Sentence Transformers 和 MLflow 简介

欢迎来到我们的教程，学习如何将 Sentence Transformers 与 MLflow 结合使用，进行高级自然语言处理和模型管理。

学习目标

使用 sentence-transformers 建立句子嵌入的 pipeline。
使用 MLflow 记录模型和配置。
理解并将 MLflow 中的模型签名应用于 sentence-transformers。
利用 MLflow 的功能部署模型并用于推理。

什么是 Sentence Transformers？

Sentence Transformers 是 Hugging Face Transformers 库的一个扩展，旨在生成具有丰富语义的句子嵌入。它们利用 BERT 和 RoBERTa 等模型，针对语义搜索和文本聚类等任务进行微调，从而产生高质量的句子级嵌入。

集成 MLflow 与 Sentence Transformers 的优势

将 MLflow 与 Sentence Transformers 结合使用可通过以下方式增强 NLP 项目：

简化实验管理和日志记录。
更好地控制模型版本和配置。
确保结果和模型预测的可重现性。
简化生产环境中的部署过程。

这种集成有助于对 NLP 应用进行高效的追踪、管理和部署。

# Disable tokenizers warnings when constructing pipelines
%env TOKENIZERS_PARALLELISM=false

import warnings

# Disable a few less-than-useful UserWarnings from setuptools and pydantic
warnings.filterwarnings("ignore", category=UserWarning)

env: TOKENIZERS_PARALLELISM=false

设置句子嵌入环境

通过建立核心工作环境，开始您使用 Sentence Transformers 和 MLflow 的旅程。

初始化关键步骤

导入必要的库：SentenceTransformer 和 mlflow。
初始化 "all-MiniLM-L6-v2" Sentence Transformer 模型。

模型初始化

选择紧凑高效的 "all-MiniLM-L6-v2" 模型，因为它在生成有意义的句子嵌入方面非常有效。在 Hugging Face Hub 探索更多模型。

模型用途

该模型擅长将句子转换为具有丰富语义的嵌入，可应用于语义搜索和聚类等各种 NLP 任务。

from sentence_transformers import SentenceTransformer

import mlflow

model = SentenceTransformer("all-MiniLM-L6-v2")

使用 MLflow 定义模型签名

定义模型签名是设置 Sentence Transformer 模型以确保推理期间行为一致且符合预期的关键步骤。

签名定义步骤

准备示例句子：定义示例句子，以演示模型的输入和输出格式。
生成模型签名：使用 mlflow.models.infer_signature 函数以及模型的输入和输出来自动定义签名。

模型签名的重要性

清晰的数据格式：确保清晰记录模型期望和产生的数据类型和结构。
模型部署和使用：对于将模型部署到生产环境至关重要，确保模型接收正确格式的输入并产生预期输出。
防止错误：通过强制执行一致的数据格式，有助于防止模型推理期间的错误。

注意：在推理时，List[str] 输入类型等同于 str。MLflow flavor 使用 ColSpec[str] 定义输入类型。

example_sentences = ["A sentence to encode.", "Another sentence to encode."]

# Infer the signature of the custom model by providing an input example and the resultant prediction output.
# We're not including any custom inference parameters in this example, but you can include them as a third argument
# to infer_signature(), as you will see in the advanced tutorials for Sentence Transformers.
signature = mlflow.models.infer_signature(
  model_input=example_sentences,
  model_output=model.encode(example_sentences),
)

# Visualize the signature
signature

inputs: 
[string]
outputs: 
[Tensor('float32', (-1, 384))]
params: 
None

创建实验

我们创建一个新的 MLflow 实验，以便我们要记录模型的运行不会记录到默认实验中，而是有其自己的上下文相关条目。

# If you are running this tutorial in local mode, leave the next line commented out.
# Otherwise, uncomment the following line and set your tracking uri to your local or remote tracking server.

# mlflow.set_tracking_uri("http://127.0.0.1:8080")

mlflow.set_experiment("Introduction to Sentence Transformers")

<Experiment: artifact_location='file:///Users/benjamin.wilson/repos/mlflow-fork/mlflow/docs/source/llms/sentence-transformers/tutorials/quickstart/mlruns/469990615226680434', creation_time=1701280211449, experiment_id='469990615226680434', last_update_time=1701280211449, lifecycle_stage='active', name='Introduction to Sentence Transformers', tags={}>

使用 MLflow 记录 Sentence Transformer 模型

在 MLflow 中记录模型对于追踪、版本控制和部署至关重要，这是继我们对 Sentence Transformer 模型进行初始化和签名定义之后进行的步骤。

记录模型的步骤

启动 MLflow Run：使用 mlflow.start_run() 初始化新的运行，将所有日志记录操作分组。
记录模型：使用 mlflow.sentence_transformers.log_model 记录模型，提供模型对象、 artifact path、签名和输入示例。

模型日志记录的重要性

模型管理：促进模型的生命周期管理，从训练到部署。
可重现性和追踪：支持模型版本追踪并确保可重现性。
易于部署：通过使模型易于部署进行推理来简化部署过程。

with mlflow.start_run():
  logged_model = mlflow.sentence_transformers.log_model(
      model=model,
      artifact_path="sbert_model",
      signature=signature,
      input_example=example_sentences,
  )

加载模型和测试推理

在 MLflow 中记录 Sentence Transformer 模型后，我们将演示如何加载模型并对其进行实时推理测试。

将模型加载为 PyFunc

为何选择 PyFunc：使用 mlflow.pyfunc.load_model 加载已记录的模型，以便无缝集成到基于 Python 的服务或应用中。
模型 URI：使用 logged_model.model_uri 精确查找并从 MLflow 加载模型。

进行推理测试

测试句子：定义句子来测试模型的嵌入生成能力。
执行预测：对测试句子使用模型的 predict 方法来获取嵌入。
打印嵌入长度：通过检查嵌入数组的长度来验证嵌入生成，该长度对应于每个句子表示的维度。

推理测试的重要性

模型验证：加载后确认模型的预期行为和数据处理能力。
部署就绪性：验证模型是否已准备好集成到应用服务中进行实时处理。

inference_test = ["I enjoy pies of both apple and cherry.", "I prefer cookies."]

# Load our custom model by providing the uri for where the model was logged.
loaded_model_pyfunc = mlflow.pyfunc.load_model(logged_model.model_uri)

# Perform a quick test to ensure that our loaded model generates the correct output
embeddings_test = loaded_model_pyfunc.predict(inference_test)

# Verify that the output is a list of lists of floats (our expected output format)
print(f"The return structure length is: {len(embeddings_test)}")

for i, embedding in enumerate(embeddings_test):
  print(f"The size of embedding {i + 1} is: {len(embeddings_test[i])}")

The return structure length is: 2
The size of embedding 1 is: 384
The size of embedding 2 is: 384

显示生成的嵌入示例

检查嵌入的内容，以验证其质量并理解模型的输出。

检查嵌入示例

采样的目的：检查每个嵌入中的条目示例，以理解模型生成的向量表示。
打印嵌入示例：使用 embedding[:10] 打印每个嵌入向量的前 10 个条目，以初步了解模型的输出。

为何采样重要

质量检查：采样提供了一种快速验证嵌入质量的方式，并确保它们有意义且非退化。
理解模型输出：查看嵌入向量的部分内容有助于直观理解模型的输出，这对于调试和开发很有益。

for i, embedding in enumerate(embeddings_test):
  print(f"The sample of the first 10 entries in embedding {i + 1} is: {embedding[:10]}")

The sample of the first 10 entries in embedding 1 is: [ 0.04866192 -0.03687946  0.02408808  0.03534171 -0.12739632  0.00999414
0.07135344 -0.01433522  0.04296691 -0.00654414]
The sample of the first 10 entries in embedding 2 is: [-0.03879027 -0.02373698  0.01314073  0.03589077 -0.01641303 -0.0857707
0.08282158 -0.03173266  0.04507608  0.02777079]

MLflow 中的原生模型加载以实现扩展功能

借助 MLflow 对原生模型加载的支持，探索 Sentence Transformer 的全部功能。

为何支持原生加载？

访问原生功能：原生加载解锁了 Sentence Transformer 模型的所有功能，这对于高级 NLP 任务至关重要。
原生加载模型：使用 mlflow.sentence_transformers.load_model 以其全部功能加载模型，从而增强灵活性和效率。

使用原生模型生成嵌入

模型编码：使用模型的原生 encode 方法生成嵌入，充分利用优化功能。
原生编码的重要性：原生编码确保充分利用模型的全部嵌入生成能力，适用于大规模或复杂的 NLP 应用。

# Load the saved model as a native Sentence Transformers model (unlike above, where we loaded as a generic python function)
loaded_model_native = mlflow.sentence_transformers.load_model(logged_model.model_uri)

# Use the native model to generate embeddings by calling encode() (unlike for the generic python function which uses the single entrypoint of `predict`)
native_embeddings = loaded_model_native.encode(inference_test)

for i, embedding in enumerate(native_embeddings):
  print(
      f"The sample of the native library encoding call for embedding {i + 1} is: {embedding[:10]}"
  )

2023/11/30 15:50:24 INFO mlflow.sentence_transformers: 'runs:/eeab3c1b13594fdea13e07585b1c0596/sbert_model' resolved as 'file:///Users/benjamin.wilson/repos/mlflow-fork/mlflow/docs/source/llms/sentence-transformers/tutorials/quickstart/mlruns/469990615226680434/eeab3c1b13594fdea13e07585b1c0596/artifacts/sbert_model'

The sample of the native library encoding call for embedding 1 is: [ 0.04866192 -0.03687946  0.02408808  0.03534171 -0.12739632  0.00999414
0.07135344 -0.01433522  0.04296691 -0.00654414]
The sample of the native library encoding call for embedding 2 is: [-0.03879027 -0.02373698  0.01314073  0.03589077 -0.01641303 -0.0857707
0.08282158 -0.03173266  0.04507608  0.02777079]

结论：拥抱 Sentence Transformers 与 MLflow 结合的力量

在本 Sentence Transformers 教程的结尾，我们已成功掌握了将 Sentence Transformers 库与 MLflow 集成的基础知识。这些基础知识为在自然语言处理 (NLP) 领域中开展更高级、更专业的应用奠定了基础。

主要学习内容回顾

集成基础知识：我们学习了使用 MLflow 加载和记录 Sentence Transformer 模型的基本步骤。此过程展示了在 MLflow 生态系统中集成尖端 NLP 工具的简单性和有效性。
签名和推理：通过创建模型签名和执行推理任务，我们展示了如何操作化 Sentence Transformer 模型，确保其为实际应用做好准备。
模型加载和预测：我们探索了两种加载模型的方式——作为 PyFunc 模型加载和使用原生的 Sentence Transformers 加载机制。这种双重方法突出了 MLflow 在适应不同模型交互方式方面的多功能性。
嵌入探索：通过生成和检查句子嵌入，我们领略了 transformer 模型在从文本中捕获语义信息方面的变革潜力。

展望未来

拓展视野：虽然本教程侧重于 Sentence Transformers 和 MLflow 的基础方面，但仍有广阔的高级应用领域等待探索。从语义相似性分析到复述挖掘，潜在用例广泛多样。
继续学习：我们强烈建议您深入研究本系列中的其他教程，这些教程将更深入地探讨语义相似性分析、语义搜索和复述挖掘等更具吸引力的用例。这些教程将使您更广泛地理解 Sentence Transformers 在各种 NLP 任务中的实际应用。

结语

使用 Sentence Transformers 和 MLflow 探索 NLP 的旅程才刚刚开始。通过本教程获得的技能和见解，您已做好充分准备去探索更复杂、更令人兴奋的应用。将先进的 NLP 模型与 MLflow 强大的管理和部署功能相结合，为语言理解及其他领域的创新和探索开辟了新的途径。

感谢您加入我们的入门之旅，期待看到您如何在自己的 NLP 工作中应用这些工具和概念！

学习目标​

什么是 Sentence Transformers？​

集成 MLflow 与 Sentence Transformers 的优势​

设置句子嵌入环境​

初始化关键步骤​

模型初始化​

模型用途​

使用 MLflow 定义模型签名​

签名定义步骤​

模型签名的重要性​

创建实验​

使用 MLflow 记录 Sentence Transformer 模型​

记录模型的步骤​

模型日志记录的重要性​

加载模型和测试推理​

将模型加载为 PyFunc​

进行推理测试​

推理测试的重要性​

显示生成的嵌入示例​

检查嵌入示例​

为何采样重要​

MLflow 中的原生模型加载以实现扩展功能​

为何支持原生加载？​

使用原生模型生成嵌入​

结论：拥抱 Sentence Transformers 与 MLflow 结合的力量​

主要学习内容回顾​

展望未来​

结语​