Sentence Transformers 和 MLflow 简介

欢迎阅读本教程，了解如何利用 Sentence Transformers 和 MLflow 进行高级自然语言处理和模型管理。

学习目标

使用 sentence-transformers 为句子嵌入设置管道。
使用 MLflow 记录模型和配置。
理解并在 MLflow 中将模型签名应用于 sentence-transformers。
使用 MLflow 的功能部署模型并进行推理。

什么是 Sentence Transformers？

Sentence Transformers 是 Hugging Face Transformers 库的扩展，旨在生成语义丰富的句子嵌入。它们利用 BERT 和 RoBERTa 等模型，针对语义搜索和文本聚类等任务进行了微调，从而产生高质量的句子级嵌入。

集成 MLflow 与 Sentence Transformers 的优势

将 MLflow 与 Sentence Transformers 结合使用，可通过以下方式增强 NLP 项目：

简化实验管理和日志记录。
提供对模型版本和配置的更好控制。
确保结果和模型预测的可复现性。
简化生产环境中的部署过程。

这种集成能够高效地跟踪、管理和部署 NLP 应用程序。

# Disable tokenizers warnings when constructing pipelines
%env TOKENIZERS_PARALLELISM=false

import warnings

# Disable a few less-than-useful UserWarnings from setuptools and pydantic
warnings.filterwarnings("ignore", category=UserWarning)

env: TOKENIZERS_PARALLELISM=false

设置句子嵌入的环境

通过建立核心工作环境，开始您的 Sentence Transformers 和 MLflow 之旅。

初始化的关键步骤

导入必要的库：SentenceTransformer 和 mlflow。
初始化 "all-MiniLM-L6-v2" Sentence Transformer 模型。

模型初始化

选择紧凑高效的 "all-MiniLM-L6-v2" 模型，因为它在生成有意义的句子嵌入方面非常有效。在 Hugging Face Hub 上探索更多模型。

模型的目的

该模型在将句子转换为语义丰富的嵌入方面表现出色，可应用于各种 NLP 任务，如语义搜索和聚类。

from sentence_transformers import SentenceTransformer

import mlflow

model = SentenceTransformer("all-MiniLM-L6-v2")

使用 MLflow 定义模型签名

定义模型签名是设置 Sentence Transformer 模型以在推理期间获得一致且可预测行为的关键步骤。

签名定义的步骤

准备示例句子：定义示例句子以演示模型的输入和输出格式。
生成模型签名：使用模型的输入和输出来调用 mlflow.models.infer_signature 函数，自动定义签名。

模型签名重要性

数据格式清晰：确保模型期望和生成的模型数据类型和结构得到清晰的文档记录。
模型部署和使用：对于将模型部署到生产环境至关重要，可确保模型以正确的格式接收输入并产生预期的输出。
错误预防：通过强制执行一致的数据格式，有助于在模型推理过程中防止错误。

注意：List[str] 输入类型在推理时等同于 str。MLflow flavor 在输入类型上使用 ColSpec[str] 定义。

example_sentences = ["A sentence to encode.", "Another sentence to encode."]

# Infer the signature of the custom model by providing an input example and the resultant prediction output.
# We're not including any custom inference parameters in this example, but you can include them as a third argument
# to infer_signature(), as you will see in the advanced tutorials for Sentence Transformers.
signature = mlflow.models.infer_signature(
  model_input=example_sentences,
  model_output=model.encode(example_sentences),
)

# Visualize the signature
signature

inputs: 
[string]
outputs: 
[Tensor('float32', (-1, 384))]
params: 
None

创建实验

我们创建一个新的 MLflow 实验，以便我们要将模型记录到的运行不会记录到默认实验，而是具有其自己的上下文相关条目。

# If you are running this tutorial in local mode, leave the next line commented out.
# Otherwise, uncomment the following line and set your tracking uri to your local or remote tracking server.

# mlflow.set_tracking_uri("http://127.0.0.1:8080")

mlflow.set_experiment("Introduction to Sentence Transformers")

<Experiment: artifact_location='file:///Users/benjamin.wilson/repos/mlflow-fork/mlflow/docs/source/llms/sentence-transformers/tutorials/quickstart/mlruns/469990615226680434', creation_time=1701280211449, experiment_id='469990615226680434', last_update_time=1701280211449, lifecycle_stage='active', name='Introduction to Sentence Transformers', tags={}>

使用 MLflow 记录 Sentence Transformer 模型

在 MLflow 中记录模型对于跟踪、版本控制和部署至关重要，遵循 Sentence Transformer 模型的初始化和签名定义。

记录模型的步骤

开始 MLflow 运行：使用 mlflow.start_run() 启动新运行，将所有日志记录操作分组。
记录模型：使用 mlflow.sentence_transformers.log_model 记录模型，提供模型对象、工件路径、签名和输入示例。

模型日志记录重要性

模型管理：有助于模型从训练到部署的生命周期管理。
可复现性和跟踪：能够跟踪模型版本并确保可复现性。
易于部署：通过允许模型轻松部署以进行推理来简化部署。

with mlflow.start_run():
  logged_model = mlflow.sentence_transformers.log_model(
      model=model,
      name="sbert_model",
      signature=signature,
      input_example=example_sentences,
  )

加载模型并测试推理

在 MLflow 中记录 Sentence Transformer 模型后，我们将演示如何加载并测试它以进行实时推理。

将模型加载为 PyFunc

为何选择 PyFunc：使用 mlflow.pyfunc.load_model 加载已记录的模型，以便无缝集成到基于 Python 的服务或应用程序中。
模型 URI：使用 logged_model.model_uri 从 MLflow 精确查找并加载模型。

进行推理测试

测试句子：定义用于测试模型嵌入生成能力的句子。
执行预测：使用模型的 predict 方法和测试句子来获取嵌入。
打印嵌入长度：通过检查嵌入数组的长度来验证嵌入生成，该长度对应于每个句子表示的维度。

推理测试重要性

模型验证：确认模型在加载后是否按预期运行并处理数据。
部署就绪情况：验证模型是否已准备好实时集成到应用程序服务中。

inference_test = ["I enjoy pies of both apple and cherry.", "I prefer cookies."]

# Load our custom model by providing the uri for where the model was logged.
loaded_model_pyfunc = mlflow.pyfunc.load_model(logged_model.model_uri)

# Perform a quick test to ensure that our loaded model generates the correct output
embeddings_test = loaded_model_pyfunc.predict(inference_test)

# Verify that the output is a list of lists of floats (our expected output format)
print(f"The return structure length is: {len(embeddings_test)}")

for i, embedding in enumerate(embeddings_test):
  print(f"The size of embedding {i + 1} is: {len(embeddings_test[i])}")

The return structure length is: 2
The size of embedding 1 is: 384
The size of embedding 2 is: 384

显示生成的嵌入样本

检查嵌入的内容以验证其质量并理解模型的输出。

检查嵌入样本

采样目的：检查每个嵌入条目的样本，以了解模型生成的向量表示。
打印嵌入样本：使用 embedding[:10] 打印每个嵌入向量的前 10 个条目，以一窥模型输出。

为什么采样很重要

质量检查：采样是一种快速验证嵌入质量的方法，可确保其有意义且不退化。
理解模型输出：查看嵌入向量的部分内容可以直观地理解模型的输出，这对于调试和开发很有帮助。

for i, embedding in enumerate(embeddings_test):
  print(f"The sample of the first 10 entries in embedding {i + 1} is: {embedding[:10]}")

The sample of the first 10 entries in embedding 1 is: [ 0.04866192 -0.03687946  0.02408808  0.03534171 -0.12739632  0.00999414
0.07135344 -0.01433522  0.04296691 -0.00654414]
The sample of the first 10 entries in embedding 2 is: [-0.03879027 -0.02373698  0.01314073  0.03589077 -0.01641303 -0.0857707
0.08282158 -0.03173266  0.04507608  0.02777079]

MLflow 中的原生模型加载以实现扩展功能

通过 MLflow 对原生模型加载的支持，探索 Sentence Transformer 的全部功能。

为什么支持原生加载？

访问原生功能：原生加载解锁了 Sentence Transformer 模型的所有功能，这对于高级 NLP 任务至关重要。
原生加载模型：使用 mlflow.sentence_transformers.load_model 加载模型及其全部功能，提高灵活性和效率。

使用原生模型生成嵌入

模型编码：使用模型的原生 encode 方法生成嵌入，利用其优化功能。
原生编码重要性：原生编码可确保模型发挥其全部嵌入生成能力，适用于大规模或复杂的 NLP 应用程序。

# Load the saved model as a native Sentence Transformers model (unlike above, where we loaded as a generic python function)
loaded_model_native = mlflow.sentence_transformers.load_model(logged_model.model_uri)

# Use the native model to generate embeddings by calling encode() (unlike for the generic python function which uses the single entrypoint of `predict`)
native_embeddings = loaded_model_native.encode(inference_test)

for i, embedding in enumerate(native_embeddings):
  print(
      f"The sample of the native library encoding call for embedding {i + 1} is: {embedding[:10]}"
  )

2023/11/30 15:50:24 INFO mlflow.sentence_transformers: 'runs:/eeab3c1b13594fdea13e07585b1c0596/sbert_model' resolved as 'file:///Users/benjamin.wilson/repos/mlflow-fork/mlflow/docs/source/llms/sentence-transformers/tutorials/quickstart/mlruns/469990615226680434/eeab3c1b13594fdea13e07585b1c0596/artifacts/sbert_model'

The sample of the native library encoding call for embedding 1 is: [ 0.04866192 -0.03687946  0.02408808  0.03534171 -0.12739632  0.00999414
0.07135344 -0.01433522  0.04296691 -0.00654414]
The sample of the native library encoding call for embedding 2 is: [-0.03879027 -0.02373698  0.01314073  0.03589077 -0.01641303 -0.0857707
0.08282158 -0.03173266  0.04507608  0.02777079]

结论：拥抱 Sentence Transformers 与 MLflow 的强大功能

本篇 Sentence Transformers 入门教程即将结束，我们已经成功掌握了将 Sentence Transformers 库与 MLflow 集成的基础知识。这些基础知识为自然语言处理 (NLP) 领域更高级和专业的应用奠定了基础。

关键学习点回顾

集成基础：我们涵盖了使用 MLflow 加载和记录 Sentence Transformer 模型的关键步骤。这个过程展示了在 MLflow 生态系统中集成尖端 NLP 工具的简单性和有效性。
签名和推理：通过创建模型签名和执行推理任务，我们展示了如何操作 Sentence Transformer 模型，确保其为实际应用做好准备。
模型加载和预测：我们探索了两种加载模型的方式——作为 PyFunc 模型和使用原生的 Sentence Transformers 加载机制。这种双重方法突显了 MLflow 在适应不同模型交互方式方面的多功能性。
嵌入探索：通过生成和检查句子嵌入，我们得以一窥 Transformer 模型在捕获文本语义信息方面的变革潜力。

展望未来

拓展视野：本教程侧重于 Sentence Transformers 和 MLflow 的基础知识，但还有广阔的先进应用等待探索。从语义相似度分析到释义挖掘，潜在用例非常广泛且多样。
持续学习：我们强烈鼓励您深入阅读本系列的更多教程，这些教程将更深入地探讨诸如相似度分析、语义搜索和释义挖掘等更具吸引力的用例。这些教程将为您提供更广泛的理解和 Sentence Transformers 在各种 NLP 任务中的更多实际应用。

结语

使用 Sentence Transformers 和 MLflow 进行 NLP 的旅程才刚刚开始。凭借从本教程中学到的技能和见解，您已经具备了探索更复杂和令人兴奋的应用的条件。先进 NLP 模型与 MLflow 强大的管理和部署能力的集成，为语言理解及更广泛领域的创新和探索开辟了新途径。

感谢您加入我们的入门之旅，我们期待看到您在 NLP 项目中应用这些工具和概念！

学习目标​

什么是 Sentence Transformers？​

集成 MLflow 与 Sentence Transformers 的优势​

设置句子嵌入的环境​

初始化的关键步骤​

模型初始化​

模型的目的​

使用 MLflow 定义模型签名​

签名定义的步骤​

模型签名重要性​

创建实验​

使用 MLflow 记录 Sentence Transformer 模型​

记录模型的步骤​

模型日志记录重要性​

加载模型并测试推理​

将模型加载为 PyFunc​

进行推理测试​

推理测试重要性​

显示生成的嵌入样本​

检查嵌入样本​

为什么采样很重要​

MLflow 中的原生模型加载以实现扩展功能​

为什么支持原生加载？​

使用原生模型生成嵌入​

结论：拥抱 Sentence Transformers 与 MLflow 的强大功能​

关键学习点回顾​

展望未来​

结语​