使用 Sentence Transformers 和 MLflow 进行高级语义搜索

下载此笔记本

踏上一次实践之旅，探索使用 Sentence Transformers 和 MLflow 进行高级语义搜索。

您将学到什么

使用 sentence-transformers 实现高级语义搜索。
为满足独特的项目需求定制 MLflow 的 PythonModel。
在 MLflow 的生态系统中管理和记录模型。
使用 MLflow 部署复杂模型以实现实际应用。

理解语义搜索

语义搜索超越了关键词匹配，利用语言的细微差别和上下文来查找相关结果。这种高级方法反映了人类的语言理解能力，考虑了单词在不同场景下的多种含义。

利用 Sentence Transformers 的强大功能进行搜索

Sentence Transformers 专门用于处理富含上下文的句子嵌入，它将搜索查询和文本语料库转换为语义向量。这使得能够识别语义上相似的条目，这是语义搜索的基石。

MLflow：模型管理和部署的先驱

MLflow 通过高效的实验记录和可定制的模型环境增强了 NLP 项目。它为实验跟踪带来了效率，并增加了可定制性，这对于独特的 NLP 任务至关重要。

加入我们的教程，掌握高级语义搜索技术，并了解 MLflow 如何革新您进行 NLP 模型部署和管理的方法。

python
import warnings

# Disable a few less-than-useful UserWarnings from setuptools and pydantic
warnings.filterwarnings("ignore", category=UserWarning)

使用 MLflow 和 Sentence Transformers 理解语义搜索模型

深入了解 SemanticSearchModel 的细节，这是一个使用 MLflow 和 Sentence Transformers 进行语义搜索的自定义实现。

MLflow 和自定义 PyFunc 模型

MLflow 的自定义 Python 函数 (pyfunc) 模型提供了一种灵活且可部署的解决方案，用于集成复杂逻辑，非常适合我们的 SemanticSearchModel。

模型的核心功能

上下文加载：初始化 Sentence Transformer 模型并准备语料库以进行语义比较至关重要。
预测方法：语义搜索的核心函数，包括输入验证、查询编码和相似性计算。

预测方法详解

输入验证：确保查询句子的格式正确并进行提取。
查询编码：将查询转换为嵌入以进行比较。
余弦相似度计算：确定语料库中每个条目与查询的相关性。
提取顶部结果：根据相似度分数识别最相关的条目。
相关性过滤：根据最低相关性阈值过滤结果，提高实用性。
警告机制：当所有顶部结果低于相关性阈值时发出警告，确保始终提供结果。

结论

这个语义搜索模型 exemplifies 了 NLP 与 MLflow 的集成，展示了现代机器学习工作流程中的灵活性、用户友好性和实际应用。

python
import warnings

import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer, util

import mlflow
from mlflow.models.signature import infer_signature
from mlflow.pyfunc import PythonModel


class SemanticSearchModel(PythonModel):
  def load_context(self, context):
      """Load the model context for inference, including the corpus from a file."""
      try:
          # Load the pre-trained sentence transformer model
          self.model = SentenceTransformer.load(context.artifacts["model_path"])

          # Load the corpus from the specified file
          corpus_file = context.artifacts["corpus_file"]
          with open(corpus_file) as file:
              self.corpus = file.read().splitlines()

          # Encode the corpus and convert it to a tensor
          self.corpus_embeddings = self.model.encode(self.corpus, convert_to_tensor=True)

      except Exception as e:
          raise ValueError(f"Error loading model and corpus: {e}")

  def predict(self, context, model_input, params=None):
      """Predict method to perform semantic search over the corpus."""

      if isinstance(model_input, pd.DataFrame):
          if model_input.shape[1] != 1:
              raise ValueError("DataFrame input must have exactly one column.")
          model_input = model_input.iloc[0, 0]
      elif isinstance(model_input, dict):
          model_input = model_input.get("sentence")
          if model_input is None:
              raise ValueError("The input dictionary must have a key named 'sentence'.")
      else:
          raise TypeError(
              f"Unexpected type for model_input: {type(model_input)}. Must be either a Dict or a DataFrame."
          )

      # Encode the query
      query_embedding = self.model.encode(model_input, convert_to_tensor=True)

      # Compute cosine similarity scores
      cos_scores = util.cos_sim(query_embedding, self.corpus_embeddings)[0]

      # Determine the number of top results to return
      top_k = params.get("top_k", 3) if params else 3  # Default to 3 if not specified

      minimum_relevancy = (
          params.get("minimum_relevancy", 0.2) if params else 0.2
      )  # Default to 0.2 if not specified

      # Get the top_k most similar sentences from the corpus
      top_results = np.argsort(cos_scores, axis=0)[-top_k:]

      # Prepare the initial results list
      initial_results = [
          (self.corpus[idx], cos_scores[idx].item()) for idx in reversed(top_results)
      ]

      # Filter the results based on the minimum relevancy threshold
      filtered_results = [result for result in initial_results if result[1] >= minimum_relevancy]

      # If all results are below the threshold, issue a warning and return the top result
      if not filtered_results:
          warnings.warn(
              "All top results are below the minimum relevancy threshold. "
              "Returning the highest match instead.",
              RuntimeWarning,
          )
          return [initial_results[0]]
      else:
          return filtered_results

构建和准备语义搜索语料库

探索为语义搜索模型构建和准备语料库，这是搜索功能的一个关键组成部分。

模拟真实用例

我们创建了一个简化的合成博客文章语料库来演示模型的核心功能，复制了一个典型真实场景的缩小版。

语料库准备的关键步骤

语料库创建：形成一个列表，代表单个博客文章条目。
写入文件：将语料库保存到文本文件，模仿真实应用程序中的数据提取和预处理过程。

高效的数据处理以实现可扩展性

我们的模型将语料库编码为嵌入以进行快速比较，展示了一种适合扩展到更大数据集的高效方法。

生产考量

存储嵌入：讨论高效存储和检索嵌入的选项，这对于大规模应用程序至关重要。
可扩展性：强调可扩展存储系统对于处理大量数据集和复杂查询的重要性。
更新语料库：概述了在动态、不断变化的用例中管理和更新语料库的策略。

实现语义搜索概念

此设置虽然简化，但反映了开发健壮且可扩展的语义搜索系统的基本步骤，它将 NLP 技术与高效的数据管理相结合。在真实的生产用例中，语料库的处理（创建嵌入）将是运行语义搜索的外部过程。下面的语料库示例旨在仅为演示目的展示功能。

python
corpus = [
  "Perfecting a Sourdough Bread Recipe: The Joy of Baking. Baking sourdough bread "
  "requires patience, skill, and a good understanding of yeast fermentation. Each "
  "loaf is unique, telling its own story of the baker's journey.",
  "The Mars Rover's Discoveries: Unveiling the Red Planet. NASA's Mars rover has "
  "sent back stunning images and data, revealing the planet's secrets. These "
  "discoveries may hold the key to understanding Mars' history.",
  "The Art of Growing Herbs: Enhancing Your Culinary Skills. Growing your own "
  "herbs can transform your cooking, adding fresh and vibrant flavors. Whether it's "
  "basil, thyme, or rosemary, each herb has its own unique characteristics.",
  "AI in Software Development: Transforming the Tech Landscape. The rapid "
  "advancements in artificial intelligence are reshaping how we approach software "
  "development. From automation to machine learning, the possibilities are endless.",
  "Backpacking Through Europe: A Journey of Discovery. Traveling across Europe by "
  "backpack allows one to immerse in diverse cultures and landscapes. It's an "
  "adventure that combines the thrill of exploration with personal growth.",
  "Shakespeare's Timeless Influence: Reshaping Modern Storytelling. The works of "
  "William Shakespeare continue to inspire and influence contemporary literature. "
  "His mastery of language and deep understanding of human nature are unparalleled.",
  "The Rise of Renewable Energy: A Sustainable Future. Embracing renewable energy "
  "is crucial for achieving a sustainable and environmentally friendly lifestyle. "
  "Solar, wind, and hydro power are leading the way in this green revolution.",
  "The Magic of Jazz: An Exploration of Sound and Harmony. Jazz music, known for "
  "its improvisation and complex harmonies, has a rich and diverse history. It "
  "evokes a range of emotions, often reflecting the soul of the musician.",
  "Yoga for Mind and Body: The Benefits of Regular Practice. Engaging in regular "
  "yoga practice can significantly improve flexibility, strength, and mental "
  "well-being. It's a holistic approach to health, combining physical and spiritual "
  "aspects.",
  "The Egyptian Pyramids: Monuments of Ancient Majesty. The ancient Egyptian "
  "pyramids, monumental tombs for pharaohs, are marvels of architectural "
  "ingenuity. They stand as a testament to the advanced skills of ancient builders.",
  "Vegan Cuisine: A World of Flavor. Exploring vegan cuisine reveals a world of "
  "nutritious and delicious possibilities. From hearty soups to delectable desserts, "
  "plant-based dishes are diverse and satisfying.",
  "Extraterrestrial Life: The Endless Search. The quest to find life beyond Earth "
  "continues to captivate scientists and the public alike. Advances in space "
  "technology are bringing us closer to answering this age-old question.",
  "The Art of Plant Pruning: Promoting Healthy Growth. Regular pruning is essential "
  "for maintaining healthy and vibrant plants. It's not just about cutting back, but "
  "understanding each plant's growth patterns and needs.",
  "Cybersecurity in the Digital Age: Protecting Our Data. With the rise of digital "
  "technology, cybersecurity has become a critical concern. Protecting sensitive "
  "information from cyber threats is an ongoing challenge for individuals and "
  "businesses alike.",
  "The Great Wall of China: A Historical Journey. Visiting the Great Wall offers "
  "more than just breathtaking views; it's a journey through history. This ancient "
  "structure tells stories of empires, invasions, and human resilience.",
  "Mystery Novels: Crafting Suspense and Intrigue. A great mystery novel captivates "
  "the reader with intricate plots and unexpected twists. It's a genre that combines "
  "intellectual challenge with entertainment.",
  "Conserving Endangered Species: A Global Effort. Protecting endangered species "
  "is a critical task that requires international collaboration. From rainforests to "
  "oceans, every effort counts in preserving our planet's biodiversity.",
  "Emotions in Classical Music: A Symphony of Feelings. Classical music is not just "
  "an auditory experience; it's an emotional journey. Each composition tells a story, "
  "conveying feelings from joy to sorrow, tranquility to excitement.",
  "CrossFit: A Test of Strength and Endurance. CrossFit is more than just a fitness "
  "regimen; it's a lifestyle that challenges your physical and mental limits. It "
  "combines various disciplines to create a comprehensive workout.",
  "The Renaissance: An Era of Artistic Genius. The Renaissance marked a period of "
  "extraordinary artistic and scientific achievements. It was a time when creativity "
  "and innovation flourished, reshaping the course of history.",
  "Exploring International Cuisines: A Culinary Adventure. Discovering international "
  "cuisines is an adventure for the palate. Each dish offers a glimpse into the "
  "culture and traditions of its origin.",
  "Astronaut Training: Preparing for the Unknown. Becoming an astronaut involves "
  "rigorous training to prepare for the extreme conditions of space. It's a journey "
  "that tests both physical endurance and mental resilience.",
  "Sustainable Gardening: Nurturing the Environment. Sustainable gardening is not "
  "just about growing plants; it's about cultivating an ecosystem. By embracing "
  "environmentally friendly practices, gardeners can have a positive impact on the "
  "planet.",
  "The Smartphone Revolution: Changing Communication. Smartphones have transformed "
  "how we communicate, offering unprecedented connectivity and convenience. This "
  "technology continues to evolve, shaping our daily interactions.",
  "Experiencing African Safaris: Wildlife and Wilderness. An African safari is an "
  "unforgettable experience that brings you face-to-face with the wonders of "
  "wildlife. It's a journey that connects you with the raw beauty of nature.",
  "Graphic Novels: A Blend of Art and Story. Graphic novels offer a unique medium "
  "where art and narrative intertwine to tell compelling stories. They challenge "
  "traditional forms of storytelling, offering visual and textual richness.",
  "Addressing Ocean Pollution: A Call to Action. The increasing levels of pollution "
  "in our oceans are a pressing environmental concern. Protecting marine life and "
  "ecosystems requires concerted global efforts.",
  "The Origins of Hip Hop: A Cultural Movement. Hip hop music, originating from the "
  "streets of New York, has grown into a powerful cultural movement. Its beats and "
  "lyrics reflect the experiences and voices of a community.",
  "Swimming: A Comprehensive Workout. Swimming offers a full-body workout that is "
  "both challenging and refreshing. It's an exercise that enhances cardiovascular "
  "health, builds muscle, and improves endurance.",
  "The Fall of the Berlin Wall: A Historical Turning Point. The fall of the Berlin "
  "Wall was not just a physical demolition; it was a symbol of political and social "
  "change. This historic event marked the end of an era and the beginning of a new "
  "chapter in world history.",
]

# Write the corpus to a file
corpus_file = "/tmp/search_corpus.txt"
with open(corpus_file, "w") as file:
  for sentence in corpus:
      file.write(sentence + "
")

MLflow 中的模型准备和配置

探索准备和配置 Sentence Transformer 模型以与 MLflow 集成的步骤，这对于部署准备至关重要。

加载和保存 Sentence Transformer 模型

模型初始化：加载 "all-MiniLM-L6-v2" 模型，该模型以其性能和速度的平衡而闻名，适用于语义搜索任务。
模型存储：将模型保存到目录，这对于稍后通过 MLflow 进行部署至关重要。选择 /tmp/search_model 是为了方便教程，以免您的当前工作目录被模型文件填满。您可以将其更改为您选择的任何位置。

准备模型构件和签名

构件字典：创建一个包含模型和语料库文件路径的字典，指导 MLflow 到初始化自定义模型对象所需的组件。
输入示例和测试输出：定义示例输入和输出，以说明模型预期的数据格式。
模型签名：使用 infer_signature 进行自动签名生成，涵盖输入、输出和操作参数。

模型签名重要性

签名可确保训练和部署之间的数据一致性，从而提高模型可用性并减少出错的可能性。指定签名可确保在推理时进行类型验证，从而防止由于类型转换不当而导致的意外行为，这些行为可能导致不正确或令人困惑的推理结果。

结论

此全面的准备过程可确保模型已准备好进行部署，并明确定义了所有依赖项和操作要求。

python
# Load a pre-trained sentence transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Create an input example DataFrame
input_example = ["Something I want to find matches for."]

# Save the model in the /tmp directory
model_directory = "/tmp/search_model"
model.save(model_directory)

artifacts = {"model_path": model_directory, "corpus_file": corpus_file}

# Generate test output for signature
test_output = ["match 1", "match 2", "match 3"]

# Define the signature associated with the model
signature = infer_signature(
  input_example, test_output, params={"top_k": 3, "minimum_relevancy": 0.2}
)

# Visualize the signature
signature

inputs: 
[string]
outputs: 
[string]
params: 
['top_k': long (default: 3), 'minimum_relevancy': double (default: 0.2)]

创建实验

我们创建一个新的 MLflow 实验，以便我们要将模型记录到的运行不会记录到默认实验，而是具有其自己的上下文相关条目。

python
# If you are running this tutorial in local mode, leave the next line commented out.
# Otherwise, uncomment the following line and set your tracking uri to your local or remote tracking server.

# mlflow.set_tracking_uri("http://127.0.0.1:8080")

mlflow.set_experiment("Semantic Similarity")

<Experiment: artifact_location='file:///Users/benjamin.wilson/repos/mlflow-fork/mlflow/docs/source/llms/sentence-transformers/tutorials/semantic-search/mlruns/405641275158666585', creation_time=1701278766302, experiment_id='405641275158666585', last_update_time=1701278766302, lifecycle_stage='active', name='Semantic Similarity', tags={}>

使用 MLflow 记录模型

探索在 MLflow 中记录模型的步骤，这是在 MLflow 框架内管理和部署模型的关键一步。

开始 MLflow 运行

上下文管理：使用 with mlflow.start_run() 启动 MLflow 运行，这对于跟踪和管理与模型相关的操作至关重要。

记录模型

模型记录：利用 mlflow.pyfunc.log_model 记录自定义 SemanticSearchModel，包括模型名称、实例、输入示例、签名、构件和要求等关键参数。

模型记录的结果

模型注册：确保模型在 MLflow 中与所有必需的组件一起注册，为部署做好准备。
可复现性和可追溯性：促进一致的模型部署并跟踪版本和相关数据。

结论

完成这一关键步骤将模型从开发转移到部署就绪状态，封装在 MLflow 生态系统中。

python
with mlflow.start_run() as run:
  model_info = mlflow.pyfunc.log_model(
      name="semantic_search",
      python_model=SemanticSearchModel(),
      input_example=input_example,
      signature=signature,
      artifacts=artifacts,
      pip_requirements=["sentence_transformers", "numpy"],
  )

Downloading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

2023/11/30 15:57:53 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

模型推理和预测演示

观察我们语义搜索模型的实际应用，展示其通过相关预测响应用户查询的能力。

加载模型进行推理

模型加载：利用 mlflow.pyfunc.load_model 加载模型，准备处理语义搜索查询。

进行预测

运行查询：将示例查询传递给加载的模型，展示其语义搜索能力。

理解预测输出

输出格式：分析预测输出，通过相关性分数展示模型的语义理解能力。
示例结果：说明模型的搜索结果，包括各种查询相关条目的相关性分数。

结论

此次演示强调了模型在语义搜索中的功效，突显了其在推荐和知识检索应用中的潜力。

python
# Load our model as a PyFuncModel.
# Note that unlike the example shown in the Introductory Tutorial, there is no 'native' flavor for PyFunc models.
# This model cannot be loaded with `mlflow.sentence_transformers.load_model()` because it is not in the native model format.
loaded_dynamic = mlflow.pyfunc.load_model(model_info.model_uri)

# Make sure that it generates a reasonable output
loaded_dynamic.predict(["I'd like some ideas for a meal to cook."])

[('Exploring International Cuisines: A Culinary Adventure. Discovering international cuisines is an adventure for the palate. Each dish offers a glimpse into the culture and traditions of its origin.',
0.43857115507125854),
('Vegan Cuisine: A World of Flavor. Exploring vegan cuisine reveals a world of nutritious and delicious possibilities. From hearty soups to delectable desserts, plant-based dishes are diverse and satisfying.',
0.34688490629196167),
("The Art of Growing Herbs: Enhancing Your Culinary Skills. Growing your own herbs can transform your cooking, adding fresh and vibrant flavors. Whether it's basil, thyme, or rosemary, each herb has its own unique characteristics.",
0.22686949372291565)]

高级查询处理，支持可定制参数和警告机制

探索模型的高级功能，包括可定制的搜索参数和独特的警告机制，以获得最佳用户体验。

执行带警告的自定义预测

带挑战性参数的自定义查询：测试模型在具有高相关性阈值的查询下辨别高度相关内容的能力。
触发警告：一种提醒用户搜索条件过于严格的机制，增强用户反馈。

理解模型的响应

挑战场景下的结果：分析模型对严格搜索条件的响应，包括不满足相关性阈值的情况。

意义和最佳实践

平衡相关性和覆盖范围：讨论设置适当的相关性阈值的重要性，以确保精确性和结果覆盖范围之间的平衡。
用于语料库改进的用户反馈：利用警告作为改进语料库和增强搜索系统的反馈。

结论

这些高级功能集展示了模型的适应性以及为动态且响应迅速的搜索体验调整搜索参数的重要性。

python
# Verify that the fallback logic works correctly by returning the 'best, closest' result, even though the parameters submitted should return no results.
# We are also validating that the warning is issued, alerting us to the fact that this behavior is occurring.
loaded_dynamic.predict(
  ["Latest stories on computing"], params={"top_k": 10, "minimum_relevancy": 0.4}
)

/var/folders/cd/n8n0rm2x53l_s0xv_j_xklb00000gp/T/ipykernel_55915/1325605132.py:71: RuntimeWarning: All top results are below the minimum relevancy threshold. Returning the highest match instead.
warnings.warn(

[('AI in Software Development: Transforming the Tech Landscape. The rapid advancements in artificial intelligence are reshaping how we approach software development. From automation to machine learning, the possibilities are endless.',
0.2533860206604004)]

结论：使用 MLflow 的 PythonModel 构建自定义逻辑

在本教程结束时，让我们回顾一下关键的学习内容以及 MLflow 的 PythonModel 在为实际应用构建自定义逻辑方面的强大功能，特别是在集成 sentence-transformers 等高级库时。

主要收获

PythonModel 的灵活性:
- MLflow 中的 PythonModel 在定义自定义逻辑方面提供了无与伦比的灵活性。在本教程中，我们利用了这一点来构建一个满足我们特定需求的语义搜索模型。
- 在处理超出标准模型实现的复杂用例时，这种灵活性非常宝贵。
与 Sentence Transformers 的集成:
- 我们将 sentence-transformers 库无缝集成到我们的 MLflow 模型中。这表明如何将高级 NLP 功能嵌入自定义模型中，以处理语义搜索等复杂任务。
- 使用 Transformer 模型生成嵌入展示了如何将最先进的 NLP 技术应用于实际场景。
定制和用户体验:
- 我们的模型不仅执行了语义搜索的核心任务，还允许自定义搜索参数（top_k 和 minimum_relevancy）。这种程度的定制对于使模型的输出与不同的用户需求保持一致至关重要。
- 包含警告机制通过提供有价值的反馈进一步丰富了模型，从而改善了用户体验。
实际应用和可扩展性:
- 虽然我们的教程侧重于受控数据集，但这些原则和方法适用于更大、更真实的 datasets。关于使用向量数据库和 Redis 或 Elasticsearch 等内存数据库实现可扩展性的讨论，突显了如何将模型改编为大规模应用。

赋能实际应用

MLflow 的 PythonModel 与 sentence-transformers 等高级库的结合，简化了复杂、实际应用的创建。
封装复杂逻辑、管理依赖项和确保模型可移植性的能力，使 MLflow 成为现代数据科学家工具箱中的宝贵工具。

前进方向

随着我们结束，请记住旅程不会就此结束。本教程中探讨的概念和技术为 NLP 及更广泛领域中的进一步探索和创新奠定了基础。
我们鼓励您将这些学习成果转化为实践，用您自己的数据集进行实验，并继续突破 MLflow 和高级 NLP 技术所能实现的界限。

感谢您与我们一起踏上这段通过 Sentence Transformers 和 MLflow 进行语义搜索的启发之旅！

您将学到什么​

理解语义搜索​

利用 Sentence Transformers 的强大功能进行搜索​

MLflow：模型管理和部署的先驱​

使用 MLflow 和 Sentence Transformers 理解语义搜索模型​

MLflow 和自定义 PyFunc 模型​

模型的核心功能​

预测方法详解​

结论​

构建和准备语义搜索语料库​

模拟真实用例​

语料库准备的关键步骤​

高效的数据处理以实现可扩展性​

生产考量​

实现语义搜索概念​

MLflow 中的模型准备和配置​

加载和保存 Sentence Transformer 模型​

准备模型构件和签名​

模型签名重要性​

结论​

创建实验​

使用 MLflow 记录模型​

开始 MLflow 运行​

记录模型​

模型记录的结果​

结论​

模型推理和预测演示​

加载模型进行推理​

进行预测​

理解预测输出​

结论​

高级查询处理，支持可定制参数和警告机制​

执行带警告的自定义预测​

理解模型的响应​

意义和最佳实践​

结论​

结论：使用 MLflow 的 PythonModel 构建自定义逻辑​

主要收获​

赋能实际应用​

前进方向​