使用 Sentence Transformers 和 MLflow 进行高级释义挖掘

下载此 Notebook

踏上一段使用 Sentence Transformers 进行高级释义挖掘的丰富旅程，并由 MLflow 加强。

学习目标

应用 sentence-transformers 进行高级释义挖掘。
在 MLflow 中开发为此任务量身定制的自定义 PythonModel。
在 MLflow 生态系统中有效管理和追踪模型。
使用 MLflow 的部署能力部署释义挖掘模型。

探索释义挖掘

探索识别语义相似但文本不同的句子的过程，这是文档摘要和聊天机器人开发等各种 NLP 应用中的关键方面。

Sentence Transformers 在释义挖掘中的作用

了解专门用于生成丰富句子嵌入的 Sentence Transformers 如何用于捕获深层语义含义并比较文本内容。

MLflow：简化模型管理与部署

深入了解 MLflow 如何简化 NLP 模型的管理和部署过程，重点在于高效追踪和可定制的模型实现。

加入我们，深入理解释义挖掘，并掌握使用 MLflow 管理和部署 NLP 模型的艺术。

import warnings

# Disable a few less-than-useful UserWarnings from setuptools and pydantic
warnings.filterwarnings("ignore", category=UserWarning)

释义挖掘模型介绍

启动释义挖掘模型，整合 Sentence Transformers 和 MLflow 进行高级 NLP 任务。

模型结构概述

加载模型和语料库 load_context 方法：对于加载 Sentence Transformer 模型和用于释义识别的文本语料库至关重要。
释义挖掘逻辑 predict 方法：集成了用于输入验证和释义挖掘的自定义逻辑，提供可定制的参数。
排序和过滤匹配项 _sort_and_filter_matches 辅助方法：通过基于相似度分数进行排序和过滤，确保相关且唯一的释义识别。

主要特点

高级 NLP 技术：利用 Sentence Transformers 进行语义文本理解。
自定义逻辑集成：展示了模型行为定制的灵活性。
用户定制选项：允许最终用户针对不同用例调整匹配标准。
处理效率：预先编码语料库，以实现高效的释义挖掘操作。
强大的错误处理：包含验证以确保模型性能可靠。

实际意义

该模型为各种应用中的释义检测提供了一个强大的工具，展示了在 MLflow 框架内有效使用自定义模型。

import warnings

import pandas as pd
from sentence_transformers import SentenceTransformer, util

import mlflow
from mlflow.models.signature import infer_signature
from mlflow.pyfunc import PythonModel


class ParaphraseMiningModel(PythonModel):
  def load_context(self, context):
      """Load the model context for inference, including the customer feedback corpus."""
      try:
          # Load the pre-trained sentence transformer model
          self.model = SentenceTransformer.load(context.artifacts["model_path"])

          # Load the customer feedback corpus from the specified file
          corpus_file = context.artifacts["corpus_file"]
          with open(corpus_file) as file:
              self.corpus = file.read().splitlines()

      except Exception as e:
          raise ValueError(f"Error loading model and corpus: {e}")

  def _sort_and_filter_matches(
      self, query: str, paraphrase_pairs: list[tuple], similarity_threshold: float
  ):
      """Sort and filter the matches by similarity score."""

      # Convert to list of tuples and sort by score
      sorted_matches = sorted(paraphrase_pairs, key=lambda x: x[1], reverse=True)

      # Filter and collect paraphrases for the query, avoiding duplicates
      query_paraphrases = {}
      for score, i, j in sorted_matches:
          if score < similarity_threshold:
              continue

          paraphrase = self.corpus[j] if self.corpus[i] == query else self.corpus[i]
          if paraphrase == query:
              continue

          if paraphrase not in query_paraphrases or score > query_paraphrases[paraphrase]:
              query_paraphrases[paraphrase] = score

      return sorted(query_paraphrases.items(), key=lambda x: x[1], reverse=True)

  def predict(self, context, model_input, params=None):
      """Predict method to perform paraphrase mining over the corpus."""

      # Validate and extract the query input
      if isinstance(model_input, pd.DataFrame):
          if model_input.shape[1] != 1:
              raise ValueError("DataFrame input must have exactly one column.")
          query = model_input.iloc[0, 0]
      elif isinstance(model_input, dict):
          query = model_input.get("query")
          if query is None:
              raise ValueError("The input dictionary must have a key named 'query'.")
      else:
          raise TypeError(
              f"Unexpected type for model_input: {type(model_input)}. Must be either a Dict or a DataFrame."
          )

      # Determine the minimum similarity threshold
      similarity_threshold = params.get("similarity_threshold", 0.5) if params else 0.5

      # Add the query to the corpus for paraphrase mining
      extended_corpus = self.corpus + [query]

      # Perform paraphrase mining
      paraphrase_pairs = util.paraphrase_mining(
          self.model, extended_corpus, show_progress_bar=False
      )

      # Convert to list of tuples and sort by score
      sorted_paraphrases = self._sort_and_filter_matches(
          query, paraphrase_pairs, similarity_threshold
      )

      # Warning if no paraphrases found
      if not sorted_paraphrases:
          warnings.warn("No paraphrases found above the similarity threshold.", UserWarning)

      return {sentence[0]: str(sentence[1]) for sentence in sorted_paraphrases}

准备用于释义挖掘的语料库

通过创建和准备多样化的语料库，为释义挖掘奠定基础。

语料库创建

定义一个包含来自各种主题（包括太空探索、AI、园艺等）的句子的 corpus。这种多样性使模型能够在广泛的主题范围内识别释义。

将语料库写入文件

语料库被保存到一个名为 feedback.txt 的文件中，这反映了大规模数据处理中的常见做法。
此步骤也为在释义挖掘模型中进行高效处理准备了语料库。

语料库的重要性

语料库是模型查找语义相似句子的关键数据集。其多样性确保了模型在不同用例中的适应性和有效性。

corpus = [
  "Exploring ancient cities in Europe offers a glimpse into history.",
  "Modern AI technologies are revolutionizing industries.",
  "Healthy eating contributes significantly to overall well-being.",
  "Advancements in renewable energy are combating climate change.",
  "Learning a new language opens doors to different cultures.",
  "Gardening is a relaxing hobby that connects you with nature.",
  "Blockchain technology could redefine digital transactions.",
  "Homemade Italian pasta is a delight to cook and eat.",
  "Practicing yoga daily improves both physical and mental health.",
  "The art of photography captures moments in time.",
  "Baking bread at home has become a popular quarantine activity.",
  "Virtual reality is creating new experiences in gaming.",
  "Sustainable travel is becoming a priority for eco-conscious tourists.",
  "Reading books is a great way to unwind and learn.",
  "Jazz music provides a rich tapestry of sound and rhythm.",
  "Marathon training requires discipline and perseverance.",
  "Studying the stars helps us understand our universe.",
  "The rise of electric cars is an important environmental development.",
  "Documentary films offer deep insights into real-world issues.",
  "Crafting DIY projects can be both fun and rewarding.",
  "The history of ancient civilizations is fascinating to explore.",
  "Exploring the depths of the ocean reveals a world of marine wonders.",
  "Learning to play a musical instrument can be a rewarding challenge.",
  "Artificial intelligence is shaping the future of personalized medicine.",
  "Cycling is not only a great workout but also eco-friendly transportation.",
  "Home automation with IoT devices is enhancing living experiences.",
  "Understanding quantum computing requires a grasp of complex physics.",
  "A well-brewed cup of coffee is the perfect start to the day.",
  "Urban farming is gaining popularity as a sustainable food source.",
  "Meditation and mindfulness can lead to a more balanced life.",
  "The popularity of podcasts has revolutionized audio storytelling.",
  "Space exploration continues to push the boundaries of human knowledge.",
  "Wildlife conservation is essential for maintaining biodiversity.",
  "The fusion of technology and fashion is creating new trends.",
  "E-learning platforms have transformed the educational landscape.",
  "Dark chocolate has surprising health benefits when enjoyed in moderation.",
  "Robotics in manufacturing is leading to more efficient production.",
  "Creating a personal budget is key to financial well-being.",
  "Hiking in nature is a great way to connect with the outdoors.",
  "3D printing is innovating the way we create and manufacture objects.",
  "Sommeliers can identify a wine's characteristics with just a taste.",
  "Mind-bending puzzles and riddles are great for cognitive exercise.",
  "Social media has a profound impact on communication and culture.",
  "Urban sketching captures the essence of city life on paper.",
  "The ethics of AI is a growing field in tech philosophy.",
  "Homemade skincare remedies are becoming more popular.",
  "Virtual travel experiences can provide a sense of adventure at home.",
  "Ancient mythology still influences modern storytelling and literature.",
  "Building model kits is a hobby that requires patience and precision.",
  "The study of languages opens windows into different worldviews.",
  "Professional esports has become a major global phenomenon.",
  "The mysteries of the universe are unveiled through space missions.",
  "Astronauts' experiences in space stations offer unique insights into life beyond Earth.",
  "Telescopic observations bring distant galaxies within our view.",
  "The study of celestial bodies helps us understand the cosmos.",
  "Space travel advancements could lead to interplanetary exploration.",
  "Observing celestial events provides valuable data for astronomers.",
  "The development of powerful rockets is key to deep space exploration.",
  "Mars rover missions are crucial in searching for extraterrestrial life.",
  "Satellites play a vital role in our understanding of Earth's atmosphere.",
  "Astrophysics is central to unraveling the secrets of space.",
  "Zero gravity environments in space pose unique challenges and opportunities.",
  "Space tourism might soon become a reality for many.",
  "Lunar missions have contributed significantly to our knowledge of the moon.",
  "The International Space Station is a hub for groundbreaking space research.",
  "Studying comets and asteroids reveals information about the early solar system.",
  "Advancements in space technology have implications for many scientific fields.",
  "The possibility of life on other planets continues to intrigue scientists.",
  "Black holes are among the most mysterious phenomena in space.",
  "The history of space exploration is filled with remarkable achievements.",
  "Future space missions could unlock the mysteries of dark matter.",
]

# Write out the corpus to a file
corpus_file = "/tmp/feedback.txt"
with open(corpus_file, "w") as file:
  for sentence in corpus:
      file.write(sentence + "
")

设置释义挖掘模型

准备 Sentence Transformer 模型，以便与 MLflow 集成，从而利用其释义挖掘能力。

加载 Sentence Transformer 模型

初始化 all-MiniLM-L6-v2 Sentence Transformer 模型，该模型非常适合生成适用于释义挖掘的句子嵌入。

准备输入示例

创建一个 DataFrame 作为输入示例，以说明模型将处理的查询类型，有助于定义模型的输入结构。

保存模型

将模型保存到 /tmp/paraphrase_search_model，以便在使用 MLflow 部署时具有可移植性且易于加载。

定义 Artifacts 和语料库路径

在 MLflow 中将已保存的模型和语料库的路径指定为 Artifacts，这对于模型日志记录和重现至关重要。

为签名生成测试输出

生成一个示例输出，说明模型在释义挖掘中的预期输出格式。

创建模型签名

使用 MLflow 的 infer_signature 定义模型的输入和输出 schema，并添加 similarity_threshold 参数以增加推理灵活性。

# Load a pre-trained sentence transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Create an input example DataFrame
input_example = pd.DataFrame({"query": ["This product works well. I'm satisfied."]})

# Save the model in the /tmp directory
model_directory = "/tmp/paraphrase_search_model"
model.save(model_directory)

# Define the path for the corpus file
corpus_file = "/tmp/feedback.txt"

# Define the artifacts (paths to the model and corpus file)
artifacts = {"model_path": model_directory, "corpus_file": corpus_file}

# Generate test output for signature
# Sample output for paraphrase mining could be a list of tuples (paraphrase, score)
test_output = [{"This product is satisfactory and functions as expected.": "0.8"}]

# Define the signature associated with the model
# The signature includes the structure of the input and the expected output, as well as any parameters that
# we would like to expose for overriding at inference time (including their default values if they are not overridden).
signature = infer_signature(
  model_input=input_example, model_output=test_output, params={"similarity_threshold": 0.5}
)

# Visualize the signature, showing our overridden inference parameter and its default.
signature

inputs: 
['query': string]
outputs: 
['This product is satisfactory and functions as expected.': string]
params: 
['similarity_threshold': double (default: 0.5)]

创建一个实验

我们创建一个新的 MLflow Experiment，以便我们要将模型日志记录到的 Run 不会日志记录到默认实验，而是拥有自己的上下文相关条目。

# If you are running this tutorial in local mode, leave the next line commented out.
# Otherwise, uncomment the following line and set your tracking uri to your local or remote tracking server.

# mlflow.set_tracking_uri("http://127.0.0.1:8080")

mlflow.set_experiment("Paraphrase Mining")

<Experiment: artifact_location='file:///Users/benjamin.wilson/repos/mlflow-fork/mlflow/docs/source/llms/sentence-transformers/tutorials/paraphrase-mining/mlruns/380691166097743403', creation_time=1701282619556, experiment_id='380691166097743403', last_update_time=1701282619556, lifecycle_stage='active', name='Paraphrase Mining', tags={}>

使用 MLflow 日志记录释义挖掘模型

使用 MLflow 日志记录自定义释义挖掘模型，这是模型管理和部署的关键步骤。

启动一个 MLflow Run

启动一个 MLflow Run，以在 MLflow 框架内创建模型日志记录和追踪的完整记录。

在 MLflow 中日志记录模型

使用 MLflow 的 Python 模型日志记录函数，将自定义模型集成到 MLflow 生态系统中。
为模型提供一个唯一名称，以便在 MLflow 中易于识别。
日志记录实例化的释义挖掘模型，以及输入示例、模型签名、artifacts 和 Python 依赖项。

模型日志记录的结果与益处

在 MLflow 中注册模型，以实现简化的管理和部署，增强其可访问性和可追踪性。
确保模型在不同部署环境中的可重现性和版本控制。

with mlflow.start_run() as run:
  model_info = mlflow.pyfunc.log_model(
      "paraphrase_model",
      python_model=ParaphraseMiningModel(),
      input_example=input_example,
      signature=signature,
      artifacts=artifacts,
      pip_requirements=["sentence_transformers"],
  )

Downloading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

2023/11/30 15:41:39 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

模型加载和释义挖掘预测

通过使用 MLflow 加载释义挖掘模型并执行预测，演示其在现实世界中的应用。

加载模型进行推理

利用 MLflow 的 load_model 函数检索并准备模型进行推理。
使用模型在 MLflow 注册表中的唯一 URI 定位并加载模型。

执行释义挖掘预测

使用模型的 predict 方法进行预测，应用嵌入在模型类中的释义挖掘逻辑。
传递一个具有设定 similarity_threshold 的代表性查询，以在语料库中查找匹配的释义。

解释模型输出

查看与查询语义相似的句子列表，突出显示模型的释义识别能力。
分析相似度分数，以了解查询和语料库句子之间的语义关联程度。

结论

本次演示验证了释义挖掘模型在实际场景中的有效性，突显了其在内容推荐、信息检索和对话式 AI 中的实用性。

# Load our model by supplying the uri that was used to save the model artifacts
loaded_dynamic = mlflow.pyfunc.load_model(model_info.model_uri)

# Perform a quick validation that our loaded model is performing adequately
loaded_dynamic.predict(
  {"query": "Space exploration is fascinating."}, params={"similarity_threshold": 0.65}
)

{'Studying the stars helps us understand our universe.': '0.8207424879074097',
'The history of space exploration is filled with remarkable achievements.': '0.7770636677742004',
'Exploring ancient cities in Europe offers a glimpse into history.': '0.7461957335472107',
'Space travel advancements could lead to interplanetary exploration.': '0.7090306282043457',
'Space exploration continues to push the boundaries of human knowledge.': '0.6893945932388306',
'The mysteries of the universe are unveiled through space missions.': '0.6830739974975586',
'The study of celestial bodies helps us understand the cosmos.': '0.671358048915863'}

结论：见解和潜在增强功能

在结束本教程之际，让我们回顾一下使用 Sentence Transformers 和 MLflow 实现释义挖掘模型的旅程。我们成功构建并部署了一个能够识别语义相似句子的模型，展示了 MLflow PythonModel 实现的灵活性和强大功能。

主要收获

我们学习了如何将高级 NLP 技术，特别是释义挖掘，与 MLflow 集成。这种集成不仅增强了模型管理，还简化了部署和可扩展性。
MLflow 中 PythonModel 实现的灵活性是核心主题。我们亲眼看到它如何允许将自定义逻辑集成到模型的预测函数中，从而满足像释义挖掘这样的特定 NLP 任务。
通过我们的自定义模型，我们探索了句子嵌入、语义相似度和语言理解的细微之处。这种理解在从内容推荐到对话式 AI 的广泛应用中至关重要。

增强释义挖掘模型的想法

虽然我们的模型是一个强大的起点，但在 predict 函数中还可以进行一些增强，使其更强大、功能更丰富。

上下文过滤器：引入基于上下文线索或特定关键词的过滤器，以进一步优化搜索结果。此功能将允许用户将释义范围缩小到与其特定上下文或主题最相关的结果。
情感分析集成：整合情感分析，根据情感倾向对释义进行分组。这在客户反馈分析等应用中特别有用，其中理解情感与内容本身同样重要。
多语言支持：扩展模型以支持多种语言的释义挖掘。此增强功能将显著扩大模型在全球或多语言环境中的适用性。

使用向量数据库实现可扩展性

超载将静态文本文件作为语料库，一个更具可扩展性和实际应用价值的方法是将模型连接到外部向量数据库或内存存储。
可以在此类数据库中存储和更新预先计算的嵌入，以适应实时内容生成，而无需重新部署模型。这种方法将显著提高模型在实际应用中的可扩展性和响应能力。

最终想法

构建和部署释义挖掘模型的旅程既富有启发性又具有实践意义。我们看到了 MLflow 的 PythonModel 如何为构建自定义 NLP 解决方案提供一个灵活的画布，以及如何利用 Sentence Transformers 深入研究语言的语义。

本教程只是一个开始。释义挖掘和整个 NLP 领域还有巨大的进一步探索和创新潜力。我们鼓励您在此基础上进行构建，尝试各种增强功能，并继续推动 MLflow 和高级 NLP 技术可能达到的极限。

学习目标​

探索释义挖掘​

Sentence Transformers 在释义挖掘中的作用​

MLflow：简化模型管理与部署​

释义挖掘模型介绍​

模型结构概述​

主要特点​

实际意义​

准备用于释义挖掘的语料库​

语料库创建​

将语料库写入文件​

语料库的重要性​

设置释义挖掘模型​

加载 Sentence Transformer 模型​

准备输入示例​

保存模型​

定义 Artifacts 和语料库路径​

为签名生成测试输出​

创建模型签名​

创建一个实验​

使用 MLflow 日志记录释义挖掘模型​

启动一个 MLflow Run​

在 MLflow 中日志记录模型​

模型日志记录的结果与益处​

模型加载和释义挖掘预测​

加载模型进行推理​

执行释义挖掘预测​

解释模型输出​

结论​

结论：见解和潜在增强功能​

主要收获​

增强释义挖掘模型的想法​

使用向量数据库实现可扩展性​

最终想法​