MLflow LlamaIndex 风味

简介

LlamaIndex 🦙 是一个强大的以数据为中心的框架，旨在将自定义数据源与大型语言模型 (LLM) 无缝连接。它提供了一套全面的数据结构和工具，简化了为 LLM 摄取、构建和访问私有或领域特定数据的过程。LlamaIndex 通过提供高效的索引和检索机制，在实现上下文感知型 AI 应用程序方面表现出色，使得构建需要集成外部知识的高级问答系统、聊天机器人及其他 AI 驱动的应用程序变得更加容易。

Overview of LlamaIndex and MLflow integration

为何在 MLflow 中使用 LlamaIndex？

LlamaIndex 库与 MLflow 的集成提供了管理和部署 LlamaIndex 引擎的无缝体验。以下是使用 MLflow 结合 LlamaIndex 的一些主要优势：

MLflow Tracking 允许你在 MLflow 中追踪索引，并管理构成 LlamaIndex 项目的诸多动态部分，例如提示词、LLM、工作流、工具、全局配置等。
MLflow Model 将你的 LlamaIndex 索引/引擎/工作流与其所有依赖版本、输入和输出接口以及其他重要元数据打包在一起。这使你能够轻松部署 LlamaIndex 模型进行推理，确保机器学习生命周期不同阶段的环境一致性。
MLflow Evaluate 在 MLflow 中提供了原生功能，用于评估生成式 AI 应用程序。此功能有助于高效评估 LlamaIndex 模型的推理结果，确保稳健的性能分析并促进快速迭代。
MLflow Tracing 是一种强大的可观测性工具，用于监控和调试 LlamaIndex 模型内部发生的情况，帮助你快速识别潜在瓶颈或问题。凭借其强大的自动日志记录功能，你无需添加任何代码，只需运行一个命令即可对 LlamaIndex 应用程序进行检测。

开始使用

在这些入门教程中，你将学习 LlamaIndex 最基本的组件，以及如何利用与 MLflow 的集成，为你的 LlamaIndex 应用程序带来更好的可维护性和可观测性。

MLflow 中的 LlamaIndex 工作流

通过构建一个简单的代理工作流，开始使用 MLflow 和 LLamaIndex。了解如何记录和加载工作流进行推理，以及如何启用追踪以实现可观测性。

MLflow 中的构建索引

通过探索 VectorStoreIndex 最简单的可能配置，开始使用 MLflow 和 LlamaIndex。

概念

注意

工作流集成仅在 LlamaIndex >= 0.11.0 和 MLflow >= 2.17.0 中可用。

`Workflow` 🆕

Workflow 是 LlamaIndex 的事件驱动编排框架。它被设计为一个灵活且可解释的框架，用于构建任意 LLM 应用程序，例如代理、RAG 流程、数据提取管道等。MLflow 支持追踪、评估和追踪 Workflow 对象，这使得它们更具可观测性和可维护性。

`Index`

Index 对象是为快速信息检索而索引的文档集合，为检索增强生成 (RAG) 和代理等应用程序提供功能。Index 对象可以直接记录到 MLflow 运行中，并加载回来用作推理引擎。

`Engine`

Engine 是一个构建在 Index 对象之上的通用接口，它提供了一组 API 来与索引交互。LlamaIndex 提供了两种类型的引擎：QueryEngine 和 ChatEngine。QueryEngine 简单地接受一个查询并根据索引返回响应。ChatEngine 专为对话式代理设计，它也跟踪对话历史。

`Settings`

Settings 对象是贯穿 LlamaIndex 应用程序的全局服务上下文，它捆绑了常用资源。它包括 LLM 模型、嵌入模型、回调等设置。当记录 LlamaIndex 索引/引擎/工作流时，MLflow 会追踪 Settings 对象的状态，以便在加载模型进行推理时轻松复现相同的结果（请注意，某些对象，如 API 密钥、不可序列化对象等，不会被追踪）。

用法

在 MLflow 实验中保存和加载索引

创建索引

index 对象是 LlamaIndex 和 MLflow 集成的核心。使用 LlamaIndex，你可以从文档集合或外部向量存储中创建索引。以下代码使用 LlamaIndex 仓库中可用的 Paul Graham 论文数据创建了一个示例索引。

mkdir -p data
curl -L https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt -o ./data/paul_graham_essay.txt

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)

将索引记录到 MLflow

你可以使用 mlflow.llama_index.log_model() 函数将 index 对象记录到 MLflow 实验中。

这里的一个关键步骤是指定 engine_type 参数。引擎类型的选择不影响索引本身，但决定了当你加载索引进行推理时如何查询索引的接口。

QueryEngine (engine_type="query") 专为简单的查询-响应系统设计，它接受一个查询字符串并返回一个响应。
ChatEngine (engine_type="chat") 专为对话式代理设计，它跟踪对话历史并响应用户消息。
Retriever (engine_type="retriever") 是一个更底层的组件，它返回与查询匹配的 top-k 相关文档。

以下代码是将索引记录到 MLflow 中，引擎类型为 chat 的示例。

import mlflow

mlflow.set_experiment("llama-index-demo")

with mlflow.start_run():
    model_info = mlflow.llama_index.log_model(
        index,
        name="index",
        engine_type="chat",
        input_example="What did the author do growing up?",
    )

注意

上述代码片段将索引对象直接传递给 log_model 函数。此方法仅适用于默认的 SimpleVectorStore 向量存储，它仅将嵌入的文档保留在内存中。如果你的索引使用外部向量存储（如 QdrantVectorStore 或 DatabricksVectorSearch），则可以使用 Model-from-Code 日志记录方法。有关更多详细信息，请参阅如何使用外部向量存储记录和加载索引。

MLflow artifacts for the LlamaIndex index

提示

在底层，MLflow 调用索引对象的 as_query_engine() / as_chat_engine() / as_retriever() 方法，将其转换为相应的引擎实例。

加载索引进行推理

保存的索引可以使用 mlflow.pyfunc.load_model() 函数加载回来进行推理。此函数提供一个由 LlamaIndex 引擎支持的 MLflow Python 模型，其引擎类型在记录时已指定。

import mlflow

model = mlflow.pyfunc.load_model(model_info.model_uri)

response = model.predict("What was the first program the author wrote?")
print(response)
# >> The first program the author wrote was on the IBM 1401 ...

# The chat engine keeps track of the conversation history
response = model.predict("How did the author feel about it?")
print(response)
# >> The author felt puzzled by the first program ...

提示

要加载索引本身而不是引擎，请使用 mlflow.llama_index.load_model() 函数。

index = mlflow.llama_index.load_model("runs:/<run_id>/index")

启用追踪

你可以通过调用 mlflow.llama_index.autolog() 函数来为你的 LlamaIndex 代码启用追踪。MLflow 会自动将 LlamaIndex 执行的输入和输出记录到活跃的 MLflow 实验中，为你提供模型行为的详细视图。

import mlflow

mlflow.llama_index.autolog()

chat_engine = index.as_chat_engine()
response = chat_engine.chat("What was the first program the author wrote?")

然后你可以导航到 MLflow UI，选择实验，并打开“Traces”选项卡，以查找引擎所做预测的已记录追踪。看到聊天引擎如何协调和执行多项任务来回答你的问题，令人印象深刻！

Trace view in MLflow UI

你可以通过使用 disable 参数设置为 True 来运行相同的函数，从而禁用追踪。

mlflow.llama_index.autolog(disable=True)

注意

追踪支持异步预测和流式响应，但不支持异步和流式组合，例如 astream_chat 方法。

常见问题

如何记录和加载带有外部向量存储的索引？

如果你的索引使用默认的 SimpleVectorStore，你可以使用 mlflow.llama_index.log_model() 函数直接将索引记录到 MLflow。MLflow 会将内存中的索引数据（嵌入文档）持久化到 MLflow 工件存储中，这样就可以在不重新索引文档的情况下加载回带有相同数据的索引。

然而，当索引使用 DatabricksVectorSearch 和 QdrantVectorStore 等外部向量存储时，索引数据是远程存储的，并且它们不支持本地序列化。因此，你无法直接使用这些存储记录索引。对于这种情况，你可以使用 Model-from-Code 日志记录，它提供了对索引保存过程的更多控制，并允许你使用外部向量存储。

要使用 code-from-model 日志记录，你首先需要创建一个单独的 Python 文件来定义索引。如果你在 Jupyter notebook 上，可以使用 %%writefile 魔术命令将单元格代码保存到 Python 文件中。

# %%writefile index.py

# Create Qdrant client with your own settings.
client = qdrant_client.QdrantClient(
    host="localhost",
    port=6333,
)

# Here we simply load vector store from the existing collection to avoid
# re-indexing documents, because this Python file is executed every time
# when the model is loaded. If you don't have an existing collection, create
# a new one by following the official tutorial:
# https://docs.llamaindex.org.cn/en/stable/examples/vector_stores/QdrantIndexDemo/
vector_store = QdrantVectorStore(client=client, collection_name="my_collection")
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

# IMPORTANT: call set_model() method to tell MLflow to log this index
mlflow.models.set_model(index)

然后你可以通过将 Python 文件路径传递给 mlflow.llama_index.log_model() 函数来记录索引。全局 Settings 对象作为模型的一部分正常保存。

import mlflow

with mlflow.start_run():
    model_info = mlflow.llama_index.log_model(
        "index.py",
        name="index",
        engine_type="query",
    )

已记录的索引可以通过 mlflow.llama_index.load_model() 或 mlflow.pyfunc.load_model() 函数加载回来，与本地索引的方式相同。

index = mlflow.llama_index.load_model(model_info.model_uri)
index.as_query_engine().query("What is MLflow?")

注意

传递给 set_model() 方法的对象必须是与记录时指定的引擎类型兼容的 LlamaIndex 索引。未来版本将增加对更多对象的支持。

如何记录和加载 LlamaIndex 工作流？

MLflow 支持通过Model-from-Code功能记录和加载 LlamaIndex 工作流。有关记录和加载 LlamaIndex 工作流的详细示例，请参阅MLflow 中的 LlamaIndex 工作流笔记本。

import mlflow

with mlflow.start_run():
    model_info = mlflow.llama_index.log_model(
        "/path/to/workflow.py",
        name="model",
        input_example={"input": "What is MLflow?"},
    )

已记录的工作流可以使用 mlflow.llama_index.load_model() 或 mlflow.pyfunc.load_model() 函数加载回来。

# Use mlflow.llama_index.load_model to load the workflow object as is
workflow = mlflow.llama_index.load_model(model_info.model_uri)
await workflow.run(input="What is MLflow?")

# Use mlflow.pyfunc.load_model to load the workflow as a MLflow Pyfunc Model
# with standard inference APIs for deployment and evaluation.
pyfunc = mlflow.pyfunc.load_model(model_info.model_uri)
pyfunc.predict({"input": "What is MLflow?"})

警告

MLflow PyFunc 模型不支持异步推理。当你使用 mlflow.pyfunc.load_model() 加载工作流时，predict 方法将变为同步并会阻塞，直到工作流执行完成。这在将记录的 LlamaIndex 工作流部署到生产端点时，使用 MLflow Deployment 或 Databricks Model Serving 时也适用。

我有一个用 `query` 引擎类型记录的索引。我能否将其加载为 `chat` 引擎？

虽然无法原地更新已记录模型的引擎类型，但你始终可以加载索引并以所需的引擎类型重新记录。此过程不需要重新创建索引，因此是在不同引擎类型之间切换的有效方法。

import mlflow

# Log the index with the query engine type first
with mlflow.start_run():
    model_info = mlflow.llama_index.log_model(
        index,
        name="index-query",
        engine_type="query",
    )

# Load the index back and re-log it with the chat engine type
index = mlflow.llama_index.load_model(model_info.model_uri)
with mlflow.start_run():
    model_info = mlflow.llama_index.log_model(
        index,
        name="index-chat",
        # Specify the chat engine type this time
        engine_type="chat",
    )

或者，你可以利用已加载的 LlamaIndex 原生索引对象上的标准推理 API，具体来说：

index.as_chat_engine().chat("hi")
index.as_query_engine().query("hi")
index.as_retriever().retrieve("hi")

如何使用不同的 LLM 进行推理？

将索引保存到 MLflow 时，它会将全局 Settings 对象作为模型的一部分持久化。此对象包含引擎要使用的 LLM 和嵌入模型等设置。

import mlflow
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI("gpt-4o-mini")

# MLflow saves GPT-4o-Mini as the LLM to use for inference
with mlflow.start_run():
    model_info = mlflow.llama_index.log_model(index, name="index", engine_type="chat")

然后，当你加载索引时，持久化的设置也会全局应用。这意味着加载的引擎将使用与记录时相同的 LLM。

然而，有时你可能希望使用不同的 LLM 进行推理。在这种情况下，你可以在加载索引后直接更新全局 Settings 对象。

import mlflow

# Load the index back
loaded_index = mlflow.llama_index.load_model(model_info.model_uri)

assert Settings.llm.model == "gpt-4o-mini"


# Update the settings to use GPT-4 instead
Settings.llm = OpenAI("gpt-4")
query_engine = loaded_index.as_query_engine()
response = query_engine.query("What is the capital of France?")

简介​

为何在 MLflow 中使用 LlamaIndex？​

开始使用​

概念​

Workflow 🆕​

Index​

Engine​

Settings​

用法​

在 MLflow 实验中保存和加载索引​

创建索引​

将索引记录到 MLflow​

加载索引进行推理​

启用追踪​

常见问题​

如何记录和加载带有外部向量存储的索引？​

如何记录和加载 LlamaIndex 工作流？​

我有一个用 query 引擎类型记录的索引。我能否将其加载为 chat 引擎？​

如何使用不同的 LLM 进行推理？​

简介