将 Transformer 模型部署为 OpenAI 兼容的聊天机器人

下载此 Notebook

欢迎阅读关于使用 Transformers 和 MLflow 创建 OpenAI 兼容聊天模型的教程。在 MLflow 2.11 及更高版本中，MLflow 的 Transformers Flavor 支持特殊的任务类型 llm/v1/chat，它可以将 Hugging Face 上的数千个文本生成模型转变为可与 OpenAI 模型互操作的对话式聊天机器人。这使得您可以无缝替换聊天应用程序的后端 LLM，或者轻松评估不同的模型，而无需修改客户端代码。

如果您还没有看过，建议您在阅读本文之前先浏览我们的关于聊天和 Transformers 的入门 Notebook，因为本文档级别稍高，不会深入探讨 Transformers 或 MLflow Tracking 的内部工作原理。

注意：本页介绍如何将 Transformers 模型部署为聊天机器人。如果您使用不同的框架或自定义 Python 模型，请改用 ChatModel 来构建 OpenAI 兼容的聊天机器人。

学习目标

在本教程中，您将学习：

使用 TinyLLama-1.1B-Chat 创建一个 OpenAI 兼容的聊天模型
将模型记录到 MLflow 并加载回以进行本地推理。
使用 MLflow Model Serving 提供模型服务

%pip install mlflow>=2.11.0 -q -U
# OpenAI-compatible chat model support is available for Transformers 4.34.0 and above
%pip install transformers>=4.34.0 -q -U

# Disable tokenizers warnings when constructing pipelines
%env TOKENIZERS_PARALLELISM=false

import warnings

# Disable a few less-than-useful UserWarnings from setuptools and pydantic
warnings.filterwarnings("ignore", category=UserWarning)

env: TOKENIZERS_PARALLELISM=false

构建聊天模型

MLflow 原生的 Transformers 集成允许您在保存或记录 pipeline 时指定 task 参数。最初，此参数接受任何 Transformers pipeline 任务类型，但 mlflow.transformers Flavor 为 text-generation pipeline 类型添加了一些 MLflow 特定的键。

对于 text-generation pipeline，您不必指定 text-generation 作为任务类型，而是可以提供符合 MLflow AI Gateway 的 endpoint_type 规范的两个字符串字面量之一（使用 mlflow.sentence_transformers 保存的模型可以将“llm/v1/embeddings”指定为任务）

用于聊天风格应用的“llm/v1/chat”
用于通用补全的“llm/v1/completions”

指定其中一个键时，MLflow 将自动处理提供聊天或补全模型服务所需的一切。这包括：

在模型上设置与聊天/补全兼容的签名
执行数据预处理和后处理，以确保输入和输出符合与 OpenAI API 规范兼容的聊天/补全 API 规范。

请注意，这些修改仅在模型使用 mlflow.pyfunc.load_model() 加载时（例如，使用 mlflow models serve CLI 工具提供模型服务时）适用。如果您只想加载基础 pipeline，始终可以通过 mlflow.transformers.load_model() 来完成。

在接下来的几个单元中，我们将学习如何使用本地 Transformers pipeline 和 MLflow 提供聊天模型服务，以 TinyLlama-1.1B-Chat 为例。

首先，让我们回顾一下保存文本生成 pipeline 的原始流程

from transformers import pipeline

import mlflow

generator = pipeline(
  "text-generation",
  model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
)

# save the model using the vanilla `text-generation` task type
mlflow.transformers.save_model(
  path="tinyllama-text-generation", transformers_model=generator, task="text-generation"
)

/var/folders/qd/9rwd0_gd0qs65g4sdqlm51hr0000gp/T/ipykernel_55429/4268198845.py:11: FutureWarning: The 'transformers' MLflow Models integration is known to be compatible with the following package version ranges: ``4.25.1`` -  ``4.37.1``. MLflow Models integrations with transformers may not succeed when used with package versions outside of this range.
mlflow.transformers.save_model(

现在，让我们加载模型并用于推理。我们加载的模型是一个 text-generation pipeline，让我们看看它的签名，了解其预期的输入和输出。

# load the model for inference
model = mlflow.pyfunc.load_model("tinyllama-text-generation")

model.metadata.signature

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

2024/02/26 21:06:51 WARNING mlflow.transformers: Could not specify device parameter for this pipeline type

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

inputs: 
[string (required)]
outputs: 
[string (required)]
params: 
None

遗憾的是，它只接受 string 作为输入，这与聊天界面不直接兼容。例如，在与 OpenAI 的 API 交互时，我们希望能够直接输入消息列表。为了使用我们当前的模型实现这一点，我们将不得不编写一些额外的模板代码

# first, apply the tokenizer's chat template, since the
# model is tuned to accept prompts in a chat format. this
# also converts the list of messages to a string.
messages = [{"role": "user", "content": "Write me a hello world program in python"}]
prompt = generator.tokenizer.apply_chat_template(
  messages, tokenize=False, add_generation_prompt=True
)

model.predict(prompt)

['<|user|>
Write me a hello world program in python</s>
<|assistant|>
Here's a simple hello world program in Python:

```python
print("Hello, world!")
```

This program prints the string "Hello, world!" to the console. You can run this program by typing it into the Python interpreter or by running the command `python hello_world.py` in your terminal.']

现在我们有了一些进展，但在推理之前格式化消息很麻烦。

此外，输出格式也不与 OpenAI API 规范兼容——它只是一个字符串列表。如果我们想为聊天应用程序评估不同的模型后端，我们将不得不重写一些客户端代码，以便同时格式化输入和解析这个新的响应。

为了简化这一切，我们只需在保存模型时将 "llm/v1/chat" 作为 task 参数传入。

# save the model using the `"llm/v1/chat"`
# task type instead of `text-generation`
mlflow.transformers.save_model(
  path="tinyllama-chat", transformers_model=generator, task="llm/v1/chat"
)

/var/folders/qd/9rwd0_gd0qs65g4sdqlm51hr0000gp/T/ipykernel_55429/609241782.py:3: FutureWarning: The 'transformers' MLflow Models integration is known to be compatible with the following package version ranges: ``4.25.1`` -  ``4.37.1``. MLflow Models integrations with transformers may not succeed when used with package versions outside of this range.
mlflow.transformers.save_model(

再次加载模型并检查签名

model = mlflow.pyfunc.load_model("tinyllama-chat")

model.metadata.signature

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

2024/02/26 21:10:04 WARNING mlflow.transformers: Could not specify device parameter for this pipeline type

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

inputs: 
['messages': Array({content: string (required), name: string (optional), role: string (required)}) (required), 'temperature': double (optional), 'max_tokens': long (optional), 'stop': Array(string) (optional), 'n': long (optional), 'stream': boolean (optional)]
outputs: 
['id': string (required), 'object': string (required), 'created': long (required), 'model': string (required), 'choices': Array({finish_reason: string (required), index: long (required), message: {content: string (required), name: string (optional), role: string (required)} (required)}) (required), 'usage': {completion_tokens: long (required), prompt_tokens: long (required), total_tokens: long (required)} (required)]
params: 
None

现在进行推理时，我们可以像与 OpenAI API 交互时那样，将消息作为字典传入。此外，我们从模型接收到的响应也符合规范。

messages = [{"role": "user", "content": "Write me a hello world program in python"}]

model.predict({"messages": messages})

[{'id': '8435a57d-9895-485e-98d3-95b1cbe007c0',
'object': 'chat.completion',
'created': 1708949437,
'model': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0',
'usage': {'prompt_tokens': 24, 'completion_tokens': 71, 'total_tokens': 95},
'choices': [{'index': 0,
  'finish_reason': 'stop',
  'message': {'role': 'assistant',
   'content': 'Here's a simple hello world program in Python:

```python
print("Hello, world!")
```

This program prints the string "Hello, world!" to the console. You can run this program by typing it into the Python interpreter or by running the command `python hello_world.py` in your terminal.'}}]}]

提供聊天模型服务

为了进一步演示这个例子，让我们使用 MLflow 来提供聊天模型服务，这样我们就可以像与 Web API 一样与它交互。为此，我们可以使用 mlflow models serve CLI 工具。

在终端中运行

$ mlflow models serve -m tinyllama-chat

服务器初始化完成后，您应该可以通过 HTTP 请求与模型交互。输入格式与MLflow Deployments Server 文档中描述的格式几乎相同，唯一的例外是 temperature 默认值为 1.0，而不是 0.0。

这是一个快速示例

%%sh
curl http://127.0.0.1:5000/invocations   -H 'Content-Type: application/json'   -d '{ "messages": [{"role": "user", "content": "Write me a hello world program in python"}] }'   | jq

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                               Dload  Upload   Total   Spent    Left  Speed
100   706  100   617  100    89     25      3  0:00:29  0:00:23  0:00:06   160

[
{
  "id": "fc3d08c3-d37d-420d-a754-50f77eb32a92",
  "object": "chat.completion",
  "created": 1708949465,
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 71,
    "total_tokens": 95
  },
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "Here's a simple hello world program in Python:

```python
print("Hello, world!")
```

This program prints the string "Hello, world!" to the console. You can run this program by typing it into the Python interpreter or by running the command `python hello_world.py` in your terminal."
      }
    }
  ]
}
]

就是这么简单！

您还可以使用一些可选的推理参数调用 API，以调整模型的响应。这些参数映射到 Transformers pipeline 参数，并在推理时直接传入。

max_tokens（映射到 max_new_tokens）：模型应生成的最大新 token 数。
temperature（映射到 temperature）：控制模型响应的创造性。请注意，并非所有模型都保证支持此参数，并且要使此参数生效，必须在创建 pipeline 时设置 do_sample=True。
stop（映射到 stopping_criteria）：停止生成的 token 列表。

注意：n 在 Transformers pipeline 中没有等效参数，也不支持在查询中使用。但是，您可以使用 Custom Pyfunc 实现一个消费 n 参数的模型（详情如下）。

结论

在本教程中，您学习了如何在保存 Transformers pipeline 时通过指定“llm/v1/chat”作为任务来创建一个 OpenAI 兼容的聊天模型。

下一步是什么？

了解自定义 ChatModel。如果您正在寻找进一步的定制或 Transformers 之外的模型，链接的页面提供了关于如何使用 MLflow 的 ChatModel 类构建聊天机器人的实践指导。
更多关于 MLflow AI Gateway 的信息。在本教程中，我们看到了如何使用本地服务器部署模型，但 MLflow 提供了许多其他将模型部署到生产环境的方法。请查看此页面以了解更多不同选项。
更多关于 MLflow Transformers 集成的信息。此页面提供了 MLflow Transformers 集成的全面概述，以及许多实践指南和 Notebook。了解如何微调模型、使用提示模板等！
其他 LLM 集成。除了 Transformers，MLflow 还集成了许多其他流行的 LLM 库，例如 Langchain 和 OpenAI。

学习目标​

构建聊天模型​

提供聊天模型服务​

结论​

下一步是什么？​

学习目标

构建聊天模型

提供聊天模型服务

结论

下一步是什么？