MLflow Transformers Flavor 中的任务

本页概述了如何在 MLflow Transformers Flavor 中使用 task 参数来控制模型的推理接口。

概览
原生 Transformers 任务类型
OpenAI 兼容推理的高级任务
- 输入和输出格式
- 使用 llm/v1 任务的代码示例
Databricks 模型服务上的预置吞吐量
常见问题解答
- 如何覆盖 OpenAI 兼容推理的默认查询参数？

概览

在 MLflow Transformers Flavor 中，task 在决定模型的输入和输出格式方面起着至关重要的作用。task 是 Transformers 库中的一个基本概念，它描述了每个模型的 API 结构（输入和输出），并用于确定我们希望为任何给定模型显示哪个推理 API 和小部件。

MLflow 利用这个概念来确定模型的输入和输出格式，保留正确的模型签名（Model Signature），并为不同类型的模型提供一致的Pyfunc 推理 API。此外，除了原生 Transformers 任务类型外，MLflow 还定义了一些额外的任务类型，以支持更复杂的用例，例如聊天风格的应用。

原生 Transformers 任务类型

对于原生 Transformers 任务，当您使用 mlflow.transformers.log_model() 保存 pipeline 时，MLflow 将自动从 pipeline 中推断出任务类型。您也可以通过传递 task 参数来显式指定任务类型。支持的任务类型完整列表可在Transformers 文档中找到，但请注意，MLflow 不支持所有任务类型。

import mlflow
import transformers

pipeline = transformers.pipeline("text-generation", model="gpt2")

with mlflow.start_run():
    model_info = mlflow.transformers.save_model(
        transformers_model=pipeline,
        artifact_path="model",
        save_pretrained=False,
    )

print(f"Inferred task: {model_info.flavors['transformers']['task']}")
# >> Inferred task: text-generation

OpenAI 兼容推理的高级任务

除了原生 Transformers 任务类型外，MLflow 还定义了一些额外的任务类型。这些高级任务类型允许您通过 OpenAI 兼容的推理接口扩展 Transformers pipeline，以服务于特定用例的模型。除了原生 Transformers 任务类型，MLflow 还定义了几种额外的任务类型。这些高级任务类型允许您通过 OpenAI 兼容的推理接口扩展 Transformers pipeline，以服务于特定用例的模型。

例如，Transformers 的 text-generation pipeline 输入和输出单个字符串或字符串列表。然而，在模型服务中，通常需要更结构化的输入和输出格式。例如，在聊天风格的应用中，输入可能是消息列表。

为了支持这些用例，MLflow 定义了一组以 llm/v1 为前缀的高级任务类型

"llm/v1/chat" 用于聊天风格的应用
"llm/v1/completions" 用于通用补全
"llm/v1/embeddings" 用于生成文本嵌入

使用这些高级任务类型的必要步骤是在记录模型时将 task 参数指定为 llm/v1 任务即可。

import mlflow

with mlflow.start_run():
    mlflow.transformers.log_model(
        transformers_model=pipeline,
        artifact_path="model",
        task="llm/v1/chat",  # <= Specify the llm/v1 task type
        # Optional, recommended for large models to avoid creating a local copy of the model weights
        save_pretrained=False,
    )

注意

此功能仅在 MLflow 2.11.0 及更高版本中可用。此外，llm/v1/chat 任务类型仅适用于使用 transformers >= 4.34.0 保存的模型。

输入和输出格式

任务	支持的 pipeline	输入	输出
`llm/v1/chat`	`text-generation`	Chat API 规范	返回 json 格式的 Chat Completion 对象。
`llm/v1/completions`	`text-generation`	Completions API 规范	返回 json 格式的 Completion 对象。
`llm/v1/embeddings`	`feature-extraction`	Embeddings API 规范	返回 Embedding 对象列表。此外，模型还会返回 `usage` 字段，其中包含用于生成嵌入的 token 数量。

注意

Completion API 被视为旧版 API，但 MLflow 仍支持它以实现向后兼容。我们建议使用 Chat API，以与 OpenAI 及其他模型提供商的最新 API 兼容。

使用 `llm/v1` 任务的代码示例

以下代码片段演示了如何使用 llm/v1/chat 任务类型记录 Transformers pipeline，并使用模型进行聊天风格的推理。请查看Notebook 教程，了解更多实际示例！

import mlflow
import transformers

pipeline = transformers.pipeline("text-generation", "gpt2")

with mlflow.start_run():
    model_info = mlflow.transformers.log_model(
        transformers_model=pipeline,
        artifact_path="model",
        task="llm/v1/chat",
        input_example={
            "messages": [
                {"role": "system", "content": "You are a bot."},
                {"role": "user", "content": "Hello, how are you?"},
            ]
        },
        save_pretrained=False,
    )

# Model metadata logs additional field "inference_task"
print(model_info.flavors["transformers"]["inference_task"])
# >> llm/v1/chat

# The original native task type is also saved
print(model_info.flavors["transformers"]["task"])
# >> text-generation

# Model signature is set to the chat API spec
print(model_info.signature)
# >> inputs:
# >>   ['messages': Array({content: string (required), name: string (optional), role: string (required)}) (required), 'temperature': double (optional), 'max_tokens': long (optional), 'stop': Array(string) (optional), 'n': long (optional), 'stream': boolean (optional)]
# >> outputs:
# >>   ['id': string (required), 'object': string (required), 'created': long (required), 'model': string (required), 'choices': Array({finish_reason: string (required), index: long (required), message: {content: string (required), name: string (optional), role: string (required)} (required)}) (required), 'usage': {completion_tokens: long (required), prompt_tokens: long (required), total_tokens: long (required)} (required)]
# >> params:
# >>     None

# The model can be served with the OpenAI-compatible inference API
pyfunc_model = mlflow.pyfunc.load_model(model_info.model_uri)
prediction = pyfunc_model.predict(
    {
        "messages": [
            {"role": "system", "content": "You are a bot."},
            {"role": "user", "content": "Hello, how are you?"},
        ],
        "temperature": 0.5,
        "max_tokens": 200,
    }
)
print(prediction)
# >> [{'choices': [{'finish_reason': 'stop',
# >>               'index': 0,
# >>               'message': {'content': 'I'm doing well, thank you for asking.', 'role': 'assistant'}},
# >>   'created': 1719875820,
# >>   'id': '355c4e9e-040b-46b0-bf22-00e93486100c',
# >>   'model': 'gpt2',
# >>   'object': 'chat.completion',
# >>   'usage': {'completion_tokens': 7, 'prompt_tokens': 13, 'total_tokens': 20}}]

请注意，输入和输出修改仅在使用 mlflow.pyfunc.load_model() 加载模型时（例如，使用 mlflow models serve CLI 工具服务模型时）适用。如果您只想加载原始 pipeline，可以使用 mlflow.transformers.load_model()。

Databricks 模型服务上的预置吞吐量

Databricks 模型服务上的预置吞吐量是一种优化基础模型推理性能并提供性能保证的能力。要使用预置吞吐量服务 Transformers 模型，请在记录模型时指定 llm/v1/xxx 任务类型。MLflow 会记录所需的元数据，以便在 Databricks 模型服务上启用预置吞吐量。

提示

记录大型模型时，您可以使用 save_pretrained=False 来避免创建模型权重的本地副本，从而节省时间和磁盘空间。请参阅文档了解更多详细信息。

常见问题解答

如何覆盖 OpenAI 兼容推理的默认查询参数？

使用 llm/v1 任务类型保存的模型进行服务时，MLflow 对 temperature 和 stop 等参数使用与 OpenAI API 相同的默认值。您可以通过在推理时传递值或在记录模型时设置不同的默认值来覆盖它们。

在推理时：您可以在调用 predict() 方法时将参数作为输入字典的一部分传递，就像传递输入消息一样。
记录模型时：您可以通过在记录模型时保存 model_config 参数来覆盖参数的默认值。

with mlflow.start_run():
    model_info = mlflow.transformers.log_model(
        transformers_model=pipeline,
        artifact_path="model",
        task="llm/v1/chat",
        model_config={
            "temperature": 0.5,  # <= Set the default temperature
            "stop": ["foo", "bar"],  # <= Set the default stop sequence
        },
        save_pretrained=False,
    )

注意

stop 参数可用于为 llm/v1/chat 和 llm/v1/completions 任务指定停止序列。我们通过将给定停止序列的 token ID 作为 stopping_criteria 传递给 Transformers pipeline 来模拟 OpenAI API 中 stop 参数的行为。然而，此行为可能不稳定，因为分词器对于不同句子中的相同序列并不总是生成相同的 token ID，特别是对于基于 sentence-piece 的分词器。

概览​

原生 Transformers 任务类型​

OpenAI 兼容推理的高级任务​

输入和输出格式​

使用 llm/v1 任务的代码示例​

Databricks 模型服务上的预置吞吐量​

常见问题解答​

如何覆盖 OpenAI 兼容推理的默认查询参数？​

概览