使用 `mlflow.evaluate()` 评估 Hugging Face LLM

本指南将展示如何加载一个预训练的 Hugging Face 流水线，将其记录到 MLflow，并使用 mlflow.evaluate() 来评估模型的内置指标以及自定义的 LLM 评判指标。

有关详细信息，请阅读关于使用 MLflow evaluate 的文档。

启动 MLflow 服务器

您可以选择

通过在您的笔记本所在的同一目录中运行 mlflow ui 来启动一个本地跟踪服务器。
使用跟踪服务器，如本概述中所述。

安装必要的依赖项

%pip install -q mlflow transformers torch torchvision evaluate datasets openai tiktoken fastapi rouge_score textstat

# Necessary imports
import warnings

import pandas as pd
from datasets import load_dataset
from transformers import pipeline

import mlflow
from mlflow.metrics.genai import EvaluationExample, answer_correctness, make_genai_metric

# Disable FutureWarnings
warnings.filterwarnings("ignore", category=FutureWarning)

加载预训练的 Hugging Face 流水线

这里我们加载一个文本生成流水线，但您也可以使用文本摘要或问答流水线。

mpt_pipeline = pipeline("text-generation", model="mosaicml/mpt-7b-chat")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

使用 MLflow 记录模型

我们将流水线记录为 MLflow 模型，该模型遵循一种标准格式，允许您以不同的“风格”保存模型，以便不同的下游工具能够理解。在这种情况下，该模型属于 transformers“风格”。

mlflow.set_experiment("Evaluate Hugging Face Text Pipeline")

# Define the signature
signature = mlflow.models.infer_signature(
  model_input="What are the three primary colors?",
  model_output="The three primary colors are red, yellow, and blue.",
)

# Log the model using mlflow
with mlflow.start_run():
  model_info = mlflow.transformers.log_model(
      transformers_model=mpt_pipeline,
      name="mpt-7b",
      signature=signature,
      registered_model_name="mpt-7b-chat",
  )

Successfully registered model 'mpt-7b-chat'.
Created version '1' of model 'mpt-7b-chat'.

加载评估数据

从 Hugging Face Hub 加载一个数据集用于评估。

以下数据集中的数据字段代表

instruction：描述模型应执行的任务。数据集中的每一行都是一个要执行的独特指令（任务）。
input：与 instruction 字段中定义的任务相关的可选上下文信息。例如，对于指令“找出格格不入的一项”，input 上下文指导信息是给定的项目列表，“Twitter, Instagram, Telegram”。
output：由原始评估模型（来自 OpenAI 的 text-davinci-003）生成的对指令（带有可选的 input 上下文）的答案。
text：将 instruction、input 和 output 应用于所使用的提示模板后，最终得到的总文本，该文本将发送给模型用于微调。

dataset = load_dataset("tatsu-lab/alpaca")
eval_df = pd.DataFrame(dataset["train"])
eval_df.head(10)

	instruction	input	output	text
0	给出保持健康的三个建议。		1. 饮食均衡，并确保包含...	下面是一个描述任务的指令...
1	三原色是什么？		三原色是红、蓝、黄...	下面是一个描述任务的指令...
2	描述一下原子的结构。		原子由原子核构成，其中包含...	下面是一个描述任务的指令...
3	我们如何减少空气污染？		有很多方法可以减少空气污染...	下面是一个描述任务的指令...
4	描述一个你必须做出艰难决定的时刻...		当我...时，我必须做出一个艰难的决定。	下面是一个描述任务的指令...
5	找出格格不入的一项。	Twitter, Instagram, Telegram	Telegram	下面是一个描述任务的指令...
6	解释为什么以下分数是等价的...	4/16	分数 4/16 等价于 1/4 是因为...	下面是一个描述任务的指令...
7	用第三人称叙述写一个短篇故事...		约翰正处于人生的十字路口。他刚刚...	下面是一个描述任务的指令...
8	渲染一个房子的 3D 模型		<nooutput> 这种类型的指令无法...	下面是一个描述任务的指令...
9	评估这个句子的拼写和语法...	他吃完了饭，离开了餐厅	他吃完了饭，离开了餐厅。	下面是一个描述任务的指令...

定义指标

由于我们正在评估我们的模型为给定指令提供答案的能力，我们可能希望在 mlflow.evaluate() 提供的任何内置指标之外，选择一些指标来帮助衡量这一点。

让我们衡量我们的模型在以下两个指标上的表现：

答案是否正确？ 让我们在这里使用预定义的指标 answer_correctness。
答案是否流畅、清晰、简洁？ 我们将定义一个自定义指标 answer_quality 来衡量这一点。

我们需要将这两个指标传递给 mlflow.evaluate() 的 extra_metrics 参数，以评估我们模型的质量。

什么是评估指标？

评估指标封装了您想为模型计算的任何定量或定性度量。对于每种模型类型，mlflow.evaluate() 将自动计算一组内置指标。请参考这里了解每种模型类型将计算哪些内置指标。您还可以将任何其他想要计算的指标作为额外指标传入。MLflow 提供了一组预定义的指标，您可以在这里找到，或者您也可以定义自己的自定义指标。在本例中，我们将结合使用预定义指标 mlflow.metrics.genai.answer_correctness 和一个自定义指标进行质量评估。

让我们加载我们的预定义指标——在本例中，我们使用 GPT-4 的 answer_correctness。

answer_correctness_metric = answer_correctness(model="openai:/gpt-4")

现在我们想使用 make_genai_metric() 创建一个名为 answer_quality 的自定义 LLM 评判指标。我们需要定义一个指标定义和评分标准，以及一些供 LLM 评判模型使用的示例。

# The definition explains what "answer quality" entails
answer_quality_definition = """Please evaluate answer quality for the provided output on the following criteria:
fluency, clarity, and conciseness. Each of the criteria is defined as follows:
- Fluency measures how naturally and smooth the output reads.
- Clarity measures how understandable the output is.
- Conciseness measures the brevity and efficiency of the output without compromising meaning.
The more fluent, clear, and concise a text, the higher the score it deserves.
"""

# The grading prompt explains what each possible score means
answer_quality_grading_prompt = """Answer quality: Below are the details for different scores:
- Score 1: The output is entirely incomprehensible and cannot be read.
- Score 2: The output conveys some meaning, but needs lots of improvement in to improve fluency, clarity, and conciseness.
- Score 3: The output is understandable but still needs improvement.
- Score 4: The output performs well on two of fluency, clarity, and conciseness, but could be improved on one of these criteria.
- Score 5: The output reads smoothly, is easy to understand, and clear. There is no clear way to improve the output on these criteria.
"""

# We provide an example of a "bad" output
example1 = EvaluationExample(
  input="What is MLflow?",
  output="MLflow is an open-source platform. For managing machine learning workflows, it "
  "including experiment tracking model packaging versioning and deployment as well as a platform "
  "simplifying for on the ML lifecycle.",
  score=2,
  justification="The output is difficult to understand and demonstrates extremely low clarity. "
  "However, it still conveys some meaning so this output deserves a score of 2.",
)

# We also provide an example of a "good" output
example2 = EvaluationExample(
  input="What is MLflow?",
  output="MLflow is an open-source platform for managing machine learning workflows, including "
  "experiment tracking, model packaging, versioning, and deployment.",
  score=5,
  justification="The output is easily understandable, clear, and concise. It deserves a score of 5.",
)

answer_quality_metric = make_genai_metric(
  name="answer_quality",
  definition=answer_quality_definition,
  grading_prompt=answer_quality_grading_prompt,
  version="v1",
  examples=[example1, example2],
  model="openai:/gpt-4",
  greater_is_better=True,
)

评估

我们需要设置我们的 OpenAI API 密钥，因为我们使用 GPT-4 作为我们的 LLM 评判指标。

为了安全地设置您的私钥，请务必通过命令行终端为当前实例导出您的密钥，或者，为了永久添加到所有基于用户的会话中，请配置您偏好的环境管理配置文件（例如 .bashrc, .zshrc），添加以下条目：

OPENAI_API_KEY=<你的 openai API 密钥>

现在，我们可以调用 mlflow.evaluate()。为了测试一下，让我们使用数据的前 10 行。使用 "text" 模型类型，毒性和可读性指标将作为内置指标进行计算。我们还将上面定义的两个指标传递到 extra_metrics 参数中进行评估。

with mlflow.start_run():
  results = mlflow.evaluate(
      model_info.model_uri,
      eval_df.head(10),
      evaluators="default",
      model_type="text",
      targets="output",
      extra_metrics=[answer_correctness_metric, answer_quality_metric],
      evaluator_config={"col_mapping": {"inputs": "instruction"}},
  )

Downloading artifacts:   0%|          | 0/79 [00:00<?, ?it/s]

2023/12/28 11:57:30 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false

Loading checkpoint shards:   0%|          | 0/66 [00:00<?, ?it/s]

2023/12/28 12:00:25 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2023/12/28 12:00:25 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2023/12/28 12:02:23 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
Using default facebook/roberta-hate-speech-dynabench-r4-target checkpoint

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

2023/12/28 12:02:43 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count
2023/12/28 12:02:43 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity
2023/12/28 12:02:44 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level
2023/12/28 12:02:44 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level
2023/12/28 12:02:44 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: answer_correctness

  0%|          | 0/10 [00:00<?, ?it/s]

2023/12/28 12:02:53 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: answer_quality

  0%|          | 0/10 [00:00<?, ?it/s]

查看结果

results.metrics 是一个字典，其中包含所有计算出的指标的聚合值。有关每种模型类型的内置指标的详细信息，请参阅此处。

results.metrics

{'toxicity/v1/mean': 0.00809656630299287,
'toxicity/v1/variance': 0.0004603014839856817,
'toxicity/v1/p90': 0.010559113975614286,
'toxicity/v1/ratio': 0.0,
'flesch_kincaid_grade_level/v1/mean': 4.9,
'flesch_kincaid_grade_level/v1/variance': 6.3500000000000005,
'flesch_kincaid_grade_level/v1/p90': 6.829999999999998,
'ari_grade_level/v1/mean': 4.1899999999999995,
'ari_grade_level/v1/variance': 16.6329,
'ari_grade_level/v1/p90': 7.949999999999998,
'answer_correctness/v1/mean': 1.5,
'answer_correctness/v1/variance': 1.45,
'answer_correctness/v1/p90': 2.299999999999999,
'answer_quality/v1/mean': 2.4,
'answer_quality/v1/variance': 1.44,
'answer_quality/v1/p90': 4.1}

我们还可以查看 eval_results_table，它向我们展示了每一行数据的指标。

results.tables["eval_results_table"]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

	instruction	input	text	output	输出	token_count	toxicity/v1/score	flesch_kincaid_grade_level/v1/score	ari_grade_level/v1/score	answer_correctness/v1/score	answer_correctness/v1/justification	answer_quality/v1/score	answer_quality/v1/justification
0	给出保持健康的三个建议。		下面是一个描述任务的指令...	1. 饮食均衡，并确保包含...	给出保持健康的三个建议。1. 饮食...	19	0.000446	4.1	4.0	2	模型提供的输出只包含...	3	输出可以理解且流畅，但它...
1	三原色是什么？		下面是一个描述任务的指令...	三原色是红、蓝、黄...	三原色是什么？三...	19	0.000217	5.0	4.9	5	模型提供的输出完全...	5	模型的输出流畅、清晰、简洁...
2	描述一下原子的结构。		下面是一个描述任务的指令...	原子由原子核构成，其中包含...	描述原子的结构。原子是...	18	0.000139	3.1	2.2	1	模型提供的输出不完整...	2	输出不完整且缺乏清晰度，使...
3	我们如何减少空气污染？		下面是一个描述任务的指令...	有很多方法可以减少空气污染...	我们如何减少空气污染？有很多...	18	0.000140	5.0	5.5	1	模型提供的输出完全...	1	输出完全无法理解，并且不能...
4	描述一个你必须做出艰难决定的时刻...		下面是一个描述任务的指令...	当我...时，我必须做出一个艰难的决定。	描述一个你必须做出艰难决定的时刻...	18	0.000159	5.2	2.9	1	模型提供的输出完全...	2	输出不完整且缺乏清晰度，使...
5	找出格格不入的一项。	Twitter, Instagram, Telegram	下面是一个描述任务的指令...	Telegram	找出格格不入的一项。1. 汽车 2. 火车...	18	0.072345	0.1	-5.4	1	模型提供的输出完全...	2	输出不清晰且缺乏流畅性。...
6	解释为什么以下分数是等价的...	4/16	下面是一个描述任务的指令...	分数 4/16 等价于 1/4 是因为...	解释为什么以下分数是等价的...	23	0.000320	6.4	7.6	1	模型提供的输出完全...	2	输出不清晰，没有回答...
7	用第三人称叙述写一个短篇故事...		下面是一个描述任务的指令...	约翰正处于人生的十字路口。他刚刚...	用第三人称叙述写一个短篇故事...	20	0.000247	10.7	11.1	1	模型提供的输出完全...	1	输出与输入完全相同，...
8	渲染一个房子的 3D 模型		下面是一个描述任务的指令...	<nooutput> 这种类型的指令无法...	在 Blender 中渲染一个房子的 3D 模型 - Blen...	19	0.003694	5.2	2.7	1	模型提供的输出完全...	2	输出部分可以理解但缺乏...
9	评估这个句子的拼写和语法...	他吃完了饭，离开了餐厅	下面是一个描述任务的指令...	他吃完了饭，离开了餐厅。	评估这个句子的拼写和语法...	18	0.003260	4.2	6.4	1	模型提供的输出完全...	4	输出流畅清晰，但不是...

在 UI 中查看结果

最后，我们可以在 MLflow UI 中查看我们的评估结果。我们可以在左侧边栏选择我们的实验，这将带我们到以下页面。我们可以看到一次运行记录了我们的模型 "mpt-7b-chat"，另一次运行则包含了我们评估的数据集。

Evaluation Main

我们点击评估（Evaluation）选项卡并隐藏任何不相关的运行。

Evaluation Filtering

现在我们可以选择我们想要分组的列，以及我们想要比较的列。在下面的例子中，我们正在查看每个输入-输出对的答案正确性得分，但我们也可以选择任何其他指标进行比较。

Evaluation Selection

最后，我们进入以下视图，在这里我们可以看到每一行答案正确性的理由和得分。

Evaluation Comparison

启动 MLflow 服务器​

安装必要的依赖项​

加载预训练的 Hugging Face 流水线​

使用 MLflow 记录模型​

加载评估数据​

定义指标​

什么是评估指标？​

评估​

查看结果​

在 UI 中查看结果​