使用 MLflow 评估 LLM 的示例笔记本

下载本笔记本

在本笔记本中，我们将演示如何使用 MLflow 评估各种 LLM 和 RAG 系统，利用简单的指标如毒性（toxicity），以及由 LLM 判定的指标如相关性（relevance），甚至自定义的由 LLM 判定的指标如专业性（professionalism）

我们需要设置我们的 OpenAI API 密钥，因为我们将使用 GPT-4 进行由 LLM 判定的指标评估。

为了安全地设置您的私钥，请务必通过当前实例的命令行终端导出您的密钥，或者，为了永久添加到所有基于用户的会话中，配置您偏好的环境管理配置文件（例如 .bashrc, .zshrc），使其包含以下条目

OPENAI_API_KEY=<您的 openai API 密钥>

import openai
import pandas as pd

import mlflow

基本问答评估

创建一个测试用例，包含将传递给模型的 inputs 和用于与模型生成的输出进行比较的 ground_truth。

eval_df = pd.DataFrame(
  {
      "inputs": [
          "How does useEffect() work?",
          "What does the static keyword in a function mean?",
          "What does the 'finally' block in Python do?",
          "What is the difference between multiprocessing and multithreading?",
      ],
      "ground_truth": [
          "The useEffect() hook tells React that your component needs to do something after render. React will remember the function you passed (we’ll refer to it as our “effect”), and call it later after performing the DOM updates.",
          "Static members belongs to the class, rather than a specific instance. This means that only one instance of a static member exists, even if you create multiple objects of the class, or if you don't create any. It will be shared by all objects.",
          "'Finally' defines a block of code to run when the try... except...else block is final. The finally block will be executed no matter if the try block raises an error or not.",
          "Multithreading refers to the ability of a processor to execute multiple threads concurrently, where each thread runs a process. Whereas multiprocessing refers to the ability of a system to run multiple processors in parallel, where each processor can run one or more threads.",
      ],
  }
)

创建一个简单的 OpenAI 模型，要求 gpt-4o 在两句话内回答问题。使用模型和评估数据框调用 mlflow.evaluate()。

with mlflow.start_run() as run:
  system_prompt = "Answer the following question in two sentences"
  basic_qa_model = mlflow.openai.log_model(
      model="gpt-4o-mini",
      task=openai.chat.completions,
      artifact_path="model",
      messages=[
          {"role": "system", "content": system_prompt},
          {"role": "user", "content": "{question}"},
      ],
  )
  results = mlflow.evaluate(
      basic_qa_model.model_uri,
      eval_df,
      targets="ground_truth",  # specify which column corresponds to the expected output
      model_type="question-answering",  # model type indicates which metrics are relevant for this task
      evaluators="default",
  )
results.metrics

2023/10/27 00:56:56 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2023/10/27 00:56:56 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
Using default facebook/roberta-hate-speech-dynabench-r4-target checkpoint
2023/10/27 00:57:06 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count
2023/10/27 00:57:06 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity
2023/10/27 00:57:06 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level
2023/10/27 00:57:06 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level
2023/10/27 00:57:06 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match

{'toxicity/v1/mean': 0.00020573455913108774,
'toxicity/v1/variance': 3.4433758978645428e-09,
'toxicity/v1/p90': 0.00027067282790085303,
'toxicity/v1/ratio': 0.0,
'flesch_kincaid_grade_level/v1/mean': 15.149999999999999,
'flesch_kincaid_grade_level/v1/variance': 26.502499999999998,
'flesch_kincaid_grade_level/v1/p90': 20.85,
'ari_grade_level/v1/mean': 17.375,
'ari_grade_level/v1/variance': 42.92187499999999,
'ari_grade_level/v1/p90': 24.48,
'exact_match/v1': 0.0}

检查评估结果表作为数据框，查看逐行指标以进一步评估模型性能

results.tables["eval_results_table"]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

	inputs	ground_truth	outputs	token_count	toxicity/v1/score	flesch_kincaid_grade_level/v1/score	ari_grade_level/v1/score
0	useEffect() 是如何工作的？	useEffect() hook 告诉 React 你的组件...	useEffect() 是一个 React hook，它允许你...	64	0.000243	14.2	15.8
1	函数中的 static 关键字是什么意思？	静态成员属于类，而不是...	函数中的 static 关键字意味着...	32	0.000150	12.6	14.9
2	Python 中的 'finally' 块有什么作用？	'Finally' 定义了一个在以下情况下运行的代码块...	Python 中的 'finally' 块用于指定...	46	0.000283	10.1	10.6
3	multiprocessing 和 ... 有什么区别？	多线程（Multithreading）指的是一个进程能够...	多进程（multiprocessing）和多线程的主要区别在于...	34	0.000148	23.7	28.2

使用 OpenAI GPT-4 进行由 LLM 判定的正确性评估

使用 answer_similarity() 指标工厂函数构建一个答案相似度指标。

from mlflow.metrics.genai import EvaluationExample, answer_similarity

# Create an example to describe what answer_similarity means like for this problem.
example = EvaluationExample(
  input="What is MLflow?",
  output="MLflow is an open-source platform for managing machine "
  "learning workflows, including experiment tracking, model packaging, "
  "versioning, and deployment, simplifying the ML lifecycle.",
  score=4,
  justification="The definition effectively explains what MLflow is "
  "its purpose, and its developer. It could be more concise for a 5-score.",
  grading_context={
      "targets": "MLflow is an open-source platform for managing "
      "the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, "
      "a company that specializes in big data and machine learning solutions. MLflow is "
      "designed to address the challenges that data scientists and machine learning "
      "engineers face when developing, training, and deploying machine learning models."
  },
)

# Construct the metric using OpenAI GPT-4 as the judge
answer_similarity_metric = answer_similarity(model="openai:/gpt-4", examples=[example])

print(answer_similarity_metric)

EvaluationMetric(name=answer_similarity, greater_is_better=True, long_name=answer_similarity, version=v1, metric_details=
Task:
You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called answer_similarity based on the input and output.
A definition of answer_similarity and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.

Input:
{input}

Output:
{output}

{grading_context_columns}

Metric definition:
Answer similarity is evaluated on the degree of semantic similarity of the provided output to the provided targets, which is the ground truth. Scores can be assigned based on the gradual similarity in meaning and description to the provided targets, where a higher score indicates greater alignment between the provided output and provided targets.

Grading rubric:
Answer similarity: Below are the details for different scores:
- Score 1: the output has little to no semantic similarity to the provided targets.
- Score 2: the output displays partial semantic similarity to the provided targets on some aspects.
- Score 3: the output has moderate semantic similarity to the provided targets.
- Score 4: the output aligns with the provided targets in most aspects and has substantial semantic similarity.
- Score 5: the output closely aligns with the provided targets in all significant aspects.

Examples:

Input:
What is MLflow?

Output:
MLflow is an open-source platform for managing machine learning workflows, including experiment tracking, model packaging, versioning, and deployment, simplifying the ML lifecycle.

Additional information used by the model:
key: ground_truth
value:
MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.

score: 4
justification: The definition effectively explains what MLflow is its purpose, and its developer. It could be more concise for a 5-score.
      

You must return the following fields in your response one below the other:
score: Your numerical score for the model's answer_similarity based on the rubric
justification: Your step-by-step reasoning about the model's answer_similarity score
  )

再次调用 mlflow.evaluate()，但使用您新的 answer_similarity_metric

with mlflow.start_run() as run:
  results = mlflow.evaluate(
      basic_qa_model.model_uri,
      eval_df,
      targets="ground_truth",
      model_type="question-answering",
      evaluators="default",
      extra_metrics=[answer_similarity_metric],  # use the answer similarity metric created above
  )
results.metrics

2023/10/27 00:57:07 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2023/10/27 00:57:07 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2023/10/27 00:57:13 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count
2023/10/27 00:57:13 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity
2023/10/27 00:57:13 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level
2023/10/27 00:57:13 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level
2023/10/27 00:57:13 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match
2023/10/27 00:57:13 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: answer_similarity

{'toxicity/v1/mean': 0.00023413174494635314,
'toxicity/v1/variance': 4.211776498455113e-09,
'toxicity/v1/p90': 0.00029628578631673007,
'toxicity/v1/ratio': 0.0,
'flesch_kincaid_grade_level/v1/mean': 14.774999999999999,
'flesch_kincaid_grade_level/v1/variance': 21.546875000000004,
'flesch_kincaid_grade_level/v1/p90': 19.71,
'ari_grade_level/v1/mean': 17.0,
'ari_grade_level/v1/variance': 41.005,
'ari_grade_level/v1/p90': 23.92,
'exact_match/v1': 0.0,
'answer_similarity/v1/mean': 3.75,
'answer_similarity/v1/variance': 1.1875,
'answer_similarity/v1/p90': 4.7}

查看逐行的由 LLM 判定的答案相似度得分和理由

results.tables["eval_results_table"]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

	inputs	ground_truth	outputs	token_count	toxicity/v1/score	flesch_kincaid_grade_level/v1/score	ari_grade_level/v1/score	answer_similarity/v1/score	answer_similarity/v1/justification
0	useEffect() 是如何工作的？	useEffect() hook 告诉 React 你的组件...	useEffect() 是一个 React hook，它允许你...	53	0.000299	12.1	12.1	4	模型提供的输出与...很好地对齐
1	函数中的 static 关键字是什么意思？	静态成员属于类，而不是...	在 C/C++ 中，函数中的 static 关键字意味着...	55	0.000141	12.5	14.4	2	模型提供的输出确实正确地...
2	Python 中的 'finally' 块有什么作用？	'Finally' 定义了一个在以下情况下运行的代码块...	Python 中的 'finally' 块用于定义...	64	0.000290	11.7	13.5	5	模型提供的输出非常准确地对齐...
3	multiprocessing 和 ... 有什么区别？	多线程（Multithreading）指的是一个进程能够...	多进程涉及执行多个进程...	49	0.000207	22.8	28.0	4	模型提供的输出与...很好地对齐

自定义由 LLM 判定的专业性指标

创建一个自定义指标，用于确定模型输出的专业性。使用 make_genai_metric 并提供指标定义、评分提示、评分示例和判定模型配置

from mlflow.metrics.genai import EvaluationExample, make_genai_metric

professionalism_metric = make_genai_metric(
  name="professionalism",
  definition=(
      "Professionalism refers to the use of a formal, respectful, and appropriate style of communication that is tailored to the context and audience. It often involves avoiding overly casual language, slang, or colloquialisms, and instead using clear, concise, and respectful language"
  ),
  grading_prompt=(
      "Professionalism: If the answer is written using a professional tone, below "
      "are the details for different scores: "
      "- Score 1: Language is extremely casual, informal, and may include slang or colloquialisms. Not suitable for professional contexts."
      "- Score 2: Language is casual but generally respectful and avoids strong informality or slang. Acceptable in some informal professional settings."
      "- Score 3: Language is balanced and avoids extreme informality or formality. Suitable for most professional contexts. "
      "- Score 4: Language is noticeably formal, respectful, and avoids casual elements. Appropriate for business or academic settings. "
      "- Score 5: Language is excessively formal, respectful, and avoids casual elements. Appropriate for the most formal settings such as textbooks. "
  ),
  examples=[
      EvaluationExample(
          input="What is MLflow?",
          output=(
              "MLflow is like your friendly neighborhood toolkit for managing your machine learning projects. It helps you track experiments, package your code and models, and collaborate with your team, making the whole ML workflow smoother. It's like your Swiss Army knife for machine learning!"
          ),
          score=2,
          justification=(
              "The response is written in a casual tone. It uses contractions, filler words such as 'like', and exclamation points, which make it sound less professional. "
          ),
      )
  ],
  version="v1",
  model="openai:/gpt-4",
  parameters={"temperature": 0.0},
  grading_context_columns=[],
  aggregations=["mean", "variance", "p90"],
  greater_is_better=True,
)

print(professionalism_metric)

EvaluationMetric(name=professionalism, greater_is_better=True, long_name=professionalism, version=v1, metric_details=
Task:
You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called professionalism based on the input and output.
A definition of professionalism and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.

Input:
{input}

Output:
{output}

{grading_context_columns}

Metric definition:
Professionalism refers to the use of a formal, respectful, and appropriate style of communication that is tailored to the context and audience. It often involves avoiding overly casual language, slang, or colloquialisms, and instead using clear, concise, and respectful language

Grading rubric:
Professionalism: If the answer is written using a professional tone, below are the details for different scores: - Score 1: Language is extremely casual, informal, and may include slang or colloquialisms. Not suitable for professional contexts.- Score 2: Language is casual but generally respectful and avoids strong informality or slang. Acceptable in some informal professional settings.- Score 3: Language is balanced and avoids extreme informality or formality. Suitable for most professional contexts. - Score 4: Language is noticeably formal, respectful, and avoids casual elements. Appropriate for business or academic settings. - Score 5: Language is excessively formal, respectful, and avoids casual elements. Appropriate for the most formal settings such as textbooks.

Examples:

Input:
What is MLflow?

Output:
MLflow is like your friendly neighborhood toolkit for managing your machine learning projects. It helps you track experiments, package your code and models, and collaborate with your team, making the whole ML workflow smoother. It's like your Swiss Army knife for machine learning!

score: 2
justification: The response is written in a casual tone. It uses contractions, filler words such as 'like', and exclamation points, which make it sound less professional.

You must return the following fields in your response one below the other:
score: Your numerical score for the model's professionalism based on the rubric
justification: Your step-by-step reasoning about the model's professionalism score
)

使用您新的专业性指标调用 mlflow.evaluate。

with mlflow.start_run() as run:
  results = mlflow.evaluate(
      basic_qa_model.model_uri,
      eval_df,
      model_type="question-answering",
      evaluators="default",
      extra_metrics=[professionalism_metric],  # use the professionalism metric we created above
  )
print(results.metrics)

2023/10/27 00:57:20 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2023/10/27 00:57:20 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2023/10/27 00:57:24 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count
2023/10/27 00:57:24 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity
2023/10/27 00:57:25 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level
2023/10/27 00:57:25 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level
2023/10/27 00:57:25 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match
2023/10/27 00:57:25 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: professionalism

{'toxicity/v1/mean': 0.0002044261127593927, 'toxicity/v1/variance': 1.8580601275034412e-09, 'toxicity/v1/p90': 0.00025343164161313326, 'toxicity/v1/ratio': 0.0, 'flesch_kincaid_grade_level/v1/mean': 13.649999999999999, 'flesch_kincaid_grade_level/v1/variance': 33.927499999999995, 'flesch_kincaid_grade_level/v1/p90': 19.92, 'ari_grade_level/v1/mean': 16.25, 'ari_grade_level/v1/variance': 51.927499999999995, 'ari_grade_level/v1/p90': 23.900000000000002, 'professionalism/v1/mean': 4.0, 'professionalism/v1/variance': 0.0, 'professionalism/v1/p90': 4.0}

results.tables["eval_results_table"]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

	inputs	ground_truth	outputs	token_count	toxicity/v1/score	flesch_kincaid_grade_level/v1/score	ari_grade_level/v1/score	professionalism/v1/score	professionalism/v1/justification
0	useEffect() 是如何工作的？	useEffect() hook 告诉 React 你的组件...	useEffect() 是 React 中的一个 hook，它允许你...	46	0.000218	11.1	12.7	4	输出中使用的语言正式且...
1	函数中的 static 关键字是什么意思？	静态成员属于类，而不是...	函数中的 static 关键字意味着...	48	0.000158	9.7	12.3	4	输出中使用的语言正式且...
2	Python 中的 'finally' 块有什么作用？	'Finally' 定义了一个在以下情况下运行的代码块...	Python 中的 'finally' 块用于定义...	45	0.000269	10.1	11.3	4	输出中使用的语言正式且...
3	multiprocessing 和 ... 有什么区别？	多线程（Multithreading）指的是一个进程能够...	多进程涉及运行多个进程...	33	0.000173	23.7	28.7	4	输出中使用的语言正式且...

让我们看看能否通过改变系统提示来改进 basic_qa_model，创建一个表现更好的新模型。

使用新模型调用 mlflow.evaluate()。观察到专业性得分提高了！

with mlflow.start_run() as run:
  system_prompt = "Answer the following question using extreme formality."
  professional_qa_model = mlflow.openai.log_model(
      model="gpt-4o-mini",
      task=openai.chat.completions,
      artifact_path="model",
      messages=[
          {"role": "system", "content": system_prompt},
          {"role": "user", "content": "{question}"},
      ],
  )
  results = mlflow.evaluate(
      professional_qa_model.model_uri,
      eval_df,
      model_type="question-answering",
      evaluators="default",
      extra_metrics=[professionalism_metric],
  )
print(results.metrics)

/Users/sunish.sheth/.local/lib/python3.8/site-packages/_distutils_hack/__init__.py:18: UserWarning: Distutils was imported before Setuptools, but importing Setuptools also replaces the `distutils` module in `sys.modules`. This may lead to undesirable behaviors or errors. To avoid these issues, avoid using distutils directly, ensure that setuptools is installed in the traditional way (e.g. not an editable install), and/or make sure that setuptools is always imported before distutils.
warnings.warn(
/Users/sunish.sheth/.local/lib/python3.8/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
2023/10/27 00:57:30 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2023/10/27 00:57:30 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2023/10/27 00:57:37 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count
2023/10/27 00:57:37 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity
2023/10/27 00:57:38 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level
2023/10/27 00:57:38 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level
2023/10/27 00:57:38 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match
2023/10/27 00:57:38 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: professionalism

{'toxicity/v1/mean': 0.00030383203556993976, 'toxicity/v1/variance': 9.482036560896618e-09, 'toxicity/v1/p90': 0.0003866828687023372, 'toxicity/v1/ratio': 0.0, 'flesch_kincaid_grade_level/v1/mean': 17.625, 'flesch_kincaid_grade_level/v1/variance': 2.9068750000000003, 'flesch_kincaid_grade_level/v1/p90': 19.54, 'ari_grade_level/v1/mean': 21.425, 'ari_grade_level/v1/variance': 3.6168750000000007, 'ari_grade_level/v1/p90': 23.6, 'professionalism/v1/mean': 4.5, 'professionalism/v1/variance': 0.25, 'professionalism/v1/p90': 5.0}

results.tables["eval_results_table"]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

	inputs	ground_truth	outputs	token_count	toxicity/v1/score	flesch_kincaid_grade_level/v1/score	ari_grade_level/v1/score	professionalism/v1/score	professionalism/v1/justification
0	useEffect() 是如何工作的？	useEffect() hook 告诉 React 你的组件...	当然，我将阐明...的机制	386	0.000398	16.3	19.7	5	响应写得过于正式...
1	函数中的 static 关键字是什么意思？	静态成员属于类，而不是...	在...上下文中使用的 static 关键字...	73	0.000143	16.4	20.0	4	输出中使用的语言正式且...
2	Python 中的 'finally' 块有什么作用？	'Finally' 定义了一个在以下情况下运行的代码块...	Python 中的 'finally' 块是一个内置的...	97	0.000313	20.5	24.5	4	输出中使用的语言正式且...
3	multiprocessing 和 ... 有什么区别？	多线程（Multithreading）指的是一个进程能够...	请允许我阐述...之间的区别	324	0.000361	17.3	21.5	5	响应写得过于正式...

基本问答评估​

使用 OpenAI GPT-4 进行由 LLM 判定的正确性评估​

自定义由 LLM 判定的专业性指标​

基本问答评估

使用 OpenAI GPT-4 进行由 LLM 判定的正确性评估

自定义由 LLM 判定的专业性指标