评估提示
结合使用 MLflow 提示注册表 和 MLflow LLM 评估,您可以跨不同模型和数据集评估提示性能,并在集中式注册表中跟踪评估结果。您还可以检查在评估期间记录的跟踪中的模型输出,以了解模型如何响应不同的提示。
MLflow 提示评估的关键优势
- 有效评估:`MLflow 的 LLM 评估 API 提供了一种简单一致的方式,无需编写样板代码即可跨不同模型和数据集评估提示。
- 比较结果:在 MLflow UI 中轻松比较评估结果。
- 跟踪结果:在 MLflow 实验中跟踪评估结果,以维护提示性能和不同评估设置的历史记录。
- 跟踪:通过评估期间生成的跟踪深入检查推理过程中的模型行为。
快速入门
1. 安装所需库
首先安装 MLflow 和 OpenAI SDK。如果您使用不同的 LLM 提供商,请安装相应的 SDK。
bash
pip install mlflow>=2.21.0 openai -qU
还要设置 OpenAI API 密钥(或任何其他 LLM 提供商,例如 Anthropic)。
python
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
1. 创建提示
- UI
- Python

- 在终端中运行
mlflow server以启动 MLflow UI。 - 在 MLflow UI 中导航到提示选项卡。
- 点击创建提示按钮。
- 填写提示详细信息,例如名称、提示模板文本和提交消息(可选)。
- 点击创建以注册提示。
要使用 Python API 创建新提示,请使用 mlflow.register_prompt() API。
python
import mlflow
# Use double curly braces for variables in the template
initial_template = """\
Summarize content you are provided with in {{ num_sentences }} sentences.
Sentences: {{ sentences }}
"""
# Register a new prompt
prompt = mlflow.genai.register_prompt(
name="summarization-prompt",
template=initial_template,
# Optional: Provide a commit message to describe the changes
commit_message="Initial commit",
)
# The prompt object contains information about the registered prompt
print(f"Created prompt '{prompt.name}' (version {prompt.version})")
2. 准备评估数据
下面,我们创建一个小型摘要数据集以供演示。
python
import pandas as pd
eval_data = [
{
"inputs": {
"sentences": "Artificial intelligence has transformed how businesses operate in the 21st century. Companies are leveraging AI for everything from customer service to supply chain optimization. The technology enables automation of routine tasks, freeing human workers for more creative endeavors. However, concerns about job displacement and ethical implications remain significant. Many experts argue that AI will ultimately create more jobs than it eliminates, though the transition may be challenging.",
},
"expectations": {
"summary": "AI has revolutionized business operations through automation and optimization, though ethical concerns about job displacement persist alongside predictions that AI will ultimately create more employment opportunities than it eliminates.",
},
},
{
"inputs": {
"sentences": "Climate change continues to affect ecosystems worldwide at an alarming rate. Rising global temperatures have led to more frequent extreme weather events including hurricanes, floods, and wildfires. Polar ice caps are melting faster than predicted, contributing to sea level rise that threatens coastal communities. Scientists warn that without immediate and dramatic reductions in greenhouse gas emissions, many of these changes may become irreversible. International cooperation remains essential but politically challenging.",
},
"expectations": {
"summary": "Climate change is causing accelerating environmental damage through extreme weather events and melting ice caps, with scientists warning that without immediate reduction in greenhouse gas emissions, many changes may become irreversible.",
},
},
{
"inputs": {
"sentences": "The human genome project was completed in 2003 after 13 years of international collaborative research. It successfully mapped all of the genes of the human genome, approximately 20,000-25,000 genes in total. The project cost nearly $3 billion but has enabled countless medical advances and spawned new fields like pharmacogenomics. The knowledge gained has dramatically improved our understanding of genetic diseases and opened pathways to personalized medicine. Today, a complete human genome can be sequenced in under a day for about $1,000.",
},
"expectations": {
"summary": "The Human Genome Project, completed in 2003, mapped approximately 20,000-25,000 human genes at a cost of $3 billion, enabling medical advances, improving understanding of genetic diseases, and establishing the foundation for personalized medicine.",
},
},
{
"inputs": {
"sentences": "Remote work adoption accelerated dramatically during the COVID-19 pandemic. Organizations that had previously resisted flexible work arrangements were forced to implement digital collaboration tools and virtual workflows. Many companies reported surprising productivity gains, though concerns about company culture and collaboration persisted. After the pandemic, a hybrid model emerged as the preferred approach for many businesses, combining in-office and remote work. This shift has profound implications for urban planning, commercial real estate, and work-life balance.",
},
"expectations": {
"summary": "The COVID-19 pandemic forced widespread adoption of remote work, revealing unexpected productivity benefits despite collaboration challenges, and resulting in a hybrid work model that impacts urban planning, real estate, and work-life balance.",
},
},
]
3. 定义预测函数
定义一个函数,该函数接受输入 DataFrame 并返回预测列表。
MLflow 将把输入列(在此示例中仅为 inputs)传递给函数。输出字符串将与 targets 列进行比较以评估模型。
python
import mlflow
import openai
def predict_fn(sentences: str) -> str:
# Load the latest version of the registered prompt
prompt = mlflow.genai.load_prompt("prompts:/summarization-prompt@latest")
completion = openai.OpenAI().chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": prompt.format(sentences=sentences, num_sentences=1),
}
],
)
return completion.choices[0].message.content
4. 运行评估
运行 mlflow.genai.evaluate() API,使用准备好的数据和提示评估模型。在此示例中,我们将使用以下两个内置指标。
python
from typing import Literal
from mlflow.genai.judges import make_judge
answer_similarity = make_judge(
name="answer_similarity",
instructions=(
"Evaluated on the degree of semantic similarity of the provided output to the expected answer.\n\n"
"Output: {{ outputs }}\n\n"
"Expected: {{ expectations }}"
"Return 'yes' if the output is similar to the expected answer, otherwise return 'no'."
),
model="openai:/gpt-5-mini",
feedback_value_type=Literal["yes", "no"],
)
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=predict_fn,
scorers=[answer_similarity],
)
5. 查看结果
您可以在 MLflow UI 中查看评估结果。导航到实验选项卡,选择评估选项卡,然后单击评估运行以查看评估结果。
