使用 MLflow 评估提示词

结合 MLflow 提示词注册表和 MLflow LLM 评估，您可以在不同模型和数据集上评估提示词性能，并在集中式注册表中跟踪评估结果。您还可以检查评估期间记录的追踪中的模型输出，以了解模型如何响应不同的提示词。

MLflow 提示词评估的主要优势

高效评估：mlflow.evaluate API 提供了一种简单一致的方法来在不同模型和数据集上评估提示词，无需编写样板代码。
比较结果：在 MLflow UI 中轻松比较评估结果。
跟踪结果：在 MLflow 实验中跟踪评估结果，以维护提示词性能和不同评估设置的历史记录。
追踪：深入检查评估期间生成的追踪中的模型推理行为。

快速入门

1. 安装所需的库

首先安装 MLflow 和 OpenAI SDK。如果您使用其他大型语言模型提供商，请安装相应的 SDK。

pip install mlflow>=2.21.0 openai -qU

此外，设置 OpenAI API 密钥（或任何其他大型语言模型提供商，例如 Anthropic）。

import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

1. 创建提示词

UI
Python

Create Prompt UI

在终端中运行 mlflow ui 以启动 MLflow UI。
导航到 MLflow UI 中的提示词选项卡。
点击创建提示词按钮。
填写提示词详细信息，例如名称、提示词模板文本和提交消息（可选）。
点击创建以注册提示词。

要使用 Python API 创建新的提示词，请使用 mlflow.register_prompt() API

import mlflow

# Use double curly braces for variables in the template
initial_template = """\
Summarize content you are provided with in {{ num_sentences }} sentences.

Sentences: {{ sentences }}
"""

# Register a new prompt
prompt = mlflow.register_prompt(
    name="summarization-prompt",
    template=initial_template,
    # Optional: Provide a commit message to describe the changes
    commit_message="Initial commit",
)

# The prompt object contains information about the registered prompt
print(f"Created prompt '{prompt.name}' (version {prompt.version})")

2. 准备评估数据

下面，我们创建一个小型摘要数据集用于演示。

import pandas as pd

eval_data = pd.DataFrame(
    {
        "inputs": [
            "Artificial intelligence has transformed how businesses operate in the 21st century. Companies are leveraging AI for everything from customer service to supply chain optimization. The technology enables automation of routine tasks, freeing human workers for more creative endeavors. However, concerns about job displacement and ethical implications remain significant. Many experts argue that AI will ultimately create more jobs than it eliminates, though the transition may be challenging.",
            "Climate change continues to affect ecosystems worldwide at an alarming rate. Rising global temperatures have led to more frequent extreme weather events including hurricanes, floods, and wildfires. Polar ice caps are melting faster than predicted, contributing to sea level rise that threatens coastal communities. Scientists warn that without immediate and dramatic reductions in greenhouse gas emissions, many of these changes may become irreversible. International cooperation remains essential but politically challenging.",
            "The human genome project was completed in 2003 after 13 years of international collaborative research. It successfully mapped all of the genes of the human genome, approximately 20,000-25,000 genes in total. The project cost nearly $3 billion but has enabled countless medical advances and spawned new fields like pharmacogenomics. The knowledge gained has dramatically improved our understanding of genetic diseases and opened pathways to personalized medicine. Today, a complete human genome can be sequenced in under a day for about $1,000.",
            "Remote work adoption accelerated dramatically during the COVID-19 pandemic. Organizations that had previously resisted flexible work arrangements were forced to implement digital collaboration tools and virtual workflows. Many companies reported surprising productivity gains, though concerns about company culture and collaboration persisted. After the pandemic, a hybrid model emerged as the preferred approach for many businesses, combining in-office and remote work. This shift has profound implications for urban planning, commercial real estate, and work-life balance.",
            "Quantum computing represents a fundamental shift in computational capability. Unlike classical computers that use bits as either 0 or 1, quantum computers use quantum bits or qubits that can exist in multiple states simultaneously. This property, known as superposition, theoretically allows quantum computers to solve certain problems exponentially faster than classical computers. Major technology companies and governments are investing billions in quantum research. Fields like cryptography, material science, and drug discovery are expected to be revolutionized once quantum computers reach practical scale.",
        ],
        "targets": [
            "AI has revolutionized business operations through automation and optimization, though ethical concerns about job displacement persist alongside predictions that AI will ultimately create more employment opportunities than it eliminates.",
            "Climate change is causing accelerating environmental damage through extreme weather events and melting ice caps, with scientists warning that without immediate reduction in greenhouse gas emissions, many changes may become irreversible.",
            "The Human Genome Project, completed in 2003, mapped approximately 20,000-25,000 human genes at a cost of $3 billion, enabling medical advances, improving understanding of genetic diseases, and establishing the foundation for personalized medicine.",
            "The COVID-19 pandemic forced widespread adoption of remote work, revealing unexpected productivity benefits despite collaboration challenges, and resulting in a hybrid work model that impacts urban planning, real estate, and work-life balance.",
            "Quantum computing uses qubits existing in multiple simultaneous states to potentially solve certain problems exponentially faster than classical computers, with major investment from tech companies and governments anticipating revolutionary applications in cryptography, materials science, and pharmaceutical research.",
        ],
    }
)

3. 定义预测函数

定义一个函数，该函数接受包含输入的 DataFrame 并返回预测列表。

MLflow 将输入列（本例中仅为 inputs）传递给该函数。输出字符串将与 targets 列进行比较以评估模型。

import mlflow
import openai


def predict(data: pd.DataFrame) -> list[str]:
    predictions = []
    prompt = mlflow.load_prompt("prompts:/summarization-prompt/1")

    for _, row in data.iterrows():
        # Fill in variables in the prompt template
        content = prompt.format(sentences=row["inputs"], num_sentences=1)
        completion = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": content}],
            temperature=0.1,
        )
        predictions.append(completion.choices[0].message.content)

    return predictions

4. 运行评估

运行 mlflow.evaluate() API，使用准备好的数据和提示词评估模型。在本例中，我们将使用以下两个内置指标。

with mlflow.start_run(run_name="prompt-evaluation"):
    mlflow.log_param("model", "gpt-4o-mini")
    mlflow.log_param("temperature", 0.1)

    results = mlflow.evaluate(
        model=predict,
        data=eval_data,
        targets="targets",
        extra_metrics=[
            mlflow.metrics.latency(),
            # Specify GPT4 as a judge model for answer similarity. Other models such as Anthropic,
            # Bedrock, Databricks, are also supported.
            mlflow.metrics.genai.answer_similarity(model="openai:/gpt-4"),
        ],
    )

提示

MLflow 中有许多内置指标可用于评估大型语言模型。您还可以定义自定义指标，包括 LLM 作为评估者（LLM-as-a-Judge）。有关更多详细信息，请参阅大型语言模型评估指标。

5. 查看结果

您可以在 MLflow UI 中查看评估结果。导航到实验选项卡，然后点击评估运行（本例中为 prompt-evaluation）以查看评估结果。

Evaluation Results

如果您有多个评估运行，您可以在图表视图中比较跨运行的指标。

Evaluation Chart

此外，您可以导航到评估运行页面中的追踪选项卡，以显示评估期间来自大型语言模型的所有输入和输出响应，从而了解模型如何响应不同的提示词。

Evaluation Chart

快速入门​

1. 安装所需的库​

1. 创建提示词​

2. 准备评估数据​

3. 定义预测函数​

4. 运行评估​

5. 查看结果​