跳到主要内容

端到端工作流:由评估驱动的开发

本指南演示了使用 MLflow 的评估驱动开发方法构建和评估 GenAI 应用程序的完整工作流。

警告

Databricks 用户:此工作流使用 MLflow OSS 评估数据集 API。对于 Databricks 环境,请改用 databricks-agents 包,该包提供与 Unity Catalog 集成的优化数据集管理。

工作流概述

评估驱动的开发

Iterate & Improve
构建与跟踪
捕获跟踪
添加期望
创建数据集
评估
分析结果

先决条件

pip install --upgrade mlflow>=3.4 openai

步骤 1:构建与跟踪您的应用程序

从一个已跟踪的 GenAI 应用程序开始。本示例展示了一个客户支持机器人,但此模式适用于任何 LLM 应用程序。您可以使用 mlflow.trace 装饰器进行手动插桩,或如以下所示启用 OpenAI 的自动跟踪

import mlflow
import openai
import os

# Configure environment
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
mlflow.set_experiment("Customer Support Bot")

# Enable automatic tracing for OpenAI
mlflow.openai.autolog()


class CustomerSupportBot:
def __init__(self):
self.client = openai.OpenAI()
self.knowledge_base = {
"refund": "Full refunds within 30 days with receipt.",
"shipping": "Standard: 5-7 days. Express available.",
"warranty": "1-year manufacturer warranty included.",
}

@mlflow.trace
def answer(self, question: str) -> str:
# Retrieve relevant context
context = self._get_context(question)

# Generate response
response = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful support assistant."},
{
"role": "user",
"content": f"Context: {context}\n\nQuestion: {question}",
},
],
temperature=0.3,
)
return response.choices[0].message.content

def _get_context(self, question: str) -> str:
# Simple keyword matching for demo
for key, value in self.knowledge_base.items():
if key in question.lower():
return value
return "General customer support information."


bot = CustomerSupportBot()

步骤 2:捕获生产跟踪

使用真实或测试场景运行您的应用程序以捕获跟踪。之后,您将使用 mlflow.search_traces() 来检索这些跟踪,用于注释和数据集创建。

# Test scenarios
test_questions = [
"What is your refund policy?",
"How long does shipping take?",
"Is my product under warranty?",
"Can I get express shipping?",
]

# Capture traces - automatically logged to the active experiment
for question in test_questions:
response = bot.answer(question)

步骤 3:添加地面真实期望

向您的跟踪添加期望,以定义什么构成正确的行为。使用 mlflow.log_expectation() 来用地面真实值注释跟踪,这些值将作为您的评估基线。

# Search for recent traces (uses current active experiment by default)
traces = mlflow.search_traces(
max_results=10, return_type="list" # Return list of Trace objects for iteration
)

# Add expectations to specific traces
for trace in traces:
# Get the question from the root span inputs
root_span = trace.data._get_root_span()
question = (
root_span.inputs.get("question", "") if root_span and root_span.inputs else ""
)

if "refund" in question.lower():
mlflow.log_expectation(
trace_id=trace.info.trace_id,
name="key_information",
value={"must_mention": ["30 days", "receipt"], "tone": "helpful"},
)
elif "shipping" in question.lower():
mlflow.log_expectation(
trace_id=trace.info.trace_id,
name="key_information",
value={"must_mention": ["5-7 days"], "offers_express": True},
)

步骤 4:创建评估数据集

将已注释的跟踪转换为可重用的评估数据集。使用 create_dataset() 初始化您的数据集,并使用 merge_records() 添加来自多个来源的测试用例。

from mlflow.genai.datasets import create_dataset

# Create dataset from current experiment
dataset = create_dataset(
name="customer_support_qa_v1",
experiment_id=mlflow.get_experiment_by_name("Customer Support Bot").experiment_id,
tags={"stage": "validation", "domain": "customer_support"},
)

# Re-fetch traces to get the attached expectations
# The expectations are now part of the trace data
annotated_traces = mlflow.search_traces(
max_results=100,
return_type="list", # Need list for merge_records
)

# Add traces to dataset
dataset.merge_records(annotated_traces)

# Optionally add manual test cases
manual_tests = [
{
"inputs": {"question": "Can I return an item after 45 days?"},
"expectations": {"should_clarify": "30-day policy", "tone": "apologetic"},
},
{
"inputs": {"question": "Do you ship internationally?"},
"expectations": {"provides_alternatives": True},
},
]
dataset.merge_records(manual_tests)

步骤 5:运行系统评估

使用内置和自定义评分器对您的应用程序相对于数据集进行评估。使用 mlflow.genai.evaluate() 运行全面的评估,并使用 Correctness 等评分器进行事实准确性评估。您还可以使用 @scorer 装饰器创建自定义评分器来评估特定于域的指标。

from mlflow.genai import evaluate
from mlflow.genai.scorers import Correctness, Guidelines, scorer


# Define custom scorer for your specific needs
@scorer
def contains_required_info(outputs: str, expectations: dict) -> float:
"""Check if response contains required information."""
if "must_mention" not in expectations:
return 1.0

output_lower = outputs.lower()
mentioned = [term for term in expectations["must_mention"] if term in output_lower]
return len(mentioned) / len(expectations["must_mention"])


# Configure evaluation
scorers = [
Correctness(name="factual_accuracy"),
Guidelines(
name="support_quality",
guidelines="Response must be helpful, accurate, and professional",
),
contains_required_info,
]

# Run evaluation
results = evaluate(
data=dataset,
predict_fn=bot.answer,
scorers=scorers,
model_id="customer-support-bot-v1",
)

# Access results
metrics = results.metrics
detailed_results = results.tables["eval_results_table"]

步骤 6:迭代与改进

使用评估结果改进您的应用程序,然后使用相同的数据集重新评估。

# Analyze results
low_scores = detailed_results[detailed_results["factual_accuracy/score"] < 0.8]
if not low_scores.empty:
# Identify patterns in failures
failed_questions = low_scores["inputs.question"].tolist()

# Example improvements based on failure analysis
bot.knowledge_base[
"refund"
] = "Full refunds available within 30 days with original receipt. Store credit offered after 30 days."
bot.client.temperature = 0.2 # Reduce temperature for more consistent responses

# Re-evaluate with same dataset for comparison
improved_results = evaluate(
data=dataset,
predict_fn=bot.answer, # Updated bot
scorers=scorers,
model_id="customer-support-bot-v2",
)

# Compare versions
improvement = (
improved_results.metrics["factual_accuracy/score"]
- metrics["factual_accuracy/score"]
)

后续步骤