跳到主要内容

端到端评判器工作流程

本指南将引导您完成使用 MLflow 的评判器 API 开发和优化自定义 LLM 评判器的完整生命周期。

为何此工作流程很重要

系统化开发

通过明确的指标和目标,将主观评估转变为数据驱动的评判器开发。

人机对齐

通过结构化反馈确保您的评判器反映人类的专业知识和领域知识。

持续改进

基于实际性能和不断变化的需求,迭代和改进评判器的准确性。

生产就绪

在确保评判器已通过测试并符合您的质量标准的情况下,自信地部署它们。

开发周期

Iterate
创建评判器
收集反馈
与人类对齐
测试与注册

第一步:创建初始评判器

首先定义您的评估标准

import mlflow
from mlflow.genai.judges import make_judge
from mlflow.entities import AssessmentSource, AssessmentSourceType

# Create experiment for judge development
experiment_id = mlflow.create_experiment("support-judge-development")
mlflow.set_experiment(experiment_id=experiment_id)

# Create a judge for evaluating customer support responses
support_judge = make_judge(
name="support_quality",
instructions="""
Evaluate the quality of this customer support response.

Rate as one of: excellent, good, needs_improvement, poor

Consider:
- Does it address the customer's issue?
- Is the tone professional and empathetic?
- Are next steps clear?

Focus on {{ outputs }} responding to {{ inputs }}.
""",
model="anthropic:/claude-opus-4-1-20250805",
)

第二步:生成轨迹并收集反馈

运行您的应用程序以生成轨迹,然后收集人类反馈

# Generate traces from your application
@mlflow.trace
def customer_support_app(issue):
# Your application logic here
return {"response": f"I'll help you with: {issue}"}


# Run application to generate traces
issues = [
"Password reset not working",
"Billing discrepancy",
"Feature request",
"Technical error",
]

trace_ids = []
for issue in issues:
with mlflow.start_run(experiment_id=experiment_id):
result = customer_support_app(issue)
trace_id = mlflow.get_last_active_trace_id()
trace_ids.append(trace_id)

# Judge evaluates the trace
assessment = support_judge(inputs={"issue": issue}, outputs=result)

# Log judge's assessment
mlflow.log_assessment(trace_id=trace_id, assessment=assessment)

收集人类反馈

运行评判器处理轨迹后,收集人类反馈以建立事实真相

何时使用: 您需要收集人类反馈以进行评判器对齐。

MLflow UI 提供了最直观的方式来审查轨迹和添加反馈

如何收集反馈

  1. 打开 MLflow UI 并导航到您的实验
  2. 转到“轨迹”选项卡,查看所有生成的轨迹
  3. 单击单个轨迹进行审查
    • 输入数据(客户问题)
    • 输出响应
    • 评判器的初步评估
  4. 单击“添加反馈”来添加您的反馈
  5. 选择与您的评判器相匹配的评估名称(例如,“support_quality”)
  6. 提供您的专家评分(优秀、良好、需要改进或差)

谁应该提供反馈?

如果您不是领域专家

  • 通过 MLflow UI 要求领域专家或其他开发人员提供标签
  • 将轨迹分配给具有相关专业知识的团队成员
  • 考虑组织反馈会议,专家们可以一起审查批次

如果您是领域专家

  • 直接在 MLflow UI 中审查轨迹并添加您的专家评估
  • 创建评分标准或指南文档以确保一致性
  • 记录您的评估标准以供将来参考

UI 会自动以正确的格式记录反馈以进行对齐。

Adding feedback through MLflow UI

第三步:通过人类反馈对齐评判器

使用 SIMBA 优化器来提高评判器的准确性

# Retrieve traces with both judge and human assessments
traces = mlflow.search_traces(experiment_ids=[experiment_id], return_type="list")

# Filter for traces with both assessments
aligned_traces = []
for trace in traces:
assessments = trace.search_assessments(name="support_quality")
has_judge = any(
a.source.source_type == AssessmentSourceType.LLM_JUDGE for a in assessments
)
has_human = any(
a.source.source_type == AssessmentSourceType.HUMAN for a in assessments
)

if has_judge and has_human:
aligned_traces.append(trace)

print(f"Found {len(aligned_traces)} traces with both assessments")

# Align the judge (requires at least 10 traces)
if len(aligned_traces) >= 10:
# Option 1: Use default optimizer (recommended for simplicity)
aligned_judge = support_judge.align(aligned_traces)

# Option 2: Explicitly specify optimizer with custom model
# from mlflow.genai.judges.optimizers import SIMBAAlignmentOptimizer
# optimizer = SIMBAAlignmentOptimizer(model="anthropic:/claude-opus-4-1-20250805")
# aligned_judge = support_judge.align(aligned_traces, optimizer)

print("Judge aligned successfully!")
else:
print(f"Need at least 10 traces (have {len(aligned_traces)})")

第四步:测试与注册

测试对齐后的评判器,准备就绪后进行注册

# Test the aligned judge on new data
test_cases = [
{
"inputs": {"issue": "Can't log in"},
"outputs": {"response": "Let me reset your password for you."},
},
{
"inputs": {"issue": "Refund request"},
"outputs": {"response": "I'll process that refund immediately."},
},
]

# Evaluate with aligned judge
for case in test_cases:
assessment = aligned_judge(**case)
print(f"Issue: {case['inputs']['issue']}")
print(f"Judge rating: {assessment.value}")
print(f"Rationale: {assessment.rationale}\n")

# Register the aligned judge for production use
aligned_judge.register(experiment_id=experiment_id)
print("Judge registered and ready for deployment!")

第五步:在生产环境中使用已注册的评判器

使用 mlflow.genai.evaluate() 检索并使用您已注册的评判器

from mlflow.genai.scorers import get_scorer
import pandas as pd

# Retrieve the registered judge
production_judge = get_scorer(name="support_quality", experiment_id=experiment_id)

# Prepare evaluation data
eval_data = pd.DataFrame(
[
{
"inputs": {"issue": "Can't access my account"},
"outputs": {"response": "I'll help you regain access immediately."},
},
{
"inputs": {"issue": "Slow website performance"},
"outputs": {"response": "Let me investigate the performance issues."},
},
]
)

# Run evaluation with the aligned judge
results = mlflow.genai.evaluate(data=eval_data, scorers=[production_judge])

# View results and metrics
print("Evaluation metrics:", results.metrics)
print("\nDetailed results:")
print(results.tables["eval_results_table"])

# Assessments are automatically logged to the traces
# You can view them in the MLflow UI Traces tab

最佳实践

清晰的说明

从反映您的领域需求的具体、明确的评估标准开始。

高质量的反馈

确保人类反馈来自了解您评估标准的领域专家。

足够的数据

收集至少 10-15 个包含评估的轨迹,以实现有效的对齐。

经常迭代

随着您的应用程序的演变和新边缘情况的出现,定期重新对齐评判器。

后续步骤