端到端评判器工作流程

本指南将引导您完成使用 MLflow 的评判器 API 开发和优化自定义 LLM 评判器的完整生命周期。

为何此工作流程很重要

系统化开发

通过明确的指标和目标，将主观评估转变为数据驱动的评判器开发。

人机对齐

通过结构化反馈确保您的评判器反映人类的专业知识和领域知识。

持续改进

基于实际性能和不断变化的需求，迭代和改进评判器的准确性。

生产就绪

在确保评判器已通过测试并符合您的质量标准的情况下，自信地部署它们。

开发周期

创建评判器

收集反馈

与人类对齐

测试与注册

第一步：创建初始评判器

首先定义您的评估标准

import mlflow
from mlflow.genai.judges import make_judge
from mlflow.entities import AssessmentSource, AssessmentSourceType

# Create experiment for judge development
experiment_id = mlflow.create_experiment("support-judge-development")
mlflow.set_experiment(experiment_id=experiment_id)

# Create a judge for evaluating customer support responses
support_judge = make_judge(
    name="support_quality",
    instructions="""
    Evaluate the quality of this customer support response.

    Rate as one of: excellent, good, needs_improvement, poor

    Consider:
    - Does it address the customer's issue?
    - Is the tone professional and empathetic?
    - Are next steps clear?

    Focus on {{ outputs }} responding to {{ inputs }}.
    """,
    model="anthropic:/claude-opus-4-1-20250805",
)

第二步：生成轨迹并收集反馈

运行您的应用程序以生成轨迹，然后收集人类反馈

# Generate traces from your application
@mlflow.trace
def customer_support_app(issue):
    # Your application logic here
    return {"response": f"I'll help you with: {issue}"}


# Run application to generate traces
issues = [
    "Password reset not working",
    "Billing discrepancy",
    "Feature request",
    "Technical error",
]

trace_ids = []
for issue in issues:
    with mlflow.start_run(experiment_id=experiment_id):
        result = customer_support_app(issue)
        trace_id = mlflow.get_last_active_trace_id()
        trace_ids.append(trace_id)

        # Judge evaluates the trace
        assessment = support_judge(inputs={"issue": issue}, outputs=result)

        # Log judge's assessment
        mlflow.log_assessment(trace_id=trace_id, assessment=assessment)

收集人类反馈

运行评判器处理轨迹后，收集人类反馈以建立事实真相

MLflow UI（推荐）
编程方式（现有标签）

何时使用： 您需要收集人类反馈以进行评判器对齐。

MLflow UI 提供了最直观的方式来审查轨迹和添加反馈

如何收集反馈

打开 MLflow UI 并导航到您的实验
转到“轨迹”选项卡，查看所有生成的轨迹
单击单个轨迹进行审查
- 输入数据（客户问题）
- 输出响应
- 评判器的初步评估
单击“添加反馈”来添加您的反馈
选择与您的评判器相匹配的评估名称（例如，“support_quality”）
提供您的专家评分（优秀、良好、需要改进或差）

谁应该提供反馈？

如果您不是领域专家

通过 MLflow UI 要求领域专家或其他开发人员提供标签
将轨迹分配给具有相关专业知识的团队成员
考虑组织反馈会议，专家们可以一起审查批次

如果您是领域专家

直接在 MLflow UI 中审查轨迹并添加您的专家评估
创建评分标准或指南文档以确保一致性
记录您的评估标准以供将来参考

UI 会自动以正确的格式记录反馈以进行对齐。

何时使用： 您已有数据的“事实真相”标签。

如果您有现有的“事实真相”标签，请以编程方式记录它们

# Example: You have ground truth labels
ground_truth = {
    trace_ids[0]: "excellent",  # Known good response
    trace_ids[1]: "poor",  # Known bad response
    trace_ids[2]: "good",  # Known acceptable response
}

for trace_id, truth_value in ground_truth.items():
    mlflow.log_feedback(
        trace_id=trace_id,
        name="support_quality",  # MUST match judge name
        value=truth_value,
        source=AssessmentSource(
            source_type=AssessmentSourceType.HUMAN, source_id="ground_truth"
        ),
    )

第三步：通过人类反馈对齐评判器

使用 SIMBA 优化器来提高评判器的准确性

# Retrieve traces with both judge and human assessments
traces = mlflow.search_traces(experiment_ids=[experiment_id], return_type="list")

# Filter for traces with both assessments
aligned_traces = []
for trace in traces:
    assessments = trace.search_assessments(name="support_quality")
    has_judge = any(
        a.source.source_type == AssessmentSourceType.LLM_JUDGE for a in assessments
    )
    has_human = any(
        a.source.source_type == AssessmentSourceType.HUMAN for a in assessments
    )

    if has_judge and has_human:
        aligned_traces.append(trace)

print(f"Found {len(aligned_traces)} traces with both assessments")

# Align the judge (requires at least 10 traces)
if len(aligned_traces) >= 10:
    # Option 1: Use default optimizer (recommended for simplicity)
    aligned_judge = support_judge.align(aligned_traces)

    # Option 2: Explicitly specify optimizer with custom model
    # from mlflow.genai.judges.optimizers import SIMBAAlignmentOptimizer
    # optimizer = SIMBAAlignmentOptimizer(model="anthropic:/claude-opus-4-1-20250805")
    # aligned_judge = support_judge.align(aligned_traces, optimizer)

    print("Judge aligned successfully!")
else:
    print(f"Need at least 10 traces (have {len(aligned_traces)})")

第四步：测试与注册

测试对齐后的评判器，准备就绪后进行注册

# Test the aligned judge on new data
test_cases = [
    {
        "inputs": {"issue": "Can't log in"},
        "outputs": {"response": "Let me reset your password for you."},
    },
    {
        "inputs": {"issue": "Refund request"},
        "outputs": {"response": "I'll process that refund immediately."},
    },
]

# Evaluate with aligned judge
for case in test_cases:
    assessment = aligned_judge(**case)
    print(f"Issue: {case['inputs']['issue']}")
    print(f"Judge rating: {assessment.value}")
    print(f"Rationale: {assessment.rationale}\n")

# Register the aligned judge for production use
aligned_judge.register(experiment_id=experiment_id)
print("Judge registered and ready for deployment!")

第五步：在生产环境中使用已注册的评判器

使用 mlflow.genai.evaluate() 检索并使用您已注册的评判器

from mlflow.genai.scorers import get_scorer
import pandas as pd

# Retrieve the registered judge
production_judge = get_scorer(name="support_quality", experiment_id=experiment_id)

# Prepare evaluation data
eval_data = pd.DataFrame(
    [
        {
            "inputs": {"issue": "Can't access my account"},
            "outputs": {"response": "I'll help you regain access immediately."},
        },
        {
            "inputs": {"issue": "Slow website performance"},
            "outputs": {"response": "Let me investigate the performance issues."},
        },
    ]
)

# Run evaluation with the aligned judge
results = mlflow.genai.evaluate(data=eval_data, scorers=[production_judge])

# View results and metrics
print("Evaluation metrics:", results.metrics)
print("\nDetailed results:")
print(results.tables["eval_results_table"])

# Assessments are automatically logged to the traces
# You can view them in the MLflow UI Traces tab

最佳实践

清晰的说明

从反映您的领域需求的具体、明确的评估标准开始。

高质量的反馈

确保人类反馈来自了解您评估标准的领域专家。

足够的数据

收集至少 10-15 个包含评估的轨迹，以实现有效的对齐。

经常迭代

随着您的应用程序的演变和新边缘情况的出现，定期重新对齐评判器。

后续步骤

评判器对齐

深入了解对齐技术和优化

学习对齐 →

数据集集成

将评判器与评估数据集结合使用以进行系统测试

探索数据集 →

主文档

返回自定义评判器概述

返回概述 →

为何此工作流程很重要​

系统化开发

人机对齐

持续改进

生产就绪

开发周期​

第一步：创建初始评判器​

第二步：生成轨迹并收集反馈​

收集人类反馈​

如何收集反馈​

谁应该提供反馈？​

第三步：通过人类反馈对齐评判器​

第四步：测试与注册​

第五步：在生产环境中使用已注册的评判器​

最佳实践​