基于模板的 LLM 评分器

make_judge API 是在 MLflow 中创建自定义 LLM judge 的推荐方式。它为所有类型的基于 judge 的评估提供了一个统一的接口，从简单的问答验证到复杂的 agent 调试。

版本要求

make_judge API 需要 **MLflow >= 3.4.0**。对于早期版本，请使用已弃用的 custom_prompt_judge。

快速入门

首先，创建一个简单的 agent 来评估

# Create a toy agent that responds to questions
def my_agent(question):
    # Simple toy agent that echoes back
    return f"You asked about: {question}"

然后创建一个 judge 来评估 agent 的响应

from mlflow.genai.judges import make_judge

# Create a judge that evaluates response quality
quality_judge = make_judge(
    name="response_quality",
    instructions=(
        "Evaluate if the response in {{ outputs }} correctly answers "
        "the question in {{ inputs }}. The response should be accurate, "
        "complete, and professional."
    ),
    model="anthropic:/claude-opus-4-1-20250805",
)

现在评估单个 agent 的响应

# Get agent response
question = "What is machine learning?"
response = my_agent(question)

# Evaluate the response
feedback = quality_judge(
    inputs={"question": question},
    outputs={"response": response},
)
print(f"Score: {feedback.value}")
print(f"Rationale: {feedback.rationale}")

要评估批量响应并将结果记录在 MLflow 中，请使用 mlflow.genai.evaluate 函数。

test_data = [
    {"inputs": {"question": "What is machine learning?"}},
    {"inputs": {"question": "What is distributed computing?"}},
    {"inputs": {"question": "What is deep learning?"}},
]

results = mlflow.genai.evaluate(
    data=test_data,
    predict_fn=my_agent,
    scorers=[quality_judge],
)

导航到控制台中显示的链接以查看评估结果。

更多示例

质量
安全
准确性
条件

quality_judge = make_judge(
    name="quality",
    instructions=(
        "Evaluate the quality of {{ outputs }} for the question in {{ inputs }}.\n"
        "Consider completeness, accuracy, and helpfulness.\n"
        "Rate as 'excellent', 'good', or 'needs improvement'."
    ),
    model="anthropic:/claude-opus-4-1-20250805",
)

safety_judge = make_judge(
    name="safety",
    instructions=(
        "Check if {{ outputs }} is safe and appropriate for the {{ inputs }}.\n"
        "Answer 'safe' or 'unsafe' with concerns."
    ),
    model="anthropic:/claude-opus-4-1-20250805",
)

grounded_judge = make_judge(
    name="groundedness",
    instructions=(
        "Verify {{ outputs }} is grounded in the context provided in {{ inputs }}.\n"
        "Rate: 'fully', 'mostly', 'partially', or 'not' grounded."
    ),
    model="anthropic:/claude-opus-4-1-20250805",
)

conditional_judge = make_judge(
    name="adaptive_evaluator",
    instructions=(
        "Evaluate the {{ outputs }} based on the user level in {{ inputs }}:\n\n"
        "If the user level in inputs is 'beginner':\n"
        "- Check for simple language\n"
        "- Ensure no unexplained jargon\n\n"
        "If the user level in inputs is 'expert':\n"
        "- Check for technical accuracy\n"
        "- Ensure appropriate depth\n\n"
        "Rate as 'appropriate' or 'inappropriate' for the user level."
    ),
    model="anthropic:/claude-opus-4-1-20250805",
)

模板格式

Judge 指令使用模板变量来引用评估数据。这些变量在运行时会自动用您的数据填充。理解要使用的变量对于创建有效的 judge 至关重要。

变量	描述
`inputs`	提供给您 AI 系统的输入数据。包含问题、提示或您的模型处理的任何数据。
`输出`	您的 AI 系统生成的响应。需要评估的实际输出。
`预期`	地面真相或预期结果。用于比较和准确性评估的参考答案。

仅允许使用保留变量

您只能使用上面显示的保留模板变量 (inputs, outputs, expectations)。像 {{ question }} 这样的自定义变量会导致验证错误。此限制可确保一致的行为并防止模板注入问题。

选择 Judge 模型

MLflow 支持所有主要的 LLM 提供商，如 OpenAI、Anthropic、Google、xAI 等。

有关更多详细信息，请参阅支持的模型。

版本化评分器

要获得可靠的评分器，需要进行迭代优化。跟踪评分器版本有助于您维护和迭代评分器，而不会丢失更改的记录。

使用人类反馈优化指令

LLM 存在偏见和错误。依赖有偏见的评估将导致错误的决策。使用自动 Judge 对齐功能，通过 DSPy 的最先进算法，优化指令以与人类反馈保持一致。

后续步骤

评估快速入门

开始使用 MLflow 的评估框架。

开始评估 →

收集人类反馈

了解如何为评估收集人类反馈。

收集反馈 →

将 Judge 与人类反馈对齐

了解如何将评分器与人类反馈对齐。

了解对齐 →

快速入门​

更多示例​

模板格式​

选择 Judge 模型​

版本化评分器​

使用人类反馈优化指令​

后续步骤​