评分器概念

什么是评分器？

MLflow 中的评分器是评估生成式 AI 应用输出质量的评估函数。它们提供了一种系统化的方法来衡量不同维度（如正确性、相关性、安全性以及是否符合指导方针）的性能。

评分器将主观质量评估转化为可衡量的指标，使您能够跟踪性能、比较模型并确保您的应用程序符合质量标准。它们的范围从简单的基于规则的检查到复杂的 LLM 裁判，后者可以评估语言生成中的细微差别。

用例

自动化质量评估

使用确定性规则或基于 LLM 的评估，用自动化评分取代手动审查流程，可以一致地大规模评估数千个输出。

安全与合规性验证

系统地检查有害内容、偏见、PII（个人身份信息）泄露和法规遵从性。确保您的应用程序在部署前符合组织和法律标准。

A/B 测试与模型比较

使用一致的评估标准来比较不同的模型、提示或配置。基于数据做出关于哪种方法最适合您的用例的决策。

持续质量监控

随着应用程序的演进和扩展，随着时间的推移跟踪生产中的质量指标，及早发现退化，并保持高标准。

评分器类型

MLflow 提供了几种评分器类型，以满足不同的评估需求

代理即评判者

自主代理，它们分析执行跟踪，不仅评估输出，还评估整个过程。它们可以评估工具使用、推理链和错误处理。

人类对齐裁判

已使用内置 `align()` 方法与人类反馈对齐的 LLM 裁判，以匹配您特定的质量标准。这些裁判在自动化的一致性与人类判断的细微差别之间取得了平衡。

基于 LLM 的评分器（LLM 作为裁判）

使用大型语言模型来评估主观质量，如有用性、连贯性和风格。这些评分器可以理解基于规则的系统所忽略的上下文和细微差别。

基于代码的评分器

用于确定性评估的自定义 Python 函数。非常适合可以算法计算的指标，如 ROUGE 分数、精确匹配或自定义业务逻辑。

评分器输出结构

MLflow 中的所有评分器都生成标准化的输出，可与评估框架无缝集成。评分器返回一个 mlflow.entities.Feedback() 对象，其中包含：

字段	类型	描述
`name`	`str`	评分器的唯一标识符（例如，“correctness”、“safety”）
`value`	`Any`	评估结果 - 可以是数字、布尔值或分类值
`rationale`	`Optional[str]`	解释为什么给出此分数（对于 LLM 裁判特别有用）
`metadata`	`Optional[dict]`	有关评估的附加信息（置信度、子分数等）
`error`	`Optional[str]`	评分器未能评估时的错误消息

常见评分器模式

MLflow 的评分器系统高度灵活，支持从简单的基于规则的检查到分析整个执行跟踪的复杂 AI 代理。以下示例展示了可用的评估功能的广度——从检测多步工作流程中的低效率到评估文本可读性、衡量响应延迟以及确保输出质量。每种模式都可以根据您的具体用例进行自定义，并与其他模式结合以进行全面评估。

代理即裁判（跟踪分析）
LLM 裁判（基于字段）
阅读水平评估
语言困惑度评分
响应延迟跟踪

from mlflow.genai.judges import make_judge
import mlflow

# Create an Agent-as-a-Judge that analyzes execution patterns
efficiency_judge = make_judge(
    name="efficiency_analyzer",
    instructions=(
        "Analyze the {{ trace }} for inefficiencies.\n\n"
        "Check for:\n"
        "- Redundant API calls or database queries\n"
        "- Sequential operations that could be parallelized\n"
        "- Unnecessary data processing\n\n"
        "Rate as: 'efficient', 'acceptable', or 'inefficient'"
    ),
    model="anthropic:/claude-opus-4-1-20250805",
)

# Example: RAG application with retrieval and generation
from mlflow.entities import SpanType
import time


@mlflow.trace(span_type=SpanType.RETRIEVER)
def retrieve_context(query: str):
    # Simulate vector database retrieval
    time.sleep(0.5)  # Retrieval latency
    return [
        {"doc": "MLflow is an open-source platform", "score": 0.95},
        {"doc": "It manages the ML lifecycle", "score": 0.89},
        {"doc": "Includes tracking and deployment", "score": 0.87},
    ]


@mlflow.trace(span_type=SpanType.RETRIEVER)
def retrieve_user_history(user_id: str):
    # Another retrieval that could be parallelized
    time.sleep(0.5)  # Could run parallel with above
    return {"previous_queries": ["What is MLflow?", "How to log models?"]}


@mlflow.trace(span_type=SpanType.LLM)
def generate_response(query: str, context: list, history: dict):
    # Simulate LLM generation
    return f"Based on context about '{query}': MLflow is a platform for ML lifecycle management."


@mlflow.trace(span_type=SpanType.AGENT)
def rag_agent(query: str, user_id: str):
    # Sequential operations that could be optimized
    context = retrieve_context(query)
    history = retrieve_user_history(user_id)  # Could be parallel with above
    response = generate_response(query, context, history)
    return response


# Run the RAG agent
result = rag_agent("What is MLflow?", "user123")
trace_id = mlflow.get_last_active_trace_id()
trace = mlflow.get_trace(trace_id)

# Judge analyzes the trace to identify inefficiencies
feedback = efficiency_judge(trace=trace)
print(f"Efficiency: {feedback.value}")
print(f"Analysis: {feedback.rationale}")

from mlflow.genai.judges import make_judge

correctness_judge = make_judge(
    name="correctness",
    instructions=(
        "Evaluate if the response in {{ outputs }} "
        "correctly answers the question in {{ inputs }}."
    ),
    model="anthropic:/claude-opus-4-1-20250805",
)

# Example usage
feedback = correctness_judge(
    inputs={"question": "What is MLflow?"},
    outputs={
        "response": "MLflow is an open-source platform for ML lifecycle management."
    },
)
print(f"Correctness: {feedback.value}")

import textstat
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback


@scorer
def reading_level(outputs: str) -> Feedback:
    """Evaluate text complexity using Flesch Reading Ease."""
    score = textstat.flesch_reading_ease(outputs)

    if score >= 60:
        level = "easy"
        rationale = f"Reading ease score of {score:.1f} - accessible to most readers"
    elif score >= 30:
        level = "moderate"
        rationale = f"Reading ease score of {score:.1f} - college level complexity"
    else:
        level = "difficult"
        rationale = f"Reading ease score of {score:.1f} - expert level required"

    return Feedback(value=level, rationale=rationale, metadata={"score": score})

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from mlflow.genai.scorers import scorer


@scorer
def perplexity_score(outputs: str) -> float:
    """Calculate perplexity to measure text quality and coherence."""
    model = AutoModelForCausalLM.from_pretrained("gpt2")
    tokenizer = AutoTokenizer.from_pretrained("gpt2")

    inputs = tokenizer(outputs, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])

    perplexity = torch.exp(outputs.loss).item()
    return perplexity  # Lower is better - indicates more natural text

from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, Trace


@scorer
def response_time(trace: Trace) -> Feedback:
    """Evaluate response time from trace spans."""
    root_span = trace.data.spans[0]
    latency_ms = (root_span.end_time - root_span.start_time) / 1e6

    if latency_ms < 100:
        value = "fast"
    elif latency_ms < 500:
        value = "acceptable"
    else:
        value = "slow"

    return Feedback(
        value=value,
        rationale=f"Response took {latency_ms:.0f}ms",
        metadata={"latency_ms": latency_ms},
    )

裁判对齐

MLflow 评分器最强大的功能之一是能够**将 LLM 裁判与人类偏好对齐**。这可以将通用的评估模型转变为领域专家，它们能够理解您独特的质量标准。

对齐工作原理

裁判对齐使用人类反馈来提高基于 LLM 的评分器的准确性和一致性

from mlflow.genai.judges import make_judge
import mlflow

# Create an initial judge
quality_judge = make_judge(
    name="quality",
    instructions="Evaluate if {{ outputs }} meets quality standards for {{ inputs }}.",
    model="anthropic:/claude-opus-4-1-20250805",
)

# Collect traces with both judge assessments and human feedback
traces_with_feedback = mlflow.search_traces(
    experiment_ids=[experiment_id], max_results=20  # Minimum 10 required for alignment
)

# Align the judge with human preferences (uses default DSPy-SIMBA optimizer)
aligned_judge = quality_judge.align(traces_with_feedback)

# The aligned judge now better matches your team's quality standards
feedback = aligned_judge(inputs={"query": "..."}, outputs={"response": "..."})

对齐的关键优势

领域专业知识：裁判通过专家反馈学习您的特定质量标准
一致性：对齐后的裁判在整个评估中统一应用标准
成本效益：对齐后，更小/更便宜的模型可以匹配专家判断
持续改进：随着您的标准演变，重新对齐

插件架构

MLflow 的对齐系统使用插件架构，允许您通过扩展 AlignmentOptimizer 基类来创建自定义优化器。

from mlflow.genai.judges.base import AlignmentOptimizer


class CustomOptimizer(AlignmentOptimizer):
    def align(self, judge, traces):
        # Your custom alignment logic
        return improved_judge


# Use your custom optimizer
aligned_judge = quality_judge.align(traces, CustomOptimizer())

与 MLflow 评估集成

评分器是 MLflow 评估框架的构建块。它们与 `mlflow.genai.evaluate()` 无缝集成。

import mlflow
import pandas as pd

# Your test data
test_data = pd.DataFrame(
    [
        {
            "inputs": {"question": "What is MLflow?"},
            "outputs": {
                "response": "MLflow is an open-source platform for ML lifecycle management."
            },
            "expectations": {
                "ground_truth": "MLflow is an open-source platform for managing the ML lifecycle"
            },
        },
        {
            "inputs": {"question": "How do I track experiments?"},
            "outputs": {
                "response": "Use mlflow.start_run() to track experiments in MLflow."
            },
            "expectations": {
                "ground_truth": "Use mlflow.start_run() to track experiments"
            },
        },
    ]
)


# Your application (optional if data already has outputs)
def my_app(inputs):
    # Your model logic here
    return {"response": f"Answer to: {inputs['question']}"}


# Evaluate with multiple scorers
results = mlflow.genai.evaluate(
    data=test_data,
    # predict_fn is optional if data already has outputs
    scorers=[
        correctness_judge,  # LLM judge from above
        reading_level,  # Custom scorer from above
    ],
)

# Access evaluation metrics
print(f"Correctness: {results.metrics.get('correctness/mean', 'N/A')}")
print(f"Reading Level: {results.metrics.get('reading_level/mode', 'N/A')}")

最佳实践

选择正确的评分器类型
- 对客观、确定性指标使用基于代码的评分器
- 对需要理解的主观质量使用 LLM 裁判
- 对评估复杂的多步流程使用代理即裁判
组合多个评分器
- 没有单一指标能涵盖质量的所有方面
- 使用评分器的组合以获得全面的评估
- 平衡效率（快速的代码评分器）与深度（LLM 和代理裁判）
与人类判断对齐
- 验证您的评分器是否与人类质量评估相关
- 使用人类反馈来改进 LLM 和代理裁判的说明
- 考虑对关键评估使用人类对齐裁判
监控评分器性能
- 跟踪评分器的执行时间和成本
- 监控评分器故障并妥善处理
- 定期审查评分器输出的一致性

后续步骤

基于 LLM 的评分器

了解如何使用 LLM 作为裁判进行评估

探索 LLM 裁判 →

裁判对齐

将裁判与人类反馈对齐以获得领域专业知识

学习对齐 →

基于代码的评分器

创建自定义 Python 函数进行评估

构建自定义评分器 →

评估指南

了解如何运行全面的评估

开始评估 →

什么是评分器？​

用例​

自动化质量评估

安全与合规性验证

A/B 测试与模型比较

持续质量监控

评分器类型​

代理即评判者

人类对齐裁判

基于 LLM 的评分器（LLM 作为裁判）

基于代码的评分器

评分器输出结构​

常见评分器模式​

裁判对齐​

对齐工作原理​

对齐的关键优势​

插件架构​

与 MLflow 评估集成​

最佳实践​

后续步骤​

基于 LLM 的评分器

裁判对齐

基于代码的评分器

评估指南

什么是评分器？

用例

评分器类型

评分器输出结构

常见评分器模式

裁判对齐

对齐工作原理

对齐的关键优势

插件架构

与 MLflow 评估集成

最佳实践

后续步骤