评分器概念
什么是评分器?
MLflow 中的评分器是评估生成式 AI 应用输出质量的评估函数。它们提供了一种系统化的方法来衡量不同维度(如正确性、相关性、安全性以及是否符合指导方针)的性能。
评分器将主观质量评估转化为可衡量的指标,使您能够跟踪性能、比较模型并确保您的应用程序符合质量标准。它们的范围从简单的基于规则的检查到复杂的 LLM 裁判,后者可以评估语言生成中的细微差别。
用例
自动化质量评估
使用确定性规则或基于 LLM 的评估,用自动化评分取代手动审查流程,可以一致地大规模评估数千个输出。
安全与合规性验证
系统地检查有害内容、偏见、PII(个人身份信息)泄露和法规遵从性。确保您的应用程序在部署前符合组织和法律标准。
A/B 测试与模型比较
使用一致的评估标准来比较不同的模型、提示或配置。基于数据做出关于哪种方法最适合您的用例的决策。
持续质量监控
随着应用程序的演进和扩展,随着时间的推移跟踪生产中的质量指标,及早发现退化,并保持高标准。
评分器类型
MLflow 提供了几种评分器类型,以满足不同的评估需求
代理即评判者
自主代理,它们分析执行跟踪,不仅评估输出,还评估整个过程。它们可以评估工具使用、推理链和错误处理。
人类对齐裁判
已使用内置 `align()` 方法与人类反馈对齐的 LLM 裁判,以匹配您特定的质量标准。这些裁判在自动化的一致性与人类判断的细微差别之间取得了平衡。
基于 LLM 的评分器(LLM 作为裁判)
使用大型语言模型来评估主观质量,如有用性、连贯性和风格。这些评分器可以理解基于规则的系统所忽略的上下文和细微差别。
基于代码的评分器
用于确定性评估的自定义 Python 函数。非常适合可以算法计算的指标,如 ROUGE 分数、精确匹配或自定义业务逻辑。
评分器输出结构
MLflow 中的所有评分器都生成标准化的输出,可与评估框架无缝集成。评分器返回一个 mlflow.entities.Feedback()
对象,其中包含:
字段 | 类型 | 描述 |
---|---|---|
name | str | 评分器的唯一标识符(例如,“correctness”、“safety”) |
value | Any | 评估结果 - 可以是数字、布尔值或分类值 |
rationale | Optional[str] | 解释为什么给出此分数(对于 LLM 裁判特别有用) |
metadata | Optional[dict] | 有关评估的附加信息(置信度、子分数等) |
error | Optional[str] | 评分器未能评估时的错误消息 |
常见评分器模式
MLflow 的评分器系统高度灵活,支持从简单的基于规则的检查到分析整个执行跟踪的复杂 AI 代理。以下示例展示了可用的评估功能的广度——从检测多步工作流程中的低效率到评估文本可读性、衡量响应延迟以及确保输出质量。每种模式都可以根据您的具体用例进行自定义,并与其他模式结合以进行全面评估。
- 代理即裁判(跟踪分析)
- LLM 裁判(基于字段)
- 阅读水平评估
- 语言困惑度评分
- 响应延迟跟踪
from mlflow.genai.judges import make_judge
import mlflow
# Create an Agent-as-a-Judge that analyzes execution patterns
efficiency_judge = make_judge(
name="efficiency_analyzer",
instructions=(
"Analyze the {{ trace }} for inefficiencies.\n\n"
"Check for:\n"
"- Redundant API calls or database queries\n"
"- Sequential operations that could be parallelized\n"
"- Unnecessary data processing\n\n"
"Rate as: 'efficient', 'acceptable', or 'inefficient'"
),
model="anthropic:/claude-opus-4-1-20250805",
)
# Example: RAG application with retrieval and generation
from mlflow.entities import SpanType
import time
@mlflow.trace(span_type=SpanType.RETRIEVER)
def retrieve_context(query: str):
# Simulate vector database retrieval
time.sleep(0.5) # Retrieval latency
return [
{"doc": "MLflow is an open-source platform", "score": 0.95},
{"doc": "It manages the ML lifecycle", "score": 0.89},
{"doc": "Includes tracking and deployment", "score": 0.87},
]
@mlflow.trace(span_type=SpanType.RETRIEVER)
def retrieve_user_history(user_id: str):
# Another retrieval that could be parallelized
time.sleep(0.5) # Could run parallel with above
return {"previous_queries": ["What is MLflow?", "How to log models?"]}
@mlflow.trace(span_type=SpanType.LLM)
def generate_response(query: str, context: list, history: dict):
# Simulate LLM generation
return f"Based on context about '{query}': MLflow is a platform for ML lifecycle management."
@mlflow.trace(span_type=SpanType.AGENT)
def rag_agent(query: str, user_id: str):
# Sequential operations that could be optimized
context = retrieve_context(query)
history = retrieve_user_history(user_id) # Could be parallel with above
response = generate_response(query, context, history)
return response
# Run the RAG agent
result = rag_agent("What is MLflow?", "user123")
trace_id = mlflow.get_last_active_trace_id()
trace = mlflow.get_trace(trace_id)
# Judge analyzes the trace to identify inefficiencies
feedback = efficiency_judge(trace=trace)
print(f"Efficiency: {feedback.value}")
print(f"Analysis: {feedback.rationale}")
from mlflow.genai.judges import make_judge
correctness_judge = make_judge(
name="correctness",
instructions=(
"Evaluate if the response in {{ outputs }} "
"correctly answers the question in {{ inputs }}."
),
model="anthropic:/claude-opus-4-1-20250805",
)
# Example usage
feedback = correctness_judge(
inputs={"question": "What is MLflow?"},
outputs={
"response": "MLflow is an open-source platform for ML lifecycle management."
},
)
print(f"Correctness: {feedback.value}")
import textstat
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback
@scorer
def reading_level(outputs: str) -> Feedback:
"""Evaluate text complexity using Flesch Reading Ease."""
score = textstat.flesch_reading_ease(outputs)
if score >= 60:
level = "easy"
rationale = f"Reading ease score of {score:.1f} - accessible to most readers"
elif score >= 30:
level = "moderate"
rationale = f"Reading ease score of {score:.1f} - college level complexity"
else:
level = "difficult"
rationale = f"Reading ease score of {score:.1f} - expert level required"
return Feedback(value=level, rationale=rationale, metadata={"score": score})
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from mlflow.genai.scorers import scorer
@scorer
def perplexity_score(outputs: str) -> float:
"""Calculate perplexity to measure text quality and coherence."""
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer(outputs, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, labels=inputs["input_ids"])
perplexity = torch.exp(outputs.loss).item()
return perplexity # Lower is better - indicates more natural text
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, Trace
@scorer
def response_time(trace: Trace) -> Feedback:
"""Evaluate response time from trace spans."""
root_span = trace.data.spans[0]
latency_ms = (root_span.end_time - root_span.start_time) / 1e6
if latency_ms < 100:
value = "fast"
elif latency_ms < 500:
value = "acceptable"
else:
value = "slow"
return Feedback(
value=value,
rationale=f"Response took {latency_ms:.0f}ms",
metadata={"latency_ms": latency_ms},
)
裁判对齐
MLflow 评分器最强大的功能之一是能够**将 LLM 裁判与人类偏好对齐**。这可以将通用的评估模型转变为领域专家,它们能够理解您独特的质量标准。
对齐工作原理
裁判对齐使用人类反馈来提高基于 LLM 的评分器的准确性和一致性
from mlflow.genai.judges import make_judge
import mlflow
# Create an initial judge
quality_judge = make_judge(
name="quality",
instructions="Evaluate if {{ outputs }} meets quality standards for {{ inputs }}.",
model="anthropic:/claude-opus-4-1-20250805",
)
# Collect traces with both judge assessments and human feedback
traces_with_feedback = mlflow.search_traces(
experiment_ids=[experiment_id], max_results=20 # Minimum 10 required for alignment
)
# Align the judge with human preferences (uses default DSPy-SIMBA optimizer)
aligned_judge = quality_judge.align(traces_with_feedback)
# The aligned judge now better matches your team's quality standards
feedback = aligned_judge(inputs={"query": "..."}, outputs={"response": "..."})
对齐的关键优势
- 领域专业知识:裁判通过专家反馈学习您的特定质量标准
- 一致性:对齐后的裁判在整个评估中统一应用标准
- 成本效益:对齐后,更小/更便宜的模型可以匹配专家判断
- 持续改进:随着您的标准演变,重新对齐
插件架构
MLflow 的对齐系统使用插件架构,允许您通过扩展 AlignmentOptimizer 基类来创建自定义优化器。
from mlflow.genai.judges.base import AlignmentOptimizer
class CustomOptimizer(AlignmentOptimizer):
def align(self, judge, traces):
# Your custom alignment logic
return improved_judge
# Use your custom optimizer
aligned_judge = quality_judge.align(traces, CustomOptimizer())
与 MLflow 评估集成
评分器是 MLflow 评估框架的构建块。它们与 `mlflow.genai.evaluate()` 无缝集成。
import mlflow
import pandas as pd
# Your test data
test_data = pd.DataFrame(
[
{
"inputs": {"question": "What is MLflow?"},
"outputs": {
"response": "MLflow is an open-source platform for ML lifecycle management."
},
"expectations": {
"ground_truth": "MLflow is an open-source platform for managing the ML lifecycle"
},
},
{
"inputs": {"question": "How do I track experiments?"},
"outputs": {
"response": "Use mlflow.start_run() to track experiments in MLflow."
},
"expectations": {
"ground_truth": "Use mlflow.start_run() to track experiments"
},
},
]
)
# Your application (optional if data already has outputs)
def my_app(inputs):
# Your model logic here
return {"response": f"Answer to: {inputs['question']}"}
# Evaluate with multiple scorers
results = mlflow.genai.evaluate(
data=test_data,
# predict_fn is optional if data already has outputs
scorers=[
correctness_judge, # LLM judge from above
reading_level, # Custom scorer from above
],
)
# Access evaluation metrics
print(f"Correctness: {results.metrics.get('correctness/mean', 'N/A')}")
print(f"Reading Level: {results.metrics.get('reading_level/mode', 'N/A')}")
最佳实践
-
选择正确的评分器类型
- 对客观、确定性指标使用基于代码的评分器
- 对需要理解的主观质量使用 LLM 裁判
- 对评估复杂的多步流程使用代理即裁判
-
组合多个评分器
- 没有单一指标能涵盖质量的所有方面
- 使用评分器的组合以获得全面的评估
- 平衡效率(快速的代码评分器)与深度(LLM 和代理裁判)
-
与人类判断对齐
- 验证您的评分器是否与人类质量评估相关
- 使用人类反馈来改进 LLM 和代理裁判的说明
- 考虑对关键评估使用人类对齐裁判
-
监控评分器性能
- 跟踪评分器的执行时间和成本
- 监控评分器故障并妥善处理
- 定期审查评分器输出的一致性