跳到主要内容

基于代理的评分器 (又名. Agent-as-a-Judge)

Agent-as-a-Judge 代表了 LLM 评估范式的一次重大转变。这些评估器不仅仅是评估输入和输出,而是充当 **自主代理**,配备了工具来深入调查您的应用程序的执行情况。

工作原理

Agent-as-a-Judge 使用以下工具来调查记录到 MLflow 后端的追踪。这些工具使评估器能够像经验丰富的调试器一样,系统地探索您应用程序的执行过程。

工具描述
GetTraceInfo检索追踪的高级信息,包括时间、状态和元数据。
ListSpans列出追踪中的所有 span 及其层级、时间信息和基本属性。
GetSpan获取特定 span 的详细信息,包括输入、输出和自定义属性。
SearchTraceRegex使用正则表达式搜索所有 span 数据中的模式。
为什么不直接将追踪传递给 LLM?

虽然这对于简单情况有效,但实际应用程序中的追踪通常庞大且复杂。将整个追踪传递给 LLM 会很快超出上下文窗口限制并降低评估器的准确性。Agentic-approach 使用工具来探索追踪结构并获取必要的详细信息,而不会消耗上下文窗口。

与 LLM-as-a-Judge 的比较

了解何时使用每种方法取决于您在开发生命周期的哪个阶段

方面代理即评判者LLM 即裁判 (LLM-as-a-Judge)
设置简易性简单 - 只需描述要调查的内容需要仔细的提示工程和细化
它们评估的内容完整的执行追踪和轨迹特定的输入和输出字段
性能较慢(详细探索追踪)执行速度快
成本较高(更多上下文和工具使用)较低(上下文较少)

何时使用 Agent-as-a-Judge?

Agent-as-a-Judge 适用于**引导**评估飞轮。

  • 开始新应用程序
  • 修订和完善您的代理
  • 识别失败模式
  • 理解意外行为

何时使用 LLM-as-a-Judge?

LLM-as-a-Judge 在评估特定标准方面效率更高,因此适用于**持续评估**和**生产使用**。

  • 生产监控
  • 回归测试
  • 部署前的最终验证
  • 满足特定的质量期望

快速入门

要创建 Agent-as-a-Judge,只需调用 `make_judge` API 并传递一个包含 **`{{ trace }}`** 模板变量的指令

import mlflow
from mlflow.genai.judges import make_judge
import time

performance_judge = make_judge(
name="performance_analyzer",
instructions=(
"Analyze the {{ trace }} for performance issues.\n\n"
"Check for:\n"
"- Operations taking longer than 2 seconds\n"
"- Redundant API calls or database queries\n"
"- Inefficient data processing patterns\n"
"- Proper use of caching mechanisms\n\n"
"Rate as: 'optimal', 'acceptable', or 'needs_improvement'"
),
model="openai:/gpt-5",
# model="anthropic:/claude-opus-4-1-20250805",
)
注意

使用 `{{ trace }}` 模板变量很重要。如果模板不包含 `{{ trace }}`,MLflow 会假定评分器是普通的 LLM-as-a-Judge,并且不会使用 MCP 工具。

然后,从您的应用程序生成一个追踪并将其传递给评分器

@mlflow.trace
def slow_data_processor(query: str):
"""Example application with performance bottlenecks."""
with mlflow.start_span("fetch_data") as span:
time.sleep(2.5)
span.set_inputs({"query": query})
span.set_outputs({"data": ["item1", "item2", "item3"]})

with mlflow.start_span("process_data") as span:
for i in range(3):
with mlflow.start_span(f"redundant_api_call_{i}"):
time.sleep(0.5)
span.set_outputs({"processed": "results"})

return "Processing complete"


result = slow_data_processor("SELECT * FROM users")
trace_id = mlflow.get_last_active_trace_id()
trace = mlflow.get_trace(trace_id)

feedback = performance_judge(trace=trace)

print(f"Performance Rating: {feedback.value}")
print(f"Analysis: {feedback.rationale}")
Performance Rating: needs_improvement
Analysis: Found critical performance issues:
1. The 'fetch_data' span took 2.5 seconds, exceeding the 2-second threshold
2. Detected 3 redundant API calls (redundant_api_call_0, redundant_api_call_1,
redundant_api_call_2) that appear to be duplicate operations
3. Total execution time of 4 seconds could be optimized by parallelizing
the redundant operations or implementing caching
Agent-as-a-Judge Evaluation Results

针对批量追踪运行评估器

要将评分器应用于批量追踪,请使用 mlflow.genai.evaluate API。

import mlflow

# Retrieve traces from MLflow
traces = mlflow.search_traces(filter_string="timestamp > 1727174400000")

# Run evaluation with Agent-as-a-Judge
results = mlflow.genai.evaluate(
data=traces,
scorers=[performance_judge],
)

高级示例

tool_optimization_judge = make_judge(
name="tool_optimizer",
instructions=(
"Analyze tool usage patterns in {{ trace }}.\n\n"
"Check for:\n"
"1. Unnecessary tool calls (could be answered without tools)\n"
"2. Wrong tool selection (better tool available)\n"
"3. Inefficient sequencing (could parallelize or reorder)\n"
"4. Missing tool usage (should have used a tool)\n\n"
"Provide specific optimization suggestions.\n"
"Rate efficiency as: 'optimal', 'good', 'suboptimal', or 'poor'"
),
model="anthropic:/claude-opus-4-1-20250805",
)

调试代理评估器

要查看 Agent-as-a-Judge 在分析您的追踪时实际调用的 MCP 工具,请启用调试日志记录

import logging

# Enable debug logging to see agent tool calls
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("mlflow.genai.judges")
logger.setLevel(logging.DEBUG)

# Now when you run the judge, you'll see detailed tool usage
feedback = performance_judge(trace=trace)

启用调试日志记录后,您将看到类似以下的输出

DEBUG:mlflow.genai.judges:Calling tool: GetTraceInfo
DEBUG:mlflow.genai.judges:Tool response: {"trace_id": "abc123", "duration_ms": 4000, ...}
DEBUG:mlflow.genai.judges:Calling tool: ListSpans
DEBUG:mlflow.genai.judges:Tool response: [{"span_id": "def456", "name": "fetch_data", ...}]
DEBUG:mlflow.genai.judges:Calling tool: GetSpan with span_id=def456
DEBUG:mlflow.genai.judges:Tool response: {"duration_ms": 2500, "inputs": {"query": "SELECT * FROM users"}, ...}

后续步骤