跳到主要内容

通过跟踪比较应用程序版本

客观的版本比较推动成功的 GenAI 开发。MLflow 基于跟踪的比较使您能够分析性能差异、验证改进并根据应用程序迭代做出数据驱动的部署决策。

为什么基于跟踪的比较有效

完整的执行上下文

跟踪捕获完整的应用程序流——输入、输出、中间步骤和性能指标,以进行全面分析。

客观性能指标

使用精确、可测量的数据比较不同版本之间的延迟、令牌使用量、错误率和质量指标。

检测细微回归

识别可能无法通过简单输入/输出比较显而易见的性能下降或行为变化。

开发决策支持

根据跟踪分析,就是否发布改进、进一步迭代或尝试不同方法做出数据驱动的决策。

生成不同版本的跟踪

为多个应用程序版本创建跟踪以实现系统比较

import mlflow
import openai


@mlflow.trace
def basic_agent(question: str) -> str:
"""Basic customer support agent."""
client = openai.OpenAI()

response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful customer support agent."},
{"role": "user", "content": question},
],
temperature=0.3,
max_tokens=100,
)

return response.choices[0].message.content


@mlflow.trace
def empathetic_agent(question: str) -> str:
"""Enhanced customer support agent with empathetic prompting."""
client = openai.OpenAI()

# Enhanced system prompt
system_prompt = """You are a caring and empathetic customer support agent.
Always acknowledge the customer's feelings before providing solutions.
Use phrases like 'I understand how frustrating this must be'.
Provide clear, actionable steps with a warm, supportive tone."""

response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": question},
],
temperature=0.7,
max_tokens=150,
)

return response.choices[0].message.content


print("✅ Agent functions ready for comparison")

通过在相同输入上测试两个版本来生成可比较的跟踪

# Test scenarios for fair comparison
test_questions = [
"How can I track my package?",
"What's your return policy?",
"I need help with my account login",
"My order arrived damaged, what should I do?",
"Can I cancel my subscription?",
]

print("🔄 Generating traces for version comparison...")

# Run both versions on the same inputs
for i, question in enumerate(test_questions):
print(f"Testing scenario {i+1}: {question[:30]}...")

# Generate trace for v1
v1_response = basic_agent(question)

# Generate trace for v2
v2_response = empathetic_agent(question)

print(f" V1 response: {v1_response[:50]}...")
print(f" V2 response: {v2_response[:50]}...")

print(f"\n✅ Generated {len(test_questions) * 2} traces for comparison")

基于跟踪的系统版本分析

跟踪收集与过滤

使用 MLflow 的 search_traces API 收集并按版本元数据过滤跟踪,从而实现精确的版本对版本比较。

性能指标分析

从跟踪中提取并比较执行时间、令牌使用量和质量指标,以识别性能改进或回归。

自动化部署逻辑

构建质量门,自动分析跟踪指标并根据性能阈值确定部署准备情况。

收集并分析版本跟踪

使用 search_traces 系统地比较版本性能

from datetime import datetime, timedelta
import pandas as pd

# Search for traces from the last hour (adjust timeframe as needed)
recent_time = datetime.now() - timedelta(hours=1)

# Get traces for both versions
all_traces = mlflow.search_traces(
filter_string=f"timestamp >= '{recent_time.isoformat()}'", max_results=100
)

print(f"Found {len(all_traces)} recent traces\n")

# Separate traces by version using trace metadata
v1_traces = []
v2_traces = []

for trace in all_traces:
# Parse trace data to get metadata
trace_data = trace if isinstance(trace, dict) else trace.to_dict()

if "customer_support_v1" in trace_data.get("info", {}).get("name", ""):
v1_traces.append(trace_data)
elif "customer_support_v2" in trace_data.get("info", {}).get("name", ""):
v2_traces.append(trace_data)

print(f"Version 1 traces: {len(v1_traces)}")
print(f"Version 2 traces: {len(v2_traces)}")


# Calculate performance metrics for each version
def analyze_traces(traces, version_name):
"""Extract key metrics from a list of traces."""
if not traces:
return {}

execution_times = []
response_lengths = []

for trace in traces:
# Extract execution time (in milliseconds)
exec_time = trace.get("info", {}).get("execution_time_ms", 0)
execution_times.append(exec_time)

# Extract response length from spans
spans = trace.get("data", {}).get("spans", [])
if spans:
# Get the root span's output
root_span = spans[0]
output = root_span.get("outputs", "")
response_lengths.append(len(str(output)) if output else 0)

return {
"version": version_name,
"trace_count": len(traces),
"avg_execution_time_ms": sum(execution_times) / len(execution_times)
if execution_times
else 0,
"avg_response_length": sum(response_lengths) / len(response_lengths)
if response_lengths
else 0,
"min_execution_time_ms": min(execution_times) if execution_times else 0,
"max_execution_time_ms": max(execution_times) if execution_times else 0,
}


# Analyze both versions
v1_metrics = analyze_traces(v1_traces, "v1.0")
v2_metrics = analyze_traces(v2_traces, "v2.0")

print("\n📊 Version Performance Comparison:")
print(
f"V1 - Avg Execution: {v1_metrics['avg_execution_time_ms']:.1f}ms, Avg Response: {v1_metrics['avg_response_length']:.0f} chars"
)
print(
f"V2 - Avg Execution: {v2_metrics['avg_execution_time_ms']:.1f}ms, Avg Response: {v2_metrics['avg_response_length']:.0f} chars"
)

这为您提供了关于应用程序版本在性能和可靠性方面如何比较的清晰、数据驱动的视图。您可以使用这些指标来做出关于部署哪个版本或进一步迭代的明智决策。

后续步骤