跳到主要内容

通过追踪对比应用版本

有目标的版本对比是成功开发 GenAI 的驱动力。MLflow 的基于追踪的对比使您能够分析性能差异,验证改进,并在应用迭代中做出数据驱动的部署决策。

为什么基于追踪的对比有效

完整的执行上下文

追踪捕获完整的应用流程 - 输入、输出、中间步骤和性能指标,以进行全面分析。

客观的性能指标

通过精确、可衡量的 D数据,对比不同版本的延迟、令牌使用量、错误率和质量指标。

检测细微的回归

识别可能无法通过简单的输入/输出对比显现的性能下降或行为变化。

开发决策支持

基于追踪分析,做出数据驱动的决策,确定何时发布改进、进一步迭代或尝试不同的方法。

为不同版本生成追踪

为多个应用版本创建追踪,以便进行系统性对比

python
import mlflow
import openai


@mlflow.trace
def basic_agent(question: str) -> str:
"""Basic customer support agent."""
client = openai.OpenAI()

response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful customer support agent."},
{"role": "user", "content": question},
],
temperature=0.3,
max_tokens=100,
)

return response.choices[0].message.content


@mlflow.trace
def empathetic_agent(question: str) -> str:
"""Enhanced customer support agent with empathetic prompting."""
client = openai.OpenAI()

# Enhanced system prompt
system_prompt = """You are a caring and empathetic customer support agent.
Always acknowledge the customer's feelings before providing solutions.
Use phrases like 'I understand how frustrating this must be'.
Provide clear, actionable steps with a warm, supportive tone."""

response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": question},
],
temperature=0.7,
max_tokens=150,
)

return response.choices[0].message.content


print("✅ Agent functions ready for comparison")

通过在相同的输入上测试两个版本来生成可对比的追踪

python
# Test scenarios for fair comparison
test_questions = [
"How can I track my package?",
"What's your return policy?",
"I need help with my account login",
"My order arrived damaged, what should I do?",
"Can I cancel my subscription?",
]

print("🔄 Generating traces for version comparison...")

# Run both versions on the same inputs
for i, question in enumerate(test_questions):
print(f"Testing scenario {i+1}: {question[:30]}...")

# Generate trace for v1
v1_response = basic_agent(question)

# Generate trace for v2
v2_response = empathetic_agent(question)

print(f" V1 response: {v1_response[:50]}...")
print(f" V2 response: {v2_response[:50]}...")

print(f"\n✅ Generated {len(test_questions) * 2} traces for comparison")

系统化的追踪式版本分析

追踪收集与过滤

使用 MLflow 的 search_traces API,通过版本元数据收集和过滤追踪,实现精确的版本到版本的对比。

性能指标分析

从追踪中提取和对比执行时间、令牌使用量和质量指标,以识别性能改进或回归。

自动化部署逻辑

构建质量门,自动分析追踪指标,并根据性能阈值确定部署就绪性。

收集和分析版本追踪

使用 search_traces 系统性地对比版本性能

python
from datetime import datetime, timedelta
import pandas as pd

# Search for traces from the last hour (adjust timeframe as needed)
recent_time = datetime.now() - timedelta(hours=1)

# Get traces for both versions
all_traces = mlflow.search_traces(
filter_string=f"timestamp >= '{recent_time.isoformat()}'", max_results=100
)

print(f"Found {len(all_traces)} recent traces\n")

# Separate traces by version using trace metadata
v1_traces = []
v2_traces = []

for trace in all_traces:
# Parse trace data to get metadata
trace_data = trace if isinstance(trace, dict) else trace.to_dict()

if "customer_support_v1" in trace_data.get("info", {}).get("name", ""):
v1_traces.append(trace_data)
elif "customer_support_v2" in trace_data.get("info", {}).get("name", ""):
v2_traces.append(trace_data)

print(f"Version 1 traces: {len(v1_traces)}")
print(f"Version 2 traces: {len(v2_traces)}")


# Calculate performance metrics for each version
def analyze_traces(traces, version_name):
"""Extract key metrics from a list of traces."""
if not traces:
return {}

execution_times = []
response_lengths = []

for trace in traces:
# Extract execution time (in milliseconds)
exec_time = trace.get("info", {}).get("execution_time_ms", 0)
execution_times.append(exec_time)

# Extract response length from spans
spans = trace.get("data", {}).get("spans", [])
if spans:
# Get the root span's output
root_span = spans[0]
output = root_span.get("outputs", "")
response_lengths.append(len(str(output)) if output else 0)

return {
"version": version_name,
"trace_count": len(traces),
"avg_execution_time_ms": sum(execution_times) / len(execution_times)
if execution_times
else 0,
"avg_response_length": sum(response_lengths) / len(response_lengths)
if response_lengths
else 0,
"min_execution_time_ms": min(execution_times) if execution_times else 0,
"max_execution_time_ms": max(execution_times) if execution_times else 0,
}


# Analyze both versions
v1_metrics = analyze_traces(v1_traces, "v1.0")
v2_metrics = analyze_traces(v2_traces, "v2.0")

print("\n📊 Version Performance Comparison:")
print(
f"V1 - Avg Execution: {v1_metrics['avg_execution_time_ms']:.1f}ms, Avg Response: {v1_metrics['avg_response_length']:.0f} chars"
)
print(
f"V2 - Avg Execution: {v2_metrics['avg_execution_time_ms']:.1f}ms, Avg Response: {v2_metrics['avg_response_length']:.0f} chars"
)

这将为您提供一个清晰、数据驱动的视图,了解您的应用版本在性能和可靠性方面的对比情况。您可以使用这些指标来做出明智的决策,选择哪个版本进行部署或进一步迭代。

后续步骤