比较应用版本

在评估了您的客户支持代理后，让我们进行改进并比较版本之间的性能。本指南展示了如何跟踪更改并客观地比较应用程序的不同版本。

创建改进版本

基于前几节中的客户支持代理，让我们通过在提示中添加更具同情心的语气来改进它

import mlflow
import subprocess
import openai

# Get current git commit for the new version
try:
    git_commit = (
        subprocess.check_output(["git", "rev-parse", "HEAD"])
        .decode("ascii")
        .strip()[:8]
    )
    version_identifier = f"git-{git_commit}"
except subprocess.CalledProcessError:
    version_identifier = "local-dev"  # Fallback if not in a git repo

improved_model_name = f"customer_support_agent-v2-{version_identifier}"

# Set the new active model
improved_model_info = mlflow.set_active_model(name=improved_model_name)

# Log parameters for the improved version
improved_params = {
    "llm": "gpt-4o-mini",
    "temperature": 0.7,
    "retrieval_strategy": "vector_search_v3",
}
mlflow.log_model_params(model_id=improved_model_info.model_id, params=improved_params)


# Define the improved agent with more empathetic prompting
def improved_agent(question: str) -> str:
    client = openai.OpenAI()

    # Enhanced system prompt with empathy focus
    system_prompt = """You are a caring and empathetic customer support agent.
    Always acknowledge the customer's feelings and frustrations before providing solutions.
    Use phrases like 'I understand how frustrating this must be' and 'I'm here to help'.
    Provide clear, actionable steps while maintaining a warm, supportive tone."""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question},
        ],
        temperature=0.7,
        max_tokens=150,
    )
    return response.choices[0].message.content

现在，让我们使用与上一节相同的数据集和评分器来评估这个改进的版本。这确保了我们可以在版本之间进行公平的比较

# Evaluate the improved version with the same dataset
evaluation_results_v2 = mlflow.genai.evaluate(
    data=eval_data,  # Same evaluation dataset from previous section
    predict_fn=improved_agent,
    model_id=improved_model_info.model_id,
    scorers=[
        relevance_scorer,
        support_guidelines,
    ],  # Same scorers from previous section
)

print(f"V2 Metrics: {evaluation_results_v2.metrics}")

在 MLflow UI 中比较版本

在创建了应用程序的多个版本后，您需要系统地比较它们以

验证改进 - 确认您的更改确实改进了您关心的指标
识别退化 - 确保新版本不会以意想不到的方式降低性能
选择最佳版本 - 对要部署到生产环境的版本做出数据驱动的决策
指导迭代 - 了解哪些更改影响最大，从而为未来的改进提供信息

MLflow UI 提供了可视化比较工具，使该分析直观易懂

导航到您的实验，其中包含客户支持代理版本
选择多个运行，方法是选中每个版本的评估运行旁边的复选框
单击“比较”以查看
- 并排参数差异
- 带有图表的指标比较
- 用于多指标分析的平行坐标图

使用 MLflow API 比较版本

对于 CI/CD 管道、回归测试或程序化版本选择等自动化工作流程，MLflow 提供了强大的 API 来搜索、排名和比较您的 LoggedModel 版本。这些 API 使您能够

自动标记不符合质量阈值的版本
生成代码审查的比较报告
选择最佳版本进行部署，无需人工干预
在您的分析中跟踪版本随时间的改进情况

对版本进行排名和搜索

使用 search_logged_models 查找应用程序的所有版本，并按质量、速度或其他性能特征对它们进行排名。这有助于您识别趋势并找到性能最佳的版本

from mlflow import search_logged_models

# Search for all versions of our customer support agent
# Order by creation time to see version progression
all_versions = search_logged_models(
    filter_string=f"name IN ('{logged_model_name}', '{improved_model_name}')",
    order_by=[{"field_name": "creation_time", "ascending": False}],  # Most recent first
    output_format="list",
)

print(f"Found {len(all_versions)} versions of customer support agent\n")

# Compare key metrics across versions
for model in all_versions[:2]:  # Compare latest 2 versions
    print(f"Version: {model.name}")
    print(f"  Model ID: {model.model_id}")
    print(f"  Created: {model.creation_timestamp}")

    # Display evaluation metrics
    for metric in model.metrics:
        print(f"  {metric.key}: {metric.value}")

    # Display parameters
    print(f"  Temperature: {model.params.get('temperature', 'N/A')}")
    print(f"  LLM: {model.params.get('llm', 'N/A')}")
    print()

# Find the best performing version by a specific metric
best_by_guidelines = max(
    all_versions,
    key=lambda m: next(
        (m.value for m in m.metrics if m.key == "support_guidelines/mean"), None
    ),
)
print(f"Best version for support guidelines: {best_by_guidelines.name}")
print(
    f"  Score: {next((m.value for m in best_by_guidelines.metrics if m.key == 'support_guidelines/mean'), None)}"
)

并排比较

确定要比较的版本后，执行详细的并排分析，以准确了解发生了哪些变化以及如何影响性能

# Get the two specific models we want to compare
v1 = mlflow.get_logged_model(
    model_id=active_model_info.model_id
)  # Original from previous section
v2 = mlflow.get_logged_model(model_id=improved_model_info.model_id)  # Improved

print("=== Version Comparison ===")
print(f"V1: {v1.name} vs V2: {v2.name}\n")

# Compare parameters (what changed)
print("Parameter Changes:")
all_params = set(v1.params.keys()) | set(v2.params.keys())
for param in sorted(all_params):
    v1_val = v1.params.get(param, "N/A")
    v2_val = v2.params.get(param, "N/A")
    if v1_val != v2_val:
        print(f"  {param}: '{v1_val}' → '{v2_val}' ✓")
    else:
        print(f"  {param}: '{v1_val}' (unchanged)")

# Compare metrics (impact of changes)
print("\nMetric Improvements:")
v1_metrics = {m.key: m.value for m in v1.metrics}
v2_metrics = {m.key: m.value for m in v2.metrics}

metric_keys = ["relevance_to_query/mean", "support_guidelines/mean"]
for metric in metric_keys:
    v1_score = v1_metrics.get(metric, 0)
    v2_score = v2_metrics.get(metric, 0)
    improvement = ((v2_score - v1_score) / max(v1_score, 0.01)) * 100
    print(f"  {metric}:")
    print(f"    V1: {v1_score:.3f}")
    print(f"    V2: {v2_score:.3f}")
    print(f"    Change: {improvement:+.1f}%")

自动化部署决策

LoggedModels API 最强大的用途之一是自动化部署决策。您无需手动审查每个版本，而是可以编纂您的质量标准，并让您的 CI/CD 管道自动确定新版本是否已准备好投入生产。

这种方法确保了一致的质量标准，并实现了快速、安全的部署

# Decision logic based on your criteria
min_relevance_threshold = 0.80
min_guidelines_threshold = 0.80

if (
    v2_metrics.get("relevance_to_query/mean", 0) >= min_relevance_threshold
    and v2_metrics.get("support_guidelines/mean", 0) >= min_guidelines_threshold
    and v2_metrics.get("support_guidelines/mean", 0)
    > v1_metrics.get("support_guidelines/mean", 0)
):
    print(f"✅ Version 2 ({v2.name}) is ready for deployment!")
    print("   - Meets all quality thresholds")
    print("   - Shows improvement in support guidelines")
else:
    print("❌ Version 2 needs more work before deployment")

您可以扩展此模式以

在您的 CI/CD 管道中创建质量门，如果指标下降，则阻止部署
根据改进幅度实施逐步推出
当版本显示出显着退化时触发警报

后续步骤

您现在可以部署性能最佳的版本，并通过将跟踪链接到已部署的版本来监控其在生产环境中的性能。

创建改进版本​

在 MLflow UI 中比较版本​

使用 MLflow API 比较版本​

对版本进行排名和搜索​

并排比较​

自动化部署决策​

后续步骤​