反馈收集

MLflow 反馈提供了一个全面的系统，用于从多个来源捕获质量评估——无论是自动化 AI 评审、程序化规则还是人工评审员。这种系统的反馈收集方法使您能够大规模地理解和改进您的 GenAI 应用程序的性能。

有关完整的 API 文档和实现细节，请参阅 mlflow.log_feedback() 参考。

什么是反馈？

反馈捕获 AI 表现好坏的评估。它衡量 AI 在准确性、相关性、安全性和实用性等各个维度上实际产生的质量。与定义应发生什么的期望不同，反馈告诉您实际发生了什么以及它在多大程度上符合您的质量标准。

先决条件

在使用 MLflow 中的反馈收集之前，请确保您已安装：

MLflow 3.2.0 或更高版本
一个活动的 MLflow 跟踪服务器或本地跟踪设置
已从您的 GenAI 应用程序记录到 MLflow 实验的跟踪

反馈来源

MLflow 支持三种类型的反馈来源，每种都有其独特的优势。您可以使用单个来源或组合多个来源以获得全面的质量覆盖。

LLM 评审评估

大规模的 AI 驱动评估。LLM 评审员在没有人为干预的情况下，为相关性、语气和安全性等细微维度提供一致的质量评估。

程序化代码检查

确定性规则评估。非常适合格式验证、合规性检查和需要即时、经济高效评估的业务逻辑规则。

人工专家评审

针对高风险内容进行的领域专家评估。人工反馈捕获了自动化系统遗漏的细微洞察，并作为黄金标准。

在 Python API 中使用反馈源的方法如下：

from mlflow.entities import AssessmentSource, AssessmentSourceType

# Human expert providing evaluation
human_source = AssessmentSource(
    source_type=AssessmentSourceType.HUMAN, source_id="expert@company.com"
)

# Automated rule-based evaluation
code_source = AssessmentSource(
    source_type=AssessmentSourceType.CODE, source_id="accuracy_checker_v1"
)

# AI-powered evaluation at scale
llm_judge_source = AssessmentSource(
    source_type=AssessmentSourceType.LLM_JUDGE, source_id="gpt-4-evaluator"
)

为什么要收集反馈？

收集 GenAI 应用程序质量的反馈对于持续改进过程至关重要，它能确保您的应用程序保持有效并随着时间的推移得到增强。

实现持续改进

通过系统地收集质量信号来创建数据驱动的改进周期，以识别模式、修复问题并随着时间的推移提高 AI 性能。

扩大质量保证规模

通过评估每个跟踪而不是小样本来监控生产规模的质量，在问题影响用户之前捕获问题。

通过透明度建立信任

向利益相关者准确展示质量是如何测量以及由谁测量，通过清晰的归因建立对您的 AI 系统可靠性的信心。

创建训练数据

从反馈中生成高质量的训练数据集，特别是人工修正，以改进 AI 应用程序和评估系统。

反馈如何运作

通过 API

当您需要大规模自动化反馈收集、与现有系统集成或构建自定义评估工作流时，请使用程序化 mlflow.log_feedback() API。该 API 使您能够以编程方式从所有三个源收集反馈。

分步指南

通过 UI 添加人工评估

MLflow UI 提供了一种直观的方式，可以直接在跟踪上添加、编辑和管理反馈。此方法非常适合手动审查、协作评估以及领域专家无需编写代码即可提供反馈的情况。

添加新反馈

反馈将立即附加到跟踪，并将您的用户信息作为来源。

编辑现有反馈

用于完善评估或纠正错误

向现有条目添加额外反馈

当多个评审员希望对同一方面提供反馈，或者当您希望对自动化评估进行修正时

这种协作方法可以在同一跟踪方面提供多种视角，从而创建更丰富的评估数据集，并帮助识别评估者意见不一致的情况。

通过 API 记录自动化评估

LLM 评审
启发式指标

通过以下步骤实施基于 LLM 的自动化评估

1. 设置您的评估环境

import json
import mlflow
from mlflow.entities import AssessmentSource, AssessmentError
from mlflow.entities.assessment_source import AssessmentSourceType
import openai  # or your preferred LLM client

# Configure your LLM client
client = openai.OpenAI(api_key="your-api-key")

2. 创建您的评估提示

def create_evaluation_prompt(user_input, ai_response):
    return f"""
    Evaluate the AI response for helpfulness and accuracy.

    User Input: {user_input}
    AI Response: {ai_response}

    Rate the response on a scale of 0.0 to 1.0 for:
    1. Helpfulness: How well does it address the user's needs?
    2. Accuracy: Is the information factually correct?

    Respond with only a JSON object:
    {{"helpfulness": 0.0-1.0, "accuracy": 0.0-1.0, "rationale": "explanation"}}
    """

3. 实现评估函数

def evaluate_with_llm_judge(trace_id, user_input, ai_response):
    try:
        # Get LLM evaluation
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {
                    "role": "user",
                    "content": create_evaluation_prompt(user_input, ai_response),
                }
            ],
            temperature=0.0,
        )

        # Parse the evaluation

        evaluation = json.loads(response.choices[0].message.content)

        # Log feedback to MLflow
        mlflow.log_feedback(
            trace_id=trace_id,
            name="llm_judge_evaluation",
            value=evaluation,
            rationale=evaluation.get("rationale", ""),
            source=AssessmentSource(
                source_type=AssessmentSourceType.LLM_JUDGE, source_id="gpt-4-evaluator"
            ),
        )

    except Exception as e:
        # Log evaluation failure
        mlflow.log_feedback(
            trace_id=trace_id,
            name="llm_judge_evaluation",
            error=AssessmentError(error_code="EVALUATION_FAILED", error_message=str(e)),
            source=AssessmentSource(
                source_type=AssessmentSourceType.LLM_JUDGE, source_id="gpt-4-evaluator"
            ),
        )

4. 使用评估函数

# Example usage
trace_id = "your-trace-id"
user_question = "What is the capital of France?"
ai_answer = "The capital of France is Paris."

evaluate_with_llm_judge(trace_id, user_question, ai_answer)

实现基于规则的程序化评估

1. 定义您的评估规则

def evaluate_response_compliance(response_text):
    """Evaluate response against business rules."""
    results = {
        "has_disclaimer": False,
        "appropriate_length": False,
        "contains_prohibited_terms": False,
        "rationale": [],
    }

    # Check for required disclaimer
    if "This is not financial advice" in response_text:
        results["has_disclaimer"] = True
    else:
        results["rationale"].append("Missing required disclaimer")

    # Check response length
    if 50 <= len(response_text) <= 500:
        results["appropriate_length"] = True
    else:
        results["rationale"].append(
            f"Response length {len(response_text)} outside acceptable range"
        )

    # Check for prohibited terms
    prohibited_terms = ["guaranteed returns", "risk-free", "get rich quick"]
    found_terms = [
        term for term in prohibited_terms if term.lower() in response_text.lower()
    ]
    if found_terms:
        results["contains_prohibited_terms"] = True
        results["rationale"].append(f"Contains prohibited terms: {found_terms}")

    return results

2. 实现日志记录函数

def log_compliance_check(trace_id, response_text):
    # Run compliance evaluation
    evaluation = evaluate_response_compliance(response_text)

    # Calculate overall compliance score
    compliance_score = (
        sum(
            [
                evaluation["has_disclaimer"],
                evaluation["appropriate_length"],
                not evaluation["contains_prohibited_terms"],
            ]
        )
        / 3
    )

    # Log the feedback
    mlflow.log_feedback(
        trace_id=trace_id,
        name="compliance_check",
        value={"overall_score": compliance_score, "details": evaluation},
        rationale="; ".join(evaluation["rationale"]) or "All compliance checks passed",
        source=AssessmentSource(
            source_type=AssessmentSourceType.CODE, source_id="compliance_validator_v2.1"
        ),
    )

3. 在您的应用程序中使用

# Example usage after your AI generates a response
with mlflow.start_span(name="financial_advice") as span:
    ai_response = your_ai_model.generate(user_question)
    trace_id = span.trace_id

    # Run automated compliance check
    log_compliance_check(trace_id, ai_response)

管理反馈

一旦您收集了跟踪的反馈，您需要检索、更新，有时还需要删除它。这些操作对于维护准确的评估数据至关重要。

检索反馈

检索特定反馈以分析评估结果

# Get a specific feedback by ID
feedback = mlflow.get_assessment(
    trace_id="tr-1234567890abcdef", assessment_id="a-0987654321abcdef"
)

# Access feedback details
name = feedback.name
value = feedback.value
source_type = feedback.source.source_type
rationale = feedback.rationale if hasattr(feedback, "rationale") else None

更新反馈

当您需要更正或完善评估时，更新现有反馈

from mlflow.entities import Feedback

# Update feedback with new information
updated_feedback = Feedback(
    name="response_quality",
    value=0.9,
    rationale="Updated after additional review - response is more comprehensive than initially evaluated",
)

mlflow.update_assessment(
    trace_id="tr-1234567890abcdef",
    assessment_id="a-0987654321abcdef",
    assessment=updated_feedback,
)

删除反馈

删除错误记录的反馈

# Delete specific feedback
mlflow.delete_assessment(
    trace_id="tr-1234567890abcdef", assessment_id="a-5555666677778888"
)

注意

如果删除使用 override_feedback API 标记为替换的反馈，则原始反馈将恢复到有效状态。

覆盖自动化反馈

override_feedback 函数允许人工专家纠正自动化评估，同时保留原始数据以用于审计跟踪和学习。

何时覆盖与更新

覆盖：用于纠正自动化反馈——保留原始数据以进行分析
更新：用于修复现有反馈中的错误——原地修改

覆盖示例

# Step 1: Original automated feedback (logged earlier)
llm_feedback = mlflow.log_feedback(
    trace_id="tr-1234567890abcdef",
    name="relevance",
    value=0.6,
    rationale="Response partially addresses the question",
    source=AssessmentSource(
        source_type=AssessmentSourceType.LLM_JUDGE, source_id="gpt-4-evaluator"
    ),
)

# Step 2: Human expert reviews and disagrees
corrected_feedback = mlflow.override_feedback(
    trace_id="tr-1234567890abcdef",
    assessment_id=llm_feedback.assessment_id,
    value=0.9,
    rationale="Response fully addresses the question with comprehensive examples",
    source=AssessmentSource(
        source_type=AssessmentSourceType.HUMAN, source_id="expert_reviewer@company.com"
    ),
    metadata={"override_reason": "LLM underestimated relevance", "confidence": "high"},
)

覆盖过程会将原始反馈标记为无效，但会保留它以进行历史分析和模型改进。

最佳实践

一致的命名约定

使用清晰、描述性的名称，使反馈数据易于分析

# Good: Descriptive, specific names
mlflow.log_feedback(trace_id=trace_id, name="response_accuracy", value=0.95)
mlflow.log_feedback(trace_id=trace_id, name="sql_syntax_valid", value=True)
mlflow.log_feedback(trace_id=trace_id, name="execution_time_ms", value=245)

# Poor: Vague, inconsistent names
mlflow.log_feedback(trace_id=trace_id, name="good", value=True)
mlflow.log_feedback(trace_id=trace_id, name="score", value=0.95)

可追溯的来源归因

提供具体的源信息以进行审计追踪

# Excellent: Version-specific, environment-aware
source = AssessmentSource(
    source_type=AssessmentSourceType.CODE, source_id="response_validator_v2.1_prod"
)

# Good: Individual attribution
source = AssessmentSource(
    source_type=AssessmentSourceType.HUMAN, source_id="expert@company.com"
)

# Poor: Generic, untraceable
source = AssessmentSource(source_type=AssessmentSourceType.CODE, source_id="validator")

丰富的元数据

包含有助于分析的上下文

mlflow.log_feedback(
    trace_id=trace_id,
    name="response_quality",
    value=0.85,
    source=human_source,
    metadata={
        "reviewer_expertise": "domain_expert",
        "review_duration_seconds": 45,
        "confidence": "high",
        "criteria_version": "v2.3",
        "evaluation_context": "production_review",
    },
)

后续步骤

反馈概念

深入了解反馈架构和模式

了解概念 →

事实期望

了解如何定义评估的预期输出

开始标注 →

LLM 评估

了解如何系统地评估和改进您的 GenAI 应用程序

开始评估 →

什么是反馈？​

先决条件​

反馈来源​

LLM 评审评估

程序化代码检查

人工专家评审

为什么要收集反馈？​

实现持续改进

扩大质量保证规模

通过透明度建立信任

创建训练数据

反馈如何运作​

通过 API​

分步指南​

通过 UI 添加人工评估​

添加新反馈​

编辑现有反馈​

向现有条目添加额外反馈​

通过 API 记录自动化评估​

管理反馈​

检索反馈​

更新反馈​

删除反馈​

覆盖自动化反馈​

何时覆盖与更新​

覆盖示例​

最佳实践​

一致的命名约定​

可追溯的来源归因​

丰富的元数据​

后续步骤​