评估数据集 SDK 指南

通过实际工作流程和真实模式，掌握用于创建、演进和管理评估数据集的 API。

开始使用

MLflow 提供了一个用于处理评估数据集的流畅 API，使常见工作流程变得简单直观。

from mlflow.genai.datasets import (
    create_dataset,
    get_dataset,
    search_datasets,
    set_dataset_tags,
    delete_dataset_tag,
)

您的数据集之旅

遵循这个典型的工作流程来构建和演进您的评估数据集。

完整的开发工作流程

创建/获取数据集

添加测试用例

运行评估

改进代码

测试与追踪

更新数据集

更新标签

步骤 1：创建您的数据集

首先，使用 mlflow.genai.datasets.create_dataset() API 创建一个包含有意义元数据的新评估数据集。

from mlflow.genai.datasets import create_dataset

# Create a new dataset with tags for organization
dataset = create_dataset(
    name="customer_support_qa_v1",
    experiment_id=["0"],  # Link to experiments ("0" is default)
    tags={
        "version": "1.0",
        "purpose": "regression_testing",
        "model": "gpt-4",
        "team": "ml-platform",
        "status": "development",
    },
)

步骤 2：添加您的第一个测试用例

通过从生产追踪数据和手动整理中添加测试用例来构建您的数据集。期望通常由主题领域专家 (SME) 定义，他们理解该领域并能确立构成正确行为的基准真相。

了解如何定义期望 → 期望是定义您的 AI 应生成什么的基准真相值。它们由审查输出并确立质量标准的主题领域专家添加。

来自生产追踪数据
手动测试用例

import mlflow

# Search for production traces to build your dataset
# Request list format to work with individual Trace objects
production_traces = mlflow.search_traces(
    experiment_ids=["0"],  # Your production experiment
    filter_string="attributes.user_feedback = 'positive'",
    max_results=100,
    return_type="list",  # Returns list[Trace] for direct manipulation
)

# Subject matter experts add expectations to define correct behavior
for trace in production_traces:
    # Subject matter experts review traces and define what the output should satisfy
    mlflow.log_expectation(
        trace_id=trace.info.trace_id,
        name="quality_assessment",
        value={
            "should_match_production": True,
            "minimum_quality": 0.8,
            "response_time_ms": 2000,
            "contains_citation": True,
        },
    )

    # Can also add textual expectations
    mlflow.log_expectation(
        trace_id=trace.info.trace_id,
        name="expected_behavior",
        value="Response should provide step-by-step instructions with security considerations",
    )

# Add annotated traces to dataset (expectations are automatically included)
dataset.merge_records(production_traces)

# Test cases can be manually defined as dictionaries
# merge_records() accepts both dict and pandas.DataFrame formats for manual
# record additions
test_cases = [
    {
        "inputs": {
            "question": "How do I reset my password?",
            "user_type": "premium",
            "context": "User has been locked out after 3 failed attempts",
        },
        "expectations": {
            "answer_quality": 0.95,
            "contains_steps": True,
            "mentions_security": True,
            "response": "To reset your password, please follow these steps:\n1. Click 'Forgot Password' on the login page\n2. Enter your registered email address\n3. Check your email for the reset link\n4. Click the link and create a new password\n5. Use your new password to log in",
        },
        "tags": {
            "category": "account_management",
            "priority": "high",
            "reviewed_by": "security_team",
        },
    },
    {
        "inputs": {
            "question": "What are your business hours?",
            "user_type": "standard",
        },
        "expectations": {
            "accuracy": 1.0,
            "includes_timezone": True,
            "mentions_holidays": True,
        },
    },
]

# Add to your dataset (accepts list[dict], list[Trace] or pandas.DataFrame)
dataset.merge_records(test_cases)

步骤 3：演进您的数据集

随着您发现边缘情况并加深理解，不断更新您的数据集。mlflow.entities.EvaluationDataset.merge_records() 方法智能地处理新记录和现有记录的更新。

# Capture a production failure
failure_case = {
    "inputs": {"question": "'; DROP TABLE users; --", "user_type": "malicious"},
    "expectations": {
        "handles_sql_injection": True,
        "returns_safe_response": True,
        "logs_security_event": True,
    },
    "source": {
        "source_type": "HUMAN",
        "source_data": {"discovered_by": "security_team"},
    },
    "tags": {"category": "security", "severity": "critical"},
}

# Add the new edge case
dataset.merge_records([failure_case])

# Update expectations for existing records
updated_records = []
for record in dataset.records:
    if "accuracy" in record.get("expectations", {}):
        # Raise the quality bar
        record["expectations"]["accuracy"] = max(
            0.9, record["expectations"]["accuracy"]
        )
        updated_records.append(record)

# Merge updates (intelligently handles duplicates)
dataset.merge_records(updated_records)

步骤 4：使用标签进行组织

使用标签来追踪数据集的演进并实现强大的搜索。了解更多关于 mlflow.search_traces() 的信息，以便从生产数据构建您的数据集。

from mlflow.genai.datasets import set_dataset_tags

# Update dataset metadata
set_dataset_tags(
    dataset_id=dataset.dataset_id,
    tags={
        "status": "validated",
        "coverage": "comprehensive",
        "last_review": "2024-11-01",
    },
)

# Remove outdated tags
set_dataset_tags(
    dataset_id=dataset.dataset_id,
    tags={"development_only": None},  # Setting to None removes the tag
)

步骤 5：搜索和发现

使用 mlflow.genai.datasets.search_datasets() 的强大搜索功能查找数据集。

from mlflow.genai.datasets import search_datasets

# Find datasets by experiment
datasets = search_datasets(experiment_ids=["0", "1"])  # Search in multiple experiments

# Search by name pattern
regression_datasets = search_datasets(filter_string="name LIKE '%regression%'")

# Complex search with tags
production_ready = search_datasets(
    filter_string="tags.status = 'validated' AND tags.coverage = 'comprehensive'",
    order_by=["last_update_time DESC"],
    max_results=10,
)

# The PagedList automatically handles pagination when iterating

通用筛选字符串示例

以下是一些实用的筛选字符串示例，可帮助您找到合适的数据集。

筛选表达式	描述	用例
`name = 'production_qa'`	精确名称匹配	查找特定数据集
`name LIKE '%test%'`	模式匹配	查找所有测试数据集
`tags.status = 'validated'`	标签相等性	查找可用于生产的数据集
`tags.version = '2.0' AND tags.team = 'ml'`	多个标签条件	查找特定团队的版本
`created_by = 'alice@company.com'`	创建者筛选	按作者查找数据集
`created_time > 1698800000000`	基于时间的筛选	查找最近的数据集
`tags.model = 'gpt-4' AND name LIKE '%eval%'`	组合条件	特定模型的评估集
`last_updated_by != 'bot@system'`	排除筛选	排除自动更新

步骤 6：管理实验关联

创建数据集后，可以使用 mlflow.genai.datasets.add_dataset_to_experiments() 和 mlflow.genai.datasets.remove_dataset_from_experiments() 将其动态关联到实验。

此功能支持几个重要的用例：

跨团队协作：通过添加其实验 ID 来跨团队共享数据集。
生命周期管理：随着项目的成熟，删除过时的实验关联。
项目重组：随着项目结构的演变，动态重组数据集。

from mlflow.genai.datasets import (
    add_dataset_to_experiments,
    remove_dataset_from_experiments,
)

# Add dataset to additional experiments
dataset = add_dataset_to_experiments(
    dataset_id="d-1a2b3c4d5e6f7890abcdef1234567890", experiment_ids=["3", "4", "5"]
)
print(f"Dataset now linked to experiments: {dataset.experiment_ids}")

# Remove dataset from specific experiments
dataset = remove_dataset_from_experiments(
    dataset_id="d-1a2b3c4d5e6f7890abcdef1234567890", experiment_ids=["3"]
)
print(f"Updated experiment associations: {dataset.experiment_ids}")

活动记录模式

EvaluationDataset 对象遵循活动记录模式——它既是数据容器，又提供了与后端交互的方法。

# Get a dataset
dataset = get_dataset(dataset_id="d-1a2b3c4d5e6f7890abcdef1234567890")

# The dataset object is "live" - it can fetch and update data
current_record_count = len(dataset.records)  # Lazy loads if needed

# Add new records directly on the object
new_records = [
    {
        "inputs": {"question": "What are your business hours?"},
        "expectations": {"mentions_hours": True, "includes_timezone": True},
    }
]
dataset.merge_records(new_records)  # Updates backend immediately

# Convert to DataFrame for analysis
df = dataset.to_df()
# Access auto-computed properties
schema = dataset.schema  # Field structure
profile = dataset.profile  # Dataset statistics

记录合并工作原理

merge_records() 方法智能地处理新记录和现有记录的更新。记录根据其输入的哈希值进行匹配——如果已存在具有相同输入的记录，则会更新其期望和标签，而不是创建重复记录。

添加新记录
更新现有记录
来自追踪的批量更新
输入唯一性

当您首次添加记录时，它们会与其输入、期望和元数据一起存储。

# Initial record
record_v1 = {
    "inputs": {"question": "What is MLflow?", "context": "ML platform overview"},
    "expectations": {"accuracy": 0.8, "mentions_tracking": True},
}

dataset.merge_records([record_v1])
# Creates a new record in the dataset

当您合并具有相同输入的记录时，现有记录将通过合并新的期望和标签与现有记录进行更新。

# Updated version with same inputs but enhanced expectations
record_v2 = {
    "inputs": {
        "question": "What is MLflow?",  # Same question
        "context": "ML platform overview",  # Same context
    },
    "expectations": {
        "accuracy": 0.95,  # Updates existing value
        "mentions_models": True,  # Adds new expectation
        "clarity": 0.9  # Adds new metric
        # Note: "mentions_tracking": True is preserved from record_v1
    },
    "tags": {"reviewed": "true", "reviewer": "ml_team"},
}

dataset.merge_records([record_v2])
# The record is updated, not duplicated
# Final record has all expectations from both v1 and v2 merged together

这种更新行为在向生产追踪添加期望时特别有用。

# First pass: Add traces without expectations
traces = mlflow.search_traces(experiment_ids=["0"], max_results=100, return_type="list")
dataset.merge_records(traces)

# Later: Subject matter experts review and add expectations
for trace in traces[:20]:  # Review subset
    mlflow.log_expectation(
        trace_id=trace.info.trace_id,
        name="quality_check",
        value={"approved": True, "quality_score": 0.9},
    )

# IMPORTANT: Re-fetch traces to get the attached expectations
updated_traces = mlflow.search_traces(
    experiment_ids=["0"], max_results=100, return_type="list"
)

# Re-merge the updated traces - existing records are updated with expectations
dataset.merge_records(updated_traces[:20])

记录被视为唯一基于其整个输入字典。即使是微小的差异也会创建单独的记录。

# These are treated as different records due to different inputs
record_a = {
    "inputs": {"question": "What is MLflow?", "temperature": 0.7},
    "expectations": {"accuracy": 0.9},
}

record_b = {
    "inputs": {
        "question": "What is MLflow?",
        "temperature": 0.8,
    },  # Different temperature
    "expectations": {"accuracy": 0.9},
}

dataset.merge_records([record_a, record_b])
# Results in 2 separate records due to different temperature values

理解来源类型

MLflow 通过来源类型跟踪评估数据集中每条记录的出处。这有助于您了解测试数据的来源，并按数据源分析性能。

来源类型行为

自动推断

当未提供显式来源时，MLflow 会根据记录的特征自动推断来源类型。

手动覆盖

您可以随时指定显式的来源信息来覆盖自动推断。

出处跟踪

来源类型允许按数据来源进行筛选和性能分析。

自动来源分配

MLflow 会根据记录的特征自动分配来源类型。

TRACE 来源
HUMAN 来源
CODE 来源

从 MLflow 追踪创建的记录会自动分配 TRACE 来源类型。

# When adding traces directly (automatic TRACE source)
traces = mlflow.search_traces(experiment_ids=["0"], return_type="list")
dataset.merge_records(traces)  # All records get TRACE source type

# Or when using DataFrame from search_traces
traces_df = mlflow.search_traces(experiment_ids=["0"])  # Returns DataFrame
dataset.merge_records(
    traces_df
)  # Automatically detects traces and assigns TRACE source

包含期望的记录被推断为 HUMAN 来源（主题领域专家标注）。

# Records with expectations indicate human review/annotation
human_curated = [
    {
        "inputs": {"question": "What is MLflow?"},
        "expectations": {"answer": "MLflow is an ML platform", "quality": 0.9}
        # Automatically inferred as HUMAN source due to expectations
    }
]
dataset.merge_records(human_curated)

仅包含输入的记录（无期望）被推断为 CODE 来源（程序化生成）。

# Records without expectations indicate programmatic generation
generated_tests = [
    {"inputs": {"question": f"Test question {i}"}}
    for i in range(100)
    # Automatically inferred as CODE source (no expectations field)
]
dataset.merge_records(generated_tests)

手动来源指定

您可以为任何记录显式指定来源类型和元数据。当未提供显式来源时，MLflow 会在将记录发送到后端之前，根据以下规则自动推断来源类型：

包含期望的记录 → 推断为 HUMAN 来源（表示手动标注或基准真相）。
仅包含输入的记录（无期望）→ 推断为 CODE 来源（表示程序化生成）。
来自追踪的记录 → 始终标记为 TRACE 来源（无论是否有期望）。

此推断在 merge_records() 方法中在客户端执行，然后在记录发送到追踪后端。您可以通过提供显式的来源信息来覆盖此自动推断。

# Specify HUMAN source for manually curated test cases
human_curated = {
    "inputs": {"question": "What are your business hours?"},
    "expectations": {"accuracy": 1.0, "includes_timezone": True},
    "source": {
        "source_type": "HUMAN",
        "source_data": {"curator": "support_team", "date": "2024-11-01"},
    },
}

# Specify DOCUMENT source for data from documentation
from_docs = {
    "inputs": {"question": "How to install MLflow?"},
    "expectations": {"mentions_pip": True, "mentions_conda": True},
    "source": {
        "source_type": "DOCUMENT",
        "source_data": {"document_id": "install_guide", "page": 1},
    },
}

# Specify CODE source for programmatically generated data
generated = {
    "inputs": {"question": f"Test question {i}" for i in range(100)},
    "source": {
        "source_type": "CODE",
        "source_data": {"generator": "test_suite_v2", "seed": 42},
    },
}

dataset.merge_records([human_curated, from_docs, generated])

可用来源类型

来源类型支持对您的评估结果进行强大的筛选和分析。您可以按数据来源分析性能，以了解您的模型在人类整理的测试用例与生成的测试用例，或生产追踪与文档示例上的表现是否不同。

TRACE

通过 MLflow 追踪捕获的生产数据 - 添加追踪时自动分配。

HUMAN

主题领域专家标注 - 为包含期望的记录推断。

CODE

程序化生成的测试 - 为不包含期望的记录推断。

DOCUMENT

来自文档或规范的测试用例 - 必须显式指定。

UNSPECIFIED

来源未知或未提供 - 用于遗留或导入数据。

按来源分析数据

来源分布
按来源筛选
来源元数据

# Convert dataset to DataFrame for analysis
df = dataset.to_df()

# Check source type distribution
source_distribution = df["source_type"].value_counts()
print("Data sources in dataset:")
for source_type, count in source_distribution.items():
    print(f"  {source_type}: {count} records")

# Analyze expectations by source
human_records = df[df["source_type"] == "HUMAN"]
trace_records = df[df["source_type"] == "TRACE"]
code_records = df[df["source_type"] == "CODE"]

print(f"Human-curated records: {len(human_records)}")
print(f"Production trace records: {len(trace_records)}")
print(f"Generated test records: {len(code_records)}")

# Filter high-value test cases for critical evaluation
high_value_test_cases = df[
    (df["source_type"] == "HUMAN") | (df["source_type"] == "DOCUMENT")
]

source_data 字段存储有关记录来源的丰富元数据。

# Example with detailed source metadata
detailed_source = {
    "inputs": {"question": "Complex integration test"},
    "expectations": {"passes_validation": True},
    "source": {
        "source_type": "TRACE",
        "source_data": {
            "trace_id": "tr-abc123",
            "environment": "production",
            "user_segment": "enterprise",
            "timestamp": "2024-11-01T10:30:00Z",
            "session_id": "sess-xyz789",
            "feedback_score": 0.95,
        },
    },
}

# Access metadata after merging
dataset.merge_records([detailed_source])
df = dataset.to_df()
# source_data preserved for analysis

在您的筛选字符串中使用这些字段。注意：流畅 API 返回一个 PagedList，可以直接迭代 - 分页在您迭代结果时自动处理。

字段	类型	示例
`name`	字符串	`name = 'production_tests'`
`tags.<key>`	字符串	`tags.status = 'validated'`
`created_by`	字符串	`created_by = 'alice@company.com'`
`last_updated_by`	字符串	`last_updated_by = 'bob@company.com'`
`created_time`	timestamp	`created_time > 1698800000000`
`last_update_time`	timestamp	`last_update_time > 1698800000000`

筛选运算符

=, !=：精确匹配
LIKE, ILIKE：使用 % 通配符的模式匹配（ILIKE 区分大小写）。
>, <, >=, <=：数字/时间戳比较。
AND：组合条件（OR 目前不支持用于评估数据集）。

# Complex filter example
datasets = search_datasets(
    filter_string="""
        tags.status = 'production'
        AND name LIKE '%customer%'
        AND created_time > 1698800000000
    """,
    order_by=["last_update_time DESC"],
)

使用客户端 API

对于应用程序和高级用例，您还可以使用 MlflowClient API，它提供与面向对象接口相同的 funkcionalita。

创建数据集
获取数据集
搜索数据集
管理标签
删除数据集

from mlflow import MlflowClient

client = MlflowClient()

# Create a dataset
dataset = client.create_dataset(
    name="customer_support_qa",
    experiment_id=["0"],
    tags={"version": "1.0", "team": "ml-platform"},
)

# Get a dataset by ID
dataset = client.get_dataset(dataset_id="d-7f2e3a9b8c1d4e5f6a7b8c9d0e1f2a3b")

# Access properties
print(f"Dataset: {dataset.name}")
print(f"Records: {len(dataset.records)}")

# Search for datasets
datasets = client.search_datasets(
    experiment_ids=["0"],
    filter_string="tags.status = 'validated'",
    order_by=["created_time DESC"],
    max_results=50,
)

for dataset in datasets:
    print(f"{dataset.name}: {dataset.dataset_id}")

# Set tags
client.set_dataset_tags(
    dataset_id=dataset.dataset_id, tags={"status": "production", "validated": "true"}
)

# Delete a tag
client.delete_dataset_tag(dataset_id=dataset.dataset_id, key="deprecated")

# Delete a dataset
client.delete_dataset(dataset_id=dataset.dataset_id)

客户端 API 提供与流畅 API 相同的功能，但更适合：

需要显式客户端管理的出产应用程序。
需要自定义追踪 URI 或身份验证的场景。
与现有的基于 MLflow 客户端的工作流程集成。

后续步骤

端到端工作流程

了解从应用程序构建到生产的完整评估驱动开发工作流程。

查看完整工作流程 →

运行评估

使用您的数据集系统地评估和改进您的 GenAI 应用程序。

开始评估 →

定义期望

了解如何为您的测试数据添加基准真相期望以进行质量验证。

设置期望 →

捕获追踪

对您的应用程序进行仪器化，以捕获生产数据用于构建数据集。

启用追踪 →

开始使用​

您的数据集之旅​

完整的开发工作流程

步骤 1：创建您的数据集​

步骤 2：添加您的第一个测试用例​

步骤 3：演进您的数据集​

步骤 4：使用标签进行组织​

步骤 5：搜索和发现​

通用筛选字符串示例​

步骤 6：管理实验关联​

活动记录模式​

记录合并工作原理​

理解来源类型​

来源类型行为

自动推断

手动覆盖

出处跟踪

自动来源分配​

手动来源指定​

可用来源类型​

TRACE

HUMAN

CODE

DOCUMENT

UNSPECIFIED

按来源分析数据​

搜索筛选参考​

筛选运算符​

使用客户端 API​

后续步骤​

端到端工作流程

运行评估

定义期望

捕获追踪

开始使用

您的数据集之旅

步骤 1：创建您的数据集

步骤 2：添加您的第一个测试用例

步骤 3：演进您的数据集

步骤 4：使用标签进行组织

步骤 5：搜索和发现

通用筛选字符串示例

步骤 6：管理实验关联

活动记录模式

记录合并工作原理

理解来源类型

自动来源分配

手动来源指定

可用来源类型

按来源分析数据

搜索筛选参考

筛选运算符

使用客户端 API

后续步骤