评估数据集 SDK 指南
通过实际工作流程和真实模式,掌握用于创建、演进和管理评估数据集的 API。
开始使用
MLflow 提供了一个用于处理评估数据集的流畅 API,使常见工作流程变得简单直观。
from mlflow.genai.datasets import (
create_dataset,
get_dataset,
search_datasets,
set_dataset_tags,
delete_dataset_tag,
)
您的数据集之旅
遵循这个典型的工作流程来构建和演进您的评估数据集。
完整的开发工作流程
步骤 1:创建您的数据集
首先,使用 mlflow.genai.datasets.create_dataset()
API 创建一个包含有意义元数据的新评估数据集。
from mlflow.genai.datasets import create_dataset
# Create a new dataset with tags for organization
dataset = create_dataset(
name="customer_support_qa_v1",
experiment_id=["0"], # Link to experiments ("0" is default)
tags={
"version": "1.0",
"purpose": "regression_testing",
"model": "gpt-4",
"team": "ml-platform",
"status": "development",
},
)
步骤 2:添加您的第一个测试用例
通过从生产追踪数据和手动整理中添加测试用例来构建您的数据集。期望通常由主题领域专家 (SME) 定义,他们理解该领域并能确立构成正确行为的基准真相。
了解如何定义期望 → 期望是定义您的 AI 应生成什么的基准真相值。它们由审查输出并确立质量标准的主题领域专家添加。
- 来自生产追踪数据
- 手动测试用例
import mlflow
# Search for production traces to build your dataset
# Request list format to work with individual Trace objects
production_traces = mlflow.search_traces(
experiment_ids=["0"], # Your production experiment
filter_string="attributes.user_feedback = 'positive'",
max_results=100,
return_type="list", # Returns list[Trace] for direct manipulation
)
# Subject matter experts add expectations to define correct behavior
for trace in production_traces:
# Subject matter experts review traces and define what the output should satisfy
mlflow.log_expectation(
trace_id=trace.info.trace_id,
name="quality_assessment",
value={
"should_match_production": True,
"minimum_quality": 0.8,
"response_time_ms": 2000,
"contains_citation": True,
},
)
# Can also add textual expectations
mlflow.log_expectation(
trace_id=trace.info.trace_id,
name="expected_behavior",
value="Response should provide step-by-step instructions with security considerations",
)
# Add annotated traces to dataset (expectations are automatically included)
dataset.merge_records(production_traces)
# Test cases can be manually defined as dictionaries
# merge_records() accepts both dict and pandas.DataFrame formats for manual
# record additions
test_cases = [
{
"inputs": {
"question": "How do I reset my password?",
"user_type": "premium",
"context": "User has been locked out after 3 failed attempts",
},
"expectations": {
"answer_quality": 0.95,
"contains_steps": True,
"mentions_security": True,
"response": "To reset your password, please follow these steps:\n1. Click 'Forgot Password' on the login page\n2. Enter your registered email address\n3. Check your email for the reset link\n4. Click the link and create a new password\n5. Use your new password to log in",
},
"tags": {
"category": "account_management",
"priority": "high",
"reviewed_by": "security_team",
},
},
{
"inputs": {
"question": "What are your business hours?",
"user_type": "standard",
},
"expectations": {
"accuracy": 1.0,
"includes_timezone": True,
"mentions_holidays": True,
},
},
]
# Add to your dataset (accepts list[dict], list[Trace] or pandas.DataFrame)
dataset.merge_records(test_cases)
步骤 3:演进您的数据集
随着您发现边缘情况并加深理解,不断更新您的数据集。mlflow.entities.EvaluationDataset.merge_records()
方法智能地处理新记录和现有记录的更新。
# Capture a production failure
failure_case = {
"inputs": {"question": "'; DROP TABLE users; --", "user_type": "malicious"},
"expectations": {
"handles_sql_injection": True,
"returns_safe_response": True,
"logs_security_event": True,
},
"source": {
"source_type": "HUMAN",
"source_data": {"discovered_by": "security_team"},
},
"tags": {"category": "security", "severity": "critical"},
}
# Add the new edge case
dataset.merge_records([failure_case])
# Update expectations for existing records
updated_records = []
for record in dataset.records:
if "accuracy" in record.get("expectations", {}):
# Raise the quality bar
record["expectations"]["accuracy"] = max(
0.9, record["expectations"]["accuracy"]
)
updated_records.append(record)
# Merge updates (intelligently handles duplicates)
dataset.merge_records(updated_records)
步骤 4:使用标签进行组织
使用标签来追踪数据集的演进并实现强大的搜索。了解更多关于 mlflow.search_traces()
的信息,以便从生产数据构建您的数据集。
from mlflow.genai.datasets import set_dataset_tags
# Update dataset metadata
set_dataset_tags(
dataset_id=dataset.dataset_id,
tags={
"status": "validated",
"coverage": "comprehensive",
"last_review": "2024-11-01",
},
)
# Remove outdated tags
set_dataset_tags(
dataset_id=dataset.dataset_id,
tags={"development_only": None}, # Setting to None removes the tag
)
步骤 5:搜索和发现
使用 mlflow.genai.datasets.search_datasets()
的强大搜索功能查找数据集。
from mlflow.genai.datasets import search_datasets
# Find datasets by experiment
datasets = search_datasets(experiment_ids=["0", "1"]) # Search in multiple experiments
# Search by name pattern
regression_datasets = search_datasets(filter_string="name LIKE '%regression%'")
# Complex search with tags
production_ready = search_datasets(
filter_string="tags.status = 'validated' AND tags.coverage = 'comprehensive'",
order_by=["last_update_time DESC"],
max_results=10,
)
# The PagedList automatically handles pagination when iterating
通用筛选字符串示例
以下是一些实用的筛选字符串示例,可帮助您找到合适的数据集。
筛选表达式 | 描述 | 用例 |
---|---|---|
name = 'production_qa' | 精确名称匹配 | 查找特定数据集 |
name LIKE '%test%' | 模式匹配 | 查找所有测试数据集 |
tags.status = 'validated' | 标签相等性 | 查找可用于生产的数据集 |
tags.version = '2.0' AND tags.team = 'ml' | 多个标签条件 | 查找特定团队的版本 |
created_by = 'alice@company.com' | 创建者筛选 | 按作者查找数据集 |
created_time > 1698800000000 | 基于时间的筛选 | 查找最近的数据集 |
tags.model = 'gpt-4' AND name LIKE '%eval%' | 组合条件 | 特定模型的评估集 |
last_updated_by != 'bot@system' | 排除筛选 | 排除自动更新 |
步骤 6:管理实验关联
创建数据集后,可以使用 mlflow.genai.datasets.add_dataset_to_experiments()
和 mlflow.genai.datasets.remove_dataset_from_experiments()
将其动态关联到实验。
此功能支持几个重要的用例:
- 跨团队协作:通过添加其实验 ID 来跨团队共享数据集。
- 生命周期管理:随着项目的成熟,删除过时的实验关联。
- 项目重组:随着项目结构的演变,动态重组数据集。
from mlflow.genai.datasets import (
add_dataset_to_experiments,
remove_dataset_from_experiments,
)
# Add dataset to additional experiments
dataset = add_dataset_to_experiments(
dataset_id="d-1a2b3c4d5e6f7890abcdef1234567890", experiment_ids=["3", "4", "5"]
)
print(f"Dataset now linked to experiments: {dataset.experiment_ids}")
# Remove dataset from specific experiments
dataset = remove_dataset_from_experiments(
dataset_id="d-1a2b3c4d5e6f7890abcdef1234567890", experiment_ids=["3"]
)
print(f"Updated experiment associations: {dataset.experiment_ids}")
活动记录模式
EvaluationDataset
对象遵循活动记录模式——它既是数据容器,又提供了与后端交互的方法。
# Get a dataset
dataset = get_dataset(dataset_id="d-1a2b3c4d5e6f7890abcdef1234567890")
# The dataset object is "live" - it can fetch and update data
current_record_count = len(dataset.records) # Lazy loads if needed
# Add new records directly on the object
new_records = [
{
"inputs": {"question": "What are your business hours?"},
"expectations": {"mentions_hours": True, "includes_timezone": True},
}
]
dataset.merge_records(new_records) # Updates backend immediately
# Convert to DataFrame for analysis
df = dataset.to_df()
# Access auto-computed properties
schema = dataset.schema # Field structure
profile = dataset.profile # Dataset statistics
记录合并工作原理
merge_records()
方法智能地处理新记录和现有记录的更新。记录根据其输入的哈希值进行匹配——如果已存在具有相同输入的记录,则会更新其期望和标签,而不是创建重复记录。
- 添加新记录
- 更新现有记录
- 来自追踪的批量更新
- 输入唯一性
当您首次添加记录时,它们会与其输入、期望和元数据一起存储。
# Initial record
record_v1 = {
"inputs": {"question": "What is MLflow?", "context": "ML platform overview"},
"expectations": {"accuracy": 0.8, "mentions_tracking": True},
}
dataset.merge_records([record_v1])
# Creates a new record in the dataset
当您合并具有相同输入的记录时,现有记录将通过合并新的期望和标签与现有记录进行更新。
# Updated version with same inputs but enhanced expectations
record_v2 = {
"inputs": {
"question": "What is MLflow?", # Same question
"context": "ML platform overview", # Same context
},
"expectations": {
"accuracy": 0.95, # Updates existing value
"mentions_models": True, # Adds new expectation
"clarity": 0.9 # Adds new metric
# Note: "mentions_tracking": True is preserved from record_v1
},
"tags": {"reviewed": "true", "reviewer": "ml_team"},
}
dataset.merge_records([record_v2])
# The record is updated, not duplicated
# Final record has all expectations from both v1 and v2 merged together
这种更新行为在向生产追踪添加期望时特别有用。
# First pass: Add traces without expectations
traces = mlflow.search_traces(experiment_ids=["0"], max_results=100, return_type="list")
dataset.merge_records(traces)
# Later: Subject matter experts review and add expectations
for trace in traces[:20]: # Review subset
mlflow.log_expectation(
trace_id=trace.info.trace_id,
name="quality_check",
value={"approved": True, "quality_score": 0.9},
)
# IMPORTANT: Re-fetch traces to get the attached expectations
updated_traces = mlflow.search_traces(
experiment_ids=["0"], max_results=100, return_type="list"
)
# Re-merge the updated traces - existing records are updated with expectations
dataset.merge_records(updated_traces[:20])
记录被视为唯一基于其整个输入字典。即使是微小的差异也会创建单独的记录。
# These are treated as different records due to different inputs
record_a = {
"inputs": {"question": "What is MLflow?", "temperature": 0.7},
"expectations": {"accuracy": 0.9},
}
record_b = {
"inputs": {
"question": "What is MLflow?",
"temperature": 0.8,
}, # Different temperature
"expectations": {"accuracy": 0.9},
}
dataset.merge_records([record_a, record_b])
# Results in 2 separate records due to different temperature values
理解来源类型
MLflow 通过来源类型跟踪评估数据集中每条记录的出处。这有助于您了解测试数据的来源,并按数据源分析性能。
来源类型行为
自动推断
当未提供显式来源时,MLflow 会根据记录的特征自动推断来源类型。
手动覆盖
您可以随时指定显式的来源信息来覆盖自动推断。
出处跟踪
来源类型允许按数据来源进行筛选和性能分析。
自动来源分配
MLflow 会根据记录的特征自动分配来源类型。
- TRACE 来源
- HUMAN 来源
- CODE 来源
从 MLflow 追踪创建的记录会自动分配 TRACE
来源类型。
# When adding traces directly (automatic TRACE source)
traces = mlflow.search_traces(experiment_ids=["0"], return_type="list")
dataset.merge_records(traces) # All records get TRACE source type
# Or when using DataFrame from search_traces
traces_df = mlflow.search_traces(experiment_ids=["0"]) # Returns DataFrame
dataset.merge_records(
traces_df
) # Automatically detects traces and assigns TRACE source
包含期望的记录被推断为 HUMAN
来源(主题领域专家标注)。
# Records with expectations indicate human review/annotation
human_curated = [
{
"inputs": {"question": "What is MLflow?"},
"expectations": {"answer": "MLflow is an ML platform", "quality": 0.9}
# Automatically inferred as HUMAN source due to expectations
}
]
dataset.merge_records(human_curated)
仅包含输入的记录(无期望)被推断为 CODE
来源(程序化生成)。
# Records without expectations indicate programmatic generation
generated_tests = [
{"inputs": {"question": f"Test question {i}"}}
for i in range(100)
# Automatically inferred as CODE source (no expectations field)
]
dataset.merge_records(generated_tests)
手动来源指定
您可以为任何记录显式指定来源类型和元数据。当未提供显式来源时,MLflow 会在将记录发送到后端之前,根据以下规则自动推断来源类型:
- 包含期望的记录 → 推断为
HUMAN
来源(表示手动标注或基准真相)。 - 仅包含输入的记录(无期望)→ 推断为
CODE
来源(表示程序化生成)。 - 来自追踪的记录 → 始终标记为
TRACE
来源(无论是否有期望)。
此推断在 merge_records()
方法中在客户端执行,然后在记录发送到追踪后端。您可以通过提供显式的来源信息来覆盖此自动推断。
# Specify HUMAN source for manually curated test cases
human_curated = {
"inputs": {"question": "What are your business hours?"},
"expectations": {"accuracy": 1.0, "includes_timezone": True},
"source": {
"source_type": "HUMAN",
"source_data": {"curator": "support_team", "date": "2024-11-01"},
},
}
# Specify DOCUMENT source for data from documentation
from_docs = {
"inputs": {"question": "How to install MLflow?"},
"expectations": {"mentions_pip": True, "mentions_conda": True},
"source": {
"source_type": "DOCUMENT",
"source_data": {"document_id": "install_guide", "page": 1},
},
}
# Specify CODE source for programmatically generated data
generated = {
"inputs": {"question": f"Test question {i}" for i in range(100)},
"source": {
"source_type": "CODE",
"source_data": {"generator": "test_suite_v2", "seed": 42},
},
}
dataset.merge_records([human_curated, from_docs, generated])
可用来源类型
来源类型支持对您的评估结果进行强大的筛选和分析。您可以按数据来源分析性能,以了解您的模型在人类整理的测试用例与生成的测试用例,或生产追踪与文档示例上的表现是否不同。
TRACE
通过 MLflow 追踪捕获的生产数据 - 添加追踪时自动分配。
HUMAN
主题领域专家标注 - 为包含期望的记录推断。
CODE
程序化生成的测试 - 为不包含期望的记录推断。
DOCUMENT
来自文档或规范的测试用例 - 必须显式指定。
UNSPECIFIED
来源未知或未提供 - 用于遗留或导入数据。
按来源分析数据
- 来源分布
- 按来源筛选
- 来源元数据
# Convert dataset to DataFrame for analysis
df = dataset.to_df()
# Check source type distribution
source_distribution = df["source_type"].value_counts()
print("Data sources in dataset:")
for source_type, count in source_distribution.items():
print(f" {source_type}: {count} records")
# Analyze expectations by source
human_records = df[df["source_type"] == "HUMAN"]
trace_records = df[df["source_type"] == "TRACE"]
code_records = df[df["source_type"] == "CODE"]
print(f"Human-curated records: {len(human_records)}")
print(f"Production trace records: {len(trace_records)}")
print(f"Generated test records: {len(code_records)}")
# Filter high-value test cases for critical evaluation
high_value_test_cases = df[
(df["source_type"] == "HUMAN") | (df["source_type"] == "DOCUMENT")
]
source_data
字段存储有关记录来源的丰富元数据。
# Example with detailed source metadata
detailed_source = {
"inputs": {"question": "Complex integration test"},
"expectations": {"passes_validation": True},
"source": {
"source_type": "TRACE",
"source_data": {
"trace_id": "tr-abc123",
"environment": "production",
"user_segment": "enterprise",
"timestamp": "2024-11-01T10:30:00Z",
"session_id": "sess-xyz789",
"feedback_score": 0.95,
},
},
}
# Access metadata after merging
dataset.merge_records([detailed_source])
df = dataset.to_df()
# source_data preserved for analysis
搜索筛选参考
在您的筛选字符串中使用这些字段。注意:流畅 API 返回一个 PagedList
,可以直接迭代 - 分页在您迭代结果时自动处理。
字段 | 类型 | 示例 |
---|---|---|
name | 字符串 | name = 'production_tests' |
tags.<key> | 字符串 | tags.status = 'validated' |
created_by | 字符串 | created_by = 'alice@company.com' |
last_updated_by | 字符串 | last_updated_by = 'bob@company.com' |
created_time | timestamp | created_time > 1698800000000 |
last_update_time | timestamp | last_update_time > 1698800000000 |
筛选运算符
=
,!=
:精确匹配LIKE
,ILIKE
:使用%
通配符的模式匹配(ILIKE 区分大小写)。>
,<
,>=
,<=
:数字/时间戳比较。AND
:组合条件(OR 目前不支持用于评估数据集)。
# Complex filter example
datasets = search_datasets(
filter_string="""
tags.status = 'production'
AND name LIKE '%customer%'
AND created_time > 1698800000000
""",
order_by=["last_update_time DESC"],
)
使用客户端 API
对于应用程序和高级用例,您还可以使用 MlflowClient
API,它提供与面向对象接口相同的 funkcionalita。
- 创建数据集
- 获取数据集
- 搜索数据集
- 管理标签
- 删除数据集
from mlflow import MlflowClient
client = MlflowClient()
# Create a dataset
dataset = client.create_dataset(
name="customer_support_qa",
experiment_id=["0"],
tags={"version": "1.0", "team": "ml-platform"},
)
# Get a dataset by ID
dataset = client.get_dataset(dataset_id="d-7f2e3a9b8c1d4e5f6a7b8c9d0e1f2a3b")
# Access properties
print(f"Dataset: {dataset.name}")
print(f"Records: {len(dataset.records)}")
# Search for datasets
datasets = client.search_datasets(
experiment_ids=["0"],
filter_string="tags.status = 'validated'",
order_by=["created_time DESC"],
max_results=50,
)
for dataset in datasets:
print(f"{dataset.name}: {dataset.dataset_id}")
# Set tags
client.set_dataset_tags(
dataset_id=dataset.dataset_id, tags={"status": "production", "validated": "true"}
)
# Delete a tag
client.delete_dataset_tag(dataset_id=dataset.dataset_id, key="deprecated")
# Delete a dataset
client.delete_dataset(dataset_id=dataset.dataset_id)
客户端 API 提供与流畅 API 相同的功能,但更适合:
- 需要显式客户端管理的出产应用程序。
- 需要自定义追踪 URI 或身份验证的场景。
- 与现有的基于 MLflow 客户端的工作流程集成。