搜索跟踪
本指南将引导您了解如何使用 MLflow UI 和 Python API 搜索 MLflow 中的跟踪。如果您有兴趣根据跟踪的元数据、标签、执行时间、状态或其他跟踪属性查询特定跟踪,此资源将非常有用。
MLflow 的跟踪搜索功能允许您利用 SQL 样式的语法根据各种条件过滤跟踪。虽然不支持 OR
关键字,但搜索功能足够强大,可以处理复杂的跟踪发现和分析查询。
跟踪搜索概述
在生产环境中处理 MLflow 跟踪时,您通常会在不同的实验中拥有数千个跟踪,这些跟踪代表各种模型推理、LLM 调用或 ML 管道执行。search_traces
API 可帮助您根据跟踪的执行特性、元数据、标签和其他属性查找特定跟踪,从而使跟踪分析和调试更加高效。
搜索查询语法
search_traces
API 使用类似 SQL 的领域特定语言 (DSL) 来查询跟踪。
搜索组件的视觉表示:
主要功能:
- 支持的属性:
request_id
、timestamp_ms
、execution_time_ms
、status
、name
、run_id
- 标签支持:使用
tags.
或tag.
前缀按跟踪标签进行筛选 - 元数据支持:使用
metadata.
前缀按请求元数据进行筛选 - 时间戳筛选:内置支持基于时间的查询
- 状态筛选:按跟踪执行状态筛选(OK、ERROR、IN_PROGRESS)
语法规则:
字段语法
- 属性:
status
、timestamp_ms
、execution_time_ms
、trace.name
- 标签:
tags.operation_type
、tag.model_name
(两个前缀都支持) - 元数据:
metadata.run_id
- 对特殊字符使用反引号:
tags.`model-name`
值语法
- 字符串值必须加引号:
status = 'OK'
- 数值不需要引号:
execution_time_ms > 1000
- 标签和元数据值必须加引号作为字符串
支持的比较运算符
- 数值(
timestamp_ms
、execution_time_ms
):>
、>=
、=
、!=
、<
、<=
- 字符串(
name
、status
、request_id
):=
、!=
、IN
、NOT IN
- 标签/元数据:
=
、!=
跟踪状态值
OK
- 执行成功ERROR
- 执行失败IN_PROGRESS
- 正在执行
示例查询
按名称筛选
# Search for traces by name
mlflow.search_traces(filter_string="trace.name = 'predict'")
mlflow.search_traces(filter_string="name = 'llm_inference'")
按状态筛选
# Get successful traces
mlflow.search_traces(filter_string="trace.status = 'OK'")
mlflow.search_traces(filter_string="status = 'OK'")
# Get failed traces
mlflow.search_traces(filter_string="status = 'ERROR'")
# Multiple statuses
mlflow.search_traces(filter_string="status IN ('OK', 'ERROR')")
按执行时间筛选
# Find slow traces (> 1 second)
mlflow.search_traces(filter_string="execution_time_ms > 1000")
# Performance range
mlflow.search_traces(
filter_string="execution_time_ms >= 200 AND execution_time_ms <= 800"
)
按时间戳筛选
import time
# Get traces from last hour
timestamp = int(time.time() * 1000)
mlflow.search_traces(filter_string=f"trace.timestamp > {timestamp - 3600000}")
# Alternative syntax
mlflow.search_traces(filter_string=f"timestamp_ms > {timestamp - 3600000}")
按标签筛选
# Filter by tag values (both syntaxes supported)
mlflow.search_traces(filter_string="tag.model_name = 'gpt-4'")
mlflow.search_traces(filter_string="tags.operation_type = 'llm_inference'")
按运行关联筛选
# Find traces associated with a specific run
mlflow.search_traces(run_id="run_id_123456")
# Or using filter string
mlflow.search_traces(filter_string="metadata.run_id = 'run_id_123456'")
组合多个条件
# Complex query
mlflow.search_traces(filter_string="trace.status = 'OK' AND tag.importance = 'high'")
# Production error analysis
mlflow.search_traces(
filter_string="""
tags.environment = 'production'
AND status = 'ERROR'
AND execution_time_ms > 500
"""
)
在 UI 中筛选跟踪
使用 MLflow 跟踪 UI 中的搜索框,使用上述相同语法按各种条件筛选跟踪。
UI 搜索支持与 API 相同的所有筛选语法,允许您
- 按跟踪名称、状态或执行时间筛选
- 按标签和元数据搜索
- 使用时间戳范围
- 使用 AND 组合多个条件
使用 Python 进行程序化搜索
mlflow.search_traces()
提供了方便的跟踪搜索功能
import mlflow
# Basic search with default DataFrame output
traces_df = mlflow.search_traces(filter_string="status = 'OK'")
# Return as list of Trace objects
traces_list = mlflow.search_traces(filter_string="status = 'OK'", return_type="list")
return_type
参数在 MLflow 2.21.1+ 中可用。对于更旧的版本,请使用 mlflow.client.MlflowClient.search_traces()
获取列表输出。
返回格式
1. DataFrame
search_traces
API 默认返回一个带有以下列的 pandas DataFrame
- MLflow 3.x
- MLflow 2.x
trace_id
- 主标识符trace
- 跟踪对象client_request_id
- 客户端请求 IDstate
- 跟踪状态(OK、ERROR、IN_PROGRESS、STATE_UNSPECIFIED)request_time
- 开始时间(毫秒)execution_duration
- 持续时间(毫秒)inputs
- 跟踪逻辑的输入outputs
- 跟踪逻辑的输出expectations
- 跟踪上标注的真实标签字典trace_metadata
- 键值元数据tags
- 相关标签assessments
- 附加到跟踪上的评估对象列表
request_id
- 主标识符trace
- 跟踪对象timestamp_ms
- 开始时间(毫秒)status
- 跟踪状态execution_time_ms
- 持续时间(毫秒)request
- 跟踪逻辑的输入response
- 跟踪逻辑的输出request_metadata
- 键值元数据spans
- 跟踪中的 spantags
- 相关标签
2. 跟踪对象列表
或者,您可以指定 return_type="list"
以获取 mlflow.entities.Trace()
对象的列表,而不是 DataFrame。
traces = mlflow.search_traces(filter_string="status = 'OK'", return_type="list")
# list[mlflow.entities.Trace]
结果排序
MLflow 支持按以下键对结果进行排序
timestamp_ms
(默认:降序)- 跟踪开始时间execution_time_ms
- 跟踪持续时间status
- 跟踪执行状态request_id
- 跟踪标识符
# Order by timestamp (most recent first)
traces = mlflow.search_traces(order_by=["timestamp_ms DESC"])
# Multiple ordering criteria
traces = mlflow.search_traces(order_by=["timestamp_ms DESC", "status ASC"])
提取 Span 字段
将特定 span 数据提取到 DataFrame 列中
traces = mlflow.search_traces(
extract_fields=[
"morning_greeting.inputs.name", # Extract specific input
"morning_greeting.outputs", # Extract all outputs
],
)
# Creates additional columns:
# - morning_greeting.inputs.name
# - morning_greeting.outputs
这对于创建评估数据集很有用
eval_data = traces.rename(
columns={
"morning_greeting.inputs.name": "inputs",
"morning_greeting.outputs": "ground_truth",
}
)
results = mlflow.genai.evaluate(data=eval_data, scorers=[...])
extract_fields
仅适用于 return_type="pandas"
。
分页
mlflow.client.MlflowClient.search_traces()
支持分页
from mlflow import MlflowClient
client = MlflowClient()
page_token = None
all_traces = []
while True:
results = client.search_traces(
experiment_ids=["1"],
filter_string="status = 'OK'",
max_results=100,
page_token=page_token,
)
all_traces.extend(results)
if not results.token:
break
page_token = results.token
print(f"Found {len(all_traces)} total traces")
常见用例
性能分析
# Find slowest 10 traces
slowest_traces = mlflow.search_traces(
filter_string="status = 'OK'",
order_by=["execution_time_ms DESC"],
max_results=10,
)
# Performance threshold violations
slow_production = mlflow.search_traces(
filter_string="""
tags.environment = 'production'
AND execution_time_ms > 2000
AND status = 'OK'
""",
)
错误分析
import time
# Recent errors
yesterday = int((time.time() - 24 * 3600) * 1000)
error_traces = mlflow.search_traces(
filter_string=f"status = 'ERROR' AND timestamp_ms > {yesterday}",
order_by=["timestamp_ms DESC"],
)
# Analyze error patterns
error_by_operation = {}
for _, trace in error_traces.iterrows():
# Access tags from the trace object
tags = trace["tags"] if "tags" in trace else {}
op_type = tags.get("operation_type", "unknown")
error_by_operation[op_type] = error_by_operation.get(op_type, 0) + 1
模型性能比较
# Compare performance across models
models = ["gpt-4", "bert-base", "roberta-large"]
model_stats = {}
for model in models:
traces = mlflow.search_traces(
filter_string=f"tags.model_name = '{model}' AND status = 'OK'",
return_type="list",
)
if traces:
exec_times = [trace.info.execution_time_ms for trace in traces]
model_stats[model] = {
"count": len(traces),
"avg_time": sum(exec_times) / len(exec_times),
"max_time": max(exec_times),
}
print("Model performance comparison:")
for model, stats in model_stats.items():
print(f"{model}: {stats['count']} traces, avg {stats['avg_time']:.1f}ms")
创建评估数据集
# Extract LLM conversation data for evaluation
conversation_data = mlflow.search_traces(
filter_string="tags.task_type = 'conversation' AND status = 'OK'",
extract_fields=["llm_call.inputs.prompt", "llm_call.outputs.response"],
)
# Rename for evaluation
eval_dataset = conversation_data.rename(
columns={
"llm_call.inputs.prompt": "inputs",
"llm_call.outputs.response": "ground_truth",
}
)
# Use with MLflow evaluate
results = mlflow.genai.evaluate(data=eval_dataset, scorers=[...])
环境监控
# Monitor error rates across environments
environments = ["production", "staging", "development"]
for env in environments:
total = mlflow.search_traces(filter_string=f"tags.environment = '{env}'")
errors = mlflow.search_traces(
filter_string=f"tags.environment = '{env}' AND status = 'ERROR'",
)
error_rate = len(errors) / len(total) * 100 if len(total) > 0 else 0
print(f"{env}: {error_rate:.1f}% error rate ({len(errors)}/{len(total)})")
创建示例跟踪
创建示例跟踪以探索搜索功能
import time
import mlflow
from mlflow.entities import SpanType
# Define methods to be traced
@mlflow.trace(span_type=SpanType.TOOL, attributes={"time": "morning"})
def morning_greeting(name: str):
time.sleep(1)
mlflow.update_current_trace(tags={"person": name})
return f"Good morning {name}."
@mlflow.trace(span_type=SpanType.TOOL, attributes={"time": "evening"})
def evening_greeting(name: str):
time.sleep(1)
mlflow.update_current_trace(tags={"person": name})
return f"Good evening {name}."
@mlflow.trace(span_type=SpanType.TOOL)
def goodbye():
raise Exception("Cannot say goodbye")
# Execute within different experiments
morning_experiment = mlflow.set_experiment("Morning Experiment")
morning_greeting("Tom")
# Get timestamp for filtering
morning_time = int(time.time() * 1000)
evening_experiment = mlflow.set_experiment("Evening Experiment")
evening_greeting("Mary")
try:
goodbye()
except:
pass # This creates an ERROR trace
print("Created example traces with different statuses and timing")
替代设置 - 类似生产的跟踪
import mlflow
import time
import random
from mlflow import trace
mlflow.set_experiment("trace-search-guide")
# Configuration for realistic traces
operation_types = ["llm_inference", "embedding_generation", "text_classification"]
model_names = ["gpt-4", "bert-base", "roberta-large"]
environments = ["production", "staging", "development"]
def simulate_operation(op_type, model_name, duration_ms):
"""Simulate an AI/ML operation"""
time.sleep(duration_ms / 1000.0)
# Simulate occasional errors
if random.random() < 0.1:
raise Exception(f"Simulated error in {op_type}")
return f"Completed {op_type} with {model_name}"
# Create diverse traces
for i in range(20):
op_type = random.choice(operation_types)
model_name = random.choice(model_names)
environment = random.choice(environments)
duration = random.randint(50, 2000) # 50ms to 2s
try:
with mlflow.start_run():
mlflow.set_tag("environment", environment)
with trace(
name=f"{op_type}_{i}",
attributes={
"operation_type": op_type,
"model_name": model_name,
"environment": environment,
"input_tokens": str(random.randint(10, 500)),
},
) as span:
result = simulate_operation(op_type, model_name, duration)
span.set_attribute("result", result)
except Exception:
# Creates ERROR status traces
continue
print("Created 20 example traces with various characteristics")
启动 MLflow UI 进行探索
mlflow ui
访问 https://:5000/
以在 UI 中查看您的跟踪。
创建这些跟踪后,您可以尝试在 UI 中或通过 fluent 或客户端 search_traces
API 进行程序化搜索。
重要说明
MLflow 版本兼容性
DataFrame Schema:格式取决于用于调用 search_traces
API 的 MLflow 版本,而不是用于记录跟踪的版本。MLflow 3.x 使用与 2.x 不同的列名。
返回类型支持
- MLflow 2.21.1+:
mlflow.search_traces()
中提供return_type
参数 - 早期版本:使用
MlflowClient.search_traces()
获取列表输出
性能提示
- 使用时间戳过滤器限制搜索空间
- 限制 max_results 以加快查询速度(当排序时)
- 对于大型结果集,使用分页
- 在存储系统中索引经常查询的标签
后端注意事项
- 数据库后端:通过对时间戳和状态进行适当索引优化性能
- Databricks:通过
sql_warehouse_id
参数增强性能 - 本地文件存储:对于大型数据集可能较慢。不推荐,仅适用于存储少量跟踪。
总结
search_traces
API 在 MLflow 中提供了强大的跟踪发现和分析功能。通过结合灵活的筛选、基于时间的查询、基于标签的组织以及 span 字段提取等高级功能,您可以高效地调查跟踪模式、调试问题和监控系统性能。
主要总结
- 使用带有
tags.
/tag.
、metadata.
和直接属性引用的类 SQL 语法 - 按执行时间、状态、时间戳和自定义标签进行筛选
- 使用 AND 组合多个条件(不支持 OR)
- 使用排序和分页进行高效数据探索
- 利用 span 字段提取创建评估数据集
- 根据您的用例选择合适的返回类型
无论您是在调试生产问题、分析模型性能、监控系统运行状况还是创建评估数据集,掌握跟踪搜索 API 都将使您的 MLflow 工作流更加高效和富有洞察力。