设置追踪标签
标签是可变的键值对,您可以将其附加到追踪中,以添加有价值的元数据和上下文。这些元数据对于组织、搜索和筛选您的追踪非常有用。例如,您可以根据用户输入的A主题、运行环境或正在使用的模型版本来标记您的追踪。
MLflow 提供了灵活性,可以通过其 API 或 MLflow UI 随时添加、更新或删除标签,即使在追踪已记录之后。
何时使用追踪标签
追踪标签特别适用于
- 会话管理:按对话会话或用户交互对追踪进行分组
- 环境追踪:区分生产、预演和开发追踪
- 模型版本控制:追踪哪个模型版本生成了特定的追踪
- 用户上下文:将追踪与特定用户或客户群关联
- 性能监控:根据性能特征标记追踪
- A/B 测试:区分不同的实验变体
- 活跃追踪
- 已完成追踪
- 搜索与筛选
- 最佳实践
设置活跃追踪上的标签
使用 mlflow.update_current_trace()
在追踪执行期间添加标签。
import mlflow
@mlflow.trace
def my_func(x):
mlflow.update_current_trace(tags={"fruit": "apple"})
return x + 1
result = my_func(5)
示例:在生产系统中设置服务标签
import mlflow
import os
@mlflow.trace
def process_user_request(user_id: str, session_id: str, request_text: str):
# Add comprehensive tags for production monitoring
mlflow.update_current_trace(
tags={
"user_id": user_id,
"session_id": session_id,
"environment": os.getenv("ENVIRONMENT", "development"),
"model_version": os.getenv("MODEL_VERSION", "1.0.0"),
"request_type": "chat_completion",
"priority": "high" if "urgent" in request_text.lower() else "normal",
}
)
response = f"Processed: {request_text}"
return response
注意
当键不存在时,mlflow.update_current_trace()
函数会将指定的标签添加到当前追踪中。如果键已存在,它将使用新值更新该键。
设置已完成追踪上的标签
在已完成并记录的追踪上添加或修改标签。
可用 API
API | 用例 |
---|---|
mlflow.set_trace_tag() | 用于设置标签的 Fluent API |
mlflow.client.MlflowClient.set_trace_tag() | 用于设置标签的客户端 API |
MLflow UI | 可视化标签管理 |
基本用法
import mlflow
from mlflow import MlflowClient
# Using fluent API
mlflow.set_trace_tag(trace_id="your-trace-id", key="tag_key", value="tag_value")
mlflow.delete_trace_tag(trace_id="your-trace-id", key="tag_key")
# Using client API
client = MlflowClient()
client.set_trace_tag(trace_id="your-trace-id", key="tag_key", value="tag_value")
client.delete_trace_tag(trace_id="your-trace-id", key="tag_key")
批量标记
import mlflow
from mlflow import MlflowClient
client = MlflowClient()
# Find traces that need to be tagged
traces = mlflow.search_traces(
experiment_ids=["1"], filter_string="status = 'ERROR'", max_results=100
)
# Add tags to all error traces
for trace in traces:
client.set_trace_tag(trace_id=trace.info.trace_id, key="needs_review", value="true")
client.set_trace_tag(
trace_id=trace.info.trace_id, key="review_priority", value="high"
)
性能分析标记
import mlflow
from mlflow import MlflowClient
from datetime import datetime
client = MlflowClient()
# Get slow traces for analysis
traces = mlflow.search_traces(
experiment_ids=["1"], filter_string="execution_time_ms > 5000", max_results=50
)
# Tag based on performance analysis
for trace in traces:
execution_time = trace.info.execution_time_ms
if execution_time > 10000:
performance_tag = "very_slow"
elif execution_time > 7500:
performance_tag = "slow"
else:
performance_tag = "moderate"
client.set_trace_tag(
trace_id=trace.info.trace_id, key="performance_category", value=performance_tag
)
使用 MLflow UI
导航到追踪详情页面,点击标签旁边的铅笔图标进行可视化编辑。
UI 功能
- 通过点击“+”按钮添加新标签
- 通过点击铅笔图标编辑现有标签
- 通过点击垃圾桶图标删除标签
- 查看与追踪关联的所有标签
使用标签搜索和筛选
使用标签快速高效地查找特定追踪。
基本标签筛选
import mlflow
# Find traces by environment
production_traces = mlflow.search_traces(
experiment_ids=["1"], filter_string="tags.environment = 'production'"
)
# Find traces by user
user_traces = mlflow.search_traces(
experiment_ids=["1"], filter_string="tags.user_id = 'user_123'"
)
# Find high-priority traces
urgent_traces = mlflow.search_traces(
experiment_ids=["1"], filter_string="tags.priority = 'high'"
)
复杂基于标签的查询
# Combine tag filters with other conditions
slow_production_errors = mlflow.search_traces(
experiment_ids=["1"],
filter_string="""
tags.environment = 'production'
AND status = 'ERROR'
AND execution_time_ms > 5000
""",
)
# Find traces that need review
review_traces = mlflow.search_traces(
experiment_ids=["1"],
filter_string="tags.needs_review = 'true'",
order_by=["timestamp_ms DESC"],
)
# Find specific user sessions
session_traces = mlflow.search_traces(
experiment_ids=["1"],
filter_string="tags.session_id = 'session_456'",
order_by=["timestamp_ms ASC"],
)
运维监控查询
# Monitor A/B test performance
control_group = mlflow.search_traces(
experiment_ids=["1"], filter_string="tags.experiment_variant = 'control'"
)
treatment_group = mlflow.search_traces(
experiment_ids=["1"], filter_string="tags.experiment_variant = 'treatment'"
)
# Find traces needing escalation
escalation_traces = mlflow.search_traces(
experiment_ids=["1"],
filter_string="""
tags.sla_tier = 'critical'
AND execution_time_ms > 30000
""",
)
分析和报告
# Generate performance reports by model version
model_v1_traces = mlflow.search_traces(
experiment_ids=["1"], filter_string="tags.model_version = 'v1.0.0'"
)
model_v2_traces = mlflow.search_traces(
experiment_ids=["1"], filter_string="tags.model_version = 'v2.0.0'"
)
# Compare performance
v1_avg_time = sum(t.info.execution_time_ms for t in model_v1_traces) / len(
model_v1_traces
)
v2_avg_time = sum(t.info.execution_time_ms for t in model_v2_traces) / len(
model_v2_traces
)
print(f"V1 average time: {v1_avg_time:.2f}ms")
print(f"V2 average time: {v2_avg_time:.2f}ms")
追踪标签的最佳实践
1. 统一命名约定
# Good: Consistent naming
tags = {
"environment": "production", # lowercase
"model_version": "v2.1.0", # semantic versioning
"user_segment": "premium", # descriptive names
"processing_stage": "preprocessing", # clear context
}
# Avoid: Inconsistent naming
tags = {
"env": "PROD", # abbreviation + uppercase
"ModelVer": "2.1", # mixed case + different format
"user_type": "premium", # different terminology
"stage": "pre", # unclear abbreviation
}
2. 分层组织
# Use dots for hierarchical organization
tags = {
"service.name": "chat_api",
"service.version": "1.2.0",
"service.region": "us-east-1",
"user.segment": "enterprise",
"user.plan": "premium",
"request.type": "completion",
"request.priority": "high",
}
3. 时间信息
import datetime
tags = {
"deployment_date": "2024-01-15",
"quarter": "Q1_2024",
"week": "2024-W03",
"shift": "evening", # for operational monitoring
}
4. 运维监控
# Tags for monitoring and alerting
tags = {
"sla_tier": "critical", # for SLA monitoring
"cost_center": "ml_platform", # for cost attribution
"alert_group": "ml_ops", # for alert routing
"escalation": "tier_1", # for incident management
}
5. 实验追踪
# Tags for A/B testing and experiments
tags = {
"experiment_name": "prompt_optimization_v2",
"variant": "control",
"hypothesis": "improved_context_helps",
"feature_flag": "new_prompt_engine",
}
常见标签类别
类别 | 示例标签 | 用例 |
---|---|---|
环境 | environment: production/staging/dev | 部署追踪 |
用户上下文 | user_id , session_id , user_segment | 用户行为分析 |
模型信息 | model_version , model_type , checkpoint | 模型性能追踪 |
请求类型 | request_type , complexity , priority | 请求分类 |
性能 | latency_tier , cost_category , sla_tier | 性能监控 |
业务逻辑 | feature_flag , experiment_variant , routing | A/B 测试和路由 |
运维 | region , deployment_id , instance_type | 基础设施追踪 |
标签命名指南
- 使用小写字母并用下划线保持一致性
- 描述性强但要简洁
- 版本号使用语义化版本控制 (v1.2.3)
- 相关时包含单位 (time_seconds, size_mb)
- 相关概念使用分层命名 (service.name, service.version)
- 避免缩写,除非它们在您的领域中广为人知
总结
追踪标签提供了一种强大的方式,可以为您的 MLflow 追踪添加上下文和元数据,从而实现
- 更好的组织:将相关追踪分组
- 强大的筛选:使用搜索快速查找特定追踪
- 运维监控:按类别追踪性能和问题
- 用户分析:了解用户行为模式
- 调试:添加有助于故障排除的上下文
无论您是在追踪执行期间还是之后添加标签,标签都能使您的追踪数据对于生产监控和分析更具价值和可操作性。