跳到主要内容

MLflow 跟踪 API

MLflow 跟踪提供了跨多种编程语言的综合 API,用于捕获您的机器学习实验。无论您喜欢自动插桩还是精细控制,MLflow 都能适应您的工作流程。

选择您的方法

MLflow 提供两种主要的实验跟踪方法,每种方法都针对不同的用例进行了优化

🤖 自动日志记录 - 零配置,最大覆盖率

非常适合快速入门或在使用支持的 ML 库时。只需添加一行代码,MLflow 即可自动捕获所有内容。

import mlflow

mlflow.autolog() # That's it!

# Your existing training code works unchanged
model.fit(X_train, y_train)

自动记录的内容

  • 模型参数和超参数
  • 训练和验证指标
  • 模型工件和检查点
  • 训练图表和可视化
  • 框架特定元数据

支持的库:Scikit-learn、XGBoost、LightGBM、PyTorch、Keras/TensorFlow、Spark 等。

→ 探索自动日志记录

🛠️ 手动日志记录 - 完全控制,自定义工作流程

适用于自定义训练循环、高级实验,或当您需要精确控制跟踪内容时。

import mlflow

with mlflow.start_run():
# Log parameters
mlflow.log_param("learning_rate", 0.01)
mlflow.log_param("batch_size", 32)

# Your training logic here
for epoch in range(num_epochs):
train_loss = train_model()
val_loss = validate_model()

# Log metrics with step tracking
mlflow.log_metrics({"train_loss": train_loss, "val_loss": val_loss}, step=epoch)

# Log final model
mlflow.sklearn.log_model(model, name="model")

核心日志记录函数

设置与配置

函数目的示例
mlflow.set_tracking_uri()连接到跟踪服务器或数据库mlflow.set_tracking_uri("https://:5000")
mlflow.get_tracking_uri()获取当前跟踪 URIuri = mlflow.get_tracking_uri()
mlflow.create_experiment()创建新实验exp_id = mlflow.create_experiment("my-experiment")
mlflow.set_experiment()设置活动实验mlflow.set_experiment("fraud-detection")

运行管理

函数目的示例
mlflow.start_run()启动新运行(带上下文管理器)with mlflow.start_run(): ...
mlflow.end_run()结束当前运行mlflow.end_run(status="FINISHED")
mlflow.active_run()获取当前活动运行run = mlflow.active_run()
mlflow.last_active_run()获取上次完成的运行last_run = mlflow.last_active_run()

数据日志记录

函数目的示例
mlflow.log_param() / mlflow.log_params()记录超参数mlflow.log_param("lr", 0.01)
mlflow.log_metric() / mlflow.log_metrics()记录性能指标mlflow.log_metric("accuracy", 0.95, step=10)
mlflow.log_input()记录数据集信息mlflow.log_input(dataset)
mlflow.set_tag() / mlflow.set_tags()添加元数据标签mlflow.set_tag("model_type", "CNN")

工件管理

函数目的示例
mlflow.log_artifact()记录单个文件/目录mlflow.log_artifact("model.pkl")
mlflow.log_artifacts()记录整个目录mlflow.log_artifacts("./plots/")
mlflow.get_artifact_uri()获取工件存储位置uri = mlflow.get_artifact_uri()

模型管理 (MLflow 3 新增)

函数目的示例
mlflow.initialize_logged_model()初始化处于 PENDING 状态的已记录模型model = mlflow.initialize_logged_model(name="my_model")
mlflow.create_external_model()创建外部模型(工件存储在 MLflow 之外)model = mlflow.create_external_model(name="agent")
mlflow.finalize_logged_model()将模型状态更新为 READY 或 FAILEDmlflow.finalize_logged_model(model_id, "READY")
mlflow.get_logged_model()按 ID 检索已记录模型model = mlflow.get_logged_model(model_id)
mlflow.last_logged_model()获取最近记录的模型model = mlflow.last_logged_model()
mlflow.search_logged_models()搜索已记录模型models = mlflow.search_logged_models(filter_string="name='my_model'")
mlflow.log_model_params()将参数记录到特定模型mlflow.log_model_params({"param": "value"}, model_id)
mlflow.set_logged_model_tags()设置已记录模型的标签mlflow.set_logged_model_tags(model_id, {"key": "value"})
mlflow.delete_logged_model_tag()从已记录模型中删除标签mlflow.delete_logged_model_tag(model_id, "key")

活动模型管理 (MLflow 3 新增)

函数目的示例
mlflow.set_active_model()设置活动模型以进行跟踪链接mlflow.set_active_model(name="my_model")
mlflow.get_active_model_id()获取当前活动模型 IDmodel_id = mlflow.get_active_model_id()
mlflow.clear_active_model()清除活动模型mlflow.clear_active_model()

特定语言的 API 覆盖范围

功能PythonJavaRREST API
基本日志记录✅ 完整✅ 完整✅ 完整✅ 完整
自动日志记录✅ 15+ 库❌ 不可用✅ 有限❌ 不可用
模型日志记录✅ 20+ 风格✅ 基本支持✅ 基本支持✅ 通过工件
已记录模型管理✅ 完整 (MLflow 3)❌ 不可用❌ 不可用✅ 基本
数据集跟踪✅ 完整✅ 基本✅ 基本✅ 基本
搜索与查询✅ 高级✅ 基本✅ 基本✅ 完整
API 对等

Python API 提供了最全面的功能集。Java 和 R API 提供核心功能,并且每个版本都在持续添加新功能。

高级跟踪模式

使用已记录的模型 (MLflow 3 新增)

MLflow 3 引入了强大的已记录模型管理功能,用于独立于运行跟踪模型

创建和管理外部模型

适用于存储在 MLflow 之外的模型(例如部署的代理或外部模型工件)

import mlflow

# Create an external model for tracking without storing artifacts in MLflow
model = mlflow.create_external_model(
name="chatbot_agent",
model_type="agent",
tags={"version": "v1.0", "environment": "production"},
)

# Log parameters specific to this model
mlflow.log_model_params(
{"temperature": "0.7", "max_tokens": "1000"}, model_id=model.model_id
)

# Set as active model for automatic trace linking
mlflow.set_active_model(model_id=model.model_id)


@mlflow.trace
def chat_with_agent(message):
# This trace will be automatically linked to the active model
return agent.chat(message)


# Traces are now linked to your external model
traces = mlflow.search_traces(model_id=model.model_id)

高级模型生命周期管理

适用于需要自定义准备或验证的模型

import mlflow
from mlflow.entities import LoggedModelStatus

# Initialize model in PENDING state
model = mlflow.initialize_logged_model(
name="custom_neural_network",
model_type="neural_network",
tags={"architecture": "transformer", "dataset": "custom"},
)

try:
# Custom model preparation logic
train_model()
validate_model()

# Save model artifacts using standard MLflow model logging
mlflow.pytorch.log_model(
pytorch_model=model_instance,
name="model",
model_id=model.model_id, # Link to the logged model
)

# Finalize model as READY
mlflow.finalize_logged_model(model.model_id, LoggedModelStatus.READY)

except Exception as e:
# Mark model as FAILED if issues occur
mlflow.finalize_logged_model(model.model_id, LoggedModelStatus.FAILED)
raise

# Retrieve and work with the logged model
final_model = mlflow.get_logged_model(model.model_id)
print(f"Model {final_model.name} is {final_model.status}")

搜索和查询已记录模型

# Find all production-ready transformer models
production_models = mlflow.search_logged_models(
filter_string="tags.environment = 'production' AND model_type = 'transformer'",
order_by=[{"field_name": "creation_time", "ascending": False}],
output_format="pandas",
)

# Search for models with specific performance metrics
high_accuracy_models = mlflow.search_logged_models(
filter_string="metrics.accuracy > 0.95",
datasets=[{"dataset_name": "test_set"}], # Only consider test set metrics
max_results=10,
)

# Get the most recently logged model in current session
latest_model = mlflow.last_logged_model()
if latest_model:
print(f"Latest model: {latest_model.name} (ID: {latest_model.model_id})")

精确指标跟踪

通过自定义时间戳和步骤精确控制何时以及如何记录指标

import time
from datetime import datetime

# Log with custom step (training iteration/epoch)
for epoch in range(100):
loss = train_epoch()
mlflow.log_metric("train_loss", loss, step=epoch)

# Log with custom timestamp
now = int(time.time() * 1000) # MLflow expects milliseconds
mlflow.log_metric("inference_latency", latency, timestamp=now)

# Log with both step and timestamp
mlflow.log_metric("gpu_utilization", gpu_usage, step=epoch, timestamp=now)

步骤要求

  • 必须是有效的 64 位整数
  • 可以是负数或乱序
  • 支持序列中的间隔(例如 1、5、75、-20)

实验组织

组织您的实验以便于比较和分析

# Method 1: Environment variables
import os

os.environ["MLFLOW_EXPERIMENT_NAME"] = "fraud-detection-v2"

# Method 2: Explicit experiment setting
mlflow.set_experiment("hyperparameter-tuning")

# Method 3: Create with custom configuration
experiment_id = mlflow.create_experiment(
"production-models",
artifact_location="s3://my-bucket/experiments/",
tags={"team": "data-science", "environment": "prod"},
)

具有父子关系的层次化运行

组织复杂的实验,如超参数扫描或交叉验证

# Parent run for the entire experiment
with mlflow.start_run(run_name="hyperparameter_sweep") as parent_run:
mlflow.log_param("search_strategy", "random")

best_score = 0
best_params = {}

# Child runs for each parameter combination
for lr in [0.001, 0.01, 0.1]:
for batch_size in [16, 32, 64]:
with mlflow.start_run(
nested=True, run_name=f"lr_{lr}_bs_{batch_size}"
) as child_run:
mlflow.log_params({"learning_rate": lr, "batch_size": batch_size})

# Train and evaluate
model = train_model(lr, batch_size)
score = evaluate_model(model)
mlflow.log_metric("accuracy", score)

# Track best configuration in parent
if score > best_score:
best_score = score
best_params = {"learning_rate": lr, "batch_size": batch_size}

# Log best results to parent run
mlflow.log_params(best_params)
mlflow.log_metric("best_accuracy", best_score)

# Query child runs
child_runs = mlflow.search_runs(
filter_string=f"tags.mlflow.parentRunId = '{parent_run.info.run_id}'"
)
print("Child run results:")
print(child_runs[["run_id", "params.learning_rate", "metrics.accuracy"]])

并行执行策略

使用不同的并行化方法高效处理多个运行

非常适合简单的超参数扫描或 A/B 测试

configs = [
{"model": "RandomForest", "n_estimators": 100},
{"model": "XGBoost", "max_depth": 6},
{"model": "LogisticRegression", "C": 1.0},
]

for config in configs:
with mlflow.start_run(run_name=config["model"]):
mlflow.log_params(config)
model = train_model(config)
score = evaluate_model(model)
mlflow.log_metric("f1_score", score)

用于组织智能标记

策略性地使用标签来组织和过滤实验

with mlflow.start_run():
# Descriptive tags for filtering
mlflow.set_tags(
{
"model_family": "transformer",
"dataset_version": "v2.1",
"environment": "production",
"team": "nlp-research",
"gpu_type": "V100",
"experiment_phase": "hyperparameter_tuning",
}
)

# Special notes tag for documentation
mlflow.set_tag(
"mlflow.note.content",
"Baseline transformer model with attention dropout. "
"Testing different learning rate schedules.",
)

# Training code here...

按标签搜索实验

# Find all transformer experiments
transformer_runs = mlflow.search_runs(filter_string="tags.model_family = 'transformer'")

# Find production-ready models
prod_models = mlflow.search_runs(
filter_string="tags.environment = 'production' AND metrics.accuracy > 0.95"
)

系统标签参考

MLflow 自动设置几个系统标签来捕获执行上下文

标签描述设置时机
mlflow.source.name源文件或笔记本名称始终
mlflow.source.type源类型 (NOTEBOOK, JOB, LOCAL 等)始终
mlflow.user创建运行的用户始终
mlflow.source.git.commitGit 提交哈希从 Git 仓库运行时
mlflow.source.git.branchGit 分支名称仅限 MLflow 项目
mlflow.parentRunId嵌套运行的父运行 ID仅限子运行
mlflow.docker.image.name使用的 Docker 镜像Docker 环境
mlflow.note.content用户可编辑的描述仅限手动
专业提示

使用 mlflow.note.content 直接在 MLflow UI 中记录实验洞察、假设或结果。此标签显示在运行页面的专用“备注”部分中。

与自动日志记录集成

将自动日志记录与手动跟踪结合,实现两全其美

import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Enable auto logging
mlflow.autolog()

with mlflow.start_run():
# Auto logging captures model training automatically
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Add custom metrics and artifacts
predictions = model.predict(X_test)

# Log custom evaluation metrics
report = classification_report(y_test, predictions, output_dict=True)
mlflow.log_metrics(
{
"precision_macro": report["macro avg"]["precision"],
"recall_macro": report["macro avg"]["recall"],
"f1_macro": report["macro avg"]["f1-score"],
}
)

# Log custom artifacts
feature_importance = pd.DataFrame(
{"feature": feature_names, "importance": model.feature_importances_}
)
feature_importance.to_csv("feature_importance.csv")
mlflow.log_artifact("feature_importance.csv")

# Access the auto-logged run for additional processing
current_run = mlflow.active_run()
print(f"Auto-logged run ID: {current_run.info.run_id}")

# Access the completed run
last_run = mlflow.last_active_run()
print(f"Final run status: {last_run.info.status}")

特定语言指南


后续步骤