跳到主要内容

MLflow Tracking API

MLflow Tracking 提供跨多种编程语言的综合 API,以捕获您的机器学习实验。无论您是偏爱自动检测还是精细控制,MLflow 都能适应您的工作流程。

选择您的方法

MLflow 提供两种主要的实验跟踪方法,每种都针对不同的用例进行了优化

🤖 自动日志记录 - 零设置,最大覆盖

非常适合快速入门或在使用支持的 ML 库时。只需添加一行代码,MLflow 就会自动捕获所有内容。

import mlflow

mlflow.autolog() # That's it!

# Your existing training code works unchanged
model.fit(X_train, y_train)

自动记录的内容

  • 模型参数和超参数
  • 训练和验证指标
  • 模型制品和检查点
  • 训练图和可视化
  • 框架特定元数据

支持的库:Scikit-learn、XGBoost、LightGBM、PyTorch、Keras/TensorFlow、Spark 等。

→ 探索自动日志记录

🛠️ 手动日志记录 - 完全控制,自定义工作流程

非常适合自定义训练循环、高级实验或需要精确控制跟踪内容的情况。

import mlflow

with mlflow.start_run():
# Log parameters
mlflow.log_param("learning_rate", 0.01)
mlflow.log_param("batch_size", 32)

# Your training logic here
for epoch in range(num_epochs):
train_loss = train_model()
val_loss = validate_model()

# Log metrics with step tracking
mlflow.log_metrics({"train_loss": train_loss, "val_loss": val_loss}, step=epoch)

# Log final model
mlflow.sklearn.log_model(model, name="model")

核心日志记录功能

设置与配置

函数目的示例
mlflow.set_tracking_uri()连接到跟踪服务器或数据库mlflow.set_tracking_uri("https://:5000")
mlflow.get_tracking_uri()获取当前跟踪 URIuri = mlflow.get_tracking_uri()
mlflow.create_experiment()创建新实验exp_id = mlflow.create_experiment("my-experiment")
mlflow.set_experiment()设置活动实验mlflow.set_experiment("fraud-detection")

运行管理

函数目的示例
mlflow.start_run()开始新运行(使用上下文管理器)with mlflow.start_run(): ...
mlflow.end_run()结束当前运行mlflow.end_run(status="FINISHED")
mlflow.active_run()获取当前活动运行run = mlflow.active_run()
mlflow.last_active_run()获取最近完成的运行last_run = mlflow.last_active_run()

数据日志记录

函数目的示例
mlflow.log_param() / mlflow.log_params()记录超参数mlflow.log_param("lr", 0.01)
mlflow.log_metric() / mlflow.log_metrics()记录性能指标mlflow.log_metric("accuracy", 0.95, step=10)
mlflow.log_input()记录数据集信息mlflow.log_input(dataset)
mlflow.set_tag() / mlflow.set_tags()添加元数据标签mlflow.set_tag("model_type", "CNN")

制品管理

函数目的示例
mlflow.log_artifact()记录单个文件/目录mlflow.log_artifact("model.pkl")
mlflow.log_artifacts()记录整个目录mlflow.log_artifacts("./plots/")
mlflow.get_artifact_uri()获取制品存储位置uri = mlflow.get_artifact_uri()

模型管理(MLflow 3 新增)

函数目的示例
mlflow.initialize_logged_model()初始化处于 PENDING 状态的已记录模型model = mlflow.initialize_logged_model(name="my_model")
mlflow.create_external_model()创建外部模型(制品存储在 MLflow 之外)model = mlflow.create_external_model(name="agent")
mlflow.finalize_logged_model()将模型状态更新为 READY 或 FAILEDmlflow.finalize_logged_model(model_id, "READY")
mlflow.get_logged_model()按 ID 检索已记录的模型model = mlflow.get_logged_model(model_id)
mlflow.last_logged_model()获取最近记录的模型model = mlflow.last_logged_model()
mlflow.search_logged_models()搜索已记录的模型models = mlflow.search_logged_models(filter_string="name='my_model'")
mlflow.log_model_params()将参数记录到特定模型mlflow.log_model_params({"param": "value"}, model_id)
mlflow.set_logged_model_tags()在已记录的模型上设置标签mlflow.set_logged_model_tags(model_id, {"key": "value"})
mlflow.delete_logged_model_tag()从已记录的模型中删除标签mlflow.delete_logged_model_tag(model_id, "key")

活动模型管理(MLflow 3 新增)

函数目的示例
mlflow.set_active_model()设置用于跟踪链接的活动模型mlflow.set_active_model(name="my_model")
mlflow.get_active_model_id()获取当前活动模型 IDmodel_id = mlflow.get_active_model_id()
mlflow.clear_active_model()清除活动模型mlflow.clear_active_model()

特定语言的 API 覆盖范围

能力PythonJavaRREST API
基本日志记录✅ 完整✅ 完整✅ 完整✅ 完整
自动日志记录✅ 15+ 个库❌ 不可用✅ 有限❌ 不可用
模型日志记录✅ 20+ 种类型✅ 基本支持✅ 基本支持✅ 通过制品
已记录模型管理✅ 完整(MLflow 3)❌ 不可用❌ 不可用✅ 基本
数据集跟踪✅ 完整✅ 基本✅ 基本✅ 基本
搜索与查询✅ 高级✅ 基本✅ 基本✅ 完整
api-parity

Python API 提供最全面的功能集。Java 和 R API 提供核心功能,并在每个版本中持续添加功能。

高级跟踪模式

使用已记录的模型(MLflow 3 新增)

MLflow 3 引入了强大的已记录模型管理功能,用于独立于运行跟踪模型

创建和管理外部模型

适用于存储在 MLflow 之外的模型(例如已部署的代理或外部模型制品)

import mlflow

# Create an external model for tracking without storing artifacts in MLflow
model = mlflow.create_external_model(
name="chatbot_agent",
model_type="agent",
tags={"version": "v1.0", "environment": "production"},
)

# Log parameters specific to this model
mlflow.log_model_params(
{"temperature": "0.7", "max_tokens": "1000"}, model_id=model.model_id
)

# Set as active model for automatic trace linking
mlflow.set_active_model(model_id=model.model_id)


@mlflow.trace
def chat_with_agent(message):
# This trace will be automatically linked to the active model
return agent.chat(message)


# Traces are now linked to your external model
traces = mlflow.search_traces(model_id=model.model_id)

高级模型生命周期管理

适用于需要自定义准备或验证的模型

import mlflow
from mlflow.entities import LoggedModelStatus

# Initialize model in PENDING state
model = mlflow.initialize_logged_model(
name="custom_neural_network",
model_type="neural_network",
tags={"architecture": "transformer", "dataset": "custom"},
)

try:
# Custom model preparation logic
train_model()
validate_model()

# Save model artifacts using standard MLflow model logging
mlflow.pytorch.log_model(
pytorch_model=model_instance,
name="model",
model_id=model.model_id, # Link to the logged model
)

# Finalize model as READY
mlflow.finalize_logged_model(model.model_id, LoggedModelStatus.READY)

except Exception as e:
# Mark model as FAILED if issues occur
mlflow.finalize_logged_model(model.model_id, LoggedModelStatus.FAILED)
raise

# Retrieve and work with the logged model
final_model = mlflow.get_logged_model(model.model_id)
print(f"Model {final_model.name} is {final_model.status}")

搜索和查询已记录的模型

# Find all production-ready transformer models
production_models = mlflow.search_logged_models(
filter_string="tags.environment = 'production' AND model_type = 'transformer'",
order_by=[{"field_name": "creation_time", "ascending": False}],
output_format="pandas",
)

# Search for models with specific performance metrics
high_accuracy_models = mlflow.search_logged_models(
filter_string="metrics.accuracy > 0.95",
datasets=[{"dataset_name": "test_set"}], # Only consider test set metrics
max_results=10,
)

# Get the most recently logged model in current session
latest_model = mlflow.last_logged_model()
if latest_model:
print(f"Latest model: {latest_model.name} (ID: {latest_model.model_id})")

精确指标跟踪

精确控制何时以及如何使用自定义时间戳和步骤记录指标

import time
from datetime import datetime

# Log with custom step (training iteration/epoch)
for epoch in range(100):
loss = train_epoch()
mlflow.log_metric("train_loss", loss, step=epoch)

# Log with custom timestamp
now = int(time.time() * 1000) # MLflow expects milliseconds
mlflow.log_metric("inference_latency", latency, timestamp=now)

# Log with both step and timestamp
mlflow.log_metric("gpu_utilization", gpu_usage, step=epoch, timestamp=now)

步长要求

  • 必须是有效的 64 位整数
  • 可以是负数或乱序
  • 支持序列中的间隙(例如,1, 5, 75, -20)

实验组织

构建实验以便于比较和分析

# Method 1: Environment variables
import os

os.environ["MLFLOW_EXPERIMENT_NAME"] = "fraud-detection-v2"

# Method 2: Explicit experiment setting
mlflow.set_experiment("hyperparameter-tuning")

# Method 3: Create with custom configuration
experiment_id = mlflow.create_experiment(
"production-models",
artifact_location="s3://my-bucket/experiments/",
tags={"team": "data-science", "environment": "prod"},
)

具有父子关系的层级运行

组织复杂的实验,如超参数扫描或交叉验证

# Parent run for the entire experiment
with mlflow.start_run(run_name="hyperparameter_sweep") as parent_run:
mlflow.log_param("search_strategy", "random")

best_score = 0
best_params = {}

# Child runs for each parameter combination
for lr in [0.001, 0.01, 0.1]:
for batch_size in [16, 32, 64]:
with mlflow.start_run(
nested=True, run_name=f"lr_{lr}_bs_{batch_size}"
) as child_run:
mlflow.log_params({"learning_rate": lr, "batch_size": batch_size})

# Train and evaluate
model = train_model(lr, batch_size)
score = evaluate_model(model)
mlflow.log_metric("accuracy", score)

# Track best configuration in parent
if score > best_score:
best_score = score
best_params = {"learning_rate": lr, "batch_size": batch_size}

# Log best results to parent run
mlflow.log_params(best_params)
mlflow.log_metric("best_accuracy", best_score)

# Query child runs
child_runs = mlflow.search_runs(
filter_string=f"tags.mlflow.parentRunId = '{parent_run.info.run_id}'"
)
print("Child run results:")
print(child_runs[["run_id", "params.learning_rate", "metrics.accuracy"]])

并行执行策略

通过不同的并行化方法高效处理多个运行

非常适合简单的超参数扫描或 A/B 测试

configs = [
{"model": "RandomForest", "n_estimators": 100},
{"model": "XGBoost", "max_depth": 6},
{"model": "LogisticRegression", "C": 1.0},
]

for config in configs:
with mlflow.start_run(run_name=config["model"]):
mlflow.log_params(config)
model = train_model(config)
score = evaluate_model(model)
mlflow.log_metric("f1_score", score)

智能标签组织

策略性地使用标签来组织和过滤实验

with mlflow.start_run():
# Descriptive tags for filtering
mlflow.set_tags(
{
"model_family": "transformer",
"dataset_version": "v2.1",
"environment": "production",
"team": "nlp-research",
"gpu_type": "V100",
"experiment_phase": "hyperparameter_tuning",
}
)

# Special notes tag for documentation
mlflow.set_tag(
"mlflow.note.content",
"Baseline transformer model with attention dropout. "
"Testing different learning rate schedules.",
)

# Training code here...

按标签搜索实验

# Find all transformer experiments
transformer_runs = mlflow.search_runs(filter_string="tags.model_family = 'transformer'")

# Find production-ready models
prod_models = mlflow.search_runs(
filter_string="tags.environment = 'production' AND metrics.accuracy > 0.95"
)

系统标签参考

MLflow 自动设置多个系统标签以捕获执行上下文

标签描述设置时间
mlflow.source.name源文件或笔记本名称始终
mlflow.source.type源类型(NOTEBOOK、JOB、LOCAL 等)始终
mlflow.user创建运行的用户始终
mlflow.source.git.commitGit 提交哈希从 Git 仓库运行时
mlflow.source.git.branchGit 分支名称仅限 MLflow 项目
mlflow.parentRunId嵌套运行的父运行 ID仅限子运行
mlflow.docker.image.name使用的 Docker 镜像Docker 环境
mlflow.note.content用户可编辑描述仅限手动
专业提示

使用 mlflow.note.content 直接在 MLflow UI 中记录实验见解、假设或结果。此标签显示在运行页面的专用“备注”部分中。

与自动日志记录集成

将自动日志记录与手动跟踪相结合,实现两全其美

import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Enable auto logging
mlflow.autolog()

with mlflow.start_run():
# Auto logging captures model training automatically
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Add custom metrics and artifacts
predictions = model.predict(X_test)

# Log custom evaluation metrics
report = classification_report(y_test, predictions, output_dict=True)
mlflow.log_metrics(
{
"precision_macro": report["macro avg"]["precision"],
"recall_macro": report["macro avg"]["recall"],
"f1_macro": report["macro avg"]["f1-score"],
}
)

# Log custom artifacts
feature_importance = pd.DataFrame(
{"feature": feature_names, "importance": model.feature_importances_}
)
feature_importance.to_csv("feature_importance.csv")
mlflow.log_artifact("feature_importance.csv")

# Access the auto-logged run for additional processing
current_run = mlflow.active_run()
print(f"Auto-logged run ID: {current_run.info.run_id}")

# Access the completed run
last_run = mlflow.last_active_run()
print(f"Final run status: {last_run.info.status}")

特定语言指南


后续步骤