跳到主要内容

Scikit-learn 与 MLflow

在本综合指南中,我们将引导您了解如何将 scikit-learn 与 MLflow 结合使用,以进行实验跟踪、模型管理和生产部署。我们将介绍自动日志记录和手动日志记录方法,从基本用法到高级生产模式。

自动日志记录快速入门

最快的入门方法是使用 MLflow 的 scikit-learn 自动日志记录。只需一行代码,您就可以自动跟踪 scikit-learn 实验中的参数、指标和模型。这种方法不需要更改现有的训练代码,并捕获可重现 ML 工作流程所需的一切。

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

# Enable autologging for scikit-learn
mlflow.sklearn.autolog()

# Load sample data
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42
)

# Train your model - MLflow automatically logs everything
with mlflow.start_run():
model = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
model.fit(X_train, y_train)

# Evaluation metrics are automatically captured
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print(f"Training accuracy: {train_score:.3f}")
print(f"Test accuracy: {test_score:.3f}")

这个简单的示例自动记录所有模型参数、训练指标、经过适当序列化的训练模型以及用于部署的模型签名,而无需任何额外的代码。

了解自动日志记录的行为

MLflow 的 scikit-learn 自动日志记录会自动捕获有关训练过程的全面信息。 每次训练模型时,都会跟踪以下内容

类别捕获的信息
参数来自 estimator.get_params(deep=True) 的所有参数
指标训练分数、分类/回归指标
标签估计器类名和完全限定的类名
工件序列化模型、模型签名、指标信息

自动日志记录系统旨在全面且非侵入。它捕获重现性所需的一切,而无需更改现有的 scikit-learn 代码。

日志记录方法

为了完全控制记录的内容,您可以手动检测 scikit-learn 代码。 当您需要自定义指标、特定的工件日志记录或希望以特定方式组织实验时,此方法是理想之选

import mlflow
import mlflow.sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from mlflow.models import infer_signature

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# Manual logging approach
with mlflow.start_run():
# Define hyperparameters
params = {"C": 1.0, "max_iter": 1000, "solver": "lbfgs", "random_state": 42}

# Log parameters
mlflow.log_params(params)

# Train model
model = LogisticRegression(**params)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate and log metrics
metrics = {
"accuracy": accuracy_score(y_test, y_pred),
"precision": precision_score(y_test, y_pred, average="weighted"),
"recall": recall_score(y_test, y_pred, average="weighted"),
"f1_score": f1_score(y_test, y_pred, average="weighted"),
}
mlflow.log_metrics(metrics)

# Infer model signature
signature = infer_signature(X_train, model.predict(X_train))

# Log the model
mlflow.sklearn.log_model(
sk_model=model,
name="model",
signature=signature,
input_example=X_train[:5], # Sample input for documentation
)

超参数调优

MLflow 为 scikit-learn 的超参数优化工具提供出色的支持,自动为参数搜索实验创建有组织的子运行。 这可以轻松跟踪和比较不同的参数组合

import mlflow
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

# Enable autologging with hyperparameter tuning support
mlflow.sklearn.autolog(max_tuning_runs=10) # Track top 10 parameter combinations

# Load data
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, test_size=0.2, random_state=42
)

# Define parameter grid
param_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [5, 10, 15, None],
"min_samples_split": [2, 5, 10],
"min_samples_leaf": [1, 2, 4],
}

with mlflow.start_run(run_name="Random Forest Hyperparameter Tuning"):
# Create and fit GridSearchCV
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
rf, param_grid, cv=5, scoring="accuracy", n_jobs=-1, verbose=1
)

grid_search.fit(X_train, y_train)

# Best model evaluation
best_score = grid_search.score(X_test, y_test)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")
print(f"Test score: {best_score:.3f}")

MLflow 会自动创建一个包含总体搜索结果的父运行以及每个参数组合的子运行,从而可以轻松分析哪些参数效果最佳。

使用 MLflow 进行模型评估

MLflow 提供了一个全面的评估 API,可自动为 scikit-learn 模型生成指标、可视化和诊断工具

import mlflow
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from mlflow.models import infer_signature

# Load data and train model
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42
)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Create evaluation dataset
eval_data = X_test.copy()
eval_data = pd.DataFrame(eval_data, columns=wine.feature_names)
eval_data["label"] = y_test

with mlflow.start_run():
# Log model with signature
signature = infer_signature(X_test, model.predict(X_test))
mlflow.sklearn.log_model(model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")

# Comprehensive evaluation with MLflow
result = mlflow.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier", # or "regressor" for regression
evaluators=["default"],
)

# Access automatic metrics
print(f"Accuracy: {result.metrics['accuracy_score']:.3f}")
print(f"F1 Score: {result.metrics['f1_score']:.3f}")
print(f"ROC AUC: {result.metrics['roc_auc']:.3f}")

# Access generated artifacts
print("Generated artifacts:")
for artifact_name, path in result.artifacts.items():
print(f" {artifact_name}: {path}")

自动生成包括

性能指标,例如准确率、精确率、召回率、F1 分数、分类的 ROC-AUC。 可视化,包括混淆矩阵、ROC 曲线、精确率-召回率曲线。 特征重要性,包括 SHAP 值和特征贡献分析。 模型工件,其中所有绘图和诊断信息都保存到 MLflow。

模型比较和选择

使用 MLflow 评估来系统地比较多个 scikit-learn 模型

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Define models to compare
sklearn_models = {
"random_forest": RandomForestClassifier(n_estimators=100, random_state=42),
"gradient_boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
"logistic_regression": LogisticRegression(random_state=42, max_iter=1000),
"svm": SVC(probability=True, random_state=42),
}

# Evaluate each model systematically
comparison_results = {}

for model_name, model in sklearn_models.items():
with mlflow.start_run(run_name=f"eval_{model_name}"):
# Train model
model.fit(X_train, y_train)

# Log model
signature = infer_signature(X_train, model.predict(X_train))
mlflow.sklearn.log_model(model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")

# Comprehensive evaluation with MLflow
result = mlflow.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
evaluators=["default"],
)

comparison_results[model_name] = result.metrics

# Log key metrics for comparison
mlflow.log_metrics(
{
"accuracy": result.metrics["accuracy_score"],
"f1": result.metrics["f1_score"],
"roc_auc": result.metrics["roc_auc"],
"precision": result.metrics["precision_score"],
"recall": result.metrics["recall_score"],
}
)

# Create comparison summary
import pandas as pd

comparison_df = pd.DataFrame(comparison_results).T
print("Model Comparison Summary:")
print(comparison_df[["accuracy_score", "f1_score", "roc_auc"]].round(3))

# Identify best model
best_model = comparison_df["f1_score"].idxmax()
print(f"\nBest model by F1 score: {best_model}")

模型验证和质量门

使用 MLflow 的验证 API 来确保 scikit-learn 模型质量

from mlflow.models import MetricThreshold

# First, evaluate your scikit-learn model
result = mlflow.evaluate(model_uri, eval_data, targets="label", model_type="classifier")

# Define quality thresholds for classification models
quality_thresholds = {
"accuracy_score": MetricThreshold(threshold=0.85, greater_is_better=True),
"f1_score": MetricThreshold(threshold=0.80, greater_is_better=True),
"roc_auc": MetricThreshold(threshold=0.75, greater_is_better=True),
}

# Validate model meets quality standards
try:
mlflow.validate_evaluation_results(
candidate_result=result,
validation_thresholds=quality_thresholds,
)
print("✅ Scikit-learn model meets all quality thresholds")
except mlflow.exceptions.ModelValidationFailedException as e:
print(f"❌ Model failed validation: {e}")

# Compare against baseline model (e.g., previous model version)
baseline_result = mlflow.evaluate(
baseline_model_uri, eval_data, targets="label", model_type="classifier"
)

# Validate improvement over baseline
improvement_thresholds = {
"f1_score": MetricThreshold(
threshold=0.02, greater_is_better=True # Must be 2% better
),
}

try:
mlflow.validate_evaluation_results(
candidate_result=result,
baseline_result=baseline_result,
validation_thresholds=improvement_thresholds,
)
print("✅ New model improves over baseline")
except mlflow.exceptions.ModelValidationFailedException as e:
print(f"❌ Model doesn't improve sufficiently: {e}")

模型管理

MLflow 支持 scikit-learn 模型的多种序列化格式,每种格式都针对不同的部署方案进行了优化。 了解这些选项有助于您为生产需求选择正确的方法

import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)

# Cloudpickle format (default) - better cross-system compatibility
mlflow.sklearn.log_model(
sk_model=model,
name="cloudpickle_model",
serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE,
)

# Pickle format - faster but less portable
mlflow.sklearn.log_model(
sk_model=model,
name="pickle_model",
serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_PICKLE,
)

Cloudpickle 是默认格式,因为它通过识别并将代码依赖项与序列化模型打包在一起,从而提供更好的跨系统兼容性。 Pickle 速度更快,但跨不同环境的可移植性较差。

生产部署

模型注册表提供集中式模型管理,具有版本控制和基于别名的部署。 这对于管理从开发到生产部署的模型至关重要

# Register model to MLflow Model Registry
import mlflow
from mlflow import MlflowClient

client = MlflowClient()

# Log and register model in one step
with mlflow.start_run():
mlflow.sklearn.log_model(
sk_model=model,
name="model",
registered_model_name="CustomerChurnModel",
signature=signature,
)

# Or register an existing model
run_id = "your_run_id"
model_uri = f"runs:/{run_id}/model"

# Register the model
registered_model = mlflow.register_model(model_uri=model_uri, name="CustomerChurnModel")

# Use aliases instead of deprecated stages for deployment management
# Set aliases for different deployment environments
client.set_registered_model_alias(
name="CustomerChurnModel",
alias="champion", # Production model
version=registered_model.version,
)

client.set_registered_model_alias(
name="CustomerChurnModel",
alias="challenger", # A/B testing model
version=registered_model.version,
)

# Use tags to track model status and metadata
client.set_model_version_tag(
name="CustomerChurnModel",
version=registered_model.version,
key="validation_status",
value="approved",
)

client.set_model_version_tag(
name="CustomerChurnModel",
version=registered_model.version,
key="deployment_date",
value="2025-05-29",
)

现代模型注册表功能

模型别名将已弃用的阶段替换为灵活的、命名的引用。 您可以将多个别名分配给任何模型版本(例如,championchallengershadow),独立于模型训练更新别名以实现无缝部署,并使用它们进行 A/B 测试和渐进式发布。

模型标记提供丰富的元数据和状态跟踪。 使用 validation_status: approved 跟踪验证状态,使用 ready_for_prod: true 标记部署就绪状态,并使用 team: data-science 添加团队所有权。

基于环境的模型支持成熟的 MLOps 工作流程。 为每个环境创建单独的注册模型:dev.CustomerChurnModelstaging.CustomerChurnModelprod.CustomerChurnModel,并使用 copy_model_version() 在不同环境之间提升模型。

# Promote model from staging to production environment
client.copy_model_version(
src_model_uri="models:/staging.CustomerChurnModel@candidate",
dst_name="prod.CustomerChurnModel",
)

高级功能

Scikit-learn pipelines 是 MLflow 中的一等公民,可提供从数据预处理到模型训练的端到端工作流程跟踪。 这确保了整个 ML 工作流程的可重现性

import mlflow
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Enable autologging for pipelines
mlflow.sklearn.autolog()

# Create a complex preprocessing and modeling pipeline
numeric_features = ["age", "income", "credit_score"]
categorical_features = ["occupation", "location"]

# Preprocessing pipeline
numeric_transformer = Pipeline(
steps=[("scaler", StandardScaler()), ("selector", SelectKBest(f_regression, k=2))]
)

categorical_transformer = Pipeline(
steps=[("encoder", OneHotEncoder(drop="first", sparse_output=False))]
)

# Combine preprocessing steps
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
]
)

# Complete pipeline with model
pipeline = Pipeline(
steps=[("preprocessor", preprocessor), ("regressor", LinearRegression())]
)

# Train pipeline - all steps are automatically logged
with mlflow.start_run(run_name="Complete Pipeline Experiment"):
pipeline.fit(X_train, y_train)

# Pipeline scoring is automatically captured
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)

print(f"Pipeline R² score: {test_score:.3f}")

MLflow 会自动记录来自每个 pipeline 阶段的参数,从而可以轻松了解数据的处理方式以及使用的模型参数。

结论

MLflow 的 scikit-learn 集成提供了全面的解决方案,用于在传统机器学习工作流程中进行实验跟踪、模型管理和部署。 无论您是使用简单的自动日志记录进行快速实验,还是实施复杂的生产管道,MLflow 都可以扩展以满足您的需求。

将 MLflow 与 scikit-learn 结合使用的主要优势

轻松的实验跟踪 提供单行自动日志记录,捕获可重现 ML 所需的一切。 超参数优化 包括对网格搜索的内置支持,其中包含有组织的子运行和易于比较。 全面评估 通过 mlflow.evaluate() 提供自动指标生成、可视化和 SHAP 分析。 可用于生产的部署 提供模型注册表集成,其中包含基于别名的部署和质量门。 团队协作 启用集中式实验管理,其中包含丰富的元数据和工件。

本指南中的模式和示例为使用 scikit-learn 和 MLflow 构建可扩展、可重现的机器学习系统奠定了坚实的基础。 从自动日志记录开始以获得立竿见影的好处,然后随着需求的增长逐渐采用更高级的功能,如模型评估、注册表和自定义配置。