Scikit-learn 与 MLflow
在本综合指南中,我们将引导您了解如何将 scikit-learn 与 MLflow 结合使用,以进行实验跟踪、模型管理和生产部署。我们将介绍自动日志记录和手动日志记录方法,从基本用法到高级生产模式。
自动日志记录快速入门
最快的入门方法是使用 MLflow 的 scikit-learn 自动日志记录。只需一行代码,您就可以自动跟踪 scikit-learn 实验中的参数、指标和模型。这种方法不需要更改现有的训练代码,并捕获可重现 ML 工作流程所需的一切。
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
# Enable autologging for scikit-learn
mlflow.sklearn.autolog()
# Load sample data
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42
)
# Train your model - MLflow automatically logs everything
with mlflow.start_run():
model = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
model.fit(X_train, y_train)
# Evaluation metrics are automatically captured
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"Training accuracy: {train_score:.3f}")
print(f"Test accuracy: {test_score:.3f}")
这个简单的示例自动记录所有模型参数、训练指标、经过适当序列化的训练模型以及用于部署的模型签名,而无需任何额外的代码。
了解自动日志记录的行为
- 记录的内容
- 支持的估计器
MLflow 的 scikit-learn 自动日志记录会自动捕获有关训练过程的全面信息。 每次训练模型时,都会跟踪以下内容
类别 | 捕获的信息 |
---|---|
参数 | 来自 estimator.get_params(deep=True) 的所有参数 |
指标 | 训练分数、分类/回归指标 |
标签 | 估计器类名和完全限定的类名 |
工件 | 序列化模型、模型签名、指标信息 |
自动日志记录系统旨在全面且非侵入。它捕获重现性所需的一切,而无需更改现有的 scikit-learn 代码。
自动日志记录与几乎所有 scikit-learn 估计器和工作流程无缝协作。 该集成旨在处理简单模型和复杂的元估计器
核心估计器
- 来自
sklearn.utils.all_estimators()
的所有估计器 - 具有预处理和建模步骤的 Pipeline 对象
- 像
GridSearchCV
和RandomizedSearchCV
这样的元估计器 - 包括
RandomForestClassifier
,GradientBoostingRegressor
的集成方法
特殊处理
- 元估计器自动为参数搜索结果创建子运行
- Pipeline 阶段会记录其各自的参数和转换
- 捕获并组织交叉验证结果,以便于比较
大多数预处理估计器(如缩放器和转换器)都从单独的日志记录中排除,以避免混乱,但当它们在 Pipeline 对象中使用时,仍然会被跟踪。
日志记录方法
- 手动日志记录
- 训练后指标
为了完全控制记录的内容,您可以手动检测 scikit-learn 代码。 当您需要自定义指标、特定的工件日志记录或希望以特定方式组织实验时,此方法是理想之选
import mlflow
import mlflow.sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from mlflow.models import infer_signature
# Generate sample data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Manual logging approach
with mlflow.start_run():
# Define hyperparameters
params = {"C": 1.0, "max_iter": 1000, "solver": "lbfgs", "random_state": 42}
# Log parameters
mlflow.log_params(params)
# Train model
model = LogisticRegression(**params)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Calculate and log metrics
metrics = {
"accuracy": accuracy_score(y_test, y_pred),
"precision": precision_score(y_test, y_pred, average="weighted"),
"recall": recall_score(y_test, y_pred, average="weighted"),
"f1_score": f1_score(y_test, y_pred, average="weighted"),
}
mlflow.log_metrics(metrics)
# Infer model signature
signature = infer_signature(X_train, model.predict(X_train))
# Log the model
mlflow.sklearn.log_model(
sk_model=model,
name="model",
signature=signature,
input_example=X_train[:5], # Sample input for documentation
)
MLflow 最强大的功能之一是在模型训练后自动捕获评估指标。 这意味着您在训练后计算的任何指标都会自动链接到您的 MLflow 运行,从而无缝跟踪模型评估
import mlflow
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
# Enable autologging with post-training metrics
mlflow.sklearn.autolog(log_post_training_metrics=True)
# Load data
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, test_size=0.2, random_state=42
)
with mlflow.start_run():
# Train model
model = GradientBoostingClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Make predictions - this links predictions to the MLflow run
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
# These metric calls are automatically logged to MLflow!
accuracy = accuracy_score(y_test, y_pred)
auc_score = roc_auc_score(y_test, y_proba)
# Model scoring is also automatically captured
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.3f}")
print(f"AUC Score: {auc_score:.3f}")
训练后指标功能会智能地检测您何时评估模型,并自动使用适当的数据集上下文记录这些指标,从而可以轻松跟踪不同评估数据集的性能。
超参数调优
- GridSearchCV
- RandomizedSearchCV
MLflow 为 scikit-learn 的超参数优化工具提供出色的支持,自动为参数搜索实验创建有组织的子运行。 这可以轻松跟踪和比较不同的参数组合
import mlflow
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
# Enable autologging with hyperparameter tuning support
mlflow.sklearn.autolog(max_tuning_runs=10) # Track top 10 parameter combinations
# Load data
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, test_size=0.2, random_state=42
)
# Define parameter grid
param_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [5, 10, 15, None],
"min_samples_split": [2, 5, 10],
"min_samples_leaf": [1, 2, 4],
}
with mlflow.start_run(run_name="Random Forest Hyperparameter Tuning"):
# Create and fit GridSearchCV
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
rf, param_grid, cv=5, scoring="accuracy", n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)
# Best model evaluation
best_score = grid_search.score(X_test, y_test)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")
print(f"Test score: {best_score:.3f}")
MLflow 会自动创建一个包含总体搜索结果的父运行以及每个参数组合的子运行,从而可以轻松分析哪些参数效果最佳。
为了更有效地探索超参数,尤其是在大型参数空间中,RandomizedSearchCV 提供了一个很好的替代方案。 MLflow 以与 GridSearchCV 同样无缝的方式处理此问题
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
# Define parameter distributions for more efficient exploration
param_distributions = {
"n_estimators": randint(50, 300),
"max_depth": randint(5, 20),
"min_samples_split": randint(2, 20),
"min_samples_leaf": randint(1, 10),
"max_features": uniform(0.1, 0.9),
}
with mlflow.start_run(run_name="Randomized Hyperparameter Search"):
rf = RandomForestClassifier(random_state=42)
random_search = RandomizedSearchCV(
rf,
param_distributions,
n_iter=50, # Try 50 random combinations
cv=5,
scoring="accuracy",
random_state=42,
n_jobs=-1,
)
random_search.fit(X_train, y_train)
# MLflow automatically creates child runs for parameter combinations
# The parent run contains the best model and overall results
自动日志记录中的 max_tuning_runs
参数控制有多少最佳参数组合获得其自己的子运行,从而帮助您专注于最有希望的结果。
使用 MLflow 进行模型评估
- MLflow 评估 API
- 回归评估
- 自定义指标和工件
MLflow 提供了一个全面的评估 API,可自动为 scikit-learn 模型生成指标、可视化和诊断工具
import mlflow
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from mlflow.models import infer_signature
# Load data and train model
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Create evaluation dataset
eval_data = X_test.copy()
eval_data = pd.DataFrame(eval_data, columns=wine.feature_names)
eval_data["label"] = y_test
with mlflow.start_run():
# Log model with signature
signature = infer_signature(X_test, model.predict(X_test))
mlflow.sklearn.log_model(model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")
# Comprehensive evaluation with MLflow
result = mlflow.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier", # or "regressor" for regression
evaluators=["default"],
)
# Access automatic metrics
print(f"Accuracy: {result.metrics['accuracy_score']:.3f}")
print(f"F1 Score: {result.metrics['f1_score']:.3f}")
print(f"ROC AUC: {result.metrics['roc_auc']:.3f}")
# Access generated artifacts
print("Generated artifacts:")
for artifact_name, path in result.artifacts.items():
print(f" {artifact_name}: {path}")
自动生成包括
性能指标,例如准确率、精确率、召回率、F1 分数、分类的 ROC-AUC。 可视化,包括混淆矩阵、ROC 曲线、精确率-召回率曲线。 特征重要性,包括 SHAP 值和特征贡献分析。 模型工件,其中所有绘图和诊断信息都保存到 MLflow。
对于 scikit-learn 回归模型,MLflow 会自动提供特定于回归的指标
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
# Load regression dataset
housing = fetch_california_housing(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
# Train regression model
reg_model = LinearRegression()
reg_model.fit(X_train, y_train)
# Create evaluation dataset
eval_data = X_test.copy()
eval_data["target"] = y_test
with mlflow.start_run():
# Log and evaluate regression model
signature = infer_signature(X_train, reg_model.predict(X_train))
mlflow.sklearn.log_model(reg_model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")
result = mlflow.evaluate(
model_uri,
eval_data,
targets="target",
model_type="regressor",
evaluators=["default"],
)
print(f"MAE: {result.metrics['mean_absolute_error']:.3f}")
print(f"RMSE: {result.metrics['root_mean_squared_error']:.3f}")
print(f"R² Score: {result.metrics['r2_score']:.3f}")
自动回归指标
平均绝对误差 (MAE)、均方误差 (MSE) 和均方根误差 (RMSE) 提供误差幅度评估。 R² 分数和调整后的 R² 衡量模型拟合质量。 平均绝对百分比误差 (MAPE) 显示相对误差率。 残差图和分布分析有助于识别模型假设违规。
使用特定于您的 scikit-learn 模型的自定义指标和可视化来扩展 MLflow 评估
from mlflow.models import make_metric
import matplotlib.pyplot as plt
import numpy as np
import os
def business_value_metric(predictions, targets, sample_weights=None):
"""Custom business metric: value from correct predictions."""
# Assume $50 value per correct prediction, $20 cost per error
correct_predictions = (predictions == targets).sum()
incorrect_predictions = len(predictions) - correct_predictions
business_value = (correct_predictions * 50) - (incorrect_predictions * 20)
return business_value
def create_feature_distribution_plot(eval_df, builtin_metrics, artifacts_dir):
"""Create feature distribution plots for model analysis."""
# Select numeric features for distribution analysis
numeric_features = eval_df.select_dtypes(include=[np.number]).columns
numeric_features = [
col for col in numeric_features if col not in ["label", "prediction"]
]
if len(numeric_features) > 0:
# Create subplot for feature distributions
n_features = min(6, len(numeric_features)) # Show up to 6 features
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()
for i, feature in enumerate(numeric_features[:n_features]):
axes[i].hist(eval_df[feature], bins=30, alpha=0.7, edgecolor="black")
axes[i].set_title(f"Distribution of {feature}")
axes[i].set_xlabel(feature)
axes[i].set_ylabel("Frequency")
# Hide unused subplots
for i in range(n_features, len(axes)):
axes[i].set_visible(False)
plt.tight_layout()
plot_path = os.path.join(artifacts_dir, "feature_distributions.png")
plt.savefig(plot_path)
plt.close()
return {"feature_distributions": plot_path}
return {}
# Create custom metric
custom_business_value = make_metric(
eval_fn=business_value_metric, greater_is_better=True, name="business_value_score"
)
# Use custom metrics and artifacts
result = mlflow.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
extra_metrics=[custom_business_value],
custom_artifacts=[create_feature_distribution_plot],
)
print(f"Business Value Score: ${result.metrics['business_value_score']:.2f}")
模型比较和选择
- MLflow 模型比较
- 超参数评估
使用 MLflow 评估来系统地比较多个 scikit-learn 模型
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# Define models to compare
sklearn_models = {
"random_forest": RandomForestClassifier(n_estimators=100, random_state=42),
"gradient_boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
"logistic_regression": LogisticRegression(random_state=42, max_iter=1000),
"svm": SVC(probability=True, random_state=42),
}
# Evaluate each model systematically
comparison_results = {}
for model_name, model in sklearn_models.items():
with mlflow.start_run(run_name=f"eval_{model_name}"):
# Train model
model.fit(X_train, y_train)
# Log model
signature = infer_signature(X_train, model.predict(X_train))
mlflow.sklearn.log_model(model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")
# Comprehensive evaluation with MLflow
result = mlflow.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
evaluators=["default"],
)
comparison_results[model_name] = result.metrics
# Log key metrics for comparison
mlflow.log_metrics(
{
"accuracy": result.metrics["accuracy_score"],
"f1": result.metrics["f1_score"],
"roc_auc": result.metrics["roc_auc"],
"precision": result.metrics["precision_score"],
"recall": result.metrics["recall_score"],
}
)
# Create comparison summary
import pandas as pd
comparison_df = pd.DataFrame(comparison_results).T
print("Model Comparison Summary:")
print(comparison_df[["accuracy_score", "f1_score", "roc_auc"]].round(3))
# Identify best model
best_model = comparison_df["f1_score"].idxmax()
print(f"\nBest model by F1 score: {best_model}")
将超参数调优与 MLflow 评估相结合,以进行全面评估
from sklearn.model_selection import ParameterGrid
# Define parameter grid for Random Forest
param_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [5, 10, None],
"min_samples_split": [2, 5, 10],
"min_samples_leaf": [1, 2, 4],
}
# Evaluate each parameter combination
grid_results = []
for params in ParameterGrid(param_grid):
with mlflow.start_run(run_name=f"rf_grid_search"):
# Log parameters
mlflow.log_params(params)
# Train model with current parameters
model = RandomForestClassifier(**params, random_state=42)
model.fit(X_train, y_train)
# Log and evaluate
signature = infer_signature(X_train, model.predict(X_train))
mlflow.sklearn.log_model(model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")
# MLflow evaluation
result = mlflow.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
evaluators=["default"],
)
# Track results
grid_results.append(
{
**params,
"f1_score": result.metrics["f1_score"],
"roc_auc": result.metrics["roc_auc"],
"accuracy": result.metrics["accuracy_score"],
}
)
# Log selection metric
mlflow.log_metric("grid_search_score", result.metrics["f1_score"])
# Find best parameters
best_result = max(grid_results, key=lambda x: x["f1_score"])
print(f"Best parameters: {best_result}")
模型验证和质量门
使用 MLflow 的验证 API 来确保 scikit-learn 模型质量
from mlflow.models import MetricThreshold
# First, evaluate your scikit-learn model
result = mlflow.evaluate(model_uri, eval_data, targets="label", model_type="classifier")
# Define quality thresholds for classification models
quality_thresholds = {
"accuracy_score": MetricThreshold(threshold=0.85, greater_is_better=True),
"f1_score": MetricThreshold(threshold=0.80, greater_is_better=True),
"roc_auc": MetricThreshold(threshold=0.75, greater_is_better=True),
}
# Validate model meets quality standards
try:
mlflow.validate_evaluation_results(
candidate_result=result,
validation_thresholds=quality_thresholds,
)
print("✅ Scikit-learn model meets all quality thresholds")
except mlflow.exceptions.ModelValidationFailedException as e:
print(f"❌ Model failed validation: {e}")
# Compare against baseline model (e.g., previous model version)
baseline_result = mlflow.evaluate(
baseline_model_uri, eval_data, targets="label", model_type="classifier"
)
# Validate improvement over baseline
improvement_thresholds = {
"f1_score": MetricThreshold(
threshold=0.02, greater_is_better=True # Must be 2% better
),
}
try:
mlflow.validate_evaluation_results(
candidate_result=result,
baseline_result=baseline_result,
validation_thresholds=improvement_thresholds,
)
print("✅ New model improves over baseline")
except mlflow.exceptions.ModelValidationFailedException as e:
print(f"❌ Model doesn't improve sufficiently: {e}")
模型管理
- 序列化和格式
- 模型签名
- 加载和使用
MLflow 支持 scikit-learn 模型的多种序列化格式,每种格式都针对不同的部署方案进行了优化。 了解这些选项有助于您为生产需求选择正确的方法
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)
# Cloudpickle format (default) - better cross-system compatibility
mlflow.sklearn.log_model(
sk_model=model,
name="cloudpickle_model",
serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE,
)
# Pickle format - faster but less portable
mlflow.sklearn.log_model(
sk_model=model,
name="pickle_model",
serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_PICKLE,
)
Cloudpickle 是默认格式,因为它通过识别并将代码依赖项与序列化模型打包在一起,从而提供更好的跨系统兼容性。 Pickle 速度更快,但跨不同环境的可移植性较差。
模型签名描述了输入和输出架构,为生产部署提供了关键的验证。 它们有助于及早发现数据兼容性问题,并确保您的模型收到正确的输入格式
from mlflow.models import infer_signature
import pandas as pd
# Create model signature automatically
X_sample = X_train[:100]
predictions = model.predict(X_sample)
signature = infer_signature(X_sample, predictions)
# Log model with signature for production safety
mlflow.sklearn.log_model(
sk_model=model,
name="model_with_signature",
signature=signature,
input_example=X_sample[:5], # Include example for documentation
)
启用自动日志记录时会自动推断模型签名,但您也可以手动创建它们,以便更好地控制架构验证过程。
MLflow 提供了灵活的方式来加载和使用您保存的模型,具体取决于您的部署需求。 您可以将模型加载为原生 scikit-learn 对象或通用 Python 函数
# Load model in different ways
import mlflow.sklearn
import mlflow.pyfunc
run_id = "your_run_id_here"
# Load as scikit-learn model (preserves all sklearn functionality)
sklearn_model = mlflow.sklearn.load_model(f"runs:/{run_id}/model")
predictions = sklearn_model.predict(X_test)
# Load as PyFunc model (generic Python function interface)
pyfunc_model = mlflow.pyfunc.load_model(f"runs:/{run_id}/model")
predictions = pyfunc_model.predict(pd.DataFrame(X_test))
# Load from model registry (production deployment)
registered_model = mlflow.pyfunc.load_model("models:/MyModel@champion")
PyFunc 格式对于需要在不同模型类型和框架之间保持一致的接口的部署方案特别有用。
生产部署
- 模型注册表
- 模型服务
模型注册表提供集中式模型管理,具有版本控制和基于别名的部署。 这对于管理从开发到生产部署的模型至关重要
# Register model to MLflow Model Registry
import mlflow
from mlflow import MlflowClient
client = MlflowClient()
# Log and register model in one step
with mlflow.start_run():
mlflow.sklearn.log_model(
sk_model=model,
name="model",
registered_model_name="CustomerChurnModel",
signature=signature,
)
# Or register an existing model
run_id = "your_run_id"
model_uri = f"runs:/{run_id}/model"
# Register the model
registered_model = mlflow.register_model(model_uri=model_uri, name="CustomerChurnModel")
# Use aliases instead of deprecated stages for deployment management
# Set aliases for different deployment environments
client.set_registered_model_alias(
name="CustomerChurnModel",
alias="champion", # Production model
version=registered_model.version,
)
client.set_registered_model_alias(
name="CustomerChurnModel",
alias="challenger", # A/B testing model
version=registered_model.version,
)
# Use tags to track model status and metadata
client.set_model_version_tag(
name="CustomerChurnModel",
version=registered_model.version,
key="validation_status",
value="approved",
)
client.set_model_version_tag(
name="CustomerChurnModel",
version=registered_model.version,
key="deployment_date",
value="2025-05-29",
)
现代模型注册表功能
模型别名将已弃用的阶段替换为灵活的、命名的引用。 您可以将多个别名分配给任何模型版本(例如,champion
、challenger
、shadow
),独立于模型训练更新别名以实现无缝部署,并使用它们进行 A/B 测试和渐进式发布。
模型标记提供丰富的元数据和状态跟踪。 使用 validation_status: approved
跟踪验证状态,使用 ready_for_prod: true
标记部署就绪状态,并使用 team: data-science
添加团队所有权。
基于环境的模型支持成熟的 MLOps 工作流程。 为每个环境创建单独的注册模型:dev.CustomerChurnModel
、staging.CustomerChurnModel
、prod.CustomerChurnModel
,并使用 copy_model_version()
在不同环境之间提升模型。
# Promote model from staging to production environment
client.copy_model_version(
src_model_uri="models:/staging.CustomerChurnModel@candidate",
dst_name="prod.CustomerChurnModel",
)
MLflow 提供内置的模型服务功能,可以轻松地将 scikit-learn 模型部署为 REST API。 这非常适合开发、测试和小规模生产部署
# Serve model using alias for production deployment
mlflow models serve \
-m "models:/CustomerChurnModel@champion" \
-p 5000 \
--no-conda
部署最佳实践
通过指向 @champion
或 @production
别名而不是硬编码版本号,使用别名进行生产服务。 通过更新别名以立即切换模型版本之间的流量,实现 蓝绿部署。 确保 模型签名 在服务时提供自动输入验证。 使用必要的身份验证和配置为服务终结点配置 环境变量。
模型服务后,您可以通过向端点发送 POST 请求来进行预测
import requests
import json
# Example prediction request
data = {"inputs": [[1.2, 0.8, 3.4, 2.1]]} # Feature values
response = requests.post(
"https://:5000/invocations",
headers={"Content-Type": "application/json"},
data=json.dumps(data),
)
predictions = response.json()
对于更大的生产部署,您还可以将 MLflow 模型部署到 AWS SageMaker、Azure ML 等云平台,或者将它们部署为 Kubernetes 编排的 Docker 容器。
高级功能
- Pipeline 集成
- 自动日志记录配置
- 实验组织
Scikit-learn pipelines 是 MLflow 中的一等公民,可提供从数据预处理到模型训练的端到端工作流程跟踪。 这确保了整个 ML 工作流程的可重现性
import mlflow
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# Enable autologging for pipelines
mlflow.sklearn.autolog()
# Create a complex preprocessing and modeling pipeline
numeric_features = ["age", "income", "credit_score"]
categorical_features = ["occupation", "location"]
# Preprocessing pipeline
numeric_transformer = Pipeline(
steps=[("scaler", StandardScaler()), ("selector", SelectKBest(f_regression, k=2))]
)
categorical_transformer = Pipeline(
steps=[("encoder", OneHotEncoder(drop="first", sparse_output=False))]
)
# Combine preprocessing steps
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
]
)
# Complete pipeline with model
pipeline = Pipeline(
steps=[("preprocessor", preprocessor), ("regressor", LinearRegression())]
)
# Train pipeline - all steps are automatically logged
with mlflow.start_run(run_name="Complete Pipeline Experiment"):
pipeline.fit(X_train, y_train)
# Pipeline scoring is automatically captured
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)
print(f"Pipeline R² score: {test_score:.3f}")
MLflow 会自动记录来自每个 pipeline 阶段的参数,从而可以轻松了解数据的处理方式以及使用的模型参数。
可以自定义 MLflow 的自动日志记录行为以适应您的特定工作流程需求。 以下是控制记录内容和方式的关键配置选项
# Fine-tune autologging behavior
mlflow.sklearn.autolog(
log_input_examples=True, # Include input examples in logged models
log_model_signatures=True, # Include model signatures
log_models=True, # Log trained models
log_datasets=True, # Log dataset information
max_tuning_runs=10, # Limit hyperparameter search child runs
log_post_training_metrics=True, # Enable post-training metric capture
serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE,
pos_label=1, # Specify positive label for binary classification
extra_tags={"team": "data-science", "project": "customer-churn"},
)
这些配置选项使您可以对自动日志记录行为进行细粒度控制。 数据集日志记录 跟踪用于训练和评估的数据。 输入示例 和 签名 对于生产部署至关重要。 最大调优运行次数 控制有多少超参数组合获得详细跟踪。 额外标签 有助于组织跨团队和项目的实验。
正确的实验组织对于团队协作和项目管理至关重要。 MLflow 提供了多种功能来帮助您有效地构建和分类实验
# Organize experiments with descriptive names and tags
experiment_name = "Customer Churn Prediction - Q4 2024"
mlflow.set_experiment(experiment_name)
with mlflow.start_run(run_name="Baseline Random Forest"):
# Use consistent tagging for easy filtering and organization
mlflow.set_tags(
{
"model_type": "ensemble",
"algorithm": "random_forest",
"dataset_version": "v2.1",
"feature_engineering": "standard",
"purpose": "baseline",
}
)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
一致的标记和命名约定使以后查找、比较和理解实验变得更加容易。 考虑为实验名称、标记和运行组织建立团队范围内的约定。
结论
MLflow 的 scikit-learn 集成提供了全面的解决方案,用于在传统机器学习工作流程中进行实验跟踪、模型管理和部署。 无论您是使用简单的自动日志记录进行快速实验,还是实施复杂的生产管道,MLflow 都可以扩展以满足您的需求。
将 MLflow 与 scikit-learn 结合使用的主要优势
轻松的实验跟踪 提供单行自动日志记录,捕获可重现 ML 所需的一切。 超参数优化 包括对网格搜索的内置支持,其中包含有组织的子运行和易于比较。 全面评估 通过 mlflow.evaluate()
提供自动指标生成、可视化和 SHAP 分析。 可用于生产的部署 提供模型注册表集成,其中包含基于别名的部署和质量门。 团队协作 启用集中式实验管理,其中包含丰富的元数据和工件。
本指南中的模式和示例为使用 scikit-learn 和 MLflow 构建可扩展、可重现的机器学习系统奠定了坚实的基础。 从自动日志记录开始以获得立竿见影的好处,然后随着需求的增长逐渐采用更高级的功能,如模型评估、注册表和自定义配置。