Scikit-learn与MLflow
在本综合指南中,我们将引导您了解如何将scikit-learn与MLflow结合使用,以进行实验跟踪、模型管理和生产部署。我们将涵盖自动日志记录和手动日志记录方法,从基本用法到高级生产模式。
使用自动日志记录快速入门
最快的入门方法是使用MLflow的scikit-learn自动日志记录。只需一行代码,您就可以自动跟踪scikit-learn实验中的参数、指标和模型。这种方法无需更改现有训练代码,并捕获了可复现ML工作流所需的一切。
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
# Enable autologging for scikit-learn
mlflow.sklearn.autolog()
# Load sample data
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42
)
# Train your model - MLflow automatically logs everything
with mlflow.start_run():
model = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
model.fit(X_train, y_train)
# Evaluation metrics are automatically captured
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"Training accuracy: {train_score:.3f}")
print(f"Test accuracy: {test_score:.3f}")
这个简单的示例会自动记录所有模型参数、训练指标、经过适当序列化的训练模型以及用于部署的模型签名——所有这些都无需任何额外代码。
理解自动日志记录行为
- 记录的内容
- 支持的估计器
MLflow的scikit-learn自动日志记录会自动捕获有关训练过程的全面信息。以下是每次训练模型时会跟踪的具体内容:
类别 | 捕获的信息 |
---|---|
参数 | 来自estimator.get_params(deep=True) 的所有参数 |
指标 | 训练分数、分类/回归指标 |
标签 | 估计器类名和完全限定类名 |
工件 | 序列化模型、模型签名、指标信息 |
自动日志记录系统旨在全面且非侵入性。它捕获了可复现性所需的一切,而无需更改现有的scikit-learn代码。
自动日志记录几乎可以与所有scikit-learn估计器和工作流无缝协作。该集成旨在处理简单模型和复杂元估计器。
核心估计器
- 来自
sklearn.utils.all_estimators()
的所有估计器 - 包含预处理和建模步骤的管道对象
- 如
GridSearchCV
和RandomizedSearchCV
的元估计器 - 包括
RandomForestClassifier
、GradientBoostingRegressor
在内的集成方法
特殊处理
- 元估计器自动为参数搜索结果创建子运行
- 管道阶段会记录其各自的参数和转换
- 交叉验证结果被捕获并组织以便于比较
大多数预处理估计器(如缩放器和转换器)被排除在单独日志记录之外,以避免混乱,但当它们在管道对象中使用时仍会被跟踪。
日志记录方法
- 手动日志记录
- 训练后指标
为了完全控制日志记录的内容,您可以手动配置scikit-learn代码。当您需要自定义指标、特定工件日志记录或希望以特定方式组织实验时,这种方法是理想的选择。
import mlflow
import mlflow.sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from mlflow.models import infer_signature
# Generate sample data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Manual logging approach
with mlflow.start_run():
# Define hyperparameters
params = {"C": 1.0, "max_iter": 1000, "solver": "lbfgs", "random_state": 42}
# Log parameters
mlflow.log_params(params)
# Train model
model = LogisticRegression(**params)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Calculate and log metrics
metrics = {
"accuracy": accuracy_score(y_test, y_pred),
"precision": precision_score(y_test, y_pred, average="weighted"),
"recall": recall_score(y_test, y_pred, average="weighted"),
"f1_score": f1_score(y_test, y_pred, average="weighted"),
}
mlflow.log_metrics(metrics)
# Infer model signature
signature = infer_signature(X_train, model.predict(X_train))
# Log the model
mlflow.sklearn.log_model(
sk_model=model,
name="model",
signature=signature,
input_example=X_train[:5], # Sample input for documentation
)
MLflow最强大的功能之一是自动捕获模型训练后的评估指标。这意味着您在训练后计算的任何指标都会自动链接到您的MLflow运行,从而提供无缝的模型评估跟踪。
import mlflow
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
# Enable autologging with post-training metrics
mlflow.sklearn.autolog(log_post_training_metrics=True)
# Load data
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, test_size=0.2, random_state=42
)
with mlflow.start_run():
# Train model
model = GradientBoostingClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Make predictions - this links predictions to the MLflow run
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
# These metric calls are automatically logged to MLflow!
accuracy = accuracy_score(y_test, y_pred)
auc_score = roc_auc_score(y_test, y_proba)
# Model scoring is also automatically captured
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.3f}")
print(f"AUC Score: {auc_score:.3f}")
训练后指标功能智能地检测何时您正在评估模型,并自动记录这些指标以及适当的数据集上下文,从而轻松跟踪不同评估数据集的性能。
超参数调优
- GridSearchCV
- RandomizedSearchCV
MLflow为scikit-learn的超参数优化工具提供了出色的支持,自动为参数搜索实验创建有组织的子运行。这使得跟踪和比较不同的参数组合变得容易。
import mlflow
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
# Enable autologging with hyperparameter tuning support
mlflow.sklearn.autolog(max_tuning_runs=10) # Track top 10 parameter combinations
# Load data
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, test_size=0.2, random_state=42
)
# Define parameter grid
param_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [5, 10, 15, None],
"min_samples_split": [2, 5, 10],
"min_samples_leaf": [1, 2, 4],
}
with mlflow.start_run(run_name="Random Forest Hyperparameter Tuning"):
# Create and fit GridSearchCV
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
rf, param_grid, cv=5, scoring="accuracy", n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)
# Best model evaluation
best_score = grid_search.score(X_test, y_test)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")
print(f"Test score: {best_score:.3f}")
MLflow会自动创建一个包含整体搜索结果的父运行,并为每个参数组合创建子运行,从而轻松分析哪些参数效果最好。
为了更高效的超参数探索,特别是对于大型参数空间,RandomizedSearchCV提供了一个很好的替代方案。MLflow处理它就像GridSearchCV一样无缝。
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
# Define parameter distributions for more efficient exploration
param_distributions = {
"n_estimators": randint(50, 300),
"max_depth": randint(5, 20),
"min_samples_split": randint(2, 20),
"min_samples_leaf": randint(1, 10),
"max_features": uniform(0.1, 0.9),
}
with mlflow.start_run(run_name="Randomized Hyperparameter Search"):
rf = RandomForestClassifier(random_state=42)
random_search = RandomizedSearchCV(
rf,
param_distributions,
n_iter=50, # Try 50 random combinations
cv=5,
scoring="accuracy",
random_state=42,
n_jobs=-1,
)
random_search.fit(X_train, y_train)
# MLflow automatically creates child runs for parameter combinations
# The parent run contains the best model and overall results
autolog中的max_tuning_runs
参数控制有多少最佳参数组合拥有自己的子运行,帮助您关注最有希望的结果。
使用MLflow进行模型评估
- MLflow评估API
- 回归评估
- 自定义指标和工件
MLflow提供了一个全面的评估API,可以自动为scikit-learn模型生成指标、可视化和诊断工具。
import mlflow
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from mlflow.models import infer_signature
# Load data and train model
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Create evaluation dataset
eval_data = X_test.copy()
eval_data = pd.DataFrame(eval_data, columns=wine.feature_names)
eval_data["label"] = y_test
with mlflow.start_run():
# Log model with signature
signature = infer_signature(X_test, model.predict(X_test))
mlflow.sklearn.log_model(model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")
# Comprehensive evaluation with MLflow
result = mlflow.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier", # or "regressor" for regression
evaluators=["default"],
)
# Access automatic metrics
print(f"Accuracy: {result.metrics['accuracy_score']:.3f}")
print(f"F1 Score: {result.metrics['f1_score']:.3f}")
print(f"ROC AUC: {result.metrics['roc_auc']:.3f}")
# Access generated artifacts
print("Generated artifacts:")
for artifact_name, path in result.artifacts.items():
print(f" {artifact_name}: {path}")
自动生成包括
性能指标,如分类的准确率、精确率、召回率、F1分数、ROC-AUC。可视化,包括混淆矩阵、ROC曲线、精确率-召回率曲线。特征重要性,具有SHAP值和特征贡献分析。模型工件,所有图表和诊断信息都保存到MLflow。
对于scikit-learn回归模型,MLflow自动提供回归特定指标。
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
# Load regression dataset
housing = fetch_california_housing(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
# Train regression model
reg_model = LinearRegression()
reg_model.fit(X_train, y_train)
# Create evaluation dataset
eval_data = X_test.copy()
eval_data["target"] = y_test
with mlflow.start_run():
# Log and evaluate regression model
signature = infer_signature(X_train, reg_model.predict(X_train))
mlflow.sklearn.log_model(reg_model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")
result = mlflow.evaluate(
model_uri,
eval_data,
targets="target",
model_type="regressor",
evaluators=["default"],
)
print(f"MAE: {result.metrics['mean_absolute_error']:.3f}")
print(f"RMSE: {result.metrics['root_mean_squared_error']:.3f}")
print(f"R² Score: {result.metrics['r2_score']:.3f}")
自动回归指标
平均绝对误差 (MAE)、均方误差 (MSE) 和均方根误差 (RMSE) 提供误差量级评估。R² 分数和调整后的 R² 测量模型拟合质量。平均绝对百分比误差 (MAPE) 显示相对误差率。残差图和分布分析有助于识别模型假设违规。
使用特定于您的scikit-learn模型的自定义指标和可视化扩展MLflow评估。
from mlflow.models import make_metric
import matplotlib.pyplot as plt
import numpy as np
import os
def business_value_metric(predictions, targets, sample_weights=None):
"""Custom business metric: value from correct predictions."""
# Assume $50 value per correct prediction, $20 cost per error
correct_predictions = (predictions == targets).sum()
incorrect_predictions = len(predictions) - correct_predictions
business_value = (correct_predictions * 50) - (incorrect_predictions * 20)
return business_value
def create_feature_distribution_plot(eval_df, builtin_metrics, artifacts_dir):
"""Create feature distribution plots for model analysis."""
# Select numeric features for distribution analysis
numeric_features = eval_df.select_dtypes(include=[np.number]).columns
numeric_features = [
col for col in numeric_features if col not in ["label", "prediction"]
]
if len(numeric_features) > 0:
# Create subplot for feature distributions
n_features = min(6, len(numeric_features)) # Show up to 6 features
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()
for i, feature in enumerate(numeric_features[:n_features]):
axes[i].hist(eval_df[feature], bins=30, alpha=0.7, edgecolor="black")
axes[i].set_title(f"Distribution of {feature}")
axes[i].set_xlabel(feature)
axes[i].set_ylabel("Frequency")
# Hide unused subplots
for i in range(n_features, len(axes)):
axes[i].set_visible(False)
plt.tight_layout()
plot_path = os.path.join(artifacts_dir, "feature_distributions.png")
plt.savefig(plot_path)
plt.close()
return {"feature_distributions": plot_path}
return {}
# Create custom metric
custom_business_value = make_metric(
eval_fn=business_value_metric, greater_is_better=True, name="business_value_score"
)
# Use custom metrics and artifacts
result = mlflow.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
extra_metrics=[custom_business_value],
custom_artifacts=[create_feature_distribution_plot],
)
print(f"Business Value Score: ${result.metrics['business_value_score']:.2f}")
模型比较和选择
- MLflow模型比较
- 超参数评估
使用MLflow评估系统地比较多个scikit-learn模型。
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# Define models to compare
sklearn_models = {
"random_forest": RandomForestClassifier(n_estimators=100, random_state=42),
"gradient_boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
"logistic_regression": LogisticRegression(random_state=42, max_iter=1000),
"svm": SVC(probability=True, random_state=42),
}
# Evaluate each model systematically
comparison_results = {}
for model_name, model in sklearn_models.items():
with mlflow.start_run(run_name=f"eval_{model_name}"):
# Train model
model.fit(X_train, y_train)
# Log model
signature = infer_signature(X_train, model.predict(X_train))
mlflow.sklearn.log_model(model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")
# Comprehensive evaluation with MLflow
result = mlflow.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
evaluators=["default"],
)
comparison_results[model_name] = result.metrics
# Log key metrics for comparison
mlflow.log_metrics(
{
"accuracy": result.metrics["accuracy_score"],
"f1": result.metrics["f1_score"],
"roc_auc": result.metrics["roc_auc"],
"precision": result.metrics["precision_score"],
"recall": result.metrics["recall_score"],
}
)
# Create comparison summary
import pandas as pd
comparison_df = pd.DataFrame(comparison_results).T
print("Model Comparison Summary:")
print(comparison_df[["accuracy_score", "f1_score", "roc_auc"]].round(3))
# Identify best model
best_model = comparison_df["f1_score"].idxmax()
print(f"\nBest model by F1 score: {best_model}")
将超参数调优与MLflow评估结合起来进行全面评估。
from sklearn.model_selection import ParameterGrid
# Define parameter grid for Random Forest
param_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [5, 10, None],
"min_samples_split": [2, 5, 10],
"min_samples_leaf": [1, 2, 4],
}
# Evaluate each parameter combination
grid_results = []
for params in ParameterGrid(param_grid):
with mlflow.start_run(run_name=f"rf_grid_search"):
# Log parameters
mlflow.log_params(params)
# Train model with current parameters
model = RandomForestClassifier(**params, random_state=42)
model.fit(X_train, y_train)
# Log and evaluate
signature = infer_signature(X_train, model.predict(X_train))
mlflow.sklearn.log_model(model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")
# MLflow evaluation
result = mlflow.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
evaluators=["default"],
)
# Track results
grid_results.append(
{
**params,
"f1_score": result.metrics["f1_score"],
"roc_auc": result.metrics["roc_auc"],
"accuracy": result.metrics["accuracy_score"],
}
)
# Log selection metric
mlflow.log_metric("grid_search_score", result.metrics["f1_score"])
# Find best parameters
best_result = max(grid_results, key=lambda x: x["f1_score"])
print(f"Best parameters: {best_result}")
模型验证和质量门
使用MLflow的验证API确保scikit-learn模型质量。
from mlflow.models import MetricThreshold
# First, evaluate your scikit-learn model
result = mlflow.evaluate(model_uri, eval_data, targets="label", model_type="classifier")
# Define quality thresholds for classification models
quality_thresholds = {
"accuracy_score": MetricThreshold(threshold=0.85, greater_is_better=True),
"f1_score": MetricThreshold(threshold=0.80, greater_is_better=True),
"roc_auc": MetricThreshold(threshold=0.75, greater_is_better=True),
}
# Validate model meets quality standards
try:
mlflow.validate_evaluation_results(
candidate_result=result,
validation_thresholds=quality_thresholds,
)
print("✅ Scikit-learn model meets all quality thresholds")
except mlflow.exceptions.ModelValidationFailedException as e:
print(f"❌ Model failed validation: {e}")
# Compare against baseline model (e.g., previous model version)
baseline_result = mlflow.evaluate(
baseline_model_uri, eval_data, targets="label", model_type="classifier"
)
# Validate improvement over baseline
improvement_thresholds = {
"f1_score": MetricThreshold(
threshold=0.02, greater_is_better=True # Must be 2% better
),
}
try:
mlflow.validate_evaluation_results(
candidate_result=result,
baseline_result=baseline_result,
validation_thresholds=improvement_thresholds,
)
print("✅ New model improves over baseline")
except mlflow.exceptions.ModelValidationFailedException as e:
print(f"❌ Model doesn't improve sufficiently: {e}")
模型管理
- 序列化和格式
- 模型签名
- 加载和使用
MLflow支持多种scikit-learn模型的序列化格式,每种格式都针对不同的部署场景进行了优化。了解这些选项有助于您为生产需求选择正确的方法。
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)
# Cloudpickle format (default) - better cross-system compatibility
mlflow.sklearn.log_model(
sk_model=model,
name="cloudpickle_model",
serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE,
)
# Pickle format - faster but less portable
mlflow.sklearn.log_model(
sk_model=model,
name="pickle_model",
serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_PICKLE,
)
Cloudpickle是默认格式,因为它通过识别并打包代码依赖项与序列化模型一起提供更好的跨系统兼容性。Pickle速度更快,但在不同环境中的可移植性较差。
模型签名描述输入和输出模式,为生产部署提供关键验证。它们有助于及早发现数据兼容性问题,并确保您的模型接收到正确的输入格式。
from mlflow.models import infer_signature
import pandas as pd
# Create model signature automatically
X_sample = X_train[:100]
predictions = model.predict(X_sample)
signature = infer_signature(X_sample, predictions)
# Log model with signature for production safety
mlflow.sklearn.log_model(
sk_model=model,
name="model_with_signature",
signature=signature,
input_example=X_sample[:5], # Include example for documentation
)
启用自动日志记录时会自动推断模型签名,但您也可以手动创建它们以更好地控制模式验证过程。
MLflow提供了灵活的方式来加载和使用您保存的模型,具体取决于您的部署需求。您可以将模型加载为本机scikit-learn对象或通用Python函数。
# Load model in different ways
import mlflow.sklearn
import mlflow.pyfunc
run_id = "your_run_id_here"
# Load as scikit-learn model (preserves all sklearn functionality)
sklearn_model = mlflow.sklearn.load_model(f"runs:/{run_id}/model")
predictions = sklearn_model.predict(X_test)
# Load as PyFunc model (generic Python function interface)
pyfunc_model = mlflow.pyfunc.load_model(f"runs:/{run_id}/model")
predictions = pyfunc_model.predict(pd.DataFrame(X_test))
# Load from model registry (production deployment)
registered_model = mlflow.pyfunc.load_model("models:/MyModel@champion")
PyFunc格式对于需要在不同模型类型和框架之间保持一致接口的部署场景特别有用。
生产部署
- 模型注册表
- 模型服务
模型注册表提供集中式模型管理,具有版本控制和基于别名的部署。这对于从开发到生产部署管理模型至关重要。
# Register model to MLflow Model Registry
import mlflow
from mlflow import MlflowClient
client = MlflowClient()
# Log and register model in one step
with mlflow.start_run():
mlflow.sklearn.log_model(
sk_model=model,
name="model",
registered_model_name="CustomerChurnModel",
signature=signature,
)
# Or register an existing model
run_id = "your_run_id"
model_uri = f"runs:/{run_id}/model"
# Register the model
registered_model = mlflow.register_model(model_uri=model_uri, name="CustomerChurnModel")
# Use aliases instead of deprecated stages for deployment management
# Set aliases for different deployment environments
client.set_registered_model_alias(
name="CustomerChurnModel",
alias="champion", # Production model
version=registered_model.version,
)
client.set_registered_model_alias(
name="CustomerChurnModel",
alias="challenger", # A/B testing model
version=registered_model.version,
)
# Use tags to track model status and metadata
client.set_model_version_tag(
name="CustomerChurnModel",
version=registered_model.version,
key="validation_status",
value="approved",
)
client.set_model_version_tag(
name="CustomerChurnModel",
version=registered_model.version,
key="deployment_date",
value="2025-05-29",
)
现代模型注册表功能
模型别名用灵活的命名引用替换了已弃用的阶段。您可以为任何模型版本分配多个别名(例如,champion
、challenger
、shadow
),独立于模型训练更新别名以实现无缝部署,并将其用于A/B测试和渐进式发布。
模型标签提供丰富的元数据和状态跟踪。使用validation_status: approved
跟踪验证状态,使用ready_for_prod: true
标记部署准备情况,并使用team: data-science
添加团队所有权。
基于环境的模型支持成熟的MLOps工作流。为每个环境创建单独的注册模型:dev.CustomerChurnModel
、staging.CustomerChurnModel
、prod.CustomerChurnModel
,并使用copy_model_version()
在环境之间提升模型。
# Promote model from staging to production environment
client.copy_model_version(
src_model_uri="models:/staging.CustomerChurnModel@candidate",
dst_name="prod.CustomerChurnModel",
)
MLflow提供了内置的模型服务功能,使您可以轻松地将scikit-learn模型部署为REST API。这非常适合开发、测试和小型生产部署。
# Serve model using alias for production deployment
mlflow models serve \
-m "models:/CustomerChurnModel@champion" \
-p 5000 \
--no-conda
部署最佳实践
使用别名进行生产服务,通过指向@champion
或@production
别名而不是硬编码版本号。通过更新别名来实现蓝绿部署,以即时切换模型版本之间的流量。确保模型签名在服务时提供自动输入验证。为服务端点配置环境变量,其中包含必要的身份验证和配置。
一旦您的模型服务,您可以通过向端点发送POST请求来进行预测。
import requests
import json
# Example prediction request
data = {"inputs": [[1.2, 0.8, 3.4, 2.1]]} # Feature values
response = requests.post(
"https://:5000/invocations",
headers={"Content-Type": "application/json"},
data=json.dumps(data),
)
predictions = response.json()
对于大型生产部署,您还可以将MLflow模型部署到AWS SageMaker、Azure ML等云平台,或将其部署为Docker容器以进行Kubernetes编排。
高级功能
- 管道集成
- 自动日志配置
- 实验组织
Scikit-learn管道是MLflow中的一等公民,提供从数据预处理到模型训练的端到端工作流跟踪。这确保了整个ML工作流的可复现性。
import mlflow
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# Enable autologging for pipelines
mlflow.sklearn.autolog()
# Create a complex preprocessing and modeling pipeline
numeric_features = ["age", "income", "credit_score"]
categorical_features = ["occupation", "location"]
# Preprocessing pipeline
numeric_transformer = Pipeline(
steps=[("scaler", StandardScaler()), ("selector", SelectKBest(f_regression, k=2))]
)
categorical_transformer = Pipeline(
steps=[("encoder", OneHotEncoder(drop="first", sparse_output=False))]
)
# Combine preprocessing steps
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
]
)
# Complete pipeline with model
pipeline = Pipeline(
steps=[("preprocessor", preprocessor), ("regressor", LinearRegression())]
)
# Train pipeline - all steps are automatically logged
with mlflow.start_run(run_name="Complete Pipeline Experiment"):
pipeline.fit(X_train, y_train)
# Pipeline scoring is automatically captured
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)
print(f"Pipeline R² score: {test_score:.3f}")
MLflow会自动记录每个管道阶段的参数,从而轻松了解数据是如何处理的以及使用了哪些模型参数。
MLflow的自动日志记录行为可以根据您的特定工作流需求进行自定义。以下是控制日志记录内容和方式的关键配置选项:
# Fine-tune autologging behavior
mlflow.sklearn.autolog(
log_input_examples=True, # Include input examples in logged models
log_model_signatures=True, # Include model signatures
log_models=True, # Log trained models
log_datasets=True, # Log dataset information
max_tuning_runs=10, # Limit hyperparameter search child runs
log_post_training_metrics=True, # Enable post-training metric capture
serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE,
pos_label=1, # Specify positive label for binary classification
extra_tags={"team": "data-science", "project": "customer-churn"},
)
这些配置选项使您可以对自动日志记录行为进行精细控制。数据集日志记录跟踪用于训练和评估的数据。输入示例和签名对于生产部署至关重要。最大调优运行次数控制有多少超参数组合获得详细跟踪。额外标签有助于跨团队和项目组织实验。
适当的实验组织对于团队协作和项目管理至关重要。MLflow提供了多种功能来帮助您有效地构建和分类实验。
# Organize experiments with descriptive names and tags
experiment_name = "Customer Churn Prediction - Q4 2024"
mlflow.set_experiment(experiment_name)
with mlflow.start_run(run_name="Baseline Random Forest"):
# Use consistent tagging for easy filtering and organization
mlflow.set_tags(
{
"model_type": "ensemble",
"algorithm": "random_forest",
"dataset_version": "v2.1",
"feature_engineering": "standard",
"purpose": "baseline",
}
)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
一致的标记和命名约定使得以后更容易查找、比较和理解实验。考虑为实验名称、标签和运行组织建立团队范围的约定。
结论
MLflow的scikit-learn集成提供了用于传统机器学习工作流中实验跟踪、模型管理和部署的全面解决方案。无论您是使用简单的自动日志记录进行快速实验,还是实现复杂的生产管道,MLflow都能满足您的需求。
将MLflow与scikit-learn结合使用的主要优点
轻松的实验跟踪提供单行自动日志记录,捕获可复现ML所需的一切。超参数优化包括内置的网格搜索支持,具有有组织的子运行和轻松比较。全面评估通过mlflow.evaluate()
提供自动指标生成、可视化和SHAP分析。生产就绪部署提供模型注册表集成,具有基于别名的部署和质量门。团队协作通过丰富的元数据和工件实现集中式实验管理。
本指南中的模式和示例为使用scikit-learn和MLflow构建可扩展、可复现的机器学习系统奠定了坚实的基础。从自动日志记录开始,立即获得好处,然后随着需求的增长逐步采用更高级的功能,如模型评估、注册表和自定义配置。