跳到主要内容

MLflow Scikit-learn 集成

简介

Scikit-learn 是一个全面的 Python 机器学习库,提供分类、回归、聚类和预处理工具。Scikit-learn 基于 NumPy、SciPy 和 matplotlib 构建,为所有估计器提供一致的 API,并提供统一的 fit()predict()transform() 方法。

MLflow 与 scikit-learn 的集成,为传统的机器学习工作流提供了自动化的实验跟踪、模型管理和部署功能。

为什么选择 MLflow + Scikit-learn?

自动日志记录

一行代码 (mlflow.sklearn.autolog()) 即可捕获所有参数、指标、交叉验证结果和模型,无需手动检测。

完整的模型记录

记录训练好的模型,包含序列化格式、输入/输出签名、模型依赖项和 Python 环境,以实现可重现的部署。

超参数调优

内置对 GridSearchCV 和 RandomizedSearchCV 的支持,为每个参数组合自动创建子运行。

训练后指标

自动捕获训练后计算的评估指标,包括 sklearn.metrics 函数调用和 model.score() 评估。

开始使用

只需几行代码即可开始使用 scikit-learn 和 MLflow

python
import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

# Enable autologging
mlflow.sklearn.autolog()

# Load and prepare data
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42
)

# Train model - MLflow automatically logs everything!
with mlflow.start_run():
model = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
model.fit(X_train, y_train)

# Evaluation metrics are automatically captured
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print(f"Train accuracy: {train_score:.3f}, Test accuracy: {test_score:.3f}")

自动日志记录捕获所有模型参数、训练指标、训练好的模型和模型签名。

跟踪服务器设置

本地运行?MLflow 默认将实验存储在当前目录中。如需团队协作或远程跟踪,请设置跟踪服务器

自动日志记录

启用自动日志记录以自动跟踪 scikit-learn 实验

python
import mlflow
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load data
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, test_size=0.2, random_state=42
)

# Enable autologging
mlflow.sklearn.autolog()

with mlflow.start_run():
model = GradientBoostingClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Model scoring is automatically captured
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

记录什么

启用自动日志记录后,MLflow 会自动捕获

  • 参数:来自 estimator.get_params(deep=True) 的所有模型参数
  • 指标:训练得分、分类/回归指标、交叉验证结果
  • 模型:带有签名和输入示例的序列化模型
  • 构件:交叉验证结果、指标信息、模型元数据

对于 GridSearchCV 和 RandomizedSearchCV,MLflow 会为参数组合创建子运行,并单独记录最佳估计器。

超参数调优

MLflow 自动为超参数调优创建子运行

python
import mlflow
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split

# Load data
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, test_size=0.2, random_state=42
)

# Enable autologging
mlflow.sklearn.autolog(max_tuning_runs=10)

# Define parameter grid
param_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [5, 10, 15, None],
"min_samples_split": [2, 5, 10],
}

with mlflow.start_run(run_name="RF Hyperparameter Tuning"):
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid_search.fit(X_train, y_train)

best_score = grid_search.score(X_test, y_test)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")
print(f"Test score: {best_score:.3f}")

Optuna 集成

用于高级超参数优化

python
import mlflow
import optuna
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

# Load data
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, test_size=0.2, random_state=42
)

mlflow.sklearn.autolog()


def objective(trial):
params = {
"n_estimators": trial.suggest_int("n_estimators", 50, 200),
"max_depth": trial.suggest_int("max_depth", 3, 10),
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
}

with mlflow.start_run(nested=True):
model = GradientBoostingClassifier(**params, random_state=42)
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
return accuracy


with mlflow.start_run():
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=20)

mlflow.log_params({f"best_{k}": v for k, v in study.best_params.items()})
mlflow.log_metric("best_accuracy", study.best_value)
嵌套运行

nested=True 参数为父运行下的每个试验创建子运行,从而实现超参数调优实验的分层组织。详细了解分层运行

了解更多