MLflow Scikit-learn 集成
简介
Scikit-learn 是一个全面的 Python 机器学习库,提供分类、回归、聚类和预处理工具。Scikit-learn 基于 NumPy、SciPy 和 matplotlib 构建,为所有估计器提供一致的 API,并提供统一的 fit()、predict() 和 transform() 方法。
MLflow 与 scikit-learn 的集成,为传统的机器学习工作流提供了自动化的实验跟踪、模型管理和部署功能。
为什么选择 MLflow + Scikit-learn?
自动日志记录
一行代码 (mlflow.sklearn.autolog()) 即可捕获所有参数、指标、交叉验证结果和模型,无需手动检测。
完整的模型记录
记录训练好的模型,包含序列化格式、输入/输出签名、模型依赖项和 Python 环境,以实现可重现的部署。
超参数调优
内置对 GridSearchCV 和 RandomizedSearchCV 的支持,为每个参数组合自动创建子运行。
训练后指标
自动捕获训练后计算的评估指标,包括 sklearn.metrics 函数调用和 model.score() 评估。
开始使用
只需几行代码即可开始使用 scikit-learn 和 MLflow
import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
# Enable autologging
mlflow.sklearn.autolog()
# Load and prepare data
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42
)
# Train model - MLflow automatically logs everything!
with mlflow.start_run():
model = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
model.fit(X_train, y_train)
# Evaluation metrics are automatically captured
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"Train accuracy: {train_score:.3f}, Test accuracy: {test_score:.3f}")
自动日志记录捕获所有模型参数、训练指标、训练好的模型和模型签名。
本地运行?MLflow 默认将实验存储在当前目录中。如需团队协作或远程跟踪,请设置跟踪服务器。
自动日志记录
启用自动日志记录以自动跟踪 scikit-learn 实验
import mlflow
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
# Load data
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, test_size=0.2, random_state=42
)
# Enable autologging
mlflow.sklearn.autolog()
with mlflow.start_run():
model = GradientBoostingClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Model scoring is automatically captured
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
记录什么
启用自动日志记录后,MLflow 会自动捕获
- 参数:来自
estimator.get_params(deep=True)的所有模型参数 - 指标:训练得分、分类/回归指标、交叉验证结果
- 模型:带有签名和输入示例的序列化模型
- 构件:交叉验证结果、指标信息、模型元数据
对于 GridSearchCV 和 RandomizedSearchCV,MLflow 会为参数组合创建子运行,并单独记录最佳估计器。
超参数调优
网格搜索
MLflow 自动为超参数调优创建子运行
import mlflow
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
# Load data
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, test_size=0.2, random_state=42
)
# Enable autologging
mlflow.sklearn.autolog(max_tuning_runs=10)
# Define parameter grid
param_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [5, 10, 15, None],
"min_samples_split": [2, 5, 10],
}
with mlflow.start_run(run_name="RF Hyperparameter Tuning"):
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid_search.fit(X_train, y_train)
best_score = grid_search.score(X_test, y_test)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")
print(f"Test score: {best_score:.3f}")
Optuna 集成
用于高级超参数优化
import mlflow
import optuna
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
# Load data
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, test_size=0.2, random_state=42
)
mlflow.sklearn.autolog()
def objective(trial):
params = {
"n_estimators": trial.suggest_int("n_estimators", 50, 200),
"max_depth": trial.suggest_int("max_depth", 3, 10),
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
}
with mlflow.start_run(nested=True):
model = GradientBoostingClassifier(**params, random_state=42)
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
return accuracy
with mlflow.start_run():
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=20)
mlflow.log_params({f"best_{k}": v for k, v in study.best_params.items()})
mlflow.log_metric("best_accuracy", study.best_value)
nested=True 参数为父运行下的每个试验创建子运行,从而实现超参数调优实验的分层组织。详细了解分层运行。