MLflow + XGBoost 入门

在本指南中，我们将向您展示如何使用 XGBoost 训练模型，并使用 MLflow 记录您的训练过程。

我们将使用 Databricks 免费试用版，它内置了对 MLflow 的支持。 Databricks 免费试用版提供了免费使用 Databricks 平台的机会。如果您还没有注册，请通过此链接注册一个帐户。

您可以从基于云的笔记本（如 Databricks 笔记本或 Google Colab）运行本指南中的代码，或者在本地计算机上运行它。

安装依赖项

让我们安装 mlflow 包。

%pip install mlflow

然后让我们导入软件包

import numpy as np
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split

import mlflow
from mlflow.models import infer_signature

加载和准备数据集

我们将使用 iris 数据集训练一个简单的 Iris 花卉多类分类模型。

让我们使用 load_iris() 将数据集加载到 pandas Dataframe 中，并查看数据。

iris_df = load_iris(as_frame=True).frame
iris_df

	萼片长度 (cm)	萼片宽度 (cm)	花瓣长度 (cm)	花瓣宽度 (cm)	目标
0	5.1	3.5	1.4	0.2	0
1	4.9	3.0	1.4	0.2	0
2	4.7	3.2	1.3	0.2	0
3	4.6	3.1	1.5	0.2	0
4	5.0	3.6	1.4	0.2	0
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	2
146	6.3	2.5	5.0	1.9	2
147	6.5	3.0	5.2	2.0	2
148	6.2	3.4	5.4	2.3	2
149	5.9	3.0	5.1	1.8	2

150 行 × 5 列

现在我们将把数据集分成训练集和测试集

# Split into 80% training and 20% testing
train_df, test_df = train_test_split(iris_df, test_size=0.2, random_state=42)
train_df.shape, test_df.shape

((120, 5), (30, 5))

# Separate the target column for the training set
train_dataset = mlflow.data.from_pandas(train_df, name="train")
X_train = train_dataset.df.drop(["target"], axis=1)
y_train = train_dataset.df[["target"]]

dtrain = xgb.DMatrix(X_train, label=y_train)

# Separate the target column for the testing set
test_dataset = mlflow.data.from_pandas(test_df, name="test")
X_test = test_dataset.df.drop(["target"], axis=1)
y_test = test_dataset.df[["target"]]

dtest = xgb.DMatrix(X_test, label=y_test)

连接到 MLflow 跟踪服务器

在训练之前，我们需要配置 MLflow 跟踪服务器，因为我们会将数据记录到 MLflow 中。在本教程中，我们将使用 Databricks 免费试用版作为 MLflow 跟踪服务器。有关其他选项，例如使用您的本地 MLflow 服务器，请阅读跟踪服务器概述。

如果您还没有，请按照本指南设置您的帐户和 Databricks 免费试用版的访问令牌。注册过程应该不超过 5 分钟。在本指南中，我们需要 ML 实验仪表板来跟踪我们的训练进度。

在 Databricks 免费试用版上成功注册帐户后，让我们将 MLflow 连接到 Databricks Workspace。您将需要输入以下信息

Databricks 主机: https://<你的工作区主机>.cloud.databricks.com
令牌：您的个人访问令牌

mlflow.login()

现在，此笔记本已连接到托管的跟踪服务器。让我们配置一些 MLflow 元数据。需要设置两件事

mlflow.set_tracking_uri：始终使用“databricks”。
mlflow.set_experiment: 选择一个您喜欢的名称，以 /Users/<您的电子邮件地址>/ 开头。

mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Users/<your email>/mlflow-xgboost-quickstart")

使用 MLflow 进行日志记录

MLflow 具有强大的跟踪 API，可让我们记录运行和模型以及与其相关的元数据，例如参数和指标。让我们训练和评估我们的模型。

# Start a training run
with mlflow.start_run() as run:
  # Define and log the parameters for our model
  params = {
      "objective": "multi:softprob",
      "num_class": len(set(train_df["target"])),
      "max_depth": 8,
      "learning_rate": 0.05,
      "subsample": 0.9,
      "colsample_bytree": 0.9,
      "min_child_weight": 1,
      "gamma": 0,
      "reg_alpha": 0,
      "reg_lambda": 1,
      "random_state": 42,
  }
  training_config = {
      "num_boost_round": 200,
      "early_stopping_rounds": 20,
  }
  mlflow.log_params(params)
  mlflow.log_params(training_config)

  # Custom evaluation tracking
  eval_results = {}
  # Train model with custom callback
  model = xgb.train(
      params=params,
      dtrain=dtrain,
      num_boost_round=training_config["num_boost_round"],
      evals=[(dtrain, "train"), (dtest, "test")],
      early_stopping_rounds=training_config["early_stopping_rounds"],
      evals_result=eval_results,
      verbose_eval=False,
  )

  # Log training history to the run
  for epoch, (train_metrics, test_metrics) in enumerate(
      zip(eval_results["train"]["mlogloss"], eval_results["test"]["mlogloss"])
  ):
      mlflow.log_metrics(
          {"train_logloss": train_metrics, "test_logloss": test_metrics}, step=epoch
      )

  # Final evaluation
  y_pred_proba = model.predict(dtest)
  y_pred = np.argmax(y_pred_proba, axis=1)
  final_metrics = {
      "accuracy": accuracy_score(y_test, y_pred),
      "roc_auc": roc_auc_score(y_test, y_pred_proba, multi_class="ovr"),
  }
  mlflow.log_metrics(final_metrics, step=model.best_iteration)

  # Log the model at the best iteration, linked with all params and metrics
  model_info = mlflow.xgboost.log_model(
      xgb_model=model,
      name="xgboost_model",
      signature=infer_signature(X_train, y_pred_proba),
      input_example=X_train[:5],
      step=model.best_iteration,
  )

查看结果

让我们看看我们的训练和测试结果。登录到您的 Databricks Workspace，然后从左侧菜单中单击“Experiments”选项卡。初始页面显示运行列表，我们可以在其中看到我们的运行。

runs page

现在让我们转到模型选项卡，我们可以在其中看到我们记录的模型

models page

单击模型名称会将您带到模型详细信息页面，其中包含有关其参数、指标和其他元数据的信息。

model details page

我们还可以使用 API 检查我们的模型

logged_model = mlflow.get_logged_model(model_info.model_id)

logged_model, logged_model.metrics, logged_model.params

	萼片长度 (cm)	萼片宽度 (cm)	花瓣长度 (cm)	花瓣宽度 (cm)	目标
0	5.1	3.5	1.4	0.2	0
1	4.9	3.0	1.4	0.2	0
2	4.7	3.2	1.3	0.2	0
3	4.6	3.1	1.5	0.2	0
4	5.0	3.6	1.4	0.2	0
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	2
146	6.3	2.5	5.0	1.9	2
147	6.5	3.0	5.2	2.0	2
148	6.2	3.4	5.4	2.3	2
149	5.9	3.0	5.1	1.8	2

	萼片长度 (cm)	萼片宽度 (cm)	花瓣长度 (cm)	花瓣宽度 (cm)	目标
0	5.1	3.5	1.4	0.2	0
1	4.9	3.0	1.4	0.2	0
2	4.7	3.2	1.3	0.2	0
3	4.6	3.1	1.5	0.2	0
4	5.0	3.6	1.4	0.2	0
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	2
146	6.3	2.5	5.0	1.9	2
147	6.5	3.0	5.2	2.0	2
148	6.2	3.4	5.4	2.3	2
149	5.9	3.0	5.1	1.8	2

安装依赖项​

加载和准备数据集​

连接到 MLflow 跟踪服务器​

使用 MLflow 进行日志记录​

查看结果​

安装依赖项

加载和准备数据集

连接到 MLflow 跟踪服务器

使用 MLflow 进行日志记录

查看结果

	萼片长度 (cm)	萼片宽度 (cm)	花瓣长度 (cm)	花瓣宽度 (cm)	目标
0	5.1	3.5	1.4	0.2	0
1	4.9	3.0	1.4	0.2	0
2	4.7	3.2	1.3	0.2	0
3	4.6	3.1	1.5	0.2	0
4	5.0	3.6	1.4	0.2	0
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	2
146	6.3	2.5	5.0	1.9	2
147	6.5	3.0	5.2	2.0	2
148	6.2	3.4	5.4	2.3	2
149	5.9	3.0	5.1	1.8	2