MLflow + Scikit-learn 入门

在本指南中，我们将向您展示如何使用 scikit-learn 训练模型，并使用 MLflow 记录您的训练过程。

我们将使用Databricks 免费试用版，它内置了对 MLflow 的支持。 Databricks 免费试用版提供了一个免费使用 Databricks 平台的机会。如果您尚未注册，请通过此链接注册一个帐户。

您可以从基于云的笔记本（如 Databricks 笔记本或 Google Colab）运行本指南中的代码，或者在本地计算机上运行它。

安装依赖项

让我们安装 mlflow 包。

%pip install mlflow

然后让我们导入包

from sklearn.datasets import load_iris
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

import mlflow

加载和准备数据集

我们将使用iris 数据集训练一个简单的鸢尾花多类分类模型。

让我们使用 load_iris() 将数据集加载到 pandas Dataframe 中，并查看数据。

iris_df = load_iris(as_frame=True).frame
iris_df

	萼片长度 (cm)	萼片宽度 (cm)	花瓣长度 (cm)	花瓣宽度 (cm)	target
0	5.1	3.5	1.4	0.2	0
1	4.9	3.0	1.4	0.2	0
2	4.7	3.2	1.3	0.2	0
3	4.6	3.1	1.5	0.2	0
4	5.0	3.6	1.4	0.2	0
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	2
146	6.3	2.5	5.0	1.9	2
147	6.5	3.0	5.2	2.0	2
148	6.2	3.4	5.4	2.3	2
149	5.9	3.0	5.1	1.8	2

150 行 × 5 列

现在我们将数据集分成训练集和测试集

# Split into 80% training and 20% testing
train_df, test_df = train_test_split(iris_df, test_size=0.2, random_state=42)
train_df.shape, test_df.shape

((120, 5), (30, 5))

# Separate the target column for the training set
train_dataset = mlflow.data.from_pandas(train_df, name="train")
train_x = train_dataset.df.drop(["target"], axis=1)
train_y = train_dataset.df[["target"]]

train_x.shape, train_y.shape

((120, 4), (120, 1))

# Separate the target column for the testing set
test_dataset = mlflow.data.from_pandas(test_df, name="test")
test_x = test_dataset.df.drop(["target"], axis=1)
test_y = test_dataset.df[["target"]]

test_x.shape, test_y.shape

((30, 4), (30, 1))

定义模型

对于本示例，我们将使用具有一些预定义超参数的 ElasticNet 模型。让我们也定义一个辅助函数来计算一些指标来评估我们模型的性能。

lr = ElasticNet(alpha=0.5, l1_ratio=0.5, random_state=42)

def compute_metrics(actual, predicted):
  rmse = mean_squared_error(actual, predicted)
  mae = mean_absolute_error(actual, predicted)
  r2 = r2_score(actual, predicted)

  return rmse, mae, r2

连接到 MLflow 跟踪服务器

在训练之前，我们需要配置 MLflow 跟踪服务器，因为我们将数据记录到 MLflow 中。在本教程中，我们将使用 Databricks 免费试用版作为 MLflow 跟踪服务器。对于其他选项（例如使用本地 MLflow 服务器），请阅读跟踪服务器概述。

如果您还没有，请按照本指南设置您的 Databricks 免费试用版的帐户和访问令牌。注册应该不会超过 5 分钟。对于本指南，我们需要 ML 实验仪表板来跟踪我们的训练进度。

在 Databricks 免费试用版上成功注册帐户后，让我们将 MLflow 连接到 Databricks Workspace。您将需要输入以下信息

Databricks Host: https://<您的工作区主机>.cloud.databricks.com
令牌：您的个人访问令牌

mlflow.login()

现在此笔记本已连接到托管的跟踪服务器。让我们配置一些 MLflow 元数据。需要设置两件事

mlflow.set_tracking_uri：始终使用“databricks”。
mlflow.set_experiment：选择一个你喜欢的名字，以 /Users/<你的电子邮件>/开头。

mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Users/<your email>/mlflow-sklearn-quickstart")

使用 MLflow 进行日志记录

MLflow 具有强大的跟踪 API，可以让我们记录运行和模型以及它们相关的元数据，例如参数和指标。让我们首先启动一个训练运行来训练我们的模型。

# Start a training run
with mlflow.start_run() as training_run:
  # Log the parameters for our model
  mlflow.log_param("alpha", 0.5)
  mlflow.log_param("l1_ratio", 0.5)

  # Train and log our model, which inherits the parameters
  lr.fit(train_x, train_y)
  model_info = mlflow.sklearn.log_model(sk_model=lr, name="elasticnet", input_example=train_x)

  # Evaluate the model on the training dataset and log metrics
  # These metrics will be linked to both the model and run
  predictions = lr.predict(train_x)
  (rmse, mae, r2) = compute_metrics(train_y, predictions)
  mlflow.log_metrics(
      metrics={
          "rmse": rmse,
          "r2": r2,
          "mae": mae,
      },
      dataset=train_dataset,
  )

现在让我们在测试数据集上评估我们的模型

# Start an evaluation run
with mlflow.start_run() as evaluation_run:
  # Load our previous model
  logged_model = mlflow.sklearn.load_model(f"models:/{model_info.model_id}")

  # Evaluate the model on the training dataset and log metrics
  predictions = logged_model.predict(test_x)
  (rmse, mae, r2) = compute_metrics(test_y, predictions)
  mlflow.log_metrics(
      metrics={
          "rmse": rmse,
          "r2": r2,
          "mae": mae,
      },
      dataset=test_dataset,
      model_id=model_info.model_id,
  )

查看结果

让我们看看我们的训练和测试结果。登录到您的 Databricks 工作区，然后单击左侧菜单中的“Experiments”选项卡。初始页面显示运行列表，我们可以在其中看到我们的训练和评估运行。

runs page

现在让我们前往模型选项卡，我们可以在其中看到我们记录的模型

models page

单击模型名称会将您带到模型详细信息页面，其中包含有关其参数、跨两次运行的指标和其他元数据的信息。

model details page

我们也可以使用 API 检查我们的模型

logged_model = mlflow.get_logged_model(model_info.model_id)

logged_model, logged_model.metrics, logged_model.params

	萼片长度 (cm)	萼片宽度 (cm)	花瓣长度 (cm)	花瓣宽度 (cm)	target
0	5.1	3.5	1.4	0.2	0
1	4.9	3.0	1.4	0.2	0
2	4.7	3.2	1.3	0.2	0
3	4.6	3.1	1.5	0.2	0
4	5.0	3.6	1.4	0.2	0
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	2
146	6.3	2.5	5.0	1.9	2
147	6.5	3.0	5.2	2.0	2
148	6.2	3.4	5.4	2.3	2
149	5.9	3.0	5.1	1.8	2

	萼片长度 (cm)	萼片宽度 (cm)	花瓣长度 (cm)	花瓣宽度 (cm)	target
0	5.1	3.5	1.4	0.2	0
1	4.9	3.0	1.4	0.2	0
2	4.7	3.2	1.3	0.2	0
3	4.6	3.1	1.5	0.2	0
4	5.0	3.6	1.4	0.2	0
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	2
146	6.3	2.5	5.0	1.9	2
147	6.5	3.0	5.2	2.0	2
148	6.2	3.4	5.4	2.3	2
149	5.9	3.0	5.1	1.8	2

安装依赖项​

加载和准备数据集​

定义模型​

连接到 MLflow 跟踪服务器​

使用 MLflow 进行日志记录​

查看结果​