mlflow.sklearn

The mlflow.sklearn 模块提供了用于记录和加载 scikit-learn 模型的 API。该模块支持以下风格（flavors）的 scikit-learn 模型：

Python (native) pickle 格式: 这是可以重新加载到 scikit-learn 中的主要格式。
mlflow.pyfunc: 用于通用的基于 pyfunc 的部署工具和批量推理。注意：mlflow.pyfunc 格式仅为定义了 predict() 方法的 scikit-learn 模型添加，因为 pyfunc 模型推理需要 predict() 方法。

mlflow.sklearn.autolog(log_input_examples=False, log_model_signatures=True, log_models=True, log_datasets=True, disable=False, exclusive=False, disable_for_unsupported_versions=False, silent=False, max_tuning_runs=5, log_post_training_metrics=True, serialization_format='cloudpickle', registered_model_name=None, pos_label=None, extra_tags=None)[source]

注意

已知 autologging 与以下软件包版本兼容：1.4.0 <= scikit-learn <= 1.8.0。当使用此范围外的软件包版本时，autologging 可能不会成功。

启用（或禁用）并配置 scikit-learn 估计器的 autologging。

何时执行 autologging？

当您调用以下方法时，将执行 autologging：

estimator.fit()
estimator.fit_predict()
estimator.fit_transform()

记录的信息

参数

通过 estimator.get_params(deep=True) 获取的参数。请注意，get_params 是使用 deep=True 参数调用的。这意味着当您拟合一个链接了一系列估计器的元估计器（meta estimator）时，这些子估计器的参数也会被记录。

训练指标

通过 estimator.score 计算的训练得分。请注意，训练得分是使用传递给 fit() 的参数计算的。
分类器的常用指标
- 精确率（precision score）
- 召回率（recall score）
- F1 分数（f1 score）
- 准确率（accuracy score）
如果分类器具有 predict_proba 方法，我们还会记录：
- 对数损失（log loss）
- ROC AUC 分数（roc auc score）
回归器的常用指标
- 均方误差（mean squared error）
- 均方根误差（root mean squared error）
- 平均绝对误差（mean absolute error）
- R2 分数（r2 score）

训练后指标

当用户在模型训练后调用指标 API 时，MLflow 会尝试捕获指标 API 的结果，并将它们记录为与模型关联的 Run 的 MLflow 指标。支持以下类型的 scikit-learn 指标 API：

model.score
定义在 sklearn.metrics 模块中的指标 API

对于训练后指标的 autologging，指标键的格式为：“{metric_name}[-{call_index}]_{dataset_name}”

如果指标函数来自 sklearn.metrics，则 MLflow 的“metric_name”是指标函数名。如果指标函数是 model.score，则“metric_name”是“{model_class_name}_score”。
如果对同一个 scikit-learn 指标 API 调用了多次，每次后续调用都会在指标键中添加一个“call_index”（从 2 开始）。
MLflow 使用预测输入数据集的变量名作为指标键中的“dataset_name”。“预测输入数据集变量”指的是用作相关 model.predict 或 model.score 调用第一个参数的变量。注意：MLflow 在最外层调用帧中捕获“预测输入数据集”实例，并在最外层调用帧中获取变量名。如果“预测输入数据集”实例是一个没有定义变量名的中间表达式，则数据集名称将设置为“unknown_dataset”。如果有多个“预测输入数据集”实例具有相同的变量名，则后续的实例将在检查的数据集名称后附加一个索引（从 2 开始）。

限制

MLflow 只能将模型预测 API（包括 predict / predict_proba / predict_log_proba / transform，但不包括 fit_predict / fit_transform）返回的原始预测结果对象映射到 MLflow run。MLflow 无法查找由给定预测结果派生的其他对象（例如，通过复制或选择预测结果的子集）的 run 信息。在派生对象上调用的 scikit-learn 指标 API 不会将指标记录到 MLflow。
必须在从 sklearn.metrics 导入 scikit-learn 指标 API 之前启用 autologging。在启用 autologging 之前导入的指标 API 不会将指标记录到 MLflow run。
如果用户定义了一个非基于 sklearn.metrics 中指标 API 的评分器（scorer），则该评分器的训练后指标 autologging 将无效。

标签

估计器类名（例如，“LinearRegression”）。
完全限定的估计器类名（例如，“sklearn.linear_model._base.LinearRegression”）。

工件

一个具有 mlflow.sklearn 格式的 MLflow 模型，其中包含一个已拟合的估计器（由 mlflow.sklearn.log_model() 记录）。当 scikit-learn 估计器定义 predict() 方法时，该模型还包含 mlflow.pyfunc 格式。
对于训练后指标 API 调用，会记录一个“metric_info.json”的 artifact。这是一个 JSON 对象，其键是 MLflow 训练后指标的名称（有关键的格式，请参阅“训练后指标”部分），其值是生成指标的相应指标调用命令，例如：accuracy_score(y_true=test_iris_y, y_pred=pred_iris_y, normalize=False)。

对于元估计器，autologging 如何工作？

当元估计器（例如 Pipeline、GridSearchCV）调用 fit() 时，它会在内部调用其子估计器的 fit() 方法。Autologging **不会** 对这些构成性 fit() 调用执行记录。

参数搜索: 除了记录上述信息外，参数搜索元估计器（GridSearchCV 和 RandomizedSearchCV）的 autologging 会记录子 run，其中包含每个参数集对应的指标，以及最佳模型的 artifact 和参数（如果可用）。

支持的估计器

通过 sklearn.utils.all_estimators 获取的所有估计器（包括元估计器）。
Pipeline
参数搜索估计器（GridSearchCV 和 RandomizedSearchCV）

示例

查看更多示例

from pprint import pprint
import numpy as np
from sklearn.linear_model import LinearRegression
import mlflow
from mlflow import MlflowClient


def fetch_logged_data(run_id):
    client = MlflowClient()
    data = client.get_run(run_id).data
    tags = {k: v for k, v in data.tags.items() if not k.startswith("mlflow.")}
    artifacts = [f.path for f in client.list_artifacts(run_id, "model")]
    return data.params, data.metrics, tags, artifacts


# enable autologging
mlflow.sklearn.autolog()

# prepare training data
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.dot(X, np.array([1, 2])) + 3

# train a model
model = LinearRegression()
with mlflow.start_run() as run:
    model.fit(X, y)

# fetch logged data
params, metrics, tags, artifacts = fetch_logged_data(run.info.run_id)

pprint(params)
# {'copy_X': 'True',
#  'fit_intercept': 'True',
#  'n_jobs': 'None',
#  'normalize': 'False'}

pprint(metrics)
# {'training_score': 1.0,
#  'training_mean_absolute_error': 2.220446049250313e-16,
#  'training_mean_squared_error': 1.9721522630525295e-31,
#  'training_r2_score': 1.0,
#  'training_root_mean_squared_error': 4.440892098500626e-16}

pprint(tags)
# {'estimator_class': 'sklearn.linear_model._base.LinearRegression',
#  'estimator_name': 'LinearRegression'}

pprint(artifacts)
# ['model/MLmodel', 'model/conda.yaml', 'model/model.pkl']

参数

log_input_examples – 如果为 True，则在训练期间收集训练数据集的输入示例，并与 scikit-learn 模型 artifact 一起记录。如果为 False，则不记录输入示例。注意：输入示例是 MLflow 模型属性，仅在 log_models 也为 True 时才收集。
log_model_signatures – 如果为 True，则在训练期间收集描述模型输入和输出的 ModelSignatures，并与 scikit-learn 模型 artifact 一起记录。如果为 False，则不记录签名。注意：模型签名是 MLflow 模型属性，仅在 log_models 也为 True 时才收集。
log_models – 如果为 True，则训练好的模型将作为 MLflow 模型工件进行记录。如果为 False，则不记录训练好的模型。输入样本和模型签名（MLflow 模型的属性）在 log_models 为 False 时也会被省略。
log_datasets – 如果为 True，则将训练和验证数据集信息记录到 MLflow Tracking（如果适用）。如果为 False，则不记录数据集信息。
disable – 如果为 True，则禁用 scikit-learn autologging 集成。如果为 False，则启用 scikit-learn autologging 集成。
exclusive – 如果为 True，则自动记录的内容不会记录到用户创建的流畅运行中。如果为 False，则自动记录的内容将记录到活动的流畅运行中，该运行可能是用户创建的。
disable_for_unsupported_versions – 如果为 True，则禁用与此 MLflow 客户端版本未经验证或不兼容的 scikit-learn 版本的 autologging。
silent – 如果为 True，则在 scikit-learn autologging 期间抑制 MLflow 的所有事件日志和警告。如果为 False，则在 scikit-learn autologging 期间显示所有事件和警告。
max_tuning_runs – 为超参数搜索估计器创建的子 MLflow run 的最大数量。要为搜索到的最佳 k 个结果创建子 run，请将 max_tuning_runs 设置为 k。默认值为跟踪最佳 5 个搜索参数集。如果 max_tuning_runs=None，则为每个搜索参数集创建一个子 run。注意：最佳 k 个结果基于 rank_test_score 的排序。在多指标评估和自定义评分器的情况下，将使用第一个评分器的 rank_test_score_<scorer_name> 来选择最佳 k 个结果。要更改用于选择最佳 k 个结果的指标，请更改传递给评分器 scoring 参数的字典的顺序。
log_post_training_metrics – 如果为 True，则记录训练后指标。默认为 True。有关更多详细信息，请参阅训练后指标部分。
serialization_format – 序列化模型的格式。应为以下之一：mlflow.sklearn.SERIALIZATION_FORMAT_PICKLE 或 mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE。
registered_model_name – If given, each time a model is trained, it is registered as a new model version of the registered model with this name. The registered model is created if it does not already exist.
pos_label – 如果给出，则用作计算二元分类训练指标（如 precision、recall、f1 等）的正类标签。此参数应仅为二元分类模型设置。如果用于多标签模型，训练指标计算将失败，训练指标将不会被记录。如果用于回归模型，此参数将被忽略。
extra_tags – 要为自动日志记录创建的每个托管运行设置的额外标签的字典。

mlflow.sklearn.get_default_conda_env(include_cloudpickle=False)[source]

返回: 对 save_model() 和 log_model() 调用生成的 MLflow Models 的默认 Conda 环境。

mlflow.sklearn.get_default_pip_requirements(include_cloudpickle=False)[source]

返回: 此格式（flavor）生成的 MLflow Models 的默认 pip requirements 列表。对 save_model() 和 log_model() 的调用会生成一个 pip 环境，该环境至少包含这些 requirements。

mlflow.sklearn.load_model(model_uri, dst_path=None)[source]

从本地文件或 run 加载 scikit-learn 模型。

参数

model_uri –
The location, in URI format, of the MLflow model, for example
- /Users/me/path/to/local/model
- relative/path/to/local/model
- s3://my_bucket/path/to/model
- runs:/<mlflow_run_id>/run-relative/path/to/model
- models:/<model_name>/<model_version>
- models:/<model_name>/<stage>
For more information about supported URI schemes, see Referencing Artifacts.
dst_path – The local filesystem path to which to download the model artifact. This directory must already exist. If unspecified, a local output path will be created.

返回

一个 scikit-learn 模型。

Example

import mlflow.sklearn

sk_model = mlflow.sklearn.load_model("runs:/96771d893a5e46159d9f3b49bf9013e2/sk_models")

# use Pandas DataFrame to make predictions
pandas_df = ...
predictions = sk_model.predict(pandas_df)

mlflow.sklearn.log_model(sk_model, artifact_path: str | None = None, conda_env=None, code_paths=None, serialization_format='cloudpickle', registered_model_name=None, signature: mlflow.models.signature.ModelSignature = None, input_example: Union[pandas.core.frame.DataFrame, numpy.ndarray, dict, list, csr_matrix, csc_matrix, str, bytes, tuple] = None, await_registration_for=300, pip_requirements=None, extra_pip_requirements=None, pyfunc_predict_fn='predict', metadata=None, params: dict[str, typing.Any] | None = None, tags: dict[str, typing.Any] | None = None, model_type: str | None = None, step: int = 0, model_id: str | None = None, name: str | None = None)[source]

将 scikit-learn 模型记录为当前 run 的 MLflow artifact。生成一个包含以下格式（flavors）的 MLflow Model：

mlflow.sklearn

mlflow.pyfunc。注意：此格式仅为定义了 predict() 方法的 scikit-learn 模型包含，因为 pyfunc 模型推理需要 predict() 方法。

参数

sk_model – 要保存的 scikit-learn 模型。
artifact_path – Deprecated. Use name instead.
conda_env –
Conda 环境的字典表示形式，或 conda 环境 yaml 文件的路径。如果提供，它将描述模型应运行的环境。至少，它应该指定 get_default_conda_env() 中包含的依赖项。如果为 None，则将一个通过 mlflow.models.infer_pip_requirements() 推断出的 pip requirements 的 conda 环境添加到模型中。如果推断失败，它将回退到使用 get_default_pip_requirements。来自 conda_env 的 pip requirements 将被写入一个 pip requirements.txt 文件，并且完整的 conda 环境将被写入 conda.yaml。以下是一个 *示例* 的 conda 环境字典表示：
```
{
    "name": "mlflow-env",
    "channels": ["conda-forge"],
    "dependencies": [
        "python=3.8.15",
        {
            "pip": [
                "scikit-learn==x.y.z"
            ],
        },
    ],
}
```
code_paths –
A list of local filesystem paths to Python file dependencies (or directories containing file dependencies). These files are prepended to the system path when the model is loaded. Files declared as dependencies for a given model should have relative imports declared from a common root path if multiple files are defined with import dependencies between them to avoid import errors when loading the model.

For a detailed explanation of code_paths functionality, recommended usage patterns and limitations, see the code_paths usage guide.
serialization_format – 序列化模型的格式。应为 mlflow.sklearn.SUPPORTED_SERIALIZATION_FORMATS 中列出的格式之一。Cloudpickle 格式，mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE，通过识别和打包序列化模型中的代码依赖项，提供更好的跨系统兼容性。
registered_model_name – 如果提供，则在 registered_model_name 下创建一个模型版本，如果给定名称的注册模型不存在，也会创建该注册模型。
signature –
一个 ModelSignature 类的实例，它描述了模型的输入和输出。如果未指定但提供了 input_example，则将根据提供的输入示例和模型自动推断签名。要在使用输入示例时禁用自动签名推断，请将 signature 设置为 False。要手动推断模型签名，请在具有有效模型输入的 dataset 上调用 infer_signature()，以及具有有效模型输出的 dataset（例如，在训练数据集上执行的模型预测），例如：
```
from mlflow.models import infer_signature

train = df.drop_column("target_label")
predictions = ...  # compute model predictions
signature = infer_signature(train, predictions)
```
input_example – 一个或多个有效的模型输入实例。输入示例用作要馈送给模型的数据的提示。它将被转换为 Pandas DataFrame，然后使用 Pandas 的面向拆分（split-oriented）格式序列化为 json，或者转换为 numpy 数组，其中示例将通过转换为列表来序列化为 json。字节将进行 base64 编码。当 signature 参数为 None 时，输入示例用于推断模型签名。
await_registration_for – 等待模型版本完成创建并处于 READY 状态的秒数。默认情况下，函数等待五分钟。指定 0 或 None 可跳过等待。
pip_requirements – 可以是 pip requirement 字符串的可迭代对象（例如 ["scikit-learn", "-r requirements.txt", "-c constraints.txt"]）或本地文件系统上的 pip requirement 文件的字符串路径（例如 "requirements.txt"）。如果提供，它将描述模型应运行的环境。如果为 None，则通过 mlflow.models.infer_pip_requirements() 从当前软件环境中推断出默认的 requirements 列表。如果推断失败，它将回退到使用 get_default_pip_requirements。requirements 和 constraints 都将被自动解析并分别写入 requirements.txt 和 constraints.txt 文件，并作为模型的一部分存储。requirements 也会被写入模型 conda 环境（conda.yaml）文件的 pip 部分。
extra_pip_requirements –
可以是一个 pip requirement 字符串的可迭代对象（例如 ["pandas", "-r requirements.txt", "-c constraints.txt"]）或本地文件系统上的 pip requirement 文件的字符串路径（例如 "requirements.txt"）。如果提供，它将描述附加到基于用户当前软件环境自动生成的默认 pip requirements 集的附加 pip requirements。requirements 和 constraints 都将被自动解析并分别写入 requirements.txt 和 constraints.txt 文件，并作为模型的一部分存储。requirements 也会被写入模型 conda 环境（conda.yaml）文件的 pip 部分。
警告

以下参数不能同时指定
- conda_env
- pip_requirements
- extra_pip_requirements
此示例演示了如何使用 pip_requirements 和 extra_pip_requirements 指定 pip requirements。
pyfunc_predict_fn – 用于使用 MLflow Model 的 pyfunc 表示形式进行推理的预测函数的名称。当前支持的函数有："predict"、"predict_proba"、"predict_log_proba"、"predict_joint_log_proba" 和 "score"。
metadata – 传递给模型并存储在 MLmodel 文件中的自定义元数据字典。
params – 要与模型一起记录的参数字典。
tags – 要与模型一起记录的标签字典。
model_type – 模型的类型。
step – 记录模型输出和指标的步骤
model_id – 模型的 ID。
name – 模型名称。

返回

一个 ModelInfo 实例，其中包含已记录模型的元数据。

示例

import mlflow
import mlflow.sklearn
from mlflow.models import infer_signature
from sklearn.datasets import load_iris
from sklearn import tree

with mlflow.start_run():
    # load dataset and train model
    iris = load_iris()
    sk_model = tree.DecisionTreeClassifier()
    sk_model = sk_model.fit(iris.data, iris.target)

    # log model params
    mlflow.log_param("criterion", sk_model.criterion)
    mlflow.log_param("splitter", sk_model.splitter)
    signature = infer_signature(iris.data, sk_model.predict(iris.data))

    # log model
    mlflow.sklearn.log_model(sk_model, name="sk_models", signature=signature)

mlflow.sklearn.save_model(sk_model, path, conda_env=None, code_paths=None, mlflow_model=None, serialization_format='cloudpickle', signature: mlflow.models.signature.ModelSignature = None, input_example: Union[pandas.core.frame.DataFrame, numpy.ndarray, dict, list, csr_matrix, csc_matrix, str, bytes, tuple] = None, pip_requirements=None, extra_pip_requirements=None, pyfunc_predict_fn='predict', metadata=None)[source]

将 scikit-learn 模型保存到本地文件系统上的一个路径。生成一个包含以下格式（flavors）的 MLflow Model：

mlflow.sklearn

mlflow.pyfunc。注意：此格式仅为定义了 predict() 方法的 scikit-learn 模型包含，因为 pyfunc 模型推理需要 predict() 方法。

参数

sk_model – 要保存的 scikit-learn 模型。
path – 要保存模型的本地路径。
conda_env –
Conda 环境的字典表示形式，或 conda 环境 yaml 文件的路径。如果提供，它将描述模型应运行的环境。至少，它应该指定 get_default_conda_env() 中包含的依赖项。如果为 None，则将一个通过 mlflow.models.infer_pip_requirements() 推断出的 pip requirements 的 conda 环境添加到模型中。如果推断失败，它将回退到使用 get_default_pip_requirements。来自 conda_env 的 pip requirements 将被写入一个 pip requirements.txt 文件，并且完整的 conda 环境将被写入 conda.yaml。以下是一个 *示例* 的 conda 环境字典表示：
```
{
    "name": "mlflow-env",
    "channels": ["conda-forge"],
    "dependencies": [
        "python=3.8.15",
        {
            "pip": [
                "scikit-learn==x.y.z"
            ],
        },
    ],
}
```
code_paths –
A list of local filesystem paths to Python file dependencies (or directories containing file dependencies). These files are prepended to the system path when the model is loaded. Files declared as dependencies for a given model should have relative imports declared from a common root path if multiple files are defined with import dependencies between them to avoid import errors when loading the model.

For a detailed explanation of code_paths functionality, recommended usage patterns and limitations, see the code_paths usage guide.
mlflow_model – 要添加此 flavor 的 mlflow.models.Model。
serialization_format – 序列化模型的格式。应为 mlflow.sklearn.SUPPORTED_SERIALIZATION_FORMATS 中列出的格式之一。Cloudpickle 格式，mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE，通过识别和打包序列化模型中的代码依赖项，提供更好的跨系统兼容性。
signature –
一个 ModelSignature 类的实例，它描述了模型的输入和输出。如果未指定但提供了 input_example，则将根据提供的输入示例和模型自动推断签名。要在使用输入示例时禁用自动签名推断，请将 signature 设置为 False。要手动推断模型签名，请在具有有效模型输入的 dataset 上调用 infer_signature()，以及具有有效模型输出的 dataset（例如，在训练数据集上执行的模型预测），例如：
```
from mlflow.models import infer_signature

train = df.drop_column("target_label")
predictions = ...  # compute model predictions
signature = infer_signature(train, predictions)
```
input_example – 一个或多个有效的模型输入实例。输入示例用作要馈送给模型的数据的提示。它将被转换为 Pandas DataFrame，然后使用 Pandas 的面向拆分（split-oriented）格式序列化为 json，或者转换为 numpy 数组，其中示例将通过转换为列表来序列化为 json。字节将进行 base64 编码。当 signature 参数为 None 时，输入示例用于推断模型签名。
pip_requirements – 可以是 pip requirement 字符串的可迭代对象（例如 ["scikit-learn", "-r requirements.txt", "-c constraints.txt"]）或本地文件系统上的 pip requirement 文件的字符串路径（例如 "requirements.txt"）。如果提供，它将描述模型应运行的环境。如果为 None，则通过 mlflow.models.infer_pip_requirements() 从当前软件环境中推断出默认的 requirements 列表。如果推断失败，它将回退到使用 get_default_pip_requirements。requirements 和 constraints 都将被自动解析并分别写入 requirements.txt 和 constraints.txt 文件，并作为模型的一部分存储。requirements 也会被写入模型 conda 环境（conda.yaml）文件的 pip 部分。
extra_pip_requirements –
可以是一个 pip requirement 字符串的可迭代对象（例如 ["pandas", "-r requirements.txt", "-c constraints.txt"]）或本地文件系统上的 pip requirement 文件的字符串路径（例如 "requirements.txt"）。如果提供，它将描述附加到基于用户当前软件环境自动生成的默认 pip requirements 集的附加 pip requirements。requirements 和 constraints 都将被自动解析并分别写入 requirements.txt 和 constraints.txt 文件，并作为模型的一部分存储。requirements 也会被写入模型 conda 环境（conda.yaml）文件的 pip 部分。
警告

以下参数不能同时指定
- conda_env
- pip_requirements
- extra_pip_requirements
此示例演示了如何使用 pip_requirements 和 extra_pip_requirements 指定 pip requirements。
pyfunc_predict_fn – 用于使用 MLflow Model 的 pyfunc 表示形式进行推理的预测函数的名称。当前支持的函数有："predict"、"predict_proba"、"predict_log_proba"、"predict_joint_log_proba" 和 "score"。
metadata – 传递给模型并存储在 MLmodel 文件中的自定义元数据字典。

Example

import mlflow.sklearn
from sklearn.datasets import load_iris
from sklearn import tree

iris = load_iris()
sk_model = tree.DecisionTreeClassifier()
sk_model = sk_model.fit(iris.data, iris.target)

# Save the model in cloudpickle format
# set path to location for persistence
sk_path_dir_1 = ...
mlflow.sklearn.save_model(
    sk_model,
    sk_path_dir_1,
    serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE,
)

# save the model in pickle format
# set path to location for persistence
sk_path_dir_2 = ...
mlflow.sklearn.save_model(
    sk_model,
    sk_path_dir_2,
    serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_PICKLE,
)