MLflow 签名 Playground Notebook

下载此 Notebook 欢迎来到 MLflow 签名 Playground！这个交互式的 Jupyter Notebook 旨在引导您了解 MLflow 生态系统中模型签名的基本概念。通过本 Notebook，您将获得定义、强制执行和利用模型签名的实践经验 - 这是模型管理的一个关键方面，可以提高可重复性、可靠性和易用性。

为什么模型签名很重要

在机器学习领域，精确定义模型的输入和输出是确保顺利运行的关键。模型签名充当模型预期和产生的数据的模式定义，充当模型开发人员和用户的蓝图。这不仅明确了期望，而且还有助于自动验证检查，从而简化了从模型训练到部署的过程。

签名强制执行的实际应用

通过探索此 Notebook 中的代码单元格，您将亲眼目睹模型签名如何强制执行数据完整性、防止常见错误以及在出现差异时提供描述性反馈。这对于维护模型输入的质量和一致性非常宝贵，尤其是在生产环境中提供模型时。

用于更深入理解的实践示例

该 Notebook 包含一系列示例，展示了不同的数据类型和结构，从简单的标量到复杂的嵌套字典。这些示例演示了如何推断、记录和更新签名，从而使您全面了解签名的生命周期。当您与提供的 PythonModel 实例交互并调用其 predict 方法时，您将学习如何处理各种输入场景（考虑所需和可选数据字段），以及如何更新现有模型以包含详细的签名。无论您是希望改进模型管理实践的数据科学家，还是将 MLflow 集成到您的工作流程中的开发人员，此 Notebook 都是您掌握模型签名的沙箱。让我们深入研究并探索 MLflow 签名的强大功能！

注意：本 Notebook 中显示的某些功能仅在 MLflow 2.10.0 及更高版本中可用。特别是，2.10.0 之前的版本不支持 Array 和 Object 类型。

import numpy as np
import pandas as pd

import mlflow
from mlflow.models.signature import infer_signature, set_signature


def report_signature_info(input_data, output_data=None, params=None):
  inferred_signature = infer_signature(input_data, output_data, params)

  report = f"""
The input data: 
	{input_data}.
The data is of type: {type(input_data)}.
The inferred signature is:

{inferred_signature}
"""
  print(report)

MLflow 签名中的标量支持

在本教程的这一部分中，我们将探讨标量数据类型在 MLflow 模型签名中的关键作用。标量类型，例如字符串、整数、浮点数、双精度数、布尔值和日期时间，是定义模型输入和输出模式的基础。这些类型的准确表示对于确保模型正确处理数据至关重要，这直接影响预测的可靠性和准确性。

通过检查各种标量类型的示例，本节演示了 MLflow 如何推断和记录数据的结构和性质。我们将看到 MLflow 签名如何适应不同的标量类型，确保输入到模型中的数据与预期的格式匹配。这种理解对于任何机器学习从业者来说都至关重要，因为它有助于准备和验证数据输入，从而使模型操作更顺畅，结果更可靠。

通过实际示例，包括字符串、浮点数和其他类型的列表，我们说明了 MLflow 的 infer_signature 函数如何准确地推断数据格式。这种能力是 MLflow 处理各种数据输入的基础，并构成了机器学习模型中更复杂数据结构的基础。在本节结束时，您将清楚地了解标量数据在 MLflow 签名中的表示方式，以及为什么这对您的 ML 项目很重要。

# List of strings

report_signature_info(["a", "list", "of", "strings"])

The input data: 
['a', 'list', 'of', 'strings'].
The data is of type: <class 'list'>.
The inferred signature is:

inputs: 
[string (required)]
outputs: 
None
params: 
None

# List of floats

report_signature_info([np.float32(0.117), np.float32(1.99)])

The input data: 
[0.117, 1.99].
The data is of type: <class 'list'>.
The inferred signature is:

inputs: 
[float (required)]
outputs: 
None
params: 
None

# Adding a column header to a list of doubles
my_data = pd.DataFrame({"input_data": [np.float64(0.117), np.float64(1.99)]})
report_signature_info(my_data)

The input data: 
   input_data
0       0.117
1       1.990.
The data is of type: <class 'pandas.core.frame.DataFrame'>.
The inferred signature is:

inputs: 
['input_data': double (required)]
outputs: 
None
params: 
None

# List of Dictionaries
report_signature_info([{"a": "a1", "b": "b1"}, {"a": "a2", "b": "b2"}])

The input data: 
[{'a': 'a1', 'b': 'b1'}, {'a': 'a2', 'b': 'b2'}].
The data is of type: <class 'list'>.
The inferred signature is:

inputs: 
['a': string (required), 'b': string (required)]
outputs: 
None
params: 
None

# List of Arrays of strings
report_signature_info([["a", "b", "c"], ["d", "e", "f"]])

The input data: 
[['a', 'b', 'c'], ['d', 'e', 'f']].
The data is of type: <class 'list'>.
The inferred signature is:

inputs: 
[Array(string) (required)]
outputs: 
None
params: 
None

# List of Arrays of Dictionaries
report_signature_info(
  [[{"a": "a", "b": "b"}, {"a": "a", "b": "b"}], [{"a": "a", "b": "b"}, {"a": "a", "b": "b"}]]
)

The input data: 
[[{'a': 'a', 'b': 'b'}, {'a': 'a', 'b': 'b'}], [{'a': 'a', 'b': 'b'}, {'a': 'a', 'b': 'b'}]].
The data is of type: <class 'list'>.
The inferred signature is:

inputs: 
[Array({a: string (required), b: string (required)}) (required)]
outputs: 
None
params: 
None

了解类型转换：Int 到 Long

在本教程的这一部分中，我们观察到 MLflow 模式推断中类型转换的一个有趣方面。在报告整数列表的签名信息时，您可能会注意到推断的数据类型是 long 而不是 int。从 int 到 long 的这种转换不是错误或 Bug，而是 MLflow 模式推断机制中有效的和有意的类型转换。

为什么整数被推断为 Long

更广泛的兼容性： 转换为 long 可确保跨各种平台和系统的兼容性。由于整数 (int) 的大小可能因系统架构而异，因此使用 long（具有更一致的大小规范）可避免潜在的差异和数据溢出问题。
数据完整性： 通过将整数推断为 long，MLflow 可确保准确表示和处理可能超过 int 典型容量的较大整数值，而不会造成数据丢失或溢出。
机器学习模型中的一致性： 在许多机器学习框架中，尤其是在涉及较大数据集或计算的框架中，长整数通常是数值运算的标准数据类型。推断模式中的这种标准化与机器学习社区中的常见做法一致。

# List of integers
report_signature_info([1, 2, 3])

The input data: 
[1, 2, 3].
The data is of type: <class 'list'>.
The inferred signature is:

inputs: 
[long (required)]
outputs: 
None
params: 
None

/Users/benjamin.wilson/repos/mlflow-fork/mlflow/mlflow/types/utils.py:378: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values <https://www.mlflow.org/docs/latest/models.html#handling-integers-with-missing-values>`_ for more details.
warnings.warn(

# List of Booleans
report_signature_info([True, False, False, False, True])

The input data: 
[True, False, False, False, True].
The data is of type: <class 'list'>.
The inferred signature is:

inputs: 
[boolean (required)]
outputs: 
None
params: 
None

# List of Datetimes
report_signature_info([np.datetime64("2023-12-24 11:59:59"), np.datetime64("2023-12-25 00:00:00")])

The input data: 
[numpy.datetime64('2023-12-24T11:59:59'), numpy.datetime64('2023-12-25T00:00:00')].
The data is of type: <class 'list'>.
The inferred signature is:

inputs: 
[datetime (required)]
outputs: 
None
params: 
None

# Complex list of Dictionaries
report_signature_info([{"a": "b", "b": [1, 2, 3], "c": {"d": [4, 5, 6]}}])

The input data: 
[{'a': 'b', 'b': [1, 2, 3], 'c': {'d': [4, 5, 6]}}].
The data is of type: <class 'list'>.
The inferred signature is:

inputs: 
['a': string (required), 'b': Array(long) (required), 'c': {d: Array(long) (required)} (required)]
outputs: 
None
params: 
None

# Pandas DF input

data = [
  {"a": "a", "b": ["a", "b", "c"], "c": {"d": 1, "e": 0.1}, "f": [{"g": "g"}, {"h": 1}]},
  {"b": ["a", "b"], "c": {"d": 2, "f": "f"}, "f": [{"g": "g"}]},
]
data = pd.DataFrame(data)

report_signature_info(data)

The input data: 
     a          b                   c                       f
0    a  [a, b, c]  {'d': 1, 'e': 0.1}  [{'g': 'g'}, {'h': 1}]
1  NaN     [a, b]  {'d': 2, 'f': 'f'}            [{'g': 'g'}].
The data is of type: <class 'pandas.core.frame.DataFrame'>.
The inferred signature is:

inputs: 
['a': string (optional), 'b': Array(string) (required), 'c': {d: long (required), e: double (optional), f: string (optional)} (required), 'f': Array({g: string (optional), h: long (optional)}) (required)]
outputs: 
None
params: 
None

签名强制执行

在本教程的这一部分中，我们将重点介绍签名强制执行在 MLflow 中的实际应用。签名强制执行是一项强大的功能，可确保提供给模型的数据与定义的输入模式一致。此步骤对于防止因不匹配或格式不正确的数据而可能出现的错误和不一致至关重要。

通过实践示例，我们将观察 MLflow 如何在运行时强制执行数据与预期签名的符合性。我们将使用 MyModel 类（一个简单的 Python 模型）来演示 MLflow 如何检查输入数据与模型签名的兼容性。此过程有助于保护模型免受不兼容或错误的输入的影响，从而增强模型预测的稳健性和可靠性。

本节还强调了 MLflow 中精确数据表示的重要性以及它对模型性能的影响。通过使用不同类型的数据（包括不符合预期模式的数据）进行测试，我们将看到 MLflow 如何验证数据并提供信息丰富的反馈。签名强制执行的这方面对于调试数据问题和改进模型输入非常宝贵，使其成为参与部署机器学习模型的任何人的关键技能。

class MyModel(mlflow.pyfunc.PythonModel):
  def predict(self, context, model_input, params=None):
      return model_input

data = [{"a": ["a", "b", "c"], "b": "b", "c": {"d": "d"}}, {"a": ["a"], "c": {"d": "d", "e": "e"}}]

report_signature_info(data)

The input data: 
[{'a': ['a', 'b', 'c'], 'b': 'b', 'c': {'d': 'd'}}, {'a': ['a'], 'c': {'d': 'd', 'e': 'e'}}].
The data is of type: <class 'list'>.
The inferred signature is:

inputs: 
['a': Array(string) (required), 'b': string (optional), 'c': {d: string (required), e: string (optional)} (required)]
outputs: 
None
params: 
None

# Generate a prediction that will serve as the model output example for signature inference
model_output = MyModel().predict(context=None, model_input=data)

with mlflow.start_run():
  model_info = mlflow.pyfunc.log_model(
      python_model=MyModel(),
      name="test_model",
      signature=infer_signature(model_input=data, model_output=model_output),
  )

loaded_model = mlflow.pyfunc.load_model(model_info.model_uri)
prediction = loaded_model.predict(data)

prediction

/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

	a	b	c
0	[a, b, c]	b	{'d': 'd'}
1	[a]	NaN	{'d': 'd', 'e': 'e'}

我们可以直接从调用 log_model() 返回的已记录模型信息中检查推断的签名。

model_info.signature

inputs: 
['a': Array(string) (required), 'b': string (optional), 'c': {d: string (required), e: string (optional)} (required)]
outputs: 
['a': Array(string) (required), 'b': string (optional), 'c': {d: string (required), e: string (optional)} (required)]
params: 
None

我们还可以快速验证已记录的输入签名是否与签名推断匹配。在此过程中，我们也可以生成输出签名。

注意：建议使用您的模型记录输入和输出签名。

report_signature_info(data, prediction)

The input data: 
[{'a': ['a', 'b', 'c'], 'b': 'b', 'c': {'d': 'd'}}, {'a': ['a'], 'c': {'d': 'd', 'e': 'e'}}].
The data is of type: <class 'list'>.
The inferred signature is:

inputs: 
['a': Array(string) (required), 'b': string (optional), 'c': {d: string (required), e: string (optional)} (required)]
outputs: 
['a': Array(string) (required), 'b': string (optional), 'c': {d: string (required), e: string (optional)} (required)]
params: 
None

# Using the model while not providing an optional input (note the output return structure and the non existent optional columns)

loaded_model.predict([{"a": ["a", "b", "c"], "c": {"d": "d"}}])

	a	c
0	[a, b, c]	{'d': 'd'}

# Using the model while omitting the input of required fields (this will raise an Exception from schema enforcement,
# stating that the required fields "a" and "c" are missing)

loaded_model.predict([{"b": "b"}])

---------------------------------------------------------------------------

MlflowException                           Traceback (most recent call last)

~/repos/mlflow-fork/mlflow/mlflow/pyfunc/__init__.py in predict(self, data, params)
  469             try:
--> 470                 data = _enforce_schema(data, input_schema)
  471             except Exception as e:

~/repos/mlflow-fork/mlflow/mlflow/models/utils.py in _enforce_schema(pf_input, input_schema)
  939                 message += f" Note that there were extra inputs: {extra_cols}"
--> 940             raise MlflowException(message)
  941     elif not input_schema.is_tensor_spec():

MlflowException: Model is missing inputs ['a', 'c'].

During handling of the above exception, another exception occurred:

MlflowException                           Traceback (most recent call last)

/var/folders/cd/n8n0rm2x53l_s0xv_j_xklb00000gp/T/ipykernel_97464/1628231496.py in <cell line: 4>()
    2 # stating that the required fields "a" and "c" are missing)
    3 
----> 4 loaded_model.predict([{"b": "b"}])

~/repos/mlflow-fork/mlflow/mlflow/pyfunc/__init__.py in predict(self, data, params)
  471             except Exception as e:
  472                 # Include error in message for backwards compatibility
--> 473                 raise MlflowException.invalid_parameter_value(
  474                     f"Failed to enforce schema of data '{data}' "
  475                     f"with schema '{input_schema}'. "

MlflowException: Failed to enforce schema of data '[{'b': 'b'}]' with schema '['a': Array(string) (required), 'b': string (optional), 'c': {d: string (required), e: string (optional)} (required)]'. Error: Model is missing inputs ['a', 'c'].

更新签名

本教程的这一部分介绍了数据和模型的动态性质，重点介绍了更新 MLflow 模型签名的关键任务。随着数据集的发展和需求的变化，有必要修改模型的签名以与新的数据结构或输入保持一致。这种更新签名的能力是随着时间的推移保持模型准确性和相关性的关键。

我们将演示如何确定何时需要签名更新，并逐步完成创建和应用新签名到现有模型的过程。本节重点介绍了 MLflow 在适应数据格式和结构变化方面的灵活性，而无需重新保存整个模型。但是，对于 MLflow 中注册的模型，更新签名需要重新注册模型以反映注册版本中的更改。

通过探索更新模型签名的步骤，您将学习如何在手动定义无效签名的情况下更新模型签名，或者如果您在记录时未能定义签名并且需要使用有效签名更新模型。

# Updating an existing model that wasn't saved with a signature


class MyTypeCheckerModel(mlflow.pyfunc.PythonModel):
  def predict(self, context, model_input, params=None):
      print(type(model_input))
      print(model_input)
      if not isinstance(model_input, (pd.DataFrame, list)):
          raise ValueError("The input must be a list.")
      return "Input is valid."


with mlflow.start_run():
  model_info = mlflow.pyfunc.log_model(
      python_model=MyTypeCheckerModel(),
      name="test_model",
  )

loaded_model = mlflow.pyfunc.load_model(model_info.model_uri)

loaded_model.metadata.signature

test_data = [{"a": "we are expecting strings", "b": "and only strings"}, [1, 2, 3]]
loaded_model.predict(test_data)

<class 'list'>
[{'a': 'we are expecting strings', 'b': 'and only strings'}, [1, 2, 3]]

'Input is valid.'

MLflow 中模式强制执行的必要性

在本教程的这一部分中，我们将解决机器学习模型部署中的一个常见挑战：错误消息的清晰性和可解释性。如果没有模式强制执行，模型通常会返回神秘或误导性的错误消息。发生这种情况的原因是，在没有明确定义的模式的情况下，模型会尝试处理可能与其预期不符的输入，从而导致模糊或难以诊断的错误。

为什么模式强制执行很重要

模式强制执行充当门卫，确保输入到模型中的数据与预期的格式完全匹配。这不仅降低了运行时错误的发生概率，而且还使发生的任何错误更容易理解和纠正。如果没有此类强制执行，诊断问题将成为一项耗时且复杂的任务，通常需要深入研究模型的内部逻辑。

更新模型签名以获得更清晰的错误消息

为了说明模式强制执行的价值，我们将更新已保存模型的签名以匹配预期的数据结构。此过程包括定义预期的数据结构，使用 infer_signature 函数生成适当的签名，然后使用 set_signature 将此签名应用于模型。这样，我们可以确保任何未来的错误都更具信息性，并且与我们预期的数据结构保持一致，从而简化故障排除并增强模型可靠性。

expected_data_structure = [{"a": "string", "b": "another string"}, {"a": "string"}]

signature = infer_signature(expected_data_structure, loaded_model.predict(expected_data_structure))

set_signature(model_info.model_uri, signature)

<class 'list'>
[{'a': 'string', 'b': 'another string'}, {'a': 'string'}]

loaded_with_signature = mlflow.pyfunc.load_model(model_info.model_uri)

loaded_with_signature.metadata.signature

inputs: 
['a': string (required), 'b': string (optional)]
outputs: 
[string (required)]
params: 
None

loaded_with_signature.predict(expected_data_structure)

<class 'pandas.core.frame.DataFrame'>
      a               b
0  string  another string
1  string             NaN

'Input is valid.'

验证模式强制执行是否不允许有缺陷的输入

现在我们已正确设置签名并更新了模型定义，让我们确保之前的有缺陷的输入类型会引发有用的错误消息！

loaded_with_signature.predict(test_data)

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

~/repos/mlflow-fork/mlflow/mlflow/pyfunc/__init__.py in predict(self, data, params)
  469             try:
--> 470                 data = _enforce_schema(data, input_schema)
  471             except Exception as e:

~/repos/mlflow-fork/mlflow/mlflow/models/utils.py in _enforce_schema(pf_input, input_schema)
  907         elif isinstance(pf_input, (list, np.ndarray, pd.Series)):
--> 908             pf_input = pd.DataFrame(pf_input)
  909

~/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
  781                         columns = ensure_index(columns)
--> 782                     arrays, columns, index = nested_data_to_arrays(
  783                         # error: Argument 3 to "nested_data_to_arrays" has incompatible

~/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pandas/core/internals/construction.py in nested_data_to_arrays(data, columns, index, dtype)
  497 
--> 498     arrays, columns = to_arrays(data, columns, dtype=dtype)
  499     columns = ensure_index(columns)

~/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pandas/core/internals/construction.py in to_arrays(data, columns, dtype)
  831     elif isinstance(data[0], abc.Mapping):
--> 832         arr, columns = _list_of_dict_to_arrays(data, columns)
  833     elif isinstance(data[0], ABCSeries):

~/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pandas/core/internals/construction.py in _list_of_dict_to_arrays(data, columns)
  911         sort = not any(isinstance(d, dict) for d in data)
--> 912         pre_cols = lib.fast_unique_multiple_list_gen(gen, sort=sort)
  913         columns = ensure_index(pre_cols)

~/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.fast_unique_multiple_list_gen()

~/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pandas/core/internals/construction.py in <genexpr>(.0)
  909     if columns is None:
--> 910         gen = (list(x.keys()) for x in data)
  911         sort = not any(isinstance(d, dict) for d in data)

AttributeError: 'list' object has no attribute 'keys'

During handling of the above exception, another exception occurred:

MlflowException                           Traceback (most recent call last)

/var/folders/cd/n8n0rm2x53l_s0xv_j_xklb00000gp/T/ipykernel_97464/2586525788.py in <cell line: 1>()
----> 1 loaded_with_signature.predict(test_data)

~/repos/mlflow-fork/mlflow/mlflow/pyfunc/__init__.py in predict(self, data, params)
  471             except Exception as e:
  472                 # Include error in message for backwards compatibility
--> 473                 raise MlflowException.invalid_parameter_value(
  474                     f"Failed to enforce schema of data '{data}' "
  475                     f"with schema '{input_schema}'. "

MlflowException: Failed to enforce schema of data '[{'a': 'we are expecting strings', 'b': 'and only strings'}, [1, 2, 3]]' with schema '['a': string (required), 'b': string (optional)]'. Error: 'list' object has no attribute 'keys'

总结：来自 MLflow 签名 Playground 的见解和最佳实践

在我们结束 MLflow 签名 Playground Notebook 的学习之旅时，我们获得了对 MLflow 生态系统中模型签名的复杂性的宝贵见解。本教程为您提供了有效管理和利用模型签名所需的知识和实践技能，从而确保了机器学习模型的稳健性和准确性。

主要要点包括准确定义标量类型的重要性、强制执行和遵守模型签名以确保数据完整性的重要性，以及 MLflow 在更新无效模型签名方面提供的灵活性。这些概念不仅仅是理论上的，而且对于在实际场景中成功进行模型部署和管理至关重要。

无论您是改进模型的数据科学家，还是将机器学习集成到您的应用程序中的开发人员，理解和利用模型签名都至关重要。我们希望本教程为您提供了 MLflow 签名的坚实基础，使您能够在未来的 ML 项目中实施这些最佳实践。

为什么模型签名很重要​

签名强制执行的实际应用​

用于更深入理解的实践示例​

MLflow 签名中的标量支持​

了解类型转换：Int 到 Long​

为什么整数被推断为 Long​

签名强制执行​

更新签名​

MLflow 中模式强制执行的必要性​

为什么模式强制执行很重要​

更新模型签名以获得更清晰的错误消息​

验证模式强制执行是否不允许有缺陷的输入​

总结：来自 MLflow 签名 Playground 的见解和最佳实践​