使用 MLflow 部署 LLM：利用自定义 PyFunc

简介

本教程将指导你使用 MLflow 和自定义 pyfunc 保存和部署大型语言模型 (LLM)，这非常适合 MLflow 默认 transformers flavor 不直接支持的模型。

学习目标

了解在特定模型场景中需要自定义 pyfunc 定义的原因。
学习创建自定义 pyfunc 以管理模型依赖项和接口数据。
深入了解如何使用自定义 pyfunc 简化部署环境中的用户界面。

默认实现的挑战

虽然 MLflow 的 transformers flavor 通常可以处理 HuggingFace Transformers 库中的模型，但某些模型或配置可能与此标准方法不一致。在这种情况下，就像我们一样，模型无法使用默认的 pipeline 类型，我们面临着使用 MLflow 部署这些模型的独特挑战。

自定义 PyFunc 的强大功能

为了解决这个问题，MLflow 的自定义 pyfunc 可以提供帮助。它允许我们

高效地处理模型加载及其依赖项。
自定义推理过程以适应特定的模型需求。
调整接口数据，以在部署的应用程序中创建用户友好的环境。

我们的重点将是自定义 pyfunc 的实际应用，以便在 MLflow 的生态系统中有效地部署 LLM。

在本教程结束时，你将掌握相关知识，从而能够应对机器学习项目中类似的挑战，并充分利用 MLflow 在自定义模型部署方面的潜力。

开始之前的重要注意事项

硬件建议

本指南演示了使用一种特别庞大且复杂的大型语言模型 (LLM)。鉴于其复杂性

GPU 要求：强烈建议在具有 CUDA 功能且至少具有 64GB VRAM 的 GPU 系统上运行此示例。
CPU 注意事项：虽然在技术上可行，但在 CPU 上执行模型可能会导致推理时间非常长，即使在顶级 CPU 上，单个预测也可能需要花费数十分钟。由于在此仅 CPU 系统上运行此模型时性能受到限制，因此本笔记本的最后一个单元格未刻意执行。但是，借助功能强大的 GPU，本笔记本的总运行时间约为 8 分钟（端到端）。

执行建议

如果你正在考虑运行此笔记本中的代码

性能：为了获得更流畅的体验并真正发挥模型的功能，请使用与模型设计一致的硬件。
依赖项：确保你已安装推荐的依赖项，以获得最佳模型性能。这些对于高效的模型加载、初始化、注意力计算和推理处理至关重要

pip install xformers==0.0.20 einops==0.6.1 flash-attn==v1.0.3.post0 triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python

# Load necessary libraries

import accelerate
import torch
import transformers
from huggingface_hub import snapshot_download

import mlflow

/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pydantic/_internal/_fields.py:128: UserWarning: Field "model_server_url" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
warnings.warn(
/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pydantic/_internal/_config.py:317: UserWarning: Valid config keys have changed in V2:
* 'schema_extra' has been renamed to 'json_schema_extra'
warnings.warn(message, UserWarning)

下载模型和 Tokenizer

首先，我们需要下载我们的模型和 tokenizer。方法如下

# Download the MPT-7B instruct model and tokenizer to a local directory cache
snapshot_location = snapshot_download(repo_id="mosaicml/mpt-7b-instruct", local_dir="mpt-7b")

Fetching 24 files:   0%|          | 0/24 [00:00<?, ?it/s]

Downloading README.md:   0%|          | 0.00/7.96k [00:00<?, ?B/s]

Downloading .gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading adapt_tokenizer.py:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading attention.py:   0%|          | 0.00/21.6k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

Downloading blocks.py:   0%|          | 0.00/2.84k [00:00<?, ?B/s]

Downloading custom_embedding.py:   0%|          | 0.00/292 [00:00<?, ?B/s]

Downloading configuration_mpt.py:   0%|          | 0.00/11.0k [00:00<?, ?B/s]

Downloading meta_init_context.py:   0%|          | 0.00/3.96k [00:00<?, ?B/s]

Downloading fc.py:   0%|          | 0.00/167 [00:00<?, ?B/s]

Downloading ffn.py:   0%|          | 0.00/1.75k [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)refixlm_converter.py:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Downloading modeling_mpt.py:   0%|          | 0.00/20.1k [00:00<?, ?B/s]

Downloading flash_attn_triton.py:   0%|          | 0.00/28.2k [00:00<?, ?B/s]

Downloading requirements.txt:   0%|          | 0.00/113 [00:00<?, ?B/s]

Downloading param_init_fns.py:   0%|          | 0.00/11.9k [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/16.0k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Downloading norm.py:   0%|          | 0.00/3.12k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.36G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

定义自定义 PyFunc

现在，让我们定义我们的自定义 pyfunc。这将决定我们的模型如何加载其依赖项以及它如何执行预测。请注意我们是如何将模型的复杂性封装在这个类中。

class MPT(mlflow.pyfunc.PythonModel):
  def load_context(self, context):
      """
      This method initializes the tokenizer and language model
      using the specified model snapshot directory.
      """
      # Initialize tokenizer and language model
      self.tokenizer = transformers.AutoTokenizer.from_pretrained(
          context.artifacts["snapshot"], padding_side="left"
      )

      config = transformers.AutoConfig.from_pretrained(
          context.artifacts["snapshot"], trust_remote_code=True
      )
      # If you are running this in a system that has a sufficiently powerful GPU with available VRAM,
      # uncomment the configuration setting below to leverage triton.
      # Note that triton dramatically improves the inference speed performance

      # config.attn_config["attn_impl"] = "triton"

      self.model = transformers.AutoModelForCausalLM.from_pretrained(
          context.artifacts["snapshot"],
          config=config,
          torch_dtype=torch.bfloat16,
          trust_remote_code=True,
      )

      # NB: If you do not have a CUDA-capable device or have torch installed with CUDA support
      # this setting will not function correctly. Setting device to 'cpu' is valid, but
      # the performance will be very slow.
      self.model.to(device="cpu")
      # If running on a GPU-compatible environment, uncomment the following line:
      # self.model.to(device="cuda")

      self.model.eval()

  def _build_prompt(self, instruction):
      """
      This method generates the prompt for the model.
      """
      INSTRUCTION_KEY = "### Instruction:"
      RESPONSE_KEY = "### Response:"
      INTRO_BLURB = (
          "Below is an instruction that describes a task. "
          "Write a response that appropriately completes the request."
      )

      return f"""{INTRO_BLURB}
      {INSTRUCTION_KEY}
      {instruction}
      {RESPONSE_KEY}
      """

  def predict(self, context, model_input, params=None):
      """
      This method generates prediction for the given input.
      """
      prompt = model_input["prompt"][0]

      # Retrieve or use default values for temperature and max_tokens
      temperature = params.get("temperature", 0.1) if params else 0.1
      max_tokens = params.get("max_tokens", 1000) if params else 1000

      # Build the prompt
      prompt = self._build_prompt(prompt)

      # Encode the input and generate prediction
      # NB: Sending the tokenized inputs to the GPU here explicitly will not work if your system does not have CUDA support.
      # If attempting to run this with GPU support, change 'cpu' to 'cuda' for maximum performance
      encoded_input = self.tokenizer.encode(prompt, return_tensors="pt").to("cpu")
      output = self.model.generate(
          encoded_input,
          do_sample=True,
          temperature=temperature,
          max_new_tokens=max_tokens,
      )

      # Removing the prompt from the generated text
      prompt_length = len(self.tokenizer.encode(prompt, return_tensors="pt")[0])
      generated_response = self.tokenizer.decode(
          output[0][prompt_length:], skip_special_tokens=True
      )

      return {"candidates": [generated_response]}

构建 Prompt

我们的自定义 pyfunc 的一个关键方面是构建模型 prompt。我们的自定义 pyfunc 负责构建 prompt，而不是最终用户必须理解和构建此 prompt。这确保了无论模型的要求有多复杂，最终用户界面都保持简单和一致。

查看我们上面类中的 _build_prompt() 方法，了解如何将自定义输入处理逻辑添加到自定义 pyfunc，以支持将用户输入数据转换为与包装的模型实例兼容的格式所需的转换。

import numpy as np
import pandas as pd

import mlflow
from mlflow.models.signature import ModelSignature
from mlflow.types import ColSpec, DataType, ParamSchema, ParamSpec, Schema

# Define input and output schema
input_schema = Schema(
  [
      ColSpec(DataType.string, "prompt"),
  ]
)
output_schema = Schema([ColSpec(DataType.string, "candidates")])

parameters = ParamSchema(
  [
      ParamSpec("temperature", DataType.float, np.float32(0.1), None),
      ParamSpec("max_tokens", DataType.integer, np.int32(1000), None),
  ]
)

signature = ModelSignature(inputs=input_schema, outputs=output_schema, params=parameters)


# Define input example
input_example = pd.DataFrame({"prompt": ["What is machine learning?"]})

设置我们将用于记录自定义模型的实验

如果该实验尚不存在，MLflow 将使用此名称创建一个新实验，并会提醒你它已创建一个新实验。

# If you are running this tutorial in local mode, leave the next line commented out.
# Otherwise, uncomment the following line and set your tracking uri to your local or remote tracking server.

# mlflow.set_tracking_uri("http://127.0.0.1:8080")

mlflow.set_experiment(experiment_name="mpt-7b-instruct-evaluation")

2023/11/29 17:33:23 INFO mlflow.tracking.fluent: Experiment with name 'mpt-7b-instruct-evaluation' does not exist. Creating a new experiment.

<Experiment: artifact_location='file:///Users/benjamin.wilson/repos/mlflow-fork/mlflow/docs/source/llms/custom-pyfunc-for-llms/notebooks/mlruns/265930820950682761', creation_time=1701297203895, experiment_id='265930820950682761', last_update_time=1701297203895, lifecycle_stage='active', name='mpt-7b-instruct-evaluation', tags={}>

# Get the current base version of torch that is installed, without specific version modifiers
torch_version = torch.__version__.split("+")[0]

# Start an MLflow run context and log the MPT-7B model wrapper along with the param-included signature to
# allow for overriding parameters at inference time
with mlflow.start_run():
  model_info = mlflow.pyfunc.log_model(
      name="mpt-7b-instruct",
      python_model=MPT(),
      # NOTE: the artifacts dictionary mapping is critical! This dict is used by the load_context() method in our MPT() class.
      artifacts={"snapshot": snapshot_location},
      pip_requirements=[
          f"torch=={torch_version}",
          f"transformers=={transformers.__version__}",
          f"accelerate=={accelerate.__version__}",
          "einops",
          "sentencepiece",
      ],
      input_example=input_example,
      signature=signature,
  )

Downloading artifacts:   0%|          | 0/24 [00:00<?, ?it/s]

2023/11/29 17:33:24 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false
/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

加载已保存的模型

loaded_model = mlflow.pyfunc.load_model(model_info.model_uri)

/Users/benjamin.wilson/.cache/huggingface/modules/transformers_modules/mpt-7b/configuration_mpt.py:97: UserWarning: alibi is turned on, setting `learned_pos_emb` to `False.`
warnings.warn(f'alibi is turned on, setting `learned_pos_emb` to `False.`')

测试模型以进行推理

# The execution of this is commented out for the purposes of runtime on CPU.
# If you are running this on a system with a sufficiently powerful GPU, you may uncomment and interface with the model!

# loaded_model.predict(pd.DataFrame(
#     {"prompt": ["What is machine learning?"]}), params={"temperature": 0.6}
# )

结论

通过本教程，我们已经了解了 MLflow 自定义 pyfunc 的强大功能和灵活性。通过了解我们模型的具体需求并定义一个自定义 pyfunc 来满足这些需求，我们可以确保无缝的部署过程和用户友好的界面。

简介​

学习目标​

默认实现的挑战​

自定义 PyFunc 的强大功能​

开始之前的重要注意事项​

硬件建议​

执行建议​

下载模型和 Tokenizer​

定义自定义 PyFunc​

构建 Prompt​

设置我们将用于记录自定义模型的实验​

加载已保存的模型​

测试模型以进行推理​

结论​

简介