使用 MLflow 提供 LLM 服务：利用自定义 PyFunc

介绍

本教程将指导您如何使用 MLflow 的自定义 pyfunc 保存和部署大型语言模型 (LLM)，这对于 MLflow 默认的 transformers 风格不支持的模型来说是理想的选择。

学习目标

了解在特定模型场景中定义自定义 pyfunc 的必要性。
学习创建自定义 pyfunc 来管理模型依赖和接口数据。
了解如何通过自定义 pyfunc 简化部署环境中的用户界面。

默认实现的挑战

虽然 MLflow 的 transformers 风格通常能够处理 HuggingFace Transformers 库中的模型，但有些模型或配置可能与这种标准方法不符。在这种情况下，就像我们的情况一样，模型无法利用默认的 pipeline 类型，我们在使用 MLflow 部署这些模型时面临独特的挑战。

自定义 PyFunc 的强大之处

为了解决这个问题，MLflow 的自定义 pyfunc 应运而生。它允许我们

高效处理模型加载及其依赖。
自定义推理过程以适应特定的模型需求。
调整接口数据，在部署的应用中创建用户友好的环境。

我们的重点将是自定义 pyfunc 的实际应用，以便在 MLflow 生态系统中有效地部署 LLM。

在本教程结束时，您将掌握解决机器学习项目中类似挑战的知识，充分利用 MLflow 的全部潜力进行自定义模型部署。

继续之前的重要注意事项

硬件建议

本指南演示了使用一个特别庞大且复杂的语言模型（LLM）。考虑到其复杂性

GPU 要求：强烈建议在配备至少 64GB 显存的 CUDA-capable GPU 的系统上运行此示例。
CPU 注意事项：虽然技术上可行，但在 CPU 上执行模型可能导致推理时间极度延长，单次预测可能需要数十分钟，即使是在顶级 CPU 上。由于在纯 CPU 系统上运行此模型时性能受限，本 Notebook 的最后一个单元格有意未执行。然而，使用具有适当强大功能的 GPU，此 Notebook 的总运行时间端到端约为 8 分钟。

执行建议

如果您考虑运行此 Notebook 中的代码

性能：为了获得更流畅的体验并真正发挥模型的潜力，请使用与模型设计匹配的硬件。
依赖项：确保您已安装推荐的依赖项以获得最佳模型性能。这些对于高效的模型加载、初始化、注意力计算和推理处理至关重要

pip install xformers==0.0.20 einops==0.6.1 flash-attn==v1.0.3.post0 triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python

# Load necessary libraries

import accelerate
import torch
import transformers
from huggingface_hub import snapshot_download

import mlflow

/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pydantic/_internal/_fields.py:128: UserWarning: Field "model_server_url" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
warnings.warn(
/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pydantic/_internal/_config.py:317: UserWarning: Valid config keys have changed in V2:
* 'schema_extra' has been renamed to 'json_schema_extra'
warnings.warn(message, UserWarning)

下载模型和分词器

首先，我们需要下载我们的模型和分词器。方法如下

# Download the MPT-7B instruct model and tokenizer to a local directory cache
snapshot_location = snapshot_download(repo_id="mosaicml/mpt-7b-instruct", local_dir="mpt-7b")

Fetching 24 files:   0%|          | 0/24 [00:00<?, ?it/s]

Downloading README.md:   0%|          | 0.00/7.96k [00:00<?, ?B/s]

Downloading .gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading adapt_tokenizer.py:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading attention.py:   0%|          | 0.00/21.6k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

Downloading blocks.py:   0%|          | 0.00/2.84k [00:00<?, ?B/s]

Downloading custom_embedding.py:   0%|          | 0.00/292 [00:00<?, ?B/s]

Downloading configuration_mpt.py:   0%|          | 0.00/11.0k [00:00<?, ?B/s]

Downloading meta_init_context.py:   0%|          | 0.00/3.96k [00:00<?, ?B/s]

Downloading fc.py:   0%|          | 0.00/167 [00:00<?, ?B/s]

Downloading ffn.py:   0%|          | 0.00/1.75k [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)refixlm_converter.py:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Downloading modeling_mpt.py:   0%|          | 0.00/20.1k [00:00<?, ?B/s]

Downloading flash_attn_triton.py:   0%|          | 0.00/28.2k [00:00<?, ?B/s]

Downloading requirements.txt:   0%|          | 0.00/113 [00:00<?, ?B/s]

Downloading param_init_fns.py:   0%|          | 0.00/11.9k [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/16.0k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Downloading norm.py:   0%|          | 0.00/3.12k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.36G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

定义自定义 PyFunc

现在，让我们定义我们的自定义 pyfunc。这将决定我们的模型如何加载其依赖以及如何执行预测。请注意，我们如何将模型的复杂性封装在此类中。

class MPT(mlflow.pyfunc.PythonModel):
  def load_context(self, context):
      """
      This method initializes the tokenizer and language model
      using the specified model snapshot directory.
      """
      # Initialize tokenizer and language model
      self.tokenizer = transformers.AutoTokenizer.from_pretrained(
          context.artifacts["snapshot"], padding_side="left"
      )

      config = transformers.AutoConfig.from_pretrained(
          context.artifacts["snapshot"], trust_remote_code=True
      )
      # If you are running this in a system that has a sufficiently powerful GPU with available VRAM,
      # uncomment the configuration setting below to leverage triton.
      # Note that triton dramatically improves the inference speed performance

      # config.attn_config["attn_impl"] = "triton"

      self.model = transformers.AutoModelForCausalLM.from_pretrained(
          context.artifacts["snapshot"],
          config=config,
          torch_dtype=torch.bfloat16,
          trust_remote_code=True,
      )

      # NB: If you do not have a CUDA-capable device or have torch installed with CUDA support
      # this setting will not function correctly. Setting device to 'cpu' is valid, but
      # the performance will be very slow.
      self.model.to(device="cpu")
      # If running on a GPU-compatible environment, uncomment the following line:
      # self.model.to(device="cuda")

      self.model.eval()

  def _build_prompt(self, instruction):
      """
      This method generates the prompt for the model.
      """
      INSTRUCTION_KEY = "### Instruction:"
      RESPONSE_KEY = "### Response:"
      INTRO_BLURB = (
          "Below is an instruction that describes a task. "
          "Write a response that appropriately completes the request."
      )

      return f"""{INTRO_BLURB}
      {INSTRUCTION_KEY}
      {instruction}
      {RESPONSE_KEY}
      """

  def predict(self, context, model_input, params=None):
      """
      This method generates prediction for the given input.
      """
      prompt = model_input["prompt"][0]

      # Retrieve or use default values for temperature and max_tokens
      temperature = params.get("temperature", 0.1) if params else 0.1
      max_tokens = params.get("max_tokens", 1000) if params else 1000

      # Build the prompt
      prompt = self._build_prompt(prompt)

      # Encode the input and generate prediction
      # NB: Sending the tokenized inputs to the GPU here explicitly will not work if your system does not have CUDA support.
      # If attempting to run this with GPU support, change 'cpu' to 'cuda' for maximum performance
      encoded_input = self.tokenizer.encode(prompt, return_tensors="pt").to("cpu")
      output = self.model.generate(
          encoded_input,
          do_sample=True,
          temperature=temperature,
          max_new_tokens=max_tokens,
      )

      # Removing the prompt from the generated text
      prompt_length = len(self.tokenizer.encode(prompt, return_tensors="pt")[0])
      generated_response = self.tokenizer.decode(
          output[0][prompt_length:], skip_special_tokens=True
      )

      return {"candidates": [generated_response]}

构建提示

我们的自定义 pyfunc 的一个关键方面是模型提示的构建。我们的自定义 pyfunc 负责处理这个问题，而不是让最终用户理解和构建此提示。这确保了无论模型的要求有多么复杂，最终用户界面都保持简单和一致。

请查看上面类中的 _build_prompt() 方法，了解如何将自定义输入处理逻辑添加到自定义 pyfunc 中，以支持将用户输入数据转换为与封装的模型实例兼容的所需格式。

import numpy as np
import pandas as pd

import mlflow
from mlflow.models.signature import ModelSignature
from mlflow.types import ColSpec, DataType, ParamSchema, ParamSpec, Schema

# Define input and output schema
input_schema = Schema(
  [
      ColSpec(DataType.string, "prompt"),
  ]
)
output_schema = Schema([ColSpec(DataType.string, "candidates")])

parameters = ParamSchema(
  [
      ParamSpec("temperature", DataType.float, np.float32(0.1), None),
      ParamSpec("max_tokens", DataType.integer, np.int32(1000), None),
  ]
)

signature = ModelSignature(inputs=input_schema, outputs=output_schema, params=parameters)


# Define input example
input_example = pd.DataFrame({"prompt": ["What is machine learning?"]})

设置我们要记录自定义模型的实验

如果实验尚不存在，MLflow 将创建同名的新实验，并会通知您已创建新实验。

# If you are running this tutorial in local mode, leave the next line commented out.
# Otherwise, uncomment the following line and set your tracking uri to your local or remote tracking server.

# mlflow.set_tracking_uri("http://127.0.0.1:8080")

mlflow.set_experiment(experiment_name="mpt-7b-instruct-evaluation")

2023/11/29 17:33:23 INFO mlflow.tracking.fluent: Experiment with name 'mpt-7b-instruct-evaluation' does not exist. Creating a new experiment.

<Experiment: artifact_location='file:///Users/benjamin.wilson/repos/mlflow-fork/mlflow/docs/source/llms/custom-pyfunc-for-llms/notebooks/mlruns/265930820950682761', creation_time=1701297203895, experiment_id='265930820950682761', last_update_time=1701297203895, lifecycle_stage='active', name='mpt-7b-instruct-evaluation', tags={}>

# Get the current base version of torch that is installed, without specific version modifiers
torch_version = torch.__version__.split("+")[0]

# Start an MLflow run context and log the MPT-7B model wrapper along with the param-included signature to
# allow for overriding parameters at inference time
with mlflow.start_run():
  model_info = mlflow.pyfunc.log_model(
      "mpt-7b-instruct",
      python_model=MPT(),
      # NOTE: the artifacts dictionary mapping is critical! This dict is used by the load_context() method in our MPT() class.
      artifacts={"snapshot": snapshot_location},
      pip_requirements=[
          f"torch=={torch_version}",
          f"transformers=={transformers.__version__}",
          f"accelerate=={accelerate.__version__}",
          "einops",
          "sentencepiece",
      ],
      input_example=input_example,
      signature=signature,
  )

Downloading artifacts:   0%|          | 0/24 [00:00<?, ?it/s]

2023/11/29 17:33:24 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false
/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

加载保存的模型

loaded_model = mlflow.pyfunc.load_model(model_info.model_uri)

/Users/benjamin.wilson/.cache/huggingface/modules/transformers_modules/mpt-7b/configuration_mpt.py:97: UserWarning: alibi is turned on, setting `learned_pos_emb` to `False.`
warnings.warn(f'alibi is turned on, setting `learned_pos_emb` to `False.`')

测试模型进行推理

# The execution of this is commented out for the purposes of runtime on CPU.
# If you are running this on a system with a sufficiently powerful GPU, you may uncomment and interface with the model!

# loaded_model.predict(pd.DataFrame(
#     {"prompt": ["What is machine learning?"]}), params={"temperature": 0.6}
# )

结论

通过本教程，我们了解了 MLflow 自定义 pyfunc 的强大功能和灵活性。通过理解我们模型的特定需求并定义自定义 pyfunc 来满足这些需求，我们可以确保无缝的部署过程和用户友好的界面。

介绍​

学习目标​

默认实现的挑战​

自定义 PyFunc 的强大之处​

继续之前的重要注意事项​

硬件建议​

执行建议​

下载模型和分词器​

定义自定义 PyFunc​

构建提示​

设置我们要记录自定义模型的实验​

加载保存的模型​

测试模型进行推理​

结论​

介绍