跳到主要内容

使用 MLflow 提供 LLM 服务:利用自定义 PyFunc

下载此 Notebook

介绍

本教程将指导您如何使用 MLflow 的自定义 pyfunc 保存和部署大型语言模型 (LLM),这对于 MLflow 默认的 transformers 风格不支持的模型来说是理想的选择。

学习目标

  • 了解在特定模型场景中定义自定义 pyfunc 的必要性。
  • 学习创建自定义 pyfunc 来管理模型依赖和接口数据。
  • 了解如何通过自定义 pyfunc 简化部署环境中的用户界面。

默认实现的挑战

虽然 MLflow 的 transformers 风格通常能够处理 HuggingFace Transformers 库中的模型,但有些模型或配置可能与这种标准方法不符。在这种情况下,就像我们的情况一样,模型无法利用默认的 pipeline 类型,我们在使用 MLflow 部署这些模型时面临独特的挑战。

自定义 PyFunc 的强大之处

为了解决这个问题,MLflow 的自定义 pyfunc 应运而生。它允许我们

  • 高效处理模型加载及其依赖。
  • 自定义推理过程以适应特定的模型需求。
  • 调整接口数据,在部署的应用中创建用户友好的环境。

我们的重点将是自定义 pyfunc 的实际应用,以便在 MLflow 生态系统中有效地部署 LLM。

在本教程结束时,您将掌握解决机器学习项目中类似挑战的知识,充分利用 MLflow 的全部潜力进行自定义模型部署。

继续之前的重要注意事项

硬件建议

本指南演示了使用一个特别庞大且复杂的语言模型(LLM)。考虑到其复杂性

  • GPU 要求强烈建议在配备至少 64GB 显存的 CUDA-capable GPU 的系统上运行此示例。
  • CPU 注意事项:虽然技术上可行,但在 CPU 上执行模型可能导致推理时间极度延长,单次预测可能需要数十分钟,即使是在顶级 CPU 上。由于在纯 CPU 系统上运行此模型时性能受限,本 Notebook 的最后一个单元格有意未执行。然而,使用具有适当强大功能的 GPU,此 Notebook 的总运行时间端到端约为 8 分钟。

执行建议

如果您考虑运行此 Notebook 中的代码

  • 性能:为了获得更流畅的体验并真正发挥模型的潜力,请使用与模型设计匹配的硬件。

  • 依赖项:确保您已安装推荐的依赖项以获得最佳模型性能。这些对于高效的模型加载、初始化、注意力计算和推理处理至关重要

pip install xformers==0.0.20 einops==0.6.1 flash-attn==v1.0.3.post0 triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python
# Load necessary libraries

import accelerate
import torch
import transformers
from huggingface_hub import snapshot_download

import mlflow
/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pydantic/_internal/_fields.py:128: UserWarning: Field "model_server_url" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
warnings.warn(
/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pydantic/_internal/_config.py:317: UserWarning: Valid config keys have changed in V2:
* 'schema_extra' has been renamed to 'json_schema_extra'
warnings.warn(message, UserWarning)

下载模型和分词器

首先,我们需要下载我们的模型和分词器。方法如下

# Download the MPT-7B instruct model and tokenizer to a local directory cache
snapshot_location = snapshot_download(repo_id="mosaicml/mpt-7b-instruct", local_dir="mpt-7b")
Fetching 24 files:   0%|          | 0/24 [00:00<?, ?it/s]
Downloading README.md:   0%|          | 0.00/7.96k [00:00<?, ?B/s]
Downloading .gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]
Downloading adapt_tokenizer.py:   0%|          | 0.00/1.72k [00:00<?, ?B/s]
Downloading attention.py:   0%|          | 0.00/21.6k [00:00<?, ?B/s]
Downloading config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]
Downloading blocks.py:   0%|          | 0.00/2.84k [00:00<?, ?B/s]
Downloading custom_embedding.py:   0%|          | 0.00/292 [00:00<?, ?B/s]
Downloading configuration_mpt.py:   0%|          | 0.00/11.0k [00:00<?, ?B/s]
Downloading meta_init_context.py:   0%|          | 0.00/3.96k [00:00<?, ?B/s]
Downloading fc.py:   0%|          | 0.00/167 [00:00<?, ?B/s]
Downloading ffn.py:   0%|          | 0.00/1.75k [00:00<?, ?B/s]
Downloading generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]
Downloading (…)refixlm_converter.py:   0%|          | 0.00/10.5k [00:00<?, ?B/s]
Downloading modeling_mpt.py:   0%|          | 0.00/20.1k [00:00<?, ?B/s]
Downloading flash_attn_triton.py:   0%|          | 0.00/28.2k [00:00<?, ?B/s]
Downloading requirements.txt:   0%|          | 0.00/113 [00:00<?, ?B/s]
Downloading param_init_fns.py:   0%|          | 0.00/11.9k [00:00<?, ?B/s]
Downloading (…)model.bin.index.json:   0%|          | 0.00/16.0k [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]
Downloading norm.py:   0%|          | 0.00/3.12k [00:00<?, ?B/s]
Downloading tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]
Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]
Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.36G [00:00<?, ?B/s]
Downloading tokenizer_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

定义自定义 PyFunc

现在,让我们定义我们的自定义 pyfunc。这将决定我们的模型如何加载其依赖以及如何执行预测。请注意,我们如何将模型的复杂性封装在此类中。

class MPT(mlflow.pyfunc.PythonModel):
def load_context(self, context):
"""
This method initializes the tokenizer and language model
using the specified model snapshot directory.
"""
# Initialize tokenizer and language model
self.tokenizer = transformers.AutoTokenizer.from_pretrained(
context.artifacts["snapshot"], padding_side="left"
)

config = transformers.AutoConfig.from_pretrained(
context.artifacts["snapshot"], trust_remote_code=True
)
# If you are running this in a system that has a sufficiently powerful GPU with available VRAM,
# uncomment the configuration setting below to leverage triton.
# Note that triton dramatically improves the inference speed performance

# config.attn_config["attn_impl"] = "triton"

self.model = transformers.AutoModelForCausalLM.from_pretrained(
context.artifacts["snapshot"],
config=config,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)

# NB: If you do not have a CUDA-capable device or have torch installed with CUDA support
# this setting will not function correctly. Setting device to 'cpu' is valid, but
# the performance will be very slow.
self.model.to(device="cpu")
# If running on a GPU-compatible environment, uncomment the following line:
# self.model.to(device="cuda")

self.model.eval()

def _build_prompt(self, instruction):
"""
This method generates the prompt for the model.
"""
INSTRUCTION_KEY = "### Instruction:"
RESPONSE_KEY = "### Response:"
INTRO_BLURB = (
"Below is an instruction that describes a task. "
"Write a response that appropriately completes the request."
)

return f"""{INTRO_BLURB}
{INSTRUCTION_KEY}
{instruction}
{RESPONSE_KEY}
"""

def predict(self, context, model_input, params=None):
"""
This method generates prediction for the given input.
"""
prompt = model_input["prompt"][0]

# Retrieve or use default values for temperature and max_tokens
temperature = params.get("temperature", 0.1) if params else 0.1
max_tokens = params.get("max_tokens", 1000) if params else 1000

# Build the prompt
prompt = self._build_prompt(prompt)

# Encode the input and generate prediction
# NB: Sending the tokenized inputs to the GPU here explicitly will not work if your system does not have CUDA support.
# If attempting to run this with GPU support, change 'cpu' to 'cuda' for maximum performance
encoded_input = self.tokenizer.encode(prompt, return_tensors="pt").to("cpu")
output = self.model.generate(
encoded_input,
do_sample=True,
temperature=temperature,
max_new_tokens=max_tokens,
)

# Removing the prompt from the generated text
prompt_length = len(self.tokenizer.encode(prompt, return_tensors="pt")[0])
generated_response = self.tokenizer.decode(
output[0][prompt_length:], skip_special_tokens=True
)

return {"candidates": [generated_response]}

构建提示

我们的自定义 pyfunc 的一个关键方面是模型提示的构建。我们的自定义 pyfunc 负责处理这个问题,而不是让最终用户理解和构建此提示。这确保了无论模型的要求有多么复杂,最终用户界面都保持简单和一致。

请查看上面类中的 _build_prompt() 方法,了解如何将自定义输入处理逻辑添加到自定义 pyfunc 中,以支持将用户输入数据转换为与封装的模型实例兼容的所需格式。

import numpy as np
import pandas as pd

import mlflow
from mlflow.models.signature import ModelSignature
from mlflow.types import ColSpec, DataType, ParamSchema, ParamSpec, Schema

# Define input and output schema
input_schema = Schema(
[
ColSpec(DataType.string, "prompt"),
]
)
output_schema = Schema([ColSpec(DataType.string, "candidates")])

parameters = ParamSchema(
[
ParamSpec("temperature", DataType.float, np.float32(0.1), None),
ParamSpec("max_tokens", DataType.integer, np.int32(1000), None),
]
)

signature = ModelSignature(inputs=input_schema, outputs=output_schema, params=parameters)


# Define input example
input_example = pd.DataFrame({"prompt": ["What is machine learning?"]})

设置我们要记录自定义模型的实验

如果实验尚不存在,MLflow 将创建同名的新实验,并会通知您已创建新实验。

# If you are running this tutorial in local mode, leave the next line commented out.
# Otherwise, uncomment the following line and set your tracking uri to your local or remote tracking server.

# mlflow.set_tracking_uri("http://127.0.0.1:8080")

mlflow.set_experiment(experiment_name="mpt-7b-instruct-evaluation")
2023/11/29 17:33:23 INFO mlflow.tracking.fluent: Experiment with name 'mpt-7b-instruct-evaluation' does not exist. Creating a new experiment.
<Experiment: artifact_location='file:///Users/benjamin.wilson/repos/mlflow-fork/mlflow/docs/source/llms/custom-pyfunc-for-llms/notebooks/mlruns/265930820950682761', creation_time=1701297203895, experiment_id='265930820950682761', last_update_time=1701297203895, lifecycle_stage='active', name='mpt-7b-instruct-evaluation', tags={}>
# Get the current base version of torch that is installed, without specific version modifiers
torch_version = torch.__version__.split("+")[0]

# Start an MLflow run context and log the MPT-7B model wrapper along with the param-included signature to
# allow for overriding parameters at inference time
with mlflow.start_run():
model_info = mlflow.pyfunc.log_model(
"mpt-7b-instruct",
python_model=MPT(),
# NOTE: the artifacts dictionary mapping is critical! This dict is used by the load_context() method in our MPT() class.
artifacts={"snapshot": snapshot_location},
pip_requirements=[
f"torch=={torch_version}",
f"transformers=={transformers.__version__}",
f"accelerate=={accelerate.__version__}",
"einops",
"sentencepiece",
],
input_example=input_example,
signature=signature,
)
Downloading artifacts:   0%|          | 0/24 [00:00<?, ?it/s]
2023/11/29 17:33:24 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false
/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

加载保存的模型

loaded_model = mlflow.pyfunc.load_model(model_info.model_uri)
/Users/benjamin.wilson/.cache/huggingface/modules/transformers_modules/mpt-7b/configuration_mpt.py:97: UserWarning: alibi is turned on, setting `learned_pos_emb` to `False.`
warnings.warn(f'alibi is turned on, setting `learned_pos_emb` to `False.`')

测试模型进行推理

# The execution of this is commented out for the purposes of runtime on CPU.
# If you are running this on a system with a sufficiently powerful GPU, you may uncomment and interface with the model!

# loaded_model.predict(pd.DataFrame(
# {"prompt": ["What is machine learning?"]}), params={"temperature": 0.6}
# )

结论

通过本教程,我们了解了 MLflow 自定义 pyfunc 的强大功能和灵活性。通过理解我们模型的特定需求并定义自定义 pyfunc 来满足这些需求,我们可以确保无缝的部署过程和用户友好的界面。