使用 MLflow 提供 LLM 服务:利用自定义 PyFunc
介绍
本教程将指导您如何使用 MLflow 的自定义 pyfunc
保存和部署大型语言模型 (LLM),这对于 MLflow 默认的 transformers 风格不支持的模型来说是理想的选择。
学习目标
- 了解在特定模型场景中定义自定义
pyfunc
的必要性。 - 学习创建自定义
pyfunc
来管理模型依赖和接口数据。 - 了解如何通过自定义
pyfunc
简化部署环境中的用户界面。
默认实现的挑战
虽然 MLflow 的 transformers
风格通常能够处理 HuggingFace Transformers 库中的模型,但有些模型或配置可能与这种标准方法不符。在这种情况下,就像我们的情况一样,模型无法利用默认的 pipeline
类型,我们在使用 MLflow 部署这些模型时面临独特的挑战。
自定义 PyFunc 的强大之处
为了解决这个问题,MLflow 的自定义 pyfunc
应运而生。它允许我们
- 高效处理模型加载及其依赖。
- 自定义推理过程以适应特定的模型需求。
- 调整接口数据,在部署的应用中创建用户友好的环境。
我们的重点将是自定义 pyfunc
的实际应用,以便在 MLflow 生态系统中有效地部署 LLM。
在本教程结束时,您将掌握解决机器学习项目中类似挑战的知识,充分利用 MLflow 的全部潜力进行自定义模型部署。
继续之前的重要注意事项
硬件建议
本指南演示了使用一个特别庞大且复杂的语言模型(LLM)。考虑到其复杂性
- GPU 要求:强烈建议在配备至少 64GB 显存的 CUDA-capable GPU 的系统上运行此示例。
- CPU 注意事项:虽然技术上可行,但在 CPU 上执行模型可能导致推理时间极度延长,单次预测可能需要数十分钟,即使是在顶级 CPU 上。由于在纯 CPU 系统上运行此模型时性能受限,本 Notebook 的最后一个单元格有意未执行。然而,使用具有适当强大功能的 GPU,此 Notebook 的总运行时间端到端约为 8 分钟。
执行建议
如果您考虑运行此 Notebook 中的代码
-
性能:为了获得更流畅的体验并真正发挥模型的潜力,请使用与模型设计匹配的硬件。
-
依赖项:确保您已安装推荐的依赖项以获得最佳模型性能。这些对于高效的模型加载、初始化、注意力计算和推理处理至关重要
pip install xformers==0.0.20 einops==0.6.1 flash-attn==v1.0.3.post0 triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python
# Load necessary libraries
import accelerate
import torch
import transformers
from huggingface_hub import snapshot_download
import mlflow
/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pydantic/_internal/_fields.py:128: UserWarning: Field "model_server_url" has conflict with protected namespace "model_". You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`. warnings.warn( /Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pydantic/_internal/_config.py:317: UserWarning: Valid config keys have changed in V2: * 'schema_extra' has been renamed to 'json_schema_extra' warnings.warn(message, UserWarning)
下载模型和分词器
首先,我们需要下载我们的模型和分词器。方法如下
# Download the MPT-7B instruct model and tokenizer to a local directory cache
snapshot_location = snapshot_download(repo_id="mosaicml/mpt-7b-instruct", local_dir="mpt-7b")
Fetching 24 files: 0%| | 0/24 [00:00<?, ?it/s]
Downloading README.md: 0%| | 0.00/7.96k [00:00<?, ?B/s]
Downloading .gitattributes: 0%| | 0.00/1.48k [00:00<?, ?B/s]
Downloading adapt_tokenizer.py: 0%| | 0.00/1.72k [00:00<?, ?B/s]
Downloading attention.py: 0%| | 0.00/21.6k [00:00<?, ?B/s]
Downloading config.json: 0%| | 0.00/1.23k [00:00<?, ?B/s]
Downloading blocks.py: 0%| | 0.00/2.84k [00:00<?, ?B/s]
Downloading custom_embedding.py: 0%| | 0.00/292 [00:00<?, ?B/s]
Downloading configuration_mpt.py: 0%| | 0.00/11.0k [00:00<?, ?B/s]
Downloading meta_init_context.py: 0%| | 0.00/3.96k [00:00<?, ?B/s]
Downloading fc.py: 0%| | 0.00/167 [00:00<?, ?B/s]
Downloading ffn.py: 0%| | 0.00/1.75k [00:00<?, ?B/s]
Downloading generation_config.json: 0%| | 0.00/112 [00:00<?, ?B/s]
Downloading (…)refixlm_converter.py: 0%| | 0.00/10.5k [00:00<?, ?B/s]
Downloading modeling_mpt.py: 0%| | 0.00/20.1k [00:00<?, ?B/s]
Downloading flash_attn_triton.py: 0%| | 0.00/28.2k [00:00<?, ?B/s]
Downloading requirements.txt: 0%| | 0.00/113 [00:00<?, ?B/s]
Downloading param_init_fns.py: 0%| | 0.00/11.9k [00:00<?, ?B/s]
Downloading (…)model.bin.index.json: 0%| | 0.00/16.0k [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json: 0%| | 0.00/99.0 [00:00<?, ?B/s]
Downloading norm.py: 0%| | 0.00/3.12k [00:00<?, ?B/s]
Downloading tokenizer.json: 0%| | 0.00/2.11M [00:00<?, ?B/s]
Downloading (…)l-00001-of-00002.bin: 0%| | 0.00/9.94G [00:00<?, ?B/s]
Downloading (…)l-00002-of-00002.bin: 0%| | 0.00/3.36G [00:00<?, ?B/s]
Downloading tokenizer_config.json: 0%| | 0.00/237 [00:00<?, ?B/s]
定义自定义 PyFunc
现在,让我们定义我们的自定义 pyfunc
。这将决定我们的模型如何加载其依赖以及如何执行预测。请注意,我们如何将模型的复杂性封装在此类中。
class MPT(mlflow.pyfunc.PythonModel):
def load_context(self, context):
"""
This method initializes the tokenizer and language model
using the specified model snapshot directory.
"""
# Initialize tokenizer and language model
self.tokenizer = transformers.AutoTokenizer.from_pretrained(
context.artifacts["snapshot"], padding_side="left"
)
config = transformers.AutoConfig.from_pretrained(
context.artifacts["snapshot"], trust_remote_code=True
)
# If you are running this in a system that has a sufficiently powerful GPU with available VRAM,
# uncomment the configuration setting below to leverage triton.
# Note that triton dramatically improves the inference speed performance
# config.attn_config["attn_impl"] = "triton"
self.model = transformers.AutoModelForCausalLM.from_pretrained(
context.artifacts["snapshot"],
config=config,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
# NB: If you do not have a CUDA-capable device or have torch installed with CUDA support
# this setting will not function correctly. Setting device to 'cpu' is valid, but
# the performance will be very slow.
self.model.to(device="cpu")
# If running on a GPU-compatible environment, uncomment the following line:
# self.model.to(device="cuda")
self.model.eval()
def _build_prompt(self, instruction):
"""
This method generates the prompt for the model.
"""
INSTRUCTION_KEY = "### Instruction:"
RESPONSE_KEY = "### Response:"
INTRO_BLURB = (
"Below is an instruction that describes a task. "
"Write a response that appropriately completes the request."
)
return f"""{INTRO_BLURB}
{INSTRUCTION_KEY}
{instruction}
{RESPONSE_KEY}
"""
def predict(self, context, model_input, params=None):
"""
This method generates prediction for the given input.
"""
prompt = model_input["prompt"][0]
# Retrieve or use default values for temperature and max_tokens
temperature = params.get("temperature", 0.1) if params else 0.1
max_tokens = params.get("max_tokens", 1000) if params else 1000
# Build the prompt
prompt = self._build_prompt(prompt)
# Encode the input and generate prediction
# NB: Sending the tokenized inputs to the GPU here explicitly will not work if your system does not have CUDA support.
# If attempting to run this with GPU support, change 'cpu' to 'cuda' for maximum performance
encoded_input = self.tokenizer.encode(prompt, return_tensors="pt").to("cpu")
output = self.model.generate(
encoded_input,
do_sample=True,
temperature=temperature,
max_new_tokens=max_tokens,
)
# Removing the prompt from the generated text
prompt_length = len(self.tokenizer.encode(prompt, return_tensors="pt")[0])
generated_response = self.tokenizer.decode(
output[0][prompt_length:], skip_special_tokens=True
)
return {"candidates": [generated_response]}
构建提示
我们的自定义 pyfunc
的一个关键方面是模型提示的构建。我们的自定义 pyfunc
负责处理这个问题,而不是让最终用户理解和构建此提示。这确保了无论模型的要求有多么复杂,最终用户界面都保持简单和一致。
请查看上面类中的 _build_prompt()
方法,了解如何将自定义输入处理逻辑添加到自定义 pyfunc 中,以支持将用户输入数据转换为与封装的模型实例兼容的所需格式。
import numpy as np
import pandas as pd
import mlflow
from mlflow.models.signature import ModelSignature
from mlflow.types import ColSpec, DataType, ParamSchema, ParamSpec, Schema
# Define input and output schema
input_schema = Schema(
[
ColSpec(DataType.string, "prompt"),
]
)
output_schema = Schema([ColSpec(DataType.string, "candidates")])
parameters = ParamSchema(
[
ParamSpec("temperature", DataType.float, np.float32(0.1), None),
ParamSpec("max_tokens", DataType.integer, np.int32(1000), None),
]
)
signature = ModelSignature(inputs=input_schema, outputs=output_schema, params=parameters)
# Define input example
input_example = pd.DataFrame({"prompt": ["What is machine learning?"]})
设置我们要记录自定义模型的实验
如果实验尚不存在,MLflow 将创建同名的新实验,并会通知您已创建新实验。
# If you are running this tutorial in local mode, leave the next line commented out.
# Otherwise, uncomment the following line and set your tracking uri to your local or remote tracking server.
# mlflow.set_tracking_uri("http://127.0.0.1:8080")
mlflow.set_experiment(experiment_name="mpt-7b-instruct-evaluation")
2023/11/29 17:33:23 INFO mlflow.tracking.fluent: Experiment with name 'mpt-7b-instruct-evaluation' does not exist. Creating a new experiment.
<Experiment: artifact_location='file:///Users/benjamin.wilson/repos/mlflow-fork/mlflow/docs/source/llms/custom-pyfunc-for-llms/notebooks/mlruns/265930820950682761', creation_time=1701297203895, experiment_id='265930820950682761', last_update_time=1701297203895, lifecycle_stage='active', name='mpt-7b-instruct-evaluation', tags={}>
# Get the current base version of torch that is installed, without specific version modifiers
torch_version = torch.__version__.split("+")[0]
# Start an MLflow run context and log the MPT-7B model wrapper along with the param-included signature to
# allow for overriding parameters at inference time
with mlflow.start_run():
model_info = mlflow.pyfunc.log_model(
"mpt-7b-instruct",
python_model=MPT(),
# NOTE: the artifacts dictionary mapping is critical! This dict is used by the load_context() method in our MPT() class.
artifacts={"snapshot": snapshot_location},
pip_requirements=[
f"torch=={torch_version}",
f"transformers=={transformers.__version__}",
f"accelerate=={accelerate.__version__}",
"einops",
"sentencepiece",
],
input_example=input_example,
signature=signature,
)
Downloading artifacts: 0%| | 0/24 [00:00<?, ?it/s]
2023/11/29 17:33:24 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false /Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.")
加载保存的模型
loaded_model = mlflow.pyfunc.load_model(model_info.model_uri)
/Users/benjamin.wilson/.cache/huggingface/modules/transformers_modules/mpt-7b/configuration_mpt.py:97: UserWarning: alibi is turned on, setting `learned_pos_emb` to `False.` warnings.warn(f'alibi is turned on, setting `learned_pos_emb` to `False.`')
测试模型进行推理
# The execution of this is commented out for the purposes of runtime on CPU.
# If you are running this on a system with a sufficiently powerful GPU, you may uncomment and interface with the model!
# loaded_model.predict(pd.DataFrame(
# {"prompt": ["What is machine learning?"]}), params={"temperature": 0.6}
# )
结论
通过本教程,我们了解了 MLflow 自定义 pyfunc
的强大功能和灵活性。通过理解我们模型的特定需求并定义自定义 pyfunc
来满足这些需求,我们可以确保无缝的部署过程和用户友好的界面。