使用 QLoRA、MLflow 和 PEFT 微调开源 LLM

概览

许多强大的开源 LLM 已涌现并易于获取。然而，它们并非开箱即用即可部署到生产环境；相反，你需要针对你的特定任务对其进行微调，例如构建聊天机器人、内容生成等。但一个挑战是训练 LLM 通常非常昂贵。即使你的微调数据集很小，反向传播步骤也需要为数十亿个参数计算梯度。例如，完全微调 Llama7B 模型需要 112GB 的显存，即至少两块 80GB 的 A100 GPU。幸运的是，关于如何降低 LLM 微调成本已有许多研究成果。

在本教程中，我们将演示如何通过使用 单个 24GB 显存的 GPU 微调 Mistral 7B 模型来构建一个强大的文本转 SQL 生成器。

你将学到什么

动手学习典型的 LLM 微调过程。
理解如何使用 QLoRA 和 PEFT 克服 GPU 显存限制进行微调。
使用 MLflow 管理模型训练周期，以记录模型制品、超参数、指标和提示。
如何在 MLflow 中保存提示模板和推理参数（例如 max_token_length）以简化预测接口。

关键参与者

在本教程中，你将通过实际运行代码来学习高效 LLM 微调背后的技术和方法。下面对每个单元格有更详细的解释，但首先让我们简要预览一下本教程中使用的几个主要重要库/方法。

Mistral-7B-v0.1 模型是由 mistral.ai 开发的预训练文本生成模型，拥有 70 亿参数。该模型采用了多种优化技术，例如 Group-Query Attention、Sliding-Window Attention、Byte-fallback BPE tokenizer，并在基准测试中以更少的参数优于 Llama 2 13B。
QLoRA 是一种新颖的方法，允许我们在有限的 GPU 资源下微调大型基础模型。它通过学习秩分解矩阵对来减少可训练参数的数量，并对冻结的预训练模型应用 4 位量化以进一步减少显存占用。
PEFT 是 HuggingFace🤗 开发的一个库，它使开发者能够轻松地将各种优化方法与 HuggingFace Hub 上可用的预训练模型集成。使用 PEFT，只需几行配置即可将 QLoRA 应用于预训练模型，并像正常的 Transformers 模型训练一样运行微调。
MLflow 代表你管理 LLM 训练期间爆炸式增长的配置、资产和指标。MLflow 原生集成了 Transformers 和 PEFT，在组织微调周期中发挥着至关重要的作用。

1. 环境设置

硬件要求

请确保你的 GPU 至少有 20GB 显存可用。此 notebook 已在具有 24GB 显存的单个 NVIDIA A10G GPU 上测试通过。

%sh nvidia-smi

Wed Feb 21 07:16:13 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    Off | 00000000:00:1E.0 Off |                    0 |
|  0%   15C    P8              16W / 300W |      4MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                       
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

安装 Python 库

本教程使用以下 Python 库：

mlflow - 用于跟踪参数、指标和保存训练后的模型。需要版本 2.11.0 或更高才能使用 MLflow 记录 PEFT 模型。
transformers - 用于定义模型、分词器和训练器。
peft - 用于在 Transformer 模型之上创建 LoRA 适配器。
bitsandbytes - 用于加载基模型时进行 4 位量化，以实现 QLoRA。
accelerate - bitsandbytes 所需的依赖项。
datasets - 用于从 HuggingFace hub 加载训练数据集。

注意：安装这些依赖项后可能需要重启 Python 内核。

此 notebook 已在 mlflow==2.11.0, transformers==4.35.2, peft==0.8.2, bitsandbytes==0.42.0, accelerate==0.27.2, 和 datasets==2.17.1 版本下测试通过。

%pip install mlflow>=2.11.0
%pip install transformers peft accelerate bitsandbytes datasets -q -U

2. 数据集准备

从 HuggingFace Hub 加载数据集

我们将使用 Hugging Face Hub 中的 b-mc2/sql-create-context 数据集进行本教程。该数据集包含 7.86 万对自然语言查询及其对应的 SQL 语句，非常适合训练文本转 SQL 模型。数据集包含三列：

question：关于数据的自然语言问题。
context：关于数据的附加信息，例如被查询表的模式。
answer：表示预期输出的 SQL 查询。

import pandas as pd
from datasets import load_dataset
from IPython.display import HTML, display

dataset_name = "b-mc2/sql-create-context"
dataset = load_dataset(dataset_name, split="train")


def display_table(dataset_or_sample):
  # A helper fuction to display a Transformer dataset or single sample contains multi-line string nicely
  pd.set_option("display.max_colwidth", None)
  pd.set_option("display.width", None)
  pd.set_option("display.max_rows", None)

  if isinstance(dataset_or_sample, dict):
      df = pd.DataFrame(dataset_or_sample, index=[0])
  else:
      df = pd.DataFrame(dataset_or_sample)

  html = df.to_html().replace("\n", "<br>")
  styled_html = f"""<style> .dataframe th, .dataframe tbody td {{ text-align: left; padding-right: 30px; }} </style> {html}"""
  display(HTML(styled_html))


display_table(dataset.select(range(3)))

	问题 (question)	上下文 (context)	答案 (answer)
0	多少部门负责人年龄超过 56 岁？	CREATE TABLE head (age INTEGER)	SELECT COUNT(*) FROM head WHERE age > 56
1	列出按年龄排序的部门负责人的姓名、出生州和年龄。	CREATE TABLE head (name VARCHAR, born_state VARCHAR, age VARCHAR)	SELECT name, born_state, age FROM head ORDER BY age
2	列出每个部门的创建年份、名称和预算。	CREATE TABLE department (creation VARCHAR, name VARCHAR, budget_in_billions VARCHAR)	SELECT creation, name, budget_in_billions FROM department

划分训练集和测试集

b-mc2/sql-create-context 数据集包含一个单一的划分："train"。我们将从中分离出 20% 作为测试样本。

split_dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

print(f"Training dataset contains {len(train_dataset)} text-to-SQL pairs")
print(f"Test dataset contains {len(test_dataset)} text-to-SQL pairs")

Training dataset contains 62861 text-to-SQL pairs
Test dataset contains 15716 text-to-SQL pairs

定义提示模板

Mistral 7B 模型是一个文本理解模型，因此我们必须构建一个文本提示，其中包含用户的提问、上下文和我们的系统指令。数据集中的新列 prompt 将包含在训练期间输入到模型的文本提示。值得注意的是，我们还在提示中包含了预期的响应，以便模型能够以自监督的方式进行训练。

PROMPT_TEMPLATE = """You are a powerful text-to-SQL model. Given the SQL tables and natural language question, your job is to write SQL query that answers the question.

### Table:
{context}

### Question:
{question}

### Response:
{output}"""


def apply_prompt_template(row):
  prompt = PROMPT_TEMPLATE.format(
      question=row["question"],
      context=row["context"],
      output=row["answer"],
  )
  return {"prompt": prompt}


train_dataset = train_dataset.map(apply_prompt_template)
display_table(train_dataset.select(range(1)))

	问题 (question)	上下文 (context)	答案 (answer)	提示 (prompt)
0	哪个珀斯有黄金海岸是、悉尼是、墨尔本是、阿德莱德是？	CREATE TABLE table_name_56 (perth VARCHAR, adelaide VARCHAR, melbourne VARCHAR, gold_coast VARCHAR, sydney VARCHAR)	SELECT perth FROM table_name_56 WHERE gold_coast = "yes" AND sydney = "yes" AND melbourne = "yes" AND adelaide = "yes"	你是一个强大的文本转 SQL 模型。给定 SQL 表和自然语言问题，你的任务是编写回答该问题的 SQL 查询。 ### 表 CREATE TABLE table_name_56 (perth VARCHAR, adelaide VARCHAR, melbourne VARCHAR, gold_coast VARCHAR, sydney VARCHAR) ### 问题哪个珀斯有黄金海岸是、悉尼是、墨尔本是、阿德莱德是？ ### 响应 SELECT perth FROM table_name_56 WHERE gold_coast = "yes" AND sydney = "yes" AND melbourne = "yes" AND adelaide = "yes"

填充训练数据集

作为数据集准备的最后一步，我们需要对训练数据集应用填充。填充确保批处理中的所有输入序列具有相同的长度。

一个需要注意的关键点是需要在左侧添加填充。采用这种方法是因为模型是自回归地生成 tokens，这意味着它会从最后一个 token 继续生成。如果在右侧添加填充，将导致模型从这些填充 token 生成新的 tokens，从而在输出序列中间包含填充 tokens。

右侧填充

Today |  is  |   a    |  cold  |  <pad>  ==generate=>  "Today is a cold <pad> day"
 How  |  to  | become |  <pad> |  <pad>  ==generate=>  "How to become a <pad> <pad> great engineer".

左侧填充

<pad> |  Today  |  is  |  a   |  cold     ==generate=>  "<pad> Today is a cold day"
<pad> |  <pad>  |  How |  to  |  become   ==generate=>  "<pad> <pad> How to become a great engineer".

from transformers import AutoTokenizer

base_model_id = "mistralai/Mistral-7B-v0.1"

# You can use a different max length if your custom dataset has shorter/longer input sequences.
MAX_LENGTH = 256

tokenizer = AutoTokenizer.from_pretrained(
  base_model_id,
  model_max_length=MAX_LENGTH,
  padding_side="left",
  add_eos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token


def tokenize_and_pad_to_fixed_length(sample):
  result = tokenizer(
      sample["prompt"],
      truncation=True,
      max_length=MAX_LENGTH,
      padding="max_length",
  )
  result["labels"] = result["input_ids"].copy()
  return result


tokenized_train_dataset = train_dataset.map(tokenize_and_pad_to_fixed_length)

assert all(len(x["input_ids"]) == MAX_LENGTH for x in tokenized_train_dataset)

display_table(tokenized_train_dataset.select(range(1)))

3. 加载基模型（带 4 位量化）

接下来，我们将加载 Mistral 7B 模型，它将作为我们微调的基模型。可以使用 Transformers 的 from_pretrained() API 从 HuggingFace Hub 仓库 mistralai/Mistral-7B-v0.1 加载该模型。然而，这里我们还提供了一个 quantization_config 参数。

此参数体现了 QLoRA 的关键技术，该技术显著减少了微调期间的内存使用。以下段落详细介绍了该方法和此配置的影响。但是，如果感觉复杂，可以跳过。毕竟，我们很少需要自己修改 quantization_config 的值 :)

工作原理

简而言之，QLoRA 是量化（Quantization）和 LoRA 的组合。为了理解其功能，从 LoRA 入手会更简单。LoRA (Low Rank Adaptation) 是一种先前的资源高效微调方法，通过矩阵分解减少可训练参数的数量。设 W' 表示微调后的最终权重矩阵。在 LoRA 中，W' 通过原始权重及其更新（即 W + ΔW）之和来近似，然后将 delta 部分分解为两个低维矩阵，即 ΔW ≈ AB。假设 W 是 mxm，我们选择一个较小的 r 作为 A 和 B 的秩，其中 A 是 mxr，B 是 rxm。现在，原始可训练参数的大小（与 W 大小成二次关系，即 m^2）经过分解后变为 2mr。经验上，我们可以选择一个比整个权重矩阵大小小得多的 r 值，例如 32、64，因此这显著减少了需要训练的参数数量。

QLoRA 扩展了 LoRA，采用了相同的矩阵分解策略。然而，它通过对冻结的预训练模型 W 应用 4 位量化来进一步减少内存使用。根据他们的研究，LoRA 微调期间最大的内存使用是反向传播通过冻结参数 W 来计算适配器 A 和 B 的梯度。因此，将 W 量化到 4 位显著降低了整体内存消耗。这是通过设置下面的 load_in_4bit=True 实现的。

此外，QLoRA 引入了其他技术来优化资源使用而不会显著影响模型性能。有关更多技术细节，请参阅该论文，但我们通过在 bitsandbytes 中设置以下量化配置来实现它们：

通过 bnb_4bit_quant_type="nf4" 指定 4 位 NormalFloat 类型。
通过 bnb_4bit_use_double_quant=True 激活双重量化。
QLoRA 在计算 A 和 B 的梯度时，将 4 位权重重新量化到更高的精度，以防止性能下降。此数据类型通过 bnb_4bit_compute_dtype=torch.bfloat16 指定。

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
  # Load the model with 4-bit quantization
  load_in_4bit=True,
  # Use double quantization
  bnb_4bit_use_double_quant=True,
  # Use 4-bit Normal Float for storing the base model weights in GPU memory
  bnb_4bit_quant_type="nf4",
  # De-quantize the weights to 16-bit (Brain) float before the forward/backward pass
  bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=quantization_config)

基模型表现如何？

首先，让我们评估一下未经微调的原始 Mistral 模型在 SQL 生成任务上的表现。正如预期的那样，该模型没有生成正确的 SQL 查询；相反，它生成了自然语言的随机答案。这一结果表明有必要针对我们的特定任务对模型进行微调。

import transformers

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
pipeline = transformers.pipeline(model=model, tokenizer=tokenizer, task="text-generation")

sample = test_dataset[1]
prompt = PROMPT_TEMPLATE.format(
  context=sample["context"], question=sample["question"], output=""
)  # Leave the answer part blank

with torch.no_grad():
  response = pipeline(prompt, max_new_tokens=256, repetition_penalty=1.15, return_full_text=False)

display_table({"prompt": prompt, "generated_query": response[0]["generated_text"]})

	提示 (prompt)	生成的查询 (generated_query)
0	你是一个强大的文本转 SQL 模型。给定 SQL 表和自然语言问题，你的任务是编写回答该问题的 SQL 查询。 ### 表 CREATE TABLE table_name_61 (game INTEGER, opponent VARCHAR, record VARCHAR) ### 问题对阵 Phoenix 且记录为 29-17 的最低编号比赛是什么？ ### 响应	答：对阵 Phoenix 的最低编号比赛是在 2018 年 3 月 4 日进行的。比分是 PHO 115 - DAL 106。对阵 Phoenix 的最高编号比赛是什么？答：对阵 Phoenix 的最高编号比赛是在 2018 年 3 月 4 日进行的。比分是 PHO 115 - DAL 106。在对阵 Phoenix 的常规赛中，哪些球员曾为 Dallas 担任控球后卫？

4. 定义一个 PEFT 模型

4. 定义 PEFT 模型

正如前面讨论的，QLoRA 代表着量化（Quantization）+ LoRA。在应用了量化部分之后，我们现在继续进行 LoRA 部分。尽管 LoRA 背后的数学原理很复杂，但 PEFT 通过简化将 LoRA 应用于预训练 Transformer 模型的过程来帮助我们。

在下一个单元格中，我们使用各种 LoRA 设置创建一个 LoraConfig。与之前的 quantization_config 不同，这些超参数可能需要优化才能为你的特定任务获得最佳模型性能。MLflow 通过跟踪这些超参数、相关的模型及其结果来促进这一过程。

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Enabling gradient checkpointing, to make the training further efficient
model.gradient_checkpointing_enable()
# Set up the model for quantization-aware training e.g. casting layers, parameter freezing, etc.
model = prepare_model_for_kbit_training(model)

peft_config = LoraConfig(
  task_type="CAUSAL_LM",
  # This is the rank of the decomposed matrices A and B to be learned during fine-tuning. A smaller number will save more GPU memory but might result in worse performance.
  r=32,
  # This is the coefficient for the learned ΔW factor, so the larger number will typically result in a larger behavior change after fine-tuning.
  lora_alpha=64,
  # Drop out ratio for the layers in LoRA adaptors A and B.
  lora_dropout=0.1,
  # We fine-tune all linear layers in the model. It might sound a bit large, but the trainable adapter size is still only **1.16%** of the whole model.
  target_modules=[
      "q_proj",
      "k_proj",
      "v_proj",
      "o_proj",
      "gate_proj",
      "up_proj",
      "down_proj",
      "lm_head",
  ],
  # Bias parameters to train. 'none' is recommended to keep the original model performing equally when turning off the adapter.
  bias="none",
)

peft_model = get_peft_model(model, peft_config)
peft_model.print_trainable_parameters()

trainable params: 85,041,152 || all params: 7,326,773,248 || trainable%: 1.1606903765339511

在单元格的末尾，我们显示了微调期间可训练参数的数量及其相对于总模型参数的百分比。在这里，我们仅训练了总共 70 亿参数的 1.16%。

就是这样！！！ PEFT 使 LoRA 设置变得超级容易。

一个额外的好处是 PEFT 模型暴露了与 Transformers 模型相同的接口。这意味着从现在开始的一切都与使用 Transformers 的标准模型训练过程非常相似。

5. 启动训练作业

与传统的 Transformers 训练类似，我们将首先设置一个 Trainer 对象来组织训练迭代。有许多超参数需要配置，但 MLflow 将代表你管理它们。

from datetime import datetime

import transformers
from transformers import TrainingArguments

import mlflow

# Comment-out this line if you are running the tutorial on Databricks
mlflow.set_experiment("MLflow PEFT Tutorial")

training_args = TrainingArguments(
  # Set this to mlflow for logging your training
  report_to="mlflow",
  # Name the MLflow run
  run_name=f"Mistral-7B-SQL-QLoRA-{datetime.now().strftime('%Y-%m-%d-%H-%M-%s')}",
  # Replace with your output destination
  output_dir="YOUR_OUTPUT_DIR",
  # For the following arguments, refer to https://hugging-face.cn/docs/transformers/main_classes/trainer
  per_device_train_batch_size=2,
  gradient_accumulation_steps=4,
  gradient_checkpointing=True,
  optim="paged_adamw_8bit",
  bf16=True,
  learning_rate=2e-5,
  lr_scheduler_type="constant",
  max_steps=500,
  save_steps=100,
  logging_steps=100,
  warmup_steps=5,
  # https://discuss.huggingface.co/t/training-llama-with-lora-on-multiple-gpus-may-exist-bug/47005/3
  ddp_find_unused_parameters=False,
)

trainer = transformers.Trainer(
  model=peft_model,
  train_dataset=tokenized_train_dataset,
  data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
  args=training_args,
)

# use_cache=True is incompatible with gradient checkpointing.
peft_model.config.use_cache = False

要启用 MLflow 记录，你可以指定 report_to="mlflow" 并使用 run_name 参数命名你的训练试验。此操作会启动一个 MLflow 运行，该运行会自动记录训练指标、超参数、配置和训练后的模型。

trainer.train()

训练时长可能会持续数小时，具体取决于你的硬件规格。尽管如此，本教程的主要目标是让你熟悉使用 PEFT 和 MLflow 进行微调的过程，而不是培养一个高性能的 SQL 生成器。如果你不太关心模型性能，可以指定较小的步数或中断以下单元格以继续本 notebook 的其余部分。

[500/500 45:41, Epoch 0/1]	步数
100	0.681700
200	0.522400
300	0.507300
400	0.494800
500	0.474600

TrainOutput(global_step=500, training_loss=0.5361956100463867, metrics={'train_runtime': 2747.9223, 'train_samples_per_second': 1.456, 'train_steps_per_second': 0.182, 'total_flos': 4.421038813216768e+16, 'train_loss': 0.5361956100463867, 'epoch': 0.06})

训练损失

6. 将 PEFT 模型保存到 MLflow

太棒了！我们已成功将 Mistral 7B 模型微调成一个 SQL 生成器。在结束训练之前，最后一步是将训练后的 PEFT 模型保存到 MLflow。

设置提示模板和默认推理参数（可选）

LLMs 的预测行为不仅由模型权重决定，还在很大程度上受提示和推理参数（例如 `max_token_length`, `repetition_penalty`）控制。因此，强烈建议将这些元数据与模型一起保存，以便你在之后加载模型时能够获得一致的行为。

提示模板

# Basically the same format as we applied to the dataset. However, the template only accepts {prompt} variable so both table and question need to be fed in there.
prompt_template = """You are a powerful text-to-SQL model. Given the SQL tables and natural language question, your job is to write SQL query that answers the question.

{prompt}

### Response:
"""

用户提示本身是自由文本，但你可以通过应用“模板”来利用输入。MLflow Transformer Flavor 支持将提示模板与模型一起保存，并在预测前自动应用。这还可以让你向模型客户端隐藏系统提示。要保存提示模板，我们必须定义一个包含 `{prompt}` 变量的单个字符串，并将其作为 `mlflow.transformers.log_model` API 的 `prompt_template` 参数传递。有关此功能的更详细用法，请参阅使用 Transformer Pipelines 保存提示模板。

推理参数

from mlflow.models import infer_signature

sample = train_dataset[1]

# MLflow infers schema from the provided sample input/output/params
signature = infer_signature(
  model_input=sample["prompt"],
  model_output=sample["answer"],
  # Parameters are saved with default values if specified
  params={"max_new_tokens": 256, "repetition_penalty": 1.15, "return_full_text": False},
)
signature

inputs: 
[string (required)]
outputs: 
[string (required)]
params: 
['max_new_tokens': long (default: 256), 'repetition_penalty': double (default: 1.15), 'return_full_text': boolean (default: False)]

推理参数可以作为 Model Signature 的一部分与 MLflow 模型一起保存。Signature 定义了模型的输入和输出格式以及传递给模型预测的附加参数，你可以让 MLflow 使用 mlflow.models.infer_signature API 从一些示例输入推断它。如果你为参数传递具体值，MLflow 会将它们视为默认值，并在用户未提供时在推理时应用它们。有关 Model Signature 的更多详细信息，请参阅 MLflow 文档。

将 PEFT 模型保存到 MLflow

最后，我们将调用 mlflow.transformers.log_model API 将模型记录到 MLflow。将 PEFT 模型记录到 MLflow 时需要记住几个关键点：
MLflow 将 Transformer 模型记录为 Pipeline。 Pipeline 将模型与其分词器（或根据任务类型，与其他组件）捆绑在一起，并将预测步骤简化为易于使用的接口，使其成为确保可复现性的绝佳工具。在下面的代码中，我们将模型和分词器作为字典传递，然后 MLflow 会自动推导出正确的 Pipeline 类型并保存它。
MLflow 不会为 PEFT 模型保存基模型权重。执行 mlflow.transformers.log_model 时，MLflow 只保存少量训练过的参数，即 PEFT 适配器。对于基模型，MLflow 会记录对 HuggingFace hub（仓库名称和提交哈希）的引用，并在加载 PEFT 模型时动态下载基模型权重。这种方法显著减少了存储使用和记录延迟；例如，本教程中记录的制品大小不到 1GB，而完整的 Mistral 7B 模型约为 20GB。

保存不带填充的分词器。在微调期间，我们对数据集应用了填充以标准化批处理中的序列长度。然而，在推理时不再需要填充，因此我们保存一个不带填充的不同分词器。这确保加载的模型可以立即用于推理。

import mlflow

# Get the ID of the MLflow Run that was automatically created above
last_run_id = mlflow.last_active_run().info.run_id

# Save a tokenizer without padding because it is only needed for training
tokenizer_no_pad = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True)

# If you interrupt the training, uncomment the following line to stop the MLflow run
# mlflow.end_run()

with mlflow.start_run(run_id=last_run_id):
  mlflow.log_params(peft_config.to_dict())
  mlflow.transformers.log_model(
      transformers_model={"model": trainer.model, "tokenizer": tokenizer_no_pad},
      prompt_template=prompt_template,
      signature=signature,
      artifact_path="model",  # This is a relative path to save model files within MLflow run
  )

注意：目前，PEFT 适配器和配置需要手动记录，而数据集、指标、Trainer 参数等其他信息会自动记录。但是，此过程可能会在 MLflow 和 Transformers 的未来版本中自动化。

MLflow 记录了什么？

让我们简要回顾一下你的训练结果记录/保存到 MLflow 的内容。要访问 MLflow UI，运行 `mlflow ui` 命令并打开 `https://:PORT`（PORT 默认为 5000）。在左侧选择实验“MLflow PEFT Tutorial”（或在 Databricks 上运行时为 notebook 名称）。然后单击最新的 MLflow 运行（名称为 `Mistral-7B-SQL-QLoRA-2024-...`）以查看运行详情。

参数

“Parameters”部分显示了为 Trainer、LoraConfig 和 BitsAndBytesConfig 指定的数百个参数，例如 `learning_rate`、`r`、`bnb_4bit_quant_type`。它还包括未明确指定的默认参数，这对于确保可复现性至关重要，尤其是当库的默认值发生变化时。

指标

“Metrics”部分显示了运行期间收集的模型指标，例如 `train_loss`。你可以在“Chart”选项卡中使用各种类型的图表可视化这些指标。

制品

    model/
      ├─ peft/
      │  ├─ adapter_config.json       # JSON file of the LoraConfig
      │  ├─ adapter_module.safetensor # The weight file of the LoRA adapter
      │  └─ README.md                 # Empty README file generated by Transformers
      │
      ├─ LICENSE.txt                  # License information about the base model (Mistral-7B-0.1)
      ├─ MLModel                      # Contains various metadata about your model
      ├─ conda.yaml                   # Dependencies to create conda environment
      ├─ model_card.md                # Model card text for the base model
      ├─ model_card_data.yaml         # Model card data for the base model
      ├─ python_env.yaml              # Dependencies to create Python virtual environment
      └─ requirements.txt             # Pip requirements for model inference

“Artifacts”部分显示了由于训练而保存在 MLflow 中的文件/目录。对于 Transformers PEFT 训练，你应该看到以下文件/目录：

模型元数据

flavors:
  transformers:
    peft_adaptor: peft                                 # Points the location of the saved PEFT model
    pipeline_model_type: MistralForCausalLM            # The base model implementation
    source_model_name: mistralai/Mistral-7B-v0.1.      # Repository name of the base model
    source_model_revision: xxxxxxx                     # Commit hash in the repository for the base model
    task: text-generation                              # Pipeline type
    torch_dtype: torch.bfloat16                        # Dtype for loading the model
    tokenizer_type: LlamaTokenizerFast                 # Tokenizer implementation

# Prompt template saved with the model above
metadata:
  prompt_template: 'You are a powerful text-to-SQL model. Given the SQL tables and
    natural language question, your job is to write SQL query that answers the question.


    {prompt}


    ### Response:

    '
# Defines the input and output format of the model, with additional inference parameters with default values
signature:
  inputs: '[{"type": "string", "required": true}]'
  outputs: '[{"type": "string", "required": true}]'
  params: '[{"name": "max_new_tokens", "type": "long", "default": 256, "shape": null},
    {"name": "repetition_penalty", "type": "double", "default": 1.15, "shape": null},
    {"name": "return_full_text", "type": "boolean", "default": false, "shape": null}]'

在 MLModel 文件中，你可以看到有关 PEFT 和基模型的许多详细元数据已保存。以下是 MLModel 文件的一部分摘录（为简洁起见，省略了一些字段）：

7. 从 MLflow 加载保存的 PEFT 模型

最后，让我们加载在 MLflow 中记录的模型，并评估其作为文本转 SQL 生成器的性能。在 MLflow 中加载 Transformer 模型有两种方法：
使用 mlflow.transformers.load_model()。此方法返回一个原生 Transformers pipeline 实例。

使用 mlflow.pyfunc.load_model()。此方法返回一个 MLflow 的 PythonModel 实例，它包装了 Transformers pipeline，提供了比原生 pipeline 更多的功能，例如 (1) 统一的 predict() API 用于推理，(2) 模型签名强制执行，以及 (3) 如果已保存，则自动应用提示模板和默认参数。请注意，并非所有 Transformer pipeline 都支持 pyfunc 加载，有关支持的 pipeline 类型完整列表，请参阅 MLflow 文档。

如果你希望通过原生 Transformers 接口使用模型，则首选第一种选项。第二种选项在不同模型类型之间提供了简化和统一的接口，对于在生产部署之前进行模型测试特别有用。在以下代码中，我们将使用 mlflow.pyfunc.load_model() 来展示它如何应用上面定义的提示模板和默认推理参数。

注意：调用 load_model() 会将新的模型实例加载到你的 GPU 上，这可能会超出 GPU 显存限制并触发内存不足 (OOM) 错误，或者导致 Transformers 库尝试将模型的部分卸载到其他设备或磁盘。这种卸载可能会导致问题，例如“ValueError: We need an offload_dir to dispatch this model according to this decide_map.”如果你遇到此错误，请考虑重启 Python 内核并再次加载模型。

# You can find the ID of run in the Run detail page on MLflow UI
mlflow_model = mlflow.pyfunc.load_model("runs:/YOUR_RUN_ID/model")

# We only input table and question, since system prompt is adeed in the prompt template.
test_prompt = """
### Table:
CREATE TABLE table_name_50 (venue VARCHAR, away_team VARCHAR)

### Question:
When Essendon played away; where did they play?
"""

# Inference parameters like max_tokens_length are set to default values specified in the Model Signature
generated_query = mlflow_model.predict(test_prompt)[0]
display_table({"prompt": test_prompt, "generated_query": generated_query})

	提示 (prompt)	生成的查询 (generated_query)
0	### 表注意：重启 Python 内核将清除上述单元格中的所有中间状态和变量。请确保训练后的 PEFT 模型已正确记录在 MLflow 中，然后再重启。 ### 问题 CREATE TABLE table_name_50 (venue VARCHAR, away_team VARCHAR)	当 Essendon 打客场时，他们在哪里打球？

SELECT venue FROM table_name_50 WHERE away_team = "essendon"

完美！！微调后的模型现在可以正确生成 SQL 查询了。正如你在上面的代码和结果中所见，系统提示和默认推理参数会自动应用，因此我们无需将其传递给加载的模型。当你想要部署多个模型（或更新现有模型）时，如果它们的系统提示或参数不同，这会非常强大，因为你无需编辑客户端的实现，它们被抽象在 MLflow 模型后面 :)。

结论

在本教程中，你学习了如何使用 PEFT 和 QLoRA 对大型语言模型进行文本转 SQL 任务的微调。你还学习了 MLflow 在 LLM 微调过程中的关键作用，它在微调期间跟踪参数和指标，并管理模型和其他资产。

下一步
使用 MLflow 评估 Hugging Face LLM - 模型评估是模型开发的关键步骤。查看此指南，了解如何使用 MLflow 高效评估 LLM，包括 LLM-as-a-judge。
将 MLflow 模型部署到生产环境 - MLflow 模型存储丰富的元数据并提供统一的预测接口，从而简化了轻松的部署过程。了解如何将微调后的模型部署到各种目标，例如 AWS SageMaker、Azure ML、Kubernetes、Databricks Model Serving，并提供详细的指南和动手实践 notebook。
MLflow Transformers Flavor 文档 - 详细了解 MLflow 和 Transformers 集成，并继续学习更多教程。

概览​

你将学到什么​

关键参与者​

1. 环境设置​

硬件要求​

安装 Python 库​

2. 数据集准备​

从 HuggingFace Hub 加载数据集​

划分训练集和测试集​

定义提示模板​

填充训练数据集​

3. 加载基模型（带 4 位量化）​

基模型表现如何？​

4. 定义一个 PEFT 模型​