使用 QLoRA、MLflow 和 PEFT 微调开源 LLM

概述

许多强大的开源 LLM 已经涌现并且可以轻松访问。但是，它们并非设计为开箱即用地部署到您的生产环境中；相反，您必须针对您的特定任务对它们进行微调，例如聊天机器人、内容生成等。但是，一个挑战是，训练 LLM 通常非常昂贵。即使您用于微调的数据集很小，反向传播步骤也需要计算数十亿个参数的梯度。例如，完全微调 Llama7B 模型需要 112GB 的 VRAM，即至少两个 80GB A100 GPU。幸运的是，在如何降低 LLM 微调成本方面，有许多研究工作。

在本教程中，我们将演示如何通过使用 单个 24GB VRAM GPU 微调 Mistral 7B 模型来构建强大的文本到 SQL 生成器。

您将学到的内容

典型的 LLM 微调过程的实践学习。
了解如何使用 QLoRA 和 PEFT 来克服微调的 GPU 内存限制。
使用 MLflow 管理模型训练周期，以记录模型工件、超参数、指标和提示。
如何在 MLflow 中保存提示模板和推理参数（例如 max_token_length），以简化预测界面。

主要参与者

在本教程中，您将通过实际运行代码来学习高效 LLM 微调背后的技术和方法。下面每个单元格都有更详细的说明，但让我们首先简要预览本教程中使用的一些主要重要库/方法。

Mistral-7B-v0.1 模型是由 mistral.ai 开发的具有 70 亿个参数的预训练文本生成模型。该模型采用各种优化技术，例如 Group-Query Attention、Sliding-Window Attention、Byte-fallback BPE tokenizer，并且在基准测试中以更少的参数优于 Llama 2 13B。
QLoRA 是一种新颖的方法，允许我们使用有限的 GPU 资源微调大型基础模型。它通过学习秩分解矩阵对来减少可训练参数的数量，并且还将 4 位量化应用于冻结的预训练模型，以进一步减少内存占用。
PEFT 是 HuggingFace🤗 开发的库，使开发人员能够轻松地将各种优化方法与 HuggingFace Hub 上提供的预训练模型集成。使用 PEFT，您可以使用几行配置将 QLoRA 应用于预训练模型，并像正常的 Transformers 模型训练一样运行微调。
MLflow 代表您管理 LLM 训练期间数量激增的配置、资产和指标。MLflow 与 Transformers 和 PEFT 本机集成，并在组织微调周期中发挥着至关重要的作用。

1. 环境设置

硬件要求

请确保您的 GPU 至少有 20GB 的可用 VRAM。此笔记本已在具有 24GB VRAM 的单个 NVIDIA A10G GPU 上进行了测试。

%sh nvidia-smi

Wed Feb 21 07:16:13 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    Off | 00000000:00:1E.0 Off |                    0 |
|  0%   15C    P8              16W / 300W |      4MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                       
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

安装 Python 库

本教程使用以下 Python 库

mlflow - 用于跟踪参数、指标和保存训练好的模型。需要 2.11.0 或更高版本才能使用 MLflow 记录 PEFT 模型。
transformers - 用于定义模型、tokenizer 和 trainer。
peft - 用于在 Transformer 模型之上创建 LoRA 适配器。
bitsandbytes - 用于加载具有 4 位量化的基本模型以进行 QLoRA。
accelerate - bitsandbytes 所需的依赖项。
datasets - 用于从 HuggingFace Hub 加载训练数据集。

注意：安装这些依赖项后，可能需要重新启动 Python 内核。

该笔记本已使用 mlflow==2.11.0、transformers==4.35.2、peft==0.8.2、bitsandbytes==0.42.0、accelerate==0.27.2 和 datasets==2.17.1 进行了测试。

%pip install mlflow>=2.11.0
%pip install transformers peft accelerate bitsandbytes datasets -q -U

2. 数据集准备

从 HuggingFace Hub 加载数据集

我们将使用来自 Hugging Face Hub 的 b-mc2/sql-create-context 数据集用于本教程。该数据集包含 78.6k 对自然语言查询及其相应的 SQL 语句，使其成为训练文本到 SQL 模型的理想选择。该数据集包括三列

question：关于数据的自然语言问题。
context：有关数据的其他信息，例如正在查询的表的架构。
answer：表示预期输出的 SQL 查询。

import pandas as pd
from datasets import load_dataset
from IPython.display import HTML, display

dataset_name = "b-mc2/sql-create-context"
dataset = load_dataset(dataset_name, split="train")


def display_table(dataset_or_sample):
  # A helper fuction to display a Transformer dataset or single sample contains multi-line string nicely
  pd.set_option("display.max_colwidth", None)
  pd.set_option("display.width", None)
  pd.set_option("display.max_rows", None)

  if isinstance(dataset_or_sample, dict):
      df = pd.DataFrame(dataset_or_sample, index=[0])
  else:
      df = pd.DataFrame(dataset_or_sample)

  html = df.to_html().replace("\n", "<br>")
  styled_html = f"""<style> .dataframe th, .dataframe tbody td {{ text-align: left; padding-right: 30px; }} </style> {html}"""
  display(HTML(styled_html))


display_table(dataset.select(range(3)))

	问题	上下文	答案
0	部门负责人中有多少人年龄超过 56 岁？	CREATE TABLE head (age INTEGER)	SELECT COUNT(*) FROM head WHERE age > 56
1	按年龄顺序列出部门负责人的姓名、出生地和年龄。	CREATE TABLE head (name VARCHAR, born_state VARCHAR, age VARCHAR)	SELECT name, born_state, age FROM head ORDER BY age
2	列出每个部门的创建年份、名称和预算。	CREATE TABLE department (creation VARCHAR, name VARCHAR, budget_in_billions VARCHAR)	SELECT creation, name, budget_in_billions FROM department

拆分训练和测试数据集

b-mc2/sql-create-context 数据集由单个拆分“train”组成。我们将分离 20% 作为测试样本。

split_dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

print(f"Training dataset contains {len(train_dataset)} text-to-SQL pairs")
print(f"Test dataset contains {len(test_dataset)} text-to-SQL pairs")

Training dataset contains 62861 text-to-SQL pairs
Test dataset contains 15716 text-to-SQL pairs

定义提示模板

Mistral 7B 模型是一个文本理解模型，因此我们必须构建一个文本提示，其中包含用户的问题、上下文和我们的系统说明。数据集中的新 prompt 列将包含在训练期间馈入模型的文本提示。重要的是要注意，我们还在提示中包含预期的响应，从而允许以自监督的方式训练模型。

PROMPT_TEMPLATE = """You are a powerful text-to-SQL model. Given the SQL tables and natural language question, your job is to write SQL query that answers the question.

### Table:
{context}

### Question:
{question}

### Response:
{output}"""


def apply_prompt_template(row):
  prompt = PROMPT_TEMPLATE.format(
      question=row["question"],
      context=row["context"],
      output=row["answer"],
  )
  return {"prompt": prompt}


train_dataset = train_dataset.map(apply_prompt_template)
display_table(train_dataset.select(range(1)))

	问题	上下文	答案	提示
0	珀斯哪个有黄金海岸是，悉尼是，墨尔本是，阿德莱德是？	CREATE TABLE table_name_56 (perth VARCHAR, adelaide VARCHAR, melbourne VARCHAR, gold_coast VARCHAR, sydney VARCHAR)	SELECT perth FROM table_name_56 WHERE gold_coast = "yes" AND sydney = "yes" AND melbourne = "yes" AND adelaide = "yes"	你是一个强大的文本到 SQL 模型。给定 SQL 表和自然语言问题，您的工作是编写 SQL 查询来回答问题。 ### 表 CREATE TABLE table_name_56 (perth VARCHAR, adelaide VARCHAR, melbourne VARCHAR, gold_coast VARCHAR, sydney VARCHAR) ### 问题珀斯哪个有黄金海岸是，悉尼是，墨尔本是，阿德莱德是？ ### 响应 SELECT perth FROM table_name_56 WHERE gold_coast = "yes" AND sydney = "yes" AND melbourne = "yes" AND adelaide = "yes"

填充训练数据集

作为数据集准备的最后一步，我们需要对训练数据集应用填充。填充确保批处理中的所有输入序列都具有相同的长度。

需要注意的一个关键点是需要向左添加填充。采用此方法是因为模型以自回归方式生成标记，这意味着它从最后一个标记继续。向右添加填充会导致模型从这些填充标记生成新标记，从而导致输出序列在中间包含填充标记。

向右填充

Today |  is  |   a    |  cold  |  <pad>  ==generate=>  "Today is a cold <pad> day"
 How  |  to  | become |  <pad> |  <pad>  ==generate=>  "How to become a <pad> <pad> great engineer".

向左填充

<pad> |  Today  |  is  |  a   |  cold     ==generate=>  "<pad> Today is a cold day"
<pad> |  <pad>  |  How |  to  |  become   ==generate=>  "<pad> <pad> How to become a great engineer".

from transformers import AutoTokenizer

base_model_id = "mistralai/Mistral-7B-v0.1"

# You can use a different max length if your custom dataset has shorter/longer input sequences.
MAX_LENGTH = 256

tokenizer = AutoTokenizer.from_pretrained(
  base_model_id,
  model_max_length=MAX_LENGTH,
  padding_side="left",
  add_eos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token


def tokenize_and_pad_to_fixed_length(sample):
  result = tokenizer(
      sample["prompt"],
      truncation=True,
      max_length=MAX_LENGTH,
      padding="max_length",
  )
  result["labels"] = result["input_ids"].copy()
  return result


tokenized_train_dataset = train_dataset.map(tokenize_and_pad_to_fixed_length)

assert all(len(x["input_ids"]) == MAX_LENGTH for x in tokenized_train_dataset)

display_table(tokenized_train_dataset.select(range(1)))

3. 加载基本模型（具有 4 位量化）

接下来，我们将加载 Mistral 7B 模型，该模型将用作我们微调的基本模型。可以使用 Transformers 的 from_pretrained() API 从 HuggingFace Hub 存储库 mistralai/Mistral-7B-v0.1 加载此模型。但是，这里我们还提供了 quantization_config 参数。

此参数体现了 QLoRA 的关键技术，该技术显着减少了微调期间的内存使用量。以下段落详细介绍了该方法和此配置的含义。但是，如果它看起来很复杂，请随意跳过。毕竟，我们很少需要自己修改 quantization_config 值：)

它是如何工作的

简而言之，QLoRA 是 Quantization 和 LoRA 的组合。为了掌握其功能，从 LoRA 开始更简单。LoRA (Low Rank Adaptation) 是一种资源高效微调的先行方法，通过矩阵分解来减少可训练参数的数量。设 W' 表示来自微调的最终权重矩阵。在 LoRA 中，W' 近似为原始权重及其更新之和，即 W + ΔW，然后将 delta 部分分解为两个低维矩阵，即 ΔW ≈ AB。假设 W 是 mxm，我们为 A 和 B 的秩选择一个较小的 r，其中 A 是 mxr，B 是 rxm。现在，原始可训练参数的大小是 W 的大小的二次方（即 m^2），分解后变为 2mr。根据经验，我们可以为 r 选择一个更小的数字，例如，与完整权重矩阵大小相比，为 32、64，因此这显着减少了要训练的参数的数量。

QLoRA 扩展了 LoRA，对矩阵分解采用相同的策略。但是，它通过将 4 位量化应用于冻结的预训练模型 W 来进一步减少内存使用量。根据他们的研究，LoRA 微调期间最大的内存使用量是通过冻结参数 W 进行反向传播，以计算适配器 A 和 B 的梯度。因此，将 W 量化为 4 位显着降低了整体内存消耗。这是通过下面显示的 load_in_4bit=True 设置实现的。

此外，QLoRA 还引入了其他技术来优化资源使用，而不会显着影响模型性能。有关更多技术细节，请参阅该论文，但我们通过在 bitsandbytes 中设置以下量化配置来实现它们

4 位 NormalFloat 类型由 bnb_4bit_quant_type="nf4" 指定。
双量化由 bnb_4bit_use_double_quant=True 激活。
QLoRA 在计算 A 和 B 的梯度时，会将 4 位权重重新量化为更高的精度，以防止性能下降。此数据类型由 bnb_4bit_compute_dtype=torch.bfloat16 指定。

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
  # Load the model with 4-bit quantization
  load_in_4bit=True,
  # Use double quantization
  bnb_4bit_use_double_quant=True,
  # Use 4-bit Normal Float for storing the base model weights in GPU memory
  bnb_4bit_quant_type="nf4",
  # De-quantize the weights to 16-bit (Brain) float before the forward/backward pass
  bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=quantization_config)

基本模型的性能如何？

首先，让我们在进行任何微调之前评估 vanilla Mistral 模型在 SQL 生成任务上的性能。正如预期的那样，该模型不会生成正确的 SQL 查询；相反，它以自然语言生成随机答案。此结果表明有必要针对我们的特定任务微调模型。

import transformers

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
pipeline = transformers.pipeline(model=model, tokenizer=tokenizer, task="text-generation")

sample = test_dataset[1]
prompt = PROMPT_TEMPLATE.format(
  context=sample["context"], question=sample["question"], output=""
)  # Leave the answer part blank

with torch.no_grad():
  response = pipeline(prompt, max_new_tokens=256, repetition_penalty=1.15, return_full_text=False)

display_table({"prompt": prompt, "generated_query": response[0]["generated_text"]})

	提示	generated_query
0	你是一个强大的文本到 SQL 模型。给定 SQL 表和自然语言问题，您的工作是编写 SQL 查询来回答问题。 ### 表 CREATE TABLE table_name_61 (game INTEGER, opponent VARCHAR, record VARCHAR) ### 问题与 Phoenix 的最低编号比赛是什么，记录为 29-17？ ### 响应	答：与 Phoenix 的最低编号比赛于 2018 年 3 月 4 日举行。比分是 PHO 115 - DAL 106。与 Phoenix 的最高编号比赛是什么？答：与 Phoenix 的最高编号比赛于 2018 年 3 月 4 日举行。比分是 PHO 115 - DAL 106。哪些球员在与 Phoenix 的常规赛中为达拉斯队担任控球后卫？

4. 定义 PEFT 模型

如前所述，QLoRA 代表量化 + LoRA。应用了量化部分后，我们现在继续进行 LoRA 方面。尽管 LoRA 背后的数学很复杂，但 PEFT 通过简化将 LoRA 适应预训练 Transformer 模型的过程来帮助我们。

在下一个单元格中，我们将创建一个 LoraConfig，其中包含 LoRA 的各种设置。与之前的 quantization_config 不同，可能需要优化这些超参数才能为您的特定任务实现最佳模型性能。MLflow 通过跟踪这些超参数、关联的模型及其结果来促进此过程。

在单元格的末尾，我们显示微调期间的可训练参数的数量，以及它们相对于总模型参数的百分比。在这里，我们仅训练总共 70 亿个参数的 1.16%。

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Enabling gradient checkpointing, to make the training further efficient
model.gradient_checkpointing_enable()
# Set up the model for quantization-aware training e.g. casting layers, parameter freezing, etc.
model = prepare_model_for_kbit_training(model)

peft_config = LoraConfig(
  task_type="CAUSAL_LM",
  # This is the rank of the decomposed matrices A and B to be learned during fine-tuning. A smaller number will save more GPU memory but might result in worse performance.
  r=32,
  # This is the coefficient for the learned ΔW factor, so the larger number will typically result in a larger behavior change after fine-tuning.
  lora_alpha=64,
  # Drop out ratio for the layers in LoRA adaptors A and B.
  lora_dropout=0.1,
  # We fine-tune all linear layers in the model. It might sound a bit large, but the trainable adapter size is still only **1.16%** of the whole model.
  target_modules=[
      "q_proj",
      "k_proj",
      "v_proj",
      "o_proj",
      "gate_proj",
      "up_proj",
      "down_proj",
      "lm_head",
  ],
  # Bias parameters to train. 'none' is recommended to keep the original model performing equally when turning off the adapter.
  bias="none",
)

peft_model = get_peft_model(model, peft_config)
peft_model.print_trainable_parameters()

trainable params: 85,041,152 || all params: 7,326,773,248 || trainable%: 1.1606903765339511

就这样！！！ PEFT 使 LoRA 设置超级容易。

另一个额外的好处是 PEFT 模型公开与 Transformers 模型相同的接口。这意味着从现在开始的一切都与使用 Transformers 的标准模型训练过程非常相似。

5. 启动训练作业

与传统的 Transformers 训练类似，我们将首先设置一个 Trainer 对象来组织训练迭代。有许多超参数需要配置，但 MLflow 将代表您管理它们。

要启用 MLflow 日志记录，您可以指定 report_to="mlflow"，并使用 run_name 参数命名您的训练试验。此操作会启动一个 MLflow 运行，该运行会自动记录训练指标、超参数、配置和训练好的模型。

from datetime import datetime

import transformers
from transformers import TrainingArguments

import mlflow

# Comment-out this line if you are running the tutorial on Databricks
mlflow.set_experiment("MLflow PEFT Tutorial")

training_args = TrainingArguments(
  # Set this to mlflow for logging your training
  report_to="mlflow",
  # Name the MLflow run
  run_name=f"Mistral-7B-SQL-QLoRA-{datetime.now().strftime('%Y-%m-%d-%H-%M-%s')}",
  # Replace with your output destination
  output_dir="YOUR_OUTPUT_DIR",
  # For the following arguments, refer to https://hugging-face.cn/docs/transformers/main_classes/trainer
  per_device_train_batch_size=2,
  gradient_accumulation_steps=4,
  gradient_checkpointing=True,
  optim="paged_adamw_8bit",
  bf16=True,
  learning_rate=2e-5,
  lr_scheduler_type="constant",
  max_steps=500,
  save_steps=100,
  logging_steps=100,
  warmup_steps=5,
  # https://discuss.huggingface.co/t/training-llama-with-lora-on-multiple-gpus-may-exist-bug/47005/3
  ddp_find_unused_parameters=False,
)

trainer = transformers.Trainer(
  model=peft_model,
  train_dataset=tokenized_train_dataset,
  data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
  args=training_args,
)

# use_cache=True is incompatible with gradient checkpointing.
peft_model.config.use_cache = False

训练持续时间可能跨越数小时，具体取决于您的硬件规格。尽管如此，本教程的主要目标是让您熟悉使用 PEFT 和 MLflow 进行微调的过程，而不是培养高性能 SQL 生成器。如果您不太关心模型性能，您可以指定更少的步骤或中断以下单元格以继续阅读笔记本的其余部分。

trainer.train()

[500/500 45:41, Epoch 0/1]

步骤	训练损失
100	0.681700
200	0.522400
300	0.507300
400	0.494800
500	0.474600

TrainOutput(global_step=500, training_loss=0.5361956100463867, metrics={'train_runtime': 2747.9223, 'train_samples_per_second': 1.456, 'train_steps_per_second': 0.182, 'total_flos': 4.421038813216768e+16, 'train_loss': 0.5361956100463867, 'epoch': 0.06})

6. 将 PEFT 模型保存到 MLflow

万岁！我们已成功将 Mistral 7B 模型微调为 SQL 生成器。在结束训练之前，最后一步是将训练好的 PEFT 模型保存到 MLflow。

设置提示模板和默认推理参数（可选）

LLM 预测行为不仅由模型权重定义，而且在很大程度上由提示和推理参数（如 max_token_length、repetition_penalty）控制。因此，强烈建议将这些元数据与模型一起保存，以便您可以在稍后加载模型时获得一致的行为。

提示模板

用户提示本身是自由文本，但您可以通过应用“模板”来利用输入。MLflow Transformer 风味支持使用模型保存提示模板，并在预测之前自动应用它。这也允许您从模型客户端隐藏系统提示。要保存提示模板，我们必须定义一个包含 {prompt} 变量的单个字符串，并将其传递给 mlflow.transformers.log_model API 的 prompt_template 参数。有关此功能的更详细用法，请参阅使用 Transformer 管道保存提示模板。

# Basically the same format as we applied to the dataset. However, the template only accepts {prompt} variable so both table and question need to be fed in there.
prompt_template = """You are a powerful text-to-SQL model. Given the SQL tables and natural language question, your job is to write SQL query that answers the question.

{prompt}

### Response:
"""

推理参数

推理参数可以作为模型签名的一部分与 MLflow 模型一起保存。签名使用传递给模型预测的附加参数定义模型输入和输出格式，您可以让 MLflow 使用 mlflow.models.infer_signature API 从一些样本输入中推断它。如果您传递参数的具体值，MLflow 会将它们视为默认值，并在用户未提供时在推理时应用它们。有关模型签名的更多详细信息，请参阅 MLflow 文档。

from mlflow.models import infer_signature

sample = train_dataset[1]

# MLflow infers schema from the provided sample input/output/params
signature = infer_signature(
  model_input=sample["prompt"],
  model_output=sample["answer"],
  # Parameters are saved with default values if specified
  params={"max_new_tokens": 256, "repetition_penalty": 1.15, "return_full_text": False},
)
signature

inputs: 
[string (required)]
outputs: 
[string (required)]
params: 
['max_new_tokens': long (default: 256), 'repetition_penalty': double (default: 1.15), 'return_full_text': boolean (default: False)]

将 PEFT 模型保存到 MLflow

最后，我们将调用 mlflow.transformers.log_model API 将模型记录到 MLflow。将 PEFT 模型记录到 MLflow 时，需要记住几个关键点

MLflow 将 Transformer 模型记录为 Pipeline。 管道将模型与其 tokenizer（或其他组件，具体取决于任务类型）捆绑在一起，并将预测步骤简化为易于使用的界面，使其成为确保可重复性的绝佳工具。在下面的代码中，我们将模型和 tokenizer 作为字典传递，然后 MLflow 会自动推断正确的管道类型并保存它。
MLflow 不保存 PEFT 模型的基本模型权重。执行 mlflow.transformers.log_model 时，MLflow 仅保存少量经过训练的参数，即 PEFT 适配器。对于基本模型，MLflow 而是记录对 HuggingFace Hub 的引用（存储库名称和提交哈希），并在加载 PEFT 模型时动态下载基本模型权重。此方法显着减少了存储使用量和日志记录延迟；例如，本教程中记录的工件大小小于 1GB，而完整的 Mistral 7B 模型约为 20GB。
保存没有填充的 tokenizer。在微调期间，我们对数据集应用了填充，以标准化批处理中的序列长度。但是，在推理时不再需要填充，因此我们保存一个没有填充的不同 tokenizer。这确保了加载的模型可以立即用于推理。

注意：目前，PEFT 适配器和配置需要手动日志记录，而其他信息（如数据集、指标、Trainer 参数等）会自动记录。但是，此过程可能会在 MLflow 和 Transformers 的未来版本中自动化。

import mlflow

# Get the ID of the MLflow Run that was automatically created above
last_run_id = mlflow.last_active_run().info.run_id

# Save a tokenizer without padding because it is only needed for training
tokenizer_no_pad = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True)

# If you interrupt the training, uncomment the following line to stop the MLflow run
# mlflow.end_run()

with mlflow.start_run(run_id=last_run_id):
  mlflow.log_params(peft_config.to_dict())
  mlflow.transformers.log_model(
      transformers_model={"model": trainer.model, "tokenizer": tokenizer_no_pad},
      prompt_template=prompt_template,
      signature=signature,
      name="model",  # This is a relative path to save model files within MLflow run
  )

记录到 MLflow 的内容？

让我们简要回顾一下由于您的训练而记录/保存到 MLflow 的内容。要访问 MLflow UI，请运行 mlflow ui 命令并打开 https://:PORT（默认情况下 PORT 为 5000）。在左侧选择实验“MLflow PEFT Tutorial”（或在 Databricks 上运行时选择笔记本名称）。然后单击名为 Mistral-7B-SQL-QLoRA-2024-... 的最新 MLflow Run 以查看 Run 详细信息。

参数

Parameters 部分显示为 Trainer、LoraConfig 和 BitsAndBytesConfig 指定的数百个参数，例如 learning_rate、r、bnb_4bit_quant_type。它还包括未明确指定的默认参数，这对于确保可重复性至关重要，尤其是在库的默认值发生更改时。

指标

Metrics 部分显示在运行期间收集的模型指标，例如 train_loss。您可以使用“Chart”选项卡中的各种类型的图表可视化这些指标。

工件

Artifacts 部分显示由于训练而在 MLflow 中保存的文件/目录。对于 Transformers PEFT 训练，您应该看到以下文件/目录

    model/
      ├─ peft/
      │  ├─ adapter_config.json       # JSON file of the LoraConfig
      │  ├─ adapter_module.safetensor # The weight file of the LoRA adapter
      │  └─ README.md                 # Empty README file generated by Transformers
      │
      ├─ LICENSE.txt                  # License information about the base model (Mistral-7B-0.1)
      ├─ MLModel                      # Contains various metadata about your model
      ├─ conda.yaml                   # Dependencies to create conda environment
      ├─ model_card.md                # Model card text for the base model
      ├─ model_card_data.yaml         # Model card data for the base model
      ├─ python_env.yaml              # Dependencies to create Python virtual environment
      └─ requirements.txt             # Pip requirements for model inference

模型元数据

在 MLModel 文件中，您可以看到保存的有关 PEFT 和基本模型的许多详细元数据。这是 MLModel 文件的摘录（为了简单起见，省略了一些字段）

flavors:
  transformers:
    peft_adaptor: peft                                 # Points the location of the saved PEFT model
    pipeline_model_type: MistralForCausalLM            # The base model implementation
    source_model_name: mistralai/Mistral-7B-v0.1.      # Repository name of the base model
    source_model_revision: xxxxxxx                     # Commit hash in the repository for the base model
    task: text-generation                              # Pipeline type
    torch_dtype: torch.bfloat16                        # Dtype for loading the model
    tokenizer_type: LlamaTokenizerFast                 # Tokenizer implementation

# Prompt template saved with the model above
metadata:
  prompt_template: 'You are a powerful text-to-SQL model. Given the SQL tables and
    natural language question, your job is to write SQL query that answers the question.


    {prompt}


    ### Response:

    '
# Defines the input and output format of the model, with additional inference parameters with default values
signature:
  inputs: '[{"type": "string", "required": true}]'
  outputs: '[{"type": "string", "required": true}]'
  params: '[{"name": "max_new_tokens", "type": "long", "default": 256, "shape": null},
    {"name": "repetition_penalty", "type": "double", "default": 1.15, "shape": null},
    {"name": "return_full_text", "type": "boolean", "default": false, "shape": null}]'

7. 从 MLflow 加载保存的 PEFT 模型

最后，让我们加载 MLflow 中记录的模型并评估其作为文本到 SQL 生成器的性能。有两种在 MLflow 中加载 Transformer 模型的方法

使用 mlflow.transformers.load_model()。此方法返回本机 Transformers 管道实例。
使用 mlflow.pyfunc.load_model()。此方法返回 MLflow 的 PythonModel 实例，该实例包装 Transformers 管道，提供比本机管道更多的功能，例如 (1) 用于推理的统一 predict() API，(2) 模型签名强制，以及 (3) 自动应用提示模板和默认参数（如果已保存）。请注意，并非所有 Transformer 管道都支持 pyfunc 加载，有关支持的管道类型的完整列表，请参阅 MLflow 文档。

如果您希望通过本机 Transformers 接口使用模型，则首选第一个选项。第二个选项提供跨不同模型类型的简化和统一的接口，并且对于生产部署之前的模型测试特别有用。在下面的代码中，我们将使用 mlflow.pyfunc.load_model() 来展示它如何应用提示模板和上面定义的默认推理参数。

注意：调用 load_model() 会将新的模型实例加载到您的 GPU 上，这可能会超过 GPU 内存限制并触发内存不足 (OOM) 错误，或者导致 Transformers 库尝试将模型的部分内容卸载到其他设备或磁盘。此卸载可能会导致问题，例如“ValueError: We need an offload_dir to dispatch this model according to this decide_map.” 如果您遇到此错误，请考虑重新启动 Python 内核并再次加载模型。

注意：重新启动 Python 内核将擦除上述单元格中的所有中间状态和变量。请确保在重新启动之前将训练好的 PEFT 模型正确记录在 MLflow 中。

# You can find the ID of run in the Run detail page on MLflow UI
mlflow_model = mlflow.pyfunc.load_model("runs:/YOUR_RUN_ID/model")

# We only input table and question, since system prompt is adeed in the prompt template.
test_prompt = """
### Table:
CREATE TABLE table_name_50 (venue VARCHAR, away_team VARCHAR)

### Question:
When Essendon played away; where did they play?
"""

# Inference parameters like max_tokens_length are set to default values specified in the Model Signature
generated_query = mlflow_model.predict(test_prompt)[0]
display_table({"prompt": test_prompt, "generated_query": generated_query})

	提示	generated_query
0	### 表 CREATE TABLE table_name_50 (venue VARCHAR, away_team VARCHAR) ### 问题当 Essendon 客场比赛时；他们在哪里比赛？	SELECT venue FROM table_name_50 WHERE away_team = "essendon"

完美！！微调的模型现在可以正确生成 SQL 查询。正如您在上面的代码和结果中看到的那样，系统提示和默认推理参数会自动应用，因此我们不必将其传递给加载的模型。当您想要部署多个具有不同系统提示或参数的模型（或更新现有模型）时，这非常强大，因为您不必编辑客户端的实现，因为它们被抽象在 MLflow 模型后面：)

结论

在本教程中，您学习了如何使用 PEFT 使用 QLoRA 微调大型语言模型以进行文本到 SQL 任务。您还学习了 MLflow 在 LLM 微调过程中的关键作用，它可以跟踪微调期间的参数和指标，并管理模型和其他资产。

下一步是什么？

使用 MLflow 评估 Hugging Face LLM - 模型评估是模型开发中的关键步骤。查看此指南，了解如何使用 MLflow（包括 LLM-as-a-judge）高效地评估 LLM。
将 MLflow 模型部署到生产环境 - MLflow 模型存储丰富的元数据，并为预测提供统一的界面，从而简化了轻松部署过程。了解如何使用详细的指南和实践笔记本将微调的模型部署到各种目标，例如 AWS SageMaker、Azure ML、Kubernetes、Databricks Model Serving。
MLflow Transformers 风味文档 - 了解有关 MLflow 和 Transformers 集成的更多信息，并继续学习更多教程。
MLflow 中的大型语言模型 - MLflow 提供更多 LLM 相关功能，并集成到许多其他库，如 OpenAI 和 Langchain。

概述​

您将学到的内容​

主要参与者​

1. 环境设置​

硬件要求​

安装 Python 库​

2. 数据集准备​

从 HuggingFace Hub 加载数据集​

拆分训练和测试数据集​

定义提示模板​

填充训练数据集​

3. 加载基本模型（具有 4 位量化）​

基本模型的性能如何？​

4. 定义 PEFT 模型​

5. 启动训练作业​

6. 将 PEFT 模型保存到 MLflow​

设置提示模板和默认推理参数（可选）​

提示模板​

推理参数​

将 PEFT 模型保存到 MLflow​

记录到 MLflow 的内容？​

参数​

指标​

工件​

模型元数据​

7. 从 MLflow 加载保存的 PEFT 模型​

结论​

下一步是什么？​

概述