[NLP] Tutorial on fine-tuning using Alpaca-Lora based on the llama model

Time:2024-5-6

Stanford Alpaca is fine-tuning on the whole model of LLaMA, i.e., all parameters in the pre-trained model are fine-tuned (full fine-tuning). However, this method still requires high hardware cost and is inefficient for training.

[NLP] Understanding Efficient Fine-Tuning of Large Language Models (PEFT)

Therefore, Alpaca-Lora utilizes the Lora technique by adding additional network layers to the model while freezing the LLaMA parameters of the original model and training only these additional network layers. Due to the small number of these additional parameters, not only the cost of fine-tuning is significantly reduced, but also the results are similar to those of full fine-tuning.

The principle of LoRA is actually not complicated, its core idea is to add a bypass next to the original pre-trained language model, do a dimensionality reduction and then dimensionality increase operation, to simulate the so-called intrinsic rank (the process of pre-trained model generalization on various types of downstream tasks is actually optimizing the common low-dimensional intrinsic subspace of various types of tasks). (the process of generalizing the pre-trained model to each type of downstream task is actually optimizing a very small number of free parameters in the common low-dimensional intrinsic subspace of each type of task). The training is done by fixing the parameters of the pre-trained language model and training only the descending matrix A and the ascending matrix B. The input and output dimensions of the model remain unchanged, and the BA is superimposed on the parameters of the pre-trained language model at the output. Initialize A with a random Gaussian distribution and B with a matrix of 0. This ensures that at the beginning of the training, the added pathway BA = 0 from and has no effect on the model results.

When reasoning, it is sufficient to add the results of the left and right parts together, h=Wx+BAx=(W+BA)x. Therefore, it is sufficient to replace W of the original pre-trained language model by adding the product of the trained matrices, BA, with the original weight matrices, W, as the new weight parameters, which does not add additional computational resources.

[NLP] Tutorial on fine-tuning using Alpaca-Lora based on the llama model

The biggest advantage of LoRA is that it is faster and uses less memory; therefore, it can run on consumer-grade hardware.

Preparing the dataset

The goals of fine-tune are usually of two kinds:

  • Like Alpaca, collects input/output and generates prompts for training, allowing the model to perform specific tasks.
  • Linguistic filler, which collects text for training and allows the model to fill in the prompt.

Taking the first goal as an example, suppose our goal is to make the model speak Chinese, then we can take an existing dataset (e.g., text-davinci-003) through other LLMs (e.g., text-davinci-003) and make the model speak Chinese.AlpacaIn fact, this idea is already in the open-source community, with some people already using it to make fine-tune.realizationUp.

To accomplish this, the dataset I used wasLuotuo Alpaca dataset translated by the authors, training code mainly fromAlpaca-LoRA

wget https://github.com/LC1332/Chinese-alpaca-lora/blob/main/data/trans_chinese_alpaca_data.json

The Alpach-LoRA catalog also contains the English dataset for fine-tune:

[NLP] Tutorial on fine-tuning using Alpaca-Lora based on the llama model

In addition to this, refer toGPT-4-LLMproject, which also provided 52,000 instruction-following data generated using Alpaca’s Prompt translated into Chinese using GPT4.

I. Environment setup

The base environment is configured as follows:

  • Operating System. CentOS 7
  • CPUs: Intel CPUs with 1TB of memory on a single node, 64 physical CPUs, 16 cores per CPU
  • GPUs:4 cards A100 80GB GPU
  • Docker Image: pytorch:1.13.0-cuda11.6-cudnn8-devel

1. In the Alpaca-LoRA project, the authors mention that they use Hugging Face’s PEFT, a library (LoRA is one of its supported technologies, in addition to Prefix Tuning, P-Tuning, and Prompt Tuning) that lets you efficiently fine-tune your language modeling using various Transformer structure-based language models for efficient fine-tuning. Install PEFT below.

#install peft
git clone https://github.com/huggingface/peft.git
cd peft/
pip install .

2. bitsandbytes is a lightweight wrapper around CUDA custom functions.

Especially for 8-bit optimizers, matrix multiplication (LLM.int8()) and quantization functions.

# Install bitsandbytes.
git clone [email protected]:TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=116 make cuda11x
python setup.py install
The following error occurs if bitsandbytes is installed:
/usr/bin/ld: cannot find -lcudart

Execute the following command on the line

cd /usr/lib
ln -s /usr/local/cuda/lib64/libcudart.so libcudart.so

3. Alpaca-Lora fine-tuning code

# Download alpaca-lora
git clone [email protected]:tloen/alpaca-lora.git
cd alpaca-lora
pip install -r requirements.txt

requirements.txtThe specifics of the document are as follows:

accelerate
appdirs
loralib
bitsandbytes
black
black[jupyter]
datasets
fire
git+https://github.com/huggingface/peft.git
transformers>=4.28.0
sentencepiece
gradio

II. Model format conversion

Convert the LLaMA raw weights file to the corresponding model file format of the Transformers library. The converted model can be downloaded directly from Hugging Face as follows:

You can refer to the download method:[NLP] Huggingface Model/Data File Download Methods

decapoda-research/llama-7b-hf · Hugging Face

decapoda-research/llama-13b-hf · Hugging Face

III. Model fine-tuning

The authors of Alpaca Lora used the LoRA method supported in Hugging Face’s Parameter Efficient Fine-Tuning (PEFT) library.The two configurations of the LoRA method directly affect the number of parameters to be trained:

1) LoRA target modules (lora_target_modules), used to specify which modules to fine-tune the parameters. For example, we can fine-tune Q, K, V, O; or we can only fine-tune Q and V. Different settings will affect the number of parameters to be fine-tuned, and also affect the training process. Different settings affect the number of parameters to be fine-tuned and the amount of computation during training. For example, if we set only Q and V to be fine-tuned, the number of trainable parameters is only about 6% of the total number of parameters in the model.

2) The rank (lora_r) of LoRA is also an important factor affecting the number of training parameters. Objectively speaking, the effect of the model trained by LoRA will be somewhat different from that of the original large model. Therefore, the use of LoRA can be flexibly configured by combining factors such as the configuration of the machine you have, the maximum training time you can tolerate, and so on.

1. This is the default parameter for fine-tuning as follows:

batch_size: 128
micro_batch_size: 4
num_epochs: 3
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 2000
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
group_by_length: False
wandb_project:
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: False
prompt template: alpaca

2. Run the following with a single GPU:

nohup python finetune.py \
    --base_model '/home/llama-7b' \
    --data_path '../alpaca_data_cleaned.json' \
    --output_dir './lora-alpaca-7b-1gpu' \
    > torchrun-7b-1gpu.log 2>&1 &
    
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:10:00.0 Off |                    0 |
| N/A   00C    P0   293W / 400W |  10813MiB / 81920MiB |     94%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               

 

3 Run the following using 4 GPUs:

nohup torchrun --nproc_per_node=4 --master_port=1234 finetune.py \
    --base_model '/home/llama-7b' \
    --data_path '../alpaca_data_cleaned.json' \
    --output_dir './lora-alpaca-7b-4gpu' \
    --num_epochs 1 \
    > torchrun-7b-4gpu.log 2>&1 &
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:16:00.0 Off |                    0 |
| N/A   11C    P0   282W / 400W |  17055MiB / 81920MiB |     93%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   12C    P0   339W / 400W |  14275MiB / 81920MiB |     93%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:4B:00.0 Off |                    0 |
| N/A   13C    P0   324W / 400W |  14773MiB / 81920MiB |     94%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   14C    P0   325W / 400W |  14385MiB / 81920MiB |     94%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               

4. The output is as follows:

Training Alpaca-LoRA model with params:
base_model: /disk1/llama-13b
data_path: ./alpaca_data_cleaned_archive.json
output_dir: ./lora-alpaca
batch_size: 128
micro_batch_size: 8
num_epochs: 1
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 2000
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
add_eos_token: False
group_by_length: False
wandb_project: 
wandb_run_name: 
wandb_watch: 
wandb_log_model: 
resume_from_checkpoint: False
prompt template: alpaca
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:43<00:00,  1.06s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:43<00:00,  1.06s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:43<00:00,  1.06s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:43<00:00,  1.06s/it]
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
/opt/conda/lib/python3.9/site-packages/peft/utils/other.py:102: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead.
warnings.warn(
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
/opt/conda/lib/python3.9/site-packages/peft/utils/other.py:102: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead.
warnings.warn(
/opt/conda/lib/python3.9/site-packages/peft/utils/other.py:102: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead.
warnings.warn(
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
/opt/conda/lib/python3.9/site-packages/peft/utils/other.py:102: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead.
warnings.warn(
trainable params: 6,553,600 || all params: 13,022,417,920 || trainable%: 0.05032552357220002
Map:   3%|███▊                                                                                                                                          | 1330/49759 [00:01<00:39, 1216.23 examples/s]trainable params: 6,553,600 || all params: 13,022,417,920 || trainable%: 0.05032552357220002
Map:   0%|                                                                                                                                                           | 0/49759 [00:00<?, ? examples/s]trainable params: 6,553,600 || all params: 13,022,417,920 || trainable%: 0.05032552357220002
Map:   1%|▊                                                                                                                                              | 272/49759 [00:00<00:36, 1350.21 examples/s]trainable params: 6,553,600 || all params: 13,022,417,920 || trainable%: 0.05032552357220002
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49759/49759 [00:38<00:00, 1294.31 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49759/49759 [00:38<00:00, 1284.04 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49759/49759 [00:38<00:00, 1283.95 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:01<00:00, 1221.03 examples/s]
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29005 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29005 (errno: 97 - Address family not supported by protocol).
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49759/49759 [00:39<00:00, 1274.42 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:01<00:00, 1285.16 examples/s]
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29005 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29005 (errno: 97 - Address family not supported by protocol).
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:01<00:00, 1281.27 examples/s]
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29005 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29005 (errno: 97 - Address family not supported by protocol).
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:01<00:00, 1290.31 examples/s]
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29005 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29005 (errno: 97 - Address family not supported by protocol).
0%|                                                                                                                                                                         | 0/388 [00:00<?, ?it/s]/opt/conda/lib/python3.9/site-packages/bitsandbytes-0.41.0-py3.9.egg/bitsandbytes/autograd/_functions.py:322: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
/opt/conda/lib/python3.9/site-packages/bitsandbytes-0.41.0-py3.9.egg/bitsandbytes/autograd/_functions.py:322: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
/opt/conda/lib/python3.9/site-packages/bitsandbytes-0.41.0-py3.9.egg/bitsandbytes/autograd/_functions.py:322: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
/opt/conda/lib/python3.9/site-packages/bitsandbytes-0.41.0-py3.9.egg/bitsandbytes/autograd/_functions.py:322: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
{'loss': 2.249, 'learning_rate': 2.9999999999999997e-05, 'epoch': 0.03}                                                                                                                               
{'loss': 2.1927, 'learning_rate': 5.6999999999999996e-05, 'epoch': 0.05}                                                                                                                              
{'loss': 2.0813, 'learning_rate': 7.8e-05, 'epoch': 0.08}                                                                                                                                             
{'loss': 1.7206, 'learning_rate': 0.00010799999999999998, 'epoch': 0.1}                                                                                                                               
11%|████████████████▋                                                                                                                               11%|███████████▋                                                                                                | 42/388 [10:50<1:27:2

IV Consolidated model

1. Export to HuggingFace format:

It can be downloaded:Angainor/alpaca-lora-13b · Hugging FaceThe lora_weights

modificationsexport_hf_checkpoint.pyDocumentation:

import os
import torch
import transformers
from peft import PeftModel
from transformers import LlamaForCausalLM, LlamaTokenizer  # noqa: F402
BASE_MODEL = os.environ.get("BASE_MODEL", "/disk1/llama-13b")
LORA_MODEL = os.environ.get("LORA_MODEL", "./alpaca-lora-13b")
HF_CHECKPOINT = os.environ.get("HF_CHECKPOINT", "./hf_ckpt")
tokenizer = LlamaTokenizer.from_pretrained(BASE_MODEL)
base_model = LlamaForCausalLM.from_pretrained(
BASE_MODEL,
load_in_8bit=False,
torch_dtype=torch.float16,
device_map={"": "cpu"},
)
first_weight = base_model.model.layers[0].self_attn.q_proj.weight
first_weight_old = first_weight.clone()
lora_model = PeftModel.from_pretrained(
base_model,
LORA_MODEL,
device_map={"": "cpu"},
torch_dtype=torch.float16,
)
lora_weight = lora_model.base_model.model.model.layers[
0
].self_attn.q_proj.weight
assert torch.allclose(first_weight_old, first_weight)
# merge weights - new merging method from peft
lora_model = lora_model.merge_and_unload()
lora_model.train(False)
# did we do anything?
assert not torch.allclose(first_weight_old, first_weight)
lora_model_sd = lora_model.state_dict()
deloreanized_sd = {
k.replace("base_model.model.", ""): v
for k, v in lora_model_sd.items()
if "lora" not in k
}
LlamaForCausalLM.save_pretrained(
base_model, HF_CHECKPOINT, state_dict=deloreanized_sd, max_shard_size="400MB"
)

python export_hf_checkpoint.py

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:26<00:00,  1.56it/s]

View the model output file:

hf_ckpt/
├── config.json
├── generation_config.json
├── pytorch_model-00001-of-00082.bin
├── pytorch_model-00002-of-00082.bin
├── pytorch_model-00003-of-00082.bin
├── pytorch_model-00004-of-00082.bin
├── pytorch_model-00005-of-00082.bin
├── pytorch_model-00006-of-00082.bin
├── pytorch_model-00007-of-00082.bin
├── pytorch_model-00008-of-00082.bin
├── pytorch_model-00009-of-00082.bin
├── pytorch_model-00010-of-00082.bin
├── pytorch_model-00011-of-00082.bin
├── pytorch_model-00012-of-00082.bin
├── pytorch_model-00013-of-00082.bin
├── pytorch_model-00014-of-00082.bin
├── pytorch_model-00015-of-00082.bin
├── pytorch_model-00016-of-00082.bin
├── pytorch_model-00017-of-00082.bin
├── pytorch_model-00018-of-00082.bin
├── pytorch_model-00019-of-00082.bin
├── pytorch_model-00020-of-00082.bin
├── pytorch_model-00021-of-00082.bin
├── pytorch_model-00022-of-00082.bin
├── pytorch_model-00023-of-00082.bin
├── pytorch_model-00024-of-00082.bin
├── pytorch_model-00025-of-00082.bin
├── pytorch_model-00026-of-00082.bin
├── pytorch_model-00027-of-00082.bin
├── pytorch_model-00028-of-00082.bin
├── pytorch_model-00029-of-00082.bin
├── pytorch_model-00030-of-00082.bin
├── pytorch_model-00031-of-00082.bin
├── pytorch_model-00032-of-00082.bin
├── pytorch_model-00033-of-00082.bin
├── pytorch_model-00034-of-00082.bin
├── pytorch_model-00035-of-00082.bin
├── pytorch_model-00036-of-00082.bin
├── pytorch_model-00037-of-00082.bin
├── pytorch_model-00038-of-00082.bin
├── pytorch_model-00039-of-00082.bin
├── pytorch_model-00040-of-00082.bin
├── pytorch_model-00041-of-00082.bin
├── pytorch_model-00042-of-00082.bin
├── pytorch_model-00043-of-00082.bin
├── pytorch_model-00044-of-00082.bin
├── pytorch_model-00045-of-00082.bin
├── pytorch_model-00046-of-00082.bin
├── pytorch_model-00047-of-00082.bin
├── pytorch_model-00048-of-00082.bin
├── pytorch_model-00049-of-00082.bin
├── pytorch_model-00050-of-00082.bin
├── pytorch_model-00051-of-00082.bin
├── pytorch_model-00052-of-00082.bin
├── pytorch_model-00053-of-00082.bin
├── pytorch_model-00054-of-00082.bin
├── pytorch_model-00055-of-00082.bin
├── pytorch_model-00056-of-00082.bin
├── pytorch_model-00057-of-00082.bin
├── pytorch_model-00058-of-00082.bin
├── pytorch_model-00059-of-00082.bin
├── pytorch_model-00060-of-00082.bin
├── pytorch_model-00061-of-00082.bin
├── pytorch_model-00062-of-00082.bin
├── pytorch_model-00063-of-00082.bin
├── pytorch_model-00064-of-00082.bin
├── pytorch_model-00065-of-00082.bin
├── pytorch_model-00066-of-00082.bin
├── pytorch_model-00067-of-00082.bin
├── pytorch_model-00068-of-00082.bin
├── pytorch_model-00069-of-00082.bin
├── pytorch_model-00070-of-00082.bin
├── pytorch_model-00071-of-00082.bin
├── pytorch_model-00072-of-00082.bin
├── pytorch_model-00073-of-00082.bin
├── pytorch_model-00074-of-00082.bin
├── pytorch_model-00075-of-00082.bin
├── pytorch_model-00076-of-00082.bin
├── pytorch_model-00077-of-00082.bin
├── pytorch_model-00078-of-00082.bin
├── pytorch_model-00079-of-00082.bin
├── pytorch_model-00080-of-00082.bin
├── pytorch_model-00081-of-00082.bin
├── pytorch_model-00082-of-00082.bin
└── pytorch_model.bin.index.json
0 directories, 85 files

2 Export as PyTorch state_dicts

congruent modificationexport_state_dict_checkpoint.pyDocumentation:

Step 5: Quantization (optional)

Finally, Quantization can help us accelerate model inference and reduce the memory required for inference. There are also open source tools for this that can be used directly.

Step 6: Relevant questions

Out Of Memory (OOM) when saving checkpoint model

During the tuning process, we encountered the problem of memory overflow OOM (Out Of Memory) when saving the checkpoint model.issue-CUDA out of memoryThe discussion in the found to bebitsandbytes The new version of 0.38.1 has a bug that requires the version to be rolled back to 0.37.2, problem solved.

At the end of the tuningadapter_model.bin No parameters (size 443)

This problem is mainly due to compatibility issues between alpaca-lora and the peft library, according to thefix issues to be compatible with latest peft #359 From the discussion in , the simplest thing to do at this point is to modify thefinetune.pydocument, as described below:

model.save_pretrained(output_dir) # original 275 lines of code
model.save_pretrained(output_dir,state_dict=old_state_dict()) # Modified code at line 275

reference document