[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

Time:2024-4-9

catalogs

preamble

preliminary

Git 

Python3.9 

Cmake

Download model

Consolidation model

Deployment models


preamble

I’m sure some of you would like to experience deploying a large language model like I do, but it’s in the way of economic strength, but there are a lot of quantitative models in the private sector, so we can also experience the experience of the common man ~, the model can be deployed on a laptop, make sure you have at least 16G of RAM on your computer!

Open the original address:GitHub – ymcui/Chinese-LLaMA-Alpaca: Chinese LLaMA & Alpaca Large Language Models + Local CPU Deployment (Chinese LLaMA & Alpaca LLMs)

Tutorials for Linux and Mac are available in the open source repositories, and of course if you are an M1 you can also refer to the following articles:

https://gist.github.com/cedrickchee/e8d4cb0c4b1df6cc47ce8b18457ebde0


preliminary

It’s better to have a proxy, otherwise you may fail to download something, I spent a day trying to download a model, crying.

We need to install the following environment on our computer first:

  • Git
  • Python 3.9 (use Anaconda3 to create this environment)
  • Cmake (if your computer does not have a C and C++ compilation environment you will also need to install mingw)

Git 

Download Address:Git – Downloading Package 

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

After downloading the installer, open it, and click Next to install it…

In the cmd window, type in the following If the version number is displayed, it means that the installation has been successful

git -v

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

Python3.9 

I’m using Anaconda3 here to use Python, what is Anaconda3?

If you are familiar with docker, then you can bring the concept of docker, docker can create a lot of containers, each container may be the same environment may not be the same, Anaconda3 is the same, it can create a lot of different versions of Python, do not conflict with each other, you want to use which version of the version of the switch to which version…

Anaconda3 Download Address:Anaconda | Anaconda Distribution[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

Installation step by step reference:

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

Wait for the installation and keep clicking next, until you click Finish to close it.

In the cmd window, enter the following command, the version number is displayed, then the installation is successful

conda -V

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

Next we create a python 3.9 environment by typing the following command in a cmd window

conda create --name py39 python=3.9 -y

The py39 after –name is the name of the environment, you can name it as you like, you need it when switching environments.

python=3.9 specifies the python version.

Adding -y eliminates the need to manually enter y to confirm installation

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

See what commands are available for the environment:

conda info -e

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

Command to activate/switch environments:

conda activate py39

Change the name of the environment you want to use to the corresponding name.

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

Once you are in the environment you can enter python related commands here, e.g.:

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

To exit the environment enter:

conda deactivate

When I log out of the environment and then check the python version it tells me that I’m not an internal or external command or a runnable program.
or batch files. Such as:

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

Cmake

This is a compilation tool, we need to use it to compile thellama.cpp, quantitative modeling needs to be used, not quantitative modeling of personal computers can not run, do not think the concept of quantization can be understood as compression, this concept is not right, just to help you better understand.

Before installing, we need to install mingw, to avoid not finding the compilation environment when compiling, press win+r shortcut key and type powershell.

Enter the command to install scoop, which is a package manager that we use to download and install mingw:

This place may be wrong if you don’t have a proxy on. 

iex "& {$(irm get.scoop.sh)} -RunAsAdmin"

Run the following two commands separately after installation (to add libraries):

scoop bucket add extras
scoop bucket add main

Enter the command to install mingw

scoop install mingw

This is all you need to install mingw, if you get any errors please comment, I will reply when I see them.

Next, install Cmake

Address:Download | CMake 

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

Installation Reference:

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

Tap Finish after installation


Download model

We need to download two models, one is the original LLaMA model, and the other is the expanded Chinese model, which will be followed by a merge model operation.

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

  • Expanded model downloads in Chinese:

It is recommended that you create a new folder on your D drive, and perform the download operation in it, as follows:

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

Enter the following commands in the pop-up boxes respectively:

git lfs install
git clone https://huggingface.co/ziqingyang/chinese-alpaca-lora-7b

It may keep failing here due to network problems …… Just keep retrying, please comment if you have any other questions, I will reply if I see it.


Consolidation model

Finally, I’m tired.

Open a cmd window in the directory where you downloaded the model, as follows:

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

Let me tell you what’s in the two directories in this picture.

First is the directory chinese-alpaca-lora-7b, this directory usually you download and do not need to move, the format is as follows:

chinese-alpaca-lora-7b/
        – adapter_config.json
        – adapter_model.bin
        – special_tokens_map.json
        – tokenizer_config.json
        – tokenizer.model

Then there is the path_to_original_llama_root_dir directory, this folder needs to be created, keep consistent filenames, and the format inside the directory is as follows:

path_to_original_llama_root_dir/

– 7B/ # This is a folder named 7B

                – checklist.chk

                – consolidated.00.pth

                – params.json

                – tokenizer_checklist.chk

        – tokenizer.model

Store yourself in the format above

After opening the window, you need to activate the python environment, using the previous installation of Anaconda3.

# Run the following command first if you don't remember what environment you have
conda info -e

# And then activate the environment you need My environment name is py39
conda activate py39

After the switch is made, install the dependencies by executing the following commands respectively

pip install git+https://github.com/huggingface/transformers

pip install sentencepiece==0.1.97

pip install peft==0.2.0

When the command is executed successfully, the word Successfully will appear.

The next step is to convert the original model to HF format, which can be done with the help of theLatest transformersScripts providedconvert_llama_weights_to_hf.py

Create a new convert_llama_weights_to_hf.py file in the directory, open it with notepad and paste the following code into it

Note: I’ve copied it directly here for convenience. The script may be updated, so I suggest you go directly to the following address to copy the latest one:

transformers/convert_llama_weights_to_hf.py at main · huggingface/transformers · GitHub

# Copyright 2022 EleutherAI and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import gc
import json
import math
import os
import shutil
import warnings
import torch
from transformers import LlamaConfig, LlamaForCausalLM, LlamaTokenizer
try:
from transformers import LlamaTokenizerFast
except ImportError as e:
warnings.warn(e)
warnings.warn(
"The converted tokenizer will be the `slow` tokenizer. To use the fast, update your `tokenizers` library and re-run the tokenizer conversion"
)
LlamaTokenizerFast = None
"""
Sample usage:
```
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
--input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path
```
Thereafter, models can be loaded via:
```py
from transformers import LlamaForCausalLM, LlamaTokenizer
model = LlamaForCausalLM.from_pretrained("/output/path")
tokenizer = LlamaTokenizer.from_pretrained("/output/path")
```
Important note: you need to be able to host the whole model in RAM to execute this script (even if the biggest versions
come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM).
"""
INTERMEDIATE_SIZE_MAP = {
"7B": 11008,
"13B": 13824,
"30B": 17920,
"65B": 22016,
}
NUM_SHARDS = {
"7B": 1,
"13B": 2,
"30B": 4,
"65B": 8,
}
def compute_intermediate_size(n):
return int(math.ceil(n * 8 / 3) + 255) // 256 * 256
def read_json(path):
with open(path, "r") as f:
return json.load(f)
def write_json(text, path):
with open(path, "w") as f:
json.dump(text, f)
def write_model(model_path, input_base_path, model_size):
os.makedirs(model_path, exist_ok=True)
tmp_model_path = os.path.join(model_path, "tmp")
os.makedirs(tmp_model_path, exist_ok=True)
params = read_json(os.path.join(input_base_path, "params.json"))
num_shards = NUM_SHARDS[model_size]
n_layers = params["n_layers"]
n_heads = params["n_heads"]
n_heads_per_shard = n_heads // num_shards
dim = params["dim"]
dims_per_head = dim // n_heads
base = 10000.0
inv_freq = 1.0 / (base ** (torch.arange(0, dims_per_head, 2).float() / dims_per_head))
# permute for sliced rotary
def permute(w):
return w.view(n_heads, dim // n_heads // 2, 2, dim).transpose(1, 2).reshape(dim, dim)
print(f"Fetching all parameters from the checkpoint at {input_base_path}.")
# Load weights
if model_size == "7B":
# Not shared
# (The sharded implementation would also work, but this is simpler.)
loaded = torch.load(os.path.join(input_base_path, "consolidated.00.pth"), map_location="cpu")
else:
# Sharded
loaded = [
torch.load(os.path.join(input_base_path, f"consolidated.{i:02d}.pth"), map_location="cpu")
for i in range(num_shards)
]
param_count = 0
index_dict = {"weight_map": {}}
for layer_i in range(n_layers):
filename = f"pytorch_model-{layer_i + 1}-of-{n_layers + 1}.bin"
if model_size == "7B":
# Unsharded
state_dict = {
f"model.layers.{layer_i}.self_attn.q_proj.weight": permute(
loaded[f"layers.{layer_i}.attention.wq.weight"]
),
f"model.layers.{layer_i}.self_attn.k_proj.weight": permute(
loaded[f"layers.{layer_i}.attention.wk.weight"]
),
f"model.layers.{layer_i}.self_attn.v_proj.weight": loaded[f"layers.{layer_i}.attention.wv.weight"],
f"model.layers.{layer_i}.self_attn.o_proj.weight": loaded[f"layers.{layer_i}.attention.wo.weight"],
f"model.layers.{layer_i}.mlp.gate_proj.weight": loaded[f"layers.{layer_i}.feed_forward.w1.weight"],
f"model.layers.{layer_i}.mlp.down_proj.weight": loaded[f"layers.{layer_i}.feed_forward.w2.weight"],
f"model.layers.{layer_i}.mlp.up_proj.weight": loaded[f"layers.{layer_i}.feed_forward.w3.weight"],
f"model.layers.{layer_i}.input_layernorm.weight": loaded[f"layers.{layer_i}.attention_norm.weight"],
f"model.layers.{layer_i}.post_attention_layernorm.weight": loaded[f"layers.{layer_i}.ffn_norm.weight"],
}
else:
# Sharded
# Note that in the 13B checkpoint, not cloning the two following weights will result in the checkpoint
# becoming 37GB instead of 26GB for some reason.
state_dict = {
f"model.layers.{layer_i}.input_layernorm.weight": loaded[0][
f"layers.{layer_i}.attention_norm.weight"
].clone(),
f"model.layers.{layer_i}.post_attention_layernorm.weight": loaded[0][
f"layers.{layer_i}.ffn_norm.weight"
].clone(),
}
state_dict[f"model.layers.{layer_i}.self_attn.q_proj.weight"] = permute(
torch.cat(
[
loaded[i][f"layers.{layer_i}.attention.wq.weight"].view(n_heads_per_shard, dims_per_head, dim)
for i in range(num_shards)
],
dim=0,
).reshape(dim, dim)
)
state_dict[f"model.layers.{layer_i}.self_attn.k_proj.weight"] = permute(
torch.cat(
[
loaded[i][f"layers.{layer_i}.attention.wk.weight"].view(n_heads_per_shard, dims_per_head, dim)
for i in range(num_shards)
],
dim=0,
).reshape(dim, dim)
)
state_dict[f"model.layers.{layer_i}.self_attn.v_proj.weight"] = torch.cat(
[
loaded[i][f"layers.{layer_i}.attention.wv.weight"].view(n_heads_per_shard, dims_per_head, dim)
for i in range(num_shards)
],
dim=0,
).reshape(dim, dim)
state_dict[f"model.layers.{layer_i}.self_attn.o_proj.weight"] = torch.cat(
[loaded[i][f"layers.{layer_i}.attention.wo.weight"] for i in range(num_shards)], dim=1
)
state_dict[f"model.layers.{layer_i}.mlp.gate_proj.weight"] = torch.cat(
[loaded[i][f"layers.{layer_i}.feed_forward.w1.weight"] for i in range(num_shards)], dim=0
)
state_dict[f"model.layers.{layer_i}.mlp.down_proj.weight"] = torch.cat(
[loaded[i][f"layers.{layer_i}.feed_forward.w2.weight"] for i in range(num_shards)], dim=1
)
state_dict[f"model.layers.{layer_i}.mlp.up_proj.weight"] = torch.cat(
[loaded[i][f"layers.{layer_i}.feed_forward.w3.weight"] for i in range(num_shards)], dim=0
)
state_dict[f"model.layers.{layer_i}.self_attn.rotary_emb.inv_freq"] = inv_freq
for k, v in state_dict.items():
index_dict["weight_map"][k] = filename
param_count += v.numel()
torch.save(state_dict, os.path.join(tmp_model_path, filename))
filename = f"pytorch_model-{n_layers + 1}-of-{n_layers + 1}.bin"
if model_size == "7B":
# Unsharded
state_dict = {
"model.embed_tokens.weight": loaded["tok_embeddings.weight"],
"model.norm.weight": loaded["norm.weight"],
"lm_head.weight": loaded["output.weight"],
}
else:
state_dict = {
"model.norm.weight": loaded[0]["norm.weight"],
"model.embed_tokens.weight": torch.cat(
[loaded[i]["tok_embeddings.weight"] for i in range(num_shards)], dim=1
),
"lm_head.weight": torch.cat([loaded[i]["output.weight"] for i in range(num_shards)], dim=0),
}
for k, v in state_dict.items():
index_dict["weight_map"][k] = filename
param_count += v.numel()
torch.save(state_dict, os.path.join(tmp_model_path, filename))
# Write configs
index_dict["metadata"] = {"total_size": param_count * 2}
write_json(index_dict, os.path.join(tmp_model_path, "pytorch_model.bin.index.json"))
config = LlamaConfig(
hidden_size=dim,
intermediate_size=compute_intermediate_size(dim),
num_attention_heads=params["n_heads"],
num_hidden_layers=params["n_layers"],
rms_norm_eps=params["norm_eps"],
)
config.save_pretrained(tmp_model_path)
# Make space so we can load the model properly now.
del state_dict
del loaded
gc.collect()
print("Loading the checkpoint in a Llama model.")
model = LlamaForCausalLM.from_pretrained(tmp_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
# Avoid saving this as part of the config.
del model.config._name_or_path
print("Saving in the Transformers format.")
model.save_pretrained(model_path)
shutil.rmtree(tmp_model_path)
def write_tokenizer(tokenizer_path, input_tokenizer_path):
# Initialize the tokenizer based on the `spm` model
tokenizer_class = LlamaTokenizer if LlamaTokenizerFast is None else LlamaTokenizerFast
print("Saving a {tokenizer_class} to {tokenizer_path}")
tokenizer = tokenizer_class(input_tokenizer_path)
tokenizer.save_pretrained(tokenizer_path)
def main():
parser = argparse.ArgumentParser()
parser.add_argument(
"--input_dir",
help="Location of LLaMA weights, which contains tokenizer.model and model folders",
)
parser.add_argument(
"--model_size",
choices=["7B", "13B", "30B", "65B", "tokenizer_only"],
)
parser.add_argument(
"--output_dir",
help="Location to write HF model and tokenizer",
)
args = parser.parse_args()
if args.model_size != "tokenizer_only":
write_model(
model_path=args.output_dir,
input_base_path=os.path.join(args.input_dir, args.model_size),
model_size=args.model_size,
)
spm_path = os.path.join(args.input_dir, "tokenizer.model")
write_tokenizer(args.output_dir, spm_path)
if __name__ == "__main__":
main()

Execute the command in the cmd window (If you are using anaconda, activate the environment before executing commands):

python convert_llama_weights_to_hf.py --input_dir path_to_original_llama_root_dir --model_size 7B --output_dir path_to_original_llama_hf_dir

After a long wait ….

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

Next merge the output PyTorch version weights (.pthfile), use themerge_llama_with_chinese_lora.pyscript

Create a new merge_llama_with_chinese_lora.py file in the directory, open it with notepad and paste the following code into it

Note: I’ve copied it directly here for convenience. The script may be updated, so I suggest you go directly to the following address to copy the latest one: 

Chinese-LLaMA-Alpaca/merge_llama_with_chinese_lora.py at main · ymcui/Chinese-LLaMA-Alpaca · GitHub

"""
Borrowed and modified from https://github.com/tloen/alpaca-lora
"""
import argparse
import os
import json
import gc
import torch
import transformers
import peft
from peft import PeftModel
parser = argparse.ArgumentParser()
parser.add_argument('--base_model',default=None,required=True,type=str,help="Please specify a base_model")
parser.add_argument('--lora_model',default=None,required=True,type=str,help="Please specify a lora_model")
# deprecated; the script infers the model size from the checkpoint
parser.add_argument('--model_size',default='7B',type=str,help="Size of the LLaMA model",choices=['7B','13B'])
parser.add_argument('--offload_dir',default=None,type=str,help="(Optional) Please specify a temp folder for offloading (useful for low-RAM machines). Default None (disable offload).")
parser.add_argument('--output_dir',default='./',type=str)
args = parser.parse_args()
assert (
"LlamaTokenizer" in transformers._import_structure["models.llama"]
), "LLaMA is now in HuggingFace's main branch.\nPlease reinstall it: pip uninstall transformers && pip install git+https://github.com/huggingface/transformers.git"
from transformers import LlamaTokenizer, LlamaForCausalLM
BASE_MODEL = args.base_model
LORA_MODEL = args.lora_model
output_dir = args.output_dir
assert (
BASE_MODEL
), "Please specify a BASE_MODEL in the script, e.g. 'decapoda-research/llama-7b-hf'"
tokenizer = LlamaTokenizer.from_pretrained(LORA_MODEL)
if args.offload_dir is not None:
# Load with offloading, which is useful for low-RAM machines.
# Note that if you have enough RAM, please use original method instead, as it is faster.
base_model = LlamaForCausalLM.from_pretrained(
BASE_MODEL,
load_in_8bit=False,
torch_dtype=torch.float16,
offload_folder=args.offload_dir,
offload_state_dict=True,
low_cpu_mem_usage=True,
device_map={"": "cpu"},
)
else:
# Original method without offloading
base_model = LlamaForCausalLM.from_pretrained(
BASE_MODEL,
load_in_8bit=False,
torch_dtype=torch.float16,
device_map={"": "cpu"},
)
base_model.resize_token_embeddings(len(tokenizer))
assert base_model.get_input_embeddings().weight.size(0) == len(tokenizer)
tokenizer.save_pretrained(output_dir)
print(f"Extended vocabulary size: {len(tokenizer)}")
first_weight = base_model.model.layers[0].self_attn.q_proj.weight
first_weight_old = first_weight.clone()
## infer the model size from the checkpoint
emb_to_model_size = {
4096 : '7B',
5120 : '13B',
6656 : '30B',
8192 : '65B',
}
embedding_size = base_model.get_input_embeddings().weight.size(1)
model_size = emb_to_model_size[embedding_size]
print(f"Loading LoRA for {model_size} model")
lora_model = PeftModel.from_pretrained(
base_model,
LORA_MODEL,
device_map={"": "cpu"},
torch_dtype=torch.float16,
)
assert torch.allclose(first_weight_old, first_weight)
# merge weights
print(f"Peft version: {peft.__version__}")
print(f"Merging model")
if peft.__version__ > '0.2.0':
# merge weights - new merging method from peft
lora_model = lora_model.merge_and_unload()
else:
# merge weights
for layer in lora_model.base_model.model.model.layers:
if hasattr(layer.self_attn.q_proj,'merge_weights'):
layer.self_attn.q_proj.merge_weights = True
if hasattr(layer.self_attn.v_proj,'merge_weights'):
layer.self_attn.v_proj.merge_weights = True
if hasattr(layer.self_attn.k_proj,'merge_weights'):
layer.self_attn.k_proj.merge_weights = True
if hasattr(layer.self_attn.o_proj,'merge_weights'):
layer.self_attn.o_proj.merge_weights = True
if hasattr(layer.mlp.gate_proj,'merge_weights'):
layer.mlp.gate_proj.merge_weights = True
if hasattr(layer.mlp.down_proj,'merge_weights'):
layer.mlp.down_proj.merge_weights = True
if hasattr(layer.mlp.up_proj,'merge_weights'):
layer.mlp.up_proj.merge_weights = True
lora_model.train(False)
# did we do anything?
assert not torch.allclose(first_weight_old, first_weight)
lora_model_sd = lora_model.state_dict()
del lora_model, base_model
num_shards_of_models = {'7B': 1, '13B': 2}
params_of_models = {
'7B':
{
"dim": 4096,
"multiple_of": 256,
"n_heads": 32,
"n_layers": 32,
"norm_eps": 1e-06,
"vocab_size": -1,
},
'13B':
{
"dim": 5120,
"multiple_of": 256,
"n_heads": 40,
"n_layers": 40,
"norm_eps": 1e-06,
"vocab_size": -1,
},
}
params = params_of_models[model_size]
num_shards = num_shards_of_models[model_size]
n_layers = params["n_layers"]
n_heads = params["n_heads"]
dim = params["dim"]
dims_per_head = dim // n_heads
base = 10000.0
inv_freq = 1.0 / (base ** (torch.arange(0, dims_per_head, 2).float() / dims_per_head))
def permute(w):
return (
w.view(n_heads, dim // n_heads // 2, 2, dim).transpose(1, 2).reshape(dim, dim)
)
def unpermute(w):
return (
w.view(n_heads, 2, dim // n_heads // 2, dim).transpose(1, 2).reshape(dim, dim)
)
def translate_state_dict_key(k):
k = k.replace("base_model.model.", "")
if k == "model.embed_tokens.weight":
return "tok_embeddings.weight"
elif k == "model.norm.weight":
return "norm.weight"
elif k == "lm_head.weight":
return "output.weight"
elif k.startswith("model.layers."):
layer = k.split(".")[2]
if k.endswith(".self_attn.q_proj.weight"):
return f"layers.{layer}.attention.wq.weight"
elif k.endswith(".self_attn.k_proj.weight"):
return f"layers.{layer}.attention.wk.weight"
elif k.endswith(".self_attn.v_proj.weight"):
return f"layers.{layer}.attention.wv.weight"
elif k.endswith(".self_attn.o_proj.weight"):
return f"layers.{layer}.attention.wo.weight"
elif k.endswith(".mlp.gate_proj.weight"):
return f"layers.{layer}.feed_forward.w1.weight"
elif k.endswith(".mlp.down_proj.weight"):
return f"layers.{layer}.feed_forward.w2.weight"
elif k.endswith(".mlp.up_proj.weight"):
return f"layers.{layer}.feed_forward.w3.weight"
elif k.endswith(".input_layernorm.weight"):
return f"layers.{layer}.attention_norm.weight"
elif k.endswith(".post_attention_layernorm.weight"):
return f"layers.{layer}.ffn_norm.weight"
elif k.endswith("rotary_emb.inv_freq") or "lora" in k:
return None
else:
print(layer, k)
raise NotImplementedError
else:
print(k)
raise NotImplementedError
def save_shards(lora_model_sd, num_shards: int):
# Add the no_grad context manager
with torch.no_grad():
if num_shards == 1:
new_state_dict = {}
for k, v in lora_model_sd.items():
new_k = translate_state_dict_key(k)
if new_k is not None:
if "wq" in new_k or "wk" in new_k:
new_state_dict[new_k] = unpermute(v)
else:
new_state_dict[new_k] = v
os.makedirs(output_dir, exist_ok=True)
print(f"Saving shard 1 of {num_shards} into {output_dir}/consolidated.00.pth")
torch.save(new_state_dict, output_dir + "/consolidated.00.pth")
with open(output_dir + "/params.json", "w") as f:
json.dump(params, f)
else:
new_state_dicts = [dict() for _ in range(num_shards)]
for k in list(lora_model_sd.keys()):
v = lora_model_sd[k]
new_k = translate_state_dict_key(k)
if new_k is not None:
if new_k=='tok_embeddings.weight':
print(f"Processing {new_k}")
assert v.size(1)%num_shards==0
splits = v.split(v.size(1)//num_shards,dim=1)
elif new_k=='output.weight':
print(f"Processing {new_k}")
splits = v.split(v.size(0)//num_shards,dim=0)
elif new_k=='norm.weight':
print(f"Processing {new_k}")
splits = [v] * num_shards
elif 'ffn_norm.weight' in new_k:
print(f"Processing {new_k}")
splits = [v] * num_shards
elif 'attention_norm.weight' in new_k:
print(f"Processing {new_k}")
splits = [v] * num_shards
elif 'w1.weight' in new_k:
print(f"Processing {new_k}")
splits = v.split(v.size(0)//num_shards,dim=0)
elif 'w2.weight' in new_k:
print(f"Processing {new_k}")
splits = v.split(v.size(1)//num_shards,dim=1)
elif 'w3.weight' in new_k:
print(f"Processing {new_k}")
splits = v.split(v.size(0)//num_shards,dim=0)
elif 'wo.weight' in new_k:
print(f"Processing {new_k}")
splits = v.split(v.size(1)//num_shards,dim=1)
elif 'wv.weight' in new_k:
print(f"Processing {new_k}")
splits = v.split(v.size(0)//num_shards,dim=0)
elif "wq.weight" in new_k or "wk.weight" in new_k:
print(f"Processing {new_k}")
v = unpermute(v)
splits = v.split(v.size(0)//num_shards,dim=0)
else:
print(f"Unexpected key {new_k}")
raise ValueError
for sd,split in zip(new_state_dicts,splits):
sd[new_k] = split.clone()
del split
del splits
del lora_model_sd[k],v
gc.collect()    # Effectively enforce garbage collection
os.makedirs(output_dir, exist_ok=True)
for i,new_state_dict in enumerate(new_state_dicts):
print(f"Saving shard {i+1} of {num_shards} into {output_dir}/consolidated.0{i}.pth")
torch.save(new_state_dict, output_dir + f"/consolidated.0{i}.pth")
with open(output_dir + "/params.json", "w") as f:
print(f"Saving params.json into {output_dir}/params.json")
json.dump(params, f)
save_shards(lora_model_sd=lora_model_sd, num_shards=num_shards)

Execute the command (If you are using anaconda, activate the environment before executing commands):

python merge_llama_with_chinese_lora.py --base_model path_to_original_llama_hf_dir --lora_model chinese-alpaca-lora-7b --output_dir path_to_output_dir

Parameter Description:

  • --base_model: Directory where LLaMA model weights and profiles in HF format are stored (hf format converted in previous steps)
  • --lora_model: Expanded model catalog in Chinese
  • --output_dir: Specify the directory where the full model weights are kept, default is. /(merged directory)
  • (Optional)--offload_dir: For low memory users you need to specify an offload cache path

For more details, please see the Kaibara Warehouse:GitHub – ymcui/Chinese-LLaMA-Alpaca: Chinese LLaMA & Alpaca Large Language Models + Local CPU/GPU Deployment (Chinese LLaMA & Alpaca LLMs)

The model has been merged here, catalog:

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

Let’s get ready for deployment.


Deployment models

We need to download llama.cpp first to quantize the model, enter the following command.

git clone https://github.com/ggerganov/llama.cpp

Catalogs such as:

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

Here’s the kicker, enter the following command into the window to access the just downloaded llama.cpp

cd llama.cpp

 If you followed the tutorial and installed MinGW using scoop (package manager), use the following commands (scroll back if you didn’t):

cmake . -G "MinGW Makefiles"
cmake --build . --config Release

After walking through the above commands you should see the following files inside the bin directory of llama.cpp:

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

 If you installed MinGW using the installer method, use the following command:

mkdir build
cd build
cmake ..
cmake --build . --config Release

After walking through the above commands there should be the following files in the build = “Release = “bin directory:

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

You can’t enter all of the above commands, it’s up to you to choose the commands!!!!

If you don’t have any of these files, then you should be getting an error, basically either a mistake in downloading the dependencies, or a mistake in compiling, which I’ve been trying to figure out for a long time.

Next, create a new zh-models folder in llama.cpp, ready to generate the quantized version model.

The directory format of zh-models is as follows:

zh-models/

– 7B/ # This is a folder named 7B
                – consolidated.00.pth
                – params.json
        – tokenizer.model

Place the files consolidated.00.pth and params.json in the path_to_output_dir folder into the locations in the format above

Put the tokenizer.mod file in the path_to_output_dir folder at the same level as the 7B folder.

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

Next, enter the command in the window to change the above.pthThe model weights are converted to FP16 format in ggml, generating a file path ofzh-models/7B/ggml-model-f16.bin

python convert-pth-to-ggml.py zh-models/7B/ 1

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

Further 4-bit quantization of the FP16 model is performed to generate a quantized model file path ofzh-models/7B/ggml-model-q4_0.bin

D:\llama\llama.cpp\bin\quantize.exe ./zh-models/7B/ggml-model-f16.bin ./zh-models/7B/ggml-model-q4_0.bin 2

The file quantize.exe is in the bin directory, change it according to your path.

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit recordTo this point has been quantized, can be deployed to see the effect, deployment, if your computer configuration is good you can choose to deploy f16, otherwise the deployment of q4_0 ….

D:\llama\llama.cpp\bin\main.exe -m zh-models/7B/ggml-model-q4_0.bin --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.3

At the prompt> After that, enter your prompt.cmd/ctrl+cInterrupt output with multiple lines of information in\accumulate data as line endings

Common Parameters(For more parameters, please execute the command D:\llama\llama.cpp\bin\main.exe -h)

-ins Launches ChatGPT-like dialog exchange run mode
-f Specify the prompt template, for alpaca model please load prompts/alpaca.txt
-c Controls the length of the context, the larger the value the longer the dialog history can be referenced (default: 512)
-n Controls the maximum length of reply generation (default: 128)
-b control batch size (default: 8), can be increased as appropriate
-t Controls the number of threads (default: 4).
–repeat_penalty Controls how much duplicate text is penalized in generated replies
–temp Temperature coefficient, the lower the value the less randomness of the response, and vice versa.
–top_p, top_k Controls parameters related to decoding samples.

If you want to deploy f16, you can replace the -m parameter in the command with zh-models/7B/ggml-model-f16.bin.

Deployment effects:

[LLM] Windows local CPU deployment folk version of the Chinese alpaca model (Chinese-LLaMA-Alpaca) stepping on the pit record

It’s finally done.


Reference:


Kudos, your approval is what motivates me to create !
Favorites, your favor is what I strive for!
Comment, your input is an asset to my progress!

Recommended Today

Windows CMD Commands

A command prompt is a working prompt in an operating system that prompts for command entry. Command prompts vary in different operating system environments. In the windows environment, the command line program is cmd.exe, a 32-bit command line program, Microsoft Windows system based on the command interpreter program on Windows, similar to the Microsoft DOS […]