├── README.md
├── README_zh.md
├── alpaca
├── scripts
│ ├── assert
│ │ ├── dict.txt
│ │ └── test.src
│ ├── fsdp
│ │ ├── README.md
│ │ ├── inference
│ │ │ ├── run_inf.sh
│ │ │ ├── run_inf_hub.sh
│ │ │ └── run_webapp.sh
│ │ ├── run_train.sh
│ │ ├── run_train_belle.sh
│ │ └── run_train_cpu_offload.sh
│ ├── lora
│ │ ├── README.md
│ │ ├── inference
│ │ │ ├── run_inf.sh
│ │ │ ├── run_inf_hub.sh
│ │ │ └── run_webapp.sh
│ │ └── run_train.sh
│ ├── megatron
│ │ ├── README.md
│ │ ├── inference
│ │ │ └── run_inf_megatron.sh
│ │ └── run_train_megatron.sh
│ ├── megatron_lora
│ │ ├── README.md
│ │ ├── inference
│ │ │ └── run_inf_megatron_lora.sh
│ │ └── run_train_megatron_lora.sh
│ └── utils
│ │ ├── README.md
│ │ ├── convert_llama_to_half.py
│ │ ├── merge_llama_megatron_ckpt.py
│ │ ├── prepare_inf_data.sh
│ │ ├── prepare_llama_belle_data.sh
│ │ ├── prepare_llama_training_data.sh
│ │ ├── prepare_utils.py
│ │ ├── process_llama_ckpt.py
│ │ └── process_llama_megatron_ckpt.py
└── src
│ ├── __init__.py
│ ├── __pycache__
│ ├── __init__.cpython-37.pyc
│ ├── megatron_trainer.cpython-37.pyc
│ ├── trainer.cpython-37.pyc
│ └── utils.cpython-37.pyc
│ ├── fsdp
│ ├── __pycache__
│ │ ├── cpu_adam.cpython-37.pyc
│ │ └── fully_sharded_data_parallel.cpython-37.pyc
│ ├── cpu_adam.py
│ └── fully_sharded_data_parallel.py
│ ├── generate.py
│ ├── generator
│ ├── __pycache__
│ │ ├── search.cpython-37.pyc
│ │ └── sequence_generator.cpython-37.pyc
│ ├── search.py
│ └── sequence_generator.py
│ ├── inference.py
│ ├── loss
│ ├── __pycache__
│ │ └── lm_loss.cpython-37.pyc
│ └── lm_loss.py
│ ├── megatron_trainer.py
│ ├── model
│ ├── __pycache__
│ │ ├── hub_interface.cpython-37.pyc
│ │ ├── llama_megatron_transformer.cpython-37.pyc
│ │ ├── llama_model.cpython-37.pyc
│ │ ├── llama_transformer.cpython-37.pyc
│ │ └── lora_modules.cpython-37.pyc
│ ├── hub_interface.py
│ ├── llama_megatron_transformer.py
│ ├── llama_model.py
│ ├── llama_transformer.py
│ └── lora_modules.py
│ ├── preprocess.py
│ ├── task
│ ├── __pycache__
│ │ ├── dictionary.cpython-37.pyc
│ │ ├── seq2seq_dataset.cpython-37.pyc
│ │ ├── seq2seq_ft_task.cpython-37.pyc
│ │ └── seq2seq_lora_task.cpython-37.pyc
│ ├── dictionary.py
│ ├── seq2seq_dataset.py
│ ├── seq2seq_ft_task.py
│ └── seq2seq_lora_task.py
│ ├── train_fsdp.py
│ ├── train_lora.py
│ ├── train_megatron.py
│ ├── trainer.py
│ ├── utils.py
│ └── webapp.py
├── efficient_alpaca_logo.PNG
├── efficient_alpaca_logo_old.PNG
└── webapp.PNG
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
Efficient Alpaca
7 |
8 |
9 |
10 |
11 | English | 中文
12 |
13 |
14 |
15 |
16 | The aim of Efficient Alpaca is to utilize LLaMA to build and enhance the LLM-based chatbots, including but not limited to **reducing resource consumption (GPU memory or training time)**, **improving inference speed**, and more **facilitating researchers' use** (especially for fairseq users). This project will be constantly updated and maintained. Please feel free to use it!
17 |
18 |
19 | **************************** Updates ****************************
20 | - 4/5 We support Fine-tuning using FSDP to reduce GPU memory with extra RAM memory !
21 | - 3/17 We support model parallel to reduce GPU memory using Megatron-LM !
22 | - 3/15 We support LoRA (Efficient-finetuning) to reproduce Stanford Alpaca !
23 |
24 |
25 | # Supported Inference Devices
26 |
27 | we can choose following device to support inference, even 12G 1080.
28 |
29 | | Method | Support | Device | GPU | Inference Speed |
30 | | -------- | ------------- | ---------- | ------- | ------------------- |
31 | | Original | all | 1 24G 3090 | 14G | |
32 | | Megatron | megatron_lora | 2 12G 1080 | 8G | |
33 |
34 | ## Supported Training Methods and Devices
35 |
36 | We can choose the following available methods and combinations: for example, I have 2 24G 3090 and a lot of memory, at this time you can have two options: 1. Use Megatron-LM for Efficient-Finetuning (does not use a lot of memory) . 2. Use FSDP for Fine-tuning (it will use a lot of extra memory).
37 |
38 | | Method | Type | Support | Data Para | Model Para | Device | GPU | Memory Limit | Training Speed |
39 | | ------------- | --------------------- | ------------- | --------- | ---------- | ---------- | ------- | ------------- | ------------------- |
40 | | LoRA | Efficient Fine-tuning | lora | ✓ | ✓ | 1 40G A100 | 30G | No | 90 sec / 100 step |
41 | | Megatron-LoRA | Efficient Fine-tuning | megatron_lora | ✗ | ✓ | 2 24G 3090 | 21G | No | 190 sec / 100 step |
42 | | FSDP | Fine-tuning | fsdp | ✓ | ✓ | 1 40G A100 | 32G | 128G + | 1600 sec / 100 step |
43 | | | | | | | 8 40G A100 | 32G | No | 400 sec / 100 step |
44 | | | | | | | 2 24G 3090 | 13G | 128G + | 900 sec / 100 step |
45 | | | | | | | 8 24G 3090 | 22G | 128G + | 800 sec / 100 step |
46 | | Megatron | Fine-tuning | megatron | ✗ | ✓ | 4 40G A100 | 25G | No | 130 sec / 100 step |
47 | | | | | | | 8 24G 3090 | 14G | No | 130 sec / 100 step |
48 |
49 | Some Explanation about Support Table.
50 |
51 | All evaluation used hyper-parameter --max-tokens 2048.
52 |
53 | * Data Para: Whether to support data parallel.
54 | * Model Para: Whether to support model parallel.
55 | * GPU: GPU Memory usage of each node during training.
56 | * Memory Limit: RAM Memory usage during training.
57 | * Training Speed: Only represents training speed rather than training time, because data parallel supports to accelerate training.
58 |
59 |
60 |
61 | ## Web Interface
62 |
63 | We support web interface using [Gradio](https://gradio.app/)。
64 |
65 | ```
66 | bash alpaca_lora/scripts/run_webapp.sh
67 | ```
68 |
69 |
70 |
71 |
72 |
73 | ## Setup
74 | Ensure the pytorch and cuda environment available, and install fllowing dependences:
75 |
76 | ```
77 | pip install fairseq
78 | pip install fairscale
79 | ```
80 |
81 | We have to install sentencepiece from [official repo](https://github.com/google/sentencepiece) to process data or hack for your specific task.
82 |
83 | ```
84 | git clone https://github.com/google/sentencepiece.git
85 | cd sentencepiece
86 | mkdir build
87 | cd build
88 | cmake ..
89 | make -j $(nproc)
90 | make install
91 | ```
92 |
93 | ## Prepare Model and Data
94 |
95 | + [Data and Model Preprocess](alpaca/scripts/utils/README.md)
96 |
97 | ## Training Step:
98 |
99 | Efficient-Finetuning
100 | + [LoRA](alpaca/scripts/lora/README.md)
101 | + [Megatron + LoRA](alpaca/scripts/megatron_lora/README.md)
102 |
103 | Fine-tuning
104 | + [Megatron](alpaca/scripts/megatron/README.md)
105 | + [Fully Sharded Data Parallel](alpaca/scripts/fsdp/README.md)
106 |
107 |
108 | ## Some Case Sampled by Our Model:
109 |
110 | 1. **Question**: what is the capital of china?
111 |
112 | **Answer**: Beijing is the capital of China.
113 |
114 | 2. **Question**: please give a pieces of python code about socket connection.
115 |
116 | **Answer**:
117 | ```
118 | import socket
119 |
120 | s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
121 | print("Constructing a socket connection")
122 | print(s)
123 |
124 | s.bind(('', 5000)) # Bind the socket to the '(' and 5000'
125 |
126 | print("Listening for connections")
127 | while True:
128 | connection, address = s.accept()
129 | print("Got connection from", address)
130 | print("Connected!")
131 | connection.close() # Close the connection.
132 |
133 | print("Exiting the program")
134 | ```
135 |
136 | 3. **Question**: 请讲一个武侠故事.
137 |
138 | **Answer**:
139 | The legendary warrior Guan Yu Once upon a time, there was a legendary warrior named Guan Yu. He was a highly skilled fighter and commander and was known for his great bravery and loyalty. He was the loyal servant of Emperor Liu Bei, and fought in the Three Kingdoms civil war alongside him. He was brave enough to challenge the powerful enemy warriors, and eventually helped Liu Bei's army defeat the Wei Kingdom, the strongest of the three kingdoms. Guan Yu was a great warrior and a hero in Chinese history.
140 |
141 | 4. **Question**: what is an alpaca? how is it different form a llama?
142 |
143 | **Answer**:
144 | Alpacas and llamas are both members of the Camelidae family. They are both herbivorous mammals that are domesticated for their coats, but they are distinctly different species. Llamas are generally larger than alpacas and have more pronounced spiral horns, while alpacas have smaller horns and more delicate coats. Alpacas also have smaller teeth and a narrower neck than llamas, and they are able to reproduced year-round. Llamas are also unable to reproduce during the winter months and have a more social behavior than alpacas.
145 |
146 | 5. **Question**: write a e-mail to congratulate Lee and mention that you are exited about meeting all of them in person.
147 |
148 | **Answer**:
149 | ```
150 | Dear Lee,
151 |
152 | I want to congratulate you on your success and I'm looking forward to meeting you and the other team members in person.
153 | It's exciting to see the progress you've made and I'm sure the hard work you have put in has paid off.
154 | I'm very proud of you and I'm looking forward to meeting you in person and seeing all of your progress.
155 |
156 | Sincerely,
157 | [Your Name]
158 | ```
159 |
160 | 5. **Question**: write a well-thought our abstract for a machine learning parper that proves that 42 is the optimal seed for training neural network.
161 |
162 | **Answer**:
163 | This paper investigates the effect of different seeds on the performance of neural networks. Through experiments, we demonstrate that the optimal seed is 42, which leads to the best performance on a wide range of tasks. We analyze the effects of different seeds on the model's accuracy, robustness, and other metrics. Our results show that 42 is the best seed for neural networks, and we discuss the implications of this finding.
164 |
165 |
166 | ## Reference
167 |
168 |
169 |
170 | Fairseq: a sequence modeling toolkit \[[github](https://github.com/facebookresearch/fairseq)\]
171 | ```
172 | @inproceedings{ott2019fairseq,
173 | title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
174 | author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
175 | booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
176 | year = {2019},
177 | }
178 | ```
179 |
180 | FairScale: is a PyTorch extension library for high performance and large scale training. \[[github](https://github.com/facebookresearch/fairscale)\]
181 | ```
182 | @Misc{FairScale2021,
183 | author = {{FairScale authors}},
184 | title = {FairScale: A general purpose modular PyTorch library for high performance and large scale training},
185 | howpublished = {\url{https://github.com/facebookresearch/fairscale}},
186 | year = {2021}
187 | }
188 | ```
189 |
190 | LLaMA: Open and Efficient Foundation Language Models \[[paper](https://arxiv.org/abs/2302.13971)\]\[[github](https://github.com/facebookresearch/llama)\]
191 |
192 | ```
193 | @article{touvron2023llama,
194 | title={LLaMA: Open and Efficient Foundation Language Models},
195 | author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume},
196 | journal={arXiv preprint arXiv:2302.13971},
197 | year={2023}
198 | }
199 | ```
200 |
201 | Stanford Alpaca: An Instruction-following LLaMA model \[[github](https://github.com/tatsu-lab/stanford_alpaca)\]
202 |
203 | ```
204 | @misc{alpaca,
205 | author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },
206 | title = {Stanford Alpaca: An Instruction-following LLaMA model},
207 | year = {2023},
208 | publisher = {GitHub},
209 | journal = {GitHub repository},
210 | howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}},
211 | }
--------------------------------------------------------------------------------
/README_zh.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
Efficient Alpaca
8 |
9 |
10 |
11 |
12 | English | 中文
13 |
14 |
15 |
16 | Efficient Alpaca 的目的是为了方便构建或者增强基于 LLMs 的 Chatbots,其功能包括但不限于 **减少资源使用 (GPU 显存,训练时间)**,**推理速度**,**方便开发者使用(尤其是熟悉 Fairseq 的用户)**。项目会持续更新,欢迎使用!
17 |
18 | **************************** 更新记录 ****************************
19 | - 4/5 我们支持 FSDP 进行 Fine-tuning,可以使用额外内存来减少 GPU 显存占用!
20 | - 3/17 我们支持使用 Megatron-LM 来减少 GPU 显存,包括 Fine-tuning 和 Efficient-finetuning !
21 | - 3/15 我们支持使用 LoRA 来进行 Efficient-finetuning 来复现 Stanford Alpaca !
22 |
23 |
24 | ## 可供选择的推理设备
25 |
26 | 你可以选择下面的任意设备来支持推理,即使是 12G 的 1080。
27 |
28 | | Method | Support | Device | GPU | Inference Speed |
29 | | -------- | ------------- | ---------- | ------- | ------------------- |
30 | | Original | all | 1 24G 3090 | 14G | |
31 | | Megatron | megatron_lora | 2 12G 1080 | 8G | |
32 |
33 | ## 可供选择的训练方法和设备
34 |
35 | 你可以根据下标的组合来选择可用方法:例如,我有两块3090和大量的内存,这个时候你可以有两种选择:1. 使用Megatron-LM 来进行 Efficient-Finetuning(不会使用大量内存)。 2. 使用 FSDP 来进行 Fine-tuning (会使用额外大量内存)。
36 |
37 | | Method | Type | Support | Data Para | Model Para | Device | GPU | Memory Limit | Training Speed |
38 | | ------------- | --------------------- | ------------- | --------- | ---------- | ---------- | ------- | ------------- | ------------------- |
39 | | LoRA | Efficient Fine-tuning | lora | ✓ | ✓ | 1 40G A100 | 30G | No | 90 sec / 100 step |
40 | | Megatron-LoRA | Efficient Fine-tuning | megatron_lora | ✗ | ✓ | 2 24G 3090 | 21G | No | 190 sec / 100 step |
41 | | FSDP | Fine-tuning | fsdp | ✓ | ✓ | 1 40G A100 | 32G | 128G + | 1600 sec / 100 step |
42 | | | | | | | 8 40G A100 | 32G | No | 400 sec / 100 step |
43 | | | | | | | 2 24G 3090 | 13G | 128G + | 900 sec / 100 step |
44 | | | | | | | 8 24G 3090 | 22G | 128G + | 800 sec / 100 step |
45 | | Megatron | Fine-tuning | megatron | ✗ | ✓ | 4 40G A100 | 25G | No | 130 sec / 100 step |
46 | | | | | | | 8 24G 3090 | 14G | No | 130 sec / 100 step |
47 |
48 |
49 | 关于表格的解释
50 |
51 | 以上所有的实验都是使用 --max-tokens 2048 这一参数进行测试.
52 |
53 | * Data Para: 是否支持数据并行.
54 | * Model Para: 是否支持模型并行.
55 | * GPU: 在训练中实际大概使用的 GPU 显存.
56 | * Memory Limit: 内存限制,只是大概的测试,并不代表实际情况.
57 | * Training Speed: 仅仅代表训练速度而非训练时间,比如数据并行可以加快训练时间,但是并不会加快训练速度。
58 |
59 |
60 |
61 |
62 | ## 进行演示
63 |
64 | 我们支持使用 [Gradio](https://gradio.app/) 来进行演示。
65 |
66 | ```
67 | bash alpaca_lora/scripts/run_webapp.sh
68 | ```
69 |
70 |
71 |
72 |
73 |
74 |
75 | ## 环境依赖
76 | 请确保你有正常的CUDA环境,并安装以下依赖:
77 |
78 | ```
79 | pip install fairseq
80 | pip install fairscale
81 | ```
82 |
83 | 如果要训练模型,请安装 sentencepiece [official repo](https://github.com/google/sentencepiece) 来处理数据。
84 |
85 | ```
86 | git clone https://github.com/google/sentencepiece.git
87 | cd sentencepiece
88 | mkdir build
89 | cd build
90 | cmake ..
91 | make -j $(nproc)
92 | make install
93 | ```
94 |
95 | ## 数据和模型处理
96 |
97 | + [Data and Model Preprocess](alpaca/scripts/utils/README.md)
98 |
99 | ## 训练模型
100 |
101 | Efficient-Finetuning
102 | + [LoRA](alpaca/scripts/lora/README.md)
103 | + [Megatron + LoRA](alpaca/scripts/megatron_lora/README.md)
104 |
105 | Fine-tuning
106 | + [Megatron](alpaca/scripts/megatron/README.md)
107 | + [Fully Sharded Data Parallel](alpaca/scripts/fsdp/README.md)
108 |
109 | ## 模型输出示例:
110 |
111 | 1. **Question**: what is the capital of china?
112 |
113 | **Answer**: Beijing is the capital of China.
114 |
115 | 2. **Question**: please give a pieces of python code about socket connection.
116 |
117 | **Answer**:
118 | ```
119 | import socket
120 |
121 | s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
122 | print("Constructing a socket connection")
123 | print(s)
124 |
125 | s.bind(('', 5000)) # Bind the socket to the '(' and 5000'
126 |
127 | print("Listening for connections")
128 | while True:
129 | connection, address = s.accept()
130 | print("Got connection from", address)
131 | print("Connected!")
132 | connection.close() # Close the connection.
133 |
134 | print("Exiting the program")
135 | ```
136 |
137 | 3. **Question**: 请讲一个武侠故事.
138 |
139 | **Answer**:
140 | The legendary warrior Guan Yu Once upon a time, there was a legendary warrior named Guan Yu. He was a highly skilled fighter and commander and was known for his great bravery and loyalty. He was the loyal servant of Emperor Liu Bei, and fought in the Three Kingdoms civil war alongside him. He was brave enough to challenge the powerful enemy warriors, and eventually helped Liu Bei's army defeat the Wei Kingdom, the strongest of the three kingdoms. Guan Yu was a great warrior and a hero in Chinese history.
141 |
142 | 4. **Question**: what is an alpaca? how is it different form a llama?
143 |
144 | **Answer**:
145 | Alpacas and llamas are both members of the Camelidae family. They are both herbivorous mammals that are domesticated for their coats, but they are distinctly different species. Llamas are generally larger than alpacas and have more pronounced spiral horns, while alpacas have smaller horns and more delicate coats. Alpacas also have smaller teeth and a narrower neck than llamas, and they are able to reproduced year-round. Llamas are also unable to reproduce during the winter months and have a more social behavior than alpacas.
146 |
147 | 5. **Question**: write a e-mail to congratulate Lee and mention that you are exited about meeting all of them in person.
148 |
149 | **Answer**:
150 | ```
151 | Dear Lee,
152 |
153 | I want to congratulate you on your success and I'm looking forward to meeting you and the other team members in person.
154 | It's exciting to see the progress you've made and I'm sure the hard work you have put in has paid off.
155 | I'm very proud of you and I'm looking forward to meeting you in person and seeing all of your progress.
156 |
157 | Sincerely,
158 | [Your Name]
159 | ```
160 |
161 | 5. **Question**: write a well-thought our abstract for a machine learning parper that proves that 42 is the optimal seed for training neural network.
162 |
163 | **Answer**:
164 | This paper investigates the effect of different seeds on the performance of neural networks. Through experiments, we demonstrate that the optimal seed is 42, which leads to the best performance on a wide range of tasks. We analyze the effects of different seeds on the model's accuracy, robustness, and other metrics. Our results show that 42 is the best seed for neural networks, and we discuss the implications of this finding.
165 |
166 |
167 | ## Reference
168 |
169 | In addition, thanks to the open source projects and communities that this project relies on, including but not limited to the following:
170 |
171 | Fairseq: a sequence modeling toolkit \[[github](https://github.com/facebookresearch/fairseq)\]
172 | ```
173 | @inproceedings{ott2019fairseq,
174 | title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
175 | author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
176 | booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
177 | year = {2019},
178 | }
179 | ```
180 |
181 | FairScale: is a PyTorch extension library for high performance and large scale training. \[[github](https://github.com/facebookresearch/fairscale)\]
182 | ```
183 | @Misc{FairScale2021,
184 | author = {{FairScale authors}},
185 | title = {FairScale: A general purpose modular PyTorch library for high performance and large scale training},
186 | howpublished = {\url{https://github.com/facebookresearch/fairscale}},
187 | year = {2021}
188 | }
189 | ```
190 |
191 | LLaMA: Open and Efficient Foundation Language Models \[[paper](https://arxiv.org/abs/2302.13971)\]\[[github](https://github.com/facebookresearch/llama)\]
192 |
193 | ```
194 | @article{touvron2023llama,
195 | title={LLaMA: Open and Efficient Foundation Language Models},
196 | author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume},
197 | journal={arXiv preprint arXiv:2302.13971},
198 | year={2023}
199 | }
200 | ```
201 |
202 | Stanford Alpaca: An Instruction-following LLaMA model \[[github](https://github.com/tatsu-lab/stanford_alpaca)\]
203 |
204 | ```
205 | @misc{alpaca,
206 | author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },
207 | title = {Stanford Alpaca: An Instruction-following LLaMA model},
208 | year = {2023},
209 | publisher = {GitHub},
210 | journal = {GitHub repository},
211 | howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}},
212 | }
--------------------------------------------------------------------------------
/alpaca/scripts/assert/test.src:
--------------------------------------------------------------------------------
1 | tell a story about Sleeping Beauty, please.
2 | write a e-mail to congratulate Lee and mention that you are exited about meeting all of them in person.
3 | what is an alpaca? how is it different form a llama?
4 | what is the capital of Tanzania?
5 | what is the capital of china?
6 | write a well-thought our abstract for a machine learning parper that proves that 42 is the optimal seed for training neural network.
7 | 请将一个武侠故事.
8 | please give a pieces of python code about socket connection.
9 |
--------------------------------------------------------------------------------
/alpaca/scripts/fsdp/README.md:
--------------------------------------------------------------------------------
1 | # Fully Sharded Data Parallel
2 |
3 | FSDP: [Fully Sharded Data Parallel: faster AI training with fewer GPUs](https://engineering.fb.com/2021/07/15/open-source/fsdp/)
4 |
5 | ## Training Step
6 |
7 | ```
8 | bash alpaca/scripts/fsdp/run_train.sh
9 | ```
10 |
11 | ```
12 | bash alpaca/scripts/fsdp/run_train_cpu_offload.sh
13 | ```
14 |
15 | ## Inference Step
16 |
17 | + (Batch-Level) Please prepare the test file.
18 |
19 | ```
20 | bash alpaca/scripts/fsdp/inference/run_inf.sh
21 | ```
22 |
23 | + (Instance-Level) Using alpaca/src/inference.py line 17 to set prompts.
24 |
25 | ```
26 | bash alpaca/scripts/fsdp/inference/run_inf_hub.sh
27 | ```
--------------------------------------------------------------------------------
/alpaca/scripts/fsdp/inference/run_inf.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | src=src
3 | tgt=tgt
4 |
5 | export CUDA_VISIBLE_DEVICES=0
6 |
7 | data_dir=/opt/data/private/data/llama/llama_instruction/inf/data-bin/
8 | llama_dir=/opt/data/private/ckpt/alpaca/fsdp/
9 | bpe_dir=/opt/data/private/data/llama/tokenizer.model
10 |
11 |
12 | python alpaca/src/generate.py $data_dir \
13 | --user-dir alpaca/src \
14 | --task seq2seq_ft_task \
15 | --arch llama_7b \
16 | -s $src -t $tgt \
17 | --gen-subset test \
18 | --bpe 'sentencepiece' --sentencepiece-model $bpe_dir \
19 | --path $llama_dir/checkpoint1.pt \
20 | --required-batch-size-multiple 1 \
21 | --batch-size 1 \
22 | --beam 1 --sampling --sampling-topp 0.95 --temperature 0.8 \
23 |
--------------------------------------------------------------------------------
/alpaca/scripts/fsdp/inference/run_inf_hub.sh:
--------------------------------------------------------------------------------
1 |
2 | export CUDA_VISIBLE_DEVICES=0
3 |
4 | python alpaca/src/inference.py \
5 | --model-dir /opt/data/private/ckpt/alpaca/fsdp/ \
6 | --model-file checkpoint1.pt \
7 | --bpe sentencepiece \
8 | --sentencepiece-model /opt/data/private/data/llama/tokenizer.model \
9 |
--------------------------------------------------------------------------------
/alpaca/scripts/fsdp/inference/run_webapp.sh:
--------------------------------------------------------------------------------
1 |
2 | export CUDA_VISIBLE_DEVICES=0
3 |
4 | python alpaca/src/webapp.py \
5 | --model-dir /opt/data/private/ckpt/alpaca/fsdp/ \
6 | --model-file checkpoint1.pt \
7 | --bpe sentencepiece \
8 | --sentencepiece-model /opt/data/private/data/llama/tokenizer.model \
9 |
--------------------------------------------------------------------------------
/alpaca/scripts/fsdp/run_train.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | src=src
3 | tgt=tgt
4 |
5 | export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
6 | export CUDA_LAUNCH_BLOCKING=1
7 |
8 | data_dir=/opt/data/private/data/llama/llama_instruction/data-bin
9 | save_dir=/opt/data/private/ckpt/alpaca/fsdp/
10 | llama_dir=/opt/data/private/data/llama/7B/model_no_pad.pt
11 | max_token=2048
12 |
13 |
14 | python alpaca/src/train_fsdp.py $data_dir \
15 | --reset-optimizer --reset-dataloader --reset-meters \
16 | --restore-file $llama_dir \
17 | --user-dir alpaca/src \
18 | --ddp-backend fully_sharded \
19 | --fp16 --fp16-init-scale 4 \
20 | --checkpoint-activations \
21 | --no-reshard-after-forward \
22 | --no-save-optimizer-state \
23 | --max-target-positions 2048 \
24 | --task seq2seq_ft_task \
25 | --arch llama_7b \
26 | --data-para \
27 | --criterion lm_loss \
28 | -s $src -t $tgt \
29 | --max-tokens $max_token \
30 | --optimizer adam --adam-betas "(0.9, 0.98)" \
31 | --lr-scheduler polynomial_decay --lr 2e-5 \
32 | --weight-decay 0.0 \
33 | --total-num-update 2000 --warmup-updates 100 \
34 | --max-epoch 3 \
35 | --no-progress-bar \
36 | --log-interval 10 \
37 | --save-dir $save_dir | tee -a $save_dir/train.log \
38 |
--------------------------------------------------------------------------------
/alpaca/scripts/fsdp/run_train_belle.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | src=src
3 | tgt=tgt
4 |
5 | export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
6 | export CUDA_LAUNCH_BLOCKING=1
7 |
8 | data_dir=/opt/data/private/data/llama/belle_1m/data-bin
9 | save_dir=/opt/data/private/ckpt/alpaca/fsdp_belle/
10 | llama_dir=/opt/data/private/data/llama/7B/model_no_pad.pt
11 | max_token=2048
12 |
13 |
14 | python alpaca/src/train_fsdp.py $data_dir \
15 | --reset-optimizer --reset-dataloader --reset-meters \
16 | --restore-file $llama_dir \
17 | --user-dir alpaca/src \
18 | --ddp-backend fully_sharded \
19 | --fp16 --fp16-init-scale 4 \
20 | --checkpoint-activations \
21 | --no-reshard-after-forward \
22 | --no-save-optimizer-state \
23 | --max-target-positions 2048 \
24 | --task seq2seq_ft_task \
25 | --arch llama_7b \
26 | --data-para \
27 | --criterion lm_loss \
28 | -s $src -t $tgt \
29 | --max-tokens $max_token \
30 | --optimizer adam --adam-betas "(0.9, 0.98)" \
31 | --lr-scheduler polynomial_decay --lr 2e-5 \
32 | --weight-decay 0.0 \
33 | --total-num-update 2000 --warmup-updates 100 \
34 | --max-epoch 3 \
35 | --no-progress-bar \
36 | --log-interval 10 \
37 | --save-dir $save_dir | tee -a $save_dir/train.log \
38 |
--------------------------------------------------------------------------------
/alpaca/scripts/fsdp/run_train_cpu_offload.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | src=src
3 | tgt=tgt
4 |
5 | export OMP_NUM_THREADS=20
6 | export CUDA_VISIBLE_DEVICES=0,1
7 | export CUDA_LAUNCH_BLOCKING=1
8 |
9 | data_dir=/opt/data/private/data/llama/llama_instruction/data-bin
10 | save_dir=/opt/data/private/ckpt/alpaca/fsdp_cpu_offload/
11 | llama_dir=/opt/data/private/data/llama/7B/model_no_pad.pt
12 | max_token=2048
13 |
14 |
15 | python alpaca/src/train_fsdp.py $data_dir \
16 | --reset-optimizer --reset-dataloader --reset-meters \
17 | --restore-file $llama_dir \
18 | --user-dir alpaca/src \
19 | --ddp-backend fully_sharded \
20 | --fp16 --fp16-init-scale 4 \
21 | --cpu-offload --checkpoint-activations \
22 | --no-reshard-after-forward \
23 | --no-save-optimizer-state \
24 | --max-target-positions 2048 \
25 | --task seq2seq_ft_task \
26 | --arch llama_7b \
27 | --data-para \
28 | --criterion lm_loss \
29 | -s $src -t $tgt \
30 | --max-tokens $max_token \
31 | --optimizer new_cpu_adam --adam-betas "(0.9, 0.98)" \
32 | --lr-scheduler polynomial_decay --lr 2e-5 \
33 | --weight-decay 0.0 \
34 | --total-num-update 2000 --warmup-updates 100 \
35 | --max-epoch 3 \
36 | --no-progress-bar \
37 | --log-interval 10 \
38 | --save-dir $save_dir | tee -a $save_dir/train.log \
39 |
--------------------------------------------------------------------------------
/alpaca/scripts/lora/README.md:
--------------------------------------------------------------------------------
1 | # LoRA
2 |
3 | Efficient-Finetuning Method: [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)
4 |
5 |
6 | ## Training Step
7 |
8 | ```
9 | bash alpaca/scripts/lora/run_train.sh
10 | ```
11 |
12 | ## Inference Step
13 |
14 | + (Batch-Level) Please prepare the test file.
15 |
16 | ```
17 | bash alpaca/scripts/lora/inference/run_inf.sh
18 | ```
19 |
20 | + (Instance-Level) Using alpaca/src/inference.py line 17 to set prompts.
21 |
22 | ```
23 | bash alpaca/scripts/lora/inference/run_inf_hub.sh
24 | ```
25 |
--------------------------------------------------------------------------------
/alpaca/scripts/lora/inference/run_inf.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | src=src
3 | tgt=tgt
4 |
5 | export CUDA_VISIBLE_DEVICES=0
6 |
7 | data_dir=/opt/data/private/data/llama/llama_instruction/inf/data-bin/
8 | llama_dir=/opt/data/private/data/llama/7B/
9 | lora_dir=/opt/data/private/ckpt/alpaca/lora/checkpoint3.pt
10 | bpe_dir=/opt/data/private/data/llama/tokenizer.model
11 |
12 |
13 | torchrun --master_port 29001 alpaca/src/generate.py $data_dir \
14 | --user-dir alpaca/src \
15 | --task seq2seq_lora_task \
16 | --arch llama_7b \
17 | --lora-model-inf $lora_dir \
18 | --lora-tuning \
19 | -s $src -t $tgt \
20 | --gen-subset test \
21 | --bpe 'sentencepiece' --sentencepiece-model $bpe_dir \
22 | --path $llama_dir/model_no_pad.pt \
23 | --required-batch-size-multiple 1 \
24 | --batch-size 1 \
25 | --beam 1 --sampling --sampling-topp 0.95 --temperature 0.8 \
26 |
--------------------------------------------------------------------------------
/alpaca/scripts/lora/inference/run_inf_hub.sh:
--------------------------------------------------------------------------------
1 |
2 | export CUDA_VISIBLE_DEVICES=0
3 |
4 | torchrun --master_port 29004 alpaca/src/inference.py \
5 | --model-dir /opt/data/private/data/llama/7B/ \
6 | --model-file model_no_pad.pt \
7 | --lora-tuning \
8 | --lora-model-inf /opt/data/private/ckpt/alpaca/lora/checkpoint3.pt \
9 | --bpe sentencepiece \
10 | --sentencepiece-model /opt/data/private/data/llama/tokenizer.model \
11 |
--------------------------------------------------------------------------------
/alpaca/scripts/lora/inference/run_webapp.sh:
--------------------------------------------------------------------------------
1 |
2 | export CUDA_VISIBLE_DEVICES=0
3 |
4 | torchrun --master_port 29002 alpaca/src/webapp.py \
5 | --model-dir /opt/data/private/data/llama/7B/ \
6 | --model-file model_no_pad.pt \
7 | --lora-model-inf /opt/data/private/ckpt/alpaca/lora/checkpoint3.pt \
8 | --lora-tuning \
9 | --bpe sentencepiece \
10 | --sentencepiece-model /opt/data/private/data/llama/tokenizer.model \
11 |
--------------------------------------------------------------------------------
/alpaca/scripts/lora/run_train.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | src=src
3 | tgt=tgt
4 |
5 | export CUDA_VISIBLE_DEVICES=0,1,2,3
6 |
7 | data_dir=/opt/data/private/data/llama/llama_instruction/data-bin
8 | save_dir=/opt/data/private/ckpt/alpaca/lora/
9 | llama_dir=/opt/data/private/data/llama/7B/model_no_pad.pt
10 | max_token=1024
11 | update_freq=1
12 | world_size=4
13 |
14 |
15 | torchrun --master_port 29000 --nproc_per_node $world_size alpaca/src/train_lora.py $data_dir \
16 | --reset-optimizer --reset-dataloader --reset-meters \
17 | --restore-file $llama_dir \
18 | --user-dir alpaca/src \
19 | --max-source-positions 2048 \
20 | --max-target-positions 2048 \
21 | --memory-efficient-fp16 \
22 | --fp16 --fp16-init-scale 4 \
23 | --task seq2seq_lora_task \
24 | --arch llama_7b \
25 | --criterion lm_loss \
26 | --lora-tuning \
27 | --data-para \
28 | -s $src -t $tgt \
29 | --max-tokens $max_token \
30 | --update-freq $update_freq \
31 | --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 --clip-norm 0.0 \
32 | --lr-scheduler polynomial_decay --lr 3e-4 \
33 | --weight-decay 0.0 \
34 | --total-num-update 5000 --warmup-updates 200 \
35 | --max-epoch 3 \
36 | --no-progress-bar \
37 | --log-interval 100 \
38 | --save-dir $save_dir | tee -a $save_dir/train.log \
39 |
--------------------------------------------------------------------------------
/alpaca/scripts/megatron/README.md:
--------------------------------------------------------------------------------
1 | # LoRA
2 |
3 | Megatron-LM: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
4 |
5 |
6 | ## Training Step
7 |
8 | ```
9 | bash alpaca/scripts/megatron/run_train_megatron.sh
10 | ```
11 |
12 | ## Inference Step
13 |
14 | + (Batch-Level) Please prepare the test file.
15 |
16 | ```
17 | bash alpaca/scripts/megatron/run_inf_megatron.sh
18 | ```
19 |
--------------------------------------------------------------------------------
/alpaca/scripts/megatron/inference/run_inf_megatron.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | src=src
3 | tgt=tgt
4 |
5 | export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
6 |
7 | data_dir=/opt/data/private/data/llama/llama_instruction/inf/data-bin/
8 | llama_dir=/opt/data/private/ckpt/alpaca/megatron8_ft/
9 | bpe_dir=/opt/data/private/data/llama/tokenizer.model
10 | world_size=8
11 |
12 |
13 | torchrun --master_port 29006 --nproc_per_node $world_size alpaca/src/generate.py $data_dir \
14 | --user-dir alpaca/src \
15 | --model-parallel-size $world_size \
16 | --distributed-world-size $world_size \
17 | --task seq2seq_ft_task \
18 | --megatron-model \
19 | --arch llama_7b \
20 | -s $src -t $tgt \
21 | --gen-subset test \
22 | --bpe 'sentencepiece' --sentencepiece-model $bpe_dir \
23 | --path $llama_dir/checkpoint1.pt \
24 | --required-batch-size-multiple 1 \
25 | --batch-size 1 \
26 | --beam 1 --sampling --sampling-topp 0.95 --temperature 0.8 \
27 |
--------------------------------------------------------------------------------
/alpaca/scripts/megatron/run_train_megatron.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | src=src
3 | tgt=tgt
4 |
5 | export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
6 |
7 | data_dir=/opt/data/private/data/llama/llama_instruction/data-bin
8 | save_dir=/opt/data/private/ckpt/alpaca/megatron8_ft/
9 | llama_dir=/opt/data/private/data/llama/7B/megatron_8/model.pt
10 | max_token=2048
11 | update_freq=1
12 | world_size=8
13 |
14 |
15 | torchrun --master_port 29000 --nproc_per_node $world_size alpaca/src/train_megatron.py $data_dir \
16 | --model-parallel-size $world_size \
17 | --distributed-world-size $world_size \
18 | --reset-optimizer --reset-dataloader --reset-meters \
19 | --restore-file $llama_dir \
20 | --user-dir alpaca/src \
21 | --max-source-positions 2048 \
22 | --max-target-positions 2048 \
23 | --memory-efficient-fp16 \
24 | --fp16 --fp16-init-scale 4 \
25 | --checkpoint-activations \
26 | --task seq2seq_ft_task \
27 | --arch llama_7b \
28 | --megatron-model \
29 | --criterion lm_loss \
30 | -s $src -t $tgt \
31 | --max-tokens $max_token \
32 | --update-freq $update_freq \
33 | --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 --clip-norm 0.0 \
34 | --lr-scheduler polynomial_decay --lr 2e-5 \
35 | --weight-decay 0.0 \
36 | --total-num-update 7000 --warmup-updates 200 \
37 | --max-epoch 3 \
38 | --no-progress-bar \
39 | --log-interval 100 \
40 | --save-dir $save_dir | tee -a $save_dir/train.log \
41 |
--------------------------------------------------------------------------------
/alpaca/scripts/megatron_lora/README.md:
--------------------------------------------------------------------------------
1 | # LoRA
2 |
3 | Megatron-LM: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
4 | Efficient-Finetuning Method: [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)
5 |
6 |
7 | ## Training Step
8 |
9 | ```
10 | bash alpaca/scripts/megatron_lora/run_train_megatron_lora.sh
11 | ```
12 |
13 | ## Inference Step
14 |
15 | + (Batch-Level) Please prepare the test file.
16 |
17 | ```
18 | bash alpaca/scripts/megatron_lora/inference/run_inf_megatron_lora.sh
19 | ```
20 |
--------------------------------------------------------------------------------
/alpaca/scripts/megatron_lora/inference/run_inf_megatron_lora.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | src=src
3 | tgt=tgt
4 |
5 | export CUDA_VISIBLE_DEVICES=0,1
6 |
7 | data_dir=/opt/data/private/data/llama/llama_instruction/inf/data-bin
8 | llama_dir=/opt/data/private/data/llama/7B/megatron_2/
9 | lora_dir=/opt/data/private/ckpt/alpaca/megatron_lora/checkpoint1-model_part-0.pt
10 | bpe_dir=/opt/data/private/data/llama/tokenizer.model
11 | world_size=2
12 |
13 |
14 | torchrun --master_port 29006 --nproc_per_node $world_size alpaca/src/generate.py $data_dir \
15 | --user-dir alpaca/src \
16 | --model-parallel-size $world_size \
17 | --distributed-world-size $world_size \
18 | --lora-model-inf $lora_dir \
19 | --task seq2seq_lora_task \
20 | --arch llama_7b \
21 | --megatron-model \
22 | --lora-tuning \
23 | -s $src -t $tgt \
24 | --gen-subset test \
25 | --bpe 'sentencepiece' --sentencepiece-model $bpe_dir \
26 | --path $llama_dir/model.pt \
27 | --required-batch-size-multiple 1 \
28 | --batch-size 1 \
29 | --beam 1 --sampling --sampling-topp 0.95 --temperature 0.8 \
30 |
--------------------------------------------------------------------------------
/alpaca/scripts/megatron_lora/run_train_megatron_lora.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | src=src
3 | tgt=tgt
4 |
5 | export CUDA_VISIBLE_DEVICES=0,1
6 |
7 | data_dir=/opt/data/private/data/llama/llama_instruction/data-bin
8 | save_dir=/opt/data/private/ckpt/alpaca/megatron_lora/
9 | llama_dir=/opt/data/private/data/llama/7B/megatron_2/model.pt
10 | max_token=1024
11 | update_freq=2
12 | world_size=2
13 |
14 |
15 | torchrun --master_port 29002 --nproc_per_node $world_size alpaca/src/train_megatron.py $data_dir \
16 | --model-parallel-size $world_size \
17 | --distributed-world-size $world_size \
18 | --reset-optimizer --reset-dataloader --reset-meters \
19 | --restore-file $llama_dir \
20 | --user-dir alpaca/src \
21 | --max-source-positions 2048 \
22 | --max-target-positions 2048 \
23 | --memory-efficient-fp16 \
24 | --fp16 --fp16-init-scale 4 \
25 | --task seq2seq_lora_task \
26 | --arch llama_7b \
27 | --megatron-model \
28 | --criterion lm_loss \
29 | --lora-tuning \
30 | -s $src -t $tgt \
31 | --max-tokens $max_token \
32 | --update-freq $update_freq \
33 | --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 --clip-norm 0.0 \
34 | --lr-scheduler polynomial_decay --lr 2e-4 \
35 | --weight-decay 0.0 \
36 | --total-num-update 7000 --warmup-updates 200 \
37 | --max-epoch 3 \
38 | --no-progress-bar \
39 | --log-interval 100 \
40 | --save-dir $save_dir | tee -a $save_dir/train.log \
41 |
--------------------------------------------------------------------------------
/alpaca/scripts/utils/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Data Process
3 |
4 | Prepare chatbot data contains 52K instruction-following data we used for fine-tuning the Alpaca model.
5 |
6 | ```
7 | bash prepare_llama_training_data.sh
8 | ```
9 |
10 | parameter:
11 |
12 | + `DATA` init dataset dir, download the alpaca data [alpaca_data.json](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json).
13 | + `SPM` sentencepiece project spm dir: "sentencepiece/build/src/spm_encode".
14 | + `MODEL` LLaMA tokenizer model "tokenizer.model".
15 |
16 | # Model Process
17 |
18 |
19 | ## Build Model Checkpoint
20 |
21 | Process the LLaMA model based on your equipment (GPU devices).
22 |
23 | 1. single model checkpoint:
24 | ```
25 | python alpaca_lora/scripts/utils/process_llama_ckpt.py --llama-model-dir $llama_dir --llama-model-file $llama_file
26 | ```
27 |
28 | 2. Megatron-LM model checkpoint:
29 | ```
30 | python alpaca_lora/scripts/utils/process_llama_megatron_ckpt.py --llama-model-dir $llama_dir --llama-model-file $llama_file --parallel-size 2
31 | ```
32 |
33 | parameter:
34 | + `--parallel-size` The number of GPUs to use.
35 |
36 | after that, we can get new checkpoint file ``model.pt``.
37 |
38 |
39 | ## Merge Model Checkpoint
40 |
41 | We can merge multiple `Megatron-LM Checkpoints` into a single Checkpoint to support the `hub` or `web interface` mode.
42 |
43 | ```
44 | python merge_llama_megatron_ckpt.py
45 | ```
46 |
--------------------------------------------------------------------------------
/alpaca/scripts/utils/convert_llama_to_half.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import json
3 | import argparse
4 |
5 | def convert_llama_half(llama_file):
6 |
7 | with open(llama_file, "rb") as f:
8 | llama_state = torch.load(f, map_location=torch.device("cpu"))
9 |
10 | for k in list(llama_state['model'].keys()):
11 | llama_state['model'][k] = llama_state['model'][k].half()
12 |
13 | dump_file = "checkpoint_half.pt"
14 | torch.save(llama_state, llama_file.replace("checkpoint_best.pt", dump_file))
15 | print("dump new model to {}".format(dump_file))
16 |
17 | def main():
18 |
19 | parser = argparse.ArgumentParser()
20 | parser.add_argument(
21 | "--llama-model-file",
22 | type=str,
23 | default="/opt/data/private/ckpt/alpaca/fsdp_belle/checkpoint_best.pt",
24 | help="path containing model file",
25 | )
26 |
27 | args = parser.parse_args()
28 | print("convert model {}".format(args.llama_model_file))
29 | convert_llama_half(args.llama_model_file)
30 |
31 |
32 | if __name__ == "__main__":
33 | main()
34 |
--------------------------------------------------------------------------------
/alpaca/scripts/utils/merge_llama_megatron_ckpt.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import json
3 | import argparse
4 |
5 | def build_default_state():
6 |
7 | state = {}
8 |
9 | state['args'] = {}
10 | state['args']['arch'] = "llama_7b"
11 | state['args']['task'] = "seq2seq_ft_task"
12 | state['args']['criterion'] = "ll_loss"
13 | state['args']['decoder_attention_heads'] = 32
14 | state['args']['decoder_embed_dim'] = 4096
15 | state['args']['decoder_ffn_embed_dim'] = 16384
16 | state['args']['decoder_layers'] = 32
17 |
18 | state['args']['max_target_positions'] = 2048
19 | state['args']['max_tokens'] = 2048
20 |
21 | temp_parser = argparse.ArgumentParser()
22 | for key, value in state['args'].items():
23 | temp_parser.add_argument("--" + key, default=value)
24 | args = temp_parser.parse_args([])
25 |
26 | state['args'] = args
27 |
28 | state['model'] = {}
29 | state['optimizer_history'] = [
30 | {
31 | 'criterion_name': 'lm_loss',
32 | 'optimizer_name': 'MemoryEfficientFP16Optimizer',
33 | 'lr_scheduler_state': {'best': None},
34 | 'num_updates': 5000,
35 | }
36 | ]
37 | state['extra_state'] = {}
38 | print(state)
39 | return state
40 |
41 | def build_llama_state_dict(llama_dir, parallel_size, prefix):
42 |
43 | llama_state = None
44 | for file_idx in range(parallel_size):
45 | print(file_idx)
46 | with open((llama_dir + prefix).format(file_idx), "rb") as f:
47 | sep_state = torch.load(f, map_location=torch.device("cpu"))['model']
48 |
49 | if llama_state is None:
50 | llama_state = sep_state
51 | continue
52 |
53 | for k in list(sep_state.keys()):
54 |
55 | print("{}: {} -> +{}".format(k, llama_state[k].size(), sep_state[k].size()))
56 | if "inner_attention" in k:
57 | print("skip llama state key = {} size = {}".format(k, llama_state[k].size()))
58 | continue
59 | elif "norm.weight" in k or "_norm" in k:
60 | continue
61 | elif "decoder.embed_tokens.weight" in k:
62 | llama_state[k] = torch.cat([llama_state[k].half(), sep_state[k].half()], dim=-1)
63 | elif "decoder.output_projection.weight" in k:
64 | llama_state[k] = torch.cat([llama_state[k].half(), sep_state[k].half()], dim=0)
65 | elif "layers" in k:
66 | if "attention" in k and "out_proj" not in k:
67 | # 2048, 4096
68 | llama_state[k] = torch.cat([llama_state[k].half(), sep_state[k].half()], dim=0)
69 | elif "attention" in k and "out_proj" in k:
70 | llama_state[k] = torch.cat([llama_state[k].half(), sep_state[k].half()], dim=-1)
71 | elif "feed_forward.w1" in k:
72 | llama_state[k] = torch.cat([llama_state[k].half(), sep_state[k].half()], dim=0)
73 | elif "feed_forward.w2" in k:
74 | llama_state[k] = torch.cat([llama_state[k].half(), sep_state[k].half()], dim=-1)
75 | elif "feed_forward.w3" in k:
76 | llama_state[k] = torch.cat([llama_state[k].half(), sep_state[k].half()], dim=0)
77 | else:
78 | print(sep_state[k].size())
79 | print(k)
80 | raise NotImplementedError
81 | else:
82 | print(k)
83 | print(sep_state[k].size())
84 | raise NotImplementedError
85 |
86 | llama_state["decoder.embed_tokens.weight"] = llama_state["decoder.embed_tokens.weight"][:32008, :]
87 | llama_state["decoder.output_projection.weight"] = llama_state["decoder.output_projection.weight"][:32008, :]
88 | state = build_default_state()
89 | state['model'] = llama_state
90 | dump_file = "model.pt"
91 | torch.save(state, llama_dir + dump_file)
92 | print("dump new model to {}{}".format(llama_dir, dump_file))
93 |
94 | def main():
95 |
96 | parser = argparse.ArgumentParser()
97 | parser.add_argument(
98 | "--llama-model-dir",
99 | type=str,
100 | default="/opt/data/private/ckpt/alpaca/parallel_zh_new_ft/",
101 | help="path containing model file",
102 | )
103 | parser.add_argument(
104 | "--prefix",
105 | type=str,
106 | default="checkpoint_1_15000-model_part-{}.pt",
107 | help="where in model_dir are weights saved",
108 | )
109 | parser.add_argument(
110 | "--parallel-size",
111 | type=int,
112 | default=8,
113 | help="model parallel size to split",
114 | )
115 |
116 | args = parser.parse_args()
117 | print("load model from {}{}".format(args.llama_model_dir, args.prefix))
118 | build_llama_state_dict(args.llama_model_dir, args.parallel_size, args.prefix)
119 |
120 |
121 | if __name__ == "__main__":
122 | main()
123 |
--------------------------------------------------------------------------------
/alpaca/scripts/utils/prepare_inf_data.sh:
--------------------------------------------------------------------------------
1 | SRC=src
2 | TGT=tgt
3 |
4 | DATA=/opt/data/private/data/llama/llama_instruction/inf/
5 | SPM=/opt/data/private/code/sentencepiece/build/src/spm_encode
6 | MODEL=/opt/data/private/data/llama/tokenizer.model
7 |
8 | cp ${DATA}/test.spm.${SRC} ${DATA}/test.spm.${TGT}
9 |
10 | python alpaca/src/preprocess.py \
11 | --user-dir alpaca/src \
12 | --task llama_task \
13 | --source-lang ${SRC} \
14 | --target-lang ${TGT} \
15 | --testpref ${DATA}/test.spm \
16 | --destdir ${DATA}/data-bin \
17 | --srcdict alpaca/scripts/assert/dict.txt \
18 | --tgtdict alpaca/scripts/assert/dict.txt \
19 | --workers 40 \
20 |
--------------------------------------------------------------------------------
/alpaca/scripts/utils/prepare_llama_belle_data.sh:
--------------------------------------------------------------------------------
1 | SRC=src
2 | TGT=tgt
3 |
4 |
5 | DATA=/opt/data/private/data/llama/belle_1m
6 | SPM=/opt/data/private/code/sentencepiece/build/src/spm_encode
7 | MODEL=/opt/data/private/data/llama/tokenizer.model
8 |
9 | python fsdp/scripts/utils/prepare_utils.py --manner split_zh --alpaca-data $DATA/Belle_open_source_1M.json
10 |
11 | head -100 ${DATA}/train.src > ${DATA}/valid.src
12 | head -100 ${DATA}/train.tgt > ${DATA}/valid.tgt
13 |
14 | ${SPM} --model=${MODEL} < ${DATA}/train.${SRC} > ${DATA}/train.spm.${SRC}.tmp
15 | ${SPM} --model=${MODEL} < ${DATA}/train.${TGT} > ${DATA}/train.spm.${TGT}.tmp
16 | ${SPM} --model=${MODEL} < ${DATA}/valid.${SRC} > ${DATA}/valid.spm.${SRC}.tmp
17 | ${SPM} --model=${MODEL} < ${DATA}/valid.${TGT} > ${DATA}/valid.spm.${TGT}.tmp
18 |
19 | python fsdp/scripts/utils/prepare_utils.py --manner replace_zh --alpaca-data $DATA/Belle_open_source_1M.json
20 |
21 | python fsdp/src/preprocess.py \
22 | --user-dir fsdp/src \
23 | --task llama_task \
24 | --source-lang ${SRC} \
25 | --target-lang ${TGT} \
26 | --trainpref ${DATA}/train.spm \
27 | --validpref ${DATA}/valid.spm \
28 | --destdir ${DATA}/data-bin \
29 | --srcdict alpaca/scripts/assert/dict.txt \
30 | --tgtdict alpaca/scripts/assert/dict.txt \
31 | --workers 40 \
32 |
--------------------------------------------------------------------------------
/alpaca/scripts/utils/prepare_llama_training_data.sh:
--------------------------------------------------------------------------------
1 | SRC=src
2 | TGT=tgt
3 |
4 |
5 | DATA=/opt/data/private/data/llama/llama_instruction
6 | SPM=/opt/data/private/code/sentencepiece/build/src/spm_encode
7 | MODEL=/opt/data/private/data/llama/tokenizer.model
8 |
9 | python alpaca/scripts/utils/prepare_utils.py --manner split --alpaca-data $DATA/alpaca_data.json
10 |
11 | head -100 ${DATA}/train.src > ${DATA}/valid.src
12 | head -100 ${DATA}/train.tgt > ${DATA}/valid.tgt
13 |
14 | ${SPM} --model=${MODEL} < ${DATA}/train.${SRC} > ${DATA}/train.spm.${SRC}.tmp
15 | ${SPM} --model=${MODEL} < ${DATA}/train.${TGT} > ${DATA}/train.spm.${TGT}.tmp
16 | ${SPM} --model=${MODEL} < ${DATA}/valid.${SRC} > ${DATA}/valid.spm.${SRC}.tmp
17 | ${SPM} --model=${MODEL} < ${DATA}/valid.${TGT} > ${DATA}/valid.spm.${TGT}.tmp
18 |
19 | python alpaca/scripts/utils/prepare_utils.py --manner replace --alpaca-data $DATA/alpaca_data.json
20 |
21 | python alpaca/src/preprocess.py \
22 | --user-dir alpaca/src \
23 | --task seq2seq_lora_task \
24 | --source-lang ${SRC} \
25 | --target-lang ${TGT} \
26 | --trainpref ${DATA}/train.spm \
27 | --validpref ${DATA}/valid.spm \
28 | --destdir ${DATA}/data-bin \
29 | --srcdict alpaca/scripts/assert/dict.txt \
30 | --tgtdict alpaca/scripts/assert/dict.txt \
31 | --workers 40 \
32 |
--------------------------------------------------------------------------------
/alpaca/scripts/utils/prepare_utils.py:
--------------------------------------------------------------------------------
1 | import json
2 | import argparse
3 |
4 |
5 | def split_json(alpaca_data):
6 |
7 | json_data = json.load(open(alpaca_data))
8 | print("load aplaca data number = {}".format(len(json_data)))
9 | train_src = alpaca_data.replace("alpaca_data.json", 'train.src')
10 | train_tgt = alpaca_data.replace("alpaca_data.json", 'train.tgt')
11 |
12 | prompt_text = "## Instruction:\n{}\n\n## Input:\n{}\n\n## Response:"
13 | prompt_no_input_text = "## Instruction:\n{}\n\n## Response:"
14 |
15 | with open(train_src, 'w') as f_src, open(train_tgt, 'w') as f_tgt:
16 |
17 | for data in json_data:
18 | if len(data['input']) > 0:
19 | src = prompt_text.format(data['instruction'].strip(), data['input'].strip())
20 | else:
21 | src = prompt_no_input_text.format(data['instruction'].strip())
22 | tgt = data['output']
23 | f_src.writelines(src.replace("\n", '<0x0A>') + '\n')
24 | f_tgt.writelines(tgt.replace("\n", '<0x0A>') + '\n')
25 |
26 | def replace_data(alpaca_data):
27 |
28 | train_src = alpaca_data.replace("alpaca_data.json", 'train.spm.src.tmp')
29 | train_tgt = alpaca_data.replace("alpaca_data.json", 'train.spm.tgt.tmp')
30 |
31 | valid_src = alpaca_data.replace("alpaca_data.json", 'valid.spm.src.tmp')
32 | valid_tgt = alpaca_data.replace("alpaca_data.json", 'valid.spm.tgt.tmp')
33 |
34 | train_files = [train_src, train_tgt, valid_src, valid_tgt]
35 | for train_file_new in train_files:
36 |
37 | train_src_rep = train_file_new.replace(".tmp", "")
38 | with open(train_src_rep, 'w') as f_o:
39 | for line in open(train_file_new).readlines():
40 | newline = line.replace("▁< 0 x 0 A >", " <0x0A> ").replace("< 0 x 0 A >", " <0x0A> ")
41 | newline = newline.replace(" ", " ")
42 | f_o.writelines(newline)
43 |
44 | def split_zh_json(alpaca_data):
45 |
46 | print("load aplaca data {}".format(alpaca_data))
47 | train_src = alpaca_data.replace("Belle_open_source_1M.json", 'train.src')
48 | train_tgt = alpaca_data.replace("Belle_open_source_1M.json", 'train.tgt')
49 |
50 | prompt_text = "## Instruction:\n{}\n\n## Input:\n{}\n\n## Response:"
51 | prompt_no_input_text = "## Instruction:\n{}\n\n## Response:"
52 |
53 | with open(train_src, 'w') as f_src, open(train_tgt, 'w') as f_tgt:
54 | for lines in open(alpaca_data).readlines():
55 | data = json.loads(lines)
56 | if len(data['input']) > 0:
57 | src = prompt_text.format(data['instruction'].strip(), data['input'].strip()).strip()
58 | else:
59 | src = prompt_no_input_text.format(data['instruction'].strip()).strip()
60 | tgt = data['output'].strip()
61 | if len(src) > 0 and len(tgt) > 0:
62 | f_src.writelines(src.replace("\n", '<0x0A>').replace("\\n", '<0x0A>').replace("\r\n", '<0x0A>').replace("\r", '<0x0A>') + '\n')
63 | f_tgt.writelines(tgt.replace("\n", '<0x0A>').replace("\\n", '<0x0A>').replace("\r\n", '<0x0A>').replace("\r", '<0x0A>') + '\n')
64 |
65 | def replace_zh_data(alpaca_data):
66 |
67 | train_src = alpaca_data.replace("Belle_open_source_1M.json", 'train.spm.src.tmp')
68 | train_tgt = alpaca_data.replace("Belle_open_source_1M.json", 'train.spm.tgt.tmp')
69 |
70 | valid_src = alpaca_data.replace("Belle_open_source_1M.json", 'valid.spm.src.tmp')
71 | valid_tgt = alpaca_data.replace("Belle_open_source_1M.json", 'valid.spm.tgt.tmp')
72 |
73 | train_files = [train_src, train_tgt, valid_src, valid_tgt]
74 | for train_file_new in train_files:
75 |
76 | train_src_rep = train_file_new.replace(".tmp", "")
77 | with open(train_src_rep, 'w') as f_o:
78 | for line in open(train_file_new).readlines():
79 | newline = line.replace("▁< 0 x 0 A >", " <0x0A> ").replace("< 0 x 0 A >", " <0x0A> ")
80 | newline = newline.replace(" ", " ")
81 | f_o.writelines(newline)
82 |
83 | def main():
84 |
85 | parser = argparse.ArgumentParser()
86 | parser.add_argument(
87 | "--manner",
88 | required=True,
89 | type=str,
90 | default="split",
91 | help="process utils",
92 | )
93 | parser.add_argument(
94 | "--alpaca-data",
95 | default="/opt/data/private/data/llama_new/alpaca_data.json",
96 | help="alpaca self-instruction data_dir",
97 | )
98 | parser.add_argument(
99 | "--translation-data",
100 | default="/opt/data/private/data/llama/trans/translation2019zh_train.json",
101 | help="transltion data_dir",
102 | )
103 | args = parser.parse_args()
104 |
105 | if args.manner == "split":
106 | split_json(args.alpaca_data)
107 | elif args.manner == "replace":
108 | replace_data(args.alpaca_data)
109 | elif args.manner == "split_zh":
110 | split_zh_json(args.alpaca_data)
111 | elif args.manner == "replace_zh":
112 | replace_zh_data(args.alpaca_data)
113 | else:
114 | print("No Support!")
115 |
116 |
117 | if __name__ == "__main__":
118 | main()
119 |
--------------------------------------------------------------------------------
/alpaca/scripts/utils/process_llama_ckpt.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import json
3 | import argparse
4 |
5 |
6 | def build_default_state():
7 |
8 | state = {}
9 |
10 | state['args'] = {}
11 | state['args']['arch'] = "llama_7b"
12 | state['args']['task'] = "seq2seq_ft_task"
13 | state['args']['criterion'] = "lm_loss"
14 |
15 | state['args']['decoder_attention_heads'] = 32
16 | state['args']['decoder_embed_dim'] = 4096
17 | state['args']['decoder_ffn_embed_dim'] = 16384
18 | state['args']['decoder_layers'] = 32
19 |
20 | temp_parser = argparse.ArgumentParser()
21 | for key, value in state['args'].items():
22 | temp_parser.add_argument("--" + key, default=value)
23 | args = temp_parser.parse_args([])
24 |
25 | state['args'] = args
26 |
27 | state['model'] = {}
28 | state['optimizer_history'] = [
29 | {
30 | 'criterion_name': 'lm_loss',
31 | 'optimizer_name': 'AdamOptimizer',
32 | 'lr_scheduler_state': {'best': None},
33 | 'num_updates': 2000,
34 | }
35 | ]
36 | state['extra_state'] = {}
37 | print(state)
38 | return state
39 |
40 | def build_llama_state_dict(llama_dir, llama_file):
41 | # please replace the llama_path with real path
42 | with open(llama_dir + llama_file, "rb") as f:
43 | llama_state = torch.load(f, map_location=torch.device("cpu"))
44 |
45 | # add pad to token weight and predicion weight
46 | dict_size, dict_dim = llama_state['tok_embeddings.weight'].size()
47 | pad = llama_state['tok_embeddings.weight'].new_zeros([1, dict_dim])
48 | llama_state['tok_embeddings.weight'] = torch.cat([llama_state['tok_embeddings.weight'], pad], dim=0)
49 | llama_state['output.weight'] = torch.cat([llama_state['output.weight'], pad], dim=0)
50 |
51 | state = build_default_state()
52 | state['model'] = llama_state
53 | dump_file = "model_no_pad.pt"
54 | torch.save(state, llama_dir + dump_file)
55 | print("dump new model to {}{}".format(llama_dir, dump_file))
56 |
57 | def main():
58 |
59 | parser = argparse.ArgumentParser()
60 | parser.add_argument(
61 | "--llama-model-dir",
62 | type=str,
63 | default="/opt/data/private/data/llama/7B/",
64 | help="path containing model file",
65 | )
66 | parser.add_argument(
67 | "--llama-model-file",
68 | type=str,
69 | default="consolidated.00.pth",
70 | help="where in model_dir are weights saved",
71 | )
72 |
73 | args = parser.parse_args()
74 | print("load model from {}{}".format(args.llama_model_dir, args.llama_model_file))
75 | build_llama_state_dict(args.llama_model_dir, args.llama_model_file)
76 |
77 |
78 | if __name__ == "__main__":
79 | main()
80 |
--------------------------------------------------------------------------------
/alpaca/scripts/utils/process_llama_megatron_ckpt.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import json
3 | import argparse
4 | import os
5 |
6 |
7 | def build_default_state():
8 |
9 | state = {}
10 |
11 | state['args'] = {}
12 | state['args']['arch'] = "llama_7b"
13 | state['args']['task'] = "seq2seq_ft_task"
14 | state['args']['criterion'] = "lm_loss"
15 | state['args']['decoder_attention_heads'] = 32
16 | state['args']['decoder_embed_dim'] = 4096
17 | state['args']['decoder_ffn_embed_dim'] = 16384
18 | state['args']['decoder_layers'] = 32
19 |
20 | temp_parser = argparse.ArgumentParser()
21 | for key, value in state['args'].items():
22 | temp_parser.add_argument("--" + key, default=value)
23 | args = temp_parser.parse_args([])
24 |
25 | state['args'] = args
26 |
27 | state['model'] = {}
28 | state['optimizer_history'] = [
29 | {
30 | 'criterion_name': 'lm_loss',
31 | 'optimizer_name': 'MemoryEfficientFP16Optimizer',
32 | 'lr_scheduler_state': {'best': None},
33 | 'num_updates': 5000,
34 | }
35 | ]
36 | state['extra_state'] = {}
37 | print(state)
38 | return state
39 |
40 | DICT_MAP = {
41 | 2: 32016,
42 | 4: 32032,
43 | 8: 32064,
44 | }
45 |
46 | def split_parameter(llama_state, parallel_size):
47 |
48 | parallel_state_list = []
49 | incr_dict_size = DICT_MAP[parallel_size]
50 |
51 | dict_size, dict_dim = llama_state['tok_embeddings.weight'].size()
52 | pad = llama_state['tok_embeddings.weight'].new_zeros([incr_dict_size - dict_size, dict_dim])
53 | llama_state['tok_embeddings.weight'] = torch.cat([llama_state['tok_embeddings.weight'], pad], dim=0)
54 | llama_state['output.weight'] = torch.cat([llama_state['output.weight'], pad], dim=0)
55 |
56 | embed_size = dict_dim // parallel_size
57 | ffn_embed_size = (256 * ((int(2 * dict_dim * 4 / 3) + 256 - 1) // 256)) // parallel_size
58 | parallel_dict_size = incr_dict_size // parallel_size
59 |
60 | for parallel_idx in range(parallel_size):
61 | parallel_state = {}
62 | start_embed_size = parallel_idx * embed_size
63 | end_embed_size = (parallel_idx + 1) * embed_size
64 | start_ffn_embed_size = parallel_idx * ffn_embed_size
65 | end_ffn_embed_size = (parallel_idx + 1) * ffn_embed_size
66 | start_parallel_dict_size = parallel_idx * parallel_dict_size
67 | end_parallel_dict_size = (parallel_idx + 1) * parallel_dict_size
68 |
69 | print("embed dim start={} end={}".format(start_embed_size, end_embed_size))
70 | print("ffn dim start={} end={}".format(start_ffn_embed_size, end_ffn_embed_size))
71 |
72 | for k in list(llama_state.keys()):
73 | if "inner_attention" in k:
74 | print("skip llama state key = {} size = {}".format(k, llama_state[k].size()))
75 | continue
76 | elif "norm.weight" in k or "_norm" in k:
77 | parallel_state[k] = llama_state[k].clone()
78 | elif "tok_embeddings.weight" in k:
79 | parallel_state[k] = llama_state[k][:, start_embed_size:end_embed_size].clone()
80 | elif "output.weight" in k:
81 | parallel_state[k] = llama_state[k][start_parallel_dict_size:end_parallel_dict_size, :].clone()
82 | elif "layers" in k:
83 | if "attention" in k and "wo" not in k:
84 | # 2048, 4096
85 | parallel_state[k] = llama_state[k][start_embed_size:end_embed_size, :].clone()
86 | elif "attention" in k and "wo" in k:
87 | parallel_state[k] = llama_state[k][:, start_embed_size:end_embed_size].clone()
88 | elif "feed_forward.w1" in k:
89 | parallel_state[k] = llama_state[k][start_ffn_embed_size:end_ffn_embed_size, :].clone()
90 | elif "feed_forward.w2" in k:
91 | parallel_state[k] = llama_state[k][:, start_ffn_embed_size:end_ffn_embed_size].clone()
92 | elif "feed_forward.w3" in k:
93 | parallel_state[k] = llama_state[k][start_ffn_embed_size:end_ffn_embed_size, :].clone()
94 | else:
95 | print(llama_state[k].size())
96 | print(k)
97 | raise NotImplementedError
98 | else:
99 | print(state[k].size())
100 | print(k)
101 | raise NotImplementedError
102 | print("split llama state key = {} size = {}".format(k, llama_state[k].size()))
103 | print("parallel state size = {}".format(parallel_state[k].size()))
104 | parallel_state_list.append(parallel_state)
105 | return parallel_state_list
106 |
107 | def build_llama_state_dict(llama_dir, llama_file, parallel_size):
108 | # please replace the llama_path with real path
109 | with open(llama_dir + llama_file, "rb") as f:
110 | llama_state = torch.load(f, map_location=torch.device("cpu"))
111 |
112 | # add pad to token weight and predicion weight
113 | state = build_default_state()
114 | for parallel_idx, parallel_state in enumerate(split_parameter(llama_state, parallel_size)):
115 | state['model'] = parallel_state
116 | dump_file = "model-model_part-{}.pt".format(parallel_idx)
117 | if not os.path.exists(llama_dir + 'megatron_{}/'.format(parallel_size)):
118 | os.mkdir(llama_dir + 'megatron_{}/'.format(parallel_size))
119 | torch.save(state, llama_dir + 'megatron_{}/'.format(parallel_size) + dump_file)
120 | print("dump new model to {}{}".format(llama_dir, dump_file))
121 |
122 | def main():
123 |
124 | parser = argparse.ArgumentParser()
125 | parser.add_argument(
126 | "--llama-model-dir",
127 | type=str,
128 | default="/opt/data/private/data/llama/7B/",
129 | help="path containing model file",
130 | )
131 | parser.add_argument(
132 | "--llama-model-file",
133 | type=str,
134 | default="consolidated.00.pth",
135 | help="where in model_dir are weights saved",
136 | )
137 | parser.add_argument(
138 | "--parallel-size",
139 | type=int,
140 | default=2,
141 | help="model parallel size to split",
142 | )
143 |
144 | args = parser.parse_args()
145 | print("load model from {}{}.".format(args.llama_model_dir, args.llama_model_file))
146 | print("We will split the llama model into {} fragment.".format(args.parallel_size))
147 | build_llama_state_dict(args.llama_model_dir, args.llama_model_file, args.parallel_size)
148 |
149 | if __name__ == "__main__":
150 | main()
151 |
--------------------------------------------------------------------------------
/alpaca/src/__init__.py:
--------------------------------------------------------------------------------
1 |
2 | try:
3 | from .model import llama_model
4 | except ValueError:
5 | print("llama model has been loaded!!!")
6 | from .loss import lm_loss
7 | from .task import seq2seq_ft_task, seq2seq_lora_task
8 | from .fsdp import cpu_adam, fully_sharded_data_parallel
9 |
--------------------------------------------------------------------------------
/alpaca/src/__pycache__/__init__.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/__pycache__/__init__.cpython-37.pyc
--------------------------------------------------------------------------------
/alpaca/src/__pycache__/megatron_trainer.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/__pycache__/megatron_trainer.cpython-37.pyc
--------------------------------------------------------------------------------
/alpaca/src/__pycache__/trainer.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/__pycache__/trainer.cpython-37.pyc
--------------------------------------------------------------------------------
/alpaca/src/__pycache__/utils.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/__pycache__/utils.cpython-37.pyc
--------------------------------------------------------------------------------
/alpaca/src/fsdp/__pycache__/cpu_adam.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/fsdp/__pycache__/cpu_adam.cpython-37.pyc
--------------------------------------------------------------------------------
/alpaca/src/fsdp/__pycache__/fully_sharded_data_parallel.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/fsdp/__pycache__/fully_sharded_data_parallel.cpython-37.pyc
--------------------------------------------------------------------------------
/alpaca/src/fsdp/cpu_adam.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Facebook, Inc. and its affiliates.
2 | #
3 | # This source code is licensed under the MIT license found in the
4 | # LICENSE file in the root directory of this source tree.
5 |
6 | import importlib
7 | from collections.abc import Collection
8 | from dataclasses import dataclass, field
9 | from typing import List
10 |
11 | import torch
12 | from fairseq.dataclass import FairseqDataclass
13 | from fairseq.optim import FairseqOptimizer, register_optimizer
14 | from omegaconf import II, DictConfig
15 |
16 |
17 | try:
18 | import deepspeed
19 |
20 | has_deepspeed = True
21 | except ImportError as e:
22 | has_deepspeed = False
23 |
24 |
25 | def _get_cpu_adam():
26 | try:
27 | from deepspeed.ops.op_builder import CPUAdamBuilder
28 |
29 | return CPUAdamBuilder().load()
30 | except ImportError:
31 | # fbcode
32 | from deepspeed.ops.adam import DeepSpeedCPUAdam as ds_opt_adam
33 |
34 | return ds_opt_adam
35 |
36 |
37 | @dataclass
38 | class FairseqCPUAdamConfig(FairseqDataclass):
39 | adam_betas: str = field(
40 | default="(0.9, 0.999)", metadata={"help": "betas for Adam optimizer"}
41 | )
42 | adam_eps: float = field(
43 | default=1e-8, metadata={"help": "epsilon for Adam optimizer"}
44 | )
45 | weight_decay: float = field(default=0.0, metadata={"help": "weight decay"})
46 | fp16_adam_stats: bool = field(
47 | default=False, metadata={"help": "use FP16 stats (with automatic scaling)"}
48 | )
49 | # TODO common vars below in parent
50 | lr: List[float] = II("optimization.lr")
51 |
52 |
53 | @register_optimizer("new_cpu_adam", dataclass=FairseqCPUAdamConfig)
54 | class FairseqCPUAdam(FairseqOptimizer):
55 | """Adam optimizer for fairseq, optimized for CPU tensors.
56 |
57 | Important note: this optimizer corresponds to the "AdamW" variant of
58 | Adam in its weight decay behavior. As such, it is most closely
59 | analogous to torch.optim.AdamW from PyTorch.
60 | """
61 |
62 | def __init__(self, cfg: DictConfig, params):
63 | super().__init__(cfg)
64 | self._optimizer = CPUAdam(params, **self.optimizer_config)
65 |
66 | @property
67 | def optimizer_config(self):
68 | """
69 | Return a kwarg dictionary that will be used to override optimizer
70 | args stored in checkpoints. This allows us to load a checkpoint and
71 | resume training using a different set of optimizer args, e.g., with a
72 | different learning rate.
73 | """
74 | return {
75 | "lr": self.cfg.lr[0]
76 | if isinstance(self.cfg.lr, Collection)
77 | else self.cfg.lr,
78 | "betas": eval(self.cfg.adam_betas),
79 | "eps": self.cfg.adam_eps,
80 | "weight_decay": self.cfg.weight_decay,
81 | "use_fp16_stats": self.cfg.fp16_adam_stats,
82 | }
83 |
84 |
85 | class CPUAdam(torch.optim.Optimizer):
86 |
87 | optimizer_id = 0
88 |
89 | def __init__(
90 | self,
91 | params,
92 | lr=1e-3,
93 | bias_correction=True,
94 | betas=(0.9, 0.999),
95 | eps=1e-8,
96 | weight_decay=0,
97 | use_fp16_stats=False,
98 | ):
99 | defaults = {
100 | "lr": lr,
101 | "bias_correction": bias_correction,
102 | "betas": betas,
103 | "eps": eps,
104 | "weight_decay": weight_decay,
105 | }
106 | super().__init__(params, defaults)
107 |
108 | self.use_fp16_stats = use_fp16_stats
109 | self.FLOAT16_MAX = 65504.0
110 |
111 | if not has_deepspeed:
112 | raise ImportError("Please install DeepSpeed: pip install deepspeed")
113 |
114 | self.opt_id = CPUAdam.optimizer_id
115 | CPUAdam.optimizer_id = CPUAdam.optimizer_id + 1
116 |
117 | self.ds_opt_adam = _get_cpu_adam()
118 | adamw_mode = True
119 | self.ds_opt_adam.create_adam(
120 | self.opt_id, lr, betas[0], betas[1], eps, weight_decay, adamw_mode, True
121 | )
122 |
123 | @property
124 | def supports_memory_efficient_fp16(self):
125 | return True
126 |
127 | @property
128 | def supports_flat_params(self):
129 | return True
130 |
131 | @torch.no_grad()
132 | def step(self, closure=None):
133 | loss = None
134 | if closure is not None:
135 | with torch.enable_grad():
136 | loss = closure()
137 |
138 | torch.cuda.synchronize()
139 |
140 | for group_id, group in enumerate(self.param_groups):
141 | for param_id, p in enumerate(group["params"]):
142 | if p.grad is None:
143 | continue
144 |
145 | state = self.state[p]
146 | if len(state) == 0:
147 | state["step"] = 0
148 | dtype = torch.float16 if self.use_fp16_stats else p.data.dtype
149 | # gradient momentums
150 | state["exp_avg"] = torch.zeros_like(
151 | p.data, dtype=dtype, device="cpu"
152 | )
153 | # gradient variances
154 | state["exp_avg_sq"] = torch.zeros_like(
155 | p.data, dtype=dtype, device="cpu"
156 | )
157 | if self.use_fp16_stats:
158 | assert torch.is_floating_point(p.data)
159 | state["exp_avg_scale"] = 1.0
160 | state["exp_avg_sq_scale"] = 1.0
161 |
162 | exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"]
163 |
164 | p_data_bak = p.data # backup of the original data pointer
165 |
166 | p.data = p.data.to(dtype=torch.float32, device="cpu")
167 | p.grad.data = p.grad.data.to(dtype=torch.float32, device="cpu")
168 |
169 | if self.use_fp16_stats:
170 | exp_avg = exp_avg.float() * state["exp_avg_scale"]
171 | exp_avg_sq = exp_avg_sq.float() * state["exp_avg_sq_scale"]
172 |
173 | state["step"] += 1
174 | beta1, beta2 = group["betas"]
175 |
176 | self.ds_opt_adam.adam_update(
177 | self.opt_id,
178 | state["step"],
179 | group["lr"],
180 | beta1,
181 | beta2,
182 | group["eps"],
183 | group["weight_decay"],
184 | group["bias_correction"],
185 | p.data,
186 | p.grad.data,
187 | exp_avg,
188 | exp_avg_sq,
189 | )
190 |
191 | if p_data_bak.data_ptr() != p.data.data_ptr():
192 | p_data_bak.copy_(p.data)
193 | p.data = p_data_bak
194 |
195 | if self.use_fp16_stats:
196 |
197 | def inf_norm(t):
198 | return torch.norm(t, float("inf"))
199 |
200 | # from github.com/openai/jukebox/blob/master/jukebox/utils/fp16.py
201 | state["exp_avg_scale"], state["exp_avg_sq_scale"] = (
202 | 1e-8 + inf_norm(exp_avg) / self.FLOAT16_MAX,
203 | 1e-8 + inf_norm(exp_avg_sq) / self.FLOAT16_MAX,
204 | )
205 | state["exp_avg"], state["exp_avg_sq"] = (
206 | (exp_avg / state["exp_avg_scale"]).half(),
207 | (exp_avg_sq / state["exp_avg_sq_scale"]).half(),
208 | )
209 |
210 | return loss
211 |
--------------------------------------------------------------------------------
/alpaca/src/fsdp/fully_sharded_data_parallel.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Facebook, Inc. and its affiliates.
2 | #
3 | # This source code is licensed under the MIT license found in the
4 | # LICENSE file in the root directory of this source tree.
5 |
6 | import contextlib
7 | from typing import Optional
8 | import os
9 | import torch
10 | from fairseq.dataclass.configs import DistributedTrainingConfig
11 | from fairseq.distributed import utils as dist_utils
12 | from typing import Any, Dict, Optional, Set, cast
13 | try:
14 | from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP
15 | from fairscale.nn.data_parallel import TrainingState
16 | has_FSDP = True
17 | except ImportError:
18 | FSDP = torch.nn.Module
19 | has_FSDP = False
20 |
21 | def free_storage_(data: torch.Tensor):
22 | if data.storage().size() > 0:
23 | assert data.storage_offset() == 0
24 | data.storage().resize_(0)
25 |
26 |
27 | class FullyShardedDataParallel(FSDP):
28 | """
29 | A small wrapper around fairscale's FullyShardedDataParallel (FSDP) with some
30 | fairseq-specific checkpoint saving/loading logic.
31 |
32 | Args:
33 | use_sharded_state (bool): if True, then ``state_dict`` will return
34 | ``FSDP.local_state_dict`` and ``load_state_dict`` will call
35 | ``FSDP.load_local_state_dict``. Otherwise, ``state_dict`` will
36 | return the full model weights on data parallel rank 0 (empty on
37 | other ranks) and ``load_state_dict`` will broadcast model weights
38 | from rank 0 to other ranks.
39 | """
40 |
41 | def __init__(self, *args, use_sharded_state: bool = False, **kwargs):
42 | if not has_FSDP:
43 | raise ImportError(
44 | "Cannot find FullyShardedDataParallel. "
45 | "Please install fairscale with: pip install fairscale"
46 | )
47 | super().__init__(*args, **kwargs)
48 | self.use_sharded_state = use_sharded_state
49 |
50 | if dist_utils.get_world_size(group=dist_utils.get_data_parallel_group()) < 4 and \
51 | "NVIDIA GeForce RTX 3090" in torch.cuda.get_device_name():
52 | self.alpaca_force_full_precision = False
53 | else:
54 | self.alpaca_force_full_precision = True
55 |
56 | @property
57 | def unwrapped_module(self) -> torch.nn.Module:
58 | if self.flatten_parameters:
59 | return self.module.module
60 | else:
61 | return self.module
62 |
63 | def state_dict(self, destination=None, prefix="", keep_vars=False):
64 | if self.use_sharded_state:
65 | return super().local_state_dict(
66 | destination=destination, prefix=prefix, keep_vars=keep_vars
67 | )
68 | else:
69 | if self.rank == 0:
70 | return super().state_dict(
71 | destination=destination, prefix=prefix, keep_vars=keep_vars
72 | )
73 | else:
74 | # We must call state_dict() due to use of communication
75 | # primitives. But we don't use the result.
76 | super().state_dict()
77 | return destination or {}
78 |
79 | def load_state_dict(self, state_dict, strict=True, model_cfg=None):
80 | if self.use_sharded_state:
81 | return super().load_local_state_dict(state_dict, strict=strict)
82 | else:
83 | state_dict = dist_utils.broadcast_object(
84 | state_dict, src_rank=0, group=self.process_group,
85 | )
86 | return super().load_state_dict(state_dict, strict=strict)
87 |
88 | @contextlib.contextmanager
89 | def summon_full_params(self, recurse: bool = True, volatile: bool = False):
90 | if recurse:
91 | with contextlib.ExitStack() as stack:
92 | # Summon all params for any nested FSDP instances.
93 | for module in self.modules():
94 | if isinstance(module, FullyShardedDataParallel):
95 | stack.enter_context(module.summon_full_params(recurse=False, volatile=volatile))
96 | # Yield to the caller, with full params in all nested instances.
97 | yield
98 | # Exiting from the ExitStack will re-shard params.
99 | return
100 | else:
101 | torch.cuda.synchronize()
102 | self._lazy_init()
103 | self.assert_state(TrainingState.IDLE)
104 | # Set the state so that we assert when trying to go into fwd/bwd.
105 | self.training_state = TrainingState.SUMMON_FULL_PARAMS
106 | full_tensors = self._rebuild_full_params(force_full_precision=self.alpaca_force_full_precision)
107 | assert full_tensors is not None
108 | with contextlib.ExitStack() as stack:
109 | if self.module.is_flattened:
110 | # Update flattened views to point to fully-sized tensors. We
111 | # use self.params instead of full_tensors since the
112 | # latter may contain padding.
113 | stack.enter_context(
114 | self.module.unflatten_params(
115 | flat_params=[p.data for p in self.params[: self._num_flatten_params]]
116 | )
117 | )
118 | try:
119 | yield
120 | finally:
121 | stack.close()
122 | non_shared_params = self.params
123 | # filter out shared params for all but the owner FSDP module.
124 | if len(full_tensors) < len(non_shared_params):
125 | non_shared_params = self.non_shared_params()
126 | assert len(full_tensors) == len(
127 | non_shared_params
128 | ), f"{len(full_tensors)} vs. {len(non_shared_params)}"
129 | for p, (full_tensor, safe_to_free) in zip(non_shared_params, full_tensors):
130 | if not volatile:
131 | # Copy any changes made to the full params back into
132 | # the corresponding local shards.
133 | local_shard, _ = self._get_shard(full_tensor)
134 | p._fp32_shard.copy_(local_shard.view_as(p._fp32_shard))
135 | if safe_to_free:
136 | free_storage_(full_tensor)
137 | self.has_full_params = False
138 | self._use_fp32_param_shard()
139 | self.training_state = TrainingState.IDLE
140 |
141 |
142 | class DummyProcessGroup:
143 | def __init__(self, rank: int, size: int):
144 | self._rank = rank
145 | self._size = size
146 |
147 | def rank(self) -> int:
148 | return self._rank
149 |
150 | def size(self) -> int:
151 | return self._size
152 |
153 |
154 | @contextlib.contextmanager
155 | def fsdp_enable_wrap(cfg: DistributedTrainingConfig):
156 | try:
157 | from fairscale.nn import enable_wrap
158 | except ImportError:
159 | raise ImportError(
160 | "Cannot find FullyShardedDataParallel. "
161 | "Please install fairscale with: pip install fairscale"
162 | )
163 | if cfg.memory_efficient_fp16:
164 | assert cfg.fp16 # memory_efficient_fp16 should imply fp16
165 | group = dist_utils.get_data_parallel_group()
166 | if group is None and cfg.distributed_world_size == 1:
167 | group = DummyProcessGroup(rank=0, size=1)
168 | fsdp_config = {
169 | "process_group": group,
170 | "reshard_after_forward": not cfg.no_reshard_after_forward,
171 | "mixed_precision": cfg.fp16 and not cfg.memory_efficient_fp16,
172 | "fp32_reduce_scatter": cfg.fp32_reduce_scatter,
173 | "flatten_parameters": not cfg.not_fsdp_flatten_parameters,
174 | "cpu_offload": cfg.cpu_offload,
175 | "compute_dtype": torch.float16 if cfg.fp16 else torch.float32,
176 | "bucket_cap_mb": cfg.bucket_cap_mb,
177 | "state_dict_device": torch.device("cpu"), # reduce GPU mem usage
178 | }
179 | with enable_wrap(
180 | wrapper_cls=FullyShardedDataParallel,
181 | use_sharded_state=cfg.use_sharded_state,
182 | **fsdp_config,
183 | ):
184 | yield
185 |
186 |
187 | def fsdp_wrap(module, min_num_params: Optional[int] = None, **kwargs):
188 | """
189 | Helper to wrap layers/modules in FSDP. This falls back to a no-op if
190 | fairscale is not available.
191 |
192 | Args:
193 | module (nn.Module): module to (maybe) wrap
194 | min_num_params (int, Optional): minimum number of layer params to wrap
195 | """
196 | try:
197 | from fairscale.nn import wrap
198 |
199 | if min_num_params is not None:
200 | num_params = sum(p.numel() for p in module.parameters())
201 | if num_params >= min_num_params:
202 | return wrap(module, **kwargs)
203 | else:
204 | return module
205 | else:
206 | return wrap(module, **kwargs)
207 | except ImportError:
208 | return module
209 |
--------------------------------------------------------------------------------
/alpaca/src/generate.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3 -u
2 | # Copyright (c) Facebook, Inc. and its affiliates.
3 | #
4 | # This source code is licensed under the MIT license found in the
5 | # LICENSE file in the root directory of this source tree.
6 | """
7 | Translate pre-processed data with a trained model.
8 | """
9 |
10 | import ast
11 | import logging
12 | import math
13 | import os
14 | import sys
15 | from argparse import Namespace
16 | from itertools import chain
17 |
18 | import numpy as np
19 | import torch
20 | from omegaconf import DictConfig
21 |
22 | from fairseq import checkpoint_utils, options, scoring, tasks, utils
23 | from fairseq.dataclass.utils import convert_namespace_to_omegaconf
24 | from fairseq.logging import progress_bar
25 | from fairseq.logging.meters import StopwatchMeter, TimeMeter
26 | import utils as distributed_utils
27 |
28 |
29 | def main(cfg: DictConfig):
30 |
31 | if isinstance(cfg, Namespace):
32 | cfg = convert_namespace_to_omegaconf(cfg)
33 |
34 | assert cfg.common_eval.path is not None, "--path required for generation!"
35 | assert (
36 | not cfg.generation.sampling or cfg.generation.nbest == cfg.generation.beam
37 | ), "--sampling requires --nbest to be equal to --beam"
38 | assert (
39 | cfg.generation.replace_unk is None or cfg.dataset.dataset_impl == "raw"
40 | ), "--replace-unk requires a raw text dataset (--dataset-impl=raw)"
41 |
42 | if cfg.common_eval.results_path is not None:
43 | os.makedirs(cfg.common_eval.results_path, exist_ok=True)
44 | output_path = os.path.join(
45 | cfg.common_eval.results_path,
46 | "generate-{}.txt".format(cfg.dataset.gen_subset),
47 | )
48 | with open(output_path, "w", buffering=1, encoding="utf-8") as h:
49 | return _main(cfg, h)
50 | else:
51 | return _main(cfg, sys.stdout)
52 |
53 |
54 | def get_symbols_to_strip_from_output(generator):
55 | if hasattr(generator, "symbols_to_strip_from_output"):
56 | return generator.symbols_to_strip_from_output
57 | else:
58 | return {generator.eos}
59 |
60 |
61 | def _main(cfg: DictConfig, output_file):
62 | logging.basicConfig(
63 | format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
64 | datefmt="%Y-%m-%d %H:%M:%S",
65 | level=os.environ.get("LOGLEVEL", "INFO").upper(),
66 | stream=output_file,
67 | )
68 | logger = logging.getLogger("fairseq_cli.generate")
69 |
70 | utils.import_user_module(cfg.common)
71 |
72 | if cfg.dataset.max_tokens is None and cfg.dataset.batch_size is None:
73 | cfg.dataset.max_tokens = 12000
74 | logger.info(cfg)
75 |
76 | # Fix seed for stochastic decoding
77 | if cfg.common.seed is not None and not cfg.generation.no_seed_provided:
78 | np.random.seed(cfg.common.seed)
79 | utils.set_torch_seed(cfg.common.seed)
80 |
81 | use_cuda = torch.cuda.is_available() and not cfg.common.cpu
82 |
83 | # Load dataset splits
84 | task = tasks.setup_task(cfg.task)
85 |
86 | # Set dictionaries
87 | try:
88 | src_dict = getattr(task, "source_dictionary", None)
89 | except NotImplementedError:
90 | src_dict = None
91 | tgt_dict = task.target_dictionary
92 |
93 | overrides = ast.literal_eval(cfg.common_eval.model_overrides)
94 |
95 | logger.info("cfg.checkpoint.checkpoint_suffix {}".format(cfg.checkpoint.checkpoint_suffix))
96 | # Load ensemble
97 | logger.info("loading model(s) from {}".format(cfg.common_eval.path))
98 | models, saved_cfg = checkpoint_utils.load_model_ensemble(
99 | utils.split_paths(cfg.common_eval.path),
100 | arg_overrides=overrides,
101 | task=task,
102 | suffix=cfg.checkpoint.checkpoint_suffix,
103 | strict=(cfg.checkpoint.checkpoint_shard_count == 1),
104 | num_shards=cfg.checkpoint.checkpoint_shard_count,
105 | )
106 |
107 | # loading the dataset should happen after the checkpoint has been loaded so we can give it the saved task config
108 | task.load_dataset(cfg.dataset.gen_subset, task_cfg=saved_cfg.task)
109 |
110 | if cfg.generation.lm_path is not None:
111 | overrides["data"] = cfg.task.data
112 |
113 | try:
114 | lms, _ = checkpoint_utils.load_model_ensemble(
115 | [cfg.generation.lm_path], arg_overrides=overrides, task=None
116 | )
117 | except:
118 | logger.warning(
119 | f"Failed to load language model! Please make sure that the language model dict is the same "
120 | f"as target dict and is located in the data dir ({cfg.task.data})"
121 | )
122 | raise
123 |
124 | assert len(lms) == 1
125 | else:
126 | lms = [None]
127 |
128 | # Optimize ensemble for generation
129 | for model in chain(models, lms):
130 | if model is None:
131 | continue
132 | if cfg.common.fp16:
133 | model.half()
134 | if use_cuda and not cfg.distributed_training.pipeline_model_parallel:
135 | model.cuda()
136 | model.prepare_for_inference_(cfg)
137 |
138 | # Load alignment dictionary for unknown word replacement
139 | # (None if no unknown word replacement, empty if no path to align dictionary)
140 | align_dict = utils.load_align_dict(cfg.generation.replace_unk)
141 |
142 | # Load dataset (possibly sharded)
143 | itr = task.get_batch_iterator(
144 | dataset=task.dataset(cfg.dataset.gen_subset),
145 | max_tokens=cfg.dataset.max_tokens,
146 | max_sentences=cfg.dataset.batch_size,
147 | max_positions=utils.resolve_max_positions(
148 | task.max_positions(), *[m.max_positions() for m in models]
149 | ),
150 | ignore_invalid_inputs=cfg.dataset.skip_invalid_size_inputs_valid_test,
151 | required_batch_size_multiple=cfg.dataset.required_batch_size_multiple,
152 | seed=cfg.common.seed,
153 | num_shards=cfg.distributed_training.distributed_world_size,
154 | shard_id=cfg.distributed_training.distributed_rank,
155 | num_workers=cfg.dataset.num_workers,
156 | data_buffer_size=cfg.dataset.data_buffer_size,
157 | ).next_epoch_itr(shuffle=False)
158 | progress = progress_bar.progress_bar(
159 | itr,
160 | log_format=cfg.common.log_format,
161 | log_interval=cfg.common.log_interval,
162 | default_log_format=("tqdm" if not cfg.common.no_progress_bar else "simple"),
163 | )
164 |
165 | # Initialize generator
166 | gen_timer = StopwatchMeter()
167 | generator = task.build_generator(models, args=cfg.generation)
168 |
169 | # Handle tokenization and BPE
170 | tokenizer = task.build_tokenizer(cfg.tokenizer)
171 | bpe = task.build_bpe(cfg.bpe)
172 |
173 | def decode_fn(x):
174 | if bpe is not None:
175 | x = bpe.decode(x.tolist())
176 | if tokenizer is not None:
177 | x = tokenizer.decode(x)
178 | return x
179 |
180 | scorer = scoring.build_scorer(cfg.scoring, tgt_dict)
181 |
182 | num_sentences = 0
183 | has_target = True
184 | wps_meter = TimeMeter()
185 | for sample in progress:
186 | sample = utils.move_to_cuda(sample) if use_cuda else sample
187 | if "net_input" not in sample:
188 | continue
189 |
190 | prefix_tokens = None
191 | if cfg.generation.prefix_size > 0:
192 | prefix_tokens = sample["target"][:, : cfg.generation.prefix_size]
193 |
194 | constraints = None
195 | if "constraints" in sample:
196 | constraints = sample["constraints"]
197 |
198 | gen_timer.start()
199 | hypos = task.inference_step(
200 | generator,
201 | models,
202 | sample,
203 | prefix_tokens=prefix_tokens,
204 | constraints=constraints,
205 | )
206 | num_generated_tokens = sum(len(h[0]["tokens"]) for h in hypos)
207 | gen_timer.stop(num_generated_tokens)
208 |
209 | for i, sample_id in enumerate(sample["id"].tolist()):
210 | has_target = sample["target"] is not None
211 |
212 | # Remove padding
213 | if "src_tokens" in sample["net_input"]:
214 | src_tokens = utils.strip_pad(
215 | sample["net_input"]["src_tokens"][i, :], tgt_dict.pad()
216 | )
217 | else:
218 | src_tokens = None
219 |
220 | target_tokens = None
221 | if has_target:
222 | target_tokens = (
223 | utils.strip_pad(sample["target"][i, :], tgt_dict.pad()).int().cpu()
224 | )
225 |
226 | # Either retrieve the original sentences or regenerate them from tokens.
227 | if align_dict is not None:
228 | src_str = task.dataset(cfg.dataset.gen_subset).src.get_original_text(
229 | sample_id
230 | )
231 | target_str = task.dataset(cfg.dataset.gen_subset).tgt.get_original_text(
232 | sample_id
233 | )
234 | else:
235 | if src_dict is not None:
236 | src_str = src_dict.string(src_tokens, cfg.common_eval.post_process)
237 | else:
238 | src_str = ""
239 | if has_target:
240 | target_str = tgt_dict.string(
241 | target_tokens,
242 | cfg.common_eval.post_process,
243 | escape_unk=True,
244 | extra_symbols_to_ignore=get_symbols_to_strip_from_output(
245 | generator
246 | ),
247 | )
248 |
249 | src_str = decode_fn(src_tokens)
250 | if has_target:
251 | target_str = decode_fn(target_tokens)
252 |
253 | if "-model_part" in cfg.checkpoint.checkpoint_suffix and "-model_part-0" not in cfg.checkpoint.checkpoint_suffix:
254 | print(distributed_utils.get_model_parallel_rank())
255 | continue
256 |
257 | if not cfg.common_eval.quiet:
258 | if src_dict is not None:
259 | print("S-{}\t{}".format(sample_id, src_str), file=output_file)
260 | # if has_target:
261 | # print("T-{}\t{}".format(sample_id, target_str), file=output_file)
262 |
263 | # Process top predictions
264 | for j, hypo in enumerate(hypos[i][: cfg.generation.nbest]):
265 | hypo_tokens, hypo_str, alignment = utils.post_process_prediction(
266 | hypo_tokens=hypo["tokens"].int().cpu(),
267 | src_str=src_str,
268 | alignment=hypo["alignment"],
269 | align_dict=align_dict,
270 | tgt_dict=tgt_dict,
271 | remove_bpe=cfg.common_eval.post_process,
272 | extra_symbols_to_ignore=get_symbols_to_strip_from_output(generator),
273 | )
274 | detok_hypo_str = decode_fn(hypo_tokens)
275 | if not cfg.common_eval.quiet:
276 | score = hypo["score"] / math.log(2) # convert to base 2
277 | # original hypothesis (after tokenization and BPE)
278 | print(
279 | "H-{}\t{}\t{}".format(sample_id, score, hypo_str),
280 | file=output_file,
281 | )
282 | # detokenized hypothesis
283 | print(
284 | "D-{}\t{}\t{}".format(sample_id, score, detok_hypo_str),
285 | file=output_file,
286 | )
287 | # print(
288 | # "P-{}\t{}".format(
289 | # sample_id,
290 | # " ".join(
291 | # map(
292 | # lambda x: "{:.4f}".format(x),
293 | # # convert from base e to base 2
294 | # hypo["positional_scores"]
295 | # .div_(math.log(2))
296 | # .tolist(),
297 | # )
298 | # ),
299 | # ),
300 | # file=output_file,
301 | # )
302 |
303 | if cfg.generation.print_alignment == "hard":
304 | print(
305 | "A-{}\t{}".format(
306 | sample_id,
307 | " ".join(
308 | [
309 | "{}-{}".format(src_idx, tgt_idx)
310 | for src_idx, tgt_idx in alignment
311 | ]
312 | ),
313 | ),
314 | file=output_file,
315 | )
316 | if cfg.generation.print_alignment == "soft":
317 | print(
318 | "A-{}\t{}".format(
319 | sample_id,
320 | " ".join(
321 | [",".join(src_probs) for src_probs in alignment]
322 | ),
323 | ),
324 | file=output_file,
325 | )
326 |
327 | if cfg.generation.print_step:
328 | print(
329 | "I-{}\t{}".format(sample_id, hypo["steps"]),
330 | file=output_file,
331 | )
332 |
333 | if cfg.generation.retain_iter_history:
334 | for step, h in enumerate(hypo["history"]):
335 | _, h_str, _ = utils.post_process_prediction(
336 | hypo_tokens=h["tokens"].int().cpu(),
337 | src_str=src_str,
338 | alignment=None,
339 | align_dict=None,
340 | tgt_dict=tgt_dict,
341 | remove_bpe=None,
342 | )
343 | print(
344 | "E-{}_{}\t{}".format(sample_id, step, h_str),
345 | file=output_file,
346 | )
347 |
348 | # Score only the top hypothesis
349 | if has_target and j == 0:
350 | if (
351 | align_dict is not None
352 | or cfg.common_eval.post_process is not None
353 | ):
354 | # Convert back to tokens for evaluation with unk replacement and/or without BPE
355 | target_tokens = tgt_dict.encode_line(
356 | target_str, add_if_not_exist=True
357 | )
358 | hypo_tokens = tgt_dict.encode_line(
359 | detok_hypo_str, add_if_not_exist=True
360 | )
361 | if hasattr(scorer, "add_string"):
362 | scorer.add_string(target_str, detok_hypo_str)
363 | else:
364 | scorer.add(target_tokens, hypo_tokens)
365 |
366 | wps_meter.update(num_generated_tokens)
367 | progress.log({"wps": round(wps_meter.avg)})
368 | num_sentences += (
369 | sample["nsentences"] if "nsentences" in sample else sample["id"].numel()
370 | )
371 |
372 | logger.info("NOTE: hypothesis and token scores are output in base 2")
373 | logger.info(
374 | "Translated {:,} sentences ({:,} tokens) in {:.1f}s ({:.2f} sentences/s, {:.2f} tokens/s)".format(
375 | num_sentences,
376 | gen_timer.n,
377 | gen_timer.sum,
378 | num_sentences / gen_timer.sum,
379 | 1.0 / gen_timer.avg,
380 | )
381 | )
382 | # if has_target:
383 | # if cfg.bpe and not cfg.generation.sacrebleu:
384 | # if cfg.common_eval.post_process:
385 | # logger.warning(
386 | # "BLEU score is being computed by splitting detokenized string on spaces, this is probably not what you want. Use --sacrebleu for standard 13a BLEU tokenization"
387 | # )
388 | # else:
389 | # logger.warning(
390 | # "If you are using BPE on the target side, the BLEU score is computed on BPE tokens, not on proper words. Use --sacrebleu for standard 13a BLEU tokenization"
391 | # )
392 | # # use print to be consistent with other main outputs: S-, H-, T-, D- and so on
393 | # print(
394 | # "Generate {} with beam={}: {}".format(
395 | # cfg.dataset.gen_subset, cfg.generation.beam, scorer.result_string()
396 | # ),
397 | # file=output_file,
398 | # )
399 |
400 | return scorer
401 |
402 |
403 | def cli_main():
404 | parser = options.get_generation_parser()
405 | # TODO: replace this workaround with refactoring of `AudioPretraining`
406 | parser.add_argument(
407 | "--arch",
408 | "-a",
409 | metavar="ARCH",
410 | default="wav2vec2",
411 | help="Model architecture. For constructing tasks that rely on "
412 | "model args (e.g. `AudioPretraining`)",
413 | )
414 | args = options.parse_args_and_arch(parser)
415 |
416 | if args.model_parallel_size > 1:
417 | print("run megatron mode...")
418 | distributed_utils.call_main(convert_namespace_to_omegaconf(args), main)
419 | else:
420 | main(args)
421 |
422 |
423 | if __name__ == "__main__":
424 | cli_main()
425 |
--------------------------------------------------------------------------------
/alpaca/src/generator/__pycache__/search.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/generator/__pycache__/search.cpython-37.pyc
--------------------------------------------------------------------------------
/alpaca/src/generator/__pycache__/sequence_generator.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/generator/__pycache__/sequence_generator.cpython-37.pyc
--------------------------------------------------------------------------------
/alpaca/src/inference.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Facebook, Inc. and its affiliates.
2 | #
3 | # This source code is licensed under the MIT license found in the
4 | # LICENSE file in the root directory of this source tree.
5 |
6 | import torch
7 | from model.llama_model import LLaMA
8 | import argparse
9 | import logging
10 |
11 | logger = logging.getLogger(__name__)
12 |
13 |
14 | @torch.no_grad()
15 | def generate(alpaca):
16 |
17 | # load from txt
18 | # prompts = [
19 | # "Give three tips for staying healthy.",
20 | # "What are the three primary colors?",
21 | # "Describe the structure of an atom.",
22 | # "Describe a time when you had to make a difficult decision.",
23 | # "Explain why the following fraction 4/16 is equivalent to 1/4",
24 | # "Write a short story in third person narration about a protagonist who has to make an important career decision.",
25 | # ]
26 |
27 | # load from files
28 | prompts = open("alpaca/scripts/assert/test.src").readlines()
29 |
30 | eval_kwargs = dict(sampling=True, sampling_topp=0.95, temperature=0.8)
31 | for prompt in prompts:
32 | print("-----" * 20)
33 | prompt_text = "## Instruction:\n{}\n\n## Response:".format(prompt)
34 | print(prompt_text)
35 | output = alpaca.sample([prompt_text], **eval_kwargs)[0][0]
36 | print(output)
37 |
38 | def main():
39 |
40 | parser = argparse.ArgumentParser()
41 | parser.add_argument(
42 | "--model-dir",
43 | required=True,
44 | type=str,
45 | default="",
46 | help="path containing model file",
47 | )
48 | parser.add_argument(
49 | "--model-file",
50 | default="",
51 | help="where in model_dir are weights saved",
52 | )
53 | parser.add_argument(
54 | "--lora-model-inf",
55 | default="",
56 | help="where in model_dir are weights saved",
57 | )
58 | parser.add_argument(
59 | "--lora-tuning",
60 | action="store_true",
61 | default=False,
62 | help="if true use XSUM_KWARGS else CNN_KWARGS",
63 | )
64 |
65 | parser.add_argument("--bpe",)
66 | parser.add_argument("--sentencepiece-model")
67 | args = parser.parse_args()
68 |
69 | kwargs = {
70 | "user_dir": "alpaca/src",
71 | "lora_model_inf": args.lora_model_inf,
72 | "bpe": args.bpe,
73 | "sentencepiece_model": args.sentencepiece_model,
74 | "source_lang": 'src',
75 | "target_lang": 'tgt',
76 | "lora_tuning": args.lora_tuning,
77 | "task": "seq2seq_lora_task",
78 | }
79 | alpaca = LLaMA.from_pretrained(
80 | model_name_or_path=args.model_dir,
81 | checkpoint_file=args.model_file,
82 | **kwargs,
83 | )
84 | alpaca = alpaca.eval()
85 | if torch.cuda.is_available():
86 | alpaca = alpaca.half().cuda()
87 |
88 | generate(alpaca)
89 |
90 |
91 | if __name__ == "__main__":
92 | main()
93 |
--------------------------------------------------------------------------------
/alpaca/src/loss/__pycache__/lm_loss.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/loss/__pycache__/lm_loss.cpython-37.pyc
--------------------------------------------------------------------------------
/alpaca/src/loss/lm_loss.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Facebook, Inc. and its affiliates.
2 | #
3 | # This source code is licensed under the MIT license found in the
4 | # LICENSE file in the root directory of this source tree.
5 |
6 | import math
7 | from dataclasses import dataclass, field
8 |
9 | import torch
10 | from fairseq import metrics, utils
11 | from fairseq.criterions import FairseqCriterion, register_criterion
12 | from fairseq.dataclass import FairseqDataclass
13 | import torch.nn.functional as F
14 |
15 |
16 | @register_criterion("lm_loss")
17 | class LMLabelSmoothedCrossEntropyCriterion(FairseqCriterion):
18 | def __init__(
19 | self,
20 | task,
21 | ):
22 | super().__init__(task)
23 | self.eps = 0.1
24 | self.pad = task.tgt_dict.pad()
25 |
26 | def forward(self, model, sample, reduce=True):
27 | target = sample["target"]
28 | target_mask = target.ne(self.pad)
29 | mask = sample["net_input"]['seq_mask'].eq(1)
30 |
31 | output = model(sample["net_input"]['seq_input'])
32 |
33 | loss, nll_loss = self.label_smooth_loss(output[mask], target[target_mask])
34 |
35 | sample_size = 1
36 | logging_output = {
37 | "loss": loss.data,
38 | "nll_loss": nll_loss.data,
39 | "ntokens": sample["ntokens"],
40 | "nsentences": sample["target"].size(0),
41 | "sample_size": sample_size,
42 | }
43 | return loss, sample_size, logging_output
44 |
45 | def label_smooth_loss(self, net_out, net_target):
46 | net_logits = F.log_softmax(net_out, dim=-1)
47 | nll_loss = F.nll_loss(net_logits, net_target, reduction="none").float().mean()
48 | loss = nll_loss * (1. - self.eps) - net_logits.float().mean() * self.eps
49 | return loss, nll_loss
50 |
51 | @classmethod
52 | def reduce_metrics(cls, logging_outputs) -> None:
53 | """Aggregate logging outputs from data parallel training."""
54 | loss_sum = sum(log.get("loss", 0) for log in logging_outputs)
55 | nll_loss_sum = sum(log.get("nll_loss", 0) for log in logging_outputs)
56 | ntokens = sum(log.get("ntokens", 0) for log in logging_outputs)
57 | sample_size = sum(log.get("sample_size", 0) for log in logging_outputs)
58 |
59 | metrics.log_scalar(
60 | "loss", loss_sum / sample_size / math.log(2), sample_size, round=3
61 | )
62 | metrics.log_scalar(
63 | "nll_loss", nll_loss_sum / sample_size / math.log(2), sample_size, round=3
64 | )
65 | metrics.log_derived(
66 | "ppl", lambda meters: utils.get_perplexity(meters["loss"].avg)
67 | )
68 |
69 | @staticmethod
70 | def logging_outputs_can_be_summed() -> bool:
71 | """
72 | Whether the logging outputs returned by `forward` can be summed
73 | across workers prior to calling `reduce_metrics`. Setting this
74 | to True will improves distributed training speed.
75 | """
76 | return True
77 |
--------------------------------------------------------------------------------
/alpaca/src/megatron_trainer.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Facebook, Inc. and its affiliates.
2 | #
3 | # This source code is licensed under the MIT license found in the
4 | # LICENSE file in the root directory of this source tree.
5 |
6 | """
7 | Train a network across multiple GPUs.
8 | """
9 |
10 | from fairseq.dataclass.configs import FairseqConfig
11 | import utils as distributed_utils
12 | from trainer import Trainer
13 | from fairscale.nn.model_parallel.random import get_cuda_rng_tracker
14 |
15 | class MegatronTrainer(Trainer):
16 | """Main class for model parallel with data parallel training."""
17 |
18 | def __init__(self, cfg: FairseqConfig, task, model, criterion, **kwargs):
19 | super().__init__(cfg, task, model, criterion, **kwargs)
20 |
21 | def clip_grad_norm(self, clip_norm):
22 | def _aggregate_model_parallel_grad_norm(total_norm):
23 | total_norm = total_norm**2
24 | distributed_utils.all_reduce(
25 | total_norm, group=distributed_utils.get_model_parallel_group()
26 | )
27 | total_norm = total_norm**0.5
28 | return total_norm
29 |
30 | return self.optimizer.clip_grad_norm(
31 | clip_norm,
32 | aggregate_norm_fn=_aggregate_model_parallel_grad_norm,
33 | )
34 |
35 | def save_checkpoint(self, filename, extra_state):
36 | """Save all training state in a checkpoint file."""
37 | extra_state["rng_tracker_states"] = get_cuda_rng_tracker().get_states()
38 | super().save_checkpoint(filename, extra_state)
39 |
40 | def load_checkpoint(
41 | self,
42 | filename,
43 | reset_optimizer=False,
44 | reset_lr_scheduler=False,
45 | optimizer_overrides=None,
46 | reset_meters=False,
47 | ):
48 | extra_state = super().load_checkpoint(
49 | filename,
50 | reset_optimizer=reset_optimizer,
51 | reset_lr_scheduler=reset_lr_scheduler,
52 | optimizer_overrides=optimizer_overrides,
53 | reset_meters=reset_meters,
54 | )
55 | if extra_state is not None and "rng_tracker_states" in extra_state:
56 | get_cuda_rng_tracker().set_states(extra_state["rng_tracker_states"])
57 | return extra_state
58 |
--------------------------------------------------------------------------------
/alpaca/src/model/__pycache__/hub_interface.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/model/__pycache__/hub_interface.cpython-37.pyc
--------------------------------------------------------------------------------
/alpaca/src/model/__pycache__/llama_megatron_transformer.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/model/__pycache__/llama_megatron_transformer.cpython-37.pyc
--------------------------------------------------------------------------------
/alpaca/src/model/__pycache__/llama_model.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/model/__pycache__/llama_model.cpython-37.pyc
--------------------------------------------------------------------------------
/alpaca/src/model/__pycache__/llama_transformer.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/model/__pycache__/llama_transformer.cpython-37.pyc
--------------------------------------------------------------------------------
/alpaca/src/model/__pycache__/lora_modules.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/model/__pycache__/lora_modules.cpython-37.pyc
--------------------------------------------------------------------------------
/alpaca/src/model/hub_interface.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Facebook, Inc. and its affiliates.
2 | #
3 | # This source code is licensed under the MIT license found in the
4 | # LICENSE file in the root directory of this source tree.
5 |
6 | import copy
7 | import logging
8 | from typing import Dict, List
9 |
10 | import numpy as np
11 | import torch
12 | import torch.nn as nn
13 | import torch.nn.functional as F
14 | from fairseq import utils
15 | from fairseq.data import encoders
16 | from fairseq.hub_utils import GeneratorHubInterface
17 | from omegaconf import open_dict
18 |
19 |
20 | logger = logging.getLogger(__name__)
21 |
22 |
23 | class LLaMAHubInterface(GeneratorHubInterface):
24 |
25 | def __init__(self, cfg, task, model):
26 | super().__init__(cfg, task, [model])
27 | self.model = self.models[0]
28 |
29 | def encode(
30 | self, sentence: str, *addl_sentences, no_separator=True
31 | ) -> torch.LongTensor:
32 | bpe_sentence = " " + self.bpe.encode(sentence)
33 | tokens = self.task.target_dictionary.encode_line(bpe_sentence, append_eos=False)
34 | return tokens.long()
35 |
36 | def decode(self, tokens: torch.LongTensor):
37 | tokens = tokens.cpu().numpy()
38 | sentences = [self.bpe.sp.decode(tokens.tolist())]
39 | return sentences
40 |
41 | def sample(
42 | self, sentences: List[str], **kwargs
43 | ) -> List[str]:
44 | tokenized_sentences = [self.encode(sentence) for sentence in sentences]
45 | batched_hypos = self.generate(tokenized_sentences, **kwargs)
46 | return [self.decode(hypos[0]["tokens"]) for hypos in batched_hypos]
47 |
48 | def generate(
49 | self,
50 | tokenized_sentences: List[torch.LongTensor],
51 | **kwargs
52 | ) -> List[List[Dict[str, torch.Tensor]]]:
53 |
54 | generator = self.task.build_generator(
55 | self.models,
56 | **kwargs,
57 | )
58 |
59 | results = []
60 | for batch in self._build_batches(tokenized_sentences, skip_invalid_size_inputs=False):
61 | batch = utils.apply_to_sample(lambda t: t.to(self.device), batch)
62 | translations = self.task.inference_step(
63 | generator, self.models, batch,
64 | )
65 | for id, hypos in zip(batch["id"].tolist(), translations):
66 | results.append((id, hypos))
67 |
68 | # sort output to match input order
69 | outputs = [hypos for _, hypos in sorted(results, key=lambda x: x[0])]
70 | return outputs
71 |
--------------------------------------------------------------------------------
/alpaca/src/model/llama_megatron_transformer.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Facebook, Inc. and its affiliates.
2 | #
3 | # This source code is licensed under the MIT license found in the
4 | # LICENSE file in the root directory of this source tree.
5 |
6 | from typing import Dict, List, Optional, Tuple
7 | import os
8 | import math
9 | import logging
10 |
11 | import torch
12 | from torch import Tensor, nn
13 | import torch.nn.functional as F
14 | from fairseq import options, utils
15 |
16 | from fairscale.nn.model_parallel import initialize as mpu
17 | from fairscale.nn.model_parallel.initialize import initialize_model_parallel
18 | from fairscale.nn.model_parallel.mappings import scatter_to_model_parallel_region, gather_from_model_parallel_region
19 | import fairscale.nn.model_parallel.initialize as fs_init
20 | from fairscale.nn.model_parallel.layers import (
21 | ParallelEmbedding,
22 | RowParallelLinear,
23 | ColumnParallelLinear,
24 | )
25 | from fairseq.modules.checkpoint_activations import checkpoint_wrapper
26 | from .lora_modules import LoRA
27 |
28 | logger = logging.getLogger(__name__)
29 |
30 |
31 | class LLaMAMegatronTransformer(nn.Module):
32 |
33 | def __init__(self, cfg, tgt_dict, embed_tokens, lora_tuning):
34 | super().__init__()
35 |
36 | self.lora_tuning = lora_tuning
37 |
38 | self.tgt_dict = tgt_dict
39 | self.embed_dim = cfg.decoder_embed_dim
40 | self.num_layers = cfg.decoder_layers
41 | self.num_heads = cfg.decoder_attention_heads
42 | self.head_dim = self.embed_dim // self.num_heads
43 | self.max_target_positions = cfg.max_target_positions
44 |
45 | self.pad = self.tgt_dict.pad()
46 | self.embed_tokens = embed_tokens
47 |
48 | self.layers = torch.nn.ModuleList()
49 | self.layers.extend(
50 | [
51 | self.build_decoder_layer(cfg, self.lora_tuning)
52 | for _ in range(self.num_layers)
53 | ]
54 | )
55 |
56 | self.layer_norm = RMSNorm(self.embed_dim)
57 | self.output_projection = ColumnParallelLinear(
58 | self.embed_dim, len(self.tgt_dict), bias=False, init_method=lambda x: x
59 | )
60 |
61 | self.freqs_cis = self.precompute_freqs_cis(
62 | self.embed_dim // self.num_heads, self.max_target_positions * 2
63 | )
64 | self._future_mask = torch.empty(0)
65 |
66 | def build_decoder_layer(self, cfg, lora_tuning):
67 | layer = LLaMATransformerLayer(cfg, lora_tuning)
68 | checkpoint = cfg.checkpoint_activations
69 | if checkpoint:
70 | offload_to_cpu = cfg.offload_activations
71 | layer = checkpoint_wrapper(layer, offload_to_cpu=offload_to_cpu)
72 | return layer
73 |
74 | def precompute_freqs_cis(self, dim: int, end: int, theta: float = 10000.0):
75 | freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
76 | t = torch.arange(end, device=freqs.device) # type: ignore
77 | freqs = torch.outer(t, freqs).float() # type: ignore
78 | freqs_cis = torch.polar(torch.ones_like(freqs), freqs) # complex64
79 | return freqs_cis
80 |
81 | def output_layer(self, x):
82 | return self.output_projection(x).float()
83 |
84 | def buffered_future_mask(self, tensor):
85 | dim = tensor.size(1)
86 | if (
87 | self._future_mask.size(0) == 0
88 | or (not self._future_mask.device == tensor.device)
89 | or self._future_mask.size(0) < dim
90 | ):
91 | self._future_mask = torch.triu(
92 | utils.fill_with_neg_inf(torch.zeros([dim, dim])), 1
93 | )
94 | self._future_mask = self._future_mask.to(tensor)
95 | return self._future_mask[:dim, :dim]
96 |
97 | def forward_inf(
98 | self,
99 | prev_output_tokens,
100 | incremental_state: Optional[Dict[str, Dict[str, Optional[Tensor]]]] = None,
101 | src_pos: Optional[Tensor] = None,
102 | tgt_pos: Optional[Tensor] = None,
103 | trunc_flg: bool = False,
104 | ):
105 |
106 | if incremental_state is not None and trunc_flg:
107 | prev_output_tokens = prev_output_tokens[:, -1:]
108 |
109 | bsz, target_len = prev_output_tokens.size()
110 | x = self.embed_tokens(prev_output_tokens)
111 |
112 | key_padding_mask = prev_output_tokens.eq(self.pad)
113 | if incremental_state is not None:
114 | key_padding_mask = torch.cat([incremental_state['padding_mask'], key_padding_mask], dim=-1)
115 |
116 | self.freqs_cis = self.freqs_cis.to(x.device)
117 | if incremental_state is not None:
118 | freqs_cis = self.freqs_cis[:key_padding_mask.size(1)]
119 | else:
120 | freqs_cis = self.freqs_cis[:target_len]
121 |
122 | if incremental_state is not None:
123 | tgt_attn_mask = self.buffered_future_mask(x)
124 | tgt_len = tgt_attn_mask.size(1)
125 | src_len = key_padding_mask.size(1)
126 | src_attn_mask = torch.torch.zeros([tgt_len, src_len - tgt_len]).to(tgt_attn_mask)
127 | self_attn_mask = torch.cat([src_attn_mask, tgt_attn_mask], dim=1)
128 | else:
129 | self_attn_mask = self.buffered_future_mask(x)
130 |
131 | hidden_state = [x]
132 | attn_state = None
133 | for layer_idx, layer in enumerate(self.layers):
134 |
135 | if incremental_state is not None:
136 | context = torch.cat([incremental_state[layer_idx]['key'], x], dim=1)
137 | else:
138 | context = x
139 |
140 | x, attn = layer(
141 | x,
142 | context,
143 | freqs_cis,
144 | key_padding_mask,
145 | self_attn_mask,
146 | src_pos,
147 | tgt_pos,
148 | )
149 |
150 | attn_state = attn
151 | hidden_state.append(x)
152 |
153 | attn_state = attn_state.mean(dim=1)
154 | x = self.layer_norm(x)
155 | return x, key_padding_mask, attn_state, hidden_state
156 |
157 | def forward(self, prev_output_tokens):
158 | bsz, target_len = prev_output_tokens.size()
159 | x = self.embed_tokens(prev_output_tokens)
160 |
161 | key_padding_mask = prev_output_tokens.eq(self.pad)
162 | freqs_cis = self.freqs_cis.to(x.device)[:target_len]
163 | self_attn_mask = self.buffered_future_mask(x)
164 |
165 | hidden_state = [x]
166 | attn_state = None
167 | for layer_idx, layer in enumerate(self.layers):
168 |
169 | x, attn = layer(
170 | x,
171 | x,
172 | freqs_cis,
173 | key_padding_mask,
174 | self_attn_mask,
175 | )
176 |
177 | attn_state = attn
178 | hidden_state.append(x)
179 |
180 | attn_state = attn_state.mean(dim=1)
181 | x = self.layer_norm(x)
182 | return x, key_padding_mask, attn_state, hidden_state
183 |
184 | class LLaMATransformerLayer(nn.Module):
185 |
186 | def __init__(self, cfg, lora_tuning):
187 | super().__init__()
188 |
189 | self.lora_tuning = lora_tuning
190 |
191 | self.embed_dim = cfg.decoder_embed_dim
192 | self.num_heads = cfg.decoder_attention_heads
193 | self.ffn_embed_dim = cfg.decoder_ffn_embed_dim
194 |
195 | self.attention = LLaMAAttention(self.num_heads, self.embed_dim, lora_tuning)
196 | self.feed_forward = LLaMAFeedForward(self.embed_dim, self.ffn_embed_dim)
197 |
198 | self.attention_norm = RMSNorm(self.embed_dim)
199 | self.ffn_norm = RMSNorm(self.embed_dim)
200 |
201 | def forward(
202 | self,
203 | query: Tensor,
204 | key_value: Tensor,
205 | freqs_cis: Tensor,
206 | key_padding_mask: Optional[Tensor],
207 | self_attn_mask: Optional[Tensor],
208 | src_pos: Optional[Tensor] = None,
209 | tgt_pos: Optional[Tensor] = None,
210 | ):
211 |
212 | x, attn = self.attention(
213 | self.attention_norm(query),
214 | self.attention_norm(key_value),
215 | freqs_cis,
216 | key_padding_mask,
217 | self_attn_mask,
218 | src_pos,
219 | tgt_pos,
220 | )
221 | x = query + x
222 | x = x + self.feed_forward(self.ffn_norm(x))
223 |
224 | return x, attn
225 |
226 | class RMSNorm(torch.nn.Module):
227 |
228 | def __init__(self, dim: int, eps: float = 1e-6):
229 | super().__init__()
230 |
231 | self.eps = eps
232 | self.weight = nn.Parameter(torch.ones(dim))
233 |
234 | def _norm(self, x):
235 | return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
236 |
237 | def forward(self, x):
238 | output = self._norm(x.float()).type_as(x)
239 | return output * self.weight
240 |
241 | class LLaMAAttention(nn.Module):
242 |
243 | def __init__(self, num_heads, embed_dim, lora_tuning):
244 | super().__init__()
245 |
246 | self.lora_tuning = lora_tuning
247 |
248 | self.num_heads = num_heads
249 | self.embed_dim = embed_dim
250 | self.head_dim = embed_dim // num_heads
251 | self.local_num_heads = self.num_heads // fs_init.get_model_parallel_world_size()
252 |
253 | self.q_proj = ColumnParallelLinear(
254 | self.embed_dim,
255 | self.embed_dim,
256 | bias=False,
257 | gather_output=False,
258 | init_method=lambda x: x,
259 | )
260 | self.k_proj = ColumnParallelLinear(
261 | self.embed_dim,
262 | self.embed_dim,
263 | bias=False,
264 | gather_output=False,
265 | init_method=lambda x: x,
266 | )
267 | self.v_proj = ColumnParallelLinear(
268 | self.embed_dim,
269 | self.embed_dim,
270 | bias=False,
271 | gather_output=False,
272 | init_method=lambda x: x,
273 | )
274 | self.out_proj = RowParallelLinear(
275 | self.embed_dim,
276 | self.embed_dim,
277 | bias=False,
278 | input_is_parallel=True,
279 | init_method=lambda x: x,
280 | )
281 |
282 | if self.lora_tuning:
283 | self.q_lora = LoRA(self.embed_dim, self.embed_dim)
284 | self.k_lora = LoRA(self.embed_dim, self.embed_dim)
285 | self.v_lora = LoRA(self.embed_dim, self.embed_dim)
286 |
287 | def apply_rotary_emb(
288 | self,
289 | query: Tensor,
290 | key: Tensor,
291 | freqs_cis: Tensor,
292 | src_pos: Tensor,
293 | tgt_pos: Tensor,
294 | ) -> Tuple[Tensor, Tensor]:
295 |
296 | def reshape_for_broadcast(freqs_cis: Tensor, x: Tensor):
297 | ndim = x.ndim
298 | assert 0 <= 1 < ndim
299 | assert freqs_cis.shape == (x.shape[1], x.shape[-1])
300 | shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
301 | return freqs_cis.view(*shape)
302 |
303 | q_ = torch.view_as_complex(query.float().reshape(*query.shape[:-1], -1, 2))
304 | k_ = torch.view_as_complex(key.float().reshape(*key.shape[:-1], -1, 2))
305 |
306 | if src_pos is not None and tgt_pos is not None:
307 | if freqs_cis.size(0) == q_.size(1):
308 | freqs_cis = reshape_for_broadcast(freqs_cis, q_)
309 | q_list = []
310 | k_list = []
311 | for idx, attn_p in enumerate(src_pos):
312 | q_list.append(q_[idx] * freqs_cis.index_select(dim=1, index=attn_p))
313 | k_list.append(k_[idx] * freqs_cis.index_select(dim=1, index=attn_p))
314 | q_out = torch.view_as_real(torch.cat(q_list, dim=0)).flatten(3)
315 | k_out = torch.view_as_real(torch.cat(k_list, dim=0)).flatten(3)
316 | else:
317 | freqs_cis = reshape_for_broadcast(freqs_cis, k_)
318 | q_list = []
319 | k_list = []
320 | idx = 0
321 | for q_pos, k_pos in zip(tgt_pos, torch.cat([src_pos, tgt_pos], dim=-1)):
322 | q_list.append(q_[idx] * freqs_cis.index_select(dim=1, index=q_pos))
323 | k_list.append(k_[idx] * freqs_cis.index_select(dim=1, index=k_pos))
324 | idx += 1
325 | q_out = torch.view_as_real(torch.cat(q_list, dim=0)).flatten(3)
326 | k_out = torch.view_as_real(torch.cat(k_list, dim=0)).flatten(3)
327 | else:
328 | freqs_cis = reshape_for_broadcast(freqs_cis, q_)
329 | q_out = torch.view_as_real(q_ * freqs_cis).flatten(3)
330 | k_out = torch.view_as_real(k_ * freqs_cis).flatten(3)
331 | return q_out.type_as(query), k_out.type_as(key)
332 |
333 | def forward(
334 | self,
335 | query: Tensor,
336 | key_value: Tensor,
337 | freqs_cis: Tensor,
338 | key_padding_mask: Optional[Tensor] = None,
339 | attn_mask: Optional[Tensor] = None,
340 | src_pos: Optional[Tensor] = None,
341 | tgt_pos: Optional[Tensor] = None,
342 | ):
343 |
344 | bsz, tgt_len, embed_dim = query.size()
345 | bsz, src_len, embed_dim = key_value.size()
346 |
347 | q = self.q_proj(query)
348 | k = self.k_proj(key_value)
349 | v = self.v_proj(key_value)
350 |
351 | if self.lora_tuning:
352 |
353 | q = gather_from_model_parallel_region(q) + self.q_lora(query)
354 | k = gather_from_model_parallel_region(k) + self.k_lora(key_value)
355 | v = gather_from_model_parallel_region(v) + self.v_lora(key_value)
356 |
357 | q = scatter_to_model_parallel_region(q)
358 | k = scatter_to_model_parallel_region(k)
359 | v = scatter_to_model_parallel_region(v)
360 |
361 | q = q.view(bsz, tgt_len, self.local_num_heads, self.head_dim)
362 | k = k.view(bsz, src_len, self.local_num_heads, self.head_dim)
363 | v = v.view(bsz, src_len, self.local_num_heads, self.head_dim)
364 |
365 | q, k = self.apply_rotary_emb(q, k, freqs_cis=freqs_cis, src_pos=src_pos, tgt_pos=tgt_pos)
366 |
367 | q = q.transpose(1, 2)
368 | k = k.transpose(1, 2)
369 | v = v.transpose(1, 2)
370 |
371 | attn_scores = torch.matmul(q, k.transpose(2, 3)) / math.sqrt(self.head_dim)
372 |
373 | if attn_mask is not None:
374 | attn_scores = attn_scores + attn_mask.unsqueeze(0).unsqueeze(1)
375 | attn_scores = attn_scores.masked_fill(
376 | key_padding_mask.unsqueeze(1).unsqueeze(2),
377 | float("-inf")
378 | )
379 |
380 | attn_softmax_scores = F.softmax(attn_scores.float(), dim=-1).type_as(q)
381 | output = torch.matmul(attn_softmax_scores, v)
382 | output = output.transpose(1, 2).contiguous().view(bsz, tgt_len, -1)
383 | return self.out_proj(output), attn_softmax_scores
384 |
385 | class LLaMAFeedForward(nn.Module):
386 |
387 | def __init__(self, embed_dim: int, hidden_dim: int):
388 | super().__init__()
389 |
390 | self.embed_dim = embed_dim
391 | self.hidden_dim = hidden_dim
392 |
393 | multiple_of = 256
394 | self.hidden_dim = int(2 * self.hidden_dim / 3)
395 | self.hidden_dim = multiple_of * ((self.hidden_dim + multiple_of - 1) // multiple_of)
396 |
397 | self.w1 = ColumnParallelLinear(
398 | self.embed_dim, self.hidden_dim, bias=False, gather_output=False, init_method=lambda x: x
399 | )
400 | self.w2 = RowParallelLinear(
401 | self.hidden_dim, self.embed_dim, bias=False, input_is_parallel=True, init_method=lambda x: x
402 | )
403 | self.w3 = ColumnParallelLinear(
404 | self.embed_dim, self.hidden_dim, bias=False, gather_output=False, init_method=lambda x: x
405 | )
406 |
407 | def forward(self, x):
408 | return self.w2(F.silu(self.w1(x)) * self.w3(x))
409 |
--------------------------------------------------------------------------------
/alpaca/src/model/llama_model.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Facebook, Inc. and its affiliates.
2 | #
3 | # This source code is licensed under the MIT license found in the
4 | # LICENSE file in the root directory of this source tree.
5 |
6 |
7 | from dataclasses import dataclass, field
8 | from typing import Dict, List, Optional, Tuple
9 | import os
10 | from omegaconf import II
11 | import math
12 | import logging
13 |
14 | import torch
15 | from torch import Tensor, nn
16 | import torch.nn.functional as F
17 | from fairseq import options, utils
18 | from fairseq.dataclass import ChoiceEnum, FairseqDataclass
19 | from fairseq.models import (
20 | BaseFairseqModel,
21 | register_model,
22 | register_model_architecture,
23 | )
24 | from fairseq.models.transformer import DEFAULT_MIN_PARAMS_TO_WRAP
25 | from fairscale.nn.model_parallel import initialize as mpu
26 | from .hub_interface import LLaMAHubInterface
27 | from .llama_transformer import LLaMATransformer
28 | from .llama_megatron_transformer import LLaMAMegatronTransformer
29 | from fairscale.nn.model_parallel.layers import ParallelEmbedding
30 | from fairseq.utils import safe_getattr, safe_hasattr
31 |
32 |
33 | logger = logging.getLogger(__name__)
34 |
35 |
36 | @dataclass
37 | class LLaMAConfig(FairseqDataclass):
38 |
39 | dropout: float = field(default=0.1, metadata={"help": "dropout probability"})
40 | attention_dropout: float = field(
41 | default=0.0, metadata={"help": "dropout probability for attention weights"}
42 | )
43 | decoder_embed_dim: int = field(
44 | default=512, metadata={"help": "decoder embedding dimension"}
45 | )
46 | decoder_ffn_embed_dim: int = field(
47 | default=2048, metadata={"help": "decoder embedding dimension for FFN"}
48 | )
49 | decoder_layers: int = field(default=6, metadata={"help": "num decoder layers"})
50 | decoder_attention_heads: int = field(
51 | default=8, metadata={"help": "num decoder attention heads"}
52 | )
53 | max_target_positions: Optional[int] = II("task.max_target_positions")
54 | checkpoint_activations: bool = field(
55 | default=False, metadata={"help": "checkpoint activations at each layer"}
56 | )
57 | offload_activations: bool = field(
58 | default=False,
59 | metadata={"help": "move checkpointed activations to CPU after they are used."},
60 | )
61 | min_params_to_wrap: int = field(
62 | default=DEFAULT_MIN_PARAMS_TO_WRAP,
63 | metadata={
64 | "help": ("minimum number of params for a layer to be wrapped with FSDP()")
65 | },
66 | )
67 |
68 |
69 | def Embedding(num_embeddings, embedding_dim, padding_idx):
70 | m = nn.Embedding(num_embeddings, embedding_dim, padding_idx=padding_idx)
71 | nn.init.normal_(m.weight, mean=0, std=embedding_dim**-0.5)
72 | nn.init.constant_(m.weight[padding_idx], 0)
73 | return m
74 |
75 | @register_model("llama", dataclass=LLaMAConfig)
76 | class LLaMA(BaseFairseqModel):
77 |
78 | def __init__(self, decoder, lora_tuning):
79 | super().__init__()
80 | self.decoder = decoder
81 |
82 | self.lora_tuning = lora_tuning
83 | logger.info('run efficient-tuning method {}'.format(self.lora_tuning))
84 | if self.lora_tuning:
85 | self.mark_only_lora_as_trainable()
86 | self.lora_model_inf = None
87 |
88 | def set_lora_model_inf(self, lora_model_inf):
89 | self.lora_model_inf = lora_model_inf
90 |
91 | def mark_only_lora_as_trainable(self) -> None:
92 | for n, p in self.named_parameters():
93 | if 'lora' not in n:
94 | p.requires_grad = False
95 | else:
96 | p.requires_grad = True
97 |
98 | @classmethod
99 | def build_model(cls, args, task):
100 | """Build a new model instance."""
101 | llama_base_architecture(args)
102 |
103 | logger.info("rescale [src] dictionary: {} types and [tgt] dictionary: {} types".format(
104 | len(task.source_dictionary), len(task.target_dictionary)))
105 |
106 | lora_tuning = safe_getattr(task, "lora_tuning", False)
107 | if safe_getattr(task, "megatron_model", False):
108 | cls.initialize_model_parallel()
109 |
110 | task.source_dictionary.pad_to_multiple_(torch.distributed.get_world_size() * 8)
111 | task.target_dictionary.pad_to_multiple_(torch.distributed.get_world_size() * 8)
112 |
113 | embed_tokens = cls.build_megatron_embedding(args, task.target_dictionary, args.decoder_embed_dim)
114 | decoder = LLaMAMegatronTransformer(args, task.target_dictionary, embed_tokens, lora_tuning)
115 | else:
116 | embed_tokens = cls.build_embedding(args, task.target_dictionary, args.decoder_embed_dim)
117 | decoder = LLaMATransformer(args, task.target_dictionary, embed_tokens, lora_tuning)
118 |
119 | return cls(decoder, lora_tuning)
120 |
121 | @classmethod
122 | def initialize_model_parallel(cls):
123 | logger.info("llama model init process group")
124 |
125 | if not torch.distributed.is_initialized():
126 | torch.distributed.init_process_group("nccl")
127 |
128 | if not mpu.model_parallel_is_initialized():
129 | ws = torch.distributed.get_world_size()
130 | mpu.initialize_model_parallel(ws)
131 |
132 | @classmethod
133 | def build_megatron_embedding(cls, args, dictionary, embed_dim):
134 | embed_tokens = ParallelEmbedding(len(dictionary), embed_dim, init_method=lambda x: x)
135 | return embed_tokens
136 |
137 | @classmethod
138 | def build_embedding(cls, cfg, dictionary, embed_dim):
139 | emb = Embedding(len(dictionary), embed_dim, dictionary.pad())
140 | return emb
141 |
142 | @classmethod
143 | def from_pretrained(
144 | cls,
145 | model_name_or_path,
146 | checkpoint_file,
147 | **kwargs
148 | ):
149 | from fairseq import hub_utils
150 |
151 | x = hub_utils.from_pretrained(
152 | model_name_or_path,
153 | checkpoint_file,
154 | **kwargs,
155 | )
156 | return LLaMAHubInterface(x["args"], x["task"], x["models"][0])
157 |
158 | def forward_encoder(self, encoder_inputs):
159 |
160 | src_x, src_padding, src_attn, src_hiddens = self.decoder.forward_inf(
161 | prev_output_tokens=encoder_inputs['src_tokens'],
162 | src_pos=encoder_inputs['src_pos'],
163 | )
164 |
165 | return {
166 | "encoder_out": [src_x],
167 | "encoder_padding_mask": [src_padding],
168 | "encoder_states": src_hiddens,
169 | "src_tokens": [encoder_inputs['src_tokens']],
170 | "src_pos": [encoder_inputs['src_pos']],
171 | "tgt_pos": [encoder_inputs['tgt_pos']] if encoder_inputs['tgt_pos'] is not None else [],
172 | "bos_token_pos": [encoder_inputs['bos_token_pos']],
173 | }
174 |
175 | def forward_decoder(self, prev_output_tokens, encoder_out, incremental_state=None):
176 |
177 | if len(incremental_state) == 0:
178 | incremental_state["padding_mask"] = encoder_out["encoder_padding_mask"][0]
179 | for layer_idx, layer_hidden_states in enumerate(encoder_out["encoder_states"]):
180 |
181 | incremental_state[layer_idx] = {}
182 | incremental_state[layer_idx]['key'] = layer_hidden_states
183 | incremental_state['src_pos'] = encoder_out['src_pos'][0]
184 | incremental_state['tgt_pos'] = encoder_out['bos_token_pos'][0]
185 |
186 | tgt_x, tgt_padding, tgt_attn, tgt_hiddens = self.decoder.forward_inf(
187 | prev_output_tokens=prev_output_tokens,
188 | incremental_state=incremental_state,
189 | src_pos=incremental_state['src_pos'],
190 | tgt_pos=incremental_state['tgt_pos'],
191 | trunc_flg=True,
192 | )
193 |
194 | tgt_out = self.decoder.output_layer(tgt_x)
195 |
196 | if len(incremental_state) > 0:
197 | incremental_state["padding_mask"] = tgt_padding
198 | for layer_idx, tgt_hid in enumerate(tgt_hiddens):
199 |
200 | incremental_state[layer_idx]['key'] = torch.cat(
201 | [incremental_state[layer_idx]['key'], tgt_hid], dim=1
202 | )
203 | incremental_state['src_pos'] = torch.cat([
204 | incremental_state['src_pos'], incremental_state['tgt_pos']], dim=-1)
205 | incremental_state['tgt_pos'] += 1
206 | return tgt_out, {"attn": [tgt_attn], "inner_states": tgt_hiddens}, incremental_state
207 |
208 | @torch.jit.export
209 | def get_normalized_probs(
210 | self,
211 | net_output: Tuple[Tensor, Optional[Dict[str, List[Optional[Tensor]]]]],
212 | log_probs: bool,
213 | sample: Optional[Dict[str, Tensor]] = None,
214 | ):
215 | logits = net_output[0]
216 |
217 | if log_probs:
218 | return utils.log_softmax(logits, dim=-1)
219 | else:
220 | return utils.softmax(logits, dim=-1)
221 |
222 | def forward(self, prev_output_tokens):
223 | x, x_padding, x_attn, x_hiddens = self.decoder(prev_output_tokens)
224 | x_out = self.decoder.output_layer(x)
225 | return x_out
226 |
227 | @torch.jit.export
228 | def reorder_incremental_state(
229 | self,
230 | incremental_state: Optional[Dict[str, Dict[str, Optional[Tensor]]]],
231 | new_order: Tensor,
232 | ):
233 | for key, value in incremental_state.items():
234 | if "padding_mask" in str(key):
235 | incremental_state[key] = value.index_select(0, new_order)
236 | elif "pos" in str(key):
237 | incremental_state[key] = value.index_select(0, new_order)
238 | else:
239 | incremental_state[key]['key'] = value['key'].index_select(0, new_order)
240 | return incremental_state
241 |
242 | @torch.jit.export
243 | def reorder_encoder_out(self, encoder_out: Dict[str, List[Tensor]], new_order):
244 |
245 | if len(encoder_out["encoder_out"]) == 0:
246 | new_encoder_out = []
247 | else:
248 | new_encoder_out = [encoder_out["encoder_out"][0].index_select(0, new_order)]
249 |
250 | if len(encoder_out["encoder_padding_mask"]) == 0:
251 | new_encoder_padding_mask = []
252 | else:
253 | new_encoder_padding_mask = [
254 | encoder_out["encoder_padding_mask"][0].index_select(0, new_order)
255 | ]
256 |
257 | encoder_states = encoder_out["encoder_states"]
258 | if len(encoder_states) > 0:
259 | for idx, state in enumerate(encoder_states):
260 | encoder_states[idx] = state.index_select(0, new_order)
261 |
262 | if len(encoder_out["src_tokens"]) == 0:
263 | src_tokens = []
264 | else:
265 | src_tokens = [(encoder_out["src_tokens"][0]).index_select(0, new_order)]
266 |
267 | if len(encoder_out["src_pos"]) == 0:
268 | src_pos = []
269 | else:
270 | src_pos = [(encoder_out["src_pos"][0]).index_select(0, new_order)]
271 |
272 | if len(encoder_out["tgt_pos"]) == 0:
273 | tgt_pos = []
274 | else:
275 | tgt_pos = [(encoder_out["tgt_pos"][0]).index_select(0, new_order)]
276 |
277 | if len(encoder_out["bos_token_pos"]) == 0:
278 | bos_token_pos = []
279 | else:
280 | bos_token_pos = [(encoder_out["bos_token_pos"][0]).index_select(0, new_order)]
281 |
282 | return {
283 | "encoder_out": new_encoder_out, # T x B x C
284 | "encoder_padding_mask": new_encoder_padding_mask, # B x T
285 | "encoder_states": encoder_states, # List[T x B x C]
286 | "src_tokens": src_tokens, # B x T
287 | "src_pos": src_pos, # B x T
288 | "tgt_pos": tgt_pos, # B x T
289 | "bos_token_pos": bos_token_pos,
290 | }
291 |
292 | def upgrade_state_dict_named(self, state_dict, name):
293 |
294 | if self.lora_tuning and self.lora_model_inf is not None:
295 | if os.path.exists(self.lora_model_inf):
296 | print("load lora model from {}".format(self.lora_model_inf))
297 | with open(self.lora_model_inf, "rb") as f:
298 | lora_state_dict = torch.load(f, map_location=torch.device("cuda"))['model']
299 | for k in list(lora_state_dict.keys()):
300 | state_dict[k] = lora_state_dict[k]
301 | else:
302 | print("no lora model!")
303 |
304 | if "decoder.embed_tokens.weight" not in state_dict.keys():
305 | for k in list(state_dict.keys()):
306 | if "tok_embeddings.weight" in k:
307 | state_dict["decoder.embed_tokens.weight"] = state_dict[k]
308 | del state_dict[k]
309 | elif "output.weight" in k:
310 | state_dict["decoder.output_projection.weight"] = state_dict[k]
311 | del state_dict[k]
312 |
313 | elif "layers" in k:
314 |
315 | if "inner_attention" in k:
316 | del state_dict[k]
317 | continue
318 |
319 | if "wq" in k:
320 | new_k = 'decoder.' + k.replace("wq", "q_proj")
321 | elif "wk" in k:
322 | new_k = 'decoder.' + k.replace("wk", "k_proj")
323 | elif "wv" in k:
324 | new_k = 'decoder.' + k.replace("wv", "v_proj")
325 | elif "wo" in k:
326 | new_k = 'decoder.' + k.replace("wo", "out_proj")
327 | elif "feed_forward" in k:
328 | new_k = 'decoder.' + k
329 | elif "_norm" in k:
330 | new_k = 'decoder.' + k
331 | else:
332 | continue
333 |
334 | state_dict[new_k] = state_dict[k]
335 | del state_dict[k]
336 |
337 | elif "norm.weight" in k:
338 | state_dict["decoder.layer_norm.weight"] = state_dict[k]
339 | del state_dict[k]
340 |
341 | else:
342 | raise NotImplementedError
343 |
344 | super().upgrade_state_dict_named(state_dict, name)
345 |
346 |
347 | def llama_base_architecture(args):
348 |
349 | args.dropout = safe_getattr(args, "dropout", 0.1)
350 | args.attention_dropout = safe_getattr(args, "attention_dropout", 0.0)
351 | args.decoder_embed_dim = safe_getattr(args, "decoder_embed_dim", 4096)
352 | args.decoder_ffn_embed_dim = safe_getattr(args, "decoder_ffn_embed_dim", 4096 * 4)
353 | args.decoder_layers = safe_getattr(args, "decoder_layers", 32)
354 | args.decoder_attention_heads = safe_getattr(args, "decoder_attention_heads", 32)
355 | args.max_target_positions = safe_getattr(args, "max_target_positions", 2048)
356 |
357 | @register_model_architecture("llama", "llama_7b")
358 | def llama_7b(args):
359 | llama_base_architecture(args)
360 |
--------------------------------------------------------------------------------
/alpaca/src/model/llama_transformer.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Facebook, Inc. and its affiliates.
2 | #
3 | # This source code is licensed under the MIT license found in the
4 | # LICENSE file in the root directory of this source tree.
5 |
6 | from typing import Dict, List, Optional, Tuple
7 | import os
8 | import math
9 | import logging
10 |
11 | import torch
12 | from torch import Tensor, nn
13 | import torch.nn.functional as F
14 | from fairseq import utils
15 | from torch.nn import Linear
16 | from fsdp.fully_sharded_data_parallel import fsdp_enable_wrap, fsdp_wrap
17 | from fairseq.modules.checkpoint_activations import checkpoint_wrapper
18 | from .lora_modules import LoRA
19 |
20 | logger = logging.getLogger(__name__)
21 |
22 |
23 | class LLaMATransformer(nn.Module):
24 |
25 | def __init__(self, cfg, tgt_dict, embed_tokens, lora_tuning):
26 | super().__init__()
27 |
28 | self.lora_tuning = lora_tuning
29 |
30 | self.tgt_dict = tgt_dict
31 | self.embed_dim = cfg.decoder_embed_dim
32 | self.num_layers = cfg.decoder_layers
33 | self.num_heads = cfg.decoder_attention_heads
34 | self.head_dim = self.embed_dim // self.num_heads
35 | self.max_target_positions = cfg.max_target_positions
36 |
37 | self.pad = self.tgt_dict.pad()
38 | self.embed_tokens = embed_tokens
39 |
40 | self.layers = torch.nn.ModuleList()
41 | self.layers.extend(
42 | [
43 | self.build_decoder_layer(cfg, self.lora_tuning)
44 | for _ in range(self.num_layers)
45 | ]
46 | )
47 |
48 | self.layer_norm = RMSNorm(self.embed_dim)
49 | self.output_projection = Linear(
50 | self.embed_dim, len(self.tgt_dict), bias=False
51 | )
52 |
53 | self.freqs_cis = self.precompute_freqs_cis(
54 | self.embed_dim // self.num_heads, self.max_target_positions * 2
55 | )
56 | self._future_mask = torch.empty(0)
57 |
58 | def build_decoder_layer(self, cfg, lora_tuning):
59 | layer = LLaMATransformerLayer(cfg, lora_tuning)
60 | checkpoint = cfg.checkpoint_activations
61 | if checkpoint:
62 | offload_to_cpu = cfg.offload_activations
63 | layer = checkpoint_wrapper(layer, offload_to_cpu=offload_to_cpu)
64 | # if we are checkpointing, enforce that FSDP always wraps the
65 | # checkpointed layer, regardless of layer size
66 | min_params_to_wrap = cfg.min_params_to_wrap if not checkpoint else 0
67 | layer = fsdp_wrap(layer, min_num_params=min_params_to_wrap)
68 | return layer
69 |
70 | def precompute_freqs_cis(self, dim: int, end: int, theta: float = 10000.0):
71 | freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
72 | t = torch.arange(end, device=freqs.device) # type: ignore
73 | freqs = torch.outer(t, freqs).float() # type: ignore
74 | freqs_cis = torch.polar(torch.ones_like(freqs), freqs) # complex64
75 | return freqs_cis
76 |
77 | def output_layer(self, x):
78 | return self.output_projection(x).float()
79 |
80 | def buffered_future_mask(self, tensor):
81 | dim = tensor.size(1)
82 | if (
83 | self._future_mask.size(0) == 0
84 | or (not self._future_mask.device == tensor.device)
85 | or self._future_mask.size(0) < dim
86 | ):
87 | self._future_mask = torch.triu(
88 | utils.fill_with_neg_inf(torch.zeros([dim, dim])), 1
89 | )
90 | self._future_mask = self._future_mask.to(tensor)
91 | return self._future_mask[:dim, :dim]
92 |
93 | def forward_inf(
94 | self,
95 | prev_output_tokens,
96 | incremental_state: Optional[Dict[str, Dict[str, Optional[Tensor]]]] = None,
97 | src_pos: Optional[Tensor] = None,
98 | tgt_pos: Optional[Tensor] = None,
99 | trunc_flg: bool = False,
100 | ):
101 |
102 | if incremental_state is not None and trunc_flg:
103 | prev_output_tokens = prev_output_tokens[:, -1:]
104 |
105 | bsz, target_len = prev_output_tokens.size()
106 | x = self.embed_tokens(prev_output_tokens)
107 |
108 | key_padding_mask = prev_output_tokens.eq(self.pad)
109 | if incremental_state is not None:
110 | key_padding_mask = torch.cat([incremental_state['padding_mask'], key_padding_mask], dim=-1)
111 |
112 | self.freqs_cis = self.freqs_cis.to(x.device)
113 | if incremental_state is not None:
114 | freqs_cis = self.freqs_cis[:key_padding_mask.size(1)]
115 | else:
116 | freqs_cis = self.freqs_cis[:target_len]
117 |
118 | if incremental_state is not None:
119 | tgt_attn_mask = self.buffered_future_mask(x)
120 | tgt_len = tgt_attn_mask.size(1)
121 | src_len = key_padding_mask.size(1)
122 | src_attn_mask = torch.torch.zeros([tgt_len, src_len - tgt_len]).to(tgt_attn_mask)
123 | self_attn_mask = torch.cat([src_attn_mask, tgt_attn_mask], dim=1)
124 | else:
125 | self_attn_mask = self.buffered_future_mask(x)
126 |
127 | hidden_state = [x]
128 | attn_state = None
129 | for layer_idx, layer in enumerate(self.layers):
130 |
131 | if incremental_state is not None:
132 | context = torch.cat([incremental_state[layer_idx]['key'], x], dim=1)
133 | else:
134 | context = x
135 |
136 | x, attn = layer(
137 | x,
138 | context,
139 | freqs_cis,
140 | key_padding_mask,
141 | self_attn_mask,
142 | src_pos,
143 | tgt_pos,
144 | )
145 |
146 | attn_state = attn
147 | hidden_state.append(x)
148 |
149 | attn_state = attn_state.mean(dim=1)
150 | x = self.layer_norm(x)
151 | return x, key_padding_mask, attn_state, hidden_state
152 |
153 | def forward(self, prev_output_tokens):
154 | bsz, target_len = prev_output_tokens.size()
155 | x = self.embed_tokens(prev_output_tokens)
156 |
157 | key_padding_mask = prev_output_tokens.eq(self.pad)
158 | freqs_cis = self.freqs_cis.to(x.device)[:target_len]
159 | self_attn_mask = self.buffered_future_mask(x)
160 |
161 | hidden_state = [x]
162 | attn_state = None
163 | for layer_idx, layer in enumerate(self.layers):
164 |
165 | x, attn = layer(
166 | x,
167 | x,
168 | freqs_cis,
169 | key_padding_mask,
170 | self_attn_mask,
171 | )
172 |
173 | attn_state = attn
174 | hidden_state.append(x)
175 |
176 | attn_state = attn_state.mean(dim=1)
177 | x = self.layer_norm(x)
178 | return x, key_padding_mask, attn_state, hidden_state
179 |
180 |
181 | class LLaMATransformerLayer(nn.Module):
182 |
183 | def __init__(self, cfg, lora_tuning):
184 | super().__init__()
185 |
186 | self.lora_tuning = lora_tuning
187 |
188 | self.embed_dim = cfg.decoder_embed_dim
189 | self.num_heads = cfg.decoder_attention_heads
190 | self.ffn_embed_dim = cfg.decoder_ffn_embed_dim
191 |
192 | self.attention = LLaMAAttention(self.num_heads, self.embed_dim, self.lora_tuning)
193 | self.feed_forward = LLaMAFeedForward(self.embed_dim, self.ffn_embed_dim)
194 |
195 | self.attention_norm = RMSNorm(self.embed_dim)
196 | self.ffn_norm = RMSNorm(self.embed_dim)
197 |
198 | def forward(
199 | self,
200 | query: Tensor,
201 | key_value: Tensor,
202 | freqs_cis: Tensor,
203 | key_padding_mask: Optional[Tensor],
204 | self_attn_mask: Optional[Tensor],
205 | src_pos: Optional[Tensor] = None,
206 | tgt_pos: Optional[Tensor] = None,
207 | ):
208 |
209 | x, attn = self.attention(
210 | self.attention_norm(query),
211 | self.attention_norm(key_value),
212 | freqs_cis,
213 | key_padding_mask,
214 | self_attn_mask,
215 | src_pos,
216 | tgt_pos,
217 | )
218 | x = query + x
219 | x = x + self.feed_forward(self.ffn_norm(x))
220 | return x, attn
221 |
222 | class RMSNorm(nn.Module):
223 |
224 | def __init__(self, dim: int, eps: float = 1e-6):
225 | super().__init__()
226 |
227 | self.eps = eps
228 | self.weight = nn.Parameter(torch.ones(dim))
229 |
230 | def _norm(self, x):
231 | return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
232 |
233 | def forward(self, x):
234 | output = self._norm(x.float()).type_as(x)
235 | return output * self.weight
236 |
237 |
238 | class LLaMAAttention(nn.Module):
239 |
240 | def __init__(self, num_heads, embed_dim, lora_tuning):
241 | super().__init__()
242 |
243 | self.lora_tuning = lora_tuning
244 |
245 | self.num_heads = num_heads
246 | self.embed_dim = embed_dim
247 | self.head_dim = embed_dim // num_heads
248 | self.local_num_heads = self.num_heads
249 |
250 | self.q_proj = Linear(
251 | self.embed_dim,
252 | self.embed_dim,
253 | bias=False,
254 | )
255 | self.k_proj = Linear(
256 | self.embed_dim,
257 | self.embed_dim,
258 | bias=False,
259 | )
260 | self.v_proj = Linear(
261 | self.embed_dim,
262 | self.embed_dim,
263 | bias=False,
264 | )
265 | self.out_proj = Linear(
266 | self.embed_dim,
267 | self.embed_dim,
268 | bias=False,
269 | )
270 |
271 | if self.lora_tuning:
272 | self.q_lora = LoRA(self.embed_dim, self.embed_dim)
273 | self.k_lora = LoRA(self.embed_dim, self.embed_dim)
274 | self.v_lora = LoRA(self.embed_dim, self.embed_dim)
275 |
276 | def apply_rotary_emb(
277 | self,
278 | query: Tensor,
279 | key: Tensor,
280 | freqs_cis: Tensor,
281 | src_pos: Tensor,
282 | tgt_pos: Tensor,
283 | ) -> Tuple[Tensor, Tensor]:
284 |
285 | def reshape_for_broadcast(freqs_cis: Tensor, x: Tensor):
286 | ndim = x.ndim
287 | assert 0 <= 1 < ndim
288 | assert freqs_cis.shape == (x.shape[1], x.shape[-1])
289 | shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
290 | return freqs_cis.view(*shape)
291 |
292 | q_ = torch.view_as_complex(query.float().reshape(*query.shape[:-1], -1, 2))
293 | k_ = torch.view_as_complex(key.float().reshape(*key.shape[:-1], -1, 2))
294 |
295 | if src_pos is not None and tgt_pos is not None:
296 | if freqs_cis.size(0) == q_.size(1):
297 | freqs_cis = reshape_for_broadcast(freqs_cis, q_)
298 | q_list = []
299 | k_list = []
300 | for idx, attn_p in enumerate(src_pos):
301 | q_list.append(q_[idx] * freqs_cis.index_select(dim=1, index=attn_p))
302 | k_list.append(k_[idx] * freqs_cis.index_select(dim=1, index=attn_p))
303 | q_out = torch.view_as_real(torch.cat(q_list, dim=0)).flatten(3)
304 | k_out = torch.view_as_real(torch.cat(k_list, dim=0)).flatten(3)
305 | else:
306 | freqs_cis = reshape_for_broadcast(freqs_cis, k_)
307 | q_list = []
308 | k_list = []
309 | idx = 0
310 | for q_pos, k_pos in zip(tgt_pos, torch.cat([src_pos, tgt_pos], dim=-1)):
311 | q_list.append(q_[idx] * freqs_cis.index_select(dim=1, index=q_pos))
312 | k_list.append(k_[idx] * freqs_cis.index_select(dim=1, index=k_pos))
313 | idx += 1
314 | q_out = torch.view_as_real(torch.cat(q_list, dim=0)).flatten(3)
315 | k_out = torch.view_as_real(torch.cat(k_list, dim=0)).flatten(3)
316 | else:
317 | freqs_cis = reshape_for_broadcast(freqs_cis, q_)
318 | q_out = torch.view_as_real(q_ * freqs_cis).flatten(3)
319 | k_out = torch.view_as_real(k_ * freqs_cis).flatten(3)
320 | return q_out.type_as(query), k_out.type_as(key)
321 |
322 | def forward(
323 | self,
324 | query: Tensor,
325 | key_value: Tensor,
326 | freqs_cis: Tensor,
327 | key_padding_mask: Optional[Tensor] = None,
328 | attn_mask: Optional[Tensor] = None,
329 | src_pos: Optional[Tensor] = None,
330 | tgt_pos: Optional[Tensor] = None,
331 | ):
332 |
333 | bsz, tgt_len, embed_dim = query.size()
334 | bsz, src_len, embed_dim = key_value.size()
335 |
336 | q = self.q_proj(query)
337 | k = self.k_proj(key_value)
338 | v = self.v_proj(key_value)
339 |
340 | if self.lora_tuning:
341 |
342 | q = q + self.q_lora(query)
343 | k = k + self.k_lora(key_value)
344 | v = v + self.v_lora(key_value)
345 |
346 | q = q.view(bsz, tgt_len, self.local_num_heads, self.head_dim)
347 | k = k.view(bsz, src_len, self.local_num_heads, self.head_dim)
348 | v = v.view(bsz, src_len, self.local_num_heads, self.head_dim)
349 |
350 | q, k = self.apply_rotary_emb(q, k, freqs_cis=freqs_cis, src_pos=src_pos, tgt_pos=tgt_pos)
351 |
352 | q = q.transpose(1, 2)
353 | k = k.transpose(1, 2)
354 | v = v.transpose(1, 2)
355 |
356 | attn_scores = torch.matmul(q, k.transpose(2, 3)) / math.sqrt(self.head_dim)
357 |
358 | if attn_mask is not None:
359 | attn_scores = attn_scores + attn_mask.unsqueeze(0).unsqueeze(1)
360 | attn_scores = attn_scores.masked_fill(
361 | key_padding_mask.unsqueeze(1).unsqueeze(2),
362 | float("-inf")
363 | )
364 |
365 | attn_softmax_scores = F.softmax(attn_scores.float(), dim=-1).type_as(q)
366 | output = torch.matmul(attn_softmax_scores, v)
367 | output = output.transpose(1, 2).contiguous().view(bsz, tgt_len, -1)
368 |
369 | return self.out_proj(output), attn_softmax_scores
370 |
371 |
372 | class LLaMAFeedForward(nn.Module):
373 |
374 | def __init__(self, embed_dim: int, hidden_dim: int):
375 | super().__init__()
376 |
377 | self.embed_dim = embed_dim
378 | self.hidden_dim = hidden_dim
379 |
380 | multiple_of = 256
381 | self.hidden_dim = int(2 * self.hidden_dim / 3)
382 | self.hidden_dim = multiple_of * ((self.hidden_dim + multiple_of - 1) // multiple_of)
383 |
384 | self.w1 = Linear(
385 | self.embed_dim, self.hidden_dim, bias=False,
386 | )
387 | self.w2 = Linear(
388 | self.hidden_dim, self.embed_dim, bias=False,
389 | )
390 | self.w3 = Linear(
391 | self.embed_dim, self.hidden_dim, bias=False,
392 | )
393 |
394 | def forward(self, x):
395 | return self.w2(F.silu(self.w1(x)) * self.w3(x))
396 |
--------------------------------------------------------------------------------
/alpaca/src/model/lora_modules.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Facebook, Inc. and its affiliates.
2 | #
3 | # This source code is licensed under the MIT license found in the
4 | # LICENSE file in the root directory of this source tree.
5 | import torch
6 | from torch import nn
7 | import math
8 |
9 |
10 | class LoRA(nn.Module):
11 |
12 | def __init__(self, input_dim, output_dim):
13 | super().__init__()
14 | self.lora_alpha = 32
15 | self.r = 4
16 | self.scaling = self.lora_alpha / self.r
17 |
18 | self.lora_A = nn.Parameter(torch.zeros((self.r, input_dim)))
19 | self.lora_B = nn.Parameter(torch.zeros((output_dim, self.r)))
20 | self.reset_lora_parameters()
21 |
22 | def reset_lora_parameters(self):
23 | nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
24 | nn.init.zeros_(self.lora_B)
25 |
26 | def forward(self, x):
27 | return (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
28 |
29 | def upgrade_state_dict_named(self, state_dict, name):
30 |
31 | prefix = name + '.lora_A'
32 | if prefix not in state_dict:
33 | state_dict[prefix] = self.lora_A
34 |
35 | prefix = name + '.lora_B'
36 | if prefix not in state_dict:
37 | state_dict[prefix] = self.lora_B
38 |
--------------------------------------------------------------------------------
/alpaca/src/preprocess.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | # Copyright (c) Facebook, Inc. and its affiliates.
3 | #
4 | # This source code is licensed under the MIT license found in the
5 | # LICENSE file in the root directory of this source tree.
6 | """
7 | Data pre-processing: build vocabularies and binarize training data.
8 | """
9 |
10 | import logging
11 | import os
12 | import shutil
13 | import sys
14 | import typing as tp
15 | from argparse import Namespace
16 | from itertools import zip_longest
17 |
18 | from fairseq import options, tasks, utils
19 | from fairseq.binarizer import (
20 | AlignmentDatasetBinarizer,
21 | FileBinarizer,
22 | VocabularyDatasetBinarizer,
23 | )
24 | from dictionary import Dictionary
25 |
26 | logging.basicConfig(
27 | format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
28 | datefmt="%Y-%m-%d %H:%M:%S",
29 | level=os.environ.get("LOGLEVEL", "INFO").upper(),
30 | stream=sys.stdout,
31 | )
32 | logger = logging.getLogger("fairseq_cli.preprocess")
33 |
34 | #####################################################################
35 | # file name tools
36 | #####################################################################
37 |
38 |
39 | def _train_path(lang, trainpref):
40 | return "{}{}".format(trainpref, ("." + lang) if lang else "")
41 |
42 |
43 | def _file_name(prefix, lang):
44 | fname = prefix
45 | if lang is not None:
46 | fname += ".{lang}".format(lang=lang)
47 | return fname
48 |
49 |
50 | def _dest_path(prefix, lang, destdir):
51 | return os.path.join(destdir, _file_name(prefix, lang))
52 |
53 |
54 | def _dict_path(lang, destdir):
55 | return _dest_path("dict", lang, destdir) + ".txt"
56 |
57 |
58 | def dataset_dest_prefix(args, output_prefix, lang):
59 | base = os.path.join(args.destdir, output_prefix)
60 | if lang is not None:
61 | lang_part = f".{args.source_lang}-{args.target_lang}.{lang}"
62 | elif args.only_source:
63 | lang_part = ""
64 | else:
65 | lang_part = f".{args.source_lang}-{args.target_lang}"
66 |
67 | return "{}{}".format(base, lang_part)
68 |
69 |
70 | def dataset_dest_file(args, output_prefix, lang, extension):
71 | return "{}.{}".format(dataset_dest_prefix(args, output_prefix, lang), extension)
72 |
73 |
74 | #####################################################################
75 | # dictionary tools
76 | #####################################################################
77 |
78 |
79 | def _build_dictionary(
80 | filenames,
81 | task,
82 | args,
83 | src=False,
84 | tgt=False,
85 | ):
86 | assert src ^ tgt
87 | return task.build_dictionary(
88 | filenames,
89 | workers=args.workers,
90 | threshold=args.thresholdsrc if src else args.thresholdtgt,
91 | nwords=args.nwordssrc if src else args.nwordstgt,
92 | padding_factor=args.padding_factor,
93 | )
94 |
95 |
96 | #####################################################################
97 | # bin file creation logic
98 | #####################################################################
99 |
100 |
101 | def _make_binary_dataset(
102 | vocab: Dictionary,
103 | input_prefix: str,
104 | output_prefix: str,
105 | lang: tp.Optional[str],
106 | num_workers: int,
107 | args: Namespace,
108 | ):
109 | logger.info("[{}] Dictionary: {} types".format(lang, len(vocab)))
110 |
111 | binarizer = VocabularyDatasetBinarizer(
112 | vocab,
113 | append_eos=True,
114 | )
115 |
116 | input_file = "{}{}".format(input_prefix, ("." + lang) if lang is not None else "")
117 | full_output_prefix = dataset_dest_prefix(args, output_prefix, lang)
118 |
119 | final_summary = FileBinarizer.multiprocess_dataset(
120 | input_file,
121 | args.dataset_impl,
122 | binarizer,
123 | full_output_prefix,
124 | vocab_size=len(vocab),
125 | num_workers=num_workers,
126 | )
127 |
128 | logger.info(f"[{lang}] {input_file}: {final_summary} (by {vocab.unk_word})")
129 |
130 |
131 | def _make_binary_alignment_dataset(
132 | input_prefix: str, output_prefix: str, num_workers: int, args: Namespace
133 | ):
134 |
135 | binarizer = AlignmentDatasetBinarizer(utils.parse_alignment)
136 |
137 | input_file = input_prefix
138 | full_output_prefix = dataset_dest_prefix(args, output_prefix, lang=None)
139 |
140 | final_summary = FileBinarizer.multiprocess_dataset(
141 | input_file,
142 | args.dataset_impl,
143 | binarizer,
144 | full_output_prefix,
145 | vocab_size=None,
146 | num_workers=num_workers,
147 | )
148 |
149 | logger.info(
150 | "[alignments] {}: parsed {} alignments".format(
151 | input_file, final_summary.num_seq
152 | )
153 | )
154 |
155 |
156 | #####################################################################
157 | # routing logic
158 | #####################################################################
159 |
160 |
161 | def _make_dataset(
162 | vocab: Dictionary,
163 | input_prefix: str,
164 | output_prefix: str,
165 | lang: tp.Optional[str],
166 | args: Namespace,
167 | num_workers: int,
168 | ):
169 | if args.dataset_impl == "raw":
170 | # Copy original text file to destination folder
171 | output_text_file = _dest_path(
172 | output_prefix + ".{}-{}".format(args.source_lang, args.target_lang),
173 | lang,
174 | args.destdir,
175 | )
176 | shutil.copyfile(_file_name(input_prefix, lang), output_text_file)
177 | else:
178 | _make_binary_dataset(
179 | vocab, input_prefix, output_prefix, lang, num_workers, args
180 | )
181 |
182 |
183 | def _make_all(lang, vocab, args):
184 | if args.trainpref:
185 | _make_dataset(
186 | vocab, args.trainpref, "train", lang, args=args, num_workers=args.workers
187 | )
188 | if args.validpref:
189 | for k, validpref in enumerate(args.validpref.split(",")):
190 | outprefix = "valid{}".format(k) if k > 0 else "valid"
191 | _make_dataset(
192 | vocab, validpref, outprefix, lang, args=args, num_workers=args.workers
193 | )
194 | if args.testpref:
195 | for k, testpref in enumerate(args.testpref.split(",")):
196 | outprefix = "test{}".format(k) if k > 0 else "test"
197 | _make_dataset(
198 | vocab, testpref, outprefix, lang, args=args, num_workers=args.workers
199 | )
200 |
201 |
202 | def _make_all_alignments(args):
203 | if args.trainpref and os.path.exists(args.trainpref + "." + args.align_suffix):
204 | _make_binary_alignment_dataset(
205 | args.trainpref + "." + args.align_suffix,
206 | "train.align",
207 | num_workers=args.workers,
208 | args=args,
209 | )
210 | if args.validpref and os.path.exists(args.validpref + "." + args.align_suffix):
211 | _make_binary_alignment_dataset(
212 | args.validpref + "." + args.align_suffix,
213 | "valid.align",
214 | num_workers=args.workers,
215 | args=args,
216 | )
217 | if args.testpref and os.path.exists(args.testpref + "." + args.align_suffix):
218 | _make_binary_alignment_dataset(
219 | args.testpref + "." + args.align_suffix,
220 | "test.align",
221 | num_workers=args.workers,
222 | args=args,
223 | )
224 |
225 |
226 | #####################################################################
227 | # align
228 | #####################################################################
229 |
230 |
231 | def _align_files(args, src_dict, tgt_dict):
232 | assert args.trainpref, "--trainpref must be set if --alignfile is specified"
233 | src_file_name = _train_path(args.source_lang, args.trainpref)
234 | tgt_file_name = _train_path(args.target_lang, args.trainpref)
235 | freq_map = {}
236 | with open(args.alignfile, "r", encoding="utf-8") as align_file:
237 | with open(src_file_name, "r", encoding="utf-8") as src_file:
238 | with open(tgt_file_name, "r", encoding="utf-8") as tgt_file:
239 | for a, s, t in zip_longest(align_file, src_file, tgt_file):
240 | si = src_dict.encode_line(s, add_if_not_exist=False)
241 | ti = tgt_dict.encode_line(t, add_if_not_exist=False)
242 | ai = list(map(lambda x: tuple(x.split("-")), a.split()))
243 | for sai, tai in ai:
244 | srcidx = si[int(sai)]
245 | tgtidx = ti[int(tai)]
246 | if srcidx != src_dict.unk() and tgtidx != tgt_dict.unk():
247 | assert srcidx != src_dict.pad()
248 | assert srcidx != src_dict.eos()
249 | assert tgtidx != tgt_dict.pad()
250 | assert tgtidx != tgt_dict.eos()
251 | if srcidx not in freq_map:
252 | freq_map[srcidx] = {}
253 | if tgtidx not in freq_map[srcidx]:
254 | freq_map[srcidx][tgtidx] = 1
255 | else:
256 | freq_map[srcidx][tgtidx] += 1
257 | align_dict = {}
258 | for srcidx in freq_map.keys():
259 | align_dict[srcidx] = max(freq_map[srcidx], key=freq_map[srcidx].get)
260 | with open(
261 | os.path.join(
262 | args.destdir,
263 | "alignment.{}-{}.txt".format(args.source_lang, args.target_lang),
264 | ),
265 | "w",
266 | encoding="utf-8",
267 | ) as f:
268 | for k, v in align_dict.items():
269 | print("{} {}".format(src_dict[k], tgt_dict[v]), file=f)
270 |
271 |
272 | #####################################################################
273 | # MAIN
274 | #####################################################################
275 |
276 |
277 | def main(args):
278 | # setup some basic things
279 | utils.import_user_module(args)
280 |
281 | os.makedirs(args.destdir, exist_ok=True)
282 |
283 | logger.addHandler(
284 | logging.FileHandler(
285 | filename=os.path.join(args.destdir, "preprocess.log"),
286 | )
287 | )
288 | logger.info(args)
289 |
290 | assert (
291 | args.dataset_impl != "huffman"
292 | ), "preprocessing.py doesn't support Huffman yet, use HuffmanCodeBuilder directly."
293 |
294 | # build dictionaries
295 |
296 | target = not args.only_source
297 |
298 | if not args.srcdict and os.path.exists(_dict_path(args.source_lang, args.destdir)):
299 | raise FileExistsError(_dict_path(args.source_lang, args.destdir))
300 |
301 | if (
302 | target
303 | and not args.tgtdict
304 | and os.path.exists(_dict_path(args.target_lang, args.destdir))
305 | ):
306 | raise FileExistsError(_dict_path(args.target_lang, args.destdir))
307 |
308 | task = tasks.get_task(args.task)
309 |
310 | if args.joined_dictionary:
311 | assert (
312 | not args.srcdict or not args.tgtdict
313 | ), "cannot use both --srcdict and --tgtdict with --joined-dictionary"
314 |
315 | if args.srcdict:
316 | src_dict = task.load_dictionary(args.srcdict)
317 | elif args.tgtdict:
318 | src_dict = task.load_dictionary(args.tgtdict)
319 | else:
320 | assert (
321 | args.trainpref
322 | ), "--trainpref must be set if --srcdict is not specified"
323 | src_dict = _build_dictionary(
324 | {
325 | _train_path(lang, args.trainpref)
326 | for lang in [args.source_lang, args.target_lang]
327 | },
328 | task=task,
329 | args=args,
330 | src=True,
331 | )
332 | tgt_dict = src_dict
333 | else:
334 | if args.srcdict:
335 | src_dict = task.load_dictionary(args.srcdict)
336 | else:
337 | assert (
338 | args.trainpref
339 | ), "--trainpref must be set if --srcdict is not specified"
340 | src_dict = _build_dictionary(
341 | [_train_path(args.source_lang, args.trainpref)],
342 | task=task,
343 | args=args,
344 | src=True,
345 | )
346 |
347 | if target:
348 | if args.tgtdict:
349 | tgt_dict = task.load_dictionary(args.tgtdict)
350 | else:
351 | assert (
352 | args.trainpref
353 | ), "--trainpref must be set if --tgtdict is not specified"
354 | tgt_dict = _build_dictionary(
355 | [_train_path(args.target_lang, args.trainpref)],
356 | task=task,
357 | args=args,
358 | tgt=True,
359 | )
360 | else:
361 | tgt_dict = None
362 |
363 | # save dictionaries
364 |
365 | src_dict.save(_dict_path(args.source_lang, args.destdir))
366 | if target and tgt_dict is not None:
367 | tgt_dict.save(_dict_path(args.target_lang, args.destdir))
368 |
369 | if args.dict_only:
370 | return
371 |
372 | _make_all(args.source_lang, src_dict, args)
373 | if target:
374 | _make_all(args.target_lang, tgt_dict, args)
375 |
376 | # align the datasets if needed
377 | if args.align_suffix:
378 | _make_all_alignments(args)
379 |
380 | logger.info("Wrote preprocessed data to {}".format(args.destdir))
381 |
382 | if args.alignfile:
383 | _align_files(args, src_dict=src_dict, tgt_dict=tgt_dict)
384 |
385 |
386 | def cli_main():
387 | parser = options.get_preprocessing_parser()
388 | args = parser.parse_args()
389 | main(args)
390 |
391 |
392 | if __name__ == "__main__":
393 | cli_main()
394 |
--------------------------------------------------------------------------------
/alpaca/src/task/__pycache__/dictionary.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/task/__pycache__/dictionary.cpython-37.pyc
--------------------------------------------------------------------------------
/alpaca/src/task/__pycache__/seq2seq_dataset.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/task/__pycache__/seq2seq_dataset.cpython-37.pyc
--------------------------------------------------------------------------------
/alpaca/src/task/__pycache__/seq2seq_ft_task.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/task/__pycache__/seq2seq_ft_task.cpython-37.pyc
--------------------------------------------------------------------------------
/alpaca/src/task/__pycache__/seq2seq_lora_task.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/task/__pycache__/seq2seq_lora_task.cpython-37.pyc
--------------------------------------------------------------------------------
/alpaca/src/task/dictionary.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Facebook, Inc. and its affiliates.
2 | #
3 | # This source code is licensed under the MIT license found in the
4 | # LICENSE file in the root directory of this source tree.
5 |
6 | import os
7 | from collections import Counter
8 | from multiprocessing import Pool
9 |
10 | import torch
11 | from fairseq import utils
12 | from fairseq.data import data_utils
13 | from fairseq.file_chunker_utils import Chunker, find_offsets
14 | from fairseq.file_io import PathManager
15 | from fairseq.tokenizer import tokenize_line
16 |
17 |
18 | class Dictionary:
19 | """A mapping from symbols to consecutive integers"""
20 |
21 | def __init__(
22 | self,
23 | *, # begin keyword-only arguments
24 | bos="",
25 | pad="",
26 | eos="",
27 | unk="",
28 | extra_special_symbols=None,
29 | ):
30 | self.bos_word, self.unk_word, self.pad_word, self.eos_word = bos, unk, pad, eos
31 | self.symbols = []
32 | self.count = []
33 | self.indices = {}
34 | self.unk_index = self.add_symbol(unk)
35 | self.bos_index = self.add_symbol(bos)
36 | self.eos_index = self.add_symbol(eos)
37 |
38 | if extra_special_symbols:
39 | for s in extra_special_symbols:
40 | self.add_symbol(s)
41 | self.nspecial = len(self.symbols)
42 |
43 | def __eq__(self, other):
44 | return self.indices == other.indices
45 |
46 | def __getitem__(self, idx):
47 | if idx < len(self.symbols):
48 | return self.symbols[idx]
49 | return self.unk_word
50 |
51 | def get_count(self, idx):
52 | return self.count[idx]
53 |
54 | def __len__(self):
55 | """Returns the number of symbols in the dictionary"""
56 | return len(self.symbols)
57 |
58 | def __contains__(self, sym):
59 | return sym in self.indices
60 |
61 | def index(self, sym):
62 | """Returns the index of the specified symbol"""
63 | assert isinstance(sym, str)
64 | if sym in self.indices:
65 | return self.indices[sym]
66 | return self.unk_index
67 |
68 | def string(
69 | self,
70 | tensor,
71 | bpe_symbol=None,
72 | escape_unk=False,
73 | extra_symbols_to_ignore=None,
74 | unk_string=None,
75 | include_eos=False,
76 | separator=" ",
77 | ):
78 | """Helper for converting a tensor of token indices to a string.
79 |
80 | Can optionally remove BPE symbols or escape words.
81 | """
82 | if torch.is_tensor(tensor) and tensor.dim() == 2:
83 | return "\n".join(
84 | self.string(
85 | t,
86 | bpe_symbol,
87 | escape_unk,
88 | extra_symbols_to_ignore,
89 | include_eos=include_eos,
90 | )
91 | for t in tensor
92 | )
93 |
94 | extra_symbols_to_ignore = set(extra_symbols_to_ignore or [])
95 | if not include_eos:
96 | extra_symbols_to_ignore.add(self.eos())
97 |
98 | def token_string(i):
99 | if i == self.unk():
100 | if unk_string is not None:
101 | return unk_string
102 | else:
103 | return self.unk_string(escape_unk)
104 | else:
105 | return self[i]
106 |
107 | if hasattr(self, "bos_index"):
108 | extra_symbols_to_ignore.add(self.bos())
109 |
110 | sent = separator.join(
111 | token_string(i)
112 | for i in tensor
113 | if utils.item(i) not in extra_symbols_to_ignore
114 | )
115 |
116 | return data_utils.post_process(sent, bpe_symbol)
117 |
118 | def unk_string(self, escape=False):
119 | """Return unknown string, optionally escaped as: <>"""
120 | if escape:
121 | return "<{}>".format(self.unk_word)
122 | else:
123 | return self.unk_word
124 |
125 | def add_symbol(self, word, n=1, overwrite=False):
126 | """Adds a word to the dictionary"""
127 | if word in self.indices and not overwrite:
128 | idx = self.indices[word]
129 | self.count[idx] = self.count[idx] + n
130 | return idx
131 | else:
132 | idx = len(self.symbols)
133 | self.indices[word] = idx
134 | self.symbols.append(word)
135 | self.count.append(n)
136 | return idx
137 |
138 | def update(self, new_dict):
139 | """Updates counts from new dictionary."""
140 | for word in new_dict.symbols:
141 | idx2 = new_dict.indices[word]
142 | if word in self.indices:
143 | idx = self.indices[word]
144 | self.count[idx] = self.count[idx] + new_dict.count[idx2]
145 | else:
146 | idx = len(self.symbols)
147 | self.indices[word] = idx
148 | self.symbols.append(word)
149 | self.count.append(new_dict.count[idx2])
150 |
151 | def finalize(self, threshold=-1, nwords=-1, padding_factor=8):
152 | """Sort symbols by frequency in descending order, ignoring special ones.
153 |
154 | Args:
155 | - threshold defines the minimum word count
156 | - nwords defines the total number of words in the final dictionary,
157 | including special symbols
158 | - padding_factor can be used to pad the dictionary size to be a
159 | multiple of 8, which is important on some hardware (e.g., Nvidia
160 | Tensor Cores).
161 | """
162 | if nwords <= 0:
163 | nwords = len(self)
164 |
165 | new_indices = dict(zip(self.symbols[: self.nspecial], range(self.nspecial)))
166 | new_symbols = self.symbols[: self.nspecial]
167 | new_count = self.count[: self.nspecial]
168 |
169 | c = Counter(
170 | dict(
171 | sorted(zip(self.symbols[self.nspecial :], self.count[self.nspecial :]))
172 | )
173 | )
174 | for symbol, count in c.most_common(nwords - self.nspecial):
175 | if count >= threshold:
176 | new_indices[symbol] = len(new_symbols)
177 | new_symbols.append(symbol)
178 | new_count.append(count)
179 | else:
180 | break
181 |
182 | assert len(new_symbols) == len(new_indices)
183 |
184 | self.count = list(new_count)
185 | self.symbols = list(new_symbols)
186 | self.indices = new_indices
187 |
188 | self.pad_to_multiple_(padding_factor)
189 |
190 | def pad_to_multiple_(self, padding_factor):
191 | """Pad Dictionary size to be a multiple of *padding_factor*."""
192 | if padding_factor > 1:
193 | i = 0
194 | while len(self) % padding_factor != 0:
195 | symbol = "madeupword{:04d}".format(i)
196 | self.add_symbol(symbol, n=0)
197 | i += 1
198 |
199 | def bos(self):
200 | """Helper to get index of beginning-of-sentence symbol"""
201 | return self.bos_index
202 |
203 | def pad(self):
204 | """Helper to get index of pad symbol"""
205 | return self.pad_index
206 |
207 | def eos(self):
208 | """Helper to get index of end-of-sentence symbol"""
209 | return self.eos_index
210 |
211 | def unk(self):
212 | """Helper to get index of unk symbol"""
213 | return self.unk_index
214 |
215 | @classmethod
216 | def load(cls, f):
217 | """Loads the dictionary from a text file with the format:
218 |
219 | ```
220 |
221 |
222 | ...
223 | ```
224 | """
225 | d = cls()
226 | d.add_from_file(f)
227 | return d
228 |
229 | def add_from_file(self, f):
230 | """
231 | Loads a pre-existing dictionary from a text file and adds its symbols
232 | to this instance.
233 | """
234 | if isinstance(f, str):
235 | try:
236 | with open(PathManager.get_local_path(f), "r", encoding="utf-8") as fd:
237 | self.add_from_file(fd)
238 | except FileNotFoundError as fnfe:
239 | raise fnfe
240 | except UnicodeError:
241 | raise Exception(
242 | "Incorrect encoding detected in {}, please "
243 | "rebuild the dataset".format(f)
244 | )
245 | return
246 |
247 | lines = f.readlines()
248 | indices_start_line = self._load_meta(lines)
249 |
250 | for line in lines[indices_start_line:]:
251 | try:
252 | line, field = line.rstrip().rsplit(" ", 1)
253 | if field == "#fairseq:overwrite":
254 | overwrite = True
255 | line, field = line.rsplit(" ", 1)
256 | else:
257 | overwrite = False
258 | count = int(field)
259 | word = line
260 | if word in self and not overwrite:
261 | raise RuntimeError(
262 | "Duplicate word found when loading Dictionary: '{}'. "
263 | "Duplicate words can overwrite earlier ones by adding the "
264 | "#fairseq:overwrite flag at the end of the corresponding row "
265 | "in the dictionary file. If using the Camembert model, please "
266 | "download an updated copy of the model file.".format(word)
267 | )
268 | self.add_symbol(word, n=count, overwrite=overwrite)
269 | except ValueError:
270 | raise ValueError(
271 | f"Incorrect dictionary format, expected ' [flags]': \"{line}\""
272 | )
273 |
274 | def _save(self, f, kv_iterator):
275 | if isinstance(f, str):
276 | PathManager.mkdirs(os.path.dirname(f))
277 | with PathManager.open(f, "w", encoding="utf-8") as fd:
278 | return self.save(fd)
279 | for k, v in kv_iterator:
280 | print("{} {}".format(k, v), file=f)
281 |
282 | def _get_meta(self):
283 | return [], []
284 |
285 | def _load_meta(self, lines):
286 | return 0
287 |
288 | def save(self, f):
289 | """Stores dictionary into a text file"""
290 | ex_keys, ex_vals = self._get_meta()
291 | self._save(
292 | f,
293 | zip(
294 | ex_keys + self.symbols[self.nspecial :],
295 | ex_vals + self.count[self.nspecial :],
296 | ),
297 | )
298 |
299 | def dummy_sentence(self, length):
300 | t = torch.Tensor(length).uniform_(self.nspecial + 1, len(self)).long()
301 | t[-1] = self.eos()
302 | return t
303 |
304 | def encode_line(
305 | self,
306 | line,
307 | line_tokenizer=tokenize_line,
308 | add_if_not_exist=True,
309 | consumer=None,
310 | append_eos=True,
311 | reverse_order=False,
312 | ) -> torch.IntTensor:
313 | words = line_tokenizer(line)
314 | if reverse_order:
315 | words = list(reversed(words))
316 | nwords = len(words)
317 | ids = torch.IntTensor(nwords + 1 if append_eos else nwords)
318 |
319 | for i, word in enumerate(words):
320 | if add_if_not_exist:
321 | idx = self.add_symbol(word)
322 | else:
323 | idx = self.index(word)
324 | if consumer is not None:
325 | consumer(word, idx)
326 | ids[i] = idx
327 | if append_eos:
328 | ids[nwords] = self.eos_index
329 | return ids
330 |
331 | @staticmethod
332 | def _add_file_to_dictionary_single_worker(
333 | filename,
334 | tokenize,
335 | eos_word,
336 | start_offset,
337 | end_offset,
338 | ):
339 | counter = Counter()
340 | with Chunker(filename, start_offset, end_offset) as line_iterator:
341 | for line in line_iterator:
342 | for word in tokenize(line):
343 | counter.update([word])
344 | counter.update([eos_word])
345 | return counter
346 |
347 | @staticmethod
348 | def add_file_to_dictionary(filename, dict, tokenize, num_workers):
349 | def merge_result(counter):
350 | for w, c in sorted(counter.items()):
351 | dict.add_symbol(w, c)
352 |
353 | local_file = PathManager.get_local_path(filename)
354 | offsets = find_offsets(local_file, num_workers)
355 | if num_workers > 1:
356 | chunks = zip(offsets, offsets[1:])
357 | pool = Pool(processes=num_workers)
358 | results = []
359 | for (start_offset, end_offset) in chunks:
360 | results.append(
361 | pool.apply_async(
362 | Dictionary._add_file_to_dictionary_single_worker,
363 | (
364 | local_file,
365 | tokenize,
366 | dict.eos_word,
367 | start_offset,
368 | end_offset,
369 | ),
370 | )
371 | )
372 | pool.close()
373 | pool.join()
374 | for r in results:
375 | merge_result(r.get())
376 | else:
377 | merge_result(
378 | Dictionary._add_file_to_dictionary_single_worker(
379 | local_file, tokenize, dict.eos_word, offsets[0], offsets[1]
380 | )
381 | )
382 |
383 |
--------------------------------------------------------------------------------
/alpaca/src/task/seq2seq_dataset.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Facebook, Inc. and its affiliates.
2 | #
3 | # This source code is licensed under the MIT license found in the
4 | # LICENSE file in the root directory of this source tree.
5 |
6 | import logging
7 |
8 | import numpy as np
9 | import torch
10 | from fairseq.data import FairseqDataset, data_utils
11 | from fairseq.utils import new_arange
12 | import math
13 | from fairseq.utils import new_arange
14 |
15 | logger = logging.getLogger(__name__)
16 |
17 | def collate(
18 | samples,
19 | pad_idx,
20 | eos_idx,
21 | left_pad_source=True,
22 | left_pad_target=False,
23 | input_feeding=True,
24 | pad_to_length=None,
25 | pad_to_multiple=1,
26 | ):
27 | if len(samples) == 0:
28 | return {}
29 |
30 | def merge(key, left_pad, move_eos_to_beginning=False, pad_to_length=None):
31 | return data_utils.collate_tokens(
32 | [s[key] for s in samples],
33 | pad_idx,
34 | eos_idx,
35 | left_pad,
36 | move_eos_to_beginning,
37 | pad_to_length=pad_to_length,
38 | pad_to_multiple=pad_to_multiple,
39 | )
40 |
41 | def check_alignment(alignment, src_len, tgt_len):
42 | if alignment is None or len(alignment) == 0:
43 | return False
44 | if (
45 | alignment[:, 0].max().item() >= src_len - 1
46 | or alignment[:, 1].max().item() >= tgt_len - 1
47 | ):
48 | logger.warning("alignment size mismatch found, skipping alignment!")
49 | return False
50 | return True
51 |
52 | def compute_alignment_weights(alignments):
53 | """
54 | Given a tensor of shape [:, 2] containing the source-target indices
55 | corresponding to the alignments, a weight vector containing the
56 | inverse frequency of each target index is computed.
57 | For e.g. if alignments = [[5, 7], [2, 3], [1, 3], [4, 2]], then
58 | a tensor containing [1., 0.5, 0.5, 1] should be returned (since target
59 | index 3 is repeated twice)
60 | """
61 | align_tgt = alignments[:, 1]
62 | _, align_tgt_i, align_tgt_c = torch.unique(
63 | align_tgt, return_inverse=True, return_counts=True
64 | )
65 | align_weights = align_tgt_c[align_tgt_i[np.arange(len(align_tgt))]]
66 | return 1.0 / align_weights.float()
67 |
68 | id = torch.LongTensor([s["id"] for s in samples])
69 | src_tokens = merge(
70 | "source",
71 | left_pad=left_pad_source,
72 | pad_to_length=pad_to_length["source"] if pad_to_length is not None else None,
73 | )
74 | # sort by descending source length
75 | src_lengths = torch.LongTensor(
76 | [s["source"].ne(pad_idx).long().sum() for s in samples]
77 | )
78 | src_lengths, sort_order = src_lengths.sort(descending=True)
79 | id = id.index_select(0, sort_order)
80 | src_tokens = src_tokens.index_select(0, sort_order)
81 |
82 | def merge_data(data_name, sort_order):
83 | prepared_data = merge(
84 | data_name,
85 | left_pad=left_pad_target,
86 | pad_to_length=pad_to_length[data_name]
87 | if pad_to_length is not None
88 | else None,
89 | )
90 | return prepared_data.index_select(0, sort_order)
91 |
92 | src_input = merge(
93 | "src_input",
94 | left_pad=left_pad_source,
95 | pad_to_length=pad_to_length["src_input"] if pad_to_length is not None else None,
96 | )
97 | src_input = src_input.index_select(0, sort_order)
98 |
99 | bos_token = merge(
100 | "bos_token",
101 | left_pad=left_pad_source,
102 | pad_to_length=pad_to_length["bos_token"] if pad_to_length is not None else None,
103 | )
104 | bos_token = bos_token.index_select(0, sort_order)
105 |
106 | src_pos = merge(
107 | "src_pos",
108 | left_pad=left_pad_source,
109 | pad_to_length=pad_to_length["src_pos"] if pad_to_length is not None else None,
110 | )
111 | src_pos = src_pos.index_select(0, sort_order)
112 |
113 | bos_token_pos = merge(
114 | "bos_token_pos",
115 | left_pad=left_pad_source,
116 | pad_to_length=pad_to_length["bos_token_pos"] if pad_to_length is not None else None,
117 | )
118 | bos_token_pos = bos_token_pos.index_select(0, sort_order)
119 |
120 | target = None
121 | tgt_input = None
122 | tgt_pos = None
123 |
124 | seq_input = None
125 | seq_mask = None
126 |
127 | if samples[0].get("target", None) is not None:
128 | target = merge(
129 | "target",
130 | left_pad=left_pad_target,
131 | pad_to_length=pad_to_length["target"]
132 | if pad_to_length is not None
133 | else None,
134 | )
135 | target = target.index_select(0, sort_order)
136 | tgt_lengths = torch.LongTensor(
137 | [s["target"].ne(pad_idx).long().sum() for s in samples]
138 | ).index_select(0, sort_order)
139 | ntokens = tgt_lengths.sum().item()
140 |
141 | prev_output_tokens = merge(
142 | "target",
143 | left_pad=left_pad_target,
144 | move_eos_to_beginning=True,
145 | pad_to_length=pad_to_length["target"] if pad_to_length is not None else None,
146 | )
147 | prev_output_tokens = prev_output_tokens.index_select(0, sort_order)
148 | prev_output_tokens[:,0:1] = bos_token
149 |
150 | tgt_pos = merge(
151 | "tgt_pos",
152 | left_pad=left_pad_source,
153 | move_eos_to_beginning=True,
154 | pad_to_length=pad_to_length["tgt_pos"] if pad_to_length is not None else None,
155 | )
156 | tgt_pos = tgt_pos.index_select(0, sort_order)
157 | tgt_pos[:,0:1] = bos_token_pos
158 | tgt_pos_mask = (tgt_pos == pad_idx)
159 | tgt_pos = tgt_pos.masked_fill(tgt_pos_mask, 0)
160 |
161 | seq_mask = merge(
162 | "seq_mask",
163 | left_pad=left_pad_source,
164 | pad_to_length=pad_to_length["seq_mask"] if pad_to_length is not None else None,
165 | )
166 | seq_mask = seq_mask.index_select(0, sort_order)
167 | seq_input = merge(
168 | "seq_input",
169 | left_pad=left_pad_source,
170 | pad_to_length=pad_to_length["seq_input"] if pad_to_length is not None else None,
171 | )
172 | seq_input = seq_input.index_select(0, sort_order)
173 | else:
174 | ntokens = src_lengths.sum().item()
175 | prev_output_tokens = None
176 |
177 | src_pos_mask = (src_pos == pad_idx)
178 | src_pos = src_pos.masked_fill(src_pos_mask, 0)
179 |
180 | batch = {
181 | "id": id,
182 | "nsentences": len(samples),
183 | "ntokens": ntokens,
184 | "net_input": {
185 | "seq_mask": seq_mask,
186 | "seq_input": seq_input,
187 | "soruce": src_tokens,
188 | "src_tokens": src_input,
189 | "src_lengths": src_lengths,
190 | "bos_token": bos_token,
191 | "src_pos": src_pos,
192 | "tgt_pos": tgt_pos,
193 | "bos_token_pos": bos_token_pos,
194 | },
195 | "target": target,
196 | }
197 | if prev_output_tokens is not None:
198 | batch["net_input"]["prev_output_tokens"] = prev_output_tokens
199 |
200 | if samples[0].get("alignment", None) is not None:
201 | bsz, tgt_sz = batch["target"].shape
202 | src_sz = batch["net_input"]["src_tokens"].shape[1]
203 |
204 | offsets = torch.zeros((len(sort_order), 2), dtype=torch.long)
205 | offsets[:, 1] += torch.arange(len(sort_order), dtype=torch.long) * tgt_sz
206 | if left_pad_source:
207 | offsets[:, 0] += src_sz - src_lengths
208 | if left_pad_target:
209 | offsets[:, 1] += tgt_sz - tgt_lengths
210 |
211 | alignments = [
212 | alignment + offset
213 | for align_idx, offset, src_len, tgt_len in zip(
214 | sort_order, offsets, src_lengths, tgt_lengths
215 | )
216 | for alignment in [samples[align_idx]["alignment"].view(-1, 2)]
217 | if check_alignment(alignment, src_len, tgt_len)
218 | ]
219 |
220 | if len(alignments) > 0:
221 | alignments = torch.cat(alignments, dim=0)
222 | align_weights = compute_alignment_weights(alignments)
223 |
224 | batch["alignments"] = alignments
225 | batch["align_weights"] = align_weights
226 |
227 | if samples[0].get("constraints", None) is not None:
228 | # Collate the packed constraints across the samples, padding to
229 | # the length of the longest sample.
230 | lens = [sample.get("constraints").size(0) for sample in samples]
231 | max_len = max(lens)
232 | constraints = torch.zeros((len(samples), max(lens))).long()
233 | for i, sample in enumerate(samples):
234 | constraints[i, 0 : lens[i]] = samples[i].get("constraints")
235 | batch["constraints"] = constraints
236 |
237 | return batch
238 |
239 |
240 | class LanguagePairDataset(FairseqDataset):
241 |
242 | def __init__(
243 | self,
244 | src,
245 | src_sizes,
246 | src_dict,
247 | tgt=None,
248 | tgt_sizes=None,
249 | tgt_dict=None,
250 | left_pad_source=True,
251 | left_pad_target=False,
252 | shuffle=True,
253 | input_feeding=True,
254 | remove_eos_from_source=False,
255 | append_eos_to_target=False,
256 | align_dataset=None,
257 | constraints=None,
258 | append_bos=False,
259 | eos=None,
260 | num_buckets=0,
261 | src_lang_id=None,
262 | tgt_lang_id=None,
263 | pad_to_multiple=1,
264 | ):
265 | if tgt_dict is not None:
266 | assert src_dict.pad() == tgt_dict.pad()
267 | assert src_dict.eos() == tgt_dict.eos()
268 | assert src_dict.unk() == tgt_dict.unk()
269 | if tgt is not None:
270 | assert len(src) == len(
271 | tgt
272 | ), "Source and target must contain the same number of examples"
273 | self.src = src
274 | self.tgt = tgt
275 | self.src_sizes = np.array(src_sizes)
276 | self.tgt_sizes = np.array(tgt_sizes) if tgt_sizes is not None else None
277 | self.sizes = (
278 | np.vstack((self.src_sizes, self.tgt_sizes)).T
279 | if self.tgt_sizes is not None
280 | else self.src_sizes
281 | )
282 | self.src_dict = src_dict
283 | self.tgt_dict = tgt_dict
284 | self.left_pad_source = left_pad_source
285 | self.left_pad_target = left_pad_target
286 | self.shuffle = shuffle
287 | self.input_feeding = input_feeding
288 | self.remove_eos_from_source = remove_eos_from_source
289 | self.append_eos_to_target = append_eos_to_target
290 | self.align_dataset = align_dataset
291 | if self.align_dataset is not None:
292 | assert (
293 | self.tgt_sizes is not None
294 | ), "Both source and target needed when alignments are provided"
295 | self.constraints = constraints
296 | self.append_bos = append_bos
297 | self.eos = eos if eos is not None else src_dict.eos()
298 | self.src_lang_id = src_lang_id
299 | self.tgt_lang_id = tgt_lang_id
300 | if num_buckets > 0:
301 | from fairseq.data import BucketPadLengthDataset
302 |
303 | self.src = BucketPadLengthDataset(
304 | self.src,
305 | sizes=self.src_sizes,
306 | num_buckets=num_buckets,
307 | pad_idx=self.src_dict.pad(),
308 | left_pad=self.left_pad_source,
309 | )
310 | self.src_sizes = self.src.sizes
311 | logger.info("bucketing source lengths: {}".format(list(self.src.buckets)))
312 | if self.tgt is not None:
313 | self.tgt = BucketPadLengthDataset(
314 | self.tgt,
315 | sizes=self.tgt_sizes,
316 | num_buckets=num_buckets,
317 | pad_idx=self.tgt_dict.pad(),
318 | left_pad=self.left_pad_target,
319 | )
320 | self.tgt_sizes = self.tgt.sizes
321 | logger.info(
322 | "bucketing target lengths: {}".format(list(self.tgt.buckets))
323 | )
324 |
325 | # determine bucket sizes using self.num_tokens, which will return
326 | # the padded lengths (thanks to BucketPadLengthDataset)
327 | num_tokens = np.vectorize(self.num_tokens, otypes=[np.long])
328 | self.bucketed_num_tokens = num_tokens(np.arange(len(self.src)))
329 | self.buckets = [
330 | (None, num_tokens) for num_tokens in np.unique(self.bucketed_num_tokens)
331 | ]
332 | else:
333 | self.buckets = None
334 | self.pad_to_multiple = pad_to_multiple
335 |
336 | def get_batch_shapes(self):
337 | return self.buckets
338 |
339 | def __getitem__(self, index):
340 | tgt_item = self.tgt[index] if self.tgt is not None else None
341 | src_item = self.src[index]
342 |
343 | # remove bos eos from item
344 | eos = self.src_dict.eos()
345 | if self.src[index][-1] == eos:
346 | src_item = self.src[index][:-1]
347 |
348 | bos_token = src_item[-1:]
349 | bos_token_pos = new_arange(bos_token) + src_item.size(0) - 1
350 |
351 | src_input = src_item[:-1]
352 | src_pos = new_arange(src_input)
353 |
354 | bos = self.tgt_dict.bos()
355 | if tgt_item is not None:
356 | if self.tgt[index][0] == bos:
357 | tgt_item = self.tgt[index][1:]
358 |
359 | tgt_pos = new_arange(tgt_item) + src_item.size(0)
360 | seq_input = torch.cat([src_item, tgt_item[:-1]], 0)
361 | seq_mask = torch.cat([src_input.new_zeros(src_input.size()), tgt_item.new_ones(tgt_item.size())], 0)
362 | else:
363 | tgt_pos = None
364 | seq_input = None
365 | seq_mask = None
366 |
367 | example = {
368 | "id": index,
369 | "source": src_item,
370 | "target": tgt_item,
371 | "src_input": src_input,
372 | "bos_token": bos_token,
373 | "src_pos": src_pos,
374 | "tgt_pos": tgt_pos,
375 | "bos_token_pos": bos_token_pos,
376 | "seq_input": seq_input,
377 | "seq_mask": seq_mask,
378 | }
379 |
380 | if self.align_dataset is not None:
381 | example["alignment"] = self.align_dataset[index]
382 | if self.constraints is not None:
383 | example["constraints"] = self.constraints[index]
384 | return example
385 |
386 | def __len__(self):
387 | return len(self.src)
388 |
389 | def collater(self, samples, pad_to_length=None):
390 | res = collate(
391 | samples,
392 | pad_idx=self.src_dict.pad(),
393 | eos_idx=self.eos,
394 | left_pad_source=self.left_pad_source,
395 | left_pad_target=self.left_pad_target,
396 | input_feeding=self.input_feeding,
397 | pad_to_length=pad_to_length,
398 | pad_to_multiple=self.pad_to_multiple,
399 | )
400 | if self.src_lang_id is not None or self.tgt_lang_id is not None:
401 | src_tokens = res["net_input"]["src_tokens"]
402 | bsz = src_tokens.size(0)
403 | if self.src_lang_id is not None:
404 | res["net_input"]["src_lang_id"] = (
405 | torch.LongTensor([[self.src_lang_id]]).expand(bsz, 1).to(src_tokens)
406 | )
407 | if self.tgt_lang_id is not None:
408 | res["tgt_lang_id"] = (
409 | torch.LongTensor([[self.tgt_lang_id]]).expand(bsz, 1).to(src_tokens)
410 | )
411 | return res
412 |
413 | def num_tokens(self, index):
414 | """Return the number of tokens in a sample. This value is used to
415 | enforce ``--max-tokens`` during batching."""
416 | return max(
417 | self.src_sizes[index],
418 | self.tgt_sizes[index] if self.tgt_sizes is not None else 0,
419 | )
420 |
421 | def num_tokens_vec(self, indices):
422 | """Return the number of tokens for a set of positions defined by indices.
423 | This value is used to enforce ``--max-tokens`` during batching."""
424 | sizes = self.src_sizes[indices]
425 | if self.tgt_sizes is not None:
426 | sizes = np.maximum(sizes, self.tgt_sizes[indices])
427 | return sizes
428 |
429 | def size(self, index):
430 | """Return an example's size as a float or tuple. This value is used when
431 | filtering a dataset with ``--max-positions``."""
432 | return (
433 | self.src_sizes[index],
434 | self.tgt_sizes[index] if self.tgt_sizes is not None else 0,
435 | )
436 |
437 | def ordered_indices(self):
438 | """Return an ordered list of indices. Batches will be constructed based
439 | on this order."""
440 | if self.shuffle:
441 | indices = np.random.permutation(len(self)).astype(np.int64)
442 | else:
443 | indices = np.arange(len(self), dtype=np.int64)
444 | if self.buckets is None:
445 | # sort by target length, then source length
446 | if self.tgt_sizes is not None:
447 | indices = indices[np.argsort(self.tgt_sizes[indices], kind="mergesort")]
448 | return indices[np.argsort(self.src_sizes[indices], kind="mergesort")]
449 | else:
450 | # sort by bucketed_num_tokens, which is:
451 | # max(padded_src_len, padded_tgt_len)
452 | return indices[
453 | np.argsort(self.bucketed_num_tokens[indices], kind="mergesort")
454 | ]
455 |
456 | @property
457 | def supports_prefetch(self):
458 | return getattr(self.src, "supports_prefetch", False) and (
459 | getattr(self.tgt, "supports_prefetch", False) or self.tgt is None
460 | )
461 |
462 | def prefetch(self, indices):
463 | self.src.prefetch(indices)
464 | if self.tgt is not None:
465 | self.tgt.prefetch(indices)
466 | if self.align_dataset is not None:
467 | self.align_dataset.prefetch(indices)
468 |
469 | def filter_indices_by_size(self, indices, max_sizes):
470 | """Filter a list of sample indices. Remove those that are longer
471 | than specified in max_sizes.
472 |
473 | Args:
474 | indices (np.array): original array of sample indices
475 | max_sizes (int or list[int] or tuple[int]): max sample size,
476 | can be defined separately for src and tgt (then list or tuple)
477 |
478 | Returns:
479 | np.array: filtered sample array
480 | list: list of removed indices
481 | """
482 | return data_utils.filter_paired_dataset_indices_by_size(
483 | self.src_sizes,
484 | self.tgt_sizes,
485 | indices,
486 | max_sizes,
487 | )
488 |
--------------------------------------------------------------------------------
/alpaca/src/task/seq2seq_ft_task.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import itertools
3 | import os
4 | import logging
5 | from typing import Dict, Optional
6 |
7 | from dataclasses import dataclass, field
8 | from fairseq import utils
9 | from fairseq.tasks.translation import TranslationTask
10 | from fairseq.utils import new_arange
11 | from fairseq.tasks import FairseqTask, register_task
12 | from fairseq.tasks.translation import TranslationConfig
13 | from fairseq.data import (
14 | AppendTokenDataset,
15 | ConcatDataset,
16 | PrependTokenDataset,
17 | StripTokenDataset,
18 | TruncateDataset,
19 | data_utils,
20 | indexed_dataset,
21 | )
22 | from fairseq.data import iterators
23 | from .dictionary import Dictionary
24 | from .seq2seq_dataset import LanguagePairDataset
25 | from fairseq.utils import safe_getattr, safe_hasattr
26 |
27 | logger = logging.getLogger(__name__)
28 |
29 |
30 | def load_langpair_dataset(
31 | data_path,
32 | split,
33 | src,
34 | src_dict,
35 | tgt,
36 | tgt_dict,
37 | combine,
38 | dataset_impl,
39 | upsample_primary,
40 | left_pad_source,
41 | left_pad_target,
42 | max_source_positions,
43 | max_target_positions,
44 | prepend_bos=False,
45 | load_alignments=False,
46 | truncate_source=False,
47 | append_source_id=False,
48 | num_buckets=0,
49 | shuffle=True,
50 | pad_to_multiple=1,
51 | prepend_bos_src=None,
52 | ):
53 | def split_exists(split, src, tgt, lang, data_path):
54 | filename = os.path.join(data_path, "{}.{}-{}.{}".format(split, src, tgt, lang))
55 | return indexed_dataset.dataset_exists(filename, impl=dataset_impl)
56 |
57 | src_datasets = []
58 | tgt_datasets = []
59 |
60 | for k in itertools.count():
61 | split_k = split + (str(k) if k > 0 else "")
62 |
63 | # infer langcode
64 | if split_exists(split_k, src, tgt, src, data_path):
65 | prefix = os.path.join(data_path, "{}.{}-{}.".format(split_k, src, tgt))
66 | elif split_exists(split_k, tgt, src, src, data_path):
67 | prefix = os.path.join(data_path, "{}.{}-{}.".format(split_k, tgt, src))
68 | else:
69 | if k > 0:
70 | break
71 | else:
72 | raise FileNotFoundError(
73 | "Dataset not found: {} ({})".format(split, data_path)
74 | )
75 |
76 | src_dataset = data_utils.load_indexed_dataset(
77 | prefix + src, src_dict, dataset_impl
78 | )
79 | if truncate_source:
80 | src_dataset = AppendTokenDataset(
81 | TruncateDataset(
82 | StripTokenDataset(src_dataset, src_dict.eos()),
83 | max_source_positions - 1,
84 | ),
85 | src_dict.eos(),
86 | )
87 | src_datasets.append(src_dataset)
88 |
89 | tgt_dataset = data_utils.load_indexed_dataset(
90 | prefix + tgt, tgt_dict, dataset_impl
91 | )
92 | if tgt_dataset is not None:
93 | tgt_datasets.append(tgt_dataset)
94 |
95 | logger.info(
96 | "{} {} {}-{} {} examples".format(
97 | data_path, split_k, src, tgt, len(src_datasets[-1])
98 | )
99 | )
100 |
101 | if not combine:
102 | break
103 |
104 | assert len(src_datasets) == len(tgt_datasets) or len(tgt_datasets) == 0
105 |
106 | if len(src_datasets) == 1:
107 | src_dataset = src_datasets[0]
108 | tgt_dataset = tgt_datasets[0] if len(tgt_datasets) > 0 else None
109 | else:
110 | sample_ratios = [1] * len(src_datasets)
111 | sample_ratios[0] = upsample_primary
112 | src_dataset = ConcatDataset(src_datasets, sample_ratios)
113 | if len(tgt_datasets) > 0:
114 | tgt_dataset = ConcatDataset(tgt_datasets, sample_ratios)
115 | else:
116 | tgt_dataset = None
117 |
118 | if prepend_bos:
119 | assert hasattr(src_dict, "bos_index") and hasattr(tgt_dict, "bos_index")
120 | src_dataset = PrependTokenDataset(src_dataset, src_dict.bos())
121 | if tgt_dataset is not None:
122 | tgt_dataset = PrependTokenDataset(tgt_dataset, tgt_dict.bos())
123 | elif prepend_bos_src is not None:
124 | logger.info(f"prepending src bos: {prepend_bos_src}")
125 | src_dataset = PrependTokenDataset(src_dataset, prepend_bos_src)
126 |
127 | eos = None
128 | if append_source_id:
129 | src_dataset = AppendTokenDataset(
130 | src_dataset, src_dict.index("[{}]".format(src))
131 | )
132 | if tgt_dataset is not None:
133 | tgt_dataset = AppendTokenDataset(
134 | tgt_dataset, tgt_dict.index("[{}]".format(tgt))
135 | )
136 | eos = tgt_dict.index("[{}]".format(tgt))
137 |
138 | align_dataset = None
139 | if load_alignments:
140 | align_path = os.path.join(data_path, "{}.align.{}-{}".format(split, src, tgt))
141 | if indexed_dataset.dataset_exists(align_path, impl=dataset_impl):
142 | align_dataset = data_utils.load_indexed_dataset(
143 | align_path, None, dataset_impl
144 | )
145 |
146 | tgt_dataset_sizes = tgt_dataset.sizes if tgt_dataset is not None else None
147 | return LanguagePairDataset(
148 | src_dataset,
149 | src_dataset.sizes,
150 | src_dict,
151 | tgt_dataset,
152 | tgt_dataset_sizes,
153 | tgt_dict,
154 | left_pad_source=left_pad_source,
155 | left_pad_target=left_pad_target,
156 | align_dataset=align_dataset,
157 | eos=eos,
158 | num_buckets=num_buckets,
159 | shuffle=shuffle,
160 | pad_to_multiple=pad_to_multiple,
161 | )
162 |
163 | @dataclass
164 | class FTTaskConfig(TranslationConfig):
165 |
166 | megatron_model: bool = field(
167 | default=False,
168 | metadata={"help": "using megatron-lm to split model"},
169 | )
170 |
171 | data_para: bool = field(
172 | default=False, metadata={"help": "data parallel"},
173 | )
174 |
175 | @register_task("seq2seq_ft_task", dataclass=FTTaskConfig)
176 | class Seq2SeqFineTuningTask(TranslationTask):
177 |
178 | def __init__(self, cfg, src_dict, tgt_dict):
179 | super().__init__(cfg, src_dict, tgt_dict)
180 |
181 | self.data_para = safe_getattr(cfg, "data_para", False)
182 | self.megatron_model = safe_getattr(cfg, "megatron_model", False)
183 |
184 | def build_bpe(self, args):
185 | from sentencepiece import SentencePieceProcessor
186 | model_path = args.sentencepiece_model
187 | self.sp_model = SentencePieceProcessor(model_file=model_path)
188 | return self.sp_model
189 |
190 | @classmethod
191 | def load_dictionary(cls, filename):
192 | if "dict.src.txt" not in filename or "dict.tgt.txt" not in filename:
193 | logger.info("{} is not exist!".format(filename))
194 | filename = "alpaca/scripts/assert/dict.txt"
195 | logger.info("load common dict {}!".format(filename))
196 |
197 | dictionary = Dictionary.load(filename)
198 | dictionary.pad_index = dictionary.add_symbol(dictionary.pad_word)
199 | return dictionary
200 |
201 | def load_dataset(self, split, epoch=1, combine=False, **kwargs):
202 | paths = utils.split_paths(self.cfg.data)
203 | data_path = paths[(epoch - 1) % len(paths)]
204 | src, tgt = self.cfg.source_lang, self.cfg.target_lang
205 |
206 | self.cfg.left_pad_source = False
207 | self.cfg.left_pad_target = False
208 | self.datasets[split] = load_langpair_dataset(
209 | data_path,
210 | split,
211 | src,
212 | self.src_dict,
213 | tgt,
214 | self.tgt_dict,
215 | combine=combine,
216 | dataset_impl=self.cfg.dataset_impl,
217 | upsample_primary=self.cfg.upsample_primary,
218 | left_pad_source=self.cfg.left_pad_source,
219 | left_pad_target=self.cfg.left_pad_target,
220 | max_source_positions=self.cfg.max_source_positions,
221 | max_target_positions=self.cfg.max_target_positions,
222 | truncate_source=self.cfg.truncate_source,
223 | shuffle=(split != "test"),
224 | prepend_bos=True,
225 | )
226 |
227 | def build_dataset_for_inference(self, src_tokens, src_lengths, constraints=None):
228 | return LanguagePairDataset(
229 | src_tokens,
230 | src_lengths,
231 | self.source_dictionary,
232 | tgt_dict=self.target_dictionary,
233 | constraints=constraints,
234 | )
235 |
236 | def build_generator(
237 | self,
238 | models,
239 | args=None,
240 | **kwargs,
241 | ):
242 | from generator.sequence_generator import SequenceGenerator
243 | from generator import search
244 |
245 | if isinstance(kwargs, dict):
246 | if "sampling" in kwargs:
247 | sampling = kwargs["sampling"]
248 | else:
249 | sampling = False
250 | if "sampling_topk" in kwargs:
251 | sampling_topk = kwargs["sampling_topk"]
252 | else:
253 | sampling_topk = -1.0
254 | if "sampling_topp" in kwargs:
255 | sampling_topp = kwargs["sampling_topp"]
256 | else:
257 | sampling_topp = -1.0
258 | else:
259 | sampling = getattr(args, "sampling", False)
260 | sampling_topk = getattr(args, "sampling_topk", -1.0)
261 | sampling_topp = getattr(args, "sampling_topp", -1.0)
262 |
263 | if sampling:
264 | search_strategy = search.Sampling(
265 | self.target_dictionary, sampling_topk, sampling_topp
266 | )
267 | else:
268 | search_strategy = search.BeamSearch(self.target_dictionary)
269 |
270 | extra_gen_cls_kwargs = {}
271 | return SequenceGenerator(
272 | models,
273 | self.target_dictionary,
274 | beam_size=getattr(args, "beam", 5),
275 | max_len_a=getattr(args, "max_len_a", 0),
276 | max_len_b=getattr(args, "max_len_b", 512),
277 | min_len=getattr(args, "min_len", 1),
278 | normalize_scores=(not getattr(args, "unnormalized", False)),
279 | len_penalty=getattr(args, "lenpen", 1),
280 | unk_penalty=getattr(args, "unkpen", 0),
281 | temperature=getattr(args, "temperature", 1.0),
282 | match_source_len=getattr(args, "match_source_len", False),
283 | no_repeat_ngram_size=getattr(args, "no_repeat_ngram_size", 0),
284 | search_strategy=search_strategy,
285 | **extra_gen_cls_kwargs,
286 | )
287 |
288 | def inference_step(
289 | self, generator, models, sample, prefix_tokens=None, constraints=None
290 | ):
291 | with torch.no_grad():
292 | bos_token = sample['net_input']['bos_token']
293 | return generator.generate(
294 | models, sample,
295 | prefix_tokens=prefix_tokens, constraints=constraints,
296 | bos_token=bos_token,
297 | )
298 |
299 | def get_batch_iterator(
300 | self,
301 | dataset,
302 | max_tokens=None,
303 | max_sentences=None,
304 | max_positions=None,
305 | ignore_invalid_inputs=False,
306 | required_batch_size_multiple=1,
307 | seed=1,
308 | num_shards=1,
309 | shard_id=0,
310 | num_workers=0,
311 | epoch=1,
312 | data_buffer_size=0,
313 | disable_iterator_cache=False,
314 | skip_remainder_batch=False,
315 | grouped_shuffling=False,
316 | update_epoch_batch_itr=False,
317 | ):
318 | if not self.data_para:
319 | num_shards = 1
320 | shard_id=0
321 | return super().get_batch_iterator(
322 | dataset,
323 | max_tokens,
324 | max_sentences,
325 | max_positions,
326 | ignore_invalid_inputs,
327 | required_batch_size_multiple,
328 | seed,
329 | num_shards,
330 | shard_id,
331 | num_workers,
332 | epoch,
333 | data_buffer_size,
334 | disable_iterator_cache,
335 | skip_remainder_batch,
336 | grouped_shuffling,
337 | update_epoch_batch_itr,
338 | )
--------------------------------------------------------------------------------
/alpaca/src/task/seq2seq_lora_task.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import itertools
3 | import os
4 | import logging
5 | from typing import Dict, Optional
6 |
7 | from dataclasses import dataclass, field
8 | from fairseq.tasks import FairseqTask, register_task
9 | from fairseq.tasks.translation import TranslationConfig
10 | from fairseq.data import iterators
11 | from .seq2seq_ft_task import Seq2SeqFineTuningTask, FTTaskConfig
12 | from fairseq.utils import safe_getattr, safe_hasattr
13 |
14 |
15 | logger = logging.getLogger(__name__)
16 |
17 |
18 | @dataclass
19 | class LoRATaskConfig(FTTaskConfig):
20 |
21 | lora_model_inf: Optional[str] = field(
22 | default="", metadata={"help": "load lora model for inference"},
23 | )
24 |
25 | lora_tuning: bool = field(
26 | default=False, metadata={"help": "if using lora tuning"},
27 | )
28 |
29 |
30 | @register_task("seq2seq_lora_task", dataclass=LoRATaskConfig)
31 | class Seq2SeqLoRATask(Seq2SeqFineTuningTask):
32 |
33 | def __init__(self, cfg, src_dict, tgt_dict):
34 | super().__init__(cfg, src_dict, tgt_dict)
35 |
36 | self.lora_model_inf = safe_getattr(cfg, "lora_model_inf", "")
37 | self.lora_tuning = safe_getattr(cfg, "lora_tuning", False)
38 |
39 | def build_model(self, cfg, from_checkpoint=False):
40 | model = super().build_model(cfg, from_checkpoint)
41 | if len(self.lora_model_inf) > 0:
42 | model.set_lora_model_inf(self.lora_model_inf)
43 | logging.info("Seq2SeqLoRATask load inference model checkpoint from {}".format(self.lora_model_inf))
44 | return model
45 |
46 |
--------------------------------------------------------------------------------
/alpaca/src/webapp.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) Facebook, Inc. and its affiliates.
2 | #
3 | # This source code is licensed under the MIT license found in the
4 | # LICENSE file in the root directory of this source tree.
5 |
6 | import torch
7 | from model.llama_model import LLaMA
8 | import argparse
9 | import gradio as gr
10 |
11 |
12 |
13 | def sample_demo(alpaca):
14 |
15 | @torch.no_grad()
16 | def process(prompt):
17 | prompt_text = "## Instruction:\n{}\n\n## Response:".format(prompt)
18 | print("Received:\n", prompt_text)
19 | eval_kwargs = dict(beam=1, sampling=True, sampling_topp=0.95, temperature=0.8, min_len=512)
20 | prompts = [prompt_text]
21 | results = alpaca.sample(prompts, **eval_kwargs)[0]
22 | print("Generated:\n", results[0])
23 | return str(results[0])
24 |
25 | demo = gr.Interface(
26 | title = "Efficient Alpaca",
27 | thumbnail = "https://github.com/dropreg/efficient_alpaca/blob/main/efficient_alpaca_logo.PNG",
28 | fn = process,
29 | inputs = gr.Textbox(lines=10, placeholder="Your prompt here..."),
30 | outputs = "text",
31 | )
32 |
33 | demo.launch(share=True)
34 |
35 | def demo(alpaca):
36 |
37 | @torch.no_grad()
38 | def process(prompt, temperature, topp):
39 | prompt_text = "## Instruction:\n{}\n\n## Response:".format(prompt)
40 | print("Received:\n", prompt_text)
41 | eval_kwargs = dict(sampling=True, sampling_topp=topp, temperature=temperature)
42 | prompts = [prompt_text]
43 | results = alpaca.sample(prompts, **eval_kwargs)[0]
44 | print("Generated:\n", results[0])
45 | return str(results[0])
46 |
47 | with gr.Blocks() as demo:
48 | gr.Markdown(
49 | """
50 |
51 |
52 |
53 | """)
54 |
55 | with gr.Row():
56 | with gr.Column():
57 | model_input = gr.Textbox(lines=15, placeholder='Input something', label='Input')
58 | with gr.Row():
59 | gen = gr.Button("Generate")
60 | clr = gr.Button("Clear")
61 |
62 | outputs = gr.Textbox(lines=15, label='Output')
63 |
64 | gr.Markdown(
65 | """
66 | Generation Parameter
67 | """)
68 | with gr.Row():
69 | with gr.Column():
70 | temperature = gr.Slider(maximum=1, value=0.8, minimum=0, label='Temperature')
71 | topp = gr.Slider(maximum=1, value=0.95, minimum=0, label='Top P')
72 |
73 | inputs = [model_input, temperature, topp]
74 | gen.click(fn=process, inputs=inputs, outputs=outputs)
75 | clr.click(fn=lambda value: gr.update(value=""), inputs=clr, outputs=model_input)
76 |
77 | gr.Markdown(
78 | """
79 | Our project can be found from [Efficient Alpaca](https://github.com/dropreg/efficient_alpaca)
80 | """)
81 |
82 | demo.launch(share=True)
83 |
84 |
85 | if __name__ == "__main__":
86 |
87 | parser = argparse.ArgumentParser()
88 | parser.add_argument(
89 | "--model-dir",
90 | required=True,
91 | type=str,
92 | default="alpaca_lora",
93 | help="path containing model file and src_dict.txt",
94 | )
95 | parser.add_argument(
96 | "--model-file",
97 | default="checkpoint_best.pt",
98 | help="where in model_dir are weights saved",
99 | )
100 | parser.add_argument(
101 | "--lora-model-inf",
102 | default="",
103 | help="where in model_dir are weights saved",
104 | )
105 | parser.add_argument(
106 | "--lora-tuning",
107 | action="store_true",
108 | default=False,
109 | help="if true use XSUM_KWARGS else CNN_KWARGS",
110 | )
111 |
112 | parser.add_argument("--bpe",)
113 | parser.add_argument("--sentencepiece-model")
114 | args = parser.parse_args()
115 |
116 | kwargs = {
117 | "user_dir": "alpaca/src",
118 | "lora_model_inf": args.lora_model_inf,
119 | "bpe": args.bpe,
120 | "sentencepiece_model": args.sentencepiece_model,
121 | "source_lang": 'src',
122 | "target_lang": 'tgt',
123 | "lora_tuning": args.lora_tuning,
124 | "task": "seq2seq_lora_task",
125 | }
126 | alpaca = LLaMA.from_pretrained(
127 | model_name_or_path=args.model_dir,
128 | checkpoint_file=args.model_file,
129 | **kwargs,
130 | )
131 |
132 | alpaca = alpaca.eval()
133 | if torch.cuda.is_available():
134 | alpaca = alpaca.half().cuda()
135 |
136 | demo(alpaca)
137 |
--------------------------------------------------------------------------------
/efficient_alpaca_logo.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/efficient_alpaca_logo.PNG
--------------------------------------------------------------------------------
/efficient_alpaca_logo_old.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/efficient_alpaca_logo_old.PNG
--------------------------------------------------------------------------------
/webapp.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/webapp.PNG
--------------------------------------------------------------------------------