├── README.md ├── README_zh.md ├── alpaca ├── scripts │ ├── assert │ │ ├── dict.txt │ │ └── test.src │ ├── fsdp │ │ ├── README.md │ │ ├── inference │ │ │ ├── run_inf.sh │ │ │ ├── run_inf_hub.sh │ │ │ └── run_webapp.sh │ │ ├── run_train.sh │ │ ├── run_train_belle.sh │ │ └── run_train_cpu_offload.sh │ ├── lora │ │ ├── README.md │ │ ├── inference │ │ │ ├── run_inf.sh │ │ │ ├── run_inf_hub.sh │ │ │ └── run_webapp.sh │ │ └── run_train.sh │ ├── megatron │ │ ├── README.md │ │ ├── inference │ │ │ └── run_inf_megatron.sh │ │ └── run_train_megatron.sh │ ├── megatron_lora │ │ ├── README.md │ │ ├── inference │ │ │ └── run_inf_megatron_lora.sh │ │ └── run_train_megatron_lora.sh │ └── utils │ │ ├── README.md │ │ ├── convert_llama_to_half.py │ │ ├── merge_llama_megatron_ckpt.py │ │ ├── prepare_inf_data.sh │ │ ├── prepare_llama_belle_data.sh │ │ ├── prepare_llama_training_data.sh │ │ ├── prepare_utils.py │ │ ├── process_llama_ckpt.py │ │ └── process_llama_megatron_ckpt.py └── src │ ├── __init__.py │ ├── __pycache__ │ ├── __init__.cpython-37.pyc │ ├── megatron_trainer.cpython-37.pyc │ ├── trainer.cpython-37.pyc │ └── utils.cpython-37.pyc │ ├── fsdp │ ├── __pycache__ │ │ ├── cpu_adam.cpython-37.pyc │ │ └── fully_sharded_data_parallel.cpython-37.pyc │ ├── cpu_adam.py │ └── fully_sharded_data_parallel.py │ ├── generate.py │ ├── generator │ ├── __pycache__ │ │ ├── search.cpython-37.pyc │ │ └── sequence_generator.cpython-37.pyc │ ├── search.py │ └── sequence_generator.py │ ├── inference.py │ ├── loss │ ├── __pycache__ │ │ └── lm_loss.cpython-37.pyc │ └── lm_loss.py │ ├── megatron_trainer.py │ ├── model │ ├── __pycache__ │ │ ├── hub_interface.cpython-37.pyc │ │ ├── llama_megatron_transformer.cpython-37.pyc │ │ ├── llama_model.cpython-37.pyc │ │ ├── llama_transformer.cpython-37.pyc │ │ └── lora_modules.cpython-37.pyc │ ├── hub_interface.py │ ├── llama_megatron_transformer.py │ ├── llama_model.py │ ├── llama_transformer.py │ └── lora_modules.py │ ├── preprocess.py │ ├── task │ ├── __pycache__ │ │ ├── dictionary.cpython-37.pyc │ │ ├── seq2seq_dataset.cpython-37.pyc │ │ ├── seq2seq_ft_task.cpython-37.pyc │ │ └── seq2seq_lora_task.cpython-37.pyc │ ├── dictionary.py │ ├── seq2seq_dataset.py │ ├── seq2seq_ft_task.py │ └── seq2seq_lora_task.py │ ├── train_fsdp.py │ ├── train_lora.py │ ├── train_megatron.py │ ├── trainer.py │ ├── utils.py │ └── webapp.py ├── efficient_alpaca_logo.PNG ├── efficient_alpaca_logo_old.PNG └── webapp.PNG /README.md: -------------------------------------------------------------------------------- 1 |

2 | Efficient-Alpaca 3 |

4 | 5 |

6 |

Efficient Alpaca

7 |

8 | 9 |

10 |

11 | English | 中文 12 |

13 |

14 | 15 | 16 | The aim of Efficient Alpaca is to utilize LLaMA to build and enhance the LLM-based chatbots, including but not limited to **reducing resource consumption (GPU memory or training time)**, **improving inference speed**, and more **facilitating researchers' use** (especially for fairseq users). This project will be constantly updated and maintained. Please feel free to use it! 17 | 18 | 19 | **************************** Updates **************************** 20 | - 4/5 We support Fine-tuning using FSDP to reduce GPU memory with extra RAM memory ! 21 | - 3/17 We support model parallel to reduce GPU memory using Megatron-LM ! 22 | - 3/15 We support LoRA (Efficient-finetuning) to reproduce Stanford Alpaca ! 23 | 24 | 25 | # Supported Inference Devices 26 | 27 | we can choose following device to support inference, even 12G 1080. 28 | 29 | | Method | Support | Device | GPU | Inference Speed | 30 | | -------- | ------------- | ---------- | ------- | ------------------- | 31 | | Original | all | 1 24G 3090 | 14G | | 32 | | Megatron | megatron_lora | 2 12G 1080 | 8G | | 33 | 34 | ## Supported Training Methods and Devices 35 | 36 | We can choose the following available methods and combinations: for example, I have 2 24G 3090 and a lot of memory, at this time you can have two options: 1. Use Megatron-LM for Efficient-Finetuning (does not use a lot of memory) . 2. Use FSDP for Fine-tuning (it will use a lot of extra memory). 37 | 38 | | Method | Type | Support | Data Para | Model Para | Device | GPU | Memory Limit | Training Speed | 39 | | ------------- | --------------------- | ------------- | --------- | ---------- | ---------- | ------- | ------------- | ------------------- | 40 | | LoRA | Efficient Fine-tuning | lora | ✓ | ✓ | 1 40G A100 | 30G | No | 90 sec / 100 step | 41 | | Megatron-LoRA | Efficient Fine-tuning | megatron_lora | ✗ | ✓ | 2 24G 3090 | 21G | No | 190 sec / 100 step | 42 | | FSDP | Fine-tuning | fsdp | ✓ | ✓ | 1 40G A100 | 32G | 128G + | 1600 sec / 100 step | 43 | | | | | | | 8 40G A100 | 32G | No | 400 sec / 100 step | 44 | | | | | | | 2 24G 3090 | 13G | 128G + | 900 sec / 100 step | 45 | | | | | | | 8 24G 3090 | 22G | 128G + | 800 sec / 100 step | 46 | | Megatron | Fine-tuning | megatron | ✗ | ✓ | 4 40G A100 | 25G | No | 130 sec / 100 step | 47 | | | | | | | 8 24G 3090 | 14G | No | 130 sec / 100 step | 48 | 49 |
Some Explanation about Support Table.

50 | 51 | All evaluation used hyper-parameter --max-tokens 2048. 52 | 53 | * Data Para: Whether to support data parallel. 54 | * Model Para: Whether to support model parallel. 55 | * GPU: GPU Memory usage of each node during training. 56 | * Memory Limit: RAM Memory usage during training. 57 | * Training Speed: Only represents training speed rather than training time, because data parallel supports to accelerate training. 58 | 59 |

60 | 61 | ## Web Interface 62 | 63 | We support web interface using [Gradio](https://gradio.app/)。 64 | 65 | ``` 66 | bash alpaca_lora/scripts/run_webapp.sh 67 | ``` 68 | 69 |

70 | Examples 71 |

72 | 73 | ## Setup 74 | Ensure the pytorch and cuda environment available, and install fllowing dependences: 75 | 76 | ``` 77 | pip install fairseq 78 | pip install fairscale 79 | ``` 80 | 81 | We have to install sentencepiece from [official repo](https://github.com/google/sentencepiece) to process data or hack for your specific task. 82 | 83 | ``` 84 | git clone https://github.com/google/sentencepiece.git 85 | cd sentencepiece 86 | mkdir build 87 | cd build 88 | cmake .. 89 | make -j $(nproc) 90 | make install 91 | ``` 92 | 93 | ## Prepare Model and Data 94 | 95 | + [Data and Model Preprocess](alpaca/scripts/utils/README.md) 96 | 97 | ## Training Step: 98 | 99 | Efficient-Finetuning 100 | + [LoRA](alpaca/scripts/lora/README.md) 101 | + [Megatron + LoRA](alpaca/scripts/megatron_lora/README.md) 102 | 103 | Fine-tuning 104 | + [Megatron](alpaca/scripts/megatron/README.md) 105 | + [Fully Sharded Data Parallel](alpaca/scripts/fsdp/README.md) 106 | 107 | 108 | ## Some Case Sampled by Our Model: 109 | 110 | 1. **Question**: what is the capital of china? 111 | 112 | **Answer**: Beijing is the capital of China. 113 | 114 | 2. **Question**: please give a pieces of python code about socket connection. 115 | 116 | **Answer**: 117 | ``` 118 | import socket 119 | 120 | s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 121 | print("Constructing a socket connection") 122 | print(s) 123 | 124 | s.bind(('', 5000)) # Bind the socket to the '(' and 5000' 125 | 126 | print("Listening for connections") 127 | while True: 128 | connection, address = s.accept() 129 | print("Got connection from", address) 130 | print("Connected!") 131 | connection.close() # Close the connection. 132 | 133 | print("Exiting the program") 134 | ``` 135 | 136 | 3. **Question**: 请讲一个武侠故事. 137 | 138 | **Answer**: 139 | The legendary warrior Guan Yu Once upon a time, there was a legendary warrior named Guan Yu. He was a highly skilled fighter and commander and was known for his great bravery and loyalty. He was the loyal servant of Emperor Liu Bei, and fought in the Three Kingdoms civil war alongside him. He was brave enough to challenge the powerful enemy warriors, and eventually helped Liu Bei's army defeat the Wei Kingdom, the strongest of the three kingdoms. Guan Yu was a great warrior and a hero in Chinese history. 140 | 141 | 4. **Question**: what is an alpaca? how is it different form a llama? 142 | 143 | **Answer**: 144 | Alpacas and llamas are both members of the Camelidae family. They are both herbivorous mammals that are domesticated for their coats, but they are distinctly different species. Llamas are generally larger than alpacas and have more pronounced spiral horns, while alpacas have smaller horns and more delicate coats. Alpacas also have smaller teeth and a narrower neck than llamas, and they are able to reproduced year-round. Llamas are also unable to reproduce during the winter months and have a more social behavior than alpacas. 145 | 146 | 5. **Question**: write a e-mail to congratulate Lee and mention that you are exited about meeting all of them in person. 147 | 148 | **Answer**: 149 | ``` 150 | Dear Lee, 151 | 152 | I want to congratulate you on your success and I'm looking forward to meeting you and the other team members in person. 153 | It's exciting to see the progress you've made and I'm sure the hard work you have put in has paid off. 154 | I'm very proud of you and I'm looking forward to meeting you in person and seeing all of your progress. 155 | 156 | Sincerely, 157 | [Your Name] 158 | ``` 159 | 160 | 5. **Question**: write a well-thought our abstract for a machine learning parper that proves that 42 is the optimal seed for training neural network. 161 | 162 | **Answer**: 163 | This paper investigates the effect of different seeds on the performance of neural networks. Through experiments, we demonstrate that the optimal seed is 42, which leads to the best performance on a wide range of tasks. We analyze the effects of different seeds on the model's accuracy, robustness, and other metrics. Our results show that 42 is the best seed for neural networks, and we discuss the implications of this finding. 164 | 165 | 166 | ## Reference 167 | 168 | 169 | 170 | Fairseq: a sequence modeling toolkit \[[github](https://github.com/facebookresearch/fairseq)\] 171 | ``` 172 | @inproceedings{ott2019fairseq, 173 | title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling}, 174 | author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli}, 175 | booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations}, 176 | year = {2019}, 177 | } 178 | ``` 179 | 180 | FairScale: is a PyTorch extension library for high performance and large scale training. \[[github](https://github.com/facebookresearch/fairscale)\] 181 | ``` 182 | @Misc{FairScale2021, 183 | author = {{FairScale authors}}, 184 | title = {FairScale: A general purpose modular PyTorch library for high performance and large scale training}, 185 | howpublished = {\url{https://github.com/facebookresearch/fairscale}}, 186 | year = {2021} 187 | } 188 | ``` 189 | 190 | LLaMA: Open and Efficient Foundation Language Models \[[paper](https://arxiv.org/abs/2302.13971)\]\[[github](https://github.com/facebookresearch/llama)\] 191 | 192 | ``` 193 | @article{touvron2023llama, 194 | title={LLaMA: Open and Efficient Foundation Language Models}, 195 | author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume}, 196 | journal={arXiv preprint arXiv:2302.13971}, 197 | year={2023} 198 | } 199 | ``` 200 | 201 | Stanford Alpaca: An Instruction-following LLaMA model \[[github](https://github.com/tatsu-lab/stanford_alpaca)\] 202 | 203 | ``` 204 | @misc{alpaca, 205 | author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto }, 206 | title = {Stanford Alpaca: An Instruction-following LLaMA model}, 207 | year = {2023}, 208 | publisher = {GitHub}, 209 | journal = {GitHub repository}, 210 | howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}}, 211 | } -------------------------------------------------------------------------------- /README_zh.md: -------------------------------------------------------------------------------- 1 |

2 | Efficient-Alpaca 3 |

4 | 5 | 6 |

7 |

Efficient Alpaca

8 |

9 | 10 |

11 |

12 | English | 中文 13 |

14 |

15 | 16 | Efficient Alpaca 的目的是为了方便构建或者增强基于 LLMs 的 Chatbots,其功能包括但不限于 **减少资源使用 (GPU 显存,训练时间)**,**推理速度**,**方便开发者使用(尤其是熟悉 Fairseq 的用户)**。项目会持续更新,欢迎使用! 17 | 18 | **************************** 更新记录 **************************** 19 | - 4/5 我们支持 FSDP 进行 Fine-tuning,可以使用额外内存来减少 GPU 显存占用! 20 | - 3/17 我们支持使用 Megatron-LM 来减少 GPU 显存,包括 Fine-tuning 和 Efficient-finetuning ! 21 | - 3/15 我们支持使用 LoRA 来进行 Efficient-finetuning 来复现 Stanford Alpaca ! 22 | 23 | 24 | ## 可供选择的推理设备 25 | 26 | 你可以选择下面的任意设备来支持推理,即使是 12G 的 1080。 27 | 28 | | Method | Support | Device | GPU | Inference Speed | 29 | | -------- | ------------- | ---------- | ------- | ------------------- | 30 | | Original | all | 1 24G 3090 | 14G | | 31 | | Megatron | megatron_lora | 2 12G 1080 | 8G | | 32 | 33 | ## 可供选择的训练方法和设备 34 | 35 | 你可以根据下标的组合来选择可用方法:例如,我有两块3090和大量的内存,这个时候你可以有两种选择:1. 使用Megatron-LM 来进行 Efficient-Finetuning(不会使用大量内存)。 2. 使用 FSDP 来进行 Fine-tuning (会使用额外大量内存)。 36 | 37 | | Method | Type | Support | Data Para | Model Para | Device | GPU | Memory Limit | Training Speed | 38 | | ------------- | --------------------- | ------------- | --------- | ---------- | ---------- | ------- | ------------- | ------------------- | 39 | | LoRA | Efficient Fine-tuning | lora | ✓ | ✓ | 1 40G A100 | 30G | No | 90 sec / 100 step | 40 | | Megatron-LoRA | Efficient Fine-tuning | megatron_lora | ✗ | ✓ | 2 24G 3090 | 21G | No | 190 sec / 100 step | 41 | | FSDP | Fine-tuning | fsdp | ✓ | ✓ | 1 40G A100 | 32G | 128G + | 1600 sec / 100 step | 42 | | | | | | | 8 40G A100 | 32G | No | 400 sec / 100 step | 43 | | | | | | | 2 24G 3090 | 13G | 128G + | 900 sec / 100 step | 44 | | | | | | | 8 24G 3090 | 22G | 128G + | 800 sec / 100 step | 45 | | Megatron | Fine-tuning | megatron | ✗ | ✓ | 4 40G A100 | 25G | No | 130 sec / 100 step | 46 | | | | | | | 8 24G 3090 | 14G | No | 130 sec / 100 step | 47 | 48 | 49 |
关于表格的解释

50 | 51 | 以上所有的实验都是使用 --max-tokens 2048 这一参数进行测试. 52 | 53 | * Data Para: 是否支持数据并行. 54 | * Model Para: 是否支持模型并行. 55 | * GPU: 在训练中实际大概使用的 GPU 显存. 56 | * Memory Limit: 内存限制,只是大概的测试,并不代表实际情况. 57 | * Training Speed: 仅仅代表训练速度而非训练时间,比如数据并行可以加快训练时间,但是并不会加快训练速度。 58 | 59 |

60 | 61 | 62 | ## 进行演示 63 | 64 | 我们支持使用 [Gradio](https://gradio.app/) 来进行演示。 65 | 66 | ``` 67 | bash alpaca_lora/scripts/run_webapp.sh 68 | ``` 69 | 70 |

71 | Examples 72 |

73 | 74 | 75 | ## 环境依赖 76 | 请确保你有正常的CUDA环境,并安装以下依赖: 77 | 78 | ``` 79 | pip install fairseq 80 | pip install fairscale 81 | ``` 82 | 83 | 如果要训练模型,请安装 sentencepiece [official repo](https://github.com/google/sentencepiece) 来处理数据。 84 | 85 | ``` 86 | git clone https://github.com/google/sentencepiece.git 87 | cd sentencepiece 88 | mkdir build 89 | cd build 90 | cmake .. 91 | make -j $(nproc) 92 | make install 93 | ``` 94 | 95 | ## 数据和模型处理 96 | 97 | + [Data and Model Preprocess](alpaca/scripts/utils/README.md) 98 | 99 | ## 训练模型 100 | 101 | Efficient-Finetuning 102 | + [LoRA](alpaca/scripts/lora/README.md) 103 | + [Megatron + LoRA](alpaca/scripts/megatron_lora/README.md) 104 | 105 | Fine-tuning 106 | + [Megatron](alpaca/scripts/megatron/README.md) 107 | + [Fully Sharded Data Parallel](alpaca/scripts/fsdp/README.md) 108 | 109 | ## 模型输出示例: 110 | 111 | 1. **Question**: what is the capital of china? 112 | 113 | **Answer**: Beijing is the capital of China. 114 | 115 | 2. **Question**: please give a pieces of python code about socket connection. 116 | 117 | **Answer**: 118 | ``` 119 | import socket 120 | 121 | s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 122 | print("Constructing a socket connection") 123 | print(s) 124 | 125 | s.bind(('', 5000)) # Bind the socket to the '(' and 5000' 126 | 127 | print("Listening for connections") 128 | while True: 129 | connection, address = s.accept() 130 | print("Got connection from", address) 131 | print("Connected!") 132 | connection.close() # Close the connection. 133 | 134 | print("Exiting the program") 135 | ``` 136 | 137 | 3. **Question**: 请讲一个武侠故事. 138 | 139 | **Answer**: 140 | The legendary warrior Guan Yu Once upon a time, there was a legendary warrior named Guan Yu. He was a highly skilled fighter and commander and was known for his great bravery and loyalty. He was the loyal servant of Emperor Liu Bei, and fought in the Three Kingdoms civil war alongside him. He was brave enough to challenge the powerful enemy warriors, and eventually helped Liu Bei's army defeat the Wei Kingdom, the strongest of the three kingdoms. Guan Yu was a great warrior and a hero in Chinese history. 141 | 142 | 4. **Question**: what is an alpaca? how is it different form a llama? 143 | 144 | **Answer**: 145 | Alpacas and llamas are both members of the Camelidae family. They are both herbivorous mammals that are domesticated for their coats, but they are distinctly different species. Llamas are generally larger than alpacas and have more pronounced spiral horns, while alpacas have smaller horns and more delicate coats. Alpacas also have smaller teeth and a narrower neck than llamas, and they are able to reproduced year-round. Llamas are also unable to reproduce during the winter months and have a more social behavior than alpacas. 146 | 147 | 5. **Question**: write a e-mail to congratulate Lee and mention that you are exited about meeting all of them in person. 148 | 149 | **Answer**: 150 | ``` 151 | Dear Lee, 152 | 153 | I want to congratulate you on your success and I'm looking forward to meeting you and the other team members in person. 154 | It's exciting to see the progress you've made and I'm sure the hard work you have put in has paid off. 155 | I'm very proud of you and I'm looking forward to meeting you in person and seeing all of your progress. 156 | 157 | Sincerely, 158 | [Your Name] 159 | ``` 160 | 161 | 5. **Question**: write a well-thought our abstract for a machine learning parper that proves that 42 is the optimal seed for training neural network. 162 | 163 | **Answer**: 164 | This paper investigates the effect of different seeds on the performance of neural networks. Through experiments, we demonstrate that the optimal seed is 42, which leads to the best performance on a wide range of tasks. We analyze the effects of different seeds on the model's accuracy, robustness, and other metrics. Our results show that 42 is the best seed for neural networks, and we discuss the implications of this finding. 165 | 166 | 167 | ## Reference 168 | 169 | In addition, thanks to the open source projects and communities that this project relies on, including but not limited to the following: 170 | 171 | Fairseq: a sequence modeling toolkit \[[github](https://github.com/facebookresearch/fairseq)\] 172 | ``` 173 | @inproceedings{ott2019fairseq, 174 | title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling}, 175 | author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli}, 176 | booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations}, 177 | year = {2019}, 178 | } 179 | ``` 180 | 181 | FairScale: is a PyTorch extension library for high performance and large scale training. \[[github](https://github.com/facebookresearch/fairscale)\] 182 | ``` 183 | @Misc{FairScale2021, 184 | author = {{FairScale authors}}, 185 | title = {FairScale: A general purpose modular PyTorch library for high performance and large scale training}, 186 | howpublished = {\url{https://github.com/facebookresearch/fairscale}}, 187 | year = {2021} 188 | } 189 | ``` 190 | 191 | LLaMA: Open and Efficient Foundation Language Models \[[paper](https://arxiv.org/abs/2302.13971)\]\[[github](https://github.com/facebookresearch/llama)\] 192 | 193 | ``` 194 | @article{touvron2023llama, 195 | title={LLaMA: Open and Efficient Foundation Language Models}, 196 | author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume}, 197 | journal={arXiv preprint arXiv:2302.13971}, 198 | year={2023} 199 | } 200 | ``` 201 | 202 | Stanford Alpaca: An Instruction-following LLaMA model \[[github](https://github.com/tatsu-lab/stanford_alpaca)\] 203 | 204 | ``` 205 | @misc{alpaca, 206 | author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto }, 207 | title = {Stanford Alpaca: An Instruction-following LLaMA model}, 208 | year = {2023}, 209 | publisher = {GitHub}, 210 | journal = {GitHub repository}, 211 | howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}}, 212 | } -------------------------------------------------------------------------------- /alpaca/scripts/assert/test.src: -------------------------------------------------------------------------------- 1 | tell a story about Sleeping Beauty, please. 2 | write a e-mail to congratulate Lee and mention that you are exited about meeting all of them in person. 3 | what is an alpaca? how is it different form a llama? 4 | what is the capital of Tanzania? 5 | what is the capital of china? 6 | write a well-thought our abstract for a machine learning parper that proves that 42 is the optimal seed for training neural network. 7 | 请将一个武侠故事. 8 | please give a pieces of python code about socket connection. 9 | -------------------------------------------------------------------------------- /alpaca/scripts/fsdp/README.md: -------------------------------------------------------------------------------- 1 | # Fully Sharded Data Parallel 2 | 3 | FSDP: [Fully Sharded Data Parallel: faster AI training with fewer GPUs](https://engineering.fb.com/2021/07/15/open-source/fsdp/) 4 | 5 | ## Training Step 6 | 7 | ``` 8 | bash alpaca/scripts/fsdp/run_train.sh 9 | ``` 10 | 11 | ``` 12 | bash alpaca/scripts/fsdp/run_train_cpu_offload.sh 13 | ``` 14 | 15 | ## Inference Step 16 | 17 | + (Batch-Level) Please prepare the test file. 18 | 19 | ``` 20 | bash alpaca/scripts/fsdp/inference/run_inf.sh 21 | ``` 22 | 23 | + (Instance-Level) Using alpaca/src/inference.py line 17 to set prompts. 24 | 25 | ``` 26 | bash alpaca/scripts/fsdp/inference/run_inf_hub.sh 27 | ``` -------------------------------------------------------------------------------- /alpaca/scripts/fsdp/inference/run_inf.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | src=src 3 | tgt=tgt 4 | 5 | export CUDA_VISIBLE_DEVICES=0 6 | 7 | data_dir=/opt/data/private/data/llama/llama_instruction/inf/data-bin/ 8 | llama_dir=/opt/data/private/ckpt/alpaca/fsdp/ 9 | bpe_dir=/opt/data/private/data/llama/tokenizer.model 10 | 11 | 12 | python alpaca/src/generate.py $data_dir \ 13 | --user-dir alpaca/src \ 14 | --task seq2seq_ft_task \ 15 | --arch llama_7b \ 16 | -s $src -t $tgt \ 17 | --gen-subset test \ 18 | --bpe 'sentencepiece' --sentencepiece-model $bpe_dir \ 19 | --path $llama_dir/checkpoint1.pt \ 20 | --required-batch-size-multiple 1 \ 21 | --batch-size 1 \ 22 | --beam 1 --sampling --sampling-topp 0.95 --temperature 0.8 \ 23 | -------------------------------------------------------------------------------- /alpaca/scripts/fsdp/inference/run_inf_hub.sh: -------------------------------------------------------------------------------- 1 | 2 | export CUDA_VISIBLE_DEVICES=0 3 | 4 | python alpaca/src/inference.py \ 5 | --model-dir /opt/data/private/ckpt/alpaca/fsdp/ \ 6 | --model-file checkpoint1.pt \ 7 | --bpe sentencepiece \ 8 | --sentencepiece-model /opt/data/private/data/llama/tokenizer.model \ 9 | -------------------------------------------------------------------------------- /alpaca/scripts/fsdp/inference/run_webapp.sh: -------------------------------------------------------------------------------- 1 | 2 | export CUDA_VISIBLE_DEVICES=0 3 | 4 | python alpaca/src/webapp.py \ 5 | --model-dir /opt/data/private/ckpt/alpaca/fsdp/ \ 6 | --model-file checkpoint1.pt \ 7 | --bpe sentencepiece \ 8 | --sentencepiece-model /opt/data/private/data/llama/tokenizer.model \ 9 | -------------------------------------------------------------------------------- /alpaca/scripts/fsdp/run_train.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | src=src 3 | tgt=tgt 4 | 5 | export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 6 | export CUDA_LAUNCH_BLOCKING=1 7 | 8 | data_dir=/opt/data/private/data/llama/llama_instruction/data-bin 9 | save_dir=/opt/data/private/ckpt/alpaca/fsdp/ 10 | llama_dir=/opt/data/private/data/llama/7B/model_no_pad.pt 11 | max_token=2048 12 | 13 | 14 | python alpaca/src/train_fsdp.py $data_dir \ 15 | --reset-optimizer --reset-dataloader --reset-meters \ 16 | --restore-file $llama_dir \ 17 | --user-dir alpaca/src \ 18 | --ddp-backend fully_sharded \ 19 | --fp16 --fp16-init-scale 4 \ 20 | --checkpoint-activations \ 21 | --no-reshard-after-forward \ 22 | --no-save-optimizer-state \ 23 | --max-target-positions 2048 \ 24 | --task seq2seq_ft_task \ 25 | --arch llama_7b \ 26 | --data-para \ 27 | --criterion lm_loss \ 28 | -s $src -t $tgt \ 29 | --max-tokens $max_token \ 30 | --optimizer adam --adam-betas "(0.9, 0.98)" \ 31 | --lr-scheduler polynomial_decay --lr 2e-5 \ 32 | --weight-decay 0.0 \ 33 | --total-num-update 2000 --warmup-updates 100 \ 34 | --max-epoch 3 \ 35 | --no-progress-bar \ 36 | --log-interval 10 \ 37 | --save-dir $save_dir | tee -a $save_dir/train.log \ 38 | -------------------------------------------------------------------------------- /alpaca/scripts/fsdp/run_train_belle.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | src=src 3 | tgt=tgt 4 | 5 | export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 6 | export CUDA_LAUNCH_BLOCKING=1 7 | 8 | data_dir=/opt/data/private/data/llama/belle_1m/data-bin 9 | save_dir=/opt/data/private/ckpt/alpaca/fsdp_belle/ 10 | llama_dir=/opt/data/private/data/llama/7B/model_no_pad.pt 11 | max_token=2048 12 | 13 | 14 | python alpaca/src/train_fsdp.py $data_dir \ 15 | --reset-optimizer --reset-dataloader --reset-meters \ 16 | --restore-file $llama_dir \ 17 | --user-dir alpaca/src \ 18 | --ddp-backend fully_sharded \ 19 | --fp16 --fp16-init-scale 4 \ 20 | --checkpoint-activations \ 21 | --no-reshard-after-forward \ 22 | --no-save-optimizer-state \ 23 | --max-target-positions 2048 \ 24 | --task seq2seq_ft_task \ 25 | --arch llama_7b \ 26 | --data-para \ 27 | --criterion lm_loss \ 28 | -s $src -t $tgt \ 29 | --max-tokens $max_token \ 30 | --optimizer adam --adam-betas "(0.9, 0.98)" \ 31 | --lr-scheduler polynomial_decay --lr 2e-5 \ 32 | --weight-decay 0.0 \ 33 | --total-num-update 2000 --warmup-updates 100 \ 34 | --max-epoch 3 \ 35 | --no-progress-bar \ 36 | --log-interval 10 \ 37 | --save-dir $save_dir | tee -a $save_dir/train.log \ 38 | -------------------------------------------------------------------------------- /alpaca/scripts/fsdp/run_train_cpu_offload.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | src=src 3 | tgt=tgt 4 | 5 | export OMP_NUM_THREADS=20 6 | export CUDA_VISIBLE_DEVICES=0,1 7 | export CUDA_LAUNCH_BLOCKING=1 8 | 9 | data_dir=/opt/data/private/data/llama/llama_instruction/data-bin 10 | save_dir=/opt/data/private/ckpt/alpaca/fsdp_cpu_offload/ 11 | llama_dir=/opt/data/private/data/llama/7B/model_no_pad.pt 12 | max_token=2048 13 | 14 | 15 | python alpaca/src/train_fsdp.py $data_dir \ 16 | --reset-optimizer --reset-dataloader --reset-meters \ 17 | --restore-file $llama_dir \ 18 | --user-dir alpaca/src \ 19 | --ddp-backend fully_sharded \ 20 | --fp16 --fp16-init-scale 4 \ 21 | --cpu-offload --checkpoint-activations \ 22 | --no-reshard-after-forward \ 23 | --no-save-optimizer-state \ 24 | --max-target-positions 2048 \ 25 | --task seq2seq_ft_task \ 26 | --arch llama_7b \ 27 | --data-para \ 28 | --criterion lm_loss \ 29 | -s $src -t $tgt \ 30 | --max-tokens $max_token \ 31 | --optimizer new_cpu_adam --adam-betas "(0.9, 0.98)" \ 32 | --lr-scheduler polynomial_decay --lr 2e-5 \ 33 | --weight-decay 0.0 \ 34 | --total-num-update 2000 --warmup-updates 100 \ 35 | --max-epoch 3 \ 36 | --no-progress-bar \ 37 | --log-interval 10 \ 38 | --save-dir $save_dir | tee -a $save_dir/train.log \ 39 | -------------------------------------------------------------------------------- /alpaca/scripts/lora/README.md: -------------------------------------------------------------------------------- 1 | # LoRA 2 | 3 | Efficient-Finetuning Method: [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) 4 | 5 | 6 | ## Training Step 7 | 8 | ``` 9 | bash alpaca/scripts/lora/run_train.sh 10 | ``` 11 | 12 | ## Inference Step 13 | 14 | + (Batch-Level) Please prepare the test file. 15 | 16 | ``` 17 | bash alpaca/scripts/lora/inference/run_inf.sh 18 | ``` 19 | 20 | + (Instance-Level) Using alpaca/src/inference.py line 17 to set prompts. 21 | 22 | ``` 23 | bash alpaca/scripts/lora/inference/run_inf_hub.sh 24 | ``` 25 | -------------------------------------------------------------------------------- /alpaca/scripts/lora/inference/run_inf.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | src=src 3 | tgt=tgt 4 | 5 | export CUDA_VISIBLE_DEVICES=0 6 | 7 | data_dir=/opt/data/private/data/llama/llama_instruction/inf/data-bin/ 8 | llama_dir=/opt/data/private/data/llama/7B/ 9 | lora_dir=/opt/data/private/ckpt/alpaca/lora/checkpoint3.pt 10 | bpe_dir=/opt/data/private/data/llama/tokenizer.model 11 | 12 | 13 | torchrun --master_port 29001 alpaca/src/generate.py $data_dir \ 14 | --user-dir alpaca/src \ 15 | --task seq2seq_lora_task \ 16 | --arch llama_7b \ 17 | --lora-model-inf $lora_dir \ 18 | --lora-tuning \ 19 | -s $src -t $tgt \ 20 | --gen-subset test \ 21 | --bpe 'sentencepiece' --sentencepiece-model $bpe_dir \ 22 | --path $llama_dir/model_no_pad.pt \ 23 | --required-batch-size-multiple 1 \ 24 | --batch-size 1 \ 25 | --beam 1 --sampling --sampling-topp 0.95 --temperature 0.8 \ 26 | -------------------------------------------------------------------------------- /alpaca/scripts/lora/inference/run_inf_hub.sh: -------------------------------------------------------------------------------- 1 | 2 | export CUDA_VISIBLE_DEVICES=0 3 | 4 | torchrun --master_port 29004 alpaca/src/inference.py \ 5 | --model-dir /opt/data/private/data/llama/7B/ \ 6 | --model-file model_no_pad.pt \ 7 | --lora-tuning \ 8 | --lora-model-inf /opt/data/private/ckpt/alpaca/lora/checkpoint3.pt \ 9 | --bpe sentencepiece \ 10 | --sentencepiece-model /opt/data/private/data/llama/tokenizer.model \ 11 | -------------------------------------------------------------------------------- /alpaca/scripts/lora/inference/run_webapp.sh: -------------------------------------------------------------------------------- 1 | 2 | export CUDA_VISIBLE_DEVICES=0 3 | 4 | torchrun --master_port 29002 alpaca/src/webapp.py \ 5 | --model-dir /opt/data/private/data/llama/7B/ \ 6 | --model-file model_no_pad.pt \ 7 | --lora-model-inf /opt/data/private/ckpt/alpaca/lora/checkpoint3.pt \ 8 | --lora-tuning \ 9 | --bpe sentencepiece \ 10 | --sentencepiece-model /opt/data/private/data/llama/tokenizer.model \ 11 | -------------------------------------------------------------------------------- /alpaca/scripts/lora/run_train.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | src=src 3 | tgt=tgt 4 | 5 | export CUDA_VISIBLE_DEVICES=0,1,2,3 6 | 7 | data_dir=/opt/data/private/data/llama/llama_instruction/data-bin 8 | save_dir=/opt/data/private/ckpt/alpaca/lora/ 9 | llama_dir=/opt/data/private/data/llama/7B/model_no_pad.pt 10 | max_token=1024 11 | update_freq=1 12 | world_size=4 13 | 14 | 15 | torchrun --master_port 29000 --nproc_per_node $world_size alpaca/src/train_lora.py $data_dir \ 16 | --reset-optimizer --reset-dataloader --reset-meters \ 17 | --restore-file $llama_dir \ 18 | --user-dir alpaca/src \ 19 | --max-source-positions 2048 \ 20 | --max-target-positions 2048 \ 21 | --memory-efficient-fp16 \ 22 | --fp16 --fp16-init-scale 4 \ 23 | --task seq2seq_lora_task \ 24 | --arch llama_7b \ 25 | --criterion lm_loss \ 26 | --lora-tuning \ 27 | --data-para \ 28 | -s $src -t $tgt \ 29 | --max-tokens $max_token \ 30 | --update-freq $update_freq \ 31 | --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 --clip-norm 0.0 \ 32 | --lr-scheduler polynomial_decay --lr 3e-4 \ 33 | --weight-decay 0.0 \ 34 | --total-num-update 5000 --warmup-updates 200 \ 35 | --max-epoch 3 \ 36 | --no-progress-bar \ 37 | --log-interval 100 \ 38 | --save-dir $save_dir | tee -a $save_dir/train.log \ 39 | -------------------------------------------------------------------------------- /alpaca/scripts/megatron/README.md: -------------------------------------------------------------------------------- 1 | # LoRA 2 | 3 | Megatron-LM: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 4 | 5 | 6 | ## Training Step 7 | 8 | ``` 9 | bash alpaca/scripts/megatron/run_train_megatron.sh 10 | ``` 11 | 12 | ## Inference Step 13 | 14 | + (Batch-Level) Please prepare the test file. 15 | 16 | ``` 17 | bash alpaca/scripts/megatron/run_inf_megatron.sh 18 | ``` 19 | -------------------------------------------------------------------------------- /alpaca/scripts/megatron/inference/run_inf_megatron.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | src=src 3 | tgt=tgt 4 | 5 | export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 6 | 7 | data_dir=/opt/data/private/data/llama/llama_instruction/inf/data-bin/ 8 | llama_dir=/opt/data/private/ckpt/alpaca/megatron8_ft/ 9 | bpe_dir=/opt/data/private/data/llama/tokenizer.model 10 | world_size=8 11 | 12 | 13 | torchrun --master_port 29006 --nproc_per_node $world_size alpaca/src/generate.py $data_dir \ 14 | --user-dir alpaca/src \ 15 | --model-parallel-size $world_size \ 16 | --distributed-world-size $world_size \ 17 | --task seq2seq_ft_task \ 18 | --megatron-model \ 19 | --arch llama_7b \ 20 | -s $src -t $tgt \ 21 | --gen-subset test \ 22 | --bpe 'sentencepiece' --sentencepiece-model $bpe_dir \ 23 | --path $llama_dir/checkpoint1.pt \ 24 | --required-batch-size-multiple 1 \ 25 | --batch-size 1 \ 26 | --beam 1 --sampling --sampling-topp 0.95 --temperature 0.8 \ 27 | -------------------------------------------------------------------------------- /alpaca/scripts/megatron/run_train_megatron.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | src=src 3 | tgt=tgt 4 | 5 | export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 6 | 7 | data_dir=/opt/data/private/data/llama/llama_instruction/data-bin 8 | save_dir=/opt/data/private/ckpt/alpaca/megatron8_ft/ 9 | llama_dir=/opt/data/private/data/llama/7B/megatron_8/model.pt 10 | max_token=2048 11 | update_freq=1 12 | world_size=8 13 | 14 | 15 | torchrun --master_port 29000 --nproc_per_node $world_size alpaca/src/train_megatron.py $data_dir \ 16 | --model-parallel-size $world_size \ 17 | --distributed-world-size $world_size \ 18 | --reset-optimizer --reset-dataloader --reset-meters \ 19 | --restore-file $llama_dir \ 20 | --user-dir alpaca/src \ 21 | --max-source-positions 2048 \ 22 | --max-target-positions 2048 \ 23 | --memory-efficient-fp16 \ 24 | --fp16 --fp16-init-scale 4 \ 25 | --checkpoint-activations \ 26 | --task seq2seq_ft_task \ 27 | --arch llama_7b \ 28 | --megatron-model \ 29 | --criterion lm_loss \ 30 | -s $src -t $tgt \ 31 | --max-tokens $max_token \ 32 | --update-freq $update_freq \ 33 | --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 --clip-norm 0.0 \ 34 | --lr-scheduler polynomial_decay --lr 2e-5 \ 35 | --weight-decay 0.0 \ 36 | --total-num-update 7000 --warmup-updates 200 \ 37 | --max-epoch 3 \ 38 | --no-progress-bar \ 39 | --log-interval 100 \ 40 | --save-dir $save_dir | tee -a $save_dir/train.log \ 41 | -------------------------------------------------------------------------------- /alpaca/scripts/megatron_lora/README.md: -------------------------------------------------------------------------------- 1 | # LoRA 2 | 3 | Megatron-LM: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 4 | Efficient-Finetuning Method: [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) 5 | 6 | 7 | ## Training Step 8 | 9 | ``` 10 | bash alpaca/scripts/megatron_lora/run_train_megatron_lora.sh 11 | ``` 12 | 13 | ## Inference Step 14 | 15 | + (Batch-Level) Please prepare the test file. 16 | 17 | ``` 18 | bash alpaca/scripts/megatron_lora/inference/run_inf_megatron_lora.sh 19 | ``` 20 | -------------------------------------------------------------------------------- /alpaca/scripts/megatron_lora/inference/run_inf_megatron_lora.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | src=src 3 | tgt=tgt 4 | 5 | export CUDA_VISIBLE_DEVICES=0,1 6 | 7 | data_dir=/opt/data/private/data/llama/llama_instruction/inf/data-bin 8 | llama_dir=/opt/data/private/data/llama/7B/megatron_2/ 9 | lora_dir=/opt/data/private/ckpt/alpaca/megatron_lora/checkpoint1-model_part-0.pt 10 | bpe_dir=/opt/data/private/data/llama/tokenizer.model 11 | world_size=2 12 | 13 | 14 | torchrun --master_port 29006 --nproc_per_node $world_size alpaca/src/generate.py $data_dir \ 15 | --user-dir alpaca/src \ 16 | --model-parallel-size $world_size \ 17 | --distributed-world-size $world_size \ 18 | --lora-model-inf $lora_dir \ 19 | --task seq2seq_lora_task \ 20 | --arch llama_7b \ 21 | --megatron-model \ 22 | --lora-tuning \ 23 | -s $src -t $tgt \ 24 | --gen-subset test \ 25 | --bpe 'sentencepiece' --sentencepiece-model $bpe_dir \ 26 | --path $llama_dir/model.pt \ 27 | --required-batch-size-multiple 1 \ 28 | --batch-size 1 \ 29 | --beam 1 --sampling --sampling-topp 0.95 --temperature 0.8 \ 30 | -------------------------------------------------------------------------------- /alpaca/scripts/megatron_lora/run_train_megatron_lora.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | src=src 3 | tgt=tgt 4 | 5 | export CUDA_VISIBLE_DEVICES=0,1 6 | 7 | data_dir=/opt/data/private/data/llama/llama_instruction/data-bin 8 | save_dir=/opt/data/private/ckpt/alpaca/megatron_lora/ 9 | llama_dir=/opt/data/private/data/llama/7B/megatron_2/model.pt 10 | max_token=1024 11 | update_freq=2 12 | world_size=2 13 | 14 | 15 | torchrun --master_port 29002 --nproc_per_node $world_size alpaca/src/train_megatron.py $data_dir \ 16 | --model-parallel-size $world_size \ 17 | --distributed-world-size $world_size \ 18 | --reset-optimizer --reset-dataloader --reset-meters \ 19 | --restore-file $llama_dir \ 20 | --user-dir alpaca/src \ 21 | --max-source-positions 2048 \ 22 | --max-target-positions 2048 \ 23 | --memory-efficient-fp16 \ 24 | --fp16 --fp16-init-scale 4 \ 25 | --task seq2seq_lora_task \ 26 | --arch llama_7b \ 27 | --megatron-model \ 28 | --criterion lm_loss \ 29 | --lora-tuning \ 30 | -s $src -t $tgt \ 31 | --max-tokens $max_token \ 32 | --update-freq $update_freq \ 33 | --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 --clip-norm 0.0 \ 34 | --lr-scheduler polynomial_decay --lr 2e-4 \ 35 | --weight-decay 0.0 \ 36 | --total-num-update 7000 --warmup-updates 200 \ 37 | --max-epoch 3 \ 38 | --no-progress-bar \ 39 | --log-interval 100 \ 40 | --save-dir $save_dir | tee -a $save_dir/train.log \ 41 | -------------------------------------------------------------------------------- /alpaca/scripts/utils/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Data Process 3 | 4 | Prepare chatbot data contains 52K instruction-following data we used for fine-tuning the Alpaca model. 5 | 6 | ``` 7 | bash prepare_llama_training_data.sh 8 | ``` 9 | 10 | parameter: 11 | 12 | + `DATA` init dataset dir, download the alpaca data [alpaca_data.json](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json). 13 | + `SPM` sentencepiece project spm dir: "sentencepiece/build/src/spm_encode". 14 | + `MODEL` LLaMA tokenizer model "tokenizer.model". 15 | 16 | # Model Process 17 | 18 | 19 | ## Build Model Checkpoint 20 | 21 | Process the LLaMA model based on your equipment (GPU devices). 22 | 23 | 1. single model checkpoint: 24 | ``` 25 | python alpaca_lora/scripts/utils/process_llama_ckpt.py --llama-model-dir $llama_dir --llama-model-file $llama_file 26 | ``` 27 | 28 | 2. Megatron-LM model checkpoint: 29 | ``` 30 | python alpaca_lora/scripts/utils/process_llama_megatron_ckpt.py --llama-model-dir $llama_dir --llama-model-file $llama_file --parallel-size 2 31 | ``` 32 | 33 | parameter: 34 | + `--parallel-size` The number of GPUs to use. 35 | 36 | after that, we can get new checkpoint file ``model.pt``. 37 | 38 | 39 | ## Merge Model Checkpoint 40 | 41 | We can merge multiple `Megatron-LM Checkpoints` into a single Checkpoint to support the `hub` or `web interface` mode. 42 | 43 | ``` 44 | python merge_llama_megatron_ckpt.py 45 | ``` 46 | -------------------------------------------------------------------------------- /alpaca/scripts/utils/convert_llama_to_half.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import json 3 | import argparse 4 | 5 | def convert_llama_half(llama_file): 6 | 7 | with open(llama_file, "rb") as f: 8 | llama_state = torch.load(f, map_location=torch.device("cpu")) 9 | 10 | for k in list(llama_state['model'].keys()): 11 | llama_state['model'][k] = llama_state['model'][k].half() 12 | 13 | dump_file = "checkpoint_half.pt" 14 | torch.save(llama_state, llama_file.replace("checkpoint_best.pt", dump_file)) 15 | print("dump new model to {}".format(dump_file)) 16 | 17 | def main(): 18 | 19 | parser = argparse.ArgumentParser() 20 | parser.add_argument( 21 | "--llama-model-file", 22 | type=str, 23 | default="/opt/data/private/ckpt/alpaca/fsdp_belle/checkpoint_best.pt", 24 | help="path containing model file", 25 | ) 26 | 27 | args = parser.parse_args() 28 | print("convert model {}".format(args.llama_model_file)) 29 | convert_llama_half(args.llama_model_file) 30 | 31 | 32 | if __name__ == "__main__": 33 | main() 34 | -------------------------------------------------------------------------------- /alpaca/scripts/utils/merge_llama_megatron_ckpt.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import json 3 | import argparse 4 | 5 | def build_default_state(): 6 | 7 | state = {} 8 | 9 | state['args'] = {} 10 | state['args']['arch'] = "llama_7b" 11 | state['args']['task'] = "seq2seq_ft_task" 12 | state['args']['criterion'] = "ll_loss" 13 | state['args']['decoder_attention_heads'] = 32 14 | state['args']['decoder_embed_dim'] = 4096 15 | state['args']['decoder_ffn_embed_dim'] = 16384 16 | state['args']['decoder_layers'] = 32 17 | 18 | state['args']['max_target_positions'] = 2048 19 | state['args']['max_tokens'] = 2048 20 | 21 | temp_parser = argparse.ArgumentParser() 22 | for key, value in state['args'].items(): 23 | temp_parser.add_argument("--" + key, default=value) 24 | args = temp_parser.parse_args([]) 25 | 26 | state['args'] = args 27 | 28 | state['model'] = {} 29 | state['optimizer_history'] = [ 30 | { 31 | 'criterion_name': 'lm_loss', 32 | 'optimizer_name': 'MemoryEfficientFP16Optimizer', 33 | 'lr_scheduler_state': {'best': None}, 34 | 'num_updates': 5000, 35 | } 36 | ] 37 | state['extra_state'] = {} 38 | print(state) 39 | return state 40 | 41 | def build_llama_state_dict(llama_dir, parallel_size, prefix): 42 | 43 | llama_state = None 44 | for file_idx in range(parallel_size): 45 | print(file_idx) 46 | with open((llama_dir + prefix).format(file_idx), "rb") as f: 47 | sep_state = torch.load(f, map_location=torch.device("cpu"))['model'] 48 | 49 | if llama_state is None: 50 | llama_state = sep_state 51 | continue 52 | 53 | for k in list(sep_state.keys()): 54 | 55 | print("{}: {} -> +{}".format(k, llama_state[k].size(), sep_state[k].size())) 56 | if "inner_attention" in k: 57 | print("skip llama state key = {} size = {}".format(k, llama_state[k].size())) 58 | continue 59 | elif "norm.weight" in k or "_norm" in k: 60 | continue 61 | elif "decoder.embed_tokens.weight" in k: 62 | llama_state[k] = torch.cat([llama_state[k].half(), sep_state[k].half()], dim=-1) 63 | elif "decoder.output_projection.weight" in k: 64 | llama_state[k] = torch.cat([llama_state[k].half(), sep_state[k].half()], dim=0) 65 | elif "layers" in k: 66 | if "attention" in k and "out_proj" not in k: 67 | # 2048, 4096 68 | llama_state[k] = torch.cat([llama_state[k].half(), sep_state[k].half()], dim=0) 69 | elif "attention" in k and "out_proj" in k: 70 | llama_state[k] = torch.cat([llama_state[k].half(), sep_state[k].half()], dim=-1) 71 | elif "feed_forward.w1" in k: 72 | llama_state[k] = torch.cat([llama_state[k].half(), sep_state[k].half()], dim=0) 73 | elif "feed_forward.w2" in k: 74 | llama_state[k] = torch.cat([llama_state[k].half(), sep_state[k].half()], dim=-1) 75 | elif "feed_forward.w3" in k: 76 | llama_state[k] = torch.cat([llama_state[k].half(), sep_state[k].half()], dim=0) 77 | else: 78 | print(sep_state[k].size()) 79 | print(k) 80 | raise NotImplementedError 81 | else: 82 | print(k) 83 | print(sep_state[k].size()) 84 | raise NotImplementedError 85 | 86 | llama_state["decoder.embed_tokens.weight"] = llama_state["decoder.embed_tokens.weight"][:32008, :] 87 | llama_state["decoder.output_projection.weight"] = llama_state["decoder.output_projection.weight"][:32008, :] 88 | state = build_default_state() 89 | state['model'] = llama_state 90 | dump_file = "model.pt" 91 | torch.save(state, llama_dir + dump_file) 92 | print("dump new model to {}{}".format(llama_dir, dump_file)) 93 | 94 | def main(): 95 | 96 | parser = argparse.ArgumentParser() 97 | parser.add_argument( 98 | "--llama-model-dir", 99 | type=str, 100 | default="/opt/data/private/ckpt/alpaca/parallel_zh_new_ft/", 101 | help="path containing model file", 102 | ) 103 | parser.add_argument( 104 | "--prefix", 105 | type=str, 106 | default="checkpoint_1_15000-model_part-{}.pt", 107 | help="where in model_dir are weights saved", 108 | ) 109 | parser.add_argument( 110 | "--parallel-size", 111 | type=int, 112 | default=8, 113 | help="model parallel size to split", 114 | ) 115 | 116 | args = parser.parse_args() 117 | print("load model from {}{}".format(args.llama_model_dir, args.prefix)) 118 | build_llama_state_dict(args.llama_model_dir, args.parallel_size, args.prefix) 119 | 120 | 121 | if __name__ == "__main__": 122 | main() 123 | -------------------------------------------------------------------------------- /alpaca/scripts/utils/prepare_inf_data.sh: -------------------------------------------------------------------------------- 1 | SRC=src 2 | TGT=tgt 3 | 4 | DATA=/opt/data/private/data/llama/llama_instruction/inf/ 5 | SPM=/opt/data/private/code/sentencepiece/build/src/spm_encode 6 | MODEL=/opt/data/private/data/llama/tokenizer.model 7 | 8 | cp ${DATA}/test.spm.${SRC} ${DATA}/test.spm.${TGT} 9 | 10 | python alpaca/src/preprocess.py \ 11 | --user-dir alpaca/src \ 12 | --task llama_task \ 13 | --source-lang ${SRC} \ 14 | --target-lang ${TGT} \ 15 | --testpref ${DATA}/test.spm \ 16 | --destdir ${DATA}/data-bin \ 17 | --srcdict alpaca/scripts/assert/dict.txt \ 18 | --tgtdict alpaca/scripts/assert/dict.txt \ 19 | --workers 40 \ 20 | -------------------------------------------------------------------------------- /alpaca/scripts/utils/prepare_llama_belle_data.sh: -------------------------------------------------------------------------------- 1 | SRC=src 2 | TGT=tgt 3 | 4 | 5 | DATA=/opt/data/private/data/llama/belle_1m 6 | SPM=/opt/data/private/code/sentencepiece/build/src/spm_encode 7 | MODEL=/opt/data/private/data/llama/tokenizer.model 8 | 9 | python fsdp/scripts/utils/prepare_utils.py --manner split_zh --alpaca-data $DATA/Belle_open_source_1M.json 10 | 11 | head -100 ${DATA}/train.src > ${DATA}/valid.src 12 | head -100 ${DATA}/train.tgt > ${DATA}/valid.tgt 13 | 14 | ${SPM} --model=${MODEL} < ${DATA}/train.${SRC} > ${DATA}/train.spm.${SRC}.tmp 15 | ${SPM} --model=${MODEL} < ${DATA}/train.${TGT} > ${DATA}/train.spm.${TGT}.tmp 16 | ${SPM} --model=${MODEL} < ${DATA}/valid.${SRC} > ${DATA}/valid.spm.${SRC}.tmp 17 | ${SPM} --model=${MODEL} < ${DATA}/valid.${TGT} > ${DATA}/valid.spm.${TGT}.tmp 18 | 19 | python fsdp/scripts/utils/prepare_utils.py --manner replace_zh --alpaca-data $DATA/Belle_open_source_1M.json 20 | 21 | python fsdp/src/preprocess.py \ 22 | --user-dir fsdp/src \ 23 | --task llama_task \ 24 | --source-lang ${SRC} \ 25 | --target-lang ${TGT} \ 26 | --trainpref ${DATA}/train.spm \ 27 | --validpref ${DATA}/valid.spm \ 28 | --destdir ${DATA}/data-bin \ 29 | --srcdict alpaca/scripts/assert/dict.txt \ 30 | --tgtdict alpaca/scripts/assert/dict.txt \ 31 | --workers 40 \ 32 | -------------------------------------------------------------------------------- /alpaca/scripts/utils/prepare_llama_training_data.sh: -------------------------------------------------------------------------------- 1 | SRC=src 2 | TGT=tgt 3 | 4 | 5 | DATA=/opt/data/private/data/llama/llama_instruction 6 | SPM=/opt/data/private/code/sentencepiece/build/src/spm_encode 7 | MODEL=/opt/data/private/data/llama/tokenizer.model 8 | 9 | python alpaca/scripts/utils/prepare_utils.py --manner split --alpaca-data $DATA/alpaca_data.json 10 | 11 | head -100 ${DATA}/train.src > ${DATA}/valid.src 12 | head -100 ${DATA}/train.tgt > ${DATA}/valid.tgt 13 | 14 | ${SPM} --model=${MODEL} < ${DATA}/train.${SRC} > ${DATA}/train.spm.${SRC}.tmp 15 | ${SPM} --model=${MODEL} < ${DATA}/train.${TGT} > ${DATA}/train.spm.${TGT}.tmp 16 | ${SPM} --model=${MODEL} < ${DATA}/valid.${SRC} > ${DATA}/valid.spm.${SRC}.tmp 17 | ${SPM} --model=${MODEL} < ${DATA}/valid.${TGT} > ${DATA}/valid.spm.${TGT}.tmp 18 | 19 | python alpaca/scripts/utils/prepare_utils.py --manner replace --alpaca-data $DATA/alpaca_data.json 20 | 21 | python alpaca/src/preprocess.py \ 22 | --user-dir alpaca/src \ 23 | --task seq2seq_lora_task \ 24 | --source-lang ${SRC} \ 25 | --target-lang ${TGT} \ 26 | --trainpref ${DATA}/train.spm \ 27 | --validpref ${DATA}/valid.spm \ 28 | --destdir ${DATA}/data-bin \ 29 | --srcdict alpaca/scripts/assert/dict.txt \ 30 | --tgtdict alpaca/scripts/assert/dict.txt \ 31 | --workers 40 \ 32 | -------------------------------------------------------------------------------- /alpaca/scripts/utils/prepare_utils.py: -------------------------------------------------------------------------------- 1 | import json 2 | import argparse 3 | 4 | 5 | def split_json(alpaca_data): 6 | 7 | json_data = json.load(open(alpaca_data)) 8 | print("load aplaca data number = {}".format(len(json_data))) 9 | train_src = alpaca_data.replace("alpaca_data.json", 'train.src') 10 | train_tgt = alpaca_data.replace("alpaca_data.json", 'train.tgt') 11 | 12 | prompt_text = "## Instruction:\n{}\n\n## Input:\n{}\n\n## Response:" 13 | prompt_no_input_text = "## Instruction:\n{}\n\n## Response:" 14 | 15 | with open(train_src, 'w') as f_src, open(train_tgt, 'w') as f_tgt: 16 | 17 | for data in json_data: 18 | if len(data['input']) > 0: 19 | src = prompt_text.format(data['instruction'].strip(), data['input'].strip()) 20 | else: 21 | src = prompt_no_input_text.format(data['instruction'].strip()) 22 | tgt = data['output'] 23 | f_src.writelines(src.replace("\n", '<0x0A>') + '\n') 24 | f_tgt.writelines(tgt.replace("\n", '<0x0A>') + '\n') 25 | 26 | def replace_data(alpaca_data): 27 | 28 | train_src = alpaca_data.replace("alpaca_data.json", 'train.spm.src.tmp') 29 | train_tgt = alpaca_data.replace("alpaca_data.json", 'train.spm.tgt.tmp') 30 | 31 | valid_src = alpaca_data.replace("alpaca_data.json", 'valid.spm.src.tmp') 32 | valid_tgt = alpaca_data.replace("alpaca_data.json", 'valid.spm.tgt.tmp') 33 | 34 | train_files = [train_src, train_tgt, valid_src, valid_tgt] 35 | for train_file_new in train_files: 36 | 37 | train_src_rep = train_file_new.replace(".tmp", "") 38 | with open(train_src_rep, 'w') as f_o: 39 | for line in open(train_file_new).readlines(): 40 | newline = line.replace("▁< 0 x 0 A >", " <0x0A> ").replace("< 0 x 0 A >", " <0x0A> ") 41 | newline = newline.replace(" ", " ") 42 | f_o.writelines(newline) 43 | 44 | def split_zh_json(alpaca_data): 45 | 46 | print("load aplaca data {}".format(alpaca_data)) 47 | train_src = alpaca_data.replace("Belle_open_source_1M.json", 'train.src') 48 | train_tgt = alpaca_data.replace("Belle_open_source_1M.json", 'train.tgt') 49 | 50 | prompt_text = "## Instruction:\n{}\n\n## Input:\n{}\n\n## Response:" 51 | prompt_no_input_text = "## Instruction:\n{}\n\n## Response:" 52 | 53 | with open(train_src, 'w') as f_src, open(train_tgt, 'w') as f_tgt: 54 | for lines in open(alpaca_data).readlines(): 55 | data = json.loads(lines) 56 | if len(data['input']) > 0: 57 | src = prompt_text.format(data['instruction'].strip(), data['input'].strip()).strip() 58 | else: 59 | src = prompt_no_input_text.format(data['instruction'].strip()).strip() 60 | tgt = data['output'].strip() 61 | if len(src) > 0 and len(tgt) > 0: 62 | f_src.writelines(src.replace("\n", '<0x0A>').replace("\\n", '<0x0A>').replace("\r\n", '<0x0A>').replace("\r", '<0x0A>') + '\n') 63 | f_tgt.writelines(tgt.replace("\n", '<0x0A>').replace("\\n", '<0x0A>').replace("\r\n", '<0x0A>').replace("\r", '<0x0A>') + '\n') 64 | 65 | def replace_zh_data(alpaca_data): 66 | 67 | train_src = alpaca_data.replace("Belle_open_source_1M.json", 'train.spm.src.tmp') 68 | train_tgt = alpaca_data.replace("Belle_open_source_1M.json", 'train.spm.tgt.tmp') 69 | 70 | valid_src = alpaca_data.replace("Belle_open_source_1M.json", 'valid.spm.src.tmp') 71 | valid_tgt = alpaca_data.replace("Belle_open_source_1M.json", 'valid.spm.tgt.tmp') 72 | 73 | train_files = [train_src, train_tgt, valid_src, valid_tgt] 74 | for train_file_new in train_files: 75 | 76 | train_src_rep = train_file_new.replace(".tmp", "") 77 | with open(train_src_rep, 'w') as f_o: 78 | for line in open(train_file_new).readlines(): 79 | newline = line.replace("▁< 0 x 0 A >", " <0x0A> ").replace("< 0 x 0 A >", " <0x0A> ") 80 | newline = newline.replace(" ", " ") 81 | f_o.writelines(newline) 82 | 83 | def main(): 84 | 85 | parser = argparse.ArgumentParser() 86 | parser.add_argument( 87 | "--manner", 88 | required=True, 89 | type=str, 90 | default="split", 91 | help="process utils", 92 | ) 93 | parser.add_argument( 94 | "--alpaca-data", 95 | default="/opt/data/private/data/llama_new/alpaca_data.json", 96 | help="alpaca self-instruction data_dir", 97 | ) 98 | parser.add_argument( 99 | "--translation-data", 100 | default="/opt/data/private/data/llama/trans/translation2019zh_train.json", 101 | help="transltion data_dir", 102 | ) 103 | args = parser.parse_args() 104 | 105 | if args.manner == "split": 106 | split_json(args.alpaca_data) 107 | elif args.manner == "replace": 108 | replace_data(args.alpaca_data) 109 | elif args.manner == "split_zh": 110 | split_zh_json(args.alpaca_data) 111 | elif args.manner == "replace_zh": 112 | replace_zh_data(args.alpaca_data) 113 | else: 114 | print("No Support!") 115 | 116 | 117 | if __name__ == "__main__": 118 | main() 119 | -------------------------------------------------------------------------------- /alpaca/scripts/utils/process_llama_ckpt.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import json 3 | import argparse 4 | 5 | 6 | def build_default_state(): 7 | 8 | state = {} 9 | 10 | state['args'] = {} 11 | state['args']['arch'] = "llama_7b" 12 | state['args']['task'] = "seq2seq_ft_task" 13 | state['args']['criterion'] = "lm_loss" 14 | 15 | state['args']['decoder_attention_heads'] = 32 16 | state['args']['decoder_embed_dim'] = 4096 17 | state['args']['decoder_ffn_embed_dim'] = 16384 18 | state['args']['decoder_layers'] = 32 19 | 20 | temp_parser = argparse.ArgumentParser() 21 | for key, value in state['args'].items(): 22 | temp_parser.add_argument("--" + key, default=value) 23 | args = temp_parser.parse_args([]) 24 | 25 | state['args'] = args 26 | 27 | state['model'] = {} 28 | state['optimizer_history'] = [ 29 | { 30 | 'criterion_name': 'lm_loss', 31 | 'optimizer_name': 'AdamOptimizer', 32 | 'lr_scheduler_state': {'best': None}, 33 | 'num_updates': 2000, 34 | } 35 | ] 36 | state['extra_state'] = {} 37 | print(state) 38 | return state 39 | 40 | def build_llama_state_dict(llama_dir, llama_file): 41 | # please replace the llama_path with real path 42 | with open(llama_dir + llama_file, "rb") as f: 43 | llama_state = torch.load(f, map_location=torch.device("cpu")) 44 | 45 | # add pad to token weight and predicion weight 46 | dict_size, dict_dim = llama_state['tok_embeddings.weight'].size() 47 | pad = llama_state['tok_embeddings.weight'].new_zeros([1, dict_dim]) 48 | llama_state['tok_embeddings.weight'] = torch.cat([llama_state['tok_embeddings.weight'], pad], dim=0) 49 | llama_state['output.weight'] = torch.cat([llama_state['output.weight'], pad], dim=0) 50 | 51 | state = build_default_state() 52 | state['model'] = llama_state 53 | dump_file = "model_no_pad.pt" 54 | torch.save(state, llama_dir + dump_file) 55 | print("dump new model to {}{}".format(llama_dir, dump_file)) 56 | 57 | def main(): 58 | 59 | parser = argparse.ArgumentParser() 60 | parser.add_argument( 61 | "--llama-model-dir", 62 | type=str, 63 | default="/opt/data/private/data/llama/7B/", 64 | help="path containing model file", 65 | ) 66 | parser.add_argument( 67 | "--llama-model-file", 68 | type=str, 69 | default="consolidated.00.pth", 70 | help="where in model_dir are weights saved", 71 | ) 72 | 73 | args = parser.parse_args() 74 | print("load model from {}{}".format(args.llama_model_dir, args.llama_model_file)) 75 | build_llama_state_dict(args.llama_model_dir, args.llama_model_file) 76 | 77 | 78 | if __name__ == "__main__": 79 | main() 80 | -------------------------------------------------------------------------------- /alpaca/scripts/utils/process_llama_megatron_ckpt.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import json 3 | import argparse 4 | import os 5 | 6 | 7 | def build_default_state(): 8 | 9 | state = {} 10 | 11 | state['args'] = {} 12 | state['args']['arch'] = "llama_7b" 13 | state['args']['task'] = "seq2seq_ft_task" 14 | state['args']['criterion'] = "lm_loss" 15 | state['args']['decoder_attention_heads'] = 32 16 | state['args']['decoder_embed_dim'] = 4096 17 | state['args']['decoder_ffn_embed_dim'] = 16384 18 | state['args']['decoder_layers'] = 32 19 | 20 | temp_parser = argparse.ArgumentParser() 21 | for key, value in state['args'].items(): 22 | temp_parser.add_argument("--" + key, default=value) 23 | args = temp_parser.parse_args([]) 24 | 25 | state['args'] = args 26 | 27 | state['model'] = {} 28 | state['optimizer_history'] = [ 29 | { 30 | 'criterion_name': 'lm_loss', 31 | 'optimizer_name': 'MemoryEfficientFP16Optimizer', 32 | 'lr_scheduler_state': {'best': None}, 33 | 'num_updates': 5000, 34 | } 35 | ] 36 | state['extra_state'] = {} 37 | print(state) 38 | return state 39 | 40 | DICT_MAP = { 41 | 2: 32016, 42 | 4: 32032, 43 | 8: 32064, 44 | } 45 | 46 | def split_parameter(llama_state, parallel_size): 47 | 48 | parallel_state_list = [] 49 | incr_dict_size = DICT_MAP[parallel_size] 50 | 51 | dict_size, dict_dim = llama_state['tok_embeddings.weight'].size() 52 | pad = llama_state['tok_embeddings.weight'].new_zeros([incr_dict_size - dict_size, dict_dim]) 53 | llama_state['tok_embeddings.weight'] = torch.cat([llama_state['tok_embeddings.weight'], pad], dim=0) 54 | llama_state['output.weight'] = torch.cat([llama_state['output.weight'], pad], dim=0) 55 | 56 | embed_size = dict_dim // parallel_size 57 | ffn_embed_size = (256 * ((int(2 * dict_dim * 4 / 3) + 256 - 1) // 256)) // parallel_size 58 | parallel_dict_size = incr_dict_size // parallel_size 59 | 60 | for parallel_idx in range(parallel_size): 61 | parallel_state = {} 62 | start_embed_size = parallel_idx * embed_size 63 | end_embed_size = (parallel_idx + 1) * embed_size 64 | start_ffn_embed_size = parallel_idx * ffn_embed_size 65 | end_ffn_embed_size = (parallel_idx + 1) * ffn_embed_size 66 | start_parallel_dict_size = parallel_idx * parallel_dict_size 67 | end_parallel_dict_size = (parallel_idx + 1) * parallel_dict_size 68 | 69 | print("embed dim start={} end={}".format(start_embed_size, end_embed_size)) 70 | print("ffn dim start={} end={}".format(start_ffn_embed_size, end_ffn_embed_size)) 71 | 72 | for k in list(llama_state.keys()): 73 | if "inner_attention" in k: 74 | print("skip llama state key = {} size = {}".format(k, llama_state[k].size())) 75 | continue 76 | elif "norm.weight" in k or "_norm" in k: 77 | parallel_state[k] = llama_state[k].clone() 78 | elif "tok_embeddings.weight" in k: 79 | parallel_state[k] = llama_state[k][:, start_embed_size:end_embed_size].clone() 80 | elif "output.weight" in k: 81 | parallel_state[k] = llama_state[k][start_parallel_dict_size:end_parallel_dict_size, :].clone() 82 | elif "layers" in k: 83 | if "attention" in k and "wo" not in k: 84 | # 2048, 4096 85 | parallel_state[k] = llama_state[k][start_embed_size:end_embed_size, :].clone() 86 | elif "attention" in k and "wo" in k: 87 | parallel_state[k] = llama_state[k][:, start_embed_size:end_embed_size].clone() 88 | elif "feed_forward.w1" in k: 89 | parallel_state[k] = llama_state[k][start_ffn_embed_size:end_ffn_embed_size, :].clone() 90 | elif "feed_forward.w2" in k: 91 | parallel_state[k] = llama_state[k][:, start_ffn_embed_size:end_ffn_embed_size].clone() 92 | elif "feed_forward.w3" in k: 93 | parallel_state[k] = llama_state[k][start_ffn_embed_size:end_ffn_embed_size, :].clone() 94 | else: 95 | print(llama_state[k].size()) 96 | print(k) 97 | raise NotImplementedError 98 | else: 99 | print(state[k].size()) 100 | print(k) 101 | raise NotImplementedError 102 | print("split llama state key = {} size = {}".format(k, llama_state[k].size())) 103 | print("parallel state size = {}".format(parallel_state[k].size())) 104 | parallel_state_list.append(parallel_state) 105 | return parallel_state_list 106 | 107 | def build_llama_state_dict(llama_dir, llama_file, parallel_size): 108 | # please replace the llama_path with real path 109 | with open(llama_dir + llama_file, "rb") as f: 110 | llama_state = torch.load(f, map_location=torch.device("cpu")) 111 | 112 | # add pad to token weight and predicion weight 113 | state = build_default_state() 114 | for parallel_idx, parallel_state in enumerate(split_parameter(llama_state, parallel_size)): 115 | state['model'] = parallel_state 116 | dump_file = "model-model_part-{}.pt".format(parallel_idx) 117 | if not os.path.exists(llama_dir + 'megatron_{}/'.format(parallel_size)): 118 | os.mkdir(llama_dir + 'megatron_{}/'.format(parallel_size)) 119 | torch.save(state, llama_dir + 'megatron_{}/'.format(parallel_size) + dump_file) 120 | print("dump new model to {}{}".format(llama_dir, dump_file)) 121 | 122 | def main(): 123 | 124 | parser = argparse.ArgumentParser() 125 | parser.add_argument( 126 | "--llama-model-dir", 127 | type=str, 128 | default="/opt/data/private/data/llama/7B/", 129 | help="path containing model file", 130 | ) 131 | parser.add_argument( 132 | "--llama-model-file", 133 | type=str, 134 | default="consolidated.00.pth", 135 | help="where in model_dir are weights saved", 136 | ) 137 | parser.add_argument( 138 | "--parallel-size", 139 | type=int, 140 | default=2, 141 | help="model parallel size to split", 142 | ) 143 | 144 | args = parser.parse_args() 145 | print("load model from {}{}.".format(args.llama_model_dir, args.llama_model_file)) 146 | print("We will split the llama model into {} fragment.".format(args.parallel_size)) 147 | build_llama_state_dict(args.llama_model_dir, args.llama_model_file, args.parallel_size) 148 | 149 | if __name__ == "__main__": 150 | main() 151 | -------------------------------------------------------------------------------- /alpaca/src/__init__.py: -------------------------------------------------------------------------------- 1 | 2 | try: 3 | from .model import llama_model 4 | except ValueError: 5 | print("llama model has been loaded!!!") 6 | from .loss import lm_loss 7 | from .task import seq2seq_ft_task, seq2seq_lora_task 8 | from .fsdp import cpu_adam, fully_sharded_data_parallel 9 | -------------------------------------------------------------------------------- /alpaca/src/__pycache__/__init__.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/__pycache__/__init__.cpython-37.pyc -------------------------------------------------------------------------------- /alpaca/src/__pycache__/megatron_trainer.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/__pycache__/megatron_trainer.cpython-37.pyc -------------------------------------------------------------------------------- /alpaca/src/__pycache__/trainer.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/__pycache__/trainer.cpython-37.pyc -------------------------------------------------------------------------------- /alpaca/src/__pycache__/utils.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/__pycache__/utils.cpython-37.pyc -------------------------------------------------------------------------------- /alpaca/src/fsdp/__pycache__/cpu_adam.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/fsdp/__pycache__/cpu_adam.cpython-37.pyc -------------------------------------------------------------------------------- /alpaca/src/fsdp/__pycache__/fully_sharded_data_parallel.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/fsdp/__pycache__/fully_sharded_data_parallel.cpython-37.pyc -------------------------------------------------------------------------------- /alpaca/src/fsdp/cpu_adam.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | import importlib 7 | from collections.abc import Collection 8 | from dataclasses import dataclass, field 9 | from typing import List 10 | 11 | import torch 12 | from fairseq.dataclass import FairseqDataclass 13 | from fairseq.optim import FairseqOptimizer, register_optimizer 14 | from omegaconf import II, DictConfig 15 | 16 | 17 | try: 18 | import deepspeed 19 | 20 | has_deepspeed = True 21 | except ImportError as e: 22 | has_deepspeed = False 23 | 24 | 25 | def _get_cpu_adam(): 26 | try: 27 | from deepspeed.ops.op_builder import CPUAdamBuilder 28 | 29 | return CPUAdamBuilder().load() 30 | except ImportError: 31 | # fbcode 32 | from deepspeed.ops.adam import DeepSpeedCPUAdam as ds_opt_adam 33 | 34 | return ds_opt_adam 35 | 36 | 37 | @dataclass 38 | class FairseqCPUAdamConfig(FairseqDataclass): 39 | adam_betas: str = field( 40 | default="(0.9, 0.999)", metadata={"help": "betas for Adam optimizer"} 41 | ) 42 | adam_eps: float = field( 43 | default=1e-8, metadata={"help": "epsilon for Adam optimizer"} 44 | ) 45 | weight_decay: float = field(default=0.0, metadata={"help": "weight decay"}) 46 | fp16_adam_stats: bool = field( 47 | default=False, metadata={"help": "use FP16 stats (with automatic scaling)"} 48 | ) 49 | # TODO common vars below in parent 50 | lr: List[float] = II("optimization.lr") 51 | 52 | 53 | @register_optimizer("new_cpu_adam", dataclass=FairseqCPUAdamConfig) 54 | class FairseqCPUAdam(FairseqOptimizer): 55 | """Adam optimizer for fairseq, optimized for CPU tensors. 56 | 57 | Important note: this optimizer corresponds to the "AdamW" variant of 58 | Adam in its weight decay behavior. As such, it is most closely 59 | analogous to torch.optim.AdamW from PyTorch. 60 | """ 61 | 62 | def __init__(self, cfg: DictConfig, params): 63 | super().__init__(cfg) 64 | self._optimizer = CPUAdam(params, **self.optimizer_config) 65 | 66 | @property 67 | def optimizer_config(self): 68 | """ 69 | Return a kwarg dictionary that will be used to override optimizer 70 | args stored in checkpoints. This allows us to load a checkpoint and 71 | resume training using a different set of optimizer args, e.g., with a 72 | different learning rate. 73 | """ 74 | return { 75 | "lr": self.cfg.lr[0] 76 | if isinstance(self.cfg.lr, Collection) 77 | else self.cfg.lr, 78 | "betas": eval(self.cfg.adam_betas), 79 | "eps": self.cfg.adam_eps, 80 | "weight_decay": self.cfg.weight_decay, 81 | "use_fp16_stats": self.cfg.fp16_adam_stats, 82 | } 83 | 84 | 85 | class CPUAdam(torch.optim.Optimizer): 86 | 87 | optimizer_id = 0 88 | 89 | def __init__( 90 | self, 91 | params, 92 | lr=1e-3, 93 | bias_correction=True, 94 | betas=(0.9, 0.999), 95 | eps=1e-8, 96 | weight_decay=0, 97 | use_fp16_stats=False, 98 | ): 99 | defaults = { 100 | "lr": lr, 101 | "bias_correction": bias_correction, 102 | "betas": betas, 103 | "eps": eps, 104 | "weight_decay": weight_decay, 105 | } 106 | super().__init__(params, defaults) 107 | 108 | self.use_fp16_stats = use_fp16_stats 109 | self.FLOAT16_MAX = 65504.0 110 | 111 | if not has_deepspeed: 112 | raise ImportError("Please install DeepSpeed: pip install deepspeed") 113 | 114 | self.opt_id = CPUAdam.optimizer_id 115 | CPUAdam.optimizer_id = CPUAdam.optimizer_id + 1 116 | 117 | self.ds_opt_adam = _get_cpu_adam() 118 | adamw_mode = True 119 | self.ds_opt_adam.create_adam( 120 | self.opt_id, lr, betas[0], betas[1], eps, weight_decay, adamw_mode, True 121 | ) 122 | 123 | @property 124 | def supports_memory_efficient_fp16(self): 125 | return True 126 | 127 | @property 128 | def supports_flat_params(self): 129 | return True 130 | 131 | @torch.no_grad() 132 | def step(self, closure=None): 133 | loss = None 134 | if closure is not None: 135 | with torch.enable_grad(): 136 | loss = closure() 137 | 138 | torch.cuda.synchronize() 139 | 140 | for group_id, group in enumerate(self.param_groups): 141 | for param_id, p in enumerate(group["params"]): 142 | if p.grad is None: 143 | continue 144 | 145 | state = self.state[p] 146 | if len(state) == 0: 147 | state["step"] = 0 148 | dtype = torch.float16 if self.use_fp16_stats else p.data.dtype 149 | # gradient momentums 150 | state["exp_avg"] = torch.zeros_like( 151 | p.data, dtype=dtype, device="cpu" 152 | ) 153 | # gradient variances 154 | state["exp_avg_sq"] = torch.zeros_like( 155 | p.data, dtype=dtype, device="cpu" 156 | ) 157 | if self.use_fp16_stats: 158 | assert torch.is_floating_point(p.data) 159 | state["exp_avg_scale"] = 1.0 160 | state["exp_avg_sq_scale"] = 1.0 161 | 162 | exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"] 163 | 164 | p_data_bak = p.data # backup of the original data pointer 165 | 166 | p.data = p.data.to(dtype=torch.float32, device="cpu") 167 | p.grad.data = p.grad.data.to(dtype=torch.float32, device="cpu") 168 | 169 | if self.use_fp16_stats: 170 | exp_avg = exp_avg.float() * state["exp_avg_scale"] 171 | exp_avg_sq = exp_avg_sq.float() * state["exp_avg_sq_scale"] 172 | 173 | state["step"] += 1 174 | beta1, beta2 = group["betas"] 175 | 176 | self.ds_opt_adam.adam_update( 177 | self.opt_id, 178 | state["step"], 179 | group["lr"], 180 | beta1, 181 | beta2, 182 | group["eps"], 183 | group["weight_decay"], 184 | group["bias_correction"], 185 | p.data, 186 | p.grad.data, 187 | exp_avg, 188 | exp_avg_sq, 189 | ) 190 | 191 | if p_data_bak.data_ptr() != p.data.data_ptr(): 192 | p_data_bak.copy_(p.data) 193 | p.data = p_data_bak 194 | 195 | if self.use_fp16_stats: 196 | 197 | def inf_norm(t): 198 | return torch.norm(t, float("inf")) 199 | 200 | # from github.com/openai/jukebox/blob/master/jukebox/utils/fp16.py 201 | state["exp_avg_scale"], state["exp_avg_sq_scale"] = ( 202 | 1e-8 + inf_norm(exp_avg) / self.FLOAT16_MAX, 203 | 1e-8 + inf_norm(exp_avg_sq) / self.FLOAT16_MAX, 204 | ) 205 | state["exp_avg"], state["exp_avg_sq"] = ( 206 | (exp_avg / state["exp_avg_scale"]).half(), 207 | (exp_avg_sq / state["exp_avg_sq_scale"]).half(), 208 | ) 209 | 210 | return loss 211 | -------------------------------------------------------------------------------- /alpaca/src/fsdp/fully_sharded_data_parallel.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | import contextlib 7 | from typing import Optional 8 | import os 9 | import torch 10 | from fairseq.dataclass.configs import DistributedTrainingConfig 11 | from fairseq.distributed import utils as dist_utils 12 | from typing import Any, Dict, Optional, Set, cast 13 | try: 14 | from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP 15 | from fairscale.nn.data_parallel import TrainingState 16 | has_FSDP = True 17 | except ImportError: 18 | FSDP = torch.nn.Module 19 | has_FSDP = False 20 | 21 | def free_storage_(data: torch.Tensor): 22 | if data.storage().size() > 0: 23 | assert data.storage_offset() == 0 24 | data.storage().resize_(0) 25 | 26 | 27 | class FullyShardedDataParallel(FSDP): 28 | """ 29 | A small wrapper around fairscale's FullyShardedDataParallel (FSDP) with some 30 | fairseq-specific checkpoint saving/loading logic. 31 | 32 | Args: 33 | use_sharded_state (bool): if True, then ``state_dict`` will return 34 | ``FSDP.local_state_dict`` and ``load_state_dict`` will call 35 | ``FSDP.load_local_state_dict``. Otherwise, ``state_dict`` will 36 | return the full model weights on data parallel rank 0 (empty on 37 | other ranks) and ``load_state_dict`` will broadcast model weights 38 | from rank 0 to other ranks. 39 | """ 40 | 41 | def __init__(self, *args, use_sharded_state: bool = False, **kwargs): 42 | if not has_FSDP: 43 | raise ImportError( 44 | "Cannot find FullyShardedDataParallel. " 45 | "Please install fairscale with: pip install fairscale" 46 | ) 47 | super().__init__(*args, **kwargs) 48 | self.use_sharded_state = use_sharded_state 49 | 50 | if dist_utils.get_world_size(group=dist_utils.get_data_parallel_group()) < 4 and \ 51 | "NVIDIA GeForce RTX 3090" in torch.cuda.get_device_name(): 52 | self.alpaca_force_full_precision = False 53 | else: 54 | self.alpaca_force_full_precision = True 55 | 56 | @property 57 | def unwrapped_module(self) -> torch.nn.Module: 58 | if self.flatten_parameters: 59 | return self.module.module 60 | else: 61 | return self.module 62 | 63 | def state_dict(self, destination=None, prefix="", keep_vars=False): 64 | if self.use_sharded_state: 65 | return super().local_state_dict( 66 | destination=destination, prefix=prefix, keep_vars=keep_vars 67 | ) 68 | else: 69 | if self.rank == 0: 70 | return super().state_dict( 71 | destination=destination, prefix=prefix, keep_vars=keep_vars 72 | ) 73 | else: 74 | # We must call state_dict() due to use of communication 75 | # primitives. But we don't use the result. 76 | super().state_dict() 77 | return destination or {} 78 | 79 | def load_state_dict(self, state_dict, strict=True, model_cfg=None): 80 | if self.use_sharded_state: 81 | return super().load_local_state_dict(state_dict, strict=strict) 82 | else: 83 | state_dict = dist_utils.broadcast_object( 84 | state_dict, src_rank=0, group=self.process_group, 85 | ) 86 | return super().load_state_dict(state_dict, strict=strict) 87 | 88 | @contextlib.contextmanager 89 | def summon_full_params(self, recurse: bool = True, volatile: bool = False): 90 | if recurse: 91 | with contextlib.ExitStack() as stack: 92 | # Summon all params for any nested FSDP instances. 93 | for module in self.modules(): 94 | if isinstance(module, FullyShardedDataParallel): 95 | stack.enter_context(module.summon_full_params(recurse=False, volatile=volatile)) 96 | # Yield to the caller, with full params in all nested instances. 97 | yield 98 | # Exiting from the ExitStack will re-shard params. 99 | return 100 | else: 101 | torch.cuda.synchronize() 102 | self._lazy_init() 103 | self.assert_state(TrainingState.IDLE) 104 | # Set the state so that we assert when trying to go into fwd/bwd. 105 | self.training_state = TrainingState.SUMMON_FULL_PARAMS 106 | full_tensors = self._rebuild_full_params(force_full_precision=self.alpaca_force_full_precision) 107 | assert full_tensors is not None 108 | with contextlib.ExitStack() as stack: 109 | if self.module.is_flattened: 110 | # Update flattened views to point to fully-sized tensors. We 111 | # use self.params instead of full_tensors since the 112 | # latter may contain padding. 113 | stack.enter_context( 114 | self.module.unflatten_params( 115 | flat_params=[p.data for p in self.params[: self._num_flatten_params]] 116 | ) 117 | ) 118 | try: 119 | yield 120 | finally: 121 | stack.close() 122 | non_shared_params = self.params 123 | # filter out shared params for all but the owner FSDP module. 124 | if len(full_tensors) < len(non_shared_params): 125 | non_shared_params = self.non_shared_params() 126 | assert len(full_tensors) == len( 127 | non_shared_params 128 | ), f"{len(full_tensors)} vs. {len(non_shared_params)}" 129 | for p, (full_tensor, safe_to_free) in zip(non_shared_params, full_tensors): 130 | if not volatile: 131 | # Copy any changes made to the full params back into 132 | # the corresponding local shards. 133 | local_shard, _ = self._get_shard(full_tensor) 134 | p._fp32_shard.copy_(local_shard.view_as(p._fp32_shard)) 135 | if safe_to_free: 136 | free_storage_(full_tensor) 137 | self.has_full_params = False 138 | self._use_fp32_param_shard() 139 | self.training_state = TrainingState.IDLE 140 | 141 | 142 | class DummyProcessGroup: 143 | def __init__(self, rank: int, size: int): 144 | self._rank = rank 145 | self._size = size 146 | 147 | def rank(self) -> int: 148 | return self._rank 149 | 150 | def size(self) -> int: 151 | return self._size 152 | 153 | 154 | @contextlib.contextmanager 155 | def fsdp_enable_wrap(cfg: DistributedTrainingConfig): 156 | try: 157 | from fairscale.nn import enable_wrap 158 | except ImportError: 159 | raise ImportError( 160 | "Cannot find FullyShardedDataParallel. " 161 | "Please install fairscale with: pip install fairscale" 162 | ) 163 | if cfg.memory_efficient_fp16: 164 | assert cfg.fp16 # memory_efficient_fp16 should imply fp16 165 | group = dist_utils.get_data_parallel_group() 166 | if group is None and cfg.distributed_world_size == 1: 167 | group = DummyProcessGroup(rank=0, size=1) 168 | fsdp_config = { 169 | "process_group": group, 170 | "reshard_after_forward": not cfg.no_reshard_after_forward, 171 | "mixed_precision": cfg.fp16 and not cfg.memory_efficient_fp16, 172 | "fp32_reduce_scatter": cfg.fp32_reduce_scatter, 173 | "flatten_parameters": not cfg.not_fsdp_flatten_parameters, 174 | "cpu_offload": cfg.cpu_offload, 175 | "compute_dtype": torch.float16 if cfg.fp16 else torch.float32, 176 | "bucket_cap_mb": cfg.bucket_cap_mb, 177 | "state_dict_device": torch.device("cpu"), # reduce GPU mem usage 178 | } 179 | with enable_wrap( 180 | wrapper_cls=FullyShardedDataParallel, 181 | use_sharded_state=cfg.use_sharded_state, 182 | **fsdp_config, 183 | ): 184 | yield 185 | 186 | 187 | def fsdp_wrap(module, min_num_params: Optional[int] = None, **kwargs): 188 | """ 189 | Helper to wrap layers/modules in FSDP. This falls back to a no-op if 190 | fairscale is not available. 191 | 192 | Args: 193 | module (nn.Module): module to (maybe) wrap 194 | min_num_params (int, Optional): minimum number of layer params to wrap 195 | """ 196 | try: 197 | from fairscale.nn import wrap 198 | 199 | if min_num_params is not None: 200 | num_params = sum(p.numel() for p in module.parameters()) 201 | if num_params >= min_num_params: 202 | return wrap(module, **kwargs) 203 | else: 204 | return module 205 | else: 206 | return wrap(module, **kwargs) 207 | except ImportError: 208 | return module 209 | -------------------------------------------------------------------------------- /alpaca/src/generate.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 -u 2 | # Copyright (c) Facebook, Inc. and its affiliates. 3 | # 4 | # This source code is licensed under the MIT license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | """ 7 | Translate pre-processed data with a trained model. 8 | """ 9 | 10 | import ast 11 | import logging 12 | import math 13 | import os 14 | import sys 15 | from argparse import Namespace 16 | from itertools import chain 17 | 18 | import numpy as np 19 | import torch 20 | from omegaconf import DictConfig 21 | 22 | from fairseq import checkpoint_utils, options, scoring, tasks, utils 23 | from fairseq.dataclass.utils import convert_namespace_to_omegaconf 24 | from fairseq.logging import progress_bar 25 | from fairseq.logging.meters import StopwatchMeter, TimeMeter 26 | import utils as distributed_utils 27 | 28 | 29 | def main(cfg: DictConfig): 30 | 31 | if isinstance(cfg, Namespace): 32 | cfg = convert_namespace_to_omegaconf(cfg) 33 | 34 | assert cfg.common_eval.path is not None, "--path required for generation!" 35 | assert ( 36 | not cfg.generation.sampling or cfg.generation.nbest == cfg.generation.beam 37 | ), "--sampling requires --nbest to be equal to --beam" 38 | assert ( 39 | cfg.generation.replace_unk is None or cfg.dataset.dataset_impl == "raw" 40 | ), "--replace-unk requires a raw text dataset (--dataset-impl=raw)" 41 | 42 | if cfg.common_eval.results_path is not None: 43 | os.makedirs(cfg.common_eval.results_path, exist_ok=True) 44 | output_path = os.path.join( 45 | cfg.common_eval.results_path, 46 | "generate-{}.txt".format(cfg.dataset.gen_subset), 47 | ) 48 | with open(output_path, "w", buffering=1, encoding="utf-8") as h: 49 | return _main(cfg, h) 50 | else: 51 | return _main(cfg, sys.stdout) 52 | 53 | 54 | def get_symbols_to_strip_from_output(generator): 55 | if hasattr(generator, "symbols_to_strip_from_output"): 56 | return generator.symbols_to_strip_from_output 57 | else: 58 | return {generator.eos} 59 | 60 | 61 | def _main(cfg: DictConfig, output_file): 62 | logging.basicConfig( 63 | format="%(asctime)s | %(levelname)s | %(name)s | %(message)s", 64 | datefmt="%Y-%m-%d %H:%M:%S", 65 | level=os.environ.get("LOGLEVEL", "INFO").upper(), 66 | stream=output_file, 67 | ) 68 | logger = logging.getLogger("fairseq_cli.generate") 69 | 70 | utils.import_user_module(cfg.common) 71 | 72 | if cfg.dataset.max_tokens is None and cfg.dataset.batch_size is None: 73 | cfg.dataset.max_tokens = 12000 74 | logger.info(cfg) 75 | 76 | # Fix seed for stochastic decoding 77 | if cfg.common.seed is not None and not cfg.generation.no_seed_provided: 78 | np.random.seed(cfg.common.seed) 79 | utils.set_torch_seed(cfg.common.seed) 80 | 81 | use_cuda = torch.cuda.is_available() and not cfg.common.cpu 82 | 83 | # Load dataset splits 84 | task = tasks.setup_task(cfg.task) 85 | 86 | # Set dictionaries 87 | try: 88 | src_dict = getattr(task, "source_dictionary", None) 89 | except NotImplementedError: 90 | src_dict = None 91 | tgt_dict = task.target_dictionary 92 | 93 | overrides = ast.literal_eval(cfg.common_eval.model_overrides) 94 | 95 | logger.info("cfg.checkpoint.checkpoint_suffix {}".format(cfg.checkpoint.checkpoint_suffix)) 96 | # Load ensemble 97 | logger.info("loading model(s) from {}".format(cfg.common_eval.path)) 98 | models, saved_cfg = checkpoint_utils.load_model_ensemble( 99 | utils.split_paths(cfg.common_eval.path), 100 | arg_overrides=overrides, 101 | task=task, 102 | suffix=cfg.checkpoint.checkpoint_suffix, 103 | strict=(cfg.checkpoint.checkpoint_shard_count == 1), 104 | num_shards=cfg.checkpoint.checkpoint_shard_count, 105 | ) 106 | 107 | # loading the dataset should happen after the checkpoint has been loaded so we can give it the saved task config 108 | task.load_dataset(cfg.dataset.gen_subset, task_cfg=saved_cfg.task) 109 | 110 | if cfg.generation.lm_path is not None: 111 | overrides["data"] = cfg.task.data 112 | 113 | try: 114 | lms, _ = checkpoint_utils.load_model_ensemble( 115 | [cfg.generation.lm_path], arg_overrides=overrides, task=None 116 | ) 117 | except: 118 | logger.warning( 119 | f"Failed to load language model! Please make sure that the language model dict is the same " 120 | f"as target dict and is located in the data dir ({cfg.task.data})" 121 | ) 122 | raise 123 | 124 | assert len(lms) == 1 125 | else: 126 | lms = [None] 127 | 128 | # Optimize ensemble for generation 129 | for model in chain(models, lms): 130 | if model is None: 131 | continue 132 | if cfg.common.fp16: 133 | model.half() 134 | if use_cuda and not cfg.distributed_training.pipeline_model_parallel: 135 | model.cuda() 136 | model.prepare_for_inference_(cfg) 137 | 138 | # Load alignment dictionary for unknown word replacement 139 | # (None if no unknown word replacement, empty if no path to align dictionary) 140 | align_dict = utils.load_align_dict(cfg.generation.replace_unk) 141 | 142 | # Load dataset (possibly sharded) 143 | itr = task.get_batch_iterator( 144 | dataset=task.dataset(cfg.dataset.gen_subset), 145 | max_tokens=cfg.dataset.max_tokens, 146 | max_sentences=cfg.dataset.batch_size, 147 | max_positions=utils.resolve_max_positions( 148 | task.max_positions(), *[m.max_positions() for m in models] 149 | ), 150 | ignore_invalid_inputs=cfg.dataset.skip_invalid_size_inputs_valid_test, 151 | required_batch_size_multiple=cfg.dataset.required_batch_size_multiple, 152 | seed=cfg.common.seed, 153 | num_shards=cfg.distributed_training.distributed_world_size, 154 | shard_id=cfg.distributed_training.distributed_rank, 155 | num_workers=cfg.dataset.num_workers, 156 | data_buffer_size=cfg.dataset.data_buffer_size, 157 | ).next_epoch_itr(shuffle=False) 158 | progress = progress_bar.progress_bar( 159 | itr, 160 | log_format=cfg.common.log_format, 161 | log_interval=cfg.common.log_interval, 162 | default_log_format=("tqdm" if not cfg.common.no_progress_bar else "simple"), 163 | ) 164 | 165 | # Initialize generator 166 | gen_timer = StopwatchMeter() 167 | generator = task.build_generator(models, args=cfg.generation) 168 | 169 | # Handle tokenization and BPE 170 | tokenizer = task.build_tokenizer(cfg.tokenizer) 171 | bpe = task.build_bpe(cfg.bpe) 172 | 173 | def decode_fn(x): 174 | if bpe is not None: 175 | x = bpe.decode(x.tolist()) 176 | if tokenizer is not None: 177 | x = tokenizer.decode(x) 178 | return x 179 | 180 | scorer = scoring.build_scorer(cfg.scoring, tgt_dict) 181 | 182 | num_sentences = 0 183 | has_target = True 184 | wps_meter = TimeMeter() 185 | for sample in progress: 186 | sample = utils.move_to_cuda(sample) if use_cuda else sample 187 | if "net_input" not in sample: 188 | continue 189 | 190 | prefix_tokens = None 191 | if cfg.generation.prefix_size > 0: 192 | prefix_tokens = sample["target"][:, : cfg.generation.prefix_size] 193 | 194 | constraints = None 195 | if "constraints" in sample: 196 | constraints = sample["constraints"] 197 | 198 | gen_timer.start() 199 | hypos = task.inference_step( 200 | generator, 201 | models, 202 | sample, 203 | prefix_tokens=prefix_tokens, 204 | constraints=constraints, 205 | ) 206 | num_generated_tokens = sum(len(h[0]["tokens"]) for h in hypos) 207 | gen_timer.stop(num_generated_tokens) 208 | 209 | for i, sample_id in enumerate(sample["id"].tolist()): 210 | has_target = sample["target"] is not None 211 | 212 | # Remove padding 213 | if "src_tokens" in sample["net_input"]: 214 | src_tokens = utils.strip_pad( 215 | sample["net_input"]["src_tokens"][i, :], tgt_dict.pad() 216 | ) 217 | else: 218 | src_tokens = None 219 | 220 | target_tokens = None 221 | if has_target: 222 | target_tokens = ( 223 | utils.strip_pad(sample["target"][i, :], tgt_dict.pad()).int().cpu() 224 | ) 225 | 226 | # Either retrieve the original sentences or regenerate them from tokens. 227 | if align_dict is not None: 228 | src_str = task.dataset(cfg.dataset.gen_subset).src.get_original_text( 229 | sample_id 230 | ) 231 | target_str = task.dataset(cfg.dataset.gen_subset).tgt.get_original_text( 232 | sample_id 233 | ) 234 | else: 235 | if src_dict is not None: 236 | src_str = src_dict.string(src_tokens, cfg.common_eval.post_process) 237 | else: 238 | src_str = "" 239 | if has_target: 240 | target_str = tgt_dict.string( 241 | target_tokens, 242 | cfg.common_eval.post_process, 243 | escape_unk=True, 244 | extra_symbols_to_ignore=get_symbols_to_strip_from_output( 245 | generator 246 | ), 247 | ) 248 | 249 | src_str = decode_fn(src_tokens) 250 | if has_target: 251 | target_str = decode_fn(target_tokens) 252 | 253 | if "-model_part" in cfg.checkpoint.checkpoint_suffix and "-model_part-0" not in cfg.checkpoint.checkpoint_suffix: 254 | print(distributed_utils.get_model_parallel_rank()) 255 | continue 256 | 257 | if not cfg.common_eval.quiet: 258 | if src_dict is not None: 259 | print("S-{}\t{}".format(sample_id, src_str), file=output_file) 260 | # if has_target: 261 | # print("T-{}\t{}".format(sample_id, target_str), file=output_file) 262 | 263 | # Process top predictions 264 | for j, hypo in enumerate(hypos[i][: cfg.generation.nbest]): 265 | hypo_tokens, hypo_str, alignment = utils.post_process_prediction( 266 | hypo_tokens=hypo["tokens"].int().cpu(), 267 | src_str=src_str, 268 | alignment=hypo["alignment"], 269 | align_dict=align_dict, 270 | tgt_dict=tgt_dict, 271 | remove_bpe=cfg.common_eval.post_process, 272 | extra_symbols_to_ignore=get_symbols_to_strip_from_output(generator), 273 | ) 274 | detok_hypo_str = decode_fn(hypo_tokens) 275 | if not cfg.common_eval.quiet: 276 | score = hypo["score"] / math.log(2) # convert to base 2 277 | # original hypothesis (after tokenization and BPE) 278 | print( 279 | "H-{}\t{}\t{}".format(sample_id, score, hypo_str), 280 | file=output_file, 281 | ) 282 | # detokenized hypothesis 283 | print( 284 | "D-{}\t{}\t{}".format(sample_id, score, detok_hypo_str), 285 | file=output_file, 286 | ) 287 | # print( 288 | # "P-{}\t{}".format( 289 | # sample_id, 290 | # " ".join( 291 | # map( 292 | # lambda x: "{:.4f}".format(x), 293 | # # convert from base e to base 2 294 | # hypo["positional_scores"] 295 | # .div_(math.log(2)) 296 | # .tolist(), 297 | # ) 298 | # ), 299 | # ), 300 | # file=output_file, 301 | # ) 302 | 303 | if cfg.generation.print_alignment == "hard": 304 | print( 305 | "A-{}\t{}".format( 306 | sample_id, 307 | " ".join( 308 | [ 309 | "{}-{}".format(src_idx, tgt_idx) 310 | for src_idx, tgt_idx in alignment 311 | ] 312 | ), 313 | ), 314 | file=output_file, 315 | ) 316 | if cfg.generation.print_alignment == "soft": 317 | print( 318 | "A-{}\t{}".format( 319 | sample_id, 320 | " ".join( 321 | [",".join(src_probs) for src_probs in alignment] 322 | ), 323 | ), 324 | file=output_file, 325 | ) 326 | 327 | if cfg.generation.print_step: 328 | print( 329 | "I-{}\t{}".format(sample_id, hypo["steps"]), 330 | file=output_file, 331 | ) 332 | 333 | if cfg.generation.retain_iter_history: 334 | for step, h in enumerate(hypo["history"]): 335 | _, h_str, _ = utils.post_process_prediction( 336 | hypo_tokens=h["tokens"].int().cpu(), 337 | src_str=src_str, 338 | alignment=None, 339 | align_dict=None, 340 | tgt_dict=tgt_dict, 341 | remove_bpe=None, 342 | ) 343 | print( 344 | "E-{}_{}\t{}".format(sample_id, step, h_str), 345 | file=output_file, 346 | ) 347 | 348 | # Score only the top hypothesis 349 | if has_target and j == 0: 350 | if ( 351 | align_dict is not None 352 | or cfg.common_eval.post_process is not None 353 | ): 354 | # Convert back to tokens for evaluation with unk replacement and/or without BPE 355 | target_tokens = tgt_dict.encode_line( 356 | target_str, add_if_not_exist=True 357 | ) 358 | hypo_tokens = tgt_dict.encode_line( 359 | detok_hypo_str, add_if_not_exist=True 360 | ) 361 | if hasattr(scorer, "add_string"): 362 | scorer.add_string(target_str, detok_hypo_str) 363 | else: 364 | scorer.add(target_tokens, hypo_tokens) 365 | 366 | wps_meter.update(num_generated_tokens) 367 | progress.log({"wps": round(wps_meter.avg)}) 368 | num_sentences += ( 369 | sample["nsentences"] if "nsentences" in sample else sample["id"].numel() 370 | ) 371 | 372 | logger.info("NOTE: hypothesis and token scores are output in base 2") 373 | logger.info( 374 | "Translated {:,} sentences ({:,} tokens) in {:.1f}s ({:.2f} sentences/s, {:.2f} tokens/s)".format( 375 | num_sentences, 376 | gen_timer.n, 377 | gen_timer.sum, 378 | num_sentences / gen_timer.sum, 379 | 1.0 / gen_timer.avg, 380 | ) 381 | ) 382 | # if has_target: 383 | # if cfg.bpe and not cfg.generation.sacrebleu: 384 | # if cfg.common_eval.post_process: 385 | # logger.warning( 386 | # "BLEU score is being computed by splitting detokenized string on spaces, this is probably not what you want. Use --sacrebleu for standard 13a BLEU tokenization" 387 | # ) 388 | # else: 389 | # logger.warning( 390 | # "If you are using BPE on the target side, the BLEU score is computed on BPE tokens, not on proper words. Use --sacrebleu for standard 13a BLEU tokenization" 391 | # ) 392 | # # use print to be consistent with other main outputs: S-, H-, T-, D- and so on 393 | # print( 394 | # "Generate {} with beam={}: {}".format( 395 | # cfg.dataset.gen_subset, cfg.generation.beam, scorer.result_string() 396 | # ), 397 | # file=output_file, 398 | # ) 399 | 400 | return scorer 401 | 402 | 403 | def cli_main(): 404 | parser = options.get_generation_parser() 405 | # TODO: replace this workaround with refactoring of `AudioPretraining` 406 | parser.add_argument( 407 | "--arch", 408 | "-a", 409 | metavar="ARCH", 410 | default="wav2vec2", 411 | help="Model architecture. For constructing tasks that rely on " 412 | "model args (e.g. `AudioPretraining`)", 413 | ) 414 | args = options.parse_args_and_arch(parser) 415 | 416 | if args.model_parallel_size > 1: 417 | print("run megatron mode...") 418 | distributed_utils.call_main(convert_namespace_to_omegaconf(args), main) 419 | else: 420 | main(args) 421 | 422 | 423 | if __name__ == "__main__": 424 | cli_main() 425 | -------------------------------------------------------------------------------- /alpaca/src/generator/__pycache__/search.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/generator/__pycache__/search.cpython-37.pyc -------------------------------------------------------------------------------- /alpaca/src/generator/__pycache__/sequence_generator.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/generator/__pycache__/sequence_generator.cpython-37.pyc -------------------------------------------------------------------------------- /alpaca/src/inference.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | import torch 7 | from model.llama_model import LLaMA 8 | import argparse 9 | import logging 10 | 11 | logger = logging.getLogger(__name__) 12 | 13 | 14 | @torch.no_grad() 15 | def generate(alpaca): 16 | 17 | # load from txt 18 | # prompts = [ 19 | # "Give three tips for staying healthy.", 20 | # "What are the three primary colors?", 21 | # "Describe the structure of an atom.", 22 | # "Describe a time when you had to make a difficult decision.", 23 | # "Explain why the following fraction 4/16 is equivalent to 1/4", 24 | # "Write a short story in third person narration about a protagonist who has to make an important career decision.", 25 | # ] 26 | 27 | # load from files 28 | prompts = open("alpaca/scripts/assert/test.src").readlines() 29 | 30 | eval_kwargs = dict(sampling=True, sampling_topp=0.95, temperature=0.8) 31 | for prompt in prompts: 32 | print("-----" * 20) 33 | prompt_text = "## Instruction:\n{}\n\n## Response:".format(prompt) 34 | print(prompt_text) 35 | output = alpaca.sample([prompt_text], **eval_kwargs)[0][0] 36 | print(output) 37 | 38 | def main(): 39 | 40 | parser = argparse.ArgumentParser() 41 | parser.add_argument( 42 | "--model-dir", 43 | required=True, 44 | type=str, 45 | default="", 46 | help="path containing model file", 47 | ) 48 | parser.add_argument( 49 | "--model-file", 50 | default="", 51 | help="where in model_dir are weights saved", 52 | ) 53 | parser.add_argument( 54 | "--lora-model-inf", 55 | default="", 56 | help="where in model_dir are weights saved", 57 | ) 58 | parser.add_argument( 59 | "--lora-tuning", 60 | action="store_true", 61 | default=False, 62 | help="if true use XSUM_KWARGS else CNN_KWARGS", 63 | ) 64 | 65 | parser.add_argument("--bpe",) 66 | parser.add_argument("--sentencepiece-model") 67 | args = parser.parse_args() 68 | 69 | kwargs = { 70 | "user_dir": "alpaca/src", 71 | "lora_model_inf": args.lora_model_inf, 72 | "bpe": args.bpe, 73 | "sentencepiece_model": args.sentencepiece_model, 74 | "source_lang": 'src', 75 | "target_lang": 'tgt', 76 | "lora_tuning": args.lora_tuning, 77 | "task": "seq2seq_lora_task", 78 | } 79 | alpaca = LLaMA.from_pretrained( 80 | model_name_or_path=args.model_dir, 81 | checkpoint_file=args.model_file, 82 | **kwargs, 83 | ) 84 | alpaca = alpaca.eval() 85 | if torch.cuda.is_available(): 86 | alpaca = alpaca.half().cuda() 87 | 88 | generate(alpaca) 89 | 90 | 91 | if __name__ == "__main__": 92 | main() 93 | -------------------------------------------------------------------------------- /alpaca/src/loss/__pycache__/lm_loss.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/loss/__pycache__/lm_loss.cpython-37.pyc -------------------------------------------------------------------------------- /alpaca/src/loss/lm_loss.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | import math 7 | from dataclasses import dataclass, field 8 | 9 | import torch 10 | from fairseq import metrics, utils 11 | from fairseq.criterions import FairseqCriterion, register_criterion 12 | from fairseq.dataclass import FairseqDataclass 13 | import torch.nn.functional as F 14 | 15 | 16 | @register_criterion("lm_loss") 17 | class LMLabelSmoothedCrossEntropyCriterion(FairseqCriterion): 18 | def __init__( 19 | self, 20 | task, 21 | ): 22 | super().__init__(task) 23 | self.eps = 0.1 24 | self.pad = task.tgt_dict.pad() 25 | 26 | def forward(self, model, sample, reduce=True): 27 | target = sample["target"] 28 | target_mask = target.ne(self.pad) 29 | mask = sample["net_input"]['seq_mask'].eq(1) 30 | 31 | output = model(sample["net_input"]['seq_input']) 32 | 33 | loss, nll_loss = self.label_smooth_loss(output[mask], target[target_mask]) 34 | 35 | sample_size = 1 36 | logging_output = { 37 | "loss": loss.data, 38 | "nll_loss": nll_loss.data, 39 | "ntokens": sample["ntokens"], 40 | "nsentences": sample["target"].size(0), 41 | "sample_size": sample_size, 42 | } 43 | return loss, sample_size, logging_output 44 | 45 | def label_smooth_loss(self, net_out, net_target): 46 | net_logits = F.log_softmax(net_out, dim=-1) 47 | nll_loss = F.nll_loss(net_logits, net_target, reduction="none").float().mean() 48 | loss = nll_loss * (1. - self.eps) - net_logits.float().mean() * self.eps 49 | return loss, nll_loss 50 | 51 | @classmethod 52 | def reduce_metrics(cls, logging_outputs) -> None: 53 | """Aggregate logging outputs from data parallel training.""" 54 | loss_sum = sum(log.get("loss", 0) for log in logging_outputs) 55 | nll_loss_sum = sum(log.get("nll_loss", 0) for log in logging_outputs) 56 | ntokens = sum(log.get("ntokens", 0) for log in logging_outputs) 57 | sample_size = sum(log.get("sample_size", 0) for log in logging_outputs) 58 | 59 | metrics.log_scalar( 60 | "loss", loss_sum / sample_size / math.log(2), sample_size, round=3 61 | ) 62 | metrics.log_scalar( 63 | "nll_loss", nll_loss_sum / sample_size / math.log(2), sample_size, round=3 64 | ) 65 | metrics.log_derived( 66 | "ppl", lambda meters: utils.get_perplexity(meters["loss"].avg) 67 | ) 68 | 69 | @staticmethod 70 | def logging_outputs_can_be_summed() -> bool: 71 | """ 72 | Whether the logging outputs returned by `forward` can be summed 73 | across workers prior to calling `reduce_metrics`. Setting this 74 | to True will improves distributed training speed. 75 | """ 76 | return True 77 | -------------------------------------------------------------------------------- /alpaca/src/megatron_trainer.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | """ 7 | Train a network across multiple GPUs. 8 | """ 9 | 10 | from fairseq.dataclass.configs import FairseqConfig 11 | import utils as distributed_utils 12 | from trainer import Trainer 13 | from fairscale.nn.model_parallel.random import get_cuda_rng_tracker 14 | 15 | class MegatronTrainer(Trainer): 16 | """Main class for model parallel with data parallel training.""" 17 | 18 | def __init__(self, cfg: FairseqConfig, task, model, criterion, **kwargs): 19 | super().__init__(cfg, task, model, criterion, **kwargs) 20 | 21 | def clip_grad_norm(self, clip_norm): 22 | def _aggregate_model_parallel_grad_norm(total_norm): 23 | total_norm = total_norm**2 24 | distributed_utils.all_reduce( 25 | total_norm, group=distributed_utils.get_model_parallel_group() 26 | ) 27 | total_norm = total_norm**0.5 28 | return total_norm 29 | 30 | return self.optimizer.clip_grad_norm( 31 | clip_norm, 32 | aggregate_norm_fn=_aggregate_model_parallel_grad_norm, 33 | ) 34 | 35 | def save_checkpoint(self, filename, extra_state): 36 | """Save all training state in a checkpoint file.""" 37 | extra_state["rng_tracker_states"] = get_cuda_rng_tracker().get_states() 38 | super().save_checkpoint(filename, extra_state) 39 | 40 | def load_checkpoint( 41 | self, 42 | filename, 43 | reset_optimizer=False, 44 | reset_lr_scheduler=False, 45 | optimizer_overrides=None, 46 | reset_meters=False, 47 | ): 48 | extra_state = super().load_checkpoint( 49 | filename, 50 | reset_optimizer=reset_optimizer, 51 | reset_lr_scheduler=reset_lr_scheduler, 52 | optimizer_overrides=optimizer_overrides, 53 | reset_meters=reset_meters, 54 | ) 55 | if extra_state is not None and "rng_tracker_states" in extra_state: 56 | get_cuda_rng_tracker().set_states(extra_state["rng_tracker_states"]) 57 | return extra_state 58 | -------------------------------------------------------------------------------- /alpaca/src/model/__pycache__/hub_interface.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/model/__pycache__/hub_interface.cpython-37.pyc -------------------------------------------------------------------------------- /alpaca/src/model/__pycache__/llama_megatron_transformer.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/model/__pycache__/llama_megatron_transformer.cpython-37.pyc -------------------------------------------------------------------------------- /alpaca/src/model/__pycache__/llama_model.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/model/__pycache__/llama_model.cpython-37.pyc -------------------------------------------------------------------------------- /alpaca/src/model/__pycache__/llama_transformer.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/model/__pycache__/llama_transformer.cpython-37.pyc -------------------------------------------------------------------------------- /alpaca/src/model/__pycache__/lora_modules.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/model/__pycache__/lora_modules.cpython-37.pyc -------------------------------------------------------------------------------- /alpaca/src/model/hub_interface.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | import copy 7 | import logging 8 | from typing import Dict, List 9 | 10 | import numpy as np 11 | import torch 12 | import torch.nn as nn 13 | import torch.nn.functional as F 14 | from fairseq import utils 15 | from fairseq.data import encoders 16 | from fairseq.hub_utils import GeneratorHubInterface 17 | from omegaconf import open_dict 18 | 19 | 20 | logger = logging.getLogger(__name__) 21 | 22 | 23 | class LLaMAHubInterface(GeneratorHubInterface): 24 | 25 | def __init__(self, cfg, task, model): 26 | super().__init__(cfg, task, [model]) 27 | self.model = self.models[0] 28 | 29 | def encode( 30 | self, sentence: str, *addl_sentences, no_separator=True 31 | ) -> torch.LongTensor: 32 | bpe_sentence = " " + self.bpe.encode(sentence) 33 | tokens = self.task.target_dictionary.encode_line(bpe_sentence, append_eos=False) 34 | return tokens.long() 35 | 36 | def decode(self, tokens: torch.LongTensor): 37 | tokens = tokens.cpu().numpy() 38 | sentences = [self.bpe.sp.decode(tokens.tolist())] 39 | return sentences 40 | 41 | def sample( 42 | self, sentences: List[str], **kwargs 43 | ) -> List[str]: 44 | tokenized_sentences = [self.encode(sentence) for sentence in sentences] 45 | batched_hypos = self.generate(tokenized_sentences, **kwargs) 46 | return [self.decode(hypos[0]["tokens"]) for hypos in batched_hypos] 47 | 48 | def generate( 49 | self, 50 | tokenized_sentences: List[torch.LongTensor], 51 | **kwargs 52 | ) -> List[List[Dict[str, torch.Tensor]]]: 53 | 54 | generator = self.task.build_generator( 55 | self.models, 56 | **kwargs, 57 | ) 58 | 59 | results = [] 60 | for batch in self._build_batches(tokenized_sentences, skip_invalid_size_inputs=False): 61 | batch = utils.apply_to_sample(lambda t: t.to(self.device), batch) 62 | translations = self.task.inference_step( 63 | generator, self.models, batch, 64 | ) 65 | for id, hypos in zip(batch["id"].tolist(), translations): 66 | results.append((id, hypos)) 67 | 68 | # sort output to match input order 69 | outputs = [hypos for _, hypos in sorted(results, key=lambda x: x[0])] 70 | return outputs 71 | -------------------------------------------------------------------------------- /alpaca/src/model/llama_megatron_transformer.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | from typing import Dict, List, Optional, Tuple 7 | import os 8 | import math 9 | import logging 10 | 11 | import torch 12 | from torch import Tensor, nn 13 | import torch.nn.functional as F 14 | from fairseq import options, utils 15 | 16 | from fairscale.nn.model_parallel import initialize as mpu 17 | from fairscale.nn.model_parallel.initialize import initialize_model_parallel 18 | from fairscale.nn.model_parallel.mappings import scatter_to_model_parallel_region, gather_from_model_parallel_region 19 | import fairscale.nn.model_parallel.initialize as fs_init 20 | from fairscale.nn.model_parallel.layers import ( 21 | ParallelEmbedding, 22 | RowParallelLinear, 23 | ColumnParallelLinear, 24 | ) 25 | from fairseq.modules.checkpoint_activations import checkpoint_wrapper 26 | from .lora_modules import LoRA 27 | 28 | logger = logging.getLogger(__name__) 29 | 30 | 31 | class LLaMAMegatronTransformer(nn.Module): 32 | 33 | def __init__(self, cfg, tgt_dict, embed_tokens, lora_tuning): 34 | super().__init__() 35 | 36 | self.lora_tuning = lora_tuning 37 | 38 | self.tgt_dict = tgt_dict 39 | self.embed_dim = cfg.decoder_embed_dim 40 | self.num_layers = cfg.decoder_layers 41 | self.num_heads = cfg.decoder_attention_heads 42 | self.head_dim = self.embed_dim // self.num_heads 43 | self.max_target_positions = cfg.max_target_positions 44 | 45 | self.pad = self.tgt_dict.pad() 46 | self.embed_tokens = embed_tokens 47 | 48 | self.layers = torch.nn.ModuleList() 49 | self.layers.extend( 50 | [ 51 | self.build_decoder_layer(cfg, self.lora_tuning) 52 | for _ in range(self.num_layers) 53 | ] 54 | ) 55 | 56 | self.layer_norm = RMSNorm(self.embed_dim) 57 | self.output_projection = ColumnParallelLinear( 58 | self.embed_dim, len(self.tgt_dict), bias=False, init_method=lambda x: x 59 | ) 60 | 61 | self.freqs_cis = self.precompute_freqs_cis( 62 | self.embed_dim // self.num_heads, self.max_target_positions * 2 63 | ) 64 | self._future_mask = torch.empty(0) 65 | 66 | def build_decoder_layer(self, cfg, lora_tuning): 67 | layer = LLaMATransformerLayer(cfg, lora_tuning) 68 | checkpoint = cfg.checkpoint_activations 69 | if checkpoint: 70 | offload_to_cpu = cfg.offload_activations 71 | layer = checkpoint_wrapper(layer, offload_to_cpu=offload_to_cpu) 72 | return layer 73 | 74 | def precompute_freqs_cis(self, dim: int, end: int, theta: float = 10000.0): 75 | freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim)) 76 | t = torch.arange(end, device=freqs.device) # type: ignore 77 | freqs = torch.outer(t, freqs).float() # type: ignore 78 | freqs_cis = torch.polar(torch.ones_like(freqs), freqs) # complex64 79 | return freqs_cis 80 | 81 | def output_layer(self, x): 82 | return self.output_projection(x).float() 83 | 84 | def buffered_future_mask(self, tensor): 85 | dim = tensor.size(1) 86 | if ( 87 | self._future_mask.size(0) == 0 88 | or (not self._future_mask.device == tensor.device) 89 | or self._future_mask.size(0) < dim 90 | ): 91 | self._future_mask = torch.triu( 92 | utils.fill_with_neg_inf(torch.zeros([dim, dim])), 1 93 | ) 94 | self._future_mask = self._future_mask.to(tensor) 95 | return self._future_mask[:dim, :dim] 96 | 97 | def forward_inf( 98 | self, 99 | prev_output_tokens, 100 | incremental_state: Optional[Dict[str, Dict[str, Optional[Tensor]]]] = None, 101 | src_pos: Optional[Tensor] = None, 102 | tgt_pos: Optional[Tensor] = None, 103 | trunc_flg: bool = False, 104 | ): 105 | 106 | if incremental_state is not None and trunc_flg: 107 | prev_output_tokens = prev_output_tokens[:, -1:] 108 | 109 | bsz, target_len = prev_output_tokens.size() 110 | x = self.embed_tokens(prev_output_tokens) 111 | 112 | key_padding_mask = prev_output_tokens.eq(self.pad) 113 | if incremental_state is not None: 114 | key_padding_mask = torch.cat([incremental_state['padding_mask'], key_padding_mask], dim=-1) 115 | 116 | self.freqs_cis = self.freqs_cis.to(x.device) 117 | if incremental_state is not None: 118 | freqs_cis = self.freqs_cis[:key_padding_mask.size(1)] 119 | else: 120 | freqs_cis = self.freqs_cis[:target_len] 121 | 122 | if incremental_state is not None: 123 | tgt_attn_mask = self.buffered_future_mask(x) 124 | tgt_len = tgt_attn_mask.size(1) 125 | src_len = key_padding_mask.size(1) 126 | src_attn_mask = torch.torch.zeros([tgt_len, src_len - tgt_len]).to(tgt_attn_mask) 127 | self_attn_mask = torch.cat([src_attn_mask, tgt_attn_mask], dim=1) 128 | else: 129 | self_attn_mask = self.buffered_future_mask(x) 130 | 131 | hidden_state = [x] 132 | attn_state = None 133 | for layer_idx, layer in enumerate(self.layers): 134 | 135 | if incremental_state is not None: 136 | context = torch.cat([incremental_state[layer_idx]['key'], x], dim=1) 137 | else: 138 | context = x 139 | 140 | x, attn = layer( 141 | x, 142 | context, 143 | freqs_cis, 144 | key_padding_mask, 145 | self_attn_mask, 146 | src_pos, 147 | tgt_pos, 148 | ) 149 | 150 | attn_state = attn 151 | hidden_state.append(x) 152 | 153 | attn_state = attn_state.mean(dim=1) 154 | x = self.layer_norm(x) 155 | return x, key_padding_mask, attn_state, hidden_state 156 | 157 | def forward(self, prev_output_tokens): 158 | bsz, target_len = prev_output_tokens.size() 159 | x = self.embed_tokens(prev_output_tokens) 160 | 161 | key_padding_mask = prev_output_tokens.eq(self.pad) 162 | freqs_cis = self.freqs_cis.to(x.device)[:target_len] 163 | self_attn_mask = self.buffered_future_mask(x) 164 | 165 | hidden_state = [x] 166 | attn_state = None 167 | for layer_idx, layer in enumerate(self.layers): 168 | 169 | x, attn = layer( 170 | x, 171 | x, 172 | freqs_cis, 173 | key_padding_mask, 174 | self_attn_mask, 175 | ) 176 | 177 | attn_state = attn 178 | hidden_state.append(x) 179 | 180 | attn_state = attn_state.mean(dim=1) 181 | x = self.layer_norm(x) 182 | return x, key_padding_mask, attn_state, hidden_state 183 | 184 | class LLaMATransformerLayer(nn.Module): 185 | 186 | def __init__(self, cfg, lora_tuning): 187 | super().__init__() 188 | 189 | self.lora_tuning = lora_tuning 190 | 191 | self.embed_dim = cfg.decoder_embed_dim 192 | self.num_heads = cfg.decoder_attention_heads 193 | self.ffn_embed_dim = cfg.decoder_ffn_embed_dim 194 | 195 | self.attention = LLaMAAttention(self.num_heads, self.embed_dim, lora_tuning) 196 | self.feed_forward = LLaMAFeedForward(self.embed_dim, self.ffn_embed_dim) 197 | 198 | self.attention_norm = RMSNorm(self.embed_dim) 199 | self.ffn_norm = RMSNorm(self.embed_dim) 200 | 201 | def forward( 202 | self, 203 | query: Tensor, 204 | key_value: Tensor, 205 | freqs_cis: Tensor, 206 | key_padding_mask: Optional[Tensor], 207 | self_attn_mask: Optional[Tensor], 208 | src_pos: Optional[Tensor] = None, 209 | tgt_pos: Optional[Tensor] = None, 210 | ): 211 | 212 | x, attn = self.attention( 213 | self.attention_norm(query), 214 | self.attention_norm(key_value), 215 | freqs_cis, 216 | key_padding_mask, 217 | self_attn_mask, 218 | src_pos, 219 | tgt_pos, 220 | ) 221 | x = query + x 222 | x = x + self.feed_forward(self.ffn_norm(x)) 223 | 224 | return x, attn 225 | 226 | class RMSNorm(torch.nn.Module): 227 | 228 | def __init__(self, dim: int, eps: float = 1e-6): 229 | super().__init__() 230 | 231 | self.eps = eps 232 | self.weight = nn.Parameter(torch.ones(dim)) 233 | 234 | def _norm(self, x): 235 | return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps) 236 | 237 | def forward(self, x): 238 | output = self._norm(x.float()).type_as(x) 239 | return output * self.weight 240 | 241 | class LLaMAAttention(nn.Module): 242 | 243 | def __init__(self, num_heads, embed_dim, lora_tuning): 244 | super().__init__() 245 | 246 | self.lora_tuning = lora_tuning 247 | 248 | self.num_heads = num_heads 249 | self.embed_dim = embed_dim 250 | self.head_dim = embed_dim // num_heads 251 | self.local_num_heads = self.num_heads // fs_init.get_model_parallel_world_size() 252 | 253 | self.q_proj = ColumnParallelLinear( 254 | self.embed_dim, 255 | self.embed_dim, 256 | bias=False, 257 | gather_output=False, 258 | init_method=lambda x: x, 259 | ) 260 | self.k_proj = ColumnParallelLinear( 261 | self.embed_dim, 262 | self.embed_dim, 263 | bias=False, 264 | gather_output=False, 265 | init_method=lambda x: x, 266 | ) 267 | self.v_proj = ColumnParallelLinear( 268 | self.embed_dim, 269 | self.embed_dim, 270 | bias=False, 271 | gather_output=False, 272 | init_method=lambda x: x, 273 | ) 274 | self.out_proj = RowParallelLinear( 275 | self.embed_dim, 276 | self.embed_dim, 277 | bias=False, 278 | input_is_parallel=True, 279 | init_method=lambda x: x, 280 | ) 281 | 282 | if self.lora_tuning: 283 | self.q_lora = LoRA(self.embed_dim, self.embed_dim) 284 | self.k_lora = LoRA(self.embed_dim, self.embed_dim) 285 | self.v_lora = LoRA(self.embed_dim, self.embed_dim) 286 | 287 | def apply_rotary_emb( 288 | self, 289 | query: Tensor, 290 | key: Tensor, 291 | freqs_cis: Tensor, 292 | src_pos: Tensor, 293 | tgt_pos: Tensor, 294 | ) -> Tuple[Tensor, Tensor]: 295 | 296 | def reshape_for_broadcast(freqs_cis: Tensor, x: Tensor): 297 | ndim = x.ndim 298 | assert 0 <= 1 < ndim 299 | assert freqs_cis.shape == (x.shape[1], x.shape[-1]) 300 | shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)] 301 | return freqs_cis.view(*shape) 302 | 303 | q_ = torch.view_as_complex(query.float().reshape(*query.shape[:-1], -1, 2)) 304 | k_ = torch.view_as_complex(key.float().reshape(*key.shape[:-1], -1, 2)) 305 | 306 | if src_pos is not None and tgt_pos is not None: 307 | if freqs_cis.size(0) == q_.size(1): 308 | freqs_cis = reshape_for_broadcast(freqs_cis, q_) 309 | q_list = [] 310 | k_list = [] 311 | for idx, attn_p in enumerate(src_pos): 312 | q_list.append(q_[idx] * freqs_cis.index_select(dim=1, index=attn_p)) 313 | k_list.append(k_[idx] * freqs_cis.index_select(dim=1, index=attn_p)) 314 | q_out = torch.view_as_real(torch.cat(q_list, dim=0)).flatten(3) 315 | k_out = torch.view_as_real(torch.cat(k_list, dim=0)).flatten(3) 316 | else: 317 | freqs_cis = reshape_for_broadcast(freqs_cis, k_) 318 | q_list = [] 319 | k_list = [] 320 | idx = 0 321 | for q_pos, k_pos in zip(tgt_pos, torch.cat([src_pos, tgt_pos], dim=-1)): 322 | q_list.append(q_[idx] * freqs_cis.index_select(dim=1, index=q_pos)) 323 | k_list.append(k_[idx] * freqs_cis.index_select(dim=1, index=k_pos)) 324 | idx += 1 325 | q_out = torch.view_as_real(torch.cat(q_list, dim=0)).flatten(3) 326 | k_out = torch.view_as_real(torch.cat(k_list, dim=0)).flatten(3) 327 | else: 328 | freqs_cis = reshape_for_broadcast(freqs_cis, q_) 329 | q_out = torch.view_as_real(q_ * freqs_cis).flatten(3) 330 | k_out = torch.view_as_real(k_ * freqs_cis).flatten(3) 331 | return q_out.type_as(query), k_out.type_as(key) 332 | 333 | def forward( 334 | self, 335 | query: Tensor, 336 | key_value: Tensor, 337 | freqs_cis: Tensor, 338 | key_padding_mask: Optional[Tensor] = None, 339 | attn_mask: Optional[Tensor] = None, 340 | src_pos: Optional[Tensor] = None, 341 | tgt_pos: Optional[Tensor] = None, 342 | ): 343 | 344 | bsz, tgt_len, embed_dim = query.size() 345 | bsz, src_len, embed_dim = key_value.size() 346 | 347 | q = self.q_proj(query) 348 | k = self.k_proj(key_value) 349 | v = self.v_proj(key_value) 350 | 351 | if self.lora_tuning: 352 | 353 | q = gather_from_model_parallel_region(q) + self.q_lora(query) 354 | k = gather_from_model_parallel_region(k) + self.k_lora(key_value) 355 | v = gather_from_model_parallel_region(v) + self.v_lora(key_value) 356 | 357 | q = scatter_to_model_parallel_region(q) 358 | k = scatter_to_model_parallel_region(k) 359 | v = scatter_to_model_parallel_region(v) 360 | 361 | q = q.view(bsz, tgt_len, self.local_num_heads, self.head_dim) 362 | k = k.view(bsz, src_len, self.local_num_heads, self.head_dim) 363 | v = v.view(bsz, src_len, self.local_num_heads, self.head_dim) 364 | 365 | q, k = self.apply_rotary_emb(q, k, freqs_cis=freqs_cis, src_pos=src_pos, tgt_pos=tgt_pos) 366 | 367 | q = q.transpose(1, 2) 368 | k = k.transpose(1, 2) 369 | v = v.transpose(1, 2) 370 | 371 | attn_scores = torch.matmul(q, k.transpose(2, 3)) / math.sqrt(self.head_dim) 372 | 373 | if attn_mask is not None: 374 | attn_scores = attn_scores + attn_mask.unsqueeze(0).unsqueeze(1) 375 | attn_scores = attn_scores.masked_fill( 376 | key_padding_mask.unsqueeze(1).unsqueeze(2), 377 | float("-inf") 378 | ) 379 | 380 | attn_softmax_scores = F.softmax(attn_scores.float(), dim=-1).type_as(q) 381 | output = torch.matmul(attn_softmax_scores, v) 382 | output = output.transpose(1, 2).contiguous().view(bsz, tgt_len, -1) 383 | return self.out_proj(output), attn_softmax_scores 384 | 385 | class LLaMAFeedForward(nn.Module): 386 | 387 | def __init__(self, embed_dim: int, hidden_dim: int): 388 | super().__init__() 389 | 390 | self.embed_dim = embed_dim 391 | self.hidden_dim = hidden_dim 392 | 393 | multiple_of = 256 394 | self.hidden_dim = int(2 * self.hidden_dim / 3) 395 | self.hidden_dim = multiple_of * ((self.hidden_dim + multiple_of - 1) // multiple_of) 396 | 397 | self.w1 = ColumnParallelLinear( 398 | self.embed_dim, self.hidden_dim, bias=False, gather_output=False, init_method=lambda x: x 399 | ) 400 | self.w2 = RowParallelLinear( 401 | self.hidden_dim, self.embed_dim, bias=False, input_is_parallel=True, init_method=lambda x: x 402 | ) 403 | self.w3 = ColumnParallelLinear( 404 | self.embed_dim, self.hidden_dim, bias=False, gather_output=False, init_method=lambda x: x 405 | ) 406 | 407 | def forward(self, x): 408 | return self.w2(F.silu(self.w1(x)) * self.w3(x)) 409 | -------------------------------------------------------------------------------- /alpaca/src/model/llama_model.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | 7 | from dataclasses import dataclass, field 8 | from typing import Dict, List, Optional, Tuple 9 | import os 10 | from omegaconf import II 11 | import math 12 | import logging 13 | 14 | import torch 15 | from torch import Tensor, nn 16 | import torch.nn.functional as F 17 | from fairseq import options, utils 18 | from fairseq.dataclass import ChoiceEnum, FairseqDataclass 19 | from fairseq.models import ( 20 | BaseFairseqModel, 21 | register_model, 22 | register_model_architecture, 23 | ) 24 | from fairseq.models.transformer import DEFAULT_MIN_PARAMS_TO_WRAP 25 | from fairscale.nn.model_parallel import initialize as mpu 26 | from .hub_interface import LLaMAHubInterface 27 | from .llama_transformer import LLaMATransformer 28 | from .llama_megatron_transformer import LLaMAMegatronTransformer 29 | from fairscale.nn.model_parallel.layers import ParallelEmbedding 30 | from fairseq.utils import safe_getattr, safe_hasattr 31 | 32 | 33 | logger = logging.getLogger(__name__) 34 | 35 | 36 | @dataclass 37 | class LLaMAConfig(FairseqDataclass): 38 | 39 | dropout: float = field(default=0.1, metadata={"help": "dropout probability"}) 40 | attention_dropout: float = field( 41 | default=0.0, metadata={"help": "dropout probability for attention weights"} 42 | ) 43 | decoder_embed_dim: int = field( 44 | default=512, metadata={"help": "decoder embedding dimension"} 45 | ) 46 | decoder_ffn_embed_dim: int = field( 47 | default=2048, metadata={"help": "decoder embedding dimension for FFN"} 48 | ) 49 | decoder_layers: int = field(default=6, metadata={"help": "num decoder layers"}) 50 | decoder_attention_heads: int = field( 51 | default=8, metadata={"help": "num decoder attention heads"} 52 | ) 53 | max_target_positions: Optional[int] = II("task.max_target_positions") 54 | checkpoint_activations: bool = field( 55 | default=False, metadata={"help": "checkpoint activations at each layer"} 56 | ) 57 | offload_activations: bool = field( 58 | default=False, 59 | metadata={"help": "move checkpointed activations to CPU after they are used."}, 60 | ) 61 | min_params_to_wrap: int = field( 62 | default=DEFAULT_MIN_PARAMS_TO_WRAP, 63 | metadata={ 64 | "help": ("minimum number of params for a layer to be wrapped with FSDP()") 65 | }, 66 | ) 67 | 68 | 69 | def Embedding(num_embeddings, embedding_dim, padding_idx): 70 | m = nn.Embedding(num_embeddings, embedding_dim, padding_idx=padding_idx) 71 | nn.init.normal_(m.weight, mean=0, std=embedding_dim**-0.5) 72 | nn.init.constant_(m.weight[padding_idx], 0) 73 | return m 74 | 75 | @register_model("llama", dataclass=LLaMAConfig) 76 | class LLaMA(BaseFairseqModel): 77 | 78 | def __init__(self, decoder, lora_tuning): 79 | super().__init__() 80 | self.decoder = decoder 81 | 82 | self.lora_tuning = lora_tuning 83 | logger.info('run efficient-tuning method {}'.format(self.lora_tuning)) 84 | if self.lora_tuning: 85 | self.mark_only_lora_as_trainable() 86 | self.lora_model_inf = None 87 | 88 | def set_lora_model_inf(self, lora_model_inf): 89 | self.lora_model_inf = lora_model_inf 90 | 91 | def mark_only_lora_as_trainable(self) -> None: 92 | for n, p in self.named_parameters(): 93 | if 'lora' not in n: 94 | p.requires_grad = False 95 | else: 96 | p.requires_grad = True 97 | 98 | @classmethod 99 | def build_model(cls, args, task): 100 | """Build a new model instance.""" 101 | llama_base_architecture(args) 102 | 103 | logger.info("rescale [src] dictionary: {} types and [tgt] dictionary: {} types".format( 104 | len(task.source_dictionary), len(task.target_dictionary))) 105 | 106 | lora_tuning = safe_getattr(task, "lora_tuning", False) 107 | if safe_getattr(task, "megatron_model", False): 108 | cls.initialize_model_parallel() 109 | 110 | task.source_dictionary.pad_to_multiple_(torch.distributed.get_world_size() * 8) 111 | task.target_dictionary.pad_to_multiple_(torch.distributed.get_world_size() * 8) 112 | 113 | embed_tokens = cls.build_megatron_embedding(args, task.target_dictionary, args.decoder_embed_dim) 114 | decoder = LLaMAMegatronTransformer(args, task.target_dictionary, embed_tokens, lora_tuning) 115 | else: 116 | embed_tokens = cls.build_embedding(args, task.target_dictionary, args.decoder_embed_dim) 117 | decoder = LLaMATransformer(args, task.target_dictionary, embed_tokens, lora_tuning) 118 | 119 | return cls(decoder, lora_tuning) 120 | 121 | @classmethod 122 | def initialize_model_parallel(cls): 123 | logger.info("llama model init process group") 124 | 125 | if not torch.distributed.is_initialized(): 126 | torch.distributed.init_process_group("nccl") 127 | 128 | if not mpu.model_parallel_is_initialized(): 129 | ws = torch.distributed.get_world_size() 130 | mpu.initialize_model_parallel(ws) 131 | 132 | @classmethod 133 | def build_megatron_embedding(cls, args, dictionary, embed_dim): 134 | embed_tokens = ParallelEmbedding(len(dictionary), embed_dim, init_method=lambda x: x) 135 | return embed_tokens 136 | 137 | @classmethod 138 | def build_embedding(cls, cfg, dictionary, embed_dim): 139 | emb = Embedding(len(dictionary), embed_dim, dictionary.pad()) 140 | return emb 141 | 142 | @classmethod 143 | def from_pretrained( 144 | cls, 145 | model_name_or_path, 146 | checkpoint_file, 147 | **kwargs 148 | ): 149 | from fairseq import hub_utils 150 | 151 | x = hub_utils.from_pretrained( 152 | model_name_or_path, 153 | checkpoint_file, 154 | **kwargs, 155 | ) 156 | return LLaMAHubInterface(x["args"], x["task"], x["models"][0]) 157 | 158 | def forward_encoder(self, encoder_inputs): 159 | 160 | src_x, src_padding, src_attn, src_hiddens = self.decoder.forward_inf( 161 | prev_output_tokens=encoder_inputs['src_tokens'], 162 | src_pos=encoder_inputs['src_pos'], 163 | ) 164 | 165 | return { 166 | "encoder_out": [src_x], 167 | "encoder_padding_mask": [src_padding], 168 | "encoder_states": src_hiddens, 169 | "src_tokens": [encoder_inputs['src_tokens']], 170 | "src_pos": [encoder_inputs['src_pos']], 171 | "tgt_pos": [encoder_inputs['tgt_pos']] if encoder_inputs['tgt_pos'] is not None else [], 172 | "bos_token_pos": [encoder_inputs['bos_token_pos']], 173 | } 174 | 175 | def forward_decoder(self, prev_output_tokens, encoder_out, incremental_state=None): 176 | 177 | if len(incremental_state) == 0: 178 | incremental_state["padding_mask"] = encoder_out["encoder_padding_mask"][0] 179 | for layer_idx, layer_hidden_states in enumerate(encoder_out["encoder_states"]): 180 | 181 | incremental_state[layer_idx] = {} 182 | incremental_state[layer_idx]['key'] = layer_hidden_states 183 | incremental_state['src_pos'] = encoder_out['src_pos'][0] 184 | incremental_state['tgt_pos'] = encoder_out['bos_token_pos'][0] 185 | 186 | tgt_x, tgt_padding, tgt_attn, tgt_hiddens = self.decoder.forward_inf( 187 | prev_output_tokens=prev_output_tokens, 188 | incremental_state=incremental_state, 189 | src_pos=incremental_state['src_pos'], 190 | tgt_pos=incremental_state['tgt_pos'], 191 | trunc_flg=True, 192 | ) 193 | 194 | tgt_out = self.decoder.output_layer(tgt_x) 195 | 196 | if len(incremental_state) > 0: 197 | incremental_state["padding_mask"] = tgt_padding 198 | for layer_idx, tgt_hid in enumerate(tgt_hiddens): 199 | 200 | incremental_state[layer_idx]['key'] = torch.cat( 201 | [incremental_state[layer_idx]['key'], tgt_hid], dim=1 202 | ) 203 | incremental_state['src_pos'] = torch.cat([ 204 | incremental_state['src_pos'], incremental_state['tgt_pos']], dim=-1) 205 | incremental_state['tgt_pos'] += 1 206 | return tgt_out, {"attn": [tgt_attn], "inner_states": tgt_hiddens}, incremental_state 207 | 208 | @torch.jit.export 209 | def get_normalized_probs( 210 | self, 211 | net_output: Tuple[Tensor, Optional[Dict[str, List[Optional[Tensor]]]]], 212 | log_probs: bool, 213 | sample: Optional[Dict[str, Tensor]] = None, 214 | ): 215 | logits = net_output[0] 216 | 217 | if log_probs: 218 | return utils.log_softmax(logits, dim=-1) 219 | else: 220 | return utils.softmax(logits, dim=-1) 221 | 222 | def forward(self, prev_output_tokens): 223 | x, x_padding, x_attn, x_hiddens = self.decoder(prev_output_tokens) 224 | x_out = self.decoder.output_layer(x) 225 | return x_out 226 | 227 | @torch.jit.export 228 | def reorder_incremental_state( 229 | self, 230 | incremental_state: Optional[Dict[str, Dict[str, Optional[Tensor]]]], 231 | new_order: Tensor, 232 | ): 233 | for key, value in incremental_state.items(): 234 | if "padding_mask" in str(key): 235 | incremental_state[key] = value.index_select(0, new_order) 236 | elif "pos" in str(key): 237 | incremental_state[key] = value.index_select(0, new_order) 238 | else: 239 | incremental_state[key]['key'] = value['key'].index_select(0, new_order) 240 | return incremental_state 241 | 242 | @torch.jit.export 243 | def reorder_encoder_out(self, encoder_out: Dict[str, List[Tensor]], new_order): 244 | 245 | if len(encoder_out["encoder_out"]) == 0: 246 | new_encoder_out = [] 247 | else: 248 | new_encoder_out = [encoder_out["encoder_out"][0].index_select(0, new_order)] 249 | 250 | if len(encoder_out["encoder_padding_mask"]) == 0: 251 | new_encoder_padding_mask = [] 252 | else: 253 | new_encoder_padding_mask = [ 254 | encoder_out["encoder_padding_mask"][0].index_select(0, new_order) 255 | ] 256 | 257 | encoder_states = encoder_out["encoder_states"] 258 | if len(encoder_states) > 0: 259 | for idx, state in enumerate(encoder_states): 260 | encoder_states[idx] = state.index_select(0, new_order) 261 | 262 | if len(encoder_out["src_tokens"]) == 0: 263 | src_tokens = [] 264 | else: 265 | src_tokens = [(encoder_out["src_tokens"][0]).index_select(0, new_order)] 266 | 267 | if len(encoder_out["src_pos"]) == 0: 268 | src_pos = [] 269 | else: 270 | src_pos = [(encoder_out["src_pos"][0]).index_select(0, new_order)] 271 | 272 | if len(encoder_out["tgt_pos"]) == 0: 273 | tgt_pos = [] 274 | else: 275 | tgt_pos = [(encoder_out["tgt_pos"][0]).index_select(0, new_order)] 276 | 277 | if len(encoder_out["bos_token_pos"]) == 0: 278 | bos_token_pos = [] 279 | else: 280 | bos_token_pos = [(encoder_out["bos_token_pos"][0]).index_select(0, new_order)] 281 | 282 | return { 283 | "encoder_out": new_encoder_out, # T x B x C 284 | "encoder_padding_mask": new_encoder_padding_mask, # B x T 285 | "encoder_states": encoder_states, # List[T x B x C] 286 | "src_tokens": src_tokens, # B x T 287 | "src_pos": src_pos, # B x T 288 | "tgt_pos": tgt_pos, # B x T 289 | "bos_token_pos": bos_token_pos, 290 | } 291 | 292 | def upgrade_state_dict_named(self, state_dict, name): 293 | 294 | if self.lora_tuning and self.lora_model_inf is not None: 295 | if os.path.exists(self.lora_model_inf): 296 | print("load lora model from {}".format(self.lora_model_inf)) 297 | with open(self.lora_model_inf, "rb") as f: 298 | lora_state_dict = torch.load(f, map_location=torch.device("cuda"))['model'] 299 | for k in list(lora_state_dict.keys()): 300 | state_dict[k] = lora_state_dict[k] 301 | else: 302 | print("no lora model!") 303 | 304 | if "decoder.embed_tokens.weight" not in state_dict.keys(): 305 | for k in list(state_dict.keys()): 306 | if "tok_embeddings.weight" in k: 307 | state_dict["decoder.embed_tokens.weight"] = state_dict[k] 308 | del state_dict[k] 309 | elif "output.weight" in k: 310 | state_dict["decoder.output_projection.weight"] = state_dict[k] 311 | del state_dict[k] 312 | 313 | elif "layers" in k: 314 | 315 | if "inner_attention" in k: 316 | del state_dict[k] 317 | continue 318 | 319 | if "wq" in k: 320 | new_k = 'decoder.' + k.replace("wq", "q_proj") 321 | elif "wk" in k: 322 | new_k = 'decoder.' + k.replace("wk", "k_proj") 323 | elif "wv" in k: 324 | new_k = 'decoder.' + k.replace("wv", "v_proj") 325 | elif "wo" in k: 326 | new_k = 'decoder.' + k.replace("wo", "out_proj") 327 | elif "feed_forward" in k: 328 | new_k = 'decoder.' + k 329 | elif "_norm" in k: 330 | new_k = 'decoder.' + k 331 | else: 332 | continue 333 | 334 | state_dict[new_k] = state_dict[k] 335 | del state_dict[k] 336 | 337 | elif "norm.weight" in k: 338 | state_dict["decoder.layer_norm.weight"] = state_dict[k] 339 | del state_dict[k] 340 | 341 | else: 342 | raise NotImplementedError 343 | 344 | super().upgrade_state_dict_named(state_dict, name) 345 | 346 | 347 | def llama_base_architecture(args): 348 | 349 | args.dropout = safe_getattr(args, "dropout", 0.1) 350 | args.attention_dropout = safe_getattr(args, "attention_dropout", 0.0) 351 | args.decoder_embed_dim = safe_getattr(args, "decoder_embed_dim", 4096) 352 | args.decoder_ffn_embed_dim = safe_getattr(args, "decoder_ffn_embed_dim", 4096 * 4) 353 | args.decoder_layers = safe_getattr(args, "decoder_layers", 32) 354 | args.decoder_attention_heads = safe_getattr(args, "decoder_attention_heads", 32) 355 | args.max_target_positions = safe_getattr(args, "max_target_positions", 2048) 356 | 357 | @register_model_architecture("llama", "llama_7b") 358 | def llama_7b(args): 359 | llama_base_architecture(args) 360 | -------------------------------------------------------------------------------- /alpaca/src/model/llama_transformer.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | from typing import Dict, List, Optional, Tuple 7 | import os 8 | import math 9 | import logging 10 | 11 | import torch 12 | from torch import Tensor, nn 13 | import torch.nn.functional as F 14 | from fairseq import utils 15 | from torch.nn import Linear 16 | from fsdp.fully_sharded_data_parallel import fsdp_enable_wrap, fsdp_wrap 17 | from fairseq.modules.checkpoint_activations import checkpoint_wrapper 18 | from .lora_modules import LoRA 19 | 20 | logger = logging.getLogger(__name__) 21 | 22 | 23 | class LLaMATransformer(nn.Module): 24 | 25 | def __init__(self, cfg, tgt_dict, embed_tokens, lora_tuning): 26 | super().__init__() 27 | 28 | self.lora_tuning = lora_tuning 29 | 30 | self.tgt_dict = tgt_dict 31 | self.embed_dim = cfg.decoder_embed_dim 32 | self.num_layers = cfg.decoder_layers 33 | self.num_heads = cfg.decoder_attention_heads 34 | self.head_dim = self.embed_dim // self.num_heads 35 | self.max_target_positions = cfg.max_target_positions 36 | 37 | self.pad = self.tgt_dict.pad() 38 | self.embed_tokens = embed_tokens 39 | 40 | self.layers = torch.nn.ModuleList() 41 | self.layers.extend( 42 | [ 43 | self.build_decoder_layer(cfg, self.lora_tuning) 44 | for _ in range(self.num_layers) 45 | ] 46 | ) 47 | 48 | self.layer_norm = RMSNorm(self.embed_dim) 49 | self.output_projection = Linear( 50 | self.embed_dim, len(self.tgt_dict), bias=False 51 | ) 52 | 53 | self.freqs_cis = self.precompute_freqs_cis( 54 | self.embed_dim // self.num_heads, self.max_target_positions * 2 55 | ) 56 | self._future_mask = torch.empty(0) 57 | 58 | def build_decoder_layer(self, cfg, lora_tuning): 59 | layer = LLaMATransformerLayer(cfg, lora_tuning) 60 | checkpoint = cfg.checkpoint_activations 61 | if checkpoint: 62 | offload_to_cpu = cfg.offload_activations 63 | layer = checkpoint_wrapper(layer, offload_to_cpu=offload_to_cpu) 64 | # if we are checkpointing, enforce that FSDP always wraps the 65 | # checkpointed layer, regardless of layer size 66 | min_params_to_wrap = cfg.min_params_to_wrap if not checkpoint else 0 67 | layer = fsdp_wrap(layer, min_num_params=min_params_to_wrap) 68 | return layer 69 | 70 | def precompute_freqs_cis(self, dim: int, end: int, theta: float = 10000.0): 71 | freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim)) 72 | t = torch.arange(end, device=freqs.device) # type: ignore 73 | freqs = torch.outer(t, freqs).float() # type: ignore 74 | freqs_cis = torch.polar(torch.ones_like(freqs), freqs) # complex64 75 | return freqs_cis 76 | 77 | def output_layer(self, x): 78 | return self.output_projection(x).float() 79 | 80 | def buffered_future_mask(self, tensor): 81 | dim = tensor.size(1) 82 | if ( 83 | self._future_mask.size(0) == 0 84 | or (not self._future_mask.device == tensor.device) 85 | or self._future_mask.size(0) < dim 86 | ): 87 | self._future_mask = torch.triu( 88 | utils.fill_with_neg_inf(torch.zeros([dim, dim])), 1 89 | ) 90 | self._future_mask = self._future_mask.to(tensor) 91 | return self._future_mask[:dim, :dim] 92 | 93 | def forward_inf( 94 | self, 95 | prev_output_tokens, 96 | incremental_state: Optional[Dict[str, Dict[str, Optional[Tensor]]]] = None, 97 | src_pos: Optional[Tensor] = None, 98 | tgt_pos: Optional[Tensor] = None, 99 | trunc_flg: bool = False, 100 | ): 101 | 102 | if incremental_state is not None and trunc_flg: 103 | prev_output_tokens = prev_output_tokens[:, -1:] 104 | 105 | bsz, target_len = prev_output_tokens.size() 106 | x = self.embed_tokens(prev_output_tokens) 107 | 108 | key_padding_mask = prev_output_tokens.eq(self.pad) 109 | if incremental_state is not None: 110 | key_padding_mask = torch.cat([incremental_state['padding_mask'], key_padding_mask], dim=-1) 111 | 112 | self.freqs_cis = self.freqs_cis.to(x.device) 113 | if incremental_state is not None: 114 | freqs_cis = self.freqs_cis[:key_padding_mask.size(1)] 115 | else: 116 | freqs_cis = self.freqs_cis[:target_len] 117 | 118 | if incremental_state is not None: 119 | tgt_attn_mask = self.buffered_future_mask(x) 120 | tgt_len = tgt_attn_mask.size(1) 121 | src_len = key_padding_mask.size(1) 122 | src_attn_mask = torch.torch.zeros([tgt_len, src_len - tgt_len]).to(tgt_attn_mask) 123 | self_attn_mask = torch.cat([src_attn_mask, tgt_attn_mask], dim=1) 124 | else: 125 | self_attn_mask = self.buffered_future_mask(x) 126 | 127 | hidden_state = [x] 128 | attn_state = None 129 | for layer_idx, layer in enumerate(self.layers): 130 | 131 | if incremental_state is not None: 132 | context = torch.cat([incremental_state[layer_idx]['key'], x], dim=1) 133 | else: 134 | context = x 135 | 136 | x, attn = layer( 137 | x, 138 | context, 139 | freqs_cis, 140 | key_padding_mask, 141 | self_attn_mask, 142 | src_pos, 143 | tgt_pos, 144 | ) 145 | 146 | attn_state = attn 147 | hidden_state.append(x) 148 | 149 | attn_state = attn_state.mean(dim=1) 150 | x = self.layer_norm(x) 151 | return x, key_padding_mask, attn_state, hidden_state 152 | 153 | def forward(self, prev_output_tokens): 154 | bsz, target_len = prev_output_tokens.size() 155 | x = self.embed_tokens(prev_output_tokens) 156 | 157 | key_padding_mask = prev_output_tokens.eq(self.pad) 158 | freqs_cis = self.freqs_cis.to(x.device)[:target_len] 159 | self_attn_mask = self.buffered_future_mask(x) 160 | 161 | hidden_state = [x] 162 | attn_state = None 163 | for layer_idx, layer in enumerate(self.layers): 164 | 165 | x, attn = layer( 166 | x, 167 | x, 168 | freqs_cis, 169 | key_padding_mask, 170 | self_attn_mask, 171 | ) 172 | 173 | attn_state = attn 174 | hidden_state.append(x) 175 | 176 | attn_state = attn_state.mean(dim=1) 177 | x = self.layer_norm(x) 178 | return x, key_padding_mask, attn_state, hidden_state 179 | 180 | 181 | class LLaMATransformerLayer(nn.Module): 182 | 183 | def __init__(self, cfg, lora_tuning): 184 | super().__init__() 185 | 186 | self.lora_tuning = lora_tuning 187 | 188 | self.embed_dim = cfg.decoder_embed_dim 189 | self.num_heads = cfg.decoder_attention_heads 190 | self.ffn_embed_dim = cfg.decoder_ffn_embed_dim 191 | 192 | self.attention = LLaMAAttention(self.num_heads, self.embed_dim, self.lora_tuning) 193 | self.feed_forward = LLaMAFeedForward(self.embed_dim, self.ffn_embed_dim) 194 | 195 | self.attention_norm = RMSNorm(self.embed_dim) 196 | self.ffn_norm = RMSNorm(self.embed_dim) 197 | 198 | def forward( 199 | self, 200 | query: Tensor, 201 | key_value: Tensor, 202 | freqs_cis: Tensor, 203 | key_padding_mask: Optional[Tensor], 204 | self_attn_mask: Optional[Tensor], 205 | src_pos: Optional[Tensor] = None, 206 | tgt_pos: Optional[Tensor] = None, 207 | ): 208 | 209 | x, attn = self.attention( 210 | self.attention_norm(query), 211 | self.attention_norm(key_value), 212 | freqs_cis, 213 | key_padding_mask, 214 | self_attn_mask, 215 | src_pos, 216 | tgt_pos, 217 | ) 218 | x = query + x 219 | x = x + self.feed_forward(self.ffn_norm(x)) 220 | return x, attn 221 | 222 | class RMSNorm(nn.Module): 223 | 224 | def __init__(self, dim: int, eps: float = 1e-6): 225 | super().__init__() 226 | 227 | self.eps = eps 228 | self.weight = nn.Parameter(torch.ones(dim)) 229 | 230 | def _norm(self, x): 231 | return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps) 232 | 233 | def forward(self, x): 234 | output = self._norm(x.float()).type_as(x) 235 | return output * self.weight 236 | 237 | 238 | class LLaMAAttention(nn.Module): 239 | 240 | def __init__(self, num_heads, embed_dim, lora_tuning): 241 | super().__init__() 242 | 243 | self.lora_tuning = lora_tuning 244 | 245 | self.num_heads = num_heads 246 | self.embed_dim = embed_dim 247 | self.head_dim = embed_dim // num_heads 248 | self.local_num_heads = self.num_heads 249 | 250 | self.q_proj = Linear( 251 | self.embed_dim, 252 | self.embed_dim, 253 | bias=False, 254 | ) 255 | self.k_proj = Linear( 256 | self.embed_dim, 257 | self.embed_dim, 258 | bias=False, 259 | ) 260 | self.v_proj = Linear( 261 | self.embed_dim, 262 | self.embed_dim, 263 | bias=False, 264 | ) 265 | self.out_proj = Linear( 266 | self.embed_dim, 267 | self.embed_dim, 268 | bias=False, 269 | ) 270 | 271 | if self.lora_tuning: 272 | self.q_lora = LoRA(self.embed_dim, self.embed_dim) 273 | self.k_lora = LoRA(self.embed_dim, self.embed_dim) 274 | self.v_lora = LoRA(self.embed_dim, self.embed_dim) 275 | 276 | def apply_rotary_emb( 277 | self, 278 | query: Tensor, 279 | key: Tensor, 280 | freqs_cis: Tensor, 281 | src_pos: Tensor, 282 | tgt_pos: Tensor, 283 | ) -> Tuple[Tensor, Tensor]: 284 | 285 | def reshape_for_broadcast(freqs_cis: Tensor, x: Tensor): 286 | ndim = x.ndim 287 | assert 0 <= 1 < ndim 288 | assert freqs_cis.shape == (x.shape[1], x.shape[-1]) 289 | shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)] 290 | return freqs_cis.view(*shape) 291 | 292 | q_ = torch.view_as_complex(query.float().reshape(*query.shape[:-1], -1, 2)) 293 | k_ = torch.view_as_complex(key.float().reshape(*key.shape[:-1], -1, 2)) 294 | 295 | if src_pos is not None and tgt_pos is not None: 296 | if freqs_cis.size(0) == q_.size(1): 297 | freqs_cis = reshape_for_broadcast(freqs_cis, q_) 298 | q_list = [] 299 | k_list = [] 300 | for idx, attn_p in enumerate(src_pos): 301 | q_list.append(q_[idx] * freqs_cis.index_select(dim=1, index=attn_p)) 302 | k_list.append(k_[idx] * freqs_cis.index_select(dim=1, index=attn_p)) 303 | q_out = torch.view_as_real(torch.cat(q_list, dim=0)).flatten(3) 304 | k_out = torch.view_as_real(torch.cat(k_list, dim=0)).flatten(3) 305 | else: 306 | freqs_cis = reshape_for_broadcast(freqs_cis, k_) 307 | q_list = [] 308 | k_list = [] 309 | idx = 0 310 | for q_pos, k_pos in zip(tgt_pos, torch.cat([src_pos, tgt_pos], dim=-1)): 311 | q_list.append(q_[idx] * freqs_cis.index_select(dim=1, index=q_pos)) 312 | k_list.append(k_[idx] * freqs_cis.index_select(dim=1, index=k_pos)) 313 | idx += 1 314 | q_out = torch.view_as_real(torch.cat(q_list, dim=0)).flatten(3) 315 | k_out = torch.view_as_real(torch.cat(k_list, dim=0)).flatten(3) 316 | else: 317 | freqs_cis = reshape_for_broadcast(freqs_cis, q_) 318 | q_out = torch.view_as_real(q_ * freqs_cis).flatten(3) 319 | k_out = torch.view_as_real(k_ * freqs_cis).flatten(3) 320 | return q_out.type_as(query), k_out.type_as(key) 321 | 322 | def forward( 323 | self, 324 | query: Tensor, 325 | key_value: Tensor, 326 | freqs_cis: Tensor, 327 | key_padding_mask: Optional[Tensor] = None, 328 | attn_mask: Optional[Tensor] = None, 329 | src_pos: Optional[Tensor] = None, 330 | tgt_pos: Optional[Tensor] = None, 331 | ): 332 | 333 | bsz, tgt_len, embed_dim = query.size() 334 | bsz, src_len, embed_dim = key_value.size() 335 | 336 | q = self.q_proj(query) 337 | k = self.k_proj(key_value) 338 | v = self.v_proj(key_value) 339 | 340 | if self.lora_tuning: 341 | 342 | q = q + self.q_lora(query) 343 | k = k + self.k_lora(key_value) 344 | v = v + self.v_lora(key_value) 345 | 346 | q = q.view(bsz, tgt_len, self.local_num_heads, self.head_dim) 347 | k = k.view(bsz, src_len, self.local_num_heads, self.head_dim) 348 | v = v.view(bsz, src_len, self.local_num_heads, self.head_dim) 349 | 350 | q, k = self.apply_rotary_emb(q, k, freqs_cis=freqs_cis, src_pos=src_pos, tgt_pos=tgt_pos) 351 | 352 | q = q.transpose(1, 2) 353 | k = k.transpose(1, 2) 354 | v = v.transpose(1, 2) 355 | 356 | attn_scores = torch.matmul(q, k.transpose(2, 3)) / math.sqrt(self.head_dim) 357 | 358 | if attn_mask is not None: 359 | attn_scores = attn_scores + attn_mask.unsqueeze(0).unsqueeze(1) 360 | attn_scores = attn_scores.masked_fill( 361 | key_padding_mask.unsqueeze(1).unsqueeze(2), 362 | float("-inf") 363 | ) 364 | 365 | attn_softmax_scores = F.softmax(attn_scores.float(), dim=-1).type_as(q) 366 | output = torch.matmul(attn_softmax_scores, v) 367 | output = output.transpose(1, 2).contiguous().view(bsz, tgt_len, -1) 368 | 369 | return self.out_proj(output), attn_softmax_scores 370 | 371 | 372 | class LLaMAFeedForward(nn.Module): 373 | 374 | def __init__(self, embed_dim: int, hidden_dim: int): 375 | super().__init__() 376 | 377 | self.embed_dim = embed_dim 378 | self.hidden_dim = hidden_dim 379 | 380 | multiple_of = 256 381 | self.hidden_dim = int(2 * self.hidden_dim / 3) 382 | self.hidden_dim = multiple_of * ((self.hidden_dim + multiple_of - 1) // multiple_of) 383 | 384 | self.w1 = Linear( 385 | self.embed_dim, self.hidden_dim, bias=False, 386 | ) 387 | self.w2 = Linear( 388 | self.hidden_dim, self.embed_dim, bias=False, 389 | ) 390 | self.w3 = Linear( 391 | self.embed_dim, self.hidden_dim, bias=False, 392 | ) 393 | 394 | def forward(self, x): 395 | return self.w2(F.silu(self.w1(x)) * self.w3(x)) 396 | -------------------------------------------------------------------------------- /alpaca/src/model/lora_modules.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | import torch 6 | from torch import nn 7 | import math 8 | 9 | 10 | class LoRA(nn.Module): 11 | 12 | def __init__(self, input_dim, output_dim): 13 | super().__init__() 14 | self.lora_alpha = 32 15 | self.r = 4 16 | self.scaling = self.lora_alpha / self.r 17 | 18 | self.lora_A = nn.Parameter(torch.zeros((self.r, input_dim))) 19 | self.lora_B = nn.Parameter(torch.zeros((output_dim, self.r))) 20 | self.reset_lora_parameters() 21 | 22 | def reset_lora_parameters(self): 23 | nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5)) 24 | nn.init.zeros_(self.lora_B) 25 | 26 | def forward(self, x): 27 | return (x @ self.lora_A.T @ self.lora_B.T) * self.scaling 28 | 29 | def upgrade_state_dict_named(self, state_dict, name): 30 | 31 | prefix = name + '.lora_A' 32 | if prefix not in state_dict: 33 | state_dict[prefix] = self.lora_A 34 | 35 | prefix = name + '.lora_B' 36 | if prefix not in state_dict: 37 | state_dict[prefix] = self.lora_B 38 | -------------------------------------------------------------------------------- /alpaca/src/preprocess.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # Copyright (c) Facebook, Inc. and its affiliates. 3 | # 4 | # This source code is licensed under the MIT license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | """ 7 | Data pre-processing: build vocabularies and binarize training data. 8 | """ 9 | 10 | import logging 11 | import os 12 | import shutil 13 | import sys 14 | import typing as tp 15 | from argparse import Namespace 16 | from itertools import zip_longest 17 | 18 | from fairseq import options, tasks, utils 19 | from fairseq.binarizer import ( 20 | AlignmentDatasetBinarizer, 21 | FileBinarizer, 22 | VocabularyDatasetBinarizer, 23 | ) 24 | from dictionary import Dictionary 25 | 26 | logging.basicConfig( 27 | format="%(asctime)s | %(levelname)s | %(name)s | %(message)s", 28 | datefmt="%Y-%m-%d %H:%M:%S", 29 | level=os.environ.get("LOGLEVEL", "INFO").upper(), 30 | stream=sys.stdout, 31 | ) 32 | logger = logging.getLogger("fairseq_cli.preprocess") 33 | 34 | ##################################################################### 35 | # file name tools 36 | ##################################################################### 37 | 38 | 39 | def _train_path(lang, trainpref): 40 | return "{}{}".format(trainpref, ("." + lang) if lang else "") 41 | 42 | 43 | def _file_name(prefix, lang): 44 | fname = prefix 45 | if lang is not None: 46 | fname += ".{lang}".format(lang=lang) 47 | return fname 48 | 49 | 50 | def _dest_path(prefix, lang, destdir): 51 | return os.path.join(destdir, _file_name(prefix, lang)) 52 | 53 | 54 | def _dict_path(lang, destdir): 55 | return _dest_path("dict", lang, destdir) + ".txt" 56 | 57 | 58 | def dataset_dest_prefix(args, output_prefix, lang): 59 | base = os.path.join(args.destdir, output_prefix) 60 | if lang is not None: 61 | lang_part = f".{args.source_lang}-{args.target_lang}.{lang}" 62 | elif args.only_source: 63 | lang_part = "" 64 | else: 65 | lang_part = f".{args.source_lang}-{args.target_lang}" 66 | 67 | return "{}{}".format(base, lang_part) 68 | 69 | 70 | def dataset_dest_file(args, output_prefix, lang, extension): 71 | return "{}.{}".format(dataset_dest_prefix(args, output_prefix, lang), extension) 72 | 73 | 74 | ##################################################################### 75 | # dictionary tools 76 | ##################################################################### 77 | 78 | 79 | def _build_dictionary( 80 | filenames, 81 | task, 82 | args, 83 | src=False, 84 | tgt=False, 85 | ): 86 | assert src ^ tgt 87 | return task.build_dictionary( 88 | filenames, 89 | workers=args.workers, 90 | threshold=args.thresholdsrc if src else args.thresholdtgt, 91 | nwords=args.nwordssrc if src else args.nwordstgt, 92 | padding_factor=args.padding_factor, 93 | ) 94 | 95 | 96 | ##################################################################### 97 | # bin file creation logic 98 | ##################################################################### 99 | 100 | 101 | def _make_binary_dataset( 102 | vocab: Dictionary, 103 | input_prefix: str, 104 | output_prefix: str, 105 | lang: tp.Optional[str], 106 | num_workers: int, 107 | args: Namespace, 108 | ): 109 | logger.info("[{}] Dictionary: {} types".format(lang, len(vocab))) 110 | 111 | binarizer = VocabularyDatasetBinarizer( 112 | vocab, 113 | append_eos=True, 114 | ) 115 | 116 | input_file = "{}{}".format(input_prefix, ("." + lang) if lang is not None else "") 117 | full_output_prefix = dataset_dest_prefix(args, output_prefix, lang) 118 | 119 | final_summary = FileBinarizer.multiprocess_dataset( 120 | input_file, 121 | args.dataset_impl, 122 | binarizer, 123 | full_output_prefix, 124 | vocab_size=len(vocab), 125 | num_workers=num_workers, 126 | ) 127 | 128 | logger.info(f"[{lang}] {input_file}: {final_summary} (by {vocab.unk_word})") 129 | 130 | 131 | def _make_binary_alignment_dataset( 132 | input_prefix: str, output_prefix: str, num_workers: int, args: Namespace 133 | ): 134 | 135 | binarizer = AlignmentDatasetBinarizer(utils.parse_alignment) 136 | 137 | input_file = input_prefix 138 | full_output_prefix = dataset_dest_prefix(args, output_prefix, lang=None) 139 | 140 | final_summary = FileBinarizer.multiprocess_dataset( 141 | input_file, 142 | args.dataset_impl, 143 | binarizer, 144 | full_output_prefix, 145 | vocab_size=None, 146 | num_workers=num_workers, 147 | ) 148 | 149 | logger.info( 150 | "[alignments] {}: parsed {} alignments".format( 151 | input_file, final_summary.num_seq 152 | ) 153 | ) 154 | 155 | 156 | ##################################################################### 157 | # routing logic 158 | ##################################################################### 159 | 160 | 161 | def _make_dataset( 162 | vocab: Dictionary, 163 | input_prefix: str, 164 | output_prefix: str, 165 | lang: tp.Optional[str], 166 | args: Namespace, 167 | num_workers: int, 168 | ): 169 | if args.dataset_impl == "raw": 170 | # Copy original text file to destination folder 171 | output_text_file = _dest_path( 172 | output_prefix + ".{}-{}".format(args.source_lang, args.target_lang), 173 | lang, 174 | args.destdir, 175 | ) 176 | shutil.copyfile(_file_name(input_prefix, lang), output_text_file) 177 | else: 178 | _make_binary_dataset( 179 | vocab, input_prefix, output_prefix, lang, num_workers, args 180 | ) 181 | 182 | 183 | def _make_all(lang, vocab, args): 184 | if args.trainpref: 185 | _make_dataset( 186 | vocab, args.trainpref, "train", lang, args=args, num_workers=args.workers 187 | ) 188 | if args.validpref: 189 | for k, validpref in enumerate(args.validpref.split(",")): 190 | outprefix = "valid{}".format(k) if k > 0 else "valid" 191 | _make_dataset( 192 | vocab, validpref, outprefix, lang, args=args, num_workers=args.workers 193 | ) 194 | if args.testpref: 195 | for k, testpref in enumerate(args.testpref.split(",")): 196 | outprefix = "test{}".format(k) if k > 0 else "test" 197 | _make_dataset( 198 | vocab, testpref, outprefix, lang, args=args, num_workers=args.workers 199 | ) 200 | 201 | 202 | def _make_all_alignments(args): 203 | if args.trainpref and os.path.exists(args.trainpref + "." + args.align_suffix): 204 | _make_binary_alignment_dataset( 205 | args.trainpref + "." + args.align_suffix, 206 | "train.align", 207 | num_workers=args.workers, 208 | args=args, 209 | ) 210 | if args.validpref and os.path.exists(args.validpref + "." + args.align_suffix): 211 | _make_binary_alignment_dataset( 212 | args.validpref + "." + args.align_suffix, 213 | "valid.align", 214 | num_workers=args.workers, 215 | args=args, 216 | ) 217 | if args.testpref and os.path.exists(args.testpref + "." + args.align_suffix): 218 | _make_binary_alignment_dataset( 219 | args.testpref + "." + args.align_suffix, 220 | "test.align", 221 | num_workers=args.workers, 222 | args=args, 223 | ) 224 | 225 | 226 | ##################################################################### 227 | # align 228 | ##################################################################### 229 | 230 | 231 | def _align_files(args, src_dict, tgt_dict): 232 | assert args.trainpref, "--trainpref must be set if --alignfile is specified" 233 | src_file_name = _train_path(args.source_lang, args.trainpref) 234 | tgt_file_name = _train_path(args.target_lang, args.trainpref) 235 | freq_map = {} 236 | with open(args.alignfile, "r", encoding="utf-8") as align_file: 237 | with open(src_file_name, "r", encoding="utf-8") as src_file: 238 | with open(tgt_file_name, "r", encoding="utf-8") as tgt_file: 239 | for a, s, t in zip_longest(align_file, src_file, tgt_file): 240 | si = src_dict.encode_line(s, add_if_not_exist=False) 241 | ti = tgt_dict.encode_line(t, add_if_not_exist=False) 242 | ai = list(map(lambda x: tuple(x.split("-")), a.split())) 243 | for sai, tai in ai: 244 | srcidx = si[int(sai)] 245 | tgtidx = ti[int(tai)] 246 | if srcidx != src_dict.unk() and tgtidx != tgt_dict.unk(): 247 | assert srcidx != src_dict.pad() 248 | assert srcidx != src_dict.eos() 249 | assert tgtidx != tgt_dict.pad() 250 | assert tgtidx != tgt_dict.eos() 251 | if srcidx not in freq_map: 252 | freq_map[srcidx] = {} 253 | if tgtidx not in freq_map[srcidx]: 254 | freq_map[srcidx][tgtidx] = 1 255 | else: 256 | freq_map[srcidx][tgtidx] += 1 257 | align_dict = {} 258 | for srcidx in freq_map.keys(): 259 | align_dict[srcidx] = max(freq_map[srcidx], key=freq_map[srcidx].get) 260 | with open( 261 | os.path.join( 262 | args.destdir, 263 | "alignment.{}-{}.txt".format(args.source_lang, args.target_lang), 264 | ), 265 | "w", 266 | encoding="utf-8", 267 | ) as f: 268 | for k, v in align_dict.items(): 269 | print("{} {}".format(src_dict[k], tgt_dict[v]), file=f) 270 | 271 | 272 | ##################################################################### 273 | # MAIN 274 | ##################################################################### 275 | 276 | 277 | def main(args): 278 | # setup some basic things 279 | utils.import_user_module(args) 280 | 281 | os.makedirs(args.destdir, exist_ok=True) 282 | 283 | logger.addHandler( 284 | logging.FileHandler( 285 | filename=os.path.join(args.destdir, "preprocess.log"), 286 | ) 287 | ) 288 | logger.info(args) 289 | 290 | assert ( 291 | args.dataset_impl != "huffman" 292 | ), "preprocessing.py doesn't support Huffman yet, use HuffmanCodeBuilder directly." 293 | 294 | # build dictionaries 295 | 296 | target = not args.only_source 297 | 298 | if not args.srcdict and os.path.exists(_dict_path(args.source_lang, args.destdir)): 299 | raise FileExistsError(_dict_path(args.source_lang, args.destdir)) 300 | 301 | if ( 302 | target 303 | and not args.tgtdict 304 | and os.path.exists(_dict_path(args.target_lang, args.destdir)) 305 | ): 306 | raise FileExistsError(_dict_path(args.target_lang, args.destdir)) 307 | 308 | task = tasks.get_task(args.task) 309 | 310 | if args.joined_dictionary: 311 | assert ( 312 | not args.srcdict or not args.tgtdict 313 | ), "cannot use both --srcdict and --tgtdict with --joined-dictionary" 314 | 315 | if args.srcdict: 316 | src_dict = task.load_dictionary(args.srcdict) 317 | elif args.tgtdict: 318 | src_dict = task.load_dictionary(args.tgtdict) 319 | else: 320 | assert ( 321 | args.trainpref 322 | ), "--trainpref must be set if --srcdict is not specified" 323 | src_dict = _build_dictionary( 324 | { 325 | _train_path(lang, args.trainpref) 326 | for lang in [args.source_lang, args.target_lang] 327 | }, 328 | task=task, 329 | args=args, 330 | src=True, 331 | ) 332 | tgt_dict = src_dict 333 | else: 334 | if args.srcdict: 335 | src_dict = task.load_dictionary(args.srcdict) 336 | else: 337 | assert ( 338 | args.trainpref 339 | ), "--trainpref must be set if --srcdict is not specified" 340 | src_dict = _build_dictionary( 341 | [_train_path(args.source_lang, args.trainpref)], 342 | task=task, 343 | args=args, 344 | src=True, 345 | ) 346 | 347 | if target: 348 | if args.tgtdict: 349 | tgt_dict = task.load_dictionary(args.tgtdict) 350 | else: 351 | assert ( 352 | args.trainpref 353 | ), "--trainpref must be set if --tgtdict is not specified" 354 | tgt_dict = _build_dictionary( 355 | [_train_path(args.target_lang, args.trainpref)], 356 | task=task, 357 | args=args, 358 | tgt=True, 359 | ) 360 | else: 361 | tgt_dict = None 362 | 363 | # save dictionaries 364 | 365 | src_dict.save(_dict_path(args.source_lang, args.destdir)) 366 | if target and tgt_dict is not None: 367 | tgt_dict.save(_dict_path(args.target_lang, args.destdir)) 368 | 369 | if args.dict_only: 370 | return 371 | 372 | _make_all(args.source_lang, src_dict, args) 373 | if target: 374 | _make_all(args.target_lang, tgt_dict, args) 375 | 376 | # align the datasets if needed 377 | if args.align_suffix: 378 | _make_all_alignments(args) 379 | 380 | logger.info("Wrote preprocessed data to {}".format(args.destdir)) 381 | 382 | if args.alignfile: 383 | _align_files(args, src_dict=src_dict, tgt_dict=tgt_dict) 384 | 385 | 386 | def cli_main(): 387 | parser = options.get_preprocessing_parser() 388 | args = parser.parse_args() 389 | main(args) 390 | 391 | 392 | if __name__ == "__main__": 393 | cli_main() 394 | -------------------------------------------------------------------------------- /alpaca/src/task/__pycache__/dictionary.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/task/__pycache__/dictionary.cpython-37.pyc -------------------------------------------------------------------------------- /alpaca/src/task/__pycache__/seq2seq_dataset.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/task/__pycache__/seq2seq_dataset.cpython-37.pyc -------------------------------------------------------------------------------- /alpaca/src/task/__pycache__/seq2seq_ft_task.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/task/__pycache__/seq2seq_ft_task.cpython-37.pyc -------------------------------------------------------------------------------- /alpaca/src/task/__pycache__/seq2seq_lora_task.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/alpaca/src/task/__pycache__/seq2seq_lora_task.cpython-37.pyc -------------------------------------------------------------------------------- /alpaca/src/task/dictionary.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | import os 7 | from collections import Counter 8 | from multiprocessing import Pool 9 | 10 | import torch 11 | from fairseq import utils 12 | from fairseq.data import data_utils 13 | from fairseq.file_chunker_utils import Chunker, find_offsets 14 | from fairseq.file_io import PathManager 15 | from fairseq.tokenizer import tokenize_line 16 | 17 | 18 | class Dictionary: 19 | """A mapping from symbols to consecutive integers""" 20 | 21 | def __init__( 22 | self, 23 | *, # begin keyword-only arguments 24 | bos="", 25 | pad="", 26 | eos="", 27 | unk="", 28 | extra_special_symbols=None, 29 | ): 30 | self.bos_word, self.unk_word, self.pad_word, self.eos_word = bos, unk, pad, eos 31 | self.symbols = [] 32 | self.count = [] 33 | self.indices = {} 34 | self.unk_index = self.add_symbol(unk) 35 | self.bos_index = self.add_symbol(bos) 36 | self.eos_index = self.add_symbol(eos) 37 | 38 | if extra_special_symbols: 39 | for s in extra_special_symbols: 40 | self.add_symbol(s) 41 | self.nspecial = len(self.symbols) 42 | 43 | def __eq__(self, other): 44 | return self.indices == other.indices 45 | 46 | def __getitem__(self, idx): 47 | if idx < len(self.symbols): 48 | return self.symbols[idx] 49 | return self.unk_word 50 | 51 | def get_count(self, idx): 52 | return self.count[idx] 53 | 54 | def __len__(self): 55 | """Returns the number of symbols in the dictionary""" 56 | return len(self.symbols) 57 | 58 | def __contains__(self, sym): 59 | return sym in self.indices 60 | 61 | def index(self, sym): 62 | """Returns the index of the specified symbol""" 63 | assert isinstance(sym, str) 64 | if sym in self.indices: 65 | return self.indices[sym] 66 | return self.unk_index 67 | 68 | def string( 69 | self, 70 | tensor, 71 | bpe_symbol=None, 72 | escape_unk=False, 73 | extra_symbols_to_ignore=None, 74 | unk_string=None, 75 | include_eos=False, 76 | separator=" ", 77 | ): 78 | """Helper for converting a tensor of token indices to a string. 79 | 80 | Can optionally remove BPE symbols or escape words. 81 | """ 82 | if torch.is_tensor(tensor) and tensor.dim() == 2: 83 | return "\n".join( 84 | self.string( 85 | t, 86 | bpe_symbol, 87 | escape_unk, 88 | extra_symbols_to_ignore, 89 | include_eos=include_eos, 90 | ) 91 | for t in tensor 92 | ) 93 | 94 | extra_symbols_to_ignore = set(extra_symbols_to_ignore or []) 95 | if not include_eos: 96 | extra_symbols_to_ignore.add(self.eos()) 97 | 98 | def token_string(i): 99 | if i == self.unk(): 100 | if unk_string is not None: 101 | return unk_string 102 | else: 103 | return self.unk_string(escape_unk) 104 | else: 105 | return self[i] 106 | 107 | if hasattr(self, "bos_index"): 108 | extra_symbols_to_ignore.add(self.bos()) 109 | 110 | sent = separator.join( 111 | token_string(i) 112 | for i in tensor 113 | if utils.item(i) not in extra_symbols_to_ignore 114 | ) 115 | 116 | return data_utils.post_process(sent, bpe_symbol) 117 | 118 | def unk_string(self, escape=False): 119 | """Return unknown string, optionally escaped as: <>""" 120 | if escape: 121 | return "<{}>".format(self.unk_word) 122 | else: 123 | return self.unk_word 124 | 125 | def add_symbol(self, word, n=1, overwrite=False): 126 | """Adds a word to the dictionary""" 127 | if word in self.indices and not overwrite: 128 | idx = self.indices[word] 129 | self.count[idx] = self.count[idx] + n 130 | return idx 131 | else: 132 | idx = len(self.symbols) 133 | self.indices[word] = idx 134 | self.symbols.append(word) 135 | self.count.append(n) 136 | return idx 137 | 138 | def update(self, new_dict): 139 | """Updates counts from new dictionary.""" 140 | for word in new_dict.symbols: 141 | idx2 = new_dict.indices[word] 142 | if word in self.indices: 143 | idx = self.indices[word] 144 | self.count[idx] = self.count[idx] + new_dict.count[idx2] 145 | else: 146 | idx = len(self.symbols) 147 | self.indices[word] = idx 148 | self.symbols.append(word) 149 | self.count.append(new_dict.count[idx2]) 150 | 151 | def finalize(self, threshold=-1, nwords=-1, padding_factor=8): 152 | """Sort symbols by frequency in descending order, ignoring special ones. 153 | 154 | Args: 155 | - threshold defines the minimum word count 156 | - nwords defines the total number of words in the final dictionary, 157 | including special symbols 158 | - padding_factor can be used to pad the dictionary size to be a 159 | multiple of 8, which is important on some hardware (e.g., Nvidia 160 | Tensor Cores). 161 | """ 162 | if nwords <= 0: 163 | nwords = len(self) 164 | 165 | new_indices = dict(zip(self.symbols[: self.nspecial], range(self.nspecial))) 166 | new_symbols = self.symbols[: self.nspecial] 167 | new_count = self.count[: self.nspecial] 168 | 169 | c = Counter( 170 | dict( 171 | sorted(zip(self.symbols[self.nspecial :], self.count[self.nspecial :])) 172 | ) 173 | ) 174 | for symbol, count in c.most_common(nwords - self.nspecial): 175 | if count >= threshold: 176 | new_indices[symbol] = len(new_symbols) 177 | new_symbols.append(symbol) 178 | new_count.append(count) 179 | else: 180 | break 181 | 182 | assert len(new_symbols) == len(new_indices) 183 | 184 | self.count = list(new_count) 185 | self.symbols = list(new_symbols) 186 | self.indices = new_indices 187 | 188 | self.pad_to_multiple_(padding_factor) 189 | 190 | def pad_to_multiple_(self, padding_factor): 191 | """Pad Dictionary size to be a multiple of *padding_factor*.""" 192 | if padding_factor > 1: 193 | i = 0 194 | while len(self) % padding_factor != 0: 195 | symbol = "madeupword{:04d}".format(i) 196 | self.add_symbol(symbol, n=0) 197 | i += 1 198 | 199 | def bos(self): 200 | """Helper to get index of beginning-of-sentence symbol""" 201 | return self.bos_index 202 | 203 | def pad(self): 204 | """Helper to get index of pad symbol""" 205 | return self.pad_index 206 | 207 | def eos(self): 208 | """Helper to get index of end-of-sentence symbol""" 209 | return self.eos_index 210 | 211 | def unk(self): 212 | """Helper to get index of unk symbol""" 213 | return self.unk_index 214 | 215 | @classmethod 216 | def load(cls, f): 217 | """Loads the dictionary from a text file with the format: 218 | 219 | ``` 220 | 221 | 222 | ... 223 | ``` 224 | """ 225 | d = cls() 226 | d.add_from_file(f) 227 | return d 228 | 229 | def add_from_file(self, f): 230 | """ 231 | Loads a pre-existing dictionary from a text file and adds its symbols 232 | to this instance. 233 | """ 234 | if isinstance(f, str): 235 | try: 236 | with open(PathManager.get_local_path(f), "r", encoding="utf-8") as fd: 237 | self.add_from_file(fd) 238 | except FileNotFoundError as fnfe: 239 | raise fnfe 240 | except UnicodeError: 241 | raise Exception( 242 | "Incorrect encoding detected in {}, please " 243 | "rebuild the dataset".format(f) 244 | ) 245 | return 246 | 247 | lines = f.readlines() 248 | indices_start_line = self._load_meta(lines) 249 | 250 | for line in lines[indices_start_line:]: 251 | try: 252 | line, field = line.rstrip().rsplit(" ", 1) 253 | if field == "#fairseq:overwrite": 254 | overwrite = True 255 | line, field = line.rsplit(" ", 1) 256 | else: 257 | overwrite = False 258 | count = int(field) 259 | word = line 260 | if word in self and not overwrite: 261 | raise RuntimeError( 262 | "Duplicate word found when loading Dictionary: '{}'. " 263 | "Duplicate words can overwrite earlier ones by adding the " 264 | "#fairseq:overwrite flag at the end of the corresponding row " 265 | "in the dictionary file. If using the Camembert model, please " 266 | "download an updated copy of the model file.".format(word) 267 | ) 268 | self.add_symbol(word, n=count, overwrite=overwrite) 269 | except ValueError: 270 | raise ValueError( 271 | f"Incorrect dictionary format, expected ' [flags]': \"{line}\"" 272 | ) 273 | 274 | def _save(self, f, kv_iterator): 275 | if isinstance(f, str): 276 | PathManager.mkdirs(os.path.dirname(f)) 277 | with PathManager.open(f, "w", encoding="utf-8") as fd: 278 | return self.save(fd) 279 | for k, v in kv_iterator: 280 | print("{} {}".format(k, v), file=f) 281 | 282 | def _get_meta(self): 283 | return [], [] 284 | 285 | def _load_meta(self, lines): 286 | return 0 287 | 288 | def save(self, f): 289 | """Stores dictionary into a text file""" 290 | ex_keys, ex_vals = self._get_meta() 291 | self._save( 292 | f, 293 | zip( 294 | ex_keys + self.symbols[self.nspecial :], 295 | ex_vals + self.count[self.nspecial :], 296 | ), 297 | ) 298 | 299 | def dummy_sentence(self, length): 300 | t = torch.Tensor(length).uniform_(self.nspecial + 1, len(self)).long() 301 | t[-1] = self.eos() 302 | return t 303 | 304 | def encode_line( 305 | self, 306 | line, 307 | line_tokenizer=tokenize_line, 308 | add_if_not_exist=True, 309 | consumer=None, 310 | append_eos=True, 311 | reverse_order=False, 312 | ) -> torch.IntTensor: 313 | words = line_tokenizer(line) 314 | if reverse_order: 315 | words = list(reversed(words)) 316 | nwords = len(words) 317 | ids = torch.IntTensor(nwords + 1 if append_eos else nwords) 318 | 319 | for i, word in enumerate(words): 320 | if add_if_not_exist: 321 | idx = self.add_symbol(word) 322 | else: 323 | idx = self.index(word) 324 | if consumer is not None: 325 | consumer(word, idx) 326 | ids[i] = idx 327 | if append_eos: 328 | ids[nwords] = self.eos_index 329 | return ids 330 | 331 | @staticmethod 332 | def _add_file_to_dictionary_single_worker( 333 | filename, 334 | tokenize, 335 | eos_word, 336 | start_offset, 337 | end_offset, 338 | ): 339 | counter = Counter() 340 | with Chunker(filename, start_offset, end_offset) as line_iterator: 341 | for line in line_iterator: 342 | for word in tokenize(line): 343 | counter.update([word]) 344 | counter.update([eos_word]) 345 | return counter 346 | 347 | @staticmethod 348 | def add_file_to_dictionary(filename, dict, tokenize, num_workers): 349 | def merge_result(counter): 350 | for w, c in sorted(counter.items()): 351 | dict.add_symbol(w, c) 352 | 353 | local_file = PathManager.get_local_path(filename) 354 | offsets = find_offsets(local_file, num_workers) 355 | if num_workers > 1: 356 | chunks = zip(offsets, offsets[1:]) 357 | pool = Pool(processes=num_workers) 358 | results = [] 359 | for (start_offset, end_offset) in chunks: 360 | results.append( 361 | pool.apply_async( 362 | Dictionary._add_file_to_dictionary_single_worker, 363 | ( 364 | local_file, 365 | tokenize, 366 | dict.eos_word, 367 | start_offset, 368 | end_offset, 369 | ), 370 | ) 371 | ) 372 | pool.close() 373 | pool.join() 374 | for r in results: 375 | merge_result(r.get()) 376 | else: 377 | merge_result( 378 | Dictionary._add_file_to_dictionary_single_worker( 379 | local_file, tokenize, dict.eos_word, offsets[0], offsets[1] 380 | ) 381 | ) 382 | 383 | -------------------------------------------------------------------------------- /alpaca/src/task/seq2seq_dataset.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | import logging 7 | 8 | import numpy as np 9 | import torch 10 | from fairseq.data import FairseqDataset, data_utils 11 | from fairseq.utils import new_arange 12 | import math 13 | from fairseq.utils import new_arange 14 | 15 | logger = logging.getLogger(__name__) 16 | 17 | def collate( 18 | samples, 19 | pad_idx, 20 | eos_idx, 21 | left_pad_source=True, 22 | left_pad_target=False, 23 | input_feeding=True, 24 | pad_to_length=None, 25 | pad_to_multiple=1, 26 | ): 27 | if len(samples) == 0: 28 | return {} 29 | 30 | def merge(key, left_pad, move_eos_to_beginning=False, pad_to_length=None): 31 | return data_utils.collate_tokens( 32 | [s[key] for s in samples], 33 | pad_idx, 34 | eos_idx, 35 | left_pad, 36 | move_eos_to_beginning, 37 | pad_to_length=pad_to_length, 38 | pad_to_multiple=pad_to_multiple, 39 | ) 40 | 41 | def check_alignment(alignment, src_len, tgt_len): 42 | if alignment is None or len(alignment) == 0: 43 | return False 44 | if ( 45 | alignment[:, 0].max().item() >= src_len - 1 46 | or alignment[:, 1].max().item() >= tgt_len - 1 47 | ): 48 | logger.warning("alignment size mismatch found, skipping alignment!") 49 | return False 50 | return True 51 | 52 | def compute_alignment_weights(alignments): 53 | """ 54 | Given a tensor of shape [:, 2] containing the source-target indices 55 | corresponding to the alignments, a weight vector containing the 56 | inverse frequency of each target index is computed. 57 | For e.g. if alignments = [[5, 7], [2, 3], [1, 3], [4, 2]], then 58 | a tensor containing [1., 0.5, 0.5, 1] should be returned (since target 59 | index 3 is repeated twice) 60 | """ 61 | align_tgt = alignments[:, 1] 62 | _, align_tgt_i, align_tgt_c = torch.unique( 63 | align_tgt, return_inverse=True, return_counts=True 64 | ) 65 | align_weights = align_tgt_c[align_tgt_i[np.arange(len(align_tgt))]] 66 | return 1.0 / align_weights.float() 67 | 68 | id = torch.LongTensor([s["id"] for s in samples]) 69 | src_tokens = merge( 70 | "source", 71 | left_pad=left_pad_source, 72 | pad_to_length=pad_to_length["source"] if pad_to_length is not None else None, 73 | ) 74 | # sort by descending source length 75 | src_lengths = torch.LongTensor( 76 | [s["source"].ne(pad_idx).long().sum() for s in samples] 77 | ) 78 | src_lengths, sort_order = src_lengths.sort(descending=True) 79 | id = id.index_select(0, sort_order) 80 | src_tokens = src_tokens.index_select(0, sort_order) 81 | 82 | def merge_data(data_name, sort_order): 83 | prepared_data = merge( 84 | data_name, 85 | left_pad=left_pad_target, 86 | pad_to_length=pad_to_length[data_name] 87 | if pad_to_length is not None 88 | else None, 89 | ) 90 | return prepared_data.index_select(0, sort_order) 91 | 92 | src_input = merge( 93 | "src_input", 94 | left_pad=left_pad_source, 95 | pad_to_length=pad_to_length["src_input"] if pad_to_length is not None else None, 96 | ) 97 | src_input = src_input.index_select(0, sort_order) 98 | 99 | bos_token = merge( 100 | "bos_token", 101 | left_pad=left_pad_source, 102 | pad_to_length=pad_to_length["bos_token"] if pad_to_length is not None else None, 103 | ) 104 | bos_token = bos_token.index_select(0, sort_order) 105 | 106 | src_pos = merge( 107 | "src_pos", 108 | left_pad=left_pad_source, 109 | pad_to_length=pad_to_length["src_pos"] if pad_to_length is not None else None, 110 | ) 111 | src_pos = src_pos.index_select(0, sort_order) 112 | 113 | bos_token_pos = merge( 114 | "bos_token_pos", 115 | left_pad=left_pad_source, 116 | pad_to_length=pad_to_length["bos_token_pos"] if pad_to_length is not None else None, 117 | ) 118 | bos_token_pos = bos_token_pos.index_select(0, sort_order) 119 | 120 | target = None 121 | tgt_input = None 122 | tgt_pos = None 123 | 124 | seq_input = None 125 | seq_mask = None 126 | 127 | if samples[0].get("target", None) is not None: 128 | target = merge( 129 | "target", 130 | left_pad=left_pad_target, 131 | pad_to_length=pad_to_length["target"] 132 | if pad_to_length is not None 133 | else None, 134 | ) 135 | target = target.index_select(0, sort_order) 136 | tgt_lengths = torch.LongTensor( 137 | [s["target"].ne(pad_idx).long().sum() for s in samples] 138 | ).index_select(0, sort_order) 139 | ntokens = tgt_lengths.sum().item() 140 | 141 | prev_output_tokens = merge( 142 | "target", 143 | left_pad=left_pad_target, 144 | move_eos_to_beginning=True, 145 | pad_to_length=pad_to_length["target"] if pad_to_length is not None else None, 146 | ) 147 | prev_output_tokens = prev_output_tokens.index_select(0, sort_order) 148 | prev_output_tokens[:,0:1] = bos_token 149 | 150 | tgt_pos = merge( 151 | "tgt_pos", 152 | left_pad=left_pad_source, 153 | move_eos_to_beginning=True, 154 | pad_to_length=pad_to_length["tgt_pos"] if pad_to_length is not None else None, 155 | ) 156 | tgt_pos = tgt_pos.index_select(0, sort_order) 157 | tgt_pos[:,0:1] = bos_token_pos 158 | tgt_pos_mask = (tgt_pos == pad_idx) 159 | tgt_pos = tgt_pos.masked_fill(tgt_pos_mask, 0) 160 | 161 | seq_mask = merge( 162 | "seq_mask", 163 | left_pad=left_pad_source, 164 | pad_to_length=pad_to_length["seq_mask"] if pad_to_length is not None else None, 165 | ) 166 | seq_mask = seq_mask.index_select(0, sort_order) 167 | seq_input = merge( 168 | "seq_input", 169 | left_pad=left_pad_source, 170 | pad_to_length=pad_to_length["seq_input"] if pad_to_length is not None else None, 171 | ) 172 | seq_input = seq_input.index_select(0, sort_order) 173 | else: 174 | ntokens = src_lengths.sum().item() 175 | prev_output_tokens = None 176 | 177 | src_pos_mask = (src_pos == pad_idx) 178 | src_pos = src_pos.masked_fill(src_pos_mask, 0) 179 | 180 | batch = { 181 | "id": id, 182 | "nsentences": len(samples), 183 | "ntokens": ntokens, 184 | "net_input": { 185 | "seq_mask": seq_mask, 186 | "seq_input": seq_input, 187 | "soruce": src_tokens, 188 | "src_tokens": src_input, 189 | "src_lengths": src_lengths, 190 | "bos_token": bos_token, 191 | "src_pos": src_pos, 192 | "tgt_pos": tgt_pos, 193 | "bos_token_pos": bos_token_pos, 194 | }, 195 | "target": target, 196 | } 197 | if prev_output_tokens is not None: 198 | batch["net_input"]["prev_output_tokens"] = prev_output_tokens 199 | 200 | if samples[0].get("alignment", None) is not None: 201 | bsz, tgt_sz = batch["target"].shape 202 | src_sz = batch["net_input"]["src_tokens"].shape[1] 203 | 204 | offsets = torch.zeros((len(sort_order), 2), dtype=torch.long) 205 | offsets[:, 1] += torch.arange(len(sort_order), dtype=torch.long) * tgt_sz 206 | if left_pad_source: 207 | offsets[:, 0] += src_sz - src_lengths 208 | if left_pad_target: 209 | offsets[:, 1] += tgt_sz - tgt_lengths 210 | 211 | alignments = [ 212 | alignment + offset 213 | for align_idx, offset, src_len, tgt_len in zip( 214 | sort_order, offsets, src_lengths, tgt_lengths 215 | ) 216 | for alignment in [samples[align_idx]["alignment"].view(-1, 2)] 217 | if check_alignment(alignment, src_len, tgt_len) 218 | ] 219 | 220 | if len(alignments) > 0: 221 | alignments = torch.cat(alignments, dim=0) 222 | align_weights = compute_alignment_weights(alignments) 223 | 224 | batch["alignments"] = alignments 225 | batch["align_weights"] = align_weights 226 | 227 | if samples[0].get("constraints", None) is not None: 228 | # Collate the packed constraints across the samples, padding to 229 | # the length of the longest sample. 230 | lens = [sample.get("constraints").size(0) for sample in samples] 231 | max_len = max(lens) 232 | constraints = torch.zeros((len(samples), max(lens))).long() 233 | for i, sample in enumerate(samples): 234 | constraints[i, 0 : lens[i]] = samples[i].get("constraints") 235 | batch["constraints"] = constraints 236 | 237 | return batch 238 | 239 | 240 | class LanguagePairDataset(FairseqDataset): 241 | 242 | def __init__( 243 | self, 244 | src, 245 | src_sizes, 246 | src_dict, 247 | tgt=None, 248 | tgt_sizes=None, 249 | tgt_dict=None, 250 | left_pad_source=True, 251 | left_pad_target=False, 252 | shuffle=True, 253 | input_feeding=True, 254 | remove_eos_from_source=False, 255 | append_eos_to_target=False, 256 | align_dataset=None, 257 | constraints=None, 258 | append_bos=False, 259 | eos=None, 260 | num_buckets=0, 261 | src_lang_id=None, 262 | tgt_lang_id=None, 263 | pad_to_multiple=1, 264 | ): 265 | if tgt_dict is not None: 266 | assert src_dict.pad() == tgt_dict.pad() 267 | assert src_dict.eos() == tgt_dict.eos() 268 | assert src_dict.unk() == tgt_dict.unk() 269 | if tgt is not None: 270 | assert len(src) == len( 271 | tgt 272 | ), "Source and target must contain the same number of examples" 273 | self.src = src 274 | self.tgt = tgt 275 | self.src_sizes = np.array(src_sizes) 276 | self.tgt_sizes = np.array(tgt_sizes) if tgt_sizes is not None else None 277 | self.sizes = ( 278 | np.vstack((self.src_sizes, self.tgt_sizes)).T 279 | if self.tgt_sizes is not None 280 | else self.src_sizes 281 | ) 282 | self.src_dict = src_dict 283 | self.tgt_dict = tgt_dict 284 | self.left_pad_source = left_pad_source 285 | self.left_pad_target = left_pad_target 286 | self.shuffle = shuffle 287 | self.input_feeding = input_feeding 288 | self.remove_eos_from_source = remove_eos_from_source 289 | self.append_eos_to_target = append_eos_to_target 290 | self.align_dataset = align_dataset 291 | if self.align_dataset is not None: 292 | assert ( 293 | self.tgt_sizes is not None 294 | ), "Both source and target needed when alignments are provided" 295 | self.constraints = constraints 296 | self.append_bos = append_bos 297 | self.eos = eos if eos is not None else src_dict.eos() 298 | self.src_lang_id = src_lang_id 299 | self.tgt_lang_id = tgt_lang_id 300 | if num_buckets > 0: 301 | from fairseq.data import BucketPadLengthDataset 302 | 303 | self.src = BucketPadLengthDataset( 304 | self.src, 305 | sizes=self.src_sizes, 306 | num_buckets=num_buckets, 307 | pad_idx=self.src_dict.pad(), 308 | left_pad=self.left_pad_source, 309 | ) 310 | self.src_sizes = self.src.sizes 311 | logger.info("bucketing source lengths: {}".format(list(self.src.buckets))) 312 | if self.tgt is not None: 313 | self.tgt = BucketPadLengthDataset( 314 | self.tgt, 315 | sizes=self.tgt_sizes, 316 | num_buckets=num_buckets, 317 | pad_idx=self.tgt_dict.pad(), 318 | left_pad=self.left_pad_target, 319 | ) 320 | self.tgt_sizes = self.tgt.sizes 321 | logger.info( 322 | "bucketing target lengths: {}".format(list(self.tgt.buckets)) 323 | ) 324 | 325 | # determine bucket sizes using self.num_tokens, which will return 326 | # the padded lengths (thanks to BucketPadLengthDataset) 327 | num_tokens = np.vectorize(self.num_tokens, otypes=[np.long]) 328 | self.bucketed_num_tokens = num_tokens(np.arange(len(self.src))) 329 | self.buckets = [ 330 | (None, num_tokens) for num_tokens in np.unique(self.bucketed_num_tokens) 331 | ] 332 | else: 333 | self.buckets = None 334 | self.pad_to_multiple = pad_to_multiple 335 | 336 | def get_batch_shapes(self): 337 | return self.buckets 338 | 339 | def __getitem__(self, index): 340 | tgt_item = self.tgt[index] if self.tgt is not None else None 341 | src_item = self.src[index] 342 | 343 | # remove bos eos from item 344 | eos = self.src_dict.eos() 345 | if self.src[index][-1] == eos: 346 | src_item = self.src[index][:-1] 347 | 348 | bos_token = src_item[-1:] 349 | bos_token_pos = new_arange(bos_token) + src_item.size(0) - 1 350 | 351 | src_input = src_item[:-1] 352 | src_pos = new_arange(src_input) 353 | 354 | bos = self.tgt_dict.bos() 355 | if tgt_item is not None: 356 | if self.tgt[index][0] == bos: 357 | tgt_item = self.tgt[index][1:] 358 | 359 | tgt_pos = new_arange(tgt_item) + src_item.size(0) 360 | seq_input = torch.cat([src_item, tgt_item[:-1]], 0) 361 | seq_mask = torch.cat([src_input.new_zeros(src_input.size()), tgt_item.new_ones(tgt_item.size())], 0) 362 | else: 363 | tgt_pos = None 364 | seq_input = None 365 | seq_mask = None 366 | 367 | example = { 368 | "id": index, 369 | "source": src_item, 370 | "target": tgt_item, 371 | "src_input": src_input, 372 | "bos_token": bos_token, 373 | "src_pos": src_pos, 374 | "tgt_pos": tgt_pos, 375 | "bos_token_pos": bos_token_pos, 376 | "seq_input": seq_input, 377 | "seq_mask": seq_mask, 378 | } 379 | 380 | if self.align_dataset is not None: 381 | example["alignment"] = self.align_dataset[index] 382 | if self.constraints is not None: 383 | example["constraints"] = self.constraints[index] 384 | return example 385 | 386 | def __len__(self): 387 | return len(self.src) 388 | 389 | def collater(self, samples, pad_to_length=None): 390 | res = collate( 391 | samples, 392 | pad_idx=self.src_dict.pad(), 393 | eos_idx=self.eos, 394 | left_pad_source=self.left_pad_source, 395 | left_pad_target=self.left_pad_target, 396 | input_feeding=self.input_feeding, 397 | pad_to_length=pad_to_length, 398 | pad_to_multiple=self.pad_to_multiple, 399 | ) 400 | if self.src_lang_id is not None or self.tgt_lang_id is not None: 401 | src_tokens = res["net_input"]["src_tokens"] 402 | bsz = src_tokens.size(0) 403 | if self.src_lang_id is not None: 404 | res["net_input"]["src_lang_id"] = ( 405 | torch.LongTensor([[self.src_lang_id]]).expand(bsz, 1).to(src_tokens) 406 | ) 407 | if self.tgt_lang_id is not None: 408 | res["tgt_lang_id"] = ( 409 | torch.LongTensor([[self.tgt_lang_id]]).expand(bsz, 1).to(src_tokens) 410 | ) 411 | return res 412 | 413 | def num_tokens(self, index): 414 | """Return the number of tokens in a sample. This value is used to 415 | enforce ``--max-tokens`` during batching.""" 416 | return max( 417 | self.src_sizes[index], 418 | self.tgt_sizes[index] if self.tgt_sizes is not None else 0, 419 | ) 420 | 421 | def num_tokens_vec(self, indices): 422 | """Return the number of tokens for a set of positions defined by indices. 423 | This value is used to enforce ``--max-tokens`` during batching.""" 424 | sizes = self.src_sizes[indices] 425 | if self.tgt_sizes is not None: 426 | sizes = np.maximum(sizes, self.tgt_sizes[indices]) 427 | return sizes 428 | 429 | def size(self, index): 430 | """Return an example's size as a float or tuple. This value is used when 431 | filtering a dataset with ``--max-positions``.""" 432 | return ( 433 | self.src_sizes[index], 434 | self.tgt_sizes[index] if self.tgt_sizes is not None else 0, 435 | ) 436 | 437 | def ordered_indices(self): 438 | """Return an ordered list of indices. Batches will be constructed based 439 | on this order.""" 440 | if self.shuffle: 441 | indices = np.random.permutation(len(self)).astype(np.int64) 442 | else: 443 | indices = np.arange(len(self), dtype=np.int64) 444 | if self.buckets is None: 445 | # sort by target length, then source length 446 | if self.tgt_sizes is not None: 447 | indices = indices[np.argsort(self.tgt_sizes[indices], kind="mergesort")] 448 | return indices[np.argsort(self.src_sizes[indices], kind="mergesort")] 449 | else: 450 | # sort by bucketed_num_tokens, which is: 451 | # max(padded_src_len, padded_tgt_len) 452 | return indices[ 453 | np.argsort(self.bucketed_num_tokens[indices], kind="mergesort") 454 | ] 455 | 456 | @property 457 | def supports_prefetch(self): 458 | return getattr(self.src, "supports_prefetch", False) and ( 459 | getattr(self.tgt, "supports_prefetch", False) or self.tgt is None 460 | ) 461 | 462 | def prefetch(self, indices): 463 | self.src.prefetch(indices) 464 | if self.tgt is not None: 465 | self.tgt.prefetch(indices) 466 | if self.align_dataset is not None: 467 | self.align_dataset.prefetch(indices) 468 | 469 | def filter_indices_by_size(self, indices, max_sizes): 470 | """Filter a list of sample indices. Remove those that are longer 471 | than specified in max_sizes. 472 | 473 | Args: 474 | indices (np.array): original array of sample indices 475 | max_sizes (int or list[int] or tuple[int]): max sample size, 476 | can be defined separately for src and tgt (then list or tuple) 477 | 478 | Returns: 479 | np.array: filtered sample array 480 | list: list of removed indices 481 | """ 482 | return data_utils.filter_paired_dataset_indices_by_size( 483 | self.src_sizes, 484 | self.tgt_sizes, 485 | indices, 486 | max_sizes, 487 | ) 488 | -------------------------------------------------------------------------------- /alpaca/src/task/seq2seq_ft_task.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import itertools 3 | import os 4 | import logging 5 | from typing import Dict, Optional 6 | 7 | from dataclasses import dataclass, field 8 | from fairseq import utils 9 | from fairseq.tasks.translation import TranslationTask 10 | from fairseq.utils import new_arange 11 | from fairseq.tasks import FairseqTask, register_task 12 | from fairseq.tasks.translation import TranslationConfig 13 | from fairseq.data import ( 14 | AppendTokenDataset, 15 | ConcatDataset, 16 | PrependTokenDataset, 17 | StripTokenDataset, 18 | TruncateDataset, 19 | data_utils, 20 | indexed_dataset, 21 | ) 22 | from fairseq.data import iterators 23 | from .dictionary import Dictionary 24 | from .seq2seq_dataset import LanguagePairDataset 25 | from fairseq.utils import safe_getattr, safe_hasattr 26 | 27 | logger = logging.getLogger(__name__) 28 | 29 | 30 | def load_langpair_dataset( 31 | data_path, 32 | split, 33 | src, 34 | src_dict, 35 | tgt, 36 | tgt_dict, 37 | combine, 38 | dataset_impl, 39 | upsample_primary, 40 | left_pad_source, 41 | left_pad_target, 42 | max_source_positions, 43 | max_target_positions, 44 | prepend_bos=False, 45 | load_alignments=False, 46 | truncate_source=False, 47 | append_source_id=False, 48 | num_buckets=0, 49 | shuffle=True, 50 | pad_to_multiple=1, 51 | prepend_bos_src=None, 52 | ): 53 | def split_exists(split, src, tgt, lang, data_path): 54 | filename = os.path.join(data_path, "{}.{}-{}.{}".format(split, src, tgt, lang)) 55 | return indexed_dataset.dataset_exists(filename, impl=dataset_impl) 56 | 57 | src_datasets = [] 58 | tgt_datasets = [] 59 | 60 | for k in itertools.count(): 61 | split_k = split + (str(k) if k > 0 else "") 62 | 63 | # infer langcode 64 | if split_exists(split_k, src, tgt, src, data_path): 65 | prefix = os.path.join(data_path, "{}.{}-{}.".format(split_k, src, tgt)) 66 | elif split_exists(split_k, tgt, src, src, data_path): 67 | prefix = os.path.join(data_path, "{}.{}-{}.".format(split_k, tgt, src)) 68 | else: 69 | if k > 0: 70 | break 71 | else: 72 | raise FileNotFoundError( 73 | "Dataset not found: {} ({})".format(split, data_path) 74 | ) 75 | 76 | src_dataset = data_utils.load_indexed_dataset( 77 | prefix + src, src_dict, dataset_impl 78 | ) 79 | if truncate_source: 80 | src_dataset = AppendTokenDataset( 81 | TruncateDataset( 82 | StripTokenDataset(src_dataset, src_dict.eos()), 83 | max_source_positions - 1, 84 | ), 85 | src_dict.eos(), 86 | ) 87 | src_datasets.append(src_dataset) 88 | 89 | tgt_dataset = data_utils.load_indexed_dataset( 90 | prefix + tgt, tgt_dict, dataset_impl 91 | ) 92 | if tgt_dataset is not None: 93 | tgt_datasets.append(tgt_dataset) 94 | 95 | logger.info( 96 | "{} {} {}-{} {} examples".format( 97 | data_path, split_k, src, tgt, len(src_datasets[-1]) 98 | ) 99 | ) 100 | 101 | if not combine: 102 | break 103 | 104 | assert len(src_datasets) == len(tgt_datasets) or len(tgt_datasets) == 0 105 | 106 | if len(src_datasets) == 1: 107 | src_dataset = src_datasets[0] 108 | tgt_dataset = tgt_datasets[0] if len(tgt_datasets) > 0 else None 109 | else: 110 | sample_ratios = [1] * len(src_datasets) 111 | sample_ratios[0] = upsample_primary 112 | src_dataset = ConcatDataset(src_datasets, sample_ratios) 113 | if len(tgt_datasets) > 0: 114 | tgt_dataset = ConcatDataset(tgt_datasets, sample_ratios) 115 | else: 116 | tgt_dataset = None 117 | 118 | if prepend_bos: 119 | assert hasattr(src_dict, "bos_index") and hasattr(tgt_dict, "bos_index") 120 | src_dataset = PrependTokenDataset(src_dataset, src_dict.bos()) 121 | if tgt_dataset is not None: 122 | tgt_dataset = PrependTokenDataset(tgt_dataset, tgt_dict.bos()) 123 | elif prepend_bos_src is not None: 124 | logger.info(f"prepending src bos: {prepend_bos_src}") 125 | src_dataset = PrependTokenDataset(src_dataset, prepend_bos_src) 126 | 127 | eos = None 128 | if append_source_id: 129 | src_dataset = AppendTokenDataset( 130 | src_dataset, src_dict.index("[{}]".format(src)) 131 | ) 132 | if tgt_dataset is not None: 133 | tgt_dataset = AppendTokenDataset( 134 | tgt_dataset, tgt_dict.index("[{}]".format(tgt)) 135 | ) 136 | eos = tgt_dict.index("[{}]".format(tgt)) 137 | 138 | align_dataset = None 139 | if load_alignments: 140 | align_path = os.path.join(data_path, "{}.align.{}-{}".format(split, src, tgt)) 141 | if indexed_dataset.dataset_exists(align_path, impl=dataset_impl): 142 | align_dataset = data_utils.load_indexed_dataset( 143 | align_path, None, dataset_impl 144 | ) 145 | 146 | tgt_dataset_sizes = tgt_dataset.sizes if tgt_dataset is not None else None 147 | return LanguagePairDataset( 148 | src_dataset, 149 | src_dataset.sizes, 150 | src_dict, 151 | tgt_dataset, 152 | tgt_dataset_sizes, 153 | tgt_dict, 154 | left_pad_source=left_pad_source, 155 | left_pad_target=left_pad_target, 156 | align_dataset=align_dataset, 157 | eos=eos, 158 | num_buckets=num_buckets, 159 | shuffle=shuffle, 160 | pad_to_multiple=pad_to_multiple, 161 | ) 162 | 163 | @dataclass 164 | class FTTaskConfig(TranslationConfig): 165 | 166 | megatron_model: bool = field( 167 | default=False, 168 | metadata={"help": "using megatron-lm to split model"}, 169 | ) 170 | 171 | data_para: bool = field( 172 | default=False, metadata={"help": "data parallel"}, 173 | ) 174 | 175 | @register_task("seq2seq_ft_task", dataclass=FTTaskConfig) 176 | class Seq2SeqFineTuningTask(TranslationTask): 177 | 178 | def __init__(self, cfg, src_dict, tgt_dict): 179 | super().__init__(cfg, src_dict, tgt_dict) 180 | 181 | self.data_para = safe_getattr(cfg, "data_para", False) 182 | self.megatron_model = safe_getattr(cfg, "megatron_model", False) 183 | 184 | def build_bpe(self, args): 185 | from sentencepiece import SentencePieceProcessor 186 | model_path = args.sentencepiece_model 187 | self.sp_model = SentencePieceProcessor(model_file=model_path) 188 | return self.sp_model 189 | 190 | @classmethod 191 | def load_dictionary(cls, filename): 192 | if "dict.src.txt" not in filename or "dict.tgt.txt" not in filename: 193 | logger.info("{} is not exist!".format(filename)) 194 | filename = "alpaca/scripts/assert/dict.txt" 195 | logger.info("load common dict {}!".format(filename)) 196 | 197 | dictionary = Dictionary.load(filename) 198 | dictionary.pad_index = dictionary.add_symbol(dictionary.pad_word) 199 | return dictionary 200 | 201 | def load_dataset(self, split, epoch=1, combine=False, **kwargs): 202 | paths = utils.split_paths(self.cfg.data) 203 | data_path = paths[(epoch - 1) % len(paths)] 204 | src, tgt = self.cfg.source_lang, self.cfg.target_lang 205 | 206 | self.cfg.left_pad_source = False 207 | self.cfg.left_pad_target = False 208 | self.datasets[split] = load_langpair_dataset( 209 | data_path, 210 | split, 211 | src, 212 | self.src_dict, 213 | tgt, 214 | self.tgt_dict, 215 | combine=combine, 216 | dataset_impl=self.cfg.dataset_impl, 217 | upsample_primary=self.cfg.upsample_primary, 218 | left_pad_source=self.cfg.left_pad_source, 219 | left_pad_target=self.cfg.left_pad_target, 220 | max_source_positions=self.cfg.max_source_positions, 221 | max_target_positions=self.cfg.max_target_positions, 222 | truncate_source=self.cfg.truncate_source, 223 | shuffle=(split != "test"), 224 | prepend_bos=True, 225 | ) 226 | 227 | def build_dataset_for_inference(self, src_tokens, src_lengths, constraints=None): 228 | return LanguagePairDataset( 229 | src_tokens, 230 | src_lengths, 231 | self.source_dictionary, 232 | tgt_dict=self.target_dictionary, 233 | constraints=constraints, 234 | ) 235 | 236 | def build_generator( 237 | self, 238 | models, 239 | args=None, 240 | **kwargs, 241 | ): 242 | from generator.sequence_generator import SequenceGenerator 243 | from generator import search 244 | 245 | if isinstance(kwargs, dict): 246 | if "sampling" in kwargs: 247 | sampling = kwargs["sampling"] 248 | else: 249 | sampling = False 250 | if "sampling_topk" in kwargs: 251 | sampling_topk = kwargs["sampling_topk"] 252 | else: 253 | sampling_topk = -1.0 254 | if "sampling_topp" in kwargs: 255 | sampling_topp = kwargs["sampling_topp"] 256 | else: 257 | sampling_topp = -1.0 258 | else: 259 | sampling = getattr(args, "sampling", False) 260 | sampling_topk = getattr(args, "sampling_topk", -1.0) 261 | sampling_topp = getattr(args, "sampling_topp", -1.0) 262 | 263 | if sampling: 264 | search_strategy = search.Sampling( 265 | self.target_dictionary, sampling_topk, sampling_topp 266 | ) 267 | else: 268 | search_strategy = search.BeamSearch(self.target_dictionary) 269 | 270 | extra_gen_cls_kwargs = {} 271 | return SequenceGenerator( 272 | models, 273 | self.target_dictionary, 274 | beam_size=getattr(args, "beam", 5), 275 | max_len_a=getattr(args, "max_len_a", 0), 276 | max_len_b=getattr(args, "max_len_b", 512), 277 | min_len=getattr(args, "min_len", 1), 278 | normalize_scores=(not getattr(args, "unnormalized", False)), 279 | len_penalty=getattr(args, "lenpen", 1), 280 | unk_penalty=getattr(args, "unkpen", 0), 281 | temperature=getattr(args, "temperature", 1.0), 282 | match_source_len=getattr(args, "match_source_len", False), 283 | no_repeat_ngram_size=getattr(args, "no_repeat_ngram_size", 0), 284 | search_strategy=search_strategy, 285 | **extra_gen_cls_kwargs, 286 | ) 287 | 288 | def inference_step( 289 | self, generator, models, sample, prefix_tokens=None, constraints=None 290 | ): 291 | with torch.no_grad(): 292 | bos_token = sample['net_input']['bos_token'] 293 | return generator.generate( 294 | models, sample, 295 | prefix_tokens=prefix_tokens, constraints=constraints, 296 | bos_token=bos_token, 297 | ) 298 | 299 | def get_batch_iterator( 300 | self, 301 | dataset, 302 | max_tokens=None, 303 | max_sentences=None, 304 | max_positions=None, 305 | ignore_invalid_inputs=False, 306 | required_batch_size_multiple=1, 307 | seed=1, 308 | num_shards=1, 309 | shard_id=0, 310 | num_workers=0, 311 | epoch=1, 312 | data_buffer_size=0, 313 | disable_iterator_cache=False, 314 | skip_remainder_batch=False, 315 | grouped_shuffling=False, 316 | update_epoch_batch_itr=False, 317 | ): 318 | if not self.data_para: 319 | num_shards = 1 320 | shard_id=0 321 | return super().get_batch_iterator( 322 | dataset, 323 | max_tokens, 324 | max_sentences, 325 | max_positions, 326 | ignore_invalid_inputs, 327 | required_batch_size_multiple, 328 | seed, 329 | num_shards, 330 | shard_id, 331 | num_workers, 332 | epoch, 333 | data_buffer_size, 334 | disable_iterator_cache, 335 | skip_remainder_batch, 336 | grouped_shuffling, 337 | update_epoch_batch_itr, 338 | ) -------------------------------------------------------------------------------- /alpaca/src/task/seq2seq_lora_task.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import itertools 3 | import os 4 | import logging 5 | from typing import Dict, Optional 6 | 7 | from dataclasses import dataclass, field 8 | from fairseq.tasks import FairseqTask, register_task 9 | from fairseq.tasks.translation import TranslationConfig 10 | from fairseq.data import iterators 11 | from .seq2seq_ft_task import Seq2SeqFineTuningTask, FTTaskConfig 12 | from fairseq.utils import safe_getattr, safe_hasattr 13 | 14 | 15 | logger = logging.getLogger(__name__) 16 | 17 | 18 | @dataclass 19 | class LoRATaskConfig(FTTaskConfig): 20 | 21 | lora_model_inf: Optional[str] = field( 22 | default="", metadata={"help": "load lora model for inference"}, 23 | ) 24 | 25 | lora_tuning: bool = field( 26 | default=False, metadata={"help": "if using lora tuning"}, 27 | ) 28 | 29 | 30 | @register_task("seq2seq_lora_task", dataclass=LoRATaskConfig) 31 | class Seq2SeqLoRATask(Seq2SeqFineTuningTask): 32 | 33 | def __init__(self, cfg, src_dict, tgt_dict): 34 | super().__init__(cfg, src_dict, tgt_dict) 35 | 36 | self.lora_model_inf = safe_getattr(cfg, "lora_model_inf", "") 37 | self.lora_tuning = safe_getattr(cfg, "lora_tuning", False) 38 | 39 | def build_model(self, cfg, from_checkpoint=False): 40 | model = super().build_model(cfg, from_checkpoint) 41 | if len(self.lora_model_inf) > 0: 42 | model.set_lora_model_inf(self.lora_model_inf) 43 | logging.info("Seq2SeqLoRATask load inference model checkpoint from {}".format(self.lora_model_inf)) 44 | return model 45 | 46 | -------------------------------------------------------------------------------- /alpaca/src/webapp.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | 6 | import torch 7 | from model.llama_model import LLaMA 8 | import argparse 9 | import gradio as gr 10 | 11 | 12 | 13 | def sample_demo(alpaca): 14 | 15 | @torch.no_grad() 16 | def process(prompt): 17 | prompt_text = "## Instruction:\n{}\n\n## Response:".format(prompt) 18 | print("Received:\n", prompt_text) 19 | eval_kwargs = dict(beam=1, sampling=True, sampling_topp=0.95, temperature=0.8, min_len=512) 20 | prompts = [prompt_text] 21 | results = alpaca.sample(prompts, **eval_kwargs)[0] 22 | print("Generated:\n", results[0]) 23 | return str(results[0]) 24 | 25 | demo = gr.Interface( 26 | title = "Efficient Alpaca", 27 | thumbnail = "https://github.com/dropreg/efficient_alpaca/blob/main/efficient_alpaca_logo.PNG", 28 | fn = process, 29 | inputs = gr.Textbox(lines=10, placeholder="Your prompt here..."), 30 | outputs = "text", 31 | ) 32 | 33 | demo.launch(share=True) 34 | 35 | def demo(alpaca): 36 | 37 | @torch.no_grad() 38 | def process(prompt, temperature, topp): 39 | prompt_text = "## Instruction:\n{}\n\n## Response:".format(prompt) 40 | print("Received:\n", prompt_text) 41 | eval_kwargs = dict(sampling=True, sampling_topp=topp, temperature=temperature) 42 | prompts = [prompt_text] 43 | results = alpaca.sample(prompts, **eval_kwargs)[0] 44 | print("Generated:\n", results[0]) 45 | return str(results[0]) 46 | 47 | with gr.Blocks() as demo: 48 | gr.Markdown( 49 | """ 50 |

51 | Efficient-Alpaca 52 |

53 | """) 54 | 55 | with gr.Row(): 56 | with gr.Column(): 57 | model_input = gr.Textbox(lines=15, placeholder='Input something', label='Input') 58 | with gr.Row(): 59 | gen = gr.Button("Generate") 60 | clr = gr.Button("Clear") 61 | 62 | outputs = gr.Textbox(lines=15, label='Output') 63 | 64 | gr.Markdown( 65 | """ 66 | Generation Parameter 67 | """) 68 | with gr.Row(): 69 | with gr.Column(): 70 | temperature = gr.Slider(maximum=1, value=0.8, minimum=0, label='Temperature') 71 | topp = gr.Slider(maximum=1, value=0.95, minimum=0, label='Top P') 72 | 73 | inputs = [model_input, temperature, topp] 74 | gen.click(fn=process, inputs=inputs, outputs=outputs) 75 | clr.click(fn=lambda value: gr.update(value=""), inputs=clr, outputs=model_input) 76 | 77 | gr.Markdown( 78 | """ 79 | Our project can be found from [Efficient Alpaca](https://github.com/dropreg/efficient_alpaca) 80 | """) 81 | 82 | demo.launch(share=True) 83 | 84 | 85 | if __name__ == "__main__": 86 | 87 | parser = argparse.ArgumentParser() 88 | parser.add_argument( 89 | "--model-dir", 90 | required=True, 91 | type=str, 92 | default="alpaca_lora", 93 | help="path containing model file and src_dict.txt", 94 | ) 95 | parser.add_argument( 96 | "--model-file", 97 | default="checkpoint_best.pt", 98 | help="where in model_dir are weights saved", 99 | ) 100 | parser.add_argument( 101 | "--lora-model-inf", 102 | default="", 103 | help="where in model_dir are weights saved", 104 | ) 105 | parser.add_argument( 106 | "--lora-tuning", 107 | action="store_true", 108 | default=False, 109 | help="if true use XSUM_KWARGS else CNN_KWARGS", 110 | ) 111 | 112 | parser.add_argument("--bpe",) 113 | parser.add_argument("--sentencepiece-model") 114 | args = parser.parse_args() 115 | 116 | kwargs = { 117 | "user_dir": "alpaca/src", 118 | "lora_model_inf": args.lora_model_inf, 119 | "bpe": args.bpe, 120 | "sentencepiece_model": args.sentencepiece_model, 121 | "source_lang": 'src', 122 | "target_lang": 'tgt', 123 | "lora_tuning": args.lora_tuning, 124 | "task": "seq2seq_lora_task", 125 | } 126 | alpaca = LLaMA.from_pretrained( 127 | model_name_or_path=args.model_dir, 128 | checkpoint_file=args.model_file, 129 | **kwargs, 130 | ) 131 | 132 | alpaca = alpaca.eval() 133 | if torch.cuda.is_available(): 134 | alpaca = alpaca.half().cuda() 135 | 136 | demo(alpaca) 137 | -------------------------------------------------------------------------------- /efficient_alpaca_logo.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/efficient_alpaca_logo.PNG -------------------------------------------------------------------------------- /efficient_alpaca_logo_old.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/efficient_alpaca_logo_old.PNG -------------------------------------------------------------------------------- /webapp.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dropreg/efficient_alpaca/7232348d19873e055bf99f67e7d90d1722e9d029/webapp.PNG --------------------------------------------------------------------------------