├── README.md └── images └── minitron.png /README.md: -------------------------------------------------------------------------------- 1 | # Minitron 2 | 3 |

4 | 5 |

6 |

7 | 🤗 Hugging Face Models   |    📄 Paper    |    📜 Blog   |   💬 Demo 8 |

9 | 10 | ## Introduction 11 | 12 | Minitron is a family of small language models (SLMs) obtained via pruning and knowledge distillation. We prune model embedding size, attention heads, and MLP intermediate dimension, following which, we perform continued training with distillation to arrive at the final models. 13 | 14 | ## News 15 | 16 | 1. 🔥🔥🔥 SOTA 8B model via pruning and distillation with only 400B tokens! See our tech report [Technical report](https://arxiv.org/abs/2408.11796) and blog post: [Mistral-NeMo-Minitron 8B Foundation Model Delivers Unparalleled Accuracy](https://developer.nvidia.com/blog/mistral-nemo-minitron-8b-foundation-model-delivers-unparalleled-accuracy/) 17 | 2. The best LLaMa-3.1 4B model is out! New blog post on Llama-3.1-Minitron-4B models: [How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model](https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/). 18 | 19 | ## Minitron Model Performance 20 | 21 |

22 | Minitron accuracy 23 |

Minitron accuracy (MMLU) vs. other baseline models. Compression results in significant reduction of training costs for additional models(40x) while producing better results. Please refer to our paper for the full set of results.

24 |

25 | 26 | Deriving the Minitron 8B and 4B models from the base Nemotron-4 15B model using our approach requires up to **40x fewer training tokens** per model compared to training from scratch; this results in **compute cost savings of 1.8x** for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. Please refer to our [arXiv paper](https://arxiv.org/abs/2407.14679) for more details. 27 | 28 | ## Hugging Face Checkpoints, Model Cards and Usage 29 | 30 | Please see: 31 | 32 | 1. [Mistral-NeMo-Minitron-8B-Base](https://huggingface.co/nvidia/Mistral-NeMo-Minitron-8B-Base) / [Instruct](https://huggingface.co/nvidia/Mistral-NeMo-Minitron-8B-Instruct). 33 | 2. [Llama-3.1-Minitron-4B-Width-Base](https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Width-Base). 34 | 3. [Llama-3.1-Minitron-4B-Depth-Base](https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Depth-Base). 35 | 4. [Minitron-8B-Base](https://huggingface.co/nvidia/Minitron-8B-Base). 36 | 5. [Minitron-4B-Base](https://huggingface.co/nvidia/Minitron-4B-Base) / [Instruct](https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct). 37 | 38 | ## Usage 39 | 40 | ### Hugging Face 41 | 42 | Please refer to the instructions in the respective model cards above. 43 | 44 | **Quantized Versions:** The 🤗 Hugging Face community has already created FP8 quantized versions of Minitron models. Give them a try here: [Minitron-8B-Base-FP8](https://huggingface.co/mgoin/Minitron-8B-Base-FP8) and [Minitron-4B-Base-FP8](https://huggingface.co/mgoin/Minitron-4B-Base-FP8). 45 | 46 | ### NeMo support 47 | 48 | 1. [Depth pruning](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/pruning/depth-pruning.html) support. 49 | 2. Width pruning 50 | 3. [Distillation](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/distillation/distillation.html) support. 51 | 52 | Find notebook examples [here](https://github.com/NVIDIA/NeMo/tree/main/tutorials/llm/llama-3/pruning-distillation) for performing teacher correction, width/depth pruning and distillation. 53 | 54 | ### TRT-LLM 55 | 56 | The following steps provide an example of how to load the Minitron-8B model in the `.nemo` checkpoint format. You can download the corresponding `.nemo` checkpoints here: [Minitron-8B-Base](https://huggingface.co/nvidia/Minitron-8B-Base/tree/main/nemo) and [Minitron-4B-Base](https://huggingface.co/nvidia/Minitron-4B-Base/tree/main/nemo). 57 | 58 | 1. Export TensorRT-LLM checkpoint. 59 | 60 | First launch the NeMo container `nvcr.io/nvidia/nemo:24.05` with the `.nemo` model checkpoint and [TensorRT-Model-Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) folder mounted: 61 | 62 | ``` 63 | git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git 64 | docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --init -it -v :/workspace/TensorRT-Model-Optimizer -v :/workspace/minitron --rm nvcr.io/nvidia/nemo:24.05 bash 65 | ``` 66 | 67 | Inside the container, run the following commands to export the TensorRT-LLM checkpoint: 68 | ``` 69 | export GPT_MODEL_FILE= 70 | pip install "nvidia-modelopt[torch]" -U 71 | cd TensorRT-Model-Optimizer/llm_ptq/ 72 | scripts/nemo_example.sh --type gptnext --model $GPT_MODEL_FILE --quant bf16 --tp 1 --task "build" 73 | ``` 74 | 75 | You will see something like: 76 | 77 | ``` 78 | Model config exported to: . Total time used ** s. 79 | ``` 80 | 81 | which means the TensorRT-LLM checkpoint has been exported successfully. 82 | 83 | 2. Export TensorRT engine. 84 | 85 | Use docker to build and run [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) following these [instructions](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html): 86 | 87 | ``` 88 | # TensorRT-LLM uses git-lfs, which needs to be installed in advance. 89 | apt-get update && apt-get -y install git git-lfs 90 | git lfs install 91 | 92 | git clone https://github.com/NVIDIA/TensorRT-LLM.git 93 | cd TensorRT-LLM 94 | git submodule update --init --recursive 95 | git lfs pull 96 | make -C docker release_build 97 | ``` 98 | 99 | Now copy the exported TensorRT-LLM checkpoint to directory of TensorRT-LLM and launch the docker container: 100 | 101 | ``` 102 | cp -r 103 | cd 104 | make -C docker release_run 105 | ``` 106 | 107 | Inside the docker container, build TensorRT engine: 108 | 109 | ``` 110 | trtllm-build --checkpoint_dir /code/tensorrt_llm/ --gpt_attention_plugin bfloat16 --gemm_plugin bfloat16 --output_dir 111 | ``` 112 | 113 | Run inference with the built TensorRT engine to summarize articles from the [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset: 114 | 115 | ``` 116 | python3 example/summarize.py --test_trt_llm --no_add_special_tokens --engine_dir --vocab_file /tokenizer.model 117 | ``` 118 | 119 | ### Fine-tuning with LMFlow 120 | [LMFlow](https://github.com/OptimalScale/LMFlow) is a complete pipeline for fine-tuning large language models. The following steps provide an example of how to fine-tune the ``Minitron-8B-Base`` models using LMFlow with the `alpaca` dataset. 121 | 122 | 1. Install LMFlow 123 | 124 | ``` 125 | git clone https://github.com/OptimalScale/LMFlow.git 126 | cd LMFlow 127 | bash install.sh 128 | ``` 129 | 130 | 2. Prepare the dataset 131 | 132 | Download the [alpaca](https://huggingface.co/datasets/wikitext) dataset and preprocess it using the following command. 133 | 134 | ```bash 135 | cd data && ./download.sh alpaca && cd - 136 | ``` 137 | 138 | 3. Fine-tune the model 139 | 140 | Fine-tune the Minitron-8B model on the Wikitext-103 dataset using the following command. 141 | 142 | ```bash 143 | bash ./scripts/run_finetune.sh \ 144 | --model_name_or_path nvidia/Minitron-8B-Base \ 145 | --dataset_path data/alpaca/train_conversation \ 146 | --output_model_path output_models/finetuned_minitron 147 | ``` 148 | 149 | With LMFlow, you can also fine-tune the model on your custom dataset. The only thing you need to do is transform your dataset into the [LMFlow data format](https://optimalscale.github.io/LMFlow/examples/DATASETS.html). 150 | In addition to full-finetuniing, you can also fine-tune minitron efficiently with [LoRA](https://github.com/OptimalScale/LMFlow?tab=readme-ov-file#lora), [LISA](https://github.com/OptimalScale/LMFlow?tab=readme-ov-file#lisa), [Flash Attention](https://github.com/OptimalScale/LMFlow/blob/main/readme/flash_attn2.md), and other acceleration techniques. 151 | 152 | 153 | ## License 154 | 155 | Minitron models are released under the [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf). 156 | 157 | 158 | ## Acknowledgments 159 | 160 | We would like to thank Ameya Sunil Mahabaleshwarkar, Hayley Ross, Brandon Rowlett, Oluwatobi 161 | Olabiyi, Ao Tang, and Yoshi Suhara for help with producing the instruction-tuned versions of 162 | MINITRON; additionally, James Shen for TRT-LLM support, and Sanjeev Satheesh, Oleksii Kuchaiev, 163 | Shengyang Sun, Jiaqi Zeng, Zhilin Wang, Yi Dong, Zihan Liu, Rajarshi Roy, Wei Ping, and Makesh 164 | Narsimhan Sreedhar for help with datasets; Ao Tang for HF support. We’d also like to gratefully acknowledge the insightful 165 | discussion and feedback from Chenhan Yu and Daniel Korzekwa. 166 | 167 | ## Citation 168 | 169 | If you find our work helpful, please consider citing our paper: 170 | ``` 171 | @article{minitron2024, 172 | title={Compact Language Models via Pruning and Knowledge Distillation}, 173 | author={Saurav Muralidharan and Sharath Turuvekere Sreenivas and Raviraj Joshi and Marcin Chochowski and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro and Jan Kautz and Pavlo Molchanov}, 174 | journal={arXiv preprint arXiv:2407.14679}, 175 | year={2024}, 176 | url={https://arxiv.org/abs/2407.14679}, 177 | } 178 | ``` 179 | -------------------------------------------------------------------------------- /images/minitron.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NVlabs/Minitron/998ad3dc40476997f1593e69a3a3d9555a3b1ee3/images/minitron.png --------------------------------------------------------------------------------