├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE-CC-BY-SA ├── README.md ├── debug ├── NicerTrace.py ├── README.md ├── printflock.py └── torch-distributed-gpu-test.py ├── dtype └── README.md ├── hparams └── README.md ├── instabilities └── README.md ├── parallelism └── README.md ├── resources └── README.md ├── slurm ├── README.md ├── cron-daily.slurm └── cron-hourly.slurm └── throughput ├── README.md └── all_reduce_bench.py /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/llm_training_handbook/HEAD/CODE_OF_CONDUCT.md -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/llm_training_handbook/HEAD/CONTRIBUTING.md -------------------------------------------------------------------------------- /LICENSE-CC-BY-SA: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/llm_training_handbook/HEAD/LICENSE-CC-BY-SA -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/llm_training_handbook/HEAD/README.md -------------------------------------------------------------------------------- /debug/NicerTrace.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/llm_training_handbook/HEAD/debug/NicerTrace.py -------------------------------------------------------------------------------- /debug/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/llm_training_handbook/HEAD/debug/README.md -------------------------------------------------------------------------------- /debug/printflock.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/llm_training_handbook/HEAD/debug/printflock.py -------------------------------------------------------------------------------- /debug/torch-distributed-gpu-test.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/llm_training_handbook/HEAD/debug/torch-distributed-gpu-test.py -------------------------------------------------------------------------------- /dtype/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/llm_training_handbook/HEAD/dtype/README.md -------------------------------------------------------------------------------- /hparams/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/llm_training_handbook/HEAD/hparams/README.md -------------------------------------------------------------------------------- /instabilities/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/llm_training_handbook/HEAD/instabilities/README.md -------------------------------------------------------------------------------- /parallelism/README.md: -------------------------------------------------------------------------------- 1 | # Model Parallelism 2 | 3 | ## TP 4 | 5 | TP degree shouldn't span across nodes. 6 | -------------------------------------------------------------------------------- /resources/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/llm_training_handbook/HEAD/resources/README.md -------------------------------------------------------------------------------- /slurm/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/llm_training_handbook/HEAD/slurm/README.md -------------------------------------------------------------------------------- /slurm/cron-daily.slurm: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/llm_training_handbook/HEAD/slurm/cron-daily.slurm -------------------------------------------------------------------------------- /slurm/cron-hourly.slurm: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/llm_training_handbook/HEAD/slurm/cron-hourly.slurm -------------------------------------------------------------------------------- /throughput/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/llm_training_handbook/HEAD/throughput/README.md -------------------------------------------------------------------------------- /throughput/all_reduce_bench.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/llm_training_handbook/HEAD/throughput/all_reduce_bench.py --------------------------------------------------------------------------------