├── LICENSE ├── README.md ├── __init__.py ├── analysis ├── burstiness.py └── distraction.py ├── evaluation ├── cbqa.sh ├── eval_utils.py ├── fewshot.py ├── icl.sh └── mrc.sh ├── llama_modeling.py ├── packing_dataset.py ├── pics ├── bm25chunk.png └── random.png ├── preprocessing ├── __init__.py ├── create_corpus.py └── split_to_subsets.py ├── project_config.py ├── retrieval_packing.py ├── retriv_bm25.py ├── save_offline_dataset.py ├── scripts ├── download_eval_data.py └── download_slimpajama.sh └── utils.py /LICENSE: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/LICENSE -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/README.md -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /analysis/burstiness.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/analysis/burstiness.py -------------------------------------------------------------------------------- /analysis/distraction.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/analysis/distraction.py -------------------------------------------------------------------------------- /evaluation/cbqa.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/evaluation/cbqa.sh -------------------------------------------------------------------------------- /evaluation/eval_utils.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/evaluation/eval_utils.py -------------------------------------------------------------------------------- /evaluation/fewshot.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/evaluation/fewshot.py -------------------------------------------------------------------------------- /evaluation/icl.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/evaluation/icl.sh -------------------------------------------------------------------------------- /evaluation/mrc.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/evaluation/mrc.sh -------------------------------------------------------------------------------- /llama_modeling.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/llama_modeling.py -------------------------------------------------------------------------------- /packing_dataset.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/packing_dataset.py -------------------------------------------------------------------------------- /pics/bm25chunk.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/pics/bm25chunk.png -------------------------------------------------------------------------------- /pics/random.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/pics/random.png -------------------------------------------------------------------------------- /preprocessing/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /preprocessing/create_corpus.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/preprocessing/create_corpus.py -------------------------------------------------------------------------------- /preprocessing/split_to_subsets.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/preprocessing/split_to_subsets.py -------------------------------------------------------------------------------- /project_config.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/project_config.py -------------------------------------------------------------------------------- /retrieval_packing.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/retrieval_packing.py -------------------------------------------------------------------------------- /retriv_bm25.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/retriv_bm25.py -------------------------------------------------------------------------------- /save_offline_dataset.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/save_offline_dataset.py -------------------------------------------------------------------------------- /scripts/download_eval_data.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/scripts/download_eval_data.py -------------------------------------------------------------------------------- /scripts/download_slimpajama.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/scripts/download_slimpajama.sh -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuzhaouoe/pretraining-data-packing/HEAD/utils.py --------------------------------------------------------------------------------