├── LICENSE ├── README.md ├── classification ├── README.md ├── run_edu_bert.py ├── run_edu_bert.slurm ├── train_edu_bert.py └── train_edu_bert.slurm ├── decontamination ├── README.md └── decontaminate.py ├── deduplication ├── README.md └── deduplicate_dataset.py ├── evaluation ├── README.md ├── eval.slurm └── lighteval_tasks.py ├── fulltext_search ├── README.md ├── index_docs.py ├── index_docs.slurm ├── manticore.conf ├── search_sharded.py └── search_sharded.slurm ├── generation ├── README.md ├── boilerplate_cleanup.py └── llm_swarm_script.py ├── plots ├── clusters_map.png ├── cover.png ├── cover_01.png ├── educational_score.png └── topics_distpng.png └── prompts ├── README.md ├── auto_math_text ├── README.md └── build_science_prompts.py ├── khanacademy ├── README.md ├── generate_textbooks.py └── khan_dl │ ├── khan_dl.py │ ├── main.py │ └── requirements.txt ├── openstax ├── README.md └── build_openstax_prompts.py ├── stanford ├── 1_scraper.ipynb ├── 2_generate_course_outlines.ipynb └── README.md ├── stories ├── README.md ├── build_openhermes_stories_prompts.py ├── build_ultrachat_stories_prompts.py └── filter_openhermes.py ├── web_samples ├── README.md ├── build_web_prompts.py └── filter_and_classify_clusters.py └── wikihow ├── README.md └── wikihowcom-20231012-titles.txt /LICENSE: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/LICENSE -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/README.md -------------------------------------------------------------------------------- /classification/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/classification/README.md -------------------------------------------------------------------------------- /classification/run_edu_bert.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/classification/run_edu_bert.py -------------------------------------------------------------------------------- /classification/run_edu_bert.slurm: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/classification/run_edu_bert.slurm -------------------------------------------------------------------------------- /classification/train_edu_bert.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/classification/train_edu_bert.py -------------------------------------------------------------------------------- /classification/train_edu_bert.slurm: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/classification/train_edu_bert.slurm -------------------------------------------------------------------------------- /decontamination/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/decontamination/README.md -------------------------------------------------------------------------------- /decontamination/decontaminate.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/decontamination/decontaminate.py -------------------------------------------------------------------------------- /deduplication/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/deduplication/README.md -------------------------------------------------------------------------------- /deduplication/deduplicate_dataset.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/deduplication/deduplicate_dataset.py -------------------------------------------------------------------------------- /evaluation/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/evaluation/README.md -------------------------------------------------------------------------------- /evaluation/eval.slurm: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/evaluation/eval.slurm -------------------------------------------------------------------------------- /evaluation/lighteval_tasks.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/evaluation/lighteval_tasks.py -------------------------------------------------------------------------------- /fulltext_search/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/fulltext_search/README.md -------------------------------------------------------------------------------- /fulltext_search/index_docs.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/fulltext_search/index_docs.py -------------------------------------------------------------------------------- /fulltext_search/index_docs.slurm: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/fulltext_search/index_docs.slurm -------------------------------------------------------------------------------- /fulltext_search/manticore.conf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/fulltext_search/manticore.conf -------------------------------------------------------------------------------- /fulltext_search/search_sharded.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/fulltext_search/search_sharded.py -------------------------------------------------------------------------------- /fulltext_search/search_sharded.slurm: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/fulltext_search/search_sharded.slurm -------------------------------------------------------------------------------- /generation/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/generation/README.md -------------------------------------------------------------------------------- /generation/boilerplate_cleanup.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/generation/boilerplate_cleanup.py -------------------------------------------------------------------------------- /generation/llm_swarm_script.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/generation/llm_swarm_script.py -------------------------------------------------------------------------------- /plots/clusters_map.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/plots/clusters_map.png -------------------------------------------------------------------------------- /plots/cover.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/plots/cover.png -------------------------------------------------------------------------------- /plots/cover_01.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/plots/cover_01.png -------------------------------------------------------------------------------- /plots/educational_score.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/plots/educational_score.png -------------------------------------------------------------------------------- /plots/topics_distpng.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/plots/topics_distpng.png -------------------------------------------------------------------------------- /prompts/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/README.md -------------------------------------------------------------------------------- /prompts/auto_math_text/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/auto_math_text/README.md -------------------------------------------------------------------------------- /prompts/auto_math_text/build_science_prompts.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/auto_math_text/build_science_prompts.py -------------------------------------------------------------------------------- /prompts/khanacademy/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/khanacademy/README.md -------------------------------------------------------------------------------- /prompts/khanacademy/generate_textbooks.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/khanacademy/generate_textbooks.py -------------------------------------------------------------------------------- /prompts/khanacademy/khan_dl/khan_dl.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/khanacademy/khan_dl/khan_dl.py -------------------------------------------------------------------------------- /prompts/khanacademy/khan_dl/main.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/khanacademy/khan_dl/main.py -------------------------------------------------------------------------------- /prompts/khanacademy/khan_dl/requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/khanacademy/khan_dl/requirements.txt -------------------------------------------------------------------------------- /prompts/openstax/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/openstax/README.md -------------------------------------------------------------------------------- /prompts/openstax/build_openstax_prompts.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/openstax/build_openstax_prompts.py -------------------------------------------------------------------------------- /prompts/stanford/1_scraper.ipynb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/stanford/1_scraper.ipynb -------------------------------------------------------------------------------- /prompts/stanford/2_generate_course_outlines.ipynb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/stanford/2_generate_course_outlines.ipynb -------------------------------------------------------------------------------- /prompts/stanford/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/stanford/README.md -------------------------------------------------------------------------------- /prompts/stories/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/stories/README.md -------------------------------------------------------------------------------- /prompts/stories/build_openhermes_stories_prompts.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/stories/build_openhermes_stories_prompts.py -------------------------------------------------------------------------------- /prompts/stories/build_ultrachat_stories_prompts.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/stories/build_ultrachat_stories_prompts.py -------------------------------------------------------------------------------- /prompts/stories/filter_openhermes.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/stories/filter_openhermes.py -------------------------------------------------------------------------------- /prompts/web_samples/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/web_samples/README.md -------------------------------------------------------------------------------- /prompts/web_samples/build_web_prompts.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/web_samples/build_web_prompts.py -------------------------------------------------------------------------------- /prompts/web_samples/filter_and_classify_clusters.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/web_samples/filter_and_classify_clusters.py -------------------------------------------------------------------------------- /prompts/wikihow/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/wikihow/README.md -------------------------------------------------------------------------------- /prompts/wikihow/wikihowcom-20231012-titles.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/cosmopedia/HEAD/prompts/wikihow/wikihowcom-20231012-titles.txt --------------------------------------------------------------------------------