├── .gitignore ├── LICENSE ├── README.md ├── extract_from_cc ├── configs │ └── randomized_all.yaml ├── extract_from_warc.py ├── spark_extract_dataset.py ├── spark_session_builder.py ├── text_normalizer.py └── warcs │ └── shard_0.txt ├── filtering └── filter_dataset.py ├── imgs ├── OpenWebMath-left.png ├── openwebmath_logo.png └── pipeline.png └── text_extraction ├── setup.py └── text_extract ├── __init__.py ├── banned_selectors.txt ├── boilerplate_words.txt ├── extract.py ├── latex_processing.py ├── line_processing.py ├── mmltex ├── README ├── cmarkup.xsl ├── entities.xsl ├── glayout.xsl ├── mmltex.xsl ├── scripts.xsl ├── tables.xsl └── tokens.xsl ├── tree_processing.py └── utils.py /.gitignore: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/.gitignore -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/LICENSE -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/README.md -------------------------------------------------------------------------------- /extract_from_cc/configs/randomized_all.yaml: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/extract_from_cc/configs/randomized_all.yaml -------------------------------------------------------------------------------- /extract_from_cc/extract_from_warc.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/extract_from_cc/extract_from_warc.py -------------------------------------------------------------------------------- /extract_from_cc/spark_extract_dataset.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/extract_from_cc/spark_extract_dataset.py -------------------------------------------------------------------------------- /extract_from_cc/spark_session_builder.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/extract_from_cc/spark_session_builder.py -------------------------------------------------------------------------------- /extract_from_cc/text_normalizer.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/extract_from_cc/text_normalizer.py -------------------------------------------------------------------------------- /extract_from_cc/warcs/shard_0.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/extract_from_cc/warcs/shard_0.txt -------------------------------------------------------------------------------- /filtering/filter_dataset.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/filtering/filter_dataset.py -------------------------------------------------------------------------------- /imgs/OpenWebMath-left.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/imgs/OpenWebMath-left.png -------------------------------------------------------------------------------- /imgs/openwebmath_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/imgs/openwebmath_logo.png -------------------------------------------------------------------------------- /imgs/pipeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/imgs/pipeline.png -------------------------------------------------------------------------------- /text_extraction/setup.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/text_extraction/setup.py -------------------------------------------------------------------------------- /text_extraction/text_extract/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /text_extraction/text_extract/banned_selectors.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/text_extraction/text_extract/banned_selectors.txt -------------------------------------------------------------------------------- /text_extraction/text_extract/boilerplate_words.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/text_extraction/text_extract/boilerplate_words.txt -------------------------------------------------------------------------------- /text_extraction/text_extract/extract.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/text_extraction/text_extract/extract.py -------------------------------------------------------------------------------- /text_extraction/text_extract/latex_processing.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/text_extraction/text_extract/latex_processing.py -------------------------------------------------------------------------------- /text_extraction/text_extract/line_processing.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/text_extraction/text_extract/line_processing.py -------------------------------------------------------------------------------- /text_extraction/text_extract/mmltex/README: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/text_extraction/text_extract/mmltex/README -------------------------------------------------------------------------------- /text_extraction/text_extract/mmltex/cmarkup.xsl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/text_extraction/text_extract/mmltex/cmarkup.xsl -------------------------------------------------------------------------------- /text_extraction/text_extract/mmltex/entities.xsl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/text_extraction/text_extract/mmltex/entities.xsl -------------------------------------------------------------------------------- /text_extraction/text_extract/mmltex/glayout.xsl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/text_extraction/text_extract/mmltex/glayout.xsl -------------------------------------------------------------------------------- /text_extraction/text_extract/mmltex/mmltex.xsl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/text_extraction/text_extract/mmltex/mmltex.xsl -------------------------------------------------------------------------------- /text_extraction/text_extract/mmltex/scripts.xsl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/text_extraction/text_extract/mmltex/scripts.xsl -------------------------------------------------------------------------------- /text_extraction/text_extract/mmltex/tables.xsl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/text_extraction/text_extract/mmltex/tables.xsl -------------------------------------------------------------------------------- /text_extraction/text_extract/mmltex/tokens.xsl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/text_extraction/text_extract/mmltex/tokens.xsl -------------------------------------------------------------------------------- /text_extraction/text_extract/tree_processing.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/text_extraction/text_extract/tree_processing.py -------------------------------------------------------------------------------- /text_extraction/text_extract/utils.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keirp/OpenWebMath/HEAD/text_extraction/text_extract/utils.py --------------------------------------------------------------------------------