├── .gitignore ├── LICENSE ├── README.md ├── extract_paragraphs_from_page_htmls.py ├── filter_items_by_pageid.py ├── get_all_page_ids_from_cirrussearch.py ├── get_all_page_ids_from_web.py ├── get_page_htmls.py ├── make_corpus_from_cirrussearch.py ├── make_corpus_from_paragraphs.py ├── make_passages_from_paragraphs.py ├── push_to_hub.py ├── requirements.txt └── sentence_splitters.py /.gitignore: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/singletongue/wikipedia-utils/HEAD/.gitignore -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/singletongue/wikipedia-utils/HEAD/LICENSE -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/singletongue/wikipedia-utils/HEAD/README.md -------------------------------------------------------------------------------- /extract_paragraphs_from_page_htmls.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/singletongue/wikipedia-utils/HEAD/extract_paragraphs_from_page_htmls.py -------------------------------------------------------------------------------- /filter_items_by_pageid.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/singletongue/wikipedia-utils/HEAD/filter_items_by_pageid.py -------------------------------------------------------------------------------- /get_all_page_ids_from_cirrussearch.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/singletongue/wikipedia-utils/HEAD/get_all_page_ids_from_cirrussearch.py -------------------------------------------------------------------------------- /get_all_page_ids_from_web.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/singletongue/wikipedia-utils/HEAD/get_all_page_ids_from_web.py -------------------------------------------------------------------------------- /get_page_htmls.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/singletongue/wikipedia-utils/HEAD/get_page_htmls.py -------------------------------------------------------------------------------- /make_corpus_from_cirrussearch.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/singletongue/wikipedia-utils/HEAD/make_corpus_from_cirrussearch.py -------------------------------------------------------------------------------- /make_corpus_from_paragraphs.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/singletongue/wikipedia-utils/HEAD/make_corpus_from_paragraphs.py -------------------------------------------------------------------------------- /make_passages_from_paragraphs.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/singletongue/wikipedia-utils/HEAD/make_passages_from_paragraphs.py -------------------------------------------------------------------------------- /push_to_hub.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/singletongue/wikipedia-utils/HEAD/push_to_hub.py -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/singletongue/wikipedia-utils/HEAD/requirements.txt -------------------------------------------------------------------------------- /sentence_splitters.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/singletongue/wikipedia-utils/HEAD/sentence_splitters.py --------------------------------------------------------------------------------