├── .gitignore ├── LICENSE ├── README.md ├── basic_dedup ├── README.md ├── find_duplicates.py ├── readme.txt └── write_meta_data_pkl.py ├── convert ├── README.md ├── convert.py └── wudao_convert.py ├── corpus_processing ├── blacklist.txt ├── clean_file.py ├── decp936messy.py ├── extract.py ├── move_file.py ├── passwords.txt └── readme.txt ├── parallel_dedup ├── README.md ├── convert_jsonl_to_csv.py ├── multiprocess_deduplication.py ├── reset_csv.py └── write_output_to_jsonl.py ├── requirements.txt ├── utils ├── customSimhash.py ├── redisSimhash.py └── utils.py └── words_dedup ├── add_jsonl_detailed_simhash.py └── alltext_simhash.py /.gitignore: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/.gitignore -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/LICENSE -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/README.md -------------------------------------------------------------------------------- /basic_dedup/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/basic_dedup/README.md -------------------------------------------------------------------------------- /basic_dedup/find_duplicates.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/basic_dedup/find_duplicates.py -------------------------------------------------------------------------------- /basic_dedup/readme.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/basic_dedup/readme.txt -------------------------------------------------------------------------------- /basic_dedup/write_meta_data_pkl.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/basic_dedup/write_meta_data_pkl.py -------------------------------------------------------------------------------- /convert/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/convert/README.md -------------------------------------------------------------------------------- /convert/convert.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/convert/convert.py -------------------------------------------------------------------------------- /convert/wudao_convert.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/convert/wudao_convert.py -------------------------------------------------------------------------------- /corpus_processing/blacklist.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/corpus_processing/blacklist.txt -------------------------------------------------------------------------------- /corpus_processing/clean_file.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/corpus_processing/clean_file.py -------------------------------------------------------------------------------- /corpus_processing/decp936messy.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/corpus_processing/decp936messy.py -------------------------------------------------------------------------------- /corpus_processing/extract.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/corpus_processing/extract.py -------------------------------------------------------------------------------- /corpus_processing/move_file.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/corpus_processing/move_file.py -------------------------------------------------------------------------------- /corpus_processing/passwords.txt: -------------------------------------------------------------------------------- 1 | 253874 -------------------------------------------------------------------------------- /corpus_processing/readme.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/corpus_processing/readme.txt -------------------------------------------------------------------------------- /parallel_dedup/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/parallel_dedup/README.md -------------------------------------------------------------------------------- /parallel_dedup/convert_jsonl_to_csv.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/parallel_dedup/convert_jsonl_to_csv.py -------------------------------------------------------------------------------- /parallel_dedup/multiprocess_deduplication.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/parallel_dedup/multiprocess_deduplication.py -------------------------------------------------------------------------------- /parallel_dedup/reset_csv.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/parallel_dedup/reset_csv.py -------------------------------------------------------------------------------- /parallel_dedup/write_output_to_jsonl.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/parallel_dedup/write_output_to_jsonl.py -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/requirements.txt -------------------------------------------------------------------------------- /utils/customSimhash.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/utils/customSimhash.py -------------------------------------------------------------------------------- /utils/redisSimhash.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/utils/redisSimhash.py -------------------------------------------------------------------------------- /utils/utils.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/utils/utils.py -------------------------------------------------------------------------------- /words_dedup/add_jsonl_detailed_simhash.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/words_dedup/add_jsonl_detailed_simhash.py -------------------------------------------------------------------------------- /words_dedup/alltext_simhash.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aplmikex/deduplication_mnbvc/HEAD/words_dedup/alltext_simhash.py --------------------------------------------------------------------------------