├── 20B_tokenizer.json ├── README.md ├── data ├── sample_data_text_document.bin └── sample_data_text_document.idx ├── requirements.txt ├── rwkv_vocab_v20230424.txt ├── sample.jsonl └── tools ├── __pycache__ ├── indexed_dataset.cpython-38.pyc └── tokenizer.cpython-38.pyc ├── indexed_dataset.py ├── preprocess_data.py ├── rwkv_tokenizer.py └── tokenizer.py /20B_tokenizer.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Abel2076/json2binidx_tool/HEAD/20B_tokenizer.json -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Abel2076/json2binidx_tool/HEAD/README.md -------------------------------------------------------------------------------- /data/sample_data_text_document.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Abel2076/json2binidx_tool/HEAD/data/sample_data_text_document.bin -------------------------------------------------------------------------------- /data/sample_data_text_document.idx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Abel2076/json2binidx_tool/HEAD/data/sample_data_text_document.idx -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Abel2076/json2binidx_tool/HEAD/requirements.txt -------------------------------------------------------------------------------- /rwkv_vocab_v20230424.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Abel2076/json2binidx_tool/HEAD/rwkv_vocab_v20230424.txt -------------------------------------------------------------------------------- /sample.jsonl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Abel2076/json2binidx_tool/HEAD/sample.jsonl -------------------------------------------------------------------------------- /tools/__pycache__/indexed_dataset.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Abel2076/json2binidx_tool/HEAD/tools/__pycache__/indexed_dataset.cpython-38.pyc -------------------------------------------------------------------------------- /tools/__pycache__/tokenizer.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Abel2076/json2binidx_tool/HEAD/tools/__pycache__/tokenizer.cpython-38.pyc -------------------------------------------------------------------------------- /tools/indexed_dataset.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Abel2076/json2binidx_tool/HEAD/tools/indexed_dataset.py -------------------------------------------------------------------------------- /tools/preprocess_data.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Abel2076/json2binidx_tool/HEAD/tools/preprocess_data.py -------------------------------------------------------------------------------- /tools/rwkv_tokenizer.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Abel2076/json2binidx_tool/HEAD/tools/rwkv_tokenizer.py -------------------------------------------------------------------------------- /tools/tokenizer.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Abel2076/json2binidx_tool/HEAD/tools/tokenizer.py --------------------------------------------------------------------------------