├── README.md └── split.sh /README.md: -------------------------------------------------------------------------------- 1 | # text-normalization-data 2 | 3 | Links to data used in Sproat & Jaitly (https://arxiv.org/abs/1611.00068) experiments, with data for Polish added by request. 4 | 5 | The data is hosted by Kaggle: 6 | 7 | https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish 8 | 9 | Data splits can be obtained using the script split.sh available here. 10 | -------------------------------------------------------------------------------- /split.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # The following script can be used to create the training-developing-testing 3 | # splits used in Zhang et al. (submitted): 4 | # 5 | # "In most experiments we train using the first file for English (about 10 6 | # million tokens) and the first four files for Russian (about 11 million 7 | # tokens), and evaluate on the first 92,000 tokens of the last file for English, 8 | # and the first 93,000 tokens of the last file for Russian." 9 | # 10 | # But due to symbols we actually want 100002 and 10007, respectively. 11 | 12 | set -e -u 13 | 14 | echo -n "Downloading and decompressing..." 15 | curl -sO https://storage.googleapis.com/text-normalization/en_with_types.tgz && \ 16 | tar -xzf en_with_types.tgz & 17 | curl -sO https://storage.googleapis.com/text-normalization/ru_with_types.tgz && \ 18 | tar -xzf ru_with_types.tgz & 19 | wait 20 | echo "done" 21 | 22 | cd en_with_types 23 | cp output-00000-of-00100 training & 24 | cat output-0009[0-4]-of-00100 > development & 25 | head -100002 output-00099-of-00100 > evaluation & 26 | wait 27 | echo "English:" 28 | wc -l training development evaluation 29 | 30 | cd ../ru_with_types 31 | cp output-00000-of-00100 training & 32 | cat output-0009[0-4]-of-00100 > development & 33 | head -100007 output-00099-of-00100 > evaluation & 34 | wait 35 | echo "Russian:" 36 | wc -l training development evaluation 37 | 38 | cd .. 39 | rm -r en_with_types.tgz ru_with_types.tgz 40 | --------------------------------------------------------------------------------