├── README.md
└── split.sh


/README.md:
--------------------------------------------------------------------------------
 1 | # text-normalization-data
 2 | 
 3 | Links to data used in Sproat &amp; Jaitly (https://arxiv.org/abs/1611.00068) experiments, with data for Polish added by request.
 4 | 
 5 | The data is hosted by Kaggle:
 6 | 
 7 | https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish
 8 | 
 9 | Data splits can be obtained using the script split.sh available here.
10 | 


--------------------------------------------------------------------------------
/split.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # The following script can be used to create the training-developing-testing
 3 | # splits used in Zhang et al. (submitted):
 4 | #
 5 | # "In most experiments we train using the first file for English (about 10
 6 | # million tokens) and the first four files for Russian (about 11 million
 7 | # tokens), and evaluate on the first 92,000 tokens of the last file for English,
 8 | # and the first 93,000 tokens of the last file for Russian."
 9 | #
10 | # But due to <eos> symbols we actually want 100002 and 10007, respectively.
11 |  
12 | set -e -u
13 |  
14 | echo -n "Downloading and decompressing..."
15 | curl -sO https://storage.googleapis.com/text-normalization/en_with_types.tgz && \
16 |   tar -xzf en_with_types.tgz &
17 | curl -sO https://storage.googleapis.com/text-normalization/ru_with_types.tgz && \
18 |   tar -xzf ru_with_types.tgz &
19 | wait
20 | echo "done"
21 |  
22 | cd en_with_types
23 | cp output-00000-of-00100 training &
24 | cat output-0009[0-4]-of-00100 > development & 
25 | head -100002 output-00099-of-00100 > evaluation &
26 | wait
27 | echo "English:"
28 | wc -l training development evaluation
29 |  
30 | cd ../ru_with_types
31 | cp output-00000-of-00100 training &
32 | cat output-0009[0-4]-of-00100 > development &
33 | head -100007 output-00099-of-00100 > evaluation &
34 | wait
35 | echo "Russian:"
36 | wc -l training development evaluation
37 |  
38 | cd ..
39 | rm -r en_with_types.tgz ru_with_types.tgz
40 | 


--------------------------------------------------------------------------------