├── LICENSE ├── README.md ├── download.sh └── md5sum.lst /LICENSE: -------------------------------------------------------------------------------- 1 | Dual license, cc-by-nc and commercial usage available after agreement with dataset authors. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # **Russian Open Text To Speech (TTS) Dataset** 2 | 3 | Arguably the largest public Russian TTS dataset up to date: 4 | - ~5 000 voices; 5 | - ~13 000 hours; 6 | - **(new!)** A new domain - public speech with ~3 000 hours; 7 | - **(new!)** A new domain - radio with ~10 000 hours; 8 | - Speaker labels for new domains are coming soon! 9 | 10 | Prove [us](mailto:open_stt@googlegroups.com) wrong! 11 | Open issues, collaborate, submit a PR, contribute, share your datasets! 12 | Let's make TTS/STT in Russian (and more) as open and available as CV models. 13 | 14 | **Table of contents** 15 | - [Downloads](https://github.com/snakers4/open_tts/#downloads) 16 | - [Links](https://github.com/snakers4/open_tts/#links) 17 | - [Download-instructions](https://github.com/snakers4/open_tts/#download-instructions) 18 | - [End-to-end download scripts](https://github.com/snakers4/open_tts/#end-to-end-download-scripts) 19 | - [Annotation methodology](https://github.com/snakers4/open_tts/#annotation-methodology) 20 | - [Contacts](https://github.com/snakers4/open_tts/#contacts) 21 | - [License](https://github.com/snakers4/open_tts/#license) 22 | - [Donations](https://github.com/snakers4/open_tts/#donations) 23 | 24 | # **Updates** 25 | 26 | ## **_Update 2019-11-04_** 27 | 28 | **New train datasets added:** 29 | 30 | - 10,430 hours radio_v4; 31 | - 2,709 hours public_speech; 32 | - 154 hours radio_v4_add; 33 | - 5% sample of all new datasets with annotation. 34 | - Speaker labels are coming soon! 35 | 36 |
37 | Click to expand 38 | 39 | ## **_Update 2019-06-28_** 40 | 41 | `russian_young_male_1` added (~43 hours) 42 | 43 | ## **_Update 2019-05-24_** 44 | 45 | It's alive! 46 | Looking for collaborators) 47 | 48 |
49 | 50 | # **Downloads** 51 | 52 | # **Links** 53 | 54 | ~~Meta data [file](https://ru-open-tts.ams3.digitaloceanspaces.com/public_tts_df_02.csv).~~**Coming soon!** 55 | 56 | | Voice | Clips | Hours | GB | Comment | Links | Md5sum | 57 | |----------------------------|--------|-------|-----|----------|---------------|------------------------------------| 58 | | 5% of radio + public_speech | 469797 | 665 | 66,7| | [mp3+txt](https://ru-open-stt.ams3.cdn.digitaloceanspaces.com/radio_pspeech_sample_mp3.tar.gz), [manifest file](https://ru-open-stt.ams3.cdn.digitaloceanspaces.com/radio_pspeech_sample_manifest.csv) | `84397631475426f505babbb73b4197d9` | 59 | | radio | 7,603,192| 10,430| 1,195 | | [mp3](https://ru-open-stt.ams3.cdn.digitaloceanspaces.com/radio_v4_mp3.tar.gz), [txt](https://forms.gle/nosMaNgj8MWKm99d9), [manifest file](https://ru-open-stt.ams3.cdn.digitaloceanspaces.com/radio_v4_manifest.csv), | `7c2273a5b8c3cc10df3754dbe9c783e1` | 60 | | public_speech | 1,700,060| 2,709 | 301 | | [mp3](https://ru-open-stt.ams3.cdn.digitaloceanspaces.com/public_speech_mp3.tar.gz), [txt](https://forms.gle/nosMaNgj8MWKm99d9), [manifest file](https://ru-open-stt.ams3.cdn.digitaloceanspaces.com/public_speech_manifest.csv), | `d41f3f21d3cb9328de3cd6a530a70832` | 61 | | radio_add | 92,679 | 157 | 18 | | [mp3](https://ru-open-stt.ams3.cdn.digitaloceanspaces.com/radio_v4_add_mp3.tar.gz), [txt](https://forms.gle/nosMaNgj8MWKm99d9), [manifest file](https://ru-open-stt.ams3.cdn.digitaloceanspaces.com/radio_v4_add_manifest.csv), | `ae00489678836b92e3a65d2ee8b51960` | 62 | | russian_middle_aged_male_1 | 45,311 | 64 | 9.7 | Rnnoise | [wav+txt](https://ru-open-tts.ams3.digitaloceanspaces.com/russian_middle_aged_male_1.tar.gz) | `f1157d6dfd07c302c23cfe7dcb0298f5` | 63 | | russian_middle_aged_male_2 | 46,684 | 38 | 6.0 | Rnnoise | [wav+txt](https://ru-open-tts.ams3.digitaloceanspaces.com/russian_middle_aged_male_2.tar.gz) | `059ab6b3e5fa77319f7bf20e594fc133` | 64 | | russian_young_male_1 (tts_2) | 118,536 | 43 | 4.9 | | [wav+txt](https://ru-open-tts.ams3.digitaloceanspaces.com/tts_2.tar.gz) | `403c90662beb51ac9a39d64b879e0f1b` | 65 | | total | 9,606,462 | 13,446 | 1,535 | | | | 66 | 67 | ## **Download instructions** 68 | 69 | ### End to end 70 | 71 | `download.sh` or `download.py` with this config [file]((https://github.com/snakers4/open_tts/blob/master/md5sum.lst)). Please check the config first. 72 | 73 | ### Manually 74 | 75 | 1. Download each dataset separately: 76 | 77 | Via `wget` 78 | ``` 79 | wget https://ru-open-stt.ams3.digitaloceanspaces.com/some_file 80 | ``` 81 | 82 | For multi-threaded downloads use aria2 with `-x` flag, i.e. 83 | ``` 84 | aria2c -c -x5 https://ru-open-stt.ams3.digitaloceanspaces.com/some_file 85 | ``` 86 | 87 | 2. Download the meta data. 88 | 89 | # **Data collection / denoising / normalization methodology** 90 | The dataset is compiled using open domain sources. 91 | 92 | ## Russian_middle_aged/young_male 93 | Then the dataset is cleaned using the best ASR engine we have at hand and only items with `CER` less than `0.1` are left. 94 | 95 | Then where applicable: 96 | - Spectral [gating](https://github.com/timsainb/noisereduce) / de-noising is applied; 97 | - Rnnoise is [applied](https://github.com/xiph/rnnoise/issues/69); 98 | 99 | All files are normalized as follows: 100 | - Converted to mono, if necessary; 101 | - Converted to 22 kHz sampling rate, if necessary; 102 | - Stored as 16-bit integers; 103 | 104 | 22 kHz was chosen as an optimal rate used in the literature, though in real applications as low as 8kHz may suffice. 105 | 106 | ## Radio/Public Speech 107 | 108 | All files are normalized for easier / faster runtime augmentations and processing as follows: 109 | 110 | - Converted to mono, if necessary; 111 | - Converted to 16 kHz sampling rate, if necessary; 112 | - Stored as 32 kbps `mp3`; 113 | 114 | # **Contacts** 115 | 116 | Please contact us [here](mailto:open_stt@googlegroups.com) or just create a GitHub issue! 117 | 118 | **Authors in alphabetic order:** 119 | - Anna Slizhikova; 120 | - Alexander Veysov; 121 | - Dmitry Voronin; 122 | - Yuri Baburov; 123 | 124 | # **License** 125 | Dual license, cc-by-nc and commercial usage available after agreement with dataset authors. 126 | 127 | 128 | # **Donations** 129 | 130 | [Donate](https://buymeacoff.ee/8oneCIN) (each coffee pays for several full downloads) or via [open_collective](https://opencollective.com/open_stt) / use our DO referral [link](https://sohabr.net/habr/post/357748/) to help. 131 | 132 | -------------------------------------------------------------------------------- /download.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | mirror='https://ru-open-stt.ams3.digitaloceanspaces.com' 4 | 5 | while true; do 6 | for file in $(cut -f2 -d' ' md5sum.lst); do 7 | wget -c "${mirror}/${file}" 8 | done 9 | 10 | echo '' 11 | echo '>>> Checking MD5 digests...' 12 | 13 | md5sum -c md5sum.lst 1>md5sum.log 2>/dev/null 14 | status=$? 15 | 16 | if test $status -eq 0; then 17 | rm md5sum.log 18 | echo '>>> Data is downloaded and checked.' 19 | break 20 | fi 21 | 22 | for failed in $(grep 'FAILED$' md5sum.log | grep -Po '^[^:]+'); do 23 | echo ">>> MD5 digest for ${failed} is incorrect, the file will be downloaded again." 24 | rm -f ${failed} 25 | done 26 | 27 | echo '' 28 | done 29 | -------------------------------------------------------------------------------- /md5sum.lst: -------------------------------------------------------------------------------- 1 | f524a5b3f2a46eb8bf9c307b902f0d32 https://ru-open-tts.ams3.digitaloceanspaces.com/public_tts_df_02.csv 2 | f1157d6dfd07c302c23cfe7dcb0298f5 https://ru-open-tts.ams3.digitaloceanspaces.com/russian_middle_aged_male_1.tar.gz 3 | 059ab6b3e5fa77319f7bf20e594fc133 https://ru-open-tts.ams3.digitaloceanspaces.com/russian_middle_aged_male_2.tar.gz 4 | 403c90662beb51ac9a39d64b879e0f1b https://ru-open-tts.ams3.digitaloceanspaces.com/tts_2.tar.gz 5 | --------------------------------------------------------------------------------