├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | SOVA Dataset is licensed under Creative Commons BY 4.0 license by Virtual Assistant, LLC. 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SOVA Dataset 2 | 3 | SOVA Dataset is free public STT/ASR dataset. 4 | 5 | Key facts: 6 | - Russian, English and Chinese languages 7 | - ~ 32 328 hours 8 | - ~ 3,21 TB in `.wav` format 9 | 10 | ## Dataset composition 11 | |Name||Lang|Hours|Size|Source|Equipment|Annotation|Speech type|Augmentation|Quality| 12 | |-|:-:|-|-|-|-|-|-|-|-|-| 13 | |EngAudiobooksOriginal|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw "Download")|EN|7 130|743 Gb|audiobook|professional|forced alignment|reading|none|95%| 14 | |EngAudiobooksNoisy|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw "Download")|EN|3 873|310 Gb|audiobook|professional|forced alignment|reading|phone calls|95%| 15 | |RuAudiobooksDevices|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw "Download")|RU|298|30,24 Gb|audiobook|unprofessional|manual|reading|none|99%| 16 | |RuDevices|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw "Download")|RU|101|10,42 Gb|audio records|unprofessional|manual|live speech|none|98%| 17 | |RuYoutube|[Download](https://disk.yandex.ru/d/QsnbNTK0yzXSiA "Download")|RU|17 451|1 873 Gb|audio records|unprofessional|asr|live speech|none|95%| 18 | |ZhYoutube|[Download](https://disk.yandex.ru/d/zCY5yRvW7PWjvA "Download")|CN|3 475,1|321 Gb|audio records|unprofessional|asr|live speech|none|97.83%| 19 | |**TOTAL**|-|-|**32 328,1**|**3 287,66 Gb**
**(3,21 TB)**|-|-|-|-|-|-| 20 | 21 | ## Audio characteristics 22 | * Bit rate mode: constant 23 | * Bit rate: 256 kbps 24 | * Channel(s): 1 channel 25 | * Sample rate: 16.0 kHz 26 | * Bit depth: 16 bit 27 | 28 | ## Updates 29 | * 08/11/2022: [Release v0.4.0](https://github.com/sovaai/sova-dataset/releases/tag/v0.4.0) 30 | * 10/12/2021: [Release v0.3.0](https://github.com/sovaai/sova-dataset/releases/tag/v0.3.0) 31 | * 22/12/2020: [Release v0.2.0](https://github.com/sovaai/sova-dataset/releases/tag/v0.2.0) 32 | * 24/12/2019: Published dataset with 116 hours. 33 | 34 | ## Contacts 35 | For all questions please feel free to contact us support@sova.ai 36 | 37 | ## License 38 | 39 | SOVA Dataset is licensed under [Creative Commons BY 4.0](https://creativecommons.org/licenses/by/4.0/) license by Virtual Assistant, LLC. 40 | --------------------------------------------------------------------------------