├── LICENSE
└── README.md
/LICENSE:
--------------------------------------------------------------------------------
1 | SOVA Dataset is licensed under Creative Commons BY 4.0 license by Virtual Assistant, LLC.
2 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # SOVA Dataset
2 |
3 | SOVA Dataset is free public STT/ASR dataset.
4 |
5 | Key facts:
6 | - Russian, English and Chinese languages
7 | - ~ 32 328 hours
8 | - ~ 3,21 TB in `.wav` format
9 |
10 | ## Dataset composition
11 | |Name||Lang|Hours|Size|Source|Equipment|Annotation|Speech type|Augmentation|Quality|
12 | |-|:-:|-|-|-|-|-|-|-|-|-|
13 | |EngAudiobooksOriginal|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw "Download")|EN|7 130|743 Gb|audiobook|professional|forced alignment|reading|none|95%|
14 | |EngAudiobooksNoisy|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw "Download")|EN|3 873|310 Gb|audiobook|professional|forced alignment|reading|phone calls|95%|
15 | |RuAudiobooksDevices|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw "Download")|RU|298|30,24 Gb|audiobook|unprofessional|manual|reading|none|99%|
16 | |RuDevices|[Download](https://disk.yandex.ru/d/jz3k7pnzTpnTgw "Download")|RU|101|10,42 Gb|audio records|unprofessional|manual|live speech|none|98%|
17 | |RuYoutube|[Download](https://disk.yandex.ru/d/QsnbNTK0yzXSiA "Download")|RU|17 451|1 873 Gb|audio records|unprofessional|asr|live speech|none|95%|
18 | |ZhYoutube|[Download](https://disk.yandex.ru/d/zCY5yRvW7PWjvA "Download")|CN|3 475,1|321 Gb|audio records|unprofessional|asr|live speech|none|97.83%|
19 | |**TOTAL**|-|-|**32 328,1**|**3 287,66 Gb**
**(3,21 TB)**|-|-|-|-|-|-|
20 |
21 | ## Audio characteristics
22 | * Bit rate mode: constant
23 | * Bit rate: 256 kbps
24 | * Channel(s): 1 channel
25 | * Sample rate: 16.0 kHz
26 | * Bit depth: 16 bit
27 |
28 | ## Updates
29 | * 08/11/2022: [Release v0.4.0](https://github.com/sovaai/sova-dataset/releases/tag/v0.4.0)
30 | * 10/12/2021: [Release v0.3.0](https://github.com/sovaai/sova-dataset/releases/tag/v0.3.0)
31 | * 22/12/2020: [Release v0.2.0](https://github.com/sovaai/sova-dataset/releases/tag/v0.2.0)
32 | * 24/12/2019: Published dataset with 116 hours.
33 |
34 | ## Contacts
35 | For all questions please feel free to contact us support@sova.ai
36 |
37 | ## License
38 |
39 | SOVA Dataset is licensed under [Creative Commons BY 4.0](https://creativecommons.org/licenses/by/4.0/) license by Virtual Assistant, LLC.
40 |
--------------------------------------------------------------------------------