├── LICENSE ├── README.md ├── list ├── dev-clean-1mix.jsonl ├── dev-clean-2mix.jsonl ├── dev-clean-3mix.jsonl ├── optional │ ├── dev-clean-1mix-8prof-10utt.jsonl │ ├── dev-clean-1mix-8prof-1utt.jsonl │ ├── dev-clean-1mix-8prof-2utt.jsonl │ ├── dev-clean-1mix-8prof-5utt.jsonl │ ├── dev-clean-2mix-8prof-10utt.jsonl │ ├── dev-clean-2mix-8prof-1utt.jsonl │ ├── dev-clean-2mix-8prof-2utt.jsonl │ ├── dev-clean-2mix-8prof-5utt.jsonl │ ├── dev-clean-3mix-8prof-10utt.jsonl │ ├── dev-clean-3mix-8prof-1utt.jsonl │ ├── dev-clean-3mix-8prof-2utt.jsonl │ ├── dev-clean-3mix-8prof-5utt.jsonl │ ├── test-clean-1mix-8prof-10utt.jsonl │ ├── test-clean-1mix-8prof-1utt.jsonl │ ├── test-clean-1mix-8prof-2utt.jsonl │ ├── test-clean-1mix-8prof-5utt.jsonl │ ├── test-clean-2mix-8prof-10utt.jsonl │ ├── test-clean-2mix-8prof-1utt.jsonl │ ├── test-clean-2mix-8prof-2utt.jsonl │ ├── test-clean-2mix-8prof-5utt.jsonl │ ├── test-clean-3mix-8prof-10utt.jsonl │ ├── test-clean-3mix-8prof-1utt.jsonl │ ├── test-clean-3mix-8prof-2utt.jsonl │ └── test-clean-3mix-8prof-5utt.jsonl ├── test-clean-1mix.jsonl ├── test-clean-2mix.jsonl └── test-clean-3mix.jsonl ├── run.sh └── utils └── mix_wavs.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Naoyuki Kanda 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | ============================ 24 | The original LibriSpeech corpus is distributed at http://www.openslr.org/12/ 25 | under CC BY 4.0. 26 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # LibriSpeechMix 2 | 3 | LibriSpeechMix is the dastaset used in [Serialized Output Training for End-to-End Overlapped Speech Recognition](https://www.isca-speech.org/archive/Interspeech_2020/pdfs/0999.pdf) and [Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers](https://www.isca-speech.org/archive/Interspeech_2020/pdfs/1085.pdf) for evaluating multi-talker speech recognition systems. The dataset has been derived from the LibriSpeech "dev_clean" and "test_clean" sets. 4 | - Notable features 5 | - Consists of partially overlapped speech utterances (instead of commonly-used fully overlapped utterances), which is closer to real senarios. 6 | - Designed for ASR evaluation. 7 | - The dataset comprises single-speaker, two-speaker-mixture, and three-speaker-mixtures datasets. Each utterance in the original LibriSpeech evaluation data is used exactly N times in the N-speaker set, which allows the WERs to be compared across the three speaker number conditions. 8 | - Each mixed audio does not contain multiple utterances of the same speaker. 9 | - Includes the information for speaker profile extraction, which is suitable for speaker-attributed automatic speech recogntion (SA-ASR) experiments. 10 | - The dataset was used for the papers listed below. 11 | - Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka: Serialized Output Training for End-to-End Overlapped Speech Recognition, Proc. Interspeech, pp. 2797-2801, 2020. [[pdf]](https://www.isca-speech.org/archive/Interspeech_2020/pdfs/0999.pdf) 12 | - Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Tianyan Zhou , Takuya Yoshioka: Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers. Proc. Interspeech, pp. 36-40, 2020. [[pdf]](https://www.isca-speech.org/archive/Interspeech_2020/pdfs/1085.pdf) 13 | - Intersted readers are also referred to the following related paper. 14 | - Naoyuki Kanda, Zhong Meng, Liang Lu, Yashesh Gaur, Xiaofei Wang, Zhuo Chen, Takuya Yoshioka: Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR, arXiv:2011.02921, 2020. [[pdf]](https://arxiv.org/pdf/2011.02921.pdf) 15 | - Naoyuki Kanda, Xuankai Chang, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka: Investigation of End-To-End Speaker-Attributed ASR for Continuous Multi-Talker Recordings. Proc. SLT, 2021 (to appear). [[pdf]](https://arxiv.org/pdf/2008.04546.pdf) 16 | 17 | ## Prerequisites 18 | - Linux 19 | - python3 20 | - flac 21 | 22 | ## How to Generate Data 23 | The following commands first download the LibriSpeech evaluation data ("dev_clean" and "test_clean") and then generate the mixed audio. 24 | ```sh 25 | $ pip install soundfile librosa numpy 26 | $ bash run.sh 27 | ``` 28 | The mixed audio files are generated under ./data/ directory according to the information in *.jsonl file. 29 | ``` 30 | list/ 31 | ├── dev-clean-1mix.jsonl 32 | ├── dev-clean-2mix.jsonl 33 | ├── dev-clean-3mix.jsonl 34 | ├── test-clean-1mix.jsonl 35 | ├── test-clean-2mix.jsonl 36 | └── test-clean-3mix.jsonl 37 | ``` 38 | 39 | ## Data Format of *.jsonl file 40 | ### Each line of *.jsonl corresponds to a string of JSON data. 41 | |Element|Type|Meaning| 42 | |---|---|---| 43 | |id|Required|Utterance id| 44 | |mixed_wav|Required|Path to the mixed audio (NOTE: relative path from ./data/)| 45 | |texts|Required|Transcription| 46 | |speaker_profile|Option for SA-ASR|Audio list for speaker profile extraction (NOTE: relative path from ./data/)| 47 | |speaker_profile_index|Option for SA-ASR|Index of speaker profile corresponding to each utterance in the mixed audio| 48 | |wavs||Original wav files used to generage the mixed audio (NOTE: relative path from ./data)| 49 | |delays||Delay (in second) applied to each utterance before mixing the signals| 50 | |speakers||Speaker id of each utterance| 51 | |durations||Duration (in second) of each original wav file| 52 | |genders||Gender of the speaker of each utterance in the mixed audio| 53 | 54 | ### Example of 2-speaker-mixture audio (indented for visibility) 55 | ``` 56 | { 57 | "id": "dev-clean-2mix/dev-clean-2mix-0000", 58 | "mixed_wav": "dev-clean-2mix/dev-clean-2mix-0000.wav", 59 | "texts": [ 60 | "MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL", 61 | "THAT ENCHANTMENT HAD POSSESSED HIM USURPING AS IT WERE THE THRONE OF HIS LIFE AND DISPLACING IT WHEN IT CEASED HE WAS NOT HIS OWN MASTER"], 62 | "speaker_profile": [ 63 | ["dev-clean/6241/61943/6241-61943-0008.wav", "dev-clean/6241/61946/6241-61946-0001.wav"], 64 | ["dev-clean/174/84280/174-84280-0011.wav", "dev-clean/174/50561/174-50561-0005.wav"], 65 | ["dev-clean/1988/147956/1988-147956-0007.wav", "dev-clean/1988/24833/1988-24833-0011.wav"], 66 | ["dev-clean/7850/281318/7850-281318-0016.wav", "dev-clean/7850/286674/7850-286674-0013.wav"], 67 | ["dev-clean/1919/142785/1919-142785-0024.wav", "dev-clean/1919/142785/1919-142785-0034.wav"], 68 | ["dev-clean/6295/244435/6295-244435-0023.wav", "dev-clean/6295/64301/6295-64301-0002.wav"], 69 | ["dev-clean/2428/83699/2428-83699-0005.wav", "dev-clean/2428/83705/2428-83705-0025.wav"], 70 | ["dev-clean/1272/141231/1272-141231-0022.wav", "dev-clean/1272/128104/1272-128104-0005.wav"]], 71 | "speaker_profile_index": [7, 5], 72 | "wavs": ["dev-clean/1272/128104/1272-128104-0000.wav", "dev-clean/6295/64301/6295-64301-0026.wav"], 73 | "delays": [0.0, 4.469242864375414], 74 | "speakers": ["1272", "6295"], 75 | "durations": [5.855, 10.43], 76 | "genders": ["m", "m"] 77 | } 78 | ``` 79 | 80 | ## Optional list 81 | - ./list/optional/ directory contains optional *jsonl files with different profile settings for SA-ASR. 82 | - Each file has a name of [dev|test]-clean-[1|2|3]mix-8prof-[1|2|5|10]utt.jsonl. 83 | - [dev|test] indicates if this is development data or test data 84 | - [1|2|3]mix indicates the number of mixed audio 85 | - [1|2|5|10]utt indicates the number of utterances for extracting a speaker profile for each speaker 86 | - Files with a suffix of '-8prof-2utt.jsonl' is identical to the files in ./list/ directory. 87 | 88 | ## When referring to this dataset, one of the following papers may be cited. 89 | ``` 90 | @inproceedings{kanda2020serialized, 91 | title={Serialized Output Training for End-to-End Overlapped Speech Recognition}, 92 | author={Kanda, Naoyuki and Gaur, Yashesh and Wang, Xiaofei and Meng, Zhong and Yoshioka, Takuya}, 93 | booktitle={Proc. Interspeech}, 94 | pages={2797--2801}, 95 | year={2020} 96 | } 97 | 98 | @inproceedings{kanda2020joint, 99 | title={Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers}, 100 | author={Kanda, Naoyuki and Gaur, Yashesh and Wang, Xiaofei and Meng, Zhong and Chen, Zhuo and Zhou, Tianyan and Yoshioka, Takuya}, 101 | booktitle={Proc. Interspeech}, 102 | pages={36--40}, 103 | year={2020} 104 | } 105 | ``` 106 | -------------------------------------------------------------------------------- /run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Copyright 2020 Naoyuki Kanda 4 | # MIT license 5 | 6 | set -e 7 | set -u 8 | set -o pipefail 9 | 10 | data_out=./data 11 | 12 | # all utterances are FLAC compressed 13 | if ! which flac >&/dev/null; then 14 | echo "Please install 'flac'!" 15 | exit 1 16 | fi 17 | 18 | # download & untar necessary data 19 | if [ ! -d $data_out/original ]; then 20 | mkdir -p $data_out/original 21 | ( 22 | cd $data_out/original 23 | for dataid in dev-clean test-clean; do 24 | wget http://www.openslr.org/resources/12/$dataid.tar.gz 25 | tar xvzf $dataid.tar.gz 26 | done 27 | ) 28 | fi 29 | 30 | # convert flac to wav 31 | if [ ! -f $data_out/.done.wavfile_gen ]; then 32 | for flac_file in `find $data_out/original/LibriSpeech -type f | grep '\.flac'`; do 33 | echo flac -d -s $flac_file ${flac_file/\.flac/.wav} 34 | flac -d -s $flac_file -o ${flac_file/\.flac/.wav} 35 | done 36 | for dataid in dev-clean test-clean; do 37 | (cd $data_out; ln -s original/LibriSpeech/$dataid $dataid) 38 | done 39 | touch $data_out/.done.wavfile_gen 40 | fi 41 | 42 | # generate mixed wav 43 | if [ ! -f $data_out/.done.mix_wavfile_gen ]; then 44 | for datatype in dev-clean test-clean; do 45 | for mix in 1 2 3; do 46 | python utils/mix_wavs.py \ 47 | list/${datatype}-${mix}mix.jsonl \ 48 | $data_out \ 49 | $data_out 50 | done 51 | done 52 | touch $data_out/.done.mix_wavfile_gen 53 | fi -------------------------------------------------------------------------------- /utils/mix_wavs.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Copyright 2020 Naoyuki Kanda 4 | # MIT license 5 | 6 | import sys 7 | import os 8 | import json 9 | import soundfile 10 | import librosa 11 | import numpy as np 12 | 13 | 14 | def get_delayed_audio(wav_file, delay, sampling_rate=16000): 15 | audio, _ = soundfile.read(wav_file) 16 | delay_frame = int(delay * sampling_rate) 17 | if delay_frame != 0: 18 | audio = np.append(np.zeros(delay_frame), audio) 19 | return audio 20 | 21 | 22 | def mix_audio(wavin_dir, wav_files, delays): 23 | for i, wav_file in enumerate(wav_files): 24 | if i == 0: 25 | audio = get_delayed_audio(os.path.join(wavin_dir, wav_file), delays[i]) 26 | else: 27 | additional_audio = get_delayed_audio(os.path.join(wavin_dir, wav_file), delays[i]) 28 | # tune length & sum up to audio 29 | target_length = max(len(audio), len(additional_audio)) 30 | audio = librosa.util.fix_length(audio, target_length) 31 | additional_audio = librosa.util.fix_length(additional_audio, target_length) 32 | audio = audio + additional_audio 33 | return audio 34 | 35 | 36 | if __name__ == "__main__": 37 | jsonl_file = sys.argv[1] 38 | wavin_dir = sys.argv[2] 39 | wavout_dir = sys.argv[3] 40 | 41 | with open(jsonl_file, "r") as f: 42 | for line in f: 43 | data = json.loads(line) 44 | mixed_audio = mix_audio(wavin_dir, data['wavs'], data['delays']) 45 | 46 | outfile_path = os.path.join(wavout_dir, data['mixed_wav']) 47 | outdir = os.path.dirname(outfile_path) 48 | if not os.path.exists(outdir): 49 | os.makedirs(outdir) 50 | soundfile.write(outfile_path, mixed_audio, samplerate=16000) 51 | --------------------------------------------------------------------------------