├── LICENSE
├── README.md
├── list
    ├── dev-clean-1mix.jsonl
    ├── dev-clean-2mix.jsonl
    ├── dev-clean-3mix.jsonl
    ├── optional
    │   ├── dev-clean-1mix-8prof-10utt.jsonl
    │   ├── dev-clean-1mix-8prof-1utt.jsonl
    │   ├── dev-clean-1mix-8prof-2utt.jsonl
    │   ├── dev-clean-1mix-8prof-5utt.jsonl
    │   ├── dev-clean-2mix-8prof-10utt.jsonl
    │   ├── dev-clean-2mix-8prof-1utt.jsonl
    │   ├── dev-clean-2mix-8prof-2utt.jsonl
    │   ├── dev-clean-2mix-8prof-5utt.jsonl
    │   ├── dev-clean-3mix-8prof-10utt.jsonl
    │   ├── dev-clean-3mix-8prof-1utt.jsonl
    │   ├── dev-clean-3mix-8prof-2utt.jsonl
    │   ├── dev-clean-3mix-8prof-5utt.jsonl
    │   ├── test-clean-1mix-8prof-10utt.jsonl
    │   ├── test-clean-1mix-8prof-1utt.jsonl
    │   ├── test-clean-1mix-8prof-2utt.jsonl
    │   ├── test-clean-1mix-8prof-5utt.jsonl
    │   ├── test-clean-2mix-8prof-10utt.jsonl
    │   ├── test-clean-2mix-8prof-1utt.jsonl
    │   ├── test-clean-2mix-8prof-2utt.jsonl
    │   ├── test-clean-2mix-8prof-5utt.jsonl
    │   ├── test-clean-3mix-8prof-10utt.jsonl
    │   ├── test-clean-3mix-8prof-1utt.jsonl
    │   ├── test-clean-3mix-8prof-2utt.jsonl
    │   └── test-clean-3mix-8prof-5utt.jsonl
    ├── test-clean-1mix.jsonl
    ├── test-clean-2mix.jsonl
    └── test-clean-3mix.jsonl
├── run.sh
└── utils
    └── mix_wavs.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Naoyuki Kanda
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 
23 | ============================
24 | The original LibriSpeech corpus is distributed at http://www.openslr.org/12/
25 | under CC BY 4.0. 
26 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # LibriSpeechMix
  2 | 
  3 | LibriSpeechMix is the dastaset used in [Serialized Output Training for End-to-End Overlapped Speech Recognition](https://www.isca-speech.org/archive/Interspeech_2020/pdfs/0999.pdf) and [Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers](https://www.isca-speech.org/archive/Interspeech_2020/pdfs/1085.pdf) for evaluating multi-talker speech recognition systems. The dataset has been derived from the LibriSpeech "dev_clean" and "test_clean" sets.
  4 | - Notable features
  5 |   - Consists of partially overlapped speech utterances (instead of commonly-used fully overlapped utterances), which is closer to real senarios.
  6 |   - Designed for ASR evaluation. 
  7 |     - The dataset comprises single-speaker, two-speaker-mixture, and three-speaker-mixtures datasets. Each utterance in the original LibriSpeech evaluation data is used exactly N times in the N-speaker set, which allows the WERs to be compared across the three speaker number conditions. 
  8 |     - Each mixed audio does not contain multiple utterances of the same speaker.
  9 |   - Includes the information for speaker profile extraction, which is suitable for speaker-attributed automatic speech recogntion (SA-ASR) experiments.
 10 | - The dataset was used for the papers listed below. 
 11 |   - Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka: Serialized Output Training for End-to-End Overlapped Speech Recognition, Proc. Interspeech, pp. 2797-2801, 2020. [[pdf]](https://www.isca-speech.org/archive/Interspeech_2020/pdfs/0999.pdf)
 12 |   - Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Tianyan Zhou , Takuya Yoshioka: Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers. Proc. Interspeech, pp. 36-40, 2020. [[pdf]](https://www.isca-speech.org/archive/Interspeech_2020/pdfs/1085.pdf)
 13 | - Intersted readers are also referred to the following related paper.
 14 |   - Naoyuki Kanda, Zhong Meng, Liang Lu, Yashesh Gaur, Xiaofei Wang, Zhuo Chen, Takuya Yoshioka: Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR,	arXiv:2011.02921, 2020. [[pdf]](https://arxiv.org/pdf/2011.02921.pdf)
 15 |   -  Naoyuki Kanda, Xuankai Chang, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka: Investigation of End-To-End Speaker-Attributed ASR for Continuous Multi-Talker Recordings. Proc. SLT, 2021 (to appear). [[pdf]](https://arxiv.org/pdf/2008.04546.pdf)
 16 | 
 17 | ## Prerequisites
 18 | - Linux
 19 |   - python3
 20 |   - flac
 21 | 
 22 | ## How to Generate Data
 23 | The following commands first download the LibriSpeech evaluation data ("dev_clean" and "test_clean") and then generate the mixed audio.
 24 | ```sh
 25 | $ pip install soundfile librosa numpy
 26 | $ bash run.sh
 27 | ```
 28 | The mixed audio files are generated under ./data/ directory according to the information in *.jsonl file.
 29 | ```
 30 | list/
 31 | ├── dev-clean-1mix.jsonl
 32 | ├── dev-clean-2mix.jsonl
 33 | ├── dev-clean-3mix.jsonl
 34 | ├── test-clean-1mix.jsonl
 35 | ├── test-clean-2mix.jsonl
 36 | └── test-clean-3mix.jsonl
 37 | ```
 38 | 
 39 | ## Data Format of *.jsonl file
 40 | ### Each line of *.jsonl corresponds to a string of JSON data.
 41 | |Element|Type|Meaning|
 42 | |---|---|---|
 43 | |id|Required|Utterance id|
 44 | |mixed_wav|Required|Path to the mixed audio (NOTE: relative path from ./data/)|
 45 | |texts|Required|Transcription|
 46 | |speaker_profile|Option for SA-ASR|Audio list for speaker profile extraction (NOTE: relative path from ./data/)|
 47 | |speaker_profile_index|Option for SA-ASR|Index of speaker profile corresponding to each utterance in the mixed audio|
 48 | |wavs||Original wav files used to generage the mixed audio (NOTE: relative path from ./data)|
 49 | |delays||Delay (in second) applied to each utterance before mixing the signals|
 50 | |speakers||Speaker id of each utterance|
 51 | |durations||Duration (in second) of each original wav file|
 52 | |genders||Gender of the speaker of each utterance in the mixed audio|
 53 | 
 54 | ### Example of 2-speaker-mixture audio (indented for visibility)
 55 | ```
 56 | {
 57 |     "id": "dev-clean-2mix/dev-clean-2mix-0000", 
 58 |     "mixed_wav": "dev-clean-2mix/dev-clean-2mix-0000.wav", 
 59 |     "texts": [
 60 |         "MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL",
 61 |         "THAT ENCHANTMENT HAD POSSESSED HIM USURPING AS IT WERE THE THRONE OF HIS LIFE AND DISPLACING IT WHEN IT CEASED HE WAS NOT HIS OWN MASTER"], 
 62 |     "speaker_profile": [
 63 |         ["dev-clean/6241/61943/6241-61943-0008.wav", "dev-clean/6241/61946/6241-61946-0001.wav"], 
 64 |         ["dev-clean/174/84280/174-84280-0011.wav", "dev-clean/174/50561/174-50561-0005.wav"], 
 65 |         ["dev-clean/1988/147956/1988-147956-0007.wav", "dev-clean/1988/24833/1988-24833-0011.wav"],
 66 |         ["dev-clean/7850/281318/7850-281318-0016.wav", "dev-clean/7850/286674/7850-286674-0013.wav"],
 67 |         ["dev-clean/1919/142785/1919-142785-0024.wav", "dev-clean/1919/142785/1919-142785-0034.wav"],
 68 |         ["dev-clean/6295/244435/6295-244435-0023.wav", "dev-clean/6295/64301/6295-64301-0002.wav"],
 69 |         ["dev-clean/2428/83699/2428-83699-0005.wav", "dev-clean/2428/83705/2428-83705-0025.wav"], 
 70 |         ["dev-clean/1272/141231/1272-141231-0022.wav", "dev-clean/1272/128104/1272-128104-0005.wav"]], 
 71 |     "speaker_profile_index": [7, 5], 
 72 |     "wavs": ["dev-clean/1272/128104/1272-128104-0000.wav", "dev-clean/6295/64301/6295-64301-0026.wav"],
 73 |     "delays": [0.0, 4.469242864375414], 
 74 |     "speakers": ["1272", "6295"], 
 75 |     "durations": [5.855, 10.43], 
 76 |     "genders": ["m", "m"]
 77 | }
 78 | ```
 79 | 
 80 | ## Optional list
 81 | - ./list/optional/ directory contains optional *jsonl files with different profile settings for SA-ASR.
 82 | - Each file has a name of [dev|test]-clean-[1|2|3]mix-8prof-[1|2|5|10]utt.jsonl.
 83 |   - [dev|test] indicates if this is development data or test data
 84 |   - [1|2|3]mix indicates the number of mixed audio
 85 |   - [1|2|5|10]utt indicates the number of utterances for extracting a speaker profile for each speaker
 86 | - Files with a suffix of '-8prof-2utt.jsonl' is identical to the files in ./list/ directory.
 87 | 
 88 | ## When referring to this dataset, one of the following papers may be cited.
 89 | ```
 90 | @inproceedings{kanda2020serialized,
 91 |   title={Serialized Output Training for End-to-End Overlapped Speech Recognition},
 92 |   author={Kanda, Naoyuki and Gaur, Yashesh and Wang, Xiaofei and Meng, Zhong and Yoshioka, Takuya},
 93 |   booktitle={Proc. Interspeech},
 94 |   pages={2797--2801},
 95 |   year={2020}
 96 | }
 97 | 
 98 | @inproceedings{kanda2020joint,
 99 |   title={Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers},
100 |   author={Kanda, Naoyuki and Gaur, Yashesh and Wang, Xiaofei and Meng, Zhong and Chen, Zhuo and Zhou, Tianyan and Yoshioka, Takuya},
101 |   booktitle={Proc. Interspeech},
102 |   pages={36--40},
103 |   year={2020}
104 | }
105 | ```
106 | 


--------------------------------------------------------------------------------
/run.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Copyright 2020 Naoyuki Kanda
 4 | # MIT license
 5 | 
 6 | set -e
 7 | set -u
 8 | set -o pipefail
 9 | 
10 | data_out=./data
11 | 
12 | # all utterances are FLAC compressed
13 | if ! which flac >&/dev/null; then
14 |    echo "Please install 'flac'!"
15 |    exit 1
16 | fi
17 | 
18 | # download & untar necessary data
19 | if [ ! -d $data_out/original ]; then
20 |     mkdir -p $data_out/original
21 |     (
22 |         cd $data_out/original
23 |         for dataid in dev-clean test-clean; do
24 |             wget http://www.openslr.org/resources/12/$dataid.tar.gz
25 |             tar xvzf $dataid.tar.gz
26 |         done
27 |     )
28 | fi
29 | 
30 | # convert flac to wav
31 | if [ ! -f $data_out/.done.wavfile_gen ]; then
32 |     for flac_file in `find $data_out/original/LibriSpeech -type f | grep '\.flac'`; do
33 |         echo flac -d -s $flac_file ${flac_file/\.flac/.wav}
34 |         flac -d -s $flac_file -o ${flac_file/\.flac/.wav}
35 |     done
36 |     for dataid in dev-clean test-clean; do
37 |         (cd $data_out; ln -s original/LibriSpeech/$dataid $dataid)
38 |     done
39 |     touch $data_out/.done.wavfile_gen
40 | fi
41 | 
42 | # generate mixed wav
43 | if [ ! -f $data_out/.done.mix_wavfile_gen ]; then
44 |     for datatype in dev-clean test-clean; do
45 |         for mix in 1 2 3; do
46 |             python utils/mix_wavs.py \
47 |                 list/${datatype}-${mix}mix.jsonl \
48 |                 $data_out \
49 |                 $data_out
50 |         done
51 |     done
52 |     touch $data_out/.done.mix_wavfile_gen 
53 | fi


--------------------------------------------------------------------------------
/utils/mix_wavs.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | # Copyright 2020 Naoyuki Kanda
 4 | # MIT license
 5 | 
 6 | import sys
 7 | import os
 8 | import json
 9 | import soundfile
10 | import librosa
11 | import numpy as np
12 | 
13 | 
14 | def get_delayed_audio(wav_file, delay, sampling_rate=16000):
15 |     audio, _ = soundfile.read(wav_file)
16 |     delay_frame = int(delay * sampling_rate)
17 |     if delay_frame != 0:
18 |         audio = np.append(np.zeros(delay_frame), audio)
19 |     return audio
20 | 
21 | 
22 | def mix_audio(wavin_dir, wav_files, delays):
23 |     for i, wav_file in enumerate(wav_files):
24 |         if i == 0:
25 |             audio = get_delayed_audio(os.path.join(wavin_dir, wav_file), delays[i])
26 |         else:
27 |             additional_audio = get_delayed_audio(os.path.join(wavin_dir, wav_file), delays[i])
28 |             # tune length & sum up to audio
29 |             target_length = max(len(audio), len(additional_audio))
30 |             audio = librosa.util.fix_length(audio, target_length)
31 |             additional_audio = librosa.util.fix_length(additional_audio, target_length)
32 |             audio = audio + additional_audio
33 |     return audio
34 | 
35 | 
36 | if __name__ == "__main__":
37 |     jsonl_file = sys.argv[1]
38 |     wavin_dir = sys.argv[2]
39 |     wavout_dir = sys.argv[3]
40 | 
41 |     with open(jsonl_file, "r") as f:
42 |         for line in f:
43 |             data = json.loads(line)
44 |             mixed_audio = mix_audio(wavin_dir, data['wavs'], data['delays'])
45 | 
46 |             outfile_path = os.path.join(wavout_dir, data['mixed_wav'])
47 |             outdir = os.path.dirname(outfile_path)
48 |             if not os.path.exists(outdir):
49 |                 os.makedirs(outdir)
50 |             soundfile.write(outfile_path, mixed_audio, samplerate=16000)
51 | 


--------------------------------------------------------------------------------