├── .DS_Store ├── README.md ├── asr1 ├── espnet_cmn │ ├── asr.sh │ ├── cmd.sh │ ├── conf │ │ ├── .DS_Store │ │ ├── decode_asr.yaml │ │ ├── decode_asr_lm.yaml │ │ ├── fbank.conf │ │ ├── pbs.conf │ │ ├── pitch.conf │ │ ├── queue.conf │ │ ├── slurm.conf │ │ ├── train.yaml │ │ └── tuning │ │ │ ├── .DS_Store │ │ │ ├── train_asr_conformer.yaml │ │ │ ├── train_asr_conformer_seame.yaml │ │ │ ├── train_lm_conf.yaml │ │ │ ├── train_lm_lstm.yaml │ │ │ ├── train_lm_lstm2.yaml │ │ │ └── train_lm_transformer.yaml │ ├── db.sh │ ├── filter.sh │ ├── local │ │ ├── __pycache__ │ │ │ └── preprocess.cpython-39.pyc │ │ ├── add_lid.py │ │ ├── add_lid_seame.py │ │ ├── add_lid_seame_v2.py │ │ ├── cmi.py │ │ ├── cmi2.py │ │ ├── data.sh │ │ ├── path.sh │ │ ├── preprocess.py │ │ ├── score.sh │ │ ├── split_lang_trn.py │ │ ├── subset_seame_cs.py │ │ └── subset_seame_mono.py │ ├── path.sh │ ├── pyscripts │ ├── run.sh │ ├── run_bigram.sh │ ├── run_bigram_subset.sh │ ├── run_bigram_subset2.sh │ ├── run_mono.sh │ ├── run_uni.sh │ ├── run_uni_imp.sh │ ├── sample.sh │ ├── scripts │ ├── seperate_mono.sh │ ├── steps │ └── utils └── kaldi_cmn │ ├── align.sh │ ├── cmd.sh │ ├── conf │ ├── cmu2pinyin │ ├── decode.config │ ├── decode_dnn.config │ ├── g2p_model │ ├── mfcc.conf │ ├── mfcc_hires.conf │ ├── online_cmvn.conf │ ├── pinyin2cmu │ ├── slurm.conf │ └── vad.conf │ ├── decode.sh │ ├── decode_test.sh │ ├── fix_spk_pref.sh │ ├── local │ ├── format_data.sh │ ├── format_data2.sh │ ├── prep_dict_en_zh.sh │ ├── prepare_dict.sh │ ├── prepare_dict2.sh │ ├── prepare_grammar.sh │ ├── sample_data.sh │ ├── score.sh │ ├── train_lms.sh │ └── train_lms_extra.sh │ ├── path.sh │ ├── results.sh │ ├── run.sh │ ├── sample.sh │ ├── steps │ └── utils ├── conf └── slurm.conf ├── environment.yml ├── images └── high-level.png ├── run.sh ├── run_cmn.sh ├── src ├── __pycache__ │ ├── splice_bigram_random.cpython-38.pyc │ ├── splice_unigram.cpython-38.pyc │ ├── splice_unigram.cpython-39.pyc │ ├── splice_unigram_improved.cpython-37.pyc │ ├── splice_unigram_improved.cpython-38.pyc │ ├── splice_unigram_improved.cpython-39.pyc │ ├── utils.cpython-37.pyc │ └── utils.cpython-38.pyc ├── generate_bigram.py ├── generate_unigram.py ├── generate_unigram_improved.py ├── seg2rec_ctm.py ├── setup_recording_dict.py ├── setup_supervision_bigram_dict.py ├── setup_supervision_dict.py ├── setup_supervision_improved_dict.py ├── splice_bigram_random.py ├── splice_unigram.py ├── splice_unigram_improved.py └── utils.py ├── test ├── bigram_supervisions.pkl ├── recording_dict.pkl └── supervisions.pkl └── utils ├── make_utt2spk.py └── make_wav_scp.py /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JSALT2022CodeSwitchingASR/generating-code-switched-audio/bacef099e0ddccd16b23191a4d9938b97bab3a92/.DS_Store -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # **Speech Collage** 2 | 3 | This repository contains the code used for the paper titled ["SPEECH COLLAGE: CODE-SWITCHED AUDIO GENERATION BY COLLAGING MONOLINGUAL CORPORA"](https://arxiv.org/pdf/2309.15674.pdf). 4 | 5 | 🔹 **Dataset Samples:** You can listen to a sample of the generated audio [here](). 6 | 7 | --- 8 | 9 | ## **High-Level Approach Description** 10 | 11 | ![Proposed Approach](images/high-level.png) 12 | 13 | --- 14 | 15 | ## Requirements 16 | 17 | ### Python Environment 18 | 19 | - Python version `3.8.12` 20 | - To create an Anaconda environment, run the following command: 21 | 22 | ```bash 23 | conda env create -f environment.yml 24 | ``` 25 | 26 | ### Install Necessary Toolkits 27 | 28 | 1. Install ESPnet and Kaldi by following the instructions provided [here](https://espnet.github.io/espnet/installation.html). 29 | 30 | 2. Install SOX format libraries: 31 | 32 | ```bash 33 | sudo apt-get install libsox-fmt-all 34 | ``` 35 | 36 | ## Steps to generate audio from monolingual data 37 | 38 | 1. Train a standard HMM-GMM ASR system following the standard Kaldi recipes for your monolingual data. You can also follow the provided monolingual Chinese-English (Aishel+Tedlium3) recipe in `asr1/kaldi/`. 39 | 40 | 2. Generate the alignments (ctm) file using the Kaldi script `steps/get_train_ctm.sh` and save it in your `data_dir`. Additionally, copy the `text` (in this case, code-switching) used for generation. Note that you can use any text as long as you have the monolingual audios for that text. 41 | 42 | To generate the ctm using Kaldi: 43 | 44 | ```bash 45 | steps/get_train_ctm.sh --use-segments false data/train data/lang exp/tri3_ali data_dir/ctm.mono 46 | ``` 47 | 48 | If the first column of the `ctm` file contains segments, run: 49 | 50 | ```bash 51 | python src/seg2rec_ctm.py data_dir 52 | ``` 53 | 54 | This will convert the segments to the names of audio recordings from `wav.scp`. 55 | 56 | ### Note: From this step you can follow `run.sh` for an automated execution of the below procedures: 57 | 58 | 3. Following the Kaldi style, copy the `wav.scp` file containing monolingual utterances to `data_dir`. Generate a recording dictionary as follows: 59 | 60 | ```bash 61 | python src/setup_recording_dict.py ${indir}/wav.scp outdir 62 | ``` 63 | 64 | 4. With the ctm file for the monolingual utterances and recording dictionary, create a supervision dictionary. Choose one of the following options based on your requirements: 65 | 66 | - For randomly generated utterances with unigram units and no signal enhancement: 67 | 68 | ```bash 69 | python src/setup_supervision_dict.py data_dir/ctm.mono outdir/recording_dict.pkl outdir 70 | ``` 71 | 72 | - For randomly generated utterances with unigram units and signal enhancement: 73 | 74 | ```bash 75 | python src/setup_supervision_improved_dict.py data_dir/ctm.mono outdir/recording_dict.pkl outdir 76 | ``` 77 | 78 | - For randomly generated utterances with bigram units and signal enhancement: 79 | 80 | ```bash 81 | python src/setup_bigram_sup_dict.py data_dir/ctm.mono outdir/recording_dict.pkl outdir 82 | ``` 83 | 84 | 5. Run the audio generation. Below is an example for generating bigrams: 85 | 86 | ```bash 87 | ./src/generate_bigram.py \ 88 | --input text \ 89 | --output outdir/bigrams \ 90 | --data outdir \ 91 | --jobs $nj 92 | ``` 93 | 94 | 6. Once the audios are generated, run `make_wav_scp.py` to create the `wav.scp` file. 95 | 96 | ```bash 97 | python utils/make_wav_scp.py --audio-dir outdir/bigrams --out-dir data_dir_mode 98 | ``` 99 | 100 | 7. Create the rest of the necessary files: `text`, `utt2spk`, and `spk2utt`. 101 | 102 | ```bash 103 | cp outdir/bigrams/transcripts.txt data_dir_mode/text 104 | cat data_dir_mode/wav.scp | awk '{print $1 " " $1}' > data_dir_mode/utt2spk 105 | cp data_dir_mode/utt2spk data_dir_mode/spk2utt 106 | ``` 107 | 108 | 8. Use the `data_dir_mode` data folder for ESPnet training. 109 | 110 | 111 | ## Cite the Paper 112 | 113 | If you use this code for your work, please consider citing the paper: 114 | 115 | ```markdown 116 | @INPROCEEDINGS{10446857, 117 | author={Hussein, Amir and Zeinali, Dorsa and Klejch, Ondřej and Wiesner, Matthew and Yan, Brian and Chowdhury, Shammur and Ali, Ahmed and Watanabe, Shinji and Khudanpur, Sanjeev}, 118 | booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 119 | title={Speech Collage: Code-Switched Audio Generation by Collaging Monolingual Corpora}, 120 | year={2024}, 121 | pages={12006-12010}, 122 | keywords={Training;Speech coding;Zero-shot learning;Splicing;Signal processing;Data augmentation;Data models;Code-switching;ASR;data augmentation;end-to-end;zero-shot learning}, 123 | doi={10.1109/ICASSP48485.2024.10446857} 124 | } 125 | ``` 126 | -------------------------------------------------------------------------------- /asr1/espnet_cmn/cmd.sh: -------------------------------------------------------------------------------- 1 | # ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ====== 2 | # Usage: .pl [options] JOB=1: 3 | # e.g. 4 | # run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB 5 | # 6 | # Options: 7 | # --time