├── requirements.txt ├── .gitignore ├── LICENSE ├── kaldi.sh ├── main.py └── README.md /requirements.txt: -------------------------------------------------------------------------------- 1 | pysubs2 2 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | series 2 | venv 3 | temp 4 | kaldi 5 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | This is free and unencumbered software released into the public domain. 2 | 3 | Anyone is free to copy, modify, publish, use, compile, sell, or 4 | distribute this software, either in source code form or as a compiled 5 | binary, for any purpose, commercial or non-commercial, and by any 6 | means. 7 | 8 | In jurisdictions that recognize copyright laws, the author or authors 9 | of this software dedicate any and all copyright interest in the 10 | software to the public domain. We make this dedication for the benefit 11 | of the public at large and to the detriment of our heirs and 12 | successors. We intend this dedication to be an overt act of 13 | relinquishment in perpetuity of all present and future rights to this 14 | software under copyright law. 15 | 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 19 | IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR 20 | OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 21 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR 22 | OTHER DEALINGS IN THE SOFTWARE. 23 | 24 | For more information, please refer to 25 | -------------------------------------------------------------------------------- /kaldi.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright 2017 Johns Hopkins University (Author: Daniel Garcia-Romero) 3 | # 2017 Johns Hopkins University (Author: Daniel Povey) 4 | # 2017-2018 David Snyder 5 | # 2018 Ewald Enzinger 6 | # Apache 2.0. 7 | # 8 | # See ../README.txt for more info on data required. 9 | # Results (mostly equal error-rates) are inline in comments below. 10 | 11 | cd ../kaldi/egs/voxceleb/v2 # Move to the recipe's root dir 12 | 13 | . ./cmd.sh 14 | . ./path.sh 15 | set -e 16 | mfccdir=`pwd`/mfcc 17 | vaddir=`pwd`/mfcc 18 | 19 | 20 | # The trials file is downloaded by local/make_voxceleb1.pl. 21 | voxceleb1_trials=data/voxceleb1_test/trials 22 | voxceleb1_root=/export/corpora/VoxCeleb1 23 | voxceleb2_root=~/projects/waifuchat/temp/utt 24 | nnet_dir=exp/xvector_nnet_1a 25 | musan_root=/export/corpora/JHU/musan 26 | results_dir=~/projects/waifuchat/temp 27 | 28 | stage=0 29 | 30 | if [ $stage -le 0 ]; then 31 | #local/make_voxceleb2.pl $voxceleb2_root dev data/voxceleb2_train 32 | local/make_voxceleb2.pl $voxceleb2_root test data/voxceleb2_test 33 | # This script reates data/voxceleb1_test and data/voxceleb1_train. 34 | # Our evaluation set is the test portion of VoxCeleb1. 35 | #local/make_voxceleb1.pl $voxceleb1_root data 36 | # We'll train on all of VoxCeleb2, plus the training portion of VoxCeleb1. 37 | # This should give 7,351 speakers and 1,277,503 utterances. 38 | utils/combine_data.sh data/train data/voxceleb2_test 39 | fi 40 | 41 | 42 | if [ $stage -le 1 ]; then 43 | # Make MFCCs and compute the energy-based VAD for each dataset 44 | for name in train; do 45 | steps/make_mfcc.sh --write-utt2num-frames true --mfcc-config conf/mfcc.conf --nj 40 --cmd "$train_cmd" \ 46 | data/${name} exp/make_mfcc $mfccdir 47 | utils/fix_data_dir.sh data/${name} 48 | sid/compute_vad_decision.sh --nj 1 --cmd "$train_cmd" \ 49 | data/${name} exp/make_vad $vaddir 50 | utils/fix_data_dir.sh data/${name} 51 | done 52 | fi 53 | 54 | sid/nnet3/xvector/extract_xvectors.sh --cmd "$train_cmd --mem 4G" --nj 1 \ 55 | $nnet_dir data/train \ 56 | $nnet_dir/xvectors_train 57 | 58 | ../../../src/bin/copy-vector ark:$nnet_dir/xvectors_train/xvector.1.ark ark,t:$results_dir/xvectors.txt 59 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import pysubs2 2 | import subprocess, re, json 3 | from os import listdir 4 | 5 | def run(cmd): 6 | return str(subprocess.check_output(cmd)) 7 | 8 | def formatTime(t): # t in millisecs, returns in format 00:01:02.500 9 | return '{0:02d}:{1:02d}:{2:02d}.{3:03d}'.format(t//(1000*60*60), (t//(1000*60))%60, (t//(1000))%60, t%1000) 10 | 11 | data={} 12 | 13 | for serie in listdir('./series'): 14 | data[serie]=[] 15 | run(["rm", "-rf", "temp"]) 16 | run(["mkdir", "-p", "temp/utt/test/aac"]) 17 | for video in listdir('./series/'+serie): 18 | filename='./series/'+serie+'/'+video 19 | tracks=run(["mkvmerge", "--identify", filename]) 20 | subtracks=re.findall(r'Track ID ([0-9]*): subtitles.*', tracks) 21 | audiotracks=re.findall(r'Track ID ([0-9]*): audio.*', tracks) 22 | if subtracks and audiotracks: 23 | run(["mkvextract", filename, "tracks", audiotracks[0]+":temp/audio"]) 24 | run(["mkvextract", filename, "tracks", subtracks[0]+":temp/subs"]) 25 | subs = pysubs2.load("temp/subs", encoding="utf-8") 26 | for line in subs: 27 | if line.start==line.end: 28 | continue 29 | audioFilename=str(len(data[serie])) 30 | run(["mkdir", "-p", "temp/utt/test/aac/"+audioFilename+"/01dfn2spqyE"]) 31 | run(["ffmpeg", "-loglevel", "quiet", "-i", "temp/audio", "-ss", formatTime(line.start), "-to", formatTime(line.end), "-ar", "16000", "temp/utt/test/aac/"+audioFilename+"/01dfn2spqyE/"+audioFilename+".m4a"]) 32 | data[serie].append({ 33 | 'text': line.text, 34 | 'speaker': None, 35 | 'start': line.start, 36 | 'end': line.end 37 | }) 38 | run(["bash", "kaldi.sh"]) # Compute xvectors using Kaldi 39 | 40 | for line in open("temp/xvectors.txt", "r").readlines(): 41 | audioFilename,vector=re.findall(r'([A-z0-9]*) \[([^\]]*)\]', line)[0] 42 | data[serie][int(audioFilename)]["xvector"] = list(map(float, vector.strip().split(' '))) 43 | 44 | open("data.json", "w+").write(json.dumps(data)) 45 | 46 | 47 | #Apply clustering 48 | #Save all data 49 | #Get user input on chosen waifu 50 | #Train seq2seq nn 51 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Waifu Chat 2 | > AI-based chatbot trained on specific waifus' speech 3 | 4 | ## Using it 5 | ### Set up / Installation 6 | 1. [Install Kaldi](http://kaldi-asr.org/doc/install.html) 7 | 2. Install ffmpeg and mkvtoolnix (`sudo apt-get install mkvtoolnix ffmpeg`) 8 | 3. Install python dependencies (`pip install -r requirements.txt`) 9 | 4. Download VoxCeleb1 and VoxCeleb2 datasets and train the voxceleb Kaldi DNN on them (by running `kaldi/egs/voxceleb/v2/run.sh`) 10 | **- OR -** 11 | Download a pretrained network like [2] and install it by replacing the files in `kaldi/egs/voxceleb/v2` with the downloaded ones 12 | 5. Prepare the data following the guidelines in the **Data** section 13 | 14 | ### Data 15 | 16 | The system takes unmodified anime episodes as input. It assumes these are in `.mkv` format and that the first audio and susbtitle tracks, as listed by `mkvmerge --identify`, are the ones that should be used to build the model. This is usually the case and the only issue that may arise because of this assumption is that, if a series has it's first subtitle track in japanese and the second one in english, the model is going to use the japanese one, resulting in a chatbot that speaks japanese. If that's the case, the subtitle track being used can be changed by modifying the definition of the variable `subtitleTrack` in `main.py`, setting it to the 0-based index of the subtitle track (so if it's set to `0`, the system is gonna use the first subtitle track listed by `mkvmerge --identify`, if it's set to `2`, it's gonna use the third subtitle track listed by `mkvmerge --identify`, and so on). 17 | 18 | These episodes should be in the `series` folder, organized by series, so that all the episodes from a series are in the same folder, thus following this folder structure: 19 | - `series` 20 | - `series/Shingeki no Kyojin` 21 | - `series/Shingeki no Kyojin/1x01.mkv` 22 | - `series/Shingeki no Kyojin/1x02.mkv` 23 | - `series/Shingeki no Kyojin/1x03.mkv` 24 | - `...` 25 | - `series/Shingeki no Kyojin/2x01.mkv` 26 | - `series/Shingeki no Kyojin/2x02.mkv` 27 | - `...` 28 | - `series/Sword Art Online` 29 | - `series/Sword Art Online/Episode 1x01.mkv` 30 | - `series/Sword Art Online/Episode 1x02.mkv` 31 | - `series/Sword Art Online/Episode 1x03.mkv` 32 | - `...` 33 | 34 | ### Run 35 | ``` 36 | python3 main.py 37 | ``` 38 | 39 | ## How does it work? 40 | 1. Extract the audio and subtitle tracks from anime episodes. 41 | 2. Using timing information from the subtitles, extract the audio segment associated with each subtitle so that we have a list of (subtitle text, audio segment) tuples for all the subtitles in a series. 42 | 3. Extract the xvector associated with each audio segment. This is done by extracting a set of features from the audio segment and then feeding them to a neural network to extract an embedding called xvector. The original research behind this can be found in [1]. We use an implementation of it in Kaldi using a pretrained DNN[2] that was trained on the VoxCeleb datasets[3][4]. 43 | 4. Use DBSCAN[5] to perform clustering on the xvectors in order to group together the audio segments spoken by the same speaker. Ater this step we should have a list of (subtitle text, speaker) tuples for all subtitles. 44 | 5. Select the speaker or set of speakers to mimic by using user-input 45 | 6. Train a seq2seq[6] model 46 | 7. Serve the chatbot as a discord bot 47 | 48 | ## Why? 49 | > No waifu no laifu 50 |

- Adolf Hitler

51 | 52 | ## References 53 | [1] Snyder, D. and Garcia-Romero, D. and Sell, G. and Povey, D. and Khudanpur, S. 2018. X-vectors: Robust DNN Embeddings for Speaker Recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 54 | [2] http://kaldi-asr.org/models/m7 (VoxCeleb Xvector System 1a), retrieved the 19th of December of 2018. 55 | [3] Nagrani, A. and Chung, J.~S. and Zisserman, A. 2017. VoxCeleb: a large-scale speaker identification dataset. INTERSPEECH. 56 | [4] J. S. Chung, A. Nagrani, A. Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. INTERSPEECH. 57 | [5] DBSCAN 58 | [6] Ilya Sutskever, Oriol Vinyals and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. Conference and Workshop on Neural Information Processing Systems (NIPS). 59 | --------------------------------------------------------------------------------