├── requirements.txt
├── .gitignore
├── LICENSE
├── kaldi.sh
├── main.py
└── README.md


/requirements.txt:
--------------------------------------------------------------------------------
1 | pysubs2
2 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | series
2 | venv
3 | temp
4 | kaldi
5 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | This is free and unencumbered software released into the public domain.
 2 | 
 3 | Anyone is free to copy, modify, publish, use, compile, sell, or
 4 | distribute this software, either in source code form or as a compiled
 5 | binary, for any purpose, commercial or non-commercial, and by any
 6 | means.
 7 | 
 8 | In jurisdictions that recognize copyright laws, the author or authors
 9 | of this software dedicate any and all copyright interest in the
10 | software to the public domain. We make this dedication for the benefit
11 | of the public at large and to the detriment of our heirs and
12 | successors. We intend this dedication to be an overt act of
13 | relinquishment in perpetuity of all present and future rights to this
14 | software under copyright law.
15 | 
16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
19 | IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
20 | OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
21 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
22 | OTHER DEALINGS IN THE SOFTWARE.
23 | 
24 | For more information, please refer to <http://unlicense.org>
25 | 


--------------------------------------------------------------------------------
/kaldi.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # Copyright   2017   Johns Hopkins University (Author: Daniel Garcia-Romero)
 3 | #             2017   Johns Hopkins University (Author: Daniel Povey)
 4 | #        2017-2018   David Snyder
 5 | #             2018   Ewald Enzinger
 6 | # Apache 2.0.
 7 | #
 8 | # See ../README.txt for more info on data required.
 9 | # Results (mostly equal error-rates) are inline in comments below.
10 | 
11 | cd ../kaldi/egs/voxceleb/v2 # Move to the recipe's root dir
12 | 
13 | . ./cmd.sh
14 | . ./path.sh
15 | set -e
16 | mfccdir=`pwd`/mfcc
17 | vaddir=`pwd`/mfcc
18 | 
19 | 
20 | # The trials file is downloaded by local/make_voxceleb1.pl.
21 | voxceleb1_trials=data/voxceleb1_test/trials
22 | voxceleb1_root=/export/corpora/VoxCeleb1
23 | voxceleb2_root=~/projects/waifuchat/temp/utt
24 | nnet_dir=exp/xvector_nnet_1a
25 | musan_root=/export/corpora/JHU/musan
26 | results_dir=~/projects/waifuchat/temp
27 | 
28 | stage=0
29 | 
30 | if [ $stage -le 0 ]; then
31 |   #local/make_voxceleb2.pl $voxceleb2_root dev data/voxceleb2_train
32 |   local/make_voxceleb2.pl $voxceleb2_root test data/voxceleb2_test
33 |   # This script reates data/voxceleb1_test and data/voxceleb1_train.
34 |   # Our evaluation set is the test portion of VoxCeleb1.
35 |   #local/make_voxceleb1.pl $voxceleb1_root data
36 |   # We'll train on all of VoxCeleb2, plus the training portion of VoxCeleb1.
37 |   # This should give 7,351 speakers and 1,277,503 utterances.
38 |   utils/combine_data.sh data/train data/voxceleb2_test
39 | fi
40 | 
41 | 
42 | if [ $stage -le 1 ]; then
43 |   # Make MFCCs and compute the energy-based VAD for each dataset
44 |   for name in train; do
45 |     steps/make_mfcc.sh --write-utt2num-frames true --mfcc-config conf/mfcc.conf --nj 40 --cmd "$train_cmd" \
46 |       data/${name} exp/make_mfcc $mfccdir
47 |     utils/fix_data_dir.sh data/${name}
48 |     sid/compute_vad_decision.sh --nj 1 --cmd "$train_cmd" \
49 |       data/${name} exp/make_vad $vaddir
50 |     utils/fix_data_dir.sh data/${name}
51 |   done
52 | fi
53 | 
54 | sid/nnet3/xvector/extract_xvectors.sh --cmd "$train_cmd --mem 4G" --nj 1 \
55 |     $nnet_dir data/train \
56 |     $nnet_dir/xvectors_train
57 | 
58 | ../../../src/bin/copy-vector ark:$nnet_dir/xvectors_train/xvector.1.ark ark,t:$results_dir/xvectors.txt
59 | 


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
 1 | import pysubs2
 2 | import subprocess, re, json
 3 | from os import listdir
 4 | 
 5 | def run(cmd):
 6 |     return str(subprocess.check_output(cmd))
 7 | 
 8 | def formatTime(t): # t in millisecs, returns in format 00:01:02.500
 9 |     return '{0:02d}:{1:02d}:{2:02d}.{3:03d}'.format(t//(1000*60*60), (t//(1000*60))%60, (t//(1000))%60, t%1000)
10 | 
11 | data={}
12 | 
13 | for serie in listdir('./series'):
14 |     data[serie]=[]
15 |     run(["rm", "-rf", "temp"])
16 |     run(["mkdir", "-p", "temp/utt/test/aac"])
17 |     for video in listdir('./series/'+serie):
18 |         filename='./series/'+serie+'/'+video
19 |         tracks=run(["mkvmerge",  "--identify", filename])
20 |         subtracks=re.findall(r'Track ID ([0-9]*): subtitles.*', tracks)
21 |         audiotracks=re.findall(r'Track ID ([0-9]*): audio.*', tracks)
22 |         if subtracks and audiotracks:
23 |             run(["mkvextract",  filename, "tracks", audiotracks[0]+":temp/audio"])
24 |             run(["mkvextract",  filename, "tracks", subtracks[0]+":temp/subs"])
25 |             subs = pysubs2.load("temp/subs", encoding="utf-8")
26 |             for line in subs:
27 |                 if line.start==line.end:
28 |                     continue
29 |                 audioFilename=str(len(data[serie]))
30 |                 run(["mkdir", "-p", "temp/utt/test/aac/"+audioFilename+"/01dfn2spqyE"])
31 |                 run(["ffmpeg", "-loglevel", "quiet", "-i", "temp/audio", "-ss",  formatTime(line.start), "-to", formatTime(line.end), "-ar", "16000", "temp/utt/test/aac/"+audioFilename+"/01dfn2spqyE/"+audioFilename+".m4a"])
32 |                 data[serie].append({
33 |                     'text': line.text,
34 |                     'speaker': None,
35 |                     'start': line.start,
36 |                     'end': line.end
37 |                     })
38 |     run(["bash", "kaldi.sh"]) # Compute xvectors using Kaldi
39 | 
40 |     for line in open("temp/xvectors.txt", "r").readlines():
41 |         audioFilename,vector=re.findall(r'([A-z0-9]*)  \[([^\]]*)\]', line)[0]
42 |         data[serie][int(audioFilename)]["xvector"] = list(map(float, vector.strip().split(' ')))
43 | 
44 | open("data.json", "w+").write(json.dumps(data))
45 | 
46 | 
47 |     #Apply clustering
48 |     #Save all data
49 |     #Get user input on chosen waifu
50 |     #Train seq2seq nn
51 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Waifu Chat
 2 | > AI-based chatbot trained on specific waifus' speech
 3 | 
 4 | ## Using it
 5 | ### Set up / Installation
 6 | 1. [Install Kaldi](http://kaldi-asr.org/doc/install.html)
 7 | 2. Install ffmpeg and mkvtoolnix (`sudo apt-get install mkvtoolnix ffmpeg`)
 8 | 3. Install python dependencies (`pip install -r requirements.txt`)
 9 | 4. Download VoxCeleb1 and VoxCeleb2 datasets and train the voxceleb Kaldi DNN on them (by running `kaldi/egs/voxceleb/v2/run.sh`)  
10 | **- OR -**  
11 |   Download a pretrained network like [2] and install it by replacing the files in `kaldi/egs/voxceleb/v2` with the downloaded ones
12 | 5. Prepare the data following the guidelines in the **Data** section
13 | 
14 | ### Data
15 | 
16 | The system takes unmodified anime episodes as input. It assumes these are in `.mkv` format and that the first audio and susbtitle tracks, as listed by `mkvmerge --identify`, are the ones that should be used to build the model. This is usually the case and the only issue that may arise because of this assumption is that, if a series has it's first subtitle track in japanese and the second one in english, the model is going to use the japanese one, resulting in a chatbot that speaks japanese. If that's the case, the subtitle track being used can be changed by modifying the definition of the variable `subtitleTrack` in `main.py`, setting it to the 0-based index of the subtitle track (so if it's set to `0`, the system is gonna use the first subtitle track listed by `mkvmerge --identify`, if it's set to `2`, it's gonna use the third subtitle track listed by `mkvmerge --identify`, and so on).
17 | 
18 | These episodes should be in the `series` folder, organized by series, so that all the episodes from a series are in the same folder, thus following this folder structure:
19 | - `series`  
20 |   - `series/Shingeki no Kyojin`  
21 |     - `series/Shingeki no Kyojin/1x01.mkv`  
22 |     - `series/Shingeki no Kyojin/1x02.mkv`  
23 |     - `series/Shingeki no Kyojin/1x03.mkv`  
24 |     - `...`  
25 |     - `series/Shingeki no Kyojin/2x01.mkv`  
26 |     - `series/Shingeki no Kyojin/2x02.mkv`  
27 |     - `...`  
28 |   - `series/Sword Art Online`  
29 |     - `series/Sword Art Online/Episode 1x01.mkv`  
30 |     - `series/Sword Art Online/Episode 1x02.mkv`  
31 |     - `series/Sword Art Online/Episode 1x03.mkv`  
32 |     - `...`  
33 | 
34 | ### Run
35 | ```
36 | python3 main.py
37 | ```
38 | 
39 | ## How does it work?
40 | 1. Extract the audio and subtitle tracks from anime episodes.
41 | 2. Using timing information from the subtitles, extract the audio segment associated with each subtitle so that we have a list of (subtitle text, audio segment) tuples for all the subtitles in a series.
42 | 3. Extract the xvector associated with each audio segment. This is done by extracting a set of features from the audio segment and then feeding them to a neural network to extract an embedding called xvector. The original research behind this can be found in [1]. We use an implementation of it in Kaldi using a pretrained DNN[2] that was trained on the VoxCeleb datasets[3][4].
43 | 4. Use DBSCAN[5] to perform clustering on the xvectors in order to group together the audio segments spoken by the same speaker. Ater this step we should have a list of (subtitle text, speaker) tuples for all subtitles.
44 | 5. Select the speaker or set of speakers to mimic by using user-input
45 | 6. Train a seq2seq[6] model
46 | 7. Serve the chatbot as a discord bot
47 | 
48 | ## Why?
49 | > No waifu no laifu
50 | <p align="right">- Adolf Hitler</p>
51 | 
52 | ## References
53 | [1] Snyder, D. and Garcia-Romero, D. and Sell, G. and Povey, D. and Khudanpur, S. 2018. X-vectors: Robust DNN Embeddings for Speaker Recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).  
54 | [2] http://kaldi-asr.org/models/m7 (VoxCeleb Xvector System 1a), retrieved the 19th of December of 2018.  
55 | [3] Nagrani, A. and Chung, J.~S. and Zisserman, A. 2017. VoxCeleb: a large-scale speaker identification dataset. INTERSPEECH.  
56 | [4] J. S. Chung, A. Nagrani, A. Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. INTERSPEECH.  
57 | [5] DBSCAN  
58 | [6] Ilya Sutskever, Oriol Vinyals and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks.  Conference and Workshop on Neural Information Processing Systems (NIPS).
59 | 


--------------------------------------------------------------------------------