├── LICENSE.md
├── README.md
├── align_db.py
└── emov_mfa_alignment.py


/LICENSE.md:
--------------------------------------------------------------------------------
1 | By downloading or using the EmoV-DB Dataset, you are agreeing to the “Non-commercial Purposes” condition. “Non-commercial Purposes” means research, teaching, scientific publication and personal experimentation. Non-commercial Purposes include use of the Dataset to perform benchmarking for purposes of academic or applied research publication. Non-commercial Purposes does not include purposes primarily intended for or directed towards commercial advantage or monetary compensation, or purposes intended for or directed towards litigation, licensing, or enforcement, even in part. EmoV-DB dataset is provided as-is, is experimental in nature, and not intended for use by, with, or for the diagnosis of human subjects for incorporation into a product.
2 | 
3 | If you wish to make use of the EmoV-DB Dataset for Commercial Purpose, please contact Prof. Sarah Ostadabbas
4 | https://coe.northeastern.edu/people/ostadabbas-sarah/
5 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # EmoV-DB
  2 | 
  3 | # See also
  4 | https://github.com/noetits/ICE-Talk for controllable TTS
  5 | 
  6 | # How to use
  7 | ## Download link
  8 | Sorted version (recommended), **new link**:
  9 | https://openslr.org/115/
 10 | 
 11 | old link (slow download) but gives ou the folder structure needed to use "load_emov_db()" function: 
 12 | https://mega.nz/#F!KBp32apT!gLIgyWf9iQ-yqnWFUFuUHg
 13 | 
 14 | Not sorted version:
 15 | http://www.coe.neu.edu/Research/AClab/Speech%20Data/
 16 | 
 17 | ## Forced alignments
 18 | "It is the process of taking the text transcription of an audio speech segment and determining where in time particular words occur in the speech segment." [source](http://www.voxforge.org/home/docs/faq/faq/what-is-forced-alignment)
 19 | 
 20 | It also allows to separate verbal and non-verbal vocalizations (laughs, yawns, etc.) that are before/after the sentence. 
 21 | Note that it might also be possible to detect non-verbal vocalizations inside sentences when they are not mixed with speech (e.g. chuckle between words) with "sil" or "spn" tokens of Montreal-Forced-Aligner. But this has not been experimented on our end.
 22 | 
 23 | 
 24 | 
 25 | ### Alignment with Montreal Forced Aligner (MFA)
 26 | First install MFA
 27 | + [Installation](https://montreal-forced-aligner.readthedocs.io/en/latest/installation.html)
 28 | 
 29 | Then use the steps below. It is based on the instructions of [Phone alignment of a dataset](https://montreal-forced-aligner.readthedocs.io/en/latest/first_steps/index.html#first-steps-align-pretrained) with their acoustic and g2p models.
 30 | To use them, you need download models as in [here](https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/models/index.html). In this example, we use `english_us_arpa`, but you could use their IPA model as well.
 31 | 
 32 | In a python terminal:
 33 | ```
 34 | from emov_mfa_alignment import Emov
 35 | dataset = Emov()
 36 | dataset.download()
 37 | dataset.prepare_mfa()
 38 | ```
 39 | 
 40 | Then in a shell terminal:
 41 | ```
 42 | mfa align EMOV-DB/ english_us_arpa english_us_arpa EMOV
 43 | ```
 44 | 
 45 | Then the "convert" function is the function to remove non-verbal vocalizations that would be before/after the whole sentence. It just reads the results of phone alignment and extract the start timing of the first phoneme and the end timing of the last phoneme to cut the audio and rewrite it.
 46 | 
 47 | ```
 48 | from emov_mfa_alignment import Emov
 49 | dataset = Emov()
 50 | dataset.convert()
 51 | ```
 52 | 
 53 | ### Alignment with gentle
 54 | Older alternative, performance should be less good than with MFA
 55 | <details>
 56 |   <summary>Click to show process</summary>
 57 | 
 58 | 1. Go to https://github.com/lowerquality/gentle
 59 | 2. Clone the repo
 60 | 3. In Getting started, use the 3rd option: .\install.sh
 61 | 4. Copy align_db.py in the repository
 62 | 5. In align_db.py, change the "path" variable so that it corresponds to the path of EmoV-DB. 
 63 | 6. Launch command "python align_db.py". You'll probably have to install some packages to make it work
 64 | 7. It should create a folder called "alignments" in the repo, with the same structure as the database, containing a json file for each sentence of the database.
 65 | 
 66 | 8. The function "get_start_end_from_json(path)" allows you to extract start and end of the computed force alignment
 67 | 9. you can play a file with function "play(path)"
 68 | 10. you can play the part of the file in which there is speech according to the forced alignment with "play_start_end(path, start, end)"
 69 | 
 70 | </details>
 71 | 
 72 | # Overview of data
 73 | 
 74 | The Emotional Voices Database: Towards Controlling the Emotional Expressiveness in Voice Generation Systems
 75 | 
 76 | - This dataset is built for the purpose of emotional speech synthesis. The transcript were based on the CMU arctic database: http://www.festvox.org/cmu_arctic/cmuarctic.data.
 77 | 
 78 | - It includes recordings for four speakers- two males and two females.
 79 | 
 80 | - The emotional styles are neutral, sleepiness, anger, disgust and amused. 
 81 | 
 82 | - Each audio file is recorded in 16bits .wav format 
 83 | 
 84 | - Spk-Je (Female, English: Neutral(417 files), Amused(222 files), Angry(523 files), Sleepy(466 files), Disgust(189 files))
 85 | - Spk-Bea (Female, English: Neutral(373 files), Amused(309 files), Angry(317 files), Sleepy(520 files), Disgust(347 files))
 86 | - Spk-Sa (Male, English: Neutral(493 files), Amused(501 files), Angry(468 files), Sleepy(495 files), Disgust(497 files))
 87 | - Spk-Jsh (Male, English: Neutral(302 files), Amused(298 files), Sleepy(263 files))
 88 | 
 89 | - File naming (audio_folder): anger_1-28_0011.wav - 1) first word (emotion style), 1-28 - annotation doc file range, Last four digit is the sentence number. 
 90 | 
 91 | - File naming (annotation_folder): anger_1-28.TextGrid - 1) first word (emotional style), 1-28- annotation doc range
 92 | 
 93 | # References
 94 | A description of the database here:
 95 | https://arxiv.org/pdf/1806.09514.pdf
 96 | 
 97 | Please reference this paper when using this database:
 98 | 
 99 | Bibtex:
100 | ```
101 | @article{adigwe2018emotional,
102 |   title={The emotional voices database: Towards controlling the emotion dimension in voice generation systems},
103 |   author={Adigwe, Adaeze and Tits, No{\'e} and Haddad, Kevin El and Ostadabbas, Sarah and Dutoit, Thierry},
104 |   journal={arXiv preprint arXiv:1806.09514},
105 |   year={2018}
106 | }
107 | ```
108 | 
109 | 
110 | 


--------------------------------------------------------------------------------
/align_db.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import gentle
  3 | import pandas as pd
  4 | import codecs
  5 | import logging
  6 | 
  7 | def on_progress(p):
  8 |     for k,v in p.items():
  9 |         logging.debug("%s: %s" % (k, v))
 10 | 
 11 | # DOWNLOAD THE DB AND CHANGE THIS PATH
 12 | path='path/to/EmoV-DB_sorted/'
 13 | 
 14 | 
 15 | 
 16 | resources = gentle.Resources()
 17 | 
 18 | def load_emov_db(path_to_EmoV_DB):
 19 |     transcript = os.path.join(path_to_EmoV_DB, 'cmuarctic.data')
 20 |     lines = codecs.open(transcript, 'r', 'utf-8').readlines()
 21 | 
 22 |     # in our database, we use only files beginning with arctic_a. And the number of these sentences correspond.
 23 |     # Here we build a dataframe with number and text of each of these lines
 24 |     sentences = []
 25 |     for line in lines:
 26 |         temp = {}
 27 |         idx_n_0 = line.find('arctic_a') + len('arctic_a')
 28 |         if line.find('arctic_a') != -1:
 29 |             print(line)
 30 |             print(idx_n_0)
 31 |             idx_n_end = idx_n_0 + 4
 32 |             number = line[idx_n_0:idx_n_end]
 33 |             print(number)
 34 |             temp['n'] = number
 35 |             idx_text_0 = idx_n_end + 2
 36 |             text = line.strip()[idx_text_0:-3]
 37 |             temp['text'] = text
 38 |             # print(text)
 39 |             sentences.append(temp)
 40 |     sentences = pd.DataFrame(sentences)
 41 | 
 42 |     print(sentences)
 43 |     speakers=next(os.walk(path_to_EmoV_DB))[1] #this list directories (and not files, contrary to osl.listdir() )
 44 | 
 45 |     data=[]
 46 | 
 47 |     for spk in speakers:
 48 | 
 49 |         emo_cat = next(os.walk(os.path.join(path_to_EmoV_DB,spk)))[1] #this list directories (and not files, contrary to osl.listdir() )
 50 | 
 51 |         for emo in emo_cat:
 52 |             for file in os.listdir(os.path.join(path_to_EmoV_DB, spk, emo)):
 53 |                 print(file)
 54 |                 fpath = os.path.join(path_to_EmoV_DB, spk, emo, file)
 55 | 
 56 |                 if file[-4:] == '.wav':
 57 |                     fnumber = file[-8:-4]
 58 |                     print(fnumber)
 59 |                     if fnumber.isdigit():
 60 |                         text = sentences[sentences['n'] == fnumber]['text'].iloc[0]  # result must be a string and not a df with a single element
 61 |                         # text_lengths.append(len(text))
 62 |                         # texts.append(text)
 63 |                         # texts.append(np.array(text, np.int32).tostring())
 64 |                         # fpaths.append(fpath)
 65 |                         # emo_cats.append(emo)
 66 | 
 67 |                         e = {'database': 'EmoV-DB',
 68 |                              'id': file[:-4],
 69 |                              'speaker': spk,
 70 |                              'emotion':emo,
 71 |                              'transcription': text,
 72 |                              'sentence_path': fpath}
 73 |                         data.append(e)
 74 |                         print(e)
 75 | 
 76 |     data = pd.DataFrame.from_records(data)
 77 | 
 78 |     return data
 79 | 
 80 | 
 81 | def align_db(data):
 82 |     import pathlib
 83 | 
 84 |     for i, row in data.iterrows():
 85 |         f = row.sentence_path
 86 |         transcript = row.transcription
 87 |         with gentle.resampled(f) as wavfile:
 88 |             aligner = gentle.ForcedAligner(resources, transcript)
 89 |             result = aligner.transcribe(wavfile, progress_cb=on_progress, logging=logging)
 90 |         # os.system('python align.py '+f+' words.txt -o test.json')
 91 | 
 92 |         output = os.path.join('alignments', '/'.join(f.split('/')[-4:]).split('.')[0] + '.json')
 93 |         pathlib.Path('/'.join(output.split('/')[0:-1])).mkdir(parents=True, exist_ok=True)
 94 | 
 95 |         fh = open(output, 'w')
 96 |         fh.write(result.to_json(indent=2))
 97 |         if output:
 98 |             logging.info("output written to %s" % (output))
 99 | 
100 |         fh.close()
101 | 
102 | 
103 | data=load_emov_db(path)
104 | align_db(data)
105 | 
106 | 
107 | def get_start_end_from_json(path):
108 |     a=pd.read_json(os.path.join('file://localhost', os.path.abspath(path)))
109 |     b=pd.DataFrame.from_records(a.words)
110 | 
111 |     print('start:')
112 |     start=b.start[0]
113 |     print(start)
114 | 
115 |     print('end:')
116 |     end=b.end.round(2).tolist()[-1]
117 |     print(end)
118 | 
119 |     return start, end
120 | 
121 | 
122 | # path='alignments/EmoV-DB/bea/amused/amused_1-15_0001.json'
123 | # start, end=get_start_end_from_json(path)
124 | 
125 | def play_start_end(path, start, end):
126 |     import sounddevice as sd
127 | 
128 |     import librosa
129 | 
130 |     y,fs=librosa.load(path)
131 |     sd.play(y[int(start*fs):int(end*fs)],fs)
132 | 
133 | def play(path):
134 |     import sounddevice as sd
135 | 
136 |     import librosa
137 | 
138 |     y,fs=librosa.load(path)
139 |     sd.play(y,fs)


--------------------------------------------------------------------------------
/emov_mfa_alignment.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import shutil
  3 | import requests
  4 | import tarfile
  5 | 
  6 | import textgrid
  7 | import pandas as pd
  8 | import librosa
  9 | import numpy as np
 10 | from scipy.io import wavfile
 11 | 
 12 | class Emov:
 13 |     def __init(self):
 14 |         pass
 15 | 
 16 |     def get_all_phone_with_timings(self, f='/home/weili/data/EMOV/1/amused_1-15_0001.TextGrid'):
 17 |         """get all phonemes of a sentence located in tg[1], and filter silence and empty parts, then convert to DataFrame
 18 |         """
 19 |         tg = textgrid.TextGrid.fromFile(f)
 20 |         # get phones and drop "sp", "sil" and empty strings
 21 |         phones=[[el.minTime, el.maxTime, el.mark] for el in tg[1] if el.mark not in ['sil','sp','','spn']]
 22 |         phones=pd.DataFrame(phones)
 23 |         phones.columns=["start", "end", "phone"]
 24 |         return phones
 25 | 
 26 |     def convert(self):
 27 |         for speaker in range(1, 5):
 28 |             speaker_path = os.path.join("EMOV-DB", str(speaker))
 29 |             for audio in os.listdir(speaker_path):
 30 |                 if audio[-4:] == ".wav":
 31 |                     audio_path = os.path.join(speaker_path, audio)
 32 |                     y, sr = librosa.load(audio_path)
 33 |                     textgrid_path = audio_path.replace("EMOV-DB", "EMOV").replace(".wav", ".TextGrid")
 34 |                     if os.path.exists(textgrid_path):
 35 |                         p = self.get_all_phone_with_timings(f=textgrid_path)  
 36 |                     else:
 37 |                         # wavfile and textfile mismatch
 38 |                         continue
 39 | 
 40 |                     speech_segs = np.array([])
 41 | 
 42 |                     for interval in p.values:
 43 |                         speech_seg = y[int(interval[0]*sr): int(interval[1]*sr)]
 44 |                         speech_segs = np.append(speech_segs, speech_seg)
 45 | 
 46 |                     wavfile.write(textgrid_path.replace(".TextGrid", ".wav"), sr, speech_segs)
 47 |                     
 48 |     
 49 |     def prepare_mfa(self, clean=False):
 50 |         def remove_punct(string): 
 51 |             punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
 52 |             for x in string.lower(): 
 53 |                 if x in punctuations: 
 54 |                     string = string.replace(x, " ") 
 55 |                                                 
 56 |             return string.lower()
 57 |         # create the textfile with the same name of wavfile
 58 | 
 59 |         # 1. read transcripts
 60 |         with open("EMOV-DB/cmuarctic.data", "r") as rf:
 61 |             lines = rf.readlines()
 62 | 
 63 |         label_to_transcript = {}
 64 | 
 65 |         for line in lines:
 66 |             line = line.split('"')
 67 |             sent = line[1]
 68 |             label = line[0].rstrip().split('_')[-1]
 69 |             if label[0] == "b":
 70 |                 continue
 71 |             label = label[1:]
 72 |             sent = remove_punct(sent) # remove punct
 73 |             sent = sent.replace("1908", "nineteen o eight")
 74 |             sent = sent.replace("18", "eighteen")
 75 |             sent = sent.replace("16", "sixteen")
 76 |             sent = sent.replace("nightglow", "night glow")
 77 |             sent = sent.replace("mr ", "mister ")
 78 |             sent = sent.replace("mrs ", "misses ")
 79 |             sent = sent.replace("  ", " ")
 80 |             label_to_transcript[label] = sent
 81 | 
 82 |         # 2. scan wavfiles and create textfiles
 83 |         for speaker in range(1, 5):
 84 |             speaker_path = os.path.join("EMOV-DB", str(speaker))
 85 |             # for emotion in os.listdir(speaker_path):
 86 |             #     emotion_path = os.path.join(speaker_path, emotion)
 87 |             for audio in os.listdir(speaker_path):
 88 |                 if audio[-4:] == ".wav":
 89 |                     textfile = audio[:-4] + ".lab"
 90 |                     label = audio.split('_')[-1].split('.')[0]
 91 |                     transcript = label_to_transcript[label]
 92 |                     if clean:
 93 |                         os.remove(os.path.join(speaker_path, textfile))
 94 |                     else:
 95 |                         with open(os.path.join(speaker_path, textfile), 'w') as wf:
 96 |                             wf.write(transcript)
 97 |                     
 98 | 
 99 |     def download(self):
100 |         download_links = [
101 |             "https://www.openslr.org/resources/115/bea_Amused.tar.gz",
102 |             "https://www.openslr.org/resources/115/bea_Angry.tar.gz",
103 |             "https://www.openslr.org/resources/115/bea_Disgusted.tar.gz",
104 |             "https://www.openslr.org/resources/115/bea_Neutral.tar.gz",
105 |             "https://www.openslr.org/resources/115/bea_Sleepy.tar.gz",
106 | 
107 |             "https://www.openslr.org/resources/115/jenie_Amused.tar.gz",
108 |             "https://www.openslr.org/resources/115/jenie_Angry.tar.gz",
109 |             "https://www.openslr.org/resources/115/jenie_Disgusted.tar.gz",
110 |             "https://www.openslr.org/resources/115/jenie_Neutral.tar.gz",
111 |             "https://www.openslr.org/resources/115/jenie_Sleepy.tar.gz",
112 | 
113 |             "https://www.openslr.org/resources/115/josh_Amused.tar.gz",
114 |             "https://www.openslr.org/resources/115/josh_Neutral.tar.gz",
115 |             "https://www.openslr.org/resources/115/josh_Sleepy.tar.gz",
116 | 
117 |             "https://www.openslr.org/resources/115/sam_Amused.tar.gz",
118 |             "https://www.openslr.org/resources/115/sam_Angry.tar.gz",
119 |             "https://www.openslr.org/resources/115/sam_Disgusted.tar.gz",
120 |             "https://www.openslr.org/resources/115/sam_Neutral.tar.gz",
121 |             "https://www.openslr.org/resources/115/sam_Sleepy.tar.gz",
122 | 
123 |             "http://www.festvox.org/cmu_arctic/cmuarctic.data"
124 |         ]
125 | 
126 |         target_directories = [
127 | 
128 |             "EMOV-DB/1",
129 |             "EMOV-DB/1",
130 |             "EMOV-DB/1",
131 |             "EMOV-DB/1",
132 |             "EMOV-DB/1",
133 | 
134 |             "EMOV-DB/2",
135 |             "EMOV-DB/2",
136 |             "EMOV-DB/2",
137 |             "EMOV-DB/2",
138 |             "EMOV-DB/2",
139 | 
140 |             "EMOV-DB/3",
141 |             "EMOV-DB/3",
142 |             "EMOV-DB/3",
143 | 
144 |             "EMOV-DB/4",
145 |             "EMOV-DB/4",
146 |             "EMOV-DB/4",
147 |             "EMOV-DB/4",
148 |             "EMOV-DB/4",
149 | 
150 |             "EMOV-DB"
151 |         ]
152 | 
153 |         for directory in target_directories:
154 |             os.makedirs(directory, exist_ok=True)
155 | 
156 |         for link, target_directory in zip(download_links, target_directories):
157 |             filename = os.path.basename(link)
158 |             file_path = os.path.join(target_directory, filename)
159 | 
160 |             response = requests.get(link, stream=True)
161 |             if response.status_code == 200:
162 |                 with open(file_path, 'wb') as file:
163 |                     for chunk in response.iter_content(1024):
164 |                         file.write(chunk)
165 |                 print(f"download successed:{filename}")
166 | 
167 |                 if filename[-5:]!=".data":
168 |                     with tarfile.open(file_path, 'r:gz') as tar:
169 |                         tar.extractall(path=target_directory)
170 |                     os.remove(file_path)
171 |             else:
172 |                 print(f"download failed:{filename}")
173 | 


--------------------------------------------------------------------------------