├── .github └── ISSUE_TEMPLATE │ └── bug_report.md ├── .gitignore ├── LICENSE.md ├── README.md ├── example.py ├── formantfeatures.code-workspace ├── formantfeatures.egg-info ├── PKG-INFO ├── SOURCES.txt ├── dependency_links.txt ├── requires.txt └── top_level.txt ├── formantfeatures ├── FormantsExtract.py ├── FormatsHDFread.py └── __init__.py ├── setup.py └── test_1.wav /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Describe the bug** 11 | A clear and concise description of what the bug is. 12 | 13 | **To Reproduce** 14 | Steps to reproduce the behavior: 15 | 1. Go to '...' 16 | 2. Click on '....' 17 | 3. Scroll down to '....' 18 | 4. See error 19 | 20 | **Expected behavior** 21 | A clear and concise description of what you expected to happen. 22 | 23 | **Screenshots** 24 | If applicable, add screenshots to help explain your problem. 25 | 26 | **Desktop (please complete the following information):** 27 | - OS: [e.g. iOS] 28 | - Browser [e.g. chrome, safari] 29 | - Version [e.g. 22] 30 | 31 | **Smartphone (please complete the following information):** 32 | - Device: [e.g. iPhone6] 33 | - OS: [e.g. iOS8.1] 34 | - Browser [e.g. stock browser, safari] 35 | - Version [e.g. 22] 36 | 37 | **Additional context** 38 | Add any other context about the problem here. 39 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | Archive/ 2 | cache/ 3 | __pycache__/ 4 | .vscode 5 | build/ 6 | dist/ 7 | *.code-workspace 8 | .pypirc 9 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright © 2020 Tabahi Abdul Rehman 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 6 | 7 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 8 | 9 | THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Formant characteristic features extraction 2 | 3 | Extract frequency, power, width and dissonance of formants from a WAV file. These formant features can be used for speech recognition or music analysis. 4 | 5 | ## Dependencies 6 | 7 | + Python 3.7 or later 8 | + Numpy 1.16 or later 9 | + [Scipy v1.3.1](https://scipy.org/install.html) 10 | + [H5py v2.9.0](https://pypi.org/project/h5py/) 11 | + [Numba (v0.45.1)](https://numba.pydata.org/numba-doc/dev/user/installing.html) 12 | + [Wavio v0.0.4](https://pypi.org/project/wavio/) 13 | 14 | > Install : `pip install formantfeatures` 15 | 16 | 17 | --------- 18 | 19 | ## Get formant characteristics from a single file 20 | 21 | `Extract_wav_file_formants` 22 | -------------------------------- 23 | 24 | ```python 25 | import formantfeatures as ff 26 | 27 | formants_features, frame_count, signal_length, trimmed_length = ff.Extract_wav_file_formants(wav_file_path, window_length, window_step, emphasize_ratio, norm=0, f0_min=f0_min, f0_max=f0_max, max_frames=max_frames, formants=max_formants) 28 | ``` 29 | 30 | ### Parameters 31 | 32 | 33 | >`wav_file_path`: string, Path of the input wav audio file. 34 | 35 | >`window_length`: float, optional (default=0.025). Frame window size in seconds. 36 | 37 | >`window_step`: float, optional (default=0.010). Frame window step size in seconds. 38 | 39 | >`emphasize_ratio`: float, optional (default=0.7). Amplitude increasing factor for pre-emphasis of higher frequencies (high frequencies * emphasize_ratio = balanced amplitude as low frequencies). 40 | 41 | > `norm`: int, optional, (default=0), Enable or disable normalization of Mel-filters; 42 | 43 | >`f0_min`: int, optional, (default=30), Hertz. 44 | 45 | >`f0_max`: int, optional, (default=4000), Hertz. 46 | 47 | >`max_frames`: int, optional (default=400). Cut off size for the number of frames per clip. It is used to standardize the size of clips during processing. If clip size is shorter than that then rest of the frames will be filled with zeros. 48 | 49 | >`formants`: int, optional (default=3). Number of formants to extract. 50 | 51 | >`formant_decay`: float, optional (default=0.5). Decay constant to exponentially decrease feature values by their formant amplitude ranks. 52 | 53 | ### Returns 54 | 55 | 56 | returns `frames_features, frame_count, signal_length, trimmed_length` 57 | 58 | >`frames_features`: array-like, `np.array((max_frames, num_of_features*formants), dtype=np.uint16)`. If `formant=3` then `formants_features` is a numpy array of shape=(12xframes) comprising of 12 features for each 0.025s frame of the WAV file. Frame size can be adjusted, recommended size is 0.025s. 59 | The 12 features are frequency, power, width and dissonance of top 3 formants are at indices of numpy array as: 60 | 61 | 62 | Indices | Description 63 | ------------ | ------------- 64 | `frames_features[frame, 0]`| frequency of formant 0 65 | `frames_features[frame, 1]`| power of formant 0 66 | `frames_features[frame, 2]`| width of formant 0 67 | `frames_features[frame, 3]`| dissonance of formant 0 68 | `frames_features[frame, 4]`| frequency of formant 1 69 | `frames_features[frame, 5]`| power of formant 1 70 | `frames_features[frame, 6]`| width of formant 1 71 | `frames_features[frame, 7]`| dissonance of formant 1 72 | `frames_features[frame, 8]`| frequency of formant 2 73 | `frames_features[frame, 9]`| power of formant 2 74 | `frames_features[frame, 10]`| width of formant 2 75 | `frames_features[frame, 11]`| dissonance of formant 2 76 | 77 | 78 | >`frame_count`: int, number of filled frames (out of max_frames). It is the number of non-zero frames starting from index 0. 79 | 80 | >`signal_length`: float, signal length in seconds. Silence at the begining and end of the input signal is trimmed before processing. 81 | 82 | >`trimmed_length`: float, trimmed length in seconds, silence at the begining and end of the input signal is trimmed before processing; 83 | 84 | :: Note: Frequency is not on Hertz or Mel scale. Instead, a disproportionate scaling is applied to all features that results in completely different scales. An example of conversion back to Hz can be seen in `FormantsHDFread.py` line 89. 85 | 86 | ## Example 87 | An example code is given in file `example.py`. 88 | This example extracts 12 formant features for each frame of test wav file ('test_1.wav' has 383 frames of 25ms window at 10ms stride). On line 27 we have: 89 | 90 | 91 | 92 | The `formants_features` array of size (500, 12) is returned by the function `formantfeatures.Extract_wav_file_formants` in which 500 is the maximum number of frames but only `frame_count` number of frames are used. 93 | 94 | Then we calculate mean of frequency, power, width and dissonance of first 3 formant across 383 frames. 95 | 96 | 12 formant features of each individual frame can be accessed as: `formants_features[i, j]`, where `i` is the frame number out of total `frame_count` (383 in this example), and `j` is the feature index out of total 12 features (0 for 1st formant frequency). 97 | 98 | To calculate the mean of first fomant frequency across all used frames (383 frames are used out of max 500): 99 | ```python 100 | firt_formant_freq_mean = np.mean(formants_features[0:frame_count, 0]) 101 | # where 0:frame_count gives the range of used frames out of total 500 frames. The '0' is the index of 1st formant frequency in features' list. 102 | 103 | # Similarly, the power (index is '1'): 104 | firt_formant_power_mean = np.mean(formants_features[0:frame_count, 1]) 105 | 106 | # For frequency of 2nd formant (index is '4' see the list of indices given above) 107 | second_formant_freq_mean = np.mean(formants_features[0:frame_count, 4]) 108 | 109 | # To get features of individual frames (without mean): 110 | first_freq_of_frame_50 = formants_features[50, 0] #frequency of 1st formant of frame 50 111 | first_width_of_frame_50 = formants_features[50, 3] #width of 1st formant of frame 50 112 | ``` 113 | 114 | Output of `example.py`: 115 | ``` 116 | formants_features max_frames: 500 features count: 12 frame_count 383 117 | Formant 0 Mean freq: 1174.3315926892951 118 | Formant 0 Mean power: 448.1566579634465 119 | Formant 0 Mean width: 46.30548302872063 120 | Formant 0 Mean dissonance: 5.169712793733681 121 | Formant 1 Mean freq: 579.9373368146214 122 | Formant 1 Mean power: 188.7859007832898 123 | Formant 1 Mean width: 12.459530026109661 124 | Formant 1 Mean dissonance: 2.2323759791122715 125 | Formant 2 Mean freq: 268.45430809399477 126 | Formant 2 Mean power: 79.54830287206266 127 | Formant 2 Mean width: 3.8929503916449084 128 | Formant 2 Mean dissonance: 1.0783289817232375 129 | Done 130 | 131 | ``` 132 | 133 | 134 | ## Bulk processing 135 | 136 | Pass a list of DB files objects (see ) and path of HDF file to save extracted features: 137 | 138 | 139 | `Extract_files_formant_features` 140 | -------------------------------- 141 | 142 | ```python 143 | import formantfeatures as ff 144 | 145 | ff.Extract_files_formant_features(array_of_clips, features_save_file, window_length=0.025, window_step=0.010, emphasize_ratio=0.7, f0_min=30, f0_max=4000, max_frames=400, formants=3,) 146 | ``` 147 | 148 | ### Parameters 149 | 150 | 151 | `array_of_clips`: list of `Clip_file_Class` objects from 'SER_DB.py' 152 | 153 | `features_save_file`: string, Path for HDF file where extracted features will be stored 154 | 155 | 156 | ### Returns 157 | 158 | 159 | `processed_clips`: int, number of successfully processing clips; 160 | 161 | 162 | ## Read HDF data files 163 | 164 | HDF read functions: `import_features_from_HDF` import from `FormatsHDFread` 165 | 166 | ```python 167 | import formantfeatures as ff 168 | 169 | formant_features, labels, unique_speaker_ids, unique_classes = ff.import_features_from_HDF(storage_file, deselect_labels=['B', 'X']) 170 | # Import without deslected labels B (Boring) and X (unknown) 171 | ``` 172 | 173 | Print label stats and save features stats to file: 174 | 175 | ```python 176 | ff.print_database_stats(labels) 177 | 178 | ff.save_features_stats("DB_X", "csv_filename.csv", labels, formant_features) 179 | ``` 180 | 181 | 182 | 183 | ------------------ 184 | 185 | ## Citations 186 | 187 | ```tex 188 | @article{LIU2021309, 189 | title = {Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence}, 190 | journal = {Information Sciences}, 191 | volume = {563}, 192 | pages = {309-325}, 193 | year = {2021}, 194 | issn = {0020-0255}, 195 | doi = {https://doi.org/10.1016/j.ins.2021.02.016}, 196 | url = {https://www.sciencedirect.com/science/article/pii/S0020025521001584}, 197 | author = {Zhen-Tao Liu and Abdul Rehman and Min Wu and Wei-Hua Cao and Man Hao}, 198 | keywords = {Speech, Emotion recognition, Formants extraction, Phonemes, Clustering, Cross-corpus}, 199 | abstract = {Speech Emotion Recognition (SER) has numerous applications including human-robot interaction, online gaming, and health care assistance. While deep learning-based approaches achieve considerable precision, they often come with high computational and time costs. Indeed, feature learning strategies must search for important features in a large amount of speech data. In order to reduce these time and computational costs, we propose pre-processing step in which speech segments with similar formant characteristics are clustered together and labeled as the same phoneme. The phoneme occurrence rates in emotional utterances are then used as the input features for classifiers. Using six databases (EmoDB, RAVDESS, IEMOCAP, ShEMO, DEMoS and MSP-Improv) for evaluation, the level of accuracy is comparable to that of current state-of-the-art methods and the required training time was significantly reduced from hours to minutes.} 200 | } 201 | ``` 202 | 203 | Paper: [Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence](https://www.sciencedirect.com/science/article/abs/pii/S0020025521001584) 204 | 205 | 206 | -------------------------------------------------------------------------------- /example.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This example extracts 12 formant features for each frame (test_1.wav has 383 frames of 25ms window at 10ms stride) 3 | 4 | The `formants_features` array of size (500, 12) is returned by the function `FormantsExtract.Extract_wav_file_formants` in which 500 is the maximum number of frames but only `frame_count` number of frames are used. 5 | 6 | Then we calculate mean of frequency, power, width and dissonance of first 3 formant across 383 frames. 7 | 8 | 12 formant features of each individual frame can be accessed as: `formants_features[i, j]`, where `i` is the frame number out of total `frame_count` (383 in this example), and `j` is the feature index out of total 12 features (0 for 1st formant frequency). 9 | 10 | Output of `example.py`: 11 | ``` 12 | formants_features max_frames: 500 features count: 12 frame_count 383 13 | Formant 0 Mean freq: 1174.3315926892951 14 | Formant 0 Mean power: 448.1566579634465 15 | Formant 0 Mean width: 46.30548302872063 16 | Formant 0 Mean dissonance: 5.169712793733681 17 | Formant 1 Mean freq: 579.9373368146214 18 | Formant 1 Mean power: 188.7859007832898 19 | Formant 1 Mean width: 12.459530026109661 20 | Formant 1 Mean dissonance: 2.2323759791122715 21 | Formant 2 Mean freq: 268.45430809399477 22 | Formant 2 Mean power: 79.54830287206266 23 | Formant 2 Mean width: 3.8929503916449084 24 | Formant 2 Mean dissonance: 1.0783289817232375 25 | Done 26 | 27 | ``` 28 | ''' 29 | 30 | import numpy as np 31 | import formantfeatures as ff 32 | import matplotlib.pyplot as plt 33 | 34 | 35 | def main(): 36 | 37 | 38 | test_wav = "test_1.wav" #A sample from RAVDESS 39 | 40 | window_length = 0.025 #Keep it such that its easier to differentiate syllables and remove pauses 41 | window_step = 0.010 42 | emphasize_ratio = 0.65 43 | f0_min = 30 44 | f0_max = 4000 45 | max_frames = 500 46 | max_formants = 3 47 | 48 | formants_features, frame_count, signal_length, trimmed_length = ff.Extract_wav_file_formants(test_wav, window_length, window_step, emphasize_ratio, norm=0, f0_min=f0_min, f0_max=f0_max, max_frames=max_frames, formants=max_formants) 49 | 50 | print("formants_features max_frames:", formants_features.shape[0], " features count:", formants_features.shape[1], "frame_count", frame_count) 51 | 52 | for formant in range(max_formants): 53 | print("Formant", formant, "Mean freq:", np.mean(formants_features[0:frame_count, (formant*4)+0])) 54 | print("Formant", formant, "Mean power:", np.mean(formants_features[0:frame_count, (formant*4)+1])) 55 | print("Formant", formant, "Mean width:", np.mean(formants_features[0:frame_count, (formant*4)+2])) 56 | print("Formant", formant, "Mean dissonance:", np.mean(formants_features[0:frame_count, (formant*4)+3])) 57 | 58 | 59 | x_axis_i = [*range(0, frame_count, 1)] 60 | 61 | colors = ['b', 'r', 'g'] 62 | 63 | for formant in range(0, 1): 64 | formant_decay_rate = 0.5**(formant) 65 | 66 | log_scaled_freq = formants_features[0:frame_count, formant*4] 67 | 68 | Hz_freq = np.exp(log_scaled_freq / (200*formant_decay_rate)) 69 | 70 | Hz_width = np.exp(np.log(Hz_freq) - formants_features[0:frame_count, (formant*4)+2] / (50 * formant_decay_rate))/4 71 | 72 | width_dn = Hz_freq - Hz_width 73 | width_up = Hz_freq + Hz_width 74 | 75 | plt.plot(x_axis_i, Hz_freq) 76 | plt.fill_between(x_axis_i, Hz_freq, width_dn, color=colors[formant], alpha=0.30) 77 | plt.fill_between(x_axis_i, Hz_freq, width_up, color=colors[formant], alpha=0.30) 78 | 79 | 80 | plt.tight_layout() 81 | plt.xlabel("frame") 82 | plt.ylabel("f") 83 | plt.title("freq") 84 | 85 | 86 | plt.show() 87 | 88 | print("Done") 89 | exit() 90 | 91 | 92 | 93 | ''' 94 | Other functions: 95 | 96 | #Pass a list of augmented DB objects (see SER_Datasets_Import) and path of HDF file to save extracted features: 97 | 98 | FormantsExtract.Extract_files_formant_features(array_of_clips, features_save_file, window_length=0.025, window_step=0.010, emphasize_ratio=0.7, norm=0, f0_min=30, f0_max=4000, max_frames=400, formants=3,) 99 | 100 | import FormantsLib.FormatsHDFread as FormatsHDFread 101 | 102 | #Read extracted formants from HDF files: 103 | formant_features, labels, unique_speaker_ids, unique_classes = FormatsHDFread.import_features_from_HDF(storage_file, deselect_labels=['B']) 104 | 105 | 106 | FormatsHDFread.print_database_stats(labels) 107 | 108 | FormatsHDFread.save_features_stats("DB_X", "csv_filename.csv", labels, formant_features) 109 | ''' 110 | 111 | 112 | 113 | 114 | if __name__ == '__main__': 115 | main() 116 | 117 | 118 | -------------------------------------------------------------------------------- /formantfeatures.code-workspace: -------------------------------------------------------------------------------- 1 | { 2 | "folders": [ 3 | { 4 | "path": "." 5 | } 6 | ] 7 | } -------------------------------------------------------------------------------- /formantfeatures.egg-info/PKG-INFO: -------------------------------------------------------------------------------- 1 | Metadata-Version: 1.2 2 | Name: formantfeatures 3 | Version: 1.0.3 4 | Summary: Extract formant characteristics from speech wav files. 5 | Home-page: https://github.com/tabahi/formantfeatures 6 | Author: Abdul Rehman 7 | Author-email: alabdulrehman@hotmail.fr 8 | License: MIT 9 | Description: Please go to: https://github.com/tabahi/formantfeatures 10 | Platform: UNKNOWN 11 | Classifier: Development Status :: 4 - Beta 12 | Classifier: License :: OSI Approved :: MIT License 13 | Classifier: Programming Language :: Python 14 | Classifier: Programming Language :: Python :: 3.6 15 | Classifier: Programming Language :: Python :: 3.7 16 | Classifier: Topic :: Software Development :: Libraries 17 | Classifier: Topic :: Software Development :: Libraries :: Python Modules 18 | Classifier: Intended Audience :: Developers 19 | Requires-Python: >=3.7 20 | -------------------------------------------------------------------------------- /formantfeatures.egg-info/SOURCES.txt: -------------------------------------------------------------------------------- 1 | README.md 2 | setup.py 3 | formantfeatures/FormantsExtract.py 4 | formantfeatures/FormatsHDFread.py 5 | formantfeatures/__init__.py 6 | formantfeatures.egg-info/PKG-INFO 7 | formantfeatures.egg-info/SOURCES.txt 8 | formantfeatures.egg-info/dependency_links.txt 9 | formantfeatures.egg-info/requires.txt 10 | formantfeatures.egg-info/top_level.txt -------------------------------------------------------------------------------- /formantfeatures.egg-info/dependency_links.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /formantfeatures.egg-info/requires.txt: -------------------------------------------------------------------------------- 1 | numpy 2 | scipy 3 | h5py 4 | numba 5 | wavio 6 | -------------------------------------------------------------------------------- /formantfeatures.egg-info/top_level.txt: -------------------------------------------------------------------------------- 1 | formantfeatures 2 | -------------------------------------------------------------------------------- /formantfeatures/FormantsExtract.py: -------------------------------------------------------------------------------- 1 | """ 2 | ----- 3 | Author: Abdul Rehman 4 | License: The MIT License (MIT) 5 | Copyright (c) 2020, Tabahi Abdul Rehman 6 | All rights reserved. 7 | 8 | Redistribution and use in source and binary forms, with or without 9 | modification, are permitted provided that the following conditions are met: 10 | 11 | 1. Redistributions of source code must retain the above copyright notice, 12 | this list of conditions and the following disclaimer. 13 | 14 | 2. Redistributions in binary form must reproduce the above copyright notice, 15 | this list of conditions and the following disclaimer in the documentation 16 | and/or other materials provided with the distribution. 17 | 18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 19 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 20 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 21 | ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE 22 | LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 23 | CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 24 | SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 25 | INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 26 | CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 27 | ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 28 | POSSIBILITY OF SUCH DAMAGE. 29 | """ 30 | import numpy as np 31 | from scipy import signal as signallib 32 | from numba import jit #install numba to speed up the execution 33 | from wavio import read as wavio_read 34 | 35 | 36 | @jit(nopython=True) 37 | def get_lowest_positions(array_y, n_positions): 38 | order = array_y.argsort() 39 | ranks = order.argsort() #ascending 40 | top_indexes = np.zeros((n_positions,), dtype=np.int16) 41 | #print(array_y) 42 | i = int(0) 43 | 44 | while(i < n_positions): 45 | itemindices = np.where(ranks==i) 46 | for itemindex in itemindices: 47 | if(itemindex.size): 48 | #print(i, array_y[itemindex], itemindex) 49 | top_indexes[i] = itemindex[0] 50 | else: #for when positions are more than array size 51 | itemindices2 = np.where(ranks==(array_y.size -1-i+ array_y.size )) 52 | for itemindex2 in itemindices2: 53 | #print(i, array_y[itemindex2], itemindex2) 54 | top_indexes[i] = itemindex2[0] 55 | i += 1 56 | #print(array_y[top_indexes]) 57 | return top_indexes 58 | 59 | 60 | @jit(nopython=True) 61 | def get_top_positions(array_y, n_positions): 62 | order = array_y.argsort() 63 | ranks = order.argsort() #ascending 64 | top_indexes = np.zeros((n_positions,), dtype=np.int16) 65 | #print(array_y) 66 | i = int(n_positions - 1) 67 | 68 | while(i >= 0): 69 | itemindices = np.where(ranks==(len(array_y)-1-i)) 70 | for itemindex in itemindices: 71 | if(itemindex.size): 72 | #print(i, array_y[itemindex], itemindex) 73 | top_indexes[i] = itemindex[0] 74 | else: #for when positions are more than array size 75 | itemindices2 = np.where(ranks==len(array_y)-1-i+len(array_y) ) 76 | for itemindex2 in itemindices2: 77 | #print(i, array_y[itemindex2], itemindex2) 78 | top_indexes[i] = itemindex2[0] 79 | i -= 1 80 | 81 | return top_indexes 82 | 83 | 84 | def frame_segmentation(signal, sample_rate, window_length=0.040, window_step=0.020): 85 | 86 | #Framing 87 | frame_length, frame_step = window_length * sample_rate, window_step * sample_rate # Convert from seconds to samples 88 | signal_length = len(signal) 89 | frame_length = int(round(frame_length)) 90 | frame_step = int(round(frame_step)) 91 | num_frames = int(np.ceil(float(np.abs(signal_length - frame_length)) / frame_step)) # Make sure that we have at least 1 frame 92 | 93 | if(num_frames < 1): 94 | raise Exception("Clip length is too short. It should be atleast " + str(window_length*2)+ " frames") 95 | 96 | pad_signal_length = num_frames * frame_step + frame_length 97 | z = np.zeros((pad_signal_length - signal_length)) 98 | pad_signal = np.append(signal, z) # Pad Signal to make sure that all frames have equal number of samples without truncating any samples from the original signal 99 | 100 | indices = np.tile(np.arange(0, frame_length), (num_frames, 1)) + np.tile(np.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T 101 | frames = pad_signal[indices.astype(np.int32, copy=False)] 102 | 103 | #Hamming Window 104 | frames *= np.hamming(frame_length) 105 | #frames *= 0.54 - 0.46 * numpy.cos((2 * numpy.pi * n) / (frame_length - 1)) # Explicit Implementation ** 106 | #print (frames.shape) 107 | return frames, signal_length 108 | 109 | 110 | def get_filter_banks(frames, sample_rate, f0_min=60, f0_max=4000, num_filt=128, norm=0): 111 | ''' 112 | Fourier-Transform and Power Spectrum 113 | 114 | return filter_banks, hz_points 115 | 116 | filter_banks: array-like, shape = [n_frames, num_filt] 117 | 118 | hz_points: array-like, shape = [num_filt], center frequency of mel-filters 119 | 120 | This code is from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html 121 | Courtesy of Haytham Fayek 122 | ''' 123 | 124 | NFFT = num_filt*32 #FFT bins (equally spaced - Unlike mel filter) 125 | mag_frames = np.absolute(np.fft.rfft(frames, NFFT)) # Magnitude of the FFT 126 | pow_frames = ((1.0 / NFFT) * ((mag_frames) ** 2)) # Power Spectrum 127 | 128 | #Filter Banks 129 | nfilt = num_filt 130 | low_freq_mel = (2595 * np.log10(1 + (f0_min) / 700)) 131 | high_freq_mel = (2595 * np.log10(1 + (f0_max) / 700)) # Convert Hz to Mel 132 | mel_points = np.linspace(low_freq_mel, high_freq_mel, nfilt + 2) # Equally spaced in Mel scale 133 | hz_points = (700 * (10**(mel_points / 2595) - 1)) # Convert Mel to Hz 134 | bin = np.floor((NFFT + 1) * hz_points / sample_rate) 135 | 136 | n_overlap = int(np.floor(NFFT / 2 + 1)) 137 | fbank = np.zeros((nfilt, n_overlap)) 138 | 139 | for m in range(1, nfilt + 1): 140 | f_m_minus = int(bin[m - 1]) # left 141 | f_m = int(bin[m]) # center 142 | f_m_plus = int(bin[m + 1]) # right 143 | 144 | for k in range(f_m_minus, f_m): 145 | fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1]) 146 | for k in range(f_m, f_m_plus): 147 | fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m]) 148 | filter_banks = np.dot(pow_frames, fbank.T) 149 | filter_banks = np.where(filter_banks == 0, np.finfo(float).eps, filter_banks) # Numerical Stability 150 | #filter_banks = 20 * np.log10(filter_banks) # dB 151 | if(norm): 152 | filter_banks -= (np.mean(filter_banks)) #normalize 153 | 154 | return filter_banks, hz_points 155 | 156 | 157 | 158 | 159 | freq, power, width, dissonance = 0,1,2,3 160 | 161 | 162 | 163 | def Extract_formant_descriptors(fft_x, fft_y, formants=2, f_min=30, f_max=4000): 164 | ''' 165 | returns 12D-array, shape = ((formants*4,), dtype=np.uint64) 166 | ''' 167 | 168 | len_of_x = len(fft_x) 169 | len_of_y = len(fft_y) 170 | 171 | #for 4 features 172 | returno = np.zeros((formants*4,), dtype=np.uint64) 173 | 174 | if(len_of_x!=len_of_y) or (len_of_x<=3): 175 | #print("Empty Frame") 176 | return returno 177 | 178 | peak_indices = signallib.argrelextrema(fft_y, np.greater, mode='wrap') 179 | valley_indices = signallib.argrelextrema(fft_y, np.less, mode='wrap') 180 | peak_indices = peak_indices[0] 181 | peak_fft_x, peak_fft_y = fft_x[peak_indices], fft_y[peak_indices] 182 | valley_fft_x, valley_fft_y = fft_x[valley_indices], fft_y[valley_indices] 183 | 184 | 185 | len_of_peaks = len(peak_indices) 186 | if(len_of_peaks < 1) or (len(valley_indices) < 1): 187 | #print("Silence") 188 | return returno 189 | 190 | 191 | ground_level = 0 192 | if (len(valley_fft_y) > 1): 193 | ground_level = np.max(valley_fft_y) #range(valleys_y)/2 194 | if(ground_level<10): 195 | #Silence 196 | return returno 197 | 198 | #add extra valleys at start and end 199 | if(peak_fft_x[0] < valley_fft_x[0]): 200 | valley_fft_x = np.append([f_min/2], valley_fft_x) 201 | valley_fft_y = np.append([ground_level/8], valley_fft_y) 202 | if(peak_fft_x[-1] > valley_fft_x[-1]): 203 | valley_fft_x = np.append(valley_fft_x, [f_max+f_min]) 204 | valley_fft_y = np.append(valley_fft_y, [ground_level/8]) 205 | 206 | top_peaks_n = formants*2 207 | #make sure fft has enought points 208 | 209 | if(len(peak_fft_y)<(formants+1)): 210 | return returno 211 | if(len(peak_fft_y)<(top_peaks_n-1)): 212 | top_peaks_n = len(peak_fft_y) - 1 213 | 214 | tp_indexes = get_top_positions(peak_fft_y, top_peaks_n) #descending 215 | dissonance_peak = np.zeros(top_peaks_n) 216 | biggest_peak_y = peak_fft_y[tp_indexes[0]] 217 | 218 | formants_detected = 0 219 | 220 | #calc width and dissonance 221 | for i in range(0, top_peaks_n): 222 | 223 | if(dissonance_peak[i]==0) and (peak_fft_y[tp_indexes[i]] > (biggest_peak_y/16)) and (peak_fft_x[tp_indexes[i]] >= f_min) and (peak_fft_x[tp_indexes[i]] <= f_max) and (formants_detected < formants): 224 | next_valley = np.min(np.where(valley_fft_x > peak_fft_x[tp_indexes[i]])) 225 | next_valley_x = valley_fft_x[next_valley] 226 | next_valley_y = valley_fft_y[next_valley] 227 | 228 | this_peak_gnd_thresh = peak_fft_y[tp_indexes[i]]/4 229 | 230 | 231 | while(next_valley_y > this_peak_gnd_thresh) and (len(np.where(valley_fft_x > next_valley_x)[0])>0): 232 | valley_next_peak_ind = np.where(peak_fft_x > next_valley_x) 233 | if(len(valley_next_peak_ind[0])>0): 234 | valley_next_peak = np.min(valley_next_peak_ind) 235 | if(peak_fft_y[tp_indexes[i]] > peak_fft_y[valley_next_peak]): 236 | next_valley = np.min(np.where(valley_fft_x > next_valley_x)) 237 | next_valley_x = valley_fft_x[next_valley] 238 | next_valley_y = valley_fft_y[next_valley] 239 | else: 240 | break 241 | else: 242 | break 243 | 244 | 245 | 246 | prev_valley = np.max(np.where(valley_fft_x < peak_fft_x[tp_indexes[i]])) 247 | prev_valley_x = valley_fft_x[prev_valley] 248 | prev_valley_y = valley_fft_y[prev_valley] 249 | 250 | while(prev_valley_y > this_peak_gnd_thresh) and (len(np.where(valley_fft_x < prev_valley_x)[0])>0): 251 | valleys_prev_peak_ind = np.where(peak_fft_x < prev_valley) 252 | if(len(valleys_prev_peak_ind[0])>0): 253 | valley_prev_peak = np.max(valleys_prev_peak_ind) 254 | if(peak_fft_y[tp_indexes[i]] > peak_fft_y[valley_prev_peak]): 255 | prev_valley = np.max(np.where(valley_fft_x < prev_valley_x)) 256 | prev_valley_x = valley_fft_x[prev_valley] 257 | prev_valley_y = valley_fft_y[prev_valley] 258 | else: 259 | break 260 | else: 261 | break 262 | 263 | 264 | dissonance_peak[i] = 1 265 | this_dissonane = 0 266 | for k in range(0, top_peaks_n): 267 | if(peak_fft_x[tp_indexes[k]] < next_valley_x) and (peak_fft_x[tp_indexes[k]] > prev_valley_x) and k!=i: 268 | dissonance_peak[k] = 1 269 | if(np.abs(peak_fft_x[tp_indexes[k]] - peak_fft_x[tp_indexes[i]]) > (peak_fft_x[tp_indexes[i]]/50)): 270 | this_dissonane += peak_fft_y[tp_indexes[k]] 271 | else: 272 | peak_fft_x[tp_indexes[i]] = (peak_fft_x[tp_indexes[i]]+peak_fft_x[tp_indexes[k]])/2 273 | peak_fft_y[tp_indexes[i]] = (peak_fft_y[tp_indexes[i]]+peak_fft_y[tp_indexes[k]])/2 274 | 275 | 276 | this_dissonane = this_dissonane/peak_fft_y[tp_indexes[i]] 277 | this_width = np.log(next_valley_x)-np.log(prev_valley_x) 278 | 279 | 280 | returno[freq + (formants_detected*4)] = peak_fft_x[tp_indexes[i]] 281 | returno[power + (formants_detected*4)] = peak_fft_y[tp_indexes[i]] 282 | returno[width + (formants_detected*4)] = this_width*10 283 | returno[dissonance + (formants_detected*4)] = this_dissonane*10 284 | 285 | 286 | formants_detected += 1 287 | 288 | 289 | #plt.figure(1) 290 | #plt.plot(fft_x, fft_y) 291 | #plt.plot(peak_fft_x, peak_fft_y, marker='o', linestyle='dashed', color='green', label="Splits") 292 | #plt.plot(valley_fft_x, valley_fft_y, marker='o', linestyle='dashed', color='red', label="Splits") 293 | #plt.show() 294 | 295 | 296 | return returno 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | def Extract_wav_file_formants(wav_file_path, window_length=0.025, window_step=0.010, emphasize_ratio=0.7, norm=0, f0_min=30, f0_max=4000, max_frames=400, formants=3, formant_decay=0.5): 305 | ''' 306 | Parameters 307 | ---------- 308 | 309 | `wav_file_path`: string, Path of the input wav audio file; 310 | 311 | `window_length`: float, optional (default=0.025). Frame window size in seconds; 312 | 313 | `window_step`: float, optional (default=0.010). Frame window step size in seconds; 314 | 315 | `emphasize_ratio`: float, optional (default=0.7). Amplitude increasing factor for pre-emphasis of higher frequencies (high frequencies * emphasize_ratio = balanced amplitude as low frequencies); 316 | 317 | `norm`: int, optional, (default=0), Enable or disable normalization of Mel-filters; 318 | 319 | `f0_min`: int, optional, (default=30), Hertz; 320 | 321 | `f0_max`: int, optional, (default=4000), Hertz; 322 | 323 | `max_frames`: int, optional (default=400). Cut off size for the number of frames per clip. It is used to standardize the size of clips during processing. 324 | 325 | `formants`: int, optional (default=3). Number of formants to extract; 326 | 327 | `formant_decay`: float, optional (default=0.5). Decay constant to exponentially decrease feature values by their formant amplitude ranks; 328 | 329 | Returns 330 | ------- 331 | returns `frames_features, frame_count, signal_length, trimmed_length` 332 | 333 | `frames_features`: array-like, `np.array((max_frames, num_of_features*formants), dtype=np.uint16)` 334 | 335 | `frame_count`: int, number of filled frames (out of max_frames); 336 | 337 | `signal_length`: float, signal length in seconds; 338 | 339 | `trimmed_length`: float, trimmed length in seconds, silence at the begining and end of the input signal is trimmed before processing; 340 | ''' 341 | 342 | 343 | wav_data = wavio_read(wav_file_path) 344 | raw_signal = wav_data.data 345 | sample_rate = wav_data.rate 346 | 347 | #emphasize_ratio = 0.70 348 | signal_to_plot = np.append(raw_signal[0], raw_signal[1:] - emphasize_ratio * raw_signal[:-1]) 349 | #signal_to_plot = raw_signal 350 | 351 | num_filt = 256 352 | frames, signal_length = frame_segmentation(signal_to_plot, sample_rate, window_length=window_length, window_step=window_step) 353 | frames_filter_banks, hz_points = get_filter_banks(frames, sample_rate, f0_min=f0_min, f0_max=f0_max, num_filt=num_filt, norm=norm) 354 | 355 | #x-axis points for triangular mel filter used 356 | #hz_bins_min = hz_points[0:num_filt] #discarding last 2 points 357 | hz_bins_mid = hz_points[1:num_filt+1] #discarding 1st and last point 358 | #hz_bins_max = hz_points[2:num_filt+2] #discarding first 2 points 359 | 360 | 361 | num_of_frames = frames_filter_banks.shape[0] 362 | 363 | #min_peaks_count = 2 364 | 365 | neighboring_frames = 2 #number of neighboring frames to compares 366 | if(num_of_frames < ((neighboring_frames*2)+1)): 367 | raise Exception("Not enough frames to compare harmonics. Need at least" + str(neighboring_frames*2)+ " frames. Frame count:", str(num_of_frames)) 368 | 369 | #formants = 2 370 | num_of_features = 4 #freq, power, width, dissonance 371 | formants_data = np.zeros((num_of_frames, num_of_features*formants), dtype=np.uint64) 372 | 373 | for frame_index in range(0, num_of_frames): #except first and last 5 frames 374 | 375 | # Find peaks(max). 376 | peak_indexes = signallib.argrelextrema(frames_filter_banks[frame_index], np.greater, mode='wrap') 377 | peak_indexes = peak_indexes[0] 378 | peak_fft_x, peak_fft_y = hz_bins_mid[peak_indexes], frames_filter_banks[frame_index][peak_indexes] 379 | 380 | formants_data[frame_index] = Extract_formant_descriptors(peak_fft_x, peak_fft_y, formants, f0_min, f0_max) 381 | 382 | 383 | 384 | #mean(power of 1st formant)/40 385 | power_ground = int(np.mean(formants_data[:,power][np.where(formants_data[:,power] > 0)])/1000) 386 | if(power_ground<1): 387 | power_ground = 1 388 | 389 | 390 | 391 | #trim silent ends 392 | first_frame, last_frame = 0, 0 393 | for i in range(0,num_of_frames): 394 | first_frame = i 395 | if(formants_data[i, power]>power_ground): 396 | break 397 | 398 | for i in range(0, num_of_frames): 399 | last_frame = num_of_frames - i - 1 400 | if(formants_data[last_frame, power]>power_ground): 401 | break 402 | 403 | #print(power_ground, num_of_frames, last_frame - first_frame) 404 | trimmed_length = ((last_frame - first_frame)/num_of_frames)*signal_length 405 | 406 | 407 | 408 | #convert to db 409 | for fr in range(0, num_of_frames): 410 | for i in range(0, formants): 411 | formant_decay_rate = formant_decay**(i) 412 | 413 | if(formants_data[fr, power + (i*num_of_features)] < 1): 414 | formants_data[fr, power + (i*num_of_features)] = 0 415 | else: 416 | formants_data[fr, power + (i*num_of_features)] = np.log10(formants_data[fr, power + (i*num_of_features)]) * 100 * formant_decay_rate 417 | 418 | if(formants_data[fr, freq + (i*num_of_features)] < f0_min): 419 | formants_data[fr, freq + (i*num_of_features)] = 0 420 | else: 421 | formants_data[fr, freq + (i*num_of_features)] = np.log(formants_data[fr, freq + (i*num_of_features)]) * 200 * formant_decay_rate 422 | 423 | formants_data[fr, width + (i*num_of_features)] = formants_data[fr, width + (i*num_of_features)] * 5 * formant_decay_rate 424 | formants_data[fr, dissonance + (i*num_of_features)] = formants_data[fr, dissonance + (i*num_of_features)] * 10 * formant_decay_rate 425 | #print(formants_data[fr]) 426 | #exit() 427 | returno = np.zeros((max_frames, num_of_features*formants), dtype=np.uint16) 428 | frame_count = 0 429 | for i in range(0, max_frames): 430 | old_frame_i = first_frame+i 431 | returno[i] = formants_data[old_frame_i] 432 | frame_count = i 433 | if(i >= (last_frame - first_frame - 1)): 434 | break 435 | elif(i >= (max_frames-1)): 436 | print("Warning! Frame size overflow, Size:", (last_frame - first_frame), "Limit:", max_frames) 437 | break 438 | 439 | #print(frame_count, signal_length/sample_rate, trimmed_length/sample_rate) 440 | return returno, frame_count, signal_length/sample_rate, trimmed_length/sample_rate 441 | 442 | 443 | 444 | 445 | def Extract_files_formant_features(array_of_clips, features_save_file, window_length=0.025, window_step=0.010, emphasize_ratio=0.7, norm=0, f0_min=30, f0_max=4000, max_frames=400, formants=3,): 446 | ''' 447 | Parameters 448 | ---------- 449 | `array_of_clips`: list of Clip_file_Class objects from 'SER_DB.py'; 450 | 451 | `features_save_file`: string, Path for HDF file where extracted features will be stored; 452 | 453 | `window_length`: float, optional (default=0.025). Frame window size in seconds; 454 | 455 | `window_step`: float, optional (default=0.010). Frame window step size in seconds; 456 | 457 | `emphasize_ratio`: float, optional (default=0.7). Amplitude increasing factor for pre-emphasis of higher frequencies (high frequencies * emphasize_ratio = balanced amplitude as low frequencies); 458 | 459 | `norm`: int, optional, (default=0), Enable or disable normalization of Mel-filters; 460 | 461 | `f0_min`: int, optional, (default=30), Hertz; 462 | 463 | `f0_max`: int, optional, (default=4000), Hertz; 464 | 465 | `max_frames`: int, optional (default=400). Cut off size for the number of frames per clip. It is used to standardize the size of clips during processing. 466 | 467 | `formants`: int, optional (default=3). Number of formants to extract; 468 | 469 | returns processed_clips 470 | ---------------------- 471 | 472 | processed_clips: int, number of successfully processing clips; 473 | ''' 474 | 475 | import os 476 | if(os.path.isfile(features_save_file)): 477 | print("Removing HDF") 478 | os.remove(features_save_file) 479 | 480 | 481 | total_clips = len(array_of_clips) 482 | processed_clips = 0 483 | 484 | import h5py 485 | with h5py.File(features_save_file, 'w') as hf: 486 | dset_label = hf.create_dataset('labels', (total_clips, 11), dtype='u2') 487 | dset_features = hf.create_dataset('features', (total_clips, max_frames, formants*4), dtype='u2') 488 | 489 | print("Clip", "i", "of", "Total", "SpeakerID", "Accent", "Sex", "Emotion") 490 | for index, clip in enumerate(array_of_clips): 491 | try: 492 | print("Clip ", index+1, "of", total_clips, clip.speaker_id, clip.accent, clip.sex, clip.emotion) 493 | array_frames_by_features = np.zeros((max_frames, formants*4), dtype=np.uint16) 494 | #print(clip.filepath) 495 | array_frames_by_features, frame_count, signal_length, trimmed_length = Extract_wav_file_formants(clip.filepath, window_length, window_step, emphasize_ratio, norm, f0_min, f0_max, max_frames, formants) 496 | clipfile_size = int(os.path.getsize(clip.filepath)/1000) 497 | 498 | dset_features[index] = array_frames_by_features 499 | dset_label[index] = [clip.speaker_id, clip.accent, ord(clip.sex), ord(clip.emotion), int(clip.intensity), int(clip.statement), int(clip.repetition), int(frame_count), int(signal_length*1000), int(trimmed_length*1000), clipfile_size] 500 | processed_clips += 1 501 | except Exception as e: 502 | print (e) 503 | 504 | print("Read features of", total_clips, "clips") 505 | 506 | print("Closing HDF") 507 | return processed_clips 508 | 509 | 510 | 511 | -------------------------------------------------------------------------------- /formantfeatures/FormatsHDFread.py: -------------------------------------------------------------------------------- 1 | """ 2 | ----- 3 | Author: Abdul Rehman 4 | License: The MIT License (MIT) 5 | Copyright (c) 2020, Tabahi Abdul Rehman 6 | All rights reserved. 7 | 8 | Redistribution and use in source and binary forms, with or without 9 | modification, are permitted provided that the following conditions are met: 10 | 11 | 1. Redistributions of source code must retain the above copyright notice, 12 | this list of conditions and the following disclaimer. 13 | 14 | 2. Redistributions in binary form must reproduce the above copyright notice, 15 | this list of conditions and the following disclaimer in the documentation 16 | and/or other materials provided with the distribution. 17 | 18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 19 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 20 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 21 | ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE 22 | LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 23 | CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 24 | SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 25 | INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 26 | CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 27 | ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 28 | POSSIBILITY OF SUCH DAMAGE. 29 | """ 30 | import numpy as np 31 | 32 | 33 | class Ix(object): 34 | ''' 35 | Clip label indices for enumeration - Ignore 36 | ''' 37 | speaker_id, accent, sex, emotion, intensity, statement, repetition, frame_count, signal_len, trimmed_len, file_size =0,1,2,3,4,5,6,7,8,9,10 38 | 39 | 40 | def print_database_stats(labels): 41 | 42 | print("Total clips", labels.shape[0]) 43 | print("wav files size (MB)", round(np.sum(labels[:, Ix.file_size])/1000, 2)) 44 | print("Total raw length (min)", round(np.sum(labels[:, Ix.signal_len])/60000, 2)) 45 | print("Total trimmed length (min)", round(np.sum(labels[:, Ix.trimmed_len])/60000, 2)) 46 | print("Avg raw length (s)", round(np.mean(labels[:, Ix.signal_len]/1000), 2)) 47 | print("Avg trimmed length (s)", round(np.mean(labels[:, Ix.trimmed_len]/1000), 2)) 48 | print("Avg. frame count", round(np.mean(labels[:, Ix.frame_count]), 2)) 49 | print("Male Female Clips", np.where(labels[:, Ix.sex]==ord('M'))[0].size, np.where(labels[:, Ix.sex]==ord('F'))[0].size) 50 | 51 | unique_speaker_id = np.unique(labels[:, Ix.speaker_id]) 52 | print("Unique speakers: ", len(unique_speaker_id)) 53 | print("Speakers id: ", unique_speaker_id) 54 | 55 | 56 | unique_classes = np.unique(labels[:, Ix.emotion]) 57 | print("Emotion classes: ", len(unique_classes)) 58 | print("Unique emotions: ", [chr(x) for x in unique_classes]) 59 | 60 | 61 | print("Emotion", "N clips", "Total(min)", "Trimmed(min)") 62 | for this_e in unique_classes: 63 | select_e = np.where(labels[:, Ix.emotion]==this_e)[0] 64 | print(chr(this_e), '\t', labels[select_e].shape[0], '\t', round(np.sum(labels[select_e, Ix.signal_len]/1000)/60, 2), '\t', round(np.sum(labels[select_e, Ix.trimmed_len]/1000)/60, 2)) 65 | 66 | return len(unique_classes), len(unique_speaker_id) 67 | 68 | 69 | def save_features_stats(db_name, csv_filename, labels, features): 70 | 71 | 72 | import csv 73 | with open(csv_filename, 'a') as csvFile: 74 | writer = csv.writer(csvFile, delimiter=',', lineterminator = '\n') 75 | #writer.writerow(["Emotion", "Combination", "Occurrences"]) 76 | writer.writerow(["DB", "Emotion", "N clips", "f0", "p0", "w0", "d0", "f1", "p1", "w1", "d1", "f2", "p2", "w2", "d2"]) 77 | unique_classes = np.unique(labels[:, Ix.emotion]) 78 | 79 | print("Mean Values") 80 | print("Emotion", "freq", "power", "width", "diss") 81 | for this_e in unique_classes: 82 | select_e = np.where(labels[:, Ix.emotion]==this_e)[0] 83 | 84 | clips_n = features[select_e].shape[0] 85 | e_fts = features[select_e] 86 | this_row = [db_name, str(chr(this_e)), str(clips_n)] 87 | for i in range(0, 3): 88 | formant_decay_rate = 0.5**i 89 | freq = int(np.mean(np.exp(e_fts[:, :, (i*4)][np.where(e_fts[:, :, (i*4)] > 0)] / (200*formant_decay_rate)))) 90 | power = int(np.mean(e_fts[:, (i*4)+1][np.where(e_fts[:, (i*4)+1] > 0)]) / (100*formant_decay_rate) *10) 91 | width = int(np.mean(e_fts[:, (i*4)+2][np.where(e_fts[:, (i*4)+2] > 0)]) / (5*formant_decay_rate)) 92 | diss = int(np.mean(e_fts[:, (i*4)+3][np.where(e_fts[:, (i*4)+3] > 0)]) / (10*formant_decay_rate)) 93 | 94 | this_row.append(str(freq)) 95 | this_row.append(str(power)) 96 | this_row.append(str(width)) 97 | this_row.append(str(diss)) 98 | print(chr(this_e), clips_n, freq, power, width, diss) 99 | 100 | 101 | writer.writerow(this_row) 102 | #print(1000, np.log(1000), np.exp(np.log(1000))) 103 | 104 | return 105 | 106 | 107 | 108 | def import_features_from_HDF(storage_file, deselect_labels=None): 109 | # deselect_labels=['C', 'D', 'F', 'U']) 110 | print("Reading dataset from file:", storage_file) 111 | import h5py 112 | hf = h5py.File(storage_file, 'r') 113 | lbl = np.array(hf.get('labels')) 114 | formant_features = np.array(hf.get('features')) 115 | 116 | conditions = (lbl[:, Ix.accent]==1) #RAVDESS has 2 accents (1=speech, 2=song), select only speech. 117 | 118 | if(len(deselect_labels) > 0): 119 | for em in deselect_labels: 120 | conditions &= (lbl[:, Ix.emotion]!=ord(em)) 121 | 122 | selected = np.where(conditions) 123 | lbl = lbl[selected] 124 | formant_features = formant_features[selected] 125 | 126 | if(lbl.shape[0]!=formant_features.shape[0]): 127 | raise Exception("Labels and Features samples size mismatch", lbl.shape[0], formant_features.shape[0]) 128 | 129 | print ("Clips count:", formant_features.shape[0]) 130 | 131 | unique_speaker_id = np.unique(lbl[:, Ix.speaker_id]) 132 | unique_classes = np.unique(lbl[:, Ix.emotion]) 133 | 134 | return formant_features, lbl, unique_speaker_id, unique_classes 135 | 136 | 137 | 138 | def import_mutiple_HDFs(storage_files, deselect_labels=['C', 'D', 'F', 'U', 'E', 'R', 'G', 'B']): 139 | # deselect_labels=['C', 'D', 'F', 'U', 'E', 'R', 'G', 'B'] 140 | import os.path as os_path 141 | import h5py 142 | print("Reading dataset from file:", storage_files) 143 | 144 | #check if features are already extracted 145 | if (os_path.isfile(storage_files[0])==False) or (int(os_path.getsize(storage_files[0]))<8000): 146 | raise Exception ("Formants features for this training set are not extracted yet. Call 'run_train_and_test' for extracting formant features.") 147 | 148 | 149 | storage_file = storage_files[0] 150 | hf = h5py.File(storage_file, 'r') 151 | lbl = np.array(hf.get('labels')) 152 | formant_features = np.array(hf.get('features')) 153 | 154 | for sn in range(1, len(storage_files)): 155 | if (os_path.isfile(storage_files[sn])==False) or (int(os_path.getsize(storage_files[sn]))<8000): 156 | raise Exception ("Formants features for this training set are not extracted yet. Call 'run_train_and_test' for extracting formant features.") 157 | 158 | storage_file = storage_files[sn] 159 | hf = h5py.File(storage_file, 'r') 160 | lbl = np.concatenate((lbl, np.array(hf.get('labels')))) 161 | formant_features = np.concatenate((formant_features, np.array(hf.get('features')))) 162 | 163 | 164 | 165 | conditions = (lbl[:, Ix.accent]==1) #RAVDESS has 2 accents (1=speech, 2=song), select only speech. 166 | 167 | if(deselect_labels!=None): 168 | if(len(deselect_labels) > 0): 169 | for em in deselect_labels: 170 | conditions &= (lbl[:, Ix.emotion]!=ord(em)) 171 | 172 | selected = np.where(conditions) 173 | lbl = lbl[selected] 174 | formant_features = formant_features[selected] 175 | 176 | if(lbl.shape[0]!=formant_features.shape[0]): 177 | raise Exception("Labels and Features samples size mismatch", lbl.shape[0], formant_features.shape[0]) 178 | 179 | print ("Clips count:", formant_features.shape[0]) 180 | 181 | unique_speaker_id = np.unique(lbl[:, Ix.speaker_id]) 182 | unique_classes = np.unique(lbl[:, Ix.emotion]) 183 | 184 | return formant_features, lbl, unique_speaker_id, unique_classes 185 | 186 | 187 | #if(db_name=="IEMOCAP"): 188 | # features, labels, u_speakers, u_classes = HDFread.import_features_from_HDF(features_HDF_file, window_length, window_step, deselect_labels=['D','F','U','E','R']) 189 | #Deselect some labels from IEMOCAP because these emotions have very few samples. 190 | 191 | -------------------------------------------------------------------------------- /formantfeatures/__init__.py: -------------------------------------------------------------------------------- 1 | 2 | from .FormantsExtract import ( 3 | Extract_files_formant_features, Extract_wav_file_formants, 4 | ) 5 | 6 | from .FormatsHDFread import ( 7 | import_features_from_HDF, import_mutiple_HDFs, save_features_stats, print_database_stats 8 | ) 9 | 10 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | 2 | import setuptools 3 | 4 | 5 | with open('README.md') as f: 6 | README = f.read() 7 | 8 | setuptools.setup( 9 | author="Abdul Rehman", 10 | author_email="alabdulrehman@hotmail.fr", 11 | name='formantfeatures', 12 | license="MIT", 13 | description='Extract formant characteristics from speech wav files.', 14 | version='v1.0.3', 15 | long_description='Please go to: https://github.com/tabahi/formantfeatures', 16 | url='https://github.com/tabahi/formantfeatures', 17 | packages=setuptools.find_packages(), 18 | python_requires=">=3.7", 19 | install_requires=['numpy', 'scipy', 'h5py', 'numba', 'wavio'], 20 | classifiers=[ 21 | # Trove classifiers 22 | # (https://pypi.python.org/pypi?%3Aaction=list_classifiers) 23 | 'Development Status :: 4 - Beta', 24 | 'License :: OSI Approved :: MIT License', 25 | 'Programming Language :: Python', 26 | 'Programming Language :: Python :: 3.6', 27 | 'Programming Language :: Python :: 3.7', 28 | 'Topic :: Software Development :: Libraries', 29 | 'Topic :: Software Development :: Libraries :: Python Modules', 30 | 'Intended Audience :: Developers', 31 | ], 32 | ) 33 | -------------------------------------------------------------------------------- /test_1.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tabahi/formantfeatures/363fe4c9c0480705819ee2770cd05926228d21b1/test_1.wav --------------------------------------------------------------------------------