├── .github
    └── ISSUE_TEMPLATE
    │   └── bug_report.md
├── .gitignore
├── LICENSE.md
├── README.md
├── example.py
├── formantfeatures.code-workspace
├── formantfeatures.egg-info
    ├── PKG-INFO
    ├── SOURCES.txt
    ├── dependency_links.txt
    ├── requires.txt
    └── top_level.txt
├── formantfeatures
    ├── FormantsExtract.py
    ├── FormatsHDFread.py
    └── __init__.py
├── setup.py
└── test_1.wav


/.github/ISSUE_TEMPLATE/bug_report.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Bug report
 3 | about: Create a report to help us improve
 4 | title: ''
 5 | labels: ''
 6 | assignees: ''
 7 | 
 8 | ---
 9 | 
10 | **Describe the bug**
11 | A clear and concise description of what the bug is.
12 | 
13 | **To Reproduce**
14 | Steps to reproduce the behavior:
15 | 1. Go to '...'
16 | 2. Click on '....'
17 | 3. Scroll down to '....'
18 | 4. See error
19 | 
20 | **Expected behavior**
21 | A clear and concise description of what you expected to happen.
22 | 
23 | **Screenshots**
24 | If applicable, add screenshots to help explain your problem.
25 | 
26 | **Desktop (please complete the following information):**
27 |  - OS: [e.g. iOS]
28 |  - Browser [e.g. chrome, safari]
29 |  - Version [e.g. 22]
30 | 
31 | **Smartphone (please complete the following information):**
32 |  - Device: [e.g. iPhone6]
33 |  - OS: [e.g. iOS8.1]
34 |  - Browser [e.g. stock browser, safari]
35 |  - Version [e.g. 22]
36 | 
37 | **Additional context**
38 | Add any other context about the problem here.
39 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | Archive/
2 | cache/
3 | __pycache__/
4 | .vscode
5 | build/
6 | dist/
7 | *.code-workspace
8 | .pypirc
9 | 


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
 1 |  The MIT License (MIT)
 2 | 
 3 | Copyright © 2020 Tabahi Abdul Rehman
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
 6 | 
 7 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
 8 | 
 9 | THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
10 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Formant characteristic features extraction
  2 | 
  3 | Extract frequency, power, width and dissonance of formants from a WAV file. These formant features can be used for speech recognition or music analysis.
  4 | 
  5 | ## Dependencies
  6 | 
  7 | + Python 3.7 or later
  8 | + Numpy 1.16 or later
  9 | + [Scipy v1.3.1](https://scipy.org/install.html)
 10 | + [H5py v2.9.0](https://pypi.org/project/h5py/)
 11 | + [Numba (v0.45.1)](https://numba.pydata.org/numba-doc/dev/user/installing.html)
 12 | + [Wavio v0.0.4](https://pypi.org/project/wavio/)
 13 | 
 14 | > Install : `pip install formantfeatures`
 15 | 
 16 | 
 17 | ---------
 18 | 
 19 | ## Get formant characteristics from a single file
 20 | 
 21 | `Extract_wav_file_formants`
 22 | --------------------------------
 23 | 
 24 | ```python
 25 | import formantfeatures as ff
 26 | 
 27 | formants_features, frame_count, signal_length, trimmed_length = ff.Extract_wav_file_formants(wav_file_path, window_length, window_step, emphasize_ratio, norm=0, f0_min=f0_min, f0_max=f0_max, max_frames=max_frames, formants=max_formants)
 28 | ```
 29 | 
 30 | ### Parameters
 31 | 
 32 | 
 33 | >`wav_file_path`: string, Path of the input wav audio file.
 34 | 
 35 | >`window_length`: float, optional (default=0.025). Frame window size in seconds.
 36 | 
 37 | >`window_step`: float, optional (default=0.010). Frame window step size in seconds.
 38 | 
 39 | >`emphasize_ratio`: float, optional (default=0.7). Amplitude increasing factor for pre-emphasis of higher frequencies (high frequencies * emphasize_ratio = balanced amplitude as low frequencies).
 40 | 
 41 | > `norm`: int, optional, (default=0), Enable or disable normalization of Mel-filters;
 42 | 
 43 | >`f0_min`: int, optional, (default=30), Hertz.
 44 | 
 45 | >`f0_max`: int, optional, (default=4000), Hertz.
 46 |     
 47 | >`max_frames`: int, optional (default=400). Cut off size for the number of frames per clip. It is used to standardize the size of clips during processing. If clip size is shorter than that then rest of the frames will be filled with zeros. 
 48 |     
 49 | >`formants`: int, optional (default=3). Number of formants to extract.
 50 | 
 51 | >`formant_decay`: float, optional (default=0.5). Decay constant to exponentially decrease feature values by their formant amplitude ranks.
 52 | 
 53 | ### Returns
 54 | 
 55 | 
 56 | returns `frames_features, frame_count, signal_length, trimmed_length`
 57 | 
 58 | >`frames_features`: array-like, `np.array((max_frames, num_of_features*formants), dtype=np.uint16)`. If `formant=3` then `formants_features` is a numpy array of shape=(12xframes) comprising of 12 features for each 0.025s frame of the WAV file. Frame size can be adjusted, recommended size is 0.025s. 
 59 | The 12 features are frequency, power, width and dissonance of top 3 formants are at indices of numpy array as:
 60 | 
 61 | 
 62 | Indices | Description
 63 | ------------ | -------------
 64 | `frames_features[frame, 0]`| frequency of formant 0
 65 | `frames_features[frame, 1]`| power of formant 0
 66 | `frames_features[frame, 2]`| width of formant 0
 67 | `frames_features[frame, 3]`| dissonance of formant 0
 68 | `frames_features[frame, 4]`| frequency of formant 1
 69 | `frames_features[frame, 5]`| power of formant 1
 70 | `frames_features[frame, 6]`| width of formant 1
 71 | `frames_features[frame, 7]`| dissonance of formant 1
 72 | `frames_features[frame, 8]`| frequency of formant 2
 73 | `frames_features[frame, 9]`| power of formant 2
 74 | `frames_features[frame, 10]`| width of formant 2
 75 | `frames_features[frame, 11]`| dissonance of formant 2
 76 | 
 77 | 
 78 | >`frame_count`: int, number of filled frames (out of max_frames). It is the number of non-zero frames starting from index 0.
 79 | 
 80 | >`signal_length`: float, signal length in seconds. Silence at the begining and end of the input signal is trimmed before processing.
 81 | 
 82 | >`trimmed_length`: float, trimmed length in seconds, silence at the begining and end of the input signal is trimmed before processing;
 83 | 
 84 | :: Note: Frequency is not on Hertz or Mel scale. Instead, a disproportionate scaling is applied to all features that results in completely different scales. An example of conversion back to Hz can be seen in `FormantsHDFread.py` line 89.
 85 | 
 86 | ## Example
 87 | An example code is given in file `example.py`.
 88 | This example extracts 12 formant features for each frame of test wav file ('test_1.wav' has 383 frames of 25ms window at 10ms stride). On line 27 we have:
 89 | 
 90 | 
 91 | 
 92 | The `formants_features` array of size (500, 12) is returned by the function `formantfeatures.Extract_wav_file_formants` in which 500 is the maximum number of frames but only `frame_count` number of frames are used.
 93 | 
 94 | Then we calculate mean of frequency, power, width and dissonance of first 3 formant across 383 frames.
 95 | 
 96 | 12 formant features of each individual frame can be accessed as: `formants_features[i, j]`, where `i` is the frame number out of total `frame_count` (383 in this example), and `j` is the feature index out of total 12 features (0 for 1st formant frequency).
 97 | 
 98 | To calculate the mean of first fomant frequency across all used frames (383 frames are used out of max 500):
 99 | ```python
100 | firt_formant_freq_mean = np.mean(formants_features[0:frame_count, 0])
101 | # where 0:frame_count gives the range of used frames out of total 500 frames. The '0' is the index of 1st formant frequency in features' list.
102 | 
103 | # Similarly, the power (index is '1'):
104 | firt_formant_power_mean = np.mean(formants_features[0:frame_count, 1])
105 | 
106 | # For frequency of 2nd formant (index is '4' see the list of indices given above)
107 | second_formant_freq_mean = np.mean(formants_features[0:frame_count, 4])
108 | 
109 | # To get features of individual frames (without mean):
110 | first_freq_of_frame_50 = formants_features[50, 0]  #frequency of 1st formant of frame 50
111 | first_width_of_frame_50 = formants_features[50, 3]  #width of 1st formant of frame 50
112 | ```
113 | 
114 | Output of `example.py`:
115 | ```
116 | formants_features max_frames: 500  features count: 12 frame_count 383
117 | Formant 0 Mean freq: 1174.3315926892951
118 | Formant 0 Mean power: 448.1566579634465
119 | Formant 0 Mean width: 46.30548302872063
120 | Formant 0 Mean dissonance: 5.169712793733681
121 | Formant 1 Mean freq: 579.9373368146214
122 | Formant 1 Mean power: 188.7859007832898
123 | Formant 1 Mean width: 12.459530026109661
124 | Formant 1 Mean dissonance: 2.2323759791122715
125 | Formant 2 Mean freq: 268.45430809399477
126 | Formant 2 Mean power: 79.54830287206266
127 | Formant 2 Mean width: 3.8929503916449084
128 | Formant 2 Mean dissonance: 1.0783289817232375
129 | Done
130 | 
131 | ```
132 | 
133 | 
134 | ## Bulk processing
135 | 
136 | Pass a list of DB files objects (see <https://github.com/tabahi/SER_Datasets_Import>) and path of HDF file to save extracted features:
137 | 
138 | 
139 | `Extract_files_formant_features`
140 | --------------------------------
141 | 
142 | ```python
143 | import formantfeatures as ff
144 | 
145 | ff.Extract_files_formant_features(array_of_clips, features_save_file, window_length=0.025, window_step=0.010, emphasize_ratio=0.7,  f0_min=30, f0_max=4000, max_frames=400, formants=3,)
146 | ```
147 | 
148 | ### Parameters
149 | 
150 | 
151 | `array_of_clips`: list of `Clip_file_Class` objects from 'SER_DB.py' <https://github.com/tabahi/SER_Datasets_Import/blob/master/SER_Datasets_Libs/SER_DB.py>
152 | 
153 | `features_save_file`: string, Path for HDF file where extracted features will be stored
154 | 
155 | 
156 | ### Returns
157 | 
158 | 
159 | `processed_clips`: int, number of successfully processing clips;
160 | 
161 | 
162 | ## Read HDF data files
163 | 
164 | HDF read functions: `import_features_from_HDF` import from `FormatsHDFread`
165 | 
166 | ```python
167 | import formantfeatures as ff
168 | 
169 | formant_features, labels, unique_speaker_ids, unique_classes = ff.import_features_from_HDF(storage_file, deselect_labels=['B', 'X'])
170 | # Import without deslected labels B (Boring) and X (unknown)
171 | ```
172 | 
173 | Print label stats and save features stats to file:
174 | 
175 | ```python
176 | ff.print_database_stats(labels)
177 | 
178 | ff.save_features_stats("DB_X", "csv_filename.csv", labels, formant_features)
179 | ```
180 | 
181 | 
182 | 
183 | ------------------
184 | 
185 | ## Citations
186 | 
187 | ```tex
188 | @article{LIU2021309,
189 | title = {Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence},
190 | journal = {Information Sciences},
191 | volume = {563},
192 | pages = {309-325},
193 | year = {2021},
194 | issn = {0020-0255},
195 | doi = {https://doi.org/10.1016/j.ins.2021.02.016},
196 | url = {https://www.sciencedirect.com/science/article/pii/S0020025521001584},
197 | author = {Zhen-Tao Liu and Abdul Rehman and Min Wu and Wei-Hua Cao and Man Hao},
198 | keywords = {Speech, Emotion recognition, Formants extraction, Phonemes, Clustering, Cross-corpus},
199 | abstract = {Speech Emotion Recognition (SER) has numerous applications including human-robot interaction, online gaming, and health care assistance. While deep learning-based approaches achieve considerable precision, they often come with high computational and time costs. Indeed, feature learning strategies must search for important features in a large amount of speech data. In order to reduce these time and computational costs, we propose pre-processing step in which speech segments with similar formant characteristics are clustered together and labeled as the same phoneme. The phoneme occurrence rates in emotional utterances are then used as the input features for classifiers. Using six databases (EmoDB, RAVDESS, IEMOCAP, ShEMO, DEMoS and MSP-Improv) for evaluation, the level of accuracy is comparable to that of current state-of-the-art methods and the required training time was significantly reduced from hours to minutes.}
200 | }
201 | ```
202 | 
203 | Paper: [Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence](https://www.sciencedirect.com/science/article/abs/pii/S0020025521001584)
204 | 
205 | 
206 | 


--------------------------------------------------------------------------------
/example.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | This example extracts 12 formant features for each frame (test_1.wav has 383 frames of 25ms window at 10ms stride)
  3 | 
  4 | The `formants_features` array of size (500, 12) is returned by the function `FormantsExtract.Extract_wav_file_formants` in which 500 is the maximum number of frames but only `frame_count` number of frames are used.
  5 | 
  6 | Then we calculate mean of frequency, power, width and dissonance of first 3 formant across 383 frames.
  7 | 
  8 | 12 formant features of each individual frame can be accessed as: `formants_features[i, j]`, where `i` is the frame number out of total `frame_count` (383 in this example), and `j` is the feature index out of total 12 features (0 for 1st formant frequency).
  9 | 
 10 | Output of `example.py`:
 11 | ```
 12 | formants_features max_frames: 500  features count: 12 frame_count 383
 13 | Formant 0 Mean freq: 1174.3315926892951
 14 | Formant 0 Mean power: 448.1566579634465
 15 | Formant 0 Mean width: 46.30548302872063
 16 | Formant 0 Mean dissonance: 5.169712793733681
 17 | Formant 1 Mean freq: 579.9373368146214
 18 | Formant 1 Mean power: 188.7859007832898
 19 | Formant 1 Mean width: 12.459530026109661
 20 | Formant 1 Mean dissonance: 2.2323759791122715
 21 | Formant 2 Mean freq: 268.45430809399477
 22 | Formant 2 Mean power: 79.54830287206266
 23 | Formant 2 Mean width: 3.8929503916449084
 24 | Formant 2 Mean dissonance: 1.0783289817232375
 25 | Done
 26 | 
 27 | ```
 28 | '''
 29 | 
 30 | import numpy as np
 31 | import formantfeatures as ff
 32 | import matplotlib.pyplot as plt
 33 | 
 34 | 
 35 | def main():
 36 | 
 37 | 
 38 |     test_wav = "test_1.wav" #A sample from RAVDESS
 39 |     
 40 |     window_length = 0.025   #Keep it such that its easier to differentiate syllables and remove pauses
 41 |     window_step = 0.010
 42 |     emphasize_ratio = 0.65
 43 |     f0_min = 30
 44 |     f0_max = 4000
 45 |     max_frames = 500
 46 |     max_formants = 3
 47 | 
 48 |     formants_features, frame_count, signal_length, trimmed_length = ff.Extract_wav_file_formants(test_wav, window_length, window_step, emphasize_ratio, norm=0, f0_min=f0_min, f0_max=f0_max, max_frames=max_frames, formants=max_formants)
 49 |     
 50 |     print("formants_features max_frames:", formants_features.shape[0], " features count:", formants_features.shape[1], "frame_count", frame_count)
 51 |     
 52 |     for formant in range(max_formants):
 53 |         print("Formant", formant, "Mean freq:", np.mean(formants_features[0:frame_count, (formant*4)+0]))
 54 |         print("Formant", formant, "Mean power:", np.mean(formants_features[0:frame_count, (formant*4)+1]))
 55 |         print("Formant", formant, "Mean width:", np.mean(formants_features[0:frame_count, (formant*4)+2]))
 56 |         print("Formant", formant, "Mean dissonance:", np.mean(formants_features[0:frame_count, (formant*4)+3]))
 57 |     
 58 |     
 59 |     x_axis_i = [*range(0, frame_count, 1)]
 60 | 
 61 |     colors = ['b', 'r', 'g']
 62 |     
 63 |     for formant in range(0, 1):
 64 |         formant_decay_rate = 0.5**(formant)
 65 | 
 66 |         log_scaled_freq = formants_features[0:frame_count, formant*4]
 67 | 
 68 |         Hz_freq = np.exp(log_scaled_freq / (200*formant_decay_rate))
 69 | 
 70 |         Hz_width  = np.exp(np.log(Hz_freq) - formants_features[0:frame_count, (formant*4)+2] / (50 * formant_decay_rate))/4
 71 |         
 72 |         width_dn = Hz_freq - Hz_width
 73 |         width_up = Hz_freq + Hz_width
 74 |     
 75 |         plt.plot(x_axis_i, Hz_freq)
 76 |         plt.fill_between(x_axis_i, Hz_freq, width_dn, color=colors[formant], alpha=0.30)
 77 |         plt.fill_between(x_axis_i, Hz_freq, width_up, color=colors[formant], alpha=0.30)
 78 | 
 79 |     
 80 |     plt.tight_layout()
 81 |     plt.xlabel("frame")
 82 |     plt.ylabel("f")
 83 |     plt.title("freq")
 84 |     
 85 | 
 86 |     plt.show()
 87 | 
 88 |     print("Done")
 89 |     exit()
 90 | 
 91 | 
 92 | 
 93 |     '''
 94 |     Other functions:
 95 | 
 96 |     #Pass a list of augmented DB objects (see SER_Datasets_Import) and path of HDF file to save extracted features:
 97 | 
 98 |     FormantsExtract.Extract_files_formant_features(array_of_clips, features_save_file, window_length=0.025, window_step=0.010, emphasize_ratio=0.7, norm=0, f0_min=30, f0_max=4000, max_frames=400, formants=3,)
 99 | 
100 |     import FormantsLib.FormatsHDFread as FormatsHDFread
101 | 
102 |     #Read extracted formants from HDF files:
103 |     formant_features, labels, unique_speaker_ids, unique_classes = FormatsHDFread.import_features_from_HDF(storage_file, deselect_labels=['B'])
104 | 
105 | 
106 |     FormatsHDFread.print_database_stats(labels)
107 | 
108 |     FormatsHDFread.save_features_stats("DB_X", "csv_filename.csv", labels, formant_features)
109 |     '''
110 |     
111 | 
112 | 
113 | 
114 | if __name__ == '__main__':
115 |     main()
116 | 
117 | 
118 | 


--------------------------------------------------------------------------------
/formantfeatures.code-workspace:
--------------------------------------------------------------------------------
1 | {
2 | 	"folders": [
3 | 		{
4 | 			"path": "."
5 | 		}
6 | 	]
7 | }


--------------------------------------------------------------------------------
/formantfeatures.egg-info/PKG-INFO:
--------------------------------------------------------------------------------
 1 | Metadata-Version: 1.2
 2 | Name: formantfeatures
 3 | Version: 1.0.3
 4 | Summary: Extract formant characteristics from speech wav files.
 5 | Home-page: https://github.com/tabahi/formantfeatures
 6 | Author: Abdul Rehman
 7 | Author-email: alabdulrehman@hotmail.fr
 8 | License: MIT
 9 | Description: Please go to: https://github.com/tabahi/formantfeatures
10 | Platform: UNKNOWN
11 | Classifier: Development Status :: 4 - Beta
12 | Classifier: License :: OSI Approved :: MIT License
13 | Classifier: Programming Language :: Python
14 | Classifier: Programming Language :: Python :: 3.6
15 | Classifier: Programming Language :: Python :: 3.7
16 | Classifier: Topic :: Software Development :: Libraries
17 | Classifier: Topic :: Software Development :: Libraries :: Python Modules
18 | Classifier: Intended Audience :: Developers
19 | Requires-Python: >=3.7
20 | 


--------------------------------------------------------------------------------
/formantfeatures.egg-info/SOURCES.txt:
--------------------------------------------------------------------------------
 1 | README.md
 2 | setup.py
 3 | formantfeatures/FormantsExtract.py
 4 | formantfeatures/FormatsHDFread.py
 5 | formantfeatures/__init__.py
 6 | formantfeatures.egg-info/PKG-INFO
 7 | formantfeatures.egg-info/SOURCES.txt
 8 | formantfeatures.egg-info/dependency_links.txt
 9 | formantfeatures.egg-info/requires.txt
10 | formantfeatures.egg-info/top_level.txt


--------------------------------------------------------------------------------
/formantfeatures.egg-info/dependency_links.txt:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/formantfeatures.egg-info/requires.txt:
--------------------------------------------------------------------------------
1 | numpy
2 | scipy
3 | h5py
4 | numba
5 | wavio
6 | 


--------------------------------------------------------------------------------
/formantfeatures.egg-info/top_level.txt:
--------------------------------------------------------------------------------
1 | formantfeatures
2 | 


--------------------------------------------------------------------------------
/formantfeatures/FormantsExtract.py:
--------------------------------------------------------------------------------
  1 | """
  2 | -----
  3 | Author: Abdul Rehman
  4 | License:  The MIT License (MIT)
  5 | Copyright (c) 2020, Tabahi Abdul Rehman
  6 | All rights reserved.
  7 | 
  8 | Redistribution and use in source and binary forms, with or without
  9 | modification, are permitted provided that the following conditions are met:
 10 | 
 11 | 1. Redistributions of source code must retain the above copyright notice,
 12 |    this list of conditions and the following disclaimer.
 13 | 
 14 | 2. Redistributions in binary form must reproduce the above copyright notice,
 15 |    this list of conditions and the following disclaimer in the documentation
 16 |    and/or other materials provided with the distribution.
 17 | 
 18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 19 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 20 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 21 | ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
 22 | LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 23 | CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 24 | SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 25 | INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
 26 | CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 27 | ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 28 | POSSIBILITY OF SUCH DAMAGE.
 29 | """
 30 | import numpy as np
 31 | from scipy import signal as signallib
 32 | from numba import jit #install numba to speed up the execution
 33 | from wavio import read as wavio_read
 34 | 
 35 | 
 36 | @jit(nopython=True) 
 37 | def get_lowest_positions(array_y, n_positions):
 38 |     order = array_y.argsort()
 39 |     ranks = order.argsort() #ascending
 40 |     top_indexes = np.zeros((n_positions,), dtype=np.int16)
 41 |     #print(array_y)
 42 |     i = int(0)
 43 | 
 44 |     while(i < n_positions):
 45 |         itemindices = np.where(ranks==i)
 46 |         for itemindex in itemindices:
 47 |             if(itemindex.size):
 48 |                 #print(i, array_y[itemindex], itemindex)
 49 |                 top_indexes[i] = itemindex[0]
 50 |             else:   #for when positions are more than array size
 51 |                 itemindices2 = np.where(ranks==(array_y.size -1-i+ array_y.size ))
 52 |                 for itemindex2 in itemindices2:
 53 |                     #print(i, array_y[itemindex2], itemindex2)
 54 |                     top_indexes[i] = itemindex2[0]
 55 |             i += 1
 56 |     #print(array_y[top_indexes])
 57 |     return top_indexes
 58 | 
 59 | 
 60 | @jit(nopython=True) 
 61 | def get_top_positions(array_y, n_positions):
 62 |     order = array_y.argsort()
 63 |     ranks = order.argsort() #ascending
 64 |     top_indexes = np.zeros((n_positions,), dtype=np.int16)
 65 |     #print(array_y)
 66 |     i = int(n_positions - 1)
 67 | 
 68 |     while(i >= 0):
 69 |         itemindices = np.where(ranks==(len(array_y)-1-i))
 70 |         for itemindex in itemindices:
 71 |             if(itemindex.size):
 72 |                 #print(i, array_y[itemindex], itemindex)
 73 |                 top_indexes[i] = itemindex[0]
 74 |             else:   #for when positions are more than array size
 75 |                 itemindices2 = np.where(ranks==len(array_y)-1-i+len(array_y) )
 76 |                 for itemindex2 in itemindices2:
 77 |                     #print(i, array_y[itemindex2], itemindex2)
 78 |                     top_indexes[i] = itemindex2[0]
 79 |             i -= 1
 80 | 
 81 |     return top_indexes
 82 |     
 83 | 
 84 | def frame_segmentation(signal, sample_rate, window_length=0.040, window_step=0.020):
 85 | 
 86 |     #Framing
 87 |     frame_length, frame_step = window_length * sample_rate, window_step * sample_rate  # Convert from seconds to samples
 88 |     signal_length = len(signal)
 89 |     frame_length = int(round(frame_length))
 90 |     frame_step = int(round(frame_step))
 91 |     num_frames = int(np.ceil(float(np.abs(signal_length - frame_length)) / frame_step))  # Make sure that we have at least 1 frame
 92 | 
 93 |     if(num_frames < 1):
 94 |         raise Exception("Clip length is too short. It should be atleast " + str(window_length*2)+ " frames")
 95 | 
 96 |     pad_signal_length = num_frames * frame_step + frame_length
 97 |     z = np.zeros((pad_signal_length - signal_length))
 98 |     pad_signal = np.append(signal, z) # Pad Signal to make sure that all frames have equal number of samples without truncating any samples from the original signal
 99 | 
100 |     indices = np.tile(np.arange(0, frame_length), (num_frames, 1)) + np.tile(np.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T
101 |     frames = pad_signal[indices.astype(np.int32, copy=False)]
102 | 
103 |     #Hamming Window
104 |     frames *= np.hamming(frame_length)
105 |     #frames *= 0.54 - 0.46 * numpy.cos((2 * numpy.pi * n) / (frame_length - 1))  # Explicit Implementation **
106 |     #print (frames.shape)
107 |     return frames, signal_length
108 | 
109 | 
110 | def get_filter_banks(frames, sample_rate, f0_min=60, f0_max=4000, num_filt=128, norm=0):
111 |     '''
112 |     Fourier-Transform and Power Spectrum
113 | 
114 |     return filter_banks, hz_points
115 | 
116 |     filter_banks: array-like, shape = [n_frames, num_filt]
117 | 
118 |     hz_points: array-like, shape = [num_filt], center frequency of mel-filters
119 | 
120 |     This code is from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
121 |     Courtesy of Haytham Fayek
122 |     '''
123 | 
124 |     NFFT = num_filt*32      #FFT bins (equally spaced - Unlike mel filter)
125 |     mag_frames = np.absolute(np.fft.rfft(frames, NFFT))  # Magnitude of the FFT
126 |     pow_frames = ((1.0 / NFFT) * ((mag_frames) ** 2))  # Power Spectrum
127 | 
128 |     #Filter Banks
129 |     nfilt = num_filt
130 |     low_freq_mel = (2595 * np.log10(1 + (f0_min) / 700))
131 |     high_freq_mel = (2595 * np.log10(1 + (f0_max) / 700))  # Convert Hz to Mel
132 |     mel_points = np.linspace(low_freq_mel, high_freq_mel, nfilt + 2)  # Equally spaced in Mel scale
133 |     hz_points = (700 * (10**(mel_points / 2595) - 1))  # Convert Mel to Hz
134 |     bin = np.floor((NFFT + 1) * hz_points / sample_rate)
135 | 
136 |     n_overlap = int(np.floor(NFFT / 2 + 1))
137 |     fbank = np.zeros((nfilt, n_overlap))
138 |     
139 |     for m in range(1, nfilt + 1):
140 |         f_m_minus = int(bin[m - 1])   # left
141 |         f_m = int(bin[m])             # center
142 |         f_m_plus = int(bin[m + 1])    # right
143 |         
144 |         for k in range(f_m_minus, f_m):
145 |             fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
146 |         for k in range(f_m, f_m_plus):
147 |             fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
148 |     filter_banks = np.dot(pow_frames, fbank.T)
149 |     filter_banks = np.where(filter_banks == 0, np.finfo(float).eps, filter_banks)  # Numerical Stability
150 |     #filter_banks = 20 * np.log10(filter_banks)  # dB
151 |     if(norm):
152 |         filter_banks -= (np.mean(filter_banks)) #normalize
153 | 
154 |     return filter_banks, hz_points
155 | 
156 | 
157 | 
158 | 
159 | freq, power, width, dissonance = 0,1,2,3
160 | 
161 | 
162 | 
163 | def Extract_formant_descriptors(fft_x, fft_y, formants=2, f_min=30, f_max=4000):
164 |     '''
165 |     returns 12D-array, shape = ((formants*4,), dtype=np.uint64)
166 |     '''
167 |     
168 |     len_of_x = len(fft_x)
169 |     len_of_y = len(fft_y)
170 |     
171 |     #for 4 features
172 |     returno = np.zeros((formants*4,), dtype=np.uint64)
173 | 
174 |     if(len_of_x!=len_of_y) or (len_of_x<=3):
175 |         #print("Empty Frame")
176 |         return returno
177 | 
178 |     peak_indices = signallib.argrelextrema(fft_y, np.greater, mode='wrap')
179 |     valley_indices = signallib.argrelextrema(fft_y, np.less, mode='wrap')
180 |     peak_indices = peak_indices[0]
181 |     peak_fft_x, peak_fft_y = fft_x[peak_indices], fft_y[peak_indices]
182 |     valley_fft_x, valley_fft_y = fft_x[valley_indices], fft_y[valley_indices]
183 | 
184 |     
185 |     len_of_peaks = len(peak_indices)
186 |     if(len_of_peaks < 1) or (len(valley_indices) < 1):
187 |         #print("Silence")
188 |         return returno
189 | 
190 | 
191 |     ground_level = 0
192 |     if (len(valley_fft_y) > 1):
193 |         ground_level = np.max(valley_fft_y)  #range(valleys_y)/2
194 |     if(ground_level<10):
195 |         #Silence
196 |         return returno
197 |     
198 |     #add extra valleys at start and end
199 |     if(peak_fft_x[0] < valley_fft_x[0]):
200 |         valley_fft_x = np.append([f_min/2], valley_fft_x)
201 |         valley_fft_y = np.append([ground_level/8], valley_fft_y)
202 |     if(peak_fft_x[-1] > valley_fft_x[-1]):
203 |         valley_fft_x = np.append(valley_fft_x, [f_max+f_min])
204 |         valley_fft_y = np.append(valley_fft_y, [ground_level/8])
205 | 
206 |     top_peaks_n = formants*2
207 |     #make sure fft has enought points
208 |     
209 |     if(len(peak_fft_y)<(formants+1)):
210 |         return returno
211 |     if(len(peak_fft_y)<(top_peaks_n-1)):
212 |         top_peaks_n = len(peak_fft_y) - 1
213 | 
214 |     tp_indexes = get_top_positions(peak_fft_y, top_peaks_n) #descending
215 |     dissonance_peak = np.zeros(top_peaks_n)
216 |     biggest_peak_y = peak_fft_y[tp_indexes[0]]
217 |     
218 |     formants_detected = 0
219 | 
220 |     #calc width and dissonance
221 |     for i in range(0, top_peaks_n):
222 |         
223 |         if(dissonance_peak[i]==0) and (peak_fft_y[tp_indexes[i]] > (biggest_peak_y/16))  and (peak_fft_x[tp_indexes[i]] >= f_min) and (peak_fft_x[tp_indexes[i]] <= f_max) and (formants_detected < formants):
224 |             next_valley = np.min(np.where(valley_fft_x > peak_fft_x[tp_indexes[i]]))
225 |             next_valley_x = valley_fft_x[next_valley]
226 |             next_valley_y = valley_fft_y[next_valley]
227 | 
228 |             this_peak_gnd_thresh = peak_fft_y[tp_indexes[i]]/4
229 | 
230 |             
231 |             while(next_valley_y > this_peak_gnd_thresh) and (len(np.where(valley_fft_x > next_valley_x)[0])>0):
232 |                 valley_next_peak_ind = np.where(peak_fft_x > next_valley_x)
233 |                 if(len(valley_next_peak_ind[0])>0):
234 |                     valley_next_peak = np.min(valley_next_peak_ind)
235 |                     if(peak_fft_y[tp_indexes[i]] > peak_fft_y[valley_next_peak]):
236 |                         next_valley = np.min(np.where(valley_fft_x > next_valley_x))
237 |                         next_valley_x = valley_fft_x[next_valley]
238 |                         next_valley_y = valley_fft_y[next_valley]
239 |                     else:
240 |                         break
241 |                 else:
242 |                     break
243 |                 
244 |                 
245 |                         
246 |             prev_valley = np.max(np.where(valley_fft_x < peak_fft_x[tp_indexes[i]]))
247 |             prev_valley_x = valley_fft_x[prev_valley]
248 |             prev_valley_y = valley_fft_y[prev_valley]
249 | 
250 |             while(prev_valley_y > this_peak_gnd_thresh) and (len(np.where(valley_fft_x < prev_valley_x)[0])>0):
251 |                 valleys_prev_peak_ind = np.where(peak_fft_x < prev_valley)
252 |                 if(len(valleys_prev_peak_ind[0])>0):
253 |                     valley_prev_peak = np.max(valleys_prev_peak_ind)
254 |                     if(peak_fft_y[tp_indexes[i]] > peak_fft_y[valley_prev_peak]):
255 |                         prev_valley = np.max(np.where(valley_fft_x < prev_valley_x))
256 |                         prev_valley_x = valley_fft_x[prev_valley]
257 |                         prev_valley_y = valley_fft_y[prev_valley]
258 |                     else:
259 |                         break
260 |                 else:
261 |                     break
262 | 
263 | 
264 |             dissonance_peak[i] = 1
265 |             this_dissonane = 0
266 |             for k in range(0, top_peaks_n):
267 |                 if(peak_fft_x[tp_indexes[k]] < next_valley_x) and (peak_fft_x[tp_indexes[k]] > prev_valley_x) and k!=i:
268 |                     dissonance_peak[k] = 1
269 |                     if(np.abs(peak_fft_x[tp_indexes[k]] - peak_fft_x[tp_indexes[i]]) > (peak_fft_x[tp_indexes[i]]/50)):
270 |                         this_dissonane += peak_fft_y[tp_indexes[k]]
271 |                     else:
272 |                         peak_fft_x[tp_indexes[i]] = (peak_fft_x[tp_indexes[i]]+peak_fft_x[tp_indexes[k]])/2
273 |                         peak_fft_y[tp_indexes[i]] = (peak_fft_y[tp_indexes[i]]+peak_fft_y[tp_indexes[k]])/2
274 |             
275 | 
276 |             this_dissonane = this_dissonane/peak_fft_y[tp_indexes[i]]
277 |             this_width = np.log(next_valley_x)-np.log(prev_valley_x)
278 |             
279 | 
280 |             returno[freq + (formants_detected*4)] = peak_fft_x[tp_indexes[i]]
281 |             returno[power + (formants_detected*4)] = peak_fft_y[tp_indexes[i]]
282 |             returno[width + (formants_detected*4)] = this_width*10
283 |             returno[dissonance + (formants_detected*4)] = this_dissonane*10
284 |             
285 |             
286 |             formants_detected += 1
287 | 
288 |              
289 |     #plt.figure(1)
290 |     #plt.plot(fft_x, fft_y)
291 |     #plt.plot(peak_fft_x, peak_fft_y, marker='o', linestyle='dashed', color='green', label="Splits")
292 |     #plt.plot(valley_fft_x, valley_fft_y, marker='o', linestyle='dashed', color='red', label="Splits")
293 |     #plt.show()
294 | 
295 |     
296 |     return returno
297 | 
298 |     
299 | 
300 | 
301 | 
302 | 
303 | 
304 | def Extract_wav_file_formants(wav_file_path, window_length=0.025, window_step=0.010, emphasize_ratio=0.7, norm=0, f0_min=30, f0_max=4000, max_frames=400, formants=3, formant_decay=0.5):
305 |     '''
306 |     Parameters
307 |     ----------
308 | 
309 |     `wav_file_path`: string, Path of the input wav audio file;
310 | 
311 |     `window_length`: float, optional (default=0.025). Frame window size in seconds;
312 | 
313 |     `window_step`: float, optional (default=0.010). Frame window step size in seconds;
314 | 
315 |     `emphasize_ratio`: float, optional (default=0.7). Amplitude increasing factor for pre-emphasis of higher frequencies (high frequencies * emphasize_ratio = balanced amplitude as low frequencies);
316 | 
317 |     `norm`: int, optional, (default=0), Enable or disable normalization of Mel-filters;
318 | 
319 |     `f0_min`: int, optional, (default=30), Hertz;
320 | 
321 |     `f0_max`: int, optional, (default=4000), Hertz;
322 |     
323 |     `max_frames`: int, optional (default=400). Cut off size for the number of frames per clip. It is used to standardize the size of clips during processing.
324 |     
325 |     `formants`: int, optional (default=3). Number of formants to extract;
326 | 
327 |     `formant_decay`: float, optional (default=0.5). Decay constant to exponentially decrease feature values by their formant amplitude ranks;
328 | 
329 |     Returns
330 |     -------
331 |     returns `frames_features, frame_count, signal_length, trimmed_length`
332 | 
333 |     `frames_features`: array-like, `np.array((max_frames, num_of_features*formants), dtype=np.uint16)`
334 | 
335 |     `frame_count`: int, number of filled frames (out of max_frames);
336 | 
337 |     `signal_length`: float, signal length in seconds;
338 | 
339 |     `trimmed_length`: float, trimmed length in seconds, silence at the begining and end of the input signal is trimmed before processing;
340 |     '''
341 | 
342 |     
343 |     wav_data = wavio_read(wav_file_path)
344 |     raw_signal = wav_data.data
345 |     sample_rate = wav_data.rate
346 | 
347 |     #emphasize_ratio = 0.70
348 |     signal_to_plot = np.append(raw_signal[0], raw_signal[1:] - emphasize_ratio * raw_signal[:-1])
349 |     #signal_to_plot = raw_signal
350 |     
351 |     num_filt = 256
352 |     frames, signal_length = frame_segmentation(signal_to_plot, sample_rate, window_length=window_length, window_step=window_step)
353 |     frames_filter_banks, hz_points = get_filter_banks(frames, sample_rate, f0_min=f0_min, f0_max=f0_max, num_filt=num_filt, norm=norm)
354 |     
355 |     #x-axis points for triangular mel filter used
356 |     #hz_bins_min = hz_points[0:num_filt] #discarding last 2 points
357 |     hz_bins_mid = hz_points[1:num_filt+1] #discarding 1st and last point
358 |     #hz_bins_max = hz_points[2:num_filt+2] #discarding first 2 points
359 | 
360 |     
361 |     num_of_frames = frames_filter_banks.shape[0]
362 | 
363 |     #min_peaks_count = 2
364 |     
365 |     neighboring_frames = 2  #number of neighboring frames to compares
366 |     if(num_of_frames < ((neighboring_frames*2)+1)):
367 |         raise Exception("Not enough frames to compare harmonics. Need at least" + str(neighboring_frames*2)+ " frames. Frame count:", str(num_of_frames))
368 | 
369 |     #formants = 2
370 |     num_of_features = 4 #freq, power, width, dissonance
371 |     formants_data = np.zeros((num_of_frames, num_of_features*formants), dtype=np.uint64)
372 |     
373 |     for frame_index in range(0, num_of_frames): #except first and last 5 frames
374 |       
375 |         # Find peaks(max).
376 |         peak_indexes = signallib.argrelextrema(frames_filter_banks[frame_index], np.greater, mode='wrap')
377 |         peak_indexes = peak_indexes[0]
378 |         peak_fft_x, peak_fft_y = hz_bins_mid[peak_indexes], frames_filter_banks[frame_index][peak_indexes]
379 | 
380 |         formants_data[frame_index] = Extract_formant_descriptors(peak_fft_x, peak_fft_y, formants, f0_min, f0_max)
381 | 
382 |     
383 |     
384 |     #mean(power of 1st formant)/40
385 |     power_ground = int(np.mean(formants_data[:,power][np.where(formants_data[:,power] > 0)])/1000)
386 |     if(power_ground<1):
387 |         power_ground = 1
388 |     
389 |     
390 | 
391 |     #trim silent ends
392 |     first_frame, last_frame = 0, 0
393 |     for i in range(0,num_of_frames):
394 |         first_frame = i
395 |         if(formants_data[i, power]>power_ground):
396 |             break
397 | 
398 |     for i in range(0, num_of_frames):
399 |         last_frame = num_of_frames - i - 1
400 |         if(formants_data[last_frame, power]>power_ground):
401 |             break
402 | 
403 |     #print(power_ground, num_of_frames, last_frame - first_frame)
404 |     trimmed_length = ((last_frame - first_frame)/num_of_frames)*signal_length
405 | 
406 |     
407 | 
408 |     #convert to db
409 |     for fr in range(0, num_of_frames):
410 |         for i in range(0, formants):
411 |             formant_decay_rate = formant_decay**(i)
412 |             
413 |             if(formants_data[fr, power + (i*num_of_features)] < 1):
414 |                 formants_data[fr, power + (i*num_of_features)] = 0
415 |             else:
416 |                 formants_data[fr, power + (i*num_of_features)] = np.log10(formants_data[fr, power + (i*num_of_features)]) * 100 * formant_decay_rate
417 |             
418 |             if(formants_data[fr, freq + (i*num_of_features)] < f0_min):
419 |                 formants_data[fr, freq + (i*num_of_features)] = 0
420 |             else:
421 |                 formants_data[fr, freq + (i*num_of_features)] = np.log(formants_data[fr, freq + (i*num_of_features)]) * 200 * formant_decay_rate
422 |             
423 |             formants_data[fr, width + (i*num_of_features)] = formants_data[fr, width + (i*num_of_features)] * 5 * formant_decay_rate
424 |             formants_data[fr, dissonance + (i*num_of_features)] = formants_data[fr, dissonance + (i*num_of_features)] * 10 * formant_decay_rate
425 |         #print(formants_data[fr])
426 |         #exit()
427 |     returno = np.zeros((max_frames, num_of_features*formants), dtype=np.uint16)
428 |     frame_count = 0
429 |     for i in range(0, max_frames):
430 |         old_frame_i = first_frame+i
431 |         returno[i] = formants_data[old_frame_i]
432 |         frame_count = i
433 |         if(i >= (last_frame - first_frame - 1)):
434 |             break
435 |         elif(i >= (max_frames-1)):
436 |             print("Warning! Frame size overflow, Size:", (last_frame - first_frame), "Limit:", max_frames)
437 |             break
438 | 
439 |     #print(frame_count, signal_length/sample_rate, trimmed_length/sample_rate)
440 |     return returno, frame_count, signal_length/sample_rate, trimmed_length/sample_rate
441 | 
442 | 
443 | 
444 | 
445 | def Extract_files_formant_features(array_of_clips, features_save_file, window_length=0.025, window_step=0.010, emphasize_ratio=0.7, norm=0, f0_min=30, f0_max=4000, max_frames=400, formants=3,):
446 |     '''
447 |     Parameters
448 |     ----------
449 |     `array_of_clips`: list of Clip_file_Class objects from 'SER_DB.py';
450 | 
451 |     `features_save_file`: string, Path for HDF file where extracted features will be stored;
452 | 
453 |     `window_length`: float, optional (default=0.025). Frame window size in seconds;
454 | 
455 |     `window_step`: float, optional (default=0.010). Frame window step size in seconds;
456 | 
457 |     `emphasize_ratio`: float, optional (default=0.7). Amplitude increasing factor for pre-emphasis of higher frequencies (high frequencies * emphasize_ratio = balanced amplitude as low frequencies);
458 | 
459 |     `norm`: int, optional, (default=0), Enable or disable normalization of Mel-filters;
460 | 
461 |     `f0_min`: int, optional, (default=30), Hertz;
462 | 
463 |     `f0_max`: int, optional, (default=4000), Hertz;
464 |     
465 |     `max_frames`: int, optional (default=400). Cut off size for the number of frames per clip. It is used to standardize the size of clips during processing.
466 |     
467 |     `formants`: int, optional (default=3). Number of formants to extract;
468 | 
469 |     returns processed_clips
470 |     ----------------------
471 | 
472 |     processed_clips: int, number of successfully processing clips;
473 |     '''
474 | 
475 |     import os
476 |     if(os.path.isfile(features_save_file)):
477 |         print("Removing HDF")
478 |         os.remove(features_save_file)
479 | 
480 | 
481 |     total_clips = len(array_of_clips)
482 |     processed_clips = 0
483 |     
484 |     import h5py
485 |     with h5py.File(features_save_file, 'w') as hf:
486 |         dset_label = hf.create_dataset('labels', (total_clips, 11),  dtype='u2')
487 |         dset_features = hf.create_dataset('features', (total_clips, max_frames, formants*4), dtype='u2')
488 |         
489 |         print("Clip", "i", "of", "Total", "SpeakerID", "Accent", "Sex", "Emotion")
490 |         for index, clip in enumerate(array_of_clips):
491 |             try:
492 |                 print("Clip ", index+1, "of", total_clips, clip.speaker_id, clip.accent, clip.sex, clip.emotion)
493 |                 array_frames_by_features = np.zeros((max_frames, formants*4), dtype=np.uint16)
494 |                 #print(clip.filepath)
495 |                 array_frames_by_features, frame_count, signal_length, trimmed_length = Extract_wav_file_formants(clip.filepath, window_length, window_step, emphasize_ratio, norm, f0_min, f0_max, max_frames, formants)
496 |                 clipfile_size = int(os.path.getsize(clip.filepath)/1000)
497 | 
498 |                 dset_features[index] = array_frames_by_features
499 |                 dset_label[index] = [clip.speaker_id, clip.accent, ord(clip.sex), ord(clip.emotion), int(clip.intensity), int(clip.statement), int(clip.repetition), int(frame_count), int(signal_length*1000), int(trimmed_length*1000), clipfile_size]
500 |                 processed_clips += 1
501 |             except Exception as e:
502 |                 print (e)
503 |             
504 |         print("Read features of", total_clips, "clips")
505 |     
506 |     print("Closing HDF")
507 |     return processed_clips
508 | 
509 | 
510 | 
511 | 


--------------------------------------------------------------------------------
/formantfeatures/FormatsHDFread.py:
--------------------------------------------------------------------------------
  1 | """
  2 | -----
  3 | Author: Abdul Rehman
  4 | License:  The MIT License (MIT)
  5 | Copyright (c) 2020, Tabahi Abdul Rehman
  6 | All rights reserved.
  7 | 
  8 | Redistribution and use in source and binary forms, with or without
  9 | modification, are permitted provided that the following conditions are met:
 10 | 
 11 | 1. Redistributions of source code must retain the above copyright notice,
 12 |    this list of conditions and the following disclaimer.
 13 | 
 14 | 2. Redistributions in binary form must reproduce the above copyright notice,
 15 |    this list of conditions and the following disclaimer in the documentation
 16 |    and/or other materials provided with the distribution.
 17 | 
 18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 19 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 20 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 21 | ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
 22 | LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 23 | CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 24 | SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 25 | INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
 26 | CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 27 | ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 28 | POSSIBILITY OF SUCH DAMAGE.
 29 | """
 30 | import numpy as np
 31 | 
 32 | 
 33 | class Ix(object):
 34 |     '''
 35 |     Clip label indices for enumeration - Ignore
 36 |     '''
 37 |     speaker_id, accent, sex, emotion, intensity, statement, repetition, frame_count, signal_len, trimmed_len, file_size =0,1,2,3,4,5,6,7,8,9,10
 38 |     
 39 | 
 40 | def print_database_stats(labels):
 41 | 
 42 |     print("Total clips", labels.shape[0])
 43 |     print("wav files size (MB)", round(np.sum(labels[:, Ix.file_size])/1000, 2))
 44 |     print("Total raw length (min)", round(np.sum(labels[:, Ix.signal_len])/60000, 2))
 45 |     print("Total trimmed length (min)", round(np.sum(labels[:, Ix.trimmed_len])/60000, 2))
 46 |     print("Avg raw length (s)", round(np.mean(labels[:, Ix.signal_len]/1000), 2))
 47 |     print("Avg trimmed length (s)", round(np.mean(labels[:, Ix.trimmed_len]/1000), 2))
 48 |     print("Avg. frame count", round(np.mean(labels[:, Ix.frame_count]), 2))
 49 |     print("Male Female Clips", np.where(labels[:, Ix.sex]==ord('M'))[0].size, np.where(labels[:, Ix.sex]==ord('F'))[0].size)
 50 |     
 51 |     unique_speaker_id = np.unique(labels[:, Ix.speaker_id])
 52 |     print("Unique speakers: ", len(unique_speaker_id))
 53 |     print("Speakers id: ", unique_speaker_id)
 54 | 
 55 |     
 56 |     unique_classes = np.unique(labels[:, Ix.emotion])
 57 |     print("Emotion classes: ", len(unique_classes))
 58 |     print("Unique emotions: ", [chr(x) for x in unique_classes])
 59 |     
 60 |     
 61 |     print("Emotion", "N clips", "Total(min)", "Trimmed(min)")
 62 |     for this_e in unique_classes:
 63 |         select_e = np.where(labels[:, Ix.emotion]==this_e)[0]
 64 |         print(chr(this_e), '\t', labels[select_e].shape[0], '\t', round(np.sum(labels[select_e, Ix.signal_len]/1000)/60, 2), '\t', round(np.sum(labels[select_e, Ix.trimmed_len]/1000)/60, 2))
 65 | 
 66 |     return len(unique_classes), len(unique_speaker_id)
 67 | 
 68 | 
 69 | def save_features_stats(db_name, csv_filename, labels, features):
 70 | 
 71 | 
 72 |     import csv
 73 |     with open(csv_filename, 'a') as csvFile:
 74 |         writer = csv.writer(csvFile, delimiter=',', lineterminator = '\n')
 75 |         #writer.writerow(["Emotion", "Combination", "Occurrences"])
 76 |         writer.writerow(["DB", "Emotion", "N clips", "f0", "p0", "w0", "d0", "f1", "p1", "w1", "d1", "f2", "p2", "w2", "d2"])
 77 |         unique_classes = np.unique(labels[:, Ix.emotion])
 78 |         
 79 |         print("Mean Values")
 80 |         print("Emotion", "freq", "power", "width", "diss")
 81 |         for this_e in unique_classes:
 82 |             select_e = np.where(labels[:, Ix.emotion]==this_e)[0]
 83 | 
 84 |             clips_n = features[select_e].shape[0]
 85 |             e_fts = features[select_e]
 86 |             this_row = [db_name, str(chr(this_e)), str(clips_n)]
 87 |             for i in range(0, 3):
 88 |                 formant_decay_rate = 0.5**i
 89 |                 freq = int(np.mean(np.exp(e_fts[:, :, (i*4)][np.where(e_fts[:, :, (i*4)] > 0)] / (200*formant_decay_rate))))
 90 |                 power = int(np.mean(e_fts[:, (i*4)+1][np.where(e_fts[:, (i*4)+1] > 0)]) / (100*formant_decay_rate) *10)
 91 |                 width = int(np.mean(e_fts[:, (i*4)+2][np.where(e_fts[:, (i*4)+2] > 0)]) / (5*formant_decay_rate))
 92 |                 diss = int(np.mean(e_fts[:, (i*4)+3][np.where(e_fts[:, (i*4)+3] > 0)]) / (10*formant_decay_rate))
 93 | 
 94 |                 this_row.append(str(freq))
 95 |                 this_row.append(str(power))
 96 |                 this_row.append(str(width))
 97 |                 this_row.append(str(diss))
 98 |                 print(chr(this_e), clips_n, freq, power, width, diss)
 99 | 
100 |             
101 |             writer.writerow(this_row)
102 |             #print(1000, np.log(1000), np.exp(np.log(1000)))
103 | 
104 |     return
105 | 
106 | 
107 | 
108 | def import_features_from_HDF(storage_file, deselect_labels=None):
109 |     # deselect_labels=['C', 'D', 'F', 'U'])
110 |     print("Reading dataset from file:", storage_file)
111 |     import h5py
112 |     hf = h5py.File(storage_file, 'r')
113 |     lbl = np.array(hf.get('labels'))
114 |     formant_features = np.array(hf.get('features'))
115 | 
116 |     conditions =  (lbl[:, Ix.accent]==1) #RAVDESS has 2 accents (1=speech, 2=song), select only speech.
117 |     
118 |     if(len(deselect_labels) > 0):
119 |         for em in deselect_labels:
120 |             conditions &= (lbl[:, Ix.emotion]!=ord(em))
121 |     
122 |     selected = np.where(conditions)
123 |     lbl = lbl[selected]
124 |     formant_features = formant_features[selected]
125 | 
126 |     if(lbl.shape[0]!=formant_features.shape[0]):
127 |         raise Exception("Labels and Features samples size mismatch", lbl.shape[0], formant_features.shape[0])
128 |     
129 |     print ("Clips count:", formant_features.shape[0])
130 | 
131 |     unique_speaker_id = np.unique(lbl[:, Ix.speaker_id])
132 |     unique_classes = np.unique(lbl[:, Ix.emotion])
133 | 
134 |     return formant_features, lbl, unique_speaker_id, unique_classes
135 | 
136 | 
137 | 
138 | def import_mutiple_HDFs(storage_files, deselect_labels=['C', 'D', 'F', 'U', 'E', 'R', 'G', 'B']):
139 |     # deselect_labels=['C', 'D', 'F', 'U', 'E', 'R', 'G', 'B']
140 |     import os.path as os_path
141 |     import h5py
142 |     print("Reading dataset from file:", storage_files)
143 | 
144 |     #check if features are already extracted
145 |     if (os_path.isfile(storage_files[0])==False) or (int(os_path.getsize(storage_files[0]))<8000):
146 |         raise Exception ("Formants features for this training set are not extracted yet. Call 'run_train_and_test' for extracting formant features.")
147 |         
148 | 
149 |     storage_file = storage_files[0]
150 |     hf = h5py.File(storage_file, 'r')
151 |     lbl = np.array(hf.get('labels'))
152 |     formant_features = np.array(hf.get('features'))
153 | 
154 |     for sn in range(1, len(storage_files)):
155 |         if (os_path.isfile(storage_files[sn])==False) or (int(os_path.getsize(storage_files[sn]))<8000):
156 |             raise Exception ("Formants features for this training set are not extracted yet. Call 'run_train_and_test' for extracting formant features.")
157 |         
158 |         storage_file = storage_files[sn]
159 |         hf = h5py.File(storage_file, 'r')
160 |         lbl = np.concatenate((lbl, np.array(hf.get('labels'))))
161 |         formant_features = np.concatenate((formant_features, np.array(hf.get('features'))))
162 | 
163 | 
164 | 
165 |     conditions =  (lbl[:, Ix.accent]==1) #RAVDESS has 2 accents (1=speech, 2=song), select only speech.
166 |     
167 |     if(deselect_labels!=None):
168 |         if(len(deselect_labels) > 0):
169 |             for em in deselect_labels:
170 |                 conditions &= (lbl[:, Ix.emotion]!=ord(em))
171 |     
172 |     selected = np.where(conditions)
173 |     lbl = lbl[selected]
174 |     formant_features = formant_features[selected]
175 | 
176 |     if(lbl.shape[0]!=formant_features.shape[0]):
177 |         raise Exception("Labels and Features samples size mismatch", lbl.shape[0], formant_features.shape[0])
178 |     
179 |     print ("Clips count:", formant_features.shape[0])
180 | 
181 |     unique_speaker_id = np.unique(lbl[:, Ix.speaker_id])
182 |     unique_classes = np.unique(lbl[:, Ix.emotion])
183 | 
184 |     return formant_features, lbl, unique_speaker_id, unique_classes    
185 | 
186 | 
187 |     #if(db_name=="IEMOCAP"):
188 |     #    features, labels, u_speakers, u_classes  = HDFread.import_features_from_HDF(features_HDF_file, window_length, window_step, deselect_labels=['D','F','U','E','R'])
189 |         #Deselect some labels from IEMOCAP because these emotions have very few samples.
190 | 
191 | 


--------------------------------------------------------------------------------
/formantfeatures/__init__.py:
--------------------------------------------------------------------------------
 1 | 
 2 | from .FormantsExtract import (
 3 |   Extract_files_formant_features, Extract_wav_file_formants, 
 4 | )
 5 | 
 6 | from .FormatsHDFread import (
 7 |   import_features_from_HDF, import_mutiple_HDFs, save_features_stats, print_database_stats
 8 | )
 9 | 
10 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import setuptools
 3 | 
 4 | 
 5 | with open('README.md') as f:
 6 |     README = f.read()
 7 | 
 8 | setuptools.setup(
 9 |     author="Abdul Rehman",
10 |     author_email="alabdulrehman@hotmail.fr",
11 |     name='formantfeatures',
12 |     license="MIT",
13 |     description='Extract formant characteristics from speech wav files.',
14 |     version='v1.0.3',
15 |     long_description='Please go to: https://github.com/tabahi/formantfeatures',
16 |     url='https://github.com/tabahi/formantfeatures',
17 |     packages=setuptools.find_packages(),
18 |     python_requires=">=3.7",
19 |     install_requires=['numpy', 'scipy', 'h5py', 'numba', 'wavio'],
20 |     classifiers=[
21 |         # Trove classifiers
22 |         # (https://pypi.python.org/pypi?%3Aaction=list_classifiers)
23 |         'Development Status :: 4 - Beta',
24 |         'License :: OSI Approved :: MIT License',
25 |         'Programming Language :: Python',
26 |         'Programming Language :: Python :: 3.6',
27 |         'Programming Language :: Python :: 3.7',
28 |         'Topic :: Software Development :: Libraries',
29 |         'Topic :: Software Development :: Libraries :: Python Modules',
30 |         'Intended Audience :: Developers',
31 |     ],
32 | )
33 | 


--------------------------------------------------------------------------------
/test_1.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tabahi/formantfeatures/363fe4c9c0480705819ee2770cd05926228d21b1/test_1.wav


--------------------------------------------------------------------------------