├── LICENSE
├── README.md
├── feature.py
├── images
    └── CRNN_SED_DCASE2017_task3.jpg
├── metrics.py
├── requirements.txt
├── sed.py
└── utils.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | -----------COPYRIGHT NOTICE STARTS WITH THIS LINE------------
 2 | Copyright (c) 2017 Tampere University of Technology and its licensors
 3 | All rights reserved.
 4 | 
 5 | Permission is hereby granted, without written agreement and without
 6 | license or royalty fees, to use and copy the code for the Multichannel
 7 | Sound Event Detection using Convolutional Recurrent Neural Network
 8 | method/architecture, present in the GitHub repository with the handle
 9 | sed-crnn, (“Work”) described in the paper with title
10 | "Sound event detection using spatial features and convolutional
11 | recurrent neural network" (and available also from
12 | https://arxiv.org/abs/1706.02291) and composed of files with code in the
13 | Python programming language. This grant is only for experimental and
14 | non-commercial purposes, provided that the copyright notice in its entirety
15 | appear in all copies of this Work, and the original source of this Work,
16 | Audio Research Group, Lab. of Signal Processing at Tampere University
17 | of Technology, is acknowledged in any publication that reports research
18 | using this Work.
19 | 
20 | Any commercial use of the Work or any part thereof is strictly prohibited.
21 | Commercial use include, but is not limited to:
22 | - selling or reproducing the Work
23 | - selling or distributing the results or content achieved by use of the Work
24 | - providing services by using the Work.
25 | 
26 | IN NO EVENT SHALL TAMPERE UNIVERSITY OF TECHNOLOGY OR ITS LICENSORS BE LIABLE TO
27 | ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES
28 | ARISING OUT OF THE USE OF THIS WORK AND ITS DOCUMENTATION, EVEN IF TAMPERE
29 | UNIVERSITY OF TECHNOLOGY OR ITS LICENSORS HAS BEEN ADVISED OF THE POSSIBILITY
30 | OF SUCH DAMAGE.
31 | 
32 | TAMPERE UNIVERSITY OF TECHNOLOGY AND ALL ITS LICENSORS SPECIFICALLY DISCLAIMS
33 | ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
34 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE WORK PROVIDED HEREUNDER
35 | IS ON AN "AS IS" BASIS, AND THE TAMPERE UNIVERSITY OF TECHNOLOGY HAS NO OBLIGATION
36 | TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
37 | 
38 | -----------COPYRIGHT NOTICE ENDS WITH THIS LINE------------
39 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Single and multichannel sound event detection using convolutional recurrent neural network
 2 | [Sound event detection (SED)](https://www.aane.in/research/computational-audio-scene-analysis-casa/sound-event-detection) is the task of recognizing the sound events and their respective temporal start and end time in a recording. Sound events in real life do not always occur in isolation, but tend to considerably overlap with each other.
 3 | Recognizing such overlapping sound events is referred as polyphonic SED. Performing polyphonic SED using monochannel audio is a challenging task. These overlapping sound events can potentially be recognized better with multichannel audio.
 4 | This repository supports both single- and multichannel versions of polyphonic SED and is referred as SEDnet hereafter. You can read more about [sound event detection literature here](https://www.aane.in/research/computational-audio-scene-analysis-casa/sound-event-detection).
 5 | 
 6 | This method was first proposed in '[Sound event detection using spatial features and convolutional recurrent neural network](https://arxiv.org/abs/1706.02291 "Arxiv paper")'. It recently won the [DCASE 2017 real-life sound event detection](https://goo.gl/8eqCg3 "Challenge webpage"). We are releasing a simple vanila code without much frills here. 
 7 | 
 8 | If you are using anything from this repository please consider citing,
 9 | 
10 | >Sharath Adavanne, Pasi Pertila and Tuomas Virtanen, "Sound event detection using spatial features and convolutional recurrent neural network" in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017)
11 | 
12 | Similar CRNN architecture has been successfully used for different tasks and research challenges as below. You can accordingly play around with a suitable prediction layer as the task requires.
13 | 
14 | 1. Sound event detection
15 |    - Sharath Adavanne, Pasi Pertila and Tuomas Virtanen, '[Sound event detection using spatial features and convolutional recurrent neural network](https://arxiv.org/abs/1706.02291 "Arxiv paper")' at IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017) 
16 |    - Sharath Adavanne, Archontis Politis and Tuomas Virtanen, '[Multichannel sound event detection using 3D convolutional neural networks for learning inter-channel features](https://arxiv.org/abs/1801.09522 "Arxiv paper")' at International Joint Conference on Neural Networks (IJCNN 2018)
17 | 
18 | 2. SED with weak labels
19 |    - Sharath Adavanne and Tuomas Virtanen, '[Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network](https://arxiv.org/abs/1710.02998 "Arxiv paper")' at Detection and Classification of Acoustic Scenes and Events (DCASE 2017)
20 | 
21 | 3. Bird audio detection 
22 |    - Sharath Adavanne, Konstantinos Drossos, Emre Cakir and Tuomas Virtanen, '[Stacked convolutional and recurrent neural networks for bird audio detection](https://arxiv.org/abs/1706.02047 "Arxiv paper")' at European Signal Processing Conference (EUSIPCO 2017)
23 |    - Emre Cakir, Sharath Adavanne, Giambattista Parascandolo, Konstantinos Drossos and Tuomas Virtanen, '[Convolutional recurrent neural networks for bird audio detection](https://arxiv.org/abs/1703.02317 "Arxiv paper")' at European Signal Processing Conference (EUSIPCO 2017)
24 | 
25 | 4. Music emotion recognition
26 |    - Miroslav Malik, Sharath Adavanne, Konstantinos Drossos, Tuomas Virtanen, Dasa Ticha, Roman Jarina , '[Stacked convolutional and recurrent neural networks for music emotion recognition](https://arxiv.org/abs/1706.02292 "Arxiv paper")', at Sound and Music Computing Conference (SMC 2017)
27 | 
28 | ## More about SEDnet
29 | The proposed SEDnet is shown in the figure below. The input to the method is either a single or multichannel audio. The log mel-band energy feature is then extracted from each channel of the corresponding input audio. These audio features are fed to a convolutional recurrent neural network that maps them to the activities of the sound event classes in the dataset. The output of the neural network is in the continuous range of [0, 1] for each of the sound event classes and corresponds to the probability of the particular sound class being active in the frame. This continuous range output is further thresholded to obtain the final binary decision of the sound event class being active or absent in each frame. In general, the proposed method takes a sequence of frame-wise audio features as the input and predicts the activity of the target sound event classes for each of the input frames.
30 | 
31 | <p align="center">
32 |    <img src="https://github.com/sharathadavanne/multichannel-sed-crnn/blob/master/images/CRNN_SED_DCASE2017_task3.jpg" width="400" title="SEDnet Architecture">
33 | </p>
34 | 
35 | 
36 | ## Getting Started
37 | 
38 | This repository is built around the DCASE 2017 task 3 dataset, and consists of four Python scripts. 
39 | * The feature.py script, extracts the features, labels, and normalizes the training and test split features. Make sure you update the location of the wav files, evaluation setup and folder to write features in before running it. 
40 | * The sed.py script, loads the normalized features, and traines the SEDnet. The training stops when the error rate metric in one second segment (http://tut-arg.github.io/sed_eval/) stops improving.
41 | * The metrics.py script, implements the core metrics from sound event detection evaluation module http://tut-arg.github.io/sed_eval/
42 | * The utils.py script has some utility functions.
43 | 
44 | If you are only interested in the SEDnet model then just check  `get_model()` function in the sed.py script.
45 | 
46 | 
47 | ### Prerequisites
48 | 
49 | The requirements.txt file consists of the libraries and their versions used. The Python script is written and tested in 3.7.3 version. You can install the requirements by running the following line
50 | 
51 | ```
52 | pip install -r requirements.txt
53 | ```
54 | ## Training the SEDnet on development dataset of DCASE 2017
55 | 
56 | * Download the dataset from https://zenodo.org/record/814831#.Ws2xO3VuYUE
57 | * Update the path of the `audio/street/` and `evaluation_setup` folders of the dataset in feature.py script. Also update the `feat_folder` variable with a local folder where the script can dump the extracted feature files. Run the script `python feature.py` this will save the features and labels of training and test splits in the provided `feat_folder`. Change the flag `is_mono` to `True` for single channel SED, and `False` for multichannel SED. Since the dataset used has only binaural audio, by setting `is_mono = False`, the SEDnet trains on binaural audio.
58 | * In the sed.py script, update the `feat_folder` path as used in feature.py script.  Change the `is_mono` flag according to single or multichannel SED studies and run the script `python sed.py`. This should train on the default training split of the dataset, and evaluate the model on the testing split for all four folds.
59 | 
60 | The sound event detection metrics - error rate (ER) and F-score for one second segment averaged over four folds are as following. Since the dataset is small the results vary quite a bit, hence we report the mean of five separate runs. An ideal SED method has an ER of 0 and F of 1.
61 | 
62 | | SEDnet mode | ER | F|
63 | | ----| --- | --- |
64 | | Single channel | 0.60 | 0.57 |
65 | | Multichannel |0.60 | 0.59|
66 | 
67 | The results vary from the original paper, as we are not using the evaluation split here
68 | 
69 | ## License
70 | 
71 | This repository is licensed under the TUT License - see the [LICENSE](LICENSE) file for details
72 | 
73 | ## Acknowledgments
74 | 
75 | The research leading to these results has received funding from the European Research Council under the European Unions H2020 Framework Programme through ERC Grant Agreement 637422 EVERYSOUND.
76 | 


--------------------------------------------------------------------------------
/feature.py:
--------------------------------------------------------------------------------
  1 | import wave
  2 | import numpy as np
  3 | import utils
  4 | import librosa
  5 | from IPython import embed
  6 | import os
  7 | from sklearn import preprocessing
  8 | 
  9 | 
 10 | def load_audio(filename, mono=True, fs=44100):
 11 |     """Load audio file into numpy array
 12 |     Supports 24-bit wav-format
 13 |     
 14 |     Taken from TUT-SED system: https://github.com/TUT-ARG/DCASE2016-baseline-system-python
 15 |     
 16 |     Parameters
 17 |     ----------
 18 |     filename:  str
 19 |         Path to audio file
 20 | 
 21 |     mono : bool
 22 |         In case of multi-channel audio, channels are averaged into single channel.
 23 |         (Default value=True)
 24 | 
 25 |     fs : int > 0 [scalar]
 26 |         Target sample rate, if input audio does not fulfil this, audio is resampled.
 27 |         (Default value=44100)
 28 | 
 29 |     Returns
 30 |     -------
 31 |     audio_data : numpy.ndarray [shape=(signal_length, channel)]
 32 |         Audio
 33 | 
 34 |     sample_rate : integer
 35 |         Sample rate
 36 | 
 37 |     """
 38 | 
 39 |     file_base, file_extension = os.path.splitext(filename)
 40 |     if file_extension == '.wav':
 41 |         _audio_file = wave.open(filename)
 42 | 
 43 |         # Audio info
 44 |         sample_rate = _audio_file.getframerate()
 45 |         sample_width = _audio_file.getsampwidth()
 46 |         number_of_channels = _audio_file.getnchannels()
 47 |         number_of_frames = _audio_file.getnframes()
 48 | 
 49 |         # Read raw bytes
 50 |         data = _audio_file.readframes(number_of_frames)
 51 |         _audio_file.close()
 52 | 
 53 |         # Convert bytes based on sample_width
 54 |         num_samples, remainder = divmod(len(data), sample_width * number_of_channels)
 55 |         if remainder > 0:
 56 |             raise ValueError('The length of data is not a multiple of sample size * number of channels.')
 57 |         if sample_width > 4:
 58 |             raise ValueError('Sample size cannot be bigger than 4 bytes.')
 59 | 
 60 |         if sample_width == 3:
 61 |             # 24 bit audio
 62 |             a = np.empty((num_samples, number_of_channels, 4), dtype=np.uint8)
 63 |             raw_bytes = np.fromstring(data, dtype=np.uint8)
 64 |             a[:, :, :sample_width] = raw_bytes.reshape(-1, number_of_channels, sample_width)
 65 |             a[:, :, sample_width:] = (a[:, :, sample_width - 1:sample_width] >> 7) * 255
 66 |             audio_data = a.view('<i4').reshape(a.shape[:-1]).T
 67 |         else:
 68 |             # 8 bit samples are stored as unsigned ints; others as signed ints.
 69 |             dt_char = 'u' if sample_width == 1 else 'i'
 70 |             a = np.fromstring(data, dtype='<%s%d' % (dt_char, sample_width))
 71 |             audio_data = a.reshape(-1, number_of_channels).T
 72 | 
 73 |         if mono:
 74 |             # Down-mix audio
 75 |             audio_data = np.mean(audio_data, axis=0)
 76 | 
 77 |         # Convert int values into float
 78 |         audio_data = audio_data / float(2 ** (sample_width * 8 - 1) + 1)
 79 | 
 80 |         # Resample
 81 |         if fs != sample_rate:
 82 |             audio_data = librosa.core.resample(audio_data, sample_rate, fs)
 83 |             sample_rate = fs
 84 | 
 85 |         return audio_data, sample_rate
 86 |     return None, None
 87 | 
 88 | 
 89 | def load_desc_file(_desc_file):
 90 |     _desc_dict = dict()
 91 |     for line in open(_desc_file):
 92 |         words = line.strip().split('\t')
 93 |         name = words[0].split('/')[-1]
 94 |         if name not in _desc_dict:
 95 |             _desc_dict[name] = list()
 96 |         _desc_dict[name].append([float(words[2]), float(words[3]), __class_labels[words[-1]]])
 97 |     return _desc_dict
 98 | 
 99 | 
100 | def extract_mbe(_y, _sr, _nfft, _nb_mel):
101 |     spec, n_fft = librosa.core.spectrum._spectrogram(y=_y, n_fft=_nfft, hop_length=_nfft//2, power=1)
102 |     mel_basis = librosa.filters.mel(sr=_sr, n_fft=_nfft, n_mels=_nb_mel)
103 |     return np.log(np.dot(mel_basis, spec))
104 | 
105 | # ###################################################################
106 | #              Main script starts here
107 | # ###################################################################
108 | 
109 | is_mono = True
110 | __class_labels = {
111 |     'brakes squeaking': 0,
112 |     'car': 1,
113 |     'children': 2,
114 |     'large vehicle': 3,
115 |     'people speaking': 4,
116 |     'people walking': 5
117 | }
118 | 
119 | # location of data.
120 | folds_list = [1, 2, 3, 4]
121 | evaluation_setup_folder = '/scratch/asignal/sharath/DCASE2017/TUT-sound-events-2017-development/evaluation_setup'
122 | audio_folder = '/scratch/asignal/sharath/DCASE2017/TUT-sound-events-2017-development/audio/street'
123 | 
124 | # Output
125 | feat_folder = '/scratch/asignal/sharath/DCASE2017/TUT-sound-events-2017-development/feat/'
126 | utils.create_folder(feat_folder)
127 | 
128 | # User set parameters
129 | nfft = 2048
130 | win_len = nfft
131 | hop_len = win_len // 2
132 | nb_mel_bands = 40
133 | sr = 44100
134 | 
135 | # -----------------------------------------------------------------------
136 | # Feature extraction and label generation
137 | # -----------------------------------------------------------------------
138 | # Load labels
139 | train_file = os.path.join(evaluation_setup_folder, 'street_fold{}_train.txt'.format(1))
140 | evaluate_file = os.path.join(evaluation_setup_folder, 'street_fold{}_evaluate.txt'.format(1))
141 | desc_dict = load_desc_file(train_file)
142 | desc_dict.update(load_desc_file(evaluate_file)) # contains labels for all the audio in the dataset
143 | 
144 | # Extract features for all audio files, and save it along with labels
145 | for audio_filename in os.listdir(audio_folder):
146 |     audio_file = os.path.join(audio_folder, audio_filename)
147 |     print('Extracting features and label for : {}'.format(audio_file))
148 |     y, sr = load_audio(audio_file, mono=is_mono, fs=sr)
149 |     mbe = None
150 | 
151 |     if is_mono:
152 |         mbe = extract_mbe(y, sr, nfft, nb_mel_bands).T
153 |     else:
154 |         for ch in range(y.shape[0]):
155 |             mbe_ch = extract_mbe(y[ch, :], sr, nfft, nb_mel_bands).T
156 |             if mbe is None:
157 |                 mbe = mbe_ch
158 |             else:
159 |                 mbe = np.concatenate((mbe, mbe_ch), 1)
160 | 
161 |     label = np.zeros((mbe.shape[0], len(__class_labels)))
162 |     tmp_data = np.array(desc_dict[audio_filename])
163 |     frame_start = np.floor(tmp_data[:, 0] * sr / hop_len).astype(int)
164 |     frame_end = np.ceil(tmp_data[:, 1] * sr / hop_len).astype(int)
165 |     se_class = tmp_data[:, 2].astype(int)
166 |     for ind, val in enumerate(se_class):
167 |         label[frame_start[ind]:frame_end[ind], val] = 1
168 |     tmp_feat_file = os.path.join(feat_folder, '{}_{}.npz'.format(audio_filename, 'mon' if is_mono else 'bin'))
169 |     np.savez(tmp_feat_file, mbe, label)
170 | 
171 | # -----------------------------------------------------------------------
172 | # Feature Normalization
173 | # -----------------------------------------------------------------------
174 | 
175 | for fold in folds_list:
176 |     train_file = os.path.join(evaluation_setup_folder, 'street_fold{}_train.txt'.format(fold))
177 |     evaluate_file = os.path.join(evaluation_setup_folder, 'street_fold{}_evaluate.txt'.format(fold))
178 |     train_dict = load_desc_file(train_file)
179 |     test_dict = load_desc_file(evaluate_file)
180 | 
181 |     X_train, Y_train, X_test, Y_test = None, None, None, None
182 |     for key in train_dict.keys():
183 |         tmp_feat_file = os.path.join(feat_folder, '{}_{}.npz'.format(key, 'mon' if is_mono else 'bin'))
184 |         dmp = np.load(tmp_feat_file)
185 |         tmp_mbe, tmp_label = dmp['arr_0'], dmp['arr_1']
186 |         if X_train is None:
187 |             X_train, Y_train = tmp_mbe, tmp_label
188 |         else:
189 |             X_train, Y_train = np.concatenate((X_train, tmp_mbe), 0), np.concatenate((Y_train, tmp_label), 0)
190 | 
191 |     for key in test_dict.keys():
192 |         tmp_feat_file = os.path.join(feat_folder, '{}_{}.npz'.format(key, 'mon' if is_mono else 'bin'))
193 |         dmp = np.load(tmp_feat_file)
194 |         tmp_mbe, tmp_label = dmp['arr_0'], dmp['arr_1']
195 |         if X_test is None:
196 |             X_test, Y_test = tmp_mbe, tmp_label
197 |         else:
198 |             X_test, Y_test = np.concatenate((X_test, tmp_mbe), 0), np.concatenate((Y_test, tmp_label), 0)
199 | 
200 |     # Normalize the training data, and scale the testing data using the training data weights
201 |     scaler = preprocessing.StandardScaler()
202 |     X_train = scaler.fit_transform(X_train)
203 |     X_test = scaler.transform(X_test)
204 | 
205 |     normalized_feat_file = os.path.join(feat_folder, 'mbe_{}_fold{}.npz'.format('mon' if is_mono else 'bin', fold))
206 |     np.savez(normalized_feat_file, X_train, Y_train, X_test, Y_test)
207 |     print('normalized_feat_file : {}'.format(normalized_feat_file))
208 | 
209 | 
210 | 
211 | 
212 | 


--------------------------------------------------------------------------------
/images/CRNN_SED_DCASE2017_task3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sharathadavanne/sed-crnn/4d0bb802ee17b45dae3115bc0e213e75b71d9b0c/images/CRNN_SED_DCASE2017_task3.jpg


--------------------------------------------------------------------------------
/metrics.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import utils
 3 | #####################
 4 | # Scoring functions
 5 | #
 6 | # Code blocks taken from Toni Heittola's repository: http://tut-arg.github.io/sed_eval/
 7 | #
 8 | # Implementation of the Metrics in the following paper:
 9 | # Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen, 'Metrics for polyphonic sound event detection',
10 | # Applied Sciences, 6(6):162, 2016
11 | #####################
12 | 
13 | 
14 | def f1_overall_framewise(O, T):
15 |     if len(O.shape) == 3:
16 |         O, T = utils.reshape_3Dto2D(O), utils.reshape_3Dto2D(T)
17 |     TP = ((2 * T - O) == 1).sum()
18 |     Nref, Nsys = T.sum(), O.sum()
19 | 
20 |     prec = float(TP) / float(Nsys + utils.eps)
21 |     recall = float(TP) / float(Nref + utils.eps)
22 |     f1_score = 2 * prec * recall / (prec + recall + utils.eps)
23 |     return f1_score
24 | 
25 | 
26 | def er_overall_framewise(O, T):
27 |     if len(O.shape) == 3:
28 |         O, T = utils.reshape_3Dto2D(O), utils.reshape_3Dto2D(T)
29 |     FP = np.logical_and(T == 0, O == 1).sum(1)
30 |     FN = np.logical_and(T == 1, O == 0).sum(1)
31 | 
32 |     S = np.minimum(FP, FN).sum()
33 |     D = np.maximum(0, FN-FP).sum()
34 |     I = np.maximum(0, FP-FN).sum()
35 | 
36 |     Nref = T.sum()
37 |     ER = (S+D+I) / (Nref + 0.0)
38 |     return ER
39 | 
40 | 
41 | def f1_overall_1sec(O, T, block_size):
42 |     if len(O.shape) == 3:
43 |         O, T = utils.reshape_3Dto2D(O), utils.reshape_3Dto2D(T)
44 |     new_size = int(np.ceil(O.shape[0] / block_size))
45 |     O_block = np.zeros((new_size, O.shape[1]))
46 |     T_block = np.zeros((new_size, O.shape[1]))
47 |     for i in range(0, new_size):
48 |         O_block[i, :] = np.max(O[int(i * block_size):int(i * block_size + block_size - 1), ], axis=0)
49 |         T_block[i, :] = np.max(T[int(i * block_size):int(i * block_size + block_size - 1), ], axis=0)
50 |     return f1_overall_framewise(O_block, T_block)
51 | 
52 | 
53 | def er_overall_1sec(O, T, block_size):
54 |     if len(O.shape) == 3:
55 |         O, T = utils.reshape_3Dto2D(O), utils.reshape_3Dto2D(T)
56 |     new_size = int(O.shape[0] / block_size)
57 |     O_block = np.zeros((new_size, O.shape[1]))
58 |     T_block = np.zeros((new_size, O.shape[1]))
59 |     for i in range(0, new_size):
60 |         O_block[i, :] = np.max(O[int(i * block_size):int(i * block_size + block_size - 1), ], axis=0)
61 |         T_block[i, :] = np.max(T[int(i * block_size):int(i * block_size + block_size - 1), ], axis=0)
62 |     return er_overall_framewise(O_block, T_block)
63 | 
64 | 
65 | def compute_scores(pred, y, frames_in_1_sec=50):
66 |     scores = dict()
67 |     scores['f1_overall_1sec'] = f1_overall_1sec(pred, y, frames_in_1_sec)
68 |     scores['er_overall_1sec'] = er_overall_1sec(pred, y, frames_in_1_sec)
69 |     return scores
70 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | keras==2.2.4
2 | numpy==1.16.4
3 | matplotlib==3.1.0
4 | librosa==0.7.0
5 | IPython==7.6.1
6 | sklearn==0.21.2
7 | 


--------------------------------------------------------------------------------
/sed.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function
  2 | import os
  3 | import numpy as np
  4 | import time
  5 | import sys
  6 | import matplotlib.pyplot as plot
  7 | from keras.layers import Bidirectional, TimeDistributed, Conv2D, MaxPooling2D, Input, GRU, Dense, Activation, Dropout, Reshape, Permute
  8 | from keras.layers.normalization import BatchNormalization
  9 | from keras.models import Model
 10 | from sklearn.metrics import confusion_matrix
 11 | import metrics
 12 | import utils
 13 | from IPython import embed
 14 | import keras.backend as K
 15 | K.set_image_data_format('channels_first')
 16 | plot.switch_backend('agg')
 17 | sys.setrecursionlimit(10000)
 18 | 
 19 | 
 20 | def load_data(_feat_folder, _mono, _fold=None):
 21 |     feat_file_fold = os.path.join(_feat_folder, 'mbe_{}_fold{}.npz'.format('mon' if _mono else 'bin', _fold))
 22 |     dmp = np.load(feat_file_fold)
 23 |     _X_train, _Y_train, _X_test, _Y_test = dmp['arr_0'],  dmp['arr_1'],  dmp['arr_2'],  dmp['arr_3']
 24 |     return _X_train, _Y_train, _X_test, _Y_test
 25 | 
 26 | 
 27 | def get_model(data_in, data_out, _cnn_nb_filt, _cnn_pool_size, _rnn_nb, _fc_nb):
 28 | 
 29 |     spec_start = Input(shape=(data_in.shape[-3], data_in.shape[-2], data_in.shape[-1]))
 30 |     spec_x = spec_start
 31 |     for _i, _cnt in enumerate(_cnn_pool_size):
 32 |         spec_x = Conv2D(filters=_cnn_nb_filt, kernel_size=(3, 3), padding='same')(spec_x)
 33 |         spec_x = BatchNormalization(axis=1)(spec_x)
 34 |         spec_x = Activation('relu')(spec_x)
 35 |         spec_x = MaxPooling2D(pool_size=(1, _cnn_pool_size[_i]))(spec_x)
 36 |         spec_x = Dropout(dropout_rate)(spec_x)
 37 |     spec_x = Permute((2, 1, 3))(spec_x)
 38 |     spec_x = Reshape((data_in.shape[-2], -1))(spec_x)
 39 | 
 40 |     for _r in _rnn_nb:
 41 |         spec_x = Bidirectional(
 42 |             GRU(_r, activation='tanh', dropout=dropout_rate, recurrent_dropout=dropout_rate, return_sequences=True),
 43 |             merge_mode='mul')(spec_x)
 44 | 
 45 |     for _f in _fc_nb:
 46 |         spec_x = TimeDistributed(Dense(_f))(spec_x)
 47 |         spec_x = Dropout(dropout_rate)(spec_x)
 48 | 
 49 |     spec_x = TimeDistributed(Dense(data_out.shape[-1]))(spec_x)
 50 |     out = Activation('sigmoid', name='strong_out')(spec_x)
 51 | 
 52 |     _model = Model(inputs=spec_start, outputs=out)
 53 |     _model.compile(optimizer='Adam', loss='binary_crossentropy')
 54 |     _model.summary()
 55 |     return _model
 56 | 
 57 | 
 58 | def plot_functions(_nb_epoch, _tr_loss, _val_loss, _f1, _er, extension=''):
 59 |     plot.figure()
 60 | 
 61 |     plot.subplot(211)
 62 |     plot.plot(range(_nb_epoch), _tr_loss, label='train loss')
 63 |     plot.plot(range(_nb_epoch), _val_loss, label='val loss')
 64 |     plot.legend()
 65 |     plot.grid(True)
 66 | 
 67 |     plot.subplot(212)
 68 |     plot.plot(range(_nb_epoch), _f1, label='f')
 69 |     plot.plot(range(_nb_epoch), _er, label='er')
 70 |     plot.legend()
 71 |     plot.grid(True)
 72 | 
 73 |     plot.savefig(__models_dir + __fig_name + extension)
 74 |     plot.close()
 75 |     print('figure name : {}'.format(__fig_name))
 76 | 
 77 | 
 78 | def preprocess_data(_X, _Y, _X_test, _Y_test, _seq_len, _nb_ch):
 79 |     # split into sequences
 80 |     _X = utils.split_in_seqs(_X, _seq_len)
 81 |     _Y = utils.split_in_seqs(_Y, _seq_len)
 82 | 
 83 |     _X_test = utils.split_in_seqs(_X_test, _seq_len)
 84 |     _Y_test = utils.split_in_seqs(_Y_test, _seq_len)
 85 | 
 86 |     _X = utils.split_multi_channels(_X, _nb_ch)
 87 |     _X_test = utils.split_multi_channels(_X_test, _nb_ch)
 88 |     return _X, _Y, _X_test, _Y_test
 89 | 
 90 | 
 91 | #######################################################################################
 92 | # MAIN SCRIPT STARTS HERE
 93 | #######################################################################################
 94 | 
 95 | is_mono = True  # True: mono-channel input, False: binaural input
 96 | 
 97 | feat_folder = '/scratch/asignal/sharath/DCASE2017/TUT-sound-events-2017-development/feat/'
 98 | __fig_name = '{}_{}'.format('mon' if is_mono else 'bin', time.strftime("%Y_%m_%d_%H_%M_%S"))
 99 | 
100 | 
101 | nb_ch = 1 if is_mono else 2
102 | batch_size = 128    # Decrease this if you want to run on smaller GPU's
103 | seq_len = 256       # Frame sequence length. Input to the CRNN.
104 | nb_epoch = 500      # Training epochs
105 | patience = int(0.25 * nb_epoch)  # Patience for early stopping
106 | 
107 | # Number of frames in 1 second, required to calculate F and ER for 1 sec segments.
108 | # Make sure the nfft and sr are the same as in feature.py
109 | sr = 44100
110 | nfft = 2048
111 | frames_1_sec = int(sr/(nfft/2.0))
112 | 
113 | print('\n\nUNIQUE ID: {}'.format(__fig_name))
114 | print('TRAINING PARAMETERS: nb_ch: {}, seq_len: {}, batch_size: {}, nb_epoch: {}, frames_1_sec: {}'.format(
115 |     nb_ch, seq_len, batch_size, nb_epoch, frames_1_sec))
116 | 
117 | # Folder for saving model and training curves
118 | __models_dir = 'models/'
119 | utils.create_folder(__models_dir)
120 | 
121 | # CRNN model definition
122 | cnn_nb_filt = 128            # CNN filter size
123 | cnn_pool_size = [5, 2, 2]   # Maxpooling across frequency. Length of cnn_pool_size =  number of CNN layers
124 | rnn_nb = [32, 32]           # Number of RNN nodes.  Length of rnn_nb =  number of RNN layers
125 | fc_nb = [32]                # Number of FC nodes.  Length of fc_nb =  number of FC layers
126 | dropout_rate = 0.5          # Dropout after each layer
127 | print('MODEL PARAMETERS:\n cnn_nb_filt: {}, cnn_pool_size: {}, rnn_nb: {}, fc_nb: {}, dropout_rate: {}'.format(
128 |     cnn_nb_filt, cnn_pool_size, rnn_nb, fc_nb, dropout_rate))
129 | 
130 | avg_er = list()
131 | avg_f1 = list()
132 | for fold in [1, 2, 3, 4]:
133 |     print('\n\n----------------------------------------------')
134 |     print('FOLD: {}'.format(fold))
135 |     print('----------------------------------------------\n')
136 |     # Load feature and labels, pre-process it
137 |     X, Y, X_test, Y_test = load_data(feat_folder, is_mono, fold)
138 |     X, Y, X_test, Y_test = preprocess_data(X, Y, X_test, Y_test, seq_len, nb_ch)
139 | 
140 |     # Load model
141 |     model = get_model(X, Y, cnn_nb_filt, cnn_pool_size, rnn_nb, fc_nb)
142 | 
143 |     # Training
144 |     best_epoch, pat_cnt, best_er, f1_for_best_er, best_conf_mat = 0, 0, 99999, None, None
145 |     tr_loss, val_loss, f1_overall_1sec_list, er_overall_1sec_list = [0] * nb_epoch, [0] * nb_epoch, [0] * nb_epoch, [0] * nb_epoch
146 |     posterior_thresh = 0.5
147 |     for i in range(nb_epoch):
148 |         print('Epoch : {} '.format(i), end='')
149 |         hist = model.fit(
150 |             X, Y,
151 |             batch_size=batch_size,
152 |             validation_data=[X_test, Y_test],
153 |             epochs=1,
154 |             verbose=2
155 |         )
156 |         val_loss[i] = hist.history.get('val_loss')[-1]
157 |         tr_loss[i] = hist.history.get('loss')[-1]
158 | 
159 |         # Calculate the predictions on test data, in order to calculate ER and F scores
160 |         pred = model.predict(X_test)
161 |         pred_thresh = pred > posterior_thresh
162 |         score_list = metrics.compute_scores(pred_thresh, Y_test, frames_in_1_sec=frames_1_sec)
163 | 
164 |         f1_overall_1sec_list[i] = score_list['f1_overall_1sec']
165 |         er_overall_1sec_list[i] = score_list['er_overall_1sec']
166 |         pat_cnt = pat_cnt + 1
167 | 
168 |         # Calculate confusion matrix
169 |         test_pred_cnt = np.sum(pred_thresh, 2)
170 |         Y_test_cnt = np.sum(Y_test, 2)
171 |         conf_mat = confusion_matrix(Y_test_cnt.reshape(-1), test_pred_cnt.reshape(-1))
172 |         conf_mat = conf_mat / (utils.eps + np.sum(conf_mat, 1)[:, None].astype('float'))
173 | 
174 |         if er_overall_1sec_list[i] < best_er:
175 |             best_conf_mat = conf_mat
176 |             best_er = er_overall_1sec_list[i]
177 |             f1_for_best_er = f1_overall_1sec_list[i]
178 |             model.save(os.path.join(__models_dir, '{}_fold_{}_model.h5'.format(__fig_name, fold)))
179 |             best_epoch = i
180 |             pat_cnt = 0
181 | 
182 |         print('tr Er : {}, val Er : {}, F1_overall : {}, ER_overall : {} Best ER : {}, best_epoch: {}'.format(
183 |                 tr_loss[i], val_loss[i], f1_overall_1sec_list[i], er_overall_1sec_list[i], best_er, best_epoch))
184 |         plot_functions(nb_epoch, tr_loss, val_loss, f1_overall_1sec_list, er_overall_1sec_list, '_fold_{}'.format(fold))
185 |         if pat_cnt > patience:
186 |             break
187 |     avg_er.append(best_er)
188 |     avg_f1.append(f1_for_best_er)
189 |     print('saved model for the best_epoch: {} with best_f1: {} f1_for_best_er: {}'.format(
190 |         best_epoch, best_er, f1_for_best_er))
191 |     print('best_conf_mat: {}'.format(best_conf_mat))
192 |     print('best_conf_mat_diag: {}'.format(np.diag(best_conf_mat)))
193 | 
194 | print('\n\nMETRICS FOR ALL FOUR FOLDS: avg_er: {}, avg_f1: {}'.format(avg_er, avg_f1))
195 | print('MODEL AVERAGE OVER FOUR FOLDS: avg_er: {}, avg_f1: {}'.format(np.mean(avg_er), np.mean(avg_f1)))
196 | 


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import numpy as np
 3 | 
 4 | eps = np.finfo(np.float).eps
 5 | 
 6 | 
 7 | def create_folder(_fold_path):
 8 |     if not os.path.exists(_fold_path):
 9 |         os.makedirs(_fold_path)
10 | 
11 | 
12 | def reshape_3Dto2D(A):
13 |     return A.reshape(A.shape[0] * A.shape[1], A.shape[2])
14 | 
15 | 
16 | def split_multi_channels(data, num_channels):
17 |     in_shape = data.shape
18 |     if len(in_shape) == 3:
19 |         hop = in_shape[2] // num_channels
20 |         tmp = np.zeros((in_shape[0], num_channels, in_shape[1], hop))
21 |         for i in range(num_channels):
22 |             tmp[:, i, :, :] = data[:, :, i * hop:(i + 1) * hop]
23 |     else:
24 |         print("ERROR: The input should be a 3D matrix but it seems to have dimensions ", in_shape)
25 |         exit()
26 |     return tmp
27 | 
28 | 
29 | def split_in_seqs(data, subdivs):
30 |     if len(data.shape) == 1:
31 |         if data.shape[0] % subdivs:
32 |             data = data[:-(data.shape[0] % subdivs), :]
33 |         data = data.reshape((data.shape[0] // subdivs, subdivs, 1))
34 |     elif len(data.shape) == 2:
35 |         if data.shape[0] % subdivs:
36 |             data = data[:-(data.shape[0] % subdivs), :]
37 |         data = data.reshape((data.shape[0] // subdivs, subdivs, data.shape[1]))
38 |     elif len(data.shape) == 3:
39 |         if data.shape[0] % subdivs:
40 |             data = data[:-(data.shape[0] % subdivs), :, :]
41 |         data = data.reshape((data.shape[0] // subdivs, subdivs, data.shape[1], data.shape[2]))
42 |     return data
43 | 


--------------------------------------------------------------------------------