├── README.md
├── appendixes
├── fold1_plot.png
├── results.png
├── split1_ir0_ov1_1_pred.png
└── split1_ir0_ov1_1_ref.png
├── evaluation_tools
├── cls_feature_class.py
└── evaluation_metrics.py
├── licenses
├── MIT_LICENSE.md
└── TUT_LICENSE.md
├── pytorch
├── evaluate.py
├── losses.py
├── main.py
├── models.py
└── pytorch_utils.py
├── runme.sh
└── utils
├── config.py
├── data_generator.py
├── features.py
├── plot_results.py
└── utilities.py
/README.md:
--------------------------------------------------------------------------------
1 | # DCASE 2019 Task 3 Sound Event Localization and Detection
2 |
3 | DCASE 2019 Task3 Sound Event Localization and Detection is a task to jointly localize and recognize individual sound events and their respective temporal onset and offset times. More description of this task can be found in http://dcase.community/challenge2019/task-sound-event-localization-and-detection.
4 |
5 | ## DATASET
6 | The dataset can be downloaded from http://dcase.community/challenge2019/task-sound-event-localization-and-detection. The dataset contains 400 audio recordings, one minute long recordings sampled at 48 kHz. Two formats of audio, First-Order Ambisonic (FOA) and microphone array (MIC) are provided for each audio recording. Both of FOA and MIC are 4 channels. Each one minute recording contains 11 synthetic polyphonic sound events.
7 |
8 | The statistic of the data is shown below:
9 |
10 | | | Attributes | Dev. recordings | Eva. recordings |
11 | |:----:|:-----------------:|:---------------:|:---------------:|
12 | | Data | FOA & MIC, 48 kHz | 400 | - |
13 |
14 | The log mel spectrogram of the scenes are shown below:
15 |
16 |
17 |
18 | ## Run the code
19 |
20 | **0. Prepare data**
21 |
22 | Download and upzip the data, the data looks like:
23 |
24 |
25 | dataset_root
26 | ├── metadata_dev (400 files)
27 | │ ├── split1_ir0_ov1_10.csv
28 | │ └── ...
29 | ├── foa_dev (400 files)
30 | │ ├── split1_ir0_ov1_10.wav
31 | │ └── ...
32 | ├── mic_dev (400 files)
33 | │ ├── split1_ir0_ov1_10.wav
34 | │ └── ...
35 | └── ...
36 |
37 |
38 | **1. Requirements**
39 |
40 | python 3.6 + pytorch 1.0
41 |
42 | **2. Then simply run:**
43 |
44 | $ Run the bash script ./runme.sh
45 |
46 | Or run the commands in runme.sh line by line. The commands includes:
47 |
48 | (1) Modify the paths of dataset and your workspace
49 |
50 | (2) Extract features
51 |
52 | (3) Train model
53 |
54 | (4) Inference
55 |
56 | ## Model
57 | We apply convolutional neural networks using the log mel spectrogram of 4 channels audio as input. The targets are onset and offset times, elevation and azimuth of sound events. To train a CNN with 9 layers and a mini-batch size of 32, the training takes approximately 200 ms / iteration on a single card GTX Titan Xp GPU. The model is trained for 5000 iterations. The training looks like:
58 |
59 |
60 | Load data time: 90.292 s
61 | Training audio num: 300
62 | Validation audio num: 100
63 | ------------------------------------
64 | ...
65 | ------------------------------------
66 | iteration: 5000
67 | train statistics: total_loss: 0.076, event_loss: 0.007, position_loss: 0.069
68 | Total 10 files written to /vol/vssp/msos/qk/workspaces/dcase2019_task3/_temp/submissions/main/Cnn_9layers_foa_dev_logmel_64frames_64melbins
69 | sed_error_rate : 0.057
70 | sed_f1_score : 0.971
71 | doa_error : 8.902
72 | doa_frame_recall : 0.966
73 | seld_score : 0.042
74 | validate statistics: total_loss: 0.449, event_loss: 0.039, position_loss: 0.409
75 | Total 10 files written to /vol/vssp/msos/qk/workspaces/dcase2019_task3/_temp/submissions/main/Cnn_9layers_foa_dev_logmel_64frames_64melbins
76 | sed_error_rate : 0.206
77 | sed_f1_score : 0.875
78 | doa_error : 33.374
79 | doa_frame_recall : 0.894
80 | seld_score : 0.156
81 | train time: 20.135 s, validate time: 7.023 s
82 | Model saved to /vol/vssp/msos/qk/workspaces/dcase2019_task3/models/main/Cnn_9layers_foa_dev_logmel_64frames_64melbins/holdout_fold=1/md_5000_iters.pth
83 | ------------------------------------
84 | ...
85 |
86 |
87 | ## Results
88 |
89 | **Validation result on 400 audio files**
90 |
91 |
92 |
93 | The 9-layer CNN achieves slightly better results than other CNNs. The baseline system result is from [2], which applies phase information as extra input and obtains better DOA result. Our system only use log mel spectrogram magnitue as input, without using phase as input.
94 |
95 | **Plot results over different iterations**
96 |
97 |
98 |
99 | The 5-layer and 9-layer CNN achieve similar results. The 13-layer CNN tends to overfit.
100 |
101 | **Visualization the prediction**
102 |
103 |
104 |
105 | We are able to predict the DOA only using the log mel spectrogram magnitude as input.
106 |
107 | ## Summary
108 | This codebase provides a convolutional neural network (CNN) for DCASE 2019 challenge Task 3 Sound Event Localization and Detection.
109 |
110 | ## Citation
111 |
112 | **If this codebase is helpful, please feel free to cite the following paper:**
113 |
114 | **[1] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yong Xu, Wenwu Wang, Mark D. Plumbley. Cross-task learning for audio tagging, sound event detection and spatial localization: DCASE 2019 baseline systems. arXiv preprint arXiv:1904.03476 (2019).**
115 |
116 | ## FAQ
117 | If you met running out of GPU memory error, then try to reduce batch_size.
118 |
119 | ## License
120 | File evaluation_tools/cls_feature_class.py is under TUT_LICENSE.
121 |
122 | All other files except utils/cls_feature_class.py is under MIT_LICENSE.
123 |
124 | ## External link
125 |
126 | [2] https://github.com/sharathadavanne/seld-dcase2019
127 |
128 | [3] http://dcase.community/challenge2019/task-audio-tagging
129 |
--------------------------------------------------------------------------------
/appendixes/fold1_plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/qiuqiangkong/dcase2019_task3/aa40091cd9ce49149201634c3a8da2fc01ffd67c/appendixes/fold1_plot.png
--------------------------------------------------------------------------------
/appendixes/results.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/qiuqiangkong/dcase2019_task3/aa40091cd9ce49149201634c3a8da2fc01ffd67c/appendixes/results.png
--------------------------------------------------------------------------------
/appendixes/split1_ir0_ov1_1_pred.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/qiuqiangkong/dcase2019_task3/aa40091cd9ce49149201634c3a8da2fc01ffd67c/appendixes/split1_ir0_ov1_1_pred.png
--------------------------------------------------------------------------------
/appendixes/split1_ir0_ov1_1_ref.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/qiuqiangkong/dcase2019_task3/aa40091cd9ce49149201634c3a8da2fc01ffd67c/appendixes/split1_ir0_ov1_1_ref.png
--------------------------------------------------------------------------------
/evaluation_tools/cls_feature_class.py:
--------------------------------------------------------------------------------
1 | # Contains routines for labels creation, features extraction and normalization
2 | #
3 |
4 |
5 | import os
6 | import numpy as np
7 | import scipy.io.wavfile as wav
8 | from sklearn import preprocessing
9 | from sklearn.externals import joblib
10 | from IPython import embed
11 | import matplotlib.pyplot as plot
12 | import librosa
13 | # plot.switch_backend('agg')
14 |
15 |
16 | class FeatureClass:
17 | def __init__(self, dataset_dir='', feat_label_dir='', dataset='foa', is_eval=False):
18 | """
19 |
20 | :param dataset: string, dataset name, supported: foa - ambisonic or mic- microphone format
21 | :param is_eval: if True, does not load dataset labels.
22 | """
23 |
24 | # Input directories
25 | self._feat_label_dir = feat_label_dir
26 | self._dataset_dir = dataset_dir
27 | self._dataset_combination = '{}_{}'.format(dataset, 'eval' if is_eval else 'dev')
28 | self._aud_dir = os.path.join(self._dataset_dir, self._dataset_combination)
29 |
30 | self._desc_dir = None if is_eval else os.path.join(self._dataset_dir, 'metadata_dev')
31 |
32 | # Output directories
33 | self._label_dir = None
34 | self._feat_dir = None
35 | self._feat_dir_norm = None
36 |
37 | # Local parameters
38 | self._is_eval = is_eval
39 |
40 | self._fs = 48000
41 | self._hop_len_s = 0.02
42 | self._hop_len = int(self._fs * self._hop_len_s)
43 | self._frame_res = self._fs / float(self._hop_len)
44 | self._nb_frames_1s = int(self._frame_res)
45 |
46 | self._win_len = 2 * self._hop_len
47 | self._nfft = self._next_greater_power_of_2(self._win_len)
48 |
49 | self._dataset = dataset
50 | self._eps = np.spacing(np.float(1e-16))
51 | self._nb_channels = 4
52 |
53 | # Sound event classes dictionary # DCASE 2016 Task 2 sound events
54 | self._unique_classes = dict()
55 | self._unique_classes = \
56 | {
57 | 'clearthroat': 2,
58 | 'cough': 8,
59 | 'doorslam': 9,
60 | 'drawer': 1,
61 | 'keyboard': 6,
62 | 'keysDrop': 4,
63 | 'knock': 0,
64 | 'laughter': 10,
65 | 'pageturn': 7,
66 | 'phone': 3,
67 | 'speech': 5
68 | }
69 |
70 | self._doa_resolution = 10
71 | self._azi_list = range(-180, 180, self._doa_resolution)
72 | self._length = len(self._azi_list)
73 | self._ele_list = range(-40, 50, self._doa_resolution)
74 | self._height = len(self._ele_list)
75 |
76 | self._audio_max_len_samples = 60 * self._fs # TODO: Fix the audio synthesis code to always generate 60s of
77 | # audio. Currently it generates audio till the last active sound event, which is not always 60s long. This is a
78 | # quick fix to overcome that. We need this because, for processing and training we need the length of features
79 | # to be fixed.
80 |
81 | # For regression task only
82 | self._default_azi = 180
83 | self._default_ele = 50
84 |
85 | if self._default_azi in self._azi_list:
86 | print('ERROR: chosen default_azi value {} should not exist in azi_list'.format(self._default_azi))
87 | exit()
88 | if self._default_ele in self._ele_list:
89 | print('ERROR: chosen default_ele value {} should not exist in ele_list'.format(self._default_ele))
90 | exit()
91 |
92 | self._max_frames = int(np.ceil(self._audio_max_len_samples / float(self._hop_len)))
93 |
94 | def _load_audio(self, audio_path):
95 | fs, audio = wav.read(audio_path)
96 | audio = audio[:, :self._nb_channels] / 32768.0 + self._eps
97 | if audio.shape[0] < self._audio_max_len_samples:
98 | zero_pad = np.zeros((self._audio_max_len_samples - audio.shape[0], audio.shape[1]))
99 | audio = np.vstack((audio, zero_pad))
100 | elif audio.shape[0] > self._audio_max_len_samples:
101 | audio = audio[:self._audio_max_len_samples, :]
102 | return audio, fs
103 |
104 | # INPUT FEATURES
105 | @staticmethod
106 | def _next_greater_power_of_2(x):
107 | return 2 ** (x - 1).bit_length()
108 |
109 | def _spectrogram(self, audio_input):
110 | _nb_ch = audio_input.shape[1]
111 | nb_bins = self._nfft // 2
112 | spectra = np.zeros((self._max_frames, nb_bins, _nb_ch), dtype=complex)
113 | for ch_cnt in range(_nb_ch):
114 | stft_ch = librosa.core.stft(audio_input[:, ch_cnt], n_fft=self._nfft, hop_length=self._hop_len,
115 | win_length=self._win_len, window='hann')
116 | spectra[:, :, ch_cnt] = stft_ch[1:, :self._max_frames].T
117 | return spectra
118 |
119 | def _extract_spectrogram_for_file(self, audio_filename):
120 | audio_in, fs = self._load_audio(os.path.join(self._aud_dir, audio_filename))
121 | audio_spec = self._spectrogram(audio_in)
122 | # print('\t{}'.format(audio_spec.shape))
123 | np.save(os.path.join(self._feat_dir, '{}.npy'.format(audio_filename.split('.')[0])), audio_spec.reshape(self._max_frames, -1))
124 |
125 | # OUTPUT LABELS
126 | def read_desc_file(self, desc_filename, in_sec=False):
127 | desc_file = {
128 | 'class': list(), 'start': list(), 'end': list(), 'ele': list(), 'azi': list()
129 | }
130 | fid = open(desc_filename, 'r')
131 | next(fid)
132 | for line in fid:
133 | split_line = line.strip().split(',')
134 | desc_file['class'].append(split_line[0])
135 | # desc_file['class'].append(split_line[0].split('.')[0][:-3])
136 | if in_sec:
137 | # return onset-offset time in seconds
138 | desc_file['start'].append(float(split_line[1]))
139 | desc_file['end'].append(float(split_line[2]))
140 | else:
141 | # return onset-offset time in frames
142 | desc_file['start'].append(int(np.floor(float(split_line[1])*self._frame_res)))
143 | desc_file['end'].append(int(np.ceil(float(split_line[2])*self._frame_res)))
144 | desc_file['ele'].append(int(split_line[3]))
145 | desc_file['azi'].append(int(split_line[4]))
146 | fid.close()
147 | return desc_file
148 |
149 | def get_list_index(self, azi, ele):
150 | azi = (azi - self._azi_list[0]) // 10
151 | ele = (ele - self._ele_list[0]) // 10
152 | return azi * self._height + ele
153 |
154 | def get_matrix_index(self, ind):
155 | azi, ele = ind // self._height, ind % self._height
156 | azi = (azi * 10 + self._azi_list[0])
157 | ele = (ele * 10 + self._ele_list[0])
158 | return azi, ele
159 |
160 | def _get_doa_labels_regr(self, _desc_file):
161 | azi_label = self._default_azi*np.ones((self._max_frames, len(self._unique_classes)))
162 | ele_label = self._default_ele*np.ones((self._max_frames, len(self._unique_classes)))
163 | for i, ele_ang in enumerate(_desc_file['ele']):
164 | start_frame = _desc_file['start'][i]
165 | end_frame = self._max_frames if _desc_file['end'][i] > self._max_frames else _desc_file['end'][i]
166 | azi_ang = _desc_file['azi'][i]
167 | class_ind = self._unique_classes[_desc_file['class'][i]]
168 | if (azi_ang >= self._azi_list[0]) & (azi_ang <= self._azi_list[-1]) & \
169 | (ele_ang >= self._ele_list[0]) & (ele_ang <= self._ele_list[-1]):
170 | azi_label[start_frame:end_frame + 1, class_ind] = azi_ang
171 | ele_label[start_frame:end_frame + 1, class_ind] = ele_ang
172 | else:
173 | print('bad_angle {} {}'.format(azi_ang, ele_ang))
174 | doa_label_regr = np.concatenate((azi_label, ele_label), axis=1)
175 | return doa_label_regr
176 |
177 | def _get_se_labels(self, _desc_file):
178 | se_label = np.zeros((self._max_frames, len(self._unique_classes)))
179 | for i, se_class in enumerate(_desc_file['class']):
180 | start_frame = _desc_file['start'][i]
181 | end_frame = self._max_frames if _desc_file['end'][i] > self._max_frames else _desc_file['end'][i]
182 | se_label[start_frame:end_frame + 1, self._unique_classes[se_class]] = 1
183 | return se_label
184 |
185 | def get_labels_for_file(self, _desc_file):
186 | """
187 | Reads description csv file and returns classification based SED labels and regression based DOA labels
188 |
189 | :param _desc_file: csv file
190 | :return: label_mat: labels of the format [sed_label, doa_label],
191 | where sed_label is of dimension [nb_frames, nb_classes] which is 1 for active sound event else zero
192 | where doa_labels is of dimension [nb_frames, 2*nb_classes], nb_classes each for azimuth and elevation angles,
193 | if active, the DOA values will be in degrees, else, it will contain default doa values given by
194 | self._default_ele and self._default_azi
195 | """
196 |
197 | se_label = self._get_se_labels(_desc_file)
198 | doa_label = self._get_doa_labels_regr(_desc_file)
199 | label_mat = np.concatenate((se_label, doa_label), axis=1)
200 | # print(label_mat.shape)
201 | return label_mat
202 |
203 | def get_clas_labels_for_file(self, _desc_file):
204 | """
205 | Reads description file and returns classification format labels for SELD
206 |
207 | :param _desc_file: csv file
208 | :return: _labels: matrix of SELD labels of dimension [nb_frames, nb_classes, nb_azi*nb_ele],
209 | which is 1 for active sound event and location else zero
210 | """
211 |
212 | _labels = np.zeros((self._max_frames, len(self._unique_classes), len(self._azi_list) * len(self._ele_list)))
213 | for _ind, _start_frame in enumerate(_desc_file['start']):
214 | _tmp_class = self._unique_classes[_desc_file['class'][_ind]]
215 | _tmp_azi = _desc_file['azi'][_ind]
216 | _tmp_ele = _desc_file['ele'][_ind]
217 | _tmp_end = self._max_frames if _desc_file['end'][_ind] > self._max_frames else _desc_file['end'][_ind]
218 | _tmp_ind = self.get_list_index(_tmp_azi, _tmp_ele)
219 | _labels[_start_frame:_tmp_end + 1, _tmp_class, _tmp_ind] = 1
220 |
221 | return _labels
222 |
223 | # ------------------------------- EXTRACT FEATURE AND PREPROCESS IT -------------------------------
224 | def extract_all_feature(self):
225 | # setting up folders
226 | self._feat_dir = self.get_unnormalized_feat_dir()
227 | create_folder(self._feat_dir)
228 |
229 | # extraction starts
230 | print('Extracting spectrogram:')
231 | print('\t\taud_dir {}\n\t\tdesc_dir {}\n\t\tfeat_dir {}'.format(
232 | self._aud_dir, self._desc_dir, self._feat_dir))
233 |
234 | for file_cnt, file_name in enumerate(os.listdir(self._aud_dir)):
235 | print('{}: {}'.format(file_cnt, file_name))
236 | wav_filename = '{}.wav'.format(file_name.split('.')[0])
237 | self._extract_spectrogram_for_file(wav_filename)
238 |
239 | def preprocess_features(self):
240 | # Setting up folders and filenames
241 | self._feat_dir = self.get_unnormalized_feat_dir()
242 | self._feat_dir_norm = self.get_normalized_feat_dir()
243 | create_folder(self._feat_dir_norm)
244 | normalized_features_wts_file = self.get_normalized_wts_file()
245 | spec_scaler = None
246 |
247 | # pre-processing starts
248 | if self._is_eval:
249 | spec_scaler = joblib.load(normalized_features_wts_file)
250 | print('Normalized_features_wts_file: {}. Loaded.'.format(normalized_features_wts_file))
251 |
252 | else:
253 | print('Estimating weights for normalizing feature files:')
254 | print('\t\tfeat_dir: {}'.format(self._feat_dir))
255 |
256 | spec_scaler = preprocessing.StandardScaler()
257 | for file_cnt, file_name in enumerate(os.listdir(self._feat_dir)):
258 | print('{}: {}'.format(file_cnt, file_name))
259 | feat_file = np.load(os.path.join(self._feat_dir, file_name))
260 | spec_scaler.partial_fit(np.concatenate((np.abs(feat_file), np.angle(feat_file)), axis=1))
261 | del feat_file
262 | joblib.dump(
263 | spec_scaler,
264 | normalized_features_wts_file
265 | )
266 | print('Normalized_features_wts_file: {}. Saved.'.format(normalized_features_wts_file))
267 |
268 | print('Normalizing feature files:')
269 | print('\t\tfeat_dir_norm {}'.format(self._feat_dir_norm))
270 | for file_cnt, file_name in enumerate(os.listdir(self._feat_dir)):
271 | print('{}: {}'.format(file_cnt, file_name))
272 | feat_file = np.load(os.path.join(self._feat_dir, file_name))
273 | feat_file = spec_scaler.transform(np.concatenate((np.abs(feat_file), np.angle(feat_file)), axis=1))
274 | np.save(
275 | os.path.join(self._feat_dir_norm, file_name),
276 | feat_file
277 | )
278 | del feat_file
279 |
280 | print('normalized files written to {}'.format(self._feat_dir_norm))
281 |
282 | # ------------------------------- EXTRACT LABELS AND PREPROCESS IT -------------------------------
283 | def extract_all_labels(self):
284 | self._label_dir = self.get_label_dir()
285 |
286 | print('Extracting labels:')
287 | print('\t\taud_dir {}\n\t\tdesc_dir {}\n\t\tlabel_dir {}'.format(
288 | self._aud_dir, self._desc_dir, self._label_dir))
289 | create_folder(self._label_dir)
290 |
291 | for file_cnt, file_name in enumerate(os.listdir(self._desc_dir)):
292 | print('{}: {}'.format(file_cnt, file_name))
293 | wav_filename = '{}.wav'.format(file_name.split('.')[0])
294 | desc_file = self.read_desc_file(os.path.join(self._desc_dir, file_name))
295 | label_mat = self.get_labels_for_file(desc_file)
296 | np.save(os.path.join(self._label_dir, '{}.npy'.format(wav_filename.split('.')[0])), label_mat)
297 |
298 | # ------------------------------- Misc public functions -------------------------------
299 | def get_classes(self):
300 | return self._unique_classes
301 |
302 | def get_normalized_feat_dir(self):
303 | return os.path.join(
304 | self._feat_label_dir,
305 | '{}_norm'.format(self._dataset_combination)
306 | )
307 |
308 | def get_unnormalized_feat_dir(self):
309 | return os.path.join(
310 | self._feat_label_dir,
311 | '{}'.format(self._dataset_combination)
312 | )
313 |
314 | def get_label_dir(self):
315 | if self._is_eval:
316 | return None
317 | else:
318 | return os.path.join(
319 | self._feat_label_dir, '{}_label'.format(self._dataset_combination)
320 | )
321 |
322 | def get_normalized_wts_file(self):
323 | return os.path.join(
324 | self._feat_label_dir,
325 | '{}_wts'.format(self._dataset)
326 | )
327 |
328 | def get_default_azi_ele_regr(self):
329 | return self._default_azi, self._default_ele
330 |
331 | def get_nb_channels(self):
332 | return self._nb_channels
333 |
334 | def nb_frames_1s(self):
335 | return self._nb_frames_1s
336 |
337 | def get_hop_len_sec(self):
338 | return self._hop_len_s
339 |
340 | def get_azi_ele_list(self):
341 | return self._azi_list, self._ele_list
342 |
343 | def get_nb_frames(self):
344 | return self._max_frames
345 |
346 |
347 | def create_folder(folder_name):
348 | if not os.path.exists(folder_name):
349 | print('{} folder does not exist, creating it.'.format(folder_name))
350 | os.makedirs(folder_name)
--------------------------------------------------------------------------------
/evaluation_tools/evaluation_metrics.py:
--------------------------------------------------------------------------------
1 | #
2 | # Implements the core metrics from sound event detection evaluation module http://tut-arg.github.io/sed_eval/ and
3 | # The DOA metrics are explained in the SELDnet paper
4 | #
5 |
6 | import numpy as np
7 | from scipy.optimize import linear_sum_assignment
8 | from IPython import embed
9 | eps = np.finfo(np.float).eps
10 |
11 |
12 | ##########################################################################################
13 | # SELD scoring functions - class implementation
14 | #
15 | # NOTE: Supports only one-hot labels for both SED and DOA. Doesnt work for baseline method
16 | # directly, since it estimated DOA in regression approach. Check below the class for
17 | # one shot (function) implementations of all metrics. The function implementation has
18 | # support for both one-hot labels and regression values of DOA estimation.
19 | ##########################################################################################
20 |
21 | class SELDMetrics(object):
22 | def __init__(self, nb_frames_1s=None, data_gen=None):
23 | # SED params
24 | self._S = 0
25 | self._D = 0
26 | self._I = 0
27 | self._TP = 0
28 | self._Nref = 0
29 | self._Nsys = 0
30 | self._block_size = nb_frames_1s
31 |
32 | # DOA params
33 | self._doa_loss_pred_cnt = 0
34 | self._nb_frames = 0
35 |
36 | self._doa_loss_pred = 0
37 | self._nb_good_pks = 0
38 |
39 | self._data_gen = data_gen
40 |
41 | self._less_est_cnt, self._less_est_frame_cnt = 0, 0
42 | self._more_est_cnt, self._more_est_frame_cnt = 0, 0
43 |
44 | def f1_overall_framewise(self, O, T):
45 | TP = ((2 * T - O) == 1).sum()
46 | Nref, Nsys = T.sum(), O.sum()
47 | self._TP += TP
48 | self._Nref += Nref
49 | self._Nsys += Nsys
50 |
51 | def er_overall_framewise(self, O, T):
52 | FP = np.logical_and(T == 0, O == 1).sum(1)
53 | FN = np.logical_and(T == 1, O == 0).sum(1)
54 | S = np.minimum(FP, FN).sum()
55 | D = np.maximum(0, FN - FP).sum()
56 | I = np.maximum(0, FP - FN).sum()
57 | self._S += S
58 | self._D += D
59 | self._I += I
60 |
61 | def f1_overall_1sec(self, O, T):
62 | new_size = int(np.ceil(O.shape[0] / self._block_size))
63 | O_block = np.zeros((new_size, O.shape[1]))
64 | T_block = np.zeros((new_size, O.shape[1]))
65 | for i in range(0, new_size):
66 | O_block[i, :] = np.max(O[int(i * self._block_size):int(i * self._block_size + self._block_size - 1), :], axis=0)
67 | T_block[i, :] = np.max(T[int(i * self._block_size):int(i * self._block_size + self._block_size - 1), :], axis=0)
68 | return self.f1_overall_framewise(O_block, T_block)
69 |
70 | def er_overall_1sec(self, O, T):
71 | new_size = int(O.shape[0] / self._block_size)
72 | O_block = np.zeros((new_size, O.shape[1]))
73 | T_block = np.zeros((new_size, O.shape[1]))
74 | for i in range(0, new_size):
75 | O_block[i, :] = np.max(O[int(i * self._block_size):int(i * self._block_size + self._block_size - 1), :], axis=0)
76 | T_block[i, :] = np.max(T[int(i * self._block_size):int(i * self._block_size + self._block_size - 1), :], axis=0)
77 | return self.er_overall_framewise(O_block, T_block)
78 |
79 | def update_sed_scores(self, pred, gt):
80 | """
81 | Computes SED metrics for one second segments
82 |
83 | :param pred: predicted matrix of dimension [nb_frames, nb_classes], with 1 when sound event is active else 0
84 | :param gt: reference matrix of dimension [nb_frames, nb_classes], with 1 when sound event is active else 0
85 | :param nb_frames_1s: integer, number of frames in one second
86 | :return:
87 | """
88 | self.f1_overall_1sec(pred, gt)
89 | self.er_overall_1sec(pred, gt)
90 |
91 | def compute_sed_scores(self):
92 | ER = (self._S + self._D + self._I) / (self._Nref + 0.0)
93 |
94 | prec = float(self._TP) / float(self._Nsys + eps)
95 | recall = float(self._TP) / float(self._Nref + eps)
96 | F = 2 * prec * recall / (prec + recall + eps)
97 |
98 | return ER, F
99 |
100 | def update_doa_scores(self, pred_doa_thresholded, gt_doa):
101 | '''
102 | Compute DOA metrics when DOA is estimated using classification approach
103 |
104 | :param pred_doa_thresholded: predicted results of dimension [nb_frames, nb_classes, nb_azi*nb_ele],
105 | with value 1 when sound event active, else 0
106 | :param gt_doa: reference results of dimension [nb_frames, nb_classes, nb_azi*nb_ele],
107 | with value 1 when sound event active, else 0
108 | :param data_gen_test: feature or data generator class
109 |
110 | :return: DOA metrics
111 |
112 | '''
113 | self._doa_loss_pred_cnt += np.sum(pred_doa_thresholded)
114 | self._nb_frames += pred_doa_thresholded.shape[0]
115 |
116 | for frame in range(pred_doa_thresholded.shape[0]):
117 | nb_gt_peaks = int(np.sum(gt_doa[frame, :]))
118 | nb_pred_peaks = int(np.sum(pred_doa_thresholded[frame, :]))
119 |
120 | # good_frame_cnt includes frames where the nb active sources were zero in both groundtruth and prediction
121 | if nb_gt_peaks == nb_pred_peaks:
122 | self._nb_good_pks += 1
123 | elif nb_gt_peaks > nb_pred_peaks:
124 | self._less_est_frame_cnt += 1
125 | self._less_est_cnt += (nb_gt_peaks - nb_pred_peaks)
126 | elif nb_pred_peaks > nb_gt_peaks:
127 | self._more_est_frame_cnt += 1
128 | self._more_est_cnt += (nb_pred_peaks - nb_gt_peaks)
129 |
130 | # when nb_ref_doa > nb_estimated_doa, ignores the extra ref doas and scores only the nearest matching doas
131 | # similarly, when nb_estimated_doa > nb_ref_doa, ignores the extra estimated doa and scores the remaining matching doas
132 | if nb_gt_peaks and nb_pred_peaks:
133 | pred_ind = np.where(pred_doa_thresholded[frame] == 1)[1]
134 | pred_list_rad = np.array(self._data_gen .get_matrix_index(pred_ind)) * np.pi / 180
135 |
136 | gt_ind = np.where(gt_doa[frame] == 1)[1]
137 | gt_list_rad = np.array(self._data_gen .get_matrix_index(gt_ind)) * np.pi / 180
138 |
139 | frame_dist = distance_between_gt_pred(gt_list_rad.T, pred_list_rad.T)
140 | self._doa_loss_pred += frame_dist
141 |
142 | def compute_doa_scores(self):
143 | doa_error = self._doa_loss_pred / self._doa_loss_pred_cnt
144 | frame_recall = self._nb_good_pks / float(self._nb_frames)
145 | return doa_error, frame_recall
146 |
147 | def reset(self):
148 | # SED params
149 | self._S = 0
150 | self._D = 0
151 | self._I = 0
152 | self._TP = 0
153 | self._Nref = 0
154 | self._Nsys = 0
155 |
156 | # DOA params
157 | self._doa_loss_pred_cnt = 0
158 | self._nb_frames = 0
159 |
160 | self._doa_loss_pred = 0
161 | self._nb_good_pks = 0
162 |
163 | self._less_est_cnt, self._less_est_frame_cnt = 0, 0
164 | self._more_est_cnt, self._more_est_frame_cnt = 0, 0
165 |
166 |
167 | ###############################################################
168 | # SED scoring functions
169 | ###############################################################
170 |
171 |
172 | def reshape_3Dto2D(A):
173 | return A.reshape(A.shape[0] * A.shape[1], A.shape[2])
174 |
175 |
176 | def f1_overall_framewise(O, T):
177 | if len(O.shape) == 3:
178 | O, T = reshape_3Dto2D(O), reshape_3Dto2D(T)
179 | TP = ((2 * T - O) == 1).sum()
180 | Nref, Nsys = T.sum(), O.sum()
181 |
182 | prec = float(TP) / float(Nsys + eps)
183 | recall = float(TP) / float(Nref + eps)
184 | f1_score = 2 * prec * recall / (prec + recall + eps)
185 | return f1_score
186 |
187 |
188 | def er_overall_framewise(O, T):
189 | if len(O.shape) == 3:
190 | O, T = reshape_3Dto2D(O), reshape_3Dto2D(T)
191 |
192 | FP = np.logical_and(T == 0, O == 1).sum(1)
193 | FN = np.logical_and(T == 1, O == 0).sum(1)
194 |
195 | S = np.minimum(FP, FN).sum()
196 | D = np.maximum(0, FN-FP).sum()
197 | I = np.maximum(0, FP-FN).sum()
198 |
199 | Nref = T.sum()
200 | ER = (S+D+I) / (Nref + 0.0)
201 | return ER
202 |
203 |
204 | def f1_overall_1sec(O, T, block_size):
205 | if len(O.shape) == 3:
206 | O, T = reshape_3Dto2D(O), reshape_3Dto2D(T)
207 | new_size = int(np.ceil(O.shape[0] / block_size))
208 | O_block = np.zeros((new_size, O.shape[1]))
209 | T_block = np.zeros((new_size, O.shape[1]))
210 | for i in range(0, new_size):
211 | O_block[i, :] = np.max(O[int(i * block_size):int(i * block_size + block_size - 1), :], axis=0)
212 | T_block[i, :] = np.max(T[int(i * block_size):int(i * block_size + block_size - 1), :], axis=0)
213 | return f1_overall_framewise(O_block, T_block)
214 |
215 |
216 | def er_overall_1sec(O, T, block_size):
217 | if len(O.shape) == 3:
218 | O, T = reshape_3Dto2D(O), reshape_3Dto2D(T)
219 | new_size = int(O.shape[0] / (block_size))
220 | O_block = np.zeros((new_size, O.shape[1]))
221 | T_block = np.zeros((new_size, O.shape[1]))
222 | for i in range(0, new_size):
223 | O_block[i, :] = np.max(O[int(i * block_size):int(i * block_size + block_size - 1), :], axis=0)
224 | T_block[i, :] = np.max(T[int(i * block_size):int(i * block_size + block_size - 1), :], axis=0)
225 | return er_overall_framewise(O_block, T_block)
226 |
227 |
228 | def compute_sed_scores(pred, gt, nb_frames_1s):
229 | """
230 | Computes SED metrics for one second segments
231 |
232 | :param pred: predicted matrix of dimension [nb_frames, nb_classes], with 1 when sound event is active else 0
233 | :param gt: reference matrix of dimension [nb_frames, nb_classes], with 1 when sound event is active else 0
234 | :param nb_frames_1s: integer, number of frames in one second
235 | :return:
236 | """
237 | f1o = f1_overall_1sec(pred, gt, nb_frames_1s)
238 | ero = er_overall_1sec(pred, gt, nb_frames_1s)
239 | scores = [ero, f1o]
240 | return scores
241 |
242 |
243 | ###############################################################
244 | # DOA scoring functions
245 | ###############################################################
246 |
247 |
248 | def compute_doa_scores_regr(pred_doa_rad, gt_doa_rad, pred_sed, gt_sed):
249 | """
250 | Compute DOA metrics when DOA is estimated using regression approach
251 |
252 | :param pred_doa_rad: predicted doa_labels is of dimension [nb_frames, 2*nb_classes],
253 | nb_classes each for azimuth and elevation angles,
254 | if active, the DOA values will be in RADIANS, else, it will contain default doa values
255 | :param gt_doa_rad: reference doa_labels is of dimension [nb_frames, 2*nb_classes],
256 | nb_classes each for azimuth and elevation angles,
257 | if active, the DOA values will be in RADIANS, else, it will contain default doa values
258 | :param pred_sed: predicted sed label of dimension [nb_frames, nb_classes] which is 1 for active sound event else zero
259 | :param gt_sed: reference sed label of dimension [nb_frames, nb_classes] which is 1 for active sound event else zero
260 | :return:
261 | """
262 |
263 | nb_src_gt_list = np.zeros(gt_doa_rad.shape[0]).astype(int)
264 | nb_src_pred_list = np.zeros(gt_doa_rad.shape[0]).astype(int)
265 | good_frame_cnt = 0
266 | doa_loss_pred = 0.0
267 | nb_sed = gt_sed.shape[-1]
268 |
269 | less_est_cnt, less_est_frame_cnt = 0, 0
270 | more_est_cnt, more_est_frame_cnt = 0, 0
271 |
272 | for frame_cnt, sed_frame in enumerate(gt_sed):
273 | nb_src_gt_list[frame_cnt] = int(np.sum(sed_frame))
274 | nb_src_pred_list[frame_cnt] = int(np.sum(pred_sed[frame_cnt]))
275 |
276 | # good_frame_cnt includes frames where the nb active sources were zero in both groundtruth and prediction
277 | if nb_src_gt_list[frame_cnt] == nb_src_pred_list[frame_cnt]:
278 | good_frame_cnt = good_frame_cnt + 1
279 | elif nb_src_gt_list[frame_cnt] > nb_src_pred_list[frame_cnt]:
280 | less_est_cnt = less_est_cnt + nb_src_gt_list[frame_cnt] - nb_src_pred_list[frame_cnt]
281 | less_est_frame_cnt = less_est_frame_cnt + 1
282 | elif nb_src_gt_list[frame_cnt] < nb_src_pred_list[frame_cnt]:
283 | more_est_cnt = more_est_cnt + nb_src_pred_list[frame_cnt] - nb_src_gt_list[frame_cnt]
284 | more_est_frame_cnt = more_est_frame_cnt + 1
285 |
286 | # when nb_ref_doa > nb_estimated_doa, ignores the extra ref doas and scores only the nearest matching doas
287 | # similarly, when nb_estimated_doa > nb_ref_doa, ignores the extra estimated doa and scores the remaining matching doas
288 | if nb_src_gt_list[frame_cnt] and nb_src_pred_list[frame_cnt]:
289 | # DOA Loss with respect to predicted confidence
290 | sed_frame_gt = gt_sed[frame_cnt]
291 | doa_frame_gt_azi = gt_doa_rad[frame_cnt][:nb_sed][sed_frame_gt == 1]
292 | doa_frame_gt_ele = gt_doa_rad[frame_cnt][nb_sed:][sed_frame_gt == 1]
293 |
294 | sed_frame_pred = pred_sed[frame_cnt]
295 | doa_frame_pred_azi = pred_doa_rad[frame_cnt][:nb_sed][sed_frame_pred == 1]
296 | doa_frame_pred_ele = pred_doa_rad[frame_cnt][nb_sed:][sed_frame_pred == 1]
297 |
298 | doa_loss_pred += distance_between_gt_pred(np.vstack((doa_frame_gt_azi, doa_frame_gt_ele)).T,
299 | np.vstack((doa_frame_pred_azi, doa_frame_pred_ele)).T)
300 |
301 | doa_loss_pred_cnt = np.sum(nb_src_pred_list)
302 | if doa_loss_pred_cnt:
303 | doa_loss_pred /= doa_loss_pred_cnt
304 |
305 | frame_recall = good_frame_cnt / float(gt_sed.shape[0])
306 | er_metric = [doa_loss_pred, frame_recall, doa_loss_pred_cnt, good_frame_cnt, more_est_cnt, less_est_cnt]
307 | return er_metric
308 |
309 |
310 | def compute_doa_scores_clas(pred_doa_thresholded, gt_doa, data_gen_test):
311 | '''
312 | Compute DOA metrics when DOA is estimated using classification approach
313 |
314 | :param pred_doa_thresholded: predicted results of dimension [nb_frames, nb_classes, nb_azi*nb_ele],
315 | with value 1 when sound event active, else 0
316 | :param gt_doa: reference results of dimension [nb_frames, nb_classes, nb_azi*nb_ele],
317 | with value 1 when sound event active, else 0
318 | :param data_gen_test: feature or data generator class
319 |
320 | :return: DOA metrics
321 |
322 | '''
323 | doa_loss_pred_cnt = np.sum(pred_doa_thresholded)
324 |
325 | doa_loss_pred = 0
326 | nb_good_pks = 0
327 |
328 | less_est_cnt, less_est_frame_cnt = 0, 0
329 | more_est_cnt, more_est_frame_cnt = 0, 0
330 |
331 | for frame in range(pred_doa_thresholded.shape[0]):
332 | nb_gt_peaks = int(np.sum(gt_doa[frame, :]))
333 | nb_pred_peaks = int(np.sum(pred_doa_thresholded[frame, :]))
334 |
335 | # good_frame_cnt includes frames where the nb active sources were zero in both groundtruth and prediction
336 | if nb_gt_peaks == nb_pred_peaks:
337 | nb_good_pks += 1
338 | elif nb_gt_peaks > nb_pred_peaks:
339 | less_est_frame_cnt += 1
340 | less_est_cnt += (nb_gt_peaks - nb_pred_peaks)
341 | elif nb_pred_peaks > nb_gt_peaks:
342 | more_est_frame_cnt += 1
343 | more_est_cnt += (nb_pred_peaks - nb_gt_peaks)
344 |
345 | # when nb_ref_doa > nb_estimated_doa, ignores the extra ref doas and scores only the nearest matching doas
346 | # similarly, when nb_estimated_doa > nb_ref_doa, ignores the extra estimated doa and scores the remaining matching doas
347 | if nb_gt_peaks and nb_pred_peaks:
348 | pred_ind = np.where(pred_doa_thresholded[frame] == 1)[1]
349 | pred_list_rad = np.array(data_gen_test.get_matrix_index(pred_ind)) * np.pi / 180
350 |
351 | gt_ind = np.where(gt_doa[frame] == 1)[1]
352 | gt_list_rad = np.array(data_gen_test.get_matrix_index(gt_ind)) * np.pi / 180
353 |
354 | frame_dist = distance_between_gt_pred(gt_list_rad.T, pred_list_rad.T)
355 | doa_loss_pred += frame_dist
356 |
357 | if doa_loss_pred_cnt:
358 | doa_loss_pred /= doa_loss_pred_cnt
359 |
360 | frame_recall = nb_good_pks / float(pred_doa_thresholded.shape[0])
361 | er_metric = [doa_loss_pred, frame_recall, doa_loss_pred_cnt, nb_good_pks, more_est_cnt, less_est_cnt]
362 | return er_metric
363 |
364 |
365 | def distance_between_gt_pred(gt_list_rad, pred_list_rad):
366 | """
367 | Shortest distance between two sets of spherical coordinates. Given a set of groundtruth spherical coordinates,
368 | and its respective predicted coordinates, we calculate the spherical distance between each of the spherical
369 | coordinate pairs resulting in a matrix of distances, where one axis represents the number of groundtruth
370 | coordinates and the other the predicted coordinates. The number of estimated peaks need not be the same as in
371 | groundtruth, thus the distance matrix is not always a square matrix. We use the hungarian algorithm to find the
372 | least cost in this distance matrix.
373 |
374 | :param gt_list_rad: list of ground-truth spherical coordinates
375 | :param pred_list_rad: list of predicted spherical coordinates
376 | :return: cost - distance
377 | :return: less - number of DOA's missed
378 | :return: extra - number of DOA's over-estimated
379 | """
380 |
381 | gt_len, pred_len = gt_list_rad.shape[0], pred_list_rad.shape[0]
382 | ind_pairs = np.array([[x, y] for y in range(pred_len) for x in range(gt_len)])
383 | cost_mat = np.zeros((gt_len, pred_len))
384 |
385 | # Slow implementation
386 | # cost_mat = np.zeros((gt_len, pred_len))
387 | # for gt_cnt, gt in enumerate(gt_list_rad):
388 | # for pred_cnt, pred in enumerate(pred_list_rad):
389 | # cost_mat[gt_cnt, pred_cnt] = distance_between_spherical_coordinates_rad(gt, pred)
390 |
391 | # Fast implementation
392 | if gt_len and pred_len:
393 | az1, ele1, az2, ele2 = gt_list_rad[ind_pairs[:, 0], 0], gt_list_rad[ind_pairs[:, 0], 1], \
394 | pred_list_rad[ind_pairs[:, 1], 0], pred_list_rad[ind_pairs[:, 1], 1]
395 | cost_mat[ind_pairs[:, 0], ind_pairs[:, 1]] = distance_between_spherical_coordinates_rad(az1, ele1, az2, ele2)
396 |
397 | row_ind, col_ind = linear_sum_assignment(cost_mat)
398 | cost = cost_mat[row_ind, col_ind].sum()
399 | return cost
400 |
401 |
402 | def distance_between_spherical_coordinates_rad(az1, ele1, az2, ele2):
403 | """
404 | Angular distance between two spherical coordinates
405 | MORE: https://en.wikipedia.org/wiki/Great-circle_distance
406 |
407 | :return: angular distance in degrees
408 | """
409 | dist = np.sin(ele1) * np.sin(ele2) + np.cos(ele1) * np.cos(ele2) * np.cos(np.abs(az1 - az2))
410 | # Making sure the dist values are in -1 to 1 range, else np.arccos kills the job
411 | dist = np.clip(dist, -1, 1)
412 | dist = np.arccos(dist) * 180 / np.pi
413 | return dist
414 |
415 |
416 | def distance_between_cartesian_coordinates(x1, y1, z1, x2, y2, z2):
417 | """
418 | Angular distance between two cartesian coordinates
419 | MORE: https://en.wikipedia.org/wiki/Great-circle_distance
420 | Check 'From chord length' section
421 |
422 | :return: angular distance in degrees
423 | """
424 | dist = np.sqrt((x1-x2) ** 2 + (y1-y2) ** 2 + (z1-z2) ** 2)
425 | dist = 2 * np.arcsin(dist / 2.0) * 180/np.pi
426 | return dist
427 |
428 |
429 | def sph2cart(azimuth, elevation, r):
430 | '''
431 | Convert spherical to cartesian coordinates
432 |
433 | :param azimuth: in radians
434 | :param elevation: in radians
435 | :param r: in meters
436 | :return: cartesian coordinates
437 | '''
438 |
439 | x = r * np.cos(elevation) * np.cos(azimuth)
440 | y = r * np.cos(elevation) * np.sin(azimuth)
441 | z = r * np.sin(elevation)
442 | return x, y, z
443 |
444 |
445 | def cart2sph(x, y, z):
446 | '''
447 | Convert cartesian to spherical coordinates
448 |
449 | :param x:
450 | :param y:
451 | :param z:
452 | :return: azi, ele in radians and r in meters
453 | '''
454 |
455 | azimuth = np.arctan2(y,x)
456 | elevation = np.arctan2(z,np.sqrt(x**2 + y**2))
457 | r = np.sqrt(x**2 + y**2 + z**2)
458 | return azimuth, elevation, r
459 |
460 |
461 | ###############################################################
462 | # SELD scoring functions
463 | ###############################################################
464 |
465 |
466 | def compute_seld_metric(sed_error, doa_error):
467 | """
468 | Compute SELD metric from sed and doa errors.
469 |
470 | :param sed_error: [error rate (0 to 1 range), f score (0 to 1 range)]
471 | :param doa_error: [doa error (in degrees), frame recall (0 to 1 range)]
472 | :return: seld metric result
473 | """
474 | seld_metric = np.mean([
475 | sed_error[0],
476 | 1 - sed_error[1],
477 | doa_error[0]/180,
478 | 1 - doa_error[1]]
479 | )
480 | return seld_metric
481 |
482 |
483 | def compute_seld_metrics_from_output_format_dict(_pred_dict, _gt_dict, _feat_cls):
484 | """
485 | Compute SELD metrics between _gt_dict and_pred_dict in DCASE output format
486 |
487 | :param _pred_dict: dcase output format dict
488 | :param _gt_dict: dcase output format dict
489 | :param _feat_cls: feature or data generator class
490 | :return: the seld metrics
491 | """
492 | _gt_labels = output_format_dict_to_classification_labels(_gt_dict, _feat_cls)
493 | _pred_labels = output_format_dict_to_classification_labels(_pred_dict, _feat_cls)
494 |
495 | _er, _f = compute_sed_scores(_pred_labels.max(2), _gt_labels.max(2), _feat_cls.nb_frames_1s())
496 | _doa_err, _frame_recall, d1, d2, d3, d4 = compute_doa_scores_clas(_pred_labels, _gt_labels, _feat_cls)
497 | _seld_scr = compute_seld_metric([_er, _f], [_doa_err, _frame_recall])
498 | return _seld_scr, _er, _f, _doa_err, _frame_recall
499 |
500 |
501 | ###############################################################
502 | # Functions for format conversions
503 | ###############################################################
504 |
505 | def output_format_dict_to_classification_labels(_output_dict, _feat_cls):
506 |
507 | _unique_classes = _feat_cls.get_classes()
508 | _nb_classes = len(_unique_classes)
509 | _azi_list, _ele_list = _feat_cls.get_azi_ele_list()
510 | _max_frames = _feat_cls.get_nb_frames()
511 | _labels = np.zeros((_max_frames, _nb_classes, len(_azi_list) * len(_ele_list)))
512 |
513 | for _frame_cnt in _output_dict.keys():
514 | if _frame_cnt < _max_frames:
515 | for _tmp_doa in _output_dict[_frame_cnt]:
516 | # Making sure the doa's are within the limits
517 | _tmp_doa[1] = np.clip(_tmp_doa[1], _azi_list[0], _azi_list[-1])
518 | _tmp_doa[2] = np.clip(_tmp_doa[2], _ele_list[0], _ele_list[-1])
519 |
520 | # create label
521 | _labels[_frame_cnt, _tmp_doa[0], int(_feat_cls.get_list_index(_tmp_doa[1], _tmp_doa[2]))] = 1
522 |
523 | return _labels
524 |
525 |
526 | def regression_label_format_to_output_format(_feat_cls, _sed_labels, _doa_labels_deg):
527 | """
528 | Converts the sed (classification) and doa labels predicted in regression format to dcase output format.
529 |
530 | :param _feat_cls: feature or data generator class instance
531 | :param _sed_labels: SED labels matrix [nb_frames, nb_classes]
532 | :param _doa_labels_deg: DOA labels matrix [nb_frames, 2*nb_classes] in degrees
533 | :return: _output_dict: returns a dict containing dcase output format
534 | """
535 |
536 | _unique_classes = _feat_cls.get_classes()
537 | _nb_classes = len(_unique_classes)
538 | _azi_labels = _doa_labels_deg[:, :_nb_classes]
539 | _ele_labels = _doa_labels_deg[:, _nb_classes:]
540 |
541 | _output_dict = {}
542 | for _frame_ind in range(_sed_labels.shape[0]):
543 | _tmp_ind = np.where(_sed_labels[_frame_ind, :])
544 | if len(_tmp_ind[0]):
545 | _output_dict[_frame_ind] = []
546 | for _tmp_class in _tmp_ind[0]:
547 | _output_dict[_frame_ind].append([_tmp_class, _azi_labels[_frame_ind, _tmp_class], _ele_labels[_frame_ind, _tmp_class]])
548 | return _output_dict
549 |
550 |
551 | def classification_label_format_to_output_format(_feat_cls, _labels):
552 | """
553 | Converts the seld labels predicted in classification format to dcase output format.
554 |
555 | :param _feat_cls: feature or data generator class instance
556 | :param _labels: SED labels matrix [nb_frames, nb_classes, nb_azi*nb_ele]
557 | :return: _output_dict: returns a dict containing dcase output format
558 | """
559 | _output_dict = {}
560 | for _frame_ind in range(_labels.shape[0]):
561 | _tmp_class_ind = np.where(_labels[_frame_ind].sum(1))
562 | if len(_tmp_class_ind[0]):
563 | _output_dict[_frame_ind] = []
564 | for _tmp_class in _tmp_class_ind[0]:
565 | _tmp_spatial_ind = np.where(_labels[_frame_ind, _tmp_class])
566 | for _tmp_spatial in _tmp_spatial_ind[0]:
567 | _azi, _ele = _feat_cls.get_matrix_index(_tmp_spatial)
568 | _output_dict[_frame_ind].append(
569 | [_tmp_class, _azi, _ele])
570 |
571 | return _output_dict
572 |
573 |
574 | def description_file_to_output_format(_desc_file_dict, _unique_classes, _hop_length_sec):
575 | """
576 | Reads description file in csv format. Outputs, the dcase format results in dictionary, and additionally writes it
577 | to the _output_file
578 |
579 | :param _unique_classes: unique classes dictionary, maps class name to class index
580 | :param _desc_file_dict: full path of the description file
581 | :param _hop_length_sec: hop length in seconds
582 |
583 | :return: _output_dict: dcase output in dicitionary format
584 | """
585 |
586 | _output_dict = {}
587 | for _ind, _tmp_start_sec in enumerate(_desc_file_dict['start']):
588 | _tmp_class = _unique_classes[_desc_file_dict['class'][_ind]]
589 | _tmp_azi = _desc_file_dict['azi'][_ind]
590 | _tmp_ele = _desc_file_dict['ele'][_ind]
591 | _tmp_end_sec = _desc_file_dict['end'][_ind]
592 |
593 | _start_frame = int(_tmp_start_sec / _hop_length_sec)
594 | _end_frame = int(_tmp_end_sec / _hop_length_sec)
595 | for _frame_ind in range(_start_frame, _end_frame + 1):
596 | if _frame_ind not in _output_dict:
597 | _output_dict[_frame_ind] = []
598 | _output_dict[_frame_ind].append([_tmp_class, _tmp_azi, _tmp_ele])
599 |
600 | return _output_dict
601 |
602 |
603 | def load_output_format_file(_output_format_file):
604 | """
605 | Loads DCASE output format csv file and returns it in dictionary format
606 |
607 | :param _output_format_file: DCASE output format CSV
608 | :return: _output_dict: dictionary
609 | """
610 | _output_dict = {}
611 | _fid = open(_output_format_file, 'r')
612 | # next(_fid)
613 | for _line in _fid:
614 | _words = _line.strip().split(',')
615 | _frame_ind = int(_words[0])
616 | if _frame_ind not in _output_dict:
617 | _output_dict[_frame_ind] = []
618 | _output_dict[_frame_ind].append([int(_words[1]), int(_words[2]), int(_words[3])])
619 | _fid.close()
620 | return _output_dict
621 |
622 |
623 | def write_output_format_file(_output_format_file, _output_format_dict):
624 | """
625 | Writes DCASE output format csv file, given output format dictionary
626 |
627 | :param _output_format_file:
628 | :param _output_format_dict:
629 | :return:
630 | """
631 | _fid = open(_output_format_file, 'w')
632 | # _fid.write('{},{},{},{}\n'.format('frame number with 20ms hop (int)', 'class index (int)', 'azimuth angle (int)', 'elevation angle (int)'))
633 | for _frame_ind in _output_format_dict.keys():
634 | for _value in _output_format_dict[_frame_ind]:
635 | _fid.write('{},{},{},{}\n'.format(int(_frame_ind), int(_value[0]), int(_value[1]), int(_value[2])))
636 | _fid.close()
637 |
--------------------------------------------------------------------------------
/licenses/MIT_LICENSE.md:
--------------------------------------------------------------------------------
1 | The MIT License
2 |
3 | Copyright (c) 2010-2017 Google, Inc. http://angularjs.org
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in
13 | all copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21 | THE SOFTWARE.
22 |
--------------------------------------------------------------------------------
/licenses/TUT_LICENSE.md:
--------------------------------------------------------------------------------
1 | -----------COPYRIGHT NOTICE STARTS WITH THIS LINE------------ Copyright (c) 2019 Tampere University and its licensors All rights reserved.
2 |
3 | Permission is hereby granted, without written agreement and without license or royalty fees, to use and copy the code for the Sound Event Localization and Detection using Convolutional Recurrent Neural Network method/architecture, present in the GitHub repository with the handle seld-dcase2019, (“Work”) described in the paper with title "Sound event localization and detection of overlapping sources using convolutional recurrent neural network" and composed of files with code in the Python programming language. This grant is only for experimental and non-commercial purposes, provided that the copyright notice in its entirety appear in all copies of this Work, and the original source of this Work, Audio Research Group at Tampere University, is acknowledged in any publication that reports research using this Work.
4 |
5 | Any commercial use of the Work or any part thereof is strictly prohibited. Commercial use include, but is not limited to:
6 |
7 | selling or reproducing the Work
8 | selling or distributing the results or content achieved by use of the Work
9 | providing services by using the Work.
10 | IN NO EVENT SHALL TAMPERE UNIVERSITY OR ITS LICENSORS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OF THIS WORK AND ITS DOCUMENTATION, EVEN IF TAMPERE UNIVERSITY OR ITS LICENSORS HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
11 |
12 | TAMPERE UNIVERSITY AND ALL ITS LICENSORS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE WORK PROVIDED HEREUNDER IS ON AN "AS IS" BASIS, AND THE TAMPERE UNIVERSITY HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
13 |
14 | -----------COPYRIGHT NOTICE ENDS WITH THIS LINE------------
15 |
--------------------------------------------------------------------------------
/pytorch/evaluate.py:
--------------------------------------------------------------------------------
1 | import os
2 | import sys
3 | sys.path.insert(1, os.path.join(sys.path[0], '../utils'))
4 |
5 | import numpy as np
6 | import time
7 | import logging
8 | import datetime
9 | import _pickle as cPickle
10 | import matplotlib.pyplot as plt
11 |
12 | from utilities import (get_filename, write_submission, calculate_metrics,
13 | inverse_scale)
14 | from pytorch_utils import forward
15 | from losses import event_spatial_loss
16 | import config
17 |
18 |
19 | class Evaluator(object):
20 | def __init__(self, model, data_generator, cuda=True):
21 | '''Evaluator to evaluate prediction performance.
22 |
23 | Args:
24 | model: object
25 | data_generator: object
26 | cuda: bool
27 | '''
28 |
29 | self.model = model
30 | self.data_generator = data_generator
31 | self.cuda = cuda
32 |
33 | self.frames_per_second = config.frames_per_second
34 | self.submission_frames_per_second = config.submission_frames_per_second
35 |
36 | def evaluate(self, data_type, metadata_dir, submissions_dir,
37 | max_validate_num=None):
38 | '''Evaluate the performance.
39 |
40 | Args:
41 | data_type: 'train' | 'validate'
42 | metadata_dir: string, directory of reference meta csvs
43 | submissions_dir: string: directory to write out submission csvs
44 | max_validate_num: None | int, maximum iteration to run to speed up
45 | evaluation
46 | '''
47 |
48 | # Forward
49 | generate_func=self.data_generator.generate_validate(
50 | data_type=data_type, max_validate_num=max_validate_num)
51 |
52 | list_dict = forward(
53 | model=self.model,
54 | generate_func=generate_func,
55 | cuda=self.cuda,
56 | return_target=True)
57 |
58 | # Calculate loss
59 | (total_loss, event_loss, position_loss) = self.calculate_loss(list_dict)
60 |
61 | logging.info('{:<20} {}: {:.3f}, {}: {:.3f}, {}: {:.3f}'
62 | ''.format(data_type + ' statistics: ', 'total_loss', total_loss,
63 | 'event_loss', event_loss, 'position_loss', position_loss))
64 |
65 | # Write out submission and evaluate using code provided by organizer
66 | write_submission(list_dict, submissions_dir)
67 |
68 | prediction_paths = [os.path.join(submissions_dir,
69 | '{}.csv'.format(dict['name'])) for dict in list_dict]
70 |
71 | statistics = calculate_metrics(metadata_dir, prediction_paths)
72 |
73 | for key in statistics.keys():
74 | logging.info(' {:<20} {:.3f}'.format(key + ' :', statistics[key]))
75 |
76 | return statistics
77 |
78 | def calculate_loss(self, list_dict):
79 | total_loss_list = []
80 | event_loss_list = []
81 | position_loss_list = []
82 |
83 | for dict in list_dict:
84 | (output_dict, target_dict) = self._get_output_target_dict(dict)
85 |
86 | (total_loss, event_loss, position_loss) = event_spatial_loss(
87 | output_dict=output_dict,
88 | target_dict=target_dict,
89 | return_individual_loss=True)
90 |
91 | total_loss_list.append(total_loss)
92 | event_loss_list.append(event_loss)
93 | position_loss_list.append(position_loss)
94 |
95 | return np.mean(total_loss_list), np.mean(event_loss_list), np.mean(position_loss_list)
96 |
97 | def _get_output_target_dict(self, dict):
98 | output_dict = {
99 | 'event': dict['output_event'],
100 | 'elevation': dict['output_elevation'],
101 | 'azimuth': dict['output_azimuth']}
102 |
103 | target_dict = {
104 | 'event': dict['target_event'],
105 | 'elevation': dict['target_elevation'],
106 | 'azimuth': dict['target_azimuth']}
107 |
108 | return output_dict, target_dict
109 |
110 |
111 | def visualize(self, data_type, max_validate_num=None):
112 | '''Visualize the log mel spectrogram, reference and prediction of
113 | sound events, elevation and azimuth.
114 |
115 | Args:
116 | data_type: 'train' | 'validate'
117 | max_validate_num: None | int, maximum iteration to run to speed up
118 | evaluation
119 | '''
120 |
121 | mel_bins = config.mel_bins
122 | frames_per_second = config.frames_per_second
123 | classes_num = config.classes_num
124 | labels = config.labels
125 |
126 | # Forward
127 | generate_func=self.data_generator.generate_validate(
128 | data_type=data_type, max_validate_num=max_validate_num)
129 |
130 | list_dict = forward(
131 | model=self.model,
132 | generate_func=generate_func,
133 | cuda=self.cuda,
134 | return_input=True,
135 | return_target=True)
136 |
137 | for n, dict in enumerate(list_dict):
138 | print('File: {}'.format(dict['name']))
139 |
140 | frames_num = dict['target_event'].shape[1]
141 | length_in_second = frames_num / float(frames_per_second)
142 |
143 | fig, axs = plt.subplots(4, 2, figsize=(15, 10))
144 | logmel = inverse_scale(dict['feature'][0][0],
145 | self.data_generator.scalar['mean'],
146 | self.data_generator.scalar['std'])
147 | axs[0, 0].matshow(logmel.T, origin='lower', aspect='auto', cmap='jet')
148 | axs[1, 0].matshow(dict['target_event'][0].T, origin='lower', aspect='auto', cmap='jet')
149 | axs[2, 0].matshow(dict['output_event'][0].T, origin='lower', aspect='auto', cmap='jet')
150 | axs[0, 1].matshow(dict['target_elevation'][0].T, origin='lower', aspect='auto', cmap='jet')
151 | axs[1, 1].matshow(dict['target_azimuth'][0].T, origin='lower', aspect='auto', cmap='jet')
152 | masksed_evaluation = dict['output_elevation'] * dict['output_event']
153 | axs[2, 1].matshow(masksed_evaluation[0].T, origin='lower', aspect='auto', cmap='jet')
154 | masksed_azimuth = dict['output_azimuth'] * dict['output_event']
155 | axs[3, 1].matshow(masksed_azimuth[0].T, origin='lower', aspect='auto', cmap='jet')
156 |
157 | axs[0,0].set_title('Log mel spectrogram', color='r')
158 | axs[1,0].set_title('Reference sound events', color='r')
159 | axs[2,0].set_title('Predicted sound events', color='b')
160 | axs[0,1].set_title('Reference elevation', color='r')
161 | axs[1,1].set_title('Reference azimuth', color='r')
162 | axs[2,1].set_title('Predicted elevation', color='b')
163 | axs[3,1].set_title('Predicted azimuth', color='b')
164 |
165 | for i in range(4):
166 | for j in range(2):
167 | axs[i, j].set_xticks([0, frames_num])
168 | axs[i, j].set_xticklabels(['0', '{:.1f} s'.format(length_in_second)])
169 | axs[i, j].xaxis.set_ticks_position('bottom')
170 | axs[i, j].set_yticks(np.arange(classes_num))
171 | axs[i, j].set_yticklabels(labels)
172 | axs[i, j].yaxis.grid(color='w', linestyle='solid', linewidth=0.2)
173 |
174 | axs[0, 0].set_ylabel('Mel bins')
175 | axs[0, 0].set_yticks([0, mel_bins])
176 | axs[0, 0].set_yticklabels([0, mel_bins])
177 | axs[3, 0].set_visible(False)
178 |
179 | fig.tight_layout()
180 | plt.show()
181 |
182 |
183 | class StatisticsContainer(object):
184 | def __init__(self, statistics_path):
185 | '''Container of statistics during training.
186 |
187 | Args:
188 | statistics_path: string, path to write out
189 | '''
190 | self.statistics_path = statistics_path
191 |
192 | self.backup_statistics_path = '{}_{}.pickle'.format(
193 | os.path.splitext(self.statistics_path)[0],
194 | datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S'))
195 |
196 | self.statistics_list = []
197 |
198 | def append_and_dump(self, iteration, statistics):
199 | '''Append statistics to container and dump the container.
200 |
201 | Args:
202 | iteration: int
203 | statistics: dict of statistics
204 | '''
205 | statistics['iteration'] = iteration
206 | self.statistics_list.append(statistics)
207 |
208 | cPickle.dump(self.statistics_list, open(self.statistics_path, 'wb'))
209 | cPickle.dump(self.statistics_list, open(self.backup_statistics_path, 'wb'))
210 | logging.info(' Dump statistics to {}'.format(self.statistics_path))
--------------------------------------------------------------------------------
/pytorch/losses.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import torch.nn.functional as F
3 |
4 |
5 | def to_tensor(x):
6 | if type(x).__name__ == 'ndarray':
7 | return torch.Tensor(x)
8 | else:
9 | return x
10 |
11 |
12 | def binary_crossentropy(output, target):
13 | '''Binary crossentropy between output and target.
14 |
15 | Args:
16 | output: (batch_size, frames_num, classes_num)
17 | target: (batch_size, frames_num, classes_num)
18 | '''
19 | output = to_tensor(output)
20 | target = to_tensor(target)
21 |
22 | # To let output and target to have the same time steps. The mismatching
23 | # size is caused by pooling in CNNs.
24 | N = min(output.shape[1], target.shape[1])
25 |
26 | return F.binary_cross_entropy(
27 | output[:, 0 : N, :],
28 | target[:, 0 : N, :])
29 |
30 |
31 | def mean_absolute_error(output, target, mask):
32 | '''Mean absolute error between output and target.
33 |
34 | Args:
35 | output: (batch_size, frames_num, classes_num)
36 | target: (batch_size, frames_num, classes_num)
37 | '''
38 | output = to_tensor(output)
39 | target = to_tensor(target)
40 | mask = to_tensor(mask)
41 |
42 | # To let output and target to have the same time steps. The mismatching
43 | # size is caused by pooling in CNNs.
44 | N = min(output.shape[1], target.shape[1])
45 |
46 | output = output[:, 0 : N, :]
47 | target = target[:, 0 : N, :]
48 | mask = mask[:, 0 : N, :]
49 |
50 | normalize_value = torch.sum(mask)
51 |
52 | return torch.sum(torch.abs(output - target) * mask) / normalize_value
53 |
54 |
55 | def event_spatial_loss(output_dict, target_dict, return_individual_loss=False):
56 | '''Joint event and spatial loss.
57 |
58 | Args:
59 | output_dict: {'event': (batch_size, frames_num, classes_num),
60 | 'elevation': (batch_size, frames_num, classes_num),
61 | 'azimuth': (batch_size, frames_num, classes_num)}
62 | target_dict: {'event': (batch_size, frames_num, classes_num),
63 | 'elevation': (batch_size, frames_num, classes_num),
64 | 'azimuth': (batch_size, frames_num, classes_num)}
65 | return_individual_loss: bool
66 |
67 | Returns:
68 | total_loss: scalar
69 | '''
70 |
71 | event_loss = binary_crossentropy(
72 | output_dict['event'],
73 | target_dict['event'])
74 |
75 | elevation_loss = mean_absolute_error(
76 | output=output_dict['elevation'],
77 | target=target_dict['elevation'],
78 | mask=target_dict['event'])
79 |
80 | azimuth_loss = mean_absolute_error(
81 | output=output_dict['azimuth'],
82 | target=target_dict['azimuth'],
83 | mask=target_dict['event'])
84 |
85 | alpha = 0.01 # To control the balance between the event loss and position loss
86 | position_loss = alpha * (elevation_loss + azimuth_loss)
87 |
88 | total_loss = event_loss + position_loss
89 |
90 | if return_individual_loss:
91 | return total_loss, event_loss, position_loss
92 | else:
93 | return total_loss
--------------------------------------------------------------------------------
/pytorch/main.py:
--------------------------------------------------------------------------------
1 | import os
2 | import sys
3 | sys.path.insert(1, os.path.join(sys.path[0], '../utils'))
4 | import numpy as np
5 | import argparse
6 | import h5py
7 | import math
8 | import time
9 | import logging
10 | import matplotlib.pyplot as plt
11 | import torch
12 | import torch.nn as nn
13 | import torch.nn.functional as F
14 | import torch.optim as optim
15 |
16 | from utilities import (create_folder, get_filename, create_logging,
17 | load_scalar, calculate_metrics)
18 | from data_generator import DataGenerator
19 | from models import (Cnn_5layers_AvgPooling, Cnn_9layers_AvgPooling,
20 | Cnn_9layers_MaxPooling, Cnn_13layers_AvgPooling)
21 | from losses import event_spatial_loss
22 | from evaluate import Evaluator, StatisticsContainer
23 | from pytorch_utils import move_data_to_gpu, forward
24 | import config
25 |
26 |
27 | def train(args):
28 | '''Train. Model will be saved after several iterations.
29 |
30 | Args:
31 | dataset_dir: string, directory of dataset
32 | workspace: string, directory of workspace
33 | audio_type: 'foa' | 'mic'
34 | holdout_fold: '1' | '2' | '3' | '4' | 'none', set to none if using all
35 | data without validation to train
36 | model_type: string, e.g. 'Cnn_9layers_AvgPooling'
37 | batch_size: int
38 | cuda: bool
39 | mini_data: bool, set True for debugging on a small part of data
40 | '''
41 |
42 | # Arugments & parameters
43 | dataset_dir = args.dataset_dir
44 | workspace = args.workspace
45 | audio_type = args.audio_type
46 | holdout_fold = args.holdout_fold
47 | model_type = args.model_type
48 | batch_size = args.batch_size
49 | cuda = args.cuda and torch.cuda.is_available()
50 | mini_data = args.mini_data
51 | filename = args.filename
52 |
53 | mel_bins = config.mel_bins
54 | frames_per_second = config.frames_per_second
55 | classes_num = config.classes_num
56 | max_validate_num = None # Number of audio recordings to validate
57 | reduce_lr = True # Reduce learning rate after several iterations
58 |
59 | # Paths
60 | if mini_data:
61 | prefix = 'minidata_'
62 | else:
63 | prefix = ''
64 |
65 | metadata_dir = os.path.join(dataset_dir, 'metadata_dev')
66 |
67 | features_dir = os.path.join(workspace, 'features',
68 | '{}{}_{}_logmel_{}frames_{}melbins'.format(prefix, audio_type,
69 | 'dev', frames_per_second, mel_bins))
70 |
71 | scalar_path = os.path.join(workspace, 'scalars',
72 | '{}{}_{}_logmel_{}frames_{}melbins'.format(prefix, audio_type,
73 | 'dev', frames_per_second, mel_bins), 'scalar.h5')
74 |
75 | checkpoints_dir = os.path.join(workspace, 'checkpoints', filename,
76 | '{}{}_{}_logmel_{}frames_{}melbins'.format(prefix, audio_type,
77 | 'dev', frames_per_second, mel_bins), model_type,
78 | 'holdout_fold={}'.format(holdout_fold))
79 | create_folder(checkpoints_dir)
80 |
81 | # All folds result should write to the same directory
82 | temp_submissions_dir = os.path.join(workspace, '_temp', 'submissions', filename,
83 | '{}{}_{}_logmel_{}frames_{}melbins'.format(prefix, audio_type,
84 | 'dev', frames_per_second, mel_bins), model_type)
85 | create_folder(temp_submissions_dir)
86 |
87 | validate_statistics_path = os.path.join(workspace, 'statistics', filename,
88 | '{}{}_{}_logmel_{}frames_{}melbins'.format(prefix, audio_type,
89 | 'dev', frames_per_second, mel_bins), 'holdout_fold={}'.format(holdout_fold),
90 | model_type, 'validate_statistics.pickle')
91 | create_folder(os.path.dirname(validate_statistics_path))
92 |
93 | logs_dir = os.path.join(args.workspace, 'logs', filename, args.mode,
94 | '{}{}_{}_logmel_{}frames_{}melbins'.format(prefix, audio_type, 'dev',
95 | frames_per_second, mel_bins), 'holdout_fold={}'.format(holdout_fold),
96 | model_type)
97 | create_logging(logs_dir, filemode='w')
98 | logging.info(args)
99 |
100 | if cuda:
101 | logging.info('Using GPU.')
102 | else:
103 | logging.info('Using CPU. Set --cuda flag to use GPU.')
104 |
105 | # Load scalar
106 | scalar = load_scalar(scalar_path)
107 |
108 | # Model
109 | Model = eval(model_type)
110 | model = Model(classes_num)
111 |
112 | if cuda:
113 | model.cuda()
114 |
115 | # Optimizer
116 | optimizer = optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999),
117 | eps=1e-08, weight_decay=0., amsgrad=True)
118 |
119 | # Data generator
120 | data_generator = DataGenerator(
121 | features_dir=features_dir,
122 | scalar=scalar,
123 | batch_size=batch_size,
124 | holdout_fold=holdout_fold)
125 |
126 | # Evaluator
127 | evaluator = Evaluator(
128 | model=model,
129 | data_generator=data_generator,
130 | cuda=cuda)
131 |
132 | # Statistics
133 | validate_statistics_container = StatisticsContainer(validate_statistics_path)
134 |
135 | train_bgn_time = time.time()
136 | iteration = 0
137 |
138 | # Train on mini batches
139 | for batch_data_dict in data_generator.generate_train():
140 |
141 | # Evaluate
142 | if iteration % 200 == 0:
143 |
144 | logging.info('------------------------------------')
145 | logging.info('Iteration: {}'.format(iteration))
146 |
147 | train_fin_time = time.time()
148 |
149 | '''
150 | # Uncomment for evaluating on training dataset
151 | train_statistics = evaluator.evaluate(
152 | data_type='train',
153 | metadata_dir=metadata_dir,
154 | submissions_dir=temp_submissions_dir,
155 | max_validate_num=max_validate_num)
156 | '''
157 |
158 | if holdout_fold != 'none':
159 | validate_statistics = evaluator.evaluate(
160 | data_type='validate',
161 | metadata_dir=metadata_dir,
162 | submissions_dir=temp_submissions_dir,
163 | max_validate_num=max_validate_num)
164 |
165 | validate_statistics_container.append_and_dump(
166 | iteration, validate_statistics)
167 |
168 | train_time = train_fin_time - train_bgn_time
169 | validate_time = time.time() - train_fin_time
170 |
171 | logging.info(
172 | 'Train time: {:.3f} s, validate time: {:.3f} s'
173 | ''.format(train_time, validate_time))
174 |
175 | train_bgn_time = time.time()
176 |
177 | # Save model
178 | if iteration % 1000 == 0 and iteration > 0:
179 |
180 | checkpoint = {
181 | 'iteration': iteration,
182 | 'model': model.state_dict(),
183 | 'optimizer': optimizer.state_dict()}
184 |
185 | checkpoint_path = os.path.join(
186 | checkpoints_dir, '{}_iterations.pth'.format(iteration))
187 |
188 | torch.save(checkpoint, checkpoint_path)
189 | logging.info('Model saved to {}'.format(checkpoint_path))
190 |
191 | # Reduce learning rate
192 | if reduce_lr and iteration % 200 == 0 and iteration > 0:
193 | for param_group in optimizer.param_groups:
194 | param_group['lr'] *= 0.9
195 |
196 | # Move data to GPU
197 | for key in batch_data_dict.keys():
198 | batch_data_dict[key] = move_data_to_gpu(batch_data_dict[key], cuda)
199 |
200 | # Train
201 | model.train()
202 | batch_output_dict = model(batch_data_dict['feature'])
203 | loss = event_spatial_loss(batch_output_dict, batch_data_dict)
204 |
205 | # Backward
206 | optimizer.zero_grad()
207 | loss.backward()
208 | optimizer.step()
209 |
210 | # Stop learning
211 | if iteration == 5000:
212 | break
213 |
214 | iteration += 1
215 |
216 |
217 | def inference_validation(args):
218 | '''Inference validation data.
219 |
220 | Args:
221 | dataset_dir: string, directory of dataset
222 | workspace: string, directory of workspace
223 | audio_type: 'foa' | 'mic'
224 | holdout_fold: '1' | '2' | '3' | '4' | 'none', where 'none' represents
225 | summary and print results of all folds 1, 2, 3 and 4.
226 | model_type: string, e.g. 'Cnn_9layers_AvgPooling'
227 | iteration: int, load model of this iteration
228 | batch_size: int
229 | cuda: bool
230 | visualize: bool
231 | mini_data: bool, set True for debugging on a small part of data
232 | '''
233 |
234 | # Arugments & parameters
235 | dataset_dir = args.dataset_dir
236 | workspace = args.workspace
237 | audio_type = args.audio_type
238 | holdout_fold = args.holdout_fold
239 | model_type = args.model_type
240 | iteration = args.iteration
241 | batch_size = args.batch_size
242 | cuda = args.cuda and torch.cuda.is_available()
243 | visualize = args.visualize
244 | mini_data = args.mini_data
245 | filename = args.filename
246 |
247 | mel_bins = config.mel_bins
248 | frames_per_second = config.frames_per_second
249 | classes_num = config.classes_num
250 |
251 | # Paths
252 | if mini_data:
253 | prefix = 'minidata_'
254 | else:
255 | prefix = ''
256 |
257 | metadata_dir = os.path.join(dataset_dir, 'metadata_dev')
258 |
259 | submissions_dir = os.path.join(workspace, 'submissions', filename,
260 | '{}{}_{}_logmel_{}frames_{}melbins'.format(prefix, audio_type, 'dev',
261 | frames_per_second, mel_bins), model_type, 'iteration={}'.format(iteration))
262 | create_folder(submissions_dir)
263 |
264 | logs_dir = os.path.join(args.workspace, 'logs', filename, args.mode,
265 | '{}{}_{}_logmel_{}frames_{}melbins'.format(prefix, audio_type, 'dev',
266 | frames_per_second, mel_bins), 'holdout_fold={}'.format(holdout_fold),
267 | model_type)
268 | create_logging(logs_dir, filemode='w')
269 | logging.info(args)
270 |
271 | # Inference and calculate metrics for a fold
272 | if holdout_fold != 'none':
273 |
274 | features_dir = os.path.join(workspace, 'features',
275 | '{}{}_{}_logmel_{}frames_{}melbins'.format(prefix, audio_type,
276 | 'dev', frames_per_second, mel_bins))
277 |
278 | scalar_path = os.path.join(workspace, 'scalars',
279 | '{}{}_{}_logmel_{}frames_{}melbins'.format(prefix, audio_type,
280 | 'dev', frames_per_second, mel_bins), 'scalar.h5')
281 |
282 | checkoutpoint_path = os.path.join(workspace, 'checkpoints', filename,
283 | '{}{}_{}_logmel_{}frames_{}melbins'.format(prefix, audio_type,
284 | 'dev', frames_per_second, mel_bins), model_type,
285 | 'holdout_fold={}'.format(holdout_fold),
286 | '{}_iterations.pth'.format(iteration))
287 |
288 | # Load scalar
289 | scalar = load_scalar(scalar_path)
290 |
291 | # Load model
292 | Model = eval(model_type)
293 | model = Model(classes_num)
294 | checkpoint = torch.load(checkoutpoint_path)
295 | model.load_state_dict(checkpoint['model'])
296 |
297 | if cuda:
298 | model.cuda()
299 |
300 | # Data generator
301 | data_generator = DataGenerator(
302 | features_dir=features_dir,
303 | scalar=scalar,
304 | batch_size=batch_size,
305 | holdout_fold=holdout_fold)
306 |
307 | # Evaluator
308 | evaluator = Evaluator(
309 | model=model,
310 | data_generator=data_generator,
311 | cuda=cuda)
312 |
313 | # Calculate metrics
314 | data_type = 'validate'
315 |
316 | evaluator.evaluate(
317 | data_type=data_type,
318 | metadata_dir=metadata_dir,
319 | submissions_dir=submissions_dir,
320 | max_validate_num=None)
321 |
322 | # Visualize reference and predicted events, elevation and azimuth
323 | if visualize:
324 | evaluator.visualize(data_type=data_type)
325 |
326 | # Calculate metrics for all 4 folds
327 | else:
328 | prediction_names = os.listdir(submissions_dir)
329 | prediction_paths = [os.path.join(submissions_dir, name) for \
330 | name in prediction_names]
331 |
332 | metrics = calculate_metrics(metadata_dir=metadata_dir,
333 | prediction_paths=prediction_paths)
334 |
335 | logging.info('Metrics of {} files: '.format(len(prediction_names)))
336 | for key in metrics.keys():
337 | logging.info(' {:<20} {:.3f}'.format(key + ' :', metrics[key]))
338 |
339 |
340 | if __name__ == '__main__':
341 | parser = argparse.ArgumentParser(description='Example of parser. ')
342 | subparsers = parser.add_subparsers(dest='mode')
343 |
344 | # Train
345 | parser_train = subparsers.add_parser('train')
346 | parser_train.add_argument('--dataset_dir', type=str, required=True, help='Directory of dataset.')
347 | parser_train.add_argument('--workspace', type=str, required=True, help='Directory of your workspace.')
348 | parser_train.add_argument('--audio_type', type=str, choices=['foa', 'mic'], required=True)
349 | parser_train.add_argument('--holdout_fold', type=str, choices=['1', '2', '3', '4', 'none'], required=True,
350 | help='Holdout fold. Set to none if using all data without validation to train. ')
351 | parser_train.add_argument('--model_type', type=str, required=True, help='E.g., Cnn_9layers_AvgPooling.')
352 | parser_train.add_argument('--batch_size', type=int, required=True)
353 | parser_train.add_argument('--cuda', action='store_true', default=False)
354 | parser_train.add_argument('--mini_data', action='store_true', default=False, help='Set True for debugging on a small part of data.')
355 |
356 | # Inference validation data
357 | parser_inference_validation = subparsers.add_parser('inference_validation')
358 | parser_inference_validation.add_argument('--dataset_dir', type=str, required=True, help='Directory of dataset.')
359 | parser_inference_validation.add_argument('--workspace', type=str, required=True, help='Directory of your workspace.')
360 | parser_inference_validation.add_argument('--audio_type', type=str, choices=['foa', 'mic'], required=True)
361 | parser_inference_validation.add_argument('--holdout_fold', type=str, choices=['1', '2', '3', '4', 'none'], required=True)
362 | parser_inference_validation.add_argument('--model_type', type=str, required=True, help='E.g., Cnn_9layers_AvgPooling.')
363 | parser_inference_validation.add_argument('--iteration', type=int, required=True, help='Load model of this iteration.')
364 | parser_inference_validation.add_argument('--batch_size', type=int, required=True)
365 | parser_inference_validation.add_argument('--cuda', action='store_true', default=False)
366 | parser_inference_validation.add_argument('--visualize', action='store_true', default=False, help='Visualize log mel spectrogram, prediction and reference')
367 | parser_inference_validation.add_argument('--mini_data', action='store_true', default=False, help='Set True for debugging on a small part of data.')
368 |
369 | # Parse arguments
370 | args = parser.parse_args()
371 | args.filename = get_filename(__file__)
372 |
373 | if args.mode == 'train':
374 | train(args)
375 |
376 | elif args.mode == 'inference_validation':
377 | inference_validation(args)
378 |
379 | else:
380 | raise Exception('Error argument!')
--------------------------------------------------------------------------------
/pytorch/models.py:
--------------------------------------------------------------------------------
1 | import math
2 |
3 | import torch
4 | import torch.nn as nn
5 | import torch.nn.functional as F
6 |
7 | from pytorch_utils import interpolate
8 |
9 |
10 | def init_layer(layer, nonlinearity='leaky_relu'):
11 | """Initialize a Linear or Convolutional layer. """
12 | nn.init.kaiming_uniform_(layer.weight, nonlinearity=nonlinearity)
13 |
14 | if hasattr(layer, 'bias'):
15 | if layer.bias is not None:
16 | layer.bias.data.fill_(0.)
17 |
18 |
19 | def init_bn(bn):
20 | """Initialize a Batchnorm layer. """
21 |
22 | bn.bias.data.fill_(0.)
23 | bn.running_mean.data.fill_(0.)
24 | bn.weight.data.fill_(1.)
25 | bn.running_var.data.fill_(1.)
26 |
27 |
28 | class Cnn_5layers_AvgPooling(nn.Module):
29 |
30 | def __init__(self, classes_num):
31 | super(Cnn_5layers_AvgPooling, self).__init__()
32 |
33 | self.conv1 = nn.Conv2d(in_channels=4, out_channels=64,
34 | kernel_size=(5, 5), stride=(1, 1),
35 | padding=(2, 2), bias=False)
36 |
37 | self.conv2 = nn.Conv2d(in_channels=64, out_channels=128,
38 | kernel_size=(5, 5), stride=(1, 1),
39 | padding=(2, 2), bias=False)
40 |
41 | self.conv3 = nn.Conv2d(in_channels=128, out_channels=256,
42 | kernel_size=(5, 5), stride=(1, 1),
43 | padding=(2, 2), bias=False)
44 |
45 | self.conv4 = nn.Conv2d(in_channels=256, out_channels=512,
46 | kernel_size=(5, 5), stride=(1, 1),
47 | padding=(2, 2), bias=False)
48 |
49 | self.bn1 = nn.BatchNorm2d(64)
50 | self.bn2 = nn.BatchNorm2d(128)
51 | self.bn3 = nn.BatchNorm2d(256)
52 | self.bn4 = nn.BatchNorm2d(512)
53 |
54 | self.event_fc = nn.Linear(512, classes_num, bias=True)
55 | self.elevation_fc = nn.Linear(512, classes_num, bias=True)
56 | self.azimuth_fc = nn.Linear(512, classes_num, bias=True)
57 |
58 | self.init_weights()
59 |
60 | def init_weights(self):
61 | init_layer(self.conv1)
62 | init_layer(self.conv2)
63 | init_layer(self.conv3)
64 | init_layer(self.conv4)
65 | init_layer(self.event_fc)
66 | init_layer(self.elevation_fc)
67 | init_layer(self.azimuth_fc)
68 |
69 | init_bn(self.bn1)
70 | init_bn(self.bn2)
71 | init_bn(self.bn3)
72 | init_bn(self.bn4)
73 |
74 | def forward(self, input):
75 | '''
76 | Input: (channels_num, batch_size, times_steps, freq_bins)'''
77 |
78 | interpolate_ratio = 8
79 |
80 | x = input.transpose(0, 1)
81 | '''(batch_size, channels_num, times_steps, freq_bins)'''
82 |
83 | x = F.relu_(self.bn1(self.conv1(x)))
84 | x = F.avg_pool2d(x, kernel_size=(2, 2))
85 |
86 | x = F.relu_(self.bn2(self.conv2(x)))
87 | x = F.avg_pool2d(x, kernel_size=(2, 2))
88 |
89 | x = F.relu_(self.bn3(self.conv3(x)))
90 | x = F.avg_pool2d(x, kernel_size=(2, 2))
91 |
92 | x = F.relu_(self.bn4(self.conv4(x)))
93 | x = F.avg_pool2d(x, kernel_size=(1, 1))
94 | '''(batch_size, feature_maps, time_steps, freq_bins)'''
95 |
96 | x = torch.mean(x, dim=3) # (batch_size, feature_maps, time_steps)
97 | x = x.transpose(1, 2) # (batch_size, time_steps, feature_maps)
98 |
99 | event_output = torch.sigmoid(self.event_fc(x)) # (batch_size, time_steps, classes_num)
100 | elevation_output = self.elevation_fc(x) # (batch_size, time_steps, classes_num)
101 | azimuth_output = self.azimuth_fc(x) # (batch_size, time_steps, classes_num)
102 |
103 | # Interpolate
104 | event_output = interpolate(event_output, interpolate_ratio)
105 | elevation_output = interpolate(elevation_output, interpolate_ratio)
106 | azimuth_output = interpolate(azimuth_output, interpolate_ratio)
107 |
108 | output_dict = {
109 | 'event': event_output,
110 | 'elevation': elevation_output,
111 | 'azimuth': azimuth_output}
112 |
113 | return output_dict
114 |
115 |
116 | class ConvBlock(nn.Module):
117 | def __init__(self, in_channels, out_channels):
118 |
119 | super(ConvBlock, self).__init__()
120 |
121 | self.conv1 = nn.Conv2d(in_channels=in_channels,
122 | out_channels=out_channels,
123 | kernel_size=(3, 3), stride=(1, 1),
124 | padding=(1, 1), bias=False)
125 |
126 | self.conv2 = nn.Conv2d(in_channels=out_channels,
127 | out_channels=out_channels,
128 | kernel_size=(3, 3), stride=(1, 1),
129 | padding=(1, 1), bias=False)
130 |
131 | self.bn1 = nn.BatchNorm2d(out_channels)
132 | self.bn2 = nn.BatchNorm2d(out_channels)
133 |
134 | self.init_weights()
135 |
136 | def init_weights(self):
137 |
138 | init_layer(self.conv1)
139 | init_layer(self.conv2)
140 | init_bn(self.bn1)
141 | init_bn(self.bn2)
142 |
143 | def forward(self, input, pool_size=(2, 2), pool_type='avg'):
144 |
145 | x = input
146 | x = F.relu_(self.bn1(self.conv1(x)))
147 | x = F.relu_(self.bn2(self.conv2(x)))
148 | if pool_type == 'max':
149 | x = F.max_pool2d(x, kernel_size=pool_size)
150 | elif pool_type == 'avg':
151 | x = F.avg_pool2d(x, kernel_size=pool_size)
152 | else:
153 | raise Exception('Incorrect argument!')
154 |
155 | return x
156 |
157 |
158 | class Cnn_9layers_AvgPooling(nn.Module):
159 | def __init__(self, classes_num):
160 |
161 | super(Cnn_9layers_AvgPooling, self).__init__()
162 |
163 | self.conv_block1 = ConvBlock(in_channels=4, out_channels=64)
164 | self.conv_block2 = ConvBlock(in_channels=64, out_channels=128)
165 | self.conv_block3 = ConvBlock(in_channels=128, out_channels=256)
166 | self.conv_block4 = ConvBlock(in_channels=256, out_channels=512)
167 |
168 | self.event_fc = nn.Linear(512, classes_num, bias=True)
169 | self.elevation_fc = nn.Linear(512, classes_num, bias=True)
170 | self.azimuth_fc = nn.Linear(512, classes_num, bias=True)
171 |
172 | self.init_weights()
173 |
174 | def init_weights(self):
175 |
176 | init_layer(self.event_fc)
177 | init_layer(self.elevation_fc)
178 | init_layer(self.azimuth_fc)
179 |
180 | def forward(self, input):
181 | '''
182 | Input: (channels_num, batch_size, times_steps, freq_bins)'''
183 |
184 | interpolate_ratio = 8
185 |
186 | x = input.transpose(0, 1)
187 | '''(batch_size, channels_num, times_steps, freq_bins)'''
188 |
189 | x = self.conv_block1(x, pool_size=(2, 2), pool_type='avg')
190 | x = self.conv_block2(x, pool_size=(2, 2), pool_type='avg')
191 | x = self.conv_block3(x, pool_size=(2, 2), pool_type='avg')
192 | x = self.conv_block4(x, pool_size=(1, 1), pool_type='avg')
193 |
194 | x = torch.mean(x, dim=3) # (batch_size, feature_maps, time_steps)
195 | x = x.transpose(1, 2) # (batch_size, time_steps, feature_maps)
196 |
197 | event_output = torch.sigmoid(self.event_fc(x)) # (batch_size, time_steps, classes_num)
198 | elevation_output = self.elevation_fc(x) # (batch_size, time_steps, classes_num)
199 | azimuth_output = self.azimuth_fc(x) # (batch_size, time_steps, classes_num)
200 |
201 | # Interpolate
202 | event_output = interpolate(event_output, interpolate_ratio)
203 | elevation_output = interpolate(elevation_output, interpolate_ratio)
204 | azimuth_output = interpolate(azimuth_output, interpolate_ratio)
205 |
206 | output_dict = {
207 | 'event': event_output,
208 | 'elevation': elevation_output,
209 | 'azimuth': azimuth_output}
210 |
211 | return output_dict
212 |
213 |
214 | class Cnn_9layers_MaxPooling(nn.Module):
215 | def __init__(self, classes_num):
216 |
217 | super(Cnn_9layers_MaxPooling, self).__init__()
218 |
219 | self.conv_block1 = ConvBlock(in_channels=4, out_channels=64)
220 | self.conv_block2 = ConvBlock(in_channels=64, out_channels=128)
221 | self.conv_block3 = ConvBlock(in_channels=128, out_channels=256)
222 | self.conv_block4 = ConvBlock(in_channels=256, out_channels=512)
223 |
224 | self.event_fc = nn.Linear(512, classes_num, bias=True)
225 | self.elevation_fc = nn.Linear(512, classes_num, bias=True)
226 | self.azimuth_fc = nn.Linear(512, classes_num, bias=True)
227 |
228 | self.init_weights()
229 |
230 | def init_weights(self):
231 |
232 | init_layer(self.event_fc)
233 | init_layer(self.elevation_fc)
234 | init_layer(self.azimuth_fc)
235 |
236 | def forward(self, input):
237 | '''
238 | Input: (channels_num, batch_size, times_steps, freq_bins)'''
239 |
240 | interpolate_ratio = 8
241 |
242 | x = input.transpose(0, 1)
243 | '''(batch_size, channels_num, times_steps, freq_bins)'''
244 |
245 | x = self.conv_block1(x, pool_size=(2, 2), pool_type='max')
246 | x = self.conv_block2(x, pool_size=(2, 2), pool_type='max')
247 | x = self.conv_block3(x, pool_size=(2, 2), pool_type='max')
248 | x = self.conv_block4(x, pool_size=(1, 1), pool_type='max')
249 |
250 | x = torch.mean(x, dim=3) # (batch_size, feature_maps, time_steps)
251 | x = x.transpose(1, 2) # (batch_size, time_steps, feature_maps)
252 |
253 | event_output = torch.sigmoid(self.event_fc(x)) # (batch_size, time_steps, classes_num)
254 | elevation_output = self.elevation_fc(x) # (batch_size, time_steps, classes_num)
255 | azimuth_output = self.azimuth_fc(x) # (batch_size, time_steps, classes_num)
256 |
257 | # Interpolate
258 | event_output = interpolate(event_output, interpolate_ratio)
259 | elevation_output = interpolate(elevation_output, interpolate_ratio)
260 | azimuth_output = interpolate(azimuth_output, interpolate_ratio)
261 |
262 | output_dict = {
263 | 'event': event_output,
264 | 'elevation': elevation_output,
265 | 'azimuth': azimuth_output}
266 |
267 | return output_dict
268 |
269 |
270 | class Cnn_13layers_AvgPooling(nn.Module):
271 | def __init__(self, classes_num):
272 |
273 | super(Cnn_13layers_AvgPooling, self).__init__()
274 |
275 | self.conv_block1 = ConvBlock(in_channels=4, out_channels=64)
276 | self.conv_block2 = ConvBlock(in_channels=64, out_channels=128)
277 | self.conv_block3 = ConvBlock(in_channels=128, out_channels=256)
278 | self.conv_block4 = ConvBlock(in_channels=256, out_channels=512)
279 | self.conv_block5 = ConvBlock(in_channels=512, out_channels=1024)
280 | self.conv_block6 = ConvBlock(in_channels=1024, out_channels=2048)
281 |
282 | self.event_fc = nn.Linear(2048, classes_num, bias=True)
283 | self.elevation_fc = nn.Linear(2048, classes_num, bias=True)
284 | self.azimuth_fc = nn.Linear(2048, classes_num, bias=True)
285 |
286 | self.init_weights()
287 |
288 | def init_weights(self):
289 |
290 | init_layer(self.event_fc)
291 | init_layer(self.elevation_fc)
292 | init_layer(self.azimuth_fc)
293 |
294 | def forward(self, input):
295 | '''
296 | Input: (channels_num, batch_size, times_steps, freq_bins)'''
297 |
298 | interpolate_ratio = 32
299 |
300 | x = input.transpose(0, 1)
301 | '''(batch_size, channels_num, times_steps, freq_bins)'''
302 |
303 | x = self.conv_block1(x, pool_size=(2, 2), pool_type='avg')
304 | x = self.conv_block2(x, pool_size=(2, 2), pool_type='avg')
305 | x = self.conv_block3(x, pool_size=(2, 2), pool_type='avg')
306 | x = self.conv_block4(x, pool_size=(2, 2), pool_type='avg')
307 | x = self.conv_block5(x, pool_size=(2, 2), pool_type='avg')
308 | x = self.conv_block6(x, pool_size=(1, 1), pool_type='avg')
309 |
310 | x = torch.mean(x, dim=3) # (batch_size, feature_maps, time_steps)
311 | x = x.transpose(1, 2) # (batch_size, time_steps, feature_maps)
312 |
313 | event_output = torch.sigmoid(self.event_fc(x)) # (batch_size, time_steps, classes_num)
314 | elevation_output = self.elevation_fc(x) # (batch_size, time_steps, classes_num)
315 | azimuth_output = self.azimuth_fc(x) # (batch_size, time_steps, classes_num)
316 |
317 | # Interpolate
318 | event_output = interpolate(event_output, interpolate_ratio)
319 | elevation_output = interpolate(elevation_output, interpolate_ratio)
320 | azimuth_output = interpolate(azimuth_output, interpolate_ratio)
321 |
322 | output_dict = {
323 | 'event': event_output,
324 | 'elevation': elevation_output,
325 | 'azimuth': azimuth_output}
326 |
327 | return output_dict
--------------------------------------------------------------------------------
/pytorch/pytorch_utils.py:
--------------------------------------------------------------------------------
1 | import torch
2 |
3 |
4 | def move_data_to_gpu(x, cuda):
5 | if 'float' in str(x.dtype):
6 | x = torch.Tensor(x)
7 | elif 'int' in str(x.dtype):
8 | x = torch.LongTensor(x)
9 | else:
10 | raise Exception("Error!")
11 |
12 | if cuda:
13 | x = x.cuda()
14 |
15 | return x
16 |
17 |
18 | def interpolate(x, ratio):
19 | '''Interpolate the prediction to have the same time_steps as the target.
20 | The time_steps mismatch is caused by maxpooling in CNN.
21 |
22 | Args:
23 | x: (batch_size, time_steps, classes_num)
24 | ratio: int, ratio to upsample
25 | '''
26 | (batch_size, time_steps, classes_num) = x.shape
27 | upsampled = x[:, :, None, :].repeat(1, 1, ratio, 1)
28 | upsampled = upsampled.reshape(batch_size, time_steps * ratio, classes_num)
29 | return upsampled
30 |
31 |
32 | def forward(model, generate_func, cuda, return_input=False,
33 | return_target=False):
34 | '''Forward data to model in mini-batch.
35 |
36 | Args:
37 | model: object
38 | generate_func: function
39 | cuda: bool
40 | return_input: bool
41 | return_target: bool
42 |
43 | Returns:
44 | list_dict, e.g.:
45 | [{'name': 'split1_ir0_ov1_7',
46 | 'output_event': (1, frames_num, classes_num),
47 | 'output_elevation': (1, frames_num, classes_num),
48 | 'output_azimuth': (1, frames_num, classes_num),
49 | ...
50 | },
51 | ...]
52 | '''
53 |
54 | list_dict = []
55 |
56 | # Evaluate on mini-batch
57 | for (n, single_data_dict) in enumerate(generate_func):
58 |
59 | # Predict
60 | batch_feature = move_data_to_gpu(single_data_dict['feature'], cuda)
61 |
62 | with torch.no_grad():
63 | model.eval()
64 | batch_output_dict = model(batch_feature)
65 |
66 | output_dict = {
67 | 'name': single_data_dict['name'],
68 | 'output_event': batch_output_dict['event'].data.cpu().numpy(),
69 | 'output_elevation': batch_output_dict['elevation'].data.cpu().numpy(),
70 | 'output_azimuth': batch_output_dict['azimuth'].data.cpu().numpy()}
71 |
72 | if return_input:
73 | output_dict['feature'] = single_data_dict['feature']
74 |
75 | if return_target:
76 | output_dict['target_event'] = single_data_dict['event']
77 | output_dict['target_elevation'] = single_data_dict['elevation']
78 | output_dict['target_azimuth'] = single_data_dict['azimuth']
79 |
80 | list_dict.append(output_dict)
81 |
82 | return list_dict
--------------------------------------------------------------------------------
/runme.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | # You need to modify this path to your downloaded dataset directory
3 | DATASET_DIR='/vol/vssp/cvpnobackup/scratch_4weeks/qk00006/dcase2019/task3/dataset_root'
4 |
5 | # You need to modify this path to your workspace to store features, models, etc.
6 | WORKSPACE='/vol/vssp/msos/qk/workspaces/dcase2019_task3'
7 |
8 | # Hyper-parameters
9 | GPU_ID=1
10 | DATA_TYPE='development' # 'development' | 'evaluation'
11 | AUDIO_TYPE='foa' # 'foa' | 'mic'
12 | MODEL_TYPE='Cnn_9layers_AvgPooling'
13 | BATCH_SIZE=32
14 |
15 | # Calculate feature
16 | python utils/features.py calculate_feature_for_each_audio_file --dataset_dir=$DATASET_DIR --workspace=$WORKSPACE --data_type=$DATA_TYPE --audio_type=$AUDIO_TYPE
17 |
18 | # Calculate scalar
19 | python utils/features.py calculate_scalar --workspace=$WORKSPACE --data_type=$DATA_TYPE --audio_type=$AUDIO_TYPE
20 |
21 | ############ Train and validate system on development dataset ############
22 | for HOLDOUT_FOLD in '1' '2' '3' '4'
23 | do
24 | echo 'Holdout fold: '$HOLDOUT_FOLD
25 |
26 | # Train
27 | CUDA_VISIBLE_DEVICES=$GPU_ID python pytorch/main.py train --dataset_dir=$DATASET_DIR --workspace=$WORKSPACE --audio_type=$AUDIO_TYPE --holdout_fold=$HOLDOUT_FOLD --model_type=$MODEL_TYPE --batch_size=$BATCH_SIZE --cuda
28 |
29 | # Validate
30 | CUDA_VISIBLE_DEVICES=$GPU_ID python pytorch/main.py inference_validation --dataset_dir=$DATASET_DIR --workspace=$WORKSPACE --audio_type=$AUDIO_TYPE --holdout_fold=$HOLDOUT_FOLD --model_type=$MODEL_TYPE --iteration=5000 --batch_size=$BATCH_SIZE --cuda
31 |
32 | HOLDOUT_FOLD=$[$HOLDOUT_FOLD+1]
33 | done
34 |
35 | # Calculate metrics on all cross-validation folds
36 | HOLDOUT_FOLD=-1
37 | CUDA_VISIBLE_DEVICES=$GPU_ID python pytorch/main.py inference_validation --dataset_dir=$DATASET_DIR --workspace=$WORKSPACE --audio_type=$AUDIO_TYPE --holdout_fold=$HOLDOUT_FOLD --model_type=$MODEL_TYPE --iteration=5000 --batch_size=$BATCH_SIZE --cuda
38 |
39 | # Plot statistics
40 | python utils/plot_results.py --dataset_dir=$DATASET_DIR --workspace=$WORKSPACE --audio_type='foa'
41 |
42 | ############ END ############
43 |
--------------------------------------------------------------------------------
/utils/config.py:
--------------------------------------------------------------------------------
1 | sample_rate = 32000
2 | window_size = 1024
3 | hop_size = 500 # So that there are 64 frames per second
4 | mel_bins = 64
5 | fmin = 50 # Hz
6 | fmax = 14000 # Hz
7 |
8 | frames_per_second = sample_rate // hop_size
9 | time_steps = frames_per_second * 10 # 10-second log mel spectrogram as input
10 | submission_frames_per_second = 50 # DCASE2019 Task3 submission format
11 |
12 | # The label configuration is the same as https://github.com/sharathadavanne/seld-dcase2019
13 | labels = ['knock', 'drawer', 'clearthroat', 'phone', 'keysDrop', 'speech',
14 | 'keyboard', 'pageturn', 'cough', 'doorslam', 'laughter']
15 |
16 | classes_num = len(labels)
17 | lb_to_idx = {lb: idx for idx, lb in enumerate(labels)}
18 | idx_to_lb = {idx: lb for idx, lb in enumerate(labels)}
--------------------------------------------------------------------------------
/utils/data_generator.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import h5py
3 | import csv
4 | import time
5 | import logging
6 | import os
7 | import glob
8 | import matplotlib.pyplot as plt
9 | import logging
10 |
11 | from utilities import scale
12 | import config
13 |
14 |
15 | class DataGenerator(object):
16 |
17 | def __init__(self, features_dir, scalar, batch_size, holdout_fold, seed=1234):
18 | '''Data generator for training and validation.
19 |
20 | Args:
21 | features_dir: string, directory of features
22 | scalar: object, containing mean and std value
23 | batch_size: int
24 | holdout_fold: '1' | '2' | '3' | '4' | 'none', where 'none' indicates
25 | using all data without validation for training
26 | seed: int, random seed
27 | '''
28 |
29 | self.scalar = scalar
30 | self.batch_size = batch_size
31 | self.random_state = np.random.RandomState(seed)
32 |
33 | self.frames_per_second = config.frames_per_second
34 | self.classes_num = config.classes_num
35 | self.lb_to_idx = config.lb_to_idx
36 | self.time_steps = config.time_steps
37 |
38 | # Load data
39 | load_time = time.time()
40 |
41 | feature_names = sorted(os.listdir(features_dir))
42 |
43 | self.train_feature_names = [name for name in feature_names \
44 | if 'split{}'.format(holdout_fold) not in name]
45 |
46 | self.validate_feature_names = [name for name in feature_names \
47 | if 'split{}'.format(holdout_fold) in name]
48 |
49 | self.train_features_list = []
50 | self.train_event_matrix_list = []
51 | self.train_elevation_matrix_list = []
52 | self.train_azimuth_matrix_list = []
53 | self.train_index_array_list = []
54 | frame_index = 0
55 |
56 | # Load training feature and targets
57 | for feature_name in self.train_feature_names:
58 | feature_path = os.path.join(features_dir, feature_name)
59 |
60 | (feature, event_matrix, elevation_matrix, azimuth_matrix) = \
61 | self.load_hdf5(feature_path)
62 |
63 | frames_num = feature.shape[1]
64 | '''Number of frames of the log mel spectrogram of an audio
65 | recording. May be different from file to file'''
66 |
67 | index_array = np.arange(frame_index, frame_index + frames_num - self.time_steps)
68 | frame_index += frames_num
69 |
70 | # Append data
71 | self.train_features_list.append(feature)
72 | self.train_event_matrix_list.append(event_matrix)
73 | self.train_elevation_matrix_list.append(elevation_matrix)
74 | self.train_azimuth_matrix_list.append(azimuth_matrix)
75 | self.train_index_array_list.append(index_array)
76 |
77 | self.train_features = np.concatenate(self.train_features_list, axis=1)
78 | self.train_event_matrix = np.concatenate(self.train_event_matrix_list, axis=0)
79 | self.train_elevation_matrix = np.concatenate(self.train_elevation_matrix_list, axis=0)
80 | self.train_azimuth_matrix = np.concatenate(self.train_azimuth_matrix_list, axis=0)
81 | self.train_index_array = np.concatenate(self.train_index_array_list, axis=0)
82 |
83 | # Load validation feature and targets
84 | self.validate_features_list = []
85 | self.validate_event_matrix_list = []
86 | self.validate_elevation_matrix_list = []
87 | self.validate_azimuth_matrix_list = []
88 |
89 | for feature_name in self.validate_feature_names:
90 | feature_path = os.path.join(features_dir, feature_name)
91 |
92 | (feature, event_matrix, elevation_matrix, azimuth_matrix) = \
93 | self.load_hdf5(feature_path)
94 |
95 | self.validate_features_list.append(feature)
96 | self.validate_event_matrix_list.append(event_matrix)
97 | self.validate_elevation_matrix_list.append(elevation_matrix)
98 | self.validate_azimuth_matrix_list.append(azimuth_matrix)
99 |
100 | logging.info('Load data time: {:.3f} s'.format(time.time() - load_time))
101 | logging.info('Training audio num: {}'.format(len(self.train_feature_names)))
102 | logging.info('Validation audio num: {}'.format(len(self.validate_feature_names)))
103 |
104 | self.random_state.shuffle(self.train_index_array)
105 | self.pointer = 0
106 |
107 | def load_hdf5(self, feature_path):
108 | '''Load hdf5.
109 |
110 | Args:
111 | feature_path: string
112 |
113 | Returns:
114 | feature: (channels_num, frames_num, freq_bins)
115 | eevnt_matrix: (frames_num, classes_num)
116 | elevation_matrix: (frames_num, classes_num)
117 | azimuth_matrix: (frames_num, classes_num)
118 | '''
119 |
120 | with h5py.File(feature_path, 'r') as hf:
121 | feature = hf['feature'][:]
122 | events = [e.decode() for e in hf['target']['event'][:]]
123 | start_times = hf['target']['start_time'][:]
124 | end_times = hf['target']['end_time'][:]
125 | elevations = hf['target']['elevation'][:]
126 | azimuths = hf['target']['azimuth'][:]
127 | distances = hf['target']['distance'][:]
128 |
129 | frames_num = feature.shape[1]
130 |
131 | # Researve space data
132 | event_matrix = np.zeros((frames_num, self.classes_num))
133 | elevation_matrix = np.zeros((frames_num, self.classes_num))
134 | azimuth_matrix = np.zeros((frames_num, self.classes_num))
135 |
136 | for n in range(len(events)):
137 | class_id = self.lb_to_idx[events[n]]
138 | start_frame = int(round(start_times[n] * self.frames_per_second))
139 | end_frame = int(round(end_times[n] * self.frames_per_second)) + 1
140 |
141 | event_matrix[start_frame : end_frame, class_id] = 1
142 | elevation_matrix[start_frame : end_frame, class_id] = elevations[n]
143 | azimuth_matrix[start_frame : end_frame, class_id] = azimuths[n]
144 |
145 | return feature, event_matrix, elevation_matrix, azimuth_matrix
146 |
147 | def generate_train(self):
148 | '''Generate mini-batch data for training.
149 |
150 | Returns:
151 | batch_data_dict: dict containing feature, event, elevation and azimuth
152 | '''
153 |
154 | while True:
155 | # Reset pointer
156 | if self.pointer >= len(self.train_index_array):
157 | self.pointer = 0
158 | self.random_state.shuffle(self.train_index_array)
159 |
160 | # Get batch indexes
161 | batch_indexes = self.train_index_array[
162 | self.pointer: self.pointer + self.batch_size]
163 |
164 | data_indexes = batch_indexes[:, None] + np.arange(self.time_steps)
165 |
166 | self.pointer += self.batch_size
167 |
168 | batch_feature = self.train_features[:, data_indexes]
169 | batch_event_matrix = self.train_event_matrix[data_indexes]
170 | batch_elevation_matrix = self.train_elevation_matrix[data_indexes]
171 | batch_azimuth_matrix = self.train_azimuth_matrix[data_indexes]
172 |
173 | # Transform data
174 | batch_feature = self.transform(batch_feature)
175 |
176 | batch_data_dict = {
177 | 'feature': batch_feature,
178 | 'event': batch_event_matrix,
179 | 'elevation': batch_elevation_matrix,
180 | 'azimuth': batch_azimuth_matrix}
181 |
182 | yield batch_data_dict
183 |
184 | def generate_validate(self, data_type, max_validate_num=None):
185 | '''Generate feature and targets of a single audio file.
186 |
187 | Args:
188 | data_type: 'train' | 'validate'
189 | max_validate_num: None | int, maximum iteration to run to speed up
190 | evaluation
191 |
192 | Returns:
193 | batch_data_dict: dict containing feature, event, elevation and azimuth
194 | '''
195 |
196 | if data_type == 'train':
197 | feature_names = self.train_feature_names
198 | features_list = self.train_features_list
199 | event_matrix_list = self.train_event_matrix_list
200 | elevation_matrix_list = self.train_elevation_matrix_list
201 | azimuth_matrix_list = self.train_azimuth_matrix_list
202 |
203 | elif data_type == 'validate':
204 | feature_names = self.validate_feature_names
205 | features_list = self.validate_features_list
206 | event_matrix_list = self.validate_event_matrix_list
207 | elevation_matrix_list = self.validate_elevation_matrix_list
208 | azimuth_matrix_list = self.validate_azimuth_matrix_list
209 |
210 | else:
211 | raise Exception('Incorrect argument!')
212 |
213 | validate_num = len(feature_names)
214 |
215 | for n in range(validate_num):
216 | if n == max_validate_num:
217 | break
218 |
219 | name = os.path.splitext(feature_names[n])[0]
220 | feature = features_list[n]
221 | event_matrix = event_matrix_list[n]
222 | elevation_matrix = elevation_matrix_list[n]
223 | azimuth_matrix = azimuth_matrix_list[n]
224 |
225 | feature = self.transform(feature)
226 |
227 | batch_data_dict = {
228 | 'name': name,
229 | 'feature': feature[:, None, :, :], # (channels_num, batch_size=1, frames_num, mel_bins)
230 | 'event': event_matrix[None, :, :], # (batch_size=1, frames_num, mel_bins)
231 | 'elevation': elevation_matrix[None, :, :], # (batch_size=1, frames_num, mel_bins)
232 | 'azimuth': azimuth_matrix[None, :, :] # (batch_size=1, frames_num, mel_bins)
233 | }
234 | '''The None above indicates using an entire audio recording as
235 | input and batch_size=1 in inference'''
236 |
237 | yield batch_data_dict
238 |
239 | def transform(self, x):
240 | return scale(x, self.scalar['mean'], self.scalar['std'])
241 |
--------------------------------------------------------------------------------
/utils/features.py:
--------------------------------------------------------------------------------
1 | import os
2 | import sys
3 | sys.path.insert(1, os.path.join(sys.path[0], 'utils'))
4 | import numpy as np
5 | import pandas as pd
6 | import argparse
7 | import h5py
8 | import librosa
9 | from scipy import signal
10 | import matplotlib.pyplot as plt
11 | import time
12 | import csv
13 | import random
14 |
15 | from utilities import (read_multichannel_audio, create_folder,
16 | calculate_scalar_of_tensor)
17 | import config
18 |
19 |
20 | class LogMelExtractor(object):
21 | def __init__(self, sample_rate, window_size, hop_size, mel_bins, fmin, fmax):
22 | '''Log mel feature extractor.
23 |
24 | Args:
25 | sample_rate: int
26 | window_size: int
27 | hop_size: int
28 | mel_bins: int
29 | fmin: int, minimum frequency of mel filter banks
30 | fmax: int, maximum frequency of mel filter banks
31 | '''
32 |
33 | self.window_size = window_size
34 | self.hop_size = hop_size
35 | self.window_func = np.hanning(window_size)
36 |
37 | self.melW = librosa.filters.mel(
38 | sr=sample_rate,
39 | n_fft=window_size,
40 | n_mels=mel_bins,
41 | fmin=fmin,
42 | fmax=fmax).T
43 | '''(n_fft // 2 + 1, mel_bins)'''
44 |
45 | def transform_multichannel(self, multichannel_audio):
46 | '''Extract feature of a multichannel audio file.
47 |
48 | Args:
49 | multichannel_audio: (samples, channels_num)
50 |
51 | Returns:
52 | feature: (channels_num, frames_num, freq_bins)
53 | '''
54 |
55 | (samples, channels_num) = multichannel_audio.shape
56 |
57 | feature = np.array([self.transform_singlechannel(
58 | multichannel_audio[:, m]) for m in range(channels_num)])
59 |
60 | return feature
61 |
62 | def transform_singlechannel(self, audio):
63 | '''Extract feature of a singlechannel audio file.
64 |
65 | Args:
66 | audio: (samples,)
67 |
68 | Returns:
69 | feature: (frames_num, freq_bins)
70 | '''
71 |
72 | window_size = self.window_size
73 | hop_size = self.hop_size
74 | window_func = self.window_func
75 |
76 | # Compute short-time Fourier transform
77 | stft_matrix = librosa.core.stft(
78 | y=audio,
79 | n_fft=window_size,
80 | hop_length=hop_size,
81 | window=window_func,
82 | center=True,
83 | dtype=np.complex64,
84 | pad_mode='reflect').T
85 | '''(N, n_fft // 2 + 1)'''
86 |
87 | # Mel spectrogram
88 | mel_spectrogram = np.dot(np.abs(stft_matrix) ** 2, self.melW)
89 |
90 | # Log mel spectrogram
91 | logmel_spectrogram = librosa.core.power_to_db(
92 | mel_spectrogram, ref=1.0, amin=1e-10,
93 | top_db=None)
94 |
95 | logmel_spectrogram = logmel_spectrogram.astype(np.float32)
96 |
97 | return logmel_spectrogram
98 |
99 |
100 | def calculate_feature_for_each_audio_file(args):
101 | '''Calculate feature for each audio file and write out to hdf5.
102 |
103 | Args:
104 | dataset_dir: string
105 | workspace: string
106 | data_type: 'development' | 'evaluation'
107 | audio_type: 'foa' | 'mic'
108 | mini_data: bool, set True for debugging on a small part of data
109 | '''
110 |
111 | # Arguments & parameters
112 | dataset_dir = args.dataset_dir
113 | workspace = args.workspace
114 | data_type = args.data_type
115 | audio_type = args.audio_type
116 | mini_data = args.mini_data
117 |
118 | sample_rate = config.sample_rate
119 | window_size = config.window_size
120 | hop_size = config.hop_size
121 | mel_bins = config.mel_bins
122 | fmin = config.fmin
123 | fmax = config.fmax
124 | frames_per_second = config.frames_per_second
125 |
126 | # Paths
127 | if data_type == 'development':
128 | data_type = 'dev'
129 |
130 | elif data_type == 'evaluation':
131 | data_type = 'eva'
132 | raise Exception('Todo after evaluation data released. ')
133 |
134 | if mini_data:
135 | prefix = 'minidata_'
136 | else:
137 | prefix = ''
138 |
139 | metas_dir = os.path.join(dataset_dir, 'metadata_{}'.format(data_type))
140 | audios_dir = os.path.join(dataset_dir, '{}_{}'.format(audio_type, data_type))
141 |
142 | features_dir = os.path.join(workspace, 'features',
143 | '{}{}_{}_logmel_{}frames_{}melbins'.format(prefix, audio_type,
144 | data_type, frames_per_second, mel_bins))
145 |
146 | create_folder(features_dir)
147 |
148 | # Feature extractor
149 | feature_extractor = LogMelExtractor(
150 | sample_rate=sample_rate,
151 | window_size=window_size,
152 | hop_size=hop_size,
153 | mel_bins=mel_bins,
154 | fmin=fmin,
155 | fmax=fmax)
156 |
157 | # Extract features and targets
158 | meta_names = sorted(os.listdir(metas_dir))
159 |
160 | if mini_data:
161 | mini_num = 10
162 | random_state = np.random.RandomState(1234)
163 | random_state.shuffle(meta_names)
164 | meta_names = meta_names[0 : mini_num]
165 |
166 | print('Extracting features of all audio files ...')
167 | extract_time = time.time()
168 |
169 | for (n, meta_name) in enumerate(meta_names):
170 | meta_path = os.path.join(metas_dir, meta_name)
171 | bare_name = os.path.splitext(meta_name)[0]
172 | audio_path = os.path.join(audios_dir, '{}.wav'.format(bare_name))
173 | feature_path = os.path.join(features_dir, '{}.h5'.format(bare_name))
174 |
175 | df = pd.read_csv(meta_path, sep=',')
176 | event_array = df['sound_event_recording'].values
177 | start_time_array = df['start_time'].values
178 | end_time_array = df['end_time'].values
179 | elevation_array = df['ele'].values
180 | azimuth_array = df['azi'].values
181 | distance_array = df['dist'].values
182 |
183 | # Read audio
184 | (multichannel_audio, _) = read_multichannel_audio(
185 | audio_path=audio_path,
186 | target_fs=sample_rate)
187 |
188 | # Extract feature
189 | feature = feature_extractor.transform_multichannel(multichannel_audio)
190 |
191 | with h5py.File(feature_path, 'w') as hf:
192 | hf.create_dataset('feature', data=feature, dtype=np.float32)
193 |
194 | hf.create_group('target')
195 | hf['target'].create_dataset('event', data=[e.encode() for e in event_array], dtype='S20')
196 | hf['target'].create_dataset('start_time', data=start_time_array, dtype=np.float32)
197 | hf['target'].create_dataset('end_time', data=end_time_array, dtype=np.float32)
198 | hf['target'].create_dataset('elevation', data=elevation_array, dtype=np.int32)
199 | hf['target'].create_dataset('azimuth', data=azimuth_array, dtype=np.int32)
200 | hf['target'].create_dataset('distance', data=distance_array, dtype=np.int32)
201 |
202 | print(n, feature_path, feature.shape)
203 |
204 | print('Extract features finished! {:.3f} s'.format(time.time() - extract_time))
205 |
206 |
207 | def calculate_scalar(args):
208 | '''Calculate and write out scalar of development data.
209 |
210 | Args:
211 | dataset_dir: string
212 | workspace: string
213 | audio_type: 'foa' | 'mic'
214 | mini_data: bool, set True for debugging on a small part of data
215 | '''
216 |
217 | # Arguments & parameters
218 | workspace = args.workspace
219 | audio_type = args.audio_type
220 | mini_data = args.mini_data
221 | data_type = 'dev'
222 |
223 | mel_bins = config.mel_bins
224 | frames_per_second = config.frames_per_second
225 |
226 | # Paths
227 | if mini_data:
228 | prefix = 'minidata_'
229 | else:
230 | prefix = ''
231 |
232 | features_dir = os.path.join(workspace, 'features',
233 | '{}{}_{}_logmel_{}frames_{}melbins'.format(prefix, audio_type,
234 | data_type, frames_per_second, mel_bins))
235 |
236 | scalar_path = os.path.join(workspace, 'scalars',
237 | '{}{}_{}_logmel_{}frames_{}melbins'.format(prefix, audio_type,
238 | data_type, frames_per_second, mel_bins), 'scalar.h5')
239 |
240 | create_folder(os.path.dirname(scalar_path))
241 |
242 | # Load data
243 | load_time = time.time()
244 | feature_names = os.listdir(features_dir)
245 | all_features = []
246 |
247 | for feature_name in feature_names:
248 | feature_path = os.path.join(features_dir, feature_name)
249 |
250 | with h5py.File(feature_path, 'r') as hf:
251 | feature = hf['feature'][:]
252 | all_features.append(feature)
253 |
254 | print('Load feature time: {:.3f} s'.format(time.time() - load_time))
255 |
256 | # Calculate scalar
257 | all_features = np.concatenate(all_features, axis=1)
258 | (mean, std) = calculate_scalar_of_tensor(all_features)
259 |
260 | with h5py.File(scalar_path, 'w') as hf:
261 | hf.create_dataset('mean', data=mean, dtype=np.float32)
262 | hf.create_dataset('std', data=std, dtype=np.float32)
263 |
264 | print('All features: {}'.format(all_features.shape))
265 | print('mean: {}'.format(mean))
266 | print('std: {}'.format(std))
267 | print('Write out scalar to {}'.format(scalar_path))
268 |
269 |
270 | if __name__ == '__main__':
271 | parser = argparse.ArgumentParser(description='')
272 | subparsers = parser.add_subparsers(dest='mode')
273 |
274 | # Calculate feature for each audio file
275 | parser_logmel = subparsers.add_parser('calculate_feature_for_each_audio_file')
276 | parser_logmel.add_argument('--dataset_dir', type=str, required=True, help='Directory of dataset.')
277 | parser_logmel.add_argument('--workspace', type=str, required=True, help='Directory of your workspace.')
278 | parser_logmel.add_argument('--data_type', type=str, required=True, choices=['development', 'evaluation'])
279 | parser_logmel.add_argument('--audio_type', type=str, required=True, choices=['foa', 'mic'])
280 | parser_logmel.add_argument('--mini_data', action='store_true', default=False, help='Set True for debugging on a small part of data.')
281 |
282 | # Calculate scalar
283 | parser_scalar = subparsers.add_parser('calculate_scalar')
284 | parser_scalar.add_argument('--workspace', type=str, required=True, help='Directory of your workspace.')
285 | parser_scalar.add_argument('--data_type', type=str, required=True, choices=['development'], help='Scalar is calculated on development set.')
286 | parser_scalar.add_argument('--audio_type', type=str, required=True, choices=['foa', 'mic'])
287 | parser_scalar.add_argument('--mini_data', action='store_true', default=False, help='Set True for debugging on a small part of data.')
288 |
289 | # Parse arguments
290 | args = parser.parse_args()
291 |
292 | if args.mode == 'calculate_feature_for_each_audio_file':
293 | calculate_feature_for_each_audio_file(args)
294 |
295 | elif args.mode == 'calculate_scalar':
296 | calculate_scalar(args)
297 |
298 | else:
299 | raise Exception('Incorrect arguments!')
--------------------------------------------------------------------------------
/utils/plot_results.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import os
3 | import matplotlib.pyplot as plt
4 | import _pickle as cPickle
5 | import numpy as np
6 |
7 | import config
8 |
9 |
10 | def plot_results(args):
11 |
12 | # Arugments & parameters
13 | dataset_dir = args.dataset_dir
14 | workspace = args.workspace
15 | audio_type = args.audio_type
16 |
17 | filename = 'main'
18 | prefix = ''
19 | frames_per_second = config.frames_per_second
20 | mel_bins = config.mel_bins
21 | holdout_fold = 1
22 | max_plot_iteration = 5000
23 |
24 | iterations = np.arange(0, max_plot_iteration, 200)
25 |
26 | def _load_stat(model_type):
27 | validate_statistics_path = os.path.join(workspace, 'statistics',
28 | filename, '{}{}_{}_logmel_{}frames_{}melbins'.format(prefix,
29 | audio_type, 'dev', frames_per_second, mel_bins),
30 | 'holdout_fold={}'.format(holdout_fold), model_type,
31 | 'validate_statistics.pickle')
32 |
33 | statistics_list = cPickle.load(open(validate_statistics_path, 'rb'))
34 |
35 | sed_error_rate = np.array([statistics['sed_error_rate']
36 | for statistics in statistics_list])
37 |
38 | sed_f1_score = np.array([statistics['sed_f1_score']
39 | for statistics in statistics_list])
40 |
41 | doa_error = np.array([statistics['doa_error']
42 | for statistics in statistics_list])
43 |
44 | doa_frame_recall = np.array([statistics['doa_frame_recall']
45 | for statistics in statistics_list])
46 |
47 | seld_score = np.array([statistics['seld_score']
48 | for statistics in statistics_list])
49 |
50 | legend = '{}'.format(model_type)
51 |
52 | results = {'sed_error_rate': sed_error_rate,
53 | 'sed_f1_score': sed_f1_score, 'doa_error': doa_error,
54 | 'doa_frame_recall': doa_frame_recall, 'seld_score': seld_score,
55 | 'legend': legend}
56 |
57 | print('Model type: {}'.format(model_type))
58 | print(' sed_error_rate: {:.3f}'.format(sed_error_rate[-1]))
59 | print(' sed_f1_score: {:.3f}'.format(sed_f1_score[-1]))
60 | print(' doa_error: {:.3f}'.format(doa_error[-1]))
61 | print(' doa_frame_recall: {:.3f}'.format(doa_frame_recall[-1]))
62 | print(' seld_score: {:.3f}'.format(seld_score[-1]))
63 |
64 | return results
65 |
66 | measure_keys = ['sed_error_rate', 'sed_f1_score', 'doa_error', 'doa_frame_recall']
67 |
68 | fig, axs = plt.subplots(2, 2, figsize=(12, 8))
69 |
70 | results_dict = {}
71 | results_dict['Cnn_5layers_AvgPooling'] = _load_stat('Cnn_5layers_AvgPooling')
72 | results_dict['Cnn_9layers_AvgPooling'] = _load_stat('Cnn_9layers_AvgPooling')
73 | results_dict['Cnn_9layers_MaxPooling'] = _load_stat('Cnn_9layers_MaxPooling')
74 | results_dict['Cnn_13layers_AvgPooling'] = _load_stat('Cnn_13layers_AvgPooling')
75 |
76 | for n, measure_key in enumerate(measure_keys):
77 | lines = []
78 |
79 | row = n // 2
80 | col = n % 2
81 |
82 | for model_key in results_dict.keys():
83 | line, = axs[row, col].plot(results_dict[model_key][measure_key], label=results_dict[model_key]['legend'])
84 | lines.append(line)
85 |
86 | axs[row, col].set_title(measure_key)
87 | axs[row, col].legend(handles=lines, loc=4)
88 | axs[row, col].set_ylim(0, 1.0)
89 | axs[row, col].set_xlabel('Iterations')
90 | axs[row, col].grid(color='b', linestyle='solid', linewidth=0.2)
91 | axs[row, col].xaxis.set_ticks(np.arange(0, len(iterations), len(iterations) // 4))
92 | axs[row, col].xaxis.set_ticklabels(np.arange(0, max_plot_iteration, max_plot_iteration // 4))
93 |
94 | axs[1, 0].set_ylim(0, 100.)
95 | axs[0, 0].set_ylabel('sed_error_rate')
96 | axs[0, 1].set_ylabel('sed_f1_score')
97 | axs[1, 0].set_ylabel('doa_error')
98 | axs[1, 1].set_ylabel('doa_frame_recall')
99 |
100 | plt.tight_layout()
101 | plt.show()
102 |
103 |
104 | if __name__ == '__main__':
105 | parser = argparse.ArgumentParser(description='')
106 | parser.add_argument('--dataset_dir', type=str, required=True, help='Directory of dataset.')
107 | parser.add_argument('--workspace', type=str, required=True, help='Directory of your workspace.')
108 | parser.add_argument('--audio_type', type=str, choices=['foa', 'mic'], required=True)
109 |
110 | args = parser.parse_args()
111 |
112 | plot_results(args)
--------------------------------------------------------------------------------
/utils/utilities.py:
--------------------------------------------------------------------------------
1 | import os
2 | import sys
3 | sys.path.insert(1, os.path.join(sys.path[0], '../evaluation_tools'))
4 | import numpy as np
5 | import soundfile
6 | import librosa
7 | import h5py
8 | from sklearn import metrics
9 | import logging
10 | import matplotlib.pyplot as plt
11 |
12 | import evaluation_metrics
13 | import cls_feature_class
14 | import config
15 |
16 |
17 | def create_folder(fd):
18 | if not os.path.exists(fd):
19 | os.makedirs(fd)
20 |
21 |
22 | def get_filename(path):
23 | path = os.path.realpath(path)
24 | name_ext = path.split('/')[-1]
25 | name = os.path.splitext(name_ext)[0]
26 | return name
27 |
28 |
29 | def create_logging(log_dir, filemode):
30 |
31 | create_folder(log_dir)
32 | i1 = 0
33 |
34 | while os.path.isfile(os.path.join(log_dir, '{:04d}.log'.format(i1))):
35 | i1 += 1
36 |
37 | log_path = os.path.join(log_dir, '{:04d}.log'.format(i1))
38 | logging.basicConfig(
39 | level=logging.DEBUG,
40 | format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
41 | datefmt='%a, %d %b %Y %H:%M:%S',
42 | filename=log_path,
43 | filemode=filemode)
44 |
45 | # Print to console
46 | console = logging.StreamHandler()
47 | console.setLevel(logging.INFO)
48 | formatter = logging.Formatter('%(name)-12s: %(levelname)-8s %(message)s')
49 | console.setFormatter(formatter)
50 | logging.getLogger('').addHandler(console)
51 |
52 | return logging
53 |
54 |
55 | def read_multichannel_audio(audio_path, target_fs=None):
56 |
57 | (multichannel_audio, fs) = soundfile.read(audio_path)
58 | '''(samples, channels_num)'''
59 |
60 | if target_fs is not None and fs != target_fs:
61 | (samples, channels_num) = multichannel_audio.shape
62 |
63 | multichannel_audio = np.array(
64 | [librosa.resample(
65 | multichannel_audio[:, i],
66 | orig_sr=fs,
67 | target_sr=target_fs)
68 | for i in range(channels_num)]).T
69 | '''(samples, channels_num)'''
70 |
71 | return multichannel_audio, fs
72 |
73 |
74 | def calculate_scalar_of_tensor(x):
75 | if x.ndim == 2:
76 | axis = 0
77 | elif x.ndim == 3:
78 | axis = (0, 1)
79 |
80 | mean = np.mean(x, axis=axis)
81 | std = np.std(x, axis=axis)
82 |
83 | return mean, std
84 |
85 |
86 | def load_scalar(scalar_path):
87 | with h5py.File(scalar_path, 'r') as hf:
88 | mean = hf['mean'][:]
89 | std = hf['std'][:]
90 |
91 | scalar = {'mean': mean, 'std': std}
92 | return scalar
93 |
94 |
95 | def scale(x, mean, std):
96 | return (x - mean) / std
97 |
98 |
99 | def inverse_scale(x, mean, std):
100 | return x * std + mean
101 |
102 |
103 | def resample_matrix(matrix, ratio):
104 | '''Resample matrix
105 |
106 | Args:
107 | matrix: (time_steps, classes_num)
108 | ratio: float, ratio to resample
109 | '''
110 | new_len = int(round(ratio * matrix.shape[0]))
111 | new_matrix = np.zeros((new_len, matrix.shape[1]))
112 |
113 | for n in range(new_len):
114 | new_matrix[n] = matrix[int(round(n / ratio))]
115 |
116 | return new_matrix
117 |
118 |
119 | def calculate_metrics(metadata_dir, prediction_paths):
120 | '''Calculate metrics using official tool. This part of code is modified from:
121 | https://github.com/sharathadavanne/seld-dcase2019/blob/master/calculate_SELD_metrics.py
122 |
123 | Args:
124 | metadata_dir: string, directory of reference files.
125 | prediction_paths: list of string
126 |
127 | Returns:
128 | metrics: dict
129 | '''
130 |
131 | # Load feature class
132 | feat_cls = cls_feature_class.FeatureClass()
133 |
134 | # Load evaluation metric class
135 | eval = evaluation_metrics.SELDMetrics(
136 | nb_frames_1s=feat_cls.nb_frames_1s(), data_gen=feat_cls)
137 |
138 | eval.reset() # Reset the evaluation metric parameters
139 | for prediction_path in prediction_paths:
140 | reference_path = os.path.join(metadata_dir, '{}.csv'.format(
141 | get_filename(prediction_path)))
142 |
143 | prediction_dict = evaluation_metrics.load_output_format_file(prediction_path)
144 | reference_dict = feat_cls.read_desc_file(reference_path)
145 |
146 | # Generate classification labels for SELD
147 | reference_tensor = feat_cls.get_clas_labels_for_file(reference_dict)
148 | prediction_tensor = evaluation_metrics.output_format_dict_to_classification_labels(
149 | prediction_dict, feat_cls)
150 |
151 | # Calculated SED and DOA scores
152 | eval.update_sed_scores(prediction_tensor.max(2), reference_tensor.max(2))
153 | eval.update_doa_scores(prediction_tensor, reference_tensor)
154 |
155 | # Overall SED and DOA scores
156 | sed_error_rate, sed_f1_score = eval.compute_sed_scores()
157 | doa_error, doa_frame_recall = eval.compute_doa_scores()
158 | seld_score = evaluation_metrics.compute_seld_metric(
159 | [sed_error_rate, sed_f1_score], [doa_error, doa_frame_recall])
160 |
161 | metrics = {
162 | 'sed_error_rate': sed_error_rate,
163 | 'sed_f1_score': sed_f1_score,
164 | 'doa_error': doa_error,
165 | 'doa_frame_recall': doa_frame_recall,
166 | 'seld_score': seld_score }
167 |
168 | return metrics
169 |
170 |
171 | def write_submission(list_dict, submissions_dir):
172 | '''Write predicted result to submission csv files.
173 |
174 | Args:
175 | list_dict: list of dict, e.g.:
176 | [{'name': 'split1_ir0_ov1_7',
177 | 'output_event': (1, frames_num, classes_num),
178 | 'output_elevation': (1, frames_num, classes_num),
179 | 'output_azimuth': (1, frames_num, classes_num),
180 | ...
181 | },
182 | ...]
183 | submissions_dir: string, directory to write out submission files
184 | '''
185 |
186 | frames_per_second = config.frames_per_second
187 | submission_frames_per_second = config.submission_frames_per_second
188 |
189 | for dict in list_dict:
190 | filename = '{}.csv'.format(dict['name'])
191 | filepath = os.path.join(submissions_dir, filename)
192 |
193 | event_matrix = dict['output_event'][0]
194 | elevation_matrix = dict['output_elevation'][0]
195 | azimuth_matrix = dict['output_azimuth'][0]
196 |
197 | # Resample predicted frames to submission format
198 | ratio = submission_frames_per_second / float(frames_per_second)
199 | resampled_event_matrix = resample_matrix(event_matrix, ratio)
200 | resampled_elevation_matrix = resample_matrix(elevation_matrix, ratio)
201 | resampled_azimuth_matrix = resample_matrix(azimuth_matrix, ratio)
202 |
203 | with open(filepath, 'w') as f:
204 | for n in range(resampled_event_matrix.shape[0]):
205 | for k in range(resampled_event_matrix.shape[1]):
206 | if resampled_event_matrix[n, k] > 0.5:
207 | elevation = int(resampled_elevation_matrix[n, k])
208 | azimuth = int(resampled_azimuth_matrix[n, k])
209 | f.write('{},{},{},{}\n'.format(n, k, azimuth, elevation))
210 |
211 | logging.info(' Total {} files written to {}'.format(len(list_dict), submissions_dir))
212 |
--------------------------------------------------------------------------------