├── .gitignore
├── Licenses
└── MIT_LICENSE
├── README.md
├── configs
└── ein_seld
│ └── seld.yaml
├── environment.yml
├── figures
└── performance_compare.png
├── papers
├── An Improved Event Independent Network for SELD.pdf
└── Event Independent Network for SELD.pdf
├── scripts
├── download_dataset.sh
├── evaluate.sh
├── predict.sh
├── prepare_env.sh
├── preproc.sh
└── train.sh
└── seld
├── learning
├── __init__.py
├── checkpoint.py
├── evaluate.py
├── infer.py
├── initialize.py
├── preprocess.py
└── train.py
├── main.py
├── methods
├── __init__.py
├── data.py
├── data_augmentation
│ └── __init__.py
├── ein_seld
│ ├── __init__.py
│ ├── data.py
│ ├── data_augmentation
│ │ └── __init__.py
│ ├── inference.py
│ ├── losses.py
│ ├── metrics.py
│ ├── models
│ │ ├── __init__.py
│ │ └── seld.py
│ └── training.py
├── feature.py
├── inference.py
├── training.py
└── utils
│ ├── SELD_evaluation_metrics_2019.py
│ ├── SELD_evaluation_metrics_2020.py
│ ├── __init__.py
│ ├── data_utilities.py
│ ├── loss_utilities.py
│ ├── model_utilities.py
│ └── stft.py
└── utils
├── __init__.py
├── cli_parser.py
├── common.py
├── config.py
└── datasets.py
/.gitignore:
--------------------------------------------------------------------------------
1 |
2 | .DS_Store
3 |
--------------------------------------------------------------------------------
/Licenses/MIT_LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2020 Yin Cao
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection
2 | An Improved Event-Independent Network (EIN) for Polyphonic Sound Event Localization and Detection (SELD)
3 |
4 | from Centre for Vision, Speech and Signal Processing, University of Surrey.
5 |
6 | ## Contents
7 |
8 | - [Introduction](#Introduction)
9 | - [Requirements](#Requirements)
10 | - [Download Dataset](#Download-Dataset)
11 | - [Preprocessing](#Preprocessing)
12 | - [QuickEvaluate](#QuickEvaluate)
13 | - [Usage](#Usage)
14 | * [Training](#Training)
15 | * [Prediction](#Prediction)
16 | * [Evaluation](#Evaluation)
17 | - [Results](#Results)
18 | - [FAQs](#FAQs)
19 | - [Citing](#Citing)
20 | - [Reference](#Reference)
21 |
22 |
23 | ## Introduction
24 |
25 | This is a Pytorch implementation of Event-Independent Networks for Polyphonic SELD.
26 |
27 | Event-Independent Networks for Polyphonic SELD uses a trackwise output format and multi-task learning (MTL) of a soft parameter-sharing scheme. For more information, please read papers [here](#Citing).
28 |
29 |
30 |
31 | The features of this method are:
32 | - It uses a trackwise output format to detect different sound events of the same type but with different DoAs.
33 | - It uses a permutation-invaiant training (PIT) to solve the track permutation problem introducted by trackwise output format.
34 | - It uses multi-head self-attention (MHSA) to separate tracks.
35 | - It uses multi-task learning (MTL) of a soft parameter-sharing scheme for joint-SELD.
36 |
37 | Currently, the code is availabel for [*TAU-NIGENS Spatial Sound Events 2020*](http://dcase.community/challenge2020/task-sound-event-localization-and-detection#download) dataset. Data augmentation methods are not included.
38 |
39 | ## Requirements
40 |
41 | We provide two ways to setup the environment. Both are based on [Anaconda](https://www.anaconda.com/products/individual).
42 |
43 | 1. Use the provided `prepare_env.sh`. Note that you need to set the `anaconda_dir` in `prepare_env.sh` to your anaconda directory, then directly run
44 |
45 | ```bash
46 | bash scripts/prepare_env.sh
47 | ```
48 |
49 | 2. Use the provided `environment.yml`. Note that you also need to set the `prefix` to your aimed env directory, then directly run
50 |
51 | ```bash
52 | conda env create -f environment.yml
53 | ```
54 |
55 | After setup your environment, don't forget to activate it
56 |
57 | ```bash
58 | conda activate ein
59 | ```
60 |
61 | ## Download Dataset
62 |
63 | Download dataset is easy. Directly run
64 |
65 | ```bash
66 | bash scripts/download_dataset.sh
67 | ```
68 |
69 | ## Preprocessing
70 |
71 | It is needed to preprocess the data and meta files. `.wav` files will be saved to `.h5` files. Meta files will also be converted to `.h5` files. After downloading the data, directly run
72 |
73 | ```bash
74 | bash scripts/preproc.sh
75 | ```
76 |
77 | Preprocessing for meta files (labels) separate labels to different tracks, each with up to one event and a corresponding DoA. The same event is consistently put in the same track. For frame-level permutation-invariant training, this may not be necessary, but for chunk-level PIT or no PIT, consistently arrange the same event in the same track is reasonable.
78 |
79 | ## QuickEvaluate
80 |
81 | We uploaded the pre-trained model here. Download it and unzip it in the code folder (`EIN-SELD` folder) using
82 |
83 | ```bash
84 | wget 'https://zenodo.org/record/4158864/files/out_train.zip' && unzip out_train.zip
85 | ```
86 |
87 | Then directly run
88 |
89 | ```bash
90 | bash scripts/predict.sh && sh scripts/evaluate.sh
91 | ```
92 |
93 | ## Usage
94 |
95 | Hyper-parameters are stored in `./configs/ein_seld/seld.yaml`. You can change some of them, such as `train_chunklen_sec`, `train_hoplen_sec`, `test_chunklen_sec`, `test_hoplen_sec`, `batch_size`, `lr` and others.
96 |
97 | ### Training
98 |
99 | To train a model yourself, setup `./configs/ein_seld/seld.yaml` and directly run
100 |
101 | ```bash
102 | bash scripts/train.sh
103 | ```
104 |
105 | `train_fold` and `valid_fold` in `./configs/ein_seld/seld.yaml` means using what folds to train and validate. Note that `valid_fold` can be `None` which means no validation is needed, and this is usually used for training using fold 1-6.
106 |
107 | `overlap` can be `1` or `2` or combined `1&2`, which means using non-overlapped sound event to train or overlapped to train or both.
108 |
109 | `--seed` is set to a random integer by default. You can set it to a fixed number. Results will not be completely the same if RNN or Transformer is used.
110 |
111 | You can consider to add `--read_into_mem` argument in `train.sh` to pre-load all of the data into memory to increase the training speed, according to your resources.
112 |
113 | `--num_workers` also affects the training speed, adjust it according to your resources.
114 |
115 | ### Prediction
116 |
117 | Prediction predicts resutls and save to `./out_infer` folder. The saved results is the submission result for DCASE challenge. Directly run
118 |
119 | ```bash
120 | bash scripts/predict.sh
121 | ```
122 |
123 | Prediction predicts results on `testset_type` set, which can be `dev` or `eval`. If it is `dev`, `test_fold` cannot be `None`.
124 |
125 |
126 | ### Evaluation
127 |
128 | Evaluation evaluate the generated submission result. Directly run
129 |
130 | ```bash
131 | bash scripts/evaluate.sh
132 | ```
133 |
134 | ## Results
135 |
136 | It is notable that EINV2-DA is a single model with plain VGGish architecture using only the channel-rotation and the specaug data-augmentation methods.
137 |
138 |
139 |
140 | ## FAQs
141 |
142 | 1. If you have any question, please email to caoyfive@gmail.com or report an issue here.
143 |
144 | 2. Currently the `pin_memory` can only be set to `True`. For more information, please check [Pytorch Doc](https://pytorch.org/docs/stable/data.html#memory-pinning) and [Nvidia Developer Blog](https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/).
145 |
146 | 3. After downloading, you can delete `downloaded_packages` folder to save some space.
147 |
148 | ## Citing
149 |
150 | If you use the code, please consider citing the papers below
151 |
152 | [[1]. Yin Cao, Turab Iqbal, Qiuqiang Kong, Fengyan An, Wenwu Wang, Mark D. Plumbley, "*An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection*", submitted for publication](http://bit.ly/2N8cF6w)
153 | ```
154 | @article{cao2020anevent,
155 | title={An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection},
156 | author={Cao, Yin and Iqbal, Turab and Kong, Qiuqiang and Fengyan, An and Wang, Wenwu and Plumbley, Mark D},
157 | journal={arXiv preprint arXiv:2010.13092},
158 | year={2020}
159 | }
160 | ```
161 |
162 | [[2]. Yin Cao, Turab Iqbal, Qiuqiang Kong, Yue Zhong, Wenwu Wang, Mark D. Plumbley, "*Event-Independent Network for Polyphonic Sound Event Localization and Detection*", DCASE 2020 Workshop, November 2020](https://bit.ly/2Tz8oJ9)
163 | ```
164 | @article{cao2020event,
165 | title={Event-Independent Network for Polyphonic Sound Event Localization and Detection},
166 | author={Cao, Yin and Iqbal, Turab and Kong, Qiuqiang and Zhong, Yue and Wang, Wenwu and Plumbley, Mark D},
167 | journal={arXiv preprint arXiv:2010.00140},
168 | year={2020}
169 | }
170 | ```
171 |
172 | ## Reference
173 |
174 | 1. Archontis Politis, Sharath Adavanne, and Tuomas Virtanen. A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2020). November 2020. [URL](https://arxiv.org/abs/2006.01919)
175 |
176 | 2. Annamaria Mesaros, Sharath Adavanne, Archontis Politis, Toni Heittola, and Tuomas Virtanen. Joint measurement of localization and detection of sound events. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New Paltz, NY, Oct 2019. [URL](https://ieeexplore.ieee.org/abstract/document/8937220?casa_token=Z4aGA4E2Dz4AAAAA:BELmzMjaZslLDf1EN1NVZ92_9J0PRnRymY360j--954Un9jb_WXbvLSDhp--7yOeXp0HXYoKuUek)
177 |
178 | 3. Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, 13(1):34–48, March 2018. [URL](https://ieeexplore.ieee.org/abstract/document/8567942)
179 |
180 | 4. https://github.com/yinkalario/DCASE2019-TASK3
181 |
182 |
--------------------------------------------------------------------------------
/configs/ein_seld/seld.yaml:
--------------------------------------------------------------------------------
1 | method: ein_seld
2 | dataset: dcase2020task3
3 | workspace_dir: ./
4 | dataset_dir: ./_dataset/dataset_root
5 | hdf5_dir: ./_hdf5
6 | data:
7 | type: foa
8 | sample_rate: 24000
9 | n_fft: 1024
10 | hop_length: 600
11 | n_mels: 256
12 | window: hann
13 | fmin: 20
14 | fmax: 12000
15 | train_chunklen_sec: 4
16 | train_hoplen_sec: 4
17 | test_chunklen_sec: 4
18 | test_hoplen_sec: 4
19 | audio_feature: logmel&intensity
20 | feature_freeze: True
21 | data_augmentation:
22 | type: None
23 | training:
24 | train_id: EINV2_tPIT_n1
25 | model: EINV2
26 | resume_model: # None_epoch_latest.pth
27 | loss_type: all
28 | loss_beta: 0.5
29 | PIT_type: tPIT
30 | batch_size: 32
31 | train_fold: 2,3,4,5,6
32 | valid_fold: 1
33 | overlap: 1&2
34 | optimizer: adam
35 | lr: 0.0005
36 | lr_step_size: 80
37 | lr_gamma: 0.1
38 | max_epoch: 90
39 | threshold_sed: 0.5
40 | remark: None
41 | inference:
42 | infer_id: EINV2_tPIT_n1
43 | testset_type: eval # dev | eval
44 | test_fold: None
45 | overlap: 1&2
46 | train_ids: EINV2_tPIT_n1
47 | models: EINV2
48 | batch_size: 64
49 | threshold_sed: 0.5
50 | remark: None
51 |
--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
1 | name: ein
2 | channels:
3 | - pytorch
4 | - anaconda
5 | - conda-forge
6 | - defaults
7 | dependencies:
8 | - _libgcc_mutex=0.1=main
9 | - appdirs=1.4.4=pyh9f0ad1d_0
10 | - astroid=2.4.2=py37_0
11 | - audioread=2.1.8=py37hc8dfbb8_3
12 | - backcall=0.2.0=py_0
13 | - blas=1.0=mkl
14 | - brotlipy=0.7.0=py37hb5d75c8_1001
15 | - bzip2=1.0.8=h516909a_3
16 | - ca-certificates=2020.10.14=0
17 | - certifi=2020.6.20=py37he5f6b98_2
18 | - cffi=1.14.3=py37he30daa8_0
19 | - chardet=3.0.4=py37he5f6b98_1008
20 | - cryptography=3.1.1=py37hff6837a_1
21 | - cudatoolkit=10.2.89=hfd86e86_1
22 | - cycler=0.10.0=py_2
23 | - decorator=4.4.2=py_0
24 | - ffmpeg=4.3.1=h3215721_1
25 | - freetype=2.10.4=h5ab3b9f_0
26 | - gettext=0.19.8.1=h5e8e0c9_1
27 | - gmp=6.2.0=he1b5a44_3
28 | - gnutls=3.6.13=h79a8f9a_0
29 | - h5py=2.10.0=py37hd6299e0_1
30 | - hdf5=1.10.6=hb1b8bf9_0
31 | - idna=2.10=pyh9f0ad1d_0
32 | - intel-openmp=2020.2=254
33 | - ipython=7.18.1=py37h5ca1d4c_0
34 | - ipython_genutils=0.2.0=py37_0
35 | - isort=5.6.4=py_0
36 | - jedi=0.17.2=py37_0
37 | - joblib=0.17.0=py_0
38 | - jpeg=9b=h024ee3a_2
39 | - kiwisolver=1.2.0=py37h99015e2_1
40 | - lame=3.100=h14c3975_1001
41 | - lazy-object-proxy=1.4.3=py37h7b6447c_0
42 | - lcms2=2.11=h396b838_0
43 | - ld_impl_linux-64=2.33.1=h53a641e_7
44 | - libedit=3.1.20191231=h14c3975_1
45 | - libffi=3.3=he6710b0_2
46 | - libflac=1.3.3=he1b5a44_0
47 | - libgcc-ng=9.1.0=hdf63c60_0
48 | - libgfortran-ng=7.3.0=hdf63c60_0
49 | - libiconv=1.16=h516909a_0
50 | - libllvm10=10.0.1=he513fc3_3
51 | - libogg=1.3.2=h516909a_1002
52 | - libpng=1.6.37=hbc83047_0
53 | - librosa=0.8.0=pyh9f0ad1d_0
54 | - libsndfile=1.0.29=he1b5a44_0
55 | - libstdcxx-ng=9.1.0=hdf63c60_0
56 | - libtiff=4.1.0=h2733197_1
57 | - libvorbis=1.3.7=he1b5a44_0
58 | - llvmlite=0.34.0=py37h5202443_2
59 | - lz4-c=1.9.2=heb0550a_3
60 | - matplotlib-base=3.3.2=py37hc9afd2a_1
61 | - mccabe=0.6.1=py37_1
62 | - mkl=2019.4=243
63 | - mkl-service=2.3.0=py37he904b0f_0
64 | - mkl_fft=1.2.0=py37h23d657b_0
65 | - mkl_random=1.1.0=py37hd6b4f25_0
66 | - ncurses=6.2=he6710b0_1
67 | - nettle=3.4.1=h1bed415_1002
68 | - ninja=1.10.1=py37hfd86e86_0
69 | - numba=0.51.2=py37h9fdb41a_0
70 | - numpy=1.19.1=py37hbc911f0_0
71 | - numpy-base=1.19.1=py37hfa32c7d_0
72 | - olefile=0.46=py37_0
73 | - openh264=2.1.1=h8b12597_0
74 | - openssl=1.1.1h=h516909a_0
75 | - packaging=20.4=pyh9f0ad1d_0
76 | - pandas=1.1.3=py37he6710b0_0
77 | - parso=0.7.0=py_0
78 | - pexpect=4.8.0=py37_1
79 | - pickleshare=0.7.5=py37_1001
80 | - pillow=8.0.0=py37h9a89aac_0
81 | - pip=20.2.4=py37_0
82 | - pooch=1.2.0=py_0
83 | - prompt-toolkit=3.0.8=py_0
84 | - ptyprocess=0.6.0=py37_0
85 | - pudb=2019.2=pyh9f0ad1d_2
86 | - pycparser=2.20=pyh9f0ad1d_2
87 | - pygments=2.7.1=py_0
88 | - pylint=2.6.0=py37_0
89 | - pyopenssl=19.1.0=py_1
90 | - pyparsing=2.4.7=pyh9f0ad1d_0
91 | - pysocks=1.7.1=py37he5f6b98_2
92 | - pysoundfile=0.10.2=py_1001
93 | - python=3.7.9=h7579374_0
94 | - python-dateutil=2.8.1=py_0
95 | - python_abi=3.7=1_cp37m
96 | - pytorch=1.6.0=py3.7_cuda10.2.89_cudnn7.6.5_0
97 | - pytz=2020.1=py_0
98 | - pyyaml=5.3.1=py37h7b6447c_1
99 | - readline=8.0=h7b6447c_0
100 | - requests=2.24.0=pyh9f0ad1d_0
101 | - resampy=0.2.2=py_0
102 | - ruamel.yaml=0.16.12=py37h8f50634_1
103 | - ruamel.yaml.clib=0.2.2=py37h8f50634_1
104 | - scikit-learn=0.23.2=py37h0573a6f_0
105 | - scipy=1.5.2=py37h0b6359f_0
106 | - setuptools=50.3.0=py37hb0f4dca_1
107 | - six=1.15.0=py_0
108 | - sqlite=3.33.0=h62c20be_0
109 | - termcolor=1.1.0=py37_1
110 | - threadpoolctl=2.1.0=pyh5ca1d4c_0
111 | - tk=8.6.10=hbc83047_0
112 | - toml=0.10.1=py_0
113 | - torchaudio=0.6.0=py37
114 | - torchvision=0.7.0=py37_cu102
115 | - tornado=6.0.4=py37h8f50634_2
116 | - tqdm=4.50.2=pyh9f0ad1d_0
117 | - traitlets=5.0.5=py_0
118 | - typed-ast=1.4.1=py37h7b6447c_0
119 | - urllib3=1.25.11=py_0
120 | - urwid=2.1.2=py37h8f50634_1
121 | - wcwidth=0.2.5=py_0
122 | - wheel=0.35.1=py_0
123 | - wrapt=1.11.2=py37h7b6447c_0
124 | - x264=1!152.20180806=h14c3975_0
125 | - xz=5.2.5=h7b6447c_0
126 | - yaml=0.2.5=h7b6447c_0
127 | - zlib=1.2.11=h7b6447c_3
128 | - zstd=1.4.5=h9ceee32_0
129 | - pip:
130 | - absl-py==0.10.0
131 | - cachetools==4.1.1
132 | - google-auth==1.22.1
133 | - google-auth-oauthlib==0.4.1
134 | - grpcio==1.33.1
135 | - importlib-metadata==2.0.0
136 | - markdown==3.3.2
137 | - oauthlib==3.1.0
138 | - protobuf==3.13.0
139 | - pyasn1==0.4.8
140 | - pyasn1-modules==0.2.8
141 | - requests-oauthlib==1.3.0
142 | - rsa==4.6
143 | - tensorboard==2.3.0
144 | - tensorboard-plugin-wit==1.7.0
145 | - werkzeug==1.0.1
146 | - zipp==3.3.2
147 | prefix: # /vol/research/yc/miniconda3/envs/ein
148 |
149 |
--------------------------------------------------------------------------------
/figures/performance_compare.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yinkalario/EIN-SELD/9f42da23f4ef4620ca495af37e079dc7f8cde06c/figures/performance_compare.png
--------------------------------------------------------------------------------
/papers/An Improved Event Independent Network for SELD.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yinkalario/EIN-SELD/9f42da23f4ef4620ca495af37e079dc7f8cde06c/papers/An Improved Event Independent Network for SELD.pdf
--------------------------------------------------------------------------------
/papers/Event Independent Network for SELD.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yinkalario/EIN-SELD/9f42da23f4ef4620ca495af37e079dc7f8cde06c/papers/Event Independent Network for SELD.pdf
--------------------------------------------------------------------------------
/scripts/download_dataset.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | set -e
4 |
5 | # check bin requirement
6 | command -v wget >/dev/null 2>&1 || { echo 'wget is missing' >&2; exit 1; }
7 | command -v zip >/dev/null 2>&1 || { echo 'zip is missing' >&2; exit 1; }
8 | command -v unzip >/dev/null 2>&1 || { echo 'unzip is missing' >&2; exit 1; }
9 |
10 | ## dcase 2020 Task 3
11 | Dataset_dir='_dataset'
12 | Dataset_root=$Dataset_dir'/dataset_root'
13 | mkdir -p $Dataset_root
14 | Download_packages_dir=$Dataset_dir'/downloaded_packages'
15 | mkdir -p $Download_packages_dir
16 |
17 | # dev
18 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/foa_dev.z01'
19 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/foa_dev.z02'
20 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/foa_dev.zip'
21 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/foa_eval.zip'
22 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/metadata_dev.zip'
23 | wget -P $Download_packages_dir 'https://zenodo.org/record/4064792/files/metadata_eval.zip'
24 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/mic_dev.z01'
25 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/mic_dev.z02'
26 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/mic_dev.zip'
27 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/mic_eval.zip'
28 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/README.md'
29 |
30 | zip -s 0 $Download_packages_dir'/foa_dev.zip' --out $Download_packages_dir'/foa_dev_single.zip'
31 | zip -s 0 $Download_packages_dir'/mic_dev.zip' --out $Download_packages_dir'/mic_dev_single.zip'
32 |
33 | unzip $Download_packages_dir'/foa_dev_single.zip' -d $Dataset_root
34 | unzip $Download_packages_dir'/mic_dev_single.zip' -d $Dataset_root
35 | unzip $Download_packages_dir'/metadata_dev.zip' -d $Dataset_root
36 | unzip $Download_packages_dir'/metadata_eval.zip' -d $Dataset_root
37 | unzip $Download_packages_dir'/foa_eval.zip' -d $Dataset_root
38 | unzip $Download_packages_dir'/mic_eval.zip' -d $Dataset_root
--------------------------------------------------------------------------------
/scripts/evaluate.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | set -e
4 |
5 | CONFIG_FILE='./configs/ein_seld/seld.yaml'
6 |
7 | python seld/main.py -c $CONFIG_FILE evaluate
--------------------------------------------------------------------------------
/scripts/predict.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | set -e
4 |
5 | CONFIG_FILE='./configs/ein_seld/seld.yaml'
6 |
7 | python seld/main.py -c $CONFIG_FILE infer --num_workers=8
--------------------------------------------------------------------------------
/scripts/prepare_env.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | set -e
4 |
5 | anaconda_dir= # '/vol/research/yc/miniconda3'
6 |
7 | . $anaconda_dir'/etc/profile.d/conda.sh'
8 | conda remove -n ein --all -y
9 | conda create -n ein python=3.7 -y
10 | conda activate ein
11 |
12 | conda install -c anaconda pandas h5py ipython pyyaml pylint -y
13 | conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch-lts -y
14 | conda install -c conda-forge librosa pudb tqdm ruamel.yaml -y
15 | conda install -c omnia termcolor -y
16 | pip install tensorboard
17 |
--------------------------------------------------------------------------------
/scripts/preproc.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | set -e
4 |
5 | CONFIG_FILE='./configs/ein_seld/seld.yaml'
6 |
7 | # Extract data
8 | python seld/main.py -c $CONFIG_FILE preprocess --preproc_mode='extract_data' --dataset_type='dev'
9 | python seld/main.py -c $CONFIG_FILE preprocess --preproc_mode='extract_data' --dataset_type='eval'
10 |
11 | # Extract scalar
12 | python seld/main.py -c $CONFIG_FILE preprocess --preproc_mode='extract_scalar' --num_workers=8
13 |
14 | # Extract meta
15 | python seld/main.py -c $CONFIG_FILE preprocess --preproc_mode='extract_meta' --dataset_type='dev'
16 | python seld/main.py -c $CONFIG_FILE preprocess --preproc_mode='extract_meta' --dataset_type='eval'
17 |
--------------------------------------------------------------------------------
/scripts/train.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | set -e
4 |
5 | CONFIG_FILE='./configs/ein_seld/seld.yaml'
6 |
7 | python seld/main.py -c $CONFIG_FILE train --seed=$(shuf -i 0-10000 -n 1) --num_workers=8
8 |
--------------------------------------------------------------------------------
/seld/learning/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yinkalario/EIN-SELD/9f42da23f4ef4620ca495af37e079dc7f8cde06c/seld/learning/__init__.py
--------------------------------------------------------------------------------
/seld/learning/checkpoint.py:
--------------------------------------------------------------------------------
1 | import logging
2 | import random
3 | from pathlib import Path
4 |
5 | import numpy as np
6 | import pandas as pd
7 | import torch
8 |
9 |
10 | class CheckpointIO:
11 | """CheckpointIO class.
12 |
13 | It handles saving and loading checkpoints.
14 | """
15 |
16 | def __init__(self, checkpoints_dir, model, optimizer, batch_sampler, metrics_names, num_checkpoints=1, remark=None):
17 | """
18 | Args:
19 | checkpoint_dir (Path obj): path where checkpoints are saved
20 | model: model
21 | optimizer: optimizer
22 | batch_sampler: batch_sampler
23 | metrics_names: metrics names to be saved in a checkpoints csv file
24 | num_checkpoints: maximum number of checkpoints to save. When it exceeds the number, the older
25 | (older, smaller or higher) checkpoints will be deleted
26 | remark (optional): to remark the name of the checkpoint
27 | """
28 | self.checkpoints_dir = checkpoints_dir
29 | self.checkpoints_dir.mkdir(parents=True, exist_ok=True)
30 | self.model = model
31 | self.optimizer = optimizer
32 | self.batch_sampler = batch_sampler
33 | self.num_checkpoints = num_checkpoints
34 | self.remark = remark
35 |
36 | self.value_list = []
37 | self.epoch_list = []
38 |
39 | self.checkpoints_csv_path = checkpoints_dir.joinpath('metrics_statistics.csv')
40 |
41 | # save checkpoints_csv header
42 | metrics_keys_list = [name for name in metrics_names]
43 | header = ['epoch'] + metrics_keys_list
44 | df_header = pd.DataFrame(columns=header)
45 | df_header.to_csv(self.checkpoints_csv_path, sep='\t', index=False, mode='a+')
46 |
47 | def save(self, epoch, it, metrics, key_rank=None, rank_order='high'):
48 | """Save model. It will save a latest model, a best model of rank_order for value, and
49 | 'self.num_checkpoints' best models of rank_order for value.
50 |
51 | Args:
52 | metrics: metrics to log
53 | key_rank (str): the key of metrics to rank
54 | rank_order: 'low' | 'high' | 'latest'
55 | 'low' to keep the models of lowest values
56 | 'high' to keep the models of highest values
57 | 'latest' to keep the models of latest epochs
58 | """
59 |
60 | ## save checkpionts_csv
61 | metrics_values_list = [value for value in metrics.values()]
62 | checkpoint_list = [[epoch] + metrics_values_list]
63 | df_checkpoint = pd.DataFrame(checkpoint_list)
64 | df_checkpoint.to_csv(self.checkpoints_csv_path, sep='\t', header=False, index=False, mode='a+')
65 |
66 | ## save checkpoints
67 | current_value = None if rank_order == 'latest' else metrics[key_rank]
68 |
69 | # latest model
70 | latest_checkpoint_path = self.checkpoints_dir.joinpath('{}_epoch_latest.pth'.format(self.remark))
71 | self.save_file(latest_checkpoint_path, epoch, it)
72 |
73 | # self.num_checkpoints best models
74 | if len(self.value_list) < self.num_checkpoints:
75 | self.value_list.append(current_value)
76 | self.epoch_list.append(epoch)
77 | checkpoint_path = self.checkpoints_dir.joinpath('{}_epoch_{}.pth'.format(self.remark, epoch))
78 | self.save_file(checkpoint_path, epoch, it)
79 | logging.info('Checkpoint saved to {}'.format(checkpoint_path))
80 | elif len(self.value_list) >= self.num_checkpoints:
81 | value_list = np.array(self.value_list)
82 | if rank_order == 'high' and current_value >= value_list.min():
83 | worst_index = value_list.argmin()
84 | self.del_and_save(worst_index, current_value, epoch, it)
85 | elif rank_order == 'low' and current_value <= value_list.max():
86 | worst_index = value_list.argmax()
87 | self.del_and_save(worst_index, current_value, epoch, it)
88 | elif rank_order == 'latest':
89 | worst_index = 0
90 | self.del_and_save(worst_index, current_value, epoch, it)
91 |
92 | # best model
93 | value_list = np.array(self.value_list)
94 | best_checkpoint_path = self.checkpoints_dir.joinpath('{}_epoch_best.pth'.format(self.remark))
95 | if rank_order == 'high' and current_value >= value_list.max():
96 | self.save_file(best_checkpoint_path, epoch, it)
97 | elif rank_order == 'low' and current_value <= value_list.min():
98 | self.save_file(best_checkpoint_path, epoch, it)
99 | elif rank_order == 'latest':
100 | self.save_file(best_checkpoint_path, epoch, it)
101 |
102 | def del_and_save(self, worst_index, current_value, epoch, it):
103 | """Delete and save checkpoint
104 |
105 | Args:
106 | worst_index: worst index,
107 | current_value: current value,
108 | epoch: epoch,
109 | it: it,
110 | """
111 | worst_chpt_path = self.checkpoints_dir.joinpath('{}_epoch_{}.pth'.format(self.remark, self.epoch_list[worst_index]))
112 | if worst_chpt_path.is_file():
113 | worst_chpt_path.unlink()
114 | self.value_list.pop(worst_index)
115 | self.epoch_list.pop(worst_index)
116 |
117 | self.value_list.append(current_value)
118 | self.epoch_list.append(epoch)
119 | checkpoint_path = self.checkpoints_dir.joinpath('{}_epoch_{}.pth'.format(self.remark, epoch))
120 | self.save_file(checkpoint_path, epoch, it)
121 | logging.info('Checkpoint saved to {}'.format(checkpoint_path))
122 |
123 | def save_file(self, checkpoint_path, epoch, it):
124 | """Save a module to a file
125 |
126 | Args:
127 | checkpoint_path (Path obj): checkpoint path, including .pth file name
128 | epoch: epoch,
129 | it: it
130 | """
131 | outdict = {
132 | 'epoch': epoch,
133 | 'it': it,
134 | 'model': self.model.module.state_dict(),
135 | 'optimizer': self.optimizer.state_dict(),
136 | 'sampler': self.batch_sampler.get_state(),
137 | 'rng': torch.get_rng_state(),
138 | 'cuda_rng': torch.cuda.get_rng_state(),
139 | 'random': random.getstate(),
140 | 'np_random': np.random.get_state(),
141 | }
142 | torch.save(outdict, checkpoint_path)
143 |
144 | def load(self, checkpoint_path):
145 | """Load a module from a file
146 |
147 | """
148 | state_dict = torch.load(checkpoint_path)
149 | epoch = state_dict['epoch']
150 | it = state_dict['it']
151 | self.model.module.load_state_dict(state_dict['model'])
152 | self.optimizer.load_state_dict(state_dict['optimizer'])
153 | self.batch_sampler.set_state(state_dict['sampler'])
154 | torch.set_rng_state(state_dict['rng'])
155 | torch.cuda.set_rng_state(state_dict['cuda_rng'])
156 | random.setstate(state_dict['random'])
157 | np.random.set_state(state_dict['np_random'])
158 | logging.info('Resuming complete from {}\n'.format(checkpoint_path))
159 | return epoch, it
160 |
161 |
--------------------------------------------------------------------------------
/seld/learning/evaluate.py:
--------------------------------------------------------------------------------
1 | from pathlib import Path
2 |
3 | import numpy as np
4 | import pandas as pd
5 |
6 | import methods.utils.SELD_evaluation_metrics_2019 as SELDMetrics2019
7 | from methods.utils.data_utilities import (load_dcase_format,
8 | to_metrics2020_format)
9 | from methods.utils.SELD_evaluation_metrics_2020 import \
10 | SELDMetrics as SELDMetrics2020
11 | from methods.utils.SELD_evaluation_metrics_2020 import early_stopping_metric
12 |
13 |
14 | def evaluate(cfg, dataset):
15 |
16 | """ Evaluate scores
17 |
18 | """
19 |
20 | '''Directories'''
21 | print('Inference ID is {}\n'.format(cfg['inference']['infer_id']))
22 |
23 | out_infer_dir = Path(cfg['workspace_dir']).joinpath('out_infer').joinpath(cfg['method']) \
24 | .joinpath(cfg['inference']['infer_id'])
25 | submissions_dir = out_infer_dir.joinpath('submissions')
26 |
27 | main_dir = Path(cfg['dataset_dir'])
28 | dev_meta_dir = main_dir.joinpath('metadata_dev')
29 | eval_meta_dir = main_dir.joinpath('metadata_eval')
30 | if cfg['inference']['testset_type'] == 'dev':
31 | meta_dir = dev_meta_dir
32 | elif cfg['inference']['testset_type'] == 'eval':
33 | meta_dir = eval_meta_dir
34 |
35 | pred_frame_begin_index = 0
36 | gt_frame_begin_index = 0
37 | frame_length = int(dataset.clip_length / dataset.label_resolution)
38 | pred_output_dict, pred_sed_metrics2019, pred_doa_metrics2019 = {}, [], []
39 | gt_output_dict, gt_sed_metrics2019, gt_doa_metrics2019= {}, [], []
40 | for pred_path in sorted(submissions_dir.glob('*.csv')):
41 | fn = pred_path.name
42 | gt_path = meta_dir.joinpath(fn)
43 |
44 | # pred
45 | output_dict, sed_metrics2019, doa_metrics2019 = load_dcase_format(
46 | pred_path, frame_begin_index=pred_frame_begin_index,
47 | frame_length=frame_length, num_classes=len(dataset.label_set), set_type='pred')
48 | pred_output_dict.update(output_dict)
49 | pred_sed_metrics2019.append(sed_metrics2019)
50 | pred_doa_metrics2019.append(doa_metrics2019)
51 | pred_frame_begin_index += frame_length
52 |
53 | # gt
54 | output_dict, sed_metrics2019, doa_metrics2019 = load_dcase_format(
55 | gt_path, frame_begin_index=gt_frame_begin_index,
56 | frame_length=frame_length, num_classes=len(dataset.label_set), set_type='gt')
57 | gt_output_dict.update(output_dict)
58 | gt_sed_metrics2019.append(sed_metrics2019)
59 | gt_doa_metrics2019.append(doa_metrics2019)
60 | gt_frame_begin_index += frame_length
61 |
62 | pred_sed_metrics2019 = np.concatenate(pred_sed_metrics2019, axis=0)
63 | pred_doa_metrics2019 = np.concatenate(pred_doa_metrics2019, axis=0)
64 | gt_sed_metrics2019 = np.concatenate(gt_sed_metrics2019, axis=0)
65 | gt_doa_metrics2019 = np.concatenate(gt_doa_metrics2019, axis=0)
66 | pred_metrics2020_dict = to_metrics2020_format(pred_output_dict,
67 | pred_sed_metrics2019.shape[0], label_resolution=dataset.label_resolution)
68 | gt_metrics2020_dict = to_metrics2020_format(gt_output_dict,
69 | gt_sed_metrics2019.shape[0], label_resolution=dataset.label_resolution)
70 |
71 | # 2019 metrics
72 | num_frames_1s = int(1 / dataset.label_resolution)
73 | ER_19, F_19 = SELDMetrics2019.compute_sed_scores(pred_sed_metrics2019, gt_sed_metrics2019, num_frames_1s)
74 | LE_19, LR_19, _, _, _, _ = SELDMetrics2019.compute_doa_scores_regr(
75 | pred_doa_metrics2019, gt_doa_metrics2019, pred_sed_metrics2019, gt_sed_metrics2019)
76 | seld_score_19 = SELDMetrics2019.early_stopping_metric([ER_19, F_19], [LE_19, LR_19])
77 |
78 | # 2020 metrics
79 | dcase2020_metric = SELDMetrics2020(nb_classes=len(dataset.label_set), doa_threshold=20)
80 | dcase2020_metric.update_seld_scores(pred_metrics2020_dict, gt_metrics2020_dict)
81 | ER_20, F_20, LE_20, LR_20 = dcase2020_metric.compute_seld_scores()
82 | seld_score_20 = early_stopping_metric([ER_20, F_20], [LE_20, LR_20])
83 |
84 | metrics_scores ={
85 | 'ER20': ER_20,
86 | 'F20': F_20,
87 | 'LE20': LE_20,
88 | 'LR20': LR_20,
89 | 'seld20': seld_score_20,
90 | 'ER19': ER_19,
91 | 'F19': F_19,
92 | 'LE19': LE_19,
93 | 'LR19': LR_19,
94 | 'seld19': seld_score_19,
95 | }
96 |
97 | out_str = 'test: '
98 | for key, value in metrics_scores.items():
99 | out_str += '{}: {:.3f}, '.format(key, value)
100 | print('---------------------------------------------------------------------------------------------------'
101 | +'-------------------------------------------------')
102 | print(out_str)
103 | print('---------------------------------------------------------------------------------------------------'
104 | +'-------------------------------------------------')
105 |
106 | out_eval_dir = Path(cfg['workspace_dir']).joinpath('out_eval').joinpath(cfg['method']) \
107 | .joinpath(cfg['inference']['infer_id'])
108 | out_eval_dir.mkdir(parents=True, exist_ok=True)
109 | result_path = out_eval_dir.joinpath('results.csv')
110 | df = pd.DataFrame(metrics_scores, index=[0])
111 | df.to_csv(result_path, sep=',', mode='a')
112 |
113 |
--------------------------------------------------------------------------------
/seld/learning/infer.py:
--------------------------------------------------------------------------------
1 | import torch
2 | from utils.config import get_afextractor, get_inferer, get_models
3 |
4 |
5 | def infer(cfg, dataset, **infer_initializer):
6 | """ Infer, only save the testset predictions
7 |
8 | """
9 | submissions_dir = infer_initializer['submissions_dir']
10 | ckpts_paths_list = infer_initializer['ckpts_paths_list']
11 | ckpts_models_list = infer_initializer['ckpts_models_list']
12 | test_generator = infer_initializer['test_generator']
13 | cuda = infer_initializer['cuda']
14 | preds = []
15 | for ckpt_path, model_name in zip(ckpts_paths_list, ckpts_models_list):
16 | print('=====>> Resuming from the checkpoint: {}\n'.format(ckpt_path))
17 | af_extractor = get_afextractor(cfg, cuda)
18 | model = get_models(cfg, dataset, cuda, model_name=model_name)
19 | state_dict = torch.load(ckpt_path)
20 | model.module.load_state_dict(state_dict['model'])
21 | print(' Resuming complete\n')
22 | inferer = get_inferer(cfg, dataset, af_extractor, model, cuda)
23 | pred = inferer.infer(test_generator)
24 | preds.append(pred)
25 | print('\n Inference finished for {}\n'.format(ckpt_path))
26 | inferer.fusion(submissions_dir, preds)
27 |
28 |
29 |
--------------------------------------------------------------------------------
/seld/learning/initialize.py:
--------------------------------------------------------------------------------
1 | import logging
2 | import random
3 | import shutil
4 | import socket
5 | from datetime import datetime
6 | from pathlib import Path
7 |
8 | import numpy as np
9 | import torch
10 | import torch.optim as optim
11 | from torch.backends import cudnn
12 | from torch.utils.tensorboard import SummaryWriter
13 | from utils.common import create_logging
14 | from utils.config import (get_afextractor, get_generator, get_losses,
15 | get_metrics, get_models, get_optimizer, get_trainer,
16 | store_config)
17 |
18 | from learning.checkpoint import CheckpointIO
19 |
20 |
21 | def init_train(args, cfg, dataset):
22 | """ Training initialization.
23 |
24 | Including Data generator, model, optimizer initialization.
25 | """
26 |
27 | '''Cuda'''
28 | args.cuda = not args.no_cuda and torch.cuda.is_available()
29 |
30 | ''' Reproducible seed set'''
31 | torch.manual_seed(args.seed)
32 | if args.cuda:
33 | torch.cuda.manual_seed(args.seed)
34 | np.random.seed(args.seed)
35 | random.seed(args.seed)
36 | cudnn.deterministic = True
37 | cudnn.benchmark = True
38 |
39 | '''Directories'''
40 | print('Train ID is {}\n'.format(cfg['training']['train_id']))
41 | out_train_dir = Path(cfg['workspace_dir']).joinpath('out_train') \
42 | .joinpath(cfg['method']).joinpath(cfg['training']['train_id'])
43 | if out_train_dir.is_dir():
44 | flag = input("Train ID folder {} is existed, delete it? (y/n)". \
45 | format(str(out_train_dir))).lower()
46 | print('')
47 | if flag == 'y':
48 | shutil.rmtree(str(out_train_dir))
49 | elif flag == 'n':
50 | print("User select not to remove the training ID folder {}.\n". \
51 | format(str(out_train_dir)))
52 | out_train_dir.mkdir(parents=True, exist_ok=True)
53 |
54 | current_time = datetime.now().strftime('%b%d_%H-%M-%S')
55 | tb_dir = out_train_dir.joinpath('tb').joinpath(current_time + '_' + socket.gethostname())
56 | tb_dir.mkdir(parents=True, exist_ok=True)
57 | logs_dir = out_train_dir.joinpath('logs')
58 | ckpts_dir = out_train_dir.joinpath('checkpoints')
59 |
60 | '''tensorboard and logging'''
61 | writer = SummaryWriter(log_dir=str(tb_dir))
62 | create_logging(logs_dir, filemode='w')
63 | param_file = out_train_dir.joinpath('config.yaml')
64 | if param_file.is_file():
65 | param_file.unlink()
66 | store_config(param_file, cfg)
67 |
68 | '''Data generator'''
69 | train_set, train_generator, batch_sampler = get_generator(args, cfg, dataset, generator_type='train')
70 | valid_set, valid_generator, _ = get_generator(args, cfg, dataset, generator_type='valid')
71 |
72 | '''Loss'''
73 | losses = get_losses(cfg)
74 |
75 | '''Metrics'''
76 | metrics = get_metrics(cfg, dataset)
77 |
78 | '''Audio feature extractor'''
79 | af_extractor = get_afextractor(cfg, args.cuda)
80 |
81 | '''Model'''
82 | model = get_models(cfg, dataset, args.cuda)
83 |
84 | '''Optimizer'''
85 | optimizer = get_optimizer(cfg, af_extractor, model)
86 | lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=cfg['training']['lr_step_size'],
87 | gamma=cfg['training']['lr_gamma'])
88 |
89 | '''Trainer'''
90 | trainer = get_trainer(args=args, cfg=cfg, dataset=dataset, valid_set=valid_set,
91 | af_extractor=af_extractor, model=model, optimizer=optimizer, losses=losses, metrics=metrics)
92 |
93 | '''CheckpointIO'''
94 | if not cfg['training']['valid_fold']:
95 | metrics_names = losses.names
96 | else:
97 | metrics_names = metrics.names
98 | ckptIO = CheckpointIO(
99 | checkpoints_dir=ckpts_dir,
100 | model=model,
101 | optimizer=optimizer,
102 | batch_sampler=batch_sampler,
103 | metrics_names=metrics_names,
104 | num_checkpoints=1,
105 | remark=cfg['training']['remark']
106 | )
107 |
108 | if cfg['training']['resume_model']:
109 | resume_path = ckpts_dir.joinpath(cfg['training']['resume_model'])
110 | logging.info('=====>> Resume from the checkpoint: {}......\n'.format(str(resume_path)))
111 | epoch_it, it = ckptIO.load(resume_path)
112 | for param_group in optimizer.param_groups:
113 | param_group['lr'] = cfg['training']['lr']
114 | else:
115 | epoch_it, it = 0, 0
116 |
117 | '''logging and return'''
118 | logging.info('Train folds are: {}\n'.format(cfg['training']['train_fold']))
119 | logging.info('Valid folds are: {}\n'.format(cfg['training']['valid_fold']))
120 | logging.info('Training clip number is: {}\n'.format(len(train_set)))
121 | logging.info('Number of batches per epoch is: {}\n'.format(len(batch_sampler)))
122 | logging.info('Validation clip number is: {}\n'.format(len(valid_set)))
123 | logging.info('Training loss type is: {}\n'.format(cfg['training']['loss_type']))
124 | if cfg['training']['loss_type'] == 'doa':
125 | logging.info('DOA loss type is: {}\n'.format(cfg['training']['doa_loss_type']))
126 | logging.info('Data augmentation methods used are: {}\n'.format(cfg['data_augmentation']['type']))
127 |
128 | train_initializer = {
129 | 'writer': writer,
130 | 'train_generator': train_generator,
131 | 'valid_generator': valid_generator,
132 | 'lr_scheduler': lr_scheduler,
133 | 'trainer': trainer,
134 | 'ckptIO': ckptIO,
135 | 'epoch_it': epoch_it,
136 | 'it': it
137 | }
138 |
139 | return train_initializer
140 |
141 |
142 | def init_infer(args, cfg, dataset):
143 | """ Inference initialization.
144 |
145 | Including Data generator, model, optimizer initialization.
146 | """
147 |
148 | '''Cuda'''
149 | args.cuda = not args.no_cuda and torch.cuda.is_available()
150 |
151 | '''Directories'''
152 | print('Inference ID is {}\n'.format(cfg['inference']['infer_id']))
153 | out_infer_dir = Path(cfg['workspace_dir']).joinpath('out_infer').joinpath(cfg['method']) \
154 | .joinpath(cfg['inference']['infer_id'])
155 | if out_infer_dir.is_dir():
156 | shutil.rmtree(str(out_infer_dir))
157 | submissions_dir = out_infer_dir.joinpath('submissions')
158 | submissions_dir.mkdir(parents=True, exist_ok=True)
159 | train_ids = [train_id.strip() for train_id in str(cfg['inference']['train_ids']).split(',')]
160 | models = [model.strip() for model in str(cfg['inference']['models']).split(',')]
161 | ckpts_paths_list = []
162 | ckpts_models_list = []
163 | for train_id, model_name in zip(train_ids, models):
164 | ckpts_dir = Path(cfg['workspace_dir']).joinpath('out_train').joinpath(cfg['method']) \
165 | .joinpath(train_id).joinpath('checkpoints')
166 | ckpt_path = [path for path in sorted(ckpts_dir.iterdir()) if path.stem.split('_')[-1].isnumeric()]
167 | for path in ckpt_path:
168 | ckpts_paths_list.append(path)
169 | ckpts_models_list.append(model_name)
170 |
171 | '''Parameters'''
172 | param_file = out_infer_dir.joinpath('config.yaml')
173 | if param_file.is_file():
174 | param_file.unlink()
175 | store_config(param_file, cfg)
176 |
177 | '''Data generator'''
178 | test_set, test_generator, _ = get_generator(args, cfg, dataset, generator_type='test')
179 |
180 | '''logging and return'''
181 | logging.info('Test clip number is: {}\n'.format(len(test_set)))
182 |
183 | infer_initializer = {
184 | 'submissions_dir': submissions_dir,
185 | 'ckpts_paths_list': ckpts_paths_list,
186 | 'ckpts_models_list': ckpts_models_list,
187 | 'test_generator': test_generator,
188 | 'cuda': args.cuda
189 | }
190 |
191 | return infer_initializer
192 |
--------------------------------------------------------------------------------
/seld/learning/preprocess.py:
--------------------------------------------------------------------------------
1 | import shutil
2 | from functools import reduce
3 | from pathlib import Path
4 | from timeit import default_timer as timer
5 |
6 | import h5py
7 | import librosa
8 | import numpy as np
9 | import pandas as pd
10 | import torch
11 | from methods.data import BaseDataset, collate_fn
12 | from torch.utils.data import DataLoader
13 | from tqdm import tqdm
14 | from utils.common import float_samples_to_int16
15 | from utils.config import get_afextractor
16 |
17 |
18 | class Preprocessor:
19 | """Preprocess the audio data.
20 |
21 | 1. Extract wav file and store to hdf5 file
22 | 2. Extract meta file and store to hdf5 file
23 | """
24 | def __init__(self, args, cfg, dataset):
25 | """
26 | Args:
27 | args: parsed args
28 | cfg: configurations
29 | dataset: dataset class
30 | """
31 | self.args = args
32 | self.cfg = cfg
33 | self.dataset = dataset
34 |
35 | # Path for dataset
36 | hdf5_dir = Path(cfg['hdf5_dir']).joinpath(cfg['dataset'])
37 |
38 | # Path for extraction of wav
39 | self.data_dir_list = [
40 | dataset.dataset_dir[args.dataset_type]['foa'],
41 | dataset.dataset_dir[args.dataset_type]['mic']
42 | ]
43 | data_h5_dir = hdf5_dir.joinpath('data').joinpath('{}fs'.format(cfg['data']['sample_rate']))
44 | self.data_h5_dir_list = [
45 | data_h5_dir.joinpath(args.dataset_type).joinpath('foa'),
46 | data_h5_dir.joinpath(args.dataset_type).joinpath('mic')
47 | ]
48 | self.data_statistics_path_list = [
49 | data_h5_dir.joinpath(args.dataset_type).joinpath('statistics_foa.txt'),
50 | data_h5_dir.joinpath(args.dataset_type).joinpath('statistics_mic.txt')
51 | ]
52 |
53 | # Path for extraction of scalar
54 | self.scalar_h5_dir = hdf5_dir.joinpath('scalar')
55 | fn_scalar = '{}_{}_sr{}_nfft{}_hop{}_mel{}.h5'.format(cfg['data']['type'], cfg['data']['audio_feature'],
56 | cfg['data']['sample_rate'], cfg['data']['n_fft'], cfg['data']['hop_length'], cfg['data']['n_mels'])
57 | self.scalar_path = self.scalar_h5_dir.joinpath(fn_scalar)
58 |
59 | # Path for extraction of meta
60 | self.meta_dir = dataset.dataset_dir[args.dataset_type]['meta']
61 | self.meta_h5_dir = hdf5_dir.joinpath('meta').joinpath(args.dataset_type)
62 |
63 | def extract_data(self):
64 | """ Extract wave and store to hdf5 file
65 |
66 | """
67 | print('Converting wav file to hdf5 file starts......\n')
68 |
69 | for h5_dir in self.data_h5_dir_list:
70 | if h5_dir.is_dir():
71 | flag = input("HDF5 folder {} is already existed, delete it? (y/n)".format(h5_dir)).lower()
72 | if flag == 'y':
73 | shutil.rmtree(h5_dir)
74 | elif flag == 'n':
75 | print("User select not to remove the HDF5 folder {}. The process will quit.\n".format(h5_dir))
76 | return
77 | h5_dir.mkdir(parents=True)
78 | for statistic_path in self.data_statistics_path_list:
79 | if statistic_path.is_file():
80 | statistic_path.unlink()
81 |
82 | for idx, data_dir in enumerate(self.data_dir_list):
83 | begin_time = timer()
84 | h5_dir = self.data_h5_dir_list[idx]
85 | statistic_path = self.data_statistics_path_list[idx]
86 | audio_count = 0
87 | silent_audio_count = 0
88 | data_list = [path for path in sorted(data_dir.glob('*.wav')) if not path.name.startswith('.')]
89 | iterator = tqdm(data_list, total=len(data_list), unit='it')
90 | for data_path in iterator:
91 | # read data
92 | data, _ = librosa.load(data_path, sr=self.cfg['data']['sample_rate'], mono=False)
93 | if len(data.shape) == 1:
94 | data = data[None,:]
95 | '''data: (channels, samples)'''
96 |
97 | # silent data statistics
98 | lst = np.sum(np.abs(data), axis=1) > data.shape[1]*1e-4
99 | if not reduce(lambda x, y: x*y, lst):
100 | with statistic_path.open(mode='a+') as f:
101 | print(f"Silent file in feature extractor: {data_path.name}\n", file=f)
102 | silent_audio_count += 1
103 | tqdm.write("Silent file in feature extractor: {}".format(data_path.name))
104 | tqdm.write("Total silent files are: {}\n".format(silent_audio_count))
105 |
106 | # save to h5py
107 | h5_path = h5_dir.joinpath(data_path.stem + '.h5')
108 | with h5py.File(h5_path, 'w') as hf:
109 | hf.create_dataset(name='waveform', data=float_samples_to_int16(data), dtype=np.int16)
110 |
111 | audio_count += 1
112 |
113 | tqdm.write('{}, {}, {}'.format(audio_count, h5_path, data.shape))
114 |
115 | with statistic_path.open(mode='a+') as f:
116 | print(f"Total number of audio clips extracted: {audio_count}", file=f)
117 | print(f"Total number of silent audio clips is: {silent_audio_count}\n", file=f)
118 |
119 | iterator.close()
120 | print("Extacting feature finished! Time spent: {:.3f} s".format(timer() - begin_time))
121 |
122 | return
123 |
124 | def extract_scalar(self):
125 | """ Extract scalar and store to hdf5 file
126 |
127 | """
128 | print('Extracting scalar......\n')
129 | self.scalar_h5_dir.mkdir(parents=True, exist_ok=True)
130 |
131 | cuda_enabled = not self.args.no_cuda and torch.cuda.is_available()
132 | train_set = BaseDataset(self.args, self.cfg, self.dataset)
133 | data_generator = DataLoader(
134 | dataset=train_set,
135 | batch_size=32,
136 | shuffle=False,
137 | num_workers=self.args.num_workers,
138 | collate_fn=collate_fn,
139 | pin_memory=True
140 | )
141 | af_extractor = get_afextractor(self.cfg, cuda_enabled).eval()
142 | iterator = tqdm(enumerate(data_generator), total=len(data_generator), unit='it')
143 | features = []
144 | begin_time = timer()
145 | for it, batch_sample in iterator:
146 | if it == len(data_generator):
147 | break
148 | batch_x = batch_sample['waveform']
149 | batch_x.require_grad = False
150 | if cuda_enabled:
151 | batch_x = batch_x.cuda(non_blocking=True)
152 | batch_y = af_extractor(batch_x).transpose(0, 1)
153 | C, _, _, F = batch_y.shape
154 | features.append(batch_y.reshape(C, -1, F).cpu().numpy())
155 | iterator.close()
156 | features = np.concatenate(features, axis=1)
157 | mean = []
158 | std = []
159 |
160 | for ch in range(C):
161 | mean.append(np.mean(features[ch], axis=0, keepdims=True))
162 | std.append(np.std(features[ch], axis=0, keepdims=True))
163 | mean = np.stack(mean)[None, ...]
164 | std = np.stack(std)[None, ...]
165 |
166 | # save to h5py
167 | with h5py.File(self.scalar_path, 'w') as hf:
168 | hf.create_dataset(name='mean', data=mean, dtype=np.float32)
169 | hf.create_dataset(name='std', data=std, dtype=np.float32)
170 | print("\nScalar saved to {}\n".format(str(self.scalar_path)))
171 | print("Extacting scalar finished! Time spent: {:.3f} s\n".format(timer() - begin_time))
172 |
173 | def extract_meta(self):
174 | """ Extract meta .csv file and re-organize the meta data and store it to hdf5 file.
175 |
176 | """
177 | print('Converting meta file to hdf5 file starts......\n')
178 |
179 | shutil.rmtree(str(self.meta_h5_dir), ignore_errors=True)
180 | self.meta_h5_dir.mkdir(parents=True, exist_ok=True)
181 |
182 | if self.cfg['dataset'] == 'dcase2020task3':
183 | self.extract_meta_dcase2020task3()
184 |
185 | def extract_meta_dcase2020task3(self):
186 | num_frames = 600
187 | num_tracks = 2
188 | num_classes = 14
189 | meta_list = [path for path in sorted(self.meta_dir.glob('*.csv')) if not path.name.startswith('.')]
190 | iterator = tqdm(enumerate(meta_list), total=len(meta_list), unit='it')
191 | for idx, meta_file in iterator:
192 | fn = meta_file.stem
193 | df = pd.read_csv(meta_file, header=None)
194 | sed_label = np.zeros((num_frames, num_tracks, num_classes))
195 | doa_label = np.zeros((num_frames, num_tracks, 3))
196 | event_indexes = np.array([[None, None]] * num_frames) # event indexes of all frames
197 | track_numbers = np.array([[None, None]] * num_frames) # track number of all frames
198 | for row in df.iterrows():
199 | frame_idx = row[1][0]
200 | event_idx = row[1][1]
201 | track_number = row[1][2]
202 | azi = row[1][3]
203 | elev = row[1][4]
204 |
205 | ##### track indexing #####
206 | # default assign current_track_idx to the first available track
207 | current_event_indexes = event_indexes[frame_idx]
208 | current_track_indexes = np.where(current_event_indexes == None)[0].tolist()
209 | current_track_idx = current_track_indexes[0]
210 |
211 | # tracking from the last frame if the last frame is not empty
212 | last_event_indexes = np.array([None, None]) if frame_idx == 0 else event_indexes[frame_idx-1]
213 | last_track_indexes = np.where(last_event_indexes != None)[0].tolist() # event index of the last frame
214 | last_events_tracks = list(zip(event_indexes[frame_idx-1], track_numbers[frame_idx-1]))
215 | if last_track_indexes != []:
216 | for track_idx in last_track_indexes:
217 | if last_events_tracks[track_idx] == (event_idx, track_number):
218 | if current_track_idx != track_idx: # swap tracks if track_idx is not consistant
219 | sed_label[frame_idx, [current_track_idx, track_idx]] = \
220 | sed_label[frame_idx, [track_idx, current_track_idx]]
221 | doa_label[frame_idx, [current_track_idx, track_idx]] = \
222 | doa_label[frame_idx, [track_idx, current_track_idx]]
223 | event_indexes[frame_idx, [current_track_idx, track_idx]] = \
224 | event_indexes[frame_idx, [track_idx, current_track_idx]]
225 | track_numbers[frame_idx, [current_track_idx, track_idx]] = \
226 | track_numbers[frame_idx, [track_idx, current_track_idx]]
227 | current_track_idx = track_idx
228 | #########################
229 |
230 | # label encode
231 | azi_rad, elev_rad = azi * np.pi / 180, elev * np.pi / 180
232 | sed_label[frame_idx, current_track_idx, event_idx] = 1.0
233 | doa_label[frame_idx, current_track_idx, :] = np.cos(elev_rad) * np.cos(azi_rad), \
234 | np.cos(elev_rad) * np.sin(azi_rad), np.sin(elev_rad)
235 | event_indexes[frame_idx, current_track_idx] = event_idx
236 | track_numbers[frame_idx, current_track_idx] = track_number
237 |
238 | meta_h5_path = self.meta_h5_dir.joinpath(fn + '.h5')
239 | with h5py.File(meta_h5_path, 'w') as hf:
240 | hf.create_dataset(name='sed_label', data=sed_label, dtype=np.float32)
241 | hf.create_dataset(name='doa_label', data=doa_label, dtype=np.float32)
242 |
243 | tqdm.write('{}, {}'.format(idx, meta_h5_path))
244 |
245 |
--------------------------------------------------------------------------------
/seld/learning/train.py:
--------------------------------------------------------------------------------
1 | import logging
2 | from timeit import default_timer as timer
3 |
4 | from tqdm import tqdm
5 |
6 | from utils.common import print_metrics
7 |
8 |
9 | def train(cfg, **initializer):
10 | """Train
11 |
12 | """
13 | writer = initializer['writer']
14 | train_generator = initializer['train_generator']
15 | valid_generator = initializer['valid_generator']
16 | lr_scheduler = initializer['lr_scheduler']
17 | trainer = initializer['trainer']
18 | ckptIO = initializer['ckptIO']
19 | epoch_it = initializer['epoch_it']
20 | it = initializer['it']
21 |
22 | batchNum_per_epoch = len(train_generator)
23 | max_epoch = cfg['training']['max_epoch']
24 |
25 | logging.info('===> Training mode\n')
26 | iterator = tqdm(train_generator, total=max_epoch*batchNum_per_epoch-it, unit='it')
27 | train_begin_time = timer()
28 | for batch_sample in iterator:
29 |
30 | epoch_it, rem_batch = it // batchNum_per_epoch, it % batchNum_per_epoch
31 |
32 | ################
33 | ## Validation
34 | ################
35 | if it % int(1*batchNum_per_epoch) == 0:
36 | valid_begin_time = timer()
37 | train_time = valid_begin_time - train_begin_time
38 |
39 | train_losses = trainer.validate_step(valid_type='train', epoch_it=epoch_it)
40 | for k, v in train_losses.items():
41 | train_losses[k] = v / batchNum_per_epoch
42 |
43 | if cfg['training']['valid_fold']:
44 | valid_losses, valid_metrics = trainer.validate_step(
45 | generator=valid_generator,
46 | valid_type='valid',
47 | epoch_it=epoch_it
48 | )
49 | valid_time = timer() - valid_begin_time
50 |
51 | writer.add_scalar('train/lr', lr_scheduler.get_last_lr()[0], it)
52 | logging.info('---------------------------------------------------------------------------------------------------'
53 | +'------------------------------------------------------')
54 | logging.info('Iter: {}, Epoch/Total Epoch: {}/{}, Batch/Total Batch: {}/{}'.format(
55 | it, epoch_it, max_epoch, rem_batch, batchNum_per_epoch))
56 | print_metrics(logging, writer, train_losses, it, set_type='train')
57 | if cfg['training']['valid_fold']:
58 | print_metrics(logging, writer, valid_losses, it, set_type='valid')
59 | if cfg['training']['valid_fold']:
60 | print_metrics(logging, writer, valid_metrics, it, set_type='valid')
61 | logging.info('Train time: {:.3f}s, Valid time: {:.3f}s, Lr: {}'.format(
62 | train_time, valid_time, lr_scheduler.get_last_lr()[0]))
63 | if 'PIT_type' in cfg['training']:
64 | logging.info('PIT type: {}'.format(cfg['training']['PIT_type']))
65 | logging.info('---------------------------------------------------------------------------------------------------'
66 | +'------------------------------------------------------')
67 |
68 | train_begin_time = timer()
69 |
70 | ###############
71 | ## Save model
72 | ###############
73 | if rem_batch == 0 and it > 0:
74 | if cfg['training']['valid_fold']:
75 | ckptIO.save(epoch_it, it, metrics=valid_metrics, key_rank='seld20', rank_order='latest')
76 | else:
77 | ckptIO.save(epoch_it, it, metrics=train_losses, key_rank='loss_all', rank_order='latest')
78 |
79 | ###############
80 | ## Finish training
81 | ###############
82 | if it == max_epoch * batchNum_per_epoch:
83 | iterator.close()
84 | break
85 |
86 | ###############
87 | ## Train
88 | ###############
89 | trainer.train_step(batch_sample, epoch_it)
90 | if rem_batch == 0 and it > 0:
91 | lr_scheduler.step()
92 |
93 | it += 1
94 |
95 | iterator.close()
96 |
97 |
--------------------------------------------------------------------------------
/seld/main.py:
--------------------------------------------------------------------------------
1 | import sys
2 | from pathlib import Path
3 |
4 | from learning import evaluate, initialize, infer, preprocess, train
5 | from utils.cli_parser import parse_cli_overides
6 | from utils.config import get_dataset
7 |
8 |
9 | def main(args, cfg):
10 | """Execute a task based on the given command-line arguments.
11 |
12 | This function is the main entry-point of the program. It allows the
13 | user to extract features, train a model, infer predictions, and
14 | evaluate predictions using the command-line interface.
15 |
16 | Args:
17 | args: command line arguments.
18 | Return:
19 | 0: successful termination
20 | 'any nonzero value': abnormal termination
21 | """
22 |
23 | # Create workspace
24 | Path(cfg['workspace_dir']).mkdir(parents=True, exist_ok=True)
25 |
26 | # Dataset initialization
27 | dataset = get_dataset(dataset_name=cfg['dataset'], root_dir=cfg['dataset_dir'])
28 |
29 | # Preprocess
30 | if args.mode == 'preprocess':
31 | preprocessor = preprocess.Preprocessor(args, cfg, dataset)
32 |
33 | if args.preproc_mode == 'extract_data':
34 | preprocessor.extract_data()
35 | elif args.preproc_mode == 'extract_scalar':
36 | preprocessor.extract_scalar()
37 | elif args.preproc_mode == 'extract_meta':
38 | preprocessor.extract_meta()
39 |
40 | # Train
41 | if args.mode == 'train':
42 | train_initializer = initialize.init_train(args, cfg, dataset)
43 | train.train(cfg, **train_initializer)
44 |
45 | # Inference
46 | elif args.mode == 'infer':
47 | infer_initializer = initialize.init_infer(args, cfg, dataset)
48 | infer.infer(cfg, dataset, **infer_initializer)
49 |
50 | # Evaluate
51 | elif args.mode == 'evaluate':
52 | evaluate.evaluate(cfg, dataset)
53 |
54 | return 0
55 |
56 |
57 | if __name__ == '__main__':
58 | args, cfg = parse_cli_overides()
59 | sys.exit(main(args, cfg))
60 |
--------------------------------------------------------------------------------
/seld/methods/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yinkalario/EIN-SELD/9f42da23f4ef4620ca495af37e079dc7f8cde06c/seld/methods/__init__.py
--------------------------------------------------------------------------------
/seld/methods/data.py:
--------------------------------------------------------------------------------
1 | from pathlib import Path
2 |
3 | import h5py
4 | import numpy as np
5 | import torch
6 | from torch.utils.data import Dataset, Sampler
7 |
8 | from methods.utils.data_utilities import _segment_index
9 | from utils.common import int16_samples_to_float32
10 |
11 |
12 | class BaseDataset(Dataset):
13 | """ User defined datset
14 |
15 | """
16 | def __init__(self, args, cfg, dataset):
17 | """
18 | Args:
19 | args: input args
20 | cfg: configurations
21 | dataset: dataset used
22 | """
23 | super().__init__()
24 |
25 | self.args = args
26 | self.sample_rate = cfg['data']['sample_rate']
27 | self.clip_length = dataset.clip_length
28 |
29 | # Chunklen and hoplen and segmentation. Since all of the clips are 60s long, it only segments once here
30 | data = np.zeros((1, self.clip_length * self.sample_rate))
31 | chunklen = int(10 * self.sample_rate)
32 | self.segmented_indexes, self.segmented_pad_width = _segment_index(data, chunklen, hoplen=chunklen)
33 | self.num_segments = len(self.segmented_indexes)
34 |
35 | # Data path
36 | data_sr_folder_name = '{}fs'.format(self.sample_rate)
37 | main_data_dir = Path(cfg['hdf5_dir']).joinpath(cfg['dataset']).joinpath('data').joinpath(data_sr_folder_name)
38 | self.data_dir = main_data_dir.joinpath('dev').joinpath(cfg['data']['type'])
39 | self.fn_list = [path.stem for path in sorted(self.data_dir.glob('*.h5')) \
40 | if not path.name.startswith('.')]
41 | self.fn_list = [fn + '%' + str(n) for fn in self.fn_list for n in range(self.num_segments)]
42 |
43 | def __len__(self):
44 | """Get length of the dataset
45 |
46 | """
47 | return len(self.fn_list)
48 |
49 | def __getitem__(self, idx):
50 | """
51 | Read features from the dataset
52 | """
53 | fn_segment = self.fn_list[idx]
54 | fn, n_segment = fn_segment.split('%')[0], int(fn_segment.split('%')[1])
55 | data_path = self.data_dir.joinpath(fn + '.h5')
56 | index_begin = self.segmented_indexes[n_segment][0]
57 | index_end = self.segmented_indexes[n_segment][1]
58 | pad_width_before = self.segmented_pad_width[n_segment][0]
59 | pad_width_after = self.segmented_pad_width[n_segment][1]
60 | with h5py.File(data_path, 'r') as hf:
61 | x = int16_samples_to_float32(hf['waveform'][:, index_begin: index_end])
62 | pad_width = ((0, 0), (pad_width_before, pad_width_after))
63 | x = np.pad(x, pad_width, mode='constant')
64 | sample = {
65 | 'waveform': x
66 | }
67 |
68 | return sample
69 |
70 |
71 | class UserBatchSampler(Sampler):
72 | """User defined batch sampler. Only for train set.
73 |
74 | """
75 | def __init__(self, clip_num, batch_size, seed=2020):
76 | self.clip_num = clip_num
77 | self.batch_size = batch_size
78 | self.random_state = np.random.RandomState(seed)
79 |
80 | self.indexes = np.arange(self.clip_num)
81 | self.random_state.shuffle(self.indexes)
82 | self.pointer = 0
83 |
84 | def get_state(self):
85 | sampler_state = {
86 | 'random': self.random_state.get_state(),
87 | 'indexes': self.indexes,
88 | 'pointer': self.pointer
89 | }
90 | return sampler_state
91 |
92 | def set_state(self, sampler_state):
93 | self.random_state.set_state(sampler_state['random'])
94 | self.indexes = sampler_state['indexes']
95 | self.pointer = sampler_state['pointer']
96 |
97 | def __iter__(self):
98 | """
99 | Return:
100 | batch_indexes (int): indexes of batch
101 | """
102 | while True:
103 | if self.pointer >= self.clip_num:
104 | self.pointer = 0
105 | self.random_state.shuffle(self.indexes)
106 |
107 | batch_indexes = self.indexes[self.pointer: self.pointer + self.batch_size]
108 | self.pointer += self.batch_size
109 | yield batch_indexes
110 |
111 | def __len__(self):
112 | return (self.clip_num + self.batch_size - 1) // self.batch_size
113 |
114 |
115 | class PinMemCustomBatch:
116 | def __init__(self, batch_dict):
117 | batch_x = []
118 | for n in range(len(batch_dict)):
119 | batch_x.append(batch_dict[n]['waveform'])
120 | self.batch_out_dict = {
121 | 'waveform': torch.tensor(batch_x, dtype=torch.float32),
122 | }
123 |
124 | def pin_memory(self):
125 | self.batch_out_dict['waveform'] = self.batch_out_dict['waveform'].pin_memory()
126 | return self.batch_out_dict
127 |
128 |
129 | def collate_fn(batch_dict):
130 | """
131 | Merges a list of samples to form a mini-batch
132 | Pin memory for customized dataset
133 | """
134 | return PinMemCustomBatch(batch_dict)
135 |
--------------------------------------------------------------------------------
/seld/methods/data_augmentation/__init__.py:
--------------------------------------------------------------------------------
1 | # from .rotate import *
2 | # from .specaug import *
3 | # from .crop import *
4 |
--------------------------------------------------------------------------------
/seld/methods/ein_seld/__init__.py:
--------------------------------------------------------------------------------
1 | from . import (data, data_augmentation, inference, losses, metrics, models,
2 | training)
3 |
4 | __all__ = [
5 | data_augmentation, models, data, training, inference, losses, metrics
6 | ]
7 |
--------------------------------------------------------------------------------
/seld/methods/ein_seld/data.py:
--------------------------------------------------------------------------------
1 | from pathlib import Path
2 | from timeit import default_timer as timer
3 |
4 | import h5py
5 | import numpy as np
6 | import torch
7 | from methods.utils.data_utilities import (_segment_index, load_dcase_format,
8 | to_metrics2020_format)
9 | from torch.utils.data import Dataset, Sampler
10 | from tqdm import tqdm
11 | from utils.common import int16_samples_to_float32
12 |
13 |
14 | class UserDataset(Dataset):
15 | """ User defined datset
16 |
17 | """
18 | def __init__(self, args, cfg, dataset, dataset_type='train', overlap=''):
19 | """
20 | Args:
21 | args: input args
22 | cfg: configurations
23 | dataset: dataset used
24 | dataset_type: 'train' | 'valid' | 'dev_test' | 'eval_test'
25 | overlap: '1' | '2'
26 | """
27 | super().__init__()
28 |
29 | self.dataset_type = dataset_type
30 | self.read_into_mem = args.read_into_mem
31 | self.sample_rate = cfg['data']['sample_rate']
32 | self.clip_length = dataset.clip_length
33 | self.label_resolution = dataset.label_resolution
34 | self.frame_length = int(self.clip_length / self.label_resolution)
35 | self.label_interp_ratio = int(self.label_resolution * self.sample_rate / cfg['data']['hop_length'])
36 |
37 | # Chunklen and hoplen and segmentation. Since all of the clips are 60s long, it only segments once here
38 | data = np.zeros((1, self.clip_length * self.sample_rate))
39 | if 'train' in self.dataset_type:
40 | chunklen = int(cfg['data']['train_chunklen_sec'] * self.sample_rate)
41 | hoplen = int(cfg['data']['train_hoplen_sec'] * self.sample_rate)
42 | self.segmented_indexes, self.segmented_pad_width = _segment_index(data, chunklen, hoplen)
43 | elif self.dataset_type in ['valid', 'dev_test', 'eval_test']:
44 | chunklen = int(cfg['data']['test_chunklen_sec'] * self.sample_rate)
45 | hoplen = int(cfg['data']['test_hoplen_sec'] * self.sample_rate)
46 | self.segmented_indexes, self.segmented_pad_width = _segment_index(data, chunklen, hoplen, last_frame_always_paddding=True)
47 | self.num_segments = len(self.segmented_indexes)
48 |
49 | # Data and meta path
50 | fold_str_idx = dataset.fold_str_index
51 | ov_str_idx = dataset.ov_str_index
52 | data_sr_folder_name = '{}fs'.format(self.sample_rate)
53 | main_data_dir = Path(cfg['hdf5_dir']).joinpath(cfg['dataset']).joinpath('data').joinpath(data_sr_folder_name)
54 | dev_data_dir = main_data_dir.joinpath('dev').joinpath(cfg['data']['type'])
55 | eval_data_dir = main_data_dir.joinpath('eval').joinpath(cfg['data']['type'])
56 | main_meta_dir = Path(cfg['hdf5_dir']).joinpath(cfg['dataset']).joinpath('meta')
57 | dev_meta_dir = main_meta_dir.joinpath('dev')
58 | eval_meta_dir = main_meta_dir.joinpath('eval')
59 | if self.dataset_type == 'train':
60 | data_dirs = [dev_data_dir]
61 | self.meta_dir = dev_meta_dir
62 | train_fold = [int(fold.strip()) for fold in str(cfg['training']['train_fold']).split(',')]
63 | ov_set = str(cfg['training']['overlap']) if not overlap else overlap
64 | self.paths_list = [path for data_dir in data_dirs for path in sorted(data_dir.glob('*.h5')) \
65 | if int(path.stem[fold_str_idx]) in train_fold and path.stem[ov_str_idx] in ov_set \
66 | and not path.name.startswith('.')]
67 | elif self.dataset_type == 'valid':
68 | if cfg['training']['valid_fold'] != 'eval':
69 | data_dirs = [dev_data_dir]
70 | self.meta_dir = dev_meta_dir
71 | valid_fold = [int(fold.strip()) for fold in str(cfg['training']['valid_fold']).split(',')]
72 | ov_set = str(cfg['training']['overlap']) if not overlap else overlap
73 | self.paths_list = [path for data_dir in data_dirs for path in sorted(data_dir.glob('*.h5')) \
74 | if int(path.stem[fold_str_idx]) in valid_fold and path.stem[ov_str_idx] in ov_set \
75 | and not path.name.startswith('.')]
76 | ori_meta_dir = Path(cfg['dataset_dir']).joinpath('metadata_dev')
77 | else:
78 | data_dirs = [eval_data_dir]
79 | self.meta_dir = eval_meta_dir
80 | ov_set = str(cfg['training']['overlap']) if not overlap else overlap
81 | self.paths_list = [path for data_dir in data_dirs for path in sorted(data_dir.glob('*.h5')) \
82 | if not path.name.startswith('.')]
83 | ori_meta_dir = Path(cfg['dataset_dir']).joinpath('metadata_eval')
84 | frame_begin_index = 0
85 | self.valid_gt_sed_metrics2019 = []
86 | self.valid_gt_doa_metrics2019 = []
87 | self.valid_gt_dcaseformat = {}
88 | for path in self.paths_list:
89 | ori_meta_path = ori_meta_dir.joinpath(path.stem + '.csv')
90 | output_dict, sed_metrics2019, doa_metrics2019 = \
91 | load_dcase_format(ori_meta_path, frame_begin_index=frame_begin_index,
92 | frame_length=self.frame_length, num_classes=len(dataset.label_set))
93 | self.valid_gt_dcaseformat.update(output_dict)
94 | self.valid_gt_sed_metrics2019.append(sed_metrics2019)
95 | self.valid_gt_doa_metrics2019.append(doa_metrics2019)
96 | frame_begin_index += self.frame_length
97 | self.valid_gt_sed_metrics2019 = np.concatenate(self.valid_gt_sed_metrics2019, axis=0)
98 | self.valid_gt_doa_metrics2019 = np.concatenate(self.valid_gt_doa_metrics2019, axis=0)
99 | self.gt_metrics2020_dict = to_metrics2020_format(self.valid_gt_dcaseformat,
100 | self.valid_gt_sed_metrics2019.shape[0], label_resolution=self.label_resolution)
101 | elif self.dataset_type == 'dev_test':
102 | data_dirs = [dev_data_dir]
103 | self.meta_dir = dev_meta_dir
104 | dev_test_fold = [int(fold.strip()) for fold in str(cfg['inference']['test_fold']).split(',')]
105 | ov_set = str(cfg['inference']['overlap']) if not overlap else overlap
106 | self.paths_list = [path for data_dir in data_dirs for path in sorted(data_dir.glob('*.h5')) \
107 | if int(path.stem[fold_str_idx]) in dev_test_fold and path.stem[ov_str_idx] in ov_set \
108 | and not path.name.startswith('.')]
109 | elif self.dataset_type == 'eval_test':
110 | data_dirs = [eval_data_dir]
111 | self.meta_dir = eval_meta_dir
112 | self.paths_list = [path for data_dir in data_dirs for path in sorted(data_dir.glob('*.h5')) \
113 | if not path.name.startswith('.')]
114 | self.paths_list = [Path(str(path) + '%' + str(n)) for path in self.paths_list for n in range(self.num_segments)]
115 |
116 | # Read into memory
117 | if self.read_into_mem:
118 | load_begin_time = timer()
119 | print('Start to load dataset: {}, ov={}......\n'.format(self.dataset_type + ' set', ov_set))
120 | iterator = tqdm(self.paths_list, total=len(self.paths_list), unit='clips')
121 | self.dataset_list = []
122 | for path in iterator:
123 | fn, n_segment = path.stem, int(path.name.split('%')[1])
124 | data_path = Path(str(path).split('%')[0])
125 | index_begin = self.segmented_indexes[n_segment][0]
126 | index_end = self.segmented_indexes[n_segment][1]
127 | pad_width_before = self.segmented_pad_width[n_segment][0]
128 | pad_width_after = self.segmented_pad_width[n_segment][1]
129 | with h5py.File(data_path, 'r') as hf:
130 | x = int16_samples_to_float32(hf['waveform'][:, index_begin: index_end])
131 | pad_width = ((0, 0), (pad_width_before, pad_width_after))
132 | x = np.pad(x, pad_width, mode='constant')
133 | if 'test' not in self.dataset_type:
134 | ov = fn[-1]
135 | index_begin_label = int(index_begin / (self.sample_rate * self.label_resolution))
136 | index_end_label = int(index_end / (self.sample_rate * self.label_resolution))
137 | # pad_width_before_label = int(pad_width_before / (self.sample_rate * self.label_resolution))
138 | pad_width_after_label = int(pad_width_after / (self.sample_rate * self.label_resolution))
139 | meta_path = self.meta_dir.joinpath(fn + '.h5')
140 | with h5py.File(meta_path, 'r') as hf:
141 | sed_label = hf['sed_label'][index_begin_label: index_end_label, ...]
142 | doa_label = hf['doa_label'][index_begin_label: index_end_label, ...] # NOTE: this is Catesian coordinates
143 | if pad_width_after_label != 0:
144 | sed_label_new = np.zeros((pad_width_after_label, 2, 14))
145 | doa_label_new = np.zeros((pad_width_after_label, 2, 3))
146 | sed_label = np.concatenate((sed_label, sed_label_new), axis=0)
147 | doa_label = np.concatenate((doa_label, doa_label_new), axis=0)
148 | self.dataset_list.append({
149 | 'filename': fn,
150 | 'n_segment': n_segment,
151 | 'ov': ov,
152 | 'waveform': x,
153 | 'sed_label': sed_label,
154 | 'doa_label': doa_label
155 | })
156 | else:
157 | self.dataset_list.append({
158 | 'filename': fn,
159 | 'n_segment': n_segment,
160 | 'waveform': x
161 | })
162 | iterator.close()
163 | print('Loading dataset time: {:.3f}\n'.format(timer()-load_begin_time))
164 |
165 | def __len__(self):
166 | """Get length of the dataset
167 |
168 | """
169 | return len(self.paths_list)
170 |
171 | def __getitem__(self, idx):
172 | """
173 | Read features from the dataset
174 | """
175 | if self.read_into_mem:
176 | data_dict = self.dataset_list[idx]
177 | fn = data_dict['filename']
178 | n_segment = data_dict['n_segment']
179 | x = data_dict['waveform']
180 | if 'test' not in self.dataset_type:
181 | ov = data_dict['ov']
182 | sed_label = data_dict['sed_label']
183 | doa_label = data_dict['doa_label']
184 | else:
185 | path = self.paths_list[idx]
186 | fn, n_segment = path.stem, int(path.name.split('%')[1])
187 | data_path = Path(str(path).split('%')[0])
188 | index_begin = self.segmented_indexes[n_segment][0]
189 | index_end = self.segmented_indexes[n_segment][1]
190 | pad_width_before = self.segmented_pad_width[n_segment][0]
191 | pad_width_after = self.segmented_pad_width[n_segment][1]
192 | with h5py.File(data_path, 'r') as hf:
193 | x = int16_samples_to_float32(hf['waveform'][:, index_begin: index_end])
194 | pad_width = ((0, 0), (pad_width_before, pad_width_after))
195 | x = np.pad(x, pad_width, mode='constant')
196 | if 'test' not in self.dataset_type:
197 | ov = fn[-1]
198 | index_begin_label = int(index_begin / (self.sample_rate * self.label_resolution))
199 | index_end_label = int(index_end / (self.sample_rate * self.label_resolution))
200 | # pad_width_before_label = int(pad_width_before / (self.sample_rate * self.label_resolution))
201 | pad_width_after_label = int(pad_width_after / (self.sample_rate * self.label_resolution))
202 | meta_path = self.meta_dir.joinpath(fn + '.h5')
203 | with h5py.File(meta_path, 'r') as hf:
204 | sed_label = hf['sed_label'][index_begin_label: index_end_label, ...]
205 | doa_label = hf['doa_label'][index_begin_label: index_end_label, ...] # NOTE: this is Catesian coordinates
206 | if pad_width_after_label != 0:
207 | sed_label_new = np.zeros((pad_width_after_label, 2, 14))
208 | doa_label_new = np.zeros((pad_width_after_label, 2, 3))
209 | sed_label = np.concatenate((sed_label, sed_label_new), axis=0)
210 | doa_label = np.concatenate((doa_label, doa_label_new), axis=0)
211 |
212 | if 'test' not in self.dataset_type:
213 | sample = {
214 | 'filename': fn,
215 | 'n_segment': n_segment,
216 | 'ov': ov,
217 | 'waveform': x,
218 | 'sed_label': sed_label,
219 | 'doa_label': doa_label
220 | }
221 | else:
222 | sample = {
223 | 'filename': fn,
224 | 'n_segment': n_segment,
225 | 'waveform': x
226 | }
227 |
228 | return sample
229 |
230 |
231 | class UserBatchSampler(Sampler):
232 | """User defined batch sampler. Only for train set.
233 |
234 | """
235 | def __init__(self, clip_num, batch_size, seed=2020):
236 | self.clip_num = clip_num
237 | self.batch_size = batch_size
238 | self.random_state = np.random.RandomState(seed)
239 |
240 | self.indexes = np.arange(self.clip_num)
241 | self.random_state.shuffle(self.indexes)
242 | self.pointer = 0
243 |
244 | def get_state(self):
245 | sampler_state = {
246 | 'random': self.random_state.get_state(),
247 | 'indexes': self.indexes,
248 | 'pointer': self.pointer
249 | }
250 | return sampler_state
251 |
252 | def set_state(self, sampler_state):
253 | self.random_state.set_state(sampler_state['random'])
254 | self.indexes = sampler_state['indexes']
255 | self.pointer = sampler_state['pointer']
256 |
257 | def __iter__(self):
258 | """
259 | Return:
260 | batch_indexes (int): indexes of batch
261 | """
262 | while True:
263 | if self.pointer >= self.clip_num:
264 | self.pointer = 0
265 | self.random_state.shuffle(self.indexes)
266 |
267 | batch_indexes = self.indexes[self.pointer: self.pointer + self.batch_size]
268 | self.pointer += self.batch_size
269 | yield batch_indexes
270 |
271 | def __len__(self):
272 | return (self.clip_num + self.batch_size - 1) // self.batch_size
273 |
274 |
275 | class PinMemCustomBatch:
276 | def __init__(self, batch_dict):
277 | batch_fn = []
278 | batch_n_segment = []
279 | batch_ov = []
280 | batch_x = []
281 | batch_sed_label = []
282 | batch_doa_label = []
283 |
284 | for n in range(len(batch_dict)):
285 | batch_fn.append(batch_dict[n]['filename'])
286 | batch_n_segment.append(batch_dict[n]['n_segment'])
287 | batch_ov.append(batch_dict[n]['ov'])
288 | batch_x.append(batch_dict[n]['waveform'])
289 | batch_sed_label.append(batch_dict[n]['sed_label'])
290 | batch_doa_label.append(batch_dict[n]['doa_label'])
291 |
292 | self.batch_out_dict = {
293 | 'filename': batch_fn,
294 | 'n_segment': batch_n_segment,
295 | 'ov': batch_ov,
296 | 'waveform': torch.tensor(batch_x, dtype=torch.float32),
297 | 'sed_label': torch.tensor(batch_sed_label, dtype=torch.float32),
298 | 'doa_label': torch.tensor(batch_doa_label, dtype=torch.float32),
299 | }
300 |
301 | def pin_memory(self):
302 | self.batch_out_dict['waveform'] = self.batch_out_dict['waveform'].pin_memory()
303 | self.batch_out_dict['sed_label'] = self.batch_out_dict['sed_label'].pin_memory()
304 | self.batch_out_dict['doa_label'] = self.batch_out_dict['doa_label'].pin_memory()
305 | return self.batch_out_dict
306 |
307 |
308 | def collate_fn(batch_dict):
309 | """
310 | Merges a list of samples to form a mini-batch
311 | Pin memory for customized dataset
312 | """
313 | return PinMemCustomBatch(batch_dict)
314 |
315 |
316 | class PinMemCustomBatchTest:
317 | def __init__(self, batch_dict):
318 | batch_fn = []
319 | batch_n_segment = []
320 | batch_x = []
321 |
322 | for n in range(len(batch_dict)):
323 | batch_fn.append(batch_dict[n]['filename'])
324 | batch_n_segment.append(batch_dict[n]['n_segment'])
325 | batch_x.append(batch_dict[n]['waveform'])
326 |
327 | self.batch_out_dict = {
328 | 'filename': batch_fn,
329 | 'n_segment': batch_n_segment,
330 | 'waveform': torch.tensor(batch_x, dtype=torch.float32)
331 | }
332 |
333 | def pin_memory(self):
334 | self.batch_out_dict['waveform'] = self.batch_out_dict['waveform'].pin_memory()
335 | return self.batch_out_dict
336 |
337 |
338 | def collate_fn_test(batch_dict):
339 | """
340 | Merges a list of samples to form a mini-batch
341 | Pin memory for customized dataset
342 | """
343 | return PinMemCustomBatchTest(batch_dict)
344 |
--------------------------------------------------------------------------------
/seld/methods/ein_seld/data_augmentation/__init__.py:
--------------------------------------------------------------------------------
1 | # build your data augmentation method in this folder
2 | # from .trackmix import *
3 |
--------------------------------------------------------------------------------
/seld/methods/ein_seld/inference.py:
--------------------------------------------------------------------------------
1 | from pathlib import Path
2 |
3 | import h5py
4 | import numpy as np
5 | import torch
6 | from methods.inference import BaseInferer
7 | from tqdm import tqdm
8 |
9 | from .training import to_dcase_format
10 |
11 |
12 | class Inferer(BaseInferer):
13 |
14 | def __init__(self, cfg, dataset, af_extractor, model, cuda):
15 | super().__init__()
16 | self.cfg = cfg
17 | self.af_extractor = af_extractor
18 | self.model = model
19 | self.cuda = cuda
20 |
21 | # Scalar
22 | scalar_h5_dir = Path(cfg['hdf5_dir']).joinpath(cfg['dataset']).joinpath('scalar')
23 | fn_scalar = '{}_{}_sr{}_nfft{}_hop{}_mel{}.h5'.format(cfg['data']['type'], cfg['data']['audio_feature'],
24 | cfg['data']['sample_rate'], cfg['data']['n_fft'], cfg['data']['hop_length'], cfg['data']['n_mels'])
25 | scalar_path = scalar_h5_dir.joinpath(fn_scalar)
26 | with h5py.File(scalar_path, 'r') as hf:
27 | self.mean = hf['mean'][:]
28 | self.std = hf['std'][:]
29 | if cuda:
30 | self.mean = torch.tensor(self.mean, dtype=torch.float32).cuda()
31 | self.std = torch.tensor(self.std, dtype=torch.float32).cuda()
32 |
33 | self.label_resolution = dataset.label_resolution
34 | self.label_interp_ratio = int(self.label_resolution * cfg['data']['sample_rate'] / cfg['data']['hop_length'])
35 |
36 | def infer(self, generator):
37 | fn_list, n_segment_list = [], []
38 | pred_sed_list, pred_doa_list = [], []
39 |
40 | iterator = tqdm(generator)
41 | for batch_sample in iterator:
42 | batch_x = batch_sample['waveform']
43 | if self.cuda:
44 | batch_x = batch_x.cuda(non_blocking=True)
45 | with torch.no_grad():
46 | self.af_extractor.eval()
47 | self.model.eval()
48 | batch_x = self.af_extractor(batch_x)
49 | batch_x = (batch_x - self.mean) / self.std
50 | pred = self.model(batch_x)
51 | pred['sed'] = torch.sigmoid(pred['sed'])
52 | fn_list.append(batch_sample['filename'])
53 | n_segment_list.append(batch_sample['n_segment'])
54 | pred_sed_list.append(pred['sed'].cpu().detach().numpy())
55 | pred_doa_list.append(pred['doa'].cpu().detach().numpy())
56 |
57 | iterator.close()
58 |
59 | self.fn_list = [fn for row in fn_list for fn in row]
60 | self.n_segment_list = [n_segment for row in n_segment_list for n_segment in row]
61 | pred_sed = np.concatenate(pred_sed_list, axis=0)
62 | pred_doa = np.concatenate(pred_doa_list, axis=0)
63 |
64 | self.num_segments = max(self.n_segment_list) + 1
65 | origin_num_clips = int(pred_sed.shape[0]/self.num_segments)
66 | origin_T = int(pred_sed.shape[1]*self.num_segments)
67 | pred_sed = pred_sed.reshape((origin_num_clips, origin_T, 2, -1))[:, :int(60 / self.label_resolution)]
68 | pred_doa = pred_doa.reshape((origin_num_clips, origin_T, 2, -1))[:, :int(60 / self.label_resolution)]
69 |
70 | pred = {
71 | 'sed': pred_sed,
72 | 'doa': pred_doa
73 | }
74 | return pred
75 |
76 | def fusion(self, submissions_dir, preds):
77 | """ Ensamble predictions
78 |
79 | """
80 | num_preds = len(preds)
81 | pred_sed = []
82 | pred_doa = []
83 | for n in range(num_preds):
84 | pred_sed.append(preds[n]['sed'])
85 | pred_doa.append(preds[n]['doa'])
86 | pred_sed = np.array(pred_sed).mean(axis=0)
87 | pred_doa = np.array(pred_doa).mean(axis=0)
88 |
89 | N, T = pred_sed.shape[:2]
90 | pred_sed_max = pred_sed.max(axis=-1)
91 | pred_sed_max_idx = pred_sed.argmax(axis=-1)
92 | pred_sed = np.zeros_like(pred_sed)
93 | for b_idx in range(N):
94 | for t_idx in range(T):
95 | for track_idx in range(2):
96 | pred_sed[b_idx, t_idx, track_idx, pred_sed_max_idx[b_idx, t_idx, track_idx]] = \
97 | pred_sed_max[b_idx, t_idx, track_idx]
98 | pred_sed = (pred_sed > self.cfg['inference']['threshold_sed']).astype(np.float32)
99 | # convert Catesian to Spherical
100 | azi = np.arctan2(pred_doa[..., 1], pred_doa[..., 0])
101 | elev = np.arctan2(pred_doa[..., 2], np.sqrt(pred_doa[..., 0]**2 + pred_doa[..., 1]**2))
102 | pred_doa = np.stack((azi, elev), axis=-1) # (N, T, tracks, (azi, elev))
103 |
104 | fn_list = self.fn_list[::self.num_segments]
105 | for n in range(pred_sed.shape[0]):
106 | fn = fn_list[n]
107 | pred_sed_f = pred_sed[n][None, ...]
108 | pred_doa_f = pred_doa[n][None, ...]
109 | pred_dcase_format_dict = to_dcase_format(pred_sed_f, pred_doa_f)
110 | csv_path = submissions_dir.joinpath(fn + '.csv')
111 | self.write_submission(csv_path, pred_dcase_format_dict)
112 | print('Rsults are saved to {}\n'.format(str(submissions_dir)))
113 |
114 |
--------------------------------------------------------------------------------
/seld/methods/ein_seld/losses.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import torch
3 | from methods.utils.loss_utilities import BCEWithLogitsLoss, MSELoss
4 |
5 |
6 | class Losses:
7 | def __init__(self, cfg):
8 |
9 | self.cfg = cfg
10 | self.beta = cfg['training']['loss_beta']
11 |
12 | self.losses = [BCEWithLogitsLoss(reduction='mean'), MSELoss(reduction='mean')]
13 | self.losses_pit = [BCEWithLogitsLoss(reduction='PIT'), MSELoss(reduction='PIT')]
14 |
15 | self.names = ['loss_all'] + [loss.name for loss in self.losses]
16 |
17 | def calculate(self, pred, target, epoch_it=0):
18 |
19 | if 'PIT' not in self.cfg['training']['PIT_type']:
20 | updated_target = target
21 | loss_sed = self.losses[0].calculate_loss(pred['sed'], updated_target['sed'])
22 | loss_doa = self.losses[1].calculate_loss(pred['doa'], updated_target['doa'])
23 | elif self.cfg['training']['PIT_type'] == 'tPIT':
24 | loss_sed, loss_doa, updated_target = self.tPIT(pred, target)
25 |
26 | loss_all = self.beta * loss_sed + (1 - self.beta) * loss_doa
27 | losses_dict = {
28 | 'all': loss_all,
29 | 'sed': loss_sed,
30 | 'doa': loss_doa,
31 | 'updated_target': updated_target
32 | }
33 | return losses_dict
34 |
35 | def tPIT(self, pred, target):
36 | """Frame Permutation Invariant Training for 2 possible combinations
37 |
38 | Args:
39 | pred: {
40 | 'sed': [batch_size, T, num_tracks=2, num_classes],
41 | 'doa': [batch_size, T, num_tracks=2, doas=3]
42 | }
43 | target: {
44 | 'sed': [batch_size, T, num_tracks=2, num_classes],
45 | 'doa': [batch_size, T, num_tracks=2, doas=3]
46 | }
47 | Return:
48 | updated_target: updated target with the minimum loss frame-wisely
49 | {
50 | 'sed': [batch_size, T, num_tracks=2, num_classes],
51 | 'doa': [batch_size, T, num_tracks=2, doas=3]
52 | }
53 | """
54 | target_flipped = {
55 | 'sed': target['sed'].flip(dims=[2]),
56 | 'doa': target['doa'].flip(dims=[2])
57 | }
58 |
59 | loss_sed1 = self.losses_pit[0].calculate_loss(pred['sed'], target['sed'])
60 | loss_sed2 = self.losses_pit[0].calculate_loss(pred['sed'], target_flipped['sed'])
61 | loss_doa1 = self.losses_pit[1].calculate_loss(pred['doa'], target['doa'])
62 | loss_doa2 = self.losses_pit[1].calculate_loss(pred['doa'], target_flipped['doa'])
63 |
64 | loss1 = loss_sed1 + loss_doa1
65 | loss2 = loss_sed2 + loss_doa2
66 |
67 | loss_sed = (loss_sed1 * (loss1 <= loss2) + loss_sed2 * (loss1 > loss2)).mean()
68 | loss_doa = (loss_doa1 * (loss1 <= loss2) + loss_doa2 * (loss1 > loss2)).mean()
69 | updated_target_sed = target['sed'].clone() * (loss1[:, :, None, None] <= loss2[:, :, None, None]) + \
70 | target_flipped['sed'].clone() * (loss1[:, :, None, None] > loss2[:, :, None, None])
71 | updated_target_doa = target['doa'].clone() * (loss1[:, :, None, None] <= loss2[:, :, None, None]) + \
72 | target_flipped['doa'].clone() * (loss1[:, :, None, None] > loss2[:, :, None, None])
73 | updated_target = {
74 | 'sed': updated_target_sed,
75 | 'doa': updated_target_doa
76 | }
77 | return loss_sed, loss_doa, updated_target
78 |
--------------------------------------------------------------------------------
/seld/methods/ein_seld/metrics.py:
--------------------------------------------------------------------------------
1 | import methods.utils.SELD_evaluation_metrics_2019 as SELDMetrics2019
2 | from methods.utils.SELD_evaluation_metrics_2020 import \
3 | SELDMetrics as SELDMetrics2020
4 | from methods.utils.SELD_evaluation_metrics_2020 import early_stopping_metric
5 |
6 |
7 | class Metrics(object):
8 | """Metrics for evaluation
9 |
10 | """
11 | def __init__(self, dataset):
12 |
13 | self.metrics = []
14 | self.names = ['ER20', 'F20', 'LE20', 'LR20', 'seld20', 'ER19', 'F19', 'LE19', 'LR19', 'seld19']
15 |
16 | self.num_classes = len(dataset.label_set)
17 | self.doa_threshold = 20 # in deg
18 | self.num_frames_1s = int(1 / dataset.label_resolution)
19 |
20 | def calculate(self, pred_dict, gt_dict):
21 |
22 | # ER20: error rate, F20: F1-score, LE20: Location error, LR20: Location recall
23 | ER_19, F_19 = SELDMetrics2019.compute_sed_scores(pred_dict['dcase2019_sed'], gt_dict['dcase2019_sed'], \
24 | self.num_frames_1s)
25 | LE_19, LR_19, _, _, _, _ = SELDMetrics2019.compute_doa_scores_regr( \
26 | pred_dict['dcase2019_doa'], gt_dict['dcase2019_doa'], pred_dict['dcase2019_sed'], gt_dict['dcase2019_sed'])
27 | seld_score_19 = SELDMetrics2019.early_stopping_metric([ER_19, F_19], [LE_19, LR_19])
28 |
29 | dcase2020_metric = SELDMetrics2020(nb_classes=self.num_classes, doa_threshold=self.doa_threshold)
30 | dcase2020_metric.update_seld_scores(pred_dict['dcase2020'], gt_dict['dcase2020'])
31 | ER_20, F_20, LE_20, LR_20 = dcase2020_metric.compute_seld_scores()
32 | seld_score_20 = early_stopping_metric([ER_20, F_20], [LE_20, LR_20])
33 |
34 | metrics_scores = {
35 | 'ER20': ER_20,
36 | 'F20': F_20,
37 | 'LE20': LE_20,
38 | 'LR20': LR_20,
39 | 'seld20': seld_score_20,
40 | 'ER19': ER_19,
41 | 'F19': F_19,
42 | 'LE19': LE_19,
43 | 'LR19': LR_19,
44 | 'seld19': seld_score_19,
45 | }
46 | return metrics_scores
47 |
--------------------------------------------------------------------------------
/seld/methods/ein_seld/models/__init__.py:
--------------------------------------------------------------------------------
1 | from .seld import *
--------------------------------------------------------------------------------
/seld/methods/ein_seld/models/seld.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import torch.nn as nn
3 | import torch.nn.functional as F
4 | from methods.utils.model_utilities import (DoubleConv, PositionalEncoding,
5 | init_layer)
6 |
7 |
8 | class EINV2(nn.Module):
9 | def __init__(self, cfg, dataset):
10 | super().__init__()
11 | self.pe_enable = False # Ture | False
12 |
13 | if cfg['data']['audio_feature'] == 'logmel&intensity':
14 | self.f_bins = cfg['data']['n_mels']
15 | self.in_channels = 7
16 |
17 | self.downsample_ratio = 2 ** 2
18 | self.sed_conv_block1 = nn.Sequential(
19 | DoubleConv(in_channels=4, out_channels=64),
20 | nn.AvgPool2d(kernel_size=(2, 2)),
21 | )
22 | self.sed_conv_block2 = nn.Sequential(
23 | DoubleConv(in_channels=64, out_channels=128),
24 | nn.AvgPool2d(kernel_size=(2, 2)),
25 | )
26 | self.sed_conv_block3 = nn.Sequential(
27 | DoubleConv(in_channels=128, out_channels=256),
28 | nn.AvgPool2d(kernel_size=(1, 2)),
29 | )
30 | self.sed_conv_block4 = nn.Sequential(
31 | DoubleConv(in_channels=256, out_channels=512),
32 | nn.AvgPool2d(kernel_size=(1, 2)),
33 | )
34 |
35 | self.doa_conv_block1 = nn.Sequential(
36 | DoubleConv(in_channels=self.in_channels, out_channels=64),
37 | nn.AvgPool2d(kernel_size=(2, 2)),
38 | )
39 | self.doa_conv_block2 = nn.Sequential(
40 | DoubleConv(in_channels=64, out_channels=128),
41 | nn.AvgPool2d(kernel_size=(2, 2)),
42 | )
43 | self.doa_conv_block3 = nn.Sequential(
44 | DoubleConv(in_channels=128, out_channels=256),
45 | nn.AvgPool2d(kernel_size=(1, 2)),
46 | )
47 | self.doa_conv_block4 = nn.Sequential(
48 | DoubleConv(in_channels=256, out_channels=512),
49 | nn.AvgPool2d(kernel_size=(1, 2)),
50 | )
51 |
52 | self.stitch = nn.ParameterList([
53 | nn.Parameter(torch.FloatTensor(64, 2, 2).uniform_(0.1, 0.9)),
54 | nn.Parameter(torch.FloatTensor(128, 2, 2).uniform_(0.1, 0.9)),
55 | nn.Parameter(torch.FloatTensor(256, 2, 2).uniform_(0.1, 0.9)),
56 | ])
57 |
58 | if self.pe_enable:
59 | self.pe = PositionalEncoding(pos_len=100, d_model=512, pe_type='t', dropout=0.0)
60 | self.sed_trans_track1 = nn.TransformerEncoder(
61 | nn.TransformerEncoderLayer(d_model=512, nhead=8, dim_feedforward=1024, dropout=0.2), num_layers=2)
62 | self.sed_trans_track2 = nn.TransformerEncoder(
63 | nn.TransformerEncoderLayer(d_model=512, nhead=8, dim_feedforward=1024, dropout=0.2), num_layers=2)
64 | self.doa_trans_track1 = nn.TransformerEncoder(
65 | nn.TransformerEncoderLayer(d_model=512, nhead=8, dim_feedforward=1024, dropout=0.2), num_layers=2)
66 | self.doa_trans_track2 = nn.TransformerEncoder(
67 | nn.TransformerEncoderLayer(d_model=512, nhead=8, dim_feedforward=1024, dropout=0.2), num_layers=2)
68 |
69 | self.fc_sed_track1 = nn.Linear(512, 14, bias=True)
70 | self.fc_sed_track2 = nn.Linear(512, 14, bias=True)
71 | self.fc_doa_track1 = nn.Linear(512, 3, bias=True)
72 | self.fc_doa_track2 = nn.Linear(512, 3, bias=True)
73 | self.final_act_sed = nn.Sequential() # nn.Sigmoid()
74 | self.final_act_doa = nn.Tanh()
75 |
76 | self.init_weight()
77 |
78 | def init_weight(self):
79 |
80 | init_layer(self.fc_sed_track1)
81 | init_layer(self.fc_sed_track2)
82 | init_layer(self.fc_doa_track1)
83 | init_layer(self.fc_doa_track2)
84 |
85 | def forward(self, x):
86 | """
87 | x: waveform, (batch_size, num_channels, data_length)
88 | """
89 | x_sed = x[:, :4]
90 | x_doa = x
91 |
92 | # cnn
93 | x_sed = self.sed_conv_block1(x_sed)
94 | x_doa = self.doa_conv_block1(x_doa)
95 | x_sed = torch.einsum('c, nctf -> nctf', self.stitch[0][:, 0, 0], x_sed) + \
96 | torch.einsum('c, nctf -> nctf', self.stitch[0][:, 0, 1], x_doa)
97 | x_doa = torch.einsum('c, nctf -> nctf', self.stitch[0][:, 1, 0], x_sed) + \
98 | torch.einsum('c, nctf -> nctf', self.stitch[0][:, 1, 1], x_doa)
99 | x_sed = self.sed_conv_block2(x_sed)
100 | x_doa = self.doa_conv_block2(x_doa)
101 | x_sed = torch.einsum('c, nctf -> nctf', self.stitch[1][:, 0, 0], x_sed) + \
102 | torch.einsum('c, nctf -> nctf', self.stitch[1][:, 0, 1], x_doa)
103 | x_doa = torch.einsum('c, nctf -> nctf', self.stitch[1][:, 1, 0], x_sed) + \
104 | torch.einsum('c, nctf -> nctf', self.stitch[1][:, 1, 1], x_doa)
105 | x_sed = self.sed_conv_block3(x_sed)
106 | x_doa = self.doa_conv_block3(x_doa)
107 | x_sed = torch.einsum('c, nctf -> nctf', self.stitch[2][:, 0, 0], x_sed) + \
108 | torch.einsum('c, nctf -> nctf', self.stitch[2][:, 0, 1], x_doa)
109 | x_doa = torch.einsum('c, nctf -> nctf', self.stitch[2][:, 1, 0], x_sed) + \
110 | torch.einsum('c, nctf -> nctf', self.stitch[2][:, 1, 1], x_doa)
111 | x_sed = self.sed_conv_block4(x_sed)
112 | x_doa = self.doa_conv_block4(x_doa)
113 | x_sed = x_sed.mean(dim=3) # (N, C, T)
114 | x_doa = x_doa.mean(dim=3) # (N, C, T)
115 |
116 | # transformer
117 | if self.pe_enable:
118 | x_sed = self.pe(x_sed)
119 | if self.pe_enable:
120 | x_doa = self.pe(x_doa)
121 | x_sed = x_sed.permute(2, 0, 1) # (T, N, C)
122 | x_doa = x_doa.permute(2, 0, 1) # (T, N, C)
123 |
124 | x_sed_1 = self.sed_trans_track1(x_sed).transpose(0, 1) # (N, T, C)
125 | x_sed_2 = self.sed_trans_track2(x_sed).transpose(0, 1) # (N, T, C)
126 | x_doa_1 = self.doa_trans_track1(x_doa).transpose(0, 1) # (N, T, C)
127 | x_doa_2 = self.doa_trans_track2(x_doa).transpose(0, 1) # (N, T, C)
128 |
129 | # fc
130 | x_sed_1 = self.final_act_sed(self.fc_sed_track1(x_sed_1))
131 | x_sed_2 = self.final_act_sed(self.fc_sed_track2(x_sed_2))
132 | x_sed = torch.stack((x_sed_1, x_sed_2), 2)
133 | x_doa_1 = self.final_act_doa(self.fc_doa_track1(x_doa_1))
134 | x_doa_2 = self.final_act_doa(self.fc_doa_track2(x_doa_2))
135 | x_doa = torch.stack((x_doa_1, x_doa_2), 2)
136 | output = {
137 | 'sed': x_sed,
138 | 'doa': x_doa,
139 | }
140 |
141 | return output
142 |
143 |
--------------------------------------------------------------------------------
/seld/methods/ein_seld/training.py:
--------------------------------------------------------------------------------
1 | import random
2 | from itertools import combinations
3 | from pathlib import Path
4 |
5 | import h5py
6 | import numpy as np
7 | import torch
8 | from methods.training import BaseTrainer
9 | from methods.utils.data_utilities import to_metrics2020_format
10 |
11 |
12 | class Trainer(BaseTrainer):
13 |
14 | def __init__(self, args, cfg, dataset, af_extractor, valid_set, model, optimizer, losses, metrics):
15 |
16 | super().__init__()
17 | self.cfg = cfg
18 | self.af_extractor = af_extractor
19 | self.model = model
20 | self.optimizer = optimizer
21 | self.losses = losses
22 | self.metrics = metrics
23 | self.cuda = args.cuda
24 |
25 | self.clip_length = dataset.clip_length
26 | self.label_resolution = dataset.label_resolution
27 | self.label_interp_ratio = int(self.label_resolution * cfg['data']['sample_rate'] / cfg['data']['hop_length'])
28 |
29 | # Load ground truth for dcase metrics
30 | self.num_segments = valid_set.num_segments
31 | self.valid_gt_sed_metrics2019 = valid_set.valid_gt_sed_metrics2019
32 | self.valid_gt_doa_metrics2019 = valid_set.valid_gt_doa_metrics2019
33 | self.gt_metrics2020_dict = valid_set.gt_metrics2020_dict
34 |
35 | # Scalar
36 | scalar_h5_dir = Path(cfg['hdf5_dir']).joinpath(cfg['dataset']).joinpath('scalar')
37 | fn_scalar = '{}_{}_sr{}_nfft{}_hop{}_mel{}.h5'.format(cfg['data']['type'], cfg['data']['audio_feature'],
38 | cfg['data']['sample_rate'], cfg['data']['n_fft'], cfg['data']['hop_length'], cfg['data']['n_mels'])
39 | scalar_path = scalar_h5_dir.joinpath(fn_scalar)
40 | with h5py.File(scalar_path, 'r') as hf:
41 | self.mean = hf['mean'][:]
42 | self.std = hf['std'][:]
43 | if args.cuda:
44 | self.mean = torch.tensor(self.mean, dtype=torch.float32).cuda()
45 | self.std = torch.tensor(self.std, dtype=torch.float32).cuda()
46 |
47 | self.init_train_losses()
48 |
49 | def init_train_losses(self):
50 | """ Initialize train losses
51 |
52 | """
53 | self.train_losses = {
54 | 'loss_all': 0.,
55 | 'loss_sed': 0.,
56 | 'loss_doa': 0.
57 | }
58 |
59 | def train_step(self, batch_sample, epoch_it):
60 | """ Perform a train step
61 |
62 | """
63 | batch_x = batch_sample['waveform']
64 | batch_target = {
65 | 'ov': batch_sample['ov'],
66 | 'sed': batch_sample['sed_label'],
67 | 'doa': batch_sample['doa_label']
68 | }
69 | if self.cuda:
70 | batch_x = batch_x.cuda(non_blocking=True)
71 | batch_target['sed'] = batch_target['sed'].cuda(non_blocking=True)
72 | batch_target['doa'] = batch_target['doa'].cuda(non_blocking=True)
73 |
74 | self.optimizer.zero_grad()
75 | self.af_extractor.train()
76 | self.model.train()
77 | batch_x = self.af_extractor(batch_x)
78 | batch_x = (batch_x - self.mean) / self.std
79 | pred = self.model(batch_x)
80 | loss_dict = self.losses.calculate(pred, batch_target)
81 | loss_dict[self.cfg['training']['loss_type']].backward()
82 | self.optimizer.step()
83 |
84 | self.train_losses['loss_all'] += loss_dict['all']
85 | self.train_losses['loss_sed'] += loss_dict['sed']
86 | self.train_losses['loss_doa'] += loss_dict['doa']
87 |
88 |
89 | def validate_step(self, generator=None, max_batch_num=None, valid_type='train', epoch_it=0):
90 | """ Perform the validation on the train, valid set
91 |
92 | Generate a batch of segmentations each time
93 | """
94 |
95 | if valid_type == 'train':
96 | train_losses = self.train_losses.copy()
97 | self.init_train_losses()
98 | return train_losses
99 |
100 | elif valid_type == 'valid':
101 | pred_sed_list, pred_doa_list = [], []
102 | gt_sed_list, gt_doa_list = [], []
103 | loss_all, loss_sed, loss_doa = 0., 0., 0.
104 |
105 | for batch_idx, batch_sample in enumerate(generator):
106 | if batch_idx == max_batch_num:
107 | break
108 |
109 | batch_x = batch_sample['waveform']
110 | batch_target = {
111 | 'sed': batch_sample['sed_label'],
112 | 'doa': batch_sample['doa_label']
113 | }
114 |
115 | if self.cuda:
116 | batch_x = batch_x.cuda(non_blocking=True)
117 | batch_target['sed'] = batch_target['sed'].cuda(non_blocking=True)
118 | batch_target['doa'] = batch_target['doa'].cuda(non_blocking=True)
119 |
120 | with torch.no_grad():
121 | self.af_extractor.eval()
122 | self.model.eval()
123 | batch_x = self.af_extractor(batch_x)
124 | batch_x = (batch_x - self.mean) / self.std
125 | pred = self.model(batch_x)
126 | loss_dict = self.losses.calculate(pred, batch_target, epoch_it)
127 | pred['sed'] = torch.sigmoid(pred['sed'])
128 | loss_all += loss_dict['all'].cpu().detach().numpy()
129 | loss_sed += loss_dict['sed'].cpu().detach().numpy()
130 | loss_doa += loss_dict['doa'].cpu().detach().numpy()
131 | pred_sed_list.append(pred['sed'].cpu().detach().numpy())
132 | pred_doa_list.append(pred['doa'].cpu().detach().numpy())
133 |
134 | pred_sed = np.concatenate(pred_sed_list, axis=0)
135 | pred_doa = np.concatenate(pred_doa_list, axis=0)
136 |
137 | origin_num_clips = int(pred_sed.shape[0]/self.num_segments)
138 | origin_T = int(pred_sed.shape[1]*self.num_segments)
139 | pred_sed = pred_sed.reshape((origin_num_clips, origin_T, 2, -1))[:, :int(self.clip_length / self.label_resolution)]
140 | pred_doa = pred_doa.reshape((origin_num_clips, origin_T, 2, -1))[:, :int(self.clip_length / self.label_resolution)]
141 |
142 | pred_sed_max = pred_sed.max(axis=-1)
143 | pred_sed_max_idx = pred_sed.argmax(axis=-1)
144 | pred_sed = np.zeros_like(pred_sed)
145 | for b_idx in range(origin_num_clips):
146 | for t_idx in range(origin_T):
147 | for track_idx in range(2):
148 | pred_sed[b_idx, t_idx, track_idx, pred_sed_max_idx[b_idx, t_idx, track_idx]] = \
149 | pred_sed_max[b_idx, t_idx, track_idx]
150 | pred_sed = (pred_sed > self.cfg['training']['threshold_sed']).astype(np.float32)
151 |
152 | # convert Catesian to Spherical
153 | azi = np.arctan2(pred_doa[..., 1], pred_doa[..., 0])
154 | elev = np.arctan2(pred_doa[..., 2], np.sqrt(pred_doa[..., 0]**2 + pred_doa[..., 1]**2))
155 | pred_doa = np.stack((azi, elev), axis=-1) # (N, T, tracks, (azi, elev))
156 |
157 | # convert format
158 | pred_sed_metrics2019, pred_doa_metrics2019 = to_metrics2019_format(pred_sed, pred_doa)
159 | gt_sed_metrics2019, gt_doa_metrics2019 = self.valid_gt_sed_metrics2019, self.valid_gt_doa_metrics2019
160 | pred_dcase_format_dict = to_dcase_format(pred_sed, pred_doa)
161 | pred_metrics2020_dict = to_metrics2020_format(pred_dcase_format_dict,
162 | pred_sed.shape[0]*pred_sed.shape[1], label_resolution=self.label_resolution)
163 | gt_metrics2020_dict = self.gt_metrics2020_dict
164 |
165 | out_losses = {
166 | 'loss_all': loss_all / (batch_idx + 1),
167 | 'loss_sed': loss_sed / (batch_idx + 1),
168 | 'loss_doa': loss_doa / (batch_idx + 1),
169 | }
170 |
171 | pred_dict = {
172 | 'dcase2019_sed': pred_sed_metrics2019,
173 | 'dcase2019_doa': pred_doa_metrics2019,
174 | 'dcase2020': pred_metrics2020_dict,
175 | }
176 |
177 | gt_dict = {
178 | 'dcase2019_sed': gt_sed_metrics2019,
179 | 'dcase2019_doa': gt_doa_metrics2019,
180 | 'dcase2020': gt_metrics2020_dict,
181 | }
182 | metrics_scores = self.metrics.calculate(pred_dict, gt_dict)
183 | return out_losses, metrics_scores
184 |
185 |
186 | def to_metrics2019_format(sed_labels, doa_labels):
187 | """Convert sed and doa labels from track-wise output format to DCASE2019 evaluation metrics input format
188 |
189 | Args:
190 | sed_labels: SED labels, (batch_size, time_steps, num_tracks=2, logits_events=14 (number of classes))
191 | doa_labels: DOA labels, (batch_size, time_steps, num_tracks=2, logits_degrees=2 (azi in radians, ele in radians))
192 | Output:
193 | out_sed_labels: SED labels, (batch_size * time_steps, logits_events=14 (True or False)
194 | out_doa_labels: DOA labels, (batch_size * time_steps, azi_index=14 + ele_index=14)
195 | """
196 | batch_size, T, num_tracks, num_classes = sed_labels.shape
197 | sed_labels = sed_labels.reshape(batch_size * T, num_tracks, num_classes)
198 | doa_labels = doa_labels.reshape(batch_size * T, num_tracks, 2)
199 | out_sed_labels = np.logical_or(sed_labels[:, 0], sed_labels[:, 1]).astype(float)
200 | out_doa_labels = np.zeros((batch_size * T, num_classes * 2))
201 | for n_track in range(num_tracks):
202 | indexes = np.where(sed_labels[:, n_track, :])
203 | out_doa_labels[:, 0: num_classes][indexes[0], indexes[1]] = \
204 | doa_labels[indexes[0], n_track, 0] # azimuth
205 | out_doa_labels[:, num_classes: 2*num_classes][indexes[0], indexes[1]] = \
206 | doa_labels[indexes[0], n_track, 1] # elevation
207 | return out_sed_labels, out_doa_labels
208 |
209 | def to_dcase_format(sed_labels, doa_labels):
210 | """Convert sed and doa labels from track-wise output format to dcase output format
211 |
212 | Args:
213 | sed_labels: SED labels, (batch_size, time_steps, num_tracks=2, logits_events=14 (number of classes))
214 | doa_labels: DOA labels, (batch_size, time_steps, num_tracks=2, logits_degrees=2 (azi in radiance, ele in radiance))
215 | Output:
216 | output_dict: return a dict containing dcase output format
217 | output_dict[frame-containing-events] = [[class_index_1, azi_1 in degree, ele_1 in degree], [class_index_2, azi_2 in degree, ele_2 in degree]]
218 | """
219 | batch_size, T, num_tracks, num_classes= sed_labels.shape
220 |
221 | sed_labels = sed_labels.reshape(batch_size*T, num_tracks, num_classes)
222 | doa_labels = doa_labels.reshape(batch_size*T, num_tracks, 2)
223 |
224 | output_dict = {}
225 | for n_idx in range(batch_size*T):
226 | for n_track in range(num_tracks):
227 | class_index = list(np.where(sed_labels[n_idx, n_track, :])[0])
228 | assert len(class_index) <= 1, 'class_index should be smaller or equal to 1!!\n'
229 | if class_index:
230 | event_doa = [class_index[0], int(np.around(doa_labels[n_idx, n_track, 0] * 180 / np.pi)), \
231 | int(np.around(doa_labels[n_idx, n_track, 1] * 180 / np.pi))] # NOTE: this is in degree
232 | if n_idx not in output_dict:
233 | output_dict[n_idx] = []
234 | output_dict[n_idx].append(event_doa)
235 | return output_dict
236 |
237 |
--------------------------------------------------------------------------------
/seld/methods/feature.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import torch.nn as nn
3 |
4 | from methods.utils.stft import (STFT, LogmelFilterBank, intensityvector,
5 | spectrogram_STFTInput)
6 |
7 |
8 | class LogmelIntensity_Extractor(nn.Module):
9 | def __init__(self, cfg):
10 | super().__init__()
11 |
12 | data = cfg['data']
13 | sample_rate, n_fft, hop_length, window, n_mels, fmin, fmax = \
14 | data['sample_rate'], data['n_fft'], data['hop_length'], data['window'], data['n_mels'], \
15 | data['fmin'], data['fmax']
16 |
17 | center = True
18 | pad_mode = 'reflect'
19 | ref = 1.0
20 | amin = 1e-10
21 | top_db = None
22 |
23 | # STFT extractor
24 | self.stft_extractor = STFT(n_fft=n_fft, hop_length=hop_length, win_length=n_fft,
25 | window=window, center=center, pad_mode=pad_mode,
26 | freeze_parameters=data['feature_freeze'])
27 |
28 | # Spectrogram extractor
29 | self.spectrogram_extractor = spectrogram_STFTInput
30 |
31 | # Logmel extractor
32 | self.logmel_extractor = LogmelFilterBank(sr=sample_rate, n_fft=n_fft,
33 | n_mels=n_mels, fmin=fmin, fmax=fmax, ref=ref, amin=amin, top_db=top_db,
34 | freeze_parameters=data['feature_freeze'])
35 |
36 | # Intensity vector extractor
37 | self.intensityVector_extractor = intensityvector
38 |
39 | def forward(self, x):
40 | """
41 | input:
42 | (batch_size, channels=4, data_length)
43 | output:
44 | (batch_size, channels, time_steps, freq_bins)
45 | """
46 | if x.ndim != 3:
47 | raise ValueError("x shape must be (batch_size, num_channels, data_length)\n \
48 | Now it is {}".format(x.shape))
49 | x = self.stft_extractor(x)
50 | logmel = self.logmel_extractor(self.spectrogram_extractor(x))
51 | intensity_vector = self.intensityVector_extractor(x, self.logmel_extractor.melW)
52 | out = torch.cat((logmel, intensity_vector), dim=1)
53 | return out
54 |
55 |
--------------------------------------------------------------------------------
/seld/methods/inference.py:
--------------------------------------------------------------------------------
1 | class BaseInferer:
2 | """ Base inferer class
3 |
4 | """
5 | def infer(self, *args, **kwargs):
6 | """ Perform an inference on test data.
7 |
8 | """
9 | raise NotImplementedError
10 |
11 | def fusion(self, submissions_dir, preds):
12 | """ Ensamble predictions.
13 |
14 | """
15 | raise NotImplementedError
16 |
17 | @staticmethod
18 | def write_submission(submissions_dir, pred_dict):
19 | """ Write predicted result to submission csv files
20 | Args:
21 | pred_dict: DCASE2020 format dict:
22 | pred_dict[frame-containing-events] = [[class_index_1, azi_1 in degree, ele_1 in degree], [class_index_2, azi_2 in degree, ele_2 in degree]]
23 | """
24 | for key, values in pred_dict.items():
25 | for value in values:
26 | with submissions_dir.open('a') as f:
27 | f.write('{},{},{},{}\n'.format(key, value[0], value[1], value[2]))
28 |
29 |
30 |
31 |
--------------------------------------------------------------------------------
/seld/methods/training.py:
--------------------------------------------------------------------------------
1 | class BaseTrainer:
2 | """ Base trainer class
3 |
4 | """
5 | def train_step(self, *args, **kwargs):
6 | """ Perform a training step.
7 |
8 | """
9 | raise NotImplementedError
10 |
11 | def validate_step(self, *args, **kwargs):
12 | """ Perform a validation step
13 |
14 | """
15 | raise NotImplementedError
16 |
17 |
18 |
19 |
--------------------------------------------------------------------------------
/seld/methods/utils/SELD_evaluation_metrics_2019.py:
--------------------------------------------------------------------------------
1 | #
2 | # Implements the core metrics from sound event detection evaluation module http://tut-arg.github.io/sed_eval/ and
3 | # The DOA metrics are explained in the SELDnet paper
4 | #
5 | # This script has MIT license
6 | #
7 |
8 | import numpy as np
9 | from scipy.optimize import linear_sum_assignment
10 | from IPython import embed
11 | eps = np.finfo(np.float).eps
12 |
13 |
14 | ##########################################################################################
15 | # SELD scoring functions - class implementation
16 | #
17 | # NOTE: Supports only one-hot labels for both SED and DOA. Doesnt work for baseline method
18 | # directly, since it estimated DOA in regression approach. Check below the class for
19 | # one shot (function) implementations of all metrics. The function implementation has
20 | # support for both one-hot labels and regression values of DOA estimation.
21 | ##########################################################################################
22 |
23 | class SELDMetrics(object):
24 | def __init__(self, nb_frames_1s=None, data_gen=None):
25 | # SED params
26 | self._S = 0
27 | self._D = 0
28 | self._I = 0
29 | self._TP = 0
30 | self._Nref = 0
31 | self._Nsys = 0
32 | self._block_size = nb_frames_1s
33 |
34 | # DOA params
35 | self._doa_loss_pred_cnt = 0
36 | self._nb_frames = 0
37 |
38 | self._doa_loss_pred = 0
39 | self._nb_good_pks = 0
40 |
41 | self._data_gen = data_gen
42 |
43 | self._less_est_cnt, self._less_est_frame_cnt = 0, 0
44 | self._more_est_cnt, self._more_est_frame_cnt = 0, 0
45 |
46 | def f1_overall_framewise(self, O, T):
47 | TP = ((2 * T - O) == 1).sum()
48 | Nref, Nsys = T.sum(), O.sum()
49 | self._TP += TP
50 | self._Nref += Nref
51 | self._Nsys += Nsys
52 |
53 | def er_overall_framewise(self, O, T):
54 | FP = np.logical_and(T == 0, O == 1).sum(1)
55 | FN = np.logical_and(T == 1, O == 0).sum(1)
56 | S = np.minimum(FP, FN).sum()
57 | D = np.maximum(0, FN - FP).sum()
58 | I = np.maximum(0, FP - FN).sum()
59 | self._S += S
60 | self._D += D
61 | self._I += I
62 |
63 | def f1_overall_1sec(self, O, T):
64 | new_size = int(np.ceil(float(O.shape[0]) / self._block_size))
65 | O_block = np.zeros((new_size, O.shape[1]))
66 | T_block = np.zeros((new_size, O.shape[1]))
67 | for i in range(0, new_size):
68 | O_block[i, :] = np.max(O[int(i * self._block_size):int(i * self._block_size + self._block_size - 1), :], axis=0)
69 | T_block[i, :] = np.max(T[int(i * self._block_size):int(i * self._block_size + self._block_size - 1), :], axis=0)
70 | return self.f1_overall_framewise(O_block, T_block)
71 |
72 | def er_overall_1sec(self, O, T):
73 | new_size = int(np.ceil(float(O.shape[0]) / self._block_size))
74 | O_block = np.zeros((new_size, O.shape[1]))
75 | T_block = np.zeros((new_size, O.shape[1]))
76 | for i in range(0, new_size):
77 | O_block[i, :] = np.max(O[int(i * self._block_size):int(i * self._block_size + self._block_size - 1), :], axis=0)
78 | T_block[i, :] = np.max(T[int(i * self._block_size):int(i * self._block_size + self._block_size - 1), :], axis=0)
79 | return self.er_overall_framewise(O_block, T_block)
80 |
81 | def update_sed_scores(self, pred, gt):
82 | """
83 | Computes SED metrics for one second segments
84 |
85 | :param pred: predicted matrix of dimension [nb_frames, nb_classes], with 1 when sound event is active else 0
86 | :param gt: reference matrix of dimension [nb_frames, nb_classes], with 1 when sound event is active else 0
87 | :param nb_frames_1s: integer, number of frames in one second
88 | :return:
89 | """
90 | self.f1_overall_1sec(pred, gt)
91 | self.er_overall_1sec(pred, gt)
92 |
93 | def compute_sed_scores(self):
94 | ER = (self._S + self._D + self._I) / (self._Nref + 0.0)
95 |
96 | prec = float(self._TP) / float(self._Nsys + eps)
97 | recall = float(self._TP) / float(self._Nref + eps)
98 | F = 2 * prec * recall / (prec + recall + eps)
99 |
100 | return ER, F
101 |
102 | def update_doa_scores(self, pred_doa_thresholded, gt_doa):
103 | '''
104 | Compute DOA metrics when DOA is estimated using classification approach
105 |
106 | :param pred_doa_thresholded: predicted results of dimension [nb_frames, nb_classes, nb_azi*nb_ele],
107 | with value 1 when sound event active, else 0
108 | :param gt_doa: reference results of dimension [nb_frames, nb_classes, nb_azi*nb_ele],
109 | with value 1 when sound event active, else 0
110 | :param data_gen_test: feature or data generator class
111 |
112 | :return: DOA metrics
113 |
114 | '''
115 | self._doa_loss_pred_cnt += np.sum(pred_doa_thresholded)
116 | self._nb_frames += pred_doa_thresholded.shape[0]
117 |
118 | for frame in range(pred_doa_thresholded.shape[0]):
119 | nb_gt_peaks = int(np.sum(gt_doa[frame, :]))
120 | nb_pred_peaks = int(np.sum(pred_doa_thresholded[frame, :]))
121 |
122 | # good_frame_cnt includes frames where the nb active sources were zero in both groundtruth and prediction
123 | if nb_gt_peaks == nb_pred_peaks:
124 | self._nb_good_pks += 1
125 | elif nb_gt_peaks > nb_pred_peaks:
126 | self._less_est_frame_cnt += 1
127 | self._less_est_cnt += (nb_gt_peaks - nb_pred_peaks)
128 | elif nb_pred_peaks > nb_gt_peaks:
129 | self._more_est_frame_cnt += 1
130 | self._more_est_cnt += (nb_pred_peaks - nb_gt_peaks)
131 |
132 | # when nb_ref_doa > nb_estimated_doa, ignores the extra ref doas and scores only the nearest matching doas
133 | # similarly, when nb_estimated_doa > nb_ref_doa, ignores the extra estimated doa and scores the remaining matching doas
134 | if nb_gt_peaks and nb_pred_peaks:
135 | pred_ind = np.where(pred_doa_thresholded[frame] == 1)[1]
136 | pred_list_rad = np.array(self._data_gen .get_matrix_index(pred_ind)) * np.pi / 180
137 |
138 | gt_ind = np.where(gt_doa[frame] == 1)[1]
139 | gt_list_rad = np.array(self._data_gen .get_matrix_index(gt_ind)) * np.pi / 180
140 |
141 | frame_dist = distance_between_gt_pred(gt_list_rad.T, pred_list_rad.T)
142 | self._doa_loss_pred += frame_dist
143 |
144 | def compute_doa_scores(self):
145 | doa_error = self._doa_loss_pred / self._doa_loss_pred_cnt
146 | frame_recall = self._nb_good_pks / float(self._nb_frames)
147 | return doa_error, frame_recall
148 |
149 | def reset(self):
150 | # SED params
151 | self._S = 0
152 | self._D = 0
153 | self._I = 0
154 | self._TP = 0
155 | self._Nref = 0
156 | self._Nsys = 0
157 |
158 | # DOA params
159 | self._doa_loss_pred_cnt = 0
160 | self._nb_frames = 0
161 |
162 | self._doa_loss_pred = 0
163 | self._nb_good_pks = 0
164 |
165 | self._less_est_cnt, self._less_est_frame_cnt = 0, 0
166 | self._more_est_cnt, self._more_est_frame_cnt = 0, 0
167 |
168 |
169 | ###############################################################
170 | # SED scoring functions
171 | ###############################################################
172 |
173 |
174 | def reshape_3Dto2D(A):
175 | return A.reshape(A.shape[0] * A.shape[1], A.shape[2])
176 |
177 |
178 | def f1_overall_framewise(O, T):
179 | if len(O.shape) == 3:
180 | O, T = reshape_3Dto2D(O), reshape_3Dto2D(T)
181 | TP = ((2 * T - O) == 1).sum()
182 | Nref, Nsys = T.sum(), O.sum()
183 |
184 | prec = float(TP) / float(Nsys + eps)
185 | recall = float(TP) / float(Nref + eps)
186 | f1_score = 2 * prec * recall / (prec + recall + eps)
187 | return f1_score
188 |
189 |
190 | def er_overall_framewise(O, T):
191 | if len(O.shape) == 3:
192 | O, T = reshape_3Dto2D(O), reshape_3Dto2D(T)
193 |
194 | FP = np.logical_and(T == 0, O == 1).sum(1)
195 | FN = np.logical_and(T == 1, O == 0).sum(1)
196 |
197 | S = np.minimum(FP, FN).sum()
198 | D = np.maximum(0, FN-FP).sum()
199 | I = np.maximum(0, FP-FN).sum()
200 |
201 | Nref = T.sum()
202 | ER = (S+D+I) / (Nref + 0.0)
203 | return ER
204 |
205 |
206 | def f1_overall_1sec(O, T, block_size):
207 | if len(O.shape) == 3:
208 | O, T = reshape_3Dto2D(O), reshape_3Dto2D(T)
209 | new_size = int(np.ceil(float(O.shape[0]) / block_size))
210 | O_block = np.zeros((new_size, O.shape[1]))
211 | T_block = np.zeros((new_size, O.shape[1]))
212 | for i in range(0, new_size):
213 | O_block[i, :] = np.max(O[int(i * block_size):int(i * block_size + block_size - 1), :], axis=0)
214 | T_block[i, :] = np.max(T[int(i * block_size):int(i * block_size + block_size - 1), :], axis=0)
215 | return f1_overall_framewise(O_block, T_block)
216 |
217 |
218 | def er_overall_1sec(O, T, block_size):
219 | if len(O.shape) == 3:
220 | O, T = reshape_3Dto2D(O), reshape_3Dto2D(T)
221 | new_size = int(np.ceil(float(O.shape[0]) / block_size))
222 | O_block = np.zeros((new_size, O.shape[1]))
223 | T_block = np.zeros((new_size, O.shape[1]))
224 | for i in range(0, new_size):
225 | O_block[i, :] = np.max(O[int(i * block_size):int(i * block_size + block_size - 1), :], axis=0)
226 | T_block[i, :] = np.max(T[int(i * block_size):int(i * block_size + block_size - 1), :], axis=0)
227 | return er_overall_framewise(O_block, T_block)
228 |
229 |
230 | def compute_sed_scores(pred, gt, nb_frames_1s):
231 | """
232 | Computes SED metrics for one second segments
233 |
234 | :param pred: predicted matrix of dimension [nb_frames, nb_classes], with 1 when sound event is active else 0
235 | :param gt: reference matrix of dimension [nb_frames, nb_classes], with 1 when sound event is active else 0
236 | :param nb_frames_1s: integer, number of frames in one second
237 | :return:
238 | """
239 | f1o = f1_overall_1sec(pred, gt, nb_frames_1s)
240 | ero = er_overall_1sec(pred, gt, nb_frames_1s)
241 | scores = [ero, f1o]
242 | return scores
243 |
244 |
245 | ###############################################################
246 | # DOA scoring functions
247 | ###############################################################
248 |
249 |
250 | def compute_doa_scores_regr_xyz(pred_doa, gt_doa, pred_sed, gt_sed):
251 | """
252 | Compute DOA metrics when DOA is estimated using regression approach
253 |
254 | :param pred_doa: predicted doa_labels is of dimension [nb_frames, 3*nb_classes],
255 | nb_classes each for x, y, and z axes,
256 | if active, the DOA values will be in real numbers [-1 1] range, else, it will contain default doa values of (0, 0, 0)
257 | :param gt_doa: reference doa_labels is of dimension [nb_frames, 3*nb_classes],
258 | :param pred_sed: predicted sed label of dimension [nb_frames, nb_classes] which is 1 for active sound event else zero
259 | :param gt_sed: reference sed label of dimension [nb_frames, nb_classes] which is 1 for active sound event else zero
260 | :return:
261 | """
262 |
263 | nb_src_gt_list = np.zeros(gt_doa.shape[0]).astype(int)
264 | nb_src_pred_list = np.zeros(gt_doa.shape[0]).astype(int)
265 | good_frame_cnt = 0
266 | doa_loss_pred = 0.0
267 | nb_sed = gt_sed.shape[-1]
268 |
269 | less_est_cnt, less_est_frame_cnt = 0, 0
270 | more_est_cnt, more_est_frame_cnt = 0, 0
271 |
272 | for frame_cnt, sed_frame in enumerate(gt_sed):
273 | nb_src_gt_list[frame_cnt] = int(np.sum(sed_frame))
274 | nb_src_pred_list[frame_cnt] = int(np.sum(pred_sed[frame_cnt]))
275 |
276 | # good_frame_cnt includes frames where the nb active sources were zero in both groundtruth and prediction
277 | if nb_src_gt_list[frame_cnt] == nb_src_pred_list[frame_cnt]:
278 | good_frame_cnt = good_frame_cnt + 1
279 | elif nb_src_gt_list[frame_cnt] > nb_src_pred_list[frame_cnt]:
280 | less_est_cnt = less_est_cnt + nb_src_gt_list[frame_cnt] - nb_src_pred_list[frame_cnt]
281 | less_est_frame_cnt = less_est_frame_cnt + 1
282 | elif nb_src_gt_list[frame_cnt] < nb_src_pred_list[frame_cnt]:
283 | more_est_cnt = more_est_cnt + nb_src_pred_list[frame_cnt] - nb_src_gt_list[frame_cnt]
284 | more_est_frame_cnt = more_est_frame_cnt + 1
285 |
286 | # when nb_ref_doa > nb_estimated_doa, ignores the extra ref doas and scores only the nearest matching doas
287 | # similarly, when nb_estimated_doa > nb_ref_doa, ignores the extra estimated doa and scores the remaining matching doas
288 | if nb_src_gt_list[frame_cnt] and nb_src_pred_list[frame_cnt]:
289 | # DOA Loss with respect to predicted confidence
290 | sed_frame_gt = gt_sed[frame_cnt]
291 | doa_frame_gt_x = gt_doa[frame_cnt][:nb_sed][sed_frame_gt == 1]
292 | doa_frame_gt_y = gt_doa[frame_cnt][nb_sed:2*nb_sed][sed_frame_gt == 1]
293 | doa_frame_gt_z = gt_doa[frame_cnt][2*nb_sed:][sed_frame_gt == 1]
294 |
295 | sed_frame_pred = pred_sed[frame_cnt]
296 | doa_frame_pred_x = pred_doa[frame_cnt][:nb_sed][sed_frame_pred == 1]
297 | doa_frame_pred_y = pred_doa[frame_cnt][nb_sed:2*nb_sed][sed_frame_pred == 1]
298 | doa_frame_pred_z = pred_doa[frame_cnt][2*nb_sed:][sed_frame_pred == 1]
299 |
300 | doa_loss_pred += distance_between_gt_pred_xyz(np.vstack((doa_frame_gt_x, doa_frame_gt_y, doa_frame_gt_z)).T,
301 | np.vstack((doa_frame_pred_x, doa_frame_pred_y, doa_frame_pred_z)).T)
302 |
303 | doa_loss_pred_cnt = np.sum(nb_src_pred_list)
304 | if doa_loss_pred_cnt:
305 | doa_loss_pred /= doa_loss_pred_cnt
306 |
307 | frame_recall = good_frame_cnt / float(gt_sed.shape[0])
308 | er_metric = [doa_loss_pred, frame_recall, doa_loss_pred_cnt, good_frame_cnt, more_est_cnt, less_est_cnt]
309 | return er_metric
310 |
311 |
312 | def compute_doa_scores_regr(pred_doa_rad, gt_doa_rad, pred_sed, gt_sed):
313 | """
314 | Compute DOA metrics when DOA is estimated using regression approach
315 |
316 | :param pred_doa_rad: predicted doa_labels is of dimension [nb_frames, 2*nb_classes],
317 | nb_classes each for azimuth and elevation angles,
318 | if active, the DOA values will be in RADIANS, else, it will contain default doa values
319 | :param gt_doa_rad: reference doa_labels is of dimension [nb_frames, 2*nb_classes],
320 | nb_classes each for azimuth and elevation angles,
321 | if active, the DOA values will be in RADIANS, else, it will contain default doa values
322 | :param pred_sed: predicted sed label of dimension [nb_frames, nb_classes] which is 1 for active sound event else zero
323 | :param gt_sed: reference sed label of dimension [nb_frames, nb_classes] which is 1 for active sound event else zero
324 | :return:
325 | """
326 |
327 | nb_src_gt_list = np.zeros(gt_doa_rad.shape[0]).astype(int)
328 | nb_src_pred_list = np.zeros(gt_doa_rad.shape[0]).astype(int)
329 | good_frame_cnt = 0
330 | doa_loss_pred = 0.0
331 | nb_sed = gt_sed.shape[-1]
332 |
333 | less_est_cnt, less_est_frame_cnt = 0, 0
334 | more_est_cnt, more_est_frame_cnt = 0, 0
335 |
336 | for frame_cnt, sed_frame in enumerate(gt_sed):
337 | nb_src_gt_list[frame_cnt] = int(np.sum(sed_frame))
338 | nb_src_pred_list[frame_cnt] = int(np.sum(pred_sed[frame_cnt]))
339 |
340 | # good_frame_cnt includes frames where the nb active sources were zero in both groundtruth and prediction
341 | if nb_src_gt_list[frame_cnt] == nb_src_pred_list[frame_cnt]:
342 | good_frame_cnt = good_frame_cnt + 1
343 | elif nb_src_gt_list[frame_cnt] > nb_src_pred_list[frame_cnt]:
344 | less_est_cnt = less_est_cnt + nb_src_gt_list[frame_cnt] - nb_src_pred_list[frame_cnt]
345 | less_est_frame_cnt = less_est_frame_cnt + 1
346 | elif nb_src_gt_list[frame_cnt] < nb_src_pred_list[frame_cnt]:
347 | more_est_cnt = more_est_cnt + nb_src_pred_list[frame_cnt] - nb_src_gt_list[frame_cnt]
348 | more_est_frame_cnt = more_est_frame_cnt + 1
349 |
350 | # when nb_ref_doa > nb_estimated_doa, ignores the extra ref doas and scores only the nearest matching doas
351 | # similarly, when nb_estimated_doa > nb_ref_doa, ignores the extra estimated doa and scores the remaining matching doas
352 | if nb_src_gt_list[frame_cnt] and nb_src_pred_list[frame_cnt]:
353 | # DOA Loss with respect to predicted confidence
354 | sed_frame_gt = gt_sed[frame_cnt]
355 | doa_frame_gt_azi = gt_doa_rad[frame_cnt][:nb_sed][sed_frame_gt == 1]
356 | doa_frame_gt_ele = gt_doa_rad[frame_cnt][nb_sed:][sed_frame_gt == 1]
357 |
358 | sed_frame_pred = pred_sed[frame_cnt]
359 | doa_frame_pred_azi = pred_doa_rad[frame_cnt][:nb_sed][sed_frame_pred == 1]
360 | doa_frame_pred_ele = pred_doa_rad[frame_cnt][nb_sed:][sed_frame_pred == 1]
361 |
362 | doa_loss_pred += distance_between_gt_pred(np.vstack((doa_frame_gt_azi, doa_frame_gt_ele)).T,
363 | np.vstack((doa_frame_pred_azi, doa_frame_pred_ele)).T)
364 |
365 | doa_loss_pred_cnt = np.sum(nb_src_pred_list)
366 | if doa_loss_pred_cnt:
367 | doa_loss_pred /= doa_loss_pred_cnt
368 |
369 | frame_recall = good_frame_cnt / float(gt_sed.shape[0])
370 | er_metric = [doa_loss_pred, frame_recall, doa_loss_pred_cnt, good_frame_cnt, more_est_cnt, less_est_cnt]
371 | return er_metric
372 |
373 |
374 | def compute_doa_scores_clas(pred_doa_thresholded, gt_doa, data_gen_test):
375 | '''
376 | Compute DOA metrics when DOA is estimated using classification approach
377 |
378 | :param pred_doa_thresholded: predicted results of dimension [nb_frames, nb_classes, nb_azi*nb_ele],
379 | with value 1 when sound event active, else 0
380 | :param gt_doa: reference results of dimension [nb_frames, nb_classes, nb_azi*nb_ele],
381 | with value 1 when sound event active, else 0
382 | :param data_gen_test: feature or data generator class
383 |
384 | :return: DOA metrics
385 |
386 | '''
387 | doa_loss_pred_cnt = np.sum(pred_doa_thresholded)
388 |
389 | doa_loss_pred = 0
390 | nb_good_pks = 0
391 |
392 | less_est_cnt, less_est_frame_cnt = 0, 0
393 | more_est_cnt, more_est_frame_cnt = 0, 0
394 |
395 | for frame in range(pred_doa_thresholded.shape[0]):
396 | nb_gt_peaks = int(np.sum(gt_doa[frame, :]))
397 | nb_pred_peaks = int(np.sum(pred_doa_thresholded[frame, :]))
398 |
399 | # good_frame_cnt includes frames where the nb active sources were zero in both groundtruth and prediction
400 | if nb_gt_peaks == nb_pred_peaks:
401 | nb_good_pks += 1
402 | elif nb_gt_peaks > nb_pred_peaks:
403 | less_est_frame_cnt += 1
404 | less_est_cnt += (nb_gt_peaks - nb_pred_peaks)
405 | elif nb_pred_peaks > nb_gt_peaks:
406 | more_est_frame_cnt += 1
407 | more_est_cnt += (nb_pred_peaks - nb_gt_peaks)
408 |
409 | # when nb_ref_doa > nb_estimated_doa, ignores the extra ref doas and scores only the nearest matching doas
410 | # similarly, when nb_estimated_doa > nb_ref_doa, ignores the extra estimated doa and scores the remaining matching doas
411 | if nb_gt_peaks and nb_pred_peaks:
412 | pred_ind = np.where(pred_doa_thresholded[frame] == 1)[1]
413 | pred_list_rad = np.array(data_gen_test.get_matrix_index(pred_ind)) * np.pi / 180
414 |
415 | gt_ind = np.where(gt_doa[frame] == 1)[1]
416 | gt_list_rad = np.array(data_gen_test.get_matrix_index(gt_ind)) * np.pi / 180
417 |
418 | frame_dist = distance_between_gt_pred(gt_list_rad.T, pred_list_rad.T)
419 | doa_loss_pred += frame_dist
420 |
421 | if doa_loss_pred_cnt:
422 | doa_loss_pred /= doa_loss_pred_cnt
423 |
424 | frame_recall = nb_good_pks / float(pred_doa_thresholded.shape[0])
425 | er_metric = [doa_loss_pred, frame_recall, doa_loss_pred_cnt, nb_good_pks, more_est_cnt, less_est_cnt]
426 | return er_metric
427 |
428 |
429 | def distance_between_gt_pred(gt_list_rad, pred_list_rad):
430 | """
431 | Shortest distance between two sets of spherical coordinates. Given a set of groundtruth spherical coordinates,
432 | and its respective predicted coordinates, we calculate the spherical distance between each of the spherical
433 | coordinate pairs resulting in a matrix of distances, where one axis represents the number of groundtruth
434 | coordinates and the other the predicted coordinates. The number of estimated peaks need not be the same as in
435 | groundtruth, thus the distance matrix is not always a square matrix. We use the hungarian algorithm to find the
436 | least cost in this distance matrix.
437 |
438 | :param gt_list_rad: list of ground-truth spherical coordinates
439 | :param pred_list_rad: list of predicted spherical coordinates
440 | :return: cost - distance
441 | :return: less - number of DOA's missed
442 | :return: extra - number of DOA's over-estimated
443 | """
444 |
445 | gt_len, pred_len = gt_list_rad.shape[0], pred_list_rad.shape[0]
446 | ind_pairs = np.array([[x, y] for y in range(pred_len) for x in range(gt_len)])
447 | cost_mat = np.zeros((gt_len, pred_len))
448 |
449 | # Slow implementation
450 | # cost_mat = np.zeros((gt_len, pred_len))
451 | # for gt_cnt, gt in enumerate(gt_list_rad):
452 | # for pred_cnt, pred in enumerate(pred_list_rad):
453 | # cost_mat[gt_cnt, pred_cnt] = distance_between_spherical_coordinates_rad(gt, pred)
454 |
455 | # Fast implementation
456 | if gt_len and pred_len:
457 | az1, ele1, az2, ele2 = gt_list_rad[ind_pairs[:, 0], 0], gt_list_rad[ind_pairs[:, 0], 1], \
458 | pred_list_rad[ind_pairs[:, 1], 0], pred_list_rad[ind_pairs[:, 1], 1]
459 | cost_mat[ind_pairs[:, 0], ind_pairs[:, 1]] = distance_between_spherical_coordinates_rad(az1, ele1, az2, ele2)
460 |
461 | row_ind, col_ind = linear_sum_assignment(cost_mat)
462 | cost = cost_mat[row_ind, col_ind].sum()
463 | return cost
464 |
465 |
466 | def distance_between_gt_pred_xyz(gt_list, pred_list):
467 | """
468 | Shortest distance between two sets of Cartesian coordinates. Given a set of groundtruth coordinates,
469 | and its respective predicted coordinates, we calculate the spherical distance between each of the spherical
470 | coordinate pairs resulting in a matrix of distances, where one axis represents the number of groundtruth
471 | coordinates and the other the predicted coordinates. The number of estimated peaks need not be the same as in
472 | groundtruth, thus the distance matrix is not always a square matrix. We use the hungarian algorithm to find the
473 | least cost in this distance matrix.
474 |
475 | :param gt_list: list of ground-truth Cartesian coordinates
476 | :param pred_list: list of predicted Cartesian coordinates
477 | :return: cost - distance
478 | :return: less - number of DOA's missed
479 | :return: extra - number of DOA's over-estimated
480 | """
481 |
482 | gt_len, pred_len = gt_list.shape[0], pred_list.shape[0]
483 | ind_pairs = np.array([[x, y] for y in range(pred_len) for x in range(gt_len)])
484 | cost_mat = np.zeros((gt_len, pred_len))
485 |
486 | # Slow implementation
487 | # cost_mat = np.zeros((gt_len, pred_len))
488 | # for gt_cnt, gt in enumerate(gt_list_rad):
489 | # for pred_cnt, pred in enumerate(pred_list_rad):
490 | # cost_mat[gt_cnt, pred_cnt] = distance_between_spherical_coordinates_rad(gt, pred)
491 |
492 | # Fast implementation
493 | if gt_len and pred_len:
494 | x1, y1, z1, x2, y2, z2 = gt_list[ind_pairs[:, 0], 0], gt_list[ind_pairs[:, 0], 1], gt_list[ind_pairs[:, 0], 2], \
495 | pred_list[ind_pairs[:, 1], 0], pred_list[ind_pairs[:, 1], 1], pred_list[ind_pairs[:, 1], 2]
496 | cost_mat[ind_pairs[:, 0], ind_pairs[:, 1]] = distance_between_cartesian_coordinates(x1, y1, z1, x2, y2, z2)
497 |
498 | row_ind, col_ind = linear_sum_assignment(cost_mat)
499 | cost = cost_mat[row_ind, col_ind].sum()
500 | return cost
501 |
502 |
503 | def distance_between_spherical_coordinates_rad(az1, ele1, az2, ele2):
504 | """
505 | Angular distance between two spherical coordinates
506 | MORE: https://en.wikipedia.org/wiki/Great-circle_distance
507 |
508 | :return: angular distance in degrees
509 | """
510 | dist = np.sin(ele1) * np.sin(ele2) + np.cos(ele1) * np.cos(ele2) * np.cos(np.abs(az1 - az2))
511 | # Making sure the dist values are in -1 to 1 range, else np.arccos kills the job
512 | dist = np.clip(dist, -1, 1)
513 | dist = np.arccos(dist) * 180 / np.pi
514 | return dist
515 |
516 |
517 | def distance_between_cartesian_coordinates(x1, y1, z1, x2, y2, z2):
518 | """
519 | Angular distance between two cartesian coordinates
520 | MORE: https://en.wikipedia.org/wiki/Great-circle_distance
521 | Check 'From chord length' section
522 |
523 | :return: angular distance in degrees
524 | """
525 | # Normalize the Cartesian vectors
526 | N1 = np.sqrt(x1**2 + y1**2 + z1**2 + 1e-10)
527 | N2 = np.sqrt(x2**2 + y2**2 + z2**2 + 1e-10)
528 | x1, y1, z1, x2, y2, z2 = x1/N1, y1/N1, z1/N1, x2/N2, y2/N2, z2/N2
529 |
530 | #Compute the distance
531 | dist = x1*x2 + y1*y2 + z1*z2
532 | dist = np.clip(dist, -1, 1)
533 | dist = np.arccos(dist) * 180 / np.pi
534 | return dist
535 |
536 |
537 | def sph2cart(azimuth, elevation, r):
538 | '''
539 | Convert spherical to cartesian coordinates
540 |
541 | :param azimuth: in radians
542 | :param elevation: in radians
543 | :param r: in meters
544 | :return: cartesian coordinates
545 | '''
546 |
547 | x = r * np.cos(elevation) * np.cos(azimuth)
548 | y = r * np.cos(elevation) * np.sin(azimuth)
549 | z = r * np.sin(elevation)
550 | return x, y, z
551 |
552 |
553 | def cart2sph(x, y, z):
554 | '''
555 | Convert cartesian to spherical coordinates
556 |
557 | :param x:
558 | :param y:
559 | :param z:
560 | :return: azi, ele in radians and r in meters
561 | '''
562 |
563 | azimuth = np.arctan2(y,x)
564 | elevation = np.arctan2(z,np.sqrt(x**2 + y**2))
565 | r = np.sqrt(x**2 + y**2 + z**2)
566 | return azimuth, elevation, r
567 |
568 |
569 | ###############################################################
570 | # SELD scoring functions
571 | ###############################################################
572 |
573 |
574 | def early_stopping_metric(sed_error, doa_error):
575 | """
576 | Compute early stopping metric from sed and doa errors.
577 |
578 | :param sed_error: [error rate (0 to 1 range), f score (0 to 1 range)]
579 | :param doa_error: [doa error (in degrees), frame recall (0 to 1 range)]
580 | :return: seld metric result
581 | """
582 | seld_metric = np.mean([
583 | sed_error[0],
584 | 1 - sed_error[1],
585 | doa_error[0]/180,
586 | 1 - doa_error[1]]
587 | )
588 | return seld_metric
589 |
590 |
--------------------------------------------------------------------------------
/seld/methods/utils/SELD_evaluation_metrics_2020.py:
--------------------------------------------------------------------------------
1 | #
2 | # Implements the localization and detection metrics proposed in the paper
3 | #
4 | # Joint Measurement of Localization and Detection of Sound Events
5 | # Annamaria Mesaros, Sharath Adavanne, Archontis Politis, Toni Heittola, Tuomas Virtanen
6 | # WASPAA 2019
7 | #
8 | #
9 | # This script has MIT license
10 | #
11 |
12 | import numpy as np
13 | from IPython import embed
14 | eps = np.finfo(np.float).eps
15 | from scipy.optimize import linear_sum_assignment
16 |
17 |
18 | class SELDMetrics(object):
19 | def __init__(self, doa_threshold=20, nb_classes=11):
20 | '''
21 | This class implements both the class-sensitive localization and location-sensitive detection metrics.
22 | Additionally, based on the user input, the corresponding averaging is performed within the segment.
23 |
24 | :param nb_classes: Number of sound classes. In the paper, nb_classes = 11
25 | :param doa_thresh: DOA threshold for location sensitive detection.
26 | '''
27 |
28 | self._TP = 0
29 | self._FP = 0
30 | self._TN = 0
31 | self._FN = 0
32 |
33 | self._S = 0
34 | self._D = 0
35 | self._I = 0
36 |
37 | self._Nref = 0
38 | self._Nsys = 0
39 |
40 | self._total_DE = 0
41 | self._DE_TP = 0
42 |
43 | self._spatial_T = doa_threshold
44 | self._nb_classes = nb_classes
45 |
46 | def compute_seld_scores(self):
47 | '''
48 | Collect the final SELD scores
49 |
50 | :return: returns both location-sensitive detection scores and class-sensitive localization scores
51 | '''
52 |
53 | # Location-senstive detection performance
54 | ER = (self._S + self._D + self._I) / float(self._Nref + eps)
55 |
56 | prec = float(self._TP) / float(self._Nsys + eps)
57 | recall = float(self._TP) / float(self._Nref + eps)
58 | F = 2 * prec * recall / (prec + recall + eps)
59 |
60 | # Class-sensitive localization performance
61 | if self._DE_TP:
62 | DE = self._total_DE / float(self._DE_TP + eps)
63 | else:
64 | # When the total number of prediction is zero
65 | DE = 180
66 |
67 | DE_prec = float(self._DE_TP) / float(self._Nsys + eps)
68 | DE_recall = float(self._DE_TP) / float(self._Nref + eps)
69 | DE_F = 2 * DE_prec * DE_recall / (DE_prec + DE_recall + eps)
70 |
71 | return ER, F, DE, DE_F
72 |
73 | def update_seld_scores_xyz(self, pred, gt):
74 | '''
75 | Implements the spatial error averaging according to equation [5] in the paper, using Cartesian distance
76 |
77 | :param pred: dictionary containing class-wise prediction results for each N-seconds segment block
78 | :param gt: dictionary containing class-wise groundtruth for each N-seconds segment block
79 | '''
80 | for block_cnt in range(len(gt.keys())):
81 | # print('\nblock_cnt', block_cnt, end='')
82 | loc_FN, loc_FP = 0, 0
83 | for class_cnt in range(self._nb_classes):
84 | # print('\tclass:', class_cnt, end='')
85 | # Counting the number of ref and sys outputs should include the number of tracks for each class in the segment
86 | if class_cnt in gt[block_cnt]:
87 | self._Nref += 1
88 | if class_cnt in pred[block_cnt]:
89 | self._Nsys += 1
90 |
91 | if class_cnt in gt[block_cnt] and class_cnt in pred[block_cnt]:
92 | # True positives or False negative case
93 |
94 | # NOTE: For multiple tracks per class, identify multiple tracks using hungarian algorithm and then
95 | # calculate the spatial distance using the following code. In the current code, if there are multiple
96 | # tracks of the same class in a frame we are calculating the least cost between the groundtruth and predicted and using it.
97 |
98 | total_spatial_dist = 0
99 | total_framewise_matching_doa = 0
100 | gt_ind_list = gt[block_cnt][class_cnt][0][0]
101 | pred_ind_list = pred[block_cnt][class_cnt][0][0]
102 | for gt_ind, gt_val in enumerate(gt_ind_list):
103 | if gt_val in pred_ind_list:
104 | total_framewise_matching_doa += 1
105 | pred_ind = pred_ind_list.index(gt_val)
106 |
107 | gt_arr = np.array(gt[block_cnt][class_cnt][0][1][gt_ind])
108 | pred_arr = np.array(pred[block_cnt][class_cnt][0][1][pred_ind])
109 |
110 | if gt_arr.shape[0]==1 and pred_arr.shape[0]==1:
111 | total_spatial_dist += distance_between_cartesian_coordinates(gt_arr[0][0], gt_arr[0][1], gt_arr[0][2], pred_arr[0][0], pred_arr[0][1], pred_arr[0][2])
112 | else:
113 | total_spatial_dist += least_distance_between_gt_pred(gt_arr, pred_arr)
114 |
115 | if total_spatial_dist == 0 and total_framewise_matching_doa == 0:
116 | loc_FN += 1
117 | self._FN += 1
118 | else:
119 | avg_spatial_dist = (total_spatial_dist / total_framewise_matching_doa)
120 |
121 | self._total_DE += avg_spatial_dist
122 | self._DE_TP += 1
123 |
124 | if avg_spatial_dist <= self._spatial_T:
125 | self._TP += 1
126 | else:
127 | loc_FN += 1
128 | self._FN += 1
129 | elif class_cnt in gt[block_cnt] and class_cnt not in pred[block_cnt]:
130 | # False negative
131 | loc_FN += 1
132 | self._FN += 1
133 | elif class_cnt not in gt[block_cnt] and class_cnt in pred[block_cnt]:
134 | # False positive
135 | loc_FP += 1
136 | self._FP += 1
137 | elif class_cnt not in gt[block_cnt] and class_cnt not in pred[block_cnt]:
138 | # True negative
139 | self._TN += 1
140 |
141 | self._S += np.minimum(loc_FP, loc_FN)
142 | self._D += np.maximum(0, loc_FN - loc_FP)
143 | self._I += np.maximum(0, loc_FP - loc_FN)
144 | return
145 |
146 | def update_seld_scores(self, pred_deg, gt_deg):
147 | '''
148 | Implements the spatial error averaging according to equation [5] in the paper, using Polar distance
149 | Expects the angles in degrees
150 |
151 | :param pred_deg: dictionary containing class-wise prediction results for each N-seconds segment block
152 | :param gt_deg: dictionary containing class-wise groundtruth for each N-seconds segment block
153 | '''
154 | for block_cnt in range(len(gt_deg.keys())):
155 | # print('\nblock_cnt', block_cnt, end='')
156 | loc_FN, loc_FP = 0, 0
157 | for class_cnt in range(self._nb_classes):
158 | # print('\tclass:', class_cnt, end='')
159 | # Counting the number of ref and sys outputs should include the number of tracks for each class in the segment
160 | if class_cnt in gt_deg[block_cnt]:
161 | self._Nref += 1
162 | if class_cnt in pred_deg[block_cnt]:
163 | self._Nsys += 1
164 |
165 | if class_cnt in gt_deg[block_cnt] and class_cnt in pred_deg[block_cnt]:
166 | # True positives or False negative case
167 |
168 | # NOTE: For multiple tracks per class, identify multiple tracks using hungarian algorithm and then
169 | # calculate the spatial distance using the following code. In the current code, if there are multiple
170 | # tracks of the same class in a frame we are calculating the least cost between the groundtruth and predicted and using it.
171 | total_spatial_dist = 0
172 | total_framewise_matching_doa = 0
173 | gt_ind_list = gt_deg[block_cnt][class_cnt][0][0]
174 | pred_ind_list = pred_deg[block_cnt][class_cnt][0][0]
175 | for gt_ind, gt_val in enumerate(gt_ind_list):
176 | if gt_val in pred_ind_list:
177 | total_framewise_matching_doa += 1
178 | pred_ind = pred_ind_list.index(gt_val)
179 |
180 | gt_arr = np.array(gt_deg[block_cnt][class_cnt][0][1][gt_ind]) * np.pi / 180
181 | pred_arr = np.array(pred_deg[block_cnt][class_cnt][0][1][pred_ind]) * np.pi / 180
182 | if gt_arr.shape[0]==1 and pred_arr.shape[0]==1:
183 | total_spatial_dist += distance_between_spherical_coordinates_rad(gt_arr[0][0], gt_arr[0][1], pred_arr[0][0], pred_arr[0][1])
184 | else:
185 | total_spatial_dist += least_distance_between_gt_pred(gt_arr, pred_arr)
186 |
187 | if total_spatial_dist == 0 and total_framewise_matching_doa == 0:
188 | loc_FN += 1
189 | self._FN += 1
190 | else:
191 | avg_spatial_dist = (total_spatial_dist / total_framewise_matching_doa)
192 |
193 | self._total_DE += avg_spatial_dist
194 | self._DE_TP += 1
195 |
196 | if avg_spatial_dist <= self._spatial_T:
197 | self._TP += 1
198 | else:
199 | loc_FN += 1
200 | self._FN += 1
201 | elif class_cnt in gt_deg[block_cnt] and class_cnt not in pred_deg[block_cnt]:
202 | # False negative
203 | loc_FN += 1
204 | self._FN += 1
205 | elif class_cnt not in gt_deg[block_cnt] and class_cnt in pred_deg[block_cnt]:
206 | # False positive
207 | loc_FP += 1
208 | self._FP += 1
209 | elif class_cnt not in gt_deg[block_cnt] and class_cnt not in pred_deg[block_cnt]:
210 | # True negative
211 | self._TN += 1
212 |
213 | self._S += np.minimum(loc_FP, loc_FN)
214 | self._D += np.maximum(0, loc_FN - loc_FP)
215 | self._I += np.maximum(0, loc_FP - loc_FN)
216 | return
217 |
218 |
219 | def distance_between_spherical_coordinates_rad(az1, ele1, az2, ele2):
220 | """
221 | Angular distance between two spherical coordinates
222 | MORE: https://en.wikipedia.org/wiki/Great-circle_distance
223 |
224 | :return: angular distance in degrees
225 | """
226 | dist = np.sin(ele1) * np.sin(ele2) + np.cos(ele1) * np.cos(ele2) * np.cos(np.abs(az1 - az2))
227 | # Making sure the dist values are in -1 to 1 range, else np.arccos kills the job
228 | dist = np.clip(dist, -1, 1)
229 | dist = np.arccos(dist) * 180 / np.pi
230 | return dist
231 |
232 |
233 | def distance_between_cartesian_coordinates(x1, y1, z1, x2, y2, z2):
234 | """
235 | Angular distance between two cartesian coordinates
236 | MORE: https://en.wikipedia.org/wiki/Great-circle_distance
237 | Check 'From chord length' section
238 |
239 | :return: angular distance in degrees
240 | """
241 | # Normalize the Cartesian vectors
242 | N1 = np.sqrt(x1**2 + y1**2 + z1**2 + 1e-10)
243 | N2 = np.sqrt(x2**2 + y2**2 + z2**2 + 1e-10)
244 | x1, y1, z1, x2, y2, z2 = x1/N1, y1/N1, z1/N1, x2/N2, y2/N2, z2/N2
245 |
246 | #Compute the distance
247 | dist = x1*x2 + y1*y2 + z1*z2
248 | dist = np.clip(dist, -1, 1)
249 | dist = np.arccos(dist) * 180 / np.pi
250 | return dist
251 |
252 |
253 | def least_distance_between_gt_pred(gt_list, pred_list):
254 | """
255 | Shortest distance between two sets of DOA coordinates. Given a set of groundtruth coordinates,
256 | and its respective predicted coordinates, we calculate the distance between each of the
257 | coordinate pairs resulting in a matrix of distances, where one axis represents the number of groundtruth
258 | coordinates and the other the predicted coordinates. The number of estimated peaks need not be the same as in
259 | groundtruth, thus the distance matrix is not always a square matrix. We use the hungarian algorithm to find the
260 | least cost in this distance matrix.
261 | :param gt_list_xyz: list of ground-truth Cartesian or Polar coordinates in Radians
262 | :param pred_list_xyz: list of predicted Carteisan or Polar coordinates in Radians
263 | :return: cost - distance
264 | :return: less - number of DOA's missed
265 | :return: extra - number of DOA's over-estimated
266 | """
267 | gt_len, pred_len = gt_list.shape[0], pred_list.shape[0]
268 | ind_pairs = np.array([[x, y] for y in range(pred_len) for x in range(gt_len)])
269 | cost_mat = np.zeros((gt_len, pred_len))
270 |
271 | if gt_len and pred_len:
272 | if len(gt_list[0]) == 3: #Cartesian
273 | x1, y1, z1, x2, y2, z2 = gt_list[ind_pairs[:, 0], 0], gt_list[ind_pairs[:, 0], 1], gt_list[ind_pairs[:, 0], 2], pred_list[ind_pairs[:, 1], 0], pred_list[ind_pairs[:, 1], 1], pred_list[ind_pairs[:, 1], 2]
274 | cost_mat[ind_pairs[:, 0], ind_pairs[:, 1]] = distance_between_cartesian_coordinates(x1, y1, z1, x2, y2, z2)
275 | else:
276 | az1, ele1, az2, ele2 = gt_list[ind_pairs[:, 0], 0], gt_list[ind_pairs[:, 0], 1], pred_list[ind_pairs[:, 1], 0], pred_list[ind_pairs[:, 1], 1]
277 | cost_mat[ind_pairs[:, 0], ind_pairs[:, 1]] = distance_between_spherical_coordinates_rad(az1, ele1, az2, ele2)
278 |
279 | row_ind, col_ind = linear_sum_assignment(cost_mat)
280 | cost = cost_mat[row_ind, col_ind].sum()
281 | return cost
282 |
283 |
284 | def early_stopping_metric(sed_error, doa_error):
285 | """
286 | Compute early stopping metric from sed and doa errors.
287 |
288 | :param sed_error: [error rate (0 to 1 range), f score (0 to 1 range)]
289 | :param doa_error: [doa error (in degrees), frame recall (0 to 1 range)]
290 | :return: early stopping metric result
291 | """
292 | seld_metric = np.mean([
293 | sed_error[0],
294 | 1 - sed_error[1],
295 | doa_error[0]/180,
296 | 1 - doa_error[1]]
297 | )
298 | return seld_metric
299 |
--------------------------------------------------------------------------------
/seld/methods/utils/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yinkalario/EIN-SELD/9f42da23f4ef4620ca495af37e079dc7f8cde06c/seld/methods/utils/__init__.py
--------------------------------------------------------------------------------
/seld/methods/utils/data_utilities.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import torch
4 |
5 |
6 | def _segment_index(x, chunklen, hoplen, last_frame_always_paddding=False):
7 | """Segment input x with chunklen, hoplen parameters. Return
8 |
9 | Args:
10 | x: input, time domain or feature domain (channels, time)
11 | chunklen:
12 | hoplen:
13 | last_frame_always_paddding: to decide if always padding for the last frame
14 |
15 | Return:
16 | segmented_indexes: [(begin_index, end_index), (begin_index, end_index), ...]
17 | segmented_pad_width: [(before, after), (before, after), ...]
18 | """
19 | x_len = x.shape[1]
20 |
21 | segmented_indexes = []
22 | segmented_pad_width = []
23 | if x_len < chunklen:
24 | begin_index = 0
25 | end_index = x_len
26 | pad_width_before = 0
27 | pad_width_after = chunklen - x_len
28 | segmented_indexes.append((begin_index, end_index))
29 | segmented_pad_width.append((pad_width_before, pad_width_after))
30 | return segmented_indexes, segmented_pad_width
31 |
32 | n_frames = 1 + (x_len - chunklen) // hoplen
33 | for n in range(n_frames):
34 | begin_index = n * hoplen
35 | end_index = n * hoplen + chunklen
36 | segmented_indexes.append((begin_index, end_index))
37 | pad_width_before = 0
38 | pad_width_after = 0
39 | segmented_pad_width.append((pad_width_before, pad_width_after))
40 |
41 | if (n_frames - 1) * hoplen + chunklen == x_len:
42 | return segmented_indexes, segmented_pad_width
43 |
44 | # the last frame
45 | if last_frame_always_paddding:
46 | begin_index = n_frames * hoplen
47 | end_index = x_len
48 | pad_width_before = 0
49 | pad_width_after = chunklen - (x_len - n_frames * hoplen)
50 | else:
51 | if x_len - n_frames * hoplen >= chunklen // 2:
52 | begin_index = n_frames * hoplen
53 | end_index = x_len
54 | pad_width_before = 0
55 | pad_width_after = chunklen - (x_len - n_frames * hoplen)
56 | else:
57 | begin_index = x_len - chunklen
58 | end_index = x_len
59 | pad_width_before = 0
60 | pad_width_after = 0
61 | segmented_indexes.append((begin_index, end_index))
62 | segmented_pad_width.append((pad_width_before, pad_width_after))
63 |
64 | return segmented_indexes, segmented_pad_width
65 |
66 |
67 | def load_dcase_format(meta_path, frame_begin_index=0, frame_length=600, num_classes=14, set_type='gt'):
68 | """ Load meta into dcase format
69 |
70 | Args:
71 | meta_path (Path obj): path of meta file
72 | frame_begin_index (int): frame begin index, for concatenating labels
73 | frame_length (int): frame length in a file
74 | num_classes (int): number of classes
75 | Output:
76 | output_dict: return a dict containing dcase output format
77 | output_dict[frame-containing-events] = [[class_index_1, azi_1 in degree, ele_1 in degree], [class_index_2, azi_2 in degree, ele_2 in degree]]
78 | sed_metrics2019: (frame, num_classes)
79 | doa_metrics2019: (frame, 2*num_classes), with (frame, 0:num_classes) represents azimuth, (frame, num_classes:2*num_classes) represents elevation
80 | both are in radiance
81 | """
82 | df = pd.read_csv(meta_path, header=None)
83 |
84 | output_dict = {}
85 | sed_metrics2019 = np.zeros((frame_length, num_classes))
86 | doa_metrics2019 = np.zeros((frame_length, 2*num_classes))
87 | for row in df.iterrows():
88 | frame_idx = row[1][0]
89 | frame_idx2020 = frame_idx + frame_begin_index
90 | event_idx = row[1][1]
91 | if set_type == 'gt':
92 | azi = row[1][3]
93 | ele = row[1][4]
94 | elif set_type == 'pred':
95 | azi = row[1][2]
96 | ele = row[1][3]
97 | if frame_idx2020 not in output_dict:
98 | output_dict[frame_idx2020] = []
99 | output_dict[frame_idx2020].append([event_idx, azi, ele])
100 | sed_metrics2019[frame_idx, event_idx] = 1.0
101 | doa_metrics2019[frame_idx, event_idx], doa_metrics2019[frame_idx, event_idx + num_classes] \
102 | = azi * np.pi / 180.0, ele * np.pi / 180.0
103 | return output_dict, sed_metrics2019, doa_metrics2019
104 |
105 |
106 | def to_metrics2020_format(label_dict, num_frames, label_resolution):
107 | """Collect class-wise sound event location information in segments of length 1s (according to DCASE2020) from reference dataset
108 |
109 | Reference:
110 | https://github.com/sharathadavanne/seld-dcase2020/blob/74a0e1db61cee32c19ea9dde87ba1a5389eb9a85/cls_feature_class.py#L312
111 | Args:
112 | label_dict: Dictionary containing frame-wise sound event time and location information. Dcase format.
113 | num_frames: Total number of frames in the recording.
114 | label_resolution: Groundtruth label resolution.
115 | Output:
116 | output_dict: Dictionary containing class-wise sound event location information in each segment of audio
117 | dictionary_name[segment-index][class-index] = list(frame-cnt-within-segment, azimuth in degree, elevation in degree)
118 | """
119 |
120 | num_label_frames_1s = int(1 / label_resolution)
121 | num_blocks = int(np.ceil(num_frames / float(num_label_frames_1s)))
122 | output_dict = {x: {} for x in range(num_blocks)}
123 | for n_frame in range(0, num_frames, num_label_frames_1s):
124 | # Collect class-wise information for each block
125 | # [class][frame] =
126 | # Data structure supports multi-instance occurence of same class
127 | n_block = n_frame // num_label_frames_1s
128 | loc_dict = {}
129 | for audio_frame in range(n_frame, n_frame + num_label_frames_1s):
130 | if audio_frame not in label_dict:
131 | continue
132 | for value in label_dict[audio_frame]:
133 | if value[0] not in loc_dict:
134 | loc_dict[value[0]] = {}
135 |
136 | block_frame = audio_frame - n_frame
137 | if block_frame not in loc_dict[value[0]]:
138 | loc_dict[value[0]][block_frame] = []
139 | loc_dict[value[0]][block_frame].append(value[1:])
140 |
141 | # Update the block wise details collected above in a global structure
142 | for n_class in loc_dict:
143 | if n_class not in output_dict[n_block]:
144 | output_dict[n_block][n_class] = []
145 |
146 | keys = [k for k in loc_dict[n_class]]
147 | values = [loc_dict[n_class][k] for k in loc_dict[n_class]]
148 |
149 | output_dict[n_block][n_class].append([keys, values])
150 |
151 | return output_dict
152 |
153 |
154 |
--------------------------------------------------------------------------------
/seld/methods/utils/loss_utilities.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import torch.nn as nn
3 | import torch.nn.functional as F
4 | eps = torch.finfo(torch.float32).eps
5 |
6 |
7 | class MSELoss:
8 | def __init__(self, reduction='mean'):
9 | self.reduction = reduction
10 | self.name = 'loss_MSE'
11 | if self.reduction != 'PIT':
12 | self.loss = nn.MSELoss(reduction='mean')
13 | else:
14 | self.loss = nn.MSELoss(reduction='none')
15 |
16 | def calculate_loss(self, pred, target):
17 | if self.reduction != 'PIT':
18 | return self.loss(pred, target)
19 | else:
20 | return self.loss(pred, target).mean(dim=tuple(range(2, pred.ndim)))
21 |
22 |
23 | class BCEWithLogitsLoss:
24 | def __init__(self, reduction='mean', pos_weight=None):
25 | self.reduction = reduction
26 | self.name = 'loss_BCEWithLogits'
27 | if self.reduction != 'PIT':
28 | self.loss = nn.BCEWithLogitsLoss(reduction=self.reduction, pos_weight=pos_weight)
29 | else:
30 | self.loss = nn.BCEWithLogitsLoss(reduction='none', pos_weight=pos_weight)
31 |
32 | def calculate_loss(self, pred, target):
33 | if self.reduction != 'PIT':
34 | return self.loss(pred, target)
35 | else:
36 | return self.loss(pred, target).mean(dim=tuple(range(2, pred.ndim)))
37 |
38 |
--------------------------------------------------------------------------------
/seld/methods/utils/model_utilities.py:
--------------------------------------------------------------------------------
1 | import math
2 |
3 | import numpy as np
4 | import torch
5 | import torch.nn as nn
6 | import torch.nn.functional as F
7 |
8 |
9 | def init_layer(layer, nonlinearity='leaky_relu'):
10 | '''
11 | Initialize a layer
12 | '''
13 | classname = layer.__class__.__name__
14 | if (classname.find('Conv') != -1) or (classname.find('Linear') != -1):
15 | nn.init.kaiming_uniform_(layer.weight, nonlinearity=nonlinearity)
16 | if hasattr(layer, 'bias'):
17 | if layer.bias is not None:
18 | nn.init.constant_(layer.bias, 0.0)
19 | elif classname.find('BatchNorm') != -1:
20 | nn.init.normal_(layer.weight, 1.0, 0.02)
21 | nn.init.constant_(layer.bias, 0.0)
22 |
23 |
24 | class DoubleConv(nn.Module):
25 | def __init__(self, in_channels, out_channels,
26 | kernel_size=(3,3), stride=(1,1), padding=(1,1),
27 | dilation=1, bias=False):
28 | super().__init__()
29 |
30 | self.double_conv = nn.Sequential(
31 | nn.Conv2d(in_channels=in_channels,
32 | out_channels=out_channels,
33 | kernel_size=kernel_size, stride=stride,
34 | padding=padding, dilation=dilation, bias=bias),
35 | nn.BatchNorm2d(out_channels),
36 | nn.ReLU(inplace=True),
37 | # nn.LeakyReLU(negative_slope=0.1, inplace=True),
38 | nn.Conv2d(in_channels=out_channels,
39 | out_channels=out_channels,
40 | kernel_size=kernel_size, stride=stride,
41 | padding=padding, dilation=dilation, bias=bias),
42 | nn.BatchNorm2d(out_channels),
43 | nn.ReLU(inplace=True),
44 | # nn.LeakyReLU(negative_slope=0.1, inplace=True),
45 | )
46 |
47 | self.init_weights()
48 |
49 | def init_weights(self):
50 | for layer in self.double_conv:
51 | init_layer(layer)
52 |
53 | def forward(self, x):
54 | x = self.double_conv(x)
55 |
56 | return x
57 |
58 |
59 | class PositionalEncoding(nn.Module):
60 | def __init__(self, pos_len, d_model=512, pe_type='t', dropout=0.0):
61 | """ Positional encoding using sin and cos
62 |
63 | Args:
64 | pos_len: positional length
65 | d_model: number of feature maps
66 | pe_type: 't' | 'f' , time domain, frequency domain
67 | dropout: dropout probability
68 | """
69 | super().__init__()
70 |
71 | self.pe_type = pe_type
72 | pe = torch.zeros(pos_len, d_model)
73 | pos = torch.arange(0, pos_len).float().unsqueeze(1)
74 | div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))
75 | pe[:, 0::2] = 0.1 * torch.sin(pos * div_term)
76 | pe[:, 1::2] = 0.1 * torch.cos(pos * div_term)
77 | pe = pe.unsqueeze(0).transpose(1, 2) # (N, C, T)
78 | self.register_buffer('pe', pe)
79 | self.dropout = nn.Dropout(p=dropout)
80 |
81 | def forward(self, x):
82 | # x is (N, C, T, F) or (N, C, T) or (N, C, F)
83 | if x.ndim == 4:
84 | if self.pe_type == 't':
85 | pe = self.pe.unsqueeze(3)
86 | x += pe[:, :, :x.shape[2]]
87 | elif self.pe_type == 'f':
88 | pe = self.pe.unsqueeze(2)
89 | x += pe[:, :, :, :x.shape[3]]
90 | elif x.ndim == 3:
91 | x += self.pe[:, :, :x.shape[2]]
92 | return self.dropout(x)
93 |
94 |
--------------------------------------------------------------------------------
/seld/methods/utils/stft.py:
--------------------------------------------------------------------------------
1 | import math
2 |
3 | import librosa
4 | import numpy as np
5 | import torch
6 | import torch.nn as nn
7 | import torch.nn.functional as F
8 | from librosa import ParameterError
9 | from torch.nn.parameter import Parameter
10 |
11 | eps = torch.finfo(torch.float32).eps
12 |
13 | class DFTBase(nn.Module):
14 | def __init__(self):
15 | """Base class for DFT and IDFT matrix"""
16 | super().__init__()
17 |
18 | def dft_matrix(self, n):
19 | (x, y) = np.meshgrid(np.arange(n), np.arange(n))
20 | omega = np.exp(-2 * np.pi * 1j / n)
21 | W = np.power(omega, x * y)
22 | return W
23 |
24 | def idft_matrix(self, n):
25 | (x, y) = np.meshgrid(np.arange(n), np.arange(n))
26 | omega = np.exp(2 * np.pi * 1j / n)
27 | W = np.power(omega, x * y)
28 | return W
29 |
30 |
31 | class DFT(DFTBase):
32 | def __init__(self, n, norm):
33 | """Calculate DFT, IDFT, RDFT, IRDFT.
34 | Args:
35 | n: fft window size
36 | norm: None | 'ortho'
37 | """
38 | super().__init__()
39 |
40 | self.W = self.dft_matrix(n)
41 | self.inv_W = self.idft_matrix(n)
42 |
43 | self.W_real = torch.Tensor(np.real(self.W))
44 | self.W_imag = torch.Tensor(np.imag(self.W))
45 | self.inv_W_real = torch.Tensor(np.real(self.inv_W))
46 | self.inv_W_imag = torch.Tensor(np.imag(self.inv_W))
47 |
48 | self.n = n
49 | self.norm = norm
50 |
51 | def dft(self, x_real, x_imag):
52 | """Calculate DFT of signal.
53 | Args:
54 | x_real: (n,), signal real part
55 | x_imag: (n,), signal imag part
56 | Returns:
57 | z_real: (n,), output real part
58 | z_imag: (n,), output imag part
59 | """
60 | z_real = torch.matmul(x_real, self.W_real) - torch.matmul(x_imag, self.W_imag)
61 | z_imag = torch.matmul(x_imag, self.W_real) + torch.matmul(x_real, self.W_imag)
62 |
63 | if self.norm is None:
64 | pass
65 | elif self.norm == 'ortho':
66 | z_real /= math.sqrt(self.n)
67 | z_imag /= math.sqrt(self.n)
68 |
69 | return z_real, z_imag
70 |
71 | def idft(self, x_real, x_imag):
72 | """Calculate IDFT of signal.
73 | Args:
74 | x_real: (n,), signal real part
75 | x_imag: (n,), signal imag part
76 | Returns:
77 | z_real: (n,), output real part
78 | z_imag: (n,), output imag part
79 | """
80 | z_real = torch.matmul(x_real, self.inv_W_real) - torch.matmul(x_imag, self.inv_W_imag)
81 | z_imag = torch.matmul(x_imag, self.inv_W_real) + torch.matmul(x_real, self.inv_W_imag)
82 |
83 | if self.norm is None:
84 | z_real /= self.n
85 | elif self.norm == 'ortho':
86 | z_real /= math.sqrt(n)
87 | z_imag /= math.sqrt(n)
88 |
89 | return z_real, z_imag
90 |
91 | def rdft(self, x_real):
92 | """Calculate right DFT of signal.
93 | Args:
94 | x_real: (n,), signal real part
95 | x_imag: (n,), signal imag part
96 | Returns:
97 | z_real: (n // 2 + 1,), output real part
98 | z_imag: (n // 2 + 1,), output imag part
99 | """
100 | n_rfft = self.n // 2 + 1
101 | z_real = torch.matmul(x_real, self.W_real[..., 0 : n_rfft])
102 | z_imag = torch.matmul(x_real, self.W_imag[..., 0 : n_rfft])
103 |
104 | if self.norm is None:
105 | pass
106 | elif self.norm == 'ortho':
107 | z_real /= math.sqrt(self.n)
108 | z_imag /= math.sqrt(self.n)
109 |
110 | return z_real, z_imag
111 |
112 | def irdft(self, x_real, x_imag):
113 | """Calculate inverse right DFT of signal.
114 | Args:
115 | x_real: (n // 2 + 1,), signal real part
116 | x_imag: (n // 2 + 1,), signal imag part
117 | Returns:
118 | z_real: (n,), output real part
119 | z_imag: (n,), output imag part
120 | """
121 | n_rfft = self.n // 2 + 1
122 |
123 | flip_x_real = torch.flip(x_real, dims=(-1,))
124 | x_real = torch.cat((x_real, flip_x_real[..., 1 : n_rfft - 1]), dim=-1)
125 |
126 | flip_x_imag = torch.flip(x_imag, dims=(-1,))
127 | x_imag = torch.cat((x_imag, -1. * flip_x_imag[..., 1 : n_rfft - 1]), dim=-1)
128 |
129 | z_real = torch.matmul(x_real, self.inv_W_real) - torch.matmul(x_imag, self.inv_W_imag)
130 |
131 | if self.norm is None:
132 | z_real /= self.n
133 | elif self.norm == 'ortho':
134 | z_real /= math.sqrt(n)
135 |
136 | return z_real
137 |
138 |
139 | class STFT(DFTBase):
140 | def __init__(self, n_fft=2048, hop_length=None, win_length=None,
141 | window='hann', center=True, pad_mode='reflect', freeze_parameters=True):
142 | """Implementation of STFT with Conv1d. The function has the same output
143 | of librosa.core.stft
144 | """
145 | super().__init__()
146 |
147 | assert pad_mode in ['constant', 'reflect']
148 |
149 | self.n_fft = n_fft
150 | self.center = center
151 | self.pad_mode = pad_mode
152 |
153 | # By default, use the entire frame
154 | if win_length is None:
155 | win_length = n_fft
156 |
157 | # Set the default hop, if it's not already specified
158 | if hop_length is None:
159 | hop_length = int(win_length // 4)
160 |
161 | fft_window = librosa.filters.get_window(window, win_length, fftbins=True)
162 |
163 | # Pad the window out to n_fft size
164 | fft_window = librosa.util.pad_center(fft_window, n_fft)
165 |
166 | # DFT & IDFT matrix
167 | self.W = self.dft_matrix(n_fft)
168 |
169 | out_channels = n_fft // 2 + 1
170 |
171 | self.conv_real = nn.Conv1d(in_channels=1, out_channels=out_channels,
172 | kernel_size=n_fft, stride=hop_length, padding=0, dilation=1,
173 | groups=1, bias=False)
174 |
175 | self.conv_imag = nn.Conv1d(in_channels=1, out_channels=out_channels,
176 | kernel_size=n_fft, stride=hop_length, padding=0, dilation=1,
177 | groups=1, bias=False)
178 |
179 | self.conv_real.weight.data = torch.Tensor(
180 | np.real(self.W[:, 0 : out_channels] * fft_window[:, None]).T)[:, None, :]
181 | # (n_fft // 2 + 1, 1, n_fft)
182 |
183 | self.conv_imag.weight.data = torch.Tensor(
184 | np.imag(self.W[:, 0 : out_channels] * fft_window[:, None]).T)[:, None, :]
185 | # (n_fft // 2 + 1, 1, n_fft)
186 |
187 | if freeze_parameters:
188 | for param in self.parameters():
189 | param.requires_grad = False
190 |
191 | def forward(self, input):
192 | """input: (batch_size, num_channels, data_length)
193 | Returns:
194 | real: (batch_size, num_channels, time_steps, n_fft // 2 + 1)
195 | imag: (batch_size, num_channels, time_steps, n_fft // 2 + 1)
196 | """
197 | _, num_channels, _ = input.shape
198 |
199 | real_out = []
200 | imag_out = []
201 | for n in range(num_channels):
202 | x = input[:, n, :][:, None, :]
203 | # (batch_size, 1, data_length)
204 |
205 | if self.center:
206 | x = F.pad(x, pad=(self.n_fft // 2, self.n_fft // 2), mode=self.pad_mode)
207 |
208 | real = self.conv_real(x)
209 | imag = self.conv_imag(x)
210 | # (batch_size, n_fft // 2 + 1, time_steps)
211 |
212 | real = real[:, None, :, :].transpose(2, 3)
213 | imag = imag[:, None, :, :].transpose(2, 3)
214 | # (batch_size, 1, time_steps, n_fft // 2 + 1)
215 |
216 | real_out.append(real)
217 | imag_out.append(imag)
218 |
219 | real_out = torch.cat(real_out, dim=1)
220 | imag_out = torch.cat(imag_out, dim=1)
221 |
222 | return real_out, imag_out
223 |
224 |
225 | def magphase(real, imag):
226 | mag = (real ** 2 + imag ** 2) ** 0.5
227 | cos = real / torch.clamp(mag, 1e-10, np.inf)
228 | sin = imag / torch.clamp(mag, 1e-10, np.inf)
229 | return mag, cos, sin
230 |
231 |
232 | class ISTFT(DFTBase):
233 | def __init__(self, n_fft=2048, hop_length=None, win_length=None,
234 | window='hann', center=True, pad_mode='reflect', freeze_parameters=True):
235 | """Implementation of ISTFT with Conv1d. The function has the same output
236 | of librosa.core.istft
237 | """
238 | super().__init__()
239 |
240 | assert pad_mode in ['constant', 'reflect']
241 |
242 | self.n_fft = n_fft
243 | self.hop_length = hop_length
244 | self.win_length = win_length
245 | self.window = window
246 | self.center = center
247 | self.pad_mode = pad_mode
248 |
249 | # By default, use the entire frame
250 | if win_length is None:
251 | win_length = n_fft
252 |
253 | # Set the default hop, if it's not already specified
254 | if hop_length is None:
255 | hop_length = int(win_length // 4)
256 |
257 | ifft_window = librosa.filters.get_window(window, win_length, fftbins=True)
258 |
259 | # Pad the window out to n_fft size
260 | ifft_window = librosa.util.pad_center(ifft_window, n_fft)
261 |
262 | # DFT & IDFT matrix
263 | self.W = self.idft_matrix(n_fft) / n_fft
264 |
265 | self.conv_real = nn.Conv1d(in_channels=n_fft, out_channels=n_fft,
266 | kernel_size=1, stride=1, padding=0, dilation=1,
267 | groups=1, bias=False)
268 |
269 | self.conv_imag = nn.Conv1d(in_channels=n_fft, out_channels=n_fft,
270 | kernel_size=1, stride=1, padding=0, dilation=1,
271 | groups=1, bias=False)
272 |
273 |
274 | self.conv_real.weight.data = torch.Tensor(
275 | np.real(self.W * ifft_window[None, :]).T)[:, :, None]
276 | # (n_fft // 2 + 1, 1, n_fft)
277 |
278 | self.conv_imag.weight.data = torch.Tensor(
279 | np.imag(self.W * ifft_window[None, :]).T)[:, :, None]
280 | # (n_fft // 2 + 1, 1, n_fft)
281 |
282 | if freeze_parameters:
283 | for param in self.parameters():
284 | param.requires_grad = False
285 |
286 | def forward(self, real_stft, imag_stft, length):
287 | """input: (batch_size, num_channels, time_steps, n_fft // 2 + 1)
288 | Returns:
289 | real: (batch_size, num_channels, data_length)
290 | """
291 | assert real_stft.ndimension() == 4 and imag_stft.ndimension() == 4
292 | device = next(self.parameters()).device
293 | batch_size, num_channels, _, _ = real_stft.shape
294 |
295 | wav_out = []
296 | for n in range(num_channels):
297 | real_stft = real_stft[:, n, :, :].transpose(1, 2)
298 | imag_stft = imag_stft[:, n, :, :].transpose(1, 2)
299 | # (batch_size, n_fft // 2 + 1, time_steps)
300 |
301 | # Full stft
302 | full_real_stft = torch.cat((real_stft, torch.flip(real_stft[:, 1 : -1, :], dims=[1])), dim=1)
303 | full_imag_stft = torch.cat((imag_stft, - torch.flip(imag_stft[:, 1 : -1, :], dims=[1])), dim=1)
304 |
305 | # Reserve space for reconstructed waveform
306 | if length:
307 | if self.center:
308 | padded_length = length + int(self.n_fft)
309 | else:
310 | padded_length = length
311 | n_frames = min(
312 | real_stft.shape[2], int(np.ceil(padded_length / self.hop_length)))
313 | else:
314 | n_frames = real_stft.shape[2]
315 |
316 | expected_signal_len = self.n_fft + self.hop_length * (n_frames - 1)
317 | expected_signal_len = self.n_fft + self.hop_length * (n_frames - 1)
318 | y = torch.zeros(batch_size, expected_signal_len).to(device)
319 |
320 | # IDFT
321 | s_real = self.conv_real(full_real_stft) - self.conv_imag(full_imag_stft)
322 |
323 | # Overlap add
324 | for i in range(n_frames):
325 | y[:, i * self.hop_length : i * self.hop_length + self.n_fft] += s_real[:, :, i]
326 |
327 | ifft_window_sum = librosa.filters.window_sumsquare(self.window, n_frames,
328 | win_length=self.win_length, n_fft=self.n_fft, hop_length=self.hop_length)
329 |
330 | approx_nonzero_indices = np.where(ifft_window_sum > librosa.util.tiny(ifft_window_sum))[0]
331 | approx_nonzero_indices = torch.LongTensor(approx_nonzero_indices).to(device)
332 | ifft_window_sum = torch.Tensor(ifft_window_sum).to(device)
333 |
334 | y[:, approx_nonzero_indices] /= ifft_window_sum[approx_nonzero_indices][None, :]
335 |
336 | # Trim or pad to length
337 | if length is None:
338 | if self.center:
339 | y = y[:, self.n_fft // 2 : -self.n_fft // 2]
340 | else:
341 | if self.center:
342 | start = self.n_fft // 2
343 | else:
344 | start = 0
345 |
346 | y = y[:, start : start + length]
347 | (batch_size, len_y) = y.shape
348 | if y.shape[-1] < length:
349 | y = torch.cat((y, torch.zeros(batch_size, length - len_y).to(device)), dim=-1)
350 |
351 | wav_out.append(y)
352 |
353 | wav_out = torch.cat(wav_out, dim=1)
354 |
355 | return y
356 |
357 |
358 | def spectrogram_STFTInput(input, power=2.0):
359 | """
360 | Input:
361 | real: (batch_size, num_channels, time_steps, n_fft // 2 + 1)
362 | imag: (batch_size, num_channels, time_steps, n_fft // 2 + 1)
363 | Returns:
364 | spectrogram: (batch_size, num_channels, time_steps, n_fft // 2 + 1)
365 | """
366 |
367 | (real, imag) = input
368 | # (batch_size, num_channels, n_fft // 2 + 1, time_steps)
369 |
370 | spectrogram = real ** 2 + imag ** 2
371 |
372 | if power == 2.0:
373 | pass
374 | else:
375 | spectrogram = spectrogram ** (power / 2.0)
376 |
377 | return spectrogram
378 |
379 |
380 | class Spectrogram(nn.Module):
381 | def __init__(self, n_fft=2048, hop_length=None, win_length=None,
382 | window='hann', center=True, pad_mode='reflect', power=2.0,
383 | freeze_parameters=True):
384 | """Calculate spectrogram using pytorch. The STFT is implemented with
385 | Conv1d. The function has the same output of librosa.core.stft
386 | """
387 | super().__init__()
388 |
389 | self.power = power
390 |
391 | self.stft = STFT(n_fft=n_fft, hop_length=hop_length,
392 | win_length=win_length, window=window, center=center,
393 | pad_mode=pad_mode, freeze_parameters=True)
394 |
395 | def forward(self, input):
396 | """input: (batch_size, num_channels, data_length)
397 | Returns:
398 | spectrogram: (batch_size, num_channels, time_steps, n_fft // 2 + 1)
399 | """
400 |
401 | (real, imag) = self.stft.forward(input)
402 | # (batch_size, num_channels, n_fft // 2 + 1, time_steps)
403 |
404 | spectrogram = real ** 2 + imag ** 2
405 |
406 | if self.power == 2.0:
407 | pass
408 | else:
409 | spectrogram = spectrogram ** (self.power / 2.0)
410 |
411 | return spectrogram
412 |
413 |
414 | class LogmelFilterBank(nn.Module):
415 | def __init__(self, sr=32000, n_fft=2048, n_mels=64, fmin=50, fmax=14000, is_log=True,
416 | ref=1.0, amin=1e-10, top_db=80.0, freeze_parameters=True):
417 | """Calculate logmel spectrogram using pytorch. The mel filter bank is
418 | the pytorch implementation of as librosa.filters.mel
419 | """
420 | super().__init__()
421 |
422 | self.is_log = is_log
423 | self.ref = ref
424 | self.amin = amin
425 | self.top_db = top_db
426 |
427 | self.melW = librosa.filters.mel(sr=sr, n_fft=n_fft, n_mels=n_mels,
428 | fmin=fmin, fmax=fmax).T
429 | # (n_fft // 2 + 1, mel_bins)
430 |
431 | self.melW = nn.Parameter(torch.Tensor(self.melW))
432 |
433 | if freeze_parameters:
434 | for param in self.parameters():
435 | param.requires_grad = False
436 |
437 | def forward(self, input):
438 | """input: (batch_size, num_channels, time_steps, freq_bins)
439 |
440 | Output: (batch_size, num_channels, time_steps, mel_bins)
441 | """
442 | # Mel spectrogram
443 | mel_spectrogram = torch.matmul(input, self.melW)
444 |
445 | # Logmel spectrogram
446 | if self.is_log:
447 | output = self.power_to_db(mel_spectrogram)
448 | else:
449 | output = mel_spectrogram
450 |
451 | return output
452 |
453 |
454 | def power_to_db(self, input):
455 | """Power to db, this function is the pytorch implementation of
456 | librosa.core.power_to_lb
457 | """
458 | ref_value = self.ref
459 | log_spec = 10.0 * torch.log10(torch.clamp(input, min=self.amin, max=np.inf))
460 | log_spec -= 10.0 * np.log10(np.maximum(self.amin, ref_value))
461 |
462 | if self.top_db is not None:
463 | if self.top_db < 0:
464 | raise ParameterError('top_db must be non-negative')
465 | log_spec = torch.clamp(log_spec, min=log_spec.max().item() - self.top_db, max=np.inf)
466 |
467 | return log_spec
468 |
469 |
470 | def intensityvector(input, melW):
471 | """Calculate intensity vector. Input is four channel stft of the signals.
472 | input: (stft_real, stft_imag)
473 | stft_real: (batch_size, 4, time_steps, freq_bins)
474 | stft_imag: (batch_size, 4, time_steps, freq_bins)
475 | out:
476 | intenVec: (batch_size, 3, time_steps, freq_bins)
477 | """
478 | sig_real, sig_imag = input[0], input[1]
479 | Pref_real, Pref_imag = sig_real[:,0,...], sig_imag[:,0,...]
480 | Px_real, Px_imag = sig_real[:,1,...], sig_imag[:,1,...]
481 | Py_real, Py_imag = sig_real[:,2,...], sig_imag[:,2,...]
482 | Pz_real, Pz_imag = sig_real[:,3,...], sig_imag[:,3,...]
483 |
484 | IVx = Pref_real * Px_real + Pref_imag * Px_imag
485 | IVy = Pref_real * Py_real + Pref_imag * Py_imag
486 | IVz = Pref_real * Pz_real + Pref_imag * Pz_imag
487 | normal = torch.sqrt(IVx**2 + IVy**2 + IVz**2) + eps
488 |
489 | IVx_mel = torch.matmul(IVx / normal, melW)
490 | IVy_mel = torch.matmul(IVy / normal, melW)
491 | IVz_mel = torch.matmul(IVz / normal, melW)
492 | intenVec = torch.stack([IVx_mel, IVy_mel, IVz_mel], dim=1)
493 |
494 | return intenVec
495 |
496 |
497 | class Enframe(nn.Module):
498 | def __init__(self, frame_length=2048, hop_length=512):
499 | """Enframe a time sequence. This function is the pytorch implementation
500 | of librosa.util.frame
501 | """
502 | super().__init__()
503 |
504 | '''
505 | self.enframe_conv = nn.Conv1d(in_channels=1, out_channels=frame_length,
506 | kernel_size=frame_length, stride=hop_length,
507 | padding=frame_length // 2, bias=False)
508 | '''
509 | self.enframe_conv = nn.Conv1d(in_channels=1, out_channels=frame_length,
510 | kernel_size=frame_length, stride=hop_length,
511 | padding=0, bias=False)
512 |
513 | self.enframe_conv.weight.data = torch.Tensor(torch.eye(frame_length)[:, None, :])
514 | self.enframe_conv.weight.requires_grad = False
515 |
516 | def forward(self, input):
517 | """input: (batch_size, num_channels, samples)
518 |
519 | Output: (batch_size, num_channels, window_length, frames_num)
520 | """
521 | _, num_channels, _ = input.shape
522 |
523 | output = []
524 | for n in range(num_channels):
525 | output.append(self.enframe_conv(input[:, n, :][:, None, :]))
526 |
527 | output = torch.cat(output, dim=1)
528 | return output
529 |
530 |
531 | class Scalar(nn.Module):
532 | def __init__(self, scalar, freeze_parameters):
533 | super().__init__()
534 |
535 | self.scalar_mean = Parameter(torch.Tensor(scalar['mean']))
536 | self.scalar_std = Parameter(torch.Tensor(scalar['std']))
537 |
538 | if freeze_parameters:
539 | for param in self.parameters():
540 | param.requires_grad = False
541 |
542 | def forward(self, input):
543 | return (input - self.scalar_mean) / self.scalar_std
544 |
545 |
546 | def debug(select, device):
547 | """Compare numpy + librosa and pytorch implementation result. For debug.
548 | Args:
549 | select: 'dft' | 'logmel' | 'logmel&iv' | 'logmel&gcc'
550 | device: 'cpu' | 'cuda'
551 | """
552 |
553 | if select == 'dft':
554 | n = 10
555 | norm = None # None | 'ortho'
556 | np.random.seed(0)
557 |
558 | # Data
559 | np_data = np.random.uniform(-1, 1, n)
560 | pt_data = torch.Tensor(np_data)
561 |
562 | # Numpy FFT
563 | np_fft = np.fft.fft(np_data, norm=norm)
564 | np_ifft = np.fft.ifft(np_fft, norm=norm)
565 | np_rfft = np.fft.rfft(np_data, norm=norm)
566 | np_irfft = np.fft.ifft(np_rfft, norm=norm)
567 |
568 | # Pytorch FFT
569 | obj = DFT(n, norm)
570 | pt_dft = obj.dft(pt_data, torch.zeros_like(pt_data))
571 | pt_idft = obj.idft(pt_dft[0], pt_dft[1])
572 | pt_rdft = obj.rdft(pt_data)
573 | pt_irdft = obj.irdft(pt_rdft[0], pt_rdft[1])
574 |
575 | print('Comparing librosa and pytorch implementation of DFT. All numbers '
576 | 'below should be close to 0.')
577 | print(np.mean((np.abs(np.real(np_fft) - pt_dft[0].cpu().numpy()))))
578 | print(np.mean((np.abs(np.imag(np_fft) - pt_dft[1].cpu().numpy()))))
579 |
580 | print(np.mean((np.abs(np.real(np_ifft) - pt_idft[0].cpu().numpy()))))
581 | print(np.mean((np.abs(np.imag(np_ifft) - pt_idft[1].cpu().numpy()))))
582 |
583 | print(np.mean((np.abs(np.real(np_rfft) - pt_rdft[0].cpu().numpy()))))
584 | print(np.mean((np.abs(np.imag(np_rfft) - pt_rdft[1].cpu().numpy()))))
585 |
586 | print(np.mean(np.abs(np_data - pt_irdft.cpu().numpy())))
587 |
588 | elif select == 'stft':
589 | data_length = 32000
590 | device = torch.device(device)
591 | np.random.seed(0)
592 |
593 | sample_rate = 16000
594 | n_fft = 1024
595 | hop_length = 250
596 | win_length = 1024
597 | window = 'hann'
598 | center = True
599 | dtype = np.complex64
600 | pad_mode = 'reflect'
601 |
602 | # Data
603 | np_data = np.random.uniform(-1, 1, data_length)
604 | pt_data = torch.Tensor(np_data).to(device)
605 |
606 | # Numpy stft matrix
607 | np_stft_matrix = librosa.core.stft(y=np_data, n_fft=n_fft,
608 | hop_length=hop_length, window=window, center=center).T
609 |
610 | # Pytorch stft matrix
611 | pt_stft_extractor = STFT(n_fft=n_fft, hop_length=hop_length,
612 | win_length=win_length, window=window, center=center, pad_mode=pad_mode,
613 | freeze_parameters=True)
614 |
615 | pt_stft_extractor.to(device)
616 |
617 | (pt_stft_real, pt_stft_imag) = pt_stft_extractor.forward(pt_data[None, None, :])
618 |
619 | print('Comparing librosa and pytorch implementation of stft. All numbers '
620 | 'below should be close to 0.')
621 |
622 | print(np.mean(np.abs(np.real(np_stft_matrix) - pt_stft_real.data.cpu().numpy()[0, 0])))
623 | print(np.mean(np.abs(np.imag(np_stft_matrix) - pt_stft_imag.data.cpu().numpy()[0, 0])))
624 |
625 | # Numpy istft
626 | np_istft_s = librosa.core.istft(stft_matrix=np_stft_matrix.T,
627 | hop_length=hop_length, window=window, center=center, length=data_length)
628 |
629 | # Pytorch istft
630 | pt_istft_extractor = ISTFT(n_fft=n_fft, hop_length=hop_length,
631 | win_length=win_length, window=window, center=center, pad_mode=pad_mode,
632 | freeze_parameters=True)
633 | pt_istft_extractor.to(device)
634 |
635 | # Recover from real and imag part
636 | pt_istft_s = pt_istft_extractor.forward(pt_stft_real, pt_stft_imag, data_length)[0, :]
637 |
638 | # Recover from magnitude and phase
639 | (pt_stft_mag, cos, sin) = magphase(pt_stft_real, pt_stft_imag)
640 | pt_istft_s2 = pt_istft_extractor.forward(pt_stft_mag * cos, pt_stft_mag * sin, data_length)[0, :]
641 |
642 | print(np.mean(np.abs(np_istft_s - pt_istft_s.data.cpu().numpy())))
643 | print(np.mean(np.abs(np_data - pt_istft_s.data.cpu().numpy())))
644 | print(np.mean(np.abs(np_data - pt_istft_s2.data.cpu().numpy())))
645 |
646 | elif select == 'logmel':
647 |
648 | data_length = 4*32000
649 | norm = None # None | 'ortho'
650 | device = torch.device(device)
651 | np.random.seed(0)
652 |
653 | # Spectrogram parameters
654 | sample_rate = 32000
655 | n_fft = 1024
656 | hop_length = 320
657 | win_length = 1024
658 | window = 'hann'
659 | center = True
660 | dtype = np.complex64
661 | pad_mode = 'reflect'
662 |
663 | # Mel parameters
664 | n_mels = 128
665 | fmin = 50
666 | fmax = 14000
667 | ref = 1.0
668 | amin = 1e-10
669 | top_db = None
670 |
671 | # Data
672 | np_data = np.random.uniform(-1, 1, data_length)
673 | pt_data = torch.Tensor(np_data).to(device)
674 |
675 | print('Comparing librosa and pytorch implementation of logmel '
676 | 'spectrogram. All numbers below should be close to 0.')
677 |
678 | # Numpy librosa
679 | np_stft_matrix = librosa.core.stft(y=np_data, n_fft=n_fft, hop_length=hop_length,
680 | win_length=win_length, window=window, center=center, dtype=dtype,
681 | pad_mode=pad_mode)
682 |
683 | np_pad = np.pad(np_data, int(n_fft // 2), mode=pad_mode)
684 |
685 | np_melW = librosa.filters.mel(sr=sample_rate, n_fft=n_fft, n_mels=n_mels,
686 | fmin=fmin, fmax=fmax).T
687 |
688 | np_mel_spectrogram = np.dot(np.abs(np_stft_matrix.T) ** 2, np_melW)
689 |
690 | np_logmel_spectrogram = librosa.core.power_to_db(
691 | np_mel_spectrogram, ref=ref, amin=amin, top_db=top_db)
692 |
693 | # Pytorch
694 | stft_extractor = STFT(n_fft=n_fft, hop_length=hop_length,
695 | win_length=win_length, window=window, center=center, pad_mode=pad_mode,
696 | freeze_parameters=True)
697 |
698 | logmel_extractor = LogmelFilterBank(sr=sample_rate, n_fft=n_fft,
699 | n_mels=n_mels, fmin=fmin, fmax=fmax, ref=ref, amin=amin,
700 | top_db=top_db, freeze_parameters=True)
701 |
702 | stft_extractor.to(device)
703 | logmel_extractor.to(device)
704 |
705 | pt_pad = F.pad(pt_data[None, None, :], pad=(n_fft // 2, n_fft // 2), mode=pad_mode)[0, 0]
706 | print(np.mean(np.abs(np_pad - pt_pad.cpu().numpy())))
707 |
708 | pt_stft_matrix_real = stft_extractor.conv_real(pt_pad[None, None, :])[0]
709 | pt_stft_matrix_imag = stft_extractor.conv_imag(pt_pad[None, None, :])[0]
710 | print(np.mean(np.abs(np.real(np_stft_matrix) - pt_stft_matrix_real.data.cpu().numpy())))
711 | print(np.mean(np.abs(np.imag(np_stft_matrix) - pt_stft_matrix_imag.data.cpu().numpy())))
712 |
713 | # Spectrogram
714 | spectrogram_extractor = Spectrogram(n_fft=n_fft, hop_length=hop_length,
715 | win_length=win_length, window=window, center=center, pad_mode=pad_mode,
716 | freeze_parameters=True)
717 |
718 | spectrogram_extractor.to(device)
719 |
720 | pt_spectrogram = spectrogram_extractor.forward(pt_data[None, None, :])
721 | pt_mel_spectrogram = torch.matmul(pt_spectrogram, logmel_extractor.melW)
722 | print(np.mean(np.abs(np_mel_spectrogram - pt_mel_spectrogram.data.cpu().numpy()[0, 0])))
723 |
724 | # Log mel spectrogram
725 | pt_logmel_spectrogram = logmel_extractor.forward(pt_spectrogram)
726 | print(np.mean(np.abs(np_logmel_spectrogram - pt_logmel_spectrogram[0, 0].data.cpu().numpy())))
727 |
728 | elif select == 'enframe':
729 | data_length = 32000
730 | device = torch.device(device)
731 | np.random.seed(0)
732 |
733 | # Spectrogram parameters
734 | hop_length = 250
735 | win_length = 1024
736 |
737 | # Data
738 | np_data = np.random.uniform(-1, 1, data_length)
739 | pt_data = torch.Tensor(np_data).to(device)
740 |
741 | print('Comparing librosa and pytorch implementation of '
742 | 'librosa.util.frame. All numbers below should be close to 0.')
743 |
744 | # Numpy librosa
745 | np_frames = librosa.util.frame(np_data, frame_length=win_length,
746 | hop_length=hop_length)
747 |
748 | # Pytorch
749 | pt_frame_extractor = Enframe(frame_length=win_length, hop_length=hop_length)
750 | pt_frame_extractor.to(device)
751 |
752 | pt_frames = pt_frame_extractor(pt_data[None, None, :])
753 | print(np.mean(np.abs(np_frames - pt_frames.data.cpu().numpy())))
754 |
755 | elif select == 'logmel&iv':
756 | data_size = (1, 4, 24000*3)
757 | device = torch.device(device)
758 | np.random.seed(0)
759 |
760 | # Stft parameters
761 | sample_rate = 24000
762 | n_fft = 1024
763 | hop_length = 240
764 | win_length = 1024
765 | window = 'hann'
766 | center = True
767 | dtype = np.complex64
768 | pad_mode = 'reflect'
769 |
770 | # Mel parameters
771 | n_mels = 128
772 | fmin = 50
773 | fmax = 10000
774 | ref = 1.0
775 | amin = 1e-10
776 | top_db = None
777 |
778 | # Data
779 | np_data = np.random.uniform(-1, 1, data_size)
780 | pt_data = torch.Tensor(np_data).to(device)
781 |
782 | # Numpy stft matrix
783 | np_stft_matrix = []
784 | for chn in range(np_data.shape[1]):
785 | np_stft_matrix.append(librosa.core.stft(y=np_data[0,chn,:], n_fft=n_fft,
786 | hop_length=hop_length, window=window, center=center).T)
787 | np_stft_matrix = np.array(np_stft_matrix)[None,...]
788 |
789 | # Pytorch stft matrix
790 | pt_stft_extractor = STFT(n_fft=n_fft, hop_length=hop_length,
791 | win_length=win_length, window=window, center=center, pad_mode=pad_mode,
792 | freeze_parameters=True)
793 | pt_stft_extractor.to(device)
794 | (pt_stft_real, pt_stft_imag) = pt_stft_extractor(pt_data)
795 | print('Comparing librosa and pytorch implementation of intensity vector. All numbers '
796 | 'below should be close to 0.')
797 |
798 | print(np.mean(np.abs(np.real(np_stft_matrix) - pt_stft_real.cpu().detach().numpy())))
799 | print(np.mean(np.abs(np.imag(np_stft_matrix) - pt_stft_imag.cpu().detach().numpy())))
800 |
801 | # Numpy logmel
802 | np_pad = np.pad(np_data, ((0,0), (0,0), (int(n_fft // 2),int(n_fft // 2))), mode=pad_mode)
803 | np_melW = librosa.filters.mel(sr=sample_rate, n_fft=n_fft, n_mels=n_mels,
804 | fmin=fmin, fmax=fmax).T
805 | np_mel_spectrogram = np.dot(np.abs(np_stft_matrix) ** 2, np_melW)
806 | np_logmel_spectrogram = librosa.core.power_to_db(
807 | np_mel_spectrogram, ref=ref, amin=amin, top_db=top_db)
808 |
809 | # Pytorch logmel
810 | pt_logmel_extractor = LogmelFilterBank(sr=sample_rate, n_fft=n_fft,
811 | n_mels=n_mels, fmin=fmin, fmax=fmax, ref=ref, amin=amin,
812 | top_db=top_db, freeze_parameters=True)
813 | pt_logmel_extractor.to(device)
814 | pt_pad = F.pad(pt_data, pad=(n_fft // 2, n_fft // 2), mode=pad_mode)
815 | print(np.mean(np.abs(np_pad - pt_pad.cpu().numpy())))
816 | pt_spectrogram = spectrogram_STFTInput((pt_stft_real, pt_stft_imag))
817 | pt_mel_spectrogram = torch.matmul(pt_spectrogram, pt_logmel_extractor.melW)
818 | print(np.mean(np.abs(np_mel_spectrogram - pt_mel_spectrogram.cpu().detach().numpy())))
819 | pt_logmel_spectrogram = pt_logmel_extractor(pt_spectrogram)
820 | print(np.mean(np.abs(np_logmel_spectrogram - pt_logmel_spectrogram.cpu().detach().numpy())))
821 |
822 | # Numpy intensity
823 | Pref = np_stft_matrix[:,0,...]
824 | Px = np_stft_matrix[:,1,...]
825 | Py = np_stft_matrix[:,2,...]
826 | Pz = np_stft_matrix[:,3,...]
827 | IVx = np.real(np.conj(Pref) * Px)
828 | IVy = np.real(np.conj(Pref) * Py)
829 | IVz = np.real(np.conj(Pref) * Pz)
830 | normal = np.sqrt(IVx**2 + IVy**2 + IVz**2) + np.finfo(np.float32).eps
831 | IVx_mel = np.dot(IVx / normal, np_melW)
832 | IVy_mel = np.dot(IVy / normal, np_melW)
833 | IVz_mel = np.dot(IVz / normal, np_melW)
834 | np_IV = np.stack([IVx_mel, IVy_mel, IVz_mel], axis=1)
835 |
836 | # Pytorch intensity
837 | pt_IV = intensityvector((pt_stft_real, pt_stft_imag), pt_logmel_extractor.melW)
838 | print(np.mean(np.abs(np_IV - pt_IV.cpu().detach().numpy())))
839 |
840 |
841 | if __name__ == '__main__':
842 |
843 | data_length = 12800
844 | norm = None # None | 'ortho'
845 | device = 'cuda' # 'cuda' | 'cpu'
846 | np.random.seed(0)
847 |
848 | # Spectrogram parameters
849 | sample_rate = 32000
850 | n_fft = 1024
851 | hop_length = 320
852 | win_length = 1024
853 | window = 'hann'
854 | center = True
855 | dtype = np.complex64
856 | pad_mode = 'reflect'
857 |
858 | # Mel parameters
859 | n_mels = 128
860 | fmin = 50
861 | fmax = 14000
862 | ref = 1.0
863 | amin = 1e-10
864 | top_db = None
865 |
866 | # Data
867 | np_data = np.random.uniform(-1, 1, data_length)
868 | pt_data = torch.Tensor(np_data).to(device)
869 |
870 | # Pytorch
871 | spectrogram_extractor = Spectrogram(n_fft=n_fft, hop_length=hop_length,
872 | win_length=win_length, window=window, center=center, pad_mode=pad_mode,
873 | freeze_parameters=True)
874 |
875 | logmel_extractor = LogmelFilterBank(sr=sample_rate, n_fft=n_fft,
876 | n_mels=n_mels, fmin=fmin, fmax=fmax, ref=ref, amin=amin, top_db=top_db,
877 | freeze_parameters=True)
878 |
879 | spectrogram_extractor.to(device)
880 | logmel_extractor.to(device)
881 |
882 | # Spectrogram
883 | pt_spectrogram = spectrogram_extractor.forward(pt_data[None, None, :])
884 |
885 | # Log mel spectrogram
886 | pt_logmel_spectrogram = logmel_extractor.forward(pt_spectrogram)
887 |
888 | # Uncomment for debug
889 | if True:
890 | debug(select='dft', device=device)
891 | debug(select='stft', device=device)
892 | debug(select='logmel', device=device)
893 | debug(select='enframe', device=device)
894 | debug(select='logmel&iv', device=device)
895 |
--------------------------------------------------------------------------------
/seld/utils/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yinkalario/EIN-SELD/9f42da23f4ef4620ca495af37e079dc7f8cde06c/seld/utils/__init__.py
--------------------------------------------------------------------------------
/seld/utils/cli_parser.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import sys
3 | from pathlib import Path
4 |
5 | from ruamel.yaml import YAML
6 | from termcolor import cprint
7 |
8 |
9 | def parse_cli_overides():
10 | """Parse the command-line arguments.
11 |
12 | Parse args from CLI and override config dictionary entries
13 |
14 | This function implements the command-line interface of the program.
15 | The interface accepts general command-line arguments as well as
16 | arguments that are specific to a sub-command. The sub-commands are
17 | *preprocess*, *train*, *predict*, and *evaluate*. Specifying a
18 | sub-command is required, as it specifies the task that the program
19 | should carry out.
20 |
21 | Returns:
22 | args: The parsed arguments.
23 | """
24 | # Parse the command-line arguments, but separate the `--config_file`
25 | # option from the other arguments. This way, options can be parsed
26 | # from the config file(s) first and then overidden by the other
27 | # command-line arguments later.
28 | parser = argparse.ArgumentParser(
29 | description='Event Independent Network for Sound Event Localization and Detection.',
30 | add_help=False
31 | )
32 | parser.add_argument('-c', '--config_file', help='Specify config file', metavar='FILE')
33 | subparsers = parser.add_subparsers(dest='mode')
34 | parser_preproc = subparsers.add_parser('preprocess')
35 | parser_train = subparsers.add_parser('train')
36 | parser_infer = subparsers.add_parser('infer')
37 | subparsers.add_parser('evaluate')
38 |
39 | # Require the user to specify a sub-command
40 | subparsers.required = True
41 | parser_preproc.add_argument('--preproc_mode', choices=['extract_data', 'extract_scalar', 'extract_meta'],
42 | required=True, help='select preprocessing mode')
43 | parser_preproc.add_argument('--dataset_type', default='dev', choices=['dev', 'eval'],
44 | help='select dataset to preprocess')
45 | parser_preproc.add_argument('--num_workers', type=int, default=8, metavar='N')
46 | parser_preproc.add_argument('--no_cuda', action='store_true', help='Do not use cuda.')
47 | parser_train.add_argument('--seed', type=int, default=2020, metavar='N')
48 | parser_train.add_argument('--num_workers', type=int, default=8, metavar='N')
49 | parser_train.add_argument('--read_into_mem', action='store_true', help='Read dataloader into memory')
50 | parser_train.add_argument('--no_cuda', action='store_true', help='Do not use cuda.')
51 | parser_infer.add_argument('--num_workers', type=int, default=8, metavar='N')
52 | parser_infer.add_argument('--read_into_mem', action='store_true')
53 | parser_infer.add_argument('--no_cuda', action='store_true', help='Do not use cuda.')
54 |
55 | args = parser.parse_args()
56 | args_dict = vars(args)
57 | cprint("Args:", "green")
58 | for key, value in args_dict.items():
59 | print(f" {key:25s} -> {value}")
60 |
61 | yaml = YAML()
62 | yaml.indent(mapping=4, sequence=6, offset=3)
63 | yaml.default_flow_style = False
64 | with open(args.config_file, 'r') as f:
65 | cfg = yaml.load(f)
66 | cprint("Cfg:", "red")
67 | yaml.dump(cfg, sys.stdout, transform=replace_indent)
68 |
69 | return args, cfg
70 |
71 | def replace_indent(stream):
72 | stream = " " + stream
73 | return stream.replace("\n", "\n ")
74 |
--------------------------------------------------------------------------------
/seld/utils/common.py:
--------------------------------------------------------------------------------
1 | import logging
2 | import math
3 | from datetime import datetime
4 | from pathlib import Path
5 |
6 | import numpy as np
7 | import torch
8 | import yaml
9 | from tqdm import tqdm
10 |
11 |
12 | def float_samples_to_int16(y):
13 | """Convert floating-point numpy array of audio samples to int16."""
14 | if not issubclass(y.dtype.type, np.floating):
15 | raise ValueError('input samples not floating-point')
16 | return (y * np.iinfo(np.int16).max).astype(np.int16)
17 |
18 |
19 | def int16_samples_to_float32(y):
20 | """Convert int16 numpy array of audio samples to float32."""
21 | if y.dtype != np.int16:
22 | raise ValueError('input samples not int16')
23 | return y.astype(np.float32) / np.iinfo(np.int16).max
24 |
25 |
26 | class TqdmLoggingHandler(logging.Handler):
27 | def __init__(self, level=logging.NOTSET):
28 | super().__init__(level)
29 |
30 | def emit(self, record):
31 | try:
32 | msg = self.format(record)
33 | tqdm.write(msg)
34 | self.flush()
35 | except:
36 | self.handleError(record)
37 |
38 |
39 | def create_logging(logs_dir, filemode):
40 | """Create log objective.
41 |
42 | Args:
43 | logs_dir (Path obj): logs directory
44 | filenmode: open file mode
45 | """
46 | logs_dir.mkdir(parents=True, exist_ok=True)
47 |
48 | i1 = 0
49 |
50 | while logs_dir.joinpath('{:04d}.log'.format(i1)).is_file():
51 | i1 += 1
52 |
53 | logs_path = logs_dir.joinpath('{:04d}.log'.format(i1))
54 | logging.basicConfig(
55 | level=logging.DEBUG,
56 | format='%(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
57 | # format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
58 | datefmt='%a, %d %b %Y %H:%M:%S',
59 | filename=logs_path,
60 | filemode=filemode)
61 |
62 | # Print to console
63 | console = logging.StreamHandler()
64 | console.setLevel(logging.INFO)
65 | formatter = logging.Formatter('%(name)-12s: %(levelname)-8s %(message)s')
66 | console.setFormatter(formatter)
67 | # logging.getLogger('').addHandler(console)
68 | logging.getLogger('').addHandler(TqdmLoggingHandler())
69 |
70 | dt_string = datetime.now().strftime('%a, %d %b %Y %H:%M:%S')
71 | logging.info(dt_string)
72 | logging.info('')
73 |
74 | return logging
75 |
76 |
77 | def convert_ordinal(n):
78 | """Convert a number to a ordinal number
79 |
80 | """
81 | ordinal = lambda n: "%d%s" % (n,"tsnrhtdd"[(math.floor(n/10)%10!=1)*(n%10<4)*n%10::4])
82 | return ordinal(n)
83 |
84 |
85 | def move_model_to_gpu(model, cuda):
86 | """Move model to GPU
87 |
88 | """
89 | # TODO: change DataParallel to DistributedDataParallel
90 | model = torch.nn.DataParallel(model)
91 | if cuda:
92 | logging.info('Utilize GPUs for computation')
93 | logging.info('Number of GPU available: {}\n'.format(torch.cuda.device_count()))
94 | model.cuda()
95 | else:
96 | logging.info('Utilize CPU for computation')
97 | return model
98 |
99 |
100 | def count_parameters(model):
101 | """Count model parameters
102 |
103 | """
104 | params_num = sum(p.numel() for p in model.parameters() if p.requires_grad)
105 | logging.info('Total number of parameters: {}\n'.format(params_num))
106 |
107 |
108 | def print_metrics(logging, writer, values_dict, it, set_type='train'):
109 | """Print losses and metrics, and write it to tensorboard
110 |
111 | Args:
112 | logging: logging
113 | writer: tensorboard writer
114 | values_dict: losses or metrics
115 | it: iter
116 | set_type: 'train' | 'valid' | 'test'
117 | """
118 | out_str = ''
119 | if set_type == 'train':
120 | out_str += 'Train: '
121 | elif set_type == 'valid':
122 | out_str += 'valid: '
123 |
124 | for key, value in values_dict.items():
125 | out_str += '{}: {:.3f}, '.format(key, value)
126 | writer.add_scalar('{}/{}'.format(set_type, key), value, it)
127 | logging.info(out_str)
128 |
129 |
130 |
--------------------------------------------------------------------------------
/seld/utils/config.py:
--------------------------------------------------------------------------------
1 | import logging
2 | from pathlib import Path
3 |
4 | import methods.feature as feature
5 | import torch.optim as optim
6 | from methods import ein_seld
7 | from ruamel.yaml import YAML
8 | from termcolor import cprint
9 | from torch.utils.data import ConcatDataset, DataLoader
10 |
11 | from utils.common import convert_ordinal, count_parameters, move_model_to_gpu
12 | from utils.datasets import Dcase2020task3
13 |
14 | method_dict = {
15 | 'ein_seld': ein_seld,
16 | }
17 |
18 | datasets_dict = {
19 | 'dcase2020task3': Dcase2020task3,
20 | }
21 |
22 |
23 | def store_config(output_path, config):
24 | """ Write the given config parameter values to a YAML file.
25 |
26 | Args:
27 | output_path (str): Output file path.
28 | config: Parameter values to log.
29 | """
30 | yaml = YAML()
31 | with open(output_path, 'w') as f:
32 | yaml.dump(config, f)
33 |
34 |
35 | # Datasets
36 | def get_dataset(dataset_name, root_dir):
37 | assert dataset_name == 'dcase2019task3' or dataset_name == 'dcase2020task3', \
38 | "Dataset name '{}' is not 'dcase2019task3' or 'dcase2020task3'".format(dataset_name)
39 | dataset = datasets_dict[dataset_name](root_dir)
40 | print('\nDataset {} is being developed......\n'.format(dataset_name))
41 | return dataset
42 |
43 |
44 | # Dataloaders
45 | def get_generator(args, cfg, dataset, generator_type):
46 | """ Get generator.
47 |
48 | Args:
49 | args: input args
50 | cfg: configuration
51 | dataset: dataset used
52 | generator_type: 'train' | 'valid' | 'test'
53 | 'train' for training, 'valid_train' for validation of train set,
54 | 'valid' for validation of valid set.
55 | Output:
56 | subset: train_set or valid_set
57 | data_generator: 'train_generator', or 'valid_generator'
58 | """
59 | assert generator_type == 'train' or generator_type == 'valid' or generator_type == 'test', \
60 | "Data generator type '{}' is not 'train', 'valid_train' or 'valid'".format(generator_type)
61 |
62 | batch_sampler = None
63 | if generator_type == 'train':
64 |
65 | subset = method_dict[cfg['method']].data.UserDataset(args, cfg, dataset, dataset_type='train')
66 | if 'pitchshift' in cfg['data_augmentation']['type']:
67 | augset = method_dict[cfg['method']].data.UserDataset(args, cfg, dataset, dataset_type='train_pitchshift')
68 | subset = ConcatDataset([subset, augset])
69 |
70 | batch_sampler = method_dict[cfg['method']].data.UserBatchSampler(
71 | clip_num=len(subset),
72 | batch_size=cfg['training']['batch_size'],
73 | seed=args.seed
74 | )
75 | data_generator = DataLoader(
76 | dataset=subset,
77 | batch_sampler=batch_sampler,
78 | num_workers=args.num_workers,
79 | collate_fn=method_dict[cfg['method']].data.collate_fn,
80 | pin_memory=True
81 | )
82 | elif generator_type == 'valid':
83 | subset = method_dict[cfg['method']].data.UserDataset(args, cfg, dataset, dataset_type='valid')
84 | data_generator = DataLoader(
85 | dataset=subset,
86 | batch_size=cfg['training']['batch_size'],
87 | shuffle=False,
88 | num_workers=args.num_workers,
89 | collate_fn=method_dict[cfg['method']].data.collate_fn,
90 | pin_memory=True
91 | )
92 | elif generator_type == 'test':
93 | testset_type = cfg['inference']['testset_type']
94 | dataset_type = testset_type + '_test'
95 | subset = method_dict[cfg['method']].data.UserDataset(args, cfg, dataset, dataset_type=dataset_type)
96 | data_generator = DataLoader(
97 | dataset=subset,
98 | batch_size=cfg['inference']['batch_size'],
99 | shuffle=False,
100 | num_workers=args.num_workers,
101 | collate_fn=method_dict[cfg['method']].data.collate_fn_test,
102 | pin_memory=True
103 | )
104 |
105 | return subset, data_generator, batch_sampler
106 |
107 |
108 | # Losses
109 | def get_losses(cfg):
110 | """ Get losses
111 |
112 | """
113 | losses = method_dict[cfg['method']].losses.Losses(cfg)
114 | for idx, loss_name in enumerate(losses.names):
115 | logging.info('{} is used as the {} loss.'.format(loss_name, convert_ordinal(idx + 1)))
116 | logging.info('')
117 | return losses
118 |
119 |
120 | # Metrics
121 | def get_metrics(cfg, dataset):
122 | """ Get metrics
123 |
124 | """
125 | metrics = method_dict[cfg['method']].metrics.Metrics(dataset)
126 | for idx, metric_name in enumerate(metrics.names):
127 | logging.info('{} is used as the {} metric.'.format(metric_name, convert_ordinal(idx + 1)))
128 | logging.info('')
129 | return metrics
130 |
131 |
132 | # Audio feature extractor
133 | def get_afextractor(cfg, cuda):
134 | """ Get audio feature extractor
135 |
136 | """
137 | if cfg['data']['audio_feature'] == 'logmel&intensity':
138 | afextractor = feature.LogmelIntensity_Extractor(cfg)
139 | afextractor = move_model_to_gpu(afextractor, cuda)
140 | return afextractor
141 |
142 |
143 | # Models
144 | def get_models(cfg, dataset, cuda, model_name=None):
145 | """ Get models
146 |
147 | """
148 | logging.info('=====>> Building a model\n')
149 | if not model_name:
150 | model = vars(method_dict[cfg['method']].models)[cfg['training']['model']](cfg, dataset)
151 | else:
152 | model = vars(method_dict[cfg['method']].models)[model_name](cfg, dataset)
153 | model = move_model_to_gpu(model, cuda)
154 | logging.info('Model architectures:\n{}\n'.format(model))
155 | count_parameters(model)
156 | return model
157 |
158 |
159 | # Optimizers
160 | def get_optimizer(cfg, af_extractor, model):
161 | """ Get optimizers
162 |
163 | """
164 | opt_method = cfg['training']['optimizer']
165 | lr = cfg['training']['lr']
166 | params = list(af_extractor.parameters()) + list(model.parameters())
167 | if opt_method == 'adam':
168 | optimizer = optim.Adam(params, lr=lr, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=True)
169 | elif opt_method == 'sgd':
170 | optimizer = optim.SGD(params, lr=lr, momentum=0, weight_decay=0)
171 | elif opt_method == 'adamw':
172 | # optimizer = AdamW(params, lr=lr, betas=(0.9, 0.999), weight_decay=0, warmup=0)
173 | optimizer = optim.AdamW(params, lr=lr, betas=(0.9, 0.999), eps=1e-08,
174 | weight_decay=0.01, amsgrad=True)
175 |
176 | logging.info('Optimizer is: {}\n'.format(opt_method))
177 | return optimizer
178 |
179 | # Trainer
180 | def get_trainer(args, cfg, dataset, valid_set, af_extractor, model, optimizer, losses, metrics):
181 | """ Get trainer
182 |
183 | """
184 | trainer = method_dict[cfg['method']].training.Trainer(
185 | args=args, cfg=cfg, dataset=dataset, valid_set=valid_set, af_extractor=af_extractor,
186 | model=model, optimizer=optimizer, losses=losses, metrics=metrics
187 | )
188 | return trainer
189 |
190 | # Inferer
191 | def get_inferer(cfg, dataset, af_extractor, model, cuda):
192 | """ Get inferer
193 |
194 | """
195 | inferer = method_dict[cfg['method']].inference.Inferer(
196 | cfg=cfg, dataset=dataset, af_extractor=af_extractor, model=model, cuda=cuda
197 | )
198 | return inferer
199 |
--------------------------------------------------------------------------------
/seld/utils/datasets.py:
--------------------------------------------------------------------------------
1 | from pathlib import Path
2 | import pandas as pd
3 |
4 |
5 | class Dcase2020task3:
6 | """DCASE 2020 Task 3 dataset
7 |
8 | """
9 | def __init__(self, root_dir):
10 | self.root_dir = Path(root_dir)
11 | self.dataset_dir = dict()
12 | self.dataset_dir['dev'] = {
13 | 'foa': self.root_dir.joinpath('foa_dev'),
14 | 'mic': self.root_dir.joinpath('mic_dev'),
15 | 'meta': self.root_dir.joinpath('metadata_dev'),
16 | }
17 | self.dataset_dir['eval'] = {
18 | 'foa': self.root_dir.joinpath('foa_eval'),
19 | 'mic': self.root_dir.joinpath('mic_eval'),
20 | 'meta': self.root_dir.joinpath('metadata_eval'),
21 | }
22 |
23 | self.label_set = ['alarm', 'crying baby', 'crash', 'barking dog', 'running engine', 'female scream', \
24 | 'female speech', 'burning fire', 'footsteps', 'knocking on door', 'male scream', 'male speech', \
25 | 'ringing phone', 'piano']
26 |
27 | self.clip_length = 60 # seconds long
28 | self.label_resolution = 0.1 # 0.1s is the label resolution
29 | self.fold_str_index = 4 # string index indicating fold number
30 | self.ov_str_index = -1 # string index indicating overlap
31 |
--------------------------------------------------------------------------------