├── .gitignore ├── Licenses └── MIT_LICENSE ├── README.md ├── configs └── ein_seld │ └── seld.yaml ├── environment.yml ├── figures └── performance_compare.png ├── papers ├── An Improved Event Independent Network for SELD.pdf └── Event Independent Network for SELD.pdf ├── scripts ├── download_dataset.sh ├── evaluate.sh ├── predict.sh ├── prepare_env.sh ├── preproc.sh └── train.sh └── seld ├── learning ├── __init__.py ├── checkpoint.py ├── evaluate.py ├── infer.py ├── initialize.py ├── preprocess.py └── train.py ├── main.py ├── methods ├── __init__.py ├── data.py ├── data_augmentation │ └── __init__.py ├── ein_seld │ ├── __init__.py │ ├── data.py │ ├── data_augmentation │ │ └── __init__.py │ ├── inference.py │ ├── losses.py │ ├── metrics.py │ ├── models │ │ ├── __init__.py │ │ └── seld.py │ └── training.py ├── feature.py ├── inference.py ├── training.py └── utils │ ├── SELD_evaluation_metrics_2019.py │ ├── SELD_evaluation_metrics_2020.py │ ├── __init__.py │ ├── data_utilities.py │ ├── loss_utilities.py │ ├── model_utilities.py │ └── stft.py └── utils ├── __init__.py ├── cli_parser.py ├── common.py ├── config.py └── datasets.py /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | .DS_Store 3 | -------------------------------------------------------------------------------- /Licenses/MIT_LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Yin Cao 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection 2 | An Improved Event-Independent Network (EIN) for Polyphonic Sound Event Localization and Detection (SELD) 3 | 4 | from Centre for Vision, Speech and Signal Processing, University of Surrey. 5 | 6 | ## Contents 7 | 8 | - [Introduction](#Introduction) 9 | - [Requirements](#Requirements) 10 | - [Download Dataset](#Download-Dataset) 11 | - [Preprocessing](#Preprocessing) 12 | - [QuickEvaluate](#QuickEvaluate) 13 | - [Usage](#Usage) 14 | * [Training](#Training) 15 | * [Prediction](#Prediction) 16 | * [Evaluation](#Evaluation) 17 | - [Results](#Results) 18 | - [FAQs](#FAQs) 19 | - [Citing](#Citing) 20 | - [Reference](#Reference) 21 | 22 | 23 | ## Introduction 24 | 25 | This is a Pytorch implementation of Event-Independent Networks for Polyphonic SELD. 26 | 27 | Event-Independent Networks for Polyphonic SELD uses a trackwise output format and multi-task learning (MTL) of a soft parameter-sharing scheme. For more information, please read papers [here](#Citing). 28 | 29 | 30 | 31 | The features of this method are: 32 | - It uses a trackwise output format to detect different sound events of the same type but with different DoAs. 33 | - It uses a permutation-invaiant training (PIT) to solve the track permutation problem introducted by trackwise output format. 34 | - It uses multi-head self-attention (MHSA) to separate tracks. 35 | - It uses multi-task learning (MTL) of a soft parameter-sharing scheme for joint-SELD. 36 | 37 | Currently, the code is availabel for [*TAU-NIGENS Spatial Sound Events 2020*](http://dcase.community/challenge2020/task-sound-event-localization-and-detection#download) dataset. Data augmentation methods are not included. 38 | 39 | ## Requirements 40 | 41 | We provide two ways to setup the environment. Both are based on [Anaconda](https://www.anaconda.com/products/individual). 42 | 43 | 1. Use the provided `prepare_env.sh`. Note that you need to set the `anaconda_dir` in `prepare_env.sh` to your anaconda directory, then directly run 44 | 45 | ```bash 46 | bash scripts/prepare_env.sh 47 | ``` 48 | 49 | 2. Use the provided `environment.yml`. Note that you also need to set the `prefix` to your aimed env directory, then directly run 50 | 51 | ```bash 52 | conda env create -f environment.yml 53 | ``` 54 | 55 | After setup your environment, don't forget to activate it 56 | 57 | ```bash 58 | conda activate ein 59 | ``` 60 | 61 | ## Download Dataset 62 | 63 | Download dataset is easy. Directly run 64 | 65 | ```bash 66 | bash scripts/download_dataset.sh 67 | ``` 68 | 69 | ## Preprocessing 70 | 71 | It is needed to preprocess the data and meta files. `.wav` files will be saved to `.h5` files. Meta files will also be converted to `.h5` files. After downloading the data, directly run 72 | 73 | ```bash 74 | bash scripts/preproc.sh 75 | ``` 76 | 77 | Preprocessing for meta files (labels) separate labels to different tracks, each with up to one event and a corresponding DoA. The same event is consistently put in the same track. For frame-level permutation-invariant training, this may not be necessary, but for chunk-level PIT or no PIT, consistently arrange the same event in the same track is reasonable. 78 | 79 | ## QuickEvaluate 80 | 81 | We uploaded the pre-trained model here. Download it and unzip it in the code folder (`EIN-SELD` folder) using 82 | 83 | ```bash 84 | wget 'https://zenodo.org/record/4158864/files/out_train.zip' && unzip out_train.zip 85 | ``` 86 | 87 | Then directly run 88 | 89 | ```bash 90 | bash scripts/predict.sh && sh scripts/evaluate.sh 91 | ``` 92 | 93 | ## Usage 94 | 95 | Hyper-parameters are stored in `./configs/ein_seld/seld.yaml`. You can change some of them, such as `train_chunklen_sec`, `train_hoplen_sec`, `test_chunklen_sec`, `test_hoplen_sec`, `batch_size`, `lr` and others. 96 | 97 | ### Training 98 | 99 | To train a model yourself, setup `./configs/ein_seld/seld.yaml` and directly run 100 | 101 | ```bash 102 | bash scripts/train.sh 103 | ``` 104 | 105 | `train_fold` and `valid_fold` in `./configs/ein_seld/seld.yaml` means using what folds to train and validate. Note that `valid_fold` can be `None` which means no validation is needed, and this is usually used for training using fold 1-6. 106 | 107 | `overlap` can be `1` or `2` or combined `1&2`, which means using non-overlapped sound event to train or overlapped to train or both. 108 | 109 | `--seed` is set to a random integer by default. You can set it to a fixed number. Results will not be completely the same if RNN or Transformer is used. 110 | 111 | You can consider to add `--read_into_mem` argument in `train.sh` to pre-load all of the data into memory to increase the training speed, according to your resources. 112 | 113 | `--num_workers` also affects the training speed, adjust it according to your resources. 114 | 115 | ### Prediction 116 | 117 | Prediction predicts resutls and save to `./out_infer` folder. The saved results is the submission result for DCASE challenge. Directly run 118 | 119 | ```bash 120 | bash scripts/predict.sh 121 | ``` 122 | 123 | Prediction predicts results on `testset_type` set, which can be `dev` or `eval`. If it is `dev`, `test_fold` cannot be `None`. 124 | 125 | 126 | ### Evaluation 127 | 128 | Evaluation evaluate the generated submission result. Directly run 129 | 130 | ```bash 131 | bash scripts/evaluate.sh 132 | ``` 133 | 134 | ## Results 135 | 136 | It is notable that EINV2-DA is a single model with plain VGGish architecture using only the channel-rotation and the specaug data-augmentation methods. 137 | 138 | 139 | 140 | ## FAQs 141 | 142 | 1. If you have any question, please email to caoyfive@gmail.com or report an issue here. 143 | 144 | 2. Currently the `pin_memory` can only be set to `True`. For more information, please check [Pytorch Doc](https://pytorch.org/docs/stable/data.html#memory-pinning) and [Nvidia Developer Blog](https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/). 145 | 146 | 3. After downloading, you can delete `downloaded_packages` folder to save some space. 147 | 148 | ## Citing 149 | 150 | If you use the code, please consider citing the papers below 151 | 152 | [[1]. Yin Cao, Turab Iqbal, Qiuqiang Kong, Fengyan An, Wenwu Wang, Mark D. Plumbley, "*An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection*", submitted for publication](http://bit.ly/2N8cF6w) 153 | ``` 154 | @article{cao2020anevent, 155 | title={An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection}, 156 | author={Cao, Yin and Iqbal, Turab and Kong, Qiuqiang and Fengyan, An and Wang, Wenwu and Plumbley, Mark D}, 157 | journal={arXiv preprint arXiv:2010.13092}, 158 | year={2020} 159 | } 160 | ``` 161 | 162 | [[2]. Yin Cao, Turab Iqbal, Qiuqiang Kong, Yue Zhong, Wenwu Wang, Mark D. Plumbley, "*Event-Independent Network for Polyphonic Sound Event Localization and Detection*", DCASE 2020 Workshop, November 2020](https://bit.ly/2Tz8oJ9) 163 | ``` 164 | @article{cao2020event, 165 | title={Event-Independent Network for Polyphonic Sound Event Localization and Detection}, 166 | author={Cao, Yin and Iqbal, Turab and Kong, Qiuqiang and Zhong, Yue and Wang, Wenwu and Plumbley, Mark D}, 167 | journal={arXiv preprint arXiv:2010.00140}, 168 | year={2020} 169 | } 170 | ``` 171 | 172 | ## Reference 173 | 174 | 1. Archontis Politis, Sharath Adavanne, and Tuomas Virtanen. A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2020). November 2020. [URL](https://arxiv.org/abs/2006.01919) 175 | 176 | 2. Annamaria Mesaros, Sharath Adavanne, Archontis Politis, Toni Heittola, and Tuomas Virtanen. Joint measurement of localization and detection of sound events. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New Paltz, NY, Oct 2019. [URL](https://ieeexplore.ieee.org/abstract/document/8937220?casa_token=Z4aGA4E2Dz4AAAAA:BELmzMjaZslLDf1EN1NVZ92_9J0PRnRymY360j--954Un9jb_WXbvLSDhp--7yOeXp0HXYoKuUek) 177 | 178 | 3. Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, 13(1):34–48, March 2018. [URL](https://ieeexplore.ieee.org/abstract/document/8567942) 179 | 180 | 4. https://github.com/yinkalario/DCASE2019-TASK3 181 | 182 | -------------------------------------------------------------------------------- /configs/ein_seld/seld.yaml: -------------------------------------------------------------------------------- 1 | method: ein_seld 2 | dataset: dcase2020task3 3 | workspace_dir: ./ 4 | dataset_dir: ./_dataset/dataset_root 5 | hdf5_dir: ./_hdf5 6 | data: 7 | type: foa 8 | sample_rate: 24000 9 | n_fft: 1024 10 | hop_length: 600 11 | n_mels: 256 12 | window: hann 13 | fmin: 20 14 | fmax: 12000 15 | train_chunklen_sec: 4 16 | train_hoplen_sec: 4 17 | test_chunklen_sec: 4 18 | test_hoplen_sec: 4 19 | audio_feature: logmel&intensity 20 | feature_freeze: True 21 | data_augmentation: 22 | type: None 23 | training: 24 | train_id: EINV2_tPIT_n1 25 | model: EINV2 26 | resume_model: # None_epoch_latest.pth 27 | loss_type: all 28 | loss_beta: 0.5 29 | PIT_type: tPIT 30 | batch_size: 32 31 | train_fold: 2,3,4,5,6 32 | valid_fold: 1 33 | overlap: 1&2 34 | optimizer: adam 35 | lr: 0.0005 36 | lr_step_size: 80 37 | lr_gamma: 0.1 38 | max_epoch: 90 39 | threshold_sed: 0.5 40 | remark: None 41 | inference: 42 | infer_id: EINV2_tPIT_n1 43 | testset_type: eval # dev | eval 44 | test_fold: None 45 | overlap: 1&2 46 | train_ids: EINV2_tPIT_n1 47 | models: EINV2 48 | batch_size: 64 49 | threshold_sed: 0.5 50 | remark: None 51 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: ein 2 | channels: 3 | - pytorch 4 | - anaconda 5 | - conda-forge 6 | - defaults 7 | dependencies: 8 | - _libgcc_mutex=0.1=main 9 | - appdirs=1.4.4=pyh9f0ad1d_0 10 | - astroid=2.4.2=py37_0 11 | - audioread=2.1.8=py37hc8dfbb8_3 12 | - backcall=0.2.0=py_0 13 | - blas=1.0=mkl 14 | - brotlipy=0.7.0=py37hb5d75c8_1001 15 | - bzip2=1.0.8=h516909a_3 16 | - ca-certificates=2020.10.14=0 17 | - certifi=2020.6.20=py37he5f6b98_2 18 | - cffi=1.14.3=py37he30daa8_0 19 | - chardet=3.0.4=py37he5f6b98_1008 20 | - cryptography=3.1.1=py37hff6837a_1 21 | - cudatoolkit=10.2.89=hfd86e86_1 22 | - cycler=0.10.0=py_2 23 | - decorator=4.4.2=py_0 24 | - ffmpeg=4.3.1=h3215721_1 25 | - freetype=2.10.4=h5ab3b9f_0 26 | - gettext=0.19.8.1=h5e8e0c9_1 27 | - gmp=6.2.0=he1b5a44_3 28 | - gnutls=3.6.13=h79a8f9a_0 29 | - h5py=2.10.0=py37hd6299e0_1 30 | - hdf5=1.10.6=hb1b8bf9_0 31 | - idna=2.10=pyh9f0ad1d_0 32 | - intel-openmp=2020.2=254 33 | - ipython=7.18.1=py37h5ca1d4c_0 34 | - ipython_genutils=0.2.0=py37_0 35 | - isort=5.6.4=py_0 36 | - jedi=0.17.2=py37_0 37 | - joblib=0.17.0=py_0 38 | - jpeg=9b=h024ee3a_2 39 | - kiwisolver=1.2.0=py37h99015e2_1 40 | - lame=3.100=h14c3975_1001 41 | - lazy-object-proxy=1.4.3=py37h7b6447c_0 42 | - lcms2=2.11=h396b838_0 43 | - ld_impl_linux-64=2.33.1=h53a641e_7 44 | - libedit=3.1.20191231=h14c3975_1 45 | - libffi=3.3=he6710b0_2 46 | - libflac=1.3.3=he1b5a44_0 47 | - libgcc-ng=9.1.0=hdf63c60_0 48 | - libgfortran-ng=7.3.0=hdf63c60_0 49 | - libiconv=1.16=h516909a_0 50 | - libllvm10=10.0.1=he513fc3_3 51 | - libogg=1.3.2=h516909a_1002 52 | - libpng=1.6.37=hbc83047_0 53 | - librosa=0.8.0=pyh9f0ad1d_0 54 | - libsndfile=1.0.29=he1b5a44_0 55 | - libstdcxx-ng=9.1.0=hdf63c60_0 56 | - libtiff=4.1.0=h2733197_1 57 | - libvorbis=1.3.7=he1b5a44_0 58 | - llvmlite=0.34.0=py37h5202443_2 59 | - lz4-c=1.9.2=heb0550a_3 60 | - matplotlib-base=3.3.2=py37hc9afd2a_1 61 | - mccabe=0.6.1=py37_1 62 | - mkl=2019.4=243 63 | - mkl-service=2.3.0=py37he904b0f_0 64 | - mkl_fft=1.2.0=py37h23d657b_0 65 | - mkl_random=1.1.0=py37hd6b4f25_0 66 | - ncurses=6.2=he6710b0_1 67 | - nettle=3.4.1=h1bed415_1002 68 | - ninja=1.10.1=py37hfd86e86_0 69 | - numba=0.51.2=py37h9fdb41a_0 70 | - numpy=1.19.1=py37hbc911f0_0 71 | - numpy-base=1.19.1=py37hfa32c7d_0 72 | - olefile=0.46=py37_0 73 | - openh264=2.1.1=h8b12597_0 74 | - openssl=1.1.1h=h516909a_0 75 | - packaging=20.4=pyh9f0ad1d_0 76 | - pandas=1.1.3=py37he6710b0_0 77 | - parso=0.7.0=py_0 78 | - pexpect=4.8.0=py37_1 79 | - pickleshare=0.7.5=py37_1001 80 | - pillow=8.0.0=py37h9a89aac_0 81 | - pip=20.2.4=py37_0 82 | - pooch=1.2.0=py_0 83 | - prompt-toolkit=3.0.8=py_0 84 | - ptyprocess=0.6.0=py37_0 85 | - pudb=2019.2=pyh9f0ad1d_2 86 | - pycparser=2.20=pyh9f0ad1d_2 87 | - pygments=2.7.1=py_0 88 | - pylint=2.6.0=py37_0 89 | - pyopenssl=19.1.0=py_1 90 | - pyparsing=2.4.7=pyh9f0ad1d_0 91 | - pysocks=1.7.1=py37he5f6b98_2 92 | - pysoundfile=0.10.2=py_1001 93 | - python=3.7.9=h7579374_0 94 | - python-dateutil=2.8.1=py_0 95 | - python_abi=3.7=1_cp37m 96 | - pytorch=1.6.0=py3.7_cuda10.2.89_cudnn7.6.5_0 97 | - pytz=2020.1=py_0 98 | - pyyaml=5.3.1=py37h7b6447c_1 99 | - readline=8.0=h7b6447c_0 100 | - requests=2.24.0=pyh9f0ad1d_0 101 | - resampy=0.2.2=py_0 102 | - ruamel.yaml=0.16.12=py37h8f50634_1 103 | - ruamel.yaml.clib=0.2.2=py37h8f50634_1 104 | - scikit-learn=0.23.2=py37h0573a6f_0 105 | - scipy=1.5.2=py37h0b6359f_0 106 | - setuptools=50.3.0=py37hb0f4dca_1 107 | - six=1.15.0=py_0 108 | - sqlite=3.33.0=h62c20be_0 109 | - termcolor=1.1.0=py37_1 110 | - threadpoolctl=2.1.0=pyh5ca1d4c_0 111 | - tk=8.6.10=hbc83047_0 112 | - toml=0.10.1=py_0 113 | - torchaudio=0.6.0=py37 114 | - torchvision=0.7.0=py37_cu102 115 | - tornado=6.0.4=py37h8f50634_2 116 | - tqdm=4.50.2=pyh9f0ad1d_0 117 | - traitlets=5.0.5=py_0 118 | - typed-ast=1.4.1=py37h7b6447c_0 119 | - urllib3=1.25.11=py_0 120 | - urwid=2.1.2=py37h8f50634_1 121 | - wcwidth=0.2.5=py_0 122 | - wheel=0.35.1=py_0 123 | - wrapt=1.11.2=py37h7b6447c_0 124 | - x264=1!152.20180806=h14c3975_0 125 | - xz=5.2.5=h7b6447c_0 126 | - yaml=0.2.5=h7b6447c_0 127 | - zlib=1.2.11=h7b6447c_3 128 | - zstd=1.4.5=h9ceee32_0 129 | - pip: 130 | - absl-py==0.10.0 131 | - cachetools==4.1.1 132 | - google-auth==1.22.1 133 | - google-auth-oauthlib==0.4.1 134 | - grpcio==1.33.1 135 | - importlib-metadata==2.0.0 136 | - markdown==3.3.2 137 | - oauthlib==3.1.0 138 | - protobuf==3.13.0 139 | - pyasn1==0.4.8 140 | - pyasn1-modules==0.2.8 141 | - requests-oauthlib==1.3.0 142 | - rsa==4.6 143 | - tensorboard==2.3.0 144 | - tensorboard-plugin-wit==1.7.0 145 | - werkzeug==1.0.1 146 | - zipp==3.3.2 147 | prefix: # /vol/research/yc/miniconda3/envs/ein 148 | 149 | -------------------------------------------------------------------------------- /figures/performance_compare.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yinkalario/EIN-SELD/9f42da23f4ef4620ca495af37e079dc7f8cde06c/figures/performance_compare.png -------------------------------------------------------------------------------- /papers/An Improved Event Independent Network for SELD.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yinkalario/EIN-SELD/9f42da23f4ef4620ca495af37e079dc7f8cde06c/papers/An Improved Event Independent Network for SELD.pdf -------------------------------------------------------------------------------- /papers/Event Independent Network for SELD.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yinkalario/EIN-SELD/9f42da23f4ef4620ca495af37e079dc7f8cde06c/papers/Event Independent Network for SELD.pdf -------------------------------------------------------------------------------- /scripts/download_dataset.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -e 4 | 5 | # check bin requirement 6 | command -v wget >/dev/null 2>&1 || { echo 'wget is missing' >&2; exit 1; } 7 | command -v zip >/dev/null 2>&1 || { echo 'zip is missing' >&2; exit 1; } 8 | command -v unzip >/dev/null 2>&1 || { echo 'unzip is missing' >&2; exit 1; } 9 | 10 | ## dcase 2020 Task 3 11 | Dataset_dir='_dataset' 12 | Dataset_root=$Dataset_dir'/dataset_root' 13 | mkdir -p $Dataset_root 14 | Download_packages_dir=$Dataset_dir'/downloaded_packages' 15 | mkdir -p $Download_packages_dir 16 | 17 | # dev 18 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/foa_dev.z01' 19 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/foa_dev.z02' 20 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/foa_dev.zip' 21 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/foa_eval.zip' 22 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/metadata_dev.zip' 23 | wget -P $Download_packages_dir 'https://zenodo.org/record/4064792/files/metadata_eval.zip' 24 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/mic_dev.z01' 25 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/mic_dev.z02' 26 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/mic_dev.zip' 27 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/mic_eval.zip' 28 | wget -P $Download_packages_dir 'https://zenodo.org/record/3870859/files/README.md' 29 | 30 | zip -s 0 $Download_packages_dir'/foa_dev.zip' --out $Download_packages_dir'/foa_dev_single.zip' 31 | zip -s 0 $Download_packages_dir'/mic_dev.zip' --out $Download_packages_dir'/mic_dev_single.zip' 32 | 33 | unzip $Download_packages_dir'/foa_dev_single.zip' -d $Dataset_root 34 | unzip $Download_packages_dir'/mic_dev_single.zip' -d $Dataset_root 35 | unzip $Download_packages_dir'/metadata_dev.zip' -d $Dataset_root 36 | unzip $Download_packages_dir'/metadata_eval.zip' -d $Dataset_root 37 | unzip $Download_packages_dir'/foa_eval.zip' -d $Dataset_root 38 | unzip $Download_packages_dir'/mic_eval.zip' -d $Dataset_root -------------------------------------------------------------------------------- /scripts/evaluate.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -e 4 | 5 | CONFIG_FILE='./configs/ein_seld/seld.yaml' 6 | 7 | python seld/main.py -c $CONFIG_FILE evaluate -------------------------------------------------------------------------------- /scripts/predict.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -e 4 | 5 | CONFIG_FILE='./configs/ein_seld/seld.yaml' 6 | 7 | python seld/main.py -c $CONFIG_FILE infer --num_workers=8 -------------------------------------------------------------------------------- /scripts/prepare_env.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -e 4 | 5 | anaconda_dir= # '/vol/research/yc/miniconda3' 6 | 7 | . $anaconda_dir'/etc/profile.d/conda.sh' 8 | conda remove -n ein --all -y 9 | conda create -n ein python=3.7 -y 10 | conda activate ein 11 | 12 | conda install -c anaconda pandas h5py ipython pyyaml pylint -y 13 | conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch-lts -y 14 | conda install -c conda-forge librosa pudb tqdm ruamel.yaml -y 15 | conda install -c omnia termcolor -y 16 | pip install tensorboard 17 | -------------------------------------------------------------------------------- /scripts/preproc.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -e 4 | 5 | CONFIG_FILE='./configs/ein_seld/seld.yaml' 6 | 7 | # Extract data 8 | python seld/main.py -c $CONFIG_FILE preprocess --preproc_mode='extract_data' --dataset_type='dev' 9 | python seld/main.py -c $CONFIG_FILE preprocess --preproc_mode='extract_data' --dataset_type='eval' 10 | 11 | # Extract scalar 12 | python seld/main.py -c $CONFIG_FILE preprocess --preproc_mode='extract_scalar' --num_workers=8 13 | 14 | # Extract meta 15 | python seld/main.py -c $CONFIG_FILE preprocess --preproc_mode='extract_meta' --dataset_type='dev' 16 | python seld/main.py -c $CONFIG_FILE preprocess --preproc_mode='extract_meta' --dataset_type='eval' 17 | -------------------------------------------------------------------------------- /scripts/train.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -e 4 | 5 | CONFIG_FILE='./configs/ein_seld/seld.yaml' 6 | 7 | python seld/main.py -c $CONFIG_FILE train --seed=$(shuf -i 0-10000 -n 1) --num_workers=8 8 | -------------------------------------------------------------------------------- /seld/learning/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yinkalario/EIN-SELD/9f42da23f4ef4620ca495af37e079dc7f8cde06c/seld/learning/__init__.py -------------------------------------------------------------------------------- /seld/learning/checkpoint.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import random 3 | from pathlib import Path 4 | 5 | import numpy as np 6 | import pandas as pd 7 | import torch 8 | 9 | 10 | class CheckpointIO: 11 | """CheckpointIO class. 12 | 13 | It handles saving and loading checkpoints. 14 | """ 15 | 16 | def __init__(self, checkpoints_dir, model, optimizer, batch_sampler, metrics_names, num_checkpoints=1, remark=None): 17 | """ 18 | Args: 19 | checkpoint_dir (Path obj): path where checkpoints are saved 20 | model: model 21 | optimizer: optimizer 22 | batch_sampler: batch_sampler 23 | metrics_names: metrics names to be saved in a checkpoints csv file 24 | num_checkpoints: maximum number of checkpoints to save. When it exceeds the number, the older 25 | (older, smaller or higher) checkpoints will be deleted 26 | remark (optional): to remark the name of the checkpoint 27 | """ 28 | self.checkpoints_dir = checkpoints_dir 29 | self.checkpoints_dir.mkdir(parents=True, exist_ok=True) 30 | self.model = model 31 | self.optimizer = optimizer 32 | self.batch_sampler = batch_sampler 33 | self.num_checkpoints = num_checkpoints 34 | self.remark = remark 35 | 36 | self.value_list = [] 37 | self.epoch_list = [] 38 | 39 | self.checkpoints_csv_path = checkpoints_dir.joinpath('metrics_statistics.csv') 40 | 41 | # save checkpoints_csv header 42 | metrics_keys_list = [name for name in metrics_names] 43 | header = ['epoch'] + metrics_keys_list 44 | df_header = pd.DataFrame(columns=header) 45 | df_header.to_csv(self.checkpoints_csv_path, sep='\t', index=False, mode='a+') 46 | 47 | def save(self, epoch, it, metrics, key_rank=None, rank_order='high'): 48 | """Save model. It will save a latest model, a best model of rank_order for value, and 49 | 'self.num_checkpoints' best models of rank_order for value. 50 | 51 | Args: 52 | metrics: metrics to log 53 | key_rank (str): the key of metrics to rank 54 | rank_order: 'low' | 'high' | 'latest' 55 | 'low' to keep the models of lowest values 56 | 'high' to keep the models of highest values 57 | 'latest' to keep the models of latest epochs 58 | """ 59 | 60 | ## save checkpionts_csv 61 | metrics_values_list = [value for value in metrics.values()] 62 | checkpoint_list = [[epoch] + metrics_values_list] 63 | df_checkpoint = pd.DataFrame(checkpoint_list) 64 | df_checkpoint.to_csv(self.checkpoints_csv_path, sep='\t', header=False, index=False, mode='a+') 65 | 66 | ## save checkpoints 67 | current_value = None if rank_order == 'latest' else metrics[key_rank] 68 | 69 | # latest model 70 | latest_checkpoint_path = self.checkpoints_dir.joinpath('{}_epoch_latest.pth'.format(self.remark)) 71 | self.save_file(latest_checkpoint_path, epoch, it) 72 | 73 | # self.num_checkpoints best models 74 | if len(self.value_list) < self.num_checkpoints: 75 | self.value_list.append(current_value) 76 | self.epoch_list.append(epoch) 77 | checkpoint_path = self.checkpoints_dir.joinpath('{}_epoch_{}.pth'.format(self.remark, epoch)) 78 | self.save_file(checkpoint_path, epoch, it) 79 | logging.info('Checkpoint saved to {}'.format(checkpoint_path)) 80 | elif len(self.value_list) >= self.num_checkpoints: 81 | value_list = np.array(self.value_list) 82 | if rank_order == 'high' and current_value >= value_list.min(): 83 | worst_index = value_list.argmin() 84 | self.del_and_save(worst_index, current_value, epoch, it) 85 | elif rank_order == 'low' and current_value <= value_list.max(): 86 | worst_index = value_list.argmax() 87 | self.del_and_save(worst_index, current_value, epoch, it) 88 | elif rank_order == 'latest': 89 | worst_index = 0 90 | self.del_and_save(worst_index, current_value, epoch, it) 91 | 92 | # best model 93 | value_list = np.array(self.value_list) 94 | best_checkpoint_path = self.checkpoints_dir.joinpath('{}_epoch_best.pth'.format(self.remark)) 95 | if rank_order == 'high' and current_value >= value_list.max(): 96 | self.save_file(best_checkpoint_path, epoch, it) 97 | elif rank_order == 'low' and current_value <= value_list.min(): 98 | self.save_file(best_checkpoint_path, epoch, it) 99 | elif rank_order == 'latest': 100 | self.save_file(best_checkpoint_path, epoch, it) 101 | 102 | def del_and_save(self, worst_index, current_value, epoch, it): 103 | """Delete and save checkpoint 104 | 105 | Args: 106 | worst_index: worst index, 107 | current_value: current value, 108 | epoch: epoch, 109 | it: it, 110 | """ 111 | worst_chpt_path = self.checkpoints_dir.joinpath('{}_epoch_{}.pth'.format(self.remark, self.epoch_list[worst_index])) 112 | if worst_chpt_path.is_file(): 113 | worst_chpt_path.unlink() 114 | self.value_list.pop(worst_index) 115 | self.epoch_list.pop(worst_index) 116 | 117 | self.value_list.append(current_value) 118 | self.epoch_list.append(epoch) 119 | checkpoint_path = self.checkpoints_dir.joinpath('{}_epoch_{}.pth'.format(self.remark, epoch)) 120 | self.save_file(checkpoint_path, epoch, it) 121 | logging.info('Checkpoint saved to {}'.format(checkpoint_path)) 122 | 123 | def save_file(self, checkpoint_path, epoch, it): 124 | """Save a module to a file 125 | 126 | Args: 127 | checkpoint_path (Path obj): checkpoint path, including .pth file name 128 | epoch: epoch, 129 | it: it 130 | """ 131 | outdict = { 132 | 'epoch': epoch, 133 | 'it': it, 134 | 'model': self.model.module.state_dict(), 135 | 'optimizer': self.optimizer.state_dict(), 136 | 'sampler': self.batch_sampler.get_state(), 137 | 'rng': torch.get_rng_state(), 138 | 'cuda_rng': torch.cuda.get_rng_state(), 139 | 'random': random.getstate(), 140 | 'np_random': np.random.get_state(), 141 | } 142 | torch.save(outdict, checkpoint_path) 143 | 144 | def load(self, checkpoint_path): 145 | """Load a module from a file 146 | 147 | """ 148 | state_dict = torch.load(checkpoint_path) 149 | epoch = state_dict['epoch'] 150 | it = state_dict['it'] 151 | self.model.module.load_state_dict(state_dict['model']) 152 | self.optimizer.load_state_dict(state_dict['optimizer']) 153 | self.batch_sampler.set_state(state_dict['sampler']) 154 | torch.set_rng_state(state_dict['rng']) 155 | torch.cuda.set_rng_state(state_dict['cuda_rng']) 156 | random.setstate(state_dict['random']) 157 | np.random.set_state(state_dict['np_random']) 158 | logging.info('Resuming complete from {}\n'.format(checkpoint_path)) 159 | return epoch, it 160 | 161 | -------------------------------------------------------------------------------- /seld/learning/evaluate.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | 3 | import numpy as np 4 | import pandas as pd 5 | 6 | import methods.utils.SELD_evaluation_metrics_2019 as SELDMetrics2019 7 | from methods.utils.data_utilities import (load_dcase_format, 8 | to_metrics2020_format) 9 | from methods.utils.SELD_evaluation_metrics_2020 import \ 10 | SELDMetrics as SELDMetrics2020 11 | from methods.utils.SELD_evaluation_metrics_2020 import early_stopping_metric 12 | 13 | 14 | def evaluate(cfg, dataset): 15 | 16 | """ Evaluate scores 17 | 18 | """ 19 | 20 | '''Directories''' 21 | print('Inference ID is {}\n'.format(cfg['inference']['infer_id'])) 22 | 23 | out_infer_dir = Path(cfg['workspace_dir']).joinpath('out_infer').joinpath(cfg['method']) \ 24 | .joinpath(cfg['inference']['infer_id']) 25 | submissions_dir = out_infer_dir.joinpath('submissions') 26 | 27 | main_dir = Path(cfg['dataset_dir']) 28 | dev_meta_dir = main_dir.joinpath('metadata_dev') 29 | eval_meta_dir = main_dir.joinpath('metadata_eval') 30 | if cfg['inference']['testset_type'] == 'dev': 31 | meta_dir = dev_meta_dir 32 | elif cfg['inference']['testset_type'] == 'eval': 33 | meta_dir = eval_meta_dir 34 | 35 | pred_frame_begin_index = 0 36 | gt_frame_begin_index = 0 37 | frame_length = int(dataset.clip_length / dataset.label_resolution) 38 | pred_output_dict, pred_sed_metrics2019, pred_doa_metrics2019 = {}, [], [] 39 | gt_output_dict, gt_sed_metrics2019, gt_doa_metrics2019= {}, [], [] 40 | for pred_path in sorted(submissions_dir.glob('*.csv')): 41 | fn = pred_path.name 42 | gt_path = meta_dir.joinpath(fn) 43 | 44 | # pred 45 | output_dict, sed_metrics2019, doa_metrics2019 = load_dcase_format( 46 | pred_path, frame_begin_index=pred_frame_begin_index, 47 | frame_length=frame_length, num_classes=len(dataset.label_set), set_type='pred') 48 | pred_output_dict.update(output_dict) 49 | pred_sed_metrics2019.append(sed_metrics2019) 50 | pred_doa_metrics2019.append(doa_metrics2019) 51 | pred_frame_begin_index += frame_length 52 | 53 | # gt 54 | output_dict, sed_metrics2019, doa_metrics2019 = load_dcase_format( 55 | gt_path, frame_begin_index=gt_frame_begin_index, 56 | frame_length=frame_length, num_classes=len(dataset.label_set), set_type='gt') 57 | gt_output_dict.update(output_dict) 58 | gt_sed_metrics2019.append(sed_metrics2019) 59 | gt_doa_metrics2019.append(doa_metrics2019) 60 | gt_frame_begin_index += frame_length 61 | 62 | pred_sed_metrics2019 = np.concatenate(pred_sed_metrics2019, axis=0) 63 | pred_doa_metrics2019 = np.concatenate(pred_doa_metrics2019, axis=0) 64 | gt_sed_metrics2019 = np.concatenate(gt_sed_metrics2019, axis=0) 65 | gt_doa_metrics2019 = np.concatenate(gt_doa_metrics2019, axis=0) 66 | pred_metrics2020_dict = to_metrics2020_format(pred_output_dict, 67 | pred_sed_metrics2019.shape[0], label_resolution=dataset.label_resolution) 68 | gt_metrics2020_dict = to_metrics2020_format(gt_output_dict, 69 | gt_sed_metrics2019.shape[0], label_resolution=dataset.label_resolution) 70 | 71 | # 2019 metrics 72 | num_frames_1s = int(1 / dataset.label_resolution) 73 | ER_19, F_19 = SELDMetrics2019.compute_sed_scores(pred_sed_metrics2019, gt_sed_metrics2019, num_frames_1s) 74 | LE_19, LR_19, _, _, _, _ = SELDMetrics2019.compute_doa_scores_regr( 75 | pred_doa_metrics2019, gt_doa_metrics2019, pred_sed_metrics2019, gt_sed_metrics2019) 76 | seld_score_19 = SELDMetrics2019.early_stopping_metric([ER_19, F_19], [LE_19, LR_19]) 77 | 78 | # 2020 metrics 79 | dcase2020_metric = SELDMetrics2020(nb_classes=len(dataset.label_set), doa_threshold=20) 80 | dcase2020_metric.update_seld_scores(pred_metrics2020_dict, gt_metrics2020_dict) 81 | ER_20, F_20, LE_20, LR_20 = dcase2020_metric.compute_seld_scores() 82 | seld_score_20 = early_stopping_metric([ER_20, F_20], [LE_20, LR_20]) 83 | 84 | metrics_scores ={ 85 | 'ER20': ER_20, 86 | 'F20': F_20, 87 | 'LE20': LE_20, 88 | 'LR20': LR_20, 89 | 'seld20': seld_score_20, 90 | 'ER19': ER_19, 91 | 'F19': F_19, 92 | 'LE19': LE_19, 93 | 'LR19': LR_19, 94 | 'seld19': seld_score_19, 95 | } 96 | 97 | out_str = 'test: ' 98 | for key, value in metrics_scores.items(): 99 | out_str += '{}: {:.3f}, '.format(key, value) 100 | print('---------------------------------------------------------------------------------------------------' 101 | +'-------------------------------------------------') 102 | print(out_str) 103 | print('---------------------------------------------------------------------------------------------------' 104 | +'-------------------------------------------------') 105 | 106 | out_eval_dir = Path(cfg['workspace_dir']).joinpath('out_eval').joinpath(cfg['method']) \ 107 | .joinpath(cfg['inference']['infer_id']) 108 | out_eval_dir.mkdir(parents=True, exist_ok=True) 109 | result_path = out_eval_dir.joinpath('results.csv') 110 | df = pd.DataFrame(metrics_scores, index=[0]) 111 | df.to_csv(result_path, sep=',', mode='a') 112 | 113 | -------------------------------------------------------------------------------- /seld/learning/infer.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from utils.config import get_afextractor, get_inferer, get_models 3 | 4 | 5 | def infer(cfg, dataset, **infer_initializer): 6 | """ Infer, only save the testset predictions 7 | 8 | """ 9 | submissions_dir = infer_initializer['submissions_dir'] 10 | ckpts_paths_list = infer_initializer['ckpts_paths_list'] 11 | ckpts_models_list = infer_initializer['ckpts_models_list'] 12 | test_generator = infer_initializer['test_generator'] 13 | cuda = infer_initializer['cuda'] 14 | preds = [] 15 | for ckpt_path, model_name in zip(ckpts_paths_list, ckpts_models_list): 16 | print('=====>> Resuming from the checkpoint: {}\n'.format(ckpt_path)) 17 | af_extractor = get_afextractor(cfg, cuda) 18 | model = get_models(cfg, dataset, cuda, model_name=model_name) 19 | state_dict = torch.load(ckpt_path) 20 | model.module.load_state_dict(state_dict['model']) 21 | print(' Resuming complete\n') 22 | inferer = get_inferer(cfg, dataset, af_extractor, model, cuda) 23 | pred = inferer.infer(test_generator) 24 | preds.append(pred) 25 | print('\n Inference finished for {}\n'.format(ckpt_path)) 26 | inferer.fusion(submissions_dir, preds) 27 | 28 | 29 | -------------------------------------------------------------------------------- /seld/learning/initialize.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import random 3 | import shutil 4 | import socket 5 | from datetime import datetime 6 | from pathlib import Path 7 | 8 | import numpy as np 9 | import torch 10 | import torch.optim as optim 11 | from torch.backends import cudnn 12 | from torch.utils.tensorboard import SummaryWriter 13 | from utils.common import create_logging 14 | from utils.config import (get_afextractor, get_generator, get_losses, 15 | get_metrics, get_models, get_optimizer, get_trainer, 16 | store_config) 17 | 18 | from learning.checkpoint import CheckpointIO 19 | 20 | 21 | def init_train(args, cfg, dataset): 22 | """ Training initialization. 23 | 24 | Including Data generator, model, optimizer initialization. 25 | """ 26 | 27 | '''Cuda''' 28 | args.cuda = not args.no_cuda and torch.cuda.is_available() 29 | 30 | ''' Reproducible seed set''' 31 | torch.manual_seed(args.seed) 32 | if args.cuda: 33 | torch.cuda.manual_seed(args.seed) 34 | np.random.seed(args.seed) 35 | random.seed(args.seed) 36 | cudnn.deterministic = True 37 | cudnn.benchmark = True 38 | 39 | '''Directories''' 40 | print('Train ID is {}\n'.format(cfg['training']['train_id'])) 41 | out_train_dir = Path(cfg['workspace_dir']).joinpath('out_train') \ 42 | .joinpath(cfg['method']).joinpath(cfg['training']['train_id']) 43 | if out_train_dir.is_dir(): 44 | flag = input("Train ID folder {} is existed, delete it? (y/n)". \ 45 | format(str(out_train_dir))).lower() 46 | print('') 47 | if flag == 'y': 48 | shutil.rmtree(str(out_train_dir)) 49 | elif flag == 'n': 50 | print("User select not to remove the training ID folder {}.\n". \ 51 | format(str(out_train_dir))) 52 | out_train_dir.mkdir(parents=True, exist_ok=True) 53 | 54 | current_time = datetime.now().strftime('%b%d_%H-%M-%S') 55 | tb_dir = out_train_dir.joinpath('tb').joinpath(current_time + '_' + socket.gethostname()) 56 | tb_dir.mkdir(parents=True, exist_ok=True) 57 | logs_dir = out_train_dir.joinpath('logs') 58 | ckpts_dir = out_train_dir.joinpath('checkpoints') 59 | 60 | '''tensorboard and logging''' 61 | writer = SummaryWriter(log_dir=str(tb_dir)) 62 | create_logging(logs_dir, filemode='w') 63 | param_file = out_train_dir.joinpath('config.yaml') 64 | if param_file.is_file(): 65 | param_file.unlink() 66 | store_config(param_file, cfg) 67 | 68 | '''Data generator''' 69 | train_set, train_generator, batch_sampler = get_generator(args, cfg, dataset, generator_type='train') 70 | valid_set, valid_generator, _ = get_generator(args, cfg, dataset, generator_type='valid') 71 | 72 | '''Loss''' 73 | losses = get_losses(cfg) 74 | 75 | '''Metrics''' 76 | metrics = get_metrics(cfg, dataset) 77 | 78 | '''Audio feature extractor''' 79 | af_extractor = get_afextractor(cfg, args.cuda) 80 | 81 | '''Model''' 82 | model = get_models(cfg, dataset, args.cuda) 83 | 84 | '''Optimizer''' 85 | optimizer = get_optimizer(cfg, af_extractor, model) 86 | lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=cfg['training']['lr_step_size'], 87 | gamma=cfg['training']['lr_gamma']) 88 | 89 | '''Trainer''' 90 | trainer = get_trainer(args=args, cfg=cfg, dataset=dataset, valid_set=valid_set, 91 | af_extractor=af_extractor, model=model, optimizer=optimizer, losses=losses, metrics=metrics) 92 | 93 | '''CheckpointIO''' 94 | if not cfg['training']['valid_fold']: 95 | metrics_names = losses.names 96 | else: 97 | metrics_names = metrics.names 98 | ckptIO = CheckpointIO( 99 | checkpoints_dir=ckpts_dir, 100 | model=model, 101 | optimizer=optimizer, 102 | batch_sampler=batch_sampler, 103 | metrics_names=metrics_names, 104 | num_checkpoints=1, 105 | remark=cfg['training']['remark'] 106 | ) 107 | 108 | if cfg['training']['resume_model']: 109 | resume_path = ckpts_dir.joinpath(cfg['training']['resume_model']) 110 | logging.info('=====>> Resume from the checkpoint: {}......\n'.format(str(resume_path))) 111 | epoch_it, it = ckptIO.load(resume_path) 112 | for param_group in optimizer.param_groups: 113 | param_group['lr'] = cfg['training']['lr'] 114 | else: 115 | epoch_it, it = 0, 0 116 | 117 | '''logging and return''' 118 | logging.info('Train folds are: {}\n'.format(cfg['training']['train_fold'])) 119 | logging.info('Valid folds are: {}\n'.format(cfg['training']['valid_fold'])) 120 | logging.info('Training clip number is: {}\n'.format(len(train_set))) 121 | logging.info('Number of batches per epoch is: {}\n'.format(len(batch_sampler))) 122 | logging.info('Validation clip number is: {}\n'.format(len(valid_set))) 123 | logging.info('Training loss type is: {}\n'.format(cfg['training']['loss_type'])) 124 | if cfg['training']['loss_type'] == 'doa': 125 | logging.info('DOA loss type is: {}\n'.format(cfg['training']['doa_loss_type'])) 126 | logging.info('Data augmentation methods used are: {}\n'.format(cfg['data_augmentation']['type'])) 127 | 128 | train_initializer = { 129 | 'writer': writer, 130 | 'train_generator': train_generator, 131 | 'valid_generator': valid_generator, 132 | 'lr_scheduler': lr_scheduler, 133 | 'trainer': trainer, 134 | 'ckptIO': ckptIO, 135 | 'epoch_it': epoch_it, 136 | 'it': it 137 | } 138 | 139 | return train_initializer 140 | 141 | 142 | def init_infer(args, cfg, dataset): 143 | """ Inference initialization. 144 | 145 | Including Data generator, model, optimizer initialization. 146 | """ 147 | 148 | '''Cuda''' 149 | args.cuda = not args.no_cuda and torch.cuda.is_available() 150 | 151 | '''Directories''' 152 | print('Inference ID is {}\n'.format(cfg['inference']['infer_id'])) 153 | out_infer_dir = Path(cfg['workspace_dir']).joinpath('out_infer').joinpath(cfg['method']) \ 154 | .joinpath(cfg['inference']['infer_id']) 155 | if out_infer_dir.is_dir(): 156 | shutil.rmtree(str(out_infer_dir)) 157 | submissions_dir = out_infer_dir.joinpath('submissions') 158 | submissions_dir.mkdir(parents=True, exist_ok=True) 159 | train_ids = [train_id.strip() for train_id in str(cfg['inference']['train_ids']).split(',')] 160 | models = [model.strip() for model in str(cfg['inference']['models']).split(',')] 161 | ckpts_paths_list = [] 162 | ckpts_models_list = [] 163 | for train_id, model_name in zip(train_ids, models): 164 | ckpts_dir = Path(cfg['workspace_dir']).joinpath('out_train').joinpath(cfg['method']) \ 165 | .joinpath(train_id).joinpath('checkpoints') 166 | ckpt_path = [path for path in sorted(ckpts_dir.iterdir()) if path.stem.split('_')[-1].isnumeric()] 167 | for path in ckpt_path: 168 | ckpts_paths_list.append(path) 169 | ckpts_models_list.append(model_name) 170 | 171 | '''Parameters''' 172 | param_file = out_infer_dir.joinpath('config.yaml') 173 | if param_file.is_file(): 174 | param_file.unlink() 175 | store_config(param_file, cfg) 176 | 177 | '''Data generator''' 178 | test_set, test_generator, _ = get_generator(args, cfg, dataset, generator_type='test') 179 | 180 | '''logging and return''' 181 | logging.info('Test clip number is: {}\n'.format(len(test_set))) 182 | 183 | infer_initializer = { 184 | 'submissions_dir': submissions_dir, 185 | 'ckpts_paths_list': ckpts_paths_list, 186 | 'ckpts_models_list': ckpts_models_list, 187 | 'test_generator': test_generator, 188 | 'cuda': args.cuda 189 | } 190 | 191 | return infer_initializer 192 | -------------------------------------------------------------------------------- /seld/learning/preprocess.py: -------------------------------------------------------------------------------- 1 | import shutil 2 | from functools import reduce 3 | from pathlib import Path 4 | from timeit import default_timer as timer 5 | 6 | import h5py 7 | import librosa 8 | import numpy as np 9 | import pandas as pd 10 | import torch 11 | from methods.data import BaseDataset, collate_fn 12 | from torch.utils.data import DataLoader 13 | from tqdm import tqdm 14 | from utils.common import float_samples_to_int16 15 | from utils.config import get_afextractor 16 | 17 | 18 | class Preprocessor: 19 | """Preprocess the audio data. 20 | 21 | 1. Extract wav file and store to hdf5 file 22 | 2. Extract meta file and store to hdf5 file 23 | """ 24 | def __init__(self, args, cfg, dataset): 25 | """ 26 | Args: 27 | args: parsed args 28 | cfg: configurations 29 | dataset: dataset class 30 | """ 31 | self.args = args 32 | self.cfg = cfg 33 | self.dataset = dataset 34 | 35 | # Path for dataset 36 | hdf5_dir = Path(cfg['hdf5_dir']).joinpath(cfg['dataset']) 37 | 38 | # Path for extraction of wav 39 | self.data_dir_list = [ 40 | dataset.dataset_dir[args.dataset_type]['foa'], 41 | dataset.dataset_dir[args.dataset_type]['mic'] 42 | ] 43 | data_h5_dir = hdf5_dir.joinpath('data').joinpath('{}fs'.format(cfg['data']['sample_rate'])) 44 | self.data_h5_dir_list = [ 45 | data_h5_dir.joinpath(args.dataset_type).joinpath('foa'), 46 | data_h5_dir.joinpath(args.dataset_type).joinpath('mic') 47 | ] 48 | self.data_statistics_path_list = [ 49 | data_h5_dir.joinpath(args.dataset_type).joinpath('statistics_foa.txt'), 50 | data_h5_dir.joinpath(args.dataset_type).joinpath('statistics_mic.txt') 51 | ] 52 | 53 | # Path for extraction of scalar 54 | self.scalar_h5_dir = hdf5_dir.joinpath('scalar') 55 | fn_scalar = '{}_{}_sr{}_nfft{}_hop{}_mel{}.h5'.format(cfg['data']['type'], cfg['data']['audio_feature'], 56 | cfg['data']['sample_rate'], cfg['data']['n_fft'], cfg['data']['hop_length'], cfg['data']['n_mels']) 57 | self.scalar_path = self.scalar_h5_dir.joinpath(fn_scalar) 58 | 59 | # Path for extraction of meta 60 | self.meta_dir = dataset.dataset_dir[args.dataset_type]['meta'] 61 | self.meta_h5_dir = hdf5_dir.joinpath('meta').joinpath(args.dataset_type) 62 | 63 | def extract_data(self): 64 | """ Extract wave and store to hdf5 file 65 | 66 | """ 67 | print('Converting wav file to hdf5 file starts......\n') 68 | 69 | for h5_dir in self.data_h5_dir_list: 70 | if h5_dir.is_dir(): 71 | flag = input("HDF5 folder {} is already existed, delete it? (y/n)".format(h5_dir)).lower() 72 | if flag == 'y': 73 | shutil.rmtree(h5_dir) 74 | elif flag == 'n': 75 | print("User select not to remove the HDF5 folder {}. The process will quit.\n".format(h5_dir)) 76 | return 77 | h5_dir.mkdir(parents=True) 78 | for statistic_path in self.data_statistics_path_list: 79 | if statistic_path.is_file(): 80 | statistic_path.unlink() 81 | 82 | for idx, data_dir in enumerate(self.data_dir_list): 83 | begin_time = timer() 84 | h5_dir = self.data_h5_dir_list[idx] 85 | statistic_path = self.data_statistics_path_list[idx] 86 | audio_count = 0 87 | silent_audio_count = 0 88 | data_list = [path for path in sorted(data_dir.glob('*.wav')) if not path.name.startswith('.')] 89 | iterator = tqdm(data_list, total=len(data_list), unit='it') 90 | for data_path in iterator: 91 | # read data 92 | data, _ = librosa.load(data_path, sr=self.cfg['data']['sample_rate'], mono=False) 93 | if len(data.shape) == 1: 94 | data = data[None,:] 95 | '''data: (channels, samples)''' 96 | 97 | # silent data statistics 98 | lst = np.sum(np.abs(data), axis=1) > data.shape[1]*1e-4 99 | if not reduce(lambda x, y: x*y, lst): 100 | with statistic_path.open(mode='a+') as f: 101 | print(f"Silent file in feature extractor: {data_path.name}\n", file=f) 102 | silent_audio_count += 1 103 | tqdm.write("Silent file in feature extractor: {}".format(data_path.name)) 104 | tqdm.write("Total silent files are: {}\n".format(silent_audio_count)) 105 | 106 | # save to h5py 107 | h5_path = h5_dir.joinpath(data_path.stem + '.h5') 108 | with h5py.File(h5_path, 'w') as hf: 109 | hf.create_dataset(name='waveform', data=float_samples_to_int16(data), dtype=np.int16) 110 | 111 | audio_count += 1 112 | 113 | tqdm.write('{}, {}, {}'.format(audio_count, h5_path, data.shape)) 114 | 115 | with statistic_path.open(mode='a+') as f: 116 | print(f"Total number of audio clips extracted: {audio_count}", file=f) 117 | print(f"Total number of silent audio clips is: {silent_audio_count}\n", file=f) 118 | 119 | iterator.close() 120 | print("Extacting feature finished! Time spent: {:.3f} s".format(timer() - begin_time)) 121 | 122 | return 123 | 124 | def extract_scalar(self): 125 | """ Extract scalar and store to hdf5 file 126 | 127 | """ 128 | print('Extracting scalar......\n') 129 | self.scalar_h5_dir.mkdir(parents=True, exist_ok=True) 130 | 131 | cuda_enabled = not self.args.no_cuda and torch.cuda.is_available() 132 | train_set = BaseDataset(self.args, self.cfg, self.dataset) 133 | data_generator = DataLoader( 134 | dataset=train_set, 135 | batch_size=32, 136 | shuffle=False, 137 | num_workers=self.args.num_workers, 138 | collate_fn=collate_fn, 139 | pin_memory=True 140 | ) 141 | af_extractor = get_afextractor(self.cfg, cuda_enabled).eval() 142 | iterator = tqdm(enumerate(data_generator), total=len(data_generator), unit='it') 143 | features = [] 144 | begin_time = timer() 145 | for it, batch_sample in iterator: 146 | if it == len(data_generator): 147 | break 148 | batch_x = batch_sample['waveform'] 149 | batch_x.require_grad = False 150 | if cuda_enabled: 151 | batch_x = batch_x.cuda(non_blocking=True) 152 | batch_y = af_extractor(batch_x).transpose(0, 1) 153 | C, _, _, F = batch_y.shape 154 | features.append(batch_y.reshape(C, -1, F).cpu().numpy()) 155 | iterator.close() 156 | features = np.concatenate(features, axis=1) 157 | mean = [] 158 | std = [] 159 | 160 | for ch in range(C): 161 | mean.append(np.mean(features[ch], axis=0, keepdims=True)) 162 | std.append(np.std(features[ch], axis=0, keepdims=True)) 163 | mean = np.stack(mean)[None, ...] 164 | std = np.stack(std)[None, ...] 165 | 166 | # save to h5py 167 | with h5py.File(self.scalar_path, 'w') as hf: 168 | hf.create_dataset(name='mean', data=mean, dtype=np.float32) 169 | hf.create_dataset(name='std', data=std, dtype=np.float32) 170 | print("\nScalar saved to {}\n".format(str(self.scalar_path))) 171 | print("Extacting scalar finished! Time spent: {:.3f} s\n".format(timer() - begin_time)) 172 | 173 | def extract_meta(self): 174 | """ Extract meta .csv file and re-organize the meta data and store it to hdf5 file. 175 | 176 | """ 177 | print('Converting meta file to hdf5 file starts......\n') 178 | 179 | shutil.rmtree(str(self.meta_h5_dir), ignore_errors=True) 180 | self.meta_h5_dir.mkdir(parents=True, exist_ok=True) 181 | 182 | if self.cfg['dataset'] == 'dcase2020task3': 183 | self.extract_meta_dcase2020task3() 184 | 185 | def extract_meta_dcase2020task3(self): 186 | num_frames = 600 187 | num_tracks = 2 188 | num_classes = 14 189 | meta_list = [path for path in sorted(self.meta_dir.glob('*.csv')) if not path.name.startswith('.')] 190 | iterator = tqdm(enumerate(meta_list), total=len(meta_list), unit='it') 191 | for idx, meta_file in iterator: 192 | fn = meta_file.stem 193 | df = pd.read_csv(meta_file, header=None) 194 | sed_label = np.zeros((num_frames, num_tracks, num_classes)) 195 | doa_label = np.zeros((num_frames, num_tracks, 3)) 196 | event_indexes = np.array([[None, None]] * num_frames) # event indexes of all frames 197 | track_numbers = np.array([[None, None]] * num_frames) # track number of all frames 198 | for row in df.iterrows(): 199 | frame_idx = row[1][0] 200 | event_idx = row[1][1] 201 | track_number = row[1][2] 202 | azi = row[1][3] 203 | elev = row[1][4] 204 | 205 | ##### track indexing ##### 206 | # default assign current_track_idx to the first available track 207 | current_event_indexes = event_indexes[frame_idx] 208 | current_track_indexes = np.where(current_event_indexes == None)[0].tolist() 209 | current_track_idx = current_track_indexes[0] 210 | 211 | # tracking from the last frame if the last frame is not empty 212 | last_event_indexes = np.array([None, None]) if frame_idx == 0 else event_indexes[frame_idx-1] 213 | last_track_indexes = np.where(last_event_indexes != None)[0].tolist() # event index of the last frame 214 | last_events_tracks = list(zip(event_indexes[frame_idx-1], track_numbers[frame_idx-1])) 215 | if last_track_indexes != []: 216 | for track_idx in last_track_indexes: 217 | if last_events_tracks[track_idx] == (event_idx, track_number): 218 | if current_track_idx != track_idx: # swap tracks if track_idx is not consistant 219 | sed_label[frame_idx, [current_track_idx, track_idx]] = \ 220 | sed_label[frame_idx, [track_idx, current_track_idx]] 221 | doa_label[frame_idx, [current_track_idx, track_idx]] = \ 222 | doa_label[frame_idx, [track_idx, current_track_idx]] 223 | event_indexes[frame_idx, [current_track_idx, track_idx]] = \ 224 | event_indexes[frame_idx, [track_idx, current_track_idx]] 225 | track_numbers[frame_idx, [current_track_idx, track_idx]] = \ 226 | track_numbers[frame_idx, [track_idx, current_track_idx]] 227 | current_track_idx = track_idx 228 | ######################### 229 | 230 | # label encode 231 | azi_rad, elev_rad = azi * np.pi / 180, elev * np.pi / 180 232 | sed_label[frame_idx, current_track_idx, event_idx] = 1.0 233 | doa_label[frame_idx, current_track_idx, :] = np.cos(elev_rad) * np.cos(azi_rad), \ 234 | np.cos(elev_rad) * np.sin(azi_rad), np.sin(elev_rad) 235 | event_indexes[frame_idx, current_track_idx] = event_idx 236 | track_numbers[frame_idx, current_track_idx] = track_number 237 | 238 | meta_h5_path = self.meta_h5_dir.joinpath(fn + '.h5') 239 | with h5py.File(meta_h5_path, 'w') as hf: 240 | hf.create_dataset(name='sed_label', data=sed_label, dtype=np.float32) 241 | hf.create_dataset(name='doa_label', data=doa_label, dtype=np.float32) 242 | 243 | tqdm.write('{}, {}'.format(idx, meta_h5_path)) 244 | 245 | -------------------------------------------------------------------------------- /seld/learning/train.py: -------------------------------------------------------------------------------- 1 | import logging 2 | from timeit import default_timer as timer 3 | 4 | from tqdm import tqdm 5 | 6 | from utils.common import print_metrics 7 | 8 | 9 | def train(cfg, **initializer): 10 | """Train 11 | 12 | """ 13 | writer = initializer['writer'] 14 | train_generator = initializer['train_generator'] 15 | valid_generator = initializer['valid_generator'] 16 | lr_scheduler = initializer['lr_scheduler'] 17 | trainer = initializer['trainer'] 18 | ckptIO = initializer['ckptIO'] 19 | epoch_it = initializer['epoch_it'] 20 | it = initializer['it'] 21 | 22 | batchNum_per_epoch = len(train_generator) 23 | max_epoch = cfg['training']['max_epoch'] 24 | 25 | logging.info('===> Training mode\n') 26 | iterator = tqdm(train_generator, total=max_epoch*batchNum_per_epoch-it, unit='it') 27 | train_begin_time = timer() 28 | for batch_sample in iterator: 29 | 30 | epoch_it, rem_batch = it // batchNum_per_epoch, it % batchNum_per_epoch 31 | 32 | ################ 33 | ## Validation 34 | ################ 35 | if it % int(1*batchNum_per_epoch) == 0: 36 | valid_begin_time = timer() 37 | train_time = valid_begin_time - train_begin_time 38 | 39 | train_losses = trainer.validate_step(valid_type='train', epoch_it=epoch_it) 40 | for k, v in train_losses.items(): 41 | train_losses[k] = v / batchNum_per_epoch 42 | 43 | if cfg['training']['valid_fold']: 44 | valid_losses, valid_metrics = trainer.validate_step( 45 | generator=valid_generator, 46 | valid_type='valid', 47 | epoch_it=epoch_it 48 | ) 49 | valid_time = timer() - valid_begin_time 50 | 51 | writer.add_scalar('train/lr', lr_scheduler.get_last_lr()[0], it) 52 | logging.info('---------------------------------------------------------------------------------------------------' 53 | +'------------------------------------------------------') 54 | logging.info('Iter: {}, Epoch/Total Epoch: {}/{}, Batch/Total Batch: {}/{}'.format( 55 | it, epoch_it, max_epoch, rem_batch, batchNum_per_epoch)) 56 | print_metrics(logging, writer, train_losses, it, set_type='train') 57 | if cfg['training']['valid_fold']: 58 | print_metrics(logging, writer, valid_losses, it, set_type='valid') 59 | if cfg['training']['valid_fold']: 60 | print_metrics(logging, writer, valid_metrics, it, set_type='valid') 61 | logging.info('Train time: {:.3f}s, Valid time: {:.3f}s, Lr: {}'.format( 62 | train_time, valid_time, lr_scheduler.get_last_lr()[0])) 63 | if 'PIT_type' in cfg['training']: 64 | logging.info('PIT type: {}'.format(cfg['training']['PIT_type'])) 65 | logging.info('---------------------------------------------------------------------------------------------------' 66 | +'------------------------------------------------------') 67 | 68 | train_begin_time = timer() 69 | 70 | ############### 71 | ## Save model 72 | ############### 73 | if rem_batch == 0 and it > 0: 74 | if cfg['training']['valid_fold']: 75 | ckptIO.save(epoch_it, it, metrics=valid_metrics, key_rank='seld20', rank_order='latest') 76 | else: 77 | ckptIO.save(epoch_it, it, metrics=train_losses, key_rank='loss_all', rank_order='latest') 78 | 79 | ############### 80 | ## Finish training 81 | ############### 82 | if it == max_epoch * batchNum_per_epoch: 83 | iterator.close() 84 | break 85 | 86 | ############### 87 | ## Train 88 | ############### 89 | trainer.train_step(batch_sample, epoch_it) 90 | if rem_batch == 0 and it > 0: 91 | lr_scheduler.step() 92 | 93 | it += 1 94 | 95 | iterator.close() 96 | 97 | -------------------------------------------------------------------------------- /seld/main.py: -------------------------------------------------------------------------------- 1 | import sys 2 | from pathlib import Path 3 | 4 | from learning import evaluate, initialize, infer, preprocess, train 5 | from utils.cli_parser import parse_cli_overides 6 | from utils.config import get_dataset 7 | 8 | 9 | def main(args, cfg): 10 | """Execute a task based on the given command-line arguments. 11 | 12 | This function is the main entry-point of the program. It allows the 13 | user to extract features, train a model, infer predictions, and 14 | evaluate predictions using the command-line interface. 15 | 16 | Args: 17 | args: command line arguments. 18 | Return: 19 | 0: successful termination 20 | 'any nonzero value': abnormal termination 21 | """ 22 | 23 | # Create workspace 24 | Path(cfg['workspace_dir']).mkdir(parents=True, exist_ok=True) 25 | 26 | # Dataset initialization 27 | dataset = get_dataset(dataset_name=cfg['dataset'], root_dir=cfg['dataset_dir']) 28 | 29 | # Preprocess 30 | if args.mode == 'preprocess': 31 | preprocessor = preprocess.Preprocessor(args, cfg, dataset) 32 | 33 | if args.preproc_mode == 'extract_data': 34 | preprocessor.extract_data() 35 | elif args.preproc_mode == 'extract_scalar': 36 | preprocessor.extract_scalar() 37 | elif args.preproc_mode == 'extract_meta': 38 | preprocessor.extract_meta() 39 | 40 | # Train 41 | if args.mode == 'train': 42 | train_initializer = initialize.init_train(args, cfg, dataset) 43 | train.train(cfg, **train_initializer) 44 | 45 | # Inference 46 | elif args.mode == 'infer': 47 | infer_initializer = initialize.init_infer(args, cfg, dataset) 48 | infer.infer(cfg, dataset, **infer_initializer) 49 | 50 | # Evaluate 51 | elif args.mode == 'evaluate': 52 | evaluate.evaluate(cfg, dataset) 53 | 54 | return 0 55 | 56 | 57 | if __name__ == '__main__': 58 | args, cfg = parse_cli_overides() 59 | sys.exit(main(args, cfg)) 60 | -------------------------------------------------------------------------------- /seld/methods/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yinkalario/EIN-SELD/9f42da23f4ef4620ca495af37e079dc7f8cde06c/seld/methods/__init__.py -------------------------------------------------------------------------------- /seld/methods/data.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | 3 | import h5py 4 | import numpy as np 5 | import torch 6 | from torch.utils.data import Dataset, Sampler 7 | 8 | from methods.utils.data_utilities import _segment_index 9 | from utils.common import int16_samples_to_float32 10 | 11 | 12 | class BaseDataset(Dataset): 13 | """ User defined datset 14 | 15 | """ 16 | def __init__(self, args, cfg, dataset): 17 | """ 18 | Args: 19 | args: input args 20 | cfg: configurations 21 | dataset: dataset used 22 | """ 23 | super().__init__() 24 | 25 | self.args = args 26 | self.sample_rate = cfg['data']['sample_rate'] 27 | self.clip_length = dataset.clip_length 28 | 29 | # Chunklen and hoplen and segmentation. Since all of the clips are 60s long, it only segments once here 30 | data = np.zeros((1, self.clip_length * self.sample_rate)) 31 | chunklen = int(10 * self.sample_rate) 32 | self.segmented_indexes, self.segmented_pad_width = _segment_index(data, chunklen, hoplen=chunklen) 33 | self.num_segments = len(self.segmented_indexes) 34 | 35 | # Data path 36 | data_sr_folder_name = '{}fs'.format(self.sample_rate) 37 | main_data_dir = Path(cfg['hdf5_dir']).joinpath(cfg['dataset']).joinpath('data').joinpath(data_sr_folder_name) 38 | self.data_dir = main_data_dir.joinpath('dev').joinpath(cfg['data']['type']) 39 | self.fn_list = [path.stem for path in sorted(self.data_dir.glob('*.h5')) \ 40 | if not path.name.startswith('.')] 41 | self.fn_list = [fn + '%' + str(n) for fn in self.fn_list for n in range(self.num_segments)] 42 | 43 | def __len__(self): 44 | """Get length of the dataset 45 | 46 | """ 47 | return len(self.fn_list) 48 | 49 | def __getitem__(self, idx): 50 | """ 51 | Read features from the dataset 52 | """ 53 | fn_segment = self.fn_list[idx] 54 | fn, n_segment = fn_segment.split('%')[0], int(fn_segment.split('%')[1]) 55 | data_path = self.data_dir.joinpath(fn + '.h5') 56 | index_begin = self.segmented_indexes[n_segment][0] 57 | index_end = self.segmented_indexes[n_segment][1] 58 | pad_width_before = self.segmented_pad_width[n_segment][0] 59 | pad_width_after = self.segmented_pad_width[n_segment][1] 60 | with h5py.File(data_path, 'r') as hf: 61 | x = int16_samples_to_float32(hf['waveform'][:, index_begin: index_end]) 62 | pad_width = ((0, 0), (pad_width_before, pad_width_after)) 63 | x = np.pad(x, pad_width, mode='constant') 64 | sample = { 65 | 'waveform': x 66 | } 67 | 68 | return sample 69 | 70 | 71 | class UserBatchSampler(Sampler): 72 | """User defined batch sampler. Only for train set. 73 | 74 | """ 75 | def __init__(self, clip_num, batch_size, seed=2020): 76 | self.clip_num = clip_num 77 | self.batch_size = batch_size 78 | self.random_state = np.random.RandomState(seed) 79 | 80 | self.indexes = np.arange(self.clip_num) 81 | self.random_state.shuffle(self.indexes) 82 | self.pointer = 0 83 | 84 | def get_state(self): 85 | sampler_state = { 86 | 'random': self.random_state.get_state(), 87 | 'indexes': self.indexes, 88 | 'pointer': self.pointer 89 | } 90 | return sampler_state 91 | 92 | def set_state(self, sampler_state): 93 | self.random_state.set_state(sampler_state['random']) 94 | self.indexes = sampler_state['indexes'] 95 | self.pointer = sampler_state['pointer'] 96 | 97 | def __iter__(self): 98 | """ 99 | Return: 100 | batch_indexes (int): indexes of batch 101 | """ 102 | while True: 103 | if self.pointer >= self.clip_num: 104 | self.pointer = 0 105 | self.random_state.shuffle(self.indexes) 106 | 107 | batch_indexes = self.indexes[self.pointer: self.pointer + self.batch_size] 108 | self.pointer += self.batch_size 109 | yield batch_indexes 110 | 111 | def __len__(self): 112 | return (self.clip_num + self.batch_size - 1) // self.batch_size 113 | 114 | 115 | class PinMemCustomBatch: 116 | def __init__(self, batch_dict): 117 | batch_x = [] 118 | for n in range(len(batch_dict)): 119 | batch_x.append(batch_dict[n]['waveform']) 120 | self.batch_out_dict = { 121 | 'waveform': torch.tensor(batch_x, dtype=torch.float32), 122 | } 123 | 124 | def pin_memory(self): 125 | self.batch_out_dict['waveform'] = self.batch_out_dict['waveform'].pin_memory() 126 | return self.batch_out_dict 127 | 128 | 129 | def collate_fn(batch_dict): 130 | """ 131 | Merges a list of samples to form a mini-batch 132 | Pin memory for customized dataset 133 | """ 134 | return PinMemCustomBatch(batch_dict) 135 | -------------------------------------------------------------------------------- /seld/methods/data_augmentation/__init__.py: -------------------------------------------------------------------------------- 1 | # from .rotate import * 2 | # from .specaug import * 3 | # from .crop import * 4 | -------------------------------------------------------------------------------- /seld/methods/ein_seld/__init__.py: -------------------------------------------------------------------------------- 1 | from . import (data, data_augmentation, inference, losses, metrics, models, 2 | training) 3 | 4 | __all__ = [ 5 | data_augmentation, models, data, training, inference, losses, metrics 6 | ] 7 | -------------------------------------------------------------------------------- /seld/methods/ein_seld/data.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | from timeit import default_timer as timer 3 | 4 | import h5py 5 | import numpy as np 6 | import torch 7 | from methods.utils.data_utilities import (_segment_index, load_dcase_format, 8 | to_metrics2020_format) 9 | from torch.utils.data import Dataset, Sampler 10 | from tqdm import tqdm 11 | from utils.common import int16_samples_to_float32 12 | 13 | 14 | class UserDataset(Dataset): 15 | """ User defined datset 16 | 17 | """ 18 | def __init__(self, args, cfg, dataset, dataset_type='train', overlap=''): 19 | """ 20 | Args: 21 | args: input args 22 | cfg: configurations 23 | dataset: dataset used 24 | dataset_type: 'train' | 'valid' | 'dev_test' | 'eval_test' 25 | overlap: '1' | '2' 26 | """ 27 | super().__init__() 28 | 29 | self.dataset_type = dataset_type 30 | self.read_into_mem = args.read_into_mem 31 | self.sample_rate = cfg['data']['sample_rate'] 32 | self.clip_length = dataset.clip_length 33 | self.label_resolution = dataset.label_resolution 34 | self.frame_length = int(self.clip_length / self.label_resolution) 35 | self.label_interp_ratio = int(self.label_resolution * self.sample_rate / cfg['data']['hop_length']) 36 | 37 | # Chunklen and hoplen and segmentation. Since all of the clips are 60s long, it only segments once here 38 | data = np.zeros((1, self.clip_length * self.sample_rate)) 39 | if 'train' in self.dataset_type: 40 | chunklen = int(cfg['data']['train_chunklen_sec'] * self.sample_rate) 41 | hoplen = int(cfg['data']['train_hoplen_sec'] * self.sample_rate) 42 | self.segmented_indexes, self.segmented_pad_width = _segment_index(data, chunklen, hoplen) 43 | elif self.dataset_type in ['valid', 'dev_test', 'eval_test']: 44 | chunklen = int(cfg['data']['test_chunklen_sec'] * self.sample_rate) 45 | hoplen = int(cfg['data']['test_hoplen_sec'] * self.sample_rate) 46 | self.segmented_indexes, self.segmented_pad_width = _segment_index(data, chunklen, hoplen, last_frame_always_paddding=True) 47 | self.num_segments = len(self.segmented_indexes) 48 | 49 | # Data and meta path 50 | fold_str_idx = dataset.fold_str_index 51 | ov_str_idx = dataset.ov_str_index 52 | data_sr_folder_name = '{}fs'.format(self.sample_rate) 53 | main_data_dir = Path(cfg['hdf5_dir']).joinpath(cfg['dataset']).joinpath('data').joinpath(data_sr_folder_name) 54 | dev_data_dir = main_data_dir.joinpath('dev').joinpath(cfg['data']['type']) 55 | eval_data_dir = main_data_dir.joinpath('eval').joinpath(cfg['data']['type']) 56 | main_meta_dir = Path(cfg['hdf5_dir']).joinpath(cfg['dataset']).joinpath('meta') 57 | dev_meta_dir = main_meta_dir.joinpath('dev') 58 | eval_meta_dir = main_meta_dir.joinpath('eval') 59 | if self.dataset_type == 'train': 60 | data_dirs = [dev_data_dir] 61 | self.meta_dir = dev_meta_dir 62 | train_fold = [int(fold.strip()) for fold in str(cfg['training']['train_fold']).split(',')] 63 | ov_set = str(cfg['training']['overlap']) if not overlap else overlap 64 | self.paths_list = [path for data_dir in data_dirs for path in sorted(data_dir.glob('*.h5')) \ 65 | if int(path.stem[fold_str_idx]) in train_fold and path.stem[ov_str_idx] in ov_set \ 66 | and not path.name.startswith('.')] 67 | elif self.dataset_type == 'valid': 68 | if cfg['training']['valid_fold'] != 'eval': 69 | data_dirs = [dev_data_dir] 70 | self.meta_dir = dev_meta_dir 71 | valid_fold = [int(fold.strip()) for fold in str(cfg['training']['valid_fold']).split(',')] 72 | ov_set = str(cfg['training']['overlap']) if not overlap else overlap 73 | self.paths_list = [path for data_dir in data_dirs for path in sorted(data_dir.glob('*.h5')) \ 74 | if int(path.stem[fold_str_idx]) in valid_fold and path.stem[ov_str_idx] in ov_set \ 75 | and not path.name.startswith('.')] 76 | ori_meta_dir = Path(cfg['dataset_dir']).joinpath('metadata_dev') 77 | else: 78 | data_dirs = [eval_data_dir] 79 | self.meta_dir = eval_meta_dir 80 | ov_set = str(cfg['training']['overlap']) if not overlap else overlap 81 | self.paths_list = [path for data_dir in data_dirs for path in sorted(data_dir.glob('*.h5')) \ 82 | if not path.name.startswith('.')] 83 | ori_meta_dir = Path(cfg['dataset_dir']).joinpath('metadata_eval') 84 | frame_begin_index = 0 85 | self.valid_gt_sed_metrics2019 = [] 86 | self.valid_gt_doa_metrics2019 = [] 87 | self.valid_gt_dcaseformat = {} 88 | for path in self.paths_list: 89 | ori_meta_path = ori_meta_dir.joinpath(path.stem + '.csv') 90 | output_dict, sed_metrics2019, doa_metrics2019 = \ 91 | load_dcase_format(ori_meta_path, frame_begin_index=frame_begin_index, 92 | frame_length=self.frame_length, num_classes=len(dataset.label_set)) 93 | self.valid_gt_dcaseformat.update(output_dict) 94 | self.valid_gt_sed_metrics2019.append(sed_metrics2019) 95 | self.valid_gt_doa_metrics2019.append(doa_metrics2019) 96 | frame_begin_index += self.frame_length 97 | self.valid_gt_sed_metrics2019 = np.concatenate(self.valid_gt_sed_metrics2019, axis=0) 98 | self.valid_gt_doa_metrics2019 = np.concatenate(self.valid_gt_doa_metrics2019, axis=0) 99 | self.gt_metrics2020_dict = to_metrics2020_format(self.valid_gt_dcaseformat, 100 | self.valid_gt_sed_metrics2019.shape[0], label_resolution=self.label_resolution) 101 | elif self.dataset_type == 'dev_test': 102 | data_dirs = [dev_data_dir] 103 | self.meta_dir = dev_meta_dir 104 | dev_test_fold = [int(fold.strip()) for fold in str(cfg['inference']['test_fold']).split(',')] 105 | ov_set = str(cfg['inference']['overlap']) if not overlap else overlap 106 | self.paths_list = [path for data_dir in data_dirs for path in sorted(data_dir.glob('*.h5')) \ 107 | if int(path.stem[fold_str_idx]) in dev_test_fold and path.stem[ov_str_idx] in ov_set \ 108 | and not path.name.startswith('.')] 109 | elif self.dataset_type == 'eval_test': 110 | data_dirs = [eval_data_dir] 111 | self.meta_dir = eval_meta_dir 112 | self.paths_list = [path for data_dir in data_dirs for path in sorted(data_dir.glob('*.h5')) \ 113 | if not path.name.startswith('.')] 114 | self.paths_list = [Path(str(path) + '%' + str(n)) for path in self.paths_list for n in range(self.num_segments)] 115 | 116 | # Read into memory 117 | if self.read_into_mem: 118 | load_begin_time = timer() 119 | print('Start to load dataset: {}, ov={}......\n'.format(self.dataset_type + ' set', ov_set)) 120 | iterator = tqdm(self.paths_list, total=len(self.paths_list), unit='clips') 121 | self.dataset_list = [] 122 | for path in iterator: 123 | fn, n_segment = path.stem, int(path.name.split('%')[1]) 124 | data_path = Path(str(path).split('%')[0]) 125 | index_begin = self.segmented_indexes[n_segment][0] 126 | index_end = self.segmented_indexes[n_segment][1] 127 | pad_width_before = self.segmented_pad_width[n_segment][0] 128 | pad_width_after = self.segmented_pad_width[n_segment][1] 129 | with h5py.File(data_path, 'r') as hf: 130 | x = int16_samples_to_float32(hf['waveform'][:, index_begin: index_end]) 131 | pad_width = ((0, 0), (pad_width_before, pad_width_after)) 132 | x = np.pad(x, pad_width, mode='constant') 133 | if 'test' not in self.dataset_type: 134 | ov = fn[-1] 135 | index_begin_label = int(index_begin / (self.sample_rate * self.label_resolution)) 136 | index_end_label = int(index_end / (self.sample_rate * self.label_resolution)) 137 | # pad_width_before_label = int(pad_width_before / (self.sample_rate * self.label_resolution)) 138 | pad_width_after_label = int(pad_width_after / (self.sample_rate * self.label_resolution)) 139 | meta_path = self.meta_dir.joinpath(fn + '.h5') 140 | with h5py.File(meta_path, 'r') as hf: 141 | sed_label = hf['sed_label'][index_begin_label: index_end_label, ...] 142 | doa_label = hf['doa_label'][index_begin_label: index_end_label, ...] # NOTE: this is Catesian coordinates 143 | if pad_width_after_label != 0: 144 | sed_label_new = np.zeros((pad_width_after_label, 2, 14)) 145 | doa_label_new = np.zeros((pad_width_after_label, 2, 3)) 146 | sed_label = np.concatenate((sed_label, sed_label_new), axis=0) 147 | doa_label = np.concatenate((doa_label, doa_label_new), axis=0) 148 | self.dataset_list.append({ 149 | 'filename': fn, 150 | 'n_segment': n_segment, 151 | 'ov': ov, 152 | 'waveform': x, 153 | 'sed_label': sed_label, 154 | 'doa_label': doa_label 155 | }) 156 | else: 157 | self.dataset_list.append({ 158 | 'filename': fn, 159 | 'n_segment': n_segment, 160 | 'waveform': x 161 | }) 162 | iterator.close() 163 | print('Loading dataset time: {:.3f}\n'.format(timer()-load_begin_time)) 164 | 165 | def __len__(self): 166 | """Get length of the dataset 167 | 168 | """ 169 | return len(self.paths_list) 170 | 171 | def __getitem__(self, idx): 172 | """ 173 | Read features from the dataset 174 | """ 175 | if self.read_into_mem: 176 | data_dict = self.dataset_list[idx] 177 | fn = data_dict['filename'] 178 | n_segment = data_dict['n_segment'] 179 | x = data_dict['waveform'] 180 | if 'test' not in self.dataset_type: 181 | ov = data_dict['ov'] 182 | sed_label = data_dict['sed_label'] 183 | doa_label = data_dict['doa_label'] 184 | else: 185 | path = self.paths_list[idx] 186 | fn, n_segment = path.stem, int(path.name.split('%')[1]) 187 | data_path = Path(str(path).split('%')[0]) 188 | index_begin = self.segmented_indexes[n_segment][0] 189 | index_end = self.segmented_indexes[n_segment][1] 190 | pad_width_before = self.segmented_pad_width[n_segment][0] 191 | pad_width_after = self.segmented_pad_width[n_segment][1] 192 | with h5py.File(data_path, 'r') as hf: 193 | x = int16_samples_to_float32(hf['waveform'][:, index_begin: index_end]) 194 | pad_width = ((0, 0), (pad_width_before, pad_width_after)) 195 | x = np.pad(x, pad_width, mode='constant') 196 | if 'test' not in self.dataset_type: 197 | ov = fn[-1] 198 | index_begin_label = int(index_begin / (self.sample_rate * self.label_resolution)) 199 | index_end_label = int(index_end / (self.sample_rate * self.label_resolution)) 200 | # pad_width_before_label = int(pad_width_before / (self.sample_rate * self.label_resolution)) 201 | pad_width_after_label = int(pad_width_after / (self.sample_rate * self.label_resolution)) 202 | meta_path = self.meta_dir.joinpath(fn + '.h5') 203 | with h5py.File(meta_path, 'r') as hf: 204 | sed_label = hf['sed_label'][index_begin_label: index_end_label, ...] 205 | doa_label = hf['doa_label'][index_begin_label: index_end_label, ...] # NOTE: this is Catesian coordinates 206 | if pad_width_after_label != 0: 207 | sed_label_new = np.zeros((pad_width_after_label, 2, 14)) 208 | doa_label_new = np.zeros((pad_width_after_label, 2, 3)) 209 | sed_label = np.concatenate((sed_label, sed_label_new), axis=0) 210 | doa_label = np.concatenate((doa_label, doa_label_new), axis=0) 211 | 212 | if 'test' not in self.dataset_type: 213 | sample = { 214 | 'filename': fn, 215 | 'n_segment': n_segment, 216 | 'ov': ov, 217 | 'waveform': x, 218 | 'sed_label': sed_label, 219 | 'doa_label': doa_label 220 | } 221 | else: 222 | sample = { 223 | 'filename': fn, 224 | 'n_segment': n_segment, 225 | 'waveform': x 226 | } 227 | 228 | return sample 229 | 230 | 231 | class UserBatchSampler(Sampler): 232 | """User defined batch sampler. Only for train set. 233 | 234 | """ 235 | def __init__(self, clip_num, batch_size, seed=2020): 236 | self.clip_num = clip_num 237 | self.batch_size = batch_size 238 | self.random_state = np.random.RandomState(seed) 239 | 240 | self.indexes = np.arange(self.clip_num) 241 | self.random_state.shuffle(self.indexes) 242 | self.pointer = 0 243 | 244 | def get_state(self): 245 | sampler_state = { 246 | 'random': self.random_state.get_state(), 247 | 'indexes': self.indexes, 248 | 'pointer': self.pointer 249 | } 250 | return sampler_state 251 | 252 | def set_state(self, sampler_state): 253 | self.random_state.set_state(sampler_state['random']) 254 | self.indexes = sampler_state['indexes'] 255 | self.pointer = sampler_state['pointer'] 256 | 257 | def __iter__(self): 258 | """ 259 | Return: 260 | batch_indexes (int): indexes of batch 261 | """ 262 | while True: 263 | if self.pointer >= self.clip_num: 264 | self.pointer = 0 265 | self.random_state.shuffle(self.indexes) 266 | 267 | batch_indexes = self.indexes[self.pointer: self.pointer + self.batch_size] 268 | self.pointer += self.batch_size 269 | yield batch_indexes 270 | 271 | def __len__(self): 272 | return (self.clip_num + self.batch_size - 1) // self.batch_size 273 | 274 | 275 | class PinMemCustomBatch: 276 | def __init__(self, batch_dict): 277 | batch_fn = [] 278 | batch_n_segment = [] 279 | batch_ov = [] 280 | batch_x = [] 281 | batch_sed_label = [] 282 | batch_doa_label = [] 283 | 284 | for n in range(len(batch_dict)): 285 | batch_fn.append(batch_dict[n]['filename']) 286 | batch_n_segment.append(batch_dict[n]['n_segment']) 287 | batch_ov.append(batch_dict[n]['ov']) 288 | batch_x.append(batch_dict[n]['waveform']) 289 | batch_sed_label.append(batch_dict[n]['sed_label']) 290 | batch_doa_label.append(batch_dict[n]['doa_label']) 291 | 292 | self.batch_out_dict = { 293 | 'filename': batch_fn, 294 | 'n_segment': batch_n_segment, 295 | 'ov': batch_ov, 296 | 'waveform': torch.tensor(batch_x, dtype=torch.float32), 297 | 'sed_label': torch.tensor(batch_sed_label, dtype=torch.float32), 298 | 'doa_label': torch.tensor(batch_doa_label, dtype=torch.float32), 299 | } 300 | 301 | def pin_memory(self): 302 | self.batch_out_dict['waveform'] = self.batch_out_dict['waveform'].pin_memory() 303 | self.batch_out_dict['sed_label'] = self.batch_out_dict['sed_label'].pin_memory() 304 | self.batch_out_dict['doa_label'] = self.batch_out_dict['doa_label'].pin_memory() 305 | return self.batch_out_dict 306 | 307 | 308 | def collate_fn(batch_dict): 309 | """ 310 | Merges a list of samples to form a mini-batch 311 | Pin memory for customized dataset 312 | """ 313 | return PinMemCustomBatch(batch_dict) 314 | 315 | 316 | class PinMemCustomBatchTest: 317 | def __init__(self, batch_dict): 318 | batch_fn = [] 319 | batch_n_segment = [] 320 | batch_x = [] 321 | 322 | for n in range(len(batch_dict)): 323 | batch_fn.append(batch_dict[n]['filename']) 324 | batch_n_segment.append(batch_dict[n]['n_segment']) 325 | batch_x.append(batch_dict[n]['waveform']) 326 | 327 | self.batch_out_dict = { 328 | 'filename': batch_fn, 329 | 'n_segment': batch_n_segment, 330 | 'waveform': torch.tensor(batch_x, dtype=torch.float32) 331 | } 332 | 333 | def pin_memory(self): 334 | self.batch_out_dict['waveform'] = self.batch_out_dict['waveform'].pin_memory() 335 | return self.batch_out_dict 336 | 337 | 338 | def collate_fn_test(batch_dict): 339 | """ 340 | Merges a list of samples to form a mini-batch 341 | Pin memory for customized dataset 342 | """ 343 | return PinMemCustomBatchTest(batch_dict) 344 | -------------------------------------------------------------------------------- /seld/methods/ein_seld/data_augmentation/__init__.py: -------------------------------------------------------------------------------- 1 | # build your data augmentation method in this folder 2 | # from .trackmix import * 3 | -------------------------------------------------------------------------------- /seld/methods/ein_seld/inference.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | 3 | import h5py 4 | import numpy as np 5 | import torch 6 | from methods.inference import BaseInferer 7 | from tqdm import tqdm 8 | 9 | from .training import to_dcase_format 10 | 11 | 12 | class Inferer(BaseInferer): 13 | 14 | def __init__(self, cfg, dataset, af_extractor, model, cuda): 15 | super().__init__() 16 | self.cfg = cfg 17 | self.af_extractor = af_extractor 18 | self.model = model 19 | self.cuda = cuda 20 | 21 | # Scalar 22 | scalar_h5_dir = Path(cfg['hdf5_dir']).joinpath(cfg['dataset']).joinpath('scalar') 23 | fn_scalar = '{}_{}_sr{}_nfft{}_hop{}_mel{}.h5'.format(cfg['data']['type'], cfg['data']['audio_feature'], 24 | cfg['data']['sample_rate'], cfg['data']['n_fft'], cfg['data']['hop_length'], cfg['data']['n_mels']) 25 | scalar_path = scalar_h5_dir.joinpath(fn_scalar) 26 | with h5py.File(scalar_path, 'r') as hf: 27 | self.mean = hf['mean'][:] 28 | self.std = hf['std'][:] 29 | if cuda: 30 | self.mean = torch.tensor(self.mean, dtype=torch.float32).cuda() 31 | self.std = torch.tensor(self.std, dtype=torch.float32).cuda() 32 | 33 | self.label_resolution = dataset.label_resolution 34 | self.label_interp_ratio = int(self.label_resolution * cfg['data']['sample_rate'] / cfg['data']['hop_length']) 35 | 36 | def infer(self, generator): 37 | fn_list, n_segment_list = [], [] 38 | pred_sed_list, pred_doa_list = [], [] 39 | 40 | iterator = tqdm(generator) 41 | for batch_sample in iterator: 42 | batch_x = batch_sample['waveform'] 43 | if self.cuda: 44 | batch_x = batch_x.cuda(non_blocking=True) 45 | with torch.no_grad(): 46 | self.af_extractor.eval() 47 | self.model.eval() 48 | batch_x = self.af_extractor(batch_x) 49 | batch_x = (batch_x - self.mean) / self.std 50 | pred = self.model(batch_x) 51 | pred['sed'] = torch.sigmoid(pred['sed']) 52 | fn_list.append(batch_sample['filename']) 53 | n_segment_list.append(batch_sample['n_segment']) 54 | pred_sed_list.append(pred['sed'].cpu().detach().numpy()) 55 | pred_doa_list.append(pred['doa'].cpu().detach().numpy()) 56 | 57 | iterator.close() 58 | 59 | self.fn_list = [fn for row in fn_list for fn in row] 60 | self.n_segment_list = [n_segment for row in n_segment_list for n_segment in row] 61 | pred_sed = np.concatenate(pred_sed_list, axis=0) 62 | pred_doa = np.concatenate(pred_doa_list, axis=0) 63 | 64 | self.num_segments = max(self.n_segment_list) + 1 65 | origin_num_clips = int(pred_sed.shape[0]/self.num_segments) 66 | origin_T = int(pred_sed.shape[1]*self.num_segments) 67 | pred_sed = pred_sed.reshape((origin_num_clips, origin_T, 2, -1))[:, :int(60 / self.label_resolution)] 68 | pred_doa = pred_doa.reshape((origin_num_clips, origin_T, 2, -1))[:, :int(60 / self.label_resolution)] 69 | 70 | pred = { 71 | 'sed': pred_sed, 72 | 'doa': pred_doa 73 | } 74 | return pred 75 | 76 | def fusion(self, submissions_dir, preds): 77 | """ Ensamble predictions 78 | 79 | """ 80 | num_preds = len(preds) 81 | pred_sed = [] 82 | pred_doa = [] 83 | for n in range(num_preds): 84 | pred_sed.append(preds[n]['sed']) 85 | pred_doa.append(preds[n]['doa']) 86 | pred_sed = np.array(pred_sed).mean(axis=0) 87 | pred_doa = np.array(pred_doa).mean(axis=0) 88 | 89 | N, T = pred_sed.shape[:2] 90 | pred_sed_max = pred_sed.max(axis=-1) 91 | pred_sed_max_idx = pred_sed.argmax(axis=-1) 92 | pred_sed = np.zeros_like(pred_sed) 93 | for b_idx in range(N): 94 | for t_idx in range(T): 95 | for track_idx in range(2): 96 | pred_sed[b_idx, t_idx, track_idx, pred_sed_max_idx[b_idx, t_idx, track_idx]] = \ 97 | pred_sed_max[b_idx, t_idx, track_idx] 98 | pred_sed = (pred_sed > self.cfg['inference']['threshold_sed']).astype(np.float32) 99 | # convert Catesian to Spherical 100 | azi = np.arctan2(pred_doa[..., 1], pred_doa[..., 0]) 101 | elev = np.arctan2(pred_doa[..., 2], np.sqrt(pred_doa[..., 0]**2 + pred_doa[..., 1]**2)) 102 | pred_doa = np.stack((azi, elev), axis=-1) # (N, T, tracks, (azi, elev)) 103 | 104 | fn_list = self.fn_list[::self.num_segments] 105 | for n in range(pred_sed.shape[0]): 106 | fn = fn_list[n] 107 | pred_sed_f = pred_sed[n][None, ...] 108 | pred_doa_f = pred_doa[n][None, ...] 109 | pred_dcase_format_dict = to_dcase_format(pred_sed_f, pred_doa_f) 110 | csv_path = submissions_dir.joinpath(fn + '.csv') 111 | self.write_submission(csv_path, pred_dcase_format_dict) 112 | print('Rsults are saved to {}\n'.format(str(submissions_dir))) 113 | 114 | -------------------------------------------------------------------------------- /seld/methods/ein_seld/losses.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | from methods.utils.loss_utilities import BCEWithLogitsLoss, MSELoss 4 | 5 | 6 | class Losses: 7 | def __init__(self, cfg): 8 | 9 | self.cfg = cfg 10 | self.beta = cfg['training']['loss_beta'] 11 | 12 | self.losses = [BCEWithLogitsLoss(reduction='mean'), MSELoss(reduction='mean')] 13 | self.losses_pit = [BCEWithLogitsLoss(reduction='PIT'), MSELoss(reduction='PIT')] 14 | 15 | self.names = ['loss_all'] + [loss.name for loss in self.losses] 16 | 17 | def calculate(self, pred, target, epoch_it=0): 18 | 19 | if 'PIT' not in self.cfg['training']['PIT_type']: 20 | updated_target = target 21 | loss_sed = self.losses[0].calculate_loss(pred['sed'], updated_target['sed']) 22 | loss_doa = self.losses[1].calculate_loss(pred['doa'], updated_target['doa']) 23 | elif self.cfg['training']['PIT_type'] == 'tPIT': 24 | loss_sed, loss_doa, updated_target = self.tPIT(pred, target) 25 | 26 | loss_all = self.beta * loss_sed + (1 - self.beta) * loss_doa 27 | losses_dict = { 28 | 'all': loss_all, 29 | 'sed': loss_sed, 30 | 'doa': loss_doa, 31 | 'updated_target': updated_target 32 | } 33 | return losses_dict 34 | 35 | def tPIT(self, pred, target): 36 | """Frame Permutation Invariant Training for 2 possible combinations 37 | 38 | Args: 39 | pred: { 40 | 'sed': [batch_size, T, num_tracks=2, num_classes], 41 | 'doa': [batch_size, T, num_tracks=2, doas=3] 42 | } 43 | target: { 44 | 'sed': [batch_size, T, num_tracks=2, num_classes], 45 | 'doa': [batch_size, T, num_tracks=2, doas=3] 46 | } 47 | Return: 48 | updated_target: updated target with the minimum loss frame-wisely 49 | { 50 | 'sed': [batch_size, T, num_tracks=2, num_classes], 51 | 'doa': [batch_size, T, num_tracks=2, doas=3] 52 | } 53 | """ 54 | target_flipped = { 55 | 'sed': target['sed'].flip(dims=[2]), 56 | 'doa': target['doa'].flip(dims=[2]) 57 | } 58 | 59 | loss_sed1 = self.losses_pit[0].calculate_loss(pred['sed'], target['sed']) 60 | loss_sed2 = self.losses_pit[0].calculate_loss(pred['sed'], target_flipped['sed']) 61 | loss_doa1 = self.losses_pit[1].calculate_loss(pred['doa'], target['doa']) 62 | loss_doa2 = self.losses_pit[1].calculate_loss(pred['doa'], target_flipped['doa']) 63 | 64 | loss1 = loss_sed1 + loss_doa1 65 | loss2 = loss_sed2 + loss_doa2 66 | 67 | loss_sed = (loss_sed1 * (loss1 <= loss2) + loss_sed2 * (loss1 > loss2)).mean() 68 | loss_doa = (loss_doa1 * (loss1 <= loss2) + loss_doa2 * (loss1 > loss2)).mean() 69 | updated_target_sed = target['sed'].clone() * (loss1[:, :, None, None] <= loss2[:, :, None, None]) + \ 70 | target_flipped['sed'].clone() * (loss1[:, :, None, None] > loss2[:, :, None, None]) 71 | updated_target_doa = target['doa'].clone() * (loss1[:, :, None, None] <= loss2[:, :, None, None]) + \ 72 | target_flipped['doa'].clone() * (loss1[:, :, None, None] > loss2[:, :, None, None]) 73 | updated_target = { 74 | 'sed': updated_target_sed, 75 | 'doa': updated_target_doa 76 | } 77 | return loss_sed, loss_doa, updated_target 78 | -------------------------------------------------------------------------------- /seld/methods/ein_seld/metrics.py: -------------------------------------------------------------------------------- 1 | import methods.utils.SELD_evaluation_metrics_2019 as SELDMetrics2019 2 | from methods.utils.SELD_evaluation_metrics_2020 import \ 3 | SELDMetrics as SELDMetrics2020 4 | from methods.utils.SELD_evaluation_metrics_2020 import early_stopping_metric 5 | 6 | 7 | class Metrics(object): 8 | """Metrics for evaluation 9 | 10 | """ 11 | def __init__(self, dataset): 12 | 13 | self.metrics = [] 14 | self.names = ['ER20', 'F20', 'LE20', 'LR20', 'seld20', 'ER19', 'F19', 'LE19', 'LR19', 'seld19'] 15 | 16 | self.num_classes = len(dataset.label_set) 17 | self.doa_threshold = 20 # in deg 18 | self.num_frames_1s = int(1 / dataset.label_resolution) 19 | 20 | def calculate(self, pred_dict, gt_dict): 21 | 22 | # ER20: error rate, F20: F1-score, LE20: Location error, LR20: Location recall 23 | ER_19, F_19 = SELDMetrics2019.compute_sed_scores(pred_dict['dcase2019_sed'], gt_dict['dcase2019_sed'], \ 24 | self.num_frames_1s) 25 | LE_19, LR_19, _, _, _, _ = SELDMetrics2019.compute_doa_scores_regr( \ 26 | pred_dict['dcase2019_doa'], gt_dict['dcase2019_doa'], pred_dict['dcase2019_sed'], gt_dict['dcase2019_sed']) 27 | seld_score_19 = SELDMetrics2019.early_stopping_metric([ER_19, F_19], [LE_19, LR_19]) 28 | 29 | dcase2020_metric = SELDMetrics2020(nb_classes=self.num_classes, doa_threshold=self.doa_threshold) 30 | dcase2020_metric.update_seld_scores(pred_dict['dcase2020'], gt_dict['dcase2020']) 31 | ER_20, F_20, LE_20, LR_20 = dcase2020_metric.compute_seld_scores() 32 | seld_score_20 = early_stopping_metric([ER_20, F_20], [LE_20, LR_20]) 33 | 34 | metrics_scores = { 35 | 'ER20': ER_20, 36 | 'F20': F_20, 37 | 'LE20': LE_20, 38 | 'LR20': LR_20, 39 | 'seld20': seld_score_20, 40 | 'ER19': ER_19, 41 | 'F19': F_19, 42 | 'LE19': LE_19, 43 | 'LR19': LR_19, 44 | 'seld19': seld_score_19, 45 | } 46 | return metrics_scores 47 | -------------------------------------------------------------------------------- /seld/methods/ein_seld/models/__init__.py: -------------------------------------------------------------------------------- 1 | from .seld import * -------------------------------------------------------------------------------- /seld/methods/ein_seld/models/seld.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | from methods.utils.model_utilities import (DoubleConv, PositionalEncoding, 5 | init_layer) 6 | 7 | 8 | class EINV2(nn.Module): 9 | def __init__(self, cfg, dataset): 10 | super().__init__() 11 | self.pe_enable = False # Ture | False 12 | 13 | if cfg['data']['audio_feature'] == 'logmel&intensity': 14 | self.f_bins = cfg['data']['n_mels'] 15 | self.in_channels = 7 16 | 17 | self.downsample_ratio = 2 ** 2 18 | self.sed_conv_block1 = nn.Sequential( 19 | DoubleConv(in_channels=4, out_channels=64), 20 | nn.AvgPool2d(kernel_size=(2, 2)), 21 | ) 22 | self.sed_conv_block2 = nn.Sequential( 23 | DoubleConv(in_channels=64, out_channels=128), 24 | nn.AvgPool2d(kernel_size=(2, 2)), 25 | ) 26 | self.sed_conv_block3 = nn.Sequential( 27 | DoubleConv(in_channels=128, out_channels=256), 28 | nn.AvgPool2d(kernel_size=(1, 2)), 29 | ) 30 | self.sed_conv_block4 = nn.Sequential( 31 | DoubleConv(in_channels=256, out_channels=512), 32 | nn.AvgPool2d(kernel_size=(1, 2)), 33 | ) 34 | 35 | self.doa_conv_block1 = nn.Sequential( 36 | DoubleConv(in_channels=self.in_channels, out_channels=64), 37 | nn.AvgPool2d(kernel_size=(2, 2)), 38 | ) 39 | self.doa_conv_block2 = nn.Sequential( 40 | DoubleConv(in_channels=64, out_channels=128), 41 | nn.AvgPool2d(kernel_size=(2, 2)), 42 | ) 43 | self.doa_conv_block3 = nn.Sequential( 44 | DoubleConv(in_channels=128, out_channels=256), 45 | nn.AvgPool2d(kernel_size=(1, 2)), 46 | ) 47 | self.doa_conv_block4 = nn.Sequential( 48 | DoubleConv(in_channels=256, out_channels=512), 49 | nn.AvgPool2d(kernel_size=(1, 2)), 50 | ) 51 | 52 | self.stitch = nn.ParameterList([ 53 | nn.Parameter(torch.FloatTensor(64, 2, 2).uniform_(0.1, 0.9)), 54 | nn.Parameter(torch.FloatTensor(128, 2, 2).uniform_(0.1, 0.9)), 55 | nn.Parameter(torch.FloatTensor(256, 2, 2).uniform_(0.1, 0.9)), 56 | ]) 57 | 58 | if self.pe_enable: 59 | self.pe = PositionalEncoding(pos_len=100, d_model=512, pe_type='t', dropout=0.0) 60 | self.sed_trans_track1 = nn.TransformerEncoder( 61 | nn.TransformerEncoderLayer(d_model=512, nhead=8, dim_feedforward=1024, dropout=0.2), num_layers=2) 62 | self.sed_trans_track2 = nn.TransformerEncoder( 63 | nn.TransformerEncoderLayer(d_model=512, nhead=8, dim_feedforward=1024, dropout=0.2), num_layers=2) 64 | self.doa_trans_track1 = nn.TransformerEncoder( 65 | nn.TransformerEncoderLayer(d_model=512, nhead=8, dim_feedforward=1024, dropout=0.2), num_layers=2) 66 | self.doa_trans_track2 = nn.TransformerEncoder( 67 | nn.TransformerEncoderLayer(d_model=512, nhead=8, dim_feedforward=1024, dropout=0.2), num_layers=2) 68 | 69 | self.fc_sed_track1 = nn.Linear(512, 14, bias=True) 70 | self.fc_sed_track2 = nn.Linear(512, 14, bias=True) 71 | self.fc_doa_track1 = nn.Linear(512, 3, bias=True) 72 | self.fc_doa_track2 = nn.Linear(512, 3, bias=True) 73 | self.final_act_sed = nn.Sequential() # nn.Sigmoid() 74 | self.final_act_doa = nn.Tanh() 75 | 76 | self.init_weight() 77 | 78 | def init_weight(self): 79 | 80 | init_layer(self.fc_sed_track1) 81 | init_layer(self.fc_sed_track2) 82 | init_layer(self.fc_doa_track1) 83 | init_layer(self.fc_doa_track2) 84 | 85 | def forward(self, x): 86 | """ 87 | x: waveform, (batch_size, num_channels, data_length) 88 | """ 89 | x_sed = x[:, :4] 90 | x_doa = x 91 | 92 | # cnn 93 | x_sed = self.sed_conv_block1(x_sed) 94 | x_doa = self.doa_conv_block1(x_doa) 95 | x_sed = torch.einsum('c, nctf -> nctf', self.stitch[0][:, 0, 0], x_sed) + \ 96 | torch.einsum('c, nctf -> nctf', self.stitch[0][:, 0, 1], x_doa) 97 | x_doa = torch.einsum('c, nctf -> nctf', self.stitch[0][:, 1, 0], x_sed) + \ 98 | torch.einsum('c, nctf -> nctf', self.stitch[0][:, 1, 1], x_doa) 99 | x_sed = self.sed_conv_block2(x_sed) 100 | x_doa = self.doa_conv_block2(x_doa) 101 | x_sed = torch.einsum('c, nctf -> nctf', self.stitch[1][:, 0, 0], x_sed) + \ 102 | torch.einsum('c, nctf -> nctf', self.stitch[1][:, 0, 1], x_doa) 103 | x_doa = torch.einsum('c, nctf -> nctf', self.stitch[1][:, 1, 0], x_sed) + \ 104 | torch.einsum('c, nctf -> nctf', self.stitch[1][:, 1, 1], x_doa) 105 | x_sed = self.sed_conv_block3(x_sed) 106 | x_doa = self.doa_conv_block3(x_doa) 107 | x_sed = torch.einsum('c, nctf -> nctf', self.stitch[2][:, 0, 0], x_sed) + \ 108 | torch.einsum('c, nctf -> nctf', self.stitch[2][:, 0, 1], x_doa) 109 | x_doa = torch.einsum('c, nctf -> nctf', self.stitch[2][:, 1, 0], x_sed) + \ 110 | torch.einsum('c, nctf -> nctf', self.stitch[2][:, 1, 1], x_doa) 111 | x_sed = self.sed_conv_block4(x_sed) 112 | x_doa = self.doa_conv_block4(x_doa) 113 | x_sed = x_sed.mean(dim=3) # (N, C, T) 114 | x_doa = x_doa.mean(dim=3) # (N, C, T) 115 | 116 | # transformer 117 | if self.pe_enable: 118 | x_sed = self.pe(x_sed) 119 | if self.pe_enable: 120 | x_doa = self.pe(x_doa) 121 | x_sed = x_sed.permute(2, 0, 1) # (T, N, C) 122 | x_doa = x_doa.permute(2, 0, 1) # (T, N, C) 123 | 124 | x_sed_1 = self.sed_trans_track1(x_sed).transpose(0, 1) # (N, T, C) 125 | x_sed_2 = self.sed_trans_track2(x_sed).transpose(0, 1) # (N, T, C) 126 | x_doa_1 = self.doa_trans_track1(x_doa).transpose(0, 1) # (N, T, C) 127 | x_doa_2 = self.doa_trans_track2(x_doa).transpose(0, 1) # (N, T, C) 128 | 129 | # fc 130 | x_sed_1 = self.final_act_sed(self.fc_sed_track1(x_sed_1)) 131 | x_sed_2 = self.final_act_sed(self.fc_sed_track2(x_sed_2)) 132 | x_sed = torch.stack((x_sed_1, x_sed_2), 2) 133 | x_doa_1 = self.final_act_doa(self.fc_doa_track1(x_doa_1)) 134 | x_doa_2 = self.final_act_doa(self.fc_doa_track2(x_doa_2)) 135 | x_doa = torch.stack((x_doa_1, x_doa_2), 2) 136 | output = { 137 | 'sed': x_sed, 138 | 'doa': x_doa, 139 | } 140 | 141 | return output 142 | 143 | -------------------------------------------------------------------------------- /seld/methods/ein_seld/training.py: -------------------------------------------------------------------------------- 1 | import random 2 | from itertools import combinations 3 | from pathlib import Path 4 | 5 | import h5py 6 | import numpy as np 7 | import torch 8 | from methods.training import BaseTrainer 9 | from methods.utils.data_utilities import to_metrics2020_format 10 | 11 | 12 | class Trainer(BaseTrainer): 13 | 14 | def __init__(self, args, cfg, dataset, af_extractor, valid_set, model, optimizer, losses, metrics): 15 | 16 | super().__init__() 17 | self.cfg = cfg 18 | self.af_extractor = af_extractor 19 | self.model = model 20 | self.optimizer = optimizer 21 | self.losses = losses 22 | self.metrics = metrics 23 | self.cuda = args.cuda 24 | 25 | self.clip_length = dataset.clip_length 26 | self.label_resolution = dataset.label_resolution 27 | self.label_interp_ratio = int(self.label_resolution * cfg['data']['sample_rate'] / cfg['data']['hop_length']) 28 | 29 | # Load ground truth for dcase metrics 30 | self.num_segments = valid_set.num_segments 31 | self.valid_gt_sed_metrics2019 = valid_set.valid_gt_sed_metrics2019 32 | self.valid_gt_doa_metrics2019 = valid_set.valid_gt_doa_metrics2019 33 | self.gt_metrics2020_dict = valid_set.gt_metrics2020_dict 34 | 35 | # Scalar 36 | scalar_h5_dir = Path(cfg['hdf5_dir']).joinpath(cfg['dataset']).joinpath('scalar') 37 | fn_scalar = '{}_{}_sr{}_nfft{}_hop{}_mel{}.h5'.format(cfg['data']['type'], cfg['data']['audio_feature'], 38 | cfg['data']['sample_rate'], cfg['data']['n_fft'], cfg['data']['hop_length'], cfg['data']['n_mels']) 39 | scalar_path = scalar_h5_dir.joinpath(fn_scalar) 40 | with h5py.File(scalar_path, 'r') as hf: 41 | self.mean = hf['mean'][:] 42 | self.std = hf['std'][:] 43 | if args.cuda: 44 | self.mean = torch.tensor(self.mean, dtype=torch.float32).cuda() 45 | self.std = torch.tensor(self.std, dtype=torch.float32).cuda() 46 | 47 | self.init_train_losses() 48 | 49 | def init_train_losses(self): 50 | """ Initialize train losses 51 | 52 | """ 53 | self.train_losses = { 54 | 'loss_all': 0., 55 | 'loss_sed': 0., 56 | 'loss_doa': 0. 57 | } 58 | 59 | def train_step(self, batch_sample, epoch_it): 60 | """ Perform a train step 61 | 62 | """ 63 | batch_x = batch_sample['waveform'] 64 | batch_target = { 65 | 'ov': batch_sample['ov'], 66 | 'sed': batch_sample['sed_label'], 67 | 'doa': batch_sample['doa_label'] 68 | } 69 | if self.cuda: 70 | batch_x = batch_x.cuda(non_blocking=True) 71 | batch_target['sed'] = batch_target['sed'].cuda(non_blocking=True) 72 | batch_target['doa'] = batch_target['doa'].cuda(non_blocking=True) 73 | 74 | self.optimizer.zero_grad() 75 | self.af_extractor.train() 76 | self.model.train() 77 | batch_x = self.af_extractor(batch_x) 78 | batch_x = (batch_x - self.mean) / self.std 79 | pred = self.model(batch_x) 80 | loss_dict = self.losses.calculate(pred, batch_target) 81 | loss_dict[self.cfg['training']['loss_type']].backward() 82 | self.optimizer.step() 83 | 84 | self.train_losses['loss_all'] += loss_dict['all'] 85 | self.train_losses['loss_sed'] += loss_dict['sed'] 86 | self.train_losses['loss_doa'] += loss_dict['doa'] 87 | 88 | 89 | def validate_step(self, generator=None, max_batch_num=None, valid_type='train', epoch_it=0): 90 | """ Perform the validation on the train, valid set 91 | 92 | Generate a batch of segmentations each time 93 | """ 94 | 95 | if valid_type == 'train': 96 | train_losses = self.train_losses.copy() 97 | self.init_train_losses() 98 | return train_losses 99 | 100 | elif valid_type == 'valid': 101 | pred_sed_list, pred_doa_list = [], [] 102 | gt_sed_list, gt_doa_list = [], [] 103 | loss_all, loss_sed, loss_doa = 0., 0., 0. 104 | 105 | for batch_idx, batch_sample in enumerate(generator): 106 | if batch_idx == max_batch_num: 107 | break 108 | 109 | batch_x = batch_sample['waveform'] 110 | batch_target = { 111 | 'sed': batch_sample['sed_label'], 112 | 'doa': batch_sample['doa_label'] 113 | } 114 | 115 | if self.cuda: 116 | batch_x = batch_x.cuda(non_blocking=True) 117 | batch_target['sed'] = batch_target['sed'].cuda(non_blocking=True) 118 | batch_target['doa'] = batch_target['doa'].cuda(non_blocking=True) 119 | 120 | with torch.no_grad(): 121 | self.af_extractor.eval() 122 | self.model.eval() 123 | batch_x = self.af_extractor(batch_x) 124 | batch_x = (batch_x - self.mean) / self.std 125 | pred = self.model(batch_x) 126 | loss_dict = self.losses.calculate(pred, batch_target, epoch_it) 127 | pred['sed'] = torch.sigmoid(pred['sed']) 128 | loss_all += loss_dict['all'].cpu().detach().numpy() 129 | loss_sed += loss_dict['sed'].cpu().detach().numpy() 130 | loss_doa += loss_dict['doa'].cpu().detach().numpy() 131 | pred_sed_list.append(pred['sed'].cpu().detach().numpy()) 132 | pred_doa_list.append(pred['doa'].cpu().detach().numpy()) 133 | 134 | pred_sed = np.concatenate(pred_sed_list, axis=0) 135 | pred_doa = np.concatenate(pred_doa_list, axis=0) 136 | 137 | origin_num_clips = int(pred_sed.shape[0]/self.num_segments) 138 | origin_T = int(pred_sed.shape[1]*self.num_segments) 139 | pred_sed = pred_sed.reshape((origin_num_clips, origin_T, 2, -1))[:, :int(self.clip_length / self.label_resolution)] 140 | pred_doa = pred_doa.reshape((origin_num_clips, origin_T, 2, -1))[:, :int(self.clip_length / self.label_resolution)] 141 | 142 | pred_sed_max = pred_sed.max(axis=-1) 143 | pred_sed_max_idx = pred_sed.argmax(axis=-1) 144 | pred_sed = np.zeros_like(pred_sed) 145 | for b_idx in range(origin_num_clips): 146 | for t_idx in range(origin_T): 147 | for track_idx in range(2): 148 | pred_sed[b_idx, t_idx, track_idx, pred_sed_max_idx[b_idx, t_idx, track_idx]] = \ 149 | pred_sed_max[b_idx, t_idx, track_idx] 150 | pred_sed = (pred_sed > self.cfg['training']['threshold_sed']).astype(np.float32) 151 | 152 | # convert Catesian to Spherical 153 | azi = np.arctan2(pred_doa[..., 1], pred_doa[..., 0]) 154 | elev = np.arctan2(pred_doa[..., 2], np.sqrt(pred_doa[..., 0]**2 + pred_doa[..., 1]**2)) 155 | pred_doa = np.stack((azi, elev), axis=-1) # (N, T, tracks, (azi, elev)) 156 | 157 | # convert format 158 | pred_sed_metrics2019, pred_doa_metrics2019 = to_metrics2019_format(pred_sed, pred_doa) 159 | gt_sed_metrics2019, gt_doa_metrics2019 = self.valid_gt_sed_metrics2019, self.valid_gt_doa_metrics2019 160 | pred_dcase_format_dict = to_dcase_format(pred_sed, pred_doa) 161 | pred_metrics2020_dict = to_metrics2020_format(pred_dcase_format_dict, 162 | pred_sed.shape[0]*pred_sed.shape[1], label_resolution=self.label_resolution) 163 | gt_metrics2020_dict = self.gt_metrics2020_dict 164 | 165 | out_losses = { 166 | 'loss_all': loss_all / (batch_idx + 1), 167 | 'loss_sed': loss_sed / (batch_idx + 1), 168 | 'loss_doa': loss_doa / (batch_idx + 1), 169 | } 170 | 171 | pred_dict = { 172 | 'dcase2019_sed': pred_sed_metrics2019, 173 | 'dcase2019_doa': pred_doa_metrics2019, 174 | 'dcase2020': pred_metrics2020_dict, 175 | } 176 | 177 | gt_dict = { 178 | 'dcase2019_sed': gt_sed_metrics2019, 179 | 'dcase2019_doa': gt_doa_metrics2019, 180 | 'dcase2020': gt_metrics2020_dict, 181 | } 182 | metrics_scores = self.metrics.calculate(pred_dict, gt_dict) 183 | return out_losses, metrics_scores 184 | 185 | 186 | def to_metrics2019_format(sed_labels, doa_labels): 187 | """Convert sed and doa labels from track-wise output format to DCASE2019 evaluation metrics input format 188 | 189 | Args: 190 | sed_labels: SED labels, (batch_size, time_steps, num_tracks=2, logits_events=14 (number of classes)) 191 | doa_labels: DOA labels, (batch_size, time_steps, num_tracks=2, logits_degrees=2 (azi in radians, ele in radians)) 192 | Output: 193 | out_sed_labels: SED labels, (batch_size * time_steps, logits_events=14 (True or False) 194 | out_doa_labels: DOA labels, (batch_size * time_steps, azi_index=14 + ele_index=14) 195 | """ 196 | batch_size, T, num_tracks, num_classes = sed_labels.shape 197 | sed_labels = sed_labels.reshape(batch_size * T, num_tracks, num_classes) 198 | doa_labels = doa_labels.reshape(batch_size * T, num_tracks, 2) 199 | out_sed_labels = np.logical_or(sed_labels[:, 0], sed_labels[:, 1]).astype(float) 200 | out_doa_labels = np.zeros((batch_size * T, num_classes * 2)) 201 | for n_track in range(num_tracks): 202 | indexes = np.where(sed_labels[:, n_track, :]) 203 | out_doa_labels[:, 0: num_classes][indexes[0], indexes[1]] = \ 204 | doa_labels[indexes[0], n_track, 0] # azimuth 205 | out_doa_labels[:, num_classes: 2*num_classes][indexes[0], indexes[1]] = \ 206 | doa_labels[indexes[0], n_track, 1] # elevation 207 | return out_sed_labels, out_doa_labels 208 | 209 | def to_dcase_format(sed_labels, doa_labels): 210 | """Convert sed and doa labels from track-wise output format to dcase output format 211 | 212 | Args: 213 | sed_labels: SED labels, (batch_size, time_steps, num_tracks=2, logits_events=14 (number of classes)) 214 | doa_labels: DOA labels, (batch_size, time_steps, num_tracks=2, logits_degrees=2 (azi in radiance, ele in radiance)) 215 | Output: 216 | output_dict: return a dict containing dcase output format 217 | output_dict[frame-containing-events] = [[class_index_1, azi_1 in degree, ele_1 in degree], [class_index_2, azi_2 in degree, ele_2 in degree]] 218 | """ 219 | batch_size, T, num_tracks, num_classes= sed_labels.shape 220 | 221 | sed_labels = sed_labels.reshape(batch_size*T, num_tracks, num_classes) 222 | doa_labels = doa_labels.reshape(batch_size*T, num_tracks, 2) 223 | 224 | output_dict = {} 225 | for n_idx in range(batch_size*T): 226 | for n_track in range(num_tracks): 227 | class_index = list(np.where(sed_labels[n_idx, n_track, :])[0]) 228 | assert len(class_index) <= 1, 'class_index should be smaller or equal to 1!!\n' 229 | if class_index: 230 | event_doa = [class_index[0], int(np.around(doa_labels[n_idx, n_track, 0] * 180 / np.pi)), \ 231 | int(np.around(doa_labels[n_idx, n_track, 1] * 180 / np.pi))] # NOTE: this is in degree 232 | if n_idx not in output_dict: 233 | output_dict[n_idx] = [] 234 | output_dict[n_idx].append(event_doa) 235 | return output_dict 236 | 237 | -------------------------------------------------------------------------------- /seld/methods/feature.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | from methods.utils.stft import (STFT, LogmelFilterBank, intensityvector, 5 | spectrogram_STFTInput) 6 | 7 | 8 | class LogmelIntensity_Extractor(nn.Module): 9 | def __init__(self, cfg): 10 | super().__init__() 11 | 12 | data = cfg['data'] 13 | sample_rate, n_fft, hop_length, window, n_mels, fmin, fmax = \ 14 | data['sample_rate'], data['n_fft'], data['hop_length'], data['window'], data['n_mels'], \ 15 | data['fmin'], data['fmax'] 16 | 17 | center = True 18 | pad_mode = 'reflect' 19 | ref = 1.0 20 | amin = 1e-10 21 | top_db = None 22 | 23 | # STFT extractor 24 | self.stft_extractor = STFT(n_fft=n_fft, hop_length=hop_length, win_length=n_fft, 25 | window=window, center=center, pad_mode=pad_mode, 26 | freeze_parameters=data['feature_freeze']) 27 | 28 | # Spectrogram extractor 29 | self.spectrogram_extractor = spectrogram_STFTInput 30 | 31 | # Logmel extractor 32 | self.logmel_extractor = LogmelFilterBank(sr=sample_rate, n_fft=n_fft, 33 | n_mels=n_mels, fmin=fmin, fmax=fmax, ref=ref, amin=amin, top_db=top_db, 34 | freeze_parameters=data['feature_freeze']) 35 | 36 | # Intensity vector extractor 37 | self.intensityVector_extractor = intensityvector 38 | 39 | def forward(self, x): 40 | """ 41 | input: 42 | (batch_size, channels=4, data_length) 43 | output: 44 | (batch_size, channels, time_steps, freq_bins) 45 | """ 46 | if x.ndim != 3: 47 | raise ValueError("x shape must be (batch_size, num_channels, data_length)\n \ 48 | Now it is {}".format(x.shape)) 49 | x = self.stft_extractor(x) 50 | logmel = self.logmel_extractor(self.spectrogram_extractor(x)) 51 | intensity_vector = self.intensityVector_extractor(x, self.logmel_extractor.melW) 52 | out = torch.cat((logmel, intensity_vector), dim=1) 53 | return out 54 | 55 | -------------------------------------------------------------------------------- /seld/methods/inference.py: -------------------------------------------------------------------------------- 1 | class BaseInferer: 2 | """ Base inferer class 3 | 4 | """ 5 | def infer(self, *args, **kwargs): 6 | """ Perform an inference on test data. 7 | 8 | """ 9 | raise NotImplementedError 10 | 11 | def fusion(self, submissions_dir, preds): 12 | """ Ensamble predictions. 13 | 14 | """ 15 | raise NotImplementedError 16 | 17 | @staticmethod 18 | def write_submission(submissions_dir, pred_dict): 19 | """ Write predicted result to submission csv files 20 | Args: 21 | pred_dict: DCASE2020 format dict: 22 | pred_dict[frame-containing-events] = [[class_index_1, azi_1 in degree, ele_1 in degree], [class_index_2, azi_2 in degree, ele_2 in degree]] 23 | """ 24 | for key, values in pred_dict.items(): 25 | for value in values: 26 | with submissions_dir.open('a') as f: 27 | f.write('{},{},{},{}\n'.format(key, value[0], value[1], value[2])) 28 | 29 | 30 | 31 | -------------------------------------------------------------------------------- /seld/methods/training.py: -------------------------------------------------------------------------------- 1 | class BaseTrainer: 2 | """ Base trainer class 3 | 4 | """ 5 | def train_step(self, *args, **kwargs): 6 | """ Perform a training step. 7 | 8 | """ 9 | raise NotImplementedError 10 | 11 | def validate_step(self, *args, **kwargs): 12 | """ Perform a validation step 13 | 14 | """ 15 | raise NotImplementedError 16 | 17 | 18 | 19 | -------------------------------------------------------------------------------- /seld/methods/utils/SELD_evaluation_metrics_2019.py: -------------------------------------------------------------------------------- 1 | # 2 | # Implements the core metrics from sound event detection evaluation module http://tut-arg.github.io/sed_eval/ and 3 | # The DOA metrics are explained in the SELDnet paper 4 | # 5 | # This script has MIT license 6 | # 7 | 8 | import numpy as np 9 | from scipy.optimize import linear_sum_assignment 10 | from IPython import embed 11 | eps = np.finfo(np.float).eps 12 | 13 | 14 | ########################################################################################## 15 | # SELD scoring functions - class implementation 16 | # 17 | # NOTE: Supports only one-hot labels for both SED and DOA. Doesnt work for baseline method 18 | # directly, since it estimated DOA in regression approach. Check below the class for 19 | # one shot (function) implementations of all metrics. The function implementation has 20 | # support for both one-hot labels and regression values of DOA estimation. 21 | ########################################################################################## 22 | 23 | class SELDMetrics(object): 24 | def __init__(self, nb_frames_1s=None, data_gen=None): 25 | # SED params 26 | self._S = 0 27 | self._D = 0 28 | self._I = 0 29 | self._TP = 0 30 | self._Nref = 0 31 | self._Nsys = 0 32 | self._block_size = nb_frames_1s 33 | 34 | # DOA params 35 | self._doa_loss_pred_cnt = 0 36 | self._nb_frames = 0 37 | 38 | self._doa_loss_pred = 0 39 | self._nb_good_pks = 0 40 | 41 | self._data_gen = data_gen 42 | 43 | self._less_est_cnt, self._less_est_frame_cnt = 0, 0 44 | self._more_est_cnt, self._more_est_frame_cnt = 0, 0 45 | 46 | def f1_overall_framewise(self, O, T): 47 | TP = ((2 * T - O) == 1).sum() 48 | Nref, Nsys = T.sum(), O.sum() 49 | self._TP += TP 50 | self._Nref += Nref 51 | self._Nsys += Nsys 52 | 53 | def er_overall_framewise(self, O, T): 54 | FP = np.logical_and(T == 0, O == 1).sum(1) 55 | FN = np.logical_and(T == 1, O == 0).sum(1) 56 | S = np.minimum(FP, FN).sum() 57 | D = np.maximum(0, FN - FP).sum() 58 | I = np.maximum(0, FP - FN).sum() 59 | self._S += S 60 | self._D += D 61 | self._I += I 62 | 63 | def f1_overall_1sec(self, O, T): 64 | new_size = int(np.ceil(float(O.shape[0]) / self._block_size)) 65 | O_block = np.zeros((new_size, O.shape[1])) 66 | T_block = np.zeros((new_size, O.shape[1])) 67 | for i in range(0, new_size): 68 | O_block[i, :] = np.max(O[int(i * self._block_size):int(i * self._block_size + self._block_size - 1), :], axis=0) 69 | T_block[i, :] = np.max(T[int(i * self._block_size):int(i * self._block_size + self._block_size - 1), :], axis=0) 70 | return self.f1_overall_framewise(O_block, T_block) 71 | 72 | def er_overall_1sec(self, O, T): 73 | new_size = int(np.ceil(float(O.shape[0]) / self._block_size)) 74 | O_block = np.zeros((new_size, O.shape[1])) 75 | T_block = np.zeros((new_size, O.shape[1])) 76 | for i in range(0, new_size): 77 | O_block[i, :] = np.max(O[int(i * self._block_size):int(i * self._block_size + self._block_size - 1), :], axis=0) 78 | T_block[i, :] = np.max(T[int(i * self._block_size):int(i * self._block_size + self._block_size - 1), :], axis=0) 79 | return self.er_overall_framewise(O_block, T_block) 80 | 81 | def update_sed_scores(self, pred, gt): 82 | """ 83 | Computes SED metrics for one second segments 84 | 85 | :param pred: predicted matrix of dimension [nb_frames, nb_classes], with 1 when sound event is active else 0 86 | :param gt: reference matrix of dimension [nb_frames, nb_classes], with 1 when sound event is active else 0 87 | :param nb_frames_1s: integer, number of frames in one second 88 | :return: 89 | """ 90 | self.f1_overall_1sec(pred, gt) 91 | self.er_overall_1sec(pred, gt) 92 | 93 | def compute_sed_scores(self): 94 | ER = (self._S + self._D + self._I) / (self._Nref + 0.0) 95 | 96 | prec = float(self._TP) / float(self._Nsys + eps) 97 | recall = float(self._TP) / float(self._Nref + eps) 98 | F = 2 * prec * recall / (prec + recall + eps) 99 | 100 | return ER, F 101 | 102 | def update_doa_scores(self, pred_doa_thresholded, gt_doa): 103 | ''' 104 | Compute DOA metrics when DOA is estimated using classification approach 105 | 106 | :param pred_doa_thresholded: predicted results of dimension [nb_frames, nb_classes, nb_azi*nb_ele], 107 | with value 1 when sound event active, else 0 108 | :param gt_doa: reference results of dimension [nb_frames, nb_classes, nb_azi*nb_ele], 109 | with value 1 when sound event active, else 0 110 | :param data_gen_test: feature or data generator class 111 | 112 | :return: DOA metrics 113 | 114 | ''' 115 | self._doa_loss_pred_cnt += np.sum(pred_doa_thresholded) 116 | self._nb_frames += pred_doa_thresholded.shape[0] 117 | 118 | for frame in range(pred_doa_thresholded.shape[0]): 119 | nb_gt_peaks = int(np.sum(gt_doa[frame, :])) 120 | nb_pred_peaks = int(np.sum(pred_doa_thresholded[frame, :])) 121 | 122 | # good_frame_cnt includes frames where the nb active sources were zero in both groundtruth and prediction 123 | if nb_gt_peaks == nb_pred_peaks: 124 | self._nb_good_pks += 1 125 | elif nb_gt_peaks > nb_pred_peaks: 126 | self._less_est_frame_cnt += 1 127 | self._less_est_cnt += (nb_gt_peaks - nb_pred_peaks) 128 | elif nb_pred_peaks > nb_gt_peaks: 129 | self._more_est_frame_cnt += 1 130 | self._more_est_cnt += (nb_pred_peaks - nb_gt_peaks) 131 | 132 | # when nb_ref_doa > nb_estimated_doa, ignores the extra ref doas and scores only the nearest matching doas 133 | # similarly, when nb_estimated_doa > nb_ref_doa, ignores the extra estimated doa and scores the remaining matching doas 134 | if nb_gt_peaks and nb_pred_peaks: 135 | pred_ind = np.where(pred_doa_thresholded[frame] == 1)[1] 136 | pred_list_rad = np.array(self._data_gen .get_matrix_index(pred_ind)) * np.pi / 180 137 | 138 | gt_ind = np.where(gt_doa[frame] == 1)[1] 139 | gt_list_rad = np.array(self._data_gen .get_matrix_index(gt_ind)) * np.pi / 180 140 | 141 | frame_dist = distance_between_gt_pred(gt_list_rad.T, pred_list_rad.T) 142 | self._doa_loss_pred += frame_dist 143 | 144 | def compute_doa_scores(self): 145 | doa_error = self._doa_loss_pred / self._doa_loss_pred_cnt 146 | frame_recall = self._nb_good_pks / float(self._nb_frames) 147 | return doa_error, frame_recall 148 | 149 | def reset(self): 150 | # SED params 151 | self._S = 0 152 | self._D = 0 153 | self._I = 0 154 | self._TP = 0 155 | self._Nref = 0 156 | self._Nsys = 0 157 | 158 | # DOA params 159 | self._doa_loss_pred_cnt = 0 160 | self._nb_frames = 0 161 | 162 | self._doa_loss_pred = 0 163 | self._nb_good_pks = 0 164 | 165 | self._less_est_cnt, self._less_est_frame_cnt = 0, 0 166 | self._more_est_cnt, self._more_est_frame_cnt = 0, 0 167 | 168 | 169 | ############################################################### 170 | # SED scoring functions 171 | ############################################################### 172 | 173 | 174 | def reshape_3Dto2D(A): 175 | return A.reshape(A.shape[0] * A.shape[1], A.shape[2]) 176 | 177 | 178 | def f1_overall_framewise(O, T): 179 | if len(O.shape) == 3: 180 | O, T = reshape_3Dto2D(O), reshape_3Dto2D(T) 181 | TP = ((2 * T - O) == 1).sum() 182 | Nref, Nsys = T.sum(), O.sum() 183 | 184 | prec = float(TP) / float(Nsys + eps) 185 | recall = float(TP) / float(Nref + eps) 186 | f1_score = 2 * prec * recall / (prec + recall + eps) 187 | return f1_score 188 | 189 | 190 | def er_overall_framewise(O, T): 191 | if len(O.shape) == 3: 192 | O, T = reshape_3Dto2D(O), reshape_3Dto2D(T) 193 | 194 | FP = np.logical_and(T == 0, O == 1).sum(1) 195 | FN = np.logical_and(T == 1, O == 0).sum(1) 196 | 197 | S = np.minimum(FP, FN).sum() 198 | D = np.maximum(0, FN-FP).sum() 199 | I = np.maximum(0, FP-FN).sum() 200 | 201 | Nref = T.sum() 202 | ER = (S+D+I) / (Nref + 0.0) 203 | return ER 204 | 205 | 206 | def f1_overall_1sec(O, T, block_size): 207 | if len(O.shape) == 3: 208 | O, T = reshape_3Dto2D(O), reshape_3Dto2D(T) 209 | new_size = int(np.ceil(float(O.shape[0]) / block_size)) 210 | O_block = np.zeros((new_size, O.shape[1])) 211 | T_block = np.zeros((new_size, O.shape[1])) 212 | for i in range(0, new_size): 213 | O_block[i, :] = np.max(O[int(i * block_size):int(i * block_size + block_size - 1), :], axis=0) 214 | T_block[i, :] = np.max(T[int(i * block_size):int(i * block_size + block_size - 1), :], axis=0) 215 | return f1_overall_framewise(O_block, T_block) 216 | 217 | 218 | def er_overall_1sec(O, T, block_size): 219 | if len(O.shape) == 3: 220 | O, T = reshape_3Dto2D(O), reshape_3Dto2D(T) 221 | new_size = int(np.ceil(float(O.shape[0]) / block_size)) 222 | O_block = np.zeros((new_size, O.shape[1])) 223 | T_block = np.zeros((new_size, O.shape[1])) 224 | for i in range(0, new_size): 225 | O_block[i, :] = np.max(O[int(i * block_size):int(i * block_size + block_size - 1), :], axis=0) 226 | T_block[i, :] = np.max(T[int(i * block_size):int(i * block_size + block_size - 1), :], axis=0) 227 | return er_overall_framewise(O_block, T_block) 228 | 229 | 230 | def compute_sed_scores(pred, gt, nb_frames_1s): 231 | """ 232 | Computes SED metrics for one second segments 233 | 234 | :param pred: predicted matrix of dimension [nb_frames, nb_classes], with 1 when sound event is active else 0 235 | :param gt: reference matrix of dimension [nb_frames, nb_classes], with 1 when sound event is active else 0 236 | :param nb_frames_1s: integer, number of frames in one second 237 | :return: 238 | """ 239 | f1o = f1_overall_1sec(pred, gt, nb_frames_1s) 240 | ero = er_overall_1sec(pred, gt, nb_frames_1s) 241 | scores = [ero, f1o] 242 | return scores 243 | 244 | 245 | ############################################################### 246 | # DOA scoring functions 247 | ############################################################### 248 | 249 | 250 | def compute_doa_scores_regr_xyz(pred_doa, gt_doa, pred_sed, gt_sed): 251 | """ 252 | Compute DOA metrics when DOA is estimated using regression approach 253 | 254 | :param pred_doa: predicted doa_labels is of dimension [nb_frames, 3*nb_classes], 255 | nb_classes each for x, y, and z axes, 256 | if active, the DOA values will be in real numbers [-1 1] range, else, it will contain default doa values of (0, 0, 0) 257 | :param gt_doa: reference doa_labels is of dimension [nb_frames, 3*nb_classes], 258 | :param pred_sed: predicted sed label of dimension [nb_frames, nb_classes] which is 1 for active sound event else zero 259 | :param gt_sed: reference sed label of dimension [nb_frames, nb_classes] which is 1 for active sound event else zero 260 | :return: 261 | """ 262 | 263 | nb_src_gt_list = np.zeros(gt_doa.shape[0]).astype(int) 264 | nb_src_pred_list = np.zeros(gt_doa.shape[0]).astype(int) 265 | good_frame_cnt = 0 266 | doa_loss_pred = 0.0 267 | nb_sed = gt_sed.shape[-1] 268 | 269 | less_est_cnt, less_est_frame_cnt = 0, 0 270 | more_est_cnt, more_est_frame_cnt = 0, 0 271 | 272 | for frame_cnt, sed_frame in enumerate(gt_sed): 273 | nb_src_gt_list[frame_cnt] = int(np.sum(sed_frame)) 274 | nb_src_pred_list[frame_cnt] = int(np.sum(pred_sed[frame_cnt])) 275 | 276 | # good_frame_cnt includes frames where the nb active sources were zero in both groundtruth and prediction 277 | if nb_src_gt_list[frame_cnt] == nb_src_pred_list[frame_cnt]: 278 | good_frame_cnt = good_frame_cnt + 1 279 | elif nb_src_gt_list[frame_cnt] > nb_src_pred_list[frame_cnt]: 280 | less_est_cnt = less_est_cnt + nb_src_gt_list[frame_cnt] - nb_src_pred_list[frame_cnt] 281 | less_est_frame_cnt = less_est_frame_cnt + 1 282 | elif nb_src_gt_list[frame_cnt] < nb_src_pred_list[frame_cnt]: 283 | more_est_cnt = more_est_cnt + nb_src_pred_list[frame_cnt] - nb_src_gt_list[frame_cnt] 284 | more_est_frame_cnt = more_est_frame_cnt + 1 285 | 286 | # when nb_ref_doa > nb_estimated_doa, ignores the extra ref doas and scores only the nearest matching doas 287 | # similarly, when nb_estimated_doa > nb_ref_doa, ignores the extra estimated doa and scores the remaining matching doas 288 | if nb_src_gt_list[frame_cnt] and nb_src_pred_list[frame_cnt]: 289 | # DOA Loss with respect to predicted confidence 290 | sed_frame_gt = gt_sed[frame_cnt] 291 | doa_frame_gt_x = gt_doa[frame_cnt][:nb_sed][sed_frame_gt == 1] 292 | doa_frame_gt_y = gt_doa[frame_cnt][nb_sed:2*nb_sed][sed_frame_gt == 1] 293 | doa_frame_gt_z = gt_doa[frame_cnt][2*nb_sed:][sed_frame_gt == 1] 294 | 295 | sed_frame_pred = pred_sed[frame_cnt] 296 | doa_frame_pred_x = pred_doa[frame_cnt][:nb_sed][sed_frame_pred == 1] 297 | doa_frame_pred_y = pred_doa[frame_cnt][nb_sed:2*nb_sed][sed_frame_pred == 1] 298 | doa_frame_pred_z = pred_doa[frame_cnt][2*nb_sed:][sed_frame_pred == 1] 299 | 300 | doa_loss_pred += distance_between_gt_pred_xyz(np.vstack((doa_frame_gt_x, doa_frame_gt_y, doa_frame_gt_z)).T, 301 | np.vstack((doa_frame_pred_x, doa_frame_pred_y, doa_frame_pred_z)).T) 302 | 303 | doa_loss_pred_cnt = np.sum(nb_src_pred_list) 304 | if doa_loss_pred_cnt: 305 | doa_loss_pred /= doa_loss_pred_cnt 306 | 307 | frame_recall = good_frame_cnt / float(gt_sed.shape[0]) 308 | er_metric = [doa_loss_pred, frame_recall, doa_loss_pred_cnt, good_frame_cnt, more_est_cnt, less_est_cnt] 309 | return er_metric 310 | 311 | 312 | def compute_doa_scores_regr(pred_doa_rad, gt_doa_rad, pred_sed, gt_sed): 313 | """ 314 | Compute DOA metrics when DOA is estimated using regression approach 315 | 316 | :param pred_doa_rad: predicted doa_labels is of dimension [nb_frames, 2*nb_classes], 317 | nb_classes each for azimuth and elevation angles, 318 | if active, the DOA values will be in RADIANS, else, it will contain default doa values 319 | :param gt_doa_rad: reference doa_labels is of dimension [nb_frames, 2*nb_classes], 320 | nb_classes each for azimuth and elevation angles, 321 | if active, the DOA values will be in RADIANS, else, it will contain default doa values 322 | :param pred_sed: predicted sed label of dimension [nb_frames, nb_classes] which is 1 for active sound event else zero 323 | :param gt_sed: reference sed label of dimension [nb_frames, nb_classes] which is 1 for active sound event else zero 324 | :return: 325 | """ 326 | 327 | nb_src_gt_list = np.zeros(gt_doa_rad.shape[0]).astype(int) 328 | nb_src_pred_list = np.zeros(gt_doa_rad.shape[0]).astype(int) 329 | good_frame_cnt = 0 330 | doa_loss_pred = 0.0 331 | nb_sed = gt_sed.shape[-1] 332 | 333 | less_est_cnt, less_est_frame_cnt = 0, 0 334 | more_est_cnt, more_est_frame_cnt = 0, 0 335 | 336 | for frame_cnt, sed_frame in enumerate(gt_sed): 337 | nb_src_gt_list[frame_cnt] = int(np.sum(sed_frame)) 338 | nb_src_pred_list[frame_cnt] = int(np.sum(pred_sed[frame_cnt])) 339 | 340 | # good_frame_cnt includes frames where the nb active sources were zero in both groundtruth and prediction 341 | if nb_src_gt_list[frame_cnt] == nb_src_pred_list[frame_cnt]: 342 | good_frame_cnt = good_frame_cnt + 1 343 | elif nb_src_gt_list[frame_cnt] > nb_src_pred_list[frame_cnt]: 344 | less_est_cnt = less_est_cnt + nb_src_gt_list[frame_cnt] - nb_src_pred_list[frame_cnt] 345 | less_est_frame_cnt = less_est_frame_cnt + 1 346 | elif nb_src_gt_list[frame_cnt] < nb_src_pred_list[frame_cnt]: 347 | more_est_cnt = more_est_cnt + nb_src_pred_list[frame_cnt] - nb_src_gt_list[frame_cnt] 348 | more_est_frame_cnt = more_est_frame_cnt + 1 349 | 350 | # when nb_ref_doa > nb_estimated_doa, ignores the extra ref doas and scores only the nearest matching doas 351 | # similarly, when nb_estimated_doa > nb_ref_doa, ignores the extra estimated doa and scores the remaining matching doas 352 | if nb_src_gt_list[frame_cnt] and nb_src_pred_list[frame_cnt]: 353 | # DOA Loss with respect to predicted confidence 354 | sed_frame_gt = gt_sed[frame_cnt] 355 | doa_frame_gt_azi = gt_doa_rad[frame_cnt][:nb_sed][sed_frame_gt == 1] 356 | doa_frame_gt_ele = gt_doa_rad[frame_cnt][nb_sed:][sed_frame_gt == 1] 357 | 358 | sed_frame_pred = pred_sed[frame_cnt] 359 | doa_frame_pred_azi = pred_doa_rad[frame_cnt][:nb_sed][sed_frame_pred == 1] 360 | doa_frame_pred_ele = pred_doa_rad[frame_cnt][nb_sed:][sed_frame_pred == 1] 361 | 362 | doa_loss_pred += distance_between_gt_pred(np.vstack((doa_frame_gt_azi, doa_frame_gt_ele)).T, 363 | np.vstack((doa_frame_pred_azi, doa_frame_pred_ele)).T) 364 | 365 | doa_loss_pred_cnt = np.sum(nb_src_pred_list) 366 | if doa_loss_pred_cnt: 367 | doa_loss_pred /= doa_loss_pred_cnt 368 | 369 | frame_recall = good_frame_cnt / float(gt_sed.shape[0]) 370 | er_metric = [doa_loss_pred, frame_recall, doa_loss_pred_cnt, good_frame_cnt, more_est_cnt, less_est_cnt] 371 | return er_metric 372 | 373 | 374 | def compute_doa_scores_clas(pred_doa_thresholded, gt_doa, data_gen_test): 375 | ''' 376 | Compute DOA metrics when DOA is estimated using classification approach 377 | 378 | :param pred_doa_thresholded: predicted results of dimension [nb_frames, nb_classes, nb_azi*nb_ele], 379 | with value 1 when sound event active, else 0 380 | :param gt_doa: reference results of dimension [nb_frames, nb_classes, nb_azi*nb_ele], 381 | with value 1 when sound event active, else 0 382 | :param data_gen_test: feature or data generator class 383 | 384 | :return: DOA metrics 385 | 386 | ''' 387 | doa_loss_pred_cnt = np.sum(pred_doa_thresholded) 388 | 389 | doa_loss_pred = 0 390 | nb_good_pks = 0 391 | 392 | less_est_cnt, less_est_frame_cnt = 0, 0 393 | more_est_cnt, more_est_frame_cnt = 0, 0 394 | 395 | for frame in range(pred_doa_thresholded.shape[0]): 396 | nb_gt_peaks = int(np.sum(gt_doa[frame, :])) 397 | nb_pred_peaks = int(np.sum(pred_doa_thresholded[frame, :])) 398 | 399 | # good_frame_cnt includes frames where the nb active sources were zero in both groundtruth and prediction 400 | if nb_gt_peaks == nb_pred_peaks: 401 | nb_good_pks += 1 402 | elif nb_gt_peaks > nb_pred_peaks: 403 | less_est_frame_cnt += 1 404 | less_est_cnt += (nb_gt_peaks - nb_pred_peaks) 405 | elif nb_pred_peaks > nb_gt_peaks: 406 | more_est_frame_cnt += 1 407 | more_est_cnt += (nb_pred_peaks - nb_gt_peaks) 408 | 409 | # when nb_ref_doa > nb_estimated_doa, ignores the extra ref doas and scores only the nearest matching doas 410 | # similarly, when nb_estimated_doa > nb_ref_doa, ignores the extra estimated doa and scores the remaining matching doas 411 | if nb_gt_peaks and nb_pred_peaks: 412 | pred_ind = np.where(pred_doa_thresholded[frame] == 1)[1] 413 | pred_list_rad = np.array(data_gen_test.get_matrix_index(pred_ind)) * np.pi / 180 414 | 415 | gt_ind = np.where(gt_doa[frame] == 1)[1] 416 | gt_list_rad = np.array(data_gen_test.get_matrix_index(gt_ind)) * np.pi / 180 417 | 418 | frame_dist = distance_between_gt_pred(gt_list_rad.T, pred_list_rad.T) 419 | doa_loss_pred += frame_dist 420 | 421 | if doa_loss_pred_cnt: 422 | doa_loss_pred /= doa_loss_pred_cnt 423 | 424 | frame_recall = nb_good_pks / float(pred_doa_thresholded.shape[0]) 425 | er_metric = [doa_loss_pred, frame_recall, doa_loss_pred_cnt, nb_good_pks, more_est_cnt, less_est_cnt] 426 | return er_metric 427 | 428 | 429 | def distance_between_gt_pred(gt_list_rad, pred_list_rad): 430 | """ 431 | Shortest distance between two sets of spherical coordinates. Given a set of groundtruth spherical coordinates, 432 | and its respective predicted coordinates, we calculate the spherical distance between each of the spherical 433 | coordinate pairs resulting in a matrix of distances, where one axis represents the number of groundtruth 434 | coordinates and the other the predicted coordinates. The number of estimated peaks need not be the same as in 435 | groundtruth, thus the distance matrix is not always a square matrix. We use the hungarian algorithm to find the 436 | least cost in this distance matrix. 437 | 438 | :param gt_list_rad: list of ground-truth spherical coordinates 439 | :param pred_list_rad: list of predicted spherical coordinates 440 | :return: cost - distance 441 | :return: less - number of DOA's missed 442 | :return: extra - number of DOA's over-estimated 443 | """ 444 | 445 | gt_len, pred_len = gt_list_rad.shape[0], pred_list_rad.shape[0] 446 | ind_pairs = np.array([[x, y] for y in range(pred_len) for x in range(gt_len)]) 447 | cost_mat = np.zeros((gt_len, pred_len)) 448 | 449 | # Slow implementation 450 | # cost_mat = np.zeros((gt_len, pred_len)) 451 | # for gt_cnt, gt in enumerate(gt_list_rad): 452 | # for pred_cnt, pred in enumerate(pred_list_rad): 453 | # cost_mat[gt_cnt, pred_cnt] = distance_between_spherical_coordinates_rad(gt, pred) 454 | 455 | # Fast implementation 456 | if gt_len and pred_len: 457 | az1, ele1, az2, ele2 = gt_list_rad[ind_pairs[:, 0], 0], gt_list_rad[ind_pairs[:, 0], 1], \ 458 | pred_list_rad[ind_pairs[:, 1], 0], pred_list_rad[ind_pairs[:, 1], 1] 459 | cost_mat[ind_pairs[:, 0], ind_pairs[:, 1]] = distance_between_spherical_coordinates_rad(az1, ele1, az2, ele2) 460 | 461 | row_ind, col_ind = linear_sum_assignment(cost_mat) 462 | cost = cost_mat[row_ind, col_ind].sum() 463 | return cost 464 | 465 | 466 | def distance_between_gt_pred_xyz(gt_list, pred_list): 467 | """ 468 | Shortest distance between two sets of Cartesian coordinates. Given a set of groundtruth coordinates, 469 | and its respective predicted coordinates, we calculate the spherical distance between each of the spherical 470 | coordinate pairs resulting in a matrix of distances, where one axis represents the number of groundtruth 471 | coordinates and the other the predicted coordinates. The number of estimated peaks need not be the same as in 472 | groundtruth, thus the distance matrix is not always a square matrix. We use the hungarian algorithm to find the 473 | least cost in this distance matrix. 474 | 475 | :param gt_list: list of ground-truth Cartesian coordinates 476 | :param pred_list: list of predicted Cartesian coordinates 477 | :return: cost - distance 478 | :return: less - number of DOA's missed 479 | :return: extra - number of DOA's over-estimated 480 | """ 481 | 482 | gt_len, pred_len = gt_list.shape[0], pred_list.shape[0] 483 | ind_pairs = np.array([[x, y] for y in range(pred_len) for x in range(gt_len)]) 484 | cost_mat = np.zeros((gt_len, pred_len)) 485 | 486 | # Slow implementation 487 | # cost_mat = np.zeros((gt_len, pred_len)) 488 | # for gt_cnt, gt in enumerate(gt_list_rad): 489 | # for pred_cnt, pred in enumerate(pred_list_rad): 490 | # cost_mat[gt_cnt, pred_cnt] = distance_between_spherical_coordinates_rad(gt, pred) 491 | 492 | # Fast implementation 493 | if gt_len and pred_len: 494 | x1, y1, z1, x2, y2, z2 = gt_list[ind_pairs[:, 0], 0], gt_list[ind_pairs[:, 0], 1], gt_list[ind_pairs[:, 0], 2], \ 495 | pred_list[ind_pairs[:, 1], 0], pred_list[ind_pairs[:, 1], 1], pred_list[ind_pairs[:, 1], 2] 496 | cost_mat[ind_pairs[:, 0], ind_pairs[:, 1]] = distance_between_cartesian_coordinates(x1, y1, z1, x2, y2, z2) 497 | 498 | row_ind, col_ind = linear_sum_assignment(cost_mat) 499 | cost = cost_mat[row_ind, col_ind].sum() 500 | return cost 501 | 502 | 503 | def distance_between_spherical_coordinates_rad(az1, ele1, az2, ele2): 504 | """ 505 | Angular distance between two spherical coordinates 506 | MORE: https://en.wikipedia.org/wiki/Great-circle_distance 507 | 508 | :return: angular distance in degrees 509 | """ 510 | dist = np.sin(ele1) * np.sin(ele2) + np.cos(ele1) * np.cos(ele2) * np.cos(np.abs(az1 - az2)) 511 | # Making sure the dist values are in -1 to 1 range, else np.arccos kills the job 512 | dist = np.clip(dist, -1, 1) 513 | dist = np.arccos(dist) * 180 / np.pi 514 | return dist 515 | 516 | 517 | def distance_between_cartesian_coordinates(x1, y1, z1, x2, y2, z2): 518 | """ 519 | Angular distance between two cartesian coordinates 520 | MORE: https://en.wikipedia.org/wiki/Great-circle_distance 521 | Check 'From chord length' section 522 | 523 | :return: angular distance in degrees 524 | """ 525 | # Normalize the Cartesian vectors 526 | N1 = np.sqrt(x1**2 + y1**2 + z1**2 + 1e-10) 527 | N2 = np.sqrt(x2**2 + y2**2 + z2**2 + 1e-10) 528 | x1, y1, z1, x2, y2, z2 = x1/N1, y1/N1, z1/N1, x2/N2, y2/N2, z2/N2 529 | 530 | #Compute the distance 531 | dist = x1*x2 + y1*y2 + z1*z2 532 | dist = np.clip(dist, -1, 1) 533 | dist = np.arccos(dist) * 180 / np.pi 534 | return dist 535 | 536 | 537 | def sph2cart(azimuth, elevation, r): 538 | ''' 539 | Convert spherical to cartesian coordinates 540 | 541 | :param azimuth: in radians 542 | :param elevation: in radians 543 | :param r: in meters 544 | :return: cartesian coordinates 545 | ''' 546 | 547 | x = r * np.cos(elevation) * np.cos(azimuth) 548 | y = r * np.cos(elevation) * np.sin(azimuth) 549 | z = r * np.sin(elevation) 550 | return x, y, z 551 | 552 | 553 | def cart2sph(x, y, z): 554 | ''' 555 | Convert cartesian to spherical coordinates 556 | 557 | :param x: 558 | :param y: 559 | :param z: 560 | :return: azi, ele in radians and r in meters 561 | ''' 562 | 563 | azimuth = np.arctan2(y,x) 564 | elevation = np.arctan2(z,np.sqrt(x**2 + y**2)) 565 | r = np.sqrt(x**2 + y**2 + z**2) 566 | return azimuth, elevation, r 567 | 568 | 569 | ############################################################### 570 | # SELD scoring functions 571 | ############################################################### 572 | 573 | 574 | def early_stopping_metric(sed_error, doa_error): 575 | """ 576 | Compute early stopping metric from sed and doa errors. 577 | 578 | :param sed_error: [error rate (0 to 1 range), f score (0 to 1 range)] 579 | :param doa_error: [doa error (in degrees), frame recall (0 to 1 range)] 580 | :return: seld metric result 581 | """ 582 | seld_metric = np.mean([ 583 | sed_error[0], 584 | 1 - sed_error[1], 585 | doa_error[0]/180, 586 | 1 - doa_error[1]] 587 | ) 588 | return seld_metric 589 | 590 | -------------------------------------------------------------------------------- /seld/methods/utils/SELD_evaluation_metrics_2020.py: -------------------------------------------------------------------------------- 1 | # 2 | # Implements the localization and detection metrics proposed in the paper 3 | # 4 | # Joint Measurement of Localization and Detection of Sound Events 5 | # Annamaria Mesaros, Sharath Adavanne, Archontis Politis, Toni Heittola, Tuomas Virtanen 6 | # WASPAA 2019 7 | # 8 | # 9 | # This script has MIT license 10 | # 11 | 12 | import numpy as np 13 | from IPython import embed 14 | eps = np.finfo(np.float).eps 15 | from scipy.optimize import linear_sum_assignment 16 | 17 | 18 | class SELDMetrics(object): 19 | def __init__(self, doa_threshold=20, nb_classes=11): 20 | ''' 21 | This class implements both the class-sensitive localization and location-sensitive detection metrics. 22 | Additionally, based on the user input, the corresponding averaging is performed within the segment. 23 | 24 | :param nb_classes: Number of sound classes. In the paper, nb_classes = 11 25 | :param doa_thresh: DOA threshold for location sensitive detection. 26 | ''' 27 | 28 | self._TP = 0 29 | self._FP = 0 30 | self._TN = 0 31 | self._FN = 0 32 | 33 | self._S = 0 34 | self._D = 0 35 | self._I = 0 36 | 37 | self._Nref = 0 38 | self._Nsys = 0 39 | 40 | self._total_DE = 0 41 | self._DE_TP = 0 42 | 43 | self._spatial_T = doa_threshold 44 | self._nb_classes = nb_classes 45 | 46 | def compute_seld_scores(self): 47 | ''' 48 | Collect the final SELD scores 49 | 50 | :return: returns both location-sensitive detection scores and class-sensitive localization scores 51 | ''' 52 | 53 | # Location-senstive detection performance 54 | ER = (self._S + self._D + self._I) / float(self._Nref + eps) 55 | 56 | prec = float(self._TP) / float(self._Nsys + eps) 57 | recall = float(self._TP) / float(self._Nref + eps) 58 | F = 2 * prec * recall / (prec + recall + eps) 59 | 60 | # Class-sensitive localization performance 61 | if self._DE_TP: 62 | DE = self._total_DE / float(self._DE_TP + eps) 63 | else: 64 | # When the total number of prediction is zero 65 | DE = 180 66 | 67 | DE_prec = float(self._DE_TP) / float(self._Nsys + eps) 68 | DE_recall = float(self._DE_TP) / float(self._Nref + eps) 69 | DE_F = 2 * DE_prec * DE_recall / (DE_prec + DE_recall + eps) 70 | 71 | return ER, F, DE, DE_F 72 | 73 | def update_seld_scores_xyz(self, pred, gt): 74 | ''' 75 | Implements the spatial error averaging according to equation [5] in the paper, using Cartesian distance 76 | 77 | :param pred: dictionary containing class-wise prediction results for each N-seconds segment block 78 | :param gt: dictionary containing class-wise groundtruth for each N-seconds segment block 79 | ''' 80 | for block_cnt in range(len(gt.keys())): 81 | # print('\nblock_cnt', block_cnt, end='') 82 | loc_FN, loc_FP = 0, 0 83 | for class_cnt in range(self._nb_classes): 84 | # print('\tclass:', class_cnt, end='') 85 | # Counting the number of ref and sys outputs should include the number of tracks for each class in the segment 86 | if class_cnt in gt[block_cnt]: 87 | self._Nref += 1 88 | if class_cnt in pred[block_cnt]: 89 | self._Nsys += 1 90 | 91 | if class_cnt in gt[block_cnt] and class_cnt in pred[block_cnt]: 92 | # True positives or False negative case 93 | 94 | # NOTE: For multiple tracks per class, identify multiple tracks using hungarian algorithm and then 95 | # calculate the spatial distance using the following code. In the current code, if there are multiple 96 | # tracks of the same class in a frame we are calculating the least cost between the groundtruth and predicted and using it. 97 | 98 | total_spatial_dist = 0 99 | total_framewise_matching_doa = 0 100 | gt_ind_list = gt[block_cnt][class_cnt][0][0] 101 | pred_ind_list = pred[block_cnt][class_cnt][0][0] 102 | for gt_ind, gt_val in enumerate(gt_ind_list): 103 | if gt_val in pred_ind_list: 104 | total_framewise_matching_doa += 1 105 | pred_ind = pred_ind_list.index(gt_val) 106 | 107 | gt_arr = np.array(gt[block_cnt][class_cnt][0][1][gt_ind]) 108 | pred_arr = np.array(pred[block_cnt][class_cnt][0][1][pred_ind]) 109 | 110 | if gt_arr.shape[0]==1 and pred_arr.shape[0]==1: 111 | total_spatial_dist += distance_between_cartesian_coordinates(gt_arr[0][0], gt_arr[0][1], gt_arr[0][2], pred_arr[0][0], pred_arr[0][1], pred_arr[0][2]) 112 | else: 113 | total_spatial_dist += least_distance_between_gt_pred(gt_arr, pred_arr) 114 | 115 | if total_spatial_dist == 0 and total_framewise_matching_doa == 0: 116 | loc_FN += 1 117 | self._FN += 1 118 | else: 119 | avg_spatial_dist = (total_spatial_dist / total_framewise_matching_doa) 120 | 121 | self._total_DE += avg_spatial_dist 122 | self._DE_TP += 1 123 | 124 | if avg_spatial_dist <= self._spatial_T: 125 | self._TP += 1 126 | else: 127 | loc_FN += 1 128 | self._FN += 1 129 | elif class_cnt in gt[block_cnt] and class_cnt not in pred[block_cnt]: 130 | # False negative 131 | loc_FN += 1 132 | self._FN += 1 133 | elif class_cnt not in gt[block_cnt] and class_cnt in pred[block_cnt]: 134 | # False positive 135 | loc_FP += 1 136 | self._FP += 1 137 | elif class_cnt not in gt[block_cnt] and class_cnt not in pred[block_cnt]: 138 | # True negative 139 | self._TN += 1 140 | 141 | self._S += np.minimum(loc_FP, loc_FN) 142 | self._D += np.maximum(0, loc_FN - loc_FP) 143 | self._I += np.maximum(0, loc_FP - loc_FN) 144 | return 145 | 146 | def update_seld_scores(self, pred_deg, gt_deg): 147 | ''' 148 | Implements the spatial error averaging according to equation [5] in the paper, using Polar distance 149 | Expects the angles in degrees 150 | 151 | :param pred_deg: dictionary containing class-wise prediction results for each N-seconds segment block 152 | :param gt_deg: dictionary containing class-wise groundtruth for each N-seconds segment block 153 | ''' 154 | for block_cnt in range(len(gt_deg.keys())): 155 | # print('\nblock_cnt', block_cnt, end='') 156 | loc_FN, loc_FP = 0, 0 157 | for class_cnt in range(self._nb_classes): 158 | # print('\tclass:', class_cnt, end='') 159 | # Counting the number of ref and sys outputs should include the number of tracks for each class in the segment 160 | if class_cnt in gt_deg[block_cnt]: 161 | self._Nref += 1 162 | if class_cnt in pred_deg[block_cnt]: 163 | self._Nsys += 1 164 | 165 | if class_cnt in gt_deg[block_cnt] and class_cnt in pred_deg[block_cnt]: 166 | # True positives or False negative case 167 | 168 | # NOTE: For multiple tracks per class, identify multiple tracks using hungarian algorithm and then 169 | # calculate the spatial distance using the following code. In the current code, if there are multiple 170 | # tracks of the same class in a frame we are calculating the least cost between the groundtruth and predicted and using it. 171 | total_spatial_dist = 0 172 | total_framewise_matching_doa = 0 173 | gt_ind_list = gt_deg[block_cnt][class_cnt][0][0] 174 | pred_ind_list = pred_deg[block_cnt][class_cnt][0][0] 175 | for gt_ind, gt_val in enumerate(gt_ind_list): 176 | if gt_val in pred_ind_list: 177 | total_framewise_matching_doa += 1 178 | pred_ind = pred_ind_list.index(gt_val) 179 | 180 | gt_arr = np.array(gt_deg[block_cnt][class_cnt][0][1][gt_ind]) * np.pi / 180 181 | pred_arr = np.array(pred_deg[block_cnt][class_cnt][0][1][pred_ind]) * np.pi / 180 182 | if gt_arr.shape[0]==1 and pred_arr.shape[0]==1: 183 | total_spatial_dist += distance_between_spherical_coordinates_rad(gt_arr[0][0], gt_arr[0][1], pred_arr[0][0], pred_arr[0][1]) 184 | else: 185 | total_spatial_dist += least_distance_between_gt_pred(gt_arr, pred_arr) 186 | 187 | if total_spatial_dist == 0 and total_framewise_matching_doa == 0: 188 | loc_FN += 1 189 | self._FN += 1 190 | else: 191 | avg_spatial_dist = (total_spatial_dist / total_framewise_matching_doa) 192 | 193 | self._total_DE += avg_spatial_dist 194 | self._DE_TP += 1 195 | 196 | if avg_spatial_dist <= self._spatial_T: 197 | self._TP += 1 198 | else: 199 | loc_FN += 1 200 | self._FN += 1 201 | elif class_cnt in gt_deg[block_cnt] and class_cnt not in pred_deg[block_cnt]: 202 | # False negative 203 | loc_FN += 1 204 | self._FN += 1 205 | elif class_cnt not in gt_deg[block_cnt] and class_cnt in pred_deg[block_cnt]: 206 | # False positive 207 | loc_FP += 1 208 | self._FP += 1 209 | elif class_cnt not in gt_deg[block_cnt] and class_cnt not in pred_deg[block_cnt]: 210 | # True negative 211 | self._TN += 1 212 | 213 | self._S += np.minimum(loc_FP, loc_FN) 214 | self._D += np.maximum(0, loc_FN - loc_FP) 215 | self._I += np.maximum(0, loc_FP - loc_FN) 216 | return 217 | 218 | 219 | def distance_between_spherical_coordinates_rad(az1, ele1, az2, ele2): 220 | """ 221 | Angular distance between two spherical coordinates 222 | MORE: https://en.wikipedia.org/wiki/Great-circle_distance 223 | 224 | :return: angular distance in degrees 225 | """ 226 | dist = np.sin(ele1) * np.sin(ele2) + np.cos(ele1) * np.cos(ele2) * np.cos(np.abs(az1 - az2)) 227 | # Making sure the dist values are in -1 to 1 range, else np.arccos kills the job 228 | dist = np.clip(dist, -1, 1) 229 | dist = np.arccos(dist) * 180 / np.pi 230 | return dist 231 | 232 | 233 | def distance_between_cartesian_coordinates(x1, y1, z1, x2, y2, z2): 234 | """ 235 | Angular distance between two cartesian coordinates 236 | MORE: https://en.wikipedia.org/wiki/Great-circle_distance 237 | Check 'From chord length' section 238 | 239 | :return: angular distance in degrees 240 | """ 241 | # Normalize the Cartesian vectors 242 | N1 = np.sqrt(x1**2 + y1**2 + z1**2 + 1e-10) 243 | N2 = np.sqrt(x2**2 + y2**2 + z2**2 + 1e-10) 244 | x1, y1, z1, x2, y2, z2 = x1/N1, y1/N1, z1/N1, x2/N2, y2/N2, z2/N2 245 | 246 | #Compute the distance 247 | dist = x1*x2 + y1*y2 + z1*z2 248 | dist = np.clip(dist, -1, 1) 249 | dist = np.arccos(dist) * 180 / np.pi 250 | return dist 251 | 252 | 253 | def least_distance_between_gt_pred(gt_list, pred_list): 254 | """ 255 |         Shortest distance between two sets of DOA coordinates. Given a set of groundtruth coordinates, 256 |         and its respective predicted coordinates, we calculate the distance between each of the 257 |         coordinate pairs resulting in a matrix of distances, where one axis represents the number of groundtruth 258 |         coordinates and the other the predicted coordinates. The number of estimated peaks need not be the same as in 259 |         groundtruth, thus the distance matrix is not always a square matrix. We use the hungarian algorithm to find the 260 |         least cost in this distance matrix. 261 |         :param gt_list_xyz: list of ground-truth Cartesian or Polar coordinates in Radians 262 |         :param pred_list_xyz: list of predicted Carteisan or Polar coordinates in Radians 263 |         :return: cost -  distance 264 |         :return: less - number of DOA's missed 265 |         :return: extra - number of DOA's over-estimated 266 |     """ 267 | gt_len, pred_len = gt_list.shape[0], pred_list.shape[0] 268 | ind_pairs = np.array([[x, y] for y in range(pred_len) for x in range(gt_len)]) 269 | cost_mat = np.zeros((gt_len, pred_len)) 270 | 271 | if gt_len and pred_len: 272 | if len(gt_list[0]) == 3: #Cartesian 273 | x1, y1, z1, x2, y2, z2 = gt_list[ind_pairs[:, 0], 0], gt_list[ind_pairs[:, 0], 1], gt_list[ind_pairs[:, 0], 2], pred_list[ind_pairs[:, 1], 0], pred_list[ind_pairs[:, 1], 1], pred_list[ind_pairs[:, 1], 2] 274 | cost_mat[ind_pairs[:, 0], ind_pairs[:, 1]] = distance_between_cartesian_coordinates(x1, y1, z1, x2, y2, z2) 275 | else: 276 | az1, ele1, az2, ele2 = gt_list[ind_pairs[:, 0], 0], gt_list[ind_pairs[:, 0], 1], pred_list[ind_pairs[:, 1], 0], pred_list[ind_pairs[:, 1], 1] 277 | cost_mat[ind_pairs[:, 0], ind_pairs[:, 1]] = distance_between_spherical_coordinates_rad(az1, ele1, az2, ele2) 278 | 279 | row_ind, col_ind = linear_sum_assignment(cost_mat) 280 | cost = cost_mat[row_ind, col_ind].sum() 281 | return cost 282 | 283 | 284 | def early_stopping_metric(sed_error, doa_error): 285 | """ 286 | Compute early stopping metric from sed and doa errors. 287 | 288 | :param sed_error: [error rate (0 to 1 range), f score (0 to 1 range)] 289 | :param doa_error: [doa error (in degrees), frame recall (0 to 1 range)] 290 | :return: early stopping metric result 291 | """ 292 | seld_metric = np.mean([ 293 | sed_error[0], 294 | 1 - sed_error[1], 295 | doa_error[0]/180, 296 | 1 - doa_error[1]] 297 | ) 298 | return seld_metric 299 | -------------------------------------------------------------------------------- /seld/methods/utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yinkalario/EIN-SELD/9f42da23f4ef4620ca495af37e079dc7f8cde06c/seld/methods/utils/__init__.py -------------------------------------------------------------------------------- /seld/methods/utils/data_utilities.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import torch 4 | 5 | 6 | def _segment_index(x, chunklen, hoplen, last_frame_always_paddding=False): 7 | """Segment input x with chunklen, hoplen parameters. Return 8 | 9 | Args: 10 | x: input, time domain or feature domain (channels, time) 11 | chunklen: 12 | hoplen: 13 | last_frame_always_paddding: to decide if always padding for the last frame 14 | 15 | Return: 16 | segmented_indexes: [(begin_index, end_index), (begin_index, end_index), ...] 17 | segmented_pad_width: [(before, after), (before, after), ...] 18 | """ 19 | x_len = x.shape[1] 20 | 21 | segmented_indexes = [] 22 | segmented_pad_width = [] 23 | if x_len < chunklen: 24 | begin_index = 0 25 | end_index = x_len 26 | pad_width_before = 0 27 | pad_width_after = chunklen - x_len 28 | segmented_indexes.append((begin_index, end_index)) 29 | segmented_pad_width.append((pad_width_before, pad_width_after)) 30 | return segmented_indexes, segmented_pad_width 31 | 32 | n_frames = 1 + (x_len - chunklen) // hoplen 33 | for n in range(n_frames): 34 | begin_index = n * hoplen 35 | end_index = n * hoplen + chunklen 36 | segmented_indexes.append((begin_index, end_index)) 37 | pad_width_before = 0 38 | pad_width_after = 0 39 | segmented_pad_width.append((pad_width_before, pad_width_after)) 40 | 41 | if (n_frames - 1) * hoplen + chunklen == x_len: 42 | return segmented_indexes, segmented_pad_width 43 | 44 | # the last frame 45 | if last_frame_always_paddding: 46 | begin_index = n_frames * hoplen 47 | end_index = x_len 48 | pad_width_before = 0 49 | pad_width_after = chunklen - (x_len - n_frames * hoplen) 50 | else: 51 | if x_len - n_frames * hoplen >= chunklen // 2: 52 | begin_index = n_frames * hoplen 53 | end_index = x_len 54 | pad_width_before = 0 55 | pad_width_after = chunklen - (x_len - n_frames * hoplen) 56 | else: 57 | begin_index = x_len - chunklen 58 | end_index = x_len 59 | pad_width_before = 0 60 | pad_width_after = 0 61 | segmented_indexes.append((begin_index, end_index)) 62 | segmented_pad_width.append((pad_width_before, pad_width_after)) 63 | 64 | return segmented_indexes, segmented_pad_width 65 | 66 | 67 | def load_dcase_format(meta_path, frame_begin_index=0, frame_length=600, num_classes=14, set_type='gt'): 68 | """ Load meta into dcase format 69 | 70 | Args: 71 | meta_path (Path obj): path of meta file 72 | frame_begin_index (int): frame begin index, for concatenating labels 73 | frame_length (int): frame length in a file 74 | num_classes (int): number of classes 75 | Output: 76 | output_dict: return a dict containing dcase output format 77 | output_dict[frame-containing-events] = [[class_index_1, azi_1 in degree, ele_1 in degree], [class_index_2, azi_2 in degree, ele_2 in degree]] 78 | sed_metrics2019: (frame, num_classes) 79 | doa_metrics2019: (frame, 2*num_classes), with (frame, 0:num_classes) represents azimuth, (frame, num_classes:2*num_classes) represents elevation 80 | both are in radiance 81 | """ 82 | df = pd.read_csv(meta_path, header=None) 83 | 84 | output_dict = {} 85 | sed_metrics2019 = np.zeros((frame_length, num_classes)) 86 | doa_metrics2019 = np.zeros((frame_length, 2*num_classes)) 87 | for row in df.iterrows(): 88 | frame_idx = row[1][0] 89 | frame_idx2020 = frame_idx + frame_begin_index 90 | event_idx = row[1][1] 91 | if set_type == 'gt': 92 | azi = row[1][3] 93 | ele = row[1][4] 94 | elif set_type == 'pred': 95 | azi = row[1][2] 96 | ele = row[1][3] 97 | if frame_idx2020 not in output_dict: 98 | output_dict[frame_idx2020] = [] 99 | output_dict[frame_idx2020].append([event_idx, azi, ele]) 100 | sed_metrics2019[frame_idx, event_idx] = 1.0 101 | doa_metrics2019[frame_idx, event_idx], doa_metrics2019[frame_idx, event_idx + num_classes] \ 102 | = azi * np.pi / 180.0, ele * np.pi / 180.0 103 | return output_dict, sed_metrics2019, doa_metrics2019 104 | 105 | 106 | def to_metrics2020_format(label_dict, num_frames, label_resolution): 107 | """Collect class-wise sound event location information in segments of length 1s (according to DCASE2020) from reference dataset 108 | 109 | Reference: 110 | https://github.com/sharathadavanne/seld-dcase2020/blob/74a0e1db61cee32c19ea9dde87ba1a5389eb9a85/cls_feature_class.py#L312 111 | Args: 112 | label_dict: Dictionary containing frame-wise sound event time and location information. Dcase format. 113 | num_frames: Total number of frames in the recording. 114 | label_resolution: Groundtruth label resolution. 115 | Output: 116 | output_dict: Dictionary containing class-wise sound event location information in each segment of audio 117 | dictionary_name[segment-index][class-index] = list(frame-cnt-within-segment, azimuth in degree, elevation in degree) 118 | """ 119 | 120 | num_label_frames_1s = int(1 / label_resolution) 121 | num_blocks = int(np.ceil(num_frames / float(num_label_frames_1s))) 122 | output_dict = {x: {} for x in range(num_blocks)} 123 | for n_frame in range(0, num_frames, num_label_frames_1s): 124 | # Collect class-wise information for each block 125 | # [class][frame] = 126 | # Data structure supports multi-instance occurence of same class 127 | n_block = n_frame // num_label_frames_1s 128 | loc_dict = {} 129 | for audio_frame in range(n_frame, n_frame + num_label_frames_1s): 130 | if audio_frame not in label_dict: 131 | continue 132 | for value in label_dict[audio_frame]: 133 | if value[0] not in loc_dict: 134 | loc_dict[value[0]] = {} 135 | 136 | block_frame = audio_frame - n_frame 137 | if block_frame not in loc_dict[value[0]]: 138 | loc_dict[value[0]][block_frame] = [] 139 | loc_dict[value[0]][block_frame].append(value[1:]) 140 | 141 | # Update the block wise details collected above in a global structure 142 | for n_class in loc_dict: 143 | if n_class not in output_dict[n_block]: 144 | output_dict[n_block][n_class] = [] 145 | 146 | keys = [k for k in loc_dict[n_class]] 147 | values = [loc_dict[n_class][k] for k in loc_dict[n_class]] 148 | 149 | output_dict[n_block][n_class].append([keys, values]) 150 | 151 | return output_dict 152 | 153 | 154 | -------------------------------------------------------------------------------- /seld/methods/utils/loss_utilities.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | eps = torch.finfo(torch.float32).eps 5 | 6 | 7 | class MSELoss: 8 | def __init__(self, reduction='mean'): 9 | self.reduction = reduction 10 | self.name = 'loss_MSE' 11 | if self.reduction != 'PIT': 12 | self.loss = nn.MSELoss(reduction='mean') 13 | else: 14 | self.loss = nn.MSELoss(reduction='none') 15 | 16 | def calculate_loss(self, pred, target): 17 | if self.reduction != 'PIT': 18 | return self.loss(pred, target) 19 | else: 20 | return self.loss(pred, target).mean(dim=tuple(range(2, pred.ndim))) 21 | 22 | 23 | class BCEWithLogitsLoss: 24 | def __init__(self, reduction='mean', pos_weight=None): 25 | self.reduction = reduction 26 | self.name = 'loss_BCEWithLogits' 27 | if self.reduction != 'PIT': 28 | self.loss = nn.BCEWithLogitsLoss(reduction=self.reduction, pos_weight=pos_weight) 29 | else: 30 | self.loss = nn.BCEWithLogitsLoss(reduction='none', pos_weight=pos_weight) 31 | 32 | def calculate_loss(self, pred, target): 33 | if self.reduction != 'PIT': 34 | return self.loss(pred, target) 35 | else: 36 | return self.loss(pred, target).mean(dim=tuple(range(2, pred.ndim))) 37 | 38 | -------------------------------------------------------------------------------- /seld/methods/utils/model_utilities.py: -------------------------------------------------------------------------------- 1 | import math 2 | 3 | import numpy as np 4 | import torch 5 | import torch.nn as nn 6 | import torch.nn.functional as F 7 | 8 | 9 | def init_layer(layer, nonlinearity='leaky_relu'): 10 | ''' 11 | Initialize a layer 12 | ''' 13 | classname = layer.__class__.__name__ 14 | if (classname.find('Conv') != -1) or (classname.find('Linear') != -1): 15 | nn.init.kaiming_uniform_(layer.weight, nonlinearity=nonlinearity) 16 | if hasattr(layer, 'bias'): 17 | if layer.bias is not None: 18 | nn.init.constant_(layer.bias, 0.0) 19 | elif classname.find('BatchNorm') != -1: 20 | nn.init.normal_(layer.weight, 1.0, 0.02) 21 | nn.init.constant_(layer.bias, 0.0) 22 | 23 | 24 | class DoubleConv(nn.Module): 25 | def __init__(self, in_channels, out_channels, 26 | kernel_size=(3,3), stride=(1,1), padding=(1,1), 27 | dilation=1, bias=False): 28 | super().__init__() 29 | 30 | self.double_conv = nn.Sequential( 31 | nn.Conv2d(in_channels=in_channels, 32 | out_channels=out_channels, 33 | kernel_size=kernel_size, stride=stride, 34 | padding=padding, dilation=dilation, bias=bias), 35 | nn.BatchNorm2d(out_channels), 36 | nn.ReLU(inplace=True), 37 | # nn.LeakyReLU(negative_slope=0.1, inplace=True), 38 | nn.Conv2d(in_channels=out_channels, 39 | out_channels=out_channels, 40 | kernel_size=kernel_size, stride=stride, 41 | padding=padding, dilation=dilation, bias=bias), 42 | nn.BatchNorm2d(out_channels), 43 | nn.ReLU(inplace=True), 44 | # nn.LeakyReLU(negative_slope=0.1, inplace=True), 45 | ) 46 | 47 | self.init_weights() 48 | 49 | def init_weights(self): 50 | for layer in self.double_conv: 51 | init_layer(layer) 52 | 53 | def forward(self, x): 54 | x = self.double_conv(x) 55 | 56 | return x 57 | 58 | 59 | class PositionalEncoding(nn.Module): 60 | def __init__(self, pos_len, d_model=512, pe_type='t', dropout=0.0): 61 | """ Positional encoding using sin and cos 62 | 63 | Args: 64 | pos_len: positional length 65 | d_model: number of feature maps 66 | pe_type: 't' | 'f' , time domain, frequency domain 67 | dropout: dropout probability 68 | """ 69 | super().__init__() 70 | 71 | self.pe_type = pe_type 72 | pe = torch.zeros(pos_len, d_model) 73 | pos = torch.arange(0, pos_len).float().unsqueeze(1) 74 | div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model)) 75 | pe[:, 0::2] = 0.1 * torch.sin(pos * div_term) 76 | pe[:, 1::2] = 0.1 * torch.cos(pos * div_term) 77 | pe = pe.unsqueeze(0).transpose(1, 2) # (N, C, T) 78 | self.register_buffer('pe', pe) 79 | self.dropout = nn.Dropout(p=dropout) 80 | 81 | def forward(self, x): 82 | # x is (N, C, T, F) or (N, C, T) or (N, C, F) 83 | if x.ndim == 4: 84 | if self.pe_type == 't': 85 | pe = self.pe.unsqueeze(3) 86 | x += pe[:, :, :x.shape[2]] 87 | elif self.pe_type == 'f': 88 | pe = self.pe.unsqueeze(2) 89 | x += pe[:, :, :, :x.shape[3]] 90 | elif x.ndim == 3: 91 | x += self.pe[:, :, :x.shape[2]] 92 | return self.dropout(x) 93 | 94 | -------------------------------------------------------------------------------- /seld/methods/utils/stft.py: -------------------------------------------------------------------------------- 1 | import math 2 | 3 | import librosa 4 | import numpy as np 5 | import torch 6 | import torch.nn as nn 7 | import torch.nn.functional as F 8 | from librosa import ParameterError 9 | from torch.nn.parameter import Parameter 10 | 11 | eps = torch.finfo(torch.float32).eps 12 | 13 | class DFTBase(nn.Module): 14 | def __init__(self): 15 | """Base class for DFT and IDFT matrix""" 16 | super().__init__() 17 | 18 | def dft_matrix(self, n): 19 | (x, y) = np.meshgrid(np.arange(n), np.arange(n)) 20 | omega = np.exp(-2 * np.pi * 1j / n) 21 | W = np.power(omega, x * y) 22 | return W 23 | 24 | def idft_matrix(self, n): 25 | (x, y) = np.meshgrid(np.arange(n), np.arange(n)) 26 | omega = np.exp(2 * np.pi * 1j / n) 27 | W = np.power(omega, x * y) 28 | return W 29 | 30 | 31 | class DFT(DFTBase): 32 | def __init__(self, n, norm): 33 | """Calculate DFT, IDFT, RDFT, IRDFT. 34 | Args: 35 | n: fft window size 36 | norm: None | 'ortho' 37 | """ 38 | super().__init__() 39 | 40 | self.W = self.dft_matrix(n) 41 | self.inv_W = self.idft_matrix(n) 42 | 43 | self.W_real = torch.Tensor(np.real(self.W)) 44 | self.W_imag = torch.Tensor(np.imag(self.W)) 45 | self.inv_W_real = torch.Tensor(np.real(self.inv_W)) 46 | self.inv_W_imag = torch.Tensor(np.imag(self.inv_W)) 47 | 48 | self.n = n 49 | self.norm = norm 50 | 51 | def dft(self, x_real, x_imag): 52 | """Calculate DFT of signal. 53 | Args: 54 | x_real: (n,), signal real part 55 | x_imag: (n,), signal imag part 56 | Returns: 57 | z_real: (n,), output real part 58 | z_imag: (n,), output imag part 59 | """ 60 | z_real = torch.matmul(x_real, self.W_real) - torch.matmul(x_imag, self.W_imag) 61 | z_imag = torch.matmul(x_imag, self.W_real) + torch.matmul(x_real, self.W_imag) 62 | 63 | if self.norm is None: 64 | pass 65 | elif self.norm == 'ortho': 66 | z_real /= math.sqrt(self.n) 67 | z_imag /= math.sqrt(self.n) 68 | 69 | return z_real, z_imag 70 | 71 | def idft(self, x_real, x_imag): 72 | """Calculate IDFT of signal. 73 | Args: 74 | x_real: (n,), signal real part 75 | x_imag: (n,), signal imag part 76 | Returns: 77 | z_real: (n,), output real part 78 | z_imag: (n,), output imag part 79 | """ 80 | z_real = torch.matmul(x_real, self.inv_W_real) - torch.matmul(x_imag, self.inv_W_imag) 81 | z_imag = torch.matmul(x_imag, self.inv_W_real) + torch.matmul(x_real, self.inv_W_imag) 82 | 83 | if self.norm is None: 84 | z_real /= self.n 85 | elif self.norm == 'ortho': 86 | z_real /= math.sqrt(n) 87 | z_imag /= math.sqrt(n) 88 | 89 | return z_real, z_imag 90 | 91 | def rdft(self, x_real): 92 | """Calculate right DFT of signal. 93 | Args: 94 | x_real: (n,), signal real part 95 | x_imag: (n,), signal imag part 96 | Returns: 97 | z_real: (n // 2 + 1,), output real part 98 | z_imag: (n // 2 + 1,), output imag part 99 | """ 100 | n_rfft = self.n // 2 + 1 101 | z_real = torch.matmul(x_real, self.W_real[..., 0 : n_rfft]) 102 | z_imag = torch.matmul(x_real, self.W_imag[..., 0 : n_rfft]) 103 | 104 | if self.norm is None: 105 | pass 106 | elif self.norm == 'ortho': 107 | z_real /= math.sqrt(self.n) 108 | z_imag /= math.sqrt(self.n) 109 | 110 | return z_real, z_imag 111 | 112 | def irdft(self, x_real, x_imag): 113 | """Calculate inverse right DFT of signal. 114 | Args: 115 | x_real: (n // 2 + 1,), signal real part 116 | x_imag: (n // 2 + 1,), signal imag part 117 | Returns: 118 | z_real: (n,), output real part 119 | z_imag: (n,), output imag part 120 | """ 121 | n_rfft = self.n // 2 + 1 122 | 123 | flip_x_real = torch.flip(x_real, dims=(-1,)) 124 | x_real = torch.cat((x_real, flip_x_real[..., 1 : n_rfft - 1]), dim=-1) 125 | 126 | flip_x_imag = torch.flip(x_imag, dims=(-1,)) 127 | x_imag = torch.cat((x_imag, -1. * flip_x_imag[..., 1 : n_rfft - 1]), dim=-1) 128 | 129 | z_real = torch.matmul(x_real, self.inv_W_real) - torch.matmul(x_imag, self.inv_W_imag) 130 | 131 | if self.norm is None: 132 | z_real /= self.n 133 | elif self.norm == 'ortho': 134 | z_real /= math.sqrt(n) 135 | 136 | return z_real 137 | 138 | 139 | class STFT(DFTBase): 140 | def __init__(self, n_fft=2048, hop_length=None, win_length=None, 141 | window='hann', center=True, pad_mode='reflect', freeze_parameters=True): 142 | """Implementation of STFT with Conv1d. The function has the same output 143 | of librosa.core.stft 144 | """ 145 | super().__init__() 146 | 147 | assert pad_mode in ['constant', 'reflect'] 148 | 149 | self.n_fft = n_fft 150 | self.center = center 151 | self.pad_mode = pad_mode 152 | 153 | # By default, use the entire frame 154 | if win_length is None: 155 | win_length = n_fft 156 | 157 | # Set the default hop, if it's not already specified 158 | if hop_length is None: 159 | hop_length = int(win_length // 4) 160 | 161 | fft_window = librosa.filters.get_window(window, win_length, fftbins=True) 162 | 163 | # Pad the window out to n_fft size 164 | fft_window = librosa.util.pad_center(fft_window, n_fft) 165 | 166 | # DFT & IDFT matrix 167 | self.W = self.dft_matrix(n_fft) 168 | 169 | out_channels = n_fft // 2 + 1 170 | 171 | self.conv_real = nn.Conv1d(in_channels=1, out_channels=out_channels, 172 | kernel_size=n_fft, stride=hop_length, padding=0, dilation=1, 173 | groups=1, bias=False) 174 | 175 | self.conv_imag = nn.Conv1d(in_channels=1, out_channels=out_channels, 176 | kernel_size=n_fft, stride=hop_length, padding=0, dilation=1, 177 | groups=1, bias=False) 178 | 179 | self.conv_real.weight.data = torch.Tensor( 180 | np.real(self.W[:, 0 : out_channels] * fft_window[:, None]).T)[:, None, :] 181 | # (n_fft // 2 + 1, 1, n_fft) 182 | 183 | self.conv_imag.weight.data = torch.Tensor( 184 | np.imag(self.W[:, 0 : out_channels] * fft_window[:, None]).T)[:, None, :] 185 | # (n_fft // 2 + 1, 1, n_fft) 186 | 187 | if freeze_parameters: 188 | for param in self.parameters(): 189 | param.requires_grad = False 190 | 191 | def forward(self, input): 192 | """input: (batch_size, num_channels, data_length) 193 | Returns: 194 | real: (batch_size, num_channels, time_steps, n_fft // 2 + 1) 195 | imag: (batch_size, num_channels, time_steps, n_fft // 2 + 1) 196 | """ 197 | _, num_channels, _ = input.shape 198 | 199 | real_out = [] 200 | imag_out = [] 201 | for n in range(num_channels): 202 | x = input[:, n, :][:, None, :] 203 | # (batch_size, 1, data_length) 204 | 205 | if self.center: 206 | x = F.pad(x, pad=(self.n_fft // 2, self.n_fft // 2), mode=self.pad_mode) 207 | 208 | real = self.conv_real(x) 209 | imag = self.conv_imag(x) 210 | # (batch_size, n_fft // 2 + 1, time_steps) 211 | 212 | real = real[:, None, :, :].transpose(2, 3) 213 | imag = imag[:, None, :, :].transpose(2, 3) 214 | # (batch_size, 1, time_steps, n_fft // 2 + 1) 215 | 216 | real_out.append(real) 217 | imag_out.append(imag) 218 | 219 | real_out = torch.cat(real_out, dim=1) 220 | imag_out = torch.cat(imag_out, dim=1) 221 | 222 | return real_out, imag_out 223 | 224 | 225 | def magphase(real, imag): 226 | mag = (real ** 2 + imag ** 2) ** 0.5 227 | cos = real / torch.clamp(mag, 1e-10, np.inf) 228 | sin = imag / torch.clamp(mag, 1e-10, np.inf) 229 | return mag, cos, sin 230 | 231 | 232 | class ISTFT(DFTBase): 233 | def __init__(self, n_fft=2048, hop_length=None, win_length=None, 234 | window='hann', center=True, pad_mode='reflect', freeze_parameters=True): 235 | """Implementation of ISTFT with Conv1d. The function has the same output 236 | of librosa.core.istft 237 | """ 238 | super().__init__() 239 | 240 | assert pad_mode in ['constant', 'reflect'] 241 | 242 | self.n_fft = n_fft 243 | self.hop_length = hop_length 244 | self.win_length = win_length 245 | self.window = window 246 | self.center = center 247 | self.pad_mode = pad_mode 248 | 249 | # By default, use the entire frame 250 | if win_length is None: 251 | win_length = n_fft 252 | 253 | # Set the default hop, if it's not already specified 254 | if hop_length is None: 255 | hop_length = int(win_length // 4) 256 | 257 | ifft_window = librosa.filters.get_window(window, win_length, fftbins=True) 258 | 259 | # Pad the window out to n_fft size 260 | ifft_window = librosa.util.pad_center(ifft_window, n_fft) 261 | 262 | # DFT & IDFT matrix 263 | self.W = self.idft_matrix(n_fft) / n_fft 264 | 265 | self.conv_real = nn.Conv1d(in_channels=n_fft, out_channels=n_fft, 266 | kernel_size=1, stride=1, padding=0, dilation=1, 267 | groups=1, bias=False) 268 | 269 | self.conv_imag = nn.Conv1d(in_channels=n_fft, out_channels=n_fft, 270 | kernel_size=1, stride=1, padding=0, dilation=1, 271 | groups=1, bias=False) 272 | 273 | 274 | self.conv_real.weight.data = torch.Tensor( 275 | np.real(self.W * ifft_window[None, :]).T)[:, :, None] 276 | # (n_fft // 2 + 1, 1, n_fft) 277 | 278 | self.conv_imag.weight.data = torch.Tensor( 279 | np.imag(self.W * ifft_window[None, :]).T)[:, :, None] 280 | # (n_fft // 2 + 1, 1, n_fft) 281 | 282 | if freeze_parameters: 283 | for param in self.parameters(): 284 | param.requires_grad = False 285 | 286 | def forward(self, real_stft, imag_stft, length): 287 | """input: (batch_size, num_channels, time_steps, n_fft // 2 + 1) 288 | Returns: 289 | real: (batch_size, num_channels, data_length) 290 | """ 291 | assert real_stft.ndimension() == 4 and imag_stft.ndimension() == 4 292 | device = next(self.parameters()).device 293 | batch_size, num_channels, _, _ = real_stft.shape 294 | 295 | wav_out = [] 296 | for n in range(num_channels): 297 | real_stft = real_stft[:, n, :, :].transpose(1, 2) 298 | imag_stft = imag_stft[:, n, :, :].transpose(1, 2) 299 | # (batch_size, n_fft // 2 + 1, time_steps) 300 | 301 | # Full stft 302 | full_real_stft = torch.cat((real_stft, torch.flip(real_stft[:, 1 : -1, :], dims=[1])), dim=1) 303 | full_imag_stft = torch.cat((imag_stft, - torch.flip(imag_stft[:, 1 : -1, :], dims=[1])), dim=1) 304 | 305 | # Reserve space for reconstructed waveform 306 | if length: 307 | if self.center: 308 | padded_length = length + int(self.n_fft) 309 | else: 310 | padded_length = length 311 | n_frames = min( 312 | real_stft.shape[2], int(np.ceil(padded_length / self.hop_length))) 313 | else: 314 | n_frames = real_stft.shape[2] 315 | 316 | expected_signal_len = self.n_fft + self.hop_length * (n_frames - 1) 317 | expected_signal_len = self.n_fft + self.hop_length * (n_frames - 1) 318 | y = torch.zeros(batch_size, expected_signal_len).to(device) 319 | 320 | # IDFT 321 | s_real = self.conv_real(full_real_stft) - self.conv_imag(full_imag_stft) 322 | 323 | # Overlap add 324 | for i in range(n_frames): 325 | y[:, i * self.hop_length : i * self.hop_length + self.n_fft] += s_real[:, :, i] 326 | 327 | ifft_window_sum = librosa.filters.window_sumsquare(self.window, n_frames, 328 | win_length=self.win_length, n_fft=self.n_fft, hop_length=self.hop_length) 329 | 330 | approx_nonzero_indices = np.where(ifft_window_sum > librosa.util.tiny(ifft_window_sum))[0] 331 | approx_nonzero_indices = torch.LongTensor(approx_nonzero_indices).to(device) 332 | ifft_window_sum = torch.Tensor(ifft_window_sum).to(device) 333 | 334 | y[:, approx_nonzero_indices] /= ifft_window_sum[approx_nonzero_indices][None, :] 335 | 336 | # Trim or pad to length 337 | if length is None: 338 | if self.center: 339 | y = y[:, self.n_fft // 2 : -self.n_fft // 2] 340 | else: 341 | if self.center: 342 | start = self.n_fft // 2 343 | else: 344 | start = 0 345 | 346 | y = y[:, start : start + length] 347 | (batch_size, len_y) = y.shape 348 | if y.shape[-1] < length: 349 | y = torch.cat((y, torch.zeros(batch_size, length - len_y).to(device)), dim=-1) 350 | 351 | wav_out.append(y) 352 | 353 | wav_out = torch.cat(wav_out, dim=1) 354 | 355 | return y 356 | 357 | 358 | def spectrogram_STFTInput(input, power=2.0): 359 | """ 360 | Input: 361 | real: (batch_size, num_channels, time_steps, n_fft // 2 + 1) 362 | imag: (batch_size, num_channels, time_steps, n_fft // 2 + 1) 363 | Returns: 364 | spectrogram: (batch_size, num_channels, time_steps, n_fft // 2 + 1) 365 | """ 366 | 367 | (real, imag) = input 368 | # (batch_size, num_channels, n_fft // 2 + 1, time_steps) 369 | 370 | spectrogram = real ** 2 + imag ** 2 371 | 372 | if power == 2.0: 373 | pass 374 | else: 375 | spectrogram = spectrogram ** (power / 2.0) 376 | 377 | return spectrogram 378 | 379 | 380 | class Spectrogram(nn.Module): 381 | def __init__(self, n_fft=2048, hop_length=None, win_length=None, 382 | window='hann', center=True, pad_mode='reflect', power=2.0, 383 | freeze_parameters=True): 384 | """Calculate spectrogram using pytorch. The STFT is implemented with 385 | Conv1d. The function has the same output of librosa.core.stft 386 | """ 387 | super().__init__() 388 | 389 | self.power = power 390 | 391 | self.stft = STFT(n_fft=n_fft, hop_length=hop_length, 392 | win_length=win_length, window=window, center=center, 393 | pad_mode=pad_mode, freeze_parameters=True) 394 | 395 | def forward(self, input): 396 | """input: (batch_size, num_channels, data_length) 397 | Returns: 398 | spectrogram: (batch_size, num_channels, time_steps, n_fft // 2 + 1) 399 | """ 400 | 401 | (real, imag) = self.stft.forward(input) 402 | # (batch_size, num_channels, n_fft // 2 + 1, time_steps) 403 | 404 | spectrogram = real ** 2 + imag ** 2 405 | 406 | if self.power == 2.0: 407 | pass 408 | else: 409 | spectrogram = spectrogram ** (self.power / 2.0) 410 | 411 | return spectrogram 412 | 413 | 414 | class LogmelFilterBank(nn.Module): 415 | def __init__(self, sr=32000, n_fft=2048, n_mels=64, fmin=50, fmax=14000, is_log=True, 416 | ref=1.0, amin=1e-10, top_db=80.0, freeze_parameters=True): 417 | """Calculate logmel spectrogram using pytorch. The mel filter bank is 418 | the pytorch implementation of as librosa.filters.mel 419 | """ 420 | super().__init__() 421 | 422 | self.is_log = is_log 423 | self.ref = ref 424 | self.amin = amin 425 | self.top_db = top_db 426 | 427 | self.melW = librosa.filters.mel(sr=sr, n_fft=n_fft, n_mels=n_mels, 428 | fmin=fmin, fmax=fmax).T 429 | # (n_fft // 2 + 1, mel_bins) 430 | 431 | self.melW = nn.Parameter(torch.Tensor(self.melW)) 432 | 433 | if freeze_parameters: 434 | for param in self.parameters(): 435 | param.requires_grad = False 436 | 437 | def forward(self, input): 438 | """input: (batch_size, num_channels, time_steps, freq_bins) 439 | 440 | Output: (batch_size, num_channels, time_steps, mel_bins) 441 | """ 442 | # Mel spectrogram 443 | mel_spectrogram = torch.matmul(input, self.melW) 444 | 445 | # Logmel spectrogram 446 | if self.is_log: 447 | output = self.power_to_db(mel_spectrogram) 448 | else: 449 | output = mel_spectrogram 450 | 451 | return output 452 | 453 | 454 | def power_to_db(self, input): 455 | """Power to db, this function is the pytorch implementation of 456 | librosa.core.power_to_lb 457 | """ 458 | ref_value = self.ref 459 | log_spec = 10.0 * torch.log10(torch.clamp(input, min=self.amin, max=np.inf)) 460 | log_spec -= 10.0 * np.log10(np.maximum(self.amin, ref_value)) 461 | 462 | if self.top_db is not None: 463 | if self.top_db < 0: 464 | raise ParameterError('top_db must be non-negative') 465 | log_spec = torch.clamp(log_spec, min=log_spec.max().item() - self.top_db, max=np.inf) 466 | 467 | return log_spec 468 | 469 | 470 | def intensityvector(input, melW): 471 | """Calculate intensity vector. Input is four channel stft of the signals. 472 | input: (stft_real, stft_imag) 473 | stft_real: (batch_size, 4, time_steps, freq_bins) 474 | stft_imag: (batch_size, 4, time_steps, freq_bins) 475 | out: 476 | intenVec: (batch_size, 3, time_steps, freq_bins) 477 | """ 478 | sig_real, sig_imag = input[0], input[1] 479 | Pref_real, Pref_imag = sig_real[:,0,...], sig_imag[:,0,...] 480 | Px_real, Px_imag = sig_real[:,1,...], sig_imag[:,1,...] 481 | Py_real, Py_imag = sig_real[:,2,...], sig_imag[:,2,...] 482 | Pz_real, Pz_imag = sig_real[:,3,...], sig_imag[:,3,...] 483 | 484 | IVx = Pref_real * Px_real + Pref_imag * Px_imag 485 | IVy = Pref_real * Py_real + Pref_imag * Py_imag 486 | IVz = Pref_real * Pz_real + Pref_imag * Pz_imag 487 | normal = torch.sqrt(IVx**2 + IVy**2 + IVz**2) + eps 488 | 489 | IVx_mel = torch.matmul(IVx / normal, melW) 490 | IVy_mel = torch.matmul(IVy / normal, melW) 491 | IVz_mel = torch.matmul(IVz / normal, melW) 492 | intenVec = torch.stack([IVx_mel, IVy_mel, IVz_mel], dim=1) 493 | 494 | return intenVec 495 | 496 | 497 | class Enframe(nn.Module): 498 | def __init__(self, frame_length=2048, hop_length=512): 499 | """Enframe a time sequence. This function is the pytorch implementation 500 | of librosa.util.frame 501 | """ 502 | super().__init__() 503 | 504 | ''' 505 | self.enframe_conv = nn.Conv1d(in_channels=1, out_channels=frame_length, 506 | kernel_size=frame_length, stride=hop_length, 507 | padding=frame_length // 2, bias=False) 508 | ''' 509 | self.enframe_conv = nn.Conv1d(in_channels=1, out_channels=frame_length, 510 | kernel_size=frame_length, stride=hop_length, 511 | padding=0, bias=False) 512 | 513 | self.enframe_conv.weight.data = torch.Tensor(torch.eye(frame_length)[:, None, :]) 514 | self.enframe_conv.weight.requires_grad = False 515 | 516 | def forward(self, input): 517 | """input: (batch_size, num_channels, samples) 518 | 519 | Output: (batch_size, num_channels, window_length, frames_num) 520 | """ 521 | _, num_channels, _ = input.shape 522 | 523 | output = [] 524 | for n in range(num_channels): 525 | output.append(self.enframe_conv(input[:, n, :][:, None, :])) 526 | 527 | output = torch.cat(output, dim=1) 528 | return output 529 | 530 | 531 | class Scalar(nn.Module): 532 | def __init__(self, scalar, freeze_parameters): 533 | super().__init__() 534 | 535 | self.scalar_mean = Parameter(torch.Tensor(scalar['mean'])) 536 | self.scalar_std = Parameter(torch.Tensor(scalar['std'])) 537 | 538 | if freeze_parameters: 539 | for param in self.parameters(): 540 | param.requires_grad = False 541 | 542 | def forward(self, input): 543 | return (input - self.scalar_mean) / self.scalar_std 544 | 545 | 546 | def debug(select, device): 547 | """Compare numpy + librosa and pytorch implementation result. For debug. 548 | Args: 549 | select: 'dft' | 'logmel' | 'logmel&iv' | 'logmel&gcc' 550 | device: 'cpu' | 'cuda' 551 | """ 552 | 553 | if select == 'dft': 554 | n = 10 555 | norm = None # None | 'ortho' 556 | np.random.seed(0) 557 | 558 | # Data 559 | np_data = np.random.uniform(-1, 1, n) 560 | pt_data = torch.Tensor(np_data) 561 | 562 | # Numpy FFT 563 | np_fft = np.fft.fft(np_data, norm=norm) 564 | np_ifft = np.fft.ifft(np_fft, norm=norm) 565 | np_rfft = np.fft.rfft(np_data, norm=norm) 566 | np_irfft = np.fft.ifft(np_rfft, norm=norm) 567 | 568 | # Pytorch FFT 569 | obj = DFT(n, norm) 570 | pt_dft = obj.dft(pt_data, torch.zeros_like(pt_data)) 571 | pt_idft = obj.idft(pt_dft[0], pt_dft[1]) 572 | pt_rdft = obj.rdft(pt_data) 573 | pt_irdft = obj.irdft(pt_rdft[0], pt_rdft[1]) 574 | 575 | print('Comparing librosa and pytorch implementation of DFT. All numbers ' 576 | 'below should be close to 0.') 577 | print(np.mean((np.abs(np.real(np_fft) - pt_dft[0].cpu().numpy())))) 578 | print(np.mean((np.abs(np.imag(np_fft) - pt_dft[1].cpu().numpy())))) 579 | 580 | print(np.mean((np.abs(np.real(np_ifft) - pt_idft[0].cpu().numpy())))) 581 | print(np.mean((np.abs(np.imag(np_ifft) - pt_idft[1].cpu().numpy())))) 582 | 583 | print(np.mean((np.abs(np.real(np_rfft) - pt_rdft[0].cpu().numpy())))) 584 | print(np.mean((np.abs(np.imag(np_rfft) - pt_rdft[1].cpu().numpy())))) 585 | 586 | print(np.mean(np.abs(np_data - pt_irdft.cpu().numpy()))) 587 | 588 | elif select == 'stft': 589 | data_length = 32000 590 | device = torch.device(device) 591 | np.random.seed(0) 592 | 593 | sample_rate = 16000 594 | n_fft = 1024 595 | hop_length = 250 596 | win_length = 1024 597 | window = 'hann' 598 | center = True 599 | dtype = np.complex64 600 | pad_mode = 'reflect' 601 | 602 | # Data 603 | np_data = np.random.uniform(-1, 1, data_length) 604 | pt_data = torch.Tensor(np_data).to(device) 605 | 606 | # Numpy stft matrix 607 | np_stft_matrix = librosa.core.stft(y=np_data, n_fft=n_fft, 608 | hop_length=hop_length, window=window, center=center).T 609 | 610 | # Pytorch stft matrix 611 | pt_stft_extractor = STFT(n_fft=n_fft, hop_length=hop_length, 612 | win_length=win_length, window=window, center=center, pad_mode=pad_mode, 613 | freeze_parameters=True) 614 | 615 | pt_stft_extractor.to(device) 616 | 617 | (pt_stft_real, pt_stft_imag) = pt_stft_extractor.forward(pt_data[None, None, :]) 618 | 619 | print('Comparing librosa and pytorch implementation of stft. All numbers ' 620 | 'below should be close to 0.') 621 | 622 | print(np.mean(np.abs(np.real(np_stft_matrix) - pt_stft_real.data.cpu().numpy()[0, 0]))) 623 | print(np.mean(np.abs(np.imag(np_stft_matrix) - pt_stft_imag.data.cpu().numpy()[0, 0]))) 624 | 625 | # Numpy istft 626 | np_istft_s = librosa.core.istft(stft_matrix=np_stft_matrix.T, 627 | hop_length=hop_length, window=window, center=center, length=data_length) 628 | 629 | # Pytorch istft 630 | pt_istft_extractor = ISTFT(n_fft=n_fft, hop_length=hop_length, 631 | win_length=win_length, window=window, center=center, pad_mode=pad_mode, 632 | freeze_parameters=True) 633 | pt_istft_extractor.to(device) 634 | 635 | # Recover from real and imag part 636 | pt_istft_s = pt_istft_extractor.forward(pt_stft_real, pt_stft_imag, data_length)[0, :] 637 | 638 | # Recover from magnitude and phase 639 | (pt_stft_mag, cos, sin) = magphase(pt_stft_real, pt_stft_imag) 640 | pt_istft_s2 = pt_istft_extractor.forward(pt_stft_mag * cos, pt_stft_mag * sin, data_length)[0, :] 641 | 642 | print(np.mean(np.abs(np_istft_s - pt_istft_s.data.cpu().numpy()))) 643 | print(np.mean(np.abs(np_data - pt_istft_s.data.cpu().numpy()))) 644 | print(np.mean(np.abs(np_data - pt_istft_s2.data.cpu().numpy()))) 645 | 646 | elif select == 'logmel': 647 | 648 | data_length = 4*32000 649 | norm = None # None | 'ortho' 650 | device = torch.device(device) 651 | np.random.seed(0) 652 | 653 | # Spectrogram parameters 654 | sample_rate = 32000 655 | n_fft = 1024 656 | hop_length = 320 657 | win_length = 1024 658 | window = 'hann' 659 | center = True 660 | dtype = np.complex64 661 | pad_mode = 'reflect' 662 | 663 | # Mel parameters 664 | n_mels = 128 665 | fmin = 50 666 | fmax = 14000 667 | ref = 1.0 668 | amin = 1e-10 669 | top_db = None 670 | 671 | # Data 672 | np_data = np.random.uniform(-1, 1, data_length) 673 | pt_data = torch.Tensor(np_data).to(device) 674 | 675 | print('Comparing librosa and pytorch implementation of logmel ' 676 | 'spectrogram. All numbers below should be close to 0.') 677 | 678 | # Numpy librosa 679 | np_stft_matrix = librosa.core.stft(y=np_data, n_fft=n_fft, hop_length=hop_length, 680 | win_length=win_length, window=window, center=center, dtype=dtype, 681 | pad_mode=pad_mode) 682 | 683 | np_pad = np.pad(np_data, int(n_fft // 2), mode=pad_mode) 684 | 685 | np_melW = librosa.filters.mel(sr=sample_rate, n_fft=n_fft, n_mels=n_mels, 686 | fmin=fmin, fmax=fmax).T 687 | 688 | np_mel_spectrogram = np.dot(np.abs(np_stft_matrix.T) ** 2, np_melW) 689 | 690 | np_logmel_spectrogram = librosa.core.power_to_db( 691 | np_mel_spectrogram, ref=ref, amin=amin, top_db=top_db) 692 | 693 | # Pytorch 694 | stft_extractor = STFT(n_fft=n_fft, hop_length=hop_length, 695 | win_length=win_length, window=window, center=center, pad_mode=pad_mode, 696 | freeze_parameters=True) 697 | 698 | logmel_extractor = LogmelFilterBank(sr=sample_rate, n_fft=n_fft, 699 | n_mels=n_mels, fmin=fmin, fmax=fmax, ref=ref, amin=amin, 700 | top_db=top_db, freeze_parameters=True) 701 | 702 | stft_extractor.to(device) 703 | logmel_extractor.to(device) 704 | 705 | pt_pad = F.pad(pt_data[None, None, :], pad=(n_fft // 2, n_fft // 2), mode=pad_mode)[0, 0] 706 | print(np.mean(np.abs(np_pad - pt_pad.cpu().numpy()))) 707 | 708 | pt_stft_matrix_real = stft_extractor.conv_real(pt_pad[None, None, :])[0] 709 | pt_stft_matrix_imag = stft_extractor.conv_imag(pt_pad[None, None, :])[0] 710 | print(np.mean(np.abs(np.real(np_stft_matrix) - pt_stft_matrix_real.data.cpu().numpy()))) 711 | print(np.mean(np.abs(np.imag(np_stft_matrix) - pt_stft_matrix_imag.data.cpu().numpy()))) 712 | 713 | # Spectrogram 714 | spectrogram_extractor = Spectrogram(n_fft=n_fft, hop_length=hop_length, 715 | win_length=win_length, window=window, center=center, pad_mode=pad_mode, 716 | freeze_parameters=True) 717 | 718 | spectrogram_extractor.to(device) 719 | 720 | pt_spectrogram = spectrogram_extractor.forward(pt_data[None, None, :]) 721 | pt_mel_spectrogram = torch.matmul(pt_spectrogram, logmel_extractor.melW) 722 | print(np.mean(np.abs(np_mel_spectrogram - pt_mel_spectrogram.data.cpu().numpy()[0, 0]))) 723 | 724 | # Log mel spectrogram 725 | pt_logmel_spectrogram = logmel_extractor.forward(pt_spectrogram) 726 | print(np.mean(np.abs(np_logmel_spectrogram - pt_logmel_spectrogram[0, 0].data.cpu().numpy()))) 727 | 728 | elif select == 'enframe': 729 | data_length = 32000 730 | device = torch.device(device) 731 | np.random.seed(0) 732 | 733 | # Spectrogram parameters 734 | hop_length = 250 735 | win_length = 1024 736 | 737 | # Data 738 | np_data = np.random.uniform(-1, 1, data_length) 739 | pt_data = torch.Tensor(np_data).to(device) 740 | 741 | print('Comparing librosa and pytorch implementation of ' 742 | 'librosa.util.frame. All numbers below should be close to 0.') 743 | 744 | # Numpy librosa 745 | np_frames = librosa.util.frame(np_data, frame_length=win_length, 746 | hop_length=hop_length) 747 | 748 | # Pytorch 749 | pt_frame_extractor = Enframe(frame_length=win_length, hop_length=hop_length) 750 | pt_frame_extractor.to(device) 751 | 752 | pt_frames = pt_frame_extractor(pt_data[None, None, :]) 753 | print(np.mean(np.abs(np_frames - pt_frames.data.cpu().numpy()))) 754 | 755 | elif select == 'logmel&iv': 756 | data_size = (1, 4, 24000*3) 757 | device = torch.device(device) 758 | np.random.seed(0) 759 | 760 | # Stft parameters 761 | sample_rate = 24000 762 | n_fft = 1024 763 | hop_length = 240 764 | win_length = 1024 765 | window = 'hann' 766 | center = True 767 | dtype = np.complex64 768 | pad_mode = 'reflect' 769 | 770 | # Mel parameters 771 | n_mels = 128 772 | fmin = 50 773 | fmax = 10000 774 | ref = 1.0 775 | amin = 1e-10 776 | top_db = None 777 | 778 | # Data 779 | np_data = np.random.uniform(-1, 1, data_size) 780 | pt_data = torch.Tensor(np_data).to(device) 781 | 782 | # Numpy stft matrix 783 | np_stft_matrix = [] 784 | for chn in range(np_data.shape[1]): 785 | np_stft_matrix.append(librosa.core.stft(y=np_data[0,chn,:], n_fft=n_fft, 786 | hop_length=hop_length, window=window, center=center).T) 787 | np_stft_matrix = np.array(np_stft_matrix)[None,...] 788 | 789 | # Pytorch stft matrix 790 | pt_stft_extractor = STFT(n_fft=n_fft, hop_length=hop_length, 791 | win_length=win_length, window=window, center=center, pad_mode=pad_mode, 792 | freeze_parameters=True) 793 | pt_stft_extractor.to(device) 794 | (pt_stft_real, pt_stft_imag) = pt_stft_extractor(pt_data) 795 | print('Comparing librosa and pytorch implementation of intensity vector. All numbers ' 796 | 'below should be close to 0.') 797 | 798 | print(np.mean(np.abs(np.real(np_stft_matrix) - pt_stft_real.cpu().detach().numpy()))) 799 | print(np.mean(np.abs(np.imag(np_stft_matrix) - pt_stft_imag.cpu().detach().numpy()))) 800 | 801 | # Numpy logmel 802 | np_pad = np.pad(np_data, ((0,0), (0,0), (int(n_fft // 2),int(n_fft // 2))), mode=pad_mode) 803 | np_melW = librosa.filters.mel(sr=sample_rate, n_fft=n_fft, n_mels=n_mels, 804 | fmin=fmin, fmax=fmax).T 805 | np_mel_spectrogram = np.dot(np.abs(np_stft_matrix) ** 2, np_melW) 806 | np_logmel_spectrogram = librosa.core.power_to_db( 807 | np_mel_spectrogram, ref=ref, amin=amin, top_db=top_db) 808 | 809 | # Pytorch logmel 810 | pt_logmel_extractor = LogmelFilterBank(sr=sample_rate, n_fft=n_fft, 811 | n_mels=n_mels, fmin=fmin, fmax=fmax, ref=ref, amin=amin, 812 | top_db=top_db, freeze_parameters=True) 813 | pt_logmel_extractor.to(device) 814 | pt_pad = F.pad(pt_data, pad=(n_fft // 2, n_fft // 2), mode=pad_mode) 815 | print(np.mean(np.abs(np_pad - pt_pad.cpu().numpy()))) 816 | pt_spectrogram = spectrogram_STFTInput((pt_stft_real, pt_stft_imag)) 817 | pt_mel_spectrogram = torch.matmul(pt_spectrogram, pt_logmel_extractor.melW) 818 | print(np.mean(np.abs(np_mel_spectrogram - pt_mel_spectrogram.cpu().detach().numpy()))) 819 | pt_logmel_spectrogram = pt_logmel_extractor(pt_spectrogram) 820 | print(np.mean(np.abs(np_logmel_spectrogram - pt_logmel_spectrogram.cpu().detach().numpy()))) 821 | 822 | # Numpy intensity 823 | Pref = np_stft_matrix[:,0,...] 824 | Px = np_stft_matrix[:,1,...] 825 | Py = np_stft_matrix[:,2,...] 826 | Pz = np_stft_matrix[:,3,...] 827 | IVx = np.real(np.conj(Pref) * Px) 828 | IVy = np.real(np.conj(Pref) * Py) 829 | IVz = np.real(np.conj(Pref) * Pz) 830 | normal = np.sqrt(IVx**2 + IVy**2 + IVz**2) + np.finfo(np.float32).eps 831 | IVx_mel = np.dot(IVx / normal, np_melW) 832 | IVy_mel = np.dot(IVy / normal, np_melW) 833 | IVz_mel = np.dot(IVz / normal, np_melW) 834 | np_IV = np.stack([IVx_mel, IVy_mel, IVz_mel], axis=1) 835 | 836 | # Pytorch intensity 837 | pt_IV = intensityvector((pt_stft_real, pt_stft_imag), pt_logmel_extractor.melW) 838 | print(np.mean(np.abs(np_IV - pt_IV.cpu().detach().numpy()))) 839 | 840 | 841 | if __name__ == '__main__': 842 | 843 | data_length = 12800 844 | norm = None # None | 'ortho' 845 | device = 'cuda' # 'cuda' | 'cpu' 846 | np.random.seed(0) 847 | 848 | # Spectrogram parameters 849 | sample_rate = 32000 850 | n_fft = 1024 851 | hop_length = 320 852 | win_length = 1024 853 | window = 'hann' 854 | center = True 855 | dtype = np.complex64 856 | pad_mode = 'reflect' 857 | 858 | # Mel parameters 859 | n_mels = 128 860 | fmin = 50 861 | fmax = 14000 862 | ref = 1.0 863 | amin = 1e-10 864 | top_db = None 865 | 866 | # Data 867 | np_data = np.random.uniform(-1, 1, data_length) 868 | pt_data = torch.Tensor(np_data).to(device) 869 | 870 | # Pytorch 871 | spectrogram_extractor = Spectrogram(n_fft=n_fft, hop_length=hop_length, 872 | win_length=win_length, window=window, center=center, pad_mode=pad_mode, 873 | freeze_parameters=True) 874 | 875 | logmel_extractor = LogmelFilterBank(sr=sample_rate, n_fft=n_fft, 876 | n_mels=n_mels, fmin=fmin, fmax=fmax, ref=ref, amin=amin, top_db=top_db, 877 | freeze_parameters=True) 878 | 879 | spectrogram_extractor.to(device) 880 | logmel_extractor.to(device) 881 | 882 | # Spectrogram 883 | pt_spectrogram = spectrogram_extractor.forward(pt_data[None, None, :]) 884 | 885 | # Log mel spectrogram 886 | pt_logmel_spectrogram = logmel_extractor.forward(pt_spectrogram) 887 | 888 | # Uncomment for debug 889 | if True: 890 | debug(select='dft', device=device) 891 | debug(select='stft', device=device) 892 | debug(select='logmel', device=device) 893 | debug(select='enframe', device=device) 894 | debug(select='logmel&iv', device=device) 895 | -------------------------------------------------------------------------------- /seld/utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yinkalario/EIN-SELD/9f42da23f4ef4620ca495af37e079dc7f8cde06c/seld/utils/__init__.py -------------------------------------------------------------------------------- /seld/utils/cli_parser.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import sys 3 | from pathlib import Path 4 | 5 | from ruamel.yaml import YAML 6 | from termcolor import cprint 7 | 8 | 9 | def parse_cli_overides(): 10 | """Parse the command-line arguments. 11 | 12 | Parse args from CLI and override config dictionary entries 13 | 14 | This function implements the command-line interface of the program. 15 | The interface accepts general command-line arguments as well as 16 | arguments that are specific to a sub-command. The sub-commands are 17 | *preprocess*, *train*, *predict*, and *evaluate*. Specifying a 18 | sub-command is required, as it specifies the task that the program 19 | should carry out. 20 | 21 | Returns: 22 | args: The parsed arguments. 23 | """ 24 | # Parse the command-line arguments, but separate the `--config_file` 25 | # option from the other arguments. This way, options can be parsed 26 | # from the config file(s) first and then overidden by the other 27 | # command-line arguments later. 28 | parser = argparse.ArgumentParser( 29 | description='Event Independent Network for Sound Event Localization and Detection.', 30 | add_help=False 31 | ) 32 | parser.add_argument('-c', '--config_file', help='Specify config file', metavar='FILE') 33 | subparsers = parser.add_subparsers(dest='mode') 34 | parser_preproc = subparsers.add_parser('preprocess') 35 | parser_train = subparsers.add_parser('train') 36 | parser_infer = subparsers.add_parser('infer') 37 | subparsers.add_parser('evaluate') 38 | 39 | # Require the user to specify a sub-command 40 | subparsers.required = True 41 | parser_preproc.add_argument('--preproc_mode', choices=['extract_data', 'extract_scalar', 'extract_meta'], 42 | required=True, help='select preprocessing mode') 43 | parser_preproc.add_argument('--dataset_type', default='dev', choices=['dev', 'eval'], 44 | help='select dataset to preprocess') 45 | parser_preproc.add_argument('--num_workers', type=int, default=8, metavar='N') 46 | parser_preproc.add_argument('--no_cuda', action='store_true', help='Do not use cuda.') 47 | parser_train.add_argument('--seed', type=int, default=2020, metavar='N') 48 | parser_train.add_argument('--num_workers', type=int, default=8, metavar='N') 49 | parser_train.add_argument('--read_into_mem', action='store_true', help='Read dataloader into memory') 50 | parser_train.add_argument('--no_cuda', action='store_true', help='Do not use cuda.') 51 | parser_infer.add_argument('--num_workers', type=int, default=8, metavar='N') 52 | parser_infer.add_argument('--read_into_mem', action='store_true') 53 | parser_infer.add_argument('--no_cuda', action='store_true', help='Do not use cuda.') 54 | 55 | args = parser.parse_args() 56 | args_dict = vars(args) 57 | cprint("Args:", "green") 58 | for key, value in args_dict.items(): 59 | print(f" {key:25s} -> {value}") 60 | 61 | yaml = YAML() 62 | yaml.indent(mapping=4, sequence=6, offset=3) 63 | yaml.default_flow_style = False 64 | with open(args.config_file, 'r') as f: 65 | cfg = yaml.load(f) 66 | cprint("Cfg:", "red") 67 | yaml.dump(cfg, sys.stdout, transform=replace_indent) 68 | 69 | return args, cfg 70 | 71 | def replace_indent(stream): 72 | stream = " " + stream 73 | return stream.replace("\n", "\n ") 74 | -------------------------------------------------------------------------------- /seld/utils/common.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import math 3 | from datetime import datetime 4 | from pathlib import Path 5 | 6 | import numpy as np 7 | import torch 8 | import yaml 9 | from tqdm import tqdm 10 | 11 | 12 | def float_samples_to_int16(y): 13 | """Convert floating-point numpy array of audio samples to int16.""" 14 | if not issubclass(y.dtype.type, np.floating): 15 | raise ValueError('input samples not floating-point') 16 | return (y * np.iinfo(np.int16).max).astype(np.int16) 17 | 18 | 19 | def int16_samples_to_float32(y): 20 | """Convert int16 numpy array of audio samples to float32.""" 21 | if y.dtype != np.int16: 22 | raise ValueError('input samples not int16') 23 | return y.astype(np.float32) / np.iinfo(np.int16).max 24 | 25 | 26 | class TqdmLoggingHandler(logging.Handler): 27 | def __init__(self, level=logging.NOTSET): 28 | super().__init__(level) 29 | 30 | def emit(self, record): 31 | try: 32 | msg = self.format(record) 33 | tqdm.write(msg) 34 | self.flush() 35 | except: 36 | self.handleError(record) 37 | 38 | 39 | def create_logging(logs_dir, filemode): 40 | """Create log objective. 41 | 42 | Args: 43 | logs_dir (Path obj): logs directory 44 | filenmode: open file mode 45 | """ 46 | logs_dir.mkdir(parents=True, exist_ok=True) 47 | 48 | i1 = 0 49 | 50 | while logs_dir.joinpath('{:04d}.log'.format(i1)).is_file(): 51 | i1 += 1 52 | 53 | logs_path = logs_dir.joinpath('{:04d}.log'.format(i1)) 54 | logging.basicConfig( 55 | level=logging.DEBUG, 56 | format='%(filename)s[line:%(lineno)d] %(levelname)s %(message)s', 57 | # format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s', 58 | datefmt='%a, %d %b %Y %H:%M:%S', 59 | filename=logs_path, 60 | filemode=filemode) 61 | 62 | # Print to console 63 | console = logging.StreamHandler() 64 | console.setLevel(logging.INFO) 65 | formatter = logging.Formatter('%(name)-12s: %(levelname)-8s %(message)s') 66 | console.setFormatter(formatter) 67 | # logging.getLogger('').addHandler(console) 68 | logging.getLogger('').addHandler(TqdmLoggingHandler()) 69 | 70 | dt_string = datetime.now().strftime('%a, %d %b %Y %H:%M:%S') 71 | logging.info(dt_string) 72 | logging.info('') 73 | 74 | return logging 75 | 76 | 77 | def convert_ordinal(n): 78 | """Convert a number to a ordinal number 79 | 80 | """ 81 | ordinal = lambda n: "%d%s" % (n,"tsnrhtdd"[(math.floor(n/10)%10!=1)*(n%10<4)*n%10::4]) 82 | return ordinal(n) 83 | 84 | 85 | def move_model_to_gpu(model, cuda): 86 | """Move model to GPU 87 | 88 | """ 89 | # TODO: change DataParallel to DistributedDataParallel 90 | model = torch.nn.DataParallel(model) 91 | if cuda: 92 | logging.info('Utilize GPUs for computation') 93 | logging.info('Number of GPU available: {}\n'.format(torch.cuda.device_count())) 94 | model.cuda() 95 | else: 96 | logging.info('Utilize CPU for computation') 97 | return model 98 | 99 | 100 | def count_parameters(model): 101 | """Count model parameters 102 | 103 | """ 104 | params_num = sum(p.numel() for p in model.parameters() if p.requires_grad) 105 | logging.info('Total number of parameters: {}\n'.format(params_num)) 106 | 107 | 108 | def print_metrics(logging, writer, values_dict, it, set_type='train'): 109 | """Print losses and metrics, and write it to tensorboard 110 | 111 | Args: 112 | logging: logging 113 | writer: tensorboard writer 114 | values_dict: losses or metrics 115 | it: iter 116 | set_type: 'train' | 'valid' | 'test' 117 | """ 118 | out_str = '' 119 | if set_type == 'train': 120 | out_str += 'Train: ' 121 | elif set_type == 'valid': 122 | out_str += 'valid: ' 123 | 124 | for key, value in values_dict.items(): 125 | out_str += '{}: {:.3f}, '.format(key, value) 126 | writer.add_scalar('{}/{}'.format(set_type, key), value, it) 127 | logging.info(out_str) 128 | 129 | 130 | -------------------------------------------------------------------------------- /seld/utils/config.py: -------------------------------------------------------------------------------- 1 | import logging 2 | from pathlib import Path 3 | 4 | import methods.feature as feature 5 | import torch.optim as optim 6 | from methods import ein_seld 7 | from ruamel.yaml import YAML 8 | from termcolor import cprint 9 | from torch.utils.data import ConcatDataset, DataLoader 10 | 11 | from utils.common import convert_ordinal, count_parameters, move_model_to_gpu 12 | from utils.datasets import Dcase2020task3 13 | 14 | method_dict = { 15 | 'ein_seld': ein_seld, 16 | } 17 | 18 | datasets_dict = { 19 | 'dcase2020task3': Dcase2020task3, 20 | } 21 | 22 | 23 | def store_config(output_path, config): 24 | """ Write the given config parameter values to a YAML file. 25 | 26 | Args: 27 | output_path (str): Output file path. 28 | config: Parameter values to log. 29 | """ 30 | yaml = YAML() 31 | with open(output_path, 'w') as f: 32 | yaml.dump(config, f) 33 | 34 | 35 | # Datasets 36 | def get_dataset(dataset_name, root_dir): 37 | assert dataset_name == 'dcase2019task3' or dataset_name == 'dcase2020task3', \ 38 | "Dataset name '{}' is not 'dcase2019task3' or 'dcase2020task3'".format(dataset_name) 39 | dataset = datasets_dict[dataset_name](root_dir) 40 | print('\nDataset {} is being developed......\n'.format(dataset_name)) 41 | return dataset 42 | 43 | 44 | # Dataloaders 45 | def get_generator(args, cfg, dataset, generator_type): 46 | """ Get generator. 47 | 48 | Args: 49 | args: input args 50 | cfg: configuration 51 | dataset: dataset used 52 | generator_type: 'train' | 'valid' | 'test' 53 | 'train' for training, 'valid_train' for validation of train set, 54 | 'valid' for validation of valid set. 55 | Output: 56 | subset: train_set or valid_set 57 | data_generator: 'train_generator', or 'valid_generator' 58 | """ 59 | assert generator_type == 'train' or generator_type == 'valid' or generator_type == 'test', \ 60 | "Data generator type '{}' is not 'train', 'valid_train' or 'valid'".format(generator_type) 61 | 62 | batch_sampler = None 63 | if generator_type == 'train': 64 | 65 | subset = method_dict[cfg['method']].data.UserDataset(args, cfg, dataset, dataset_type='train') 66 | if 'pitchshift' in cfg['data_augmentation']['type']: 67 | augset = method_dict[cfg['method']].data.UserDataset(args, cfg, dataset, dataset_type='train_pitchshift') 68 | subset = ConcatDataset([subset, augset]) 69 | 70 | batch_sampler = method_dict[cfg['method']].data.UserBatchSampler( 71 | clip_num=len(subset), 72 | batch_size=cfg['training']['batch_size'], 73 | seed=args.seed 74 | ) 75 | data_generator = DataLoader( 76 | dataset=subset, 77 | batch_sampler=batch_sampler, 78 | num_workers=args.num_workers, 79 | collate_fn=method_dict[cfg['method']].data.collate_fn, 80 | pin_memory=True 81 | ) 82 | elif generator_type == 'valid': 83 | subset = method_dict[cfg['method']].data.UserDataset(args, cfg, dataset, dataset_type='valid') 84 | data_generator = DataLoader( 85 | dataset=subset, 86 | batch_size=cfg['training']['batch_size'], 87 | shuffle=False, 88 | num_workers=args.num_workers, 89 | collate_fn=method_dict[cfg['method']].data.collate_fn, 90 | pin_memory=True 91 | ) 92 | elif generator_type == 'test': 93 | testset_type = cfg['inference']['testset_type'] 94 | dataset_type = testset_type + '_test' 95 | subset = method_dict[cfg['method']].data.UserDataset(args, cfg, dataset, dataset_type=dataset_type) 96 | data_generator = DataLoader( 97 | dataset=subset, 98 | batch_size=cfg['inference']['batch_size'], 99 | shuffle=False, 100 | num_workers=args.num_workers, 101 | collate_fn=method_dict[cfg['method']].data.collate_fn_test, 102 | pin_memory=True 103 | ) 104 | 105 | return subset, data_generator, batch_sampler 106 | 107 | 108 | # Losses 109 | def get_losses(cfg): 110 | """ Get losses 111 | 112 | """ 113 | losses = method_dict[cfg['method']].losses.Losses(cfg) 114 | for idx, loss_name in enumerate(losses.names): 115 | logging.info('{} is used as the {} loss.'.format(loss_name, convert_ordinal(idx + 1))) 116 | logging.info('') 117 | return losses 118 | 119 | 120 | # Metrics 121 | def get_metrics(cfg, dataset): 122 | """ Get metrics 123 | 124 | """ 125 | metrics = method_dict[cfg['method']].metrics.Metrics(dataset) 126 | for idx, metric_name in enumerate(metrics.names): 127 | logging.info('{} is used as the {} metric.'.format(metric_name, convert_ordinal(idx + 1))) 128 | logging.info('') 129 | return metrics 130 | 131 | 132 | # Audio feature extractor 133 | def get_afextractor(cfg, cuda): 134 | """ Get audio feature extractor 135 | 136 | """ 137 | if cfg['data']['audio_feature'] == 'logmel&intensity': 138 | afextractor = feature.LogmelIntensity_Extractor(cfg) 139 | afextractor = move_model_to_gpu(afextractor, cuda) 140 | return afextractor 141 | 142 | 143 | # Models 144 | def get_models(cfg, dataset, cuda, model_name=None): 145 | """ Get models 146 | 147 | """ 148 | logging.info('=====>> Building a model\n') 149 | if not model_name: 150 | model = vars(method_dict[cfg['method']].models)[cfg['training']['model']](cfg, dataset) 151 | else: 152 | model = vars(method_dict[cfg['method']].models)[model_name](cfg, dataset) 153 | model = move_model_to_gpu(model, cuda) 154 | logging.info('Model architectures:\n{}\n'.format(model)) 155 | count_parameters(model) 156 | return model 157 | 158 | 159 | # Optimizers 160 | def get_optimizer(cfg, af_extractor, model): 161 | """ Get optimizers 162 | 163 | """ 164 | opt_method = cfg['training']['optimizer'] 165 | lr = cfg['training']['lr'] 166 | params = list(af_extractor.parameters()) + list(model.parameters()) 167 | if opt_method == 'adam': 168 | optimizer = optim.Adam(params, lr=lr, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=True) 169 | elif opt_method == 'sgd': 170 | optimizer = optim.SGD(params, lr=lr, momentum=0, weight_decay=0) 171 | elif opt_method == 'adamw': 172 | # optimizer = AdamW(params, lr=lr, betas=(0.9, 0.999), weight_decay=0, warmup=0) 173 | optimizer = optim.AdamW(params, lr=lr, betas=(0.9, 0.999), eps=1e-08, 174 | weight_decay=0.01, amsgrad=True) 175 | 176 | logging.info('Optimizer is: {}\n'.format(opt_method)) 177 | return optimizer 178 | 179 | # Trainer 180 | def get_trainer(args, cfg, dataset, valid_set, af_extractor, model, optimizer, losses, metrics): 181 | """ Get trainer 182 | 183 | """ 184 | trainer = method_dict[cfg['method']].training.Trainer( 185 | args=args, cfg=cfg, dataset=dataset, valid_set=valid_set, af_extractor=af_extractor, 186 | model=model, optimizer=optimizer, losses=losses, metrics=metrics 187 | ) 188 | return trainer 189 | 190 | # Inferer 191 | def get_inferer(cfg, dataset, af_extractor, model, cuda): 192 | """ Get inferer 193 | 194 | """ 195 | inferer = method_dict[cfg['method']].inference.Inferer( 196 | cfg=cfg, dataset=dataset, af_extractor=af_extractor, model=model, cuda=cuda 197 | ) 198 | return inferer 199 | -------------------------------------------------------------------------------- /seld/utils/datasets.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | import pandas as pd 3 | 4 | 5 | class Dcase2020task3: 6 | """DCASE 2020 Task 3 dataset 7 | 8 | """ 9 | def __init__(self, root_dir): 10 | self.root_dir = Path(root_dir) 11 | self.dataset_dir = dict() 12 | self.dataset_dir['dev'] = { 13 | 'foa': self.root_dir.joinpath('foa_dev'), 14 | 'mic': self.root_dir.joinpath('mic_dev'), 15 | 'meta': self.root_dir.joinpath('metadata_dev'), 16 | } 17 | self.dataset_dir['eval'] = { 18 | 'foa': self.root_dir.joinpath('foa_eval'), 19 | 'mic': self.root_dir.joinpath('mic_eval'), 20 | 'meta': self.root_dir.joinpath('metadata_eval'), 21 | } 22 | 23 | self.label_set = ['alarm', 'crying baby', 'crash', 'barking dog', 'running engine', 'female scream', \ 24 | 'female speech', 'burning fire', 'footsteps', 'knocking on door', 'male scream', 'male speech', \ 25 | 'ringing phone', 'piano'] 26 | 27 | self.clip_length = 60 # seconds long 28 | self.label_resolution = 0.1 # 0.1s is the label resolution 29 | self.fold_str_index = 4 # string index indicating fold number 30 | self.ov_str_index = -1 # string index indicating overlap 31 | --------------------------------------------------------------------------------