├── .gitignore ├── Data-Preprocessing.ipynb ├── README.md ├── Run-All-on-Colab.ipynb ├── Training-Classifier.ipynb ├── advanced ├── Multi-Label-FSDKaggle2019-on-Colab.ipynb ├── Note-webdataset-smalldata.ipynb ├── Perceiver_MelSpecAudio_Example_Colab.ipynb ├── __init__.py ├── create_wds_fsd50k.py ├── create_wds_fsd50k_resample.py ├── fat2018.py ├── metric_fat2018.py └── preprocess_fat2018.py ├── config-fat2018.yaml ├── config.yaml ├── for_evar ├── README.md ├── cnn14_decoupled.py └── sampler.py ├── requirements.txt ├── src ├── __init__.py ├── augmentations.py ├── libs.py ├── lwlrap.py ├── models.py └── multi_label_libs.py └── work └── .placeholder /.gitignore: -------------------------------------------------------------------------------- 1 | work/* -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Sound Classifier Tutorial for PyTorch 2 | 3 | This is sound classifier tutorials using PyTorch, PyTorch Lightning and torchaudio. 4 | 5 | ## 0. Motivation 6 | 7 | I had made a repository regarding sound classifier solution: [Machine Learning Sound Classifier for Live Audio](https://github.com/daisukelab/ml-sound-classifier), 8 | it is based on my solution for a Kaggle competition "[Freesound General-Purpose Audio Tagging Challenge](https://www.kaggle.com/c/freesound-audio-tagging)" using Keras. 9 | 10 | Keras was popular when it was created, but many people today are using PyTorch, and yes, I'm also the one. 11 | 12 | This repository is an updated example solution using PyTorch that shows how I would try new machine learning sound competition with current software assets. 13 | 14 | ## 1. Quickstart 15 | 16 | - `pip install -r requirements.txt` to install modules. 17 | - Run notebooks. 18 | 19 | ## 2. What you can find 20 | 21 | ### 2-1. What's included 22 | 23 | - An audio preprocessing example: [Data-Preprocessing.ipynb](Data-Preprocessing.ipynb) 24 | - A training example: [Training-Classifier.ipynb](Training-Classifier.ipynb) 25 | - [FSDKaggle2018](https://zenodo.org/record/2552860#.X9TH6mT7RzU) handling example, it's a sound multi-class classification task. 26 | - New) ResNetish/VGGish [1] models. 27 | - Models are equipped with AdaptiveXXXPool2d to be flexible with input size. Models now accept any shape. 28 | - New) Colab all-in-one notebook [Run-All-on-Colab.ipynb](Run-All-on-Colab.ipynb). You can run all through the training/evaluation online. 29 | 30 | ### 2-2. What's not 31 | 32 | - No usual practices/techniques: Normalization, augmentation, regularizations and etc. --> will be followed up with advanced notebook. 33 | - No cutting edge networks like what you can find there: [PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition](https://github.com/qiuqiangkong/audioset_tagging_cnn). 34 | 35 | You can just run to reproduce, or try advanced techniques based on the tutorials. 36 | 37 | ## 3. Notes for design choices 38 | 39 | ### 3-1. Input data format: raw audio or spectrogram? 40 | 41 | If we need to augment input data in time-domain, we feed raw audio to dataset class. 42 | 43 | But in this example, all the data are converted to log-mel spectrogram in advance, as a major choice. 44 | 45 | - Good: This will make data handling easy, especially in training pipeline. 46 | - Bad: Applicable data augmentations will be limited. Available transformations in torchaudio are: [FrequencyMasking](https://pytorch.org/audio/stable/transforms.html#frequencymasking) or [TimeMasking](https://pytorch.org/audio/stable/transforms.html#timemasking). 47 | 48 | ### 3-2. Input data size 49 | 50 | Number of frequency bins (n_mels) is set to 64 as a typical choice. 51 | Duration is set to ~~1 second, just as an example.~~ 5 seconds in current configuration, because 1 second was too short for the FSDKaggle2018 dataset. 52 | 53 | You can find and change in [config.yaml](config.yaml). 54 | 55 | clip_length: 5.0 # [sec] -- it was 1.0 s at the initial release. 56 | n_mels: 64 57 | 58 | ### 3-3. FFT paramaters 59 | 60 | Typical paramaters are configured in [config.yaml](config.yaml). 61 | 62 | sample_rate: 44100 63 | hop_length: 441 64 | n_fft: 1024 65 | n_mels: 64 66 | f_min: 0 67 | f_max: 22050 68 | 69 | ## 4. Performances 70 | 71 | How is the performance of the trained models on the tutorials? 72 | 73 | - The best Kaggle result MAP@3 was reported as 0.942 (see [Kaggle 4th solution](https://www.kaggle.com/c/freesound-audio-tagging/discussion/62634)). Note that this result is ensemble of 5 models of the same SE-ResNeXt network trained on 5 folds. 74 | - The best result in this repo is MAP@3 of 0.87 (with ResNetish). This is a single model result, without use of data augmentations. 75 | 76 | Already came close to the top solution with ResNetish, and still have space for data augmentations/ regularization techniques. 77 | 78 | ## References 79 | 80 | - [1] S. Hershey et al., ‘CNN Architectures for Large-Scale Audio Classification’,\ in International Conference on Acoustics, Speech and Signal Processing (ICASSP),2017\ Available: https://arxiv.org/abs/1609.09430, https://ai.google/research/pubs/pub45611 81 | -------------------------------------------------------------------------------- /advanced/Note-webdataset-smalldata.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "361027b5", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "from dlcliche.notebook import *\n", 11 | "from dlcliche.torch_utils import *" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "id": "e8d4d37d", 17 | "metadata": {}, 18 | "source": [ 19 | "## Goal\n", 20 | "\n", 21 | "Check if webdataset is useful for downstream datasets which are typically small.\n", 22 | "\n", 23 | "### Preparing webdataset shards\n", 24 | "\n", 25 | "Used `create_wds_fsd50k.py` to make tar-shards encupslating local 16kHz FSD50K files.\n", 26 | "Resulted in making four tar files: `fsd50k-eval-16k-{000000..000003}.tar`.\n", 27 | "\n", 28 | "### Test result\n", 29 | "\n", 30 | "The result show that webdataset is not effective small data regime." 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 24, 36 | "id": "0eab9f25", 37 | "metadata": {}, 38 | "outputs": [ 39 | { 40 | "name": "stdout", 41 | "output_type": "stream", 42 | "text": [ 43 | "9.86 s ± 534 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 44 | ] 45 | } 46 | ], 47 | "source": [ 48 | "%%timeit\n", 49 | "\n", 50 | "import webdataset as wds\n", 51 | "import io\n", 52 | "import librosa\n", 53 | "\n", 54 | "url = '/data/A/fsd50k/fsd50k-eval-16k-{000000..000003}.tar'\n", 55 | "ds = (\n", 56 | " wds.WebDataset(url)\n", 57 | " .shuffle(1000)\n", 58 | " .to_tuple('wav', 'labels')\n", 59 | ")\n", 60 | "for i, (wav, labels) in enumerate(ds):\n", 61 | " wav = librosa.load(io.BytesIO(wav))\n", 62 | " labels = labels.decode()\n", 63 | " if i > 100:\n", 64 | " break" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 25, 70 | "id": "b61de49f", 71 | "metadata": {}, 72 | "outputs": [ 73 | { 74 | "name": "stdout", 75 | "output_type": "stream", 76 | "text": [ 77 | "9.06 s ± 8.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 78 | ] 79 | } 80 | ], 81 | "source": [ 82 | "%%timeit\n", 83 | "\n", 84 | "import io\n", 85 | "import librosa\n", 86 | "\n", 87 | "def IterativeDataset(root, files, label_set):\n", 88 | " root = Path(root)\n", 89 | " for fname, labels in zip(files, label_set):\n", 90 | " data = librosa.load(root/fname)\n", 91 | " labels = labels\n", 92 | " yield data, labels\n", 93 | "\n", 94 | "df = pd.read_csv('/lab/AR2021/evar/metadata/fsd50k.csv')\n", 95 | "df = df[df.split == 'test']\n", 96 | "\n", 97 | "for i, (binary, labels) in enumerate(IterativeDataset('work/16k/fsd50k', df.file_name.values, df.label.values)):\n", 98 | " wav = binary\n", 99 | " labels = labels\n", 100 | " if i > 100:\n", 101 | " break\n", 102 | "#print(wav, labels)" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "id": "0185a362", 108 | "metadata": {}, 109 | "source": [ 110 | "## Note: create tar shard files by codes" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": 33, 116 | "id": "22a05c53", 117 | "metadata": {}, 118 | "outputs": [ 119 | { 120 | "data": { 121 | "text/html": [ 122 | "
\n", 123 | "\n", 136 | "\n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | "
fnamelabelsmidssplitkey
0FSD50K.dev_audio/64760.wavElectric_guitar,Guitar,Plucked_string_instrume.../m/02sgy,/m/0342h,/m/0fx80y,/m/04szw,/m/04rlftraintrain_64760
1FSD50K.dev_audio/16399.wavElectric_guitar,Guitar,Plucked_string_instrume.../m/02sgy,/m/0342h,/m/0fx80y,/m/04szw,/m/04rlftraintrain_16399
2FSD50K.dev_audio/16401.wavElectric_guitar,Guitar,Plucked_string_instrume.../m/02sgy,/m/0342h,/m/0fx80y,/m/04szw,/m/04rlftraintrain_16401
\n", 174 | "
" 175 | ], 176 | "text/plain": [ 177 | " fname \\\n", 178 | "0 FSD50K.dev_audio/64760.wav \n", 179 | "1 FSD50K.dev_audio/16399.wav \n", 180 | "2 FSD50K.dev_audio/16401.wav \n", 181 | "\n", 182 | " labels \\\n", 183 | "0 Electric_guitar,Guitar,Plucked_string_instrume... \n", 184 | "1 Electric_guitar,Guitar,Plucked_string_instrume... \n", 185 | "2 Electric_guitar,Guitar,Plucked_string_instrume... \n", 186 | "\n", 187 | " mids split key \n", 188 | "0 /m/02sgy,/m/0342h,/m/0fx80y,/m/04szw,/m/04rlf train train_64760 \n", 189 | "1 /m/02sgy,/m/0342h,/m/0fx80y,/m/04szw,/m/04rlf train train_16399 \n", 190 | "2 /m/02sgy,/m/0342h,/m/0fx80y,/m/04szw,/m/04rlf train train_16401 " 191 | ] 192 | }, 193 | "execution_count": 33, 194 | "metadata": {}, 195 | "output_type": "execute_result" 196 | } 197 | ], 198 | "source": [ 199 | "def fsd50k_metadata(FSD50K_root):\n", 200 | " FSD = Path(FSD50K_root)\n", 201 | " df = pd.read_csv(FSD/f'FSD50K.ground_truth/dev.csv')\n", 202 | " df['key'] = df.split + '_' + df.fname.apply(lambda s: str(s))\n", 203 | " df['fname'] = df.fname.apply(lambda s: f'FSD50K.dev_audio/{s}.wav')\n", 204 | " dftest = pd.read_csv(FSD/f'FSD50K.ground_truth/eval.csv')\n", 205 | " dftest['key'] = 'eval_' + dftest.fname.apply(lambda s: str(s))\n", 206 | " dftest['split'] = 'eval'\n", 207 | " dftest['fname'] = dftest.fname.apply(lambda s: f'FSD50K.eval_audio/{s}.wav')\n", 208 | " df = pd.concat([df, dftest], ignore_index=True)\n", 209 | " return df\n", 210 | "\n", 211 | "\n", 212 | "df = fsd50k_metadata(FSD50K_root='/data/A/fsd50k/')\n", 213 | "df[:3]" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": 56, 219 | "id": "8970bd0b", 220 | "metadata": {}, 221 | "outputs": [ 222 | { 223 | "name": "stdout", 224 | "output_type": "stream", 225 | "text": [ 226 | "Processing 36796 train samples.\n", 227 | "/data/A/fsd50k/FSD50K.dev_audio/64760.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64760\n" 228 | ] 229 | }, 230 | { 231 | "data": { 232 | "text/plain": [ 233 | "{'__key__': 'train_64760',\n", 234 | " 'npy': array([-0.00026427, -0.00128246, 0.00068087, ..., -0.00253225,\n", 235 | " -0.00244647, 0. ], dtype=float32),\n", 236 | " 'labels': 'Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music'}" 237 | ] 238 | }, 239 | "execution_count": 56, 240 | "metadata": {}, 241 | "output_type": "execute_result" 242 | } 243 | ], 244 | "source": [ 245 | "import librosa\n", 246 | "\n", 247 | "\n", 248 | "def load_resampled_mono_wav(fpath, sr):\n", 249 | " y, org_sr = librosa.load('/data/A/fsd50k/FSD50K.dev_audio/382455.wav', sr=None, mono=True)\n", 250 | " if org_sr != sr:\n", 251 | " y = librosa.resample(y, orig_sr=org_sr, target_sr=sr)\n", 252 | " return y\n", 253 | "\n", 254 | "\n", 255 | "def fsd50k_generator(root, split, sr):\n", 256 | " root = Path(root)\n", 257 | " df = fsd50k_metadata(FSD50K_root=root)\n", 258 | " df = df[df.split == split]\n", 259 | " print(f'Processing {len(df)} {split} samples.')\n", 260 | " for file_name, labels, key in df[['fname', 'labels', 'key']].values:\n", 261 | " fpath = root/file_name\n", 262 | " print(fpath, labels, key)\n", 263 | "\n", 264 | " sample = {\n", 265 | " '__key__': key,\n", 266 | " 'npy': load_resampled_mono_wav(fpath, sr),\n", 267 | " 'labels': labels,\n", 268 | " }\n", 269 | " yield sample\n", 270 | "\n", 271 | "gen = fsd50k_generator('/data/A/fsd50k/', 'train', 16000)\n", 272 | "next(iter(gen))" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": 57, 278 | "id": "a7fb2d1c", 279 | "metadata": {}, 280 | "outputs": [ 281 | { 282 | "name": "stdout", 283 | "output_type": "stream", 284 | "text": [ 285 | "# writing /data/A/fsd50k/train-000000.tar 0 0.0 GB 0\n", 286 | "Processing 36796 train samples.\n", 287 | "/data/A/fsd50k/FSD50K.dev_audio/64760.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64760\n", 288 | "/data/A/fsd50k/FSD50K.dev_audio/16399.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_16399\n", 289 | "/data/A/fsd50k/FSD50K.dev_audio/16401.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_16401\n", 290 | "/data/A/fsd50k/FSD50K.dev_audio/16402.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_16402\n", 291 | "/data/A/fsd50k/FSD50K.dev_audio/16404.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_16404\n", 292 | "/data/A/fsd50k/FSD50K.dev_audio/64761.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64761\n", 293 | "/data/A/fsd50k/FSD50K.dev_audio/268259.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_268259\n", 294 | "/data/A/fsd50k/FSD50K.dev_audio/64762.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64762\n", 295 | "/data/A/fsd50k/FSD50K.dev_audio/40515.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_40515\n", 296 | "/data/A/fsd50k/FSD50K.dev_audio/40516.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_40516\n", 297 | "/data/A/fsd50k/FSD50K.dev_audio/40517.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_40517\n", 298 | "/data/A/fsd50k/FSD50K.dev_audio/64741.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64741\n", 299 | "/data/A/fsd50k/FSD50K.dev_audio/40523.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_40523\n", 300 | "/data/A/fsd50k/FSD50K.dev_audio/64743.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64743\n", 301 | "/data/A/fsd50k/FSD50K.dev_audio/64744.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64744\n", 302 | "/data/A/fsd50k/FSD50K.dev_audio/40525.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_40525\n", 303 | "/data/A/fsd50k/FSD50K.dev_audio/64746.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64746\n", 304 | "/data/A/fsd50k/FSD50K.dev_audio/5318.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_5318\n", 305 | "/data/A/fsd50k/FSD50K.dev_audio/4258.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4258\n", 306 | "/data/A/fsd50k/FSD50K.dev_audio/4259.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4259\n", 307 | "/data/A/fsd50k/FSD50K.dev_audio/4260.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4260\n", 308 | "/data/A/fsd50k/FSD50K.dev_audio/4261.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4261\n", 309 | "/data/A/fsd50k/FSD50K.dev_audio/4262.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4262\n", 310 | "/data/A/fsd50k/FSD50K.dev_audio/4263.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4263\n", 311 | "/data/A/fsd50k/FSD50K.dev_audio/4264.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4264\n", 312 | "/data/A/fsd50k/FSD50K.dev_audio/4265.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4265\n", 313 | "/data/A/fsd50k/FSD50K.dev_audio/4266.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4266\n", 314 | "/data/A/fsd50k/FSD50K.dev_audio/4267.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4267\n", 315 | "/data/A/fsd50k/FSD50K.dev_audio/4268.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4268\n", 316 | "/data/A/fsd50k/FSD50K.dev_audio/4269.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4269\n", 317 | "/data/A/fsd50k/FSD50K.dev_audio/4270.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4270\n", 318 | "/data/A/fsd50k/FSD50K.dev_audio/4272.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4272\n", 319 | "/data/A/fsd50k/FSD50K.dev_audio/64757.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64757\n", 320 | "/data/A/fsd50k/FSD50K.dev_audio/4276.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4276\n", 321 | "/data/A/fsd50k/FSD50K.dev_audio/4277.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4277\n", 322 | "/data/A/fsd50k/FSD50K.dev_audio/4278.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4278\n", 323 | "/data/A/fsd50k/FSD50K.dev_audio/4279.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4279\n", 324 | "/data/A/fsd50k/FSD50K.dev_audio/4280.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4280\n", 325 | "/data/A/fsd50k/FSD50K.dev_audio/4281.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4281\n", 326 | "/data/A/fsd50k/FSD50K.dev_audio/4283.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4283\n", 327 | "/data/A/fsd50k/FSD50K.dev_audio/4284.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4284\n", 328 | "/data/A/fsd50k/FSD50K.dev_audio/4285.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4285\n", 329 | "/data/A/fsd50k/FSD50K.dev_audio/4286.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4286\n", 330 | "/data/A/fsd50k/FSD50K.dev_audio/4287.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4287\n", 331 | "/data/A/fsd50k/FSD50K.dev_audio/4288.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4288\n", 332 | "/data/A/fsd50k/FSD50K.dev_audio/4289.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4289\n", 333 | "/data/A/fsd50k/FSD50K.dev_audio/5314.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_5314\n", 334 | "/data/A/fsd50k/FSD50K.dev_audio/4290.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4290\n", 335 | "/data/A/fsd50k/FSD50K.dev_audio/4291.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4291\n", 336 | "/data/A/fsd50k/FSD50K.dev_audio/5310.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_5310\n", 337 | "/data/A/fsd50k/FSD50K.dev_audio/64703.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64703\n", 338 | "/data/A/fsd50k/FSD50K.dev_audio/5312.wav Electric_guitar,Bass_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_5312\n", 339 | "/data/A/fsd50k/FSD50K.dev_audio/64704.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64704\n", 340 | "/data/A/fsd50k/FSD50K.dev_audio/64706.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64706\n", 341 | "/data/A/fsd50k/FSD50K.dev_audio/64707.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64707\n", 342 | "/data/A/fsd50k/FSD50K.dev_audio/64708.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64708\n", 343 | "/data/A/fsd50k/FSD50K.dev_audio/5315.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_5315\n", 344 | "/data/A/fsd50k/FSD50K.dev_audio/5317.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_5317\n", 345 | "/data/A/fsd50k/FSD50K.dev_audio/64711.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64711\n", 346 | "/data/A/fsd50k/FSD50K.dev_audio/64712.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64712\n", 347 | "/data/A/fsd50k/FSD50K.dev_audio/64714.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64714\n", 348 | "/data/A/fsd50k/FSD50K.dev_audio/64715.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64715\n", 349 | "/data/A/fsd50k/FSD50K.dev_audio/64717.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64717\n", 350 | "/data/A/fsd50k/FSD50K.dev_audio/64718.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64718\n", 351 | "/data/A/fsd50k/FSD50K.dev_audio/64720.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64720\n" 352 | ] 353 | }, 354 | { 355 | "name": "stdout", 356 | "output_type": "stream", 357 | "text": [ 358 | "/data/A/fsd50k/FSD50K.dev_audio/64721.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64721\n", 359 | "/data/A/fsd50k/FSD50K.dev_audio/64722.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64722\n" 360 | ] 361 | }, 362 | { 363 | "ename": "KeyboardInterrupt", 364 | "evalue": "", 365 | "output_type": "error", 366 | "traceback": [ 367 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 368 | "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)", 369 | "\u001b[0;32m/tmp/ipykernel_2207172/328821770.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 10\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mwds\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mShardWriter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moutput_name\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_count\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0msink\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 12\u001b[0;31m \u001b[0;32mfor\u001b[0m \u001b[0msample\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mislice\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfsd50k_generator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msource_dir\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msplit\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m100\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 13\u001b[0m \u001b[0msink\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msample\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 370 | "\u001b[0;32m/tmp/ipykernel_2207172/1443749655.py\u001b[0m in \u001b[0;36mfsd50k_generator\u001b[0;34m(root, split, sr)\u001b[0m\n\u001b[1;32m 20\u001b[0m sample = {\n\u001b[1;32m 21\u001b[0m \u001b[0;34m'__key__'\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 22\u001b[0;31m \u001b[0;34m'npy'\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mload_resampled_mono_wav\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfpath\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 23\u001b[0m \u001b[0;34m'labels'\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mlabels\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 24\u001b[0m }\n", 371 | "\u001b[0;32m/tmp/ipykernel_2207172/1443749655.py\u001b[0m in \u001b[0;36mload_resampled_mono_wav\u001b[0;34m(fpath, sr)\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morg_sr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlibrosa\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mload\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'/data/A/fsd50k/FSD50K.dev_audio/382455.wav'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmono\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0morg_sr\u001b[0m \u001b[0;34m!=\u001b[0m \u001b[0msr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlibrosa\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mresample\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morig_sr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morg_sr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtarget_sr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 8\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 372 | "\u001b[0;32m~/anaconda3/lib/python3.9/site-packages/librosa/core/audio.py\u001b[0m in \u001b[0;36mresample\u001b[0;34m(y, orig_sr, target_sr, res_type, fix, scale, **kwargs)\u001b[0m\n\u001b[1;32m 602\u001b[0m \u001b[0my_hat\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msoxr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mresample\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mT\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morig_sr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtarget_sr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mquality\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mres_type\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mT\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 603\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 604\u001b[0;31m \u001b[0my_hat\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mresampy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mresample\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morig_sr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtarget_sr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfilter\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mres_type\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 605\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 606\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mfix\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 373 | "\u001b[0;32m~/anaconda3/lib/python3.9/site-packages/resampy/core.py\u001b[0m in \u001b[0;36mresample\u001b[0;34m(x, sr_orig, sr_new, axis, filter, **kwargs)\u001b[0m\n\u001b[1;32m 118\u001b[0m \u001b[0mx_2d\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mswapaxes\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 119\u001b[0m \u001b[0my_2d\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mswapaxes\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 120\u001b[0;31m \u001b[0mresample_f\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx_2d\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_2d\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msample_ratio\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minterp_win\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minterp_delta\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mprecision\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 121\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 122\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 374 | "\u001b[0;31mKeyboardInterrupt\u001b[0m: " 375 | ] 376 | } 377 | ], 378 | "source": [ 379 | "import webdataset as wds\n", 380 | "from itertools import islice\n", 381 | "\n", 382 | "\n", 383 | "source_dir = '/data/A/fsd50k/'\n", 384 | "split = 'train'\n", 385 | "sr = 16000\n", 386 | "output_name = f'/data/A/fsd50k/{split}-%06d.tar'\n", 387 | "max_count = 10000\n", 388 | "\n", 389 | "with wds.ShardWriter(output_name, max_count) as sink:\n", 390 | " for sample in islice(fsd50k_generator(source_dir, split, sr), 0, 100):\n", 391 | " sink.write(sample)" 392 | ] 393 | }, 394 | { 395 | "cell_type": "markdown", 396 | "id": "bfc6032c", 397 | "metadata": {}, 398 | "source": [ 399 | "## Note: creating dataset tar archives with command-line\n", 400 | "\n", 401 | "\n", 402 | "### Install go and tarp commands\n", 403 | "\n", 404 | "https://github.com/webdataset/tarp\n", 405 | "\n", 406 | "- `sudo apt install golang-go`\n", 407 | "- `go get -v github.com/tmbdev/tarp/tarp`\n", 408 | "\n", 409 | "### Create tar archive\n", 410 | "\n", 411 | "- `tar --sort=name -cf your_archive.tar your_folders`\n", 412 | "- `find your_folder - type f -print| sort | tar -cf your_archive.tar - T -'\n", 413 | "\n", 414 | "### Shuffle and split\n", 415 | "\n", 416 | "- `tar --sorted -cf - your_folders | tarp" 417 | ] 418 | } 419 | ], 420 | "metadata": { 421 | "kernelspec": { 422 | "display_name": "base", 423 | "language": "python", 424 | "name": "base" 425 | }, 426 | "language_info": { 427 | "codemirror_mode": { 428 | "name": "ipython", 429 | "version": 3 430 | }, 431 | "file_extension": ".py", 432 | "mimetype": "text/x-python", 433 | "name": "python", 434 | "nbconvert_exporter": "python", 435 | "pygments_lexer": "ipython3", 436 | "version": "3.9.7" 437 | } 438 | }, 439 | "nbformat": 4, 440 | "nbformat_minor": 5 441 | } 442 | -------------------------------------------------------------------------------- /advanced/__init__.py: -------------------------------------------------------------------------------- 1 | # adaanced 2 | 3 | -------------------------------------------------------------------------------- /advanced/create_wds_fsd50k.py: -------------------------------------------------------------------------------- 1 | """webdataset 2 | python create_wds_fsd50k.py work/16k/fsd50k /data/A/fsd50k eval 16k 3 | """ 4 | 5 | import sys 6 | from multiprocessing import Pool 7 | from pathlib import Path 8 | import pandas as pd 9 | import webdataset as wds 10 | from itertools import islice 11 | import librosa 12 | import fire 13 | 14 | 15 | def fsd50k_metadata(FSD50K_root): 16 | FSD = Path(FSD50K_root) 17 | df = pd.read_csv(FSD/f'FSD50K.ground_truth/dev.csv') 18 | df['key'] = df.split + '_' + df.fname.apply(lambda s: str(s)) 19 | df['fname'] = df.fname.apply(lambda s: f'FSD50K.dev_audio/{s}.wav') 20 | dftest = pd.read_csv(FSD/f'FSD50K.ground_truth/eval.csv') 21 | dftest['key'] = 'eval_' + dftest.fname.apply(lambda s: str(s)) 22 | dftest['split'] = 'eval' 23 | dftest['fname'] = dftest.fname.apply(lambda s: f'FSD50K.eval_audio/{s}.wav') 24 | df = pd.concat([df, dftest], ignore_index=True) 25 | return df 26 | 27 | 28 | def load_resampled_mono_wav(fpath, sr): 29 | with open(fpath, 'rb') as f: 30 | y = f.read() 31 | # y, org_sr = librosa.load(fpath, sr=None, mono=True) 32 | # if org_sr != sr: 33 | # y = librosa.resample(y, orig_sr=org_sr, target_sr=sr) 34 | return y 35 | 36 | 37 | def _converter_worker(args): 38 | fpath, sr = args 39 | return load_resampled_mono_wav(fpath, sr) 40 | 41 | 42 | def fsd50k_generator(root, split, sr): 43 | root = Path(root) 44 | df = fsd50k_metadata(FSD50K_root=root) 45 | df = df[df.split == split] 46 | print(f'Processing {len(df)} {split} samples.') 47 | for file_name, labels, key in df[['fname', 'labels', 'key']].values: 48 | fpath = root/file_name 49 | 50 | sample = { 51 | '__key__': key, 52 | 'wav': fpath, # load_resampled_mono_wav(fpath, sr), 53 | 'labels': labels, 54 | } 55 | yield sample 56 | 57 | 58 | def create_wds(source, output, split, sr, name='fsd50k-[SPLIT]-[SR]-%06d.tar', maxsize=10**9): 59 | source = source 60 | name = name.replace('[SPLIT]', split).replace('[SR]', str(sr)) 61 | output_name = str(Path(output)/name) 62 | 63 | gen = fsd50k_generator(source, split, sr) 64 | with wds.ShardWriter(output_name, maxsize=maxsize) as sink: 65 | while True: 66 | samples = list(islice(gen, 100)) 67 | if len(samples) == 0: 68 | break 69 | # load and resample wav files 70 | with Pool() as p: 71 | args = [[s['wav'], sr] for s in samples] 72 | wavs = list(p.imap(_converter_worker, args)) 73 | for s, wav in zip(samples, wavs): 74 | s['wav'] = wav 75 | sink.write(s) 76 | print('.', end='') 77 | sys.stdout.flush() 78 | print('Finished') 79 | 80 | 81 | if __name__ == '__main__': 82 | fire.Fire(create_wds) 83 | -------------------------------------------------------------------------------- /advanced/create_wds_fsd50k_resample.py: -------------------------------------------------------------------------------- 1 | """webdataset 2 | """ 3 | 4 | import sys 5 | from multiprocessing import Pool 6 | from pathlib import Path 7 | import pandas as pd 8 | import webdataset as wds 9 | from itertools import islice 10 | import librosa 11 | import fire 12 | 13 | 14 | def fsd50k_metadata(FSD50K_root): 15 | FSD = Path(FSD50K_root) 16 | df = pd.read_csv(FSD/f'FSD50K.ground_truth/dev.csv') 17 | df['key'] = df.split + '_' + df.fname.apply(lambda s: str(s)) 18 | df['fname'] = df.fname.apply(lambda s: f'FSD50K.dev_audio/{s}.wav') 19 | dftest = pd.read_csv(FSD/f'FSD50K.ground_truth/eval.csv') 20 | dftest['key'] = 'eval_' + dftest.fname.apply(lambda s: str(s)) 21 | dftest['split'] = 'eval' 22 | dftest['fname'] = dftest.fname.apply(lambda s: f'FSD50K.eval_audio/{s}.wav') 23 | df = pd.concat([df, dftest], ignore_index=True) 24 | return df 25 | 26 | 27 | def load_resampled_mono_wav(fpath, sr): 28 | y, org_sr = librosa.load(fpath, sr=None, mono=True) 29 | if org_sr != sr: 30 | y = librosa.resample(y, orig_sr=org_sr, target_sr=sr) 31 | return y 32 | 33 | 34 | def _converter_worker(args): 35 | fpath, sr = args 36 | return load_resampled_mono_wav(fpath, sr) 37 | 38 | 39 | def fsd50k_generator(root, split, sr): 40 | root = Path(root) 41 | df = fsd50k_metadata(FSD50K_root=root) 42 | df = df[df.split == split] 43 | print(f'Processing {len(df)} {split} samples.') 44 | for file_name, labels, key in df[['fname', 'labels', 'key']].values: 45 | fpath = root/file_name 46 | 47 | sample = { 48 | '__key__': key, 49 | 'npy': fpath, # load_resampled_mono_wav(fpath, sr), 50 | 'labels': labels, 51 | } 52 | yield sample 53 | 54 | 55 | def create_wds(source, output, split, sr, name='fsd50k-[SPLIT]-[SR]-%06d.tar', maxsize=10**9): 56 | source = source 57 | name = name.replace('[SPLIT]', split).replace('[SR]', str(sr)) 58 | output_name = str(Path(output)/name) 59 | 60 | gen = fsd50k_generator(source, split, sr) 61 | with wds.ShardWriter(output_name, maxsize=maxsize) as sink: 62 | while True: 63 | samples = list(islice(gen, 100)) 64 | if len(samples) == 0: 65 | break 66 | # load and resample wav files 67 | with Pool() as p: 68 | args = [[s['npy'], sr] for s in samples] 69 | npys = list(p.imap(_converter_worker, args)) 70 | for s, npy in zip(samples, npys): 71 | s['npy'] = npy 72 | sink.write(s) 73 | print('.', end='') 74 | sys.stdout.flush() 75 | print('Finished') 76 | 77 | 78 | if __name__ == '__main__': 79 | fire.Fire(create_wds) 80 | -------------------------------------------------------------------------------- /advanced/fat2018.py: -------------------------------------------------------------------------------- 1 | """Multi-fold Freesound Audio Tagging solution. 2 | """ 3 | 4 | from src.libs import * 5 | import datetime 6 | from advanced.metric_fat2018 import eval_fat2018_all_splits, eval_fat2018_by_probas 7 | from src.models import resnetish18, VGGish, AlexNet 8 | 9 | 10 | def report_result(message): 11 | print(message) 12 | # you might want to report to slack or anything here 13 | 14 | 15 | def get_transforms(cfg): 16 | NF = cfg.n_mels 17 | NT = cfg.unit_length 18 | augs = [] 19 | for a in cfg.aug.split('x'): 20 | if a == 'RC': 21 | augs.append(GenericRandomResizedCrop((NF, NT), scale=(0.8, 1.0), ratio=(NF/(NT*1.2), NF/(NT*0.8)))) 22 | elif a == 'SA': 23 | augs.append(AT.FrequencyMasking(NF//10)) 24 | augs.append(AT.TimeMasking(NT//10)) 25 | else: 26 | if a: 27 | raise Exception(f'unknown: {a}') 28 | tfms = VT.Compose(augs) 29 | print(tfms) 30 | return tfms 31 | 32 | 33 | def get_model(cfg, num_classes): 34 | if cfg.model == 'AN': 35 | return AlexNet(num_classes) 36 | if cfg.model == 'R18': 37 | return resnetish18(num_classes) 38 | if cfg.model == 'VGG': 39 | return VGGish(num_classes) 40 | raise Exception(f'unknown: {cfg.model}') 41 | 42 | 43 | def read_metadata(cfg): 44 | # Make lists of filenames and labels from meta files 45 | filenames, labels = {}, {} 46 | for split, npy_folder, meta_filename in [['train', f'work/{cfg.type}/FSDKaggle2018.audio_train', 'train_post_competition.csv'], 47 | ['test', f'work/{cfg.type}/FSDKaggle2018.audio_test', 'test_post_competition_scoring_clips.csv']]: 48 | df = pd.read_csv(cfg.data_root/'FSDKaggle2018.meta'/meta_filename) 49 | filenames[split] = np.array([(npy_folder + '/' + fname.replace('.wav', '.npy')) for fname in df.fname.values]) 50 | labels[split] = list(df.label.values) 51 | 52 | # Make a list of classes, converting labels into numbers 53 | classes = sorted(set(labels['train'] + labels['test'])) 54 | for split in labels: 55 | labels[split] = np.array([classes.index(label) for label in labels[split]]) 56 | 57 | return filenames, labels, classes 58 | 59 | 60 | def calc_stat(cfg, filenames, labels, classes, calc_stat=False, n_calc_stat=10000): 61 | print(labels) 62 | class_weight = compute_class_weight('balanced', range(len(classes)), labels['train']) 63 | class_weight = torch.tensor(class_weight).to(torch.float) 64 | 65 | if calc_stat: 66 | all_train_lms = np.hstack([np.load(f)[0] for f in filenames['train'][:n_calc_stat]]) 67 | train_mean_std = all_train_lms.mean(), all_train_lms.std() 68 | print(train_mean_std) 69 | else: 70 | train_mean_std = None 71 | 72 | return class_weight, train_mean_std 73 | 74 | 75 | def run(config_file='config-fat2018.yaml', epochs=None, finetune_epochs=None, mixup=None, aug=None, norm=False): 76 | print(config_file, epochs, mixup, aug) 77 | cfg = load_config(config_file) 78 | cfg.epochs = epochs or cfg.epochs 79 | cfg.finetune_epochs = finetune_epochs or cfg.finetune_epochs 80 | cfg.mixup = cfg.mixup if mixup is None else mixup 81 | cfg.aug = aug or cfg.aug or '' 82 | filenames, labels, classes = read_metadata(cfg) 83 | class_weight, train_mean_std = calc_stat(cfg, filenames, labels, classes, calc_stat=norm) 84 | 85 | name = datetime.datetime.now().strftime('%y%m%d%H%M') 86 | name = f'model{cfg.type}-{cfg.model}-{cfg.aug}-m{str(cfg.mixup)[2:]}{"-N" if norm else ""}-{name}' 87 | 88 | weight_folder = Path('work/' + name) 89 | weight_folder.mkdir(parents=True, exist_ok=True) 90 | results, all_file_probas = [], [] 91 | print(f'Training {weight_folder}') 92 | 93 | skf = StratifiedKFold(n_splits=cfg.n_folds) 94 | for fold, (train_index, test_index) in enumerate(skf.split(filenames['train'], labels['train'])): 95 | print("TRAIN:", len(train_index), "TEST:", len(test_index)) 96 | train_files, val_files = filenames['train'][train_index], filenames['train'][test_index] 97 | train_ys, val_ys = labels['train'][train_index], labels['train'][test_index] 98 | 99 | train_dataset = LMSClfDataset(cfg, train_files, train_ys, norm_mean_std=train_mean_std, 100 | transforms=get_transforms(cfg)) 101 | valid_dataset = LMSClfDataset(cfg, val_files, val_ys, norm_mean_std=train_mean_std) 102 | train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=cfg.bs, shuffle=True, pin_memory=True, 103 | num_workers=multiprocessing.cpu_count()) 104 | valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=cfg.bs, pin_memory=True, 105 | num_workers=multiprocessing.cpu_count()) 106 | 107 | # main training 108 | model = get_model(cfg, len(classes)) 109 | dataloaders = [train_loader, valid_loader, None] 110 | learner = LMSClfLearner(model, dataloaders, mixup_alpha=cfg.mixup, weight=class_weight) 111 | checkpoint = pl.callbacks.ModelCheckpoint(monitor='val_acc') 112 | trainer = pl.Trainer(gpus=1, max_epochs=cfg.epochs, callbacks=[checkpoint]) 113 | trainer.fit(learner) 114 | # result for now 115 | learner.load_state_dict(torch.load(checkpoint.best_model_path)['state_dict']) 116 | (acc, MAP3), file_probas = eval_fat2018_all_splits(cfg, model, device, filenames['test'], labels['test'], 117 | norm_mean_std=train_mean_std, debug_name='test') 118 | 119 | # fine tuning 120 | learner = LMSClfLearner(model, dataloaders, mixup_alpha=0.0, learning_rate=1e-4, weight=class_weight) 121 | checkpoint = pl.callbacks.ModelCheckpoint(monitor='val_acc') 122 | trainer = pl.Trainer(gpus=1, max_epochs=cfg.finetune_epochs, callbacks=[checkpoint]) 123 | trainer.fit(learner) 124 | # result for fine tuned model 125 | learner.load_state_dict(torch.load(checkpoint.best_model_path)['state_dict']) 126 | (acc, MAP3), file_probas = eval_fat2018_all_splits(cfg, model, device, filenames['test'], labels['test'], 127 | norm_mean_std=train_mean_std, debug_name='test') 128 | all_file_probas.append(file_probas) 129 | results.append(MAP3) 130 | 131 | fold_weight = weight_folder/f'{fold}-{Path(checkpoint.best_model_path).name}' 132 | copy_file(checkpoint.best_model_path, fold_weight) 133 | print(f'Saved fold#{fold} weight as {fold_weight}') 134 | 135 | mean_file_probas = np.array(all_file_probas).mean(axis=0) 136 | acc, MAP3 = eval_fat2018_by_probas(mean_file_probas, labels['test'], debug_name='test') 137 | np.save(weight_folder/'ens_probas.npy', mean_file_probas) 138 | report_text = f'{name},{epochs},{aug},{mixup},{norm},{MAP3},{np.mean(results)}' 139 | report_result(report_text) 140 | 141 | 142 | if __name__ == '__main__': 143 | fire.Fire(run) 144 | 145 | 146 | -------------------------------------------------------------------------------- /advanced/metric_fat2018.py: -------------------------------------------------------------------------------- 1 | # Based on https://github.com/DCASE-REPO/dcase2018_baseline/blob/master/task2/evaluation.py 2 | 3 | from src.libs import * 4 | import datetime 5 | import numpy as np 6 | import torch 7 | import multiprocessing 8 | 9 | 10 | def one_ap(gt, topk): 11 | for i, p in enumerate(topk): 12 | if gt == p: 13 | return 1.0 / (i + 1.0) 14 | return 0.0 15 | 16 | 17 | def avg_precision(gts=None, topks=None): 18 | return np.array([one_ap(gt, topk) for gt, topk in zip(gts, topks)]) 19 | 20 | 21 | def eval_fat2018_by_probas(probas, labels, debug_name=None, TOP_K=3): 22 | correct = ap = 0.0 23 | for proba, label in zip(probas, labels): 24 | topk = proba.argsort()[-TOP_K:][::-1] 25 | correct += int(topk[0] == label) 26 | ap += one_ap(label, topk) 27 | acc = correct / len(labels) 28 | mAP = ap / len(labels) 29 | if debug_name: 30 | print(f'{debug_name} acc = {acc:.4f}, MAP@{TOP_K} = {mAP}') 31 | return acc, mAP 32 | 33 | 34 | def eval_fat2018(model, device, dataloader, debug_name=None, TTA=1): 35 | model = model.to(device).eval() 36 | all_probas, labels = [], [] 37 | with torch.no_grad(): 38 | for _ in range(TTA): 39 | for X, gts in dataloader: 40 | preds = model(X.to(device)) 41 | probas = preds.softmax(1) 42 | all_probas.extend(probas.cpu().numpy()) 43 | labels.extend(gts.cpu().numpy()) 44 | all_probas = np.array(all_probas) 45 | return eval_fat2018_by_probas(all_probas, labels, debug_name=debug_name), all_probas 46 | 47 | 48 | def eval_fat2018_all_splits(cfg, model, device, filenames, labels, norm_mean_std=None, debug_name=None, head_n=999, agg='mean'): 49 | model = model.to(device).eval() 50 | file_probas = [[] for _ in range(len(labels))] 51 | test_dataset = SplitAllDataset(cfg, filenames, norm_mean_std=norm_mean_std, head_n=head_n) 52 | test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=cfg.bs, 53 | num_workers=multiprocessing.cpu_count(), pin_memory=True) 54 | print(f'Predicting all {len(test_dataset)} splits for {len(labels)} files...') 55 | for X, fileidxs in test_loader: 56 | with torch.no_grad(): 57 | preds = model(X.to(device)) 58 | probas = F.softmax(preds, dim=1) 59 | for idx, prob in zip(fileidxs.cpu().numpy(), probas.cpu().numpy()): 60 | file_probas[idx].append(prob) 61 | 62 | if agg == 'max': 63 | file_probas = np.array([np.max(probas, axis=0) for probas in file_probas]) 64 | elif agg == 'mean': 65 | file_probas = np.array([np.mean(probas, axis=0) for probas in file_probas]) 66 | else: 67 | raise Exception() 68 | 69 | return eval_fat2018_by_probas(file_probas, labels, debug_name=debug_name), file_probas 70 | -------------------------------------------------------------------------------- /advanced/preprocess_fat2018.py: -------------------------------------------------------------------------------- 1 | """Preprocess Freesound Audio Tagging 2018 competition data. 2 | """ 3 | 4 | import warnings 5 | warnings.simplefilter('ignore') 6 | 7 | from src.libs import * 8 | from tqdm import tqdm 9 | import fire 10 | 11 | def convert(config='config.yaml'): 12 | cfg = load_config(config) 13 | print(cfg) 14 | DATA_ROOT = Path(cfg.data_root) 15 | DEST = Path('work')/cfg.type 16 | 17 | folders = ['FSDKaggle2018.audio_test', 'FSDKaggle2018.audio_train'] 18 | 19 | to_mel_spectrogram = torchaudio.transforms.MelSpectrogram( 20 | sample_rate=cfg.sample_rate, n_fft=cfg.n_fft, n_mels=cfg.n_mels, 21 | hop_length=cfg.hop_length, f_min=cfg.f_min, f_max=cfg.f_max) 22 | 23 | for folder in folders: 24 | cur_folder = DATA_ROOT/folder 25 | filenames = sorted(cur_folder.glob('*.wav')) 26 | resampler = None 27 | for filename in tqdm(filenames): 28 | # Load waveform 29 | waveform, sr = torchaudio.load(filename) 30 | #assert sr == cfg.sample_rate 31 | if sr != cfg.sample_rate: 32 | if resampler is None: 33 | resampler = torchaudio.transforms.Resample(sr, cfg.sample_rate) 34 | print(f'CAUTION: RESAMPLING from {sr} Hz to {cfg.sample_rate} Hz.') 35 | waveform = resampler(waveform) 36 | # To log-mel spectrogram 37 | log_mel_spec = to_mel_spectrogram(waveform).log() 38 | # Write to work 39 | (DEST/folder).mkdir(parents=True, exist_ok=True) 40 | np.save(DEST/folder/filename.name.replace('.wav', '.npy'), log_mel_spec) 41 | 42 | 43 | fire.Fire(convert) 44 | -------------------------------------------------------------------------------- /config-fat2018.yaml: -------------------------------------------------------------------------------- 1 | # type name of this configuration 2 | type: B 3 | 4 | # basic setting parameters 5 | clip_length: 5.0 # [sec] 6 | 7 | # preprocessing parameters 8 | sample_rate: 44100 9 | hop_length: 441 10 | n_fft: 2048 11 | n_mels: 64 12 | f_min: 0 13 | f_max: 22050 14 | 15 | # test parameters 16 | bs: 64 #128 17 | mixup: 0.4 18 | n_folds: 5 19 | epochs: 300 20 | finetune_epochs: 20 21 | aug: RCxSA # RC: random resized crop, SA: spec augment 22 | model: R18 # R18: ResNetish18, VGG: VGGish 23 | 24 | # dataset configurations 25 | data_root: /data/A/2018fsd -------------------------------------------------------------------------------- /config.yaml: -------------------------------------------------------------------------------- 1 | # basic setting parameters 2 | clip_length: 5.0 # [sec] 3 | 4 | # preprocessing parameters 5 | sample_rate: 44100 6 | hop_length: 441 7 | n_fft: 1024 8 | n_mels: 64 9 | f_min: 0 10 | f_max: 22050 11 | -------------------------------------------------------------------------------- /for_evar/README.md: -------------------------------------------------------------------------------- 1 | # For EVAR 2 | 3 | [EVAR](https://github.com/nttcslab/eval-audio-repr) is a evaluation package for audio representations. 4 | 5 | This subfolder holds files belonging to opensource for EVAR. 6 | 7 | ## Acknoledgement 8 | 9 | We use/borrow [PANNs](https://github.com/qiuqiangkong/audioset_tagging_cnn) implementation. 10 | 11 | - https://github.com/qiuqiangkong/audioset_tagging_cnn 12 | -------------------------------------------------------------------------------- /for_evar/cnn14_decoupled.py: -------------------------------------------------------------------------------- 1 | """CNN14 network, decoupled from Spectrogram, LogmelFilterBank, SpecAugmentation, and classifier head. 2 | 3 | ## Reference 4 | - [1] https://arxiv.org/abs/1912.10211 "PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition" 5 | - [2] https://github.com/qiuqiangkong/audioset_tagging_cnn 6 | """ 7 | 8 | import torch 9 | from torch import nn 10 | import torch.nn.functional as F 11 | from torchlibrosa.stft import Spectrogram, LogmelFilterBank 12 | 13 | 14 | class AudioFeatureExtractor(nn.Module): 15 | def __init__(self, sample_rate=16000, n_fft=512, n_mels=64, hop_length=160, win_length=512, f_min=50, f_max=8000): 16 | super().__init__() 17 | 18 | # Spectrogram extractor 19 | self.spectrogram_extractor = Spectrogram(n_fft=n_fft, hop_length=hop_length, 20 | win_length=win_length, window='hann', center=True, pad_mode='reflect', 21 | freeze_parameters=True) 22 | 23 | # Logmel feature extractor 24 | self.logmel_extractor = LogmelFilterBank(sr=sample_rate, n_fft=win_length, 25 | n_mels=n_mels, fmin=f_min, fmax=f_max, ref=1.0, amin=1e-10, top_db=None, 26 | freeze_parameters=True) 27 | 28 | def forward(self, batch_audio): 29 | x = self.spectrogram_extractor(batch_audio) # (B, 1, T, F(freq_bins)) 30 | x = self.logmel_extractor(x) # (B, 1, T, F(mel_bins)) 31 | return x 32 | 33 | 34 | def initialize_layers(layer): 35 | # initialize all childrens first. 36 | for l in layer.children(): 37 | initialize_layers(l) 38 | 39 | # initialize only linaer 40 | if type(layer) != nn.Linear: 41 | return 42 | 43 | # Thanks to https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/d2f4b8c18eab44737fcc0de1248ae21eb43f6aa4/pytorch/models.py#L10 44 | nn.init.xavier_uniform_(layer.weight) 45 | if hasattr(layer, 'bias'): 46 | if layer.bias is not None: 47 | layer.bias.data.fill_(0.) 48 | 49 | 50 | def init_bn(bn): 51 | """Initialize a Batchnorm layer. """ 52 | bn.bias.data.fill_(0.) 53 | bn.weight.data.fill_(1.) 54 | 55 | 56 | class ConvBlock(nn.Module): 57 | def __init__(self, in_channels, out_channels): 58 | 59 | super(ConvBlock, self).__init__() 60 | 61 | self.conv1 = nn.Conv2d(in_channels=in_channels, 62 | out_channels=out_channels, 63 | kernel_size=(3, 3), stride=(1, 1), 64 | padding=(1, 1), bias=False) 65 | 66 | self.conv2 = nn.Conv2d(in_channels=out_channels, 67 | out_channels=out_channels, 68 | kernel_size=(3, 3), stride=(1, 1), 69 | padding=(1, 1), bias=False) 70 | 71 | self.bn1 = nn.BatchNorm2d(out_channels) 72 | self.bn2 = nn.BatchNorm2d(out_channels) 73 | 74 | self.init_weight() 75 | 76 | def init_weight(self): 77 | initialize_layers(self.conv1) 78 | initialize_layers(self.conv2) 79 | init_bn(self.bn1) 80 | init_bn(self.bn2) 81 | 82 | 83 | def forward(self, input, pool_size=(2, 2), pool_type='avg'): 84 | 85 | x = input 86 | x = F.relu_(self.bn1(self.conv1(x))) 87 | x = F.relu_(self.bn2(self.conv2(x))) 88 | if pool_type == 'max': 89 | x = F.max_pool2d(x, kernel_size=pool_size) 90 | elif pool_type == 'avg': 91 | x = F.avg_pool2d(x, kernel_size=pool_size) 92 | elif pool_type == 'avg+max': 93 | x1 = F.avg_pool2d(x, kernel_size=pool_size) 94 | x2 = F.max_pool2d(x, kernel_size=pool_size) 95 | x = x1 + x2 96 | else: 97 | raise Exception('Incorrect argument!') 98 | 99 | return x 100 | 101 | 102 | class Cnn14_Decoupled(nn.Module): 103 | """CNN14 network, decoupled from Spectrogram, LogmelFilterBank, SpecAugmentation, and classifier head. 104 | Original implementation: https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/master/pytorch/models.py 105 | """ 106 | 107 | def __init__(self, n_mels=64, d=2048): 108 | assert d == 2048, 'This implementation accepts d=2048 only, for compatible with the original Cnn14.' 109 | super().__init__() 110 | 111 | self.bn0 = nn.BatchNorm2d(n_mels) 112 | 113 | self.conv_block1 = ConvBlock(in_channels=1, out_channels=64) 114 | self.conv_block2 = ConvBlock(in_channels=64, out_channels=128) 115 | self.conv_block3 = ConvBlock(in_channels=128, out_channels=256) 116 | self.conv_block4 = ConvBlock(in_channels=256, out_channels=512) 117 | self.conv_block5 = ConvBlock(in_channels=512, out_channels=1024) 118 | self.conv_block6 = ConvBlock(in_channels=1024, out_channels=2048) 119 | 120 | self.fc1 = nn.Linear(2048, d, bias=True) 121 | #self.fc_audioset = nn.Linear(d, classes_num, bias=True) 122 | 123 | self.init_weight() 124 | 125 | def init_weight(self): 126 | init_bn(self.bn0) 127 | initialize_layers(self.fc1) 128 | #init_layer(self.fc_audioset) 129 | 130 | def encode(self, x, squash_freq=True): 131 | x = x.transpose(1, 3) 132 | x = self.bn0(x) 133 | x = x.transpose(1, 3) 134 | 135 | x = self.conv_block1(x, pool_size=(2, 2), pool_type='avg') 136 | x = F.dropout(x, p=0.2, training=self.training) 137 | x = self.conv_block2(x, pool_size=(2, 2), pool_type='avg') 138 | x = F.dropout(x, p=0.2, training=self.training) 139 | x = self.conv_block3(x, pool_size=(2, 2), pool_type='avg') 140 | x = F.dropout(x, p=0.2, training=self.training) 141 | x3 = x 142 | x = self.conv_block4(x, pool_size=(2, 2), pool_type='avg') 143 | x = F.dropout(x, p=0.2, training=self.training) 144 | x = self.conv_block5(x, pool_size=(2, 2), pool_type='avg') 145 | x = F.dropout(x, p=0.2, training=self.training) 146 | x = self.conv_block6(x, pool_size=(1, 1), pool_type='avg') 147 | x = F.dropout(x, p=0.2, training=self.training) 148 | if squash_freq: 149 | x = torch.mean(x, dim=3) 150 | return x 151 | 152 | def temporal_pooling(self, x): 153 | (x1, _) = torch.max(x, dim=2) 154 | x2 = torch.mean(x, dim=2) 155 | x = x1 + x2 156 | x = F.dropout(x, p=0.5, training=self.training) 157 | x = F.relu_(self.fc1(x)) 158 | embedding = F.dropout(x, p=0.5, training=self.training) 159 | return embedding 160 | 161 | def forward(self, x): 162 | x = self.encode(x) 163 | embedding = self.temporal_pooling(x) 164 | 165 | return embedding 166 | -------------------------------------------------------------------------------- /for_evar/sampler.py: -------------------------------------------------------------------------------- 1 | """Samplers. 2 | 3 | Mostly borrowed from: 4 | https://github.com/qiuqiangkong/audioset_tagging_cnn 5 | """ 6 | 7 | import numpy as np 8 | import logging 9 | 10 | 11 | class BalancedRandomSampler(): 12 | """ 13 | This is a simple version of: 14 | https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/d2f4b8c18eab44737fcc0de1248ae21eb43f6aa4/utils/data_generator.py#L175 15 | """ 16 | def __init__(self, dataset, batch_size, random_seed=42): 17 | 18 | self.dataset = dataset 19 | self.batch_size = batch_size 20 | self.random_state = np.random.RandomState(random_seed) 21 | 22 | self.samples_per_class = np.sum(self.dataset.labels.numpy(), axis=0) 23 | logging.info(f'samples per class: {self.samples_per_class.astype(np.int32)}') 24 | 25 | # Training indexes of all sound classes. E.g.: 26 | # [[0, 11, 12, ...], [3, 4, 15, 16, ...], [7, 8, ...], ...] 27 | self.indexes_per_class = [] 28 | self.classes_num = len(self.dataset.classes) 29 | 30 | for k in range(self.classes_num): 31 | self.indexes_per_class.append( 32 | np.where(dataset.labels[:, k] != 0)[0]) 33 | 34 | # Shuffle indexes 35 | for k in range(self.classes_num): 36 | self.random_state.shuffle(self.indexes_per_class[k]) 37 | 38 | self.queue = [] 39 | self.pointers_of_classes = [0] * self.classes_num 40 | 41 | def expand_queue(self, queue): 42 | classes_set = np.arange(self.classes_num).tolist() 43 | self.random_state.shuffle(classes_set) 44 | queue += classes_set 45 | return queue 46 | 47 | def __iter__(self): 48 | while True: 49 | batch_idxs = [] 50 | for _ in range(self.batch_size): 51 | if len(self.queue) == 0: 52 | self.queue = self.expand_queue(self.queue) 53 | 54 | class_id = self.queue.pop(0) 55 | pointer = self.pointers_of_classes[class_id] 56 | self.pointers_of_classes[class_id] += 1 57 | batch_idxs.append(self.indexes_per_class[class_id][pointer]) 58 | 59 | # When finish one epoch of a sound class, then shuffle its indexes and reset pointer 60 | if self.pointers_of_classes[class_id] >= self.samples_per_class[class_id]: 61 | self.pointers_of_classes[class_id] = 0 62 | self.random_state.shuffle(self.indexes_per_class[class_id]) 63 | 64 | yield batch_idxs 65 | 66 | def __len__(self): 67 | return (len(self.dataset) + self.batch_size - 1) // self.batch_size 68 | 69 | 70 | class InfiniteSampler(object): 71 | def __init__(self, dataset, batch_size, random_seed=42, shuffle=False): 72 | self.df = dataset.df 73 | self.batch_size = batch_size 74 | self.random_state = np.random.RandomState(random_seed) 75 | self.indexes = self.df.index.values.copy() 76 | self.shuffle = shuffle 77 | if self.shuffle: 78 | self.random_state.shuffle(self.indexes) 79 | 80 | def __iter__(self): 81 | pointer = 0 82 | while True: 83 | batch_idxs = [] 84 | for _ in range(self.batch_size): 85 | batch_idxs.append(self.indexes[pointer]) 86 | pointer += 1 87 | if pointer >= len(self.indexes): 88 | pointer = 0 89 | if self.shuffle: 90 | self.random_state.shuffle(self.indexes) 91 | yield batch_idxs 92 | 93 | def __len__(self): 94 | return (len(self.df) + self.batch_size - 1) // self.batch_size 95 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | torch>=1.7.0 2 | torchaudio>=0.7.0 3 | pytorch-lightning 4 | pyyaml 5 | easydict 6 | matplotlib 7 | numpy 8 | jupyter 9 | pandas 10 | scikit-learn 11 | fire 12 | dl-cliche 13 | -------------------------------------------------------------------------------- /src/__init__.py: -------------------------------------------------------------------------------- 1 | # multi-label 2 | -------------------------------------------------------------------------------- /src/augmentations.py: -------------------------------------------------------------------------------- 1 | # Borrowed from https://github.com/pytorch/vision/blob/master/torchvision/transforms/functional.py 2 | import torch 3 | import torch.nn.functional as F 4 | import math 5 | 6 | 7 | class GenericRandomResizedCrop(): 8 | def __init__(self, size, scale=(0.08, 1.0), ratio=(3. / 4., 4. / 3.)): 9 | self.size = size 10 | self.scale = scale 11 | self.ratio = ratio 12 | 13 | @staticmethod 14 | def get_params(x, scale, ratio): 15 | width, height = x.shape[1:] 16 | area = height * width 17 | 18 | for _ in range(100): 19 | target_area = area * torch.empty(1).uniform_(scale[0], scale[1]).item() 20 | log_ratio = torch.log(torch.tensor(ratio)) 21 | aspect_ratio = torch.exp( 22 | torch.empty(1).uniform_(log_ratio[0], log_ratio[1]) 23 | ).item() 24 | 25 | w = int(round(math.sqrt(target_area * aspect_ratio))) 26 | h = int(round(math.sqrt(target_area / aspect_ratio))) 27 | 28 | if 0 < w <= width and 0 < h <= height: 29 | i = torch.randint(0, height - h + 1, size=(1,)).item() 30 | j = torch.randint(0, width - w + 1, size=(1,)).item() 31 | return i, j, h, w 32 | 33 | # Fallback to central crop 34 | in_ratio = float(width) / float(height) 35 | if in_ratio < min(ratio): 36 | w = width 37 | h = int(round(w / min(ratio))) 38 | elif in_ratio > max(ratio): 39 | h = height 40 | w = int(round(h * max(ratio))) 41 | else: # whole image 42 | w = width 43 | h = height 44 | i = (height - h) // 2 45 | j = (width - w) // 2 46 | return i, j, h, w 47 | 48 | def __call__(self, x): 49 | i, j, h, w = self.get_params(x, self.scale, self.ratio) 50 | x = x[:, j:j+w, i:i+h] 51 | return F.interpolate(x.unsqueeze(0), size=self.size, mode='bicubic', align_corners=True).squeeze(0) 52 | 53 | def __repr__(self): 54 | return f'{self.__class__.__name__}(size={self.size}, scale={self.scale}, ratio={self.ratio})' 55 | -------------------------------------------------------------------------------- /src/libs.py: -------------------------------------------------------------------------------- 1 | import warnings 2 | warnings.simplefilter('ignore') 3 | 4 | # Essential PyTorch 5 | import torch 6 | import torchaudio 7 | 8 | # Other modules used in this notebook 9 | from pathlib import Path 10 | import matplotlib.pyplot as plt 11 | import pandas as pd 12 | import numpy as np 13 | from IPython.display import Audio 14 | import fire 15 | import yaml 16 | import multiprocessing 17 | from easydict import EasyDict 18 | from sklearn.model_selection import train_test_split, StratifiedKFold 19 | from sklearn.utils.class_weight import compute_class_weight 20 | import torch 21 | import torch.nn as nn 22 | import torch.nn.functional as F 23 | import pytorch_lightning as pl 24 | from pytorch_lightning.metrics.functional import accuracy 25 | 26 | import torchvision.transforms as VT 27 | import torchaudio.transforms as AT 28 | 29 | from dlcliche.torch_utils import IntraBatchMixup 30 | from dlcliche.utils import copy_file 31 | 32 | from src.augmentations import GenericRandomResizedCrop 33 | 34 | 35 | device = torch.device('cuda') 36 | 37 | 38 | def load_config(filename, debug=False): 39 | with open(filename) as conf: 40 | cfg = EasyDict(yaml.safe_load(conf)) 41 | cfg.unit_length = int((cfg.clip_length * cfg.sample_rate + cfg.hop_length - 1) // cfg.hop_length) 42 | cfg.data_root = Path(cfg.data_root) 43 | if debug: 44 | print(cfg) 45 | return cfg 46 | 47 | 48 | def sample_length(log_mel_spec): 49 | return log_mel_spec.shape[-1] 50 | 51 | 52 | class LMSClfDataset(torch.utils.data.Dataset): 53 | def __init__(self, cfg, filenames, labels, transforms=None, norm_mean_std=None): 54 | assert len(filenames) == len(labels), f'Inconsistent length of filenames and labels.' 55 | 56 | self.filenames = filenames 57 | self.labels = labels 58 | self.transforms = transforms 59 | self.norm_mean_std = norm_mean_std 60 | 61 | # Calculate length of clip this dataset will make 62 | self.unit_length = cfg.unit_length 63 | 64 | # Test with first file 65 | assert self[0][0].shape[-1] == self.unit_length, f'Check your files, failed to load {filenames[0]}' 66 | 67 | # Show basic info. 68 | print(f'Dataset will yield log-mel spectrogram {len(self)} data samples in shape [1, {cfg.n_mels}, {self.unit_length}]') 69 | 70 | def __len__(self): 71 | return len(self.filenames) 72 | 73 | def __getitem__(self, index): 74 | assert 0 <= index and index < len(self) 75 | 76 | log_mel_spec = np.load(self.filenames[index]) 77 | 78 | # Normalize 79 | if self.norm_mean_std is not None: 80 | log_mel_spec = (log_mel_spec - self.norm_mean_std[0]) / self.norm_mean_std[1] 81 | 82 | # Padding if sample is shorter than expected - both head & tail are filled with 0s 83 | pad_size = self.unit_length - sample_length(log_mel_spec) 84 | if pad_size > 0: 85 | offset = pad_size // 2 86 | log_mel_spec = np.pad(log_mel_spec, ((0, 0), (0, 0), (offset, pad_size - offset)), 'constant') 87 | 88 | # Random crop 89 | crop_size = sample_length(log_mel_spec) - self.unit_length 90 | if crop_size > 0: 91 | start = np.random.randint(0, crop_size) 92 | log_mel_spec = log_mel_spec[..., start:start + self.unit_length] 93 | 94 | # Apply augmentations 95 | log_mel_spec = torch.Tensor(log_mel_spec) 96 | if self.transforms is not None: 97 | log_mel_spec = self.transforms(log_mel_spec) 98 | 99 | return log_mel_spec, self.labels[index] 100 | 101 | 102 | class SplitAllDataset(torch.utils.data.Dataset): 103 | def __init__(self, cfg, filenames, norm_mean_std=None, head_n=99999): 104 | self.filenames = filenames 105 | self.norm_mean_std = norm_mean_std 106 | 107 | # Calculate length of clip this dataset will make 108 | self.L = cfg.unit_length 109 | 110 | # Get # of splits for all files 111 | self.n_splits = np.array([(np.load(f).shape[-1] + self.L - 1) // self.L for f in filenames]) 112 | self.n_splits = np.clip(1, head_n, self.n_splits) # limit number of splits. 113 | self.sum_splits = np.cumsum(self.n_splits) 114 | 115 | def __len__(self): 116 | return self.sum_splits[-1] 117 | 118 | def file_index(self, index): 119 | return sum((index < self.sum_splits) == False) 120 | 121 | def filename(self, index): 122 | return self.filenames[self.file_index(index)] 123 | 124 | def split_index(self, index): 125 | fidx = self.file_index(index) 126 | prev_sum = self.sum_splits[fidx - 1] if fidx > 0 else 0 127 | return index - prev_sum 128 | 129 | def __getitem__(self, index): 130 | assert 0 <= index and index < len(self) 131 | 132 | log_mel_spec = np.load(self.filename(index)) 133 | start = self.split_index(index) * self.L 134 | log_mel_spec = log_mel_spec[..., start:start + self.L] 135 | 136 | # Normalize 137 | if self.norm_mean_std is not None: 138 | log_mel_spec = (log_mel_spec - self.norm_mean_std[0]) / self.norm_mean_std[1] 139 | 140 | # Padding if sample is shorter than expected - both head & tail are filled with 0s 141 | pad_size = self.L - sample_length(log_mel_spec) 142 | if pad_size > 0: 143 | offset = pad_size // 2 144 | log_mel_spec = np.pad(log_mel_spec, ((0, 0), (0, 0), (offset, pad_size - offset)), 'constant') 145 | 146 | return log_mel_spec, self.file_index(index) 147 | 148 | 149 | class LMSClfLearner(pl.LightningModule): 150 | 151 | def __init__(self, model, dataloaders, learning_rate=3e-4, mixup_alpha=0.0, weight=None): 152 | super().__init__() 153 | self.learning_rate = learning_rate 154 | self.model = model 155 | self.trn_dl, self.val_dl, self.test_dl = dataloaders 156 | self.criterion = nn.CrossEntropyLoss(weight=weight) 157 | self.batch_mixer = IntraBatchMixup(self.criterion, alpha=mixup_alpha) if mixup_alpha > 0.0 else None 158 | 159 | def forward(self, x): 160 | x = self.model(x) 161 | return x 162 | 163 | def step(self, x, y, train): 164 | if self.batch_mixer is None: 165 | preds = self(x) 166 | loss = self.criterion(preds, y) 167 | else: 168 | x, stacked_y = self.batch_mixer.transform(x, y, train=train) 169 | preds = self(x) 170 | loss = self.batch_mixer.criterion(preds, stacked_y) 171 | return preds, loss 172 | 173 | def training_step(self, batch, batch_idx): 174 | x, y = batch 175 | preds, loss = self.step(x, y, train=True) 176 | return loss 177 | 178 | def validation_step(self, batch, batch_idx, split='val'): 179 | x, y = batch 180 | preds, loss = self.step(x, y, train=False) 181 | yhat = torch.argmax(preds, dim=1) 182 | acc = accuracy(yhat, y) 183 | 184 | self.log(f'{split}_loss', loss, prog_bar=True) 185 | self.log(f'{split}_acc', acc, prog_bar=True) 186 | return loss 187 | 188 | def test_step(self, batch, batch_idx): 189 | return self.validation_step(batch, batch_idx, split='test') 190 | 191 | def configure_optimizers(self): 192 | optimizer = torch.optim.AdamW(self.parameters(), lr=self.learning_rate) 193 | return optimizer 194 | 195 | def train_dataloader(self): 196 | return self.trn_dl 197 | 198 | def val_dataloader(self): 199 | return self.val_dl 200 | 201 | def test_dataloader(self): 202 | return self.test_dl -------------------------------------------------------------------------------- /src/lwlrap.py: -------------------------------------------------------------------------------- 1 | # Borrowed from https://github.com/DCASE-REPO/dcase2019_task2_baseline/blob/master/evaluation.py 2 | import numpy as np 3 | 4 | 5 | class Lwlrap(object): 6 | """Computes label-weighted label-ranked average precision (lwlrap).""" 7 | 8 | def __init__(self, class_map): 9 | self.num_classes = 0 10 | self.total_num_samples = 0 11 | self._class_map = class_map 12 | 13 | def accumulate(self, batch_truth, batch_scores): 14 | """Accumulate a new batch of samples into the metric. 15 | Args: 16 | truth: np.array of (num_samples, num_classes) giving boolean 17 | ground-truth of presence of that class in that sample for this batch. 18 | scores: np.array of (num_samples, num_classes) giving the 19 | classifier-under-test's real-valued score for each class for each 20 | sample. 21 | """ 22 | assert batch_scores.shape == batch_truth.shape 23 | num_samples, num_classes = batch_truth.shape 24 | if not self.num_classes: 25 | self.num_classes = num_classes 26 | self._per_class_cumulative_precision = np.zeros(self.num_classes) 27 | self._per_class_cumulative_count = np.zeros(self.num_classes, 28 | dtype=np.int) 29 | assert num_classes == self.num_classes 30 | for truth, scores in zip(batch_truth, batch_scores): 31 | pos_class_indices, precision_at_hits = ( 32 | self._one_sample_positive_class_precisions(scores, truth)) 33 | self._per_class_cumulative_precision[pos_class_indices] += ( 34 | precision_at_hits) 35 | self._per_class_cumulative_count[pos_class_indices] += 1 36 | self.total_num_samples += num_samples 37 | 38 | def _one_sample_positive_class_precisions(self, scores, truth): 39 | """Calculate precisions for each true class for a single sample. 40 | Args: 41 | scores: np.array of (num_classes,) giving the individual classifier scores. 42 | truth: np.array of (num_classes,) bools indicating which classes are true. 43 | Returns: 44 | pos_class_indices: np.array of indices of the true classes for this sample. 45 | pos_class_precisions: np.array of precisions corresponding to each of those 46 | classes. 47 | """ 48 | num_classes = scores.shape[0] 49 | pos_class_indices = np.flatnonzero(truth > 0) 50 | # Only calculate precisions if there are some true classes. 51 | if not len(pos_class_indices): 52 | return pos_class_indices, np.zeros(0) 53 | # Retrieval list of classes for this sample. 54 | retrieved_classes = np.argsort(scores)[::-1] 55 | # class_rankings[top_scoring_class_index] == 0 etc. 56 | class_rankings = np.zeros(num_classes, dtype=np.int) 57 | class_rankings[retrieved_classes] = range(num_classes) 58 | # Which of these is a true label? 59 | retrieved_class_true = np.zeros(num_classes, dtype=np.bool) 60 | retrieved_class_true[class_rankings[pos_class_indices]] = True 61 | # Num hits for every truncated retrieval list. 62 | retrieved_cumulative_hits = np.cumsum(retrieved_class_true) 63 | # Precision of retrieval list truncated at each hit, in order of pos_labels. 64 | precision_at_hits = ( 65 | retrieved_cumulative_hits[class_rankings[pos_class_indices]] / 66 | (1 + class_rankings[pos_class_indices].astype(np.float))) 67 | return pos_class_indices, precision_at_hits 68 | 69 | def per_class_lwlrap(self): 70 | """Return a vector of the per-class lwlraps for the accumulated samples.""" 71 | return (self._per_class_cumulative_precision / 72 | np.maximum(1, self._per_class_cumulative_count)) 73 | 74 | def per_class_weight(self): 75 | """Return a normalized weight vector for the contributions of each class.""" 76 | return (self._per_class_cumulative_count / 77 | float(np.sum(self._per_class_cumulative_count))) 78 | 79 | def overall_lwlrap(self): 80 | """Return the scalar overall lwlrap for cumulated samples.""" 81 | return np.sum(self.per_class_lwlrap() * self.per_class_weight()) 82 | 83 | def __str__(self): 84 | per_class_lwlrap = self.per_class_lwlrap() 85 | # List classes in descending order of lwlrap. 86 | s = (['Lwlrap(%s) = %.6f' % (name, lwlrap) for (lwlrap, name) in 87 | sorted([(per_class_lwlrap[i], self._class_map[i]) for i in range(self.num_classes)], 88 | reverse=True)]) 89 | s.append('Overall lwlrap = %.6f' % (self.overall_lwlrap())) 90 | return '\n'.join(s) 91 | -------------------------------------------------------------------------------- /src/models.py: -------------------------------------------------------------------------------- 1 | """Audio models based on VGGish [1] paper. 2 | 3 | ## About 4 | 5 | Based on following implementations: 6 | 7 | - https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py -- borrowed most of code from this torchvision implementation. 8 | - https://github.com/harritaylor/torchvggish 9 | 10 | ## Disclaimer 11 | 12 | Tried to follow the original paper description, but there could be difference from the real ResNetish/VGGish. 13 | 14 | ## References 15 | 16 | [1] S. Hershey et al., ‘CNN Architectures for Large-Scale Audio Classification’,\ in International Conference on Acoustics, Speech and Signal Processing (ICASSP),2017\ Available: https://arxiv.org/abs/1609.09430, https://ai.google/research/pubs/pub45611 17 | """ 18 | 19 | import torch 20 | from torch import Tensor 21 | import torch.nn as nn 22 | from typing import Type, Any, Callable, Union, List, Optional 23 | 24 | 25 | def conv3x3(in_planes: int, out_planes: int, stride: int = 1, groups: int = 1, dilation: int = 1) -> nn.Conv2d: 26 | """3x3 convolution with padding""" 27 | return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride, 28 | padding=dilation, groups=groups, bias=False, dilation=dilation) 29 | 30 | 31 | def conv1x1(in_planes: int, out_planes: int, stride: int = 1) -> nn.Conv2d: 32 | """1x1 convolution""" 33 | return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False) 34 | 35 | 36 | class BasicBlock(nn.Module): 37 | expansion: int = 1 38 | 39 | def __init__( 40 | self, 41 | inplanes: int, 42 | planes: int, 43 | stride: int = 1, 44 | downsample: Optional[nn.Module] = None, 45 | groups: int = 1, 46 | base_width: int = 64, 47 | dilation: int = 1, 48 | norm_layer: Optional[Callable[..., nn.Module]] = None 49 | ) -> None: 50 | super(BasicBlock, self).__init__() 51 | if norm_layer is None: 52 | norm_layer = nn.BatchNorm2d 53 | if groups != 1 or base_width != 64: 54 | raise ValueError('BasicBlock only supports groups=1 and base_width=64') 55 | if dilation > 1: 56 | raise NotImplementedError("Dilation > 1 not supported in BasicBlock") 57 | # Both self.conv1 and self.downsample layers downsample the input when stride != 1 58 | self.conv1 = conv3x3(inplanes, planes, stride) 59 | self.bn1 = norm_layer(planes) 60 | self.relu = nn.ReLU(inplace=True) 61 | self.conv2 = conv3x3(planes, planes) 62 | self.bn2 = norm_layer(planes) 63 | self.downsample = downsample 64 | self.stride = stride 65 | 66 | def forward(self, x: Tensor) -> Tensor: 67 | identity = x 68 | 69 | out = self.conv1(x) 70 | out = self.bn1(out) 71 | out = self.relu(out) 72 | 73 | out = self.conv2(out) 74 | out = self.bn2(out) 75 | 76 | if self.downsample is not None: 77 | identity = self.downsample(x) 78 | 79 | out += identity 80 | out = self.relu(out) 81 | 82 | return out 83 | 84 | 85 | class Bottleneck(nn.Module): 86 | # Bottleneck in torchvision places the stride for downsampling at 3x3 convolution(self.conv2) 87 | # while original implementation places the stride at the first 1x1 convolution(self.conv1) 88 | # according to "Deep residual learning for image recognition"https://arxiv.org/abs/1512.03385. 89 | # This variant is also known as ResNet V1.5 and improves accuracy according to 90 | # https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch. 91 | 92 | expansion: int = 4 93 | 94 | def __init__( 95 | self, 96 | inplanes: int, 97 | planes: int, 98 | stride: int = 1, 99 | downsample: Optional[nn.Module] = None, 100 | groups: int = 1, 101 | base_width: int = 64, 102 | dilation: int = 1, 103 | norm_layer: Optional[Callable[..., nn.Module]] = None 104 | ) -> None: 105 | super(Bottleneck, self).__init__() 106 | if norm_layer is None: 107 | norm_layer = nn.BatchNorm2d 108 | width = int(planes * (base_width / 64.)) * groups 109 | # Both self.conv2 and self.downsample layers downsample the input when stride != 1 110 | self.conv1 = conv1x1(inplanes, width) 111 | self.bn1 = norm_layer(width) 112 | self.conv2 = conv3x3(width, width, stride, groups, dilation) 113 | self.bn2 = norm_layer(width) 114 | self.conv3 = conv1x1(width, planes * self.expansion) 115 | self.bn3 = norm_layer(planes * self.expansion) 116 | self.relu = nn.ReLU(inplace=True) 117 | self.downsample = downsample 118 | self.stride = stride 119 | 120 | def forward(self, x: Tensor) -> Tensor: 121 | identity = x 122 | 123 | out = self.conv1(x) 124 | out = self.bn1(out) 125 | out = self.relu(out) 126 | 127 | out = self.conv2(out) 128 | out = self.bn2(out) 129 | out = self.relu(out) 130 | 131 | out = self.conv3(out) 132 | out = self.bn3(out) 133 | 134 | if self.downsample is not None: 135 | identity = self.downsample(x) 136 | 137 | out += identity 138 | out = self.relu(out) 139 | 140 | return out 141 | 142 | 143 | class ResNetish(nn.Module): 144 | 145 | def __init__( 146 | self, 147 | block: Type[Union[BasicBlock, Bottleneck]], 148 | layers: List[int], 149 | num_classes: int = 1000, 150 | zero_init_residual: bool = False, 151 | groups: int = 1, 152 | width_per_group: int = 64, 153 | replace_stride_with_dilation: Optional[List[bool]] = None, 154 | norm_layer: Optional[Callable[..., nn.Module]] = None 155 | ) -> None: 156 | super(ResNetish, self).__init__() 157 | if norm_layer is None: 158 | norm_layer = nn.BatchNorm2d 159 | self._norm_layer = norm_layer 160 | 161 | self.inplanes = 64 162 | self.dilation = 1 163 | if replace_stride_with_dilation is None: 164 | # each element in the tuple indicates if we should replace 165 | # the 2x2 stride with a dilated convolution instead 166 | replace_stride_with_dilation = [False, False, False] 167 | if len(replace_stride_with_dilation) != 3: 168 | raise ValueError("replace_stride_with_dilation should be None " 169 | "or a 3-element tuple, got {}".format(replace_stride_with_dilation)) 170 | self.groups = groups 171 | self.base_width = width_per_group 172 | self.conv1 = nn.Conv2d(1, self.inplanes, kernel_size=7, stride=1, padding=3, # Audio input 3 -> 1, stride 2 -> 1 173 | bias=False) 174 | self.bn1 = norm_layer(self.inplanes) 175 | self.relu = nn.ReLU(inplace=True) 176 | self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1) 177 | self.layer1 = self._make_layer(block, 64, layers[0]) 178 | self.layer2 = self._make_layer(block, 128, layers[1], stride=2, 179 | dilate=replace_stride_with_dilation[0]) 180 | self.layer3 = self._make_layer(block, 256, layers[2], stride=2, 181 | dilate=replace_stride_with_dilation[1]) 182 | self.layer4 = self._make_layer(block, 512, layers[3], stride=2, 183 | dilate=replace_stride_with_dilation[2]) 184 | self.avgpool = nn.AdaptiveAvgPool2d((4, 6)) 185 | self.fc = nn.Linear(512 * 24 * block.expansion, num_classes) 186 | 187 | for m in self.modules(): 188 | if isinstance(m, nn.Conv2d): 189 | nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu') 190 | elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)): 191 | nn.init.constant_(m.weight, 1) 192 | nn.init.constant_(m.bias, 0) 193 | 194 | # Zero-initialize the last BN in each residual branch, 195 | # so that the residual branch starts with zeros, and each residual block behaves like an identity. 196 | # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677 197 | if zero_init_residual: 198 | for m in self.modules(): 199 | if isinstance(m, Bottleneck): 200 | nn.init.constant_(m.bn3.weight, 0) # type: ignore[arg-type] 201 | elif isinstance(m, BasicBlock): 202 | nn.init.constant_(m.bn2.weight, 0) # type: ignore[arg-type] 203 | 204 | def _make_layer(self, block: Type[Union[BasicBlock, Bottleneck]], planes: int, blocks: int, 205 | stride: int = 1, dilate: bool = False) -> nn.Sequential: 206 | norm_layer = self._norm_layer 207 | downsample = None 208 | previous_dilation = self.dilation 209 | if dilate: 210 | self.dilation *= stride 211 | stride = 1 212 | if stride != 1 or self.inplanes != planes * block.expansion: 213 | downsample = nn.Sequential( 214 | conv1x1(self.inplanes, planes * block.expansion, stride), 215 | norm_layer(planes * block.expansion), 216 | ) 217 | 218 | layers = [] 219 | layers.append(block(self.inplanes, planes, stride, downsample, self.groups, 220 | self.base_width, previous_dilation, norm_layer)) 221 | self.inplanes = planes * block.expansion 222 | for _ in range(1, blocks): 223 | layers.append(block(self.inplanes, planes, groups=self.groups, 224 | base_width=self.base_width, dilation=self.dilation, 225 | norm_layer=norm_layer)) 226 | 227 | return nn.Sequential(*layers) 228 | 229 | def _forward_impl(self, x: Tensor) -> Tensor: 230 | # See note [TorchScript super()] 231 | x = self.conv1(x) 232 | x = self.bn1(x) 233 | x = self.relu(x) 234 | x = self.maxpool(x) 235 | 236 | x = self.layer1(x) 237 | x = self.layer2(x) 238 | x = self.layer3(x) 239 | x = self.layer4(x) 240 | 241 | x = self.avgpool(x) 242 | x = torch.flatten(x, 1) 243 | x = self.fc(x) 244 | 245 | return x 246 | 247 | def forward(self, x: Tensor) -> Tensor: 248 | return self._forward_impl(x) 249 | 250 | 251 | def _resnet( 252 | arch: str, 253 | block: Type[Union[BasicBlock, Bottleneck]], 254 | layers: List[int], 255 | **kwargs: Any 256 | ) -> ResNetish: 257 | model = ResNetish(block, layers, **kwargs) 258 | return model 259 | 260 | 261 | def resnetish18(num_classes: int, **kwargs: Any) -> ResNetish: 262 | r"""ResNet-18 model from 263 | `"Deep Residual Learning for Image Recognition" `_. 264 | Args: 265 | pretrained (bool): If True, returns a model pre-trained on ImageNet 266 | progress (bool): If True, displays a progress bar of the download to stderr 267 | """ 268 | return _resnet('resnetish18', BasicBlock, [2, 2, 2, 2], num_classes=num_classes, 269 | **kwargs) 270 | 271 | 272 | def resnetish34(num_classes: int, **kwargs: Any) -> ResNetish: 273 | r"""ResNet-34 model from 274 | `"Deep Residual Learning for Image Recognition" `_. 275 | Args: 276 | pretrained (bool): If True, returns a model pre-trained on ImageNet 277 | progress (bool): If True, displays a progress bar of the download to stderr 278 | """ 279 | return _resnet('resnetish34', BasicBlock, [3, 4, 6, 3], num_classes=num_classes, 280 | **kwargs) 281 | 282 | 283 | def resnetish50(num_classes: int, **kwargs: Any) -> ResNetish: 284 | r"""ResNet-50 model from 285 | `"Deep Residual Learning for Image Recognition" `_. 286 | Args: 287 | pretrained (bool): If True, returns a model pre-trained on ImageNet 288 | progress (bool): If True, displays a progress bar of the download to stderr 289 | """ 290 | return _resnet('resnetish50', Bottleneck, [3, 4, 6, 3], num_classes=num_classes, 291 | **kwargs) 292 | 293 | 294 | class VGGish(nn.Module): 295 | """Based on: 296 | https://github.com/harritaylor/torchvggish/blob/master/docs/_example_download_weights.ipynb 297 | """ 298 | 299 | def __init__(self, num_classes: int): # Added num_classes 300 | super(VGGish, self).__init__() 301 | self.features = nn.Sequential( 302 | nn.Conv2d(1, 64, 3, 1, 1), 303 | nn.ReLU(inplace=True), 304 | nn.MaxPool2d(2, 2), 305 | nn.Conv2d(64, 128, 3, 1, 1), 306 | nn.ReLU(inplace=True), 307 | nn.MaxPool2d(2, 2), 308 | nn.Conv2d(128, 256, 3, 1, 1), 309 | nn.ReLU(inplace=True), 310 | nn.Conv2d(256, 256, 3, 1, 1), 311 | nn.ReLU(inplace=True), 312 | nn.MaxPool2d(2, 2), 313 | nn.Conv2d(256, 512, 3, 1, 1), 314 | nn.ReLU(inplace=True), 315 | nn.Conv2d(512, 512, 3, 1, 1), 316 | nn.ReLU(inplace=True), 317 | nn.AdaptiveMaxPool2d((4, 6))) # Replaced: MaxPool2d(2,2) 318 | self.embeddings = nn.Sequential( 319 | nn.Linear(512*24, 4096), 320 | nn.ReLU(inplace=True), 321 | nn.Linear(4096, 4096), 322 | nn.ReLU(inplace=True), 323 | nn.Linear(4096, 128), 324 | nn.ReLU(inplace=True)) 325 | self.head = nn.Linear(128, num_classes) # Added 326 | 327 | def forward(self, x): 328 | x = self.features(x) 329 | x = x.view(x.size(0),-1) 330 | x = self.embeddings(x) 331 | x = self.head(x) # Added 332 | return x 333 | 334 | 335 | class AlexNet(nn.Module): 336 | """Based on https://github.com/pytorch/vision/blob/master/torchvision/models/alexnet.py 337 | """ 338 | 339 | def __init__(self, num_classes: int = 1000) -> None: 340 | super(AlexNet, self).__init__() 341 | self.features = nn.Sequential( 342 | nn.Conv2d(1, 64, kernel_size=11, stride=(1,2), padding=2), # Replaced 3-channel with 1, strid=4 with (1,2) 343 | nn.BatchNorm2d(64), # Added according to the paper. 344 | nn.ReLU(inplace=True), 345 | nn.MaxPool2d(kernel_size=3, stride=2), 346 | nn.Conv2d(64, 192, kernel_size=5, padding=2), 347 | nn.BatchNorm2d(192), # Added according to the paper. 348 | nn.ReLU(inplace=True), 349 | nn.MaxPool2d(kernel_size=3, stride=2), 350 | nn.Conv2d(192, 384, kernel_size=3, padding=1), 351 | nn.BatchNorm2d(384), # Added according to the paper. 352 | nn.ReLU(inplace=True), 353 | nn.Conv2d(384, 256, kernel_size=3, padding=1), 354 | nn.BatchNorm2d(256), # Added according to the paper. 355 | nn.ReLU(inplace=True), 356 | nn.Conv2d(256, 256, kernel_size=3, padding=1), 357 | nn.BatchNorm2d(256), # Added according to the paper. 358 | nn.ReLU(inplace=True), 359 | nn.MaxPool2d(kernel_size=3, stride=2), 360 | ) 361 | self.avgpool = nn.AdaptiveAvgPool2d((4, 6)) # Replaced: n.AdaptiveAvgPool2d((6, 6)) 362 | self.classifier = nn.Sequential( 363 | nn.Dropout(), 364 | nn.Linear(256 * 4 * 6, 4096), # Replaced: 256 * 6 * 6 365 | nn.ReLU(inplace=True), 366 | nn.Dropout(), 367 | nn.Linear(4096, 4096), 368 | nn.ReLU(inplace=True), 369 | nn.Linear(4096, num_classes), 370 | ) 371 | 372 | def forward(self, x: torch.Tensor) -> torch.Tensor: 373 | x = self.features(x) 374 | x = self.avgpool(x) 375 | x = torch.flatten(x, 1) 376 | x = self.classifier(x) 377 | return x 378 | -------------------------------------------------------------------------------- /src/multi_label_libs.py: -------------------------------------------------------------------------------- 1 | import pytorch_lightning as pl 2 | import datetime 3 | import logging 4 | import numpy as np 5 | import torch 6 | import torch.nn as nn 7 | import torch.nn.functional as F 8 | import multiprocessing 9 | from dlcliche.torch_utils import IntraBatchMixupBCE 10 | from dlcliche.utils import copy_file 11 | from .lwlrap import Lwlrap 12 | from skmultilearn.model_selection import IterativeStratification 13 | 14 | 15 | class SplitAllDataset(torch.utils.data.Dataset): 16 | def __init__(self, cfg, df, normalize=False): 17 | self.df = df 18 | self.normalize = normalize 19 | 20 | # Calculate length of clip this dataset will make 21 | self.L = cfg.unit_length 22 | 23 | # Get # of splits for all files 24 | self.n_splits = np.array([(np.load(f).shape[-1] + self.L - 1) // self.L for f in df.index.values]) 25 | self.sum_splits = np.cumsum(self.n_splits) 26 | 27 | def __len__(self): 28 | return self.sum_splits[-1] 29 | 30 | def file_index(self, index): 31 | return sum((index < self.sum_splits) == False) 32 | 33 | def filename(self, index): 34 | return self.df.index.values[self.file_index(index)] 35 | 36 | def split_index(self, index): 37 | fidx = self.file_index(index) 38 | prev_sum = self.sum_splits[fidx - 1] if fidx > 0 else 0 39 | return index - prev_sum 40 | 41 | def __getitem__(self, index): 42 | assert 0 <= index and index < len(self) 43 | 44 | log_mel_spec = np.load(self.filename(index)) 45 | start = self.split_index(index) * self.L 46 | log_mel_spec = log_mel_spec[..., start:start + self.L] 47 | 48 | # normalize - instance based 49 | if self.normalize: 50 | _m, _s = log_mel_spec.mean(), log_mel_spec.std() + np.finfo(np.float).eps 51 | log_mel_spec = (log_mel_spec - _m) / _s 52 | 53 | # Padding if sample is shorter than expected - both head & tail are filled with 0s 54 | pad_size = self.L - sample_length(log_mel_spec) 55 | if pad_size > 0: 56 | offset = pad_size // 2 57 | log_mel_spec = np.pad(log_mel_spec, ((0, 0), (0, 0), (offset, pad_size - offset)), 'constant') 58 | 59 | return log_mel_spec, self.file_index(index) 60 | 61 | 62 | def eval_all_splits(cfg, model, device, classes, df, normalize=False, debug_name=None, n=1, bs=64): 63 | model = model.to(device).eval() 64 | file_probas = [[] for _ in range(len(df))] 65 | test_dataset = SplitAllDataset(cfg, df, normalize=normalize) 66 | test_loader = torch.utils.data.DataLoader(test_dataset, num_workers=multiprocessing.cpu_count(), 67 | batch_size=bs, pin_memory=True) 68 | print(f'Predicting all {len(test_dataset)} splits for {len(df)} files...') 69 | for _ in range(n): 70 | with torch.no_grad(): 71 | for X, fileidxs in test_loader: 72 | preds = model(X.to(device)) 73 | probas = F.sigmoid(preds) 74 | for idx, proba in zip(fileidxs.cpu().numpy(), probas.cpu().numpy()): 75 | file_probas[idx].append(proba) 76 | file_probas = np.array([np.mean(probas, axis=0) for probas in file_probas]) 77 | lwlrap = Lwlrap(classes) 78 | lwlrap.accumulate(df.values, file_probas) 79 | return file_probas, lwlrap.overall_lwlrap(), lwlrap.per_class_lwlrap() 80 | 81 | 82 | def sample_length(log_mel_spec): 83 | return log_mel_spec.shape[-1] 84 | 85 | 86 | class MLClfDataset(torch.utils.data.Dataset): 87 | def __init__(self, cfg, df, transforms=None, normalize=False): 88 | self.df = df 89 | self.transforms = transforms 90 | self.normalize = normalize 91 | 92 | # Calculate length of clip this dataset will make 93 | self.cfg = cfg 94 | self.unit_length = cfg.unit_length 95 | self.hop = cfg.hop_length / cfg.sample_rate 96 | 97 | # Show basic info. 98 | print(f'Dataset will yield log-mel spectrogram {len(self)} data samples in shape [1, {cfg.n_mels}, {self.unit_length}]') 99 | 100 | def __len__(self): 101 | return len(self.df) 102 | 103 | def __getitem__(self, index): 104 | assert 0 <= index and index < len(self) 105 | row = self.df.iloc[index] 106 | filename = f'{row.name}' 107 | 108 | log_mel_spec = np.load(filename) 109 | 110 | # normalize - instance based 111 | if self.normalize: 112 | _m, _s = log_mel_spec.mean(), log_mel_spec.std() + np.finfo(np.float).eps 113 | log_mel_spec = (log_mel_spec - _m) / _s 114 | 115 | # Padding if sample is shorter than expected - both head & tail are filled with 0s 116 | pad_size = self.unit_length - sample_length(log_mel_spec) 117 | offset = 0 118 | if pad_size > 0: 119 | offset = np.random.randint(1, pad_size) if pad_size > 1 else 0 # (pad_size // 2) -- for making it center 120 | log_mel_spec = np.pad(log_mel_spec, ((0, 0), (0, 0), (offset, pad_size - offset)), 'constant') 121 | 122 | # Random crop 123 | crop_size = sample_length(log_mel_spec) - self.unit_length 124 | start = 0 125 | if crop_size > 0: 126 | start = np.random.randint(0, crop_size) 127 | log_mel_spec = log_mel_spec[..., start:start + self.unit_length] 128 | 129 | # Apply augmentations 130 | log_mel_spec = torch.Tensor(log_mel_spec) 131 | if self.transforms is not None: 132 | log_mel_spec = self.transforms(log_mel_spec) 133 | 134 | return log_mel_spec, row.values 135 | 136 | 137 | class MLClfLearner(pl.LightningModule): 138 | 139 | def __init__(self, model, dataloaders, classes, learning_rate=3e-4, mixup_alpha=0.2, weight=None): 140 | super().__init__() 141 | self.learning_rate = learning_rate 142 | self.model = model 143 | self.classes = classes 144 | self.train_loader, self.valid_loader, self.test_loader = dataloaders 145 | 146 | self.criterion = nn.BCEWithLogitsLoss(weight=weight) 147 | self.batch_mixer = IntraBatchMixupBCE(alpha=mixup_alpha) 148 | self.lwlrap = Lwlrap(classes) 149 | 150 | def forward(self, x): 151 | x = self.model(x) 152 | return x 153 | 154 | def step(self, x, y, train): 155 | mixed_inputs, mixed_labels = self.batch_mixer.transform(x, y, train=train) 156 | preds = self(mixed_inputs) 157 | #print(preds, mixed_labels.to(torch.float)) 158 | loss = self.criterion(preds, mixed_labels.to(torch.float)) 159 | return preds, loss 160 | 161 | def training_step(self, batch, batch_idx): 162 | x, y = batch 163 | preds, loss = self.step(x, y, train=True) 164 | return loss 165 | 166 | def on_validation_start(self, **kwargs): 167 | self.lwlrap = Lwlrap(self.classes) 168 | 169 | def validation_step(self, batch, batch_idx, split='val'): 170 | x, gt = batch 171 | preds, loss = self.step(x, gt, train=False) 172 | self.lwlrap.accumulate(gt.cpu().numpy(), F.sigmoid(preds).cpu().numpy()) 173 | 174 | self.log(f'{split}_loss', loss, prog_bar=True) 175 | #batch_lwlrap = lwlrap(gt.cpu().numpy(), preds.cpu().numpy()) 176 | #self.log(f'{split}_lwlrap', batch_lwlrap, prog_bar=True) 177 | if batch_idx >= len(self.valid_loader) - 1: 178 | self.log(f'val_lwlrap', self.lwlrap.overall_lwlrap(), prog_bar=False) 179 | logging.info(self.lwlrap) 180 | return loss 181 | 182 | def test_step(self, batch, batch_idx): 183 | return self.validation_step(batch, batch_idx, split='test') 184 | 185 | def configure_optimizers(self): 186 | optimizer = torch.optim.AdamW(self.parameters(), lr=self.learning_rate) 187 | return optimizer 188 | 189 | def train_dataloader(self): 190 | return self.train_loader 191 | 192 | def val_dataloader(self): 193 | return self.valid_loader 194 | 195 | def test_dataloader(self): 196 | return self.test_loader 197 | 198 | 199 | def ml_fold_spliter(train_df, random_state=42): 200 | fnames = train_df.index.values 201 | 202 | # multi label stratified train-test splitter 203 | splitter = IterativeStratification(n_splits=5, random_state=random_state) 204 | 205 | for train, test in splitter.split(train_df.index, train_df): 206 | yield train_df.iloc[train], train_df.iloc[test] 207 | -------------------------------------------------------------------------------- /work/.placeholder: -------------------------------------------------------------------------------- 1 | Working files will be in this folder. 2 | --------------------------------------------------------------------------------