├── .gitignore
├── Data-Preprocessing.ipynb
├── README.md
├── Run-All-on-Colab.ipynb
├── Training-Classifier.ipynb
├── advanced
├── Multi-Label-FSDKaggle2019-on-Colab.ipynb
├── Note-webdataset-smalldata.ipynb
├── Perceiver_MelSpecAudio_Example_Colab.ipynb
├── __init__.py
├── create_wds_fsd50k.py
├── create_wds_fsd50k_resample.py
├── fat2018.py
├── metric_fat2018.py
└── preprocess_fat2018.py
├── config-fat2018.yaml
├── config.yaml
├── for_evar
├── README.md
├── cnn14_decoupled.py
└── sampler.py
├── requirements.txt
├── src
├── __init__.py
├── augmentations.py
├── libs.py
├── lwlrap.py
├── models.py
└── multi_label_libs.py
└── work
└── .placeholder
/.gitignore:
--------------------------------------------------------------------------------
1 | work/*
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Sound Classifier Tutorial for PyTorch
2 |
3 | This is sound classifier tutorials using PyTorch, PyTorch Lightning and torchaudio.
4 |
5 | ## 0. Motivation
6 |
7 | I had made a repository regarding sound classifier solution: [Machine Learning Sound Classifier for Live Audio](https://github.com/daisukelab/ml-sound-classifier),
8 | it is based on my solution for a Kaggle competition "[Freesound General-Purpose Audio Tagging Challenge](https://www.kaggle.com/c/freesound-audio-tagging)" using Keras.
9 |
10 | Keras was popular when it was created, but many people today are using PyTorch, and yes, I'm also the one.
11 |
12 | This repository is an updated example solution using PyTorch that shows how I would try new machine learning sound competition with current software assets.
13 |
14 | ## 1. Quickstart
15 |
16 | - `pip install -r requirements.txt` to install modules.
17 | - Run notebooks.
18 |
19 | ## 2. What you can find
20 |
21 | ### 2-1. What's included
22 |
23 | - An audio preprocessing example: [Data-Preprocessing.ipynb](Data-Preprocessing.ipynb)
24 | - A training example: [Training-Classifier.ipynb](Training-Classifier.ipynb)
25 | - [FSDKaggle2018](https://zenodo.org/record/2552860#.X9TH6mT7RzU) handling example, it's a sound multi-class classification task.
26 | - New) ResNetish/VGGish [1] models.
27 | - Models are equipped with AdaptiveXXXPool2d to be flexible with input size. Models now accept any shape.
28 | - New) Colab all-in-one notebook [Run-All-on-Colab.ipynb](Run-All-on-Colab.ipynb). You can run all through the training/evaluation online.
29 |
30 | ### 2-2. What's not
31 |
32 | - No usual practices/techniques: Normalization, augmentation, regularizations and etc. --> will be followed up with advanced notebook.
33 | - No cutting edge networks like what you can find there: [PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition](https://github.com/qiuqiangkong/audioset_tagging_cnn).
34 |
35 | You can just run to reproduce, or try advanced techniques based on the tutorials.
36 |
37 | ## 3. Notes for design choices
38 |
39 | ### 3-1. Input data format: raw audio or spectrogram?
40 |
41 | If we need to augment input data in time-domain, we feed raw audio to dataset class.
42 |
43 | But in this example, all the data are converted to log-mel spectrogram in advance, as a major choice.
44 |
45 | - Good: This will make data handling easy, especially in training pipeline.
46 | - Bad: Applicable data augmentations will be limited. Available transformations in torchaudio are: [FrequencyMasking](https://pytorch.org/audio/stable/transforms.html#frequencymasking) or [TimeMasking](https://pytorch.org/audio/stable/transforms.html#timemasking).
47 |
48 | ### 3-2. Input data size
49 |
50 | Number of frequency bins (n_mels) is set to 64 as a typical choice.
51 | Duration is set to ~~1 second, just as an example.~~ 5 seconds in current configuration, because 1 second was too short for the FSDKaggle2018 dataset.
52 |
53 | You can find and change in [config.yaml](config.yaml).
54 |
55 | clip_length: 5.0 # [sec] -- it was 1.0 s at the initial release.
56 | n_mels: 64
57 |
58 | ### 3-3. FFT paramaters
59 |
60 | Typical paramaters are configured in [config.yaml](config.yaml).
61 |
62 | sample_rate: 44100
63 | hop_length: 441
64 | n_fft: 1024
65 | n_mels: 64
66 | f_min: 0
67 | f_max: 22050
68 |
69 | ## 4. Performances
70 |
71 | How is the performance of the trained models on the tutorials?
72 |
73 | - The best Kaggle result MAP@3 was reported as 0.942 (see [Kaggle 4th solution](https://www.kaggle.com/c/freesound-audio-tagging/discussion/62634)). Note that this result is ensemble of 5 models of the same SE-ResNeXt network trained on 5 folds.
74 | - The best result in this repo is MAP@3 of 0.87 (with ResNetish). This is a single model result, without use of data augmentations.
75 |
76 | Already came close to the top solution with ResNetish, and still have space for data augmentations/ regularization techniques.
77 |
78 | ## References
79 |
80 | - [1] S. Hershey et al., ‘CNN Architectures for Large-Scale Audio Classification’,\ in International Conference on Acoustics, Speech and Signal Processing (ICASSP),2017\ Available: https://arxiv.org/abs/1609.09430, https://ai.google/research/pubs/pub45611
81 |
--------------------------------------------------------------------------------
/advanced/Note-webdataset-smalldata.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "361027b5",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "from dlcliche.notebook import *\n",
11 | "from dlcliche.torch_utils import *"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "id": "e8d4d37d",
17 | "metadata": {},
18 | "source": [
19 | "## Goal\n",
20 | "\n",
21 | "Check if webdataset is useful for downstream datasets which are typically small.\n",
22 | "\n",
23 | "### Preparing webdataset shards\n",
24 | "\n",
25 | "Used `create_wds_fsd50k.py` to make tar-shards encupslating local 16kHz FSD50K files.\n",
26 | "Resulted in making four tar files: `fsd50k-eval-16k-{000000..000003}.tar`.\n",
27 | "\n",
28 | "### Test result\n",
29 | "\n",
30 | "The result show that webdataset is not effective small data regime."
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": 24,
36 | "id": "0eab9f25",
37 | "metadata": {},
38 | "outputs": [
39 | {
40 | "name": "stdout",
41 | "output_type": "stream",
42 | "text": [
43 | "9.86 s ± 534 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
44 | ]
45 | }
46 | ],
47 | "source": [
48 | "%%timeit\n",
49 | "\n",
50 | "import webdataset as wds\n",
51 | "import io\n",
52 | "import librosa\n",
53 | "\n",
54 | "url = '/data/A/fsd50k/fsd50k-eval-16k-{000000..000003}.tar'\n",
55 | "ds = (\n",
56 | " wds.WebDataset(url)\n",
57 | " .shuffle(1000)\n",
58 | " .to_tuple('wav', 'labels')\n",
59 | ")\n",
60 | "for i, (wav, labels) in enumerate(ds):\n",
61 | " wav = librosa.load(io.BytesIO(wav))\n",
62 | " labels = labels.decode()\n",
63 | " if i > 100:\n",
64 | " break"
65 | ]
66 | },
67 | {
68 | "cell_type": "code",
69 | "execution_count": 25,
70 | "id": "b61de49f",
71 | "metadata": {},
72 | "outputs": [
73 | {
74 | "name": "stdout",
75 | "output_type": "stream",
76 | "text": [
77 | "9.06 s ± 8.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
78 | ]
79 | }
80 | ],
81 | "source": [
82 | "%%timeit\n",
83 | "\n",
84 | "import io\n",
85 | "import librosa\n",
86 | "\n",
87 | "def IterativeDataset(root, files, label_set):\n",
88 | " root = Path(root)\n",
89 | " for fname, labels in zip(files, label_set):\n",
90 | " data = librosa.load(root/fname)\n",
91 | " labels = labels\n",
92 | " yield data, labels\n",
93 | "\n",
94 | "df = pd.read_csv('/lab/AR2021/evar/metadata/fsd50k.csv')\n",
95 | "df = df[df.split == 'test']\n",
96 | "\n",
97 | "for i, (binary, labels) in enumerate(IterativeDataset('work/16k/fsd50k', df.file_name.values, df.label.values)):\n",
98 | " wav = binary\n",
99 | " labels = labels\n",
100 | " if i > 100:\n",
101 | " break\n",
102 | "#print(wav, labels)"
103 | ]
104 | },
105 | {
106 | "cell_type": "markdown",
107 | "id": "0185a362",
108 | "metadata": {},
109 | "source": [
110 | "## Note: create tar shard files by codes"
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": 33,
116 | "id": "22a05c53",
117 | "metadata": {},
118 | "outputs": [
119 | {
120 | "data": {
121 | "text/html": [
122 | "
\n",
123 | "\n",
136 | "
\n",
137 | " \n",
138 | " \n",
139 | " | \n",
140 | " fname | \n",
141 | " labels | \n",
142 | " mids | \n",
143 | " split | \n",
144 | " key | \n",
145 | "
\n",
146 | " \n",
147 | " \n",
148 | " \n",
149 | " 0 | \n",
150 | " FSD50K.dev_audio/64760.wav | \n",
151 | " Electric_guitar,Guitar,Plucked_string_instrume... | \n",
152 | " /m/02sgy,/m/0342h,/m/0fx80y,/m/04szw,/m/04rlf | \n",
153 | " train | \n",
154 | " train_64760 | \n",
155 | "
\n",
156 | " \n",
157 | " 1 | \n",
158 | " FSD50K.dev_audio/16399.wav | \n",
159 | " Electric_guitar,Guitar,Plucked_string_instrume... | \n",
160 | " /m/02sgy,/m/0342h,/m/0fx80y,/m/04szw,/m/04rlf | \n",
161 | " train | \n",
162 | " train_16399 | \n",
163 | "
\n",
164 | " \n",
165 | " 2 | \n",
166 | " FSD50K.dev_audio/16401.wav | \n",
167 | " Electric_guitar,Guitar,Plucked_string_instrume... | \n",
168 | " /m/02sgy,/m/0342h,/m/0fx80y,/m/04szw,/m/04rlf | \n",
169 | " train | \n",
170 | " train_16401 | \n",
171 | "
\n",
172 | " \n",
173 | "
\n",
174 | "
"
175 | ],
176 | "text/plain": [
177 | " fname \\\n",
178 | "0 FSD50K.dev_audio/64760.wav \n",
179 | "1 FSD50K.dev_audio/16399.wav \n",
180 | "2 FSD50K.dev_audio/16401.wav \n",
181 | "\n",
182 | " labels \\\n",
183 | "0 Electric_guitar,Guitar,Plucked_string_instrume... \n",
184 | "1 Electric_guitar,Guitar,Plucked_string_instrume... \n",
185 | "2 Electric_guitar,Guitar,Plucked_string_instrume... \n",
186 | "\n",
187 | " mids split key \n",
188 | "0 /m/02sgy,/m/0342h,/m/0fx80y,/m/04szw,/m/04rlf train train_64760 \n",
189 | "1 /m/02sgy,/m/0342h,/m/0fx80y,/m/04szw,/m/04rlf train train_16399 \n",
190 | "2 /m/02sgy,/m/0342h,/m/0fx80y,/m/04szw,/m/04rlf train train_16401 "
191 | ]
192 | },
193 | "execution_count": 33,
194 | "metadata": {},
195 | "output_type": "execute_result"
196 | }
197 | ],
198 | "source": [
199 | "def fsd50k_metadata(FSD50K_root):\n",
200 | " FSD = Path(FSD50K_root)\n",
201 | " df = pd.read_csv(FSD/f'FSD50K.ground_truth/dev.csv')\n",
202 | " df['key'] = df.split + '_' + df.fname.apply(lambda s: str(s))\n",
203 | " df['fname'] = df.fname.apply(lambda s: f'FSD50K.dev_audio/{s}.wav')\n",
204 | " dftest = pd.read_csv(FSD/f'FSD50K.ground_truth/eval.csv')\n",
205 | " dftest['key'] = 'eval_' + dftest.fname.apply(lambda s: str(s))\n",
206 | " dftest['split'] = 'eval'\n",
207 | " dftest['fname'] = dftest.fname.apply(lambda s: f'FSD50K.eval_audio/{s}.wav')\n",
208 | " df = pd.concat([df, dftest], ignore_index=True)\n",
209 | " return df\n",
210 | "\n",
211 | "\n",
212 | "df = fsd50k_metadata(FSD50K_root='/data/A/fsd50k/')\n",
213 | "df[:3]"
214 | ]
215 | },
216 | {
217 | "cell_type": "code",
218 | "execution_count": 56,
219 | "id": "8970bd0b",
220 | "metadata": {},
221 | "outputs": [
222 | {
223 | "name": "stdout",
224 | "output_type": "stream",
225 | "text": [
226 | "Processing 36796 train samples.\n",
227 | "/data/A/fsd50k/FSD50K.dev_audio/64760.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64760\n"
228 | ]
229 | },
230 | {
231 | "data": {
232 | "text/plain": [
233 | "{'__key__': 'train_64760',\n",
234 | " 'npy': array([-0.00026427, -0.00128246, 0.00068087, ..., -0.00253225,\n",
235 | " -0.00244647, 0. ], dtype=float32),\n",
236 | " 'labels': 'Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music'}"
237 | ]
238 | },
239 | "execution_count": 56,
240 | "metadata": {},
241 | "output_type": "execute_result"
242 | }
243 | ],
244 | "source": [
245 | "import librosa\n",
246 | "\n",
247 | "\n",
248 | "def load_resampled_mono_wav(fpath, sr):\n",
249 | " y, org_sr = librosa.load('/data/A/fsd50k/FSD50K.dev_audio/382455.wav', sr=None, mono=True)\n",
250 | " if org_sr != sr:\n",
251 | " y = librosa.resample(y, orig_sr=org_sr, target_sr=sr)\n",
252 | " return y\n",
253 | "\n",
254 | "\n",
255 | "def fsd50k_generator(root, split, sr):\n",
256 | " root = Path(root)\n",
257 | " df = fsd50k_metadata(FSD50K_root=root)\n",
258 | " df = df[df.split == split]\n",
259 | " print(f'Processing {len(df)} {split} samples.')\n",
260 | " for file_name, labels, key in df[['fname', 'labels', 'key']].values:\n",
261 | " fpath = root/file_name\n",
262 | " print(fpath, labels, key)\n",
263 | "\n",
264 | " sample = {\n",
265 | " '__key__': key,\n",
266 | " 'npy': load_resampled_mono_wav(fpath, sr),\n",
267 | " 'labels': labels,\n",
268 | " }\n",
269 | " yield sample\n",
270 | "\n",
271 | "gen = fsd50k_generator('/data/A/fsd50k/', 'train', 16000)\n",
272 | "next(iter(gen))"
273 | ]
274 | },
275 | {
276 | "cell_type": "code",
277 | "execution_count": 57,
278 | "id": "a7fb2d1c",
279 | "metadata": {},
280 | "outputs": [
281 | {
282 | "name": "stdout",
283 | "output_type": "stream",
284 | "text": [
285 | "# writing /data/A/fsd50k/train-000000.tar 0 0.0 GB 0\n",
286 | "Processing 36796 train samples.\n",
287 | "/data/A/fsd50k/FSD50K.dev_audio/64760.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64760\n",
288 | "/data/A/fsd50k/FSD50K.dev_audio/16399.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_16399\n",
289 | "/data/A/fsd50k/FSD50K.dev_audio/16401.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_16401\n",
290 | "/data/A/fsd50k/FSD50K.dev_audio/16402.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_16402\n",
291 | "/data/A/fsd50k/FSD50K.dev_audio/16404.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_16404\n",
292 | "/data/A/fsd50k/FSD50K.dev_audio/64761.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64761\n",
293 | "/data/A/fsd50k/FSD50K.dev_audio/268259.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_268259\n",
294 | "/data/A/fsd50k/FSD50K.dev_audio/64762.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64762\n",
295 | "/data/A/fsd50k/FSD50K.dev_audio/40515.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_40515\n",
296 | "/data/A/fsd50k/FSD50K.dev_audio/40516.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_40516\n",
297 | "/data/A/fsd50k/FSD50K.dev_audio/40517.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_40517\n",
298 | "/data/A/fsd50k/FSD50K.dev_audio/64741.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64741\n",
299 | "/data/A/fsd50k/FSD50K.dev_audio/40523.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_40523\n",
300 | "/data/A/fsd50k/FSD50K.dev_audio/64743.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64743\n",
301 | "/data/A/fsd50k/FSD50K.dev_audio/64744.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64744\n",
302 | "/data/A/fsd50k/FSD50K.dev_audio/40525.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_40525\n",
303 | "/data/A/fsd50k/FSD50K.dev_audio/64746.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64746\n",
304 | "/data/A/fsd50k/FSD50K.dev_audio/5318.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_5318\n",
305 | "/data/A/fsd50k/FSD50K.dev_audio/4258.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4258\n",
306 | "/data/A/fsd50k/FSD50K.dev_audio/4259.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4259\n",
307 | "/data/A/fsd50k/FSD50K.dev_audio/4260.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4260\n",
308 | "/data/A/fsd50k/FSD50K.dev_audio/4261.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4261\n",
309 | "/data/A/fsd50k/FSD50K.dev_audio/4262.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4262\n",
310 | "/data/A/fsd50k/FSD50K.dev_audio/4263.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4263\n",
311 | "/data/A/fsd50k/FSD50K.dev_audio/4264.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4264\n",
312 | "/data/A/fsd50k/FSD50K.dev_audio/4265.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4265\n",
313 | "/data/A/fsd50k/FSD50K.dev_audio/4266.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4266\n",
314 | "/data/A/fsd50k/FSD50K.dev_audio/4267.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4267\n",
315 | "/data/A/fsd50k/FSD50K.dev_audio/4268.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4268\n",
316 | "/data/A/fsd50k/FSD50K.dev_audio/4269.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4269\n",
317 | "/data/A/fsd50k/FSD50K.dev_audio/4270.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4270\n",
318 | "/data/A/fsd50k/FSD50K.dev_audio/4272.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4272\n",
319 | "/data/A/fsd50k/FSD50K.dev_audio/64757.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64757\n",
320 | "/data/A/fsd50k/FSD50K.dev_audio/4276.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4276\n",
321 | "/data/A/fsd50k/FSD50K.dev_audio/4277.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4277\n",
322 | "/data/A/fsd50k/FSD50K.dev_audio/4278.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4278\n",
323 | "/data/A/fsd50k/FSD50K.dev_audio/4279.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4279\n",
324 | "/data/A/fsd50k/FSD50K.dev_audio/4280.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4280\n",
325 | "/data/A/fsd50k/FSD50K.dev_audio/4281.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4281\n",
326 | "/data/A/fsd50k/FSD50K.dev_audio/4283.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4283\n",
327 | "/data/A/fsd50k/FSD50K.dev_audio/4284.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4284\n",
328 | "/data/A/fsd50k/FSD50K.dev_audio/4285.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4285\n",
329 | "/data/A/fsd50k/FSD50K.dev_audio/4286.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4286\n",
330 | "/data/A/fsd50k/FSD50K.dev_audio/4287.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4287\n",
331 | "/data/A/fsd50k/FSD50K.dev_audio/4288.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4288\n",
332 | "/data/A/fsd50k/FSD50K.dev_audio/4289.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4289\n",
333 | "/data/A/fsd50k/FSD50K.dev_audio/5314.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_5314\n",
334 | "/data/A/fsd50k/FSD50K.dev_audio/4290.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4290\n",
335 | "/data/A/fsd50k/FSD50K.dev_audio/4291.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_4291\n",
336 | "/data/A/fsd50k/FSD50K.dev_audio/5310.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_5310\n",
337 | "/data/A/fsd50k/FSD50K.dev_audio/64703.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64703\n",
338 | "/data/A/fsd50k/FSD50K.dev_audio/5312.wav Electric_guitar,Bass_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_5312\n",
339 | "/data/A/fsd50k/FSD50K.dev_audio/64704.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64704\n",
340 | "/data/A/fsd50k/FSD50K.dev_audio/64706.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64706\n",
341 | "/data/A/fsd50k/FSD50K.dev_audio/64707.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64707\n",
342 | "/data/A/fsd50k/FSD50K.dev_audio/64708.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64708\n",
343 | "/data/A/fsd50k/FSD50K.dev_audio/5315.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_5315\n",
344 | "/data/A/fsd50k/FSD50K.dev_audio/5317.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_5317\n",
345 | "/data/A/fsd50k/FSD50K.dev_audio/64711.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64711\n",
346 | "/data/A/fsd50k/FSD50K.dev_audio/64712.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64712\n",
347 | "/data/A/fsd50k/FSD50K.dev_audio/64714.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64714\n",
348 | "/data/A/fsd50k/FSD50K.dev_audio/64715.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64715\n",
349 | "/data/A/fsd50k/FSD50K.dev_audio/64717.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64717\n",
350 | "/data/A/fsd50k/FSD50K.dev_audio/64718.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64718\n",
351 | "/data/A/fsd50k/FSD50K.dev_audio/64720.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64720\n"
352 | ]
353 | },
354 | {
355 | "name": "stdout",
356 | "output_type": "stream",
357 | "text": [
358 | "/data/A/fsd50k/FSD50K.dev_audio/64721.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64721\n",
359 | "/data/A/fsd50k/FSD50K.dev_audio/64722.wav Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music train_64722\n"
360 | ]
361 | },
362 | {
363 | "ename": "KeyboardInterrupt",
364 | "evalue": "",
365 | "output_type": "error",
366 | "traceback": [
367 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
368 | "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)",
369 | "\u001b[0;32m/tmp/ipykernel_2207172/328821770.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 10\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mwds\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mShardWriter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moutput_name\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_count\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0msink\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 12\u001b[0;31m \u001b[0;32mfor\u001b[0m \u001b[0msample\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mislice\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfsd50k_generator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msource_dir\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msplit\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m100\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 13\u001b[0m \u001b[0msink\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msample\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
370 | "\u001b[0;32m/tmp/ipykernel_2207172/1443749655.py\u001b[0m in \u001b[0;36mfsd50k_generator\u001b[0;34m(root, split, sr)\u001b[0m\n\u001b[1;32m 20\u001b[0m sample = {\n\u001b[1;32m 21\u001b[0m \u001b[0;34m'__key__'\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 22\u001b[0;31m \u001b[0;34m'npy'\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mload_resampled_mono_wav\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfpath\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 23\u001b[0m \u001b[0;34m'labels'\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mlabels\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 24\u001b[0m }\n",
371 | "\u001b[0;32m/tmp/ipykernel_2207172/1443749655.py\u001b[0m in \u001b[0;36mload_resampled_mono_wav\u001b[0;34m(fpath, sr)\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morg_sr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlibrosa\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mload\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'/data/A/fsd50k/FSD50K.dev_audio/382455.wav'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmono\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0morg_sr\u001b[0m \u001b[0;34m!=\u001b[0m \u001b[0msr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlibrosa\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mresample\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morig_sr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morg_sr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtarget_sr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 8\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
372 | "\u001b[0;32m~/anaconda3/lib/python3.9/site-packages/librosa/core/audio.py\u001b[0m in \u001b[0;36mresample\u001b[0;34m(y, orig_sr, target_sr, res_type, fix, scale, **kwargs)\u001b[0m\n\u001b[1;32m 602\u001b[0m \u001b[0my_hat\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msoxr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mresample\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mT\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morig_sr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtarget_sr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mquality\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mres_type\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mT\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 603\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 604\u001b[0;31m \u001b[0my_hat\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mresampy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mresample\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morig_sr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtarget_sr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfilter\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mres_type\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 605\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 606\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mfix\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
373 | "\u001b[0;32m~/anaconda3/lib/python3.9/site-packages/resampy/core.py\u001b[0m in \u001b[0;36mresample\u001b[0;34m(x, sr_orig, sr_new, axis, filter, **kwargs)\u001b[0m\n\u001b[1;32m 118\u001b[0m \u001b[0mx_2d\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mswapaxes\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 119\u001b[0m \u001b[0my_2d\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mswapaxes\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 120\u001b[0;31m \u001b[0mresample_f\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx_2d\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_2d\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msample_ratio\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minterp_win\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minterp_delta\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mprecision\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 121\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 122\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
374 | "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
375 | ]
376 | }
377 | ],
378 | "source": [
379 | "import webdataset as wds\n",
380 | "from itertools import islice\n",
381 | "\n",
382 | "\n",
383 | "source_dir = '/data/A/fsd50k/'\n",
384 | "split = 'train'\n",
385 | "sr = 16000\n",
386 | "output_name = f'/data/A/fsd50k/{split}-%06d.tar'\n",
387 | "max_count = 10000\n",
388 | "\n",
389 | "with wds.ShardWriter(output_name, max_count) as sink:\n",
390 | " for sample in islice(fsd50k_generator(source_dir, split, sr), 0, 100):\n",
391 | " sink.write(sample)"
392 | ]
393 | },
394 | {
395 | "cell_type": "markdown",
396 | "id": "bfc6032c",
397 | "metadata": {},
398 | "source": [
399 | "## Note: creating dataset tar archives with command-line\n",
400 | "\n",
401 | "\n",
402 | "### Install go and tarp commands\n",
403 | "\n",
404 | "https://github.com/webdataset/tarp\n",
405 | "\n",
406 | "- `sudo apt install golang-go`\n",
407 | "- `go get -v github.com/tmbdev/tarp/tarp`\n",
408 | "\n",
409 | "### Create tar archive\n",
410 | "\n",
411 | "- `tar --sort=name -cf your_archive.tar your_folders`\n",
412 | "- `find your_folder - type f -print| sort | tar -cf your_archive.tar - T -'\n",
413 | "\n",
414 | "### Shuffle and split\n",
415 | "\n",
416 | "- `tar --sorted -cf - your_folders | tarp"
417 | ]
418 | }
419 | ],
420 | "metadata": {
421 | "kernelspec": {
422 | "display_name": "base",
423 | "language": "python",
424 | "name": "base"
425 | },
426 | "language_info": {
427 | "codemirror_mode": {
428 | "name": "ipython",
429 | "version": 3
430 | },
431 | "file_extension": ".py",
432 | "mimetype": "text/x-python",
433 | "name": "python",
434 | "nbconvert_exporter": "python",
435 | "pygments_lexer": "ipython3",
436 | "version": "3.9.7"
437 | }
438 | },
439 | "nbformat": 4,
440 | "nbformat_minor": 5
441 | }
442 |
--------------------------------------------------------------------------------
/advanced/__init__.py:
--------------------------------------------------------------------------------
1 | # adaanced
2 |
3 |
--------------------------------------------------------------------------------
/advanced/create_wds_fsd50k.py:
--------------------------------------------------------------------------------
1 | """webdataset
2 | python create_wds_fsd50k.py work/16k/fsd50k /data/A/fsd50k eval 16k
3 | """
4 |
5 | import sys
6 | from multiprocessing import Pool
7 | from pathlib import Path
8 | import pandas as pd
9 | import webdataset as wds
10 | from itertools import islice
11 | import librosa
12 | import fire
13 |
14 |
15 | def fsd50k_metadata(FSD50K_root):
16 | FSD = Path(FSD50K_root)
17 | df = pd.read_csv(FSD/f'FSD50K.ground_truth/dev.csv')
18 | df['key'] = df.split + '_' + df.fname.apply(lambda s: str(s))
19 | df['fname'] = df.fname.apply(lambda s: f'FSD50K.dev_audio/{s}.wav')
20 | dftest = pd.read_csv(FSD/f'FSD50K.ground_truth/eval.csv')
21 | dftest['key'] = 'eval_' + dftest.fname.apply(lambda s: str(s))
22 | dftest['split'] = 'eval'
23 | dftest['fname'] = dftest.fname.apply(lambda s: f'FSD50K.eval_audio/{s}.wav')
24 | df = pd.concat([df, dftest], ignore_index=True)
25 | return df
26 |
27 |
28 | def load_resampled_mono_wav(fpath, sr):
29 | with open(fpath, 'rb') as f:
30 | y = f.read()
31 | # y, org_sr = librosa.load(fpath, sr=None, mono=True)
32 | # if org_sr != sr:
33 | # y = librosa.resample(y, orig_sr=org_sr, target_sr=sr)
34 | return y
35 |
36 |
37 | def _converter_worker(args):
38 | fpath, sr = args
39 | return load_resampled_mono_wav(fpath, sr)
40 |
41 |
42 | def fsd50k_generator(root, split, sr):
43 | root = Path(root)
44 | df = fsd50k_metadata(FSD50K_root=root)
45 | df = df[df.split == split]
46 | print(f'Processing {len(df)} {split} samples.')
47 | for file_name, labels, key in df[['fname', 'labels', 'key']].values:
48 | fpath = root/file_name
49 |
50 | sample = {
51 | '__key__': key,
52 | 'wav': fpath, # load_resampled_mono_wav(fpath, sr),
53 | 'labels': labels,
54 | }
55 | yield sample
56 |
57 |
58 | def create_wds(source, output, split, sr, name='fsd50k-[SPLIT]-[SR]-%06d.tar', maxsize=10**9):
59 | source = source
60 | name = name.replace('[SPLIT]', split).replace('[SR]', str(sr))
61 | output_name = str(Path(output)/name)
62 |
63 | gen = fsd50k_generator(source, split, sr)
64 | with wds.ShardWriter(output_name, maxsize=maxsize) as sink:
65 | while True:
66 | samples = list(islice(gen, 100))
67 | if len(samples) == 0:
68 | break
69 | # load and resample wav files
70 | with Pool() as p:
71 | args = [[s['wav'], sr] for s in samples]
72 | wavs = list(p.imap(_converter_worker, args))
73 | for s, wav in zip(samples, wavs):
74 | s['wav'] = wav
75 | sink.write(s)
76 | print('.', end='')
77 | sys.stdout.flush()
78 | print('Finished')
79 |
80 |
81 | if __name__ == '__main__':
82 | fire.Fire(create_wds)
83 |
--------------------------------------------------------------------------------
/advanced/create_wds_fsd50k_resample.py:
--------------------------------------------------------------------------------
1 | """webdataset
2 | """
3 |
4 | import sys
5 | from multiprocessing import Pool
6 | from pathlib import Path
7 | import pandas as pd
8 | import webdataset as wds
9 | from itertools import islice
10 | import librosa
11 | import fire
12 |
13 |
14 | def fsd50k_metadata(FSD50K_root):
15 | FSD = Path(FSD50K_root)
16 | df = pd.read_csv(FSD/f'FSD50K.ground_truth/dev.csv')
17 | df['key'] = df.split + '_' + df.fname.apply(lambda s: str(s))
18 | df['fname'] = df.fname.apply(lambda s: f'FSD50K.dev_audio/{s}.wav')
19 | dftest = pd.read_csv(FSD/f'FSD50K.ground_truth/eval.csv')
20 | dftest['key'] = 'eval_' + dftest.fname.apply(lambda s: str(s))
21 | dftest['split'] = 'eval'
22 | dftest['fname'] = dftest.fname.apply(lambda s: f'FSD50K.eval_audio/{s}.wav')
23 | df = pd.concat([df, dftest], ignore_index=True)
24 | return df
25 |
26 |
27 | def load_resampled_mono_wav(fpath, sr):
28 | y, org_sr = librosa.load(fpath, sr=None, mono=True)
29 | if org_sr != sr:
30 | y = librosa.resample(y, orig_sr=org_sr, target_sr=sr)
31 | return y
32 |
33 |
34 | def _converter_worker(args):
35 | fpath, sr = args
36 | return load_resampled_mono_wav(fpath, sr)
37 |
38 |
39 | def fsd50k_generator(root, split, sr):
40 | root = Path(root)
41 | df = fsd50k_metadata(FSD50K_root=root)
42 | df = df[df.split == split]
43 | print(f'Processing {len(df)} {split} samples.')
44 | for file_name, labels, key in df[['fname', 'labels', 'key']].values:
45 | fpath = root/file_name
46 |
47 | sample = {
48 | '__key__': key,
49 | 'npy': fpath, # load_resampled_mono_wav(fpath, sr),
50 | 'labels': labels,
51 | }
52 | yield sample
53 |
54 |
55 | def create_wds(source, output, split, sr, name='fsd50k-[SPLIT]-[SR]-%06d.tar', maxsize=10**9):
56 | source = source
57 | name = name.replace('[SPLIT]', split).replace('[SR]', str(sr))
58 | output_name = str(Path(output)/name)
59 |
60 | gen = fsd50k_generator(source, split, sr)
61 | with wds.ShardWriter(output_name, maxsize=maxsize) as sink:
62 | while True:
63 | samples = list(islice(gen, 100))
64 | if len(samples) == 0:
65 | break
66 | # load and resample wav files
67 | with Pool() as p:
68 | args = [[s['npy'], sr] for s in samples]
69 | npys = list(p.imap(_converter_worker, args))
70 | for s, npy in zip(samples, npys):
71 | s['npy'] = npy
72 | sink.write(s)
73 | print('.', end='')
74 | sys.stdout.flush()
75 | print('Finished')
76 |
77 |
78 | if __name__ == '__main__':
79 | fire.Fire(create_wds)
80 |
--------------------------------------------------------------------------------
/advanced/fat2018.py:
--------------------------------------------------------------------------------
1 | """Multi-fold Freesound Audio Tagging solution.
2 | """
3 |
4 | from src.libs import *
5 | import datetime
6 | from advanced.metric_fat2018 import eval_fat2018_all_splits, eval_fat2018_by_probas
7 | from src.models import resnetish18, VGGish, AlexNet
8 |
9 |
10 | def report_result(message):
11 | print(message)
12 | # you might want to report to slack or anything here
13 |
14 |
15 | def get_transforms(cfg):
16 | NF = cfg.n_mels
17 | NT = cfg.unit_length
18 | augs = []
19 | for a in cfg.aug.split('x'):
20 | if a == 'RC':
21 | augs.append(GenericRandomResizedCrop((NF, NT), scale=(0.8, 1.0), ratio=(NF/(NT*1.2), NF/(NT*0.8))))
22 | elif a == 'SA':
23 | augs.append(AT.FrequencyMasking(NF//10))
24 | augs.append(AT.TimeMasking(NT//10))
25 | else:
26 | if a:
27 | raise Exception(f'unknown: {a}')
28 | tfms = VT.Compose(augs)
29 | print(tfms)
30 | return tfms
31 |
32 |
33 | def get_model(cfg, num_classes):
34 | if cfg.model == 'AN':
35 | return AlexNet(num_classes)
36 | if cfg.model == 'R18':
37 | return resnetish18(num_classes)
38 | if cfg.model == 'VGG':
39 | return VGGish(num_classes)
40 | raise Exception(f'unknown: {cfg.model}')
41 |
42 |
43 | def read_metadata(cfg):
44 | # Make lists of filenames and labels from meta files
45 | filenames, labels = {}, {}
46 | for split, npy_folder, meta_filename in [['train', f'work/{cfg.type}/FSDKaggle2018.audio_train', 'train_post_competition.csv'],
47 | ['test', f'work/{cfg.type}/FSDKaggle2018.audio_test', 'test_post_competition_scoring_clips.csv']]:
48 | df = pd.read_csv(cfg.data_root/'FSDKaggle2018.meta'/meta_filename)
49 | filenames[split] = np.array([(npy_folder + '/' + fname.replace('.wav', '.npy')) for fname in df.fname.values])
50 | labels[split] = list(df.label.values)
51 |
52 | # Make a list of classes, converting labels into numbers
53 | classes = sorted(set(labels['train'] + labels['test']))
54 | for split in labels:
55 | labels[split] = np.array([classes.index(label) for label in labels[split]])
56 |
57 | return filenames, labels, classes
58 |
59 |
60 | def calc_stat(cfg, filenames, labels, classes, calc_stat=False, n_calc_stat=10000):
61 | print(labels)
62 | class_weight = compute_class_weight('balanced', range(len(classes)), labels['train'])
63 | class_weight = torch.tensor(class_weight).to(torch.float)
64 |
65 | if calc_stat:
66 | all_train_lms = np.hstack([np.load(f)[0] for f in filenames['train'][:n_calc_stat]])
67 | train_mean_std = all_train_lms.mean(), all_train_lms.std()
68 | print(train_mean_std)
69 | else:
70 | train_mean_std = None
71 |
72 | return class_weight, train_mean_std
73 |
74 |
75 | def run(config_file='config-fat2018.yaml', epochs=None, finetune_epochs=None, mixup=None, aug=None, norm=False):
76 | print(config_file, epochs, mixup, aug)
77 | cfg = load_config(config_file)
78 | cfg.epochs = epochs or cfg.epochs
79 | cfg.finetune_epochs = finetune_epochs or cfg.finetune_epochs
80 | cfg.mixup = cfg.mixup if mixup is None else mixup
81 | cfg.aug = aug or cfg.aug or ''
82 | filenames, labels, classes = read_metadata(cfg)
83 | class_weight, train_mean_std = calc_stat(cfg, filenames, labels, classes, calc_stat=norm)
84 |
85 | name = datetime.datetime.now().strftime('%y%m%d%H%M')
86 | name = f'model{cfg.type}-{cfg.model}-{cfg.aug}-m{str(cfg.mixup)[2:]}{"-N" if norm else ""}-{name}'
87 |
88 | weight_folder = Path('work/' + name)
89 | weight_folder.mkdir(parents=True, exist_ok=True)
90 | results, all_file_probas = [], []
91 | print(f'Training {weight_folder}')
92 |
93 | skf = StratifiedKFold(n_splits=cfg.n_folds)
94 | for fold, (train_index, test_index) in enumerate(skf.split(filenames['train'], labels['train'])):
95 | print("TRAIN:", len(train_index), "TEST:", len(test_index))
96 | train_files, val_files = filenames['train'][train_index], filenames['train'][test_index]
97 | train_ys, val_ys = labels['train'][train_index], labels['train'][test_index]
98 |
99 | train_dataset = LMSClfDataset(cfg, train_files, train_ys, norm_mean_std=train_mean_std,
100 | transforms=get_transforms(cfg))
101 | valid_dataset = LMSClfDataset(cfg, val_files, val_ys, norm_mean_std=train_mean_std)
102 | train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=cfg.bs, shuffle=True, pin_memory=True,
103 | num_workers=multiprocessing.cpu_count())
104 | valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=cfg.bs, pin_memory=True,
105 | num_workers=multiprocessing.cpu_count())
106 |
107 | # main training
108 | model = get_model(cfg, len(classes))
109 | dataloaders = [train_loader, valid_loader, None]
110 | learner = LMSClfLearner(model, dataloaders, mixup_alpha=cfg.mixup, weight=class_weight)
111 | checkpoint = pl.callbacks.ModelCheckpoint(monitor='val_acc')
112 | trainer = pl.Trainer(gpus=1, max_epochs=cfg.epochs, callbacks=[checkpoint])
113 | trainer.fit(learner)
114 | # result for now
115 | learner.load_state_dict(torch.load(checkpoint.best_model_path)['state_dict'])
116 | (acc, MAP3), file_probas = eval_fat2018_all_splits(cfg, model, device, filenames['test'], labels['test'],
117 | norm_mean_std=train_mean_std, debug_name='test')
118 |
119 | # fine tuning
120 | learner = LMSClfLearner(model, dataloaders, mixup_alpha=0.0, learning_rate=1e-4, weight=class_weight)
121 | checkpoint = pl.callbacks.ModelCheckpoint(monitor='val_acc')
122 | trainer = pl.Trainer(gpus=1, max_epochs=cfg.finetune_epochs, callbacks=[checkpoint])
123 | trainer.fit(learner)
124 | # result for fine tuned model
125 | learner.load_state_dict(torch.load(checkpoint.best_model_path)['state_dict'])
126 | (acc, MAP3), file_probas = eval_fat2018_all_splits(cfg, model, device, filenames['test'], labels['test'],
127 | norm_mean_std=train_mean_std, debug_name='test')
128 | all_file_probas.append(file_probas)
129 | results.append(MAP3)
130 |
131 | fold_weight = weight_folder/f'{fold}-{Path(checkpoint.best_model_path).name}'
132 | copy_file(checkpoint.best_model_path, fold_weight)
133 | print(f'Saved fold#{fold} weight as {fold_weight}')
134 |
135 | mean_file_probas = np.array(all_file_probas).mean(axis=0)
136 | acc, MAP3 = eval_fat2018_by_probas(mean_file_probas, labels['test'], debug_name='test')
137 | np.save(weight_folder/'ens_probas.npy', mean_file_probas)
138 | report_text = f'{name},{epochs},{aug},{mixup},{norm},{MAP3},{np.mean(results)}'
139 | report_result(report_text)
140 |
141 |
142 | if __name__ == '__main__':
143 | fire.Fire(run)
144 |
145 |
146 |
--------------------------------------------------------------------------------
/advanced/metric_fat2018.py:
--------------------------------------------------------------------------------
1 | # Based on https://github.com/DCASE-REPO/dcase2018_baseline/blob/master/task2/evaluation.py
2 |
3 | from src.libs import *
4 | import datetime
5 | import numpy as np
6 | import torch
7 | import multiprocessing
8 |
9 |
10 | def one_ap(gt, topk):
11 | for i, p in enumerate(topk):
12 | if gt == p:
13 | return 1.0 / (i + 1.0)
14 | return 0.0
15 |
16 |
17 | def avg_precision(gts=None, topks=None):
18 | return np.array([one_ap(gt, topk) for gt, topk in zip(gts, topks)])
19 |
20 |
21 | def eval_fat2018_by_probas(probas, labels, debug_name=None, TOP_K=3):
22 | correct = ap = 0.0
23 | for proba, label in zip(probas, labels):
24 | topk = proba.argsort()[-TOP_K:][::-1]
25 | correct += int(topk[0] == label)
26 | ap += one_ap(label, topk)
27 | acc = correct / len(labels)
28 | mAP = ap / len(labels)
29 | if debug_name:
30 | print(f'{debug_name} acc = {acc:.4f}, MAP@{TOP_K} = {mAP}')
31 | return acc, mAP
32 |
33 |
34 | def eval_fat2018(model, device, dataloader, debug_name=None, TTA=1):
35 | model = model.to(device).eval()
36 | all_probas, labels = [], []
37 | with torch.no_grad():
38 | for _ in range(TTA):
39 | for X, gts in dataloader:
40 | preds = model(X.to(device))
41 | probas = preds.softmax(1)
42 | all_probas.extend(probas.cpu().numpy())
43 | labels.extend(gts.cpu().numpy())
44 | all_probas = np.array(all_probas)
45 | return eval_fat2018_by_probas(all_probas, labels, debug_name=debug_name), all_probas
46 |
47 |
48 | def eval_fat2018_all_splits(cfg, model, device, filenames, labels, norm_mean_std=None, debug_name=None, head_n=999, agg='mean'):
49 | model = model.to(device).eval()
50 | file_probas = [[] for _ in range(len(labels))]
51 | test_dataset = SplitAllDataset(cfg, filenames, norm_mean_std=norm_mean_std, head_n=head_n)
52 | test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=cfg.bs,
53 | num_workers=multiprocessing.cpu_count(), pin_memory=True)
54 | print(f'Predicting all {len(test_dataset)} splits for {len(labels)} files...')
55 | for X, fileidxs in test_loader:
56 | with torch.no_grad():
57 | preds = model(X.to(device))
58 | probas = F.softmax(preds, dim=1)
59 | for idx, prob in zip(fileidxs.cpu().numpy(), probas.cpu().numpy()):
60 | file_probas[idx].append(prob)
61 |
62 | if agg == 'max':
63 | file_probas = np.array([np.max(probas, axis=0) for probas in file_probas])
64 | elif agg == 'mean':
65 | file_probas = np.array([np.mean(probas, axis=0) for probas in file_probas])
66 | else:
67 | raise Exception()
68 |
69 | return eval_fat2018_by_probas(file_probas, labels, debug_name=debug_name), file_probas
70 |
--------------------------------------------------------------------------------
/advanced/preprocess_fat2018.py:
--------------------------------------------------------------------------------
1 | """Preprocess Freesound Audio Tagging 2018 competition data.
2 | """
3 |
4 | import warnings
5 | warnings.simplefilter('ignore')
6 |
7 | from src.libs import *
8 | from tqdm import tqdm
9 | import fire
10 |
11 | def convert(config='config.yaml'):
12 | cfg = load_config(config)
13 | print(cfg)
14 | DATA_ROOT = Path(cfg.data_root)
15 | DEST = Path('work')/cfg.type
16 |
17 | folders = ['FSDKaggle2018.audio_test', 'FSDKaggle2018.audio_train']
18 |
19 | to_mel_spectrogram = torchaudio.transforms.MelSpectrogram(
20 | sample_rate=cfg.sample_rate, n_fft=cfg.n_fft, n_mels=cfg.n_mels,
21 | hop_length=cfg.hop_length, f_min=cfg.f_min, f_max=cfg.f_max)
22 |
23 | for folder in folders:
24 | cur_folder = DATA_ROOT/folder
25 | filenames = sorted(cur_folder.glob('*.wav'))
26 | resampler = None
27 | for filename in tqdm(filenames):
28 | # Load waveform
29 | waveform, sr = torchaudio.load(filename)
30 | #assert sr == cfg.sample_rate
31 | if sr != cfg.sample_rate:
32 | if resampler is None:
33 | resampler = torchaudio.transforms.Resample(sr, cfg.sample_rate)
34 | print(f'CAUTION: RESAMPLING from {sr} Hz to {cfg.sample_rate} Hz.')
35 | waveform = resampler(waveform)
36 | # To log-mel spectrogram
37 | log_mel_spec = to_mel_spectrogram(waveform).log()
38 | # Write to work
39 | (DEST/folder).mkdir(parents=True, exist_ok=True)
40 | np.save(DEST/folder/filename.name.replace('.wav', '.npy'), log_mel_spec)
41 |
42 |
43 | fire.Fire(convert)
44 |
--------------------------------------------------------------------------------
/config-fat2018.yaml:
--------------------------------------------------------------------------------
1 | # type name of this configuration
2 | type: B
3 |
4 | # basic setting parameters
5 | clip_length: 5.0 # [sec]
6 |
7 | # preprocessing parameters
8 | sample_rate: 44100
9 | hop_length: 441
10 | n_fft: 2048
11 | n_mels: 64
12 | f_min: 0
13 | f_max: 22050
14 |
15 | # test parameters
16 | bs: 64 #128
17 | mixup: 0.4
18 | n_folds: 5
19 | epochs: 300
20 | finetune_epochs: 20
21 | aug: RCxSA # RC: random resized crop, SA: spec augment
22 | model: R18 # R18: ResNetish18, VGG: VGGish
23 |
24 | # dataset configurations
25 | data_root: /data/A/2018fsd
--------------------------------------------------------------------------------
/config.yaml:
--------------------------------------------------------------------------------
1 | # basic setting parameters
2 | clip_length: 5.0 # [sec]
3 |
4 | # preprocessing parameters
5 | sample_rate: 44100
6 | hop_length: 441
7 | n_fft: 1024
8 | n_mels: 64
9 | f_min: 0
10 | f_max: 22050
11 |
--------------------------------------------------------------------------------
/for_evar/README.md:
--------------------------------------------------------------------------------
1 | # For EVAR
2 |
3 | [EVAR](https://github.com/nttcslab/eval-audio-repr) is a evaluation package for audio representations.
4 |
5 | This subfolder holds files belonging to opensource for EVAR.
6 |
7 | ## Acknoledgement
8 |
9 | We use/borrow [PANNs](https://github.com/qiuqiangkong/audioset_tagging_cnn) implementation.
10 |
11 | - https://github.com/qiuqiangkong/audioset_tagging_cnn
12 |
--------------------------------------------------------------------------------
/for_evar/cnn14_decoupled.py:
--------------------------------------------------------------------------------
1 | """CNN14 network, decoupled from Spectrogram, LogmelFilterBank, SpecAugmentation, and classifier head.
2 |
3 | ## Reference
4 | - [1] https://arxiv.org/abs/1912.10211 "PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition"
5 | - [2] https://github.com/qiuqiangkong/audioset_tagging_cnn
6 | """
7 |
8 | import torch
9 | from torch import nn
10 | import torch.nn.functional as F
11 | from torchlibrosa.stft import Spectrogram, LogmelFilterBank
12 |
13 |
14 | class AudioFeatureExtractor(nn.Module):
15 | def __init__(self, sample_rate=16000, n_fft=512, n_mels=64, hop_length=160, win_length=512, f_min=50, f_max=8000):
16 | super().__init__()
17 |
18 | # Spectrogram extractor
19 | self.spectrogram_extractor = Spectrogram(n_fft=n_fft, hop_length=hop_length,
20 | win_length=win_length, window='hann', center=True, pad_mode='reflect',
21 | freeze_parameters=True)
22 |
23 | # Logmel feature extractor
24 | self.logmel_extractor = LogmelFilterBank(sr=sample_rate, n_fft=win_length,
25 | n_mels=n_mels, fmin=f_min, fmax=f_max, ref=1.0, amin=1e-10, top_db=None,
26 | freeze_parameters=True)
27 |
28 | def forward(self, batch_audio):
29 | x = self.spectrogram_extractor(batch_audio) # (B, 1, T, F(freq_bins))
30 | x = self.logmel_extractor(x) # (B, 1, T, F(mel_bins))
31 | return x
32 |
33 |
34 | def initialize_layers(layer):
35 | # initialize all childrens first.
36 | for l in layer.children():
37 | initialize_layers(l)
38 |
39 | # initialize only linaer
40 | if type(layer) != nn.Linear:
41 | return
42 |
43 | # Thanks to https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/d2f4b8c18eab44737fcc0de1248ae21eb43f6aa4/pytorch/models.py#L10
44 | nn.init.xavier_uniform_(layer.weight)
45 | if hasattr(layer, 'bias'):
46 | if layer.bias is not None:
47 | layer.bias.data.fill_(0.)
48 |
49 |
50 | def init_bn(bn):
51 | """Initialize a Batchnorm layer. """
52 | bn.bias.data.fill_(0.)
53 | bn.weight.data.fill_(1.)
54 |
55 |
56 | class ConvBlock(nn.Module):
57 | def __init__(self, in_channels, out_channels):
58 |
59 | super(ConvBlock, self).__init__()
60 |
61 | self.conv1 = nn.Conv2d(in_channels=in_channels,
62 | out_channels=out_channels,
63 | kernel_size=(3, 3), stride=(1, 1),
64 | padding=(1, 1), bias=False)
65 |
66 | self.conv2 = nn.Conv2d(in_channels=out_channels,
67 | out_channels=out_channels,
68 | kernel_size=(3, 3), stride=(1, 1),
69 | padding=(1, 1), bias=False)
70 |
71 | self.bn1 = nn.BatchNorm2d(out_channels)
72 | self.bn2 = nn.BatchNorm2d(out_channels)
73 |
74 | self.init_weight()
75 |
76 | def init_weight(self):
77 | initialize_layers(self.conv1)
78 | initialize_layers(self.conv2)
79 | init_bn(self.bn1)
80 | init_bn(self.bn2)
81 |
82 |
83 | def forward(self, input, pool_size=(2, 2), pool_type='avg'):
84 |
85 | x = input
86 | x = F.relu_(self.bn1(self.conv1(x)))
87 | x = F.relu_(self.bn2(self.conv2(x)))
88 | if pool_type == 'max':
89 | x = F.max_pool2d(x, kernel_size=pool_size)
90 | elif pool_type == 'avg':
91 | x = F.avg_pool2d(x, kernel_size=pool_size)
92 | elif pool_type == 'avg+max':
93 | x1 = F.avg_pool2d(x, kernel_size=pool_size)
94 | x2 = F.max_pool2d(x, kernel_size=pool_size)
95 | x = x1 + x2
96 | else:
97 | raise Exception('Incorrect argument!')
98 |
99 | return x
100 |
101 |
102 | class Cnn14_Decoupled(nn.Module):
103 | """CNN14 network, decoupled from Spectrogram, LogmelFilterBank, SpecAugmentation, and classifier head.
104 | Original implementation: https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/master/pytorch/models.py
105 | """
106 |
107 | def __init__(self, n_mels=64, d=2048):
108 | assert d == 2048, 'This implementation accepts d=2048 only, for compatible with the original Cnn14.'
109 | super().__init__()
110 |
111 | self.bn0 = nn.BatchNorm2d(n_mels)
112 |
113 | self.conv_block1 = ConvBlock(in_channels=1, out_channels=64)
114 | self.conv_block2 = ConvBlock(in_channels=64, out_channels=128)
115 | self.conv_block3 = ConvBlock(in_channels=128, out_channels=256)
116 | self.conv_block4 = ConvBlock(in_channels=256, out_channels=512)
117 | self.conv_block5 = ConvBlock(in_channels=512, out_channels=1024)
118 | self.conv_block6 = ConvBlock(in_channels=1024, out_channels=2048)
119 |
120 | self.fc1 = nn.Linear(2048, d, bias=True)
121 | #self.fc_audioset = nn.Linear(d, classes_num, bias=True)
122 |
123 | self.init_weight()
124 |
125 | def init_weight(self):
126 | init_bn(self.bn0)
127 | initialize_layers(self.fc1)
128 | #init_layer(self.fc_audioset)
129 |
130 | def encode(self, x, squash_freq=True):
131 | x = x.transpose(1, 3)
132 | x = self.bn0(x)
133 | x = x.transpose(1, 3)
134 |
135 | x = self.conv_block1(x, pool_size=(2, 2), pool_type='avg')
136 | x = F.dropout(x, p=0.2, training=self.training)
137 | x = self.conv_block2(x, pool_size=(2, 2), pool_type='avg')
138 | x = F.dropout(x, p=0.2, training=self.training)
139 | x = self.conv_block3(x, pool_size=(2, 2), pool_type='avg')
140 | x = F.dropout(x, p=0.2, training=self.training)
141 | x3 = x
142 | x = self.conv_block4(x, pool_size=(2, 2), pool_type='avg')
143 | x = F.dropout(x, p=0.2, training=self.training)
144 | x = self.conv_block5(x, pool_size=(2, 2), pool_type='avg')
145 | x = F.dropout(x, p=0.2, training=self.training)
146 | x = self.conv_block6(x, pool_size=(1, 1), pool_type='avg')
147 | x = F.dropout(x, p=0.2, training=self.training)
148 | if squash_freq:
149 | x = torch.mean(x, dim=3)
150 | return x
151 |
152 | def temporal_pooling(self, x):
153 | (x1, _) = torch.max(x, dim=2)
154 | x2 = torch.mean(x, dim=2)
155 | x = x1 + x2
156 | x = F.dropout(x, p=0.5, training=self.training)
157 | x = F.relu_(self.fc1(x))
158 | embedding = F.dropout(x, p=0.5, training=self.training)
159 | return embedding
160 |
161 | def forward(self, x):
162 | x = self.encode(x)
163 | embedding = self.temporal_pooling(x)
164 |
165 | return embedding
166 |
--------------------------------------------------------------------------------
/for_evar/sampler.py:
--------------------------------------------------------------------------------
1 | """Samplers.
2 |
3 | Mostly borrowed from:
4 | https://github.com/qiuqiangkong/audioset_tagging_cnn
5 | """
6 |
7 | import numpy as np
8 | import logging
9 |
10 |
11 | class BalancedRandomSampler():
12 | """
13 | This is a simple version of:
14 | https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/d2f4b8c18eab44737fcc0de1248ae21eb43f6aa4/utils/data_generator.py#L175
15 | """
16 | def __init__(self, dataset, batch_size, random_seed=42):
17 |
18 | self.dataset = dataset
19 | self.batch_size = batch_size
20 | self.random_state = np.random.RandomState(random_seed)
21 |
22 | self.samples_per_class = np.sum(self.dataset.labels.numpy(), axis=0)
23 | logging.info(f'samples per class: {self.samples_per_class.astype(np.int32)}')
24 |
25 | # Training indexes of all sound classes. E.g.:
26 | # [[0, 11, 12, ...], [3, 4, 15, 16, ...], [7, 8, ...], ...]
27 | self.indexes_per_class = []
28 | self.classes_num = len(self.dataset.classes)
29 |
30 | for k in range(self.classes_num):
31 | self.indexes_per_class.append(
32 | np.where(dataset.labels[:, k] != 0)[0])
33 |
34 | # Shuffle indexes
35 | for k in range(self.classes_num):
36 | self.random_state.shuffle(self.indexes_per_class[k])
37 |
38 | self.queue = []
39 | self.pointers_of_classes = [0] * self.classes_num
40 |
41 | def expand_queue(self, queue):
42 | classes_set = np.arange(self.classes_num).tolist()
43 | self.random_state.shuffle(classes_set)
44 | queue += classes_set
45 | return queue
46 |
47 | def __iter__(self):
48 | while True:
49 | batch_idxs = []
50 | for _ in range(self.batch_size):
51 | if len(self.queue) == 0:
52 | self.queue = self.expand_queue(self.queue)
53 |
54 | class_id = self.queue.pop(0)
55 | pointer = self.pointers_of_classes[class_id]
56 | self.pointers_of_classes[class_id] += 1
57 | batch_idxs.append(self.indexes_per_class[class_id][pointer])
58 |
59 | # When finish one epoch of a sound class, then shuffle its indexes and reset pointer
60 | if self.pointers_of_classes[class_id] >= self.samples_per_class[class_id]:
61 | self.pointers_of_classes[class_id] = 0
62 | self.random_state.shuffle(self.indexes_per_class[class_id])
63 |
64 | yield batch_idxs
65 |
66 | def __len__(self):
67 | return (len(self.dataset) + self.batch_size - 1) // self.batch_size
68 |
69 |
70 | class InfiniteSampler(object):
71 | def __init__(self, dataset, batch_size, random_seed=42, shuffle=False):
72 | self.df = dataset.df
73 | self.batch_size = batch_size
74 | self.random_state = np.random.RandomState(random_seed)
75 | self.indexes = self.df.index.values.copy()
76 | self.shuffle = shuffle
77 | if self.shuffle:
78 | self.random_state.shuffle(self.indexes)
79 |
80 | def __iter__(self):
81 | pointer = 0
82 | while True:
83 | batch_idxs = []
84 | for _ in range(self.batch_size):
85 | batch_idxs.append(self.indexes[pointer])
86 | pointer += 1
87 | if pointer >= len(self.indexes):
88 | pointer = 0
89 | if self.shuffle:
90 | self.random_state.shuffle(self.indexes)
91 | yield batch_idxs
92 |
93 | def __len__(self):
94 | return (len(self.df) + self.batch_size - 1) // self.batch_size
95 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | torch>=1.7.0
2 | torchaudio>=0.7.0
3 | pytorch-lightning
4 | pyyaml
5 | easydict
6 | matplotlib
7 | numpy
8 | jupyter
9 | pandas
10 | scikit-learn
11 | fire
12 | dl-cliche
13 |
--------------------------------------------------------------------------------
/src/__init__.py:
--------------------------------------------------------------------------------
1 | # multi-label
2 |
--------------------------------------------------------------------------------
/src/augmentations.py:
--------------------------------------------------------------------------------
1 | # Borrowed from https://github.com/pytorch/vision/blob/master/torchvision/transforms/functional.py
2 | import torch
3 | import torch.nn.functional as F
4 | import math
5 |
6 |
7 | class GenericRandomResizedCrop():
8 | def __init__(self, size, scale=(0.08, 1.0), ratio=(3. / 4., 4. / 3.)):
9 | self.size = size
10 | self.scale = scale
11 | self.ratio = ratio
12 |
13 | @staticmethod
14 | def get_params(x, scale, ratio):
15 | width, height = x.shape[1:]
16 | area = height * width
17 |
18 | for _ in range(100):
19 | target_area = area * torch.empty(1).uniform_(scale[0], scale[1]).item()
20 | log_ratio = torch.log(torch.tensor(ratio))
21 | aspect_ratio = torch.exp(
22 | torch.empty(1).uniform_(log_ratio[0], log_ratio[1])
23 | ).item()
24 |
25 | w = int(round(math.sqrt(target_area * aspect_ratio)))
26 | h = int(round(math.sqrt(target_area / aspect_ratio)))
27 |
28 | if 0 < w <= width and 0 < h <= height:
29 | i = torch.randint(0, height - h + 1, size=(1,)).item()
30 | j = torch.randint(0, width - w + 1, size=(1,)).item()
31 | return i, j, h, w
32 |
33 | # Fallback to central crop
34 | in_ratio = float(width) / float(height)
35 | if in_ratio < min(ratio):
36 | w = width
37 | h = int(round(w / min(ratio)))
38 | elif in_ratio > max(ratio):
39 | h = height
40 | w = int(round(h * max(ratio)))
41 | else: # whole image
42 | w = width
43 | h = height
44 | i = (height - h) // 2
45 | j = (width - w) // 2
46 | return i, j, h, w
47 |
48 | def __call__(self, x):
49 | i, j, h, w = self.get_params(x, self.scale, self.ratio)
50 | x = x[:, j:j+w, i:i+h]
51 | return F.interpolate(x.unsqueeze(0), size=self.size, mode='bicubic', align_corners=True).squeeze(0)
52 |
53 | def __repr__(self):
54 | return f'{self.__class__.__name__}(size={self.size}, scale={self.scale}, ratio={self.ratio})'
55 |
--------------------------------------------------------------------------------
/src/libs.py:
--------------------------------------------------------------------------------
1 | import warnings
2 | warnings.simplefilter('ignore')
3 |
4 | # Essential PyTorch
5 | import torch
6 | import torchaudio
7 |
8 | # Other modules used in this notebook
9 | from pathlib import Path
10 | import matplotlib.pyplot as plt
11 | import pandas as pd
12 | import numpy as np
13 | from IPython.display import Audio
14 | import fire
15 | import yaml
16 | import multiprocessing
17 | from easydict import EasyDict
18 | from sklearn.model_selection import train_test_split, StratifiedKFold
19 | from sklearn.utils.class_weight import compute_class_weight
20 | import torch
21 | import torch.nn as nn
22 | import torch.nn.functional as F
23 | import pytorch_lightning as pl
24 | from pytorch_lightning.metrics.functional import accuracy
25 |
26 | import torchvision.transforms as VT
27 | import torchaudio.transforms as AT
28 |
29 | from dlcliche.torch_utils import IntraBatchMixup
30 | from dlcliche.utils import copy_file
31 |
32 | from src.augmentations import GenericRandomResizedCrop
33 |
34 |
35 | device = torch.device('cuda')
36 |
37 |
38 | def load_config(filename, debug=False):
39 | with open(filename) as conf:
40 | cfg = EasyDict(yaml.safe_load(conf))
41 | cfg.unit_length = int((cfg.clip_length * cfg.sample_rate + cfg.hop_length - 1) // cfg.hop_length)
42 | cfg.data_root = Path(cfg.data_root)
43 | if debug:
44 | print(cfg)
45 | return cfg
46 |
47 |
48 | def sample_length(log_mel_spec):
49 | return log_mel_spec.shape[-1]
50 |
51 |
52 | class LMSClfDataset(torch.utils.data.Dataset):
53 | def __init__(self, cfg, filenames, labels, transforms=None, norm_mean_std=None):
54 | assert len(filenames) == len(labels), f'Inconsistent length of filenames and labels.'
55 |
56 | self.filenames = filenames
57 | self.labels = labels
58 | self.transforms = transforms
59 | self.norm_mean_std = norm_mean_std
60 |
61 | # Calculate length of clip this dataset will make
62 | self.unit_length = cfg.unit_length
63 |
64 | # Test with first file
65 | assert self[0][0].shape[-1] == self.unit_length, f'Check your files, failed to load {filenames[0]}'
66 |
67 | # Show basic info.
68 | print(f'Dataset will yield log-mel spectrogram {len(self)} data samples in shape [1, {cfg.n_mels}, {self.unit_length}]')
69 |
70 | def __len__(self):
71 | return len(self.filenames)
72 |
73 | def __getitem__(self, index):
74 | assert 0 <= index and index < len(self)
75 |
76 | log_mel_spec = np.load(self.filenames[index])
77 |
78 | # Normalize
79 | if self.norm_mean_std is not None:
80 | log_mel_spec = (log_mel_spec - self.norm_mean_std[0]) / self.norm_mean_std[1]
81 |
82 | # Padding if sample is shorter than expected - both head & tail are filled with 0s
83 | pad_size = self.unit_length - sample_length(log_mel_spec)
84 | if pad_size > 0:
85 | offset = pad_size // 2
86 | log_mel_spec = np.pad(log_mel_spec, ((0, 0), (0, 0), (offset, pad_size - offset)), 'constant')
87 |
88 | # Random crop
89 | crop_size = sample_length(log_mel_spec) - self.unit_length
90 | if crop_size > 0:
91 | start = np.random.randint(0, crop_size)
92 | log_mel_spec = log_mel_spec[..., start:start + self.unit_length]
93 |
94 | # Apply augmentations
95 | log_mel_spec = torch.Tensor(log_mel_spec)
96 | if self.transforms is not None:
97 | log_mel_spec = self.transforms(log_mel_spec)
98 |
99 | return log_mel_spec, self.labels[index]
100 |
101 |
102 | class SplitAllDataset(torch.utils.data.Dataset):
103 | def __init__(self, cfg, filenames, norm_mean_std=None, head_n=99999):
104 | self.filenames = filenames
105 | self.norm_mean_std = norm_mean_std
106 |
107 | # Calculate length of clip this dataset will make
108 | self.L = cfg.unit_length
109 |
110 | # Get # of splits for all files
111 | self.n_splits = np.array([(np.load(f).shape[-1] + self.L - 1) // self.L for f in filenames])
112 | self.n_splits = np.clip(1, head_n, self.n_splits) # limit number of splits.
113 | self.sum_splits = np.cumsum(self.n_splits)
114 |
115 | def __len__(self):
116 | return self.sum_splits[-1]
117 |
118 | def file_index(self, index):
119 | return sum((index < self.sum_splits) == False)
120 |
121 | def filename(self, index):
122 | return self.filenames[self.file_index(index)]
123 |
124 | def split_index(self, index):
125 | fidx = self.file_index(index)
126 | prev_sum = self.sum_splits[fidx - 1] if fidx > 0 else 0
127 | return index - prev_sum
128 |
129 | def __getitem__(self, index):
130 | assert 0 <= index and index < len(self)
131 |
132 | log_mel_spec = np.load(self.filename(index))
133 | start = self.split_index(index) * self.L
134 | log_mel_spec = log_mel_spec[..., start:start + self.L]
135 |
136 | # Normalize
137 | if self.norm_mean_std is not None:
138 | log_mel_spec = (log_mel_spec - self.norm_mean_std[0]) / self.norm_mean_std[1]
139 |
140 | # Padding if sample is shorter than expected - both head & tail are filled with 0s
141 | pad_size = self.L - sample_length(log_mel_spec)
142 | if pad_size > 0:
143 | offset = pad_size // 2
144 | log_mel_spec = np.pad(log_mel_spec, ((0, 0), (0, 0), (offset, pad_size - offset)), 'constant')
145 |
146 | return log_mel_spec, self.file_index(index)
147 |
148 |
149 | class LMSClfLearner(pl.LightningModule):
150 |
151 | def __init__(self, model, dataloaders, learning_rate=3e-4, mixup_alpha=0.0, weight=None):
152 | super().__init__()
153 | self.learning_rate = learning_rate
154 | self.model = model
155 | self.trn_dl, self.val_dl, self.test_dl = dataloaders
156 | self.criterion = nn.CrossEntropyLoss(weight=weight)
157 | self.batch_mixer = IntraBatchMixup(self.criterion, alpha=mixup_alpha) if mixup_alpha > 0.0 else None
158 |
159 | def forward(self, x):
160 | x = self.model(x)
161 | return x
162 |
163 | def step(self, x, y, train):
164 | if self.batch_mixer is None:
165 | preds = self(x)
166 | loss = self.criterion(preds, y)
167 | else:
168 | x, stacked_y = self.batch_mixer.transform(x, y, train=train)
169 | preds = self(x)
170 | loss = self.batch_mixer.criterion(preds, stacked_y)
171 | return preds, loss
172 |
173 | def training_step(self, batch, batch_idx):
174 | x, y = batch
175 | preds, loss = self.step(x, y, train=True)
176 | return loss
177 |
178 | def validation_step(self, batch, batch_idx, split='val'):
179 | x, y = batch
180 | preds, loss = self.step(x, y, train=False)
181 | yhat = torch.argmax(preds, dim=1)
182 | acc = accuracy(yhat, y)
183 |
184 | self.log(f'{split}_loss', loss, prog_bar=True)
185 | self.log(f'{split}_acc', acc, prog_bar=True)
186 | return loss
187 |
188 | def test_step(self, batch, batch_idx):
189 | return self.validation_step(batch, batch_idx, split='test')
190 |
191 | def configure_optimizers(self):
192 | optimizer = torch.optim.AdamW(self.parameters(), lr=self.learning_rate)
193 | return optimizer
194 |
195 | def train_dataloader(self):
196 | return self.trn_dl
197 |
198 | def val_dataloader(self):
199 | return self.val_dl
200 |
201 | def test_dataloader(self):
202 | return self.test_dl
--------------------------------------------------------------------------------
/src/lwlrap.py:
--------------------------------------------------------------------------------
1 | # Borrowed from https://github.com/DCASE-REPO/dcase2019_task2_baseline/blob/master/evaluation.py
2 | import numpy as np
3 |
4 |
5 | class Lwlrap(object):
6 | """Computes label-weighted label-ranked average precision (lwlrap)."""
7 |
8 | def __init__(self, class_map):
9 | self.num_classes = 0
10 | self.total_num_samples = 0
11 | self._class_map = class_map
12 |
13 | def accumulate(self, batch_truth, batch_scores):
14 | """Accumulate a new batch of samples into the metric.
15 | Args:
16 | truth: np.array of (num_samples, num_classes) giving boolean
17 | ground-truth of presence of that class in that sample for this batch.
18 | scores: np.array of (num_samples, num_classes) giving the
19 | classifier-under-test's real-valued score for each class for each
20 | sample.
21 | """
22 | assert batch_scores.shape == batch_truth.shape
23 | num_samples, num_classes = batch_truth.shape
24 | if not self.num_classes:
25 | self.num_classes = num_classes
26 | self._per_class_cumulative_precision = np.zeros(self.num_classes)
27 | self._per_class_cumulative_count = np.zeros(self.num_classes,
28 | dtype=np.int)
29 | assert num_classes == self.num_classes
30 | for truth, scores in zip(batch_truth, batch_scores):
31 | pos_class_indices, precision_at_hits = (
32 | self._one_sample_positive_class_precisions(scores, truth))
33 | self._per_class_cumulative_precision[pos_class_indices] += (
34 | precision_at_hits)
35 | self._per_class_cumulative_count[pos_class_indices] += 1
36 | self.total_num_samples += num_samples
37 |
38 | def _one_sample_positive_class_precisions(self, scores, truth):
39 | """Calculate precisions for each true class for a single sample.
40 | Args:
41 | scores: np.array of (num_classes,) giving the individual classifier scores.
42 | truth: np.array of (num_classes,) bools indicating which classes are true.
43 | Returns:
44 | pos_class_indices: np.array of indices of the true classes for this sample.
45 | pos_class_precisions: np.array of precisions corresponding to each of those
46 | classes.
47 | """
48 | num_classes = scores.shape[0]
49 | pos_class_indices = np.flatnonzero(truth > 0)
50 | # Only calculate precisions if there are some true classes.
51 | if not len(pos_class_indices):
52 | return pos_class_indices, np.zeros(0)
53 | # Retrieval list of classes for this sample.
54 | retrieved_classes = np.argsort(scores)[::-1]
55 | # class_rankings[top_scoring_class_index] == 0 etc.
56 | class_rankings = np.zeros(num_classes, dtype=np.int)
57 | class_rankings[retrieved_classes] = range(num_classes)
58 | # Which of these is a true label?
59 | retrieved_class_true = np.zeros(num_classes, dtype=np.bool)
60 | retrieved_class_true[class_rankings[pos_class_indices]] = True
61 | # Num hits for every truncated retrieval list.
62 | retrieved_cumulative_hits = np.cumsum(retrieved_class_true)
63 | # Precision of retrieval list truncated at each hit, in order of pos_labels.
64 | precision_at_hits = (
65 | retrieved_cumulative_hits[class_rankings[pos_class_indices]] /
66 | (1 + class_rankings[pos_class_indices].astype(np.float)))
67 | return pos_class_indices, precision_at_hits
68 |
69 | def per_class_lwlrap(self):
70 | """Return a vector of the per-class lwlraps for the accumulated samples."""
71 | return (self._per_class_cumulative_precision /
72 | np.maximum(1, self._per_class_cumulative_count))
73 |
74 | def per_class_weight(self):
75 | """Return a normalized weight vector for the contributions of each class."""
76 | return (self._per_class_cumulative_count /
77 | float(np.sum(self._per_class_cumulative_count)))
78 |
79 | def overall_lwlrap(self):
80 | """Return the scalar overall lwlrap for cumulated samples."""
81 | return np.sum(self.per_class_lwlrap() * self.per_class_weight())
82 |
83 | def __str__(self):
84 | per_class_lwlrap = self.per_class_lwlrap()
85 | # List classes in descending order of lwlrap.
86 | s = (['Lwlrap(%s) = %.6f' % (name, lwlrap) for (lwlrap, name) in
87 | sorted([(per_class_lwlrap[i], self._class_map[i]) for i in range(self.num_classes)],
88 | reverse=True)])
89 | s.append('Overall lwlrap = %.6f' % (self.overall_lwlrap()))
90 | return '\n'.join(s)
91 |
--------------------------------------------------------------------------------
/src/models.py:
--------------------------------------------------------------------------------
1 | """Audio models based on VGGish [1] paper.
2 |
3 | ## About
4 |
5 | Based on following implementations:
6 |
7 | - https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py -- borrowed most of code from this torchvision implementation.
8 | - https://github.com/harritaylor/torchvggish
9 |
10 | ## Disclaimer
11 |
12 | Tried to follow the original paper description, but there could be difference from the real ResNetish/VGGish.
13 |
14 | ## References
15 |
16 | [1] S. Hershey et al., ‘CNN Architectures for Large-Scale Audio Classification’,\ in International Conference on Acoustics, Speech and Signal Processing (ICASSP),2017\ Available: https://arxiv.org/abs/1609.09430, https://ai.google/research/pubs/pub45611
17 | """
18 |
19 | import torch
20 | from torch import Tensor
21 | import torch.nn as nn
22 | from typing import Type, Any, Callable, Union, List, Optional
23 |
24 |
25 | def conv3x3(in_planes: int, out_planes: int, stride: int = 1, groups: int = 1, dilation: int = 1) -> nn.Conv2d:
26 | """3x3 convolution with padding"""
27 | return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
28 | padding=dilation, groups=groups, bias=False, dilation=dilation)
29 |
30 |
31 | def conv1x1(in_planes: int, out_planes: int, stride: int = 1) -> nn.Conv2d:
32 | """1x1 convolution"""
33 | return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)
34 |
35 |
36 | class BasicBlock(nn.Module):
37 | expansion: int = 1
38 |
39 | def __init__(
40 | self,
41 | inplanes: int,
42 | planes: int,
43 | stride: int = 1,
44 | downsample: Optional[nn.Module] = None,
45 | groups: int = 1,
46 | base_width: int = 64,
47 | dilation: int = 1,
48 | norm_layer: Optional[Callable[..., nn.Module]] = None
49 | ) -> None:
50 | super(BasicBlock, self).__init__()
51 | if norm_layer is None:
52 | norm_layer = nn.BatchNorm2d
53 | if groups != 1 or base_width != 64:
54 | raise ValueError('BasicBlock only supports groups=1 and base_width=64')
55 | if dilation > 1:
56 | raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
57 | # Both self.conv1 and self.downsample layers downsample the input when stride != 1
58 | self.conv1 = conv3x3(inplanes, planes, stride)
59 | self.bn1 = norm_layer(planes)
60 | self.relu = nn.ReLU(inplace=True)
61 | self.conv2 = conv3x3(planes, planes)
62 | self.bn2 = norm_layer(planes)
63 | self.downsample = downsample
64 | self.stride = stride
65 |
66 | def forward(self, x: Tensor) -> Tensor:
67 | identity = x
68 |
69 | out = self.conv1(x)
70 | out = self.bn1(out)
71 | out = self.relu(out)
72 |
73 | out = self.conv2(out)
74 | out = self.bn2(out)
75 |
76 | if self.downsample is not None:
77 | identity = self.downsample(x)
78 |
79 | out += identity
80 | out = self.relu(out)
81 |
82 | return out
83 |
84 |
85 | class Bottleneck(nn.Module):
86 | # Bottleneck in torchvision places the stride for downsampling at 3x3 convolution(self.conv2)
87 | # while original implementation places the stride at the first 1x1 convolution(self.conv1)
88 | # according to "Deep residual learning for image recognition"https://arxiv.org/abs/1512.03385.
89 | # This variant is also known as ResNet V1.5 and improves accuracy according to
90 | # https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch.
91 |
92 | expansion: int = 4
93 |
94 | def __init__(
95 | self,
96 | inplanes: int,
97 | planes: int,
98 | stride: int = 1,
99 | downsample: Optional[nn.Module] = None,
100 | groups: int = 1,
101 | base_width: int = 64,
102 | dilation: int = 1,
103 | norm_layer: Optional[Callable[..., nn.Module]] = None
104 | ) -> None:
105 | super(Bottleneck, self).__init__()
106 | if norm_layer is None:
107 | norm_layer = nn.BatchNorm2d
108 | width = int(planes * (base_width / 64.)) * groups
109 | # Both self.conv2 and self.downsample layers downsample the input when stride != 1
110 | self.conv1 = conv1x1(inplanes, width)
111 | self.bn1 = norm_layer(width)
112 | self.conv2 = conv3x3(width, width, stride, groups, dilation)
113 | self.bn2 = norm_layer(width)
114 | self.conv3 = conv1x1(width, planes * self.expansion)
115 | self.bn3 = norm_layer(planes * self.expansion)
116 | self.relu = nn.ReLU(inplace=True)
117 | self.downsample = downsample
118 | self.stride = stride
119 |
120 | def forward(self, x: Tensor) -> Tensor:
121 | identity = x
122 |
123 | out = self.conv1(x)
124 | out = self.bn1(out)
125 | out = self.relu(out)
126 |
127 | out = self.conv2(out)
128 | out = self.bn2(out)
129 | out = self.relu(out)
130 |
131 | out = self.conv3(out)
132 | out = self.bn3(out)
133 |
134 | if self.downsample is not None:
135 | identity = self.downsample(x)
136 |
137 | out += identity
138 | out = self.relu(out)
139 |
140 | return out
141 |
142 |
143 | class ResNetish(nn.Module):
144 |
145 | def __init__(
146 | self,
147 | block: Type[Union[BasicBlock, Bottleneck]],
148 | layers: List[int],
149 | num_classes: int = 1000,
150 | zero_init_residual: bool = False,
151 | groups: int = 1,
152 | width_per_group: int = 64,
153 | replace_stride_with_dilation: Optional[List[bool]] = None,
154 | norm_layer: Optional[Callable[..., nn.Module]] = None
155 | ) -> None:
156 | super(ResNetish, self).__init__()
157 | if norm_layer is None:
158 | norm_layer = nn.BatchNorm2d
159 | self._norm_layer = norm_layer
160 |
161 | self.inplanes = 64
162 | self.dilation = 1
163 | if replace_stride_with_dilation is None:
164 | # each element in the tuple indicates if we should replace
165 | # the 2x2 stride with a dilated convolution instead
166 | replace_stride_with_dilation = [False, False, False]
167 | if len(replace_stride_with_dilation) != 3:
168 | raise ValueError("replace_stride_with_dilation should be None "
169 | "or a 3-element tuple, got {}".format(replace_stride_with_dilation))
170 | self.groups = groups
171 | self.base_width = width_per_group
172 | self.conv1 = nn.Conv2d(1, self.inplanes, kernel_size=7, stride=1, padding=3, # Audio input 3 -> 1, stride 2 -> 1
173 | bias=False)
174 | self.bn1 = norm_layer(self.inplanes)
175 | self.relu = nn.ReLU(inplace=True)
176 | self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
177 | self.layer1 = self._make_layer(block, 64, layers[0])
178 | self.layer2 = self._make_layer(block, 128, layers[1], stride=2,
179 | dilate=replace_stride_with_dilation[0])
180 | self.layer3 = self._make_layer(block, 256, layers[2], stride=2,
181 | dilate=replace_stride_with_dilation[1])
182 | self.layer4 = self._make_layer(block, 512, layers[3], stride=2,
183 | dilate=replace_stride_with_dilation[2])
184 | self.avgpool = nn.AdaptiveAvgPool2d((4, 6))
185 | self.fc = nn.Linear(512 * 24 * block.expansion, num_classes)
186 |
187 | for m in self.modules():
188 | if isinstance(m, nn.Conv2d):
189 | nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
190 | elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
191 | nn.init.constant_(m.weight, 1)
192 | nn.init.constant_(m.bias, 0)
193 |
194 | # Zero-initialize the last BN in each residual branch,
195 | # so that the residual branch starts with zeros, and each residual block behaves like an identity.
196 | # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
197 | if zero_init_residual:
198 | for m in self.modules():
199 | if isinstance(m, Bottleneck):
200 | nn.init.constant_(m.bn3.weight, 0) # type: ignore[arg-type]
201 | elif isinstance(m, BasicBlock):
202 | nn.init.constant_(m.bn2.weight, 0) # type: ignore[arg-type]
203 |
204 | def _make_layer(self, block: Type[Union[BasicBlock, Bottleneck]], planes: int, blocks: int,
205 | stride: int = 1, dilate: bool = False) -> nn.Sequential:
206 | norm_layer = self._norm_layer
207 | downsample = None
208 | previous_dilation = self.dilation
209 | if dilate:
210 | self.dilation *= stride
211 | stride = 1
212 | if stride != 1 or self.inplanes != planes * block.expansion:
213 | downsample = nn.Sequential(
214 | conv1x1(self.inplanes, planes * block.expansion, stride),
215 | norm_layer(planes * block.expansion),
216 | )
217 |
218 | layers = []
219 | layers.append(block(self.inplanes, planes, stride, downsample, self.groups,
220 | self.base_width, previous_dilation, norm_layer))
221 | self.inplanes = planes * block.expansion
222 | for _ in range(1, blocks):
223 | layers.append(block(self.inplanes, planes, groups=self.groups,
224 | base_width=self.base_width, dilation=self.dilation,
225 | norm_layer=norm_layer))
226 |
227 | return nn.Sequential(*layers)
228 |
229 | def _forward_impl(self, x: Tensor) -> Tensor:
230 | # See note [TorchScript super()]
231 | x = self.conv1(x)
232 | x = self.bn1(x)
233 | x = self.relu(x)
234 | x = self.maxpool(x)
235 |
236 | x = self.layer1(x)
237 | x = self.layer2(x)
238 | x = self.layer3(x)
239 | x = self.layer4(x)
240 |
241 | x = self.avgpool(x)
242 | x = torch.flatten(x, 1)
243 | x = self.fc(x)
244 |
245 | return x
246 |
247 | def forward(self, x: Tensor) -> Tensor:
248 | return self._forward_impl(x)
249 |
250 |
251 | def _resnet(
252 | arch: str,
253 | block: Type[Union[BasicBlock, Bottleneck]],
254 | layers: List[int],
255 | **kwargs: Any
256 | ) -> ResNetish:
257 | model = ResNetish(block, layers, **kwargs)
258 | return model
259 |
260 |
261 | def resnetish18(num_classes: int, **kwargs: Any) -> ResNetish:
262 | r"""ResNet-18 model from
263 | `"Deep Residual Learning for Image Recognition" `_.
264 | Args:
265 | pretrained (bool): If True, returns a model pre-trained on ImageNet
266 | progress (bool): If True, displays a progress bar of the download to stderr
267 | """
268 | return _resnet('resnetish18', BasicBlock, [2, 2, 2, 2], num_classes=num_classes,
269 | **kwargs)
270 |
271 |
272 | def resnetish34(num_classes: int, **kwargs: Any) -> ResNetish:
273 | r"""ResNet-34 model from
274 | `"Deep Residual Learning for Image Recognition" `_.
275 | Args:
276 | pretrained (bool): If True, returns a model pre-trained on ImageNet
277 | progress (bool): If True, displays a progress bar of the download to stderr
278 | """
279 | return _resnet('resnetish34', BasicBlock, [3, 4, 6, 3], num_classes=num_classes,
280 | **kwargs)
281 |
282 |
283 | def resnetish50(num_classes: int, **kwargs: Any) -> ResNetish:
284 | r"""ResNet-50 model from
285 | `"Deep Residual Learning for Image Recognition" `_.
286 | Args:
287 | pretrained (bool): If True, returns a model pre-trained on ImageNet
288 | progress (bool): If True, displays a progress bar of the download to stderr
289 | """
290 | return _resnet('resnetish50', Bottleneck, [3, 4, 6, 3], num_classes=num_classes,
291 | **kwargs)
292 |
293 |
294 | class VGGish(nn.Module):
295 | """Based on:
296 | https://github.com/harritaylor/torchvggish/blob/master/docs/_example_download_weights.ipynb
297 | """
298 |
299 | def __init__(self, num_classes: int): # Added num_classes
300 | super(VGGish, self).__init__()
301 | self.features = nn.Sequential(
302 | nn.Conv2d(1, 64, 3, 1, 1),
303 | nn.ReLU(inplace=True),
304 | nn.MaxPool2d(2, 2),
305 | nn.Conv2d(64, 128, 3, 1, 1),
306 | nn.ReLU(inplace=True),
307 | nn.MaxPool2d(2, 2),
308 | nn.Conv2d(128, 256, 3, 1, 1),
309 | nn.ReLU(inplace=True),
310 | nn.Conv2d(256, 256, 3, 1, 1),
311 | nn.ReLU(inplace=True),
312 | nn.MaxPool2d(2, 2),
313 | nn.Conv2d(256, 512, 3, 1, 1),
314 | nn.ReLU(inplace=True),
315 | nn.Conv2d(512, 512, 3, 1, 1),
316 | nn.ReLU(inplace=True),
317 | nn.AdaptiveMaxPool2d((4, 6))) # Replaced: MaxPool2d(2,2)
318 | self.embeddings = nn.Sequential(
319 | nn.Linear(512*24, 4096),
320 | nn.ReLU(inplace=True),
321 | nn.Linear(4096, 4096),
322 | nn.ReLU(inplace=True),
323 | nn.Linear(4096, 128),
324 | nn.ReLU(inplace=True))
325 | self.head = nn.Linear(128, num_classes) # Added
326 |
327 | def forward(self, x):
328 | x = self.features(x)
329 | x = x.view(x.size(0),-1)
330 | x = self.embeddings(x)
331 | x = self.head(x) # Added
332 | return x
333 |
334 |
335 | class AlexNet(nn.Module):
336 | """Based on https://github.com/pytorch/vision/blob/master/torchvision/models/alexnet.py
337 | """
338 |
339 | def __init__(self, num_classes: int = 1000) -> None:
340 | super(AlexNet, self).__init__()
341 | self.features = nn.Sequential(
342 | nn.Conv2d(1, 64, kernel_size=11, stride=(1,2), padding=2), # Replaced 3-channel with 1, strid=4 with (1,2)
343 | nn.BatchNorm2d(64), # Added according to the paper.
344 | nn.ReLU(inplace=True),
345 | nn.MaxPool2d(kernel_size=3, stride=2),
346 | nn.Conv2d(64, 192, kernel_size=5, padding=2),
347 | nn.BatchNorm2d(192), # Added according to the paper.
348 | nn.ReLU(inplace=True),
349 | nn.MaxPool2d(kernel_size=3, stride=2),
350 | nn.Conv2d(192, 384, kernel_size=3, padding=1),
351 | nn.BatchNorm2d(384), # Added according to the paper.
352 | nn.ReLU(inplace=True),
353 | nn.Conv2d(384, 256, kernel_size=3, padding=1),
354 | nn.BatchNorm2d(256), # Added according to the paper.
355 | nn.ReLU(inplace=True),
356 | nn.Conv2d(256, 256, kernel_size=3, padding=1),
357 | nn.BatchNorm2d(256), # Added according to the paper.
358 | nn.ReLU(inplace=True),
359 | nn.MaxPool2d(kernel_size=3, stride=2),
360 | )
361 | self.avgpool = nn.AdaptiveAvgPool2d((4, 6)) # Replaced: n.AdaptiveAvgPool2d((6, 6))
362 | self.classifier = nn.Sequential(
363 | nn.Dropout(),
364 | nn.Linear(256 * 4 * 6, 4096), # Replaced: 256 * 6 * 6
365 | nn.ReLU(inplace=True),
366 | nn.Dropout(),
367 | nn.Linear(4096, 4096),
368 | nn.ReLU(inplace=True),
369 | nn.Linear(4096, num_classes),
370 | )
371 |
372 | def forward(self, x: torch.Tensor) -> torch.Tensor:
373 | x = self.features(x)
374 | x = self.avgpool(x)
375 | x = torch.flatten(x, 1)
376 | x = self.classifier(x)
377 | return x
378 |
--------------------------------------------------------------------------------
/src/multi_label_libs.py:
--------------------------------------------------------------------------------
1 | import pytorch_lightning as pl
2 | import datetime
3 | import logging
4 | import numpy as np
5 | import torch
6 | import torch.nn as nn
7 | import torch.nn.functional as F
8 | import multiprocessing
9 | from dlcliche.torch_utils import IntraBatchMixupBCE
10 | from dlcliche.utils import copy_file
11 | from .lwlrap import Lwlrap
12 | from skmultilearn.model_selection import IterativeStratification
13 |
14 |
15 | class SplitAllDataset(torch.utils.data.Dataset):
16 | def __init__(self, cfg, df, normalize=False):
17 | self.df = df
18 | self.normalize = normalize
19 |
20 | # Calculate length of clip this dataset will make
21 | self.L = cfg.unit_length
22 |
23 | # Get # of splits for all files
24 | self.n_splits = np.array([(np.load(f).shape[-1] + self.L - 1) // self.L for f in df.index.values])
25 | self.sum_splits = np.cumsum(self.n_splits)
26 |
27 | def __len__(self):
28 | return self.sum_splits[-1]
29 |
30 | def file_index(self, index):
31 | return sum((index < self.sum_splits) == False)
32 |
33 | def filename(self, index):
34 | return self.df.index.values[self.file_index(index)]
35 |
36 | def split_index(self, index):
37 | fidx = self.file_index(index)
38 | prev_sum = self.sum_splits[fidx - 1] if fidx > 0 else 0
39 | return index - prev_sum
40 |
41 | def __getitem__(self, index):
42 | assert 0 <= index and index < len(self)
43 |
44 | log_mel_spec = np.load(self.filename(index))
45 | start = self.split_index(index) * self.L
46 | log_mel_spec = log_mel_spec[..., start:start + self.L]
47 |
48 | # normalize - instance based
49 | if self.normalize:
50 | _m, _s = log_mel_spec.mean(), log_mel_spec.std() + np.finfo(np.float).eps
51 | log_mel_spec = (log_mel_spec - _m) / _s
52 |
53 | # Padding if sample is shorter than expected - both head & tail are filled with 0s
54 | pad_size = self.L - sample_length(log_mel_spec)
55 | if pad_size > 0:
56 | offset = pad_size // 2
57 | log_mel_spec = np.pad(log_mel_spec, ((0, 0), (0, 0), (offset, pad_size - offset)), 'constant')
58 |
59 | return log_mel_spec, self.file_index(index)
60 |
61 |
62 | def eval_all_splits(cfg, model, device, classes, df, normalize=False, debug_name=None, n=1, bs=64):
63 | model = model.to(device).eval()
64 | file_probas = [[] for _ in range(len(df))]
65 | test_dataset = SplitAllDataset(cfg, df, normalize=normalize)
66 | test_loader = torch.utils.data.DataLoader(test_dataset, num_workers=multiprocessing.cpu_count(),
67 | batch_size=bs, pin_memory=True)
68 | print(f'Predicting all {len(test_dataset)} splits for {len(df)} files...')
69 | for _ in range(n):
70 | with torch.no_grad():
71 | for X, fileidxs in test_loader:
72 | preds = model(X.to(device))
73 | probas = F.sigmoid(preds)
74 | for idx, proba in zip(fileidxs.cpu().numpy(), probas.cpu().numpy()):
75 | file_probas[idx].append(proba)
76 | file_probas = np.array([np.mean(probas, axis=0) for probas in file_probas])
77 | lwlrap = Lwlrap(classes)
78 | lwlrap.accumulate(df.values, file_probas)
79 | return file_probas, lwlrap.overall_lwlrap(), lwlrap.per_class_lwlrap()
80 |
81 |
82 | def sample_length(log_mel_spec):
83 | return log_mel_spec.shape[-1]
84 |
85 |
86 | class MLClfDataset(torch.utils.data.Dataset):
87 | def __init__(self, cfg, df, transforms=None, normalize=False):
88 | self.df = df
89 | self.transforms = transforms
90 | self.normalize = normalize
91 |
92 | # Calculate length of clip this dataset will make
93 | self.cfg = cfg
94 | self.unit_length = cfg.unit_length
95 | self.hop = cfg.hop_length / cfg.sample_rate
96 |
97 | # Show basic info.
98 | print(f'Dataset will yield log-mel spectrogram {len(self)} data samples in shape [1, {cfg.n_mels}, {self.unit_length}]')
99 |
100 | def __len__(self):
101 | return len(self.df)
102 |
103 | def __getitem__(self, index):
104 | assert 0 <= index and index < len(self)
105 | row = self.df.iloc[index]
106 | filename = f'{row.name}'
107 |
108 | log_mel_spec = np.load(filename)
109 |
110 | # normalize - instance based
111 | if self.normalize:
112 | _m, _s = log_mel_spec.mean(), log_mel_spec.std() + np.finfo(np.float).eps
113 | log_mel_spec = (log_mel_spec - _m) / _s
114 |
115 | # Padding if sample is shorter than expected - both head & tail are filled with 0s
116 | pad_size = self.unit_length - sample_length(log_mel_spec)
117 | offset = 0
118 | if pad_size > 0:
119 | offset = np.random.randint(1, pad_size) if pad_size > 1 else 0 # (pad_size // 2) -- for making it center
120 | log_mel_spec = np.pad(log_mel_spec, ((0, 0), (0, 0), (offset, pad_size - offset)), 'constant')
121 |
122 | # Random crop
123 | crop_size = sample_length(log_mel_spec) - self.unit_length
124 | start = 0
125 | if crop_size > 0:
126 | start = np.random.randint(0, crop_size)
127 | log_mel_spec = log_mel_spec[..., start:start + self.unit_length]
128 |
129 | # Apply augmentations
130 | log_mel_spec = torch.Tensor(log_mel_spec)
131 | if self.transforms is not None:
132 | log_mel_spec = self.transforms(log_mel_spec)
133 |
134 | return log_mel_spec, row.values
135 |
136 |
137 | class MLClfLearner(pl.LightningModule):
138 |
139 | def __init__(self, model, dataloaders, classes, learning_rate=3e-4, mixup_alpha=0.2, weight=None):
140 | super().__init__()
141 | self.learning_rate = learning_rate
142 | self.model = model
143 | self.classes = classes
144 | self.train_loader, self.valid_loader, self.test_loader = dataloaders
145 |
146 | self.criterion = nn.BCEWithLogitsLoss(weight=weight)
147 | self.batch_mixer = IntraBatchMixupBCE(alpha=mixup_alpha)
148 | self.lwlrap = Lwlrap(classes)
149 |
150 | def forward(self, x):
151 | x = self.model(x)
152 | return x
153 |
154 | def step(self, x, y, train):
155 | mixed_inputs, mixed_labels = self.batch_mixer.transform(x, y, train=train)
156 | preds = self(mixed_inputs)
157 | #print(preds, mixed_labels.to(torch.float))
158 | loss = self.criterion(preds, mixed_labels.to(torch.float))
159 | return preds, loss
160 |
161 | def training_step(self, batch, batch_idx):
162 | x, y = batch
163 | preds, loss = self.step(x, y, train=True)
164 | return loss
165 |
166 | def on_validation_start(self, **kwargs):
167 | self.lwlrap = Lwlrap(self.classes)
168 |
169 | def validation_step(self, batch, batch_idx, split='val'):
170 | x, gt = batch
171 | preds, loss = self.step(x, gt, train=False)
172 | self.lwlrap.accumulate(gt.cpu().numpy(), F.sigmoid(preds).cpu().numpy())
173 |
174 | self.log(f'{split}_loss', loss, prog_bar=True)
175 | #batch_lwlrap = lwlrap(gt.cpu().numpy(), preds.cpu().numpy())
176 | #self.log(f'{split}_lwlrap', batch_lwlrap, prog_bar=True)
177 | if batch_idx >= len(self.valid_loader) - 1:
178 | self.log(f'val_lwlrap', self.lwlrap.overall_lwlrap(), prog_bar=False)
179 | logging.info(self.lwlrap)
180 | return loss
181 |
182 | def test_step(self, batch, batch_idx):
183 | return self.validation_step(batch, batch_idx, split='test')
184 |
185 | def configure_optimizers(self):
186 | optimizer = torch.optim.AdamW(self.parameters(), lr=self.learning_rate)
187 | return optimizer
188 |
189 | def train_dataloader(self):
190 | return self.train_loader
191 |
192 | def val_dataloader(self):
193 | return self.valid_loader
194 |
195 | def test_dataloader(self):
196 | return self.test_loader
197 |
198 |
199 | def ml_fold_spliter(train_df, random_state=42):
200 | fnames = train_df.index.values
201 |
202 | # multi label stratified train-test splitter
203 | splitter = IterativeStratification(n_splits=5, random_state=random_state)
204 |
205 | for train, test in splitter.split(train_df.index, train_df):
206 | yield train_df.iloc[train], train_df.iloc[test]
207 |
--------------------------------------------------------------------------------
/work/.placeholder:
--------------------------------------------------------------------------------
1 | Working files will be in this folder.
2 |
--------------------------------------------------------------------------------