├── Procfile
├── app.py
├── requirements.txt
├── README.md
└── [TensorFlow]MaskedAutoEncoders__Demo.ipynb
/Procfile:
--------------------------------------------------------------------------------
1 | web: gunicorn app:main
--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Thu Jun 11 22:34:20 2020
4 |
5 | @author: Krish Naik
6 | """
7 |
8 | from __future__ import division, print_function
9 | # coding=utf-8
10 | import sys
11 | import os
12 | import glob
13 | import re
14 | import numpy as np
15 | # Keras
16 | import tensorflow as tf
17 | from tensorflow.keras.applications.imagenet_utils import preprocess_input, decode_predictions
18 | from tensorflow.keras.models import load_model
19 | from tensorflow.keras.preprocessing import image
20 |
21 | # Flask utils
22 | from flask import Flask, redirect, url_for, request, render_template
23 | from werkzeug.utils import secure_filename
24 | #from gevent.pywsgi import WSGIServer
25 |
26 | # Define a flask app
27 | app = Flask(__name__)
28 |
29 | # Model saved with Keras model.save()
30 | MODEL_PATH ='./model'
31 |
32 | # Load your trained model
33 | model = tf.saved_model.load(MODEL_PATH)
34 |
35 | print(11111111111111111111111111111111111111111111)
36 |
37 |
38 | def model_predict(img_path, model):
39 | img = image.load_img(img_path)
40 | x = img.resize((128,48))
41 | # Preprocessing the image
42 | x = image.img_to_array(x)
43 | # x = np.true,_divide(x, 255)
44 |
45 | ## Scaling
46 | # x=x/255
47 | x = np.expand_dims(x, axis=0)
48 |
49 | pred = model(x)
50 | pred=np.argmax(pred, axis=1)
51 | preds = "The bird class is " + str(pred)
52 |
53 |
54 | return preds
55 |
56 |
57 | @app.route('/', methods=['GET'])
58 | def index():
59 | # Main page
60 | return render_template('index.html')
61 |
62 |
63 | @app.route('/predict', methods=['GET', 'POST'])
64 | def upload():
65 | if request.method == 'POST':
66 | # Get the file from post request
67 | f = request.files['file']
68 |
69 | # Save the file to ./uploads
70 | basepath = os.path.dirname(__file__)
71 | file_path = os.path.join(
72 | basepath, 'uploads', secure_filename(f.filename))
73 | f.save(file_path)
74 |
75 | # Make prediction
76 | preds = model_predict(file_path, model)
77 | result=preds
78 | return result
79 | return None
80 |
81 |
82 | if __name__ == '__main__':
83 | app.run(debug=True)
84 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | absl-py==0.9.0
2 | astunparse==1.6.3
3 | attrs==19.3.0
4 | backcall==0.1.0
5 | bleach==3.1.5
6 | cachetools==4.1.0
7 | certifi==2020.4.5.1
8 | chardet==3.0.4
9 | click==7.1.2
10 | colorama==0.4.3
11 | cycler==0.10.0
12 | decorator==4.4.2
13 | defusedxml==0.6.0
14 | entrypoints==0.3
15 | Flask==1.1.2
16 | Flask-Cors==3.0.8
17 | gast==0.3.3
18 | geojson==2.5.0
19 | google-auth==1.15.0
20 | google-auth-oauthlib==0.4.1
21 | google-pasta==0.2.0
22 | grpcio==1.29.0
23 | h5py==2.10.0
24 | idna==2.9
25 | importlib-metadata==1.6.0
26 | ipykernel==5.3.0
27 | ipython==7.14.0
28 | ipython-genutils==0.2.0
29 | ipywidgets==7.5.1
30 | itsdangerous==1.1.0
31 | jedi==0.17.0
32 | Jinja2==2.11.2
33 | joblib==0.15.1
34 | jsonify==0.5
35 | jsonschema==3.2.0
36 | jupyter==1.0.0
37 | jupyter-client==6.1.3
38 | jupyter-console==6.1.0
39 | jupyter-core==4.6.3
40 | Keras-Preprocessing==1.1.2
41 | kiwisolver==1.2.0
42 | lxml==4.5.1
43 | Markdown==3.2.2
44 | MarkupSafe==1.1.1
45 | matplotlib==3.2.1
46 | mistune==0.8.4
47 | nbconvert==5.6.1
48 | nbformat==5.0.6
49 | notebook==6.0.3
50 | numpy==1.18.4
51 | oauthlib==3.1.0
52 | opencv-python==4.2.0.34
53 | opt-einsum==3.2.1
54 | packaging==20.4
55 | pandas==1.0.3
56 | pandas-datareader==0.8.1
57 | pandocfilters==1.4.2
58 | parso==0.7.0
59 | pexpect==4.8.0
60 | pickleshare==0.7.5
61 | Pillow==7.1.2
62 | prometheus-client==0.7.1
63 | prompt-toolkit==3.0.5
64 | protobuf==3.8.0
65 | ptyprocess==0.6.0
66 | pyasn1==0.4.8
67 | pyasn1-modules==0.2.8
68 | Pygments==2.6.1
69 | pyparsing==2.4.7
70 | pyrsistent==0.16.0
71 | PySocks==1.7.1
72 | python-dateutil==2.8.1
73 | pytz==2020.1
74 | pywinpty==0.5.7
75 | pyzmq==19.0.1
76 | qtconsole==4.7.4
77 | QtPy==1.9.0
78 | requests==2.23.0
79 | requests-oauthlib==1.3.0
80 | rsa==4.0
81 | scikit-learn==0.23.1
82 | scipy==1.4.1
83 | seaborn==0.10.1
84 | Send2Trash==1.5.0
85 | six==1.15.0
86 | sklearn==0.0
87 | tensorboard==2.2.1
88 | tensorboard-plugin-wit==1.6.0.post3
89 | tensorflow==2.2.0
90 | tensorflow-estimator==2.2.0
91 | termcolor==1.1.0
92 | terminado==0.8.3
93 | testpath==0.4.4
94 | threadpoolctl==2.0.0
95 | tornado==6.0.4
96 | traitlets==4.3.3
97 | urllib3==1.25.9
98 | wcwidth==0.1.9
99 | webencodings==0.5.1
100 | Werkzeug==1.0.1
101 | widgetsnbextension==3.5.1
102 | wincertstore==0.2
103 | wrapt==1.12.1
104 | zipp==3.1.0
105 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | Kaggle Notebook : https://www.kaggle.com/code/usharengaraju/tensorflow-maskedautoencoders
2 |
3 | Colab Notebook : https://colab.research.google.com/drive/1rbYMNOTjcDPAegxpyBgc0c4rQv2ob95R?usp=sharing
4 |
5 |
6 |
7 | ### Bioacoustic-Sound-Detector
8 |
9 | Intro to the project
10 | “Visionary Perspective Plan (2020-2030) for the conservation of avian diversity, their ecosystems, habitats and landscapes in the country” proposed by the Indian government to help in the conservation of birds and their habitats inspired me to take up this project . Extinction of bird species is an increasing global concern as it has a huge impact on food chains. Bioacoustic monitoring can provide a passive, low labor, and cost-effective strategy for studying endangered bird populations. Recent advances in machine learning have made it possible to automatically identify bird songs for common species with ample training data. This innovation makes it easier for researchers and conservation practitioners to accurately survey population trends and they'll be able to regularly and more effectively evaluate threats and adjust their conservation actions.
11 | The deep learning prototype can process continuous audio data and then acoustically recognize the species .
12 | The goal of the project when I started was to build a basic prototype for monitoring of rare bird species in India . In future I would like to expand the project to monitor other endangered species as well .
13 |
14 |
15 | Which Google ML products did you use?
16 |
17 | The google ML products which were used are TensorFlow and Cloud TPUs .
18 | The data was converted to TFRecords and the entire code was written in TensorFlow . The model was trained in Cloud TPUs .
19 |
20 | How did you build the dataset?
21 |
22 | BirdClef 2022 challenge dataset which consists of short recordings of individual bird calls consists of short recordings of individual bird calls generously uploaded by users of xenocanto.org. These files have been downsampled to 32 kHz where applicable to match the test set audio and converted to the ogg format.
23 |
24 | Brief overview of the dataset used for training is below
25 |
26 | train_metadata.csv - A wide range of metadata is provided for the training data. The most directly relevant fields are:
27 | primary_label - a code for the bird species. You can review detailed information about the bird codes by appending the code to https://ebird.org/species/, such as https://ebird.org/species/amecro for the American Crow.
28 | secondary_labels: Background species as annotated by the recordist. An empty list does not mean that no background birds are audible.
29 | author - the eBird user who provided the recording.
30 | filename: the associated audio file.
31 | rating: Float value between 0.0 and 5.0 as an indicator of the quality rating on Xeno-canto and the number of background species, where 5.0 is the highest and 1.0 is the lowest. 0.0 means that this recording has no user rating yet.
32 | train_audio/ - The training data consists of short recordings of individual bird calls generously
33 | scored_birds.json - The subset of the species in the dataset that are scored.
34 | eBird_Taxonomy_v2021.csv - Data on the relationships between different species.
35 |
36 | Link to the dataset : https://www.kaggle.com/competitions/birdclef-2022/data
37 |
38 |
39 | Trouble-shooting
40 |
41 | Conversion of the model to TFJS format was challenging and there were errors which couldn’t be resolved. Hence I used Flask to build browser based applications .
42 |
43 |
--------------------------------------------------------------------------------
/[TensorFlow]MaskedAutoEncoders__Demo.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "kernelspec": {
4 | "language": "python",
5 | "display_name": "Python 3",
6 | "name": "python3"
7 | },
8 | "language_info": {
9 | "pygments_lexer": "ipython3",
10 | "nbconvert_exporter": "python",
11 | "version": "3.6.4",
12 | "file_extension": ".py",
13 | "codemirror_mode": {
14 | "name": "ipython",
15 | "version": 3
16 | },
17 | "name": "python",
18 | "mimetype": "text/x-python"
19 | },
20 | "colab": {
21 | "name": "[TensorFlow]MaskedAutoEncoders _Demo.ipynb",
22 | "provenance": [],
23 | "machine_shape": "hm",
24 | "background_execution": "on"
25 | },
26 | "accelerator": "TPU",
27 | "gpuClass": "standard"
28 | },
29 | "nbformat_minor": 0,
30 | "nbformat": 4,
31 | "cells": [
32 | {
33 | "cell_type": "markdown",
34 | "source": [
35 | ""
36 | ],
37 | "metadata": {
38 | "papermill": {
39 | "duration": 0.034259,
40 | "end_time": "2022-03-27T12:55:21.915376",
41 | "exception": false,
42 | "start_time": "2022-03-27T12:55:21.881117",
43 | "status": "completed"
44 | },
45 | "tags": [],
46 | "id": "RcgrrsmsJXwT"
47 | }
48 | },
49 | {
50 | "cell_type": "code",
51 | "source": [
52 | "from google.colab import drive\n",
53 | "drive.mount('/content/drive')"
54 | ],
55 | "metadata": {
56 | "id": "BzASdDi_ZRTw",
57 | "colab": {
58 | "base_uri": "https://localhost:8080/"
59 | },
60 | "outputId": "7837cf91-78a3-4e44-c972-d3d15d027c13"
61 | },
62 | "execution_count": null,
63 | "outputs": [
64 | {
65 | "output_type": "stream",
66 | "name": "stdout",
67 | "text": [
68 | "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n"
69 | ]
70 | }
71 | ]
72 | },
73 | {
74 | "cell_type": "markdown",
75 | "source": [
76 | "# Writing the whole dataset into a tfrecord file after mild preprocessing "
77 | ],
78 | "metadata": {
79 | "id": "AcYEQOd6Q5ZW"
80 | }
81 | },
82 | {
83 | "cell_type": "code",
84 | "source": [
85 | "import os\n",
86 | "import pandas as pd\n",
87 | "from tqdm import tqdm\n",
88 | "import numpy as np\n",
89 | "from sklearn.model_selection import StratifiedKFold\n",
90 | "import cv2\n",
91 | "import os\n",
92 | "import matplotlib.pyplot as plt\n",
93 | "from math import ceil\n",
94 | "import tensorflow as tf\n",
95 | "\n",
96 | "import warnings\n",
97 | "warnings.filterwarnings(\"ignore\")"
98 | ],
99 | "metadata": {
100 | "execution": {
101 | "iopub.status.busy": "2022-03-28T17:29:38.741983Z",
102 | "iopub.execute_input": "2022-03-28T17:29:38.742347Z",
103 | "iopub.status.idle": "2022-03-28T17:29:40.253470Z",
104 | "shell.execute_reply.started": "2022-03-28T17:29:38.742258Z",
105 | "shell.execute_reply": "2022-03-28T17:29:40.252502Z"
106 | },
107 | "trusted": true,
108 | "id": "OnPwxirZuSI3"
109 | },
110 | "execution_count": null,
111 | "outputs": []
112 | },
113 | {
114 | "cell_type": "code",
115 | "source": [
116 | "def configure_device():\n",
117 | " try:\n",
118 | " tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect() # connect to tpu cluster\n",
119 | " strategy = tf.distribute.TPUStrategy(tpu) # get strategy for tpu\n",
120 | " print('Num of TPUs: ', strategy.num_replicas_in_sync)\n",
121 | " device='TPU'\n",
122 | " except: # otherwise detect GPUs\n",
123 | " tpu = None\n",
124 | " gpus = tf.config.list_logical_devices('GPU') # get logical gpus\n",
125 | " ngpu = len(gpus)\n",
126 | " if ngpu: # if number of GPUs are 0 then CPU\n",
127 | " strategy = tf.distribute.MirroredStrategy(gpus) # single-GPU or multi-GPU\n",
128 | " print(\"> Running on GPU\", end=' | ')\n",
129 | " print(\"Num of GPUs: \", ngpu)\n",
130 | " device='GPU'\n",
131 | " else:\n",
132 | " print(\"> Running on CPU\")\n",
133 | " strategy = tf.distribute.get_strategy() # connect to single gpu or cpu\n",
134 | " device='CPU'\n",
135 | " return strategy, device, tpu"
136 | ],
137 | "metadata": {
138 | "id": "ASDwpoVPj1JC"
139 | },
140 | "execution_count": null,
141 | "outputs": []
142 | },
143 | {
144 | "cell_type": "code",
145 | "source": [
146 | "strategy, device, tpu = configure_device()\n",
147 | "AUTO = tf.data.experimental.AUTOTUNE\n"
148 | ],
149 | "metadata": {
150 | "colab": {
151 | "base_uri": "https://localhost:8080/"
152 | },
153 | "id": "rJd40S9ElIFZ",
154 | "outputId": "37848f0c-efe0-472e-f972-0f8104d5e237"
155 | },
156 | "execution_count": null,
157 | "outputs": [
158 | {
159 | "output_type": "stream",
160 | "name": "stdout",
161 | "text": [
162 | "INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.\n"
163 | ]
164 | },
165 | {
166 | "output_type": "stream",
167 | "name": "stderr",
168 | "text": [
169 | "INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.\n"
170 | ]
171 | },
172 | {
173 | "output_type": "stream",
174 | "name": "stdout",
175 | "text": [
176 | "WARNING:tensorflow:TPU system grpc://10.13.242.82:8470 has already been initialized. Reinitializing the TPU can cause previously created variables on TPU to be lost.\n"
177 | ]
178 | },
179 | {
180 | "output_type": "stream",
181 | "name": "stderr",
182 | "text": [
183 | "WARNING:tensorflow:TPU system grpc://10.13.242.82:8470 has already been initialized. Reinitializing the TPU can cause previously created variables on TPU to be lost.\n"
184 | ]
185 | },
186 | {
187 | "output_type": "stream",
188 | "name": "stdout",
189 | "text": [
190 | "INFO:tensorflow:Initializing the TPU system: grpc://10.13.242.82:8470\n"
191 | ]
192 | },
193 | {
194 | "output_type": "stream",
195 | "name": "stderr",
196 | "text": [
197 | "INFO:tensorflow:Initializing the TPU system: grpc://10.13.242.82:8470\n"
198 | ]
199 | },
200 | {
201 | "output_type": "stream",
202 | "name": "stdout",
203 | "text": [
204 | "INFO:tensorflow:Finished initializing TPU system.\n"
205 | ]
206 | },
207 | {
208 | "output_type": "stream",
209 | "name": "stderr",
210 | "text": [
211 | "INFO:tensorflow:Finished initializing TPU system.\n"
212 | ]
213 | },
214 | {
215 | "output_type": "stream",
216 | "name": "stdout",
217 | "text": [
218 | "INFO:tensorflow:Found TPU system:\n"
219 | ]
220 | },
221 | {
222 | "output_type": "stream",
223 | "name": "stderr",
224 | "text": [
225 | "INFO:tensorflow:Found TPU system:\n"
226 | ]
227 | },
228 | {
229 | "output_type": "stream",
230 | "name": "stdout",
231 | "text": [
232 | "INFO:tensorflow:*** Num TPU Cores: 8\n"
233 | ]
234 | },
235 | {
236 | "output_type": "stream",
237 | "name": "stderr",
238 | "text": [
239 | "INFO:tensorflow:*** Num TPU Cores: 8\n"
240 | ]
241 | },
242 | {
243 | "output_type": "stream",
244 | "name": "stdout",
245 | "text": [
246 | "INFO:tensorflow:*** Num TPU Workers: 1\n"
247 | ]
248 | },
249 | {
250 | "output_type": "stream",
251 | "name": "stderr",
252 | "text": [
253 | "INFO:tensorflow:*** Num TPU Workers: 1\n"
254 | ]
255 | },
256 | {
257 | "output_type": "stream",
258 | "name": "stdout",
259 | "text": [
260 | "INFO:tensorflow:*** Num TPU Cores Per Worker: 8\n"
261 | ]
262 | },
263 | {
264 | "output_type": "stream",
265 | "name": "stderr",
266 | "text": [
267 | "INFO:tensorflow:*** Num TPU Cores Per Worker: 8\n"
268 | ]
269 | },
270 | {
271 | "output_type": "stream",
272 | "name": "stdout",
273 | "text": [
274 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)\n"
275 | ]
276 | },
277 | {
278 | "output_type": "stream",
279 | "name": "stderr",
280 | "text": [
281 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)\n"
282 | ]
283 | },
284 | {
285 | "output_type": "stream",
286 | "name": "stdout",
287 | "text": [
288 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)\n"
289 | ]
290 | },
291 | {
292 | "output_type": "stream",
293 | "name": "stderr",
294 | "text": [
295 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)\n"
296 | ]
297 | },
298 | {
299 | "output_type": "stream",
300 | "name": "stdout",
301 | "text": [
302 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)\n"
303 | ]
304 | },
305 | {
306 | "output_type": "stream",
307 | "name": "stderr",
308 | "text": [
309 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)\n"
310 | ]
311 | },
312 | {
313 | "output_type": "stream",
314 | "name": "stdout",
315 | "text": [
316 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)\n"
317 | ]
318 | },
319 | {
320 | "output_type": "stream",
321 | "name": "stderr",
322 | "text": [
323 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)\n"
324 | ]
325 | },
326 | {
327 | "output_type": "stream",
328 | "name": "stdout",
329 | "text": [
330 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)\n"
331 | ]
332 | },
333 | {
334 | "output_type": "stream",
335 | "name": "stderr",
336 | "text": [
337 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)\n"
338 | ]
339 | },
340 | {
341 | "output_type": "stream",
342 | "name": "stdout",
343 | "text": [
344 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)\n"
345 | ]
346 | },
347 | {
348 | "output_type": "stream",
349 | "name": "stderr",
350 | "text": [
351 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)\n"
352 | ]
353 | },
354 | {
355 | "output_type": "stream",
356 | "name": "stdout",
357 | "text": [
358 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)\n"
359 | ]
360 | },
361 | {
362 | "output_type": "stream",
363 | "name": "stderr",
364 | "text": [
365 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)\n"
366 | ]
367 | },
368 | {
369 | "output_type": "stream",
370 | "name": "stdout",
371 | "text": [
372 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)\n"
373 | ]
374 | },
375 | {
376 | "output_type": "stream",
377 | "name": "stderr",
378 | "text": [
379 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)\n"
380 | ]
381 | },
382 | {
383 | "output_type": "stream",
384 | "name": "stdout",
385 | "text": [
386 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)\n"
387 | ]
388 | },
389 | {
390 | "output_type": "stream",
391 | "name": "stderr",
392 | "text": [
393 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)\n"
394 | ]
395 | },
396 | {
397 | "output_type": "stream",
398 | "name": "stdout",
399 | "text": [
400 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)\n"
401 | ]
402 | },
403 | {
404 | "output_type": "stream",
405 | "name": "stderr",
406 | "text": [
407 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)\n"
408 | ]
409 | },
410 | {
411 | "output_type": "stream",
412 | "name": "stdout",
413 | "text": [
414 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)\n"
415 | ]
416 | },
417 | {
418 | "output_type": "stream",
419 | "name": "stderr",
420 | "text": [
421 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)\n"
422 | ]
423 | },
424 | {
425 | "output_type": "stream",
426 | "name": "stdout",
427 | "text": [
428 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)\n"
429 | ]
430 | },
431 | {
432 | "output_type": "stream",
433 | "name": "stderr",
434 | "text": [
435 | "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)\n"
436 | ]
437 | },
438 | {
439 | "output_type": "stream",
440 | "name": "stdout",
441 | "text": [
442 | "Num of TPUs: 8\n"
443 | ]
444 | }
445 | ]
446 | },
447 | {
448 | "cell_type": "code",
449 | "source": [
450 | "import re\n",
451 | "def count_data_items(filenames):\n",
452 | " n = [int(re.compile(r\"-([0-9]*)\\.\").search(filename).group(1)) for filename in filenames]\n",
453 | " return np.sum(n)\n",
454 | "\n",
455 | "GCS_PATH = \"gs://kds-cd6394f23429c6b928662c3e4c9479f1b4e4371b159e5633d5a79b6d\"\n",
456 | "ALL_FILENAMES = tf.io.gfile.glob(GCS_PATH + '/*.tfrec')\n",
457 | "print('NUM TFRECORD FILES: {:,}'.format(len(ALL_FILENAMES)))\n"
458 | ],
459 | "metadata": {
460 | "colab": {
461 | "base_uri": "https://localhost:8080/"
462 | },
463 | "id": "lBtNTWAliCwv",
464 | "outputId": "b6ba382e-76c3-42d5-f9a4-268a38af2940"
465 | },
466 | "execution_count": null,
467 | "outputs": [
468 | {
469 | "output_type": "stream",
470 | "name": "stdout",
471 | "text": [
472 | "NUM TFRECORD FILES: 1\n"
473 | ]
474 | }
475 | ]
476 | },
477 | {
478 | "cell_type": "markdown",
479 | "source": [
480 | "# Parsing the saved tfrecord file and getting the dataset"
481 | ],
482 | "metadata": {
483 | "id": "XFXqZmJgQzQ7"
484 | }
485 | },
486 | {
487 | "cell_type": "code",
488 | "source": [
489 | "ALL_FILENAMES"
490 | ],
491 | "metadata": {
492 | "id": "TvBb6eSBiPYI",
493 | "colab": {
494 | "base_uri": "https://localhost:8080/"
495 | },
496 | "outputId": "0512623d-6e06-45cb-e6e0-c7e4458f64fc"
497 | },
498 | "execution_count": null,
499 | "outputs": [
500 | {
501 | "output_type": "execute_result",
502 | "data": {
503 | "text/plain": [
504 | "['gs://kds-cd6394f23429c6b928662c3e4c9479f1b4e4371b159e5633d5a79b6d/train.tfrec']"
505 | ]
506 | },
507 | "metadata": {},
508 | "execution_count": 34
509 | }
510 | ]
511 | },
512 | {
513 | "cell_type": "code",
514 | "source": [
515 | "# tf.config.run_functions_eagerly(True)\n",
516 | "def parse_tfr_element(element):\n",
517 | " #use the same structure as above; it's kinda an outline of the structure we now want to create\n",
518 | " data = {\n",
519 | " 'filename': tf.io.FixedLenFeature([], tf.string),\n",
520 | " 'time':tf.io.FixedLenFeature([], tf.int64),\n",
521 | " 'audio' : tf.io.FixedLenFeature([], tf.string),\n",
522 | " 'label':tf.io.FixedLenFeature([], tf.int64),\n",
523 | " }\n",
524 | "\n",
525 | " \n",
526 | " content = tf.io.parse_single_example(element, data)\n",
527 | " \n",
528 | " filename = content['filename']\n",
529 | "\n",
530 | " time = content['time']\n",
531 | " label = content['label']\n",
532 | " audio = content['audio']\n",
533 | "\n",
534 | " return (audio, label)\n",
535 | "\n",
536 | "dataset = tf.data.TFRecordDataset(ALL_FILENAMES)\n",
537 | "#pass every single feature through our mapping function\n",
538 | "dataset=dataset.shuffle(75000)\n",
539 | "dataset = dataset.map(parse_tfr_element)\n",
540 | "# dataset = dataset.batch(10)"
541 | ],
542 | "metadata": {
543 | "id": "TuAeQTATRGy_"
544 | },
545 | "execution_count": null,
546 | "outputs": []
547 | },
548 | {
549 | "cell_type": "code",
550 | "source": [
551 | "from tqdm import tqdm\n",
552 | "train_x=[]\n",
553 | "train_y=[]\n",
554 | "#from google.colab.patches import cv2_imshow\n",
555 | "for sample in tqdm(dataset.take(20000)):\n",
556 | " x = np.fromstring(sample[0].numpy(), dtype='uint8')\n",
557 | " image = cv2.imdecode(x, cv2.IMREAD_UNCHANGED)\n",
558 | " image = cv2.resize(image,(128,48))\n",
559 | " train_x.append(image)\n",
560 | " train_y.append(sample[1].numpy())"
561 | ],
562 | "metadata": {
563 | "colab": {
564 | "base_uri": "https://localhost:8080/"
565 | },
566 | "id": "UpHNaXaGTdOv",
567 | "outputId": "36094b03-5ae5-4932-d185-8646f361d317"
568 | },
569 | "execution_count": null,
570 | "outputs": [
571 | {
572 | "output_type": "stream",
573 | "name": "stderr",
574 | "text": [
575 | "20000it [01:21, 244.15it/s]\n"
576 | ]
577 | }
578 | ]
579 | },
580 | {
581 | "cell_type": "code",
582 | "source": [
583 | "# train_X = np.asarray(train_x)\n",
584 | "train_Y = np.asarray(train_y)\n",
585 | "del train_y"
586 | ],
587 | "metadata": {
588 | "id": "ow3pOXa3xP68"
589 | },
590 | "execution_count": null,
591 | "outputs": []
592 | },
593 | {
594 | "cell_type": "code",
595 | "source": [
596 | "train_Y = train_Y.ravel()"
597 | ],
598 | "metadata": {
599 | "id": "_qE2nsUMxWA-"
600 | },
601 | "execution_count": null,
602 | "outputs": []
603 | },
604 | {
605 | "cell_type": "code",
606 | "source": [
607 | "np.unique(train_Y).reshape(-1, 1).shape"
608 | ],
609 | "metadata": {
610 | "colab": {
611 | "base_uri": "https://localhost:8080/"
612 | },
613 | "id": "-0yGKO2K8NzD",
614 | "outputId": "68350f65-1c5a-473c-b67a-e9ae66e535db"
615 | },
616 | "execution_count": null,
617 | "outputs": [
618 | {
619 | "output_type": "execute_result",
620 | "data": {
621 | "text/plain": [
622 | "(96, 1)"
623 | ]
624 | },
625 | "metadata": {},
626 | "execution_count": 39
627 | }
628 | ]
629 | },
630 | {
631 | "cell_type": "code",
632 | "source": [
633 | "jg = [[float(i)] for i in range(151)]\n",
634 | "jg=np.array(jg)\n",
635 | "jg.shape"
636 | ],
637 | "metadata": {
638 | "colab": {
639 | "base_uri": "https://localhost:8080/"
640 | },
641 | "id": "RustmjSN_H4m",
642 | "outputId": "959bab8d-19e5-4c64-94e2-fac557050868"
643 | },
644 | "execution_count": null,
645 | "outputs": [
646 | {
647 | "output_type": "execute_result",
648 | "data": {
649 | "text/plain": [
650 | "(151, 1)"
651 | ]
652 | },
653 | "metadata": {},
654 | "execution_count": 40
655 | }
656 | ]
657 | },
658 | {
659 | "cell_type": "code",
660 | "source": [
661 | "from sklearn.preprocessing import OneHotEncoder\n",
662 | "ohe = OneHotEncoder()\n",
663 | "ohe.fit(jg)\n",
664 | "Y_vals=[]\n",
665 | "for y in tqdm(train_Y):\n",
666 | " res = ohe.transform(np.array([y]).reshape(-1, 1)).todense()\n",
667 | " y_arr = np.array(res).reshape(-1, 151)\n",
668 | " Y_vals.extend(y_arr)\n",
669 | "Y_vals=np.array(Y_vals)\n",
670 | "print(Y_vals.shape)\n",
671 | "del train_Y\n",
672 | "\n",
673 | "train_x_placeholder=[]\n",
674 | "train_y = []\n",
675 | "for i in range(len(train_x)):\n",
676 | " try:\n",
677 | " assert train_x[i].shape == (48, 128, 3)\n",
678 | " train_x_placeholder.append(train_x[i])\n",
679 | " train_y.append(Y_vals[i])\n",
680 | " except:\n",
681 | " pass\n",
682 | "del Y_vals"
683 | ],
684 | "metadata": {
685 | "colab": {
686 | "base_uri": "https://localhost:8080/"
687 | },
688 | "id": "4BfrtV1c7R_6",
689 | "outputId": "98c9ae15-d332-4ea9-e115-9f273d6d1944"
690 | },
691 | "execution_count": null,
692 | "outputs": [
693 | {
694 | "output_type": "stream",
695 | "name": "stderr",
696 | "text": [
697 | "100%|██████████| 20000/20000 [00:06<00:00, 2992.09it/s]"
698 | ]
699 | },
700 | {
701 | "output_type": "stream",
702 | "name": "stdout",
703 | "text": [
704 | "(20000, 151)\n"
705 | ]
706 | },
707 | {
708 | "output_type": "stream",
709 | "name": "stderr",
710 | "text": [
711 | "\n"
712 | ]
713 | }
714 | ]
715 | },
716 | {
717 | "cell_type": "code",
718 | "source": [
719 | "train_x_placeholder = tf.convert_to_tensor(train_x_placeholder)\n",
720 | "train_y = tf.convert_to_tensor(train_y)"
721 | ],
722 | "metadata": {
723 | "id": "g6XvtB0K7SAK"
724 | },
725 | "execution_count": null,
726 | "outputs": []
727 | },
728 | {
729 | "cell_type": "markdown",
730 | "source": [
731 | "The tutorial aims to explain the concepts and terminologies of the research paper \" Masked Autoencoders Are Scalable Vision Learners \" .\n",
732 | "\n",
733 | "\n"
734 | ],
735 | "metadata": {
736 | "papermill": {
737 | "duration": 0.035917,
738 | "end_time": "2022-03-27T12:55:21.984473",
739 | "exception": false,
740 | "start_time": "2022-03-27T12:55:21.948556",
741 | "status": "completed"
742 | },
743 | "tags": [],
744 | "id": "xGdFB6lSJXwW"
745 | }
746 | },
747 | {
748 | "cell_type": "markdown",
749 | "source": [
750 | "## **Context**\n",
751 | "\n",
752 | "`MAE (Masked autoencoders)` are self-supervised models used to reconstruct the image with missing pixels. It has two core designs consisting of encoder-decoder and masking(hiding) of pixels. Firstly for training purposes, up to 75% of the tokens are masked and removed and the remaining patch is encoded. Since the output of an autoencoder has the same number of tokens as input, the masked tokens are inserted again and with the lightweight decoder, the input image is reconstructed. After pre-training, the decoder is discarded and the encoder is applied to complete images(no missing pixels) for recognition tasks. The above model allows us to train the data faster and improve accuracy. This model works well on a variety of image cases and outperforms supervised training and also scales effectively.\n",
753 | "\n",
754 | "The transformer architecture has been successfully applied to Natural Language Processing(NLP) using autoregressive language modelling and masked encoding, wherein a portion of the data is removed and models are trained to predict the missing data. However, computer vision has been predominantly associated with Convolutional Neural Network(CNN). The paper explores the usage of masked autoencoders in computer vision\n",
755 | "\n",
756 | "Previously, autoencoding wasn’t used in computer vision due to the following reasons-\n",
757 | "\n",
758 | "📌 It was difficult to integrate masked tokens or positional embedding into CNN. However with Vision Transformers(ViT), this problem was solved. ViT slightly outperforms CNN for a large dataset(more than 100 million images). In ViT, we split the image into fixed size patches, vectorize them, add positional embedding and feed the vectors into a transformer encoder to train the model. \n",
759 | "\n",
760 | "📌 Languages are information-dense and predicting missing words is a sophisticated task which needs sophisticated language understanding. While missing patches of images can be recreated with little high-level understanding. To overcome this,large proportion of patches of image are masked(removed) , reducing redundancy and requiring a higher level of understanding.\n",
761 | "\n",
762 | "📌 Decoder plays different roles in reconstructing images and text. As text has a high level of semantic information(information that refers to facts, concepts and ideas which we have accumulated over the course of our lives) while images have a low level of semantic information. Thus the decoder design plays an important role for reconstructing images.\n",
763 | "\n",
764 | "The paper presents a simple, effective and scalable form of MAE (Masked Autoenocoder) for visual representation learning. In MAE, random patches from the input space are masked and these random patches are reconstructed in pixel space. It has an asymmetric encoder-decoder design. The encoder operates on tokens which remain after removal of masked tokens and a lightweight decoder reconstructs the image from the latent representation and masked tokens. With a high masking ratio (75%), high accuracy can be achieved, reducing the overall training time by more than 3x and also reducing memory consumption. MAE pre-training helps data-hungry modela like ViT-Huge to improve their performance. The paper also evaluates transfer learning on a variety of downstream tasks such as object detection, instance segmentation, and semantic segmentation. In these tasks, the proposed pre-trained model achieves better results than supervised pre-trained models.\n"
765 | ],
766 | "metadata": {
767 | "papermill": {
768 | "duration": 0.039857,
769 | "end_time": "2022-03-27T12:55:22.061429",
770 | "exception": false,
771 | "start_time": "2022-03-27T12:55:22.021572",
772 | "status": "completed"
773 | },
774 | "tags": [],
775 | "id": "B_J8mjneJXwX"
776 | }
777 | },
778 | {
779 | "cell_type": "markdown",
780 | "source": [
781 | "\n",
782 | "## **Masked language modelling**\n",
783 | "\n",
784 | "`Masked language modelling` is successful for pre-training in NLP. In methods such as BERT and GPT(methods used for pre-training in NLP), the sequences of words from the input were removed and the model is trained to predict the missing sequence. These models scale efficiently and work for a variety of downstream tasks .\n",
785 | "\n",
786 | "## **Autoencoding**\n",
787 | "\n",
788 | "`Autoencoding` type of neural network that is trained to copy its input to its output. It has an encoder that converts input vector into code vector(latent representation) using recognition weights and a decoder that regenerates the input from the code vector using generative weights. Denoising autoencoders(DAE) are used to corrupt an input signal and predict the original signal. DAE is used to extract a representation from the encoder that is robust to the introduction of noise. DAE can be constructed in many ways such as masking pixels or removing colour of the input. MAE is a kind of denoising autoencoder but different from the classical DAE in many ways. \n",
789 | "\n",
790 | "## **Masked image encoding**\n",
791 | "\n",
792 | "`Masked image encoding` is used to recreate images that have been corrupted by masking. Context encoders are Convolutional Neural networks that generate patches of missing pixels on the basis of its surrounding pixels. The success of unsupervised learning using transformers in NLP has prompted a similar method to be applied to images. `iGPT` operates on sequences of pixels using a sequence transformer to predict unknown pixels autoregressive. `ViT(vision transformer)` masks patches of images and using transformers predicts the image type. Vision Transformer (ViT), using self supervision, attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. Recently, `BEiT (Bidirectional Encoder representation from Image Transformers)` is used to pretrain image transformers. We break the original image into tokens(patches) and randomly mask a few tokens and try to recover the original tokens by fine-tuning the model .\n",
793 | "\n",
794 | "## **Self Supervised Learning**\n",
795 | "\n",
796 | "`Self Supervised Learning` has been used significantly in language and is now being used in computer vision. Contrastive learning is used to train a CNN to classify similar and dissimilar images. Contrastive methods strongly depend on data augmentation. Autoencoding is based on a different concept, and it exhibits different behaviours as we will present. "
797 | ],
798 | "metadata": {
799 | "papermill": {
800 | "duration": 0.033898,
801 | "end_time": "2022-03-27T12:55:22.129566",
802 | "exception": false,
803 | "start_time": "2022-03-27T12:55:22.095668",
804 | "status": "completed"
805 | },
806 | "tags": [],
807 | "id": "PeHTE-VqJXwY"
808 | }
809 | },
810 | {
811 | "cell_type": "markdown",
812 | "source": [
813 | "## **Masked Autoencoders are Scalable Vision Learners**\n",
814 | "\n",
815 | "Masked Autoencoder (MAE) follows a simple autoencoding approach, wherein an encoder maps an input into a latent representation and decoder reconstructs the original image from the latent representation and masked tokens. The paper proposes an unsymmetric design as it allows the encoder to operate on input after removing the masked tokens from it.\n",
816 | "\n",
817 | "## **Masking**\n",
818 | "\n",
819 | "The image is divided into regular non-overlapping patches and a subset of the patches are randomly sampled , masking(removing) the other patches. This is also called “random sampling”, i.e., sampling random patches without replacement, along with uniform distribution. Random Sampling with a high masking ratio eliminates redundancy and reduces the chance of being solved by extrapolation from the visible patches. A uniform distribution prevents centre bias(i.e. More masked patches near centre) and we get a highly sparse input, which helps in designing an efficient encoder. \n",
820 | "\n",
821 | "## **MAE Encoder**\n",
822 | "\n",
823 | "ViT(Vision Transformer) is applied only on the patches that are not masked. The encoder embeds the visible patches by linear projection with added positional embedding and then processes it using transformer blocks. Masked tokens are vectors that indicate the presence of a missing patch to be predicted. Since masked patches are removed and no mask tokens are used, large encoders can be trained with a fraction of computation and memory. MAE Encoder is used during pre training and testing as well.\n",
824 | "\n",
825 | "\n",
826 | "\n",
827 | "## **MAE Decoder**\n",
828 | "\n",
829 | "The inputs to the decoder are encoded visible patches and mask tokens. The total count of tokens is same as in input images. All the tokens have positional embedding in them to know their location in the image. The decoder has transformer blocks in them. MAE decoders are used only during pre-training and hence can be designed independently of the encoder. Due to asymmetric design, a full set of tokens can be processed by lightweight decoders, which significantly reduce pre-training time.\n",
830 | "\n",
831 | "\n",
832 | "\n",
833 | "## **Reconstruction Target**\n",
834 | " \n",
835 | "MAE reconstructs the input image by predicting the pixel values for each masked patch. Each element in the decoder’s output is a vector of pixel values representing a patch. The last layer of the decoder is a linear projection whose number of output channels equals the number of pixel values in a patch. The decoder’s output is reshaped to form a reconstructed image. The loss function compares the mean squared error between reconstructed and original images on the masked patches.We also study the results after normalising the pixel values of each masked patch as using normalised pixel values improves the accuracy of the experiments. \n",
836 | "\n",
837 | "## **Implementation**\n",
838 | "\n",
839 | "In the MAE pre-training,a token for every input is generated. The token for each input patch is generated by adding positional embedding to the linear projection of the input. The list of tokens is randomly shuffled and the last portion of the list is removed depending on the masking ratio. The remaining tokens are encoded and then a list of masked tokens is appended to make the total number of tokens equal to the number of input tokens. The full list is unshuffled to align tokens in their original position. The full list is decoded and the original image is reconstructed. This process has negligible overhead as shuffling and unshuffling operations are fast and no sparse operations(operations performed on matrices consisting of row and column numbers of non-zero numbers) are needed.\n",
840 | "\n",
841 | "\n",
842 | "\n"
843 | ],
844 | "metadata": {
845 | "papermill": {
846 | "duration": 0.032258,
847 | "end_time": "2022-03-27T12:55:22.194424",
848 | "exception": false,
849 | "start_time": "2022-03-27T12:55:22.162166",
850 | "status": "completed"
851 | },
852 | "tags": [],
853 | "id": "xldwdC4tJXwY"
854 | }
855 | },
856 | {
857 | "cell_type": "code",
858 | "source": [
859 | "!pip install -q noisereduce "
860 | ],
861 | "metadata": {
862 | "_kg_hide-input": true,
863 | "_kg_hide-output": true,
864 | "execution": {
865 | "iopub.execute_input": "2022-03-27T12:55:22.264928Z",
866 | "iopub.status.busy": "2022-03-27T12:55:22.263783Z",
867 | "iopub.status.idle": "2022-03-27T12:55:31.020598Z",
868 | "shell.execute_reply": "2022-03-27T12:55:31.021119Z",
869 | "shell.execute_reply.started": "2022-03-27T12:53:06.667196Z"
870 | },
871 | "papermill": {
872 | "duration": 8.793844,
873 | "end_time": "2022-03-27T12:55:31.021409",
874 | "exception": false,
875 | "start_time": "2022-03-27T12:55:22.227565",
876 | "status": "completed"
877 | },
878 | "tags": [],
879 | "id": "LxTl3mN5JXwZ"
880 | },
881 | "execution_count": null,
882 | "outputs": []
883 | },
884 | {
885 | "cell_type": "code",
886 | "source": [
887 | "!pip install tensorflow_addons "
888 | ],
889 | "metadata": {
890 | "id": "ZR41eUuIl0al",
891 | "colab": {
892 | "base_uri": "https://localhost:8080/"
893 | },
894 | "outputId": "5af4dd4f-aea6-439c-882d-36ab636abe61"
895 | },
896 | "execution_count": null,
897 | "outputs": [
898 | {
899 | "output_type": "stream",
900 | "name": "stdout",
901 | "text": [
902 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
903 | "Requirement already satisfied: tensorflow_addons in /usr/local/lib/python3.7/dist-packages (0.17.1)\n",
904 | "Requirement already satisfied: typeguard>=2.7 in /usr/local/lib/python3.7/dist-packages (from tensorflow_addons) (2.7.1)\n",
905 | "Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from tensorflow_addons) (21.3)\n",
906 | "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->tensorflow_addons) (3.0.9)\n"
907 | ]
908 | }
909 | ]
910 | },
911 | {
912 | "cell_type": "code",
913 | "source": [
914 | "import os\n",
915 | "import json\n",
916 | "import numpy as np\n",
917 | "import pandas as pd\n",
918 | "import seaborn as sns\n",
919 | "import matplotlib.pyplot as plt\n",
920 | "\n",
921 | "pd.set_option('max_rows', 250)\n",
922 | "pd.set_option('max_columns', 100)\n",
923 | "\n",
924 | "from tensorflow.keras import layers\n",
925 | "from tensorflow.keras import models\n",
926 | "import tensorflow as tf\n",
927 | "import tensorflow_addons as tfa\n",
928 | "from tensorflow import keras\n",
929 | "\n",
930 | "import noisereduce as nr\n",
931 | "from math import ceil\n",
932 | "\n",
933 | "import random\n",
934 | "\n",
935 | "# Setting seeds for reproducibility.\n",
936 | "seed = 42\n",
937 | "tf.random.set_seed(seed)\n",
938 | "np.random.seed(seed)\n",
939 | "random.seed(seed)"
940 | ],
941 | "metadata": {
942 | "_kg_hide-input": true,
943 | "_kg_hide-output": true,
944 | "execution": {
945 | "iopub.execute_input": "2022-03-27T12:55:31.09744Z",
946 | "iopub.status.busy": "2022-03-27T12:55:31.096695Z",
947 | "iopub.status.idle": "2022-03-27T12:55:42.3049Z",
948 | "shell.execute_reply": "2022-03-27T12:55:42.30541Z",
949 | "shell.execute_reply.started": "2022-03-27T12:53:18.014455Z"
950 | },
951 | "papermill": {
952 | "duration": 11.250103,
953 | "end_time": "2022-03-27T12:55:42.305582",
954 | "exception": false,
955 | "start_time": "2022-03-27T12:55:31.055479",
956 | "status": "completed"
957 | },
958 | "tags": [],
959 | "id": "8PXj70QaJXwa"
960 | },
961 | "execution_count": null,
962 | "outputs": []
963 | },
964 | {
965 | "cell_type": "code",
966 | "source": [
967 | "#seed = 42\n",
968 | "os.environ['PYTHONHASHSEED'] = str(seed)\n",
969 | "np.random.seed(seed)\n",
970 | "\n",
971 | "DURATION = 15\n",
972 | "SPEC_SHAPE = (48, 128)\n",
973 | "SAMPLE_RATE = 32000\n",
974 | "TEST_DURATION = 5\n",
975 | "SPEC_SHAPE = (48, 128)\n",
976 | "FMIN = 500\n",
977 | "FMAX = 12500\n",
978 | "\n",
979 | "# DATA\n",
980 | "BUFFER_SIZE = 1024\n",
981 | "BATCH_SIZE = 256\n",
982 | "AUTO = tf.data.AUTOTUNE\n",
983 | "INPUT_SHAPE = (48, 128, 3)\n",
984 | "NUM_CLASSES = 151\n",
985 | "\n",
986 | "# OPTIMIZER\n",
987 | "LEARNING_RATE = 5e-3\n",
988 | "WEIGHT_DECAY = 1e-4\n",
989 | "\n",
990 | "# TRAINING\n",
991 | "EPOCHS = 1\n",
992 | "\n",
993 | "# AUGMENTATION\n",
994 | "IMAGE_SIZE = 48 # We'll resize input images to this size.\n",
995 | "IMAGE_SIZE1 = 128 # We'll resize input images to this size.\n",
996 | "PATCH_SIZE = 6 # Size of the patches to be extract from the input images.\n",
997 | "NUM_PATCHES = 168#(IMAGE_SIZE // PATCH_SIZE) ** 2\n",
998 | "\n",
999 | "# ENCODER and DECODER\n",
1000 | "LAYER_NORM_EPS = 1e-6\n",
1001 | "ENC_PROJECTION_DIM = 128\n",
1002 | "ENC_NUM_HEADS = 4\n",
1003 | "ENC_LAYERS = 3\n",
1004 | "ENC_TRANSFORMER_UNITS = [\n",
1005 | " ENC_PROJECTION_DIM * 2,\n",
1006 | " ENC_PROJECTION_DIM,\n",
1007 | "] # Size of the transformer layers."
1008 | ],
1009 | "metadata": {
1010 | "execution": {
1011 | "iopub.execute_input": "2022-03-27T12:55:42.775718Z",
1012 | "iopub.status.busy": "2022-03-27T12:55:42.775031Z",
1013 | "iopub.status.idle": "2022-03-27T12:55:42.777044Z",
1014 | "shell.execute_reply": "2022-03-27T12:55:42.776404Z",
1015 | "shell.execute_reply.started": "2022-03-27T12:53:29.740765Z"
1016 | },
1017 | "papermill": {
1018 | "duration": 0.042232,
1019 | "end_time": "2022-03-27T12:55:42.777173",
1020 | "exception": false,
1021 | "start_time": "2022-03-27T12:55:42.734941",
1022 | "status": "completed"
1023 | },
1024 | "tags": [],
1025 | "id": "lh29TujyJXwb"
1026 | },
1027 | "execution_count": null,
1028 | "outputs": []
1029 | },
1030 | {
1031 | "cell_type": "markdown",
1032 | "source": [
1033 | "## **Masking ratio**\n",
1034 | "\n",
1035 | "A high masking ratio is optimal for MAE. Ratio of 75% is good for both linear probing and fine tuning. In other models of computer vision, masking ratio is less (between 20%-50%). While in language, the masking ratio is even lesser, around 15%.\n",
1036 | "\n",
1037 | "## **Mask Tokens**\n",
1038 | "\n",
1039 | "\n",
1040 | "The masked tokens are dropped and applied again after encoding. If masked tokens are used during encoding , the accuracy drops by 14% in linear probing and 1% in fine-tuning. This is because, in pre-training, a large proportion of tokens are masked and the encoder pre-trains on these tokens which are not part of the uncorrupted image, reducing accuracy. The masked tokens are removed as then encoder pre-trains only on the patches that exist in un-corrupted images.\n",
1041 | "Removing masked token reduces the computational resources .\n",
1042 | "\n",
1043 | "## **Mask Sampling Strategy**\n",
1044 | "\n",
1045 | "In mask sampling strategy, a large block of pixels is removed. At a masking ratio of 50%, fine-tuning and linear probing do not degrade much. But with a higher masking ratio of 75%, the accuracy decreases considerably. Also the reconstruction observed is much blurrier due to higher training loss.Grid-wise sampling has lower training loss, however, the representation quality is low. Simple random sampling works best for MAE, with a high masking ratio providing high speed and good accuracy.\n",
1046 | "\n"
1047 | ],
1048 | "metadata": {
1049 | "papermill": {
1050 | "duration": 0.047982,
1051 | "end_time": "2022-03-27T12:55:47.128207",
1052 | "exception": false,
1053 | "start_time": "2022-03-27T12:55:47.080225",
1054 | "status": "completed"
1055 | },
1056 | "tags": [],
1057 | "id": "v3EOTYUxJXwe"
1058 | }
1059 | },
1060 | {
1061 | "cell_type": "code",
1062 | "source": [
1063 | "class Patches(layers.Layer):\n",
1064 | " def __init__(self, patch_size=PATCH_SIZE):\n",
1065 | " super(Patches, self).__init__()\n",
1066 | " self.patch_size = patch_size\n",
1067 | "\n",
1068 | " def call(self, images):\n",
1069 | " batch_size = tf.shape(images)[0]\n",
1070 | " patches = tf.image.extract_patches(\n",
1071 | " images=images,\n",
1072 | " sizes=[1, self.patch_size, self.patch_size, 1],\n",
1073 | " strides=[1, self.patch_size, self.patch_size, 1],\n",
1074 | " rates=[1, 1, 1, 1],\n",
1075 | " padding=\"VALID\",\n",
1076 | " )\n",
1077 | " patch_dims = patches.shape[-1]\n",
1078 | " patches = tf.reshape(patches, [batch_size, -1, patch_dims])\n",
1079 | " return patches"
1080 | ],
1081 | "metadata": {
1082 | "execution": {
1083 | "iopub.execute_input": "2022-03-27T12:55:47.231686Z",
1084 | "iopub.status.busy": "2022-03-27T12:55:47.230934Z",
1085 | "iopub.status.idle": "2022-03-27T12:55:47.234234Z",
1086 | "shell.execute_reply": "2022-03-27T12:55:47.233653Z",
1087 | "shell.execute_reply.started": "2022-03-27T12:54:00.787624Z"
1088 | },
1089 | "papermill": {
1090 | "duration": 0.058255,
1091 | "end_time": "2022-03-27T12:55:47.234379",
1092 | "exception": false,
1093 | "start_time": "2022-03-27T12:55:47.176124",
1094 | "status": "completed"
1095 | },
1096 | "tags": [],
1097 | "id": "qAXPHasTJXwe"
1098 | },
1099 | "execution_count": null,
1100 | "outputs": []
1101 | },
1102 | {
1103 | "cell_type": "code",
1104 | "source": [
1105 | "class PatchEncoder(layers.Layer):\n",
1106 | " def __init__(self, num_patches=NUM_PATCHES, projection_dim=ENC_PROJECTION_DIM):\n",
1107 | " super(PatchEncoder, self).__init__()\n",
1108 | " self.num_patches = num_patches\n",
1109 | " self.projection = layers.Dense(units=projection_dim)\n",
1110 | " self.position_embedding = layers.Embedding(\n",
1111 | " input_dim=num_patches, output_dim=projection_dim\n",
1112 | " )\n",
1113 | "\n",
1114 | " def call(self, patch):\n",
1115 | " positions = tf.range(start=0, limit=self.num_patches, delta=1)\n",
1116 | " encoded = self.projection(patch) + self.position_embedding(positions)\n",
1117 | " return encoded"
1118 | ],
1119 | "metadata": {
1120 | "execution": {
1121 | "iopub.execute_input": "2022-03-27T12:55:47.334786Z",
1122 | "iopub.status.busy": "2022-03-27T12:55:47.3341Z",
1123 | "iopub.status.idle": "2022-03-27T12:55:47.339401Z",
1124 | "shell.execute_reply": "2022-03-27T12:55:47.33995Z",
1125 | "shell.execute_reply.started": "2022-03-27T12:54:01.181312Z"
1126 | },
1127 | "papermill": {
1128 | "duration": 0.057524,
1129 | "end_time": "2022-03-27T12:55:47.340144",
1130 | "exception": false,
1131 | "start_time": "2022-03-27T12:55:47.28262",
1132 | "status": "completed"
1133 | },
1134 | "tags": [],
1135 | "id": "3HUW96iDJXwe"
1136 | },
1137 | "execution_count": null,
1138 | "outputs": []
1139 | },
1140 | {
1141 | "cell_type": "code",
1142 | "source": [
1143 | "def mlp(x, dropout_rate, hidden_units):\n",
1144 | " for units in hidden_units:\n",
1145 | " x = layers.Dense(units, activation=tf.nn.gelu)(x)\n",
1146 | " x = layers.Dropout(dropout_rate)(x)\n",
1147 | " return x\n"
1148 | ],
1149 | "metadata": {
1150 | "execution": {
1151 | "iopub.execute_input": "2022-03-27T12:55:47.441672Z",
1152 | "iopub.status.busy": "2022-03-27T12:55:47.441038Z",
1153 | "iopub.status.idle": "2022-03-27T12:55:47.444702Z",
1154 | "shell.execute_reply": "2022-03-27T12:55:47.445314Z",
1155 | "shell.execute_reply.started": "2022-03-27T12:54:01.633413Z"
1156 | },
1157 | "papermill": {
1158 | "duration": 0.056428,
1159 | "end_time": "2022-03-27T12:55:47.44548",
1160 | "exception": false,
1161 | "start_time": "2022-03-27T12:55:47.389052",
1162 | "status": "completed"
1163 | },
1164 | "tags": [],
1165 | "id": "f4N-fQIAJXwf"
1166 | },
1167 | "execution_count": null,
1168 | "outputs": []
1169 | },
1170 | {
1171 | "cell_type": "code",
1172 | "source": [
1173 | "def create_vit_classifier():\n",
1174 | " inputs = layers.Input(shape=(IMAGE_SIZE, IMAGE_SIZE1, 3))\n",
1175 | " # Create patches.\n",
1176 | " patches = Patches()(inputs)\n",
1177 | " # Encode patches.\n",
1178 | " encoded_patches = PatchEncoder()(patches)\n",
1179 | "\n",
1180 | " # Create multiple layers of the Transformer block.\n",
1181 | " for _ in range(ENC_LAYERS):\n",
1182 | " # Layer normalization 1.\n",
1183 | " x1 = layers.LayerNormalization(epsilon=LAYER_NORM_EPS)(encoded_patches)\n",
1184 | " # Create a multi-head attention layer.\n",
1185 | " attention_output = layers.MultiHeadAttention(\n",
1186 | " num_heads=ENC_NUM_HEADS, key_dim=ENC_PROJECTION_DIM, dropout=0.1\n",
1187 | " )(x1, x1)\n",
1188 | " # Skip connection 1.\n",
1189 | " x2 = layers.Add()([attention_output, encoded_patches])\n",
1190 | " # Layer normalization 2.\n",
1191 | " x3 = layers.LayerNormalization(epsilon=LAYER_NORM_EPS)(x2)\n",
1192 | " # MLP.\n",
1193 | " x3 = mlp(x3, hidden_units=ENC_TRANSFORMER_UNITS, dropout_rate=0.1)\n",
1194 | " # Skip connection 2.\n",
1195 | " encoded_patches = layers.Add()([x3, x2])\n",
1196 | " \n",
1197 | "\n",
1198 | " # Create a [batch_size, projection_dim] tensor.\n",
1199 | " representation = layers.LayerNormalization(epsilon=LAYER_NORM_EPS)(encoded_patches)\n",
1200 | " representation = layers.GlobalAveragePooling1D()(representation)\n",
1201 | " \n",
1202 | " # Classify outputs.\n",
1203 | " outputs = layers.Dense(NUM_CLASSES, activation=\"softmax\")(representation)\n",
1204 | " \n",
1205 | " # Create the Keras model.\n",
1206 | " model = keras.Model(inputs=inputs, outputs=outputs)\n",
1207 | " return model"
1208 | ],
1209 | "metadata": {
1210 | "execution": {
1211 | "iopub.execute_input": "2022-03-27T12:55:47.546267Z",
1212 | "iopub.status.busy": "2022-03-27T12:55:47.545545Z",
1213 | "iopub.status.idle": "2022-03-27T12:55:47.552957Z",
1214 | "shell.execute_reply": "2022-03-27T12:55:47.553501Z",
1215 | "shell.execute_reply.started": "2022-03-27T12:54:02.076178Z"
1216 | },
1217 | "papermill": {
1218 | "duration": 0.059666,
1219 | "end_time": "2022-03-27T12:55:47.553682",
1220 | "exception": false,
1221 | "start_time": "2022-03-27T12:55:47.494016",
1222 | "status": "completed"
1223 | },
1224 | "tags": [],
1225 | "id": "V7nRpIG9JXwf"
1226 | },
1227 | "execution_count": null,
1228 | "outputs": []
1229 | },
1230 | {
1231 | "cell_type": "code",
1232 | "source": [
1233 | "with strategy.scope():\n",
1234 | " vit_model = create_vit_classifier()\n",
1235 | " vit_model.compile(\n",
1236 | " optimizer='adam',\n",
1237 | " loss=\"categorical_crossentropy\",#sparse_categorical_crossentropy\n",
1238 | " metrics=[\"accuracy\"]\n",
1239 | " )\n",
1240 | " \n",
1241 | " vit_model.fit(train_x_placeholder,train_y, batch_size= 1,epochs=EPOCHS,verbose=1)\n",
1242 | " "
1243 | ],
1244 | "metadata": {
1245 | "execution": {
1246 | "iopub.execute_input": "2022-03-27T12:55:47.761975Z",
1247 | "iopub.status.busy": "2022-03-27T12:55:47.761385Z",
1248 | "iopub.status.idle": "2022-03-27T12:55:47.765361Z",
1249 | "shell.execute_reply": "2022-03-27T12:55:47.765913Z",
1250 | "shell.execute_reply.started": "2022-03-27T12:54:03.632774Z"
1251 | },
1252 | "papermill": {
1253 | "duration": 0.055226,
1254 | "end_time": "2022-03-27T12:55:47.766119",
1255 | "exception": false,
1256 | "start_time": "2022-03-27T12:55:47.710893",
1257 | "status": "completed"
1258 | },
1259 | "tags": [],
1260 | "id": "sqiBLt3AJXwf",
1261 | "colab": {
1262 | "base_uri": "https://localhost:8080/"
1263 | },
1264 | "outputId": "436463f1-ae72-4263-a0c0-9e4c4ce0b927"
1265 | },
1266 | "execution_count": null,
1267 | "outputs": [
1268 | {
1269 | "output_type": "stream",
1270 | "name": "stdout",
1271 | "text": [
1272 | "20000/20000 [==============================] - 311s 15ms/step - loss: 3.6487 - accuracy: 0.1261\n"
1273 | ]
1274 | }
1275 | ]
1276 | },
1277 | {
1278 | "cell_type": "code",
1279 | "source": [
1280 | "vit_model.summary()"
1281 | ],
1282 | "metadata": {
1283 | "execution": {
1284 | "iopub.execute_input": "2022-03-27T12:55:48.431129Z",
1285 | "iopub.status.busy": "2022-03-27T12:55:48.430442Z",
1286 | "iopub.status.idle": "2022-03-27T12:55:48.450068Z",
1287 | "shell.execute_reply": "2022-03-27T12:55:48.450748Z",
1288 | "shell.execute_reply.started": "2022-03-27T12:54:06.768691Z"
1289 | },
1290 | "papermill": {
1291 | "duration": 0.059597,
1292 | "end_time": "2022-03-27T12:55:48.450924",
1293 | "exception": false,
1294 | "start_time": "2022-03-27T12:55:48.391327",
1295 | "status": "completed"
1296 | },
1297 | "tags": [],
1298 | "id": "LkkD0vCNJXwg",
1299 | "colab": {
1300 | "base_uri": "https://localhost:8080/"
1301 | },
1302 | "outputId": "3b5f9ab3-bf62-4677-ee4d-8a15a6be41f3"
1303 | },
1304 | "execution_count": null,
1305 | "outputs": [
1306 | {
1307 | "output_type": "stream",
1308 | "name": "stdout",
1309 | "text": [
1310 | "Model: \"model_1\"\n",
1311 | "__________________________________________________________________________________________________\n",
1312 | " Layer (type) Output Shape Param # Connected to \n",
1313 | "==================================================================================================\n",
1314 | " input_2 (InputLayer) [(None, 48, 128, 3) 0 [] \n",
1315 | " ] \n",
1316 | " \n",
1317 | " patches_1 (Patches) (None, None, 108) 0 ['input_2[0][0]'] \n",
1318 | " \n",
1319 | " patch_encoder_1 (PatchEncoder) (None, 168, 128) 35456 ['patches_1[0][0]'] \n",
1320 | " \n",
1321 | " layer_normalization_7 (LayerNo (None, 168, 128) 256 ['patch_encoder_1[0][0]'] \n",
1322 | " rmalization) \n",
1323 | " \n",
1324 | " multi_head_attention_3 (MultiH (None, 168, 128) 263808 ['layer_normalization_7[0][0]', \n",
1325 | " eadAttention) 'layer_normalization_7[0][0]'] \n",
1326 | " \n",
1327 | " add_6 (Add) (None, 168, 128) 0 ['multi_head_attention_3[0][0]', \n",
1328 | " 'patch_encoder_1[0][0]'] \n",
1329 | " \n",
1330 | " layer_normalization_8 (LayerNo (None, 168, 128) 256 ['add_6[0][0]'] \n",
1331 | " rmalization) \n",
1332 | " \n",
1333 | " dense_9 (Dense) (None, 168, 256) 33024 ['layer_normalization_8[0][0]'] \n",
1334 | " \n",
1335 | " dropout_6 (Dropout) (None, 168, 256) 0 ['dense_9[0][0]'] \n",
1336 | " \n",
1337 | " dense_10 (Dense) (None, 168, 128) 32896 ['dropout_6[0][0]'] \n",
1338 | " \n",
1339 | " dropout_7 (Dropout) (None, 168, 128) 0 ['dense_10[0][0]'] \n",
1340 | " \n",
1341 | " add_7 (Add) (None, 168, 128) 0 ['dropout_7[0][0]', \n",
1342 | " 'add_6[0][0]'] \n",
1343 | " \n",
1344 | " layer_normalization_9 (LayerNo (None, 168, 128) 256 ['add_7[0][0]'] \n",
1345 | " rmalization) \n",
1346 | " \n",
1347 | " multi_head_attention_4 (MultiH (None, 168, 128) 263808 ['layer_normalization_9[0][0]', \n",
1348 | " eadAttention) 'layer_normalization_9[0][0]'] \n",
1349 | " \n",
1350 | " add_8 (Add) (None, 168, 128) 0 ['multi_head_attention_4[0][0]', \n",
1351 | " 'add_7[0][0]'] \n",
1352 | " \n",
1353 | " layer_normalization_10 (LayerN (None, 168, 128) 256 ['add_8[0][0]'] \n",
1354 | " ormalization) \n",
1355 | " \n",
1356 | " dense_11 (Dense) (None, 168, 256) 33024 ['layer_normalization_10[0][0]'] \n",
1357 | " \n",
1358 | " dropout_8 (Dropout) (None, 168, 256) 0 ['dense_11[0][0]'] \n",
1359 | " \n",
1360 | " dense_12 (Dense) (None, 168, 128) 32896 ['dropout_8[0][0]'] \n",
1361 | " \n",
1362 | " dropout_9 (Dropout) (None, 168, 128) 0 ['dense_12[0][0]'] \n",
1363 | " \n",
1364 | " add_9 (Add) (None, 168, 128) 0 ['dropout_9[0][0]', \n",
1365 | " 'add_8[0][0]'] \n",
1366 | " \n",
1367 | " layer_normalization_11 (LayerN (None, 168, 128) 256 ['add_9[0][0]'] \n",
1368 | " ormalization) \n",
1369 | " \n",
1370 | " multi_head_attention_5 (MultiH (None, 168, 128) 263808 ['layer_normalization_11[0][0]', \n",
1371 | " eadAttention) 'layer_normalization_11[0][0]'] \n",
1372 | " \n",
1373 | " add_10 (Add) (None, 168, 128) 0 ['multi_head_attention_5[0][0]', \n",
1374 | " 'add_9[0][0]'] \n",
1375 | " \n",
1376 | " layer_normalization_12 (LayerN (None, 168, 128) 256 ['add_10[0][0]'] \n",
1377 | " ormalization) \n",
1378 | " \n",
1379 | " dense_13 (Dense) (None, 168, 256) 33024 ['layer_normalization_12[0][0]'] \n",
1380 | " \n",
1381 | " dropout_10 (Dropout) (None, 168, 256) 0 ['dense_13[0][0]'] \n",
1382 | " \n",
1383 | " dense_14 (Dense) (None, 168, 128) 32896 ['dropout_10[0][0]'] \n",
1384 | " \n",
1385 | " dropout_11 (Dropout) (None, 168, 128) 0 ['dense_14[0][0]'] \n",
1386 | " \n",
1387 | " add_11 (Add) (None, 168, 128) 0 ['dropout_11[0][0]', \n",
1388 | " 'add_10[0][0]'] \n",
1389 | " \n",
1390 | " layer_normalization_13 (LayerN (None, 168, 128) 256 ['add_11[0][0]'] \n",
1391 | " ormalization) \n",
1392 | " \n",
1393 | " global_average_pooling1d_1 (Gl (None, 128) 0 ['layer_normalization_13[0][0]'] \n",
1394 | " obalAveragePooling1D) \n",
1395 | " \n",
1396 | " dense_15 (Dense) (None, 151) 19479 ['global_average_pooling1d_1[0][0\n",
1397 | " ]'] \n",
1398 | " \n",
1399 | "==================================================================================================\n",
1400 | "Total params: 1,045,911\n",
1401 | "Trainable params: 1,045,911\n",
1402 | "Non-trainable params: 0\n",
1403 | "__________________________________________________________________________________________________\n"
1404 | ]
1405 | }
1406 | ]
1407 | },
1408 | {
1409 | "cell_type": "code",
1410 | "source": [
1411 | "vit_model.save_weights('BirdClef.h5', overwrite=True)"
1412 | ],
1413 | "metadata": {
1414 | "id": "JOfbCX7LVneO"
1415 | },
1416 | "execution_count": null,
1417 | "outputs": []
1418 | },
1419 | {
1420 | "cell_type": "markdown",
1421 | "source": [
1422 | "The core of deep learning consists of simple algorithms that scale up well. While self supervised learning methods are used in NLP due to exponential scaling models, computer vision still primarily has supervised models. In this paper, authors observe that using autoencoder - a simple self-supervised method similar to techniques in NLP- provides scalable benefits. Self-supervised learning in vision is on the same path as in NLP. Images and languages are different types of signals and these differences must be addressed carefully. Images do not have a semantic decomposition like languages and instead of attempting to remove objects like we do in language, random patches that do not most likely form semantic segment are removed. Thus, the MAE model reconstructs pixels, which are not semantic entities. This behaviour occurs by way of a rich hidden representation inside the MAE.The method predicts content based on statistics learned from the training dataset and will reflect biases in those data, including the ones with a negative societal impact or inexistent content. \n"
1423 | ],
1424 | "metadata": {
1425 | "papermill": {
1426 | "duration": 0.048961,
1427 | "end_time": "2022-03-27T12:55:48.549486",
1428 | "exception": false,
1429 | "start_time": "2022-03-27T12:55:48.500525",
1430 | "status": "completed"
1431 | },
1432 | "tags": [],
1433 | "id": "CaghtRcXJXwg"
1434 | }
1435 | }
1436 | ]
1437 | }
--------------------------------------------------------------------------------