├── .gitignore
├── README.md
├── docs
├── inter_rcnn_2.png
├── simple_conv_loss_epoch_20.png
├── simple_crnn_2_loss_epoch_7.png
├── simple_crnn_3_loss_epoch_10.png
├── simple_crnn_loss_epoch_20.png
└── simple_dense_loss_epoch_10.png
├── main_dynamic.py
├── main_static.py
├── mer
├── __init__.py
├── const.py
├── loss.py
├── model.py
└── utils.py
├── requirements.txt
├── server
├── .gitignore
├── README.md
├── app.py
├── requirements.txt
└── utils.py
└── wav_converter.py
/.gitignore:
--------------------------------------------------------------------------------
1 | **__pycache__/
2 | dataset/
3 | .vscode/
4 | weights/
5 | history/
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Music Emotion Recognition Algorithm using Deep Learning.
2 | Author: Viet Dung Nguyen. Gettysburg College
3 |
4 | `(Introduction to be written)`
5 |
6 | ## Requirements:
7 | 1. System: Window, Linux, and Mac.
8 | 2. Dataset:
9 | * DEAM Dataset: [https://www.kaggle.com/imsparsh/deam-mediaeval-dataset-emotional-analysis-in-music](https://www.kaggle.com/imsparsh/deam-mediaeval-dataset-emotional-analysis-in-music)
10 | 3. Know the benchhmark: [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173392](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173392)
11 | 4. VS Code (in order to run the jupyter notebook with python file: VS Code format)
12 | 5. [Anaconda](https://docs.anaconda.com/anaconda/install/index.html): The python environment (for systematic code execution)
13 |
14 | ## Project structure
15 | * `docs`: Containing images for documentation
16 | * `mer`: The core libraries for model training and evaluating
17 | * `__init__.py`: Package initialization
18 | * `const.py`: Constant for the library
19 | * `loss.py`: Loss functions for the deep learning model
20 | * `model.py`: Deep learning models
21 | * `utils.py`: Utilization methods
22 | * `main_dynamic.py`: Main jupython notebook (python format) for training with data with per-second label.
23 | * `main_static.py`: Main jupython notebook (python format) for training with data with whole song label.
24 | * `requirements.txt`: Dependency file
25 | * `wav_converter.py`: Script to convert every mp3 file in a folder into wav format.
26 |
27 | ## How to run the project
28 |
29 | * First, install the required libraries:
30 | ```
31 | conda create -n mer
32 | conda activate mer
33 | pip install -r requirements.txt
34 | ```
35 | * Next, go into one python file `main_dynamic.py` or `main_static.py` and experiment with the VS Code notebook.
36 |
37 | ## Misc:
38 | * convert to wav code:
39 | ```
40 | cd audio
41 | mkdir wav
42 | for %f in (*.mp3) do (ffmpeg -i "%f" -acodec pcm_s16le -ar 44100 "./wav/%f.wav")
43 | ```
44 | or
45 | ```
46 | python wav_converter.py {src} {dst} {ffmpeg bin path}
47 | # E.g: python wav_converter.py "./dataset/DEAM/audio" "./dataset/DEAM/wav" "C:/Users/Alex Nguyen/Documents/ffmpeg-master-latest-win64-gpl/bin/ffmpeg.exe"
48 | ```
49 |
50 | ## Reports
51 |
52 | ### Jan 4, 2022
53 | * We want to experiement with depth-wise and point-wise (mobile net) convolution to reduce computational cost but still want to keep the same performance.
54 |
55 | ### Jan 3, 2022
56 | * We tried CRNN Model with CBAM architecture with shallow gru (1 layer of gru). first 3 epoch (300 steps of batch_size 16), we train with learning rate 1e-3, next 1e-4. We use mae loss throughout the training process. We achieve the similar loss to other models.
57 |
58 | ### Dec 31, 2021
59 | * We tried CRNN Model with deep gru (3 layer of gru)
60 |
61 |
62 |
63 |
64 | ### Dec 30, 2021
65 | * We tried CRNN Model with shallow bidirectional lstm
66 |
68 |
69 |
70 |
71 |
72 | ### Dec 28, 2021
73 | * We tried CRNN Model
74 |
75 |
76 |
77 |
78 | ### Dec 26, 2021
79 | * We tried testing with Simple CNN Model with 5 Convolution Block (each block consists of a convolution layer with filer size (3,3) and leaky relu activation followed by a convolution layer with filer size (1,1) and leaky relu activation). Each double the number of neurons (64, 128, 256, 512, 1024). Here is the training result after 25 epochs:
80 |
81 |
82 |
83 |
84 | ### Dec 25, 2021
85 | * The Music Information Retrieval (MIR) field has always been challenging because there are not a lot of refined dataset constructed. Especially for Music Emotion Recognition (MER) task, to assess the emotion of the song, one has to collect the songs as input (most of them is not possible because of copyright [\[1\]](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173392)). According to Aljanaki et. al [\[1\]](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173392), the emotion is subjective to human and language and therefore hard to be determined. There are a lot of emotion labeling scheme such as the emotion adjective wording scheme from Lin et. al [\[2\]](https://doi.org/10.1145/2037676.2037683) or the two dimensional regression scheme from the DEAM dataset developed by Aljanaki et. al [\[1\]](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173392) which utilize the two orthorgonal psychology states that are discussed by Russel [\[3\]](https://www.researchgate.net/publication/235361517_A_Circumplex_Model_of_Affect).
86 | * There are a lot of traditional musical emotion recognition models that utilize sound processing and musical feature detection from the waveform and the spectrogram of the sound such as Improved Back Propagation network [\[4\]](https://www.frontiersin.org/articles/10.3389/fpsyg.2021.760060/full), MIDI notes, melodic, and dynamic features [\[5\]](https://www.semanticscholar.org/paper/Novel-Audio-Features-for-Music-Emotion-Recognition-Panda-Malheiro/6feb6c070313992897140a1802fdb8f0bf129422)
87 | * The dataset we use is the DEAM dataset which was developed by Aljanaki et. al [\[1\]](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173392). The dataset contain 1082 45-second mp3 soundtrack and a set of annotation. The annotations consist of average static (whole song) and dynamic (each second) label of the valence/arousal point on the grade of 10.
88 | * As in the beginning of the project, we experiment with most popular method nowaday: deep learning. We want to apply deep learning into assessing the MER task by having the music (and potentially its related standard (Panda et. al [\[5\]](https://www.semanticscholar.org/paper/Novel-Audio-Features-for-Music-Emotion-Recognition-Panda-Malheiro/6feb6c070313992897140a1802fdb8f0bf129422)) and derived features) as input to the deep learning schema, and the annotated valence-arousal point (ranged from 0 to 10) as label.
89 | * We want to firstly test if the linear dense network to see if they accurately predict the two value valence and arousal. We first preprocess the music audio by performing stft on the waveform to get the time-frequency spectrogram of the sound which is represented by a 3D array \[`time_length`, `n_frequency`, `n_channel`\] (a typical spectrogram of a 45 second music will have the shape (15502, 129, 2)). We then resize such data into the smalller size (i.e, (512, 129, 2)) using bilinear method. We then Flatten the array and feed the vector of 512 * 129 * 2 through 4 linear layers of 512, 256, 128, and 64 neurons with activation of rectified linear unit. The last layer is also a Linear layer that have 2 neurons as output with a rectified linear unit activation. We simplly use the l2 loss to the the distance of the output neurons from the actuall labelled valence and arousal. For optimizer, we use stochastic gradient descent with learning rate of 1e-4. After training the model with batch size 16, step per epoch 100, and 10 epoch, we get the following total loss for batch. So the mean squared error should be `loss` / `batch_size`.
90 |
91 |
92 |
93 |
94 | ## Resources
95 |
96 | * Database benchmark:
97 |
98 | https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173392
99 |
100 | ```
101 | @article{AlajankiEmoInMusicAnalysis,
102 | author = {Alajanki, Anna and Yang, Yi-Hsuan and Soleymani, Mohammad},
103 | title = {Benchmarking music emotion recognition systems},
104 | journal = {PLOS ONE},
105 | year = {2016},
106 | note= {under review}
107 | }
108 | ```
109 |
110 | https://www.kaggle.com/imsparsh/deam-mediaeval-dataset-emotional-analysis-in-music
111 | https://cvml.unige.ch/databases/DEAM/
112 | https://cvml.unige.ch/databases/DEAM/manual.pdf
113 |
114 | https://www.semanticscholar.org/paper/The-AMG1608-dataset-for-music-emotion-recognition-Chen-Yang/16a21a57c85bae0ada26454300dc5c5891f1c0e2
115 |
116 | The PMEmo Dataset for Music Emotion Recognition. https://dl.acm.org/doi/10.1145/3206025.3206037
117 | https://github.com/HuiZhangDB/PMEmo
118 |
119 | RAVDESS database: https://zenodo.org/record/1188976
120 |
121 | * label scheme:
122 |
123 | Lin, Y. C., Yang, Y. H., and Homer, H. (2011). Exploiting online music tags for music emotion classification. ACM Trans. Multimed. Comput. Commun. Appl. 7, 1–16. doi: 10.1145/2037676.2037683
124 |
125 | Russell, James. (1980). A Circumplex Model of Affect. Journal of Personality and Social Psychology. 39. 1161-1178. 10.1037/h0077714.
126 |
127 | * Technical:
128 |
129 | https://www.tensorflow.org/io/tutorials/audio
130 | https://librosa.org/
131 | https://www.tensorflow.org/tutorials/audio/simple_audio
132 |
133 | Visual attention: https://www.youtube.com/watch?v=1mjI_Jm4W1E
134 | Visual attention notebook: https://github.com/EscVM/EscVM_YT/blob/master/Notebooks/0%20-%20TF2.X%20Tutorials/tf_2_visual_attention.ipynb
135 |
136 | YAMNet: https://github.com/tensorflow/models/blob/master/research/audioset/yamnet/yamnet.py
137 |
138 | * Related work and works
139 |
140 | most influlece library mir search: https://www.semanticscholar.org/search?fos%5B0%5D=computer-science&q=Music%20Emotion%20Recognition&sort=influence
141 |
142 | Audio-based deep music emotion recognition: https://www.semanticscholar.org/paper/Audio-based-deep-music-emotion-recognition-Liu-Han/6c4ed9c7cad950a6398a9caa2debb2dea0d16f73
143 |
144 | A Novel Music Emotion Recognition Model Using Neural Network Technology: https://www.frontiersin.org/articles/10.3389/fpsyg.2021.760060/full
145 |
146 | novel features of music for mer: https://www.semanticscholar.org/paper/Novel-Audio-Features-for-Music-Emotion-Recognition-Panda-Malheiro/6feb6c070313992897140a1802fdb8f0bf129422
147 |
148 | musical texture and espresitivity features: https://www.semanticscholar.org/paper/Musical-Texture-and-Expressivity-Features-for-Music-Panda-Malheiro/e4693023ae525b7dd1ecabf494654e7632f148b3
149 |
150 | specch recognition: http://proceedings.mlr.press/v32/graves14.pdf
151 |
152 | data augmentation: https://paperswithcode.com/paper/specaugment-a-simple-data-augmentation-method
153 |
154 | rnn regularization technique: https://paperswithcode.com/paper/recurrent-neural-network-regularization
155 |
156 | wave to vec: https://paperswithcode.com/paper/wav2vec-2-0-a-framework-for-self-supervised
157 |
158 | * MER Task:
159 |
160 | Music Mood Detection Based On Audio And Lyrics With Deep Neural Net: https://paperswithcode.com/paper/music-mood-detection-based-on-audio-and
161 | Transformer-based: https://paperswithcode.com/paper/transformer-based-approach-towards-music
162 |
163 | tutorial ismir: https://arxiv.org/abs/1709.04396
164 |
165 | * DNN-based:
166 |
167 | https://www.semanticscholar.org/paper/Music-Emotion-Classification-with-Deep-Neural-Nets-Pandeya-Bhattarai/023f9feb933c6e82ed2e7095c285e203d31241dc
168 |
169 |
170 | * crnn based:
171 |
172 | Recognizing Song Mood and Theme Using Convolutional Recurrent Neural Networks: https://www.semanticscholar.org/paper/Recognizing-Song-Mood-and-Theme-Using-Convolutional-Mayerl-V%C3%B6tter/5319195f1f7be778a04186bfe4165e3516165a19
173 |
174 | Convolutional Recurrent Neural Networks for Music Classification: https://arxiv.org/abs/1609.04243
175 |
176 | * cnn-based:
177 |
178 | https://www.semanticscholar.org/paper/CNN-based-music-emotion-classification-Liu-Chen/63e83168006678410d137315dd3e8488136aed39
179 |
180 | https://www.semanticscholar.org/paper/Recognition-of-emotion-in-music-based-on-deep-Sarkar-Choudhury/396fd30fa5d2e8821b9413c5a227ec7c902d5b33
181 |
182 | https://www.semanticscholar.org/paper/Music-Emotion-Recognition-by-Using-Chroma-and-Deep-Er-Aydilek/79b35f61ee84f2f7161c98f591b55f0bb31c4d0e
183 |
184 | * Attention:
185 |
186 | Emotion and Themes Recognition in Music with Convolutional and Recurrent Attention-Blocks: https://www.semanticscholar.org/paper/Emotion-and-Themes-Recognition-in-Music-with-and-Gerczuk-Amiriparian/50e46f65dc0dd2d09bb5a08c0fe4872bbe5a2810
187 |
188 | CBAM: https://arxiv.org/abs/1807.06521
189 |
--------------------------------------------------------------------------------
/docs/inter_rcnn_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rxng8/Music-Emotion-Recognition-Algorithm/f43aef6f3d410f9d0ea3e9b55f8fe5f29a7bb030/docs/inter_rcnn_2.png
--------------------------------------------------------------------------------
/docs/simple_conv_loss_epoch_20.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rxng8/Music-Emotion-Recognition-Algorithm/f43aef6f3d410f9d0ea3e9b55f8fe5f29a7bb030/docs/simple_conv_loss_epoch_20.png
--------------------------------------------------------------------------------
/docs/simple_crnn_2_loss_epoch_7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rxng8/Music-Emotion-Recognition-Algorithm/f43aef6f3d410f9d0ea3e9b55f8fe5f29a7bb030/docs/simple_crnn_2_loss_epoch_7.png
--------------------------------------------------------------------------------
/docs/simple_crnn_3_loss_epoch_10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rxng8/Music-Emotion-Recognition-Algorithm/f43aef6f3d410f9d0ea3e9b55f8fe5f29a7bb030/docs/simple_crnn_3_loss_epoch_10.png
--------------------------------------------------------------------------------
/docs/simple_crnn_loss_epoch_20.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rxng8/Music-Emotion-Recognition-Algorithm/f43aef6f3d410f9d0ea3e9b55f8fe5f29a7bb030/docs/simple_crnn_loss_epoch_20.png
--------------------------------------------------------------------------------
/docs/simple_dense_loss_epoch_10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rxng8/Music-Emotion-Recognition-Algorithm/f43aef6f3d410f9d0ea3e9b55f8fe5f29a7bb030/docs/simple_dense_loss_epoch_10.png
--------------------------------------------------------------------------------
/main_dynamic.py:
--------------------------------------------------------------------------------
1 | """
2 | file: main_dynamic.py
3 | author: Alex Nguyen
4 | This file contains code to process the each-second song labeled data (dynamically labeled)
5 | """
6 | # %%
7 |
8 | import os
9 | import pathlib
10 |
11 | import matplotlib.pyplot as plt
12 | import numpy as np
13 | import seaborn as sns
14 | import tensorflow as tf
15 | import sounddevice as sd
16 | import pandas as pd
17 | import tensorflow.keras.layers as L
18 |
19 | from tensorflow.keras import layers
20 | from tensorflow.keras import models
21 | from IPython import display
22 | from tensorflow.python.keras.layers.core import Dropout
23 |
24 | from mer.utils import get_spectrogram, \
25 | plot_spectrogram, \
26 | load_metadata, \
27 | plot_and_play, \
28 | preprocess_waveforms, \
29 | split_train_test, \
30 | tanh_to_sigmoid
31 |
32 | from mer.const import *
33 | from mer.loss import simple_mse_loss, simple_mae_loss
34 | from mer.model import Simple_CRNN_3, SimpleDenseModel, \
35 | SimpleConvModel, \
36 | ConvBlock, \
37 | ConvBlock2,\
38 | Simple_CRNN, \
39 | Simple_CRNN_2, \
40 | Simple_CRNN_3
41 |
42 | # Set the seed value for experiment reproducibility.
43 | # seed = 42
44 | # tf.random.set_seed(seed)
45 | # np.random.seed(seed)
46 |
47 | sd.default.samplerate = DEFAULT_FREQ
48 |
49 | ANNOTATION_SONG_LEVEL = "./dataset/DEAM/annotations/annotations averaged per song/dynamic (per second annotations)/"
50 | AUDIO_FOLDER = "./dataset/DEAM/wav"
51 | filenames = tf.io.gfile.glob(str(AUDIO_FOLDER) + '/*')
52 |
53 | BATCH_SIZE = 8
54 | DEFAULT_SECOND_PER_TIME_STEP = 0.5
55 |
56 | # Process with average annotation per song.
57 | df = load_metadata(ANNOTATION_SONG_LEVEL)
58 |
59 | train_df, test_df = split_train_test(df, TRAIN_RATIO)
60 | # Process with average annotation per second.
61 | valence_df = pd.read_csv(os.path.join(ANNOTATION_SONG_LEVEL, "valence.csv"), sep=r"\s*,\s*", engine="python")
62 | arousal_df = pd.read_csv(os.path.join(ANNOTATION_SONG_LEVEL, "arousal.csv"), sep=r"\s*,\s*", engine="python")
63 |
64 | assert len(valence_df) == len(arousal_df)
65 |
66 | song_id_df = pd.DataFrame({"song_id": valence_df["song_id"]})
67 |
68 | # NOTE: Split only the song id table, then join that table with valence_df and arousal_df
69 | train_song_ids, test_song_ids = split_train_test(song_id_df, TRAIN_RATIO)
70 |
71 | train_valence_df = train_song_ids.merge(valence_df, on="song_id", how="left").dropna(axis=1)
72 | train_arousal_df = train_song_ids.merge(arousal_df, on="song_id", how="left").dropna(axis=1)
73 |
74 | test_valence_df = test_song_ids.merge(valence_df, on="song_id", how="left").dropna(axis=1)
75 | test_arousal_df = test_song_ids.merge(arousal_df, on="song_id", how="left").dropna(axis=1)
76 |
77 | # Describe (Summaraize) the datasets label
78 | # train_valence_flatten_df = pd.DataFrame(np.reshape(train_valence_df.loc[:, train_valence_df.columns != "song_id"].to_numpy(), (-1,)))
79 | # print(train_valence_flatten_df.describe())
80 |
81 | # test_valence_flatten_df = pd.DataFrame(np.reshape(test_valence_df.loc[:, test_valence_df.columns != "song_id"].to_numpy(), (-1,)))
82 | # print(test_valence_flatten_df.describe())
83 |
84 | # Debugging
85 | # pointer = 0
86 | # row = train_valence_df.loc[pointer]
87 | # song_id = row["song_id"]
88 | # assert train_arousal_df.loc[pointer, "song_id"] == song_id, "Wrong row!"
89 | # # Load song and waveform
90 | # song_path = os.path.join(AUDIO_FOLDER, str(int(song_id)) + SOUND_EXTENSION)
91 | # audio_file = tf.io.read_file(song_path)
92 | # waveforms, _ = tf.audio.decode_wav(contents=audio_file)
93 | # # Pad to max 45 second. Shape (total_frequency, n_channels)
94 | # waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH)
95 |
96 | # # Get the labels series
97 | # valence_labels = train_valence_df.loc[pointer, train_valence_df.columns != "song_id"]
98 | # arousal_labels = train_arousal_df.loc[pointer, train_arousal_df.columns != "song_id"]
99 |
100 | # time_pointer = 8
101 | # time_end_point = MIN_TIME_END_POINT + time_pointer * DEFAULT_SECOND_PER_TIME_STEP # 15 + ptr * 0.5
102 |
103 | # end_wave_index = int(time_end_point * DEFAULT_FREQ)
104 | # start_wave_index = int(end_wave_index - WINDOW_SIZE)
105 |
106 | # current_waveforms = waveforms[start_wave_index: end_wave_index, ...]
107 | # # Work on building spectrogram
108 | # # Shape (timestep, frequency, n_channel)
109 | # spectrograms = None
110 | # # Loop through each channel
111 |
112 | # test_wave = get_spectrogram(current_waveforms[..., 0], input_len=current_waveforms.shape[0])
113 | # print(test_wave.shape) # TensorShape([171, 129, 1])
114 |
115 | # for i in range(current_waveforms.shape[-1]):
116 | # # Shape (timestep, frequency, 1)
117 | # spectrogram = get_spectrogram(current_waveforms[..., i], input_len=current_waveforms.shape[0])
118 | # # spectrogram = tf.convert_to_tensor(np.log(spectrogram.numpy() + np.finfo(float).eps))
119 | # if spectrograms == None:
120 | # spectrograms = spectrogram
121 | # else:
122 | # spectrograms = tf.concat([spectrograms, spectrogram], axis=-1)
123 | # pointer += 1
124 |
125 | # padded_spectrogram = np.zeros((SPECTROGRAM_HALF_SECOND_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=float)
126 | # # spectrograms = spectrograms[tf.newaxis, ...]
127 | # # some spectrogram are not the same shape
128 | # padded_spectrogram[:spectrograms.shape[0], :spectrograms.shape[1], :] = spectrograms
129 |
130 | # print(padded_spectrogram.shape)
131 |
132 | def train_datagen_per_second():
133 | """ Predicting valence mean and arousal mean
134 | """
135 | pointer = 0
136 | while True:
137 | # Reset pointer
138 | if pointer >= len(train_valence_df):
139 | pointer = 0
140 |
141 | row = train_valence_df.loc[pointer]
142 | song_id = row["song_id"]
143 | assert train_arousal_df.loc[pointer, "song_id"] == song_id, "Wrong row!"
144 |
145 | # Load song and waveform
146 | song_path = os.path.join(AUDIO_FOLDER, str(int(song_id)) + SOUND_EXTENSION)
147 | audio_file = tf.io.read_file(song_path)
148 | waveforms, _ = tf.audio.decode_wav(contents=audio_file)
149 | # Pad to max 45 second. Shape (total_frequency, n_channels)
150 | waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH)
151 |
152 | # Get the labels series
153 | valence_labels = train_valence_df.loc[pointer, train_valence_df.columns != "song_id"]
154 | arousal_labels = train_arousal_df.loc[pointer, train_arousal_df.columns != "song_id"]
155 | # Loop through the series
156 | for time_pointer, ((valence_time_name, valence), (arousal_time_name, arousal)) in enumerate(zip(valence_labels.iteritems(), arousal_labels.iteritems())):
157 | label = tf.convert_to_tensor([tanh_to_sigmoid(valence), tanh_to_sigmoid(arousal)], dtype=tf.float32)
158 | time_end_point = MIN_TIME_END_POINT + time_pointer * DEFAULT_SECOND_PER_TIME_STEP # 15 + ptr * 0.5
159 |
160 | end_wave_index = int(time_end_point * DEFAULT_FREQ)
161 | start_wave_index = int(end_wave_index - WINDOW_SIZE)
162 |
163 | try:
164 | current_waveforms = waveforms[start_wave_index: end_wave_index, ...]
165 | # Work on building spectrogram
166 | # Shape (timestep, frequency, n_channel)
167 | spectrograms = None
168 | # Loop through each channel
169 | for i in range(current_waveforms.shape[-1]):
170 | # Shape (timestep, frequency, 1)
171 | spectrogram = get_spectrogram(current_waveforms[..., i], input_len=current_waveforms.shape[0])
172 | # spectrogram = tf.convert_to_tensor(np.log(spectrogram.numpy() + np.finfo(float).eps))
173 | if spectrograms == None:
174 | spectrograms = spectrogram
175 | else:
176 | spectrograms = tf.concat([spectrograms, spectrogram], axis=-1)
177 | pointer += 1
178 |
179 | padded_spectrogram = np.zeros((SPECTROGRAM_5_SECOND_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=float)
180 | # spectrograms = spectrograms[tf.newaxis, ...]
181 | # some spectrogram are not the same shape
182 | padded_spectrogram[:spectrograms.shape[0], :spectrograms.shape[1], :] = spectrograms
183 |
184 | yield (tf.convert_to_tensor(padded_spectrogram), label)
185 |
186 | except:
187 | print("There is some error accessing the waveforms by index")
188 | break
189 |
190 | # train_dataset = tf.data.Dataset.from_generator(
191 | # train_datagen_per_second,
192 | # output_signature=(
193 | # tf.TensorSpec(shape=(SPECTROGRAM_HALF_SECOND_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=tf.float32),
194 | # tf.TensorSpec(shape=(2), dtype=tf.float32)
195 | # )
196 | # )
197 | # train_iter = iter(train_dataset)
198 |
199 | def test_datagen_per_second():
200 | """ Predicting valence mean and arousal mean
201 | """
202 | pointer = 0
203 | while True:
204 | # Reset pointer
205 | if pointer >= len(test_valence_df):
206 | pointer = 0
207 |
208 | row = test_valence_df.loc[pointer]
209 | song_id = row["song_id"]
210 | assert test_arousal_df.loc[pointer, "song_id"] == song_id, "Wrong row!"
211 |
212 | # Load song and waveform
213 | song_path = os.path.join(AUDIO_FOLDER, str(int(song_id)) + SOUND_EXTENSION)
214 | audio_file = tf.io.read_file(song_path)
215 | waveforms, _ = tf.audio.decode_wav(contents=audio_file)
216 | # Pad to max 45 second. Shape (total_frequency, n_channels)
217 | waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH)
218 |
219 | # Get the labels series
220 | valence_labels = test_valence_df.loc[pointer, test_valence_df.columns != "song_id"]
221 | arousal_labels = test_arousal_df.loc[pointer, test_arousal_df.columns != "song_id"]
222 | # Loop through the series
223 | for time_pointer, ((valence_time_name, valence), (arousal_time_name, arousal)) in enumerate(zip(valence_labels.iteritems(), arousal_labels.iteritems())):
224 | label = tf.convert_to_tensor([tanh_to_sigmoid(valence), tanh_to_sigmoid(arousal)], dtype=tf.float32)
225 | time_end_point = MIN_TIME_END_POINT + time_pointer * DEFAULT_SECOND_PER_TIME_STEP # 15 + ptr * 0.5
226 |
227 | end_wave_index = int(time_end_point * DEFAULT_FREQ)
228 | start_wave_index = int(end_wave_index - WINDOW_SIZE)
229 |
230 | try:
231 | current_waveforms = waveforms[start_wave_index: end_wave_index, ...]
232 | # Work on building spectrogram
233 | # Shape (timestep, frequency, n_channel)
234 | spectrograms = None
235 | # Loop through each channel
236 | for i in range(current_waveforms.shape[-1]):
237 | # Shape (timestep, frequency, 1)
238 | spectrogram = get_spectrogram(current_waveforms[..., i], input_len=current_waveforms.shape[0])
239 | # spectrogram = tf.convert_to_tensor(np.log(spectrogram.numpy() + np.finfo(float).eps))
240 | if spectrograms == None:
241 | spectrograms = spectrogram
242 | else:
243 | spectrograms = tf.concat([spectrograms, spectrogram], axis=-1)
244 | pointer += 1
245 |
246 | padded_spectrogram = np.zeros((SPECTROGRAM_5_SECOND_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=float)
247 | # spectrograms = spectrograms[tf.newaxis, ...]
248 | # some spectrogram are not the same shape
249 | padded_spectrogram[:spectrograms.shape[0], :spectrograms.shape[1], :] = spectrograms
250 |
251 | yield (tf.convert_to_tensor(padded_spectrogram), label)
252 |
253 | except:
254 | print("There is some error accessing the waveforms by index")
255 | break
256 |
257 |
258 | train_dataset = tf.data.Dataset.from_generator(
259 | train_datagen_per_second,
260 | output_signature=(
261 | tf.TensorSpec(shape=(SPECTROGRAM_5_SECOND_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=tf.float32),
262 | tf.TensorSpec(shape=(2), dtype=tf.float32)
263 | )
264 | )
265 | train_batch_dataset = train_dataset.batch(BATCH_SIZE)
266 | # train_batch_dataset = train_batch_dataset.cache().prefetch(tf.data.AUTOTUNE) # OOM error
267 | train_batch_iter = iter(train_batch_dataset)
268 |
269 |
270 | # Comment out to decide to create a normalization layer.
271 | # NOTE: this is every time consuming because it looks at all the data, only
272 | # use this at the first time.
273 | # NOTE: Normally, we create this layer once, save it somewhere to reuse in
274 | # every other model.
275 | #
276 | # norm_layer = L.Normalization()
277 | # norm_layer.adapt(data=train_dataset.map(map_func=lambda spec, label: spec))
278 | #
279 |
280 | test_dataset = tf.data.Dataset.from_generator(
281 | test_datagen_per_second,
282 | output_signature=(
283 | tf.TensorSpec(shape=(SPECTROGRAM_5_SECOND_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=tf.float32),
284 | tf.TensorSpec(shape=(2, ), dtype=tf.float32)
285 | )
286 | )
287 | test_batch_dataset = test_dataset.batch(BATCH_SIZE)
288 | # test_batch_dataset = test_batch_dataset.cache().prefetch(tf.data.AUTOTUNE) # OOM error
289 | test_batch_iter = iter(test_batch_dataset)
290 |
291 | # ds = iter(train_dataset)
292 | # i, o = next(ds)
293 | # log_spec = np.log(i + np.finfo(float).eps)
294 |
295 | # print(tf.reduce_max(i))
296 | # print(tf.reduce_min(i))
297 | # print(tf.reduce_mean(i))
298 |
299 | # print(tf.reduce_max(log_spec))
300 | # print(tf.reduce_min(log_spec))
301 | # print(tf.reduce_mean(log_spec))
302 |
303 | # ii = tf.transpose(i[..., 0], [1,0])
304 | # height = ii.shape[0]
305 | # width = ii.shape[1]
306 | # X = np.linspace(0, np.size(ii), num=width, dtype=int)
307 | # Y = range(height)
308 | # plt.pcolormesh(X, Y, ii)
309 | # plt.show()
310 |
311 | # it = iter(train_dataset)
312 | # i, o = next(it)
313 | # o.shape
314 |
315 |
316 | # %%
317 |
318 | ## Training
319 |
320 | def train_step(batch_x, batch_label, model, loss_function, optimizer, step=-1):
321 | with tf.device("/GPU:0"):
322 | with tf.GradientTape() as tape:
323 | logits = model(batch_x, training=True)
324 | loss = loss_function(batch_label, logits)
325 | grads = tape.gradient(loss, model.trainable_weights)
326 | optimizer.apply_gradients(zip(grads, model.trainable_weights))
327 | return loss
328 |
329 | def train(model,
330 | training_batch_iter,
331 | test_batch_iter,
332 | optimizer,
333 | loss_function,
334 | epochs=1,
335 | steps_per_epoch=20,
336 | valid_step=5,
337 | history_path=None,
338 | weights_path=None,
339 | save_history=False):
340 |
341 | if history_path != None and os.path.exists(history_path):
342 | # Sometimes, we have not created the files
343 | with open(history_path, "rb") as f:
344 | history = np.load(f, allow_pickle=True)
345 | epochs_loss, epochs_val_loss = history
346 | epochs_loss = epochs_loss.tolist()
347 | epochs_val_loss = epochs_val_loss.tolist()
348 | else:
349 | epochs_val_loss = []
350 | epochs_loss = []
351 |
352 | if weights_path != None and os.path.exists(weights_path + ".index"):
353 | try:
354 | model.load_weights(weights_path)
355 | print("Model weights loaded!")
356 | except:
357 | print("cannot load weights!")
358 |
359 | for epoch in range(epochs):
360 | losses = []
361 | val_losses = []
362 | with tf.device("/CPU:0"):
363 | step_pointer = 0
364 | while step_pointer < steps_per_epoch:
365 | batch = next(training_batch_iter)
366 | batch_x = batch[0]
367 | batch_label = batch[1]
368 | loss = train_step(batch_x, batch_label, model, loss_function, optimizer, step=step_pointer + 1)
369 | print(f"Epoch {epoch + 1} - Step {step_pointer + 1} - Loss: {loss}")
370 | losses.append(loss)
371 |
372 | val_batch = next(test_batch_iter)
373 | logits = model(val_batch[0], training=False)
374 | val_loss = loss_function(val_batch[1], logits)
375 | val_losses.append(val_loss)
376 |
377 | if (step_pointer + 1) % valid_step == 0:
378 | print(
379 | "Training loss (for one batch) at step %d: %.4f"
380 | % (step_pointer + 1, float(loss))
381 | )
382 | # perform validation
383 | print(f"exmaple logits: {logits}")
384 | print(f"Validation loss: {val_loss}\n-----------------")
385 |
386 | step_pointer += 1
387 | epochs_loss.append(losses)
388 | epochs_val_loss.append(val_losses)
389 |
390 | # Save history and model
391 | if history_path != None and save_history:
392 | np.save(history_path, [epochs_loss, epochs_val_loss])
393 |
394 | if weights_path != None:
395 | model.save_weights(weights_path)
396 |
397 | # return history
398 | return [epochs_loss, epochs_val_loss]
399 |
400 | # %%
401 |
402 | """################## Training #################"""
403 |
404 | ## Define model first
405 |
406 | weights_path = "./weights/dynamics/base_shallow_lstm/checkpoint"
407 | history_path = "./history/dynamics/base_shallow_lstm.npy"
408 |
409 | # model = SimpleDenseModel(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL, BATCH_SIZE)
410 | # model.build(input_shape=(BATCH_SIZE, SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL))
411 | # model.model().summary()
412 |
413 | # model = SimpleConvModel(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL, BATCH_SIZE)
414 | # model.model.load_weights(weights_path)
415 |
416 |
417 | optimizer = tf.keras.optimizers.SGD(learning_rate=LEARNING_RATE)
418 |
419 | # %%
420 |
421 | model = Simple_CRNN_3()
422 | # model.summary()
423 | sample_input = tf.ones(shape=(BATCH_SIZE, SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2))
424 | with tf.device("/CPU:0"):
425 | sample_output = model(sample_input, training=False)
426 | print(sample_output)
427 |
428 | # %%
429 |
430 | # About 50 epochs with each epoch step 100 will cover the whole training dataset!
431 | history = train(
432 | model,
433 | train_batch_iter,
434 | test_batch_iter,
435 | optimizer,
436 | simple_mae_loss,
437 | epochs=2,
438 | steps_per_epoch=100, # 1200 // 16
439 | valid_step=20,
440 | history_path=history_path,
441 | weights_path=weights_path,
442 | save_history=True
443 | )
444 |
445 |
446 | # %%
447 |
448 | ### MODEL DEBUGGING ###
449 |
450 | class ResBlock(tf.keras.layers.Layer):
451 | def __init__(self, filters, kernel_size_1=(5,5), kernel_size_2=(3,3), **kwargs) -> None:
452 | super().__init__(**kwargs)
453 | self.filters = filters
454 | self.kernel_size_1 = kernel_size_1
455 | self.kernel_size_2 = kernel_size_2
456 |
457 | def build(self, input_shape):
458 | self.conv_norm = L.Conv2D(self.filters, (1, 1), padding="same")
459 | self.conv1 = L.Conv2D(self.filters, self.kernel_size_1, padding="same")
460 | self.conv2 = L.Conv2D(self.filters, self.kernel_size_2, padding="same")
461 |
462 | def call(self, inputs):
463 | skip_tensor = self.conv_norm(inputs)
464 | tensor = L.ReLU()(skip_tensor)
465 | tensor = self.conv1(tensor)
466 | tensor = L.ReLU()(tensor)
467 | tensor = self.conv2(tensor)
468 | tensor = L.Add()([skip_tensor, tensor])
469 | out = L.ReLU()(tensor)
470 | return out
471 |
472 | class ResBlock2(tf.keras.layers.Layer):
473 | def __init__(self, filters, kernel_size=(5,5), **kwargs) -> None:
474 | super().__init__(**kwargs)
475 | self.filters = filters
476 | self.kernel_size = kernel_size
477 |
478 | def build(self, input_shape):
479 | self.conv_norm = L.Conv2D(self.filters, (1, 1), padding="same")
480 | self.conv1 = L.Conv2D(self.filters // 2, (1, 1), padding="same")
481 | self.conv2 = L.Conv2D(self.filters, self.kernel_size, padding="same")
482 |
483 | def call(self, inputs):
484 | skip_tensor = self.conv_norm(inputs)
485 | tensor = L.ReLU()(skip_tensor)
486 | tensor = self.conv1(tensor)
487 | tensor = L.ReLU()(tensor)
488 | tensor = self.conv2(tensor)
489 | tensor = L.Add()([skip_tensor, tensor])
490 | out = L.ReLU()(tensor)
491 | return out
492 |
493 | def base_cnn(task_type="static"):
494 | """ Base CNN Feature extractor for 45 second spectrogram
495 | Input to model shape: (SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2)
496 | Output of the model shape: (4, 60, 256)
497 | (Convolved frequency, convolved timestep, feature neurons)
498 | Args:
499 | task_type (str, optional): There are three value:
500 | "static" for model evaluation per song,
501 | "dynamic" for model evaluation per timestep and it takes the input waveform at only per timestep,
502 | "seq2seq" for model evaluation per second but take the input as the whole song. Defaults to "static".
503 | Returns:
504 | tf.keras.Model: Return a model
505 | """
506 | model = tf.keras.Sequential(name="base_cnn")
507 | if task_type == "static" or task_type == "seq2seq":
508 | model.add(L.Input(shape=(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2)))
509 | model.add(L.Permute((2, 1, 3)))
510 | model.add(L.Resizing(FREQUENCY_LENGTH, 1024))
511 | elif task_type == "dynamic":
512 | model.add(L.Input(shape=(SPECTROGRAM_5_SECOND_LENGTH, FREQUENCY_LENGTH, 2)))
513 | model.add(L.Permute((2, 1, 3)))
514 | model.add(L.Resizing(FREQUENCY_LENGTH, 1024))
515 | else:
516 | print("Wrong parameters")
517 | return
518 |
519 | cnn_config = [
520 | # [32, (5,5), (3, 3)],
521 | [64, (5,5), (3, 3)],
522 | [128, (5,5), (3, 3)],
523 | [256, (5,5), (3, 3)]
524 | # [512, (5,5), (3, 3)],
525 | ]
526 |
527 | for i, (filters, kernel_size_1, kernel_size_2) in enumerate(cnn_config):
528 | # model.add(ResBlock(filters, kernel_size_1, kernel_size_2, name=f"res_block_{i}"))
529 | # model.add(L.MaxPool2D(2,2, name=f"max_pool_{i}"))
530 | model.add(ResBlock(filters, kernel_size_1, name=f"res_block_{i}"))
531 | model.add(L.MaxPool2D(2,2, name=f"max_pool_{i}"))
532 |
533 | model.add(L.Conv2D(128, (1,1), activation="relu"))
534 | model.add(L.Conv2D(64, (1,1), activation="relu"))
535 | model.add(L.Conv2D(32, (1,1), activation="relu"))
536 |
537 | return model
538 |
539 | def model():
540 | base = base_cnn(task_type="dynamic")
541 |
542 | tensor = base.outputs[0]
543 | tensor = L.Permute((2, 3, 1))(tensor)
544 | tensor = L.Reshape((128, -1))(tensor)
545 | tensor = L.LSTM(256)(tensor)
546 | tensor = L.Dense(256, activation=None)(tensor)
547 | tensor = L.Dense(64, activation=None)(tensor)
548 | out = L.Dense(2, activation=None)(tensor)
549 |
550 | model = tf.keras.Model(inputs=base.inputs, outputs=out)
551 | return model
552 |
553 | model = model()
554 | model.summary()
555 | sample_input = tf.ones(shape=(BATCH_SIZE, SPECTROGRAM_5_SECOND_LENGTH, FREQUENCY_LENGTH, 2))
556 | with tf.device("/CPU:0"):
557 | sample_output = model(sample_input, training=False)
558 | print(sample_output.shape)
559 |
560 |
561 |
562 |
563 | # %%5
564 |
565 |
566 | sample_output.shape
567 |
568 | # %%
569 |
570 |
571 |
572 |
573 | # %%
574 |
575 | # Plot
576 | with open(history_path, "rb") as f:
577 | [epochs_loss, epochs_val_loss] = np.load(f, allow_pickle=True)
578 |
579 |
580 | e_loss = [k[0] for k in epochs_loss]
581 |
582 | e_all_loss = []
583 |
584 | id = 0
585 | time_val = []
586 | for epoch in epochs_loss:
587 | for step in epoch:
588 | e_all_loss.append(step)
589 | id += 1
590 | time_val.append(id)
591 |
592 | e_val_loss = [k[0] for k in epochs_val_loss]
593 |
594 | e_all_val_loss = []
595 |
596 | id = 0
597 | time_val = []
598 | for epoch in epochs_val_loss:
599 | for step in epoch:
600 | e_all_val_loss.append(step)
601 | id += 1
602 | time_val.append(id)
603 |
604 | plt.plot(np.arange(0, len(e_all_loss), 1), e_all_loss, label = "train loss")
605 | # plt.plot(time_val, epochs_val_loss, label = "val loss")
606 | plt.plot(np.arange(0, len(e_all_val_loss), 1), e_all_val_loss, label = "val loss")
607 |
608 | # plt.plot(np.arange(1,len(e_loss)+ 1), e_loss, label = "train loss")
609 | # plt.plot(np.arange(1,len(epochs_val_loss)+ 1), epochs_val_loss, label = "val loss")
610 | plt.xlabel("Step")
611 | plt.ylabel("Loss")
612 | plt.legend()
613 | plt.show()
614 |
615 | # %%5
616 |
617 | # model.load_weights(weights_path)
618 | # model.trainable_weights
619 | # y.shape
620 |
621 |
622 | # %%
623 |
624 |
625 |
626 | # model.build(input_shape=(BATCH_SIZE, SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL))
627 | # model.summary()
628 | # %%
629 |
630 | model.save_weights(weights_path)
631 |
632 |
633 | # %%
634 |
635 |
636 |
637 | model.load_weights(weights_path)
638 |
639 |
640 | # %%
641 |
642 |
643 | def evaluate(df_pointer, model, loss_func, play=False):
644 | row = test_df.loc[df_pointer]
645 | song_id = row["song_id"]
646 | valence_mean = row["valence_mean"]
647 | arousal_mean = row["arousal_mean"]
648 | label = tf.convert_to_tensor([valence_mean, arousal_mean], dtype=tf.float32)
649 | print(f"Label: Valence: {valence_mean}, Arousal: {arousal_mean}")
650 | song_path = os.path.join(AUDIO_FOLDER, str(int(song_id)) + SOUND_EXTENSION)
651 | audio_file = tf.io.read_file(song_path)
652 | waveforms, _ = tf.audio.decode_wav(contents=audio_file)
653 | waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH)
654 | spectrograms = None
655 | # Loop through each channel
656 | for i in range(waveforms.shape[-1]):
657 | # Shape (timestep, frequency, 1)
658 | spectrogram = get_spectrogram(waveforms[..., i], input_len=waveforms.shape[0])
659 | if spectrograms == None:
660 | spectrograms = spectrogram
661 | else:
662 | spectrograms = tf.concat([spectrograms, spectrogram], axis=-1)
663 |
664 | spectrograms = spectrograms[tf.newaxis, ...]
665 |
666 | ## Eval
667 | y_pred = model(spectrograms, training=False)[0]
668 | print(f"Predicted y_pred value: Valence: {y_pred[0]}, Arousal: {y_pred[1]}")
669 |
670 | loss = loss_func(label[tf.newaxis, ...], y_pred)
671 | print(f"Loss: {loss}")
672 |
673 | if play:
674 | plot_and_play(waveforms, 0, 40, 0)
675 |
676 | i = 0
677 |
678 | # %%
679 |
680 | i += 1
681 | evaluate(i, model, simple_mae_loss, play=False)
682 |
683 | # %%
684 |
685 | ####### INTERMEDIARY REPRESENTATION ########
686 |
687 | layer_list = [l for l in model.layers]
688 | debugging_model = tf.keras.Model(inputs=model.inputs, outputs=[l.output for l in layer_list])
689 |
690 | # %%
691 |
692 | layer_list
693 |
694 | # %%
695 |
696 | test_id = 35
697 | test_time_ptr = 0
698 | time_end_point = MIN_TIME_END_POINT + test_time_ptr
699 | df_id = int(test_time_ptr * 2)
700 | row = test_valence_df.loc[test_id]
701 | song_id = row["song_id"]
702 | # Get the labels series
703 | valence_labels = test_valence_df.loc[test_id, test_valence_df.columns != "song_id"]
704 | arousal_labels = test_arousal_df.loc[test_id, test_arousal_df.columns != "song_id"]
705 |
706 | valence_val = tanh_to_sigmoid(valence_labels.iloc[[df_id]].to_numpy())
707 | arousal_val = tanh_to_sigmoid(arousal_labels.iloc[[df_id]].to_numpy())
708 |
709 | label = tf.convert_to_tensor([valence_val, arousal_val], dtype=tf.float32)
710 | print(f"Label: Valence: {valence_val}, Arousal: {arousal_val}")
711 | song_path = os.path.join(AUDIO_FOLDER, str(int(song_id)) + SOUND_EXTENSION)
712 | audio_file = tf.io.read_file(song_path)
713 | waveforms, _ = tf.audio.decode_wav(contents=audio_file)
714 | waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH)
715 |
716 |
717 | end_wave_index = int(time_end_point * DEFAULT_FREQ)
718 | start_wave_index = int(end_wave_index - WINDOW_SIZE)
719 |
720 | current_waveforms = waveforms[start_wave_index: end_wave_index, ...]
721 |
722 | spectrograms = None
723 | # Loop through each channel
724 | for i in range(current_waveforms.shape[-1]):
725 | # Shape (timestep, frequency, 1)
726 | spectrogram = get_spectrogram(current_waveforms[..., i], input_len=current_waveforms.shape[0])
727 | if spectrograms == None:
728 | spectrograms = spectrogram
729 | else:
730 | spectrograms = tf.concat([spectrograms, spectrogram], axis=-1)
731 |
732 | spectrograms = spectrograms[tf.newaxis, ...]
733 |
734 |
735 | ## Eval
736 | y_pred_list = debugging_model(spectrograms, training=False)
737 | print(f"Predicted y_pred value: Valence: {y_pred_list[-1][0, 0]}, Arousal: {y_pred_list[-1][0, 1]}")
738 |
739 | plot_and_play(waveforms, time_end_point - WINDOW_TIME, WINDOW_TIME, 0)
740 |
741 |
742 | # def show_color_mesh(spectrogram):
743 | # """ Generate color mesh
744 |
745 | # Args:
746 | # spectrogram (2D array): Expect shape (Frequency length, time step)
747 | # """
748 | # assert len(spectrogram.shape) == 2
749 | # log_spec = np.log(spectrogram + np.finfo(float).eps)
750 | # height = log_spec.shape[0]
751 | # width = log_spec.shape[1]
752 | # X = np.linspace(0, np.size(spectrogram), num=width, dtype=int)
753 | # Y = range(height)
754 | # plt.pcolormesh(X, Y, log_spec)
755 | # plt.show()
756 |
757 | # show_color_mesh(tf.transpose(spectrograms[0, :, :, 0], [1,0]))
758 |
759 |
760 | # %%
761 |
762 | f, axarr = plt.subplots(7,4, figsize=(25,15))
763 | CONVOLUTION_NUMBER_LIST = [2, 3, 4, 5]
764 | LAYER_LIST = [3, 5, 7, 9, 10, 11, 13]
765 | for x, CONVOLUTION_NUMBER in enumerate(CONVOLUTION_NUMBER_LIST):
766 | f1 = y_pred_list[LAYER_LIST[0]]
767 | plot_spectrogram(tf.transpose(f1[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[0,x])
768 | axarr[0,x].grid(False)
769 | f2 = y_pred_list[LAYER_LIST[1]]
770 | plot_spectrogram(tf.transpose(f2[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[1,x])
771 | axarr[1,x].grid(False)
772 | f3 = y_pred_list[LAYER_LIST[2]]
773 | plot_spectrogram(tf.transpose(f3[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[2,x])
774 | axarr[2,x].grid(False)
775 | f4 = y_pred_list[LAYER_LIST[3]]
776 | plot_spectrogram(tf.transpose(f4[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[3,x])
777 | axarr[3,x].grid(False)
778 |
779 | f5 = y_pred_list[LAYER_LIST[4]]
780 | plot_spectrogram(tf.transpose(f5[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[4,x])
781 | axarr[4,x].grid(False)
782 | f6 = y_pred_list[LAYER_LIST[5]]
783 | plot_spectrogram(tf.transpose(f6[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[5,x])
784 | axarr[5,x].grid(False)
785 | f7 = y_pred_list[LAYER_LIST[6]]
786 | # plot_spectrogram(tf.transpose(f7[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[6,x])
787 | plot_spectrogram(f7[0, : , :].numpy(), axarr[6,x])
788 | axarr[6,x].grid(False)
789 | # f8 = y_pred_list[LAYER_LIST[7]]
790 | # plot_spectrogram(tf.transpose(f8[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[7,x])
791 | # axarr[7,x].grid(False)
792 |
793 | axarr[0,0].set_ylabel("After convolution layer 1")
794 | axarr[1,0].set_ylabel("After convolution layer 2")
795 | axarr[2,0].set_ylabel("After convolution layer 3")
796 | axarr[3,0].set_ylabel("After convolution layer 7")
797 |
798 | axarr[0,0].set_title("convolution number 0")
799 | axarr[0,1].set_title("convolution number 4")
800 | axarr[0,2].set_title("convolution number 7")
801 | axarr[0,3].set_title("convolution number 23")
802 |
803 | plt.show()
804 |
805 | # %%
806 |
807 |
--------------------------------------------------------------------------------
/main_static.py:
--------------------------------------------------------------------------------
1 | """
2 | file: main_static.py
3 | author: Alex Nguyen
4 | This file contains code to process the whole song labeled data (statically labeled)
5 | """
6 | # %%
7 |
8 | import os
9 | import pathlib
10 |
11 | import matplotlib.pyplot as plt
12 | import numpy as np
13 | import seaborn as sns
14 | import tensorflow as tf
15 | import sounddevice as sd
16 | import pandas as pd
17 | import tensorflow.keras.layers as L
18 |
19 | from tensorflow.keras import layers
20 | from tensorflow.keras import models
21 | from IPython import display
22 | from tensorflow.python.keras.layers.core import Dropout
23 |
24 | from mer.utils import get_spectrogram, \
25 | plot_spectrogram, \
26 | load_metadata, \
27 | plot_and_play, \
28 | preprocess_waveforms, \
29 | split_train_test
30 |
31 | from mer.const import *
32 | from mer.loss import simple_mse_loss, simple_mae_loss
33 | from mer.model import Simple_CRNN_3, SimpleDenseModel, \
34 | SimpleConvModel, \
35 | ConvBlock, \
36 | ConvBlock2,\
37 | Simple_CRNN, \
38 | Simple_CRNN_2, \
39 | Simple_CRNN_3
40 |
41 | # Set the seed value for experiment reproducibility.
42 | # seed = 42
43 | # tf.random.set_seed(seed)
44 | # np.random.seed(seed)
45 |
46 | sd.default.samplerate = DEFAULT_FREQ
47 |
48 | ANNOTATION_SONG_LEVEL = "./dataset/DEAM/annotations/annotations averaged per song/song_level/"
49 | AUDIO_FOLDER = "./dataset/DEAM/wav"
50 | filenames = tf.io.gfile.glob(str(AUDIO_FOLDER) + '/*')
51 |
52 | # Process with average annotation per song.
53 | df = load_metadata(ANNOTATION_SONG_LEVEL)
54 |
55 | train_df, test_df = split_train_test(df, TRAIN_RATIO)
56 |
57 | # test_file = tf.io.read_file(os.path.join(AUDIO_FOLDER, "2011.wav"))
58 | # test_audio, _ = tf.audio.decode_wav(contents=test_file)
59 | # test_audio.shape
60 | # test_audio = preprocess_waveforms(test_audio, WAVE_ARRAY_LENGTH)
61 | # test_audio.shape
62 |
63 | # plot_and_play(test_audio, 24, 5, 0)
64 | # plot_and_play(test_audio, 26, 5, 0)
65 | # plot_and_play(test_audio, 28, 5, 0)
66 | # plot_and_play(test_audio, 30, 5, 0)
67 |
68 | # TODO: Check if all the audio files have the same number of channels
69 |
70 | # TODO: Loop through all music file to get the max length spectrogram, and other specs
71 | # Spectrogram length for 45s audio with freq 44100 is often 15523
72 | # Largeest 3 spectrogram, 16874 at 1198.wav, 103922 at 2001.wav, 216080 at 2011.wav
73 | # The reason why there are multiple spectrogram is because the music have different length
74 | # For the exact 45 seconds audio, the spectrogram time length is 15502.
75 |
76 | # SPECTROGRAM_TIME_LENGTH = 15502
77 | # min_audio_length = 1e8
78 | # for fname in os.listdir(AUDIO_FOLDER):
79 | # song_path = os.path.join(AUDIO_FOLDER, fname)
80 | # audio_file = tf.io.read_file(song_path)
81 | # waveforms, _ = tf.audio.decode_wav(contents=audio_file)
82 | # audio_length = waveforms.shape[0] // DEFAULT_FREQ
83 | # if audio_length < min_audio_length:
84 | # min_audio_length = audio_length
85 | # print(f"The min audio time length is: {min_audio_length} second(s) at {fname}")
86 | # spectrogram = get_spectrogram(waveforms[..., 0], input_len=waveforms.shape[0])
87 | # if spectrogram.shape[0] > SPECTROGRAM_TIME_LENGTH:
88 | # SPECTROGRAM_TIME_LENGTH = spectrogram.shape[0]
89 | # print(f"The max spectrogram time length is: {SPECTROGRAM_TIME_LENGTH} at {fname}")
90 |
91 | # TODO: Get the max and min val of the label. Mean:
92 |
93 | def train_datagen_song_level():
94 | """ Predicting valence mean and arousal mean
95 | """
96 | pointer = 0
97 | while True:
98 | # Reset pointer
99 | if pointer >= len(train_df):
100 | pointer = 0
101 |
102 | row = train_df.loc[pointer]
103 | song_id = row["song_id"]
104 | valence_mean = float(row["valence_mean"])
105 | arousal_mean = float(row["arousal_mean"])
106 | label = tf.convert_to_tensor([valence_mean, arousal_mean], dtype=tf.float32)
107 | song_path = os.path.join(AUDIO_FOLDER, str(int(song_id)) + SOUND_EXTENSION)
108 | audio_file = tf.io.read_file(song_path)
109 | waveforms, _ = tf.audio.decode_wav(contents=audio_file)
110 | waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH)
111 | # print(waveforms.shape)
112 |
113 | # Work on building spectrogram
114 | # Shape (timestep, frequency, n_channel)
115 | spectrograms = None
116 | # Loop through each channel
117 | for i in range(waveforms.shape[-1]):
118 | # Shape (timestep, frequency, 1)
119 | spectrogram = get_spectrogram(waveforms[..., i], input_len=waveforms.shape[0])
120 | # spectrogram = tf.convert_to_tensor(np.log(spectrogram.numpy() + np.finfo(float).eps))
121 | if spectrograms == None:
122 | spectrograms = spectrogram
123 | else:
124 | spectrograms = tf.concat([spectrograms, spectrogram], axis=-1)
125 | pointer += 1
126 |
127 | padded_spectrogram = np.zeros((SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=float)
128 | # spectrograms = spectrograms[tf.newaxis, ...]
129 | # some spectrogram are not the same shape
130 | padded_spectrogram[:spectrograms.shape[0], :spectrograms.shape[1], :] = spectrograms
131 |
132 | yield (tf.convert_to_tensor(padded_spectrogram), label)
133 |
134 | def test_datagen_song_level():
135 | """ Predicting valence mean and arousal mean
136 | """
137 | pointer = 0
138 | while True:
139 | # Reset pointer
140 | if pointer >= len(test_df):
141 | pointer = 0
142 |
143 | row = test_df.loc[pointer]
144 | song_id = row["song_id"]
145 | valence_mean = float(row["valence_mean"])
146 | arousal_mean = float(row["arousal_mean"])
147 | label = tf.convert_to_tensor([valence_mean, arousal_mean], dtype=tf.float32)
148 | song_path = os.path.join(AUDIO_FOLDER, str(int(song_id)) + SOUND_EXTENSION)
149 | audio_file = tf.io.read_file(song_path)
150 | waveforms, _ = tf.audio.decode_wav(contents=audio_file)
151 | waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH)
152 | # print(waveforms.shape)
153 |
154 | # Work on building spectrogram
155 | # Shape (timestep, frequency, n_channel)
156 | spectrograms = None
157 | # Loop through each channel
158 | for i in range(waveforms.shape[-1]):
159 | # Shape (timestep, frequency, 1)
160 | spectrogram = get_spectrogram(waveforms[..., i], input_len=waveforms.shape[0])
161 | # spectrogram = tf.convert_to_tensor(np.log(spectrogram.numpy() + np.finfo(float).eps))
162 | if spectrograms == None:
163 | spectrograms = spectrogram
164 | else:
165 | spectrograms = tf.concat([spectrograms, spectrogram], axis=-1)
166 | pointer += 1
167 |
168 | padded_spectrogram = np.zeros((SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=float)
169 | # spectrograms = spectrograms[tf.newaxis, ...]
170 | # some spectrogram are not the same shape
171 | padded_spectrogram[:spectrograms.shape[0], :spectrograms.shape[1], :] = spectrograms
172 |
173 | yield (tf.convert_to_tensor(padded_spectrogram), label)
174 |
175 | train_dataset = tf.data.Dataset.from_generator(
176 | train_datagen_song_level,
177 | output_signature=(
178 | tf.TensorSpec(shape=(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=tf.float32),
179 | tf.TensorSpec(shape=(2), dtype=tf.float32)
180 | )
181 | )
182 | train_batch_dataset = train_dataset.batch(BATCH_SIZE)
183 | # train_batch_dataset = train_batch_dataset.cache().prefetch(tf.data.AUTOTUNE) # OOM error
184 | train_batch_iter = iter(train_batch_dataset)
185 |
186 |
187 | # Comment out to decide to create a normalization layer.
188 | # NOTE: this is every time consuming because it looks at all the data, only
189 | # use this at the first time.
190 | # NOTE: Normally, we create this layer once, save it somewhere to reuse in
191 | # every other model.
192 | #
193 | # norm_layer = L.Normalization()
194 | # norm_layer.adapt(data=train_dataset.map(map_func=lambda spec, label: spec))
195 | #
196 |
197 | test_dataset = tf.data.Dataset.from_generator(
198 | test_datagen_song_level,
199 | output_signature=(
200 | tf.TensorSpec(shape=(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=tf.float32),
201 | tf.TensorSpec(shape=(2, ), dtype=tf.float32)
202 | )
203 | )
204 | test_batch_dataset = test_dataset.batch(BATCH_SIZE)
205 | # test_batch_dataset = test_batch_dataset.cache().prefetch(tf.data.AUTOTUNE) # OOM error
206 | test_batch_iter = iter(test_batch_dataset)
207 |
208 | # ds = iter(train_dataset)
209 | # i, o = next(ds)
210 | # log_spec = np.log(i + np.finfo(float).eps)
211 |
212 | # print(tf.reduce_max(i))
213 | # print(tf.reduce_min(i))
214 | # print(tf.reduce_mean(i))
215 |
216 | # print(tf.reduce_max(log_spec))
217 | # print(tf.reduce_min(log_spec))
218 | # print(tf.reduce_mean(log_spec))
219 |
220 | # ii = tf.transpose(i[..., 0], [1,0])
221 | # height = ii.shape[0]
222 | # width = ii.shape[1]
223 | # X = np.linspace(0, np.size(ii), num=width, dtype=int)
224 | # Y = range(height)
225 | # plt.pcolormesh(X, Y, ii)
226 | # plt.show()
227 |
228 |
229 | # %%
230 |
231 | ## Training
232 |
233 | def train_step(batch_x, batch_label, model, loss_function, optimizer, step=-1):
234 | with tf.device("/GPU:0"):
235 | with tf.GradientTape() as tape:
236 | logits = model(batch_x, training=True)
237 | loss = loss_function(batch_label, logits)
238 | grads = tape.gradient(loss, model.trainable_weights)
239 | optimizer.apply_gradients(zip(grads, model.trainable_weights))
240 | return loss
241 |
242 | def train(model,
243 | training_batch_iter,
244 | test_batch_iter,
245 | optimizer,
246 | loss_function,
247 | epochs=1,
248 | steps_per_epoch=20,
249 | valid_step=5,
250 | history_path=None,
251 | weights_path=None,
252 | save_history=False):
253 |
254 | if history_path != None and os.path.exists(history_path):
255 | # Sometimes, we have not created the files
256 | with open(history_path, "rb") as f:
257 | history = np.load(f, allow_pickle=True)
258 | epochs_loss, epochs_val_loss = history
259 | epochs_loss = epochs_loss.tolist()
260 | epochs_val_loss = epochs_val_loss.tolist()
261 | else:
262 | epochs_val_loss = []
263 | epochs_loss = []
264 |
265 | if weights_path != None and os.path.exists(weights_path + ".index"):
266 | try:
267 | model.load_weights(weights_path)
268 | print("Model weights loaded!")
269 | except:
270 | print("cannot load weights!")
271 |
272 | for epoch in range(epochs):
273 | losses = []
274 |
275 | with tf.device("/CPU:0"):
276 | step_pointer = 0
277 | while step_pointer < steps_per_epoch:
278 | batch = next(training_batch_iter)
279 | batch_x = batch[0]
280 | batch_label = batch[1]
281 | loss = train_step(batch_x, batch_label, model, loss_function, optimizer, step=step_pointer + 1)
282 | print(f"Epoch {epoch + 1} - Step {step_pointer + 1} - Loss: {loss}")
283 | losses.append(loss)
284 |
285 | if (step_pointer + 1) % valid_step == 0:
286 | print(
287 | "Training loss (for one batch) at step %d: %.4f"
288 | % (step_pointer + 1, float(loss))
289 | )
290 | # perform validation
291 | val_batch = next(test_batch_iter)
292 | logits = model(val_batch[0], training=False)
293 | val_loss = loss_function(val_batch[1], logits)
294 | print(f"exmaple logits: {logits}")
295 | print(f"Validation loss: {val_loss}\n-----------------")
296 | if (step_pointer + 1) == steps_per_epoch:
297 | val_batch = next(test_batch_iter)
298 | logits = model(val_batch[0], training=False)
299 | val_loss = loss_function(val_batch[1], logits)
300 | epochs_val_loss.append(val_loss)
301 |
302 | step_pointer += 1
303 | epochs_loss.append(losses)
304 |
305 | # Save history and model
306 | if history_path != None and save_history:
307 | np.save(history_path, [epochs_loss, epochs_val_loss])
308 |
309 | if weights_path != None:
310 | model.save_weights(weights_path)
311 |
312 | # return history
313 | return [epochs_loss, epochs_val_loss]
314 |
315 | # %%
316 |
317 | """################## Training #################"""
318 |
319 | ## Define model first
320 |
321 | weights_path = "./weights/cbam_2/checkpoint"
322 | history_path = "./history/cbam_2.npy"
323 |
324 | # model = SimpleDenseModel(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL, BATCH_SIZE)
325 | # model.build(input_shape=(BATCH_SIZE, SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL))
326 | # model.model().summary()
327 |
328 | # model = SimpleConvModel(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL, BATCH_SIZE)
329 | # model.model.load_weights(weights_path)
330 |
331 | optimizer = tf.keras.optimizers.SGD(learning_rate=LEARNING_RATE)
332 |
333 | # %%
334 |
335 | model = Simple_CRNN_3()
336 | # model.summary()
337 | sample_input = tf.ones(shape=(BATCH_SIZE, SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2))
338 | with tf.device("/CPU:0"):
339 | sample_output = model(sample_input, training=False)
340 | print(sample_output)
341 |
342 | # %%
343 |
344 | # About 50 epochs with each epoch step 100 will cover the whole training dataset!
345 | history = train(
346 | model,
347 | train_batch_iter,
348 | test_batch_iter,
349 | optimizer,
350 | simple_mae_loss,
351 | epochs=2,
352 | steps_per_epoch=100, # 1800 // 16
353 | valid_step=20,
354 | history_path=history_path,
355 | weights_path=weights_path,
356 | save_history=True
357 | )
358 |
359 | # %%
360 |
361 | ### MODEL DEBUGGING ###
362 |
363 | def base_cnn():
364 | """ Base CNN Feature extractor for 45 second spectrogram
365 | Input to model shape: (SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2)
366 | Output of the model shape: (4, 60, 256)
367 | (Convolved frequency, convolved timestep, feature neurons)
368 | Returns:
369 | tf.keras.Model: Return a model
370 | """
371 |
372 | inputs = L.Input(shape=(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2))
373 |
374 | tensor = L.Permute((2, 1, 3))(inputs)
375 | tensor = L.Resizing(FREQUENCY_LENGTH, 1024)(tensor)
376 |
377 | tensor = L.Conv2D(64, (5,5), padding="valid", name="conv_1_1")(tensor)
378 | tensor = L.ReLU()(tensor)
379 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
380 | tensor = L.Conv2D(64 // 2, (1,1), padding="valid", name="conv_1_2")(tensor)
381 | tensor = L.ReLU()(tensor)
382 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
383 | tensor = L.MaxPool2D(2,2)(tensor)
384 | tensor = L.Dropout(0.1)(tensor)
385 |
386 | tensor = L.Conv2D(128, (5,5), padding="valid", name="conv_2_1")(tensor)
387 | tensor = L.ReLU()(tensor)
388 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
389 | tensor = L.Conv2D(128 // 2, (1,1), padding="valid", name="conv_2_2")(tensor)
390 | tensor = L.ReLU()(tensor)
391 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
392 | tensor = L.MaxPool2D(2,2)(tensor)
393 | tensor = L.Dropout(0.1)(tensor)
394 |
395 | tensor = L.Conv2D(256, (5,5), padding="valid", name="conv_3_1")(tensor)
396 | tensor = L.ReLU()(tensor)
397 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
398 | tensor = L.Conv2D(256 // 2, (1,1), padding="valid", name="conv_3_2")(tensor)
399 | tensor = L.ReLU()(tensor)
400 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
401 | tensor = L.MaxPool2D(2,2)(tensor)
402 | tensor = L.Dropout(0.1)(tensor)
403 |
404 | tensor = L.Conv2D(512, (5,5), padding="valid", name="conv_4_1")(tensor)
405 | tensor = L.ReLU()(tensor)
406 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
407 | tensor = L.Conv2D(512 // 2, (1,1), padding="valid", name="conv_4_2")(tensor)
408 | tensor = L.ReLU()(tensor)
409 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
410 | tensor = L.MaxPool2D(2,2)(tensor)
411 | out = L.Dropout(0.1)(tensor)
412 | model = tf.keras.Model(inputs=inputs, outputs=out, name="base_model")
413 | return model
414 |
415 | class ChannelAttention(tf.keras.layers.Layer):
416 | def __init__(self, neuron: int, ratio: int, use_average=True, **kwargs) -> None:
417 | super().__init__(**kwargs)
418 | self.neuron = neuron
419 | self.ratio = ratio
420 | self.use_average = use_average
421 |
422 | def build(self, input_shape):
423 | """build layers
424 |
425 | Args:
426 | input_shape (tf.shape): the shape of the input
427 |
428 | Returns:
429 | [type]: [description]
430 | """
431 | assert len(input_shape) == 4, "The input shape to the layer has to be 3D"
432 | self.first_shared_layer = L.Dense(self.neuron // self.ratio, activation="relu", kernel_initializer="he_normal")
433 | self.second_shared_layer = L.Dense(self.neuron, activation="relu", kernel_initializer="he_normal")
434 |
435 | def call(self, inputs):
436 | if self.use_average:
437 | avg_pool_tensor = L.GlobalAveragePooling2D()(inputs) # Shape (batch, filters)
438 | avg_pool_tensor = L.Reshape((1,1,-1))(avg_pool_tensor) # Shape (batch, 1, 1, filters)
439 | avg_pool_tensor = self.first_shared_layer(avg_pool_tensor)
440 | avg_pool_tensor = self.second_shared_layer(avg_pool_tensor)
441 |
442 | max_pool_tensor = L.GlobalMaxPool2D()(inputs) # Shape (batch, filters)
443 | max_pool_tensor = L.Reshape((1,1,-1))(max_pool_tensor) # Shape (batch, 1, 1, filters)
444 | max_pool_tensor = self.first_shared_layer(max_pool_tensor)
445 | max_pool_tensor = self.second_shared_layer(max_pool_tensor)
446 |
447 | attention_tensor = L.Add()([avg_pool_tensor, max_pool_tensor])
448 | attention_tensor = L.Activation("sigmoid")(attention_tensor)
449 |
450 | out = L.Multiply()([inputs, attention_tensor]) # Broadcast element-wise multiply. (batch, height, width, filters) x (batch, 1, 1, neurons)
451 |
452 | return out
453 | else:
454 | max_pool_tensor = L.GlobalMaxPool2D()(inputs) # Shape (batch, filters)
455 | max_pool_tensor = L.Reshape((1,1,-1))(max_pool_tensor) # Shape (batch, 1, 1, filters)
456 | max_pool_tensor = self.first_shared_layer(max_pool_tensor)
457 | max_pool_tensor = self.second_shared_layer(max_pool_tensor)
458 | attention_tensor = L.Activation("sigmoid")(max_pool_tensor)
459 | out = L.Multiply()([inputs, attention_tensor])
460 | return out
461 |
462 | class SpatialAttention(tf.keras.layers.Layer):
463 | def __init__(self, kernel_size, use_average=True, **kwargs) -> None:
464 | super().__init__(**kwargs)
465 | self.kernel_size = kernel_size
466 | self.use_average = use_average
467 |
468 | def build(self, input_shape):
469 | """build layers
470 |
471 | Args:
472 | input_shape (tf.shape): the shape of the input
473 |
474 | Returns:
475 | [type]: [description]
476 | """
477 | assert len(input_shape) == 4, "The input shape to the layer has to be 3D"
478 | self.conv_layer = L.Conv2D(1, self.kernel_size, padding="same", activation="relu",
479 | kernel_initializer="he_normal")
480 |
481 | def call(self, inputs):
482 | if self.use_average:
483 | avg_pool_tensor = L.Lambda(lambda x: tf.reduce_mean(x, axis=-1, keepdims=True))(inputs)
484 | max_pool_tensor = L.Lambda(lambda x: tf.reduce_max(x, axis=-1, keepdims=True))(inputs)
485 | concat_tensor = L.Concatenate(axis=-1)([avg_pool_tensor, max_pool_tensor])
486 | tensor = self.conv_layer(concat_tensor) # shape (height, width, 1)
487 | out = L.Multiply()([inputs, tensor]) # Broadcast element-wise multiply. (batch, height, width, neurons) x (batch, height, width, 1)
488 |
489 | return out
490 | else:
491 | max_pool_tensor = L.Lambda(lambda x: tf.reduce_max(x, axis=-1, keepdims=True))(inputs)
492 | tensor = self.conv_layer(max_pool_tensor) # shape (height, width, 1)
493 | out = L.Multiply()([inputs, tensor]) # Broadcast element-wise multiply. (batch, height, width, neurons) x (batch, height, width, 1)
494 |
495 | return out
496 |
497 | class CBAM_Block(tf.keras.layers.Layer):
498 | """ TODO: Implement Res Block architecture for CBAM Block
499 |
500 | Args:
501 | tf ([type]): [description]
502 | """
503 | def __init__(self,
504 | channel_attention_filters,
505 | channel_attention_ratio,
506 | spatial_attention_kernel_size,
507 | **kwargs) -> None:
508 | super().__init__(**kwargs)
509 | self.channel_attention_filters = channel_attention_filters
510 | self.channel_attention_ratio = channel_attention_ratio
511 | self.spatial_attention_kernel_size = spatial_attention_kernel_size
512 |
513 | def build(self, input_shape):
514 | assert len(input_shape) == 4, "The shape must be 3D!"
515 |
516 | # NOTE: The reson why self.channel_attention_filters is put here is because the number
517 | # of neurons of the input to channel attention has to be equal the number of filters
518 | # in the channel attention
519 | self.conv_1 = L.Conv2D(self.channel_attention_filters * 2, (5,5), padding="same", activation="relu")
520 | self.conv_2 = L.Conv2D(self.channel_attention_filters, (1,1), padding="same", activation="relu")
521 | self.c_att = ChannelAttention(self.channel_attention_filters, self.channel_attention_ratio)
522 | self.s_att = SpatialAttention(self.spatial_attention_kernel_size)
523 |
524 | def call(self, inputs):
525 | # inputs shape (batch, height, width, channel)
526 | tensor = self.conv_1(inputs) # shape (batch, height, width, filters * 2)
527 | tensor = self.conv_2(tensor) # shape (batch, height, width, filters)
528 | tensor = self.c_att(tensor) # shape (batch, height, width, filters)
529 | tensor = self.s_att(tensor) # shape (batch, height, width, filters)
530 | return tensor
531 |
532 | def cbam_1():
533 |
534 | inputs = L.Input(shape=(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2))
535 | tensor = L.Permute((2, 1, 3))(inputs)
536 | tensor = L.Resizing(FREQUENCY_LENGTH, 1024)(tensor)
537 |
538 | # tensor = CBAM_Block(32, 2, (5,5))(tensor)
539 | tensor = L.Conv2D(64, (5,5), padding="same", activation="relu")(tensor)
540 | tensor = L.Conv2D(32, (1,1), padding="same", activation="relu")(tensor)
541 | tensor_att_1 = ChannelAttention(32, 2)(tensor)
542 | tensor_att_1 = SpatialAttention((5,5))(tensor_att_1)
543 | tensor = L.Add()([tensor, tensor_att_1])
544 | tensor = L.MaxPool2D(2,2)(tensor)
545 |
546 | # tensor = CBAM_Block(64, 2, (7,7))(tensor)
547 | tensor = L.Conv2D(128, (5,5), padding="same", activation="relu")(tensor)
548 | tensor = L.Conv2D(64, (1,1), padding="same", activation="relu")(tensor)
549 | tensor_att_2 = ChannelAttention(64, 2)(tensor)
550 | tensor_att_2 = SpatialAttention((7,7))(tensor_att_2)
551 | tensor = L.Add()([tensor, tensor_att_2])
552 | tensor = L.MaxPool2D(2,2)(tensor)
553 | # tensor = L.BatchNormalization()(tensor)
554 | # tensor = L.Dropout(0.1)(tensor)
555 |
556 | # tensor = CBAM_Block(128, 2, (7,7))(tensor)
557 | tensor = L.Conv2D(256, (5,5), padding="same", activation="relu")(tensor)
558 | tensor = L.Conv2D(128, (1,1), padding="same", activation="relu")(tensor)
559 | tensor_att_3 = ChannelAttention(128, 2)(tensor)
560 | tensor_att_3 = SpatialAttention((7,7))(tensor_att_3)
561 | tensor = L.Add()([tensor, tensor_att_3])
562 | tensor = L.MaxPool2D(2,2)(tensor)
563 | # tensor = L.BatchNormalization()(tensor)
564 | # tensor = L.Dropout(0.1)(tensor)
565 |
566 | # tensor = CBAM_Block(256, 2, (5,5))(tensor)
567 | tensor = L.Conv2D(512, (5,5), padding="same", activation="relu")(tensor)
568 | tensor = L.Conv2D(256, (1,1), padding="same", activation="relu")(tensor)
569 | tensor_att_4 = ChannelAttention(256, 2)(tensor)
570 | tensor_att_4 = SpatialAttention((5,5))(tensor_att_4)
571 | tensor = L.Add()([tensor, tensor_att_4])
572 | tensor = L.MaxPool2D(2,2)(tensor)
573 | # tensor = L.BatchNormalization()(tensor)
574 | # tensor = L.Dropout(0.1)(tensor)
575 |
576 | # tensor = CBAM_Block(256, 2, (3,3))(tensor)
577 | tensor = L.Conv2D(512, (5,5), padding="same", activation="relu")(tensor)
578 | tensor = L.Conv2D(256, (1,1), padding="same", activation="relu")(tensor)
579 | tensor_att_5 = ChannelAttention(256, 2)(tensor)
580 | tensor_att_5 = SpatialAttention((3,3))(tensor_att_5)
581 | tensor = L.Add()([tensor, tensor_att_5])
582 | tensor = L.MaxPool2D(2,2)(tensor)
583 | # tensor = L.BatchNormalization()(tensor)
584 | # tensor = L.Dropout(0.1)(tensor)
585 |
586 | tensor = L.Permute((2, 1, 3))(tensor)
587 | tensor = L.Reshape((32, 4 * 256))(tensor)
588 |
589 | # tensor = L.GRU(256, activation="tanh", return_sequences=True)(tensor)
590 | # tensor = L.GRU(128, activation="tanh", return_sequences=True)(tensor)
591 | tensor = L.GRU(64, activation="tanh")(tensor)
592 | tensor = L.Dense(512, activation="relu")(tensor)
593 | tensor = L.Dense(256, activation="relu")(tensor)
594 | tensor = L.Dense(64, activation="relu")(tensor)
595 | out = L.Dense(2, activation="relu")(tensor)
596 |
597 | model = tf.keras.Model(inputs=inputs, outputs=out)
598 | return model
599 |
600 | def cbam_2():
601 | """ No average CBAM
602 |
603 | Returns:
604 | tf.keras.Model: The Model
605 | """
606 | inputs = L.Input(shape=(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2))
607 | tensor = L.Permute((2, 1, 3))(inputs)
608 | tensor = L.Resizing(FREQUENCY_LENGTH, 1024)(tensor)
609 |
610 | # tensor = CBAM_Block(32, 2, (5,5))(tensor)
611 | tensor = L.Conv2D(64, (5,5), padding="same", activation="relu")(tensor)
612 | tensor = L.Conv2D(32, (1,1), padding="same", activation="relu")(tensor)
613 | tensor_att_1 = ChannelAttention(32, 2, use_average=False)(tensor)
614 | tensor_att_1 = SpatialAttention((5,5), use_average=False)(tensor_att_1)
615 | tensor = L.Add()([tensor, tensor_att_1])
616 | tensor = L.MaxPool2D(2,2)(tensor)
617 |
618 | # tensor = CBAM_Block(64, 2, (7,7))(tensor)
619 | tensor = L.Conv2D(128, (5,5), padding="same", activation="relu")(tensor)
620 | tensor = L.Conv2D(64, (1,1), padding="same", activation="relu")(tensor)
621 | tensor_att_2 = ChannelAttention(64, 2, use_average=False)(tensor)
622 | tensor_att_2 = SpatialAttention((7,7), use_average=False)(tensor_att_2)
623 | tensor = L.Add()([tensor, tensor_att_2])
624 | tensor = L.MaxPool2D(2,2)(tensor)
625 | # tensor = L.BatchNormalization()(tensor)
626 | # tensor = L.Dropout(0.1)(tensor)
627 |
628 | # tensor = CBAM_Block(128, 2, (7,7))(tensor)
629 | tensor = L.Conv2D(256, (5,5), padding="same", activation="relu")(tensor)
630 | tensor = L.Conv2D(128, (1,1), padding="same", activation="relu")(tensor)
631 | tensor_att_3 = ChannelAttention(128, 2, use_average=False)(tensor)
632 | tensor_att_3 = SpatialAttention((7,7), use_average=False)(tensor_att_3)
633 | tensor = L.Add()([tensor, tensor_att_3])
634 | tensor = L.MaxPool2D(2,2)(tensor)
635 | # tensor = L.BatchNormalization()(tensor)
636 | # tensor = L.Dropout(0.1)(tensor)
637 |
638 | # tensor = CBAM_Block(256, 2, (5,5))(tensor)
639 | tensor = L.Conv2D(512, (5,5), padding="same", activation="relu")(tensor)
640 | tensor = L.Conv2D(256, (1,1), padding="same", activation="relu")(tensor)
641 | tensor_att_4 = ChannelAttention(256, 2, use_average=False)(tensor)
642 | tensor_att_4 = SpatialAttention((5,5), use_average=False)(tensor_att_4)
643 | tensor = L.Add()([tensor, tensor_att_4])
644 | tensor = L.MaxPool2D(2,2)(tensor)
645 | # tensor = L.BatchNormalization()(tensor)
646 | # tensor = L.Dropout(0.1)(tensor)
647 |
648 | # tensor = CBAM_Block(256, 2, (3,3))(tensor)
649 | tensor = L.Conv2D(512, (5,5), padding="same", activation="relu")(tensor)
650 | tensor = L.Conv2D(256, (1,1), padding="same", activation="relu")(tensor)
651 | tensor_att_5 = ChannelAttention(256, 2, use_average=False)(tensor)
652 | tensor_att_5 = SpatialAttention((3,3), use_average=False)(tensor_att_5)
653 | tensor = L.Add()([tensor, tensor_att_5])
654 | tensor = L.MaxPool2D(2,2)(tensor)
655 | # tensor = L.BatchNormalization()(tensor)
656 | # tensor = L.Dropout(0.1)(tensor)
657 |
658 | tensor = L.Permute((2, 1, 3))(tensor)
659 | tensor = L.Reshape((32, 4 * 256))(tensor)
660 |
661 | # tensor = L.GRU(256, activation="tanh", return_sequences=True)(tensor)
662 | # tensor = L.GRU(128, activation="tanh", return_sequences=True)(tensor)
663 | tensor = L.LSTM(256, activation="tanh")(tensor)
664 | tensor = L.Dense(512, activation="relu")(tensor)
665 | tensor = L.Dense(256, activation="relu")(tensor)
666 | tensor = L.Dense(64, activation="relu")(tensor)
667 | out = L.Dense(2, activation="relu")(tensor)
668 |
669 | model = tf.keras.Model(inputs=inputs, outputs=out)
670 | return model
671 |
672 | model: tf.keras.Model = cbam_2()
673 | model.summary()
674 | sample_input = tf.ones(shape=(BATCH_SIZE, SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2))
675 | with tf.device("/CPU:0"):
676 | sample_output = model(sample_input, training=False)
677 | print(sample_output.shape)
678 |
679 | # TODO: Code the CBAM architecture
680 | # TODO: Code the attetion after the CBAM
681 |
682 |
683 |
684 | # %%
685 |
686 | model.load_weights(weights_path)
687 |
688 | # %%
689 |
690 | tf.keras.models.save_model(
691 | model,
692 | "./server/model/my_model",
693 | overwrite=True,
694 | include_optimizer=True,
695 | save_format=None,
696 | signatures=None,
697 | options=None
698 | )
699 |
700 |
701 | # %%5
702 |
703 |
704 | sample_output.shape
705 |
706 | # %%
707 |
708 |
709 |
710 |
711 |
712 |
713 |
714 |
715 | # %%
716 |
717 | # Plot
718 | with open(history_path, "rb") as f:
719 | [epochs_loss, epochs_val_loss] = np.load(f, allow_pickle=True)
720 |
721 |
722 | e_loss = [k[0] for k in epochs_loss]
723 |
724 | e_all_loss = []
725 |
726 | id = 0
727 | time_val = []
728 | for epoch in epochs_loss:
729 | for step in epoch:
730 | e_all_loss.append(step.numpy())
731 | id += 1
732 | time_val.append(id)
733 |
734 | # %%
735 |
736 | plt.plot(np.arange(0, len(e_all_loss), 1), e_all_loss, label = "train loss")
737 | plt.plot(time_val, epochs_val_loss, label = "val loss")
738 |
739 | # plt.plot(np.arange(1,len(e_loss)+ 1), e_loss, label = "train loss")
740 | # plt.plot(np.arange(1,len(epochs_val_loss)+ 1), epochs_val_loss, label = "val loss")
741 | plt.xlabel("Step")
742 | plt.ylabel("Loss")
743 | plt.legend()
744 | plt.show()
745 |
746 | # %%5
747 |
748 | # model.load_weights(weights_path)
749 | # model.trainable_weights
750 | # y.shape
751 |
752 |
753 | # %%
754 |
755 |
756 |
757 | # model.build(input_shape=(BATCH_SIZE, SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL))
758 | # model.summary()
759 | # %%
760 |
761 | model.save_weights(weights_path)
762 |
763 |
764 | # %%
765 |
766 |
767 |
768 | model.load_weights(weights_path)
769 |
770 |
771 | # %%
772 |
773 |
774 | def evaluate(df_pointer, model, loss_func, play=False):
775 | row = test_df.loc[df_pointer]
776 | song_id = row["song_id"]
777 | valence_mean = row["valence_mean"]
778 | arousal_mean = row["arousal_mean"]
779 | label = tf.convert_to_tensor([valence_mean, arousal_mean], dtype=tf.float32)
780 | print(f"Label: Valence: {valence_mean}, Arousal: {arousal_mean}")
781 | song_path = os.path.join(AUDIO_FOLDER, str(int(song_id)) + SOUND_EXTENSION)
782 | audio_file = tf.io.read_file(song_path)
783 | waveforms, _ = tf.audio.decode_wav(contents=audio_file)
784 | waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH)
785 | spectrograms = None
786 | # Loop through each channel
787 | for i in range(waveforms.shape[-1]):
788 | # Shape (timestep, frequency, 1)
789 | spectrogram = get_spectrogram(waveforms[..., i], input_len=waveforms.shape[0])
790 | if spectrograms == None:
791 | spectrograms = spectrogram
792 | else:
793 | spectrograms = tf.concat([spectrograms, spectrogram], axis=-1)
794 |
795 | spectrograms = spectrograms[tf.newaxis, ...]
796 |
797 | ## Eval
798 | y_pred = model(spectrograms, training=False)[0]
799 | print(f"Predicted y_pred value: Valence: {y_pred[0]}, Arousal: {y_pred[1]}")
800 |
801 | loss = loss_func(label[tf.newaxis, ...], y_pred)
802 | print(f"Loss: {loss}")
803 |
804 | if play:
805 | plot_and_play(waveforms, 0, 40, 0)
806 |
807 | i = 0
808 |
809 | # %%
810 |
811 | i += 1
812 | evaluate(i, model, simple_mae_loss, play=False)
813 |
814 | # %%
815 |
816 | ####### INTERMEDIARY REPRESENTATION ########
817 |
818 | layer_list = [l for l in model.layers]
819 | debugging_model = tf.keras.Model(inputs=model.inputs, outputs=[l.output for l in layer_list])
820 |
821 | # %%
822 |
823 | layer_list
824 |
825 | # %%
826 |
827 | test_id = 223
828 | row = test_df.loc[test_id]
829 | song_id = row["song_id"]
830 | valence_mean = row["valence_mean"]
831 | arousal_mean = row["arousal_mean"]
832 | label = tf.convert_to_tensor([valence_mean, arousal_mean], dtype=tf.float32)
833 | print(f"Label: Valence: {valence_mean}, Arousal: {arousal_mean}")
834 | song_path = os.path.join(AUDIO_FOLDER, str(int(song_id)) + SOUND_EXTENSION)
835 | audio_file = tf.io.read_file(song_path)
836 | waveforms, _ = tf.audio.decode_wav(contents=audio_file)
837 | waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH)
838 | spectrograms = None
839 | # Loop through each channel
840 | for i in range(waveforms.shape[-1]):
841 | # Shape (timestep, frequency, 1)
842 | spectrogram = get_spectrogram(waveforms[..., i], input_len=waveforms.shape[0])
843 | if spectrograms == None:
844 | spectrograms = spectrogram
845 | else:
846 | spectrograms = tf.concat([spectrograms, spectrogram], axis=-1)
847 |
848 | spectrograms = spectrograms[tf.newaxis, ...]
849 |
850 | print(label)
851 | # plot_and_play(waveforms, 0, 40, 0)
852 |
853 | ## Eval
854 | y_pred_list = debugging_model(spectrograms, training=False)
855 | print(f"Predicted y_pred value: Valence: {y_pred_list[-1][0, 0]}, Arousal: {y_pred_list[-1][0, 1]}")
856 |
857 |
858 | # %%
859 |
860 | def show_color_mesh(spectrogram):
861 | """ Generate color mesh
862 |
863 | Args:
864 | spectrogram (2D array): Expect shape (Frequency length, time step)
865 | """
866 | assert len(spectrogram.shape) == 2
867 | log_spec = np.log(spectrogram + np.finfo(float).eps)
868 | height = log_spec.shape[0]
869 | width = log_spec.shape[1]
870 | X = np.linspace(0, np.size(spectrogram), num=width, dtype=int)
871 | Y = range(height)
872 | plt.pcolormesh(X, Y, log_spec)
873 | plt.show()
874 |
875 | show_color_mesh(tf.transpose(spectrograms[0, :, :, 0], [1,0]))
876 |
877 |
878 | # %%
879 |
880 | f, axarr = plt.subplots(8,8, figsize=(25,15))
881 | CONVOLUTION_NUMBER_LIST = [8, 9, 10, 11, 12, 13, 14, 15]
882 | LAYER_LIST = [10, 11, 12, 13, 16, 17, 18, 19]
883 | for x, CONVOLUTION_NUMBER in enumerate(CONVOLUTION_NUMBER_LIST):
884 | f1 = y_pred_list[LAYER_LIST[0]]
885 | plot_spectrogram(tf.transpose(f1[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[0,x])
886 | axarr[0,x].grid(False)
887 | f2 = y_pred_list[LAYER_LIST[1]]
888 | plot_spectrogram(tf.transpose(f2[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[1,x])
889 | axarr[1,x].grid(False)
890 | f3 = y_pred_list[LAYER_LIST[2]]
891 | plot_spectrogram(tf.transpose(f3[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[2,x])
892 | axarr[2,x].grid(False)
893 | f4 = y_pred_list[LAYER_LIST[3]]
894 | plot_spectrogram(tf.transpose(f4[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[3,x])
895 | axarr[3,x].grid(False)
896 |
897 | f5 = y_pred_list[LAYER_LIST[4]]
898 | plot_spectrogram(tf.transpose(f5[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[4,x])
899 | axarr[4,x].grid(False)
900 | f6 = y_pred_list[LAYER_LIST[5]]
901 | plot_spectrogram(tf.transpose(f6[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[5,x])
902 | axarr[5,x].grid(False)
903 | f7 = y_pred_list[LAYER_LIST[6]]
904 | plot_spectrogram(tf.transpose(f7[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[6,x])
905 | axarr[6,x].grid(False)
906 | f8 = y_pred_list[LAYER_LIST[7]]
907 | plot_spectrogram(tf.transpose(f8[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[7,x])
908 | axarr[7,x].grid(False)
909 |
910 | axarr[0,0].set_ylabel("After convolution layer 1")
911 | axarr[1,0].set_ylabel("After convolution layer 2")
912 | axarr[2,0].set_ylabel("After convolution layer 3")
913 | axarr[3,0].set_ylabel("After convolution layer 7")
914 |
915 | axarr[0,0].set_title("convolution number 0")
916 | axarr[0,1].set_title("convolution number 4")
917 | axarr[0,2].set_title("convolution number 7")
918 | axarr[0,3].set_title("convolution number 23")
919 |
920 | plt.show()
921 |
922 | # %%
923 |
924 |
925 | f4 = y_pred_list[3]
926 |
927 | # %%
928 |
929 | f4.shape
930 |
931 | # %%
932 |
933 | f4
934 |
935 | # %%
936 |
937 | w = layer_list[5].weights[0]
938 | w
939 | # %%
940 |
941 | y_pred_list[-1]
942 |
943 | # %%
944 |
945 |
--------------------------------------------------------------------------------
/mer/__init__.py:
--------------------------------------------------------------------------------
1 | from . import *
--------------------------------------------------------------------------------
/mer/const.py:
--------------------------------------------------------------------------------
1 | DEFAULT_FREQ = 44100
2 | DEFAULT_TIME = 45
3 | WAVE_ARRAY_LENGTH = DEFAULT_FREQ * DEFAULT_TIME
4 |
5 | WINDOW_TIME = 5
6 | WINDOW_SIZE = WINDOW_TIME * DEFAULT_FREQ
7 |
8 | TRAIN_RATIO = 0.8
9 |
10 | BATCH_SIZE = 16
11 |
12 | FREQUENCY_LENGTH = 129
13 | N_CHANNEL = 2
14 | SPECTROGRAM_TIME_LENGTH = 15502
15 | SPECTROGRAM_HALF_SECOND_LENGTH = 171
16 | SPECTROGRAM_5_SECOND_LENGTH = 1721
17 | MFCCS_TIME_LENGTH = 3876
18 |
19 | LEARNING_RATE = 1e-4
20 |
21 | SOUND_EXTENSION = ".wav"
22 |
23 | # The minimum second to be labeled in the dynamics files.
24 | MIN_TIME_END_POINT = 15
--------------------------------------------------------------------------------
/mer/loss.py:
--------------------------------------------------------------------------------
1 | import tensorflow as tf
2 |
3 | def simple_mse_loss(true, pred):
4 |
5 | # loss_valence = tf.reduce_sum(tf.square(true[..., 0] - pred[0][..., 0])) / true.shape[0] # divide by batch size
6 | # loss_arousal = tf.reduce_sum(tf.square(true[..., 1] - pred[1][..., 0])) / true.shape[0] # divide by batch size
7 |
8 | # return loss_valence + loss_arousal
9 |
10 | loss = tf.reduce_sum(tf.square(true - pred)) / true.shape[0]
11 |
12 | return loss
13 |
14 | def simple_mae_loss(true, pred):
15 | loss = tf.reduce_sum(tf.abs(true - pred)) / true.shape[0]
16 | return loss
--------------------------------------------------------------------------------
/mer/model.py:
--------------------------------------------------------------------------------
1 | import tensorflow as tf
2 | import tensorflow.keras.layers as L
3 |
4 | from .const import *
5 |
6 | class SimpleDenseModel(tf.keras.Model):
7 | def __init__(self, max_timestep, n_freq, n_channel, batch_size, **kwargs):
8 | super().__init__(**kwargs)
9 | self.max_timestep = max_timestep
10 | self.n_freq = n_freq
11 | self.n_channel = n_channel
12 | self.batch_size = batch_size
13 |
14 | self.resize = tf.keras.layers.Resizing(self.n_freq, 1024)
15 | self.flatten = tf.keras.layers.Flatten()
16 | self.dense1 = tf.keras.layers.Dense(512, activation="relu")
17 | self.dense2 = tf.keras.layers.Dense(256, activation="relu")
18 | self.dense3 = tf.keras.layers.Dense(128, activation="relu")
19 | self.dense4 = tf.keras.layers.Dense(64, activation="relu")
20 | self.dense5 = tf.keras.layers.Dense(2, activation="relu")
21 |
22 | def call(self, x):
23 | """
24 |
25 | Args:
26 | x ([type]): [description]
27 |
28 | Returns:
29 | [type]: [description]
30 | """
31 | # Condense
32 | tensor = self.resize(x)
33 | tensor = self.flatten(tensor)
34 | tensor = self.dense1(tensor)
35 | tensor = self.dense2(tensor)
36 | tensor = self.dense3(tensor)
37 | tensor = self.dense4(tensor)
38 | out = self.dense5(tensor)
39 | return out
40 |
41 | def model(self):
42 | x = tf.keras.layers.Input(shape=(self.max_timestep, self.n_freq, self.n_channel))
43 | return tf.keras.Model(inputs=x, outputs=self.call(x))
44 |
45 | class ConvBlock(tf.keras.Model):
46 | def __init__(self, neurons, **kwargs) -> None:
47 | super().__init__(**kwargs)
48 | self.model = tf.keras.Sequential([
49 | tf.keras.layers.Conv2D(neurons, (3,3), padding="same"),
50 | tf.keras.layers.LeakyReLU(alpha=0.1),
51 | tf.keras.layers.Conv2D(neurons // 2, (1,1), padding="same"),
52 | tf.keras.layers.LeakyReLU(alpha=0.1),
53 | tf.keras.layers.MaxPool2D(2,2),
54 | tf.keras.layers.Dropout(0.1)
55 | ])
56 |
57 | def call(self, x):
58 | return self.model(x)
59 |
60 |
61 | class SimpleConvModel(tf.keras.Model):
62 | def __init__(self, max_timestep, n_freq, n_channel, batch_size, **kwargs):
63 | super().__init__(**kwargs)
64 | self.max_timestep = max_timestep
65 | self.n_freq = n_freq
66 | self.n_channel = n_channel
67 | self.batch_size = batch_size
68 |
69 | neuron_conv = [64, 128, 256, 512, 1024]
70 |
71 | self.model = tf.keras.Sequential()
72 | self.model.add(tf.keras.layers.Resizing(self.n_freq, 512, input_shape=(max_timestep, n_freq, n_channel)))
73 | for neuron in neuron_conv:
74 | self.model.add(ConvBlock(neuron))
75 | self.model.add(tf.keras.layers.Flatten())
76 | self.model.add(tf.keras.layers.Dense(128, activation="relu"))
77 | self.model.add(tf.keras.layers.Dense(2, activation="relu"))
78 | self.model.summary()
79 |
80 | def call(self, x):
81 | """
82 |
83 | Args:
84 | x ([type]): [description]
85 |
86 | Returns:
87 | [type]: [description]
88 | """
89 | # Condense
90 | return self.model(x)
91 |
92 | def model(self):
93 | x = tf.keras.layers.Input(shape=(self.max_timestep, self.n_freq, self.n_channel))
94 | return tf.keras.Model(inputs=x, outputs=self.call(x))
95 |
96 |
97 | class ConvBlock2(tf.keras.Model):
98 | def __init__(self, neurons, **kwargs) -> None:
99 | super().__init__(**kwargs)
100 | self.model = tf.keras.Sequential([
101 | tf.keras.layers.Conv2D(neurons, (5,5), padding="valid"),
102 | tf.keras.layers.LeakyReLU(alpha=0.1),
103 | tf.keras.layers.Conv2D(neurons // 2, (1,1), padding="valid"),
104 | tf.keras.layers.LeakyReLU(alpha=0.1),
105 | tf.keras.layers.MaxPool2D(2,2),
106 | tf.keras.layers.Dropout(0.1)
107 | ])
108 |
109 | def call(self, x):
110 | return self.model(x)
111 |
112 | def Simple_CRNN():
113 | """[summary]
114 |
115 | Args:
116 | inputs (tf.Tensor): Expect tensor shape (batch, width, height, channel)
117 |
118 | Returns:
119 | [type]: [description]
120 | """
121 | inputs = L.Input(shape=(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2))
122 | tensor = L.Permute((2, 1, 3))(inputs)
123 | tensor = L.Resizing(FREQUENCY_LENGTH, 1024)(tensor)
124 |
125 | tensor = L.Conv2D(64, (5,5), padding="valid")(tensor)
126 | tensor = L.LeakyReLU(alpha=0.1)(tensor)
127 | tensor = L.Conv2D(64 // 2, (1,1), padding="valid")(tensor)
128 | tensor = L.LeakyReLU(alpha=0.1)(tensor)
129 | tensor = L.MaxPool2D(2,2)(tensor)
130 | tensor = L.Dropout(0.1)(tensor)
131 |
132 | tensor = L.Conv2D(128, (5,5), padding="valid")(tensor)
133 | tensor = L.LeakyReLU(alpha=0.1)(tensor)
134 | tensor = L.Conv2D(128 // 2, (1,1), padding="valid")(tensor)
135 | tensor = L.LeakyReLU(alpha=0.1)(tensor)
136 | tensor = L.MaxPool2D(2,2)(tensor)
137 | tensor = L.Dropout(0.1)(tensor)
138 |
139 | tensor = L.Conv2D(256, (5,5), padding="valid")(tensor)
140 | tensor = L.LeakyReLU(alpha=0.1)(tensor)
141 | tensor = L.Conv2D(256 // 2, (1,1), padding="valid")(tensor)
142 | tensor = L.LeakyReLU(alpha=0.1)(tensor)
143 | tensor = L.MaxPool2D(2,2)(tensor)
144 | tensor = L.Dropout(0.1)(tensor)
145 |
146 | tensor = L.Conv2D(512, (5,5), padding="valid")(tensor)
147 | tensor = L.LeakyReLU(alpha=0.1)(tensor)
148 | tensor = L.Conv2D(512 // 2, (1,1), padding="valid")(tensor)
149 | tensor = L.LeakyReLU(alpha=0.1)(tensor)
150 | tensor = L.MaxPool2D(2,2)(tensor)
151 | tensor = L.Dropout(0.1)(tensor)
152 |
153 | tensor = L.MaxPool2D(pool_size=(2,1), strides=(2,1))(tensor)
154 | tensor = L.Conv2D(512, (2,2), padding="valid")(tensor)
155 | tensor = L.LeakyReLU(alpha=0.1)(tensor)
156 | tensor = L.Dropout(0.1)(tensor)
157 |
158 | tensor = L.Reshape((59, 512))(tensor)
159 | tensor = L.Bidirectional(L.LSTM(128, return_sequences=True))(tensor)
160 | tensor = L.Bidirectional(L.LSTM(64, return_sequences=True))(tensor)
161 | tensor = L.Bidirectional(L.LSTM(32))(tensor)
162 | tensor = L.Dense(128, activation="relu")(tensor)
163 | out = L.Dense(2, activation="relu")(tensor)
164 |
165 | model = tf.keras.Model(inputs=inputs, outputs=out)
166 | return model
167 |
168 | def Simple_CRNN_2():
169 | """[summary]
170 |
171 | Args:
172 | inputs (tf.Tensor): Expect tensor shape (batch, width, height, channel)
173 |
174 | Returns:
175 | [type]: [description]
176 | """
177 | inputs = L.Input(shape=(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2))
178 | tensor = L.Permute((2, 1, 3))(inputs)
179 | tensor = L.Resizing(FREQUENCY_LENGTH, 1024)(tensor)
180 |
181 | tensor = L.Conv2D(64, (5,5), padding="valid")(tensor)
182 | tensor = L.ReLU()(tensor)
183 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
184 | tensor = L.Conv2D(64 // 2, (1,1), padding="valid")(tensor)
185 | tensor = L.ReLU()(tensor)
186 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
187 | tensor = L.MaxPool2D(2,2)(tensor)
188 | tensor = L.Dropout(0.1)(tensor)
189 |
190 | tensor = L.Conv2D(128, (5,5), padding="valid")(tensor)
191 | tensor = L.ReLU()(tensor)
192 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
193 | tensor = L.Conv2D(128 // 2, (1,1), padding="valid")(tensor)
194 | tensor = L.ReLU()(tensor)
195 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
196 | tensor = L.MaxPool2D(2,2)(tensor)
197 | tensor = L.Dropout(0.1)(tensor)
198 |
199 | tensor = L.Conv2D(256, (5,5), padding="valid")(tensor)
200 | tensor = L.ReLU()(tensor)
201 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
202 | tensor = L.Conv2D(256 // 2, (1,1), padding="valid")(tensor)
203 | tensor = L.ReLU()(tensor)
204 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
205 | tensor = L.MaxPool2D(2,2)(tensor)
206 | tensor = L.Dropout(0.1)(tensor)
207 |
208 | tensor = L.Conv2D(512, (5,5), padding="valid")(tensor)
209 | tensor = L.ReLU()(tensor)
210 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
211 | tensor = L.Conv2D(512 // 2, (1,1), padding="valid")(tensor)
212 | tensor = L.ReLU()(tensor)
213 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
214 | tensor = L.MaxPool2D(2,2)(tensor)
215 | tensor = L.Dropout(0.1)(tensor)
216 |
217 | tensor = L.Permute((2, 1, 3))(tensor)
218 | tensor = L.Reshape((60, 4 * 256))(tensor)
219 |
220 | # tensor = L.MaxPool2D(pool_size=(2,1), strides=(2,1))(tensor)
221 | # tensor = L.Conv2D(512, (2,2), padding="valid")(tensor)
222 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
223 | # out = L.Dropout(0.1)(tensor)
224 |
225 | # tensor = L.Bidirectional(L.LSTM(128, return_sequences=True, activation="sigmoid"))(tensor)
226 | # tensor = L.Bidirectional(L.LSTM(128, return_sequences=True, activation="sigmoid"))(tensor)
227 | tensor = L.Bidirectional(L.LSTM(128, activation="tanh"))(tensor)
228 | tensor = L.Dense(512, activation="relu")(tensor)
229 | tensor = L.Dense(256, activation="relu")(tensor)
230 | tensor = L.Dense(64, activation="relu")(tensor)
231 | out = L.Dense(2, activation="relu")(tensor)
232 |
233 | # tensor_1 = L.Bidirectional(L.LSTM(128, return_sequences=True))(tensor)
234 | # tensor_1 = L.Bidirectional(L.LSTM(128, return_sequences=True))(tensor_1)
235 | # tensor_1 = L.Bidirectional(L.LSTM(128))(tensor_1)
236 | # tensor_1 = L.Dense(512, activation="relu")(tensor_1)
237 | # tensor_1 = L.Dense(256, activation="relu")(tensor_1)
238 | # tensor_1 = L.Dense(64, activation="relu")(tensor_1)
239 | # out_1 = L.Dense(1, activation="relu")(tensor_1)
240 |
241 | # tensor_2 = L.Bidirectional(L.LSTM(128, return_sequences=True))(tensor)
242 | # tensor_2 = L.Bidirectional(L.LSTM(128, return_sequences=True))(tensor_2)
243 | # tensor_2 = L.Bidirectional(L.LSTM(128))(tensor_2)
244 | # tensor_2 = L.Dense(512, activation="relu")(tensor_2)
245 | # tensor_2 = L.Dense(256, activation="relu")(tensor_2)
246 | # tensor_2 = L.Dense(64, activation="relu")(tensor_2)
247 | # out_2 = L.Dense(1, activation="relu")(tensor_2)
248 |
249 |
250 | model = tf.keras.Model(inputs=inputs, outputs=out)
251 | return model
252 |
253 |
254 | def Simple_CRNN_3():
255 | """ CRNN that uses GRU
256 |
257 | Args:
258 | inputs (tf.Tensor): Expect tensor shape (batch, width, height, channel)
259 |
260 | Returns:
261 | [type]: [description]
262 | """
263 | inputs = L.Input(shape=(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2))
264 | tensor = L.Permute((2, 1, 3))(inputs)
265 | tensor = L.Resizing(FREQUENCY_LENGTH, 1024)(tensor)
266 |
267 | tensor = L.Conv2D(64, (5,5), padding="valid")(tensor)
268 | tensor = L.ReLU()(tensor)
269 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
270 | tensor = L.Conv2D(64 // 2, (1,1), padding="valid")(tensor)
271 | tensor = L.ReLU()(tensor)
272 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
273 | tensor = L.MaxPool2D(2,2)(tensor)
274 | tensor = L.Dropout(0.1)(tensor)
275 |
276 | tensor = L.Conv2D(128, (5,5), padding="valid")(tensor)
277 | tensor = L.ReLU()(tensor)
278 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
279 | tensor = L.Conv2D(128 // 2, (1,1), padding="valid")(tensor)
280 | tensor = L.ReLU()(tensor)
281 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
282 | tensor = L.MaxPool2D(2,2)(tensor)
283 | tensor = L.Dropout(0.1)(tensor)
284 |
285 | tensor = L.Conv2D(256, (5,5), padding="valid")(tensor)
286 | tensor = L.ReLU()(tensor)
287 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
288 | tensor = L.Conv2D(256 // 2, (1,1), padding="valid")(tensor)
289 | tensor = L.ReLU()(tensor)
290 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
291 | tensor = L.MaxPool2D(2,2)(tensor)
292 | tensor = L.Dropout(0.1)(tensor)
293 |
294 | tensor = L.Conv2D(512, (5,5), padding="valid")(tensor)
295 | tensor = L.ReLU()(tensor)
296 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
297 | tensor = L.Conv2D(512 // 2, (1,1), padding="valid")(tensor)
298 | tensor = L.ReLU()(tensor)
299 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
300 | tensor = L.MaxPool2D(2,2)(tensor)
301 | tensor = L.Dropout(0.1)(tensor)
302 |
303 | tensor = L.Permute((2, 1, 3))(tensor)
304 | tensor = L.Reshape((60, 4 * 256))(tensor)
305 |
306 | # tensor = L.MaxPool2D(pool_size=(2,1), strides=(2,1))(tensor)
307 | # tensor = L.Conv2D(512, (2,2), padding="valid")(tensor)
308 | # tensor = L.LeakyReLU(alpha=0.1)(tensor)
309 | # out = L.Dropout(0.1)(tensor)
310 |
311 | tensor = L.GRU(256, activation="tanh", return_sequences=True)(tensor)
312 | tensor = L.GRU(128, activation="tanh", return_sequences=True)(tensor)
313 | tensor = L.GRU(64, activation="tanh")(tensor)
314 | tensor = L.Dense(512, activation="relu")(tensor)
315 | tensor = L.Dense(256, activation="relu")(tensor)
316 | tensor = L.Dense(64, activation="relu")(tensor)
317 | out = L.Dense(2, activation="relu")(tensor)
318 |
319 | # tensor_1 = L.Bidirectional(L.LSTM(128, return_sequences=True))(tensor)
320 | # tensor_1 = L.Bidirectional(L.LSTM(128, return_sequences=True))(tensor_1)
321 | # tensor_1 = L.Bidirectional(L.LSTM(128))(tensor_1)
322 | # tensor_1 = L.Dense(512, activation="relu")(tensor_1)
323 | # tensor_1 = L.Dense(256, activation="relu")(tensor_1)
324 | # tensor_1 = L.Dense(64, activation="relu")(tensor_1)
325 | # out_1 = L.Dense(1, activation="relu")(tensor_1)
326 |
327 | # tensor_2 = L.Bidirectional(L.LSTM(128, return_sequences=True))(tensor)
328 | # tensor_2 = L.Bidirectional(L.LSTM(128, return_sequences=True))(tensor_2)
329 | # tensor_2 = L.Bidirectional(L.LSTM(128))(tensor_2)
330 | # tensor_2 = L.Dense(512, activation="relu")(tensor_2)
331 | # tensor_2 = L.Dense(256, activation="relu")(tensor_2)
332 | # tensor_2 = L.Dense(64, activation="relu")(tensor_2)
333 | # out_2 = L.Dense(1, activation="relu")(tensor_2)
334 |
335 |
336 | model = tf.keras.Model(inputs=inputs, outputs=out)
337 | return model
--------------------------------------------------------------------------------
/mer/utils.py:
--------------------------------------------------------------------------------
1 |
2 | import tensorflow as tf
3 | import numpy as np
4 | import os
5 | import pandas as pd
6 | import matplotlib.pyplot as plt
7 | import sounddevice as sd
8 |
9 | from .const import *
10 |
11 | def get_spectrogram(waveform, input_len=44100):
12 | """ Check out https://www.tensorflow.org/io/tutorials/audio
13 |
14 | Args:
15 | waveform ([type]): Expect waveform array of shape (>44100,)
16 | input_len (int, optional): [description]. Defaults to 44100.
17 |
18 | Returns:
19 | Tensor: Spectrogram of the 1D waveform. Shape (freq, time, 1)
20 | """
21 | max_zero_padding = min(input_len, tf.shape(waveform))
22 | # Zero-padding for an audio waveform with less than 44,100 samples.
23 | waveform = waveform[:input_len]
24 | zero_padding = tf.zeros(
25 | (input_len - max_zero_padding),
26 | dtype=tf.float32)
27 | # Cast the waveform tensors' dtype to float32.
28 | waveform = tf.cast(waveform, dtype=tf.float32)
29 | # Concatenate the waveform with `zero_padding`, which ensures all audio
30 | # clips are of the same length.
31 | equal_length = tf.concat([waveform, zero_padding], 0)
32 | # Convert the waveform to a spectrogram via a STFT.
33 | spectrogram = tf.signal.stft(
34 | equal_length, frame_length=255, frame_step=128)
35 | # Obtain the magnitude of the STFT.
36 | spectrogram = tf.abs(spectrogram)
37 | # Add a `channels` dimension, so that the spectrogram can be used
38 | # as image-like input data with convolution layers (which expect
39 | # shape (`batch_size`, `height`, `width`, `channels`).
40 | spectrogram = spectrogram[..., tf.newaxis]
41 | return spectrogram
42 |
43 | def plot_spectrogram(spectrogram, ax):
44 | """ Check out https://www.tensorflow.org/io/tutorials/audio
45 |
46 | Args:
47 | spectrogram ([type]): Expect shape (time step, frequency)
48 | ax (plt.axes[i]): [description]
49 | """
50 | if len(spectrogram.shape) > 2:
51 | assert len(spectrogram.shape) == 3
52 | spectrogram = np.squeeze(spectrogram, axis=-1)
53 | # Convert the frequencies to log scale and transpose, so that the time is
54 | # represented on the x-axis (columns).
55 | # Add an epsilon to avoid taking a log of zero.
56 | log_spec = np.log(spectrogram.T + np.finfo(float).eps)
57 | height = log_spec.shape[0]
58 | width = log_spec.shape[1]
59 | X = np.linspace(0, np.size(spectrogram), num=width, dtype=int)
60 | Y = range(height)
61 | ax.pcolormesh(X, Y, log_spec)
62 |
63 | def load_metadata(csv_folder):
64 | """ Pandas load multiple csv file and concat them into one df.
65 |
66 | Args:
67 | csv_folder (str): Path to the csv folder
68 |
69 | Returns:
70 | pd.DataFrame: The concatnated one!
71 | """
72 | global_df = pd.DataFrame()
73 | for i, fname in enumerate(os.listdir(csv_folder)):
74 | # headers: song_id, valence_mean, valence_std, arousal_mean, arousal_std
75 | df = pd.read_csv(os.path.join(csv_folder, fname), sep=r"\s*,\s*", engine="python")
76 | global_df = pd.concat([global_df, df], axis=0)
77 |
78 | # Reset the index
79 | global_df = global_df.reset_index(drop=True)
80 |
81 | return global_df
82 |
83 | def split_train_test(df: pd.DataFrame, train_ratio: float):
84 | train_size = int(len(df) * train_ratio)
85 | train_df: pd.DataFrame = df[:train_size]
86 | train_df = train_df.reset_index(drop=True)
87 | test_df: pd.DataFrame = df[train_size:]
88 | test_df = test_df.reset_index(drop=True)
89 | return train_df, test_df
90 |
91 | def plot_and_play(test_audio, second_id = 24.0, second_length = 1, channel = 0):
92 | """ Plot and play
93 |
94 | Args:
95 | test_audio ([type]): [description]
96 | second_id (float, optional): [description]. Defaults to 24.0.
97 | second_length (int, optional): [description]. Defaults to 1.
98 | channel (int, optional): [description]. Defaults to 0.
99 | """
100 | # Spectrogram of one second
101 | from_id = int(DEFAULT_FREQ * second_id)
102 | to_id = min(int(DEFAULT_FREQ * (second_id + second_length)), test_audio.shape[0])
103 |
104 | test_spectrogram = get_spectrogram(test_audio[from_id:, channel], input_len=int(DEFAULT_FREQ * second_length))
105 | print(test_spectrogram.shape)
106 | fig, axes = plt.subplots(2, figsize=(12, 8))
107 | timescale = np.arange(to_id - from_id)
108 | axes[0].plot(timescale, test_audio[from_id:to_id, channel].numpy())
109 | axes[0].set_title('Waveform')
110 | axes[0].set_xlim([0, int(DEFAULT_FREQ * second_length)])
111 |
112 | plot_spectrogram(test_spectrogram.numpy(), axes[1])
113 | axes[1].set_title('Spectrogram')
114 | plt.show()
115 |
116 | # Play sound
117 | sd.play(test_audio[from_id: to_id, channel], blocking=True)
118 |
119 | def preprocess_waveforms(waveforms, input_len):
120 | """ Get the first input_len value of the waveforms, if not exist, pad it with 0.
121 |
122 | Args:
123 | waveforms ([type]): [description]
124 | input_len ([type]): [description]
125 |
126 | Returns:
127 | [type]: [description]
128 | """
129 | n_channel = waveforms.shape[-1]
130 | preprocessed = np.zeros((input_len, n_channel))
131 | if input_len <= waveforms.shape[0]:
132 | preprocessed = waveforms[:input_len, :]
133 | else:
134 | preprocessed[:waveforms.shape[0], :] = waveforms
135 | return tf.convert_to_tensor(preprocessed)
136 |
137 | def tanh_to_sigmoid(inputs):
138 | """ Convert from tanh range to sigmoid range
139 |
140 | Args:
141 | inputs (): number of np array of number
142 |
143 | Returns:
144 | number or array-like object: changed range object
145 | """
146 | return (inputs + 1.0) / 2.0
147 |
148 | def get_CAM(model, img, actual_label, loss_func, layer_name='block5_conv3'):
149 |
150 | model_grad = tf.keras.Model(model.inputs,
151 | [model.get_layer(layer_name).output, model.output])
152 |
153 | with tf.GradientTape() as tape:
154 | conv_output_values, predictions = model_grad(img)
155 |
156 | # watch the conv_output_values
157 | tape.watch(conv_output_values)
158 |
159 | # Calculate loss as in the loss func
160 | try:
161 | loss, _ = loss_func(actual_label, predictions)
162 | except:
163 | loss = loss_func(actual_label, predictions)
164 | print(f"Loss: {loss}")
165 |
166 | # get the gradient of the loss with respect to the outputs of the last conv layer
167 | grads_values = tape.gradient(loss, conv_output_values)
168 | grads_values = tf.reduce_mean(grads_values, axis=(0,1,2))
169 |
170 | conv_output_values = np.squeeze(conv_output_values.numpy())
171 | grads_values = grads_values.numpy()
172 |
173 | # weight the convolution outputs with the computed gradients
174 | for i in range(conv_output_values.shape[-1]):
175 | conv_output_values[:,:,i] *= grads_values[i]
176 | heatmap = np.mean(conv_output_values, axis=-1)
177 |
178 | heatmap = np.maximum(heatmap, 0)
179 | heatmap /= heatmap.max()
180 |
181 | del model_grad, conv_output_values, grads_values, loss
182 |
183 | return heatmap
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | matplotlib==3.5.1
2 | numpy==1.19.5
3 | opencv-python==4.5.4.60
4 | pandas==1.0.5
5 | Pillow==8.4.0
6 | scipy==1.7.3
7 | seaborn==0.11.2
8 | sounddevice==0.4.3
9 | tensorflow==2.6.0
--------------------------------------------------------------------------------
/server/.gitignore:
--------------------------------------------------------------------------------
1 | model
--------------------------------------------------------------------------------
/server/README.md:
--------------------------------------------------------------------------------
1 | # Server for serving music emotion recognition model
2 |
3 | ## Requirements:
4 | 1. System: Window, Linux, and Mac.
5 | 2. Python
6 |
7 | ## Running the server:
8 | 1. Export the model to `server/model/my_model`
9 | 2. Running
10 | ```
11 | cd server
12 | pip install -r requirements.txt
13 | python app.py
14 | ```
15 | 3. Or to start a deployment server, run:
16 | ```
17 | uvicorn app:app --host 0.0.0.0 --port 80
18 | ```
19 |
20 |
--------------------------------------------------------------------------------
/server/app.py:
--------------------------------------------------------------------------------
1 | import uvicorn
2 | from fastapi import FastAPI, File, UploadFile
3 | from fastapi.responses import HTMLResponse
4 | import tensorflow as tf
5 | import numpy as np
6 |
7 | app = FastAPI()
8 | model = tf.keras.models.load_model("./model/my_model/")
9 |
10 | DEFAULT_FREQ = 44100
11 | DEFAULT_TIME = 45
12 | WAVE_ARRAY_LENGTH = DEFAULT_FREQ * DEFAULT_TIME
13 | FREQUENCY_LENGTH = 129
14 | N_CHANNEL = 2
15 | SPECTROGRAM_TIME_LENGTH = 15502
16 |
17 | def preprocess_waveforms(waveforms, input_len):
18 | """ Get the first input_len value of the waveforms, if not exist, pad it with 0.
19 |
20 | Args:
21 | waveforms ([type]): [description]
22 | input_len ([type]): [description]
23 |
24 | Returns:
25 | [type]: [description]
26 | """
27 | n_channel = waveforms.shape[-1]
28 | preprocessed = np.zeros((input_len, n_channel))
29 | if input_len <= waveforms.shape[0]:
30 | preprocessed = waveforms[:input_len, :]
31 | else:
32 | preprocessed[:waveforms.shape[0], :] = waveforms
33 | return tf.convert_to_tensor(preprocessed)
34 |
35 | def get_spectrogram(waveform, input_len=44100):
36 | """ Check out https://www.tensorflow.org/io/tutorials/audio
37 |
38 | Args:
39 | waveform ([type]): Expect waveform array of shape (>44100,)
40 | input_len (int, optional): [description]. Defaults to 44100.
41 |
42 | Returns:
43 | Tensor: Spectrogram of the 1D waveform. Shape (freq, time, 1)
44 | """
45 | max_zero_padding = min(input_len, tf.shape(waveform))
46 | # Zero-padding for an audio waveform with less than 44,100 samples.
47 | waveform = waveform[:input_len]
48 | zero_padding = tf.zeros(
49 | (input_len - max_zero_padding),
50 | dtype=tf.float32)
51 | # Cast the waveform tensors' dtype to float32.
52 | waveform = tf.cast(waveform, dtype=tf.float32)
53 | # Concatenate the waveform with `zero_padding`, which ensures all audio
54 | # clips are of the same length.
55 | equal_length = tf.concat([waveform, zero_padding], 0)
56 | # Convert the waveform to a spectrogram via a STFT.
57 | spectrogram = tf.signal.stft(
58 | equal_length, frame_length=255, frame_step=128)
59 | # Obtain the magnitude of the STFT.
60 | spectrogram = tf.abs(spectrogram)
61 | # Add a `channels` dimension, so that the spectrogram can be used
62 | # as image-like input data with convolution layers (which expect
63 | # shape (`batch_size`, `height`, `width`, `channels`).
64 | spectrogram = spectrogram[..., tf.newaxis]
65 | return spectrogram
66 |
67 | def predict(sound: bytes):
68 | waveforms, _ = tf.audio.decode_wav(contents=sound)
69 | # Pad to max 45 second. Shape (total_frequency, n_channels)
70 | waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH)
71 | # Work on building spectrogram
72 | # Shape (timestep, frequency, n_channel)
73 | spectrograms = None
74 | # Loop through each channel
75 | for i in range(waveforms.shape[-1]):
76 | # Shape (timestep, frequency, 1)
77 | spectrogram = get_spectrogram(waveforms[..., i], input_len=waveforms.shape[0])
78 | # spectrogram = tf.convert_to_tensor(np.log(spectrogram.numpy() + np.finfo(float).eps))
79 | if spectrograms == None:
80 | spectrograms = spectrogram
81 | else:
82 | spectrograms = tf.concat([spectrograms, spectrogram], axis=-1)
83 |
84 |
85 | padded_spectrogram = np.zeros((SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=float)
86 | # spectrograms = spectrograms[tf.newaxis, ...]
87 | # some spectrogram are not the same shape
88 | padded_spectrogram[:spectrograms.shape[0], :spectrograms.shape[1], :] = spectrograms
89 |
90 | sample_input = tf.convert_to_tensor(padded_spectrogram)
91 | prediction = model(sample_input[tf.newaxis, ...], training=False)[0, ...]
92 | return prediction
93 |
94 | @app.post("/predict/sound")
95 | async def predict_api(file: UploadFile = File(...)):
96 | extension = file.filename.split(".")[-1] in ("wav")
97 | if not extension:
98 | return "Sound must be wav format!"
99 | sound = await file.read()
100 | prediction = predict(sound)
101 | # print(prediction)
102 | return str(prediction)
103 |
104 | @app.get("/", response_class=HTMLResponse)
105 | async def index():
106 | return """
107 |
108 |