├── .gitignore ├── README.md ├── docs ├── inter_rcnn_2.png ├── simple_conv_loss_epoch_20.png ├── simple_crnn_2_loss_epoch_7.png ├── simple_crnn_3_loss_epoch_10.png ├── simple_crnn_loss_epoch_20.png └── simple_dense_loss_epoch_10.png ├── main_dynamic.py ├── main_static.py ├── mer ├── __init__.py ├── const.py ├── loss.py ├── model.py └── utils.py ├── requirements.txt ├── server ├── .gitignore ├── README.md ├── app.py ├── requirements.txt └── utils.py └── wav_converter.py /.gitignore: -------------------------------------------------------------------------------- 1 | **__pycache__/ 2 | dataset/ 3 | .vscode/ 4 | weights/ 5 | history/ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Music Emotion Recognition Algorithm using Deep Learning. 2 | Author: Viet Dung Nguyen. Gettysburg College 3 | 4 | `(Introduction to be written)` 5 | 6 | ## Requirements: 7 | 1. System: Window, Linux, and Mac. 8 | 2. Dataset: 9 | * DEAM Dataset: [https://www.kaggle.com/imsparsh/deam-mediaeval-dataset-emotional-analysis-in-music](https://www.kaggle.com/imsparsh/deam-mediaeval-dataset-emotional-analysis-in-music) 10 | 3. Know the benchhmark: [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173392](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173392) 11 | 4. VS Code (in order to run the jupyter notebook with python file: VS Code format) 12 | 5. [Anaconda](https://docs.anaconda.com/anaconda/install/index.html): The python environment (for systematic code execution) 13 | 14 | ## Project structure 15 | * `docs`: Containing images for documentation 16 | * `mer`: The core libraries for model training and evaluating 17 | * `__init__.py`: Package initialization 18 | * `const.py`: Constant for the library 19 | * `loss.py`: Loss functions for the deep learning model 20 | * `model.py`: Deep learning models 21 | * `utils.py`: Utilization methods 22 | * `main_dynamic.py`: Main jupython notebook (python format) for training with data with per-second label. 23 | * `main_static.py`: Main jupython notebook (python format) for training with data with whole song label. 24 | * `requirements.txt`: Dependency file 25 | * `wav_converter.py`: Script to convert every mp3 file in a folder into wav format. 26 | 27 | ## How to run the project 28 | 29 | * First, install the required libraries: 30 | ``` 31 | conda create -n mer 32 | conda activate mer 33 | pip install -r requirements.txt 34 | ``` 35 | * Next, go into one python file `main_dynamic.py` or `main_static.py` and experiment with the VS Code notebook. 36 | 37 | ## Misc: 38 | * convert to wav code: 39 | ``` 40 | cd audio 41 | mkdir wav 42 | for %f in (*.mp3) do (ffmpeg -i "%f" -acodec pcm_s16le -ar 44100 "./wav/%f.wav") 43 | ``` 44 | or 45 | ``` 46 | python wav_converter.py {src} {dst} {ffmpeg bin path} 47 | # E.g: python wav_converter.py "./dataset/DEAM/audio" "./dataset/DEAM/wav" "C:/Users/Alex Nguyen/Documents/ffmpeg-master-latest-win64-gpl/bin/ffmpeg.exe" 48 | ``` 49 | 50 | ## Reports 51 | 52 | ### Jan 4, 2022 53 | * We want to experiement with depth-wise and point-wise (mobile net) convolution to reduce computational cost but still want to keep the same performance. 54 | 55 | ### Jan 3, 2022 56 | * We tried CRNN Model with CBAM architecture with shallow gru (1 layer of gru). first 3 epoch (300 steps of batch_size 16), we train with learning rate 1e-3, next 1e-4. We use mae loss throughout the training process. We achieve the similar loss to other models. 57 | 58 | ### Dec 31, 2021 59 | * We tried CRNN Model with deep gru (3 layer of gru) 60 | 61 | 62 | 63 | 64 | ### Dec 30, 2021 65 | * We tried CRNN Model with shallow bidirectional lstm 66 | 68 | 69 | 70 | 71 | 72 | ### Dec 28, 2021 73 | * We tried CRNN Model 74 | 75 | 76 | 77 | 78 | ### Dec 26, 2021 79 | * We tried testing with Simple CNN Model with 5 Convolution Block (each block consists of a convolution layer with filer size (3,3) and leaky relu activation followed by a convolution layer with filer size (1,1) and leaky relu activation). Each double the number of neurons (64, 128, 256, 512, 1024). Here is the training result after 25 epochs: 80 | 81 | 82 | 83 | 84 | ### Dec 25, 2021 85 | * The Music Information Retrieval (MIR) field has always been challenging because there are not a lot of refined dataset constructed. Especially for Music Emotion Recognition (MER) task, to assess the emotion of the song, one has to collect the songs as input (most of them is not possible because of copyright [\[1\]](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173392)). According to Aljanaki et. al [\[1\]](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173392), the emotion is subjective to human and language and therefore hard to be determined. There are a lot of emotion labeling scheme such as the emotion adjective wording scheme from Lin et. al [\[2\]](https://doi.org/10.1145/2037676.2037683) or the two dimensional regression scheme from the DEAM dataset developed by Aljanaki et. al [\[1\]](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173392) which utilize the two orthorgonal psychology states that are discussed by Russel [\[3\]](https://www.researchgate.net/publication/235361517_A_Circumplex_Model_of_Affect). 86 | * There are a lot of traditional musical emotion recognition models that utilize sound processing and musical feature detection from the waveform and the spectrogram of the sound such as Improved Back Propagation network [\[4\]](https://www.frontiersin.org/articles/10.3389/fpsyg.2021.760060/full), MIDI notes, melodic, and dynamic features [\[5\]](https://www.semanticscholar.org/paper/Novel-Audio-Features-for-Music-Emotion-Recognition-Panda-Malheiro/6feb6c070313992897140a1802fdb8f0bf129422) 87 | * The dataset we use is the DEAM dataset which was developed by Aljanaki et. al [\[1\]](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173392). The dataset contain 1082 45-second mp3 soundtrack and a set of annotation. The annotations consist of average static (whole song) and dynamic (each second) label of the valence/arousal point on the grade of 10. 88 | * As in the beginning of the project, we experiment with most popular method nowaday: deep learning. We want to apply deep learning into assessing the MER task by having the music (and potentially its related standard (Panda et. al [\[5\]](https://www.semanticscholar.org/paper/Novel-Audio-Features-for-Music-Emotion-Recognition-Panda-Malheiro/6feb6c070313992897140a1802fdb8f0bf129422)) and derived features) as input to the deep learning schema, and the annotated valence-arousal point (ranged from 0 to 10) as label. 89 | * We want to firstly test if the linear dense network to see if they accurately predict the two value valence and arousal. We first preprocess the music audio by performing stft on the waveform to get the time-frequency spectrogram of the sound which is represented by a 3D array \[`time_length`, `n_frequency`, `n_channel`\] (a typical spectrogram of a 45 second music will have the shape (15502, 129, 2)). We then resize such data into the smalller size (i.e, (512, 129, 2)) using bilinear method. We then Flatten the array and feed the vector of 512 * 129 * 2 through 4 linear layers of 512, 256, 128, and 64 neurons with activation of rectified linear unit. The last layer is also a Linear layer that have 2 neurons as output with a rectified linear unit activation. We simplly use the l2 loss to the the distance of the output neurons from the actuall labelled valence and arousal. For optimizer, we use stochastic gradient descent with learning rate of 1e-4. After training the model with batch size 16, step per epoch 100, and 10 epoch, we get the following total loss for batch. So the mean squared error should be `loss` / `batch_size`. 90 | 91 | 92 | 93 | 94 | ## Resources 95 | 96 | * Database benchmark: 97 | 98 | https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173392 99 | 100 | ``` 101 | @article{AlajankiEmoInMusicAnalysis, 102 | author = {Alajanki, Anna and Yang, Yi-Hsuan and Soleymani, Mohammad}, 103 | title = {Benchmarking music emotion recognition systems}, 104 | journal = {PLOS ONE}, 105 | year = {2016}, 106 | note= {under review} 107 | } 108 | ``` 109 | 110 | https://www.kaggle.com/imsparsh/deam-mediaeval-dataset-emotional-analysis-in-music 111 | https://cvml.unige.ch/databases/DEAM/ 112 | https://cvml.unige.ch/databases/DEAM/manual.pdf 113 | 114 | https://www.semanticscholar.org/paper/The-AMG1608-dataset-for-music-emotion-recognition-Chen-Yang/16a21a57c85bae0ada26454300dc5c5891f1c0e2 115 | 116 | The PMEmo Dataset for Music Emotion Recognition. https://dl.acm.org/doi/10.1145/3206025.3206037 117 | https://github.com/HuiZhangDB/PMEmo 118 | 119 | RAVDESS database: https://zenodo.org/record/1188976 120 | 121 | * label scheme: 122 | 123 | Lin, Y. C., Yang, Y. H., and Homer, H. (2011). Exploiting online music tags for music emotion classification. ACM Trans. Multimed. Comput. Commun. Appl. 7, 1–16. doi: 10.1145/2037676.2037683 124 | 125 | Russell, James. (1980). A Circumplex Model of Affect. Journal of Personality and Social Psychology. 39. 1161-1178. 10.1037/h0077714. 126 | 127 | * Technical: 128 | 129 | https://www.tensorflow.org/io/tutorials/audio 130 | https://librosa.org/ 131 | https://www.tensorflow.org/tutorials/audio/simple_audio 132 | 133 | Visual attention: https://www.youtube.com/watch?v=1mjI_Jm4W1E 134 | Visual attention notebook: https://github.com/EscVM/EscVM_YT/blob/master/Notebooks/0%20-%20TF2.X%20Tutorials/tf_2_visual_attention.ipynb 135 | 136 | YAMNet: https://github.com/tensorflow/models/blob/master/research/audioset/yamnet/yamnet.py 137 | 138 | * Related work and works 139 | 140 | most influlece library mir search: https://www.semanticscholar.org/search?fos%5B0%5D=computer-science&q=Music%20Emotion%20Recognition&sort=influence 141 | 142 | Audio-based deep music emotion recognition: https://www.semanticscholar.org/paper/Audio-based-deep-music-emotion-recognition-Liu-Han/6c4ed9c7cad950a6398a9caa2debb2dea0d16f73 143 | 144 | A Novel Music Emotion Recognition Model Using Neural Network Technology: https://www.frontiersin.org/articles/10.3389/fpsyg.2021.760060/full 145 | 146 | novel features of music for mer: https://www.semanticscholar.org/paper/Novel-Audio-Features-for-Music-Emotion-Recognition-Panda-Malheiro/6feb6c070313992897140a1802fdb8f0bf129422 147 | 148 | musical texture and espresitivity features: https://www.semanticscholar.org/paper/Musical-Texture-and-Expressivity-Features-for-Music-Panda-Malheiro/e4693023ae525b7dd1ecabf494654e7632f148b3 149 | 150 | specch recognition: http://proceedings.mlr.press/v32/graves14.pdf 151 | 152 | data augmentation: https://paperswithcode.com/paper/specaugment-a-simple-data-augmentation-method 153 | 154 | rnn regularization technique: https://paperswithcode.com/paper/recurrent-neural-network-regularization 155 | 156 | wave to vec: https://paperswithcode.com/paper/wav2vec-2-0-a-framework-for-self-supervised 157 | 158 | * MER Task: 159 | 160 | Music Mood Detection Based On Audio And Lyrics With Deep Neural Net: https://paperswithcode.com/paper/music-mood-detection-based-on-audio-and 161 | Transformer-based: https://paperswithcode.com/paper/transformer-based-approach-towards-music 162 | 163 | tutorial ismir: https://arxiv.org/abs/1709.04396 164 | 165 | * DNN-based: 166 | 167 | https://www.semanticscholar.org/paper/Music-Emotion-Classification-with-Deep-Neural-Nets-Pandeya-Bhattarai/023f9feb933c6e82ed2e7095c285e203d31241dc 168 | 169 | 170 | * crnn based: 171 | 172 | Recognizing Song Mood and Theme Using Convolutional Recurrent Neural Networks: https://www.semanticscholar.org/paper/Recognizing-Song-Mood-and-Theme-Using-Convolutional-Mayerl-V%C3%B6tter/5319195f1f7be778a04186bfe4165e3516165a19 173 | 174 | Convolutional Recurrent Neural Networks for Music Classification: https://arxiv.org/abs/1609.04243 175 | 176 | * cnn-based: 177 | 178 | https://www.semanticscholar.org/paper/CNN-based-music-emotion-classification-Liu-Chen/63e83168006678410d137315dd3e8488136aed39 179 | 180 | https://www.semanticscholar.org/paper/Recognition-of-emotion-in-music-based-on-deep-Sarkar-Choudhury/396fd30fa5d2e8821b9413c5a227ec7c902d5b33 181 | 182 | https://www.semanticscholar.org/paper/Music-Emotion-Recognition-by-Using-Chroma-and-Deep-Er-Aydilek/79b35f61ee84f2f7161c98f591b55f0bb31c4d0e 183 | 184 | * Attention: 185 | 186 | Emotion and Themes Recognition in Music with Convolutional and Recurrent Attention-Blocks: https://www.semanticscholar.org/paper/Emotion-and-Themes-Recognition-in-Music-with-and-Gerczuk-Amiriparian/50e46f65dc0dd2d09bb5a08c0fe4872bbe5a2810 187 | 188 | CBAM: https://arxiv.org/abs/1807.06521 189 | -------------------------------------------------------------------------------- /docs/inter_rcnn_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rxng8/Music-Emotion-Recognition-Algorithm/f43aef6f3d410f9d0ea3e9b55f8fe5f29a7bb030/docs/inter_rcnn_2.png -------------------------------------------------------------------------------- /docs/simple_conv_loss_epoch_20.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rxng8/Music-Emotion-Recognition-Algorithm/f43aef6f3d410f9d0ea3e9b55f8fe5f29a7bb030/docs/simple_conv_loss_epoch_20.png -------------------------------------------------------------------------------- /docs/simple_crnn_2_loss_epoch_7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rxng8/Music-Emotion-Recognition-Algorithm/f43aef6f3d410f9d0ea3e9b55f8fe5f29a7bb030/docs/simple_crnn_2_loss_epoch_7.png -------------------------------------------------------------------------------- /docs/simple_crnn_3_loss_epoch_10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rxng8/Music-Emotion-Recognition-Algorithm/f43aef6f3d410f9d0ea3e9b55f8fe5f29a7bb030/docs/simple_crnn_3_loss_epoch_10.png -------------------------------------------------------------------------------- /docs/simple_crnn_loss_epoch_20.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rxng8/Music-Emotion-Recognition-Algorithm/f43aef6f3d410f9d0ea3e9b55f8fe5f29a7bb030/docs/simple_crnn_loss_epoch_20.png -------------------------------------------------------------------------------- /docs/simple_dense_loss_epoch_10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rxng8/Music-Emotion-Recognition-Algorithm/f43aef6f3d410f9d0ea3e9b55f8fe5f29a7bb030/docs/simple_dense_loss_epoch_10.png -------------------------------------------------------------------------------- /main_dynamic.py: -------------------------------------------------------------------------------- 1 | """ 2 | file: main_dynamic.py 3 | author: Alex Nguyen 4 | This file contains code to process the each-second song labeled data (dynamically labeled) 5 | """ 6 | # %% 7 | 8 | import os 9 | import pathlib 10 | 11 | import matplotlib.pyplot as plt 12 | import numpy as np 13 | import seaborn as sns 14 | import tensorflow as tf 15 | import sounddevice as sd 16 | import pandas as pd 17 | import tensorflow.keras.layers as L 18 | 19 | from tensorflow.keras import layers 20 | from tensorflow.keras import models 21 | from IPython import display 22 | from tensorflow.python.keras.layers.core import Dropout 23 | 24 | from mer.utils import get_spectrogram, \ 25 | plot_spectrogram, \ 26 | load_metadata, \ 27 | plot_and_play, \ 28 | preprocess_waveforms, \ 29 | split_train_test, \ 30 | tanh_to_sigmoid 31 | 32 | from mer.const import * 33 | from mer.loss import simple_mse_loss, simple_mae_loss 34 | from mer.model import Simple_CRNN_3, SimpleDenseModel, \ 35 | SimpleConvModel, \ 36 | ConvBlock, \ 37 | ConvBlock2,\ 38 | Simple_CRNN, \ 39 | Simple_CRNN_2, \ 40 | Simple_CRNN_3 41 | 42 | # Set the seed value for experiment reproducibility. 43 | # seed = 42 44 | # tf.random.set_seed(seed) 45 | # np.random.seed(seed) 46 | 47 | sd.default.samplerate = DEFAULT_FREQ 48 | 49 | ANNOTATION_SONG_LEVEL = "./dataset/DEAM/annotations/annotations averaged per song/dynamic (per second annotations)/" 50 | AUDIO_FOLDER = "./dataset/DEAM/wav" 51 | filenames = tf.io.gfile.glob(str(AUDIO_FOLDER) + '/*') 52 | 53 | BATCH_SIZE = 8 54 | DEFAULT_SECOND_PER_TIME_STEP = 0.5 55 | 56 | # Process with average annotation per song. 57 | df = load_metadata(ANNOTATION_SONG_LEVEL) 58 | 59 | train_df, test_df = split_train_test(df, TRAIN_RATIO) 60 | # Process with average annotation per second. 61 | valence_df = pd.read_csv(os.path.join(ANNOTATION_SONG_LEVEL, "valence.csv"), sep=r"\s*,\s*", engine="python") 62 | arousal_df = pd.read_csv(os.path.join(ANNOTATION_SONG_LEVEL, "arousal.csv"), sep=r"\s*,\s*", engine="python") 63 | 64 | assert len(valence_df) == len(arousal_df) 65 | 66 | song_id_df = pd.DataFrame({"song_id": valence_df["song_id"]}) 67 | 68 | # NOTE: Split only the song id table, then join that table with valence_df and arousal_df 69 | train_song_ids, test_song_ids = split_train_test(song_id_df, TRAIN_RATIO) 70 | 71 | train_valence_df = train_song_ids.merge(valence_df, on="song_id", how="left").dropna(axis=1) 72 | train_arousal_df = train_song_ids.merge(arousal_df, on="song_id", how="left").dropna(axis=1) 73 | 74 | test_valence_df = test_song_ids.merge(valence_df, on="song_id", how="left").dropna(axis=1) 75 | test_arousal_df = test_song_ids.merge(arousal_df, on="song_id", how="left").dropna(axis=1) 76 | 77 | # Describe (Summaraize) the datasets label 78 | # train_valence_flatten_df = pd.DataFrame(np.reshape(train_valence_df.loc[:, train_valence_df.columns != "song_id"].to_numpy(), (-1,))) 79 | # print(train_valence_flatten_df.describe()) 80 | 81 | # test_valence_flatten_df = pd.DataFrame(np.reshape(test_valence_df.loc[:, test_valence_df.columns != "song_id"].to_numpy(), (-1,))) 82 | # print(test_valence_flatten_df.describe()) 83 | 84 | # Debugging 85 | # pointer = 0 86 | # row = train_valence_df.loc[pointer] 87 | # song_id = row["song_id"] 88 | # assert train_arousal_df.loc[pointer, "song_id"] == song_id, "Wrong row!" 89 | # # Load song and waveform 90 | # song_path = os.path.join(AUDIO_FOLDER, str(int(song_id)) + SOUND_EXTENSION) 91 | # audio_file = tf.io.read_file(song_path) 92 | # waveforms, _ = tf.audio.decode_wav(contents=audio_file) 93 | # # Pad to max 45 second. Shape (total_frequency, n_channels) 94 | # waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH) 95 | 96 | # # Get the labels series 97 | # valence_labels = train_valence_df.loc[pointer, train_valence_df.columns != "song_id"] 98 | # arousal_labels = train_arousal_df.loc[pointer, train_arousal_df.columns != "song_id"] 99 | 100 | # time_pointer = 8 101 | # time_end_point = MIN_TIME_END_POINT + time_pointer * DEFAULT_SECOND_PER_TIME_STEP # 15 + ptr * 0.5 102 | 103 | # end_wave_index = int(time_end_point * DEFAULT_FREQ) 104 | # start_wave_index = int(end_wave_index - WINDOW_SIZE) 105 | 106 | # current_waveforms = waveforms[start_wave_index: end_wave_index, ...] 107 | # # Work on building spectrogram 108 | # # Shape (timestep, frequency, n_channel) 109 | # spectrograms = None 110 | # # Loop through each channel 111 | 112 | # test_wave = get_spectrogram(current_waveforms[..., 0], input_len=current_waveforms.shape[0]) 113 | # print(test_wave.shape) # TensorShape([171, 129, 1]) 114 | 115 | # for i in range(current_waveforms.shape[-1]): 116 | # # Shape (timestep, frequency, 1) 117 | # spectrogram = get_spectrogram(current_waveforms[..., i], input_len=current_waveforms.shape[0]) 118 | # # spectrogram = tf.convert_to_tensor(np.log(spectrogram.numpy() + np.finfo(float).eps)) 119 | # if spectrograms == None: 120 | # spectrograms = spectrogram 121 | # else: 122 | # spectrograms = tf.concat([spectrograms, spectrogram], axis=-1) 123 | # pointer += 1 124 | 125 | # padded_spectrogram = np.zeros((SPECTROGRAM_HALF_SECOND_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=float) 126 | # # spectrograms = spectrograms[tf.newaxis, ...] 127 | # # some spectrogram are not the same shape 128 | # padded_spectrogram[:spectrograms.shape[0], :spectrograms.shape[1], :] = spectrograms 129 | 130 | # print(padded_spectrogram.shape) 131 | 132 | def train_datagen_per_second(): 133 | """ Predicting valence mean and arousal mean 134 | """ 135 | pointer = 0 136 | while True: 137 | # Reset pointer 138 | if pointer >= len(train_valence_df): 139 | pointer = 0 140 | 141 | row = train_valence_df.loc[pointer] 142 | song_id = row["song_id"] 143 | assert train_arousal_df.loc[pointer, "song_id"] == song_id, "Wrong row!" 144 | 145 | # Load song and waveform 146 | song_path = os.path.join(AUDIO_FOLDER, str(int(song_id)) + SOUND_EXTENSION) 147 | audio_file = tf.io.read_file(song_path) 148 | waveforms, _ = tf.audio.decode_wav(contents=audio_file) 149 | # Pad to max 45 second. Shape (total_frequency, n_channels) 150 | waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH) 151 | 152 | # Get the labels series 153 | valence_labels = train_valence_df.loc[pointer, train_valence_df.columns != "song_id"] 154 | arousal_labels = train_arousal_df.loc[pointer, train_arousal_df.columns != "song_id"] 155 | # Loop through the series 156 | for time_pointer, ((valence_time_name, valence), (arousal_time_name, arousal)) in enumerate(zip(valence_labels.iteritems(), arousal_labels.iteritems())): 157 | label = tf.convert_to_tensor([tanh_to_sigmoid(valence), tanh_to_sigmoid(arousal)], dtype=tf.float32) 158 | time_end_point = MIN_TIME_END_POINT + time_pointer * DEFAULT_SECOND_PER_TIME_STEP # 15 + ptr * 0.5 159 | 160 | end_wave_index = int(time_end_point * DEFAULT_FREQ) 161 | start_wave_index = int(end_wave_index - WINDOW_SIZE) 162 | 163 | try: 164 | current_waveforms = waveforms[start_wave_index: end_wave_index, ...] 165 | # Work on building spectrogram 166 | # Shape (timestep, frequency, n_channel) 167 | spectrograms = None 168 | # Loop through each channel 169 | for i in range(current_waveforms.shape[-1]): 170 | # Shape (timestep, frequency, 1) 171 | spectrogram = get_spectrogram(current_waveforms[..., i], input_len=current_waveforms.shape[0]) 172 | # spectrogram = tf.convert_to_tensor(np.log(spectrogram.numpy() + np.finfo(float).eps)) 173 | if spectrograms == None: 174 | spectrograms = spectrogram 175 | else: 176 | spectrograms = tf.concat([spectrograms, spectrogram], axis=-1) 177 | pointer += 1 178 | 179 | padded_spectrogram = np.zeros((SPECTROGRAM_5_SECOND_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=float) 180 | # spectrograms = spectrograms[tf.newaxis, ...] 181 | # some spectrogram are not the same shape 182 | padded_spectrogram[:spectrograms.shape[0], :spectrograms.shape[1], :] = spectrograms 183 | 184 | yield (tf.convert_to_tensor(padded_spectrogram), label) 185 | 186 | except: 187 | print("There is some error accessing the waveforms by index") 188 | break 189 | 190 | # train_dataset = tf.data.Dataset.from_generator( 191 | # train_datagen_per_second, 192 | # output_signature=( 193 | # tf.TensorSpec(shape=(SPECTROGRAM_HALF_SECOND_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=tf.float32), 194 | # tf.TensorSpec(shape=(2), dtype=tf.float32) 195 | # ) 196 | # ) 197 | # train_iter = iter(train_dataset) 198 | 199 | def test_datagen_per_second(): 200 | """ Predicting valence mean and arousal mean 201 | """ 202 | pointer = 0 203 | while True: 204 | # Reset pointer 205 | if pointer >= len(test_valence_df): 206 | pointer = 0 207 | 208 | row = test_valence_df.loc[pointer] 209 | song_id = row["song_id"] 210 | assert test_arousal_df.loc[pointer, "song_id"] == song_id, "Wrong row!" 211 | 212 | # Load song and waveform 213 | song_path = os.path.join(AUDIO_FOLDER, str(int(song_id)) + SOUND_EXTENSION) 214 | audio_file = tf.io.read_file(song_path) 215 | waveforms, _ = tf.audio.decode_wav(contents=audio_file) 216 | # Pad to max 45 second. Shape (total_frequency, n_channels) 217 | waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH) 218 | 219 | # Get the labels series 220 | valence_labels = test_valence_df.loc[pointer, test_valence_df.columns != "song_id"] 221 | arousal_labels = test_arousal_df.loc[pointer, test_arousal_df.columns != "song_id"] 222 | # Loop through the series 223 | for time_pointer, ((valence_time_name, valence), (arousal_time_name, arousal)) in enumerate(zip(valence_labels.iteritems(), arousal_labels.iteritems())): 224 | label = tf.convert_to_tensor([tanh_to_sigmoid(valence), tanh_to_sigmoid(arousal)], dtype=tf.float32) 225 | time_end_point = MIN_TIME_END_POINT + time_pointer * DEFAULT_SECOND_PER_TIME_STEP # 15 + ptr * 0.5 226 | 227 | end_wave_index = int(time_end_point * DEFAULT_FREQ) 228 | start_wave_index = int(end_wave_index - WINDOW_SIZE) 229 | 230 | try: 231 | current_waveforms = waveforms[start_wave_index: end_wave_index, ...] 232 | # Work on building spectrogram 233 | # Shape (timestep, frequency, n_channel) 234 | spectrograms = None 235 | # Loop through each channel 236 | for i in range(current_waveforms.shape[-1]): 237 | # Shape (timestep, frequency, 1) 238 | spectrogram = get_spectrogram(current_waveforms[..., i], input_len=current_waveforms.shape[0]) 239 | # spectrogram = tf.convert_to_tensor(np.log(spectrogram.numpy() + np.finfo(float).eps)) 240 | if spectrograms == None: 241 | spectrograms = spectrogram 242 | else: 243 | spectrograms = tf.concat([spectrograms, spectrogram], axis=-1) 244 | pointer += 1 245 | 246 | padded_spectrogram = np.zeros((SPECTROGRAM_5_SECOND_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=float) 247 | # spectrograms = spectrograms[tf.newaxis, ...] 248 | # some spectrogram are not the same shape 249 | padded_spectrogram[:spectrograms.shape[0], :spectrograms.shape[1], :] = spectrograms 250 | 251 | yield (tf.convert_to_tensor(padded_spectrogram), label) 252 | 253 | except: 254 | print("There is some error accessing the waveforms by index") 255 | break 256 | 257 | 258 | train_dataset = tf.data.Dataset.from_generator( 259 | train_datagen_per_second, 260 | output_signature=( 261 | tf.TensorSpec(shape=(SPECTROGRAM_5_SECOND_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=tf.float32), 262 | tf.TensorSpec(shape=(2), dtype=tf.float32) 263 | ) 264 | ) 265 | train_batch_dataset = train_dataset.batch(BATCH_SIZE) 266 | # train_batch_dataset = train_batch_dataset.cache().prefetch(tf.data.AUTOTUNE) # OOM error 267 | train_batch_iter = iter(train_batch_dataset) 268 | 269 | 270 | # Comment out to decide to create a normalization layer. 271 | # NOTE: this is every time consuming because it looks at all the data, only 272 | # use this at the first time. 273 | # NOTE: Normally, we create this layer once, save it somewhere to reuse in 274 | # every other model. 275 | # 276 | # norm_layer = L.Normalization() 277 | # norm_layer.adapt(data=train_dataset.map(map_func=lambda spec, label: spec)) 278 | # 279 | 280 | test_dataset = tf.data.Dataset.from_generator( 281 | test_datagen_per_second, 282 | output_signature=( 283 | tf.TensorSpec(shape=(SPECTROGRAM_5_SECOND_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=tf.float32), 284 | tf.TensorSpec(shape=(2, ), dtype=tf.float32) 285 | ) 286 | ) 287 | test_batch_dataset = test_dataset.batch(BATCH_SIZE) 288 | # test_batch_dataset = test_batch_dataset.cache().prefetch(tf.data.AUTOTUNE) # OOM error 289 | test_batch_iter = iter(test_batch_dataset) 290 | 291 | # ds = iter(train_dataset) 292 | # i, o = next(ds) 293 | # log_spec = np.log(i + np.finfo(float).eps) 294 | 295 | # print(tf.reduce_max(i)) 296 | # print(tf.reduce_min(i)) 297 | # print(tf.reduce_mean(i)) 298 | 299 | # print(tf.reduce_max(log_spec)) 300 | # print(tf.reduce_min(log_spec)) 301 | # print(tf.reduce_mean(log_spec)) 302 | 303 | # ii = tf.transpose(i[..., 0], [1,0]) 304 | # height = ii.shape[0] 305 | # width = ii.shape[1] 306 | # X = np.linspace(0, np.size(ii), num=width, dtype=int) 307 | # Y = range(height) 308 | # plt.pcolormesh(X, Y, ii) 309 | # plt.show() 310 | 311 | # it = iter(train_dataset) 312 | # i, o = next(it) 313 | # o.shape 314 | 315 | 316 | # %% 317 | 318 | ## Training 319 | 320 | def train_step(batch_x, batch_label, model, loss_function, optimizer, step=-1): 321 | with tf.device("/GPU:0"): 322 | with tf.GradientTape() as tape: 323 | logits = model(batch_x, training=True) 324 | loss = loss_function(batch_label, logits) 325 | grads = tape.gradient(loss, model.trainable_weights) 326 | optimizer.apply_gradients(zip(grads, model.trainable_weights)) 327 | return loss 328 | 329 | def train(model, 330 | training_batch_iter, 331 | test_batch_iter, 332 | optimizer, 333 | loss_function, 334 | epochs=1, 335 | steps_per_epoch=20, 336 | valid_step=5, 337 | history_path=None, 338 | weights_path=None, 339 | save_history=False): 340 | 341 | if history_path != None and os.path.exists(history_path): 342 | # Sometimes, we have not created the files 343 | with open(history_path, "rb") as f: 344 | history = np.load(f, allow_pickle=True) 345 | epochs_loss, epochs_val_loss = history 346 | epochs_loss = epochs_loss.tolist() 347 | epochs_val_loss = epochs_val_loss.tolist() 348 | else: 349 | epochs_val_loss = [] 350 | epochs_loss = [] 351 | 352 | if weights_path != None and os.path.exists(weights_path + ".index"): 353 | try: 354 | model.load_weights(weights_path) 355 | print("Model weights loaded!") 356 | except: 357 | print("cannot load weights!") 358 | 359 | for epoch in range(epochs): 360 | losses = [] 361 | val_losses = [] 362 | with tf.device("/CPU:0"): 363 | step_pointer = 0 364 | while step_pointer < steps_per_epoch: 365 | batch = next(training_batch_iter) 366 | batch_x = batch[0] 367 | batch_label = batch[1] 368 | loss = train_step(batch_x, batch_label, model, loss_function, optimizer, step=step_pointer + 1) 369 | print(f"Epoch {epoch + 1} - Step {step_pointer + 1} - Loss: {loss}") 370 | losses.append(loss) 371 | 372 | val_batch = next(test_batch_iter) 373 | logits = model(val_batch[0], training=False) 374 | val_loss = loss_function(val_batch[1], logits) 375 | val_losses.append(val_loss) 376 | 377 | if (step_pointer + 1) % valid_step == 0: 378 | print( 379 | "Training loss (for one batch) at step %d: %.4f" 380 | % (step_pointer + 1, float(loss)) 381 | ) 382 | # perform validation 383 | print(f"exmaple logits: {logits}") 384 | print(f"Validation loss: {val_loss}\n-----------------") 385 | 386 | step_pointer += 1 387 | epochs_loss.append(losses) 388 | epochs_val_loss.append(val_losses) 389 | 390 | # Save history and model 391 | if history_path != None and save_history: 392 | np.save(history_path, [epochs_loss, epochs_val_loss]) 393 | 394 | if weights_path != None: 395 | model.save_weights(weights_path) 396 | 397 | # return history 398 | return [epochs_loss, epochs_val_loss] 399 | 400 | # %% 401 | 402 | """################## Training #################""" 403 | 404 | ## Define model first 405 | 406 | weights_path = "./weights/dynamics/base_shallow_lstm/checkpoint" 407 | history_path = "./history/dynamics/base_shallow_lstm.npy" 408 | 409 | # model = SimpleDenseModel(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL, BATCH_SIZE) 410 | # model.build(input_shape=(BATCH_SIZE, SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL)) 411 | # model.model().summary() 412 | 413 | # model = SimpleConvModel(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL, BATCH_SIZE) 414 | # model.model.load_weights(weights_path) 415 | 416 | 417 | optimizer = tf.keras.optimizers.SGD(learning_rate=LEARNING_RATE) 418 | 419 | # %% 420 | 421 | model = Simple_CRNN_3() 422 | # model.summary() 423 | sample_input = tf.ones(shape=(BATCH_SIZE, SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2)) 424 | with tf.device("/CPU:0"): 425 | sample_output = model(sample_input, training=False) 426 | print(sample_output) 427 | 428 | # %% 429 | 430 | # About 50 epochs with each epoch step 100 will cover the whole training dataset! 431 | history = train( 432 | model, 433 | train_batch_iter, 434 | test_batch_iter, 435 | optimizer, 436 | simple_mae_loss, 437 | epochs=2, 438 | steps_per_epoch=100, # 1200 // 16 439 | valid_step=20, 440 | history_path=history_path, 441 | weights_path=weights_path, 442 | save_history=True 443 | ) 444 | 445 | 446 | # %% 447 | 448 | ### MODEL DEBUGGING ### 449 | 450 | class ResBlock(tf.keras.layers.Layer): 451 | def __init__(self, filters, kernel_size_1=(5,5), kernel_size_2=(3,3), **kwargs) -> None: 452 | super().__init__(**kwargs) 453 | self.filters = filters 454 | self.kernel_size_1 = kernel_size_1 455 | self.kernel_size_2 = kernel_size_2 456 | 457 | def build(self, input_shape): 458 | self.conv_norm = L.Conv2D(self.filters, (1, 1), padding="same") 459 | self.conv1 = L.Conv2D(self.filters, self.kernel_size_1, padding="same") 460 | self.conv2 = L.Conv2D(self.filters, self.kernel_size_2, padding="same") 461 | 462 | def call(self, inputs): 463 | skip_tensor = self.conv_norm(inputs) 464 | tensor = L.ReLU()(skip_tensor) 465 | tensor = self.conv1(tensor) 466 | tensor = L.ReLU()(tensor) 467 | tensor = self.conv2(tensor) 468 | tensor = L.Add()([skip_tensor, tensor]) 469 | out = L.ReLU()(tensor) 470 | return out 471 | 472 | class ResBlock2(tf.keras.layers.Layer): 473 | def __init__(self, filters, kernel_size=(5,5), **kwargs) -> None: 474 | super().__init__(**kwargs) 475 | self.filters = filters 476 | self.kernel_size = kernel_size 477 | 478 | def build(self, input_shape): 479 | self.conv_norm = L.Conv2D(self.filters, (1, 1), padding="same") 480 | self.conv1 = L.Conv2D(self.filters // 2, (1, 1), padding="same") 481 | self.conv2 = L.Conv2D(self.filters, self.kernel_size, padding="same") 482 | 483 | def call(self, inputs): 484 | skip_tensor = self.conv_norm(inputs) 485 | tensor = L.ReLU()(skip_tensor) 486 | tensor = self.conv1(tensor) 487 | tensor = L.ReLU()(tensor) 488 | tensor = self.conv2(tensor) 489 | tensor = L.Add()([skip_tensor, tensor]) 490 | out = L.ReLU()(tensor) 491 | return out 492 | 493 | def base_cnn(task_type="static"): 494 | """ Base CNN Feature extractor for 45 second spectrogram 495 | Input to model shape: (SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2) 496 | Output of the model shape: (4, 60, 256) 497 | (Convolved frequency, convolved timestep, feature neurons) 498 | Args: 499 | task_type (str, optional): There are three value: 500 | "static" for model evaluation per song, 501 | "dynamic" for model evaluation per timestep and it takes the input waveform at only per timestep, 502 | "seq2seq" for model evaluation per second but take the input as the whole song. Defaults to "static". 503 | Returns: 504 | tf.keras.Model: Return a model 505 | """ 506 | model = tf.keras.Sequential(name="base_cnn") 507 | if task_type == "static" or task_type == "seq2seq": 508 | model.add(L.Input(shape=(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2))) 509 | model.add(L.Permute((2, 1, 3))) 510 | model.add(L.Resizing(FREQUENCY_LENGTH, 1024)) 511 | elif task_type == "dynamic": 512 | model.add(L.Input(shape=(SPECTROGRAM_5_SECOND_LENGTH, FREQUENCY_LENGTH, 2))) 513 | model.add(L.Permute((2, 1, 3))) 514 | model.add(L.Resizing(FREQUENCY_LENGTH, 1024)) 515 | else: 516 | print("Wrong parameters") 517 | return 518 | 519 | cnn_config = [ 520 | # [32, (5,5), (3, 3)], 521 | [64, (5,5), (3, 3)], 522 | [128, (5,5), (3, 3)], 523 | [256, (5,5), (3, 3)] 524 | # [512, (5,5), (3, 3)], 525 | ] 526 | 527 | for i, (filters, kernel_size_1, kernel_size_2) in enumerate(cnn_config): 528 | # model.add(ResBlock(filters, kernel_size_1, kernel_size_2, name=f"res_block_{i}")) 529 | # model.add(L.MaxPool2D(2,2, name=f"max_pool_{i}")) 530 | model.add(ResBlock(filters, kernel_size_1, name=f"res_block_{i}")) 531 | model.add(L.MaxPool2D(2,2, name=f"max_pool_{i}")) 532 | 533 | model.add(L.Conv2D(128, (1,1), activation="relu")) 534 | model.add(L.Conv2D(64, (1,1), activation="relu")) 535 | model.add(L.Conv2D(32, (1,1), activation="relu")) 536 | 537 | return model 538 | 539 | def model(): 540 | base = base_cnn(task_type="dynamic") 541 | 542 | tensor = base.outputs[0] 543 | tensor = L.Permute((2, 3, 1))(tensor) 544 | tensor = L.Reshape((128, -1))(tensor) 545 | tensor = L.LSTM(256)(tensor) 546 | tensor = L.Dense(256, activation=None)(tensor) 547 | tensor = L.Dense(64, activation=None)(tensor) 548 | out = L.Dense(2, activation=None)(tensor) 549 | 550 | model = tf.keras.Model(inputs=base.inputs, outputs=out) 551 | return model 552 | 553 | model = model() 554 | model.summary() 555 | sample_input = tf.ones(shape=(BATCH_SIZE, SPECTROGRAM_5_SECOND_LENGTH, FREQUENCY_LENGTH, 2)) 556 | with tf.device("/CPU:0"): 557 | sample_output = model(sample_input, training=False) 558 | print(sample_output.shape) 559 | 560 | 561 | 562 | 563 | # %%5 564 | 565 | 566 | sample_output.shape 567 | 568 | # %% 569 | 570 | 571 | 572 | 573 | # %% 574 | 575 | # Plot 576 | with open(history_path, "rb") as f: 577 | [epochs_loss, epochs_val_loss] = np.load(f, allow_pickle=True) 578 | 579 | 580 | e_loss = [k[0] for k in epochs_loss] 581 | 582 | e_all_loss = [] 583 | 584 | id = 0 585 | time_val = [] 586 | for epoch in epochs_loss: 587 | for step in epoch: 588 | e_all_loss.append(step) 589 | id += 1 590 | time_val.append(id) 591 | 592 | e_val_loss = [k[0] for k in epochs_val_loss] 593 | 594 | e_all_val_loss = [] 595 | 596 | id = 0 597 | time_val = [] 598 | for epoch in epochs_val_loss: 599 | for step in epoch: 600 | e_all_val_loss.append(step) 601 | id += 1 602 | time_val.append(id) 603 | 604 | plt.plot(np.arange(0, len(e_all_loss), 1), e_all_loss, label = "train loss") 605 | # plt.plot(time_val, epochs_val_loss, label = "val loss") 606 | plt.plot(np.arange(0, len(e_all_val_loss), 1), e_all_val_loss, label = "val loss") 607 | 608 | # plt.plot(np.arange(1,len(e_loss)+ 1), e_loss, label = "train loss") 609 | # plt.plot(np.arange(1,len(epochs_val_loss)+ 1), epochs_val_loss, label = "val loss") 610 | plt.xlabel("Step") 611 | plt.ylabel("Loss") 612 | plt.legend() 613 | plt.show() 614 | 615 | # %%5 616 | 617 | # model.load_weights(weights_path) 618 | # model.trainable_weights 619 | # y.shape 620 | 621 | 622 | # %% 623 | 624 | 625 | 626 | # model.build(input_shape=(BATCH_SIZE, SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL)) 627 | # model.summary() 628 | # %% 629 | 630 | model.save_weights(weights_path) 631 | 632 | 633 | # %% 634 | 635 | 636 | 637 | model.load_weights(weights_path) 638 | 639 | 640 | # %% 641 | 642 | 643 | def evaluate(df_pointer, model, loss_func, play=False): 644 | row = test_df.loc[df_pointer] 645 | song_id = row["song_id"] 646 | valence_mean = row["valence_mean"] 647 | arousal_mean = row["arousal_mean"] 648 | label = tf.convert_to_tensor([valence_mean, arousal_mean], dtype=tf.float32) 649 | print(f"Label: Valence: {valence_mean}, Arousal: {arousal_mean}") 650 | song_path = os.path.join(AUDIO_FOLDER, str(int(song_id)) + SOUND_EXTENSION) 651 | audio_file = tf.io.read_file(song_path) 652 | waveforms, _ = tf.audio.decode_wav(contents=audio_file) 653 | waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH) 654 | spectrograms = None 655 | # Loop through each channel 656 | for i in range(waveforms.shape[-1]): 657 | # Shape (timestep, frequency, 1) 658 | spectrogram = get_spectrogram(waveforms[..., i], input_len=waveforms.shape[0]) 659 | if spectrograms == None: 660 | spectrograms = spectrogram 661 | else: 662 | spectrograms = tf.concat([spectrograms, spectrogram], axis=-1) 663 | 664 | spectrograms = spectrograms[tf.newaxis, ...] 665 | 666 | ## Eval 667 | y_pred = model(spectrograms, training=False)[0] 668 | print(f"Predicted y_pred value: Valence: {y_pred[0]}, Arousal: {y_pred[1]}") 669 | 670 | loss = loss_func(label[tf.newaxis, ...], y_pred) 671 | print(f"Loss: {loss}") 672 | 673 | if play: 674 | plot_and_play(waveforms, 0, 40, 0) 675 | 676 | i = 0 677 | 678 | # %% 679 | 680 | i += 1 681 | evaluate(i, model, simple_mae_loss, play=False) 682 | 683 | # %% 684 | 685 | ####### INTERMEDIARY REPRESENTATION ######## 686 | 687 | layer_list = [l for l in model.layers] 688 | debugging_model = tf.keras.Model(inputs=model.inputs, outputs=[l.output for l in layer_list]) 689 | 690 | # %% 691 | 692 | layer_list 693 | 694 | # %% 695 | 696 | test_id = 35 697 | test_time_ptr = 0 698 | time_end_point = MIN_TIME_END_POINT + test_time_ptr 699 | df_id = int(test_time_ptr * 2) 700 | row = test_valence_df.loc[test_id] 701 | song_id = row["song_id"] 702 | # Get the labels series 703 | valence_labels = test_valence_df.loc[test_id, test_valence_df.columns != "song_id"] 704 | arousal_labels = test_arousal_df.loc[test_id, test_arousal_df.columns != "song_id"] 705 | 706 | valence_val = tanh_to_sigmoid(valence_labels.iloc[[df_id]].to_numpy()) 707 | arousal_val = tanh_to_sigmoid(arousal_labels.iloc[[df_id]].to_numpy()) 708 | 709 | label = tf.convert_to_tensor([valence_val, arousal_val], dtype=tf.float32) 710 | print(f"Label: Valence: {valence_val}, Arousal: {arousal_val}") 711 | song_path = os.path.join(AUDIO_FOLDER, str(int(song_id)) + SOUND_EXTENSION) 712 | audio_file = tf.io.read_file(song_path) 713 | waveforms, _ = tf.audio.decode_wav(contents=audio_file) 714 | waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH) 715 | 716 | 717 | end_wave_index = int(time_end_point * DEFAULT_FREQ) 718 | start_wave_index = int(end_wave_index - WINDOW_SIZE) 719 | 720 | current_waveforms = waveforms[start_wave_index: end_wave_index, ...] 721 | 722 | spectrograms = None 723 | # Loop through each channel 724 | for i in range(current_waveforms.shape[-1]): 725 | # Shape (timestep, frequency, 1) 726 | spectrogram = get_spectrogram(current_waveforms[..., i], input_len=current_waveforms.shape[0]) 727 | if spectrograms == None: 728 | spectrograms = spectrogram 729 | else: 730 | spectrograms = tf.concat([spectrograms, spectrogram], axis=-1) 731 | 732 | spectrograms = spectrograms[tf.newaxis, ...] 733 | 734 | 735 | ## Eval 736 | y_pred_list = debugging_model(spectrograms, training=False) 737 | print(f"Predicted y_pred value: Valence: {y_pred_list[-1][0, 0]}, Arousal: {y_pred_list[-1][0, 1]}") 738 | 739 | plot_and_play(waveforms, time_end_point - WINDOW_TIME, WINDOW_TIME, 0) 740 | 741 | 742 | # def show_color_mesh(spectrogram): 743 | # """ Generate color mesh 744 | 745 | # Args: 746 | # spectrogram (2D array): Expect shape (Frequency length, time step) 747 | # """ 748 | # assert len(spectrogram.shape) == 2 749 | # log_spec = np.log(spectrogram + np.finfo(float).eps) 750 | # height = log_spec.shape[0] 751 | # width = log_spec.shape[1] 752 | # X = np.linspace(0, np.size(spectrogram), num=width, dtype=int) 753 | # Y = range(height) 754 | # plt.pcolormesh(X, Y, log_spec) 755 | # plt.show() 756 | 757 | # show_color_mesh(tf.transpose(spectrograms[0, :, :, 0], [1,0])) 758 | 759 | 760 | # %% 761 | 762 | f, axarr = plt.subplots(7,4, figsize=(25,15)) 763 | CONVOLUTION_NUMBER_LIST = [2, 3, 4, 5] 764 | LAYER_LIST = [3, 5, 7, 9, 10, 11, 13] 765 | for x, CONVOLUTION_NUMBER in enumerate(CONVOLUTION_NUMBER_LIST): 766 | f1 = y_pred_list[LAYER_LIST[0]] 767 | plot_spectrogram(tf.transpose(f1[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[0,x]) 768 | axarr[0,x].grid(False) 769 | f2 = y_pred_list[LAYER_LIST[1]] 770 | plot_spectrogram(tf.transpose(f2[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[1,x]) 771 | axarr[1,x].grid(False) 772 | f3 = y_pred_list[LAYER_LIST[2]] 773 | plot_spectrogram(tf.transpose(f3[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[2,x]) 774 | axarr[2,x].grid(False) 775 | f4 = y_pred_list[LAYER_LIST[3]] 776 | plot_spectrogram(tf.transpose(f4[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[3,x]) 777 | axarr[3,x].grid(False) 778 | 779 | f5 = y_pred_list[LAYER_LIST[4]] 780 | plot_spectrogram(tf.transpose(f5[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[4,x]) 781 | axarr[4,x].grid(False) 782 | f6 = y_pred_list[LAYER_LIST[5]] 783 | plot_spectrogram(tf.transpose(f6[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[5,x]) 784 | axarr[5,x].grid(False) 785 | f7 = y_pred_list[LAYER_LIST[6]] 786 | # plot_spectrogram(tf.transpose(f7[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[6,x]) 787 | plot_spectrogram(f7[0, : , :].numpy(), axarr[6,x]) 788 | axarr[6,x].grid(False) 789 | # f8 = y_pred_list[LAYER_LIST[7]] 790 | # plot_spectrogram(tf.transpose(f8[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[7,x]) 791 | # axarr[7,x].grid(False) 792 | 793 | axarr[0,0].set_ylabel("After convolution layer 1") 794 | axarr[1,0].set_ylabel("After convolution layer 2") 795 | axarr[2,0].set_ylabel("After convolution layer 3") 796 | axarr[3,0].set_ylabel("After convolution layer 7") 797 | 798 | axarr[0,0].set_title("convolution number 0") 799 | axarr[0,1].set_title("convolution number 4") 800 | axarr[0,2].set_title("convolution number 7") 801 | axarr[0,3].set_title("convolution number 23") 802 | 803 | plt.show() 804 | 805 | # %% 806 | 807 | -------------------------------------------------------------------------------- /main_static.py: -------------------------------------------------------------------------------- 1 | """ 2 | file: main_static.py 3 | author: Alex Nguyen 4 | This file contains code to process the whole song labeled data (statically labeled) 5 | """ 6 | # %% 7 | 8 | import os 9 | import pathlib 10 | 11 | import matplotlib.pyplot as plt 12 | import numpy as np 13 | import seaborn as sns 14 | import tensorflow as tf 15 | import sounddevice as sd 16 | import pandas as pd 17 | import tensorflow.keras.layers as L 18 | 19 | from tensorflow.keras import layers 20 | from tensorflow.keras import models 21 | from IPython import display 22 | from tensorflow.python.keras.layers.core import Dropout 23 | 24 | from mer.utils import get_spectrogram, \ 25 | plot_spectrogram, \ 26 | load_metadata, \ 27 | plot_and_play, \ 28 | preprocess_waveforms, \ 29 | split_train_test 30 | 31 | from mer.const import * 32 | from mer.loss import simple_mse_loss, simple_mae_loss 33 | from mer.model import Simple_CRNN_3, SimpleDenseModel, \ 34 | SimpleConvModel, \ 35 | ConvBlock, \ 36 | ConvBlock2,\ 37 | Simple_CRNN, \ 38 | Simple_CRNN_2, \ 39 | Simple_CRNN_3 40 | 41 | # Set the seed value for experiment reproducibility. 42 | # seed = 42 43 | # tf.random.set_seed(seed) 44 | # np.random.seed(seed) 45 | 46 | sd.default.samplerate = DEFAULT_FREQ 47 | 48 | ANNOTATION_SONG_LEVEL = "./dataset/DEAM/annotations/annotations averaged per song/song_level/" 49 | AUDIO_FOLDER = "./dataset/DEAM/wav" 50 | filenames = tf.io.gfile.glob(str(AUDIO_FOLDER) + '/*') 51 | 52 | # Process with average annotation per song. 53 | df = load_metadata(ANNOTATION_SONG_LEVEL) 54 | 55 | train_df, test_df = split_train_test(df, TRAIN_RATIO) 56 | 57 | # test_file = tf.io.read_file(os.path.join(AUDIO_FOLDER, "2011.wav")) 58 | # test_audio, _ = tf.audio.decode_wav(contents=test_file) 59 | # test_audio.shape 60 | # test_audio = preprocess_waveforms(test_audio, WAVE_ARRAY_LENGTH) 61 | # test_audio.shape 62 | 63 | # plot_and_play(test_audio, 24, 5, 0) 64 | # plot_and_play(test_audio, 26, 5, 0) 65 | # plot_and_play(test_audio, 28, 5, 0) 66 | # plot_and_play(test_audio, 30, 5, 0) 67 | 68 | # TODO: Check if all the audio files have the same number of channels 69 | 70 | # TODO: Loop through all music file to get the max length spectrogram, and other specs 71 | # Spectrogram length for 45s audio with freq 44100 is often 15523 72 | # Largeest 3 spectrogram, 16874 at 1198.wav, 103922 at 2001.wav, 216080 at 2011.wav 73 | # The reason why there are multiple spectrogram is because the music have different length 74 | # For the exact 45 seconds audio, the spectrogram time length is 15502. 75 | 76 | # SPECTROGRAM_TIME_LENGTH = 15502 77 | # min_audio_length = 1e8 78 | # for fname in os.listdir(AUDIO_FOLDER): 79 | # song_path = os.path.join(AUDIO_FOLDER, fname) 80 | # audio_file = tf.io.read_file(song_path) 81 | # waveforms, _ = tf.audio.decode_wav(contents=audio_file) 82 | # audio_length = waveforms.shape[0] // DEFAULT_FREQ 83 | # if audio_length < min_audio_length: 84 | # min_audio_length = audio_length 85 | # print(f"The min audio time length is: {min_audio_length} second(s) at {fname}") 86 | # spectrogram = get_spectrogram(waveforms[..., 0], input_len=waveforms.shape[0]) 87 | # if spectrogram.shape[0] > SPECTROGRAM_TIME_LENGTH: 88 | # SPECTROGRAM_TIME_LENGTH = spectrogram.shape[0] 89 | # print(f"The max spectrogram time length is: {SPECTROGRAM_TIME_LENGTH} at {fname}") 90 | 91 | # TODO: Get the max and min val of the label. Mean: 92 | 93 | def train_datagen_song_level(): 94 | """ Predicting valence mean and arousal mean 95 | """ 96 | pointer = 0 97 | while True: 98 | # Reset pointer 99 | if pointer >= len(train_df): 100 | pointer = 0 101 | 102 | row = train_df.loc[pointer] 103 | song_id = row["song_id"] 104 | valence_mean = float(row["valence_mean"]) 105 | arousal_mean = float(row["arousal_mean"]) 106 | label = tf.convert_to_tensor([valence_mean, arousal_mean], dtype=tf.float32) 107 | song_path = os.path.join(AUDIO_FOLDER, str(int(song_id)) + SOUND_EXTENSION) 108 | audio_file = tf.io.read_file(song_path) 109 | waveforms, _ = tf.audio.decode_wav(contents=audio_file) 110 | waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH) 111 | # print(waveforms.shape) 112 | 113 | # Work on building spectrogram 114 | # Shape (timestep, frequency, n_channel) 115 | spectrograms = None 116 | # Loop through each channel 117 | for i in range(waveforms.shape[-1]): 118 | # Shape (timestep, frequency, 1) 119 | spectrogram = get_spectrogram(waveforms[..., i], input_len=waveforms.shape[0]) 120 | # spectrogram = tf.convert_to_tensor(np.log(spectrogram.numpy() + np.finfo(float).eps)) 121 | if spectrograms == None: 122 | spectrograms = spectrogram 123 | else: 124 | spectrograms = tf.concat([spectrograms, spectrogram], axis=-1) 125 | pointer += 1 126 | 127 | padded_spectrogram = np.zeros((SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=float) 128 | # spectrograms = spectrograms[tf.newaxis, ...] 129 | # some spectrogram are not the same shape 130 | padded_spectrogram[:spectrograms.shape[0], :spectrograms.shape[1], :] = spectrograms 131 | 132 | yield (tf.convert_to_tensor(padded_spectrogram), label) 133 | 134 | def test_datagen_song_level(): 135 | """ Predicting valence mean and arousal mean 136 | """ 137 | pointer = 0 138 | while True: 139 | # Reset pointer 140 | if pointer >= len(test_df): 141 | pointer = 0 142 | 143 | row = test_df.loc[pointer] 144 | song_id = row["song_id"] 145 | valence_mean = float(row["valence_mean"]) 146 | arousal_mean = float(row["arousal_mean"]) 147 | label = tf.convert_to_tensor([valence_mean, arousal_mean], dtype=tf.float32) 148 | song_path = os.path.join(AUDIO_FOLDER, str(int(song_id)) + SOUND_EXTENSION) 149 | audio_file = tf.io.read_file(song_path) 150 | waveforms, _ = tf.audio.decode_wav(contents=audio_file) 151 | waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH) 152 | # print(waveforms.shape) 153 | 154 | # Work on building spectrogram 155 | # Shape (timestep, frequency, n_channel) 156 | spectrograms = None 157 | # Loop through each channel 158 | for i in range(waveforms.shape[-1]): 159 | # Shape (timestep, frequency, 1) 160 | spectrogram = get_spectrogram(waveforms[..., i], input_len=waveforms.shape[0]) 161 | # spectrogram = tf.convert_to_tensor(np.log(spectrogram.numpy() + np.finfo(float).eps)) 162 | if spectrograms == None: 163 | spectrograms = spectrogram 164 | else: 165 | spectrograms = tf.concat([spectrograms, spectrogram], axis=-1) 166 | pointer += 1 167 | 168 | padded_spectrogram = np.zeros((SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=float) 169 | # spectrograms = spectrograms[tf.newaxis, ...] 170 | # some spectrogram are not the same shape 171 | padded_spectrogram[:spectrograms.shape[0], :spectrograms.shape[1], :] = spectrograms 172 | 173 | yield (tf.convert_to_tensor(padded_spectrogram), label) 174 | 175 | train_dataset = tf.data.Dataset.from_generator( 176 | train_datagen_song_level, 177 | output_signature=( 178 | tf.TensorSpec(shape=(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=tf.float32), 179 | tf.TensorSpec(shape=(2), dtype=tf.float32) 180 | ) 181 | ) 182 | train_batch_dataset = train_dataset.batch(BATCH_SIZE) 183 | # train_batch_dataset = train_batch_dataset.cache().prefetch(tf.data.AUTOTUNE) # OOM error 184 | train_batch_iter = iter(train_batch_dataset) 185 | 186 | 187 | # Comment out to decide to create a normalization layer. 188 | # NOTE: this is every time consuming because it looks at all the data, only 189 | # use this at the first time. 190 | # NOTE: Normally, we create this layer once, save it somewhere to reuse in 191 | # every other model. 192 | # 193 | # norm_layer = L.Normalization() 194 | # norm_layer.adapt(data=train_dataset.map(map_func=lambda spec, label: spec)) 195 | # 196 | 197 | test_dataset = tf.data.Dataset.from_generator( 198 | test_datagen_song_level, 199 | output_signature=( 200 | tf.TensorSpec(shape=(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=tf.float32), 201 | tf.TensorSpec(shape=(2, ), dtype=tf.float32) 202 | ) 203 | ) 204 | test_batch_dataset = test_dataset.batch(BATCH_SIZE) 205 | # test_batch_dataset = test_batch_dataset.cache().prefetch(tf.data.AUTOTUNE) # OOM error 206 | test_batch_iter = iter(test_batch_dataset) 207 | 208 | # ds = iter(train_dataset) 209 | # i, o = next(ds) 210 | # log_spec = np.log(i + np.finfo(float).eps) 211 | 212 | # print(tf.reduce_max(i)) 213 | # print(tf.reduce_min(i)) 214 | # print(tf.reduce_mean(i)) 215 | 216 | # print(tf.reduce_max(log_spec)) 217 | # print(tf.reduce_min(log_spec)) 218 | # print(tf.reduce_mean(log_spec)) 219 | 220 | # ii = tf.transpose(i[..., 0], [1,0]) 221 | # height = ii.shape[0] 222 | # width = ii.shape[1] 223 | # X = np.linspace(0, np.size(ii), num=width, dtype=int) 224 | # Y = range(height) 225 | # plt.pcolormesh(X, Y, ii) 226 | # plt.show() 227 | 228 | 229 | # %% 230 | 231 | ## Training 232 | 233 | def train_step(batch_x, batch_label, model, loss_function, optimizer, step=-1): 234 | with tf.device("/GPU:0"): 235 | with tf.GradientTape() as tape: 236 | logits = model(batch_x, training=True) 237 | loss = loss_function(batch_label, logits) 238 | grads = tape.gradient(loss, model.trainable_weights) 239 | optimizer.apply_gradients(zip(grads, model.trainable_weights)) 240 | return loss 241 | 242 | def train(model, 243 | training_batch_iter, 244 | test_batch_iter, 245 | optimizer, 246 | loss_function, 247 | epochs=1, 248 | steps_per_epoch=20, 249 | valid_step=5, 250 | history_path=None, 251 | weights_path=None, 252 | save_history=False): 253 | 254 | if history_path != None and os.path.exists(history_path): 255 | # Sometimes, we have not created the files 256 | with open(history_path, "rb") as f: 257 | history = np.load(f, allow_pickle=True) 258 | epochs_loss, epochs_val_loss = history 259 | epochs_loss = epochs_loss.tolist() 260 | epochs_val_loss = epochs_val_loss.tolist() 261 | else: 262 | epochs_val_loss = [] 263 | epochs_loss = [] 264 | 265 | if weights_path != None and os.path.exists(weights_path + ".index"): 266 | try: 267 | model.load_weights(weights_path) 268 | print("Model weights loaded!") 269 | except: 270 | print("cannot load weights!") 271 | 272 | for epoch in range(epochs): 273 | losses = [] 274 | 275 | with tf.device("/CPU:0"): 276 | step_pointer = 0 277 | while step_pointer < steps_per_epoch: 278 | batch = next(training_batch_iter) 279 | batch_x = batch[0] 280 | batch_label = batch[1] 281 | loss = train_step(batch_x, batch_label, model, loss_function, optimizer, step=step_pointer + 1) 282 | print(f"Epoch {epoch + 1} - Step {step_pointer + 1} - Loss: {loss}") 283 | losses.append(loss) 284 | 285 | if (step_pointer + 1) % valid_step == 0: 286 | print( 287 | "Training loss (for one batch) at step %d: %.4f" 288 | % (step_pointer + 1, float(loss)) 289 | ) 290 | # perform validation 291 | val_batch = next(test_batch_iter) 292 | logits = model(val_batch[0], training=False) 293 | val_loss = loss_function(val_batch[1], logits) 294 | print(f"exmaple logits: {logits}") 295 | print(f"Validation loss: {val_loss}\n-----------------") 296 | if (step_pointer + 1) == steps_per_epoch: 297 | val_batch = next(test_batch_iter) 298 | logits = model(val_batch[0], training=False) 299 | val_loss = loss_function(val_batch[1], logits) 300 | epochs_val_loss.append(val_loss) 301 | 302 | step_pointer += 1 303 | epochs_loss.append(losses) 304 | 305 | # Save history and model 306 | if history_path != None and save_history: 307 | np.save(history_path, [epochs_loss, epochs_val_loss]) 308 | 309 | if weights_path != None: 310 | model.save_weights(weights_path) 311 | 312 | # return history 313 | return [epochs_loss, epochs_val_loss] 314 | 315 | # %% 316 | 317 | """################## Training #################""" 318 | 319 | ## Define model first 320 | 321 | weights_path = "./weights/cbam_2/checkpoint" 322 | history_path = "./history/cbam_2.npy" 323 | 324 | # model = SimpleDenseModel(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL, BATCH_SIZE) 325 | # model.build(input_shape=(BATCH_SIZE, SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL)) 326 | # model.model().summary() 327 | 328 | # model = SimpleConvModel(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL, BATCH_SIZE) 329 | # model.model.load_weights(weights_path) 330 | 331 | optimizer = tf.keras.optimizers.SGD(learning_rate=LEARNING_RATE) 332 | 333 | # %% 334 | 335 | model = Simple_CRNN_3() 336 | # model.summary() 337 | sample_input = tf.ones(shape=(BATCH_SIZE, SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2)) 338 | with tf.device("/CPU:0"): 339 | sample_output = model(sample_input, training=False) 340 | print(sample_output) 341 | 342 | # %% 343 | 344 | # About 50 epochs with each epoch step 100 will cover the whole training dataset! 345 | history = train( 346 | model, 347 | train_batch_iter, 348 | test_batch_iter, 349 | optimizer, 350 | simple_mae_loss, 351 | epochs=2, 352 | steps_per_epoch=100, # 1800 // 16 353 | valid_step=20, 354 | history_path=history_path, 355 | weights_path=weights_path, 356 | save_history=True 357 | ) 358 | 359 | # %% 360 | 361 | ### MODEL DEBUGGING ### 362 | 363 | def base_cnn(): 364 | """ Base CNN Feature extractor for 45 second spectrogram 365 | Input to model shape: (SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2) 366 | Output of the model shape: (4, 60, 256) 367 | (Convolved frequency, convolved timestep, feature neurons) 368 | Returns: 369 | tf.keras.Model: Return a model 370 | """ 371 | 372 | inputs = L.Input(shape=(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2)) 373 | 374 | tensor = L.Permute((2, 1, 3))(inputs) 375 | tensor = L.Resizing(FREQUENCY_LENGTH, 1024)(tensor) 376 | 377 | tensor = L.Conv2D(64, (5,5), padding="valid", name="conv_1_1")(tensor) 378 | tensor = L.ReLU()(tensor) 379 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 380 | tensor = L.Conv2D(64 // 2, (1,1), padding="valid", name="conv_1_2")(tensor) 381 | tensor = L.ReLU()(tensor) 382 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 383 | tensor = L.MaxPool2D(2,2)(tensor) 384 | tensor = L.Dropout(0.1)(tensor) 385 | 386 | tensor = L.Conv2D(128, (5,5), padding="valid", name="conv_2_1")(tensor) 387 | tensor = L.ReLU()(tensor) 388 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 389 | tensor = L.Conv2D(128 // 2, (1,1), padding="valid", name="conv_2_2")(tensor) 390 | tensor = L.ReLU()(tensor) 391 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 392 | tensor = L.MaxPool2D(2,2)(tensor) 393 | tensor = L.Dropout(0.1)(tensor) 394 | 395 | tensor = L.Conv2D(256, (5,5), padding="valid", name="conv_3_1")(tensor) 396 | tensor = L.ReLU()(tensor) 397 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 398 | tensor = L.Conv2D(256 // 2, (1,1), padding="valid", name="conv_3_2")(tensor) 399 | tensor = L.ReLU()(tensor) 400 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 401 | tensor = L.MaxPool2D(2,2)(tensor) 402 | tensor = L.Dropout(0.1)(tensor) 403 | 404 | tensor = L.Conv2D(512, (5,5), padding="valid", name="conv_4_1")(tensor) 405 | tensor = L.ReLU()(tensor) 406 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 407 | tensor = L.Conv2D(512 // 2, (1,1), padding="valid", name="conv_4_2")(tensor) 408 | tensor = L.ReLU()(tensor) 409 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 410 | tensor = L.MaxPool2D(2,2)(tensor) 411 | out = L.Dropout(0.1)(tensor) 412 | model = tf.keras.Model(inputs=inputs, outputs=out, name="base_model") 413 | return model 414 | 415 | class ChannelAttention(tf.keras.layers.Layer): 416 | def __init__(self, neuron: int, ratio: int, use_average=True, **kwargs) -> None: 417 | super().__init__(**kwargs) 418 | self.neuron = neuron 419 | self.ratio = ratio 420 | self.use_average = use_average 421 | 422 | def build(self, input_shape): 423 | """build layers 424 | 425 | Args: 426 | input_shape (tf.shape): the shape of the input 427 | 428 | Returns: 429 | [type]: [description] 430 | """ 431 | assert len(input_shape) == 4, "The input shape to the layer has to be 3D" 432 | self.first_shared_layer = L.Dense(self.neuron // self.ratio, activation="relu", kernel_initializer="he_normal") 433 | self.second_shared_layer = L.Dense(self.neuron, activation="relu", kernel_initializer="he_normal") 434 | 435 | def call(self, inputs): 436 | if self.use_average: 437 | avg_pool_tensor = L.GlobalAveragePooling2D()(inputs) # Shape (batch, filters) 438 | avg_pool_tensor = L.Reshape((1,1,-1))(avg_pool_tensor) # Shape (batch, 1, 1, filters) 439 | avg_pool_tensor = self.first_shared_layer(avg_pool_tensor) 440 | avg_pool_tensor = self.second_shared_layer(avg_pool_tensor) 441 | 442 | max_pool_tensor = L.GlobalMaxPool2D()(inputs) # Shape (batch, filters) 443 | max_pool_tensor = L.Reshape((1,1,-1))(max_pool_tensor) # Shape (batch, 1, 1, filters) 444 | max_pool_tensor = self.first_shared_layer(max_pool_tensor) 445 | max_pool_tensor = self.second_shared_layer(max_pool_tensor) 446 | 447 | attention_tensor = L.Add()([avg_pool_tensor, max_pool_tensor]) 448 | attention_tensor = L.Activation("sigmoid")(attention_tensor) 449 | 450 | out = L.Multiply()([inputs, attention_tensor]) # Broadcast element-wise multiply. (batch, height, width, filters) x (batch, 1, 1, neurons) 451 | 452 | return out 453 | else: 454 | max_pool_tensor = L.GlobalMaxPool2D()(inputs) # Shape (batch, filters) 455 | max_pool_tensor = L.Reshape((1,1,-1))(max_pool_tensor) # Shape (batch, 1, 1, filters) 456 | max_pool_tensor = self.first_shared_layer(max_pool_tensor) 457 | max_pool_tensor = self.second_shared_layer(max_pool_tensor) 458 | attention_tensor = L.Activation("sigmoid")(max_pool_tensor) 459 | out = L.Multiply()([inputs, attention_tensor]) 460 | return out 461 | 462 | class SpatialAttention(tf.keras.layers.Layer): 463 | def __init__(self, kernel_size, use_average=True, **kwargs) -> None: 464 | super().__init__(**kwargs) 465 | self.kernel_size = kernel_size 466 | self.use_average = use_average 467 | 468 | def build(self, input_shape): 469 | """build layers 470 | 471 | Args: 472 | input_shape (tf.shape): the shape of the input 473 | 474 | Returns: 475 | [type]: [description] 476 | """ 477 | assert len(input_shape) == 4, "The input shape to the layer has to be 3D" 478 | self.conv_layer = L.Conv2D(1, self.kernel_size, padding="same", activation="relu", 479 | kernel_initializer="he_normal") 480 | 481 | def call(self, inputs): 482 | if self.use_average: 483 | avg_pool_tensor = L.Lambda(lambda x: tf.reduce_mean(x, axis=-1, keepdims=True))(inputs) 484 | max_pool_tensor = L.Lambda(lambda x: tf.reduce_max(x, axis=-1, keepdims=True))(inputs) 485 | concat_tensor = L.Concatenate(axis=-1)([avg_pool_tensor, max_pool_tensor]) 486 | tensor = self.conv_layer(concat_tensor) # shape (height, width, 1) 487 | out = L.Multiply()([inputs, tensor]) # Broadcast element-wise multiply. (batch, height, width, neurons) x (batch, height, width, 1) 488 | 489 | return out 490 | else: 491 | max_pool_tensor = L.Lambda(lambda x: tf.reduce_max(x, axis=-1, keepdims=True))(inputs) 492 | tensor = self.conv_layer(max_pool_tensor) # shape (height, width, 1) 493 | out = L.Multiply()([inputs, tensor]) # Broadcast element-wise multiply. (batch, height, width, neurons) x (batch, height, width, 1) 494 | 495 | return out 496 | 497 | class CBAM_Block(tf.keras.layers.Layer): 498 | """ TODO: Implement Res Block architecture for CBAM Block 499 | 500 | Args: 501 | tf ([type]): [description] 502 | """ 503 | def __init__(self, 504 | channel_attention_filters, 505 | channel_attention_ratio, 506 | spatial_attention_kernel_size, 507 | **kwargs) -> None: 508 | super().__init__(**kwargs) 509 | self.channel_attention_filters = channel_attention_filters 510 | self.channel_attention_ratio = channel_attention_ratio 511 | self.spatial_attention_kernel_size = spatial_attention_kernel_size 512 | 513 | def build(self, input_shape): 514 | assert len(input_shape) == 4, "The shape must be 3D!" 515 | 516 | # NOTE: The reson why self.channel_attention_filters is put here is because the number 517 | # of neurons of the input to channel attention has to be equal the number of filters 518 | # in the channel attention 519 | self.conv_1 = L.Conv2D(self.channel_attention_filters * 2, (5,5), padding="same", activation="relu") 520 | self.conv_2 = L.Conv2D(self.channel_attention_filters, (1,1), padding="same", activation="relu") 521 | self.c_att = ChannelAttention(self.channel_attention_filters, self.channel_attention_ratio) 522 | self.s_att = SpatialAttention(self.spatial_attention_kernel_size) 523 | 524 | def call(self, inputs): 525 | # inputs shape (batch, height, width, channel) 526 | tensor = self.conv_1(inputs) # shape (batch, height, width, filters * 2) 527 | tensor = self.conv_2(tensor) # shape (batch, height, width, filters) 528 | tensor = self.c_att(tensor) # shape (batch, height, width, filters) 529 | tensor = self.s_att(tensor) # shape (batch, height, width, filters) 530 | return tensor 531 | 532 | def cbam_1(): 533 | 534 | inputs = L.Input(shape=(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2)) 535 | tensor = L.Permute((2, 1, 3))(inputs) 536 | tensor = L.Resizing(FREQUENCY_LENGTH, 1024)(tensor) 537 | 538 | # tensor = CBAM_Block(32, 2, (5,5))(tensor) 539 | tensor = L.Conv2D(64, (5,5), padding="same", activation="relu")(tensor) 540 | tensor = L.Conv2D(32, (1,1), padding="same", activation="relu")(tensor) 541 | tensor_att_1 = ChannelAttention(32, 2)(tensor) 542 | tensor_att_1 = SpatialAttention((5,5))(tensor_att_1) 543 | tensor = L.Add()([tensor, tensor_att_1]) 544 | tensor = L.MaxPool2D(2,2)(tensor) 545 | 546 | # tensor = CBAM_Block(64, 2, (7,7))(tensor) 547 | tensor = L.Conv2D(128, (5,5), padding="same", activation="relu")(tensor) 548 | tensor = L.Conv2D(64, (1,1), padding="same", activation="relu")(tensor) 549 | tensor_att_2 = ChannelAttention(64, 2)(tensor) 550 | tensor_att_2 = SpatialAttention((7,7))(tensor_att_2) 551 | tensor = L.Add()([tensor, tensor_att_2]) 552 | tensor = L.MaxPool2D(2,2)(tensor) 553 | # tensor = L.BatchNormalization()(tensor) 554 | # tensor = L.Dropout(0.1)(tensor) 555 | 556 | # tensor = CBAM_Block(128, 2, (7,7))(tensor) 557 | tensor = L.Conv2D(256, (5,5), padding="same", activation="relu")(tensor) 558 | tensor = L.Conv2D(128, (1,1), padding="same", activation="relu")(tensor) 559 | tensor_att_3 = ChannelAttention(128, 2)(tensor) 560 | tensor_att_3 = SpatialAttention((7,7))(tensor_att_3) 561 | tensor = L.Add()([tensor, tensor_att_3]) 562 | tensor = L.MaxPool2D(2,2)(tensor) 563 | # tensor = L.BatchNormalization()(tensor) 564 | # tensor = L.Dropout(0.1)(tensor) 565 | 566 | # tensor = CBAM_Block(256, 2, (5,5))(tensor) 567 | tensor = L.Conv2D(512, (5,5), padding="same", activation="relu")(tensor) 568 | tensor = L.Conv2D(256, (1,1), padding="same", activation="relu")(tensor) 569 | tensor_att_4 = ChannelAttention(256, 2)(tensor) 570 | tensor_att_4 = SpatialAttention((5,5))(tensor_att_4) 571 | tensor = L.Add()([tensor, tensor_att_4]) 572 | tensor = L.MaxPool2D(2,2)(tensor) 573 | # tensor = L.BatchNormalization()(tensor) 574 | # tensor = L.Dropout(0.1)(tensor) 575 | 576 | # tensor = CBAM_Block(256, 2, (3,3))(tensor) 577 | tensor = L.Conv2D(512, (5,5), padding="same", activation="relu")(tensor) 578 | tensor = L.Conv2D(256, (1,1), padding="same", activation="relu")(tensor) 579 | tensor_att_5 = ChannelAttention(256, 2)(tensor) 580 | tensor_att_5 = SpatialAttention((3,3))(tensor_att_5) 581 | tensor = L.Add()([tensor, tensor_att_5]) 582 | tensor = L.MaxPool2D(2,2)(tensor) 583 | # tensor = L.BatchNormalization()(tensor) 584 | # tensor = L.Dropout(0.1)(tensor) 585 | 586 | tensor = L.Permute((2, 1, 3))(tensor) 587 | tensor = L.Reshape((32, 4 * 256))(tensor) 588 | 589 | # tensor = L.GRU(256, activation="tanh", return_sequences=True)(tensor) 590 | # tensor = L.GRU(128, activation="tanh", return_sequences=True)(tensor) 591 | tensor = L.GRU(64, activation="tanh")(tensor) 592 | tensor = L.Dense(512, activation="relu")(tensor) 593 | tensor = L.Dense(256, activation="relu")(tensor) 594 | tensor = L.Dense(64, activation="relu")(tensor) 595 | out = L.Dense(2, activation="relu")(tensor) 596 | 597 | model = tf.keras.Model(inputs=inputs, outputs=out) 598 | return model 599 | 600 | def cbam_2(): 601 | """ No average CBAM 602 | 603 | Returns: 604 | tf.keras.Model: The Model 605 | """ 606 | inputs = L.Input(shape=(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2)) 607 | tensor = L.Permute((2, 1, 3))(inputs) 608 | tensor = L.Resizing(FREQUENCY_LENGTH, 1024)(tensor) 609 | 610 | # tensor = CBAM_Block(32, 2, (5,5))(tensor) 611 | tensor = L.Conv2D(64, (5,5), padding="same", activation="relu")(tensor) 612 | tensor = L.Conv2D(32, (1,1), padding="same", activation="relu")(tensor) 613 | tensor_att_1 = ChannelAttention(32, 2, use_average=False)(tensor) 614 | tensor_att_1 = SpatialAttention((5,5), use_average=False)(tensor_att_1) 615 | tensor = L.Add()([tensor, tensor_att_1]) 616 | tensor = L.MaxPool2D(2,2)(tensor) 617 | 618 | # tensor = CBAM_Block(64, 2, (7,7))(tensor) 619 | tensor = L.Conv2D(128, (5,5), padding="same", activation="relu")(tensor) 620 | tensor = L.Conv2D(64, (1,1), padding="same", activation="relu")(tensor) 621 | tensor_att_2 = ChannelAttention(64, 2, use_average=False)(tensor) 622 | tensor_att_2 = SpatialAttention((7,7), use_average=False)(tensor_att_2) 623 | tensor = L.Add()([tensor, tensor_att_2]) 624 | tensor = L.MaxPool2D(2,2)(tensor) 625 | # tensor = L.BatchNormalization()(tensor) 626 | # tensor = L.Dropout(0.1)(tensor) 627 | 628 | # tensor = CBAM_Block(128, 2, (7,7))(tensor) 629 | tensor = L.Conv2D(256, (5,5), padding="same", activation="relu")(tensor) 630 | tensor = L.Conv2D(128, (1,1), padding="same", activation="relu")(tensor) 631 | tensor_att_3 = ChannelAttention(128, 2, use_average=False)(tensor) 632 | tensor_att_3 = SpatialAttention((7,7), use_average=False)(tensor_att_3) 633 | tensor = L.Add()([tensor, tensor_att_3]) 634 | tensor = L.MaxPool2D(2,2)(tensor) 635 | # tensor = L.BatchNormalization()(tensor) 636 | # tensor = L.Dropout(0.1)(tensor) 637 | 638 | # tensor = CBAM_Block(256, 2, (5,5))(tensor) 639 | tensor = L.Conv2D(512, (5,5), padding="same", activation="relu")(tensor) 640 | tensor = L.Conv2D(256, (1,1), padding="same", activation="relu")(tensor) 641 | tensor_att_4 = ChannelAttention(256, 2, use_average=False)(tensor) 642 | tensor_att_4 = SpatialAttention((5,5), use_average=False)(tensor_att_4) 643 | tensor = L.Add()([tensor, tensor_att_4]) 644 | tensor = L.MaxPool2D(2,2)(tensor) 645 | # tensor = L.BatchNormalization()(tensor) 646 | # tensor = L.Dropout(0.1)(tensor) 647 | 648 | # tensor = CBAM_Block(256, 2, (3,3))(tensor) 649 | tensor = L.Conv2D(512, (5,5), padding="same", activation="relu")(tensor) 650 | tensor = L.Conv2D(256, (1,1), padding="same", activation="relu")(tensor) 651 | tensor_att_5 = ChannelAttention(256, 2, use_average=False)(tensor) 652 | tensor_att_5 = SpatialAttention((3,3), use_average=False)(tensor_att_5) 653 | tensor = L.Add()([tensor, tensor_att_5]) 654 | tensor = L.MaxPool2D(2,2)(tensor) 655 | # tensor = L.BatchNormalization()(tensor) 656 | # tensor = L.Dropout(0.1)(tensor) 657 | 658 | tensor = L.Permute((2, 1, 3))(tensor) 659 | tensor = L.Reshape((32, 4 * 256))(tensor) 660 | 661 | # tensor = L.GRU(256, activation="tanh", return_sequences=True)(tensor) 662 | # tensor = L.GRU(128, activation="tanh", return_sequences=True)(tensor) 663 | tensor = L.LSTM(256, activation="tanh")(tensor) 664 | tensor = L.Dense(512, activation="relu")(tensor) 665 | tensor = L.Dense(256, activation="relu")(tensor) 666 | tensor = L.Dense(64, activation="relu")(tensor) 667 | out = L.Dense(2, activation="relu")(tensor) 668 | 669 | model = tf.keras.Model(inputs=inputs, outputs=out) 670 | return model 671 | 672 | model: tf.keras.Model = cbam_2() 673 | model.summary() 674 | sample_input = tf.ones(shape=(BATCH_SIZE, SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2)) 675 | with tf.device("/CPU:0"): 676 | sample_output = model(sample_input, training=False) 677 | print(sample_output.shape) 678 | 679 | # TODO: Code the CBAM architecture 680 | # TODO: Code the attetion after the CBAM 681 | 682 | 683 | 684 | # %% 685 | 686 | model.load_weights(weights_path) 687 | 688 | # %% 689 | 690 | tf.keras.models.save_model( 691 | model, 692 | "./server/model/my_model", 693 | overwrite=True, 694 | include_optimizer=True, 695 | save_format=None, 696 | signatures=None, 697 | options=None 698 | ) 699 | 700 | 701 | # %%5 702 | 703 | 704 | sample_output.shape 705 | 706 | # %% 707 | 708 | 709 | 710 | 711 | 712 | 713 | 714 | 715 | # %% 716 | 717 | # Plot 718 | with open(history_path, "rb") as f: 719 | [epochs_loss, epochs_val_loss] = np.load(f, allow_pickle=True) 720 | 721 | 722 | e_loss = [k[0] for k in epochs_loss] 723 | 724 | e_all_loss = [] 725 | 726 | id = 0 727 | time_val = [] 728 | for epoch in epochs_loss: 729 | for step in epoch: 730 | e_all_loss.append(step.numpy()) 731 | id += 1 732 | time_val.append(id) 733 | 734 | # %% 735 | 736 | plt.plot(np.arange(0, len(e_all_loss), 1), e_all_loss, label = "train loss") 737 | plt.plot(time_val, epochs_val_loss, label = "val loss") 738 | 739 | # plt.plot(np.arange(1,len(e_loss)+ 1), e_loss, label = "train loss") 740 | # plt.plot(np.arange(1,len(epochs_val_loss)+ 1), epochs_val_loss, label = "val loss") 741 | plt.xlabel("Step") 742 | plt.ylabel("Loss") 743 | plt.legend() 744 | plt.show() 745 | 746 | # %%5 747 | 748 | # model.load_weights(weights_path) 749 | # model.trainable_weights 750 | # y.shape 751 | 752 | 753 | # %% 754 | 755 | 756 | 757 | # model.build(input_shape=(BATCH_SIZE, SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL)) 758 | # model.summary() 759 | # %% 760 | 761 | model.save_weights(weights_path) 762 | 763 | 764 | # %% 765 | 766 | 767 | 768 | model.load_weights(weights_path) 769 | 770 | 771 | # %% 772 | 773 | 774 | def evaluate(df_pointer, model, loss_func, play=False): 775 | row = test_df.loc[df_pointer] 776 | song_id = row["song_id"] 777 | valence_mean = row["valence_mean"] 778 | arousal_mean = row["arousal_mean"] 779 | label = tf.convert_to_tensor([valence_mean, arousal_mean], dtype=tf.float32) 780 | print(f"Label: Valence: {valence_mean}, Arousal: {arousal_mean}") 781 | song_path = os.path.join(AUDIO_FOLDER, str(int(song_id)) + SOUND_EXTENSION) 782 | audio_file = tf.io.read_file(song_path) 783 | waveforms, _ = tf.audio.decode_wav(contents=audio_file) 784 | waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH) 785 | spectrograms = None 786 | # Loop through each channel 787 | for i in range(waveforms.shape[-1]): 788 | # Shape (timestep, frequency, 1) 789 | spectrogram = get_spectrogram(waveforms[..., i], input_len=waveforms.shape[0]) 790 | if spectrograms == None: 791 | spectrograms = spectrogram 792 | else: 793 | spectrograms = tf.concat([spectrograms, spectrogram], axis=-1) 794 | 795 | spectrograms = spectrograms[tf.newaxis, ...] 796 | 797 | ## Eval 798 | y_pred = model(spectrograms, training=False)[0] 799 | print(f"Predicted y_pred value: Valence: {y_pred[0]}, Arousal: {y_pred[1]}") 800 | 801 | loss = loss_func(label[tf.newaxis, ...], y_pred) 802 | print(f"Loss: {loss}") 803 | 804 | if play: 805 | plot_and_play(waveforms, 0, 40, 0) 806 | 807 | i = 0 808 | 809 | # %% 810 | 811 | i += 1 812 | evaluate(i, model, simple_mae_loss, play=False) 813 | 814 | # %% 815 | 816 | ####### INTERMEDIARY REPRESENTATION ######## 817 | 818 | layer_list = [l for l in model.layers] 819 | debugging_model = tf.keras.Model(inputs=model.inputs, outputs=[l.output for l in layer_list]) 820 | 821 | # %% 822 | 823 | layer_list 824 | 825 | # %% 826 | 827 | test_id = 223 828 | row = test_df.loc[test_id] 829 | song_id = row["song_id"] 830 | valence_mean = row["valence_mean"] 831 | arousal_mean = row["arousal_mean"] 832 | label = tf.convert_to_tensor([valence_mean, arousal_mean], dtype=tf.float32) 833 | print(f"Label: Valence: {valence_mean}, Arousal: {arousal_mean}") 834 | song_path = os.path.join(AUDIO_FOLDER, str(int(song_id)) + SOUND_EXTENSION) 835 | audio_file = tf.io.read_file(song_path) 836 | waveforms, _ = tf.audio.decode_wav(contents=audio_file) 837 | waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH) 838 | spectrograms = None 839 | # Loop through each channel 840 | for i in range(waveforms.shape[-1]): 841 | # Shape (timestep, frequency, 1) 842 | spectrogram = get_spectrogram(waveforms[..., i], input_len=waveforms.shape[0]) 843 | if spectrograms == None: 844 | spectrograms = spectrogram 845 | else: 846 | spectrograms = tf.concat([spectrograms, spectrogram], axis=-1) 847 | 848 | spectrograms = spectrograms[tf.newaxis, ...] 849 | 850 | print(label) 851 | # plot_and_play(waveforms, 0, 40, 0) 852 | 853 | ## Eval 854 | y_pred_list = debugging_model(spectrograms, training=False) 855 | print(f"Predicted y_pred value: Valence: {y_pred_list[-1][0, 0]}, Arousal: {y_pred_list[-1][0, 1]}") 856 | 857 | 858 | # %% 859 | 860 | def show_color_mesh(spectrogram): 861 | """ Generate color mesh 862 | 863 | Args: 864 | spectrogram (2D array): Expect shape (Frequency length, time step) 865 | """ 866 | assert len(spectrogram.shape) == 2 867 | log_spec = np.log(spectrogram + np.finfo(float).eps) 868 | height = log_spec.shape[0] 869 | width = log_spec.shape[1] 870 | X = np.linspace(0, np.size(spectrogram), num=width, dtype=int) 871 | Y = range(height) 872 | plt.pcolormesh(X, Y, log_spec) 873 | plt.show() 874 | 875 | show_color_mesh(tf.transpose(spectrograms[0, :, :, 0], [1,0])) 876 | 877 | 878 | # %% 879 | 880 | f, axarr = plt.subplots(8,8, figsize=(25,15)) 881 | CONVOLUTION_NUMBER_LIST = [8, 9, 10, 11, 12, 13, 14, 15] 882 | LAYER_LIST = [10, 11, 12, 13, 16, 17, 18, 19] 883 | for x, CONVOLUTION_NUMBER in enumerate(CONVOLUTION_NUMBER_LIST): 884 | f1 = y_pred_list[LAYER_LIST[0]] 885 | plot_spectrogram(tf.transpose(f1[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[0,x]) 886 | axarr[0,x].grid(False) 887 | f2 = y_pred_list[LAYER_LIST[1]] 888 | plot_spectrogram(tf.transpose(f2[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[1,x]) 889 | axarr[1,x].grid(False) 890 | f3 = y_pred_list[LAYER_LIST[2]] 891 | plot_spectrogram(tf.transpose(f3[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[2,x]) 892 | axarr[2,x].grid(False) 893 | f4 = y_pred_list[LAYER_LIST[3]] 894 | plot_spectrogram(tf.transpose(f4[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[3,x]) 895 | axarr[3,x].grid(False) 896 | 897 | f5 = y_pred_list[LAYER_LIST[4]] 898 | plot_spectrogram(tf.transpose(f5[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[4,x]) 899 | axarr[4,x].grid(False) 900 | f6 = y_pred_list[LAYER_LIST[5]] 901 | plot_spectrogram(tf.transpose(f6[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[5,x]) 902 | axarr[5,x].grid(False) 903 | f7 = y_pred_list[LAYER_LIST[6]] 904 | plot_spectrogram(tf.transpose(f7[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[6,x]) 905 | axarr[6,x].grid(False) 906 | f8 = y_pred_list[LAYER_LIST[7]] 907 | plot_spectrogram(tf.transpose(f8[0, : , :, CONVOLUTION_NUMBER], [1,0]).numpy(), axarr[7,x]) 908 | axarr[7,x].grid(False) 909 | 910 | axarr[0,0].set_ylabel("After convolution layer 1") 911 | axarr[1,0].set_ylabel("After convolution layer 2") 912 | axarr[2,0].set_ylabel("After convolution layer 3") 913 | axarr[3,0].set_ylabel("After convolution layer 7") 914 | 915 | axarr[0,0].set_title("convolution number 0") 916 | axarr[0,1].set_title("convolution number 4") 917 | axarr[0,2].set_title("convolution number 7") 918 | axarr[0,3].set_title("convolution number 23") 919 | 920 | plt.show() 921 | 922 | # %% 923 | 924 | 925 | f4 = y_pred_list[3] 926 | 927 | # %% 928 | 929 | f4.shape 930 | 931 | # %% 932 | 933 | f4 934 | 935 | # %% 936 | 937 | w = layer_list[5].weights[0] 938 | w 939 | # %% 940 | 941 | y_pred_list[-1] 942 | 943 | # %% 944 | 945 | -------------------------------------------------------------------------------- /mer/__init__.py: -------------------------------------------------------------------------------- 1 | from . import * -------------------------------------------------------------------------------- /mer/const.py: -------------------------------------------------------------------------------- 1 | DEFAULT_FREQ = 44100 2 | DEFAULT_TIME = 45 3 | WAVE_ARRAY_LENGTH = DEFAULT_FREQ * DEFAULT_TIME 4 | 5 | WINDOW_TIME = 5 6 | WINDOW_SIZE = WINDOW_TIME * DEFAULT_FREQ 7 | 8 | TRAIN_RATIO = 0.8 9 | 10 | BATCH_SIZE = 16 11 | 12 | FREQUENCY_LENGTH = 129 13 | N_CHANNEL = 2 14 | SPECTROGRAM_TIME_LENGTH = 15502 15 | SPECTROGRAM_HALF_SECOND_LENGTH = 171 16 | SPECTROGRAM_5_SECOND_LENGTH = 1721 17 | MFCCS_TIME_LENGTH = 3876 18 | 19 | LEARNING_RATE = 1e-4 20 | 21 | SOUND_EXTENSION = ".wav" 22 | 23 | # The minimum second to be labeled in the dynamics files. 24 | MIN_TIME_END_POINT = 15 -------------------------------------------------------------------------------- /mer/loss.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | 3 | def simple_mse_loss(true, pred): 4 | 5 | # loss_valence = tf.reduce_sum(tf.square(true[..., 0] - pred[0][..., 0])) / true.shape[0] # divide by batch size 6 | # loss_arousal = tf.reduce_sum(tf.square(true[..., 1] - pred[1][..., 0])) / true.shape[0] # divide by batch size 7 | 8 | # return loss_valence + loss_arousal 9 | 10 | loss = tf.reduce_sum(tf.square(true - pred)) / true.shape[0] 11 | 12 | return loss 13 | 14 | def simple_mae_loss(true, pred): 15 | loss = tf.reduce_sum(tf.abs(true - pred)) / true.shape[0] 16 | return loss -------------------------------------------------------------------------------- /mer/model.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import tensorflow.keras.layers as L 3 | 4 | from .const import * 5 | 6 | class SimpleDenseModel(tf.keras.Model): 7 | def __init__(self, max_timestep, n_freq, n_channel, batch_size, **kwargs): 8 | super().__init__(**kwargs) 9 | self.max_timestep = max_timestep 10 | self.n_freq = n_freq 11 | self.n_channel = n_channel 12 | self.batch_size = batch_size 13 | 14 | self.resize = tf.keras.layers.Resizing(self.n_freq, 1024) 15 | self.flatten = tf.keras.layers.Flatten() 16 | self.dense1 = tf.keras.layers.Dense(512, activation="relu") 17 | self.dense2 = tf.keras.layers.Dense(256, activation="relu") 18 | self.dense3 = tf.keras.layers.Dense(128, activation="relu") 19 | self.dense4 = tf.keras.layers.Dense(64, activation="relu") 20 | self.dense5 = tf.keras.layers.Dense(2, activation="relu") 21 | 22 | def call(self, x): 23 | """ 24 | 25 | Args: 26 | x ([type]): [description] 27 | 28 | Returns: 29 | [type]: [description] 30 | """ 31 | # Condense 32 | tensor = self.resize(x) 33 | tensor = self.flatten(tensor) 34 | tensor = self.dense1(tensor) 35 | tensor = self.dense2(tensor) 36 | tensor = self.dense3(tensor) 37 | tensor = self.dense4(tensor) 38 | out = self.dense5(tensor) 39 | return out 40 | 41 | def model(self): 42 | x = tf.keras.layers.Input(shape=(self.max_timestep, self.n_freq, self.n_channel)) 43 | return tf.keras.Model(inputs=x, outputs=self.call(x)) 44 | 45 | class ConvBlock(tf.keras.Model): 46 | def __init__(self, neurons, **kwargs) -> None: 47 | super().__init__(**kwargs) 48 | self.model = tf.keras.Sequential([ 49 | tf.keras.layers.Conv2D(neurons, (3,3), padding="same"), 50 | tf.keras.layers.LeakyReLU(alpha=0.1), 51 | tf.keras.layers.Conv2D(neurons // 2, (1,1), padding="same"), 52 | tf.keras.layers.LeakyReLU(alpha=0.1), 53 | tf.keras.layers.MaxPool2D(2,2), 54 | tf.keras.layers.Dropout(0.1) 55 | ]) 56 | 57 | def call(self, x): 58 | return self.model(x) 59 | 60 | 61 | class SimpleConvModel(tf.keras.Model): 62 | def __init__(self, max_timestep, n_freq, n_channel, batch_size, **kwargs): 63 | super().__init__(**kwargs) 64 | self.max_timestep = max_timestep 65 | self.n_freq = n_freq 66 | self.n_channel = n_channel 67 | self.batch_size = batch_size 68 | 69 | neuron_conv = [64, 128, 256, 512, 1024] 70 | 71 | self.model = tf.keras.Sequential() 72 | self.model.add(tf.keras.layers.Resizing(self.n_freq, 512, input_shape=(max_timestep, n_freq, n_channel))) 73 | for neuron in neuron_conv: 74 | self.model.add(ConvBlock(neuron)) 75 | self.model.add(tf.keras.layers.Flatten()) 76 | self.model.add(tf.keras.layers.Dense(128, activation="relu")) 77 | self.model.add(tf.keras.layers.Dense(2, activation="relu")) 78 | self.model.summary() 79 | 80 | def call(self, x): 81 | """ 82 | 83 | Args: 84 | x ([type]): [description] 85 | 86 | Returns: 87 | [type]: [description] 88 | """ 89 | # Condense 90 | return self.model(x) 91 | 92 | def model(self): 93 | x = tf.keras.layers.Input(shape=(self.max_timestep, self.n_freq, self.n_channel)) 94 | return tf.keras.Model(inputs=x, outputs=self.call(x)) 95 | 96 | 97 | class ConvBlock2(tf.keras.Model): 98 | def __init__(self, neurons, **kwargs) -> None: 99 | super().__init__(**kwargs) 100 | self.model = tf.keras.Sequential([ 101 | tf.keras.layers.Conv2D(neurons, (5,5), padding="valid"), 102 | tf.keras.layers.LeakyReLU(alpha=0.1), 103 | tf.keras.layers.Conv2D(neurons // 2, (1,1), padding="valid"), 104 | tf.keras.layers.LeakyReLU(alpha=0.1), 105 | tf.keras.layers.MaxPool2D(2,2), 106 | tf.keras.layers.Dropout(0.1) 107 | ]) 108 | 109 | def call(self, x): 110 | return self.model(x) 111 | 112 | def Simple_CRNN(): 113 | """[summary] 114 | 115 | Args: 116 | inputs (tf.Tensor): Expect tensor shape (batch, width, height, channel) 117 | 118 | Returns: 119 | [type]: [description] 120 | """ 121 | inputs = L.Input(shape=(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2)) 122 | tensor = L.Permute((2, 1, 3))(inputs) 123 | tensor = L.Resizing(FREQUENCY_LENGTH, 1024)(tensor) 124 | 125 | tensor = L.Conv2D(64, (5,5), padding="valid")(tensor) 126 | tensor = L.LeakyReLU(alpha=0.1)(tensor) 127 | tensor = L.Conv2D(64 // 2, (1,1), padding="valid")(tensor) 128 | tensor = L.LeakyReLU(alpha=0.1)(tensor) 129 | tensor = L.MaxPool2D(2,2)(tensor) 130 | tensor = L.Dropout(0.1)(tensor) 131 | 132 | tensor = L.Conv2D(128, (5,5), padding="valid")(tensor) 133 | tensor = L.LeakyReLU(alpha=0.1)(tensor) 134 | tensor = L.Conv2D(128 // 2, (1,1), padding="valid")(tensor) 135 | tensor = L.LeakyReLU(alpha=0.1)(tensor) 136 | tensor = L.MaxPool2D(2,2)(tensor) 137 | tensor = L.Dropout(0.1)(tensor) 138 | 139 | tensor = L.Conv2D(256, (5,5), padding="valid")(tensor) 140 | tensor = L.LeakyReLU(alpha=0.1)(tensor) 141 | tensor = L.Conv2D(256 // 2, (1,1), padding="valid")(tensor) 142 | tensor = L.LeakyReLU(alpha=0.1)(tensor) 143 | tensor = L.MaxPool2D(2,2)(tensor) 144 | tensor = L.Dropout(0.1)(tensor) 145 | 146 | tensor = L.Conv2D(512, (5,5), padding="valid")(tensor) 147 | tensor = L.LeakyReLU(alpha=0.1)(tensor) 148 | tensor = L.Conv2D(512 // 2, (1,1), padding="valid")(tensor) 149 | tensor = L.LeakyReLU(alpha=0.1)(tensor) 150 | tensor = L.MaxPool2D(2,2)(tensor) 151 | tensor = L.Dropout(0.1)(tensor) 152 | 153 | tensor = L.MaxPool2D(pool_size=(2,1), strides=(2,1))(tensor) 154 | tensor = L.Conv2D(512, (2,2), padding="valid")(tensor) 155 | tensor = L.LeakyReLU(alpha=0.1)(tensor) 156 | tensor = L.Dropout(0.1)(tensor) 157 | 158 | tensor = L.Reshape((59, 512))(tensor) 159 | tensor = L.Bidirectional(L.LSTM(128, return_sequences=True))(tensor) 160 | tensor = L.Bidirectional(L.LSTM(64, return_sequences=True))(tensor) 161 | tensor = L.Bidirectional(L.LSTM(32))(tensor) 162 | tensor = L.Dense(128, activation="relu")(tensor) 163 | out = L.Dense(2, activation="relu")(tensor) 164 | 165 | model = tf.keras.Model(inputs=inputs, outputs=out) 166 | return model 167 | 168 | def Simple_CRNN_2(): 169 | """[summary] 170 | 171 | Args: 172 | inputs (tf.Tensor): Expect tensor shape (batch, width, height, channel) 173 | 174 | Returns: 175 | [type]: [description] 176 | """ 177 | inputs = L.Input(shape=(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2)) 178 | tensor = L.Permute((2, 1, 3))(inputs) 179 | tensor = L.Resizing(FREQUENCY_LENGTH, 1024)(tensor) 180 | 181 | tensor = L.Conv2D(64, (5,5), padding="valid")(tensor) 182 | tensor = L.ReLU()(tensor) 183 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 184 | tensor = L.Conv2D(64 // 2, (1,1), padding="valid")(tensor) 185 | tensor = L.ReLU()(tensor) 186 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 187 | tensor = L.MaxPool2D(2,2)(tensor) 188 | tensor = L.Dropout(0.1)(tensor) 189 | 190 | tensor = L.Conv2D(128, (5,5), padding="valid")(tensor) 191 | tensor = L.ReLU()(tensor) 192 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 193 | tensor = L.Conv2D(128 // 2, (1,1), padding="valid")(tensor) 194 | tensor = L.ReLU()(tensor) 195 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 196 | tensor = L.MaxPool2D(2,2)(tensor) 197 | tensor = L.Dropout(0.1)(tensor) 198 | 199 | tensor = L.Conv2D(256, (5,5), padding="valid")(tensor) 200 | tensor = L.ReLU()(tensor) 201 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 202 | tensor = L.Conv2D(256 // 2, (1,1), padding="valid")(tensor) 203 | tensor = L.ReLU()(tensor) 204 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 205 | tensor = L.MaxPool2D(2,2)(tensor) 206 | tensor = L.Dropout(0.1)(tensor) 207 | 208 | tensor = L.Conv2D(512, (5,5), padding="valid")(tensor) 209 | tensor = L.ReLU()(tensor) 210 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 211 | tensor = L.Conv2D(512 // 2, (1,1), padding="valid")(tensor) 212 | tensor = L.ReLU()(tensor) 213 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 214 | tensor = L.MaxPool2D(2,2)(tensor) 215 | tensor = L.Dropout(0.1)(tensor) 216 | 217 | tensor = L.Permute((2, 1, 3))(tensor) 218 | tensor = L.Reshape((60, 4 * 256))(tensor) 219 | 220 | # tensor = L.MaxPool2D(pool_size=(2,1), strides=(2,1))(tensor) 221 | # tensor = L.Conv2D(512, (2,2), padding="valid")(tensor) 222 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 223 | # out = L.Dropout(0.1)(tensor) 224 | 225 | # tensor = L.Bidirectional(L.LSTM(128, return_sequences=True, activation="sigmoid"))(tensor) 226 | # tensor = L.Bidirectional(L.LSTM(128, return_sequences=True, activation="sigmoid"))(tensor) 227 | tensor = L.Bidirectional(L.LSTM(128, activation="tanh"))(tensor) 228 | tensor = L.Dense(512, activation="relu")(tensor) 229 | tensor = L.Dense(256, activation="relu")(tensor) 230 | tensor = L.Dense(64, activation="relu")(tensor) 231 | out = L.Dense(2, activation="relu")(tensor) 232 | 233 | # tensor_1 = L.Bidirectional(L.LSTM(128, return_sequences=True))(tensor) 234 | # tensor_1 = L.Bidirectional(L.LSTM(128, return_sequences=True))(tensor_1) 235 | # tensor_1 = L.Bidirectional(L.LSTM(128))(tensor_1) 236 | # tensor_1 = L.Dense(512, activation="relu")(tensor_1) 237 | # tensor_1 = L.Dense(256, activation="relu")(tensor_1) 238 | # tensor_1 = L.Dense(64, activation="relu")(tensor_1) 239 | # out_1 = L.Dense(1, activation="relu")(tensor_1) 240 | 241 | # tensor_2 = L.Bidirectional(L.LSTM(128, return_sequences=True))(tensor) 242 | # tensor_2 = L.Bidirectional(L.LSTM(128, return_sequences=True))(tensor_2) 243 | # tensor_2 = L.Bidirectional(L.LSTM(128))(tensor_2) 244 | # tensor_2 = L.Dense(512, activation="relu")(tensor_2) 245 | # tensor_2 = L.Dense(256, activation="relu")(tensor_2) 246 | # tensor_2 = L.Dense(64, activation="relu")(tensor_2) 247 | # out_2 = L.Dense(1, activation="relu")(tensor_2) 248 | 249 | 250 | model = tf.keras.Model(inputs=inputs, outputs=out) 251 | return model 252 | 253 | 254 | def Simple_CRNN_3(): 255 | """ CRNN that uses GRU 256 | 257 | Args: 258 | inputs (tf.Tensor): Expect tensor shape (batch, width, height, channel) 259 | 260 | Returns: 261 | [type]: [description] 262 | """ 263 | inputs = L.Input(shape=(SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, 2)) 264 | tensor = L.Permute((2, 1, 3))(inputs) 265 | tensor = L.Resizing(FREQUENCY_LENGTH, 1024)(tensor) 266 | 267 | tensor = L.Conv2D(64, (5,5), padding="valid")(tensor) 268 | tensor = L.ReLU()(tensor) 269 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 270 | tensor = L.Conv2D(64 // 2, (1,1), padding="valid")(tensor) 271 | tensor = L.ReLU()(tensor) 272 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 273 | tensor = L.MaxPool2D(2,2)(tensor) 274 | tensor = L.Dropout(0.1)(tensor) 275 | 276 | tensor = L.Conv2D(128, (5,5), padding="valid")(tensor) 277 | tensor = L.ReLU()(tensor) 278 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 279 | tensor = L.Conv2D(128 // 2, (1,1), padding="valid")(tensor) 280 | tensor = L.ReLU()(tensor) 281 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 282 | tensor = L.MaxPool2D(2,2)(tensor) 283 | tensor = L.Dropout(0.1)(tensor) 284 | 285 | tensor = L.Conv2D(256, (5,5), padding="valid")(tensor) 286 | tensor = L.ReLU()(tensor) 287 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 288 | tensor = L.Conv2D(256 // 2, (1,1), padding="valid")(tensor) 289 | tensor = L.ReLU()(tensor) 290 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 291 | tensor = L.MaxPool2D(2,2)(tensor) 292 | tensor = L.Dropout(0.1)(tensor) 293 | 294 | tensor = L.Conv2D(512, (5,5), padding="valid")(tensor) 295 | tensor = L.ReLU()(tensor) 296 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 297 | tensor = L.Conv2D(512 // 2, (1,1), padding="valid")(tensor) 298 | tensor = L.ReLU()(tensor) 299 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 300 | tensor = L.MaxPool2D(2,2)(tensor) 301 | tensor = L.Dropout(0.1)(tensor) 302 | 303 | tensor = L.Permute((2, 1, 3))(tensor) 304 | tensor = L.Reshape((60, 4 * 256))(tensor) 305 | 306 | # tensor = L.MaxPool2D(pool_size=(2,1), strides=(2,1))(tensor) 307 | # tensor = L.Conv2D(512, (2,2), padding="valid")(tensor) 308 | # tensor = L.LeakyReLU(alpha=0.1)(tensor) 309 | # out = L.Dropout(0.1)(tensor) 310 | 311 | tensor = L.GRU(256, activation="tanh", return_sequences=True)(tensor) 312 | tensor = L.GRU(128, activation="tanh", return_sequences=True)(tensor) 313 | tensor = L.GRU(64, activation="tanh")(tensor) 314 | tensor = L.Dense(512, activation="relu")(tensor) 315 | tensor = L.Dense(256, activation="relu")(tensor) 316 | tensor = L.Dense(64, activation="relu")(tensor) 317 | out = L.Dense(2, activation="relu")(tensor) 318 | 319 | # tensor_1 = L.Bidirectional(L.LSTM(128, return_sequences=True))(tensor) 320 | # tensor_1 = L.Bidirectional(L.LSTM(128, return_sequences=True))(tensor_1) 321 | # tensor_1 = L.Bidirectional(L.LSTM(128))(tensor_1) 322 | # tensor_1 = L.Dense(512, activation="relu")(tensor_1) 323 | # tensor_1 = L.Dense(256, activation="relu")(tensor_1) 324 | # tensor_1 = L.Dense(64, activation="relu")(tensor_1) 325 | # out_1 = L.Dense(1, activation="relu")(tensor_1) 326 | 327 | # tensor_2 = L.Bidirectional(L.LSTM(128, return_sequences=True))(tensor) 328 | # tensor_2 = L.Bidirectional(L.LSTM(128, return_sequences=True))(tensor_2) 329 | # tensor_2 = L.Bidirectional(L.LSTM(128))(tensor_2) 330 | # tensor_2 = L.Dense(512, activation="relu")(tensor_2) 331 | # tensor_2 = L.Dense(256, activation="relu")(tensor_2) 332 | # tensor_2 = L.Dense(64, activation="relu")(tensor_2) 333 | # out_2 = L.Dense(1, activation="relu")(tensor_2) 334 | 335 | 336 | model = tf.keras.Model(inputs=inputs, outputs=out) 337 | return model -------------------------------------------------------------------------------- /mer/utils.py: -------------------------------------------------------------------------------- 1 | 2 | import tensorflow as tf 3 | import numpy as np 4 | import os 5 | import pandas as pd 6 | import matplotlib.pyplot as plt 7 | import sounddevice as sd 8 | 9 | from .const import * 10 | 11 | def get_spectrogram(waveform, input_len=44100): 12 | """ Check out https://www.tensorflow.org/io/tutorials/audio 13 | 14 | Args: 15 | waveform ([type]): Expect waveform array of shape (>44100,) 16 | input_len (int, optional): [description]. Defaults to 44100. 17 | 18 | Returns: 19 | Tensor: Spectrogram of the 1D waveform. Shape (freq, time, 1) 20 | """ 21 | max_zero_padding = min(input_len, tf.shape(waveform)) 22 | # Zero-padding for an audio waveform with less than 44,100 samples. 23 | waveform = waveform[:input_len] 24 | zero_padding = tf.zeros( 25 | (input_len - max_zero_padding), 26 | dtype=tf.float32) 27 | # Cast the waveform tensors' dtype to float32. 28 | waveform = tf.cast(waveform, dtype=tf.float32) 29 | # Concatenate the waveform with `zero_padding`, which ensures all audio 30 | # clips are of the same length. 31 | equal_length = tf.concat([waveform, zero_padding], 0) 32 | # Convert the waveform to a spectrogram via a STFT. 33 | spectrogram = tf.signal.stft( 34 | equal_length, frame_length=255, frame_step=128) 35 | # Obtain the magnitude of the STFT. 36 | spectrogram = tf.abs(spectrogram) 37 | # Add a `channels` dimension, so that the spectrogram can be used 38 | # as image-like input data with convolution layers (which expect 39 | # shape (`batch_size`, `height`, `width`, `channels`). 40 | spectrogram = spectrogram[..., tf.newaxis] 41 | return spectrogram 42 | 43 | def plot_spectrogram(spectrogram, ax): 44 | """ Check out https://www.tensorflow.org/io/tutorials/audio 45 | 46 | Args: 47 | spectrogram ([type]): Expect shape (time step, frequency) 48 | ax (plt.axes[i]): [description] 49 | """ 50 | if len(spectrogram.shape) > 2: 51 | assert len(spectrogram.shape) == 3 52 | spectrogram = np.squeeze(spectrogram, axis=-1) 53 | # Convert the frequencies to log scale and transpose, so that the time is 54 | # represented on the x-axis (columns). 55 | # Add an epsilon to avoid taking a log of zero. 56 | log_spec = np.log(spectrogram.T + np.finfo(float).eps) 57 | height = log_spec.shape[0] 58 | width = log_spec.shape[1] 59 | X = np.linspace(0, np.size(spectrogram), num=width, dtype=int) 60 | Y = range(height) 61 | ax.pcolormesh(X, Y, log_spec) 62 | 63 | def load_metadata(csv_folder): 64 | """ Pandas load multiple csv file and concat them into one df. 65 | 66 | Args: 67 | csv_folder (str): Path to the csv folder 68 | 69 | Returns: 70 | pd.DataFrame: The concatnated one! 71 | """ 72 | global_df = pd.DataFrame() 73 | for i, fname in enumerate(os.listdir(csv_folder)): 74 | # headers: song_id, valence_mean, valence_std, arousal_mean, arousal_std 75 | df = pd.read_csv(os.path.join(csv_folder, fname), sep=r"\s*,\s*", engine="python") 76 | global_df = pd.concat([global_df, df], axis=0) 77 | 78 | # Reset the index 79 | global_df = global_df.reset_index(drop=True) 80 | 81 | return global_df 82 | 83 | def split_train_test(df: pd.DataFrame, train_ratio: float): 84 | train_size = int(len(df) * train_ratio) 85 | train_df: pd.DataFrame = df[:train_size] 86 | train_df = train_df.reset_index(drop=True) 87 | test_df: pd.DataFrame = df[train_size:] 88 | test_df = test_df.reset_index(drop=True) 89 | return train_df, test_df 90 | 91 | def plot_and_play(test_audio, second_id = 24.0, second_length = 1, channel = 0): 92 | """ Plot and play 93 | 94 | Args: 95 | test_audio ([type]): [description] 96 | second_id (float, optional): [description]. Defaults to 24.0. 97 | second_length (int, optional): [description]. Defaults to 1. 98 | channel (int, optional): [description]. Defaults to 0. 99 | """ 100 | # Spectrogram of one second 101 | from_id = int(DEFAULT_FREQ * second_id) 102 | to_id = min(int(DEFAULT_FREQ * (second_id + second_length)), test_audio.shape[0]) 103 | 104 | test_spectrogram = get_spectrogram(test_audio[from_id:, channel], input_len=int(DEFAULT_FREQ * second_length)) 105 | print(test_spectrogram.shape) 106 | fig, axes = plt.subplots(2, figsize=(12, 8)) 107 | timescale = np.arange(to_id - from_id) 108 | axes[0].plot(timescale, test_audio[from_id:to_id, channel].numpy()) 109 | axes[0].set_title('Waveform') 110 | axes[0].set_xlim([0, int(DEFAULT_FREQ * second_length)]) 111 | 112 | plot_spectrogram(test_spectrogram.numpy(), axes[1]) 113 | axes[1].set_title('Spectrogram') 114 | plt.show() 115 | 116 | # Play sound 117 | sd.play(test_audio[from_id: to_id, channel], blocking=True) 118 | 119 | def preprocess_waveforms(waveforms, input_len): 120 | """ Get the first input_len value of the waveforms, if not exist, pad it with 0. 121 | 122 | Args: 123 | waveforms ([type]): [description] 124 | input_len ([type]): [description] 125 | 126 | Returns: 127 | [type]: [description] 128 | """ 129 | n_channel = waveforms.shape[-1] 130 | preprocessed = np.zeros((input_len, n_channel)) 131 | if input_len <= waveforms.shape[0]: 132 | preprocessed = waveforms[:input_len, :] 133 | else: 134 | preprocessed[:waveforms.shape[0], :] = waveforms 135 | return tf.convert_to_tensor(preprocessed) 136 | 137 | def tanh_to_sigmoid(inputs): 138 | """ Convert from tanh range to sigmoid range 139 | 140 | Args: 141 | inputs (): number of np array of number 142 | 143 | Returns: 144 | number or array-like object: changed range object 145 | """ 146 | return (inputs + 1.0) / 2.0 147 | 148 | def get_CAM(model, img, actual_label, loss_func, layer_name='block5_conv3'): 149 | 150 | model_grad = tf.keras.Model(model.inputs, 151 | [model.get_layer(layer_name).output, model.output]) 152 | 153 | with tf.GradientTape() as tape: 154 | conv_output_values, predictions = model_grad(img) 155 | 156 | # watch the conv_output_values 157 | tape.watch(conv_output_values) 158 | 159 | # Calculate loss as in the loss func 160 | try: 161 | loss, _ = loss_func(actual_label, predictions) 162 | except: 163 | loss = loss_func(actual_label, predictions) 164 | print(f"Loss: {loss}") 165 | 166 | # get the gradient of the loss with respect to the outputs of the last conv layer 167 | grads_values = tape.gradient(loss, conv_output_values) 168 | grads_values = tf.reduce_mean(grads_values, axis=(0,1,2)) 169 | 170 | conv_output_values = np.squeeze(conv_output_values.numpy()) 171 | grads_values = grads_values.numpy() 172 | 173 | # weight the convolution outputs with the computed gradients 174 | for i in range(conv_output_values.shape[-1]): 175 | conv_output_values[:,:,i] *= grads_values[i] 176 | heatmap = np.mean(conv_output_values, axis=-1) 177 | 178 | heatmap = np.maximum(heatmap, 0) 179 | heatmap /= heatmap.max() 180 | 181 | del model_grad, conv_output_values, grads_values, loss 182 | 183 | return heatmap -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | matplotlib==3.5.1 2 | numpy==1.19.5 3 | opencv-python==4.5.4.60 4 | pandas==1.0.5 5 | Pillow==8.4.0 6 | scipy==1.7.3 7 | seaborn==0.11.2 8 | sounddevice==0.4.3 9 | tensorflow==2.6.0 -------------------------------------------------------------------------------- /server/.gitignore: -------------------------------------------------------------------------------- 1 | model -------------------------------------------------------------------------------- /server/README.md: -------------------------------------------------------------------------------- 1 | # Server for serving music emotion recognition model 2 | 3 | ## Requirements: 4 | 1. System: Window, Linux, and Mac. 5 | 2. Python 6 | 7 | ## Running the server: 8 | 1. Export the model to `server/model/my_model` 9 | 2. Running 10 | ``` 11 | cd server 12 | pip install -r requirements.txt 13 | python app.py 14 | ``` 15 | 3. Or to start a deployment server, run: 16 | ``` 17 | uvicorn app:app --host 0.0.0.0 --port 80 18 | ``` 19 | 20 | -------------------------------------------------------------------------------- /server/app.py: -------------------------------------------------------------------------------- 1 | import uvicorn 2 | from fastapi import FastAPI, File, UploadFile 3 | from fastapi.responses import HTMLResponse 4 | import tensorflow as tf 5 | import numpy as np 6 | 7 | app = FastAPI() 8 | model = tf.keras.models.load_model("./model/my_model/") 9 | 10 | DEFAULT_FREQ = 44100 11 | DEFAULT_TIME = 45 12 | WAVE_ARRAY_LENGTH = DEFAULT_FREQ * DEFAULT_TIME 13 | FREQUENCY_LENGTH = 129 14 | N_CHANNEL = 2 15 | SPECTROGRAM_TIME_LENGTH = 15502 16 | 17 | def preprocess_waveforms(waveforms, input_len): 18 | """ Get the first input_len value of the waveforms, if not exist, pad it with 0. 19 | 20 | Args: 21 | waveforms ([type]): [description] 22 | input_len ([type]): [description] 23 | 24 | Returns: 25 | [type]: [description] 26 | """ 27 | n_channel = waveforms.shape[-1] 28 | preprocessed = np.zeros((input_len, n_channel)) 29 | if input_len <= waveforms.shape[0]: 30 | preprocessed = waveforms[:input_len, :] 31 | else: 32 | preprocessed[:waveforms.shape[0], :] = waveforms 33 | return tf.convert_to_tensor(preprocessed) 34 | 35 | def get_spectrogram(waveform, input_len=44100): 36 | """ Check out https://www.tensorflow.org/io/tutorials/audio 37 | 38 | Args: 39 | waveform ([type]): Expect waveform array of shape (>44100,) 40 | input_len (int, optional): [description]. Defaults to 44100. 41 | 42 | Returns: 43 | Tensor: Spectrogram of the 1D waveform. Shape (freq, time, 1) 44 | """ 45 | max_zero_padding = min(input_len, tf.shape(waveform)) 46 | # Zero-padding for an audio waveform with less than 44,100 samples. 47 | waveform = waveform[:input_len] 48 | zero_padding = tf.zeros( 49 | (input_len - max_zero_padding), 50 | dtype=tf.float32) 51 | # Cast the waveform tensors' dtype to float32. 52 | waveform = tf.cast(waveform, dtype=tf.float32) 53 | # Concatenate the waveform with `zero_padding`, which ensures all audio 54 | # clips are of the same length. 55 | equal_length = tf.concat([waveform, zero_padding], 0) 56 | # Convert the waveform to a spectrogram via a STFT. 57 | spectrogram = tf.signal.stft( 58 | equal_length, frame_length=255, frame_step=128) 59 | # Obtain the magnitude of the STFT. 60 | spectrogram = tf.abs(spectrogram) 61 | # Add a `channels` dimension, so that the spectrogram can be used 62 | # as image-like input data with convolution layers (which expect 63 | # shape (`batch_size`, `height`, `width`, `channels`). 64 | spectrogram = spectrogram[..., tf.newaxis] 65 | return spectrogram 66 | 67 | def predict(sound: bytes): 68 | waveforms, _ = tf.audio.decode_wav(contents=sound) 69 | # Pad to max 45 second. Shape (total_frequency, n_channels) 70 | waveforms = preprocess_waveforms(waveforms, WAVE_ARRAY_LENGTH) 71 | # Work on building spectrogram 72 | # Shape (timestep, frequency, n_channel) 73 | spectrograms = None 74 | # Loop through each channel 75 | for i in range(waveforms.shape[-1]): 76 | # Shape (timestep, frequency, 1) 77 | spectrogram = get_spectrogram(waveforms[..., i], input_len=waveforms.shape[0]) 78 | # spectrogram = tf.convert_to_tensor(np.log(spectrogram.numpy() + np.finfo(float).eps)) 79 | if spectrograms == None: 80 | spectrograms = spectrogram 81 | else: 82 | spectrograms = tf.concat([spectrograms, spectrogram], axis=-1) 83 | 84 | 85 | padded_spectrogram = np.zeros((SPECTROGRAM_TIME_LENGTH, FREQUENCY_LENGTH, N_CHANNEL), dtype=float) 86 | # spectrograms = spectrograms[tf.newaxis, ...] 87 | # some spectrogram are not the same shape 88 | padded_spectrogram[:spectrograms.shape[0], :spectrograms.shape[1], :] = spectrograms 89 | 90 | sample_input = tf.convert_to_tensor(padded_spectrogram) 91 | prediction = model(sample_input[tf.newaxis, ...], training=False)[0, ...] 92 | return prediction 93 | 94 | @app.post("/predict/sound") 95 | async def predict_api(file: UploadFile = File(...)): 96 | extension = file.filename.split(".")[-1] in ("wav") 97 | if not extension: 98 | return "Sound must be wav format!" 99 | sound = await file.read() 100 | prediction = predict(sound) 101 | # print(prediction) 102 | return str(prediction) 103 | 104 | @app.get("/", response_class=HTMLResponse) 105 | async def index(): 106 | return """ 107 | 108 | 109 | Chào Linh Nhím 110 | 111 | 112 |

Chào Llinh Nhím nhím nhím cutoeeeeeeeeeee!!!

113 | 114 | 115 | """ 116 | 117 | if __name__ == "__main__": 118 | # uvicorn.run(app, debug=True) 119 | uvicorn.run("app:app", host="0.0.0.0", port=80, reload=True) -------------------------------------------------------------------------------- /server/requirements.txt: -------------------------------------------------------------------------------- 1 | # ============ AI LIB ============= # 2 | matplotlib==3.5.1 3 | numpy==1.19.5 4 | opencv-python==4.5.4.60 5 | pandas==1.0.5 6 | Pillow==8.4.0 7 | scipy==1.7.3 8 | seaborn==0.11.2 9 | sounddevice==0.4.3 10 | tensorflow==2.6.0 11 | 12 | # ============ WEB SERVER ============= # 13 | fastapi==0.72.0 14 | uvicorn==0.17.0 15 | python-multipart==0.0.5 -------------------------------------------------------------------------------- /server/utils.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | 3 | 4 | -------------------------------------------------------------------------------- /wav_converter.py: -------------------------------------------------------------------------------- 1 | import os 2 | from os import path 3 | import sys 4 | from pydub import AudioSegment 5 | 6 | if __name__ == "__main__": 7 | 8 | src = sys.argv[1] 9 | dst = sys.argv[2] 10 | AudioSegment.converter = sys.argv[3] 11 | 12 | for i, fname in enumerate(os.listdir(src)): 13 | try: 14 | # convert mp3 to wav 15 | sound = AudioSegment.from_mp3(os.path.join(src, fname)) 16 | sound.export(os.path.join(dst, fname[:-4] + ".wav"), format="wav") 17 | print(f"Exported to {os.path.join(dst, fname[:-4] + '.wav')}" ) 18 | except: 19 | print(f"Cannot convert {fname} to wav.") 20 | --------------------------------------------------------------------------------