├── audio_captioning_1.1_final_submission.pdf ├── imgs └── ismir2016-ldb-captioning-diagram1-for-web.png ├── recurrentnet.py └── README.md /audio_captioning_1.1_final_submission.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keunwoochoi/ismir2016-ldb-audio-captioning-model-keras/HEAD/audio_captioning_1.1_final_submission.pdf -------------------------------------------------------------------------------- /imgs/ismir2016-ldb-captioning-diagram1-for-web.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/keunwoochoi/ismir2016-ldb-audio-captioning-model-keras/HEAD/imgs/ismir2016-ldb-captioning-diagram1-for-web.png -------------------------------------------------------------------------------- /recurrentnet.py: -------------------------------------------------------------------------------- 1 | # work with keras 1.0.6, 5th Aug 2016, Keunwoo Choi 2 | import keras 3 | from keras.models import Model 4 | from keras.layers import Input, GRU, Dense 5 | from keras.layers import RepeatVector, merge 6 | 7 | def get_model_approach_1(maxlen_enc, maxlen_dec, loss_function): 8 | '''Build an rnn model for audio summarisation. 9 | This model is trained per song-album pair. 10 | 11 | Parameters 12 | ---------- 13 | maxlen_enc: integer, length of encoder RNN (max number of tracks in an album) 14 | maxlen_dec: integer, length of decoder RNN (max number of words of album descriptions) 15 | loss_function: string, e.g. 'mse', 'kld', 'binary_crossentropy' or whatever keras supports. 16 | 17 | return: it returns a keras model. 18 | 19 | Encoder 20 | ---------- 21 | Encoder one input, which is a list of track_feature. 22 | (track feature=concat(np.mean(word_embeddings), tag_pred)). 23 | Word embeddings are the w2v vectors) of the words in the metadata/descriptions of a track. 24 | tag_pred is a (50-dim) vector from the audio. 25 | Word embeddings are averaged to summarise each track. 26 | The output of encoder is a single vector that is a summary of the song - aka a context vector 27 | 28 | Decoder 29 | ---------- 30 | Decoder takes two inputs: a context vector and album_seq[:-1]. 31 | The context vector is repeated and concatenated to album_seq[:-1]. 32 | The output is album_seq[1:]. 33 | ''' 34 | n_hidden_enc = 128 35 | n_hidden_dec = 64 36 | dim_w2v = 300 # this is just a common dimension length. 37 | dim_tag = 50 # because my convnet is trained to predict 50 tags. 38 | 39 | feats_input = Input(shape=(maxlen_enc, dim_w2v+dim_tag), name='feats_input') # feature for track, which is 300+50 dim 40 | 41 | album_seq_input = Input(shape=(maxlen_dec, dim_w2v), name='album_seq_input') # album_seq[:-1], for training of decoder 42 | 43 | # input 44 | encoded = GRU(n_hidden_enc, return_sequences=True, input_shape=(maxlen_enc, dim_w2v+dim_tag))(feats_input) 45 | encoded = GRU(n_hidden_enc, return_sequences=False)(encoded) 46 | # context vector 47 | context = RepeatVector(maxlen_dec)(encoded) 48 | 49 | # decoder 50 | x_decoder = merge([album_seq_input, context], mode='concat', concat_axis=2) 51 | 52 | decoded = GRU(n_hidden_dec, return_sequences=True, input_shape=(maxlen_dec, n_hidden_enc))(x_decoder) 53 | decoded = GRU(dim_w2v, return_sequences=True)(decoded) 54 | 55 | # model 56 | model = Model(input=[feats_input, album_seq_input], output=[decoded]) 57 | optimiser = keras.optimizers.Adam(lr=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-08) 58 | model.compile(optimizer=optimiser, loss=loss_function) 59 | 60 | return model 61 | 62 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ismir2016-ldb-audio-captioning-model-keras 2 | Audio captioning model in Keras 3 | 4 | ### What does it do 5 | This is a general sequence-to-sequence model. 6 | 7 | Details are in [my ISMIR 2016 LDB extended abstract](https://github.com/keunwoochoi/ismir2016-ldb-audio-captioning-model-keras/blob/master/audio_captioning_1.1_final_submission.pdf) 8 | 9 | ### Usage 10 | TODO 11 | 12 | # More details 13 | 14 | ### Motivations 15 | ```python 16 | if recommendation.type == playlist: 17 | if generation_method == automatic: 18 | raise("No descrption! Users don't understand what the playlists are about and confused. Consider generating descriptions so that music discovery can be easier! ") 19 | ``` 20 | E.g., 21 | 22 | * Playfully, silly or sublime--this is the sound of Paul in love. (for [PaulMcCartney Ballads playlist, Apple music](https://itunes.apple.com/us/playlist/michael-jackson-love-songs/idpl.8058d87c60b647a7bc81185b9f59e4c2)) 23 | * Just the right blend of chilled-out acoustic songs to work, relax, and dream to (for [Your Coffee Break, Spotify playlist](https://open.spotify.com/user/spotify_uk_/playlist/48910w3L1DNiqvMHbUfZyY)) 24 | 25 | ### Backgrounds 26 | #### Previously, 27 | 28 | * Tags prediction for track *[(Eck et al., 2008)](http://papers.nips.cc/paper/3370-automatic-generation-of-social-tags-for-music-recommendation)* 29 | * Tags for playlists *[(Fields et al., 2010)](http://research.gold.ac.uk/8793/)* 30 | * Visual avatars for tracks *[(Bogdanov et al, 2013)](https://scholar.google.co.kr/scholar?cluster=14015036933742269998&hl=en&as_sdt=0,33)* 31 | 32 | #### Some techniques we can use 33 | * **RNNs** for sequence modelling 34 | * **Seq2Seq** which uses RNNs to model the relationships between two sequences, e.g., two sentences in different languges 35 | * **Word2vec** or anyother word-embedding methods 36 | * **ConvNet** Convolutional neural networks for various tasks including music 37 | 38 | ### The proposed structure 39 | ![structure alt text](https://github.com/keunwoochoi/ismir2016-ldb-audio-captioning-model-keras/blob/master/imgs/ismir2016-ldb-captioning-diagram1-for-web.png "structure") 40 | * Input 41 | - A sequence of track features 42 | - A track feature: 43 | - Concat(audio_feature, text_embedding) 44 | - audio_feature: audio content feature 45 | - text_embedding: text summarisation of text data (metadata, lyrics, descriptions...) of the track 46 | - text summasation method: averaging word embeddings of every word 47 | * Output 48 | - Sequence of word embeddings 49 | - each word embedding represent each word of the description (e.g. if the ground-truth description is *Playfully, silly or sublime--this is the sound of Paul in love*, `word_embedding(playfully)`, `word_embedding(silly)`, `word_embedding(or)`,... ) 50 | 51 | 52 | ### The results 53 | I don't have a proper result and looking for dataset. Anybody help? 54 | 55 | 56 | --------------------------------------------------------------------------------