├── audio_captioning_1.1_final_submission.pdf
├── imgs
    └── ismir2016-ldb-captioning-diagram1-for-web.png
├── recurrentnet.py
└── README.md


/audio_captioning_1.1_final_submission.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/keunwoochoi/ismir2016-ldb-audio-captioning-model-keras/HEAD/audio_captioning_1.1_final_submission.pdf


--------------------------------------------------------------------------------
/imgs/ismir2016-ldb-captioning-diagram1-for-web.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/keunwoochoi/ismir2016-ldb-audio-captioning-model-keras/HEAD/imgs/ismir2016-ldb-captioning-diagram1-for-web.png


--------------------------------------------------------------------------------
/recurrentnet.py:
--------------------------------------------------------------------------------
 1 | # work with keras 1.0.6, 5th Aug 2016, Keunwoo Choi
 2 | import keras 
 3 | from keras.models import Model
 4 | from keras.layers import Input, GRU, Dense
 5 | from keras.layers import RepeatVector, merge
 6 | 
 7 | def get_model_approach_1(maxlen_enc, maxlen_dec, loss_function):
 8 | 	'''Build an rnn model for audio summarisation.
 9 | 	This model is trained per song-album pair.
10 | 
11 | 	Parameters
12 | 	----------
13 | 	maxlen_enc: integer, length of encoder RNN (max number of tracks in an album)
14 | 	maxlen_dec: integer, length of decoder RNN (max number of words of album descriptions)
15 | 	loss_function: string, e.g. 'mse', 'kld', 'binary_crossentropy' or whatever keras supports.
16 | 
17 | 	return: it returns a keras model.
18 | 
19 | 	Encoder
20 | 	----------
21 | 	Encoder one input, which is a list of track_feature.
22 | 		(track feature=concat(np.mean(word_embeddings), tag_pred)).
23 | 	Word embeddings are the w2v vectors) of the words in the metadata/descriptions of a track. 
24 | 	tag_pred is a (50-dim) vector from the audio.
25 | 	Word embeddings are averaged to summarise each track.
26 | 	The output of encoder is a single vector that is a summary of the song - aka a context vector
27 | 
28 | 	Decoder
29 | 	----------
30 | 	Decoder takes two inputs: a context vector and album_seq[:-1].
31 | 	The context vector is repeated and concatenated to album_seq[:-1].
32 | 	The output is album_seq[1:].
33 | 	'''
34 | 	n_hidden_enc = 128
35 | 	n_hidden_dec = 64
36 | 	dim_w2v = 300 # this is just a common dimension length. 
37 | 	dim_tag = 50 # because my convnet is trained to predict 50 tags.
38 | 
39 | 	feats_input = Input(shape=(maxlen_enc, dim_w2v+dim_tag), name='feats_input') # feature for track, which is 300+50 dim
40 | 	
41 | 	album_seq_input = Input(shape=(maxlen_dec, dim_w2v), name='album_seq_input') # album_seq[:-1], for training of decoder
42 | 
43 | 	# input
44 | 	encoded = GRU(n_hidden_enc, return_sequences=True, input_shape=(maxlen_enc, dim_w2v+dim_tag))(feats_input)
45 | 	encoded = GRU(n_hidden_enc, return_sequences=False)(encoded)
46 | 	# context vector
47 | 	context = RepeatVector(maxlen_dec)(encoded)
48 | 
49 | 	# decoder
50 | 	x_decoder = merge([album_seq_input, context], mode='concat', concat_axis=2)
51 | 	
52 | 	decoded = GRU(n_hidden_dec, return_sequences=True, input_shape=(maxlen_dec, n_hidden_enc))(x_decoder)
53 | 	decoded = GRU(dim_w2v, return_sequences=True)(decoded)
54 | 	
55 | 	# model
56 | 	model = Model(input=[feats_input, album_seq_input], output=[decoded])
57 | 	optimiser = keras.optimizers.Adam(lr=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
58 | 	model.compile(optimizer=optimiser, loss=loss_function)
59 | 
60 | 	return model
61 | 
62 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ismir2016-ldb-audio-captioning-model-keras
 2 | Audio captioning model in Keras
 3 | 
 4 | ### What does it do
 5 | This is a general sequence-to-sequence model. 
 6 | 
 7 | Details are in [my ISMIR 2016 LDB extended abstract](https://github.com/keunwoochoi/ismir2016-ldb-audio-captioning-model-keras/blob/master/audio_captioning_1.1_final_submission.pdf)
 8 | 
 9 | ### Usage
10 | TODO
11 | 
12 | # More details
13 | 
14 | ### Motivations
15 | ```python
16 | if recommendation.type == playlist:
17 |     if generation_method == automatic:
18 |         raise("No descrption! Users don't understand what the playlists are about and confused. Consider generating descriptions so that music discovery can be easier! ")
19 | ```
20 | E.g.,
21 | 
22 |  * Playfully, silly or sublime--this is the sound of Paul in love. (for [PaulMcCartney Ballads playlist, Apple music](https://itunes.apple.com/us/playlist/michael-jackson-love-songs/idpl.8058d87c60b647a7bc81185b9f59e4c2))
23 |  * Just the right blend of chilled-out acoustic songs to work, relax, and dream to (for [Your Coffee Break, Spotify playlist](https://open.spotify.com/user/spotify_uk_/playlist/48910w3L1DNiqvMHbUfZyY))
24 |  
25 | ### Backgrounds
26 | #### Previously,
27 | 
28 |  * Tags prediction for track *[(Eck et al., 2008)](http://papers.nips.cc/paper/3370-automatic-generation-of-social-tags-for-music-recommendation)*
29 |  * Tags for playlists *[(Fields et al., 2010)](http://research.gold.ac.uk/8793/)*
30 |  * Visual avatars for tracks *[(Bogdanov et al, 2013)](https://scholar.google.co.kr/scholar?cluster=14015036933742269998&hl=en&as_sdt=0,33)*
31 | 
32 | #### Some techniques we can use
33 |  * **RNNs** for sequence modelling
34 |  * **Seq2Seq** which uses RNNs to model the relationships between two sequences, e.g., two sentences in different languges
35 |  * **Word2vec** or anyother word-embedding methods
36 |  * **ConvNet** Convolutional neural networks for various tasks including music
37 | 
38 | ### The proposed structure
39 | ![structure alt text](https://github.com/keunwoochoi/ismir2016-ldb-audio-captioning-model-keras/blob/master/imgs/ismir2016-ldb-captioning-diagram1-for-web.png "structure")
40 |  *  Input
41 |   - A sequence of track features
42 |   - A track feature:
43 |     - Concat(audio_feature, text_embedding)
44 |       - audio_feature: audio content feature
45 |       - text_embedding: text summarisation of text data (metadata, lyrics, descriptions...) of the track
46 |         - text summasation method: averaging word embeddings of every word
47 |  * Output
48 |   - Sequence of word embeddings
49 |     - each word embedding represent each word of the description (e.g. if the ground-truth description is *Playfully, silly or sublime--this is the sound of Paul in love*, `word_embedding(playfully)`, `word_embedding(silly)`, `word_embedding(or)`,... )
50 | 
51 | 
52 | ### The results
53 | I don't have a proper result and looking for dataset. Anybody help? 
54 | 
55 | 
56 | 


--------------------------------------------------------------------------------