├── README.md
├── __init__.py
├── datasets
    ├── shs_dzr_test.csv
    ├── shs_dzr_train.csv
    ├── test_shs.csv
    └── train_shs.csv
├── es_search.py
├── evaluations.py
├── experiments.py
├── requirements.txt
├── templates.py
├── utilities
    ├── __init__.py
    ├── audio_utils.py
    ├── clique_similarity.py
    ├── plots.py
    ├── serra_et_al_2009
    │   ├── README.md
    │   ├── compute_hpcpFeatures.sh
    │   ├── compute_qmaxDistance.sh
    │   ├── run_mirex_binary.py
    │   ├── run_submission.sh
    │   └── utils.py
    └── text_utils.py
└── utils.py


/README.md:
--------------------------------------------------------------------------------
  1 | # Large Scale Cover Detection in Digital Music Libraries using Metadata, Lyrics and Audio Features
  2 | 
  3 | 
  4 | Source code and supplementary materials for the paper "Correya, Albin, Romain Hennequin, and Mickaël Arcos. "Large-Scale Cover Song Detection in Digital Music Libraries Using Metadata, Lyrics and Audio Features." arXiv preprint arXiv:1808.10351 (2018)".
  5 | 
  6 | This repo contains scripts to run text-based experiments for cover song detection task on the [MillionSongDataset (MSD)](https://labrosa.ee.columbia.edu/millionsong/)
  7 | which is imported into an [Elasticsearch (ES)](https://www.elastic.co/blog/what-is-an-elasticsearch-index) index as described in the above mentioned paper.
  8 | # Requirements
  9 | 
 10 | Install python dependencies from the requirements.txt file
 11 | 
 12 | ```
 13 | $ pip install -r requirements.txt
 14 | ```
 15 | 
 16 | # Setup
 17 | 
 18 | * Use [ElasticMSD](https://github.com/deezer/elasticmsd) scripts to setup your local Elasticsearch index of MSD.
 19 | * Fill your ES db credentials (host, port and index) as a environment variable in your local system. 
 20 | Check [templates.py](templates.py) file.
 21 | 
 22 | ## Datasets
 23 | 
 24 | The following datasets have corresponding mapping with MSD tracks. These data are ingested to the ES index in an update operation
 25 | 
 26 | * [Second Hand Songs (SHS)](https://labrosa.ee.columbia.edu/millionsong/secondhand) dataset. Check the ./data folder
 27 | * For lyrics we used the [musiXmatch (MXM)](https://labrosa.ee.columbia.edu/millionsong/musixmatch) dataset
 28 | 
 29 | # Usage
 30 | 
 31 | ## Modular mode
 32 | 
 33 | In this section, you can have a glimpse on how to use these classes and various methods for doing experiments
 34 | 
 35 | ```python
 36 | #import modules
 37 | from es_search import SearchModule
 38 | from experiments import Experiments
 39 | import templates as presets
 40 | 
 41 | # Initiaite es search class
 42 | es = SearchModule(presets.uri_config)
 43 | 
 44 | # search method by msd_track title in view mode
 45 | results = es.search_by_exact_title('Listen To My Babe', 'TRPIIKF128F1459A09', mode='view')
 46 | 
 47 | #You can also use the experiment class to automate particular experiments for a method
 48 | #Initiate experiment class with the instance of SearchModule and path to the dataset as arguments
 49 | exp = Experiments(es, './data/test_shs.csv')
 50 | 
 51 | #run the song title match experiment with top 100 results
 52 | results = exp.run_song_title_match_task(size=100)
 53 | 
 54 | #compute evaluation metrics for the task
 55 | mean_avg_precison = exp.mean_average_precision(results)
 56 | 
 57 | #reset the preset if you want to do another experiment on the same same SearchModule instance.
 58 | exp.reset_preset()
 59 | 
 60 | results = exp.run_mxm_lyrics_search_task(size=1000)
 61 | 
 62 | mean_avg_precison = exp.mean_average_precision(results)
 63 | 
 64 | ```
 65 | 
 66 | ## Evaluation tasks
 67 | 
 68 | Some examples for using functions in evaluations.py script to reproduce the results mentioned in the paper
 69 | ```python
 70 | from evaluations import *
 71 | 
 72 | #Evaluation task on SHS train set against the whole MSD (1 x 999,999 songs)
 73 | shs_train_set_evals(size=100, method="msd_title", mode="msd", with_duplicates=True)
 74 | 
 75 | #You can specify various prune sizes and methods as parameters
 76 | shs_train_set_evals(size=1000, method="mxm_lyrics", mode="msd", with_duplicates=False)
 77 | 
 78 | #You can run the same experiment only on the SHS train set against itself by specifying "mode" param as "shs" (1 x 12,960)
 79 | shs_train_set_evals(size=100, method="msd_title", mode="shs", with_duplicates=True)
 80 | 
 81 | #In same way you can do the evaluation experiments on SHS test sets
 82 | shs_test_set_evals(size=100, method="title_mxm_lyrics", with_duplicates=True)
 83 | 
 84 | ```
 85 | 
 86 | 
 87 | If you don't want to care about how the module works and you only need results various experiments, then this is for you. 
 88 | It's a wrapper around the modules to run automated experiments and save the results to a .log file or a json_template. 
 89 | The experiments are multi-threaded and able to run from terminal using command-line arguments.
 90 | 
 91 | ```bash
 92 | $ python evaluations.py -m test -t -1 -e msd -d 0 -s 100
 93 | 
 94 |     -m : (type: string) Choose between "train" or "test" modes
 95 |     -t : (type: int) No of threads
 96 |     -e : (type: int) Choose between "msd"
 97 |     -d : (type: boolean) include duplicates
 98 |     -s : (type: int) Required pruning size for the experiments
 99 | 
100 | ```
101 | 
102 | # Cite
103 | 
104 | If you use these work, please cite our paper.
105 | 
106 | ```
107 | Correya, Albin, Romain Hennequin, and Mickaël Arcos. "Large-Scale Cover Song Detection in Digital Music Libraries Using Metadata, Lyrics and Audio Features." arXiv preprint arXiv:1808.10351 (2018).
108 | ```
109 | 
110 | Bibtex format
111 | ```
112 | @article{correya2018large,
113 |   title={Large-Scale Cover Song Detection in Digital Music Libraries Using Metadata, Lyrics and Audio Features},
114 |   author={Correya, Albin and Hennequin, Romain and Arcos, Micka{\"e}l},
115 |   journal={arXiv preprint arXiv:1808.10351},
116 |   year={2018}
117 | }
118 | ```


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/datasets/shs_dzr_train.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deezer/cover_song_detection/035b925c4a3380ad833202da618f5025e7643322/datasets/shs_dzr_train.csv


--------------------------------------------------------------------------------
/es_search.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Set of functions and methods for various search requests to dzr_elastic search augmented msd db
  4 | 
  5 | Albin Andrew Correya
  6 | R&D Intern
  7 | @Deezer,2017
  8 | """
  9 | from requests import get
 10 | from copy import deepcopy
 11 | import templates as presets
 12 | 
 13 | 
 14 | class SearchModule(object):
 15 |     """
 16 |     Class containing custom methods to search the elasticsearch index containing the augmented MSD dataset
 17 |     """
 18 |     from elasticsearch import Elasticsearch
 19 |     import pandas as pd
 20 |     import json
 21 | 
 22 |     init_json = deepcopy(presets.simple_query_string)  # save the preset as attribute
 23 | 
 24 |     def __init__(self, uri_config, query_json=None, timeout=30):
 25 |         """
 26 |         Init params:
 27 |                     uri_config : uri_config dictionary specifying the host and port of es db.
 28 |                                 (check 'uri_config' in the templates.py file)
 29 |                     query_json : {default : None}
 30 | 
 31 |         """
 32 |         self.config = uri_config
 33 |         self.handler = self.Elasticsearch(hosts=[{'host': self.config['host'],
 34 |                                                   'port': self.config['port'],
 35 |                                                   'scheme': self.config['scheme']}], timeout=timeout)
 36 | 
 37 |         if query_json:
 38 |             self.post_json = query_json
 39 |         else:
 40 |             self.post_json = presets.simple_query_string
 41 | 
 42 |         return
 43 | 
 44 |     def _load_json(self, jsonfile):
 45 |         """Load a json file as python dict"""
 46 |         with open(jsonfile) as f:
 47 |             json_data = self.json.load(f)
 48 |         return json_data  
 49 | 
 50 |     def _make_request(self, target_url, query, verbose=False):
 51 |         """ 
 52 |         [DEPRECIATED] make the request and fetch results
 53 |         """
 54 |         if verbose:
 55 |             print "GET %s -d '%s'" % (target_url, self.json.dumps(query))
 56 |         r = get(target_url, data=self.json.dumps(query))
 57 |         return self.json.loads(r.text)
 58 | 
 59 |     def _format_url(self, msd_id):
 60 |         return "%s://%s:%s/%s/%s/%s" % (
 61 |             self.config['scheme'],
 62 |             self.config['host'],
 63 |             self.config['port'],
 64 |             self.config['index'],
 65 |             self.config['type'],
 66 |             msd_id
 67 |         )
 68 | 
 69 |     def _format_query(self, query_str, msd_id, mode='simple_query', field='msd_title', size=100):
 70 |         """
 71 |         Format POST json dict object with query_str and msd_id
 72 |         """
 73 |         if mode == 'simple_query':
 74 |             self.post_json['query']['bool']['must'][0]['simple_query_string']['query'] = query_str
 75 |             self.post_json['query']['bool']['must'][0]['simple_query_string']['fields'][0] = field
 76 |             # we exclude the query id from the result
 77 |             self.post_json['query']['bool']['must_not'][0]['query_string']['query'] = msd_id
 78 |             self.post_json['size'] = size
 79 |         if mode == 'query_string':
 80 |             self.post_json['query']['bool']['must'][0]['query_string']['query'] = query_str
 81 |             # we exclude the query id from the result
 82 |             self.post_json['query']['bool']['must_not'][0]['query_string']['query'] = msd_id
 83 |             self.post_json['size'] = size
 84 |         return self.post_json
 85 | 
 86 |     @staticmethod
 87 |     def _format_init_json(init_json, query_str, msd_id, field='msd_title', size=100):
 88 |         """
 89 |         """
 90 |         init_json['query']['bool']['must'][0]['simple_query_string']['query'] = query_str
 91 |         init_json['query']['bool']['must'][0]['simple_query_string']['fields'][0] = field
 92 |         # we exclude the query id from the result
 93 |         init_json['query']['bool']['must_not'][0]['query_string']['query'] = msd_id
 94 |         init_json['size'] = size
 95 |         return init_json
 96 | 
 97 |     @staticmethod
 98 |     def _parse_response_for_eval(response):
 99 |         """
100 |         Parse list of msd_track_ids and their respective scores from a search response json
101 | 
102 |         Input :
103 |                 response : json response from the elasticsearch
104 |         """
105 |         msd_ids = [d['_id'] for d in response]
106 |         scores = [d['_score'] for d in response]
107 |         return msd_ids, scores
108 | 
109 |     def _view_response(self, response):
110 |         """
111 |         Aggregrate response as pandas dataframe to view response as tables in the ipython console
112 | 
113 |         Input :
114 |                 response : json response from the elasticsearch
115 | 
116 |         Output : A pandas dataframe with aggregrated results
117 | 
118 |         """
119 |         row_list = [(track['_id'], track['_score'], track['_source']['msd_title']) for track in response]
120 | 
121 |         results = self.pd.DataFrame({
122 |             'msd_id': [r[0] for r in row_list],
123 |             'score': [r[1] for r in row_list],
124 |             'msd_title': [r[2] for r in row_list]
125 |         })
126 |         return results
127 | 
128 |     def format_lyrics_post_json(self, body, lyrics, track_id, size, field='dzr_lyrics.content'):
129 |         """
130 |         format post_json template for lyrics search with lyrics and msd_track-id
131 |         """
132 |         self.post_json = body
133 |         self.post_json['query']['bool']['must'][0]['more_like_this']['like'] = lyrics
134 |         self.post_json['query']['bool']['must'][0]['more_like_this']['fields'][0] = field
135 |         # we exclude the query id from the results
136 |         self.post_json['query']['bool']['must_not'][0]['query_string']['query'] = track_id
137 |         self.post_json['size'] = size
138 | 
139 |     def limit_post_json_to_shs(self):
140 |         """
141 |         Limits search only on the songs present in the second hand songs train set as provided by labrosa
142 |         ie. limit search to 1 x 12960 from 1 x 1M
143 | 
144 |         """
145 |         if len(self.post_json['query']['bool']['must']) <= 1:  # mar: why this condition?
146 |             self.post_json['query']['bool']['must'].append({'exists': {'field': 'shs_id'}})
147 | 
148 |     def limit_to_dzr_mapped_msd(self):
149 |         """
150 |         Limits search only on the songs have a respective mapping to the deezer_song_ids
151 |         ie. limit search to 1 x ~83k
152 |         """
153 |         self.post_json['query']['bool']['must'].append({'exists': {'field': 'dzr_song_title'}})
154 | 
155 |     def add_remove_duplicates_filter(self):
156 |         """
157 |         Filter songs with field 'msd_is_duplicate_of' from the search
158 |         and response using must_not exist method in the post-request
159 |         """
160 |         if len(self.post_json['query']['bool']['must_not']) <= 1:  # mar: why this condition?
161 |             self.post_json['query']['bool']['must_not'].append({'exists': {'field': 'msd_is_duplicate_of'}})
162 | 
163 |     @staticmethod
164 |     def add_must_field_to_query_dsl(post_json, role_type='Composer', field='dzr_artists.role_name',
165 |                                     query_type='simple_query_string'):
166 |         post_json['query']['bool']['must'].append({query_type: {'fields': [field], 'query': role_type}})
167 |         return post_json  # mar: why here we return something and not the previous add_ function
168 | 
169 |     @staticmethod
170 |     def add_role_artists_to_query_dsl(post_json, artist_names, field='dzr_artists.artist_name',
171 |                                       query_type='simple_query_string'):
172 |         if len(artist_names) > 1:
173 |             query_str = ' OR '.join(artist_names)
174 |         else:
175 |             query_str = artist_names[0]
176 | 
177 |         post_json['query']['bool']['must'].append({query_type: {'fields': [field], 'query': query_str}})
178 |         return post_json  # mar: why here we return something and not the previous add_ function
179 | 
180 |     @staticmethod
181 |     def parse_field_from_response(response, field='msd_artist_id'):
182 |         """
183 |         Parse a particular field value from the es response
184 | 
185 |         :param response: es response json
186 |         :param field: field_name
187 |         """
188 |         if field not in response['_source'].keys():
189 |             return None
190 |         elif not response['_source'][field]:
191 |             return None
192 |         elif field == 'dzr_lyrics':
193 |             return response['_source'][field]['content']
194 |         else:
195 |             return response['_source'][field]
196 | 
197 |     def get_field_info_from_id(self, msd_id, field):
198 |         """
199 |         Retrieve info for a particular field associated to a msd_id in the es db
200 |             eg. get_field_info_from_id(msd_id='TRWFERO128F425FE0D', field='dzr_lyrics.content') 
201 |         """
202 |         response = get(self._format_url(msd_id))
203 |         field_info = self.parse_field_from_response(response.json(), field=field)
204 |         return field_info
205 | 
206 |     def get_mxm_lyrics_by_id(self, track_id):
207 |         """
208 |         Get Musixmatch lyrics associated with a msd track id from es index if there is any.
209 |         :param track_id: msd track id
210 |         :return:
211 |         """
212 |         return self.get_field_info_from_id(msd_id=track_id, field='mxm_lyrics')
213 | 
214 |     def get_cleaned_title_from_id(self, msd_id, field="dzr_msd_title_clean"):
215 |         """
216 |         Get preprocessed MSD title by MSD track id
217 |         """
218 |         # mar: the field "dzr_msd_title_clean" should not be a parameter (like in get_mxm_lyrics)
219 |         response = get(self._format_url(msd_id))
220 |         return self.parse_field_from_response(response.json(), field=field)
221 | 
222 |     def search_es(self, body):
223 |         """
224 |         Make a search request to elasticsearch provided by json POST dictionary
225 |         [This is a general method you can use for querying the es db with respective query_dsl as inputs]
226 | 
227 |         Input :
228 |                 body : JSON post dict for elastic search 
229 |                 (you can use the template jsons in the templates.py script)
230 |                 eg : body = templates.simple_query_string
231 |         """
232 |         res = self.handler.search(index=self.config["index"], body=body)
233 |         return res['hits']['hits']
234 | 
235 |     def search_by_exact_title(self, track_title, track_id, mode='simple_query', out_mode='view', size=100):
236 |         """
237 |         Search by track_title using simple_query_string method in the elasticsearch
238 |         """
239 |         res = self.search_es(self._format_query(query_str=track_title, msd_id=track_id, mode=mode, size=size))
240 | 
241 |         # mar: because the following code is copy/pasted several times, it should be a function
242 |         # like return_results(res, out_mode)
243 |         if out_mode == 'eval':
244 |             msd_ids, scores = self._parse_response_for_eval(res)
245 |             return msd_ids, scores
246 | 
247 |         if out_mode == 'view':
248 |             return self._view_response(res)
249 | 
250 |         return None
251 | 
252 |     def search_with_cleaned_title(self, track_id, out_mode='view', field="dzr_msd_title_clean", size=100):
253 |         """
254 |         Search by cleaned msd_track-title
255 |         """
256 |         # mar: field="dzr_msd_title_clean" should not ne a parameter but included in get_cleaned_title_from_id
257 |         track_title = self.get_cleaned_title_from_id(msd_id=track_id)
258 |         res = self.search_es(self._format_query(query_str=track_title, msd_id=track_id, mode='simple_query',
259 |                                                 field=field, size=size))
260 |         if out_mode == 'eval':
261 |             msd_ids, scores = self._parse_response_for_eval(res)
262 |             return msd_ids, scores
263 | 
264 |         if out_mode == 'view':
265 |             return self._view_response(res)
266 | 
267 |         return None
268 | 
269 | 
270 |     def search_by_mxm_lyrics(self, post_json, msd_track_id, out_mode='eval', size=100):
271 |         """
272 |         Search the es_db by musixmatch_lyrics which are mapped to certain msd_track_ids
273 |         These mappings are obtained from the musixmatch dataset (https://labrosa.ee.columbia.edu/millionsong/musixmatch)
274 | 
275 |         [NOTE]: It returns a tuple of list of response msd_track_ids and
276 |           response_scores from the elastic search response if the track has corresponding "mxm_lyrics"
277 |           otherwise return a tuple of (None, None)
278 | 
279 |         Inputs:
280 |                 post_json : (dict) Query_DSL json template for the es_query (eg. presets.more_like_lyrics)
281 |                 msd_track_id : (string) MSD track identifier of the query file
282 | 
283 |             Params :
284 |                     out_mode : (string) Available modes (['eval', 'view'])
285 |                     size : (int) size of the required response from es_db
286 |         """
287 |         lyrics = self.get_mxm_lyrics_by_id(msd_track_id)
288 | 
289 |         if not lyrics:
290 |             return None, None
291 | 
292 |         self.format_lyrics_post_json(body=post_json, track_id=msd_track_id, lyrics=lyrics,
293 |                                      size=size, field='mxm_lyrics')
294 |         res = self.search_es(body=self.post_json)
295 | 
296 |         if out_mode == 'eval':
297 |             msd_ids, scores = self._parse_response_for_eval(res)
298 |             return msd_ids, scores
299 | 
300 |         if out_mode == 'view':
301 |             return self._view_response(res)
302 | 
303 |         return None
304 | 


--------------------------------------------------------------------------------
/evaluations.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Scripts for running various evaluation tasks for the task of large-scale cover song detection
  4 | 
  5 | [NOTE] : All the logs are stored to the LOG_FILE
  6 | 
  7 | Albin Andrew Correya
  8 | R&D Intern
  9 | @Deezer
 10 | """
 11 | 
 12 | from joblib import Parallel, delayed
 13 | from es_search import SearchModule
 14 | from experiments import Experiments
 15 | from utils import log
 16 | import templates as presets
 17 | import argparse
 18 | 
 19 | # Logging handlers
 20 | LOG_FILE = './logs/evaluations.log'
 21 | LOGGER = log(LOG_FILE)
 22 | 
 23 | 
 24 | def shs_train_set_evals(size, method="msd_title", with_duplicates=True, mode="msd"):
 25 |     """
 26 |     :param size: Required prune size of the results
 27 |     :param method: (string type) {default:"msd_title"}
 28 |         choose the method of experiment available modes are
 29 |         ["msd_title", "pre-msd_title", "mxm_lyrics", "title_mxm_lyrics", "pre-title_mxm_lyrics"]
 30 |     :param with_duplicates: (boolean) {default:True} include
 31 |         or exclude MSD official duplicate tracks from the experiments
 32 |     :param mode: 'msd' or 'shs'
 33 |     """
 34 | 
 35 |     es = SearchModule(presets.uri_config)
 36 | 
 37 |     if mode == "msd":
 38 |         if with_duplicates:
 39 |             exp = Experiments(es, './data/train_shs.csv', presets.shs_msd)
 40 |         else:
 41 |             exp = Experiments(es, './data/train_shs.csv', presets.shs_msd_no_dup)
 42 |     elif mode == "shs":
 43 |         exp = Experiments(es, './data/train_shs.csv', presets.shs_shs)
 44 |     else:
 45 |         raise Exception("\nInvalid 'mode' parameter ... ")
 46 | 
 47 |     if method == "msd_title":
 48 |         LOGGER.info("\n%s with size %s, duplicates=%s and msd_mode=%s" %
 49 |                     (method, size, with_duplicates, mode))
 50 |         results = exp.run_song_title_match_task(size=size)
 51 | 
 52 |     elif method == "pre-msd_title":
 53 |         LOGGER.info("\n%s with size %s, duplicates=%s and msd_mode=%s" %
 54 |                     (method, size, with_duplicates, mode))
 55 |         results = exp.run_cleaned_song_title_task(size=size)
 56 | 
 57 |     elif method == "mxm_lyrics":
 58 |         LOGGER.info("\n%s with size %s, duplicates=%s and msd_mode=%s" %
 59 |                     (method, size, with_duplicates, mode))
 60 |         results = exp.run_mxm_lyrics_search_task(presets.more_like_this, size=size)
 61 | 
 62 |     elif method == "title_mxm_lyrics":
 63 |         LOGGER.info("\n%s with size %s, duplicates=%s and msd_mode=%s" %
 64 |                     (method, size, with_duplicates, mode))
 65 |         results = exp.run_rerank_title_with_mxm_lyrics_task(size=size, with_cleaned=False)
 66 | 
 67 |     elif method == "pre-title_mxm_lyrics":
 68 |         LOGGER.info("\n%s with size %s, duplicates=%s and msd_mode=%s" %
 69 |                     (method, size, with_duplicates, mode))
 70 |         results = exp.run_rerank_title_with_mxm_lyrics_task(size=size, with_cleaned=True)
 71 | 
 72 |     else:
 73 |         raise Exception("\nInvalid 'method' parameter....")
 74 | 
 75 |     mean_avg_precision = exp.mean_average_precision(results)
 76 |     LOGGER.info("\n Mean Average Precision (MAP) = %s" % mean_avg_precision)
 77 | 
 78 |     return
 79 | 
 80 | 
 81 | def shs_test_set_evals(size, method="msd_title", with_duplicates=True):
 82 |     """
 83 |     :param size: Required prune size of the results
 84 |     :param method: (string type) {default:"msd_title"}
 85 |         choose the method of experiment available modes are
 86 |         ["msd_title", "pre-msd_title", "mxm_lyrics", "title_mxm_lyrics", "pre-title_mxm_lyrics"]
 87 |     :param with_duplicates: (boolean) {default:True} include
 88 |         or exclude MSD official duplicate tracks from the experiments
 89 |     :return:
 90 |     """
 91 | 
 92 |     es = SearchModule(presets.uri_config)
 93 | 
 94 |     if with_duplicates:
 95 |         exp = Experiments(es, './data/test_shs.csv', presets.shs_msd)
 96 |     else:
 97 |         exp = Experiments(es, './data/test_shs.csv', presets.shs_msd_no_dup)
 98 | 
 99 |     if method == "msd_title":
100 |         LOGGER.info("\n%s with size %s and duplicates=%s " % (method, size, with_duplicates))
101 |         results = exp.run_song_title_match_task(size=size)
102 | 
103 |     elif method == "pre-msd_title":
104 |         LOGGER.info("\n%s with size %s and duplicates=%s" % (method, size, with_duplicates))
105 |         results = exp.run_cleaned_song_title_task(size=size)
106 | 
107 |     elif method == "mxm_lyrics":
108 |         LOGGER.info("\n%s with size %s and duplicates=%s" % (method, size, with_duplicates))
109 |         results = exp.run_mxm_lyrics_search_task(presets.more_like_this, size=size)
110 | 
111 |     elif method == "title_mxm_lyrics":
112 |         LOGGER.info("\n%s with size %s and duplicates=%s" % (method, size, with_duplicates))
113 |         results = exp.run_rerank_title_with_mxm_lyrics_task(size=size, with_cleaned=False)
114 | 
115 |     elif method == "pre-title_mxm_lyrics":
116 |         LOGGER.info("\n%s with size %s and duplicates=%s" % (method, size, with_duplicates))
117 |         results = exp.run_rerank_title_with_mxm_lyrics_task(size=size, with_cleaned=True)
118 | 
119 |     else:
120 |         raise Exception("\nInvalid 'method' parameter for the experiment ! ")
121 | 
122 |     mean_avg_precision = exp.mean_average_precision(results)
123 |     LOGGER.info("\n Mean Average Precision (MAP) = %s" %mean_avg_precision)
124 | 
125 |     return
126 | 
127 | 
128 | def automate_online_evals(mode, n_threads=-1, exp_mode="msd", is_duplicates=False, size=100,
129 |                           methods=["msd_title", "pre-msd_title", "mxm_lyrics",
130 |                                    "title_mxm_lyrics", "pre-title_mxm_lyrics"]):
131 |     """
132 | 
133 |     Run the paralleled automated evaluation tasks as per the chosen requirements from the parameters
134 | 
135 |     :param mode: (type : string) chose whether train or test mode from the list ["test", "train"]
136 |     :param n_threads: number of threads to parallelize with (-1
137 |     :param exp_mode: (type : string) Choose experiment mode from the list ["msd", "shs"]
138 |     :param is_duplicates: (type : boolean)  Choose whether you should include duplicates in the experiments
139 |     :param size: (type : int) Required size of the pruned response
140 |     :param methods: Choose a list of methods to compute in the automated process
141 |         available methods are ["msd_title", "pre-msd_title",
142 |         "mxm_lyrics", "title_mxm_lyrics", "pre-title_mxm_lyrics"]
143 | 
144 |     """
145 |     LOGGER.info("\n ======== Automated online experiments on shs_ %s "
146 |                 "with exp_mode %s and duplicates %s size %s ======= "
147 |                 % (mode, exp_mode, is_duplicates, size))
148 | 
149 |     sizes = [size for i in range(len(methods))]
150 |     duplicates = [is_duplicates for i in range(len(methods))]
151 | 
152 |     if mode == "test":
153 |         args = zip(sizes, methods, duplicates)
154 |         Parallel(n_jobs=n_threads, verbose=1)(map(delayed(shs_test_set_evals), args))
155 | 
156 |     if mode == "train":
157 |         exp_modes = [exp_mode for i in range(len(methods))]
158 |         args = zip(sizes, methods, duplicates, exp_modes)
159 |         Parallel(n_jobs=n_threads, verbose=1)(map(delayed(shs_train_set_evals), args))
160 | 
161 |     LOGGER.info("\n ===== Process finished successfully... ===== ")
162 | 
163 |     return
164 | 
165 | 
166 | if __name__ == '__main__':
167 | 
168 |     parser = argparse.ArgumentParser(
169 |         description="Run automated evaluation for cover song detection task mentioned in the paper",
170 |         formatter_class=argparse.ArgumentDefaultsHelpFormatter)
171 | 
172 |     parser.add_argument("-m", action="store", default='test',
173 |                         help="choose whether 'train' or 'test' mode")
174 |     parser.add_argument("-t", action="store", default=-1,
175 |                         help="number of threads required")
176 |     parser.add_argument("-e", action="store", default='msd',
177 |                         help="choose between 'msd' or 'shs' ")
178 |     parser.add_argument("-d", action="store", default=0,
179 |                         help="choose whether you want to exclude msd official duplicates song from the experiments")
180 |     parser.add_argument("-s", action="store", default=100,
181 |                         help="required prune size for the results")
182 | 
183 |     args = parser.parse_args()
184 | 
185 |     d = bool(args.d)
186 |     methods = ["msd_title", "pre-msd_title", "mxm_lyrics", "title_mxm_lyrics", "pre-title_mxm_lyrics"]
187 | 
188 |     automate_online_evals(mode=args.m, n_threads=args.t, exp_mode=args.e, is_duplicates=d, size=args.s, methods=methods)
189 | 
190 |     print "\n ...Done..."
191 | 


--------------------------------------------------------------------------------
/experiments.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Methods for running various experiments on es msd db for the task of cover
  4 | song detection using metadata and lyrics ingested in the ES MSD index.
  5 | 
  6 | ----------
  7 | Albin Andrew Correya
  8 | R&D Intern
  9 | @Deezer, 2017
 10 | """
 11 | 
 12 | from utils import log, timeit
 13 | import templates as presets
 14 | import sys
 15 | import os
 16 | 
 17 | # bad hack for avoiding encoding erros for the moment
 18 | # to be removed soon
 19 | reload(sys)
 20 | sys.setdefaultencoding("utf8")
 21 | 
 22 | 
 23 | if not os.path.isdir('./logs/'):
 24 |     os.makedirs('./logs/')
 25 | LOGGER = log('./logs/experiments.log')
 26 | 
 27 | 
 28 | class Experiments(object):
 29 |     """
 30 |     Class containing methods for running various experiments on
 31 |     SecondHandSong and MillionSongDataset ingested in the elasticsearch
 32 |     msd_augmented index for the task of cover song detection.
 33 |     This is a wrapper on the es_search.py -> SearchModule class
 34 |     for doing fast prototyping.
 35 | 
 36 |     Pandas dataframe and json dict is mainly used as the data
 37 |     structure for dealing with aggregrated response results.
 38 |     
 39 |     Usage:
 40 |         exp = Experiments(es_search_class, shs_dataset_csv, presets.shs_msd)
 41 |         results = exp.run_song_title_match_task(size=100)
 42 |         m_avgp = exp.mean_average_precision(res)
 43 |     """
 44 | 
 45 |     import pandas as pd
 46 |     import numpy as np
 47 |     import time
 48 | 
 49 |     def __init__(self, search_class, shs_csv, profile=None):
 50 |         """
 51 |         Init parameters
 52 | 
 53 |         :param search_class: An instance of SearchModule class (es_search.py)
 54 |         :param shs_csv: path to csv file of SecondHandSong dataset (check the ./data/ folder)
 55 |             This will be the query-set and groundtruth for the experiments
 56 |         :param profile: {default: None}
 57 |             A python dictionary corresponds to the profile of the experiment object
 58 | 
 59 |             eg: {
 60 |                 'filter_duplicates':True,
 61 |                 'dzr_map':False,
 62 |                 'shs_mode':False
 63 |             }
 64 | 
 65 |             NOTE : a set of profile templates can be found inside the templates.py file.
 66 |         """
 67 |         self.es = search_class
 68 |         self.dataset = self._load_csv_as_df(shs_csv)
 69 |         self.query_ids = self.dataset.msd_id.values.tolist()
 70 |         self.query_titles = self.dataset.title.values.tolist()
 71 | 
 72 |         if profile:
 73 |             self.filter_duplicates = profile['filter_duplicates']
 74 |             self.dzr_map = profile['dzr_map']
 75 |             self.shs_mode = profile['shs_mode']
 76 |         else:
 77 |             self.filter_duplicates = presets.shs_msd['filter_duplicates']
 78 |             self.dzr_map = presets.shs_msd['dzr_map']
 79 |             self.shs_mode = presets.shs_msd['shs_mode']
 80 |         return
 81 | 
 82 |     def _load_csv_as_df(self, csvfile):
 83 |         """Load csv file as pandas dataframe"""
 84 |         return self.pd.read_csv(csvfile)
 85 | 
 86 |     def _get_subframe_df(self, dataframe, field):
 87 |         """get a particular subframe from the pandas dataframe"""
 88 |         return dataframe[field].copy().values.tolist()
 89 | 
 90 |     def _tolist(self, x):
 91 |         """For use it as pandas dataframe.apply() callback"""
 92 |         return list(x)
 93 | 
 94 |     def _merge_df(self, results_df, field='msd_id'):
 95 |         """Merge the dataset and the results df"""
 96 |         results_df[field] = self.pd.Series(results_df.index.values, index=results_df.index)
 97 |         return self.pd.merge(self.dataset, results_df, on=field, how='left')
 98 | 
 99 |     def _groupby_work(self, merged_df):
100 |         return merged_df.groupby('work_id')['msd_id'].agg({'clique_songs': self._tolist})
101 | 
102 |     def load_result_json_as_df(self, jsonfile):
103 |         """Load results json from the experiments to pandas df"""
104 |         return self.pd.read_json(jsonfile, orient='index')
105 | 
106 |     def dict_to_pickle(self, mydict, filename):
107 |         """save a dict to pickle file"""
108 |         import pickle
109 |         doc = open(filename, 'wb')
110 |         pickle.dump(mydict, doc)
111 |         return
112 | 
113 |     def get_clique_id(self, track_id):
114 |         """DEPRECIATED"""
115 |         # have to recheck if this is same for all the sample
116 |         return self.dataset[self.dataset.msd_id==track_id].clique_id.values.tolist()
117 | 
118 |     def get_ground_truth(self, query_id, reference_id):
119 |         """DEPRECIATED [To_remove]"""
120 |         if str(self.get_clique_id(query_id)) == str(self.get_clique_id(reference_id)):
121 |             return 1
122 |         else:
123 |             return 0
124 | 
125 |     def reset_preset(self):
126 |         self.es.post_json = self.es.init_json
127 |         return
128 | 
129 |     def get_artist_id(self, track_id):
130 |         """
131 |         Returns artist_id for a specific msd_track_id from the dataset
132 |         """
133 |         return self.dataset.artist_id[self.dataset.msd_id == track_id].values[0]
134 | 
135 |     def rerank_by_field(self, field_id, response, proximitiy=1, field='msd_artist_id'):
136 |         """
137 |         Re-rank the search results by taking a field with thresholding
138 |         """
139 |         top_list = list()
140 |         bottom_list = list()
141 |         if response:
142 |             top_score = response[0]['_score']
143 |         else:
144 |             return []
145 |         for row in response:
146 |             if row['_source'][field] == field_id and (top_score - row['_score']) <= proximitiy:
147 |                 top_list.append(row)
148 |             else:
149 |                 bottom_list.append(row)
150 |         if not top_list:
151 |             return response
152 |         else:
153 |             return top_list + bottom_list
154 | 
155 |     def get_score_thres(self, res_ids, res_scores, proximity=1.):
156 |         """
157 | 
158 |         :param res_ids: A list of ranked msd_track_ids. (typically from the lyrics_search response)
159 |         :param res_scores: A list of ranked scores corresponds to the res_ids
160 |         :param proximity: (int, default: 1) A threshold value for determining the boundary of differnce among the top_score and the other scores
161 |         :return: (top_ids, top_list, thres_idx)
162 |             top_ids : top msd_track_ids
163 |             top_list : top es search scores
164 |             thres_idx : threshold index
165 |         """
166 |         top_score = res_scores[0]
167 |         top_list = [score for score in res_scores if (top_score-score) <= proximity]
168 |         thres_idx = len(top_list)
169 |         top_ids = res_ids[:thres_idx]
170 |         return top_ids, top_list, thres_idx
171 | 
172 |     def rerank_title_results_by_lyrics(self, title_res, lyrics_res, mode='view', proximity=0.5):
173 |         """
174 |         :param title_res: pandas dataframe with aggregrated response of song_title match results
175 |         :param lyrics_res: pandas dataframe with aggregrated response of lyrics_similarity search results
176 |         :param mode: (available modes ['view', 'eval']) {default : 'view'}
177 |             'view' - return reranked_response as pandas dataframe
178 |             'eval' - return reranked_response as tuple of list of msd_ids and relative scores
179 |         :param proximity:
180 |         :return:
181 |         """
182 |         top_ids, top_scores, thres_idx = self.get_score_thres(
183 |             lyrics_res.msd_id.values, lyrics_res.score.values, proximity=proximity)  # threshold is 0.5
184 |         title_res_ids = title_res.msd_id.values.tolist()
185 |         common_ids = self.np.intersect1d(title_res.msd_id.values, top_ids)
186 | 
187 |         if len(common_ids) > 0:
188 |             top_list = common_ids
189 |             bottom_list = [x for x in title_res_ids if x not in common_ids]
190 | 
191 |             # preserve the ranking in the lyrics search response if it doesn't ()
192 |             top_list = top_ids[sorted([list(top_ids).index(x) for x in top_list])]
193 | 
194 |             new_ranked_list = list(top_list) + bottom_list
195 |             idx = [title_res_ids.index(x) for x in new_ranked_list]
196 |             merged_df = title_res.iloc[idx]  # select the new ranked dataframe from the indexes
197 |             merged_df = merged_df.set_index(self.np.arange(len(merged_df)))  # update the dataframe with new ranks
198 |             if mode == 'view':
199 |                 return merged_df
200 |             elif mode == 'eval':
201 |                 return merged_df.msd_id.values.tolist(), merged_df.score.values.tolist()
202 |         else:
203 |             if mode == 'view':
204 |                 return title_res
205 |             elif mode == 'eval':
206 |                 return title_res.msd_id.values.tolist(), title_res.score.values.tolist()
207 |         return
208 | 
209 |     """
210 |     ------------------------------------------
211 |     ------ AUTOMATED EXPERIMENTS -------------
212 |     These are methods for running automated search experiments on the ES MSD db
213 |     """
214 | 
215 |     @timeit
216 |     def run_song_title_match_task(self, size=100, verbose=True):
217 |         """
218 |         Simple experiment with simple text match
219 |         """
220 |         start_time = self.time.time()
221 | 
222 |         if self.shs_mode:
223 |             self.es.limit_post_json_to_shs()
224 | 
225 |         if self.filter_duplicates:
226 |             self.es.add_remove_duplicates_filter()
227 | 
228 |         if self.dzr_map:
229 |             self.es.limit_to_dzr_mapped_msd()
230 | 
231 |         results = dict()
232 | 
233 |         LOGGER.info("\n=======Running song title-match task for %s query songs against top %s results of MSD... "
234 |                     "with shs_mode %s, duplicate %s, dzr_map %s ========\n"
235 |                     % (len(self.query_ids), size, str(self.shs_mode), str(self.filter_duplicates), str(self.dzr_map)))
236 | 
237 |         for title in enumerate(self.query_titles):
238 |             if verbose:
239 |                 print "------%s-------%s" % (title[0], title[1])
240 | 
241 |             res_ids, res_scores = self.es.search_by_exact_title(
242 |                 unicode(title[1]), track_id=self.query_ids[title[0]], out_mode='eval', size=size)
243 |             # aggregrate response_ids and scores into a dict by query_msd_id as key
244 |             results[self.query_ids[title[0]]] = {'id': res_ids, 'score': res_scores}
245 | 
246 |         LOGGER.info("\n Task runtime : %s" % (self.time.time() - start_time))
247 |         return self.pd.DataFrame.from_dict(results, orient='index')
248 | 
249 |     @timeit
250 |     def run_cleaned_song_title_task(self, size=100, verbose=True):
251 |         """Run MSD pre-processed title task"""
252 |         start_time = self.time.time()
253 | 
254 |         if self.shs_mode:
255 |             self.es.limit_post_json_to_shs()
256 | 
257 |         if self.filter_duplicates:
258 |             self.es.add_remove_duplicates_filter()
259 | 
260 |         if self.dzr_map:
261 |             self.es.limit_to_dzr_mapped_msd()
262 | 
263 |         results = dict()
264 | 
265 |         LOGGER.info("\n=======Running cleaned title-match task for %s query songs against top %s results of MSD... "
266 |                     "with shs_mode %s, duplicate %s, dzr_map %s ========\n"
267 |                     % (len(self.query_ids), size, str(self.shs_mode), str(self.filter_duplicates), str(self.dzr_map)))
268 | 
269 |         for ids in enumerate(self.query_ids):
270 |             if verbose:
271 |                 print "----%s----%s" % (ids[0], ids[1])
272 |             res_ids, res_scores = self.es.search_with_cleaned_title(track_id=ids[1], out_mode='eval', size=size)
273 |             results[ids[1]] = {'id': res_ids, 'score': res_scores}
274 | 
275 |         LOGGER.info("\n Task runtime : %s" % (self.time.time() - start_time))
276 |         return self.pd.DataFrame.from_dict(results, orient='index')
277 | 
278 |     @timeit
279 |     def run_field_rerank_task(self, field='msd_artist_id', size=100, proximitiy=1, verbose=True):
280 |         """
281 |         In this task, a msd song with same artist id with the query song will be ranked top of the list
282 |         """
283 |         results = dict()
284 |         LOGGER.info("\n=======Running song title-matching task with reranking by '%s' for %s query "
285 |                     "songs against top %s results of MSD... with shs_mode %s, duplicate %s, dzr_map %s ========\n"
286 |                     % (field, len(self.query_ids), size, str(self.shs_mode),
287 |                        str(self.filter_duplicates), str(self.dzr_map)))
288 | 
289 |         if self.shs_mode:
290 |             self.es.limit_post_json_to_shs()
291 | 
292 |         if self.filter_duplicates:
293 |             self.es.add_remove_duplicates_filter()
294 | 
295 |         if self.dzr_map:
296 |             self.es.limit_to_dzr_mapped_msd()
297 | 
298 |         for index,title in enumerate(self.query_titles):
299 |             if verbose:
300 |                 print "------%s-------%s" % (index, title)
301 |             response = self.es.search_es(self.es._format_query(title, self.query_ids[index], size=size))
302 |             query_artist_id = self.get_artist_id(self.query_ids[index])
303 |             re_ranked = self.rerank_by_field(query_artist_id, response, field=field, proximitiy=proximitiy)
304 |             res_ids, res_scores = self.es._parse_response_for_eval(re_ranked)
305 |             results[self.query_ids[index]] = {'id': res_ids, 'score': res_scores}  # save it to dictionary
306 |         return self.pd.DataFrame.from_dict(results, orient='index')
307 | 
308 | 
309 |     @timeit
310 |     def run_mxm_lyrics_search_task(self, post_json=presets.more_like_this, size=100, verbose=True):
311 |         """
312 |         Lyrics search method using MXM lyrics
313 |         (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html)
314 |         """
315 |         results = dict()
316 | 
317 |         if self.shs_mode:
318 |             self.es.limit_post_json_to_shs()
319 | 
320 |         if self.filter_duplicates:
321 |             self.es.add_remove_duplicates_filter()
322 | 
323 |         if self.dzr_map:
324 |             self.es.limit_to_dzr_mapped_msd()
325 | 
326 |         LOGGER.info("\n=======Running musixmatch-msd lyrics search task for %s query songs against "
327 |                     "top %s results of MSD... with shs_mode %s, duplicate %s, dzr_map %s ========\n"
328 |                     % (len(self.query_ids), size, str(self.shs_mode), str(self.filter_duplicates), str(self.dzr_map)))
329 | 
330 |         for index, ids in enumerate(self.query_ids):
331 |             if verbose:
332 |                 print "----%s----%s" % (index, ids)
333 |             res_ids, res_scores = self.es.search_by_mxm_lyrics(post_json, msd_track_id=ids, out_mode='eval', size=size)
334 |             results[ids] = {'id': res_ids, 'score': res_scores}
335 | 
336 |         return self.pd.DataFrame.from_dict(results, orient='index')
337 | 
338 |     @timeit
339 |     def run_rerank_title_with_dzr_lyrics_task(self, size=100, with_cleaned=False, verbose=True):
340 |         """
341 |         Here you make two requests with song_title metadata and dzr_lyrics and merge the results with the top resutls
342 |         of lyrics to rerank song-title search response
343 |         """
344 |         results = dict()
345 | 
346 |         if self.shs_mode:
347 |             self.es.limit_post_json_to_shs()
348 | 
349 |         if self.filter_duplicates:
350 |             self.es.add_remove_duplicates_filter()
351 | 
352 |         if self.dzr_map:
353 |             self.es.limit_to_dzr_mapped_msd()
354 | 
355 |         post_json = self.es.post_json
356 | 
357 |         LOGGER.info("\n=======Running rerank experiment of title search response with dzr_lyrics response for %s "
358 |                     "query songs against top %s results of MSD... with shs_mode %s, duplicate %s, dzr_map %s ========\n"
359 |                     % (len(self.query_ids), size, str(self.shs_mode), str(self.filter_duplicates), str(self.dzr_map)))
360 | 
361 |         for index, title in enumerate(self.query_titles):
362 |             if verbose:
363 |                 print "---%s---%s" % (index, self.query_ids[index])
364 | 
365 |             self.es.post_json = post_json  # post-json template for title search
366 | 
367 |             if with_cleaned:
368 |                 text_df = self.es.search_with_cleaned_title(self.query_ids[index], out_mode='view', size=size)
369 |             else:
370 |                 text_df = self.es.search_by_exact_title(title, self.query_ids[index], out_mode='view', size=size)
371 | 
372 |             lyrics_df = self.es.search_by_dzr_lyrics(
373 |                 presets.more_like_this, self.query_ids[index], out_mode='view', size=size)
374 | 
375 |             if type(lyrics_df) != tuple:
376 |                 if lyrics_df.empty:
377 |                     res_ids, res_scores = text_df.msd_id.values.tolist(), text_df.score.values.tolist()
378 |                 else:
379 |                     res_ids, res_scores = self.rerank_title_results_by_lyrics(text_df, lyrics_df, mode='eval')
380 |             else:
381 |                 res_ids, res_scores = text_df.msd_id.values.tolist(), text_df.score.values.tolist()
382 |             results[self.query_ids[index]] = {'id': res_ids, 'score': res_scores}
383 |         return self.pd.DataFrame.from_dict(results, orient='index')
384 | 
385 | 
386 |     @timeit
387 |     def run_rerank_title_with_mxm_lyrics_task(self, size=100, with_cleaned=False, verbose=True, threshold=0.5):
388 |         """
389 |         Experiment we rerank the es response of song_title search with top results of mxm_lyrics similarity results
390 | 
391 |         :param size: {default : 100}
392 |         :param with_cleaned: {default : False} If set true, switch simple
393 |             text_search method to cleaned_processed title method
394 |         :param verbose: {default : False}
395 |         :param threshold:
396 |         :return: Aggregated results as pandas dataframe
397 |         """
398 |         results = dict()
399 | 
400 |         if self.shs_mode:
401 |             self.es.limit_post_json_to_shs()
402 | 
403 |         if self.filter_duplicates:
404 |             self.es.add_remove_duplicates_filter()
405 | 
406 |         if self.dzr_map:
407 |             self.es.limit_to_dzr_mapped_msd()
408 | 
409 |         post_json = self.es.post_json
410 | 
411 |         LOGGER.info("\n=======Running rerank experiment of title search response with mxm_lyrics response for %s query "
412 |                     "songs against top %s results of MSD... with shs_mode %s, duplicate %s, dzr_map %s ========\n"
413 |                     % (len(self.query_ids), size, str(self.shs_mode), str(self.filter_duplicates), str(self.dzr_map)))
414 | 
415 |         for index, title in enumerate(self.query_titles):
416 |             if verbose:
417 |                 print "---%s---%s" % (index, self.query_ids[index])
418 | 
419 |             self.es.post_json = post_json  # post-json template for title search
420 | 
421 |             if with_cleaned:
422 |                 text_df = self.es.search_with_cleaned_title(self.query_ids[index], out_mode='view', size=size)
423 |             else:
424 |                 text_df = self.es.search_by_exact_title(title, self.query_ids[index], out_mode='view', size=size)
425 | 
426 |             lyrics_df = self.es.search_by_mxm_lyrics(
427 |                 presets.more_like_this, msd_track_id=self.query_ids[index], out_mode='view', size=size)
428 | 
429 |             if type(lyrics_df) != tuple:
430 |                 if lyrics_df.empty:
431 |                     res_ids, res_scores = text_df.msd_id.values.tolist(), text_df.score.values.tolist()
432 |                 else:
433 |                     res_ids, res_scores = self.rerank_title_results_by_lyrics(
434 |                         text_df, lyrics_df, mode='eval', proximity=threshold)
435 |             else:
436 |                 res_ids, res_scores = text_df.msd_id.values.tolist(), text_df.score.values.tolist()
437 |             results[self.query_ids[index]] = {'id': res_ids, 'score': res_scores}
438 |         return self.pd.DataFrame.from_dict(results, orient='index')
439 | 
440 | 
441 |     @timeit
442 |     def run_audio_rerank_task(self, text_results_json, audio_results_json, threshold=0.1):
443 |         """
444 |         [OFFLINE EXPERIMENT]
445 | 
446 |         Function to re-rank text-based results with audio-based results
447 |         text_results_json : json file
448 |         audio_results_json : json file
449 |         threshold : {default: 0.1}
450 | 
451 |         """
452 |         text_df = self.pd.read_json(text_results_json)
453 |         audio_df = self.pd.read_json(audio_results_json)
454 |         results = dict()
455 |         cnt = 0
456 |         error_idxs = []
457 | 
458 |         def get_low_score(scores, thres=threshold):
459 | 
460 |             def list_duplicates_of(seq, item):
461 |                 start_at = -1
462 |                 locs = []
463 |                 while True:
464 |                     try:
465 |                         loc = seq.index(item, start_at+1)
466 |                     except ValueError:
467 |                         break
468 |                     else:
469 |                         locs.append(loc)
470 |                         start_at = loc
471 |                 return locs
472 | 
473 |             # top_score = scores[0]
474 |             # top_list = [score for score in scores if self.np.abs(top_score-score)<=thres]
475 |             top_list = []
476 |             for score in scores:
477 |                 # if self.np.abs(top_score-score)<=thres:
478 |                 if score <= thres:
479 |                     top_list.append(score)
480 |             # idxs = [scores.index(x) for x in top_list]
481 | 
482 |             dup_idxs = []
483 |             for s in top_list:
484 |                 dup_idxs.extend(list_duplicates_of(top_list, s))
485 | 
486 |             idxs = list(set(dup_idxs))
487 |             print "Score index :", idxs
488 |             return idxs
489 | 
490 |         LOGGER.info("Running audio reranking task on the metadata search experiments results "
491 |                     "file with a threshold of %s" % threshold)
492 | 
493 |         for idx in range(len(audio_df)):
494 |             print "Index :", idx
495 |             text_res_ids = text_df.iloc[idx].id
496 |             text_res_scores = text_df.iloc[idx].id
497 |             audio_res_ids = audio_df.iloc[idx].id
498 |             audio_res_scores = audio_df.iloc[idx].score
499 | 
500 |             if not audio_res_scores or not audio_res_ids or len(audio_res_ids) == 0:
501 |                 results[audio_df.index[idx]] = {'id': text_res_ids, 'score': text_res_scores}
502 |                 cnt += 1
503 |                 error_idxs.append(idx)
504 |             else:
505 |                 a_df = self.pd.DataFrame({'id': audio_res_ids, 'score': audio_res_scores})
506 |                 # t_df = self.pd.DataFrame({'id': text_res_ids, 'score' : text_res_scores})
507 | 
508 |                 thres_idxs = get_low_score(a_df.score.values.tolist(), threshold)
509 | 
510 |                 if len(thres_idxs) != 0:
511 |                     a_df = a_df.iloc[thres_idxs]
512 |                     top_ids = a_df.id.values.tolist()
513 |                     top_scores = a_df.score.tolist()
514 |                     # common_ids = self.np.intersect1d(top_ids, t_df.id.values)
515 |                     bottom_ids = [x for x in text_df.iloc[idx].id if x not in top_ids]
516 |                     bottom_idx = [text_res_ids.index(x) for x in bottom_ids]
517 |                     text_res_scores = self.np.array(text_res_scores)
518 |                     bottom_scores = text_res_scores[bottom_idx]
519 |                     new_ranked_ids = top_ids + bottom_ids
520 |                     new_ranked_scores = top_scores + list(bottom_scores)
521 |                     results[audio_df.index[idx]] = {'id': new_ranked_ids, 'score': new_ranked_scores}
522 |                 else:
523 |                     results[audio_df.index[idx]] = {'id': text_res_ids, 'score': text_res_scores}
524 | 
525 |         LOGGER.debug("%s queries dont have proper audio reranked resposne" % cnt)
526 | 
527 |         return self.pd.DataFrame.from_dict(results, orient='index')
528 | 
529 |     @timeit
530 |     def maximum_achievable_metrics(self, results_df):
531 |         """
532 |         In this experiment we rerank the response ids with the ground_truth to compute
533 |         the maximum achievable MAP by re-ranking the metadata-search results with
534 |         other content such as lyrics, audio etc. This was only done on the train set of the dataset
535 |         """
536 |         LOGGER.info("Computing maximum achievable mean average precison from the results dataframe")
537 |         results_df = self._merge_df(results_df)
538 |         results = dict()
539 |         for index, response in results_df.iterrows():
540 |             if type(response['id']) == list:
541 |                 response_ids = response['id']
542 |                 # result_songs = response['id']
543 |                 clique_songs = results_df.msd_id[results_df.work_id == response['work_id']].values
544 |                 top_list = self.np.intersect1d(clique_songs, response_ids)
545 |                 if len(top_list) > 0:
546 |                     bottom_list = [x for x in response_ids if x not in top_list]
547 |                     if bottom_list:
548 |                         results[response['msd_id']] = {'id': list(top_list) + bottom_list}
549 |                     else:
550 |                         results[response['msd_id']] = {'id': list(top_list)}
551 |                 else:
552 |                     results[response['msd_id']] = {'id': response_ids}
553 |         return self.pd.DataFrame.from_dict(results, orient='index')
554 | 
555 |     # ----------------------------------------EVALUATION METRICS----------------------------------------------------
556 |     def average_precision_at_k(self, results_df, query_msd_id):
557 |         """
558 |         Compute average precision for a particular query and response from the aggregrated results_dataframe
559 |         Here "k" is the msd_query_id in results df
560 | 
561 |         Inputs:
562 |                 results_df :
563 |                 query_msd_id :
564 | 
565 |         """
566 |         results_df = self._merge_df(results_df)
567 |         response_ids = results_df[results_df.msd_id == query_msd_id].id.values.tolist()[0]
568 |         work_id = results_df.work_id[results_df.msd_id == query_msd_id].values[0]
569 |         clique_songs = results_df.msd_id[results_df.work_id == work_id].values
570 |         # print clique_songs, len(clique_songs)
571 |         true_idx = [response_ids.index(x) for x in response_ids if x in clique_songs]
572 |         ground_truth = self.np.zeros(len(response_ids))
573 |         if len(true_idx) > 0:
574 |             ground_truth[true_idx] = 1
575 |         precision_at_k = self.np.cumsum(ground_truth) / self.np.arange(1., len(response_ids)+1)
576 |         precision_list = ground_truth * precision_at_k
577 |         avg_precision = sum(precision_list) / float(len(clique_songs) - 1)
578 |         return avg_precision
579 | 
580 |     def average_precision(self, results_df, size=None):
581 |         """
582 |         Average precisions
583 | 
584 |         Inputs :
585 |                 results_df :
586 |                 size :
587 | 
588 |         Returns a list of average precision
589 |         """
590 |         results_df = self._merge_df(results_df)
591 |         avg_precisions = list()
592 |         cnt = 0
593 |         for index, response in results_df.iterrows():
594 |             if type(response['id']) == list:
595 |                 if size:
596 |                     response_ids = response['id'][:size]
597 |                 else:
598 |                     response_ids = response['id']
599 |                 clique_songs = results_df.msd_id[results_df.work_id == response['work_id']].values
600 |                 true_idx = [response_ids.index(x) for x in response_ids if x in clique_songs]
601 |                 ground_truth = self.np.zeros(len(response_ids))
602 |                 if len(true_idx) > 0:
603 |                     ground_truth[true_idx] = 1
604 |                 precision_at_k = self.np.cumsum(ground_truth) / self.np.arange(1., len(response_ids)+1)
605 |                 precision_list = ground_truth * precision_at_k
606 |                 avg_precision = sum(precision_list) / float(len(clique_songs) - 1)
607 |                 avg_precisions.append(avg_precision)
608 |             else:
609 |                 cnt += 1
610 |                 avg_precisions.append(0)
611 |         LOGGER.debug("%s queries have no lyrics nor response out of %s queries" % (cnt, len(results_df)))
612 |         return avg_precisions
613 | 
614 |     @timeit
615 |     def mean_average_precision(self, results_df, size=None):
616 |         """
617 |         Mean of average precisions for the task
618 |         """
619 |         return self.np.mean(self.average_precision(results_df, size=size))
620 | 
621 |     def average_rank(self, results_df):
622 |         """
623 |         Computes average position of relevant documents and measures where the relevant docs falls in a ranked list
624 |         """
625 |         average_ranks = list()
626 |         for query_id in results_df.keys():
627 |             # response_ids = self.ast.literal_eval(results_df[query_id][0])
628 |             response_ids = results_df[query_id][0]
629 |             if type(response_ids) == list:
630 |                 clique_id = self.dataset.work_id[self.dataset.msd_id == query_id].values[0]
631 |                 clique_songs = self.dataset.msd_id[self.dataset.work_id == clique_id].values
632 |                 # true_list = len(self.np.intersect1d(clique_songs, response_ids))
633 |                 true_idx = [response_ids.index(x) for x in response_ids if x in clique_songs]
634 |                 if len(true_idx) == 0:
635 |                     average_ranks.append(1000000)
636 |                     # pass
637 |                 else:
638 |                     average_ranks.append(self.np.average(true_idx))
639 |         return self.np.average(average_ranks)
640 | 
641 |     def mean_rank_first_cover(self, results_df):
642 |         """
643 |         Mean rank of the first correctly identified cover
644 |         """
645 |         mean_ranks = list()
646 |         for query_id in results_df.keys():
647 |             # response_ids = self.ast.literal_eval(results_df[query_id][0])
648 |             response_ids = results_df[query_id][0]
649 |             if type(response_ids) == list:
650 |                 clique_id = self.dataset.work_id[self.dataset.msd_id == query_id].values[0]
651 |                 clique_songs = self.dataset.msd_id[self.dataset.work_id == clique_id].values
652 |                 # true_list = len(self.np.intersect1d(clique_songs, response_ids))
653 |                 true_idx = [response_ids.index(x) for x in response_ids if x in clique_songs]
654 |                 if len(true_idx) == 0:
655 |                     # mean_ranks.append(0)
656 |                     pass
657 |                 else:
658 |                     mean_ranks.append(true_idx[0]+1)
659 | 
660 |         return self.np.mean(mean_ranks)
661 | 
662 |     def covers_identified(self, results_df, size=None):
663 |         """
664 |         Total number of covers identified compared to the dataset
665 |         """
666 |         total_covers = list()
667 |         percentage = list()
668 |         # here you merge the results_df with the shs_dataset df we load in the init class
669 |         results_df = self._merge_df(results_df)
670 |         for index, response in results_df.iterrows():
671 |             if type(response['id']) == list:
672 |                 if size:
673 |                     response_ids = response['id'][:size]
674 |                 else:
675 |                     response_ids = response['id']
676 |                 clique_songs = results_df.msd_id[results_df.work_id == response['work_id']].values
677 |                 # check intersection of two list for detected covers
678 |                 detected_covers = self.np.intersect1d(clique_songs, response_ids)
679 |                 total_covers.append(len(detected_covers))
680 |                 percentage.append((len(detected_covers) / float(len(clique_songs)))*100)
681 |         return total_covers, percentage
682 | 
683 |     def total_covers_identified(self, results_df):
684 |         """
685 |         Total number of covers identified
686 |         """
687 |         total_covers, percentage = self.covers_identified(results_df)
688 |         return sum(total_covers)
689 | 
690 |     def mean_percentage_of_covers(self, results_df, size=None):
691 |         """
692 |         Mean percentage of covers
693 |         """
694 |         total_covers, percentage = self.covers_identified(results_df, size=size)
695 |         return self.np.mean(percentage)
696 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | numpy==1.11.0
 2 | pandas==0.20.3
 3 | elasticsearch==2.3.0
 4 | requests
 5 | joblib==0.11
 6 | seaborn==0.7.1
 7 | python-Levenshtein==0.12.0
 8 | fuzzywuzzy==0.15.0
 9 | nltk==3.3
10 | 


--------------------------------------------------------------------------------
/templates.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | A set of custom query DSL templates for elasticsearch search post-json requests for various tasks
  4 | 
  5 | Check elasticsearch documentation for more details
  6 | https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html
  7 | 
  8 | Albin Andrew Correya
  9 | R&D Intern
 10 | @2017
 11 | """
 12 | 
 13 | import os
 14 | 
 15 | assert os.environ["MSDES_HOST"]
 16 | assert os.environ["MSDES_PORT"]
 17 | assert os.environ["MSDES_INDEX"]
 18 | assert os.environ["MSDES_TYPE"]
 19 | 
 20 | SCHEME = "http"
 21 | URI = os.environ["MSDES_HOST"]
 22 | PORT = os.environ["MSDES_PORT"]
 23 | ES_INDEX = os.environ["MSDES_INDEX"]
 24 | ES_TYPE = os.environ["MSDES_TYPE"]
 25 | 
 26 | 
 27 | uri_config = {
 28 |     'host': URI,
 29 |     'port': PORT,
 30 |     'scheme': SCHEME,
 31 |     'index': ES_INDEX,
 32 |     'type': ES_TYPE
 33 | }
 34 | 
 35 | # for string search with song title
 36 | # https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
 37 | query_string = {
 38 |     "query": {
 39 |         "bool": {
 40 |             "must": [
 41 |                 {
 42 |                     "query_string": {
 43 |                         "default_field": "msd_title",
 44 |                         "query": "sample_query_here"
 45 |                     }
 46 |                 }
 47 |             ],
 48 |             "must_not": [
 49 |                 {
 50 |                     "query_string": {
 51 |                         "default_field": "_id",
 52 |                         "query": "msd_track_id_here"
 53 |                     }
 54 |                 }
 55 |             ]
 56 |         }
 57 |     },
 58 |     "from": 0,
 59 |     "size": 100
 60 | }
 61 | 
 62 | # for string search with title
 63 | # https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html
 64 | simple_query_string = {
 65 |     "query": {
 66 |         "bool": {
 67 |             "must": [
 68 |                 {
 69 |                     "simple_query_string": {
 70 |                         "fields": ["msd_title"],
 71 |                         "query": "sample_query_here"
 72 |                     }
 73 |                 }
 74 |             ],
 75 |             "must_not": [
 76 |                 {
 77 |                     "query_string": {
 78 |                         "default_field": "_id",
 79 |                         "query": "msd_track_id_here"
 80 |                     }
 81 |                 }
 82 | 
 83 |             ]
 84 |         }
 85 |     },
 86 |     "from": 0,
 87 |     "size": 100
 88 | }
 89 | 
 90 | # for lyrics search
 91 | # https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html
 92 | more_like_this = {
 93 |     "query": {
 94 |         "bool": {
 95 |             "must": [
 96 |                 {
 97 |                     "more_like_this": {
 98 |                         "fields": ["dzr_lyrics.content"],
 99 |                         "like": "sample_query_here",
100 |                         "min_term_freq": 1,
101 |                         "max_query_terms": 12
102 |                     }
103 |                 }
104 |             ],
105 |             "must_not": [
106 |                 {
107 |                     "query_string": {
108 |                         "default_field": "_id",
109 |                         "query": "msd_track_id_here"
110 |                     }
111 |                 }
112 |             ]
113 |         }
114 |     },
115 |     "from": 0,
116 |     "size": 100
117 | }
118 | 
119 | # config preset for logs
120 | log_config = {
121 |     'method':
122 |         {
123 |             'query_method': 'title_query_with_artist_rerank',
124 |             'mode': 'msd_field',
125 |             'size': 100,
126 |             'dataset': 'shs_train'
127 |         },
128 |     'metrics':
129 |         {
130 |             'MAP': 0.,
131 |             'MRFC': 0.,
132 |             'MPER': 40.
133 |         },
134 |     'run_time': 60
135 | }
136 | 
137 | # Experiment profiles
138 | # SHS against MSD experiment
139 | shs_msd = {
140 |     'dzr_map': False,
141 |     'filter_duplicates': False,
142 |     'shs_mode': False
143 | }
144 | 
145 | # SHS against MSD experiment by excluding all the official duplicates
146 | shs_msd_no_dup = {
147 |     'dzr_map': False,
148 |     'filter_duplicates': True,
149 |     'shs_mode': False
150 | }
151 | 
152 | # SHS-DZR against MSD-DZR experiment by excluding all the official duplicates
153 | shs_dzr_msd = {
154 |     'dzr_map': True,
155 |     'filter_duplicates': True,
156 |     'shs_mode': False
157 | }
158 | 
159 | # SHS train set against SHS train set experiment
160 | shs_shs = {
161 |     'dzr_map': False,
162 |     'filter_duplicates': False,
163 |     'shs_mode': True
164 | }
165 | 
166 | # SHS train set against SHS train set experiment without official duplicates.
167 | shs_shs_no_dup = {
168 |     'dzr_map': False,
169 |     'filter_duplicates': True,
170 |     'shs_mode': True
171 | }
172 | 
173 | output_evaluations = {
174 |     'size': 100,
175 |     'map': 0,
176 |     'method': 'title',
177 |     'experiment': 'shs_msd',
178 | }
179 | 


--------------------------------------------------------------------------------
/utilities/__init__.py:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/utilities/audio_utils.py:
--------------------------------------------------------------------------------
  1 | """
  2 | utility functions for processing resutls file for audio reranking experiments
  3 | """
  4 | from utils import timeit, log
  5 | import pandas as pd
  6 | import numpy as np
  7 | import os
  8 | 
  9 | 
 10 | logger = log('./logs/audio_logs.log')
 11 | 
 12 | 
 13 | def savelist_to_file(path_list, filename):
 14 |     """Write a list of string to a text file"""
 15 |     doc = open(filename, 'w')
 16 |     for item in path_list:
 17 |         doc.write("%s\n" % item)
 18 |     doc.close()
 19 |     return
 20 | 
 21 | 
 22 | def parse_mirex_output_txt(textfile):
 23 |     """
 24 |     Parse distance matrix from the text output of joan serra's cover song detection algorithm
 25 | 
 26 |     Input : path/to/the/textfile
 27 | 
 28 |     Output : pandas dataframe with query/candidates distance scores
 29 |     """
 30 |     text = open(textfile)
 31 |     data = text.readlines()
 32 |     array = list()
 33 |     m = None
 34 |     for lines in data:
 35 |         if lines.startswith("Dist"):
 36 |             m = True
 37 |         if m is True:
 38 |             if not lines.startswith("  Could not open"):
 39 |                 array.append(lines)
 40 |     text.close()
 41 |     doc = open("../distanceMatrix.txt", "w")
 42 |     for lines in array:
 43 |         doc.write("%s\n" % lines)
 44 |     doc.close()
 45 |     df = pd.read_csv("../distanceMatrix.txt", index_col=0, skiprows=1, sep='\t')
 46 |     os.system('rm ../distanceMatrix.txt')
 47 |     return df.transpose()
 48 | 
 49 | 
 50 | @timeit
 51 | def results_json_to_enriched_results_df(results_json, dzr_msd_map_df):
 52 |     """
 53 |     [NOTE] - only need to be run once
 54 |     """
 55 |     collections = list()
 56 |     queries = list()
 57 |     results = pd.read_json(results_json)
 58 |     for i in range(len(results)):
 59 |         collections.extend(results.iloc[i].id)
 60 |         queries.append(results.iloc[i].msd_id)
 61 |     collections.extend(queries)
 62 |     df = pd.DataFrame({'msd_track_id': collections})
 63 |     df = df.drop_duplicates('msd_track_id')
 64 |     df_dzr = pd.merge(df, dzr_msd_map_df, on='msd_track_id', how='left')
 65 |     df_dzr['dzr_path'] = df_dzr.song_id.apply(get_dzr_sng_path_from_sng_id)
 66 |     return df_dzr
 67 | 
 68 | 
 69 | 
 70 | def results_json_to_query_collection_pairs(results_json, enriched_csv, col_path, query_path):
 71 |     """
 72 |     Create a set of query-collection text file pairs from a aggregrated es search
 73 |     results for running mirex (serra 2009)binary scripts
 74 |     Inputs :
 75 |             results_json :
 76 |             enriched_csv :
 77 |             col_path :
 78 |             query_path :
 79 |     """
 80 |     res = pd.read_json(results_json)
 81 |     map_data = pd.read_csv(enriched_csv)
 82 |     logger.info("Constructing query-collection text files from the results-%s to %s and %s"
 83 |                 % (results_json, col_path, query_path))
 84 |     for i in range(len(res)):
 85 |         mid = res.iloc[i].msd_id
 86 |         rids = res.iloc[i].id
 87 |         if not rids:
 88 |             logger.debug("No response found for index %s" % i)
 89 |         qpaths = map_data.dzr_path[map_data.msd_track_id == mid].values[0]
 90 |         rpaths = [map_data.dzr_path[map_data.msd_track_id == rid].values[0] for rid in rids]
 91 |         qpaths = qpaths.replace("data", "mnt")
 92 |         rpaths = [string.replace("data", "mnt") for string in rpaths]
 93 |         savelist_to_file([qpaths, '\n'], query_path+'query_'+str(i)+'_.txt')
 94 |         savelist_to_file(rpaths, col_path+'collections_'+str(i)+'_.txt')
 95 |     return
 96 | 
 97 | 
 98 | def get_id_score_pairs_from_distance_df(distance_df, results_df, index):
 99 |     """
100 |     Returns the new ranked response of msd_track_ids and audio similarity
101 |     scores from a distance matrix of mirex 2009 binary output
102 |     Inputs:
103 |             distance_df :
104 |             results_df :
105 |             index :
106 | 
107 |     Outputs:
108 |             res_ids :
109 |             res_scores :
110 |     """
111 |     res_msd_ids = results_df.iloc[index].id
112 |     sorted_df = distance_df.sort_values(1)
113 |     new_ranked_idx = sorted_df.index.values
114 | 
115 |     if len(sorted_df) != len(res_msd_ids):
116 |         logger.debug("Mismatch of response msd id length in index %s" % index)
117 | 
118 |     # new reranked response ids and scores from the audio similarity measures
119 |     # note the index from the output_txt file starts with 1
120 |     res_ids = [res_msd_ids[int(i)-1] for i in new_ranked_idx]
121 |     res_scores = sorted_df[1].values.tolist()
122 |     return res_ids, res_scores
123 | 
124 | 
125 | def serra_output_txt_to_results_df(output_directory, results_json):
126 |     """
127 |     Read a collection of output_*.txt files from mirex 2009 binary
128 |     output and aggregrate it to a pandas dataframe
129 |     as required for metric computation scripts
130 | 
131 |     Inputs :
132 |             output_directory : path to the folder with the output_*.txt files from the mirex binary scripts
133 |             results_json : results_json
134 |     """
135 |     results_df = pd.read_json(results_json)
136 |     output_files = os.listdir(output_directory)
137 |     for t in output_files:
138 |         if t.startswith('.') or not t.endswith('.txt'):
139 |             output_files.remove(t)
140 | 
141 |     output_files = sorted(output_files, key=lambda x: int(x.split('_')[2].split('.')[0]))  # sort the filename list
142 |     results = dict()
143 |     cnt = 0
144 |     error_files = list()
145 |     for idx, txt_file in enumerate(output_files):
146 |         print "--%s--%s" % (idx, txt_file)
147 |         distance_df = parse_mirex_output_txt(output_directory+txt_file)
148 |         if distance_df.shape[1] == 1:
149 |             query_msd = results_df.index[idx]
150 |             res_ids, res_scores = get_id_score_pairs_from_distance_df(distance_df, results_df, index=idx)
151 |             results[query_msd] = {'id': res_ids, 'score': res_scores}
152 |         else:
153 |             cnt += 1
154 |             query_msd = results_df.index[idx]
155 |             results[query_msd] = {'id': results_df.iloc[idx].id, 'score': None}
156 |             error_files.append(txt_file)
157 |     logger.debug("\n%s files had errors with the output distance matrix.." % cnt)
158 |     #print error_files
159 |     return pd.DataFrame.from_dict(results, orient='index')
160 | 


--------------------------------------------------------------------------------
/utilities/clique_similarity.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """
 3 | Set of functions to compute the similarity of song titles with in and outside it's clique.
 4 | ~
 5 | Albin Andrew Correya
 6 | R&D Intern
 7 | @Deezer, 2018
 8 | """
 9 | from itertools import combinations
10 | from Levenshtein import ratio
11 | import numpy as np
12 | import pandas as pd
13 | import random
14 | 
15 | 
16 | def get_clique_similarity_same_set(dataset_csv):
17 |     """Compute Levenshtein similarity of song titles in same cliques in SHS"""
18 |     dataset = pd.read_csv(dataset_csv)
19 |     clique_ids = dataset.work_id.unique().tolist()
20 |     clique_sims = list()
21 |     for work_id in clique_ids:
22 |         song_titles = dataset.title[dataset.work_id == work_id].values.tolist()
23 |         distances = list()
24 |         for (title1, title2) in combinations(song_titles, 2):
25 |             measure = ratio(title1, title2)
26 |             distances.append(measure)
27 |         clique_sims.append(np.mean(distances))
28 |     return clique_sims
29 | 
30 | 
31 | def get_clique_similarity_dif_set(dataset_csv):
32 |     """Compute Levenshtein similarity of song titles in different cliques in SHS"""
33 |     distances = list()
34 |     dataset = pd.read_csv(dataset_csv)
35 |     clique_ids = dataset.work_id.unique().tolist()
36 |     all_titles = dataset.title.values.tolist()
37 | 
38 |     for i in range(len(clique_ids)):
39 |         ref_title = random.choice(all_titles)
40 |         clique_id = dataset.work_id[dataset.title == ref_title].values[0]
41 |         ref_titles = dataset.title[dataset.work_id != clique_id].values.tolist()
42 |         com_title = random.choice(ref_titles)
43 |         distance = ratio(ref_title, com_title)
44 |         distances.append(distance)
45 | 
46 |     return distances
47 | 
48 | 
49 | def plot_clique_similarity_dist(dataset_csv):
50 |     """Plot the distribution plot of string similarities within and outside its clique"""
51 |     import matplotlib.pyplot as plt
52 |     import seaborn as sns
53 |     palette = ["#000000", "#737170"]
54 |     sns.set_palette(palette)
55 |     sim_same_clique = get_clique_similarity_same_set(dataset_csv)
56 |     sim_dif_clique = get_clique_similarity_dif_set(dataset_csv)
57 |     sns.distplot(sim_same_clique, hist=True,
58 |                  kde_kws={"lw": 1, "label": "within same clique"})
59 |     sns.distplot(sim_dif_clique, hist=True,
60 |                  kde_kws={"lw": 1, "label": "within different clique"})
61 |     plt.xlabel("Similarity measure")
62 |     plt.ylabel("Density")
63 |     plt.show()
64 |     return
65 | 


--------------------------------------------------------------------------------
/utilities/plots.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Functions for various plots on the results json file
  4 | """
  5 | import pandas as pd
  6 | import seaborn as sns
  7 | import matplotlib.pyplot as plt
  8 | import json
  9 | 
 10 | 
 11 | def parse_results(jsonfile, method):
 12 |     """
 13 |     jsonfile : path to jsonfile with experiment results
 14 |     method : Any of the methods inside the list
 15 |         ['title_match', rerank_artist_id', 'lyrics_more_like']
 16 | 
 17 |     """
 18 |     with open(jsonfile) as f:
 19 |         data = json.load(f)
 20 |     sizes = list()
 21 |     mp = list()
 22 |     mper = list()
 23 |     for key, values in data.iteritems():
 24 |         if key == method:
 25 |             for d in values:
 26 |                 sizes.append(d['size'])
 27 |                 mp.append(d['map'])
 28 |                 mper.append(d['mper'])
 29 | 
 30 |     return {"map": mp, "mper": mper, "size": sizes}
 31 | 
 32 | 
 33 | def plot_optimal_topN_pruning(results_json):
 34 |     with open(results_json) as f:
 35 |         data = json.load(f)
 36 |     results = [(int(key), value['map']) for key, value in data.iteritems()]
 37 |     results = sorted(results, key=lambda r: int(r[0]))
 38 |     sizes = [x[0] for x in results]
 39 |     metrics = [x[1] for x in results]
 40 | 
 41 |     # plt.title("Mean average precision of msd song-title experiment
 42 |     # on the SHS train against the MSD for various prune sizes")
 43 |     plt.plot(sizes, metrics)
 44 |     plt.xlabel("Prune size (k)")
 45 |     plt.ylabel("Mean Average Precision")
 46 |     plt.show()
 47 |     return
 48 | 
 49 | 
 50 | # functions for plotting some stats
 51 | def plot_lang_stats(msd_dzr_lang_csv, crop=True, barwidth=0.99, norm=False):
 52 |     """plots the histogram of language distribution"""
 53 |     lan_csv = pd.read_csv(msd_dzr_lang_csv)
 54 |     langs = lan_csv.lan.unique()
 55 |     freqs = list()
 56 |     for lan in langs:
 57 |         freqs.append(len(lan_csv[lan_csv.lan == lan]))
 58 |     sorted_tup = [(item[0], item[1]) for item in zip(langs, freqs)]
 59 |     sorted_tup.sort(key=lambda x: x[1], reverse=True)
 60 |     print "\n---Language stats for MSD---\n"
 61 |     for item in sorted_tup:
 62 |         print "%s : %s" % (item[0], (item[1]/1000000. * 100))
 63 | 
 64 |     langs = [item[0] for item in sorted_tup]
 65 |     freqs = [item[1] for item in sorted_tup]
 66 |     if norm:
 67 |         freqs = [(item/1000000.)*100 for item in freqs]
 68 |     if crop:
 69 |         langs = langs[:10]
 70 |         freqs = freqs[:10]
 71 | 
 72 |     # bar_locs = np.arange(1+barwidth, len(langs))
 73 |     # Plotting histogram
 74 |     plt.title("Histogram of top 10 languages in the MillionSongDataset")
 75 |     ax = plt.subplot(111)
 76 |     bins = map(lambda x: x, range(1, len(freqs)+1))
 77 |     ax.bar(bins, freqs, width=barwidth)
 78 |     ax.set_xticks(map(lambda x: x, range(1, len(langs)+1)))
 79 |     ax.set_xticklabels(langs, rotation=0)
 80 |     ax.set_xlabel("Language")
 81 |     ax.set_ylabel("Percentage (%)")
 82 |     plt.show()
 83 |     return
 84 | 
 85 | 
 86 | def plot_results_boxplot(jsonfile, metric='map'):
 87 |     """
 88 |     map = mean average precision
 89 |     """
 90 |     title_info = parse_results(jsonfile, method='title_match')
 91 |     rerank_info = parse_results(jsonfile, method='rerank_artist_id')
 92 |     lyrics_info = parse_results(jsonfile, method='lyrics_more_like')
 93 | 
 94 |     data = pd.DataFrame({"song-title": title_info[metric],
 95 |                          "artist_id_rerank": rerank_info[metric],
 96 |                          "lyrics": lyrics_info[metric]})
 97 | 
 98 |     ax = sns.boxplot(data=data,
 99 |                      palette="Set2",
100 |                      orient="v",
101 |                      order=["song-title", "artist_id_rerank", "lyrics"])
102 | 
103 |     ax.set_xlabel("search-method")
104 |     if metric == 'map':
105 |         ax.set_ylabel("mean average precision (MAP)")
106 |     if metric == 'mper':
107 |         ax.set_ylabel("mean percentage of covers (MPER)")
108 |     sns.utils.plt.show()
109 |     return
110 | 


--------------------------------------------------------------------------------
/utilities/serra_et_al_2009/README.md:
--------------------------------------------------------------------------------
 1 | # Reproducing MIREX 2009 Cover Song Detection Algorithm results
 2 | 
 3 | # Requirements
 4 | 
 5 | * Download the binary codes of Serra et.al 2009 mirex submission from 
 6 | [here](http://www.iiia.csic.es/~jserra/downloads/2009_SerraZA_MIREX-Covers.tar.gz) and copy in the same directory.
 7 | 
 8 | 
 9 | # Document structure
10 | 
11 | (Note : Wildcard (\*) denotes the index of all the queries in the aggregrated results dataframe.
12 | 	   ie. [0 to 4252] in the case of shs_test_dzr dataset) 
13 | 
14 | ## Query lists
15 | 
16 | ```
17 | ./path_to_query_folder/query_*_.txt
18 | ```
19 | 
20 | ## Collection lists
21 | ```
22 | ./path_to_collections_folder/collections_*_.txt
23 | ```
24 | 
25 | # Usage
26 | 
27 | ```bash
28 | $ python run_mirex_binary.py -a ./audio_collections/ -c ./collection_txts/ -q ./query_txts/ -p ./output_features/ -o ./qmax_output/
29 | ```
30 | 
31 | 


--------------------------------------------------------------------------------
/utilities/serra_et_al_2009/compute_hpcpFeatures.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [ "$#" -ne 3 ]; then
 4 |         echo "USAGE: ./get_hpcpFeatures.sh <collection_list_file> <working_directory> <output_log_filename> "
 5 |         exit
 6 | fi
 7 | 
 8 | echo "Extracting descriptors..."
 9 | ./myessentiaextractor -sl $1 -op $2 -dn hpcp -ah 20 -al 20 -at divmax > $2$3
10 | 
11 | echo "Done."
12 | 


--------------------------------------------------------------------------------
/utilities/serra_et_al_2009/compute_qmaxDistance.sh:
--------------------------------------------------------------------------------
 1 | if [ "$#" -ne 5 ]; then
 2 | 	echo "USAGE: ./compute_qmaxDistance.sh <collection_list> <query_list> <working_directory> <log_filename> <output_filename>"
 3 | 	exit
 4 | fi
 5 | 
 6 | echo "Computing Qmax"
 7 | ./coverid -d qmax -q $2 -c $1 -p $3 -oti 2 -m 9 -tau 1 -k 0.095 -go 0.5 -ge 0.5 > $4
 8 | 
 9 | echo "Refining distances..."
10 | ./setdetect -rf $4 -nn 1 -dt 1000.0 > $5
11 | 
12 | echo "Done...."
13 | 


--------------------------------------------------------------------------------
/utilities/serra_et_al_2009/run_mirex_binary.py:
--------------------------------------------------------------------------------
  1 | from joblib import Parallel, delayed
  2 | import time, os
  3 | import subprocess
  4 | import argparse
  5 | import glob
  6 | 
  7 | 
  8 | def compute_features(args):
  9 |     ''''''
 10 |     print args
 11 |     start_time = time.time()
 12 |     collections_txt = args[0]
 13 |     feature_path = args[1]
 14 |     logfile_path = args[2]
 15 |     subprocess.call("./compute_hpcpFeatures.sh %s %s %s" %(collections_txt, feature_path, logfile_path), shell=True)
 16 |     print "Feature extraction finished in --%s-- s" %(time.time() - start_time)
 17 |     return
 18 | 
 19 | 
 20 | def compute_qmax_distance(args):
 21 |     ''' '''
 22 |     print args
 23 |     collections_txt = args[0]
 24 |     query_txt = args[1]
 25 |     directory = args[2]
 26 |     log_filename = args[3]
 27 |     output_filename = args[4]
 28 |     return subprocess.call("./compute_qmaxDistance.sh %s %s %s %s %s" %(collections_txt, query_txt, directory, log_filename, output_filename), shell=True)
 29 | 
 30 | 
 31 | 
 32 | def serra_cover_algo(collections_txt, query_txt, directory, output_filename):
 33 | 
 34 |     feature_process = compute_features(collections_txt, query_txt, directory)
 35 | 
 36 |     if feature_process!=0:
 37 |         raise Exception("Feature extraction process failed ...")
 38 | 
 39 |     qmax_process = compute_qmax_distance(collections_txt, query_txt, directory, output_filename)
 40 | 
 41 |     if qmax_process!=0:
 42 |         raise Exception("Process failed...")
 43 |     return
 44 | 
 45 | 
 46 | 
 47 | def run_feature_extraction(collection_directory, feature_directory):
 48 |     '''Run the feature extraction with parallelisation'''
 49 |     collection_files = os.listdir(collection_directory)
 50 |     for s in collection_files:
 51 |     	if s.startswith("."):
 52 |     		collection_files.remove(s)
 53 |     collection_files = [collection_directory+s for s in collection_files]
 54 |     collection_files = sorted(collection_files, key = lambda x: int(x.split('_')[2].split('/')[1]))
 55 |     print "%s collections txt files found..." %len(collection_files)
 56 |     feature_path = [feature_directory for i in range(len(collection_files))]
 57 |     log_file_paths = ['hpcp_logs_split_'+str(i)+'.txt' for i in range(len(collection_files))]
 58 | 
 59 |     args = zip(collection_files, feature_path, log_file_paths)
 60 |     Parallel(n_jobs=-1, verbose=1)(map(delayed(compute_features), args))
 61 |     return
 62 | 
 63 | 
 64 | def run_qmax_computation(col_path, query_path, feature_path, out_path):
 65 |     '''Run the qmax distance computation with parallelization'''
 66 |     collection_files = os.listdir(col_path)
 67 |     query_files = os.listdir(query_path)
 68 |     for s in collection_files:
 69 |         if s.startswith("."):
 70 |             collection_files.remove(s)
 71 |     for x in query_files:
 72 |         if x.startswith("."):
 73 |             query_files.remove(x)
 74 |     collection_files = sorted(collection_files, key = lambda m: int(m.split('_')[1]))
 75 |     query_files = sorted(query_files, key = lambda m: int(m.split('_')[1]))
 76 | 
 77 |     collection_files = [col_path+c for c in collection_files]
 78 |     query_files = [query_path+q for q in query_files]
 79 | 
 80 |     feature_directory = [feature_path for i in range(len(query_files))]
 81 |     log_filenames = [feature_path+'qmax_log_'+str(i)+'.txt' for i in range(len(query_files))]
 82 |     out_filenames = [out_path+'output_qmax_'+str(i)+'.txt' for i in range(len(query_files))]
 83 | 
 84 |     args = zip(collection_files, query_files, feature_directory, log_filenames, out_filenames)
 85 |     Parallel(n_jobs=-1, verbose=1)(map(delayed(compute_qmax_distance), args))
 86 |     return
 87 | 
 88 | 
 89 | if __name__ == '__main__':
 90 | 
 91 |     parser = argparse.ArgumentParser(description= "Run mirex cover similarity algorithm (serra et. al 2009) binary files with parallelisation",
 92 |                         formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 93 | 
 94 |     parser.add_argument("-a", action="store", default='./audio_collections/',
 95 |                         help="path to collection_files for audio feature extraction")
 96 |     parser.add_argument("-c", action="store", default='./collection_txts/',
 97 |                         help="path to collection_files for qmax")
 98 |     parser.add_argument("-q", action="store", default='./query_txts/',
 99 |                         help="path to query_files for qmax")
100 |     parser.add_argument("-p", action="store", default="./output_features/",
101 |                         help="path to directory where the audio features should be stored")
102 |     parser.add_argument("-o", action="store", default='./qmax_output/',
103 |                         help="output_filename")
104 |     parser.add_argument("-m", action="store", default=0,
105 |                         help="mode of the process")
106 | 
107 |     cmd_args = parser.parse_args()
108 | 
109 |     #print cmd_args
110 | 
111 |     run_feature_extraction(cmd_args.a, cmd_args.p)
112 | 
113 |     print 'Feature extraction finished'
114 | 
115 |     run_qmax_computation(cmd_args.c, cmd_args.q, cmd_args.p, cmd_args.o)
116 | 
117 |     print "\n.....DONE...."
118 | 
119 | 


--------------------------------------------------------------------------------
/utilities/serra_et_al_2009/run_submission.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [ "$#" -ne 4 ]; then
 4 | 	echo "USAGE: ./run_submission.sh <collection_list_file> <query_list_file> <working_directory> <output_file> "
 5 | 	exit
 6 | fi
 7 | 
 8 | #echo "Creating temporary directory"
 9 | #mkdir $3
10 | 
11 | echo "Extracting descriptors..."
12 | ./myessentiaextractor -sl $1 -op $3 -dn hpcp -ah 20 -al 20 -at divmax > $3/log_feature_extraction.txt
13 | 
14 | echo "Computing distances..."
15 | ./coverid -d qmax -q $2 -c $1 -p $3 -oti 2 -m 9 -tau 1 -k 0.095 -go 0.5 -ge 0.5 > $3/log_temporaryresults.txt
16 | 
17 | echo "Refining distances..."
18 | ./setdetect -rf $3/log_temporaryresults.txt -nn 1 -dt 1000.0 > $4
19 | 
20 | echo "Removing logs and temporary files..."
21 | rm -r $3
22 | 
23 | echo "Done."
24 | 


--------------------------------------------------------------------------------
/utilities/serra_et_al_2009/utils.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import glob
 3 | 
 4 | 
 5 | def rename_txts(path):
 6 | 	files = glob.glob(os.path.join(path, '*'))
 7 | 	for f in files:
 8 | 		if f.endswith('.txt'):
 9 | 			new_f = f.split("_")[5]+'_'+f
10 | 			os.system("mv %s %s" %(f, new_f))
11 | 	return
12 | 
13 | def format_col_text_for_docker(path):
14 | 	files = os.listdir(path)
15 |         for txt_file in files:
16 |                 if txt_file.endswith(".txt"):
17 |                         fname = txt_file
18 |                         f = open(path+txt_file)
19 |                         data = f.readlines()
20 |                         new_txt = [line.replace("data","mnt") for line in data]
21 | 			print new_txt
22 |                         f.close()
23 |                         savelist_to_file(new_txt, path+fname)
24 |         return
25 | 
26 | 
27 | def format_newline_col(path):
28 |         files = glob.glob(os.path.join(path, '*'))
29 |         for txt_file in files:
30 |                 if txt_file.endswith('.txt'):
31 |                         fname = txt_file
32 |                         f = open(txt_file)
33 |                         data = f.readlines()
34 |                         new_txt = [x for x in data if x!='\n']
35 |                         f.close()
36 |                         savelist_to_file(new_txt, fname)
37 |         return
38 | 
39 | 
40 | def savelist_to_file(pathList, filename):
41 |         """
42 |         """
43 |         doc = open(filename,'w')
44 |         for item in pathList:
45 |                 doc.write("%s" % item)
46 |         doc.close()
47 |         return
48 | 


--------------------------------------------------------------------------------
/utilities/text_utils.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Set of functions for text processing
  4 | 
  5 | * Format msd_track titles for song-title based query of cover song detection
  6 | ---------------------
  7 | Albin Andrew Correya
  8 | R&D Intern
  9 | @Deezer, 2017
 10 | """
 11 | 
 12 | from nltk.stem import SnowballStemmer
 13 | from fuzzywuzzy import fuzz, process
 14 | from utils import init_connection, timeit
 15 | import re
 16 | import csv
 17 | # [To be removed in future. Just a small hack to get the things done for at the moment]
 18 | import sys
 19 | reload(sys)
 20 | sys.setdefaultencoding("utf-8")
 21 | 
 22 | stemmer = SnowballStemmer("english")
 23 | codebook = ['version', 'demo', 'live', 'remix', 'mix', 'remaster', 'albumn', 'instrumental', 'cover', 'digital', 
 24 |             'acoustic', 'lp version', 'remastered', 'digital remaster', 'remastered lp version', 'mono', 'stereo', 
 25 |             'extended', 'vocal mix', 'album version', 'album', 'vocal', 'extended version', 'reprise', 'single', 
 26 |             'radio edit', 'short version', 'explicit', 'bonus track', 'edit', 'session', 'e.p', 'ep version',
 27 |             'original']
 28 | 
 29 | 
 30 | def stemit(string):
 31 |     """apply stemming to a string based on nltk.stem.SnowballStemmer()"""
 32 |     return stemmer.stem(unicode(string))
 33 | 
 34 | 
 35 | def title_formatter(string, mode='regex', striplist=codebook, threshold=70):
 36 |     """
 37 |     Remove elements similar to predefined items inside the elements of string with parenthesis
 38 |     Note : callback function to be used inside pandas.dataframe.apply() and have been used recursively
 39 | 
 40 |     Inputs :
 41 |             mode : choose either of one mode from ['regex', 'fuzzy']
 42 |                    'regex' : uses regex matching
 43 |                    'fuzzy' : uses fuzzy levenstien distance
 44 | 
 45 |             striplist : A list of strings which the match have to be computed
 46 |                         eg : ['version', 'live', 'remix', 'mix', 'remaster', 'albumn',
 47 |                               'instrumental', 'cover', 'digital', 'acoustic', 'lp version', 'remastered',
 48 |                               'digital remaster', 'remastered lp version', 'mono', 'stereo']
 49 | 
 50 |     eg : >>> string = Let it be (Live Version)
 51 |          >>> title_formatter(string)
 52 |          out: "Let it be"
 53 |     """
 54 |     # to avoid 'NaN' values appear in the pandas dataframe when applying it as a callback function
 55 |     if type(string) != float:
 56 |         to_remove = "(" + string[string.find("(")+1:string.find(")")] + ")"
 57 |         stemmed_str = stemit(to_remove)
 58 |         for word in striplist:
 59 |             if mode == 'fuzzy':
 60 |                 if fuzz.ratio(stemmed_str, word) >= threshold:
 61 |                     return string.replace(to_remove, "")
 62 |             if mode == 'regex':
 63 |                 # strip the string with parse string if there is any match with the words in the striplist
 64 |                 if re.findall(r"\b" + word + r"\b", stemit(string)): 
 65 |                     return string.replace(to_remove, "")
 66 |     return string
 67 | 
 68 | 
 69 | @timeit
 70 | def add_formatted_title_to_dataset(dataset_csv, mode='regex'):
 71 |     """
 72 |     dataset_csv : shs csv file
 73 |     mode : choose either of one mode from ['regex', 'fuzzy']
 74 |     """
 75 |     dataset = pd.read_csv(dataset_csv)
 76 |     new_data = pd.DataFrame()
 77 |     new_data['new_title'] = dataset.title.apply(title_formatter, mode=mode)
 78 |     new_data.new_title = new_data.new_title.apply(title_formatter, mode=mode)
 79 |     new_data = new_data.merge(dataset, left_index=True, right_index=True)
 80 |     return new_data
 81 | 
 82 | 
 83 | @timeit
 84 | def get_formatted_msd_track_title_csv(db_file, filename='./msd_formatted_titles.csv'):
 85 |     """
 86 |     Remove and reformat all the msd song titles and store it as csvfile.
 87 |     The output csv file is structured as follows :
 88 |         msd_track_id, msd_song_title, msd_new_song_title
 89 | 
 90 |     Inputs :
 91 |             db_file - track_metadata.db file provided by the labrosa
 92 |             filename - filename for the output csvfile ('./msd_formatted_titles.csv' by default)
 93 | 
 94 | 
 95 |     [NOTE] : tested runtime ~ 33.76 minutes
 96 |     """
 97 | 
 98 |     def double_format(string):
 99 |         s = title_formatter(string)
100 |         return title_formatter(s)
101 |     
102 |     con = init_connection(db_file)
103 |     query = con.execute("""SELECT track_id, title FROM songs""")
104 |     results = query.fetchall()
105 |     con.close()
106 |     with open(filename, 'w') as csvfile:
107 |         writer = csv.DictWriter(csvfile, fieldnames=['msd_id', 'msd_title', 'title'])
108 |         writer.writeheader()
109 |         cnt = 0
110 |         for track_id, track_name in results:
111 |             print "--%s--" % cnt
112 |             if track_name:
113 |                 title = double_format(track_name).encode('utf8')
114 |             writer.writerow({'msd_id': track_id, 
115 |                              'msd_title': track_name,
116 |                              'title': title})
117 |             cnt += 1
118 |     print "~Done..."
119 |     return 
120 | 
121 | 
122 | def extract_removefactor(string):
123 |     if type(string) != float:
124 |         return "(" + string[string.find("(")+1:string.find(")")] + ")"
125 |     return string
126 | 


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """
 3 | Some general utility functions
 4 | 
 5 | Albin Andrew Correya
 6 | R&D Intern
 7 | @2017
 8 | """
 9 | 
10 | import logging
11 | import time
12 | import json
13 | import csv
14 | 
15 | 
16 | def log(log_file):
17 |     """Returns a logger object with predefined settings"""
18 |     root_logger = logging.getLogger(__name__)
19 |     root_logger.setLevel(logging.DEBUG)
20 |     file_handler = logging.FileHandler(log_file)
21 |     log_formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
22 |     file_handler.setFormatter(log_formatter)
23 |     root_logger.addHandler(file_handler)
24 |     console_handler = logging.StreamHandler()
25 |     console_handler.setFormatter(log_formatter)
26 |     root_logger.addHandler(console_handler)
27 |     return root_logger
28 | 
29 | 
30 | def timeit(method):
31 |     """Custom timeit profiling function."""
32 |     def timed(*args, **kw):
33 |         ts = time.time()
34 |         result = method(*args, **kw)
35 |         te = time.time()
36 |         if 'log_time' in kw:
37 |             name = kw.get('log_name', method.__name__.upper())
38 |             kw['log_time'][name] = int((te - ts) * 1000)
39 |         else:
40 |             print '%r - runtime : %2.2f ms' % \
41 |                   (method.__name__, (te - ts) * 1000)
42 |         return result
43 |     return timed
44 | 
45 | 
46 | def slice_results(results_json, size):
47 |     """Slice the query response results to a specified size"""
48 |     with open(results_json) as f:
49 |         data = json.load(f)
50 |     sliced_dict = dict()
51 |     for msd_id in data.keys():
52 |         if type(data[msd_id]['id'])==list:
53 |             sliced_dict[msd_id] = {
54 |                                 'id': data[msd_id]['id'][:size], 
55 |                                 'score': data[msd_id]['id'][:size]
56 |                                 }
57 |         else:
58 |             sliced_dict[msd_id] = {'id': None, 'score': None}
59 |     return sliced_dict
60 | 
61 | 
62 | # some utils for accessing msd_metadata sql db
63 | def init_connection(db_file):
64 |     """Loads a sqldb file and returns the connection object"""
65 |     try:
66 |         import sqlite3
67 |         con = sqlite3.connect(db_file)  # specifiy your path to sql db file provided by labrosa team
68 |     except:
69 |         raise ImportError("Cannot import db_file")
70 |     return con
71 | 
72 | 
73 | def get_fields_from_msd_db(db_file, field_name='track_id'):
74 |     """
75 |     Input : "track_metadata.db" sql db file provided by labrosa
76 |     Output : A list of specified fields for 1M songs in the msd dataset
77 |     """
78 |     con = init_connection(db_file)
79 |     query = con.execute("""SELECT %s from songs""" % field_name)
80 |     results = query.fetchall()
81 | 
82 |     return [field[0] for field in results]
83 | 
84 | 
85 | def get_msd_data_from_track_id(con, track_id, field_name='track_name'):
86 |     query = con.execute("""SELECT %s from songs WHERE track_id='%s'""" % (field_name, track_id))
87 |     results = query.fetchall()
88 |     return [field[0] for field in results]
89 | 
90 | 
91 | def get_msd_field_metadata_from_ids(db_file, track_ids, field_name='track_name'):
92 |     con = init_connection(db_file)
93 |     msd_ids = ','.join(['%d' % msd_id for msd_id in track_ids])
94 |     query = con.execute("""SELECT %s from songs WHERE track_id IN (%s)""" %(field_name, msd_ids))
95 |     results = query.fetchall()
96 |     return [field[0] for field in results]
97 | 


--------------------------------------------------------------------------------