├── README.md ├── __init__.py ├── datasets ├── shs_dzr_test.csv ├── shs_dzr_train.csv ├── test_shs.csv └── train_shs.csv ├── es_search.py ├── evaluations.py ├── experiments.py ├── requirements.txt ├── templates.py ├── utilities ├── __init__.py ├── audio_utils.py ├── clique_similarity.py ├── plots.py ├── serra_et_al_2009 │ ├── README.md │ ├── compute_hpcpFeatures.sh │ ├── compute_qmaxDistance.sh │ ├── run_mirex_binary.py │ ├── run_submission.sh │ └── utils.py └── text_utils.py └── utils.py /README.md: -------------------------------------------------------------------------------- 1 | # Large Scale Cover Detection in Digital Music Libraries using Metadata, Lyrics and Audio Features 2 | 3 | 4 | Source code and supplementary materials for the paper "Correya, Albin, Romain Hennequin, and Mickaël Arcos. "Large-Scale Cover Song Detection in Digital Music Libraries Using Metadata, Lyrics and Audio Features." arXiv preprint arXiv:1808.10351 (2018)". 5 | 6 | This repo contains scripts to run text-based experiments for cover song detection task on the [MillionSongDataset (MSD)](https://labrosa.ee.columbia.edu/millionsong/) 7 | which is imported into an [Elasticsearch (ES)](https://www.elastic.co/blog/what-is-an-elasticsearch-index) index as described in the above mentioned paper. 8 | # Requirements 9 | 10 | Install python dependencies from the requirements.txt file 11 | 12 | ``` 13 | $ pip install -r requirements.txt 14 | ``` 15 | 16 | # Setup 17 | 18 | * Use [ElasticMSD](https://github.com/deezer/elasticmsd) scripts to setup your local Elasticsearch index of MSD. 19 | * Fill your ES db credentials (host, port and index) as a environment variable in your local system. 20 | Check [templates.py](templates.py) file. 21 | 22 | ## Datasets 23 | 24 | The following datasets have corresponding mapping with MSD tracks. These data are ingested to the ES index in an update operation 25 | 26 | * [Second Hand Songs (SHS)](https://labrosa.ee.columbia.edu/millionsong/secondhand) dataset. Check the ./data folder 27 | * For lyrics we used the [musiXmatch (MXM)](https://labrosa.ee.columbia.edu/millionsong/musixmatch) dataset 28 | 29 | # Usage 30 | 31 | ## Modular mode 32 | 33 | In this section, you can have a glimpse on how to use these classes and various methods for doing experiments 34 | 35 | ```python 36 | #import modules 37 | from es_search import SearchModule 38 | from experiments import Experiments 39 | import templates as presets 40 | 41 | # Initiaite es search class 42 | es = SearchModule(presets.uri_config) 43 | 44 | # search method by msd_track title in view mode 45 | results = es.search_by_exact_title('Listen To My Babe', 'TRPIIKF128F1459A09', mode='view') 46 | 47 | #You can also use the experiment class to automate particular experiments for a method 48 | #Initiate experiment class with the instance of SearchModule and path to the dataset as arguments 49 | exp = Experiments(es, './data/test_shs.csv') 50 | 51 | #run the song title match experiment with top 100 results 52 | results = exp.run_song_title_match_task(size=100) 53 | 54 | #compute evaluation metrics for the task 55 | mean_avg_precison = exp.mean_average_precision(results) 56 | 57 | #reset the preset if you want to do another experiment on the same same SearchModule instance. 58 | exp.reset_preset() 59 | 60 | results = exp.run_mxm_lyrics_search_task(size=1000) 61 | 62 | mean_avg_precison = exp.mean_average_precision(results) 63 | 64 | ``` 65 | 66 | ## Evaluation tasks 67 | 68 | Some examples for using functions in evaluations.py script to reproduce the results mentioned in the paper 69 | ```python 70 | from evaluations import * 71 | 72 | #Evaluation task on SHS train set against the whole MSD (1 x 999,999 songs) 73 | shs_train_set_evals(size=100, method="msd_title", mode="msd", with_duplicates=True) 74 | 75 | #You can specify various prune sizes and methods as parameters 76 | shs_train_set_evals(size=1000, method="mxm_lyrics", mode="msd", with_duplicates=False) 77 | 78 | #You can run the same experiment only on the SHS train set against itself by specifying "mode" param as "shs" (1 x 12,960) 79 | shs_train_set_evals(size=100, method="msd_title", mode="shs", with_duplicates=True) 80 | 81 | #In same way you can do the evaluation experiments on SHS test sets 82 | shs_test_set_evals(size=100, method="title_mxm_lyrics", with_duplicates=True) 83 | 84 | ``` 85 | 86 | 87 | If you don't want to care about how the module works and you only need results various experiments, then this is for you. 88 | It's a wrapper around the modules to run automated experiments and save the results to a .log file or a json_template. 89 | The experiments are multi-threaded and able to run from terminal using command-line arguments. 90 | 91 | ```bash 92 | $ python evaluations.py -m test -t -1 -e msd -d 0 -s 100 93 | 94 | -m : (type: string) Choose between "train" or "test" modes 95 | -t : (type: int) No of threads 96 | -e : (type: int) Choose between "msd" 97 | -d : (type: boolean) include duplicates 98 | -s : (type: int) Required pruning size for the experiments 99 | 100 | ``` 101 | 102 | # Cite 103 | 104 | If you use these work, please cite our paper. 105 | 106 | ``` 107 | Correya, Albin, Romain Hennequin, and Mickaël Arcos. "Large-Scale Cover Song Detection in Digital Music Libraries Using Metadata, Lyrics and Audio Features." arXiv preprint arXiv:1808.10351 (2018). 108 | ``` 109 | 110 | Bibtex format 111 | ``` 112 | @article{correya2018large, 113 | title={Large-Scale Cover Song Detection in Digital Music Libraries Using Metadata, Lyrics and Audio Features}, 114 | author={Correya, Albin and Hennequin, Romain and Arcos, Micka{\"e}l}, 115 | journal={arXiv preprint arXiv:1808.10351}, 116 | year={2018} 117 | } 118 | ``` -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /datasets/shs_dzr_train.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/deezer/cover_song_detection/035b925c4a3380ad833202da618f5025e7643322/datasets/shs_dzr_train.csv -------------------------------------------------------------------------------- /es_search.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Set of functions and methods for various search requests to dzr_elastic search augmented msd db 4 | 5 | Albin Andrew Correya 6 | R&D Intern 7 | @Deezer,2017 8 | """ 9 | from requests import get 10 | from copy import deepcopy 11 | import templates as presets 12 | 13 | 14 | class SearchModule(object): 15 | """ 16 | Class containing custom methods to search the elasticsearch index containing the augmented MSD dataset 17 | """ 18 | from elasticsearch import Elasticsearch 19 | import pandas as pd 20 | import json 21 | 22 | init_json = deepcopy(presets.simple_query_string) # save the preset as attribute 23 | 24 | def __init__(self, uri_config, query_json=None, timeout=30): 25 | """ 26 | Init params: 27 | uri_config : uri_config dictionary specifying the host and port of es db. 28 | (check 'uri_config' in the templates.py file) 29 | query_json : {default : None} 30 | 31 | """ 32 | self.config = uri_config 33 | self.handler = self.Elasticsearch(hosts=[{'host': self.config['host'], 34 | 'port': self.config['port'], 35 | 'scheme': self.config['scheme']}], timeout=timeout) 36 | 37 | if query_json: 38 | self.post_json = query_json 39 | else: 40 | self.post_json = presets.simple_query_string 41 | 42 | return 43 | 44 | def _load_json(self, jsonfile): 45 | """Load a json file as python dict""" 46 | with open(jsonfile) as f: 47 | json_data = self.json.load(f) 48 | return json_data 49 | 50 | def _make_request(self, target_url, query, verbose=False): 51 | """ 52 | [DEPRECIATED] make the request and fetch results 53 | """ 54 | if verbose: 55 | print "GET %s -d '%s'" % (target_url, self.json.dumps(query)) 56 | r = get(target_url, data=self.json.dumps(query)) 57 | return self.json.loads(r.text) 58 | 59 | def _format_url(self, msd_id): 60 | return "%s://%s:%s/%s/%s/%s" % ( 61 | self.config['scheme'], 62 | self.config['host'], 63 | self.config['port'], 64 | self.config['index'], 65 | self.config['type'], 66 | msd_id 67 | ) 68 | 69 | def _format_query(self, query_str, msd_id, mode='simple_query', field='msd_title', size=100): 70 | """ 71 | Format POST json dict object with query_str and msd_id 72 | """ 73 | if mode == 'simple_query': 74 | self.post_json['query']['bool']['must'][0]['simple_query_string']['query'] = query_str 75 | self.post_json['query']['bool']['must'][0]['simple_query_string']['fields'][0] = field 76 | # we exclude the query id from the result 77 | self.post_json['query']['bool']['must_not'][0]['query_string']['query'] = msd_id 78 | self.post_json['size'] = size 79 | if mode == 'query_string': 80 | self.post_json['query']['bool']['must'][0]['query_string']['query'] = query_str 81 | # we exclude the query id from the result 82 | self.post_json['query']['bool']['must_not'][0]['query_string']['query'] = msd_id 83 | self.post_json['size'] = size 84 | return self.post_json 85 | 86 | @staticmethod 87 | def _format_init_json(init_json, query_str, msd_id, field='msd_title', size=100): 88 | """ 89 | """ 90 | init_json['query']['bool']['must'][0]['simple_query_string']['query'] = query_str 91 | init_json['query']['bool']['must'][0]['simple_query_string']['fields'][0] = field 92 | # we exclude the query id from the result 93 | init_json['query']['bool']['must_not'][0]['query_string']['query'] = msd_id 94 | init_json['size'] = size 95 | return init_json 96 | 97 | @staticmethod 98 | def _parse_response_for_eval(response): 99 | """ 100 | Parse list of msd_track_ids and their respective scores from a search response json 101 | 102 | Input : 103 | response : json response from the elasticsearch 104 | """ 105 | msd_ids = [d['_id'] for d in response] 106 | scores = [d['_score'] for d in response] 107 | return msd_ids, scores 108 | 109 | def _view_response(self, response): 110 | """ 111 | Aggregrate response as pandas dataframe to view response as tables in the ipython console 112 | 113 | Input : 114 | response : json response from the elasticsearch 115 | 116 | Output : A pandas dataframe with aggregrated results 117 | 118 | """ 119 | row_list = [(track['_id'], track['_score'], track['_source']['msd_title']) for track in response] 120 | 121 | results = self.pd.DataFrame({ 122 | 'msd_id': [r[0] for r in row_list], 123 | 'score': [r[1] for r in row_list], 124 | 'msd_title': [r[2] for r in row_list] 125 | }) 126 | return results 127 | 128 | def format_lyrics_post_json(self, body, lyrics, track_id, size, field='dzr_lyrics.content'): 129 | """ 130 | format post_json template for lyrics search with lyrics and msd_track-id 131 | """ 132 | self.post_json = body 133 | self.post_json['query']['bool']['must'][0]['more_like_this']['like'] = lyrics 134 | self.post_json['query']['bool']['must'][0]['more_like_this']['fields'][0] = field 135 | # we exclude the query id from the results 136 | self.post_json['query']['bool']['must_not'][0]['query_string']['query'] = track_id 137 | self.post_json['size'] = size 138 | 139 | def limit_post_json_to_shs(self): 140 | """ 141 | Limits search only on the songs present in the second hand songs train set as provided by labrosa 142 | ie. limit search to 1 x 12960 from 1 x 1M 143 | 144 | """ 145 | if len(self.post_json['query']['bool']['must']) <= 1: # mar: why this condition? 146 | self.post_json['query']['bool']['must'].append({'exists': {'field': 'shs_id'}}) 147 | 148 | def limit_to_dzr_mapped_msd(self): 149 | """ 150 | Limits search only on the songs have a respective mapping to the deezer_song_ids 151 | ie. limit search to 1 x ~83k 152 | """ 153 | self.post_json['query']['bool']['must'].append({'exists': {'field': 'dzr_song_title'}}) 154 | 155 | def add_remove_duplicates_filter(self): 156 | """ 157 | Filter songs with field 'msd_is_duplicate_of' from the search 158 | and response using must_not exist method in the post-request 159 | """ 160 | if len(self.post_json['query']['bool']['must_not']) <= 1: # mar: why this condition? 161 | self.post_json['query']['bool']['must_not'].append({'exists': {'field': 'msd_is_duplicate_of'}}) 162 | 163 | @staticmethod 164 | def add_must_field_to_query_dsl(post_json, role_type='Composer', field='dzr_artists.role_name', 165 | query_type='simple_query_string'): 166 | post_json['query']['bool']['must'].append({query_type: {'fields': [field], 'query': role_type}}) 167 | return post_json # mar: why here we return something and not the previous add_ function 168 | 169 | @staticmethod 170 | def add_role_artists_to_query_dsl(post_json, artist_names, field='dzr_artists.artist_name', 171 | query_type='simple_query_string'): 172 | if len(artist_names) > 1: 173 | query_str = ' OR '.join(artist_names) 174 | else: 175 | query_str = artist_names[0] 176 | 177 | post_json['query']['bool']['must'].append({query_type: {'fields': [field], 'query': query_str}}) 178 | return post_json # mar: why here we return something and not the previous add_ function 179 | 180 | @staticmethod 181 | def parse_field_from_response(response, field='msd_artist_id'): 182 | """ 183 | Parse a particular field value from the es response 184 | 185 | :param response: es response json 186 | :param field: field_name 187 | """ 188 | if field not in response['_source'].keys(): 189 | return None 190 | elif not response['_source'][field]: 191 | return None 192 | elif field == 'dzr_lyrics': 193 | return response['_source'][field]['content'] 194 | else: 195 | return response['_source'][field] 196 | 197 | def get_field_info_from_id(self, msd_id, field): 198 | """ 199 | Retrieve info for a particular field associated to a msd_id in the es db 200 | eg. get_field_info_from_id(msd_id='TRWFERO128F425FE0D', field='dzr_lyrics.content') 201 | """ 202 | response = get(self._format_url(msd_id)) 203 | field_info = self.parse_field_from_response(response.json(), field=field) 204 | return field_info 205 | 206 | def get_mxm_lyrics_by_id(self, track_id): 207 | """ 208 | Get Musixmatch lyrics associated with a msd track id from es index if there is any. 209 | :param track_id: msd track id 210 | :return: 211 | """ 212 | return self.get_field_info_from_id(msd_id=track_id, field='mxm_lyrics') 213 | 214 | def get_cleaned_title_from_id(self, msd_id, field="dzr_msd_title_clean"): 215 | """ 216 | Get preprocessed MSD title by MSD track id 217 | """ 218 | # mar: the field "dzr_msd_title_clean" should not be a parameter (like in get_mxm_lyrics) 219 | response = get(self._format_url(msd_id)) 220 | return self.parse_field_from_response(response.json(), field=field) 221 | 222 | def search_es(self, body): 223 | """ 224 | Make a search request to elasticsearch provided by json POST dictionary 225 | [This is a general method you can use for querying the es db with respective query_dsl as inputs] 226 | 227 | Input : 228 | body : JSON post dict for elastic search 229 | (you can use the template jsons in the templates.py script) 230 | eg : body = templates.simple_query_string 231 | """ 232 | res = self.handler.search(index=self.config["index"], body=body) 233 | return res['hits']['hits'] 234 | 235 | def search_by_exact_title(self, track_title, track_id, mode='simple_query', out_mode='view', size=100): 236 | """ 237 | Search by track_title using simple_query_string method in the elasticsearch 238 | """ 239 | res = self.search_es(self._format_query(query_str=track_title, msd_id=track_id, mode=mode, size=size)) 240 | 241 | # mar: because the following code is copy/pasted several times, it should be a function 242 | # like return_results(res, out_mode) 243 | if out_mode == 'eval': 244 | msd_ids, scores = self._parse_response_for_eval(res) 245 | return msd_ids, scores 246 | 247 | if out_mode == 'view': 248 | return self._view_response(res) 249 | 250 | return None 251 | 252 | def search_with_cleaned_title(self, track_id, out_mode='view', field="dzr_msd_title_clean", size=100): 253 | """ 254 | Search by cleaned msd_track-title 255 | """ 256 | # mar: field="dzr_msd_title_clean" should not ne a parameter but included in get_cleaned_title_from_id 257 | track_title = self.get_cleaned_title_from_id(msd_id=track_id) 258 | res = self.search_es(self._format_query(query_str=track_title, msd_id=track_id, mode='simple_query', 259 | field=field, size=size)) 260 | if out_mode == 'eval': 261 | msd_ids, scores = self._parse_response_for_eval(res) 262 | return msd_ids, scores 263 | 264 | if out_mode == 'view': 265 | return self._view_response(res) 266 | 267 | return None 268 | 269 | 270 | def search_by_mxm_lyrics(self, post_json, msd_track_id, out_mode='eval', size=100): 271 | """ 272 | Search the es_db by musixmatch_lyrics which are mapped to certain msd_track_ids 273 | These mappings are obtained from the musixmatch dataset (https://labrosa.ee.columbia.edu/millionsong/musixmatch) 274 | 275 | [NOTE]: It returns a tuple of list of response msd_track_ids and 276 | response_scores from the elastic search response if the track has corresponding "mxm_lyrics" 277 | otherwise return a tuple of (None, None) 278 | 279 | Inputs: 280 | post_json : (dict) Query_DSL json template for the es_query (eg. presets.more_like_lyrics) 281 | msd_track_id : (string) MSD track identifier of the query file 282 | 283 | Params : 284 | out_mode : (string) Available modes (['eval', 'view']) 285 | size : (int) size of the required response from es_db 286 | """ 287 | lyrics = self.get_mxm_lyrics_by_id(msd_track_id) 288 | 289 | if not lyrics: 290 | return None, None 291 | 292 | self.format_lyrics_post_json(body=post_json, track_id=msd_track_id, lyrics=lyrics, 293 | size=size, field='mxm_lyrics') 294 | res = self.search_es(body=self.post_json) 295 | 296 | if out_mode == 'eval': 297 | msd_ids, scores = self._parse_response_for_eval(res) 298 | return msd_ids, scores 299 | 300 | if out_mode == 'view': 301 | return self._view_response(res) 302 | 303 | return None 304 | -------------------------------------------------------------------------------- /evaluations.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Scripts for running various evaluation tasks for the task of large-scale cover song detection 4 | 5 | [NOTE] : All the logs are stored to the LOG_FILE 6 | 7 | Albin Andrew Correya 8 | R&D Intern 9 | @Deezer 10 | """ 11 | 12 | from joblib import Parallel, delayed 13 | from es_search import SearchModule 14 | from experiments import Experiments 15 | from utils import log 16 | import templates as presets 17 | import argparse 18 | 19 | # Logging handlers 20 | LOG_FILE = './logs/evaluations.log' 21 | LOGGER = log(LOG_FILE) 22 | 23 | 24 | def shs_train_set_evals(size, method="msd_title", with_duplicates=True, mode="msd"): 25 | """ 26 | :param size: Required prune size of the results 27 | :param method: (string type) {default:"msd_title"} 28 | choose the method of experiment available modes are 29 | ["msd_title", "pre-msd_title", "mxm_lyrics", "title_mxm_lyrics", "pre-title_mxm_lyrics"] 30 | :param with_duplicates: (boolean) {default:True} include 31 | or exclude MSD official duplicate tracks from the experiments 32 | :param mode: 'msd' or 'shs' 33 | """ 34 | 35 | es = SearchModule(presets.uri_config) 36 | 37 | if mode == "msd": 38 | if with_duplicates: 39 | exp = Experiments(es, './data/train_shs.csv', presets.shs_msd) 40 | else: 41 | exp = Experiments(es, './data/train_shs.csv', presets.shs_msd_no_dup) 42 | elif mode == "shs": 43 | exp = Experiments(es, './data/train_shs.csv', presets.shs_shs) 44 | else: 45 | raise Exception("\nInvalid 'mode' parameter ... ") 46 | 47 | if method == "msd_title": 48 | LOGGER.info("\n%s with size %s, duplicates=%s and msd_mode=%s" % 49 | (method, size, with_duplicates, mode)) 50 | results = exp.run_song_title_match_task(size=size) 51 | 52 | elif method == "pre-msd_title": 53 | LOGGER.info("\n%s with size %s, duplicates=%s and msd_mode=%s" % 54 | (method, size, with_duplicates, mode)) 55 | results = exp.run_cleaned_song_title_task(size=size) 56 | 57 | elif method == "mxm_lyrics": 58 | LOGGER.info("\n%s with size %s, duplicates=%s and msd_mode=%s" % 59 | (method, size, with_duplicates, mode)) 60 | results = exp.run_mxm_lyrics_search_task(presets.more_like_this, size=size) 61 | 62 | elif method == "title_mxm_lyrics": 63 | LOGGER.info("\n%s with size %s, duplicates=%s and msd_mode=%s" % 64 | (method, size, with_duplicates, mode)) 65 | results = exp.run_rerank_title_with_mxm_lyrics_task(size=size, with_cleaned=False) 66 | 67 | elif method == "pre-title_mxm_lyrics": 68 | LOGGER.info("\n%s with size %s, duplicates=%s and msd_mode=%s" % 69 | (method, size, with_duplicates, mode)) 70 | results = exp.run_rerank_title_with_mxm_lyrics_task(size=size, with_cleaned=True) 71 | 72 | else: 73 | raise Exception("\nInvalid 'method' parameter....") 74 | 75 | mean_avg_precision = exp.mean_average_precision(results) 76 | LOGGER.info("\n Mean Average Precision (MAP) = %s" % mean_avg_precision) 77 | 78 | return 79 | 80 | 81 | def shs_test_set_evals(size, method="msd_title", with_duplicates=True): 82 | """ 83 | :param size: Required prune size of the results 84 | :param method: (string type) {default:"msd_title"} 85 | choose the method of experiment available modes are 86 | ["msd_title", "pre-msd_title", "mxm_lyrics", "title_mxm_lyrics", "pre-title_mxm_lyrics"] 87 | :param with_duplicates: (boolean) {default:True} include 88 | or exclude MSD official duplicate tracks from the experiments 89 | :return: 90 | """ 91 | 92 | es = SearchModule(presets.uri_config) 93 | 94 | if with_duplicates: 95 | exp = Experiments(es, './data/test_shs.csv', presets.shs_msd) 96 | else: 97 | exp = Experiments(es, './data/test_shs.csv', presets.shs_msd_no_dup) 98 | 99 | if method == "msd_title": 100 | LOGGER.info("\n%s with size %s and duplicates=%s " % (method, size, with_duplicates)) 101 | results = exp.run_song_title_match_task(size=size) 102 | 103 | elif method == "pre-msd_title": 104 | LOGGER.info("\n%s with size %s and duplicates=%s" % (method, size, with_duplicates)) 105 | results = exp.run_cleaned_song_title_task(size=size) 106 | 107 | elif method == "mxm_lyrics": 108 | LOGGER.info("\n%s with size %s and duplicates=%s" % (method, size, with_duplicates)) 109 | results = exp.run_mxm_lyrics_search_task(presets.more_like_this, size=size) 110 | 111 | elif method == "title_mxm_lyrics": 112 | LOGGER.info("\n%s with size %s and duplicates=%s" % (method, size, with_duplicates)) 113 | results = exp.run_rerank_title_with_mxm_lyrics_task(size=size, with_cleaned=False) 114 | 115 | elif method == "pre-title_mxm_lyrics": 116 | LOGGER.info("\n%s with size %s and duplicates=%s" % (method, size, with_duplicates)) 117 | results = exp.run_rerank_title_with_mxm_lyrics_task(size=size, with_cleaned=True) 118 | 119 | else: 120 | raise Exception("\nInvalid 'method' parameter for the experiment ! ") 121 | 122 | mean_avg_precision = exp.mean_average_precision(results) 123 | LOGGER.info("\n Mean Average Precision (MAP) = %s" %mean_avg_precision) 124 | 125 | return 126 | 127 | 128 | def automate_online_evals(mode, n_threads=-1, exp_mode="msd", is_duplicates=False, size=100, 129 | methods=["msd_title", "pre-msd_title", "mxm_lyrics", 130 | "title_mxm_lyrics", "pre-title_mxm_lyrics"]): 131 | """ 132 | 133 | Run the paralleled automated evaluation tasks as per the chosen requirements from the parameters 134 | 135 | :param mode: (type : string) chose whether train or test mode from the list ["test", "train"] 136 | :param n_threads: number of threads to parallelize with (-1 137 | :param exp_mode: (type : string) Choose experiment mode from the list ["msd", "shs"] 138 | :param is_duplicates: (type : boolean) Choose whether you should include duplicates in the experiments 139 | :param size: (type : int) Required size of the pruned response 140 | :param methods: Choose a list of methods to compute in the automated process 141 | available methods are ["msd_title", "pre-msd_title", 142 | "mxm_lyrics", "title_mxm_lyrics", "pre-title_mxm_lyrics"] 143 | 144 | """ 145 | LOGGER.info("\n ======== Automated online experiments on shs_ %s " 146 | "with exp_mode %s and duplicates %s size %s ======= " 147 | % (mode, exp_mode, is_duplicates, size)) 148 | 149 | sizes = [size for i in range(len(methods))] 150 | duplicates = [is_duplicates for i in range(len(methods))] 151 | 152 | if mode == "test": 153 | args = zip(sizes, methods, duplicates) 154 | Parallel(n_jobs=n_threads, verbose=1)(map(delayed(shs_test_set_evals), args)) 155 | 156 | if mode == "train": 157 | exp_modes = [exp_mode for i in range(len(methods))] 158 | args = zip(sizes, methods, duplicates, exp_modes) 159 | Parallel(n_jobs=n_threads, verbose=1)(map(delayed(shs_train_set_evals), args)) 160 | 161 | LOGGER.info("\n ===== Process finished successfully... ===== ") 162 | 163 | return 164 | 165 | 166 | if __name__ == '__main__': 167 | 168 | parser = argparse.ArgumentParser( 169 | description="Run automated evaluation for cover song detection task mentioned in the paper", 170 | formatter_class=argparse.ArgumentDefaultsHelpFormatter) 171 | 172 | parser.add_argument("-m", action="store", default='test', 173 | help="choose whether 'train' or 'test' mode") 174 | parser.add_argument("-t", action="store", default=-1, 175 | help="number of threads required") 176 | parser.add_argument("-e", action="store", default='msd', 177 | help="choose between 'msd' or 'shs' ") 178 | parser.add_argument("-d", action="store", default=0, 179 | help="choose whether you want to exclude msd official duplicates song from the experiments") 180 | parser.add_argument("-s", action="store", default=100, 181 | help="required prune size for the results") 182 | 183 | args = parser.parse_args() 184 | 185 | d = bool(args.d) 186 | methods = ["msd_title", "pre-msd_title", "mxm_lyrics", "title_mxm_lyrics", "pre-title_mxm_lyrics"] 187 | 188 | automate_online_evals(mode=args.m, n_threads=args.t, exp_mode=args.e, is_duplicates=d, size=args.s, methods=methods) 189 | 190 | print "\n ...Done..." 191 | -------------------------------------------------------------------------------- /experiments.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Methods for running various experiments on es msd db for the task of cover 4 | song detection using metadata and lyrics ingested in the ES MSD index. 5 | 6 | ---------- 7 | Albin Andrew Correya 8 | R&D Intern 9 | @Deezer, 2017 10 | """ 11 | 12 | from utils import log, timeit 13 | import templates as presets 14 | import sys 15 | import os 16 | 17 | # bad hack for avoiding encoding erros for the moment 18 | # to be removed soon 19 | reload(sys) 20 | sys.setdefaultencoding("utf8") 21 | 22 | 23 | if not os.path.isdir('./logs/'): 24 | os.makedirs('./logs/') 25 | LOGGER = log('./logs/experiments.log') 26 | 27 | 28 | class Experiments(object): 29 | """ 30 | Class containing methods for running various experiments on 31 | SecondHandSong and MillionSongDataset ingested in the elasticsearch 32 | msd_augmented index for the task of cover song detection. 33 | This is a wrapper on the es_search.py -> SearchModule class 34 | for doing fast prototyping. 35 | 36 | Pandas dataframe and json dict is mainly used as the data 37 | structure for dealing with aggregrated response results. 38 | 39 | Usage: 40 | exp = Experiments(es_search_class, shs_dataset_csv, presets.shs_msd) 41 | results = exp.run_song_title_match_task(size=100) 42 | m_avgp = exp.mean_average_precision(res) 43 | """ 44 | 45 | import pandas as pd 46 | import numpy as np 47 | import time 48 | 49 | def __init__(self, search_class, shs_csv, profile=None): 50 | """ 51 | Init parameters 52 | 53 | :param search_class: An instance of SearchModule class (es_search.py) 54 | :param shs_csv: path to csv file of SecondHandSong dataset (check the ./data/ folder) 55 | This will be the query-set and groundtruth for the experiments 56 | :param profile: {default: None} 57 | A python dictionary corresponds to the profile of the experiment object 58 | 59 | eg: { 60 | 'filter_duplicates':True, 61 | 'dzr_map':False, 62 | 'shs_mode':False 63 | } 64 | 65 | NOTE : a set of profile templates can be found inside the templates.py file. 66 | """ 67 | self.es = search_class 68 | self.dataset = self._load_csv_as_df(shs_csv) 69 | self.query_ids = self.dataset.msd_id.values.tolist() 70 | self.query_titles = self.dataset.title.values.tolist() 71 | 72 | if profile: 73 | self.filter_duplicates = profile['filter_duplicates'] 74 | self.dzr_map = profile['dzr_map'] 75 | self.shs_mode = profile['shs_mode'] 76 | else: 77 | self.filter_duplicates = presets.shs_msd['filter_duplicates'] 78 | self.dzr_map = presets.shs_msd['dzr_map'] 79 | self.shs_mode = presets.shs_msd['shs_mode'] 80 | return 81 | 82 | def _load_csv_as_df(self, csvfile): 83 | """Load csv file as pandas dataframe""" 84 | return self.pd.read_csv(csvfile) 85 | 86 | def _get_subframe_df(self, dataframe, field): 87 | """get a particular subframe from the pandas dataframe""" 88 | return dataframe[field].copy().values.tolist() 89 | 90 | def _tolist(self, x): 91 | """For use it as pandas dataframe.apply() callback""" 92 | return list(x) 93 | 94 | def _merge_df(self, results_df, field='msd_id'): 95 | """Merge the dataset and the results df""" 96 | results_df[field] = self.pd.Series(results_df.index.values, index=results_df.index) 97 | return self.pd.merge(self.dataset, results_df, on=field, how='left') 98 | 99 | def _groupby_work(self, merged_df): 100 | return merged_df.groupby('work_id')['msd_id'].agg({'clique_songs': self._tolist}) 101 | 102 | def load_result_json_as_df(self, jsonfile): 103 | """Load results json from the experiments to pandas df""" 104 | return self.pd.read_json(jsonfile, orient='index') 105 | 106 | def dict_to_pickle(self, mydict, filename): 107 | """save a dict to pickle file""" 108 | import pickle 109 | doc = open(filename, 'wb') 110 | pickle.dump(mydict, doc) 111 | return 112 | 113 | def get_clique_id(self, track_id): 114 | """DEPRECIATED""" 115 | # have to recheck if this is same for all the sample 116 | return self.dataset[self.dataset.msd_id==track_id].clique_id.values.tolist() 117 | 118 | def get_ground_truth(self, query_id, reference_id): 119 | """DEPRECIATED [To_remove]""" 120 | if str(self.get_clique_id(query_id)) == str(self.get_clique_id(reference_id)): 121 | return 1 122 | else: 123 | return 0 124 | 125 | def reset_preset(self): 126 | self.es.post_json = self.es.init_json 127 | return 128 | 129 | def get_artist_id(self, track_id): 130 | """ 131 | Returns artist_id for a specific msd_track_id from the dataset 132 | """ 133 | return self.dataset.artist_id[self.dataset.msd_id == track_id].values[0] 134 | 135 | def rerank_by_field(self, field_id, response, proximitiy=1, field='msd_artist_id'): 136 | """ 137 | Re-rank the search results by taking a field with thresholding 138 | """ 139 | top_list = list() 140 | bottom_list = list() 141 | if response: 142 | top_score = response[0]['_score'] 143 | else: 144 | return [] 145 | for row in response: 146 | if row['_source'][field] == field_id and (top_score - row['_score']) <= proximitiy: 147 | top_list.append(row) 148 | else: 149 | bottom_list.append(row) 150 | if not top_list: 151 | return response 152 | else: 153 | return top_list + bottom_list 154 | 155 | def get_score_thres(self, res_ids, res_scores, proximity=1.): 156 | """ 157 | 158 | :param res_ids: A list of ranked msd_track_ids. (typically from the lyrics_search response) 159 | :param res_scores: A list of ranked scores corresponds to the res_ids 160 | :param proximity: (int, default: 1) A threshold value for determining the boundary of differnce among the top_score and the other scores 161 | :return: (top_ids, top_list, thres_idx) 162 | top_ids : top msd_track_ids 163 | top_list : top es search scores 164 | thres_idx : threshold index 165 | """ 166 | top_score = res_scores[0] 167 | top_list = [score for score in res_scores if (top_score-score) <= proximity] 168 | thres_idx = len(top_list) 169 | top_ids = res_ids[:thres_idx] 170 | return top_ids, top_list, thres_idx 171 | 172 | def rerank_title_results_by_lyrics(self, title_res, lyrics_res, mode='view', proximity=0.5): 173 | """ 174 | :param title_res: pandas dataframe with aggregrated response of song_title match results 175 | :param lyrics_res: pandas dataframe with aggregrated response of lyrics_similarity search results 176 | :param mode: (available modes ['view', 'eval']) {default : 'view'} 177 | 'view' - return reranked_response as pandas dataframe 178 | 'eval' - return reranked_response as tuple of list of msd_ids and relative scores 179 | :param proximity: 180 | :return: 181 | """ 182 | top_ids, top_scores, thres_idx = self.get_score_thres( 183 | lyrics_res.msd_id.values, lyrics_res.score.values, proximity=proximity) # threshold is 0.5 184 | title_res_ids = title_res.msd_id.values.tolist() 185 | common_ids = self.np.intersect1d(title_res.msd_id.values, top_ids) 186 | 187 | if len(common_ids) > 0: 188 | top_list = common_ids 189 | bottom_list = [x for x in title_res_ids if x not in common_ids] 190 | 191 | # preserve the ranking in the lyrics search response if it doesn't () 192 | top_list = top_ids[sorted([list(top_ids).index(x) for x in top_list])] 193 | 194 | new_ranked_list = list(top_list) + bottom_list 195 | idx = [title_res_ids.index(x) for x in new_ranked_list] 196 | merged_df = title_res.iloc[idx] # select the new ranked dataframe from the indexes 197 | merged_df = merged_df.set_index(self.np.arange(len(merged_df))) # update the dataframe with new ranks 198 | if mode == 'view': 199 | return merged_df 200 | elif mode == 'eval': 201 | return merged_df.msd_id.values.tolist(), merged_df.score.values.tolist() 202 | else: 203 | if mode == 'view': 204 | return title_res 205 | elif mode == 'eval': 206 | return title_res.msd_id.values.tolist(), title_res.score.values.tolist() 207 | return 208 | 209 | """ 210 | ------------------------------------------ 211 | ------ AUTOMATED EXPERIMENTS ------------- 212 | These are methods for running automated search experiments on the ES MSD db 213 | """ 214 | 215 | @timeit 216 | def run_song_title_match_task(self, size=100, verbose=True): 217 | """ 218 | Simple experiment with simple text match 219 | """ 220 | start_time = self.time.time() 221 | 222 | if self.shs_mode: 223 | self.es.limit_post_json_to_shs() 224 | 225 | if self.filter_duplicates: 226 | self.es.add_remove_duplicates_filter() 227 | 228 | if self.dzr_map: 229 | self.es.limit_to_dzr_mapped_msd() 230 | 231 | results = dict() 232 | 233 | LOGGER.info("\n=======Running song title-match task for %s query songs against top %s results of MSD... " 234 | "with shs_mode %s, duplicate %s, dzr_map %s ========\n" 235 | % (len(self.query_ids), size, str(self.shs_mode), str(self.filter_duplicates), str(self.dzr_map))) 236 | 237 | for title in enumerate(self.query_titles): 238 | if verbose: 239 | print "------%s-------%s" % (title[0], title[1]) 240 | 241 | res_ids, res_scores = self.es.search_by_exact_title( 242 | unicode(title[1]), track_id=self.query_ids[title[0]], out_mode='eval', size=size) 243 | # aggregrate response_ids and scores into a dict by query_msd_id as key 244 | results[self.query_ids[title[0]]] = {'id': res_ids, 'score': res_scores} 245 | 246 | LOGGER.info("\n Task runtime : %s" % (self.time.time() - start_time)) 247 | return self.pd.DataFrame.from_dict(results, orient='index') 248 | 249 | @timeit 250 | def run_cleaned_song_title_task(self, size=100, verbose=True): 251 | """Run MSD pre-processed title task""" 252 | start_time = self.time.time() 253 | 254 | if self.shs_mode: 255 | self.es.limit_post_json_to_shs() 256 | 257 | if self.filter_duplicates: 258 | self.es.add_remove_duplicates_filter() 259 | 260 | if self.dzr_map: 261 | self.es.limit_to_dzr_mapped_msd() 262 | 263 | results = dict() 264 | 265 | LOGGER.info("\n=======Running cleaned title-match task for %s query songs against top %s results of MSD... " 266 | "with shs_mode %s, duplicate %s, dzr_map %s ========\n" 267 | % (len(self.query_ids), size, str(self.shs_mode), str(self.filter_duplicates), str(self.dzr_map))) 268 | 269 | for ids in enumerate(self.query_ids): 270 | if verbose: 271 | print "----%s----%s" % (ids[0], ids[1]) 272 | res_ids, res_scores = self.es.search_with_cleaned_title(track_id=ids[1], out_mode='eval', size=size) 273 | results[ids[1]] = {'id': res_ids, 'score': res_scores} 274 | 275 | LOGGER.info("\n Task runtime : %s" % (self.time.time() - start_time)) 276 | return self.pd.DataFrame.from_dict(results, orient='index') 277 | 278 | @timeit 279 | def run_field_rerank_task(self, field='msd_artist_id', size=100, proximitiy=1, verbose=True): 280 | """ 281 | In this task, a msd song with same artist id with the query song will be ranked top of the list 282 | """ 283 | results = dict() 284 | LOGGER.info("\n=======Running song title-matching task with reranking by '%s' for %s query " 285 | "songs against top %s results of MSD... with shs_mode %s, duplicate %s, dzr_map %s ========\n" 286 | % (field, len(self.query_ids), size, str(self.shs_mode), 287 | str(self.filter_duplicates), str(self.dzr_map))) 288 | 289 | if self.shs_mode: 290 | self.es.limit_post_json_to_shs() 291 | 292 | if self.filter_duplicates: 293 | self.es.add_remove_duplicates_filter() 294 | 295 | if self.dzr_map: 296 | self.es.limit_to_dzr_mapped_msd() 297 | 298 | for index,title in enumerate(self.query_titles): 299 | if verbose: 300 | print "------%s-------%s" % (index, title) 301 | response = self.es.search_es(self.es._format_query(title, self.query_ids[index], size=size)) 302 | query_artist_id = self.get_artist_id(self.query_ids[index]) 303 | re_ranked = self.rerank_by_field(query_artist_id, response, field=field, proximitiy=proximitiy) 304 | res_ids, res_scores = self.es._parse_response_for_eval(re_ranked) 305 | results[self.query_ids[index]] = {'id': res_ids, 'score': res_scores} # save it to dictionary 306 | return self.pd.DataFrame.from_dict(results, orient='index') 307 | 308 | 309 | @timeit 310 | def run_mxm_lyrics_search_task(self, post_json=presets.more_like_this, size=100, verbose=True): 311 | """ 312 | Lyrics search method using MXM lyrics 313 | (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html) 314 | """ 315 | results = dict() 316 | 317 | if self.shs_mode: 318 | self.es.limit_post_json_to_shs() 319 | 320 | if self.filter_duplicates: 321 | self.es.add_remove_duplicates_filter() 322 | 323 | if self.dzr_map: 324 | self.es.limit_to_dzr_mapped_msd() 325 | 326 | LOGGER.info("\n=======Running musixmatch-msd lyrics search task for %s query songs against " 327 | "top %s results of MSD... with shs_mode %s, duplicate %s, dzr_map %s ========\n" 328 | % (len(self.query_ids), size, str(self.shs_mode), str(self.filter_duplicates), str(self.dzr_map))) 329 | 330 | for index, ids in enumerate(self.query_ids): 331 | if verbose: 332 | print "----%s----%s" % (index, ids) 333 | res_ids, res_scores = self.es.search_by_mxm_lyrics(post_json, msd_track_id=ids, out_mode='eval', size=size) 334 | results[ids] = {'id': res_ids, 'score': res_scores} 335 | 336 | return self.pd.DataFrame.from_dict(results, orient='index') 337 | 338 | @timeit 339 | def run_rerank_title_with_dzr_lyrics_task(self, size=100, with_cleaned=False, verbose=True): 340 | """ 341 | Here you make two requests with song_title metadata and dzr_lyrics and merge the results with the top resutls 342 | of lyrics to rerank song-title search response 343 | """ 344 | results = dict() 345 | 346 | if self.shs_mode: 347 | self.es.limit_post_json_to_shs() 348 | 349 | if self.filter_duplicates: 350 | self.es.add_remove_duplicates_filter() 351 | 352 | if self.dzr_map: 353 | self.es.limit_to_dzr_mapped_msd() 354 | 355 | post_json = self.es.post_json 356 | 357 | LOGGER.info("\n=======Running rerank experiment of title search response with dzr_lyrics response for %s " 358 | "query songs against top %s results of MSD... with shs_mode %s, duplicate %s, dzr_map %s ========\n" 359 | % (len(self.query_ids), size, str(self.shs_mode), str(self.filter_duplicates), str(self.dzr_map))) 360 | 361 | for index, title in enumerate(self.query_titles): 362 | if verbose: 363 | print "---%s---%s" % (index, self.query_ids[index]) 364 | 365 | self.es.post_json = post_json # post-json template for title search 366 | 367 | if with_cleaned: 368 | text_df = self.es.search_with_cleaned_title(self.query_ids[index], out_mode='view', size=size) 369 | else: 370 | text_df = self.es.search_by_exact_title(title, self.query_ids[index], out_mode='view', size=size) 371 | 372 | lyrics_df = self.es.search_by_dzr_lyrics( 373 | presets.more_like_this, self.query_ids[index], out_mode='view', size=size) 374 | 375 | if type(lyrics_df) != tuple: 376 | if lyrics_df.empty: 377 | res_ids, res_scores = text_df.msd_id.values.tolist(), text_df.score.values.tolist() 378 | else: 379 | res_ids, res_scores = self.rerank_title_results_by_lyrics(text_df, lyrics_df, mode='eval') 380 | else: 381 | res_ids, res_scores = text_df.msd_id.values.tolist(), text_df.score.values.tolist() 382 | results[self.query_ids[index]] = {'id': res_ids, 'score': res_scores} 383 | return self.pd.DataFrame.from_dict(results, orient='index') 384 | 385 | 386 | @timeit 387 | def run_rerank_title_with_mxm_lyrics_task(self, size=100, with_cleaned=False, verbose=True, threshold=0.5): 388 | """ 389 | Experiment we rerank the es response of song_title search with top results of mxm_lyrics similarity results 390 | 391 | :param size: {default : 100} 392 | :param with_cleaned: {default : False} If set true, switch simple 393 | text_search method to cleaned_processed title method 394 | :param verbose: {default : False} 395 | :param threshold: 396 | :return: Aggregated results as pandas dataframe 397 | """ 398 | results = dict() 399 | 400 | if self.shs_mode: 401 | self.es.limit_post_json_to_shs() 402 | 403 | if self.filter_duplicates: 404 | self.es.add_remove_duplicates_filter() 405 | 406 | if self.dzr_map: 407 | self.es.limit_to_dzr_mapped_msd() 408 | 409 | post_json = self.es.post_json 410 | 411 | LOGGER.info("\n=======Running rerank experiment of title search response with mxm_lyrics response for %s query " 412 | "songs against top %s results of MSD... with shs_mode %s, duplicate %s, dzr_map %s ========\n" 413 | % (len(self.query_ids), size, str(self.shs_mode), str(self.filter_duplicates), str(self.dzr_map))) 414 | 415 | for index, title in enumerate(self.query_titles): 416 | if verbose: 417 | print "---%s---%s" % (index, self.query_ids[index]) 418 | 419 | self.es.post_json = post_json # post-json template for title search 420 | 421 | if with_cleaned: 422 | text_df = self.es.search_with_cleaned_title(self.query_ids[index], out_mode='view', size=size) 423 | else: 424 | text_df = self.es.search_by_exact_title(title, self.query_ids[index], out_mode='view', size=size) 425 | 426 | lyrics_df = self.es.search_by_mxm_lyrics( 427 | presets.more_like_this, msd_track_id=self.query_ids[index], out_mode='view', size=size) 428 | 429 | if type(lyrics_df) != tuple: 430 | if lyrics_df.empty: 431 | res_ids, res_scores = text_df.msd_id.values.tolist(), text_df.score.values.tolist() 432 | else: 433 | res_ids, res_scores = self.rerank_title_results_by_lyrics( 434 | text_df, lyrics_df, mode='eval', proximity=threshold) 435 | else: 436 | res_ids, res_scores = text_df.msd_id.values.tolist(), text_df.score.values.tolist() 437 | results[self.query_ids[index]] = {'id': res_ids, 'score': res_scores} 438 | return self.pd.DataFrame.from_dict(results, orient='index') 439 | 440 | 441 | @timeit 442 | def run_audio_rerank_task(self, text_results_json, audio_results_json, threshold=0.1): 443 | """ 444 | [OFFLINE EXPERIMENT] 445 | 446 | Function to re-rank text-based results with audio-based results 447 | text_results_json : json file 448 | audio_results_json : json file 449 | threshold : {default: 0.1} 450 | 451 | """ 452 | text_df = self.pd.read_json(text_results_json) 453 | audio_df = self.pd.read_json(audio_results_json) 454 | results = dict() 455 | cnt = 0 456 | error_idxs = [] 457 | 458 | def get_low_score(scores, thres=threshold): 459 | 460 | def list_duplicates_of(seq, item): 461 | start_at = -1 462 | locs = [] 463 | while True: 464 | try: 465 | loc = seq.index(item, start_at+1) 466 | except ValueError: 467 | break 468 | else: 469 | locs.append(loc) 470 | start_at = loc 471 | return locs 472 | 473 | # top_score = scores[0] 474 | # top_list = [score for score in scores if self.np.abs(top_score-score)<=thres] 475 | top_list = [] 476 | for score in scores: 477 | # if self.np.abs(top_score-score)<=thres: 478 | if score <= thres: 479 | top_list.append(score) 480 | # idxs = [scores.index(x) for x in top_list] 481 | 482 | dup_idxs = [] 483 | for s in top_list: 484 | dup_idxs.extend(list_duplicates_of(top_list, s)) 485 | 486 | idxs = list(set(dup_idxs)) 487 | print "Score index :", idxs 488 | return idxs 489 | 490 | LOGGER.info("Running audio reranking task on the metadata search experiments results " 491 | "file with a threshold of %s" % threshold) 492 | 493 | for idx in range(len(audio_df)): 494 | print "Index :", idx 495 | text_res_ids = text_df.iloc[idx].id 496 | text_res_scores = text_df.iloc[idx].id 497 | audio_res_ids = audio_df.iloc[idx].id 498 | audio_res_scores = audio_df.iloc[idx].score 499 | 500 | if not audio_res_scores or not audio_res_ids or len(audio_res_ids) == 0: 501 | results[audio_df.index[idx]] = {'id': text_res_ids, 'score': text_res_scores} 502 | cnt += 1 503 | error_idxs.append(idx) 504 | else: 505 | a_df = self.pd.DataFrame({'id': audio_res_ids, 'score': audio_res_scores}) 506 | # t_df = self.pd.DataFrame({'id': text_res_ids, 'score' : text_res_scores}) 507 | 508 | thres_idxs = get_low_score(a_df.score.values.tolist(), threshold) 509 | 510 | if len(thres_idxs) != 0: 511 | a_df = a_df.iloc[thres_idxs] 512 | top_ids = a_df.id.values.tolist() 513 | top_scores = a_df.score.tolist() 514 | # common_ids = self.np.intersect1d(top_ids, t_df.id.values) 515 | bottom_ids = [x for x in text_df.iloc[idx].id if x not in top_ids] 516 | bottom_idx = [text_res_ids.index(x) for x in bottom_ids] 517 | text_res_scores = self.np.array(text_res_scores) 518 | bottom_scores = text_res_scores[bottom_idx] 519 | new_ranked_ids = top_ids + bottom_ids 520 | new_ranked_scores = top_scores + list(bottom_scores) 521 | results[audio_df.index[idx]] = {'id': new_ranked_ids, 'score': new_ranked_scores} 522 | else: 523 | results[audio_df.index[idx]] = {'id': text_res_ids, 'score': text_res_scores} 524 | 525 | LOGGER.debug("%s queries dont have proper audio reranked resposne" % cnt) 526 | 527 | return self.pd.DataFrame.from_dict(results, orient='index') 528 | 529 | @timeit 530 | def maximum_achievable_metrics(self, results_df): 531 | """ 532 | In this experiment we rerank the response ids with the ground_truth to compute 533 | the maximum achievable MAP by re-ranking the metadata-search results with 534 | other content such as lyrics, audio etc. This was only done on the train set of the dataset 535 | """ 536 | LOGGER.info("Computing maximum achievable mean average precison from the results dataframe") 537 | results_df = self._merge_df(results_df) 538 | results = dict() 539 | for index, response in results_df.iterrows(): 540 | if type(response['id']) == list: 541 | response_ids = response['id'] 542 | # result_songs = response['id'] 543 | clique_songs = results_df.msd_id[results_df.work_id == response['work_id']].values 544 | top_list = self.np.intersect1d(clique_songs, response_ids) 545 | if len(top_list) > 0: 546 | bottom_list = [x for x in response_ids if x not in top_list] 547 | if bottom_list: 548 | results[response['msd_id']] = {'id': list(top_list) + bottom_list} 549 | else: 550 | results[response['msd_id']] = {'id': list(top_list)} 551 | else: 552 | results[response['msd_id']] = {'id': response_ids} 553 | return self.pd.DataFrame.from_dict(results, orient='index') 554 | 555 | # ----------------------------------------EVALUATION METRICS---------------------------------------------------- 556 | def average_precision_at_k(self, results_df, query_msd_id): 557 | """ 558 | Compute average precision for a particular query and response from the aggregrated results_dataframe 559 | Here "k" is the msd_query_id in results df 560 | 561 | Inputs: 562 | results_df : 563 | query_msd_id : 564 | 565 | """ 566 | results_df = self._merge_df(results_df) 567 | response_ids = results_df[results_df.msd_id == query_msd_id].id.values.tolist()[0] 568 | work_id = results_df.work_id[results_df.msd_id == query_msd_id].values[0] 569 | clique_songs = results_df.msd_id[results_df.work_id == work_id].values 570 | # print clique_songs, len(clique_songs) 571 | true_idx = [response_ids.index(x) for x in response_ids if x in clique_songs] 572 | ground_truth = self.np.zeros(len(response_ids)) 573 | if len(true_idx) > 0: 574 | ground_truth[true_idx] = 1 575 | precision_at_k = self.np.cumsum(ground_truth) / self.np.arange(1., len(response_ids)+1) 576 | precision_list = ground_truth * precision_at_k 577 | avg_precision = sum(precision_list) / float(len(clique_songs) - 1) 578 | return avg_precision 579 | 580 | def average_precision(self, results_df, size=None): 581 | """ 582 | Average precisions 583 | 584 | Inputs : 585 | results_df : 586 | size : 587 | 588 | Returns a list of average precision 589 | """ 590 | results_df = self._merge_df(results_df) 591 | avg_precisions = list() 592 | cnt = 0 593 | for index, response in results_df.iterrows(): 594 | if type(response['id']) == list: 595 | if size: 596 | response_ids = response['id'][:size] 597 | else: 598 | response_ids = response['id'] 599 | clique_songs = results_df.msd_id[results_df.work_id == response['work_id']].values 600 | true_idx = [response_ids.index(x) for x in response_ids if x in clique_songs] 601 | ground_truth = self.np.zeros(len(response_ids)) 602 | if len(true_idx) > 0: 603 | ground_truth[true_idx] = 1 604 | precision_at_k = self.np.cumsum(ground_truth) / self.np.arange(1., len(response_ids)+1) 605 | precision_list = ground_truth * precision_at_k 606 | avg_precision = sum(precision_list) / float(len(clique_songs) - 1) 607 | avg_precisions.append(avg_precision) 608 | else: 609 | cnt += 1 610 | avg_precisions.append(0) 611 | LOGGER.debug("%s queries have no lyrics nor response out of %s queries" % (cnt, len(results_df))) 612 | return avg_precisions 613 | 614 | @timeit 615 | def mean_average_precision(self, results_df, size=None): 616 | """ 617 | Mean of average precisions for the task 618 | """ 619 | return self.np.mean(self.average_precision(results_df, size=size)) 620 | 621 | def average_rank(self, results_df): 622 | """ 623 | Computes average position of relevant documents and measures where the relevant docs falls in a ranked list 624 | """ 625 | average_ranks = list() 626 | for query_id in results_df.keys(): 627 | # response_ids = self.ast.literal_eval(results_df[query_id][0]) 628 | response_ids = results_df[query_id][0] 629 | if type(response_ids) == list: 630 | clique_id = self.dataset.work_id[self.dataset.msd_id == query_id].values[0] 631 | clique_songs = self.dataset.msd_id[self.dataset.work_id == clique_id].values 632 | # true_list = len(self.np.intersect1d(clique_songs, response_ids)) 633 | true_idx = [response_ids.index(x) for x in response_ids if x in clique_songs] 634 | if len(true_idx) == 0: 635 | average_ranks.append(1000000) 636 | # pass 637 | else: 638 | average_ranks.append(self.np.average(true_idx)) 639 | return self.np.average(average_ranks) 640 | 641 | def mean_rank_first_cover(self, results_df): 642 | """ 643 | Mean rank of the first correctly identified cover 644 | """ 645 | mean_ranks = list() 646 | for query_id in results_df.keys(): 647 | # response_ids = self.ast.literal_eval(results_df[query_id][0]) 648 | response_ids = results_df[query_id][0] 649 | if type(response_ids) == list: 650 | clique_id = self.dataset.work_id[self.dataset.msd_id == query_id].values[0] 651 | clique_songs = self.dataset.msd_id[self.dataset.work_id == clique_id].values 652 | # true_list = len(self.np.intersect1d(clique_songs, response_ids)) 653 | true_idx = [response_ids.index(x) for x in response_ids if x in clique_songs] 654 | if len(true_idx) == 0: 655 | # mean_ranks.append(0) 656 | pass 657 | else: 658 | mean_ranks.append(true_idx[0]+1) 659 | 660 | return self.np.mean(mean_ranks) 661 | 662 | def covers_identified(self, results_df, size=None): 663 | """ 664 | Total number of covers identified compared to the dataset 665 | """ 666 | total_covers = list() 667 | percentage = list() 668 | # here you merge the results_df with the shs_dataset df we load in the init class 669 | results_df = self._merge_df(results_df) 670 | for index, response in results_df.iterrows(): 671 | if type(response['id']) == list: 672 | if size: 673 | response_ids = response['id'][:size] 674 | else: 675 | response_ids = response['id'] 676 | clique_songs = results_df.msd_id[results_df.work_id == response['work_id']].values 677 | # check intersection of two list for detected covers 678 | detected_covers = self.np.intersect1d(clique_songs, response_ids) 679 | total_covers.append(len(detected_covers)) 680 | percentage.append((len(detected_covers) / float(len(clique_songs)))*100) 681 | return total_covers, percentage 682 | 683 | def total_covers_identified(self, results_df): 684 | """ 685 | Total number of covers identified 686 | """ 687 | total_covers, percentage = self.covers_identified(results_df) 688 | return sum(total_covers) 689 | 690 | def mean_percentage_of_covers(self, results_df, size=None): 691 | """ 692 | Mean percentage of covers 693 | """ 694 | total_covers, percentage = self.covers_identified(results_df, size=size) 695 | return self.np.mean(percentage) 696 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.11.0 2 | pandas==0.20.3 3 | elasticsearch==2.3.0 4 | requests 5 | joblib==0.11 6 | seaborn==0.7.1 7 | python-Levenshtein==0.12.0 8 | fuzzywuzzy==0.15.0 9 | nltk==3.3 10 | -------------------------------------------------------------------------------- /templates.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | A set of custom query DSL templates for elasticsearch search post-json requests for various tasks 4 | 5 | Check elasticsearch documentation for more details 6 | https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html 7 | 8 | Albin Andrew Correya 9 | R&D Intern 10 | @2017 11 | """ 12 | 13 | import os 14 | 15 | assert os.environ["MSDES_HOST"] 16 | assert os.environ["MSDES_PORT"] 17 | assert os.environ["MSDES_INDEX"] 18 | assert os.environ["MSDES_TYPE"] 19 | 20 | SCHEME = "http" 21 | URI = os.environ["MSDES_HOST"] 22 | PORT = os.environ["MSDES_PORT"] 23 | ES_INDEX = os.environ["MSDES_INDEX"] 24 | ES_TYPE = os.environ["MSDES_TYPE"] 25 | 26 | 27 | uri_config = { 28 | 'host': URI, 29 | 'port': PORT, 30 | 'scheme': SCHEME, 31 | 'index': ES_INDEX, 32 | 'type': ES_TYPE 33 | } 34 | 35 | # for string search with song title 36 | # https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html 37 | query_string = { 38 | "query": { 39 | "bool": { 40 | "must": [ 41 | { 42 | "query_string": { 43 | "default_field": "msd_title", 44 | "query": "sample_query_here" 45 | } 46 | } 47 | ], 48 | "must_not": [ 49 | { 50 | "query_string": { 51 | "default_field": "_id", 52 | "query": "msd_track_id_here" 53 | } 54 | } 55 | ] 56 | } 57 | }, 58 | "from": 0, 59 | "size": 100 60 | } 61 | 62 | # for string search with title 63 | # https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html 64 | simple_query_string = { 65 | "query": { 66 | "bool": { 67 | "must": [ 68 | { 69 | "simple_query_string": { 70 | "fields": ["msd_title"], 71 | "query": "sample_query_here" 72 | } 73 | } 74 | ], 75 | "must_not": [ 76 | { 77 | "query_string": { 78 | "default_field": "_id", 79 | "query": "msd_track_id_here" 80 | } 81 | } 82 | 83 | ] 84 | } 85 | }, 86 | "from": 0, 87 | "size": 100 88 | } 89 | 90 | # for lyrics search 91 | # https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html 92 | more_like_this = { 93 | "query": { 94 | "bool": { 95 | "must": [ 96 | { 97 | "more_like_this": { 98 | "fields": ["dzr_lyrics.content"], 99 | "like": "sample_query_here", 100 | "min_term_freq": 1, 101 | "max_query_terms": 12 102 | } 103 | } 104 | ], 105 | "must_not": [ 106 | { 107 | "query_string": { 108 | "default_field": "_id", 109 | "query": "msd_track_id_here" 110 | } 111 | } 112 | ] 113 | } 114 | }, 115 | "from": 0, 116 | "size": 100 117 | } 118 | 119 | # config preset for logs 120 | log_config = { 121 | 'method': 122 | { 123 | 'query_method': 'title_query_with_artist_rerank', 124 | 'mode': 'msd_field', 125 | 'size': 100, 126 | 'dataset': 'shs_train' 127 | }, 128 | 'metrics': 129 | { 130 | 'MAP': 0., 131 | 'MRFC': 0., 132 | 'MPER': 40. 133 | }, 134 | 'run_time': 60 135 | } 136 | 137 | # Experiment profiles 138 | # SHS against MSD experiment 139 | shs_msd = { 140 | 'dzr_map': False, 141 | 'filter_duplicates': False, 142 | 'shs_mode': False 143 | } 144 | 145 | # SHS against MSD experiment by excluding all the official duplicates 146 | shs_msd_no_dup = { 147 | 'dzr_map': False, 148 | 'filter_duplicates': True, 149 | 'shs_mode': False 150 | } 151 | 152 | # SHS-DZR against MSD-DZR experiment by excluding all the official duplicates 153 | shs_dzr_msd = { 154 | 'dzr_map': True, 155 | 'filter_duplicates': True, 156 | 'shs_mode': False 157 | } 158 | 159 | # SHS train set against SHS train set experiment 160 | shs_shs = { 161 | 'dzr_map': False, 162 | 'filter_duplicates': False, 163 | 'shs_mode': True 164 | } 165 | 166 | # SHS train set against SHS train set experiment without official duplicates. 167 | shs_shs_no_dup = { 168 | 'dzr_map': False, 169 | 'filter_duplicates': True, 170 | 'shs_mode': True 171 | } 172 | 173 | output_evaluations = { 174 | 'size': 100, 175 | 'map': 0, 176 | 'method': 'title', 177 | 'experiment': 'shs_msd', 178 | } 179 | -------------------------------------------------------------------------------- /utilities/__init__.py: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /utilities/audio_utils.py: -------------------------------------------------------------------------------- 1 | """ 2 | utility functions for processing resutls file for audio reranking experiments 3 | """ 4 | from utils import timeit, log 5 | import pandas as pd 6 | import numpy as np 7 | import os 8 | 9 | 10 | logger = log('./logs/audio_logs.log') 11 | 12 | 13 | def savelist_to_file(path_list, filename): 14 | """Write a list of string to a text file""" 15 | doc = open(filename, 'w') 16 | for item in path_list: 17 | doc.write("%s\n" % item) 18 | doc.close() 19 | return 20 | 21 | 22 | def parse_mirex_output_txt(textfile): 23 | """ 24 | Parse distance matrix from the text output of joan serra's cover song detection algorithm 25 | 26 | Input : path/to/the/textfile 27 | 28 | Output : pandas dataframe with query/candidates distance scores 29 | """ 30 | text = open(textfile) 31 | data = text.readlines() 32 | array = list() 33 | m = None 34 | for lines in data: 35 | if lines.startswith("Dist"): 36 | m = True 37 | if m is True: 38 | if not lines.startswith(" Could not open"): 39 | array.append(lines) 40 | text.close() 41 | doc = open("../distanceMatrix.txt", "w") 42 | for lines in array: 43 | doc.write("%s\n" % lines) 44 | doc.close() 45 | df = pd.read_csv("../distanceMatrix.txt", index_col=0, skiprows=1, sep='\t') 46 | os.system('rm ../distanceMatrix.txt') 47 | return df.transpose() 48 | 49 | 50 | @timeit 51 | def results_json_to_enriched_results_df(results_json, dzr_msd_map_df): 52 | """ 53 | [NOTE] - only need to be run once 54 | """ 55 | collections = list() 56 | queries = list() 57 | results = pd.read_json(results_json) 58 | for i in range(len(results)): 59 | collections.extend(results.iloc[i].id) 60 | queries.append(results.iloc[i].msd_id) 61 | collections.extend(queries) 62 | df = pd.DataFrame({'msd_track_id': collections}) 63 | df = df.drop_duplicates('msd_track_id') 64 | df_dzr = pd.merge(df, dzr_msd_map_df, on='msd_track_id', how='left') 65 | df_dzr['dzr_path'] = df_dzr.song_id.apply(get_dzr_sng_path_from_sng_id) 66 | return df_dzr 67 | 68 | 69 | 70 | def results_json_to_query_collection_pairs(results_json, enriched_csv, col_path, query_path): 71 | """ 72 | Create a set of query-collection text file pairs from a aggregrated es search 73 | results for running mirex (serra 2009)binary scripts 74 | Inputs : 75 | results_json : 76 | enriched_csv : 77 | col_path : 78 | query_path : 79 | """ 80 | res = pd.read_json(results_json) 81 | map_data = pd.read_csv(enriched_csv) 82 | logger.info("Constructing query-collection text files from the results-%s to %s and %s" 83 | % (results_json, col_path, query_path)) 84 | for i in range(len(res)): 85 | mid = res.iloc[i].msd_id 86 | rids = res.iloc[i].id 87 | if not rids: 88 | logger.debug("No response found for index %s" % i) 89 | qpaths = map_data.dzr_path[map_data.msd_track_id == mid].values[0] 90 | rpaths = [map_data.dzr_path[map_data.msd_track_id == rid].values[0] for rid in rids] 91 | qpaths = qpaths.replace("data", "mnt") 92 | rpaths = [string.replace("data", "mnt") for string in rpaths] 93 | savelist_to_file([qpaths, '\n'], query_path+'query_'+str(i)+'_.txt') 94 | savelist_to_file(rpaths, col_path+'collections_'+str(i)+'_.txt') 95 | return 96 | 97 | 98 | def get_id_score_pairs_from_distance_df(distance_df, results_df, index): 99 | """ 100 | Returns the new ranked response of msd_track_ids and audio similarity 101 | scores from a distance matrix of mirex 2009 binary output 102 | Inputs: 103 | distance_df : 104 | results_df : 105 | index : 106 | 107 | Outputs: 108 | res_ids : 109 | res_scores : 110 | """ 111 | res_msd_ids = results_df.iloc[index].id 112 | sorted_df = distance_df.sort_values(1) 113 | new_ranked_idx = sorted_df.index.values 114 | 115 | if len(sorted_df) != len(res_msd_ids): 116 | logger.debug("Mismatch of response msd id length in index %s" % index) 117 | 118 | # new reranked response ids and scores from the audio similarity measures 119 | # note the index from the output_txt file starts with 1 120 | res_ids = [res_msd_ids[int(i)-1] for i in new_ranked_idx] 121 | res_scores = sorted_df[1].values.tolist() 122 | return res_ids, res_scores 123 | 124 | 125 | def serra_output_txt_to_results_df(output_directory, results_json): 126 | """ 127 | Read a collection of output_*.txt files from mirex 2009 binary 128 | output and aggregrate it to a pandas dataframe 129 | as required for metric computation scripts 130 | 131 | Inputs : 132 | output_directory : path to the folder with the output_*.txt files from the mirex binary scripts 133 | results_json : results_json 134 | """ 135 | results_df = pd.read_json(results_json) 136 | output_files = os.listdir(output_directory) 137 | for t in output_files: 138 | if t.startswith('.') or not t.endswith('.txt'): 139 | output_files.remove(t) 140 | 141 | output_files = sorted(output_files, key=lambda x: int(x.split('_')[2].split('.')[0])) # sort the filename list 142 | results = dict() 143 | cnt = 0 144 | error_files = list() 145 | for idx, txt_file in enumerate(output_files): 146 | print "--%s--%s" % (idx, txt_file) 147 | distance_df = parse_mirex_output_txt(output_directory+txt_file) 148 | if distance_df.shape[1] == 1: 149 | query_msd = results_df.index[idx] 150 | res_ids, res_scores = get_id_score_pairs_from_distance_df(distance_df, results_df, index=idx) 151 | results[query_msd] = {'id': res_ids, 'score': res_scores} 152 | else: 153 | cnt += 1 154 | query_msd = results_df.index[idx] 155 | results[query_msd] = {'id': results_df.iloc[idx].id, 'score': None} 156 | error_files.append(txt_file) 157 | logger.debug("\n%s files had errors with the output distance matrix.." % cnt) 158 | #print error_files 159 | return pd.DataFrame.from_dict(results, orient='index') 160 | -------------------------------------------------------------------------------- /utilities/clique_similarity.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Set of functions to compute the similarity of song titles with in and outside it's clique. 4 | ~ 5 | Albin Andrew Correya 6 | R&D Intern 7 | @Deezer, 2018 8 | """ 9 | from itertools import combinations 10 | from Levenshtein import ratio 11 | import numpy as np 12 | import pandas as pd 13 | import random 14 | 15 | 16 | def get_clique_similarity_same_set(dataset_csv): 17 | """Compute Levenshtein similarity of song titles in same cliques in SHS""" 18 | dataset = pd.read_csv(dataset_csv) 19 | clique_ids = dataset.work_id.unique().tolist() 20 | clique_sims = list() 21 | for work_id in clique_ids: 22 | song_titles = dataset.title[dataset.work_id == work_id].values.tolist() 23 | distances = list() 24 | for (title1, title2) in combinations(song_titles, 2): 25 | measure = ratio(title1, title2) 26 | distances.append(measure) 27 | clique_sims.append(np.mean(distances)) 28 | return clique_sims 29 | 30 | 31 | def get_clique_similarity_dif_set(dataset_csv): 32 | """Compute Levenshtein similarity of song titles in different cliques in SHS""" 33 | distances = list() 34 | dataset = pd.read_csv(dataset_csv) 35 | clique_ids = dataset.work_id.unique().tolist() 36 | all_titles = dataset.title.values.tolist() 37 | 38 | for i in range(len(clique_ids)): 39 | ref_title = random.choice(all_titles) 40 | clique_id = dataset.work_id[dataset.title == ref_title].values[0] 41 | ref_titles = dataset.title[dataset.work_id != clique_id].values.tolist() 42 | com_title = random.choice(ref_titles) 43 | distance = ratio(ref_title, com_title) 44 | distances.append(distance) 45 | 46 | return distances 47 | 48 | 49 | def plot_clique_similarity_dist(dataset_csv): 50 | """Plot the distribution plot of string similarities within and outside its clique""" 51 | import matplotlib.pyplot as plt 52 | import seaborn as sns 53 | palette = ["#000000", "#737170"] 54 | sns.set_palette(palette) 55 | sim_same_clique = get_clique_similarity_same_set(dataset_csv) 56 | sim_dif_clique = get_clique_similarity_dif_set(dataset_csv) 57 | sns.distplot(sim_same_clique, hist=True, 58 | kde_kws={"lw": 1, "label": "within same clique"}) 59 | sns.distplot(sim_dif_clique, hist=True, 60 | kde_kws={"lw": 1, "label": "within different clique"}) 61 | plt.xlabel("Similarity measure") 62 | plt.ylabel("Density") 63 | plt.show() 64 | return 65 | -------------------------------------------------------------------------------- /utilities/plots.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Functions for various plots on the results json file 4 | """ 5 | import pandas as pd 6 | import seaborn as sns 7 | import matplotlib.pyplot as plt 8 | import json 9 | 10 | 11 | def parse_results(jsonfile, method): 12 | """ 13 | jsonfile : path to jsonfile with experiment results 14 | method : Any of the methods inside the list 15 | ['title_match', rerank_artist_id', 'lyrics_more_like'] 16 | 17 | """ 18 | with open(jsonfile) as f: 19 | data = json.load(f) 20 | sizes = list() 21 | mp = list() 22 | mper = list() 23 | for key, values in data.iteritems(): 24 | if key == method: 25 | for d in values: 26 | sizes.append(d['size']) 27 | mp.append(d['map']) 28 | mper.append(d['mper']) 29 | 30 | return {"map": mp, "mper": mper, "size": sizes} 31 | 32 | 33 | def plot_optimal_topN_pruning(results_json): 34 | with open(results_json) as f: 35 | data = json.load(f) 36 | results = [(int(key), value['map']) for key, value in data.iteritems()] 37 | results = sorted(results, key=lambda r: int(r[0])) 38 | sizes = [x[0] for x in results] 39 | metrics = [x[1] for x in results] 40 | 41 | # plt.title("Mean average precision of msd song-title experiment 42 | # on the SHS train against the MSD for various prune sizes") 43 | plt.plot(sizes, metrics) 44 | plt.xlabel("Prune size (k)") 45 | plt.ylabel("Mean Average Precision") 46 | plt.show() 47 | return 48 | 49 | 50 | # functions for plotting some stats 51 | def plot_lang_stats(msd_dzr_lang_csv, crop=True, barwidth=0.99, norm=False): 52 | """plots the histogram of language distribution""" 53 | lan_csv = pd.read_csv(msd_dzr_lang_csv) 54 | langs = lan_csv.lan.unique() 55 | freqs = list() 56 | for lan in langs: 57 | freqs.append(len(lan_csv[lan_csv.lan == lan])) 58 | sorted_tup = [(item[0], item[1]) for item in zip(langs, freqs)] 59 | sorted_tup.sort(key=lambda x: x[1], reverse=True) 60 | print "\n---Language stats for MSD---\n" 61 | for item in sorted_tup: 62 | print "%s : %s" % (item[0], (item[1]/1000000. * 100)) 63 | 64 | langs = [item[0] for item in sorted_tup] 65 | freqs = [item[1] for item in sorted_tup] 66 | if norm: 67 | freqs = [(item/1000000.)*100 for item in freqs] 68 | if crop: 69 | langs = langs[:10] 70 | freqs = freqs[:10] 71 | 72 | # bar_locs = np.arange(1+barwidth, len(langs)) 73 | # Plotting histogram 74 | plt.title("Histogram of top 10 languages in the MillionSongDataset") 75 | ax = plt.subplot(111) 76 | bins = map(lambda x: x, range(1, len(freqs)+1)) 77 | ax.bar(bins, freqs, width=barwidth) 78 | ax.set_xticks(map(lambda x: x, range(1, len(langs)+1))) 79 | ax.set_xticklabels(langs, rotation=0) 80 | ax.set_xlabel("Language") 81 | ax.set_ylabel("Percentage (%)") 82 | plt.show() 83 | return 84 | 85 | 86 | def plot_results_boxplot(jsonfile, metric='map'): 87 | """ 88 | map = mean average precision 89 | """ 90 | title_info = parse_results(jsonfile, method='title_match') 91 | rerank_info = parse_results(jsonfile, method='rerank_artist_id') 92 | lyrics_info = parse_results(jsonfile, method='lyrics_more_like') 93 | 94 | data = pd.DataFrame({"song-title": title_info[metric], 95 | "artist_id_rerank": rerank_info[metric], 96 | "lyrics": lyrics_info[metric]}) 97 | 98 | ax = sns.boxplot(data=data, 99 | palette="Set2", 100 | orient="v", 101 | order=["song-title", "artist_id_rerank", "lyrics"]) 102 | 103 | ax.set_xlabel("search-method") 104 | if metric == 'map': 105 | ax.set_ylabel("mean average precision (MAP)") 106 | if metric == 'mper': 107 | ax.set_ylabel("mean percentage of covers (MPER)") 108 | sns.utils.plt.show() 109 | return 110 | -------------------------------------------------------------------------------- /utilities/serra_et_al_2009/README.md: -------------------------------------------------------------------------------- 1 | # Reproducing MIREX 2009 Cover Song Detection Algorithm results 2 | 3 | # Requirements 4 | 5 | * Download the binary codes of Serra et.al 2009 mirex submission from 6 | [here](http://www.iiia.csic.es/~jserra/downloads/2009_SerraZA_MIREX-Covers.tar.gz) and copy in the same directory. 7 | 8 | 9 | # Document structure 10 | 11 | (Note : Wildcard (\*) denotes the index of all the queries in the aggregrated results dataframe. 12 | ie. [0 to 4252] in the case of shs_test_dzr dataset) 13 | 14 | ## Query lists 15 | 16 | ``` 17 | ./path_to_query_folder/query_*_.txt 18 | ``` 19 | 20 | ## Collection lists 21 | ``` 22 | ./path_to_collections_folder/collections_*_.txt 23 | ``` 24 | 25 | # Usage 26 | 27 | ```bash 28 | $ python run_mirex_binary.py -a ./audio_collections/ -c ./collection_txts/ -q ./query_txts/ -p ./output_features/ -o ./qmax_output/ 29 | ``` 30 | 31 | -------------------------------------------------------------------------------- /utilities/serra_et_al_2009/compute_hpcpFeatures.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ "$#" -ne 3 ]; then 4 | echo "USAGE: ./get_hpcpFeatures.sh " 5 | exit 6 | fi 7 | 8 | echo "Extracting descriptors..." 9 | ./myessentiaextractor -sl $1 -op $2 -dn hpcp -ah 20 -al 20 -at divmax > $2$3 10 | 11 | echo "Done." 12 | -------------------------------------------------------------------------------- /utilities/serra_et_al_2009/compute_qmaxDistance.sh: -------------------------------------------------------------------------------- 1 | if [ "$#" -ne 5 ]; then 2 | echo "USAGE: ./compute_qmaxDistance.sh " 3 | exit 4 | fi 5 | 6 | echo "Computing Qmax" 7 | ./coverid -d qmax -q $2 -c $1 -p $3 -oti 2 -m 9 -tau 1 -k 0.095 -go 0.5 -ge 0.5 > $4 8 | 9 | echo "Refining distances..." 10 | ./setdetect -rf $4 -nn 1 -dt 1000.0 > $5 11 | 12 | echo "Done...." 13 | -------------------------------------------------------------------------------- /utilities/serra_et_al_2009/run_mirex_binary.py: -------------------------------------------------------------------------------- 1 | from joblib import Parallel, delayed 2 | import time, os 3 | import subprocess 4 | import argparse 5 | import glob 6 | 7 | 8 | def compute_features(args): 9 | '''''' 10 | print args 11 | start_time = time.time() 12 | collections_txt = args[0] 13 | feature_path = args[1] 14 | logfile_path = args[2] 15 | subprocess.call("./compute_hpcpFeatures.sh %s %s %s" %(collections_txt, feature_path, logfile_path), shell=True) 16 | print "Feature extraction finished in --%s-- s" %(time.time() - start_time) 17 | return 18 | 19 | 20 | def compute_qmax_distance(args): 21 | ''' ''' 22 | print args 23 | collections_txt = args[0] 24 | query_txt = args[1] 25 | directory = args[2] 26 | log_filename = args[3] 27 | output_filename = args[4] 28 | return subprocess.call("./compute_qmaxDistance.sh %s %s %s %s %s" %(collections_txt, query_txt, directory, log_filename, output_filename), shell=True) 29 | 30 | 31 | 32 | def serra_cover_algo(collections_txt, query_txt, directory, output_filename): 33 | 34 | feature_process = compute_features(collections_txt, query_txt, directory) 35 | 36 | if feature_process!=0: 37 | raise Exception("Feature extraction process failed ...") 38 | 39 | qmax_process = compute_qmax_distance(collections_txt, query_txt, directory, output_filename) 40 | 41 | if qmax_process!=0: 42 | raise Exception("Process failed...") 43 | return 44 | 45 | 46 | 47 | def run_feature_extraction(collection_directory, feature_directory): 48 | '''Run the feature extraction with parallelisation''' 49 | collection_files = os.listdir(collection_directory) 50 | for s in collection_files: 51 | if s.startswith("."): 52 | collection_files.remove(s) 53 | collection_files = [collection_directory+s for s in collection_files] 54 | collection_files = sorted(collection_files, key = lambda x: int(x.split('_')[2].split('/')[1])) 55 | print "%s collections txt files found..." %len(collection_files) 56 | feature_path = [feature_directory for i in range(len(collection_files))] 57 | log_file_paths = ['hpcp_logs_split_'+str(i)+'.txt' for i in range(len(collection_files))] 58 | 59 | args = zip(collection_files, feature_path, log_file_paths) 60 | Parallel(n_jobs=-1, verbose=1)(map(delayed(compute_features), args)) 61 | return 62 | 63 | 64 | def run_qmax_computation(col_path, query_path, feature_path, out_path): 65 | '''Run the qmax distance computation with parallelization''' 66 | collection_files = os.listdir(col_path) 67 | query_files = os.listdir(query_path) 68 | for s in collection_files: 69 | if s.startswith("."): 70 | collection_files.remove(s) 71 | for x in query_files: 72 | if x.startswith("."): 73 | query_files.remove(x) 74 | collection_files = sorted(collection_files, key = lambda m: int(m.split('_')[1])) 75 | query_files = sorted(query_files, key = lambda m: int(m.split('_')[1])) 76 | 77 | collection_files = [col_path+c for c in collection_files] 78 | query_files = [query_path+q for q in query_files] 79 | 80 | feature_directory = [feature_path for i in range(len(query_files))] 81 | log_filenames = [feature_path+'qmax_log_'+str(i)+'.txt' for i in range(len(query_files))] 82 | out_filenames = [out_path+'output_qmax_'+str(i)+'.txt' for i in range(len(query_files))] 83 | 84 | args = zip(collection_files, query_files, feature_directory, log_filenames, out_filenames) 85 | Parallel(n_jobs=-1, verbose=1)(map(delayed(compute_qmax_distance), args)) 86 | return 87 | 88 | 89 | if __name__ == '__main__': 90 | 91 | parser = argparse.ArgumentParser(description= "Run mirex cover similarity algorithm (serra et. al 2009) binary files with parallelisation", 92 | formatter_class=argparse.ArgumentDefaultsHelpFormatter) 93 | 94 | parser.add_argument("-a", action="store", default='./audio_collections/', 95 | help="path to collection_files for audio feature extraction") 96 | parser.add_argument("-c", action="store", default='./collection_txts/', 97 | help="path to collection_files for qmax") 98 | parser.add_argument("-q", action="store", default='./query_txts/', 99 | help="path to query_files for qmax") 100 | parser.add_argument("-p", action="store", default="./output_features/", 101 | help="path to directory where the audio features should be stored") 102 | parser.add_argument("-o", action="store", default='./qmax_output/', 103 | help="output_filename") 104 | parser.add_argument("-m", action="store", default=0, 105 | help="mode of the process") 106 | 107 | cmd_args = parser.parse_args() 108 | 109 | #print cmd_args 110 | 111 | run_feature_extraction(cmd_args.a, cmd_args.p) 112 | 113 | print 'Feature extraction finished' 114 | 115 | run_qmax_computation(cmd_args.c, cmd_args.q, cmd_args.p, cmd_args.o) 116 | 117 | print "\n.....DONE...." 118 | 119 | -------------------------------------------------------------------------------- /utilities/serra_et_al_2009/run_submission.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ "$#" -ne 4 ]; then 4 | echo "USAGE: ./run_submission.sh " 5 | exit 6 | fi 7 | 8 | #echo "Creating temporary directory" 9 | #mkdir $3 10 | 11 | echo "Extracting descriptors..." 12 | ./myessentiaextractor -sl $1 -op $3 -dn hpcp -ah 20 -al 20 -at divmax > $3/log_feature_extraction.txt 13 | 14 | echo "Computing distances..." 15 | ./coverid -d qmax -q $2 -c $1 -p $3 -oti 2 -m 9 -tau 1 -k 0.095 -go 0.5 -ge 0.5 > $3/log_temporaryresults.txt 16 | 17 | echo "Refining distances..." 18 | ./setdetect -rf $3/log_temporaryresults.txt -nn 1 -dt 1000.0 > $4 19 | 20 | echo "Removing logs and temporary files..." 21 | rm -r $3 22 | 23 | echo "Done." 24 | -------------------------------------------------------------------------------- /utilities/serra_et_al_2009/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import glob 3 | 4 | 5 | def rename_txts(path): 6 | files = glob.glob(os.path.join(path, '*')) 7 | for f in files: 8 | if f.endswith('.txt'): 9 | new_f = f.split("_")[5]+'_'+f 10 | os.system("mv %s %s" %(f, new_f)) 11 | return 12 | 13 | def format_col_text_for_docker(path): 14 | files = os.listdir(path) 15 | for txt_file in files: 16 | if txt_file.endswith(".txt"): 17 | fname = txt_file 18 | f = open(path+txt_file) 19 | data = f.readlines() 20 | new_txt = [line.replace("data","mnt") for line in data] 21 | print new_txt 22 | f.close() 23 | savelist_to_file(new_txt, path+fname) 24 | return 25 | 26 | 27 | def format_newline_col(path): 28 | files = glob.glob(os.path.join(path, '*')) 29 | for txt_file in files: 30 | if txt_file.endswith('.txt'): 31 | fname = txt_file 32 | f = open(txt_file) 33 | data = f.readlines() 34 | new_txt = [x for x in data if x!='\n'] 35 | f.close() 36 | savelist_to_file(new_txt, fname) 37 | return 38 | 39 | 40 | def savelist_to_file(pathList, filename): 41 | """ 42 | """ 43 | doc = open(filename,'w') 44 | for item in pathList: 45 | doc.write("%s" % item) 46 | doc.close() 47 | return 48 | -------------------------------------------------------------------------------- /utilities/text_utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Set of functions for text processing 4 | 5 | * Format msd_track titles for song-title based query of cover song detection 6 | --------------------- 7 | Albin Andrew Correya 8 | R&D Intern 9 | @Deezer, 2017 10 | """ 11 | 12 | from nltk.stem import SnowballStemmer 13 | from fuzzywuzzy import fuzz, process 14 | from utils import init_connection, timeit 15 | import re 16 | import csv 17 | # [To be removed in future. Just a small hack to get the things done for at the moment] 18 | import sys 19 | reload(sys) 20 | sys.setdefaultencoding("utf-8") 21 | 22 | stemmer = SnowballStemmer("english") 23 | codebook = ['version', 'demo', 'live', 'remix', 'mix', 'remaster', 'albumn', 'instrumental', 'cover', 'digital', 24 | 'acoustic', 'lp version', 'remastered', 'digital remaster', 'remastered lp version', 'mono', 'stereo', 25 | 'extended', 'vocal mix', 'album version', 'album', 'vocal', 'extended version', 'reprise', 'single', 26 | 'radio edit', 'short version', 'explicit', 'bonus track', 'edit', 'session', 'e.p', 'ep version', 27 | 'original'] 28 | 29 | 30 | def stemit(string): 31 | """apply stemming to a string based on nltk.stem.SnowballStemmer()""" 32 | return stemmer.stem(unicode(string)) 33 | 34 | 35 | def title_formatter(string, mode='regex', striplist=codebook, threshold=70): 36 | """ 37 | Remove elements similar to predefined items inside the elements of string with parenthesis 38 | Note : callback function to be used inside pandas.dataframe.apply() and have been used recursively 39 | 40 | Inputs : 41 | mode : choose either of one mode from ['regex', 'fuzzy'] 42 | 'regex' : uses regex matching 43 | 'fuzzy' : uses fuzzy levenstien distance 44 | 45 | striplist : A list of strings which the match have to be computed 46 | eg : ['version', 'live', 'remix', 'mix', 'remaster', 'albumn', 47 | 'instrumental', 'cover', 'digital', 'acoustic', 'lp version', 'remastered', 48 | 'digital remaster', 'remastered lp version', 'mono', 'stereo'] 49 | 50 | eg : >>> string = Let it be (Live Version) 51 | >>> title_formatter(string) 52 | out: "Let it be" 53 | """ 54 | # to avoid 'NaN' values appear in the pandas dataframe when applying it as a callback function 55 | if type(string) != float: 56 | to_remove = "(" + string[string.find("(")+1:string.find(")")] + ")" 57 | stemmed_str = stemit(to_remove) 58 | for word in striplist: 59 | if mode == 'fuzzy': 60 | if fuzz.ratio(stemmed_str, word) >= threshold: 61 | return string.replace(to_remove, "") 62 | if mode == 'regex': 63 | # strip the string with parse string if there is any match with the words in the striplist 64 | if re.findall(r"\b" + word + r"\b", stemit(string)): 65 | return string.replace(to_remove, "") 66 | return string 67 | 68 | 69 | @timeit 70 | def add_formatted_title_to_dataset(dataset_csv, mode='regex'): 71 | """ 72 | dataset_csv : shs csv file 73 | mode : choose either of one mode from ['regex', 'fuzzy'] 74 | """ 75 | dataset = pd.read_csv(dataset_csv) 76 | new_data = pd.DataFrame() 77 | new_data['new_title'] = dataset.title.apply(title_formatter, mode=mode) 78 | new_data.new_title = new_data.new_title.apply(title_formatter, mode=mode) 79 | new_data = new_data.merge(dataset, left_index=True, right_index=True) 80 | return new_data 81 | 82 | 83 | @timeit 84 | def get_formatted_msd_track_title_csv(db_file, filename='./msd_formatted_titles.csv'): 85 | """ 86 | Remove and reformat all the msd song titles and store it as csvfile. 87 | The output csv file is structured as follows : 88 | msd_track_id, msd_song_title, msd_new_song_title 89 | 90 | Inputs : 91 | db_file - track_metadata.db file provided by the labrosa 92 | filename - filename for the output csvfile ('./msd_formatted_titles.csv' by default) 93 | 94 | 95 | [NOTE] : tested runtime ~ 33.76 minutes 96 | """ 97 | 98 | def double_format(string): 99 | s = title_formatter(string) 100 | return title_formatter(s) 101 | 102 | con = init_connection(db_file) 103 | query = con.execute("""SELECT track_id, title FROM songs""") 104 | results = query.fetchall() 105 | con.close() 106 | with open(filename, 'w') as csvfile: 107 | writer = csv.DictWriter(csvfile, fieldnames=['msd_id', 'msd_title', 'title']) 108 | writer.writeheader() 109 | cnt = 0 110 | for track_id, track_name in results: 111 | print "--%s--" % cnt 112 | if track_name: 113 | title = double_format(track_name).encode('utf8') 114 | writer.writerow({'msd_id': track_id, 115 | 'msd_title': track_name, 116 | 'title': title}) 117 | cnt += 1 118 | print "~Done..." 119 | return 120 | 121 | 122 | def extract_removefactor(string): 123 | if type(string) != float: 124 | return "(" + string[string.find("(")+1:string.find(")")] + ")" 125 | return string 126 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Some general utility functions 4 | 5 | Albin Andrew Correya 6 | R&D Intern 7 | @2017 8 | """ 9 | 10 | import logging 11 | import time 12 | import json 13 | import csv 14 | 15 | 16 | def log(log_file): 17 | """Returns a logger object with predefined settings""" 18 | root_logger = logging.getLogger(__name__) 19 | root_logger.setLevel(logging.DEBUG) 20 | file_handler = logging.FileHandler(log_file) 21 | log_formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') 22 | file_handler.setFormatter(log_formatter) 23 | root_logger.addHandler(file_handler) 24 | console_handler = logging.StreamHandler() 25 | console_handler.setFormatter(log_formatter) 26 | root_logger.addHandler(console_handler) 27 | return root_logger 28 | 29 | 30 | def timeit(method): 31 | """Custom timeit profiling function.""" 32 | def timed(*args, **kw): 33 | ts = time.time() 34 | result = method(*args, **kw) 35 | te = time.time() 36 | if 'log_time' in kw: 37 | name = kw.get('log_name', method.__name__.upper()) 38 | kw['log_time'][name] = int((te - ts) * 1000) 39 | else: 40 | print '%r - runtime : %2.2f ms' % \ 41 | (method.__name__, (te - ts) * 1000) 42 | return result 43 | return timed 44 | 45 | 46 | def slice_results(results_json, size): 47 | """Slice the query response results to a specified size""" 48 | with open(results_json) as f: 49 | data = json.load(f) 50 | sliced_dict = dict() 51 | for msd_id in data.keys(): 52 | if type(data[msd_id]['id'])==list: 53 | sliced_dict[msd_id] = { 54 | 'id': data[msd_id]['id'][:size], 55 | 'score': data[msd_id]['id'][:size] 56 | } 57 | else: 58 | sliced_dict[msd_id] = {'id': None, 'score': None} 59 | return sliced_dict 60 | 61 | 62 | # some utils for accessing msd_metadata sql db 63 | def init_connection(db_file): 64 | """Loads a sqldb file and returns the connection object""" 65 | try: 66 | import sqlite3 67 | con = sqlite3.connect(db_file) # specifiy your path to sql db file provided by labrosa team 68 | except: 69 | raise ImportError("Cannot import db_file") 70 | return con 71 | 72 | 73 | def get_fields_from_msd_db(db_file, field_name='track_id'): 74 | """ 75 | Input : "track_metadata.db" sql db file provided by labrosa 76 | Output : A list of specified fields for 1M songs in the msd dataset 77 | """ 78 | con = init_connection(db_file) 79 | query = con.execute("""SELECT %s from songs""" % field_name) 80 | results = query.fetchall() 81 | 82 | return [field[0] for field in results] 83 | 84 | 85 | def get_msd_data_from_track_id(con, track_id, field_name='track_name'): 86 | query = con.execute("""SELECT %s from songs WHERE track_id='%s'""" % (field_name, track_id)) 87 | results = query.fetchall() 88 | return [field[0] for field in results] 89 | 90 | 91 | def get_msd_field_metadata_from_ids(db_file, track_ids, field_name='track_name'): 92 | con = init_connection(db_file) 93 | msd_ids = ','.join(['%d' % msd_id for msd_id in track_ids]) 94 | query = con.execute("""SELECT %s from songs WHERE track_id IN (%s)""" %(field_name, msd_ids)) 95 | results = query.fetchall() 96 | return [field[0] for field in results] 97 | --------------------------------------------------------------------------------