├── README.md ├── main.py └── CPA_fig2.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # Description of datasets: 2 | 3 | All datasets were produced using the [Hooktheory API](https://www.hooktheory.com/api/trends/docs) and the [Spotify API](https://developer.spotify.com/documentation/web-api/). 4 | 5 | Given a chord progression, the Hooktheory database annotates all of its child progressions with the proportion of songs containing that child node. For example, of all songs containing the progression "I, IV, vi", 55% are followed by "V" and 14% are followed by "IV". Information about interpreting the Hooktheory chord notation can be found [here](http://forum.hooktheory.com/t/vizualitation-of-all-chord-progressions-kinda/164/2). 6 | 7 | Chord progressions were pulled from the Hooktheory database as follows: 8 | * First, all one length chord progressions (single chords) which were contained in at least %5 of the Hooktheory song database were pulled. 9 | * Next, length two chord progressions were constructed by appending to each of the length one chord progressions any chord which comprises at least 5% of all songs containing the given length one progression. So for example, 15% of songs in the Hooktheory database begin with the "I" chord. Of chords starting with "I", 6% are followed by "ii", hence the length-two chord progression "I, ii" was pulled from the database. 10 | * Continuing in this way, we extract all 3, 4, and 5-chord progressions which have at least a 5% chance of occuring given their parent progression. 11 | 12 | After the chord progressions were obtained, I used the Hooktheory API again to pull all songs associated with the pulled progressions. Each song item contained information about the song, artist, and section (chorus, verse, etc.) which contained the given progression. 13 | 14 | Next, given the song/artist pairs pulled from the Hooktheory database, I used the Spotify API to query the Spotify track database to find tracks which match the song/artist pair. Of the songs for which a match was found, I then used the Spotify API again to pull detailed audio features for each track, and the genre information for the artists of the tracks. Detailed information about audio features can be found [here](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/) 15 | 16 | All of the following datasets were constructed by manipulating/cleaning the data pulled in the aforementioned manner. 17 | 18 | ## three_, four_, and five_chord_songs.csv 19 | 20 | As the titles suggest, these contain all three/four/five chord progressions along with any song/artist/section information pulled from the Hooktheory database. Where possible, Spotify audio feature data is given. No genre information is given. 21 | 22 | ## three_four_five.csv 23 | 24 | This is the concatenation of the three_, four_, five_chord_songs.csv datasets. Moreover, where possible genre information is given. The genres correspond to the artist of the song (though the song doesn't necessarily fit into each genre which describes a given artist). 25 | 26 | ## three_four_five_pruned.csv 27 | 28 | Several three- and four-chord progressions are contained in progressions of a longer length. For example, "I, IV, I" is contained in "I, IV, I, V" which is itself contained in "I, IV, I, V, I". This dataset removes redundant chord progressions by favoring the longer progression. Thus in the example given, if a given song/artist/section combination contains both "I, IV, I" and "I, IV, I, V", the rows pertaining to "I, IV, I" are pruned from the dataset. Similarly, if the same song/artist/section combination contains "I, IV, I, V, I", then the four-chord progression is also pruned. 29 | 30 | ## three_four_five_has_audio_pruned.csv 31 | 32 | This is a further refinement of three_four_five_pruned.csv which retains only those song/artist combinations for which Spotify information was successfully queried. 33 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | """ 4 | 2/2/18 5 | 6 | Author: Jesse Hamer 7 | 8 | The Data Incubator Application 9 | 10 | Challenge 3: Propose a Project 11 | 12 | Working Project Title: Sentiment Analysis of Chord Progressions 13 | """ 14 | 15 | import requests 16 | import time 17 | import json 18 | import pandas as pd 19 | import numpy as np 20 | import seaborn as sns 21 | import matplotlib.pyplot as plt 22 | 23 | HOOKTHEORY_USERNAME = 'your_username_here' 24 | HOOKTHEORY_PWORD = 'your_password_here' 25 | HOOKTHEORY_API_ENDPT = 'https://api.hooktheory.com/v1/' 26 | 27 | HT_AUTH_REQUEST_URL = HOOKTHEORY_API_ENDPT + 'users/auth' 28 | 29 | HT_AUTH_REQUEST_BODY = {'username': HOOKTHEORY_USERNAME, 30 | 'password': HOOKTHEORY_PWORD} 31 | 32 | HT_AUTH_REQUEST_RESP = requests.request('POST', 33 | HT_AUTH_REQUEST_URL, 34 | json=HT_AUTH_REQUEST_BODY) 35 | 36 | HT_AUTH_REQUEST_CONTENT = HT_AUTH_REQUEST_RESP.json() 37 | 38 | ht_activation_key = 'Bearer ' + HT_AUTH_REQUEST_CONTENT['activkey'] 39 | 40 | ht_auth_header = {'Authorization': ht_activation_key} 41 | 42 | client = requests.session() 43 | client.headers = ht_auth_header 44 | 45 | def get_song_request(sess, cp, wait_time=0, verbose=False): 46 | songs = [] 47 | 48 | page = 0 49 | redo=True 50 | new_songs=[] 51 | 52 | while new_songs or redo: 53 | songs+=new_songs 54 | page+=1 55 | redo=False 56 | 57 | r = sess.get(HOOKTHEORY_API_ENDPT + 'trends/songs', 58 | params = {'cp':cp, 'page':str(page)}) 59 | new_songs = r.json() 60 | 61 | remaining = int(r.headers['X-Rate-Limit-Remaining']) 62 | wait_time = int(r.headers['X-Rate-Limit-Reset']) 63 | 64 | if verbose: 65 | print('Retrieved page {}; contains {} new results'.format(page, 66 | len(new_songs))) 67 | 68 | if remaining==0: 69 | print('Too many requests. Waiting {} seconds...'.format(wait_time)) 70 | time.sleep(wait_time) 71 | if False in [type(s)==dict for s in new_songs]: 72 | page-=1 73 | new_songs=[] 74 | redo=True 75 | 76 | return songs 77 | 78 | 79 | def get_chord_progressions(sess, initial_progressions, tol = 0, verbose=False): 80 | chord_progs = [] 81 | initial_progs = initial_progressions.copy() 82 | 83 | while initial_progs: 84 | prog = initial_progs.pop(0) 85 | cp= prog['child_path'] 86 | 87 | r = sess.get(HOOKTHEORY_API_ENDPT + 'trends/nodes', 88 | params = {'cp':cp}) 89 | new_prog = r.json() 90 | 91 | remaining = int(r.headers['X-Rate-Limit-Remaining']) 92 | wait_time = int(r.headers['X-Rate-Limit-Reset']) 93 | 94 | if remaining==0: 95 | print('Too many requests. Waiting {} seconds...'.format(wait_time)) 96 | time.sleep(wait_time) 97 | if False in [type(s)==dict for s in new_prog]: 98 | initial_progs.insert(0, prog) 99 | else: 100 | new_prog = [p for p in new_prog if p['probability']>tol] 101 | chord_progs+=new_prog 102 | if verbose: 103 | print('Progression {} processed.'.format(cp)) 104 | 105 | 106 | return chord_progs 107 | 108 | one_chord = get_chord_progressions(client, [{'child_path':''}], tol=0.05, verbose=True) 109 | 110 | two_chord = get_chord_progressions(client, one_chord, tol=0.05, verbose=True) 111 | 112 | three_chord = get_chord_progressions(client, two_chord, tol=0.05, verbose=True) 113 | 114 | four_chord = get_chord_progressions(client, three_chord, tol=0.05, verbose=True) 115 | 116 | five_chord = get_chord_progressions(client, four_chord, tol=0.05, verbose=True) 117 | 118 | def get_cp_song_data(sess, chord_progs, verbose=0): 119 | data = pd.DataFrame([], columns=['cp', 'artist', 'song', 'section']) 120 | total_cp = len(chord_progs) 121 | cp_counter = 0 122 | 123 | if verbose==0: 124 | songs_verbose=False 125 | cp_verbose=False 126 | if verbose==1: 127 | songs_verbose=False 128 | cp_verbose=True 129 | if verbose==2: 130 | songs_verbose=True 131 | cp_verbose=True 132 | 133 | for prog in chord_progs: 134 | cp_counter+=1 135 | cp = prog['child_path'] 136 | if cp_verbose: 137 | print('###### FETCHING SONGS FOR {}; {}/{} ##########\n'.format(cp, 138 | cp_counter, total_cp)) 139 | songs = get_song_request(sess, cp, verbose=songs_verbose) 140 | for song in songs: 141 | song['cp'] = cp 142 | song.pop('url') 143 | data = data.append(songs) 144 | if cp_verbose: 145 | print('###### DONE WITH {}; {} SONGS ADDED; DATA SHAPE: {} #########\n'.format(cp, 146 | len(songs), data.shape)) 147 | return data 148 | 149 | # Get data for four-chord progressions first, to make sure everything works. 150 | 151 | cp_song_data_four = get_cp_song_data(client, four_chord, verbose=1) 152 | 153 | cp_song_data_four.song = cp_song_data_four.song.apply(lambda x: x.lower()) 154 | cp_song_data_four.artist = cp_song_data_four.artist.apply(lambda x: x.lower()) 155 | cp_song_data_four.section= cp_song_data_four.section.apply(lambda x: x.lower()) 156 | 157 | print(cp_song_data_four.describe()) 158 | 159 | # 2220 unique artists, 3746 unique songs; 22 unique sections 160 | 161 | print(cp_song_data_four.groupby(['artist', 'song']).size().shape[0]) 162 | 163 | # 3874 unique artist/song combinations 164 | 165 | cp4_artist_song = cp_song_data_four[['artist', 'song']] 166 | cp4_artist_song.drop_duplicates(inplace=True) 167 | cp4_artist_song.sort_values(by=['artist', 'song'], inplace=True) 168 | cp4_artist_song.reset_index(inplace=True, drop=True) 169 | 170 | 171 | 172 | ############################################## 173 | 174 | # Now try to retrieve spotify information for each artist/song 175 | 176 | from spotipy import Spotify 177 | from spotipy.oauth2 import SpotifyClientCredentials 178 | 179 | SPOTIFY_CLIENT_ID = 'your_spotify_client_id_here' 180 | 181 | SPOTIFY_CLIENT_SECRET = 'your_spotify_client_secret_here' 182 | 183 | token = SpotifyClientCredentials(client_id=SPOTIFY_CLIENT_ID, 184 | client_secret=SPOTIFY_CLIENT_SECRET) 185 | 186 | cache_token = token.get_access_token() 187 | 188 | spotify = Spotify(cache_token) 189 | 190 | def get_track_ids(client, data, num_tracks=None, turn_update=None): 191 | 192 | for i, row in data.iterrows(): 193 | artist=row['artist'] 194 | song=row['song'] 195 | q = 'artist:'+artist + ' ' + 'track:'+song 196 | result = client.search(q) 197 | items = result['tracks']['items'] 198 | if not items: 199 | continue 200 | popularities = [item['popularity'] for item in items] 201 | most_popular = items[popularities.index(max(popularities))] 202 | data.loc[(data.artist==artist) & (data.song==song),'spotify_ID']=most_popular['id'] 203 | if i == num_tracks: 204 | break 205 | if turn_update and (i+1)%turn_update==0: 206 | print('Finished track {}'.format(i+1)) 207 | 208 | get_track_ids(spotify, cp4_artist_song) 209 | 210 | def get_audio_features(client, data, verbose=False): 211 | 212 | ids = data.spotify_ID.dropna() 213 | 214 | audio_feature_data = pd.DataFrame([],columns=['danceability', 'energy', 215 | 'key', 'loudness', 'mode', 'speechiness', 216 | 'acousticness', 'instrumentalness', 217 | 'liveness', 'valence', 'tempo','id', 218 | 'duration_ms', 'time_signature']) 219 | for i in range(0, len(ids), 50): 220 | new_features = spotify.audio_features(ids[i:i+50]) 221 | for track in new_features: 222 | track.pop('type') 223 | track.pop('uri') 224 | track.pop('track_href') 225 | track.pop('analysis_url') 226 | audio_feature_data = audio_feature_data.append(new_features) 227 | if verbose: 228 | print('Done with tracks {} through {}'.format(i+1, i+50)) 229 | 230 | return audio_feature_data 231 | 232 | cp4_audio_features = get_audio_features(spotify, cp4_artist_song,verbose=True) 233 | 234 | cp4_audio_features.rename(columns={'id':'spotify_ID'}, inplace=True) 235 | 236 | cp4_artist_song = cp4_artist_song.merge(cp4_audio_features, how='left', 237 | on='spotify_ID') 238 | 239 | cp_song_data_four = cp_song_data_four.merge(cp4_artist_song, how='left', 240 | on=['artist','song']) 241 | 242 | cp_song_data_four.to_csv('four_chord_songs.csv', index=False) 243 | 244 | del cp_song_data_four 245 | del cp4_artist_song 246 | del cp4_audio_features 247 | 248 | 249 | ################################## 250 | # Now repeat for three and five chord progressions. 251 | ################################## 252 | 253 | # THREE CHORD PROGRESSIONS: 254 | 255 | HT_AUTH_REQUEST_RESP = requests.request('POST', 256 | HT_AUTH_REQUEST_URL, 257 | json=HT_AUTH_REQUEST_BODY) 258 | 259 | HT_AUTH_REQUEST_CONTENT = HT_AUTH_REQUEST_RESP.json() 260 | 261 | ht_activation_key = 'Bearer ' + HT_AUTH_REQUEST_CONTENT['activkey'] 262 | 263 | ht_auth_header = {'Authorization': ht_activation_key} 264 | 265 | client = requests.session() 266 | client.headers = ht_auth_header 267 | 268 | cp_song_data_three = get_cp_song_data(client, three_chord, verbose=1) 269 | 270 | cp_song_data_three.song = cp_song_data_three.song.apply(lambda x: x.lower()) 271 | cp_song_data_three.artist = cp_song_data_three.artist.apply(lambda x: x.lower()) 272 | cp_song_data_three.section= cp_song_data_three.section.apply(lambda x: x.lower()) 273 | 274 | 275 | cp3_artist_song = cp_song_data_three[['artist', 'song']] 276 | cp3_artist_song.drop_duplicates(inplace=True) 277 | cp3_artist_song.sort_values(by=['artist', 'song'], inplace=True) 278 | cp3_artist_song.reset_index(inplace=True, drop=True) 279 | 280 | token = SpotifyClientCredentials(client_id=SPOTIFY_CLIENT_ID, 281 | client_secret=SPOTIFY_CLIENT_SECRET) 282 | 283 | cache_token = token.get_access_token() 284 | 285 | spotify = Spotify(cache_token) 286 | 287 | get_track_ids(spotify, cp3_artist_song) 288 | 289 | cp3_audio_features = get_audio_features(spotify, cp3_artist_song,verbose=True) 290 | 291 | cp3_audio_features.rename(columns={'id':'spotify_ID'}, inplace=True) 292 | 293 | cp3_artist_song = cp3_artist_song.merge(cp3_audio_features, how='left', 294 | on='spotify_ID') 295 | 296 | cp_song_data_three = cp_song_data_three.merge(cp3_artist_song, how='left', 297 | on=['artist','song']) 298 | 299 | cp_song_data_three.to_csv('three_chord_songs.csv', index=False) 300 | 301 | del cp_song_data_three 302 | del cp3_artist_song 303 | del cp3_audio_features 304 | 305 | ###################################### 306 | 307 | # FIVE CHORD PROGRESSIONS 308 | 309 | HT_AUTH_REQUEST_RESP = requests.request('POST', 310 | HT_AUTH_REQUEST_URL, 311 | json=HT_AUTH_REQUEST_BODY) 312 | 313 | HT_AUTH_REQUEST_CONTENT = HT_AUTH_REQUEST_RESP.json() 314 | 315 | ht_activation_key = 'Bearer ' + HT_AUTH_REQUEST_CONTENT['activkey'] 316 | 317 | ht_auth_header = {'Authorization': ht_activation_key} 318 | 319 | client = requests.session() 320 | client.headers = ht_auth_header 321 | 322 | cp_song_data_five = get_cp_song_data(client, five_chord, verbose=1) 323 | 324 | cp_song_data_five.song = cp_song_data_five.song.apply(lambda x: x.lower()) 325 | cp_song_data_five.artist = cp_song_data_five.artist.apply(lambda x: x.lower()) 326 | cp_song_data_five.section= cp_song_data_five.section.apply(lambda x: x.lower()) 327 | 328 | 329 | cp5_artist_song = cp_song_data_five[['artist', 'song']] 330 | cp5_artist_song.drop_duplicates(inplace=True) 331 | cp5_artist_song.sort_values(by=['artist', 'song'], inplace=True) 332 | cp5_artist_song.reset_index(inplace=True, drop=True) 333 | 334 | token = SpotifyClientCredentials(client_id=SPOTIFY_CLIENT_ID, 335 | client_secret=SPOTIFY_CLIENT_SECRET) 336 | 337 | cache_token = token.get_access_token() 338 | 339 | spotify = Spotify(cache_token) 340 | 341 | get_track_ids(spotify, cp5_artist_song) 342 | 343 | cp5_audio_features = get_audio_features(spotify, cp5_artist_song,verbose=True) 344 | 345 | cp5_audio_features.rename(columns={'id':'spotify_ID'}, inplace=True) 346 | 347 | cp5_artist_song = cp5_artist_song.merge(cp5_audio_features, how='left', 348 | on='spotify_ID') 349 | 350 | cp_song_data_five = cp_song_data_five.merge(cp5_artist_song, how='left', 351 | on=['artist','song']) 352 | 353 | cp_song_data_five.to_csv('five_chord_songs.csv', index=False) 354 | 355 | del cp_song_data_five 356 | del cp5_artist_song 357 | del cp5_audio_features 358 | 359 | ########################################### 360 | 361 | # Get genre information for all tracks 362 | 363 | three = pd.read_csv('three_chord_songs.csv') 364 | four = pd.read_csv('four_chord_songs.csv') 365 | five = pd.read_csv('five_chord_songs.csv') 366 | 367 | three['cp_length'] = 3 368 | four['cp_length'] = 4 369 | five['cp_length'] = 5 370 | 371 | three_four_five = three.append(four).append(five).reset_index(drop=True) 372 | 373 | def get_track_genres(client, track_ids, verbose=False): 374 | data = pd.DataFrame([], columns=['spotify_ID', 'genres']) 375 | for i in range(0, len(track_ids), 20): 376 | tids = track_ids[i:i+20] 377 | tracks = client.tracks(tids)['tracks'] 378 | artist_ids = [track['artists'][0]['id'] for track in tracks] 379 | artists = client.artists(artist_ids)['artists'] 380 | genres = [artist['genres'] for artist in artists] 381 | new_data = [{'spotify_ID':tid, 'genres':genre} for tid,genre in zip(tids, genres)] 382 | data = data.append(new_data).dropna() 383 | if verbose: 384 | print('Finished fetching genres for tids {} through {}.'.format(i+1, i+20)) 385 | print('New data shape: {}'.format(data.shape)) 386 | 387 | return data 388 | 389 | token = SpotifyClientCredentials(client_id=SPOTIFY_CLIENT_ID, 390 | client_secret=SPOTIFY_CLIENT_SECRET) 391 | 392 | cache_token = token.get_access_token() 393 | 394 | spotify = Spotify(cache_token) 395 | 396 | track_genres = get_track_genres(spotify, three_four_five.spotify_ID.dropna().unique(), 397 | verbose=True) 398 | 399 | three_four_five = three_four_five.merge(track_genres, how='left', on='spotify_ID') 400 | 401 | three_four_five.to_csv('three_four_five.csv', index=False) 402 | 403 | # If one chord progression is contained within another, we favor the longer. 404 | def remove_redundant_cp(data, l1, l2): 405 | cp_l1 = data.loc[data.cp_length==l1,['cp', 'artist', 'song', 'section']] 406 | cp_l2 = data.loc[data.cp_length==l2, ['cp', 'artist', 'song', 'section']] 407 | 408 | for i, row in cp_l1.iterrows(): 409 | cp = row['cp'] 410 | artist=row['artist'] 411 | song=row['song'] 412 | section=row['section'] 413 | if ((cp_l2.cp.apply(lambda x: cp in x))&((artist==cp_l2.artist)&\ 414 | ((song==cp_l2.song)&(section==cp_l2.section)))).any(): 415 | data.drop(i, inplace=True) 416 | 417 | three_four_five_pruned = three_four_five.copy() 418 | 419 | remove_redundant_cp(three_four_five_pruned, 3, 4) 420 | remove_redundant_cp(three_four_five_pruned, 3, 5) 421 | remove_redundant_cp(three_four_five_pruned, 4, 5) 422 | 423 | # Save the new 'pruned' datasets 424 | 425 | three_four_five_pruned.to_csv('three_four_five_pruned.csv', index=False) 426 | 427 | # Make dataset of cp/artist/song/section combinations which have spotify info 428 | 429 | has_audio_data_pruned = three_four_five_pruned.dropna(subset=['spotify_ID']) 430 | 431 | has_audio_data_pruned.to_csv('three_four_five_has_audio_pruned.csv', index=False) 432 | 433 | has_audio_data_pruned.reset_index(drop=True, inplace=True) 434 | 435 | #################################### 436 | 437 | # NOW FOR EDA 438 | 439 | print(three_four_five.shape) 440 | 441 | print(has_audio_data_pruned.shape) 442 | 443 | # 12511 non-redundant records with audio data 444 | 445 | print(has_audio_data_pruned.cp_length.value_counts()) 446 | 447 | # 9884 5-chord progressions, 1479 3-chord progressions, 1148 4-chord progressions 448 | 449 | print(has_audio_data_pruned.cp.describe()) 450 | 451 | # 1018 unique chord progressions, 452 | 453 | print(has_audio_data_pruned.groupby('cp_length').apply(lambda x: x.cp.describe())) 454 | 455 | # 68 unique 3-chord progressions, 197 unique 4-chord progressions, 753 unique 456 | # 5-chord progressions 457 | 458 | # Note: to reformat genres as lists after loading csv, run the following: 459 | 460 | # has_audio_data_pruned.genres = has_audio_data_pruned.genres.apply( 461 | # lambda x: [g.strip("' ") for g in x.strip('[]').split(',')] if x!='[]' else np.nan) 462 | 463 | # Get all genres 464 | all_genres = {} 465 | 466 | for genres in has_audio_data_pruned.genres.dropna(): 467 | for g in genres: 468 | if g in all_genres.keys(): 469 | all_genres[g]+=1 470 | else: 471 | all_genres[g]=1 472 | 473 | all_genres = pd.Series(all_genres) 474 | 475 | print('The 20 most popular genres are: \n{}'.format(all_genres.sort_values(ascending=False).head(20))) 476 | """ 477 | pop 3016 478 | rock 2499 479 | dance pop 2483 480 | pop rock 1589 481 | modern rock 1511 482 | post-teen pop 1393 483 | edm 1074 484 | permanent wave 947 485 | pop punk 940 486 | album rock 930 487 | folk-pop 915 488 | mellow gold 810 489 | indie rock 755 490 | soft rock 739 491 | classic rock 729 492 | indie pop 721 493 | alternative rock 696 494 | post-grunge 685 495 | hard rock 633 496 | neo mellow 631 497 | """ 498 | 499 | # Only use cps with at least 5 observations: 500 | 501 | cp_group_sizes = has_audio_data_pruned.groupby('cp').size() 502 | cp_group_sizes.name='n' 503 | cp_group_sizes = cp_group_sizes.reset_index() 504 | 505 | 506 | has_audio_data_pruned = has_audio_data_pruned.merge(cp_group_sizes, on='cp') 507 | 508 | has_5_obs = has_audio_data_pruned[has_audio_data_pruned.n>=5] 509 | 510 | print('Still have {} unique chord_progressions.'.format(has_5_obs.cp.unique().shape)) 511 | 512 | print(has_5_obs.groupby('cp_length').apply(lambda x: x.cp.describe())) 513 | 514 | """ 515 | cp count unique top freq 516 | cp_length 517 | 3 1473 66 4,5,1 104 518 | 4 932 93 6,4,1,5 29 519 | 5 9066 342 4,1,5,6,4 312 520 | """ 521 | 522 | all_genres = {} 523 | 524 | for genres in has_5_obs.genres: 525 | for g in genres: 526 | if g in all_genres.keys(): 527 | all_genres[g]+=1 528 | else: 529 | all_genres[g]=1 530 | 531 | all_genres = pd.Series(all_genres) 532 | print('The 20 most popular genres are: \n{}'.format(all_genres.sort_values(ascending=False).head(20))) 533 | """ 534 | The 20 most popular genres are: 535 | pop 2820 536 | dance pop 2306 537 | rock 2280 538 | pop rock 1495 539 | modern rock 1366 540 | post-teen pop 1308 541 | edm 972 542 | pop punk 893 543 | permanent wave 885 544 | album rock 852 545 | folk-pop 846 546 | mellow gold 758 547 | soft rock 688 548 | classic rock 680 549 | indie rock 670 550 | indie pop 643 551 | post-grunge 642 552 | alternative rock 621 553 | neo mellow 596 554 | hard rock 567 555 | """ 556 | 557 | numeric_audio_features = ['danceability', 'energy', 'loudness', 558 | 'acousticness', 'valence', 'tempo'] 559 | 560 | def all_cp_plot(data, features): 561 | 562 | num_plots = len(features) 563 | cols = int(np.ceil(num_plots/3)) 564 | fig = plt.figure(figsize=(6*cols,9)) 565 | fig.subplots_adjust(hspace=.5, wspace=.3) 566 | 567 | for i, feature in enumerate(features): 568 | ax = fig.add_subplot(3, cols, i+1) 569 | cp_feature_groups = data.groupby('cp')[feature].agg([np.mean, np.std]) 570 | feature_sorted = cp_feature_groups.sort_values(by='mean') 571 | feature_sorted['mean'].plot(ax=ax) 572 | plt.fill_between(feature_sorted.index, 573 | feature_sorted['mean']-feature_sorted['std'], 574 | feature_sorted['mean'] + feature_sorted['std'], 575 | color='orange', alpha=0.2) 576 | plt.xticks([],[]) 577 | plt.ylabel(feature) 578 | ax.set_title('Mean {}'.format(feature)) 579 | plt.show() 580 | 581 | 582 | def cp_plot(cp, data, numeric_features=[], compare=False): 583 | cp_data = data[data.cp==cp] 584 | num_plots = len(numeric_features) 585 | cols = int(np.ceil(num_plots/3)) 586 | fig = plt.figure(figsize=(6*cols,9)) 587 | fig.subplots_adjust(hspace=.5, wspace=.3) 588 | for i, feature in enumerate(numeric_features): 589 | ax = fig.add_subplot(3, cols, i+1) 590 | sns.distplot(cp_data[feature], hist=False, ax=ax, label=cp) 591 | if compare: 592 | sns.distplot(data[feature], color='orange', hist=False, ax=ax, 593 | label='All CPs') 594 | ax.set_title('Distribution of {} for {}'.format(feature, cp)) 595 | plt.show() 596 | 597 | 598 | 599 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | 610 | -------------------------------------------------------------------------------- /CPA_fig2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Audio Features for Single Chord Progression\n", 8 | "\n", 9 | "This figure compares one of the highest-valence (happiest) chord progressions to all other chord progressions." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "import seaborn as sns\n", 20 | "import matplotlib.pyplot as plt\n", 21 | "import numpy as np\n", 22 | "\n", 23 | "%matplotlib inline\n", 24 | "\n", 25 | "# load data\n", 26 | "\n", 27 | "has_audio_data_pruned = pd.read_csv('three_four_five_has_audio_pruned.csv')\n", 28 | "\n", 29 | "cp_group_sizes = has_audio_data_pruned.groupby('cp').size()\n", 30 | "cp_group_sizes.name='n'\n", 31 | "cp_group_sizes = cp_group_sizes.reset_index()\n", 32 | "\n", 33 | "\n", 34 | "has_audio_data_pruned = has_audio_data_pruned.merge(cp_group_sizes, on='cp')\n", 35 | "\n", 36 | "# We only look at chord progressions associated to at least five artist/song/section combinations\n", 37 | "has_5_obs = has_audio_data_pruned[has_audio_data_pruned.n>=5]" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 3, 43 | "metadata": {}, 44 | "outputs": [ 45 | { 46 | "name": "stderr", 47 | "output_type": "stream", 48 | "text": [ 49 | "/Applications/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.\n", 50 | " return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval\n" 51 | ] 52 | }, 53 | { 54 | "data": { 55 | "image/png": "\n", 56 | "text/plain": [ 57 | "
" 58 | ] 59 | }, 60 | "metadata": {}, 61 | "output_type": "display_data" 62 | } 63 | ], 64 | "source": [ 65 | "numeric_audio_features = ['danceability', 'energy', 'loudness', \n", 66 | " 'acousticness', 'valence', 'tempo']\n", 67 | "\n", 68 | "\n", 69 | "def cp_plot(cp, data, numeric_features=[], compare=False):\n", 70 | " cp_data = data[data.cp==cp]\n", 71 | " num_plots = len(numeric_features)\n", 72 | " cols = int(np.ceil(num_plots/3))\n", 73 | " fig = plt.figure(figsize=(6*cols,9))\n", 74 | " fig.subplots_adjust(hspace=.5, wspace=.3)\n", 75 | " for i, feature in enumerate(numeric_features):\n", 76 | " ax = fig.add_subplot(3, cols, i+1)\n", 77 | " sns.distplot(cp_data[feature], hist=False, ax=ax, label=cp)\n", 78 | " if compare:\n", 79 | " sns.distplot(data[feature], color='orange', hist=False, ax=ax,\n", 80 | " label='All CPs')\n", 81 | " ax.set_title('Distribution of {} for {}'.format(feature, cp))\n", 82 | " plt.show()\n", 83 | " \n", 84 | "cp_plot('1,4,1,5,1',has_5_obs, numeric_features=numeric_audio_features, compare=True)" 85 | ] 86 | } 87 | ], 88 | "metadata": { 89 | "kernelspec": { 90 | "display_name": "Python 3", 91 | "language": "python", 92 | "name": "python3" 93 | }, 94 | "language_info": { 95 | "codemirror_mode": { 96 | "name": "ipython", 97 | "version": 3 98 | }, 99 | "file_extension": ".py", 100 | "mimetype": "text/x-python", 101 | "name": "python", 102 | "nbconvert_exporter": "python", 103 | "pygments_lexer": "ipython3", 104 | "version": "3.6.8" 105 | } 106 | }, 107 | "nbformat": 4, 108 | "nbformat_minor": 2 109 | } 110 | --------------------------------------------------------------------------------