├── Ranking.png ├── requirements.txt ├── README.md ├── textm.py └── api.py /Ranking.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Hugo-Nattagh/2017-Hip-Hop/HEAD/Ranking.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | - For the api.py script: 2 | 3 | You need a token to use the Genius.com API. 4 | You'll get one if you sign up to the website -> https://genius.com/api-clients 5 | 6 | pip install bs4 7 | pip install lyricsgenius 8 | 9 | 10 | 11 | 12 | - For the textm.py script: 13 | 14 | pip install nltk 15 | pip install textblob 16 | pip install vadersentiment -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 2017Hip-Hop 2 | 3 | ### 2017 Rap Albums’ Text Mining and Sentiment Analysis 4 | 5 | A lot of rappers claim to deliver a positive message to the world, which can often be challenging to believe when listening to the lyrics. 6 | 7 | How about putting some albums to the test with sentiment analysis? 8 | 9 | Sentiment analysis is a natural language processing technique used on texts to detect the overall opinion of the author (negative, neutral or positive). 10 | 11 | #### The Data 12 | 13 | I knew [Genius.com](https://genius.com/) had to be one of the best databases for this, where the lyrics are transcripted by its community. 14 | 15 | To get the lyrics, I used [lyricsGenius](https://github.com/johnwmillr/LyricsGenius), a Python library by [Johnwmillr](https://github.com/johnwmillr) 16 | 17 | It’s very useful but unfortunately cannot get all the lyrics for an album, I had to do a query for each song. 18 | 19 | I forked his repo and tried my hand at improving his script but realized that his library is using Genius.com’s API, which only allows requests for artists and songs. 20 | 21 | So I worked around it in a way that doesnt make my modifications fit with his library. 22 | 23 | I used Beautiful Soup to scrape the songs’ titles from the album page, and made a loop to request all those songs with lyricsGenius. 24 | 25 | #### What's going on 26 | 27 | I wrote a script to process the lyrics of each songs to optimize the sentiment analysis. 28 | 29 | I used TextBlob and Vader for the sentiment analysis of each song of an album, then calculated the average for each album. 30 | 31 | The results I got were not what I was hoping for, but made sense. Sentiment analysis is usually used for tweets or reviews, and not for a song’s lyrics. Especially rap, which often contains a lot of slang, wordplays, double entendres... Also a song’s length gives a lot more neutral expressions which leads the average to be close to zero. 32 | 33 | Long story short, sentiment analysis, as for now, isn’t the way to go to measure an album’s positiveness. 34 | 35 | I visualized it anyway, to see the relative ranking with this technique. 36 | 37 | ![alt text](https://github.com/Hugo-Nattagh/2017-Hip-Hop/blob/master/Ranking.png) 38 | 39 | By doing this project, I warmed up with text mining and sentiment analysis on python, and I’d like to go on with another year, or all verses from an artist, or his discography... But I would need to get a closer look at the existing dictionnaries to, if possible, somehow adjust them to rap lyrics. Or maybe remove all expressions that are too neutral. 40 | 41 | #### Files Description 42 | 43 | - `api.py`: Script to get the albums' lyrics in json files 44 | - `textm.py`: Script to perform sentiment analysis on the json files 45 | - `Ranking.png`: Visualization of my results 46 | - `requirements.txt`: What you need in order to make it work 47 | -------------------------------------------------------------------------------- /textm.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import os 3 | from nltk.corpus import stopwords 4 | from textblob import TextBlob 5 | from textblob import Word 6 | from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 7 | 8 | pd.options.display.max_colwidth = 17 9 | 10 | analyzer = SentimentIntensityAnalyzer() 11 | 12 | 13 | def s_analysis(file_df): 14 | lyrics = pd.DataFrame(data=file_df) 15 | album = lyrics['songs'][0]['album'] 16 | artist = lyrics['artist'][0] 17 | lyrics['title'] = lyrics['songs'][0]['title'] 18 | 19 | n = 0 20 | 21 | while n < len(lyrics): 22 | lyrics['title'][n] = lyrics['songs'][n]['title'] 23 | lyrics['songs'][n] = lyrics['songs'][n]['lyrics'].replace('\n', ' ') 24 | n += 1 25 | 26 | # pprint(lyrics) 27 | 28 | # 1) Basic Feature Extraction 29 | 30 | stop = stopwords.words('english') 31 | lyrics['stopwords'] = lyrics['songs'].apply(lambda x: len([x for x in x.split() if x in stop])) 32 | 33 | # 2) Basic Pre-Processing 34 | 35 | lyrics['songs'] = lyrics['songs'].apply(lambda x: " ".join(x.lower() for x in x.split())) 36 | lyrics['songs'] = lyrics['songs'].str.replace('[^\w\s]', '') 37 | lyrics['songs'] = lyrics['songs'].apply(lambda x: " ".join(x for x in x.split() if x not in stop)) 38 | 39 | TextBlob(lyrics['songs'][0]).words 40 | lyrics['songs'] = lyrics['songs'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()])) 41 | lyrics['sentimentTxtB'] = lyrics['songs'].apply(lambda x: TextBlob(x).sentiment[0]) 42 | lyrics['sentimentTxtB'] = lyrics['sentimentTxtB'].round(3) 43 | # lyrics = lyrics.sort_values('sentimentTxtB', ascending=False) 44 | 45 | lyrics['sentimentVaderPos'] = lyrics['songs'].apply(lambda x: analyzer.polarity_scores(x)['pos']) 46 | lyrics['sentimentVaderNeg'] = lyrics['songs'].apply(lambda x: analyzer.polarity_scores(x)['neg']) 47 | 48 | lyrics['Vader'] = lyrics['sentimentVaderPos'] - lyrics['sentimentVaderNeg'] 49 | 50 | nb_rows = len(lyrics.index) 51 | total_sent = sum(lyrics['sentimentTxtB']) 52 | avg = total_sent / nb_rows 53 | 54 | total_vad = sum(lyrics['Vader']) 55 | avg_vad = total_vad / nb_rows 56 | 57 | print(lyrics[['title', 'songs', 'sentimentTxtB', 'Vader']]) 58 | avg = round(avg, 3) 59 | avg_vad = round(avg_vad, 3) 60 | print(album + ' by ' + artist + ' - TextBlob: ' + str(avg) + ", Vader: " + str(avg_vad)) 61 | return album, artist, avg, avg_vad 62 | 63 | 64 | index = len([filename for filename in os.listdir(os.getcwd()) if filename.endswith(".json")]) 65 | varb = 0 66 | avg_df = pd.DataFrame(index=range(index), columns=['album', 'artist', 'sent_avg_txtB', 'sent_avg_vad']) 67 | 68 | for filename in os.listdir(os.getcwd()): 69 | if filename.endswith(".json"): 70 | df = pd.read_json(filename) 71 | avg_df['album'][varb], avg_df['artist'][varb], avg_df['sent_avg_txtB'][varb], avg_df['sent_avg_vad'][varb] = s_analysis(df) 72 | varb += 1 73 | continue 74 | else: 75 | continue 76 | 77 | avg_df['final_average'] = (avg_df['sent_avg_txtB'] + avg_df['sent_avg_vad']) / 2 78 | 79 | avg_df = avg_df.sort_values(by=['final_average'], ascending=False) 80 | print(avg_df) 81 | -------------------------------------------------------------------------------- /api.py: -------------------------------------------------------------------------------- 1 | # Usage: 2 | # import genius 3 | # api = genius.Genius('my_client_access_token_here') 4 | # artist = api.search_artist('Andy Shauf', max_songs=5) 5 | # print(artist) 6 | # 7 | # song = api.search_song('To You', artist.name) 8 | # artist.add_song(song) 9 | # print(artist) 10 | # print(artist.songs[-1]) 11 | # 12 | # For an album, see at the end 13 | 14 | 15 | import sys 16 | from urllib.request import Request, urlopen, quote 17 | import os 18 | import re 19 | import requests 20 | import socket 21 | import json 22 | from bs4 import BeautifulSoup 23 | from string import punctuation 24 | import time 25 | from warnings import warn 26 | 27 | from lyricsgenius.song import Song 28 | 29 | 30 | class Artist(object): 31 | """An artist from the Genius.com database. 32 | 33 | Attributes: 34 | name: (str) Artist name. 35 | num_songs: (int) Total number of songs listed on Genius.com 36 | 37 | """ 38 | 39 | def __init__(self, json_dict): 40 | """Populate the Artist object with the data from *json_dict*""" 41 | self._body = json_dict['artist'] 42 | self._url = self._body['url'] 43 | self._api_path = self._body['api_path'] 44 | self._id = self._body['id'] 45 | self._songs = [] 46 | self._num_songs = len(self._songs) 47 | 48 | def __len__(self): 49 | return 1 50 | 51 | @property 52 | def name(self): 53 | return self._body['name'] 54 | 55 | @property 56 | def image_url(self): 57 | try: 58 | return self._body['image_url'] 59 | except: 60 | return None 61 | 62 | @property 63 | def songs(self): 64 | return self._songs 65 | 66 | @property 67 | def num_songs(self): 68 | return self._num_songs 69 | 70 | def add_song(self, newsong, verbose=True): 71 | """Add a Song object to the Artist object""" 72 | 73 | if any([song.title == newsong.title for song in self._songs]): 74 | if verbose: 75 | print('{newsong.title} already in {self.name}, not adding song.'.format(newsong=newsong, self=self)) 76 | return 1 # Failure 77 | if newsong.artist == self.name: 78 | self._songs.append(newsong) 79 | self._num_songs += 1 80 | return 0 # Success 81 | else: 82 | self._songs.append(newsong) 83 | self._num_songs += 1 84 | print("Song by {newsong.artist} was added to {self.name}.".format(newsong=newsong, self=self)) 85 | return 0 # Success 86 | 87 | def get_song(self, song_name): 88 | """Search Genius.com for *song_name* and add it to artist""" 89 | raise NotImplementedError("I need to figure out how to allow Artist() to access search_song().") 90 | song = Genius.search_song(song_name, self.name) 91 | self.add_song(song) 92 | return 93 | 94 | # TODO: define an export_to_json() method 95 | 96 | def save_lyrics(self, format='json', filename=None, 97 | overwrite=False, skip_duplicates=True, verbose=True): 98 | """Allows user to save all lyrics within an Artist obejct to a .json or .txt file.""" 99 | if format[0] == '.': 100 | format = format[1:] 101 | assert (format == 'json') or (format == 'txt'), "Format must be json or txt" 102 | 103 | # We want to reject songs that have already been added to artist collection 104 | def songsAreSame(s1, s2): 105 | from difflib import SequenceMatcher as sm # For comparing similarity of lyrics 106 | # Idea credit: https://bigishdata.com/2016/10/25/talkin-bout-trucks-beer-and-love-in-country-songs-analyzing-genius-lyrics/ 107 | seqA = sm(None, s1.lyrics, s2['lyrics']) 108 | seqB = sm(None, s2['lyrics'], s1.lyrics) 109 | return seqA.ratio() > 0.5 or seqB.ratio() > 0.5 110 | 111 | def songInArtist(new_song): 112 | # artist_lyrics is global (works in Jupyter notebook) 113 | for song in lyrics_to_write['songs']: 114 | if songsAreSame(new_song, song): 115 | return True 116 | return False 117 | 118 | # Determine the filename 119 | if filename is None: 120 | filename = "Lyrics_{}.{}".format(self.name.replace(" ", ""), format) 121 | else: 122 | if filename.rfind('.') != -1: 123 | filename = filename[filename.rfind('.'):] + '.' + format 124 | else: 125 | filename = filename + '.' + format 126 | 127 | # Check if file already exists 128 | write_file = False 129 | if not os.path.isfile(filename): 130 | write_file = True 131 | elif overwrite: 132 | write_file = True 133 | else: 134 | if input("{} already exists. Overwrite?\n(y/n): ".format(filename)).lower() == 'y': 135 | write_file = True 136 | 137 | # Format lyrics in either .txt or .json format 138 | if format == 'json': 139 | lyrics_to_write = {'songs': [], 'artist': self.name} 140 | for song in self.songs: 141 | if skip_duplicates is False or not songInArtist(song): # This takes way too long! It's basically O(n^2), can I do better? 142 | lyrics_to_write['songs'].append({}) 143 | lyrics_to_write['songs'][-1]['title'] = song.title 144 | lyrics_to_write['songs'][-1]['album'] = song.album 145 | lyrics_to_write['songs'][-1]['year'] = song.year 146 | lyrics_to_write['songs'][-1]['lyrics'] = song.lyrics 147 | lyrics_to_write['songs'][-1]['image'] = song.song_art_image_url 148 | lyrics_to_write['songs'][-1]['artist'] = self.name 149 | lyrics_to_write['songs'][-1]['raw'] = song._body 150 | else: 151 | if verbose: 152 | print("SKIPPING \"{}\" -- already found in artist collection.".format(song.title)) 153 | else: 154 | lyrics_to_write = " ".join([s.lyrics + 5 * '\n' for s in self.songs]) 155 | 156 | # Write the lyrics to either a .json or .txt file 157 | if write_file: 158 | with open(filename, 'w') as lyrics_file: 159 | if format == 'json': 160 | json.dump(lyrics_to_write, lyrics_file) 161 | else: 162 | lyrics_file.write(lyrics_to_write) 163 | if verbose: 164 | print('Wrote {} songs to {}.'.format(self.num_songs, filename)) 165 | else: 166 | if verbose: 167 | print('Skipping file save.\n') 168 | return lyrics_to_write 169 | 170 | def __str__(self): 171 | """Return a string representation of the Artist object.""" 172 | if self._num_songs == 1: 173 | return '{0}, {1} song'.format(self.name, self._num_songs) 174 | else: 175 | return '{0}, {1} songs'.format(self.name, self._num_songs) 176 | 177 | def __repr__(self): 178 | return repr((self.name, '{0} songs'.format(self._num_songs))) 179 | 180 | 181 | class _API(object): 182 | # This is a superclass that Genius() inherits from. Not sure if this makes any sense, but it 183 | # seemed like a good idea to have this class (more removed from user) handle the lower-level 184 | # interaction with the Genius API, and then Genius() has the more user-friendly search 185 | # functions 186 | """Interface with the Genius.com API 187 | Attributes: 188 | base_url: (str) Top-most URL to access the Genius.com API with 189 | Methods: 190 | _load_credentials() 191 | OUTPUT: client_id, client_secret, client_access_token 192 | _make_api_request() 193 | INPUT: 194 | OUTPUT: 195 | """ 196 | 197 | # Genius API constants 198 | _API_URL = "https://api.genius.com/" 199 | _API_REQUEST_TYPES =\ 200 | {'song': 'songs/', 'artist': 'artists/', 201 | 'artist-songs': 'artists/songs/', 'search': 'search?q='} 202 | 203 | def __init__(self, client_access_token, client_secret='', client_id=''): 204 | self._CLIENT_ACCESS_TOKEN = client_access_token 205 | self._HEADER_AUTHORIZATION = 'Bearer ' + self._CLIENT_ACCESS_TOKEN 206 | 207 | def _make_api_request(self, request_term_and_type, page=1): 208 | """Send a request (song, artist, or search) to the Genius API, returning a json object 209 | INPUT: 210 | request_term_and_type: (tuple) (request_term, request_type) 211 | *request term* is a string. If *request_type* is 'search', then *request_term* is just 212 | what you'd type into the search box on Genius.com. If you have an song ID or an artist ID, 213 | you'd do this: self._make_api_request('2236','song') 214 | Returns a json object. 215 | """ 216 | 217 | # TODO: This should maybe be a generator 218 | 219 | # The API request URL must be formatted according to the desired 220 | # request type""" 221 | api_request = self._format_api_request( 222 | request_term_and_type, page=page) 223 | 224 | # Add the necessary headers to the request 225 | request = Request(api_request) 226 | request.add_header("Authorization", self._HEADER_AUTHORIZATION) 227 | request.add_header("User-Agent", "LyricsGenius") 228 | while True: 229 | try: 230 | # timeout set to 4 seconds; automatically retries if times out 231 | response = urlopen(request, timeout=4) 232 | raw = response.read().decode('utf-8') 233 | except socket.timeout: 234 | print("Timeout raised and caught") 235 | continue 236 | break 237 | 238 | return json.loads(raw)['response'] 239 | 240 | def _format_api_request(self, term_and_type, page=1): 241 | """Format the request URL depending on the type of request""" 242 | 243 | request_term, request_type = str(term_and_type[0]), term_and_type[1] 244 | assert request_type in self._API_REQUEST_TYPES, "Unknown API request type" 245 | 246 | # TODO - Clean this up (might not need separate returns) 247 | if request_type == 'artist-songs': 248 | return self._API_URL + 'artists/' + quote(request_term) + '/songs?per_page=50&page=' + str(page) 249 | else: 250 | return self._API_URL + self._API_REQUEST_TYPES[request_type] + quote(request_term) 251 | 252 | def _scrape_song_lyrics_from_url(self, URL, remove_section_headers=False): 253 | """Use BeautifulSoup to scrape song info off of a Genius song URL""" 254 | page = requests.get(URL) 255 | html = BeautifulSoup(page.text, "html.parser") 256 | 257 | # Scrape the song lyrics from the HTML 258 | lyrics = html.find("div", class_="lyrics").get_text() 259 | if remove_section_headers: 260 | # Remove [Verse] and [Bridge] stuff 261 | lyrics = re.sub('(\[.*?\])*', '', lyrics) 262 | # Remove gaps between verses 263 | lyrics = re.sub('\n{2}', '\n', lyrics) 264 | 265 | return lyrics.strip('\n') 266 | 267 | def _clean_str(self, s): 268 | return s.translate(str.maketrans('', '', punctuation)).replace('\u200b', " ").strip().lower() 269 | 270 | def _result_is_lyrics(self, song_title): 271 | """Returns False if result from Genius is not actually song lyrics""" 272 | regex = re.compile( 273 | r"(tracklist)|(track list)|(album art(work)?)|(liner notes)|(booklet)|(credits)", re.IGNORECASE) 274 | return not regex.search(song_title) 275 | 276 | 277 | class Genius(_API): 278 | """User-level interface with the Genius.com API. User can search for songs (getting lyrics) and artists (getting songs)""" 279 | 280 | def search_song(self, song_title, artist_name="", take_first_result=False, verbose=True, remove_section_headers=True, remove_non_songs=True): 281 | # TODO: Should search_song() be a @classmethod? 282 | """Search Genius.com for *song_title* by *artist_name*""" 283 | 284 | # Perform a Genius API search for the song 285 | if verbose: 286 | if artist_name != "": 287 | print('Searching for "{0}" by {1}...'.format( 288 | song_title, artist_name)) 289 | else: 290 | print('Searching for "{0}"...'.format(song_title)) 291 | search_term = "{} {}".format(song_title, artist_name) 292 | 293 | json_search = self._make_api_request((search_term, 'search')) 294 | 295 | # Loop through search results, stopping as soon as title and artist of 296 | # result match request 297 | n_hits = min(10, len(json_search['hits'])) 298 | for i in range(n_hits): 299 | search_hit = json_search['hits'][i]['result'] 300 | found_song = self._clean_str(search_hit['title']) 301 | found_artist = self._clean_str( 302 | search_hit['primary_artist']['name']) 303 | 304 | # Download song from Genius.com if title and artist match the request 305 | if take_first_result or found_song == self._clean_str(song_title) and found_artist == self._clean_str(artist_name) or artist_name == "": 306 | 307 | # Remove non-song results (e.g. Linear Notes, Tracklists, etc.) 308 | song_is_valid = self._result_is_lyrics(found_song) if remove_non_songs else True 309 | if song_is_valid: 310 | # Found correct song, accessing API ID 311 | json_song = self._make_api_request((search_hit['id'], 'song')) 312 | 313 | # Scrape the song's HTML for lyrics 314 | lyrics = self._scrape_song_lyrics_from_url(json_song['song']['url'], remove_section_headers) 315 | 316 | # Create the Song object 317 | song = Song(json_song, lyrics) 318 | 319 | if verbose: 320 | print('Done.') 321 | return song 322 | else: 323 | if verbose: 324 | print('Specified song does not contain lyrics. Rejecting.') 325 | return None 326 | 327 | if verbose: 328 | print('Specified song was not first result :(') 329 | return None 330 | 331 | def search_artist(self, artist_name, verbose=True, max_songs=None, take_first_result=False, get_full_song_info=True, remove_section_headers=False, remove_non_songs=True): 332 | """Allow user to search for an artist on the Genius.com database by supplying an artist name. 333 | Returns an Artist() object containing all songs for that particular artist.""" 334 | 335 | if verbose: 336 | print('Searching for songs by {0}...\n'.format(artist_name)) 337 | 338 | # Perform a Genius API search for the artist 339 | json_search = self._make_api_request((artist_name, 'search')) 340 | first_result, artist_id = None, None 341 | for hit in json_search['hits']: 342 | found_artist = hit['result']['primary_artist'] 343 | if first_result is None: 344 | first_result = found_artist 345 | artist_id = found_artist['id'] 346 | if take_first_result or self._clean_str(found_artist['name'].lower()) == self._clean_str(artist_name.lower()): 347 | artist_name = found_artist['name'] 348 | break 349 | else: 350 | # check for searched name in alternate artist names 351 | json_artist = self._make_api_request((artist_id, 'artist'))['artist'] 352 | if artist_name.lower() in [s.lower() for s in json_artist['alternate_names']]: 353 | if verbose: 354 | print("Found alternate name. Changing name to {}.".format(json_artist['name'])) 355 | artist_name = json_artist['name'] 356 | break 357 | artist_id = None 358 | 359 | if first_result is not None and artist_id is None and verbose: 360 | if input("Couldn't find {}. Did you mean {}? (y/n): ".format(artist_name, first_result['name'])).lower() == 'y': 361 | artist_name, artist_id = first_result['name'], first_result['id'] 362 | assert (not isinstance(artist_id, type(None))), "Could not find artist. Check spelling?" 363 | 364 | # Make Genius API request for the determined artist ID 365 | json_artist = self._make_api_request((artist_id, 'artist')) 366 | # Create the Artist object 367 | artist = Artist(json_artist) 368 | 369 | if max_songs is None or max_songs > 0: 370 | # Access the api_path found by searching 371 | artist_search_results = self._make_api_request((artist_id, 'artist-songs')) 372 | 373 | # Download each song by artist, store as Song objects in Artist object 374 | keep_searching = True 375 | next_page = 0 376 | n = 0 377 | while keep_searching: 378 | for json_song in artist_search_results['songs']: 379 | # TODO: Shouldn't I use self.search_song() here? 380 | 381 | # Songs must have a title 382 | if 'title' not in json_song: 383 | json_song['title'] = 'MISSING TITLE' 384 | 385 | # Remove non-song results (e.g. Linear Notes, Tracklists, etc.) 386 | song_is_valid = self._result_is_lyrics(json_song['title']) if remove_non_songs else True 387 | 388 | if song_is_valid: 389 | # Scrape song lyrics from the song's HTML 390 | lyrics = self._scrape_song_lyrics_from_url(json_song['url'], remove_section_headers) 391 | 392 | # Create song object for current song 393 | if get_full_song_info: 394 | song = Song(self._make_api_request((json_song['id'], 'song')), lyrics) 395 | else: 396 | song = Song({'song': json_song}, lyrics) # Faster, less info from API 397 | 398 | # Add song to the Artist object 399 | if artist.add_song(song, verbose=False) == 0: 400 | # print("Add song: {}".format(song.title)) 401 | n += 1 402 | if verbose: 403 | print('Song {0}: "{1}"'.format(n, song.title)) 404 | 405 | else: # Song does not contain lyrics 406 | if verbose: 407 | print('"{title}" does not contain lyrics. Rejecting.'.format(title=json_song['title'])) 408 | 409 | # Check if user specified a max number of songs for the artist 410 | if not isinstance(max_songs, type(None)): 411 | if artist.num_songs >= max_songs: 412 | keep_searching = False 413 | if verbose: 414 | print('\nReached user-specified song limit ({0}).'.format(max_songs)) 415 | break 416 | 417 | # Move on to next page of search results 418 | next_page = artist_search_results['next_page'] 419 | if next_page == None: 420 | break 421 | else: # Get next page of artist song results 422 | artist_search_results = self._make_api_request((artist_id, 'artist-songs'), page=next_page) 423 | 424 | if verbose: 425 | print('Found {n_songs} songs.'.format(n_songs=artist.num_songs)) 426 | 427 | if verbose: 428 | print('Done.') 429 | 430 | return artist 431 | 432 | def search_album(self, artist_name, album_title): 433 | """Get all lyrics from an album and save them in a json file""" 434 | 435 | # genius finds the artist 436 | artist = self.search_artist(artist_name, max_songs=0) 437 | # modify artist_name and album_title so that they will lead us to the album page on Genius.com 438 | artist_name = artist._body['name'] 439 | for ch in [',', "/", " ", '$', ';', ':', '(', ')', '[', ']', '----', '---', '--']: 440 | if ch in artist_name: 441 | artist_name = artist_name.replace(ch, "-") 442 | if ch in album_title: 443 | album_title = album_title.replace(ch, "-") 444 | for ch in ['.', "\"", "'"]: 445 | if ch in artist_name: 446 | artist_name = artist_name.replace(ch, "") 447 | if ch in album_title: 448 | album_title = album_title.replace(ch, "-") 449 | for ch in ['é', 'è', 'ê', 'ë']: 450 | if ch in artist_name: 451 | artist_name = artist_name.replace(ch, "e") 452 | if ch in album_title: 453 | album_title = album_title.replace(ch, "e") 454 | 455 | artist_name = artist_name.replace("&", "and") 456 | album_title = album_title.replace("&", "").replace("#", "").replace("--", "-") 457 | 458 | while artist_name[-1] == '-': 459 | artist_name = artist_name[:-1] 460 | while album_title[-1] == '-': 461 | album_title = album_title[:-1] 462 | # create index 463 | index = [] 464 | # get the album page on Genius.com 465 | r = requests.get('https://genius.com/albums/' + artist_name + "/" + album_title) 466 | soup = BeautifulSoup(r.text, 'html.parser') 467 | # get the html section indicating if the album isn't found 468 | not_found = soup.find('h1', attrs={'class': 'render_404-headline'}) 469 | if not_found != None and "Page not found" in not_found.text: 470 | print("Album not found.") 471 | return None 472 | # get the html section indicating if the song is missing lyrics 473 | missing = soup.find_all('div', attrs={'class': 'chart_row-metadata_element chart_row-metadata_element--large'}) 474 | miss_nb = 0 475 | # count the number of songs without lyrics 476 | for miss in missing: 477 | if miss.text.find("(Missing Lyrics)") >= 0 or miss.text.find("(Unreleased)") >= 0: 478 | miss_nb += 1 479 | divi = soup.find_all('div', attrs={'class': 'column_layout-column_span column_layout-column_span--primary'}) 480 | for div in divi: 481 | var = 0 482 | # get the html section indicating the track numbers (this will be to eliminate sections similar to those of songs but are actually of tracklist or credits of the album) 483 | mdiv = div.find_all('span', attrs={'class': 'chart_row-number_container-number chart_row-number_container-number--gray'}) 484 | for mindiv in mdiv: 485 | nb = mindiv.text.replace("\n", "") 486 | if nb != "": 487 | index.append(nb) 488 | # create a list holding the tracks' titles 489 | df = [] 490 | ndiv = div.find_all('h3', attrs={'class': 'chart_row-content-title'}) 491 | for mindiv in ndiv: 492 | tt = mindiv.text.replace("\n", "").strip() 493 | # getting rid of the featurings in the title 494 | if tt.find("(Ft") >= 0: 495 | tt = tt.split(" (Ft.", 1)[0] 496 | else: 497 | # getting ride of "lyrics" at the end of the title 498 | tt = tt.rsplit(" ", 1)[0].strip() 499 | df.append(tt) 500 | var += 1 501 | if var == len(index): 502 | break 503 | # loop to add song with title from the list 504 | for track in df: 505 | # search the song 506 | song = self.search_song(track, artist.name) 507 | # if the song was found, it's added to artist 508 | if song != None: 509 | artist.add_song(song) 510 | # if the song wasn't found, it might be because it's formatted in this way : "title by other artist" 511 | elif track.find("by") >= 0: 512 | s_artist_name = track.replace("\xa0", " ").rsplit(" by ", 1)[1] 513 | s_artist = self.search_artist(s_artist_name, max_songs=0) 514 | track = track.replace("\xa0", " ").rsplit(" by ", 1)[0].strip() 515 | # look for song with other artist 516 | song = self.search_song(track, s_artist.name) 517 | if song != None: 518 | # add song to the album's main artist 519 | artist.add_song(song) 520 | else: 521 | print("Missing lyrics") 522 | else: 523 | print("Missing lyrics") 524 | artist.save_lyrics(format='json') 525 | if miss_nb == 1: 526 | print("{} song was ignored due to missing lyrics.".format(miss_nb)) 527 | elif miss_nb > 1: 528 | print("{} songs were ignored due to missing lyrics.".format(miss_nb)) 529 | 530 | def save_artists(self, artists, filename="artist_lyrics", overwrite=False): 531 | """Pass a list of Artist objects to save multiple artists""" 532 | if isinstance(artists, Artist): 533 | artists = [artists] 534 | assert isinstance(artists, list), "Must pass in list of Artist objects." 535 | 536 | # Create a temporary directory for lyrics 537 | start = time.time() 538 | tmp_dir = 'tmp_lyrics' 539 | if not os.path.isdir(tmp_dir): 540 | os.mkdir(tmp_dir) 541 | tmp_count = 0 542 | else: 543 | tmp_count = len(os.listdir('./' + tmp_dir)) 544 | 545 | # Check if file already exists 546 | write_file = False 547 | if not os.path.isfile(filename + ".json"): 548 | pass 549 | elif overwrite: 550 | pass 551 | else: 552 | if input("{} already exists. Overwrite?\n(y/n): ".format(filename)).lower() != 'y': 553 | print("Leaving file in place. Exiting.") 554 | os.rmdir(tmp_dir) 555 | return 556 | 557 | # Extract each artist's lyrics in json format 558 | all_lyrics = {'artists': []} 559 | for n, artist in enumerate(artists): 560 | if isinstance(artist, Artist): 561 | all_lyrics['artists'].append({}) 562 | tmp_file = "./{dir}/tmp_{num}_{name}".format(dir=tmp_dir, num=n + tmp_count, name=artist.name.replace(" ", "")) 563 | print(tmp_file) 564 | all_lyrics['artists'][-1] = artist.save_lyrics(filename=tmp_file, overwrite=True) 565 | else: 566 | warn("Item #{} was not of type Artist. Skipping.".format(n)) 567 | 568 | # Save all of the lyrics 569 | with open(filename + '.json', 'w') as outfile: 570 | json.dump(all_lyrics, outfile) 571 | 572 | end = time.time() 573 | 574 | # To get an album's lyrics: 575 | # 576 | # Get the token by signing in on the Genius website https://genius.com/api-clients 577 | # client_access_token = 'YOUR_TOKEN_HERE' 578 | # api = Genius(client_access_token) 579 | # Genius.search_album(api, "DJ Khaled", "Grateful") 580 | --------------------------------------------------------------------------------