├── Ranking.png
├── requirements.txt
├── README.md
├── textm.py
└── api.py


/Ranking.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Hugo-Nattagh/2017-Hip-Hop/HEAD/Ranking.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | - For the api.py script:
 2 | 
 3 | You need a token to use the Genius.com API.
 4 | You'll get one if you sign up to the website -> https://genius.com/api-clients
 5 | 
 6 | pip install bs4
 7 | pip install lyricsgenius
 8 | 
 9 | 
10 | 
11 | 
12 | - For the textm.py script:
13 | 
14 | pip install nltk
15 | pip install textblob
16 | pip install vadersentiment


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # 2017Hip-Hop
 2 | 
 3 | ### 2017 Rap Albums’ Text Mining and Sentiment Analysis
 4 | 
 5 | A lot of rappers claim to deliver a positive message to the world, which can often be challenging to believe when listening to the lyrics.
 6 | 
 7 | How about putting some albums to the test with sentiment analysis?
 8 | 
 9 | Sentiment analysis is a natural language processing technique used on texts to detect the overall opinion of the author (negative, neutral or positive).
10 | 
11 | #### The Data
12 | 
13 | I knew [Genius.com](https://genius.com/) had to be one of the best databases for this, where the lyrics are transcripted by its community.
14 | 
15 | To get the lyrics, I used [lyricsGenius](https://github.com/johnwmillr/LyricsGenius), a Python library by [Johnwmillr](https://github.com/johnwmillr)
16 | 
17 | It’s very useful but unfortunately cannot get all the lyrics for an album, I had to do a query for each song.
18 | 
19 | I forked his repo and tried my hand at improving his script but realized that his library is using Genius.com’s API, which only allows requests for artists and songs.
20 | 
21 | So I worked around it in a way that doesnt make my modifications fit with his library.  
22 | 
23 | I used Beautiful Soup to scrape the songs’ titles from the album page, and made a loop to request all those songs with lyricsGenius.
24 | 
25 | #### What's going on
26 | 
27 | I wrote a script to process the lyrics of each songs to optimize the sentiment analysis.
28 | 
29 | I used TextBlob and Vader for the sentiment analysis of each song of an album, then calculated the average for each album.
30 | 
31 | The results I got were not what I was hoping for, but made sense. Sentiment analysis is usually used for tweets or reviews, and not for a song’s lyrics. Especially rap, which often contains a lot of slang, wordplays, double entendres... Also a song’s length gives a lot more neutral expressions which leads the average to be close to zero.
32 | 
33 | Long story short, sentiment analysis, as for now, isn’t the way to go to measure an album’s positiveness.
34 | 
35 | I visualized it anyway, to see the relative ranking with this technique.
36 | 
37 | ![alt text](https://github.com/Hugo-Nattagh/2017-Hip-Hop/blob/master/Ranking.png)
38 | 
39 | By doing this project, I warmed up with text mining and sentiment analysis on python, and I’d like to go on with another year, or all verses from an artist, or his discography... But I would need to get a closer look at the existing dictionnaries to, if possible, somehow adjust them to rap lyrics. Or maybe remove all expressions that are too neutral.
40 | 
41 | #### Files Description
42 | 
43 | - `api.py`: Script to get the albums' lyrics in json files
44 | - `textm.py`: Script to perform sentiment analysis on the json files
45 | - `Ranking.png`: Visualization of my results
46 | - `requirements.txt`: What you need in order to make it work
47 | 


--------------------------------------------------------------------------------
/textm.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import os
 3 | from nltk.corpus import stopwords
 4 | from textblob import TextBlob
 5 | from textblob import Word
 6 | from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
 7 | 
 8 | pd.options.display.max_colwidth = 17
 9 | 
10 | analyzer = SentimentIntensityAnalyzer()
11 | 
12 | 
13 | def s_analysis(file_df):
14 |     lyrics = pd.DataFrame(data=file_df)
15 |     album = lyrics['songs'][0]['album']
16 |     artist = lyrics['artist'][0]
17 |     lyrics['title'] = lyrics['songs'][0]['title']
18 | 
19 |     n = 0
20 | 
21 |     while n < len(lyrics):
22 |         lyrics['title'][n] = lyrics['songs'][n]['title']
23 |         lyrics['songs'][n] = lyrics['songs'][n]['lyrics'].replace('\n', ' ')
24 |         n += 1
25 | 
26 |     # pprint(lyrics)
27 | 
28 |     # 1) Basic Feature Extraction
29 | 
30 |     stop = stopwords.words('english')
31 |     lyrics['stopwords'] = lyrics['songs'].apply(lambda x: len([x for x in x.split() if x in stop]))
32 | 
33 |     # 2) Basic Pre-Processing
34 | 
35 |     lyrics['songs'] = lyrics['songs'].apply(lambda x: " ".join(x.lower() for x in x.split()))
36 |     lyrics['songs'] = lyrics['songs'].str.replace('[^\w\s]', '')
37 |     lyrics['songs'] = lyrics['songs'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
38 | 
39 |     TextBlob(lyrics['songs'][0]).words
40 |     lyrics['songs'] = lyrics['songs'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
41 |     lyrics['sentimentTxtB'] = lyrics['songs'].apply(lambda x: TextBlob(x).sentiment[0])
42 |     lyrics['sentimentTxtB'] = lyrics['sentimentTxtB'].round(3)
43 |     # lyrics = lyrics.sort_values('sentimentTxtB', ascending=False)
44 | 
45 |     lyrics['sentimentVaderPos'] = lyrics['songs'].apply(lambda x: analyzer.polarity_scores(x)['pos'])
46 |     lyrics['sentimentVaderNeg'] = lyrics['songs'].apply(lambda x: analyzer.polarity_scores(x)['neg'])
47 | 
48 |     lyrics['Vader'] = lyrics['sentimentVaderPos'] - lyrics['sentimentVaderNeg']
49 | 
50 |     nb_rows = len(lyrics.index)
51 |     total_sent = sum(lyrics['sentimentTxtB'])
52 |     avg = total_sent / nb_rows
53 | 
54 |     total_vad = sum(lyrics['Vader'])
55 |     avg_vad = total_vad / nb_rows
56 | 
57 |     print(lyrics[['title', 'songs', 'sentimentTxtB', 'Vader']])
58 |     avg = round(avg, 3)
59 |     avg_vad = round(avg_vad, 3)
60 |     print(album + ' by ' + artist + ' - TextBlob: ' + str(avg) + ", Vader: " + str(avg_vad))
61 |     return album, artist, avg, avg_vad
62 | 
63 | 
64 | index = len([filename for filename in os.listdir(os.getcwd()) if filename.endswith(".json")])
65 | varb = 0
66 | avg_df = pd.DataFrame(index=range(index), columns=['album', 'artist', 'sent_avg_txtB', 'sent_avg_vad'])
67 | 
68 | for filename in os.listdir(os.getcwd()):
69 |     if filename.endswith(".json"):
70 |         df = pd.read_json(filename)
71 |         avg_df['album'][varb], avg_df['artist'][varb], avg_df['sent_avg_txtB'][varb], avg_df['sent_avg_vad'][varb] = s_analysis(df)
72 |         varb += 1
73 |         continue
74 |     else:
75 |         continue
76 | 
77 | avg_df['final_average'] = (avg_df['sent_avg_txtB'] + avg_df['sent_avg_vad']) / 2
78 | 
79 | avg_df = avg_df.sort_values(by=['final_average'], ascending=False)
80 | print(avg_df)
81 | 


--------------------------------------------------------------------------------
/api.py:
--------------------------------------------------------------------------------
  1 | #  Usage:
  2 | #    import genius
  3 | #    api = genius.Genius('my_client_access_token_here')
  4 | #    artist = api.search_artist('Andy Shauf', max_songs=5)
  5 | #    print(artist)
  6 | #
  7 | #    song = api.search_song('To You', artist.name)
  8 | #    artist.add_song(song)
  9 | #    print(artist)
 10 | #    print(artist.songs[-1])
 11 | #
 12 | #  For an album, see at the end
 13 | 
 14 | 
 15 | import sys
 16 | from urllib.request import Request, urlopen, quote
 17 | import os
 18 | import re
 19 | import requests
 20 | import socket
 21 | import json
 22 | from bs4 import BeautifulSoup
 23 | from string import punctuation
 24 | import time
 25 | from warnings import warn
 26 | 
 27 | from lyricsgenius.song import Song
 28 | 
 29 | 
 30 | class Artist(object):
 31 |     """An artist from the Genius.com database.
 32 | 
 33 |     Attributes:
 34 |         name: (str) Artist name.
 35 |         num_songs: (int) Total number of songs listed on Genius.com
 36 | 
 37 |     """
 38 | 
 39 |     def __init__(self, json_dict):
 40 |         """Populate the Artist object with the data from *json_dict*"""
 41 |         self._body = json_dict['artist']
 42 |         self._url = self._body['url']
 43 |         self._api_path = self._body['api_path']
 44 |         self._id = self._body['id']
 45 |         self._songs = []
 46 |         self._num_songs = len(self._songs)
 47 | 
 48 |     def __len__(self):
 49 |         return 1
 50 | 
 51 |     @property
 52 |     def name(self):
 53 |         return self._body['name']
 54 | 
 55 |     @property
 56 |     def image_url(self):
 57 |         try:
 58 |             return self._body['image_url']
 59 |         except:
 60 |             return None
 61 | 
 62 |     @property
 63 |     def songs(self):
 64 |         return self._songs
 65 | 
 66 |     @property
 67 |     def num_songs(self):
 68 |         return self._num_songs
 69 | 
 70 |     def add_song(self, newsong, verbose=True):
 71 |         """Add a Song object to the Artist object"""
 72 | 
 73 |         if any([song.title == newsong.title for song in self._songs]):
 74 |             if verbose:
 75 |                 print('{newsong.title} already in {self.name}, not adding song.'.format(newsong=newsong, self=self))
 76 |             return 1  # Failure
 77 |         if newsong.artist == self.name:
 78 |             self._songs.append(newsong)
 79 |             self._num_songs += 1
 80 |             return 0  # Success
 81 |         else:
 82 |             self._songs.append(newsong)
 83 |             self._num_songs += 1
 84 |             print("Song by {newsong.artist} was added to {self.name}.".format(newsong=newsong, self=self))
 85 |             return 0  # Success
 86 | 
 87 |     def get_song(self, song_name):
 88 |         """Search Genius.com for *song_name* and add it to artist"""
 89 |         raise NotImplementedError("I need to figure out how to allow Artist() to access search_song().")
 90 |         song = Genius.search_song(song_name, self.name)
 91 |         self.add_song(song)
 92 |         return
 93 | 
 94 |     # TODO: define an export_to_json() method
 95 | 
 96 |     def save_lyrics(self, format='json', filename=None,
 97 |                     overwrite=False, skip_duplicates=True, verbose=True):
 98 |         """Allows user to save all lyrics within an Artist obejct to a .json or .txt file."""
 99 |         if format[0] == '.':
100 |             format = format[1:]
101 |         assert (format == 'json') or (format == 'txt'), "Format must be json or txt"
102 | 
103 |         # We want to reject songs that have already been added to artist collection
104 |         def songsAreSame(s1, s2):
105 |             from difflib import SequenceMatcher as sm  # For comparing similarity of lyrics
106 |             # Idea credit: https://bigishdata.com/2016/10/25/talkin-bout-trucks-beer-and-love-in-country-songs-analyzing-genius-lyrics/
107 |             seqA = sm(None, s1.lyrics, s2['lyrics'])
108 |             seqB = sm(None, s2['lyrics'], s1.lyrics)
109 |             return seqA.ratio() > 0.5 or seqB.ratio() > 0.5
110 | 
111 |         def songInArtist(new_song):
112 |             # artist_lyrics is global (works in Jupyter notebook)
113 |             for song in lyrics_to_write['songs']:
114 |                 if songsAreSame(new_song, song):
115 |                     return True
116 |             return False
117 | 
118 |         # Determine the filename
119 |         if filename is None:
120 |             filename = "Lyrics_{}.{}".format(self.name.replace(" ", ""), format)
121 |         else:
122 |             if filename.rfind('.') != -1:
123 |                 filename = filename[filename.rfind('.'):] + '.' + format
124 |             else:
125 |                 filename = filename + '.' + format
126 | 
127 |         # Check if file already exists
128 |         write_file = False
129 |         if not os.path.isfile(filename):
130 |             write_file = True
131 |         elif overwrite:
132 |             write_file = True
133 |         else:
134 |             if input("{} already exists. Overwrite?\n(y/n): ".format(filename)).lower() == 'y':
135 |                 write_file = True
136 | 
137 |         # Format lyrics in either .txt or .json format
138 |         if format == 'json':
139 |             lyrics_to_write = {'songs': [], 'artist': self.name}
140 |             for song in self.songs:
141 |                 if skip_duplicates is False or not songInArtist(song):  # This takes way too long! It's basically O(n^2), can I do better?
142 |                     lyrics_to_write['songs'].append({})
143 |                     lyrics_to_write['songs'][-1]['title'] = song.title
144 |                     lyrics_to_write['songs'][-1]['album'] = song.album
145 |                     lyrics_to_write['songs'][-1]['year'] = song.year
146 |                     lyrics_to_write['songs'][-1]['lyrics'] = song.lyrics
147 |                     lyrics_to_write['songs'][-1]['image'] = song.song_art_image_url
148 |                     lyrics_to_write['songs'][-1]['artist'] = self.name
149 |                     lyrics_to_write['songs'][-1]['raw'] = song._body
150 |                 else:
151 |                     if verbose:
152 |                         print("SKIPPING \"{}\" -- already found in artist collection.".format(song.title))
153 |         else:
154 |             lyrics_to_write = " ".join([s.lyrics + 5 * '\n' for s in self.songs])
155 | 
156 |         # Write the lyrics to either a .json or .txt file
157 |         if write_file:
158 |             with open(filename, 'w') as lyrics_file:
159 |                 if format == 'json':
160 |                     json.dump(lyrics_to_write, lyrics_file)
161 |                 else:
162 |                     lyrics_file.write(lyrics_to_write)
163 |             if verbose:
164 |                 print('Wrote {} songs to {}.'.format(self.num_songs, filename))
165 |         else:
166 |             if verbose:
167 |                 print('Skipping file save.\n')
168 |         return lyrics_to_write
169 | 
170 |     def __str__(self):
171 |         """Return a string representation of the Artist object."""
172 |         if self._num_songs == 1:
173 |             return '{0}, {1} song'.format(self.name, self._num_songs)
174 |         else:
175 |             return '{0}, {1} songs'.format(self.name, self._num_songs)
176 | 
177 |     def __repr__(self):
178 |         return repr((self.name, '{0} songs'.format(self._num_songs)))
179 | 
180 | 
181 | class _API(object):
182 |     # This is a superclass that Genius() inherits from. Not sure if this makes any sense, but it
183 |     # seemed like a good idea to have this class (more removed from user) handle the lower-level
184 |     # interaction with the Genius API, and then Genius() has the more user-friendly search
185 |     # functions
186 |     """Interface with the Genius.com API
187 |     Attributes:
188 |         base_url: (str) Top-most URL to access the Genius.com API with
189 |     Methods:
190 |         _load_credentials()
191 |             OUTPUT: client_id, client_secret, client_access_token
192 |         _make_api_request()
193 |             INPUT:
194 |             OUTPUT:
195 |     """
196 | 
197 |     # Genius API constants
198 |     _API_URL = "https://api.genius.com/"
199 |     _API_REQUEST_TYPES =\
200 |         {'song': 'songs/', 'artist': 'artists/',
201 |             'artist-songs': 'artists/songs/', 'search': 'search?q='}
202 | 
203 |     def __init__(self, client_access_token, client_secret='', client_id=''):
204 |         self._CLIENT_ACCESS_TOKEN = client_access_token
205 |         self._HEADER_AUTHORIZATION = 'Bearer ' + self._CLIENT_ACCESS_TOKEN
206 | 
207 |     def _make_api_request(self, request_term_and_type, page=1):
208 |         """Send a request (song, artist, or search) to the Genius API, returning a json object
209 |         INPUT:
210 |             request_term_and_type: (tuple) (request_term, request_type)
211 |         *request term* is a string. If *request_type* is 'search', then *request_term* is just
212 |         what you'd type into the search box on Genius.com. If you have an song ID or an artist ID,
213 |         you'd do this: self._make_api_request('2236','song')
214 |         Returns a json object.
215 |         """
216 | 
217 |         # TODO: This should maybe be a generator
218 | 
219 |         # The API request URL must be formatted according to the desired
220 |         # request type"""
221 |         api_request = self._format_api_request(
222 |             request_term_and_type, page=page)
223 | 
224 |         # Add the necessary headers to the request
225 |         request = Request(api_request)
226 |         request.add_header("Authorization", self._HEADER_AUTHORIZATION)
227 |         request.add_header("User-Agent", "LyricsGenius")
228 |         while True:
229 |             try:
230 |                 # timeout set to 4 seconds; automatically retries if times out
231 |                 response = urlopen(request, timeout=4)
232 |                 raw = response.read().decode('utf-8')
233 |             except socket.timeout:
234 |                 print("Timeout raised and caught")
235 |                 continue
236 |             break
237 | 
238 |         return json.loads(raw)['response']
239 | 
240 |     def _format_api_request(self, term_and_type, page=1):
241 |         """Format the request URL depending on the type of request"""
242 | 
243 |         request_term, request_type = str(term_and_type[0]), term_and_type[1]
244 |         assert request_type in self._API_REQUEST_TYPES, "Unknown API request type"
245 | 
246 |         # TODO - Clean this up (might not need separate returns)
247 |         if request_type == 'artist-songs':
248 |             return self._API_URL + 'artists/' + quote(request_term) + '/songs?per_page=50&page=' + str(page)
249 |         else:
250 |             return self._API_URL + self._API_REQUEST_TYPES[request_type] + quote(request_term)
251 | 
252 |     def _scrape_song_lyrics_from_url(self, URL, remove_section_headers=False):
253 |         """Use BeautifulSoup to scrape song info off of a Genius song URL"""
254 |         page = requests.get(URL)
255 |         html = BeautifulSoup(page.text, "html.parser")
256 | 
257 |         # Scrape the song lyrics from the HTML
258 |         lyrics = html.find("div", class_="lyrics").get_text()
259 |         if remove_section_headers:
260 |             # Remove [Verse] and [Bridge] stuff
261 |             lyrics = re.sub('(\[.*?\])*', '', lyrics)
262 |             # Remove gaps between verses
263 |             lyrics = re.sub('\n{2}', '\n', lyrics)
264 | 
265 |         return lyrics.strip('\n')
266 | 
267 |     def _clean_str(self, s):
268 |         return s.translate(str.maketrans('', '', punctuation)).replace('\u200b', " ").strip().lower()
269 | 
270 |     def _result_is_lyrics(self, song_title):
271 |         """Returns False if result from Genius is not actually song lyrics"""
272 |         regex = re.compile(
273 |             r"(tracklist)|(track list)|(album art(work)?)|(liner notes)|(booklet)|(credits)", re.IGNORECASE)
274 |         return not regex.search(song_title)
275 | 
276 | 
277 | class Genius(_API):
278 |     """User-level interface with the Genius.com API. User can search for songs (getting lyrics) and artists (getting songs)"""
279 | 
280 |     def search_song(self, song_title, artist_name="", take_first_result=False, verbose=True, remove_section_headers=True, remove_non_songs=True):
281 |         # TODO: Should search_song() be a @classmethod?
282 |         """Search Genius.com for *song_title* by *artist_name*"""
283 | 
284 |         # Perform a Genius API search for the song
285 |         if verbose:
286 |             if artist_name != "":
287 |                 print('Searching for "{0}" by {1}...'.format(
288 |                     song_title, artist_name))
289 |             else:
290 |                 print('Searching for "{0}"...'.format(song_title))
291 |         search_term = "{} {}".format(song_title, artist_name)
292 | 
293 |         json_search = self._make_api_request((search_term, 'search'))
294 | 
295 |         # Loop through search results, stopping as soon as title and artist of
296 |         # result match request
297 |         n_hits = min(10, len(json_search['hits']))
298 |         for i in range(n_hits):
299 |             search_hit = json_search['hits'][i]['result']
300 |             found_song = self._clean_str(search_hit['title'])
301 |             found_artist = self._clean_str(
302 |                 search_hit['primary_artist']['name'])
303 | 
304 |             # Download song from Genius.com if title and artist match the request
305 |             if take_first_result or found_song == self._clean_str(song_title) and found_artist == self._clean_str(artist_name) or artist_name == "":
306 | 
307 |                 # Remove non-song results (e.g. Linear Notes, Tracklists, etc.)
308 |                 song_is_valid = self._result_is_lyrics(found_song) if remove_non_songs else True
309 |                 if song_is_valid:
310 |                     # Found correct song, accessing API ID
311 |                     json_song = self._make_api_request((search_hit['id'], 'song'))
312 | 
313 |                     # Scrape the song's HTML for lyrics
314 |                     lyrics = self._scrape_song_lyrics_from_url(json_song['song']['url'], remove_section_headers)
315 | 
316 |                     # Create the Song object
317 |                     song = Song(json_song, lyrics)
318 | 
319 |                     if verbose:
320 |                         print('Done.')
321 |                     return song
322 |                 else:
323 |                     if verbose:
324 |                         print('Specified song does not contain lyrics. Rejecting.')
325 |                     return None
326 | 
327 |         if verbose:
328 |             print('Specified song was not first result :(')
329 |         return None
330 | 
331 |     def search_artist(self, artist_name, verbose=True, max_songs=None, take_first_result=False, get_full_song_info=True, remove_section_headers=False, remove_non_songs=True):
332 |         """Allow user to search for an artist on the Genius.com database by supplying an artist name.
333 |         Returns an Artist() object containing all songs for that particular artist."""
334 | 
335 |         if verbose:
336 |             print('Searching for songs by {0}...\n'.format(artist_name))
337 | 
338 |         # Perform a Genius API search for the artist
339 |         json_search = self._make_api_request((artist_name, 'search'))
340 |         first_result, artist_id = None, None
341 |         for hit in json_search['hits']:
342 |             found_artist = hit['result']['primary_artist']
343 |             if first_result is None:
344 |                 first_result = found_artist
345 |             artist_id = found_artist['id']
346 |             if take_first_result or self._clean_str(found_artist['name'].lower()) == self._clean_str(artist_name.lower()):
347 |                 artist_name = found_artist['name']
348 |                 break
349 |             else:
350 |                 # check for searched name in alternate artist names
351 |                 json_artist = self._make_api_request((artist_id, 'artist'))['artist']
352 |                 if artist_name.lower() in [s.lower() for s in json_artist['alternate_names']]:
353 |                     if verbose:
354 |                         print("Found alternate name. Changing name to {}.".format(json_artist['name']))
355 |                     artist_name = json_artist['name']
356 |                     break
357 |                 artist_id = None
358 | 
359 |         if first_result is not None and artist_id is None and verbose:
360 |             if input("Couldn't find {}. Did you mean {}? (y/n): ".format(artist_name, first_result['name'])).lower() == 'y':
361 |                 artist_name, artist_id = first_result['name'], first_result['id']
362 |         assert (not isinstance(artist_id, type(None))), "Could not find artist. Check spelling?"
363 | 
364 |         # Make Genius API request for the determined artist ID
365 |         json_artist = self._make_api_request((artist_id, 'artist'))
366 |         # Create the Artist object
367 |         artist = Artist(json_artist)
368 | 
369 |         if max_songs is None or max_songs > 0:
370 |             # Access the api_path found by searching
371 |             artist_search_results = self._make_api_request((artist_id, 'artist-songs'))
372 | 
373 |             # Download each song by artist, store as Song objects in Artist object
374 |             keep_searching = True
375 |             next_page = 0
376 |             n = 0
377 |             while keep_searching:
378 |                 for json_song in artist_search_results['songs']:
379 |                     # TODO: Shouldn't I use self.search_song() here?
380 | 
381 |                     # Songs must have a title
382 |                     if 'title' not in json_song:
383 |                         json_song['title'] = 'MISSING TITLE'
384 | 
385 |                     # Remove non-song results (e.g. Linear Notes, Tracklists, etc.)
386 |                     song_is_valid = self._result_is_lyrics(json_song['title']) if remove_non_songs else True
387 | 
388 |                     if song_is_valid:
389 |                         # Scrape song lyrics from the song's HTML
390 |                         lyrics = self._scrape_song_lyrics_from_url(json_song['url'], remove_section_headers)
391 | 
392 |                         # Create song object for current song
393 |                         if get_full_song_info:
394 |                             song = Song(self._make_api_request((json_song['id'], 'song')), lyrics)
395 |                         else:
396 |                             song = Song({'song': json_song}, lyrics)  # Faster, less info from API
397 | 
398 |                         # Add song to the Artist object
399 |                         if artist.add_song(song, verbose=False) == 0:
400 |                             # print("Add song: {}".format(song.title))
401 |                             n += 1
402 |                             if verbose:
403 |                                 print('Song {0}: "{1}"'.format(n, song.title))
404 | 
405 |                     else:  # Song does not contain lyrics
406 |                         if verbose:
407 |                             print('"{title}" does not contain lyrics. Rejecting.'.format(title=json_song['title']))
408 | 
409 |                     # Check if user specified a max number of songs for the artist
410 |                     if not isinstance(max_songs, type(None)):
411 |                         if artist.num_songs >= max_songs:
412 |                             keep_searching = False
413 |                             if verbose:
414 |                                 print('\nReached user-specified song limit ({0}).'.format(max_songs))
415 |                             break
416 | 
417 |                 # Move on to next page of search results
418 |                 next_page = artist_search_results['next_page']
419 |                 if next_page == None:
420 |                     break
421 |                 else:  # Get next page of artist song results
422 |                     artist_search_results = self._make_api_request((artist_id, 'artist-songs'), page=next_page)
423 | 
424 |             if verbose:
425 |                 print('Found {n_songs} songs.'.format(n_songs=artist.num_songs))
426 | 
427 |         if verbose:
428 |             print('Done.')
429 | 
430 |         return artist
431 | 
432 |     def search_album(self, artist_name, album_title):
433 |         """Get all lyrics from an album and save them in a json file"""
434 | 
435 |         # genius finds the artist
436 |         artist = self.search_artist(artist_name, max_songs=0)
437 |         # modify artist_name and album_title so that they will lead us to the album page on Genius.com
438 |         artist_name = artist._body['name']
439 |         for ch in [',', "/", " ", '$', ';', ':', '(', ')', '[', ']', '----', '---', '--']:
440 |             if ch in artist_name:
441 |                 artist_name = artist_name.replace(ch, "-")
442 |             if ch in album_title:
443 |                 album_title = album_title.replace(ch, "-")
444 |         for ch in ['.', "\"", "'"]:
445 |             if ch in artist_name:
446 |                 artist_name = artist_name.replace(ch, "")
447 |             if ch in album_title:
448 |                 album_title = album_title.replace(ch, "-")
449 |         for ch in ['é', 'è', 'ê', 'ë']:
450 |             if ch in artist_name:
451 |                 artist_name = artist_name.replace(ch, "e")
452 |             if ch in album_title:
453 |                 album_title = album_title.replace(ch, "e")
454 | 
455 |         artist_name = artist_name.replace("&", "and")
456 |         album_title = album_title.replace("&", "").replace("#", "").replace("--", "-")
457 | 
458 |         while artist_name[-1] == '-':
459 |             artist_name = artist_name[:-1]
460 |         while album_title[-1] == '-':
461 |             album_title = album_title[:-1]
462 |         # create index
463 |         index = []
464 |         # get the album page on Genius.com
465 |         r = requests.get('https://genius.com/albums/' + artist_name + "/" + album_title)
466 |         soup = BeautifulSoup(r.text, 'html.parser')
467 |         # get the html section indicating if the album isn't found
468 |         not_found = soup.find('h1', attrs={'class': 'render_404-headline'})
469 |         if not_found != None and "Page not found" in not_found.text:
470 |             print("Album not found.")
471 |             return None
472 |         # get the html section indicating if the song is missing lyrics
473 |         missing = soup.find_all('div', attrs={'class': 'chart_row-metadata_element chart_row-metadata_element--large'})
474 |         miss_nb = 0
475 |         # count the number of songs without lyrics
476 |         for miss in missing:
477 |             if miss.text.find("(Missing Lyrics)") >= 0 or miss.text.find("(Unreleased)") >= 0:
478 |                 miss_nb += 1
479 |         divi = soup.find_all('div', attrs={'class': 'column_layout-column_span column_layout-column_span--primary'})
480 |         for div in divi:
481 |             var = 0
482 |             # get the html section indicating the track numbers (this will be to eliminate sections similar to those of songs but are actually of tracklist or credits of the album)
483 |             mdiv = div.find_all('span', attrs={'class': 'chart_row-number_container-number chart_row-number_container-number--gray'})
484 |             for mindiv in mdiv:
485 |                 nb = mindiv.text.replace("\n", "")
486 |                 if nb != "":
487 |                     index.append(nb)
488 |             # create a list holding the tracks' titles
489 |             df = []
490 |             ndiv = div.find_all('h3', attrs={'class': 'chart_row-content-title'})
491 |             for mindiv in ndiv:
492 |                 tt = mindiv.text.replace("\n", "").strip()
493 |                 # getting rid of the featurings in the title
494 |                 if tt.find("(Ft") >= 0:
495 |                     tt = tt.split(" (Ft.", 1)[0]
496 |                 else:
497 |                     # getting ride of "lyrics" at the end of the title
498 |                     tt = tt.rsplit(" ", 1)[0].strip()
499 |                 df.append(tt)
500 |                 var += 1
501 |                 if var == len(index):
502 |                     break
503 |         # loop to add song with title from the list
504 |         for track in df:
505 |             # search the song
506 |             song = self.search_song(track, artist.name)
507 |             # if the song was found, it's added to artist
508 |             if song != None:
509 |                 artist.add_song(song)
510 |             # if the song wasn't found, it might be because it's formatted in this way : "title by other artist"
511 |             elif track.find("by") >= 0:
512 |                 s_artist_name = track.replace("\xa0", " ").rsplit(" by ", 1)[1]
513 |                 s_artist = self.search_artist(s_artist_name, max_songs=0)
514 |                 track = track.replace("\xa0", " ").rsplit(" by ", 1)[0].strip()
515 |                 # look for song with other artist
516 |                 song = self.search_song(track, s_artist.name)
517 |                 if song != None:
518 |                     # add song to the album's main artist
519 |                     artist.add_song(song)
520 |                 else:
521 |                     print("Missing lyrics")
522 |             else:
523 |                 print("Missing lyrics")
524 |         artist.save_lyrics(format='json')
525 |         if miss_nb == 1:
526 |             print("{} song was ignored due to missing lyrics.".format(miss_nb))
527 |         elif miss_nb > 1:
528 |             print("{} songs were ignored due to missing lyrics.".format(miss_nb))
529 | 
530 |     def save_artists(self, artists, filename="artist_lyrics", overwrite=False):
531 |         """Pass a list of Artist objects to save multiple artists"""
532 |         if isinstance(artists, Artist):
533 |             artists = [artists]
534 |         assert isinstance(artists, list), "Must pass in list of Artist objects."
535 | 
536 |         # Create a temporary directory for lyrics
537 |         start = time.time()
538 |         tmp_dir = 'tmp_lyrics'
539 |         if not os.path.isdir(tmp_dir):
540 |             os.mkdir(tmp_dir)
541 |             tmp_count = 0
542 |         else:
543 |             tmp_count = len(os.listdir('./' + tmp_dir))
544 | 
545 |         # Check if file already exists
546 |         write_file = False
547 |         if not os.path.isfile(filename + ".json"):
548 |             pass
549 |         elif overwrite:
550 |             pass
551 |         else:
552 |             if input("{} already exists. Overwrite?\n(y/n): ".format(filename)).lower() != 'y':
553 |                 print("Leaving file in place. Exiting.")
554 |                 os.rmdir(tmp_dir)
555 |                 return
556 | 
557 |         # Extract each artist's lyrics in json format
558 |         all_lyrics = {'artists': []}
559 |         for n, artist in enumerate(artists):
560 |             if isinstance(artist, Artist):
561 |                 all_lyrics['artists'].append({})
562 |                 tmp_file = "./{dir}/tmp_{num}_{name}".format(dir=tmp_dir, num=n + tmp_count, name=artist.name.replace(" ", ""))
563 |                 print(tmp_file)
564 |                 all_lyrics['artists'][-1] = artist.save_lyrics(filename=tmp_file, overwrite=True)
565 |             else:
566 |                 warn("Item #{} was not of type Artist. Skipping.".format(n))
567 | 
568 |         # Save all of the lyrics
569 |         with open(filename + '.json', 'w') as outfile:
570 |             json.dump(all_lyrics, outfile)
571 | 
572 |         end = time.time()
573 | 
574 | # To get an album's lyrics:
575 | #
576 | # Get the token by signing in on the Genius website https://genius.com/api-clients
577 | # client_access_token = 'YOUR_TOKEN_HERE'
578 | # api = Genius(client_access_token)
579 | # Genius.search_album(api, "DJ Khaled", "Grateful")
580 | 


--------------------------------------------------------------------------------