├── .gitignore ├── .travis.yml ├── LICENSE ├── MANIFEST.in ├── README.md ├── requirements.txt ├── setup.py ├── soundscrape ├── .gitignore ├── __init__.py └── soundscrape.py ├── test.sh └── tests └── test.py /.gitignore: -------------------------------------------------------------------------------- 1 | env/ 2 | *.DS_Store 3 | *.pyc 4 | *.bak 5 | build/ 6 | dist/ 7 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | - "3.4" 4 | - "3.5" 5 | - "3.8" 6 | - "3.9" 7 | # command to install dependencies 8 | install: 9 | - "pip install setuptools --upgrade; python setup.py install" 10 | # command to run tests 11 | script: nosetests 12 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2013 Rich Jones 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | this software and associated documentation files (the "Software"), to deal in 7 | the Software without restriction, including without limitation the rights to 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | the Software, and to permit persons to whom the Software is furnished to do so, 10 | subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include README.md LICENSE requirements.txt 2 | recursive-include soundscrape *.py 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![SoundScrape!](http://i.imgur.com/nHAt2ow.png) 2 | 3 | SoundScrape [![Build Status](https://travis-ci.org/Miserlou/SoundScrape.svg)](https://travis-ci.org/Miserlou/SoundScrape) [![Python 3](https://img.shields.io/badge/Python-3-brightgreen.svg)](https://pypi.python.org/pypi/soundscrape/) [![PyPI](https://img.shields.io/pypi/v/soundscrape.svg)](https://pypi.python.org/pypi/SoundScrape) 4 | ============== 5 | 6 | **SoundScrape** makes it super easy to download artists from SoundCloud (and Bandcamp and MixCloud) - even those which don't have download links! It automatically creates ID3 tags as well (including album art), which is handy. 7 | 8 | Usage 9 | --------- 10 | 11 | First, install it: 12 | 13 | ```bash 14 | pip install soundscrape 15 | ``` 16 | 17 | Note that if you are having problems, please first try updating to the latest version: 18 | 19 | ```bash 20 | pip install soundscrape --upgrade 21 | ``` 22 | 23 | Then, just call soundscrape and the name of the artist you want to scrape: 24 | 25 | ```bash 26 | soundscrape rabbit-i-am 27 | ``` 28 | 29 | And you're done! Hooray! Files are stored as mp3s in the format **Artist name - Track title.mp3**. 30 | 31 | You can also use the *-n* argument to only download a certain number of songs. 32 | 33 | ```bash 34 | soundscrape rabbit-i-am -n 3 35 | ``` 36 | 37 | Sets 38 | ------- 39 | 40 | Soundscrape can also download sets, but you have to include the full URL of the set you want to download: 41 | 42 | ```bash 43 | soundscrape https://soundcloud.com/vsauce-awesome/sets/awesome 44 | ``` 45 | 46 | Groups 47 | -------- 48 | 49 | Soundscrape can also download tracks from SoundCloud groups with the *-g* argument. 50 | 51 | ```bash 52 | soundscrape chopped-and-screwed -gn 2 53 | ``` 54 | 55 | Tracks 56 | -------- 57 | 58 | Soundscrape can also download specific tracks with *-t*: 59 | 60 | ```bash 61 | soundscrape foolsgoldrecs -t danny-brown-dip 62 | ``` 63 | 64 | or with just the straight URL: 65 | 66 | ```bash 67 | soundscrape https://soundcloud.com/foolsgoldrecs/danny-brown-dip 68 | ``` 69 | 70 | Likes 71 | -------- 72 | 73 | Soundscrape can also download all of an Artist's Liked items with *-l*: 74 | 75 | ```bash 76 | soundscrape troyboi -l 77 | ``` 78 | 79 | or with just the straight URL: 80 | 81 | ```bash 82 | soundscrape https://soundcloud.com/troyboi/likes 83 | ``` 84 | 85 | High-Quality Downloads Only 86 | -------- 87 | 88 | By default, SoundScrape will try to rip everything it can. However, if you only want to download tracks that have an official download available (which are typically at a higher-quality 320kbps bitrate), you can use the *-d* argument. 89 | 90 | ```bash 91 | soundscrape sly-dogg -d 92 | ``` 93 | 94 | Keep Preview Tracks 95 | -------- 96 | 97 | By default, SoundScrape will skip the 30-second preview tracks that SoundCloud now provides. You can choose to keep these preview snippets with the *-k* argument. 98 | 99 | ```bash 100 | soundscrape chromeo -k 101 | ``` 102 | 103 | Folders 104 | -------- 105 | 106 | By default, SoundScrape aims to act like _wget_, downloading in place in the current directory. With the *-f* argument, however, SoundScrape acts more like a download manager and sorts songs into the following format: 107 | 108 | ``` 109 | ./ARTIST_NAME - ALBUM_NAME/SONG_NUMBER - SONG_TITLE.mp3 110 | ``` 111 | 112 | It will also skip previously downloaded tracks. 113 | 114 | ```bash 115 | soundscrape murdercitydevils -f 116 | ``` 117 | 118 | Bandcamp 119 | -------- 120 | 121 | SoundScrape can also pull down albums from Bandcamp. For Bandcamp pages, use the *-b* argument along with an artist's username or a specific URL. It only downloads one album at a time. This works with all of the other arguments, except *-d* as Bandcamp streams only come at one bitrate, as far as I can tell. 122 | 123 | Note: Currently, when using the *-n* argument, the limit is evaluated for each album separately. 124 | 125 | ```bash 126 | soundscrape warsaw -b -f 127 | ``` 128 | 129 | This also works for non-Bandcamp URLs that are hosted on Bandcamp: 130 | 131 | ```bash 132 | soundscrape -b http://music.monstercat.com/ 133 | ``` 134 | 135 | Note that the full URL must be included. 136 | 137 | Mixcloud 138 | -------- 139 | 140 | SoundScrape can also grab mixes from Mixcloud. This feature is extremely expermental and is in no way guaranteed to work! 141 | 142 | Finds the original mp3 of a mix and grabs that (with tags and album art) if it can, or else just gets the raw m4a stream. 143 | 144 | Mixcloud currently only takes an invidiual mix. Capacity for a whole artist's profile due shortly. 145 | 146 | ```bash 147 | soundscrape https://www.mixcloud.com/corenewsuploads/flume-essential-mix-2015-10-03/ -of 148 | ``` 149 | 150 | Audiomack 151 | -------- 152 | 153 | Just for fun, SoundScrape can also download individual songs from Audiomack. Not that you'd ever want to. 154 | 155 | ```bash 156 | soundscrape -a http://www.audiomack.com/song/bottomfeedermusic/top-shottas 157 | ``` 158 | 159 | MusicBed 160 | -------- 161 | 162 | For some strange reason, it also works for MusicBed.com. Thanks @brachna for this feature. 163 | 164 | ```bash 165 | soundscrape https://www.musicbed.com/albums/be-still/2828 166 | ``` 167 | 168 | Opening Files 169 | -------- 170 | 171 | As a convenience method, SoundScrape can automatically _'open'_ files that it downloads. This uses your system's 'open' command for file associations. 172 | 173 | ```bash 174 | soundscrape lorn -of 175 | ``` 176 | 177 | Issues 178 | ------- 179 | 180 | There's probably a lot more that can be done to improve this. Please file issues if you find them! 181 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | args>=0.1.0 2 | clint>=0.3.2 3 | demjson>=2.2.2 4 | fudge>=1.0.3 5 | nose>=1.3.7 6 | requests[security]>=2.9.0 7 | setuptools>=18.0.0 8 | simplejson>=3.3.1 9 | soundcloud>=0.4.1 10 | wheel>=0.24.0 11 | mutagen>=1.31.0 12 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import os 2 | import setuptools 3 | import soundscrape 4 | import sys 5 | 6 | from setuptools import setup 7 | 8 | # To support 2/3 installation 9 | setup_version = int(setuptools.__version__.split('.')[0]) 10 | if setup_version < 18: 11 | print("Please upgrade your setuptools to install SoundScrape: ") 12 | print("pip install -U pip wheel setuptools") 13 | quit() 14 | 15 | # Set external files 16 | try: 17 | from pypandoc import convert 18 | README = convert('README.md', 'rst') 19 | except ImportError: 20 | README = open(os.path.join(os.path.dirname(__file__), 'README.md')).read() 21 | 22 | with open(os.path.join(os.path.dirname(__file__), 'requirements.txt')) as f: 23 | required = f.read().splitlines() 24 | 25 | # allow setup.py to be run from any path 26 | os.chdir(os.path.normpath(os.path.join(os.path.abspath(__file__), os.pardir))) 27 | 28 | setup( 29 | name='soundscrape', 30 | version=soundscrape.__version__, 31 | packages=['soundscrape'], 32 | install_requires=required, 33 | extras_require={ ':python_version < "3.0"': [ 'wsgiref>=0.1.2', ], }, 34 | include_package_data=True, 35 | license='MIT License', 36 | description='Scrape an artist from SoundCloud', 37 | long_description=README, 38 | url='https://github.com/Miserlou/SoundScrape', 39 | author='Rich Jones', 40 | author_email='rich@openwatch.net', 41 | entry_points={ 42 | 'console_scripts': [ 43 | 'soundscrape = soundscrape.soundscrape:main', 44 | ] 45 | }, 46 | classifiers=[ 47 | 'Environment :: Console', 48 | 'License :: OSI Approved :: Apache Software License', 49 | 'Operating System :: OS Independent', 50 | 'Programming Language :: Python', 51 | 'Programming Language :: Python :: 3.4', 52 | 'Programming Language :: Python :: 3.5', 53 | 'Programming Language :: Python :: 3.7', 54 | 'Programming Language :: Python :: 3.8', 55 | 'Programming Language :: Python :: 3.9', 56 | 'Topic :: Internet :: WWW/HTTP', 57 | 'Topic :: Internet :: WWW/HTTP :: Dynamic Content', 58 | ], 59 | ) 60 | -------------------------------------------------------------------------------- /soundscrape/.gitignore: -------------------------------------------------------------------------------- 1 | *.mp3 -------------------------------------------------------------------------------- /soundscrape/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = '0.31.0' 2 | -------------------------------------------------------------------------------- /soundscrape/soundscrape.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | import argparse 3 | import demjson 4 | import html 5 | import os 6 | import re 7 | import requests 8 | import soundcloud 9 | import sys 10 | import urllib 11 | 12 | from clint.textui import colored, puts, progress 13 | from datetime import datetime 14 | from mutagen.mp3 import MP3, EasyMP3 15 | from mutagen.id3 import APIC, WXXX 16 | from mutagen.id3 import ID3 as OldID3 17 | from subprocess import Popen, PIPE 18 | from os.path import dirname, exists, join 19 | from os import access, mkdir, W_OK 20 | 21 | if sys.version_info.minor < 4: 22 | html_unescape = html.parser.HTMLParser().unescape 23 | else: 24 | html_unescape = html.unescape 25 | 26 | #################################################################### 27 | 28 | # Please be nice with this! 29 | CLIENT_ID = 'a3dd183a357fcff9a6943c0d65664087' 30 | CLIENT_SECRET = '7e10d33e967ad42574124977cf7fa4b7' 31 | MAGIC_CLIENT_ID = 'b45b1aa10f1ac2941910a7f0d10f8e28' 32 | 33 | AGGRESSIVE_CLIENT_ID = 'OmTFHKYSMLFqnu2HHucmclAptedxWXkq' 34 | APP_VERSION = '1481046241' 35 | 36 | #################################################################### 37 | 38 | 39 | def main(): 40 | """ 41 | Main function. 42 | 43 | Converts arguments to Python and processes accordingly. 44 | 45 | """ 46 | 47 | # Hack related to #58 48 | if sys.platform == "win32": 49 | os.system("chcp 65001"); 50 | 51 | parser = argparse.ArgumentParser(description='SoundScrape. Scrape an artist from SoundCloud.\n') 52 | parser.add_argument('artist_url', metavar='U', type=str, nargs='*', 53 | help='An artist\'s SoundCloud username or URL') 54 | parser.add_argument('-n', '--num-tracks', type=int, default=sys.maxsize, 55 | help='The number of tracks to download') 56 | parser.add_argument('-g', '--group', action='store_true', 57 | help='Use if downloading tracks from a SoundCloud group') 58 | parser.add_argument('-b', '--bandcamp', action='store_true', 59 | help='Use if downloading from Bandcamp rather than SoundCloud') 60 | parser.add_argument('-m', '--mixcloud', action='store_true', 61 | help='Use if downloading from Mixcloud rather than SoundCloud') 62 | parser.add_argument('-a', '--audiomack', action='store_true', 63 | help='Use if downloading from Audiomack rather than SoundCloud') 64 | parser.add_argument('-c', '--hive', action='store_true', 65 | help='Use if downloading from Hive.co rather than SoundCloud') 66 | parser.add_argument('-l', '--likes', action='store_true', 67 | help='Download all of a user\'s Likes.') 68 | parser.add_argument('-L', '--login', type=str, default='soundscrape123@mailinator.com', 69 | help='Set login') 70 | parser.add_argument('-d', '--downloadable', action='store_true', 71 | help='Only fetch tracks with a Downloadable link.') 72 | parser.add_argument('-t', '--track', type=str, default='', 73 | help='The name of a specific track by an artist') 74 | parser.add_argument('-f', '--folders', action='store_true', 75 | help='Organize saved songs in folders by artists') 76 | parser.add_argument('-p', '--path', type=str, default='', 77 | help='Set directory path where downloads should be saved to') 78 | parser.add_argument('-P', '--password', type=str, default='soundscraperocks', 79 | help='Set password') 80 | parser.add_argument('-o', '--open', action='store_true', 81 | help='Open downloaded files after downloading.') 82 | parser.add_argument('-k', '--keep', action='store_true', 83 | help='Keep 30-second preview tracks') 84 | parser.add_argument('-v', '--version', action='store_true', default=False, 85 | help='Display the current version of SoundScrape') 86 | 87 | args = parser.parse_args() 88 | vargs = vars(args) 89 | 90 | if vargs['version']: 91 | import pkg_resources 92 | version = pkg_resources.require("soundscrape")[0].version 93 | print(version) 94 | return 95 | 96 | if not vargs['artist_url']: 97 | parser.error('Please supply an artist\'s username or URL!') 98 | 99 | if sys.version_info < (3,0,0): 100 | vargs['artist_url'] = urllib.quote(vargs['artist_url'][0], safe=':/') 101 | else: 102 | vargs['artist_url'] = urllib.parse.quote(vargs['artist_url'][0], safe=':/') 103 | 104 | artist_url = vargs['artist_url'] 105 | 106 | if not exists(vargs['path']): 107 | if not access(dirname(vargs['path']), W_OK): 108 | vargs['path'] = '' 109 | else: 110 | mkdir(vargs['path']) 111 | 112 | if 'bandcamp.com' in artist_url or vargs['bandcamp']: 113 | process_bandcamp(vargs) 114 | elif 'mixcloud.com' in artist_url or vargs['mixcloud']: 115 | process_mixcloud(vargs) 116 | elif 'audiomack.com' in artist_url or vargs['audiomack']: 117 | process_audiomack(vargs) 118 | elif 'hive.co' in artist_url or vargs['hive']: 119 | process_hive(vargs) 120 | elif 'musicbed.com' in artist_url: 121 | process_musicbed(vargs) 122 | else: 123 | process_soundcloud(vargs) 124 | 125 | 126 | #################################################################### 127 | # SoundCloud 128 | #################################################################### 129 | 130 | 131 | def process_soundcloud(vargs): 132 | """ 133 | Main SoundCloud path. 134 | """ 135 | 136 | artist_url = vargs['artist_url'] 137 | track_permalink = vargs['track'] 138 | keep_previews = vargs['keep'] 139 | folders = vargs['folders'] 140 | 141 | id3_extras = {} 142 | one_track = False 143 | likes = False 144 | client = get_client() 145 | if 'soundcloud' not in artist_url.lower(): 146 | if vargs['group']: 147 | artist_url = 'https://soundcloud.com/groups/' + artist_url.lower() 148 | elif len(track_permalink) > 0: 149 | one_track = True 150 | track_url = 'https://soundcloud.com/' + artist_url.lower() + '/' + track_permalink.lower() 151 | else: 152 | artist_url = 'https://soundcloud.com/' + artist_url.lower() 153 | if vargs['likes'] or 'likes' in artist_url.lower(): 154 | likes = True 155 | 156 | if 'likes' in artist_url.lower(): 157 | artist_url = artist_url[0:artist_url.find('/likes')] 158 | likes = True 159 | 160 | if one_track: 161 | num_tracks = 1 162 | else: 163 | num_tracks = vargs['num_tracks'] 164 | 165 | try: 166 | if one_track: 167 | resolved = client.get('/resolve', url=track_url, limit=200) 168 | 169 | elif likes: 170 | userId = str(client.get('/resolve', url=artist_url).id) 171 | 172 | resolved = client.get('/users/' + userId + '/favorites', limit=200, linked_partitioning=1) 173 | next_href = False 174 | if(hasattr(resolved, 'next_href')): 175 | next_href = resolved.next_href 176 | while (next_href): 177 | 178 | resolved2 = requests.get(next_href).json() 179 | if('next_href' in resolved2): 180 | next_href = resolved2['next_href'] 181 | else: 182 | next_href = False 183 | resolved2 = soundcloud.resource.ResourceList(resolved2['collection']) 184 | resolved.collection.extend(resolved2) 185 | resolved = resolved.collection 186 | 187 | else: 188 | resolved = client.get('/resolve', url=artist_url, limit=200) 189 | 190 | except Exception as e: # HTTPError? 191 | 192 | # SoundScrape is trying to prevent us from downloading this. 193 | # We're going to have to stop trusting the API/client and 194 | # do all our own scraping. Boo. 195 | 196 | if '404 Client Error' in str(e): 197 | puts(colored.red("Problem downloading [404]: ") + colored.white("Item Not Found")) 198 | return None 199 | 200 | message = str(e) 201 | item_id = message.rsplit('/', 1)[-1].split('.json')[0].split('?client_id')[0] 202 | hard_track_url = get_hard_track_url(item_id) 203 | 204 | track_data = get_soundcloud_data(artist_url) 205 | puts_safe(colored.green("Scraping") + colored.white(": " + track_data['title'])) 206 | 207 | filenames = [] 208 | filename = sanitize_filename(track_data['artist'] + ' - ' + track_data['title'] + '.mp3') 209 | 210 | if folders: 211 | name_path = join(vargs['path'], track_data['artist']) 212 | if not exists(name_path): 213 | mkdir(name_path) 214 | filename = join(name_path, filename) 215 | else: 216 | filename = join(vargs['path'], filename) 217 | 218 | if exists(filename): 219 | puts_safe(colored.yellow("Track already downloaded: ") + colored.white(track_data['title'])) 220 | return None 221 | 222 | filename = download_file(hard_track_url, filename) 223 | tagged = tag_file(filename, 224 | artist=track_data['artist'], 225 | title=track_data['title'], 226 | year='2018', 227 | genre='', 228 | album='', 229 | artwork_url='') 230 | 231 | if not tagged: 232 | wav_filename = filename[:-3] + 'wav' 233 | os.rename(filename, wav_filename) 234 | filename = wav_filename 235 | 236 | filenames.append(filename) 237 | 238 | else: 239 | 240 | aggressive = False 241 | 242 | # This is is likely a 'likes' page. 243 | if not hasattr(resolved, 'kind'): 244 | tracks = resolved 245 | else: 246 | if resolved.kind == 'artist': 247 | artist = resolved 248 | artist_id = str(artist.id) 249 | tracks = client.get('/users/' + artist_id + '/tracks', limit=200) 250 | elif resolved.kind == 'playlist': 251 | id3_extras['album'] = resolved.title 252 | if resolved.tracks != []: 253 | tracks = resolved.tracks 254 | else: 255 | tracks = get_soundcloud_api_playlist_data(resolved.id)['tracks'] 256 | tracks = tracks[:num_tracks] 257 | aggressive = True 258 | for track in tracks: 259 | download_track(track, resolved.title, keep_previews, folders, custom_path=vargs['path']) 260 | 261 | elif resolved.kind == 'track': 262 | tracks = [resolved] 263 | elif resolved.kind == 'group': 264 | group = resolved 265 | group_id = str(group.id) 266 | tracks = client.get('/groups/' + group_id + '/tracks', limit=200) 267 | else: 268 | artist = resolved 269 | artist_id = str(artist.id) 270 | tracks = client.get('/users/' + artist_id + '/tracks', limit=200) 271 | if tracks == [] and artist.track_count > 0: 272 | aggressive = True 273 | filenames = [] 274 | 275 | # this might be buggy 276 | data = get_soundcloud_api2_data(artist_id) 277 | 278 | for track in data['collection']: 279 | 280 | if len(filenames) >= num_tracks: 281 | break 282 | 283 | if track['type'] == 'playlist': 284 | track['playlist']['tracks'] = track['playlist']['tracks'][:num_tracks] 285 | for playlist_track in track['playlist']['tracks']: 286 | album_name = track['playlist']['title'] 287 | filename = download_track(playlist_track, album_name, keep_previews, folders, filenames, custom_path=vargs['path']) 288 | if filename: 289 | filenames.append(filename) 290 | else: 291 | d_track = track['track'] 292 | filename = download_track(d_track, custom_path=vargs['path']) 293 | if filename: 294 | filenames.append(filename) 295 | 296 | if not aggressive: 297 | filenames = download_tracks(client, tracks, num_tracks, vargs['downloadable'], vargs['folders'], vargs['path'], 298 | id3_extras=id3_extras) 299 | 300 | if vargs['open']: 301 | open_files(filenames) 302 | 303 | 304 | def get_client(): 305 | """ 306 | Return a new SoundCloud Client object. 307 | """ 308 | client = soundcloud.Client(client_id=CLIENT_ID) 309 | return client 310 | 311 | def download_track(track, album_name=u'', keep_previews=False, folders=False, filenames=[], custom_path=''): 312 | """ 313 | Given a track, force scrape it. 314 | """ 315 | 316 | hard_track_url = get_hard_track_url(track['id']) 317 | 318 | # We have no info on this track whatsoever. 319 | if not 'title' in track: 320 | return None 321 | 322 | if not keep_previews: 323 | if (track.get('duration', 0) < track.get('full_duration', 0)): 324 | puts_safe(colored.yellow("Skipping preview track") + colored.white(": " + track['title'])) 325 | return None 326 | 327 | # May not have a "full name" 328 | name = track['user'].get('full_name', '') 329 | if name == '': 330 | name = track['user']['username'] 331 | 332 | filename = sanitize_filename(name + ' - ' + track['title'] + '.mp3') 333 | 334 | if folders: 335 | name_path = join(custom_path, name) 336 | if not exists(name_path): 337 | mkdir(name_path) 338 | filename = join(name_path, filename) 339 | else: 340 | filename = join(custom_path, filename) 341 | 342 | if exists(filename): 343 | puts_safe(colored.yellow("Track already downloaded: ") + colored.white(track['title'])) 344 | return None 345 | 346 | # Skip already downloaded track. 347 | if filename in filenames: 348 | return None 349 | 350 | if hard_track_url: 351 | puts_safe(colored.green("Scraping") + colored.white(": " + track['title'])) 352 | else: 353 | # Region coded? 354 | puts_safe(colored.yellow("Unable to download") + colored.white(": " + track['title'])) 355 | return None 356 | 357 | filename = download_file(hard_track_url, filename) 358 | tagged = tag_file(filename, 359 | artist=name, 360 | title=track['title'], 361 | year=track['created_at'][:4], 362 | genre=track['genre'], 363 | album=album_name, 364 | artwork_url=track['artwork_url']) 365 | if not tagged: 366 | wav_filename = filename[:-3] + 'wav' 367 | os.rename(filename, wav_filename) 368 | filename = wav_filename 369 | 370 | return filename 371 | 372 | def download_tracks(client, tracks, num_tracks=sys.maxsize, downloadable=False, folders=False, custom_path='', id3_extras={}): 373 | """ 374 | Given a list of tracks, iteratively download all of them. 375 | 376 | """ 377 | 378 | filenames = [] 379 | 380 | for i, track in enumerate(tracks): 381 | 382 | # "Track" and "Resource" objects are actually different, 383 | # even though they're the same. 384 | if isinstance(track, soundcloud.resource.Resource): 385 | 386 | try: 387 | 388 | t_track = {} 389 | t_track['downloadable'] = track.downloadable 390 | t_track['streamable'] = track.streamable 391 | t_track['title'] = track.title 392 | t_track['user'] = {'username': track.user['username']} 393 | t_track['release_year'] = track.release 394 | t_track['genre'] = track.genre 395 | t_track['artwork_url'] = track.artwork_url 396 | if track.downloadable: 397 | t_track['stream_url'] = track.download_url 398 | else: 399 | if downloadable: 400 | puts_safe(colored.red("Skipping") + colored.white(": " + track.title)) 401 | continue 402 | if hasattr(track, 'stream_url'): 403 | t_track['stream_url'] = track.stream_url 404 | else: 405 | t_track['direct'] = True 406 | streams_url = "https://api.soundcloud.com/i1/tracks/%s/streams?client_id=%s&app_version=%s" % ( 407 | str(track.id), AGGRESSIVE_CLIENT_ID, APP_VERSION) 408 | response = requests.get(streams_url).json() 409 | t_track['stream_url'] = response['http_mp3_128_url'] 410 | 411 | track = t_track 412 | except Exception as e: 413 | puts_safe(colored.white(track.title) + colored.red(' is not downloadable.')) 414 | continue 415 | 416 | if i > num_tracks - 1: 417 | continue 418 | try: 419 | if not track.get('stream_url', False): 420 | puts_safe(colored.white(track['title']) + colored.red(' is not downloadable.')) 421 | continue 422 | else: 423 | track_artist = sanitize_filename(track['user']['username']) 424 | track_title = sanitize_filename(track['title']) 425 | track_filename = track_artist + ' - ' + track_title + '.mp3' 426 | 427 | if folders: 428 | track_artist_path = join(custom_path, track_artist) 429 | if not exists(track_artist_path): 430 | mkdir(track_artist_path) 431 | track_filename = join(track_artist_path, track_filename) 432 | else: 433 | track_filename = join(custom_path, track_filename) 434 | 435 | if exists(track_filename): 436 | puts_safe(colored.yellow("Track already downloaded: ") + colored.white(track_title)) 437 | continue 438 | 439 | puts_safe(colored.green("Downloading") + colored.white(": " + track['title'])) 440 | 441 | 442 | if track.get('direct', False): 443 | location = track['stream_url'] 444 | else: 445 | stream = client.get(track['stream_url'], allow_redirects=False, limit=200) 446 | if hasattr(stream, 'location'): 447 | location = stream.location 448 | else: 449 | location = stream.url 450 | 451 | filename = download_file(location, track_filename) 452 | tagged = tag_file(filename, 453 | artist=track['user']['username'], 454 | title=track['title'], 455 | year=track['release_year'], 456 | genre=track['genre'], 457 | album=id3_extras.get('album', None), 458 | artwork_url=track['artwork_url']) 459 | 460 | if not tagged: 461 | wav_filename = filename[:-3] + 'wav' 462 | os.rename(filename, wav_filename) 463 | filename = wav_filename 464 | 465 | filenames.append(filename) 466 | except Exception as e: 467 | puts_safe(colored.red("Problem downloading ") + colored.white(track['title'])) 468 | puts_safe(str(e)) 469 | 470 | return filenames 471 | 472 | 473 | 474 | def get_soundcloud_data(url): 475 | """ 476 | Scrapes a SoundCloud page for a track's important information. 477 | 478 | Returns: 479 | dict: of audio data 480 | 481 | """ 482 | 483 | data = {} 484 | 485 | request = requests.get(url) 486 | 487 | title_tag = request.text.split('')[1].split('</title')[0] 488 | data['title'] = title_tag.split(' by ')[0].strip() 489 | data['artist'] = title_tag.split(' by ')[1].split('|')[0].strip() 490 | # XXX Do more.. 491 | 492 | return data 493 | 494 | 495 | def get_soundcloud_api2_data(artist_id): 496 | """ 497 | Scrape the new API. Returns the parsed JSON response. 498 | """ 499 | 500 | v2_url = "https://api-v2.soundcloud.com/stream/users/%s?limit=500&client_id=%s&app_version=%s" % ( 501 | artist_id, AGGRESSIVE_CLIENT_ID, APP_VERSION) 502 | response = requests.get(v2_url) 503 | parsed = response.json() 504 | 505 | return parsed 506 | 507 | def get_soundcloud_api_playlist_data(playlist_id): 508 | """ 509 | Scrape the new API. Returns the parsed JSON response. 510 | """ 511 | 512 | url = "https://api.soundcloud.com/playlists/%s?representation=full&client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&app_version=1467724310" % ( 513 | playlist_id) 514 | response = requests.get(url) 515 | parsed = response.json() 516 | 517 | return parsed 518 | 519 | def get_hard_track_url(item_id): 520 | """ 521 | Hard-scrapes a track. 522 | """ 523 | 524 | streams_url = "https://api.soundcloud.com/i1/tracks/%s/streams/?client_id=%s&app_version=%s" % ( 525 | item_id, AGGRESSIVE_CLIENT_ID, APP_VERSION) 526 | response = requests.get(streams_url) 527 | json_response = response.json() 528 | 529 | if response.status_code == 200: 530 | hard_track_url = json_response['http_mp3_128_url'] 531 | return hard_track_url 532 | else: 533 | return None 534 | 535 | #################################################################### 536 | # Bandcamp 537 | #################################################################### 538 | 539 | 540 | def process_bandcamp(vargs): 541 | """ 542 | Main BandCamp path. 543 | """ 544 | 545 | artist_url = vargs['artist_url'] 546 | 547 | if 'bandcamp.com' in artist_url or ('://' in artist_url and vargs['bandcamp']): 548 | bc_url = artist_url 549 | else: 550 | bc_url = 'https://' + artist_url + '.bandcamp.com/music' 551 | 552 | filenames = scrape_bandcamp_url( 553 | bc_url, 554 | num_tracks=vargs['num_tracks'], 555 | folders=vargs['folders'], 556 | custom_path=vargs['path'], 557 | ) 558 | 559 | # check if we have lists inside a list, which indicates the 560 | # scraping has gone recursive, so we must format the output 561 | # ( reference: http://stackoverflow.com/a/5251706 ) 562 | if any(isinstance(elem, list) for elem in filenames): 563 | # first, remove any empty sublists inside our outter list 564 | # ( reference: http://stackoverflow.com/a/19875634 ) 565 | filenames = [sub for sub in filenames if sub] 566 | # now, make sure we "flatten" the list 567 | # ( reference: http://stackoverflow.com/a/11264751 ) 568 | filenames = [val for sub in filenames for val in sub] 569 | 570 | if vargs['open']: 571 | open_files(filenames) 572 | 573 | return 574 | 575 | 576 | # Largely borrowed from Ronier's bandcampscrape 577 | def scrape_bandcamp_url(url, num_tracks=sys.maxsize, folders=False, custom_path=''): 578 | """ 579 | Pull out artist and track info from a Bandcamp URL. 580 | 581 | Returns: 582 | list: filenames to open 583 | """ 584 | 585 | filenames = [] 586 | album_data = get_bandcamp_metadata(url) 587 | 588 | # If it's a list, we're dealing with a list of Album URLs, 589 | # so we call the scrape_bandcamp_url() method for each one 590 | if type(album_data) is list: 591 | for album_url in album_data: 592 | filenames.append( 593 | scrape_bandcamp_url( 594 | album_url, num_tracks, folders, custom_path 595 | ) 596 | ) 597 | return filenames 598 | 599 | artist = album_data.get("artist") 600 | album_name = album_data.get("album_title") 601 | 602 | if folders: 603 | if album_name: 604 | directory = artist + " - " + album_name 605 | else: 606 | directory = artist 607 | directory = sanitize_filename(directory) 608 | directory = join(custom_path, directory) 609 | if not exists(directory): 610 | mkdir(directory) 611 | 612 | for i, track in enumerate(album_data["trackinfo"]): 613 | 614 | if i > num_tracks - 1: 615 | continue 616 | 617 | try: 618 | track_name = track["title"] 619 | if track["track_num"]: 620 | track_number = str(track["track_num"]).zfill(2) 621 | else: 622 | track_number = None 623 | if track_number and folders: 624 | track_filename = '%s - %s.mp3' % (track_number, track_name) 625 | else: 626 | track_filename = '%s.mp3' % (track_name) 627 | track_filename = sanitize_filename(track_filename) 628 | 629 | if folders: 630 | path = join(directory, track_filename) 631 | else: 632 | path = join(custom_path, sanitize_filename(artist) + ' - ' + track_filename) 633 | 634 | if exists(path): 635 | puts_safe(colored.yellow("Track already downloaded: ") + colored.white(track_name)) 636 | continue 637 | 638 | if not track['file']: 639 | puts_safe(colored.yellow("Track unavailble for scraping: ") + colored.white(track_name)) 640 | continue 641 | 642 | puts_safe(colored.green("Downloading") + colored.white(": " + track_name)) 643 | path = download_file(track['file']['mp3-128'], path) 644 | 645 | album_year = album_data['album_release_date'] 646 | if album_year: 647 | album_year = datetime.strptime(album_year, "%d %b %Y %H:%M:%S GMT").year 648 | 649 | tag_file(path, 650 | artist, 651 | track_name, 652 | album=album_name, 653 | year=album_year, 654 | genre=album_data['genre'], 655 | artwork_url=album_data['artFullsizeUrl'], 656 | track_number=track_number, 657 | url=album_data['url']) 658 | 659 | filenames.append(path) 660 | 661 | except Exception as e: 662 | puts_safe(colored.red("Problem downloading ") + colored.white(track_name)) 663 | print(e) 664 | return filenames 665 | 666 | 667 | def extract_embedded_json_from_attribute(request, attribute, debug=False): 668 | """ 669 | Extract JSON object embedded in an element's attribute value. 670 | 671 | The JSON is "sloppy". The native python JSON parser often can't deal, 672 | so we use the more tolerant demjson instead. 673 | 674 | Args: 675 | request (obj:`requests.Response`): HTTP GET response from which to extract 676 | attribute (str): name of the attribute holding the desired JSON object 677 | debug (bool, optional): whether to print debug messages 678 | 679 | Returns: 680 | The embedded JSON object as a dict, or None if extraction failed 681 | """ 682 | try: 683 | embed = request.text.split('{}="'.format(attribute))[1] 684 | embed = html_unescape( 685 | embed.split('"')[0] 686 | ) 687 | output = demjson.decode(embed) 688 | if debug: 689 | print( 690 | 'extracted JSON: ' 691 | + demjson.encode( 692 | output, 693 | compactly=False, 694 | indent_amount=2, 695 | ) 696 | ) 697 | except Exception as e: 698 | output = None 699 | if debug: 700 | print(e) 701 | return output 702 | 703 | 704 | def get_bandcamp_metadata(url): 705 | """ 706 | Read information from Bandcamp embedded JavaScript object notation. 707 | The method may return a list of URLs (indicating this is probably a "main" page which links to one or more albums), 708 | or a JSON if we can already parse album/track info from the given url. 709 | """ 710 | request = requests.get(url) 711 | output = {} 712 | try: 713 | for attr in ['data-tralbum', 'data-embed']: 714 | output.update( 715 | extract_embedded_json_from_attribute( 716 | request, attr 717 | ) 718 | ) 719 | # if the JSON parser failed, we should consider it's a "/music" page, 720 | # so we generate a list of albums/tracks and return it immediately 721 | except Exception as e: 722 | regex_all_albums = r'<a href="(/(?:album|track)/[^>]+)">' 723 | all_albums = re.findall(regex_all_albums, request.text, re.MULTILINE) 724 | album_url_list = list() 725 | for album in all_albums: 726 | album_url = re.sub(r'music/?$', '', url) + album 727 | album_url_list.append(album_url) 728 | return album_url_list 729 | # if the JSON parser was successful, use a regex to get all tags 730 | # from this album/track, join them and set it as the "genre" 731 | regex_tags = r'<a class="tag" href[^>]+>([^<]+)</a>' 732 | tags = re.findall(regex_tags, request.text, re.MULTILINE) 733 | # make sure we treat integers correctly with join() 734 | # according to http://stackoverflow.com/a/7323861 735 | # (very unlikely, but better safe than sorry!) 736 | output['genre'] = ' '.join(s for s in tags) 737 | 738 | try: 739 | artUrl = request.text.split("\"tralbumArt\">")[1].split("\">")[0].split("href=\"")[1] 740 | output['artFullsizeUrl'] = artUrl 741 | except: 742 | puts_safe(colored.red("Couldn't get full artwork") + "") 743 | output['artFullsizeUrl'] = None 744 | 745 | return output 746 | 747 | 748 | #################################################################### 749 | # Mixcloud 750 | #################################################################### 751 | 752 | 753 | def process_mixcloud(vargs): 754 | """ 755 | Main MixCloud path. 756 | """ 757 | 758 | artist_url = vargs['artist_url'] 759 | 760 | if 'mixcloud.com' in artist_url: 761 | mc_url = artist_url 762 | else: 763 | mc_url = 'https://mixcloud.com/' + artist_url 764 | 765 | filenames = scrape_mixcloud_url(mc_url, num_tracks=vargs['num_tracks'], folders=vargs['folders'], custom_path=vargs['path']) 766 | 767 | if vargs['open']: 768 | open_files(filenames) 769 | 770 | return 771 | 772 | 773 | def scrape_mixcloud_url(mc_url, num_tracks=sys.maxsize, folders=False, custom_path=''): 774 | """ 775 | Returns: 776 | list: filenames to open 777 | 778 | """ 779 | 780 | try: 781 | data = get_mixcloud_data(mc_url) 782 | except Exception as e: 783 | puts_safe(colored.red("Problem downloading ") + mc_url) 784 | print(e) 785 | return [] 786 | 787 | filenames = [] 788 | 789 | track_artist = sanitize_filename(data['artist']) 790 | track_title = sanitize_filename(data['title']) 791 | track_filename = track_artist + ' - ' + track_title + data['mp3_url'][-4:] 792 | 793 | if folders: 794 | track_artist_path = join(custom_path, track_artist) 795 | if not exists(track_artist_path): 796 | mkdir(track_artist_path) 797 | track_filename = join(track_artist_path, track_filename) 798 | if exists(track_filename): 799 | puts_safe(colored.yellow("Skipping") + colored.white(': ' + data['title'] + " - it already exists!")) 800 | return [] 801 | else: 802 | track_filename = join(custom_path, track_filename) 803 | 804 | puts_safe(colored.green("Downloading") + colored.white( 805 | ': ' + data['artist'] + " - " + data['title'] + " (" + track_filename[-4:] + ")")) 806 | download_file(data['mp3_url'], track_filename) 807 | if track_filename[-4:] == '.mp3': 808 | tag_file(track_filename, 809 | artist=data['artist'], 810 | title=data['title'], 811 | year=data['year'], 812 | genre="Mix", 813 | artwork_url=data['artwork_url']) 814 | filenames.append(track_filename) 815 | 816 | return filenames 817 | 818 | 819 | def get_mixcloud_data(url): 820 | """ 821 | Scrapes a Mixcloud page for a track's important information. 822 | 823 | Returns: 824 | dict: containing audio data 825 | 826 | """ 827 | 828 | data = {} 829 | request = requests.get(url) 830 | preview_mp3_url = request.text.split('m-preview="')[1].split('" m-preview-light')[0] 831 | song_uuid = request.text.split('m-preview="')[1].split('" m-preview-light')[0].split('previews/')[1].split('.mp3')[0] 832 | 833 | # Fish for the m4a.. 834 | for server in range(1, 23): 835 | # Ex: https://stream6.mixcloud.com/c/m4a/64/1/2/0/9/30fe-23aa-40da-9bf3-4bee2fba649d.m4a 836 | mp3_url = "https://stream" + str(server) + ".mixcloud.com/c/m4a/64/" + song_uuid + '.m4a' 837 | try: 838 | if requests.head(mp3_url).status_code == 200: 839 | if '?' in mp3_url: 840 | mp3_url = mp3_url.split('?')[0] 841 | break 842 | except Exception as e: 843 | continue 844 | 845 | full_title = request.text.split("<title>")[1].split(" | Mixcloud")[0] 846 | title = full_title.split(' by ')[0].strip() 847 | artist = full_title.split(' by ')[1].strip() 848 | 849 | img_thumbnail_url = request.text.split('m-thumbnail-url="')[1].split(" ng-class")[0] 850 | artwork_url = img_thumbnail_url.replace('60/', '300/').replace('60/', '300/').replace('//', 'https://').replace('"', 851 | '') 852 | 853 | data['mp3_url'] = mp3_url 854 | data['title'] = title 855 | data['artist'] = artist 856 | data['artwork_url'] = artwork_url 857 | data['year'] = None 858 | 859 | return data 860 | 861 | 862 | #################################################################### 863 | # Audiomack 864 | #################################################################### 865 | 866 | 867 | def process_audiomack(vargs): 868 | """ 869 | Main Audiomack path. 870 | """ 871 | 872 | artist_url = vargs['artist_url'] 873 | 874 | if 'audiomack.com' in artist_url: 875 | mc_url = artist_url 876 | else: 877 | mc_url = 'https://audiomack.com/' + artist_url 878 | 879 | filenames = scrape_audiomack_url(mc_url, num_tracks=vargs['num_tracks'], folders=vargs['folders'], custom_path=vargs['path']) 880 | 881 | if vargs['open']: 882 | open_files(filenames) 883 | 884 | return 885 | 886 | 887 | def scrape_audiomack_url(mc_url, num_tracks=sys.maxsize, folders=False, custom_path=''): 888 | """ 889 | Returns: 890 | list: filenames to open 891 | 892 | """ 893 | 894 | try: 895 | data = get_audiomack_data(mc_url) 896 | except Exception as e: 897 | puts_safe(colored.red("Problem downloading ") + mc_url) 898 | print(e) 899 | 900 | filenames = [] 901 | 902 | track_artist = sanitize_filename(data['artist']) 903 | track_title = sanitize_filename(data['title']) 904 | track_filename = track_artist + ' - ' + track_title + '.mp3' 905 | 906 | if folders: 907 | track_artist_path = join(custom_path, track_artist) 908 | if not exists(track_artist_path): 909 | mkdir(track_artist_path) 910 | track_filename = join(track_artist_path, track_filename) 911 | if exists(track_filename): 912 | puts_safe(colored.yellow("Skipping") + colored.white(': ' + data['title'] + " - it already exists!")) 913 | return [] 914 | else: 915 | track_filename = join(custom_path, track_filename) 916 | 917 | puts_safe(colored.green("Downloading") + colored.white(': ' + data['artist'] + " - " + data['title'])) 918 | download_file(data['mp3_url'], track_filename) 919 | tag_file(track_filename, 920 | artist=data['artist'], 921 | title=data['title'], 922 | year=data['year'], 923 | genre=None, 924 | artwork_url=data['artwork_url']) 925 | filenames.append(track_filename) 926 | 927 | return filenames 928 | 929 | 930 | def get_audiomack_data(url): 931 | """ 932 | Scrapes a Mixcloud page for a track's important information. 933 | 934 | Returns: 935 | dict: containing audio data 936 | 937 | """ 938 | 939 | data = {} 940 | request = requests.get(url) 941 | 942 | mp3_url = request.text.split('class="player-icon download-song" title="Download" href="')[1].split('"')[0] 943 | artist = request.text.split('<span class="artist">')[1].split('</span>')[0].strip() 944 | title = request.text.split('<span class="artist">')[1].split('</span>')[1].split('</h1>')[0].strip() 945 | artwork_url = request.text.split('<a class="lightbox-trigger" href="')[1].split('" data')[0].strip() 946 | 947 | data['mp3_url'] = mp3_url 948 | data['title'] = title 949 | data['artist'] = artist 950 | data['artwork_url'] = artwork_url 951 | data['year'] = None 952 | 953 | return data 954 | 955 | 956 | #################################################################### 957 | # Hive.co 958 | #################################################################### 959 | 960 | 961 | def process_hive(vargs): 962 | """ 963 | Main Hive.co path. 964 | """ 965 | 966 | artist_url = vargs['artist_url'] 967 | 968 | if 'hive.co' in artist_url: 969 | mc_url = artist_url 970 | else: 971 | mc_url = 'https://www.hive.co/downloads/download/' + artist_url 972 | 973 | filenames = scrape_hive_url(mc_url, num_tracks=vargs['num_tracks'], folders=vargs['folders'], custom_path=vargs['path']) 974 | 975 | if vargs['open']: 976 | open_files(filenames) 977 | 978 | return 979 | 980 | 981 | def scrape_hive_url(mc_url, num_tracks=sys.maxsize, folders=False, custom_path=''): 982 | """ 983 | Scrape a Hive.co download page. 984 | 985 | Returns: 986 | list: filenames to open 987 | 988 | """ 989 | 990 | try: 991 | data = get_hive_data(mc_url) 992 | except Exception as e: 993 | puts_safe(colored.red("Problem downloading ") + mc_url) 994 | print(e) 995 | 996 | filenames = [] 997 | 998 | # track_artist = sanitize_filename(data['artist']) 999 | # track_title = sanitize_filename(data['title']) 1000 | # track_filename = track_artist + ' - ' + track_title + '.mp3' 1001 | 1002 | # if folders: 1003 | # track_artist_path = join(custom_path, track_artist) 1004 | # if not exists(track_artist_path): 1005 | # mkdir(track_artist_path) 1006 | # track_filename = join(track_artist_path, track_filename) 1007 | # if exists(track_filename): 1008 | # puts_safe(colored.yellow("Skipping") + colored.white(': ' + data['title'] + " - it already exists!")) 1009 | # return [] 1010 | 1011 | # puts_safe(colored.green("Downloading") + colored.white(': ' + data['artist'] + " - " + data['title'])) 1012 | # download_file(data['mp3_url'], track_filename) 1013 | # tag_file(track_filename, 1014 | # artist=data['artist'], 1015 | # title=data['title'], 1016 | # year=data['year'], 1017 | # genre=None, 1018 | # artwork_url=data['artwork_url']) 1019 | # filenames.append(track_filename) 1020 | 1021 | return filenames 1022 | 1023 | 1024 | def get_hive_data(url): 1025 | """ 1026 | 1027 | Scrapes a Mixcloud page for a track's important information. 1028 | 1029 | Returns a dict of data. 1030 | 1031 | """ 1032 | 1033 | data = {} 1034 | request = requests.get(url) 1035 | 1036 | # import pdb 1037 | # pdb.set_trace() 1038 | 1039 | # mp3_url = request.text.split('class="player-icon download-song" title="Download" href="')[1].split('"')[0] 1040 | # artist = request.text.split('<span class="artist">')[1].split('</span>')[0].strip() 1041 | # title = request.text.split('<span class="artist">')[1].split('</span>')[1].split('</h1>')[0].strip() 1042 | # artwork_url = request.text.split('<a class="lightbox-trigger" href="')[1].split('" data')[0].strip() 1043 | 1044 | # data['mp3_url'] = mp3_url 1045 | # data['title'] = title 1046 | # data['artist'] = artist 1047 | # data['artwork_url'] = artwork_url 1048 | # data['year'] = None 1049 | 1050 | return data 1051 | 1052 | 1053 | #################################################################### 1054 | # MusicBed 1055 | #################################################################### 1056 | 1057 | 1058 | def process_musicbed(vargs): 1059 | """ 1060 | Main MusicBed path. 1061 | """ 1062 | 1063 | # let's validate given MusicBed url 1064 | validated = False 1065 | if vargs['artist_url'].startswith( 'https://www.musicbed.com/' ): 1066 | splitted = vargs['artist_url'][len('https://www.musicbed.com/'):].split( '/' ) 1067 | if len( splitted ) == 3: 1068 | if ( splitted[0] == 'artists' or splitted[0] == 'albums' or splitted[0] == 'songs' ) and splitted[2].isdigit(): 1069 | validated = True 1070 | 1071 | if not validated: 1072 | puts( colored.red( 'process_musicbed: you provided incorrect MusicBed url. Aborting.' ) ) 1073 | puts( colored.white( 'Please make sure that url is either artist-url, album-url or song-url.' ) ) 1074 | puts( colored.white( 'Example of correct artist-url: https://www.musicbed.com/artists/lights-motion/5188' ) ) 1075 | puts( colored.white( 'Example of correct album-url: https://www.musicbed.com/albums/be-still/2828' ) ) 1076 | puts( colored.white( 'Example of correct song-url: https://www.musicbed.com/songs/be-still/24540' ) ) 1077 | return 1078 | 1079 | filenames = scrape_musicbed_url(vargs['artist_url'], vargs['login'], vargs['password'], num_tracks=vargs['num_tracks'], folders=vargs['folders'], custom_path=vargs['path']) 1080 | 1081 | if vargs['open']: 1082 | open_files(filenames) 1083 | 1084 | 1085 | def scrape_musicbed_url(url, login, password, num_tracks=sys.maxsize, folders=False, custom_path=''): 1086 | """ 1087 | Scrapes provided MusicBed url. 1088 | Uses requests' Session object in order to store cookies. 1089 | Requires login and password information. 1090 | If provided url is of pattern 'https://www.musicbed.com/artists/<string>/<number>' - a number of albums will be downloaded. 1091 | If provided url is of pattern 'https://www.musicbed.com/albums/<string>/<number>' - only one album will be downloaded. 1092 | If provided url is of pattern 'https://www.musicbed.com/songs/<string>/<number>' - will be treated as one album (but download only 1st track). 1093 | Metadata and urls are obtained from JavaScript data that's treated as JSON data. 1094 | 1095 | Returns: 1096 | list: filenames to open 1097 | """ 1098 | 1099 | session = requests.Session() 1100 | 1101 | response = session.get( url ) 1102 | if response.status_code != 200: 1103 | puts( colored.red( 'scrape_musicbed_url: couldn\'t open provided url. Status code: ' + str( response.status_code ) + '. Aborting.' ) ) 1104 | session.close() 1105 | return [] 1106 | 1107 | albums = [] 1108 | # let's determine what url type we got 1109 | # '/artists/' - search for and download many albums 1110 | # '/albums/' - means we're downloading 1 album 1111 | # '/songs/' - means 1 album as well, but we're forcing num_tracks=1 in order to download only first relevant track 1112 | if url.startswith( 'https://www.musicbed.com/artists/' ): 1113 | # a hackjob code to get a list of available albums 1114 | main_index = 0 1115 | while response.text.find( 'https://www.musicbed.com/albums/', main_index ) != -1: 1116 | start_index = response.text.find( 'https://www.musicbed.com/albums/', main_index ) 1117 | end_index = response.text.find( '">', start_index ) 1118 | albums.append( response.text[start_index:end_index] ) 1119 | main_index = end_index 1120 | elif url.startswith( 'https://www.musicbed.com/songs/' ): 1121 | albums.append( url ) 1122 | num_tracks = 1 1123 | else: # url.startswith( 'https://www.musicbed.com/albums/' ) 1124 | albums.append( url ) 1125 | 1126 | # let's get our token and try to login (csrf_token seems to be present on every page) 1127 | token = response.text.split( 'var csrf_token = "' )[1].split( '";' )[0] 1128 | details = { '_token': token, 'login': login, 'password': password } 1129 | response = session.post( 'https://www.musicbed.com/ajax/login', data=details ) 1130 | if response.status_code != 200: 1131 | puts( colored.red( 'scrape_musicbed_url: couldn\'t login. Aborting. ' ) + colored.white( 'Couldn\'t access login page.' ) ) 1132 | session.close() 1133 | return [] 1134 | login_response_data = demjson.decode( response.text ) 1135 | if not login_response_data['body']['status']: 1136 | puts( colored.red( 'scrape_musicbed_url: couldn\'t login. Aborting. ' ) + colored.white( 'Did you provide correct login and password?' ) ) 1137 | session.close() 1138 | return [] 1139 | 1140 | # now let's actually scrape collected pages 1141 | filenames = [] 1142 | for each_album_url in albums: 1143 | response = session.get( each_album_url ) 1144 | if response.status_code != 200: 1145 | puts_safe( colored.red( 'scrape_musicbed_url: couldn\'t open url: ' + each_album_url + 1146 | '. Status code: ' + str( response.status_code ) + '. Skipping.' ) ) 1147 | continue 1148 | 1149 | # actually not a JSON, but a JS object, but so far so good 1150 | json = response.text.split( 'App.components.SongRows = ' )[1].split( '</script>' )[0] 1151 | data = demjson.decode( json ) 1152 | 1153 | song_count = 1 1154 | for each_song in data['loadedSongs']: 1155 | if song_count > num_tracks: 1156 | break 1157 | 1158 | try: 1159 | url, params = each_song['playback_url'].split( '?' ) 1160 | 1161 | details = dict() 1162 | for each_param in params.split( '&' ): 1163 | name, value = each_param.split( '=' ) 1164 | details.update( { name: value } ) 1165 | # musicbed warns about it if it's not fixed 1166 | details['X-Amz-Credential'] = details['X-Amz-Credential'].replace( '%2F', '/' ) 1167 | 1168 | directory = custom_path 1169 | if folders: 1170 | sanitized_artist = sanitize_filename( each_song['album']['data']['artist']['data']['name'] ) 1171 | sanitized_album = sanitize_filename( each_song['album']['data']['name'] ) 1172 | directory = join( directory, sanitized_artist + ' - ' + sanitized_album ) 1173 | if not exists( directory ): 1174 | mkdir( directory ) 1175 | filename = join( directory, str( song_count ) + ' - ' + sanitize_filename( each_song['name'] ) + '.mp3' ) 1176 | 1177 | if exists( filename ): 1178 | puts_safe( colored.yellow( 'Skipping' ) + colored.white( ': ' + each_song['name'] + ' - it already exists!' ) ) 1179 | song_count += 1 1180 | continue 1181 | 1182 | puts_safe( colored.green( 'Downloading' ) + colored.white( ': ' + each_song['name'] ) ) 1183 | path = download_file( url, filename, session=session, params=details ) 1184 | 1185 | # example of genre_string: 1186 | # "<a href=\"https://www.musicbed.com/genres/ambient/2\">Ambient</a> <a href=\"https://www.musicbed.com/genres/cinematic/4\">Cinematic</a>" 1187 | genres = '' 1188 | for each in each_song['genre_string'].split( '</a>' ): 1189 | if ( each != "" ): 1190 | genres += each.split( '">' )[1] + '/' 1191 | genres = genres[:-1] # removing last '/ 1192 | 1193 | tag_file(path, 1194 | each_song['album']['data']['artist']['data']['name'], 1195 | each_song['name'], 1196 | album=each_song['album']['data']['name'], 1197 | year=int( each_song['album']['data']['released_at'].split( '-' )[0] ), 1198 | genre=genres, 1199 | artwork_url=each_song['album']['data']['imageObject']['data']['paths']['original'], 1200 | track_number=str( song_count ), 1201 | url=each_song['song_url']) 1202 | 1203 | filenames.append( path ) 1204 | song_count += 1 1205 | except: 1206 | puts_safe( colored.red( 'Problem downloading ' ) + colored.white( each_song['name'] ) + '. Skipping.' ) 1207 | song_count += 1 1208 | 1209 | session.close() 1210 | 1211 | return filenames 1212 | 1213 | 1214 | #################################################################### 1215 | # File Utility 1216 | #################################################################### 1217 | 1218 | 1219 | def download_file(url, path, session=None, params=None): 1220 | """ 1221 | Download an individual file. 1222 | """ 1223 | 1224 | if url[0:2] == '//': 1225 | url = 'https://' + url[2:] 1226 | 1227 | # Use a temporary file so that we don't import incomplete files. 1228 | tmp_path = path + '.tmp' 1229 | 1230 | if session and params: 1231 | r = session.get( url, params=params, stream=True ) 1232 | elif session and not params: 1233 | r = session.get( url, stream=True ) 1234 | else: 1235 | r = requests.get(url, stream=True) 1236 | with open(tmp_path, 'wb') as f: 1237 | total_length = int(r.headers.get('content-length', 0)) 1238 | for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length / 1024) + 1): 1239 | if chunk: # filter out keep-alive new chunks 1240 | f.write(chunk) 1241 | f.flush() 1242 | 1243 | os.rename(tmp_path, path) 1244 | 1245 | return path 1246 | 1247 | 1248 | def tag_file(filename, artist, title, year=None, genre=None, artwork_url=None, album=None, track_number=None, url=None): 1249 | """ 1250 | Attempt to put ID3 tags on a file. 1251 | 1252 | Args: 1253 | artist (str): 1254 | title (str): 1255 | year (int): 1256 | genre (str): 1257 | artwork_url (str): 1258 | album (str): 1259 | track_number (str): 1260 | filename (str): 1261 | url (str): 1262 | """ 1263 | 1264 | try: 1265 | audio = EasyMP3(filename) 1266 | audio.tags = None 1267 | audio["artist"] = artist 1268 | audio["title"] = title 1269 | if year: 1270 | audio["date"] = str(year) 1271 | if album: 1272 | audio["album"] = album 1273 | if track_number: 1274 | audio["tracknumber"] = track_number 1275 | if genre: 1276 | audio["genre"] = genre 1277 | if url: # saves the tag as WOAR 1278 | audio["website"] = url 1279 | audio.save() 1280 | 1281 | if artwork_url: 1282 | 1283 | artwork_url = artwork_url.replace('https', 'http') 1284 | 1285 | mime = 'image/jpeg' 1286 | if '.jpg' in artwork_url: 1287 | mime = 'image/jpeg' 1288 | if '.png' in artwork_url: 1289 | mime = 'image/png' 1290 | 1291 | if '-large' in artwork_url: 1292 | new_artwork_url = artwork_url.replace('-large', '-t500x500') 1293 | try: 1294 | image_data = requests.get(new_artwork_url).content 1295 | except Exception as e: 1296 | # No very large image available. 1297 | image_data = requests.get(artwork_url).content 1298 | else: 1299 | image_data = requests.get(artwork_url).content 1300 | 1301 | audio = MP3(filename, ID3=OldID3) 1302 | audio.tags.add( 1303 | APIC( 1304 | encoding=3, # 3 is for utf-8 1305 | mime=mime, 1306 | type=3, # 3 is for the cover image 1307 | desc='Cover', 1308 | data=image_data 1309 | ) 1310 | ) 1311 | audio.save() 1312 | 1313 | # because there is software that doesn't seem to use WOAR we save url tag again as WXXX 1314 | if url: 1315 | audio = MP3(filename, ID3=OldID3) 1316 | audio.tags.add( WXXX( encoding=3, url=url ) ) 1317 | audio.save() 1318 | 1319 | return True 1320 | 1321 | except Exception as e: 1322 | puts(colored.red("Problem tagging file: ") + colored.white("Is this file a WAV?")) 1323 | return False 1324 | 1325 | def open_files(filenames): 1326 | """ 1327 | Call the system 'open' command on a file. 1328 | """ 1329 | command = ['open'] + filenames 1330 | process = Popen(command, stdout=PIPE, stderr=PIPE) 1331 | stdout, stderr = process.communicate() 1332 | 1333 | 1334 | def sanitize_filename(filename): 1335 | """ 1336 | Make sure filenames are valid paths. 1337 | 1338 | Returns: 1339 | str: 1340 | """ 1341 | sanitized_filename = re.sub(r'[/\\:*?"<>|]', '-', filename) 1342 | sanitized_filename = sanitized_filename.replace('&', 'and') 1343 | sanitized_filename = sanitized_filename.replace('"', '') 1344 | sanitized_filename = sanitized_filename.replace("'", '') 1345 | sanitized_filename = sanitized_filename.replace("/", '') 1346 | sanitized_filename = sanitized_filename.replace("\\", '') 1347 | 1348 | # Annoying. 1349 | if sanitized_filename[0] == '.': 1350 | sanitized_filename = u'dot' + sanitized_filename[1:] 1351 | 1352 | return sanitized_filename 1353 | 1354 | def puts_safe(text): 1355 | if sys.platform == "win32": 1356 | if sys.version_info < (3,0,0): 1357 | puts(text) 1358 | else: 1359 | puts(text.encode(sys.stdout.encoding, errors='replace').decode()) 1360 | else: 1361 | puts(text) 1362 | 1363 | 1364 | #################################################################### 1365 | # Main 1366 | #################################################################### 1367 | 1368 | if __name__ == '__main__': 1369 | try: 1370 | sys.exit(main()) 1371 | except Exception as e: 1372 | print(e) 1373 | -------------------------------------------------------------------------------- /test.sh: -------------------------------------------------------------------------------- 1 | #! /bin/bash 2 | nosetests 3 | -------------------------------------------------------------------------------- /tests/test.py: -------------------------------------------------------------------------------- 1 | import glob 2 | import os 3 | import sys 4 | import unittest 5 | 6 | from mutagen.mp3 import EasyMP3 7 | from soundscrape.soundscrape import get_client 8 | from soundscrape.soundscrape import process_soundcloud 9 | from soundscrape.soundscrape import process_bandcamp 10 | 11 | 12 | def rm_mp3(): 13 | """ deletes all ``*.mp3`` files in current directory 14 | """ 15 | for f in glob.glob('*.mp3'): 16 | os.unlink(f) 17 | 18 | 19 | class TestSoundscrape(unittest.TestCase): 20 | 21 | ## 22 | # Basic Tests 23 | ## 24 | 25 | def test_test(self): 26 | self.assertTrue(True) 27 | 28 | def test_get_client(self): 29 | client = get_client() 30 | self.assertTrue(bool(client)) 31 | 32 | def test_soundcloud(self): 33 | rm_mp3() 34 | mp3_count = len(glob.glob1('', "*.mp3")) 35 | vargs = {'path':'', 'folders': False, 'group': False, 'track': '', 'num_tracks': 9223372036854775807, 'bandcamp': False, 'downloadable': False, 'likes': False, 'open': False, 'artist_url': 'https://soundcloud.com/fzpz/revised', 'keep': True} 36 | process_soundcloud(vargs) 37 | new_mp3_count = len(glob.glob1('', "*.mp3")) 38 | self.assertTrue(new_mp3_count > mp3_count) 39 | rm_mp3() 40 | 41 | def test_soundcloud_hard(self): 42 | rm_mp3() 43 | mp3_count = len(glob.glob1('', "*.mp3")) 44 | vargs = {'path':'', 'folders': False, 'group': False, 'track': '', 'num_tracks': 1, 'bandcamp': False, 'downloadable': False, 'likes': False, 'open': False, 'artist_url': 'puptheband', 'keep': False} 45 | process_soundcloud(vargs) 46 | new_mp3_count = len(glob.glob1('', "*.mp3")) 47 | self.assertTrue(new_mp3_count > mp3_count) 48 | self.assertTrue(new_mp3_count == 1) # This used to be 3, but is now 'Not available in United States.' 49 | rm_mp3() 50 | 51 | def test_soundcloud_hard_2(self): 52 | rm_mp3() 53 | mp3_count = len(glob.glob1('', "*.mp3")) 54 | vargs = {'path':'', 'folders': False, 'group': False, 'track': '', 'num_tracks': 1, 'bandcamp': False, 'downloadable': False, 'likes': False, 'open': False, 'artist_url': 'https://soundcloud.com/lostdogz/snuggles-chapstick', 'keep': False} 55 | process_soundcloud(vargs) 56 | new_mp3_count = len(glob.glob1('', "*.mp3")) 57 | self.assertTrue(new_mp3_count > mp3_count) 58 | self.assertTrue(new_mp3_count == 1) # This used to be 3, but is now 'Not available in United States.' 59 | rm_mp3() 60 | 61 | # The test URL for this is no longer a WAV. Need a new testcase. 62 | # 63 | # def test_soundcloud_wav(self): 64 | # for f in glob.glob('*.wav'): 65 | # os.unlink(f) 66 | 67 | # wav_count = len(glob.glob1('', "*.wav")) 68 | # vargs = {'path':'', 'folders': False, 'group': False, 'track': '', 'num_tracks': 1, 'bandcamp': False, 'downloadable': False, 'likes': False, 'open': False, 'artist_url': 'https://soundcloud.com/coastal/major-lazer-aerosol-can-coastal-flip', 'keep': False} 69 | # process_soundcloud(vargs) 70 | # new_wav_count = len(glob.glob1('', "*.wav")) 71 | # self.assertTrue(new_wav_count > wav_count) 72 | # self.assertTrue(new_wav_count == 1) 73 | 74 | # for f in glob.glob('*.wav'): 75 | # os.unlink(f) 76 | 77 | def test_bandcamp(self): 78 | rm_mp3() 79 | mp3_count = len(glob.glob1('', "*.mp3")) 80 | vargs = {'path':'', 'folders': False, 'group': False, 'track': '', 'num_tracks': 9223372036854775807, 'bandcamp': False, 'downloadable': False, 'likes': False, 'open': False, 'artist_url': 'https://atenrays.bandcamp.com/track/who-u-think'} 81 | process_bandcamp(vargs) 82 | new_mp3_count = len(glob.glob1('', "*.mp3")) 83 | self.assertTrue(new_mp3_count > mp3_count) 84 | rm_mp3() 85 | 86 | def test_bandcamp_slashes(self): 87 | rm_mp3() 88 | mp3_count = len(glob.glob1('', "*.mp3")) 89 | vargs = {'path':'', 'folders': False, 'group': False, 'track': '', 'num_tracks': 9223372036854775807, 'bandcamp': False, 'downloadable': False, 'likes': False, 'open': False, 'artist_url': 'https://defill.bandcamp.com/track/amnesia-chamber-harvest-skit'} 90 | process_bandcamp(vargs) 91 | new_mp3_count = len(glob.glob1('', "*.mp3")) 92 | self.assertTrue(new_mp3_count > mp3_count) 93 | rm_mp3() 94 | 95 | def test_bandcamp_html_entities(self): 96 | rm_mp3() 97 | vargs = {'path': '', 'folders': False, 'num_tracks': sys.maxsize, 'open': False, 'artist_url': 'https://anaalnathrakh.bandcamp.com/track/man-at-c-a-bonus-track'} 98 | process_bandcamp(vargs) 99 | mp3s = glob.glob('*.mp3') 100 | self.assertEquals(1, len(mp3s)) 101 | fn = mp3s[0] 102 | self.assertTrue('CandA' in fn) 103 | t = EasyMP3(fn)['title'] 104 | self.assertTrue('C&A' in t[0]) 105 | rm_mp3() 106 | 107 | 108 | # def test_musicbed(self): 109 | # for f in glob.glob('*.mp3'): 110 | # os.unlink(f) 111 | 112 | # mp3_count = len(glob.glob1('', "*.mp3")) 113 | # vargs = {'login':'musicbedtest@gmail.com', 'password':'oo6alY9T', 'path':'', 'folders': False, 'group': False, 'track': '', 'num_tracks': 9223372036854775807, 'bandcamp': False, 'downloadable': False, 'likes': False, 'open': False, 'artist_url': 'https://www.musicbed.com/albums/be-still/2828'} 114 | # process_musicbed(vargs) 115 | # new_mp3_count = len(glob.glob1('', "*.mp3")) 116 | # self.assertTrue(new_mp3_count > mp3_count) 117 | 118 | # for f in glob.glob('*.mp3'): 119 | # os.unlink(f) 120 | 121 | def test_mixcloud(self): 122 | """ 123 | MixCloud is being blocked from Travis, interestingly. 124 | """ 125 | 126 | # rm_mp3() 127 | # for f in glob.glob('*.m4a'): 128 | # os.unlink(f) 129 | 130 | # shortest mix I could find that was still semi tolerable 131 | #mp3_count = len(glob.glob1('', "*.mp3")) 132 | #m4a_count = len(glob.glob1('', "*.m4a")) 133 | #vargs = {'path':'', 'folders': False, 'group': False, 'track': '', 'num_tracks': 9223372036854775807, 'bandcamp': False, 'downloadable': False, 'likes': False, 'open': False, 'artist_url': 'https://www.mixcloud.com/Bobby_T_FS15/coffee-cigarettes-saturday-morning-hip-hop-fix/'} 134 | #process_mixcloud(vargs) 135 | #new_mp3_count = len(glob.glob1('', "*.mp3")) 136 | #new_m4a_count = len(glob.glob1('', "*.m4a")) 137 | #self.assertTrue((new_mp3_count > mp3_count) or (new_m4a_count > m4a_count)) 138 | 139 | # rm_mp3() 140 | # for f in glob.glob('*.m4a'): 141 | # os.unlink(f) 142 | 143 | # def test_audiomack(self): 144 | # for f in glob.glob('*.mp3'): 145 | # os.unlink(f) 146 | 147 | # mp3_count = len(glob.glob1('', "*.mp3")) 148 | # vargs = {'path':'', 'folders': False, 'group': False, 'track': '', 'num_tracks': 9223372036854775807, 'bandcamp': False, 'audiomack': True, 'downloadable': False, 'likes': False, 'open': False, 'artist_url': 'https://www.audiomack.com/song/bottomfeedermusic/power'} 149 | # process_audiomack(vargs) 150 | # new_mp3_count = len(glob.glob1('', "*.mp3")) 151 | # self.assertTrue(new_mp3_count > mp3_count) 152 | 153 | # for f in glob.glob('*.mp3'): 154 | # os.unlink(f) 155 | 156 | if __name__ == '__main__': 157 | unittest.main() 158 | --------------------------------------------------------------------------------