├── LICENSE ├── README.md └── itunes.py /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2015, Tuts+ 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | * Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 10 | * Redistributions in binary form must reproduce the above copyright notice, 11 | this list of conditions and the following disclaimer in the documentation 12 | and/or other materials provided with the distribution. 13 | 14 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 15 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 16 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 17 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 18 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 19 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 20 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 21 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 22 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 23 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 24 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # [Crawl the Web With Python][published url] 2 | ## Instructor: [Derek Jensen][instructor url] 3 | 4 | 5 | In a recent business endeavor, I found it necessary to collect bulk data from different online sources in order to centralize it and make it easier for people to find and to make sense of. I would have liked if those sites had exposed public APIs, but unfortunately none did. So, I decided to try my hand at a little web crawling and scraping to obtain this metadata and after searching around for awhile, I found that many people when faced with the same issue have turned towards Python. After all, if it was good enough for Google, it's definitely good enough for me! 6 | 7 | In this course I will share some of my findings and show you how you can go about creating your own basic web crawler and scraper. 8 | 9 | ## Source Files Description 10 | 11 | The full project source can be found in the itunes.py file. 12 | 13 | 14 | ------ 15 | 16 | These are source files for the Tuts+ course: [Crawl the Web With Python][published url] 17 | 18 | Available on [Tuts+](https://tutsplus.com). Teaching skills to millions worldwide. 19 | 20 | [published url]: https://code.tutsplus.com/courses/crawl-the-web-with-python 21 | [instructor url]: https://tutsplus.com/authors/derek-jensen 22 | -------------------------------------------------------------------------------- /itunes.py: -------------------------------------------------------------------------------- 1 | from lxml import html 2 | import requests 3 | import time 4 | 5 | class AppCrawler: 6 | 7 | def __init__(self, starting_url, depth): 8 | self.starting_url = starting_url 9 | self.depth = depth 10 | self.current_depth = 0 11 | self.depth_links = [] 12 | self.apps = [] 13 | 14 | def crawl(self): 15 | app = self.get_app_from_link(self.starting_url) 16 | self.apps.append(app) 17 | self.depth_links.append(app.links) 18 | 19 | while self.current_depth < self.depth: 20 | current_links = [] 21 | for link in self.depth_links[self.current_depth]: 22 | current_app = self.get_app_from_link(link) 23 | current_links.extend(current_app.links) 24 | self.apps.append(current_app) 25 | time.sleep(5) 26 | self.current_depth += 1 27 | self.depth_links.append(current_links) 28 | 29 | 30 | def get_app_from_link(self, link): 31 | start_page = requests.get(link) 32 | tree = html.fromstring(start_page.text) 33 | 34 | name = tree.xpath('//h1[@itemprop="name"]/text()')[0] 35 | developer = tree.xpath('//div[@class="left"]/h2/text()')[0] 36 | price = tree.xpath('//div[@itemprop="price"]/text()')[0] 37 | links = tree.xpath('//div[@class="center-stack"]//*/a[@class="name"]/@href') 38 | 39 | app = App(name, developer, price, links) 40 | 41 | return app 42 | 43 | 44 | class App: 45 | 46 | def __init__(self, name, developer, price, links): 47 | self.name = name 48 | self.developer = developer 49 | self.price = price 50 | self.links = links 51 | 52 | def __str__(self): 53 | return ("Name: " + self.name.encode('UTF-8') + 54 | "\r\nDeveloper: " + self.developer.encode('UTF-8') + 55 | "\r\nPrice: " + self.price.encode('UTF-8') + "\r\n") 56 | 57 | 58 | crawler = AppCrawler('https://itunes.apple.com/us/app/candy-crush-saga/id553834731', 2) 59 | crawler.crawl() 60 | 61 | for app in crawler.apps: 62 | print app --------------------------------------------------------------------------------