├── .gitignore ├── README.md ├── automate your hack nyu form ├── __init__.py ├── data1.csv ├── fill_up_nyu.py └── hack_nyu.py ├── download all django videos ├── __init__.py ├── download_videos_godjango.py └── go_django.py ├── html dom tree.png ├── identify xpath.md ├── scrape all donald trump quotes ├── __init__.py ├── brain_quote_page.py ├── extract_donald_trump_quotes.py └── write_data_1.csv ├── scrape top tech news from hacker news website ├── __init__.py ├── extact_hacker_news.py ├── hacker_news.py └── write_data_1.csv ├── selenium-installation-guide.md └── short tutorial to handle csv files ├── __init__.py └── read_write_csv.py /.gitignore: -------------------------------------------------------------------------------- 1 | ### Python template 2 | # Byte-compiled / optimized / DLL files 3 | __pycache__/ 4 | *.py[cod] 5 | *$py.class 6 | 7 | # C extensions 8 | *.so 9 | .idea 10 | # Distribution / packaging 11 | .Python 12 | env/ 13 | build/ 14 | develop-eggs/ 15 | dist/ 16 | downloads/ 17 | eggs/ 18 | .eggs/ 19 | lib/ 20 | lib64/ 21 | parts/ 22 | sdist/ 23 | var/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *,cover 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | 55 | # Sphinx documentation 56 | docs/_build/ 57 | 58 | # PyBuilder 59 | target/ 60 | 61 | # Created by .ignore support plugin (hsz.mobi) 62 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Web-Scraping using Selenium 2 | 3 | # What is the need of Selenium? 4 | 5 | Some websites don't like to be scrapped and in that case you need to disguise your webscraping bot as a Human Being. 6 | 7 | 8 | # What is locator or css selector or xpath? 9 | 10 | Locator can be termed as an address that identifies a web element uniquely within the webpage. Locators are the HTML 11 | properties of a web element which tells the Selenium about the web element it need to perform action on. 12 | 13 | There is a diverse range of web elements. The most common amongst them are: 14 | 15 | Text box 16 | Button 17 | Drop Down 18 | Hyperlink 19 | Check Box 20 | Radio Button 21 | 22 | Types of Locators in Selenium 23 | 24 | Photo Credit - www.softwaretestinghelp.com 25 | 26 | ![Locators in Selenium](http://cdn2.softwaretestinghelp.com/wp-content/qa/uploads/2014/10/Types-of-Locators-in-Selenium-1.jpg) 27 | 28 | 29 | **XPATH** 30 | 31 | Xpath is used to locate a web element based on its XML path. XML stands 32 | for Extensible Markup Language and is used to store, organize and transport 33 | arbitrary data. It stores data in a key-value pair which is very much similar 34 | to HTML tags. Both being mark up languages and since they fall under the same 35 | umbrella, xpath can be used to locate HTML elements. 36 | 37 | The fundamental behind locating elements using Xpath is the traversing between 38 | various elements across the entire page and thus enabling a user to find an element with the reference of another element. 39 | 40 | **CSS-Selector** 41 | 42 | CSS Selector is combination of an element selector and a selector value which identifies the web element within a web page. 43 | The composite of element selector and selector value is known as Selector Pattern. 44 | 45 | Photo-Credit - www.softwaretestinghelp.com 46 | 47 | Primitive types of CSS Selector 48 | 49 | ![](http://cdn2.softwaretestinghelp.com/wp-content/qa/uploads/2014/10/Using-CSS-Selector-as-a-Locator.jpg) 50 | 51 | [Different Types of CSS Selector](http://www.w3schools.com/cssref/css_selectors.asp) 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | -------------------------------------------------------------------------------- /automate your hack nyu form/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rakeshsukla53/webscraping-selenium/1dabda09afb57a18994a3355f24c300447e804a8/automate your hack nyu form/__init__.py -------------------------------------------------------------------------------- /automate your hack nyu form/data1.csv: -------------------------------------------------------------------------------- 1 | first_name,last_name,email_id 2 | Terri, Burns, terri.burns@nyu.edu 3 | Freia, Lobo, freia@nyu.edu 4 | jhishan, khan, jhishan@nyu.edu 5 | 6 | -------------------------------------------------------------------------------- /automate your hack nyu form/fill_up_nyu.py: -------------------------------------------------------------------------------- 1 | from selenium import webdriver 2 | from selenium.webdriver.common.desired_capabilities import DesiredCapabilities 3 | from selenium.webdriver.common.by import By 4 | from selenium.webdriver.support.ui import WebDriverWait 5 | from selenium.webdriver.support import expected_conditions as EC 6 | import csv 7 | from time import sleep 8 | 9 | 10 | # set the scrolling behavior to down 11 | DesiredCapabilities.FIREFOX["elementScrollBehavior"] = 1 12 | 13 | 14 | def fill_up_hack_nyu(student): 15 | """ automate your hack nyu form using selenium""" 16 | driver = webdriver.Firefox() 17 | wait = WebDriverWait(driver, 10) 18 | # load the page 19 | driver.get("http://hacknyu.org/signup") 20 | # get the form element 21 | form = driver.find_element_by_css_selector("form[name='signupForm']") 22 | # fill the fields 23 | form.find_element_by_css_selector("input[name='firstName']").send_keys(student['first_name']) 24 | form.find_element_by_css_selector("input[name='lastName']").send_keys(student['last_name']) 25 | form.find_element_by_css_selector("input[name='email']").send_keys(student['email_id']) 26 | form.find_element_by_css_selector("input[name='password']").send_keys("technyu") 27 | # click and accept terms 28 | form.find_element_by_xpath("//input[@name='terms']/..").click() 29 | wait.until(EC.presence_of_element_located((By.XPATH, "//button[.='Accept']"))).click() 30 | wait.until_not(EC.presence_of_element_located((By.CSS_SELECTOR, ".modal"))) 31 | # click on submit 32 | form.find_element_by_css_selector("button[type='submit']").click() 33 | driver.quit() 34 | 35 | 36 | def read_csv_files(): 37 | with open('data1.csv', 'r+') as data_file: 38 | data = csv.DictReader(data_file) 39 | for row in data: 40 | fill_up_hack_nyu(row) 41 | sleep(1) 42 | 43 | read_csv_files() 44 | -------------------------------------------------------------------------------- /automate your hack nyu form/hack_nyu.py: -------------------------------------------------------------------------------- 1 | class HackNNYU(object): 2 | first_name = 'input[ng-model="credentials.first_name"]' 3 | last_name = 'input[ng-model="credentials.last_name"]' 4 | email = '.col-sm-12>input[ng-model="credentials.email"]' 5 | password = '.col-sm-12>input[ng-model="credentials.password"]' 6 | agree_checkbox = '.ng-binding>input[ng-model="checkModel"]' 7 | sign_up_button = 'div>button[type="submit"]' 8 | accept_button = 'button[ng-click="positive()"]' 9 | -------------------------------------------------------------------------------- /download all django videos/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rakeshsukla53/webscraping-selenium/1dabda09afb57a18994a3355f24c300447e804a8/download all django videos/__init__.py -------------------------------------------------------------------------------- /download all django videos/download_videos_godjango.py: -------------------------------------------------------------------------------- 1 | from selenium import webdriver 2 | from go_django import GoDjango 3 | from selenium.webdriver.support import expected_conditions as EC 4 | from selenium.webdriver.support.ui import WebDriverWait 5 | from selenium.webdriver.common.by import By 6 | from time import sleep 7 | 8 | 9 | def download_videos(): 10 | """ download all videos from www.godjango.com """ 11 | driver = webdriver.Firefox() 12 | driver.get('https://godjango.com/browse/') 13 | WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.CSS_SELECTOR, GoDjango.first_video_title))) 14 | while True: 15 | for element in range(0, 10): 16 | all_video_elements = driver.find_elements_by_css_selector(GoDjango.first_video_title) 17 | all_video_elements[element].click() 18 | try: 19 | print driver.find_element_by_css_selector(GoDjango.video_download_link).get_attribute('src') 20 | except: 21 | print "Video is private" 22 | driver.execute_script("window.history.go(-1)") 23 | driver.execute_script("window.scrollTo(0, 465);") 24 | driver.execute_script("window.scrollTo(0, 4650);") 25 | sleep(1) 26 | driver.find_element_by_css_selector(GoDjango.next_button_click).click() 27 | driver.quit() 28 | 29 | if __name__ == '__main__': 30 | download_videos() 31 | 32 | -------------------------------------------------------------------------------- /download all django videos/go_django.py: -------------------------------------------------------------------------------- 1 | class GoDjango(object): 2 | pro_video_tag = '.media.episode-list-item.padding-15 div div span' 3 | video_container = '.media.episode-list-item.padding-15' 4 | first_video_title = 'h4[class="media-heading"] a' 5 | next_button_click = 'li[class="active"] + li a' 6 | video_download_link = 'div[class="video-description"] + iframe' 7 | -------------------------------------------------------------------------------- /html dom tree.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rakeshsukla53/webscraping-selenium/1dabda09afb57a18994a3355f24c300447e804a8/html dom tree.png -------------------------------------------------------------------------------- /identify xpath.md: -------------------------------------------------------------------------------- 1 | XPath locator examples 2 | 3 | To find the link in this page: 4 | 5 | 6 |

The fox jumped over the lazy brown dog.

7 | 8 | 9 | A raw XPath traverses the hierarchy from the root element of the document (page) to the desired element: 10 | 11 | /html/body/p/a 12 | 13 | 14 | Child of Element ID 15 | 16 | XPath can find an element by ID like this: 17 | 18 | //*[@id="element_id"] 19 | 20 | So if you need to find an element that is near another element with an ID, like the link in this example: 21 | 22 | 23 |

The fox jumped over the lazy brown dog.

24 | 25 | 26 | you could try an XPath like this to find the first link that is a child of the element with ID=”fox”: 27 | 28 | //*[@id="fox"]/a 29 | 30 | 31 | Button Text 32 | 33 | There are two ways to declare a standard button in HTML, discounting the many ways to make something that looks like a button, but is not. To determine how an element is declared in the HTML, see how to inspect an element in the browser. 34 | 35 | If the button is declared with the