├── README.md └── mechanicalSoup.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # BetterWebScraping 2 | Create a database from scratch by extracting html elements from a webpage with Mechanical Soup 3 |
4 | Modules Used: Mechanical Soup, os, wget. 5 |
6 |
7 | Step by step walk-through: 8 |
9 | Step 1: Search for a term on google. 10 |
11 | Step 2: Target all the image elements on the new page. 12 |
13 | Step 3: Fix broken/incomplete image links. 14 |
15 | Step 4: Create a new local directory for the images. 16 |
17 | Step 5: Download and save all the images at once. 18 |
19 |
20 | Also available in a video explination: https://youtu.be/drDdb1MBBfI 21 |
22 | Author: Mariya Sha 23 |
24 | Email: mariyasha888@gmail.com 25 |
26 | LinkedIn: www.linkedin.com/in/mariyasha888/ 27 | 28 | -------------------------------------------------------------------------------- /mechanicalSoup.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Navigate to Google Images" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [ 15 | { 16 | "name": "stdout", 17 | "output_type": "stream", 18 | "text": [ 19 | "https://www.google.com/imghp?hl=en\n" 20 | ] 21 | } 22 | ], 23 | "source": [ 24 | "import mechanicalsoup\n", 25 | "\n", 26 | "browser = mechanicalsoup.StatefulBrowser()\n", 27 | "url = \"https://www.google.com/imghp?hl=en\"\n", 28 | "\n", 29 | "browser.open(url)\n", 30 | "print(browser.get_url())" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "# Type a search term and click \"search\"" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 2, 43 | "metadata": {}, 44 | "outputs": [ 45 | { 46 | "name": "stdout", 47 | "output_type": "stream", 48 | "text": [ 49 | "\n", 50 | "\n", 51 | "\n", 52 | "\n", 53 | "\n", 54 | "\n", 55 | "\n", 56 | "new url: https://www.google.com/search?tbm=isch&ie=ISO-8859-1&hl=en-CA&source=hp&q=dog&btnG=Search+Images&gbv=1\n", 57 | "my response:\n", 58 | " dog - Google Search