├── README.md
└── mechanicalSoup.ipynb
/README.md:
--------------------------------------------------------------------------------
1 | # BetterWebScraping
2 | Create a database from scratch by extracting html elements from a webpage with Mechanical Soup
3 |
4 | Modules Used: Mechanical Soup, os, wget.
5 |
6 |
7 | Step by step walk-through:
8 |
9 | Step 1: Search for a term on google.
10 |
11 | Step 2: Target all the image elements on the new page.
12 |
13 | Step 3: Fix broken/incomplete image links.
14 |
15 | Step 4: Create a new local directory for the images.
16 |
17 | Step 5: Download and save all the images at once.
18 |
19 |
20 | Also available in a video explination: https://youtu.be/drDdb1MBBfI
21 |
22 | Author: Mariya Sha
23 |
24 | Email: mariyasha888@gmail.com
25 |
26 | LinkedIn: www.linkedin.com/in/mariyasha888/
27 |
28 |
--------------------------------------------------------------------------------
/mechanicalSoup.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Navigate to Google Images"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {},
14 | "outputs": [
15 | {
16 | "name": "stdout",
17 | "output_type": "stream",
18 | "text": [
19 | "https://www.google.com/imghp?hl=en\n"
20 | ]
21 | }
22 | ],
23 | "source": [
24 | "import mechanicalsoup\n",
25 | "\n",
26 | "browser = mechanicalsoup.StatefulBrowser()\n",
27 | "url = \"https://www.google.com/imghp?hl=en\"\n",
28 | "\n",
29 | "browser.open(url)\n",
30 | "print(browser.get_url())"
31 | ]
32 | },
33 | {
34 | "cell_type": "markdown",
35 | "metadata": {},
36 | "source": [
37 | "# Type a search term and click \"search\""
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": 2,
43 | "metadata": {},
44 | "outputs": [
45 | {
46 | "name": "stdout",
47 | "output_type": "stream",
48 | "text": [
49 | "\n",
50 | "\n",
51 | "\n",
52 | "\n",
53 | "\n",
54 | "\n",
55 | "\n",
56 | "new url: https://www.google.com/search?tbm=isch&ie=ISO-8859-1&hl=en-CA&source=hp&q=dog&btnG=Search+Images&gbv=1\n",
57 | "my response:\n",
58 | "