├── README.md └── WebScraping.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # WebScraping 2 | Create a database from scratch by extracting html elements from a webpage 3 |
4 | Modules Used: Urllib.request, BeautifulSoup, Regex and Pandas. 5 |
6 |
7 | Step by step walk-through: 8 |
9 | Step 1: pulling HTML out of a webpage. 10 |
11 | Step 2: targeting elements of interest inside the HTML. 12 |
13 | Step 3: fine-tuning targeted elements with Regex (Regular Expressions), string concatenation and slicing. 14 |
15 | Step 4: storing the data inside a DataFrame. 16 |
17 | Step 5: exporting DataFrame into a CSV file. 18 |
19 |
20 | Also available in a video explination: https://youtu.be/ySNSY7iiBDY 21 |
22 | Author: Mariya Sha 23 |
24 | Email: mariyasha888@gmail.com 25 |
26 | LinkedIn: www.linkedin.com/in/mariyasha888/ 27 | 28 | -------------------------------------------------------------------------------- /WebScraping.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import urllib.request\n", 10 | "from bs4 import BeautifulSoup as bs\n", 11 | "import re\n", 12 | "import pandas as pd" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "# Extract and Store Function Names & Usage\n", 20 | "
\n", 21 | "From the Python Documentation - Random Functions webpage: https://docs.python.org/3/library/random.html\n" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 8, 27 | "metadata": {}, 28 | "outputs": [ 29 | { 30 | "name": "stdout", 31 | "output_type": "stream", 32 | "text": [ 33 | "list of function names: ['random.seed', 'random.getstate', 'random.setstate', 'random.getrandbits', 'random.randrange']\n", 34 | "\n", 35 | "function description: Initialize the random number generator. If a is omitted or None, the current system time is used. If randomness sources are provided by the operating system, they are used instead of the system time (see the os.urandom() function for details on availability). If a is an int, it is used directly. With version 2 (the default), a str, bytes, or bytearray object gets converted to an int and all of its bits are used. With version 1 (provided for reproducing random sequences from older versions of Python), the algorithm for str and bytes generates a narrower range of seeds. Changed in version 3.2: Moved to the version 2 scheme which uses all of the bits in a string seed. \n", 36 | "\n", 37 | "number of items in function names: 24\n", 38 | "number of items in function description: 24\n" 39 | ] 40 | } 41 | ], 42 | "source": [ 43 | "#load html code from a url\n", 44 | "page = urllib.request.urlopen(\"https://docs.python.org/3/library/random.html\")\n", 45 | "soup = bs(page)\n", 46 | "\n", 47 | "#find all function names\n", 48 | "names = soup.body.findAll('dt')\n", 49 | "function_names = re.findall('id=\"random.\\w+', str(names))\n", 50 | "function_names = [item[4:] for item in function_names]\n", 51 | "\n", 52 | "#find all function descriptions\n", 53 | "description = soup.body.findAll('dd')\n", 54 | "function_usage = []\n", 55 | "\n", 56 | "for item in description:\n", 57 | " item = item.text\n", 58 | " item = item.replace('\\n', ' ')\n", 59 | " function_usage.append(item)\n", 60 | "\n", 61 | "print('list of function names:',function_names[:5])\n", 62 | "print('\\nfunction description:', function_usage[0])\n", 63 | "print('\\nnumber of items in function names:', len(function_names))\n", 64 | "print('number of items in function description:', len(function_usage))" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "# Store Data inside a DataFrame\n", 72 | "
\n", 73 | "After ensuring the lenghts of both lists match" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 4, 79 | "metadata": {}, 80 | "outputs": [ 81 | { 82 | "data": { 83 | "text/html": [ 84 | "
\n", 85 | "\n", 98 | "\n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | "
function namefunction usage
0random.seedInitialize the random number generator. If a i...
1random.getstateReturn an object capturing the current interna...
2random.setstatestate should have been obtained from a previou...
3random.getrandbitsReturns a Python integer with k random bits. T...
4random.randrangeReturn a randomly selected element from range(...
\n", 134 | "
" 135 | ], 136 | "text/plain": [ 137 | " function name function usage\n", 138 | "0 random.seed Initialize the random number generator. If a i...\n", 139 | "1 random.getstate Return an object capturing the current interna...\n", 140 | "2 random.setstate state should have been obtained from a previou...\n", 141 | "3 random.getrandbits Returns a Python integer with k random bits. T...\n", 142 | "4 random.randrange Return a randomly selected element from range(..." 143 | ] 144 | }, 145 | "execution_count": 4, 146 | "metadata": {}, 147 | "output_type": "execute_result" 148 | } 149 | ], 150 | "source": [ 151 | "#create a dataframe\n", 152 | "data = pd.DataFrame({'function name': function_names, 'function usage': function_usage})\n", 153 | "data.head()" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "# Exort Data into a csv file\n", 161 | "
\n", 162 | "The file will be saved in the same directory as the notebook" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 5, 168 | "metadata": {}, 169 | "outputs": [], 170 | "source": [ 171 | "data.to_csv('my_file.csv')" 172 | ] 173 | } 174 | ], 175 | "metadata": { 176 | "kernelspec": { 177 | "display_name": "Python 3", 178 | "language": "python", 179 | "name": "python3" 180 | }, 181 | "language_info": { 182 | "codemirror_mode": { 183 | "name": "ipython", 184 | "version": 3 185 | }, 186 | "file_extension": ".py", 187 | "mimetype": "text/x-python", 188 | "name": "python", 189 | "nbconvert_exporter": "python", 190 | "pygments_lexer": "ipython3", 191 | "version": "3.7.4" 192 | } 193 | }, 194 | "nbformat": 4, 195 | "nbformat_minor": 2 196 | } 197 | --------------------------------------------------------------------------------