├── README.md ├── products.csv └── WebScraping.py /README.md: -------------------------------------------------------------------------------- 1 | # WebScrap 2 | 3 | Web scraping (web harvesting or web data extraction) is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. 4 | 5 | In this repo Web Scraping is done on an E-Commerce Website using BeautifulSoup in [Python](https://www.python.org/) 6 | 7 | [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) 8 | * Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. 9 | -------------------------------------------------------------------------------- /products.csv: -------------------------------------------------------------------------------- 1 | brands, product_name, shipping 2 | GIGABYTE,GIGABYTE GeForce GTX 1070 DirectX 12 GV-N1070WF2OC-8GD Video Cards,Free Shipping 3 | EVGA,EVGA GeForce GTX 1080 SC GAMING ACX 3.0| 08G-P4-6183-KR| 8GB GDDR5X| LED| DX12 OSD Support (PXOC),Free Shipping 4 | XFX,XFX Radeon RX 470 RS Triple X DirectX 12 RX-470P436BM 4GB 256-Bit GDDR5 PCI Express 3.0 CrossFireX Support Video Card,$3.99 Shipping 5 | ASUS,ASUS Radeon RX 480 DirectX 12 DUAL-RX480-O8G Video Card,$4.99 Shipping 6 | ZOTAC,ZOTAC GeForce GTX 1080 Ti AMP Edition 11GB GDDR5X 352-bit Gaming Graphics Card VR Ready 16+2 Power Phase Freeze Fan Stop IceStorm Cooling Spectra Lighting ZT-P10810D-10P,$6.99 Shipping 7 | ASUS,ASUS ROG GeForce GTX 1080 STRIX-GTX1080-A8G-GAMING Video Card,Free Shipping 8 | EVGA,EVGA GeForce GTX 1070 SC GAMING ACX 3.0 Black Edition| 08G-P4-5173-KR| 8GB GDDR5| LED| DX12 OSD Support (PXOC),Free Shipping 9 | GIGABYTE,GIGABYTE Radeon RX 480 G1 Gaming 4GB GV-RX480G1GAMING-4GD Video Card,$4.99 Shipping 10 | XFX,XFX Radeon RS RX 480 DirectX 12 RX-480P836BM 8GB 256-Bit GDDR5 PCI Express 3.0 CrossFireX Support Video Card,$4.99 Shipping 11 | ZOTAC,ZOTAC GeForce GTX 1070 Mini| ZT-P10700G-10M| 8GB GDDR5,Free Shipping 12 | MSI,MSI Radeon RX 480 DirectX 12 Radeon RX 480 4G Video Card,$4.99 Shipping 13 | XFX,XFX Radeon GTR RX 480 DirectX 12 RX-480P8DBA6 Black Edition Video Card,$4.99 Shipping 14 | -------------------------------------------------------------------------------- /WebScraping.py: -------------------------------------------------------------------------------- 1 | import bs4 2 | from urllib.request import urlopen 3 | from bs4 import BeautifulSoup as soup 4 | 5 | my_url = 'https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20card' 6 | 7 | # opening url and grabbing the web page 8 | uClient = urlopen(my_url) 9 | page_html = uClient.read() 10 | uClient.close() 11 | 12 | # html parsing 13 | page_soup = soup(page_html, 'html.parser') 14 | 15 | # grabbing all containers with class name = item-container 16 | containers = page_soup.findAll('div', {'class':'item-container'}) 17 | 18 | filename = "products.csv" 19 | f = open(filename, 'w') 20 | 21 | headers = "brands, product_name, shipping\n" 22 | 23 | f.write(headers) 24 | 25 | container = containers[1] 26 | 27 | for container in containers: 28 | brand = container.div.div.a.img['title'] 29 | title_container = container.findAll('a', {'class':'item-title'}) 30 | product_name = title_container[0].text 31 | ship_container = container.findAll('li', {'class':'price-ship'}) 32 | # use strip() to remove blank spaces before and after text 33 | shipping = ship_container[0].text.strip() 34 | 35 | print("brand:" + brand) 36 | print("product_name:" + product_name) 37 | print("shipping:" + shipping) 38 | 39 | f.write(brand + ',' + product_name.replace(',' , '|') + ',' + shipping + '\n') 40 | 41 | f.close() 42 | 43 | --------------------------------------------------------------------------------