├── README.md
├── VBA
└── Web Scraping With Excel VBA Guide
│ ├── README.md
│ ├── images
│ ├── image1.png
│ ├── image2.png
│ ├── image3.png
│ ├── image4.png
│ ├── image5.png
│ ├── image6.png
│ ├── image7.png
│ ├── image8.png
│ └── image9.png
│ └── src
│ ├── automate_ie.vb
│ └── scrape_quotes.vb
├── csharp
└── csharp-web-scraping
│ ├── Export-to-csv.cs
│ ├── GetBookDetails.cs
│ ├── GetBookLinks.cs
│ ├── Program.cs
│ └── README.md
├── golang
└── golang-web-scraper
│ ├── README.md
│ └── src
│ ├── books.go
│ ├── go.mod
│ └── go.sum
├── java
└── web-scraping-with-java
│ └── README.md
├── javascript
├── how-to-build-web-scraper
│ ├── README.md
│ └── web_scraper.js
├── javascript-web-scraping
│ ├── README.md
│ └── books.js
├── node-js-fetch-api
│ ├── README.md
│ ├── axios-post.js
│ └── fetch-post.js
├── playwright-web-scraping
│ └── README.md
├── puppeteer-on-aws-lambda
│ ├── README.md
│ └── demo.js
├── puppeteer-tutorial
│ ├── README.md
│ └── bnb.js
└── rotating-proxies-javascript
│ ├── README.md
│ ├── no_proxy.js
│ ├── package.json
│ ├── proxy_list.csv
│ ├── rotating_proxies.js
│ └── single_proxy_axios.js
├── other
└── curl-with-proxy
│ ├── README.md
│ ├── simple_proxy.sh
│ └── src
│ ├── _curlrc
│ ├── env_variables.sh
│ ├── one_time_proxy.sh
│ └── socks_proxy.sh
├── php
└── web-scraping-php
│ └── README.md
├── python
├── Date-Parser-Tutorial
│ └── README.md
├── News-Article-Scraper
│ ├── JavaScript
│ │ ├── extract_article_links.js
│ │ ├── news_article_scraper.js
│ │ ├── package-lock.json
│ │ └── package.json
│ ├── Python
│ │ ├── extract_article_links.py
│ │ └── news_article_scraper.py
│ └── README.md
├── Pagination-With-Python
│ ├── README.md
│ ├── images
│ │ ├── load_more_button.png
│ │ ├── next_button_example.png
│ │ ├── next_button_example_page2.png
│ │ ├── next_button_example_page3.png
│ │ ├── next_button_locate.png
│ │ ├── pager_without_next.png
│ │ ├── scroll_html_response.png
│ │ ├── scroll_json_response.png
│ │ └── scroll_json_response_has_next.png
│ ├── infinite_scroll_html.py
│ ├── infinite_scroll_json.py
│ ├── load_more_json.py
│ ├── next_button.py
│ └── no_next_button.py
├── Price-Parsing-Tutorial
│ ├── README.md
│ └── images
│ │ └── Preview-of-RegEx.png
├── Python-Web-Scraping-Tutorial
│ ├── README.md
│ ├── python_toc.csv
│ ├── web_scraping_toc.csv
│ ├── webscraping_5lines.py
│ └── wiki_toc.py
├── Rotating-Proxies-With-Python
│ ├── README.md
│ ├── no_proxy.py
│ ├── requirements.txt
│ ├── rotating_multiple_proxies.py
│ ├── rotating_multiple_proxies_async.py
│ └── single_proxy.py
├── Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup
│ ├── README.md
│ ├── data_in_same_page.py
│ ├── images
│ │ ├── author_markup.png
│ │ ├── command_menu.png
│ │ ├── dynamic_site_no_js.png
│ │ ├── infinite_scroll.png
│ │ ├── infinite_scroll_no_js.png
│ │ ├── json_embedded.png
│ │ └── libribox.png
│ ├── selenium_bs4.py
│ ├── selenium_bs4_headless.py
│ └── selenium_example.py
├── Web-Scraping-With-Selenium
│ ├── README.md
│ └── books_selenium.py
├── automate-competitors-benchmark-analysis
│ ├── README.md
│ └── src
│ │ ├── get_serp.py
│ │ ├── get_top_urls.py
│ │ ├── off_page_metrics.py
│ │ └── page_speed_metrics.py
├── beautiful-soup-parsing-tutorial
│ ├── README.md
│ ├── content-tags.py
│ ├── export-to-csv.py
│ ├── finding-all-tags.py
│ └── traversing-tags.py
├── building-scraping-pipeline-apache-airflow
│ ├── DAG
│ │ ├── push-pull.py
│ │ ├── scrape.py
│ │ └── setup.py
│ ├── README.md
│ ├── bootstrap.py
│ ├── messenger.py
│ ├── oxylabs.py
│ ├── puller.py
│ ├── pusher.py
│ └── setup.py
├── how-to-build-a-price-tracker
│ ├── README.md
│ └── tracker.py
├── how-to-make-web-scraping-faster
│ ├── README.md
│ ├── async-scraping.py
│ ├── multiproc-scraping.py
│ ├── multithread-scraping.py
│ └── sync-scraping.py
├── lxml-tutorial
│ ├── README.md
│ └── src
│ │ ├── countries.py
│ │ ├── countries_flags.py
│ │ ├── creating_xml_html.py
│ │ ├── input.html
│ │ ├── list_of_countries.py
│ │ ├── reading_html.py
│ │ ├── requirements.txt
│ │ └── sample.xml
├── news-scraping
│ └── README.md
├── pandas-read-html-tables
│ ├── README.md
│ └── src
│ │ ├── pandas_readhtml.ipynb
│ │ └── population.html
├── playwright-web-scraping
│ ├── README.md
│ ├── node
│ │ ├── book.js
│ │ ├── package-lock.json
│ │ └── package.json
│ └── python
│ │ ├── books.py
│ │ └── requirements.txt
├── python-parse-json
│ └── README.md
├── regex-web-scraping
│ ├── README.md
│ └── demo.py
├── scrape-images-from-website
│ ├── README.md
│ └── img-scraper.py
└── web-scraping-machine-learning
│ ├── README.md
│ └── Web Scraping for Machine Learning.ipynb
├── r
└── web-scraping-r
│ ├── README.md
│ └── src
│ ├── download_images_rvest.R
│ ├── dynamic_rselenium.R
│ ├── dynamic_rvest.R
│ └── static_rvest.R
└── ruby
└── webscraping-with-ruby
└── README.md
/README.md:
--------------------------------------------------------------------------------
1 | [](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112)
2 |
3 | [](https://discord.gg/GbxmdGhZjq)
4 |
--------------------------------------------------------------------------------
/VBA/Web Scraping With Excel VBA Guide/README.md:
--------------------------------------------------------------------------------
1 | # Web Scraping With Excel VBA
2 |
3 | [](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112)
4 |
5 |
6 | - [Prerequisites](#prerequisites)
7 | - [Step 1 - Open Microsoft Excel](#step-1---open-microsoft-excel)
8 | - [Step 2 - Go to Option to enable developer menu](#step-2---go-to-option-to-enable-developer-menu)
9 | - [Step 3 - Select Customize Ribbon](#step-3----select-customize-ribbon)
10 | - [Step 4 - Open Visual Basic Application Dialog](#step-4---open-visual-basic-application-dialog)
11 | - [Step 5 - Insert a new Module](#step-5---insert-a-new-module)
12 | - [Step 6 - Add new references](#step-6---add-new-references)
13 | - [Step 7 - Automate Microsoft Edge to Open a website](#step-7---automate-microsoft-edge-to-open-a-website)
14 | - [Step 8 - Scrape Data using VBA Script & Save it to Excel](#step-8---scrape-data-using-vba-script-and-save-it-to-excel)
15 | - [Output](#output)
16 | - [Source Code](#source-code)
17 |
18 | In this tutorial, we'll focus on how to perform Excel web scraping using
19 | VBA. We’ll briefly go through the installation and preparation of the
20 | environment and then write a scraper using VBA macro to successfully
21 | fetch data from a web page into Excel.
22 |
23 | See the full [blog post](https://oxylabs.io/blog/web-scraping-excel-vba) for a detailed
24 | explanation of VBA and its use in web scraping.
25 |
26 | Before we begin, let’s make sure we’ve installed all the prerequisites
27 | and set up our environment properly so that it will be easier to follow
28 | along.
29 |
30 | ## Prerequisites
31 |
32 | We’ll be using Windows 10 and Microsoft Office 10.
33 | However, the steps will be the same or similar for other versions of
34 | Windows. You’ll only need a computer with Windows Operating System. In
35 | addition, it’s necessary to install Microsoft Office if you don’t have
36 | it already. Detailed installation instructions can be found in
37 | [Microsoft's Official
38 | documentation](https://www.microsoft.com/en-us/download/office.aspx).
39 |
40 | Now that you’ve installed MS Office, follow the steps below to set up
41 | the development environment and scrape the public data you want.
42 |
43 | ## Step 1 - Open Microsoft Excel
44 |
45 | From the start menu or Cortana search, find Microsoft Excel and open the application. You will see a similar interface as below:
46 |
47 | Click on File
48 |
49 | 
50 |
51 | ## Step 2 - Go to Option to enable developer menu
52 |
53 | By default, Excel doesn’t show the developer button in the top ribbon. To enable this we will have to go to “Options” from the File menu.
54 |
55 | 
56 |
57 | ## Step 3 - Select Customize Ribbon
58 |
59 | Once you click the “Options”, a dialog will pop up, from the side menu select “Customize Ribbon”. Click on the check box next to “developer”. Make sure it is ticked and then click on Ok.
60 |
61 | 
62 |
63 | ## Step 4 - Open Visual Basic Application Dialog
64 |
65 | Now you will see a new developer button on the top ribbon, clicking on it will expand the developer menu. From the menu, select “Visual Basic”
66 |
67 | 
68 |
69 | ## Step 5 - Insert a new Module
70 |
71 | Once you click on visual basic, it will open a new window like below:
72 |
73 | 
74 |
75 | Click on “insert” and select “Module” to insert a new module. It will open the module editor
76 |
77 | 
78 |
79 | ## Step 6 - Add new references
80 |
81 |
82 | From the top menu select `Tools > References...`, it will open a new window like the one below. Make sure to scroll through the available list of references and find Microsoft HTML Client Library & Microsoft Internet Controls in the check box. Click on the check box next to both of them to enable these references. Once you are done click ok.
83 |
84 | 
85 |
86 | That’s it! Our development environment is all set. Let’s write our first Excel VBA scraper
87 |
88 | ## Step 7 - Automate Microsoft Edge to Open a website
89 |
90 | In this step, we will update our newly created module to open the following website:
25 | There are many different ways to categorize proxies. However, two of
26 | the most popular types are residential and data center proxies. Here is a list of the most common types.
27 |
129 | There are many different ways to categorize proxies. However, two of the most popular types are residential and data center proxies. Here is a list of the most common types.
130 | This HTML is XML Compliant! This is the second paragraph. This HTML is XML Compliant! This HTML is XML Compliant! This is the second paragraph. This HTML is XML Compliant! This is the second paragraph.](https://github.com/topics/go) [
](https://github.com/topics/web-scraping)
4 | - [Installing Go](#installing-go)
5 | - [Parsing HTML with Colly](#parsing-html-with-colly)
6 | - [Handling pagination](#handling-pagination)
7 | - [Writing data to a CSV file](#writing-data-to-a-csv-file)
8 |
9 | Web scraping is an automated process of data extraction from a website. As a tool, a web scraper collects and exports data to a more usable format (JSON, CSV) for further analysis. Building a scraper could be complicated, requiring guidance and practical examples. A vast majority of web scraping tutorials concentrate on the most popular scraping languages, such as JavaScript, PHP, and, more often than not – Python. This time let’s take a look at Golang.
10 |
11 | Golang, or Go, is designed to leverage the static typing and run-time efficiency of C and usability of Python and JavaScript, with added features of high-performance networking and multiprocessing. It’s also compiled and excels in concurrency, making it quick.
12 |
13 | This article will guide you an overview of the process of writing a fast and efficient Golang web scraper.
14 |
15 | For a detailed explanation, [see this blog post](https://oxy.yt/IrPZ).
16 |
17 | ## Installing Go
18 |
19 | ```shell
20 | # macOS
21 | brew install go
22 |
23 | # Windows
24 | choco install golang
25 | ```
26 |
27 | ## Parsing HTML with Colly
28 |
29 | ```shell
30 | go mod init oxylabs.io/web-scraping-with-go
31 | go get github.com/gocolly/colly
32 |
33 | ```
34 |
35 |
36 |
37 | ```go
38 | //books.go
39 |
40 | package main
41 |
42 | import (
43 | "encoding/csv"
44 | "fmt"
45 | "log"
46 | "os"
47 |
48 | "github.com/gocolly/colly"
49 | )
50 | func main() {
51 | // Scraping code here
52 | fmt.Println("Done")
53 | }
54 | ```
55 |
56 | ### Sending HTTP requests with Colly
57 |
58 |
59 |
60 | ```go
61 | c := colly.NewCollector(
62 | colly.AllowedDomains("books.toscrape.com"),
63 | )
64 | c.OnRequest(func(r *colly.Request) {
65 | fmt.Println("Visiting", r.URL)
66 | })
67 | c.OnResponse(func(r *colly.Response) {
68 | fmt.Println(r.StatusCode)
69 | })
70 | ```
71 |
72 | ### Locating HTML elements via CSS selector
73 |
74 | ```go
75 | func main() {
76 | c := colly.NewCollector(
77 | colly.AllowedDomains("books.toscrape.com"),
78 | )
79 |
80 | c.OnHTML("title", func(e *colly.HTMLElement) {
81 | fmt.Println(e.Text)
82 | })
83 |
84 | c.OnResponse(func(r *colly.Response) {
85 | fmt.Println(r.StatusCode)
86 | })
87 |
88 | c.OnRequest(func(r *colly.Request) {
89 | fmt.Println("Visiting", r.URL)
90 | })
91 |
92 | c.Visit("https://books.toscrape.com/")
93 | }
94 | ```
95 |
96 | ### Extracting the HTML elements
97 |
98 | 
99 |
100 | ```go
101 | type Book struct {
102 | Title string
103 | Price string
104 | }
105 | c.OnHTML(".product_pod", func(e *colly.HTMLElement) {
106 | book := Book{}
107 | book.Title = e.ChildAttr(".image_container img", "alt")
108 | book.Price = e.ChildText(".price_color")
109 | fmt.Println(book.Title, book.Price)
110 | })
111 | ```
112 |
113 | ## Handling pagination
114 |
115 | ```go
116 | c.OnHTML(".next > a", func(e *colly.HTMLElement) {
117 | nextPage := e.Request.AbsoluteURL(e.Attr("href"))
118 | c.Visit(nextPage)
119 | })
120 | ```
121 |
122 | ## Writing data to a CSV file
123 |
124 | ```go
125 | func crawl() {
126 | file, err := os.Create("export2.csv")
127 | if err != nil {
128 | log.Fatal(err)
129 | }
130 | defer file.Close()
131 | writer := csv.NewWriter(file)
132 | defer writer.Flush()
133 | headers := []string{"Title", "Price"}
134 | writer.Write(headers)
135 |
136 | c := colly.NewCollector(
137 | colly.AllowedDomains("books.toscrape.com"),
138 | )
139 |
140 | c.OnRequest(func(r *colly.Request) {
141 | fmt.Println("Visiting: ", r.URL.String())
142 | })
143 |
144 | c.OnHTML(".next > a", func(e *colly.HTMLElement) {
145 | nextPage := e.Request.AbsoluteURL(e.Attr("href"))
146 | c.Visit(nextPage)
147 | })
148 |
149 | c.OnHTML(".product_pod", func(e *colly.HTMLElement) {
150 | book := Book{}
151 | book.Title = e.ChildAttr(".image_container img", "alt")
152 | book.Price = e.ChildText(".price_color")
153 | row := []string{book.Title, book.Price}
154 | writer.Write(row)
155 | })
156 |
157 | startUrl := fmt.Sprintf("https://books.toscrape.com/")
158 | c.Visit(startUrl)
159 | }
160 |
161 | ```
162 |
163 | #### Run the file
164 |
165 | ```shell
166 | go run books.go
167 | ```
168 |
169 |
170 |
171 | If you wish to find out more about web scraping with Go, see our [blog post](https://oxy.yt/IrPZ).
172 |
--------------------------------------------------------------------------------
/golang/golang-web-scraper/src/books.go:
--------------------------------------------------------------------------------
1 | package main
2 |
3 | import (
4 | "encoding/csv"
5 | "fmt"
6 | "log"
7 | "os"
8 |
9 | "github.com/gocolly/colly"
10 | )
11 |
12 | type Book struct {
13 | Title string
14 | Price string
15 | }
16 |
17 | func main() {
18 | file, err := os.Create("export.csv")
19 | if err != nil {
20 | log.Fatal(err)
21 | }
22 | defer file.Close()
23 | writer := csv.NewWriter(file)
24 | defer writer.Flush()
25 | headers := []string{"Title", "Price"}
26 | writer.Write(headers)
27 |
28 | c := colly.NewCollector(
29 | colly.AllowedDomains("books.toscrape.com"),
30 | )
31 |
32 | c.OnRequest(func(r *colly.Request) {
33 | fmt.Println("Visiting: ", r.URL.String())
34 | })
35 |
36 | c.OnHTML(".next > a", func(e *colly.HTMLElement) {
37 | nextPage := e.Request.AbsoluteURL(e.Attr("href"))
38 | c.Visit(nextPage)
39 | })
40 |
41 | c.OnHTML(".product_pod", func(e *colly.HTMLElement) {
42 | book := Book{}
43 | book.Title = e.ChildAttr(".image_container img", "alt")
44 | book.Price = e.ChildText(".price_color")
45 | row := []string{book.Title, book.Price}
46 | writer.Write(row)
47 | })
48 |
49 | startUrl := "https://books.toscrape.com/"
50 | c.Visit(startUrl)
51 | }
52 |
--------------------------------------------------------------------------------
/golang/golang-web-scraper/src/go.mod:
--------------------------------------------------------------------------------
1 | module oxylabs.io/web-scraping-with-go
2 |
3 | go 1.19
4 |
5 | require (
6 | github.com/PuerkitoBio/goquery v1.8.0 // indirect
7 | github.com/andybalholm/cascadia v1.3.1 // indirect
8 | github.com/antchfx/htmlquery v1.2.5 // indirect
9 | github.com/antchfx/xmlquery v1.3.12 // indirect
10 | github.com/antchfx/xpath v1.2.1 // indirect
11 | github.com/gobwas/glob v0.2.3 // indirect
12 | github.com/gocolly/colly v1.2.0 // indirect
13 | github.com/golang/groupcache v0.0.0-20200121045136-8c9f03a8e57e // indirect
14 | github.com/golang/protobuf v1.3.1 // indirect
15 | github.com/kennygrant/sanitize v1.2.4 // indirect
16 | github.com/saintfish/chardet v0.0.0-20120816061221-3af4cd4741ca // indirect
17 | github.com/temoto/robotstxt v1.1.2 // indirect
18 | golang.org/x/net v0.0.0-20221004154528-8021a29435af // indirect
19 | golang.org/x/text v0.3.7 // indirect
20 | google.golang.org/appengine v1.6.7 // indirect
21 | )
22 |
--------------------------------------------------------------------------------
/golang/golang-web-scraper/src/go.sum:
--------------------------------------------------------------------------------
1 | github.com/PuerkitoBio/goquery v1.8.0 h1:PJTF7AmFCFKk1N6V6jmKfrNH9tV5pNE6lZMkG0gta/U=
2 | github.com/PuerkitoBio/goquery v1.8.0/go.mod h1:ypIiRMtY7COPGk+I/YbZLbxsxn9g5ejnI2HSMtkjZvI=
3 | github.com/andybalholm/cascadia v1.3.1 h1:nhxRkql1kdYCc8Snf7D5/D3spOX+dBgjA6u8x004T2c=
4 | github.com/andybalholm/cascadia v1.3.1/go.mod h1:R4bJ1UQfqADjvDa4P6HZHLh/3OxWWEqc0Sk8XGwHqvA=
5 | github.com/antchfx/htmlquery v1.2.5 h1:1lXnx46/1wtv1E/kzmH8vrfMuUKYgkdDBA9pIdMJnk4=
6 | github.com/antchfx/htmlquery v1.2.5/go.mod h1:2MCVBzYVafPBmKbrmwB9F5xdd+IEgRY61ci2oOsOQVw=
7 | github.com/antchfx/xmlquery v1.3.12 h1:6TMGpdjpO/P8VhjnaYPXuqT3qyJ/VsqoyNTmJzNBTQ4=
8 | github.com/antchfx/xmlquery v1.3.12/go.mod h1:3w2RvQvTz+DaT5fSgsELkSJcdNgkmg6vuXDEuhdwsPQ=
9 | github.com/antchfx/xpath v1.2.1 h1:qhp4EW6aCOVr5XIkT+l6LJ9ck/JsUH/yyauNgTQkBF8=
10 | github.com/antchfx/xpath v1.2.1/go.mod h1:i54GszH55fYfBmoZXapTHN8T8tkcHfRgLyVwwqzXNcs=
11 | github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
12 | github.com/gobwas/glob v0.2.3 h1:A4xDbljILXROh+kObIiy5kIaPYD8e96x1tgBhUI5J+Y=
13 | github.com/gobwas/glob v0.2.3/go.mod h1:d3Ez4x06l9bZtSvzIay5+Yzi0fmZzPgnTbPcKjJAkT8=
14 | github.com/gocolly/colly v1.2.0 h1:qRz9YAn8FIH0qzgNUw+HT9UN7wm1oF9OBAilwEWpyrI=
15 | github.com/gocolly/colly v1.2.0/go.mod h1:Hof5T3ZswNVsOHYmba1u03W65HDWgpV5HifSuueE0EA=
16 | github.com/golang/groupcache v0.0.0-20200121045136-8c9f03a8e57e h1:1r7pUrabqp18hOBcwBwiTsbnFeTZHV9eER/QT5JVZxY=
17 | github.com/golang/groupcache v0.0.0-20200121045136-8c9f03a8e57e/go.mod h1:cIg4eruTrX1D+g88fzRXU5OdNfaM+9IcxsU14FzY7Hc=
18 | github.com/golang/protobuf v1.3.1 h1:YF8+flBXS5eO826T4nzqPrxfhQThhXl0YzfuUPu4SBg=
19 | github.com/golang/protobuf v1.3.1/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U=
20 | github.com/kennygrant/sanitize v1.2.4 h1:gN25/otpP5vAsO2djbMhF/LQX6R7+O1TB4yv8NzpJ3o=
21 | github.com/kennygrant/sanitize v1.2.4/go.mod h1:LGsjYYtgxbetdg5owWB2mpgUL6e2nfw2eObZ0u0qvak=
22 | github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
23 | github.com/saintfish/chardet v0.0.0-20120816061221-3af4cd4741ca h1:NugYot0LIVPxTvN8n+Kvkn6TrbMyxQiuvKdEwFdR9vI=
24 | github.com/saintfish/chardet v0.0.0-20120816061221-3af4cd4741ca/go.mod h1:uugorj2VCxiV1x+LzaIdVa9b4S4qGAcH6cbhh4qVxOU=
25 | github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
26 | github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI=
27 | github.com/temoto/robotstxt v1.1.2 h1:W2pOjSJ6SWvldyEuiFXNxz3xZ8aiWX5LbfDiOFd7Fxg=
28 | github.com/temoto/robotstxt v1.1.2/go.mod h1:+1AmkuG3IYkh1kv0d2qEB9Le88ehNO0zwOr3ujewlOo=
29 | golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w=
30 | golang.org/x/net v0.0.0-20190603091049-60506f45cf65/go.mod h1:HSz+uSET+XFnRR8LxR5pz3Of3rY3CfYBVs4xY44aLks=
31 | golang.org/x/net v0.0.0-20200421231249-e086a090c8fd/go.mod h1:qpuaurCH72eLCgpAm/N6yyVIVM9cpaDIP3A8BGJEC5A=
32 | golang.org/x/net v0.0.0-20210916014120-12bc252f5db8/go.mod h1:9nx3DQGgdP8bBQD5qxJ1jj9UTztislL4KSBs9R2vV5Y=
33 | golang.org/x/net v0.0.0-20220127200216-cd36cc0744dd/go.mod h1:CfG3xpIq0wQ8r1q4Su4UZFWDARRcnwPjda9FqA0JpMk=
34 | golang.org/x/net v0.0.0-20221004154528-8021a29435af h1:wv66FM3rLZGPdxpYL+ApnDe2HzHcTFta3z5nsc13wI4=
35 | golang.org/x/net v0.0.0-20221004154528-8021a29435af/go.mod h1:YDH+HFinaLZZlnHAfSS6ZXJJ9M9t4Dl22yv3iI2vPwk=
36 | golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
37 | golang.org/x/sys v0.0.0-20200323222414-85ca7c5b95cd/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
38 | golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
39 | golang.org/x/sys v0.0.0-20210423082822-04245dca01da/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
40 | golang.org/x/sys v0.0.0-20210615035016-665e8c7367d1/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
41 | golang.org/x/sys v0.0.0-20211216021012-1d35b9e2eb4e/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
42 | golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo=
43 | golang.org/x/term v0.0.0-20210927222741-03fcf44c2211/go.mod h1:jbD1KX2456YbFQfuXm/mYQcufACuNUgVhRMnK/tPxf8=
44 | golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
45 | golang.org/x/text v0.3.2/go.mod h1:bEr9sfX3Q8Zfm5fL9x+3itogRgK3+ptLWKqgva+5dAk=
46 | golang.org/x/text v0.3.6/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
47 | golang.org/x/text v0.3.7 h1:olpwvP2KacW1ZWvsR7uQhoyTYvKAupfQrRGBFM352Gk=
48 | golang.org/x/text v0.3.7/go.mod h1:u+2+/6zg+i71rQMx5EYifcz6MCKuco9NR6JIITiCfzQ=
49 | golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ=
50 | google.golang.org/appengine v1.6.7 h1:FZR1q0exgwxzPzp/aF+VccGrSfxfPpkBqjIIEq3ru6c=
51 | google.golang.org/appengine v1.6.7/go.mod h1:8WjMMxjGQR8xUklV/ARdw2HLXBOI7O7uCIDZVag1xfc=
52 |
--------------------------------------------------------------------------------
/javascript/how-to-build-web-scraper/web_scraper.js:
--------------------------------------------------------------------------------
1 | const fs = require("fs");
2 | const j2cp = require("json2csv").Parser;
3 | const axios = require("axios");
4 | const cheerio = require("cheerio");
5 |
6 | const wiki_python = "https://en.wikipedia.org/wiki/Python_(programming_language)";
7 |
8 | async function getWikiTOC(url) {
9 | try {
10 | const response = await axios.get(url);
11 | const $ = cheerio.load(response.data);
12 |
13 | const TOC = $("li.toclevel-1");
14 | let toc_data = [];
15 | TOC.each(function () {
16 | level = $(this).find("span.tocnumber").first().text();
17 | text = $(this).find("span.toctext").first().text();
18 | toc_data.push({ level, text });
19 | });
20 | const parser = new j2cp();
21 | const csv = parser.parse(toc_data);
22 | fs.writeFileSync("./wiki_toc.csv", csv);
23 | } catch (err) {
24 | console.error(err);
25 | }
26 | }
27 |
28 | getWikiTOC(wiki_python);
29 |
--------------------------------------------------------------------------------
/javascript/javascript-web-scraping/books.js:
--------------------------------------------------------------------------------
1 | const fs = require("fs");
2 | const j2cp = require("json2csv").Parser;
3 | const axios = require("axios");
4 | const cheerio = require("cheerio");
5 |
6 | const mystery = "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html";
7 |
8 | const books_data = [];
9 |
10 | async function getBooks(url) {
11 | try {
12 | const response = await axios.get(url);
13 | const $ = cheerio.load(response.data);
14 |
15 | const books = $("article");
16 | books.each(function () {
17 | title = $(this).find("h3 a").text();
18 | price = $(this).find(".price_color").text();
19 | stock = $(this).find(".availability").text().trim();
20 | books_data.push({ title, price, stock });
21 | });
22 | // console.log(books_data);
23 | const baseUrl = "http://books.toscrape.com/catalogue/category/books/mystery_3/";
24 | if ($(".next a").length > 0) {
25 | next = baseUrl + $(".next a").attr("href");
26 | getBooks(next);
27 | } else {
28 | const parser = new j2cp();
29 | const csv = parser.parse(books_data);
30 | fs.writeFileSync("./books.csv", csv);
31 | }
32 | } catch (err) {
33 | console.error(err);
34 | }
35 | }
36 |
37 | getBooks(mystery);
--------------------------------------------------------------------------------
/javascript/node-js-fetch-api/axios-post.js:
--------------------------------------------------------------------------------
1 | const axios = require('axios');
2 | const url = 'https://httpbin.org/post'
3 | const data = {
4 | x: 1920,
5 | y: 1080,
6 | };
7 | const customHeaders = {
8 | "Content-Type": "application/json",
9 | }
10 | axios.post(url, data, {
11 | headers: customHeaders,
12 | })
13 | .then(({ data }) => {
14 | console.log(data);
15 | })
16 | .catch((error) => {
17 | console.error(error);
18 | });
--------------------------------------------------------------------------------
/javascript/node-js-fetch-api/fetch-post.js:
--------------------------------------------------------------------------------
1 | const url = 'https://httpbin.org/post'
2 | const data = {
3 | x: 1920,
4 | y: 1080,
5 | };
6 | const customHeaders = {
7 | "Content-Type": "application/json",
8 | }
9 |
10 | fetch(url, {
11 | method: "POST",
12 | headers: customHeaders,
13 | body: JSON.stringify(data),
14 | })
15 | .then((response) => response.json())
16 | .then((data) => {
17 | console.log(data);
18 | })
19 | .catch((error) => {
20 | console.error(error);
21 | });
--------------------------------------------------------------------------------
/javascript/playwright-web-scraping/README.md:
--------------------------------------------------------------------------------
1 | # Web Scraping With Playwright
2 |
3 | [
](https://github.com/topics/playwright) [
](https://github.com/topics/web-scraping)
4 |
5 | - [Support for proxies in Playwright](#support-for-proxies-in-playwright)
6 | - [Basic scraping with Playwright](#basic-scraping-with-playwright)
7 | - [Web Scraping](#web-scraping)
8 |
9 | This article discusses everything you need to know about news scraping, including the benefits and use cases of news scraping as well as how you can use Python to create an article scraper.
10 |
11 | For a detailed explanation, see our [blog post](https://oxy.yt/erHw).
12 |
13 |
14 | ## Support for proxies in Playwright
15 |
16 | #### Without Proxy.js
17 |
18 | ```javascript
19 |
20 | // Node.js
21 |
22 | const { chromium } = require('playwright'); "
23 | const browser = await chromium.launch();
24 | ```
25 |
26 |
27 |
28 | ```python
29 |
30 | # Python
31 |
32 | from playwright.async_api import async_playwright
33 | import asyncio
34 | with async_playwright() as p:
35 | browser = await p.chromium.launch()
36 | ```
37 |
38 | #### With Proxy
39 |
40 | ```javascript
41 | // Node.js
42 | const launchOptions = {
43 | proxy: {
44 | server: 123.123.123.123:80'
45 | },
46 | headless: false
47 | }
48 | const browser = await chromium.launch(launchOptions);
49 | ```
50 |
51 |
52 |
53 | ```python
54 | # Python
55 | proxy_to_use = {
56 | 'server': '123.123.123.123:80'
57 | }
58 | browser = await p.chromium.launch(proxy=proxy_to_use, headless=False)
59 | ```
60 |
61 | ## Basic scraping with Playwright
62 |
63 | ### Node.Js
64 |
65 | ```shell
66 | npm init -y
67 | npm install playwright
68 | ```
69 |
70 | ```javascript
71 | const playwright = require('playwright');
72 | (async () => {
73 | const browser = await playwright.chromium.launch({
74 | headless: false // Show the browser.
75 | });
76 |
77 | const page = await browser.newPage();
78 | await page.goto('https://books.toscrape.com/');
79 | await page.waitForTimeout(1000); // wait for 1 seconds
80 | await browser.close();
81 | })();
82 | ```
83 |
84 | ### Python
85 |
86 | ```shell
87 | pip install playwright
88 | ```
89 |
90 |
91 |
92 | ```python
93 | from playwright.async_api import async_playwright
94 | import asyncio
95 |
96 | async def main():
97 | async with async_playwright() as pw:
98 | browser = await pw.chromium.launch(
99 | headless=False # Show the browser
100 | )
101 | page = await browser.new_page()
102 | await page.goto('https://books.toscrape.com/')
103 | # Data Extraction Code Here
104 | await page.wait_for_timeout(1000) # Wait for 1 second
105 | await browser.close()
106 |
107 | if __name__ == '__main__':
108 | asyncio.run(main())
109 | ```
110 |
111 | ## Web Scraping
112 |
113 |
114 |
115 | 
116 |
117 | #### Node.JS
118 |
119 | ```javascript
120 | const playwright = require('playwright');
121 |
122 | (async () => {
123 | const browser = await playwright.chromium.launch();
124 | const page = await browser.newPage();
125 | await page.goto('https://books.toscrape.com/');
126 | const books = await page.$$eval('.product_pod', all_items => {
127 | const data = [];
128 | all_items.forEach(book => {
129 | const name = book.querySelector('h3').innerText;
130 | const price = book.querySelector('.price_color').innerText;
131 | const stock = book.querySelector('.availability').innerText;
132 | data.push({ name, price, stock});
133 | });
134 | return data;
135 | });
136 | console.log(books);
137 | await browser.close();
138 | })();
139 | ```
140 |
141 | #### Python
142 |
143 | ```python
144 | from playwright.async_api import async_playwright
145 | import asyncio
146 |
147 |
148 | async def main():
149 | async with async_playwright() as pw:
150 | browser = await pw.chromium.launch()
151 | page = await browser.new_page()
152 | await page.goto('https://books.toscrape.com')
153 |
154 | all_items = await page.query_selector_all('.product_pod')
155 | books = []
156 | for item in all_items:
157 | book = {}
158 | name_el = await item.query_selector('h3')
159 | book['name'] = await name_el.inner_text()
160 | price_el = await item.query_selector('.price_color')
161 | book['price'] = await price_el.inner_text()
162 | stock_el = await item.query_selector('.availability')
163 | book['stock'] = await stock_el.inner_text()
164 | books.append(book)
165 | print(books)
166 | await browser.close()
167 |
168 | if __name__ == '__main__':
169 | asyncio.run(main())
170 | ```
171 |
172 | If you wish to find out more about Web Scraping With Playwright, see our [blog post](https://oxy.yt/erHw).
173 |
--------------------------------------------------------------------------------
/javascript/puppeteer-on-aws-lambda/README.md:
--------------------------------------------------------------------------------
1 | # Puppeteer on AWS Lambda
2 |
3 | ## Problem #1 – Puppeteer is too big to push to Lambda
4 |
5 | AWS Lambda has a 50 MB limit on the zip file you push directly to it. Due to the fact that it installs Chromium, the Puppeteer package is significantly larger than that. However, this 50 MB limit doesn’t apply when you load the function from S3! See the documentation [here](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html).
6 |
7 | AWS Lambda quotas can be tight for Puppeteer:
8 |
9 | 
10 |
11 | The 250 MB unzipped can be bypassed by uploading directly from an S3 bucket. So I create a bucket in S3, use a node script to upload to S3, and then update my Lambda code from that bucket. The script looks something like this:
12 |
13 | ```bash
14 | "zip": "npm run build && 7z a -r function.zip ./dist/* node_modules/",
15 | "sendToLambda": "npm run zip && aws s3 cp function.zip s3://chrome-aws && rm function.zip && aws lambda update-function-code --function-name puppeteer-examples --s3-bucket chrome-aws --s3-key function.zip"
16 | ```
17 |
18 | ## Problem #2 – Puppeteer on AWS Lambda doesn’t work
19 |
20 | By default, Linux (including AWS Lambda) doesn’t include the necessary libraries required to allow Puppeteer to function.
21 |
22 | Fortunately, there already exists a package of Chromium built for AWS Lambda. You can find it [here](https://www.npmjs.com/package/chrome-aws-lambda). You will need to install it and puppeteer-core in your function that you are sending to Lambda.
23 |
24 | The regular Puppeteer package will not be needed and, in fact, counts against your 250 MB limit.
25 |
26 | ```node
27 | npm i --save chrome-aws-lambda puppeteer-core
28 | ```
29 |
30 | And then, when you are setting it up to launch a browser from Puppeteer, it will look like this:
31 |
32 | ```javascript
33 | const browser = await chromium.puppeteer
34 | .launch({
35 | args: chromium.args,
36 | defaultViewport: chromium.defaultViewport,
37 | executablePath: await chromium.executablePath,
38 | headless: chromium.headless
39 | });
40 | ```
41 |
42 | ## Final note
43 |
44 | Puppeteer requires more memory than a regular script, so keep an eye on your max memory usage. When using Puppeteer, I recommend at least 512 MB on your AWS Lambda function.
45 | Also, don’t forget to run `await browser.close()` at the end of your script. Otherwise, you may end up with your function running until timeout for no reason because the browser is still alive and waiting for commands.
46 |
--------------------------------------------------------------------------------
/javascript/puppeteer-on-aws-lambda/demo.js:
--------------------------------------------------------------------------------
1 | const browser = await chromium.puppeteer
2 | .launch({
3 | args: chromium.args,
4 | defaultViewport: chromium.defaultViewport,
5 | executablePath: await chromium.executablePath,
6 | headless: chromium.headless
7 | });
--------------------------------------------------------------------------------
/javascript/puppeteer-tutorial/bnb.js:
--------------------------------------------------------------------------------
1 | const puppeteer = require("puppeteer");
2 | (async () => {
3 | let url = "https://www.airbnb.com/s/homes?refinement_paths%5B%5D=%2Fhomes&search_type=section_navigation&property_type_id%5B%5D=8";
4 | const browser = await puppeteer.launch(url);
5 | const page = await browser.newPage();
6 | await page.goto(url);
7 | data = await page.evaluate(() => {
8 | root = Array.from(document.querySelectorAll("#FMP-target [itemprop='itemListElement']"));
9 | hotels = root.map(hotel => ({
10 | Name: hotel.querySelector('ol').parentElement.nextElementSibling.textContent,
11 | Photo: hotel.querySelector("img").getAttribute("src")
12 | }));
13 | return hotels;
14 | });
15 | console.log(data);
16 | await browser.close();
17 | })();
--------------------------------------------------------------------------------
/javascript/rotating-proxies-javascript/README.md:
--------------------------------------------------------------------------------
1 | # Rotating-Proxies-with-JavaScript
2 |
3 | [
](https://github.com/topics/javascript) [
](https://github.com/topics/web-scraping) [
](https://github.com/topics/rotating-proxies)
4 |
5 | - [Requirements](#requirements)
6 | - [Finding Current IP Aaddress](#finding-current-ip-aaddress)
7 | - [Using Proxy](#using-proxy)
8 | - [Rotating Multiple Proxies](#rotating-multiple-proxies)
9 |
10 | ## Requirements
11 |
12 | In this tutorial, we will be using [Axios](https://github.com/axios/axios) to make requests. If needed, the code can be easily modified for other libraries as well.
13 |
14 | Open the terminal and run the following command to initiate a new Node project:
15 |
16 | ```shell
17 | npm init -y
18 | ```
19 |
20 | Next step is to install Axios by running the following command:
21 |
22 | ```sh
23 | npm install axios
24 | ```
25 |
26 | ## Finding Current IP Address
27 |
28 | To check if the proxy works properly, first, we need a basic code that prints the current IP address.
29 |
30 | The website http://httpbin.org/ip is appropriate for this purpose as it returns IP addresses in a clean format.
31 |
32 | Create a new JavaScript file and make changes as outlined below.
33 |
34 | The first step would be to import `axios`.
35 |
36 | ```JavaScript
37 | const axios = require("axios");
38 | ```
39 | Next, call the `get()` method and send the URL of the target website.
40 |
41 | ```javascript
42 | const url = 'https://httpbin.org/ip';
43 | const response = await axios.get(url);
44 | ```
45 |
46 | To see the data returned by the server, access `data` attribute of the `response` object:
47 |
48 | ```JavaScript
49 | console.log(response.data);
50 | // Prints current IP
51 | ```
52 |
53 | For the complete implementation, see the [no_proxy.js](no_proxy.js) file.
54 |
55 | ## Using a Proxy
56 |
57 | For this example, we are going to use a proxy with IP 46.138.246.248 and port 8088.
58 |
59 | Axios can handle proxies directly. The proxy information needs to be sent as the second parameter of the `get()` method.
60 |
61 | The proxy object should have a `host` and `port`. See an example:
62 |
63 | ```JavaScript
64 | proxy_no_auth = {
65 | host: '46.138.246.248',
66 | port: 8088
67 | }
68 | ```
69 |
70 | If proxies need authentication, simply add an `auth` object with `username` and `password`.
71 |
72 | ```javascript
73 | proxy_with_auth = {
74 | host: '46.138.246.248',
75 | port: 8088,
76 | auth: {
77 | username: 'USERNAME',
78 | password: 'PASSWORD'
79 | }
80 | }
81 | ```
82 |
83 | This `proxy_no_auth` or `proxy_with_auth` object can then be sent with the `get` method.
84 |
85 | ```javascript
86 | const response = await axios.get(url, {
87 | proxy: proxy_no_auth
88 | });
89 | ```
90 |
91 | Run this code from the terminal to see the effective IP address.
92 |
93 | You will notice that now, instead of your original IP, the IP address of the proxy is printed.
94 |
95 | ```sh
96 | node single_proxy_axios.js
97 | // Prints {'origin': '46.138.246.248'}
98 | ```
99 |
100 | See the complete implementation in the [single_proxy_axios.js](single_proxy_axios.js) file.
101 |
102 | ## Rotating Multiple Proxies
103 |
104 | If multiple proxies are available, it is possible to rotate proxies with JavaScript.
105 |
106 | Some websites allow downloading a list of proxies as CSV or similar format.
107 |
108 | In this example, we will be working with a file downloaded from one of the free websites.
109 |
110 | This file contains the proxies in this format. Note that proxy and port are separated by a comma.
111 |
112 | ```
113 | 20.94.229.106,80
114 | 209.141.55.228,80
115 | 103.149.162.194,80
116 | 206.253.164.122,80
117 | 200.98.114.237,8888
118 | 193.164.131.202,7890
119 | 98.12.195.129,44
120 | 49.206.233.104,80
121 | ```
122 |
123 | To get a rotating IP proxy using this file, first, we need to read this CSV file in asynchronous code.
124 |
125 | To read CSV file asynchronously, install the package [async-csv](https://www.npmjs.com/package/async-csv).
126 |
127 | ```sh
128 | npm install async-csv
129 | ```
130 |
131 | We will also need the `fs` package, which does not need a separate install.
132 |
133 | After the imports, use the following lines of code to read the CSV file.
134 |
135 | ```javascript
136 | // Read file from disk:
137 | const csvFile = await fs.readFile('proxy_list.csv');
138 |
139 | // Convert CSV string into rows:
140 | const data = await csv.parse(csvFile);
141 | ```
142 |
143 | The data object is an `Array` that contains each row as `Array`.
144 |
145 | We can loop over all these rows using the `map` function.
146 |
147 | Note that in the loop, we will call the get method of Axios to call the same URL, each time with a different proxy.
148 |
149 | The `get` method of Axios is `async`. This means that we can not call the `map` function of `data` directly.
150 |
151 | Instead, we need to use the `Promise` object as follows:
152 |
153 | ```JavaScript
154 | await Promise.all(data.map(async (item) => {
155 | // More async code here
156 | }));
157 | ```
158 |
159 | It is time to create the `proxy` object. The structure will be as explained in the earlier section.
160 |
161 | ```javascript
162 | // Create the Proxy object:
163 | proxy_no_auth = {
164 | host: item[0],
165 | port: item[1]
166 | };
167 | ```
168 |
169 | Above lines convert the data from `[ '20.94.229.106', '80' ]` format to `{ host: '20.94.229.106', port: '80' }`format.
170 |
171 | Next, call the `get` method and send the proxy object.
172 |
173 | ```javascript
174 | const url = 'https://httpbin.org/ip';
175 | const response = await axios.get(url, {
176 | proxy: proxy_no_auth
177 | });
178 | ```
179 |
180 | For the complete code, please see the [rotating_proxies.js](rotating_proxies.js) file.
181 |
--------------------------------------------------------------------------------
/javascript/rotating-proxies-javascript/no_proxy.js:
--------------------------------------------------------------------------------
1 | // import axios
2 | const axios = require("axios");
3 |
4 | // Creaet and execute a new Promise
5 | (async function () {
6 | try {
7 | // This URL returns the IP address
8 | const url = `https://httpbin.org/ip`;
9 |
10 | // call the GET method on the URL
11 | const response = await axios.get(url);
12 |
13 | // print the response data, which is the IP address
14 | console.log(response.data);
15 | } catch (err) {
16 |
17 | // print the error message
18 | console.error(err);
19 | }
20 | })();
21 |
22 |
--------------------------------------------------------------------------------
/javascript/rotating-proxies-javascript/package.json:
--------------------------------------------------------------------------------
1 | {
2 | "dependencies": {
3 | "async-csv": "^2.1.3",
4 | "axios": "^0.24.0",
5 | "puppeteer": "^13.1.0"
6 | }
7 | }
8 |
--------------------------------------------------------------------------------
/javascript/rotating-proxies-javascript/proxy_list.csv:
--------------------------------------------------------------------------------
1 | 20.94.229.106,80
2 | 209.141.55.228,80
3 | 103.149.162.194,80
4 | 206.253.164.122,80
5 | 49.206.233.104,80
6 | 199.19.226.12,80
7 | 206.253.164.198,80
8 | 38.94.111.208,80
9 |
--------------------------------------------------------------------------------
/javascript/rotating-proxies-javascript/rotating_proxies.js:
--------------------------------------------------------------------------------
1 | const csv = require('async-csv');
2 | const fs = require('fs').promises;
3 | const axios = require("axios");
4 |
5 | (async () => {
6 | // Read file from disk:
7 | const csvFile = await fs.readFile('proxy_list.csv');
8 |
9 | // Convert CSV string into rows:
10 | const data = await csv.parse(csvFile);
11 | await Promise.all(data.map(async (item) => {
12 | try {
13 |
14 | // Create the Proxy object
15 | proxy_no_auth = {
16 | host: '206.253.164.122',
17 | port: 80
18 | }
19 |
20 | // Proxy with authentication
21 | proxy_with_auth = {
22 | host: '46.138.246.248',
23 | port: 8088,
24 | auth: {
25 | username: 'USERNAME',
26 | password: 'PASSWORD'
27 | }
28 | }
29 |
30 | // This URL returns the IP
31 | const url = `https://httpbin.org/ip`;
32 |
33 | // Call the GET method on the URL with proxy information
34 | const response = await axios.get(url, {
35 | proxy: proxy_no_auth
36 | });
37 | // Print effective IP address
38 | console.log(response.data);
39 | } catch (err) {
40 |
41 | // Log failed proxy
42 | console.log('Proxy Failed: ' + item[0]);
43 | }
44 | }));
45 |
46 | })();
47 |
--------------------------------------------------------------------------------
/javascript/rotating-proxies-javascript/single_proxy_axios.js:
--------------------------------------------------------------------------------
1 | // Import axios
2 | const axios = require("axios");
3 |
4 | // Create and execute a new Promise
5 | (async function () {
6 | try {
7 |
8 | // Proxy with authentication
9 | proxy_no_auth = {
10 | host: '206.253.164.122',
11 | port: 80
12 | }
13 |
14 | // Proxy with authentication
15 | proxy_with_auth = {
16 | host: '46.138.246.248',
17 | port: 8088,
18 | auth: {
19 | username: 'USERNAME',
20 | password: 'PASSWORD'
21 | }
22 | }
23 | const url = `https://httpbin.org/ip`;
24 |
25 | // Call the GET method on the URL with proxy information
26 | const response = await axios.get(url, {
27 | proxy: proxy_no_auth
28 | });
29 | // Print effective IP address
30 | console.log(response.data);
31 | } catch (err) {
32 |
33 | //Log the error message
34 | console.error(err);
35 | }
36 | })();
37 |
38 |
--------------------------------------------------------------------------------
/other/curl-with-proxy/README.md:
--------------------------------------------------------------------------------
1 | # How to Use cURL With Proxy
2 |
3 | [
](https://github.com/topics/curl) [
](https://github.com/topics/proxy)
4 |
5 | - [What is cURL?](#what-is-curl)
6 | - [Installation](#installation)
7 | - [What you need to connect to a proxy](#what-you-need-to-connect-to-a-proxy)
8 | - [Command line argument to set proxy in cURL](#command-line-argument-to-set-proxy-in-curl)
9 | - [Using environment variables](#using-environment-variables)
10 | - [Configure cURL to always use proxy](#configure-curl-to-always-use-proxy)
11 | - [Ignore or override proxy for one request](#ignore-or-override-proxy-for-one-request)
12 | - [Bonus tip – turning proxies off and on quickly](#bonus-tip--turning-proxies-off-and-on-quickly)
13 | - [cURL socks proxy](#curl-socks-proxy)
14 |
15 | This step-by-step guide will explain how to use cURL or simply, curl, with proxy servers. It covers all the aspects, beginning from installation to explaining various options to set the proxy.
16 |
17 | For a detailed explanation, see our [blog post](https://oxy.yt/ArRn).
18 |
19 | ## What is cURL?
20 |
21 | cURL is a command line tool for sending and receiving data using the url.
22 |
23 | ```shell
24 | curl https://www.google.com
25 | ```
26 |
27 | The question “[what is cURL](https://oxy.yt/ArRn)?” is also answered in one of our previous articles. We recommend reading it if you want to learn how it became such a universal asset.
28 |
29 | ## Installation
30 |
31 | cURL is provided with many Linux distributions and with MacOS. Now it is provided with Windows 10 as well.
32 |
33 | If your Linux distribution is not provided with it, you can install it by running the install command. For example, on Ubuntu, open Terminal and run this command:
34 |
35 | ```shell
36 | sudo apt install curl
37 | ```
38 |
39 | If you are running an older version of Windows, or if you want to install an alternate version, you can download curl from the [official download page](https://curl.se/download.html).
40 |
41 | ## What you need to connect to a proxy
42 |
43 | Irrespective of which proxy service you use, you will need the following information to use a:
44 |
45 | - proxy server address
46 | - port
47 | - protocol
48 | - username (if authentication is required)
49 | - password (if authentication is required)
50 |
51 | In this tutorial, we are going to assume that the proxy server is **127.0.0.1**, the port is **1234**, the user name is **user**, and the password is **pwd**. We will look into multiple examples covering various protocols..
52 |
53 | ## Command line argument to set proxy in cURL
54 |
55 | Open terminal and type the following command, and press Enter:
56 |
57 | ```shell
58 | curl --help
59 | ```
60 |
61 | The output is going to be a huge list of options. One of them is going to look like this:
62 |
63 | ```shell
64 | -x, --proxy [protocol://]host[:port]
65 | ```
66 |
67 | Note that **x** is small, and it is case-sensitive. The proxy details can be supplied using **-x** or **–proxy** switch. Both mean the same thing. Bot of the curl with proxy commands are same:
68 |
69 | ```shell
70 | curl -x "http://user:pwd@127.0.0.1:1234" "http://httpbin.org/ip"
71 | ```
72 |
73 | or
74 |
75 | ```shell
76 | curl --proxy "http://user:pwd@127.0.0.1:1234" "http://httpbin.org/ip"
77 | ```
78 |
79 | **NOTE.** If there are SSL certificate errors, add **-k** (note the small **k**) to the **curl** command. This will allow insecure server connections when using SSL.
80 |
81 | ```shell
82 | curl --proxy "http://user:pwd@127.0.0.1:1234" "http://httpbin.org/ip" -k
83 | ```
84 |
85 | Another interesting thing to note here is that the default proxy protocol is http. Thus, following two commands will do exactly the same:
86 |
87 | ```shell
88 | curl --proxy "http://user:pwd@127.0.0.1:1234" "http://httpbin.org/ip"
89 | curl --proxy "user:pwd@127.0.0.1:1234" "http://httpbin.org/ip"
90 | ```
91 |
92 | ## Using environment variables
93 |
94 | Another way to use proxy with curl is to set the environment variables **http_proxy** and **https_proxy**.
95 |
96 | ```shell
97 | export http_proxy="http://user:pwd@127.0.0.1:1234"
98 | export https_proxy="http://user:pwd@127.0.0.1:1234"
99 | ```
100 |
101 | After running these two commands, run **curl** normally.
102 |
103 | ```shell
104 | curl "http://httpbin.org/ip"
105 | ```
106 |
107 | To stop using proxy, turn off the global proxy by unsetting these two variables:
108 |
109 | ```shell
110 | unset http_proxy
111 | unset https_proxy
112 | ```
113 |
114 | ## Configure cURL to always use proxy
115 |
116 | If you want a proxy for curl but not for other programs, this can be achieved by creating a [curl config file](https://everything.curl.dev/cmdline/cmdline-configfile).
117 |
118 | For Linux and MacOS, open terminal and navigate to your home directory. If there is already a **.curlrc** file, open it. If there is none, create a new file. Here are the set of commands that can be run:
119 |
120 | ```shell
121 | cd ~
122 | nano .curlrc
123 | ```
124 |
125 | In this file, add this line:
126 |
127 | ```shell
128 | proxy="http://user:pwd@127.0.0.1:1234"
129 | ```
130 |
131 | Save the file. Now curl with proxy is ready to be used.
132 |
133 | Simply run **curl** normally and it will read the proxy from **.curlrc** file.
134 |
135 | ```shell
136 | curl "http://httpbin.org/ip"
137 | ```
138 |
139 | On Windows, the file is named **_curlrc**. This file can be placed in the **%APPDATA%** directory.
140 |
141 | To find the exact path of **%APPDATA%**, open command prompt and run the following command:
142 |
143 | ```shell
144 | echo %APPDATA%
145 | ```
146 |
147 | This directory will be something like **C:\Users\
](https://github.com/topics/php) [
](https://github.com/topics/web-scraping)
4 |
5 | - [Installing Prerequisites](#installing-prerequisites)
6 | - [Making an HTTP GET request](#making-an-http-get-request)
7 | - [Web scraping in PHP with Goutte](#web-scraping-in-php-with-goutte)
8 | - [Web scraping with Symfony Panther](#web-scraping-with-symfony-panther)
9 |
10 | PHP is a general-purpose scripting language and one of the most popular options for web development. For example, WordPress, the most common content management system to create websites, is built using PHP.
11 |
12 | PHP offers various building blocks required to build a web scraper, although it can quickly become an increasingly complicated task. Conveniently, there are many open-source libraries that can make web scraping with PHP more accessible.
13 |
14 | This article will guide you through the step-by-step process of writing various PHP web scraping routines that can extract public data from static and dynamic web pages
15 |
16 | For a detailed explanation, see our [blog post](https://oxy.yt/Jr3d).
17 |
18 | ## Installing Prerequisites
19 |
20 | ```sh
21 | # Windows
22 | choco install php
23 | choco install composer
24 | ```
25 |
26 | or
27 |
28 | ```sh
29 | # macOS
30 | brew install php
31 | brew install composer
32 | ```
33 |
34 | ## Making an HTTP GET request
35 |
36 | ```php
37 | =7.1"
46 | composer require fabpot/goutte
47 | composer update
48 | ```
49 |
50 | ```php
51 | request('GET', 'https://books.toscrape.com');
56 | echo $crawler->html();
57 | ```
58 |
59 | ### Locating HTML elements via CSS Selectors
60 |
61 | ```php
62 | echo $crawler->filter('title')->text(); //CSS
63 | echo $crawler->filterXPath('//title')->text(); //XPath
64 |
65 | ```
66 |
67 | ### Extracting the elements
68 |
69 | ```php
70 | function scrapePage($url, $client){
71 | $crawler = $client->request('GET', $url);
72 | $crawler->filter('.product_pod')->each(function ($node) {
73 | $title = $node->filter('.image_container img')->attr('alt');
74 | $price = $node->filter('.price_color')->text();
75 | echo $title . "-" . $price . PHP_EOL;
76 | });
77 | }
78 | ```
79 |
80 |
81 |
82 | ### Handling pagination
83 |
84 | ```php
85 | function scrapePage($url, $client, $file)
86 | {
87 | //...
88 | // Handling Pagination
89 | try {
90 | $next_page = $crawler->filter('.next > a')->attr('href');
91 | } catch (InvalidArgumentException) { //Next page not found
92 | return null;
93 | }
94 | return "https://books.toscrape.com/catalogue/" . $next_page;
95 | }
96 |
97 | ```
98 |
99 | ### Writing Data to CSV
100 |
101 | ```php
102 | function scrapePage($url, $client, $file)
103 | {
104 | $crawler = $client->request('GET', $url);
105 | $crawler->filter('.product_pod')->each(function ($node) use ($file) {
106 | $title = $node->filter('.image_container img')->attr('alt');
107 | $price = $node->filter('.price_color')->text();
108 | fputcsv($file, [$title, $price]);
109 | });
110 | try {
111 | $next_page = $crawler->filter('.next > a')->attr('href');
112 | } catch (InvalidArgumentException) { //Next page not found
113 | return null;
114 | }
115 | return "https://books.toscrape.com/catalogue/" . $next_page;
116 | }
117 |
118 | $client = new Client();
119 | $file = fopen("books.csv", "a");
120 | $nextUrl = "https://books.toscrape.com/catalogue/page-1.html";
121 |
122 | while ($nextUrl) {
123 | echo "
" . $nextUrl . "
" . PHP_EOL;
124 | $nextUrl = scrapePage($nextUrl, $client, $file);
125 | }
126 | fclose($file);
127 | ```
128 |
129 |
130 |
131 | ## Web scraping with Symfony Panther
132 |
133 | ```sh
134 | composer init --no-interaction --require="php >=7.1"
135 | composer require symfony/panther
136 | composer update
137 |
138 | brew install chromedriver
139 | ```
140 |
141 | ### Sending HTTP requests with Panther
142 |
143 | ```php
144 | get('https://quotes.toscrape.com/js/');
149 | ```
150 |
151 | ### Locating HTML elements via CSS Selectors
152 |
153 | ```php
154 | $crawler = $client->waitFor('.quote');
155 | $crawler->filter('.quote')->each(function ($node) {
156 | $author = $node->filter('.author')->text();
157 | $quote = $node->filter('.text')->text();
158 | echo $autor." - ".$quote
159 | });
160 | ```
161 |
162 | ### Handling pagination
163 |
164 | ```php
165 | while (true) {
166 | $crawler = $client->waitFor('.quote');
167 | …
168 | try {
169 | $client->clickLink('Next');
170 | } catch (Exception) {
171 | break;
172 | }
173 | }
174 | ```
175 |
176 | ### Writing data to a CSV file
177 |
178 | ```php
179 | $file = fopen("quotes.csv", "a");
180 | while (true) {
181 | $crawler = $client->waitFor('.quote');
182 | $crawler->filter('.quote')->each(function ($node) use ($file) {
183 | $author = $node->filter('.author')->text();
184 | $quote = $node->filter('.text')->text();
185 | fputcsv($file, [$author, $quote]);
186 | });
187 | try {
188 | $client->clickLink('Next');
189 | } catch (Exception) {
190 | break;
191 | }
192 | }
193 | fclose($file);
194 | ```
195 |
196 |
197 |
198 | If you wish to find out more about web scraping with PHP, see our [blog post](https://oxy.yt/Jr3d).
199 |
--------------------------------------------------------------------------------
/python/News-Article-Scraper/JavaScript/extract_article_links.js:
--------------------------------------------------------------------------------
1 | const cheerio = require("cheerio");
2 | const axios = require("axios");
3 | url = `https://www.patrika.com/googlenewssitemap1.xml`;
4 | let links = [];
5 | async function getLinks() {
6 | try {
7 | const response = await axios.get(url);
8 | const $ = cheerio.load(response.data, { xmlMode: true });
9 | all_loc = $('loc')
10 | all_loc.each(function () {
11 | links.push($(this).text())
12 | })
13 | console.log(links.length + ' links found.')
14 |
15 | } catch (error) {
16 | console.error(error);
17 | }
18 | }
19 | getLinks();
20 |
--------------------------------------------------------------------------------
/python/News-Article-Scraper/JavaScript/news_article_scraper.js:
--------------------------------------------------------------------------------
1 | const cheerio = require("cheerio");
2 | const axios = require("axios");
3 | url = `https://www.example.com/sitemap.xml`;
4 | let links = [];
5 | async function getLinks() {
6 | try {
7 | const response = await axios.get(url);
8 | const $ = cheerio.load(response.data, { xmlMode: true });
9 | all_loc = $('loc');
10 | all_loc.each(function () {
11 | links.push($(this).text());
12 | })
13 | console.log(links.length + ' links found.');
14 | links.forEach(async function (story_link) {
15 | try {
16 | let story = await axios.get(story_link);
17 | let $ = cheerio.load(story.data);
18 | heading = $('h1').text()
19 | body = $('.complete-story p').text()
20 |
21 | } catch (error) {
22 | console.error('internal\n' + error)
23 | }
24 | })
25 |
26 | } catch (error) {
27 | console.error(error);
28 | }
29 | }
30 | getLinks();
31 |
--------------------------------------------------------------------------------
/python/News-Article-Scraper/JavaScript/package-lock.json:
--------------------------------------------------------------------------------
1 | {
2 | "name": "code",
3 | "version": "1.0.0",
4 | "lockfileVersion": 1,
5 | "requires": true,
6 | "dependencies": {
7 | "axios": {
8 | "version": "0.21.1",
9 | "resolved": "https://registry.npmjs.org/axios/-/axios-0.21.1.tgz",
10 | "integrity": "sha512-dKQiRHxGD9PPRIUNIWvZhPTPpl1rf/OxTYKsqKUDjBwYylTvV7SjSHJb9ratfyzM6wCdLCOYLzs73qpg5c4iGA==",
11 | "requires": {
12 | "follow-redirects": "^1.10.0"
13 | }
14 | },
15 | "boolbase": {
16 | "version": "1.0.0",
17 | "resolved": "https://registry.npmjs.org/boolbase/-/boolbase-1.0.0.tgz",
18 | "integrity": "sha1-aN/1++YMUes3cl6p4+0xDcwed24="
19 | },
20 | "cheerio": {
21 | "version": "1.0.0-rc.10",
22 | "resolved": "https://registry.npmjs.org/cheerio/-/cheerio-1.0.0-rc.10.tgz",
23 | "integrity": "sha512-g0J0q/O6mW8z5zxQ3A8E8J1hUgp4SMOvEoW/x84OwyHKe/Zccz83PVT4y5Crcr530FV6NgmKI1qvGTKVl9XXVw==",
24 | "requires": {
25 | "cheerio-select": "^1.5.0",
26 | "dom-serializer": "^1.3.2",
27 | "domhandler": "^4.2.0",
28 | "htmlparser2": "^6.1.0",
29 | "parse5": "^6.0.1",
30 | "parse5-htmlparser2-tree-adapter": "^6.0.1",
31 | "tslib": "^2.2.0"
32 | }
33 | },
34 | "cheerio-select": {
35 | "version": "1.5.0",
36 | "resolved": "https://registry.npmjs.org/cheerio-select/-/cheerio-select-1.5.0.tgz",
37 | "integrity": "sha512-qocaHPv5ypefh6YNxvnbABM07KMxExbtbfuJoIie3iZXX1ERwYmJcIiRrr9H05ucQP1k28dav8rpdDgjQd8drg==",
38 | "requires": {
39 | "css-select": "^4.1.3",
40 | "css-what": "^5.0.1",
41 | "domelementtype": "^2.2.0",
42 | "domhandler": "^4.2.0",
43 | "domutils": "^2.7.0"
44 | }
45 | },
46 | "css-select": {
47 | "version": "4.1.3",
48 | "resolved": "https://registry.npmjs.org/css-select/-/css-select-4.1.3.tgz",
49 | "integrity": "sha512-gT3wBNd9Nj49rAbmtFHj1cljIAOLYSX1nZ8CB7TBO3INYckygm5B7LISU/szY//YmdiSLbJvDLOx9VnMVpMBxA==",
50 | "requires": {
51 | "boolbase": "^1.0.0",
52 | "css-what": "^5.0.0",
53 | "domhandler": "^4.2.0",
54 | "domutils": "^2.6.0",
55 | "nth-check": "^2.0.0"
56 | }
57 | },
58 | "css-what": {
59 | "version": "5.0.1",
60 | "resolved": "https://registry.npmjs.org/css-what/-/css-what-5.0.1.tgz",
61 | "integrity": "sha512-FYDTSHb/7KXsWICVsxdmiExPjCfRC4qRFBdVwv7Ax9hMnvMmEjP9RfxTEZ3qPZGmADDn2vAKSo9UcN1jKVYscg=="
62 | },
63 | "dom-serializer": {
64 | "version": "1.3.2",
65 | "resolved": "https://registry.npmjs.org/dom-serializer/-/dom-serializer-1.3.2.tgz",
66 | "integrity": "sha512-5c54Bk5Dw4qAxNOI1pFEizPSjVsx5+bpJKmL2kPn8JhBUq2q09tTCa3mjijun2NfK78NMouDYNMBkOrPZiS+ig==",
67 | "requires": {
68 | "domelementtype": "^2.0.1",
69 | "domhandler": "^4.2.0",
70 | "entities": "^2.0.0"
71 | }
72 | },
73 | "domelementtype": {
74 | "version": "2.2.0",
75 | "resolved": "https://registry.npmjs.org/domelementtype/-/domelementtype-2.2.0.tgz",
76 | "integrity": "sha512-DtBMo82pv1dFtUmHyr48beiuq792Sxohr+8Hm9zoxklYPfa6n0Z3Byjj2IV7bmr2IyqClnqEQhfgHJJ5QF0R5A=="
77 | },
78 | "domhandler": {
79 | "version": "4.2.0",
80 | "resolved": "https://registry.npmjs.org/domhandler/-/domhandler-4.2.0.tgz",
81 | "integrity": "sha512-zk7sgt970kzPks2Bf+dwT/PLzghLnsivb9CcxkvR8Mzr66Olr0Ofd8neSbglHJHaHa2MadfoSdNlKYAaafmWfA==",
82 | "requires": {
83 | "domelementtype": "^2.2.0"
84 | }
85 | },
86 | "domutils": {
87 | "version": "2.7.0",
88 | "resolved": "https://registry.npmjs.org/domutils/-/domutils-2.7.0.tgz",
89 | "integrity": "sha512-8eaHa17IwJUPAiB+SoTYBo5mCdeMgdcAoXJ59m6DT1vw+5iLS3gNoqYaRowaBKtGVrOF1Jz4yDTgYKLK2kvfJg==",
90 | "requires": {
91 | "dom-serializer": "^1.0.1",
92 | "domelementtype": "^2.2.0",
93 | "domhandler": "^4.2.0"
94 | }
95 | },
96 | "entities": {
97 | "version": "2.2.0",
98 | "resolved": "https://registry.npmjs.org/entities/-/entities-2.2.0.tgz",
99 | "integrity": "sha512-p92if5Nz619I0w+akJrLZH0MX0Pb5DX39XOwQTtXSdQQOaYH03S1uIQp4mhOZtAXrxq4ViO67YTiLBo2638o9A=="
100 | },
101 | "follow-redirects": {
102 | "version": "1.14.1",
103 | "resolved": "https://registry.npmjs.org/follow-redirects/-/follow-redirects-1.14.1.tgz",
104 | "integrity": "sha512-HWqDgT7ZEkqRzBvc2s64vSZ/hfOceEol3ac/7tKwzuvEyWx3/4UegXh5oBOIotkGsObyk3xznnSRVADBgWSQVg=="
105 | },
106 | "htmlparser2": {
107 | "version": "6.1.0",
108 | "resolved": "https://registry.npmjs.org/htmlparser2/-/htmlparser2-6.1.0.tgz",
109 | "integrity": "sha512-gyyPk6rgonLFEDGoeRgQNaEUvdJ4ktTmmUh/h2t7s+M8oPpIPxgNACWa+6ESR57kXstwqPiCut0V8NRpcwgU7A==",
110 | "requires": {
111 | "domelementtype": "^2.0.1",
112 | "domhandler": "^4.0.0",
113 | "domutils": "^2.5.2",
114 | "entities": "^2.0.0"
115 | }
116 | },
117 | "nth-check": {
118 | "version": "2.0.0",
119 | "resolved": "https://registry.npmjs.org/nth-check/-/nth-check-2.0.0.tgz",
120 | "integrity": "sha512-i4sc/Kj8htBrAiH1viZ0TgU8Y5XqCaV/FziYK6TBczxmeKm3AEFWqqF3195yKudrarqy7Zu80Ra5dobFjn9X/Q==",
121 | "requires": {
122 | "boolbase": "^1.0.0"
123 | }
124 | },
125 | "parse5": {
126 | "version": "6.0.1",
127 | "resolved": "https://registry.npmjs.org/parse5/-/parse5-6.0.1.tgz",
128 | "integrity": "sha512-Ofn/CTFzRGTTxwpNEs9PP93gXShHcTq255nzRYSKe8AkVpZY7e1fpmTfOyoIvjP5HG7Z2ZM7VS9PPhQGW2pOpw=="
129 | },
130 | "parse5-htmlparser2-tree-adapter": {
131 | "version": "6.0.1",
132 | "resolved": "https://registry.npmjs.org/parse5-htmlparser2-tree-adapter/-/parse5-htmlparser2-tree-adapter-6.0.1.tgz",
133 | "integrity": "sha512-qPuWvbLgvDGilKc5BoicRovlT4MtYT6JfJyBOMDsKoiT+GiuP5qyrPCnR9HcPECIJJmZh5jRndyNThnhhb/vlA==",
134 | "requires": {
135 | "parse5": "^6.0.1"
136 | }
137 | },
138 | "tslib": {
139 | "version": "2.3.0",
140 | "resolved": "https://registry.npmjs.org/tslib/-/tslib-2.3.0.tgz",
141 | "integrity": "sha512-N82ooyxVNm6h1riLCoyS9e3fuJ3AMG2zIZs2Gd1ATcSFjSA23Q0fzjjZeh0jbJvWVDZ0cJT8yaNNaaXHzueNjg=="
142 | }
143 | }
144 | }
145 |
--------------------------------------------------------------------------------
/python/News-Article-Scraper/JavaScript/package.json:
--------------------------------------------------------------------------------
1 | {
2 | "name": "code",
3 | "version": "1.0.0",
4 | "description": "",
5 | "main": "get_links_from_sitemap_cheerio.js",
6 | "scripts": {
7 | "test": "echo \"Error: no test specified\" && exit 1"
8 | },
9 | "keywords": [],
10 | "author": "",
11 | "license": "ISC",
12 | "dependencies": {
13 | "axios": "^0.21.1",
14 | "cheerio": "^1.0.0-rc.10"
15 | }
16 | }
17 |
--------------------------------------------------------------------------------
/python/News-Article-Scraper/Python/extract_article_links.py:
--------------------------------------------------------------------------------
1 | from bs4 import BeautifulSoup
2 | import requests
3 |
4 |
5 | def parse_sitemap() -> list:
6 | response = requests.get("https://www.example.com/sitemap.xml")
7 | if response.status_code != 200:
8 | return None
9 | xml_as_str = response.text
10 |
11 | soup = BeautifulSoup(xml_as_str, "lxml")
12 | loc_elements = soup.find_all("loc")
13 | links = []
14 | for loc in loc_elements:
15 | links.append(loc.text)
16 |
17 | print(f'Found {len(links)} links')
18 | return links
19 |
20 |
21 | if __name__ == '__main__':
22 | links = parse_sitemap()
23 |
--------------------------------------------------------------------------------
/python/News-Article-Scraper/Python/news_article_scraper.py:
--------------------------------------------------------------------------------
1 | from bs4 import BeautifulSoup
2 | import requests
3 | import csv
4 |
5 |
6 | def parse_sitemap() -> list:
7 | response = requests.get("https://www.example.com/sitemap.xml")
8 | if response.status_code != 200:
9 | return None
10 | xml_as_str = response.text
11 |
12 | soup = BeautifulSoup(xml_as_str, "lxml")
13 | loc_elements = soup.find_all("loc")
14 | links = []
15 | for loc in loc_elements:
16 | links.append(loc.text)
17 |
18 | print(f'Found {len(links)} links')
19 | return links
20 |
21 |
22 | def parse_articles(links: list):
23 | s = requests.Session()
24 | with open("news.csv", "w", encoding="utf-8", newline="") as f:
25 | writer = csv.DictWriter(f, fieldnames=['Heading', 'Body'])
26 | writer.writeheader()
27 | for link in links:
28 | response = s.get(link)
29 | soup = BeautifulSoup(response.text, "lxml")
30 | heading = soup.select_one('h1').text
31 | para = []
32 | for p in soup.select('.complete-story p'):
33 | para.append(p.text)
34 | body = '\n'.join(para)
35 | writer.writerow({'Heading': heading,
36 | 'Body': body
37 | })
38 |
39 |
40 | if __name__ == '__main__':
41 | links = parse_sitemap()
42 | parse_articles(links)
43 |
--------------------------------------------------------------------------------
/python/Pagination-With-Python/images/load_more_button.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oxylabs/web-scraping-tutorials/8ab0651ee0873892b1f4d053613bc587f43d3623/python/Pagination-With-Python/images/load_more_button.png
--------------------------------------------------------------------------------
/python/Pagination-With-Python/images/next_button_example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oxylabs/web-scraping-tutorials/8ab0651ee0873892b1f4d053613bc587f43d3623/python/Pagination-With-Python/images/next_button_example.png
--------------------------------------------------------------------------------
/python/Pagination-With-Python/images/next_button_example_page2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oxylabs/web-scraping-tutorials/8ab0651ee0873892b1f4d053613bc587f43d3623/python/Pagination-With-Python/images/next_button_example_page2.png
--------------------------------------------------------------------------------
/python/Pagination-With-Python/images/next_button_example_page3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oxylabs/web-scraping-tutorials/8ab0651ee0873892b1f4d053613bc587f43d3623/python/Pagination-With-Python/images/next_button_example_page3.png
--------------------------------------------------------------------------------
/python/Pagination-With-Python/images/next_button_locate.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oxylabs/web-scraping-tutorials/8ab0651ee0873892b1f4d053613bc587f43d3623/python/Pagination-With-Python/images/next_button_locate.png
--------------------------------------------------------------------------------
/python/Pagination-With-Python/images/pager_without_next.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oxylabs/web-scraping-tutorials/8ab0651ee0873892b1f4d053613bc587f43d3623/python/Pagination-With-Python/images/pager_without_next.png
--------------------------------------------------------------------------------
/python/Pagination-With-Python/images/scroll_html_response.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oxylabs/web-scraping-tutorials/8ab0651ee0873892b1f4d053613bc587f43d3623/python/Pagination-With-Python/images/scroll_html_response.png
--------------------------------------------------------------------------------
/python/Pagination-With-Python/images/scroll_json_response.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oxylabs/web-scraping-tutorials/8ab0651ee0873892b1f4d053613bc587f43d3623/python/Pagination-With-Python/images/scroll_json_response.png
--------------------------------------------------------------------------------
/python/Pagination-With-Python/images/scroll_json_response_has_next.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oxylabs/web-scraping-tutorials/8ab0651ee0873892b1f4d053613bc587f43d3623/python/Pagination-With-Python/images/scroll_json_response_has_next.png
--------------------------------------------------------------------------------
/python/Pagination-With-Python/infinite_scroll_html.py:
--------------------------------------------------------------------------------
1 | # Handling pages with load more with HTML response
2 | import requests
3 | from bs4 import BeautifulSoup
4 | import math
5 |
6 |
7 | def process_pages():
8 | index_page = 'https://techinstr.myshopify.com/collections/all'
9 | url = 'https://techinstr.myshopify.com/collections/all?page={}'
10 |
11 | session = requests.session()
12 | response = session.get(index_page)
13 | soup = BeautifulSoup(response.text, "lxml")
14 | count_element = soup.select_one('.filters-toolbar__product-count')
15 | count_str = count_element.text.replace('products', '')
16 | count = int(count_str)
17 | # Process page 1 data here
18 | page_count = math.ceil(count/8)
19 | for page_numer in range(2, page_count+1):
20 | response = session.get(url.format(page_numer))
21 | soup = BeautifulSoup(response.text, "lxml")
22 | first_product = soup.select_one('.product-card:nth-child(1) > a > span')
23 | print(first_product.text.strip())
24 |
25 |
26 | if __name__ == '__main__':
27 | process_pages()
28 |
--------------------------------------------------------------------------------
/python/Pagination-With-Python/infinite_scroll_json.py:
--------------------------------------------------------------------------------
1 | # Handling pages with load more with JSON response
2 | import requests
3 |
4 |
5 | def process_pages():
6 | url = 'http://quotes.toscrape.com/api/quotes?page={}'
7 | page_numer = 1
8 | while True:
9 | response = requests.get(url.format(page_numer))
10 | data = response.json()
11 | # Process data
12 | # ...
13 | print(response.url) # only for debug
14 | if data.get('has_next'):
15 | page_numer += 1
16 | else:
17 | break
18 |
19 |
20 | if __name__ == '__main__':
21 | process_pages()
22 |
--------------------------------------------------------------------------------
/python/Pagination-With-Python/load_more_json.py:
--------------------------------------------------------------------------------
1 | # Handling pages with load more button with JSON
2 | import requests
3 | from bs4 import BeautifulSoup
4 | import math
5 |
6 |
7 | def process_pages():
8 | url = 'https://smarthistory.org/wp-json/smthstapi/v1/objects?tag=938&page={}'
9 | headers = {
10 | 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
11 | }
12 | page_numer = 1
13 | while True:
14 | response = requests.get(url.format(page_numer), headers=headers)
15 | data = response.json()
16 | # Process data
17 | # ...
18 | print(response.url) # only for debug
19 | if data.get('remaining') and int(data.get('remaining')) > 0:
20 | page_numer += 1
21 | else:
22 | break
23 |
24 |
25 | if __name__ == '__main__':
26 | process_pages()
27 |
--------------------------------------------------------------------------------
/python/Pagination-With-Python/next_button.py:
--------------------------------------------------------------------------------
1 | # Handling pages with Next button
2 | import requests
3 | from bs4 import BeautifulSoup
4 | from urllib.parse import urljoin
5 |
6 |
7 | def process_pages():
8 | url = 'http://books.toscrape.com/catalogue/category/books/fantasy_19/index.html'
9 |
10 | while True:
11 | response = requests.get(url)
12 | soup = BeautifulSoup(response.text, "lxml")
13 |
14 | footer_element = soup.select_one('li.current')
15 | print(footer_element.text.strip())
16 |
17 | # Pagination
18 | next_page_element = soup.select_one('li.next > a')
19 | if next_page_element:
20 | next_page_url = next_page_element.get('href')
21 | url = urljoin(url, next_page_url)
22 | else:
23 | break
24 |
25 |
26 | if __name__ == '__main__':
27 | process_pages()
28 |
--------------------------------------------------------------------------------
/python/Pagination-With-Python/no_next_button.py:
--------------------------------------------------------------------------------
1 | # Handling pages with Next button
2 | import requests
3 | from bs4 import BeautifulSoup
4 | from urllib.parse import urljoin
5 |
6 |
7 | def process_pages():
8 | url = 'https://www.gosc.pl/doc/791526.Zaloz-zbroje'
9 | response = requests.get(url)
10 | soup = BeautifulSoup(response.text, 'lxml')
11 | page_link_el = soup.select('.pgr_nrs a')
12 | # process first page
13 | for link_el in page_link_el:
14 | link = urljoin(url, link_el.get('href'))
15 | response = requests.get(link)
16 | soup = BeautifulSoup(response.text, 'lxml')
17 | print(response.url)
18 | # process remaining pages
19 |
20 |
21 | if __name__ == '__main__':
22 | process_pages()
23 |
--------------------------------------------------------------------------------
/python/Price-Parsing-Tutorial/images/Preview-of-RegEx.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oxylabs/web-scraping-tutorials/8ab0651ee0873892b1f4d053613bc587f43d3623/python/Price-Parsing-Tutorial/images/Preview-of-RegEx.png
--------------------------------------------------------------------------------
/python/Python-Web-Scraping-Tutorial/python_toc.csv:
--------------------------------------------------------------------------------
1 | heading_number,heading_text
2 | 1,History
3 | 2,Design philosophy and features
4 | 3,Syntax and semantics
5 | 3.1,Indentation
6 | 3.2,Statements and control flow
7 | 3.3,Expressions
8 | 3.4,Methods
9 | 3.5,Typing
10 | 3.6,Arithmetic operations
11 | 4,Programming examples
12 | 5,Libraries
13 | 6,Development environments
14 | 7,Implementations
15 | 7.1,Reference implementation
16 | 7.2,Other implementations
17 | 7.3,Unsupported implementations
18 | 7.4,Cross-compilers to other languages
19 | 7.5,Performance
20 | 8,Development
21 | 9,API documentation generators
22 | 10,Naming
23 | 11,Uses
24 | 12,Languages influenced by Python
25 | 13,See also
26 | 14,References
27 | 14.1,Sources
28 | 15,Further reading
29 | 16,External links
30 |
--------------------------------------------------------------------------------
/python/Python-Web-Scraping-Tutorial/web_scraping_toc.csv:
--------------------------------------------------------------------------------
1 | heading_number,heading_text
2 | 1,History
3 | 2,Techniques
4 | 2.1,Human copy-and-paste
5 | 2.2,Text pattern matching
6 | 2.3,HTTP programming
7 | 2.4,HTML parsing
8 | 2.5,DOM parsing
9 | 2.6,Vertical aggregation
10 | 2.7,Semantic annotation recognizing
11 | 2.8,Computer vision web-page analysis
12 | 3,Software
13 | 4,Legal issues
14 | 4.1,United States
15 | 4.2,The EU
16 | 4.3,Australia
17 | 4.4,India
18 | 5,Methods to prevent web scraping
19 | 6,See also
20 | 7,References
21 |
--------------------------------------------------------------------------------
/python/Python-Web-Scraping-Tutorial/webscraping_5lines.py:
--------------------------------------------------------------------------------
1 | import requests
2 | from bs4 import BeautifulSoup
3 | response = requests.get("https://en.wikipedia.org/wiki/Web_scraping")
4 | bs = BeautifulSoup(response.text, "lxml")
5 | print(bs.find("p").text)
6 |
--------------------------------------------------------------------------------
/python/Python-Web-Scraping-Tutorial/wiki_toc.py:
--------------------------------------------------------------------------------
1 | import csv
2 | import requests
3 | from bs4 import BeautifulSoup
4 | import requests
5 |
6 |
7 | def get_data(url):
8 | response = requests.get(url)
9 | soup = BeautifulSoup(response.text, 'lxml')
10 | table_of_contents = soup.find("div", id="toc")
11 | headings = table_of_contents.find_all("li")
12 | data = []
13 | for heading in headings:
14 | heading_text = heading.find("span", class_="toctext").text
15 | heading_number = heading.find("span", class_="tocnumber").text
16 | data.append({
17 | 'heading_number': heading_number,
18 | 'heading_text': heading_text,
19 | })
20 | return data
21 |
22 |
23 | def export_data(data, file_name):
24 | with open(file_name, "w", newline="") as file:
25 | writer = csv.DictWriter(file, fieldnames=['heading_number', 'heading_text'])
26 | writer.writeheader()
27 | writer.writerows(data)
28 |
29 |
30 | def main():
31 | url_to_parse = "https://en.wikipedia.org/wiki/Python_(programming_language)"
32 | file_name = "python_toc.csv"
33 | data = get_data(url_to_parse)
34 | export_data(data, file_name)
35 |
36 | url_to_parse = "https://en.wikipedia.org/wiki/Web_scraping"
37 | file_name = "web_scraping_toc.csv"
38 | data = get_data(url_to_parse)
39 | export_data(data, file_name)
40 |
41 | print('Done')
42 |
43 |
44 | if __name__ == '__main__':
45 | main()
46 |
--------------------------------------------------------------------------------
/python/Rotating-Proxies-With-Python/no_proxy.py:
--------------------------------------------------------------------------------
1 | import requests
2 |
3 | response = requests.get('https://ip.oxylabs.io/location')
4 | print(response.text)
5 |
--------------------------------------------------------------------------------
/python/Rotating-Proxies-With-Python/requirements.txt:
--------------------------------------------------------------------------------
1 | aiohttp==3.8.1
2 | requests==2.27.1
3 |
--------------------------------------------------------------------------------
/python/Rotating-Proxies-With-Python/rotating_multiple_proxies.py:
--------------------------------------------------------------------------------
1 | import csv
2 |
3 | import requests
4 | from requests.exceptions import ProxyError, ReadTimeout, ConnectTimeout
5 |
6 | TIMEOUT_IN_SECONDS = 10
7 | CSV_FILENAME = 'proxies.csv'
8 |
9 | with open(CSV_FILENAME) as open_file:
10 | reader = csv.reader(open_file)
11 | for csv_row in reader:
12 | scheme_proxy_map = {
13 | 'https': csv_row[0],
14 | }
15 |
16 | try:
17 | response = requests.get(
18 | 'https://ip.oxylabs.io',
19 | proxies=scheme_proxy_map,
20 | timeout=TIMEOUT_IN_SECONDS,
21 | )
22 | except (ProxyError, ReadTimeout, ConnectTimeout) as error:
23 | pass
24 | else:
25 | print(response.text)
26 |
--------------------------------------------------------------------------------
/python/Rotating-Proxies-With-Python/rotating_multiple_proxies_async.py:
--------------------------------------------------------------------------------
1 | import csv
2 | import aiohttp
3 | import asyncio
4 |
5 | CSV_FILENAME = 'proxies.csv'
6 | URL_TO_CHECK = 'https://ip.oxylabs.io'
7 | TIMEOUT_IN_SECONDS = 10
8 |
9 |
10 | async def check_proxy(url, proxy):
11 | try:
12 | session_timeout = aiohttp.ClientTimeout(
13 | total=None, sock_connect=TIMEOUT_IN_SECONDS, sock_read=TIMEOUT_IN_SECONDS
14 | )
15 | async with aiohttp.ClientSession(timeout=session_timeout) as session:
16 | async with session.get(
17 | url, proxy=proxy, timeout=TIMEOUT_IN_SECONDS
18 | ) as resp:
19 | print(await resp.text())
20 | except Exception as error:
21 | print('Proxy responded with an error: ', error)
22 | return
23 |
24 |
25 | async def main():
26 | tasks = []
27 | with open(CSV_FILENAME) as open_file:
28 | reader = csv.reader(open_file)
29 | for csv_row in reader:
30 | task = asyncio.create_task(check_proxy(URL_TO_CHECK, csv_row[0]))
31 | tasks.append(task)
32 |
33 | await asyncio.gather(*tasks)
34 |
35 |
36 | asyncio.run(main())
37 |
--------------------------------------------------------------------------------
/python/Rotating-Proxies-With-Python/single_proxy.py:
--------------------------------------------------------------------------------
1 | import requests
2 | from requests.exceptions import ProxyError, ReadTimeout, ConnectTimeout
3 |
4 | PROXY = 'http://2.56.215.247:3128'
5 | TIMEOUT_IN_SECONDS = 10
6 |
7 | scheme_proxy_map = {
8 | 'https': PROXY,
9 | }
10 | try:
11 | response = requests.get(
12 | 'https://ip.oxylabs.io', proxies=scheme_proxy_map, timeout=TIMEOUT_IN_SECONDS
13 | )
14 | except (ProxyError, ReadTimeout, ConnectTimeout) as error:
15 | print('Unable to connect to the proxy: ', error)
16 | else:
17 | print(response.text)
18 |
--------------------------------------------------------------------------------
/python/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/data_in_same_page.py:
--------------------------------------------------------------------------------
1 | import requests
2 | from bs4 import BeautifulSoup
3 | import re
4 | import json
5 |
6 | response = requests.get('https://quotes.toscrape.com/js/')
7 | soup = BeautifulSoup(response.text, "lxml")
8 | script_tag = soup.find("script", src=None)
9 | pattern = "var data =(.+?);\n"
10 | raw_data = re.findall(pattern, script_tag.string, re.S)
11 | if raw_data:
12 | data = json.loads(raw_data[0])
13 | # prints whole data
14 | print(data)
15 |
16 | # prints only the author
17 | for i in data:
18 | print(i['author']['name'])
19 |
--------------------------------------------------------------------------------
/python/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/images/author_markup.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oxylabs/web-scraping-tutorials/8ab0651ee0873892b1f4d053613bc587f43d3623/python/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/images/author_markup.png
--------------------------------------------------------------------------------
/python/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/images/command_menu.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oxylabs/web-scraping-tutorials/8ab0651ee0873892b1f4d053613bc587f43d3623/python/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/images/command_menu.png
--------------------------------------------------------------------------------
/python/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/images/dynamic_site_no_js.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oxylabs/web-scraping-tutorials/8ab0651ee0873892b1f4d053613bc587f43d3623/python/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/images/dynamic_site_no_js.png
--------------------------------------------------------------------------------
/python/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/images/infinite_scroll.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oxylabs/web-scraping-tutorials/8ab0651ee0873892b1f4d053613bc587f43d3623/python/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/images/infinite_scroll.png
--------------------------------------------------------------------------------
/python/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/images/infinite_scroll_no_js.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oxylabs/web-scraping-tutorials/8ab0651ee0873892b1f4d053613bc587f43d3623/python/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/images/infinite_scroll_no_js.png
--------------------------------------------------------------------------------
/python/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/images/json_embedded.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oxylabs/web-scraping-tutorials/8ab0651ee0873892b1f4d053613bc587f43d3623/python/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/images/json_embedded.png
--------------------------------------------------------------------------------
/python/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/images/libribox.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oxylabs/web-scraping-tutorials/8ab0651ee0873892b1f4d053613bc587f43d3623/python/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/images/libribox.png
--------------------------------------------------------------------------------
/python/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/selenium_bs4.py:
--------------------------------------------------------------------------------
1 | from selenium.webdriver import Chrome
2 | from bs4 import BeautifulSoup
3 | # update executable_path as required
4 | driver = Chrome(executable_path='c:/driver/chromedriver.exe')
5 |
6 | driver.get('https://quotes.toscrape.com/js/')
7 |
8 | try:
9 | soup = BeautifulSoup(driver.page_source, "lxml")
10 | # print first author
11 | author_element = soup.find("small", class_="author")
12 | print(author_element.text)
13 |
14 | # print all authors
15 | all_author_elements = soup.find_all("small", class_="author")
16 | for element in all_author_elements:
17 | print(element.text)
18 | finally:
19 | # always close the browser
20 | driver.quit()
21 |
--------------------------------------------------------------------------------
/python/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/selenium_bs4_headless.py:
--------------------------------------------------------------------------------
1 | from selenium.webdriver import Chrome, ChromeOptions
2 | from bs4 import BeautifulSoup
3 |
4 | # Hide the browser
5 | options = ChromeOptions()
6 | options.headless = True
7 |
8 | # update executable_path as required
9 | driver = Chrome(executable_path='c:/driver/chromedriver.exe', options=options)
10 |
11 | driver.get('https://quotes.toscrape.com/js/')
12 |
13 | try:
14 | soup = BeautifulSoup(driver.page_source, "lxml")
15 | # print first author
16 | author_element = soup.find("small", class_="author")
17 | print(author_element.text)
18 |
19 | # print all authors
20 | all_author_elements = soup.find_all("small", class_="author")
21 | for element in all_author_elements:
22 | print(element.text)
23 | finally:
24 | # always close the browser
25 | driver.quit()
26 |
--------------------------------------------------------------------------------
/python/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/selenium_example.py:
--------------------------------------------------------------------------------
1 | from selenium.webdriver import Chrome
2 |
3 | # update executable_path as required
4 | driver = Chrome(executable_path='c:/driver/chromedriver.exe')
5 |
6 | driver.get('https://quotes.toscrape.com/js/')
7 |
8 | try:
9 | # print first author
10 | author_element = driver.find_element_by_tag_name("small")
11 | print(author_element.text)
12 |
13 | # print all authors
14 | all_author_elements = driver.find_elements_by_tag_name("small")
15 | for element in all_author_elements:
16 | print(element.text)
17 | finally:
18 | # always close the browser
19 | driver.quit()
20 |
--------------------------------------------------------------------------------
/python/Web-Scraping-With-Selenium/books_selenium.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | from selenium.webdriver import Chrome, ChromeOptions
3 | from selenium.webdriver.common.by import By
4 | from selenium.webdriver.support.ui import WebDriverWait
5 | from selenium.webdriver.support import expected_conditions as EC
6 | from selenium.webdriver.common.keys import Keys
7 |
8 | CHROME_DRIVER_PATH = 'c:/WebDrivers/chromedriver.exe'
9 | HOMEPAGE = "http://books.toscrape.com"
10 |
11 |
12 | def get_data(url, categories):
13 | browser_options = ChromeOptions()
14 | browser_options.headless = True
15 |
16 | driver = Chrome(executable_path=CHROME_DRIVER_PATH, options=browser_options)
17 | driver.get(url)
18 | driver.implicitly_wait(10)
19 | data = []
20 | for category in categories:
21 | humor = driver.find_element_by_xpath(f'//a[contains(text(),{category})]')
22 | humor.click()
23 |
24 | try:
25 | books = WebDriverWait(driver, 10).until(
26 | EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.product_pod'))
27 | )
28 | except Exception as e:
29 | raise e
30 |
31 | for book in books:
32 | title = book.find_element_by_css_selector("h3 > a")
33 | price = book.find_element_by_css_selector(".price_color")
34 | stock = book.find_element_by_css_selector(".instock.availability")
35 | data.append({
36 | 'title': title.get_attribute("title"),
37 | 'price': price.text,
38 | 'stock': stock.text,
39 | 'Category': category
40 | })
41 |
42 | driver.get(url)
43 |
44 | driver.quit()
45 | return data
46 |
47 |
48 | def export_csv(data):
49 | df = pd.DataFrame(data)
50 | # Apply transformations if needed
51 | df.to_csv("books_exported.csv", index=False)
52 | print(df) # DEBUG
53 |
54 |
55 | def main():
56 | data = get_data(url=HOMEPAGE, categories=["Humor", "Art"])
57 | export_csv(data)
58 | print('DONE')
59 |
60 |
61 | if __name__ == '__main__':
62 | main()
63 |
--------------------------------------------------------------------------------
/python/automate-competitors-benchmark-analysis/README.md:
--------------------------------------------------------------------------------
1 | # How to Automate Competitors’ & Benchmark Analysis With Python
2 |
3 | - [Using Oxylabs’ solution to retrieve the SERPs results](#using-oxylabs-solution-to-retrieve-the-serps-results)
4 | - [Scraping URLs of the top results](#scraping-urls-of-the-top-results)
5 | - [Obtaining the off-page metrics](#obtaining-the-off-page-metrics)
6 | - [Obtaining the Page Speed metrics](#obtaining-the-page-speed-metrics)
7 | - [Converting Python list into a dataframe and exporting it as an Excel file](#converting-python-list-into-a-dataframe-and-exporting-it-as-an-excel-file)
8 |
9 | Doing competitors’ or benchmark analysis for SEO can be a burdensome task as it requires taking into account many factors which usually are extracted from different data sources.
10 |
11 | The purpose of this article is to help you automate the data extraction processes as much as possible. After learning how to do this, you can dedicate your time to what matters: the analysis itself and coming up with actionable insights to strategize.
12 |
13 | For a detailed explanation, see our [blog post](https://oxy.yt/erEh).
14 |
15 | ## Using Oxylabs’ solution to retrieve the SERPs results
16 |
17 | ```python
18 | import requests
19 |
20 | keyword = "Proxy types
23 |
24 |
30 |
36 |
37 |
38 |
39 | ```
40 |
41 | ## Traversing for HTML tags
42 |
43 | First, we can use Beautiful Soup to extract a list of all the tags used in our sample HTML file. For this, we will use the soup.descendants generator.
44 |
45 | ```python
46 | from bs4 import BeautifulSoup
47 |
48 | with open('index.html', 'r') as f:
49 | contents = f.read()
50 |
51 | soup = BeautifulSoup(contents, features="html.parser")
52 |
53 | for child in soup.descendants:
54 |
55 | if child.name:
56 | print(child.name)
57 | ```
58 |
59 | After running this code (right click on code and click “Run”) you should get the below output:
60 |
61 | ```html
62 | html
63 | head
64 | title
65 | meta
66 | body
67 | h2
68 | p
69 | ul
70 | li
71 | li
72 | li
73 | li
74 | li
75 | ```
76 |
77 | What just happened? Beautiful Soup traversed our HTML file and printed all the HTML tags that it has found sequentially. Let’s take a quick look at what each line did.
78 |
79 | ```python
80 | from bs4 import BeautifulSoup
81 | ```
82 |
83 | This tells Python to use the Beautiful Soup library.
84 |
85 | ```python
86 | with open('index.html', 'r') as f:
87 | contents = f.read()
88 | ```
89 |
90 | And this code, as you could probably guess, gives an instruction to open our sample HTML file and read its contents.
91 |
92 | ```python
93 | soup = BeautifulSoup(contents, features="html.parser")
94 | ```
95 |
96 | This line creates a BeautifulSoup object and passes it to Python’s built-in BeautifulSoup HTML parser. Other parsers, such as lxml, might also be used, but it is a separate external library and for the purpose of this tutorial the built-in parser will do just fine.
97 |
98 | ```python
99 | for child in soup.descendants:
100 |
101 | if child.name:
102 | print(child.name)
103 | ```
104 |
105 | The final pieces of code, namely the soup.descendants generator, instruct Beautiful Soup to look for HTML tags and print them in the PyCharm console. The results can also easily be exported to a .csv file but we will get to this later.
106 |
107 | ## Getting the full content of tags
108 |
109 | To get the content of tags, this is what we can do:
110 |
111 | ```python
112 | from bs4 import BeautifulSoup
113 |
114 | with open('index.html', 'r') as f:
115 | contents = f.read()
116 |
117 | soup = BeautifulSoup(contents, features="html.parser")
118 |
119 | print(soup.h2)
120 | print(soup.p)
121 | print(soup.li)
122 | ```
123 |
124 | This is a simple instruction that outputs the HTML tag with its full content in the specified order. Here’s what the output should look like:
125 |
126 | ```html
127 | Proxy types
128 |
166 |
172 | ```
173 |
174 | ## Finding all specified tags and extracting text
175 |
176 | The find_all method is a great way to extract specific data from an HTML file. It accepts many criteria that make it a flexible tool allowing us to filter data in convenient ways. Yet for this tutorial we do not need anything more complex. Let’s find all items of our list and print them as text only:
177 |
178 | ```python
179 | for tag in soup.find_all('li'):
180 | print(tag.text)
181 | ```
182 |
183 | This is how the full code should look like:
184 |
185 | ```python
186 | from bs4 import BeautifulSoup
187 |
188 | with open('index.html', 'r') as f:
189 | contents = f.read()
190 |
191 | soup = BeautifulSoup(contents, features="html.parser")
192 |
193 | for tag in soup.find_all('li'):
194 | print(tag.text)
195 | ```
196 |
197 | And here’s the output:
198 |
199 | ```
200 | Residential proxies
201 | Datacenter proxies
202 | Shared proxies
203 | Semi-dedicated proxies
204 | Private proxies
205 | ```
206 |
207 | ## Exporting data to a .csv file
208 |
209 | ```bash
210 | pip install pandas
211 | ```
212 |
213 | Add this line to the beginning of your code to import the library:
214 |
215 | ```python
216 | import pandas as pd
217 | ```
218 |
219 | Going further, let’s add some lines that will export the list we extracted earlier to a .csv file. This is how our full code should look like:
220 |
221 | ```python
222 | from bs4 import BeautifulSoup
223 | import pandas as pd
224 |
225 | with open('index.html', 'r') as f:
226 | contents = f.read()
227 |
228 | soup = BeautifulSoup(contents, features="html.parser")
229 | results = soup.find_all('li')
230 |
231 | df = pd.DataFrame({'Names': results})
232 | df.to_csv('names.csv', index=False, encoding='utf-8')
233 | ```
234 |
235 | What happened here? Let’s take a look:
236 |
237 | ```python
238 | results = soup.find_all('li')
239 | ```
240 |
241 | This line finds all instances of the `](https://github.com/topics/lxml) [
](https://github.com/topics/web-scraping)
4 |
5 | - [Installation](#installation)
6 | - [Creating a simple XML document](#creating-a-simple-xml-document)
7 | - [The Element class](#the-element-class)
8 | - [The SubElement class](#the-subelement-class)
9 | - [Setting text and attributes](#setting-text-and-attributes)
10 | - [Parse an XML file using LXML in Python](#parse-an-xml-file-using-lxml-in-python)
11 | - [Finding elements in XML](#finding-elements-in-xml)
12 | - [Handling HTML with lxml.html](#handling-html-with-lxmlhtml)
13 | - [lxml web scraping tutorial](#lxml-web-scraping-tutorial)
14 | - [Conclusion](#conclusion)
15 |
16 | In this lxml Python tutorial, we will explore the lxml library. We will go through the basics of creating XML documents and then jump on processing XML and HTML documents. Finally, we will put together all the pieces and see how to extract data using lxml.
17 |
18 | For a detailed explanation, see our [blog post](https://oxy.yt/BrAk).
19 |
20 | ## Installation
21 |
22 | The best way to download and install the lxml library is to use the pip package manager. This works on Windows, Mac, and Linux:
23 |
24 | ```shell
25 | pip3 install lxml
26 | ```
27 |
28 | ## Creating a simple XML document
29 |
30 | A very simple XML document would look like this:
31 |
32 | ```xml
33 |
Hello World!
150 | Hello World!
7 | ](https://github.com/topics/playwright) [
](https://github.com/topics/Proxy)
4 |
5 | - [Fetch HTML Page](#fetch-html-page)
6 | - [Parsing HTML](#parsing-html)
7 | - [Extracting Text](#extracting-text)
8 |
9 | This article discusses everything you need to know about news scraping, including the benefits and use cases of news scraping as well as how you can use Python to create an article scraper.
10 |
11 | For a detailed explanation, see our [blog post](https://oxy.yt/YrD0).
12 |
13 |
14 |
15 | ## Fetch HTML Page
16 |
17 | ```shell
18 | pip3 install requests
19 | ```
20 |
21 | Create a new Python file and enter the following code:
22 |
23 | ```python
24 | import requests
25 | response = requests.get(https://quotes.toscrape.com')
26 |
27 | print(response.text) # Prints the entire HTML of the webpage.
28 | ```
29 |
30 | ## Parsing HTML
31 |
32 | ```shell
33 | pip3 install lxml beautifulsoup4
34 | ```
35 |
36 | ```python
37 | from bs4 import BeautifulSoup
38 | response = requests.get('https://quotes.toscrape.com')
39 | soup = BeautifulSoup(response.text, 'lxml')
40 |
41 | title = soup.find('title')
42 | ```
43 |
44 | ## Extracting Text
45 |
46 | ```python
47 | print(title.get_text()) # Prints page title.
48 | ```
49 |
50 | ### Fine Tuning
51 |
52 | ```python
53 | soup.find('small',itemprop="author")
54 | ```
55 |
56 | ```python
57 | soup.find('small',class_="author")
58 | ```
59 |
60 | ### Extracting Headlines
61 |
62 | ```python
63 | headlines = soup.find_all(itemprop="text")
64 |
65 | for headline in headlines:
66 | print(headline.get_text())
67 | ```
68 |
69 |
70 |
71 | If you wish to find out more about News Scraping, see our [blog post](https://oxy.yt/YrD0).
72 |
--------------------------------------------------------------------------------
/python/pandas-read-html-tables/src/population.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
27 |
28 |
56 |
57 |
58 |
--------------------------------------------------------------------------------
/python/playwright-web-scraping/README.md:
--------------------------------------------------------------------------------
1 | # Web Scraping With Playwright
2 |
3 | [
29 |
34 |
35 |
36 | Sequence
30 | Country
31 | Population
32 | Updated
33 |
37 |
42 | 1
38 | China
39 | 1,439,323,776
40 | 1-Dec-2020
41 |
43 |
48 | 2
44 | India
45 | 1,380,004,385
46 | 1-Dec-2020
47 |
49 |
54 |
55 | 3
50 | United States
51 | 331,002,651
52 | 1-Dec-2020
53 | ](https://github.com/topics/playwright) [
](https://github.com/topics/web-scraping)
4 |
5 | - [Support for proxies in Playwright](#support-for-proxies-in-playwright)
6 | - [Basic scraping with Playwright](#basic-scraping-with-playwright)
7 | - [Web Scraping](#web-scraping)
8 |
9 | This article discusses everything you need to know about news scraping, including the benefits and use cases of news scraping as well as how you can use Python to create an article scraper.
10 |
11 | For a detailed explanation, see our [blog post](https://oxy.yt/erHw).
12 |
13 |
14 | ## Support for proxies in Playwright
15 |
16 | #### Without Proxy.js
17 |
18 | ```javascript
19 |
20 | // Node.js
21 |
22 | const { chromium } = require('playwright'); "
23 | const browser = await chromium.launch();
24 | ```
25 |
26 |
27 |
28 | ```python
29 |
30 | # Python
31 |
32 | from playwright.async_api import async_playwright
33 | import asyncio
34 | with async_playwright() as p:
35 | browser = await p.chromium.launch()
36 | ```
37 |
38 | #### With Proxy
39 |
40 | ```javascript
41 | // Node.js
42 | const launchOptions = {
43 | proxy: {
44 | server: 123.123.123.123:80'
45 | },
46 | headless: false
47 | }
48 | const browser = await chromium.launch(launchOptions);
49 | ```
50 |
51 |
52 |
53 | ```python
54 | # Python
55 | proxy_to_use = {
56 | 'server': '123.123.123.123:80'
57 | }
58 | browser = await p.chromium.launch(proxy=proxy_to_use, headless=False)
59 | ```
60 |
61 | ## Basic scraping with Playwright
62 |
63 | ### Node.Js
64 |
65 | ```shell
66 | npm init -y
67 | npm install playwright
68 | ```
69 |
70 | ```javascript
71 | const playwright = require('playwright');
72 | (async () => {
73 | const browser = await playwright.chromium.launch({
74 | headless: false // Show the browser.
75 | });
76 |
77 | const page = await browser.newPage();
78 | await page.goto('https://books.toscrape.com/');
79 | await page.waitForTimeout(1000); // wait for 1 seconds
80 | await browser.close();
81 | })();
82 | ```
83 |
84 | ### Python
85 |
86 | ```shell
87 | pip install playwright
88 | ```
89 |
90 |
91 |
92 | ```python
93 | from playwright.async_api import async_playwright
94 | import asyncio
95 |
96 | async def main():
97 | async with async_playwright() as pw:
98 | browser = await pw.chromium.launch(
99 | headless=False # Show the browser
100 | )
101 | page = await browser.new_page()
102 | await page.goto('https://books.toscrape.com/')
103 | # Data Extraction Code Here
104 | await page.wait_for_timeout(1000) # Wait for 1 second
105 | await browser.close()
106 |
107 | if __name__ == '__main__':
108 | asyncio.run(main())
109 | ```
110 |
111 | ## Web Scraping
112 |
113 |
114 |
115 | 
116 |
117 | #### Node.JS
118 |
119 | ```javascript
120 | const playwright = require('playwright');
121 |
122 | (async () => {
123 | const browser = await playwright.chromium.launch();
124 | const page = await browser.newPage();
125 | await page.goto('https://books.toscrape.com/');
126 | const books = await page.$$eval('.product_pod', all_items => {
127 | const data = [];
128 | all_items.forEach(book => {
129 | const name = book.querySelector('h3').innerText;
130 | const price = book.querySelector('.price_color').innerText;
131 | const stock = book.querySelector('.availability').innerText;
132 | data.push({ name, price, stock});
133 | });
134 | return data;
135 | });
136 | console.log(books);
137 | await browser.close();
138 | })();
139 | ```
140 |
141 | #### Python
142 |
143 | ```python
144 | from playwright.async_api import async_playwright
145 | import asyncio
146 |
147 |
148 | async def main():
149 | async with async_playwright() as pw:
150 | browser = await pw.chromium.launch()
151 | page = await browser.new_page()
152 | await page.goto('https://books.toscrape.com')
153 |
154 | all_items = await page.query_selector_all('.product_pod')
155 | books = []
156 | for item in all_items:
157 | book = {}
158 | name_el = await item.query_selector('h3')
159 | book['name'] = await name_el.inner_text()
160 | price_el = await item.query_selector('.price_color')
161 | book['price'] = await price_el.inner_text()
162 | stock_el = await item.query_selector('.availability')
163 | book['stock'] = await stock_el.inner_text()
164 | books.append(book)
165 | print(books)
166 | await browser.close()
167 |
168 | if __name__ == '__main__':
169 | asyncio.run(main())
170 | ```
171 |
172 | If you wish to find out more about Web Scraping With Playwright, see our [blog post](https://oxy.yt/erHw).
173 |
--------------------------------------------------------------------------------
/python/playwright-web-scraping/node/book.js:
--------------------------------------------------------------------------------
1 | const playwright = require('playwright');
2 |
3 | (async () => {
4 | const browser = await playwright.chromium.launch();
5 | const page = await browser.newPage();
6 | await page.goto('https://books.toscrape.com/');
7 | const books = await page.$$eval('.product_pod', all_items => {
8 | const data = [];
9 | all_items.forEach(book => {
10 | const name = book.querySelector('h3').innerText;
11 | const price = book.querySelector('.price_color').innerText;
12 | const stock = book.querySelector('.availability').innerText;
13 | data.push({ name, price, stock});
14 | });
15 | return data;
16 | });
17 | console.log(books);
18 | console.log(books.length);
19 | await browser.close();
20 | })();
21 |
--------------------------------------------------------------------------------
/python/playwright-web-scraping/node/package-lock.json:
--------------------------------------------------------------------------------
1 | {
2 | "name": "node",
3 | "lockfileVersion": 2,
4 | "requires": true,
5 | "packages": {
6 | "": {
7 | "dependencies": {
8 | "playwright": "^1.27.0"
9 | }
10 | },
11 | "node_modules/playwright": {
12 | "version": "1.27.0",
13 | "resolved": "https://registry.npmjs.org/playwright/-/playwright-1.27.0.tgz",
14 | "integrity": "sha512-F+0+0RD03LS+KdNAMMp63OBzu+NwYYLd52pKLczuSlTsV5b/SLkUoNhSfzDFngEFOuRL2gk0LlfGW3mKiUBk6w==",
15 | "hasInstallScript": true,
16 | "dependencies": {
17 | "playwright-core": "1.27.0"
18 | },
19 | "bin": {
20 | "playwright": "cli.js"
21 | },
22 | "engines": {
23 | "node": ">=14"
24 | }
25 | },
26 | "node_modules/playwright-core": {
27 | "version": "1.27.0",
28 | "resolved": "https://registry.npmjs.org/playwright-core/-/playwright-core-1.27.0.tgz",
29 | "integrity": "sha512-VBKaaFUVKDo3akW+o4DwbK1ZyXh46tcSwQKPK3lruh8IJd5feu55XVZx4vOkbb2uqrNdIF51sgsadYT533SdpA==",
30 | "bin": {
31 | "playwright": "cli.js"
32 | },
33 | "engines": {
34 | "node": ">=14"
35 | }
36 | }
37 | },
38 | "dependencies": {
39 | "playwright": {
40 | "version": "1.27.0",
41 | "resolved": "https://registry.npmjs.org/playwright/-/playwright-1.27.0.tgz",
42 | "integrity": "sha512-F+0+0RD03LS+KdNAMMp63OBzu+NwYYLd52pKLczuSlTsV5b/SLkUoNhSfzDFngEFOuRL2gk0LlfGW3mKiUBk6w==",
43 | "requires": {
44 | "playwright-core": "1.27.0"
45 | }
46 | },
47 | "playwright-core": {
48 | "version": "1.27.0",
49 | "resolved": "https://registry.npmjs.org/playwright-core/-/playwright-core-1.27.0.tgz",
50 | "integrity": "sha512-VBKaaFUVKDo3akW+o4DwbK1ZyXh46tcSwQKPK3lruh8IJd5feu55XVZx4vOkbb2uqrNdIF51sgsadYT533SdpA=="
51 | }
52 | }
53 | }
54 |
--------------------------------------------------------------------------------
/python/playwright-web-scraping/node/package.json:
--------------------------------------------------------------------------------
1 | {
2 | "dependencies": {
3 | "playwright": "^1.27.0"
4 | }
5 | }
6 |
--------------------------------------------------------------------------------
/python/playwright-web-scraping/python/books.py:
--------------------------------------------------------------------------------
1 | from playwright.async_api import async_playwright
2 | import asyncio
3 |
4 |
5 | async def main():
6 | async with async_playwright() as pw:
7 | browser = await pw.chromium.launch()
8 | page = await browser.new_page()
9 | await page.goto('https://books.toscrape.com')
10 |
11 | all_items = await page.query_selector_all('.product_pod')
12 | books = []
13 | for item in all_items:
14 | book = {}
15 | name_el = await item.query_selector('h3')
16 | book['name'] = await name_el.inner_text()
17 | price_el = await item.query_selector('.price_color')
18 | book['price'] = await price_el.inner_text()
19 | stock_el = await item.query_selector('.availability')
20 | book['stock'] = await stock_el.inner_text()
21 | books.append(book)
22 | print(books)
23 | await browser.close()
24 |
25 | if __name__ == '__main__':
26 | asyncio.run(main())
--------------------------------------------------------------------------------
/python/playwright-web-scraping/python/requirements.txt:
--------------------------------------------------------------------------------
1 | playwright
2 |
--------------------------------------------------------------------------------
/python/regex-web-scraping/README.md:
--------------------------------------------------------------------------------
1 | # Web Scraping With RegEx
2 |
3 | # Creating virutal environment
4 |
5 | ```bash
6 | python3 -m venv scrapingdemo
7 | ```
8 |
9 | ```bash
10 | source ./scrapingdemo/bin/activate
11 | ```
12 |
13 | # Installing requirements
14 |
15 | ```bash
16 | pip install requests
17 | ```
18 |
19 | ```bash
20 | pip install beautifulsoup4
21 | ```
22 |
23 | # Importing the required libraries
24 |
25 | ```python
26 | import requests
27 | from bs4 import BeautifulSoup
28 | import re
29 | ```
30 |
31 | ## Sending the GET request
32 |
33 | Use the Requests library to send a request to a web page from which you want to scrape the data. In this case, https://books.toscrape.com/. To commence, enter the following:
34 |
35 | ```python
36 | page = requests.get('https://books.toscrape.com/')
37 | ```
38 |
39 | ## Selecting data
40 |
41 | First, create a Beautiful Soup object and pass the page content received from your request during the initialization, including the parser type. As you’re working with an HTML code, select `HTML.parser` as the parser type.
42 |
43 | 
44 |
45 | By inspecting the elements (right-click and select inspect element) in a browser, you can see that each book title and price are presented inside an `article` element with the class called `product_pod`. Use Beautiful Soup to get all the data inside these elements, and then convert it to a string:
46 |
47 | ```python
48 | soup = BeautifulSoup(page.content, 'html.parser')
49 | content = soup.find_all(class_='product_pod')
50 | content = str(content)
51 | ```
52 |
53 | ## Processing the data using RegEx
54 |
55 | Since the acquired content has a lot of unnecessary data, create two regular expressions to get only the desired data.
56 |
57 | 
58 |
59 | ### Expression # 1
60 | ### Finding the pattern
61 |
62 | First, inspect the title of the book to find the pattern. You can see above that every title is present after the text `title=` in the format `title=“Titlename”`.
63 |
64 | ### Generating the expression
65 |
66 | Then, create an expression that returns the data inside quotations after the `title=` by specifying `"(.*?)"`.
67 |
68 | The first expression is as follows:
69 |
70 | ```python
71 | re_titles = r'title="(.*?)">'
72 | ```
73 |
74 | ### Expression # 2
75 | ### Finding the pattern
76 |
77 | First, inspect the price of the book. Every price is present after the text `£` in the format `£=price` before the paragraph tag `