└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | I made a privacy focused, Chromium web browser to intercept all requests on a website while web scraping. I built it using Tor and PySide6 (a QT framework for Python).
  2 | 
  3 | 
  4 | [![A simple Tor, Chrome browser built with Python](http://img.youtube.com/vi/auKpNM1g5aw/0.jpg)](http://www.youtube.com/watch?v=auKpNM1g5aw "A simple Tor, Chrome browser built with Python")
  5 | 
  6 | 
  7 | # Backstory
  8 | 
  9 | I've spent the last 5 month (Oct 2022 to Feb 2023) on a web scraping deep dive. I got to the point where I can scrape many of the public websites (not including the social media giants) using Selenium and Headless Chrome.
 10 | 
 11 | One thing I wanted more control over were the network requests. I wanted to see if there was a lightweight way to block requests from reaching the server from the browser. I spent the next few days diving deep into the chromium, and servo project source code to understand how web browsers work.
 12 | 
 13 | To my surprise I found there is a rather simple way to do this via Python. The next step is to turn this into a lightweight headless browser to do some distributed web scraping.
 14 | 
 15 | ## Code Snippet
 16 | 
 17 | ```python
 18 | from PySide6.QtCore import QUrl, Slot
 19 | from PySide6.QtGui import QIcon
 20 | from PySide6.QtWidgets import (QApplication, QLineEdit,
 21 |                                QMainWindow, QPushButton, QToolBar)
 22 | from PySide6.QtWebEngineCore import QWebEnginePage, QWebEngineUrlRequestInterceptor, QWebEngineProfile
 23 | from PySide6.QtWebEngineWidgets import QWebEngineView
 24 | from PySide6.QtNetwork import QNetworkProxy
 25 | 
 26 | import sys
 27 | import os
 28 | import stem.process
 29 | import re
 30 | import urllib.request
 31 | 
 32 | 
 33 | class RequestInterceptor(QWebEngineUrlRequestInterceptor):
 34 |     def interceptRequest(self, info):
 35 |         print("####### INTERCEPTING REQUEST #######")
 36 |         print(info.requestUrl())
 37 | 
 38 | class MainWindow(QMainWindow):
 39 | 
 40 |     def __init__(self):
 41 |         super().__init__()
 42 | 
 43 |         self.setWindowTitle('PySide6 WebEngineWidgets Example')
 44 | 
 45 |         self.toolBar = QToolBar()
 46 |         self.addToolBar(self.toolBar)
 47 |         self.backButton = QPushButton()
 48 |         self.backButton.setIcon(
 49 |             QIcon(':/qt-project.org/styles/commonstyle/images/left-32.png'))
 50 |         self.backButton.clicked.connect(self.back)
 51 |         self.toolBar.addWidget(self.backButton)
 52 |         self.forwardButton = QPushButton()
 53 |         self.forwardButton.setIcon(
 54 |             QIcon(':/qt-project.org/styles/commonstyle/images/right-32.png'))
 55 |         self.forwardButton.clicked.connect(self.forward)
 56 |         self.toolBar.addWidget(self.forwardButton)
 57 | 
 58 |         self.addressLineEdit = QLineEdit()
 59 |         self.addressLineEdit.returnPressed.connect(self.load)
 60 |         self.toolBar.addWidget(self.addressLineEdit)
 61 | 
 62 |         self.webEngineView = QWebEngineView()
 63 |         self.setCentralWidget(self.webEngineView)
 64 |         initialUrl = "http://ip-api.com/json"
 65 |         self.addressLineEdit.setText(initialUrl)
 66 |         self.webEngineView.load(QUrl(initialUrl))
 67 |         self.webEngineView.page().titleChanged.connect(self.setWindowTitle)
 68 |         self.webEngineView.page().urlChanged.connect(self.urlChanged)
 69 | 
 70 |     @Slot()
 71 |     def load(self):
 72 |         url = QUrl.fromUserInput(self.addressLineEdit.text())
 73 |         if url.isValid():
 74 |             self.webEngineView.load(url)
 75 | 
 76 |     @Slot()
 77 |     def back(self):
 78 |         self.webEngineView.page().triggerAction(QWebEnginePage.Back)
 79 | 
 80 |     @Slot()
 81 |     def forward(self):
 82 |         self.webEngineView.page().triggerAction(QWebEnginePage.Forward)
 83 | 
 84 |     @Slot(QUrl)
 85 |     def urlChanged(self, url):
 86 |         self.addressLineEdit.setText(url.toString())
 87 | 
 88 | 
 89 | def launch_tor_process():
 90 |     SOCKS_PORT = 9050
 91 |     CONTROL_PORT = 9051
 92 |     TOR_PATH = "/usr/local/bin/tor"
 93 |     GEOIPFILE_PATH = os.path.normpath(os.getcwd() + "/geoip")
 94 |     try:
 95 |         urllib.request.urlretrieve(
 96 |             'https://raw.githubusercontent.com/torproject/tor/main/src/config/geoip', GEOIPFILE_PATH)
 97 |     except:
 98 |         print('[INFO] Unable to update geoip file. Using local copy.')
 99 | 
100 |     tor_process = stem.process.launch_tor_with_config(
101 |         config={
102 |             'SocksPort': str(SOCKS_PORT),
103 |             'ControlPort': str(CONTROL_PORT),
104 |             'ExitNodes': '',
105 |             'StrictNodes': '1',
106 |             'CookieAuthentication': '1',
107 |             'MaxCircuitDirtiness': '60',
108 |             'GeoIPFile': GEOIPFILE_PATH,
109 |         },
110 |         take_ownership=True,
111 |         init_msg_handler=lambda line: print(line) if re.search(
112 |             'Bootstrapped', line) else False,
113 |         tor_cmd=TOR_PATH
114 |     )
115 | 
116 | 
117 | if __name__ == '__main__':
118 |     # Launch a Tor process
119 |     launch_tor_process()
120 |     app = QApplication(sys.argv)
121 |     
122 |     # Proxy all browser requests through the Tor process
123 |     PROXY_PORT = 9050
124 |     PROXY_HOST = "127.0.0.1"
125 |     proxy = QNetworkProxy()
126 |     proxy.setType(QNetworkProxy.Socks5Proxy)
127 |     proxy.setHostName(PROXY_HOST)
128 |     proxy.setPort(PROXY_PORT)
129 |     QNetworkProxy.setApplicationProxy(proxy)
130 |     
131 |     # Add a request interceptor so we can read all the requests from the browser
132 |     interceptor = RequestInterceptor()
133 | 
134 |     mainWin = MainWindow()
135 |     mainWin.webEngineView.page().profile().setUrlRequestInterceptor(interceptor)
136 |     availableGeometry = mainWin.screen().availableGeometry()
137 |     mainWin.resize(availableGeometry.width() * 2 / 3,
138 |                    availableGeometry.height() * 2 / 3)
139 |     
140 |     # Launch the web browser
141 |     mainWin.show()
142 |     sys.exit(app.exec())
143 | ```
144 | 
145 | # About Me
146 | 
147 | I'm Steven, an software engineer in love with all things web scraping and distributed systems. I worked at Twitter as an SRE migrating 500,000 bare metal servers from Aurora Mesos to Kubernetes.
148 | 
149 | On Nov 3, 2022 I was laid off. Since then I've spent the time doing deep dives and writing about my journey.
150 | 
151 | # Web Scraping Course
152 | 
153 | I'm creating a web scraping course to so you don't have to spend months learning how to:
154 | 
155 | - analyze a website to determine the best way to scrape data
156 | - use proxies to scrape without getting blocked by Cloudflare, Datadome, or PerimeterX
157 | - scrape web sites with Javascript
158 | - build your own web scraping framework
159 | - build scalable infrastructure for scraping
160 | 
161 | [Join the prelaunch to gain free access before it becomes a paid course.](https://stevennatera.gumroad.com/l/isfsd)
162 | 


--------------------------------------------------------------------------------