└── README.md /README.md: -------------------------------------------------------------------------------- 1 | I made a privacy focused, Chromium web browser to intercept all requests on a website while web scraping. I built it using Tor and PySide6 (a QT framework for Python). 2 | 3 | 4 | [![A simple Tor, Chrome browser built with Python](http://img.youtube.com/vi/auKpNM1g5aw/0.jpg)](http://www.youtube.com/watch?v=auKpNM1g5aw "A simple Tor, Chrome browser built with Python") 5 | 6 | 7 | # Backstory 8 | 9 | I've spent the last 5 month (Oct 2022 to Feb 2023) on a web scraping deep dive. I got to the point where I can scrape many of the public websites (not including the social media giants) using Selenium and Headless Chrome. 10 | 11 | One thing I wanted more control over were the network requests. I wanted to see if there was a lightweight way to block requests from reaching the server from the browser. I spent the next few days diving deep into the chromium, and servo project source code to understand how web browsers work. 12 | 13 | To my surprise I found there is a rather simple way to do this via Python. The next step is to turn this into a lightweight headless browser to do some distributed web scraping. 14 | 15 | ## Code Snippet 16 | 17 | ```python 18 | from PySide6.QtCore import QUrl, Slot 19 | from PySide6.QtGui import QIcon 20 | from PySide6.QtWidgets import (QApplication, QLineEdit, 21 | QMainWindow, QPushButton, QToolBar) 22 | from PySide6.QtWebEngineCore import QWebEnginePage, QWebEngineUrlRequestInterceptor, QWebEngineProfile 23 | from PySide6.QtWebEngineWidgets import QWebEngineView 24 | from PySide6.QtNetwork import QNetworkProxy 25 | 26 | import sys 27 | import os 28 | import stem.process 29 | import re 30 | import urllib.request 31 | 32 | 33 | class RequestInterceptor(QWebEngineUrlRequestInterceptor): 34 | def interceptRequest(self, info): 35 | print("####### INTERCEPTING REQUEST #######") 36 | print(info.requestUrl()) 37 | 38 | class MainWindow(QMainWindow): 39 | 40 | def __init__(self): 41 | super().__init__() 42 | 43 | self.setWindowTitle('PySide6 WebEngineWidgets Example') 44 | 45 | self.toolBar = QToolBar() 46 | self.addToolBar(self.toolBar) 47 | self.backButton = QPushButton() 48 | self.backButton.setIcon( 49 | QIcon(':/qt-project.org/styles/commonstyle/images/left-32.png')) 50 | self.backButton.clicked.connect(self.back) 51 | self.toolBar.addWidget(self.backButton) 52 | self.forwardButton = QPushButton() 53 | self.forwardButton.setIcon( 54 | QIcon(':/qt-project.org/styles/commonstyle/images/right-32.png')) 55 | self.forwardButton.clicked.connect(self.forward) 56 | self.toolBar.addWidget(self.forwardButton) 57 | 58 | self.addressLineEdit = QLineEdit() 59 | self.addressLineEdit.returnPressed.connect(self.load) 60 | self.toolBar.addWidget(self.addressLineEdit) 61 | 62 | self.webEngineView = QWebEngineView() 63 | self.setCentralWidget(self.webEngineView) 64 | initialUrl = "http://ip-api.com/json" 65 | self.addressLineEdit.setText(initialUrl) 66 | self.webEngineView.load(QUrl(initialUrl)) 67 | self.webEngineView.page().titleChanged.connect(self.setWindowTitle) 68 | self.webEngineView.page().urlChanged.connect(self.urlChanged) 69 | 70 | @Slot() 71 | def load(self): 72 | url = QUrl.fromUserInput(self.addressLineEdit.text()) 73 | if url.isValid(): 74 | self.webEngineView.load(url) 75 | 76 | @Slot() 77 | def back(self): 78 | self.webEngineView.page().triggerAction(QWebEnginePage.Back) 79 | 80 | @Slot() 81 | def forward(self): 82 | self.webEngineView.page().triggerAction(QWebEnginePage.Forward) 83 | 84 | @Slot(QUrl) 85 | def urlChanged(self, url): 86 | self.addressLineEdit.setText(url.toString()) 87 | 88 | 89 | def launch_tor_process(): 90 | SOCKS_PORT = 9050 91 | CONTROL_PORT = 9051 92 | TOR_PATH = "/usr/local/bin/tor" 93 | GEOIPFILE_PATH = os.path.normpath(os.getcwd() + "/geoip") 94 | try: 95 | urllib.request.urlretrieve( 96 | 'https://raw.githubusercontent.com/torproject/tor/main/src/config/geoip', GEOIPFILE_PATH) 97 | except: 98 | print('[INFO] Unable to update geoip file. Using local copy.') 99 | 100 | tor_process = stem.process.launch_tor_with_config( 101 | config={ 102 | 'SocksPort': str(SOCKS_PORT), 103 | 'ControlPort': str(CONTROL_PORT), 104 | 'ExitNodes': '', 105 | 'StrictNodes': '1', 106 | 'CookieAuthentication': '1', 107 | 'MaxCircuitDirtiness': '60', 108 | 'GeoIPFile': GEOIPFILE_PATH, 109 | }, 110 | take_ownership=True, 111 | init_msg_handler=lambda line: print(line) if re.search( 112 | 'Bootstrapped', line) else False, 113 | tor_cmd=TOR_PATH 114 | ) 115 | 116 | 117 | if __name__ == '__main__': 118 | # Launch a Tor process 119 | launch_tor_process() 120 | app = QApplication(sys.argv) 121 | 122 | # Proxy all browser requests through the Tor process 123 | PROXY_PORT = 9050 124 | PROXY_HOST = "127.0.0.1" 125 | proxy = QNetworkProxy() 126 | proxy.setType(QNetworkProxy.Socks5Proxy) 127 | proxy.setHostName(PROXY_HOST) 128 | proxy.setPort(PROXY_PORT) 129 | QNetworkProxy.setApplicationProxy(proxy) 130 | 131 | # Add a request interceptor so we can read all the requests from the browser 132 | interceptor = RequestInterceptor() 133 | 134 | mainWin = MainWindow() 135 | mainWin.webEngineView.page().profile().setUrlRequestInterceptor(interceptor) 136 | availableGeometry = mainWin.screen().availableGeometry() 137 | mainWin.resize(availableGeometry.width() * 2 / 3, 138 | availableGeometry.height() * 2 / 3) 139 | 140 | # Launch the web browser 141 | mainWin.show() 142 | sys.exit(app.exec()) 143 | ``` 144 | 145 | # About Me 146 | 147 | I'm Steven, an software engineer in love with all things web scraping and distributed systems. I worked at Twitter as an SRE migrating 500,000 bare metal servers from Aurora Mesos to Kubernetes. 148 | 149 | On Nov 3, 2022 I was laid off. Since then I've spent the time doing deep dives and writing about my journey. 150 | 151 | # Web Scraping Course 152 | 153 | I'm creating a web scraping course to so you don't have to spend months learning how to: 154 | 155 | - analyze a website to determine the best way to scrape data 156 | - use proxies to scrape without getting blocked by Cloudflare, Datadome, or PerimeterX 157 | - scrape web sites with Javascript 158 | - build your own web scraping framework 159 | - build scalable infrastructure for scraping 160 | 161 | [Join the prelaunch to gain free access before it becomes a paid course.](https://stevennatera.gumroad.com/l/isfsd) 162 | --------------------------------------------------------------------------------