├── .gitignore
├── README.md
├── package-lock.json
├── requirements.txt
├── site
    ├── .gitignore
    ├── README.md
    ├── package-lock.json
    ├── package.json
    ├── public
    │   ├── favicon.ico
    │   ├── images
    │   │   ├── testimage01.png
    │   │   ├── testimage02.png
    │   │   ├── testimage03.png
    │   │   └── testimage04.png
    │   ├── index.html
    │   ├── manifest.json
    │   └── robots.txt
    ├── site_preview.png
    ├── src
    │   ├── App.css
    │   ├── App.tsx
    │   ├── index.css
    │   ├── index.tsx
    │   ├── react-app-env.d.ts
    │   ├── reportWebVitals.ts
    │   └── setupTests.ts
    └── tsconfig.json
└── src
    ├── adv-clickdiv-collector.py
    ├── base
        ├── __init__.py
        ├── cmdline.py
        ├── driver.py
        └── utils.py
    ├── simple-card-collector.py
    └── simple-image-collector.py


/.gitignore:
--------------------------------------------------------------------------------
1 | # Virtual Python environment.
2 | venv/
3 | 
4 | # General Python files
5 | __pycache__/
6 | *.pyc
7 | *.pyo
8 | *.pyd


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | This repository will show how to use [Selenium](https://www.selenium.dev/) paired with [Beautiful Soup (V4)](https://pypi.org/project/beautifulsoup4/) in Python (3+) to parse and extract data from websites. I've included example(s) of using JavaScript as well (e.g. button clicks to open menus and then extract more hidden data). I also plan on making blog articles under [Deaconn](https://deaconn.net/) using these examples in the future!
  2 | 
  3 | These tools are commonly used with web browser automation, web scraping, and development tests. Additionally, you can use the combination of these tools in other projects such as creating a follow bot (obviously using at your own risk)!
  4 | 
  5 | ## What Is Selenium & Beautiful Soup?
  6 | **Selenium** is a powerful tool for controlling web browsers through programs and performing browser automation/tasks. A driver is included for most web browsers and a wide range of programming languages are supported!
  7 | 
  8 | **Beautiful Soup** is a Python library for pulling data out of HTML and XML files. It parses anything you give it, and does the tree traversal stuff for you!
  9 | 
 10 | ## Requirements & Setup
 11 | I've created and tested the programs made in this repository on a Debian 12 virtual machine I have running on one of my [home servers](https://github.com/gamemann/Home-Lab?tab=readme-ov-file#two-powerball). While I don't have specific instructions for setting up this repository on non-Debian/Ubuntu-based systems, there shouldn't be many changes you need to make to the instructions below. In fact, it may be easier since you may not have to worry about your OS's package manager handling the Python installation.
 12 | 
 13 | ### Debian/Ubuntu-Based Systems
 14 | Debian/Ubuntu-based systems typically use the `apt` package manager to manage the server's Python installation and its libraries. This is fine in most cases, but sometimes there are packages that aren't included with `apt` and when using the `pip` or `pip3` commands to install the package, you'll receive an error like below.
 15 | 
 16 | ```bash
 17 | error: externally-managed-environment
 18 | 
 19 | × This environment is externally managed
 20 | ╰─> To install Python packages system-wide, try apt install
 21 |     python3-xyz, where xyz is the package you are trying to
 22 |     install.
 23 |     
 24 |     If you wish to install a non-Debian-packaged Python package,
 25 |     create a virtual environment using python3 -m venv path/to/venv.
 26 |     Then use path/to/venv/bin/python and path/to/venv/bin/pip. Make
 27 |     sure you have python3-full installed.
 28 |     
 29 |     If you wish to install a non-Debian packaged Python application,
 30 |     it may be easiest to use pipx install xyz, which will manage a
 31 |     virtual environment for you. Make sure you have pipx installed.
 32 |     
 33 |     See /usr/share/doc/python3.11/README.venv for more information.
 34 | 
 35 | note: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-system-packages.
 36 | hint: See PEP 668 for the detailed specification.
 37 | ```
 38 | 
 39 | You could pass the `--break-system-packages` flag to the `pip` or `pip3` commands, but as stated in the error, this risks breaking packages in your global Python installation. A solution to this is using virtual Python environments which is detailed below.
 40 | 
 41 | If you do want to use `apt` to manage the packages, you can install Selenium and BeautifulSoup4 using the command below.
 42 | 
 43 | ```bash
 44 | sudo apt install -y python3-bs4 python3-selenium
 45 | ```
 46 | 
 47 | ### Virtual Python Environments
 48 | I personally recommend creating a virtual Python environment so that you don't risk breaking your Python installation if you need to install a package that isn't included in the `apt` package manager. It is pretty easy to create a virtual environment as well. In our case, we can do so by using the command below.
 49 | 
 50 | ```bash
 51 | python3 -m venv venv/
 52 | ```
 53 | 
 54 | This will create a `venv/` directory in your current working directory. Afterwards, you will want to source `venv/bin/activate` and then you will be able to use the `pip` or `pip3` commands to install the required packages.
 55 | 
 56 | ```bash
 57 | source venv/bin/activate
 58 | ```
 59 | 
 60 | I've also included a `requirements.txt` file which allows you to easily install the required packages using the `pip` or `pip3` commands. You may use the command below.
 61 | 
 62 | ```bash
 63 | pip3 install -r requirements.txt
 64 | ```
 65 | 
 66 | **Note** - The `requirements.txt` file includes `beautifulsoup4` (version `4.12.2`) and `selenium` (version `4.16.0`). There may be updates available to these packages, but these are the versions I've made this repository with.
 67 | 
 68 | ### Firefox & Geckodriver
 69 | In this repository, we use Selenium's Firefox driver paired with [geckodriver](https://github.com/mozilla/geckodriver). I'd recommend heading to the [releases page](https://github.com/mozilla/geckodriver/releases) and downloading the latest. Otherwise, you can use the version I've tested below.
 70 | 
 71 | ```bash
 72 | # Download version '0.34.0' for Linux 64-bit.
 73 | wget https://github.com/mozilla/geckodriver/releases/download/v0.34.0/geckodriver-v0.34.0-linux64.tar.gz
 74 | 
 75 | # Uncompress and extract the file using the 'tar' command.
 76 | tar -xzvf geckodriver-v0.34.0-linux64.tar.gz
 77 | 
 78 | # Move to '/usr/bin' using sudo/root.
 79 | sudo mv geckodriver /usr/bin
 80 | ```
 81 | 
 82 | You'll also want to download Firefox. You can do so using `apt` below.
 83 | 
 84 | ```
 85 | sudo apt install -y firefox-esr
 86 | ```
 87 | 
 88 | ## Website Setup & Running
 89 | The website we've made to test the Python programs utilize [React](https://react.dev/) and [Node.js](https://nodejs.org/en). The website's source code is located in the [`site/`](./site) directory.
 90 | 
 91 | ### Requirements
 92 | You will need to install **Node.js** and **NPM** onto your system. You can read [this guide](https://nodejs.org/en/download/package-manager/) on how to install these packages using a package manager. You can use the following command to install Node.js and NPM using the `apt` package manager. However, I did want to note that the standard repositories included in the `apt` package manager are fairly old (stable), but they should work for the websites in this repository.
 93 | 
 94 | ```bash
 95 | sudo apt install -y nodejs npm
 96 | ```
 97 | 
 98 | ### Installing Packages
 99 | After installing Node.js and NPM, you can change your directory to our website using the `cd site/` command and run the following to install the needed packages via NPM.
100 | 
101 | ```bash
102 | npm install
103 | ```
104 | 
105 | Afterwards, you can run the following command to start the web development server.
106 | 
107 | ```bash
108 | npm start
109 | ```
110 | 
111 | By default, the website should be listening at [http://localhost:3000](http://localhost:3000). However, if you want to change the bind IP or port, you can set the `HOST` and `PORT` environmental variables. Here's an example.
112 | 
113 | ```bash
114 | HOST=0.0.0.0 PORT=3001 npm start
115 | ```
116 | 
117 | If you use a different host or port, please make sure to specify this in the Python program's command line. Read **Command Line Usage** for more information.
118 | 
119 | ## Command Line Usage
120 | Each Python program utilizes [`src/base/cmdline.py`](./src/base/cmdline.py) to parse the command line arguments. Arguments are listed below.
121 | 
122 | * `-b --binary` - The path to the Geckodriver binary file (default => `/usr/bin/geckodriver`).
123 | * `-s --site` - The full URL of the website to parse and extract information from (default => `http://localhost:3000`).
124 | * `-u --ua` - The web browser's user agent to use when sending requests (default => `Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0`).
125 | 
126 | ## Programs
127 | All Python programs are located in the [`src/`](./src) directory. You may execute them using the following command. Please make sure you have the website started in another terminal!
128 | 
129 | ```bash
130 | python3 src/<program>.py
131 | ```
132 | 
133 | Here's a list of programs we've made so far!
134 | 
135 | ### [`simple-image-collector.py`](./src/simple-image-collector.py)
136 | This Python program parses our website and extracts all image sources inside of elements with the class name `image-row`.
137 | 
138 | The expected output is the following.
139 | 
140 | ```bash
141 | $ python3 src/simple-image-collector.py 
142 | Starting simple-image-collector...
143 | Parsing arguments...
144 | Setting up Selenium driver...
145 | Parsing website 'http://localhost:3000'...
146 | Found the following image URLs.
147 |         - /images/testimage01.png
148 |         - /images/testimage02.png
149 |         - /images/testimage03.png
150 |         - /images/testimage04.png
151 | Exiting...
152 | ```
153 | 
154 | ### [`simple-card-collector.py`](./src/simple-card-collector.py)
155 | This Python program parses our website and extracts the title and description of all elements with the class name `card-row`. The title is found inside of the `<h2>` tag while the description is found inside of the `<p>` tag inside the card row element.
156 | 
157 | The expected output is the following.
158 | 
159 | ```bash
160 | $ python3 src/simple-card-collector.py 
161 | Starting simple-card-collector...
162 | Parsing arguments...
163 | Setting up Selenium driver...
164 | Parsing website 'http://localhost:3000'...
165 | Found the following cards.
166 |         Card #1
167 |                 Title => Card Title #1
168 |                 Description => This is the description of card #1!
169 |         Card #2
170 |                 Title => Card Title #2
171 |                 Description => This is the description of card #2!
172 |         Card #3
173 |                 Title => Card Title #3
174 |                 Description => This is the description of card #3!
175 | Exiting...
176 | ```
177 | 
178 | ### [`adv-clickdiv-collector.py`](./src/adv-clickdiv-collector.py)
179 | This Python program parses our website, clicks all the dividers with the class name `clickDiv-row`, and then extracts the divider's title and hidden content. This is a more advanced example since it uses JavaScript to click buttons.
180 | 
181 | The expected output is the following.
182 | 
183 | ```bash
184 | $ python3 src/adv-clickdiv-collector.py
185 | Starting adv-clickdiv-collector...
186 | Parsing arguments...
187 | Setting up Selenium driver...
188 | Parsing website 'http://localhost:3000'...
189 | Found the following clickable dividers.
190 |         ClickDiv #1
191 |                 Title => Clickable Div #1
192 |                 Description => These are the hidden contents of clickable div #1!
193 |         ClickDiv #2
194 |                 Title => Clickable Div #2
195 |                 Description => These are the hidden contents of clickable div #2!
196 |         ClickDiv #3
197 |                 Title => Clickable Div #3
198 |                 Description => These are the hidden contents of clickable div #3!
199 |         ClickDiv #4
200 |                 Title => Clickable Div #4
201 |                 Description => These are the hidden contents of clickable div #4!
202 | Exiting...
203 | ```
204 | 
205 | ## Credits
206 | * [Christian Deacon](https://github.com/gamemann)


--------------------------------------------------------------------------------
/package-lock.json:
--------------------------------------------------------------------------------
1 | {
2 |   "name": "selenium-and-beautifulsoup4",
3 |   "lockfileVersion": 3,
4 |   "requires": true,
5 |   "packages": {}
6 | }
7 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | beautifulsoup4==4.12.2
2 | selenium==4.16.0


--------------------------------------------------------------------------------
/site/.gitignore:
--------------------------------------------------------------------------------
 1 | # See https://help.github.com/articles/ignoring-files/ for more about ignoring files.
 2 | 
 3 | # dependencies
 4 | /node_modules
 5 | /.pnp
 6 | .pnp.js
 7 | 
 8 | # testing
 9 | /coverage
10 | 
11 | # production
12 | /build
13 | 
14 | # misc
15 | .DS_Store
16 | .env.local
17 | .env.development.local
18 | .env.test.local
19 | .env.production.local
20 | 
21 | npm-debug.log*
22 | yarn-debug.log*
23 | yarn-error.log*
24 | 


--------------------------------------------------------------------------------
/site/README.md:
--------------------------------------------------------------------------------
 1 | This is the website we've made for parsing and extracting data using Selenium and Beautiful Soup. The website utilizes React!
 2 | 
 3 | Please refer to the main [`README`](../README.md) for more information on how to set up this website.
 4 | 
 5 | The main website's source code may be found at [`src/App.tsx`](./src/App.tsx). The website's CSS file may be found at [`src/App.css`](./src/App.css).
 6 | 
 7 | ## Website Content
 8 | ### Image Rows
 9 | Image rows are inside of a `flex` divider with a gap of `2rem`. Each image is has a width of `180px` and a height of `120px`.
10 | 
11 | ### Card Rows
12 | Card rows are inside of a `flex` divider with a gap of `2rem`. Each card row contains a title and description. The title is inside of a `<h2>` tag while the description is inside of a `<p>` tag.
13 | 
14 | ### Clickable Dividers
15 | Clickable dividers are inside of a `flex` divider with a direction of `column` and a gap of `4rem`. The title is inside of a `<h2>` tag and the hidden contents are inside of a `<div>` tag that is hidden until the clickable divider is clicked.
16 | 
17 | ## Preview
18 | ![Preview Image](./site_preview.png)


--------------------------------------------------------------------------------
/site/package.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "name": "testsite01",
 3 |   "version": "0.1.0",
 4 |   "private": true,
 5 |   "dependencies": {
 6 |     "@testing-library/jest-dom": "^5.17.0",
 7 |     "@testing-library/react": "^13.4.0",
 8 |     "@testing-library/user-event": "^13.5.0",
 9 |     "@types/jest": "^27.5.2",
10 |     "@types/node": "^16.18.70",
11 |     "@types/react": "^18.2.47",
12 |     "@types/react-dom": "^18.2.18",
13 |     "react": "^18.2.0",
14 |     "react-dom": "^18.2.0",
15 |     "react-scripts": "5.0.1",
16 |     "typescript": "^4.9.5",
17 |     "web-vitals": "^2.1.4"
18 |   },
19 |   "scripts": {
20 |     "start": "react-scripts start",
21 |     "build": "react-scripts build",
22 |     "test": "react-scripts test",
23 |     "eject": "react-scripts eject"
24 |   },
25 |   "eslintConfig": {
26 |     "extends": [
27 |       "react-app",
28 |       "react-app/jest"
29 |     ]
30 |   },
31 |   "browserslist": {
32 |     "production": [
33 |       ">0.2%",
34 |       "not dead",
35 |       "not op_mini all"
36 |     ],
37 |     "development": [
38 |       "last 1 chrome version",
39 |       "last 1 firefox version",
40 |       "last 1 safari version"
41 |     ]
42 |   }
43 | }
44 | 


--------------------------------------------------------------------------------
/site/public/favicon.ico:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gamemann/How-To-Use-Selenium-And-BeautifulSoup/9346d467764d1c855e7d519125e30c80a2736ce9/site/public/favicon.ico


--------------------------------------------------------------------------------
/site/public/images/testimage01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gamemann/How-To-Use-Selenium-And-BeautifulSoup/9346d467764d1c855e7d519125e30c80a2736ce9/site/public/images/testimage01.png


--------------------------------------------------------------------------------
/site/public/images/testimage02.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gamemann/How-To-Use-Selenium-And-BeautifulSoup/9346d467764d1c855e7d519125e30c80a2736ce9/site/public/images/testimage02.png


--------------------------------------------------------------------------------
/site/public/images/testimage03.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gamemann/How-To-Use-Selenium-And-BeautifulSoup/9346d467764d1c855e7d519125e30c80a2736ce9/site/public/images/testimage03.png


--------------------------------------------------------------------------------
/site/public/images/testimage04.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gamemann/How-To-Use-Selenium-And-BeautifulSoup/9346d467764d1c855e7d519125e30c80a2736ce9/site/public/images/testimage04.png


--------------------------------------------------------------------------------
/site/public/index.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html lang="en">
 3 |   <head>
 4 |     <meta charset="utf-8" />
 5 |     <link rel="icon" href="%PUBLIC_URL%/favicon.ico" />
 6 |     <meta name="viewport" content="width=device-width, initial-scale=1" />
 7 |     <meta name="theme-color" content="#000000" />
 8 |     <meta
 9 |       name="description"
10 |       content="Web site created using create-react-app"
11 |     />
12 |     <link rel="apple-touch-icon" href="%PUBLIC_URL%/logo192.png" />
13 |     <!--
14 |       manifest.json provides metadata used when your web app is installed on a
15 |       user's mobile device or desktop. See https://developers.google.com/web/fundamentals/web-app-manifest/
16 |     -->
17 |     <link rel="manifest" href="%PUBLIC_URL%/manifest.json" />
18 |     <!--
19 |       Notice the use of %PUBLIC_URL% in the tags above.
20 |       It will be replaced with the URL of the `public` folder during the build.
21 |       Only files inside the `public` folder can be referenced from the HTML.
22 | 
23 |       Unlike "/favicon.ico" or "favicon.ico", "%PUBLIC_URL%/favicon.ico" will
24 |       work correctly both with client-side routing and a non-root public URL.
25 |       Learn how to configure a non-root public URL by running `npm run build`.
26 |     -->
27 |     <title>React App</title>
28 |   </head>
29 |   <body>
30 |     <noscript>You need to enable JavaScript to run this app.</noscript>
31 |     <div id="root"></div>
32 |     <!--
33 |       This HTML file is a template.
34 |       If you open it directly in the browser, you will see an empty page.
35 | 
36 |       You can add webfonts, meta tags, or analytics to this file.
37 |       The build step will place the bundled scripts into the <body> tag.
38 | 
39 |       To begin the development, run `npm start` or `yarn start`.
40 |       To create a production bundle, use `npm run build` or `yarn build`.
41 |     -->
42 |   </body>
43 | </html>
44 | 


--------------------------------------------------------------------------------
/site/public/manifest.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "short_name": "React App",
 3 |   "name": "Create React App Sample",
 4 |   "icons": [
 5 |     {
 6 |       "src": "favicon.ico",
 7 |       "sizes": "64x64 32x32 24x24 16x16",
 8 |       "type": "image/x-icon"
 9 |     },
10 |     {
11 |       "src": "logo192.png",
12 |       "type": "image/png",
13 |       "sizes": "192x192"
14 |     },
15 |     {
16 |       "src": "logo512.png",
17 |       "type": "image/png",
18 |       "sizes": "512x512"
19 |     }
20 |   ],
21 |   "start_url": ".",
22 |   "display": "standalone",
23 |   "theme_color": "#000000",
24 |   "background_color": "#ffffff"
25 | }
26 | 


--------------------------------------------------------------------------------
/site/public/robots.txt:
--------------------------------------------------------------------------------
1 | # https://www.robotstxt.org/robotstxt.html
2 | User-agent: *
3 | Disallow:
4 | 


--------------------------------------------------------------------------------
/site/site_preview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gamemann/How-To-Use-Selenium-And-BeautifulSoup/9346d467764d1c855e7d519125e30c80a2736ce9/site/site_preview.png


--------------------------------------------------------------------------------
/site/src/App.css:
--------------------------------------------------------------------------------
 1 | h1 {
 2 |   text-align: center
 3 | }
 4 | 
 5 | /* Images */
 6 | #images {
 7 |   padding: 12px;
 8 | }
 9 | 
10 | #images > div {
11 |   display: flex;
12 |   flex-wrap: wrap;
13 |   gap: 2rem;
14 |   justify-content: center;
15 | }
16 | 
17 | .image-row img {
18 |   width: 180px;
19 |   height: 120px;
20 | }
21 | 
22 | /* Cards */
23 | #cards {
24 |   padding: 12px;
25 | }
26 | 
27 | #cards > div {
28 |   display: flex;
29 |   flex-wrap: wrap;
30 |   gap: 2rem;
31 |   justify-content: center;
32 | }
33 | 
34 | .card-row {
35 |   padding: 12px;
36 |   background-color: #292d33;
37 |   border-radius: 5px;
38 |   color: #FFFFFF;
39 | }
40 | 
41 | /* Clickable Dividers */
42 | #clickDivs {
43 |   padding: 12px;
44 | }
45 | 
46 | #clickDivs > div {
47 |   display: flex;
48 |   flex-direction: column;
49 |   gap: 4rem;
50 | }
51 | 
52 | .clickDiv-row {
53 |   background-color: #292d33;
54 |   padding: 12px;
55 |   width: 100%;
56 |   color: #FFFFFF;
57 |   cursor: pointer;
58 | }


--------------------------------------------------------------------------------
/site/src/App.tsx:
--------------------------------------------------------------------------------
  1 | import { useState } from 'react';
  2 | import './App.css';
  3 | 
  4 | // Images
  5 | const images = [
  6 |   "/images/testimage01.png",
  7 |   "/images/testimage02.png",
  8 |   "/images/testimage03.png",
  9 |   "/images/testimage04.png"
 10 | ]
 11 | 
 12 | // Cards
 13 | type Card = {
 14 |   title: string
 15 |   description: JSX.Element
 16 | }
 17 | 
 18 | const cards: Card[] = [
 19 |   {
 20 |     title: "Card Title #1",
 21 |     description: <span>This is the description of <span style={{ fontWeight: "bold" }}>card #1</span>!</span>
 22 |   },
 23 |   {
 24 |     title: "Card Title #2",
 25 |     description: <span>This is the description of <span style={{ fontWeight: "bold" }}>card #2</span>!</span>
 26 |   },
 27 |   {
 28 |     title: "Card Title #3",
 29 |     description: <span>This is the description of <span style={{ fontWeight: "bold" }}>card #3</span>!</span>
 30 |   }
 31 | ]
 32 | 
 33 | // Clickable Dividers.
 34 | type ClickDiv = {
 35 |   title: string
 36 |   contents: JSX.Element
 37 | }
 38 | 
 39 | const clickDivs: ClickDiv[] = [
 40 |   {
 41 |     title: "Clickable Div #1",
 42 |     contents: <span>These are the hidden contents of <span style={{ fontWeight: "bold" }}>clickable div #1</span>!</span>
 43 |   },
 44 |   {
 45 |     title: "Clickable Div #2",
 46 |     contents: <span>These are the hidden contents of <span style={{ fontWeight: "bold" }}>clickable div #2</span>!</span>
 47 |   },
 48 |   {
 49 |     title: "Clickable Div #3",
 50 |     contents: <span>These are the hidden contents of <span style={{ fontWeight: "bold" }}>clickable div #3</span>!</span>
 51 |   },
 52 |   {
 53 |     title: "Clickable Div #4",
 54 |     contents: <span>These are the hidden contents of <span style={{ fontWeight: "bold" }}>clickable div #4</span>!</span>
 55 |   }
 56 | ]
 57 | 
 58 | function App() {
 59 |   return (
 60 |     <div className="App">
 61 |       <div id="images">
 62 |         <h1>Images</h1>
 63 |         <div>
 64 |           {images.map((img, index) => {
 65 |             return (
 66 |               <Image
 67 |                 key={`image-${index.toString()}`}
 68 |                 url={img}
 69 |                 index={index}
 70 |               />
 71 |             )
 72 |           })}
 73 |         </div>
 74 |       </div>
 75 |       <div id="cards">
 76 |         <h1>Cards</h1>
 77 |         <div>
 78 |           {cards.map((card, index) => {
 79 |             return (
 80 |               <Card
 81 |                 key={`index-${index.toString()}`}
 82 |                 card={card}
 83 |               />
 84 |             )
 85 |           })}
 86 |         </div>
 87 |       </div>
 88 |       <div id="clickDivs">
 89 |         <h1>Clickable Divs</h1>
 90 |         <div>
 91 |           {clickDivs.map((clickDiv, index) => {
 92 |             return (
 93 |               <ClickDiv
 94 |                 key={`clickDiv-${index.toString()}`}
 95 |                 clickDiv={clickDiv}
 96 |               />
 97 |             )
 98 |           })}
 99 |         </div>
100 |       </div>
101 |     </div>
102 |   );
103 | }
104 | 
105 | function Image ({
106 |   url,
107 |   index
108 | } : {
109 |   url: string
110 |   index: number
111 | }) {
112 |   return (
113 |     <div className="image-row">
114 |       <img
115 |         src={url}
116 |         alt={`Alt #${(index + 1).toString()}`}
117 |       />
118 |     </div>
119 |   )
120 | }
121 | 
122 | function Card ({
123 |   card
124 | } : {
125 |   card: Card
126 | }) {
127 |   return (
128 |     <div className="card-row">
129 |       <h2>{card.title}</h2>
130 |       <p>{card.description}</p>
131 |     </div>
132 |   )
133 | }
134 | 
135 | function ClickDiv ({
136 |   clickDiv
137 | } : {
138 |   clickDiv: ClickDiv
139 | }) {
140 |   // Controls state of showing/hiding contents.
141 |   const [visible, setVisible] = useState(false);
142 | 
143 |   return (
144 |     <div
145 |       className="clickDiv-row"
146 |       onClick={() => setVisible(!visible)}
147 |     >
148 |       <h2>{clickDiv.title}</h2>
149 |       {visible && (
150 |         <div>
151 |           {clickDiv.contents}
152 |         </div>
153 |       )}
154 |     </div>
155 |   )
156 | }
157 | 
158 | export default App;
159 | 


--------------------------------------------------------------------------------
/site/src/index.css:
--------------------------------------------------------------------------------
 1 | body {
 2 |   margin: 0;
 3 |   font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', 'Oxygen',
 4 |     'Ubuntu', 'Cantarell', 'Fira Sans', 'Droid Sans', 'Helvetica Neue',
 5 |     sans-serif;
 6 |   -webkit-font-smoothing: antialiased;
 7 |   -moz-osx-font-smoothing: grayscale;
 8 | }
 9 | 
10 | code {
11 |   font-family: source-code-pro, Menlo, Monaco, Consolas, 'Courier New',
12 |     monospace;
13 | }
14 | 


--------------------------------------------------------------------------------
/site/src/index.tsx:
--------------------------------------------------------------------------------
 1 | import React from 'react';
 2 | import ReactDOM from 'react-dom/client';
 3 | import './index.css';
 4 | import App from './App';
 5 | import reportWebVitals from './reportWebVitals';
 6 | 
 7 | const root = ReactDOM.createRoot(
 8 |   document.getElementById('root') as HTMLElement
 9 | );
10 | root.render(
11 |   <React.StrictMode>
12 |     <App />
13 |   </React.StrictMode>
14 | );
15 | 
16 | // If you want to start measuring performance in your app, pass a function
17 | // to log results (for example: reportWebVitals(console.log))
18 | // or send to an analytics endpoint. Learn more: https://bit.ly/CRA-vitals
19 | reportWebVitals();
20 | 


--------------------------------------------------------------------------------
/site/src/react-app-env.d.ts:
--------------------------------------------------------------------------------
1 | /// <reference types="react-scripts" />
2 | 


--------------------------------------------------------------------------------
/site/src/reportWebVitals.ts:
--------------------------------------------------------------------------------
 1 | import { ReportHandler } from 'web-vitals';
 2 | 
 3 | const reportWebVitals = (onPerfEntry?: ReportHandler) => {
 4 |   if (onPerfEntry && onPerfEntry instanceof Function) {
 5 |     import('web-vitals').then(({ getCLS, getFID, getFCP, getLCP, getTTFB }) => {
 6 |       getCLS(onPerfEntry);
 7 |       getFID(onPerfEntry);
 8 |       getFCP(onPerfEntry);
 9 |       getLCP(onPerfEntry);
10 |       getTTFB(onPerfEntry);
11 |     });
12 |   }
13 | };
14 | 
15 | export default reportWebVitals;
16 | 


--------------------------------------------------------------------------------
/site/src/setupTests.ts:
--------------------------------------------------------------------------------
1 | // jest-dom adds custom jest matchers for asserting on DOM nodes.
2 | // allows you to do things like:
3 | // expect(element).toHaveTextContent(/react/i)
4 | // learn more: https://github.com/testing-library/jest-dom
5 | import '@testing-library/jest-dom';
6 | 


--------------------------------------------------------------------------------
/site/tsconfig.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "compilerOptions": {
 3 |     "target": "es5",
 4 |     "lib": [
 5 |       "dom",
 6 |       "dom.iterable",
 7 |       "esnext"
 8 |     ],
 9 |     "allowJs": true,
10 |     "skipLibCheck": true,
11 |     "esModuleInterop": true,
12 |     "allowSyntheticDefaultImports": true,
13 |     "strict": true,
14 |     "forceConsistentCasingInFileNames": true,
15 |     "noFallthroughCasesInSwitch": true,
16 |     "module": "esnext",
17 |     "moduleResolution": "node",
18 |     "resolveJsonModule": true,
19 |     "isolatedModules": true,
20 |     "noEmit": true,
21 |     "jsx": "react-jsx"
22 |   },
23 |   "include": [
24 |     "src"
25 |   ]
26 | }
27 | 


--------------------------------------------------------------------------------
/src/adv-clickdiv-collector.py:
--------------------------------------------------------------------------------
  1 | from base.cmdline import ParseCmdLine, PrintCmdLine
  2 | from base.driver import SetupDriver
  3 | from base.utils import ExitProgram
  4 | 
  5 | from selenium.webdriver.support.ui import WebDriverWait
  6 | from selenium.webdriver.support import expected_conditions as EC
  7 | from selenium.webdriver.common.by import By
  8 | 
  9 | from bs4 import BeautifulSoup
 10 | 
 11 | def main():
 12 |     
 13 |     print("Starting adv-clickdiv-collector...")
 14 | 
 15 |     # Parse command line arguments.
 16 |     print("Parsing arguments...")
 17 |     
 18 |     try:
 19 |         cmd = ParseCmdLine()
 20 |     except Exception as e:
 21 |         print("Failed to parse command line due to exception.")
 22 |         print(e)
 23 |         
 24 |         ExitProgram(1)
 25 |         
 26 |     # Check if we need to print command line options.
 27 |     if cmd["list"]:
 28 |         PrintCmdLine(cmd)
 29 |         
 30 |         ExitProgram()
 31 |         
 32 |     # Map command line arguments to variables.
 33 |     binary = cmd["binary"]
 34 |     ua = cmd["ua"]
 35 |     site = cmd["site"]
 36 |         
 37 |     # Setup Selenium driver.
 38 |     print("Setting up Selenium driver...")
 39 |     
 40 |     try:
 41 |         driver = SetupDriver(binary, ua)
 42 |     except Exception as e:
 43 |         print("Failed to setup Selenium driver...")
 44 |         print(e)
 45 |         
 46 |         ExitProgram(1)
 47 |         
 48 |     # Parse website.
 49 |     print(f"Parsing website '{site}'...")
 50 |     
 51 |     try:
 52 |         driver.get(site)
 53 |     except Exception as e:
 54 |         print(f"Failed to parse website '{site}'...")
 55 |         print(e)
 56 |         
 57 |         ExitProgram(1, driver)
 58 |         
 59 |     # Wait until clickable dividers are loaded using WebDriverWait and wait until elements with class name 'clickDiv-row' are visible.
 60 |     try:
 61 |         WebDriverWait(driver, 10).until(
 62 |             EC.visibility_of_any_elements_located((By.CLASS_NAME, "clickDiv-row"))
 63 |         )
 64 |     except Exception as e:
 65 |         print(f"Failed to locate elements with 'clickDiv-row' class name within 10 seconds. Make sure you're using 'testwebsite01'...")
 66 |         print(e)
 67 |         
 68 |         ExitProgram(1, driver)
 69 |         
 70 |     # We need to click all clickable dividers now before parsing through BeautifulSoup4.
 71 |     try:
 72 |         # Find elements in Selenium with class name 'clickDiv-row'.
 73 |         rows = driver.find_elements(By.CLASS_NAME, "clickDiv-row")
 74 |         
 75 |         # Check.
 76 |         if not rows or len(rows) < 1:
 77 |             print("Failed to parse clickable div rows. 'rows' is falsey or has a length of 0.")
 78 |             
 79 |             ExitProgram(1, driver)
 80 |             
 81 |         # Loop through each element found.
 82 |         for row in rows:
 83 |             # Click the element.
 84 |             row.click()
 85 |     except Exception as e:
 86 |         print("Failed to click clickable dividers due to exception.")
 87 |         print(e)
 88 |         
 89 |         ExitProgram(1, driver)
 90 |         
 91 |     # Parse web page with BeautifulSoup4.
 92 |     try:
 93 |         soup = BeautifulSoup(driver.page_source, "html.parser")
 94 |     except Exception as e:
 95 |         print("Failed to parse website's contents using BeautifulSoup4...")
 96 |         print(e)
 97 |         
 98 |         ExitProgram(1, driver)
 99 |     
100 |     # Parse each 'clickDiv-row' element, click it, and extract the clickable divider's title and hidden contents.    
101 |     clickDivs: list[dict[str, str]] = []
102 |     
103 |     try:
104 |         # Retrieve all 'div' tags with class set 'clickDiv-row'.
105 |         rows = soup.findAll("div", class_="clickDiv-row")
106 |         
107 |         if not rows or len(rows) < 1:
108 |             print("Failed to parse clickable div rows. 'rows' is falsey or has a length of 0.")
109 |             
110 |             ExitProgram(1, driver)
111 |             
112 |         # Loop through each element.
113 |         for row in rows:
114 |             # Parse first 'h2' tag and check.
115 |             h2 = row.find("h2")
116 |             
117 |             if not h2:
118 |                 print("Failed to parse clickable div. 'h2' is falsey.")
119 |                 
120 |                 continue
121 |             
122 |             # Extract text from 'h2' tag as title.
123 |             title = h2.text
124 |             
125 |             # Extract the first 'div' tag.
126 |             div = row.find("div")
127 |             
128 |             if div is None:
129 |                 print("Failed to parse clickable div. 'div' is falsey.")
130 |                 
131 |                 continue
132 |             
133 |             # Extract text from 'div' tag as contents.
134 |             contents = div.text
135 |             
136 |             # Append to clickable dividers list.
137 |             clickDivs.append({
138 |                 "title": title,
139 |                 "contents": contents
140 |             })
141 |             
142 |         # Print the clickable dividers we've found.
143 |         print("Found the following clickable dividers.")
144 |         
145 |         for index, clickDiv in enumerate(clickDivs):
146 |             print(f"\tClickDiv #{index + 1}")
147 |             print(f"\t\tTitle => {clickDiv['title']}")
148 |             print(f"\t\tDescription => {clickDiv['contents']}")
149 |     except Exception as e:
150 |         print("Failed to clickable div rows due to exception.")
151 |         print(e)
152 |         
153 |         ExitProgram(1, driver)
154 |         
155 |     print("Exiting...")
156 |     
157 |     ExitProgram(0, driver)
158 |     
159 | if __name__ == "__main__":
160 |     main()


--------------------------------------------------------------------------------
/src/base/__init__.py:
--------------------------------------------------------------------------------
1 | from .cmdline import ParseCmdLine, PrintCmdLine
2 | from .driver import SetupDriver
3 | from .utils import ExitProgram


--------------------------------------------------------------------------------
/src/base/cmdline.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | 
 3 | def ParseCmdLine() -> dict[str, any]:
 4 |     # Initialize argument parser.
 5 |     parser = argparse.ArgumentParser()
 6 |     
 7 |     # Add binary argument.
 8 |     parser.add_argument("-b", "--binary",
 9 |         help = "The path to the Geckodriver binary file.",
10 |         default = "/usr/bin/geckodriver" 
11 |     )
12 |     
13 |     # Add site argument.
14 |     parser.add_argument("-s", "--site",
15 |         help = "The full URL of the website to parse and extract information from.",
16 |         default = "http://localhost:3000"
17 |     )
18 |     
19 |     # Add user agent argument.
20 |     parser.add_argument("-u", "--ua",
21 |         help = "The web browser's user agent to use when sending requests",
22 |         default = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0"                    
23 |     )
24 |     
25 |     # Add list argument.
26 |     parser.add_argument("-l", "---list",
27 |         help = "Prints the command line values and exits.",
28 |         default = False                    
29 |     )
30 |     
31 |     # Parse all arguments.
32 |     args = parser.parse_args()
33 |     
34 |     # Return arguments in dict.
35 |     return {
36 |         "binary": args.binary,
37 |         "site": args.site,
38 |         "ua": args.ua,
39 |         "list": args.list
40 |     }
41 |     
42 | def PrintCmdLine(cmd: dict[str, any]):
43 |     print("Command Line")
44 |     
45 |     print(f"\tBinary => {cmd['binary']}")
46 |     print(f"\tSite => {cmd['site']}")
47 |     print(f"\tUser Agent => {cmd['ua']}")
48 |     


--------------------------------------------------------------------------------
/src/base/driver.py:
--------------------------------------------------------------------------------
 1 | from selenium.webdriver.firefox.service import Service
 2 | from selenium.webdriver.firefox.options import Options
 3 | from selenium.webdriver import Firefox
 4 | 
 5 | def SetupDriver(binary: str, ua: str) -> Firefox:
 6 |     # Initialize options.
 7 |     opts = Options()
 8 |     
 9 |     # Set headless and no sandbox flags.
10 |     opts.add_argument("--headless")
11 |     opts.add_argument("--no-sandbox")
12 |     
13 |     # Set user agent.
14 |     opts.set_preference("general.useragent.override", ua)
15 |     
16 |     # Create service.
17 |     service = Service(executable_path = binary)
18 |     
19 |     # Create driver.
20 |     driver = Firefox(
21 |         options = opts,
22 |         service = service
23 |     )
24 |     
25 |     return driver


--------------------------------------------------------------------------------
/src/base/utils.py:
--------------------------------------------------------------------------------
1 | from sys import exit
2 | 
3 | from selenium.webdriver import Firefox
4 | 
5 | def ExitProgram(ret: int = 0, driver: Firefox = None):
6 |     if driver:
7 |         driver.quit()
8 |         
9 |     exit(ret)


--------------------------------------------------------------------------------
/src/simple-card-collector.py:
--------------------------------------------------------------------------------
  1 | from base.cmdline import ParseCmdLine, PrintCmdLine
  2 | from base.driver import SetupDriver
  3 | from base.utils import ExitProgram
  4 | 
  5 | from selenium.webdriver.support.ui import WebDriverWait
  6 | from selenium.webdriver.support import expected_conditions as EC
  7 | from selenium.webdriver.common.by import By
  8 | 
  9 | from bs4 import BeautifulSoup
 10 | 
 11 | def main():
 12 |     print("Starting simple-card-collector...")
 13 | 
 14 |     # Parse command line arguments.
 15 |     print("Parsing arguments...")
 16 |     
 17 |     try:
 18 |         cmd = ParseCmdLine()
 19 |     except Exception as e:
 20 |         print("Failed to parse command line due to exception.")
 21 |         print(e)
 22 |         
 23 |         ExitProgram(1)
 24 |         
 25 |     # Check if we need to print command line options.
 26 |     if cmd["list"]:
 27 |         PrintCmdLine(cmd)
 28 |         
 29 |         ExitProgram()
 30 |         
 31 |     # Map command line arguments to variables.
 32 |     binary = cmd["binary"]
 33 |     ua = cmd["ua"]
 34 |     site = cmd["site"]
 35 |         
 36 |     # Setup Selenium driver.
 37 |     print("Setting up Selenium driver...")
 38 |     
 39 |     try:
 40 |         driver = SetupDriver(binary, ua)
 41 |     except Exception as e:
 42 |         print("Failed to setup Selenium driver...")
 43 |         print(e)
 44 |         
 45 |         ExitProgram(1)
 46 |         
 47 |     # Parse website.
 48 |     print(f"Parsing website '{site}'...")
 49 |     
 50 |     try:
 51 |         driver.get(site)
 52 |     except Exception as e:
 53 |         print(f"Failed to parse website '{site}'...")
 54 |         print(e)
 55 |         
 56 |         ExitProgram(1, driver)
 57 |         
 58 |     # Wait until cards are loaded using WebDriverWait and wait until elements with class name 'card-row' are visible.
 59 |     try:
 60 |         WebDriverWait(driver, 10).until(
 61 |             EC.visibility_of_any_elements_located((By.CLASS_NAME, "card-row"))
 62 |         )
 63 |     except Exception as e:
 64 |         print(f"Failed to locate elements with 'card-row' class name within 10 seconds. Make sure you're using 'testwebsite01'...")
 65 |         print(e)
 66 |         
 67 |         ExitProgram(1, driver)
 68 |         
 69 |     # Parse web page with BeautifulSoup4.
 70 |     try:
 71 |         soup = BeautifulSoup(driver.page_source, "html.parser")
 72 |     except Exception as e:
 73 |         print("Failed to parse website's contents using BeautifulSoup4...")
 74 |         print(e)
 75 |         
 76 |         ExitProgram(1, driver)
 77 |     
 78 |     # Parse each 'card-row' element and extract the card's title and description.    
 79 |     cards: list[dict[str, str]] = []
 80 |     
 81 |     try:
 82 |         # Retrieve all 'div' tags with class set 'card-row'.
 83 |         rows = soup.findAll("div", class_="card-row")
 84 |         
 85 |         if not rows or len(rows) < 1:
 86 |             print("Failed to parse card rows. 'rows' is falsey or has a length of 0.")
 87 |             
 88 |             ExitProgram(1, driver)
 89 |             
 90 |         # Loop through each element.
 91 |         for row in rows:
 92 |             # Retrieve the first 'h2' tag which represents the card title.
 93 |             h2 = row.find("h2")
 94 |             
 95 |             if not h2:
 96 |                 print("Failed to parse card. 'h2' is falsey.")
 97 |                 
 98 |                 continue
 99 |             
100 |             # Extract text from 'h2' tag as title.
101 |             title = h2.text
102 |             
103 |             # Retrieve the first 'p' tag which represents the card description.
104 |             p = row.find("p")
105 |             
106 |             if not p:
107 |                 print("Failed to parse card. 'p' is falsey.")
108 |                 
109 |                 continue
110 |             
111 |             # Extract text from 'p' tag as description.
112 |             description = p.text
113 |             
114 |             # Add to cards list.
115 |             cards.append({
116 |                 "title": title,
117 |                 "description": description
118 |             })
119 |         
120 |         # Print the cards we've found.
121 |         print("Found the following cards.")
122 |         
123 |         for index, card in enumerate(cards):
124 |             print(f"\tCard #{index + 1}")
125 |             print(f"\t\tTitle => {card['title']}")
126 |             print(f"\t\tDescription => {card['description']}")
127 |     except Exception as e:
128 |         print("Failed to parse card rows due to exception.")
129 |         print(e)
130 |         
131 |         ExitProgram(1, driver)
132 |         
133 |     print("Exiting...")
134 |     
135 |     ExitProgram(0, driver)
136 |     
137 | if __name__ == "__main__":
138 |     main()


--------------------------------------------------------------------------------
/src/simple-image-collector.py:
--------------------------------------------------------------------------------
  1 | from base.cmdline import ParseCmdLine, PrintCmdLine
  2 | from base.driver import SetupDriver
  3 | from base.utils import ExitProgram
  4 | 
  5 | from selenium.webdriver.support.ui import WebDriverWait
  6 | from selenium.webdriver.support import expected_conditions as EC
  7 | from selenium.webdriver.common.by import By
  8 | 
  9 | from bs4 import BeautifulSoup
 10 | 
 11 | def main():
 12 |     print("Starting simple-image-collector...")
 13 | 
 14 |     # Parse command line arguments.
 15 |     print("Parsing arguments...")
 16 |     
 17 |     try:
 18 |         cmd = ParseCmdLine()
 19 |     except Exception as e:
 20 |         print("Failed to parse command line due to exception.")
 21 |         print(e)
 22 |         
 23 |         ExitProgram(1)
 24 |         
 25 |     # Check if we need to print command line options.
 26 |     if cmd["list"]:
 27 |         PrintCmdLine(cmd)
 28 |         
 29 |         ExitProgram(0)
 30 |         
 31 |     # Map command line arguments to variables.
 32 |     binary = cmd["binary"]
 33 |     ua = cmd["ua"]
 34 |     site = cmd["site"]
 35 |         
 36 |     # Setup Selenium driver.
 37 |     print("Setting up Selenium driver...")
 38 |     
 39 |     try:
 40 |         driver = SetupDriver(binary, ua)
 41 |     except Exception as e:
 42 |         print("Failed to setup Selenium driver...")
 43 |         print(e)
 44 |         
 45 |         ExitProgram(1)
 46 |         
 47 |     # Parse website.
 48 |     print(f"Parsing website '{site}'...")
 49 |     
 50 |     try:
 51 |         driver.get(site)
 52 |     except Exception as e:
 53 |         print(f"Failed to parse website '{site}'...")
 54 |         print(e)
 55 |         
 56 |         ExitProgram(1, driver)
 57 |         
 58 |     # Wait until images are loaded using WebDriverWait and wait until elements with class name 'image-row' are visible.
 59 |     try:
 60 |         WebDriverWait(driver, 10).until(
 61 |             EC.visibility_of_any_elements_located((By.CLASS_NAME, "image-row"))
 62 |         )
 63 |     except Exception as e:
 64 |         print(f"Failed to locate elements with 'image-row' class name within 10 seconds. Make sure you're using 'testwebsite01'...")
 65 |         print(e)
 66 |         
 67 |         ExitProgram(1, driver)
 68 |         
 69 |     # Parse web page with BeautifulSoup4.
 70 |     try:
 71 |         soup = BeautifulSoup(driver.page_source, "html.parser")
 72 |     except Exception as e:
 73 |         print("Failed to parse website's contents using BeautifulSoup4...")
 74 |         print(e)
 75 |         
 76 |         ExitProgram(1, driver)
 77 |     
 78 |     # Parse each 'image-row' element and extract the source of the first 'img'.    
 79 |     imgUrls = []
 80 |     
 81 |     try:
 82 |         # Retrieve all 'div' tags with class set 'image-row'.
 83 |         rows = soup.findAll("div", class_="image-row")
 84 |         
 85 |         if not rows or len(rows) < 1:
 86 |             print("Failed to parse image rows. 'rows' is falsey or has a length of 0.")
 87 |             
 88 |             ExitProgram(1, driver)
 89 |             
 90 |         # Loop through each element.
 91 |         for row in rows:
 92 |             # Retrieve the first img element and check.
 93 |             img = row.find("img")
 94 |             
 95 |             if not img:
 96 |                 print("Failed to parse image. 'img' is falsey.")
 97 |                 
 98 |                 continue
 99 |             
100 |             # Retrieve source.
101 |             src = img.get("src")
102 |             
103 |             if not src:
104 |                 print("Failed to parse image. 'src' is falsey.")
105 |                 
106 |                 continue
107 |             
108 |             # Append to image URLs.
109 |             imgUrls.append(src)
110 |         
111 |         # Print the image URLs we've found.
112 |         print("Found the following image URLs.")
113 |         
114 |         for url in imgUrls:
115 |             print(f"\t- {url}")
116 |     except Exception as e:
117 |         print("Failed to parse image rows due to exception.")
118 |         print(e)
119 |         
120 |         ExitProgram(1, driver)
121 |         
122 |     print("Exiting...")
123 |     
124 |     ExitProgram(0, driver)
125 |     
126 | if __name__ == "__main__":
127 |     main()


--------------------------------------------------------------------------------