├── 5_errors_Selenium.ipynb
├── CloudFlare_Thoughts.ipynb
├── IndeedJune2022_href_jobDescription.ipynb
├── Indeed_scrape_2020CA.txt
├── Indeed_scrape_Oct2020.txt
├── Indeed_webscrape.ipynb
├── July_2023_Indeed_Webscrape_Selenium.ipynb
├── MyWay_Indeed_Help_India_version.ipynb
├── README.md
├── Selenium_ASYNCIO_Indeed.ipynb
├── Selenium_Webdriver_Issues.ipynb
├── Webscrape_UN_fin.ipynb
├── Webscraping_Ideas_Considerations.ipynb
├── indeed_WebScrape_2021.ipynb
├── screen_shot_01.png
├── screen_shot_02.png
├── screen_shot_03.png
├── screen_shot_04.png
└── shadow_root.png
/5_errors_Selenium.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "29fa83ea",
6 | "metadata": {},
7 | "source": [
8 | "# `5 Common Error When Using Selenium`\n",
9 | "\n",
10 | "# (◕‿◕✿)\n",
11 | "\n",
12 | "# Mr Fugu Data Science"
13 | ]
14 | },
15 | {
16 | "cell_type": "markdown",
17 | "id": "242b0aba",
18 | "metadata": {},
19 | "source": [
20 | "# `Top 5 Errors & Fixes: Selenium Web Scraping in Python (2024)`\n",
21 | "\n",
22 | "`------------------------------------------------------------------`\n",
23 | "\n",
24 | "# 1.) `Element Not Found Exception`\n",
25 | "\n",
26 | "**Error Description:** The `Element Not Found Exception` occurs when Selenium is unable to find the element specified by the locator.\n",
27 | "\n",
28 | "**`Possible Reasons this can occur:`**\n",
29 | "\n",
30 | "+ `HTML Changes`\n",
31 | "\n",
32 | "+ `Spelling`\n",
33 | "\n",
34 | "+ `Xpath`\n",
35 | "\n",
36 | "`_____________________________________________`\n",
37 | "\n",
38 | "**Fix:** Ensure the locator is correct and the element is present in the *Document Object Model* DOM. Use WebDriverWait to wait for elements to be loaded.\n"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": null,
44 | "id": "7ce305a9",
45 | "metadata": {},
46 | "outputs": [],
47 | "source": [
48 | "\n",
49 | "from selenium import webdriver\n",
50 | "\n",
51 | "from selenium.webdriver.common.by import By\n",
52 | "\n",
53 | "from selenium.webdriver.support.ui import WebDriverWait\n",
54 | "\n",
55 | "from selenium.webdriver.support import expected_conditions as EC\n",
56 | "\n",
57 | "driver = webdriver.Chrome()\n",
58 | "driver.get(\"https://some_example.com\")\n",
59 | "\n",
60 | "\n",
61 | "# driver.find_element_by_name(\"some_UserName\") this can also give an error because\\\n",
62 | "# the element is not there or spelled wrong\n",
63 | "\n",
64 | "try:\n",
65 | " element = WebDriverWait(driver, 10).until(\n",
66 | " EC.presence_of_element_located((By.ID, \"element_id\")) # <-----------This is what we are handling\n",
67 | " )\n",
68 | "finally:\n",
69 | " driver.quit()\n"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "id": "8c83caa7",
75 | "metadata": {},
76 | "source": [
77 | "# 2.) `Stale Element Reference Exception`\n",
78 | "\n",
79 | "**Error Description:** This error occurs when an element that was previously found is no longer present in the *Document Object Model* (DOM).\n",
80 | "\n",
81 | "**`Possible Reasons this can occur:`**\n",
82 | "\n",
83 | "+ `A.)` Basically, when the HTML element is no longer associated with the DOM WebElement\n",
84 | " + Usually, due to JS element being *dynamically* updating the (DOM)\n",
85 | "\n",
86 | "`EXAMPLE:`\n",
87 | "You have a notification as a banner notifying you of a short term sale for an item on a website but you decide to come back later and the promotion is over but you click an old link that doesn't work.\n",
88 | "\n",
89 | "+ `B.)` HTML element in the (DOM) was either deleted and then recreated\n",
90 | "\n",
91 | "`EXAMPLE:` page refresh can do this\n",
92 | "\n",
93 | "`__________________________________________________`\n",
94 | "\n",
95 | "**Fix:** Re-locate the element before interacting with it. Implement `retries or refresh` the page if necessary.\n",
96 | "\n",
97 | "\n",
98 | "[Alternate Example](https://reflect.run/articles/how-to-deal-with-staleelementreferenceexception-in-selenium/)"
99 | ]
100 | },
101 | {
102 | "cell_type": "code",
103 | "execution_count": null,
104 | "id": "bc6b9618",
105 | "metadata": {},
106 | "outputs": [],
107 | "source": [
108 | "from selenium import webdriver\n",
109 | "\n",
110 | "from selenium.common.exceptions import StaleElementReferenceException\n",
111 | "\n",
112 | "import time\n",
113 | "\n",
114 | "driver = webdriver.Chrome()\n",
115 | "driver.get(\"https://some_example.com\")\n",
116 | "\n",
117 | "while True:\n",
118 | " try:\n",
119 | " element = driver.find_element(By.ID, \"element_id\")\n",
120 | " element.click()\n",
121 | " break\n",
122 | " except StaleElementReferenceException:\n",
123 | " time.sleep(1) # Waiting before retrying\n",
124 | " driver.refresh() # Refresh the page (optional) <------ THIS IS WHAT WE ARE DOING\n",
125 | "finally:\n",
126 | " driver.quit()\n"
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "id": "11ea8876",
132 | "metadata": {},
133 | "source": [
134 | "# 3.) `Timeout Exception`\n",
135 | "\n",
136 | "**Error Description:** The `Timeout Exception` is raised when a command takes too long to execute, usually because the element wasn’t found within the expected time.\n",
137 | "\n",
138 | "`_______________________________________________`\n",
139 | "\n",
140 | "**Fix:** Increase the wait time or ensure that the element is present before attempting interaction."
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": null,
146 | "id": "a907c14f",
147 | "metadata": {},
148 | "outputs": [],
149 | "source": [
150 | "from selenium import webdriver\n",
151 | "\n",
152 | "from selenium.webdriver.common.by import By\n",
153 | "\n",
154 | "from selenium.webdriver.support.ui import WebDriverWait\n",
155 | "\n",
156 | "from selenium.webdriver.support import expected_conditions as EC\n",
157 | "\n",
158 | "driver = webdriver.Chrome()\n",
159 | "driver.get(\"https://Some_Example.com\")\n",
160 | "\n",
161 | "try:\n",
162 | " element = WebDriverWait(driver, 20).until(\n",
163 | " EC.visibility_of_element_located((By.CSS_SELECTOR, \".element_class\"))\n",
164 | " )\n",
165 | "finally:\n",
166 | " driver.quit()\n"
167 | ]
168 | },
169 | {
170 | "cell_type": "markdown",
171 | "id": "45f0e54a",
172 | "metadata": {},
173 | "source": [
174 | "# 4.) `No Such Element Exception`\n",
175 | "\n",
176 | "**Error Description:** This exception occurs when the element is not found on the page.\n",
177 | "\n",
178 | "+ Can occur if the page isn't loading\n",
179 | "\n",
180 | "+ Or the element isn't there in the first place\n",
181 | "\n",
182 | "`______________________________________`\n",
183 | "\n",
184 | "**Fix:** Double-check the locator strategy and make sure the element (actually) exists. Verify if the element is inside an iframe or loaded dynamically."
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": 1,
190 | "id": "a7ef9385",
191 | "metadata": {},
192 | "outputs": [],
193 | "source": [
194 | "from selenium import webdriver\n",
195 | "\n",
196 | "from selenium.common.exceptions import NoSuchElementException\n",
197 | "\n",
198 | "driver = webdriver.Chrome()\n",
199 | "driver.get(\"https://some_example.com\")\n",
200 | "\n",
201 | "try:\n",
202 | " element = driver.find_element(By.XPATH, \"//div[@class='Some_Example']\")\n",
203 | " print(element.text)\n",
204 | "except NoSuchElementException:\n",
205 | " print(\"Element_not_found\")\n",
206 | "finally:\n",
207 | " driver.quit()\n"
208 | ]
209 | },
210 | {
211 | "cell_type": "markdown",
212 | "id": "dcc7dd60",
213 | "metadata": {},
214 | "source": [
215 | "# 5.) `Element Not Interactable Exception`\n",
216 | "\n",
217 | "**Error Description:** This exception occurs when an element is present in the *Document Object Model* (DOM) but cannot be interacted with.\n",
218 | "\n",
219 | "+ Either an element is disabled\n",
220 | "\n",
221 | "+ Or other elements may be overlapped\n",
222 | "\n",
223 | "`__________________________________________`\n",
224 | "\n",
225 | "**Fix:** Ensure the element is visible and enabled. Sometimes scrolling to the element or waiting for it to become interactive is necessary."
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": null,
231 | "id": "6705ddff",
232 | "metadata": {},
233 | "outputs": [],
234 | "source": [
235 | "from selenium import webdriver\n",
236 | "\n",
237 | "from selenium.common.exceptions import ElementNotInteractableException\n",
238 | "\n",
239 | "from selenium.webdriver.common.by import By\n",
240 | "\n",
241 | "driver = webdriver.Chrome()\n",
242 | "\n",
243 | "driver.get(\"https://Some_Example.com\")\n",
244 | "\n",
245 | "try:\n",
246 | " element = driver.find_element(By.ID, \"Element_id\")\n",
247 | " driver.execute_script(\"arguments[0].scrollIntoView(true);\", element)\n",
248 | " element.click()\n",
249 | " \n",
250 | "except ElementNotInteractableException:\n",
251 | " print(\"Element Not Interacting\")\n",
252 | "\n",
253 | "finally:\n",
254 | " driver.quit()\n"
255 | ]
256 | },
257 | {
258 | "cell_type": "markdown",
259 | "id": "848246d3",
260 | "metadata": {},
261 | "source": [
262 | "# `BONUS:`\n",
263 | "\n",
264 | "# 6.) The error `chromedriver_autoinstaller` *unable to locate package* \n",
265 | "\n",
266 | "+ Indicates that the chromedriver_autoinstaller package is not available or cannot be found. \n",
267 | "\n",
268 | "This typically occurs due to one of the following reasons:\n",
269 | "\n",
270 | "+ **`Incorrect Package Name:`** The package name might be incorrect or not available in the package repository.\n",
271 | "\n",
272 | "+ **`Repository Issues:`** There might be an issue with the package repository or your environment's configuration.\n",
273 | "\n",
274 | "\n",
275 | "# `Steps to Resolve the Issue`\n",
276 | "\n",
277 | "`_________________________________________________`\n"
278 | ]
279 | },
280 | {
281 | "cell_type": "markdown",
282 | "id": "0805a6a4",
283 | "metadata": {},
284 | "source": [
285 | "**1.)** Check Package Name\n",
286 | "Ensure that you are using the correct package name. The correct name for the package is chromedriver-autoinstaller, not chromedriver_autoinstaller.\n",
287 | "\n",
288 | "**`pip install chromedriver-autoinstaller`**\n",
289 | "\n",
290 | "**2.)** `Update Package Repository\n",
291 | "Make sure your package repository (pip) is up-to-date. Sometimes, updating pip can resolve issues with package installation.`\n",
292 | "\n",
293 | "**`pip install --upgrade pip`**\n",
294 | "\n",
295 | "**3.)** `Check Python and Pip Versions\n",
296 | "Verify that you are using a compatible version of Python and pip. Sometimes package installation issues can arise from version incompatibilities.`\n",
297 | "\n",
298 | "**`python --version` and `pip --version`**\n",
299 | "\n",
300 | "\n",
301 | "**4.)** `Install the Package\n",
302 | "After updating pip and ensuring the correct package name, try installing the package again.\n",
303 | "\n",
304 | "**`Install chromedriver-autoinstaller:`** `pip install chromedriver-autoinstaller`\n",
305 | " \n",
306 | "**5.)** `Verify Installation`\n",
307 | "After installation, you can verify that chromedriver-autoinstaller is installed correctly by listing installed packages.\n",
308 | "\n",
309 | "List Installed Packages:\n",
310 | "\n",
311 | "`pip list`\n"
312 | ]
313 | },
314 | {
315 | "cell_type": "code",
316 | "execution_count": 3,
317 | "id": "7a9b491f",
318 | "metadata": {},
319 | "outputs": [
320 | {
321 | "ename": "ModuleNotFoundError",
322 | "evalue": "No module named 'chromedriver_autoinstaller'",
323 | "output_type": "error",
324 | "traceback": [
325 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
326 | "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)",
327 | "Cell \u001b[0;32mIn[3], line 3\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# Example:\u001b[39;00m\n\u001b[0;32m----> 3\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mchromedriver_autoinstaller\u001b[39;00m\n\u001b[1;32m 4\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mselenium\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m webdriver\n\u001b[1;32m 6\u001b[0m \u001b[38;5;66;03m# Automatically download and install the correct version of chromedriver\u001b[39;00m\n",
328 | "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'chromedriver_autoinstaller'"
329 | ]
330 | }
331 | ],
332 | "source": [
333 | "# Example:\n",
334 | "\n",
335 | "import chromedriver_autoinstaller\n",
336 | "from selenium import webdriver\n",
337 | "\n",
338 | "# Automatically download and install the correct version of chromedriver\n",
339 | "chromedriver_autoinstaller.install()\n",
340 | "\n",
341 | "# Set up the WebDriver\n",
342 | "driver = webdriver.Chrome()\n",
343 | "driver.get('https://www.example.com')\n",
344 | "\n",
345 | "# Your code here...\n",
346 | "\n",
347 | "driver.quit()\n"
348 | ]
349 | },
350 | {
351 | "cell_type": "markdown",
352 | "id": "e5a653d3",
353 | "metadata": {},
354 | "source": [
355 | "# `Citations:`\n",
356 | "\n",
357 | "`------------------------`\n",
358 | "\n",
359 | "1. **`ElementNotFoundException`**\n",
360 | " - *Reference Link*: [Selenium WebDriver - Explicit Waits](https://www.selenium.dev/documentation/en/webdriver/waits/)\n",
361 | " \n",
362 | "\n",
363 | "2. **`StaleElementReferenceException`**\n",
364 | " - *Reference Link A*: [Handling StaleElementReferenceException](https://www.selenium.dev/documentation/en/webdriver/handling_errors/#stale-element-reference-exception)\n",
365 | " - *Reference Link B*: [Stale Element error and why](https://reflect.run/articles/how-to-deal-with-staleelementreferenceexception-in-selenium/)\n",
366 | "\n",
367 | "3. **`TimeoutException`**\n",
368 | " - *Reference Link*: [TimeoutException in Selenium WebDriver](https://www.selenium.dev/documentation/en/webdriver/timeout_exceptions/)\n",
369 | "\n",
370 | "4. **`NoSuchElementException`**\n",
371 | " - *Reference Link*: [Locating Elements - Selenium WebDriver](https://www.selenium.dev/documentation/en/webdriver/locating_elements/)\n",
372 | "\n",
373 | "5. **`ElementNotInteractableException`**\n",
374 | " - *Reference Link*: [ElementNotInteractableException - Selenium WebDriver](https://www.selenium.dev/documentation/en/webdriver/handling_exceptions/#element-not-interactable-exception)\n",
375 | "\n",
376 | "6. Bonus:\n",
377 | "\n",
378 | "`extra citations`\n",
379 | "\n",
380 | "https://www.educative.io/answers/what-is-nosuchelementexception-in-selenium-python\n",
381 | "\n",
382 | "https://www.lambdatest.com/blog/expected-conditions-in-selenium-examples/#:~:text=ExpectedConditions%20in%20Selenium%20allow%20you,not%20being%20updated%20in%20time."
383 | ]
384 | }
385 | ],
386 | "metadata": {
387 | "kernelspec": {
388 | "display_name": "Python 3 (ipykernel)",
389 | "language": "python",
390 | "name": "python3"
391 | },
392 | "language_info": {
393 | "codemirror_mode": {
394 | "name": "ipython",
395 | "version": 3
396 | },
397 | "file_extension": ".py",
398 | "mimetype": "text/x-python",
399 | "name": "python",
400 | "nbconvert_exporter": "python",
401 | "pygments_lexer": "ipython3",
402 | "version": "3.10.9"
403 | }
404 | },
405 | "nbformat": 4,
406 | "nbformat_minor": 5
407 | }
408 |
--------------------------------------------------------------------------------
/CloudFlare_Thoughts.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "f6c5464f",
6 | "metadata": {},
7 | "source": [
8 | "# `Cloud Flare Thoughts`\n",
9 | "\n",
10 | "# Mr Fugu Data Science\n",
11 | "\n",
12 | "# (◕‿◕✿)"
13 | ]
14 | },
15 | {
16 | "cell_type": "markdown",
17 | "id": "364e2993",
18 | "metadata": {},
19 | "source": [
20 | "# `Background:`\n",
21 | "\n",
22 | "Have you ever tried to webscrape and get bloced by your IP address, have constant CAPTCHA's appeared in a loop that won't let you leave? Stumble onto data that is hidden that you cannot retreive? It can be annoying when you are learning (webscraping) or building a dataset. Sites such as `Cloudflare`, `PerimeterX`, `Akamai` are capable of even noticing when you are using a headless browser. We should consider understanding how this works and what these servers are using to evaluate our request(s).\n",
23 | "\n",
24 | "\n",
25 | "\n",
26 | "# `What we will cover today:`\n",
27 | "\n",
28 | "Today, I would like to cover a few ideas and tips to help you from detection. Really we need to rethink and leverage our skills to mimic a real user.\n",
29 | "\n",
30 | "\n",
31 | "\n",
32 | "# DISCLAIMER:\n",
33 | "\n",
34 | "**DO NOT ACT LIKE A PIRATE PILAGING AND PLUNDERING. ACT RESPONSIBLE & ETHICALLLY WHILE COLLECTING DATA WHEN WEBSCRAPING...**\n",
35 | "\n",
36 | "`----------------------------------------------------------------------`"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "id": "b226bf61",
42 | "metadata": {},
43 | "source": [
44 | "# Cloudflare\n",
45 | "\n",
46 | "*Handling* `Cloudflare` *Protection*:\n",
47 | "\n",
48 | "`Cloudflare` uses various techniques to detect and block bots, such as JavaScript challenges and IP rate limiting.\n",
49 | "\n",
50 | "*Basic understaning* of how `Cloudflare` operates: \n",
51 | "\n",
52 | "`--------------------------------------`\n",
53 | "\n",
54 | "They are a cyber security and content delivery network. Because of this you are at a disposition as a webscraper because you are not spared from their wrath and considereda security issue. Using `Selenium WebDriver` is a warning beacon making your presence known. `CloudFlare` uses *active* and *passive* detecting measures to counter any perceived threat.\n",
55 | "\n",
56 | "*They utilize*:\n",
57 | "\n",
58 | "+ *`Active` monitoring:* CAPTCHA, event tracking, (canvas) fingerprinting to name a few\n",
59 | "+ *`Passive` monioring:* HTTP headers, TLS finger printing, checking your IP adress reputation, among many others\n",
60 | "+ May imploy behavior techniques such as *mouse movements*\n",
61 | "+ *JavaScript Challenges*\n",
62 | "\n",
63 | "\n",
64 | "I want to mention a few important tasks that may occur when using this platform when you are using `automation tasks/testing software`\n",
65 | "\n",
66 | "For example when you are trying to webscrape there are signatures that are looked for to determine your behavior and finger print.\n",
67 | "\n",
68 | "+ When using `Selenium` you have a `WebDriver` that is used to connect a browser such as Chrome, Firefox, Mozilla, etc to that server through requests.\n",
69 | " + By default `Selenium` will send information to the website letting it know you are using active test-autiomation software. This trigger can directly impact your interaction with a server. Yes, you can mask this by turning off the ping but, it is not fool proof.\n",
70 | "\n",
71 | "*Other behaviors can trigger this as well:*\n",
72 | " \n",
73 | "+ When you try to connect to a server; information about you is given to them such as:\n",
74 | " + what version of browser, what type of browser you are on, screen resolution, mouse movements, timing of your responses like clicks and moving to new pages quickly, geo-location, IP address.\n",
75 | " \n",
76 | "*Think of it like this*: we are creating an active digital finger print. This finger print consists of various markers that make the site able to detect who we are. Because of this you have to consider that sites can evaluate your IP Address as credible or potentially a bot and they hold the keys to let you pass.\n",
77 | "\n",
78 | "+ If you want to obscure your presence and fly under the radar it is imperative to include various steps in your process to roam around without raising suspicion. \n",
79 | "\n",
80 | "**Considerations:** to help in your epic learning experience as a data extracting boss (ethically of course...)\n",
81 | "\n",
82 | "`-------------------------------------------------------------`\n"
83 | ]
84 | },
85 | {
86 | "cell_type": "markdown",
87 | "id": "a41b0508",
88 | "metadata": {},
89 | "source": [
90 | "# 1.) `Headers:`\n",
91 | "\n",
92 | "Are one of the easiest ways for an anti-bot server to detect you. This is because common HTTP libraries such as (*Python Request, Scrapy or NodeJS Axiom*) by default *`ignore adding browser headers or the headers that are included indicate its from a library`*\n",
93 | "\n",
94 | "+ `Header order is very important and overlooked`, for instance when using HTTP client(s)/libraries we have issues where the order can be random or appear as it came from a python library which is bad. `Python Request` is an example of this.\n",
95 | " + `HTTX` is a better option because it does respect the ordering and it also gives you `HTTP/2`\n",
96 | "\n",
97 | "+ Some websites will benefit from you changing what headers you're using because of how they are retreived from a server.\n",
98 | "\n",
99 | "`------------------------------------------------------------`\n",
100 | "\n",
101 | "**Solution:** consider turning off notifcations from Selenium showing that you are using automation software tools. With regard to Python we need to consider options for the header such as `headless browser` but this poses other issues I will mention.\n",
102 | "\n",
103 | "+ Turn off blink feature: `options.add_argument('--disable-blink-features=AutomationControlled')`\n",
104 | "\n",
105 | "+ Using a `headless browser` does have inherent problems as well for detection to some servers. A way to deal with it would be creating `headeless browser`, adding `user-agent` and the adjusting a `window size` and dealing with your `headers`. This is a good start and work from there adding to the code to help you seem more legit.\n",
106 | "\n",
107 | "**ex.)** \n",
108 | "\n",
109 | "`!pip install selenium_stealth`\n",
110 | "\n",
111 | "`from selenium import webdriver`\n",
112 | "\n",
113 | "`from selenium_stealth import stealth`\n",
114 | "\n",
115 | "`from selenium.webdriver.chrome.service import Service`\n",
116 | "\n",
117 | "`from selenium.webdriver.chrome.options import Options`\n",
118 | "\n",
119 | "`from webdriver_manager.chrome import ChromeDriverManager`\n",
120 | "\n",
121 | "\n",
122 | "`options = Options()`\n",
123 | "\n",
124 | "`options.add_argument(\"--headless\") # Optional: Run in headless mode` \n",
125 | "\n",
126 | "**this may be setup different like: `options.headless = True` or `options.add_argument(\"--headless=new\")` depending what version of selenium you are using!!!!**\n",
127 | "\n",
128 | "`options.add_argument(\"--window-size=1920,1200\") # Set the window size`\n",
129 | "\n",
130 | "**alternative window option:** `options.add_argument('--start-maximized')`\n",
131 | "\n",
132 | "`user_agents = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',\n",
133 | " 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0',\n",
134 | " # Add more User-Agent strings here\n",
135 | "]`\n",
136 | "\n",
137 | "`options.add_argument(f'user-agent={random.choice(user_agents)}')`\n",
138 | "\n",
139 | "\n",
140 | "`driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)`\n",
141 | "\n",
142 | "`driver.get('https://www.example.com')`\n",
143 | "\n",
144 | "\n",
145 | "`---------------------------------------------------------`\n",
146 | "\n",
147 | "+ Use a header and user agent even when going headless by setting it explicitly yourself and consider setting a window as well.\n",
148 | "\n",
149 | " + `Rotate User-Agents` to avoid getting blocked. here is an example https://www.zenrows.com/blog/user-agent-web-scraping#rotate\n",
150 | "\n",
151 | "1.) **`ALWAYS use real headers from a broswer and user-agent`**\n",
152 | "\n",
153 | "2.) **Changing User-Agent but forgetting other header information with be suspicious** for example your browser information is sent with with the header and both NEED to match. Also, older versions of Chrome do not the same header information and if you added it you will get alarm bells sent potentially blocking your request.\n",
154 | "\n",
155 | "\n",
156 | "Good read: https://www.zenrows.com/blog/selenium-python-web-scraping#add-headers\n",
157 | "\n",
158 | "If you need to check if the user agent you are using for webscraping is correct run this test:\n",
159 | "\n",
160 | " \n",
161 | "# Test if your web scraper is sending correct header to HTTPBin \n",
162 | "\n",
163 | "*this is an example*\n",
164 | "\n",
165 | "`import requests `\n",
166 | " \n",
167 | "`headers = {\"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36\"}` \n",
168 | "\n",
169 | "`r = requests.get(\"https://httpbin.org/headers\", headers=headers)` \n",
170 | "\n",
171 | "`print(r.text)`\n",
172 | "\n",
173 | "`-------------------------------------------------------------`\n",
174 | "\n",
175 | "**Helpful Tips & Code:**\n",
176 | "\n",
177 | "If you want to make this more custom: https://www.zenrows.com/blog/selenium-python-web-scraping#add-headers\n",
178 | "You can use `pip install blinker==1.7.0 selenium-wire` selenium wire can help you tailor and read header informatin aiding in your endeavors.\n",
179 | "\n",
180 | "\n",
181 | "`-------------------------------------------------------------`\n",
182 | "\n",
183 | "\n",
184 | "# `Examples how browsers can detect bot/automation tool (within the Header):`\n",
185 | "\n",
186 | "\n",
187 | "1.) **`Possible User-Agent Issues:`** bots/automation tools can add stings that defeat everything in the header\" \n",
188 | "\n",
189 | "ex.) `scraper,bot,crawler` this is a dead giveaway.\n",
190 | "\n",
191 | "`----------------`\n",
192 | "\n",
193 | "2.) Browsers such as `Google & Safari` add language fallbacks in tags called **`Accept Language:`**. \n",
194 | "\n",
195 | "+ Sometimes weird language may be in this section or completely inaccurate data. \n",
196 | "\n",
197 | "+ Notice that the `q` will always go in decreasing order such as\n",
198 | "\n",
199 | "ex.) `en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7` this will occur when navigator.language is [`en-US, zh-CN`]\n",
200 | "\n",
201 | "+ `Important note:` if you are using `Safari` or `Google Incognito` due to privacy for fingerprinting one language is specified.\n",
202 | "\n",
203 | "+ `Side note:` a browser may not give the language you are interested in for example if you are traveling or using a VPN that has a country with different language as your own. Consider this ----\n",
204 | " + You may get an error of `406` (Unacceptable) which means your request could not be matched\n",
205 | "\n",
206 | "\n",
207 | "`----------------`\n",
208 | "\n",
209 | "\n",
210 | "3.) **`Accept Encoding:`** same issue as above, can be empty or wrong information. below is an example of what is more realistic. This is a compression algorithm during the negotiation of the handshake.\n",
211 | "\n",
212 | "ex.) \n",
213 | "\n",
214 | "`Accept-Encoding: gzip`\n",
215 | "\n",
216 | "`Accept-Encoding: gzip, compress, br`\n",
217 | "\n",
218 | "`Accept-Encoding: gzip, compress, br, zstd`\n",
219 | "\n",
220 | "`Accept-Encoding: br;q=1.0, gzip;q=0.8, *;q=0.1`\n",
221 | "\n",
222 | "\n",
223 | "`----------------`\n",
224 | "\n",
225 | "\n",
226 | "4.) **`Referer:`** bots tend to leave empty or add incorrect information. Here is an example of legit info. The purpose of this is where the request is coming from.\n",
227 | "\n",
228 | "ex.) `you can add an actual website here` but not always neccessary.\n",
229 | "\n",
230 | "`----------------`\n",
231 | "\n",
232 | "\n",
233 | "5.) **`Connection`** decides if conneciton stays open after the current transaction finishes. If the value `keep-alive` is used then the connection will continue to stay open after transaction. You also have `close` which describes either the client or server are requesting to close after transaction.\n",
234 | "\n",
235 | "+ `IMPORTANT NOTE:` `keep-alive` are prohibited in `HTTP/2 , HTTP/3`. *Chrome & Firefox HTTP/2* started to ignore this. `Safari HTTP/2` conforms to the spec requirements when using this (whatever those are) and will not load these responses.\n",
236 | "\n",
237 | "ex.) `Keep-Alive, Transfer-Encoding, TE, Connection, Trailer, Upgrade, Proxy-Authorization and Proxy-Authenticate` \n",
238 | "\n",
239 | "If these are used they must be used within the `Connection Header` letting the proxy know they need to be consumed and are not moved to further steps forward.\n",
240 | "\n",
241 | "\n",
242 | "`----------------`\n",
243 | "\n",
244 | "\n",
245 | "6.) **`Accept`** indicates what content type your client can understand. Depending on your request different types of data are returned. Such as if you ask for *images, scripts, CSS, etc*\n",
246 | "\n",
247 | "ex.) \n",
248 | "\n",
249 | "+ `text/html`\n",
250 | "\n",
251 | "+ `application/xhtml+xml`\n",
252 | "\n",
253 | "+ `application/xml;q=0.9`\n",
254 | "\n",
255 | "+ `image/webp`\n",
256 | "\n",
257 | "+ `image/apng`\n",
258 | "\n",
259 | "+ `*/*;q=0.8`\n",
260 | "\n",
261 | "`-------------------------------------------------------------`\n",
262 | "\n",
263 | "https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Language\n",
264 | "\n",
265 | "https://www.zenrows.com/blog/selenium-stealth#change-browser-properties"
266 | ]
267 | },
268 | {
269 | "cell_type": "markdown",
270 | "id": "5ba94ef0",
271 | "metadata": {},
272 | "source": [
273 | "**`# Code Example:`**\n",
274 | "\n",
275 | "`from selenium import webdriver`\n",
276 | "\n",
277 | "`from selenium.webdriver.chrome.service import Service`\n",
278 | "\n",
279 | "`from selenium.webdriver.chrome.options import Options`\n",
280 | "\n",
281 | "`from webdriver_manager.chrome import ChromeDriverManager`\n",
282 | "\n",
283 | "`options = Options()`\n",
284 | "\n",
285 | "`driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)`\n",
286 | "\n",
287 | "*`# Customized header`*\n",
288 | "\n",
289 | "`driver.execute_cdp_cmd('Network.setUserAgentOverride', {'userAgent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)` \n",
290 | "\n",
291 | "`AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',`\n",
292 | " `'acceptLanguage': 'en-US,en;q=0.9',`\n",
293 | " `\n",
294 | " 'platform': 'Windows'`\n",
295 | "`})`\n",
296 | "\n",
297 | "`driver.get('https://www.example.com')`\n",
298 | "\n",
299 | "\n",
300 | "\n",
301 | "\n"
302 | ]
303 | },
304 | {
305 | "cell_type": "markdown",
306 | "id": "5dfc0784",
307 | "metadata": {},
308 | "source": [
309 | "`-------------------------------------------------------------`\n",
310 | "\n",
311 | "# 2.) **`IP Blocking/Tracking`**\n",
312 | "\n",
313 | "Your IP Address can be a red flag to these sites. Remember they have options such as throttling your requests, limiting data shown to you or even block you out right. If you are suspected to have a past of making a ton of requests, acting malicious or against terms of service this may persist into the future as an issue for you. I was blocked from Indeed webscraping twice with 3 different IP address in 2 different countries which persisted for around 9-12 months. Your geographic location can pose an issue as well because this may even restrict your access or the content you see. \n",
314 | "\n",
315 | "\n",
316 | "**`Solution:`** you will need to use **rotating proxy** services to obscure your IP address and utilize theirs. *VPN networks and TOR* can have issues that will not resolve themselves and cannot guarantee results.\n",
317 | "\n",
318 | "+ *`DO NOT use free proxies`*, they will get you banned for various reason. If you need to do inustrial scale scraping spend the money."
319 | ]
320 | },
321 | {
322 | "cell_type": "markdown",
323 | "id": "f90b3e8e",
324 | "metadata": {},
325 | "source": [
326 | "**`Ex.) Proxy:`**\n",
327 | " \n",
328 | " \n",
329 | "`from selenium import webdriver`\n",
330 | "\n",
331 | "`from selenium.webdriver.chrome.service import Service`\n",
332 | "\n",
333 | "`from selenium.webdriver.chrome.options import Options`\n",
334 | "\n",
335 | "`from webdriver_manager.chrome import ChromeDriverManager`\n",
336 | "\n",
337 | "`options = Options()`\n",
338 | "\n",
339 | "`options.add_argument('--proxy-server=http://your-proxy-address:port')`\n",
340 | "\n",
341 | "`driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)`\n",
342 | "\n",
343 | "`driver.get('https://www.example.com')`\n"
344 | ]
345 | },
346 | {
347 | "cell_type": "markdown",
348 | "id": "3a8b4455",
349 | "metadata": {},
350 | "source": [
351 | "`-------------------------------------------------------------`\n",
352 | "\n",
353 | "# 3.) **`User Agent`**\n",
354 | "\n",
355 | "This is a string identifying the client making the request(s) and this information will identify if you are from a trusted or known browser.\n",
356 | "\n",
357 | "ex.) `Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36`\n",
358 | "\n",
359 | "Typical webscraping bots/libraries make it very easy to tell where you are coming from and who you are which reveal your intentions in the process.\n",
360 | "\n",
361 | "Think about the point of view from what a website. They see something making repeated calls in a short amount of time from the same user-agent. In my opinion you are playing with fire.\n",
362 | "\n",
363 | "**Important NOTE:** if you use `--headless` browser you will end up blindly showing what you are doing in the browser. You will need to add an additional flag to assist you\n",
364 | "\n",
365 | "**Ex.)** this is a snippet not full code...\n",
366 | "\n",
367 | "\n",
368 | "`#create a custom user agent`\n",
369 | "\n",
370 | "`Your_User_Agent = \"your info here\"`\n",
371 | "\n",
372 | "`options.add_argument(f'--user-agent={Your_User_Agent}')`\n",
373 | "\n",
374 | "`enable headless`\n",
375 | "\n",
376 | "`options.add_argument('--headless')`\n",
377 | "\n",
378 | "`-------------------------------------------`\n",
379 | "\n",
380 | "https://www.zenrows.com/blog/user-agent-web-scraping#best (read this)"
381 | ]
382 | },
383 | {
384 | "cell_type": "markdown",
385 | "id": "1b7e1e71",
386 | "metadata": {},
387 | "source": [
388 | "`-------------------------------------------------------------`\n",
389 | "\n",
390 | "# 4.) **`Java Script Challenge`** \n",
391 | "\n",
392 | "This is a sneaky little technique checking to see if you are legit or not. They will send you a piece of `Java Script` code and the issue is that most automation tools cannot answer the request because it doesn't render. This is how they catch you, think of Dave Chappelle in his famous line: \"Gottcha ...\"\n",
393 | "\n"
394 | ]
395 | },
396 | {
397 | "cell_type": "markdown",
398 | "id": "b91fba82",
399 | "metadata": {},
400 | "source": [
401 | "`-------------------------------------------------------------`\n",
402 | "\n",
403 | "# 5.) **`Timing Requests`**\n",
404 | "\n",
405 | "Setting timeouts for loading pages, allowing for different times for various actions to occur and pagination will help you greatly in your journey. Make it less robotic and act dynamic and like a person.\n",
406 | "\n",
407 | "\n",
408 | "+ This can be very important when you need to have a page load before scraping because of dynamic content\n",
409 | "\n",
410 | "+ If you need to scroll a page with automation and use timing\n",
411 | "\n",
412 | "+ mouse movements\n",
413 | "\n",
414 | "+ click buttons\n",
415 | "\n",
416 | "+ randomize your movements scraping or creating a multi-threating process\n",
417 | "\n",
418 | "\n",
419 | "**`Two common ways`** add `time.sleep()` or \n",
420 | "\n",
421 | "`from selenium.webdriver.common.by import By`\n",
422 | "\n",
423 | "`from selenium.webdriver.support.ui import WebDriverWait`\n",
424 | "\n",
425 | "`from selenium.webdriver.support import expected_conditions as EC`\n",
426 | "\n",
427 | "`WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, \"Some_Element_YouWant\"))`\n",
428 | "\n",
429 | "`-------------------------------------------------------------`\n",
430 | "\n",
431 | "here is inspiration for you: https://www.geeksforgeeks.org/waits-in-selenium-python/"
432 | ]
433 | },
434 | {
435 | "cell_type": "markdown",
436 | "id": "e4d4ff0c",
437 | "metadata": {},
438 | "source": [
439 | "`-------------------------------------------------------------`\n",
440 | "\n",
441 | "# 6.) **`TLS Finger Printing:`**\n",
442 | "\n",
443 | "`TLS:` **T**ransport **L**ayer **S**ecurity, can be considered a handshake between you the client and the server. Information is exchanged to make sure you are legit, the communication is encrypted and each client will have their own signature (finger print). The information provided in your finger print dictates if it is a casual user or automated tools such as bots or scrapers. In general a request from a browser will be different from a programming language.\n",
444 | "\n",
445 | "+ This is very important because it can detect: `network spoofing, man in the middle attacks, potential espionage`\n",
446 | "\n",
447 | "+ When making a call (request) there is something called a `JA3 hash`, this is very important because it can be detected as you are the same person making repeated requests and end up giving you a lot of head aches very quickly. From my understanding starting in *Chrome 110* the TLS client side hello started to randomize this JA3 hash. \n",
448 | "\n",
449 | "+ One option to consider is `curl-cffi` because it is more aligned with HTTP protocol and TLS handshakes\n",
450 | "\n",
451 | "Unfortunately, for us this means that even though you were trying to be sleath it stumbles because your `requests does NOT change the JA3 hash`. Libaries such as `request` rely on the JA3 hash and this is a vulnerability exploited by the server side to hinder our work. \n",
452 | "\n",
453 | "`-------------------------------------------------------------`\n",
454 | "\n",
455 | "ex.) `import httpx` \n",
456 | "\n",
457 | "+ This can be used if you have issues with HTTP/1.1 and need HTTP/2 for instance\n",
458 | "\n",
459 | "ex.) [Java Selenium Finger printing](https://github.com/CheshireCaat/selenium-with-fingerprints)\n",
460 | "\n",
461 | "+ Change your finger print using Java Script as a plugin to Selenium.\n",
462 | "\n",
463 | "ex.) [Github curl_cffi](https://github.com/lexiforest/curl_cffi) | [Curl-cffi Documentation](https://curl-cffi.readthedocs.io/en/latest/impersonate.html)\n",
464 | "\n",
465 | "*This can work but, alone it is vulnerable to advanced bot detection such as CAPTCHAS.*\n",
466 | "\n",
467 | "+ It will mimic better the behavior of a browser vs the `request` library. Why is this different? Regular Python libraries emit data that will get you caught faster! This will allow you to impersonate various browsers and their correct credentials.\n",
468 | "\n",
469 | "\n",
470 | "\n",
471 | "`-------------------------------------------------------------`\n",
472 | "\n",
473 | "\n",
474 | "**IMPORTANT NOTE:** If you are using `open source software to avoid detection`, underatnd `CloudFlare` can also use this software to know how it works and fix vulnerabilities. Therfore, these technique may aid you for some services you encounter but, if you deal with advanced more guarded servers than you will be detected!\n",
475 | "\n",
476 | "`-------------------------------------------------------------`\n",
477 | "\n",
478 | "Here’s how you can handle these issues:\n",
479 | "\n",
480 | "+ `User Agent Rotation:` Changes the user agent to mimic different browsers and devices.\n",
481 | "+ `Delays:` Introduces random delays to avoid detection based on timing patterns.\n",
482 | "+ `Handling Cloudflare:` Cloudflare’s JavaScript challenges can be difficult to bypass with automation. Rotating IP addresses and using CAPTCHA-solving services might be necessary.\n",
483 | "\n",
484 | "`-------------------------------------------------------------`\n"
485 | ]
486 | },
487 | {
488 | "cell_type": "markdown",
489 | "id": "ea252691",
490 | "metadata": {},
491 | "source": [
492 | "# 7.) `Patterns & Target Area`\n",
493 | "\n",
494 | "**A.)** Here is an interesting topic: consider if you are trying to webscrape a company like `Amazon or Ebay` and you decide to enter a starting page that is very direct for your action but not something a typical user would use for a first start.\n",
495 | "\n",
496 | "Ex.) `amazon.com/[product_xyz]` this seems easy enough but this is not exactly a first match to what someone would type. It is unrealistic unless it is for something they viewed before or some other reason.\n",
497 | "\n",
498 | "Ex.) but the same person who didn't want to get caught as easily would do something more realistic `https://www.amazon.com/s?k=painters+tape&i=industrial&crid=36RXZ5Z6JCP2V&sprefix=pai%2Cindustrial%2C179&ref=nb_sb_ss_pltr-sample-20_1_3`\n",
499 | "\n",
500 | "or even: `https://www.amazon.com/s?k=painters+tape&i=industrial`\n",
501 | "\n",
502 | "I hope that even though this is generic, it does show that there is a difference because you are trying to look like an average user\n",
503 | "\n",
504 | "**B.)** Next, issue is this: if you were doing research or scraping in some European Super Market and your IP for some reason was in some place across the world it might trigger a response for weird behavior from the server to sniff you out.\n",
505 | "\n",
506 | "**C.)** Last, imagine if you are try to `randomize your scraping` this can help a lot. Scarpe different pages or different parts of a page over time.\n",
507 | "\n",
508 | "`-------------------------------------------------------------`\n"
509 | ]
510 | },
511 | {
512 | "cell_type": "markdown",
513 | "id": "f4c18dec",
514 | "metadata": {},
515 | "source": [
516 | "# 8.) `CAPTCHA`:\n",
517 | "\n",
518 | "+ If you need to solve captcha's then you need to consider a service such as `2captcha` or similar.\n",
519 | "\n",
520 | "+ For sites with less security you can try `undetected-chromedriver`\n",
521 | "\n",
522 | "+ \n",
523 | "\n",
524 | "`-------------------------------------------------------------`\n"
525 | ]
526 | },
527 | {
528 | "cell_type": "markdown",
529 | "id": "0ae9b6cc",
530 | "metadata": {},
531 | "source": [
532 | "# `BONUS TIPS:`\n",
533 | "\n",
534 | "**`ALL Automation Tools have leaks, and create issues but here is a tip to help you out`**\n",
535 | "\n",
536 | "**1.)**\n",
537 | "\n",
538 | "`from selenium.webdriver import Chrome`\n",
539 | "\n",
540 | "`driver: Chrome`\n",
541 | "`script = \"Object.defineProperty(navigator, 'webdriver', {get: () => false})\"`\n",
542 | "`driver.execute_cdp_cmd(\"Page.addScriptToEvaluateOnNewDocument\", {\"source\": script})`\n",
543 | "\n",
544 | "\n",
545 | "There is an interesting case here with this piece of code. A common leak that shows we are using a tool for automation/webscraping is the default function of `navigator.webdriver` this is a common issue in `Selenium, Puppeteer, Playwright`\n",
546 | "\n",
547 | "**2.)**\n",
548 | "\n",
549 | "At the time of launching your automation tools there are additional `flags in the background` that often create a `new leak` that you should close as well.\n",
550 | "\n",
551 | "You can take a log of the activity to see what is going on:\n",
552 | "\n",
553 | "`import logging`\n",
554 | "\n",
555 | "`selenium_logger = logging.getLogger('selenium.webdriver.remote.remote_connection')`\n",
556 | "\n",
557 | "`selenium_logger.setLevel(logging.DEBUG)`\n",
558 | "\n",
559 | "`logging.basicConfig(level=logging.DEBUG)`\n",
560 | "\n",
561 | "Then you can use something like this:\n",
562 | "\n",
563 | "`from selenium import webdriver\n",
564 | "\n",
565 | "`chromeOptions = webdriver.ChromeOptions()`\n",
566 | "\n",
567 | "`chromeOptions.add_experimental_option('excludeSwitches', ['disable-extensions','disable-default-apps','disable-component-extensions-with-background-pages'])`\n",
568 | "\n",
569 | "`chromeDriver = webdriver.Chrome(chrome_options=chromeOptions)`"
570 | ]
571 | },
572 | {
573 | "cell_type": "code",
574 | "execution_count": null,
575 | "id": "f48b3591",
576 | "metadata": {},
577 | "outputs": [],
578 | "source": [
579 | "# Dynamic Dom??\n",
580 | "\n",
581 | "#honey pots\n",
582 | "\n",
583 | "\n",
584 | "# https://medium.com/@pankaj_pandey/web-scraping-using-python-for-dynamic-web-pages-and-unveiling-hidden-insights-8dbc7da6dd26"
585 | ]
586 | },
587 | {
588 | "cell_type": "markdown",
589 | "id": "65e4abcb",
590 | "metadata": {},
591 | "source": [
592 | "# `Citations & Help:`\n",
593 | "\n",
594 | "# ◔̯◔\n",
595 | "\n",
596 | "https://www.zenrows.com/blog/selenium-avoid-bot-detection#how-anti-bots-work\n",
597 | "\n",
598 | "https://webscraping.ai/faq/headless-chromium/how-do-i-prevent-detection-of-headless-chromium-by-websites\n",
599 | "\n",
600 | "https://www.zenrows.com/blog/selenium-cloudflare-bypass#how-cloudflare-detect-selenium\n",
601 | "\n",
602 | "https://proxiesapi.com/articles/selenium-headless-stealth-tactics-to-bypass-cloudflare-detection\n",
603 | "\n",
604 | "https://webseekerj.medium.com/how-to-bypass-and-solve-cloudflare-captcha-with-python-selenium-32281fdf239e\n",
605 | "\n",
606 | "https://webscraping.ai/faq/http/can-i-use-http-2-for-web-scraping-and-what-are-the-benefits#:~:text=Yes%2C%20you%20can%20use%20HTTP,can%20enhance%20web%20scraping%20efficiency\n",
607 | "\n",
608 | "https://www.zenrows.com/blog/curl-cffi#step-2\n",
609 | "\n",
610 | "https://webscraping.fyi/lib/compare/python-curl-cffi-vs-python-undetected-chromedriver/\n",
611 | "\n",
612 | "https://www.zenrows.com/blog/curl-cffi#step-3\n",
613 | "\n",
614 | "https://www.zenrows.com/blog/pyppeteer#dynamic-pages\n",
615 | "\n",
616 | "https://scrapeops.io/web-scraping-playbook/web-scraping-without-getting-blocked/\n",
617 | "\n",
618 | "https://brightdata.com/blog/web-data/selenium-user-agent\n",
619 | "\n",
620 | "https://webseekerj.medium.com/change-the-user-agent-in-selenium-steps-best-practices-4e0f25438db3\n",
621 | "\n",
622 | "https://scrapfly.io/blog/how-to-avoid-web-scraping-blocking-javascript/\n",
623 | "\n",
624 | "https://medium.com/@yahyamrafe202/mastering-dynamic-web-scraping-from-challenges-to-solutions-with-playwright-088bfaa44a60\n",
625 | "\n",
626 | "https://www.zenrows.com/blog/selenium-headers#set-up-custom-headers\n",
627 | "\n",
628 | "https://pypi.org/project/selenium-wire/\n",
629 | "\n",
630 | "https://scrapeops.io/selenium-web-scraping-playbook/python-selenium-wire/#modifying-request-headers\n",
631 | "\n",
632 | "https://medium.com/@dungwoong/pretending-im-a-human-while-web-scraping-d5464e36f24\n",
633 | "\n",
634 | "https://datawookie.dev/blog/2023/03/chrome-devtools-protocol-selenium/\n",
635 | "\n",
636 | "https://www.scrapingbee.com/blog/selenium-python/"
637 | ]
638 | }
639 | ],
640 | "metadata": {
641 | "kernelspec": {
642 | "display_name": "Python 3 (ipykernel)",
643 | "language": "python",
644 | "name": "python3"
645 | },
646 | "language_info": {
647 | "codemirror_mode": {
648 | "name": "ipython",
649 | "version": 3
650 | },
651 | "file_extension": ".py",
652 | "mimetype": "text/x-python",
653 | "name": "python",
654 | "nbconvert_exporter": "python",
655 | "pygments_lexer": "ipython3",
656 | "version": "3.10.9"
657 | }
658 | },
659 | "nbformat": 4,
660 | "nbformat_minor": 5
661 | }
662 |
--------------------------------------------------------------------------------
/Indeed_scrape_Oct2020.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrFuguDataScience/Webscraping/HEAD/Indeed_scrape_Oct2020.txt
--------------------------------------------------------------------------------
/Indeed_webscrape.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# `Python Webscraping Indeed`\n",
8 | "\n",
9 | "# Mr Fugu Data Science\n",
10 | "\n",
11 | "# (◕‿◕✿)\n",
12 | "\n",
13 | "# Purpose & Outcome:\n",
14 | "\n",
15 | "+ `Webscrape Indeed`: take position, job title,location,date of posting, string of qualifications\n",
16 | "\n",
17 | "+ Use a list of words related to your skills or skills you are interested in and extract from job post qualifications section.\n",
18 | "\n",
19 | "+ Date time formating\n",
20 | "\n",
21 | "`------------------------------`\n",
22 | "\n",
23 | "# `Next video Word Cloud and NLP using these data`\n"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 1,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "# pip3 install wordcloud # (if you don't have already)\n",
33 | "import pandas as pd\n",
34 | "import numpy as np\n",
35 | "import PIL # for images\n",
36 | "import wordcloud # ploting word cloud\n",
37 | "import requests # grab web-page\n",
38 | "import pickle # save file\n",
39 | "from bs4 import BeautifulSoup as bsopa # parse web-page\n",
40 | "import datetime # format date/time\n",
41 | "from collections import defaultdict\n",
42 | "import re # regular expressions"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": 2,
48 | "metadata": {},
49 | "outputs": [],
50 | "source": [
51 | "# if you don't have wordcloud, get that now:\n",
52 | "\n",
53 | "# !pip3 install wordcloud\n",
54 | "# conda install -c conda-forge wordcloud"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": 3,
60 | "metadata": {},
61 | "outputs": [],
62 | "source": [
63 | "'''\n",
64 | "ex.) https://www.indeed.com/jobs?q=data+scientist&l=california&start=10\n",
65 | "\n",
66 | "range(0,150,10): each page will have \"start=0,start=10,start=20 etc\" which deals with\n",
67 | "going through each page, but not exactly 10 entries/pg. \n",
68 | "\n",
69 | "string formatting is used to denote what job and location we want: you can use a string\n",
70 | "separated by space and it will be interpreted by the website. \n",
71 | "\n",
72 | "sou=bsopa(y.text,'lxml') is taking our get request and converting to text in the format\n",
73 | "of 'lxml' but we can replace with 'html.parser' as well.\n",
74 | "\n",
75 | "Each job post can be parsed by the 'div',{\"class\":\"jobsearch-SerpJobCard\"} or \n",
76 | "'div', {'class': 'row'} depending on how you want to search\n",
77 | "\n",
78 | "After that we get each piece of information we want to obtain: job title, location etc.\n",
79 | "\n",
80 | "the only difficult and frustrating part is getting all the raw text for each posting\n",
81 | "relating to the qulifications. We have to open a link, iterate through it and then extract\n",
82 | "all the information with a try/except block and then further process.\n",
83 | "'''\n",
84 | "\n",
85 | "gg=[]\n",
86 | "for j in range(0,150,10):\n",
87 | " position,location='data scientist','california'\n",
88 | " y=requests.get('https://www.indeed.com/jobs?q={}&l={}&sort=date='.format(position,location)+str(j))\n",
89 | "\n",
90 | " # y=requests.get('https://www.indeed.com/jobs?q=data+scientist&l=california&sort=date='+str(i))\n",
91 | " sou=bsopa(y.text,'lxml')\n",
92 | "\n",
93 | "# for ii in sou.find_all('div', {'class': 'row'}):\n",
94 | " for ii in sou.find_all('div',{\"class\":\"jobsearch-SerpJobCard\"}):\n",
95 | "\n",
96 | " job_title = ii.find('a', {'data-tn-element': 'jobTitle'})['title']\n",
97 | " company_name = ii.find('span', {'class': 'company'}).text.strip() \n",
98 | " location=ii.find('span',{\"class\":\"location\"})\n",
99 | " post_date = ii.find('span', attrs={'class': 'date'})\n",
100 | " summary=ii.find('div',attrs={'class':'summary'})\n",
101 | "\n",
102 | " if location:\n",
103 | " location=location.text.strip()\n",
104 | " else:\n",
105 | " location=ii.find('div',{\"class\":\"location\"})\n",
106 | " location=location.text.strip()\n",
107 | "\n",
108 | " k=ii.find('h2', {'class':\"title\"})\n",
109 | " p=k.find(href=True)\n",
110 | " v=p['href']\n",
111 | " f_=str(v).replace('&','&') # links to iterate for qualification text\n",
112 | " \n",
113 | " \n",
114 | " datum = {'job_title': job_title,\n",
115 | " 'company_name': company_name,\n",
116 | " 'location': location,\n",
117 | " 'summary':summary.text.strip(),\n",
118 | " 'post_Date':post_date.text,\n",
119 | " 'Qualification_link': f_}\n",
120 | "\n",
121 | "# gg.append([location,job_title,company_name,post_date.text,summary.text.strip()\n",
122 | "# ,f_]) \n",
123 | " gg.append(datum)\n"
124 | ]
125 | },
126 | {
127 | "cell_type": "code",
128 | "execution_count": 18,
129 | "metadata": {},
130 | "outputs": [
131 | {
132 | "data": {
133 | "text/plain": [
134 | "{'job_title': 'Data Scientist, Medical Diagnostics',\n",
135 | " 'company_name': 'Specific Diagnostics',\n",
136 | " 'location': 'Mountain View, CA 94043',\n",
137 | " 'summary': 'The development of data-driven visualization tools for the explorative analysis of high-dimensional data.\\nUsed for bloodstream infection Specific’s solution…',\n",
138 | " 'post_Date': '30+ days ago',\n",
139 | " 'Qualification_link': '/pagead/clk?mo=r&ad=-6NYlbfkN0ASXGwdLWjBNYivRaShgyLT0X0BdS_EVSQWni2gtHz1pazciS9pJg_LVuqiSysEs-TMIwFFOo7W9bIb5VzrANtYopVIA4KVaF9yg6fI6EvZwtW2vAGjQWCOF3vWYBeKjokeHeoKgHupsPPJigSH8DaOChlEr7xUj3CXaKRa_EOHrA6Ldh3t0mnVz60SF4_2_jhFn1u1wbdVtozdppWb5GiEKmniDeQACKzNRIGkP3xmVqUnjq5MMrmY1NqR7k5fh3cPGaATLJsZd-WryLFGCzYyZ0O0DSDpMoFrVmXLdHTYbc2rlN4D-os2fc8yh2hY3l-w1BBYD4ni2Xijf6AbsJAwgc-TJ7zuUfwXogqDiLRBm7Myr-kwZKeXDZ9VLJdEiUmvYej_PpYSXrLTEZWMpGbI04d6fAr384U39vUPc-x5r-XOXRxt9hBKEZJUol9Q3zlu6xQNTvceFzUiPVRh_jQd1ulw5_cTkvDGmRIMNJG9lg==&p=0&fvj=1&vjs=3'}"
140 | ]
141 | },
142 | "execution_count": 18,
143 | "metadata": {},
144 | "output_type": "execute_result"
145 | }
146 | ],
147 | "source": [
148 | "len(gg)\n",
149 | "gg[0]"
150 | ]
151 | },
152 | {
153 | "cell_type": "markdown",
154 | "metadata": {},
155 | "source": [
156 | "# `Links to extract qualification information`\n",
157 | "\n",
158 | "+ `This will be used to take the links used for when you click a job and it expands the small text`\n",
159 | "\n",
160 | "+ `After you exand the text it will show the full job description and relative information`\n"
161 | ]
162 | },
163 | {
164 | "cell_type": "code",
165 | "execution_count": 5,
166 | "metadata": {},
167 | "outputs": [
168 | {
169 | "data": {
170 | "text/plain": [
171 | "[0,\n",
172 | " 'The CompanyThe world is facing a medical crisis, bacteria are increasingly evolving resistance to even our strongest antibiotics. The problem is already very real and immediate; for example, bloodstream infection leading to sepsis is now responsible for more than half of all deaths in hospitals and is the most expensive condition treated in hospitals. Sepsis mortality rate increases >6% every hour without effective antibiotic treatment. Yet, despite the life and death urgency, and healthcare cost impacts, current methodologies require 3 days to determine the correct antibiotic.']"
173 | ]
174 | },
175 | "execution_count": 5,
176 | "metadata": {},
177 | "output_type": "execute_result"
178 | }
179 | ],
180 | "source": [
181 | "# get all qualification page text: key=index, value=string of text for qualification\n",
182 | "hoop=[]\n",
183 | "for i in range(len(gg)):\n",
184 | " op=requests.get('https://www.indeed.com'+gg[i]['Qualification_link'])\n",
185 | " sou_=bsopa(op.text,'html.parser')\n",
186 | " for ii in sou_.find('div',{'class':'jobsearch-jobDescriptionText'}):\n",
187 | " try:\n",
188 | " hoop.append([i,''.join(ii.text.strip())])\n",
189 | " except AttributeError:\n",
190 | " hoop.append([i,''])\n",
191 | "hoop[0]"
192 | ]
193 | },
194 | {
195 | "cell_type": "code",
196 | "execution_count": 21,
197 | "metadata": {},
198 | "outputs": [],
199 | "source": [
200 | "# create dictionary with values as lists\n",
201 | "dct_lst= defaultdict(list)\n",
202 | "for i in hoop:\n",
203 | " dct_lst[i[0]].append(i[1])\n",
204 | "u=[]\n",
205 | "for i in dct_lst.values(): # string join: lists of lists of strings\n",
206 | " u.append(''.join(i))\n"
207 | ]
208 | },
209 | {
210 | "cell_type": "code",
211 | "execution_count": 31,
212 | "metadata": {},
213 | "outputs": [
214 | {
215 | "data": {
216 | "text/plain": [
217 | "\"Our Mission\\nWe’re here to create a safer, happier and more mindful future for all with the help of data science, engineering, design, and mobile technology. We're starting by reinventing insurance, by rethinking the technologies that enable it, but our true goal is to build a platform that rewards people for driving well — creating safer roads with fewer accidents in the process.\\n\\nBacked by impressive funding, we're poised to re-engineer a trillion-dollar category. We’re using rich customer insights, advanced technology and data science to build our cloud-native InsurTech solution. We're out to change behavior and promote mindful living.\\n\\nJob Requirements\\nYou must have all the skills and experiences associated with one of the options below. These options change over time and reflect the current needs of the team.\\nOption one - Telematics ML\\nMachine learning: Kaggle grandmaster-level skills with history of deploying models to production\\nTelematics experience: You have led telematics modeling at a major company known for telematics\\nSoftware development in Python: Strong scientific Python skills using pandas, scikit-learn, and statsmodels, as well as strong understanding of model deployment patterns\\nOption two - Telematics DL\\nDeep learning: Experience contributing new deep learning research to solve previously unsolved problems\\nTelematics experience: You have led telematics modeling at a major company known for telematics\\nSoftware development in Python: Strong scientific Python skills using pandas, scikit-learn, and statsmodels, as well as strong understanding of model deployment patterns\\nOption Three - Contact Center Optimization\\nContact center knowledge: You are an expert at using data science methods to optimize contact centers\\nMachine learning: Strong understanding of ML theory with significant applied experience\\nSoftware development in Python: Strong scientific Python skills using pandas, scikit-learn, and statsmodels, as well as strong understanding of model deployment patterns\\nMore details\\nSalary: We invest in first-rate people and pay top-of-market salaries for most positions, factoring in experience, talent and location. We do not offer equity.\\nBenefits: Medical, dental, vision, 401(k), wellness reimbursement, four weeks of vacation + six weeks of parental leave, and great work-life balance. Our office building offers on-site shower and bike stalls, and panoramic views of San Francisco.\\nLocation: Due to COVID-19 our teams are all working remotely through 2020. We provide an in-home office set-up including laptop, monitor, ergonomic desk, chair and other items as needed\\nLocation: Post COVID-19: San Francisco, CA near Montgomery Bart Station\\nAll are welcome at Blue Owl. We are an equal opportunity and affirmative action employer who values diversity and inclusion and looks for applicants who understand, embrace and thrive in a multicultural world. We do not discriminate on the basis of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or Veteran status. Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.\\n\\n If you are a San Francisco resident, please read the City and County of San Francisco's Fair Chance Ordinance notice.\\nhttps://sfgov.org/olse/sites/default/files/FCO%20poster2020.pdf\""
218 | ]
219 | },
220 | "execution_count": 31,
221 | "metadata": {},
222 | "output_type": "execute_result"
223 | }
224 | ],
225 | "source": [
226 | "# one entry of our qualification text:\n",
227 | "u[2]"
228 | ]
229 | },
230 | {
231 | "cell_type": "code",
232 | "execution_count": 43,
233 | "metadata": {},
234 | "outputs": [
235 | {
236 | "data": {
237 | "text/html": [
238 | "
\n",
239 | "\n",
252 | "
\n",
253 | " \n",
254 | " \n",
255 | " | \n",
256 | " job_title | \n",
257 | " company_name | \n",
258 | " location | \n",
259 | " summary | \n",
260 | " post_Date | \n",
261 | " Qualification_link | \n",
262 | " Qual_Text | \n",
263 | "
\n",
264 | " \n",
265 | " \n",
266 | " \n",
267 | " | 0 | \n",
268 | " Data Scientist, Medical Diagnostics | \n",
269 | " Specific Diagnostics | \n",
270 | " Mountain View, CA 94043 | \n",
271 | " The development of data-driven visualization t... | \n",
272 | " 30+ days ago | \n",
273 | " /pagead/clk?mo=r&ad=-6NYlbfkN0ASXGwdLWjBNYivRa... | \n",
274 | " The CompanyThe world is facing a medical crisi... | \n",
275 | "
\n",
276 | " \n",
277 | " | 1 | \n",
278 | " Data Scientist | \n",
279 | " Laxmi Therapeutic Devices | \n",
280 | " Goleta, CA 93117 | \n",
281 | " 7+ years' practical experience manipulating da... | \n",
282 | " 16 days ago | \n",
283 | " /pagead/clk?mo=r&ad=-6NYlbfkN0ALgD31io3l0I0Y-r... | \n",
284 | " Data ScientistLaxmi Therapeutic Devices – Gole... | \n",
285 | "
\n",
286 | " \n",
287 | " | 2 | \n",
288 | " Data Scientist | \n",
289 | " Blue Owl | \n",
290 | " San Francisco, CA | \n",
291 | " We’re using rich customer insights, advanced t... | \n",
292 | " 30+ days ago | \n",
293 | " /pagead/clk?mo=r&ad=-6NYlbfkN0D3UvD5kBSgX9r9tF... | \n",
294 | " Our Mission\\nWe’re here to create a safer, hap... | \n",
295 | "
\n",
296 | " \n",
297 | " | 3 | \n",
298 | " Data Engineer | \n",
299 | " Amick Brown, LLC | \n",
300 | " Sunnyvale, CA | \n",
301 | " Develops technical tools and programming that ... | \n",
302 | " Today | \n",
303 | " /pagead/clk?mo=r&ad=-6NYlbfkN0A74pTrSPrBtiJlYH... | \n",
304 | " Data EngineerSunnyvale, CAAmick Brown is seeki... | \n",
305 | "
\n",
306 | " \n",
307 | " | 4 | \n",
308 | " Data Scientist | \n",
309 | " Triplebyte | \n",
310 | " California | \n",
311 | " You'll report directly to Triplebytes' Head of... | \n",
312 | " 22 days ago | \n",
313 | " /pagead/clk?mo=r&ad=-6NYlbfkN0AMr11YIOo206dX9C... | \n",
314 | " About Triplebyte\\n\\nTriplebyte is transforming... | \n",
315 | "
\n",
316 | " \n",
317 | "
\n",
318 | "
"
319 | ],
320 | "text/plain": [
321 | " job_title company_name \\\n",
322 | "0 Data Scientist, Medical Diagnostics Specific Diagnostics \n",
323 | "1 Data Scientist Laxmi Therapeutic Devices \n",
324 | "2 Data Scientist Blue Owl \n",
325 | "3 Data Engineer Amick Brown, LLC \n",
326 | "4 Data Scientist Triplebyte \n",
327 | "\n",
328 | " location summary \\\n",
329 | "0 Mountain View, CA 94043 The development of data-driven visualization t... \n",
330 | "1 Goleta, CA 93117 7+ years' practical experience manipulating da... \n",
331 | "2 San Francisco, CA We’re using rich customer insights, advanced t... \n",
332 | "3 Sunnyvale, CA Develops technical tools and programming that ... \n",
333 | "4 California You'll report directly to Triplebytes' Head of... \n",
334 | "\n",
335 | " post_Date Qualification_link \\\n",
336 | "0 30+ days ago /pagead/clk?mo=r&ad=-6NYlbfkN0ASXGwdLWjBNYivRa... \n",
337 | "1 16 days ago /pagead/clk?mo=r&ad=-6NYlbfkN0ALgD31io3l0I0Y-r... \n",
338 | "2 30+ days ago /pagead/clk?mo=r&ad=-6NYlbfkN0D3UvD5kBSgX9r9tF... \n",
339 | "3 Today /pagead/clk?mo=r&ad=-6NYlbfkN0A74pTrSPrBtiJlYH... \n",
340 | "4 22 days ago /pagead/clk?mo=r&ad=-6NYlbfkN0AMr11YIOo206dX9C... \n",
341 | "\n",
342 | " Qual_Text \n",
343 | "0 The CompanyThe world is facing a medical crisi... \n",
344 | "1 Data ScientistLaxmi Therapeutic Devices – Gole... \n",
345 | "2 Our Mission\\nWe’re here to create a safer, hap... \n",
346 | "3 Data EngineerSunnyvale, CAAmick Brown is seeki... \n",
347 | "4 About Triplebyte\\n\\nTriplebyte is transforming... "
348 | ]
349 | },
350 | "execution_count": 43,
351 | "metadata": {},
352 | "output_type": "execute_result"
353 | }
354 | ],
355 | "source": [
356 | "jobs_=pd.concat([pd.DataFrame(gg),pd.DataFrame(u,columns=['Qual_Text'])],axis=1)\n",
357 | "jobs_.head()\n"
358 | ]
359 | },
360 | {
361 | "cell_type": "markdown",
362 | "metadata": {},
363 | "source": [
364 | "# `Parse posting date`: convert string data to official dates"
365 | ]
366 | },
367 | {
368 | "cell_type": "code",
369 | "execution_count": 37,
370 | "metadata": {},
371 | "outputs": [
372 | {
373 | "data": {
374 | "text/plain": [
375 | "['09-10-2020', '09-24-2020', '09-10-2020', '10-10-2020', '09-18-2020']"
376 | ]
377 | },
378 | "execution_count": 37,
379 | "metadata": {},
380 | "output_type": "execute_result"
381 | }
382 | ],
383 | "source": [
384 | "import re\n",
385 | "v=[]\n",
386 | "for i in jobs_['post_Date']:\n",
387 | "\n",
388 | " if re.findall(r'[0-9]',i):\n",
389 | " # if the string has digits convert each entry to single string: ['3','0']->'30'\n",
390 | " b=''.join(re.findall(r'[0-9]',i))\n",
391 | " \n",
392 | " # convert string int to int and subtract from today's date and format\n",
393 | " g=(datetime.datetime.today()-datetime.timedelta(int(b))).strftime('%m-%d-%Y')\n",
394 | "\n",
395 | " v.append(g)\n",
396 | " \n",
397 | " else: # this will contain strings like: 'just posted' or 'today' etc before convert\n",
398 | " v.append(datetime.datetime.today().strftime('%m-%d-%Y'))\n",
399 | "v[:5]\n"
400 | ]
401 | },
402 | {
403 | "cell_type": "code",
404 | "execution_count": 42,
405 | "metadata": {},
406 | "outputs": [
407 | {
408 | "data": {
409 | "text/html": [
410 | "\n",
411 | "\n",
424 | "
\n",
425 | " \n",
426 | " \n",
427 | " | \n",
428 | " job_title | \n",
429 | " company_name | \n",
430 | " location | \n",
431 | " summary | \n",
432 | " post_Date | \n",
433 | " Qualification_link | \n",
434 | " Qual_Text | \n",
435 | " posting_date_fixed | \n",
436 | " skill_matches | \n",
437 | "
\n",
438 | " \n",
439 | " \n",
440 | " \n",
441 | " | 0 | \n",
442 | " Data Scientist, Medical Diagnostics | \n",
443 | " Specific Diagnostics | \n",
444 | " Mountain View, CA 94043 | \n",
445 | " The development of data-driven visualization t... | \n",
446 | " 30+ days ago | \n",
447 | " /pagead/clk?mo=r&ad=-6NYlbfkN0ASXGwdLWjBNYivRa... | \n",
448 | " The CompanyThe world is facing a medical crisi... | \n",
449 | " 09-10-2020 | \n",
450 | " [python, sql, visualization] | \n",
451 | "
\n",
452 | " \n",
453 | " | 1 | \n",
454 | " Data Scientist | \n",
455 | " Laxmi Therapeutic Devices | \n",
456 | " Goleta, CA 93117 | \n",
457 | " 7+ years' practical experience manipulating da... | \n",
458 | " 16 days ago | \n",
459 | " /pagead/clk?mo=r&ad=-6NYlbfkN0ALgD31io3l0I0Y-r... | \n",
460 | " Data ScientistLaxmi Therapeutic Devices – Gole... | \n",
461 | " 09-24-2020 | \n",
462 | " [python, sql, statistics, algorithms] | \n",
463 | "
\n",
464 | " \n",
465 | " | 2 | \n",
466 | " Data Scientist | \n",
467 | " Blue Owl | \n",
468 | " San Francisco, CA | \n",
469 | " We’re using rich customer insights, advanced t... | \n",
470 | " 30+ days ago | \n",
471 | " /pagead/clk?mo=r&ad=-6NYlbfkN0D3UvD5kBSgX9r9tF... | \n",
472 | " Our Mission\\nWe’re here to create a safer, hap... | \n",
473 | " 09-10-2020 | \n",
474 | " [python, machine learning, deep learning, pandas] | \n",
475 | "
\n",
476 | " \n",
477 | " | 3 | \n",
478 | " Data Engineer | \n",
479 | " Amick Brown, LLC | \n",
480 | " Sunnyvale, CA | \n",
481 | " Develops technical tools and programming that ... | \n",
482 | " Today | \n",
483 | " /pagead/clk?mo=r&ad=-6NYlbfkN0A74pTrSPrBtiJlYH... | \n",
484 | " Data EngineerSunnyvale, CAAmick Brown is seeki... | \n",
485 | " 10-10-2020 | \n",
486 | " [python, sql, aws, machine learning] | \n",
487 | "
\n",
488 | " \n",
489 | " | 4 | \n",
490 | " Data Scientist | \n",
491 | " Triplebyte | \n",
492 | " California | \n",
493 | " You'll report directly to Triplebytes' Head of... | \n",
494 | " 22 days ago | \n",
495 | " /pagead/clk?mo=r&ad=-6NYlbfkN0AMr11YIOo206dX9C... | \n",
496 | " About Triplebyte\\n\\nTriplebyte is transforming... | \n",
497 | " 09-18-2020 | \n",
498 | " [aws, machine learning] | \n",
499 | "
\n",
500 | " \n",
501 | "
\n",
502 | "
"
503 | ],
504 | "text/plain": [
505 | " job_title company_name \\\n",
506 | "0 Data Scientist, Medical Diagnostics Specific Diagnostics \n",
507 | "1 Data Scientist Laxmi Therapeutic Devices \n",
508 | "2 Data Scientist Blue Owl \n",
509 | "3 Data Engineer Amick Brown, LLC \n",
510 | "4 Data Scientist Triplebyte \n",
511 | "\n",
512 | " location summary \\\n",
513 | "0 Mountain View, CA 94043 The development of data-driven visualization t... \n",
514 | "1 Goleta, CA 93117 7+ years' practical experience manipulating da... \n",
515 | "2 San Francisco, CA We’re using rich customer insights, advanced t... \n",
516 | "3 Sunnyvale, CA Develops technical tools and programming that ... \n",
517 | "4 California You'll report directly to Triplebytes' Head of... \n",
518 | "\n",
519 | " post_Date Qualification_link \\\n",
520 | "0 30+ days ago /pagead/clk?mo=r&ad=-6NYlbfkN0ASXGwdLWjBNYivRa... \n",
521 | "1 16 days ago /pagead/clk?mo=r&ad=-6NYlbfkN0ALgD31io3l0I0Y-r... \n",
522 | "2 30+ days ago /pagead/clk?mo=r&ad=-6NYlbfkN0D3UvD5kBSgX9r9tF... \n",
523 | "3 Today /pagead/clk?mo=r&ad=-6NYlbfkN0A74pTrSPrBtiJlYH... \n",
524 | "4 22 days ago /pagead/clk?mo=r&ad=-6NYlbfkN0AMr11YIOo206dX9C... \n",
525 | "\n",
526 | " Qual_Text posting_date_fixed \\\n",
527 | "0 The CompanyThe world is facing a medical crisi... 09-10-2020 \n",
528 | "1 Data ScientistLaxmi Therapeutic Devices – Gole... 09-24-2020 \n",
529 | "2 Our Mission\\nWe’re here to create a safer, hap... 09-10-2020 \n",
530 | "3 Data EngineerSunnyvale, CAAmick Brown is seeki... 10-10-2020 \n",
531 | "4 About Triplebyte\\n\\nTriplebyte is transforming... 09-18-2020 \n",
532 | "\n",
533 | " skill_matches \n",
534 | "0 [python, sql, visualization] \n",
535 | "1 [python, sql, statistics, algorithms] \n",
536 | "2 [python, machine learning, deep learning, pandas] \n",
537 | "3 [python, sql, aws, machine learning] \n",
538 | "4 [aws, machine learning] "
539 | ]
540 | },
541 | "execution_count": 42,
542 | "metadata": {},
543 | "output_type": "execute_result"
544 | }
545 | ],
546 | "source": [
547 | "# fixed posting date to date format instead of string: last column\n",
548 | "jobs_['posting_date_fixed']=v\n",
549 | "jobs_.head()"
550 | ]
551 | },
552 | {
553 | "cell_type": "markdown",
554 | "metadata": {},
555 | "source": [
556 | "\n",
557 | "\n",
558 | "# `Create a list of skills that you may have or general list`\n",
559 | "\n",
560 | "+ This can be used in a few ways, for example if I put all my skills and see what matched up for jobs.\n",
561 | "\n",
562 | "+ We could generate a corpus and start parsing through job postings not just data scientist, but similar jobs as well. After all the matches for indeed are not the best!"
563 | ]
564 | },
565 | {
566 | "cell_type": "code",
567 | "execution_count": 2,
568 | "metadata": {},
569 | "outputs": [],
570 | "source": [
571 | "buzz_words=['Python','SQL','AWS', 'Machine learning','Deep learning','Text mining',\n",
572 | "'NLP','SAS','Tableau','Sagemaker','Tensorflow','Spark', 'numpy', 'MongDB','PSQL',\n",
573 | "\"Postgres\", 'Pandas', 'RESTFUL','NLP','Statistics','Algorithms','Visualization',\n",
574 | "'GCP','Google Cloud','Naive Bayes','Random Forest','Bachelors degree','Masters degree'\n",
575 | "'Java','Pyspark','Postgres','MySQL','Github','Docker','Machine Learning','C+',\n",
576 | "'C++','Pytorch','Jupyter Notebook','R Studio','R-Studio','Forecasting','Hive',\n",
577 | "'PhD','GCP','Numpy','NoSQL','Neo4j','Neural Network','Clustering','Linear Algebra',\n",
578 | "'Google Colab','Data Mining','Regression','Time Series','ETL','Data Wrangling',\n",
579 | "'Web Scraping','Feature Extraction','Featuring Engineering','Scipy','ML','DL']\n",
580 | "buzz_words_list=[x.lower() for x in buzz_words] # convert list to lowercase to parse\n",
581 | "\n",
582 | "yo=[]\n",
583 | "for i in range(len(jobs_.Qual_Text)):\n",
584 | " a=buzz_words_list\n",
585 | " dd=[x for x in a if x in jobs_.Qual_Text[i].lower()]\n",
586 | " yo.append(dd)\n",
587 | "jobs_['skill_matches']=yo\n",
588 | "jobs_.head(7)\n"
589 | ]
590 | },
591 | {
592 | "cell_type": "code",
593 | "execution_count": 65,
594 | "metadata": {},
595 | "outputs": [
596 | {
597 | "data": {
598 | "text/html": [
599 | "\n",
600 | "\n",
613 | "
\n",
614 | " \n",
615 | " \n",
616 | " | \n",
617 | " job_title | \n",
618 | " company_name | \n",
619 | " location | \n",
620 | " summary | \n",
621 | " post_Date | \n",
622 | " Qualification_link | \n",
623 | " Qual_Text | \n",
624 | " skill_matches | \n",
625 | "
\n",
626 | " \n",
627 | " \n",
628 | " \n",
629 | " | 0 | \n",
630 | " Data Scientist, Medical Diagnostics | \n",
631 | " Specific Diagnostics | \n",
632 | " Mountain View, CA 94043 | \n",
633 | " The development of data-driven visualization t... | \n",
634 | " 30+ days ago | \n",
635 | " /pagead/clk?mo=r&ad=-6NYlbfkN0ASXGwdLWjBNYivRa... | \n",
636 | " The CompanyThe world is facing a medical crisi... | \n",
637 | " [python, sql, visualization, c+, c++, nosql] | \n",
638 | "
\n",
639 | " \n",
640 | " | 1 | \n",
641 | " Data Scientist | \n",
642 | " Laxmi Therapeutic Devices | \n",
643 | " Goleta, CA 93117 | \n",
644 | " 7+ years' practical experience manipulating da... | \n",
645 | " 16 days ago | \n",
646 | " /pagead/clk?mo=r&ad=-6NYlbfkN0ALgD31io3l0I0Y-r... | \n",
647 | " Data ScientistLaxmi Therapeutic Devices – Gole... | \n",
648 | " [python, sql, statistics, algorithms] | \n",
649 | "
\n",
650 | " \n",
651 | " | 2 | \n",
652 | " Data Scientist | \n",
653 | " Blue Owl | \n",
654 | " San Francisco, CA | \n",
655 | " We’re using rich customer insights, advanced t... | \n",
656 | " 30+ days ago | \n",
657 | " /pagead/clk?mo=r&ad=-6NYlbfkN0D3UvD5kBSgX9r9tF... | \n",
658 | " Our Mission\\nWe’re here to create a safer, hap... | \n",
659 | " [python, machine learning, deep learning, pand... | \n",
660 | "
\n",
661 | " \n",
662 | " | 3 | \n",
663 | " Data Engineer | \n",
664 | " Amick Brown, LLC | \n",
665 | " Sunnyvale, CA | \n",
666 | " Develops technical tools and programming that ... | \n",
667 | " Today | \n",
668 | " /pagead/clk?mo=r&ad=-6NYlbfkN0A74pTrSPrBtiJlYH... | \n",
669 | " Data EngineerSunnyvale, CAAmick Brown is seeki... | \n",
670 | " [python, sql, aws, machine learning, machine l... | \n",
671 | "
\n",
672 | " \n",
673 | " | 4 | \n",
674 | " Data Scientist | \n",
675 | " Triplebyte | \n",
676 | " California | \n",
677 | " You'll report directly to Triplebytes' Head of... | \n",
678 | " 22 days ago | \n",
679 | " /pagead/clk?mo=r&ad=-6NYlbfkN0AMr11YIOo206dX9C... | \n",
680 | " About Triplebyte\\n\\nTriplebyte is transforming... | \n",
681 | " [aws, machine learning, machine learning, time... | \n",
682 | "
\n",
683 | " \n",
684 | " | 5 | \n",
685 | " Data Scientist, Analytics | \n",
686 | " Evernote | \n",
687 | " Redwood City, CA 94063 (Downtown area) | \n",
688 | " Apply your expertise in quantitative analysis,... | \n",
689 | " 3 days ago | \n",
690 | " /rc/clk?jk=e6e04f20bd7ac8f1&fccid=4d2449c755ba... | \n",
691 | " About the team:\\n\\nThe Evernote Analytics team... | \n",
692 | " [python, sql, sas, tableau, statistics, visual... | \n",
693 | "
\n",
694 | " \n",
695 | " | 6 | \n",
696 | " Data Scientist | \n",
697 | " Apple | \n",
698 | " Santa Clara Valley, CA 95014 | \n",
699 | " In-depth knowledge of digital analytics data, ... | \n",
700 | " Today | \n",
701 | " /rc/clk?jk=5a488b6a19d296c3&fccid=c1099851e979... | \n",
702 | " Summary\\nPosted: Oct 9, 2020\\nWeekly Hours: 40... | \n",
703 | " [sql, tableau, visualization] | \n",
704 | "
\n",
705 | " \n",
706 | " | 7 | \n",
707 | " Data Scientist | \n",
708 | " RMDS Lab, Inc | \n",
709 | " South Lake, CA | \n",
710 | " Provide consultation on data science and machi... | \n",
711 | " 17 days ago | \n",
712 | " /company/RMDS-Lab/jobs/Data-Scientist-c487fd99... | \n",
713 | " About RMDS LabRMDS Lab and its Global Associat... | \n",
714 | " [python, machine learning, machine learning] | \n",
715 | "
\n",
716 | " \n",
717 | " | 8 | \n",
718 | " Data Scientist 2 | \n",
719 | " Xoom | \n",
720 | " San Francisco, CA 94105 (Financial District area) | \n",
721 | " Clear subject matter expert who can help mento... | \n",
722 | " Today | \n",
723 | " /rc/clk?jk=00c2ef3c615076e9&fccid=978d9fd9799d... | \n",
724 | " Who we are: Fueled by a fundamental belief tha... | \n",
725 | " [statistics] | \n",
726 | "
\n",
727 | " \n",
728 | " | 9 | \n",
729 | " Data Science Intern | \n",
730 | " Parsec Education | \n",
731 | " Fresno, CA 93721 (Central area) | \n",
732 | " Parsec Education specializes in analyzing stat... | \n",
733 | " 10 days ago | \n",
734 | " /pagead/clk?mo=r&ad=-6NYlbfkN0B5bikR7eyU0Vk4cd... | \n",
735 | " Greetings Applicant!Here at Parsec Education w... | \n",
736 | " [python, sql, machine learning, visualization,... | \n",
737 | "
\n",
738 | " \n",
739 | "
\n",
740 | "
"
741 | ],
742 | "text/plain": [
743 | " job_title company_name \\\n",
744 | "0 Data Scientist, Medical Diagnostics Specific Diagnostics \n",
745 | "1 Data Scientist Laxmi Therapeutic Devices \n",
746 | "2 Data Scientist Blue Owl \n",
747 | "3 Data Engineer Amick Brown, LLC \n",
748 | "4 Data Scientist Triplebyte \n",
749 | "5 Data Scientist, Analytics Evernote \n",
750 | "6 Data Scientist Apple \n",
751 | "7 Data Scientist RMDS Lab, Inc \n",
752 | "8 Data Scientist 2 Xoom \n",
753 | "9 Data Science Intern Parsec Education \n",
754 | "\n",
755 | " location \\\n",
756 | "0 Mountain View, CA 94043 \n",
757 | "1 Goleta, CA 93117 \n",
758 | "2 San Francisco, CA \n",
759 | "3 Sunnyvale, CA \n",
760 | "4 California \n",
761 | "5 Redwood City, CA 94063 (Downtown area) \n",
762 | "6 Santa Clara Valley, CA 95014 \n",
763 | "7 South Lake, CA \n",
764 | "8 San Francisco, CA 94105 (Financial District area) \n",
765 | "9 Fresno, CA 93721 (Central area) \n",
766 | "\n",
767 | " summary post_Date \\\n",
768 | "0 The development of data-driven visualization t... 30+ days ago \n",
769 | "1 7+ years' practical experience manipulating da... 16 days ago \n",
770 | "2 We’re using rich customer insights, advanced t... 30+ days ago \n",
771 | "3 Develops technical tools and programming that ... Today \n",
772 | "4 You'll report directly to Triplebytes' Head of... 22 days ago \n",
773 | "5 Apply your expertise in quantitative analysis,... 3 days ago \n",
774 | "6 In-depth knowledge of digital analytics data, ... Today \n",
775 | "7 Provide consultation on data science and machi... 17 days ago \n",
776 | "8 Clear subject matter expert who can help mento... Today \n",
777 | "9 Parsec Education specializes in analyzing stat... 10 days ago \n",
778 | "\n",
779 | " Qualification_link \\\n",
780 | "0 /pagead/clk?mo=r&ad=-6NYlbfkN0ASXGwdLWjBNYivRa... \n",
781 | "1 /pagead/clk?mo=r&ad=-6NYlbfkN0ALgD31io3l0I0Y-r... \n",
782 | "2 /pagead/clk?mo=r&ad=-6NYlbfkN0D3UvD5kBSgX9r9tF... \n",
783 | "3 /pagead/clk?mo=r&ad=-6NYlbfkN0A74pTrSPrBtiJlYH... \n",
784 | "4 /pagead/clk?mo=r&ad=-6NYlbfkN0AMr11YIOo206dX9C... \n",
785 | "5 /rc/clk?jk=e6e04f20bd7ac8f1&fccid=4d2449c755ba... \n",
786 | "6 /rc/clk?jk=5a488b6a19d296c3&fccid=c1099851e979... \n",
787 | "7 /company/RMDS-Lab/jobs/Data-Scientist-c487fd99... \n",
788 | "8 /rc/clk?jk=00c2ef3c615076e9&fccid=978d9fd9799d... \n",
789 | "9 /pagead/clk?mo=r&ad=-6NYlbfkN0B5bikR7eyU0Vk4cd... \n",
790 | "\n",
791 | " Qual_Text \\\n",
792 | "0 The CompanyThe world is facing a medical crisi... \n",
793 | "1 Data ScientistLaxmi Therapeutic Devices – Gole... \n",
794 | "2 Our Mission\\nWe’re here to create a safer, hap... \n",
795 | "3 Data EngineerSunnyvale, CAAmick Brown is seeki... \n",
796 | "4 About Triplebyte\\n\\nTriplebyte is transforming... \n",
797 | "5 About the team:\\n\\nThe Evernote Analytics team... \n",
798 | "6 Summary\\nPosted: Oct 9, 2020\\nWeekly Hours: 40... \n",
799 | "7 About RMDS LabRMDS Lab and its Global Associat... \n",
800 | "8 Who we are: Fueled by a fundamental belief tha... \n",
801 | "9 Greetings Applicant!Here at Parsec Education w... \n",
802 | "\n",
803 | " skill_matches \n",
804 | "0 [python, sql, visualization, c+, c++, nosql] \n",
805 | "1 [python, sql, statistics, algorithms] \n",
806 | "2 [python, machine learning, deep learning, pand... \n",
807 | "3 [python, sql, aws, machine learning, machine l... \n",
808 | "4 [aws, machine learning, machine learning, time... \n",
809 | "5 [python, sql, sas, tableau, statistics, visual... \n",
810 | "6 [sql, tableau, visualization] \n",
811 | "7 [python, machine learning, machine learning] \n",
812 | "8 [statistics] \n",
813 | "9 [python, sql, machine learning, visualization,... "
814 | ]
815 | },
816 | "execution_count": 65,
817 | "metadata": {},
818 | "output_type": "execute_result"
819 | }
820 | ],
821 | "source": [
822 | "\n",
823 | "filename='Indeed_scrape_Oct2020.txt'\n",
824 | "file=open(filename,'wb')\n",
825 | "pickle.dump(jobs_,file)\n",
826 | "\n",
827 | "file_ =open(filename,'rb')\n",
828 | "new_file_ =pickle.load(file_)\n",
829 | "new_file_.head(10)"
830 | ]
831 | },
832 | {
833 | "cell_type": "markdown",
834 | "metadata": {},
835 | "source": [
836 | "# LIKE, Share &\n",
837 | "\n",
838 | "# SUBscribe"
839 | ]
840 | },
841 | {
842 | "cell_type": "markdown",
843 | "metadata": {},
844 | "source": [
845 | "# `Next Video: Word Cloud From Indeed Job Postings, with NLP`\n",
846 | "\n",
847 | "`--------------------`"
848 | ]
849 | },
850 | {
851 | "cell_type": "markdown",
852 | "metadata": {},
853 | "source": [
854 | "https://www.jobspikr.com/blog/scraping-indeed-job-data-using-python/\n",
855 | "\n",
856 | "https://jlgamez.com/how-i-scrape-jobs-data-from-indeed-com-with-python/\n",
857 | "\n",
858 | "http://rstudio-pubs-static.s3.amazonaws.com/495949_1ab68b7e55cf4a3ab1fc00285e28fbcc.html\n",
859 | "\n",
860 | "https://chrislovejoy.me/job-scraper/\n",
861 | "\n",
862 | "https://www.youtube.com/watch?v=eN_3d4JrL_w"
863 | ]
864 | }
865 | ],
866 | "metadata": {
867 | "kernelspec": {
868 | "display_name": "Python 3",
869 | "language": "python",
870 | "name": "python3"
871 | },
872 | "language_info": {
873 | "codemirror_mode": {
874 | "name": "ipython",
875 | "version": 3
876 | },
877 | "file_extension": ".py",
878 | "mimetype": "text/x-python",
879 | "name": "python",
880 | "nbconvert_exporter": "python",
881 | "pygments_lexer": "ipython3",
882 | "version": "3.7.3"
883 | }
884 | },
885 | "nbformat": 4,
886 | "nbformat_minor": 4
887 | }
888 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Webscraping: with `Beautiful Soup4` | `Regex` | `Selenium` | `Py4J` | `Eclipse IDE with Maven`
2 | **This is a two part jupyter notebook**
3 |
4 | This exercise was used to find what countries voted: `for/against/abstained` for any given `Resolution`from the United Nations
5 |
6 | + First part of this project: Find all webpages relevant to `Session Resolutions`, this was entails entering 3 pages deep.
7 | * Second part deals with parsing `online pdf` files with `Java PDFBox` and sending that data to `Python` with `Py4J`.
8 | * From there we will finially parse with `Regex` these files which will be plain text for our: voting by country
9 |
10 | `------------------------------------------------------------------------------`
11 |
12 | # `with Mr Fugu Data Science`
13 |
14 | # (◕‿◕✿)
15 |
16 | ALso Check out my videos: [Youtube](https://www.youtube.com/channel/UCbni-TDI-Ub8VlGaP8HLTNw?view_as=subscriber)
17 |
18 |
19 | **Required installs**:
20 |
21 | `pip install beautifulsoup4` | `pip install pyPDF2` | `pip install Selenium`
22 |
23 |
24 | **Skills Learned**:
25 | + Webscraping
26 | + Basic Regular Expressions (Regex)
27 | + Pdf parsing onling material
28 |
29 |
--------------------------------------------------------------------------------
/Selenium_ASYNCIO_Indeed.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "f51e24ac",
6 | "metadata": {},
7 | "source": [
8 | "# `Selenium Indeed Webscrape Speed Up: ASYNIO July 2023`\n",
9 | "\n",
10 | "# Mr Fugu Data Science\n",
11 | "\n",
12 | "# (◕‿◕✿)\n",
13 | "\n",
14 | "# `Outcome & Purpose:`\n",
15 | "\n",
16 | "+ Webscrape Indeed\n",
17 | "+ Speed up slow processes\n",
18 | "+ convert to DF and Save"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": null,
24 | "id": "cfa85be0",
25 | "metadata": {},
26 | "outputs": [],
27 | "source": [
28 | "# Install if you have never used these: unblock the lines below to install if needed\n",
29 | "\n",
30 | "# !pip install webdriver-manager\n",
31 | "# !pip3 install lxml\n",
32 | "# !pip3 install selenium\n",
33 | "# !pip3 install webdriver_manager\n",
34 | "# !pip install --upgrade pip\n",
35 | "# !pip install -U selenium"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": null,
41 | "id": "b158e53d",
42 | "metadata": {},
43 | "outputs": [],
44 | "source": [
45 | "# Parsing and creating xml data\n",
46 | "from lxml import etree as et\n",
47 | "\n",
48 | "# Store data as a csv file written out\n",
49 | "from csv import writer\n",
50 | "\n",
51 | "# In general to use with timing our function calls to Indeed\n",
52 | "import time\n",
53 | "\n",
54 | "# Assist with creating incremental timing for our scraping to seem more human\n",
55 | "from time import sleep\n",
56 | "\n",
57 | "# Dataframe stuff\n",
58 | "import pandas as pd\n",
59 | "\n",
60 | "# Random integer for more realistic timing for clicks, buttons and searches during scraping\n",
61 | "from random import randint\n",
62 | "\n",
63 | "# Multi Threading\n",
64 | "import threading\n",
65 | "\n",
66 | "# Threading:\n",
67 | "from concurrent.futures import ThreadPoolExecutor, wait"
68 | ]
69 | },
70 | {
71 | "cell_type": "code",
72 | "execution_count": null,
73 | "id": "67d4bcda",
74 | "metadata": {},
75 | "outputs": [],
76 | "source": [
77 | "import selenium\n",
78 | "\n",
79 | "# Check version I am running\n",
80 | "selenium.__version__"
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": null,
86 | "id": "fb2a42a7",
87 | "metadata": {},
88 | "outputs": [],
89 | "source": [
90 | "# Selenium 4:\n",
91 | "\n",
92 | "from selenium import webdriver\n",
93 | "\n",
94 | "# Starting/Stopping Driver: can specify ports or location but not remote access\n",
95 | "from selenium.webdriver.chrome.service import Service as ChromeService\n",
96 | "\n",
97 | "# Manages Binaries needed for WebDriver without installing anything directly\n",
98 | "from webdriver_manager.chrome import ChromeDriverManager"
99 | ]
100 | },
101 | {
102 | "cell_type": "code",
103 | "execution_count": null,
104 | "id": "820706d0",
105 | "metadata": {},
106 | "outputs": [],
107 | "source": [
108 | "# Allows searchs similar to beautiful soup: find_all\n",
109 | "from selenium.webdriver.common.by import By\n",
110 | "\n",
111 | "# Try to establish wait times for the page to load\n",
112 | "from selenium.webdriver.support.ui import WebDriverWait\n",
113 | "\n",
114 | "# Wait for specific condition based on defined task: web elements, boolean are examples\n",
115 | "from selenium.webdriver.support import expected_conditions as EC\n",
116 | "\n",
117 | "# Used for keyboard movements, up/down, left/right,delete, etc\n",
118 | "from selenium.webdriver.common.keys import Keys\n",
119 | "\n",
120 | "# Locate elements on page and throw error if they do not exist\n",
121 | "from selenium.common.exceptions import NoSuchElementException"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": null,
127 | "id": "5cb822fc",
128 | "metadata": {},
129 | "outputs": [],
130 | "source": [
131 | "# Allows you to cusotmize: ingonito mode, maximize window size, headless browser, disable certain features, etc\n",
132 | "option= webdriver.ChromeOptions()\n",
133 | "\n",
134 | "# Going undercover:\n",
135 | "option.add_argument(\"--incognito\")\n",
136 | "option.add_argument(\"--headless=new\")\n",
137 | "\n",
138 | "# Finding location, position, radius=35 miles, sort by date and starting page\n",
139 | "paginaton_url = 'https://www.indeed.com/jobs?q={}&l={}&radius=35&filter=0&sort=date&start={}'"
140 | ]
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "id": "ae788d9b",
145 | "metadata": {},
146 | "source": [
147 | "# `Asynchronous Code:`\n",
148 | "\n",
149 | "**What is really going on here?**\n",
150 | "\n",
151 | "Essentially, you are splitting routines which may become busy and move to a next step or procedure until the workload clears up. This is allowing a `concurrent` use of time and space. \n",
152 | "\n",
153 | "+ Think of doing individual actions one-by-one and having to wait for the next task (SYNCRONOUS) workload\n",
154 | " + We get a speed up but not exactly Multi-core processing. It is a little different. We are freeing up resources to work on other tasks while one or more are busy and then come back to them later.\n",
155 | "\n",
156 | "\n",
157 | "`-----------------------------------------`"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": 2,
163 | "id": "b9566a7e",
164 | "metadata": {},
165 | "outputs": [
166 | {
167 | "name": "stdout",
168 | "output_type": "stream",
169 | "text": [
170 | "Requirement already satisfied: aiohttp in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (3.8.1)\n",
171 | "Requirement already satisfied: attrs>=17.3.0 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from aiohttp) (22.2.0)\n",
172 | "Requirement already satisfied: charset-normalizer<3.0,>=2.0 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from aiohttp) (2.1.1)\n",
173 | "Requirement already satisfied: multidict<7.0,>=4.5 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from aiohttp) (6.0.4)\n",
174 | "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from aiohttp) (4.0.2)\n",
175 | "Requirement already satisfied: yarl<2.0,>=1.0 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from aiohttp) (1.8.2)\n",
176 | "Requirement already satisfied: frozenlist>=1.1.1 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from aiohttp) (1.3.3)\n",
177 | "Requirement already satisfied: aiosignal>=1.1.2 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from aiohttp) (1.3.1)\n",
178 | "Requirement already satisfied: idna>=2.0 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from yarl<2.0,>=1.0->aiohttp) (3.4)\n",
179 | "Collecting asyncio\n",
180 | " Downloading asyncio-3.4.3-py3-none-any.whl (101 kB)\n",
181 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m101.8/101.8 kB\u001b[0m \u001b[31m4.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
182 | "\u001b[?25hInstalling collected packages: asyncio\n",
183 | "Successfully installed asyncio-3.4.3\n"
184 | ]
185 | }
186 | ],
187 | "source": [
188 | "# For Our Speed Up: AsyncIO and AIOHttp\n",
189 | "\n",
190 | "!pip install aiohttp\n",
191 | "!pip install asyncio\n",
192 | "\n",
193 | "\n",
194 | "import aiohttp\n",
195 | "\n",
196 | "import asyncio"
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": null,
202 | "id": "eabb3560",
203 | "metadata": {},
204 | "outputs": [],
205 | "source": [
206 | "start = time.time()\n",
207 | "\n",
208 | "\n",
209 | "job_='Data+Engineer'\n",
210 | "location='Washington'\n",
211 | "\n",
212 | "driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()),\n",
213 | " options=option)\n",
214 | "\n",
215 | "\n",
216 | "driver.get(paginaton_url.format(job_,location,0))\n",
217 | "\n",
218 | "# t = ScrapeThread(url_)\n",
219 | "# t.start()\n",
220 | "\n",
221 | "sleep(randint(2, 6))\n",
222 | "\n",
223 | "p=driver.find_element(By.CLASS_NAME,'jobsearch-JobCountAndSortPane-jobCount').text\n",
224 | "\n",
225 | "# Max number of pages for this search! There is a caveat described soon\n",
226 | "max_iter_pgs=int(p.split(' ')[0])//15 \n",
227 | "\n",
228 | "\n",
229 | "driver.quit() # Closing the browser we opened\n",
230 | "\n",
231 | "\n",
232 | "end = time.time()\n",
233 | "\n",
234 | "print(end - start,'seconds to complete action!')\n",
235 | "print('-----------------------')\n",
236 | "print('Max Iterable Pages for this search:',max_iter_pgs)"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": null,
242 | "id": "ba7423a4",
243 | "metadata": {},
244 | "outputs": [],
245 | "source": [
246 | "# Pagination: PRACTICE\n",
247 | "\n",
248 | "start = time.time()\n",
249 | "\n",
250 | "\n",
251 | "job_='Data+Engineer'\n",
252 | "location='Washington'\n",
253 | "\n",
254 | "\n",
255 | "job_lst=[]\n",
256 | "job_description_list_href=[]\n",
257 | "\n",
258 | "# job_description_list = []\n",
259 | "salary_list=[]\n",
260 | "\n",
261 | "\n",
262 | "driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()),\n",
263 | " options=option)\n",
264 | "sleep(randint(2, 6))\n",
265 | "\n",
266 | "\n",
267 | "for i in range(0,max_iter_pgs):\n",
268 | " driver.get(paginaton_url.format(job_,location,i*10))\n",
269 | " \n",
270 | " \n",
271 | " sleep(randint(2, 4))\n",
272 | "\n",
273 | " job_page = driver.find_element(By.ID,\"mosaic-jobResults\")\n",
274 | " jobs = job_page.find_elements(By.CLASS_NAME,\"job_seen_beacon\") # return a list\n",
275 | "\n",
276 | " for jj in jobs:\n",
277 | " job_title = jj.find_element(By.CLASS_NAME,\"jobTitle\")\n",
278 | " \n",
279 | "# Href's to get full job description (need to re-terate to get full info)\n",
280 | "# Reference ID for each job used by indeed \n",
281 | "# Finding the company name \n",
282 | "# Location\n",
283 | "# Posting date\n",
284 | "# Job description\n",
285 | "\n",
286 | " job_lst.append([job_title.text,\n",
287 | " job_title.find_element(By.CSS_SELECTOR,\"a\").get_attribute(\"href\"),\n",
288 | " job_title.find_element(By.CSS_SELECTOR,\"a\").get_attribute(\"id\"), \n",
289 | " jj.find_element(By.CLASS_NAME,\"companyName\").text, \n",
290 | " jj.find_element(By.CLASS_NAME,\"companyLocation\").text,\n",
291 | " jj.find_element(By.CLASS_NAME,\"date\").text,\n",
292 | " job_title.find_element(By.CSS_SELECTOR,\"a\").get_attribute(\"href\")])\n",
293 | " \n",
294 | "\n",
295 | " try: # I removed the metadata attached to this class name to work!\n",
296 | " salary_list.append(jj.find_element(By.CLASS_NAME,\"salary-snippet-container\").text)\n",
297 | "\n",
298 | " except NoSuchElementException: \n",
299 | " try: \n",
300 | " salary_list.append(jj.find_element(By.CLASS_NAME,\"estimated-salary\").text)\n",
301 | " \n",
302 | " except NoSuchElementException:\n",
303 | " salary_list.append(None)\n",
304 | " \n",
305 | " \n",
306 | "# # Click the job element to get the description\n",
307 | "# job_title.click()\n",
308 | " \n",
309 | "# # Help to load page so we can find and extract data\n",
310 | "# sleep(randint(3, 5))\n",
311 | "\n",
312 | "# try: \n",
313 | "# job_description_list.append(driver.find_element(By.ID,\"jobDescriptionText\").text)\n",
314 | " \n",
315 | "# except: \n",
316 | " \n",
317 | "# job_description_list.append(None)\n",
318 | "\n",
319 | "driver.quit() \n",
320 | "\n",
321 | "\n",
322 | "end = time.time()\n",
323 | "\n",
324 | "print(end - start,'seconds to complete Query!')"
325 | ]
326 | },
327 | {
328 | "cell_type": "code",
329 | "execution_count": null,
330 | "id": "64431ed8",
331 | "metadata": {},
332 | "outputs": [],
333 | "source": []
334 | },
335 | {
336 | "cell_type": "code",
337 | "execution_count": null,
338 | "id": "690fea38",
339 | "metadata": {},
340 | "outputs": [],
341 | "source": []
342 | },
343 | {
344 | "cell_type": "code",
345 | "execution_count": null,
346 | "id": "ec8210be",
347 | "metadata": {},
348 | "outputs": [],
349 | "source": []
350 | },
351 | {
352 | "cell_type": "code",
353 | "execution_count": null,
354 | "id": "606cd638",
355 | "metadata": {},
356 | "outputs": [],
357 | "source": []
358 | },
359 | {
360 | "cell_type": "markdown",
361 | "id": "8dee3356",
362 | "metadata": {},
363 | "source": [
364 | "# Like, Share & SUBscribe"
365 | ]
366 | },
367 | {
368 | "cell_type": "markdown",
369 | "id": "70256639",
370 | "metadata": {},
371 | "source": [
372 | "# `Citations & Help:`\n",
373 | "\n",
374 | "# ◔̯◔\n",
375 | "\n",
376 | "\n",
377 | "Code Optimizing with Asynio, multi-threading and multi-processing:\n",
378 | "\n",
379 | "https://www.geeksforgeeks.org/multithreading-or-multiprocessing-with-python-and-selenium/\n",
380 | "\n",
381 | "https://www.youtube.com/watch?v=-hw3AaxX5B4\n",
382 | "\n",
383 | "https://webnus.net/how-to-speed-up-selenium-automated-tests-in-2022/ (selenium speed up ideas)\n",
384 | "\n",
385 | "https://medium.com/@PhysicistMarianna/scrape-job-postings-data-from-indeed-com-with-python-b4f31340ef5f (bs4 help maybe)\n",
386 | "\n",
387 | "https://github.com/Ram-95/Indeed_Job_Scraper/blob/master/Indeed_Job_Scraper.py (bs4 idea as well)\n",
388 | "\n",
389 | "https://www.youtube.com/watch?v=HOS5Hix--bE\n",
390 | "\n",
391 | "https://stackoverflow.com/questions/75849391/failed-to-fetch-the-job-titles-from-indeed-using-the-requests-module (cloudscraper idea)\n",
392 | "\n",
393 | "https://www.geeksforgeeks.org/multithreading-python-set-1/ (multi-threading ex.)\n",
394 | "\n",
395 | "https://testdriven.io/blog/building-a-concurrent-web-scraper-with-python-and-selenium/ (come back to this! good write up with code...)\n",
396 | "\n",
397 | "https://medium.com/analytics-vidhya/asynchronous-web-scraping-101-fetching-multiple-urls-using-arsenic-ec2c2404ecb4\n",
398 | "\n",
399 | "https://www.youtube.com/watch?v=6ow7xloFy5s\n",
400 | "\n",
401 | "https://us-pycon-2019-tutorial.readthedocs.io/aiohttp_intro.html"
402 | ]
403 | }
404 | ],
405 | "metadata": {
406 | "kernelspec": {
407 | "display_name": "Python 3 (ipykernel)",
408 | "language": "python",
409 | "name": "python3"
410 | },
411 | "language_info": {
412 | "codemirror_mode": {
413 | "name": "ipython",
414 | "version": 3
415 | },
416 | "file_extension": ".py",
417 | "mimetype": "text/x-python",
418 | "name": "python",
419 | "nbconvert_exporter": "python",
420 | "pygments_lexer": "ipython3",
421 | "version": "3.10.9"
422 | }
423 | },
424 | "nbformat": 4,
425 | "nbformat_minor": 5
426 | }
427 |
--------------------------------------------------------------------------------
/Selenium_Webdriver_Issues.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "8a5cc8eb",
6 | "metadata": {},
7 | "source": [
8 | "# `Possible Issues with Selenium Webdriver.Chrome`\n",
9 | "\n",
10 | "# Mr Fugu Data Science\n",
11 | "\n",
12 | "# (。◕‿◕。)\n",
13 | "\n",
14 | "**`Purpose & Outocome:`**\n",
15 | "\n",
16 | "+ Establish a new connection and check the error message\n",
17 | "+ Options to fix & understand the issue\n",
18 | "+ Future options"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 2,
24 | "id": "37525bbe",
25 | "metadata": {},
26 | "outputs": [],
27 | "source": [
28 | "# Selenium 4:\n",
29 | "\n",
30 | "from selenium import webdriver\n",
31 | "\n",
32 | "# Starting/Stopping Driver: can specify ports or location but not remote access\n",
33 | "from selenium.webdriver.chrome.service import Service as ChromeService\n",
34 | "\n",
35 | "# Manages Binaries needed for WebDriver without installing anything directly\n",
36 | "from webdriver_manager.chrome import ChromeDriverManager\n",
37 | "\n",
38 | "# Allows searchs similar to beautiful soup: find_all\n",
39 | "from selenium.webdriver.common.by import By\n",
40 | "\n",
41 | "# Try to establish wait times for the page to load\n",
42 | "from selenium.webdriver.support.ui import WebDriverWait\n",
43 | "\n",
44 | "# Call Sleep Function to log time of operations\n",
45 | "import time\n",
46 | "\n",
47 | "# Random integer for more realistic timing for clicks, buttons and searches during scraping\n",
48 | "from random import randint"
49 | ]
50 | },
51 | {
52 | "cell_type": "code",
53 | "execution_count": 3,
54 | "id": "3ba0f31e",
55 | "metadata": {},
56 | "outputs": [
57 | {
58 | "data": {
59 | "text/plain": [
60 | "'4.10.0'"
61 | ]
62 | },
63 | "execution_count": 3,
64 | "metadata": {},
65 | "output_type": "execute_result"
66 | }
67 | ],
68 | "source": [
69 | "import selenium\n",
70 | "selenium.__version__"
71 | ]
72 | },
73 | {
74 | "cell_type": "code",
75 | "execution_count": 4,
76 | "id": "d63faf64",
77 | "metadata": {},
78 | "outputs": [
79 | {
80 | "ename": "ValueError",
81 | "evalue": "There is no such driver by url https://chromedriver.storage.googleapis.com/LATEST_RELEASE_115.0.5790",
82 | "output_type": "error",
83 | "traceback": [
84 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
85 | "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
86 | "Cell \u001b[0;32mIn[4], line 16\u001b[0m\n\u001b[1;32m 13\u001b[0m job_\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mData+Engineer\u001b[39m\u001b[38;5;124m'\u001b[39m\n\u001b[1;32m 14\u001b[0m location\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mWashington\u001b[39m\u001b[38;5;124m'\u001b[39m\n\u001b[0;32m---> 16\u001b[0m driver \u001b[38;5;241m=\u001b[39m webdriver\u001b[38;5;241m.\u001b[39mChrome(service\u001b[38;5;241m=\u001b[39mChromeService(\u001b[43mChromeDriverManager\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43minstall\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m),\n\u001b[1;32m 17\u001b[0m options\u001b[38;5;241m=\u001b[39moption)\n\u001b[1;32m 20\u001b[0m \u001b[38;5;66;03m# driver = webdriver.Chrome(service=ChromeService(),\u001b[39;00m\n\u001b[1;32m 21\u001b[0m \u001b[38;5;66;03m# options=option)\u001b[39;00m\n\u001b[1;32m 24\u001b[0m driver\u001b[38;5;241m.\u001b[39mget(paginaton_url\u001b[38;5;241m.\u001b[39mformat(job_,location,\u001b[38;5;241m0\u001b[39m))\n",
87 | "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/webdriver_manager/chrome.py:39\u001b[0m, in \u001b[0;36mChromeDriverManager.install\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 38\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21minstall\u001b[39m(\u001b[38;5;28mself\u001b[39m) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m \u001b[38;5;28mstr\u001b[39m:\n\u001b[0;32m---> 39\u001b[0m driver_path \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_get_driver_path\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mdriver\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 40\u001b[0m os\u001b[38;5;241m.\u001b[39mchmod(driver_path, \u001b[38;5;241m0o755\u001b[39m)\n\u001b[1;32m 41\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m driver_path\n",
88 | "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/webdriver_manager/core/manager.py:30\u001b[0m, in \u001b[0;36mDriverManager._get_driver_path\u001b[0;34m(self, driver)\u001b[0m\n\u001b[1;32m 27\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m binary_path:\n\u001b[1;32m 28\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m binary_path\n\u001b[0;32m---> 30\u001b[0m file \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_download_manager\u001b[38;5;241m.\u001b[39mdownload_file(\u001b[43mdriver\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_driver_download_url\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m)\n\u001b[1;32m 31\u001b[0m binary_path \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdriver_cache\u001b[38;5;241m.\u001b[39msave_file_to_cache(driver, file)\n\u001b[1;32m 32\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m binary_path\n",
89 | "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/webdriver_manager/drivers/chrome.py:40\u001b[0m, in \u001b[0;36mChromeDriver.get_driver_download_url\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 39\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mget_driver_download_url\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[0;32m---> 40\u001b[0m driver_version_to_download \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_driver_version_to_download\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 41\u001b[0m os_type \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_os_type\n\u001b[1;32m 42\u001b[0m \u001b[38;5;66;03m# For Mac ARM CPUs after version 106.0.5249.61 the format of OS type changed\u001b[39;00m\n\u001b[1;32m 43\u001b[0m \u001b[38;5;66;03m# to more unified \"mac_arm64\". For newer versions, it'll be \"mac_arm64\"\u001b[39;00m\n\u001b[1;32m 44\u001b[0m \u001b[38;5;66;03m# by default, for lower versions we replace \"mac_arm64\" to old format - \"mac64_m1\".\u001b[39;00m\n",
90 | "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/webdriver_manager/core/driver.py:51\u001b[0m, in \u001b[0;36mDriver.get_driver_version_to_download\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 45\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 46\u001b[0m \u001b[38;5;124;03mDownloads version from parameter if version not None or \"latest\".\u001b[39;00m\n\u001b[1;32m 47\u001b[0m \u001b[38;5;124;03mDownloads latest, if version is \"latest\" or browser could not been determined.\u001b[39;00m\n\u001b[1;32m 48\u001b[0m \u001b[38;5;124;03mDownloads determined browser version driver in all other ways as a bonus fallback for lazy users.\u001b[39;00m\n\u001b[1;32m 49\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 50\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_driver_to_download_version:\n\u001b[0;32m---> 51\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_driver_to_download_version \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_version \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_version \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m (\u001b[38;5;28;01mNone\u001b[39;00m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mlatest\u001b[39m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_latest_release_version\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 52\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_driver_to_download_version\n",
91 | "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/webdriver_manager/drivers/chrome.py:62\u001b[0m, in \u001b[0;36mChromeDriver.get_latest_release_version\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 56\u001b[0m log(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mGet LATEST \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_name\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m version for \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_browser_type\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 57\u001b[0m latest_release_url \u001b[38;5;241m=\u001b[39m (\n\u001b[1;32m 58\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_latest_release_url\n\u001b[1;32m 59\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_version \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mlatest\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m determined_browser_version \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m)\n\u001b[1;32m 60\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_latest_release_url\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m_\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mdetermined_browser_version\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 61\u001b[0m )\n\u001b[0;32m---> 62\u001b[0m resp \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_http_client\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget\u001b[49m\u001b[43m(\u001b[49m\u001b[43murl\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mlatest_release_url\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 63\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m resp\u001b[38;5;241m.\u001b[39mtext\u001b[38;5;241m.\u001b[39mrstrip()\n",
92 | "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/webdriver_manager/core/http.py:37\u001b[0m, in \u001b[0;36mWDMHttpClient.get\u001b[0;34m(self, url, **kwargs)\u001b[0m\n\u001b[1;32m 35\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m exceptions\u001b[38;5;241m.\u001b[39mConnectionError:\n\u001b[1;32m 36\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mConnectionError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCould not reach host. Are you offline?\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m---> 37\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mvalidate_response\u001b[49m\u001b[43m(\u001b[49m\u001b[43mresp\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 38\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m wdm_progress_bar():\n\u001b[1;32m 39\u001b[0m show_download_progress(resp)\n",
93 | "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/webdriver_manager/core/http.py:16\u001b[0m, in \u001b[0;36mHttpClient.validate_response\u001b[0;34m(resp)\u001b[0m\n\u001b[1;32m 14\u001b[0m status_code \u001b[38;5;241m=\u001b[39m resp\u001b[38;5;241m.\u001b[39mstatus_code\n\u001b[1;32m 15\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m status_code \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m404\u001b[39m:\n\u001b[0;32m---> 16\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mThere is no such driver by url \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mresp\u001b[38;5;241m.\u001b[39murl\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 17\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m status_code \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m401\u001b[39m:\n\u001b[1;32m 18\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mAPI Rate limit exceeded. You have to add GH_TOKEN!!!\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
94 | "\u001b[0;31mValueError\u001b[0m: There is no such driver by url https://chromedriver.storage.googleapis.com/LATEST_RELEASE_115.0.5790"
95 | ]
96 | }
97 | ],
98 | "source": [
99 | "# Allows you to cusotmize: ingonito mode, maximize window size, headless browser, disable certain features, etc\n",
100 | "option= webdriver.ChromeOptions()\n",
101 | "\n",
102 | "# Going undercover:\n",
103 | "option.add_argument(\"--incognito\")\n",
104 | "option.add_argument(\"--headless=new\")\n",
105 | "\n",
106 | "paginaton_url = 'https://www.indeed.com/jobs?q={}&l={}&radius=35&filter=0&sort=date&start={}'\n",
107 | "\n",
108 | "start = time.time()\n",
109 | "\n",
110 | "\n",
111 | "job_='Data+Engineer'\n",
112 | "location='Washington'\n",
113 | "\n",
114 | "driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()),\n",
115 | " options=option)\n",
116 | "\n",
117 | "\n",
118 | "# driver = webdriver.Chrome(service=ChromeService(),\n",
119 | "# options=option)\n",
120 | "\n",
121 | "\n",
122 | "driver.get(paginaton_url.format(job_,location,0))\n",
123 | "\n",
124 | "time.sleep(randint(2, 6))\n",
125 | "\n",
126 | "p=driver.find_element(By.CLASS_NAME,'jobsearch-JobCountAndSortPane-jobCount').text\n",
127 | "\n",
128 | "# Max number of pages for this search! There is a caveat described soon\n",
129 | "max_iter_pgs=int(p.split(' ')[0])//15 \n",
130 | "\n",
131 | "\n",
132 | "driver.quit() # Closing the browser we opened\n",
133 | "\n",
134 | "\n",
135 | "end = time.time()\n",
136 | "\n",
137 | "print(end - start,'seconds to complete action!')\n",
138 | "print('-----------------------')\n",
139 | "print('Max Iterable Pages for this search:',max_iter_pgs)"
140 | ]
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "id": "bb40d498",
145 | "metadata": {},
146 | "source": [
147 | "# `Investigate this One-By-One: Troubleshooting`\n",
148 | "\n",
149 | "+ **First:** let's go to the chrome browser and find the current version we are on\n",
150 | "+ **Second:** Check out the [Chrome/Chromium Divers Official Download page]\n",
151 | "(https://chromedriver.chromium.org/downloads)\n",
152 | "+ Check this from Stackoverflow [Error using Specific Version Webdriver:ValueError: There is no such driver by url https://chromedriver.storage.googleapis.com/LATEST_RELEASE_115.0.5790 ](https://stackoverflow.com/questions/76724939/there-is-no-such-driver-by-url-https-chromedriver-storage-googleapis-com-lates)\n",
153 | "+ Work around options\n",
154 | " + [Possible option 3rd party](https://github.com/seleniumbase/SeleniumBase)\n",
155 | " + Disable Automatic Chrome Updates!\n",
156 | " + Understand the basic background of Selenium Webdriver and change code"
157 | ]
158 | },
159 | {
160 | "cell_type": "code",
161 | "execution_count": 6,
162 | "id": "8d086d0c",
163 | "metadata": {},
164 | "outputs": [
165 | {
166 | "name": "stderr",
167 | "output_type": "stream",
168 | "text": [
169 | "[WDM] - Downloading: 100%|█████████████████| 8.29M/8.29M [00:00<00:00, 11.4MB/s]\n"
170 | ]
171 | },
172 | {
173 | "name": "stdout",
174 | "output_type": "stream",
175 | "text": [
176 | "11.941619873046875 seconds to complete action!\n",
177 | "-----------------------\n",
178 | "Max Iterable Pages for this search: 11\n"
179 | ]
180 | }
181 | ],
182 | "source": [
183 | "\n",
184 | "# Allows you to cusotmize: ingonito mode, maximize window size, headless browser, disable certain features, etc\n",
185 | "option= webdriver.ChromeOptions()\n",
186 | "\n",
187 | "# Going undercover:\n",
188 | "# option.add_argument(\"--incognito\")\n",
189 | "option.add_argument(\"--headless=new\")\n",
190 | "\n",
191 | "paginaton_url = 'https://www.indeed.com/jobs?q={}&l={}&radius=35&filter=0&sort=date&start={}'\n",
192 | "\n",
193 | "start = time.time()\n",
194 | "\n",
195 | "\n",
196 | "job_='Data+Engineer'\n",
197 | "location='Washington'\n",
198 | "\n",
199 | "\n",
200 | "# Alternate Version: 1\n",
201 | "driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager(version=\"114.0.5735.90\").install()),\n",
202 | " options=option)\n",
203 | "\n",
204 | "\n",
205 | "driver.get(paginaton_url.format(job_,location,0))\n",
206 | "\n",
207 | "time.sleep(randint(2, 6))\n",
208 | "\n",
209 | "p=driver.find_element(By.CLASS_NAME,'jobsearch-JobCountAndSortPane-jobCount').text\n",
210 | "\n",
211 | "# Max number of pages for this search! There is a caveat described soon\n",
212 | "max_iter_pgs=int(p.split(' ')[0])//15 \n",
213 | "\n",
214 | "driver.quit() # Closing the browser we opened\n",
215 | "\n",
216 | "\n",
217 | "end = time.time()\n",
218 | "\n",
219 | "print(end - start,'seconds to complete action!')\n",
220 | "print('-----------------------')\n",
221 | "print('Max Iterable Pages for this search:',max_iter_pgs)\n"
222 | ]
223 | },
224 | {
225 | "cell_type": "markdown",
226 | "id": "92182931",
227 | "metadata": {},
228 | "source": [
229 | "# `Check Where the drivers are and what is inside:`"
230 | ]
231 | },
232 | {
233 | "cell_type": "code",
234 | "execution_count": null,
235 | "id": "914f84dd",
236 | "metadata": {},
237 | "outputs": [],
238 | "source": [
239 | "!~/.wdm"
240 | ]
241 | },
242 | {
243 | "cell_type": "markdown",
244 | "id": "e26173de",
245 | "metadata": {},
246 | "source": [
247 | "# What Happened Exactly?\n",
248 | "\n",
249 | "`webdriver.Chrome(service = ChromeService(ChromeDriverManager().install()),options=option)`\n",
250 | "This piece of code is calling the Chrome Driver to install a version which should be a newer/newest version. But we ended up with an error. In this section you can call the section where you have it downloaded but, since version control is an issue.\n",
251 | "\n",
252 | "`driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)`\n",
253 | "\n",
254 | "Switching to this code will look for a relevant version for us to use and download it and we are off to work with the scraping.\n",
255 | "\n",
256 | "Lastly, when I adjusted for a specific version\n",
257 | "\n",
258 | "`driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager(version=\"114.0.5735.90\").install()),options=option)`\n",
259 | " \n",
260 | "The code worked because this is a stable version and there is a driver for it to work as well.\n",
261 | "\n",
262 | "`-------------------------------------------------------`\n",
263 | "\n",
264 | "# `Short Talk About: Selenium Manager:`\n",
265 | "\n",
266 | "As of I think `Selenium 4.6` it has been easier with integration to start-up your workspace without all the workup that was once required. For example, if you have the drivers installed for `Firefox, Chrome or Edge` you can directly start working with Selenium.\n",
267 | "+ One of the biggest headaches comes from updating drivers and making the browser work every so often when you need to update a browser driver to work with the given browser.\n",
268 | "+ Now, Selenium will configure your browser drivers and join the `PATH` if it is not available for you without the trouble. \n",
269 | " + This will let you run everything without further work-up usually.\n",
270 | " + Assuming you already have one of the above Browsers installed prior.\n",
271 | " \n",
272 | "+ In older versions of Selenium Version 4.6 and older you would need to explicitly call and download the drivers\n",
273 | " + No more of that mess. Yay--\n",
274 | " \n",
275 | " \n",
276 | "# `Do You Need To Downgrade Chrome?`\n",
277 | "\n",
278 | "+ No Not exactly and not yet!\n",
279 | "\n"
280 | ]
281 | },
282 | {
283 | "cell_type": "code",
284 | "execution_count": null,
285 | "id": "c0900fb0",
286 | "metadata": {},
287 | "outputs": [],
288 | "source": [
289 | "# import selenium\n",
290 | "# !./selenium-manager --help\n",
291 | "# !./selenium-manager --browser chrome"
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "execution_count": 7,
297 | "id": "b4960c20",
298 | "metadata": {},
299 | "outputs": [
300 | {
301 | "name": "stdout",
302 | "output_type": "stream",
303 | "text": [
304 | "26.578808069229126 seconds to complete action!\n",
305 | "-----------------------\n",
306 | "Max Iterable Pages for this search: 11\n"
307 | ]
308 | }
309 | ],
310 | "source": [
311 | "# Alternate Version But, seems slow. Maybe because it is searching and installing dependencies before moving on.\n",
312 | "\n",
313 | "from selenium import webdriver\n",
314 | "\n",
315 | "from selenium.webdriver.chrome.service import Service\n",
316 | "\n",
317 | "service = Service() # this is the important line of code!\n",
318 | "\n",
319 | "\n",
320 | "options = webdriver.ChromeOptions()\n",
321 | "options.add_argument(\"--headless=new\")\n",
322 | "\n",
323 | "\n",
324 | "driver = webdriver.Chrome(service=service, options=options)\n",
325 | "\n",
326 | "# driver = webdriver.Chrome(options=options)\n",
327 | "\n",
328 | "driver.get(paginaton_url.format(job_,location,0))\n",
329 | "\n",
330 | "time.sleep(randint(2, 6))\n",
331 | "\n",
332 | "p=driver.find_element(By.CLASS_NAME,'jobsearch-JobCountAndSortPane-jobCount').text\n",
333 | "\n",
334 | "# Max number of pages for this search! There is a caveat described soon\n",
335 | "max_iter_pgs=int(p.split(' ')[0])//15 \n",
336 | "\n",
337 | "driver.quit() # Closing the browser we opened\n",
338 | "\n",
339 | "\n",
340 | "end = time.time()\n",
341 | "\n",
342 | "print(end - start,'seconds to complete action!')\n",
343 | "print('-----------------------')\n",
344 | "print('Max Iterable Pages for this search:',max_iter_pgs)"
345 | ]
346 | },
347 | {
348 | "cell_type": "markdown",
349 | "id": "1c470dea",
350 | "metadata": {},
351 | "source": [
352 | "# `Lastly, before we depart today:`\n",
353 | "\n",
354 | "[Stackoverflow_Near Bottom](https://stackoverflow.com/questions/76727774/selenium-webdriver-chrome-115-stopped-working)"
355 | ]
356 | },
357 | {
358 | "cell_type": "markdown",
359 | "id": "d9f9fd4d",
360 | "metadata": {},
361 | "source": [
362 | "# Like, Share & SUBscribe"
363 | ]
364 | },
365 | {
366 | "cell_type": "markdown",
367 | "id": "cd34dd82",
368 | "metadata": {},
369 | "source": [
370 | "# `Citations & Help:`\n",
371 | "\n",
372 | "# ◔̯◔\n",
373 | "\n",
374 | "https://stackoverflow.com/questions/76724939/there-is-no-such-driver-by-url-https-chromedriver-storage-googleapis-com-lates\n",
375 | "\n",
376 | "https://stackoverflow.com/questions/72868256/chromedrivermanager-install-doesnt-work-webdriver-manager\n",
377 | "\n",
378 | "https://www.webnots.com/7-ways-to-disable-automatic-chrome-update-in-windows-and-mac/\n",
379 | "\n",
380 | "https://www.selenium.dev/blog/2022/introducing-selenium-manager/\n",
381 | "\n",
382 | "https://github.com/seleniumbase/SeleniumBase (possible solution to latest_version error)\n",
383 | "\n",
384 | "https://stackoverflow.com/questions/76727774/selenium-webdriver-chrome-115-stopped-working\n",
385 | "\n",
386 | "https://medium.com/analytics-vidhya/webdriver-manager-resolve-compatibility-issues-in-selenium-python-bef18c204475"
387 | ]
388 | }
389 | ],
390 | "metadata": {
391 | "kernelspec": {
392 | "display_name": "Python 3 (ipykernel)",
393 | "language": "python",
394 | "name": "python3"
395 | },
396 | "language_info": {
397 | "codemirror_mode": {
398 | "name": "ipython",
399 | "version": 3
400 | },
401 | "file_extension": ".py",
402 | "mimetype": "text/x-python",
403 | "name": "python",
404 | "nbconvert_exporter": "python",
405 | "pygments_lexer": "ipython3",
406 | "version": "3.10.9"
407 | }
408 | },
409 | "nbformat": 4,
410 | "nbformat_minor": 5
411 | }
412 |
--------------------------------------------------------------------------------
/indeed_WebScrape_2021.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "\n",
8 | "# Python Webscraping Indeed (Oct. 2021) Revisit\n",
9 | "\n",
10 | "# `Mr Fugu Data Science:`\n",
11 | "\n",
12 | "# (◕‿◕✿)\n",
13 | "\n",
14 | "**Purpose & Outcome:**\n",
15 | "Webscrape Indeed: take position, job title,location,date of posting, string of qualifications\n",
16 | "\n",
17 | "Use a list of words related to your skills or skills you are interested in and extract from job post qualifications section.\n",
18 | "\n",
19 | "Date time formating\n",
20 | "\n",
21 | "------------------------------"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 700,
27 | "metadata": {},
28 | "outputs": [],
29 | "source": [
30 | "import pandas as pd\n",
31 | "import requests # grab web-page\n",
32 | "from bs4 import BeautifulSoup as bsopa # parse web-page\n",
33 | "import datetime # format date/time"
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": null,
39 | "metadata": {},
40 | "outputs": [],
41 | "source": [
42 | "'''\n",
43 | "ex.) https://www.indeed.com/jobs?q=data+scientist&l=california&start=10\n",
44 | "\n",
45 | "range(0,150,10): each page will have \"start=0,start=10,start=20 etc\" which deals with\n",
46 | "going through each page, but not exactly 10 entries/pg. \n",
47 | "\n",
48 | "string formatting is used to denote what job and location we want: you can use a string\n",
49 | "separated by space and it will be interpreted by the website. \n",
50 | "\n",
51 | "sou=bsopa(y.text,'lxml') is taking our get request and converting to text in the format\n",
52 | "of 'lxml' but we can replace with 'html.parser' as well.\n",
53 | "\n",
54 | "Each job post can be parsed by the 'div',{\"class\":\"jobsearch-SerpJobCard\"} or \n",
55 | "'div', {'class': 'row'} depending on how you want to search\n",
56 | "\n",
57 | "After that we get each piece of information we want to obtain: job title, location etc.\n",
58 | "\n",
59 | "the only difficult and frustrating part is getting all the raw text for each posting\n",
60 | "relating to the qulifications. We have to open a link, iterate through it and then extract\n",
61 | "all the information with a try/except block and then further process.\n",
62 | "'''"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": 699,
68 | "metadata": {},
69 | "outputs": [
70 | {
71 | "data": {
72 | "text/plain": [
73 | "'Original Code October 2020'"
74 | ]
75 | },
76 | "execution_count": 699,
77 | "metadata": {},
78 | "output_type": "execute_result"
79 | }
80 | ],
81 | "source": [
82 | "'''Original Code October 2020'''\n",
83 | "\n",
84 | "# gg=[]\n",
85 | "# for j in range(0,150,10):\n",
86 | "# position,location='data scientist','california'\n",
87 | "# y=requests.get('https://www.indeed.com/jobs?q={}&l={}&sort=date='.format(position,location)+str(j))\n",
88 | "\n",
89 | "# # y=requests.get('https://www.indeed.com/jobs?q=data+scientist&l=california&sort=date='+str(i))\n",
90 | "# sou=bsopa(y.text,'lxml')\n",
91 | "\n",
92 | "# # for ii in sou.find_all('div', {'class': 'row'}):\n",
93 | "# for ii in sou.find_all('div',{\"class\":\"jobsearch-SerpJobCard\"}):\n",
94 | "# print(ii)\n",
95 | "\n",
96 | "# job_title = ii.find('a', {'data-tn-element': 'jobTitle'})['title']\n",
97 | "# company_name = ii.find('span', {'class': 'company'}).text.strip() \n",
98 | "# location=ii.find('span',{\"class\":\"location\"})\n",
99 | "# post_date = ii.find('span', attrs={'class': 'date'})\n",
100 | "# summary=ii.find('div',attrs={'class':'summary'})\n",
101 | "\n",
102 | "# if location:\n",
103 | "# location=location.text.strip()\n",
104 | "# else:\n",
105 | "# location=ii.find('div',{\"class\":\"location\"})\n",
106 | "# location=location.text.strip()\n",
107 | "\n",
108 | "# k=ii.find('h2', {'class':\"title\"})\n",
109 | "# p=k.find(href=True)\n",
110 | "# v=p['href']\n",
111 | "# f_=str(v).replace('&','&') # links to iterate for qualification text\n",
112 | " \n",
113 | " \n",
114 | "# datum = {'job_title': job_title,\n",
115 | "# 'company_name': company_name,\n",
116 | "# 'location': location,\n",
117 | "# 'summary':summary.text.strip(),\n",
118 | "# 'post_Date':post_date.text,\n",
119 | "# 'Qualification_link': f_}\n",
120 | "\n",
121 | "# print(datum)\n",
122 | "# gg.append([location,job_title,company_name,post_date.text,summary.text.strip()\n",
123 | "# # ,f_]) \n",
124 | "# gg.append(datum)"
125 | ]
126 | },
127 | {
128 | "cell_type": "markdown",
129 | "metadata": {},
130 | "source": [
131 | "This was the original piece of code that was used until about Summer of 2020, you were able to parse from the outer most section using the 'jobsearch-SerpJobCard'. We were iterating over each of these because it was the start of a new job posting and inside of it were the details to extract.\n",
132 | "\n",
133 | "`for ii in sou.find_all('div', {'class': 'row'}):\n",
134 | " for ii in sou.find_all('div',{\"class\":\"jobsearch-SerpJobCard\"}):\n",
135 | " print(ii)`\n",
136 | " \n",
137 | "Here we are just extracting everything we need, the 'jobtile' is subset to go inside and get information using the brackets you see. Then we have to get the company name which with then be parsed by stripping the text from html. Location was different because you will need conditional statements based on if you see it or not and have to do comparisons by 2 different text blocks. The rest of this is straight forward.\n",
138 | "\n",
139 | " job_title = ii.find('a', {'data-tn-element': 'jobTitle'})['title']\n",
140 | " company_name = ii.find('span', {'class': 'company'}).text.strip() \n",
141 | " location=ii.find('span',{\"class\":\"location\"})\n",
142 | " post_date = ii.find('span', attrs={'class': 'date'})\n",
143 | " summary=ii.find('div',attrs={'class':'summary'})\n",
144 | "\n",
145 | "Now, get the tile from the href which store hyperlinks and we will need to store those links to get job description which will be on a separate paged link. That link will be the entire posting not the snapshot seen as a glance. \n",
146 | " \n",
147 | " k=ii.find('h2', {'class':\"title\"})\n",
148 | " p=k.find(href=True)\n",
149 | " v=p['href']\n",
150 | " f_=str(v).replace('&','&') # links to iterate for qualification text\n",
151 | "\n",
152 | "Store everything as a dictionary for ease of use later.\n",
153 | "\n",
154 | " datum = {'job_title': job_title,\n",
155 | " 'company_name': company_name,\n",
156 | " 'location': location,\n",
157 | " 'summary':summary.text.strip(),\n",
158 | " 'post_Date':post_date.text,\n",
159 | " 'Qualification_link': f_}\n"
160 | ]
161 | },
162 | {
163 | "cell_type": "markdown",
164 | "metadata": {},
165 | "source": [
166 | "# `New Version Oct. 2021`"
167 | ]
168 | },
169 | {
170 | "cell_type": "code",
171 | "execution_count": 680,
172 | "metadata": {},
173 | "outputs": [
174 | {
175 | "name": "stdout",
176 | "output_type": "stream",
177 | "text": [
178 | "Senior Data Scientist - Ads Optimization\n",
179 | "Indeed\n",
180 | "Nope\n",
181 | "Sunnyvale, CA 94086 (Washington area)+1 location•Temporarily Remote\n",
182 | "Associate, Operations Research Data Scientist\n",
183 | "KPMG\n",
184 | "Nope\n",
185 | "Los Angeles, CA 90071 (Downtown area)+2 locations\n",
186 | "Associate, Data Scientist\n",
187 | "KPMG\n",
188 | "Nope\n",
189 | "Los Angeles, CA 90071 (Downtown area)+2 locations\n",
190 | "Data Scientist I\n",
191 | "Inland Empire Health Plans\n",
192 | "Nope\n",
193 | "Rancho Cucamonga, CA 91730\n",
194 | "Data Scientist\n",
195 | "Cymbiotika\n",
196 | "Nope\n",
197 | "San Diego, CA 92121 (Torrey Preserve area)\n",
198 | "People Analytics Data Scientist\n",
199 | "Workday\n",
200 | "Nope\n",
201 | "Pleasanton, CA\n",
202 | "Data Scientist\n",
203 | "Doorstead\n",
204 | "Nope\n",
205 | "San Francisco, CA•Remote\n",
206 | "Data Scientist, Monetization\n",
207 | "TikTok\n",
208 | "Nope\n",
209 | "Mountain View, CA 94041 (Old Mountain View area)\n",
210 | "Associate Data Scientist\n",
211 | "Activision\n",
212 | "Nope\n",
213 | "Santa Monica, CA\n",
214 | "Data Scientist (remote)\n",
215 | "Fitbod\n",
216 | "Nope\n",
217 | "San Francisco, CA•Remote\n",
218 | "Senior Data Scientist - Ads Optimization\n",
219 | "Indeed\n",
220 | "$153,000 - $223,000 a year\n",
221 | "Sunnyvale, CA 94086 (Washington area)+1 location•Temporarily Remote\n",
222 | "Associate, Data Scientist\n",
223 | "KPMG\n",
224 | "Nope\n",
225 | "San Francisco, CA 94105 (South Beach area)+8 locations\n",
226 | "Data Scientist I\n",
227 | "Inland Empire Health Plans\n",
228 | "$97,843 - $124,758 a year\n",
229 | "Rancho Cucamonga, CA 91730+1 location\n",
230 | "Data Scientist\n",
231 | "Cymbiotika\n",
232 | "$50,000 - $60,000 a year\n",
233 | "San Diego, CA 92121 (Torrey Preserve area)\n",
234 | "People Analytics Data Scientist\n",
235 | "Workday\n",
236 | "Nope\n",
237 | "Pleasanton, CA\n",
238 | "Data Scientist\n",
239 | "Doorstead\n",
240 | "Nope\n",
241 | "San Francisco, CA•Remote\n",
242 | "Associate Data Scientist\n",
243 | "Activision\n",
244 | "Nope\n",
245 | "Santa Monica, CA+1 location\n",
246 | "Data Scientist (remote)\n",
247 | "Fitbod\n",
248 | "Nope\n",
249 | "San Francisco, CA•Remote\n",
250 | "Data Scientist - Entry Level\n",
251 | "Lawrence Livermore National Laboratory\n",
252 | "Nope\n",
253 | "Livermore, CA 94550+2 locations\n"
254 | ]
255 | },
256 | {
257 | "data": {
258 | "text/plain": [
259 | "19"
260 | ]
261 | },
262 | "execution_count": 680,
263 | "metadata": {},
264 | "output_type": "execute_result"
265 | }
266 | ],
267 | "source": [
268 | "# KEEP THIS VERSION: October 2021\n",
269 | "\n",
270 | "gg=[]\n",
271 | "for j in range(0,15,10): # calling 15 entries\n",
272 | " \n",
273 | " position,location='data scientist','california'\n",
274 | " \n",
275 | " y=requests.get('https://www.indeed.com/jobs?q={}&l={}&sort=date='.format(position,location)+str(j))\n",
276 | "# print(y)\n",
277 | "\n",
278 | " sou=bsopa(y.text,'lxml')\n",
279 | " \n",
280 | "# print(sou) use this if you want to check if working properly, response code 200\n",
281 | "\n",
282 | "\n",
283 | " for ii in sou.find_all('div',{\"class\":\"job_seen_beacon\"}):\n",
284 | " j=ii.find('tbody') # calling the table body to go inside of\n",
285 | " a= j.find('tr') # going inside the table\n",
286 | "\n",
287 | "\n",
288 | " # print(len(wa_))\n",
289 | "\n",
290 | " for n in a.find_all('h2',{'class':'jobTitle jobTitle-color-purple jobTitle-newJob'}):\n",
291 | " job_title=n.find_all('span')[1].get_text()# if you don't use the 1, you get the 'new' posting text\n",
292 | " print(job_title)\n",
293 | " # Company Name is in new nesting:\n",
294 | " # print(a.find_all('span',{'class':'companyName'}))\n",
295 | "\n",
296 | " other=a.find('div',{'class':'heading6 company_location tapItem-gutter'})\n",
297 | " pr_=other.find('span')\n",
298 | " company_=(pr_.get_text())\n",
299 | " print(company_) \n",
300 | " gg.append(company_)\n",
301 | " # print(a.find('span',{'class':'companyName'}).get_text()) # alt version\n",
302 | "\n",
303 | " # Salary if available:\n",
304 | " if a.find('div',{'class':'heading6 tapItem-gutter metadataContainer noJEMChips salaryOnly'}):\n",
305 | " print(a.find('div',{'class':'metadata salary-snippet-container'}).get_text())\n",
306 | " # print(a.find('div',{'class':'heading6 tapItem-gutter metadataContainer noJEMChips salaryOnly'}))\n",
307 | " else:\n",
308 | " print('Nope')\n",
309 | "\n",
310 | "\n",
311 | " # Location:\n",
312 | " # locaiton=(other.find('div',{'class':'companyLocation'})) # checking something\n",
313 | " # print(other.find('pre').get_text()) # good start\n",
314 | " opt_1=other.find('pre')\n",
315 | " print(opt_1.find('div',{'class':'companyLocation'}).get_text())\n",
316 | "len(gg) "
317 | ]
318 | },
319 | {
320 | "cell_type": "markdown",
321 | "metadata": {},
322 | "source": [
323 | "# `Final Deal:`"
324 | ]
325 | },
326 | {
327 | "cell_type": "code",
328 | "execution_count": 698,
329 | "metadata": {},
330 | "outputs": [
331 | {
332 | "data": {
333 | "text/plain": [
334 | "[['Indeed',\n",
335 | " 'Sunnyvale, CA 94086 (Washington area)+1 location•Temporarily Remote',\n",
336 | " '$153,000 - $223,000 a year',\n",
337 | " 'Senior Data Scientist - Ads Optimization',\n",
338 | " '/pagead/clk?mo=r&ad=-6NYlbfkN0CiRNM7CVr8YueLFKlzwbFWI0o7IjV438l4sVrvKZ0flkywQJjHRBleObb6D711MbmS4ls7d3cF2Oq3aULRL-oCpslQ8eol1sCeCsrHu6Ormuhl_Gd2n_2z30mgTt9HyPd0R2YWYjvrQpCS5p1j9c2ZMxHpmr7PLcQ4RrnjcUQoSJEU1n8kwuSkPjjkuu0YCQ5pB3Q4pg8cBuCYb4C7rlTo8tjCITIibgEDN6U1ijuTr2905NcR87ipsT6pqVEYHp0FOb4ZT2XTE3pQji3sGUO62A6fg0SR_NaGly2iAUveSV1QohOdfN1Qcc4ilZe9m74YDGggllE1rhLrOrkpk2qwb7v0t8jBvtqb8S52e5ccYcXC2Z6vSY03iF6QfmmWj42gTn9_20rrIwkrAuVrztIdYjjZGzuMm6F8OdT71-8faWpL1g4secQ2-uj2cZrj483DW_1_9PsSvHpfe40eIAtN07K6e3mMNu2WnsG50LAdjghwXoxZp1_WcjlaDgbrqHe1L3EVcqz2VXbRbHAdXyK_7K8q-_efLULxW9SfiKnOwhPTpYeI8OVX&p=1&fvj=0&vjs=3'],\n",
339 | " ['KPMG',\n",
340 | " 'San Francisco, CA 94105 (South Beach area)+8 locations',\n",
341 | " 'No Salary posted',\n",
342 | " 'Associate, Data Scientist',\n",
343 | " '/pagead/clk?mo=r&ad=-6NYlbfkN0CzN6uX84z9wq8HaOCWH9ZvJLeemKyVc3gCGVrPSg-ucI9H6BgA8rDcXux5vfMF3jfPLAcL_8mkTIxyarZP1sq9-tD0-jA3Y6WEfxnyBIiuGY-_po0tLUjvXnqhfH2tUVrboycwnykVqCjY77bPCmJYbAO9x67OtsivC-UDrXrdDFqZFrF4LWc7RsVYowfeI06NUqjNlxXUxNYJJaVlaYbD04lqWa-tyyIenKOtpr2SWTPvlZR0Ddl7CC3NO0TG7PIwjZ8ElFnHS59voq1EG6odVpAPmUyA2ef5HevNwwH7paqkgQYJLvqCHLzr2NnEac7tGI0cFYz6dvOtHmnF8zGOlzqFbVTTZ0jIzxx8uVCaeUj4lLJtT_IwTelTw1qO3CL-XQfv4t0YZWV0v0mSuqZSqAReAvNM6ROL0ls_4WRlGCET0Uv-2kkKSnfHZKYgRf7AqrSfSWhS7sSJV40g7nH3hLBVLwz7ogO30Cfj4klqj_tto3SXfOqc0FoynxD201oDGooR8aoR81Jo9f7zkMjXQrVALkXvrMhf6w-IeGevYu6tXwN1LCB0seCqSNeIV0IwxyvyiI5BNtX3VuOpZ8sxqtLwOzAqbRi8AZu88ItiIRTSKDQE48_6fXkibOUyJRbENVyh1KFuS41qODE3-BNLH_Xn6n-0jjbo8gw_fKPhwQW4wfGI8oVInX2bB6H_3J8nBVELOep4QxSMg1_i2tnuCbhMTmCMLticJF3o0hJWGcJ-9VZOGYS5FjXLSUjSs2LPmM2tAUtHJNzuLKnA2FK2EviYQPYYBad9xY7rI8TgjU_om7Sk-zbAhUgn1d03KG_qQC6GNroh3Ak0Tsc-AFiCE6y28fAERWQOff4Q1TPElOJhT8VMqJN6c2tw8D75NFcX4DErj9WVGZ9kj8pB1LlrinWsOqwexH_zaQEeXlHp3_ev88O9U6Y_k1zG-V2aoDJklPDCNMfuSF5uU8oE00w5278NfXXuJ5GiyH8f8NfXhIM4nkSuM_NJRf7AkWuDQ0FrxvCI31qX50H_ss8vfLCRS_bABDDB6pi1Ta_z-G8_yo6V6gKqlISS6t_0-NbH1ohjlfbs0fS3oQ==&p=2&fvj=0&vjs=3'],\n",
344 | " ['Inland Empire Health Plans',\n",
345 | " 'Rancho Cucamonga, CA 91730+1 location',\n",
346 | " '$97,843 - $124,758 a year',\n",
347 | " 'Data Scientist I',\n",
348 | " '/rc/clk?jk=4edcc9b6df92924e&fccid=f23cfaf12528dbd0&vjs=3'],\n",
349 | " ['Cymbiotika',\n",
350 | " 'San Diego, CA 92121 (Torrey Preserve area)',\n",
351 | " '$50,000 - $60,000 a year',\n",
352 | " 'Data Scientist',\n",
353 | " '/company/Cymbiotika/jobs/Data-Scientist-2832b48c155fc4e4?fccid=0dfc05f77847bd48&vjs=3'],\n",
354 | " ['Workday',\n",
355 | " 'Pleasanton, CA',\n",
356 | " 'No Salary posted',\n",
357 | " 'People Analytics Data Scientist',\n",
358 | " '/rc/clk?jk=a50f913cb24519cc&fccid=9ac45d217f8342a1&vjs=3'],\n",
359 | " ['Doorstead',\n",
360 | " 'San Francisco, CA•Remote',\n",
361 | " 'No Salary posted',\n",
362 | " 'Data Scientist',\n",
363 | " '/rc/clk?jk=7b9bca069f86f278&fccid=0571d4e0d8a49b91&vjs=3'],\n",
364 | " ['Activision',\n",
365 | " 'Santa Monica, CA+1 location',\n",
366 | " 'No Salary posted',\n",
367 | " 'Associate Data Scientist',\n",
368 | " '/rc/clk?jk=a0480b289c939f61&fccid=71147e0539a0a1b7&vjs=3'],\n",
369 | " ['Fitbod',\n",
370 | " 'San Francisco, CA•Remote',\n",
371 | " 'No Salary posted',\n",
372 | " 'Data Scientist (remote)',\n",
373 | " '/rc/clk?jk=b6a78ada05d6a311&fccid=c01c2bfafd4dbb57&vjs=3'],\n",
374 | " ['Lawrence Livermore National Laboratory',\n",
375 | " 'Livermore, CA 94550+2 locations',\n",
376 | " 'No Salary posted',\n",
377 | " 'Data Scientist - Entry Level',\n",
378 | " '/rc/clk?jk=3ffb79b589d7ea3e&fccid=26727f1861532c63&vjs=3']]"
379 | ]
380 | },
381 | "execution_count": 698,
382 | "metadata": {},
383 | "output_type": "execute_result"
384 | }
385 | ],
386 | "source": [
387 | "ouch_=[]\n",
388 | "for ii in sou.find_all('tbody'):\n",
389 | "\n",
390 | " pri=ii.find('tr')\n",
391 | "# print(pri)\n",
392 | " for wows in pri.find_all('a',href=True):\n",
393 | " if wows.find('a',href=True):\n",
394 | " wawa.append(wows['href'])\n",
395 | " # Links for job description:\n",
396 | " yikes=wows['href']\n",
397 | " \n",
398 | " # Get job title, location, salary\n",
399 | " for ii_ in wows.find_all('div',{\"class\":\"job_seen_beacon\"}):\n",
400 | " j=ii_.find('tbody') # calling the table body to go inside of\n",
401 | " a= j.find('tr') # going inside the table\n",
402 | "\n",
403 | " for n in a.find_all('h2',{'class':'jobTitle jobTitle-color-purple jobTitle-newJob'}):\n",
404 | " \n",
405 | " # Job Title:\n",
406 | " job_title=n.find_all('span')[1].get_text()# if you don't use the 1, you get the 'new' posting text\n",
407 | " \n",
408 | " # Company Name\n",
409 | "# Company Name is in new nesting:\n",
410 | "\n",
411 | " other=a.find('div',{'class':'heading6 company_location tapItem-gutter'})\n",
412 | " pr_=other.find('span')\n",
413 | " company_=(pr_.get_text())\n",
414 | " \n",
415 | " \n",
416 | "# print(company_) \n",
417 | "# gg.append(company_)\n",
418 | " # print(a.find('span',{'class':'companyName'}).get_text()) # alt version\n",
419 | "\n",
420 | " # Salary if available:\n",
421 | " if a.find('div',{'class':'heading6 tapItem-gutter metadataContainer noJEMChips salaryOnly'}):\n",
422 | " salary=a.find('div',{'class':'metadata salary-snippet-container'}).get_text()\n",
423 | "# # print(a.find('div',{'class':'heading6 tapItem-gutter metadataContainer noJEMChips salaryOnly'}))\n",
424 | " else:\n",
425 | " salary='No Salary posted'\n",
426 | "\n",
427 | "\n",
428 | " # Location:\n",
429 | " opt_1=other.find('pre')\n",
430 | " location=opt_1.find('div',{'class':'companyLocation'}).get_text()\n",
431 | " \n",
432 | " \n",
433 | " \n",
434 | " ouch_.append([company_,location,salary,job_title,yikes])\n",
435 | "ouch_"
436 | ]
437 | }
438 | ],
439 | "metadata": {
440 | "kernelspec": {
441 | "display_name": "Python 3",
442 | "language": "python",
443 | "name": "python3"
444 | },
445 | "language_info": {
446 | "codemirror_mode": {
447 | "name": "ipython",
448 | "version": 3
449 | },
450 | "file_extension": ".py",
451 | "mimetype": "text/x-python",
452 | "name": "python",
453 | "nbconvert_exporter": "python",
454 | "pygments_lexer": "ipython3",
455 | "version": "3.7.3"
456 | }
457 | },
458 | "nbformat": 4,
459 | "nbformat_minor": 4
460 | }
461 |
--------------------------------------------------------------------------------
/screen_shot_01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrFuguDataScience/Webscraping/HEAD/screen_shot_01.png
--------------------------------------------------------------------------------
/screen_shot_02.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrFuguDataScience/Webscraping/HEAD/screen_shot_02.png
--------------------------------------------------------------------------------
/screen_shot_03.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrFuguDataScience/Webscraping/HEAD/screen_shot_03.png
--------------------------------------------------------------------------------
/screen_shot_04.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrFuguDataScience/Webscraping/HEAD/screen_shot_04.png
--------------------------------------------------------------------------------
/shadow_root.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrFuguDataScience/Webscraping/HEAD/shadow_root.png
--------------------------------------------------------------------------------