`, you can save the path in a variable to call it.\n",
349 | "\n",
350 | "Here, I will extract links that are in the actual content of a post by \"saving\" the `post-342779` article in a variable called `article`."
351 | ]
352 | },
353 | {
354 | "cell_type": "code",
355 | "execution_count": 13,
356 | "metadata": {},
357 | "outputs": [],
358 | "source": [
359 | "article = response.html.find('article.cis_post_item_initial.post-342779', first=True)\n",
360 | "article_links = article.xpath('//a/@href')"
361 | ]
362 | },
363 | {
364 | "cell_type": "markdown",
365 | "metadata": {},
366 | "source": [
367 | "## Case Study: Extract Broken Links"
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "execution_count": 15,
373 | "metadata": {},
374 | "outputs": [],
375 | "source": [
376 | "import re\n",
377 | "import requests\n",
378 | "from requests_html import HTMLSession\n",
379 | "from urllib.parse import urlparse\n",
380 | "\n",
381 | "# Get Domain Name With urlparse\n",
382 | "url = \"https://www.jobillico.com/fr/partenaires-corporatifs\"\n",
383 | "parsed_url = urlparse(url)\n",
384 | "domain = parsed_url.scheme + \"://\" + parsed_url.netloc\n",
385 | "\n",
386 | "# Get URL \n",
387 | "session = HTMLSession()\n",
388 | "r = session.get(url)\n",
389 | "\n",
390 | "# Extract Links\n",
391 | "jlinks = r.html.xpath('//a/@href')\n",
392 | "\n",
393 | "# Remove bad links and replace relative path for absolute path\n",
394 | "updated_links = []\n",
395 | "\n",
396 | "for link in jlinks:\n",
397 | " if re.search(\".*@.*|.*javascript:.*|.*tel:.*\",link):\n",
398 | " link = \"\"\n",
399 | " elif re.search(\"^(?!http).*\",link):\n",
400 | " link = domain + link\n",
401 | " updated_links.append(link)\n",
402 | " else:\n",
403 | " updated_links.append(link)"
404 | ]
405 | },
406 | {
407 | "cell_type": "code",
408 | "execution_count": null,
409 | "metadata": {},
410 | "outputs": [],
411 | "source": [
412 | "print(updated_links)"
413 | ]
414 | },
415 | {
416 | "cell_type": "code",
417 | "execution_count": null,
418 | "metadata": {},
419 | "outputs": [],
420 | "source": [
421 | "broken_links = []\n",
422 | "\n",
423 | "for link in updated_links:\n",
424 | " print(link)\n",
425 | " try: \n",
426 | " requests.get(link, timeout=10).status_code\n",
427 | " if requests.get(link, timeout=10).status_code != 200:\n",
428 | " broken_links.append(link)\n",
429 | " except requests.exceptions.RequestException as e:\n",
430 | " print(e)\n",
431 | "\n",
432 | "broken_links"
433 | ]
434 | },
435 | {
436 | "cell_type": "markdown",
437 | "metadata": {},
438 | "source": [
439 | "## Full Code"
440 | ]
441 | },
442 | {
443 | "cell_type": "code",
444 | "execution_count": null,
445 | "metadata": {},
446 | "outputs": [],
447 | "source": [
448 | "import pandas as pd\n",
449 | "import requests\n",
450 | "from requests_html import HTMLSession\n",
451 | "\n",
452 | "\n",
453 | "url = \"https://www.searchenginejournal.com/introduction-to-python-seo-spreadsheets/342779/\"\n",
454 | "\n",
455 | "try:\n",
456 | " session = HTMLSession()\n",
457 | " response = session.get(url)\n",
458 | "except HTTPError as error:\n",
459 | " print(error)\n",
460 | "\n",
461 | " \n",
462 | "# Get Title\n",
463 | "title = response.html.find('title', first=True).text\n",
464 | "\n",
465 | "#Get H1\n",
466 | "h1 = response.html.find('h1', first=True).text\n",
467 | "\n",
468 | "#Get all Links\n",
469 | "links = response.html.absolute_links\n",
470 | "\n",
471 | "#Get Author using Class\n",
472 | "author = response.html.find('.post-author', first=True).text\n",
473 | "\n",
474 | "#Get Canonical Link\n",
475 | "canonical = response.html.xpath(\"//link[@rel='canonical']/@href\")\n",
476 | "\n",
477 | "#Get Hreflang\n",
478 | "hreflang = response.html.xpath(\"//link[@rel='alternate']/@hreflang\")\n",
479 | "\n",
480 | "#Get Meta Robots\n",
481 | "meta_robots = response.html.xpath(\"//meta[@name='ROBOTS']/@content\")\n",
482 | "\n",
483 | "#Get Navigational links using nested CSS Selector and For Loops\n",
484 | "get_nav_links = response.html.find('a.sub-m-cat span')\n",
485 | "\n",
486 | "nav_links = []\n",
487 | "\n",
488 | "for i in range(len(get_nav_links)):\n",
489 | " x = get_nav_links[i].text\n",
490 | " nav_links.append(x)\n",
491 | " \n",
492 | "nav_links\n",
493 | "\n",
494 | "#Create a variable to extract dat from the actual article only.\n",
495 | "article = response.html.find('article.cis_post_item_initial.post-342779', first=True)\n",
496 | "article_links = article.xpath('//a/@href')\n"
497 | ]
498 | },
499 | {
500 | "cell_type": "code",
501 | "execution_count": null,
502 | "metadata": {},
503 | "outputs": [],
504 | "source": []
505 | }
506 | ],
507 | "metadata": {
508 | "kernelspec": {
509 | "display_name": "Python 3",
510 | "language": "python",
511 | "name": "python3"
512 | },
513 | "language_info": {
514 | "codemirror_mode": {
515 | "name": "ipython",
516 | "version": 3
517 | },
518 | "file_extension": ".py",
519 | "mimetype": "text/x-python",
520 | "name": "python",
521 | "nbconvert_exporter": "python",
522 | "pygments_lexer": "ipython3",
523 | "version": "3.7.4"
524 | }
525 | },
526 | "nbformat": 4,
527 | "nbformat_minor": 2
528 | }
529 |
--------------------------------------------------------------------------------
/Rainbow_lorikeet.jpg:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
loriquet arc-en-ciel — Wiktionnaire
7 |
10 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
30 |
31 |
32 |
33 |
34 |
35 |
36 |
37 |
38 |
39 |
40 |
41 |
42 |
43 |
44 |
loriquet arc-en-ciel
45 |
46 |
47 |
Définition, traduction, prononciation, anagramme et synonyme sur le dictionnaire libre Wiktionnaire.
48 |
49 |
50 |
51 |
52 |
Sauter à la navigation
53 |
Sauter à la recherche
54 |
72 |
73 |
74 |
75 |
De loriquet et arc-en-ciel .
76 |
77 |
93 |
94 |
loriquet arc-en-ciel \lɔ.ʁi.kɛ aʁ.kɑ̃.sjɛl\ masculin
95 |
96 |
(Ornithologie ) Perroquet de taille modeste, vivant dans l'est de l'Australie , très proche du loriquet à tête bleue , dont il est parfois considéré comme une sous-espèce.
97 | Des loriquets arc-en-ciel ont été introduits à Auckland (Nouvelle-Zélande ) et à Hong-Kong .
98 | La semaine dernière, enfin, trois loriquets arc-en-ciel , l’espèce la plus bruyante et colorée de perroquets, ont réussi à prendre la fuite. — (Les animaux du zoo de Bristol s’entretuent-ils? , in 20 minutes , 9 février 2015)
99 |
100 |
Peut être utilisé avec une majuscule (Loriquet arc-en-ciel) pour mettre en avant le fait qu’on donne un caractère générique au mot.
101 |
102 |
103 |
104 |
(simplifié)
105 |
106 |
109 |
110 |
128 |
129 |
130 |
150 |
163 |
164 |
166 |
167 |
168 |
169 |
170 |
171 |
172 |
173 |
174 |
175 |
Menu de navigation
176 |
177 |
178 |
179 |
Outils personnels
180 |
183 |
184 |
185 |
186 |
187 |
Espaces de noms
188 |
191 |
192 |
193 |
202 |
203 |
204 |
205 |
206 |
Affichages
207 |
210 |
211 |
212 |
221 |
222 |
223 |
224 | Rechercher
225 |
226 |
234 |
235 |
236 |
237 |
238 |
239 |
240 |
243 |
244 |
245 | Navigation
246 |
247 |
251 |
252 |
253 |
254 |
255 |
256 | Contribuer
257 |
258 |
262 |
263 |
264 |
265 |
266 | Aide
267 |
268 |
272 |
273 |
274 |
275 |
276 | Imprimer / exporter
277 |
278 |
282 |
283 |
284 |
285 |
286 | Outils
287 |
288 |
292 |
293 |
294 |
295 |
296 | Dans d’autres langues
297 |
298 |
302 |
303 |
304 |
305 |
306 |
307 |
308 |
309 |
329 |
330 |
331 |
332 |
--------------------------------------------------------------------------------