├── .gitignore
├── README.md
├── linkedin
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
    │   ├── __init__.py
    │   ├── linkedin_company_profile.py
    │   ├── linkedin_jobs.py
    │   └── linkedin_people_profile.py
└── scrapy.cfg


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # C extensions
 7 | *.so
 8 | 
 9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | pip-wheel-metadata/
24 | share/python-wheels/
25 | *.egg-info/
26 | .installed.cfg
27 | *.egg
28 | MANIFEST
29 | venv/
30 | 
31 | 
32 | ## Custom
33 | data/


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # linkedin-python-scrapy-scraper
  2 | Python Scrapy spiders that scrape job data & people and company profiles from [LinkedIn.com](https://www.linkedin.com/). 
  3 | 
  4 | This Scrapy project contains 3 seperate spiders:
  5 | 
  6 | | Spider  |      Description      |
  7 | |----------|-------------|
  8 | | `linkedin_people_profile` |  Scrapes people data from LinkedIn people profile pages. | 
  9 | | `linkedin_jobs` |  Scrapes job data from LinkedIn (https://www.linkedin.com/jobs/search) | 
 10 | | `linkedin_company_profile` |  Scrapes company data from LinkedIn company profile pages. | 
 11 | 
 12 | 
 13 | The following articles go through in detail how these LinkedIn spiders were developed, which you can use to understand the spiders and edit them for your own use case.
 14 | 
 15 | - [Python Scrapy: Build A LinkedIn.com People Profile Scraper](https://scrapeops.io/python-scrapy-playbook/python-scrapy-linkedin-people-scraper/)
 16 | - [Python Scrapy: Build A LinkedIn.com Jobs Scraper](https://scrapeops.io/python-scrapy-playbook/python-scrapy-linkedin-jobs-scraper/)
 17 | - [Python Scrapy: Build A LinkedIn.com Company Profile Scraper](https://scrapeops.io/python-scrapy-playbook/python-scrapy-linkedin-company-scraper/)
 18 | 
 19 | ## ScrapeOps Proxy
 20 | This LinkedIn spider uses [ScrapeOps Proxy](https://scrapeops.io/proxy-aggregator/) as the proxy solution. ScrapeOps has a free plan that allows you to make up to 1,000 requests per month which makes it ideal for the development phase, but can be easily scaled up to millions of pages per month if needs be.
 21 | 
 22 | You can [sign up for a free API key here](https://scrapeops.io/app/register/main).
 23 | 
 24 | To use the ScrapeOps Proxy you need to first install the proxy middleware:
 25 | 
 26 | ```python
 27 | 
 28 | pip install scrapeops-scrapy-proxy-sdk
 29 | 
 30 | ```
 31 | 
 32 | Then activate the ScrapeOps Proxy by adding your API key to the `SCRAPEOPS_API_KEY` in the ``settings.py`` file.
 33 | 
 34 | ```python
 35 | 
 36 | SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
 37 | 
 38 | SCRAPEOPS_PROXY_ENABLED = True
 39 | 
 40 | DOWNLOADER_MIDDLEWARES = {
 41 |     'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
 42 | }
 43 | 
 44 | ```
 45 | 
 46 | 
 47 | ## ScrapeOps Monitoring
 48 | To monitor our scraper, this spider uses the [ScrapeOps Monitor](https://scrapeops.io/monitoring-scheduling/), a free monitoring tool specifically designed for web scraping. 
 49 | 
 50 | **Live demo here:** [ScrapeOps Demo](https://scrapeops.io/app/login/demo) 
 51 | 
 52 | ![ScrapeOps Dashboard](https://scrapeops.io/assets/images/scrapeops-promo-286a59166d9f41db1c195f619aa36a06.png)
 53 | 
 54 | To use the ScrapeOps Proxy you need to first install the monitoring SDK:
 55 | 
 56 | ```
 57 | 
 58 | pip install scrapeops-scrapy
 59 | 
 60 | ```
 61 | 
 62 | 
 63 | Then activate the ScrapeOps Proxy by adding your API key to the `SCRAPEOPS_API_KEY` in the ``settings.py`` file.
 64 | 
 65 | ```python
 66 | 
 67 | SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
 68 | 
 69 | # Add In The ScrapeOps Monitoring Extension
 70 | EXTENSIONS = {
 71 | 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
 72 | }
 73 | 
 74 | 
 75 | DOWNLOADER_MIDDLEWARES = {
 76 | 
 77 |     ## ScrapeOps Monitor
 78 |     'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550,
 79 |     'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
 80 |     
 81 |     ## Proxy Middleware
 82 |     'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
 83 | }
 84 | 
 85 | ```
 86 | 
 87 | If you are using both the ScrapeOps Proxy & Monitoring then you just need to enter the API key once.
 88 | 
 89 | 
 90 | ## Running The Scrapers
 91 | Make sure Scrapy and the ScrapeOps Monitor is installed:
 92 | 
 93 | ```
 94 | 
 95 | pip install scrapy scrapeops-scrapy
 96 | 
 97 | ```
 98 | 
 99 | To run the LinkedIn spiders you should first set the search query parameters you want to search by updating the `profile_list` list in the spiders:
100 | 
101 | ```python
102 | 
103 | def start_requests(self):
104 |     profile_list = ['reidhoffman']
105 |     for profile in profile_list:
106 |         linkedin_people_url = f'https://www.linkedin.com/in/{profile}/' 
107 |         yield scrapy.Request(url=linkedin_people_url, callback=self.parse_profile, meta={'profile': profile, 'linkedin_url': linkedin_people_url})
108 | 
109 | 
110 | ```
111 | 
112 | Then to run the spider, enter one of the following command:
113 | 
114 | ```
115 | 
116 | scrapy crawl linkedin_people_profile
117 | 
118 | ```
119 | 
120 | 
121 | ## Customizing The LinkedIn People Profile Scraper
122 | The following are instructions on how to modify the LinkedIn People Profile scraper for your particular use case.
123 | 
124 | Check out this [guide to building a LinkedIn.com Scrapy people profile spider](https://scrapeops.io/python-scrapy-playbook/python-scrapy-linkedin-people-scraper//) if you need any more information.
125 | 
126 | ### Configuring LinkedIn People Profile Search
127 | To change the query parameters for the people profile search just change the profiles in the `profile_list` lists in the spider.
128 | 
129 | For example:
130 | 
131 | ```python
132 | 
133 | def start_requests(self):
134 |     profile_list = ['reidhoffman', 'other_person']
135 |     for profile in profile_list:
136 |         linkedin_people_url = f'https://www.linkedin.com/in/{profile}/' 
137 |         yield scrapy.Request(url=linkedin_people_url, callback=self.parse_profile, meta={'profile': profile, 'linkedin_url': linkedin_people_url})
138 | 
139 | ```
140 | 
141 | ### Extract More/Different Data
142 | LinkedIn People Profile pages contain a lot of useful data, however, in this spider is configured to only parse:
143 | 
144 | - Name
145 | - Description
146 | - Number of followers
147 | - Number of connections
148 | - Location
149 | - About
150 | - Experienes - organisation name, organisation profile link, position, start & end dates, description.
151 | - Education - organisation name, organisation profile link, course details, start & end dates, description.
152 | 
153 | You can expand or change the data that gets extract by adding additional parsers and adding the data to the `item` that is yielded in the `parse_profiles` method:
154 | 
155 | 
156 | ### Speeding Up The Crawl
157 | The spiders are set to only use 1 concurrent thread in the ``settings.py`` file as the ScrapeOps Free Proxy Plan only gives you 1 concurrent thread.
158 | 
159 | However, if you upgrade to a paid ScrapeOps Proxy plan you will have more concurrent threads. Then you can increase the concurrency limit in your scraper by updating the `CONCURRENT_REQUESTS` value in your ``settings.py`` file.
160 | 
161 | ```python
162 | # settings.py
163 | 
164 | CONCURRENT_REQUESTS = 10
165 | 
166 | ```
167 | 
168 | ### Storing Data
169 | The spiders are set to save the scraped data into a CSV file and store it in a data folder using [Scrapy's Feed Export functionality](https://docs.scrapy.org/en/latest/topics/feed-exports.html).
170 | 
171 | ```python
172 | 
173 | custom_settings = {
174 |         'FEEDS': { 'data/%(name)s_%(time)s.csv': { 'format': 'csv',}}
175 |         }
176 | 
177 | ```
178 | 
179 | If you would like to save your CSV files to a AWS S3 bucket then check out our [Saving CSV/JSON Files to Amazon AWS S3 Bucket guide here](https://scrapeops.io//python-scrapy-playbook/scrapy-save-aws-s3)
180 | 
181 | Or if you would like to save your data to another type of database then be sure to check out these guides:
182 | 
183 | - [Saving Data to JSON](https://scrapeops.io/python-scrapy-playbook/scrapy-save-json-files)
184 | - [Saving Data to SQLite Database](https://scrapeops.io/python-scrapy-playbook/scrapy-save-data-sqlite)
185 | - [Saving Data to MySQL Database](https://scrapeops.io/python-scrapy-playbook/scrapy-save-data-mysql)
186 | - [Saving Data to Postgres Database](https://scrapeops.io/python-scrapy-playbook/scrapy-save-data-postgres)
187 | 
188 | ### Deactivating ScrapeOps Proxy & Monitor
189 | To deactivate the ScrapeOps Proxy & Monitor simply comment out the follow code in your `settings.py` file:
190 | 
191 | ```python
192 | # settings.py
193 | 
194 | # SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
195 | 
196 | # SCRAPEOPS_PROXY_ENABLED = True
197 | 
198 | # EXTENSIONS = {
199 | # 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
200 | # }
201 | 
202 | # DOWNLOADER_MIDDLEWARES = {
203 | 
204 | #     ## ScrapeOps Monitor
205 | #     'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550,
206 | #     'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
207 |     
208 | #     ## Proxy Middleware
209 | #     'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
210 | # }
211 | 
212 | 
213 | 
214 | ```
215 | 
216 | 


--------------------------------------------------------------------------------
/linkedin/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/python-scrapy-playbook/linkedin-python-scrapy-scraper/c0a40a73258be3633dd5569b8870ba2adff03ea1/linkedin/__init__.py


--------------------------------------------------------------------------------
/linkedin/items.py:
--------------------------------------------------------------------------------
 1 | # Define here the models for your scraped items
 2 | #
 3 | # See documentation in:
 4 | # https://docs.scrapy.org/en/latest/topics/items.html
 5 | 
 6 | import scrapy
 7 | 
 8 | 
 9 | class LinkedinItem(scrapy.Item):
10 |     # define the fields for your item here like:
11 |     # name = scrapy.Field()
12 |     pass
13 | 


--------------------------------------------------------------------------------
/linkedin/middlewares.py:
--------------------------------------------------------------------------------
  1 | # Define here the models for your spider middleware
  2 | #
  3 | # See documentation in:
  4 | # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  5 | 
  6 | from scrapy import signals
  7 | 
  8 | # useful for handling different item types with a single interface
  9 | from itemadapter import is_item, ItemAdapter
 10 | 
 11 | 
 12 | class LinkedinSpiderMiddleware:
 13 |     # Not all methods need to be defined. If a method is not defined,
 14 |     # scrapy acts as if the spider middleware does not modify the
 15 |     # passed objects.
 16 | 
 17 |     @classmethod
 18 |     def from_crawler(cls, crawler):
 19 |         # This method is used by Scrapy to create your spiders.
 20 |         s = cls()
 21 |         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
 22 |         return s
 23 | 
 24 |     def process_spider_input(self, response, spider):
 25 |         # Called for each response that goes through the spider
 26 |         # middleware and into the spider.
 27 | 
 28 |         # Should return None or raise an exception.
 29 |         return None
 30 | 
 31 |     def process_spider_output(self, response, result, spider):
 32 |         # Called with the results returned from the Spider, after
 33 |         # it has processed the response.
 34 | 
 35 |         # Must return an iterable of Request, or item objects.
 36 |         for i in result:
 37 |             yield i
 38 | 
 39 |     def process_spider_exception(self, response, exception, spider):
 40 |         # Called when a spider or process_spider_input() method
 41 |         # (from other spider middleware) raises an exception.
 42 | 
 43 |         # Should return either None or an iterable of Request or item objects.
 44 |         pass
 45 | 
 46 |     def process_start_requests(self, start_requests, spider):
 47 |         # Called with the start requests of the spider, and works
 48 |         # similarly to the process_spider_output() method, except
 49 |         # that it doesn’t have a response associated.
 50 | 
 51 |         # Must return only requests (not items).
 52 |         for r in start_requests:
 53 |             yield r
 54 | 
 55 |     def spider_opened(self, spider):
 56 |         spider.logger.info('Spider opened: %s' % spider.name)
 57 | 
 58 | 
 59 | class LinkedinDownloaderMiddleware:
 60 |     # Not all methods need to be defined. If a method is not defined,
 61 |     # scrapy acts as if the downloader middleware does not modify the
 62 |     # passed objects.
 63 | 
 64 |     @classmethod
 65 |     def from_crawler(cls, crawler):
 66 |         # This method is used by Scrapy to create your spiders.
 67 |         s = cls()
 68 |         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
 69 |         return s
 70 | 
 71 |     def process_request(self, request, spider):
 72 |         # Called for each request that goes through the downloader
 73 |         # middleware.
 74 | 
 75 |         # Must either:
 76 |         # - return None: continue processing this request
 77 |         # - or return a Response object
 78 |         # - or return a Request object
 79 |         # - or raise IgnoreRequest: process_exception() methods of
 80 |         #   installed downloader middleware will be called
 81 |         return None
 82 | 
 83 |     def process_response(self, request, response, spider):
 84 |         # Called with the response returned from the downloader.
 85 | 
 86 |         # Must either;
 87 |         # - return a Response object
 88 |         # - return a Request object
 89 |         # - or raise IgnoreRequest
 90 |         return response
 91 | 
 92 |     def process_exception(self, request, exception, spider):
 93 |         # Called when a download handler or a process_request()
 94 |         # (from other downloader middleware) raises an exception.
 95 | 
 96 |         # Must either:
 97 |         # - return None: continue processing this exception
 98 |         # - return a Response object: stops process_exception() chain
 99 |         # - return a Request object: stops process_exception() chain
100 |         pass
101 | 
102 |     def spider_opened(self, spider):
103 |         spider.logger.info('Spider opened: %s' % spider.name)
104 | 


--------------------------------------------------------------------------------
/linkedin/pipelines.py:
--------------------------------------------------------------------------------
 1 | # Define your item pipelines here
 2 | #
 3 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 4 | # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 5 | 
 6 | 
 7 | # useful for handling different item types with a single interface
 8 | from itemadapter import ItemAdapter
 9 | 
10 | 
11 | class LinkedinPipeline:
12 |     def process_item(self, item, spider):
13 |         return item
14 | 


--------------------------------------------------------------------------------
/linkedin/settings.py:
--------------------------------------------------------------------------------
 1 | # Scrapy settings for linkedin project
 2 | #
 3 | # For simplicity, this file contains only settings considered important or
 4 | # commonly used. You can find more settings consulting the documentation:
 5 | #
 6 | #     https://docs.scrapy.org/en/latest/topics/settings.html
 7 | #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
 8 | #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
 9 | 
10 | BOT_NAME = 'linkedin'
11 | 
12 | SPIDER_MODULES = ['linkedin.spiders']
13 | NEWSPIDER_MODULE = 'linkedin.spiders'
14 | 
15 | # HTTPCACHE_ENABLED = True
16 | 
17 | # Obey robots.txt rules
18 | ROBOTSTXT_OBEY = False
19 | 
20 | SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
21 | 
22 | SCRAPEOPS_PROXY_ENABLED = True
23 | 
24 | # Add In The ScrapeOps Monitoring Extension
25 | EXTENSIONS = {
26 | 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
27 | }
28 | 
29 | 
30 | DOWNLOADER_MIDDLEWARES = {
31 | 
32 |     ## ScrapeOps Monitor
33 |     'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550,
34 |     'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
35 |     
36 |     ## Proxy Middleware
37 |     'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
38 | }
39 | 
40 | # Max Concurrency On ScrapeOps Proxy Free Plan is 1 thread
41 | CONCURRENT_REQUESTS = 1


--------------------------------------------------------------------------------
/linkedin/spiders/__init__.py:
--------------------------------------------------------------------------------
1 | # This package will contain the spiders of your Scrapy project
2 | #
3 | # Please refer to the documentation for information on how to create and manage
4 | # your spiders.
5 | 


--------------------------------------------------------------------------------
/linkedin/spiders/linkedin_company_profile.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import scrapy
 3 | 
 4 | class LinkedCompanySpider(scrapy.Spider):
 5 |     name = "linkedin_company_profile"
 6 |     api_url = 'https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=python&location=United%2BStates&geoId=103644278&trk=public_jobs_jobs-search-bar_search-submit&start=' 
 7 | 
 8 |     #add your own list of company urls here
 9 |     company_pages = [
10 |         'https://www.linkedin.com/company/usebraintrust?trk=public_jobs_jserp-result_job-search-card-subtitle',
11 |         'https://www.linkedin.com/company/centraprise?trk=public_jobs_jserp-result_job-search-card-subtitle'
12 |         ]
13 | 
14 | 
15 |     def start_requests(self):
16 |         
17 |         company_index_tracker = 0
18 | 
19 |         #uncomment below if reading the company urls from a file instead of the self.company_pages array
20 |         # self.readUrlsFromJobsFile()
21 | 
22 |         first_url = self.company_pages[company_index_tracker]
23 | 
24 |         yield scrapy.Request(url=first_url, callback=self.parse_response, meta={'company_index_tracker': company_index_tracker})
25 | 
26 | 
27 |     def parse_response(self, response):
28 |         company_index_tracker = response.meta['company_index_tracker']
29 |         print('***************')
30 |         print('****** Scraping page ' + str(company_index_tracker+1) + ' of ' + str(len(self.company_pages)))
31 |         print('***************')
32 | 
33 |         company_item = {}
34 | 
35 |         company_item['name'] = response.css('.top-card-layout__entity-info h1::text').get(default='not-found').strip()
36 |         company_item['summary'] = response.css('.top-card-layout__entity-info h4 span::text').get(default='not-found').strip()
37 | 
38 |         try:
39 |             ## all company details 
40 |             company_details = response.css('.core-section-container__content .mb-2')
41 | 
42 |             #industry line
43 |             company_industry_line = company_details[1].css('.text-md::text').getall()
44 |             company_item['industry'] = company_industry_line[1].strip()
45 | 
46 |             #company size line
47 |             company_size_line = company_details[2].css('.text-md::text').getall()
48 |             company_item['size'] = company_size_line[1].strip()
49 | 
50 |             #company founded
51 |             company_size_line = company_details[5].css('.text-md::text').getall()
52 |             company_item['founded'] = company_size_line[1].strip()
53 |         except IndexError:
54 |             print("Error: Skipped Company - Some details missing")
55 | 
56 |         yield company_item
57 |         
58 | 
59 |         company_index_tracker = company_index_tracker + 1
60 | 
61 |         if company_index_tracker <= (len(self.company_pages)-1):
62 |             next_url = self.company_pages[company_index_tracker]
63 | 
64 |             yield scrapy.Request(url=next_url, callback=self.parse_response, meta={'company_index_tracker': company_index_tracker})
65 | 
66 |     
67 | 
68 | 
69 | 
70 |     def readUrlsFromJobsFile(self):
71 |         self.company_pages = []
72 |         with open('jobs.json') as file:
73 |             jobsFromFile = json.load(file)
74 | 
75 |             for job in jobsFromFile:
76 |                 if job['company_link'] != 'not-found':
77 |                     self.company_pages.append(job['company_link'])
78 |             
79 |         #remove any duplicate links - to prevent spider from shutting down on duplicate
80 |         self.company_pages = list(set(self.company_pages))
81 |             
82 | 


--------------------------------------------------------------------------------
/linkedin/spiders/linkedin_jobs.py:
--------------------------------------------------------------------------------
 1 | import scrapy
 2 | 
 3 | class LinkedJobsSpider(scrapy.Spider):
 4 |     name = "linkedin_jobs"
 5 |     api_url = 'https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=python&location=United%2BStates&geoId=103644278&trk=public_jobs_jobs-search-bar_search-submit&start=' 
 6 | 
 7 |     def start_requests(self):
 8 |         first_job_on_page = 0
 9 |         first_url = self.api_url + str(first_job_on_page)
10 |         yield scrapy.Request(url=first_url, callback=self.parse_job, meta={'first_job_on_page': first_job_on_page})
11 | 
12 | 
13 |     def parse_job(self, response):
14 |         first_job_on_page = response.meta['first_job_on_page']
15 | 
16 |         job_item = {}
17 |         jobs = response.css("li")
18 | 
19 |         num_jobs_returned = len(jobs)
20 |         print("******* Num Jobs Returned *******")
21 |         print(num_jobs_returned)
22 |         print('*****')
23 |         
24 |         for job in jobs:
25 |             
26 |             job_item['job_title'] = job.css("h3::text").get(default='not-found').strip()
27 |             job_item['job_detail_url'] = job.css(".base-card__full-link::attr(href)").get(default='not-found').strip()
28 |             job_item['job_listed'] = job.css('time::text').get(default='not-found').strip()
29 | 
30 |             job_item['company_name'] = job.css('h4 a::text').get(default='not-found').strip()
31 |             job_item['company_link'] = job.css('h4 a::attr(href)').get(default='not-found')
32 |             job_item['company_location'] = job.css('.job-search-card__location::text').get(default='not-found').strip()
33 |             yield job_item
34 |         
35 | 
36 |         if num_jobs_returned > 0:
37 |             first_job_on_page = int(first_job_on_page) + 25
38 |             next_url = self.api_url + str(first_job_on_page)
39 |             yield scrapy.Request(url=next_url, callback=self.parse_job, meta={'first_job_on_page': first_job_on_page})
40 | 
41 |     
42 | 
43 | 


--------------------------------------------------------------------------------
/linkedin/spiders/linkedin_people_profile.py:
--------------------------------------------------------------------------------
  1 | import scrapy
  2 | 
  3 | class LinkedInPeopleProfileSpider(scrapy.Spider):
  4 |     name = "linkedin_people_profile"
  5 | 
  6 |     custom_settings = {
  7 |         'FEEDS': { 'data/%(name)s_%(time)s.jsonl': { 'format': 'jsonlines',}}
  8 |         }
  9 | 
 10 |     def start_requests(self):
 11 |         profile_list = ['reidhoffman']
 12 |         for profile in profile_list:
 13 |             linkedin_people_url = f'https://www.linkedin.com/in/{profile}/' 
 14 |             yield scrapy.Request(url=linkedin_people_url, callback=self.parse_profile, meta={'profile': profile, 'linkedin_url': linkedin_people_url})
 15 | 
 16 |     def parse_profile(self, response):
 17 |         item = {}
 18 |         item['profile'] = response.meta['profile']
 19 |         item['url'] = response.meta['linkedin_url']
 20 | 
 21 |         """
 22 |             SUMMARY SECTION
 23 |         """
 24 |         summary_box = response.css("section.top-card-layout")
 25 |         item['name'] = summary_box.css("h1::text").get().strip()
 26 |         item['description'] = summary_box.css("h2::text").get().strip()
 27 | 
 28 |         ## Location
 29 |         try:
 30 |             item['location'] = summary_box.css('div.top-card__subline-item::text').get()
 31 |         except:
 32 |             item['location'] = summary_box.css('span.top-card__subline-item::text').get().strip()
 33 |             if 'followers' in item['location'] or 'connections' in item['location']:
 34 |                 item['location'] = ''
 35 | 
 36 |         item['followers'] = ''
 37 |         item['connections'] = ''
 38 | 
 39 |         for span_text in summary_box.css('span.top-card__subline-item::text').getall():
 40 |             if 'followers' in span_text:
 41 |                 item['followers'] = span_text.replace(' followers', '').strip()
 42 |             if 'connections' in span_text:
 43 |                 item['connections'] = span_text.replace(' connections', '').strip()
 44 | 
 45 | 
 46 |         """
 47 |             ABOUT SECTION
 48 |         """
 49 |         item['about'] = response.css('section.summary div.core-section-container__content p::text').get()
 50 | 
 51 | 
 52 |         """
 53 |             EXPERIENCE SECTION
 54 |         """
 55 |         item['experience'] = []
 56 |         experience_blocks = response.css('li.experience-item')
 57 |         for block in experience_blocks:
 58 |             experience = {}
 59 |             ## organisation profile url
 60 |             try:
 61 |                 experience['organisation_profile'] = block.css('h4 a::attr(href)').get().split('?')[0]
 62 |             except Exception as e:
 63 |                 print('experience --> organisation_profile', e)
 64 |                 experience['organisation_profile'] = ''
 65 |                 
 66 |                 
 67 |             ## location
 68 |             try:
 69 |                 experience['location'] = block.css('p.experience-item__location::text').get().strip()
 70 |             except Exception as e:
 71 |                 print('experience --> location', e)
 72 |                 experience['location'] = ''
 73 |                 
 74 |                 
 75 |             ## description
 76 |             try:
 77 |                 experience['description'] = block.css('p.show-more-less-text__text--more::text').get().strip()
 78 |             except Exception as e:
 79 |                 print('experience --> description', e)
 80 |                 try:
 81 |                     experience['description'] = block.css('p.show-more-less-text__text--less::text').get().strip()
 82 |                 except Exception as e:
 83 |                     print('experience --> description', e)
 84 |                     experience['description'] = ''
 85 |                     
 86 |             ## time range
 87 |             try:
 88 |                 date_ranges = block.css('span.date-range time::text').getall()
 89 |                 if len(date_ranges) == 2:
 90 |                     experience['start_time'] = date_ranges[0]
 91 |                     experience['end_time'] = date_ranges[1]
 92 |                     experience['duration'] = block.css('span.date-range__duration::text').get()
 93 |                 elif len(date_ranges) == 1:
 94 |                     experience['start_time'] = date_ranges[0]
 95 |                     experience['end_time'] = 'present'
 96 |                     experience['duration'] = block.css('span.date-range__duration::text').get()
 97 |             except Exception as e:
 98 |                 print('experience --> time ranges', e)
 99 |                 experience['start_time'] = ''
100 |                 experience['end_time'] = ''
101 |                 experience['duration'] = ''
102 |             
103 |             item['experience'].append(experience)
104 | 
105 |         
106 |         """
107 |             EDUCATION SECTION
108 |         """
109 |         item['education'] = []
110 |         education_blocks = response.css('li.education__list-item')
111 |         for block in education_blocks:
112 |             education = {}
113 | 
114 |             ## organisation
115 |             try:
116 |                 education['organisation'] = block.css('h3::text').get().strip()
117 |             except Exception as e:
118 |                 print("education --> organisation", e)
119 |                 education['organisation'] = ''
120 | 
121 | 
122 |             ## organisation profile url
123 |             try:
124 |                 education['organisation_profile'] = block.css('a::attr(href)').get().split('?')[0]
125 |             except Exception as e:
126 |                 print("education --> organisation_profile", e)
127 |                 education['organisation_profile'] = ''
128 | 
129 |             ## course details
130 |             try:
131 |                 education['course_details'] = ''
132 |                 for text in block.css('h4 span::text').getall():
133 |                     education['course_details'] = education['course_details'] + text.strip() + ' '
134 |                 education['course_details'] = education['course_details'].strip()
135 |             except Exception as e:
136 |                 print("education --> course_details", e)
137 |                 education['course_details'] = ''
138 | 
139 |             ## description
140 |             try:
141 |                 education['description'] = block.css('div.education__item--details p::text').get().strip()
142 |             except Exception as e:
143 |                 print("education --> description", e)
144 |                 education['description'] = ''
145 | 
146 |          
147 |             ## time range
148 |             try:
149 |                 date_ranges = block.css('span.date-range time::text').getall()
150 |                 if len(date_ranges) == 2:
151 |                     education['start_time'] = date_ranges[0]
152 |                     education['end_time'] = date_ranges[1]
153 |                 elif len(date_ranges) == 1:
154 |                     education['start_time'] = date_ranges[0]
155 |                     education['end_time'] = 'present'
156 |             except Exception as e:
157 |                 print("education --> time_ranges", e)
158 |                 education['start_time'] = ''
159 |                 education['end_time'] = ''
160 | 
161 |             item['education'].append(education)
162 | 
163 |         yield item
164 |         
165 |     
166 | 
167 | 


--------------------------------------------------------------------------------
/scrapy.cfg:
--------------------------------------------------------------------------------
 1 | # Automatically created by: scrapy startproject
 2 | #
 3 | # For more information about the [deploy] section see:
 4 | # https://scrapyd.readthedocs.io/en/latest/deploy.html
 5 | 
 6 | [settings]
 7 | default = linkedin.settings
 8 | 
 9 | [deploy]
10 | #url = http://localhost:6800/
11 | project = linkedin
12 | 


--------------------------------------------------------------------------------