├── .gitignore ├── README.md ├── linkedin ├── __init__.py ├── items.py ├── middlewares.py ├── pipelines.py ├── settings.py └── spiders │ ├── __init__.py │ ├── linkedin_company_profile.py │ ├── linkedin_jobs.py │ └── linkedin_people_profile.py └── scrapy.cfg /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | venv/ 30 | 31 | 32 | ## Custom 33 | data/ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # linkedin-python-scrapy-scraper 2 | Python Scrapy spiders that scrape job data & people and company profiles from [LinkedIn.com](https://www.linkedin.com/). 3 | 4 | This Scrapy project contains 3 seperate spiders: 5 | 6 | | Spider | Description | 7 | |----------|-------------| 8 | | `linkedin_people_profile` | Scrapes people data from LinkedIn people profile pages. | 9 | | `linkedin_jobs` | Scrapes job data from LinkedIn (https://www.linkedin.com/jobs/search) | 10 | | `linkedin_company_profile` | Scrapes company data from LinkedIn company profile pages. | 11 | 12 | 13 | The following articles go through in detail how these LinkedIn spiders were developed, which you can use to understand the spiders and edit them for your own use case. 14 | 15 | - [Python Scrapy: Build A LinkedIn.com People Profile Scraper](https://scrapeops.io/python-scrapy-playbook/python-scrapy-linkedin-people-scraper/) 16 | - [Python Scrapy: Build A LinkedIn.com Jobs Scraper](https://scrapeops.io/python-scrapy-playbook/python-scrapy-linkedin-jobs-scraper/) 17 | - [Python Scrapy: Build A LinkedIn.com Company Profile Scraper](https://scrapeops.io/python-scrapy-playbook/python-scrapy-linkedin-company-scraper/) 18 | 19 | ## ScrapeOps Proxy 20 | This LinkedIn spider uses [ScrapeOps Proxy](https://scrapeops.io/proxy-aggregator/) as the proxy solution. ScrapeOps has a free plan that allows you to make up to 1,000 requests per month which makes it ideal for the development phase, but can be easily scaled up to millions of pages per month if needs be. 21 | 22 | You can [sign up for a free API key here](https://scrapeops.io/app/register/main). 23 | 24 | To use the ScrapeOps Proxy you need to first install the proxy middleware: 25 | 26 | ```python 27 | 28 | pip install scrapeops-scrapy-proxy-sdk 29 | 30 | ``` 31 | 32 | Then activate the ScrapeOps Proxy by adding your API key to the `SCRAPEOPS_API_KEY` in the ``settings.py`` file. 33 | 34 | ```python 35 | 36 | SCRAPEOPS_API_KEY = 'YOUR_API_KEY' 37 | 38 | SCRAPEOPS_PROXY_ENABLED = True 39 | 40 | DOWNLOADER_MIDDLEWARES = { 41 | 'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725, 42 | } 43 | 44 | ``` 45 | 46 | 47 | ## ScrapeOps Monitoring 48 | To monitor our scraper, this spider uses the [ScrapeOps Monitor](https://scrapeops.io/monitoring-scheduling/), a free monitoring tool specifically designed for web scraping. 49 | 50 | **Live demo here:** [ScrapeOps Demo](https://scrapeops.io/app/login/demo) 51 | 52 | ![ScrapeOps Dashboard](https://scrapeops.io/assets/images/scrapeops-promo-286a59166d9f41db1c195f619aa36a06.png) 53 | 54 | To use the ScrapeOps Proxy you need to first install the monitoring SDK: 55 | 56 | ``` 57 | 58 | pip install scrapeops-scrapy 59 | 60 | ``` 61 | 62 | 63 | Then activate the ScrapeOps Proxy by adding your API key to the `SCRAPEOPS_API_KEY` in the ``settings.py`` file. 64 | 65 | ```python 66 | 67 | SCRAPEOPS_API_KEY = 'YOUR_API_KEY' 68 | 69 | # Add In The ScrapeOps Monitoring Extension 70 | EXTENSIONS = { 71 | 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 72 | } 73 | 74 | 75 | DOWNLOADER_MIDDLEWARES = { 76 | 77 | ## ScrapeOps Monitor 78 | 'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 79 | 'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 80 | 81 | ## Proxy Middleware 82 | 'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725, 83 | } 84 | 85 | ``` 86 | 87 | If you are using both the ScrapeOps Proxy & Monitoring then you just need to enter the API key once. 88 | 89 | 90 | ## Running The Scrapers 91 | Make sure Scrapy and the ScrapeOps Monitor is installed: 92 | 93 | ``` 94 | 95 | pip install scrapy scrapeops-scrapy 96 | 97 | ``` 98 | 99 | To run the LinkedIn spiders you should first set the search query parameters you want to search by updating the `profile_list` list in the spiders: 100 | 101 | ```python 102 | 103 | def start_requests(self): 104 | profile_list = ['reidhoffman'] 105 | for profile in profile_list: 106 | linkedin_people_url = f'https://www.linkedin.com/in/{profile}/' 107 | yield scrapy.Request(url=linkedin_people_url, callback=self.parse_profile, meta={'profile': profile, 'linkedin_url': linkedin_people_url}) 108 | 109 | 110 | ``` 111 | 112 | Then to run the spider, enter one of the following command: 113 | 114 | ``` 115 | 116 | scrapy crawl linkedin_people_profile 117 | 118 | ``` 119 | 120 | 121 | ## Customizing The LinkedIn People Profile Scraper 122 | The following are instructions on how to modify the LinkedIn People Profile scraper for your particular use case. 123 | 124 | Check out this [guide to building a LinkedIn.com Scrapy people profile spider](https://scrapeops.io/python-scrapy-playbook/python-scrapy-linkedin-people-scraper//) if you need any more information. 125 | 126 | ### Configuring LinkedIn People Profile Search 127 | To change the query parameters for the people profile search just change the profiles in the `profile_list` lists in the spider. 128 | 129 | For example: 130 | 131 | ```python 132 | 133 | def start_requests(self): 134 | profile_list = ['reidhoffman', 'other_person'] 135 | for profile in profile_list: 136 | linkedin_people_url = f'https://www.linkedin.com/in/{profile}/' 137 | yield scrapy.Request(url=linkedin_people_url, callback=self.parse_profile, meta={'profile': profile, 'linkedin_url': linkedin_people_url}) 138 | 139 | ``` 140 | 141 | ### Extract More/Different Data 142 | LinkedIn People Profile pages contain a lot of useful data, however, in this spider is configured to only parse: 143 | 144 | - Name 145 | - Description 146 | - Number of followers 147 | - Number of connections 148 | - Location 149 | - About 150 | - Experienes - organisation name, organisation profile link, position, start & end dates, description. 151 | - Education - organisation name, organisation profile link, course details, start & end dates, description. 152 | 153 | You can expand or change the data that gets extract by adding additional parsers and adding the data to the `item` that is yielded in the `parse_profiles` method: 154 | 155 | 156 | ### Speeding Up The Crawl 157 | The spiders are set to only use 1 concurrent thread in the ``settings.py`` file as the ScrapeOps Free Proxy Plan only gives you 1 concurrent thread. 158 | 159 | However, if you upgrade to a paid ScrapeOps Proxy plan you will have more concurrent threads. Then you can increase the concurrency limit in your scraper by updating the `CONCURRENT_REQUESTS` value in your ``settings.py`` file. 160 | 161 | ```python 162 | # settings.py 163 | 164 | CONCURRENT_REQUESTS = 10 165 | 166 | ``` 167 | 168 | ### Storing Data 169 | The spiders are set to save the scraped data into a CSV file and store it in a data folder using [Scrapy's Feed Export functionality](https://docs.scrapy.org/en/latest/topics/feed-exports.html). 170 | 171 | ```python 172 | 173 | custom_settings = { 174 | 'FEEDS': { 'data/%(name)s_%(time)s.csv': { 'format': 'csv',}} 175 | } 176 | 177 | ``` 178 | 179 | If you would like to save your CSV files to a AWS S3 bucket then check out our [Saving CSV/JSON Files to Amazon AWS S3 Bucket guide here](https://scrapeops.io//python-scrapy-playbook/scrapy-save-aws-s3) 180 | 181 | Or if you would like to save your data to another type of database then be sure to check out these guides: 182 | 183 | - [Saving Data to JSON](https://scrapeops.io/python-scrapy-playbook/scrapy-save-json-files) 184 | - [Saving Data to SQLite Database](https://scrapeops.io/python-scrapy-playbook/scrapy-save-data-sqlite) 185 | - [Saving Data to MySQL Database](https://scrapeops.io/python-scrapy-playbook/scrapy-save-data-mysql) 186 | - [Saving Data to Postgres Database](https://scrapeops.io/python-scrapy-playbook/scrapy-save-data-postgres) 187 | 188 | ### Deactivating ScrapeOps Proxy & Monitor 189 | To deactivate the ScrapeOps Proxy & Monitor simply comment out the follow code in your `settings.py` file: 190 | 191 | ```python 192 | # settings.py 193 | 194 | # SCRAPEOPS_API_KEY = 'YOUR_API_KEY' 195 | 196 | # SCRAPEOPS_PROXY_ENABLED = True 197 | 198 | # EXTENSIONS = { 199 | # 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 200 | # } 201 | 202 | # DOWNLOADER_MIDDLEWARES = { 203 | 204 | # ## ScrapeOps Monitor 205 | # 'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 206 | # 'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 207 | 208 | # ## Proxy Middleware 209 | # 'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725, 210 | # } 211 | 212 | 213 | 214 | ``` 215 | 216 | -------------------------------------------------------------------------------- /linkedin/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/python-scrapy-playbook/linkedin-python-scrapy-scraper/c0a40a73258be3633dd5569b8870ba2adff03ea1/linkedin/__init__.py -------------------------------------------------------------------------------- /linkedin/items.py: -------------------------------------------------------------------------------- 1 | # Define here the models for your scraped items 2 | # 3 | # See documentation in: 4 | # https://docs.scrapy.org/en/latest/topics/items.html 5 | 6 | import scrapy 7 | 8 | 9 | class LinkedinItem(scrapy.Item): 10 | # define the fields for your item here like: 11 | # name = scrapy.Field() 12 | pass 13 | -------------------------------------------------------------------------------- /linkedin/middlewares.py: -------------------------------------------------------------------------------- 1 | # Define here the models for your spider middleware 2 | # 3 | # See documentation in: 4 | # https://docs.scrapy.org/en/latest/topics/spider-middleware.html 5 | 6 | from scrapy import signals 7 | 8 | # useful for handling different item types with a single interface 9 | from itemadapter import is_item, ItemAdapter 10 | 11 | 12 | class LinkedinSpiderMiddleware: 13 | # Not all methods need to be defined. If a method is not defined, 14 | # scrapy acts as if the spider middleware does not modify the 15 | # passed objects. 16 | 17 | @classmethod 18 | def from_crawler(cls, crawler): 19 | # This method is used by Scrapy to create your spiders. 20 | s = cls() 21 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 22 | return s 23 | 24 | def process_spider_input(self, response, spider): 25 | # Called for each response that goes through the spider 26 | # middleware and into the spider. 27 | 28 | # Should return None or raise an exception. 29 | return None 30 | 31 | def process_spider_output(self, response, result, spider): 32 | # Called with the results returned from the Spider, after 33 | # it has processed the response. 34 | 35 | # Must return an iterable of Request, or item objects. 36 | for i in result: 37 | yield i 38 | 39 | def process_spider_exception(self, response, exception, spider): 40 | # Called when a spider or process_spider_input() method 41 | # (from other spider middleware) raises an exception. 42 | 43 | # Should return either None or an iterable of Request or item objects. 44 | pass 45 | 46 | def process_start_requests(self, start_requests, spider): 47 | # Called with the start requests of the spider, and works 48 | # similarly to the process_spider_output() method, except 49 | # that it doesn’t have a response associated. 50 | 51 | # Must return only requests (not items). 52 | for r in start_requests: 53 | yield r 54 | 55 | def spider_opened(self, spider): 56 | spider.logger.info('Spider opened: %s' % spider.name) 57 | 58 | 59 | class LinkedinDownloaderMiddleware: 60 | # Not all methods need to be defined. If a method is not defined, 61 | # scrapy acts as if the downloader middleware does not modify the 62 | # passed objects. 63 | 64 | @classmethod 65 | def from_crawler(cls, crawler): 66 | # This method is used by Scrapy to create your spiders. 67 | s = cls() 68 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 69 | return s 70 | 71 | def process_request(self, request, spider): 72 | # Called for each request that goes through the downloader 73 | # middleware. 74 | 75 | # Must either: 76 | # - return None: continue processing this request 77 | # - or return a Response object 78 | # - or return a Request object 79 | # - or raise IgnoreRequest: process_exception() methods of 80 | # installed downloader middleware will be called 81 | return None 82 | 83 | def process_response(self, request, response, spider): 84 | # Called with the response returned from the downloader. 85 | 86 | # Must either; 87 | # - return a Response object 88 | # - return a Request object 89 | # - or raise IgnoreRequest 90 | return response 91 | 92 | def process_exception(self, request, exception, spider): 93 | # Called when a download handler or a process_request() 94 | # (from other downloader middleware) raises an exception. 95 | 96 | # Must either: 97 | # - return None: continue processing this exception 98 | # - return a Response object: stops process_exception() chain 99 | # - return a Request object: stops process_exception() chain 100 | pass 101 | 102 | def spider_opened(self, spider): 103 | spider.logger.info('Spider opened: %s' % spider.name) 104 | -------------------------------------------------------------------------------- /linkedin/pipelines.py: -------------------------------------------------------------------------------- 1 | # Define your item pipelines here 2 | # 3 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting 4 | # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html 5 | 6 | 7 | # useful for handling different item types with a single interface 8 | from itemadapter import ItemAdapter 9 | 10 | 11 | class LinkedinPipeline: 12 | def process_item(self, item, spider): 13 | return item 14 | -------------------------------------------------------------------------------- /linkedin/settings.py: -------------------------------------------------------------------------------- 1 | # Scrapy settings for linkedin project 2 | # 3 | # For simplicity, this file contains only settings considered important or 4 | # commonly used. You can find more settings consulting the documentation: 5 | # 6 | # https://docs.scrapy.org/en/latest/topics/settings.html 7 | # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 8 | # https://docs.scrapy.org/en/latest/topics/spider-middleware.html 9 | 10 | BOT_NAME = 'linkedin' 11 | 12 | SPIDER_MODULES = ['linkedin.spiders'] 13 | NEWSPIDER_MODULE = 'linkedin.spiders' 14 | 15 | # HTTPCACHE_ENABLED = True 16 | 17 | # Obey robots.txt rules 18 | ROBOTSTXT_OBEY = False 19 | 20 | SCRAPEOPS_API_KEY = 'YOUR_API_KEY' 21 | 22 | SCRAPEOPS_PROXY_ENABLED = True 23 | 24 | # Add In The ScrapeOps Monitoring Extension 25 | EXTENSIONS = { 26 | 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 27 | } 28 | 29 | 30 | DOWNLOADER_MIDDLEWARES = { 31 | 32 | ## ScrapeOps Monitor 33 | 'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 34 | 'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 35 | 36 | ## Proxy Middleware 37 | 'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725, 38 | } 39 | 40 | # Max Concurrency On ScrapeOps Proxy Free Plan is 1 thread 41 | CONCURRENT_REQUESTS = 1 -------------------------------------------------------------------------------- /linkedin/spiders/__init__.py: -------------------------------------------------------------------------------- 1 | # This package will contain the spiders of your Scrapy project 2 | # 3 | # Please refer to the documentation for information on how to create and manage 4 | # your spiders. 5 | -------------------------------------------------------------------------------- /linkedin/spiders/linkedin_company_profile.py: -------------------------------------------------------------------------------- 1 | import json 2 | import scrapy 3 | 4 | class LinkedCompanySpider(scrapy.Spider): 5 | name = "linkedin_company_profile" 6 | api_url = 'https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=python&location=United%2BStates&geoId=103644278&trk=public_jobs_jobs-search-bar_search-submit&start=' 7 | 8 | #add your own list of company urls here 9 | company_pages = [ 10 | 'https://www.linkedin.com/company/usebraintrust?trk=public_jobs_jserp-result_job-search-card-subtitle', 11 | 'https://www.linkedin.com/company/centraprise?trk=public_jobs_jserp-result_job-search-card-subtitle' 12 | ] 13 | 14 | 15 | def start_requests(self): 16 | 17 | company_index_tracker = 0 18 | 19 | #uncomment below if reading the company urls from a file instead of the self.company_pages array 20 | # self.readUrlsFromJobsFile() 21 | 22 | first_url = self.company_pages[company_index_tracker] 23 | 24 | yield scrapy.Request(url=first_url, callback=self.parse_response, meta={'company_index_tracker': company_index_tracker}) 25 | 26 | 27 | def parse_response(self, response): 28 | company_index_tracker = response.meta['company_index_tracker'] 29 | print('***************') 30 | print('****** Scraping page ' + str(company_index_tracker+1) + ' of ' + str(len(self.company_pages))) 31 | print('***************') 32 | 33 | company_item = {} 34 | 35 | company_item['name'] = response.css('.top-card-layout__entity-info h1::text').get(default='not-found').strip() 36 | company_item['summary'] = response.css('.top-card-layout__entity-info h4 span::text').get(default='not-found').strip() 37 | 38 | try: 39 | ## all company details 40 | company_details = response.css('.core-section-container__content .mb-2') 41 | 42 | #industry line 43 | company_industry_line = company_details[1].css('.text-md::text').getall() 44 | company_item['industry'] = company_industry_line[1].strip() 45 | 46 | #company size line 47 | company_size_line = company_details[2].css('.text-md::text').getall() 48 | company_item['size'] = company_size_line[1].strip() 49 | 50 | #company founded 51 | company_size_line = company_details[5].css('.text-md::text').getall() 52 | company_item['founded'] = company_size_line[1].strip() 53 | except IndexError: 54 | print("Error: Skipped Company - Some details missing") 55 | 56 | yield company_item 57 | 58 | 59 | company_index_tracker = company_index_tracker + 1 60 | 61 | if company_index_tracker <= (len(self.company_pages)-1): 62 | next_url = self.company_pages[company_index_tracker] 63 | 64 | yield scrapy.Request(url=next_url, callback=self.parse_response, meta={'company_index_tracker': company_index_tracker}) 65 | 66 | 67 | 68 | 69 | 70 | def readUrlsFromJobsFile(self): 71 | self.company_pages = [] 72 | with open('jobs.json') as file: 73 | jobsFromFile = json.load(file) 74 | 75 | for job in jobsFromFile: 76 | if job['company_link'] != 'not-found': 77 | self.company_pages.append(job['company_link']) 78 | 79 | #remove any duplicate links - to prevent spider from shutting down on duplicate 80 | self.company_pages = list(set(self.company_pages)) 81 | 82 | -------------------------------------------------------------------------------- /linkedin/spiders/linkedin_jobs.py: -------------------------------------------------------------------------------- 1 | import scrapy 2 | 3 | class LinkedJobsSpider(scrapy.Spider): 4 | name = "linkedin_jobs" 5 | api_url = 'https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=python&location=United%2BStates&geoId=103644278&trk=public_jobs_jobs-search-bar_search-submit&start=' 6 | 7 | def start_requests(self): 8 | first_job_on_page = 0 9 | first_url = self.api_url + str(first_job_on_page) 10 | yield scrapy.Request(url=first_url, callback=self.parse_job, meta={'first_job_on_page': first_job_on_page}) 11 | 12 | 13 | def parse_job(self, response): 14 | first_job_on_page = response.meta['first_job_on_page'] 15 | 16 | job_item = {} 17 | jobs = response.css("li") 18 | 19 | num_jobs_returned = len(jobs) 20 | print("******* Num Jobs Returned *******") 21 | print(num_jobs_returned) 22 | print('*****') 23 | 24 | for job in jobs: 25 | 26 | job_item['job_title'] = job.css("h3::text").get(default='not-found').strip() 27 | job_item['job_detail_url'] = job.css(".base-card__full-link::attr(href)").get(default='not-found').strip() 28 | job_item['job_listed'] = job.css('time::text').get(default='not-found').strip() 29 | 30 | job_item['company_name'] = job.css('h4 a::text').get(default='not-found').strip() 31 | job_item['company_link'] = job.css('h4 a::attr(href)').get(default='not-found') 32 | job_item['company_location'] = job.css('.job-search-card__location::text').get(default='not-found').strip() 33 | yield job_item 34 | 35 | 36 | if num_jobs_returned > 0: 37 | first_job_on_page = int(first_job_on_page) + 25 38 | next_url = self.api_url + str(first_job_on_page) 39 | yield scrapy.Request(url=next_url, callback=self.parse_job, meta={'first_job_on_page': first_job_on_page}) 40 | 41 | 42 | 43 | -------------------------------------------------------------------------------- /linkedin/spiders/linkedin_people_profile.py: -------------------------------------------------------------------------------- 1 | import scrapy 2 | 3 | class LinkedInPeopleProfileSpider(scrapy.Spider): 4 | name = "linkedin_people_profile" 5 | 6 | custom_settings = { 7 | 'FEEDS': { 'data/%(name)s_%(time)s.jsonl': { 'format': 'jsonlines',}} 8 | } 9 | 10 | def start_requests(self): 11 | profile_list = ['reidhoffman'] 12 | for profile in profile_list: 13 | linkedin_people_url = f'https://www.linkedin.com/in/{profile}/' 14 | yield scrapy.Request(url=linkedin_people_url, callback=self.parse_profile, meta={'profile': profile, 'linkedin_url': linkedin_people_url}) 15 | 16 | def parse_profile(self, response): 17 | item = {} 18 | item['profile'] = response.meta['profile'] 19 | item['url'] = response.meta['linkedin_url'] 20 | 21 | """ 22 | SUMMARY SECTION 23 | """ 24 | summary_box = response.css("section.top-card-layout") 25 | item['name'] = summary_box.css("h1::text").get().strip() 26 | item['description'] = summary_box.css("h2::text").get().strip() 27 | 28 | ## Location 29 | try: 30 | item['location'] = summary_box.css('div.top-card__subline-item::text').get() 31 | except: 32 | item['location'] = summary_box.css('span.top-card__subline-item::text').get().strip() 33 | if 'followers' in item['location'] or 'connections' in item['location']: 34 | item['location'] = '' 35 | 36 | item['followers'] = '' 37 | item['connections'] = '' 38 | 39 | for span_text in summary_box.css('span.top-card__subline-item::text').getall(): 40 | if 'followers' in span_text: 41 | item['followers'] = span_text.replace(' followers', '').strip() 42 | if 'connections' in span_text: 43 | item['connections'] = span_text.replace(' connections', '').strip() 44 | 45 | 46 | """ 47 | ABOUT SECTION 48 | """ 49 | item['about'] = response.css('section.summary div.core-section-container__content p::text').get() 50 | 51 | 52 | """ 53 | EXPERIENCE SECTION 54 | """ 55 | item['experience'] = [] 56 | experience_blocks = response.css('li.experience-item') 57 | for block in experience_blocks: 58 | experience = {} 59 | ## organisation profile url 60 | try: 61 | experience['organisation_profile'] = block.css('h4 a::attr(href)').get().split('?')[0] 62 | except Exception as e: 63 | print('experience --> organisation_profile', e) 64 | experience['organisation_profile'] = '' 65 | 66 | 67 | ## location 68 | try: 69 | experience['location'] = block.css('p.experience-item__location::text').get().strip() 70 | except Exception as e: 71 | print('experience --> location', e) 72 | experience['location'] = '' 73 | 74 | 75 | ## description 76 | try: 77 | experience['description'] = block.css('p.show-more-less-text__text--more::text').get().strip() 78 | except Exception as e: 79 | print('experience --> description', e) 80 | try: 81 | experience['description'] = block.css('p.show-more-less-text__text--less::text').get().strip() 82 | except Exception as e: 83 | print('experience --> description', e) 84 | experience['description'] = '' 85 | 86 | ## time range 87 | try: 88 | date_ranges = block.css('span.date-range time::text').getall() 89 | if len(date_ranges) == 2: 90 | experience['start_time'] = date_ranges[0] 91 | experience['end_time'] = date_ranges[1] 92 | experience['duration'] = block.css('span.date-range__duration::text').get() 93 | elif len(date_ranges) == 1: 94 | experience['start_time'] = date_ranges[0] 95 | experience['end_time'] = 'present' 96 | experience['duration'] = block.css('span.date-range__duration::text').get() 97 | except Exception as e: 98 | print('experience --> time ranges', e) 99 | experience['start_time'] = '' 100 | experience['end_time'] = '' 101 | experience['duration'] = '' 102 | 103 | item['experience'].append(experience) 104 | 105 | 106 | """ 107 | EDUCATION SECTION 108 | """ 109 | item['education'] = [] 110 | education_blocks = response.css('li.education__list-item') 111 | for block in education_blocks: 112 | education = {} 113 | 114 | ## organisation 115 | try: 116 | education['organisation'] = block.css('h3::text').get().strip() 117 | except Exception as e: 118 | print("education --> organisation", e) 119 | education['organisation'] = '' 120 | 121 | 122 | ## organisation profile url 123 | try: 124 | education['organisation_profile'] = block.css('a::attr(href)').get().split('?')[0] 125 | except Exception as e: 126 | print("education --> organisation_profile", e) 127 | education['organisation_profile'] = '' 128 | 129 | ## course details 130 | try: 131 | education['course_details'] = '' 132 | for text in block.css('h4 span::text').getall(): 133 | education['course_details'] = education['course_details'] + text.strip() + ' ' 134 | education['course_details'] = education['course_details'].strip() 135 | except Exception as e: 136 | print("education --> course_details", e) 137 | education['course_details'] = '' 138 | 139 | ## description 140 | try: 141 | education['description'] = block.css('div.education__item--details p::text').get().strip() 142 | except Exception as e: 143 | print("education --> description", e) 144 | education['description'] = '' 145 | 146 | 147 | ## time range 148 | try: 149 | date_ranges = block.css('span.date-range time::text').getall() 150 | if len(date_ranges) == 2: 151 | education['start_time'] = date_ranges[0] 152 | education['end_time'] = date_ranges[1] 153 | elif len(date_ranges) == 1: 154 | education['start_time'] = date_ranges[0] 155 | education['end_time'] = 'present' 156 | except Exception as e: 157 | print("education --> time_ranges", e) 158 | education['start_time'] = '' 159 | education['end_time'] = '' 160 | 161 | item['education'].append(education) 162 | 163 | yield item 164 | 165 | 166 | 167 | -------------------------------------------------------------------------------- /scrapy.cfg: -------------------------------------------------------------------------------- 1 | # Automatically created by: scrapy startproject 2 | # 3 | # For more information about the [deploy] section see: 4 | # https://scrapyd.readthedocs.io/en/latest/deploy.html 5 | 6 | [settings] 7 | default = linkedin.settings 8 | 9 | [deploy] 10 | #url = http://localhost:6800/ 11 | project = linkedin 12 | --------------------------------------------------------------------------------