├── softskills.txt
├── hardskills.txt
├── Keyword_Extractor.py
└── README.md


/softskills.txt:
--------------------------------------------------------------------------------
  1 | Integrity
  2 | Highly Motivated
  3 | Motivated 
  4 | Attention to detail
  5 | Attention
  6 | Collaboration
  7 | Collaborative
  8 | Education
  9 | Training
 10 | Problem solving
 11 | Professionalism
 12 | Quantitative
 13 | Instruction1
 14 | Problem Solving Skills
 15 | Problem Solving
 16 | Problem-solving
 17 | Best Practices
 18 | Innovative
 19 | teamwork
 20 | team work
 21 | team player
 22 | team
 23 | Listen
 24 | Good Listenern
 25 | Ability
 26 | Accept Feedback
 27 | Adaptable
 28 | Artistic Sense
 29 | Assertive
 30 | Attentive
 31 | Business Storytelling
 32 | Business Trend Awareness
 33 | Collaborating
 34 | Communication
 35 | Competitive
 36 | Confident
 37 | Confidence
 38 | Creative Thinker
 39 | Analytical Thinker
 40 | Creative
 41 | Conflict Management
 42 | Conflict Resolution
 43 | Cooperative
 44 | Courteous
 45 | Crisis Management
 46 | Critical Observer
 47 | Critical Thinker
 48 | Customer Service
 49 | Deal Making
 50 | Deal with Difficult Situations
 51 | Deal with Office Politics
 52 | Deals with Difficult People
 53 | Decision Making
 54 | Dedicated
 55 | Delegation
 56 | Dependable
 57 | Design Sense
 58 | Desire to Learn
 59 | Disability Awareness
 60 | Dispute Resolution
 61 | Diversity Awareness
 62 | Effective Communicator
 63 | Emotion Management
 64 | Emotional Intelligence
 65 | Empathetic
 66 | Energetic
 67 | Enthusiastic
 68 | Ergonomic Sensitivity
 69 | Establish Interpersonal Relationships
 70 | Dealing with Difficult Personalities
 71 | Experience
 72 | Excel
 73 | Facilitating
 74 | Flexible
 75 | Flexibility
 76 | Follow Instructions
 77 | Follow Regulations
 78 | Follow Rules
 79 | Friendly
 80 | Functions Well Under Pressure
 81 | Giving Feedback
 82 | Good at Networking
 83 | Good at Storytelling
 84 | Good Attitude
 85 | High Energy
 86 | Highly Organized
 87 | Highly Recommended
 88 | Honest
 89 | Independent
 90 | Influence
 91 | Persuasive
 92 | Innovator
 93 | Innovative
 94 | Innovation
 95 | Inspiring
 96 | Intercultural Competence
 97 | Interpersonal
 98 | Interviewing
 99 | Knowledge Management
100 | Leadership
101 | Listening
102 | Logical Thinking
103 | Make Deadlines
104 | Management
105 | Managing Difficult Conversations
106 | Managing Remote Teams
107 | Managing Virtual Teams
108 | Meeting Management
109 | Management Skills
110 | Mentoring
111 | Merit
112 | Motivated
113 | Motivating
114 | Multitasking
115 | Negotiation
116 | Nonverbal Communication
117 | Organization
118 | Patience
119 | performance
120 | Perform Effectively
121 | Performance Management
122 | Perseverance
123 | Persistence
124 | Persuasion
125 | Physical Communication
126 | Planning
127 | Positive Work Ethic
128 | Possess Business Ethics
129 | Presentation
130 | Problem Solving
131 | Process Improvement
132 | Proper Business Etiquette
133 | Proactive
134 | Public Speaking
135 | Punctual
136 | Quick-witted
137 | Read Body Language
138 | Reliable
139 | Research
140 | Resilient
141 | Resolving Issues
142 | Resoursefulness
143 | Respectful
144 | Respectable
145 | Results Oriented
146 | Safety Conscious
147 | Scheduling
148 | Self-awareness
149 | Self-directed
150 | Self-monitoring
151 | Self-supervising
152 | self-starter
153 | Selling Skills
154 | Sense of Humor
155 | Social
156 | Stay on Task
157 | Strategic Planning
158 | Stress Management
159 | Successful Coach
160 | Supervising
161 | Take Criticism
162 | Talent Management
163 | Team Building
164 | Team Player
165 | Technology Savvy
166 | Technology Trend Awareness
167 | Thinks Outside the Box
168 | Time Management
169 | Tolerant of Change and Uncertainty
170 | Train the Trainer
171 | Trainable
172 | Training
173 | Troubleshooter
174 | Value Education
175 | Verbal Communication
176 | Visual Communication
177 | Well Groomed
178 | Willing to Accept Feedback
179 | Willingness to Learn
180 | Work Well Under Pressure
181 | Work-Life Balance
182 | Writing Experience
183 | Writing Reports and Proposals
184 | Writing Skills
185 | 


--------------------------------------------------------------------------------
/hardskills.txt:
--------------------------------------------------------------------------------
  1 | develop
  2 | maintain
  3 | debug
  4 | test
  5 | computer programs
  6 | Programming
  7 | Programmer
  8 | Microsoft
  9 | Microsoft Excel
 10 | MS Excel
 11 | Microsoft Office
 12 | MS Office
 13 | Software Development 
 14 | HTML
 15 | Retention
 16 | SQL
 17 | Modeling
 18 | Modelling
 19 | Analytics
 20 | Apache
 21 | Apache Airflow
 22 | Apache Impala
 23 | Apache Drill
 24 | Apache Hadoop
 25 | Data 
 26 | Certification
 27 | Data Collection
 28 | Datasets
 29 | Business Requirements
 30 | Data Mining
 31 | Data Science
 32 | Visualization
 33 | Technical Guidance
 34 | Client Analytics
 35 | Programming Skills
 36 | Sql Server
 37 | Computer Science
 38 | Statistical Modeling
 39 | Applied Data Science
 40 | Hiring	
 41 | Technical
 42 | Database
 43 | Education
 44 | R
 45 | C
 46 | C++
 47 | C#
 48 | Ruby
 49 | Ruby on Rails
 50 | Weka
 51 | Matlab
 52 | Django
 53 | NetBeans
 54 | IDE
 55 | stochastic
 56 | Marketing
 57 | Mining
 58 | Mathematics
 59 | Forecasts
 60 | Statistics
 61 | Programming
 62 | python
 63 | Python
 64 | Microsoft Sql Server
 65 | MS Sql Server
 66 | NoSql
 67 | No-Sql
 68 | Hadoop
 69 | Spark
 70 | Java
 71 | Algorithms
 72 | Databases
 73 | Numpy
 74 | Pandas
 75 | scikit-learn
 76 | Scikit
 77 | clustering
 78 | classification
 79 | neural networks
 80 | neural network
 81 | tensorflow
 82 | pytorch
 83 | theano
 84 | keras
 85 | Pig
 86 | Adaboost
 87 | Statistics
 88 | Statistical analysis
 89 | machine learning
 90 | data mining
 91 | data science
 92 | data analytics
 93 | data analysis
 94 | regression
 95 | kmeans
 96 | k-means
 97 | kNN
 98 | Bayes
 99 | Bayesian Probability
100 | Bayesian Estimation
101 | Bayesian Network
102 | Forest
103 | Random Forest
104 | Decision Tree
105 | Matrix
106 | Matrix Factorization
107 | SVD
108 | Outlier
109 | Outlier detection
110 | Regression Analysis
111 | Frequent Itemset Mining
112 | Classification Analysis
113 | Backpropagation
114 | Sample
115 | LogitBoost
116 | Time Series
117 | Stochastic Gradient Descent
118 | Gradient Descent
119 | PCA
120 | Principal Component Analysis
121 | Dynamic
122 | Dynamic programming
123 | Clustering
124 | Classification
125 | Data-driven
126 | Algorithms
127 | Analysis
128 | Analytical
129 | Analytics
130 | Analyze Data
131 | Applications
132 | Application Development
133 | Application Development Methodologies
134 | Application Development Techniques
135 | Application Development Tools
136 | Application Programming Interfaces
137 | AWS
138 | AWS Glue
139 | Architecture
140 | AROS
141 | Ars Based Programming
142 | Aspect Oriented Programming
143 | Best Practices
144 | Browsers
145 | CASE Tools
146 | Capital management
147 | Code
148 | Coding
149 | Collaboration
150 | Communication
151 | Components
152 | Computer Platforms
153 | Concurrent Programming
154 | Computer Science
155 | Computational complexity
156 | Constraint-based Programming
157 | Customer Service
158 | Database Management Systems
159 | DBMS
160 | Database Techniques
161 | Databases
162 | Database
163 | Data
164 | Data Analytics
165 | Data Structures
166 | Debugging
167 | Design
168 | Design Patterns
169 | Development
170 | Development Tools
171 | Distributed Computing
172 | Dimensionality Reduction
173 | Documentation
174 | Embedded Hardware
175 | Emerging Technologies
176 | Fourth Generation Languages
177 | Hardware
178 | HTML Authoring Tools
179 | HTML Conversion Tools
180 | Industry Systems
181 | iOS
182 | Information Systems
183 | Implementation
184 | Interface with Clients
185 | Interface with Vendors
186 | Internet
187 | Languages
188 | Linux
189 | Logic
190 | MacOS
191 | Math
192 | Mobile
193 | Multimedia
194 | Multi-Tasking
195 | MXNet
196 | Object oriented programming
197 | object oriented
198 | Operating Systems
199 | Optimizing
200 | Organizational
201 | OS Programming
202 | Parallel Processing
203 | Personal
204 | Physics
205 | Planning
206 | Post Object Programming
207 | Presto
208 | Problem Solving
209 | Programming Languages
210 | Programming Methodologies
211 | Quality Control
212 | Relational Databases
213 | Relational Programming
214 | Reporting
215 | Revision Control
216 | Self-Motivation
217 | Software
218 | Structured Query Language (SQL)
219 | Symbolic Programming
220 | System Architecture
221 | System Development
222 | System Design
223 | System Programming
224 | System Testing
225 | Teamwork
226 | Technical
227 | Testing
228 | Third Generation Languages
229 | Troubleshooting
230 | UNIX
231 | Use Logical Reasoning
232 | Web
233 | Web Applications
234 | Web Platforms
235 | Web Services
236 | Windowing Systems
237 | Windows
238 | Workstations
239 | 


--------------------------------------------------------------------------------
/Keyword_Extractor.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | import pandas as pd
  3 | import sys, os
  4 | import numpy as np
  5 | import nltk
  6 | import operator
  7 | import math
  8 | 
  9 | class Extractor():
 10 | 	def __init__(self):
 11 | 		self.softskills=self.load_skills('softskills.txt')
 12 | 		self.hardskills=self.load_skills('hardskills.txt')
 13 | 		self.jb_distribution=self.build_ngram_distribution(sys.argv[-2])
 14 | 		self.cv_distribution=self.build_ngram_distribution(sys.argv[-1])
 15 | 		self.table=[]
 16 | 		self.outFile="Extracted_keywords.csv"
 17 | 
 18 | 	def load_skills(self,filename):
 19 | 		f=open(filename,'r')
 20 | 		skills=[]
 21 | 		for line in f:
 22 | 			#removing punctuation and upper cases
 23 | 			skills.append(self.clean_phrase(line)) 
 24 | 		f.close()
 25 | 		return list(set(skills))  #remove duplicates
 26 | 
 27 | 
 28 | 	def build_ngram_distribution(self,filename):
 29 | 		n_s=[1,2,3] #mono-, bi-, and tri-grams
 30 | 		dist={}
 31 | 		for n in n_s:
 32 | 			dist.update(self.parse_file(filename,n))
 33 | 		return dist
 34 | 			
 35 | 
 36 | 	def parse_file(self,filename,n):
 37 | 		f=open(filename,'r')
 38 | 		results={}
 39 | 		for line in f:
 40 | 			words=self.clean_phrase(line).split(" ")
 41 | 			ngrams=self.ngrams(words,n)
 42 | 			for tup in ngrams:
 43 | 				phrase=" ".join(tup)
 44 | 				if phrase in results.keys():
 45 | 					results[phrase]+=1
 46 | 				else:
 47 | 					results[phrase]=1
 48 | 		return results
 49 | 
 50 | 	
 51 | 	def clean_phrase(self,line):
 52 | 		return re.sub(r'[^\w\s]','',line.replace('\n','').replace('\t','').lower())		
 53 | 		
 54 | 
 55 | 
 56 | 	def ngrams(self,input_list, n):
 57 | 		return list(zip(*[input_list[i:] for i in range(n)]))
 58 | 
 59 | 	def measure1(self,v1,v2):
 60 | 		return v1-v2
 61 | 
 62 | 	def measure2(self,v1,v2):
 63 | 		return max(v1-v2,0)
 64 | 
 65 | 	def measure3(self,v1,v2):#cosine similarity
 66 | 		#"compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)"
 67 |     		sumxx, sumxy, sumyy = 0, 0, 0
 68 |     		for i in range(len(v1)):
 69 |         		x = v1[i]; y = v2[i]
 70 |         		sumxx += x*x
 71 |         		sumyy += y*y
 72 |         		sumxy += x*y
 73 |     		return sumxy/math.sqrt(sumxx*sumyy)
 74 | 
 75 | 
 76 | 		
 77 | 
 78 | 	def sendToFile(self):
 79 | 		try:
 80 |     			os.remove(self.outFile)
 81 | 		except OSError:
 82 |     			pass
 83 | 		df=pd.DataFrame(self.table,columns=['type','skill','job','cv','m1','m2'])
 84 | 		df_sorted=df.sort_values(by=['job','cv'], ascending=[False,False])
 85 | 		df_sorted.to_csv(self.outFile, columns=['type','skill','job','cv'],index=False)
 86 | 
 87 | 	def printMeasures(self):
 88 | 		n_rows=len(self.table)		
 89 | 		v1=[self.table[m1][4] for m1 in range(n_rows)]
 90 | 		v2=[self.table[m2][5] for m2 in range(n_rows)]
 91 | 		print("Measure 1: ",str(sum(v1)))
 92 | 		print("Measure 2: ",str(sum(v2)))
 93 | 		
 94 | 		v1=[self.table[jb][2] for jb in range(n_rows)]
 95 | 		v2=[self.table[cv][3] for cv in range(n_rows)]
 96 | 		print("Measure 3 (cosine sim): ",str(self.measure3(v1,v2)))
 97 | 			
 98 | 		for type in ['hard','soft','general']:
 99 | 			v1=[self.table[jb][2] for jb in range(n_rows) if self.table[jb][0]==type]
100 | 			v2=[self.table[cv][3] for cv in range(n_rows) if self.table[cv][0]==type]
101 | 			print("Cosine similarity for "+type+" skills: "+str(self.measure3(v1,v2)))		
102 | 
103 | 
104 | 	def makeTable(self):		
105 | 		#I am interested in verbs, nouns, adverbs, and adjectives
106 | 		parts_of_speech=['CD','JJ','JJR','JJS','MD','NN','NNS','NNP','NNPS','RB','RBR','RBS','VB','VBD','VBG','VBN','VBP','VBZ']
107 | 		graylist=["you", "will"]
108 | 		tmp_table=[]
109 | 		#look if the skills are mentioned in the job description and then in your cv
110 | 		
111 | 		for skill in self.hardskills:
112 | 			if skill in self.jb_distribution:
113 | 				count_jb=self.jb_distribution[skill]
114 | 				if skill in self.cv_distribution:
115 | 					count_cv=self.cv_distribution[skill]
116 | 				else:
117 | 					count_cv=0
118 | 				m1=self.measure1(count_jb,count_cv)
119 | 				m2=self.measure2(count_jb,count_cv)
120 | 				tmp_table.append(['hard',skill,count_jb,count_cv,m1,m2])
121 | 
122 | 		for skill in self.softskills:
123 | 			if skill in self.jb_distribution:
124 | 				count_jb=self.jb_distribution[skill]
125 | 				if skill in self.cv_distribution:
126 | 					count_cv=self.cv_distribution[skill]
127 | 				else:
128 | 					count_cv=0
129 | 				m1=self.measure1(count_jb,count_cv)
130 | 				m2=self.measure2(count_jb,count_cv)
131 | 				tmp_table.append(['soft',skill,count_jb,count_cv,m1,m2])
132 | 						
133 | 
134 | 		#And now for the general language of the job description:
135 | 		#Sort the distribution by the words most used in the job description
136 | 
137 | 		general_language = sorted(self.jb_distribution.items(), key=operator.itemgetter(1),reverse=True)
138 | 		for tuple in general_language:
139 | 			skill = tuple[0]
140 | 			if skill in self.hardskills or skill in self.softskills or skill in graylist:
141 | 				continue
142 | 			count_jb = tuple[1]
143 | 			tokens=nltk.word_tokenize(skill)
144 | 			parts=nltk.pos_tag(tokens)
145 | 			if all([parts[i][1]in parts_of_speech for i in range(len(parts))]):
146 | 				if skill in self.cv_distribution:
147 | 					count_cv=self.cv_distribution[skill]
148 | 				else:
149 | 					count_cv=0
150 | 				m1=self.measure1(count_jb,count_cv)
151 | 				m2=self.measure2(count_jb,count_cv)
152 | 				tmp_table.append(['general',skill,count_jb,count_cv,m1,m2])
153 | 		self.table=tmp_table
154 | 
155 | 
156 | 
157 | def main():
158 | 	K=Extractor()
159 | 	K.makeTable()
160 | 	K.sendToFile()
161 | 	K.printMeasures()
162 | 
163 | 
164 | 
165 | if __name__ == "__main__":
166 |     main()
167 | 		
168 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | The process of looking for a new job is in itself a full-time job. There is an overwhelming amount of job postings and many times it is difficult to determine which jobs are the best matches to your skills and preferences. It is important to do some research about the jobs before applying and furthermore, before accepting a job offer. On the other hand, job postings also receive an overwhelming amount of applications, and it is well-known[citation needed] that the companies use Application Tracking Systems (ATS) to weed out those candidates that are not a good match for the job. The basic idea is that the ATS searches the applicant's resume for predetermined keywords and for a particular language (mostly hard and soft skills) and then only saves those resumes with the highest match. For this reason, each time that you want to apply for a job, you should change your resume to emphasize those things that the company is looking for. To put an example, if I am looking into a data science job, I should emphasize the work that I have done working with data, while if it is a machine learning job, then I should emphasize in my resume the machine learning algorithms that I have used. This all sounds simple enough, and it sounds that you would just need a small set of resumes and use a different one for each type of job. However, the job postings are all different, and some of them use some words more than others, and your resume should match that language too.
 2 | To help job seekers "beat" the ATS, I have developed a python program that 1) looks for hard skills (e.g. "Python", "Programming") and soft skills (e.g. "Communication", "Team Work") in a job description and in your current resume, and tells you how many times it is mentioned on each, and 2) Looks for the common words or phrases that are used in the job description (e.g. "data","experience","excellent") and tells you how many times it is written in the job description and how many times it is written in your resume. Given this data, the goal is to modify your resume to look as similar as possible to the job description.
 3 | If you would like to try it out, all you would need to do is download the code from my github (github.com/danielgulloa/jobMatch), make a txt file with the job description, and make a txt file with your resume. Put all of the files in the same folder, and then run Keyword_Extractor.py entering the job description file and the resume as parameters in standard input. In other words:
 4 | 
 5 | Copy the job description in a txt file and save it with a descriptive name (e.g. tesla_job_description.txt). Do the same with your resume (e.g. my_resume.txt).
 6 | 
 7 | $git clone https://github.com/danielgulloa/jobMatch
 8 | 
 9 | $cp tesla_job_description.txt my_resume.txt jobMatch
10 | 
11 | $python Keyword_Extractor.py tesla_jobdescription.txt my_resume.txt
12 | 
13 | A file will be created, called Extracted_Keywords.csv, which will have 7 columns and many rows. Here are the first 6 rows of an example I ran:
14 | 
15 | ,type,skill,job,cv,m1,m2
16 | 
17 | 55,soft,experience,5,2,3,3
18 | 
19 | 60,general,one,5,0,5,5
20 | 
21 | 63,general,learning,4,3,1,1
22 | 
23 | 17,hard,apache,4,1,3,3
24 | 
25 | 40,hard,data,3,15,-12,0
26 | 
27 | 
28 | The first column is just an index, then the second one is the type of skill ("hard","soft", or "general" for general language in the job description). The third column is the name of the skill, then the fourth is how many times the skill appears in the job description, and the fifth is how many times it appears in your resume. The rows in the file are sorted by 'job' and then by 'cv'. The last two columns are a "distance" measure. m1 is simply (job - cv), which tells you how many more times the skill appears in the job description than in the resume. This value could be negative (hence the quotation marks on "distance"), and the goal is to have the sum of m1 as small (or negative) as possible. The value of m2 is obtained as max(0,job-cv). In this case, adding the skill many times in your resume does not increase the overall similarity between the job description and the resume. If you have different ideas on what measures to include, please let me know to include them in the github version.
29 | 
30 | The program can be easily modified to suit individual needs. The download consists mainly of the Keyword_Extractor.py program and two files: hardskills.txt and softskills.txt. These lasts two files contains a relatively small list of hard and soft skills, and it is meant for people looking for jobs in information technology. In particular, hardskills.txt should be modified for other industries. You can easily add or remove skills from these files. The order does not matter, nor is it important if there are repeated skills; the program will get rid of duplicates (function load_skills). Furthermore, capitalizations and punctuation marks are also removed (function clean_phrase), both from the skills and the job description. This way, something like "Self-driven" will be considered the same as "selfdriven".
31 | 
32 | The program considers key-phrases, which consists of one (keyword or mono-gram), two (bi-gram), or three (tri-gram) words. The function build_ngram_distribution can be easily modified to consider as many n-grams as needed, but short key-phrases work best.
33 | 
34 | In the case of the general language, we are only interested in the relevant parts of speech, so the program removes unimportant words such as conjunctions or interjections, and keeps only nouns, verbs, adjectives, and adverbs. Some people might consider that for example verbs are not important in a resume, so this can also be modified. In the function makeTable, the vriable parts_of_speech contains a list of the relevant parts of speech, according to the possible POS tags in NLTK (https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk). The program will only consider phrases where all the words are in the parts_of_speech variable. For example, the phrase "Excellent communication skills" will be considered, since the three words are either adjectives or nouns, while the phrase "If you are" will not, because "if" is not a relevant POS. Additionally, there is a graylist variable, that will also remove words that are not relevant. For example, the word "you" in the job description context might not be relevant, and you might not want it to be considered. Simply add or remove POS tags in the parts_of_speech variable, or individual words in the graylist according to your preferences.
35 | 


--------------------------------------------------------------------------------