├── softskills.txt ├── hardskills.txt ├── Keyword_Extractor.py └── README.md /softskills.txt: -------------------------------------------------------------------------------- 1 | Integrity 2 | Highly Motivated 3 | Motivated 4 | Attention to detail 5 | Attention 6 | Collaboration 7 | Collaborative 8 | Education 9 | Training 10 | Problem solving 11 | Professionalism 12 | Quantitative 13 | Instruction1 14 | Problem Solving Skills 15 | Problem Solving 16 | Problem-solving 17 | Best Practices 18 | Innovative 19 | teamwork 20 | team work 21 | team player 22 | team 23 | Listen 24 | Good Listenern 25 | Ability 26 | Accept Feedback 27 | Adaptable 28 | Artistic Sense 29 | Assertive 30 | Attentive 31 | Business Storytelling 32 | Business Trend Awareness 33 | Collaborating 34 | Communication 35 | Competitive 36 | Confident 37 | Confidence 38 | Creative Thinker 39 | Analytical Thinker 40 | Creative 41 | Conflict Management 42 | Conflict Resolution 43 | Cooperative 44 | Courteous 45 | Crisis Management 46 | Critical Observer 47 | Critical Thinker 48 | Customer Service 49 | Deal Making 50 | Deal with Difficult Situations 51 | Deal with Office Politics 52 | Deals with Difficult People 53 | Decision Making 54 | Dedicated 55 | Delegation 56 | Dependable 57 | Design Sense 58 | Desire to Learn 59 | Disability Awareness 60 | Dispute Resolution 61 | Diversity Awareness 62 | Effective Communicator 63 | Emotion Management 64 | Emotional Intelligence 65 | Empathetic 66 | Energetic 67 | Enthusiastic 68 | Ergonomic Sensitivity 69 | Establish Interpersonal Relationships 70 | Dealing with Difficult Personalities 71 | Experience 72 | Excel 73 | Facilitating 74 | Flexible 75 | Flexibility 76 | Follow Instructions 77 | Follow Regulations 78 | Follow Rules 79 | Friendly 80 | Functions Well Under Pressure 81 | Giving Feedback 82 | Good at Networking 83 | Good at Storytelling 84 | Good Attitude 85 | High Energy 86 | Highly Organized 87 | Highly Recommended 88 | Honest 89 | Independent 90 | Influence 91 | Persuasive 92 | Innovator 93 | Innovative 94 | Innovation 95 | Inspiring 96 | Intercultural Competence 97 | Interpersonal 98 | Interviewing 99 | Knowledge Management 100 | Leadership 101 | Listening 102 | Logical Thinking 103 | Make Deadlines 104 | Management 105 | Managing Difficult Conversations 106 | Managing Remote Teams 107 | Managing Virtual Teams 108 | Meeting Management 109 | Management Skills 110 | Mentoring 111 | Merit 112 | Motivated 113 | Motivating 114 | Multitasking 115 | Negotiation 116 | Nonverbal Communication 117 | Organization 118 | Patience 119 | performance 120 | Perform Effectively 121 | Performance Management 122 | Perseverance 123 | Persistence 124 | Persuasion 125 | Physical Communication 126 | Planning 127 | Positive Work Ethic 128 | Possess Business Ethics 129 | Presentation 130 | Problem Solving 131 | Process Improvement 132 | Proper Business Etiquette 133 | Proactive 134 | Public Speaking 135 | Punctual 136 | Quick-witted 137 | Read Body Language 138 | Reliable 139 | Research 140 | Resilient 141 | Resolving Issues 142 | Resoursefulness 143 | Respectful 144 | Respectable 145 | Results Oriented 146 | Safety Conscious 147 | Scheduling 148 | Self-awareness 149 | Self-directed 150 | Self-monitoring 151 | Self-supervising 152 | self-starter 153 | Selling Skills 154 | Sense of Humor 155 | Social 156 | Stay on Task 157 | Strategic Planning 158 | Stress Management 159 | Successful Coach 160 | Supervising 161 | Take Criticism 162 | Talent Management 163 | Team Building 164 | Team Player 165 | Technology Savvy 166 | Technology Trend Awareness 167 | Thinks Outside the Box 168 | Time Management 169 | Tolerant of Change and Uncertainty 170 | Train the Trainer 171 | Trainable 172 | Training 173 | Troubleshooter 174 | Value Education 175 | Verbal Communication 176 | Visual Communication 177 | Well Groomed 178 | Willing to Accept Feedback 179 | Willingness to Learn 180 | Work Well Under Pressure 181 | Work-Life Balance 182 | Writing Experience 183 | Writing Reports and Proposals 184 | Writing Skills 185 | -------------------------------------------------------------------------------- /hardskills.txt: -------------------------------------------------------------------------------- 1 | develop 2 | maintain 3 | debug 4 | test 5 | computer programs 6 | Programming 7 | Programmer 8 | Microsoft 9 | Microsoft Excel 10 | MS Excel 11 | Microsoft Office 12 | MS Office 13 | Software Development 14 | HTML 15 | Retention 16 | SQL 17 | Modeling 18 | Modelling 19 | Analytics 20 | Apache 21 | Apache Airflow 22 | Apache Impala 23 | Apache Drill 24 | Apache Hadoop 25 | Data 26 | Certification 27 | Data Collection 28 | Datasets 29 | Business Requirements 30 | Data Mining 31 | Data Science 32 | Visualization 33 | Technical Guidance 34 | Client Analytics 35 | Programming Skills 36 | Sql Server 37 | Computer Science 38 | Statistical Modeling 39 | Applied Data Science 40 | Hiring 41 | Technical 42 | Database 43 | Education 44 | R 45 | C 46 | C++ 47 | C# 48 | Ruby 49 | Ruby on Rails 50 | Weka 51 | Matlab 52 | Django 53 | NetBeans 54 | IDE 55 | stochastic 56 | Marketing 57 | Mining 58 | Mathematics 59 | Forecasts 60 | Statistics 61 | Programming 62 | python 63 | Python 64 | Microsoft Sql Server 65 | MS Sql Server 66 | NoSql 67 | No-Sql 68 | Hadoop 69 | Spark 70 | Java 71 | Algorithms 72 | Databases 73 | Numpy 74 | Pandas 75 | scikit-learn 76 | Scikit 77 | clustering 78 | classification 79 | neural networks 80 | neural network 81 | tensorflow 82 | pytorch 83 | theano 84 | keras 85 | Pig 86 | Adaboost 87 | Statistics 88 | Statistical analysis 89 | machine learning 90 | data mining 91 | data science 92 | data analytics 93 | data analysis 94 | regression 95 | kmeans 96 | k-means 97 | kNN 98 | Bayes 99 | Bayesian Probability 100 | Bayesian Estimation 101 | Bayesian Network 102 | Forest 103 | Random Forest 104 | Decision Tree 105 | Matrix 106 | Matrix Factorization 107 | SVD 108 | Outlier 109 | Outlier detection 110 | Regression Analysis 111 | Frequent Itemset Mining 112 | Classification Analysis 113 | Backpropagation 114 | Sample 115 | LogitBoost 116 | Time Series 117 | Stochastic Gradient Descent 118 | Gradient Descent 119 | PCA 120 | Principal Component Analysis 121 | Dynamic 122 | Dynamic programming 123 | Clustering 124 | Classification 125 | Data-driven 126 | Algorithms 127 | Analysis 128 | Analytical 129 | Analytics 130 | Analyze Data 131 | Applications 132 | Application Development 133 | Application Development Methodologies 134 | Application Development Techniques 135 | Application Development Tools 136 | Application Programming Interfaces 137 | AWS 138 | AWS Glue 139 | Architecture 140 | AROS 141 | Ars Based Programming 142 | Aspect Oriented Programming 143 | Best Practices 144 | Browsers 145 | CASE Tools 146 | Capital management 147 | Code 148 | Coding 149 | Collaboration 150 | Communication 151 | Components 152 | Computer Platforms 153 | Concurrent Programming 154 | Computer Science 155 | Computational complexity 156 | Constraint-based Programming 157 | Customer Service 158 | Database Management Systems 159 | DBMS 160 | Database Techniques 161 | Databases 162 | Database 163 | Data 164 | Data Analytics 165 | Data Structures 166 | Debugging 167 | Design 168 | Design Patterns 169 | Development 170 | Development Tools 171 | Distributed Computing 172 | Dimensionality Reduction 173 | Documentation 174 | Embedded Hardware 175 | Emerging Technologies 176 | Fourth Generation Languages 177 | Hardware 178 | HTML Authoring Tools 179 | HTML Conversion Tools 180 | Industry Systems 181 | iOS 182 | Information Systems 183 | Implementation 184 | Interface with Clients 185 | Interface with Vendors 186 | Internet 187 | Languages 188 | Linux 189 | Logic 190 | MacOS 191 | Math 192 | Mobile 193 | Multimedia 194 | Multi-Tasking 195 | MXNet 196 | Object oriented programming 197 | object oriented 198 | Operating Systems 199 | Optimizing 200 | Organizational 201 | OS Programming 202 | Parallel Processing 203 | Personal 204 | Physics 205 | Planning 206 | Post Object Programming 207 | Presto 208 | Problem Solving 209 | Programming Languages 210 | Programming Methodologies 211 | Quality Control 212 | Relational Databases 213 | Relational Programming 214 | Reporting 215 | Revision Control 216 | Self-Motivation 217 | Software 218 | Structured Query Language (SQL) 219 | Symbolic Programming 220 | System Architecture 221 | System Development 222 | System Design 223 | System Programming 224 | System Testing 225 | Teamwork 226 | Technical 227 | Testing 228 | Third Generation Languages 229 | Troubleshooting 230 | UNIX 231 | Use Logical Reasoning 232 | Web 233 | Web Applications 234 | Web Platforms 235 | Web Services 236 | Windowing Systems 237 | Windows 238 | Workstations 239 | -------------------------------------------------------------------------------- /Keyword_Extractor.py: -------------------------------------------------------------------------------- 1 | import re 2 | import pandas as pd 3 | import sys, os 4 | import numpy as np 5 | import nltk 6 | import operator 7 | import math 8 | 9 | class Extractor(): 10 | def __init__(self): 11 | self.softskills=self.load_skills('softskills.txt') 12 | self.hardskills=self.load_skills('hardskills.txt') 13 | self.jb_distribution=self.build_ngram_distribution(sys.argv[-2]) 14 | self.cv_distribution=self.build_ngram_distribution(sys.argv[-1]) 15 | self.table=[] 16 | self.outFile="Extracted_keywords.csv" 17 | 18 | def load_skills(self,filename): 19 | f=open(filename,'r') 20 | skills=[] 21 | for line in f: 22 | #removing punctuation and upper cases 23 | skills.append(self.clean_phrase(line)) 24 | f.close() 25 | return list(set(skills)) #remove duplicates 26 | 27 | 28 | def build_ngram_distribution(self,filename): 29 | n_s=[1,2,3] #mono-, bi-, and tri-grams 30 | dist={} 31 | for n in n_s: 32 | dist.update(self.parse_file(filename,n)) 33 | return dist 34 | 35 | 36 | def parse_file(self,filename,n): 37 | f=open(filename,'r') 38 | results={} 39 | for line in f: 40 | words=self.clean_phrase(line).split(" ") 41 | ngrams=self.ngrams(words,n) 42 | for tup in ngrams: 43 | phrase=" ".join(tup) 44 | if phrase in results.keys(): 45 | results[phrase]+=1 46 | else: 47 | results[phrase]=1 48 | return results 49 | 50 | 51 | def clean_phrase(self,line): 52 | return re.sub(r'[^\w\s]','',line.replace('\n','').replace('\t','').lower()) 53 | 54 | 55 | 56 | def ngrams(self,input_list, n): 57 | return list(zip(*[input_list[i:] for i in range(n)])) 58 | 59 | def measure1(self,v1,v2): 60 | return v1-v2 61 | 62 | def measure2(self,v1,v2): 63 | return max(v1-v2,0) 64 | 65 | def measure3(self,v1,v2):#cosine similarity 66 | #"compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)" 67 | sumxx, sumxy, sumyy = 0, 0, 0 68 | for i in range(len(v1)): 69 | x = v1[i]; y = v2[i] 70 | sumxx += x*x 71 | sumyy += y*y 72 | sumxy += x*y 73 | return sumxy/math.sqrt(sumxx*sumyy) 74 | 75 | 76 | 77 | 78 | def sendToFile(self): 79 | try: 80 | os.remove(self.outFile) 81 | except OSError: 82 | pass 83 | df=pd.DataFrame(self.table,columns=['type','skill','job','cv','m1','m2']) 84 | df_sorted=df.sort_values(by=['job','cv'], ascending=[False,False]) 85 | df_sorted.to_csv(self.outFile, columns=['type','skill','job','cv'],index=False) 86 | 87 | def printMeasures(self): 88 | n_rows=len(self.table) 89 | v1=[self.table[m1][4] for m1 in range(n_rows)] 90 | v2=[self.table[m2][5] for m2 in range(n_rows)] 91 | print("Measure 1: ",str(sum(v1))) 92 | print("Measure 2: ",str(sum(v2))) 93 | 94 | v1=[self.table[jb][2] for jb in range(n_rows)] 95 | v2=[self.table[cv][3] for cv in range(n_rows)] 96 | print("Measure 3 (cosine sim): ",str(self.measure3(v1,v2))) 97 | 98 | for type in ['hard','soft','general']: 99 | v1=[self.table[jb][2] for jb in range(n_rows) if self.table[jb][0]==type] 100 | v2=[self.table[cv][3] for cv in range(n_rows) if self.table[cv][0]==type] 101 | print("Cosine similarity for "+type+" skills: "+str(self.measure3(v1,v2))) 102 | 103 | 104 | def makeTable(self): 105 | #I am interested in verbs, nouns, adverbs, and adjectives 106 | parts_of_speech=['CD','JJ','JJR','JJS','MD','NN','NNS','NNP','NNPS','RB','RBR','RBS','VB','VBD','VBG','VBN','VBP','VBZ'] 107 | graylist=["you", "will"] 108 | tmp_table=[] 109 | #look if the skills are mentioned in the job description and then in your cv 110 | 111 | for skill in self.hardskills: 112 | if skill in self.jb_distribution: 113 | count_jb=self.jb_distribution[skill] 114 | if skill in self.cv_distribution: 115 | count_cv=self.cv_distribution[skill] 116 | else: 117 | count_cv=0 118 | m1=self.measure1(count_jb,count_cv) 119 | m2=self.measure2(count_jb,count_cv) 120 | tmp_table.append(['hard',skill,count_jb,count_cv,m1,m2]) 121 | 122 | for skill in self.softskills: 123 | if skill in self.jb_distribution: 124 | count_jb=self.jb_distribution[skill] 125 | if skill in self.cv_distribution: 126 | count_cv=self.cv_distribution[skill] 127 | else: 128 | count_cv=0 129 | m1=self.measure1(count_jb,count_cv) 130 | m2=self.measure2(count_jb,count_cv) 131 | tmp_table.append(['soft',skill,count_jb,count_cv,m1,m2]) 132 | 133 | 134 | #And now for the general language of the job description: 135 | #Sort the distribution by the words most used in the job description 136 | 137 | general_language = sorted(self.jb_distribution.items(), key=operator.itemgetter(1),reverse=True) 138 | for tuple in general_language: 139 | skill = tuple[0] 140 | if skill in self.hardskills or skill in self.softskills or skill in graylist: 141 | continue 142 | count_jb = tuple[1] 143 | tokens=nltk.word_tokenize(skill) 144 | parts=nltk.pos_tag(tokens) 145 | if all([parts[i][1]in parts_of_speech for i in range(len(parts))]): 146 | if skill in self.cv_distribution: 147 | count_cv=self.cv_distribution[skill] 148 | else: 149 | count_cv=0 150 | m1=self.measure1(count_jb,count_cv) 151 | m2=self.measure2(count_jb,count_cv) 152 | tmp_table.append(['general',skill,count_jb,count_cv,m1,m2]) 153 | self.table=tmp_table 154 | 155 | 156 | 157 | def main(): 158 | K=Extractor() 159 | K.makeTable() 160 | K.sendToFile() 161 | K.printMeasures() 162 | 163 | 164 | 165 | if __name__ == "__main__": 166 | main() 167 | 168 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | The process of looking for a new job is in itself a full-time job. There is an overwhelming amount of job postings and many times it is difficult to determine which jobs are the best matches to your skills and preferences. It is important to do some research about the jobs before applying and furthermore, before accepting a job offer. On the other hand, job postings also receive an overwhelming amount of applications, and it is well-known[citation needed] that the companies use Application Tracking Systems (ATS) to weed out those candidates that are not a good match for the job. The basic idea is that the ATS searches the applicant's resume for predetermined keywords and for a particular language (mostly hard and soft skills) and then only saves those resumes with the highest match. For this reason, each time that you want to apply for a job, you should change your resume to emphasize those things that the company is looking for. To put an example, if I am looking into a data science job, I should emphasize the work that I have done working with data, while if it is a machine learning job, then I should emphasize in my resume the machine learning algorithms that I have used. This all sounds simple enough, and it sounds that you would just need a small set of resumes and use a different one for each type of job. However, the job postings are all different, and some of them use some words more than others, and your resume should match that language too. 2 | To help job seekers "beat" the ATS, I have developed a python program that 1) looks for hard skills (e.g. "Python", "Programming") and soft skills (e.g. "Communication", "Team Work") in a job description and in your current resume, and tells you how many times it is mentioned on each, and 2) Looks for the common words or phrases that are used in the job description (e.g. "data","experience","excellent") and tells you how many times it is written in the job description and how many times it is written in your resume. Given this data, the goal is to modify your resume to look as similar as possible to the job description. 3 | If you would like to try it out, all you would need to do is download the code from my github (github.com/danielgulloa/jobMatch), make a txt file with the job description, and make a txt file with your resume. Put all of the files in the same folder, and then run Keyword_Extractor.py entering the job description file and the resume as parameters in standard input. In other words: 4 | 5 | Copy the job description in a txt file and save it with a descriptive name (e.g. tesla_job_description.txt). Do the same with your resume (e.g. my_resume.txt). 6 | 7 | $git clone https://github.com/danielgulloa/jobMatch 8 | 9 | $cp tesla_job_description.txt my_resume.txt jobMatch 10 | 11 | $python Keyword_Extractor.py tesla_jobdescription.txt my_resume.txt 12 | 13 | A file will be created, called Extracted_Keywords.csv, which will have 7 columns and many rows. Here are the first 6 rows of an example I ran: 14 | 15 | ,type,skill,job,cv,m1,m2 16 | 17 | 55,soft,experience,5,2,3,3 18 | 19 | 60,general,one,5,0,5,5 20 | 21 | 63,general,learning,4,3,1,1 22 | 23 | 17,hard,apache,4,1,3,3 24 | 25 | 40,hard,data,3,15,-12,0 26 | 27 | 28 | The first column is just an index, then the second one is the type of skill ("hard","soft", or "general" for general language in the job description). The third column is the name of the skill, then the fourth is how many times the skill appears in the job description, and the fifth is how many times it appears in your resume. The rows in the file are sorted by 'job' and then by 'cv'. The last two columns are a "distance" measure. m1 is simply (job - cv), which tells you how many more times the skill appears in the job description than in the resume. This value could be negative (hence the quotation marks on "distance"), and the goal is to have the sum of m1 as small (or negative) as possible. The value of m2 is obtained as max(0,job-cv). In this case, adding the skill many times in your resume does not increase the overall similarity between the job description and the resume. If you have different ideas on what measures to include, please let me know to include them in the github version. 29 | 30 | The program can be easily modified to suit individual needs. The download consists mainly of the Keyword_Extractor.py program and two files: hardskills.txt and softskills.txt. These lasts two files contains a relatively small list of hard and soft skills, and it is meant for people looking for jobs in information technology. In particular, hardskills.txt should be modified for other industries. You can easily add or remove skills from these files. The order does not matter, nor is it important if there are repeated skills; the program will get rid of duplicates (function load_skills). Furthermore, capitalizations and punctuation marks are also removed (function clean_phrase), both from the skills and the job description. This way, something like "Self-driven" will be considered the same as "selfdriven". 31 | 32 | The program considers key-phrases, which consists of one (keyword or mono-gram), two (bi-gram), or three (tri-gram) words. The function build_ngram_distribution can be easily modified to consider as many n-grams as needed, but short key-phrases work best. 33 | 34 | In the case of the general language, we are only interested in the relevant parts of speech, so the program removes unimportant words such as conjunctions or interjections, and keeps only nouns, verbs, adjectives, and adverbs. Some people might consider that for example verbs are not important in a resume, so this can also be modified. In the function makeTable, the vriable parts_of_speech contains a list of the relevant parts of speech, according to the possible POS tags in NLTK (https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk). The program will only consider phrases where all the words are in the parts_of_speech variable. For example, the phrase "Excellent communication skills" will be considered, since the three words are either adjectives or nouns, while the phrase "If you are" will not, because "if" is not a relevant POS. Additionally, there is a graylist variable, that will also remove words that are not relevant. For example, the word "you" in the job description context might not be relevant, and you might not want it to be considered. Simply add or remove POS tags in the parts_of_speech variable, or individual words in the graylist according to your preferences. 35 | --------------------------------------------------------------------------------