├── README.md ├── Resume parser - Straight.ipynb ├── Untitled-resume.pdf └── Using SpaCy.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # Resume-Parser-using-Python 2 | Our project on GitHub offers a versatile resume parsing tool. It provides two methods for extracting information: a straightforward approach using regular expressions and a more advanced method using SpaCy's natural language processing capabilities. Choose the method that suits your needs and streamline your resume processing tasks effectively. 3 | 4 | # Table of Contents: 5 | 6 | * Introduction 7 | * Understanding Resume Parsing 8 | * Setting up the Development Environment 9 | * Extracting Text from Resumes 10 | * Extracting Contact Information 11 | * Extracting Email Address 12 | * Extracting Skills 13 | * Extracting Education 14 | * Extracting Name using spaCy 15 | * Parsing a Sample Resume 16 | * Conclusion 17 | 18 | # Project Description: 19 | 20 | Our project is a resume parsing tool that leverages two different methods to extract information from resumes effectively. The first method is a straightforward approach that utilizes regular expressions and text processing techniques to extract key details such as contact information, skills, education, and work experience. This method is reliable and efficient for basic parsing needs. 21 | 22 | The second method utilizes the powerful natural language processing capabilities of SpaCy, a leading Python library. By employing SpaCy's advanced linguistic models and entity recognition algorithms, our tool takes resume parsing to the next level. It can extract nuanced information, handle complex structures, and provide more accurate results. This method is particularly useful for advanced parsing requirements and scenarios where detailed analysis is crucial. 23 | 24 | With our project, users have the flexibility to choose between the straightforward method and the SpaCy-based method based on their specific needs and the complexity of the resumes they are parsing. This allows for a tailored approach and ensures optimal performance in various parsing scenarios. 25 | 26 | By providing these two methods, we aim to cater to a wide range of users, from those seeking a quick and efficient parsing solution to those who require in-depth analysis and extraction. Whether you're a recruiter, HR professional, or data enthusiast, our resume parsing tool offers the versatility and accuracy you need to streamline your resume processing workflow. 27 | 28 | Explore our project on GitHub to access the code and documentation, and start leveraging the power of resume parsing using the method that best suits your requirements. Simplify your resume processing tasks and unlock the potential of automated data extraction for enhanced efficiency and productivity. 29 | 30 | # Note: Don't forget to check out our comprehensive guide on building a resume parser using SpaCy, which provides detailed insights and instructions for implementing the parser effectively. 31 | -------------------------------------------------------------------------------- /Resume parser - Straight.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "1155b4a4", 6 | "metadata": {}, 7 | "source": [ 8 | "# Introduction:\n", 9 | "Resume parsing is a valuable tool used in various real-life scenarios to simplify and streamline the hiring process. Imagine you're a busy hiring manager or a human resources professional responsible for reviewing countless resumes. It can be quite overwhelming and time-consuming to manually read through each document and extract the relevant information. This is where a resume parser comes in.\n", 10 | "\n", 11 | "A resume parser is like a smart assistant that helps automate the initial screening of resumes. It uses advanced algorithms and natural language processing techniques to analyze the content of a resume and extract key details such as contact information, education history, work experience, skills, and more. This information is then organized into a structured format, making it easier for recruiters to evaluate candidates efficiently.\n", 12 | "\n", 13 | "With a resume parser, you can quickly scan through a large pool of resumes and identify the most suitable candidates based on specific criteria. It allows you to search for particular skills, experience levels, educational backgrounds, or any other qualifications you require for the job. The parser also eliminates the possibility of human error and ensures consistent and accurate data extraction.\n", 14 | "\n", 15 | "Furthermore, resume parsing can be integrated with applicant tracking systems (ATS) or other recruitment software. This integration enables seamless data transfer and eliminates the need for manual data entry, saving a significant amount of time and reducing administrative burdens. The parsed data can be easily sorted, filtered, and compared, enabling recruiters to shortlist candidates efficiently and make informed decisions.\n", 16 | "\n", 17 | "In summary, resume parsing technology acts as a valuable assistant for hiring professionals, making the resume screening process more efficient, accurate, and manageable. It simplifies the initial stages of recruitment, allowing recruiters to focus their time and energy on evaluating the most promising candidates and conducting more meaningful interactions during interviews and assessments." 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 1, 23 | "id": "f67d15f8", 24 | "metadata": { 25 | "scrolled": true 26 | }, 27 | "outputs": [ 28 | { 29 | "name": "stdout", 30 | "output_type": "stream", 31 | "text": [ 32 | "Sanket Sarwade\n", 33 | "Data Scientist\n", 34 | "\n", 35 | "As a highly motivated and detail-oriented data scientist, I am eager to begin my career in the \u0000eld of data science. With a solid\n", 36 | "foundation in statistics, programming, and machine learning techniques, I am well-equipped to tackle complex data problems\n", 37 | "and deliver meaningful insights. Through my academic and personal projects, I have honed my skills in data analysis,\n", 38 | "visualization, and modeling. I am pro\u0000cient in using tools such as Python, SQL, and Tableau, and have experience working with\n", 39 | "various data sources such as CSV, Excel, and JSON. Additionally, I possess excellent communication skills and a passion for\n", 40 | "learning, which makes me a valuable team player and a quick learner. I am excited to leverage my skills and knowledge to make a\n", 41 | "meaningful impact in the \u0000eld of data science.\n", 42 | "\n", 43 | "Email: sanketsarwade111@gmail.com\n", 44 | "Address: New Cidco, Nashik\n", 45 | "Phone: 7798248452\n", 46 | "Date of birth: Jul 15, 2001\n", 47 | "Nationality: Indian\n", 48 | "Link: https://github.com/sanketsarwade\n", 49 | "\n", 50 | "Experience\n", 51 | "\n", 52 | "Data Science Project Experience\n", 53 | "\n", 54 | "Though I don't have any real-life job experience, I have gained valuable hands-on\n", 55 | "experience in data science through various projects using Kaggle datasets and other\n", 56 | "platforms. I have worked on a variety of projects related to data cleaning, exploratory data\n", 57 | "analysis, machine learning, and data visualization.\n", 58 | "\n", 59 | "Data Science Project\n", 60 | "\n", 61 | "These projects have allowed me to gain pro\u0000ciency in programming languages such as\n", 62 | "Python, as well as in tools like SQL and Tableau. Through these projects, I have honed my\n", 63 | "problem-solving and analytical skills, as well as my ability to work collaboratively in a team\n", 64 | "environment. I am excited to bring my passion and experience to a real-world data science\n", 65 | "role.\n", 66 | "\n", 67 | "Education\n", 68 | "\n", 69 | "Jun 2019 - Sep 2022\n", 70 | "\n", 71 | "Bsc Microbiology\n", 72 | "\n", 73 | "Sinhgad College of Science\n", 74 | "\n", 75 | "Pune\n", 76 | "\n", 77 | "Oct 2022 - Jan 2023\n", 78 | "\n", 79 | "Data Scientist\n", 80 | "\n", 81 | "Data Camp\n", 82 | "\n", 83 | "Pune\n", 84 | "\n", 85 | "Languages\n", 86 | "\n", 87 | "Skills\n", 88 | "\n", 89 | "Python\n", 90 | "Tableau\n", 91 | "Handling Pressure\n", 92 | "\n", 93 | "SQL\n", 94 | "Data Visualization\n", 95 | "Collaboration\n", 96 | "\n", 97 | "Machine Learning\n", 98 | "Data Management\n", 99 | "\n", 100 | "Deep Learning\n", 101 | "Leadership\n", 102 | "\n", 103 | "English\n", 104 | "Advanced\n", 105 | "\n", 106 | "Narathi\n", 107 | "Native\n", 108 | "\n", 109 | "Hindi\n", 110 | "Native\n", 111 | "\n", 112 | "Projects\n", 113 | "\n", 114 | "•\n", 115 | "•\n", 116 | "•\n", 117 | "•\n", 118 | "•\n", 119 | "\n", 120 | "Human Activity Recogination\n", 121 | "Email Spam\n", 122 | "Breast Cancer Prediction\n", 123 | "Anomaly Detection\n", 124 | "IPL prediction\n", 125 | "\n", 126 | "Certi\u0000cations & Courses\n", 127 | "\n", 128 | "•\n", 129 | "•\n", 130 | "•\n", 131 | "•\n", 132 | "•\n", 133 | "•\n", 134 | "\n", 135 | "Python\n", 136 | "Machine Leaning\n", 137 | "Deep Leaning\n", 138 | "SQl\n", 139 | "Data Science\n", 140 | "Tableau\n", 141 | "\n", 142 | "\f", 143 | "\n" 144 | ] 145 | } 146 | ], 147 | "source": [ 148 | "from pdfminer.high_level import extract_text\n", 149 | " \n", 150 | "def extract_text_from_pdf(pdf_path):\n", 151 | " return extract_text(pdf_path)\n", 152 | " \n", 153 | "if __name__ == '__main__':\n", 154 | " print(extract_text_from_pdf(r\"C:\\Users\\SANKET\\Downloads\\Untitled-resume.pdf\"))" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "id": "73ff19bb", 160 | "metadata": {}, 161 | "source": [ 162 | "# Exctracting Name from Resume:\n", 163 | "\n", 164 | "The code snippet demonstrates a function that extracts text from a PDF file using pdfminer library. It then utilizes a regular expression pattern to extract a potential name from the extracted text. If a name is found, it is printed; otherwise, a \"Name not found\" message is displayed. This code can be used as a starting point for resume parsing tasks to extract names from resumes." 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 2, 170 | "id": "82355882", 171 | "metadata": {}, 172 | "outputs": [ 173 | { 174 | "name": "stdout", 175 | "output_type": "stream", 176 | "text": [ 177 | "Name: Sanket Sarwade\n" 178 | ] 179 | } 180 | ], 181 | "source": [ 182 | "import pdfminer\n", 183 | "import re\n", 184 | "\n", 185 | "def extract_text_from_pdf(pdf_path):\n", 186 | " return extract_text(pdf_path)\n", 187 | "\n", 188 | "def extract_name_from_resume(text):\n", 189 | " name = None\n", 190 | "\n", 191 | " # Use regex pattern to find a potential name\n", 192 | " pattern = r\"(\\b[A-Z][a-z]+\\b)\\s(\\b[A-Z][a-z]+\\b)\"\n", 193 | " match = re.search(pattern, text)\n", 194 | " if match:\n", 195 | " name = match.group()\n", 196 | "\n", 197 | " return name\n", 198 | "\n", 199 | "if __name__ == '__main__':\n", 200 | " text = extract_text_from_pdf(r\"C:\\Users\\SANKET\\Downloads\\Untitled-resume.pdf\")\n", 201 | " name = extract_name_from_resume(text)\n", 202 | "\n", 203 | " if name:\n", 204 | " print(\"Name:\", name)\n", 205 | " else:\n", 206 | " print(\"Name not found\")\n" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "id": "90c97198", 212 | "metadata": {}, 213 | "source": [ 214 | "# Exctract Contact Number:\n", 215 | "\n", 216 | "The provided code snippet defines a function to extract text from a PDF file using pdfminer. It also includes another function to extract a potential contact number from the extracted text using a regular expression pattern. The code then calls these functions to extract the contact number from a specific resume file. If a contact number is found, it is printed; otherwise, a \"Contact Number not found\" message is displayed. This code can be used as a starting point for extracting contact numbers from resumes." 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": 3, 222 | "id": "5a1ea1cd", 223 | "metadata": {}, 224 | "outputs": [ 225 | { 226 | "name": "stdout", 227 | "output_type": "stream", 228 | "text": [ 229 | "Contact Number: 7798248452\n" 230 | ] 231 | } 232 | ], 233 | "source": [ 234 | " def extract_text_from_pdf(pdf_path):\n", 235 | " return extract_text(pdf_path)\n", 236 | "\n", 237 | "def extract_contact_number_from_resume(text):\n", 238 | " contact_number = None\n", 239 | "\n", 240 | " # Use regex pattern to find a potential contact number\n", 241 | " pattern = r\"\\b(?:\\+?\\d{1,3}[-.\\s]?)?\\(?\\d{3}\\)?[-.\\s]?\\d{3}[-.\\s]?\\d{4}\\b\"\n", 242 | " match = re.search(pattern, text)\n", 243 | " if match:\n", 244 | " contact_number = match.group()\n", 245 | "\n", 246 | " return contact_number\n", 247 | "\n", 248 | "if __name__ == '__main__':\n", 249 | " text = extract_text_from_pdf(r\"C:\\Users\\SANKET\\Downloads\\Untitled-resume.pdf\")\n", 250 | " contact_number = extract_contact_number_from_resume(text)\n", 251 | "\n", 252 | " if contact_number:\n", 253 | " print(\"Contact Number:\", contact_number)\n", 254 | " else:\n", 255 | " print(\"Contact Number not found\")" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "id": "c3499079", 261 | "metadata": {}, 262 | "source": [ 263 | "# Exctract Email Id : \n", 264 | "\n", 265 | "The provided code snippet defines a function extract_text_from_pdf() to extract text from a PDF file using pdfminer. It also includes another function extract_email_from_resume() to extract a potential email address from the extracted text using a regular expression pattern.\n", 266 | "\n", 267 | "The code then calls these functions to extract the email address from a specific resume file. If an email address is found, it is printed as \"Email: [email address]\"; otherwise, a \"Email not found\" message is displayed.\n", 268 | "\n", 269 | "This code can be used as a starting point for extracting email addresses from resumes." 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 4, 275 | "id": "92f0f6d6", 276 | "metadata": {}, 277 | "outputs": [ 278 | { 279 | "name": "stdout", 280 | "output_type": "stream", 281 | "text": [ 282 | "Email: sanketsarwade111@gmail.com\n" 283 | ] 284 | } 285 | ], 286 | "source": [ 287 | "\n", 288 | "def extract_text_from_pdf(pdf_path):\n", 289 | " return extract_text(pdf_path)\n", 290 | "\n", 291 | "def extract_email_from_resume(text):\n", 292 | " email = None\n", 293 | "\n", 294 | " # Use regex pattern to find a potential email address\n", 295 | " pattern = r\"\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\\b\"\n", 296 | " match = re.search(pattern, text)\n", 297 | " if match:\n", 298 | " email = match.group()\n", 299 | "\n", 300 | " return email\n", 301 | "\n", 302 | "if __name__ == '__main__':\n", 303 | " text = extract_text_from_pdf(r\"C:\\Users\\SANKET\\Downloads\\Untitled-resume.pdf\")\n", 304 | " email = extract_email_from_resume(text)\n", 305 | "\n", 306 | " if email:\n", 307 | " print(\"Email:\", email)\n", 308 | " else:\n", 309 | " print(\"Email not found\")\n" 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "id": "3540ffd7", 315 | "metadata": {}, 316 | "source": [ 317 | "# Exctracting Skills:\n", 318 | "\n", 319 | "The provided code snippet includes a function extract_text_from_pdf() that extracts text from a PDF file using pdfminer. Additionally, it defines a function extract_skills_from_resume() that takes the extracted text and a list of predefined skills as input.\n", 320 | "\n", 321 | "The extract_skills_from_resume() function searches for each skill in the provided list within the resume text using regular expressions. If a skill is found, it is added to the skills list. Finally, the function returns the list of extracted skills.\n", 322 | "\n", 323 | "In the code's main section, the extract_text_from_pdf() function is called to extract the text from a specific resume file. A predefined list of skills is defined, and the extract_skills_from_resume() function is invoked with the extracted text and skills list as arguments. The extracted skills are then printed as \"Skills: [extracted skills]\" if any skills are found, otherwise, a \"No skills found\" message is displayed.\n", 324 | "\n", 325 | "This code can be utilized to extract skills from resumes by providing a list of predefined skills and the resume text. It serves as a basic framework for skill extraction and can be extended or customized to meet specific requirements." 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": 5, 331 | "id": "90a413b9", 332 | "metadata": {}, 333 | "outputs": [ 334 | { 335 | "name": "stdout", 336 | "output_type": "stream", 337 | "text": [ 338 | "Skills: ['Python', 'Data Analysis', 'Machine Learning', 'Communication', 'Deep Learning', 'SQL', 'Tableau']\n" 339 | ] 340 | } 341 | ], 342 | "source": [ 343 | "def extract_text_from_pdf(pdf_path):\n", 344 | " return extract_text(pdf_path)\n", 345 | "\n", 346 | "def extract_skills_from_resume(text, skills_list):\n", 347 | " skills = []\n", 348 | "\n", 349 | " # Search for skills in the resume text\n", 350 | " for skill in skills_list:\n", 351 | " pattern = r\"\\b{}\\b\".format(re.escape(skill))\n", 352 | " match = re.search(pattern, text, re.IGNORECASE)\n", 353 | " if match:\n", 354 | " skills.append(skill)\n", 355 | "\n", 356 | " return skills\n", 357 | "\n", 358 | "if __name__ == '__main__':\n", 359 | " text = extract_text_from_pdf(r\"C:\\Users\\SANKET\\Downloads\\Untitled-resume.pdf\")\n", 360 | "\n", 361 | " # List of predefined skills\n", 362 | " skills_list = ['Python', 'Data Analysis', 'Machine Learning', 'Communication', 'Project Management', 'Deep Learning', 'SQL', 'Tableau']\n", 363 | "\n", 364 | " extracted_skills = extract_skills_from_resume(text, skills_list)\n", 365 | "\n", 366 | " if extracted_skills:\n", 367 | " print(\"Skills:\", extracted_skills)\n", 368 | " else:\n", 369 | " print(\"No skills found\")\n" 370 | ] 371 | }, 372 | { 373 | "cell_type": "markdown", 374 | "id": "44df8258", 375 | "metadata": {}, 376 | "source": [ 377 | "# Exctracting Education:\n", 378 | "\n", 379 | "The provided code snippet consists of a function extract_text_from_pdf() that extracts text from a PDF file using the pdfminer library. Additionally, it includes a function extract_education_from_resume() that takes the extracted text as input.\n", 380 | "\n", 381 | "The extract_education_from_resume() function utilizes a regular expression pattern to search for education information in the resume text. The pattern is designed to match various education degrees such as BSc, B.Tech, M.Tech, Ph.D., Bachelor's, Master's, and Ph.D., followed by the corresponding field of study.\n", 382 | "\n", 383 | "Within the code's main section, the extract_text_from_pdf() function is invoked to extract the text from a specific resume file. Then, the extract_education_from_resume() function is called with the extracted text as an argument. If any education information is found, it is appended to the education list. Finally, the list of extracted education details is printed as \"Education: [extracted_education]\" if education information is found. Otherwise, a \"No education information found\" message is displayed.\n", 384 | "\n", 385 | "This code provides a basic framework for extracting education information from resumes using regular expressions. It can be further customized or expanded to handle additional patterns or extract more specific details related to education." 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": 6, 391 | "id": "72ff2cc1", 392 | "metadata": {}, 393 | "outputs": [ 394 | { 395 | "name": "stdout", 396 | "output_type": "stream", 397 | "text": [ 398 | "No education information found\n" 399 | ] 400 | } 401 | ], 402 | "source": [ 403 | "def extract_text_from_pdf(pdf_path):\n", 404 | " return extract_text(pdf_path)\n", 405 | "\n", 406 | "def extract_education_from_resume(text):\n", 407 | " education = []\n", 408 | "\n", 409 | " # Use regex pattern to find education information\n", 410 | " pattern = r\"(?i)(?:(?:Bachelor|B\\.S\\.|B\\.A\\.|Master|M\\.S\\.|M\\.A\\.|Ph\\.D\\.)\\s(?:[A-Za-z]+\\s)*[A-Za-z]+)\"\n", 411 | " matches = re.findall(pattern, text)\n", 412 | " for match in matches:\n", 413 | " education.append(match.strip())\n", 414 | "\n", 415 | " return education\n", 416 | "\n", 417 | "if __name__ == '__main__':\n", 418 | " text = extract_text_from_pdf(r\"C:\\Users\\SANKET\\Downloads\\Untitled-resume.pdf\")\n", 419 | "\n", 420 | " extracted_education = extract_education_from_resume(text)\n", 421 | " if extracted_education:\n", 422 | " print(\"Education:\", extracted_education)\n", 423 | " else:\n", 424 | " print(\"No education information found\")\n" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": 7, 430 | "id": "3a80890c", 431 | "metadata": {}, 432 | "outputs": [ 433 | { 434 | "name": "stdout", 435 | "output_type": "stream", 436 | "text": [ 437 | "Education: ['Bsc Microbiology']\n" 438 | ] 439 | } 440 | ], 441 | "source": [ 442 | "def extract_text_from_pdf(pdf_path):\n", 443 | " return extract_text(pdf_path)\n", 444 | "\n", 445 | "def extract_education_from_resume(text):\n", 446 | " education = []\n", 447 | "\n", 448 | " # Use regex pattern to find education information\n", 449 | " pattern = r\"(?i)(?:Bsc|\\bB\\.\\w+|\\bM\\.\\w+|\\bPh\\.D\\.\\w+|\\bBachelor(?:'s)?|\\bMaster(?:'s)?|\\bPh\\.D)\\s(?:\\w+\\s)*\\w+\"\n", 450 | " matches = re.findall(pattern, text)\n", 451 | " for match in matches:\n", 452 | " education.append(match.strip())\n", 453 | "\n", 454 | " return education\n", 455 | "\n", 456 | "if __name__ == '__main__':\n", 457 | " text = extract_text_from_pdf(r\"C:\\Users\\SANKET\\Downloads\\Untitled-resume.pdf\")\n", 458 | "\n", 459 | " extracted_education = extract_education_from_resume(text)\n", 460 | " if extracted_education:\n", 461 | " print(\"Education:\", extracted_education)\n", 462 | " else:\n", 463 | " print(\"No education information found\")\n" 464 | ] 465 | }, 466 | { 467 | "cell_type": "code", 468 | "execution_count": 8, 469 | "id": "fd98bce9", 470 | "metadata": {}, 471 | "outputs": [ 472 | { 473 | "name": "stdout", 474 | "output_type": "stream", 475 | "text": [ 476 | "Data Science Education: ['Data Science Project', 'Data Science\\nTableau']\n" 477 | ] 478 | } 479 | ], 480 | "source": [ 481 | "import spacy\n", 482 | "\n", 483 | "nlp = spacy.load('en_core_web_sm')\n", 484 | "\n", 485 | "def extract_data_science_education(text):\n", 486 | " doc = nlp(text)\n", 487 | "\n", 488 | " education = []\n", 489 | "\n", 490 | " for ent in doc.ents:\n", 491 | " if ent.label_ == 'ORG' and 'Data Science' in ent.text:\n", 492 | " education.append(ent.text)\n", 493 | "\n", 494 | " return education\n", 495 | "\n", 496 | "if __name__ == '__main__':\n", 497 | " text = extract_text_from_pdf(r\"C:\\Users\\SANKET\\Downloads\\Untitled-resume.pdf\")\n", 498 | "\n", 499 | " extracted_education = extract_data_science_education(text)\n", 500 | " if extracted_education:\n", 501 | " print(\"Data Science Education:\", extracted_education)\n", 502 | " else:\n", 503 | " print(\"No data science education found\")\n" 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "execution_count": 9, 509 | "id": "dedc044f", 510 | "metadata": {}, 511 | "outputs": [ 512 | { 513 | "name": "stdout", 514 | "output_type": "stream", 515 | "text": [ 516 | "College: Sinhgad College of Science\n" 517 | ] 518 | } 519 | ], 520 | "source": [ 521 | "\n", 522 | "def extract_college_name(text):\n", 523 | " lines = text.split('\\n')\n", 524 | " college_pattern = r\"(?i).*college.*\"\n", 525 | " for line in lines:\n", 526 | " if re.match(college_pattern, line):\n", 527 | " return line.strip()\n", 528 | " return None\n", 529 | "\n", 530 | "# Example usage:\n", 531 | " text = extract_text_from_pdf(r\"C:\\Users\\SANKET\\Downloads\\Untitled-resume.pdf\")\n", 532 | "\n", 533 | "\n", 534 | "college_name = extract_college_name(text)\n", 535 | "if college_name:\n", 536 | " print(\"College:\", college_name)\n", 537 | "else:\n", 538 | " print(\"College name not found.\")\n" 539 | ] 540 | }, 541 | { 542 | "cell_type": "markdown", 543 | "id": "eb81c782", 544 | "metadata": {}, 545 | "source": [ 546 | "# Conclusion:\n", 547 | "\n", 548 | "In conclusion, the provided code snippets demonstrate a basic implementation of a resume parser. Each code snippet focuses on extracting specific information from a resume, such as name, contact number, email, skills, and education.\n", 549 | "\n", 550 | "The resume parser utilizes various techniques, including regular expressions and text extraction from PDF files. It showcases how these techniques can be applied to automate the extraction of important details from resumes.\n", 551 | "\n", 552 | "However, it's important to note that the code snippets provide a starting point and can be further enhanced and customized based on specific requirements. For example, additional patterns or algorithms can be implemented to improve the accuracy of information extraction.\n", 553 | "\n", 554 | "Resume parsing plays a vital role in automating the initial screening process for job applications. By extracting key details from resumes, it saves time and effort for recruiters and allows for efficient filtering of candidates.\n", 555 | "\n", 556 | "As technology advances, resume parsing algorithms can be further refined to handle more complex resume formats, languages, and diverse information extraction requirements. This will help in building more sophisticated and accurate resume parsing systems.\n", 557 | "\n", 558 | "Overall, the provided code snippets serve as a foundation for developing a resume parser and demonstrate the potential of automating the extraction of essential information from resumes, streamlining the recruitment process, and improving efficiency in candidate evaluation." 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "execution_count": null, 564 | "id": "085324be", 565 | "metadata": {}, 566 | "outputs": [], 567 | "source": [] 568 | } 569 | ], 570 | "metadata": { 571 | "kernelspec": { 572 | "display_name": "Python 3 (ipykernel)", 573 | "language": "python", 574 | "name": "python3" 575 | }, 576 | "language_info": { 577 | "codemirror_mode": { 578 | "name": "ipython", 579 | "version": 3 580 | }, 581 | "file_extension": ".py", 582 | "mimetype": "text/x-python", 583 | "name": "python", 584 | "nbconvert_exporter": "python", 585 | "pygments_lexer": "ipython3", 586 | "version": "3.10.7" 587 | } 588 | }, 589 | "nbformat": 4, 590 | "nbformat_minor": 5 591 | } 592 | -------------------------------------------------------------------------------- /Untitled-resume.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sanketsarwade/Resume-Parser-using-Python/fb3d176ff0bceda41a2ac1361b82b7c9e5870aaa/Untitled-resume.pdf -------------------------------------------------------------------------------- /Using SpaCy.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "26f4c9d2", 6 | "metadata": {}, 7 | "source": [ 8 | "# Introduction:\n", 9 | "Building a resume parser using SpaCy can greatly streamline the process of extracting relevant information from resumes, enabling efficient candidate evaluation. In this guide, we will explore step-by-step instructions to develop a resume parser using the powerful natural language processing library, SpaCy.\n", 10 | "\n", 11 | "1. Understanding the Problem:\n", 12 | "Define the objective of the resume parser and the specific information to be extracted, such as name, contact details, skills, education, and work experience.\n", 13 | "\n", 14 | "2. Preparing the Environment:\n", 15 | "Install SpaCy and its required dependencies.\n", 16 | "Download and load the necessary SpaCy language models.\n", 17 | "\n", 18 | "3. Extracting Text from Resumes:\n", 19 | "Utilize PDF parsing libraries like pdfminer to extract text content from resume files.\n", 20 | "Implement a function to extract text from PDF files using the chosen library.\n", 21 | "\n", 22 | "4. Extracting Name from Resumes:\n", 23 | "Use SpaCy's linguistic capabilities to extract names from resume text.\n", 24 | "Define name patterns using SpaCy's Matcher module to identify different name formats.\n", 25 | "\n", 26 | "5. Extracting Contact Details:\n", 27 | "Employ regular expressions to extract contact numbers from resume text.\n", 28 | "Define patterns to capture various phone number formats.\n", 29 | "\n", 30 | "6. Extracting Email Addresses:\n", 31 | "Utilize regular expressions to identify and extract email addresses from resume text.\n", 32 | "Define email patterns to ensure accurate extraction.\n", 33 | "\n", 34 | "7. Extracting Skills:\n", 35 | "Create a predefined list of skills relevant to the desired job requirements.\n", 36 | "Utilize SpaCy's linguistic capabilities to match and extract skills from the resume text.\n", 37 | "\n", 38 | "8. Extracting Education:\n", 39 | "Define a set of education keywords or patterns to identify educational information.\n", 40 | "Utilize regular expressions to extract education details from the resume text.\n", 41 | "\n", 42 | "9. Putting it All Together:\n", 43 | "Combine the individual extraction functions to create a comprehensive resume parser.\n", 44 | "Process the resume text, extract the desired information, and store it in a structured format.\n", 45 | "\n", 46 | "10. Enhancements and Customizations:\n", 47 | "Explore advanced techniques to improve extraction accuracy, such as named entity recognition and entity linking.\n", 48 | "Consider handling different resume formats and languages for broader compatibility.\n", 49 | "Implement additional features like extracting work experience, certifications, or personal projects based on specific requirements." 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": null, 55 | "id": "b02c73a2", 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "Note to Readers:\n", 60 | "\n", 61 | "I am thrilled to share with you a comprehensive guide on building a resume parser using SpaCy, which is now available on Analytics Vidhya. This guide aims to empower you with the knowledge and tools to create a powerful resume parser from scratch.\n", 62 | "\n", 63 | "Throughout the guide, I have covered various essential aspects of resume parsing, including text extraction from PDFs, extracting important information like contact details, skills, education, and more. I have also demonstrated how to leverage the capabilities of SpaCy, a popular natural language processing library, to perform these tasks effectively.\n", 64 | "\n", 65 | "By following the step-by-step instructions and code examples provided in the guide, you will gain a solid understanding of the resume parsing process and be equipped with the skills to build your own resume parser. Whether you are a data scientist, developer, or HR professional, this guide will help you streamline and automate the resume screening process, saving you valuable time and effort.\n", 66 | "\n", 67 | "I encourage you to dive into the guide, experiment with the code, and adapt it to your specific requirements. Remember that building a resume parser is an iterative process, and you may need to fine-tune and customize it based on the unique characteristics of the resumes you encounter.\n", 68 | "\n", 69 | "I would like to express my gratitude to the Analytics Vidhya platform for providing the opportunity to share this guide with the community. Their dedication to promoting knowledge sharing and empowering data professionals is truly commendable.\n", 70 | "\n", 71 | "I hope this guide serves as a valuable resource for you on your journey of building a resume parser using SpaCy. Feel free to reach out with any questions, feedback, or success stories you may have. Happy parsing!" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 1, 77 | "id": "d749420c", 78 | "metadata": {}, 79 | "outputs": [ 80 | { 81 | "name": "stdout", 82 | "output_type": "stream", 83 | "text": [ 84 | "Resume: C:\\Users\\SANKET\\Downloads\\Untitled-resume.pdf\n", 85 | "Name: Sanket Sarwade\n", 86 | "Contact Number: 7798248452\n", 87 | "Email: sanketsarwade111@gmail.com\n", 88 | "Skills: ['Python', 'Data Analysis', 'Machine Learning', 'Communication', 'Deep Learning', 'SQL', 'Tableau']\n", 89 | "Education: ['Bsc Microbiology']\n", 90 | "\n" 91 | ] 92 | } 93 | ], 94 | "source": [ 95 | "import re\n", 96 | "from pdfminer.high_level import extract_text\n", 97 | "import spacy\n", 98 | "from spacy.matcher import Matcher\n", 99 | "\n", 100 | "def extract_text_from_pdf(pdf_path):\n", 101 | " return extract_text(pdf_path)\n", 102 | "\n", 103 | "def extract_contact_number_from_resume(text):\n", 104 | " contact_number = None\n", 105 | "\n", 106 | " # Use regex pattern to find a potential contact number\n", 107 | " pattern = r\"\\b(?:\\+?\\d{1,3}[-.\\s]?)?\\(?\\d{3}\\)?[-.\\s]?\\d{3}[-.\\s]?\\d{4}\\b\"\n", 108 | " match = re.search(pattern, text)\n", 109 | " if match:\n", 110 | " contact_number = match.group()\n", 111 | "\n", 112 | " return contact_number\n", 113 | "\n", 114 | "def extract_email_from_resume(text):\n", 115 | " email = None\n", 116 | "\n", 117 | " # Use regex pattern to find a potential email address\n", 118 | " pattern = r\"\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\\b\"\n", 119 | " match = re.search(pattern, text)\n", 120 | " if match:\n", 121 | " email = match.group()\n", 122 | "\n", 123 | " return email\n", 124 | "\n", 125 | "def extract_skills_from_resume(text, skills_list):\n", 126 | " skills = []\n", 127 | "\n", 128 | " for skill in skills_list:\n", 129 | " pattern = r\"\\b{}\\b\".format(re.escape(skill))\n", 130 | " match = re.search(pattern, text, re.IGNORECASE)\n", 131 | " if match:\n", 132 | " skills.append(skill)\n", 133 | "\n", 134 | " return skills\n", 135 | "\n", 136 | "def extract_education_from_resume(text):\n", 137 | " education = []\n", 138 | "\n", 139 | " # Use regex pattern to find education information\n", 140 | " pattern = r\"(?i)(?:Bsc|\\bB\\.\\w+|\\bM\\.\\w+|\\bPh\\.D\\.\\w+|\\bBachelor(?:'s)?|\\bMaster(?:'s)?|\\bPh\\.D)\\s(?:\\w+\\s)*\\w+\"\n", 141 | " matches = re.findall(pattern, text)\n", 142 | " for match in matches:\n", 143 | " education.append(match.strip())\n", 144 | "\n", 145 | " return education\n", 146 | "\n", 147 | "def extract_name(resume_text):\n", 148 | " nlp = spacy.load('en_core_web_sm')\n", 149 | " matcher = Matcher(nlp.vocab)\n", 150 | "\n", 151 | " # Define name patterns\n", 152 | " patterns = [\n", 153 | " [{'POS': 'PROPN'}, {'POS': 'PROPN'}], # First name and Last name\n", 154 | " [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}], # First name, Middle name, and Last name\n", 155 | " [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}] # First name, Middle name, Middle name, and Last name\n", 156 | " # Add more patterns as needed\n", 157 | " ]\n", 158 | "\n", 159 | " for pattern in patterns:\n", 160 | " matcher.add('NAME', patterns=[pattern])\n", 161 | "\n", 162 | " doc = nlp(resume_text)\n", 163 | " matches = matcher(doc)\n", 164 | "\n", 165 | " for match_id, start, end in matches:\n", 166 | " span = doc[start:end]\n", 167 | " return span.text\n", 168 | "\n", 169 | " return None\n", 170 | "\n", 171 | "if __name__ == '__main__':\n", 172 | " resume_paths = [r\"C:\\Users\\SANKET\\Downloads\\Untitled-resume.pdf\"]\n", 173 | "\n", 174 | " for resume_path in resume_paths:\n", 175 | " text = extract_text_from_pdf(resume_path)\n", 176 | "\n", 177 | " print(\"Resume:\", resume_path)\n", 178 | "\n", 179 | " name = extract_name(text)\n", 180 | " if name:\n", 181 | " print(\"Name:\", name)\n", 182 | " else:\n", 183 | " print(\"Name not found\")\n", 184 | "\n", 185 | " contact_number = extract_contact_number_from_resume(text)\n", 186 | " if contact_number:\n", 187 | " print(\"Contact Number:\", contact_number)\n", 188 | " else:\n", 189 | " print(\"Contact Number not found\")\n", 190 | "\n", 191 | " email = extract_email_from_resume(text)\n", 192 | " if email:\n", 193 | " print(\"Email:\", email)\n", 194 | " else:\n", 195 | " print(\"Email not found\")\n", 196 | "\n", 197 | " skills_list = ['Python', 'Data Analysis', 'Machine Learning', 'Communication', 'Project Management', 'Deep Learning', 'SQL', 'Tableau']\n", 198 | " extracted_skills = extract_skills_from_resume(text, skills_list)\n", 199 | " if extracted_skills:\n", 200 | " print(\"Skills:\", extracted_skills)\n", 201 | " else:\n", 202 | " print(\"No skills found\")\n", 203 | "\n", 204 | " extracted_education = extract_education_from_resume(text)\n", 205 | " if extracted_education:\n", 206 | " print(\"Education:\", extracted_education)\n", 207 | " else:\n", 208 | " print(\"No education information found\")\n", 209 | "\n", 210 | " print()\n" 211 | ] 212 | }, 213 | { 214 | "cell_type": "markdown", 215 | "id": "60c0b98b", 216 | "metadata": {}, 217 | "source": [ 218 | "# Conclusion:\n", 219 | "Building a resume parser using SpaCy empowers recruiters and hiring professionals to automate the extraction of crucial information from resumes, saving time and effort in candidate evaluation. This guide has provided a comprehensive walkthrough of the key steps involved in creating a resume parser using SpaCy, enabling you to harness the power of natural language processing for efficient resume analysis.\n", 220 | "\n", 221 | "By leveraging the flexibility and extensibility of SpaCy, you can customize and enhance the resume parser to suit your unique requirements. Continual refinement and adaptation will result in a robust and accurate resume parsing solution, streamlining the recruitment process and improving decision-making.\n", 222 | "\n", 223 | "Remember, the resume parser presented here serves as a starting point, and you can further innovate and expand its capabilities to meet evolving needs and technological advancements. Embrace the power of SpaCy and embark on the journey of building a cutting-edge resume parser to revolutionize your recruitment process." 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "id": "f3d3143c", 230 | "metadata": {}, 231 | "outputs": [], 232 | "source": [] 233 | } 234 | ], 235 | "metadata": { 236 | "kernelspec": { 237 | "display_name": "Python 3 (ipykernel)", 238 | "language": "python", 239 | "name": "python3" 240 | }, 241 | "language_info": { 242 | "codemirror_mode": { 243 | "name": "ipython", 244 | "version": 3 245 | }, 246 | "file_extension": ".py", 247 | "mimetype": "text/x-python", 248 | "name": "python", 249 | "nbconvert_exporter": "python", 250 | "pygments_lexer": "ipython3", 251 | "version": "3.10.7" 252 | } 253 | }, 254 | "nbformat": 4, 255 | "nbformat_minor": 5 256 | } 257 | --------------------------------------------------------------------------------