├── .gitignore ├── README.md ├── __init__.py ├── bin ├── __init__.py ├── code_template.py ├── field_extraction.py ├── lib.py ├── main.py └── pdf2text.py ├── changelog.md ├── confs └── config.yaml.template ├── data ├── input │ ├── data_descriptions.csv │ └── example_resumes │ │ ├── Brendan_Herger_Resume.pdf │ │ ├── Layla_Martin_Resume.pdf │ │ ├── SGresume-1.pdf │ │ ├── john_smith.pdf │ │ └── resume_Meyer.pdf ├── output │ ├── Brendan_Herger_Resume.txt │ ├── Layla_Martin_Resume.txt │ ├── SGresume-1.txt │ ├── john_smith.txt │ ├── resume_Meyer.txt │ └── resume_summary.csv └── schema │ ├── extract.csv │ └── transform.csv └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | # MacOS specific files 2 | .DS_Store 3 | 4 | # IDE specfic files 5 | .idea/* 6 | 7 | # Compiled python files 8 | *.pyc 9 | 10 | # Configuration files 11 | confs/*.yaml 12 | confs/*.yml -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ResumeParser 2 | 3 | A utility to make handling many resumes easier by automatically pulling contact information, required skills and custom text fields. These results are then surfaced as a convenient summary CSV. 4 | 5 | ## Quick Start Guide 6 | 7 | ```bash 8 | # Install requirements 9 | pip install -r requirements.txt 10 | 11 | # Retrieve language model from spacy 12 | python -m spacy download en 13 | 14 | # Run code (with default configurations) 15 | cd bin/ 16 | python main.py 17 | 18 | # Review output 19 | open ../data/output/resume_summary.csv 20 | 21 | ``` 22 | 23 | ## Getting started 24 | 25 | ### Repo structure 26 | 27 | - `bin/main.py`: Code entry point 28 | - `confs/confs.yaml.template`: Configuration file template 29 | - `data/input/example_resumes`: Example resumes, which are parsed w/ default configurations 30 | - `data/output/resume_summary.csv`: Results from parsing example resumes 31 | 32 | ### Python Environment 33 | 34 | Python code in this repo utilizes packages that are not part of the common library. To make sure you have all of the 35 | appropriate packages, please use `pip` to install the `requirements.txt` file. For more details, please see the [pip 36 | documentation](https://pip.pypa.io/en/stable/user_guide/#requirements-files) 37 | 38 | ### Configuration file 39 | 40 | This program utilizes a configuration file to set program parameters. You can run this program with the default 41 | parameters view sample output, but you'll probably want to create a config file and modify it to get the most value 42 | from this program. 43 | 44 | ```bash 45 | 46 | # Create configuration file from template 47 | scp confs/confs.yaml.template confs/confs.yaml 48 | 49 | # Modify confs to match your needs 50 | open confs/confs.yaml 51 | ``` 52 | 53 | The configuration file has a few parameters you can tweak: 54 | - `resume_directory`: A directory containing resumes you'd like to parse 55 | - `summary_output_directory`: Where to place the .csv file, summarizing your resumes 56 | - `data_schema_dir`: The directory to store table schema. This is mostly for development purposes 57 | - `skills`: A YAML list of skills. Each element in this list can either be a string (e.g. `skill1` or 58 | `machine learning`), or a list aliases for the same skill (e.g. `[skill2_alias_A, skill2_alias_B]` or `[ml, 59 | machine learning, machine-learning]`) 60 | - `universities`: A YAML list of universities you'd like to search for 61 | 62 | ## Contact 63 | Feel free to contact me at `13herger gmail com`. If you're interested in projects like this, check out my [website](http://hergertarian.com) and [blog](http://hergertarian.com/blog) 64 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bjherger/ResumeParser/8c873551b425bea4d7c5f46960e3ff159bb05056/__init__.py -------------------------------------------------------------------------------- /bin/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bjherger/ResumeParser/8c873551b425bea4d7c5f46960e3ff159bb05056/bin/__init__.py -------------------------------------------------------------------------------- /bin/code_template.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """ 3 | coding=utf-8 4 | 5 | Code Template 6 | 7 | """ 8 | import logging 9 | 10 | logging.basicConfig(level=logging.DEBUG) 11 | 12 | 13 | def main(): 14 | """ 15 | Main function documentation template 16 | :return: None 17 | :rtype: None 18 | """ 19 | pass 20 | 21 | 22 | # Main section 23 | if __name__ == '__main__': 24 | main() 25 | -------------------------------------------------------------------------------- /bin/field_extraction.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | from gensim.utils import simple_preprocess 4 | 5 | from bin import lib 6 | 7 | EMAIL_REGEX = r"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}" 8 | PHONE_REGEX = r"\(?(\d{3})?\)?[\s\.-]{0,2}?(\d{3})[\s\.-]{0,2}(\d{4})" 9 | NAME_REGEX = r'[a-z]+(\s+[a-z]+)?' 10 | 11 | 12 | def candidate_name_extractor(input_string, nlp): 13 | 14 | doc = nlp(input_string) 15 | 16 | # Extract entities 17 | doc_entities = doc.ents 18 | 19 | # Subset to person type entities 20 | doc_persons = filter(lambda x: x.label_ == 'PERSON', doc_entities) 21 | doc_persons = filter(lambda x: len(x.text.strip().split()) >= 2, doc_persons) 22 | doc_persons = map(lambda x: x.text.strip(), doc_persons) 23 | doc_persons = list(doc_persons) 24 | 25 | # Assuming that the first Person entity with more than two tokens is the candidate's name 26 | if len(doc_persons) > 0: 27 | return doc_persons[0] 28 | return "NOT FOUND" 29 | 30 | 31 | def extract_fields(df): 32 | for extractor, items_of_interest in lib.get_conf('extractors').items(): 33 | df[extractor] = df['text'].apply(lambda x: extract_skills(x, extractor, items_of_interest)) 34 | return df 35 | 36 | 37 | def extract_skills(resume_text, extractor, items_of_interest): 38 | potential_skills_dict = dict() 39 | matched_skills = set() 40 | 41 | # TODO This skill input formatting could happen once per run, instead of once per observation. 42 | for skill_input in items_of_interest: 43 | 44 | # Format list inputs 45 | if type(skill_input) is list and len(skill_input) >= 1: 46 | potential_skills_dict[skill_input[0]] = skill_input 47 | 48 | # Format string inputs 49 | elif type(skill_input) is str: 50 | potential_skills_dict[skill_input] = [skill_input] 51 | else: 52 | logging.warn('Unknown skill listing type: {}. Please format as either a single string or a list of strings' 53 | ''.format(skill_input)) 54 | 55 | for (skill_name, skill_alias_list) in potential_skills_dict.items(): 56 | 57 | skill_matches = 0 58 | # Iterate through aliases 59 | for skill_alias in skill_alias_list: 60 | # Add the number of matches for each alias 61 | skill_matches += lib.term_count(resume_text, skill_alias.lower()) 62 | 63 | # If at least one alias is found, add skill name to set of skills 64 | if skill_matches > 0: 65 | matched_skills.add(skill_name) 66 | 67 | return matched_skills 68 | -------------------------------------------------------------------------------- /bin/lib.py: -------------------------------------------------------------------------------- 1 | """ 2 | coding=utf-8 3 | """ 4 | import logging 5 | import os 6 | import re 7 | import subprocess 8 | 9 | import pandas 10 | import yaml 11 | 12 | from bin import pdf2text 13 | 14 | CONFS = None 15 | 16 | AVAILABLE_EXTENSIONS = {'.csv', '.doc', '.docx', '.eml', '.epub', '.gif', '.htm', '.html', '.jpeg', '.jpg', '.json', 17 | '.log', '.mp3', '.msg', '.odt', '.ogg', '.pdf', '.png', '.pptx', '.ps', '.psv', '.rtf', '.tff', 18 | '.tif', '.tiff', '.tsv', '.txt', '.wav', '.xls', '.xlsx'} 19 | 20 | 21 | def load_confs(confs_path='../confs/config.yaml'): 22 | # TODO Docstring 23 | global CONFS 24 | 25 | if CONFS is None: 26 | try: 27 | CONFS = yaml.load(open(confs_path)) 28 | except IOError: 29 | confs_template_path = confs_path + '.template' 30 | logging.warn( 31 | 'Confs path: {} does not exist. Attempting to load confs template, ' 32 | 'from path: {}'.format(confs_path, confs_template_path)) 33 | CONFS = yaml.load(open(confs_template_path)) 34 | return CONFS 35 | 36 | 37 | def get_conf(conf_name): 38 | return load_confs()[conf_name] 39 | 40 | 41 | def archive_dataset_schemas(step_name, local_dict, global_dict): 42 | """ 43 | Archive the schema for all available Pandas DataFrames 44 | - Determine which objects in namespace are Pandas DataFrames 45 | - Pull schema for all available Pandas DataFrames 46 | - Write schemas to file 47 | :param step_name: The name of the current operation (e.g. `extract`, `transform`, `model` or `load` 48 | :param local_dict: A dictionary containing mappings from variable name to objects. This is usually generated by 49 | calling `locals` 50 | :type local_dict: dict 51 | :param global_dict: A dictionary containing mappings from variable name to objects. This is usually generated by 52 | calling `globals` 53 | :type global_dict: dict 54 | :return: None 55 | :rtype: None 56 | """ 57 | logging.info('Archiving data set schema(s) for step name: {}'.format(step_name)) 58 | 59 | # Reference variables 60 | data_schema_dir = get_conf('data_schema_dir') 61 | schema_output_path = os.path.join(data_schema_dir, step_name + '.csv') 62 | schema_agg = list() 63 | 64 | env_variables = dict() 65 | env_variables.update(local_dict) 66 | env_variables.update(global_dict) 67 | 68 | # Filter down to Pandas DataFrames 69 | data_sets = filter(lambda x: type(x[1]) == pandas.DataFrame, env_variables.items()) 70 | data_sets = dict(data_sets) 71 | 72 | for (data_set_name, data_set) in data_sets.items(): 73 | # Extract variable names 74 | logging.info('Working data_set: {}'.format(data_set_name)) 75 | 76 | local_schema_df = pandas.DataFrame(data_set.dtypes, columns=['type']) 77 | local_schema_df['data_set'] = data_set_name 78 | 79 | schema_agg.append(local_schema_df) 80 | 81 | # Aggregate schema list into one data frame 82 | agg_schema_df = pandas.concat(schema_agg) 83 | 84 | # Write to file 85 | agg_schema_df.to_csv(schema_output_path, index_label='variable') 86 | 87 | 88 | def term_count(string_to_search, term): 89 | """ 90 | A utility function which counts the number of times `term` occurs in `string_to_search` 91 | :param string_to_search: A string which may or may not contain the term. 92 | :type string_to_search: str 93 | :param term: The term to search for the number of occurrences for 94 | :type term: str 95 | :return: The number of times the `term` occurs in the `string_to_search` 96 | :rtype: int 97 | """ 98 | try: 99 | regular_expression = re.compile(term, re.IGNORECASE) 100 | result = re.findall(regular_expression, string_to_search) 101 | return len(result) 102 | except Exception: 103 | logging.error('Error occurred during regex search') 104 | return 0 105 | 106 | 107 | def term_match(string_to_search, term): 108 | """ 109 | A utility function which return the first match to the `regex_pattern` in the `string_to_search` 110 | :param string_to_search: A string which may or may not contain the term. 111 | :type string_to_search: str 112 | :param term: The term to search for the number of occurrences for 113 | :type term: str 114 | :return: The first match of the `regex_pattern` in the `string_to_search` 115 | :rtype: str 116 | """ 117 | try: 118 | regular_expression = re.compile(term, re.IGNORECASE) 119 | result = re.findall(regular_expression, string_to_search) 120 | if len(result) > 0: 121 | return result[0] 122 | else: 123 | return None 124 | except Exception: 125 | logging.error('Error occurred during regex search') 126 | return None 127 | 128 | def convert_pdf(f): 129 | 130 | # Create intermediate output file 131 | # TODO Is this a desirable feature? Could this be replaced with a tempfile or fake file? 132 | output_filename = os.path.basename(os.path.splitext(f)[0]) + '.txt' 133 | output_filepath = os.path.join('..', 'data', 'output', output_filename) 134 | logging.info('Writing text from {} to {}'.format(f, output_filepath)) 135 | 136 | # Convert pdf to text, placed in intermediate output file 137 | pdf2text.main(args=[f, '--outfile', output_filepath]) 138 | 139 | # Return contents of intermediate output file 140 | return open(output_filepath).read() 141 | 142 | 143 | 144 | -------------------------------------------------------------------------------- /bin/main.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """ 3 | coding=utf-8 4 | 5 | Code Template 6 | 7 | """ 8 | import inspect 9 | import logging 10 | import os 11 | import sys 12 | 13 | import pandas 14 | import spacy 15 | 16 | currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe()))) 17 | parentdir = os.path.dirname(currentdir) 18 | sys.path.insert(0,parentdir) 19 | 20 | from bin import field_extraction 21 | from bin import lib 22 | 23 | 24 | def main(): 25 | """ 26 | Main function documentation template 27 | :return: None 28 | :rtype: None 29 | """ 30 | logging.getLogger().setLevel(logging.INFO) 31 | 32 | # Extract data from upstream. 33 | observations = extract() 34 | 35 | # Spacy: Spacy NLP 36 | nlp = spacy.load('en') 37 | 38 | # Transform data to have appropriate fields 39 | observations, nlp = transform(observations, nlp) 40 | 41 | # Load data for downstream consumption 42 | load(observations, nlp) 43 | 44 | pass 45 | 46 | def extract(): 47 | logging.info('Begin extract') 48 | 49 | # Reference variables 50 | candidate_file_agg = list() 51 | 52 | # Create list of candidate files 53 | for root, subdirs, files in os.walk(lib.get_conf('resume_directory')): 54 | folder_files = map(lambda x: os.path.join(root, x), files) 55 | candidate_file_agg.extend(folder_files) 56 | 57 | # Convert list to a pandas DataFrame 58 | observations = pandas.DataFrame(data=candidate_file_agg, columns=['file_path']) 59 | logging.info('Found {} candidate files'.format(len(observations.index))) 60 | 61 | # Subset candidate files to supported extensions 62 | observations['extension'] = observations['file_path'].apply(lambda x: os.path.splitext(x)[1]) 63 | observations = observations[observations['extension'].isin(lib.AVAILABLE_EXTENSIONS)] 64 | logging.info('Subset candidate files to extensions w/ available parsers. {} files remain'. 65 | format(len(observations.index))) 66 | 67 | # Attempt to extract text from files 68 | observations['text'] = observations['file_path'].apply(lib.convert_pdf) 69 | 70 | # Archive schema and return 71 | lib.archive_dataset_schemas('extract', locals(), globals()) 72 | logging.info('End extract') 73 | return observations 74 | 75 | 76 | def transform(observations, nlp): 77 | # TODO Docstring 78 | logging.info('Begin transform') 79 | 80 | # Extract candidate name 81 | observations['candidate_name'] = observations['text'].apply(lambda x: 82 | field_extraction.candidate_name_extractor(x, nlp)) 83 | 84 | if observations['candidate_name'] == "NOT FOUND": 85 | match = re.search(field_extraction.NAME_REGEX, observations['text'], re.IGNORECASE) 86 | observations['candidate_name'] = match[0] 87 | 88 | 89 | # Extract contact fields 90 | observations['email'] = observations['text'].apply(lambda x: lib.term_match(x, field_extraction.EMAIL_REGEX)) 91 | observations['phone'] = observations['text'].apply(lambda x: lib.term_match(x, field_extraction.PHONE_REGEX)) 92 | 93 | # Extract skills 94 | observations = field_extraction.extract_fields(observations) 95 | 96 | # Archive schema and return 97 | lib.archive_dataset_schemas('transform', locals(), globals()) 98 | logging.info('End transform') 99 | return observations, nlp 100 | 101 | 102 | def load(observations, nlp): 103 | logging.info('Begin load') 104 | output_path = os.path.join(lib.get_conf('summary_output_directory'), 'resume_summary.csv') 105 | 106 | logging.info('Results being output to {}'.format(output_path)) 107 | print('Results output to {}'.format(output_path)) 108 | 109 | observations.to_csv(path_or_buf=output_path, index_label='index') 110 | logging.info('End transform') 111 | pass 112 | 113 | 114 | # Main section 115 | if __name__ == '__main__': 116 | main() 117 | -------------------------------------------------------------------------------- /bin/pdf2text.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """ 4 | Converts PDF text content (though not images containing text) to plain text, html, xml or "tags". 5 | """ 6 | import argparse 7 | import sys 8 | 9 | import pdfminer.settings 10 | import six 11 | 12 | pdfminer.settings.STRICT = False 13 | import pdfminer.high_level 14 | import pdfminer.layout 15 | from pdfminer.image import ImageWriter 16 | 17 | 18 | def extract_text(files=[], outfile='-', 19 | _py2_no_more_posargs=None, # Bloody Python2 needs a shim 20 | no_laparams=False, all_texts=None, detect_vertical=None, # LAParams 21 | word_margin=None, char_margin=None, line_margin=None, boxes_flow=None, # LAParams 22 | output_type='text', codec='utf-8', strip_control=False, 23 | maxpages=0, page_numbers=None, password="", scale=1.0, rotation=0, 24 | layoutmode='normal', output_dir=None, debug=False, 25 | disable_caching=False, **other): 26 | if _py2_no_more_posargs is not None: 27 | raise ValueError("Too many positional arguments passed.") 28 | if not files: 29 | raise ValueError("Must provide files to work upon!") 30 | 31 | # If any LAParams group arguments were passed, create an LAParams object and 32 | # populate with given args. Otherwise, set it to None. 33 | if not no_laparams: 34 | laparams = pdfminer.layout.LAParams() 35 | for param in ("all_texts", "detect_vertical", "word_margin", "char_margin", "line_margin", "boxes_flow"): 36 | paramv = locals().get(param, None) 37 | if paramv is not None: 38 | setattr(laparams, param, paramv) 39 | else: 40 | laparams = None 41 | 42 | imagewriter = None 43 | if output_dir: 44 | imagewriter = ImageWriter(output_dir) 45 | 46 | if output_type == "text" and outfile != "-": 47 | for override, alttype in ( (".htm", "html"), 48 | (".html", "html"), 49 | (".xml", "xml"), 50 | (".tag", "tag") ): 51 | if outfile.endswith(override): 52 | output_type = alttype 53 | 54 | if outfile == "-": 55 | outfp = sys.stdout 56 | if outfp.encoding is not None: 57 | codec = 'utf-8' 58 | else: 59 | outfp = open(outfile, "wb") 60 | 61 | 62 | for fname in files: 63 | with open(fname, "rb") as fp: 64 | pdfminer.high_level.extract_text_to_fp(fp, **locals()) 65 | return outfp 66 | 67 | 68 | def maketheparser(): 69 | parser = argparse.ArgumentParser(description=__doc__, add_help=True) 70 | parser.add_argument("files", type=str, default=None, nargs="+", help="File to process.") 71 | parser.add_argument("-d", "--debug", default=False, action="store_true", help="Debug output.") 72 | parser.add_argument("-p", "--pagenos", type=str, help="Comma-separated list of page numbers to parse. Included for legacy applications, use --page-numbers for more idiomatic argument entry.") 73 | parser.add_argument("--page-numbers", type=int, default=None, nargs="+", help="Alternative to --pagenos with space-separated numbers; supercedes --pagenos where it is used.") 74 | parser.add_argument("-m", "--maxpages", type=int, default=0, help="Maximum pages to parse") 75 | parser.add_argument("-P", "--password", type=str, default="", help="Decryption password for PDF") 76 | parser.add_argument("-o", "--outfile", type=str, default="-", help="Output file (default \"-\" is stdout)") 77 | parser.add_argument("-t", "--output_type", type=str, default="text", help="Output type: text|html|xml|tag (default is text)") 78 | parser.add_argument("-c", "--codec", type=str, default="utf-8", help="Text encoding") 79 | parser.add_argument("-s", "--scale", type=float, default=1.0, help="Scale") 80 | parser.add_argument("-A", "--all-texts", default=None, action="store_true", help="LAParams all texts") 81 | parser.add_argument("-V", "--detect-vertical", default=None, action="store_true", help="LAParams detect vertical") 82 | parser.add_argument("-W", "--word-margin", type=float, default=None, help="LAParams word margin") 83 | parser.add_argument("-M", "--char-margin", type=float, default=None, help="LAParams char margin") 84 | parser.add_argument("-L", "--line-margin", type=float, default=None, help="LAParams line margin") 85 | parser.add_argument("-F", "--boxes-flow", type=float, default=None, help="LAParams boxes flow") 86 | parser.add_argument("-Y", "--layoutmode", default="normal", type=str, help="HTML Layout Mode") 87 | parser.add_argument("-n", "--no-laparams", default=False, action="store_true", help="Pass None as LAParams") 88 | parser.add_argument("-R", "--rotation", default=0, type=int, help="Rotation") 89 | parser.add_argument("-O", "--output-dir", default=None, help="Output directory for images") 90 | parser.add_argument("-C", "--disable-caching", default=False, action="store_true", help="Disable caching") 91 | parser.add_argument("-S", "--strip-control", default=False, action="store_true", help="Strip control in XML mode") 92 | return parser 93 | 94 | 95 | # main 96 | 97 | 98 | def main(args=None): 99 | 100 | P = maketheparser() 101 | A = P.parse_args(args=args) 102 | 103 | if A.page_numbers: 104 | A.page_numbers = set([x-1 for x in A.page_numbers]) 105 | if A.pagenos: 106 | A.page_numbers = set([int(x)-1 for x in A.pagenos.split(",")]) 107 | 108 | imagewriter = None 109 | if A.output_dir: 110 | imagewriter = ImageWriter(A.output_dir) 111 | 112 | if six.PY2 and sys.stdin.encoding: 113 | A.password = A.password.decode(sys.stdin.encoding) 114 | 115 | if A.output_type == "text" and A.outfile != "-": 116 | for override, alttype in ( (".htm", "html"), 117 | (".html", "html"), 118 | (".xml", "xml" ), 119 | (".tag", "tag" ) ): 120 | if A.outfile.endswith(override): 121 | A.output_type = alttype 122 | 123 | if A.outfile == "-": 124 | outfp = sys.stdout 125 | if outfp.encoding is not None: 126 | # Why ignore outfp.encoding? :-/ stupid cathal? 127 | A.codec = 'utf-8' 128 | else: 129 | outfp = open(A.outfile, "wb") 130 | 131 | ## Test Code 132 | outfp = extract_text(**vars(A)) 133 | outfp.close() 134 | return 0 135 | 136 | 137 | if __name__ == '__main__': sys.exit(main()) 138 | -------------------------------------------------------------------------------- /changelog.md: -------------------------------------------------------------------------------- 1 | # ResumeParser Change Log 2 | 3 | ## 3.0.0 - 2017-10-20 4 | 5 | Re-write, mostly from scratch 6 | 7 | ### Added 8 | 9 | - None 10 | 11 | ### Modified 12 | 13 | Structure 14 | 15 | - Program now follows ETL design principles 16 | - Program is broken into a driver file, library file and field extraction file 17 | 18 | Configuration 19 | 20 | - Skills to look for are now listed in the configuration file 21 | - Universities to look for are now listed in the configuration file 22 | 23 | Output 24 | - Program now provides a set containing skills found, rather than a count for each skill 25 | 26 | ### Removed 27 | 28 | - Address search. Addresses search was limited to addresses in California. 29 | - README has been reset to a non-project specific readme. It should be specialized for this project in a future 30 | version. 31 | 32 | ## 2.1.0 - 2017-10-20 33 | 34 | ### Added 35 | 36 | - `candidate_name`: Adding candidate name extractor, using spacy 37 | - `university`: Code will now check for a list of universities 38 | 39 | ### Changed 40 | 41 | - Skills search: Now users can provide a list of skills, which will be searched for 42 | 43 | ### Removed 44 | 45 | 46 | ## 2.0.0 - 2016-10-22 47 | 48 | ### Added 49 | 50 | ### Changed 51 | - `README.md` re-written for clarity, better code example 52 | - Folder structure refactored for clarify 53 | - `ResumeChecker.py` refactored to match Python style standards, legibility 54 | 55 | ### Removed 56 | - `code/` folder removed. It only contained extraneous code, and outdated `requirements.txt` 57 | 58 | ## 1.0.0 - 2015-02-25 59 | Core functionality to read in PDF resumes, extract text, output results table -------------------------------------------------------------------------------- /confs/config.yaml.template: -------------------------------------------------------------------------------- 1 | resume_directory: ../data/input/example_resumes 2 | summary_output_directory: ../data/output 3 | data_schema_dir: ../data/schema 4 | 5 | extractors: 6 | experience: 7 | - [Teacher, teaching, tutor] 8 | - [developer, software developer, software engineer, dev] 9 | - trader 10 | 11 | platforms: 12 | - Linux 13 | - Windows 14 | - [Mac, MacOS] 15 | 16 | database: 17 | - SQL 18 | - MySQL 19 | - [Postgress, Postgresql] 20 | - Oracle 21 | 22 | programming: 23 | - [java, JavaEE] 24 | - C 25 | - C++ 26 | - C# 27 | - .NET 28 | - Matlab 29 | - R 30 | - python 31 | - VHDL 32 | - PHP 33 | - JavaScript 34 | 35 | 36 | machinelearning: 37 | - [sklearn, scikit-learn, sk-learn] 38 | - [tensorflow, tf, tensor-flow] 39 | - keras 40 | - h20 41 | 42 | 43 | universities: 44 | - [TU Delft, TUDelft, Delft University of Technology] 45 | - [Univerity of Twente, UTwente] 46 | - [University of Amsterdam, UvA] 47 | - [Vrije Univesriteit, VU University, VU Amsterdam, VU] 48 | - MIT 49 | 50 | languages: 51 | - Dutch 52 | - English 53 | - German 54 | - Spanish 55 | - [Chinese, Mandarin] 56 | 57 | hobbies: 58 | - [swimming, swim] 59 | - [soccer, football] 60 | - painting 61 | - reading 62 | 63 | open-source: 64 | - github 65 | - bitbucket 66 | - gitlab 67 | - sourceforge 68 | - gitkraken 69 | -------------------------------------------------------------------------------- /data/input/data_descriptions.csv: -------------------------------------------------------------------------------- 1 | filename,source,description -------------------------------------------------------------------------------- /data/input/example_resumes/Brendan_Herger_Resume.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bjherger/ResumeParser/8c873551b425bea4d7c5f46960e3ff159bb05056/data/input/example_resumes/Brendan_Herger_Resume.pdf -------------------------------------------------------------------------------- /data/input/example_resumes/Layla_Martin_Resume.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bjherger/ResumeParser/8c873551b425bea4d7c5f46960e3ff159bb05056/data/input/example_resumes/Layla_Martin_Resume.pdf -------------------------------------------------------------------------------- /data/input/example_resumes/SGresume-1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bjherger/ResumeParser/8c873551b425bea4d7c5f46960e3ff159bb05056/data/input/example_resumes/SGresume-1.pdf -------------------------------------------------------------------------------- /data/input/example_resumes/john_smith.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bjherger/ResumeParser/8c873551b425bea4d7c5f46960e3ff159bb05056/data/input/example_resumes/john_smith.pdf -------------------------------------------------------------------------------- /data/input/example_resumes/resume_Meyer.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bjherger/ResumeParser/8c873551b425bea4d7c5f46960e3ff159bb05056/data/input/example_resumes/resume_Meyer.pdf -------------------------------------------------------------------------------- /data/output/Brendan_Herger_Resume.txt: -------------------------------------------------------------------------------- 1 | Brendan Herger 2 | 3 | Hergertarian.com | 13herger@gmail.com | + 1 (415) 582-7457 4 | 5 | 1209 Page Street No. 7 San Francisco, Ca 94117 6 | 7 | Selected Experience 8 | 9 | Data Scientist 10 | 11 | Data Innovation Lab @ Capital One (San Francisco, Ca.) 06/15 - Now 12 | Lead research team modeling for fraud problem space with class 13 | imbalance & adversaries; H2O.ai, GraphLab, SKLearn 14 | Deployed sub-millisecond real time model; Apache Apex 15 | Recommended distributed machine learning frameworks for general 16 | adoption at Capital One; H2O.ai, GraphLab, Apache Spark 17 | 18 | Various Technical Positions 19 | 20 | Lawfty (San Francisco, Ca.), RevUp Software (Redwood City, Ca.), 21 | Perkins + Will Architecture (San Francisco, Ca.) & Lawrence Berkeley 22 | National Lab. (Berkeley, Ca.) 23 | 24 | Front End Supervisor 25 | 26 | The Home Depot Pro (Colma, Ca.) 05/11 - 03/13 27 | Positions held: Cashier, Special Services Assoc., Tool Rental Assoc. 28 | Supervised and trained a staff of 10-30 team members 29 | 30 | Education 31 | MS, Analytics 32 | 33 | University of San Francisco, July 2015 34 | Relevant Coursework: Machine Learning, Adv. Machine Learning, Data 35 | Acquisition, Exploratory Data Analysis, Relational Databases, NoSQL 36 | Databases, Linear Regression Analysis, Time Series Analysis, Intro. SAS 37 | 38 | BS, Physics 39 | 40 | University of San Francisco, May 2014 41 | Minors: Computer Sciences, Mathematics 42 | Honors: University Scholar, President of ΠΜΕ Math Honors Society 43 | Relevant Coursework: Software Development, Data Structures & 44 | Algorithms, Differential Eqn.’s, Linear Algebra, Graduate Econometrics 45 | 46 | Personal Projects 47 | 48 | Identified genre of Billboard 49 | Hot 100 songs using ensemble 50 | algorithm built with Support 51 | Vector Machine, Neural Network, 52 | Stochastic Gradient Boost, and 53 | Random Forest algorithms; 54 | Python, Pandas, R and Scikit-Learn 55 | 56 | Implemented Naive Bayes text 57 | classification algorithm and trained 58 | this algorithm to correctly label 59 | 83% of movie reviews; Python, 60 | numpy and Pandas 61 | 62 | Created database containing lyrics 63 | of Billboard Hot 100 songs since 64 | 1958; R, Python, Pandas and 65 | Beautiful Soup 4 66 | 67 | Built a multi-threaded web scraper 68 | and search engine with web 69 | user interface; Java, MySQL and 70 | HTML5/CSS 71 | 72 | Built resume parsing package 73 | which extracts text, finds contact 74 | details, and checks for required 75 | keywords; Python and Pandas 76 | 77 | Online 78 | 79 | Hergertarian.com 80 | 81 | github.com/bjherger 82 | 83 | linkedin.com/in/bjherger 84 | 85 | hergertarian.wordpress.com/ 86 | 87 | -------------------------------------------------------------------------------- /data/output/Layla_Martin_Resume.txt: -------------------------------------------------------------------------------- 1 | Layla Martin 2 | 3 | 2038 McAllister St 4 | 5 | San Francisco, CA 94118 6 | 7 | layla.d.martin@gmail.com (520) 271-2492 8 | 9 | EDUCATION 10 | AND AWARDS 11 | 12 | Master of Science in Analytics 13 | University of San Francisco, San Francisco, CA 14 | 15 | Bachelor of Science in Mathematics, Cum Laude 16 | University of San Francisco 17 | Dean’s Honor Role (4 years) 18 | University Scholar, USF’s highest academic scholarship (4 years) 19 | 20 | Expected June 2015 21 | 22 | May 2014 23 | 24 | SELECTED 25 | PROJECTS 26 | 27 | Sentiment Analysis using Naive-Bayes 28 | 29 | Graduate 30 | • Classified movie reviews as positive or negative with 75% accuracy by implementing a 31 | 32 | Naive-Bayes algorithm in Python. 33 | 34 | NBA Play-By-Play Data Cleaning 35 | 36 | Summer research with USF Faculty 37 | • Cleaned play-by-play data for every NBA game from 2006 to 2012 using Pandas (Python). 38 | 39 | Analysis of USF Men’s Basketball Statistics using R 40 | 41 | Undergraduate 42 | • Predicted importance of players on the court using simple and multiple linear regression 43 | 44 | and created visualizations of player statistics using R. 45 | 46 | Explaining Artificial Neural Networks (ANN) 47 | 48 | Undergraduate 49 | • Prepared two student lectures teaching classmates basic implementation and theory behind 50 | • Preprocessed fingerprint images as matrices in Matlab and performed pattern classification 51 | 52 | supervised learning, backpropagation, and pattern recognition with ANN. 53 | 54 | using an ANN program. 55 | 56 | Mathematical Modeling Research 57 | 58 | Undergraduate 59 | 60 | • Performed image compression with SVD implemented in Matlab. 61 | • Ranked West Coast Conference Men’s Basketball teams using multiple centrality ranking 62 | 63 | algorithms in Matlab. 64 | 65 | LEADERSHIP 66 | 67 | President, Pi Mu Epsilon National Math Honor Society 68 | Captain, USF Women’s Soccer 69 | Teacher’s Assistant, USF Astronomy Observations 70 | 71 | 2014 72 | 2013-2014 73 | 2013-2014 74 | 75 | SKILLS 76 | 77 | Python, R, MySQL, PostgreSQL, Matlab, LaTeX, JMP 78 | 79 | INTER- 80 | COLLEGIATE 81 | ATHLETICS 82 | 83 | NCAA Division I Women’s Soccer: USF, 2010-2014 84 | 85 | • Athletic scholarship (full scholarship when combined with University Scholarship). 86 | • Committed 30 hours per week to training, meetings, travel, competition. 87 | 88 | -------------------------------------------------------------------------------- /data/output/SGresume-1.txt: -------------------------------------------------------------------------------- 1 | Sébastien Genty 2 | 3 | 1209 Page St, Apt 7, San Francisco, CA 94117 4 | 5 | 713-301-5648 • sgenty@me.com 6 | Work Experience 7 | Project Director, Socratic Technologies 8 | 9 | • Collaborate with clients to plan, construct, and execute surveys for collecting 10 | 11 | insightful data on their business and marketing needs 12 | 13 | • Explore data using advanced mathematical and statistical methods, in order to 14 | 15 | identify insights and trends 16 | 17 | • Visualize data and extract insights for client facing deliverables 18 | • Create market simulators using highly customized applications that incorporate post 19 | 20 | survey analytical results to predict comparative sales figures and market share 21 | 22 | • Manage the data collection process by allocating resources, overseeing workflow, and 23 | 24 | verifying quality data collection 25 | 26 | • Formulated and built a new survey tool that increased statistical variance in 27 | 28 | respondent data, currently in use for multiple clients 29 | 30 | • Established a new model to test the reach, appeal and effectiveness of advertisements 31 | 32 | Research Assistant, Bucknell University 33 | 34 | • Independently designed and built an optical set-up to take pictures of Bose-Einstein 35 | 36 | • Created a user interface to control a high-end camera using Visual Basic 37 | • Coordinated with other research assistants working on different aspects of the same 38 | 39 | condensates 40 | 41 | experiment 42 | 43 | • Computer model and data analysis done using Matlab and ImageJ 44 | 45 | Research Intern, MD Anderson Cancer Center 46 | 47 | • Researched CT dosiometry using a relatively new technique involving x-ray sensitive 48 | 49 | film 50 | 51 | • Analyzed data using Matlab and Mathematica 52 | 53 | September 2012 - Present 54 | 55 | Fall 2011 - Spring 2012 56 | 57 | Summer 2011 58 | 59 | Relevant Projects 60 | New Product Development 61 | 62 | • Computed ideal product configuration by creating a market 63 | 64 | simulator using conjoint analysis 65 | 66 | Purchase Process Identification 67 | 68 | • Identified patterns in the purchase process of mortgages using a 69 | 70 | client provided database and formed insights about the timing and 71 | effectiveness of marketing materials 72 | 73 | Customer Feedback 74 | 75 | • Analyzed users’ opinions and knowledge of recently released 76 | 77 | version of leading media player and distribution platform 78 | 79 | Bay Area Bike Share Open Data Challenge 80 | 81 | • Uncovered habits and behavior of riders using an open source 82 | dataset and currently developing a visualization of bike station 83 | usage 84 | 85 | Education 86 | Bucknell University – Lewisburg, PA 87 | 88 | B.S. Physics 89 | Minor: International Relations 90 | International Leadership Scholarship 91 | 92 | Graduated May 2012 93 | 94 | Languages 95 | English 96 | French 97 | Spanish 98 | German 99 | 100 | Skills 101 | Computer 102 | R 103 | Python 104 | SPSS 105 | Mathematica 106 | Matlab 107 | Javascript 108 | C++ 109 | SQL 110 | Unix 111 | Adobe Creative Suite 112 | Microsoft Office 113 | 114 | Interests 115 | Film and digital photography, 116 | physics, research, reading, video 117 | games, solving problems and fixing 118 | things 119 | 120 | -------------------------------------------------------------------------------- /data/output/john_smith.txt: -------------------------------------------------------------------------------- 1 | John Smith 2 | 3 | 4 | 5 | 6 | 7 | 2222 McCoy Road Ÿ Columbus, Ohio 44444 Ÿ 614-555-5555 Ÿ sresume@kent.edu Ÿ www.linkedin.com/in/name 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | OBJECTIVE 30 | Seeking a marketing internship with ABC Company in Spring 2017 to utilize my organizational and analytical skills. 31 | 32 | EDUCATION 33 | Bachelor of Business Administration 34 | Kent State University 35 | Major: Business Management 36 | 37 | SIGNIFICANT COURSEWORK 38 | Business Finance, Principles of Management, Legal Environment of Business 39 | 40 | COMPUTER SKILLS 41 | Microsoft Office: Word, PowerPoint, and basic Excel 42 | Applications: SQL (Structure Query Language) 43 | Programs: Adobe Photoshop, Movie Maker 44 | Social Media Administration: LinkedIn, Twitter, Instagram, Facebook 45 | 46 | WORK EXPERIENCE 47 | Kent State University, Kent, Ohio 48 | Resident Advisor 49 | • Collaborate with 10 building staff and campus administrators on a weekly basis to organize a pancake breakfast 50 | 51 | Expected Graduation: May 2018 52 | Kent, Ohio 53 | GPA: 3.6 54 | 55 | August 2015 - Present 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | fundraiser for 200 attendees, raising $500 for community charity. 74 | 75 | • Utilize effective time management skills by creating and implementing 6 programs and activities each semester for 30 76 | 77 | residents while balancing full-time course load and extracurricular commitments. 78 | 79 | • Demonstrate strong communication skills through interacting with 150 residents and campus administrators on a 80 | 81 | weekly basis. 82 | 83 | • Facilitate problem-solving and conflict resolution amongst residents by serving as positive role model, mediator, and 84 | 85 | leader through one-on-one and small group interventions. 86 | 87 | 88 | Panini’s Bar and Grill, Cleveland, Ohio 89 | Server 90 | • Worked independently in a fast-paced environment while developing customer service skills with each guest to ensure 91 | 92 | August 2014 - May 2015 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | their needs were consistently met. 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | July 2014 - October 2015 111 | 112 | • Optimized persuasive skills to highlight nightly specials and engage each table’s attention. 113 | • Fostered cooperation within a staff of 7 to maintain a pleasant atmosphere. 114 | 115 | LEADERSHIP EXPERIENCE 116 | Harding Middle School, Columbus, Ohio 117 | Soccer Assistant Coach 118 | • Executed strong communication skills by guiding and leading 20 seventh and eighth grade girls on team. 119 | • Served as positive role model by teaching young athletes about teamwork, respect, and conflict-resolution. 120 | • Planned and led weekly meetings with up to 15 parents, field managers and staff members. 121 | • Led team to its first regional championship in October 2015; recognized at school banquet with leadership award. 122 | 123 | CAMPUS INVOLVEMENT 124 | Member, Delta Sigma Pi 125 | 126 | Member, Collegiate Business Association 127 | 128 | VOLUNTEER EXPERIENCE 129 | Relay for Life, Kent State University 130 | Donation Processer, Greater Cleveland Food Bank 131 | 132 | HONORS 133 | Summit County Alumni Association Scholarship 134 | Dean’s List 135 | 136 | 137 | March 2015 138 | December 2014 - February 2015 139 | 140 | Spring 2014 - Present 141 | Spring 2013 - Spring 2015 142 | 143 | August 2016 - Present 144 | August 2015 - Present 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | -------------------------------------------------------------------------------- /data/output/resume_Meyer.txt: -------------------------------------------------------------------------------- 1 | MONICA MEYER 2 | 3 | (415) · 497 · 7282 (cid:5) monica.meyer@comcast.net 4 | 5 | EDUCATION 6 | 7 | Master of Science in Analytics 8 | University of San Francisco 9 | 10 | Bachelor of Arts in Mathematics 11 | University of California, Santa Cruz 12 | Overall GPA: 3.5 13 | 14 | COURSE PROJECTS 15 | 16 | Expected June 2015 17 | 18 | July 2012 19 | 20 | Business Location Recommendation 21 | Sep 2014 22 | · Queried Yelp and Zillow APIs and performed exploratory data analysis necessary for project to help 23 | businesses decide where to open their next establishment. 24 | 25 | Text Classification 26 | · Classified movie reviews as positive or negative using Naive Bayes in Python. 27 | · Found most important words in Reuter’s articles through term frequency, inverse document frequency 28 | on XML files in Python. 29 | 30 | July 2014 31 | 32 | WORK EXPERIENCE 33 | 34 | Bank of America 35 | Dec 2013 - June 2014 36 | Sales and Service Specialist 37 | Mill Valley, CA 38 | · Promoted due to proven ability to resolve complex service issues and process transactions accurately 39 | and efficiently to guarantee customer satisfaction and build customer confidence and trust. Responsible 40 | for establishing, retaining and deepening relationships with customers to achieve team sales goals as 41 | well as providing proactive sales activities of basic products while referring more complex requests such 42 | as mortgages and investment products. 43 | 44 | Bank of America 45 | Aug 2012 - Dec 2013 46 | Teller 47 | Mill Valley, CA 48 | · Gained proficiency in retail banking operations, including computing figures, processing transactions 49 | with speed and accuracy and building customer loyalty through exceptional customer service. Learned 50 | to control large amounts of cash flow, work within established policies, procedures and guidelines and 51 | acquired the ability to advise customers on products and services the bank has to offer. Earned a 52 | promotion to the position of Sales and Service Specialist. 53 | 54 | SKILLS 55 | 56 | Programming 57 | Protocols & APIs 58 | Databases 59 | 60 | Python, R 61 | XML, JSON, REST 62 | MySQL, PostgreSQL 63 | 64 | -------------------------------------------------------------------------------- /data/output/resume_summary.csv: -------------------------------------------------------------------------------- 1 | index,file_path,extension,text,candidate_name,email,phone,experience,platforms,database,programming,machinelearning,universities,languages,hobbies,open-source 2 | 1,../data/input/example_resumes/SGresume-1.pdf,.pdf,"Sébastien Genty 3 | 4 | 1209 Page St, Apt 7, San Francisco, CA 94117 5 | 6 | 713-301-5648 • sgenty@me.com 7 | Work Experience 8 | Project Director, Socratic Technologies 9 | 10 | • Collaborate with clients to plan, construct, and execute surveys for collecting 11 | 12 | insightful data on their business and marketing needs 13 | 14 | • Explore data using advanced mathematical and statistical methods, in order to 15 | 16 | identify insights and trends 17 | 18 | • Visualize data and extract insights for client facing deliverables 19 | • Create market simulators using highly customized applications that incorporate post 20 | 21 | survey analytical results to predict comparative sales figures and market share 22 | 23 | • Manage the data collection process by allocating resources, overseeing workflow, and 24 | 25 | verifying quality data collection 26 | 27 | • Formulated and built a new survey tool that increased statistical variance in 28 | 29 | respondent data, currently in use for multiple clients 30 | 31 | • Established a new model to test the reach, appeal and effectiveness of advertisements 32 | 33 | Research Assistant, Bucknell University 34 | 35 | • Independently designed and built an optical set-up to take pictures of Bose-Einstein 36 | 37 | • Created a user interface to control a high-end camera using Visual Basic 38 | • Coordinated with other research assistants working on different aspects of the same 39 | 40 | condensates 41 | 42 | experiment 43 | 44 | • Computer model and data analysis done using Matlab and ImageJ 45 | 46 | Research Intern, MD Anderson Cancer Center 47 | 48 | • Researched CT dosiometry using a relatively new technique involving x-ray sensitive 49 | 50 | film 51 | 52 | • Analyzed data using Matlab and Mathematica 53 | 54 | September 2012 - Present 55 | 56 | Fall 2011 - Spring 2012 57 | 58 | Summer 2011 59 | 60 | Relevant Projects 61 | New Product Development 62 | 63 | • Computed ideal product configuration by creating a market 64 | 65 | simulator using conjoint analysis 66 | 67 | Purchase Process Identification 68 | 69 | • Identified patterns in the purchase process of mortgages using a 70 | 71 | client provided database and formed insights about the timing and 72 | effectiveness of marketing materials 73 | 74 | Customer Feedback 75 | 76 | • Analyzed users’ opinions and knowledge of recently released 77 | 78 | version of leading media player and distribution platform 79 | 80 | Bay Area Bike Share Open Data Challenge 81 | 82 | • Uncovered habits and behavior of riders using an open source 83 | dataset and currently developing a visualization of bike station 84 | usage 85 | 86 | Education 87 | Bucknell University – Lewisburg, PA 88 | 89 | B.S. Physics 90 | Minor: International Relations 91 | International Leadership Scholarship 92 | 93 | Graduated May 2012 94 | 95 | Languages 96 | English 97 | French 98 | Spanish 99 | German 100 | 101 | Skills 102 | Computer 103 | R 104 | Python 105 | SPSS 106 | Mathematica 107 | Matlab 108 | Javascript 109 | C++ 110 | SQL 111 | Unix 112 | Adobe Creative Suite 113 | Microsoft Office 114 | 115 | Interests 116 | Film and digital photography, 117 | physics, research, reading, video 118 | games, solving problems and fixing 119 | things 120 | 121 | ",Sébastien Genty,sgenty@me.com,"('713', '301', '5648')",{'developer'},set(),{'SQL'},"{'C', 'Matlab', 'java', 'python', 'JavaScript', 'R'}",{'tensorflow'},set(),"{'English', 'German', 'Spanish'}",{'reading'},set() 122 | 2,../data/input/example_resumes/john_smith.pdf,.pdf,"John Smith 123 | 124 | 125 | 126 | 127 | 128 | 2222 McCoy Road Ÿ Columbus, Ohio 44444 Ÿ 614-555-5555 Ÿ sresume@kent.edu Ÿ www.linkedin.com/in/name 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | OBJECTIVE 151 | Seeking a marketing internship with ABC Company in Spring 2017 to utilize my organizational and analytical skills. 152 | 153 | EDUCATION 154 | Bachelor of Business Administration 155 | Kent State University 156 | Major: Business Management 157 | 158 | SIGNIFICANT COURSEWORK 159 | Business Finance, Principles of Management, Legal Environment of Business 160 | 161 | COMPUTER SKILLS 162 | Microsoft Office: Word, PowerPoint, and basic Excel 163 | Applications: SQL (Structure Query Language) 164 | Programs: Adobe Photoshop, Movie Maker 165 | Social Media Administration: LinkedIn, Twitter, Instagram, Facebook 166 | 167 | WORK EXPERIENCE 168 | Kent State University, Kent, Ohio 169 | Resident Advisor 170 | • Collaborate with 10 building staff and campus administrators on a weekly basis to organize a pancake breakfast 171 | 172 | Expected Graduation: May 2018 173 | Kent, Ohio 174 | GPA: 3.6 175 | 176 | August 2015 - Present 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | fundraiser for 200 attendees, raising $500 for community charity. 195 | 196 | • Utilize effective time management skills by creating and implementing 6 programs and activities each semester for 30 197 | 198 | residents while balancing full-time course load and extracurricular commitments. 199 | 200 | • Demonstrate strong communication skills through interacting with 150 residents and campus administrators on a 201 | 202 | weekly basis. 203 | 204 | • Facilitate problem-solving and conflict resolution amongst residents by serving as positive role model, mediator, and 205 | 206 | leader through one-on-one and small group interventions. 207 | 208 | 209 | Panini’s Bar and Grill, Cleveland, Ohio 210 | Server 211 | • Worked independently in a fast-paced environment while developing customer service skills with each guest to ensure 212 | 213 | August 2014 - May 2015 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | their needs were consistently met. 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | July 2014 - October 2015 232 | 233 | • Optimized persuasive skills to highlight nightly specials and engage each table’s attention. 234 | • Fostered cooperation within a staff of 7 to maintain a pleasant atmosphere. 235 | 236 | LEADERSHIP EXPERIENCE 237 | Harding Middle School, Columbus, Ohio 238 | Soccer Assistant Coach 239 | • Executed strong communication skills by guiding and leading 20 seventh and eighth grade girls on team. 240 | • Served as positive role model by teaching young athletes about teamwork, respect, and conflict-resolution. 241 | • Planned and led weekly meetings with up to 15 parents, field managers and staff members. 242 | • Led team to its first regional championship in October 2015; recognized at school banquet with leadership award. 243 | 244 | CAMPUS INVOLVEMENT 245 | Member, Delta Sigma Pi 246 | 247 | Member, Collegiate Business Association 248 | 249 | VOLUNTEER EXPERIENCE 250 | Relay for Life, Kent State University 251 | Donation Processer, Greater Cleveland Food Bank 252 | 253 | HONORS 254 | Summit County Alumni Association Scholarship 255 | Dean’s List 256 | 257 | 258 | March 2015 259 | December 2014 - February 2015 260 | 261 | Spring 2014 - Present 262 | Spring 2013 - Spring 2015 263 | 264 | August 2016 - Present 265 | August 2015 - Present 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | ",John Smith,sresume@kent.edu,"('614', '555', '5555')","{'Teacher', 'developer'}",set(),{'SQL'},"{'C', 'R'}",set(),{'MIT'},set(),{'soccer'},set() 312 | 3,../data/input/example_resumes/resume_Meyer.pdf,.pdf,"MONICA MEYER 313 | 314 | (415) · 497 · 7282 (cid:5) monica.meyer@comcast.net 315 | 316 | EDUCATION 317 | 318 | Master of Science in Analytics 319 | University of San Francisco 320 | 321 | Bachelor of Arts in Mathematics 322 | University of California, Santa Cruz 323 | Overall GPA: 3.5 324 | 325 | COURSE PROJECTS 326 | 327 | Expected June 2015 328 | 329 | July 2012 330 | 331 | Business Location Recommendation 332 | Sep 2014 333 | · Queried Yelp and Zillow APIs and performed exploratory data analysis necessary for project to help 334 | businesses decide where to open their next establishment. 335 | 336 | Text Classification 337 | · Classified movie reviews as positive or negative using Naive Bayes in Python. 338 | · Found most important words in Reuter’s articles through term frequency, inverse document frequency 339 | on XML files in Python. 340 | 341 | July 2014 342 | 343 | WORK EXPERIENCE 344 | 345 | Bank of America 346 | Dec 2013 - June 2014 347 | Sales and Service Specialist 348 | Mill Valley, CA 349 | · Promoted due to proven ability to resolve complex service issues and process transactions accurately 350 | and efficiently to guarantee customer satisfaction and build customer confidence and trust. Responsible 351 | for establishing, retaining and deepening relationships with customers to achieve team sales goals as 352 | well as providing proactive sales activities of basic products while referring more complex requests such 353 | as mortgages and investment products. 354 | 355 | Bank of America 356 | Aug 2012 - Dec 2013 357 | Teller 358 | Mill Valley, CA 359 | · Gained proficiency in retail banking operations, including computing figures, processing transactions 360 | with speed and accuracy and building customer loyalty through exceptional customer service. Learned 361 | to control large amounts of cash flow, work within established policies, procedures and guidelines and 362 | acquired the ability to advise customers on products and services the bank has to offer. Earned a 363 | promotion to the position of Sales and Service Specialist. 364 | 365 | SKILLS 366 | 367 | Programming 368 | Protocols & APIs 369 | Databases 370 | 371 | Python, R 372 | XML, JSON, REST 373 | MySQL, PostgreSQL 374 | 375 | ",MONICA MEYER,monica.meyer@comcast.net,,set(),set(),"{'MySQL', 'SQL', 'Postgress'}","{'C', 'python', 'R', '.NET'}",set(),set(),set(),set(),set() 376 | 4,../data/input/example_resumes/Layla_Martin_Resume.pdf,.pdf,"Layla Martin 377 | 378 | 2038 McAllister St 379 | 380 | San Francisco, CA 94118 381 | 382 | layla.d.martin@gmail.com (520) 271-2492 383 | 384 | EDUCATION 385 | AND AWARDS 386 | 387 | Master of Science in Analytics 388 | University of San Francisco, San Francisco, CA 389 | 390 | Bachelor of Science in Mathematics, Cum Laude 391 | University of San Francisco 392 | Dean’s Honor Role (4 years) 393 | University Scholar, USF’s highest academic scholarship (4 years) 394 | 395 | Expected June 2015 396 | 397 | May 2014 398 | 399 | SELECTED 400 | PROJECTS 401 | 402 | Sentiment Analysis using Naive-Bayes 403 | 404 | Graduate 405 | • Classified movie reviews as positive or negative with 75% accuracy by implementing a 406 | 407 | Naive-Bayes algorithm in Python. 408 | 409 | NBA Play-By-Play Data Cleaning 410 | 411 | Summer research with USF Faculty 412 | • Cleaned play-by-play data for every NBA game from 2006 to 2012 using Pandas (Python). 413 | 414 | Analysis of USF Men’s Basketball Statistics using R 415 | 416 | Undergraduate 417 | • Predicted importance of players on the court using simple and multiple linear regression 418 | 419 | and created visualizations of player statistics using R. 420 | 421 | Explaining Artificial Neural Networks (ANN) 422 | 423 | Undergraduate 424 | • Prepared two student lectures teaching classmates basic implementation and theory behind 425 | • Preprocessed fingerprint images as matrices in Matlab and performed pattern classification 426 | 427 | supervised learning, backpropagation, and pattern recognition with ANN. 428 | 429 | using an ANN program. 430 | 431 | Mathematical Modeling Research 432 | 433 | Undergraduate 434 | 435 | • Performed image compression with SVD implemented in Matlab. 436 | • Ranked West Coast Conference Men’s Basketball teams using multiple centrality ranking 437 | 438 | algorithms in Matlab. 439 | 440 | LEADERSHIP 441 | 442 | President, Pi Mu Epsilon National Math Honor Society 443 | Captain, USF Women’s Soccer 444 | Teacher’s Assistant, USF Astronomy Observations 445 | 446 | 2014 447 | 2013-2014 448 | 2013-2014 449 | 450 | SKILLS 451 | 452 | Python, R, MySQL, PostgreSQL, Matlab, LaTeX, JMP 453 | 454 | INTER- 455 | COLLEGIATE 456 | ATHLETICS 457 | 458 | NCAA Division I Women’s Soccer: USF, 2010-2014 459 | 460 | • Athletic scholarship (full scholarship when combined with University Scholarship). 461 | • Committed 30 hours per week to training, meetings, travel, competition. 462 | 463 | ",Layla Martin,layla.d.martin@gmail.com,"('520', '271', '2492')",{'Teacher'},set(),"{'MySQL', 'SQL', 'Postgress'}","{'C', 'Matlab', '.NET', 'python', 'R'}",set(),{'MIT'},set(),{'soccer'},set() 464 | 5,../data/input/example_resumes/Brendan_Herger_Resume.pdf,.pdf,"Brendan Herger 465 | 466 | Hergertarian.com | 13herger@gmail.com | + 1 (415) 582-7457 467 | 468 | 1209 Page Street No. 7 San Francisco, Ca 94117 469 | 470 | Selected Experience 471 | 472 | Data Scientist 473 | 474 | Data Innovation Lab @ Capital One (San Francisco, Ca.) 06/15 - Now 475 | Lead research team modeling for fraud problem space with class 476 | imbalance & adversaries; H2O.ai, GraphLab, SKLearn 477 | Deployed sub-millisecond real time model; Apache Apex 478 | Recommended distributed machine learning frameworks for general 479 | adoption at Capital One; H2O.ai, GraphLab, Apache Spark 480 | 481 | Various Technical Positions 482 | 483 | Lawfty (San Francisco, Ca.), RevUp Software (Redwood City, Ca.), 484 | Perkins + Will Architecture (San Francisco, Ca.) & Lawrence Berkeley 485 | National Lab. (Berkeley, Ca.) 486 | 487 | Front End Supervisor 488 | 489 | The Home Depot Pro (Colma, Ca.) 05/11 - 03/13 490 | Positions held: Cashier, Special Services Assoc., Tool Rental Assoc. 491 | Supervised and trained a staff of 10-30 team members 492 | 493 | Education 494 | MS, Analytics 495 | 496 | University of San Francisco, July 2015 497 | Relevant Coursework: Machine Learning, Adv. Machine Learning, Data 498 | Acquisition, Exploratory Data Analysis, Relational Databases, NoSQL 499 | Databases, Linear Regression Analysis, Time Series Analysis, Intro. SAS 500 | 501 | BS, Physics 502 | 503 | University of San Francisco, May 2014 504 | Minors: Computer Sciences, Mathematics 505 | Honors: University Scholar, President of ΠΜΕ Math Honors Society 506 | Relevant Coursework: Software Development, Data Structures & 507 | Algorithms, Differential Eqn.’s, Linear Algebra, Graduate Econometrics 508 | 509 | Personal Projects 510 | 511 | Identified genre of Billboard 512 | Hot 100 songs using ensemble 513 | algorithm built with Support 514 | Vector Machine, Neural Network, 515 | Stochastic Gradient Boost, and 516 | Random Forest algorithms; 517 | Python, Pandas, R and Scikit-Learn 518 | 519 | Implemented Naive Bayes text 520 | classification algorithm and trained 521 | this algorithm to correctly label 522 | 83% of movie reviews; Python, 523 | numpy and Pandas 524 | 525 | Created database containing lyrics 526 | of Billboard Hot 100 songs since 527 | 1958; R, Python, Pandas and 528 | Beautiful Soup 4 529 | 530 | Built a multi-threaded web scraper 531 | and search engine with web 532 | user interface; Java, MySQL and 533 | HTML5/CSS 534 | 535 | Built resume parsing package 536 | which extracts text, finds contact 537 | details, and checks for required 538 | keywords; Python and Pandas 539 | 540 | Online 541 | 542 | Hergertarian.com 543 | 544 | github.com/bjherger 545 | 546 | linkedin.com/in/bjherger 547 | 548 | hergertarian.wordpress.com/ 549 | 550 | ",Brendan Herger,13herger@gmail.com,"('415', '582', '7457')",{'developer'},{'Mac'},"{'MySQL', 'SQL'}","{'C', 'java', '.NET', 'python', 'R'}",{'sklearn'},{'Vrije Univesriteit'},set(),set(),{'github'} 551 | -------------------------------------------------------------------------------- /data/schema/extract.csv: -------------------------------------------------------------------------------- 1 | variable,type,data_set 2 | file_path,object,observations 3 | extension,object,observations 4 | text,object,observations 5 | -------------------------------------------------------------------------------- /data/schema/transform.csv: -------------------------------------------------------------------------------- 1 | variable,type,data_set 2 | file_path,object,observations 3 | extension,object,observations 4 | text,object,observations 5 | candidate_name,object,observations 6 | email,object,observations 7 | phone,object,observations 8 | experience,object,observations 9 | platforms,object,observations 10 | database,object,observations 11 | programming,object,observations 12 | machinelearning,object,observations 13 | universities,object,observations 14 | languages,object,observations 15 | hobbies,object,observations 16 | open-source,object,observations 17 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | gensim==3.7.1 2 | pandas==0.24.2 3 | pdfminer.six==20181108 4 | spacy==2.1.3 5 | PyYAML==5.1 6 | --------------------------------------------------------------------------------