├── .gitignore
├── README.md
├── __init__.py
├── bin
    ├── __init__.py
    ├── code_template.py
    ├── field_extraction.py
    ├── lib.py
    ├── main.py
    └── pdf2text.py
├── changelog.md
├── confs
    └── config.yaml.template
├── data
    ├── input
    │   ├── data_descriptions.csv
    │   └── example_resumes
    │   │   ├── Brendan_Herger_Resume.pdf
    │   │   ├── Layla_Martin_Resume.pdf
    │   │   ├── SGresume-1.pdf
    │   │   ├── john_smith.pdf
    │   │   └── resume_Meyer.pdf
    ├── output
    │   ├── Brendan_Herger_Resume.txt
    │   ├── Layla_Martin_Resume.txt
    │   ├── SGresume-1.txt
    │   ├── john_smith.txt
    │   ├── resume_Meyer.txt
    │   └── resume_summary.csv
    └── schema
    │   ├── extract.csv
    │   └── transform.csv
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
 1 | # MacOS specific files
 2 | .DS_Store
 3 | 
 4 | # IDE specfic files
 5 | .idea/*
 6 | 
 7 | # Compiled python files
 8 | *.pyc
 9 | 
10 | # Configuration files
11 | confs/*.yaml
12 | confs/*.yml


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ResumeParser
 2 | 
 3 | A utility to make handling many resumes easier by automatically pulling contact information, required skills and custom text fields. These results are then surfaced as a convenient summary CSV.
 4 | 
 5 | ## Quick Start Guide
 6 | 
 7 | ```bash
 8 | # Install requirements
 9 | pip install -r requirements.txt
10 | 
11 | # Retrieve language model from spacy
12 | python -m spacy download en
13 | 
14 | # Run code (with default configurations)
15 | cd bin/
16 | python main.py
17 | 
18 | # Review output
19 | open ../data/output/resume_summary.csv
20 | 
21 | ```
22 | 
23 | ## Getting started
24 | 
25 | ### Repo structure
26 | 
27 |  - `bin/main.py`: Code entry point
28 |  - `confs/confs.yaml.template`: Configuration file template
29 |  - `data/input/example_resumes`: Example resumes, which are parsed w/ default configurations
30 |  - `data/output/resume_summary.csv`: Results from parsing example resumes
31 | 
32 | ### Python Environment
33 | 
34 | Python code in this repo utilizes packages that are not part of the common library. To make sure you have all of the 
35 | appropriate packages, please use `pip` to install the `requirements.txt` file. For more details, please see the [pip 
36 | documentation](https://pip.pypa.io/en/stable/user_guide/#requirements-files)
37 | 
38 | ### Configuration file
39 | 
40 | This program utilizes a configuration file to set program parameters. You can run this program with the default
41 | parameters view sample output, but you'll probably want to create a config file and modify it to get the most value
42 | from this program.
43 | 
44 | ```bash
45 | 
46 | # Create configuration file from template
47 | scp confs/confs.yaml.template confs/confs.yaml
48 | 
49 | # Modify confs to match your needs
50 | open confs/confs.yaml
51 | ```
52 | 
53 | The configuration file has a few parameters you can tweak:
54 |  - `resume_directory`: A directory containing resumes you'd like to parse
55 |  - `summary_output_directory`: Where to place the .csv file, summarizing your resumes
56 |  - `data_schema_dir`: The directory to store table schema. This is mostly for development purposes
57 |  - `skills`: A YAML list of skills. Each element in this list can either be a string (e.g. `skill1` or
58 |  `machine learning`), or a list aliases for the same skill (e.g. `[skill2_alias_A, skill2_alias_B]` or `[ml,
59 |  machine learning, machine-learning]`)
60 |  - `universities`: A YAML list of universities you'd like to search for
61 | 
62 | ## Contact
63 | Feel free to contact me at `13herger <at> gmail <dot> com`. If you're interested in projects like this, check out my [website](http://hergertarian.com) and [blog](http://hergertarian.com/blog)
64 | 


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bjherger/ResumeParser/8c873551b425bea4d7c5f46960e3ff159bb05056/__init__.py


--------------------------------------------------------------------------------
/bin/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bjherger/ResumeParser/8c873551b425bea4d7c5f46960e3ff159bb05056/bin/__init__.py


--------------------------------------------------------------------------------
/bin/code_template.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | """
 3 | coding=utf-8
 4 | 
 5 | Code Template
 6 | 
 7 | """
 8 | import logging
 9 | 
10 | logging.basicConfig(level=logging.DEBUG)
11 | 
12 | 
13 | def main():
14 |     """
15 |     Main function documentation template
16 |     :return: None
17 |     :rtype: None
18 |     """
19 |     pass
20 | 
21 | 
22 | # Main section
23 | if __name__ == '__main__':
24 |     main()
25 | 


--------------------------------------------------------------------------------
/bin/field_extraction.py:
--------------------------------------------------------------------------------
 1 | import logging
 2 | 
 3 | from gensim.utils import simple_preprocess
 4 | 
 5 | from bin import lib
 6 | 
 7 | EMAIL_REGEX = r"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}"
 8 | PHONE_REGEX = r"\(?(\d{3})?\)?[\s\.-]{0,2}?(\d{3})[\s\.-]{0,2}(\d{4})"
 9 | NAME_REGEX = r'[a-z]+(\s+[a-z]+)?'
10 | 
11 | 
12 | def candidate_name_extractor(input_string, nlp):
13 | 
14 |     doc = nlp(input_string)
15 | 
16 |     # Extract entities
17 |     doc_entities = doc.ents
18 | 
19 |     # Subset to person type entities
20 |     doc_persons = filter(lambda x: x.label_ == 'PERSON', doc_entities)
21 |     doc_persons = filter(lambda x: len(x.text.strip().split()) >= 2, doc_persons)
22 |     doc_persons = map(lambda x: x.text.strip(), doc_persons)
23 |     doc_persons = list(doc_persons)
24 | 
25 |     # Assuming that the first Person entity with more than two tokens is the candidate's name
26 |     if len(doc_persons) > 0:
27 |         return doc_persons[0]
28 |     return "NOT FOUND"
29 | 
30 | 
31 | def extract_fields(df):
32 |     for extractor, items_of_interest in lib.get_conf('extractors').items():
33 |         df[extractor] = df['text'].apply(lambda x: extract_skills(x, extractor, items_of_interest))
34 |     return df
35 | 
36 | 
37 | def extract_skills(resume_text, extractor, items_of_interest):
38 |     potential_skills_dict = dict()
39 |     matched_skills = set()
40 | 
41 |     # TODO This skill input formatting could happen once per run, instead of once per observation.
42 |     for skill_input in items_of_interest:
43 | 
44 |         # Format list inputs
45 |         if type(skill_input) is list and len(skill_input) >= 1:
46 |             potential_skills_dict[skill_input[0]] = skill_input
47 | 
48 |         # Format string inputs
49 |         elif type(skill_input) is str:
50 |             potential_skills_dict[skill_input] = [skill_input]
51 |         else:
52 |             logging.warn('Unknown skill listing type: {}. Please format as either a single string or a list of strings'
53 |                          ''.format(skill_input))
54 | 
55 |     for (skill_name, skill_alias_list) in potential_skills_dict.items():
56 | 
57 |         skill_matches = 0
58 |         # Iterate through aliases
59 |         for skill_alias in skill_alias_list:
60 |             # Add the number of matches for each alias
61 |             skill_matches += lib.term_count(resume_text, skill_alias.lower())
62 | 
63 |         # If at least one alias is found, add skill name to set of skills
64 |         if skill_matches > 0:
65 |             matched_skills.add(skill_name)
66 | 
67 |     return matched_skills
68 | 


--------------------------------------------------------------------------------
/bin/lib.py:
--------------------------------------------------------------------------------
  1 | """
  2 | coding=utf-8
  3 | """
  4 | import logging
  5 | import os
  6 | import re
  7 | import subprocess
  8 | 
  9 | import pandas
 10 | import yaml
 11 | 
 12 | from bin import pdf2text
 13 | 
 14 | CONFS = None
 15 | 
 16 | AVAILABLE_EXTENSIONS = {'.csv', '.doc', '.docx', '.eml', '.epub', '.gif', '.htm', '.html', '.jpeg', '.jpg', '.json',
 17 |                         '.log', '.mp3', '.msg', '.odt', '.ogg', '.pdf', '.png', '.pptx', '.ps', '.psv', '.rtf', '.tff',
 18 |                         '.tif', '.tiff', '.tsv', '.txt', '.wav', '.xls', '.xlsx'}
 19 | 
 20 | 
 21 | def load_confs(confs_path='../confs/config.yaml'):
 22 |     # TODO Docstring
 23 |     global CONFS
 24 | 
 25 |     if CONFS is None:
 26 |         try:
 27 |             CONFS = yaml.load(open(confs_path))
 28 |         except IOError:
 29 |             confs_template_path = confs_path + '.template'
 30 |             logging.warn(
 31 |                 'Confs path: {} does not exist. Attempting to load confs template, '
 32 |                 'from path: {}'.format(confs_path, confs_template_path))
 33 |             CONFS = yaml.load(open(confs_template_path))
 34 |     return CONFS
 35 | 
 36 | 
 37 | def get_conf(conf_name):
 38 |     return load_confs()[conf_name]
 39 | 
 40 | 
 41 | def archive_dataset_schemas(step_name, local_dict, global_dict):
 42 |     """
 43 |     Archive the schema for all available Pandas DataFrames
 44 |      - Determine which objects in namespace are Pandas DataFrames
 45 |      - Pull schema for all available Pandas DataFrames
 46 |      - Write schemas to file
 47 |     :param step_name: The name of the current operation (e.g. `extract`, `transform`, `model` or `load`
 48 |     :param local_dict: A dictionary containing mappings from variable name to objects. This is usually generated by
 49 |     calling `locals`
 50 |     :type local_dict: dict
 51 |     :param global_dict: A dictionary containing mappings from variable name to objects. This is usually generated by
 52 |     calling `globals`
 53 |     :type global_dict: dict
 54 |     :return: None
 55 |     :rtype: None
 56 |     """
 57 |     logging.info('Archiving data set schema(s) for step name: {}'.format(step_name))
 58 | 
 59 |     # Reference variables
 60 |     data_schema_dir = get_conf('data_schema_dir')
 61 |     schema_output_path = os.path.join(data_schema_dir, step_name + '.csv')
 62 |     schema_agg = list()
 63 | 
 64 |     env_variables = dict()
 65 |     env_variables.update(local_dict)
 66 |     env_variables.update(global_dict)
 67 | 
 68 |     # Filter down to Pandas DataFrames
 69 |     data_sets = filter(lambda x: type(x[1]) == pandas.DataFrame, env_variables.items())
 70 |     data_sets = dict(data_sets)
 71 | 
 72 |     for (data_set_name, data_set) in data_sets.items():
 73 |         # Extract variable names
 74 |         logging.info('Working data_set: {}'.format(data_set_name))
 75 | 
 76 |         local_schema_df = pandas.DataFrame(data_set.dtypes, columns=['type'])
 77 |         local_schema_df['data_set'] = data_set_name
 78 | 
 79 |         schema_agg.append(local_schema_df)
 80 | 
 81 |     # Aggregate schema list into one data frame
 82 |     agg_schema_df = pandas.concat(schema_agg)
 83 | 
 84 |     # Write to file
 85 |     agg_schema_df.to_csv(schema_output_path, index_label='variable')
 86 | 
 87 | 
 88 | def term_count(string_to_search, term):
 89 |     """
 90 |     A utility function which counts the number of times `term` occurs in `string_to_search`
 91 |     :param string_to_search: A string which may or may not contain the term.
 92 |     :type string_to_search: str
 93 |     :param term: The term to search for the number of occurrences for
 94 |     :type term: str
 95 |     :return: The number of times the `term` occurs in the `string_to_search`
 96 |     :rtype: int
 97 |     """
 98 |     try:
 99 |         regular_expression = re.compile(term, re.IGNORECASE)
100 |         result = re.findall(regular_expression, string_to_search)
101 |         return len(result)
102 |     except Exception:
103 |         logging.error('Error occurred during regex search')
104 |         return 0
105 | 
106 | 
107 | def term_match(string_to_search, term):
108 |     """
109 |     A utility function which return the first match to the `regex_pattern` in the `string_to_search`
110 |     :param string_to_search: A string which may or may not contain the term.
111 |     :type string_to_search: str
112 |     :param term: The term to search for the number of occurrences for
113 |     :type term: str
114 |     :return: The first match of the `regex_pattern` in the `string_to_search`
115 |     :rtype: str
116 |     """
117 |     try:
118 |         regular_expression = re.compile(term, re.IGNORECASE)
119 |         result = re.findall(regular_expression, string_to_search)
120 |         if len(result) > 0:
121 |             return result[0]
122 |         else:
123 |             return None
124 |     except Exception:
125 |         logging.error('Error occurred during regex search')
126 |         return None
127 | 
128 | def convert_pdf(f):
129 | 
130 |     # Create intermediate output file
131 |     # TODO Is this a desirable feature? Could this be replaced with a tempfile or fake file?
132 |     output_filename = os.path.basename(os.path.splitext(f)[0]) + '.txt'
133 |     output_filepath = os.path.join('..', 'data', 'output', output_filename)
134 |     logging.info('Writing text from {} to {}'.format(f, output_filepath))
135 | 
136 |     # Convert pdf to text, placed in intermediate output file
137 |     pdf2text.main(args=[f, '--outfile', output_filepath])
138 | 
139 |     # Return contents of intermediate output file
140 |     return open(output_filepath).read()
141 | 
142 | 
143 | 
144 | 


--------------------------------------------------------------------------------
/bin/main.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | """
  3 | coding=utf-8
  4 | 
  5 | Code Template
  6 | 
  7 | """
  8 | import inspect
  9 | import logging
 10 | import os
 11 | import sys
 12 | 
 13 | import pandas
 14 | import spacy
 15 | 
 16 | currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
 17 | parentdir = os.path.dirname(currentdir)
 18 | sys.path.insert(0,parentdir)
 19 | 
 20 | from bin import field_extraction
 21 | from bin import lib
 22 | 
 23 | 
 24 | def main():
 25 |     """
 26 |     Main function documentation template
 27 |     :return: None
 28 |     :rtype: None
 29 |     """
 30 |     logging.getLogger().setLevel(logging.INFO)
 31 | 
 32 |     # Extract data from upstream.
 33 |     observations = extract()
 34 | 
 35 |     # Spacy: Spacy NLP
 36 |     nlp = spacy.load('en')
 37 | 
 38 |     # Transform data to have appropriate fields
 39 |     observations, nlp = transform(observations, nlp)
 40 | 
 41 |     # Load data for downstream consumption
 42 |     load(observations, nlp)
 43 | 
 44 |     pass
 45 | 
 46 | def extract():
 47 |     logging.info('Begin extract')
 48 | 
 49 |     # Reference variables
 50 |     candidate_file_agg = list()
 51 | 
 52 |     # Create list of candidate files
 53 |     for root, subdirs, files in os.walk(lib.get_conf('resume_directory')):
 54 |         folder_files = map(lambda x: os.path.join(root, x), files)
 55 |         candidate_file_agg.extend(folder_files)
 56 | 
 57 |     # Convert list to a pandas DataFrame
 58 |     observations = pandas.DataFrame(data=candidate_file_agg, columns=['file_path'])
 59 |     logging.info('Found {} candidate files'.format(len(observations.index)))
 60 | 
 61 |     # Subset candidate files to supported extensions
 62 |     observations['extension'] = observations['file_path'].apply(lambda x: os.path.splitext(x)[1])
 63 |     observations = observations[observations['extension'].isin(lib.AVAILABLE_EXTENSIONS)]
 64 |     logging.info('Subset candidate files to extensions w/ available parsers. {} files remain'.
 65 |                  format(len(observations.index)))
 66 | 
 67 |     # Attempt to extract text from files
 68 |     observations['text'] = observations['file_path'].apply(lib.convert_pdf)
 69 | 
 70 |     # Archive schema and return
 71 |     lib.archive_dataset_schemas('extract', locals(), globals())
 72 |     logging.info('End extract')
 73 |     return observations
 74 | 
 75 | 
 76 | def transform(observations, nlp):
 77 |     # TODO Docstring
 78 |     logging.info('Begin transform')
 79 | 
 80 |     # Extract candidate name
 81 |     observations['candidate_name'] = observations['text'].apply(lambda x:
 82 |                                                                 field_extraction.candidate_name_extractor(x, nlp))
 83 |     
 84 |     if observations['candidate_name'] == "NOT FOUND":
 85 |         match = re.search(field_extraction.NAME_REGEX, observations['text'], re.IGNORECASE)
 86 |         observations['candidate_name'] = match[0]
 87 |         
 88 | 
 89 |     # Extract contact fields
 90 |     observations['email'] = observations['text'].apply(lambda x: lib.term_match(x, field_extraction.EMAIL_REGEX))
 91 |     observations['phone'] = observations['text'].apply(lambda x: lib.term_match(x, field_extraction.PHONE_REGEX))
 92 | 
 93 |     # Extract skills
 94 |     observations = field_extraction.extract_fields(observations)
 95 | 
 96 |     # Archive schema and return
 97 |     lib.archive_dataset_schemas('transform', locals(), globals())
 98 |     logging.info('End transform')
 99 |     return observations, nlp
100 | 
101 | 
102 | def load(observations, nlp):
103 |     logging.info('Begin load')
104 |     output_path = os.path.join(lib.get_conf('summary_output_directory'), 'resume_summary.csv')
105 | 
106 |     logging.info('Results being output to {}'.format(output_path))
107 |     print('Results output to {}'.format(output_path))
108 | 
109 |     observations.to_csv(path_or_buf=output_path, index_label='index')
110 |     logging.info('End transform')
111 |     pass
112 | 
113 | 
114 | # Main section
115 | if __name__ == '__main__':
116 |     main()
117 | 


--------------------------------------------------------------------------------
/bin/pdf2text.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | """
  4 | Converts PDF text content (though not images containing text) to plain text, html, xml or "tags".
  5 | """
  6 | import argparse
  7 | import sys
  8 | 
  9 | import pdfminer.settings
 10 | import six
 11 | 
 12 | pdfminer.settings.STRICT = False
 13 | import pdfminer.high_level
 14 | import pdfminer.layout
 15 | from pdfminer.image import ImageWriter
 16 | 
 17 | 
 18 | def extract_text(files=[], outfile='-',
 19 |             _py2_no_more_posargs=None,  # Bloody Python2 needs a shim
 20 |             no_laparams=False, all_texts=None, detect_vertical=None, # LAParams
 21 |             word_margin=None, char_margin=None, line_margin=None, boxes_flow=None, # LAParams
 22 |             output_type='text', codec='utf-8', strip_control=False,
 23 |             maxpages=0, page_numbers=None, password="", scale=1.0, rotation=0,
 24 |             layoutmode='normal', output_dir=None, debug=False,
 25 |             disable_caching=False, **other):
 26 |     if _py2_no_more_posargs is not None:
 27 |         raise ValueError("Too many positional arguments passed.")
 28 |     if not files:
 29 |         raise ValueError("Must provide files to work upon!")
 30 | 
 31 |     # If any LAParams group arguments were passed, create an LAParams object and
 32 |     # populate with given args. Otherwise, set it to None.
 33 |     if not no_laparams:
 34 |         laparams = pdfminer.layout.LAParams()
 35 |         for param in ("all_texts", "detect_vertical", "word_margin", "char_margin", "line_margin", "boxes_flow"):
 36 |             paramv = locals().get(param, None)
 37 |             if paramv is not None:
 38 |                 setattr(laparams, param, paramv)
 39 |     else:
 40 |         laparams = None
 41 | 
 42 |     imagewriter = None
 43 |     if output_dir:
 44 |         imagewriter = ImageWriter(output_dir)
 45 | 
 46 |     if output_type == "text" and outfile != "-":
 47 |         for override, alttype in (  (".htm", "html"),
 48 |                                     (".html", "html"),
 49 |                                     (".xml", "xml"),
 50 |                                     (".tag", "tag") ):
 51 |             if outfile.endswith(override):
 52 |                 output_type = alttype
 53 | 
 54 |     if outfile == "-":
 55 |         outfp = sys.stdout
 56 |         if outfp.encoding is not None:
 57 |             codec = 'utf-8'
 58 |     else:
 59 |         outfp = open(outfile, "wb")
 60 | 
 61 | 
 62 |     for fname in files:
 63 |         with open(fname, "rb") as fp:
 64 |             pdfminer.high_level.extract_text_to_fp(fp, **locals())
 65 |     return outfp
 66 | 
 67 | 
 68 | def maketheparser():
 69 |     parser = argparse.ArgumentParser(description=__doc__, add_help=True)
 70 |     parser.add_argument("files", type=str, default=None, nargs="+", help="File to process.")
 71 |     parser.add_argument("-d", "--debug", default=False, action="store_true", help="Debug output.")
 72 |     parser.add_argument("-p", "--pagenos", type=str, help="Comma-separated list of page numbers to parse. Included for legacy applications, use --page-numbers for more idiomatic argument entry.")
 73 |     parser.add_argument("--page-numbers", type=int, default=None, nargs="+", help="Alternative to --pagenos with space-separated numbers; supercedes --pagenos where it is used.")
 74 |     parser.add_argument("-m", "--maxpages", type=int, default=0, help="Maximum pages to parse")
 75 |     parser.add_argument("-P", "--password", type=str, default="", help="Decryption password for PDF")
 76 |     parser.add_argument("-o", "--outfile", type=str, default="-", help="Output file (default \"-\" is stdout)")
 77 |     parser.add_argument("-t", "--output_type", type=str, default="text", help="Output type: text|html|xml|tag (default is text)")
 78 |     parser.add_argument("-c", "--codec", type=str, default="utf-8", help="Text encoding")
 79 |     parser.add_argument("-s", "--scale", type=float, default=1.0, help="Scale")
 80 |     parser.add_argument("-A", "--all-texts", default=None, action="store_true", help="LAParams all texts")
 81 |     parser.add_argument("-V", "--detect-vertical", default=None, action="store_true", help="LAParams detect vertical")
 82 |     parser.add_argument("-W", "--word-margin", type=float, default=None, help="LAParams word margin")
 83 |     parser.add_argument("-M", "--char-margin", type=float, default=None, help="LAParams char margin")
 84 |     parser.add_argument("-L", "--line-margin", type=float, default=None, help="LAParams line margin")
 85 |     parser.add_argument("-F", "--boxes-flow", type=float, default=None, help="LAParams boxes flow")
 86 |     parser.add_argument("-Y", "--layoutmode", default="normal", type=str, help="HTML Layout Mode")
 87 |     parser.add_argument("-n", "--no-laparams", default=False, action="store_true", help="Pass None as LAParams")
 88 |     parser.add_argument("-R", "--rotation", default=0, type=int, help="Rotation")
 89 |     parser.add_argument("-O", "--output-dir", default=None, help="Output directory for images")
 90 |     parser.add_argument("-C", "--disable-caching", default=False, action="store_true", help="Disable caching")
 91 |     parser.add_argument("-S", "--strip-control", default=False, action="store_true", help="Strip control in XML mode")
 92 |     return parser
 93 | 
 94 | 
 95 | # main
 96 | 
 97 | 
 98 | def main(args=None):
 99 | 
100 |     P = maketheparser()
101 |     A = P.parse_args(args=args)
102 | 
103 |     if A.page_numbers:
104 |         A.page_numbers = set([x-1 for x in A.page_numbers])
105 |     if A.pagenos:
106 |         A.page_numbers = set([int(x)-1 for x in A.pagenos.split(",")])
107 | 
108 |     imagewriter = None
109 |     if A.output_dir:
110 |         imagewriter = ImageWriter(A.output_dir)
111 | 
112 |     if six.PY2 and sys.stdin.encoding:
113 |         A.password = A.password.decode(sys.stdin.encoding)
114 | 
115 |     if A.output_type == "text" and A.outfile != "-":
116 |         for override, alttype in (  (".htm",  "html"),
117 |                                     (".html", "html"),
118 |                                     (".xml",  "xml" ),
119 |                                     (".tag",  "tag" ) ):
120 |             if A.outfile.endswith(override):
121 |                 A.output_type = alttype
122 | 
123 |     if A.outfile == "-":
124 |         outfp = sys.stdout
125 |         if outfp.encoding is not None:
126 |             # Why ignore outfp.encoding? :-/ stupid cathal?
127 |             A.codec = 'utf-8'
128 |     else:
129 |         outfp = open(A.outfile, "wb")
130 | 
131 |     ## Test Code
132 |     outfp = extract_text(**vars(A))
133 |     outfp.close()
134 |     return 0
135 | 
136 | 
137 | if __name__ == '__main__': sys.exit(main())
138 | 


--------------------------------------------------------------------------------
/changelog.md:
--------------------------------------------------------------------------------
 1 | # ResumeParser Change Log
 2 | 
 3 | ## 3.0.0 - 2017-10-20
 4 | 
 5 | Re-write, mostly from scratch
 6 | 
 7 | ### Added
 8 | 
 9 |  - None
10 | 
11 | ### Modified
12 | 
13 | Structure
14 | 
15 |  - Program now follows ETL design principles
16 |  - Program is broken into a driver file, library file and field extraction file
17 | 
18 | Configuration
19 | 
20 |  - Skills to look for are now listed in the configuration file
21 |  - Universities to look for are now listed in the configuration file
22 | 
23 | Output
24 | - Program now provides a set containing skills found, rather than a count for each skill
25 | 
26 | ### Removed
27 | 
28 |  - Address search. Addresses search was limited to addresses in California.
29 |  - README has been reset to a non-project specific readme. It should be specialized for this project in a future
30 |  version.
31 | 
32 | ## 2.1.0 - 2017-10-20
33 | 
34 | ### Added
35 | 
36 |  - `candidate_name`: Adding candidate name extractor, using spacy
37 |  - `university`: Code will now check for a list of universities
38 | 
39 | ### Changed
40 | 
41 |  - Skills search: Now users can provide a list of skills, which will be searched for
42 | 
43 | ### Removed
44 | 
45 | 
46 | ## 2.0.0 - 2016-10-22
47 | 
48 | ### Added
49 | 
50 | ### Changed
51 |  - `README.md` re-written for clarity, better code example
52 |  - Folder structure refactored for clarify
53 |  - `ResumeChecker.py` refactored to match Python style standards, legibility
54 | 
55 | ### Removed
56 |  - `code/` folder removed. It only contained extraneous code, and outdated `requirements.txt`
57 | 
58 | ## 1.0.0 - 2015-02-25
59 | Core functionality to read in PDF resumes, extract text, output results table


--------------------------------------------------------------------------------
/confs/config.yaml.template:
--------------------------------------------------------------------------------
 1 | resume_directory: ../data/input/example_resumes
 2 | summary_output_directory: ../data/output
 3 | data_schema_dir: ../data/schema
 4 | 
 5 | extractors:
 6 |  experience:
 7 |   - [Teacher, teaching, tutor]
 8 |   - [developer, software developer, software engineer, dev]
 9 |   - trader
10 | 
11 |  platforms:
12 |   - Linux
13 |   - Windows
14 |   - [Mac, MacOS]
15 | 
16 |  database:
17 |   - SQL
18 |   - MySQL
19 |   - [Postgress, Postgresql]
20 |   - Oracle
21 | 
22 |  programming:
23 |   - [java, JavaEE]
24 |   - C
25 |   - C++
26 |   - C#
27 |   - .NET
28 |   - Matlab
29 |   - R
30 |   - python
31 |   - VHDL
32 |   - PHP
33 |   - JavaScript
34 | 
35 | 
36 |  machinelearning:
37 |   - [sklearn, scikit-learn, sk-learn]
38 |   - [tensorflow, tf, tensor-flow]
39 |   - keras
40 |   - h20
41 | 
42 | 
43 |  universities:
44 |   - [TU Delft, TUDelft, Delft University of Technology]
45 |   - [Univerity of Twente, UTwente]
46 |   - [University of Amsterdam, UvA]
47 |   - [Vrije Univesriteit, VU University, VU Amsterdam, VU]
48 |   - MIT
49 | 
50 |  languages:
51 |   - Dutch
52 |   - English
53 |   - German
54 |   - Spanish
55 |   - [Chinese, Mandarin]
56 | 
57 |  hobbies:
58 |   - [swimming, swim]
59 |   - [soccer, football]
60 |   - painting
61 |   - reading
62 | 
63 |  open-source:
64 |   - github
65 |   - bitbucket
66 |   - gitlab
67 |   - sourceforge
68 |   - gitkraken
69 | 


--------------------------------------------------------------------------------
/data/input/data_descriptions.csv:
--------------------------------------------------------------------------------
1 | filename,source,description


--------------------------------------------------------------------------------
/data/input/example_resumes/Brendan_Herger_Resume.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bjherger/ResumeParser/8c873551b425bea4d7c5f46960e3ff159bb05056/data/input/example_resumes/Brendan_Herger_Resume.pdf


--------------------------------------------------------------------------------
/data/input/example_resumes/Layla_Martin_Resume.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bjherger/ResumeParser/8c873551b425bea4d7c5f46960e3ff159bb05056/data/input/example_resumes/Layla_Martin_Resume.pdf


--------------------------------------------------------------------------------
/data/input/example_resumes/SGresume-1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bjherger/ResumeParser/8c873551b425bea4d7c5f46960e3ff159bb05056/data/input/example_resumes/SGresume-1.pdf


--------------------------------------------------------------------------------
/data/input/example_resumes/john_smith.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bjherger/ResumeParser/8c873551b425bea4d7c5f46960e3ff159bb05056/data/input/example_resumes/john_smith.pdf


--------------------------------------------------------------------------------
/data/input/example_resumes/resume_Meyer.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bjherger/ResumeParser/8c873551b425bea4d7c5f46960e3ff159bb05056/data/input/example_resumes/resume_Meyer.pdf


--------------------------------------------------------------------------------
/data/output/Brendan_Herger_Resume.txt:
--------------------------------------------------------------------------------
 1 | Brendan Herger
 2 | 
 3 | Hergertarian.com | 13herger@gmail.com  |  + 1 (415) 582-7457
 4 | 
 5 | 1209 Page Street No. 7 San Francisco, Ca 94117
 6 | 
 7 | Selected Experience
 8 | 
 9 | Data Scientist
10 | 
11 | Data Innovation Lab @ Capital One (San Francisco, Ca.)   06/15 - Now 
12 | Lead research team modeling for fraud problem space with class 
13 | imbalance & adversaries; H2O.ai, GraphLab, SKLearn
14 | Deployed sub-millisecond real time model; Apache Apex
15 | Recommended distributed machine learning frameworks for general 
16 | adoption at Capital One; H2O.ai, GraphLab, Apache Spark
17 | 
18 | Various Technical Positions
19 | 
20 | Lawfty (San Francisco, Ca.), RevUp Software (Redwood City, Ca.), 
21 | Perkins + Will Architecture (San Francisco, Ca.) & Lawrence Berkeley 
22 | National Lab. (Berkeley, Ca.)
23 | 
24 | Front End Supervisor
25 | 
26 | The Home Depot Pro (Colma, Ca.)                                        05/11 - 03/13
27 | Positions held: Cashier, Special Services Assoc., Tool Rental Assoc.
28 | Supervised and trained a staff of 10-30 team members
29 | 
30 | Education
31 | MS, Analytics
32 | 
33 | University of San Francisco, July 2015
34 | Relevant Coursework: Machine Learning, Adv. Machine Learning, Data 
35 | Acquisition, Exploratory Data Analysis, Relational Databases, NoSQL 
36 | Databases, Linear Regression Analysis, Time Series Analysis, Intro. SAS
37 | 
38 | BS, Physics
39 | 
40 | University of San Francisco, May 2014
41 | Minors: Computer Sciences, Mathematics
42 | Honors: University Scholar, President of ΠΜΕ Math Honors Society
43 | Relevant Coursework: Software Development, Data Structures & 
44 | Algorithms, Differential Eqn.’s, Linear Algebra, Graduate Econometrics
45 | 
46 | Personal Projects
47 | 
48 | Identified genre of Billboard 
49 | Hot 100 songs using ensemble 
50 | algorithm built with Support 
51 | Vector Machine, Neural Network, 
52 | Stochastic Gradient Boost, and 
53 | Random Forest  algorithms; 
54 | Python, Pandas, R and Scikit-Learn
55 | 
56 | Implemented Naive Bayes text 
57 | classification algorithm and trained 
58 | this algorithm to correctly label 
59 | 83% of movie reviews; Python, 
60 | numpy and Pandas
61 | 
62 | Created database containing lyrics 
63 | of Billboard Hot 100 songs since 
64 | 1958; R, Python, Pandas and 
65 | Beautiful Soup 4
66 | 
67 | Built a multi-threaded web scraper 
68 | and search engine with web 
69 | user interface; Java, MySQL and 
70 | HTML5/CSS
71 | 
72 | Built resume parsing package 
73 | which extracts text, finds contact 
74 | details, and checks for required 
75 | keywords; Python and Pandas
76 | 
77 | Online
78 | 
79 | Hergertarian.com
80 | 
81 | github.com/bjherger
82 | 
83 | linkedin.com/in/bjherger
84 | 
85 | hergertarian.wordpress.com/
86 | 
87 | 


--------------------------------------------------------------------------------
/data/output/Layla_Martin_Resume.txt:
--------------------------------------------------------------------------------
 1 | Layla Martin
 2 | 
 3 | 2038 McAllister St
 4 | 
 5 | San Francisco, CA 94118
 6 | 
 7 | layla.d.martin@gmail.com (520) 271-2492
 8 | 
 9 | EDUCATION
10 | AND AWARDS
11 | 
12 | Master of Science in Analytics
13 | University of San Francisco, San Francisco, CA
14 | 
15 | Bachelor of Science in Mathematics, Cum Laude
16 | University of San Francisco
17 | Dean’s Honor Role (4 years)
18 | University Scholar, USF’s highest academic scholarship (4 years)
19 | 
20 | Expected June 2015
21 | 
22 | May 2014
23 | 
24 | SELECTED
25 | PROJECTS
26 | 
27 | Sentiment Analysis using Naive-Bayes
28 | 
29 | Graduate
30 | • Classiﬁed movie reviews as positive or negative with 75% accuracy by implementing a
31 | 
32 | Naive-Bayes algorithm in Python.
33 | 
34 | NBA Play-By-Play Data Cleaning
35 | 
36 | Summer research with USF Faculty
37 | • Cleaned play-by-play data for every NBA game from 2006 to 2012 using Pandas (Python).
38 | 
39 | Analysis of USF Men’s Basketball Statistics using R
40 | 
41 | Undergraduate
42 | • Predicted importance of players on the court using simple and multiple linear regression
43 | 
44 | and created visualizations of player statistics using R.
45 | 
46 | Explaining Artiﬁcial Neural Networks (ANN)
47 | 
48 | Undergraduate
49 | • Prepared two student lectures teaching classmates basic implementation and theory behind
50 | • Preprocessed ﬁngerprint images as matrices in Matlab and performed pattern classiﬁcation
51 | 
52 | supervised learning, backpropagation, and pattern recognition with ANN.
53 | 
54 | using an ANN program.
55 | 
56 | Mathematical Modeling Research
57 | 
58 | Undergraduate
59 | 
60 | • Performed image compression with SVD implemented in Matlab.
61 | • Ranked West Coast Conference Men’s Basketball teams using multiple centrality ranking
62 | 
63 | algorithms in Matlab.
64 | 
65 | LEADERSHIP
66 | 
67 | President, Pi Mu Epsilon National Math Honor Society
68 | Captain, USF Women’s Soccer
69 | Teacher’s Assistant, USF Astronomy Observations
70 | 
71 | 2014
72 | 2013-2014
73 | 2013-2014
74 | 
75 | SKILLS
76 | 
77 | Python, R, MySQL, PostgreSQL, Matlab, LaTeX, JMP
78 | 
79 | INTER-
80 | COLLEGIATE
81 | ATHLETICS
82 | 
83 | NCAA Division I Women’s Soccer: USF, 2010-2014
84 | 
85 | • Athletic scholarship (full scholarship when combined with University Scholarship).
86 | • Committed 30 hours per week to training, meetings, travel, competition.
87 | 
88 | 


--------------------------------------------------------------------------------
/data/output/SGresume-1.txt:
--------------------------------------------------------------------------------
  1 | Sébastien Genty
  2 | 
  3 | 1209 Page St, Apt 7, San Francisco, CA 94117
  4 | 
  5 | 713-301-5648 • sgenty@me.com
  6 | Work Experience
  7 | Project Director, Socratic Technologies 
  8 | 
  9 | •  Collaborate with clients to plan, construct, and execute surveys for collecting 
 10 | 
 11 | insightful data on their business and marketing needs
 12 | 
 13 | •  Explore data using advanced mathematical and statistical methods, in order to 
 14 | 
 15 | identify insights and trends
 16 | 
 17 | •  Visualize data and extract insights for client facing deliverables
 18 | •  Create market simulators using highly customized applications that incorporate post 
 19 | 
 20 | survey analytical results to predict comparative sales figures and market share
 21 | 
 22 | •  Manage the data collection process by allocating resources, overseeing workflow, and 
 23 | 
 24 | verifying quality data collection
 25 | 
 26 | •  Formulated and built a new survey tool that increased statistical variance in 
 27 | 
 28 | respondent data, currently in use for multiple clients
 29 | 
 30 | •  Established a new model to test the reach, appeal and effectiveness of advertisements
 31 | 
 32 | Research Assistant, Bucknell University
 33 | 
 34 | •  Independently designed and built an optical set-up to take pictures of Bose-Einstein 
 35 | 
 36 | •  Created a user interface to control a high-end camera using Visual Basic
 37 | •  Coordinated with other research assistants working on different aspects of the same 
 38 | 
 39 | condensates
 40 | 
 41 | experiment
 42 | 
 43 | •  Computer model and data analysis done using Matlab and ImageJ
 44 | 
 45 | Research Intern, MD Anderson Cancer Center
 46 | 
 47 | •  Researched CT dosiometry using a relatively new technique involving x-ray sensitive 
 48 | 
 49 | film
 50 | 
 51 | •  Analyzed data using Matlab and Mathematica
 52 | 
 53 | September 2012 - Present
 54 | 
 55 | Fall 2011 - Spring 2012
 56 | 
 57 | Summer 2011
 58 | 
 59 | Relevant Projects
 60 | New Product Development
 61 | 
 62 | •  Computed ideal product configuration by creating a market 
 63 | 
 64 | simulator using conjoint analysis
 65 | 
 66 | Purchase Process Identification
 67 | 
 68 | •  Identified patterns in the purchase process of mortgages using a 
 69 | 
 70 | client provided database and formed insights about the timing and 
 71 | effectiveness of marketing materials
 72 | 
 73 | Customer Feedback
 74 | 
 75 | •  Analyzed users’ opinions and knowledge of recently released 
 76 | 
 77 | version of leading media player and distribution platform
 78 | 
 79 | Bay Area Bike Share Open Data Challenge
 80 | 
 81 | •  Uncovered habits and behavior of riders using an open source 
 82 | dataset and currently developing a visualization of bike station 
 83 | usage
 84 | 
 85 | Education
 86 | Bucknell University – Lewisburg, PA
 87 | 
 88 | B.S. Physics
 89 | Minor: International Relations
 90 | International Leadership Scholarship
 91 | 
 92 | Graduated May 2012
 93 | 
 94 | Languages
 95 |   English
 96 |   French
 97 |   Spanish
 98 |   German
 99 | 
100 | Skills
101 | Computer
102 |   R
103 |   Python
104 |   SPSS
105 |   Mathematica
106 |   Matlab
107 |   Javascript
108 |   C++
109 |   SQL
110 |   Unix
111 |   Adobe Creative Suite
112 |   Microsoft Office
113 | 
114 | Interests
115 | Film and digital photography, 
116 | physics, research, reading, video 
117 | games, solving problems and fixing 
118 | things
119 | 
120 | 


--------------------------------------------------------------------------------
/data/output/john_smith.txt:
--------------------------------------------------------------------------------
  1 | John Smith  
  2 | 
  3 |  
  4 | 
  5 |  
  6 | 
  7 | 2222 McCoy Road     Columbus, Ohio 44444    614-555-5555    sresume@kent.edu  www.linkedin.com/in/name 
  8 | 
  9 |  
 10 |  
 11 | 
 12 |  
 13 |  
 14 |  
 15 | 
 16 |  
 17 |  
 18 |  
 19 | 
 20 |  
 21 |  
 22 |  
 23 | 
 24 |  
 25 |  
 26 |  
 27 | 
 28 |  
 29 | OBJECTIVE 
 30 | Seeking a marketing internship with ABC Company in Spring 2017 to utilize my organizational and analytical skills. 
 31 |  
 32 | EDUCATION  
 33 | Bachelor of Business Administration 
 34 | Kent State University   
 35 | Major: Business Management 
 36 |  
 37 | SIGNIFICANT COURSEWORK 
 38 | Business Finance, Principles of Management, Legal Environment of Business 
 39 |  
 40 | COMPUTER SKILLS 
 41 | Microsoft Office: Word, PowerPoint, and basic Excel 
 42 | Applications: SQL (Structure Query Language) 
 43 | Programs: Adobe Photoshop, Movie Maker 
 44 | Social Media Administration: LinkedIn, Twitter, Instagram, Facebook 
 45 |  
 46 | WORK EXPERIENCE  
 47 | Kent State University, Kent, Ohio 
 48 | Resident Advisor 
 49 | •  Collaborate with 10 building staff and campus administrators on a weekly basis to organize a pancake breakfast 
 50 | 
 51 | Expected Graduation: May 2018 
 52 | Kent, Ohio 
 53 | GPA: 3.6 
 54 | 
 55 | August 2015 - Present  
 56 | 
 57 |  
 58 |  
 59 |  
 60 | 
 61 |  
 62 | 
 63 |  
 64 | 
 65 |  
 66 | 
 67 |  
 68 | 
 69 |  
 70 | 
 71 |  
 72 | 
 73 | fundraiser for 200 attendees, raising $500 for community charity.  
 74 | 
 75 | •  Utilize effective time management skills by creating and implementing 6 programs and activities each semester for 30 
 76 | 
 77 | residents while balancing full-time course load and extracurricular commitments.  
 78 | 
 79 | •  Demonstrate strong communication skills through interacting with 150 residents and campus administrators on a 
 80 | 
 81 | weekly basis.  
 82 | 
 83 | •  Facilitate problem-solving and conflict resolution amongst residents by serving as positive role model, mediator, and 
 84 | 
 85 | leader through one-on-one and small group interventions.  
 86 | 
 87 |  
 88 | Panini’s Bar and Grill, Cleveland, Ohio                     
 89 | Server 
 90 | •  Worked independently in a fast-paced environment while developing customer service skills with each guest to ensure 
 91 | 
 92 | August 2014 - May 2015  
 93 | 
 94 |  
 95 | 
 96 |  
 97 | 
 98 |  
 99 | 
100 | their needs were consistently met. 
101 | 
102 |  
103 | 
104 |  
105 | 
106 |  
107 | 
108 |  
109 | 
110 | July 2014 - October 2015 
111 | 
112 | •  Optimized persuasive skills to highlight nightly specials and engage each table’s attention. 
113 | •  Fostered cooperation within a staff of 7 to maintain a pleasant atmosphere. 
114 |  
115 | LEADERSHIP EXPERIENCE  
116 | Harding Middle School, Columbus, Ohio 
117 | Soccer Assistant Coach  
118 | •  Executed strong communication skills by guiding and leading 20 seventh and eighth grade girls on team. 
119 | •  Served as positive role model by teaching young athletes about teamwork, respect, and conflict-resolution. 
120 | •  Planned and led weekly meetings with up to 15 parents, field managers and staff members. 
121 | •  Led team to its first regional championship in October 2015; recognized at school banquet with leadership award. 
122 |  
123 | CAMPUS INVOLVEMENT 
124 | Member, Delta Sigma Pi 
125 |  
126 | Member, Collegiate Business Association 
127 |  
128 | VOLUNTEER EXPERIENCE  
129 | Relay for Life, Kent State University 
130 | Donation Processer, Greater Cleveland Food Bank 
131 |  
132 | HONORS 
133 | Summit County Alumni Association Scholarship   
134 | Dean’s List 
135 |  
136 | 
137 | March 2015 
138 | December 2014 - February 2015 
139 | 
140 | Spring 2014 - Present 
141 | Spring 2013 - Spring 2015 
142 | 
143 | August 2016 - Present  
144 | August 2015 - Present  
145 | 
146 |   
147 |  
148 | 
149 |   
150 |  
151 | 
152 |  
153 |  
154 | 
155 |  
156 |  
157 | 
158 |  
159 |  
160 | 
161 |  
162 |  
163 | 
164 |  
165 |  
166 | 
167 |  
168 |  
169 | 
170 |  
171 |  
172 | 
173 |  
174 |  
175 | 
176 |  
177 | 
178 |  
179 | 
180 |  
181 | 
182 |  
183 | 
184 |  
185 | 
186 |  
187 | 
188 |  
189 | 
190 | 


--------------------------------------------------------------------------------
/data/output/resume_Meyer.txt:
--------------------------------------------------------------------------------
 1 | MONICA MEYER
 2 | 
 3 | (415) · 497 · 7282 (cid:5) monica.meyer@comcast.net
 4 | 
 5 | EDUCATION
 6 | 
 7 | Master of Science in Analytics
 8 | University of San Francisco
 9 | 
10 | Bachelor of Arts in Mathematics
11 | University of California, Santa Cruz
12 | Overall GPA: 3.5
13 | 
14 | COURSE PROJECTS
15 | 
16 | Expected June 2015
17 | 
18 | July 2012
19 | 
20 | Business Location Recommendation
21 | Sep 2014
22 | · Queried Yelp and Zillow APIs and performed exploratory data analysis necessary for project to help
23 | businesses decide where to open their next establishment.
24 | 
25 | Text Classiﬁcation
26 | · Classiﬁed movie reviews as positive or negative using Naive Bayes in Python.
27 | · Found most important words in Reuter’s articles through term frequency, inverse document frequency
28 | on XML ﬁles in Python.
29 | 
30 | July 2014
31 | 
32 | WORK EXPERIENCE
33 | 
34 | Bank of America
35 | Dec 2013 - June 2014
36 | Sales and Service Specialist
37 | Mill Valley, CA
38 | · Promoted due to proven ability to resolve complex service issues and process transactions accurately
39 | and eﬃciently to guarantee customer satisfaction and build customer conﬁdence and trust. Responsible
40 | for establishing, retaining and deepening relationships with customers to achieve team sales goals as
41 | well as providing proactive sales activities of basic products while referring more complex requests such
42 | as mortgages and investment products.
43 | 
44 | Bank of America
45 | Aug 2012 - Dec 2013
46 | Teller
47 | Mill Valley, CA
48 | · Gained proﬁciency in retail banking operations, including computing ﬁgures, processing transactions
49 | with speed and accuracy and building customer loyalty through exceptional customer service. Learned
50 | to control large amounts of cash ﬂow, work within established policies, procedures and guidelines and
51 | acquired the ability to advise customers on products and services the bank has to oﬀer. Earned a
52 | promotion to the position of Sales and Service Specialist.
53 | 
54 | SKILLS
55 | 
56 | Programming
57 | Protocols & APIs
58 | Databases
59 | 
60 | Python, R
61 | XML, JSON, REST
62 | MySQL, PostgreSQL
63 | 
64 | 


--------------------------------------------------------------------------------
/data/output/resume_summary.csv:
--------------------------------------------------------------------------------
  1 | index,file_path,extension,text,candidate_name,email,phone,experience,platforms,database,programming,machinelearning,universities,languages,hobbies,open-source
  2 | 1,../data/input/example_resumes/SGresume-1.pdf,.pdf,"Sébastien Genty
  3 | 
  4 | 1209 Page St, Apt 7, San Francisco, CA 94117
  5 | 
  6 | 713-301-5648 • sgenty@me.com
  7 | Work Experience
  8 | Project Director, Socratic Technologies 
  9 | 
 10 | •  Collaborate with clients to plan, construct, and execute surveys for collecting 
 11 | 
 12 | insightful data on their business and marketing needs
 13 | 
 14 | •  Explore data using advanced mathematical and statistical methods, in order to 
 15 | 
 16 | identify insights and trends
 17 | 
 18 | •  Visualize data and extract insights for client facing deliverables
 19 | •  Create market simulators using highly customized applications that incorporate post 
 20 | 
 21 | survey analytical results to predict comparative sales figures and market share
 22 | 
 23 | •  Manage the data collection process by allocating resources, overseeing workflow, and 
 24 | 
 25 | verifying quality data collection
 26 | 
 27 | •  Formulated and built a new survey tool that increased statistical variance in 
 28 | 
 29 | respondent data, currently in use for multiple clients
 30 | 
 31 | •  Established a new model to test the reach, appeal and effectiveness of advertisements
 32 | 
 33 | Research Assistant, Bucknell University
 34 | 
 35 | •  Independently designed and built an optical set-up to take pictures of Bose-Einstein 
 36 | 
 37 | •  Created a user interface to control a high-end camera using Visual Basic
 38 | •  Coordinated with other research assistants working on different aspects of the same 
 39 | 
 40 | condensates
 41 | 
 42 | experiment
 43 | 
 44 | •  Computer model and data analysis done using Matlab and ImageJ
 45 | 
 46 | Research Intern, MD Anderson Cancer Center
 47 | 
 48 | •  Researched CT dosiometry using a relatively new technique involving x-ray sensitive 
 49 | 
 50 | film
 51 | 
 52 | •  Analyzed data using Matlab and Mathematica
 53 | 
 54 | September 2012 - Present
 55 | 
 56 | Fall 2011 - Spring 2012
 57 | 
 58 | Summer 2011
 59 | 
 60 | Relevant Projects
 61 | New Product Development
 62 | 
 63 | •  Computed ideal product configuration by creating a market 
 64 | 
 65 | simulator using conjoint analysis
 66 | 
 67 | Purchase Process Identification
 68 | 
 69 | •  Identified patterns in the purchase process of mortgages using a 
 70 | 
 71 | client provided database and formed insights about the timing and 
 72 | effectiveness of marketing materials
 73 | 
 74 | Customer Feedback
 75 | 
 76 | •  Analyzed users’ opinions and knowledge of recently released 
 77 | 
 78 | version of leading media player and distribution platform
 79 | 
 80 | Bay Area Bike Share Open Data Challenge
 81 | 
 82 | •  Uncovered habits and behavior of riders using an open source 
 83 | dataset and currently developing a visualization of bike station 
 84 | usage
 85 | 
 86 | Education
 87 | Bucknell University – Lewisburg, PA
 88 | 
 89 | B.S. Physics
 90 | Minor: International Relations
 91 | International Leadership Scholarship
 92 | 
 93 | Graduated May 2012
 94 | 
 95 | Languages
 96 |   English
 97 |   French
 98 |   Spanish
 99 |   German
100 | 
101 | Skills
102 | Computer
103 |   R
104 |   Python
105 |   SPSS
106 |   Mathematica
107 |   Matlab
108 |   Javascript
109 |   C++
110 |   SQL
111 |   Unix
112 |   Adobe Creative Suite
113 |   Microsoft Office
114 | 
115 | Interests
116 | Film and digital photography, 
117 | physics, research, reading, video 
118 | games, solving problems and fixing 
119 | things
120 | 
121 | ",Sébastien Genty,sgenty@me.com,"('713', '301', '5648')",{'developer'},set(),{'SQL'},"{'C', 'Matlab', 'java', 'python', 'JavaScript', 'R'}",{'tensorflow'},set(),"{'English', 'German', 'Spanish'}",{'reading'},set()
122 | 2,../data/input/example_resumes/john_smith.pdf,.pdf,"John Smith  
123 | 
124 |  
125 | 
126 |  
127 | 
128 | 2222 McCoy Road     Columbus, Ohio 44444    614-555-5555    sresume@kent.edu  www.linkedin.com/in/name 
129 | 
130 |  
131 |  
132 | 
133 |  
134 |  
135 |  
136 | 
137 |  
138 |  
139 |  
140 | 
141 |  
142 |  
143 |  
144 | 
145 |  
146 |  
147 |  
148 | 
149 |  
150 | OBJECTIVE 
151 | Seeking a marketing internship with ABC Company in Spring 2017 to utilize my organizational and analytical skills. 
152 |  
153 | EDUCATION  
154 | Bachelor of Business Administration 
155 | Kent State University   
156 | Major: Business Management 
157 |  
158 | SIGNIFICANT COURSEWORK 
159 | Business Finance, Principles of Management, Legal Environment of Business 
160 |  
161 | COMPUTER SKILLS 
162 | Microsoft Office: Word, PowerPoint, and basic Excel 
163 | Applications: SQL (Structure Query Language) 
164 | Programs: Adobe Photoshop, Movie Maker 
165 | Social Media Administration: LinkedIn, Twitter, Instagram, Facebook 
166 |  
167 | WORK EXPERIENCE  
168 | Kent State University, Kent, Ohio 
169 | Resident Advisor 
170 | •  Collaborate with 10 building staff and campus administrators on a weekly basis to organize a pancake breakfast 
171 | 
172 | Expected Graduation: May 2018 
173 | Kent, Ohio 
174 | GPA: 3.6 
175 | 
176 | August 2015 - Present  
177 | 
178 |  
179 |  
180 |  
181 | 
182 |  
183 | 
184 |  
185 | 
186 |  
187 | 
188 |  
189 | 
190 |  
191 | 
192 |  
193 | 
194 | fundraiser for 200 attendees, raising $500 for community charity.  
195 | 
196 | •  Utilize effective time management skills by creating and implementing 6 programs and activities each semester for 30 
197 | 
198 | residents while balancing full-time course load and extracurricular commitments.  
199 | 
200 | •  Demonstrate strong communication skills through interacting with 150 residents and campus administrators on a 
201 | 
202 | weekly basis.  
203 | 
204 | •  Facilitate problem-solving and conflict resolution amongst residents by serving as positive role model, mediator, and 
205 | 
206 | leader through one-on-one and small group interventions.  
207 | 
208 |  
209 | Panini’s Bar and Grill, Cleveland, Ohio                     
210 | Server 
211 | •  Worked independently in a fast-paced environment while developing customer service skills with each guest to ensure 
212 | 
213 | August 2014 - May 2015  
214 | 
215 |  
216 | 
217 |  
218 | 
219 |  
220 | 
221 | their needs were consistently met. 
222 | 
223 |  
224 | 
225 |  
226 | 
227 |  
228 | 
229 |  
230 | 
231 | July 2014 - October 2015 
232 | 
233 | •  Optimized persuasive skills to highlight nightly specials and engage each table’s attention. 
234 | •  Fostered cooperation within a staff of 7 to maintain a pleasant atmosphere. 
235 |  
236 | LEADERSHIP EXPERIENCE  
237 | Harding Middle School, Columbus, Ohio 
238 | Soccer Assistant Coach  
239 | •  Executed strong communication skills by guiding and leading 20 seventh and eighth grade girls on team. 
240 | •  Served as positive role model by teaching young athletes about teamwork, respect, and conflict-resolution. 
241 | •  Planned and led weekly meetings with up to 15 parents, field managers and staff members. 
242 | •  Led team to its first regional championship in October 2015; recognized at school banquet with leadership award. 
243 |  
244 | CAMPUS INVOLVEMENT 
245 | Member, Delta Sigma Pi 
246 |  
247 | Member, Collegiate Business Association 
248 |  
249 | VOLUNTEER EXPERIENCE  
250 | Relay for Life, Kent State University 
251 | Donation Processer, Greater Cleveland Food Bank 
252 |  
253 | HONORS 
254 | Summit County Alumni Association Scholarship   
255 | Dean’s List 
256 |  
257 | 
258 | March 2015 
259 | December 2014 - February 2015 
260 | 
261 | Spring 2014 - Present 
262 | Spring 2013 - Spring 2015 
263 | 
264 | August 2016 - Present  
265 | August 2015 - Present  
266 | 
267 |   
268 |  
269 | 
270 |   
271 |  
272 | 
273 |  
274 |  
275 | 
276 |  
277 |  
278 | 
279 |  
280 |  
281 | 
282 |  
283 |  
284 | 
285 |  
286 |  
287 | 
288 |  
289 |  
290 | 
291 |  
292 |  
293 | 
294 |  
295 |  
296 | 
297 |  
298 | 
299 |  
300 | 
301 |  
302 | 
303 |  
304 | 
305 |  
306 | 
307 |  
308 | 
309 |  
310 | 
311 | ",John Smith,sresume@kent.edu,"('614', '555', '5555')","{'Teacher', 'developer'}",set(),{'SQL'},"{'C', 'R'}",set(),{'MIT'},set(),{'soccer'},set()
312 | 3,../data/input/example_resumes/resume_Meyer.pdf,.pdf,"MONICA MEYER
313 | 
314 | (415) · 497 · 7282 (cid:5) monica.meyer@comcast.net
315 | 
316 | EDUCATION
317 | 
318 | Master of Science in Analytics
319 | University of San Francisco
320 | 
321 | Bachelor of Arts in Mathematics
322 | University of California, Santa Cruz
323 | Overall GPA: 3.5
324 | 
325 | COURSE PROJECTS
326 | 
327 | Expected June 2015
328 | 
329 | July 2012
330 | 
331 | Business Location Recommendation
332 | Sep 2014
333 | · Queried Yelp and Zillow APIs and performed exploratory data analysis necessary for project to help
334 | businesses decide where to open their next establishment.
335 | 
336 | Text Classiﬁcation
337 | · Classiﬁed movie reviews as positive or negative using Naive Bayes in Python.
338 | · Found most important words in Reuter’s articles through term frequency, inverse document frequency
339 | on XML ﬁles in Python.
340 | 
341 | July 2014
342 | 
343 | WORK EXPERIENCE
344 | 
345 | Bank of America
346 | Dec 2013 - June 2014
347 | Sales and Service Specialist
348 | Mill Valley, CA
349 | · Promoted due to proven ability to resolve complex service issues and process transactions accurately
350 | and eﬃciently to guarantee customer satisfaction and build customer conﬁdence and trust. Responsible
351 | for establishing, retaining and deepening relationships with customers to achieve team sales goals as
352 | well as providing proactive sales activities of basic products while referring more complex requests such
353 | as mortgages and investment products.
354 | 
355 | Bank of America
356 | Aug 2012 - Dec 2013
357 | Teller
358 | Mill Valley, CA
359 | · Gained proﬁciency in retail banking operations, including computing ﬁgures, processing transactions
360 | with speed and accuracy and building customer loyalty through exceptional customer service. Learned
361 | to control large amounts of cash ﬂow, work within established policies, procedures and guidelines and
362 | acquired the ability to advise customers on products and services the bank has to oﬀer. Earned a
363 | promotion to the position of Sales and Service Specialist.
364 | 
365 | SKILLS
366 | 
367 | Programming
368 | Protocols & APIs
369 | Databases
370 | 
371 | Python, R
372 | XML, JSON, REST
373 | MySQL, PostgreSQL
374 | 
375 | ",MONICA MEYER,monica.meyer@comcast.net,,set(),set(),"{'MySQL', 'SQL', 'Postgress'}","{'C', 'python', 'R', '.NET'}",set(),set(),set(),set(),set()
376 | 4,../data/input/example_resumes/Layla_Martin_Resume.pdf,.pdf,"Layla Martin
377 | 
378 | 2038 McAllister St
379 | 
380 | San Francisco, CA 94118
381 | 
382 | layla.d.martin@gmail.com (520) 271-2492
383 | 
384 | EDUCATION
385 | AND AWARDS
386 | 
387 | Master of Science in Analytics
388 | University of San Francisco, San Francisco, CA
389 | 
390 | Bachelor of Science in Mathematics, Cum Laude
391 | University of San Francisco
392 | Dean’s Honor Role (4 years)
393 | University Scholar, USF’s highest academic scholarship (4 years)
394 | 
395 | Expected June 2015
396 | 
397 | May 2014
398 | 
399 | SELECTED
400 | PROJECTS
401 | 
402 | Sentiment Analysis using Naive-Bayes
403 | 
404 | Graduate
405 | • Classiﬁed movie reviews as positive or negative with 75% accuracy by implementing a
406 | 
407 | Naive-Bayes algorithm in Python.
408 | 
409 | NBA Play-By-Play Data Cleaning
410 | 
411 | Summer research with USF Faculty
412 | • Cleaned play-by-play data for every NBA game from 2006 to 2012 using Pandas (Python).
413 | 
414 | Analysis of USF Men’s Basketball Statistics using R
415 | 
416 | Undergraduate
417 | • Predicted importance of players on the court using simple and multiple linear regression
418 | 
419 | and created visualizations of player statistics using R.
420 | 
421 | Explaining Artiﬁcial Neural Networks (ANN)
422 | 
423 | Undergraduate
424 | • Prepared two student lectures teaching classmates basic implementation and theory behind
425 | • Preprocessed ﬁngerprint images as matrices in Matlab and performed pattern classiﬁcation
426 | 
427 | supervised learning, backpropagation, and pattern recognition with ANN.
428 | 
429 | using an ANN program.
430 | 
431 | Mathematical Modeling Research
432 | 
433 | Undergraduate
434 | 
435 | • Performed image compression with SVD implemented in Matlab.
436 | • Ranked West Coast Conference Men’s Basketball teams using multiple centrality ranking
437 | 
438 | algorithms in Matlab.
439 | 
440 | LEADERSHIP
441 | 
442 | President, Pi Mu Epsilon National Math Honor Society
443 | Captain, USF Women’s Soccer
444 | Teacher’s Assistant, USF Astronomy Observations
445 | 
446 | 2014
447 | 2013-2014
448 | 2013-2014
449 | 
450 | SKILLS
451 | 
452 | Python, R, MySQL, PostgreSQL, Matlab, LaTeX, JMP
453 | 
454 | INTER-
455 | COLLEGIATE
456 | ATHLETICS
457 | 
458 | NCAA Division I Women’s Soccer: USF, 2010-2014
459 | 
460 | • Athletic scholarship (full scholarship when combined with University Scholarship).
461 | • Committed 30 hours per week to training, meetings, travel, competition.
462 | 
463 | ",Layla Martin,layla.d.martin@gmail.com,"('520', '271', '2492')",{'Teacher'},set(),"{'MySQL', 'SQL', 'Postgress'}","{'C', 'Matlab', '.NET', 'python', 'R'}",set(),{'MIT'},set(),{'soccer'},set()
464 | 5,../data/input/example_resumes/Brendan_Herger_Resume.pdf,.pdf,"Brendan Herger
465 | 
466 | Hergertarian.com | 13herger@gmail.com  |  + 1 (415) 582-7457
467 | 
468 | 1209 Page Street No. 7 San Francisco, Ca 94117
469 | 
470 | Selected Experience
471 | 
472 | Data Scientist
473 | 
474 | Data Innovation Lab @ Capital One (San Francisco, Ca.)   06/15 - Now 
475 | Lead research team modeling for fraud problem space with class 
476 | imbalance & adversaries; H2O.ai, GraphLab, SKLearn
477 | Deployed sub-millisecond real time model; Apache Apex
478 | Recommended distributed machine learning frameworks for general 
479 | adoption at Capital One; H2O.ai, GraphLab, Apache Spark
480 | 
481 | Various Technical Positions
482 | 
483 | Lawfty (San Francisco, Ca.), RevUp Software (Redwood City, Ca.), 
484 | Perkins + Will Architecture (San Francisco, Ca.) & Lawrence Berkeley 
485 | National Lab. (Berkeley, Ca.)
486 | 
487 | Front End Supervisor
488 | 
489 | The Home Depot Pro (Colma, Ca.)                                        05/11 - 03/13
490 | Positions held: Cashier, Special Services Assoc., Tool Rental Assoc.
491 | Supervised and trained a staff of 10-30 team members
492 | 
493 | Education
494 | MS, Analytics
495 | 
496 | University of San Francisco, July 2015
497 | Relevant Coursework: Machine Learning, Adv. Machine Learning, Data 
498 | Acquisition, Exploratory Data Analysis, Relational Databases, NoSQL 
499 | Databases, Linear Regression Analysis, Time Series Analysis, Intro. SAS
500 | 
501 | BS, Physics
502 | 
503 | University of San Francisco, May 2014
504 | Minors: Computer Sciences, Mathematics
505 | Honors: University Scholar, President of ΠΜΕ Math Honors Society
506 | Relevant Coursework: Software Development, Data Structures & 
507 | Algorithms, Differential Eqn.’s, Linear Algebra, Graduate Econometrics
508 | 
509 | Personal Projects
510 | 
511 | Identified genre of Billboard 
512 | Hot 100 songs using ensemble 
513 | algorithm built with Support 
514 | Vector Machine, Neural Network, 
515 | Stochastic Gradient Boost, and 
516 | Random Forest  algorithms; 
517 | Python, Pandas, R and Scikit-Learn
518 | 
519 | Implemented Naive Bayes text 
520 | classification algorithm and trained 
521 | this algorithm to correctly label 
522 | 83% of movie reviews; Python, 
523 | numpy and Pandas
524 | 
525 | Created database containing lyrics 
526 | of Billboard Hot 100 songs since 
527 | 1958; R, Python, Pandas and 
528 | Beautiful Soup 4
529 | 
530 | Built a multi-threaded web scraper 
531 | and search engine with web 
532 | user interface; Java, MySQL and 
533 | HTML5/CSS
534 | 
535 | Built resume parsing package 
536 | which extracts text, finds contact 
537 | details, and checks for required 
538 | keywords; Python and Pandas
539 | 
540 | Online
541 | 
542 | Hergertarian.com
543 | 
544 | github.com/bjherger
545 | 
546 | linkedin.com/in/bjherger
547 | 
548 | hergertarian.wordpress.com/
549 | 
550 | ",Brendan Herger,13herger@gmail.com,"('415', '582', '7457')",{'developer'},{'Mac'},"{'MySQL', 'SQL'}","{'C', 'java', '.NET', 'python', 'R'}",{'sklearn'},{'Vrije Univesriteit'},set(),set(),{'github'}
551 | 


--------------------------------------------------------------------------------
/data/schema/extract.csv:
--------------------------------------------------------------------------------
1 | variable,type,data_set
2 | file_path,object,observations
3 | extension,object,observations
4 | text,object,observations
5 | 


--------------------------------------------------------------------------------
/data/schema/transform.csv:
--------------------------------------------------------------------------------
 1 | variable,type,data_set
 2 | file_path,object,observations
 3 | extension,object,observations
 4 | text,object,observations
 5 | candidate_name,object,observations
 6 | email,object,observations
 7 | phone,object,observations
 8 | experience,object,observations
 9 | platforms,object,observations
10 | database,object,observations
11 | programming,object,observations
12 | machinelearning,object,observations
13 | universities,object,observations
14 | languages,object,observations
15 | hobbies,object,observations
16 | open-source,object,observations
17 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | gensim==3.7.1
2 | pandas==0.24.2
3 | pdfminer.six==20181108
4 | spacy==2.1.3
5 | PyYAML==5.1
6 | 


--------------------------------------------------------------------------------