├── format.sh
└── readme.md


/format.sh:
--------------------------------------------------------------------------------
 1 | #! /bin/bash
 2 | 
 3 | JOB_TITLES_FILE=job-titles.txt
 4 | JOB_TITLES_JSON_FILE=job-titles.json
 5 | 
 6 | echo "Formatting: $JOB_TITLES_FILE";
 7 | JOB_TITLES=$( cat $JOB_TITLES_FILE | tr A-Z a-z | sed -e "s/-/ /g" | sed -e "s/,//g" | sed -e 's/^[[:space:]]*//' | sed -e 's/[[:space:]]*$//' | uniq | sort )
 8 | printf '%b\n' "$JOB_TITLES" > $JOB_TITLES_FILE
 9 | 
10 | echo "Generating: $JOB_TITLES_JSON_FILE"
11 | JOB_TITLES_JSON=$(printf '%b\n' "$JOB_TITLES" | awk -F\; '{ print "\""$1"\"," }' | sed "$ s/.$//")
12 | JOB_TITLES_JSON="{\"job-titles\": [\n${JOB_TITLES_JSON}\n] }"
13 | printf '%b\n' "$JOB_TITLES_JSON" > $JOB_TITLES_JSON_FILE
14 | 
15 | echo "Updating readme item counter"
16 | README=readme.md
17 | JOB_TITLES_LENGTH=$(echo "$JOB_TITLES" | wc -l)
18 | README_CONTENT=$(cat $README | sed -e "s|/job_titles-[0-9]\+-|/job_titles-$JOB_TITLES_LENGTH-|g")
19 | printf '%b\n' "$README_CONTENT" > $README
20 | 


--------------------------------------------------------------------------------
/readme.md:
--------------------------------------------------------------------------------
 1 | # job-titles
 2 | 
 3 | > Normalized dataset of 70k job titles
 4 | 
 5 | [![](https://img.shields.io/badge/job_titles-73380-brightgreen.svg?style=flat-square)](job-titles.txt)
 6 | 
 7 | ## Data Normalizations
 8 | 
 9 | The data is normalized in the following ways:
10 | 
11 | - lowercase
12 | - `-` replaced with a \<Space\>
13 | - `,` removed
14 | 
15 | ## Caveats
16 | 
17 | - Duplicates such as `a and p mechanic` and `a&p mechanic`
18 | - Non-English titles such as `ab initio etl developer`
19 | 
20 | ## See also
21 | 
22 | - [animal names](https://github.com/jneidel/animal-names)
23 | - [nationalities](https://github.com/jneidel/nationalities)
24 | 
25 | ## Contribute
26 | 
27 | Feel free to open a pull request fixing above listed caveats or any other enhancements.
28 | 
29 | Only edit `job-titles.txt`. After doing so run `./format.sh`.
30 | 
31 | ## Attribution
32 | 
33 | This dataset is a collection of the following sources:
34 | 
35 | - [johnpcarty/Thesaurus-of-Job-Titles](https://github.com/johnpcarty/Thesaurus-of-Job-Titles/blob/master/synonym_job_titles_for_search.txt) (GPLv3)
36 | - [onurdegerli/job-titles](https://github.com/onurdegerli/job-titles/blob/master/job_titles.sql)
37 | - [fluquid/find_job_titles](https://github.com/fluquid/find_job_titles/blob/master/src/find_job_titles/data/titles_combined.txt.gz) (MIT)
38 | 


--------------------------------------------------------------------------------