├── format.sh └── readme.md /format.sh: -------------------------------------------------------------------------------- 1 | #! /bin/bash 2 | 3 | JOB_TITLES_FILE=job-titles.txt 4 | JOB_TITLES_JSON_FILE=job-titles.json 5 | 6 | echo "Formatting: $JOB_TITLES_FILE"; 7 | JOB_TITLES=$( cat $JOB_TITLES_FILE | tr A-Z a-z | sed -e "s/-/ /g" | sed -e "s/,//g" | sed -e 's/^[[:space:]]*//' | sed -e 's/[[:space:]]*$//' | uniq | sort ) 8 | printf '%b\n' "$JOB_TITLES" > $JOB_TITLES_FILE 9 | 10 | echo "Generating: $JOB_TITLES_JSON_FILE" 11 | JOB_TITLES_JSON=$(printf '%b\n' "$JOB_TITLES" | awk -F\; '{ print "\""$1"\"," }' | sed "$ s/.$//") 12 | JOB_TITLES_JSON="{\"job-titles\": [\n${JOB_TITLES_JSON}\n] }" 13 | printf '%b\n' "$JOB_TITLES_JSON" > $JOB_TITLES_JSON_FILE 14 | 15 | echo "Updating readme item counter" 16 | README=readme.md 17 | JOB_TITLES_LENGTH=$(echo "$JOB_TITLES" | wc -l) 18 | README_CONTENT=$(cat $README | sed -e "s|/job_titles-[0-9]\+-|/job_titles-$JOB_TITLES_LENGTH-|g") 19 | printf '%b\n' "$README_CONTENT" > $README 20 | -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | # job-titles 2 | 3 | > Normalized dataset of 70k job titles 4 | 5 | [![](https://img.shields.io/badge/job_titles-73380-brightgreen.svg?style=flat-square)](job-titles.txt) 6 | 7 | ## Data Normalizations 8 | 9 | The data is normalized in the following ways: 10 | 11 | - lowercase 12 | - `-` replaced with a \ 13 | - `,` removed 14 | 15 | ## Caveats 16 | 17 | - Duplicates such as `a and p mechanic` and `a&p mechanic` 18 | - Non-English titles such as `ab initio etl developer` 19 | 20 | ## See also 21 | 22 | - [animal names](https://github.com/jneidel/animal-names) 23 | - [nationalities](https://github.com/jneidel/nationalities) 24 | 25 | ## Contribute 26 | 27 | Feel free to open a pull request fixing above listed caveats or any other enhancements. 28 | 29 | Only edit `job-titles.txt`. After doing so run `./format.sh`. 30 | 31 | ## Attribution 32 | 33 | This dataset is a collection of the following sources: 34 | 35 | - [johnpcarty/Thesaurus-of-Job-Titles](https://github.com/johnpcarty/Thesaurus-of-Job-Titles/blob/master/synonym_job_titles_for_search.txt) (GPLv3) 36 | - [onurdegerli/job-titles](https://github.com/onurdegerli/job-titles/blob/master/job_titles.sql) 37 | - [fluquid/find_job_titles](https://github.com/fluquid/find_job_titles/blob/master/src/find_job_titles/data/titles_combined.txt.gz) (MIT) 38 | --------------------------------------------------------------------------------