├── .gitignore ├── LICENSE ├── README.md ├── data.tar.gz ├── docs └── appendix.pdf ├── images ├── balanced_bug_logo.png ├── flow.png ├── full_bug_logo.png ├── gold_bug_logo.png └── spike_logo.png ├── predictions └── README.md ├── requirements.txt ├── src ├── converters │ └── convert_to_conll.py └── evaluations │ ├── Analyze.py │ ├── dataset_stats.py │ ├── evaluate_coref.py │ └── inc_occ_gender.csv └── visualizations └── delta_s_by_dist.png /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | 131 | # Local 132 | predictions/ 133 | data/ 134 | *.conll -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 SLAB-NLP 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | **Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)* 4 | 5 | - [BUG Dataset ](#bug-dataset-img-srchttpsuser-imagesgithubusercontentcom6629995132018898-038ec717-264d-4da3-a0b8-651b851f6b64png-width30-img-srchttpsuser-imagesgithubusercontentcom6629995132017358-dea44bba-1487-464d-a9e1-4d534204570cpng-width30-img-srchttpsuser-imagesgithubusercontentcom6629995132018731-6ec8c4e3-12ac-474c-ae6c-03c1311777f4png-width30-) 6 | - [Setup](#setup) 7 | - [Dataset Partitions](#dataset-partitions) 8 | - [ Full BUG](#img-srchttpsuser-imagesgithubusercontentcom6629995132018898-038ec717-264d-4da3-a0b8-651b851f6b64png-width20--full-bug) 9 | - [ Gold BUG](#img-srchttpsuser-imagesgithubusercontentcom6629995132017358-dea44bba-1487-464d-a9e1-4d534204570cpng-width20--gold-bug) 10 | - [ Balanced BUG](#img-srchttpsuser-imagesgithubusercontentcom6629995132018731-6ec8c4e3-12ac-474c-ae6c-03c1311777f4png-width20--balanced-bug) 11 | - [Dataset Format](#dataset-format) 12 | - [Evaluations](#evaluations) 13 | - [Coreference](#coreference) 14 | - [Conversions](#conversions) 15 | - [CoNLL](#conll) 16 | - [Citing](#citing) 17 | 18 | 19 | 20 | # BUG Dataset 21 | A Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation (Levy et al., Findings of EMNLP 2021). 22 | 23 | BUG was collected semi-automatically from different real-world corpora, designed to be challenging in terms of soceital gender role assignements for machine translation and coreference resolution. 24 | 25 | ## Setup 26 | 27 | 1. Unzip `data.tar.gz` this should create a `data` folder with the following files: 28 | * balanced_BUG.csv 29 | * full_BUG.csv 30 | * gold_BUG.csv 31 | 2. Setup a python 3.x environment and install requirements: 32 | ``` 33 | pip install -r requirements.txt 34 | ``` 35 | 36 | 37 | ## Dataset Partitions 38 | 39 | **_NOTE:_** 40 | These partitions vary slightly from those reported in the paper due improvments and bug fixes post submission. 41 | For reprducibility's sake, you can access the dataset from the submission [here](https://drive.google.com/file/d/1b4Q-X1vVMoR-tIVd-XCigamnvpy0vi3F/view?usp=sharing). 42 | 43 | ### Full BUG 44 | 105,687 sentences with a human entity, identified by their profession and a gendered pronoun. 45 | 46 | ### Gold BUG 47 | 48 | 1,717 sentences, the gold-quality human-validated samples. 49 | 50 | ### Balanced BUG 51 | 25,504 sentences, randomly sampled from Full BUG to ensure balance between male and female entities and between stereotypical and non-stereotypical gender role assignments. 52 | 53 | 54 | ## Dataset Format 55 | Each file in the data folder is a csv file adhering to the following format: 56 | 57 | 58 | Column | Header | Description 59 | :-----:|------------------------|-------------------------------------------- 60 | 1 | sentence_text | Text of sentences with a human entity, identified by their profession and a gendered pronoun 61 | 2 | tokens | List of tokens (using spacy tokenizer) 62 | 3 | profession | The entity in the sentence 63 | 4 | g | The pronoun in the sentence 64 | 5 | profession_first_index | Words offset of profession in sentence 65 | 6 | g_first_index | Words offset of pronoun in sentence 66 | 7 | predicted gender | 'male'/'female' determined by the pronoun 67 | 8 | stereotype | -1/0/1 for anti-stereotype, neutral and stereotype sentence 68 | 9 | distance | The abs distance in words between pronoun and profession 69 | 10 | num_of_pronouns | Number of pronouns in the sentence 70 | 11 | corpus | The corpus from which the sentence is taken 71 | 12 | data_index | The query index of the pattern of the sentence 72 | 73 | ## Evaluations 74 | See below instructions for reproducing our evaluations on BUG. 75 | 76 | ### Coreference 77 | 1. Download the Spanbert predictions from [this link](https://drive.google.com/file/d/1i24T1YT_0ByxttrCRR7qxEnt8UWyEJ7R/view?usp=sharing). 78 | 2. Unzip and put `coref_preds.jsonl` in in the `predictions/` folder. 79 | 3. From `src/evaluations/`, run `python evaluate_coref.py --in=../../predictions/coref_preds.jsonl --out=../../visualizations/delta_s_by_dist.png`. 80 | 4. This should reproduce the [coreference evaluation figure](visualizations/delta_s_by_dist.png). 81 | 82 | 83 | ## Conversions 84 | ### CoNLL 85 | To convert each data partition to CoNLL format run: 86 | ``` 87 | python convert_to_conll.py --in=path/to/input/file --out=path/to/output/file 88 | ``` 89 | 90 | For example, try: 91 | ``` 92 | python convert_to_conll.py --in=../../data/gold_BUG.csv --out=./gold_bug.conll 93 | ``` 94 | 95 | ### Filter from SPIKE 96 | 1. Download the wanted [SPIKE](https://spike.apps.allenai.org/) csv files and save them all in the same directory (directory_path). 97 | 2. Make sure the name of each file end with `\_.csv` where `corpus` is the name of the SPIKE dataset and `x` is the number of query you entered on search (for example - myspikedata_wikipedia18.csv). 98 | 3. From `src/evaluations/`, run `python Analyze.py directory_path`. 99 | 4. This should reproduce the full dataset and balanced dataset. 100 | 101 | 102 | ## Citing 103 | ``` 104 | @misc{levy2021collecting, 105 | title={Collecting a Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation}, 106 | author={Shahar Levy and Koren Lazar and Gabriel Stanovsky}, 107 | year={2021}, 108 | eprint={2109.03858}, 109 | archivePrefix={arXiv}, 110 | primaryClass={cs.CL} 111 | } 112 | ``` 113 | 114 | -------------------------------------------------------------------------------- /data.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SLAB-NLP/BUG/6b5314d193ecd04a6864ffbfe329b42cf2aa622e/data.tar.gz -------------------------------------------------------------------------------- /docs/appendix.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SLAB-NLP/BUG/6b5314d193ecd04a6864ffbfe329b42cf2aa622e/docs/appendix.pdf -------------------------------------------------------------------------------- /images/balanced_bug_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SLAB-NLP/BUG/6b5314d193ecd04a6864ffbfe329b42cf2aa622e/images/balanced_bug_logo.png -------------------------------------------------------------------------------- /images/flow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SLAB-NLP/BUG/6b5314d193ecd04a6864ffbfe329b42cf2aa622e/images/flow.png -------------------------------------------------------------------------------- /images/full_bug_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SLAB-NLP/BUG/6b5314d193ecd04a6864ffbfe329b42cf2aa622e/images/full_bug_logo.png -------------------------------------------------------------------------------- /images/gold_bug_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SLAB-NLP/BUG/6b5314d193ecd04a6864ffbfe329b42cf2aa622e/images/gold_bug_logo.png -------------------------------------------------------------------------------- /images/spike_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SLAB-NLP/BUG/6b5314d193ecd04a6864ffbfe329b42cf2aa622e/images/spike_logo.png -------------------------------------------------------------------------------- /predictions/README.md: -------------------------------------------------------------------------------- 1 | Download the SpanBERT predictions from [this link](https://drive.google.com/file/d/1i24T1YT_0ByxttrCRR7qxEnt8UWyEJ7R/view?usp=sharing) and unzip in this folder. 2 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | matplotlib 2 | docopt 3 | tqdm 4 | numpy 5 | pandas 6 | -------------------------------------------------------------------------------- /src/converters/convert_to_conll.py: -------------------------------------------------------------------------------- 1 | """ Usage: 2 | [--in=INPUT_FILE] [--out=OUTPUT_FILE] [--debug] 3 | 4 | Options: 5 | --help Show this message and exit 6 | -i INPUT_FILE --in=INPUT_FILE Input file 7 | [default: infile.tmp] 8 | -o INPUT_FILE --out=OUTPUT_FILE Input file 9 | [default: outfile.tmp] 10 | --debug Whether to debug 11 | """ 12 | # External imports 13 | import logging 14 | import pdb 15 | from pprint import pprint 16 | from pprint import pformat 17 | from docopt import docopt 18 | from pathlib import Path 19 | from tqdm import tqdm 20 | import json 21 | import pandas as pd 22 | 23 | # Local imports 24 | 25 | 26 | #---- 27 | 28 | 29 | HEADER = "#begin document ({doc_name}); part 000" 30 | FOOTER = "\n#end document" 31 | BOILER = ["-"] * 5 + ["Speaker#1"] + ["*"] * 4 32 | ENTITY = "(1)" 33 | BLIST = [13055, 13996, ] # indices which the converter doesn't like for some reason 34 | GENRE = "nw" # conll parsing requires some genre, "nw" follows the convention 35 | # in winobias, but is probably arbitrary otherwise. 36 | 37 | 38 | def validate_row(row): 39 | """ 40 | run sanity checks on row, return true iff they pass 41 | """ 42 | prof_ind = row.profession_first_index 43 | pron_ind = row.g_first_index 44 | 45 | words = row.sentence_text.lstrip().rstrip().split(" ") 46 | num_of_words = len(words) 47 | prof = row["profession"].lower() 48 | pron = row["g"].lower() 49 | 50 | 51 | # make sure inner references in the line make sense 52 | if prof_ind >= len(words): 53 | return False 54 | if pron_ind >= len(words): 55 | return False 56 | if words[prof_ind].lower() != prof: 57 | logging.debug(f"prof doesn't match") 58 | return False 59 | if words[pron_ind].lower() != pron: 60 | logging.debug(f"pron longer than a single word") 61 | return False 62 | 63 | # don't deal with weird empty tokens 64 | if any([(str.isspace(word) or (not word)) for word in words]): 65 | return False 66 | 67 | # all tests passed 68 | return True 69 | 70 | 71 | def convert_row_to_conll(row, doc_name): 72 | """ 73 | get a conll multi-line string representing a csv row 74 | """ 75 | # find prof_index 76 | prof_ind = row.profession_first_index 77 | 78 | # find pronoun 79 | pron_ind = row.g_first_index 80 | 81 | # construct conll rows 82 | conll = [] 83 | words = row.sentence_text.lstrip().rstrip().split(" ") 84 | prof = row["profession"].lower() 85 | pron = row["g"].lower() 86 | for word_ind, word in enumerate(words): 87 | word_lower = word.lower() 88 | coref_flag = "-" 89 | 90 | if word_ind == prof_ind: 91 | coref_flag = ENTITY 92 | 93 | elif word_ind == pron_ind: 94 | coref_flag = ENTITY 95 | 96 | metadata = list(map(str, [doc_name, 0, word_ind, word])) 97 | conll_row = metadata + BOILER + [coref_flag] 98 | conll.append("\t".join(conll_row)) 99 | 100 | 101 | conll_data_str = "\n".join(conll) 102 | header = HEADER.format(doc_name = doc_name) 103 | full_conll = "\n".join([header,conll_data_str,FOOTER]) 104 | return full_conll 105 | 106 | 107 | if __name__ == "__main__": 108 | # Parse command line arguments 109 | args = docopt(__doc__) 110 | inp_fn = Path(args["--in"]) 111 | out_fn = Path(args["--out"]) 112 | 113 | # Determine logging level 114 | debug = args["--debug"] 115 | if debug: 116 | logging.basicConfig(level = logging.DEBUG) 117 | else: 118 | logging.basicConfig(level = logging.INFO) 119 | 120 | logging.info(f"Input file: {inp_fn}, Output file: {out_fn}.") 121 | 122 | # Start computation 123 | df = pd.read_csv(inp_fn) 124 | top_doc_name = out_fn.stem 125 | 126 | err_cnt = 0 127 | 128 | with open(out_fn, "w", encoding = "utf8") as fout: 129 | for row_index, row in tqdm(df.iterrows()): 130 | try: 131 | valid = validate_row(row) 132 | except: 133 | err_cnt += 1 134 | continue 135 | if not valid: 136 | # Something is wrong with this row 137 | # recover and continue 138 | err_cnt += 1 139 | continue 140 | 141 | if row_index in BLIST: 142 | err_cnt += 1 143 | continue 144 | 145 | doc_name = f"{GENRE}/{top_doc_name}/{row_index}" 146 | try: 147 | conll = convert_row_to_conll(row, doc_name) 148 | except: 149 | err_cnt += 1 150 | continue 151 | fout.write(f"{conll}\n") 152 | 153 | total_rows = len(df) 154 | perc = round((err_cnt / total_rows)*100) 155 | rows_written = total_rows - err_cnt 156 | logging.debug(f"""Wrote a total of {rows_written} to {out_fn}. 157 | Filtered out {err_cnt} ({perc}%) rows out of {total_rows} total rows.""") 158 | 159 | # End 160 | logging.info("DONE") 161 | -------------------------------------------------------------------------------- /src/evaluations/Analyze.py: -------------------------------------------------------------------------------- 1 | """ 2 | Usage: Analyze.py directory_path 3 | """ 4 | 5 | import pandas as pd 6 | import os 7 | import sys 8 | import matplotlib.pyplot as plt 9 | import re 10 | from tqdm import tqdm 11 | 12 | all_entities, m_entities, f_entities, n_entities = [], [], [], [] 13 | 14 | f_pronouns = ["she", "herself", "her", "She", "Herself", "Her"] 15 | m_pronouns = ["he", "his", "himself", "him", "He", "His", "Himself", "Him"] 16 | 17 | 18 | def create_lists(path): 19 | """ 20 | Takes a txt file of possible entities and fills up the python lists of entities. 21 | """ 22 | df = pd.read_csv(path) 23 | 24 | for index, row in df.iterrows(): 25 | if isinstance(row["entities"], str): 26 | entities = row["entities"].split(",") 27 | entities = [e.rstrip().lstrip().lower() for e in entities] 28 | all_entities.extend(entities) 29 | f = int(row["F_workers"]) 30 | m = int(row["M_workers"]) 31 | if f > m: 32 | f_entities.extend(entities) 33 | elif m > f: 34 | m_entities.extend(entities) 35 | else: 36 | n_entities.extend(entities) 37 | 38 | 39 | def drop_less_quality_data(data): 40 | """ 41 | drop less quality data (discovered by the human annotations statistics) 42 | """ 43 | data = data[data['data_index'] != "6"] 44 | data = data[data['data_index'] != "7"] 45 | data = data[data['data_index'] != "8"] 46 | data = data[data['data_index'] != "12"] 47 | data = data[data['data_index'] != "13"] 48 | data = data[data['data_index'] != "16"] 49 | data = data[data['data_index'] != "19"] 50 | data = data[data['corpus'] != "perseus"] 51 | data = data[data['distance'] <= 20] 52 | data = data[data['num_of_pronouns'] <= 4] 53 | return data 54 | 55 | 56 | def create_BUG(directory_path): 57 | """ 58 | Receives the spike data and a list of possible entities, 59 | filters the data such that the professions column will be only words from the list 60 | and creates 2 csv files: 61 | the filtered data 62 | the distribution over the professions 63 | """ 64 | new_data = pd.DataFrame() 65 | i = 0 66 | for file_name in os.listdir(directory_path): 67 | i += 1 68 | if file_name.endswith("tsv"): 69 | data = pd.read_csv(os.path.join(directory_path, file_name), sep='\t') 70 | elif file_name.endswith("csv"): 71 | data = pd.read_csv(os.path.join(directory_path, file_name)) 72 | else: 73 | continue 74 | data = data.rename(columns={"er": "g", "gender": "g"}) 75 | data = data.rename(columns={"er_first_index": "g_first_index", "gender_first_index": "g_first_index"}) 76 | data = data.rename(columns={"er_last_index": "g_last_index", "gender_last_index": "g_last_index"}) 77 | data['profession'] = data['profession'].str.lower() 78 | data['data_index'] = file_name[-5] if not file_name[-6].isdigit() else file_name[-6:-4] 79 | data['corpus'] = file_name.split("_")[0] 80 | # assumption: path is- ...._datax.[t,c]sv, while x is the index of the query 81 | new_data = new_data.append(data[data['profession'].isin(all_entities)]) 82 | 83 | new_data['stereotype'] = new_data[['profession', 'g']].apply(is_stereotype, axis=1) 84 | new_data['distance'] = new_data[['g_first_index', 'profession_first_index']].apply(find_distance, axis=1) 85 | new_data['num_of_pronouns'] = new_data[['sentence_text']].apply(num_of_pronouns, axis=1) 86 | 87 | new_data = new_data.drop_duplicates(subset=['sentence_text'], keep="last") 88 | duplicate_pronouns = ["he or she", "she or he", "her or his", "his or her", "her / his", "his / her", 89 | "he / she", "she / he"] 90 | for p in duplicate_pronouns: 91 | new_data = new_data[~new_data.sentence_text.str.contains(p)] 92 | 93 | new_data['predicted gender'] = new_data[['g']].apply(predict_gender, axis=1) 94 | 95 | new_data = drop_less_quality_data(new_data) 96 | new_data = clean_columns(new_data) 97 | 98 | new_data.to_csv("data\\full_BUG.csv") 99 | professions = new_data['profession'] 100 | distribution = professions.value_counts() 101 | distribution.to_csv("data/data_distribution.csv") 102 | create_balanced("data\\full_BUG.csv") 103 | 104 | with open('dropped_en.txt', 'r', encoding="utf8") as f: 105 | dropped_rows = [line.strip() + "\n" for line in f] 106 | 107 | with open("data/en_pro.txt", "w+", encoding="utf-8") as output_file_pro, \ 108 | open("data/en_anti.txt", "w+", encoding="utf-8") as output_file_anti, \ 109 | open("data/en.txt", "w+", encoding="utf-8") as output_file_all: 110 | for index, row in tqdm(new_data.iterrows()): 111 | is_dropped = False 112 | for line in dropped_rows: 113 | if row["sentence_text"] in line: 114 | is_dropped = True 115 | break 116 | if not is_dropped: 117 | line = row['predicted gender'] + "\t" + str(row['profession_first_index']) + "\t" + \ 118 | row['sentence_text'] + "\t" + row['profession'] + "\t" + row['corpus'] 119 | if row["stereotype"] == 1: 120 | output_file_pro.write(line + "\n") 121 | elif row["stereotype"] == -1: 122 | output_file_anti.write(line + "\n") 123 | output_file_all.write(line + "\n") 124 | 125 | corpora = ["wikipedia", "covid19", "pubmed"] 126 | for corpus in corpora: 127 | with open("data/en_{}_pro.txt".format(corpus), "w+", encoding="utf-8") as output_file_pro, \ 128 | open("data/en_{}_anti.txt".format(corpus), "w+", encoding="utf-8") as output_file_anti, \ 129 | open("data/en_{}.txt".format(corpus), "w+", encoding="utf-8") as output_file_all: 130 | for index, row in tqdm(new_data.iterrows()): 131 | is_dropped = False 132 | for line in dropped_rows: 133 | if row["sentence_text"] in line: 134 | is_dropped = True 135 | break 136 | if not is_dropped and row['corpus'] == corpus: 137 | line = row['predicted gender'] + "\t" + str(row['profession_first_index']) + "\t" + \ 138 | row['sentence_text'] + "\t" + row['profession'] + "\t" + row['corpus'] 139 | if row["stereotype"] == 1: 140 | output_file_pro.write(line + "\n") 141 | elif row["stereotype"] == -1: 142 | output_file_anti.write(line + "\n") 143 | output_file_all.write(line + "\n") 144 | 145 | print("Size of data: " + str(new_data.shape[0])) 146 | return new_data 147 | 148 | 149 | def clean_columns(df): 150 | cleaned_df = df[['sentence_text', 'profession', 'g', 'profession_first_index', 'g_first_index', 151 | 'predicted gender', 'stereotype', 'distance', 'num_of_pronouns', 'corpus', 'data_index']] 152 | return cleaned_df 153 | 154 | 155 | def create_balanced(data_path): 156 | data = pd.read_csv(data_path) 157 | data_female = data[data["g"] == "her"] 158 | data_female = data_female.append(data[data["g"] == "she"]) 159 | data_female = data_female.append(data[data["g"] == "herself"]) 160 | data_male = data[data["g"] == "his"] 161 | data_male = data_male.append(data[data["g"] == "he"]) 162 | data_male = data_male.append(data[data["g"] == "himself"]) 163 | data_f_a = data_female[data_female["stereotype"] == -1] 164 | data_f_s = data_female[data_female["stereotype"] == 1] 165 | data_m_a = data_male[data_male["stereotype"] == -1] 166 | data_m_s = data_male[data_male["stereotype"] == 1] 167 | 168 | n = data_f_s.shape[0] 169 | balanced = data_f_s 170 | balanced = balanced.append(data_f_a.sample(n=n)) 171 | balanced = balanced.append(data_m_s.sample(n=n)) 172 | balanced = balanced.append(data_m_a.sample(n=n)) 173 | balanced.to_csv("data/balanced_BUG.csv") 174 | 175 | 176 | def remove_tags_from_sentence(df): 177 | """ 178 | Adds ref tag to the gender pronoun word, and ent tag to the entity word. 179 | """ 180 | df = df.values 181 | sentence = df[0] 182 | try: 183 | sentence_a = sentence.split("<")[0] 184 | sentence_b = sentence.split(">")[1] 185 | sentence_c = sentence_b.split("<")[0] 186 | sentence_d = sentence.split(">")[2] 187 | except: 188 | print(sentence) 189 | return 190 | return sentence_a + sentence_c + sentence_d 191 | 192 | 193 | def add_tags_to_sentence(df): 194 | """ 195 | Adds ref tag to the gender pronoun word, and ent tag to the entity word. 196 | """ 197 | df = df.values 198 | sentence = df[0] 199 | p_idx = int(df[1]) 200 | split_sentence = sentence.split(' ') 201 | split_sentence[p_idx] = "" + split_sentence[p_idx] + "" 202 | return " ".join(split_sentence) 203 | 204 | 205 | def find_distance(df): 206 | """ 207 | Finds the number of words between the gender pronoun word, and the entity word. 208 | """ 209 | df = df.values 210 | g_idx = df[0] 211 | p_idx = df[1] 212 | return abs(g_idx - p_idx) 213 | 214 | 215 | def predict_gender(df): 216 | """ 217 | returns "male" is the pronoun in the sentence is a male pronoun, and "female" otherwise 218 | """ 219 | df = df.values 220 | g = df[0] 221 | if g in m_pronouns: 222 | return "male" 223 | return "female" 224 | 225 | 226 | def is_stereotype(df): 227 | """ 228 | 1 : stereotype, -1 : non-stereotype, 0 neutral entity 229 | """ 230 | df = df.values 231 | profession = df[0] 232 | g = df[1] 233 | if (profession in f_entities and g in f_pronouns) or (profession in m_entities and g in m_pronouns): 234 | return 1 235 | if (profession in f_entities and g in m_pronouns) or (profession in m_entities and g in f_pronouns): 236 | return -1 237 | return 0 238 | 239 | 240 | def num_of_pronouns(df): 241 | """ 242 | Finds the number of words between the gender pronoun word, and the entity word. 243 | """ 244 | sentence = df.values[0].lower() 245 | ans = 0 246 | pronouns = ["he", "she", "his", "himself", "herself", "her", "him"] 247 | try: 248 | for word in pronouns: 249 | ans += sum(1 for _ in re.finditer(r'\b%s\b' % re.escape(word), sentence)) 250 | return ans 251 | except: 252 | return 1 253 | 254 | 255 | def samples_for_double_validation(data): 256 | """ 257 | Sample 200 random example from the annotations for validate we agree 258 | """ 259 | data['validation'] = 0 260 | new_data = data.sample(n=200) 261 | new_data[['predicted gender', 'validation', 'sentence_text', 'profession', 'g', 'g_first_index', 262 | 'profession_first_index', 'stereotype', 263 | 'corpus', 'data_index', 'correct']].to_csv("double_validation.csv") 264 | 265 | 266 | def samples_to_classify(data): 267 | """ 268 | Sample 1000 random example from the data for us to classify the sentences as correct/incorrect reference 269 | Sample 50 random examples from each corpus and for each query 270 | """ 271 | data['correct?'] = 0 272 | new_data = data.sample(n=1000) 273 | new_data['sentence_text'] = \ 274 | new_data[['sentence_text', 'profession_first_index']].apply(add_tags_to_sentence, axis=1) 275 | for i in range(1, 11): 276 | df_new = new_data[100 * (i - 1):100 * i] 277 | df_new[['predicted gender', 'correct?', 'sentence_text', 'profession', 'g', 'g_first_index', 278 | 'profession_first_index', 'stereotype', 279 | 'corpus', 'data_index']].to_csv("human_annotation\\data_samples{}.csv".format(i)) 280 | 281 | for i in range(1, 22): 282 | df_new = data[data.data_index == str(i)].sample(n=50) 283 | df_new['sentence_text'] = \ 284 | df_new[['sentence_text', 'profession_first_index']].apply(add_tags_to_sentence, axis=1) 285 | df_new[['predicted gender', 'correct?', 'sentence_text', 'profession', 'g', 'g_first_index', 286 | 'profession_first_index', 'stereotype', 287 | 'corpus', 'data_index']].to_csv("human_annotation\\data_samples_query{}.csv".format(i)) 288 | 289 | for i in ["wikipedia", "perseus", "covid19", "pubmed"]: 290 | df_new = data[data.corpus == i].sample(n=50) 291 | df_new['sentence_text'] = \ 292 | df_new[['sentence_text', 'profession_first_index']].apply(add_tags_to_sentence, axis=1) 293 | df_new[['predicted gender', 'correct?', 'sentence_text', 'profession', 'g', 'g_first_index', 294 | 'profession_first_index', 'stereotype', 295 | 'corpus', 'data_index']].to_csv("human_annotation\\data_samples_{}.csv".format(i)) 296 | 297 | 298 | def plot_dict(dict_attributes, color, title): 299 | """ 300 | receives a dictionary of attributes and sentences count / sentences statistics, ant plot the data as histograms 301 | """ 302 | histo = plt.bar(dict_attributes.keys(), dict_attributes.values(), color=color) 303 | for bar in histo: 304 | height = bar.get_height() 305 | plt.annotate(height if height > 1 else "{:.2f}".format(height), 306 | xy=(bar.get_x() + bar.get_width() / 2, height), 307 | xytext=(0, 2), textcoords="offset points", ha='center', va='bottom') 308 | plt.title(title) 309 | plt.savefig("data/graphs/{}.png".format(title)) 310 | plt.close() 311 | 312 | 313 | def statistics_over_samples(data): 314 | """ 315 | plot statistics over data of human annotations. 316 | numbers over histograms are: 317 | (number of "correct" samples with attribute) / (number of all samples with attribute) 318 | comment out code lines are data with less quality. 319 | """ 320 | 321 | print("accuracy of human annotations" + str(sum(data['correct'] == '1') / data.shape[0])) 322 | 323 | 324 | 325 | statistics_gender_pronouns = { 326 | "himself": sum(data[data['g'] == 'himself']['correct'] == '1') / sum(data['g'] == 'himself'), 327 | "he": sum(data[data['g'] == 'he']['correct'] == '1') / sum(data['g'] == 'he'), 328 | "she": sum(data[data['g'] == 'she']['correct'] == '1') / sum(data['g'] == 'she'), 329 | "herself": sum(data[data['g'] == 'herself']['correct'] == '1') / sum(data['g'] == 'herself'), 330 | "his": sum(data[data['g'] == 'his']['correct'] == '1') / sum(data['g'] == 'his'), 331 | "her": sum(data[data['g'] == 'her']['correct'] == '1') / sum(data['g'] == 'her'), 332 | "female": (sum(data[data['g'] == 'her']['correct'] == '1') + 333 | sum(data[data['g'] == 'she']['correct'] == '1') + 334 | sum(data[data['g'] == 'herself']['correct'] == '1')) / \ 335 | (sum(data['g'] == 'her') + sum(data['g'] == 'she') + sum(data['g'] == 'herself')), 336 | "male": (sum(data[data['g'] == 'he']['correct'] == '1') + 337 | sum(data[data['g'] == 'his']['correct'] == '1') + sum(data[data['g'] == 'him']['correct'] == '1') + 338 | sum(data[data['g'] == 'himself']['correct'] == '1')) / \ 339 | (sum(data['g'] == 'he') + sum(data['g'] == 'his') + sum(data['g'] == 'himself') + 340 | sum(data['g'] == 'him')) 341 | } 342 | plot_dict(statistics_gender_pronouns, "pink", "Statistics - Gender Pronouns") 343 | 344 | data['distance'] = data[['g_first_index', 'profession_first_index']].apply(find_distance, axis=1) 345 | # distance_average = sum(data['distance']) / 300 346 | statistics_distance = { 347 | "d <= 5": sum(data[data['distance'] <= 5]['correct'] == '1') / sum(data['distance'] <= 5), 348 | "5 < d <= 10": (sum(data[data['distance'] <= 10]['correct'] == '1') - 349 | sum(data[data['distance'] <= 5]['correct'] == '1')) / 350 | (sum(data['distance'] <= 10) - sum(data['distance'] <= 5)), 351 | "10 < d <= 15": (sum(data[data['distance'] <= 15]['correct'] == '1') - 352 | sum(data[data['distance'] <= 10]['correct'] == '1')) / 353 | (sum(data['distance'] <= 15) - sum(data['distance'] <= 10)), 354 | "15 < d <=20": (sum(data[data['distance'] <= 20]['correct'] == '1') - 355 | sum(data[data['distance'] <= 15]['correct'] == '1')) / 356 | (sum(data['distance'] <= 20) - sum(data['distance'] <= 15)), 357 | # "20 < d": sum(data[data['distance'] > 20]['correct'] == '1') / sum(data['distance'] > 20) 358 | } 359 | plot_dict(statistics_distance, "teal", "Statistics - Distance Between Words") 360 | 361 | m_data = data[data['profession'].isin(m_entities)] 362 | f_data = data[data['profession'].isin(f_entities)] 363 | n_data = data[data['profession'].isin(n_entities)] 364 | 365 | # data[f_data[f_data['g'] == 'he']['correct'] == '0'] 366 | # 367 | # "non-stereotypes": (sum(f_data[f_data['g'] == 'he']['correct'] == '1') + 368 | # sum(f_data[f_data['g'] == 'his']['correct'] == '1') + 369 | # sum(m_data[m_data['g'] == 'her']['correct'] == '1') + 370 | # sum(m_data[m_data['g'] == 'she']['correct'] == '1')) / \ 371 | # (sum(f_data['g'] == 'he') + sum(f_data['g'] == 'his') + 372 | # sum(m_data['g'] == 'her') + sum(m_data['g'] == 'she')), 373 | 374 | statistics_male_female = { 375 | "male\n entities": sum(m_data['correct'] == '1') / len(m_data), 376 | "female\n entities": sum(f_data['correct'] == '1') / len(f_data), 377 | "neutral\n entities": sum(n_data['correct'] == '1') / len(n_data), 378 | } 379 | plot_dict(statistics_male_female, "darkmagenta", "Statistics - Male, Female Entities") 380 | 381 | statistics_stereotype = { 382 | "stereotypes": (sum(m_data[m_data['g'] == 'he']['correct'] == '1') + 383 | sum(m_data[m_data['g'] == 'his']['correct'] == '1') + 384 | sum(f_data[f_data['g'] == 'her']['correct'] == '1') + 385 | sum(f_data[f_data['g'] == 'she']['correct'] == '1')) / \ 386 | (sum(m_data['g'] == 'he') + sum(m_data['g'] == 'his') + 387 | sum(f_data['g'] == 'her') + sum(f_data['g'] == 'she')), 388 | "s\n male": (sum(m_data[m_data['g'] == 'he']['correct'] == '1') + 389 | sum(m_data[m_data['g'] == 'his']['correct'] == '1')) / \ 390 | (sum(m_data['g'] == 'he') + sum(m_data['g'] == 'his')), 391 | "s\n female": (sum(f_data[f_data['g'] == 'her']['correct'] == '1') + 392 | sum(f_data[f_data['g'] == 'she']['correct'] == '1')) / \ 393 | (sum(f_data['g'] == 'her') + sum(f_data['g'] == 'she')), 394 | "non-stereotypes": (sum(f_data[f_data['g'] == 'he']['correct'] == '1') + 395 | sum(f_data[f_data['g'] == 'his']['correct'] == '1') + 396 | sum(m_data[m_data['g'] == 'her']['correct'] == '1') + 397 | sum(m_data[m_data['g'] == 'she']['correct'] == '1')) / \ 398 | (sum(f_data['g'] == 'he') + sum(f_data['g'] == 'his') + 399 | sum(m_data['g'] == 'her') + sum(m_data['g'] == 'she')), 400 | "non-s\n male": (sum(f_data[f_data['g'] == 'he']['correct'] == '1') + 401 | sum(f_data[f_data['g'] == 'his']['correct'] == '1')) / \ 402 | (sum(f_data['g'] == 'he') + sum(f_data['g'] == 'his')), 403 | "non-s\n female": (sum(m_data[m_data['g'] == 'her']['correct'] == '1') + 404 | sum(m_data[m_data['g'] == 'she']['correct'] == '1')) / \ 405 | (sum(m_data['g'] == 'her') + sum(m_data['g'] == 'she')) 406 | } 407 | plot_dict(statistics_stereotype, "orange", "Statistics - Stereotypes") 408 | 409 | statistics_num_of_pronouns = { 410 | "1": (sum(data[data['num_of_pronouns'] == 1]['correct'] == '1')) / \ 411 | (sum(data['num_of_pronouns'] == 1)), # 146 412 | "2": (sum(data[data['num_of_pronouns'] == 2]['correct'] == '1')) / \ 413 | (sum(data['num_of_pronouns'] == 2)), # 81 414 | "3": (sum(data[data['num_of_pronouns'] == 3]['correct'] == '1')) / \ 415 | (sum(data['num_of_pronouns'] == 3)), # 47 416 | "4": (sum(data[data['num_of_pronouns'] == 4]['correct'] == '1')) / \ 417 | (sum(data['num_of_pronouns'] == 4)), # 20 418 | # "more then 4": (sum(data[data['num_of_pronouns'] > 4]['correct'] == '1')) / \ 419 | # (sum(data['num_of_pronouns'] > 4)) # 6 420 | } 421 | plot_dict(statistics_num_of_pronouns, "maroon", "Statistics - number of pronouns") 422 | 423 | count_corpus = { 424 | "wikipedia": (sum(data[data['corpus'] == 'wikipedia']['correct'] == '1')) / sum(data['corpus'] == 'wikipedia'), 425 | "covid19": (sum(data[data['corpus'] == 'covid19']['correct'] == '1')) / sum(data['corpus'] == 'covid19'), 426 | # "perseus": (sum(data[data['corpus'] == 'perseus']['correct'] == '1')) / sum(data['corpus'] == 'perseus'), 427 | "pubmed": (sum(data[data['corpus'] == 'pubmed']['correct'] == '1')) / sum(data['corpus'] == 'pubmed'), 428 | } 429 | plot_dict(count_corpus, "g", "Statistics - statistics_count_corpus") 430 | 431 | 432 | def statistics_over_corpus(data_path): 433 | """ 434 | plot statistics over all sentences. 435 | numbers over histograms are: 436 | number of all samples with attribute 437 | comment out code lines are data with less quality. 438 | """ 439 | data = pd.read_csv(data_path) 440 | data['g'] = data['g'].apply(lambda x: x.lower()) 441 | count_gender_pronouns = { 442 | "herself": sum(data['g'] == 'herself'), 443 | "she": sum(data['g'] == 'she'), 444 | "himself": sum(data['g'] == 'himself'), 445 | "he": sum(data['g'] == 'he'), 446 | "her": sum(data['g'] == 'her'), 447 | "his": sum(data['g'] == 'his'), 448 | # "female": sum(data['g'] == 'her') + sum(data['g'] == 'she') + sum(data['g'] == 'herself'), 449 | # "male": sum(data['g'] == 'he') + sum(data['g'] == 'his') + sum(data['g'] == 'him') + sum(data['g'] == 'himself') 450 | } 451 | plot_dict(count_gender_pronouns, "pink", "count - Gender Pronouns") 452 | print(count_gender_pronouns) 453 | 454 | m_data = data[data['profession'].isin(m_entities)] 455 | f_data = data[data['profession'].isin(f_entities)] 456 | n_data = data[data['profession'].isin(n_entities)] 457 | 458 | count_male_female = { 459 | "male\n entities": m_data.shape[0], 460 | "female\n entities": f_data.shape[0], 461 | "neutral\n entities": n_data.shape[0] 462 | } 463 | plot_dict(count_male_female, "darkmagenta", "count - Male, Female Entities") 464 | 465 | count_stereotype = { 466 | "stereotypes": sum(data['stereotype'] == 1), 467 | "non-stereotypes": sum(data['stereotype'] == -1), 468 | } 469 | plot_dict(count_stereotype, "orange", "count - Stereotypes") 470 | 471 | count_data_index = { 472 | "1": sum(data['data_index'] == '1'), 473 | "2": sum(data['data_index'] == '2'), 474 | "3": sum(data['data_index'] == '3'), 475 | "4": sum(data['data_index'] == '4'), 476 | "5": sum(data['data_index'] == '5'), 477 | # "6": sum(data['data_index'] == '6'), 478 | # "7": sum(data['data_index'] == '7'), 479 | # "8": sum(data['data_index'] == '8'), 480 | "9": sum(data['data_index'] == '9'), 481 | "10": sum(data['data_index'] == '10'), 482 | "11": sum(data['data_index'] == '11'), 483 | # "12": sum(data['data_index'] == '12'), 484 | # "13": sum(data['data_index'] == '13'), 485 | "14": sum(data['data_index'] == '14'), 486 | "15": sum(data['data_index'] == '15'), 487 | # "16": sum(data['data_index'] == '16'), 488 | "17": sum(data['data_index'] == '17'), 489 | "18": sum(data['data_index'] == '18'), 490 | # "19": sum(data['data_index'] == '19'), 491 | "20": sum(data['data_index'] == '20'), 492 | "21": sum(data['data_index'] == '21'), 493 | } 494 | plot_dict(count_data_index, "gold", "count - statistics_data_index") 495 | 496 | count_corpus = { 497 | "wikipedia": sum(data['corpus'] == 'wikipedia'), 498 | "covid19": sum(data['corpus'] == 'covid19'), 499 | # "perseus": sum(data['corpus'] == 'perseus'), 500 | "pubmed": sum(data['corpus'] == 'pubmed'), 501 | # "ungd": sum(data['corpus'] == 'ungd'), 502 | } 503 | plot_dict(count_corpus, "g", "count - statistics_count_corpus") 504 | 505 | data['num_of_pronouns'] = data[['sentence_text']].apply(num_of_pronouns, axis=1) 506 | count_num_of_pronouns = { 507 | "1": sum(data['num_of_pronouns'] == 1), 508 | "2": sum(data['num_of_pronouns'] == 2), 509 | "3": sum(data['num_of_pronouns'] == 3), 510 | "4": sum(data['num_of_pronouns'] == 4), 511 | "more then 4": sum(data['num_of_pronouns'] > 4), 512 | } 513 | plot_dict(count_num_of_pronouns, "maroon", "count - number of pronouns") 514 | 515 | data['distance'] = data[['g_first_index', 'profession_first_index']].apply(find_distance, axis=1) 516 | 517 | count_distance = { 518 | "d <= 5": sum(data['distance'] <= 5), 519 | "5 < d <= 10": (sum(data['distance'] <= 10) - sum(data['distance'] <= 5)), 520 | "10 < d <= 15": (sum(data['distance'] <= 15) - sum(data['distance'] <= 10)), 521 | "15 < d <=20": (sum(data['distance'] <= 20) - sum(data['distance'] <= 15)), 522 | "20 < d": sum(data['distance'] > 20) 523 | } 524 | plot_dict(count_distance, "teal", "count - Distance Between Words") 525 | 526 | 527 | if __name__ == '__main__': 528 | create_lists("inc_occ_gender.csv") 529 | filtered_data = create_BUG(sys.argv[1]) 530 | 531 | data_filtered = "data\\full_BUG.csv" 532 | # statistics_over_corpus(data_filtered) 533 | 534 | -------------------------------------------------------------------------------- /src/evaluations/dataset_stats.py: -------------------------------------------------------------------------------- 1 | """ Usage: 2 | [--in=INPUT_FILE] [--out=OUTPUT_FILE] [--debug] 3 | 4 | Options: 5 | --help Show this message and exit 6 | -i INPUT_FILE --in=INPUT_FILE Input file 7 | [default: ../data/full_bug.csv] 8 | -o INPUT_FILE --out=OUTPUT_FILE Input file 9 | [default: outfile.tmp] 10 | --debug Whether to debug 11 | """ 12 | # External imports 13 | import logging 14 | import pdb 15 | from pprint import pprint 16 | from pprint import pformat 17 | from docopt import docopt 18 | from pathlib import Path 19 | from tqdm import tqdm 20 | import numpy as np 21 | import json 22 | import pandas as pd 23 | import numpy as np 24 | 25 | 26 | # Local imports 27 | 28 | 29 | #---- 30 | 31 | def dist_between_prof_ant(df): 32 | """ 33 | Return average dist and std between profession and 34 | antecedent in the given data frame. 35 | """ 36 | dist = np.abs(df.profession_first_index - df.g_first_index) 37 | ave = np.average(dist) 38 | std = np.std(dist) 39 | dist_dict = {"average": ave, 40 | "std": std} 41 | return dist_dict 42 | 43 | if __name__ == "__main__": 44 | 45 | # Parse command line arguments 46 | args = docopt(__doc__) 47 | inp_fn = Path(args["--in"]) 48 | out_fn = Path(args["--out"]) 49 | 50 | # Determine logging level 51 | debug = args["--debug"] 52 | if debug: 53 | logging.basicConfig(level = logging.DEBUG) 54 | else: 55 | logging.basicConfig(level = logging.INFO) 56 | 57 | logging.info(f"Input file: {inp_fn}, Output file: {out_fn}.") 58 | 59 | # Load data 60 | df = pd.read_csv(inp_fn) 61 | 62 | # Compute distance between pronoun and antecedent between 63 | # dataset partitions 64 | ste = df[df.stereotype == 1] 65 | ant = df[df.stereotype == -1] 66 | 67 | ste_dist = dist_between_prof_ant(ste) 68 | ant_dist = dist_between_prof_ant(ant) 69 | 70 | logging.info("stereotype dist: {}".format(pformat(ste_dist))) 71 | logging.info("anti-stereotype dist: {}".format(pformat(ant_dist))) 72 | 73 | 74 | # End 75 | logging.info("DONE") 76 | -------------------------------------------------------------------------------- /src/evaluations/evaluate_coref.py: -------------------------------------------------------------------------------- 1 | """ Usage: 2 | [--in=INPUT_FILE] [--out=OUTPUT_FILE] [--debug] 3 | 4 | Options: 5 | --help Show this message and exit 6 | -i INPUT_FILE --in=INPUT_FILE Input file 7 | [default: ../../predictions/coref_preds.jsonl] 8 | -o INPUT_FILE --out=OUTPUT_FILE Input file 9 | [default: ../../visualizations/delta_s_by_dist.png] 10 | --debug Whether to debug 11 | """ 12 | # External imports 13 | import logging 14 | import pdb 15 | from pprint import pprint 16 | from pprint import pformat 17 | from docopt import docopt 18 | from pathlib import Path 19 | from tqdm import tqdm 20 | import numpy as np 21 | import json 22 | from collections import defaultdict 23 | from math import floor 24 | import matplotlib.pyplot as plt 25 | from operator import itemgetter 26 | 27 | # Local imports 28 | 29 | 30 | #---- 31 | 32 | BIN_SIZE = 5 33 | 34 | def find_cluster_ind(clusters, word_ind): 35 | """ 36 | find the cluster ind for the given word, or -1 if not found 37 | """ 38 | found_in_clusters = [] 39 | for cluster_ind, cluster in enumerate(clusters): 40 | for ent_ind, ent in enumerate(cluster): 41 | ent_start, ent_end = ent 42 | if (word_ind >= ent_start) and (word_ind <= ent_end): 43 | # found a cluster 44 | found_in_clusters.append(cluster_ind) 45 | 46 | # no cluster found 47 | return found_in_clusters 48 | 49 | 50 | def is_correct_pred(line): 51 | """ 52 | return True iff this line represents a correct prediction. 53 | """ 54 | pred = line["pred"] 55 | row = line["row"] 56 | clusters = pred["clusters"] 57 | ent_id = find_cluster_ind(clusters, row["profession_first_index"]) 58 | pron_id = find_cluster_ind(clusters, row["g_first_index"]) 59 | 60 | # prediction is correct if it assigns pronoun and entity to the same cluster 61 | is_correct = len(set(ent_id).intersection(pron_id)) 62 | 63 | return is_correct 64 | 65 | 66 | def get_acc(vals): 67 | """ 68 | return the accuracy given binary scores 69 | """ 70 | acc = (sum(vals) / len(vals)) * 100 71 | return acc 72 | 73 | def get_delta_s_by_dist(dist_metric): 74 | """ 75 | Compute delta s by distance 76 | """ 77 | dists = sorted(dist_metric.keys()) 78 | delta_s_by_dist = [] 79 | for dist in dists: 80 | cur_metric = dist_metric[dist] 81 | ste = cur_metric[1] 82 | ant = cur_metric[-1] 83 | delta_s_by_dist.append((dist, 84 | get_acc(ste) - get_acc(ant), 85 | len(ste) + len(ant))) 86 | return delta_s_by_dist 87 | 88 | def get_simple_metric_by_dist(dist_metric, metric_name): 89 | """ 90 | Aggregate over a given metric by dist. 91 | """ 92 | dists = sorted(dist_metric.keys()) 93 | met_by_dist = [] 94 | for dist in dists: 95 | met_by_dist.append(get_acc(dist_metric[dist][metric_name])) 96 | return met_by_dist 97 | 98 | # Simple instantiations 99 | get_acc_by_dist = lambda dist_metric: get_simple_metric_by_dist(dist_metric, "acc") 100 | get_ste_by_dist = lambda dist_metric: get_simple_metric_by_dist(dist_metric, 1) 101 | get_ant_by_dist = lambda dist_metric: get_simple_metric_by_dist(dist_metric, -1) 102 | 103 | def average_buckets(b1, b2): 104 | """ 105 | average two buckets 106 | """ 107 | b1_ind, b1_val, b1_cnt = b1 108 | b2_ind, b2_val, b2_cnt = b2 109 | 110 | cnt = b1_cnt + b2_cnt 111 | val = ((b1_val * b1_cnt) + (b2_val * b2_cnt)) / cnt 112 | 113 | new_bucket = (b1_ind, val, cnt) 114 | return new_bucket 115 | 116 | if __name__ == "__main__": 117 | # Parse command line arguments 118 | args = docopt(__doc__) 119 | inp_fn = Path(args["--in"]) 120 | out_fn = Path(args["--out"]) 121 | 122 | # Determine logging level 123 | debug = args["--debug"] 124 | if debug: 125 | logging.basicConfig(level = logging.DEBUG) 126 | else: 127 | logging.basicConfig(level = logging.INFO) 128 | 129 | logging.info(f"Input file: {inp_fn}, Output file: {out_fn}.") 130 | 131 | # Start computation 132 | metrics = {"acc": [], 133 | "ste": [], 134 | "ant": [], 135 | "masc": [], 136 | "femn": [], 137 | "num_of_pronouns": defaultdict(list), 138 | "distance": defaultdict(lambda: defaultdict(list))} 139 | 140 | for line in tqdm(open(inp_fn, encoding = "utf8")): 141 | line = json.loads(line.strip()) 142 | is_correct = is_correct_pred(line) 143 | row = line["row"] 144 | metrics["acc"].append(is_correct) 145 | gender = row["predicted gender"].lower() 146 | stereotype = row["stereotype"] 147 | 148 | if gender == "male": 149 | metrics["masc"].append(is_correct) 150 | elif gender == "female": 151 | metrics["femn"].append(is_correct) 152 | 153 | if stereotype == 1: 154 | metrics["ste"].append(is_correct) 155 | elif stereotype == -1: 156 | metrics["ant"].append(is_correct) 157 | 158 | num_of_prons = row["num_of_pronouns"] 159 | dist = floor(row["distance"] / BIN_SIZE) 160 | metrics["num_of_pronouns"][num_of_prons].append(is_correct) 161 | metrics["distance"][dist][stereotype].append(is_correct) 162 | metrics["distance"][dist]["acc"].append(is_correct) 163 | 164 | 165 | acc = get_acc(metrics["acc"]) 166 | delta_g = get_acc(metrics["masc"]) - get_acc(metrics["femn"]) 167 | delta_s = get_acc(metrics["ste"]) - get_acc(metrics["ant"]) 168 | delta_s_by_dist = get_delta_s_by_dist(metrics["distance"]) 169 | acc_by_dist = get_acc_by_dist(metrics["distance"]) 170 | ste_by_dist = get_ste_by_dist(metrics["distance"]) 171 | ant_by_dist = get_ant_by_dist(metrics["distance"]) 172 | 173 | # average last bucket 174 | # delta_s_by_dist[-2] = average_buckets(delta_s_by_dist[-2], delta_s_by_dist[-1]) 175 | # delta_s_by_dist = delta_s_by_dist[:-1] 176 | 177 | logging.info(f"acc = {acc:.1f}; delta_g = {delta_g:.1f}; delta_s = {delta_s:.1f}") 178 | logging.info(f"delta s by dist = {delta_s_by_dist}") 179 | logging.info(f"acc by dist = {acc_by_dist}") 180 | 181 | # plot 182 | plt.rcParams.update({'font.size': 15}) 183 | ranges = [f"{(x*5) + 1}-{(x +1) * 5}" for x in map(itemgetter(0), delta_s_by_dist)] 184 | ranges[-1] = ranges[-1].split("-")[0] + ">" 185 | values_ds = list(map(itemgetter(1), delta_s_by_dist)) 186 | y_pos = np.arange(len(ranges)) 187 | width = 1 # the width for the bars 188 | 189 | plt.plot(y_pos, ste_by_dist, label = "stereotypical", 190 | color = "orange", linestyle = "dashed") 191 | plt.scatter(y_pos, ste_by_dist, color = "orange") 192 | 193 | plt.plot(y_pos, ant_by_dist, label = "anti-stereotypical", 194 | color = "blue", linestyle = "dotted") 195 | plt.scatter(y_pos, ant_by_dist, color = "blue") 196 | 197 | plt.xticks(y_pos, ranges) 198 | plt.ylabel("coreference acc") 199 | plt.xlabel("distance [words] between pronoun and antecedent") 200 | plt.legend() 201 | plt.tight_layout() 202 | plt.savefig(out_fn) 203 | 204 | 205 | # End 206 | logging.info("DONE") 207 | -------------------------------------------------------------------------------- /src/evaluations/inc_occ_gender.csv: -------------------------------------------------------------------------------- 1 | Occupation,All_workers,All_weekly,M_workers,M_weekly,F_workers,F_weekly,entities 2 | ALL OCCUPATIONS,109080,809,60746,895,48334,726, 3 | MANAGEMENT,12480,1351,7332,1486,5147,1139,"manager, boss, principal, executive, headmaster" 4 | Chief executives,1046,2041,763,2251,283,1836,Chief 5 | General and operations managers,823,1260,621,1347,202,1002, 6 | Legislators,8,Na,5,Na,4,Na,Legislator 7 | Advertising and promotions managers,55,1050,29,Na,26,Na, 8 | Marketing and sales managers,948,1462,570,1603,378,1258, 9 | Public relations and fundraising managers,59,1557,24,Na,35,Na, 10 | Administrative services managers,170,1191,96,1451,73,981, 11 | Computer and information systems managers,636,1728,466,1817,169,1563, 12 | Financial managers,1124,1408,551,1732,573,1130, 13 | Compensation and benefits managers,23,Na,7,Na,16,Na, 14 | Human resources managers,254,1365,68,1495,186,1274, 15 | Training and development managers,37,Na,17,Na,20,Na, 16 | Industrial production managers,267,1485,221,1528,45,Na, 17 | Purchasing managers,193,1348,109,1404,84,1226, 18 | "Transportation, storage, and distribution managers",276,966,224,1006,52,749, 19 | "Farmers, ranchers, and other agricultural managers",129,769,106,847,23,Na,"Agricultural, Farmer, agriculturist, rancher" 20 | Construction managers,471,1329,429,1357,42,Na, 21 | Education administrators,778,1423,282,1585,496,1252,administrator 22 | Architectural and engineering managers,110,1899,101,1892,10,Na,architect 23 | Food service managers,763,742,389,820,374,680, 24 | Funeral service managers,13,Na,10,Na,2,Na, 25 | Gaming managers,19,Na,13,Na,6,Na, 26 | Lodging managers,123,985,54,1171,68,902, 27 | Medical and health services managers,592,1210,154,1422,438,1156, 28 | Natural sciences managers,24,Na,11,Na,13,Na, 29 | Postmasters and mail superintendents,20,Na,10,Na,10,Na,"Postmaster, superintendent" 30 | "Property, real estate, and community association managers",401,914,171,1137,230,823,realtor 31 | Social and community service managers,305,1022,105,1142,200,965, 32 | Emergency management directors,9,Na,6,Na,3,Na, 33 | "Managers, all other",2803,1408,1717,1525,1085,1213, 34 | BUSINESS,5942,1137,2686,1327,3256,1004, 35 | "Agents and business managers of artists, performers, and athletes",27,Na,13,Na,14,Na, 36 | "Buyers and purchasing agents, farm products",11,Na,9,Na,2,Na, 37 | "Wholesale and retail buyers, except farm products",142,926,73,886,69,985, 38 | "Purchasing agents, except wholesale, retail, and farm products",260,1009,136,1020,124,986, 39 | "Claims adjusters, appraisers, examiners, and investigators",317,963,141,1134,176,824,"Appraiser, investigator, examiner" 40 | Compliance officers,235,1198,126,1375,109,1025, 41 | Cost estimators,95,1232,83,1264,12,Na, 42 | Human resources workers,592,1002,151,1158,441,984, 43 | "Compensation, benefits, and job analysis specialists",63,998,12,Na,50,898, 44 | Training and development specialists,107,990,42,Na,65,1037, 45 | Logisticians,111,1028,66,1075,44,Na,logistician 46 | Management analysts,529,1431,291,1519,237,1348,analyst 47 | "Meeting, convention, and event planners",117,859,27,Na,90,840,planner 48 | Fundraisers,62,1136,14,Na,48,Na,Fundraiser 49 | Market research analysts and marketing specialists,203,1284,85,1411,118,1239, 50 | "Business operations specialists, all other",186,1090,74,1461,112,969, 51 | Accountants and auditors,1464,1132,618,1345,846,988,"auditor, Accountant" 52 | Appraisers and assessors of real estate,42,Na,21,Na,21,Na, 53 | Budget analysts,44,Na,17,Na,28,Na, 54 | Credit analysts,17,Na,8,Na,9,Na, 55 | Financial analysts,295,1426,168,1680,127,1171, 56 | Personal financial advisors,407,1419,248,1738,159,1033,advisor 57 | Insurance underwriters,106,1149,44,Na,63,956,"insurer, underwriter, stamper, sealer" 58 | Financial examiners,17,Na,9,Na,8,Na, 59 | Credit counselors and loan officers,313,997,146,1186,166,906,Counselor 60 | "Tax examiners and collectors, and revenue agents",59,1051,20,Na,39,Na, 61 | Tax preparers,56,892,19,Na,37,Na, 62 | "Financial specialists, all other",66,1162,25,Na,40,Na,specialist 63 | COMPUTATIONAL,4009,1428,3036,1503,973,1245, 64 | Computer and information research scientists,28,Na,23,Na,5,Na, 65 | Computer systems analysts,499,1389,325,1462,173,1256, 66 | Information security analysts,67,1538,56,1562,11,Na, 67 | Computer programmers,450,1438,357,1501,93,1302,programmer 68 | "Software developers, applications and systems software",1287,1682,1054,1751,232,1415,developer 69 | Web developers,151,1165,98,1233,53,1026, 70 | Computer support specialists,396,1079,291,1135,105,908, 71 | Database administrators,90,1536,58,1829,32,Na, 72 | Network and computer systems administrators,208,1242,179,1266,28,Na, 73 | Computer network architects,115,1552,100,1577,15,Na, 74 | "Computer occupations, all other",490,1227,374,1252,116,1145, 75 | Actuaries,24,Na,18,Na,6,Na,actuary 76 | Mathematicians,6,Na,6,Na,0,Na,Mathematician 77 | Operations research analysts,122,1441,59,1574,63,1325, 78 | Statisticians,76,1275,37,Na,39,Na,Statistician 79 | Miscellaneous mathematical science occupations,1,Na,1,Na,0,Na, 80 | ENGINEERING,2656,1424,2272,1452,383,1257,engineer 81 | "Architects, except naval",138,1441,106,1492,31,Na, 82 | "Surveyors, cartographers, and photogrammetrists",29,Na,23,Na,6,Na,"photogrammetrist, cartographer, Surveyor" 83 | Aerospace engineers,140,1662,122,1668,18,Na, 84 | Agricultural engineers,5,Na,5,Na,0,Na, 85 | Biomedical engineers,12,Na,10,Na,2,Na, 86 | Chemical engineers,79,1532,69,1583,10,Na, 87 | Civil engineers,316,1460,275,1474,41,Na, 88 | Computer hardware engineers,72,1876,62,1871,10,Na, 89 | Electrical and electronics engineers,283,1778,246,1819,37,Na, 90 | Environmental engineers,35,Na,26,Na,9,Na, 91 | "Industrial engineers, including health and safety",205,1447,168,1430,37,Na, 92 | Marine engineers and naval architects,9,Na,9,Na,0,Na, 93 | Materials engineers,36,Na,33,Na,4,Na, 94 | Mechanical engineers,316,1534,294,1550,23,Na,mechanic 95 | "Mining and geological engineers, including mining safety engineers",15,Na,15,Na,0,Na,geologist 96 | Nuclear engineers,5,Na,3,Na,2,Na, 97 | Petroleum engineers,43,Na,39,Na,3,Na, 98 | "Engineers, all other",393,1527,339,1537,54,1448, 99 | Drafters,114,977,91,977,23,Na,Drafter 100 | "Engineering technicians, except drafters",352,963,284,984,68,827,technician 101 | Surveying and mapping technicians,58,1012,54,1031,4,Na, 102 | SCIENCE,1176,1206,662,1379,514,1067, 103 | Agricultural and food scientists,22,Na,12,Na,10,Na, 104 | Biological scientists,74,1233,46,Na,28,Na,biologist 105 | Conservation scientists and foresters,23,Na,16,Na,7,Na, 106 | Medical scientists,151,1250,68,1362,84,1082, 107 | "Life scientists, all other",1,Na,1,Na,0,Na, 108 | Astronomers and physicists,14,Na,11,Na,3,Na,"physicist, Astronomer" 109 | Atmospheric and space scientists,8,Na,8,Na,1,Na,"Atmospheric, astronaut" 110 | Chemists and materials scientists,93,1432,61,1496,33,Na,Chemist 111 | Environmental scientists and geoscientists,90,1423,65,1740,25,Na,geoscientist 112 | "Physical scientists, all other",189,1553,121,1770,68,1170,scientist 113 | Economists,29,Na,17,Na,12,Na,Economist 114 | Survey researchers,0,Na,0,Na,0,Na, 115 | Psychologists,114,1367,31,Na,83,1189,"Psychologist, shrink" 116 | Sociologists,0,Na,0,Na,0,Na,Sociologist 117 | Urban and regional planners,22,Na,13,Na,9,Na, 118 | Miscellaneous social scientists and related workers,37,Na,19,Na,19,Na, 119 | Agricultural and food science technicians,28,Na,21,Na,7,Na, 120 | Biological technicians,20,Na,10,Na,10,Na, 121 | Chemical technicians,75,944,43,Na,32,Na, 122 | Geological and petroleum technicians,22,Na,18,Na,4,Na, 123 | Nuclear technicians,2,Na,2,Na,0,Na, 124 | Social science research assistants,3,Na,0,Na,3,Na, 125 | "Miscellaneous life, physical, and social science technicians",157,846,79,1001,78,780, 126 | SOCIAL SERVICE,2143,889,776,973,1367,845, 127 | Counselors,635,904,184,908,451,902,Counselor 128 | Social workers,677,877,127,943,549,862, 129 | Probation officers and correctional treatment specialists,85,967,42,Na,43,Na, 130 | Social and human service assistants,173,676,23,Na,149,673, 131 | "Miscellaneous community and social service specialists, including health educators and community health workers",92,831,29,Na,63,728, 132 | Clergy,376,1002,316,1021,60,924,"clergy, priest" 133 | "Directors, religious activities and education",62,929,31,Na,31,Na, 134 | "Religious workers, all other",44,Na,23,Na,21,Na, 135 | LEGAL,1346,1391,624,1877,722,1135, 136 | Lawyers,803,1886,503,1914,300,1717,Lawyer 137 | Judicial law clerks,11,Na,1,Na,10,Na, 138 | "Judges, magistrates, and other judicial workers",54,1952,33,Na,20,Na,"Judge, magistrate" 139 | Paralegals and legal assistants,341,927,47,Na,294,910, 140 | Miscellaneous legal support workers,136,770,40,Na,97,746, 141 | EDUCATION,6884,956,1849,1144,5034,907, 142 | Postsecondary teachers,917,1258,516,1405,401,1144, 143 | Preschool and kindergarten teachers,517,616,11,Na,506,618, 144 | Elementary and middle school teachers,2806,974,543,1077,2262,957,"teacher, educator" 145 | Secondary school teachers,1048,1066,438,1149,610,1006, 146 | Special education teachers,297,987,38,Na,258,990, 147 | Other teachers and instructors,378,896,179,1024,199,817,instructor 148 | "Archivists, curators, and museum technicians",38,Na,18,Na,20,Na, 149 | Librarians,130,991,27,Na,102,966,Librarian 150 | Library technicians,18,Na,4,Na,15,Na, 151 | Teacher assistants,614,541,48,Na,565,530,assistant 152 | "Other education, training, and library workers",123,1001,28,Na,95,1031, 153 | ARTS,1643,1001,930,1088,713,942, 154 | Artists and related workers,58,1166,39,Na,20,Na,Artist 155 | Designers,593,993,301,1099,291,918,Designer 156 | Actors,12,Na,8,Na,4,Na, 157 | Producers and directors,120,1270,67,1340,53,1234,"Producer, director" 158 | "Athletes, coaches, umpires, and related workers",147,780,108,818,39,Na,"Athlete, coach, umpire" 159 | Dancers and choreographers,11,Na,2,Na,9,Na,"Dancer, choreographer" 160 | "Musicians, singers, and related workers",42,Na,33,Na,9,Na,"Musician, singer, deejay, dj, Guitarist, Bassist, drummer" 161 | "Entertainers and performers, sports and related workers, all other",14,Na,11,Na,3,Na,Entertainer 162 | Announcers,25,Na,21,Na,5,Na,Announcer 163 | "News analysts, reporters and correspondents",56,1218,29,Na,27,Na,reporter 164 | Public relations specialists,120,1211,49,Na,71,971, 165 | Editors,108,1148,58,1205,50,1125,Editor 166 | Technical writers,52,1158,22,Na,30,Na, 167 | Writers and authors,79,1232,36,Na,42,Na,"Writer, author, journalist" 168 | Miscellaneous media and communication workers,46,Na,16,Na,30,Na, 169 | Broadcast and sound engineering technicians and radio operators,83,954,77,937,7,Na, 170 | Photographers,45,Na,24,Na,21,Na,Photographer 171 | "Television, video, and motion picture camera operators and editors",30,Na,28,Na,1,Na,videographer 172 | "Media and communication equipment workers, all other",2,Na,2,Na,0,Na, 173 | HEALTHCARE PROFESSIONAL,6566,1041,1639,1272,4928,991, 174 | Chiropractors,22,Na,15,Na,7,Na,Chiropractor 175 | Dentists,59,1656,39,Na,20,Na,Dentist 176 | Dietitians and nutritionists,79,886,9,Na,69,879,"Dietitian, nutritionist, dietician, Naturopath" 177 | Optometrists,19,Na,6,Na,13,Na,Optometrist 178 | Pharmacists,206,1920,98,2117,108,1811,Pharmacist 179 | Physicians and surgeons,740,1824,457,1915,283,1533,"Physician, surgeon, doctor, anesthesiologist" 180 | Physician assistants,57,1368,17,Na,40,Na, 181 | Podiatrists,9,Na,5,Na,4,Na,Podiatrist 182 | Audiologists,8,Na,1,Na,6,Na,Audiologist 183 | Occupational therapists,74,1210,10,Na,64,1199,therapist 184 | Physical therapists,178,1265,56,1347,123,1215, 185 | Radiation therapists,12,Na,5,Na,7,Na, 186 | Recreational therapists,6,Na,2,Na,4,Na, 187 | Respiratory therapists,99,1000,32,Na,67,937, 188 | Speech-language pathologists,108,1147,1,Na,106,1148,pathologist 189 | Exercise physiologists,3,Na,3,Na,0,Na,physiologist 190 | "Therapists, all other",132,944,31,Na,101,951, 191 | Veterinarians,55,1455,16,Na,39,Na,Veterinarian 192 | Registered nurses,2382,1116,278,1222,2104,1098,nurse 193 | Nurse anesthetists,23,Na,11,Na,12,Na, 194 | Nurse midwives,6,Na,0,Na,6,Na, 195 | Nurse practitioners,115,1532,11,Na,103,1522, 196 | "Health diagnosing and treating practitioners, all other",2,Na,0,Na,2,Na, 197 | Clinical laboratory technologists and technicians,270,901,69,1089,201,796,laborer 198 | Dental hygienists,86,914,6,Na,80,953, 199 | Diagnostic related technologists and technicians,253,964,76,1106,177,908, 200 | Emergency medical technicians and paramedics,175,811,126,899,49,Na, 201 | Health practitioner support technologists and technicians,487,636,99,652,389,633, 202 | Licensed practical and licensed vocational nurses,508,743,48,Na,459,737, 203 | Medical records and health information technicians,174,740,17,Na,157,723, 204 | "Opticians, dispensing",44,Na,21,Na,23,Na, 205 | Miscellaneous health technologists and technicians,99,671,32,Na,66,642, 206 | Other healthcare practitioners and technical occupations,78,1128,40,Na,38,Na, 207 | HEALTHCARE SUPPORT,2395,498,320,577,2074,490, 208 | "Nursing, psychiatric, and home health aides",1400,467,163,526,1237,457,psychiatric 209 | Occupational therapy assistants and aides,17,Na,4,Na,12,Na, 210 | Physical therapist assistants and aides,53,742,18,Na,35,Na, 211 | Massage therapists,37,Na,16,Na,22,Na, 212 | Dental assistants,188,531,14,Na,175,522, 213 | Medical assistants,422,539,35,Na,387,530, 214 | Medical transcriptionists,28,Na,3,Na,26,Na, 215 | Pharmacy aides,22,Na,4,Na,18,Na, 216 | Veterinary assistants and laboratory animal caretakers,21,Na,4,Na,17,Na, 217 | Phlebotomists,91,551,15,Na,76,534,Phlebotomist 218 | "Miscellaneous healthcare support occupations, including medical equipment preparers",115,524,44,Na,71,511, 219 | PROTECTIVE SERVICE,2729,796,2181,851,547,655, 220 | First-line supervisors of correctional officers,57,856,41,Na,16,Na, 221 | First-line supervisors of police and detectives,114,1427,97,1425,17,Na,detective 222 | First-line supervisors of fire fighting and prevention workers,42,Na,39,Na,3,Na, 223 | "First-line supervisors of protective service workers, all other",72,808,56,825,16,Na, 224 | Firefighters,260,1033,245,1052,16,Na,Firefighter 225 | Fire inspectors,18,Na,15,Na,3,Na, 226 | "Bailiffs, correctional officers, and jailers",453,754,341,779,112,686,"Bailiff, jailer, cop" 227 | Detectives and criminal investigators,141,1159,100,1265,41,Na, 228 | Fish and game wardens,6,Na,5,Na,1,Na, 229 | Parking enforcement workers,9,Na,6,Na,3,Na, 230 | Police and sheriff's patrol officers,655,1002,569,1001,86,1009,"officer, equerry" 231 | Transit and railroad police,3,Na,3,Na,0,Na, 232 | Animal control workers,4,Na,2,Na,2,Na, 233 | Private detectives and investigators,85,843,49,Na,36,Na, 234 | Security guards and gaming surveillance officers,708,567,555,592,153,515,guard 235 | Crossing guards,26,Na,13,Na,13,Na, 236 | Transportation security screeners,33,Na,22,Na,11,Na, 237 | "Lifeguards and other recreational, and all other protective service workers",42,Na,24,Na,18,Na, 238 | CULINARY,4124,441,2133,481,1991,414, 239 | Chefs and head cooks,340,619,285,656,55,492,Chef 240 | First-line supervisors of food preparation and serving workers,378,498,156,621,222,458, 241 | Cooks,1302,416,808,427,494,400,cook 242 | Food preparation workers,366,402,174,414,192,388, 243 | Bartenders,252,521,110,569,142,493,Bartender 244 | "Combined food preparation and serving workers, including fast food",173,391,67,401,107,380, 245 | "Counter attendants, cafeteria, food concession, and coffee shop",56,354,28,Na,28,Na,attendant 246 | Waiters and waitresses,868,451,305,501,563,411, 247 | "Food servers, nonrestaurant",93,509,31,Na,62,485,server 248 | Dining room and cafeteria attendants and bartender helpers,107,403,59,389,48,Na, 249 | Dishwashers,117,398,98,401,19,Na,Dishwasher 250 | "Hosts and hostesses, restaurant, lounge, and coffee shop",66,400,8,Na,58,397, 251 | "Food preparation and serving related workers, all other",6,Na,4,Na,3,Na, 252 | GROUNDSKEEPING,3605,486,2330,517,1275,419, 253 | First-line supervisors of housekeeping and janitorial workers,172,620,108,700,64,571, 254 | "First-line supervisors of landscaping, lawn service, and groundskeeping workers",80,649,79,653,1,Na, 255 | Janitors and building cleaners,1536,507,1111,547,425,429,"janitor, cleaner" 256 | Maids and housekeeping cleaners,876,416,134,475,742,407,"Maid, housekeeper" 257 | Pest control workers,77,585,74,591,3,Na, 258 | Grounds maintenance workers,862,469,824,473,39,Na, 259 | SERVICE,2427,498,664,597,1763,475, 260 | First-line supervisors of gaming workers,117,769,65,900,52,680, 261 | First-line supervisors of personal service workers,60,608,25,Na,35,Na, 262 | Animal trainers,26,Na,19,Na,8,Na,trainer 263 | Nonfarm animal caretakers,100,505,32,Na,68,501,caretaker 264 | Gaming services workers,69,676,30,Na,39,Na, 265 | Motion picture projectionists,3,Na,3,Na,0,Na, 266 | "Ushers, lobby attendants, and ticket takers",9,Na,5,Na,4,Na,Usher 267 | Miscellaneous entertainment attendants and related workers,69,485,42,Na,28,Na, 268 | Embalmers and funeral attendants,2,Na,2,Na,0,Na, 269 | "Morticians, undertakers, and funeral directors",23,Na,17,Na,7,Na,"Mortician, undertaker" 270 | Barbers,46,Na,33,Na,12,Na,Barber 271 | "Hairdressers, hairstylists, and cosmetologists",253,461,23,Na,229,463,"Hairdresser, hairstylist, cosmetologist, beautician, cosmetician, esthetician" 272 | Miscellaneous personal appearance workers,191,501,32,Na,159,497, 273 | "Baggage porters, bellhops, and concierges",75,608,63,606,12,Na,"porter, bellhop, concierge" 274 | Tour and travel guides,21,Na,11,Na,9,Na,guide 275 | Childcare workers,407,437,14,Na,393,430, 276 | Personal care aides,680,462,128,537,552,441, 277 | Recreation and fitness workers,185,555,78,684,107,526, 278 | Residential advisors,24,Na,8,Na,16,Na, 279 | "Personal care and service workers, all other",67,499,35,Na,32,Na, 280 | SALES,9725,716,5423,880,4303,578, 281 | First-line supervisors of retail sales workers,2326,711,1296,825,1030,614, 282 | First-line supervisors of non-retail sales workers,835,1028,556,1140,280,896, 283 | Cashiers,1342,415,411,471,931,405,Cashier 284 | Counter and rental clerks,73,594,35,Na,37,Na, 285 | Parts salespersons,92,601,82,600,11,Na,salesperson 286 | Retail salespersons,1918,590,1159,694,759,494, 287 | Advertising sales agents,161,925,78,1155,83,729, 288 | Insurance sales agents,427,815,194,1028,232,717, 289 | "Securities, commodities, and financial services sales agents",211,1155,146,1461,65,767, 290 | Travel agents,62,711,9,Na,53,685,agent 291 | "Sales representatives, services, all other",406,966,268,1147,139,699, 292 | "Sales representatives, wholesale and manufacturing",1138,1020,843,1066,295,917, 293 | "Models, demonstrators, and product promoters",15,Na,4,Na,11,Na, 294 | Real estate brokers and sales agents,463,837,197,1052,266,735, 295 | Sales engineers,33,Na,31,Na,2,Na, 296 | Telemarketers,39,Na,17,Na,21,Na,Telemarketer 297 | "Door-to-door sales workers, news and street vendors, and related workers",28,Na,9,Na,18,Na, 298 | "Sales and related workers, all other",158,916,89,1088,70,727, 299 | OFFICE,13894,656,3961,693,9933,646, 300 | First-line supervisors of office and administrative support workers,1297,812,434,878,863,781, 301 | "Switchboard operators, including answering service",17,Na,7,Na,10,Na, 302 | Telephone operators,22,Na,4,Na,18,Na, 303 | "Communications equipment operators, all other",5,Na,2,Na,3,Na, 304 | Bill and account collectors,152,657,54,674,98,648, 305 | Billing and posting clerks,406,657,39,Na,366,664, 306 | "Bookkeeping, accounting, and auditing clerks",769,692,87,690,682,692, 307 | Gaming cage workers,11,Na,2,Na,10,Na, 308 | Payroll and timekeeping clerks,128,757,17,Na,111,751, 309 | Procurement clerks,35,Na,15,Na,21,Na, 310 | Tellers,264,514,33,Na,231,516,Teller 311 | "Financial clerks, all other",61,767,30,Na,32,Na, 312 | Brokerage clerks,3,Na,1,Na,1,Na, 313 | Correspondence clerks,3,Na,1,Na,2,Na, 314 | "Court, municipal, and license clerks",60,755,9,Na,51,743, 315 | "Credit authorizers, checkers, and clerks",42,Na,12,Na,29,Na, 316 | Customer service representatives,1760,621,611,690,1149,604, 317 | "Eligibility interviewers, government programs",67,781,17,Na,50,805, 318 | File clerks,145,634,25,Na,120,627, 319 | "Hotel, motel, and resort desk clerks",127,481,58,486,69,467, 320 | "Interviewers, except eligibility and loan",105,615,16,Na,89,617, 321 | "Library assistants, clerical",35,Na,7,Na,28,Na, 322 | Loan interviewers and clerks,134,710,25,Na,109,722, 323 | New accounts clerks,20,Na,3,Na,17,Na, 324 | Order clerks,74,599,26,Na,48,Na, 325 | "Human resources assistants, except payroll and timekeeping",50,737,11,Na,40,Na, 326 | Receptionists and information clerks,852,575,72,619,781,569, 327 | Reservation and transportation ticket agents and travel clerks,95,713,34,Na,61,680, 328 | "Information and record clerks, all other",100,618,22,Na,78,616, 329 | Cargo and freight agents,20,Na,12,Na,9,Na, 330 | Couriers and messengers,153,752,134,750,19,Na,Courier 331 | Dispatchers,250,698,109,759,141,655, 332 | "Meter readers, utilities",39,Na,34,Na,4,Na, 333 | Postal service clerks,127,927,76,974,51,833, 334 | Postal service mail carriers,302,954,187,1021,115,854, 335 | "Postal service mail sorters, processors, and processing machine operators",53,828,27,Na,26,Na, 336 | "Production, planning, and expediting clerks",256,838,116,978,141,732, 337 | "Shipping, receiving, and traffic clerks",502,591,354,604,148,566, 338 | Stock clerks and order fillers,1027,520,651,537,376,506, 339 | "Weighers, measurers, checkers, and samplers, recordkeeping",59,629,29,Na,31,Na, 340 | Secretaries and administrative assistants,2223,687,124,786,2099,683, 341 | Computer operators,58,751,25,Na,33,Na, 342 | Data entry keyers,223,619,55,589,169,638, 343 | Word processors and typists,68,650,6,Na,62,639, 344 | Desktop publishers,1,Na,1,Na,0,Na, 345 | Insurance claims and policy processing clerks,259,689,56,762,203,675, 346 | "Mail clerks and mail machine operators, except postal service",63,563,24,Na,39,Na, 347 | "Office clerks, general",929,620,156,609,773,622,"clerk, secretary" 348 | "Office machine operators, except computer",31,Na,14,Na,16,Na, 349 | Proofreaders and copy markers,2,Na,0,Na,2,Na, 350 | Statistical assistants,15,Na,5,Na,10,Na, 351 | "Office and administrative support workers, all other",391,745,93,852,298,718, 352 | AGRICULTURAL,810,464,637,477,174,437,Agricultural 353 | "First-line supervisors of farming, fishing, and forestry workers",42,Na,32,Na,10,Na, 354 | Agricultural inspectors,12,Na,7,Na,6,Na, 355 | Animal breeders,2,Na,2,Na,0,Na,breeder 356 | "Graders and sorters, agricultural products",83,486,32,Na,51,468, 357 | Miscellaneous agricultural workers,613,445,511,460,102,398, 358 | Fishers and related fishing workers,11,Na,11,Na,0,Na, 359 | Hunters and trappers,0,Na,0,Na,0,Na,"Hunter, trapper" 360 | Forest and conservation workers,15,Na,10,Na,5,Na, 361 | Logging workers,31,Na,31,Na,0,Na, 362 | CONSTRUCTION,5722,749,5586,751,137,704,"constructor, builder" 363 | First-line supervisors of construction trades and extraction workers,560,1040,540,1047,20,Na, 364 | Boilermakers,21,Na,21,Na,0,Na,Boilermaker 365 | "Brickmasons, blockmasons, and stonemasons",122,652,122,652,0,Na, 366 | Carpenters,802,687,792,687,10,Na,Carpenter 367 | "Carpet, floor, and tile installers and finishers",89,637,89,634,1,Na,Installer 368 | "Cement masons, concrete finishers, and terrazzo workers",44,Na,44,Na,0,Na, 369 | Construction laborers,1181,639,1155,642,25,Na, 370 | "Paving, surfacing, and tamping equipment operators",10,Na,10,Na,0,Na, 371 | Pile-driver operators,2,Na,2,Na,0,Na, 372 | Operating engineers and other construction equipment operators,324,856,318,859,6,Na, 373 | "Drywall installers, ceiling tile installers, and tapers",121,596,119,595,2,Na, 374 | Electricians,651,888,632,891,19,Na,Electrician 375 | Glaziers,33,Na,32,Na,0,Na,Glazier 376 | Insulation workers,43,Na,41,Na,3,Na, 377 | "Painters, construction and maintenance",344,585,330,587,14,Na,Painter 378 | Paperhangers,0,Na,0,Na,0,Na, 379 | "Pipelayers, plumbers, pipefitters, and steamfitters",456,863,455,862,2,Na,plumber 380 | Plasterers and stucco masons,20,Na,19,Na,1,Na, 381 | Reinforcing iron and rebar workers,9,Na,9,Na,0,Na, 382 | Roofers,171,584,170,580,2,Na,Roofer 383 | Sheet metal workers,106,766,100,776,6,Na, 384 | Structural iron and steel workers,54,869,52,864,2,Na, 385 | Solar photovoltaic installers,8,Na,8,Na,0,Na, 386 | "Helpers, construction trades",47,Na,47,Na,0,Na,"trader, arborist, dealer" 387 | Construction and building inspectors,67,939,58,965,9,Na,inspector 388 | Elevator installers and repairers,23,Na,23,Na,0,Na,repairer 389 | Fence erectors,33,Na,33,Na,0,Na, 390 | Hazardous materials removal workers,39,Na,35,Na,5,Na, 391 | Highway maintenance workers,91,754,91,755,0,Na, 392 | Rail-track laying and maintenance equipment operators,9,Na,9,Na,0,Na, 393 | Septic tank servicers and sewer pipe cleaners,8,Na,8,Na,0,Na, 394 | Miscellaneous construction and related workers,25,Na,23,Na,2,Na, 395 | "Derrick, rotary drill, and service unit operators, oil, gas, and mining",28,Na,27,Na,1,Na, 396 | "Earth drillers, except oil and gas",30,Na,30,Na,0,Na, 397 | "Explosives workers, ordnance handling experts, and blasters",5,Na,5,Na,0,Na, 398 | Mining machine operators,68,1106,65,1098,2,Na, 399 | "Roof bolters, mining",3,Na,3,Na,0,Na, 400 | "Roustabouts, oil and gas",7,Na,7,Na,0,Na, 401 | Helpers--extraction workers,6,Na,6,Na,0,Na, 402 | Other extraction workers,61,900,58,918,3,Na, 403 | MAINTENANCE,4301,839,4159,842,143,761, 404 | "First-line supervisors of mechanics, installers, and repairers",270,1032,252,1033,18,Na, 405 | "Computer, automated teller, and office machine repairers",194,856,166,865,28,Na, 406 | Radio and telecommunications equipment installers and repairers,139,862,126,879,13,Na, 407 | Avionics technicians,4,Na,4,Na,0,Na, 408 | "Electric motor, power tool, and related repairers",22,Na,20,Na,1,Na, 409 | "Electrical and electronics installers and repairers, transportation equipment",2,Na,2,Na,0,Na, 410 | "Electrical and electronics repairers, industrial and utility",18,Na,17,Na,0,Na, 411 | "Electronic equipment installers and repairers, motor vehicles",17,Na,17,Na,0,Na, 412 | Electronic home entertainment equipment installers and repairers,30,Na,28,Na,2,Na, 413 | Security and fire alarm systems installers,67,911,65,913,2,Na, 414 | Aircraft mechanics and service technicians,133,1025,125,1032,7,Na, 415 | Automotive body and related repairers,120,846,118,849,2,Na, 416 | Automotive glass installers and repairers,21,Na,20,Na,1,Na, 417 | Automotive service technicians and mechanics,710,722,694,724,16,Na, 418 | Bus and truck mechanics and diesel engine specialists,327,831,327,830,0,Na, 419 | Heavy vehicle and mobile equipment service technicians and mechanics,206,928,206,928,0,Na, 420 | Small engine mechanics,39,Na,39,Na,0,Na, 421 | "Miscellaneous vehicle and mobile equipment mechanics, installers, and repairers",66,592,65,591,1,Na, 422 | Control and valve installers and repairers,23,Na,23,Na,0,Na, 423 | "Heating, air conditioning, and refrigeration mechanics and installers",341,806,337,810,4,Na, 424 | Home appliance repairers,36,Na,36,Na,0,Na, 425 | Industrial and refractory machinery mechanics,394,895,383,894,11,Na, 426 | "Maintenance and repair workers, general",469,773,459,771,10,Na, 427 | "Maintenance workers, machinery",31,Na,30,Na,1,Na, 428 | Millwrights,49,Na,48,Na,1,Na, 429 | Electrical power-line installers and repairers,113,1105,112,1105,0,Na, 430 | Telecommunications line installers and repairers,157,882,148,880,9,Na, 431 | Precision instrument and equipment repairers,64,996,60,1009,4,Na, 432 | Wind turbine service technicians,3,Na,3,Na,0,Na, 433 | "Coin, vending, and amusement machine servicers and repairers",38,Na,34,Na,4,Na, 434 | Commercial divers,1,Na,1,Na,0,Na, 435 | Locksmiths and safe repairers,12,Na,12,Na,0,Na, 436 | Manufactured building and mobile home installers,4,Na,4,Na,1,Na, 437 | Riggers,7,Na,7,Na,0,Na,Rigger 438 | Signal and track switch repairers,9,Na,9,Na,0,Na, 439 | "Helpers--installation, maintenance, and repair workers",17,Na,17,Na,0,Na, 440 | "Other installation, maintenance, and repair workers",150,792,144,810,6,Na, 441 | PRODUCTION,7551,663,5548,729,2003,519, 442 | First-line supervisors of production and operating workers,783,875,650,924,133,623, 443 | "Aircraft structure, surfaces, rigging, and systems assemblers",15,Na,11,Na,4,Na, 444 | "Electrical, electronics, and electromechanical assemblers",123,554,59,566,64,544, 445 | Engine and other machine assemblers,14,Na,12,Na,2,Na, 446 | Structural metal fabricators and fitters,31,Na,28,Na,3,Na, 447 | Miscellaneous assemblers and fabricators,950,581,573,637,377,512, 448 | Bakers,150,505,69,570,80,475,baker 449 | "Butchers and other meat, poultry, and fish processing workers",247,542,187,582,60,463, 450 | "Food and tobacco roasting, baking, and drying machine operators and tenders",9,Na,7,Na,1,Na, 451 | Food batchmakers,79,500,25,Na,54,489, 452 | Food cooking machine operators and tenders,7,Na,5,Na,2,Na, 453 | "Food processing workers, all other",132,594,82,679,50,508, 454 | Computer control programmers and operators,83,833,81,857,2,Na, 455 | "Extruding and drawing machine setters, operators, and tenders, metal and plastic",8,Na,7,Na,1,Na, 456 | "Forging machine setters, operators, and tenders, metal and plastic",6,Na,6,Na,0,Na, 457 | "Rolling machine setters, operators, and tenders, metal and plastic",15,Na,12,Na,3,Na, 458 | "Cutting, punching, and press machine setters, operators, and tenders, metal and plastic",78,633,62,674,15,Na, 459 | "Drilling and boring machine tool setters, operators, and tenders, metal and plastic",5,Na,5,Na,1,Na, 460 | "Grinding, lapping, polishing, and buffing machine tool setters, operators, and tenders, metal and plastic",41,Na,39,Na,3,Na, 461 | "Lathe and turning machine tool setters, operators, and tenders, metal and plastic",12,Na,11,Na,1,Na, 462 | "Milling and planing machine setters, operators, and tenders, metal and plastic",1,Na,1,Na,0,Na, 463 | Machinists,338,834,320,840,17,Na,Machinist 464 | "Metal furnace operators, tenders, pourers, and casters",29,Na,28,Na,1,Na, 465 | "Model makers and patternmakers, metal and plastic",6,Na,3,Na,3,Na, 466 | "Molders and molding machine setters, operators, and tenders, metal and plastic",47,Na,39,Na,9,Na, 467 | "Multiple machine tool setters, operators, and tenders, metal and plastic",1,Na,1,Na,0,Na, 468 | Tool and die makers,49,Na,49,Na,0,Na, 469 | "Welding, soldering, and brazing workers",568,760,545,767,23,Na, 470 | "Heat treating equipment setters, operators, and tenders, metal and plastic",4,Na,4,Na,0,Na, 471 | "Layout workers, metal and plastic",4,Na,4,Na,1,Na, 472 | "Plating and coating machine setters, operators, and tenders, metal and plastic",24,Na,24,Na,0,Na, 473 | "Tool grinders, filers, and sharpeners",7,Na,7,Na,0,Na, 474 | "Metal workers and plastic workers, all other",351,639,278,678,72,581, 475 | Prepress technicians and workers,14,Na,12,Na,2,Na, 476 | Printing press operators,160,707,134,729,26,Na, 477 | Print binding and finishing workers,16,Na,9,Na,6,Na, 478 | Laundry and dry-cleaning workers,133,466,53,487,80,460, 479 | "Pressers, textile, garment, and related materials",21,Na,9,Na,12,Na, 480 | Sewing machine operators,147,493,42,Na,105,476, 481 | Shoe and leather workers and repairers,5,Na,4,Na,1,Na, 482 | Shoe machine operators and tenders,1,Na,0,Na,1,Na, 483 | "Tailors, dressmakers, and sewers",37,Na,9,Na,27,Na,"Tailor, dressmaker" 484 | Textile bleaching and dyeing machine operators and tenders,2,Na,2,Na,0,Na, 485 | "Textile cutting machine setters, operators, and tenders",9,Na,7,Na,2,Na, 486 | "Textile knitting and weaving machine setters, operators, and tenders",8,Na,3,Na,4,Na, 487 | "Textile winding, twisting, and drawing out machine setters, operators, and tenders",7,Na,5,Na,2,Na, 488 | "Extruding and forming machine setters, operators, and tenders, synthetic and glass fibers",0,Na,0,Na,0,Na, 489 | Fabric and apparel patternmakers,4,Na,3,Na,1,Na, 490 | Upholsterers,29,Na,21,Na,7,Na,Upholsterer 491 | "Textile, apparel, and furnishings workers, all other",16,Na,12,Na,4,Na, 492 | Cabinetmakers and bench carpenters,40,Na,38,Na,2,Na, 493 | Furniture finishers,6,Na,6,Na,0,Na, 494 | "Model makers and patternmakers, wood",0,Na,0,Na,0,Na, 495 | "Sawing machine setters, operators, and tenders, wood",26,Na,22,Na,4,Na, 496 | "Woodworking machine setters, operators, and tenders, except sawing",23,Na,21,Na,1,Na, 497 | "Woodworkers, all other",17,Na,14,Na,3,Na,Woodworker 498 | "Power plant operators, distributors, and dispatchers",35,Na,34,Na,1,Na, 499 | Stationary engineers and boiler operators,84,996,81,1012,3,Na, 500 | Water and wastewater treatment plant and system operators,82,880,79,868,3,Na, 501 | Miscellaneous plant and system operators,35,Na,33,Na,3,Na, 502 | "Chemical processing machine setters, operators, and tenders",62,1052,57,1082,5,Na, 503 | "Crushing, grinding, polishing, mixing, and blending workers",82,652,75,668,7,Na, 504 | Cutting workers,51,685,41,Na,9,Na, 505 | "Extruding, forming, pressing, and compacting machine setters, operators, and tenders",31,Na,25,Na,6,Na, 506 | "Furnace, kiln, oven, drier, and kettle operators and tenders",6,Na,5,Na,0,Na, 507 | "Inspectors, testers, sorters, samplers, and weighers",701,710,440,844,260,583, 508 | Jewelers and precious stone and metal workers,19,Na,11,Na,7,Na,Jeweler 509 | "Medical, dental, and ophthalmic laboratory technicians",86,648,44,Na,42,Na, 510 | Packaging and filling machine operators and tenders,239,518,118,605,120,482, 511 | Painting workers,129,708,110,733,18,Na, 512 | Photographic process workers and processing machine operators,26,Na,12,Na,14,Na, 513 | Semiconductor processors,1,Na,1,Na,0,Na, 514 | Adhesive bonding machine operators and tenders,9,Na,5,Na,4,Na, 515 | "Cleaning, washing, and metal pickling equipment operators and tenders",4,Na,2,Na,1,Na, 516 | Cooling and freezing equipment operators and tenders,4,Na,3,Na,0,Na, 517 | Etchers and engravers,12,Na,8,Na,4,Na,engraver 518 | "Molders, shapers, and casters, except metal and plastic",14,Na,12,Na,2,Na, 519 | "Paper goods machine setters, operators, and tenders",27,Na,22,Na,5,Na, 520 | Tire builders,8,Na,8,Na,0,Na, 521 | Helpers--production workers,24,Na,18,Na,7,Na, 522 | "Production workers, all other",846,625,643,666,203,501, 523 | TRANSPORTATION,6953,646,5998,679,955,494, 524 | Supervisors of transportation and material moving workers,186,894,153,898,33,Na, 525 | Aircraft pilots and flight engineers,114,1735,104,1830,9,Na,pilot 526 | Air traffic controllers and airfield operations specialists,32,Na,24,Na,8,Na, 527 | Flight attendants,63,846,20,Na,43,Na, 528 | "Ambulance drivers and attendants, except emergency medical technicians",18,Na,14,Na,4,Na, 529 | Bus drivers,323,615,184,681,138,572, 530 | Driver/sales workers and truck drivers,2687,747,2582,751,105,632,driver 531 | Taxi drivers and chauffeurs,253,585,216,600,38,Na, 532 | "Motor vehicle operators, all other",21,Na,18,Na,3,Na, 533 | Locomotive engineers and operators,44,Na,42,Na,2,Na, 534 | "Railroad brake, signal, and switch operators",5,Na,5,Na,0,Na, 535 | Railroad conductors and yardmasters,55,1117,52,1137,4,Na,yardmaster 536 | "Subway, streetcar, and other rail transportation workers",15,Na,12,Na,3,Na, 537 | Sailors and marine oilers,10,Na,9,Na,0,Na,sailor 538 | Ship and boat captains and operators,29,Na,28,Na,1,Na,captain 539 | Ship engineers,5,Na,4,Na,1,Na, 540 | Bridge and lock tenders,4,Na,4,Na,0,Na, 541 | Parking lot attendants,57,492,49,Na,8,Na, 542 | Automotive and watercraft service attendants,63,452,58,470,5,Na, 543 | Transportation inspectors,21,Na,14,Na,7,Na, 544 | "Transportation attendants, except flight attendants",17,Na,9,Na,8,Na, 545 | Other transportation workers,39,Na,35,Na,4,Na, 546 | Conveyor operators and tenders,7,Na,7,Na,0,Na, 547 | Crane and tower operators,75,988,71,1016,4,Na, 548 | "Dredge, excavating, and loading machine operators",25,Na,25,Na,0,Na, 549 | Hoist and winch operators,5,Na,5,Na,0,Na, 550 | Industrial truck and tractor operators,579,609,541,612,37,Na, 551 | Cleaners of vehicles and equipment,222,485,200,498,22,Na, 552 | "Laborers and freight, stock, and material movers, hand",1433,526,1214,547,219,455, 553 | Machine feeders and offbearers,30,Na,21,Na,9,Na, 554 | "Packers and packagers, hand",385,438,158,462,227,424, 555 | Pumping station operators,18,Na,17,Na,1,Na, 556 | Refuse and recyclable material collectors,72,501,66,496,6,Na, 557 | Mine shuttle car operators,0,Na,0,Na,0,Na, 558 | "Tank car, truck, and ship loaders",6,Na,6,Na,0,Na, 559 | "Material moving workers, all other",37,Na,32,Na,5,Na, 560 | OTHER SOURCES OCCUPATION DATA,Na,Na,Na,Na,Na,Na, 561 | Presidents in the U.S,,,45,,0,,president 562 | "https://datacenter.kidscount.org/data/tables/102-child-population-by-gender#detailed/1/any/false/1729,37,871,870,573,869,36,868,867,133/14,15,65/421,422",,,51,,49,,child 563 | https://data.worldbank.org/indicator/SP.POP.TOTL.FE.ZS?locations=US,,,99,,101,,person 564 | immeasurable neutral,,,50,,50,,"friend, user, patient" 565 | https://nces.ed.gov/fastfacts/display.asp?id=98,,,43,,57,,student 566 | https://www.insidehighered.com/news/2016/08/22/study-finds-gains-faculty-diversity-not-tenure-track,,,69,,31,,professor 567 | https://cawp.rutgers.edu/women-elective-office-2021,,,70,,30,,politician 568 | -------------------------------------------------------------------------------- /visualizations/delta_s_by_dist.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SLAB-NLP/BUG/6b5314d193ecd04a6864ffbfe329b42cf2aa622e/visualizations/delta_s_by_dist.png --------------------------------------------------------------------------------