├── README.md ├── data └── synonyms.txt ├── figure4.py ├── figure6.py ├── requirements.txt ├── setup.sh ├── table1.py ├── table2.py ├── table3.py ├── table4.py ├── table5.py └── utils ├── __init__.py ├── chair.py ├── im_consistency.py ├── lm_consistency.py └── misc.py /README.md: -------------------------------------------------------------------------------- 1 | # Object Hallucination in Image Captioning 2 | 3 | Rohrbach*, Anna and Hendricks*, Lisa Anne, et al. "Object Hallucination in Image Captioning." EMNLP (2018). 4 | 5 | Find the paper [here](https://arxiv.org/pdf/1809.02156.pdf). 6 | ``` 7 | @inproceedings{objectHallucination, 8 | title = {Object Hallucination in Image Captioning.}, 9 | author = {Rohrbach, Anna and Hendricks, Lisa Anne and Burns, Kaylee, and Darrell, Trevor, and Saenko, Kate}, 10 | booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, 11 | year = {2018} 12 | } 13 | ``` 14 | 15 | License: BSD 2-Clause license 16 | 17 | ## Running the Code 18 | 19 | **Getting Started** 20 | 21 | Run [setup.sh](setup.sh) to download generated sentences used for our analysis. 22 | Additionally you will need MSCOCO annotations (both the instance segmentations and ground truth captions). 23 | If you do not already have them, they can be downloaded [here](http://images.cocodataset.org/annotations/annotations_trainval2014.zip). 24 | You can see other python requirements in [requirements.txt](requirements.txt). 25 | 26 | **Replicating Results** 27 | 28 | After running ```setup.sh``` you should be able to replicate results in our paper by running ```table1.py```, ```table2.py```, ```table3.py```, ```table4.py``` and ```figure6.py``` (example usage ```python table1.py --annotation_path PATH_TO_COCO_ANNOTATIONS``` where ```coco/annotations``` is the default for ```--annotation_path```). 29 | Our scripts call on ```utils/chair.py``` to compute the CHAIR metric. See below for more details on ```utils/chair.py```. 30 | 31 | If you would like to run ```figure4.py``` (language and image model consistency) you will need to download some intermediate features. Please see the *Language and Image Model Consistency* section below. 32 | 33 | For reproducing our results on correlation with human scores, run ```python table5.py```. The file with images IDs used in the human evaluation, as well as the average human scores for each of the compared models, will be found in ```data/human_scores```, after running the ```setup.sh```. 34 | 35 | **Evaluating CHAIR** 36 | 37 | See ```utils/chair.py``` to understand how we compute the CHAIRs and CHAIRi metrics. Evaluate generated sentences by inputting a path to the generated sentences as well as the path which includes coco annotations. 38 | 39 | Example usage is: 40 | 41 | ```python utils/chair.py --cap_file generated_sentences/fc_beam5_test.json --annotation_path coco``` 42 | 43 | where ```cap_file``` corresponds to a json file with your generated captions and ```annotation_path``` points to where MSCOCO annotations are stored. 44 | 45 | We expect generated sentences to be stored as a dictionary with the following keys: 46 | 47 | * overall: metrics from the COCO evaluation toolkit computed over the entire dataset. 48 | * imgToEval: a dictionary with keys corresponding to image ids and values with a caption, image_id, and sentence metrics for the particular caption. 49 | 50 | Note that this is the format of the captions output by the open sourced code [here](https://github.com/ruotianluo/self-critical.pytorch), 51 | which we used to replicate most of the models presented in the paper. 52 | 53 | **Language and Image Model Consistency** 54 | 55 | To compute language and image consistency, we trained a classifier to predict class labels given an image and a language model to predict the next word in a sentence given all previous words in a sentence. 56 | You can access the labels predicted by our language model in ```output/image_classifier``` and the words predicted by our language model [here](https://drive.google.com/drive/u/1/folders/1dnci1Kv6ez-hsFOqZt_gwiAv2FTAjDP4). 57 | To run our code, you ned to first download the [zip file](https://drive.google.com/drive/u/1/folders/1dnci1Kv6ez-hsFOqZt_gwiAv2FTAjDP4) into the main directory and unzip. 58 | Once you have these intermediate features you can look at ```utils/lm_consistency.py``` and ```utils/im_consistency.py``` to understand how these metrics are computed. 59 | Running ```figure4.py``` will output the results from our paper (constructing the actual bar plot is left as an exercise to the reader). 60 | 61 | **Human Eval** 62 | 63 | Replicate the results from our human evaluation by running ```python table5.py```. Raw human evaluation scores can be found in ```data/human_scores``` after running ```setup.sh```. 64 | 65 | **Captioning Models** 66 | 67 | We generated sentences for the majority of models by training open source models available [here](https://github.com/ruotianluo/self-critical.pytorch). 68 | Within this framework, we wrote code for the LRCN model as well as the topdown deconstructed models (Table 3 in the paper). 69 | This code is available upon request. 70 | For the top down model with bounding boxes, we used the code [here](https://github.com/peteanderson80/Up-Down-Captioner). 71 | For the Neural Baby Talk model, we used the code [here](https://github.com/jiasenlu/NeuralBabyTalk). 72 | For the GAN based model, we used the sentences from the paper [here](https://arxiv.org/abs/1703.10476). Sentences were obtained directly from the author (we did not train the GAN model). 73 | -------------------------------------------------------------------------------- /data/synonyms.txt: -------------------------------------------------------------------------------- 1 | person, girl, boy, man, woman, kid, child, chef, baker, people, adult, rider, children, baby, worker, passenger, sister, biker, policeman, cop, officer, lady, cowboy, bride, groom, male, female, guy, traveler, mother, father, gentleman, pitcher, player, skier, snowboarder, skater, skateboarder, person, woman, guy, foreigner, child, gentleman, caller, offender, coworker, trespasser, patient, politician, soldier, grandchild, serviceman, walker, drinker, doctor, bicyclist, thief, buyer, teenager, student, camper, driver, solider, hunter, shopper, villager 2 | bicycle, bike, bicycle, bike, unicycle, minibike, trike 3 | car, automobile, van, minivan, sedan, suv, hatchback, cab, jeep, coupe, taxicab, limo, taxi 4 | motorcycle, scooter, motor bike, motor cycle, motorbike, scooter, moped 5 | airplane, jetliner, plane, air plane, monoplane, aircraft, jet, jetliner, airbus, biplane, seaplane 6 | bus, minibus, trolley 7 | train, locomotive, tramway, caboose 8 | truck, pickup, lorry, hauler, firetruck 9 | boat, ship, liner, sailboat, motorboat, dinghy, powerboat, speedboat, canoe, skiff, yacht, kayak, catamaran, pontoon, houseboat, vessel, rowboat, trawler, ferryboat, watercraft, tugboat, schooner, barge, ferry, sailboard, paddleboat, lifeboat, freighter, steamboat, riverboat, battleship, steamship 10 | traffic light, street light, traffic signal, stop light, streetlight, stoplight 11 | fire hydrant, hydrant 12 | stop sign 13 | parking meter 14 | bench, pew 15 | bird, ostrich, owl, seagull, goose, duck, parakeet, falcon, robin, pelican, waterfowl, heron, hummingbird, mallard, finch, pigeon, sparrow, seabird, osprey, blackbird, fowl, shorebird, woodpecker, egret, chickadee, quail, bluebird, kingfisher, buzzard, willet, gull, swan, bluejay, flamingo, cormorant, parrot, loon, gosling, waterbird, pheasant, rooster, sandpiper, crow, raven, turkey, oriole, cowbird, warbler, magpie, peacock, cockatiel, lorikeet, puffin, vulture, condor, macaw, peafowl, cockatoo, songbird 16 | cat, kitten, feline, tabby 17 | dog, puppy, beagle, pup, chihuahua, schnauzer, dachshund, rottweiler, canine, pitbull, collie, pug, terrier, poodle, labrador, doggie, doberman, mutt, doggy, spaniel, bulldog, sheepdog, weimaraner, corgi, cocker, greyhound, retriever, brindle, hound, whippet, husky 18 | horse, colt, pony, racehorse, stallion, equine, mare, foal, palomino, mustang, clydesdale, bronc, bronco 19 | sheep, lamb, ram, lamb, goat, ewe 20 | cow, cattle, oxen, ox, calf, cattle, holstein, heifer, buffalo, bull, zebu, bison 21 | elephant 22 | bear, panda 23 | zebra 24 | giraffe 25 | backpack, knapsack 26 | umbrella 27 | handbag, wallet, purse, briefcase 28 | tie, bow, bow tie 29 | suitcase, suit case, luggage 30 | frisbee 31 | skis, ski 32 | snowboard 33 | sports ball, ball 34 | kite 35 | baseball bat 36 | baseball glove 37 | skateboard 38 | surfboard, longboard, skimboard, shortboard, wakeboard 39 | tennis racket, racket 40 | bottle 41 | wine glass 42 | cup 43 | fork 44 | knife, pocketknife, knive 45 | spoon 46 | bowl, container 47 | banana 48 | apple 49 | sandwich, burger, sub, cheeseburger, hamburger 50 | orange 51 | broccoli 52 | carrot 53 | hot dog 54 | pizza 55 | donut, doughnut, bagel 56 | cake, cheesecake, cupcake, shortcake, coffeecake, pancake 57 | chair, seat, stool 58 | couch, sofa, recliner, futon, loveseat, settee, chesterfield 59 | potted plant, houseplant 60 | bed 61 | dining table, table, desk 62 | toilet, urinal, commode, toilet, lavatory, potty 63 | tv, monitor, televison, television 64 | laptop, computer, notebook, netbook, lenovo, macbook, laptop computer 65 | mouse 66 | remote 67 | keyboard 68 | cell phone, mobile phone, phone, cellphone, telephone, phon, smartphone, iPhone 69 | microwave 70 | oven, stovetop, stove, stove top oven 71 | toaster 72 | sink 73 | refrigerator, fridge, fridge, freezer 74 | book 75 | clock 76 | vase 77 | scissors 78 | teddy bear, teddybear 79 | hair drier, hairdryer 80 | toothbrush 81 | -------------------------------------------------------------------------------- /figure4.py: -------------------------------------------------------------------------------- 1 | from utils import misc 2 | from utils import chair 3 | import argparse 4 | import json 5 | import os 6 | 7 | parser = argparse.ArgumentParser() 8 | parser.add_argument("--annotation_path", type=str, default='coco/annotations') 9 | args = parser.parse_args() 10 | 11 | figure4_tags_karpathy = [('TD', 'td_beam1_test'), 12 | ('No Att', 'td-noatt_beam1_test'), 13 | ('No Conv', 'td-noconv_beam1_test'), 14 | ('Single', 'td-single_beam1_test'), 15 | ('FC', 'td-fc_beam1_test')] 16 | 17 | print "=================Karpathy Split=================" 18 | print "Model\tCHAIRi\tLM Consistency\tIM Consistency" 19 | 20 | for tag in figure4_tags_karpathy: 21 | 22 | chair_i, lm_consistency, im_consistency = misc.get_consistency(tag[1], 23 | args.annotation_path, 24 | robust=False) 25 | 26 | print "%s\t%0.04f\t%0.04f\t\t%0.04f" %(tag[0], 27 | chair_i, 28 | lm_consistency, 29 | im_consistency) 30 | 31 | print "=================Robust Split=================" 32 | print "Model\tCHAIRi\tLM Consistency\tIM Consistency" 33 | 34 | figure4_tags_robust = [('TD', 'td-robust_beam1_test'), 35 | ('No Att', 'td-noatt-robust_beam1_test'), 36 | ('No Conv', 'td-noconv-robust_beam1_test'), 37 | ('Single', 'td-single-robust_beam1_test'), 38 | ('FC', 'td-fc-robust_beam1_test')] 39 | 40 | #generate hallucination files for robust split for fig 4 41 | evaluator = None 42 | output_template = "output/hallucination/hallucinated_words_%s.json" 43 | sentence_template = "generated_sentences/%s.json" 44 | for tag in figure4_tags_robust: 45 | if not os.path.exists(output_template %tag[1]): 46 | if not evaluator: 47 | _, imids, _ = chair.load_generated_captions(sentence_template %figure4_tags_robust[0][1]) 48 | evaluator = chair.CHAIR(imids, args.annotation_path) 49 | evaluator.get_annotations() 50 | cap_dict = evaluator.compute_chair(sentence_template %tag[1]) 51 | chair.save_hallucinated_words(sentence_template %tag[1], cap_dict) 52 | 53 | for tag in figure4_tags_robust: 54 | 55 | chair_i, lm_consistency, im_consistency = misc.get_consistency(tag[1], 56 | args.annotation_path, 57 | robust=True) 58 | 59 | print "%s\t%0.04f\t%0.04f\t\t%0.04f" %(tag[0], 60 | chair_i, 61 | lm_consistency, 62 | im_consistency) 63 | -------------------------------------------------------------------------------- /figure6.py: -------------------------------------------------------------------------------- 1 | from utils import misc 2 | 3 | template = './output/hallucination/hallucinated_words_%s.json' 4 | fc_hallucination = template %'fc_beam5_test' 5 | td_hallucination = template %'td_beam5_test' 6 | 7 | diffs = misc.predictive_metrics(fc_hallucination, td_hallucination) 8 | 9 | print "Differences in Hallucination for sentences with similar SPICE score:" 10 | print "\t(caomparing fc and td models)" 11 | 12 | for i in range(0, 100, 10): 13 | print "Between %d-%d:\t%0.04f" %(i, i+10, diffs[i/10]) 14 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.14.2 2 | nltk==3.2.5 3 | pattern==3.6 4 | pycocotools==2.0.0 5 | -------------------------------------------------------------------------------- /setup.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #Download generated sentences 4 | wget https://people.eecs.berkeley.edu/~lisa_anne/hallucination/generated_sentences.zip 5 | unzip generated_sentences.zip 6 | rm -r generated_sentences.zip 7 | 8 | #Download 9 | mkdir output 10 | mkdir output/hallucination 11 | wget https://people.eecs.berkeley.edu/~lisa_anne/hallucination/intermediate_image.zip 12 | unzip intermediate_image.zip 13 | rm -r intermediate_image.zip 14 | 15 | cd data 16 | wget https://people.eecs.berkeley.edu/~lisa_anne/hallucination/gt_labels.p 17 | wget https://people.eecs.berkeley.edu/~lisa_anne/hallucination/vocab.p 18 | wget https://people.eecs.berkeley.edu/~lisa_anne/hallucination/human_scores.zip 19 | unzip human_scores.zip 20 | rm -r human_scores.zip 21 | 22 | cd .. 23 | -------------------------------------------------------------------------------- /table1.py: -------------------------------------------------------------------------------- 1 | from utils import chair 2 | import argparse 3 | 4 | parser = argparse.ArgumentParser() 5 | parser.add_argument("--annotation_path", type=str, default='coco/annotations') 6 | args = parser.parse_args() 7 | 8 | sentence_template = 'generated_sentences/%s.json' 9 | table1_tags = [('LRCN', 'lrcn_beam5_test', 'lrcn-sc_beam5_test'), 10 | ('FC', 'fc_beam5_test', 'fc-sc_beam5_test'), 11 | ('att2in', 'att2in_beam5_test', 'att2in-sc_beam5_test'), 12 | ('TD', 'td_beam5_test', 'td-sc_beam5_test'), 13 | ('TD-BB', 'td-bb_beam5_test', 'td-bb-sc_beam5_test'), 14 | ('NBT', 'nbt_beam5_test'), 15 | ('GAN', 'baseline-gan_beam5_test', 'gan_beam5_test')] 16 | 17 | _, imids, _ = chair.load_generated_captions(sentence_template %table1_tags[0][1]) 18 | 19 | evaluator = chair.CHAIR(imids, args.annotation_path) 20 | evaluator.get_annotations() 21 | 22 | print "\t\tCross Entropy\t\t\t\tSelf-Critical\t\t" 23 | print "Model\tSPICE\tMETEOR\tCIDEr\tCHAIRs\tCHAIRi\t|SPICE\tMETEOR\tCIDEr\tCHAIRs\tCHAIRi" 24 | 25 | for tag in table1_tags: 26 | 27 | cap_dict = evaluator.compute_chair(sentence_template %tag[1]) 28 | metric_string_ce = chair.print_metrics(cap_dict, True) 29 | chair.save_hallucinated_words(sentence_template %tag[1], cap_dict) 30 | if len(tag) > 2: 31 | cap_dict = evaluator.compute_chair(sentence_template %tag[2]) 32 | metric_string_sc = chair.print_metrics(cap_dict, True) 33 | chair.save_hallucinated_words(sentence_template %tag[2], cap_dict) 34 | else: 35 | metric_string_sc = "-\t-\t-\t-\t-" 36 | print "%s\t%s\t|%s" %(tag[0], metric_string_ce, metric_string_sc) 37 | -------------------------------------------------------------------------------- /table2.py: -------------------------------------------------------------------------------- 1 | from utils import chair 2 | import argparse 3 | 4 | parser = argparse.ArgumentParser() 5 | parser.add_argument("--annotation_path", type=str, default='coco/annotations') 6 | args = parser.parse_args() 7 | 8 | sentence_template = 'generated_sentences/%s.json' 9 | table2_tags = [('FC', 'fc-robust_beam5_test'), 10 | ('att2in', 'att2in-robust_beam5_test'), 11 | ('TD', 'td-robust_beam5_test'), 12 | ('NBT', 'nbt-robust_beam5_test')] 13 | 14 | _, imids, _ = chair.load_generated_captions(sentence_template %table2_tags[0][1]) 15 | 16 | evaluator = chair.CHAIR(imids, args.annotation_path) 17 | evaluator.get_annotations() 18 | 19 | print "\t\tCross Entropy\t\t\t" 20 | print "Model\tSPICE\tMETEOR\tCIDEr\tCHAIRs\tCHAIRi" 21 | 22 | for tag in table2_tags: 23 | 24 | cap_dict = evaluator.compute_chair(sentence_template %tag[1]) 25 | metric_string = chair.print_metrics(cap_dict, True) 26 | chair.save_hallucinated_words(sentence_template %tag[1], cap_dict) 27 | print "%s\t%s\t" %(tag[0], metric_string) 28 | -------------------------------------------------------------------------------- /table3.py: -------------------------------------------------------------------------------- 1 | from utils import chair 2 | import argparse 3 | 4 | parser = argparse.ArgumentParser() 5 | parser.add_argument("--annotation_path", type=str, default='coco/annotations') 6 | args = parser.parse_args() 7 | 8 | sentence_template = 'generated_sentences/%s.json' 9 | table2_tags = [('TD', 'td_beam1_test'), 10 | ('No Att', 'td-noatt_beam1_test'), 11 | ('No Conv', 'td-noconv_beam1_test'), 12 | ('Single', 'td-single_beam1_test'), 13 | ('FC', 'td-fc_beam1_test')] 14 | 15 | _, imids, _ = chair.load_generated_captions(sentence_template %table2_tags[0][1]) 16 | 17 | evaluator = chair.CHAIR(imids, args.annotation_path) 18 | evaluator.get_annotations() 19 | 20 | print "\t\tCross Entropy\t\t\t" 21 | print "Model\tSPICE\tMETEOR\tCIDEr\tCHAIRs\tCHAIRi" 22 | 23 | for tag in table2_tags: 24 | 25 | cap_dict = evaluator.compute_chair(sentence_template %tag[1]) 26 | metric_string = chair.print_metrics(cap_dict, True) 27 | chair.save_hallucinated_words(sentence_template %tag[1], cap_dict) 28 | print "%s\t%s\t" %(tag[0], metric_string) 29 | -------------------------------------------------------------------------------- /table4.py: -------------------------------------------------------------------------------- 1 | from utils import misc 2 | import os 3 | from utils import chair 4 | import argparse 5 | 6 | parser = argparse.ArgumentParser() 7 | parser.add_argument("--annotation_path", type=str, default='coco/annotations') 8 | args = parser.parse_args() 9 | 10 | output_template = "output/hallucination/hallucinated_words_%s.json" 11 | sentence_template = "generated_sentences/%s.json" 12 | 13 | table4_tags = [('FC', 'fc_beam5_test'), 14 | ('att2in', 'att2in_beam5_test'), 15 | ('td', 'td_beam5_test')] 16 | 17 | 18 | _, imids, _ = chair.load_generated_captions(sentence_template %table4_tags[0][1]) 19 | evaluator = chair.CHAIR(imids, args.annotation_path) 20 | evaluator.get_annotations() 21 | 22 | print "Model\tCIDEr\tMETEOR\tSPICE" 23 | 24 | for tag in table4_tags: 25 | 26 | if not os.path.exists(output_template %tag[1]): 27 | cap_dict = evaluator.compute_chair(sentence_template %tag[1]) 28 | chair.save_hallucinated_words(sentence_template %tag[1], cap_dict) 29 | 30 | cider, meteor, spice = misc.score_correlation(output_template %tag[1], 31 | quiet=True) 32 | print "%s\t%0.03f\t%0.03f\t%0.03f" %(tag[0], cider, meteor, spice) 33 | -------------------------------------------------------------------------------- /table5.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """ 3 | Compute correlation between sentence scores and human scores. 4 | """ 5 | 6 | import glob 7 | import json 8 | import numpy as np 9 | 10 | ########################################################################## 11 | def correctForNan(inputVec): 12 | # If all the values are identical, introduce a small perturbation 13 | outputVec = inputVec[:] 14 | item = inputVec[0] 15 | allEqual = True 16 | for it in inputVec: 17 | if it <> item: 18 | allEqual = False 19 | break 20 | 21 | if allEqual: 22 | outputVec[np.random.choice(range(len(inputVec)))] += 0.001 23 | 24 | return outputVec 25 | ########################################################################## 26 | 27 | SCORES_PATH = 'output/hallucination/' 28 | 29 | f_scores = [SCORES_PATH+'hallucinated_words_baseline-gan_beam5_test.json', 30 | SCORES_PATH+'hallucinated_words_nbt_beam5_test.json', 31 | SCORES_PATH+'hallucinated_words_gan_beam5_test.json', 32 | SCORES_PATH+'hallucinated_words_td_beam5_test.json', 33 | SCORES_PATH+'hallucinated_words_td-sc_beam5_test.json'] 34 | 35 | ids_file = 'data/human_scores/' + 'imageIDs.txt' # images used in the human evaluation 36 | of = open(ids_file, 'r') 37 | image_ids = of.read().split('\n') 38 | 39 | HUMAN_SCORES_PATH = 'data/human_scores/0*.txt' 40 | f_human_subj_scores = glob.glob(HUMAN_SCORES_PATH) 41 | f_human_subj_scores.sort() 42 | 43 | MODELS = ['MPI-CE', 'NBT-CE', 'MPI-GAN', 'TD-CE', 'TD-SC'] 44 | SCORES = ['B@1', 'B@2', 'B@3', 'B@4', 'R', 'M', 'C', 'S'] 45 | CHAIR = ['1-CHs', '1-CHi'] 46 | 47 | NIMAGES = len(image_ids) 48 | NCAPTIONS = len(MODELS) 49 | 50 | s_scores_m = {} # sentence scores 51 | c_scores_m = {} # chair scores 52 | h_s_scores_m = {} # human scores 53 | 54 | for i, fn in enumerate(f_scores): 55 | s_scores_m[i] = [None] * NIMAGES 56 | c_scores_m[i] = [None] * NIMAGES 57 | of = open(fn, 'r') 58 | f_data = json.load(open(fn, 'r')) 59 | f_data = f_data['sentences'] 60 | of.close() 61 | for item in f_data: 62 | im_id = item['image_id'] 63 | if str(im_id) not in image_ids: 64 | continue 65 | metrics = item['metrics'] 66 | b1 = metrics['Bleu_1'] 67 | b2 = metrics['Bleu_2'] 68 | b3 = metrics['Bleu_3'] 69 | b4 = metrics['Bleu_4'] 70 | rl = metrics['ROUGE_L'] 71 | me = metrics['METEOR'] 72 | ci = metrics['CIDEr'] 73 | sp = metrics['SPICE']['All']['f'] 74 | ind = image_ids.index(str(im_id)) 75 | s_scores_m[i][ind] = [b1, b2, b3, b4, rl, me, ci, sp] 76 | # 77 | ch_s = metrics['CHAIRs'] 78 | ch_i = metrics['CHAIRi'] 79 | c_scores_m[i][ind] = [1-ch_s, 1-ch_i] 80 | 81 | for i, fn in enumerate(f_human_subj_scores): 82 | h_s_scores_m[i] = [] 83 | of = open(fn, 'r') 84 | f_data = of.read().split('\n') 85 | if f_data[-1] == '': 86 | f_data = f_data[0:-1] 87 | for line in f_data: 88 | items = line.split('\t') 89 | h_s_scores_m[i].append([float(x) for x in (items[1:])]) 90 | 91 | # PEARSON'S correlation across NCAPTIONS=5 captions per image (from each system), averaged over NIMAGES=500 images 92 | 93 | corr_s_s = [0] * len(SCORES) # correlation between sentence scores and human scores 94 | corr_s_cs_s = [0] * len(SCORES) # correlation between sentence scores+(1-CHs) and human scores 95 | corr_s_ci_s = [0] * len(SCORES) # correlation between sentence scores+(1-CHi) and human scores 96 | 97 | for im in range(NIMAGES): 98 | s_m = [] 99 | c_m = [] 100 | h_s_m = [] 101 | for i in range(NCAPTIONS): 102 | s_m.append(s_scores_m[i][im]) 103 | c_m.append(c_scores_m[i][im]) 104 | h_s_m.append(h_s_scores_m[i][im]) 105 | for metric in range(len(SCORES)): 106 | corr = np.corrcoef(correctForNan([x[metric] for x in s_m]), correctForNan([x[0] for x in h_s_m]))[0][1] 107 | corr_s_s[metric] += corr 108 | 109 | for metric in range(len(SCORES)): 110 | #ch_s 111 | corr = np.corrcoef(correctForNan([x[metric] + c_m[i][0] for i, x in enumerate(s_m)]), correctForNan([x[0] for x in h_s_m]))[0][1] 112 | corr_s_cs_s[metric] += corr 113 | #ch_i 114 | corr = np.corrcoef(correctForNan([x[metric] + c_m[i][1] for i, x in enumerate(s_m)]), correctForNan([x[0] for x in h_s_m]))[0][1] 115 | corr_s_ci_s[metric] += corr 116 | 117 | print "Metric\tCorrelation" 118 | 119 | for metric in [5,6,7]: # focus on 'M', 'C', 'S' 120 | print('%s\t%.04f' % (SCORES[metric], corr_s_s[metric]/float(NIMAGES))) 121 | 122 | for metric in [5,6,7]: # focus on 'M', 'C', 'S' 123 | print('%s\t%.04f' % (SCORES[metric]+'+'+CHAIR[0], corr_s_cs_s[metric]/float(NIMAGES))) 124 | 125 | for metric in [5,6,7]: # focus on 'M', 'C', 'S' 126 | print('%s\t%.04f' % (SCORES[metric]+'+'+CHAIR[1], corr_s_ci_s[metric]/float(NIMAGES))) 127 | 128 | -------------------------------------------------------------------------------- /utils/__init__.py: -------------------------------------------------------------------------------- 1 | #Blank 2 | -------------------------------------------------------------------------------- /utils/chair.py: -------------------------------------------------------------------------------- 1 | import sys 2 | from nltk.stem import * 3 | import nltk 4 | import json 5 | from pattern.en import singularize 6 | import argparse 7 | from misc import * 8 | 9 | lemma = nltk.wordnet.WordNetLemmatizer() 10 | 11 | def combine_coco_captions(annotation_path): 12 | 13 | if not os.path.exists('%s/captions_%s2014.json' %(annotation_path, 'val')): 14 | raise Exception("Please download MSCOCO caption annotations for val set") 15 | if not os.path.exists('%s/captions_%s2014.json' %(annotation_path, 'train')): 16 | raise Exception("Please download MSCOCO caption annotations for train set") 17 | 18 | val_caps = json.load(open('%s/captions_%s2014.json' %(annotation_path, 'val'))) 19 | train_caps = json.load(open('%s/captions_%s2014.json' %(annotation_path, 'train'))) 20 | all_caps = {'info': train_caps['info'], 21 | 'licenses': train_caps['licenses'], 22 | 'images': val_caps['images'] + train_caps['images'], 23 | 'annotations': val_caps['annotations'] + train_caps['annotations']} 24 | 25 | return all_caps 26 | 27 | def combine_coco_instances(annotation_path): 28 | 29 | if not os.path.exists('%s/instances_%s2014.json' %(annotation_path, 'val')): 30 | raise Exception("Please download MSCOCO instance annotations for val set") 31 | if not os.path.exists('%s/instances_%s2014.json' %(annotation_path, 'train')): 32 | raise Exception("Please download MSCOCO instance annotations for train set") 33 | 34 | val_instances = json.load(open('%s/instances_%s2014.json' %(annotation_path, 'val'))) 35 | train_instances = json.load(open('%s/instances_%s2014.json' %(annotation_path, 'train'))) 36 | all_instances = {'info': train_instances['info'], 37 | 'licenses': train_instances['licenses'], 38 | 'type': train_instances['licenses'], 39 | 'categories': train_instances['categories'], 40 | 'images': train_instances['images'] + val_instances['images'], 41 | 'annotations': val_instances['annotations'] + train_instances['annotations']} 42 | 43 | return all_instances 44 | 45 | class CHAIR(object): 46 | 47 | def __init__(self, imids, coco_path): 48 | 49 | self.imid_to_objects = {imid: [] for imid in imids} 50 | 51 | self.coco_path = coco_path 52 | 53 | #read in synonyms 54 | synonyms = open('data/synonyms.txt').readlines() 55 | synonyms = [s.strip().split(', ') for s in synonyms] 56 | self.mscoco_objects = [] #mscoco objects and *all* synonyms 57 | self.inverse_synonym_dict = {} 58 | for synonym in synonyms: 59 | self.mscoco_objects.extend(synonym) 60 | for s in synonym: 61 | self.inverse_synonym_dict[s] = synonym[0] 62 | 63 | #Some hard coded rules for implementing CHAIR metrics on MSCOCO 64 | 65 | #common 'double words' in MSCOCO that should be treated as a single word 66 | coco_double_words = ['motor bike', 'motor cycle', 'air plane', 'traffic light', 'street light', 'traffic signal', 'stop light', 'fire hydrant', 'stop sign', 'parking meter', 'suit case', 'sports ball', 'baseball bat', 'baseball glove', 'tennis racket', 'wine glass', 'hot dog', 'cell phone', 'mobile phone', 'teddy bear', 'hair drier', 'potted plant', 'bow tie', 'laptop computer', 'stove top oven', 'hot dog', 'teddy bear', 'home plate', 'train track'] 67 | 68 | #Hard code some rules for special cases in MSCOCO 69 | #qualifiers like 'baby' or 'adult' animal will lead to a false fire for the MSCOCO object 'person'. 'baby bird' --> 'bird'. 70 | animal_words = ['bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'animal', 'cub'] 71 | #qualifiers like 'passenger' vehicle will lead to a false fire for the MSCOCO object 'person'. 'passenger jet' --> 'jet'. 72 | vehicle_words = ['jet', 'train'] 73 | 74 | #double_word_dict will map double words to the word they should be treated as in our analysis 75 | 76 | self.double_word_dict = {} 77 | for double_word in coco_double_words: 78 | self.double_word_dict[double_word] = double_word 79 | for animal_word in animal_words: 80 | self.double_word_dict['baby %s' %animal_word] = animal_word 81 | self.double_word_dict['adult %s' %animal_word] = animal_word 82 | for vehicle_word in vehicle_words: 83 | self.double_word_dict['passenger %s' %vehicle_word] = vehicle_word 84 | self.double_word_dict['bow tie'] = 'tie' 85 | self.double_word_dict['toilet seat'] = 'toilet' 86 | self.double_word_dict['wine glas'] = 'wine glass' 87 | 88 | def _load_generated_captions_into_evaluator(self, cap_file): 89 | 90 | ''' 91 | Meant to save time so imid_to_objects does not always need to be recomputed. 92 | ''' 93 | #Read in captions 94 | self.caps, imids, self.metrics = load_generated_captions(cap_file) 95 | 96 | assert imids == set(self.imid_to_objects.keys()) 97 | 98 | def caption_to_words(self, caption): 99 | 100 | ''' 101 | Input: caption 102 | Output: MSCOCO words in the caption 103 | ''' 104 | 105 | #standard preprocessing 106 | words = nltk.word_tokenize(caption.lower()) 107 | words = [singularize(w) for w in words] 108 | 109 | #replace double words 110 | i = 0 111 | double_words = [] 112 | idxs = [] 113 | while i < len(words): 114 | idxs.append(i) 115 | double_word = ' '.join(words[i:i+2]) 116 | if double_word in self.double_word_dict: 117 | double_words.append(self.double_word_dict[double_word]) 118 | i += 2 119 | else: 120 | double_words.append(words[i]) 121 | i += 1 122 | words = double_words 123 | 124 | #toilet seat is not chair (sentences like "the seat of the toilet" will fire for "chair" if we do not include this line) 125 | if ('toilet' in words) & ('seat' in words): words = [word for word in words if word != 'seat'] 126 | 127 | #get synonyms for all words in the caption 128 | idxs = [idxs[idx] for idx, word in enumerate(words) \ 129 | if word in set(self.mscoco_objects)] 130 | words = [word for word in words if word in set(self.mscoco_objects)] 131 | node_words = [] 132 | for word in words: 133 | node_words.append(self.inverse_synonym_dict[word]) 134 | #return all the MSCOCO objects in the caption 135 | return words, node_words, idxs, double_words 136 | 137 | def get_annotations_from_segments(self): 138 | ''' 139 | Add objects taken from MSCOCO segmentation masks 140 | ''' 141 | 142 | coco_segments = combine_coco_instances(self.coco_path ) 143 | segment_annotations = coco_segments['annotations'] 144 | 145 | #make dict linking object name to ids 146 | id_to_name = {} #dict with id to synsets 147 | for cat in coco_segments['categories']: 148 | id_to_name[cat['id']] = cat['name'] 149 | 150 | for i, annotation in enumerate(segment_annotations): 151 | sys.stdout.write("\rGetting annotations for %d/%d segmentation masks" 152 | %(i, len(segment_annotations))) 153 | imid = annotation['image_id'] 154 | if imid in self.imid_to_objects: 155 | node_word = self.inverse_synonym_dict[id_to_name[annotation['category_id']]] 156 | self.imid_to_objects[imid].append(node_word) 157 | print "\n" 158 | for imid in self.imid_to_objects: 159 | self.imid_to_objects[imid] = set(self.imid_to_objects[imid]) 160 | 161 | def get_annotations_from_captions(self): 162 | ''' 163 | Add objects taken from MSCOCO ground truth captions 164 | ''' 165 | 166 | coco_caps = combine_coco_captions(self.coco_path) 167 | caption_annotations = coco_caps['annotations'] 168 | 169 | for i, annotation in enumerate(caption_annotations): 170 | sys.stdout.write('\rGetting annotations for %d/%d ground truth captions' 171 | %(i, len(coco_caps['annotations']))) 172 | imid = annotation['image_id'] 173 | if imid in self.imid_to_objects: 174 | _, node_words, _, _ = self.caption_to_words(annotation['caption']) 175 | self.imid_to_objects[imid].update(node_words) 176 | print "\n" 177 | 178 | for imid in self.imid_to_objects: 179 | self.imid_to_objects[imid] = set(self.imid_to_objects[imid]) 180 | 181 | def get_annotations(self): 182 | 183 | ''' 184 | Get annotations from both segmentation and captions. Need both annotation types for CHAIR metric. 185 | ''' 186 | 187 | self.get_annotations_from_segments() 188 | self.get_annotations_from_captions() 189 | 190 | def compute_chair(self, cap_file): 191 | 192 | ''' 193 | Given ground truth objects and generated captions, determine which sentences have hallucinated words. 194 | ''' 195 | 196 | self._load_generated_captions_into_evaluator(cap_file) 197 | 198 | imid_to_objects = self.imid_to_objects 199 | caps = self.caps 200 | 201 | num_caps = 0. 202 | num_hallucinated_caps = 0. 203 | hallucinated_word_count = 0. 204 | coco_word_count = 0. 205 | 206 | output = {'sentences': []} 207 | 208 | for i, cap_eval in enumerate(caps): 209 | 210 | cap = cap_eval['caption'] 211 | imid = cap_eval['image_id'] 212 | 213 | #get all words in the caption, as well as corresponding node word 214 | words, node_words, idxs, raw_words = self.caption_to_words(cap) 215 | 216 | gt_objects = imid_to_objects[imid] 217 | cap_dict = {'image_id': cap_eval['image_id'], 218 | 'caption': cap, 219 | 'mscoco_hallucinated_words': [], 220 | 'mscoco_gt_words': list(gt_objects), 221 | 'mscoco_generated_words': list(node_words), 222 | 'hallucination_idxs': [], 223 | 'words': raw_words 224 | } 225 | 226 | cap_dict['metrics'] = {'Bleu_1': cap_eval['Bleu_1'], 227 | 'Bleu_2': cap_eval['Bleu_2'], 228 | 'Bleu_3': cap_eval['Bleu_3'], 229 | 'Bleu_4': cap_eval['Bleu_4'], 230 | 'METEOR': cap_eval['METEOR'], 231 | 'CIDEr': cap_eval['CIDEr'], 232 | 'SPICE': cap_eval['SPICE'], 233 | 'ROUGE_L': cap_eval['ROUGE_L'], 234 | 'CHAIRs': 0, 235 | 'CHAIRi': 0} 236 | 237 | #count hallucinated words 238 | coco_word_count += len(node_words) 239 | hallucinated = False 240 | for word, node_word, idx in zip(words, node_words, idxs): 241 | if node_word not in gt_objects: 242 | hallucinated_word_count += 1 243 | cap_dict['mscoco_hallucinated_words'].append((word, node_word)) 244 | cap_dict['hallucination_idxs'].append(idx) 245 | hallucinated = True 246 | 247 | #count hallucinated caps 248 | num_caps += 1 249 | if hallucinated: 250 | num_hallucinated_caps += 1 251 | 252 | cap_dict['metrics']['CHAIRs'] = int(hallucinated) 253 | cap_dict['metrics']['CHAIRi'] = 0. 254 | if len(words) > 0: 255 | cap_dict['metrics']['CHAIRi'] = len(cap_dict['mscoco_hallucinated_words'])/float(len(words)) 256 | 257 | output['sentences'].append(cap_dict) 258 | 259 | chair_s = (num_hallucinated_caps/num_caps) 260 | chair_i = (hallucinated_word_count/coco_word_count) 261 | 262 | output['overall_metrics'] = {'Bleu_1': self.metrics['Bleu_1'], 263 | 'Bleu_2': self.metrics['Bleu_2'], 264 | 'Bleu_3': self.metrics['Bleu_3'], 265 | 'Bleu_4': self.metrics['Bleu_4'], 266 | 'METEOR': self.metrics['METEOR'], 267 | 'CIDEr': self.metrics['CIDEr'], 268 | 'SPICE': self.metrics['SPICE'], 269 | 'ROUGE_L': self.metrics['ROUGE_L'], 270 | 'CHAIRs': chair_s, 271 | 'CHAIRi': chair_i} 272 | 273 | return output 274 | 275 | def load_generated_captions(cap_file): 276 | #Read in captions 277 | caps = json.load(open(cap_file)) 278 | try: 279 | metrics = caps['overall'] 280 | caps = caps['imgToEval'].values() 281 | imids = set([cap['image_id'] for cap in caps]) 282 | except: 283 | raise Exception("Expect caption file to consist of a dectionary with sentences correspdonding to the key 'imgToEval'") 284 | 285 | return caps, imids, metrics 286 | 287 | def save_hallucinated_words(cap_file, cap_dict): 288 | tag = cap_file.split('/')[-1] 289 | with open('output/hallucination/hallucinated_words_%s' %tag, 'w') as f: 290 | json.dump(cap_dict, f) 291 | 292 | def print_metrics(hallucination_cap_dict, quiet=False): 293 | sentence_metrics = hallucination_cap_dict['overall_metrics'] 294 | metric_string = "%0.01f\t%0.01f\t%0.01f\t%0.01f\t%0.01f" %( 295 | sentence_metrics['SPICE']*100, 296 | sentence_metrics['METEOR']*100, 297 | sentence_metrics['CIDEr']*100, 298 | sentence_metrics['CHAIRs']*100, 299 | sentence_metrics['CHAIRi']*100) 300 | 301 | if not quiet: 302 | print "SPICE\tMETEOR\tCIDEr\tCHAIRs\tCHAIRi" 303 | print metric_string 304 | 305 | else: 306 | return metric_string 307 | 308 | if __name__ == '__main__': 309 | parser = argparse.ArgumentParser() 310 | parser.add_argument("--cap_file", type=str, default='') 311 | parser.add_argument("--annotation_path", type=str, default='coco/annotations') 312 | args = parser.parse_args() 313 | 314 | _, imids, _ = load_generated_captions(args.cap_file) 315 | 316 | evaluator = CHAIR(imids, args.coco_path) 317 | evaluator.get_annotations() 318 | cap_dict = evaluator.compute_chair(args.cap_file) 319 | 320 | print_metrics(cap_dict) 321 | save_hallucinated_words(args.cap_file, cap_dict) 322 | -------------------------------------------------------------------------------- /utils/im_consistency.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import json 3 | import pickle as pkl 4 | import pdb 5 | import numpy as np 6 | from nltk import word_tokenize 7 | from pattern.en import singularize 8 | import nltk 9 | import argparse 10 | from misc import * 11 | 12 | def get_label_dicts(robust=False): 13 | if robust: 14 | label_dict = 'output/image_classifier/classifier_output_robust.p' 15 | else: 16 | label_dict = 'output/image_classifier/classifier_output.p' 17 | predicted_label_dict = pkl.load(open(label_dict, 'rb')) 18 | gt_label_dict = pkl.load(open('data/gt_labels.p', 'rb')) 19 | 20 | return predicted_label_dict, gt_label_dict 21 | 22 | def get_im_consistency(hallucination_by_imid, 23 | predicted_label_dict, 24 | gt_label_dict): 25 | 26 | total = 0. 27 | scores = 0. 28 | 29 | for i, imid in enumerate(hallucination_by_imid.keys()): 30 | 31 | item = hallucination_by_imid[imid] 32 | caption = item['caption'] 33 | caption_words = word_tokenize(caption.lower()) 34 | mscoco_words = [i[1] for i in item['mscoco_hallucinated_words']] 35 | 36 | predicted_labels = predicted_label_dict[imid]['predicted_classes'] 37 | raw_output = predicted_label_dict[imid]['raw_output'] 38 | raw_output_sorted = np.argsort(raw_output)[::-1] 39 | 40 | for mscoco_word in mscoco_words: 41 | value = raw_output[gt_label_dict['cat_to_idx'][mscoco_word]] 42 | scores += value 43 | total += 1 44 | 45 | return scores/total 46 | 47 | if __name__ == '__main__': 48 | parser = argparse.ArgumentParser() 49 | parser.add_argument("--annotation_path", type=str, default='coco/annotations') 50 | parser.add_argument("--tag", type=str, default='td-fc_beam1_test') 51 | parser.add_argument('--robust', dest='robust', action='store_true') 52 | parser.set_defaults(robust=False) 53 | args = parser.parse_args() 54 | 55 | #read hallucination file 56 | hallucinated_json = './output/hallucination/hallucinated_words_%s.json' %args.tag 57 | hallucination_by_imid = hallucination_file_to_dict(hallucinated_json) 58 | 59 | predicted_label_dict, gt_label_dict = get_label_dicts(args.robust) 60 | consistency = get_im_consistency(hallucination_by_imid, 61 | predicted_label_dict, 62 | gt_label_dict) 63 | print "Im consistency is: %0.04f" %consistency 64 | -------------------------------------------------------------------------------- /utils/lm_consistency.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import sys 3 | import json 4 | import os 5 | import pickle as pkl 6 | from nltk import word_tokenize 7 | from collections import defaultdict 8 | from pattern.en import singularize 9 | import pdb 10 | import sys 11 | import argparse 12 | from misc import * 13 | 14 | def read_vocab(robust): 15 | 16 | #read vocab 17 | vocab = pkl.load(open('data/vocab.p', 'rb')) 18 | word_to_idx = defaultdict(lambda: unk_idx) # word -> ix 19 | for key, value in zip(vocab.keys(), vocab.values()): 20 | word_to_idx[value] = int(key) 21 | 22 | return word_to_idx 23 | 24 | def softmax(array): 25 | shift = array - np.max(array) 26 | return np.exp(shift)/np.sum(np.exp(shift)) 27 | 28 | 29 | def get_blank_prediction_path(tag): 30 | return './output/language_model_blank_input/%s/%%d.npy' %tag 31 | 32 | def get_lm_consistency(hallucination_by_imid, 33 | blank_lm_predictions, 34 | word_to_idx, 35 | quiet = False): 36 | 37 | word_hallucinated_idxs = 0. 38 | word_hallucinated_total = 0. 39 | 40 | for i, imid in enumerate(sorted(hallucination_by_imid.keys())): 41 | if not quiet: 42 | sys.stdout.write("\r%d/%d" %(i, len(hallucination_by_imid.keys()))) 43 | probs = np.load(blank_lm_predictions %int(imid)) 44 | item = hallucination_by_imid[imid] 45 | caption = item['caption'] 46 | 47 | caption_words = word_tokenize(caption.lower()) 48 | mscoco_words = zip(item['hallucination_idxs'], \ 49 | [caption_words[i] for i in item['hallucination_idxs']]) 50 | 51 | for mscoco_word in mscoco_words: 52 | idx, word = mscoco_word 53 | word = word.split(' ')[0] 54 | word_probs = softmax(probs[idx,:]) 55 | sorted_objects = np.argsort(word_probs)[::-1] 56 | word_idx = np.where(sorted_objects == word_to_idx[word])[0][0] + 1 57 | word_hallucinated_idxs += word_idx 58 | word_hallucinated_total += 1 59 | 60 | return word_hallucinated_total/word_hallucinated_idxs 61 | 62 | if __name__ == '__main__': 63 | parser = argparse.ArgumentParser() 64 | parser.add_argument("--annotation_path", type=str, default='coco/annotations') 65 | parser.add_argument("--tag", type=str, default='td-fc_beam1_test') 66 | parser.add_argument('--robust', dest='robust', action='store_true') 67 | parser.set_defaults(robust=False) 68 | args = parser.parse_args() 69 | 70 | hallucinated_json = './output/hallucination/hallucinated_words_%s.json' %args.tag 71 | hallucination_by_imid = hallucination_file_to_dict(hallucinated_json) 72 | blank_lm_predictions = get_blank_prediction_path(args.tag) 73 | 74 | word_to_idx = read_vocab(args.robust) 75 | consistency = get_lm_consistency(hallucination_by_imid, \ 76 | blank_lm_predictions, \ 77 | word_to_idx) 78 | print "\nConsistency: %0.04f" %consistency 79 | -------------------------------------------------------------------------------- /utils/misc.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import json 4 | import numpy as np 5 | import lm_consistency as LM 6 | import im_consistency as IM 7 | from chair import * 8 | 9 | def hallucination_file_to_dict(hallucinated_json): 10 | 11 | hallucination_data = json.load(open(hallucinated_json)) 12 | hallucination_by_imid = {h['image_id']: h for h in hallucination_data['sentences']} 13 | 14 | return hallucination_by_imid 15 | 16 | def get_sentence_scores_from_hallucination_file(hallucination_file): 17 | hallucination = json.load(open(hallucination_file)) 18 | return hallucination['overall_metrics'] 19 | 20 | def get_consistency(tag, annotation_path, robust=False): 21 | 22 | #Load hallucination dict. If it does not exist, make it! 23 | hallucinated_json = './output/hallucination/hallucinated_words_%s.json' %tag 24 | sentences = 'generated_sentences/%s.json' %tag 25 | 26 | if not os.path.exists(hallucinated_json): 27 | print "Computing hallucination file for tag %s" %tag 28 | sentence_template = 'generated_sentences/%s.json' 29 | _, imids, _ = load_generated_captions(sentence_template %tag) 30 | evaluator = CHAIR(imids, annotation_path) 31 | evaluator.get_annotations() 32 | cap_dict = evaluator.compute_chair(sentence_template %tag) 33 | save_hallucinated_words(sentence_template %tag, cap_dict) 34 | 35 | hallucination_by_imid = hallucination_file_to_dict(hallucinated_json) 36 | 37 | #LM consistency 38 | word_to_idx = LM.read_vocab(robust) 39 | blank_lm_predictions = LM.get_blank_prediction_path(tag) 40 | 41 | lm_consistency = LM.get_lm_consistency(hallucination_by_imid, \ 42 | blank_lm_predictions, \ 43 | word_to_idx, 44 | quiet=True) 45 | 46 | #IM consistency 47 | predicted_label_dict, gt_label_dict = IM.get_label_dicts(robust) 48 | im_consistency = IM.get_im_consistency(hallucination_by_imid, 49 | predicted_label_dict, 50 | gt_label_dict) 51 | 52 | #get chair scores for completeness 53 | scores = get_sentence_scores_from_hallucination_file(hallucinated_json) 54 | 55 | return scores['CHAIRi'], lm_consistency, im_consistency 56 | 57 | def score_correlation(cap_file, quiet=False): 58 | 59 | caps = json.load(open(cap_file)) 60 | 61 | ciders = [] 62 | meteors = [] 63 | spices = [] 64 | hallucinations = [] 65 | 66 | for cap in caps['sentences']: 67 | info = cap['metrics'] 68 | meteors.append(info['METEOR']) 69 | ciders.append(info['CIDEr']) 70 | spices.append(info['SPICE']['All']['f']) 71 | hallucinations.append(1 - info['CHAIRi']) 72 | 73 | meteors = np.array(meteors) 74 | ciders = np.array(ciders) 75 | spices = np.array(spices) 76 | hallucinations = np.array(hallucinations) 77 | 78 | cider_corr = np.corrcoef(ciders, hallucinations)[1][0] 79 | meteor_corr = np.corrcoef(meteors, hallucinations)[1][0] 80 | spice_corr = np.corrcoef(spices, hallucinations)[1][0] 81 | 82 | if not quiet: 83 | print "CIDEr and hallucination: %0.03f" %cider_corr 84 | print "METEOR and hallucination: %0.03f" %meteor_corr 85 | print "SPICE and hallucination: %0.03f" %spice_corr 86 | 87 | return cider_corr, meteor_corr, spice_corr 88 | 89 | def predictive_metrics(hallucinated_json_1, hallucinated_json_2): 90 | 91 | ''' 92 | Can sentence metrics predict hallucination? In section 3.4 of paper. 93 | ''' 94 | 95 | hallucination_data_1 = json.load(open(hallucinated_json_1)) 96 | hallucination_data_2 = json.load(open(hallucinated_json_2)) 97 | 98 | def bin_by_spice(data): 99 | #bin by spice scores 100 | spices = [] 101 | hallucinations = [] 102 | 103 | for cap in data['sentences']: 104 | info = cap['metrics'] 105 | spices.append(info['SPICE']['All']['f']) 106 | hallucinations.append(info['CHAIRs']) 107 | 108 | hist = [] 109 | for i in range(0, 100, 10): 110 | idxs = [idx for idx, spice in enumerate(spices) \ 111 | if (spice*100 >= i) and (spice*100 < (i+10))] 112 | if len(idxs) == 0: 113 | hist.append(0) 114 | else: 115 | hist.append(np.mean([hallucinations[idx] for idx in idxs])) 116 | return hist 117 | 118 | score_histogram_1 = bin_by_spice(hallucination_data_1) 119 | score_histogram_2 = bin_by_spice(hallucination_data_2) 120 | return list(np.array(score_histogram_1)-np.array(score_histogram_2)) 121 | --------------------------------------------------------------------------------