├── README.md
├── data
    └── synonyms.txt
├── figure4.py
├── figure6.py
├── requirements.txt
├── setup.sh
├── table1.py
├── table2.py
├── table3.py
├── table4.py
├── table5.py
└── utils
    ├── __init__.py
    ├── chair.py
    ├── im_consistency.py
    ├── lm_consistency.py
    └── misc.py


/README.md:
--------------------------------------------------------------------------------
 1 | # Object Hallucination in Image Captioning
 2 | 
 3 | Rohrbach*, Anna and Hendricks*, Lisa Anne, et al. "Object Hallucination in Image Captioning." EMNLP (2018).
 4 | 
 5 | Find the paper [here](https://arxiv.org/pdf/1809.02156.pdf).
 6 | ```
 7 | @inproceedings{objectHallucination, 
 8 |         title = {Object Hallucination in Image Captioning.}, 
 9 |         author = {Rohrbach, Anna and Hendricks, Lisa Anne and Burns, Kaylee, and Darrell, Trevor, and Saenko, Kate}, 
10 |         booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, 
11 |         year = {2018} 
12 | }
13 | ```
14 | 
15 | License: BSD 2-Clause license
16 | 
17 | ## Running the Code
18 | 
19 | **Getting Started**
20 | 
21 | Run [setup.sh](setup.sh) to download generated sentences used for our analysis.
22 | Additionally you will need MSCOCO annotations (both the instance segmentations and ground truth captions).
23 | If you do not already have them, they can be downloaded [here](http://images.cocodataset.org/annotations/annotations_trainval2014.zip).
24 | You can see other python requirements in [requirements.txt](requirements.txt).
25 | 
26 | **Replicating Results**
27 | 
28 | After running ```setup.sh``` you should be able to replicate results in our paper by running ```table1.py```, ```table2.py```, ```table3.py```, ```table4.py``` and ```figure6.py``` (example usage ```python table1.py --annotation_path PATH_TO_COCO_ANNOTATIONS``` where ```coco/annotations``` is the default for ```--annotation_path```).
29 | Our scripts call on ```utils/chair.py``` to compute the CHAIR metric.  See below for more details on ```utils/chair.py```.
30 | 
31 | If you would like to run ```figure4.py``` (language and image model consistency) you will need to download some intermediate features. Please see the *Language and Image Model Consistency* section below.
32 | 
33 | For reproducing our results on correlation with human scores, run ```python table5.py```. The file with images IDs used in the human evaluation, as well as the average human scores for each of the compared models, will be found in ```data/human_scores```, after running the ```setup.sh```.
34 | 
35 | **Evaluating CHAIR**
36 | 
37 | See ```utils/chair.py``` to understand how we compute the CHAIRs and CHAIRi metrics.  Evaluate generated sentences by inputting a path to the generated sentences as well as the path which includes coco annotations.
38 | 
39 | Example usage is:
40 | 
41 | ```python utils/chair.py --cap_file generated_sentences/fc_beam5_test.json --annotation_path coco```
42 | 
43 | where ```cap_file``` corresponds to a json file with your generated captions and ```annotation_path``` points to where MSCOCO annotations are stored.
44 | 
45 | We expect generated sentences to be stored as a dictionary with the following keys:
46 | 
47 | * overall:  metrics from the COCO evaluation toolkit computed over the entire dataset.
48 | * imgToEval: a dictionary with keys corresponding to image ids and values with a caption, image_id, and sentence metrics for the particular caption.
49 | 
50 | Note that this is the format of the captions output by the open sourced code [here](https://github.com/ruotianluo/self-critical.pytorch), 
51 | which we used to replicate most of the models presented in the paper.
52 | 
53 | **Language and Image Model Consistency**
54 | 
55 | To compute language and image consistency, we trained a classifier to predict class labels given an image and a language model to predict the next word in a sentence given all previous words in a sentence.
56 | You can access the labels predicted by our language model in ```output/image_classifier``` and the words predicted by our language model [here](https://drive.google.com/drive/u/1/folders/1dnci1Kv6ez-hsFOqZt_gwiAv2FTAjDP4).
57 | To run our code, you ned to first download the [zip file](https://drive.google.com/drive/u/1/folders/1dnci1Kv6ez-hsFOqZt_gwiAv2FTAjDP4) into the main directory and unzip.
58 | Once you have these intermediate features you can look at ```utils/lm_consistency.py``` and ```utils/im_consistency.py``` to understand how these metrics are computed.
59 | Running ```figure4.py``` will output the results from our paper (constructing the actual bar plot is left as an exercise to the reader).
60 | 
61 | **Human Eval**
62 | 
63 | Replicate the results from our human evaluation by running ```python table5.py```.  Raw human evaluation scores can be found in ```data/human_scores``` after running ```setup.sh```.
64 | 
65 | **Captioning Models**
66 | 
67 | We generated sentences for the majority of models by training open source models available [here](https://github.com/ruotianluo/self-critical.pytorch).
68 | Within this framework, we wrote code for the LRCN model as well as the topdown deconstructed models (Table 3 in the paper).
69 | This code is available upon request.
70 | For the top down model with bounding boxes, we used the code [here](https://github.com/peteanderson80/Up-Down-Captioner).
71 | For the Neural Baby Talk model, we used the code [here](https://github.com/jiasenlu/NeuralBabyTalk).
72 | For the GAN based model, we used the sentences from the paper [here](https://arxiv.org/abs/1703.10476).  Sentences were obtained directly from the author (we did not train the GAN model).
73 | 


--------------------------------------------------------------------------------
/data/synonyms.txt:
--------------------------------------------------------------------------------
 1 | person, girl, boy, man, woman, kid, child, chef, baker, people, adult, rider, children, baby, worker, passenger, sister, biker, policeman, cop, officer, lady, cowboy, bride, groom, male, female, guy, traveler, mother, father, gentleman, pitcher, player, skier, snowboarder, skater, skateboarder, person, woman, guy, foreigner, child, gentleman, caller, offender, coworker, trespasser, patient, politician, soldier, grandchild, serviceman, walker, drinker, doctor, bicyclist, thief, buyer, teenager, student, camper, driver, solider, hunter, shopper, villager
 2 | bicycle, bike, bicycle, bike, unicycle, minibike, trike
 3 | car, automobile, van, minivan, sedan, suv, hatchback, cab, jeep, coupe, taxicab, limo, taxi
 4 | motorcycle, scooter,  motor bike, motor cycle, motorbike, scooter, moped
 5 | airplane, jetliner, plane, air plane, monoplane, aircraft, jet, jetliner, airbus, biplane, seaplane
 6 | bus, minibus, trolley
 7 | train, locomotive, tramway, caboose
 8 | truck, pickup, lorry, hauler, firetruck
 9 | boat, ship, liner, sailboat, motorboat, dinghy, powerboat, speedboat, canoe, skiff, yacht, kayak, catamaran, pontoon, houseboat, vessel, rowboat, trawler, ferryboat, watercraft, tugboat, schooner, barge, ferry, sailboard, paddleboat, lifeboat, freighter, steamboat, riverboat, battleship, steamship
10 | traffic light, street light, traffic signal, stop light, streetlight, stoplight
11 | fire hydrant, hydrant
12 | stop sign
13 | parking meter
14 | bench, pew
15 | bird, ostrich, owl, seagull, goose, duck, parakeet, falcon, robin, pelican, waterfowl, heron, hummingbird, mallard, finch, pigeon, sparrow, seabird, osprey, blackbird, fowl, shorebird, woodpecker, egret, chickadee, quail, bluebird, kingfisher, buzzard, willet, gull, swan, bluejay, flamingo, cormorant, parrot, loon, gosling, waterbird, pheasant, rooster, sandpiper, crow, raven, turkey, oriole, cowbird, warbler, magpie, peacock, cockatiel, lorikeet, puffin, vulture, condor, macaw, peafowl, cockatoo, songbird
16 | cat, kitten, feline, tabby
17 | dog, puppy, beagle, pup, chihuahua, schnauzer, dachshund, rottweiler, canine, pitbull, collie, pug, terrier, poodle, labrador, doggie, doberman, mutt, doggy, spaniel, bulldog, sheepdog, weimaraner, corgi, cocker, greyhound, retriever, brindle, hound, whippet, husky
18 | horse, colt, pony, racehorse, stallion, equine, mare, foal, palomino, mustang, clydesdale, bronc, bronco
19 | sheep, lamb, ram, lamb, goat, ewe
20 | cow, cattle, oxen, ox, calf, cattle, holstein, heifer, buffalo, bull, zebu, bison 
21 | elephant
22 | bear, panda
23 | zebra
24 | giraffe
25 | backpack, knapsack
26 | umbrella
27 | handbag, wallet, purse, briefcase
28 | tie, bow, bow tie
29 | suitcase, suit case, luggage
30 | frisbee
31 | skis, ski
32 | snowboard
33 | sports ball, ball
34 | kite
35 | baseball bat
36 | baseball glove
37 | skateboard
38 | surfboard, longboard, skimboard, shortboard, wakeboard
39 | tennis racket, racket
40 | bottle
41 | wine glass
42 | cup
43 | fork
44 | knife, pocketknife, knive
45 | spoon
46 | bowl, container
47 | banana
48 | apple
49 | sandwich, burger, sub, cheeseburger, hamburger
50 | orange
51 | broccoli
52 | carrot
53 | hot dog
54 | pizza
55 | donut, doughnut, bagel
56 | cake,  cheesecake, cupcake, shortcake, coffeecake, pancake
57 | chair, seat, stool
58 | couch, sofa, recliner, futon, loveseat, settee, chesterfield 
59 | potted plant, houseplant
60 | bed
61 | dining table, table, desk
62 | toilet, urinal, commode, toilet, lavatory, potty
63 | tv, monitor, televison, television
64 | laptop, computer, notebook, netbook, lenovo, macbook, laptop computer
65 | mouse
66 | remote
67 | keyboard
68 | cell phone, mobile phone, phone, cellphone, telephone, phon, smartphone, iPhone
69 | microwave
70 | oven, stovetop, stove, stove top oven
71 | toaster
72 | sink
73 | refrigerator, fridge, fridge, freezer
74 | book
75 | clock
76 | vase
77 | scissors
78 | teddy bear, teddybear
79 | hair drier, hairdryer
80 | toothbrush
81 | 


--------------------------------------------------------------------------------
/figure4.py:
--------------------------------------------------------------------------------
 1 | from utils import misc 
 2 | from utils import chair 
 3 | import argparse
 4 | import json
 5 | import os
 6 | 
 7 | parser = argparse.ArgumentParser()
 8 | parser.add_argument("--annotation_path", type=str, default='coco/annotations')
 9 | args = parser.parse_args()
10 |     
11 | figure4_tags_karpathy = [('TD', 'td_beam1_test'),
12 |                ('No Att', 'td-noatt_beam1_test'),
13 |                ('No Conv', 'td-noconv_beam1_test'), 
14 |                ('Single', 'td-single_beam1_test'),
15 |                ('FC', 'td-fc_beam1_test')] 
16 | 
17 | print "=================Karpathy Split================="
18 | print "Model\tCHAIRi\tLM Consistency\tIM Consistency"
19 | 
20 | for tag in figure4_tags_karpathy:
21 | 
22 |     chair_i, lm_consistency, im_consistency = misc.get_consistency(tag[1],
23 |                                                                   args.annotation_path, 
24 |                                                                   robust=False)
25 |     
26 |     print "%s\t%0.04f\t%0.04f\t\t%0.04f" %(tag[0], 
27 |                                          chair_i, 
28 |                                          lm_consistency,
29 |                                          im_consistency)
30 | 
31 | print "=================Robust Split================="
32 | print "Model\tCHAIRi\tLM Consistency\tIM Consistency"
33 | 
34 | figure4_tags_robust = [('TD', 'td-robust_beam1_test'),
35 |                        ('No Att', 'td-noatt-robust_beam1_test'),
36 |                        ('No Conv', 'td-noconv-robust_beam1_test'), 
37 |                        ('Single', 'td-single-robust_beam1_test'),
38 |                        ('FC', 'td-fc-robust_beam1_test')] 
39 | 
40 | #generate hallucination files for robust split for fig 4
41 | evaluator = None
42 | output_template = "output/hallucination/hallucinated_words_%s.json" 
43 | sentence_template = "generated_sentences/%s.json" 
44 | for tag in figure4_tags_robust:
45 |     if not os.path.exists(output_template %tag[1]):
46 |         if not evaluator:
47 |             _, imids, _ = chair.load_generated_captions(sentence_template %figure4_tags_robust[0][1])
48 |             evaluator = chair.CHAIR(imids, args.annotation_path)
49 |             evaluator.get_annotations()
50 |         cap_dict = evaluator.compute_chair(sentence_template %tag[1])
51 |         chair.save_hallucinated_words(sentence_template %tag[1], cap_dict)
52 | 
53 | for tag in figure4_tags_robust:
54 | 
55 |     chair_i, lm_consistency, im_consistency = misc.get_consistency(tag[1], 
56 |                                                                   args.annotation_path, 
57 |                                                                   robust=True)
58 |     
59 |     print "%s\t%0.04f\t%0.04f\t\t%0.04f" %(tag[0], 
60 |                                          chair_i, 
61 |                                          lm_consistency,
62 |                                          im_consistency)
63 | 


--------------------------------------------------------------------------------
/figure6.py:
--------------------------------------------------------------------------------
 1 | from utils import misc 
 2 | 
 3 | template = './output/hallucination/hallucinated_words_%s.json'
 4 | fc_hallucination = template %'fc_beam5_test'
 5 | td_hallucination = template %'td_beam5_test'
 6 | 
 7 | diffs = misc.predictive_metrics(fc_hallucination, td_hallucination)
 8 | 
 9 | print "Differences in Hallucination for sentences with similar SPICE score:"
10 | print "\t(caomparing fc and td models)"
11 | 
12 | for i in range(0, 100, 10):
13 |     print "Between %d-%d:\t%0.04f" %(i, i+10, diffs[i/10])
14 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy==1.14.2
2 | nltk==3.2.5
3 | pattern==3.6
4 | pycocotools==2.0.0
5 | 


--------------------------------------------------------------------------------
/setup.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | #Download generated sentences
 4 | wget https://people.eecs.berkeley.edu/~lisa_anne/hallucination/generated_sentences.zip
 5 | unzip generated_sentences.zip
 6 | rm -r generated_sentences.zip
 7 | 
 8 | #Download 
 9 | mkdir output
10 | mkdir output/hallucination
11 | wget https://people.eecs.berkeley.edu/~lisa_anne/hallucination/intermediate_image.zip
12 | unzip intermediate_image.zip
13 | rm -r intermediate_image.zip
14 | 
15 | cd data
16 | wget https://people.eecs.berkeley.edu/~lisa_anne/hallucination/gt_labels.p
17 | wget https://people.eecs.berkeley.edu/~lisa_anne/hallucination/vocab.p
18 | wget https://people.eecs.berkeley.edu/~lisa_anne/hallucination/human_scores.zip
19 | unzip human_scores.zip
20 | rm -r human_scores.zip
21 | 
22 | cd ..
23 | 


--------------------------------------------------------------------------------
/table1.py:
--------------------------------------------------------------------------------
 1 | from utils import chair 
 2 | import argparse
 3 |     
 4 | parser = argparse.ArgumentParser()
 5 | parser.add_argument("--annotation_path", type=str, default='coco/annotations')
 6 | args = parser.parse_args()
 7 | 
 8 | sentence_template = 'generated_sentences/%s.json'
 9 | table1_tags = [('LRCN', 'lrcn_beam5_test', 'lrcn-sc_beam5_test'),
10 |                ('FC', 'fc_beam5_test', 'fc-sc_beam5_test'),
11 |                ('att2in', 'att2in_beam5_test', 'att2in-sc_beam5_test'),
12 |                ('TD', 'td_beam5_test', 'td-sc_beam5_test'),
13 |                ('TD-BB', 'td-bb_beam5_test', 'td-bb-sc_beam5_test'),
14 |                ('NBT', 'nbt_beam5_test'),
15 |                ('GAN', 'baseline-gan_beam5_test', 'gan_beam5_test')]
16 | 
17 | _, imids, _ = chair.load_generated_captions(sentence_template %table1_tags[0][1])
18 | 
19 | evaluator = chair.CHAIR(imids, args.annotation_path) 
20 | evaluator.get_annotations()
21 | 
22 | print "\t\tCross Entropy\t\t\t\tSelf-Critical\t\t"
23 | print "Model\tSPICE\tMETEOR\tCIDEr\tCHAIRs\tCHAIRi\t|SPICE\tMETEOR\tCIDEr\tCHAIRs\tCHAIRi"
24 | 
25 | for tag in table1_tags:
26 | 
27 |     cap_dict = evaluator.compute_chair(sentence_template %tag[1]) 
28 |     metric_string_ce = chair.print_metrics(cap_dict, True)
29 |     chair.save_hallucinated_words(sentence_template %tag[1], cap_dict)
30 |     if len(tag) > 2:
31 |         cap_dict = evaluator.compute_chair(sentence_template %tag[2]) 
32 |         metric_string_sc = chair.print_metrics(cap_dict, True)
33 |         chair.save_hallucinated_words(sentence_template %tag[2], cap_dict)
34 |     else:
35 |         metric_string_sc = "-\t-\t-\t-\t-"
36 |     print "%s\t%s\t|%s" %(tag[0], metric_string_ce, metric_string_sc) 
37 | 


--------------------------------------------------------------------------------
/table2.py:
--------------------------------------------------------------------------------
 1 | from utils import chair 
 2 | import argparse
 3 |     
 4 | parser = argparse.ArgumentParser()
 5 | parser.add_argument("--annotation_path", type=str, default='coco/annotations')
 6 | args = parser.parse_args()
 7 | 
 8 | sentence_template = 'generated_sentences/%s.json'
 9 | table2_tags = [('FC', 'fc-robust_beam5_test'),
10 |                ('att2in', 'att2in-robust_beam5_test'),
11 |                ('TD', 'td-robust_beam5_test'), 
12 |                ('NBT', 'nbt-robust_beam5_test')] 
13 | 
14 | _, imids, _ = chair.load_generated_captions(sentence_template %table2_tags[0][1])
15 | 
16 | evaluator = chair.CHAIR(imids, args.annotation_path) 
17 | evaluator.get_annotations()
18 | 
19 | print "\t\tCross Entropy\t\t\t"
20 | print "Model\tSPICE\tMETEOR\tCIDEr\tCHAIRs\tCHAIRi"
21 | 
22 | for tag in table2_tags:
23 | 
24 |     cap_dict = evaluator.compute_chair(sentence_template %tag[1]) 
25 |     metric_string = chair.print_metrics(cap_dict, True)
26 |     chair.save_hallucinated_words(sentence_template %tag[1], cap_dict)
27 |     print "%s\t%s\t" %(tag[0], metric_string) 
28 | 


--------------------------------------------------------------------------------
/table3.py:
--------------------------------------------------------------------------------
 1 | from utils import chair 
 2 | import argparse
 3 |     
 4 | parser = argparse.ArgumentParser()
 5 | parser.add_argument("--annotation_path", type=str, default='coco/annotations')
 6 | args = parser.parse_args()
 7 | 
 8 | sentence_template = 'generated_sentences/%s.json'
 9 | table2_tags = [('TD', 'td_beam1_test'),
10 |                ('No Att', 'td-noatt_beam1_test'),
11 |                ('No Conv', 'td-noconv_beam1_test'), 
12 |                ('Single', 'td-single_beam1_test'),
13 |                ('FC', 'td-fc_beam1_test')] 
14 | 
15 | _, imids, _ = chair.load_generated_captions(sentence_template %table2_tags[0][1])
16 | 
17 | evaluator = chair.CHAIR(imids, args.annotation_path) 
18 | evaluator.get_annotations()
19 | 
20 | print "\t\tCross Entropy\t\t\t"
21 | print "Model\tSPICE\tMETEOR\tCIDEr\tCHAIRs\tCHAIRi"
22 | 
23 | for tag in table2_tags:
24 | 
25 |     cap_dict = evaluator.compute_chair(sentence_template %tag[1]) 
26 |     metric_string = chair.print_metrics(cap_dict, True)
27 |     chair.save_hallucinated_words(sentence_template %tag[1], cap_dict)
28 |     print "%s\t%s\t" %(tag[0], metric_string) 
29 | 


--------------------------------------------------------------------------------
/table4.py:
--------------------------------------------------------------------------------
 1 | from utils import misc
 2 | import os
 3 | from utils import chair
 4 | import argparse
 5 | 
 6 | parser = argparse.ArgumentParser()
 7 | parser.add_argument("--annotation_path", type=str, default='coco/annotations')
 8 | args = parser.parse_args()
 9 | 
10 | output_template = "output/hallucination/hallucinated_words_%s.json" 
11 | sentence_template = "generated_sentences/%s.json" 
12 | 
13 | table4_tags = [('FC', 'fc_beam5_test'),
14 |                ('att2in', 'att2in_beam5_test'),
15 |                ('td', 'td_beam5_test')]
16 | 
17 | 
18 | _, imids, _ = chair.load_generated_captions(sentence_template %table4_tags[0][1])
19 | evaluator = chair.CHAIR(imids, args.annotation_path)
20 | evaluator.get_annotations()
21 | 
22 | print "Model\tCIDEr\tMETEOR\tSPICE"
23 | 
24 | for tag in table4_tags:
25 | 
26 |     if not os.path.exists(output_template %tag[1]):
27 |         cap_dict = evaluator.compute_chair(sentence_template %tag[1])
28 |         chair.save_hallucinated_words(sentence_template %tag[1], cap_dict)
29 |         
30 |     cider, meteor, spice = misc.score_correlation(output_template %tag[1],
31 |                                                   quiet=True)
32 |     print "%s\t%0.03f\t%0.03f\t%0.03f" %(tag[0], cider, meteor, spice)
33 | 


--------------------------------------------------------------------------------
/table5.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | """
  3 | Compute correlation between sentence scores and human scores.
  4 | """
  5 | 
  6 | import glob
  7 | import json
  8 | import numpy as np
  9 | 
 10 | ##########################################################################
 11 | def correctForNan(inputVec):
 12 |   # If all the values are identical, introduce a small perturbation
 13 |   outputVec = inputVec[:]
 14 |   item = inputVec[0]
 15 |   allEqual = True
 16 |   for it in inputVec:
 17 |     if it <> item:
 18 |       allEqual = False
 19 |       break
 20 | 
 21 |   if allEqual:
 22 |     outputVec[np.random.choice(range(len(inputVec)))] += 0.001
 23 | 
 24 |   return outputVec
 25 | ##########################################################################
 26 | 
 27 | SCORES_PATH = 'output/hallucination/'
 28 | 
 29 | f_scores = [SCORES_PATH+'hallucinated_words_baseline-gan_beam5_test.json',
 30 |             SCORES_PATH+'hallucinated_words_nbt_beam5_test.json',
 31 |             SCORES_PATH+'hallucinated_words_gan_beam5_test.json',
 32 |             SCORES_PATH+'hallucinated_words_td_beam5_test.json',
 33 |             SCORES_PATH+'hallucinated_words_td-sc_beam5_test.json']
 34 | 
 35 | ids_file = 'data/human_scores/' + 'imageIDs.txt' # images used in the human evaluation
 36 | of = open(ids_file, 'r')
 37 | image_ids = of.read().split('\n')
 38 | 
 39 | HUMAN_SCORES_PATH = 'data/human_scores/0*.txt'
 40 | f_human_subj_scores = glob.glob(HUMAN_SCORES_PATH)
 41 | f_human_subj_scores.sort()
 42 | 
 43 | MODELS = ['MPI-CE', 'NBT-CE', 'MPI-GAN', 'TD-CE', 'TD-SC']
 44 | SCORES = ['B@1', 'B@2', 'B@3', 'B@4', 'R', 'M', 'C', 'S']
 45 | CHAIR = ['1-CHs', '1-CHi']
 46 | 
 47 | NIMAGES = len(image_ids)
 48 | NCAPTIONS = len(MODELS)
 49 | 
 50 | s_scores_m = {} # sentence scores
 51 | c_scores_m = {} # chair scores
 52 | h_s_scores_m = {} # human scores
 53 | 
 54 | for i, fn in enumerate(f_scores):
 55 |   s_scores_m[i] = [None] * NIMAGES
 56 |   c_scores_m[i] = [None] * NIMAGES
 57 |   of = open(fn, 'r')
 58 |   f_data = json.load(open(fn, 'r'))
 59 |   f_data = f_data['sentences']
 60 |   of.close()
 61 |   for item in f_data:
 62 |     im_id = item['image_id']
 63 |     if str(im_id) not in image_ids:
 64 |       continue
 65 |     metrics = item['metrics']
 66 |     b1 = metrics['Bleu_1']
 67 |     b2 = metrics['Bleu_2']
 68 |     b3 = metrics['Bleu_3']
 69 |     b4 = metrics['Bleu_4']
 70 |     rl = metrics['ROUGE_L']
 71 |     me = metrics['METEOR']
 72 |     ci = metrics['CIDEr']
 73 |     sp = metrics['SPICE']['All']['f']
 74 |     ind = image_ids.index(str(im_id))
 75 |     s_scores_m[i][ind] = [b1, b2, b3, b4, rl, me, ci, sp]
 76 |     #
 77 |     ch_s = metrics['CHAIRs']
 78 |     ch_i = metrics['CHAIRi']
 79 |     c_scores_m[i][ind] = [1-ch_s, 1-ch_i]
 80 | 
 81 | for i, fn in enumerate(f_human_subj_scores):
 82 |   h_s_scores_m[i] = []
 83 |   of = open(fn, 'r')
 84 |   f_data = of.read().split('\n')
 85 |   if f_data[-1] == '':
 86 |     f_data = f_data[0:-1]
 87 |   for line in f_data:
 88 |     items = line.split('\t')
 89 |     h_s_scores_m[i].append([float(x) for x in (items[1:])])
 90 | 
 91 | # PEARSON'S correlation across NCAPTIONS=5 captions per image (from each system), averaged over NIMAGES=500 images
 92 | 
 93 | corr_s_s = [0] * len(SCORES) # correlation between sentence scores and human scores
 94 | corr_s_cs_s = [0] * len(SCORES) # correlation between sentence scores+(1-CHs) and human scores
 95 | corr_s_ci_s = [0] * len(SCORES) # correlation between sentence scores+(1-CHi) and human scores
 96 | 
 97 | for im in range(NIMAGES):
 98 |   s_m = []
 99 |   c_m = []
100 |   h_s_m = []
101 |   for i in range(NCAPTIONS):
102 |     s_m.append(s_scores_m[i][im])
103 |     c_m.append(c_scores_m[i][im])
104 |     h_s_m.append(h_s_scores_m[i][im])
105 |   for metric in range(len(SCORES)):
106 |     corr = np.corrcoef(correctForNan([x[metric] for x in s_m]), correctForNan([x[0] for x in h_s_m]))[0][1]
107 |     corr_s_s[metric] += corr
108 | 
109 |   for metric in range(len(SCORES)):
110 |     #ch_s
111 |     corr = np.corrcoef(correctForNan([x[metric] + c_m[i][0] for i, x in enumerate(s_m)]), correctForNan([x[0] for x in h_s_m]))[0][1]
112 |     corr_s_cs_s[metric] += corr
113 |     #ch_i
114 |     corr = np.corrcoef(correctForNan([x[metric] + c_m[i][1] for i, x in enumerate(s_m)]), correctForNan([x[0] for x in h_s_m]))[0][1]
115 |     corr_s_ci_s[metric] += corr
116 | 
117 | print "Metric\tCorrelation"
118 | 
119 | for metric in [5,6,7]: # focus on 'M', 'C', 'S'
120 |   print('%s\t%.04f' % (SCORES[metric], corr_s_s[metric]/float(NIMAGES)))
121 | 
122 | for metric in [5,6,7]: # focus on 'M', 'C', 'S'
123 |   print('%s\t%.04f' % (SCORES[metric]+'+'+CHAIR[0], corr_s_cs_s[metric]/float(NIMAGES)))
124 | 
125 | for metric in [5,6,7]: # focus on 'M', 'C', 'S'
126 |   print('%s\t%.04f' % (SCORES[metric]+'+'+CHAIR[1], corr_s_ci_s[metric]/float(NIMAGES)))
127 | 
128 | 


--------------------------------------------------------------------------------
/utils/__init__.py:
--------------------------------------------------------------------------------
1 | #Blank
2 | 


--------------------------------------------------------------------------------
/utils/chair.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | from nltk.stem import *
  3 | import nltk
  4 | import json
  5 | from pattern.en import singularize
  6 | import argparse
  7 | from misc import *
  8 | 
  9 | lemma = nltk.wordnet.WordNetLemmatizer()
 10 | 
 11 | def combine_coco_captions(annotation_path):
 12 | 
 13 |     if not os.path.exists('%s/captions_%s2014.json' %(annotation_path, 'val')):
 14 |         raise Exception("Please download MSCOCO caption annotations for val set")
 15 |     if not os.path.exists('%s/captions_%s2014.json' %(annotation_path, 'train')):
 16 |         raise Exception("Please download MSCOCO caption annotations for train set")
 17 | 
 18 |     val_caps = json.load(open('%s/captions_%s2014.json' %(annotation_path, 'val')))
 19 |     train_caps = json.load(open('%s/captions_%s2014.json' %(annotation_path, 'train')))
 20 |     all_caps = {'info': train_caps['info'],
 21 |                 'licenses': train_caps['licenses'],
 22 |                 'images': val_caps['images'] + train_caps['images'],
 23 |                 'annotations': val_caps['annotations'] + train_caps['annotations']}
 24 | 
 25 |     return all_caps 
 26 | 
 27 | def combine_coco_instances(annotation_path):
 28 | 
 29 |     if not os.path.exists('%s/instances_%s2014.json' %(annotation_path, 'val')):
 30 |         raise Exception("Please download MSCOCO instance annotations for val set")
 31 |     if not os.path.exists('%s/instances_%s2014.json' %(annotation_path, 'train')):
 32 |         raise Exception("Please download MSCOCO instance annotations for train set")
 33 | 
 34 |     val_instances = json.load(open('%s/instances_%s2014.json' %(annotation_path, 'val')))
 35 |     train_instances = json.load(open('%s/instances_%s2014.json' %(annotation_path, 'train')))
 36 |     all_instances = {'info': train_instances['info'],
 37 |                      'licenses': train_instances['licenses'],
 38 |                      'type': train_instances['licenses'],
 39 |                      'categories': train_instances['categories'],
 40 |                      'images': train_instances['images'] + val_instances['images'],
 41 |                      'annotations': val_instances['annotations'] + train_instances['annotations']}
 42 | 
 43 |     return all_instances 
 44 | 
 45 | class CHAIR(object):
 46 | 
 47 |     def __init__(self, imids, coco_path):
 48 | 
 49 |         self.imid_to_objects = {imid: [] for imid in imids}
 50 | 
 51 |         self.coco_path = coco_path
 52 | 
 53 |         #read in synonyms
 54 |         synonyms = open('data/synonyms.txt').readlines()
 55 |         synonyms = [s.strip().split(', ') for s in synonyms]
 56 |         self.mscoco_objects = [] #mscoco objects and *all* synonyms
 57 |         self.inverse_synonym_dict = {}
 58 |         for synonym in synonyms:
 59 |             self.mscoco_objects.extend(synonym)
 60 |             for s in synonym:
 61 |                 self.inverse_synonym_dict[s] = synonym[0]
 62 | 
 63 |         #Some hard coded rules for implementing CHAIR metrics on MSCOCO
 64 |         
 65 |         #common 'double words' in MSCOCO that should be treated as a single word
 66 |         coco_double_words = ['motor bike', 'motor cycle', 'air plane', 'traffic light', 'street light', 'traffic signal', 'stop light', 'fire hydrant', 'stop sign', 'parking meter', 'suit case', 'sports ball', 'baseball bat', 'baseball glove', 'tennis racket', 'wine glass', 'hot dog', 'cell phone', 'mobile phone', 'teddy bear', 'hair drier', 'potted plant', 'bow tie', 'laptop computer', 'stove top oven', 'hot dog', 'teddy bear', 'home plate', 'train track']
 67 |         
 68 |         #Hard code some rules for special cases in MSCOCO
 69 |         #qualifiers like 'baby' or 'adult' animal will lead to a false fire for the MSCOCO object 'person'.  'baby bird' --> 'bird'.
 70 |         animal_words = ['bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'animal', 'cub']
 71 |         #qualifiers like 'passenger' vehicle will lead to a false fire for the MSCOCO object 'person'.  'passenger jet' --> 'jet'.
 72 |         vehicle_words = ['jet', 'train']
 73 |         
 74 |         #double_word_dict will map double words to the word they should be treated as in our analysis
 75 |         
 76 |         self.double_word_dict = {}
 77 |         for double_word in coco_double_words:
 78 |             self.double_word_dict[double_word] = double_word
 79 |         for animal_word in animal_words:
 80 |             self.double_word_dict['baby %s' %animal_word] = animal_word
 81 |             self.double_word_dict['adult %s' %animal_word] = animal_word
 82 |         for vehicle_word in vehicle_words:
 83 |             self.double_word_dict['passenger %s' %vehicle_word] = vehicle_word
 84 |         self.double_word_dict['bow tie'] = 'tie'
 85 |         self.double_word_dict['toilet seat'] = 'toilet'
 86 |         self.double_word_dict['wine glas'] = 'wine glass'
 87 | 
 88 |     def _load_generated_captions_into_evaluator(self, cap_file):
 89 | 
 90 |         '''
 91 |         Meant to save time so imid_to_objects does not always need to be recomputed.
 92 |         '''
 93 |         #Read in captions        
 94 |         self.caps, imids, self.metrics = load_generated_captions(cap_file)
 95 | 
 96 |         assert imids == set(self.imid_to_objects.keys())
 97 | 
 98 |     def caption_to_words(self, caption):
 99 |     
100 |         '''
101 |         Input: caption
102 |         Output: MSCOCO words in the caption
103 |         '''
104 |     
105 |         #standard preprocessing
106 |         words = nltk.word_tokenize(caption.lower())
107 |         words = [singularize(w) for w in words]
108 |     
109 |         #replace double words
110 |         i = 0
111 |         double_words = []
112 |         idxs = []
113 |         while i < len(words):
114 |            idxs.append(i) 
115 |            double_word = ' '.join(words[i:i+2])
116 |            if double_word in self.double_word_dict: 
117 |                double_words.append(self.double_word_dict[double_word])
118 |                i += 2
119 |            else:
120 |                double_words.append(words[i])
121 |                i += 1
122 |         words = double_words
123 |     
124 |         #toilet seat is not chair (sentences like "the seat of the toilet" will fire for "chair" if we do not include this line)
125 |         if ('toilet' in words) & ('seat' in words): words = [word for word in words if word != 'seat']
126 |     
127 |         #get synonyms for all words in the caption
128 |         idxs = [idxs[idx] for idx, word in enumerate(words) \
129 |                 if word in set(self.mscoco_objects)]
130 |         words = [word for word in words if word in set(self.mscoco_objects)]
131 |         node_words = []
132 |         for word in words:
133 |             node_words.append(self.inverse_synonym_dict[word])
134 |         #return all the MSCOCO objects in the caption
135 |         return words, node_words, idxs, double_words
136 | 
137 |     def get_annotations_from_segments(self):
138 |         '''
139 |         Add objects taken from MSCOCO segmentation masks
140 |         '''
141 | 
142 |         coco_segments = combine_coco_instances(self.coco_path )
143 |         segment_annotations = coco_segments['annotations']
144 | 
145 |         #make dict linking object name to ids
146 |         id_to_name = {} #dict with id to synsets 
147 |         for cat in coco_segments['categories']:
148 |             id_to_name[cat['id']] = cat['name']
149 | 
150 |         for i, annotation in enumerate(segment_annotations):
151 |             sys.stdout.write("\rGetting annotations for %d/%d segmentation masks" 
152 |                               %(i, len(segment_annotations)))
153 |             imid = annotation['image_id']
154 |             if imid in self.imid_to_objects:
155 |                 node_word = self.inverse_synonym_dict[id_to_name[annotation['category_id']]]
156 |                 self.imid_to_objects[imid].append(node_word)
157 |         print "\n"
158 |         for imid in self.imid_to_objects:
159 |             self.imid_to_objects[imid] = set(self.imid_to_objects[imid])
160 | 
161 |     def get_annotations_from_captions(self):
162 |         '''
163 |         Add objects taken from MSCOCO ground truth captions 
164 |         '''
165 | 
166 |         coco_caps = combine_coco_captions(self.coco_path)
167 |         caption_annotations = coco_caps['annotations']
168 | 
169 |         for i, annotation in enumerate(caption_annotations):
170 |             sys.stdout.write('\rGetting annotations for %d/%d ground truth captions' 
171 |                               %(i, len(coco_caps['annotations'])))
172 |             imid = annotation['image_id']
173 |             if imid in self.imid_to_objects:
174 |                 _, node_words, _, _ = self.caption_to_words(annotation['caption'])
175 |                 self.imid_to_objects[imid].update(node_words)
176 |         print "\n"
177 | 
178 |         for imid in self.imid_to_objects:
179 |             self.imid_to_objects[imid] = set(self.imid_to_objects[imid])
180 | 
181 |     def get_annotations(self):
182 | 
183 |         '''
184 |         Get annotations from both segmentation and captions.  Need both annotation types for CHAIR metric.
185 |         '''
186 | 
187 |         self.get_annotations_from_segments() 
188 |         self.get_annotations_from_captions() 
189 | 
190 |     def compute_chair(self, cap_file):
191 |     
192 |         '''
193 |         Given ground truth objects and generated captions, determine which sentences have hallucinated words.
194 |         '''
195 |     
196 |         self._load_generated_captions_into_evaluator(cap_file)
197 | 
198 |         imid_to_objects = self.imid_to_objects
199 |         caps = self.caps
200 |  
201 |         num_caps = 0.
202 |         num_hallucinated_caps = 0.
203 |         hallucinated_word_count = 0.
204 |         coco_word_count = 0.
205 | 
206 |         output = {'sentences': []} 
207 |     
208 |         for i, cap_eval in enumerate(caps):
209 |     
210 |             cap = cap_eval['caption']
211 |             imid = cap_eval['image_id']
212 |     
213 |             #get all words in the caption, as well as corresponding node word
214 |             words, node_words, idxs, raw_words = self.caption_to_words(cap) 
215 |  
216 |             gt_objects = imid_to_objects[imid]
217 |             cap_dict = {'image_id': cap_eval['image_id'], 
218 |                         'caption': cap,
219 |                         'mscoco_hallucinated_words': [],
220 |                         'mscoco_gt_words': list(gt_objects),
221 |                         'mscoco_generated_words': list(node_words),
222 |                         'hallucination_idxs': [], 
223 |                         'words': raw_words 
224 |                         }
225 |    
226 |             cap_dict['metrics'] = {'Bleu_1': cap_eval['Bleu_1'],
227 |                                    'Bleu_2': cap_eval['Bleu_2'],
228 |                                    'Bleu_3': cap_eval['Bleu_3'],
229 |                                    'Bleu_4': cap_eval['Bleu_4'],
230 |                                    'METEOR': cap_eval['METEOR'],
231 |                                    'CIDEr': cap_eval['CIDEr'],
232 |                                    'SPICE': cap_eval['SPICE'],
233 |                                    'ROUGE_L': cap_eval['ROUGE_L'],
234 |                                    'CHAIRs': 0,
235 |                                    'CHAIRi': 0}
236 |  
237 |             #count hallucinated words
238 |             coco_word_count += len(node_words) 
239 |             hallucinated = False
240 |             for word, node_word, idx in zip(words, node_words, idxs):
241 |                 if node_word not in gt_objects:
242 |                     hallucinated_word_count += 1 
243 |                     cap_dict['mscoco_hallucinated_words'].append((word, node_word))
244 |                     cap_dict['hallucination_idxs'].append(idx)
245 |                     hallucinated = True      
246 |     
247 |             #count hallucinated caps
248 |             num_caps += 1
249 |             if hallucinated:
250 |                num_hallucinated_caps += 1
251 |     
252 |             cap_dict['metrics']['CHAIRs'] = int(hallucinated)
253 |             cap_dict['metrics']['CHAIRi'] = 0.
254 |             if len(words) > 0:
255 |                 cap_dict['metrics']['CHAIRi'] = len(cap_dict['mscoco_hallucinated_words'])/float(len(words))
256 |    
257 |             output['sentences'].append(cap_dict)
258 |  
259 |         chair_s = (num_hallucinated_caps/num_caps)
260 |         chair_i = (hallucinated_word_count/coco_word_count)
261 |     
262 |         output['overall_metrics'] = {'Bleu_1': self.metrics['Bleu_1'],
263 |                                      'Bleu_2': self.metrics['Bleu_2'],
264 |                                      'Bleu_3': self.metrics['Bleu_3'],
265 |                                      'Bleu_4': self.metrics['Bleu_4'],
266 |                                      'METEOR': self.metrics['METEOR'],
267 |                                      'CIDEr': self.metrics['CIDEr'],
268 |                                      'SPICE': self.metrics['SPICE'],
269 |                                      'ROUGE_L': self.metrics['ROUGE_L'],
270 |                                      'CHAIRs': chair_s,
271 |                                      'CHAIRi': chair_i}
272 |     
273 |         return output 
274 | 
275 | def load_generated_captions(cap_file):
276 |    #Read in captions        
277 |    caps = json.load(open(cap_file))
278 |    try:
279 |        metrics = caps['overall']
280 |        caps = caps['imgToEval'].values()
281 |        imids = set([cap['image_id'] for cap in caps])
282 |    except:
283 |        raise Exception("Expect caption file to consist of a dectionary with sentences correspdonding to the key 'imgToEval'")
284 | 
285 |    return caps, imids, metrics
286 | 
287 | def save_hallucinated_words(cap_file, cap_dict): 
288 |     tag = cap_file.split('/')[-1] 
289 |     with open('output/hallucination/hallucinated_words_%s' %tag, 'w') as f:
290 |         json.dump(cap_dict, f)
291 | 
292 | def print_metrics(hallucination_cap_dict, quiet=False):
293 |     sentence_metrics = hallucination_cap_dict['overall_metrics']
294 |     metric_string = "%0.01f\t%0.01f\t%0.01f\t%0.01f\t%0.01f" %(
295 |                                                   sentence_metrics['SPICE']*100,
296 |                                                   sentence_metrics['METEOR']*100,
297 |                                                   sentence_metrics['CIDEr']*100,
298 |                                                   sentence_metrics['CHAIRs']*100,
299 |                                                   sentence_metrics['CHAIRi']*100)
300 | 
301 |     if not quiet:
302 |         print "SPICE\tMETEOR\tCIDEr\tCHAIRs\tCHAIRi"
303 |         print metric_string
304 | 
305 |     else:
306 |         return metric_string
307 |  
308 | if __name__ == '__main__':
309 |     parser = argparse.ArgumentParser()
310 |     parser.add_argument("--cap_file", type=str, default='')
311 |     parser.add_argument("--annotation_path", type=str, default='coco/annotations')
312 |     args = parser.parse_args()
313 | 
314 |     _, imids, _ = load_generated_captions(args.cap_file)
315 | 
316 |     evaluator = CHAIR(imids, args.coco_path) 
317 |     evaluator.get_annotations()
318 |     cap_dict = evaluator.compute_chair(args.cap_file) 
319 |     
320 |     print_metrics(cap_dict)
321 |     save_hallucinated_words(args.cap_file, cap_dict)
322 | 


--------------------------------------------------------------------------------
/utils/im_consistency.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import json
 3 | import pickle as pkl
 4 | import pdb
 5 | import numpy as np
 6 | from nltk import word_tokenize
 7 | from pattern.en import singularize
 8 | import nltk
 9 | import argparse
10 | from misc import *
11 | 
12 | def get_label_dicts(robust=False):
13 |     if robust:
14 |         label_dict = 'output/image_classifier/classifier_output_robust.p'
15 |     else:
16 |         label_dict = 'output/image_classifier/classifier_output.p'
17 |     predicted_label_dict = pkl.load(open(label_dict, 'rb'))
18 |     gt_label_dict = pkl.load(open('data/gt_labels.p', 'rb'))
19 |     
20 |     return predicted_label_dict, gt_label_dict
21 | 
22 | def get_im_consistency(hallucination_by_imid,
23 |                        predicted_label_dict,
24 |                        gt_label_dict):
25 |  
26 |     total = 0.
27 |     scores = 0.
28 |     
29 |     for i, imid in enumerate(hallucination_by_imid.keys()):
30 |     
31 |         item = hallucination_by_imid[imid]
32 |         caption = item['caption'] 
33 |         caption_words = word_tokenize(caption.lower())
34 |         mscoco_words = [i[1] for i in item['mscoco_hallucinated_words']]
35 |     
36 |         predicted_labels = predicted_label_dict[imid]['predicted_classes']
37 |         raw_output = predicted_label_dict[imid]['raw_output']
38 |         raw_output_sorted = np.argsort(raw_output)[::-1]
39 |         
40 |         for mscoco_word in mscoco_words:
41 |             value = raw_output[gt_label_dict['cat_to_idx'][mscoco_word]]
42 |             scores += value 
43 |             total += 1
44 |    
45 |     return scores/total 
46 | 
47 | if __name__ == '__main__':
48 |     parser = argparse.ArgumentParser()
49 |     parser.add_argument("--annotation_path", type=str, default='coco/annotations')
50 |     parser.add_argument("--tag", type=str, default='td-fc_beam1_test')
51 |     parser.add_argument('--robust', dest='robust', action='store_true')
52 |     parser.set_defaults(robust=False)
53 |     args = parser.parse_args()
54 | 
55 |     #read hallucination file
56 |     hallucinated_json = './output/hallucination/hallucinated_words_%s.json' %args.tag
57 |     hallucination_by_imid = hallucination_file_to_dict(hallucinated_json)
58 | 
59 |     predicted_label_dict, gt_label_dict = get_label_dicts(args.robust)
60 |     consistency = get_im_consistency(hallucination_by_imid, 
61 |                                      predicted_label_dict, 
62 |                                      gt_label_dict)
63 |     print "Im consistency is: %0.04f" %consistency
64 | 


--------------------------------------------------------------------------------
/utils/lm_consistency.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import sys
 3 | import json
 4 | import os
 5 | import pickle as pkl
 6 | from nltk import word_tokenize
 7 | from collections import defaultdict
 8 | from pattern.en import singularize
 9 | import pdb
10 | import sys
11 | import argparse
12 | from misc import *
13 | 
14 | def read_vocab(robust):
15 | 
16 |     #read vocab
17 |     vocab = pkl.load(open('data/vocab.p', 'rb'))
18 |     word_to_idx = defaultdict(lambda: unk_idx)  # word -> ix
19 |     for key, value in zip(vocab.keys(), vocab.values()):
20 |         word_to_idx[value] = int(key)
21 | 
22 |     return word_to_idx
23 | 
24 | def softmax(array):
25 |     shift = array - np.max(array)
26 |     return np.exp(shift)/np.sum(np.exp(shift))
27 | 
28 | 
29 | def get_blank_prediction_path(tag):
30 |     return './output/language_model_blank_input/%s/%%d.npy' %tag
31 | 
32 | def get_lm_consistency(hallucination_by_imid, 
33 |                        blank_lm_predictions, 
34 |                        word_to_idx, 
35 |                        quiet = False):
36 | 
37 |     word_hallucinated_idxs = 0.
38 |     word_hallucinated_total = 0.
39 |     
40 |     for i, imid in enumerate(sorted(hallucination_by_imid.keys())):
41 |         if not quiet:
42 |             sys.stdout.write("\r%d/%d" %(i, len(hallucination_by_imid.keys())))
43 |         probs = np.load(blank_lm_predictions %int(imid))
44 |         item = hallucination_by_imid[imid]
45 |         caption = item['caption'] 
46 |     
47 |         caption_words = word_tokenize(caption.lower())
48 |         mscoco_words = zip(item['hallucination_idxs'], \
49 |                            [caption_words[i] for i in item['hallucination_idxs']])
50 |     
51 |         for mscoco_word in mscoco_words:
52 |             idx, word = mscoco_word
53 |             word = word.split(' ')[0]
54 |             word_probs = softmax(probs[idx,:])
55 |             sorted_objects = np.argsort(word_probs)[::-1]
56 |             word_idx = np.where(sorted_objects == word_to_idx[word])[0][0] + 1
57 |             word_hallucinated_idxs += word_idx
58 |             word_hallucinated_total += 1
59 |     
60 |     return word_hallucinated_total/word_hallucinated_idxs
61 | 
62 | if __name__ == '__main__':
63 |     parser = argparse.ArgumentParser()
64 |     parser.add_argument("--annotation_path", type=str, default='coco/annotations')
65 |     parser.add_argument("--tag", type=str, default='td-fc_beam1_test')
66 |     parser.add_argument('--robust', dest='robust', action='store_true')
67 |     parser.set_defaults(robust=False)
68 |     args = parser.parse_args()
69 | 
70 |     hallucinated_json = './output/hallucination/hallucinated_words_%s.json' %args.tag
71 |     hallucination_by_imid = hallucination_file_to_dict(hallucinated_json) 
72 |     blank_lm_predictions = get_blank_prediction_path(args.tag)
73 | 
74 |     word_to_idx = read_vocab(args.robust)
75 |     consistency = get_lm_consistency(hallucination_by_imid, \
76 |                                      blank_lm_predictions, \
77 |                                      word_to_idx)
78 |     print "\nConsistency: %0.04f" %consistency
79 | 


--------------------------------------------------------------------------------
/utils/misc.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | import json
  4 | import numpy as np
  5 | import lm_consistency as LM
  6 | import im_consistency as IM
  7 | from chair import *
  8 | 
  9 | def hallucination_file_to_dict(hallucinated_json):
 10 | 
 11 |     hallucination_data = json.load(open(hallucinated_json))
 12 |     hallucination_by_imid = {h['image_id']: h for h in hallucination_data['sentences']}
 13 |     
 14 |     return hallucination_by_imid
 15 | 
 16 | def get_sentence_scores_from_hallucination_file(hallucination_file):
 17 |     hallucination = json.load(open(hallucination_file))
 18 |     return hallucination['overall_metrics']
 19 | 
 20 | def get_consistency(tag, annotation_path, robust=False):
 21 | 
 22 |     #Load hallucination dict.  If it does not exist, make it!
 23 |     hallucinated_json = './output/hallucination/hallucinated_words_%s.json' %tag
 24 |     sentences = 'generated_sentences/%s.json' %tag
 25 | 
 26 |     if not os.path.exists(hallucinated_json):
 27 |         print "Computing hallucination file for tag %s" %tag
 28 |         sentence_template = 'generated_sentences/%s.json'
 29 |         _, imids, _ = load_generated_captions(sentence_template %tag)
 30 |         evaluator = CHAIR(imids, annotation_path)
 31 |         evaluator.get_annotations()
 32 |         cap_dict = evaluator.compute_chair(sentence_template %tag)
 33 |         save_hallucinated_words(sentence_template %tag, cap_dict)
 34 | 
 35 |     hallucination_by_imid = hallucination_file_to_dict(hallucinated_json)
 36 | 
 37 |     #LM consistency
 38 |     word_to_idx = LM.read_vocab(robust)
 39 |     blank_lm_predictions = LM.get_blank_prediction_path(tag)
 40 | 
 41 |     lm_consistency = LM.get_lm_consistency(hallucination_by_imid, \
 42 |                                  blank_lm_predictions, \
 43 |                                  word_to_idx,
 44 |                                  quiet=True)
 45 | 
 46 |     #IM consistency
 47 |     predicted_label_dict, gt_label_dict = IM.get_label_dicts(robust)
 48 |     im_consistency = IM.get_im_consistency(hallucination_by_imid,
 49 |                                           predicted_label_dict,
 50 |                                           gt_label_dict)
 51 | 
 52 |     #get chair scores for completeness
 53 |     scores = get_sentence_scores_from_hallucination_file(hallucinated_json)
 54 |     
 55 |     return scores['CHAIRi'], lm_consistency, im_consistency  
 56 |  
 57 | def score_correlation(cap_file, quiet=False):
 58 |     
 59 |     caps = json.load(open(cap_file))
 60 | 
 61 |     ciders = []
 62 |     meteors = []
 63 |     spices = []
 64 |     hallucinations = []
 65 | 
 66 |     for cap in caps['sentences']:
 67 |         info = cap['metrics']
 68 |         meteors.append(info['METEOR'])
 69 |         ciders.append(info['CIDEr'])
 70 |         spices.append(info['SPICE']['All']['f'])
 71 |         hallucinations.append(1 - info['CHAIRi'])
 72 | 
 73 |     meteors = np.array(meteors)
 74 |     ciders = np.array(ciders)
 75 |     spices = np.array(spices)
 76 |     hallucinations = np.array(hallucinations)
 77 | 
 78 |     cider_corr = np.corrcoef(ciders, hallucinations)[1][0]
 79 |     meteor_corr = np.corrcoef(meteors, hallucinations)[1][0]
 80 |     spice_corr = np.corrcoef(spices, hallucinations)[1][0]
 81 | 
 82 |     if not quiet:
 83 |         print "CIDEr and hallucination: %0.03f" %cider_corr
 84 |         print "METEOR and hallucination: %0.03f" %meteor_corr
 85 |         print "SPICE and hallucination: %0.03f" %spice_corr
 86 | 
 87 |     return cider_corr, meteor_corr, spice_corr
 88 | 
 89 | def predictive_metrics(hallucinated_json_1, hallucinated_json_2):
 90 |   
 91 |     '''
 92 |     Can sentence metrics predict hallucination?  In section 3.4 of paper.
 93 |     '''
 94 | 
 95 |     hallucination_data_1 = json.load(open(hallucinated_json_1))
 96 |     hallucination_data_2 = json.load(open(hallucinated_json_2))
 97 | 
 98 |     def bin_by_spice(data):
 99 |         #bin by spice scores    
100 |         spices = []
101 |         hallucinations = []
102 |     
103 |         for cap in data['sentences']:
104 |             info = cap['metrics']
105 |             spices.append(info['SPICE']['All']['f'])
106 |             hallucinations.append(info['CHAIRs'])
107 |     
108 |         hist = []
109 |         for i in range(0, 100, 10):
110 |             idxs = [idx for idx, spice in enumerate(spices) \
111 |                                  if (spice*100 >= i) and (spice*100 < (i+10))]
112 |             if len(idxs) == 0:
113 |                 hist.append(0)
114 |             else:
115 |                 hist.append(np.mean([hallucinations[idx] for idx in idxs]))
116 |         return hist
117 | 
118 |     score_histogram_1 = bin_by_spice(hallucination_data_1)
119 |     score_histogram_2 = bin_by_spice(hallucination_data_2)
120 |     return list(np.array(score_histogram_1)-np.array(score_histogram_2))
121 | 


--------------------------------------------------------------------------------