├── README.md
├── emnlp_data
    ├── .DS_Store
    ├── nq
    │   ├── random_prompts
    │   │   └── nq_test_random_prompt.txt
    │   └── random_testset
    │   │   ├── nq_ref
    │   │   ├── nq_ref_avg_coh_para
    │   │   ├── nq_ref_avg_lm_entropy
    │   │   ├── nq_ref_avg_sent_ppl
    │   │   ├── nq_ref_rel
    │   │   └── nq_test_random_testset.txt
    ├── testset_preprocess_scripts
    │   ├── cnt_knowledge_length.py
    │   ├── extract_ref.py
    │   ├── random_prompt_maker.py
    │   └── random_testset_maker.py
    └── wow
    │   ├── random_prompts
    │       ├── seen_random_prompt.txt
    │       └── unseen_random_prompt.txt
    │   └── random_testset
    │       ├── seen_random_testset.txt
    │       ├── seen_topic_pageviews.txt
    │       ├── unseen_random_testset.txt
    │       ├── unseen_topic_pageviews.txt
    │       └── wow_seen_knowledge_ref
├── env
    ├── .DS_Store
    ├── coherence_environment.yml
    └── environment.yml
├── framework.png
├── scripts
    ├── helpfulness
    │   ├── nq_random_knowledge.sh
    │   ├── nq_w_hyp_knowledge.sh
    │   ├── nq_w_ref_knowledge.sh
    │   ├── nq_wo_knowledge.sh
    │   ├── view_results.sh
    │   ├── wow_random_knowledge.sh
    │   └── wow_w_hyp_knowledge.sh
    ├── nq_coh_para.sh
    ├── nq_coh_sent.sh
    ├── nq_factuality.sh
    ├── nq_factuality_view.sh
    ├── nq_info.sh
    ├── nq_relevance.sh
    ├── nq_validity.sh
    ├── other
    │   ├── cal_factuality_for_DPR.sh
    │   ├── cal_factuality_for_knowledge.sh
    │   ├── cal_factuality_for_knowledge_IR.sh
    │   ├── cal_factuality_for_opt_knowledge_IR.sh
    │   ├── cal_factuality_for_refined_knowledge.sh
    │   ├── cal_factuality_for_refined_knowledge_IR.sh
    │   ├── cal_factuality_for_response.sh
    │   └── tmp.sh
    ├── view_coh_sent.sh
    ├── view_info.sh
    ├── view_nq_validity.sh
    ├── view_wow_validity.sh
    ├── wow_coh_para.sh
    ├── wow_coh_sent.sh
    ├── wow_factuality.sh
    ├── wow_factuality_view.sh
    ├── wow_info.sh
    ├── wow_relevance.sh
    └── wow_validity.sh
└── src
    ├── claim_handling.py
    ├── discourse-coherence.py
    ├── eval_exp.py
    ├── helpfulness.py
    ├── info.py
    ├── nq_validity.py
    ├── ppl.py
    ├── relevance.py
    ├── tools.py
    └── wow_validity.py


/README.md:
--------------------------------------------------------------------------------
 1 | # Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators
 2 | Welcome to the repository for our EMNLP 2023 paper, "Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators." In this work, we introduce **CONNER** (COmpreheNsive kNowledge Evaluation fRamework), a systematic approach designed to evaluate the output of Large Language Models (LLMs) across key dimensions such as Factuality, Relevance, Coherence, Informativeness, Helpfulness, and Validity.
 3 | 
 4 | Here, you'll find the necessary code and resources to replicate our findings and further explore the potential of LLMs. We hope they help facilitate your work in exploring the frontiers of LLMs with a touch of ease.
 5 | 
 6 | ## CONNER Framework
 7 | 
 8 | 
 9 | ### Intrinsic Evaluation
10 | 
11 | - **Factuality:** Assessing the verifiability of the information against external evidence.
12 | - **Relevance:** Ensuring the knowledge aligns with the user's query intent.
13 | - **Coherence:** Evaluating the logical flow of information at both sentence and paragraph levels.
14 | - **Informativeness:** Measuring the novelty or unexpectedness of the knowledge provided.
15 | 
16 | ### Extrinsic Evaluation
17 | 
18 | - **Helpfulness:** Gauging whether the knowledge aids in enhancing performance on downstream tasks.
19 | - **Validity:** Certifying the factual accuracy of downstream task results when utilizing the knowledge.
20 | 
21 | ## Getting Started
22 | 
23 | #### Setting Up the Environment
24 | 
25 | Begin by setting up your Conda environment with the provided `environment.yaml` file, which will install all necessary packages and dependencies.
26 | 
27 | ```bash
28 | conda env create -f env/environment.yaml -n CONNER
29 | conda activate CONNER
30 | ```
31 | If you run into any missing packages or dependencies, please install them as needed.
32 | 
33 | #### Evaluating Your LLMs
34 | Run the evaluation script that corresponds to your dataset and chosen metric. Replace ${data} with your dataset choice (nq or wow) and ${metric} with one of the following metrics: factuality, relevance, info, coh_sent, coh_para, validity, helpfulness.
35 | ```bash
36 | # Run evaluation script. Example usage:
37 | # bash scripts/nq_factuality.sh
38 | # bash scripts/wow_relevance.sh
39 | bash scripts/${data}_${metric}.sh
40 | ```
41 | #### Viewing Results
42 | Once you have completed the evaluation, you can easily view the results with our provided script:
43 | ```bash
44 | # Display the evaluation results. Example usage:
45 | # bash scripts/nq_factuality_view.sh
46 | # bash scripts/wow_relevance_view.sh
47 | bash scripts/${data}_${metric}_view.sh
48 | ```
49 | 
50 | #### Model Sources
51 | 
52 | Below is a list of models utilized in our CONNER framework for each metric:
53 | 
54 | | Metric               | Model                           | Source                                              |
55 | |----------------------|---------------------------------|-----------------------------------------------------|
56 | | Factuality           | NLI-RoBERTa-large, ColBERTv2             | [Hugging Face](https://huggingface.co/sentence-transformers/nli-roberta-large), [GitHub](https://github.com/stanford-futuredata/ColBERT) |
57 | | Relevance            | BERT-ranking-large              | [GitHub](https://github.com/nyu-dl/dl4marco-bert)                             |
58 | | Sentence-level Coherence            | GPT-neo-2.7B                    | [Hugging Face](https://huggingface.co/EleutherAI/gpt-neo-2.7B)                |
59 | | Paragraph-level Coherence           | Coherence-Momentum              | [Hugging Face](https://huggingface.co/aisingapore/coherence-momentum)         |
60 | | Informativeness      | GPT-neo-2.7B                    | [Hugging Face](https://huggingface.co/EleutherAI/gpt-neo-2.7B)                |
61 | | Helpfulness          | LLaMA-65B                       | [GitHub](https://github.com/facebookresearch/llama/tree/main)                 |
62 | | Validity             | NLI-RoBERTa-large, ColBERTv2               | [Hugging Face](https://huggingface.co/sentence-transformers/nli-roberta-large), [GitHub](https://github.com/stanford-futuredata/ColBERT)  |
63 | 
64 | 
65 | ## Citing Our Work
66 | If you find our work helpful in your research, please citing our paper:
67 | ```
68 | @misc{chen2023factuality,
69 |       title={Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators}, 
70 |       author={Liang Chen and Yang Deng and Yatao Bian and Zeyu Qin and Bingzhe Wu and Tat-Seng Chua and Kam-Fai Wong},
71 |       year={2023},
72 |       eprint={2310.07289},
73 |       archivePrefix={arXiv},
74 |       primaryClass={cs.CL}
75 | }
76 | ```
77 | 


--------------------------------------------------------------------------------
/emnlp_data/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ChanLiang/CONNER/77f99c876bdc6ca8cb3991210e2ccc2914d4971b/emnlp_data/.DS_Store


--------------------------------------------------------------------------------
/emnlp_data/nq/random_testset/nq_ref_avg_coh_para:
--------------------------------------------------------------------------------
1 | [-13.119321823120117, 18.643884658813477, 14.975356101989746, -9.953686714172363, -22.64740753173828, 18.281700134277344, -10.134472846984863, 10.106371879577637, -7.24709415435791, 2.947871685028076, 19.49839210510254, 10.568069458007812, 5.3517303466796875, 15.628636360168457, 13.229722023010254, -16.684036254882812, -21.012622833251953, 11.617283821105957, -5.790841102600098, 3.663580894470215, 7.896894931793213, 9.753914833068848, 16.251827239990234, 16.19835090637207, 12.03116512298584, 10.515118598937988, 15.596977233886719, 14.38183879852295, 16.412246704101562, 15.887755393981934, -15.057905197143555, 0.2845648229122162, 18.343692779541016, 14.351183891296387, 15.628971099853516, 18.301742553710938, -15.328641891479492, -0.5141409635543823, 10.937677383422852, 7.783780574798584, 1.3738312721252441, 14.359389305114746, -10.770936965942383, -3.065892219543457, -4.884547233581543, 14.948290824890137, -2.230806827545166, -8.666611671447754, 6.646633148193359, 6.719013690948486, 18.264400482177734, 7.262363433837891, 9.07824993133545, 11.578181266784668, 14.675372123718262, -11.40087890625, -13.6721830368042, -2.914580821990967, 10.05797290802002, -2.8793838024139404, 14.709553718566895, 10.51543140411377, 15.408014297485352, 10.95182991027832, 7.270349502563477, -1.7264076471328735, 14.45258903503418, 18.208837509155273, 9.560979843139648, 7.398679256439209, -8.244877815246582, 1.2515833377838135, 7.381077766418457, 6.716863632202148, -14.006756782531738, -8.48363971710205, 11.411901473999023, 12.59145450592041, -17.390769958496094, -3.3476669788360596, -1.9921391010284424, 10.428362846374512, 16.394018173217773, 9.265216827392578, 18.750652313232422, 11.602629661560059, 2.1918861865997314, 8.7633638381958, 0.5265696048736572, 5.030932426452637, -19.480724334716797, -12.329466819763184, -10.704183578491211, -26.26762580871582, -11.212485313415527, 12.00790786743164, -7.852105140686035, -19.85357666015625, 18.526920318603516, 14.987327575683594, 14.665925979614258, 9.208537101745605, 11.986360549926758, -13.110265731811523, 3.1953303813934326, 20.347864151000977, 14.489336967468262, 13.525574684143066, 11.642297744750977, -16.527755737304688, 14.678874015808105, -20.76173210144043, -0.3452112376689911, 6.819133758544922, 17.097471237182617, -0.6257216334342957, 15.619368553161621, 10.135584831237793, 6.282881259918213, 18.862598419189453, 7.799135208129883, 3.3354477882385254, 0.15584170818328857, 6.26609992980957, 9.596212387084961, 7.95708703994751, 17.74555778503418, -21.439605712890625, -7.698458671569824, 12.603463172912598, 9.275918960571289, -3.4611470699310303, -8.378585815429688, 16.112457275390625, 12.117599487304688, 5.564085006713867, 10.20129108428955, 19.477554321289062, 4.768891334533691, 12.773776054382324, 8.356185913085938, -11.753010749816895, 16.634265899658203, 11.060528755187988, 6.845538139343262, 13.33799934387207, 17.466869354248047, -12.745265007019043, 17.00414276123047, -9.293492317199707, 11.534061431884766, -5.457294464111328, 16.08570671081543, 5.225383758544922, 13.115345001220703, 9.974627494812012, 10.44150447845459, 17.22805404663086, 8.482453346252441, -8.677849769592285, -20.50507354736328, -14.645130157470703, 13.208407402038574, 18.922218322753906, 9.323628425598145, 18.2813720703125, 12.772849082946777, -0.10989882051944733, 11.091923713684082, -1.9497580528259277, 0.2969266176223755, 6.51821231842041, 2.5422768592834473, 7.474398612976074, 2.108112335205078, 16.875263214111328, 15.890420913696289, 16.88079833984375, 14.956415176391602, 4.558560371398926, -0.5833718180656433, 17.631332397460938, -0.8018089532852173, 11.801080703735352, -0.40746182203292847, -15.2289457321167, 3.7110328674316406, -7.3578410148620605, 12.60633659362793, 14.488272666931152, 3.9711179733276367, -3.953218936920166, 17.390342712402344, 17.440744400024414, 11.964737892150879, 15.407419204711914, 3.570370674133301, 9.650638580322266, 15.390456199645996, 0.06699617952108383, 17.996315002441406, 11.45728588104248, -4.622745990753174, 2.8869075775146484, -5.676790714263916, -3.6832518577575684, 12.396058082580566, 13.688033103942871, 17.996315002441406, 7.197054862976074, 15.286927223205566, -18.627094268798828, 18.30089569091797, 11.253070831298828, 2.229530096054077, 7.75808572769165, 5.06968355178833, -10.770936965942383, 11.159478187561035, 13.587868690490723, 10.082839965820312, 16.37598991394043, 5.587576389312744, 14.567668914794922, -8.760175704956055, 7.703945636749268, -13.886496543884277, 13.01379680633545, -13.758611679077148, 9.17174243927002, -16.080148696899414, 12.9537992477417, 14.316058158874512, -5.74833869934082, -8.30625057220459, 8.670008659362793, -12.240528106689453, -19.61964225769043, 0.649236261844635, 13.778191566467285, 6.167590618133545, 14.636520385742188, 18.58367919921875, -9.196493148803711, 14.32049560546875, 9.175971031188965, -11.886533737182617, 14.924628257751465, -24.584224700927734, 2.0860023498535156, 17.110746383666992, 14.72911548614502, 3.460249423980713, 9.528497695922852, -6.364473342895508, 16.25566864013672, -9.959657669067383, -7.410150051116943, 5.192204475402832, -22.707664489746094, 14.38634967803955, 12.714741706848145, -12.977810859680176, 15.175114631652832, -20.359966278076172, 15.887040138244629, 1.8842642307281494, 1.9588429927825928, 11.025944709777832, -2.200796604156494, -12.423057556152344, -13.198972702026367, 9.224650382995605, 12.306821823120117, 16.717018127441406, 12.369593620300293, 14.43990707397461, -10.932412147521973, -8.926290512084961, 14.12479305267334, 3.674036979675293, 6.213274955749512, 14.771726608276367, -24.03095245361328, -17.400426864624023, 12.2335205078125, -2.4841628074645996, 11.171034812927246, -14.19003963470459, 6.1105570793151855, 10.842275619506836, 3.076615571975708, 9.518315315246582, 15.364958763122559, 9.223173141479492, 8.47429084777832, 6.059081554412842, 15.535370826721191, 13.101935386657715, 19.02134895324707, -14.119009971618652, 12.81179141998291, -1.7170445919036865, 16.877477645874023, 18.27460479736328, 4.872830867767334, 17.16568946838379, 0.8550774455070496, 4.210025787353516, 17.31049346923828, -4.009548187255859, -2.4618661403656006, 19.04184913635254, 11.184344291687012, 0.5831080079078674, 17.68505859375, 17.419021606445312, -15.47326946258545, 15.801827430725098, -13.933554649353027, -10.648279190063477, 7.597427845001221, 15.462384223937988, 0.32431674003601074, 13.973188400268555, -8.640498161315918, 5.005640029907227, 15.267630577087402, -2.016509771347046, -5.688741683959961, 7.008584976196289, -20.141035079956055, 13.747187614440918, 5.731258392333984, -14.18584156036377, 15.806730270385742, 16.6748046875, 11.475944519042969, 6.184438705444336, 10.92053508758545, -7.597588062286377, 16.148408889770508, 14.263434410095215, 2.1767494678497314, 18.40786361694336, -14.55492115020752, 17.825923919677734, 13.184425354003906, 15.508620262145996, 8.382051467895508, -5.733557224273682, 4.105076313018799, 13.809184074401855, -5.355087757110596, 16.091463088989258, 18.29323387145996, 17.374454498291016, 1.4047178030014038, 17.88732147216797, -1.6902297735214233, 12.622706413269043, 15.329268455505371, -3.92073130607605, 15.702555656433105, -16.939350128173828, -19.12222671508789, 10.142857551574707, 11.022912979125977, -6.0877251625061035, -0.4878803491592407, -20.219280242919922, -7.447934150695801, -0.49328118562698364, 17.67724609375, 8.014138221740723, -17.380390167236328, 10.3120698928833, 13.518301963806152, 3.1764838695526123, 17.671239852905273, -0.6746103167533875, 16.26909828186035, -9.144436836242676, -0.9421584606170654, 10.030646324157715, 16.53363037109375, 9.232200622558594, 7.369050979614258, 7.575037002563477, 16.62100601196289, 6.481991767883301, 2.531597852706909, 14.252530097961426, -1.746160864830017, 11.183938980102539, 4.897782325744629, -14.06386661529541, 17.58884620666504, -13.53532886505127, 18.790796279907227, 4.670736312866211, 16.990140914916992, 4.967563629150391, -0.4503783881664276, -7.060770511627197, 12.426644325256348, 9.955527305603027, -23.58697509765625, 16.9542236328125, -22.9370174407959, 19.125083923339844, -18.199602127075195, -5.261682510375977, 11.080878257751465, 15.306122779846191, 3.0926597118377686, -17.665010452270508, 1.2239549160003662, -20.03911590576172, 16.360694885253906, 18.033679962158203, -5.759027004241943, 16.247272491455078, -4.610719203948975, 3.280198574066162, 3.6081905364990234, -24.661344528198242, 17.47615623474121, 0.26504141092300415, -2.099376916885376, 10.232733726501465, 16.317556381225586, -17.588844299316406, 17.70425033569336, 17.660411834716797, 12.04329776763916, 18.50408172607422, 4.581759452819824, 18.606327056884766, 0.7869451642036438, 15.58411693572998, 8.058518409729004, 10.642940521240234, -9.863536834716797, 12.829106330871582, -11.971624374389648, 15.981738090515137, -12.542139053344727, 3.2194089889526367, 9.560979843139648, 7.007416725158691, 13.006417274475098, -7.5256667137146, 8.963598251342773, 6.474368572235107, -15.432988166809082, 3.4593796730041504, -6.131731033325195, 19.431949615478516, 13.920551300048828, 6.007608413696289, -18.999370574951172, 8.125280380249023, 4.632145881652832, 15.26697063446045, 14.385810852050781, 11.668633460998535, 11.432242393493652, -4.376832485198975, 14.677751541137695, -17.08258628845215, 4.100184917449951, -15.47326946258545, 15.083805084228516, 4.513195514678955, -8.103970527648926, 17.41781234741211, 1.425919532775879, -6.213893890380859, 14.494572639465332, 1.9545398950576782, 17.199356079101562, -12.507933616638184, 13.423059463500977, 17.627429962158203, -14.06386661529541, 11.911812782287598, 11.91988754272461, 15.36992073059082, -9.81781005859375, -1.6895909309387207, 18.3709716796875, 5.118607997894287, 17.608135223388672, 8.924373626708984, 3.4715404510498047, 10.618024826049805, -2.3761510848999023, -14.850167274475098, 6.333468914031982]
2 | 20.347864151000977	-26.26762580871582


--------------------------------------------------------------------------------
/emnlp_data/nq/random_testset/nq_ref_avg_lm_entropy:
--------------------------------------------------------------------------------
1 | [NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN]
2 | NaN


--------------------------------------------------------------------------------
/emnlp_data/nq/random_testset/nq_ref_avg_sent_ppl:
--------------------------------------------------------------------------------
1 | [78.4693368448625, 16.043491695128726, 15.804934627199694, 53.23927945815809, 41.15005729707433, 10.50810120958408, 6.100938382963522, 67.48036996807707, 27.996390631708877, 29.215547701526184, 32.89045190663701, 53.67419445493103, 67.90224061169619, 59.324534008027015, 68.11863880866669, 100.0, 11.18664364841687, 63.60015186627291, 93.20928166657612, 53.22125644548603, 44.396799448946105, 55.91322077903019, 22.511726352105622, 44.27881054053034, 100.0, 28.969665540137683, 54.09288183052771, 18.83291960242443, 38.10779446908432, 88.78784355913466, 86.9860044535187, 100.0, 93.54996883396687, 31.23412437863948, 58.93436814650389, 71.03339433990104, 92.05139580464764, 66.88117758800215, 49.7674398840102, 87.18660354790754, 86.78696919699829, 51.950614229747046, 91.44705859268413, 5.190588743863161, 70.52482366327511, 46.18224223328229, 5.050934899268297, 19.253325549877246, 30.636168485681026, 35.336577492277954, 69.62419980886938, 41.863717275046234, 56.02683768815861, 37.563018144626106, 57.492770836936714, 100.0, 44.96884665815416, 62.09192051669649, 28.117156195500414, 84.12359139682738, 41.466212699416985, 41.842113436656064, 22.03863655634089, 52.42349769740261, 70.66018373086091, 5.945385801758123, 30.175888565873056, 25.17534426406489, 38.80949853753552, 100.0, 79.7977150546999, 5.870803559537856, 29.40460943390798, 26.140572628150604, 10.073324054728108, 8.04474714646627, 58.11901407353269, 58.47049465318385, 69.15162771778725, 83.89402822020207, 58.01664243914295, 52.62574447567372, 42.83131840855212, 44.41039137941371, 23.58437181390929, 49.8692187978461, 57.554268891823604, 38.81962317221551, 28.37811495669342, 41.51519353649107, 16.27531796559519, 86.72404698976116, 60.83146012345982, 85.2239467307671, 4.182093180232769, 26.912988954347888, 6.514642953779386, 10.59331913067613, 19.4277578940981, 44.26226843679404, 40.21397733256407, 66.5492461652898, 37.53767953925087, 74.81145321587799, 87.6990942071536, 14.167328159135606, 57.917234529614426, 32.28550906921534, 50.487339715410165, 79.3488639934888, 32.30160790329472, 34.8669344883829, 18.33801662935187, 63.08825615353803, 22.8900862160764, 4.602974120191751, 65.88837289614875, 60.114391712409606, 49.143607662107186, 48.448415985011685, 85.93036832838905, 54.46381386570613, 52.710168937818175, 34.98609189283082, 100.0, 82.99606536620868, 25.801826292437184, 67.69431699400373, 73.40954491954432, 100.0, 100.0, 13.450123586613607, 30.252880628186652, 37.18607406959037, 71.25514218300809, 50.687486145792185, 88.85777119414414, 17.67454787215367, 42.25914107716766, 46.07183443470951, 29.360933012286065, 91.64069544929384, 9.364183350622403, 28.733743947465122, 7.829874200746442, 48.19923010329084, 17.63484233438192, 8.901291042826191, 57.23819521890625, 11.431844035104252, 52.853003441050795, 55.34538055526116, 23.26579146101366, 79.336160388691, 30.84904916314805, 42.51635682798475, 15.006421350764779, 28.239869305309288, 36.2774867678075, 18.219436192333255, 27.939471967926718, 11.912612508020064, 100.0, 26.930295717239854, 57.27942179998525, 31.444787110759638, 25.243815375835123, 57.27757829283663, 79.01959879363238, 3.756021644237454, 69.47456455083955, 49.92495365704886, 28.9776917938745, 67.27869008353792, 8.233310258192683, 19.99385464132014, 41.01906277459451, 30.288760681943845, 47.57541318410044, 69.89785950677994, NaN, 18.53793016881998, 27.495018739833913, 28.133526653183385, 61.78660514015602, 100.0, 22.347591227413954, 7.607686129443385, 46.831082759910785, 54.405917836795474, 78.95596448648368, 17.73826356357924, 22.621886112268395, 7.675172804675318, 42.47293496580793, 39.93305142617706, 53.08773918167623, 38.301428644554235, 30.11430687280376, 66.4553528008607, 10.572855201652148, 85.8774195332294, 100.0, 10.168131346747268, 5.890877479933101, 90.84300721934494, 49.052191018285306, 48.00127137837724, 10.572855201652148, 65.38331521035612, 16.085177799050324, 75.57413525170637, 14.177714887593893, 43.555649711081145, 83.94592734574924, 7.859383843049418, 9.710033597164143, 91.44705859268413, 66.78146956549836, 61.12810313331742, 79.07377466512598, 40.683994468126464, 55.211967783776934, 53.822895220951466, 60.47158849946876, 25.495150683340373, 64.97668726989721, 9.39264775287852, 77.38194116179402, 23.988622161697332, 100.0, 57.77326658082132, 32.52255001911109, 28.36959604444296, 37.285203308453305, 11.112263797171225, 64.45322764057059, 29.022193408660645, 26.649195197483568, 65.18438625977032, 100.0, 23.650732932834142, 28.39248175000161, 54.22242960447606, 27.102035563186863, 52.505613907053274, 88.37060236205672, 57.63914449437923, 10.95308820881317, 48.61061834098217, 17.57937183901338, 21.741398017397973, 66.46813319011274, 43.82021831418862, 43.93084158858949, 39.460481431512385, NaN, 38.19217753182415, 27.350671828561318, 76.4552536949989, 21.43581607467629, 42.71083111389023, 54.68586560318957, 57.96951125703256, 63.2092470957139, 61.8858250041732, 67.4782191664445, 45.71895670056356, 48.49965121653874, 13.801560552250978, 82.0290052023144, 49.26524207475927, 65.39611582801435, 48.9313603200487, 17.399555024609302, 71.86034687140634, 34.17710049127836, 15.560886860555184, 100.0, 45.587523879716485, 92.90999318034552, 22.396488319503675, 19.769063634463546, 66.2727873665185, 58.37219835425476, 61.8330587063284, 76.15445479738644, 65.10139405094469, 30.411742608106582, 71.46585180678812, 25.133494662273623, 53.957955019292676, 45.205033015396566, 19.368192387932776, 52.74097094282029, 38.417053927711144, 38.44702283178825, 35.16281902141295, 32.12372067896128, 69.39259156686892, 100.0, 33.70049762868605, 83.7574123862593, 74.63229558625723, 29.63988358730266, 11.141142300349554, 49.37503166308023, 62.815921138863345, 47.90365622303026, 28.979974141484963, 65.6605960946507, 68.28209488170104, 81.34838467628872, 45.1893563527962, 23.704115878128697, 62.68784038330249, 28.979468158050544, 14.87731316470635, 31.687223344905128, 7.053431375407266, 26.924944188155596, 61.49818106064528, 35.398435766463955, 100.0, 40.89925100406656, 24.53816461142837, 81.71840580748571, 44.65755612960107, 48.508640082792425, 84.91696128735245, 23.048170043620853, 71.69208861888265, 70.74994143682393, 56.817927583504044, 21.7545726190004, 54.435831683151825, 85.53311965080941, 35.620723784868396, 62.19238793393677, 17.52713060078619, 23.452689741340183, 23.637008111384432, 34.088221002778155, 19.63002102935967, 16.658842862059494, 100.0, 20.31905768713922, 55.652382631872975, 17.602564986302696, 52.54253800204756, 20.338837471467222, 6.524293988216611, 16.531504297917582, 100.0, 51.59755874475989, 14.192940713242582, 43.21651114253152, 70.39592310603686, 15.948928734510222, 79.7092184656985, 98.76599944358716, 67.30579697786645, 85.55465618508093, 69.44646603644725, 9.68064920215347, 39.43964965941498, 17.663005510973708, 82.17563782956356, 55.74860974120666, 88.00914893020618, 98.75794968458214, 79.91814517767344, 100.0, 50.43072956804357, 55.20529921934837, 38.80347216029391, 41.418105312725274, 51.33427077483436, 40.644144375574434, 7.348653417498423, 30.744310601810668, 37.39141479385989, 31.553934979496823, 74.14908097104342, 57.91877666713658, 13.338599807408572, 75.45382159392871, 49.13650117273204, 51.73842245428733, 21.494297211299624, 49.33050189919685, 78.63987989630108, 55.792702963233616, 35.71646595198077, 100.0, 100.0, 60.99824329043069, 18.73404824576729, 96.95316219766627, 18.381026668828653, 44.721631638215335, 22.941585044090232, 65.15406536907628, 88.70722229250158, 6.937894673370338, 9.771027972111897, 47.90206725805424, 49.27142704784982, 67.55369462929836, 74.05254003400552, 18.018599813339435, 11.290523947299521, 51.75739045944618, 60.5669539827174, 75.71581797202799, 87.5467953409547, 78.08399991228774, 95.60402121298063, 67.78531634829748, 65.44257906485598, 14.782783739932551, 86.8088661985033, 35.542420475392724, 18.8143101371883, 51.821384751840085, 86.30142971062976, 18.698702953700778, 18.717258906063826, 99.49252337397067, 58.342399702717394, 43.9318890305611, 22.587872154793093, 70.93133357967618, 9.949994733825061, 12.82532481394894, 32.669446602678676, 51.21630074241758, 5.655418768344973, 49.76715603276672, 87.68309156195106, 43.42747255302461, 58.348403072402085, 34.304607915902366, 3.6366487826139267, 35.773664869380426, 97.92290896755627, 42.55352808443655, 13.826841089805036, 100.0, 38.80949853753552, 36.37374459373949, 58.95719583529741, 4.501877034727099, 36.16321945266922, 7.119819682125535, 91.74869049671233, 67.60019917559482, 4.7308653870051804, 16.537152651336708, 49.046451492916106, 45.093256091221775, 56.81469596417692, 19.41306271416821, 33.30974666446854, 40.75322950093222, 78.21412681354154, 26.98979053596943, 45.766711346561955, 57.854786571024555, 47.46923585138697, 63.829233748500975, 100.0, 14.87731316470635, 20.509667600141196, 5.297629491378028, 8.984981801999487, 30.616747250524487, 59.43491109119952, NaN, 32.80619543890013, 11.439434524154548, 81.7532819435181, 34.66610952361426, 21.746072409363222, 40.80490402564827, 60.99824329043069, 20.72013054787879, 34.1612630558561, 46.617922010330915, 88.09862636829668, 55.281747347040245, 18.206939742923193, 86.88200832384972, 28.91344281278397, 23.481498347502203, 33.597110545450455, 72.6510934816972, 75.63631810955533, 11.878015389507036, 34.153455331269825]
2 | 0.03658152118316791


--------------------------------------------------------------------------------
/emnlp_data/nq/random_testset/nq_ref_rel:
--------------------------------------------------------------------------------
  1 | 0.9955950379371643
  2 | 0.9874629378318787
  3 | 0.963216245174408
  4 | 0.9977188110351562
  5 | 0.6699656844139099
  6 | 0.9944804906845093
  7 | 0.9958698153495789
  8 | 0.9849919676780701
  9 | 0.9985634684562683
 10 | 0.989118218421936
 11 | 0.9984180927276611
 12 | 0.998940646648407
 13 | 0.9816821813583374
 14 | 0.013061659410595894
 15 | 0.992595374584198
 16 | 0.9913615584373474
 17 | 0.9959718585014343
 18 | 0.9972472786903381
 19 | 0.9554606080055237
 20 | 0.999143123626709
 21 | 0.9955300688743591
 22 | 0.9994617104530334
 23 | 0.9991528987884521
 24 | 0.9987228512763977
 25 | 0.9992048144340515
 26 | 0.998910665512085
 27 | 0.998437225818634
 28 | 0.9612510204315186
 29 | 0.9927793145179749
 30 | 0.9989737272262573
 31 | 0.994820237159729
 32 | 0.999016284942627
 33 | 0.9985498785972595
 34 | 0.9588527083396912
 35 | 0.9992305040359497
 36 | 0.9985186457633972
 37 | 0.9945138692855835
 38 | 0.9994654059410095
 39 | 0.9980382323265076
 40 | 0.9991067051887512
 41 | 0.9989997744560242
 42 | 0.9990805387496948
 43 | 0.9948292374610901
 44 | 0.9930898547172546
 45 | 0.940955400466919
 46 | 0.9945279955863953
 47 | 0.9937803149223328
 48 | 0.9994076490402222
 49 | 0.19607733190059662
 50 | 0.9980910420417786
 51 | 0.9940365552902222
 52 | 0.9391145706176758
 53 | 0.998028576374054
 54 | 0.9934442043304443
 55 | 0.9984025359153748
 56 | 0.9986213445663452
 57 | 0.9943619966506958
 58 | 0.978777289390564
 59 | 0.9989280104637146
 60 | 0.9930499792098999
 61 | 0.9985374212265015
 62 | 0.9943996071815491
 63 | 0.9629001617431641
 64 | 0.9976578950881958
 65 | 0.8398615121841431
 66 | 0.05630270019173622
 67 | 0.06235615164041519
 68 | 0.9881781935691833
 69 | 0.9899857044219971
 70 | 0.999122679233551
 71 | 0.9974337220191956
 72 | 0.9989663362503052
 73 | 0.9975524544715881
 74 | 0.995134174823761
 75 | 0.9992455244064331
 76 | 0.9988962411880493
 77 | 0.9993371367454529
 78 | 0.9994862079620361
 79 | 0.9984022974967957
 80 | 0.9993213415145874
 81 | 0.9903036952018738
 82 | 0.9910654425621033
 83 | 0.9994753003120422
 84 | 0.992023229598999
 85 | 0.9774600267410278
 86 | 0.9984448552131653
 87 | 0.10379525274038315
 88 | 0.9981904625892639
 89 | 0.9979850053787231
 90 | 0.9991675615310669
 91 | 0.9995473027229309
 92 | 0.9991532564163208
 93 | 0.998704195022583
 94 | 0.9956908822059631
 95 | 0.5713329911231995
 96 | 0.9859158992767334
 97 | 0.9917317032814026
 98 | 0.994891881942749
 99 | 0.9991616010665894
100 | 0.9993641972541809
101 | 0.9982323050498962
102 | 0.9987945556640625
103 | 0.9710656404495239
104 | 0.9981881976127625
105 | 0.9974697828292847
106 | 0.9883113503456116
107 | 0.9987496137619019
108 | 0.9991629123687744
109 | 0.9993699193000793
110 | 0.012185875326395035
111 | 0.9959009289741516
112 | 0.873554527759552
113 | 0.9922148585319519
114 | 0.9864484667778015
115 | 0.9659812450408936
116 | 0.9819478988647461
117 | 0.9902718663215637
118 | 0.9271910190582275
119 | 0.998245358467102
120 | 0.984853982925415
121 | 0.9978694915771484
122 | 0.9989228844642639
123 | 0.9975504279136658
124 | 0.9982158541679382
125 | 0.9906685948371887
126 | 0.9992731213569641
127 | 0.9980725049972534
128 | 0.9991907477378845
129 | 0.9753788113594055
130 | 0.998244047164917
131 | 0.9956242442131042
132 | 0.9967650175094604
133 | 0.9945287108421326
134 | 0.9904304146766663
135 | 0.9947385191917419
136 | 0.9993129968643188
137 | 0.9782684445381165
138 | 0.995001494884491
139 | 0.9992092847824097
140 | 0.9987636804580688
141 | 0.9790915846824646
142 | 0.9986263513565063
143 | 0.9984669089317322
144 | 0.9909221529960632
145 | 0.9909529685974121
146 | 0.9535863995552063
147 | 0.9910474419593811
148 | 0.9979985356330872
149 | 0.9984667897224426
150 | 0.9740865230560303
151 | 0.9986739158630371
152 | 0.982077419757843
153 | 0.9938271641731262
154 | 0.9988283514976501
155 | 0.9934296011924744
156 | 0.99751877784729
157 | 0.996823787689209
158 | 0.9982648491859436
159 | 0.9920627474784851
160 | 0.9957907795906067
161 | 0.9991832375526428
162 | 0.9994359612464905
163 | 0.9690139889717102
164 | 0.9970806241035461
165 | 0.9787412285804749
166 | 0.9207716584205627
167 | 0.9987126588821411
168 | 0.9524546265602112
169 | 0.9985440969467163
170 | 0.9976859092712402
171 | 0.9901997447013855
172 | 0.976740837097168
173 | 0.9648317694664001
174 | 0.9989110231399536
175 | 0.9991051554679871
176 | 0.9981874823570251
177 | 0.9803351163864136
178 | 0.9970822930335999
179 | 0.9800115823745728
180 | 0.03216550126671791
181 | 0.9969797134399414
182 | 0.9924662709236145
183 | 0.989404559135437
184 | 0.9963524341583252
185 | 0.9886478781700134
186 | 0.9746184349060059
187 | 0.9987975358963013
188 | 0.9647734761238098
189 | 0.028539611026644707
190 | 0.9089956283569336
191 | 0.9052648544311523
192 | 0.9990537762641907
193 | 0.01217712089419365
194 | 0.9794471263885498
195 | 0.997554361820221
196 | 0.9947006702423096
197 | 0.9952946305274963
198 | 0.99937903881073
199 | 0.01163787953555584
200 | 0.9988679885864258
201 | 0.9990560412406921
202 | 0.9983832836151123
203 | 0.9932032823562622
204 | 0.9957373142242432
205 | 0.9993076324462891
206 | 0.9975751042366028
207 | 0.9821121692657471
208 | 0.9968386888504028
209 | 0.9989283680915833
210 | 0.9434876441955566
211 | 0.9615771770477295
212 | 0.9193602800369263
213 | 0.9983742237091064
214 | 0.9957219362258911
215 | 0.9975576400756836
216 | 0.9960113763809204
217 | 0.9945828318595886
218 | 0.9981690645217896
219 | 0.9561895728111267
220 | 0.9899622201919556
221 | 0.9831120371818542
222 | 0.9990137815475464
223 | 0.9988020658493042
224 | 0.9855765700340271
225 | 0.9986560344696045
226 | 0.9622057676315308
227 | 0.9991530179977417
228 | 0.9991353154182434
229 | 0.997234046459198
230 | 0.9993801116943359
231 | 0.9915771484375
232 | 0.9989362359046936
233 | 0.9961090683937073
234 | 0.9803037047386169
235 | 0.9986960291862488
236 | 0.99750155210495
237 | 0.9922928810119629
238 | 0.9971168041229248
239 | 0.9979921579360962
240 | 0.9993677735328674
241 | 0.9993109703063965
242 | 0.9822739958763123
243 | 0.9387704730033875
244 | 0.9955278038978577
245 | 0.9978225231170654
246 | 0.998397171497345
247 | 0.9984637498855591
248 | 0.9981485605239868
249 | 0.9969332218170166
250 | 0.9577726721763611
251 | 0.9981651902198792
252 | 0.9982585310935974
253 | 0.9875665903091431
254 | 0.9990083575248718
255 | 0.9973329305648804
256 | 0.994461715221405
257 | 0.9881724715232849
258 | 0.9966397285461426
259 | 0.9977622032165527
260 | 0.9975376129150391
261 | 0.998838484287262
262 | 0.9916936755180359
263 | 0.9992332458496094
264 | 0.9930398464202881
265 | 0.9983376264572144
266 | 0.9966981410980225
267 | 0.9968171715736389
268 | 0.9886389970779419
269 | 0.9985862970352173
270 | 0.9967496395111084
271 | 0.9763928651809692
272 | 0.9940037131309509
273 | 0.9965202808380127
274 | 0.9911490678787231
275 | 0.999377965927124
276 | 0.9791600108146667
277 | 0.9810723662376404
278 | 0.9960089921951294
279 | 0.9988295435905457
280 | 0.9954675436019897
281 | 0.9992780089378357
282 | 0.9966852068901062
283 | 0.9978881478309631
284 | 0.9936227202415466
285 | 0.998149037361145
286 | 0.9990297555923462
287 | 0.9959836006164551
288 | 0.9988118410110474
289 | 0.9991412162780762
290 | 0.9832374453544617
291 | 0.9807348847389221
292 | 0.9992181062698364
293 | 0.9991262555122375
294 | 0.9973322153091431
295 | 0.9963339567184448
296 | 0.9898951649665833
297 | 0.9811797738075256
298 | 0.9962238073348999
299 | 0.9970287680625916
300 | 0.9972063899040222
301 | 0.9988741278648376
302 | 0.9976562261581421
303 | 0.8660569190979004
304 | 0.9985843896865845
305 | 0.997268557548523
306 | 0.995783805847168
307 | 0.9928014278411865
308 | 0.9929514527320862
309 | 0.9539411067962646
310 | 0.9947105646133423
311 | 0.9973823428153992
312 | 0.9896822571754456
313 | 0.9955400824546814
314 | 0.9949434399604797
315 | 0.9905816316604614
316 | 0.9944721460342407
317 | 0.9864016175270081
318 | 0.9975747466087341
319 | 0.9968441724777222
320 | 0.9781889319419861
321 | 0.9979853630065918
322 | 0.9954456090927124
323 | 0.9988722205162048
324 | 0.9956164360046387
325 | 0.9969595670700073
326 | 0.9814164638519287
327 | 0.9973897337913513
328 | 0.983991265296936
329 | 0.9988692402839661
330 | 0.9619116187095642
331 | 0.9980373978614807
332 | 0.9810391664505005
333 | 0.025657007470726967
334 | 0.9724447727203369
335 | 0.99146968126297
336 | 0.9986559152603149
337 | 0.9979668259620667
338 | 0.9980127811431885
339 | 0.9813748598098755
340 | 0.9970160722732544
341 | 0.9992469549179077
342 | 0.9982807636260986
343 | 0.9983713030815125
344 | 0.9965781569480896
345 | 0.9983538389205933
346 | 0.991523265838623
347 | 0.9978896975517273
348 | 0.9978058934211731
349 | 0.9515871405601501
350 | 0.8837659358978271
351 | 0.9989218711853027
352 | 0.9990463852882385
353 | 0.9826864004135132
354 | 0.9880500435829163
355 | 0.9960760474205017
356 | 0.981313943862915
357 | 0.9978699684143066
358 | 0.999392032623291
359 | 0.9959993362426758
360 | 0.9892537593841553
361 | 0.9976708292961121
362 | 0.9942777752876282
363 | 0.9838907122612
364 | 0.9917405843734741
365 | 0.9967668056488037
366 | 0.9929261803627014
367 | 0.999275267124176
368 | 0.013177838176488876
369 | 


--------------------------------------------------------------------------------
/emnlp_data/testset_preprocess_scripts/cnt_knowledge_length.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | 
 3 | # for wow
 4 | # query_len, knowledge_len = [], []
 5 | # with open('train_processed.txt', 'r', encoding='utf-8') as r:
 6 | #     data = r.readlines()
 7 | #     print ('data size: ', len(data))
 8 | #     for i, line in enumerate(data):
 9 | #         parts = [e.strip() for e in line.strip().split('\t')]
10 | #         # assert len(parts) == 4, (i, len(parts), parts)
11 | #         if len(parts) != 4:
12 | #             print(i, len(parts), parts)
13 | #             continue
14 | #         topic, history, knowledge, response = parts
15 | #         query = history.split(' [SEP] ')[-1]
16 | #         query_len.append(len(query.split(' ')))
17 | #         knowledge_len.append(len(knowledge.split(' ')))
18 | #     assert len(query_len) == len(knowledge_len), 'length not equal'
19 | #     print('query len: ', sum(query_len) / len(query_len)) # 14.6
20 | #     print('knowledge len: ', sum(knowledge_len) / len(knowledge_len)) # 21.1
21 | 
22 | # for nq
23 | train = '/misc/kfdata01/kf_grp/lchen/DPR/dpr/downloads/data/retriever/nq-train.json'
24 | test = '/misc/kfdata01/kf_grp/lchen/DPR/dpr/downloads/data/gold_passages_info/nq_test.json'
25 | query_len, knowledge_len = [], []
26 | with open(test, 'r') as infile:
27 |     data_list = json.load(infile)['data']
28 |     for data in data_list:
29 |         if not data['context'] or not data['short_answers']:
30 |             continue
31 |         query = data['question']
32 |         knowledge = data['context']
33 |         query_len.append(len(query.split(' ')))
34 |         knowledge_len.append(len(knowledge.split(' ')))
35 |     print('query len: ', sum(query_len) / len(query_len)) # 9.0
36 |     print('knowledge len: ', sum(knowledge_len) / len(knowledge_len)) # 297.2
37 | 
38 | print (len(knowledge_len)) # 1868
39 | print (max(knowledge_len))
40 | li = [0] * 21
41 | for l in knowledge_len:
42 |     if l > 1000:
43 |         continue
44 |     li[l // 50] += 1
45 | print (li)
46 | 
47 | # [274, 582, 403, 247, 85, 46, 25, 19, 21, 7, 9, 10, 5, 6, 5, 7, 1, 1, 2, 0, 0]


--------------------------------------------------------------------------------
/emnlp_data/testset_preprocess_scripts/extract_ref.py:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | def read_testfile(ref_path):
 4 |     # testset = []
 5 |     with open(ref_path, 'r') as infile:
 6 |         for line in infile:
 7 |             parts = line.strip().split('\t')
 8 |             # topic, query, knowledge, response
 9 |             assert len(parts) == 4, parts
10 |             # testset.append(parts)
11 |             print (parts[-2])
12 |     # return testset
13 | 
14 | read_testfile('/misc/kfdata01/kf_grp/lchen/EMNLP23/experiments/emnlp_data/wow/random_testset/seen_random_testset.txt')


--------------------------------------------------------------------------------
/emnlp_data/testset_preprocess_scripts/random_prompt_maker.py:
--------------------------------------------------------------------------------
 1 | import random
 2 | import json
 3 | import tqdm
 4 | 
 5 | # for wow
 6 | # data = []
 7 | # with open('train_processed.txt', 'r', encoding='utf-8') as r:
 8 | #     data = r.readlines()
 9 | 
10 | # for split in ['seen', 'unseen']:
11 | #     with open(f'random_prompts/{split}_random_prompt.txt', 'w', encoding='utf-8') as w:
12 | #         for i in range(500): # lines, examples
13 | #             random.shuffle(data)
14 | #             cur_prompt = data[:50]
15 | #             w.write(str(cur_prompt).strip() + '\n')
16 | #             # print (cur_prompt)
17 | 
18 | 
19 | 
20 | with open('/misc/kfdata01/kf_grp/lchen/DPR/dpr/downloads/data/retriever/nq-dev.json', 'r', encoding='utf-8') as infile:
21 |     data_list = json.load(infile)
22 |     print (data_list[0].keys()) # dict_keys(['dataset', 'question', 'answers', 'positive_ctxs', 'negative_ctxs', 'hard_negative_ctxs'])
23 | 
24 | 
25 | train = '/misc/kfdata01/kf_grp/lchen/DPR/dpr/downloads/data/retriever/nq-train.json'
26 | # with open(train, 'r', encoding='utf-8') as infile, \
27 | #     open('../nq/random_prompts/nq_test_random_prompt.txt', 'w', encoding='utf-8') as outfile:
28 | #     data_list = json.load(infile)
29 | #     for i in tqdm.tqdm(range(500)):
30 | #         # random.shuffle(data_list)
31 | #         id_list = random.sample(list(range(len(data_list))), 300)
32 | #         cur_prompt = []
33 | #         # for data in data_list:
34 | #         for id in id_list:
35 | #             data = data_list[id]
36 | #             if not data['positive_ctxs'] or not data['answers']:
37 | #                 continue
38 |             
39 | #             query = data['question'].strip()
40 | #             answer = data['answers'][0].strip()
41 | 
42 | #             knowledge = data['positive_ctxs'][0]['text'].strip()
43 | #             topic = data['positive_ctxs'][0]['title'].strip()
44 | 
45 | #             if len(knowledge.split(' ')) > 350:
46 | #                 continue
47 | #             cur_prompt.append(f'{topic}\t{query}\t{knowledge}\t{answer}\n')
48 | #             if len(cur_prompt) == 50:
49 | #                 break
50 | #         outfile.write(str(cur_prompt).strip() + '\n')


--------------------------------------------------------------------------------
/emnlp_data/testset_preprocess_scripts/random_testset_maker.py:
--------------------------------------------------------------------------------
 1 | import random
 2 | import json
 3 | 
 4 | # dataset = 'wow'
 5 | # for split in ['seen', 'unseen']:
 6 | #     data = []
 7 | #     with open(f'../{dataset}/test{split}_processed.txt', 'r', encoding='utf-8') as r:
 8 | #         data = r.readlines()
 9 | #     with open(f'random_testset/{split}_random_testset.txt', 'w', encoding='utf-8') as w:
10 | #         random.shuffle(data)
11 | #         testset = data[:500]
12 | #         w.writelines(testset)
13 | 
14 | 
15 | 
16 | train = '/misc/kfdata01/kf_grp/lchen/DPR/dpr/downloads/data/retriever/nq-train.json'
17 | test = '/misc/kfdata01/kf_grp/lchen/DPR/dpr/downloads/data/gold_passages_info/nq_test.json'
18 | cnt = 0
19 | with open(test, 'r') as infile, \
20 |     open('../nq/random_testset/nq_test_random_testset.txt', 'w') as outfile:
21 |     data_list = json.load(infile)['data']
22 |     for data in data_list:
23 |         if not data['context'] or not data['short_answers']:
24 |             continue
25 |         
26 |         p = random.random()
27 |         if p > 0.35:
28 |             continue
29 |         topic = data['title'].strip()
30 |         query = data['question'].strip()
31 |         knowledge = data['context'].strip()
32 |         answer = data['short_answers'][0].strip()
33 |         outfile.write(f'{topic}\t{query}\t{knowledge}\t{answer}\n')
34 |         cnt += 1
35 |         if cnt == 500:
36 |             break
37 |         


--------------------------------------------------------------------------------
/emnlp_data/wow/random_testset/seen_topic_pageviews.txt:
--------------------------------------------------------------------------------
  1 | Beard	1959542
  2 | Chevrolet Corvette	5075804
  3 | Del Taco	511892
  4 | Steak	2617070
  5 | National Hockey League	7053253
  6 | My Little Pony: Friendship Is Magic fandom	735926
  7 | Kale	4124167
  8 | Avengers (comics)	6736715
  9 | 100 metres	4162077
 10 | Mercedes-Benz S-Class	3902440
 11 | Star Trek	13923390
 12 | Dance	3905111
 13 | Beastie Boys	9614636
 14 | Bachelor of Science in Nursing	606634
 15 | Children's literature	1474233
 16 | Chicago Bulls	8437648
 17 | Byala, Varna Province	39347
 18 | Bank teller	565886
 19 | Veganism	6404632
 20 | Acoustic guitar	1177249
 21 | Washington Wizards	3783779
 22 | Depression (mood)	5088913
 23 | Violin	3510891
 24 | Back pain	1792205
 25 | Cheese	5531781
 26 | Vancouver Grizzlies	1468983
 27 | Appalachian Trail	4152310
 28 | Weight training	1395672
 29 | Zebra	3910132
 30 | Laser pointer	884624
 31 | Pet	2558753
 32 | Synchronised swimming	438525
 33 | Indie rock	3169932
 34 | American lobster	776293
 35 | Netflix	26821165
 36 | Marine habitats	510784
 37 | Facial hair in the military	711489
 38 | Association football	13829467
 39 | Wedding cake	735753
 40 | The Rolling Stones	15566337
 41 | Fantasy football (American)	1090779
 42 | My Little Pony: Friendship Is Magic fandom	735926
 43 | Anxiety disorder	4699214
 44 | Pita	1338254
 45 | New York-style pizza	2081987
 46 | Food truck	655594
 47 | Tattoo	3664743
 48 | Honda Civic	6757544
 49 | Duramax V8 engine	1934085
 50 | Fishing tackle	478009
 51 | New York City	37724664
 52 | Hostage	464802
 53 | Insurance	8211916
 54 | Higher education in the United States	1348209
 55 | Leaning Tower of Pisa	7422423
 56 | Agents of S.H.I.E.L.D.	17041567
 57 | Reading (process)	1332328
 58 | Dating	12379545
 59 | Social anxiety	1266979
 60 | Atlantic Ocean	5721579
 61 | Biotin	3600612
 62 | Yoga	9660641
 63 | Meditation	4874467
 64 | Macaroni and cheese	1682250
 65 | English as a second or foreign language	1151254
 66 | Blue	3827652
 67 | Pot washing	32074
 68 | Crochet	1344155
 69 | Armadillo	4659111
 70 | New York City	37724664
 71 | Dodge	3340132
 72 | Bathroom singing	84604
 73 | Go-kart	663975
 74 | Metallica	15902964
 75 | The Last of the Mohicans (1992 film)	3813987
 76 | Swimming	645217
 77 | Depression (mood)	5088913
 78 | Singing	2608718
 79 | Fishkeeping	400565
 80 | Beauty pageant	1734975
 81 | Nicholas Sparks	4159815
 82 | Chef	2173677
 83 | Activism	1216281
 84 | Del Taco	511892
 85 | Prince (musician)	42161460
 86 | Ford Mustang (first generation)	2606261
 87 | Airbnb	7751574
 88 | Physics	8200603
 89 | Chocolate	5734017
 90 | Blue Ridge Parkway	1330669
 91 | Wedding cake	735753
 92 | Led Zeppelin	18823514
 93 | Sports in Philadelphia	362223
 94 | Vanilla extract	890221
 95 | Battle of Okinawa	6175338
 96 | Nu metal	2870104
 97 | Bank teller	565886
 98 | Donald Trump	417127302
 99 | Jainism	7969171
100 | Japan	33765520
101 | Bathroom singing	84604
102 | History of vegetarianism	345569
103 | Academic dress	1317111
104 | Bodybuilding	2559450
105 | Grocery store	1654977
106 | Beard	1959542
107 | Bodyboarding	378542
108 | Murder on the Orient Express (2017 film)	9836159
109 | Toyota	9039006
110 | Association football	13829467
111 | Track and field	3895842
112 | Lifeguard	505683
113 | Role-playing video game	2099904
114 | Acrophobia	2256002
115 | Reading (process)	1332328
116 | Veganism	6404632
117 | Animal shelter	1175962
118 | Tofu	4844659
119 | Porsche	6184097
120 | Fantasy football (American)	1090779
121 | Library	3248947
122 | Les Paul	1990157
123 | Finance	4342066
124 | Karaoke Revolution	121987
125 | Dancing with the Stars	1242082
126 | Fruitarianism	1279408
127 | Immigration to the United States	3941725
128 | Spice	2528778
129 | The Shawshank Redemption	14947995
130 | Vietnamese Pot-bellied	241350
131 | Denmark	18009526
132 | Public aquarium	227132
133 | Stepfather	169728
134 | Karaoke	1751670
135 | Sushi	5705171
136 | Auto mechanic	629011
137 | Seafood	1654698
138 | Jeopardy!	5162800
139 | Pizza	6874547
140 | Masters of the Universe	1283163
141 | Titanic (1997 film)	21665576
142 | Toga party	258452
143 | Work–life balance	1456848
144 | 100 metres	4162077
145 | Lizard	5029851
146 | Wellington County, Ontario	141301
147 | Giant panda	10271361
148 | Go-kart	663975
149 | The Chainsmokers	9752293
150 | Veganism	6404632
151 | Gone with the Wind (film)	10029206
152 | Purple	2715649
153 | Kale	4124167
154 | Social anxiety	1266979
155 | Beastie Boys	9614636
156 | Blackjack	4299719
157 | London	28537990
158 | Kentucky Derby	3439758
159 | Pipe smoking	480616
160 | Metallica	15902964
161 | Tea processing	689180
162 | Blue	3827652
163 | Linebacker	1996470
164 | Steak	2617070
165 | Overwatch (video game)	8253107
166 | Puerto Rico	21597904
167 | Titanic (1997 film)	21665576
168 | Surf culture	441435
169 | IPhone	18946999
170 | Ice hockey	5147900
171 | Obsessive–compulsive disorder	8232417
172 | Fair	670756
173 | Spaghetti with meatballs	214486
174 | Glasses	2129352
175 | Weight training	1395672
176 | Butcher	705333
177 | Telenovela	1623710
178 | YouTube	69619308
179 | Veganism	6404632
180 | Partnership	2125839
181 | Arab cuisine	986678
182 | Vanilla	3068880
183 | Beach	1890555
184 | Human papillomavirus infection	4841867
185 | Miami Heat	5471563
186 | Running	1947540
187 | Wisconsin	7454562
188 | Carrot	3548156
189 | Grey's Anatomy	27648979
190 | Swimming stroke	581036
191 | Lindsey Stirling	5446995
192 | Coco Chanel	7867151
193 | Epilepsy	5807590
194 | Hiking	1797210
195 | Liquorice	3842441
196 | Miranda Lambert	7273065
197 | Anxiety disorder	4699214
198 | Fennel	4125421
199 | The Story So Far (band)	575060
200 | Illegal immigration to the United States	2423443
201 | Divorce	2897630
202 | Income inequality in the United States	1523700
203 | Blockbuster LLC	4235726
204 | Fair	670756
205 | Depression (mood)	5088913
206 | Corn dog	1112057
207 | Biology	7288491
208 | Ice cream	4410396
209 | Blue	3827652
210 | Chanel No. 5	1311070
211 | Corn dog	1112057
212 | Pizza	6874547
213 | Pakistan	29145105
214 | Elementary school	999722
215 | Cruise ship	2350722
216 | I Love New York	607638
217 | South Park	12874224
218 | Chocolate	5734017
219 | Police officer	1803972
220 | Animal testing	1707509
221 | Marinara sauce	952446
222 | Veganism	6404632
223 | Ford Mustang	8161648
224 | Kale	4124167
225 | List of Downton Abbey characters	4432952
226 | Pecan pie	477609
227 | German Shepherd	10166109
228 | Marriage	6021835
229 | Bon Iver	4494197
230 | Hostage	464802
231 | Chocolate	5734017
232 | Acrophobia	2256002
233 | Jumbo slice	325175
234 | Bitcoin	35706513
235 | Ocean	5401523
236 | Fathers' rights movement by country	57041
237 | Skateboarding	2020981
238 | Justin Bieber	39739057
239 | Shrimp	2976964
240 | Pig farming	876232
241 | Tattoo	3664743
242 | Bank teller	565886
243 | Lesbian	8438858
244 | Obesity in the United States	1833122
245 | Pit bull	7320671
246 | Partnership	2125839
247 | The Bahamas	11576127
248 | Luxury yacht	507678
249 | Dog	21115517
250 | Paddy field	1312896
251 | Mental disorder	4870704
252 | Finance	4342066
253 | Science, technology, engineering, and mathematics	2250255
254 | Fertility factor (demography)	168755
255 | China	43380033
256 | Asakusa	502640
257 | Tupac Shakur	37793324
258 | Parenting styles	1004316
259 | Bulldog	5395755
260 | Reading (process)	1332328
261 | LeBron James	50652189
262 | Blue	3827652
263 | Herb	2105936
264 | Surfing	1906475
265 | Management of hair loss	400689
266 | Rise Against	2128340
267 | Pizza	6874547
268 | Alpine skiing	1002167
269 | Action-adventure game	1673273
270 | Tattoo	3664743
271 | Grounds for divorce (United States)	279430
272 | Nightclub	1591702
273 | Economy of Pittsburgh	136452
274 | McDonald's	16774593
275 | Hinduism	15495622
276 | Activism	1216281
277 | Top Chef	2684469
278 | Avengers (comics)	6736715
279 | Food truck	655594
280 | Chicago metropolitan area	1862922
281 | Swimming lessons	160148
282 | Wedding cake	735753
283 | Italian cuisine	2814286
284 | Rose	6162787
285 | Usain Bolt	18802314
286 | Kayaking	853351
287 | Ford Mustang	8161648
288 | Pizza	6874547
289 | Stepfather	169728
290 | Pink	2098244
291 | Multilingualism	1817923
292 | Blue	3827652
293 | Controversy and criticism of Jersey Shore	106514
294 | Choir	1240728
295 | Newspaper	4451940
296 | Corporate behaviour	178327
297 | Ford Mustang (first generation)	2606261
298 | Kindergarten	2754297
299 | Pizza	6874547
300 | Swimming	645217
301 | Radiology	1959453
302 | Piano	4874737
303 | Justin Bieber	39739057
304 | Pizza	6874547
305 | Food allergy	816602
306 | Yorkshire Terrier	3838224
307 | Stephen King	19387069
308 | Bitcoin	35706513
309 | White Christmas (weather)	351396
310 | Guitar	5765716
311 | My Little Pony	8126577
312 | Widow	726015
313 | Compulsive hoarding	1378634
314 | Surfing	1906475
315 | Hindu	3548143
316 | Marduk (band)	570631
317 | Roller coaster phobia	138748
318 | Pet	2558753
319 | Rock music	6387915
320 | South Park	12874224
321 | China	43380033
322 | Bandy	1503081
323 | Canada	44790425
324 | New York University	4367231
325 | Chico's Tacos	152081
326 | Wedding cake	735753
327 | Iguana	2054941
328 | New York-style pizza	2081987
329 | Fourth Baptist Christian School	8992
330 | Multilingualism	1817923
331 | Sushi	5705171
332 | Horse	8144634
333 | New Year's Eve	3073903
334 | Taco	2906013
335 | Yoga as exercise	361759
336 | World War II	55549788
337 | Teapot	365709
338 | History of autonomous cars	216317
339 | Mesoamerica	2704125
340 | Crochet hook	123800
341 | Ballet	2292512
342 | South Asia	6166460
343 | Dog training	847025
344 | Land of Oz	730680
345 | Parenting	1423118
346 | Track and field	3895842
347 | Plastic arts	358076
348 | Rosalia (festival)	91822
349 | Grilling	1084103
350 | Red hair	4380560
351 | My Little Pony	8126577
352 | Orphan	940122
353 | Apple pie	1195239
354 | Chicago-style pizza	1514762
355 | No-kill shelter	162196
356 | Swimming	645217
357 | Gender in youth sports	38863
358 | Cooking	2206345
359 | Pizza	6874547
360 | Seattle	11501444
361 | Characters of Final Fantasy X and X-2	440835
362 | Vermont	6266621
363 | Barbecue grill	490577
364 | Ballroom dance	1677171
365 | Ocean	5401523
366 | Italian cuisine	2814286
367 | Adoption	1288016
368 | Piano	4874737
369 | Hiking	1797210
370 | Superman	13251517
371 | Comic book	2145684
372 | Well-being contributing factors	181026
373 | Labrador Retriever	9292925
374 | New England	7404899
375 | Lizard	5029851
376 | Jamba Juice	535492
377 | Tiny house movement	992197
378 | World Heritage Site	6591340
379 | Lizard	5029851
380 | Chocolate	5734017
381 | Practice pad	41152
382 | Steak	2617070
383 | Cat	20440030
384 | 100 metres	4162077
385 | Korn	6303735
386 | Snapple	788011
387 | Radiohead	9150954
388 | Pipe smoking	480616
389 | Kindergarten	2754297
390 | Monarch butterfly	2891626
391 | Blue Ridge Parkway	1330669
392 | Nineteen Eighty-Four	16311633
393 | Family farm	188774
394 | Odor	908906
395 | Camping	1579382
396 | Iguana	2054941
397 | Christmas tree	3701463
398 | Tennis	5392040
399 | Leather	2872405
400 | John Chambers (make-up artist)	347333
401 | Justin Bieber	39739057
402 | Hospital	3055953
403 | Ageing	2048205
404 | Andy Murray	13901154
405 | Legal awareness	183262
406 | Cyanobacteria	3595456
407 | Swimming	645217
408 | Steak	2617070
409 | Whittling	245466
410 | Bank teller	565886
411 | Bank teller	565886
412 | Yoga	9660641
413 | Agoraphobia	7454666
414 | Coco Chanel	7867151
415 | Eighth Doctor	917496
416 | Cryptic crossword	624883
417 | Running	1947540
418 | Parrot	4166342
419 | Bob Ross	21059248
420 | Low back pain	2709541
421 | Daft Punk	8119663
422 | Veterinary physician	1004302
423 | Yoga	9660641
424 | Track and field	3895842
425 | Yellow	1871687
426 | Gladiator	2818542
427 | Ice hockey	5147900
428 | Beastie Boys	9614636
429 | Free Appropriate Public Education	230741
430 | The Rolling Stones	15566337
431 | Science, technology, engineering, and mathematics	2250255
432 | Colorado	8301350
433 | Pork	2164506
434 | Dungeons & Dragons	6649283
435 | Blue Ridge Parkway	1330669
436 | Mike Trout	5315410
437 | Ocean	5401523
438 | Underwater hockey	436406
439 | Spice	2528778
440 | Gibson Les Paul	2352115
441 | Murder on the Orient Express (2017 film)	9836159
442 | Swimming	645217
443 | Pita	1338254
444 | Dance	3905111
445 | Giant panda	10271361
446 | Onion	4068491
447 | Zumba	2023534
448 | Hindu	3548143
449 | Anthrax (American band)	2990109
450 | International adoption of South Korean children	197618
451 | Netflix	26821165
452 | Cut of beef	2379048
453 | The Hershey Company	2656480
454 | Emergency department	1327462
455 | Classical music	4081061
456 | Yellow	1871687
457 | Obesity in the United States	1833122
458 | Abe Pollin	123838
459 | Immigration to the United States	3941725
460 | History of American newspapers	606405
461 | Lance Armstrong	8681088
462 | Ford Mustang	8161648
463 | Dating	12379545
464 | Ford Mustang (first generation)	2606261
465 | Pizza	6874547
466 | Radiology	1959453
467 | Santa Fe, New Mexico	3500856
468 | Camping	1579382
469 | Communist Party USA	2485345
470 | Florida	13516911
471 | Trick-or-treating	2436386
472 | Vietnamese cuisine	1451959
473 | American Motors	1219419
474 | Les Paul	1990157
475 | Fur	1023405
476 | University of Chicago	3653847
477 | Medieval cuisine	1409915
478 | Cat	20440030
479 | Game of Thrones	66975540
480 | Historical fiction	1399757
481 | Choir	1240728
482 | Arts in Seattle	43843
483 | Magic square	1853666
484 | Gone with the Wind (film)	10029206
485 | Pug	6059374
486 | Extreme Couponing	219509
487 | Bachelor of Science in Nursing	606634
488 | Dog	21115517
489 | Cut of beef	2379048
490 | Anxiety disorder	4699214
491 | Brewery	776666
492 | Reality television	3338151
493 | Travel	2377943
494 | Reading (process)	1332328
495 | Chicken McNuggets	841115
496 | Jimi Hendrix	18313612
497 | Vegetarianism	3703523
498 | Pizza	6874547
499 | Radiology	1959453
500 | The Story So Far (band)	575060
501 | 


--------------------------------------------------------------------------------
/emnlp_data/wow/random_testset/unseen_topic_pageviews.txt:
--------------------------------------------------------------------------------
  1 | Accounting	4561192
  2 | Hot dog	3698391
  3 | Online shopping	3920354
  4 | John Grisham	4108973
  5 | Popcorn	3047303
  6 | Guns N' Roses	12138504
  7 | Harry Potter	33498151
  8 | Green	2607165
  9 | Elvis Presley	36970826
 10 | Harry Potter	33498151
 11 | Hound Dog (song)	1004552
 12 | American football	10478739
 13 | Elvis Presley	36970826
 14 | Guns N' Roses	12138504
 15 | Attention deficit hyperactivity disorder	8491220
 16 | Online shopping	3920354
 17 | Early history of American football	217358
 18 | American football	10478739
 19 | Old Faithful Museum of Thermal Activity	9125
 20 | Game design	799287
 21 | Relish	756540
 22 | Game design	799287
 23 | Green	2607165
 24 | Ireland	18043996
 25 | Genghis Khan	20406918
 26 | Broken heart	1481439
 27 | American football	10478739
 28 | Water skiing	419784
 29 | Dylan's Candy Bar	338014
 30 | Guns N' Roses	12138504
 31 | Green	2607165
 32 | Harry Potter	33498151
 33 | Ten-pin bowling	1318688
 34 | American football	10478739
 35 | Trapping	724283
 36 | Rottweiler	8100660
 37 | Online shopping	3920354
 38 | The Walking Dead (TV series)	39648564
 39 | Archery	2156917
 40 | Insane Clown Posse	4525891
 41 | Ireland	18043996
 42 | Cod	3134543
 43 | Poaching	1545190
 44 | Accounting	4561192
 45 | Skiing	1880306
 46 | Accounting	4561192
 47 | Columbia River	2317442
 48 | National Football League on television	625789
 49 | Irrealism (the arts)	27065
 50 | Horrorcore	1380387
 51 | American football	10478739
 52 | Broken heart	1481439
 53 | Jazz	7245064
 54 | Accounting	4561192
 55 | Hunting	2013242
 56 | Guns N' Roses	12138504
 57 | Accounting	4561192
 58 | Dylan's Candy Bar	338014
 59 | Green	2607165
 60 | Elvis Presley	36970826
 61 | Japanese language	8978480
 62 | Stock exchange	3461353
 63 | To Kill a Mockingbird	12654725
 64 | Cardigan (sweater)	806462
 65 | Shades of green	2803743
 66 | Cycling	1657944
 67 | Political positions of Hillary Clinton	983823
 68 | Archery	2156917
 69 | Nickelback	4796561
 70 | Paramedic	1214132
 71 | The Walking Dead (TV series)	39648564
 72 | Cheerleading	2601255
 73 | Archery	2156917
 74 | Harry Potter	33498151
 75 | Water skiing	419784
 76 | Hunting	2013242
 77 | Rock N Roll Experience Magazine	8220
 78 | Green	2607165
 79 | American football	10478739
 80 | Memphis Mafia	613020
 81 | Regional street food	180107
 82 | John Grisham	4108973
 83 | Motivation	4758206
 84 | Neurosurgery	1703924
 85 | Insane Clown Posse	4525891
 86 | List of national parks of the United States	7548317
 87 | National Parks of Canada	215749
 88 | Snowflake	1032971
 89 | Nickelback	4796561
 90 | Dylan's Candy Bar	338014
 91 | Dylan's Candy Bar	338014
 92 | Bowling	2012141
 93 | Instagram	24175136
 94 | Broken heart	1481439
 95 | On-again, off-again relationship	511036
 96 | American football	10478739
 97 | Bowling	2012141
 98 | Hunting	2013242
 99 | Motivation	4758206
100 | Harry Potter	33498151
101 | Skiing	1880306
102 | Oregon Trail	3340493
103 | Major general	1171137
104 | Motivation	4758206
105 | Nickelback	4796561
106 | Dog	21115517
107 | Stock market	5662980
108 | Ski	535770
109 | The Walking Dead (TV series)	39648564
110 | Cheerleading	2601255
111 | Instagram	24175136
112 | Fox hunting	1142024
113 | Dylan's Candy Bar	338014
114 | Goldendoodle	3954294
115 | Hedge	484218
116 | Interpersonal communication	1690808
117 | Hunting	2013242
118 | Axl Rose	10541775
119 | Dylan's Candy Bar	338014
120 | Irish coffee	1213386
121 | Skype	6841400
122 | Cheerleading	2601255
123 | Broken heart	1481439
124 | Bowling	2012141
125 | Hunting	2013242
126 | Field electron emission	339182
127 | Elvis Presley	36970826
128 | Cod	3134543
129 | Fantastic Beasts and Where to Find Them (film)	24112670
130 | Cheerleading	2601255
131 | Hunting	2013242
132 | Bob Ross	21059248
133 | Accounting	4561192
134 | Accounting	4561192
135 | Google Chrome	24567422
136 | Guns N' Roses	12138504
137 | Black Friday (shopping)	11247562
138 | Cheerleading	2601255
139 | Skiing	1880306
140 | Hot dog	3698391
141 | Guns N' Roses	12138504
142 | American football	10478739
143 | Memphis, Tennessee	4997280
144 | Honey bee	3832925
145 | Blog	8108460
146 | Instagram	24175136
147 | Archery	2156917
148 | Nickelback	4796561
149 | Dylan's Candy Bar	338014
150 | Green	2607165
151 | Green	2607165
152 | Kendrick Lamar	19668434
153 | Northern Ireland	11331844
154 | Popcorn	3047303
155 | Music festival	782112
156 | Accounting	4561192
157 | Green	2607165
158 | Academic dress of universities in Queensland, Australia	33629
159 | Shades of green	2803743
160 | Green	2607165
161 | Accounting	4561192
162 | Blue cheese	2120264
163 | Broken heart	1481439
164 | Elvis Presley	36970826
165 | Irish Americans	1565967
166 | Bowling	2012141
167 | Italian cuisine	2814286
168 | Formula One car	2366007
169 | Green	2607165
170 | Blog	8108460
171 | Broken heart	1481439
172 | Motivation	4758206
173 | Instagram	24175136
174 | Motivation	4758206
175 | Goldendoodle	3954294
176 | Insane Clown Posse	4525891
177 | Auto racing	1840090
178 | Green	2607165
179 | Green	2607165
180 | Hot dog	3698391
181 | Accounting	4561192
182 | Genghis Khan	20406918
183 | Skiing	1880306
184 | Cod	3134543
185 | Phase-out of incandescent light bulbs	471753
186 | Cardigan (sweater)	806462
187 | Skiing	1880306
188 | Poultry farming	1937486
189 | Archery	2156917
190 | Cod	3134543
191 | Regional street food	180107
192 | Green	2607165
193 | Accounting	4561192
194 | Trail riding	147914
195 | Luca Pacioli	857427
196 | Death Eater	2356961
197 | Atlantic cod	945138
198 | Yes (band)	5616061
199 | Red meat	1967240
200 | Green	2607165
201 | Broken heart	1481439
202 | Hunting	2013242
203 | Bowling	2012141
204 | On-again, off-again relationship	511036
205 | Nickelback	4796561
206 | Ireland	18043996
207 | Game design	799287
208 | Bowling	2012141
209 | Skiing	1880306
210 | Cheerleading	2601255
211 | Green	2607165
212 | History of skiing	345690
213 | Eminem	43338165
214 | Ten-pin bowling	1318688
215 | Bowling	2012141
216 | Instagram	24175136
217 | Motivation	4758206
218 | Hunting	2013242
219 | Bodybuilding supplement	845375
220 | Medical billing	518915
221 | Skiing	1880306
222 | John Grisham	4108973
223 | Ireland	18043996
224 | Waterfowl hunting	269450
225 | List of national parks of the United States	7548317
226 | Accounting	4561192
227 | Bowling	2012141
228 | Cardigan (sweater)	806462
229 | Green	2607165
230 | Skiing	1880306
231 | List of awards and nominations received by Michael Jackson	829559
232 | Skiing	1880306
233 | Green	2607165
234 | Nickelback	4796561
235 | Cheerleading	2601255
236 | Hunting	2013242
237 | Cardigan (sweater)	806462
238 | Accounting	4561192
239 | Motivation	4758206
240 | List of national parks of the United States	7548317
241 | History of QubicaAMF Bowling World Cup	15318
242 | Green	2607165
243 | Guns N' Roses	12138504
244 | Instagram	24175136
245 | List of national parks of the United States	7548317
246 | Portrait of an Army Doctor	13741
247 | Green	2607165
248 | Crossbow	2103185
249 | Dog training	847025
250 | Harry Potter	33498151
251 | American football	10478739
252 | Discovery Channel	2518000
253 | Japanese language	8978480
254 | Hard rock	2532151
255 | Cheerleading	2601255
256 | Nickelback	4796561
257 | Cardigan (sweater)	806462
258 | History of health care reform in the United States	445207
259 | Grammy Award	7328535
260 | American football	10478739
261 | Five-pin bowling	298048
262 | Harry Potter influences and analogues	393906
263 | Cross-country skiing (sport)	279615
264 | Heart Broken	57754
265 | Rugby union in Germany	61369
266 | Indianapolis 500	2712848
267 | Accounting	4561192
268 | Archery	2156917
269 | Accounting	4561192
270 | Hunting	2013242
271 | Whitehaven, Memphis, Tennessee	67662
272 | Elvis Presley	36970826
273 | American football	10478739
274 | Skiing	1880306
275 | Richard Nixon	27123206
276 | Japanese language	8978480
277 | International Financial Reporting Standards	2099309
278 | Nachos	1417170
279 | Gymnastics	3142318
280 | The Walking Dead (TV series)	39648564
281 | Instagram	24175136
282 | Waterfowl hunting	269450
283 | Kurt Cobain	20514337
284 | Green	2607165
285 | Gymnastics	3142318
286 | American football	10478739
287 | Accounting	4561192
288 | Ireland	18043996
289 | Bowling	2012141
290 | Green	2607165
291 | Cheerleading	2601255
292 | Medici Bank	587752
293 | American football	10478739
294 | Skittles (sport)	421753
295 | Electric guitar	2282511
296 | Bowling	2012141
297 | Rick Grimes	3494233
298 | Blog	8108460
299 | American football	10478739
300 | History of Crayola crayons	206577
301 | Instagram	24175136
302 | The Walking Dead (TV series)	39648564
303 | Harry Potter	33498151
304 | Thierry Henry	10254331
305 | Elvis Presley	36970826
306 | The Walking Dead (TV series)	39648564
307 | Hunting	2013242
308 | Bodybuilding supplement	845375
309 | Bodybuilding supplement	845375
310 | Green	2607165
311 | Winter	3146653
312 | Cheerleading	2601255
313 | Privacy concerns with social networking services	668582
314 | Ireland	18043996
315 | Elvis Presley	36970826
316 | Popcorn	3047303
317 | Green	2607165
318 | Skiing	1880306
319 | Blog	8108460
320 | Intercity bus service	200478
321 | Elvis Presley	36970826
322 | Neurosurgery	1703924
323 | Cheerleading	2601255
324 | Nickelback	4796561
325 | Winter	3146653
326 | Cardigan (sweater)	806462
327 | Cheerleading	2601255
328 | American football	10478739
329 | Cheerleading	2601255
330 | Genghis Khan	20406918
331 | Tom Brady	41656273
332 | Skiing	1880306
333 | High Elves (Warhammer)	135086
334 | Skiing	1880306
335 | The Walking Dead (TV series)	39648564
336 | Grunge	3797656
337 | Gymnastics	3142318
338 | Freestyle skateboarding tricks	146839
339 | Elvis impersonator	305174
340 | Consulting firm	624348
341 | Goldendoodle	3954294
342 | Instagram	24175136
343 | Bodybuilding supplement	845375
344 | Hunting	2013242
345 | Elvis impersonator	305174
346 | American football	10478739
347 | Green	2607165
348 | Cheerleading	2601255
349 | Transformers (film)	6835314
350 | Medieval cuisine	1409915
351 | Green	2607165
352 | Liberty Tax Service	86233
353 | Online shopping	3920354
354 | Herbal	305698
355 | Cheerleading	2601255
356 | Ireland	18043996
357 | Skiing	1880306
358 | American football	10478739
359 | Great Famine (Ireland)	6566470
360 | Cycling	1657944
361 | American football	10478739
362 | The Beatles	28659241
363 | Kurt Cobain	20514337
364 | Preterm birth	2029449
365 | List of Shrek characters	1577478
366 | Dylan's Candy Bar	338014
367 | Gymnastics	3142318
368 | Accounting	4561192
369 | Bee	4593517
370 | Narcissus (plant)	2129973
371 | Online shopping	3920354
372 | Tea	5620471
373 | Bodybuilding supplement	845375
374 | The Walking Dead (TV series)	39648564
375 | Heartbreak Hotel	597104
376 | Stock market	5662980
377 | Ethics of eating meat	617002
378 | Cheerleading	2601255
379 | Poaching	1545190
380 | Discovery Channel	2518000
381 | Ebates	318758
382 | Instagram	24175136
383 | Zombie apocalypse	1196207
384 | Color wheel	3196003
385 | Green	2607165
386 | Elvis Presley	36970826
387 | Green	2607165
388 | Bowling	2012141
389 | Stock market	5662980
390 | Accounting	4561192
391 | Elvis Presley	36970826
392 | Medical billing	518915
393 | Broken heart	1481439
394 | Winter	3146653
395 | History of skiing	345690
396 | Game design	799287
397 | The Walking Dead (TV series)	39648564
398 | Chromium	3730713
399 | Irish coffee	1213386
400 | Broken heart	1481439
401 | Bowling	2012141
402 | Green	2607165
403 | Drama school	153875
404 | Broken heart	1481439
405 | Neurosurgery	1703924
406 | Elvis Presley	36970826
407 | Accounting	4561192
408 | Online shopping	3920354
409 | Skiing	1880306
410 | Dirt track racing	390040
411 | Nickelback	4796561
412 | American football	10478739
413 | Inca road system	384952
414 | Kendrick Lamar	19668434
415 | Motivation	4758206
416 | Non-profit hospital	249270
417 | Skiing	1880306
418 | Cheerleading	2601255
419 | The Walking Dead (video game)	2694059
420 | History of tea	974174
421 | Gymnastics	3142318
422 | Green	2607165
423 | Auto racing	1840090
424 | Gymnastics	3142318
425 | Hot dog	3698391
426 | Samsung Galaxy S III	1650132
427 | Kurt Cobain	20514337
428 | Bowling alley	172742
429 | Broken heart	1481439
430 | Japanese language	8978480
431 | Elvis Presley	36970826
432 | Stock market	5662980
433 | Motivation	4758206
434 | John Grisham	4108973
435 | Broken heart	1481439
436 | Chlorophyll	3068989
437 | Harry Potter	33498151
438 | American football	10478739
439 | Skiing	1880306
440 | Cardigan (sweater)	806462
441 | Rose (color)	635243
442 | Shades of green	2803743
443 | Instagram	24175136
444 | Photosynthesis	8445453
445 | Green	2607165
446 | The Walking Dead (TV series)	39648564
447 | Pink	2098244
448 | Cycling	1657944
449 | American football rules	2179492
450 | Elvis Presley	36970826
451 | Tastes like chicken	291582
452 | Motivation	4758206
453 | Kurt Cobain	20514337
454 | Ireland	18043996
455 | Suicide of Kurt Cobain	3883373
456 | Green	2607165
457 | Elvis Presley	36970826
458 | Bowling	2012141
459 | Elvis Presley	36970826
460 | Motivation	4758206
461 | Cheerleading	2601255
462 | Harry Potter	33498151
463 | Nickelback	4796561
464 | Split (bowling)	181007
465 | Latin influence in English	459062
466 | Beef	3021529
467 | Broken heart	1481439
468 | List of national parks of the United States	7548317
469 | Motivation	4758206
470 | Cheerleading	2601255
471 | Green	2607165
472 | Dylan's Candy Bar	338014
473 | Benjamin Spock	900469
474 | American football	10478739
475 | John Grisham	4108973
476 | Bowling	2012141
477 | Video game music	909028
478 | Bowling	2012141
479 | Harry Potter	33498151
480 | Game design	799287
481 | Guns N' Roses	12138504
482 | Skiing	1880306
483 | American football	10478739
484 | Pet Sounds	3460211
485 | The Walking Dead (TV series)	39648564
486 | Green	2607165
487 | Hunting	2013242
488 | Paramedic	1214132
489 | Genghis Khan	20406918
490 | Ireland	18043996
491 | Dallas Cowboys	7695826
492 | Genghis Khan	20406918
493 | Skiing	1880306
494 | Paramedic	1214132
495 | Elvis Presley	36970826
496 | List of Hello Kitty television series	118424
497 | Chihuahua (dog)	5744542
498 | Winter	3146653
499 | Nickelback	4796561
500 | Skiing	1880306
501 | 


--------------------------------------------------------------------------------
/env/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ChanLiang/CONNER/77f99c876bdc6ca8cb3991210e2ccc2914d4971b/env/.DS_Store


--------------------------------------------------------------------------------
/env/coherence_environment.yml:
--------------------------------------------------------------------------------
 1 | name: coherence
 2 | channels:
 3 |   - defaults
 4 | dependencies:
 5 |   - _libgcc_mutex=0.1=main
 6 |   - _openmp_mutex=5.1=1_gnu
 7 |   - bzip2=1.0.8=h7b6447c_0
 8 |   - ca-certificates=2023.05.30=h06a4308_0
 9 |   - ld_impl_linux-64=2.38=h1181459_1
10 |   - libffi=3.4.4=h6a678d5_0
11 |   - libgcc-ng=11.2.0=h1234567_1
12 |   - libgomp=11.2.0=h1234567_1
13 |   - libstdcxx-ng=11.2.0=h1234567_1
14 |   - libuuid=1.41.5=h5eee18b_0
15 |   - ncurses=6.4=h6a678d5_0
16 |   - openssl=1.1.1t=h7f8727e_0
17 |   - pip=23.0.1=py311h06a4308_0
18 |   - python=3.11.3=h7a1cb2a_0
19 |   - readline=8.2=h5eee18b_0
20 |   - setuptools=67.8.0=py311h06a4308_0
21 |   - sqlite=3.41.2=h5eee18b_0
22 |   - tk=8.6.12=h1ccaba5_0
23 |   - wheel=0.38.4=py311h06a4308_0
24 |   - xz=5.4.2=h5eee18b_0
25 |   - zlib=1.2.13=h5eee18b_0
26 |   - pip:
27 |     - aiohttp==3.8.4
28 |     - aiosignal==1.3.1
29 |     - async-timeout==4.0.2
30 |     - attrs==23.1.0
31 |     - blis==0.7.9
32 |     - catalogue==2.0.8
33 |     - certifi==2023.5.7
34 |     - charset-normalizer==3.1.0
35 |     - click==8.1.3
36 |     - confection==0.0.4
37 |     - cymem==2.0.7
38 |     - datasets==2.12.0
39 |     - dill==0.3.6
40 |     - en-core-web-sm==3.5.0
41 |     - filelock==3.12.1
42 |     - frozenlist==1.3.3
43 |     - fsspec==2023.6.0
44 |     - huggingface-hub==0.15.1
45 |     - idna==3.4
46 |     - jinja2==3.1.2
47 |     - joblib==1.2.0
48 |     - langcodes==3.3.0
49 |     - markupsafe==2.1.3
50 |     - multidict==6.0.4
51 |     - multiprocess==0.70.14
52 |     - murmurhash==1.0.9
53 |     - nltk==3.8.1
54 |     - numpy==1.24.3
55 |     - nvidia-cuda-nvrtc-cu11==11.7.99
56 |     - nvidia-cuda-runtime-cu11==11.7.99
57 |     - nvidia-cudnn-cu11==8.5.0.96
58 |     - packaging==23.1
59 |     - pandas==2.0.2
60 |     - pathy==0.10.1
61 |     - preshed==3.0.8
62 |     - pyarrow==12.0.0
63 |     - pydantic==1.10.9
64 |     - python-dateutil==2.8.2
65 |     - pytz==2023.3
66 |     - pyyaml==6.0
67 |     - regex==2023.6.3
68 |     - requests==2.31.0
69 |     - responses==0.18.0
70 |     - safetensors==0.3.1
71 |     - scikit-learn==1.2.2
72 |     - scipy==1.10.1
73 |     - sentencepiece==0.1.99
74 |     - sgnlp==0.4.0
75 |     - six==1.16.0
76 |     - smart-open==6.3.0
77 |     - spacy==3.5.3
78 |     - spacy-legacy==3.0.12
79 |     - spacy-loggers==1.0.4
80 |     - srsly==2.4.6
81 |     - thinc==8.1.10
82 |     - threadpoolctl==3.1.0
83 |     - tokenizers==0.13.3
84 |     - torch==1.13.1
85 |     - torchtext==0.6.0
86 |     - tqdm==4.65.0
87 |     - transformers==4.30.1
88 |     - typer==0.7.0
89 |     - typing-extensions==4.6.3
90 |     - tzdata==2023.3
91 |     - urllib3==2.0.3
92 |     - wasabi==1.1.2
93 |     - xxhash==3.2.0
94 |     - yarl==1.9.2
95 | prefix: /misc/kfdata01/kf_grp/lchen/anaconda3/envs/coherence
96 | 


--------------------------------------------------------------------------------
/env/environment.yml:
--------------------------------------------------------------------------------
  1 | name: FactualityPrompt
  2 | channels:
  3 |   - defaults
  4 | dependencies:
  5 |   - _libgcc_mutex=0.1=main
  6 |   - _openmp_mutex=5.1=1_gnu
  7 |   - bzip2=1.0.8=h7b6447c_0
  8 |   - ca-certificates=2023.01.10=h06a4308_0
  9 |   - certifi=2022.12.7=py310h06a4308_0
 10 |   - ld_impl_linux-64=2.38=h1181459_1
 11 |   - libffi=3.4.2=h6a678d5_6
 12 |   - libgcc-ng=11.2.0=h1234567_1
 13 |   - libgomp=11.2.0=h1234567_1
 14 |   - libstdcxx-ng=11.2.0=h1234567_1
 15 |   - libuuid=1.41.5=h5eee18b_0
 16 |   - ncurses=6.4=h6a678d5_0
 17 |   - openssl=1.1.1t=h7f8727e_0
 18 |   - pip=22.3.1=py310h06a4308_0
 19 |   - python=3.10.9=h7a1cb2a_0
 20 |   - readline=8.2=h5eee18b_0
 21 |   - setuptools=65.6.3=py310h06a4308_0
 22 |   - sqlite=3.40.1=h5082296_0
 23 |   - tk=8.6.12=h1ccaba5_0
 24 |   - tzdata=2022g=h04d1e81_0
 25 |   - wheel=0.38.4=py310h06a4308_0
 26 |   - xz=5.2.10=h5eee18b_1
 27 |   - zlib=1.2.13=h5eee18b_0
 28 |   - pip:
 29 |     - absl-py==1.4.0
 30 |     - antlr4-python3-runtime==4.8
 31 |     - astunparse==1.6.3
 32 |     - beautifulsoup4==4.11.2
 33 |     - benepar==0.2.0
 34 |     - bitarray==2.7.3
 35 |     - blis==0.7.9
 36 |     - bs4==0.0.1
 37 |     - cachetools==5.3.0
 38 |     - catalogue==2.0.8
 39 |     - cffi==1.15.1
 40 |     - charset-normalizer==3.0.1
 41 |     - click==8.1.3
 42 |     - colorama==0.4.6
 43 |     - common==0.1.2
 44 |     - common-utils==2.0.1.dev1
 45 |     - confection==0.0.4
 46 |     - cymem==2.0.7
 47 |     - cysignals==1.11.2
 48 |     - cython==0.29.33
 49 |     - editdistance==0.6.2
 50 |     - en-core-web-sm==3.5.0
 51 |     - fairseq==0.12.2
 52 |     - fever-drqa==1.0.13
 53 |     - filelock==3.9.0
 54 |     - flatbuffers==23.1.21
 55 |     - future==0.18.3
 56 |     - gast==0.4.0
 57 |     - google-api-core==2.11.0
 58 |     - google-api-python-client==2.83.0
 59 |     - google-auth==2.16.1
 60 |     - google-auth-httplib2==0.1.0
 61 |     - google-auth-oauthlib==0.4.6
 62 |     - google-pasta==0.2.0
 63 |     - googleapis-common-protos==1.59.0
 64 |     - grpcio==1.51.3
 65 |     - h5py==3.8.0
 66 |     - httplib2==0.22.0
 67 |     - huggingface-hub==0.12.1
 68 |     - hydra-core==1.0.7
 69 |     - idna==3.4
 70 |     - jaraco-context==4.3.0
 71 |     - jinja2==3.1.2
 72 |     - joblib==1.2.0
 73 |     - keras==2.11.0
 74 |     - langcodes==3.3.0
 75 |     - libclang==15.0.6.1
 76 |     - lxml==4.9.2
 77 |     - markdown==3.4.1
 78 |     - markupsafe==2.1.2
 79 |     - more-itertools==9.1.0
 80 |     - murmurhash==1.0.9
 81 |     - nltk==3.8.1
 82 |     - numpy==1.24.2
 83 |     - nvidia-cuda-nvrtc-cu11==11.7.99
 84 |     - nvidia-cuda-runtime-cu11==11.7.99
 85 |     - nvidia-cudnn-cu11==8.5.0.96
 86 |     - oauthlib==3.2.2
 87 |     - omegaconf==2.0.6
 88 |     - opt-einsum==3.3.0
 89 |     - packaging==23.0
 90 |     - pandas==1.5.3
 91 |     - pathy==0.10.1
 92 |     - pexpect==4.8.0
 93 |     - pillow==9.4.0
 94 |     - portalocker==2.7.0
 95 |     - preshed==3.0.8
 96 |     - prettytable==3.6.0
 97 |     - protobuf==3.19.6
 98 |     - ptyprocess==0.7.0
 99 |     - pyasn1==0.4.8
100 |     - pyasn1-modules==0.2.8
101 |     - pycparser==2.21
102 |     - pydantic==1.10.5
103 |     - pyparsing==3.0.9
104 |     - python-dateutil==2.8.2
105 |     - pytz==2022.7.1
106 |     - pyyaml==6.0
107 |     - rank-bm25==0.2.2
108 |     - regex==2022.10.31
109 |     - requests==2.28.2
110 |     - requests-oauthlib==1.3.1
111 |     - rsa==4.9
112 |     - sacrebleu==2.3.1
113 |     - scikit-learn==1.2.1
114 |     - scipy==1.10.1
115 |     - sentence-transformers==2.2.2
116 |     - sentencepiece==0.1.97
117 |     - six==1.16.0
118 |     - smart-open==6.3.0
119 |     - soupsieve==2.4
120 |     - spacy==3.5.0
121 |     - spacy-legacy==3.0.12
122 |     - spacy-loggers==1.0.4
123 |     - srsly==2.4.5
124 |     - tabulate==0.9.0
125 |     - tensorboard==2.11.2
126 |     - tensorboard-data-server==0.6.1
127 |     - tensorboard-plugin-wit==1.8.1
128 |     - tensorflow==2.11.0
129 |     - tensorflow-estimator==2.11.0
130 |     - tensorflow-io-gcs-filesystem==0.30.0
131 |     - termcolor==2.2.0
132 |     - thefuzz==0.19.0
133 |     - thinc==8.1.7
134 |     - threadpoolctl==3.1.0
135 |     - tokenizers==0.13.2
136 |     - torch==1.13.1
137 |     - torch-struct==0.5
138 |     - torchaudio==0.13.1
139 |     - torchvision==0.14.1
140 |     - tqdm==4.64.1
141 |     - transformers==4.26.1
142 |     - typer==0.7.0
143 |     - typing-extensions==4.5.0
144 |     - uritemplate==4.1.1
145 |     - urllib3==1.26.14
146 |     - wasabi==1.1.1
147 |     - wcwidth==0.2.6
148 |     - werkzeug==2.2.3
149 |     - wikipedia==1.4.0
150 |     - wolframalpha==5.0.0
151 |     - wrapt==1.14.1
152 |     - xmltodict==0.13.0
153 | prefix: /misc/kfdata01/kf_grp/lchen/anaconda3/envs/FactualityPrompt
154 | 


--------------------------------------------------------------------------------
/framework.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ChanLiang/CONNER/77f99c876bdc6ca8cb3991210e2ccc2914d4971b/framework.png


--------------------------------------------------------------------------------
/scripts/helpfulness/nq_random_knowledge.sh:
--------------------------------------------------------------------------------
 1 | 
 2 | exp_name=nq_llama_65B_random_knowledge
 3 | task=nq
 4 | 
 5 | # debug=True
 6 | debug=False
 7 | 
 8 | testfile=../emnlp23/emnlp_data/nq/random_testset/nq_test_random_testset.txt
 9 | promptfile=../emnlp23/emnlp_data/nq/random_prompts/nq_test_random_prompt.txt
10 | 
11 | downstream_model=llama-65B
12 | zero_shot=False
13 | knowledge_type=random_knowledge
14 | 
15 | export TRANSFORMERS_CACHE='YOUR_DIR'
16 | export HF_HOME='YOUR_DIR'
17 | export HUGGINGFACE_HUB_CACHE='YOUR_DIR'
18 | 
19 | python3 -u helpfulness.py \
20 | --exp_name $exp_name \
21 | --task $task \
22 | --zero_shot $zero_shot \
23 | --debug $debug \
24 | --testfile $testfile \
25 | --promptfile $promptfile \
26 | --downstream_model $downstream_model \
27 | --knowledge_type $knowledge_type 1>log/$exp_name.log 2>&1
28 | 
29 | 
30 | 


--------------------------------------------------------------------------------
/scripts/helpfulness/nq_w_hyp_knowledge.sh:
--------------------------------------------------------------------------------
 1 | for name in your_predictions_dir
 2 | do
 3 | 
 4 | exp_name=${name}_w_hyp_knowledge
 5 | # debug=True
 6 | debug=False
 7 | 
 8 | testfile=emnlp_data/nq/random_testset/nq_test_random_testset.txt
 9 | promptfile=./emnlp_data/nq/random_prompts/nq_test_random_prompt.txt
10 | hyp_knowledge="${name}_w_hyp_knowledge"
11 | 
12 | # downstream_model=flan-t5-xxl
13 | downstream_model=llama-65B
14 | knowledge_type=w_hyp_knowledge
15 | zero_shot=False
16 | 
17 | export TRANSFORMERS_CACHE='YOUR_DIR'
18 | export HF_HOME='YOUR_DIR'
19 | export HUGGINGFACE_HUB_CACHE='YOUR_DIR'
20 | 
21 | python3 -u helpfulness.py \
22 | --exp_name $exp_name \
23 | --task nq \
24 | --zero_shot $zero_shot \
25 | --debug $debug \
26 | --testfile $testfile \
27 | --hyp_knowledge $hyp_knowledge \
28 | --promptfile $promptfile \
29 | --downstream_model $downstream_model \
30 | --knowledge_type $knowledge_type 1>log/$exp_name.log 2>&1
31 | 
32 | wait
33 | 
34 | done 


--------------------------------------------------------------------------------
/scripts/helpfulness/nq_w_ref_knowledge.sh:
--------------------------------------------------------------------------------
 1 | export http_proxy="http://star-proxy.oa.com:3128"
 2 | export https_proxy="http://star-proxy.oa.com:3128"
 3 | export ftp_proxy="http://star-proxy.oa.com:3128"
 4 | export no_proxy=".woa.com,mirrors.cloud.tencent.com,tlinux-mirror.tencent-cloud.com,tlinux-mirrorlist.tencent-cloud.com,localhost,127.0.0.1,mirrors-tlinux.tencentyun.com,.oa.com,.local,.3gqq.com,.7700.org,.ad.com,.ada_sixjoy.com,.addev.com,.app.local,.apps.local,.aurora.com,.autotest123.com,.bocaiwawa.com,.boss.com,.cdc.com,.cdn.com,.cds.com,.cf.com,.cjgc.local,.cm.com,.code.com,.datamine.com,.dvas.com,.dyndns.tv,.ecc.com,.expochart.cn,.expovideo.cn,.fms.com,.great.com,.hadoop.sec,.heme.com,.home.com,.hotbar.com,.ibg.com,.ied.com,.ieg.local,.ierd.com,.imd.com,.imoss.com,.isd.com,.isoso.com,.itil.com,.kao5.com,.kf.com,.kitty.com,.lpptp.com,.m.com,.matrix.cloud,.matrix.net,.mickey.com,.mig.local,.mqq.com,.oiweb.com,.okbuy.isddev.com,.oss.com,.otaworld.com,.paipaioa.com,.qqbrowser.local,.qqinternal.com,.qqwork.com,.rtpre.com,.sc.oa.com,.sec.com,.server.com,.service.com,.sjkxinternal.com,.sllwrnm5.cn,.sng.local,.soc.com,.t.km,.tcna.com,.teg.local,.tencentvoip.com,.tenpayoa.com,.test.air.tenpay.com,.tr.com,.tr_autotest123.com,.vpn.com,.wb.local,.webdev.com,.webdev2.com,.wizard.com,.wqq.com,.wsd.com,.sng.com,.music.lan,.mnet2.com,.tencentb2.com,.tmeoa.com,.pcg.com,www.wip3.adobe.com,www-mm.wip3.adobe.com,mirrors.tencent.com,csighub.tencentyun.com"
 5 | 
 6 | task=nq
 7 | exp_name=nq_llama_65B_w_ref_knowledge
 8 | 
 9 | # debug=True
10 | debug=False
11 | testfile=../emnlp23/emnlp_data/nq/random_testset/nq_test_random_testset.txt
12 | promptfile=../emnlp23/emnlp_data/nq/random_prompts/nq_test_random_prompt.txt
13 | 
14 | downstream_model=llama-65B
15 | knowledge_type=w_ref_knowledge
16 | zero_shot=False
17 | 
18 | export TRANSFORMERS_CACHE='YOUR_DIR'
19 | export HF_HOME='YOUR_DIR'
20 | export HUGGINGFACE_HUB_CACHE='YOUR_DIR'
21 | 
22 | # export CUDA_VISIBLE_DEVICES=1,2,3
23 | python3 -u helpfulness.py \
24 | --exp_name $exp_name \
25 | --task $task \
26 | --zero_shot $zero_shot \
27 | --debug $debug \
28 | --testfile $testfile \
29 | --promptfile $promptfile \
30 | --downstream_model $downstream_model \
31 | --knowledge_type $knowledge_type 1>log/$exp_name.log 2>&1
32 | 


--------------------------------------------------------------------------------
/scripts/helpfulness/nq_wo_knowledge.sh:
--------------------------------------------------------------------------------
 1 | 
 2 | exp_name="YOUR_EXP_NAME"
 3 | task=nq
 4 | 
 5 | # debug=True
 6 | debug=False
 7 | testfile=../emnlp23/emnlp_data/nq/random_testset/nq_test_random_testset.txt
 8 | promptfile=../emnlp23/emnlp_data/nq/random_prompts/nq_test_random_prompt.txt
 9 | 
10 | # downstream_model=flan-t5-xxl
11 | downstream_model=llama-65B
12 | zero_shot=False
13 | knowledge_type=wo_knowledge
14 | 
15 | export TRANSFORMERS_CACHE='YOUR_DIR'
16 | export HF_HOME='YOUR_DIR'
17 | export HUGGINGFACE_HUB_CACHE='YOUR_DIR'
18 | 
19 | python3 -u helpfulness.py \
20 | --exp_name $exp_name \
21 | --task $task \
22 | --zero_shot $zero_shot \
23 | --debug $debug \
24 | --testfile $testfile \
25 | --promptfile $promptfile \
26 | --downstream_model $downstream_model \
27 | --knowledge_type $knowledge_type 1>log/$exp_name.log 2>&1
28 | 
29 | 


--------------------------------------------------------------------------------
/scripts/helpfulness/view_results.sh:
--------------------------------------------------------------------------------
 1 | 
 2 | for name in "YOUR_EXP_DIR"
 3 | do
 4 | 
 5 | echo $name
 6 | exp_name=${name}_w_hyp_knowledge
 7 | tail -2 helpfulness_results/${exp_name}.txt
 8 | echo
 9 | 
10 | done
11 | 
12 | 


--------------------------------------------------------------------------------
/scripts/helpfulness/wow_random_knowledge.sh:
--------------------------------------------------------------------------------
 1 | exp_name=wow_helpfulness_random_knowledge
 2 | task=wow
 3 | 
 4 | # debug=True
 5 | debug=False
 6 | testfile=../emnlp23/emnlp_data/wow/random_testset/seen_random_testset.txt
 7 | promptfile=../emnlp23/emnlp_data/wow/random_prompts/seen_random_prompt.txt
 8 | 
 9 | downstream_model=llama-65B
10 | zero_shot=False
11 | knowledge_type=random_knowledge
12 | 
13 | export TRANSFORMERS_CACHE='YOUR_DIR'
14 | export HF_HOME='YOUR_DIR'
15 | export HUGGINGFACE_HUB_CACHE='YOUR_DIR'
16 | 
17 | python3 -u helpfulness.py \
18 | --exp_name $exp_name \
19 | --task $task \
20 | --zero_shot $zero_shot \
21 | --debug $debug \
22 | --testfile $testfile \
23 | --promptfile $promptfile \
24 | --downstream_model $downstream_model \
25 | --knowledge_type $knowledge_type 1>log/$exp_name.log 2>&1
26 | # --knowledge_type $knowledge_type 
27 | 
28 | 
29 | 


--------------------------------------------------------------------------------
/scripts/helpfulness/wow_w_hyp_knowledge.sh:
--------------------------------------------------------------------------------
 1 | 
 2 | for name in "YOUR_EXP_NAME"
 3 | do
 4 | 
 5 | exp_name=${name}_w_hyp_knowledge
 6 | # debug=True
 7 | debug=False
 8 | testfile=emnlp_data/wow/random_testset/seen_random_testset.txt
 9 | promptfile=emnlp_data/wow/random_prompts/seen_random_prompt.txt
10 | hyp_knowledge=${name}_w_hyp_knowledge
11 | 
12 | downstream_model=llama-65B
13 | knowledge_type=w_hyp_knowledge
14 | zero_shot=False
15 | 
16 | export TRANSFORMERS_CACHE='YOUR_DIR'
17 | export HF_HOME='YOUR_DIR'
18 | export HUGGINGFACE_HUB_CACHE='YOUR_DIR'
19 | 
20 | python3 -u helpfulness.py \
21 | --exp_name $exp_name \
22 | --task nq \
23 | --zero_shot $zero_shot \
24 | --debug $debug \
25 | --testfile $testfile \
26 | --hyp_knowledge $hyp_knowledge \
27 | --promptfile $promptfile \
28 | --downstream_model $downstream_model \
29 | --knowledge_type $knowledge_type 1>log/$exp_name.log 2>&1
30 | 
31 | wait
32 | 
33 | done 


--------------------------------------------------------------------------------
/scripts/nq_coh_para.sh:
--------------------------------------------------------------------------------
 1 | # env: base
 2 | 
 3 | for name in your_prediction_dir
 4 | do
 5 | 
 6 | hyp=${name}/nq_hyp
 7 | 
 8 | exp_name=discourse_coherence_${name}
 9 | echo $name
10 | 
11 | export CUDA_VISIBLE_DEVICES=0
12 | PYTHONPATH=. python -u discourse-coherence.py  \
13 | --hyp_path $hyp 1>log/log-${exp_name} 2>&1 
14 | 
15 | echo 
16 | 
17 | wait
18 | done
19 | 


--------------------------------------------------------------------------------
/scripts/nq_coh_sent.sh:
--------------------------------------------------------------------------------
 1 | # env: base
 2 | 
 3 | for name in your_hyper_dir
 4 | do
 5 | 
 6 | hyp=${name}/nq_ref
 7 | 
 8 | exp_name=ppl_${name}
 9 | echo $name
10 | 
11 | export CUDA_VISIBLE_DEVICES=0
12 | PYTHONPATH=. python -u src/ppl.py  \
13 | --hyp_path $hyp 1>log/log-${exp_name} 2>&1 
14 | 
15 | echo 
16 | 
17 | wait
18 | done
19 | 


--------------------------------------------------------------------------------
/scripts/nq_factuality.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # This script evaluates the model predictions using various parameters.
 3 | 
 4 | # Set the debug mode. If true, additional debugging information will be printed.
 5 | debug=False
 6 | 
 7 | # Number of evaluations to perform.
 8 | eval_num=500
 9 | 
10 | # Number of information retrieval (IR) evidence to consider.
11 | IR_num=10
12 | 
13 | # Whether to use ground truth knowledge or not.
14 | wo_ground_truth_knowledge=False
15 | 
16 | # Error tolerance.
17 | outer_strategy=max 
18 | 
19 | # Loop through all the predictions of your model.
20 | for name in model_prediction_dir; do
21 |     # Define the reference and hypothesis paths.
22 |     ref="emnlp_data/nq/random_testset/nq_test_random_testset.txt"
23 |     hyp="${name}/nq_knowledge"
24 | 
25 |     # Construct the experiment name based on the current configuration.
26 |     exp_name="${name}_IR${IR_num}_${outer_strategy}"
27 |     echo "Experiment Name: $exp_name"
28 | 
29 |     # Set the CUDA device.
30 |     export CUDA_VISIBLE_DEVICES=0
31 | 
32 |     # Run the evaluation script with the specified parameters.
33 |     PYTHONPATH=. python -u src/eval_exp.py \
34 |     --hyp_path "$hyp" \
35 |     --ref_path "$ref" \
36 |     --use_IR_eval \
37 |     --debug "$debug" \
38 |     --eval_num "$eval_num" \
39 |     --wo_ground_truth_knowledge "$wo_ground_truth_knowledge" \
40 |     --outer_strategy "$outer_strategy" \
41 |     --retrieved_num "$IR_num" \
42 |     1> "log/log-${exp_name}" 2>&1 
43 | 
44 |     # Wait for the process to finish before continuing with the next prediction.
45 |     wait
46 | done


--------------------------------------------------------------------------------
/scripts/nq_factuality_view.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | your_log_name_list=(log1 log2 log3)  # Replace with actual log names
4 | 
5 | for name in "${your_log_name_list[@]}"; do
6 |   echo "Results for $name:"
7 |   tail -2 "log/log-${name}_IR10_max"
8 |   echo
9 | done


--------------------------------------------------------------------------------
/scripts/nq_info.sh:
--------------------------------------------------------------------------------
 1 | for name in your_prediction_dir
 2 | do
 3 | 
 4 | hyp=/misc/kfdata01/kf_grp/lchen/FactualityPrompt/model_scale_exp/knowledge/${name}/nq_knowledge
 5 | 
 6 | ref=/misc/kfdata01/kf_grp/lchen/EMNLP23/experiments/emnlp_data/nq/random_testset/nq_test_random_testset.txt
 7 | 
 8 | exp_name=info_${name}
 9 | echo $name
10 | 
11 | export CUDA_VISIBLE_DEVICES=0
12 | PYTHONPATH=. python -u info.py  \
13 | --task nq \
14 | --ref_path $ref \
15 | --hyp_path $hyp 1>log/log-${exp_name} 2>&1 
16 | 
17 | echo 
18 | 
19 | wait
20 | done
21 | 


--------------------------------------------------------------------------------
/scripts/nq_relevance.sh:
--------------------------------------------------------------------------------
 1 | for name in your_prediction
 2 | do
 3 | 
 4 | 
 5 | ref=emnlp_data/nq/random_testset/nq_test_random_testset.txt
 6 | hyp=${name}/nq_knowledge
 7 | 
 8 | exp_name=relevance_${name}
 9 | echo $name
10 | 
11 | export CUDA_VISIBLE_DEVICES=0
12 | PYTHONPATH=. python -u relevance.py  \
13 | --hyp_path $hyp \
14 | --ref_path $ref 1>log/log-${exp_name} 2>&1 
15 | 
16 | echo 
17 | 
18 | wait
19 | done
20 | 


--------------------------------------------------------------------------------
/scripts/nq_validity.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Toggle debug mode
 4 | debug=False
 5 | 
 6 | # Number of evaluations
 7 | eval_num=500
 8 | 
 9 | # List of experiment directories (update with actual directory names)
10 | your_experiment_dir_list=(dir1 dir2 dir3)
11 | 
12 | for name in "${your_experiment_dir_list[@]}"; do
13 |     ref="./emnlp_data/nq/random_testset/nq_test_random_testset.txt"
14 |     hyp="./answers/nq_answer_for_${name}/nq_answer"
15 | 
16 |     echo "Running experiment: ${name}"
17 | 
18 |     export CUDA_VISIBLE_DEVICES=0
19 | 
20 |     PYTHONPATH=. python -u src/nq_validity.py \
21 |     --hyp_path "$hyp" \
22 |     --ref_path "$ref" \
23 |     --debug "$debug" \
24 |     --eval_num "$eval_num" 1>"log/log-${name}" 2>&1 
25 | 
26 |     wait
27 | done


--------------------------------------------------------------------------------
/scripts/other/cal_factuality_for_DPR.sh:
--------------------------------------------------------------------------------
 1 | REF_PATH=/misc/kfdata01/kf_grp/lchen/ParlAI/data/wizard_of_wikipedia/processed_data
 2 | 
 3 | 
 4 | # w IR
 5 | 
 6 | IR_num=3
 7 | exp_name=IR${IR_num}_eval_backup
 8 | 
 9 | export CUDA_VISIBLE_DEVICES=2
10 | PYTHONPATH=. python src/eval_401.py  \
11 | --hyp_path /misc/kfdata01/kf_grp/lchen/ParlAI/data/wizard_of_wikipedia/DPR_top1_knowledge_seen \
12 | --sent_ref_path $REF_PATH/output_testseen_knowledge_sentence_reference.txt \
13 | --use_IR_eval \
14 | --retrieved_num $IR_num \
15 | --doc_ref_path $REF_PATH/output_testseen_knowledge_doc_reference.txt 1>log/DPR-${exp_name}-zero-shot-res.txt 2>log/DPR-${exp_name}-zero-shot-err.txt
16 | 
17 | wait
18 | 
19 | export CUDA_VISIBLE_DEVICES=2
20 | PYTHONPATH=. python src/eval_401.py  \
21 | --hyp_path /misc/kfdata01/kf_grp/lchen/ParlAI/data/wizard_of_wikipedia/DPR_top1_knowledge_unseen \
22 | --sent_ref_path $REF_PATH/output_testunseen_knowledge_sentence_reference.txt \
23 | --use_IR_eval \
24 | --retrieved_num $IR_num \
25 | --doc_ref_path $REF_PATH/output_testunseen_knowledge_doc_reference.txt 1>log/DPR-${exp_name}-zero-shot-unseen-res.txt 2>log/DPR-${exp_name}-zero-shot-unseen-err.txt
26 | 
27 | 
28 | 
29 | 


--------------------------------------------------------------------------------
/scripts/other/cal_factuality_for_knowledge.sh:
--------------------------------------------------------------------------------
 1 | REF_PATH=/misc/kfdata01/kf_grp/lchen/ParlAI/data/wizard_of_wikipedia/processed_data
 2 | 
 3 | # seen exp
 4 | 
 5 | # few-shot
 6 | # export CUDA_VISIBLE_DEVICES=0
 7 | # PYTHONPATH=. python src/eval_NE_NLI.py  \
 8 | # --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/flan-t5-11B/few-shot/seen_knowledge \
 9 | # --sent_ref_path $REF_PATH/output_testseen_knowledge_sentence_reference.txt \
10 | # --doc_ref_path $REF_PATH/output_testseen_knowledge_doc_reference.txt 1>log/few-shot-res.txt 2>log/few-shot-err.txt
11 | 
12 | # zero-shot
13 | # export CUDA_VISIBLE_DEVICES=1
14 | # PYTHONPATH=. python src/eval_NE_NLI.py  \
15 | # --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/flan-t5-11B/zero-shot/seen_knowledge_last_utter \
16 | # --sent_ref_path $REF_PATH/output_testseen_knowledge_sentence_reference.txt \
17 | # --doc_ref_path $REF_PATH/output_testseen_knowledge_doc_reference.txt 1>log/zero-shot-res.txt 2>log/zero-shot-err.txt
18 | 
19 | 
20 | # unseen exp
21 | # few-shot
22 | # export CUDA_VISIBLE_DEVICES=2
23 | # PYTHONPATH=. python src/eval_NE_NLI.py  \
24 | # --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/flan-t5-11B/few-shot/unseen_knowledge \
25 | # --sent_ref_path $REF_PATH/output_testunseen_knowledge_sentence_reference.txt \
26 | # --doc_ref_path $REF_PATH/output_testunseen_knowledge_doc_reference.txt 1>log/few-shot-unseen-res.txt 2>log/few-shot-unseen-err.txt
27 | 
28 | # export CUDA_VISIBLE_DEVICES=3
29 | # PYTHONPATH=. python src/eval_NE_NLI.py  \
30 | # --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/flan-t5-11B/zero-shot/unseen_knowledge_last_utter \
31 | # --sent_ref_path $REF_PATH/output_testunseen_knowledge_sentence_reference.txt \
32 | # --doc_ref_path $REF_PATH/output_testunseen_knowledge_doc_reference.txt 1>log/zero-shot-unseen-res.txt 2>log/zero-shot-unseen-err.txt
33 | 
34 | 
35 | exp_name=IR3_eval
36 | 
37 | # for model in flan-t5-11B flan-t5-xl flan-t5-large flan-t5-base flan-t5-small
38 | for model in flan-t5-xl
39 | do
40 | echo $model
41 | # seen + few-shot
42 | split=seen
43 | data=few-shot
44 | export CUDA_VISIBLE_DEVICES=0
45 | PYTHONPATH=. python src/eval_NE_NLI.py  \
46 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge__clean \
47 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
48 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt &
49 | 
50 | # seen + zero-shot
51 | split=seen
52 | data=zero-shot
53 | export CUDA_VISIBLE_DEVICES=1
54 | PYTHONPATH=. python src/eval_NE_NLI.py  \
55 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter_ \
56 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
57 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt &
58 | 
59 | 
60 | # unseen + few-shot
61 | split=unseen
62 | data=few-shot
63 | export CUDA_VISIBLE_DEVICES=2
64 | PYTHONPATH=. python src/eval_NE_NLI.py  \
65 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge__clean \
66 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
67 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt &
68 | 
69 | 
70 | # unseen + zero-shot
71 | split=unseen
72 | data=zero-shot
73 | export CUDA_VISIBLE_DEVICES=3
74 | PYTHONPATH=. python src/eval_NE_NLI.py  \
75 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter_ \
76 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
77 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt &
78 | 
79 | wait
80 | 
81 | done
82 | 
83 | 
84 | 
85 | 
86 | 
87 | # GEN_TO_EVALUATE_NAME=./wow/zero-shot/wizard-test-p1.jsonl
88 | 
89 | # PYTHONPATH=. python src/evaluate_generated_knowledge.py  \
90 | # --gen_path ${GEN_TO_EVALUATE_NAME} 1>log/res.txt 2>log/err.txt
91 | 
92 | 
93 | 


--------------------------------------------------------------------------------
/scripts/other/cal_factuality_for_knowledge_IR.sh:
--------------------------------------------------------------------------------
 1 | REF_PATH=/misc/kfdata01/kf_grp/lchen/ParlAI/data/wizard_of_wikipedia/processed_data
 2 | 
 3 | IR_num=3
 4 | # exp_name=IR${IR_num}_eval_backup
 5 | exp_name=IR${IR_num}_eval_filter_know
 6 | 
 7 | # for model in flan-t5-11B flan-t5-xl flan-t5-large flan-t5-base flan-t5-small
 8 | for model in flan-t5-xxl
 9 | do
10 | echo $model
11 | # seen + few-shot
12 | split=seen
13 | data=few-shot
14 | # hyp=/misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge__clean
15 | hyp=/misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/filter_know_${split}_knowledge
16 | 
17 | export CUDA_VISIBLE_DEVICES=0
18 | PYTHONPATH=. python src/eval_401.py  \
19 | --hyp_path $hyp \
20 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
21 | --use_IR_eval \
22 | --retrieved_num $IR_num \
23 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt &
24 | 
25 | wait
26 | 
27 | 
28 | # # seen + zero-shot
29 | # split=seen
30 | # data=zero-shot
31 | # export CUDA_VISIBLE_DEVICES=1
32 | # PYTHONPATH=. python src/eval_401.py  \
33 | # --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter_ \
34 | # --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
35 | # --use_IR_eval \
36 | # --retrieved_num $IR_num \
37 | # --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt &
38 | 
39 | # wait
40 | 
41 | # unseen + few-shot
42 | split=unseen
43 | data=few-shot
44 | hyp=/misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/filter_know_${split}_knowledge
45 | 
46 | export CUDA_VISIBLE_DEVICES=2
47 | PYTHONPATH=. python src/eval_401.py  \
48 | --hyp_path $hyp \
49 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
50 | --use_IR_eval \
51 | --retrieved_num $IR_num \
52 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt &
53 | 
54 | wait
55 | 
56 | # # unseen + zero-shot
57 | # split=unseen
58 | # data=zero-shot
59 | # export CUDA_VISIBLE_DEVICES=0
60 | # PYTHONPATH=. python src/eval_401.py  \
61 | # --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter_ \
62 | # --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
63 | # --use_IR_eval \
64 | # --retrieved_num $IR_num \
65 | # --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt &
66 | 
67 | # wait
68 | 
69 | done
70 | 
71 | 
72 | 
73 | 
74 | 
75 | # GEN_TO_EVALUATE_NAME=./wow/zero-shot/wizard-test-p1.jsonl
76 | 
77 | # PYTHONPATH=. python src/evaluate_generated_knowledge.py  \
78 | # --gen_path ${GEN_TO_EVALUATE_NAME} 1>log/res.txt 2>log/err.txt
79 | 
80 | 
81 | 


--------------------------------------------------------------------------------
/scripts/other/cal_factuality_for_opt_knowledge_IR.sh:
--------------------------------------------------------------------------------
 1 | REF_PATH=/misc/kfdata01/kf_grp/lchen/ParlAI/data/wizard_of_wikipedia/processed_data
 2 | 
 3 | IR_num=3
 4 | exp_name=OPT-IR${IR_num}_eval
 5 | 
 6 | # for model in opt-13b opt-1.3b
 7 | for model in opt-13b opt-iml-1.3b opt-1.3b
 8 | # for model in opt-6.7b
 9 | do
10 | echo $model
11 | # seen + few-shot
12 | split=seen
13 | data=few-shot
14 | export CUDA_VISIBLE_DEVICES=1
15 | PYTHONPATH=. python src/eval_401.py  \
16 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_extract \
17 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
18 | --use_IR_eval \
19 | --retrieved_num $IR_num \
20 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt &
21 | 
22 | wait
23 | 
24 | # unseen + few-shot
25 | split=unseen
26 | data=few-shot
27 | export CUDA_VISIBLE_DEVICES=1
28 | PYTHONPATH=. python src/eval_401.py  \
29 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_extract \
30 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
31 | --use_IR_eval \
32 | --retrieved_num $IR_num \
33 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt &
34 | 
35 | wait
36 | 
37 | 
38 | done
39 | 
40 | 
41 | 
42 | 
43 | 
44 | # GEN_TO_EVALUATE_NAME=./wow/zero-shot/wizard-test-p1.jsonl
45 | 
46 | # PYTHONPATH=. python src/evaluate_generated_knowledge.py  \
47 | # --gen_path ${GEN_TO_EVALUATE_NAME} 1>log/res.txt 2>log/err.txt
48 | 
49 | 
50 | 


--------------------------------------------------------------------------------
/scripts/other/cal_factuality_for_refined_knowledge.sh:
--------------------------------------------------------------------------------
 1 | REF_PATH=/misc/kfdata01/kf_grp/lchen/ParlAI/data/wizard_of_wikipedia/processed_data
 2 | 
 3 | # for model in flan-t5-11B flan-t5-xl flan-t5-large flan-t5-base flan-t5-small
 4 | # for model in flan-t5-11B
 5 | for model in flan-t5-xxl
 6 | do
 7 | echo $model
 8 | 
 9 | 
10 | # seen + few-shot
11 | split=seen
12 | data=few-shot
13 | # hyp_file=/misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge__clean_refinement
14 | hyp_file=/misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_fewshot_refinement
15 | 
16 | export CUDA_VISIBLE_DEVICES=0
17 | PYTHONPATH=. python src/eval_NE_NLI.py  \
18 | --hyp_path $hyp_file \
19 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
20 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${model}-${data}-${split}-fewshot-refine-knowledge-res.txt 2>log/${model}-${data}-${split}-fewshot-refine-knowledge-err.txt &
21 | 
22 | # unseen + few-shot
23 | split=unseen
24 | data=few-shot
25 | hyp_file=/misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_fewshot_refinement
26 | 
27 | export CUDA_VISIBLE_DEVICES=2
28 | PYTHONPATH=. python src/eval_NE_NLI.py  \
29 | --hyp_path $hyp_file \
30 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
31 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${model}-${data}-${split}-fewshot-refine-knowledge-res.txt 2>log/${model}-${data}-${split}-fewshot-refine-knowledge-err.txt &
32 | 
33 | # # seen + zero-shot
34 | # split=seen
35 | # data=zero-shot
36 | # export CUDA_VISIBLE_DEVICES=1
37 | # PYTHONPATH=. python src/eval_NE_NLI.py  \
38 | # --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter__refinement \
39 | # --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
40 | # --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${model}-${data}-${split}-zeroshot-refine-knowledge-res.txt 2>log/${model}-${data}-${split}-zeroshot-refine-knowledge-err.txt &
41 | 
42 | 
43 | # # unseen + zero-shot
44 | # split=unseen
45 | # data=zero-shot
46 | # export CUDA_VISIBLE_DEVICES=3
47 | # PYTHONPATH=. python src/eval_NE_NLI.py  \
48 | # --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter__refinement \
49 | # --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
50 | # --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${model}-${data}-${split}-zeroshot-refine-knowledge-res.txt 2>log/${model}-${data}-${split}-zeroshot-refine-knowledge-err.txt &
51 | 
52 | wait
53 | 
54 | done
55 | 
56 | 
57 | 


--------------------------------------------------------------------------------
/scripts/other/cal_factuality_for_refined_knowledge_IR.sh:
--------------------------------------------------------------------------------
 1 | REF_PATH=/misc/kfdata01/kf_grp/lchen/ParlAI/data/wizard_of_wikipedia/processed_data
 2 | 
 3 | IR_num=3
 4 | exp_name=IR${IR_num}_eval_refinement
 5 | 
 6 | for model in flan-t5-xxl
 7 | do
 8 | echo $model
 9 | 
10 | # seen + few-shot
11 | split=seen
12 | data=few-shot
13 | export CUDA_VISIBLE_DEVICES=0
14 | PYTHONPATH=. python src/eval_401.py  \
15 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_fewshot_refinement \
16 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
17 | --use_IR_eval \
18 | --retrieved_num $IR_num \
19 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt &
20 | 
21 | wait
22 | 
23 | # unseen + few-shot
24 | split=unseen
25 | data=few-shot
26 | export CUDA_VISIBLE_DEVICES=2
27 | PYTHONPATH=. python src/eval_401.py  \
28 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_fewshot_refinement \
29 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
30 | --use_IR_eval \
31 | --retrieved_num $IR_num \
32 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt &
33 | 
34 | wait
35 | 
36 | 
37 | done
38 | 
39 | 
40 | 


--------------------------------------------------------------------------------
/scripts/other/cal_factuality_for_response.sh:
--------------------------------------------------------------------------------
 1 | REF_PATH=/misc/kfdata01/kf_grp/lchen/ParlAI/data/wizard_of_wikipedia/processed_data
 2 | 
 3 | 
 4 | # for model in flan-t5-11B flan-t5-xl flan-t5-large flan-t5-base flan-t5-small
 5 | for model in flan-t5-11B
 6 | do
 7 | echo $model
 8 | # seen + few-shot
 9 | split=seen
10 | data=few-shot
11 | export CUDA_VISIBLE_DEVICES=0
12 | PYTHONPATH=. python src/eval_NE_NLI.py  \
13 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_response \
14 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
15 | --doc_ref_path $REF_PATH/output_test${split}_response_knowledge_doc_reference.txt 1>log/${model}-${data}-${split}-response-res.txt 2>log/${model}-${data}-${split}-response-err.txt &
16 | # --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${model}-${data}-${split}-response-res.txt 2>log/${model}-${data}-${split}-response-err.txt &
17 | 
18 | # seen + zero-shot
19 | split=seen
20 | data=zero-shot
21 | export CUDA_VISIBLE_DEVICES=1
22 | PYTHONPATH=. python src/eval_NE_NLI.py  \
23 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_response \
24 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
25 | --doc_ref_path $REF_PATH/output_test${split}_response_knowledge_doc_reference.txt 1>log/${model}-${data}-${split}-response-res.txt 2>log/${model}-${data}-${split}-response-err.txt &
26 | 
27 | 
28 | # unseen + few-shot
29 | split=unseen
30 | data=few-shot
31 | export CUDA_VISIBLE_DEVICES=2
32 | PYTHONPATH=. python src/eval_NE_NLI.py  \
33 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_response \
34 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
35 | --doc_ref_path $REF_PATH/output_test${split}_response_knowledge_doc_reference.txt 1>log/${model}-${data}-${split}-response-res.txt 2>log/${model}-${data}-${split}-response-err.txt &
36 | 
37 | 
38 | # unseen + zero-shot
39 | split=unseen
40 | data=zero-shot
41 | export CUDA_VISIBLE_DEVICES=3
42 | PYTHONPATH=. python src/eval_NE_NLI.py  \
43 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_response \
44 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \
45 | --doc_ref_path $REF_PATH/output_test${split}_response_knowledge_doc_reference.txt 1>log/${model}-${data}-${split}-response-res.txt 2>log/${model}-${data}-${split}-response-err.txt &
46 | 
47 | wait
48 | 
49 | done
50 | 
51 | 
52 | 
53 | 
54 | 
55 | # GEN_TO_EVALUATE_NAME=./wow/zero-shot/wizard-test-p1.jsonl
56 | 
57 | # PYTHONPATH=. python src/evaluate_generated_knowledge.py  \
58 | # --gen_path ${GEN_TO_EVALUATE_NAME} 1>log/res.txt 2>log/err.txt
59 | 
60 | 
61 | 


--------------------------------------------------------------------------------
/scripts/other/tmp.sh:
--------------------------------------------------------------------------------
 1 | # for model in flan-t5-xl flan-t5-large flan-t5-base flan-t5-small
 2 | for model in flan-t5-11B
 3 | do
 4 | split=seen
 5 | data=few-shot
 6 | head -3865 /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge > /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_
 7 | 
 8 | # seen + zero-shot
 9 | split=seen
10 | data=zero-shot
11 | head -3865 /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter > /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter_
12 | 
13 | 
14 | # unseen + few-shot
15 | split=unseen
16 | data=few-shot
17 | head -3924 /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge > /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_
18 | 
19 | 
20 | # unseen + zero-shot
21 | split=unseen
22 | data=zero-shot
23 | head -3924 /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter > /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter_
24 | 
25 | done


--------------------------------------------------------------------------------
/scripts/view_coh_sent.sh:
--------------------------------------------------------------------------------
 1 | # for name in nq_DPR nq_random_prompt_flan_xxl nq_zeroshot_prompt4_flan_xxl nq_random_prompt_llama_65b_T100 nq_zeroshot_prompt4_llama_65b_T100 
 2 | for name in nq_random_prompt_chatgpt_T100 nq_zeroshot_prompt4_chatgpt_T100
 3 | 
 4 | do
 5 | 
 6 | echo $name
 7 | res=/misc/kfdata01/kf_grp/lchen/FactualityPrompt/output/${name}/nq_knowledge_avg_sent_ppl
 8 | tail -1 $res
 9 | echo
10 | echo
11 | 
12 | done
13 | 
14 | 
15 | # for name in wow_DPR zeroshot_prompt2_flan_xxl zeroshot_prompt4_llama_65b random_prompt_flan_xxl random_prompt_llama_65b_T100
16 | for name in random_prompt_chatgpt zeroshot_prompt4_chatgpt_T100
17 | do
18 | 
19 | echo $name
20 | # res=/misc/kfdata01/kf_grp/lchen/FactualityPrompt/output_emnlp/${name}/seen_knowledge_avg_sent_ppl
21 | res=/misc/kfdata01/kf_grp/lchen/FactualityPrompt/output/${name}/seen_knowledge_avg_sent_ppl
22 | tail -1 $res
23 | echo
24 | echo
25 | 
26 | done
27 | 


--------------------------------------------------------------------------------
/scripts/view_info.sh:
--------------------------------------------------------------------------------
1 | for name in your_dir
2 | do
3 | 
4 | echo $name
5 | exp_name=info_${name}
6 | tail -1 log/log-${exp_name}
7 | echo ' '
8 | 
9 | done


--------------------------------------------------------------------------------
/scripts/view_nq_validity.sh:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | your_log_name_list=(log1 log2 log3)  # Replace with actual log names
 4 | 
 5 | for name in "${your_log_name_list[@]}"; do
 6 | do
 7 | 
 8 | echo $name
 9 | tail -2 log/log-${name}
10 | echo
11 | 
12 | done


--------------------------------------------------------------------------------
/scripts/view_wow_validity.sh:
--------------------------------------------------------------------------------
 1 | your_log_name_list=(log1 log2 log3)  # Replace with actual log names
 2 | 
 3 | for name in "${your_log_name_list[@]}"; do
 4 | do
 5 | 
 6 | echo $name
 7 | tail -2 log/log-${name}-answer
 8 | echo
 9 | 
10 | done


--------------------------------------------------------------------------------
/scripts/wow_coh_para.sh:
--------------------------------------------------------------------------------
 1 | 
 2 | for name in your_prediction_dir
 3 | do
 4 | 
 5 | hyp=${name}/seen_knowledge
 6 | 
 7 | exp_name=ppl_${name}
 8 | echo $name
 9 | 
10 | export CUDA_VISIBLE_DEVICES=0
11 | PYTHONPATH=. python -u discourse-coherence.py \
12 | --hyp_path $hyp 1>log/log-${exp_name} 2>&1 
13 | 
14 | wait
15 | done
16 | 


--------------------------------------------------------------------------------
/scripts/wow_coh_sent.sh:
--------------------------------------------------------------------------------
 1 | 
 2 | for name in your_hyper_dir
 3 | do
 4 | 
 5 | hyp=${name}/seen_knowledge
 6 | 
 7 | exp_name=ppl_${name}
 8 | echo $name
 9 | 
10 | export CUDA_VISIBLE_DEVICES=0
11 | PYTHONPATH=. python -u src/ppl.py  \
12 | --hyp_path $hyp 1>log/log-${exp_name} 2>&1 
13 | 
14 | echo 
15 | 
16 | wait
17 | done
18 | 


--------------------------------------------------------------------------------
/scripts/wow_factuality.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # This script evaluates the knowledge predictions for various models and strategies.
 3 | 
 4 | # Set the debug mode (use 'True' to enable debugging).
 5 | debug=False
 6 | 
 7 | # Number of evaluations to perform.
 8 | eval_num=500
 9 | 
10 | # Number of information retrieval (IR) evidence to consider.
11 | IR_num=10
12 | 
13 | # Whether to use ground truth knowledge or not.
14 | wo_ground_truth_knowledge=False
15 | 
16 | # Error tolerance.
17 | outer_strategy=max
18 | 
19 | for name in random_prompt_llama_65b_T100
20 | do
21 | 
22 | # seen split
23 | ref=./emnlp_data/wow/random_testset/seen_random_testset.txt
24 | hyp=${name}/seen_knowledge
25 | 
26 | exp_name=${name}_IR${IR_num}_seen_knowledge_$outer_strategy
27 | echo $exp_name
28 | 
29 | export CUDA_VISIBLE_DEVICES=0
30 | PYTHONPATH=. python -u src/eval_exp.py  \
31 | --hyp_path $hyp \
32 | --ref_path $ref \
33 | --use_IR_eval \
34 | --debug $debug \
35 | --eval_num $eval_num \
36 | --wo_ground_truth_knowledge $wo_ground_truth_knowledge \
37 | --retrieved_num $IR_num  1>log/log-${exp_name} 2>&1 
38 | 
39 | wait
40 | 
41 | 
42 | # unseen split
43 | ref=./emnlp_data/testsets500/unseen_random_testset.txt
44 | hyp=${name}/unseen_knowledge
45 | 
46 | exp_name=${name}_IR${IR_num}_unseen_knowledge
47 | echo $exp_name
48 | 
49 | export CUDA_VISIBLE_DEVICES=3
50 | PYTHONPATH=. python -u src/eval_exp.py  \
51 | --hyp_path $hyp \
52 | --ref_path $ref \
53 | --use_IR_eval \
54 | --debug $debug \
55 | --eval_num $eval_num \
56 | --wo_ground_truth_knowledge $wo_ground_truth_knowledge \
57 | --retrieved_num $IR_num  1>log/log-${exp_name} 2>&1 
58 | 
59 | wait
60 | 
61 | done 


--------------------------------------------------------------------------------
/scripts/wow_factuality_view.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | your_log_name_list=(log1 log2 log3)  # Replace with actual log names
4 | 
5 | for name in "${your_log_name_list[@]}"; do
6 |   echo "Results for $name:"
7 |   tail -2 "log/log-${name}"
8 |   echo
9 | done


--------------------------------------------------------------------------------
/scripts/wow_info.sh:
--------------------------------------------------------------------------------
 1 | for name in your_prediction_dir
 2 | do
 3 | 
 4 | hyp=${name}/seen_knowledge
 5 | ref=./emnlp_data/wow/random_testset/seen_random_testset.txt
 6 | 
 7 | exp_name=info_${name}
 8 | echo $name
 9 | 
10 | export CUDA_VISIBLE_DEVICES=1
11 | PYTHONPATH=. python -u info.py  \
12 | --task wow \
13 | --ref_path $ref \
14 | --hyp_path $hyp 1>log/log-${exp_name} 2>&1 
15 | 
16 | echo 
17 | 
18 | wait
19 | done
20 | 


--------------------------------------------------------------------------------
/scripts/wow_relevance.sh:
--------------------------------------------------------------------------------
 1 | 
 2 | for name in your_prediction
 3 | 
 4 | do
 5 | 
 6 | ref=emnlp_data/wow/random_testset/seen_random_testset.txt
 7 | hyp=${name}/seen_knowledge
 8 | 
 9 | exp_name=relevance_${name}
10 | echo $name
11 | 
12 | export CUDA_VISIBLE_DEVICES=0
13 | PYTHONPATH=. python -u relevance.py  \
14 | --hyp_path $hyp \
15 | --ref_path $ref 1>log/log-${exp_name} 2>&1 
16 | 
17 | echo 
18 | 
19 | wait
20 | done
21 | 


--------------------------------------------------------------------------------
/scripts/wow_validity.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Toggle debug mode
 4 | debug=False
 5 | 
 6 | # Number of evaluations
 7 | eval_num=500
 8 | 
 9 | # List of experiment directories (update with actual directory names)
10 | your_experiment_dir_list=(dir1 dir2 dir3)
11 | 
12 | for name in $your_experiment_dir_list
13 | do
14 | 
15 | ref=/misc/kfdata01/kf_grp/lchen/EMNLP23/experiments/emnlp_data/wow/random_testset/seen_random_testset.txt
16 | hyp=wow_answer/wow_answer_for_${name}/wow_answer
17 | 
18 | exp_name=${name}-answer
19 | echo $exp_name
20 | 
21 | export CUDA_VISIBLE_DEVICES=0
22 | PYTHONPATH=. python -u src/wow_validity.py  \
23 | --use_IR_eval \
24 | --retrieved_num 5 \
25 | --hyp_path $hyp \
26 | --ref_path $ref \
27 | --debug $debug \
28 | --eval_num $eval_num 1>log/log-${exp_name} 2>&1 
29 | 
30 | wait
31 | done
32 | 


--------------------------------------------------------------------------------
/src/claim_handling.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | from nltk.tokenize import sent_tokenize
  3 | 
  4 | import spacy
  5 | # spacy.prefer_gpu()
  6 | spacy.load('en_core_web_sm')
  7 | nlp = spacy.load("en_core_web_sm")
  8 | 
  9 | import nltk
 10 | from nltk.corpus import stopwords
 11 | stop_words = set(stopwords.words('english'))
 12 | 
 13 | '''
 14 | Five types of important entities:
 15 | Organizations, Personal Names, Events, Products, Artworks
 16 | 
 17 | Two types of non-critical entities:
 18 | Cardinal Numbers indicate the quantity of something.
 19 | Ordinal Numbers denote the position or rank of something within a sequence.
 20 | 
 21 | IMPORTANT_ENT_TYPE = set(['ORG', 'PERSON', 'WORK_OF_ART', 'PRODUCT', 'EVENT'])
 22 | '''
 23 | 
 24 | REMOVE_ENT_TYPE = set() 
 25 | 
 26 | 
 27 | def obtain_important_ne(gen, include_capitalized_words_as_ents=True):
 28 |     important_words = []
 29 |     doc = nlp(gen)
 30 | 
 31 |     ents = [(ent.text, ent.label_) for ent in doc.ents]
 32 | 
 33 |     if include_capitalized_words_as_ents and len(ents) == 0:
 34 |         capitalized_words = re.findall('(?<!^)([A-Z][a-z]+)', gen)
 35 |         
 36 |         if len(capitalized_words) > 0:
 37 |             capitalized_words = [(word, 'CAPITALIZED') for word in capitalized_words if word.lower() not in stop_words]
 38 |             ents.extend(capitalized_words)
 39 | 
 40 |     important_words.extend([ent for ent in ents if ent[1] in IMPORTANT_ENT_TYPE])
 41 |     remaining_ne_all = [ent for ent in ents if ent[1] not in IMPORTANT_ENT_TYPE]
 42 | 
 43 |     # filter out some ne
 44 |     remaining_ne = []
 45 |     for ent in remaining_ne_all:
 46 |         if ent[1] in REMOVE_ENT_TYPE:
 47 |             continue
 48 |         # if ent[1] == 'DATE' and ("year" in ent[0] or "day" in ent[0]): #not bool(re.search(r'\d', ent[0])):
 49 |             # if "DATE" entity contains NO number at all (e.g., ``the year''), meaningless
 50 |             # continue
 51 |         remaining_ne.append(ent)
 52 | 
 53 |     gens_with_ne = {
 54 |                         "gen": gen,
 55 |                         "important_ne": important_words,
 56 |                         "unimportant_ne": remaining_ne,
 57 |                         "subject": set([token.text for token in doc if token.dep_ in ['nsubj', 'nsubjpass']]),
 58 |                         # "all_analysis": [(token.text, token.pos_, token.tag_, token.dep_) for token in doc]
 59 |                     }
 60 | 
 61 |     return gens_with_ne 
 62 | 
 63 | 
 64 | def has_incorrect_style(gen_obj):
 65 |     
 66 |     # case 1: contains first person -- I, we
 67 |     if gen_obj['subject'].intersection(set(['i', 'I', 'You', 'you', 'We', 'we'])):
 68 |         return True
 69 | 
 70 |     # case 2: question?
 71 |     if "?" in gen_obj['gen']:
 72 |         return True
 73 | 
 74 |     return False
 75 | 
 76 | 
 77 | def obtain_trust_worthy_sents(text, wiki_names):
 78 | 
 79 |     wiki_names_txt = " ".join(wiki_names)
 80 | 
 81 |     text = text.strip().replace("\n",". ")
 82 |     sents = sent_tokenize(text)
 83 |     
 84 |     sents_with_ne = [obtain_important_ne(sent.strip()) for sent in sents]
 85 | 
 86 |     no_fact_gen_cnt, no_fact_gens = 0, []
 87 |     checkworthy_gen_cnt, checkworthy_gens = 0, []
 88 |     off_topic_gen_cnt, off_topic_gens = 0, []
 89 | 
 90 |     for sent_obj in sents_with_ne:
 91 |         
 92 |         # case 1: no facts -- i.e., no NE, incorrect_style, no SUBJECT
 93 |         if len(sent_obj['important_ne']) + len(sent_obj['unimportant_ne']) == 0 or has_incorrect_style(sent_obj) or len(sent_obj['subject']) == 0:
 94 |             no_fact_gen_cnt += 1
 95 | 
 96 |         # case 2 v1: no off-topic, but contains facts (unimportant_ne) about target-topic
 97 |         elif len(sent_obj['important_ne']) == 0 and len(sent_obj['unimportant_ne']) > 0:
 98 |             checkworthy_gen_cnt += 1
 99 |             checkworthy_gens.append(sent_obj)
100 |         
101 |         # case 3: tricky scenario. important_ne could be relevant to the target-topic, or could indicate off-topic
102 |         else:
103 |             
104 |             # 1. filter out any extra_ne that is same as wikiname -- e.g., wiki_name = Barak Obama, ne = Obama
105 |             extra_ne = [ne[0] for ne in sent_obj['important_ne'] if ne[0] not in wiki_names_txt]
106 | 
107 |             # 2. check if any of the extra_ne is the "SUBJECT" of the generation
108 |             overlap_between_extraNE_and_subj = sent_obj['subject'].intersection(set(" ".join(extra_ne).split(" ")))
109 | 
110 |             if len(overlap_between_extraNE_and_subj) > 0: # contains off-topic NE!!
111 |                 off_topic_gen_cnt += 1
112 |             else:
113 |                 checkworthy_gen_cnt += 1
114 |                 checkworthy_gens.append(sent_obj)
115 | 
116 | 
117 |     return checkworthy_gens 


--------------------------------------------------------------------------------
/src/discourse-coherence.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | pip install sgnlp
 3 | pip uninstall nvidia_cublas_cu11
 4 | '''
 5 | import numpy as np
 6 | from tqdm import tqdm
 7 | import json
 8 | from sgnlp.models.coherence_momentum import CoherenceMomentumModel, CoherenceMomentumConfig, \
 9 |     CoherenceMomentumPreprocessor
10 | 
11 | # Load Model
12 | config = CoherenceMomentumConfig.from_pretrained(
13 |     "coherence-momentum"
14 | )
15 | model = CoherenceMomentumModel.from_pretrained(
16 |     "coherence-momentum",
17 |     config=config
18 | )
19 | 
20 | model.cuda()
21 | 
22 | preprocessor = CoherenceMomentumPreprocessor(config.model_size, config.max_len)
23 | 
24 | # Example text inputs
25 | text1 = "Companies listed below reported quarterly profit substantially different from the average of analysts ' " \
26 |         "estimates . The companies are followed by at least three analysts , and had a minimum five-cent change in " \
27 |         "actual earnings per share . Estimated and actual results involving losses are omitted . The percent " \
28 |         "difference compares actual profit with the 30-day estimate where at least three analysts have issues " \
29 |         "forecasts in the past 30 days . Otherwise , actual profit is compared with the 300-day estimate . " \
30 |         "Source : Zacks Investment Research"
31 | text2 = "The companies are followed by at least three analysts , and had a minimum five-cent change in actual " \
32 |         "earnings per share . The percent difference compares actual profit with the 30-day estimate where at least " \
33 |         "three analysts have issues forecasts in the past 30 days . Otherwise , actual profit is compared with the " \
34 |         "300-day estimate . Source : Zacks Investment Research. Companies listed below reported quarterly profit " \
35 |         "substantially different from the average of analysts ' estimates . Estimated and actual results involving " \
36 |         "losses are omitted ."
37 | 
38 | 
39 | def args_parser():
40 |     import argparse
41 |     parser = argparse.ArgumentParser()
42 |     parser.add_argument("--hyp_path", type=str, default='./emnlp_data/nq/random_testset/nq_ref')
43 |     args = parser.parse_args()
44 |     return args
45 | 
46 | def calculate_coherence(sentences):
47 |     print('calculate coherence scores...')
48 |     scores = []
49 |     for s_list in tqdm(sentences):
50 |         inputs = preprocessor([s_list])
51 |         score = model.get_main_score(inputs["tokenized_texts"].cuda()).item()
52 |         scores.append(score)
53 |     return scores
54 | 
55 | def read_hyp(hyp_path):
56 |     hyps = []
57 |     with open(hyp_path, 'r') as infile:
58 |         for line in infile:
59 |             hyps.append(line.strip())
60 |     return hyps
61 | 
62 | 
63 | if __name__ == '__main__':
64 |     args = args_parser()
65 |     hyps = read_hyp(args.hyp_path)
66 |     assert len(hyps) == 500, len(hyps)
67 | 
68 |     scores = calculate_coherence(hyps)
69 |     assert len(scores) == 500, len(scores)
70 | 
71 |     with open(args.hyp_path + '_avg_coh_para', 'w') as outfile:
72 |         json.dump(scores, outfile)
73 |         outfile.write('\n')
74 |         outfile.write(f'{max(scores)}\t{min(scores)}')
75 | 


--------------------------------------------------------------------------------
/src/eval_exp.py:
--------------------------------------------------------------------------------
  1 | from nltk.tokenize import sent_tokenize
  2 | from tqdm import tqdm
  3 | from collections import Counter
  4 | import copy
  5 | import json
  6 | import argparse
  7 | import random
  8 | random.seed(42)
  9 | 
 10 | import numpy as np
 11 | from factuality_metric import ner_metric, nli_metric_batch
 12 | from src.claim_handling import obtain_important_ne
 13 | from tools import WikiSearch
 14 | 
 15 | import logging
 16 | logging.basicConfig()
 17 | logging.getLogger().setLevel(logging.ERROR)
 18 | 
 19 | def read_hyp(hyp_path):
 20 |     hyps = []
 21 |     with open(hyp_path, 'r') as infile:
 22 |         for line in infile:
 23 |             hyps.append(line.strip())
 24 |     return hyps
 25 | 
 26 | def read_IR_docs(IR_path):
 27 |     IR_docs = []
 28 |     with open(IR_path, 'r') as infile:
 29 |         for line in infile:
 30 |             IR_docs.append(json.loads(line.strip()))
 31 |     return IR_docs
 32 | 
 33 | def read_ref(ref_path):
 34 |     doc_refs = []
 35 |     if 'json' not in ref_path: # txt: for wow
 36 |         with open(ref_path, 'r') as infile:
 37 |             for line in infile:
 38 |                 parts = line.strip().split('\t')
 39 |                 # topic, query, knowledge, response
 40 |                 assert len(parts) == 4, parts
 41 |                 doc_refs.append(parts[2])
 42 |     else: # json: for QA dataset
 43 |         with open(ref_path, 'r') as infile:
 44 |             data_list = json.load(infile)['data']
 45 |             for data in data_list:
 46 |                 doc_refs.append(data['context'])
 47 |     return doc_refs
 48 | 
 49 | def boolean_string(s):
 50 |     if s.lower() not in {'false', 'true'}:
 51 |         raise ValueError('Not a valid boolean string')
 52 |     return s.lower() == 'true'
 53 | 
 54 | def args_parser():
 55 |     parser = argparse.ArgumentParser(description='Process some integers.')
 56 | 
 57 |     parser.add_argument('--hyp_path', type=str, default=None, help='path to generations to evaluate') 
 58 |     parser.add_argument('--ref_path', type=str, default=None, help='path to generations to evaluate') 
 59 |     parser.add_argument('--eval_num', type=int, default=-1)
 60 |     parser.add_argument('--outer_strategy', type=str, default='max', help='max, min, mean') 
 61 | 
 62 |     parser.add_argument('--use_IR_eval', action='store_true', help='Flag for saving some lm-gens with its metric for analysis') 
 63 |     parser.add_argument('--retrieved_num', type=int, default=3)
 64 |     parser.add_argument('--wo_ground_truth_knowledge', type=boolean_string, default='True')
 65 | 
 66 |     parser.add_argument('--debug', type=boolean_string) 
 67 |     parser.add_argument('--save_gen_for_analysis', action='store_true', help='Flag for saving some lm-gens with its metric for analysis') 
 68 | 
 69 |     args = parser.parse_args()
 70 |     return args
 71 | 
 72 | def single_instance_eval(hyp, doc_ref_str, recall_list_, args):
 73 |     # multiple evidences
 74 |     hallu_ner_ratio = []
 75 |     nli_contradict_prob, nli_entail_prob, nli_neutral_prob, nli_label = [], [], [], []
 76 | 
 77 |     hyp_sents = sent_tokenize(hyp)
 78 |     doc_ref = sent_tokenize(doc_ref_str) + [doc_ref_str] if doc_ref_str else [] 
 79 | 
 80 |     retrieve_error = ''
 81 |     for sent in hyp_sents: 
 82 |         cur_doc_ref = copy.deepcopy(doc_ref) #
 83 |         recall_list = copy.deepcopy(recall_list_)
 84 | 
 85 |         if args.use_IR_eval and args.retrieved_num:
 86 |             assert recall_list and len(recall_list) >= 10, f"len(recall_list) = {len(recall_list)}"
 87 |             try:
 88 |                 if not recall_list: # dont do this
 89 |                     recall_list = WikiSearch(sent, args.retrieved_num) # already sentences # raise ConnectTimeout(e, request=request)
 90 |                     assert len(recall_list) == args.retrieved_num, f"len(recall_list) = {len(recall_list)}, args.retrieved_num = {args.retrieved_num}"
 91 |                 else:
 92 |                     recall_list = recall_list[:args.retrieved_num]
 93 |             except: # need to log
 94 |                 retrieve_error = f"!!!error!!!:{sent}"
 95 |         else:
 96 |             recall_list = []
 97 | 
 98 |         # 1. NER
 99 |         sent_obj_with_ne = obtain_important_ne(sent.strip())
100 |         NE_to_check = sent_obj_with_ne['important_ne'] + sent_obj_with_ne['unimportant_ne']
101 |         if NE_to_check:
102 |             correct_ner_ratio = 0
103 |             if not args.wo_ground_truth_knowledge: # ref
104 |                 correct_ner_ratio = ner_metric(NE_to_check, doc_ref_str) # apply directly on wiki and/or google search snippets
105 |             for recall_passage in recall_list:
106 |                 correct_ner_ratio = max(correct_ner_ratio, ner_metric(NE_to_check, recall_passage))
107 |             hallu_ner_ratio.append(1 - correct_ner_ratio)
108 | 
109 |         # 2. NLI: identify the evs that give highest nli entailment score
110 |         premise_hypothesis_pairs = [[ev, sent] for ev in cur_doc_ref + recall_list]
111 |         if args.wo_ground_truth_knowledge:
112 |             premise_hypothesis_pairs = [[ev, sent] for ev in recall_list]
113 |         if len(premise_hypothesis_pairs) > 32:
114 |             premise_hypothesis_pairs = premise_hypothesis_pairs[:32]
115 |         bz = 8
116 |         nli_probs, labels = [], []
117 |         for t in range((len(premise_hypothesis_pairs) - 1) // bz + 1):
118 |             bz_nli_probs, bz_labels = nli_metric_batch(premise_hypothesis_pairs[t * bz: min((t + 1) * bz, len(premise_hypothesis_pairs))])
119 |             nli_probs.extend(bz_nli_probs)
120 |             labels.extend(bz_labels)
121 |         assert len(nli_probs) == len(premise_hypothesis_pairs) == len(labels), f"len(nli_probs) = {len(nli_probs)}, len(premise_hypothesis_pairs) = {len(premise_hypothesis_pairs)}, len(labels) = {len(labels)}"
122 |       
123 |         # [contradiction, neutral, entailment]
124 |         entailment_argmax = np.argmax([nli_s[2] for nli_s in nli_probs])
125 |         max_prob = nli_probs[entailment_argmax]
126 |         max_label = labels[entailment_argmax]
127 | 
128 |         nli_contradict_prob.append(max_prob[0])
129 |         nli_neutral_prob.append(max_prob[1])
130 |         nli_entail_prob.append(max_prob[2])
131 | 
132 |         nli_label.append(max_label)
133 | 
134 |     hallu_ner_ratio = np.nanmean(hallu_ner_ratio)
135 |     idx = None
136 |     if args.outer_strategy == 'max':
137 |         idx = nli_label.index(max(nli_label))
138 |         nli_label = max(nli_label)
139 |     if args.outer_strategy == 'min':
140 |         idx = nli_label.index(min(nli_label))
141 |         nli_label = min(nli_label)
142 | 
143 |     if args.outer_strategy != 'mean':
144 |         nli_contradict_prob = nli_contradict_prob[idx]
145 |         nli_neutral_prob = nli_neutral_prob[idx]
146 |         nli_entail_prob = nli_entail_prob[idx]
147 |     else: # mean
148 |         nli_contradict_prob = np.nanmean(nli_contradict_prob)
149 |         nli_neutral_prob = np.nanmean(nli_neutral_prob)
150 |         nli_entail_prob = np.nanmean(nli_entail_prob)
151 | 
152 |     eval_result_obj = {
153 |         'claim_to_verify': hyp_sents,
154 |         'doc_ref': doc_ref,
155 |         'recall_list': recall_list,
156 |         'retrieve_error': retrieve_error,
157 | 
158 |         'hallu_ner': hallu_ner_ratio,
159 |         'nli-label': nli_label,
160 |         'nli-contr': nli_contradict_prob,
161 |         'nli-entail': nli_entail_prob,
162 |         'nli-neutr': nli_neutral_prob
163 |     }
164 | 
165 |     return eval_result_obj
166 | 
167 | def main(args):
168 | 
169 |     # read hyp, ref, IR_docs
170 |     hyps = read_hyp(args.hyp_path) 
171 |     IR_recalls = read_IR_docs(args.hyp_path + '_IR_docs') 
172 |     doc_refs = read_ref(args.ref_path) # txt file
173 |     assert len(hyps) == len(doc_refs) == len(IR_recalls) == 500, (len(hyps), len(doc_refs), len(IR_recalls))
174 | 
175 |     # DEBUG mode!
176 |     if args.debug:
177 |         DEBUG_SAMPLE_SIZE = 5
178 |         hyps = hyps[:DEBUG_SAMPLE_SIZE]
179 |         IR_recalls = IR_recalls[:DEBUG_SAMPLE_SIZE]
180 |         doc_refs = doc_refs[:DEBUG_SAMPLE_SIZE]
181 |     
182 |     final_hallu_ner_score = []
183 |     final_contradict_prob, final_neutral_prob, final_entail_prob, all_nli_labels = [], [], [], []
184 |     all_analysis_list = []
185 | 
186 |     for i in tqdm(range(len(hyps))):
187 |         hyp, doc_ref, recall_list = hyps[i], doc_refs[i], IR_recalls[i]
188 | 
189 |         res_obj = single_instance_eval(hyp, doc_ref, recall_list, args)
190 | 
191 |         final_hallu_ner_score.append(res_obj['hallu_ner'])
192 |         final_contradict_prob.append(res_obj['nli-contr'])
193 |         final_neutral_prob.append(res_obj['nli-neutr'])
194 |         final_entail_prob.append(res_obj['nli-entail'])
195 |         all_nli_labels.append(res_obj['nli-label'])
196 |         all_analysis_list.append(res_obj)
197 | 
198 |     # analysis
199 |     avg_hallu_ner_ratio = np.nanmean(final_hallu_ner_score)
200 |     avg_contradict_prob = np.mean(final_contradict_prob)
201 |     avg_neutral_prob = np.mean(final_neutral_prob)
202 |     avg_entail_prob = np.mean(final_entail_prob)
203 | 
204 |     print("\nHallu NER: {:.2f}%".format(avg_hallu_ner_ratio*100))
205 |     print("AVG PROBS: Contradict: {:.2f}%, Neutral: {:.2f}%, Entail: {:.2f}%".format(avg_contradict_prob*100, avg_neutral_prob*100, avg_entail_prob*100))
206 | 
207 |     nli_contradict_class_ratio, nli_neutral_class_ratio, nli_entail_class_ratio = 0, 0, 0
208 | 
209 |     if args.outer_strategy == 'mean':
210 |         all_nli_labels = [item for sublist in all_nli_labels for item in sublist]
211 |     nli_counter = Counter(all_nli_labels)
212 | 
213 |     nli_contradict_class_ratio=nli_counter[0]/(nli_counter[0]+nli_counter[1]+nli_counter[2])
214 |     nli_neutral_class_ratio=nli_counter[1]/(nli_counter[0]+nli_counter[1]+nli_counter[2])
215 |     nli_entail_class_ratio=nli_counter[2]/(nli_counter[0]+nli_counter[1]+nli_counter[2])
216 |     
217 |     print("NLI CLASS %: Contradict: {:.2f}%, Neutral: {:.2f}%, Entail: {:.2f}%".format(
218 |         nli_contradict_class_ratio*100,
219 |         nli_neutral_class_ratio*100,
220 |         nli_entail_class_ratio*100
221 |     ))
222 | 
223 |     res_path = args.hyp_path + f'_{args.outer_strategy}_factuality_results.txt'
224 |     with open(res_path, 'a') as outfile:
225 |         res_obj = {
226 |             "avg_hallu_ner_ratio": avg_hallu_ner_ratio,
227 |             "nli_contradict_class_ratio": nli_contradict_class_ratio,
228 |             "nli_neutral_class_ratio": nli_neutral_class_ratio, 
229 |             "nli_entail_class_ratio": nli_entail_class_ratio,
230 |         }
231 |         json.dump(res_obj, outfile)
232 |         outfile.write("\n")
233 | 
234 |     ana_path = args.hyp_path + f'_IR{args.retrieved_num}_{args.outer_strategy}_analysis.txt'
235 |     with open(ana_path, 'w') as outfile:
236 |         json.dump(all_analysis_list, outfile)
237 |         outfile.write("\n")
238 | 
239 |     # save example NE score 
240 |     ne_path = args.hyp_path + f'_IR{args.retrieved_num}_{args.outer_strategy}_example_NE.txt'
241 |     with open(ne_path, 'w') as outfile:
242 |         for ne in final_hallu_ner_score:
243 |             outfile.write(str(ne) + '\n')
244 | 
245 |     # save example NLI score
246 |     nli_path = args.hyp_path + f'_IR{args.retrieved_num}_{args.outer_strategy}_example_NLI_entail.txt'
247 |     with open(nli_path, 'w') as outfile:
248 |         for nli in final_entail_prob:
249 |             outfile.write(str(nli) + '\n')
250 | 
251 | if __name__ == '__main__':
252 |     args = args_parser()
253 |     main(args)
254 | 


--------------------------------------------------------------------------------
/src/helpfulness.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import tqdm
  3 | import math
  4 | import numpy as np
  5 | import argparse
  6 | import random
  7 | from transformers import T5Tokenizer, T5ForConditionalGeneration, LlamaTokenizer, LlamaForCausalLM
  8 | from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModelForSeq2SeqLM
  9 | 
 10 | 
 11 | def read_testfile(testfile):
 12 |     '''read testset (groud truth knowledge, response)'''
 13 |     res = []
 14 |     with open(testfile, 'r', encoding='utf-8') as r:
 15 |         for i, line in enumerate(r):
 16 |             parts = line.strip().split('\t')
 17 |             assert len(parts) == 4, parts
 18 |             res.append(parts)
 19 |     # topic, query, knowledge, response
 20 |     return res
 21 | 
 22 | def read_knowledge_prompt(prompt_file):
 23 |     '''
 24 |     prompt_file:
 25 |     {last_utter: ["(last_utter) topic => knowledge", "..", ..]}
 26 |     '''
 27 |     knowledge_prompts = []
 28 |     with open(prompt_file, "r") as f:
 29 |         for i, line in enumerate(f):
 30 |             line = line.strip()
 31 |             line_list = eval(line)[:8]
 32 |             knowledge_prompts.append(line_list)        
 33 |     return knowledge_prompts
 34 | 
 35 | def read_hyp_knowledge(path):
 36 |     '''read generated knowledge (by dpr or llm)'''
 37 |     with open(path, 'r', encoding='utf-8') as r:
 38 |         res = [line.strip() for line in r]
 39 |     assert len(res) == 500, len(res)
 40 |     return res
 41 | 
 42 | def load_model(model_name):
 43 |     if 'flan' in model_name:
 44 |         # flan-t5-xxl: 11B = 22G, takes 6 min to load model. 1 min if gpus are empty.
 45 |         assert model_name in ['flan-t5-xxl', 'flan-t5-xl', 'flan-t5-large', 'flan-t5-base', 'flan-t5-small']
 46 |         tokenizer = T5Tokenizer.from_pretrained(f"google/{model_name}", local_files_only=True)
 47 |         model = T5ForConditionalGeneration.from_pretrained(f"google/{model_name}", device_map="balanced_low_0", torch_dtype=torch.float16, local_files_only=True)
 48 |     elif 'llama' in model_name:
 49 |         path = '/apdcephfs/share_1594716/chenliang/cache/llama1/65B'
 50 |         tokenizer = LlamaTokenizer.from_pretrained(path, padding_side='left') # left-padding for decoder-only model
 51 |         tokenizer.pad_token, tokenizer.bos_id, tokenizer.eos_id = -1, 1, 2
 52 |         model = LlamaForCausalLM.from_pretrained(path, device_map="balanced_low_0", torch_dtype=torch.float16)
 53 |     else:
 54 |         ''' A decoder-only architecture is being used, but right-padding was detected!
 55 |         For correct generation results, please set `padding_side='left'` when initializing the tokenizer.'''
 56 |         tokenizer = AutoTokenizer.from_pretrained(f"facebook/{model_name}", use_fast=False, padding_side='left')
 57 |         model = AutoModelForCausalLM.from_pretrained(f"facebook/{model_name}", device_map="auto", torch_dtype=torch.float16)
 58 |     return tokenizer, model
 59 | 
 60 | def compute_ppl(prefix_and_output_text=None, output_text=None, model=None, tokenizer=None, infer_gpu=0):
 61 |     '''calculate ppl for a single response'''
 62 |     with torch.no_grad():
 63 |         tokd_inputs = tokenizer.encode(prefix_and_output_text, return_tensors="pt")
 64 |         tokd_inputs = tokd_inputs.to(infer_gpu)
 65 | 
 66 |         # if only want to score the "generation" part we need the suffix tokenization length
 67 |         tokd_suffix = tokenizer.encode(output_text, return_tensors="pt")
 68 | 
 69 |         tokd_labels = tokd_inputs.clone().detach()
 70 |         tokd_labels[:, :tokd_labels.shape[1] - tokd_suffix.shape[1] + 1] = -100 # mask out the prefix
 71 | 
 72 |         outputs = model(input_ids=tokd_inputs, labels=tokd_labels)
 73 |         loss = outputs.loss # avg CE loss all positions (except -100, TODO check that this is working correctly)
 74 |         ppl = torch.tensor(math.exp(loss))
 75 |     
 76 |     return loss.item(), ppl.item()
 77 | 
 78 | def boolean_string(s):
 79 |     if s.lower() not in {'false', 'true'}:
 80 |         raise ValueError('Not a valid boolean string')
 81 |     return s.lower() == 'true'
 82 | 
 83 | def parse_args():
 84 | 
 85 |     parser = argparse.ArgumentParser()
 86 |     parser.add_argument('--exp_name', type=str)
 87 |     parser.add_argument('--task', type=str, default='nq')
 88 | 
 89 |     parser.add_argument("--debug", type=boolean_string, default=True)
 90 |     parser.add_argument("--zero_shot", type=boolean_string, default=False)
 91 | 
 92 |     parser.add_argument('--testfile', type=str, default='data/testset.txt')
 93 |     parser.add_argument('--promptfile', type=str, default='data/testset.txt')
 94 |     parser.add_argument('--hyp_knowledge', type=str, default='')
 95 | 
 96 |     parser.add_argument('--downstream_model', type=str)
 97 |     parser.add_argument("--knowledge_type", type=str, default='wo_knowledge', help='wo_knowledge, w_ref_knowledge, w_hyp_knowledge, random_knowledge')
 98 | 
 99 |     parser.add_argument('--infer_gpu', type=int, default=0)
100 |     args = parser.parse_args()
101 |     return args
102 | 
103 | 
104 | if __name__ == '__main__':
105 |     args = parse_args()
106 | 
107 |     testset = read_testfile(args.testfile)
108 |     nq_prompt_list = read_knowledge_prompt(args.promptfile)
109 |     random_knowledge_list = [random.choice(nq_prompt_list[499 - i]).split('\t')[-2] for i in range(len(nq_prompt_list))]
110 | 
111 |     if args.hyp_knowledge:
112 |         hyp_knowledge_list = read_hyp_knowledge(args.hyp_knowledge)
113 |         assert len(hyp_knowledge_list) == len(testset), len(hyp_knowledge_list)
114 | 
115 |     if args.debug:
116 |         testset = testset[:3]
117 | 
118 |     tokenizer, model = load_model(args.downstream_model)
119 | 
120 |     loss_list, ppl_list = [], []
121 |     for i in tqdm.tqdm(range(len(testset))):
122 |         topic, query, knowledge, response = testset[i]
123 |         examples = [e.split('\t') for e in nq_prompt_list[i] if len(e.split('\t')) == 4]
124 |         turns = query.split(" [SEP] ")
125 |         last_turn = turns[-1].strip()
126 | 
127 |         ref_knowledge = knowledge.strip()
128 |         truncate_len = 500
129 |         if len(ref_knowledge.split(' ')) > truncate_len:
130 |             print (f'Warning: knowledge {i} length {len(ref_knowledge.split(" "))} exceeds {truncate_len}, truncating to {truncate_len}')
131 |             ref_knowledge = ' '.join(ref_knowledge.split(' ')[:truncate_len]).strip()
132 |         if args.hyp_knowledge:
133 |             hyp_knowledge = hyp_knowledge_list[i]
134 | 
135 |         infer_sample = f"Passage:\nQuery: {last_turn.strip()}\nAnswer: " # set to empty passage
136 |         if args.knowledge_type == 'w_hyp_knowledge':
137 |             infer_sample = f"Passage: {hyp_knowledge.strip()}\nQuery: {last_turn.strip()}\nAnswer: "
138 |         elif args.knowledge_type == 'w_ref_knowledge':
139 |             infer_sample = f"Passage: {ref_knowledge}\nQuery: {last_turn.strip()}\nAnswer: "
140 |         elif args.knowledge_type == 'random_knowledge':
141 |             infer_sample = f"Passage: {random_knowledge_list[i].strip()}\nQuery: {last_turn.strip()}\nAnswer: "
142 | 
143 |         prompt = ''
144 |         cur_len = 0
145 |         if args.zero_shot:
146 |             if args.knowledge_type == 'wo_knowledge':
147 |                 if args.task == 'nq':
148 |                     prompt = f'Read the passage and answer the question below:\nPassage: {ref_knowledge}\nQuestion: {last_turn}\nAnswer: '
149 |                 elif args.task == 'wow':
150 |                     prompt = f'Using the knowledge from the passage, complete the dialogue below:\nPassage: {ref_knowledge}\nSpeaker 1: {last_turn}\nSpeaker 2: '
151 | 
152 |             elif args.knowledge_type == 'w_ref_knowledge':
153 |                 if args.task == 'nq':
154 |                     prompt = f'Read the passage and answer the question below:\nPassage: {ref_knowledge}\nQuestion: {last_turn}\nAnswer: '
155 |                 elif args.task == 'wow':
156 |                     prompt = f'Using the knowledge from the passage, complete the dialogue below:\nPassage: {ref_knowledge}\nSpeaker 1: {last_turn}\nSpeaker 2: '
157 |             elif args.knowledge_type == 'w_hyp_knowledge':
158 |                 if args.task == 'nq':
159 |                     prompt = f'Read the passage and answer the question below:\nPassage: {hyp_knowledge}\nQuestion: {last_turn}\nAnswer: '
160 |                 elif args.task == 'wow':
161 |                     prompt = f'Using the knowledge from the passage, complete the dialogue below:\nPassage: {hyp_knowledge}\nSpeaker 1: {last_turn}\nSpeaker 2: '
162 |             else:
163 |                 raise NotImplementedError(args.knowledge_type)
164 |         else:
165 |             for example in examples:
166 |                 p_topic, p_turns, p_knowledge, p_response = [e.strip() for e in example]
167 |                 if p_knowledge.startswith(p_topic):
168 |                     p_knowledge = p_knowledge[len(p_topic):]
169 | 
170 |                 demonstration = f"Passage: {p_knowledge.strip()}\nQuery: {p_turns.split(' [SEP] ')[-1].strip()}\nAnswer: {p_response.strip()}"
171 |                 
172 |                 if cur_len < 1800 - len(infer_sample.split(' ')):
173 |                     prompt += demonstration + '\n\n'
174 |                     cur_len += len(demonstration.split(' '))
175 | 
176 |             prompt += infer_sample
177 | 
178 |         prefix_and_output_text = prompt + response
179 |         output_text = response
180 |         loss, ppl = compute_ppl(prefix_and_output_text, output_text, model, tokenizer, args.infer_gpu)
181 |         loss_list.append(loss)
182 |         ppl_list.append(ppl)
183 | 
184 |         if args.debug:
185 |             print (prefix_and_output_text)
186 |             print (loss, ppl)
187 | 
188 |     with open(f'helpfulness_results/{args.exp_name}.txt', 'w') as f:
189 |         f.write(str(loss_list).strip() + '\n')
190 |         f.write(str(ppl_list).strip() + '\n')
191 | 
192 |         f.write(f'loss: {np.mean(loss_list)}\t{np.std(loss_list)}\t{np.var(loss_list)}\n')
193 |         f.write(f'ppl: {np.mean(ppl_list)}\t{np.std(ppl_list)}\t{np.var(ppl_list)}\n')
194 | 


--------------------------------------------------------------------------------
/src/info.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import math
 3 | from tqdm import tqdm
 4 | import numpy as np
 5 | import json
 6 | from transformers import GPT2LMHeadModel, GPT2Tokenizer
 7 | from transformers import AutoTokenizer, AutoModelForCausalLM
 8 | 
 9 | tokenizer = AutoTokenizer.from_pretrained('gpt-neo-2.7B')
10 | model = AutoModelForCausalLM.from_pretrained('gpt-neo-2.7B').half()
11 | device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
12 | model.to(device)
13 | tokenizer.pad_token = tokenizer.eos_token
14 | 
15 | def read_testfile(testfile):
16 |     '''read testset from wow'''
17 |     res = []
18 |     with open(testfile, 'r', encoding='utf-8') as r:
19 |         for i, line in enumerate(r):
20 |             parts = line.strip().split('\t')
21 |             assert len(parts) == 4, parts
22 |             res.append(parts)
23 |     return res
24 | 
25 | def read_hyp(hyp_path):
26 |     hyps = []
27 |     with open(hyp_path, 'r') as infile:
28 |         for line in infile:
29 |             hyps.append(line.strip())
30 |     return hyps
31 | 
32 | def calculate_info_per_example(hyps_knowledge, queries, topics, args):
33 |     info_seq = []
34 |     for hyp, query, topic in tqdm(zip(hyps_knowledge, queries, topics)):
35 |         hyp = ' '.join(hyp.split()[:300])
36 |         if args.task == 'nq' and query.strip()[-1] != '?':
37 |             query = query.strip() + '?'
38 |         instruction = f"Generate a Wikipedia to answer the given question.\nTopic: {topic.strip()}.\nQuestion: {query.strip()}\nWikipedia: "
39 |         example = instruction + hyp
40 |         inputs = tokenizer(example, return_tensors='pt', truncation=True).data
41 |         prefix = tokenizer(instruction, return_tensors='pt', truncation=True).data
42 |         for k, v in inputs.items():
43 |             inputs[k] = v.to(device)
44 |         for k, v in prefix.items():
45 |             prefix[k] = v.to(device)
46 |         output = model(**inputs, labels=inputs['input_ids'])
47 |         logits = output.logits
48 |         labels=inputs['input_ids']
49 |         logits = logits[:, prefix['input_ids'].shape[-1]:, :]
50 |         labels = labels[:, prefix['input_ids'].shape[-1]:]
51 |         assert logits.shape[1] == labels.shape[1], (logits.shape, labels.shape)
52 |         shift_logits = logits[..., :-1, :].contiguous()
53 |         shift_labels = labels[..., 1:].contiguous()
54 |         loss_fct = torch.nn.CrossEntropyLoss(reduction='mean')
55 |         loss1 = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
56 |         info = 1 - torch.exp(-loss1)
57 |         info = info.item()
58 |         info_seq.append(info)
59 | 
60 |     return info_seq
61 | 
62 | def args_parser():
63 |     import argparse
64 |     parser = argparse.ArgumentParser()
65 |     parser.add_argument("--task", type=str, default='nq')
66 |     parser.add_argument("--hyp_path", type=str)
67 |     parser.add_argument("--ref_path", type=str, default='./emnlp_data/nq/random_testset/nq_test_random_testset.txt')
68 |     args = parser.parse_args()
69 |     return args
70 | 
71 | if __name__ == '__main__':
72 |     args = args_parser()
73 |     hyps = read_hyp(args.hyp_path)
74 |     testset = read_testfile(args.ref_path)
75 |     queries = [t[1].strip() for t in testset]
76 |     topics = [t[0].strip() for t in testset]
77 |     assert len(hyps) == len(testset) == len(queries) == 500, (len(hyps), len(testset))
78 | 
79 |     info_list = calculate_info_per_example(hyps, queries, topics, args)
80 |     assert len(info_list) == 500, len(info_list)
81 | 
82 |     print ('mean info = ', np.nanmean(info_list))
83 | 
84 |     with open(args.hyp_path + '_info', 'w') as outfile:
85 |         for info in info_list:
86 |             outfile.write(str(info) + '\n')


--------------------------------------------------------------------------------
/src/nq_validity.py:
--------------------------------------------------------------------------------
  1 | from nltk.tokenize import sent_tokenize
  2 | from tqdm import tqdm
  3 | from collections import Counter
  4 | import copy
  5 | import json
  6 | import argparse
  7 | import random
  8 | random.seed(42)
  9 | 
 10 | import numpy as np
 11 | from factuality_metric import ner_metric, nli_metric_batch
 12 | from src.claim_handling import obtain_important_ne
 13 | from tools import WikiSearch
 14 | 
 15 | import logging
 16 | logging.basicConfig()
 17 | logging.getLogger().setLevel(logging.ERROR)
 18 | 
 19 | def read_hyp(hyp_path):
 20 |     hyps = []
 21 |     with open(hyp_path, 'r') as infile:
 22 |         for line in infile:
 23 |             hyps.append(line.strip())
 24 |     return hyps
 25 | 
 26 | def read_testfile(testfile):
 27 |     '''read testset from wow'''
 28 |     res = []
 29 |     with open(testfile, 'r', encoding='utf-8') as r:
 30 |         for i, line in enumerate(r):
 31 |             parts = line.strip().split('\t')
 32 |             assert len(parts) == 4, parts
 33 |             res.append(parts)
 34 |     return res
 35 | 
 36 | def boolean_string(s):
 37 |     if s.lower() not in {'false', 'true'}:
 38 |         raise ValueError('Not a valid boolean string')
 39 |     return s.lower() == 'true'
 40 | 
 41 | def args_parser():
 42 |     parser = argparse.ArgumentParser(description='Process some integers.')
 43 | 
 44 |     parser.add_argument('--hyp_path', type=str, default=None, help='path to generations to evaluate') 
 45 |     parser.add_argument('--ref_path', type=str, default=None, help='path to generations to evaluate') 
 46 |     parser.add_argument('--eval_num', type=int, default=-1)
 47 | 
 48 |     parser.add_argument('--use_IR_eval', action='store_true', help='Flag for saving some lm-gens with its metric for analysis') 
 49 |     parser.add_argument('--retrieved_num', type=int, default=3)
 50 |     parser.add_argument('--wo_ground_truth_knowledge', type=boolean_string, default='False')
 51 | 
 52 |     parser.add_argument('--debug', type=boolean_string) 
 53 |     parser.add_argument('--save_gen_for_analysis', action='store_true', help='Flag for saving some lm-gens with its metric for analysis') 
 54 | 
 55 |     args = parser.parse_args()
 56 |     return args
 57 | 
 58 | def single_instance_eval(hyp, query, answer, args):
 59 |     # multiple evidences
 60 |     hallu_ner_ratio = []
 61 |     nli_contradict_prob, nli_entail_prob, nli_neutral_prob, nli_label = [], [], [], []
 62 | 
 63 |     # NLI: identify the evs that give highest nli entailment score
 64 |     if query.strip()[-1] != '?':
 65 |         query = query.strip() + '?'
 66 |     premise = query + '\t' + answer
 67 |     hypothesis = query + '\t' + hyp
 68 | 
 69 |     premise_hypothesis_pairs = [[premise, hypothesis]]
 70 |     nli_probs, labels = [], []
 71 |     bz_nli_probs, bz_labels = nli_metric_batch(premise_hypothesis_pairs)
 72 |     nli_probs.extend(bz_nli_probs)
 73 |     labels.extend(bz_labels)
 74 |     assert len(nli_probs) == len(premise_hypothesis_pairs) == len(labels), f"len(nli_probs) = {len(nli_probs)}, len(premise_hypothesis_pairs) = {len(premise_hypothesis_pairs)}, len(labels) = {len(labels)}"
 75 |       
 76 |     # [contradiction, neutral, entailment]
 77 |     entailment_argmax = np.argmax([nli_s[2] for nli_s in nli_probs])
 78 |     max_prob = nli_probs[entailment_argmax]
 79 |     max_label = labels[entailment_argmax]
 80 | 
 81 |     nli_contradict_prob.append(max_prob[0])
 82 |     nli_neutral_prob.append(max_prob[1])
 83 |     nli_entail_prob.append(max_prob[2])
 84 | 
 85 |     nli_label.append(max_label)
 86 | 
 87 |     hallu_ner_ratio = np.nanmean(hallu_ner_ratio)
 88 |     idx = nli_label.index(max(nli_label))
 89 |     nli_label = max(nli_label)
 90 |     nli_contradict_prob = nli_contradict_prob[idx]
 91 |     nli_neutral_prob = nli_neutral_prob[idx]
 92 |     nli_entail_prob = nli_entail_prob[idx]
 93 | 
 94 |     eval_result_obj = {
 95 |         'premise': premise,
 96 |         'hypothesis': hypothesis,
 97 | 
 98 |         'nli-label': nli_label,
 99 |         'nli-contr': nli_contradict_prob,
100 |         'nli-entail': nli_entail_prob,
101 |         'nli-neutr': nli_neutral_prob
102 |     }
103 | 
104 |     return eval_result_obj
105 | 
106 | def main(args):
107 | 
108 |     # read hyp, ref, IR_docs
109 |     hyps = read_hyp(args.hyp_path) 
110 |     testset = read_testfile(args.ref_path)
111 |     assert len(hyps) == len(testset) == 500, (len(hyps), len(testset))
112 | 
113 |     # DEBUG mode!
114 |     if args.debug:
115 |         DEBUG_SAMPLE_SIZE = 10
116 |         hyps = hyps[:DEBUG_SAMPLE_SIZE]
117 |         testset = testset[:DEBUG_SAMPLE_SIZE]
118 |     
119 |     final_contradict_prob, final_neutral_prob, final_entail_prob, all_nli_labels = [], [], [], []
120 |     all_analysis_list = []
121 | 
122 |     for i in tqdm(range(len(hyps))):
123 |         hyp, example = hyps[i], testset[i]
124 |         query, answer = example[1], example[3]
125 | 
126 |         res_obj = single_instance_eval(hyp, query, answer, args)
127 | 
128 |         if args.debug:
129 |             print ('==' * 20)
130 |             print (res_obj)
131 | 
132 |         final_contradict_prob.append(res_obj['nli-contr'])
133 |         final_neutral_prob.append(res_obj['nli-neutr'])
134 |         final_entail_prob.append(res_obj['nli-entail'])
135 |         all_nli_labels.append(res_obj['nli-label'])
136 |         all_analysis_list.append(res_obj)
137 | 
138 |     # analysis
139 |     avg_contradict_prob = np.mean(final_contradict_prob)
140 |     avg_neutral_prob = np.mean(final_neutral_prob)
141 |     avg_entail_prob = np.mean(final_entail_prob)
142 | 
143 |     print("AVG PROBS: Contradict: {:.2f}%, Neutral: {:.2f}%, Entail: {:.2f}%".format(avg_contradict_prob*100, avg_neutral_prob*100, avg_entail_prob*100))
144 | 
145 |     nli_contradict_class_ratio, nli_neutral_class_ratio, nli_entail_class_ratio = 0, 0, 0
146 | 
147 |     nli_counter = Counter(all_nli_labels)
148 | 
149 |     nli_contradict_class_ratio=nli_counter[0]/(nli_counter[0]+nli_counter[1]+nli_counter[2])
150 |     nli_neutral_class_ratio=nli_counter[1]/(nli_counter[0]+nli_counter[1]+nli_counter[2])
151 |     nli_entail_class_ratio=nli_counter[2]/(nli_counter[0]+nli_counter[1]+nli_counter[2])
152 |     
153 |     print("NLI CLASS %: Contradict: {:.2f}%, Neutral: {:.2f}%, Entail: {:.2f}%".format(
154 |         nli_contradict_class_ratio*100,
155 |         nli_neutral_class_ratio*100,
156 |         nli_entail_class_ratio*100
157 |     ))
158 | 
159 |     res_path = args.hyp_path + '_factuality_results.txt'
160 |     with open(res_path, 'a') as outfile:
161 |         res_obj = {
162 |             'Contradict_probs': avg_contradict_prob, 
163 |             'Neutral_probs': avg_neutral_prob,
164 |             'Entail_probs': avg_entail_prob,
165 |             "nli_contradict_class_ratio": nli_contradict_class_ratio,
166 |             "nli_neutral_class_ratio": nli_neutral_class_ratio, 
167 |             "nli_entail_class_ratio": nli_entail_class_ratio,
168 |         }
169 |         json.dump(res_obj, outfile)
170 |         outfile.write("\n")
171 | 
172 |     ana_path = args.hyp_path + '_analysis.txt'
173 |     with open(ana_path, 'a') as outfile:
174 |         json.dump(all_analysis_list, outfile)
175 |         outfile.write("\n")
176 | 
177 | if __name__ == '__main__':
178 |     args = args_parser()
179 |     main(args)
180 | 


--------------------------------------------------------------------------------
/src/ppl.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import math
 3 | from tqdm import tqdm
 4 | from nltk.tokenize import sent_tokenize
 5 | import numpy as np
 6 | import json
 7 | from transformers import AutoTokenizer, AutoModelForCausalLM
 8 | 
 9 | tokenizer = AutoTokenizer.from_pretrained('gpt-neo-2.7B')
10 | model = AutoModelForCausalLM.from_pretrained('gpt-neo-2.7B')
11 | device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
12 | model.to(device)
13 | tokenizer.pad_token = tokenizer.eos_token
14 | 
15 | def calculate_ppls(sentences, device):
16 |     print('calculate PPL scores...')
17 |     ppls = []
18 |     for s_list in tqdm(sentences):
19 |         cur_list = []
20 |         for r in s_list:
21 |             inputs = tokenizer(r, return_tensors='pt', truncation=True, max_length=500).data
22 |             for k, v in inputs.items():
23 |                 inputs[k] = v.to(device)
24 |             output = model(**inputs, labels=inputs['input_ids'])
25 |             loss = output[0]
26 | 
27 |             # testing
28 |             logits = output.logits
29 |             labels=inputs['input_ids']
30 |             shift_logits = logits[..., :-1, :].contiguous()
31 |             shift_labels = labels[..., 1:].contiguous()
32 |             loss_fct = torch.nn.CrossEntropyLoss(reduction='mean')
33 |             loss1 = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
34 |             print ('loss1 = ', loss1)
35 | 
36 |             cur_list.append(min(math.exp(loss.item()), 200)) # sentence-level PPL
37 |         ppls.append(sum(cur_list) / len(cur_list)) # example-level PPL
38 |     return ppls
39 | 
40 | def read_hyp(hyp_path):
41 |     hyps = []
42 |     with open(hyp_path, 'r') as infile:
43 |         for line in infile:
44 |             hyps.append(line.strip())
45 |     return hyps
46 | 
47 | def args_parser():
48 |     import argparse
49 |     parser = argparse.ArgumentParser()
50 |     parser.add_argument("--hyp_path", type=str, default='./emnlp_data/nq/random_testset/nq_ref')
51 |     args = parser.parse_args()
52 |     return args
53 | 
54 | if __name__ == '__main__':
55 |     args = args_parser()
56 |     hyps = read_hyp(args.hyp_path)
57 |     sentences = []
58 |     for hyp in hyps:
59 |         hyp_sents = sent_tokenize(hyp)
60 |         sentences.append(hyp_sents)
61 |     assert len(sentences) == 500, len(sentences)
62 |     ppls = calculate_ppls(sentences, device)
63 |     assert len(ppls) == 500, len(ppls)
64 | 
65 |     inverse_ppls = [1 / p for p in ppls]
66 |     coh_sent = np.nanmean(inverse_ppls)
67 | 
68 |     with open(args.hyp_path + '_avg_sent_ppl', 'w') as outfile:
69 |         json.dump(ppls, outfile)
70 |         outfile.write('\n')
71 |         json.dump(coh_sent, outfile)
72 | 


--------------------------------------------------------------------------------
/src/relevance.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import tqdm
  3 | import argparse
  4 | from transformers import AutoTokenizer, BertForSequenceClassification
  5 | from transformers.data.processors.utils import InputExample
  6 | from transformers import glue_convert_examples_to_features as convert_examples_to_features
  7 | from torch.utils.data import DataLoader, TensorDataset
  8 | import json
  9 | 
 10 | # env: conda activate D3
 11 | def load_model(path):
 12 |     tokenizer = AutoTokenizer.from_pretrained(path)
 13 |     model = BertForSequenceClassification.from_pretrained(path)
 14 |     device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
 15 |     model.to(device)
 16 |     # model.half()
 17 |     model.eval()
 18 |     return tokenizer, model, device
 19 | 
 20 | def read_testfile(ref_path):
 21 |     testset = []
 22 |     with open(ref_path, 'r') as infile:
 23 |         for line in infile:
 24 |             parts = line.strip().split('\t')
 25 |             # topic, query, knowledge, response
 26 |             assert len(parts) == 4, parts
 27 |             testset.append(parts)
 28 |     return testset
 29 | 
 30 | def read_hyp(hyp_path):
 31 |     hyps = []
 32 |     with open(hyp_path, 'r') as infile:
 33 |         for line in infile:
 34 |             hyps.append(line.strip())
 35 |     return hyps
 36 | 
 37 | def get_dataloader(input_examples, tokenizer, device, batch_size=256):
 38 |     features = convert_examples_to_features(
 39 |         input_examples,
 40 |         tokenizer,
 41 |         label_list=['0', '1'],
 42 |         max_length=512,
 43 |         output_mode='classification',
 44 |     )
 45 |     all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long).to(device)
 46 |     all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long).to(device)
 47 |     token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long).to(device)
 48 |     dataset = TensorDataset(all_input_ids, token_type_ids, all_attention_mask)
 49 |     dataloader = DataLoader(dataset, batch_size=batch_size)
 50 |     return dataloader
 51 | 
 52 | def load_data(ref_path, hyp_path, tokenizer, device, batch_size=256):
 53 |     testset = read_testfile(ref_path)
 54 |     hyps = read_hyp(hyp_path)
 55 |     assert len(testset) == len(hyps), (len(testset), len(hyps))
 56 |     # examples = [InputExample(str(i), testset[i][1], hyps[i], '0') for i in range(len(testset))]
 57 |     examples = [InputExample(str(i), testset[i][1], hyps[i], '0') for i in range(len(testset)) if hyps[i].strip()]
 58 |     test_dataloader = get_dataloader(examples, tokenizer, device, batch_size=batch_size)
 59 |     return test_dataloader, examples
 60 | 
 61 | def batch_inference(model, dataloader):
 62 |     all_logits = None
 63 |     with torch.no_grad():
 64 |         # for batch in tqdm.tqdm(dataloader):
 65 |         for batch in dataloader:
 66 |             inputs = {"input_ids": batch[0], "token_type_ids": batch[1], "attention_mask": batch[2]}
 67 |             outputs = model(**inputs)
 68 |             if all_logits is None:
 69 |                 all_logits = outputs[0].cpu().detach()
 70 |             else: # [n, 2], 每个batch直接cat到第一个维度上 
 71 |                 all_logits = torch.cat((all_logits, outputs[0].cpu().detach()), dim=0)
 72 |     results = torch.argmax(all_logits, dim=1) # [n]
 73 |     probs = torch.nn.functional.softmax(all_logits, dim=-1)
 74 |     # return results, probs[torch.arange(probs.size(0)), results]
 75 |     return results, probs[:, 1]
 76 | 
 77 | def args_parser():
 78 |     parser = argparse.ArgumentParser()
 79 |     parser.add_argument("--ref_path", type=str, default='/misc/kfdata01/kf_grp/lchen/EMNLP23/experiments/emnlp_data/nq/random_testset/nq_test_random_testset.txt')
 80 |     parser.add_argument("--hyp_path", type=str, default='/misc/kfdata01/kf_grp/lchen/EMNLP23/experiments/emnlp_data/nq/random_testset/nq_ref')
 81 |     parser.add_argument("--model_path", type=str, default='/misc/kfdata01/kf_grp/lchen/cache/monobert-large-msmarco')
 82 |     parser.add_argument("--batch_size", type=int, default=256)
 83 |     args = parser.parse_args()
 84 |     return args
 85 | 
 86 | if __name__ == '__main__':
 87 | 
 88 |     args = args_parser()
 89 | 
 90 |     # 1. load model
 91 |     tokenizer, model, device = load_model(args.model_path)
 92 |     
 93 |     # 2. load data
 94 |     test_dataloader, examples = load_data(args.ref_path, args.hyp_path, tokenizer, device, batch_size=args.batch_size)
 95 | 
 96 |     # 3. inference
 97 |     results, probs = batch_inference(model, test_dataloader)
 98 |     # print (results, probs)
 99 |     # probs = torch.tensor([p for p in probs if p > 0.01])
100 |     print (torch.sum(results), torch.mean(probs))
101 | 
102 |     with open(args.hyp_path + '_rel', 'w') as w:
103 |         # w.write(json.dumps())
104 |         for prob in probs:
105 |             w.write(str(prob.item()) + '\n')
106 | 
107 |     # 4. print some examples
108 |     # res = results.cpu().tolist()
109 |     # for i in range(500):
110 |     #     idx = res[i]
111 |     #     if idx == 0:
112 |     #         print ('=='*20)
113 |     #         print (examples[i].text_a + '\n')
114 |     #         print (examples[i].text_b)
115 | 
116 | 


--------------------------------------------------------------------------------
/src/tools.py:
--------------------------------------------------------------------------------
  1 | import requests
  2 | import calendar
  3 | import wolframalpha
  4 | import datetime
  5 | from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
  6 | from operator import pow, truediv, mul, add, sub  
  7 | 
  8 | 
  9 | '''
 10 | Calendar
 11 | 
 12 | Uses Python's datetime and calendar libraries to retrieve the current date.
 13 | 
 14 | input - None
 15 | 
 16 | output - A string, the current date.
 17 | '''
 18 | def Calendar():
 19 |     now = datetime.datetime.now()
 20 |     return f'Today is {calendar.day_name[now.weekday()]}, {calendar.month_name[now.month]} {now.day}, {now.year}.'
 21 | 
 22 | 
 23 | '''
 24 | Wikipedia Search
 25 | 
 26 | Uses ColBERTv2 to retrieve Wikipedia documents.
 27 | 
 28 | input_query - A string, the input query (e.g. "what is a dog?")
 29 | k - The number of documents to retrieve
 30 | 
 31 | output - A list of strings, each string is a Wikipedia document
 32 | 
 33 | Adapted from Stanford's DSP: https://github.com/stanfordnlp/dsp/
 34 | Also see: https://github.com/lucabeetz/dsp
 35 | '''
 36 | class ColBERTv2:
 37 |     def __init__(self, url: str):
 38 |         self.url = url
 39 | 
 40 |     def __call__(self, query, k=10):
 41 |         topk = colbertv2_get_request(self.url, query, k)
 42 | 
 43 |         topk = [doc['text'] for doc in topk]
 44 |         return topk
 45 | 
 46 | def colbertv2_get_request(url: str, query: str, k: int):
 47 |     payload = {'query': query, 'k': k}
 48 |     res = requests.get(url, params=payload)
 49 | 
 50 |     topk = res.json()['topk'][:k]
 51 |     return topk
 52 | 
 53 | def WikiSearch(input_query: str, k=3):
 54 |     # k = 10
 55 |     # k = 3
 56 |     retrieval_model = ColBERTv2('http://ec2-44-228-128-229.us-west-2.compute.amazonaws.com:8893/api/search')
 57 |     output = retrieval_model(input_query, k)
 58 |     return output
 59 | 
 60 | 
 61 | '''
 62 | Machine Translation - NLLB-600M
 63 | 
 64 | Uses HuggingFace's transformers library to translate input query to English.
 65 | 
 66 | input_query - A string, the input query (e.g. "what is a dog?")
 67 | 
 68 | output - A string, the translated input query.
 69 | '''
 70 | def MT(input_query: str):
 71 |     model_name = "facebook/nllb-200-distilled-600M"
 72 |     tokenizer = AutoTokenizer.from_pretrained(model_name)
 73 |     model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
 74 |     input_ids = tokenizer(input_query, return_tensors='pt')
 75 |     outputs = model.generate(
 76 |         **input_ids,
 77 |         forced_bos_token_id=tokenizer.lang_code_to_id["eng_Latn"], 
 78 |         )
 79 |     output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
 80 |     return output
 81 | 
 82 | 
 83 | '''
 84 | Calculator
 85 | 
 86 | Calculates the result of a mathematical expression.
 87 | 
 88 | input_query - A string, the input query (e.g. "400/1400")
 89 | 
 90 | output - A float, the result of the calculation
 91 | 
 92 | Adapted from: https://levelup.gitconnected.com/3-ways-to-write-a-calculator-in-python-61642f2e4a9a 
 93 | '''
 94 | def Calculator(input_query: str):
 95 |     operators = {
 96 |         '+': add,
 97 |         '-': sub,
 98 |         '*': mul,
 99 |         '/': truediv
100 |         }
101 |     if input_query.isdigit():
102 |         return float(input_query)
103 |     for c in operators.keys():
104 |         left, operator, right = input_query.partition(c)
105 |         if operator in operators:
106 |             return round(operators[operator](Calculator(left), Calculator(right)), 2)
107 | 
108 | 
109 | 
110 | '''
111 | Wolfram Alpha Calculator
112 | 
113 | pip install wolframalpha
114 | 
115 | Uses Wolfram Alpha API to calculate input query.
116 | 
117 | input_query - A string, the input query (e.g. "what is 2 + 2?")
118 | 
119 | output - A string, the answer to the input query
120 | 
121 | wolfarm_alpha_appid - your Wolfram Alpha API key
122 | '''
123 | def WolframAlphaCalculator(input_query: str):
124 |     wolfram_alpha_appid = 'YOUR_WOLFRAM_ALPHA_APPID'
125 |     wolfram_client = wolframalpha.Client(wolfram_alpha_appid)
126 |     res = wolfram_client.query(input_query)
127 |     assumption = next(res.pods).text
128 |     answer = next(res.results).text
129 |     return f'Assumption: {assumption} \nAnswer: {answer}'
130 | 
131 | 
132 | '''
133 | Google Search
134 | 
135 | Uses Google's Custom Search API to retrieve Google Search results.
136 | 
137 | input_query - The query to search for.
138 | num_results - The number of results to return.
139 | api_key - Your Google API key.
140 | cse_id - Your Google Custom Search Engine ID.
141 | 
142 | output - A list of dictionaries, each dictionary is a Google Search result
143 | '''
144 | def custom_search(query, api_key, cse_id, **kwargs):
145 |     service = build("customsearch", "v1", developerKey=api_key)
146 |     res = service.cse().list(q=query, cx=cse_id, **kwargs).execute()
147 |     return res['items']
148 | 
149 | def google_search(input_query: str):
150 |     api_key = "YOUR_GOOGLE_API_KEY"
151 |     cse_id = 'YOUR_GOOGLE_CSE_ID' 
152 |     num_results = 10
153 |     metadata_results = []
154 |     results = custom_search(input_query, num=num_results, api_key=api_key, cse_id=cse_id)
155 |     for result in results:
156 |         metadata_result = {
157 |             "snippet": result["snippet"],
158 |             "title": result["title"],
159 |             "link": result["link"],
160 |         }
161 |         metadata_results.append(metadata_result)
162 |     return metadata_results
163 | 
164 | 
165 | '''
166 | Bing Search
167 | 
168 | Uses Bing's Custom Search API to retrieve Bing Search results.
169 | 
170 | input_query: The query to search for.
171 | bing_subscription_key: Your Bing API key.
172 | num_results: The number of results to return.
173 | 
174 | output: A list of dictionaries, each dictionary is a Bing Search result
175 | '''
176 | def _bing_search_results(search_term: str, bing_subscription_key: str, count: int):
177 |     headers = {"Ocp-Apim-Subscription-Key": bing_subscription_key}
178 |     params = {
179 |         "q": search_term,
180 |         "count": count,
181 |         "textDecorations": True,
182 |         "textFormat": "HTML",
183 |     }
184 |     response = requests.get(
185 |         # "https://api.bing.microsoft.com/v7.0/search", headers=headers, params=params
186 |         "https://api.bing.microsoft.com/", headers=headers, params=params
187 |     )
188 |     response.raise_for_status()
189 |     search_results = response.json()
190 |     return search_results["webPages"]["value"]
191 | 
192 | def bing_search(input_query: str):
193 |     bing_subscription_key = "" 
194 |     num_results = 10
195 |     metadata_results = []
196 |     results = _bing_search_results(input_query, bing_subscription_key, count=num_results)
197 |     for result in results:
198 |         metadata_result = {
199 |             "snippet": result["snippet"],
200 |             "title": result["name"],
201 |             "link": result["url"],
202 |         }
203 |         metadata_results.append(metadata_result)
204 |     return metadata_results
205 | 
206 | 
207 | # if __name__ == '__main__':
208 |     # print(google_search('What is a dog?')) 
209 |     # Outputs a list of dictionaries, each dictionary is a Google Search result
210 | 
211 |     # print(bing_search('What is a dog?')) 
212 |     # Outputs a list of dictionaries, each dictionary is a Bing Search result


--------------------------------------------------------------------------------
/src/wow_validity.py:
--------------------------------------------------------------------------------
  1 | from nltk.tokenize import sent_tokenize
  2 | from tqdm import tqdm
  3 | from collections import Counter
  4 | import copy
  5 | import json
  6 | import argparse
  7 | import random
  8 | random.seed(42)
  9 | 
 10 | import numpy as np
 11 | from factuality_metric import ner_metric, nli_metric_batch
 12 | from src.claim_handling import obtain_important_ne
 13 | from tools import WikiSearch
 14 | 
 15 | import logging
 16 | logging.basicConfig()
 17 | logging.getLogger().setLevel(logging.ERROR)
 18 | 
 19 | def read_hyp(hyp_path):
 20 |     hyps = []
 21 |     with open(hyp_path, 'r') as infile:
 22 |         for line in infile:
 23 |             hyps.append(line.strip())
 24 |     return hyps
 25 | 
 26 | def read_IR_docs(IR_path):
 27 |     IR_docs = []
 28 |     with open(IR_path, 'r') as infile:
 29 |         for line in infile:
 30 |             IR_docs.append(json.loads(line.strip()))
 31 |     return IR_docs
 32 | 
 33 | def read_testfile(testfile):
 34 |     '''read testset from wow'''
 35 |     res = []
 36 |     with open(testfile, 'r', encoding='utf-8') as r:
 37 |         for i, line in enumerate(r):
 38 |             parts = line.strip().split('\t')
 39 |             assert len(parts) == 4, parts
 40 |             res.append(parts)
 41 |     # topic, query, knowledge, response
 42 |     return res
 43 | 
 44 | def boolean_string(s):
 45 |     if s.lower() not in {'false', 'true'}:
 46 |         raise ValueError('Not a valid boolean string')
 47 |     return s.lower() == 'true'
 48 | 
 49 | def args_parser():
 50 |     parser = argparse.ArgumentParser(description='Process some integers.')
 51 | 
 52 |     parser.add_argument('--hyp_path', type=str, default=None, help='path to generations to evaluate') 
 53 |     parser.add_argument('--ref_path', type=str, default=None, help='path to generations to evaluate') 
 54 |     parser.add_argument('--eval_num', type=int, default=-1)
 55 | 
 56 |     parser.add_argument('--use_IR_eval', action='store_true', help='Flag for saving some lm-gens with its metric for analysis') 
 57 |     parser.add_argument('--retrieved_num', type=int, default=3)
 58 |     parser.add_argument('--wo_ground_truth_knowledge', type=boolean_string, default='False')
 59 | 
 60 |     parser.add_argument('--debug', type=boolean_string) 
 61 |     parser.add_argument('--save_gen_for_analysis', action='store_true', help='Flag for saving some lm-gens with its metric for analysis') 
 62 | 
 63 |     args = parser.parse_args()
 64 |     return args
 65 | 
 66 | def single_instance_eval(hyp, response, recall_list, args):
 67 |     # multiple evidences
 68 |     nli_contradict_prob, nli_entail_prob, nli_neutral_prob, nli_label = [], [], [], []
 69 | 
 70 |     if args.use_IR_eval and args.retrieved_num:
 71 |         assert recall_list and len(recall_list) >= 10, f"len(recall_list) = {len(recall_list)}"
 72 |         recall_list = recall_list[:args.retrieved_num]
 73 | 
 74 |     # NLI: identify the evs that give highest nli entailment score
 75 |     premise_hypothesis_pairs = [[ev, hyp] for ev in [response] + recall_list]
 76 |     if len(premise_hypothesis_pairs) > 32:
 77 |         premise_hypothesis_pairs = premise_hypothesis_pairs[:32]
 78 |     bz = 8
 79 |     nli_probs, labels = [], []
 80 |     for t in range((len(premise_hypothesis_pairs) - 1) // bz + 1):
 81 |         bz_nli_probs, bz_labels = nli_metric_batch(premise_hypothesis_pairs[t * bz: min((t + 1) * bz, len(premise_hypothesis_pairs))])
 82 |         nli_probs.extend(bz_nli_probs)
 83 |         labels.extend(bz_labels)
 84 |     assert len(nli_probs) == len(premise_hypothesis_pairs) == len(labels), f"len(nli_probs) = {len(nli_probs)}, len(premise_hypothesis_pairs) = {len(premise_hypothesis_pairs)}, len(labels) = {len(labels)}"
 85 |     
 86 |     # [contradiction, neutral, entailment]
 87 |     entailment_argmax = np.argmax([nli_s[2] for nli_s in nli_probs])
 88 |     max_prob = nli_probs[entailment_argmax]
 89 |     max_label = labels[entailment_argmax]
 90 | 
 91 |     nli_contradict_prob.append(max_prob[0])
 92 |     nli_neutral_prob.append(max_prob[1])
 93 |     nli_entail_prob.append(max_prob[2])
 94 | 
 95 |     nli_label.append(max_label)
 96 |     # print (max_label, premise_hypothesis_pairs[entailment_argmax])
 97 | 
 98 |     idx = nli_label.index(max(nli_label))
 99 |     nli_label = max(nli_label)
100 |     nli_contradict_prob = nli_contradict_prob[idx]
101 |     nli_neutral_prob = nli_neutral_prob[idx]
102 |     nli_entail_prob = nli_entail_prob[idx]
103 | 
104 |     eval_result_obj = {
105 |         'premise_hypothesis_pairs': premise_hypothesis_pairs,
106 |         'nli-label': nli_label,
107 |         'nli-contr': nli_contradict_prob,
108 |         'nli-entail': nli_entail_prob,
109 |         'nli-neutr': nli_neutral_prob
110 |     }
111 | 
112 |     return eval_result_obj
113 | 
114 | def main(args):
115 | 
116 |     # read hyp, ref, IR_docs
117 |     hyps = read_hyp(args.hyp_path) 
118 |     IR_recalls = read_IR_docs(args.hyp_path + '_IR_docs') 
119 |     testset = read_testfile(args.ref_path)
120 |     assert len(hyps) == len(testset) == len(IR_recalls) == 500, (len(hyps), len(testset), len(IR_recalls))
121 | 
122 |     # DEBUG mode!
123 |     if args.debug:
124 |         DEBUG_SAMPLE_SIZE = 10
125 |         hyps = hyps[:DEBUG_SAMPLE_SIZE]
126 |         IR_recalls = IR_recalls[:DEBUG_SAMPLE_SIZE]
127 |         testset = testset[:DEBUG_SAMPLE_SIZE]
128 |     
129 |     final_contradict_prob, final_neutral_prob, final_entail_prob, all_nli_labels = [], [], [], []
130 |     all_analysis_list = []
131 | 
132 |     for i in tqdm(range(len(hyps))):
133 |         hyp, example, recall_list = hyps[i], testset[i], IR_recalls[i]
134 |         response = example[3]
135 | 
136 |         res_obj = single_instance_eval(hyp, response, recall_list, args)
137 |         if args.debug:
138 |             print ('==' * 20)
139 |             print (res_obj)
140 | 
141 |         final_contradict_prob.append(res_obj['nli-contr'])
142 |         final_neutral_prob.append(res_obj['nli-neutr'])
143 |         final_entail_prob.append(res_obj['nli-entail'])
144 |         all_nli_labels.append(res_obj['nli-label'])
145 |         all_analysis_list.append(res_obj)
146 | 
147 |     # analysis
148 |     avg_contradict_prob = np.mean(final_contradict_prob)
149 |     avg_neutral_prob = np.mean(final_neutral_prob)
150 |     avg_entail_prob = np.mean(final_entail_prob)
151 | 
152 |     print("AVG PROBS: Contradict: {:.2f}%, Neutral: {:.2f}%, Entail: {:.2f}%".format(avg_contradict_prob*100, avg_neutral_prob*100, avg_entail_prob*100))
153 | 
154 |     nli_contradict_class_ratio, nli_neutral_class_ratio, nli_entail_class_ratio = 0, 0, 0
155 | 
156 |     nli_counter = Counter(all_nli_labels)
157 | 
158 |     nli_contradict_class_ratio=nli_counter[0]/(nli_counter[0]+nli_counter[1]+nli_counter[2])
159 |     nli_neutral_class_ratio=nli_counter[1]/(nli_counter[0]+nli_counter[1]+nli_counter[2])
160 |     nli_entail_class_ratio=nli_counter[2]/(nli_counter[0]+nli_counter[1]+nli_counter[2])
161 |     
162 |     print("NLI CLASS %: Contradict: {:.2f}%, Neutral: {:.2f}%, Entail: {:.2f}%".format(
163 |         nli_contradict_class_ratio*100,
164 |         nli_neutral_class_ratio*100,
165 |         nli_entail_class_ratio*100
166 |     ))
167 | 
168 |     res_path = args.hyp_path + '_factuality_results.txt'
169 |     with open(res_path, 'a') as outfile:
170 |         res_obj = {
171 |             'Contradict_probs': avg_contradict_prob, 
172 |             'Neutral_probs': avg_neutral_prob,
173 |             'Entail_probs': avg_entail_prob,
174 |             "nli_contradict_class_ratio": nli_contradict_class_ratio,
175 |             "nli_neutral_class_ratio": nli_neutral_class_ratio, 
176 |             "nli_entail_class_ratio": nli_entail_class_ratio,
177 |         }
178 |         json.dump(res_obj, outfile)
179 |         outfile.write("\n")
180 | 
181 |     ana_path = args.hyp_path + '_analysis.txt'
182 |     with open(ana_path, 'a') as outfile:
183 |         json.dump(all_analysis_list, outfile)
184 |         outfile.write("\n")
185 | 
186 | 
187 | if __name__ == '__main__':
188 |     args = args_parser()
189 |     main(args)
190 | 


--------------------------------------------------------------------------------