├── README.md ├── emnlp_data ├── .DS_Store ├── nq │ ├── random_prompts │ │ └── nq_test_random_prompt.txt │ └── random_testset │ │ ├── nq_ref │ │ ├── nq_ref_avg_coh_para │ │ ├── nq_ref_avg_lm_entropy │ │ ├── nq_ref_avg_sent_ppl │ │ ├── nq_ref_rel │ │ └── nq_test_random_testset.txt ├── testset_preprocess_scripts │ ├── cnt_knowledge_length.py │ ├── extract_ref.py │ ├── random_prompt_maker.py │ └── random_testset_maker.py └── wow │ ├── random_prompts │ ├── seen_random_prompt.txt │ └── unseen_random_prompt.txt │ └── random_testset │ ├── seen_random_testset.txt │ ├── seen_topic_pageviews.txt │ ├── unseen_random_testset.txt │ ├── unseen_topic_pageviews.txt │ └── wow_seen_knowledge_ref ├── env ├── .DS_Store ├── coherence_environment.yml └── environment.yml ├── framework.png ├── scripts ├── helpfulness │ ├── nq_random_knowledge.sh │ ├── nq_w_hyp_knowledge.sh │ ├── nq_w_ref_knowledge.sh │ ├── nq_wo_knowledge.sh │ ├── view_results.sh │ ├── wow_random_knowledge.sh │ └── wow_w_hyp_knowledge.sh ├── nq_coh_para.sh ├── nq_coh_sent.sh ├── nq_factuality.sh ├── nq_factuality_view.sh ├── nq_info.sh ├── nq_relevance.sh ├── nq_validity.sh ├── other │ ├── cal_factuality_for_DPR.sh │ ├── cal_factuality_for_knowledge.sh │ ├── cal_factuality_for_knowledge_IR.sh │ ├── cal_factuality_for_opt_knowledge_IR.sh │ ├── cal_factuality_for_refined_knowledge.sh │ ├── cal_factuality_for_refined_knowledge_IR.sh │ ├── cal_factuality_for_response.sh │ └── tmp.sh ├── view_coh_sent.sh ├── view_info.sh ├── view_nq_validity.sh ├── view_wow_validity.sh ├── wow_coh_para.sh ├── wow_coh_sent.sh ├── wow_factuality.sh ├── wow_factuality_view.sh ├── wow_info.sh ├── wow_relevance.sh └── wow_validity.sh └── src ├── claim_handling.py ├── discourse-coherence.py ├── eval_exp.py ├── helpfulness.py ├── info.py ├── nq_validity.py ├── ppl.py ├── relevance.py ├── tools.py └── wow_validity.py /README.md: -------------------------------------------------------------------------------- 1 | # Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators 2 | Welcome to the repository for our EMNLP 2023 paper, "Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators." In this work, we introduce **CONNER** (COmpreheNsive kNowledge Evaluation fRamework), a systematic approach designed to evaluate the output of Large Language Models (LLMs) across key dimensions such as Factuality, Relevance, Coherence, Informativeness, Helpfulness, and Validity. 3 | 4 | Here, you'll find the necessary code and resources to replicate our findings and further explore the potential of LLMs. We hope they help facilitate your work in exploring the frontiers of LLMs with a touch of ease. 5 | 6 | ## CONNER Framework 7 | 8 | 9 | ### Intrinsic Evaluation 10 | 11 | - **Factuality:** Assessing the verifiability of the information against external evidence. 12 | - **Relevance:** Ensuring the knowledge aligns with the user's query intent. 13 | - **Coherence:** Evaluating the logical flow of information at both sentence and paragraph levels. 14 | - **Informativeness:** Measuring the novelty or unexpectedness of the knowledge provided. 15 | 16 | ### Extrinsic Evaluation 17 | 18 | - **Helpfulness:** Gauging whether the knowledge aids in enhancing performance on downstream tasks. 19 | - **Validity:** Certifying the factual accuracy of downstream task results when utilizing the knowledge. 20 | 21 | ## Getting Started 22 | 23 | #### Setting Up the Environment 24 | 25 | Begin by setting up your Conda environment with the provided `environment.yaml` file, which will install all necessary packages and dependencies. 26 | 27 | ```bash 28 | conda env create -f env/environment.yaml -n CONNER 29 | conda activate CONNER 30 | ``` 31 | If you run into any missing packages or dependencies, please install them as needed. 32 | 33 | #### Evaluating Your LLMs 34 | Run the evaluation script that corresponds to your dataset and chosen metric. Replace ${data} with your dataset choice (nq or wow) and ${metric} with one of the following metrics: factuality, relevance, info, coh_sent, coh_para, validity, helpfulness. 35 | ```bash 36 | # Run evaluation script. Example usage: 37 | # bash scripts/nq_factuality.sh 38 | # bash scripts/wow_relevance.sh 39 | bash scripts/${data}_${metric}.sh 40 | ``` 41 | #### Viewing Results 42 | Once you have completed the evaluation, you can easily view the results with our provided script: 43 | ```bash 44 | # Display the evaluation results. Example usage: 45 | # bash scripts/nq_factuality_view.sh 46 | # bash scripts/wow_relevance_view.sh 47 | bash scripts/${data}_${metric}_view.sh 48 | ``` 49 | 50 | #### Model Sources 51 | 52 | Below is a list of models utilized in our CONNER framework for each metric: 53 | 54 | | Metric | Model | Source | 55 | |----------------------|---------------------------------|-----------------------------------------------------| 56 | | Factuality | NLI-RoBERTa-large, ColBERTv2 | [Hugging Face](https://huggingface.co/sentence-transformers/nli-roberta-large), [GitHub](https://github.com/stanford-futuredata/ColBERT) | 57 | | Relevance | BERT-ranking-large | [GitHub](https://github.com/nyu-dl/dl4marco-bert) | 58 | | Sentence-level Coherence | GPT-neo-2.7B | [Hugging Face](https://huggingface.co/EleutherAI/gpt-neo-2.7B) | 59 | | Paragraph-level Coherence | Coherence-Momentum | [Hugging Face](https://huggingface.co/aisingapore/coherence-momentum) | 60 | | Informativeness | GPT-neo-2.7B | [Hugging Face](https://huggingface.co/EleutherAI/gpt-neo-2.7B) | 61 | | Helpfulness | LLaMA-65B | [GitHub](https://github.com/facebookresearch/llama/tree/main) | 62 | | Validity | NLI-RoBERTa-large, ColBERTv2 | [Hugging Face](https://huggingface.co/sentence-transformers/nli-roberta-large), [GitHub](https://github.com/stanford-futuredata/ColBERT) | 63 | 64 | 65 | ## Citing Our Work 66 | If you find our work helpful in your research, please citing our paper: 67 | ``` 68 | @misc{chen2023factuality, 69 | title={Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators}, 70 | author={Liang Chen and Yang Deng and Yatao Bian and Zeyu Qin and Bingzhe Wu and Tat-Seng Chua and Kam-Fai Wong}, 71 | year={2023}, 72 | eprint={2310.07289}, 73 | archivePrefix={arXiv}, 74 | primaryClass={cs.CL} 75 | } 76 | ``` 77 | -------------------------------------------------------------------------------- /emnlp_data/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChanLiang/CONNER/77f99c876bdc6ca8cb3991210e2ccc2914d4971b/emnlp_data/.DS_Store -------------------------------------------------------------------------------- /emnlp_data/nq/random_testset/nq_ref_avg_coh_para: -------------------------------------------------------------------------------- 1 | [-13.119321823120117, 18.643884658813477, 14.975356101989746, -9.953686714172363, -22.64740753173828, 18.281700134277344, -10.134472846984863, 10.106371879577637, -7.24709415435791, 2.947871685028076, 19.49839210510254, 10.568069458007812, 5.3517303466796875, 15.628636360168457, 13.229722023010254, -16.684036254882812, -21.012622833251953, 11.617283821105957, -5.790841102600098, 3.663580894470215, 7.896894931793213, 9.753914833068848, 16.251827239990234, 16.19835090637207, 12.03116512298584, 10.515118598937988, 15.596977233886719, 14.38183879852295, 16.412246704101562, 15.887755393981934, -15.057905197143555, 0.2845648229122162, 18.343692779541016, 14.351183891296387, 15.628971099853516, 18.301742553710938, -15.328641891479492, -0.5141409635543823, 10.937677383422852, 7.783780574798584, 1.3738312721252441, 14.359389305114746, -10.770936965942383, -3.065892219543457, -4.884547233581543, 14.948290824890137, -2.230806827545166, -8.666611671447754, 6.646633148193359, 6.719013690948486, 18.264400482177734, 7.262363433837891, 9.07824993133545, 11.578181266784668, 14.675372123718262, -11.40087890625, -13.6721830368042, -2.914580821990967, 10.05797290802002, -2.8793838024139404, 14.709553718566895, 10.51543140411377, 15.408014297485352, 10.95182991027832, 7.270349502563477, -1.7264076471328735, 14.45258903503418, 18.208837509155273, 9.560979843139648, 7.398679256439209, -8.244877815246582, 1.2515833377838135, 7.381077766418457, 6.716863632202148, -14.006756782531738, -8.48363971710205, 11.411901473999023, 12.59145450592041, -17.390769958496094, -3.3476669788360596, -1.9921391010284424, 10.428362846374512, 16.394018173217773, 9.265216827392578, 18.750652313232422, 11.602629661560059, 2.1918861865997314, 8.7633638381958, 0.5265696048736572, 5.030932426452637, -19.480724334716797, -12.329466819763184, -10.704183578491211, -26.26762580871582, -11.212485313415527, 12.00790786743164, -7.852105140686035, -19.85357666015625, 18.526920318603516, 14.987327575683594, 14.665925979614258, 9.208537101745605, 11.986360549926758, -13.110265731811523, 3.1953303813934326, 20.347864151000977, 14.489336967468262, 13.525574684143066, 11.642297744750977, -16.527755737304688, 14.678874015808105, -20.76173210144043, -0.3452112376689911, 6.819133758544922, 17.097471237182617, -0.6257216334342957, 15.619368553161621, 10.135584831237793, 6.282881259918213, 18.862598419189453, 7.799135208129883, 3.3354477882385254, 0.15584170818328857, 6.26609992980957, 9.596212387084961, 7.95708703994751, 17.74555778503418, -21.439605712890625, -7.698458671569824, 12.603463172912598, 9.275918960571289, -3.4611470699310303, -8.378585815429688, 16.112457275390625, 12.117599487304688, 5.564085006713867, 10.20129108428955, 19.477554321289062, 4.768891334533691, 12.773776054382324, 8.356185913085938, -11.753010749816895, 16.634265899658203, 11.060528755187988, 6.845538139343262, 13.33799934387207, 17.466869354248047, -12.745265007019043, 17.00414276123047, -9.293492317199707, 11.534061431884766, -5.457294464111328, 16.08570671081543, 5.225383758544922, 13.115345001220703, 9.974627494812012, 10.44150447845459, 17.22805404663086, 8.482453346252441, -8.677849769592285, -20.50507354736328, -14.645130157470703, 13.208407402038574, 18.922218322753906, 9.323628425598145, 18.2813720703125, 12.772849082946777, -0.10989882051944733, 11.091923713684082, -1.9497580528259277, 0.2969266176223755, 6.51821231842041, 2.5422768592834473, 7.474398612976074, 2.108112335205078, 16.875263214111328, 15.890420913696289, 16.88079833984375, 14.956415176391602, 4.558560371398926, -0.5833718180656433, 17.631332397460938, -0.8018089532852173, 11.801080703735352, -0.40746182203292847, -15.2289457321167, 3.7110328674316406, -7.3578410148620605, 12.60633659362793, 14.488272666931152, 3.9711179733276367, -3.953218936920166, 17.390342712402344, 17.440744400024414, 11.964737892150879, 15.407419204711914, 3.570370674133301, 9.650638580322266, 15.390456199645996, 0.06699617952108383, 17.996315002441406, 11.45728588104248, -4.622745990753174, 2.8869075775146484, -5.676790714263916, -3.6832518577575684, 12.396058082580566, 13.688033103942871, 17.996315002441406, 7.197054862976074, 15.286927223205566, -18.627094268798828, 18.30089569091797, 11.253070831298828, 2.229530096054077, 7.75808572769165, 5.06968355178833, -10.770936965942383, 11.159478187561035, 13.587868690490723, 10.082839965820312, 16.37598991394043, 5.587576389312744, 14.567668914794922, -8.760175704956055, 7.703945636749268, -13.886496543884277, 13.01379680633545, -13.758611679077148, 9.17174243927002, -16.080148696899414, 12.9537992477417, 14.316058158874512, -5.74833869934082, -8.30625057220459, 8.670008659362793, -12.240528106689453, -19.61964225769043, 0.649236261844635, 13.778191566467285, 6.167590618133545, 14.636520385742188, 18.58367919921875, -9.196493148803711, 14.32049560546875, 9.175971031188965, -11.886533737182617, 14.924628257751465, -24.584224700927734, 2.0860023498535156, 17.110746383666992, 14.72911548614502, 3.460249423980713, 9.528497695922852, -6.364473342895508, 16.25566864013672, -9.959657669067383, -7.410150051116943, 5.192204475402832, -22.707664489746094, 14.38634967803955, 12.714741706848145, -12.977810859680176, 15.175114631652832, -20.359966278076172, 15.887040138244629, 1.8842642307281494, 1.9588429927825928, 11.025944709777832, -2.200796604156494, -12.423057556152344, -13.198972702026367, 9.224650382995605, 12.306821823120117, 16.717018127441406, 12.369593620300293, 14.43990707397461, -10.932412147521973, -8.926290512084961, 14.12479305267334, 3.674036979675293, 6.213274955749512, 14.771726608276367, -24.03095245361328, -17.400426864624023, 12.2335205078125, -2.4841628074645996, 11.171034812927246, -14.19003963470459, 6.1105570793151855, 10.842275619506836, 3.076615571975708, 9.518315315246582, 15.364958763122559, 9.223173141479492, 8.47429084777832, 6.059081554412842, 15.535370826721191, 13.101935386657715, 19.02134895324707, -14.119009971618652, 12.81179141998291, -1.7170445919036865, 16.877477645874023, 18.27460479736328, 4.872830867767334, 17.16568946838379, 0.8550774455070496, 4.210025787353516, 17.31049346923828, -4.009548187255859, -2.4618661403656006, 19.04184913635254, 11.184344291687012, 0.5831080079078674, 17.68505859375, 17.419021606445312, -15.47326946258545, 15.801827430725098, -13.933554649353027, -10.648279190063477, 7.597427845001221, 15.462384223937988, 0.32431674003601074, 13.973188400268555, -8.640498161315918, 5.005640029907227, 15.267630577087402, -2.016509771347046, -5.688741683959961, 7.008584976196289, -20.141035079956055, 13.747187614440918, 5.731258392333984, -14.18584156036377, 15.806730270385742, 16.6748046875, 11.475944519042969, 6.184438705444336, 10.92053508758545, -7.597588062286377, 16.148408889770508, 14.263434410095215, 2.1767494678497314, 18.40786361694336, -14.55492115020752, 17.825923919677734, 13.184425354003906, 15.508620262145996, 8.382051467895508, -5.733557224273682, 4.105076313018799, 13.809184074401855, -5.355087757110596, 16.091463088989258, 18.29323387145996, 17.374454498291016, 1.4047178030014038, 17.88732147216797, -1.6902297735214233, 12.622706413269043, 15.329268455505371, -3.92073130607605, 15.702555656433105, -16.939350128173828, -19.12222671508789, 10.142857551574707, 11.022912979125977, -6.0877251625061035, -0.4878803491592407, -20.219280242919922, -7.447934150695801, -0.49328118562698364, 17.67724609375, 8.014138221740723, -17.380390167236328, 10.3120698928833, 13.518301963806152, 3.1764838695526123, 17.671239852905273, -0.6746103167533875, 16.26909828186035, -9.144436836242676, -0.9421584606170654, 10.030646324157715, 16.53363037109375, 9.232200622558594, 7.369050979614258, 7.575037002563477, 16.62100601196289, 6.481991767883301, 2.531597852706909, 14.252530097961426, -1.746160864830017, 11.183938980102539, 4.897782325744629, -14.06386661529541, 17.58884620666504, -13.53532886505127, 18.790796279907227, 4.670736312866211, 16.990140914916992, 4.967563629150391, -0.4503783881664276, -7.060770511627197, 12.426644325256348, 9.955527305603027, -23.58697509765625, 16.9542236328125, -22.9370174407959, 19.125083923339844, -18.199602127075195, -5.261682510375977, 11.080878257751465, 15.306122779846191, 3.0926597118377686, -17.665010452270508, 1.2239549160003662, -20.03911590576172, 16.360694885253906, 18.033679962158203, -5.759027004241943, 16.247272491455078, -4.610719203948975, 3.280198574066162, 3.6081905364990234, -24.661344528198242, 17.47615623474121, 0.26504141092300415, -2.099376916885376, 10.232733726501465, 16.317556381225586, -17.588844299316406, 17.70425033569336, 17.660411834716797, 12.04329776763916, 18.50408172607422, 4.581759452819824, 18.606327056884766, 0.7869451642036438, 15.58411693572998, 8.058518409729004, 10.642940521240234, -9.863536834716797, 12.829106330871582, -11.971624374389648, 15.981738090515137, -12.542139053344727, 3.2194089889526367, 9.560979843139648, 7.007416725158691, 13.006417274475098, -7.5256667137146, 8.963598251342773, 6.474368572235107, -15.432988166809082, 3.4593796730041504, -6.131731033325195, 19.431949615478516, 13.920551300048828, 6.007608413696289, -18.999370574951172, 8.125280380249023, 4.632145881652832, 15.26697063446045, 14.385810852050781, 11.668633460998535, 11.432242393493652, -4.376832485198975, 14.677751541137695, -17.08258628845215, 4.100184917449951, -15.47326946258545, 15.083805084228516, 4.513195514678955, -8.103970527648926, 17.41781234741211, 1.425919532775879, -6.213893890380859, 14.494572639465332, 1.9545398950576782, 17.199356079101562, -12.507933616638184, 13.423059463500977, 17.627429962158203, -14.06386661529541, 11.911812782287598, 11.91988754272461, 15.36992073059082, -9.81781005859375, -1.6895909309387207, 18.3709716796875, 5.118607997894287, 17.608135223388672, 8.924373626708984, 3.4715404510498047, 10.618024826049805, -2.3761510848999023, -14.850167274475098, 6.333468914031982] 2 | 20.347864151000977 -26.26762580871582 -------------------------------------------------------------------------------- /emnlp_data/nq/random_testset/nq_ref_avg_lm_entropy: -------------------------------------------------------------------------------- 1 | [NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN] 2 | NaN -------------------------------------------------------------------------------- /emnlp_data/nq/random_testset/nq_ref_avg_sent_ppl: -------------------------------------------------------------------------------- 1 | [78.4693368448625, 16.043491695128726, 15.804934627199694, 53.23927945815809, 41.15005729707433, 10.50810120958408, 6.100938382963522, 67.48036996807707, 27.996390631708877, 29.215547701526184, 32.89045190663701, 53.67419445493103, 67.90224061169619, 59.324534008027015, 68.11863880866669, 100.0, 11.18664364841687, 63.60015186627291, 93.20928166657612, 53.22125644548603, 44.396799448946105, 55.91322077903019, 22.511726352105622, 44.27881054053034, 100.0, 28.969665540137683, 54.09288183052771, 18.83291960242443, 38.10779446908432, 88.78784355913466, 86.9860044535187, 100.0, 93.54996883396687, 31.23412437863948, 58.93436814650389, 71.03339433990104, 92.05139580464764, 66.88117758800215, 49.7674398840102, 87.18660354790754, 86.78696919699829, 51.950614229747046, 91.44705859268413, 5.190588743863161, 70.52482366327511, 46.18224223328229, 5.050934899268297, 19.253325549877246, 30.636168485681026, 35.336577492277954, 69.62419980886938, 41.863717275046234, 56.02683768815861, 37.563018144626106, 57.492770836936714, 100.0, 44.96884665815416, 62.09192051669649, 28.117156195500414, 84.12359139682738, 41.466212699416985, 41.842113436656064, 22.03863655634089, 52.42349769740261, 70.66018373086091, 5.945385801758123, 30.175888565873056, 25.17534426406489, 38.80949853753552, 100.0, 79.7977150546999, 5.870803559537856, 29.40460943390798, 26.140572628150604, 10.073324054728108, 8.04474714646627, 58.11901407353269, 58.47049465318385, 69.15162771778725, 83.89402822020207, 58.01664243914295, 52.62574447567372, 42.83131840855212, 44.41039137941371, 23.58437181390929, 49.8692187978461, 57.554268891823604, 38.81962317221551, 28.37811495669342, 41.51519353649107, 16.27531796559519, 86.72404698976116, 60.83146012345982, 85.2239467307671, 4.182093180232769, 26.912988954347888, 6.514642953779386, 10.59331913067613, 19.4277578940981, 44.26226843679404, 40.21397733256407, 66.5492461652898, 37.53767953925087, 74.81145321587799, 87.6990942071536, 14.167328159135606, 57.917234529614426, 32.28550906921534, 50.487339715410165, 79.3488639934888, 32.30160790329472, 34.8669344883829, 18.33801662935187, 63.08825615353803, 22.8900862160764, 4.602974120191751, 65.88837289614875, 60.114391712409606, 49.143607662107186, 48.448415985011685, 85.93036832838905, 54.46381386570613, 52.710168937818175, 34.98609189283082, 100.0, 82.99606536620868, 25.801826292437184, 67.69431699400373, 73.40954491954432, 100.0, 100.0, 13.450123586613607, 30.252880628186652, 37.18607406959037, 71.25514218300809, 50.687486145792185, 88.85777119414414, 17.67454787215367, 42.25914107716766, 46.07183443470951, 29.360933012286065, 91.64069544929384, 9.364183350622403, 28.733743947465122, 7.829874200746442, 48.19923010329084, 17.63484233438192, 8.901291042826191, 57.23819521890625, 11.431844035104252, 52.853003441050795, 55.34538055526116, 23.26579146101366, 79.336160388691, 30.84904916314805, 42.51635682798475, 15.006421350764779, 28.239869305309288, 36.2774867678075, 18.219436192333255, 27.939471967926718, 11.912612508020064, 100.0, 26.930295717239854, 57.27942179998525, 31.444787110759638, 25.243815375835123, 57.27757829283663, 79.01959879363238, 3.756021644237454, 69.47456455083955, 49.92495365704886, 28.9776917938745, 67.27869008353792, 8.233310258192683, 19.99385464132014, 41.01906277459451, 30.288760681943845, 47.57541318410044, 69.89785950677994, NaN, 18.53793016881998, 27.495018739833913, 28.133526653183385, 61.78660514015602, 100.0, 22.347591227413954, 7.607686129443385, 46.831082759910785, 54.405917836795474, 78.95596448648368, 17.73826356357924, 22.621886112268395, 7.675172804675318, 42.47293496580793, 39.93305142617706, 53.08773918167623, 38.301428644554235, 30.11430687280376, 66.4553528008607, 10.572855201652148, 85.8774195332294, 100.0, 10.168131346747268, 5.890877479933101, 90.84300721934494, 49.052191018285306, 48.00127137837724, 10.572855201652148, 65.38331521035612, 16.085177799050324, 75.57413525170637, 14.177714887593893, 43.555649711081145, 83.94592734574924, 7.859383843049418, 9.710033597164143, 91.44705859268413, 66.78146956549836, 61.12810313331742, 79.07377466512598, 40.683994468126464, 55.211967783776934, 53.822895220951466, 60.47158849946876, 25.495150683340373, 64.97668726989721, 9.39264775287852, 77.38194116179402, 23.988622161697332, 100.0, 57.77326658082132, 32.52255001911109, 28.36959604444296, 37.285203308453305, 11.112263797171225, 64.45322764057059, 29.022193408660645, 26.649195197483568, 65.18438625977032, 100.0, 23.650732932834142, 28.39248175000161, 54.22242960447606, 27.102035563186863, 52.505613907053274, 88.37060236205672, 57.63914449437923, 10.95308820881317, 48.61061834098217, 17.57937183901338, 21.741398017397973, 66.46813319011274, 43.82021831418862, 43.93084158858949, 39.460481431512385, NaN, 38.19217753182415, 27.350671828561318, 76.4552536949989, 21.43581607467629, 42.71083111389023, 54.68586560318957, 57.96951125703256, 63.2092470957139, 61.8858250041732, 67.4782191664445, 45.71895670056356, 48.49965121653874, 13.801560552250978, 82.0290052023144, 49.26524207475927, 65.39611582801435, 48.9313603200487, 17.399555024609302, 71.86034687140634, 34.17710049127836, 15.560886860555184, 100.0, 45.587523879716485, 92.90999318034552, 22.396488319503675, 19.769063634463546, 66.2727873665185, 58.37219835425476, 61.8330587063284, 76.15445479738644, 65.10139405094469, 30.411742608106582, 71.46585180678812, 25.133494662273623, 53.957955019292676, 45.205033015396566, 19.368192387932776, 52.74097094282029, 38.417053927711144, 38.44702283178825, 35.16281902141295, 32.12372067896128, 69.39259156686892, 100.0, 33.70049762868605, 83.7574123862593, 74.63229558625723, 29.63988358730266, 11.141142300349554, 49.37503166308023, 62.815921138863345, 47.90365622303026, 28.979974141484963, 65.6605960946507, 68.28209488170104, 81.34838467628872, 45.1893563527962, 23.704115878128697, 62.68784038330249, 28.979468158050544, 14.87731316470635, 31.687223344905128, 7.053431375407266, 26.924944188155596, 61.49818106064528, 35.398435766463955, 100.0, 40.89925100406656, 24.53816461142837, 81.71840580748571, 44.65755612960107, 48.508640082792425, 84.91696128735245, 23.048170043620853, 71.69208861888265, 70.74994143682393, 56.817927583504044, 21.7545726190004, 54.435831683151825, 85.53311965080941, 35.620723784868396, 62.19238793393677, 17.52713060078619, 23.452689741340183, 23.637008111384432, 34.088221002778155, 19.63002102935967, 16.658842862059494, 100.0, 20.31905768713922, 55.652382631872975, 17.602564986302696, 52.54253800204756, 20.338837471467222, 6.524293988216611, 16.531504297917582, 100.0, 51.59755874475989, 14.192940713242582, 43.21651114253152, 70.39592310603686, 15.948928734510222, 79.7092184656985, 98.76599944358716, 67.30579697786645, 85.55465618508093, 69.44646603644725, 9.68064920215347, 39.43964965941498, 17.663005510973708, 82.17563782956356, 55.74860974120666, 88.00914893020618, 98.75794968458214, 79.91814517767344, 100.0, 50.43072956804357, 55.20529921934837, 38.80347216029391, 41.418105312725274, 51.33427077483436, 40.644144375574434, 7.348653417498423, 30.744310601810668, 37.39141479385989, 31.553934979496823, 74.14908097104342, 57.91877666713658, 13.338599807408572, 75.45382159392871, 49.13650117273204, 51.73842245428733, 21.494297211299624, 49.33050189919685, 78.63987989630108, 55.792702963233616, 35.71646595198077, 100.0, 100.0, 60.99824329043069, 18.73404824576729, 96.95316219766627, 18.381026668828653, 44.721631638215335, 22.941585044090232, 65.15406536907628, 88.70722229250158, 6.937894673370338, 9.771027972111897, 47.90206725805424, 49.27142704784982, 67.55369462929836, 74.05254003400552, 18.018599813339435, 11.290523947299521, 51.75739045944618, 60.5669539827174, 75.71581797202799, 87.5467953409547, 78.08399991228774, 95.60402121298063, 67.78531634829748, 65.44257906485598, 14.782783739932551, 86.8088661985033, 35.542420475392724, 18.8143101371883, 51.821384751840085, 86.30142971062976, 18.698702953700778, 18.717258906063826, 99.49252337397067, 58.342399702717394, 43.9318890305611, 22.587872154793093, 70.93133357967618, 9.949994733825061, 12.82532481394894, 32.669446602678676, 51.21630074241758, 5.655418768344973, 49.76715603276672, 87.68309156195106, 43.42747255302461, 58.348403072402085, 34.304607915902366, 3.6366487826139267, 35.773664869380426, 97.92290896755627, 42.55352808443655, 13.826841089805036, 100.0, 38.80949853753552, 36.37374459373949, 58.95719583529741, 4.501877034727099, 36.16321945266922, 7.119819682125535, 91.74869049671233, 67.60019917559482, 4.7308653870051804, 16.537152651336708, 49.046451492916106, 45.093256091221775, 56.81469596417692, 19.41306271416821, 33.30974666446854, 40.75322950093222, 78.21412681354154, 26.98979053596943, 45.766711346561955, 57.854786571024555, 47.46923585138697, 63.829233748500975, 100.0, 14.87731316470635, 20.509667600141196, 5.297629491378028, 8.984981801999487, 30.616747250524487, 59.43491109119952, NaN, 32.80619543890013, 11.439434524154548, 81.7532819435181, 34.66610952361426, 21.746072409363222, 40.80490402564827, 60.99824329043069, 20.72013054787879, 34.1612630558561, 46.617922010330915, 88.09862636829668, 55.281747347040245, 18.206939742923193, 86.88200832384972, 28.91344281278397, 23.481498347502203, 33.597110545450455, 72.6510934816972, 75.63631810955533, 11.878015389507036, 34.153455331269825] 2 | 0.03658152118316791 -------------------------------------------------------------------------------- /emnlp_data/nq/random_testset/nq_ref_rel: -------------------------------------------------------------------------------- 1 | 0.9955950379371643 2 | 0.9874629378318787 3 | 0.963216245174408 4 | 0.9977188110351562 5 | 0.6699656844139099 6 | 0.9944804906845093 7 | 0.9958698153495789 8 | 0.9849919676780701 9 | 0.9985634684562683 10 | 0.989118218421936 11 | 0.9984180927276611 12 | 0.998940646648407 13 | 0.9816821813583374 14 | 0.013061659410595894 15 | 0.992595374584198 16 | 0.9913615584373474 17 | 0.9959718585014343 18 | 0.9972472786903381 19 | 0.9554606080055237 20 | 0.999143123626709 21 | 0.9955300688743591 22 | 0.9994617104530334 23 | 0.9991528987884521 24 | 0.9987228512763977 25 | 0.9992048144340515 26 | 0.998910665512085 27 | 0.998437225818634 28 | 0.9612510204315186 29 | 0.9927793145179749 30 | 0.9989737272262573 31 | 0.994820237159729 32 | 0.999016284942627 33 | 0.9985498785972595 34 | 0.9588527083396912 35 | 0.9992305040359497 36 | 0.9985186457633972 37 | 0.9945138692855835 38 | 0.9994654059410095 39 | 0.9980382323265076 40 | 0.9991067051887512 41 | 0.9989997744560242 42 | 0.9990805387496948 43 | 0.9948292374610901 44 | 0.9930898547172546 45 | 0.940955400466919 46 | 0.9945279955863953 47 | 0.9937803149223328 48 | 0.9994076490402222 49 | 0.19607733190059662 50 | 0.9980910420417786 51 | 0.9940365552902222 52 | 0.9391145706176758 53 | 0.998028576374054 54 | 0.9934442043304443 55 | 0.9984025359153748 56 | 0.9986213445663452 57 | 0.9943619966506958 58 | 0.978777289390564 59 | 0.9989280104637146 60 | 0.9930499792098999 61 | 0.9985374212265015 62 | 0.9943996071815491 63 | 0.9629001617431641 64 | 0.9976578950881958 65 | 0.8398615121841431 66 | 0.05630270019173622 67 | 0.06235615164041519 68 | 0.9881781935691833 69 | 0.9899857044219971 70 | 0.999122679233551 71 | 0.9974337220191956 72 | 0.9989663362503052 73 | 0.9975524544715881 74 | 0.995134174823761 75 | 0.9992455244064331 76 | 0.9988962411880493 77 | 0.9993371367454529 78 | 0.9994862079620361 79 | 0.9984022974967957 80 | 0.9993213415145874 81 | 0.9903036952018738 82 | 0.9910654425621033 83 | 0.9994753003120422 84 | 0.992023229598999 85 | 0.9774600267410278 86 | 0.9984448552131653 87 | 0.10379525274038315 88 | 0.9981904625892639 89 | 0.9979850053787231 90 | 0.9991675615310669 91 | 0.9995473027229309 92 | 0.9991532564163208 93 | 0.998704195022583 94 | 0.9956908822059631 95 | 0.5713329911231995 96 | 0.9859158992767334 97 | 0.9917317032814026 98 | 0.994891881942749 99 | 0.9991616010665894 100 | 0.9993641972541809 101 | 0.9982323050498962 102 | 0.9987945556640625 103 | 0.9710656404495239 104 | 0.9981881976127625 105 | 0.9974697828292847 106 | 0.9883113503456116 107 | 0.9987496137619019 108 | 0.9991629123687744 109 | 0.9993699193000793 110 | 0.012185875326395035 111 | 0.9959009289741516 112 | 0.873554527759552 113 | 0.9922148585319519 114 | 0.9864484667778015 115 | 0.9659812450408936 116 | 0.9819478988647461 117 | 0.9902718663215637 118 | 0.9271910190582275 119 | 0.998245358467102 120 | 0.984853982925415 121 | 0.9978694915771484 122 | 0.9989228844642639 123 | 0.9975504279136658 124 | 0.9982158541679382 125 | 0.9906685948371887 126 | 0.9992731213569641 127 | 0.9980725049972534 128 | 0.9991907477378845 129 | 0.9753788113594055 130 | 0.998244047164917 131 | 0.9956242442131042 132 | 0.9967650175094604 133 | 0.9945287108421326 134 | 0.9904304146766663 135 | 0.9947385191917419 136 | 0.9993129968643188 137 | 0.9782684445381165 138 | 0.995001494884491 139 | 0.9992092847824097 140 | 0.9987636804580688 141 | 0.9790915846824646 142 | 0.9986263513565063 143 | 0.9984669089317322 144 | 0.9909221529960632 145 | 0.9909529685974121 146 | 0.9535863995552063 147 | 0.9910474419593811 148 | 0.9979985356330872 149 | 0.9984667897224426 150 | 0.9740865230560303 151 | 0.9986739158630371 152 | 0.982077419757843 153 | 0.9938271641731262 154 | 0.9988283514976501 155 | 0.9934296011924744 156 | 0.99751877784729 157 | 0.996823787689209 158 | 0.9982648491859436 159 | 0.9920627474784851 160 | 0.9957907795906067 161 | 0.9991832375526428 162 | 0.9994359612464905 163 | 0.9690139889717102 164 | 0.9970806241035461 165 | 0.9787412285804749 166 | 0.9207716584205627 167 | 0.9987126588821411 168 | 0.9524546265602112 169 | 0.9985440969467163 170 | 0.9976859092712402 171 | 0.9901997447013855 172 | 0.976740837097168 173 | 0.9648317694664001 174 | 0.9989110231399536 175 | 0.9991051554679871 176 | 0.9981874823570251 177 | 0.9803351163864136 178 | 0.9970822930335999 179 | 0.9800115823745728 180 | 0.03216550126671791 181 | 0.9969797134399414 182 | 0.9924662709236145 183 | 0.989404559135437 184 | 0.9963524341583252 185 | 0.9886478781700134 186 | 0.9746184349060059 187 | 0.9987975358963013 188 | 0.9647734761238098 189 | 0.028539611026644707 190 | 0.9089956283569336 191 | 0.9052648544311523 192 | 0.9990537762641907 193 | 0.01217712089419365 194 | 0.9794471263885498 195 | 0.997554361820221 196 | 0.9947006702423096 197 | 0.9952946305274963 198 | 0.99937903881073 199 | 0.01163787953555584 200 | 0.9988679885864258 201 | 0.9990560412406921 202 | 0.9983832836151123 203 | 0.9932032823562622 204 | 0.9957373142242432 205 | 0.9993076324462891 206 | 0.9975751042366028 207 | 0.9821121692657471 208 | 0.9968386888504028 209 | 0.9989283680915833 210 | 0.9434876441955566 211 | 0.9615771770477295 212 | 0.9193602800369263 213 | 0.9983742237091064 214 | 0.9957219362258911 215 | 0.9975576400756836 216 | 0.9960113763809204 217 | 0.9945828318595886 218 | 0.9981690645217896 219 | 0.9561895728111267 220 | 0.9899622201919556 221 | 0.9831120371818542 222 | 0.9990137815475464 223 | 0.9988020658493042 224 | 0.9855765700340271 225 | 0.9986560344696045 226 | 0.9622057676315308 227 | 0.9991530179977417 228 | 0.9991353154182434 229 | 0.997234046459198 230 | 0.9993801116943359 231 | 0.9915771484375 232 | 0.9989362359046936 233 | 0.9961090683937073 234 | 0.9803037047386169 235 | 0.9986960291862488 236 | 0.99750155210495 237 | 0.9922928810119629 238 | 0.9971168041229248 239 | 0.9979921579360962 240 | 0.9993677735328674 241 | 0.9993109703063965 242 | 0.9822739958763123 243 | 0.9387704730033875 244 | 0.9955278038978577 245 | 0.9978225231170654 246 | 0.998397171497345 247 | 0.9984637498855591 248 | 0.9981485605239868 249 | 0.9969332218170166 250 | 0.9577726721763611 251 | 0.9981651902198792 252 | 0.9982585310935974 253 | 0.9875665903091431 254 | 0.9990083575248718 255 | 0.9973329305648804 256 | 0.994461715221405 257 | 0.9881724715232849 258 | 0.9966397285461426 259 | 0.9977622032165527 260 | 0.9975376129150391 261 | 0.998838484287262 262 | 0.9916936755180359 263 | 0.9992332458496094 264 | 0.9930398464202881 265 | 0.9983376264572144 266 | 0.9966981410980225 267 | 0.9968171715736389 268 | 0.9886389970779419 269 | 0.9985862970352173 270 | 0.9967496395111084 271 | 0.9763928651809692 272 | 0.9940037131309509 273 | 0.9965202808380127 274 | 0.9911490678787231 275 | 0.999377965927124 276 | 0.9791600108146667 277 | 0.9810723662376404 278 | 0.9960089921951294 279 | 0.9988295435905457 280 | 0.9954675436019897 281 | 0.9992780089378357 282 | 0.9966852068901062 283 | 0.9978881478309631 284 | 0.9936227202415466 285 | 0.998149037361145 286 | 0.9990297555923462 287 | 0.9959836006164551 288 | 0.9988118410110474 289 | 0.9991412162780762 290 | 0.9832374453544617 291 | 0.9807348847389221 292 | 0.9992181062698364 293 | 0.9991262555122375 294 | 0.9973322153091431 295 | 0.9963339567184448 296 | 0.9898951649665833 297 | 0.9811797738075256 298 | 0.9962238073348999 299 | 0.9970287680625916 300 | 0.9972063899040222 301 | 0.9988741278648376 302 | 0.9976562261581421 303 | 0.8660569190979004 304 | 0.9985843896865845 305 | 0.997268557548523 306 | 0.995783805847168 307 | 0.9928014278411865 308 | 0.9929514527320862 309 | 0.9539411067962646 310 | 0.9947105646133423 311 | 0.9973823428153992 312 | 0.9896822571754456 313 | 0.9955400824546814 314 | 0.9949434399604797 315 | 0.9905816316604614 316 | 0.9944721460342407 317 | 0.9864016175270081 318 | 0.9975747466087341 319 | 0.9968441724777222 320 | 0.9781889319419861 321 | 0.9979853630065918 322 | 0.9954456090927124 323 | 0.9988722205162048 324 | 0.9956164360046387 325 | 0.9969595670700073 326 | 0.9814164638519287 327 | 0.9973897337913513 328 | 0.983991265296936 329 | 0.9988692402839661 330 | 0.9619116187095642 331 | 0.9980373978614807 332 | 0.9810391664505005 333 | 0.025657007470726967 334 | 0.9724447727203369 335 | 0.99146968126297 336 | 0.9986559152603149 337 | 0.9979668259620667 338 | 0.9980127811431885 339 | 0.9813748598098755 340 | 0.9970160722732544 341 | 0.9992469549179077 342 | 0.9982807636260986 343 | 0.9983713030815125 344 | 0.9965781569480896 345 | 0.9983538389205933 346 | 0.991523265838623 347 | 0.9978896975517273 348 | 0.9978058934211731 349 | 0.9515871405601501 350 | 0.8837659358978271 351 | 0.9989218711853027 352 | 0.9990463852882385 353 | 0.9826864004135132 354 | 0.9880500435829163 355 | 0.9960760474205017 356 | 0.981313943862915 357 | 0.9978699684143066 358 | 0.999392032623291 359 | 0.9959993362426758 360 | 0.9892537593841553 361 | 0.9976708292961121 362 | 0.9942777752876282 363 | 0.9838907122612 364 | 0.9917405843734741 365 | 0.9967668056488037 366 | 0.9929261803627014 367 | 0.999275267124176 368 | 0.013177838176488876 369 | -------------------------------------------------------------------------------- /emnlp_data/testset_preprocess_scripts/cnt_knowledge_length.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | # for wow 4 | # query_len, knowledge_len = [], [] 5 | # with open('train_processed.txt', 'r', encoding='utf-8') as r: 6 | # data = r.readlines() 7 | # print ('data size: ', len(data)) 8 | # for i, line in enumerate(data): 9 | # parts = [e.strip() for e in line.strip().split('\t')] 10 | # # assert len(parts) == 4, (i, len(parts), parts) 11 | # if len(parts) != 4: 12 | # print(i, len(parts), parts) 13 | # continue 14 | # topic, history, knowledge, response = parts 15 | # query = history.split(' [SEP] ')[-1] 16 | # query_len.append(len(query.split(' '))) 17 | # knowledge_len.append(len(knowledge.split(' '))) 18 | # assert len(query_len) == len(knowledge_len), 'length not equal' 19 | # print('query len: ', sum(query_len) / len(query_len)) # 14.6 20 | # print('knowledge len: ', sum(knowledge_len) / len(knowledge_len)) # 21.1 21 | 22 | # for nq 23 | train = '/misc/kfdata01/kf_grp/lchen/DPR/dpr/downloads/data/retriever/nq-train.json' 24 | test = '/misc/kfdata01/kf_grp/lchen/DPR/dpr/downloads/data/gold_passages_info/nq_test.json' 25 | query_len, knowledge_len = [], [] 26 | with open(test, 'r') as infile: 27 | data_list = json.load(infile)['data'] 28 | for data in data_list: 29 | if not data['context'] or not data['short_answers']: 30 | continue 31 | query = data['question'] 32 | knowledge = data['context'] 33 | query_len.append(len(query.split(' '))) 34 | knowledge_len.append(len(knowledge.split(' '))) 35 | print('query len: ', sum(query_len) / len(query_len)) # 9.0 36 | print('knowledge len: ', sum(knowledge_len) / len(knowledge_len)) # 297.2 37 | 38 | print (len(knowledge_len)) # 1868 39 | print (max(knowledge_len)) 40 | li = [0] * 21 41 | for l in knowledge_len: 42 | if l > 1000: 43 | continue 44 | li[l // 50] += 1 45 | print (li) 46 | 47 | # [274, 582, 403, 247, 85, 46, 25, 19, 21, 7, 9, 10, 5, 6, 5, 7, 1, 1, 2, 0, 0] -------------------------------------------------------------------------------- /emnlp_data/testset_preprocess_scripts/extract_ref.py: -------------------------------------------------------------------------------- 1 | 2 | 3 | def read_testfile(ref_path): 4 | # testset = [] 5 | with open(ref_path, 'r') as infile: 6 | for line in infile: 7 | parts = line.strip().split('\t') 8 | # topic, query, knowledge, response 9 | assert len(parts) == 4, parts 10 | # testset.append(parts) 11 | print (parts[-2]) 12 | # return testset 13 | 14 | read_testfile('/misc/kfdata01/kf_grp/lchen/EMNLP23/experiments/emnlp_data/wow/random_testset/seen_random_testset.txt') -------------------------------------------------------------------------------- /emnlp_data/testset_preprocess_scripts/random_prompt_maker.py: -------------------------------------------------------------------------------- 1 | import random 2 | import json 3 | import tqdm 4 | 5 | # for wow 6 | # data = [] 7 | # with open('train_processed.txt', 'r', encoding='utf-8') as r: 8 | # data = r.readlines() 9 | 10 | # for split in ['seen', 'unseen']: 11 | # with open(f'random_prompts/{split}_random_prompt.txt', 'w', encoding='utf-8') as w: 12 | # for i in range(500): # lines, examples 13 | # random.shuffle(data) 14 | # cur_prompt = data[:50] 15 | # w.write(str(cur_prompt).strip() + '\n') 16 | # # print (cur_prompt) 17 | 18 | 19 | 20 | with open('/misc/kfdata01/kf_grp/lchen/DPR/dpr/downloads/data/retriever/nq-dev.json', 'r', encoding='utf-8') as infile: 21 | data_list = json.load(infile) 22 | print (data_list[0].keys()) # dict_keys(['dataset', 'question', 'answers', 'positive_ctxs', 'negative_ctxs', 'hard_negative_ctxs']) 23 | 24 | 25 | train = '/misc/kfdata01/kf_grp/lchen/DPR/dpr/downloads/data/retriever/nq-train.json' 26 | # with open(train, 'r', encoding='utf-8') as infile, \ 27 | # open('../nq/random_prompts/nq_test_random_prompt.txt', 'w', encoding='utf-8') as outfile: 28 | # data_list = json.load(infile) 29 | # for i in tqdm.tqdm(range(500)): 30 | # # random.shuffle(data_list) 31 | # id_list = random.sample(list(range(len(data_list))), 300) 32 | # cur_prompt = [] 33 | # # for data in data_list: 34 | # for id in id_list: 35 | # data = data_list[id] 36 | # if not data['positive_ctxs'] or not data['answers']: 37 | # continue 38 | 39 | # query = data['question'].strip() 40 | # answer = data['answers'][0].strip() 41 | 42 | # knowledge = data['positive_ctxs'][0]['text'].strip() 43 | # topic = data['positive_ctxs'][0]['title'].strip() 44 | 45 | # if len(knowledge.split(' ')) > 350: 46 | # continue 47 | # cur_prompt.append(f'{topic}\t{query}\t{knowledge}\t{answer}\n') 48 | # if len(cur_prompt) == 50: 49 | # break 50 | # outfile.write(str(cur_prompt).strip() + '\n') -------------------------------------------------------------------------------- /emnlp_data/testset_preprocess_scripts/random_testset_maker.py: -------------------------------------------------------------------------------- 1 | import random 2 | import json 3 | 4 | # dataset = 'wow' 5 | # for split in ['seen', 'unseen']: 6 | # data = [] 7 | # with open(f'../{dataset}/test{split}_processed.txt', 'r', encoding='utf-8') as r: 8 | # data = r.readlines() 9 | # with open(f'random_testset/{split}_random_testset.txt', 'w', encoding='utf-8') as w: 10 | # random.shuffle(data) 11 | # testset = data[:500] 12 | # w.writelines(testset) 13 | 14 | 15 | 16 | train = '/misc/kfdata01/kf_grp/lchen/DPR/dpr/downloads/data/retriever/nq-train.json' 17 | test = '/misc/kfdata01/kf_grp/lchen/DPR/dpr/downloads/data/gold_passages_info/nq_test.json' 18 | cnt = 0 19 | with open(test, 'r') as infile, \ 20 | open('../nq/random_testset/nq_test_random_testset.txt', 'w') as outfile: 21 | data_list = json.load(infile)['data'] 22 | for data in data_list: 23 | if not data['context'] or not data['short_answers']: 24 | continue 25 | 26 | p = random.random() 27 | if p > 0.35: 28 | continue 29 | topic = data['title'].strip() 30 | query = data['question'].strip() 31 | knowledge = data['context'].strip() 32 | answer = data['short_answers'][0].strip() 33 | outfile.write(f'{topic}\t{query}\t{knowledge}\t{answer}\n') 34 | cnt += 1 35 | if cnt == 500: 36 | break 37 | -------------------------------------------------------------------------------- /emnlp_data/wow/random_testset/seen_topic_pageviews.txt: -------------------------------------------------------------------------------- 1 | Beard 1959542 2 | Chevrolet Corvette 5075804 3 | Del Taco 511892 4 | Steak 2617070 5 | National Hockey League 7053253 6 | My Little Pony: Friendship Is Magic fandom 735926 7 | Kale 4124167 8 | Avengers (comics) 6736715 9 | 100 metres 4162077 10 | Mercedes-Benz S-Class 3902440 11 | Star Trek 13923390 12 | Dance 3905111 13 | Beastie Boys 9614636 14 | Bachelor of Science in Nursing 606634 15 | Children's literature 1474233 16 | Chicago Bulls 8437648 17 | Byala, Varna Province 39347 18 | Bank teller 565886 19 | Veganism 6404632 20 | Acoustic guitar 1177249 21 | Washington Wizards 3783779 22 | Depression (mood) 5088913 23 | Violin 3510891 24 | Back pain 1792205 25 | Cheese 5531781 26 | Vancouver Grizzlies 1468983 27 | Appalachian Trail 4152310 28 | Weight training 1395672 29 | Zebra 3910132 30 | Laser pointer 884624 31 | Pet 2558753 32 | Synchronised swimming 438525 33 | Indie rock 3169932 34 | American lobster 776293 35 | Netflix 26821165 36 | Marine habitats 510784 37 | Facial hair in the military 711489 38 | Association football 13829467 39 | Wedding cake 735753 40 | The Rolling Stones 15566337 41 | Fantasy football (American) 1090779 42 | My Little Pony: Friendship Is Magic fandom 735926 43 | Anxiety disorder 4699214 44 | Pita 1338254 45 | New York-style pizza 2081987 46 | Food truck 655594 47 | Tattoo 3664743 48 | Honda Civic 6757544 49 | Duramax V8 engine 1934085 50 | Fishing tackle 478009 51 | New York City 37724664 52 | Hostage 464802 53 | Insurance 8211916 54 | Higher education in the United States 1348209 55 | Leaning Tower of Pisa 7422423 56 | Agents of S.H.I.E.L.D. 17041567 57 | Reading (process) 1332328 58 | Dating 12379545 59 | Social anxiety 1266979 60 | Atlantic Ocean 5721579 61 | Biotin 3600612 62 | Yoga 9660641 63 | Meditation 4874467 64 | Macaroni and cheese 1682250 65 | English as a second or foreign language 1151254 66 | Blue 3827652 67 | Pot washing 32074 68 | Crochet 1344155 69 | Armadillo 4659111 70 | New York City 37724664 71 | Dodge 3340132 72 | Bathroom singing 84604 73 | Go-kart 663975 74 | Metallica 15902964 75 | The Last of the Mohicans (1992 film) 3813987 76 | Swimming 645217 77 | Depression (mood) 5088913 78 | Singing 2608718 79 | Fishkeeping 400565 80 | Beauty pageant 1734975 81 | Nicholas Sparks 4159815 82 | Chef 2173677 83 | Activism 1216281 84 | Del Taco 511892 85 | Prince (musician) 42161460 86 | Ford Mustang (first generation) 2606261 87 | Airbnb 7751574 88 | Physics 8200603 89 | Chocolate 5734017 90 | Blue Ridge Parkway 1330669 91 | Wedding cake 735753 92 | Led Zeppelin 18823514 93 | Sports in Philadelphia 362223 94 | Vanilla extract 890221 95 | Battle of Okinawa 6175338 96 | Nu metal 2870104 97 | Bank teller 565886 98 | Donald Trump 417127302 99 | Jainism 7969171 100 | Japan 33765520 101 | Bathroom singing 84604 102 | History of vegetarianism 345569 103 | Academic dress 1317111 104 | Bodybuilding 2559450 105 | Grocery store 1654977 106 | Beard 1959542 107 | Bodyboarding 378542 108 | Murder on the Orient Express (2017 film) 9836159 109 | Toyota 9039006 110 | Association football 13829467 111 | Track and field 3895842 112 | Lifeguard 505683 113 | Role-playing video game 2099904 114 | Acrophobia 2256002 115 | Reading (process) 1332328 116 | Veganism 6404632 117 | Animal shelter 1175962 118 | Tofu 4844659 119 | Porsche 6184097 120 | Fantasy football (American) 1090779 121 | Library 3248947 122 | Les Paul 1990157 123 | Finance 4342066 124 | Karaoke Revolution 121987 125 | Dancing with the Stars 1242082 126 | Fruitarianism 1279408 127 | Immigration to the United States 3941725 128 | Spice 2528778 129 | The Shawshank Redemption 14947995 130 | Vietnamese Pot-bellied 241350 131 | Denmark 18009526 132 | Public aquarium 227132 133 | Stepfather 169728 134 | Karaoke 1751670 135 | Sushi 5705171 136 | Auto mechanic 629011 137 | Seafood 1654698 138 | Jeopardy! 5162800 139 | Pizza 6874547 140 | Masters of the Universe 1283163 141 | Titanic (1997 film) 21665576 142 | Toga party 258452 143 | Work–life balance 1456848 144 | 100 metres 4162077 145 | Lizard 5029851 146 | Wellington County, Ontario 141301 147 | Giant panda 10271361 148 | Go-kart 663975 149 | The Chainsmokers 9752293 150 | Veganism 6404632 151 | Gone with the Wind (film) 10029206 152 | Purple 2715649 153 | Kale 4124167 154 | Social anxiety 1266979 155 | Beastie Boys 9614636 156 | Blackjack 4299719 157 | London 28537990 158 | Kentucky Derby 3439758 159 | Pipe smoking 480616 160 | Metallica 15902964 161 | Tea processing 689180 162 | Blue 3827652 163 | Linebacker 1996470 164 | Steak 2617070 165 | Overwatch (video game) 8253107 166 | Puerto Rico 21597904 167 | Titanic (1997 film) 21665576 168 | Surf culture 441435 169 | IPhone 18946999 170 | Ice hockey 5147900 171 | Obsessive–compulsive disorder 8232417 172 | Fair 670756 173 | Spaghetti with meatballs 214486 174 | Glasses 2129352 175 | Weight training 1395672 176 | Butcher 705333 177 | Telenovela 1623710 178 | YouTube 69619308 179 | Veganism 6404632 180 | Partnership 2125839 181 | Arab cuisine 986678 182 | Vanilla 3068880 183 | Beach 1890555 184 | Human papillomavirus infection 4841867 185 | Miami Heat 5471563 186 | Running 1947540 187 | Wisconsin 7454562 188 | Carrot 3548156 189 | Grey's Anatomy 27648979 190 | Swimming stroke 581036 191 | Lindsey Stirling 5446995 192 | Coco Chanel 7867151 193 | Epilepsy 5807590 194 | Hiking 1797210 195 | Liquorice 3842441 196 | Miranda Lambert 7273065 197 | Anxiety disorder 4699214 198 | Fennel 4125421 199 | The Story So Far (band) 575060 200 | Illegal immigration to the United States 2423443 201 | Divorce 2897630 202 | Income inequality in the United States 1523700 203 | Blockbuster LLC 4235726 204 | Fair 670756 205 | Depression (mood) 5088913 206 | Corn dog 1112057 207 | Biology 7288491 208 | Ice cream 4410396 209 | Blue 3827652 210 | Chanel No. 5 1311070 211 | Corn dog 1112057 212 | Pizza 6874547 213 | Pakistan 29145105 214 | Elementary school 999722 215 | Cruise ship 2350722 216 | I Love New York 607638 217 | South Park 12874224 218 | Chocolate 5734017 219 | Police officer 1803972 220 | Animal testing 1707509 221 | Marinara sauce 952446 222 | Veganism 6404632 223 | Ford Mustang 8161648 224 | Kale 4124167 225 | List of Downton Abbey characters 4432952 226 | Pecan pie 477609 227 | German Shepherd 10166109 228 | Marriage 6021835 229 | Bon Iver 4494197 230 | Hostage 464802 231 | Chocolate 5734017 232 | Acrophobia 2256002 233 | Jumbo slice 325175 234 | Bitcoin 35706513 235 | Ocean 5401523 236 | Fathers' rights movement by country 57041 237 | Skateboarding 2020981 238 | Justin Bieber 39739057 239 | Shrimp 2976964 240 | Pig farming 876232 241 | Tattoo 3664743 242 | Bank teller 565886 243 | Lesbian 8438858 244 | Obesity in the United States 1833122 245 | Pit bull 7320671 246 | Partnership 2125839 247 | The Bahamas 11576127 248 | Luxury yacht 507678 249 | Dog 21115517 250 | Paddy field 1312896 251 | Mental disorder 4870704 252 | Finance 4342066 253 | Science, technology, engineering, and mathematics 2250255 254 | Fertility factor (demography) 168755 255 | China 43380033 256 | Asakusa 502640 257 | Tupac Shakur 37793324 258 | Parenting styles 1004316 259 | Bulldog 5395755 260 | Reading (process) 1332328 261 | LeBron James 50652189 262 | Blue 3827652 263 | Herb 2105936 264 | Surfing 1906475 265 | Management of hair loss 400689 266 | Rise Against 2128340 267 | Pizza 6874547 268 | Alpine skiing 1002167 269 | Action-adventure game 1673273 270 | Tattoo 3664743 271 | Grounds for divorce (United States) 279430 272 | Nightclub 1591702 273 | Economy of Pittsburgh 136452 274 | McDonald's 16774593 275 | Hinduism 15495622 276 | Activism 1216281 277 | Top Chef 2684469 278 | Avengers (comics) 6736715 279 | Food truck 655594 280 | Chicago metropolitan area 1862922 281 | Swimming lessons 160148 282 | Wedding cake 735753 283 | Italian cuisine 2814286 284 | Rose 6162787 285 | Usain Bolt 18802314 286 | Kayaking 853351 287 | Ford Mustang 8161648 288 | Pizza 6874547 289 | Stepfather 169728 290 | Pink 2098244 291 | Multilingualism 1817923 292 | Blue 3827652 293 | Controversy and criticism of Jersey Shore 106514 294 | Choir 1240728 295 | Newspaper 4451940 296 | Corporate behaviour 178327 297 | Ford Mustang (first generation) 2606261 298 | Kindergarten 2754297 299 | Pizza 6874547 300 | Swimming 645217 301 | Radiology 1959453 302 | Piano 4874737 303 | Justin Bieber 39739057 304 | Pizza 6874547 305 | Food allergy 816602 306 | Yorkshire Terrier 3838224 307 | Stephen King 19387069 308 | Bitcoin 35706513 309 | White Christmas (weather) 351396 310 | Guitar 5765716 311 | My Little Pony 8126577 312 | Widow 726015 313 | Compulsive hoarding 1378634 314 | Surfing 1906475 315 | Hindu 3548143 316 | Marduk (band) 570631 317 | Roller coaster phobia 138748 318 | Pet 2558753 319 | Rock music 6387915 320 | South Park 12874224 321 | China 43380033 322 | Bandy 1503081 323 | Canada 44790425 324 | New York University 4367231 325 | Chico's Tacos 152081 326 | Wedding cake 735753 327 | Iguana 2054941 328 | New York-style pizza 2081987 329 | Fourth Baptist Christian School 8992 330 | Multilingualism 1817923 331 | Sushi 5705171 332 | Horse 8144634 333 | New Year's Eve 3073903 334 | Taco 2906013 335 | Yoga as exercise 361759 336 | World War II 55549788 337 | Teapot 365709 338 | History of autonomous cars 216317 339 | Mesoamerica 2704125 340 | Crochet hook 123800 341 | Ballet 2292512 342 | South Asia 6166460 343 | Dog training 847025 344 | Land of Oz 730680 345 | Parenting 1423118 346 | Track and field 3895842 347 | Plastic arts 358076 348 | Rosalia (festival) 91822 349 | Grilling 1084103 350 | Red hair 4380560 351 | My Little Pony 8126577 352 | Orphan 940122 353 | Apple pie 1195239 354 | Chicago-style pizza 1514762 355 | No-kill shelter 162196 356 | Swimming 645217 357 | Gender in youth sports 38863 358 | Cooking 2206345 359 | Pizza 6874547 360 | Seattle 11501444 361 | Characters of Final Fantasy X and X-2 440835 362 | Vermont 6266621 363 | Barbecue grill 490577 364 | Ballroom dance 1677171 365 | Ocean 5401523 366 | Italian cuisine 2814286 367 | Adoption 1288016 368 | Piano 4874737 369 | Hiking 1797210 370 | Superman 13251517 371 | Comic book 2145684 372 | Well-being contributing factors 181026 373 | Labrador Retriever 9292925 374 | New England 7404899 375 | Lizard 5029851 376 | Jamba Juice 535492 377 | Tiny house movement 992197 378 | World Heritage Site 6591340 379 | Lizard 5029851 380 | Chocolate 5734017 381 | Practice pad 41152 382 | Steak 2617070 383 | Cat 20440030 384 | 100 metres 4162077 385 | Korn 6303735 386 | Snapple 788011 387 | Radiohead 9150954 388 | Pipe smoking 480616 389 | Kindergarten 2754297 390 | Monarch butterfly 2891626 391 | Blue Ridge Parkway 1330669 392 | Nineteen Eighty-Four 16311633 393 | Family farm 188774 394 | Odor 908906 395 | Camping 1579382 396 | Iguana 2054941 397 | Christmas tree 3701463 398 | Tennis 5392040 399 | Leather 2872405 400 | John Chambers (make-up artist) 347333 401 | Justin Bieber 39739057 402 | Hospital 3055953 403 | Ageing 2048205 404 | Andy Murray 13901154 405 | Legal awareness 183262 406 | Cyanobacteria 3595456 407 | Swimming 645217 408 | Steak 2617070 409 | Whittling 245466 410 | Bank teller 565886 411 | Bank teller 565886 412 | Yoga 9660641 413 | Agoraphobia 7454666 414 | Coco Chanel 7867151 415 | Eighth Doctor 917496 416 | Cryptic crossword 624883 417 | Running 1947540 418 | Parrot 4166342 419 | Bob Ross 21059248 420 | Low back pain 2709541 421 | Daft Punk 8119663 422 | Veterinary physician 1004302 423 | Yoga 9660641 424 | Track and field 3895842 425 | Yellow 1871687 426 | Gladiator 2818542 427 | Ice hockey 5147900 428 | Beastie Boys 9614636 429 | Free Appropriate Public Education 230741 430 | The Rolling Stones 15566337 431 | Science, technology, engineering, and mathematics 2250255 432 | Colorado 8301350 433 | Pork 2164506 434 | Dungeons & Dragons 6649283 435 | Blue Ridge Parkway 1330669 436 | Mike Trout 5315410 437 | Ocean 5401523 438 | Underwater hockey 436406 439 | Spice 2528778 440 | Gibson Les Paul 2352115 441 | Murder on the Orient Express (2017 film) 9836159 442 | Swimming 645217 443 | Pita 1338254 444 | Dance 3905111 445 | Giant panda 10271361 446 | Onion 4068491 447 | Zumba 2023534 448 | Hindu 3548143 449 | Anthrax (American band) 2990109 450 | International adoption of South Korean children 197618 451 | Netflix 26821165 452 | Cut of beef 2379048 453 | The Hershey Company 2656480 454 | Emergency department 1327462 455 | Classical music 4081061 456 | Yellow 1871687 457 | Obesity in the United States 1833122 458 | Abe Pollin 123838 459 | Immigration to the United States 3941725 460 | History of American newspapers 606405 461 | Lance Armstrong 8681088 462 | Ford Mustang 8161648 463 | Dating 12379545 464 | Ford Mustang (first generation) 2606261 465 | Pizza 6874547 466 | Radiology 1959453 467 | Santa Fe, New Mexico 3500856 468 | Camping 1579382 469 | Communist Party USA 2485345 470 | Florida 13516911 471 | Trick-or-treating 2436386 472 | Vietnamese cuisine 1451959 473 | American Motors 1219419 474 | Les Paul 1990157 475 | Fur 1023405 476 | University of Chicago 3653847 477 | Medieval cuisine 1409915 478 | Cat 20440030 479 | Game of Thrones 66975540 480 | Historical fiction 1399757 481 | Choir 1240728 482 | Arts in Seattle 43843 483 | Magic square 1853666 484 | Gone with the Wind (film) 10029206 485 | Pug 6059374 486 | Extreme Couponing 219509 487 | Bachelor of Science in Nursing 606634 488 | Dog 21115517 489 | Cut of beef 2379048 490 | Anxiety disorder 4699214 491 | Brewery 776666 492 | Reality television 3338151 493 | Travel 2377943 494 | Reading (process) 1332328 495 | Chicken McNuggets 841115 496 | Jimi Hendrix 18313612 497 | Vegetarianism 3703523 498 | Pizza 6874547 499 | Radiology 1959453 500 | The Story So Far (band) 575060 501 | -------------------------------------------------------------------------------- /emnlp_data/wow/random_testset/unseen_topic_pageviews.txt: -------------------------------------------------------------------------------- 1 | Accounting 4561192 2 | Hot dog 3698391 3 | Online shopping 3920354 4 | John Grisham 4108973 5 | Popcorn 3047303 6 | Guns N' Roses 12138504 7 | Harry Potter 33498151 8 | Green 2607165 9 | Elvis Presley 36970826 10 | Harry Potter 33498151 11 | Hound Dog (song) 1004552 12 | American football 10478739 13 | Elvis Presley 36970826 14 | Guns N' Roses 12138504 15 | Attention deficit hyperactivity disorder 8491220 16 | Online shopping 3920354 17 | Early history of American football 217358 18 | American football 10478739 19 | Old Faithful Museum of Thermal Activity 9125 20 | Game design 799287 21 | Relish 756540 22 | Game design 799287 23 | Green 2607165 24 | Ireland 18043996 25 | Genghis Khan 20406918 26 | Broken heart 1481439 27 | American football 10478739 28 | Water skiing 419784 29 | Dylan's Candy Bar 338014 30 | Guns N' Roses 12138504 31 | Green 2607165 32 | Harry Potter 33498151 33 | Ten-pin bowling 1318688 34 | American football 10478739 35 | Trapping 724283 36 | Rottweiler 8100660 37 | Online shopping 3920354 38 | The Walking Dead (TV series) 39648564 39 | Archery 2156917 40 | Insane Clown Posse 4525891 41 | Ireland 18043996 42 | Cod 3134543 43 | Poaching 1545190 44 | Accounting 4561192 45 | Skiing 1880306 46 | Accounting 4561192 47 | Columbia River 2317442 48 | National Football League on television 625789 49 | Irrealism (the arts) 27065 50 | Horrorcore 1380387 51 | American football 10478739 52 | Broken heart 1481439 53 | Jazz 7245064 54 | Accounting 4561192 55 | Hunting 2013242 56 | Guns N' Roses 12138504 57 | Accounting 4561192 58 | Dylan's Candy Bar 338014 59 | Green 2607165 60 | Elvis Presley 36970826 61 | Japanese language 8978480 62 | Stock exchange 3461353 63 | To Kill a Mockingbird 12654725 64 | Cardigan (sweater) 806462 65 | Shades of green 2803743 66 | Cycling 1657944 67 | Political positions of Hillary Clinton 983823 68 | Archery 2156917 69 | Nickelback 4796561 70 | Paramedic 1214132 71 | The Walking Dead (TV series) 39648564 72 | Cheerleading 2601255 73 | Archery 2156917 74 | Harry Potter 33498151 75 | Water skiing 419784 76 | Hunting 2013242 77 | Rock N Roll Experience Magazine 8220 78 | Green 2607165 79 | American football 10478739 80 | Memphis Mafia 613020 81 | Regional street food 180107 82 | John Grisham 4108973 83 | Motivation 4758206 84 | Neurosurgery 1703924 85 | Insane Clown Posse 4525891 86 | List of national parks of the United States 7548317 87 | National Parks of Canada 215749 88 | Snowflake 1032971 89 | Nickelback 4796561 90 | Dylan's Candy Bar 338014 91 | Dylan's Candy Bar 338014 92 | Bowling 2012141 93 | Instagram 24175136 94 | Broken heart 1481439 95 | On-again, off-again relationship 511036 96 | American football 10478739 97 | Bowling 2012141 98 | Hunting 2013242 99 | Motivation 4758206 100 | Harry Potter 33498151 101 | Skiing 1880306 102 | Oregon Trail 3340493 103 | Major general 1171137 104 | Motivation 4758206 105 | Nickelback 4796561 106 | Dog 21115517 107 | Stock market 5662980 108 | Ski 535770 109 | The Walking Dead (TV series) 39648564 110 | Cheerleading 2601255 111 | Instagram 24175136 112 | Fox hunting 1142024 113 | Dylan's Candy Bar 338014 114 | Goldendoodle 3954294 115 | Hedge 484218 116 | Interpersonal communication 1690808 117 | Hunting 2013242 118 | Axl Rose 10541775 119 | Dylan's Candy Bar 338014 120 | Irish coffee 1213386 121 | Skype 6841400 122 | Cheerleading 2601255 123 | Broken heart 1481439 124 | Bowling 2012141 125 | Hunting 2013242 126 | Field electron emission 339182 127 | Elvis Presley 36970826 128 | Cod 3134543 129 | Fantastic Beasts and Where to Find Them (film) 24112670 130 | Cheerleading 2601255 131 | Hunting 2013242 132 | Bob Ross 21059248 133 | Accounting 4561192 134 | Accounting 4561192 135 | Google Chrome 24567422 136 | Guns N' Roses 12138504 137 | Black Friday (shopping) 11247562 138 | Cheerleading 2601255 139 | Skiing 1880306 140 | Hot dog 3698391 141 | Guns N' Roses 12138504 142 | American football 10478739 143 | Memphis, Tennessee 4997280 144 | Honey bee 3832925 145 | Blog 8108460 146 | Instagram 24175136 147 | Archery 2156917 148 | Nickelback 4796561 149 | Dylan's Candy Bar 338014 150 | Green 2607165 151 | Green 2607165 152 | Kendrick Lamar 19668434 153 | Northern Ireland 11331844 154 | Popcorn 3047303 155 | Music festival 782112 156 | Accounting 4561192 157 | Green 2607165 158 | Academic dress of universities in Queensland, Australia 33629 159 | Shades of green 2803743 160 | Green 2607165 161 | Accounting 4561192 162 | Blue cheese 2120264 163 | Broken heart 1481439 164 | Elvis Presley 36970826 165 | Irish Americans 1565967 166 | Bowling 2012141 167 | Italian cuisine 2814286 168 | Formula One car 2366007 169 | Green 2607165 170 | Blog 8108460 171 | Broken heart 1481439 172 | Motivation 4758206 173 | Instagram 24175136 174 | Motivation 4758206 175 | Goldendoodle 3954294 176 | Insane Clown Posse 4525891 177 | Auto racing 1840090 178 | Green 2607165 179 | Green 2607165 180 | Hot dog 3698391 181 | Accounting 4561192 182 | Genghis Khan 20406918 183 | Skiing 1880306 184 | Cod 3134543 185 | Phase-out of incandescent light bulbs 471753 186 | Cardigan (sweater) 806462 187 | Skiing 1880306 188 | Poultry farming 1937486 189 | Archery 2156917 190 | Cod 3134543 191 | Regional street food 180107 192 | Green 2607165 193 | Accounting 4561192 194 | Trail riding 147914 195 | Luca Pacioli 857427 196 | Death Eater 2356961 197 | Atlantic cod 945138 198 | Yes (band) 5616061 199 | Red meat 1967240 200 | Green 2607165 201 | Broken heart 1481439 202 | Hunting 2013242 203 | Bowling 2012141 204 | On-again, off-again relationship 511036 205 | Nickelback 4796561 206 | Ireland 18043996 207 | Game design 799287 208 | Bowling 2012141 209 | Skiing 1880306 210 | Cheerleading 2601255 211 | Green 2607165 212 | History of skiing 345690 213 | Eminem 43338165 214 | Ten-pin bowling 1318688 215 | Bowling 2012141 216 | Instagram 24175136 217 | Motivation 4758206 218 | Hunting 2013242 219 | Bodybuilding supplement 845375 220 | Medical billing 518915 221 | Skiing 1880306 222 | John Grisham 4108973 223 | Ireland 18043996 224 | Waterfowl hunting 269450 225 | List of national parks of the United States 7548317 226 | Accounting 4561192 227 | Bowling 2012141 228 | Cardigan (sweater) 806462 229 | Green 2607165 230 | Skiing 1880306 231 | List of awards and nominations received by Michael Jackson 829559 232 | Skiing 1880306 233 | Green 2607165 234 | Nickelback 4796561 235 | Cheerleading 2601255 236 | Hunting 2013242 237 | Cardigan (sweater) 806462 238 | Accounting 4561192 239 | Motivation 4758206 240 | List of national parks of the United States 7548317 241 | History of QubicaAMF Bowling World Cup 15318 242 | Green 2607165 243 | Guns N' Roses 12138504 244 | Instagram 24175136 245 | List of national parks of the United States 7548317 246 | Portrait of an Army Doctor 13741 247 | Green 2607165 248 | Crossbow 2103185 249 | Dog training 847025 250 | Harry Potter 33498151 251 | American football 10478739 252 | Discovery Channel 2518000 253 | Japanese language 8978480 254 | Hard rock 2532151 255 | Cheerleading 2601255 256 | Nickelback 4796561 257 | Cardigan (sweater) 806462 258 | History of health care reform in the United States 445207 259 | Grammy Award 7328535 260 | American football 10478739 261 | Five-pin bowling 298048 262 | Harry Potter influences and analogues 393906 263 | Cross-country skiing (sport) 279615 264 | Heart Broken 57754 265 | Rugby union in Germany 61369 266 | Indianapolis 500 2712848 267 | Accounting 4561192 268 | Archery 2156917 269 | Accounting 4561192 270 | Hunting 2013242 271 | Whitehaven, Memphis, Tennessee 67662 272 | Elvis Presley 36970826 273 | American football 10478739 274 | Skiing 1880306 275 | Richard Nixon 27123206 276 | Japanese language 8978480 277 | International Financial Reporting Standards 2099309 278 | Nachos 1417170 279 | Gymnastics 3142318 280 | The Walking Dead (TV series) 39648564 281 | Instagram 24175136 282 | Waterfowl hunting 269450 283 | Kurt Cobain 20514337 284 | Green 2607165 285 | Gymnastics 3142318 286 | American football 10478739 287 | Accounting 4561192 288 | Ireland 18043996 289 | Bowling 2012141 290 | Green 2607165 291 | Cheerleading 2601255 292 | Medici Bank 587752 293 | American football 10478739 294 | Skittles (sport) 421753 295 | Electric guitar 2282511 296 | Bowling 2012141 297 | Rick Grimes 3494233 298 | Blog 8108460 299 | American football 10478739 300 | History of Crayola crayons 206577 301 | Instagram 24175136 302 | The Walking Dead (TV series) 39648564 303 | Harry Potter 33498151 304 | Thierry Henry 10254331 305 | Elvis Presley 36970826 306 | The Walking Dead (TV series) 39648564 307 | Hunting 2013242 308 | Bodybuilding supplement 845375 309 | Bodybuilding supplement 845375 310 | Green 2607165 311 | Winter 3146653 312 | Cheerleading 2601255 313 | Privacy concerns with social networking services 668582 314 | Ireland 18043996 315 | Elvis Presley 36970826 316 | Popcorn 3047303 317 | Green 2607165 318 | Skiing 1880306 319 | Blog 8108460 320 | Intercity bus service 200478 321 | Elvis Presley 36970826 322 | Neurosurgery 1703924 323 | Cheerleading 2601255 324 | Nickelback 4796561 325 | Winter 3146653 326 | Cardigan (sweater) 806462 327 | Cheerleading 2601255 328 | American football 10478739 329 | Cheerleading 2601255 330 | Genghis Khan 20406918 331 | Tom Brady 41656273 332 | Skiing 1880306 333 | High Elves (Warhammer) 135086 334 | Skiing 1880306 335 | The Walking Dead (TV series) 39648564 336 | Grunge 3797656 337 | Gymnastics 3142318 338 | Freestyle skateboarding tricks 146839 339 | Elvis impersonator 305174 340 | Consulting firm 624348 341 | Goldendoodle 3954294 342 | Instagram 24175136 343 | Bodybuilding supplement 845375 344 | Hunting 2013242 345 | Elvis impersonator 305174 346 | American football 10478739 347 | Green 2607165 348 | Cheerleading 2601255 349 | Transformers (film) 6835314 350 | Medieval cuisine 1409915 351 | Green 2607165 352 | Liberty Tax Service 86233 353 | Online shopping 3920354 354 | Herbal 305698 355 | Cheerleading 2601255 356 | Ireland 18043996 357 | Skiing 1880306 358 | American football 10478739 359 | Great Famine (Ireland) 6566470 360 | Cycling 1657944 361 | American football 10478739 362 | The Beatles 28659241 363 | Kurt Cobain 20514337 364 | Preterm birth 2029449 365 | List of Shrek characters 1577478 366 | Dylan's Candy Bar 338014 367 | Gymnastics 3142318 368 | Accounting 4561192 369 | Bee 4593517 370 | Narcissus (plant) 2129973 371 | Online shopping 3920354 372 | Tea 5620471 373 | Bodybuilding supplement 845375 374 | The Walking Dead (TV series) 39648564 375 | Heartbreak Hotel 597104 376 | Stock market 5662980 377 | Ethics of eating meat 617002 378 | Cheerleading 2601255 379 | Poaching 1545190 380 | Discovery Channel 2518000 381 | Ebates 318758 382 | Instagram 24175136 383 | Zombie apocalypse 1196207 384 | Color wheel 3196003 385 | Green 2607165 386 | Elvis Presley 36970826 387 | Green 2607165 388 | Bowling 2012141 389 | Stock market 5662980 390 | Accounting 4561192 391 | Elvis Presley 36970826 392 | Medical billing 518915 393 | Broken heart 1481439 394 | Winter 3146653 395 | History of skiing 345690 396 | Game design 799287 397 | The Walking Dead (TV series) 39648564 398 | Chromium 3730713 399 | Irish coffee 1213386 400 | Broken heart 1481439 401 | Bowling 2012141 402 | Green 2607165 403 | Drama school 153875 404 | Broken heart 1481439 405 | Neurosurgery 1703924 406 | Elvis Presley 36970826 407 | Accounting 4561192 408 | Online shopping 3920354 409 | Skiing 1880306 410 | Dirt track racing 390040 411 | Nickelback 4796561 412 | American football 10478739 413 | Inca road system 384952 414 | Kendrick Lamar 19668434 415 | Motivation 4758206 416 | Non-profit hospital 249270 417 | Skiing 1880306 418 | Cheerleading 2601255 419 | The Walking Dead (video game) 2694059 420 | History of tea 974174 421 | Gymnastics 3142318 422 | Green 2607165 423 | Auto racing 1840090 424 | Gymnastics 3142318 425 | Hot dog 3698391 426 | Samsung Galaxy S III 1650132 427 | Kurt Cobain 20514337 428 | Bowling alley 172742 429 | Broken heart 1481439 430 | Japanese language 8978480 431 | Elvis Presley 36970826 432 | Stock market 5662980 433 | Motivation 4758206 434 | John Grisham 4108973 435 | Broken heart 1481439 436 | Chlorophyll 3068989 437 | Harry Potter 33498151 438 | American football 10478739 439 | Skiing 1880306 440 | Cardigan (sweater) 806462 441 | Rose (color) 635243 442 | Shades of green 2803743 443 | Instagram 24175136 444 | Photosynthesis 8445453 445 | Green 2607165 446 | The Walking Dead (TV series) 39648564 447 | Pink 2098244 448 | Cycling 1657944 449 | American football rules 2179492 450 | Elvis Presley 36970826 451 | Tastes like chicken 291582 452 | Motivation 4758206 453 | Kurt Cobain 20514337 454 | Ireland 18043996 455 | Suicide of Kurt Cobain 3883373 456 | Green 2607165 457 | Elvis Presley 36970826 458 | Bowling 2012141 459 | Elvis Presley 36970826 460 | Motivation 4758206 461 | Cheerleading 2601255 462 | Harry Potter 33498151 463 | Nickelback 4796561 464 | Split (bowling) 181007 465 | Latin influence in English 459062 466 | Beef 3021529 467 | Broken heart 1481439 468 | List of national parks of the United States 7548317 469 | Motivation 4758206 470 | Cheerleading 2601255 471 | Green 2607165 472 | Dylan's Candy Bar 338014 473 | Benjamin Spock 900469 474 | American football 10478739 475 | John Grisham 4108973 476 | Bowling 2012141 477 | Video game music 909028 478 | Bowling 2012141 479 | Harry Potter 33498151 480 | Game design 799287 481 | Guns N' Roses 12138504 482 | Skiing 1880306 483 | American football 10478739 484 | Pet Sounds 3460211 485 | The Walking Dead (TV series) 39648564 486 | Green 2607165 487 | Hunting 2013242 488 | Paramedic 1214132 489 | Genghis Khan 20406918 490 | Ireland 18043996 491 | Dallas Cowboys 7695826 492 | Genghis Khan 20406918 493 | Skiing 1880306 494 | Paramedic 1214132 495 | Elvis Presley 36970826 496 | List of Hello Kitty television series 118424 497 | Chihuahua (dog) 5744542 498 | Winter 3146653 499 | Nickelback 4796561 500 | Skiing 1880306 501 | -------------------------------------------------------------------------------- /env/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChanLiang/CONNER/77f99c876bdc6ca8cb3991210e2ccc2914d4971b/env/.DS_Store -------------------------------------------------------------------------------- /env/coherence_environment.yml: -------------------------------------------------------------------------------- 1 | name: coherence 2 | channels: 3 | - defaults 4 | dependencies: 5 | - _libgcc_mutex=0.1=main 6 | - _openmp_mutex=5.1=1_gnu 7 | - bzip2=1.0.8=h7b6447c_0 8 | - ca-certificates=2023.05.30=h06a4308_0 9 | - ld_impl_linux-64=2.38=h1181459_1 10 | - libffi=3.4.4=h6a678d5_0 11 | - libgcc-ng=11.2.0=h1234567_1 12 | - libgomp=11.2.0=h1234567_1 13 | - libstdcxx-ng=11.2.0=h1234567_1 14 | - libuuid=1.41.5=h5eee18b_0 15 | - ncurses=6.4=h6a678d5_0 16 | - openssl=1.1.1t=h7f8727e_0 17 | - pip=23.0.1=py311h06a4308_0 18 | - python=3.11.3=h7a1cb2a_0 19 | - readline=8.2=h5eee18b_0 20 | - setuptools=67.8.0=py311h06a4308_0 21 | - sqlite=3.41.2=h5eee18b_0 22 | - tk=8.6.12=h1ccaba5_0 23 | - wheel=0.38.4=py311h06a4308_0 24 | - xz=5.4.2=h5eee18b_0 25 | - zlib=1.2.13=h5eee18b_0 26 | - pip: 27 | - aiohttp==3.8.4 28 | - aiosignal==1.3.1 29 | - async-timeout==4.0.2 30 | - attrs==23.1.0 31 | - blis==0.7.9 32 | - catalogue==2.0.8 33 | - certifi==2023.5.7 34 | - charset-normalizer==3.1.0 35 | - click==8.1.3 36 | - confection==0.0.4 37 | - cymem==2.0.7 38 | - datasets==2.12.0 39 | - dill==0.3.6 40 | - en-core-web-sm==3.5.0 41 | - filelock==3.12.1 42 | - frozenlist==1.3.3 43 | - fsspec==2023.6.0 44 | - huggingface-hub==0.15.1 45 | - idna==3.4 46 | - jinja2==3.1.2 47 | - joblib==1.2.0 48 | - langcodes==3.3.0 49 | - markupsafe==2.1.3 50 | - multidict==6.0.4 51 | - multiprocess==0.70.14 52 | - murmurhash==1.0.9 53 | - nltk==3.8.1 54 | - numpy==1.24.3 55 | - nvidia-cuda-nvrtc-cu11==11.7.99 56 | - nvidia-cuda-runtime-cu11==11.7.99 57 | - nvidia-cudnn-cu11==8.5.0.96 58 | - packaging==23.1 59 | - pandas==2.0.2 60 | - pathy==0.10.1 61 | - preshed==3.0.8 62 | - pyarrow==12.0.0 63 | - pydantic==1.10.9 64 | - python-dateutil==2.8.2 65 | - pytz==2023.3 66 | - pyyaml==6.0 67 | - regex==2023.6.3 68 | - requests==2.31.0 69 | - responses==0.18.0 70 | - safetensors==0.3.1 71 | - scikit-learn==1.2.2 72 | - scipy==1.10.1 73 | - sentencepiece==0.1.99 74 | - sgnlp==0.4.0 75 | - six==1.16.0 76 | - smart-open==6.3.0 77 | - spacy==3.5.3 78 | - spacy-legacy==3.0.12 79 | - spacy-loggers==1.0.4 80 | - srsly==2.4.6 81 | - thinc==8.1.10 82 | - threadpoolctl==3.1.0 83 | - tokenizers==0.13.3 84 | - torch==1.13.1 85 | - torchtext==0.6.0 86 | - tqdm==4.65.0 87 | - transformers==4.30.1 88 | - typer==0.7.0 89 | - typing-extensions==4.6.3 90 | - tzdata==2023.3 91 | - urllib3==2.0.3 92 | - wasabi==1.1.2 93 | - xxhash==3.2.0 94 | - yarl==1.9.2 95 | prefix: /misc/kfdata01/kf_grp/lchen/anaconda3/envs/coherence 96 | -------------------------------------------------------------------------------- /env/environment.yml: -------------------------------------------------------------------------------- 1 | name: FactualityPrompt 2 | channels: 3 | - defaults 4 | dependencies: 5 | - _libgcc_mutex=0.1=main 6 | - _openmp_mutex=5.1=1_gnu 7 | - bzip2=1.0.8=h7b6447c_0 8 | - ca-certificates=2023.01.10=h06a4308_0 9 | - certifi=2022.12.7=py310h06a4308_0 10 | - ld_impl_linux-64=2.38=h1181459_1 11 | - libffi=3.4.2=h6a678d5_6 12 | - libgcc-ng=11.2.0=h1234567_1 13 | - libgomp=11.2.0=h1234567_1 14 | - libstdcxx-ng=11.2.0=h1234567_1 15 | - libuuid=1.41.5=h5eee18b_0 16 | - ncurses=6.4=h6a678d5_0 17 | - openssl=1.1.1t=h7f8727e_0 18 | - pip=22.3.1=py310h06a4308_0 19 | - python=3.10.9=h7a1cb2a_0 20 | - readline=8.2=h5eee18b_0 21 | - setuptools=65.6.3=py310h06a4308_0 22 | - sqlite=3.40.1=h5082296_0 23 | - tk=8.6.12=h1ccaba5_0 24 | - tzdata=2022g=h04d1e81_0 25 | - wheel=0.38.4=py310h06a4308_0 26 | - xz=5.2.10=h5eee18b_1 27 | - zlib=1.2.13=h5eee18b_0 28 | - pip: 29 | - absl-py==1.4.0 30 | - antlr4-python3-runtime==4.8 31 | - astunparse==1.6.3 32 | - beautifulsoup4==4.11.2 33 | - benepar==0.2.0 34 | - bitarray==2.7.3 35 | - blis==0.7.9 36 | - bs4==0.0.1 37 | - cachetools==5.3.0 38 | - catalogue==2.0.8 39 | - cffi==1.15.1 40 | - charset-normalizer==3.0.1 41 | - click==8.1.3 42 | - colorama==0.4.6 43 | - common==0.1.2 44 | - common-utils==2.0.1.dev1 45 | - confection==0.0.4 46 | - cymem==2.0.7 47 | - cysignals==1.11.2 48 | - cython==0.29.33 49 | - editdistance==0.6.2 50 | - en-core-web-sm==3.5.0 51 | - fairseq==0.12.2 52 | - fever-drqa==1.0.13 53 | - filelock==3.9.0 54 | - flatbuffers==23.1.21 55 | - future==0.18.3 56 | - gast==0.4.0 57 | - google-api-core==2.11.0 58 | - google-api-python-client==2.83.0 59 | - google-auth==2.16.1 60 | - google-auth-httplib2==0.1.0 61 | - google-auth-oauthlib==0.4.6 62 | - google-pasta==0.2.0 63 | - googleapis-common-protos==1.59.0 64 | - grpcio==1.51.3 65 | - h5py==3.8.0 66 | - httplib2==0.22.0 67 | - huggingface-hub==0.12.1 68 | - hydra-core==1.0.7 69 | - idna==3.4 70 | - jaraco-context==4.3.0 71 | - jinja2==3.1.2 72 | - joblib==1.2.0 73 | - keras==2.11.0 74 | - langcodes==3.3.0 75 | - libclang==15.0.6.1 76 | - lxml==4.9.2 77 | - markdown==3.4.1 78 | - markupsafe==2.1.2 79 | - more-itertools==9.1.0 80 | - murmurhash==1.0.9 81 | - nltk==3.8.1 82 | - numpy==1.24.2 83 | - nvidia-cuda-nvrtc-cu11==11.7.99 84 | - nvidia-cuda-runtime-cu11==11.7.99 85 | - nvidia-cudnn-cu11==8.5.0.96 86 | - oauthlib==3.2.2 87 | - omegaconf==2.0.6 88 | - opt-einsum==3.3.0 89 | - packaging==23.0 90 | - pandas==1.5.3 91 | - pathy==0.10.1 92 | - pexpect==4.8.0 93 | - pillow==9.4.0 94 | - portalocker==2.7.0 95 | - preshed==3.0.8 96 | - prettytable==3.6.0 97 | - protobuf==3.19.6 98 | - ptyprocess==0.7.0 99 | - pyasn1==0.4.8 100 | - pyasn1-modules==0.2.8 101 | - pycparser==2.21 102 | - pydantic==1.10.5 103 | - pyparsing==3.0.9 104 | - python-dateutil==2.8.2 105 | - pytz==2022.7.1 106 | - pyyaml==6.0 107 | - rank-bm25==0.2.2 108 | - regex==2022.10.31 109 | - requests==2.28.2 110 | - requests-oauthlib==1.3.1 111 | - rsa==4.9 112 | - sacrebleu==2.3.1 113 | - scikit-learn==1.2.1 114 | - scipy==1.10.1 115 | - sentence-transformers==2.2.2 116 | - sentencepiece==0.1.97 117 | - six==1.16.0 118 | - smart-open==6.3.0 119 | - soupsieve==2.4 120 | - spacy==3.5.0 121 | - spacy-legacy==3.0.12 122 | - spacy-loggers==1.0.4 123 | - srsly==2.4.5 124 | - tabulate==0.9.0 125 | - tensorboard==2.11.2 126 | - tensorboard-data-server==0.6.1 127 | - tensorboard-plugin-wit==1.8.1 128 | - tensorflow==2.11.0 129 | - tensorflow-estimator==2.11.0 130 | - tensorflow-io-gcs-filesystem==0.30.0 131 | - termcolor==2.2.0 132 | - thefuzz==0.19.0 133 | - thinc==8.1.7 134 | - threadpoolctl==3.1.0 135 | - tokenizers==0.13.2 136 | - torch==1.13.1 137 | - torch-struct==0.5 138 | - torchaudio==0.13.1 139 | - torchvision==0.14.1 140 | - tqdm==4.64.1 141 | - transformers==4.26.1 142 | - typer==0.7.0 143 | - typing-extensions==4.5.0 144 | - uritemplate==4.1.1 145 | - urllib3==1.26.14 146 | - wasabi==1.1.1 147 | - wcwidth==0.2.6 148 | - werkzeug==2.2.3 149 | - wikipedia==1.4.0 150 | - wolframalpha==5.0.0 151 | - wrapt==1.14.1 152 | - xmltodict==0.13.0 153 | prefix: /misc/kfdata01/kf_grp/lchen/anaconda3/envs/FactualityPrompt 154 | -------------------------------------------------------------------------------- /framework.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChanLiang/CONNER/77f99c876bdc6ca8cb3991210e2ccc2914d4971b/framework.png -------------------------------------------------------------------------------- /scripts/helpfulness/nq_random_knowledge.sh: -------------------------------------------------------------------------------- 1 | 2 | exp_name=nq_llama_65B_random_knowledge 3 | task=nq 4 | 5 | # debug=True 6 | debug=False 7 | 8 | testfile=../emnlp23/emnlp_data/nq/random_testset/nq_test_random_testset.txt 9 | promptfile=../emnlp23/emnlp_data/nq/random_prompts/nq_test_random_prompt.txt 10 | 11 | downstream_model=llama-65B 12 | zero_shot=False 13 | knowledge_type=random_knowledge 14 | 15 | export TRANSFORMERS_CACHE='YOUR_DIR' 16 | export HF_HOME='YOUR_DIR' 17 | export HUGGINGFACE_HUB_CACHE='YOUR_DIR' 18 | 19 | python3 -u helpfulness.py \ 20 | --exp_name $exp_name \ 21 | --task $task \ 22 | --zero_shot $zero_shot \ 23 | --debug $debug \ 24 | --testfile $testfile \ 25 | --promptfile $promptfile \ 26 | --downstream_model $downstream_model \ 27 | --knowledge_type $knowledge_type 1>log/$exp_name.log 2>&1 28 | 29 | 30 | -------------------------------------------------------------------------------- /scripts/helpfulness/nq_w_hyp_knowledge.sh: -------------------------------------------------------------------------------- 1 | for name in your_predictions_dir 2 | do 3 | 4 | exp_name=${name}_w_hyp_knowledge 5 | # debug=True 6 | debug=False 7 | 8 | testfile=emnlp_data/nq/random_testset/nq_test_random_testset.txt 9 | promptfile=./emnlp_data/nq/random_prompts/nq_test_random_prompt.txt 10 | hyp_knowledge="${name}_w_hyp_knowledge" 11 | 12 | # downstream_model=flan-t5-xxl 13 | downstream_model=llama-65B 14 | knowledge_type=w_hyp_knowledge 15 | zero_shot=False 16 | 17 | export TRANSFORMERS_CACHE='YOUR_DIR' 18 | export HF_HOME='YOUR_DIR' 19 | export HUGGINGFACE_HUB_CACHE='YOUR_DIR' 20 | 21 | python3 -u helpfulness.py \ 22 | --exp_name $exp_name \ 23 | --task nq \ 24 | --zero_shot $zero_shot \ 25 | --debug $debug \ 26 | --testfile $testfile \ 27 | --hyp_knowledge $hyp_knowledge \ 28 | --promptfile $promptfile \ 29 | --downstream_model $downstream_model \ 30 | --knowledge_type $knowledge_type 1>log/$exp_name.log 2>&1 31 | 32 | wait 33 | 34 | done -------------------------------------------------------------------------------- /scripts/helpfulness/nq_w_ref_knowledge.sh: -------------------------------------------------------------------------------- 1 | export http_proxy="http://star-proxy.oa.com:3128" 2 | export https_proxy="http://star-proxy.oa.com:3128" 3 | export ftp_proxy="http://star-proxy.oa.com:3128" 4 | export no_proxy=".woa.com,mirrors.cloud.tencent.com,tlinux-mirror.tencent-cloud.com,tlinux-mirrorlist.tencent-cloud.com,localhost,127.0.0.1,mirrors-tlinux.tencentyun.com,.oa.com,.local,.3gqq.com,.7700.org,.ad.com,.ada_sixjoy.com,.addev.com,.app.local,.apps.local,.aurora.com,.autotest123.com,.bocaiwawa.com,.boss.com,.cdc.com,.cdn.com,.cds.com,.cf.com,.cjgc.local,.cm.com,.code.com,.datamine.com,.dvas.com,.dyndns.tv,.ecc.com,.expochart.cn,.expovideo.cn,.fms.com,.great.com,.hadoop.sec,.heme.com,.home.com,.hotbar.com,.ibg.com,.ied.com,.ieg.local,.ierd.com,.imd.com,.imoss.com,.isd.com,.isoso.com,.itil.com,.kao5.com,.kf.com,.kitty.com,.lpptp.com,.m.com,.matrix.cloud,.matrix.net,.mickey.com,.mig.local,.mqq.com,.oiweb.com,.okbuy.isddev.com,.oss.com,.otaworld.com,.paipaioa.com,.qqbrowser.local,.qqinternal.com,.qqwork.com,.rtpre.com,.sc.oa.com,.sec.com,.server.com,.service.com,.sjkxinternal.com,.sllwrnm5.cn,.sng.local,.soc.com,.t.km,.tcna.com,.teg.local,.tencentvoip.com,.tenpayoa.com,.test.air.tenpay.com,.tr.com,.tr_autotest123.com,.vpn.com,.wb.local,.webdev.com,.webdev2.com,.wizard.com,.wqq.com,.wsd.com,.sng.com,.music.lan,.mnet2.com,.tencentb2.com,.tmeoa.com,.pcg.com,www.wip3.adobe.com,www-mm.wip3.adobe.com,mirrors.tencent.com,csighub.tencentyun.com" 5 | 6 | task=nq 7 | exp_name=nq_llama_65B_w_ref_knowledge 8 | 9 | # debug=True 10 | debug=False 11 | testfile=../emnlp23/emnlp_data/nq/random_testset/nq_test_random_testset.txt 12 | promptfile=../emnlp23/emnlp_data/nq/random_prompts/nq_test_random_prompt.txt 13 | 14 | downstream_model=llama-65B 15 | knowledge_type=w_ref_knowledge 16 | zero_shot=False 17 | 18 | export TRANSFORMERS_CACHE='YOUR_DIR' 19 | export HF_HOME='YOUR_DIR' 20 | export HUGGINGFACE_HUB_CACHE='YOUR_DIR' 21 | 22 | # export CUDA_VISIBLE_DEVICES=1,2,3 23 | python3 -u helpfulness.py \ 24 | --exp_name $exp_name \ 25 | --task $task \ 26 | --zero_shot $zero_shot \ 27 | --debug $debug \ 28 | --testfile $testfile \ 29 | --promptfile $promptfile \ 30 | --downstream_model $downstream_model \ 31 | --knowledge_type $knowledge_type 1>log/$exp_name.log 2>&1 32 | -------------------------------------------------------------------------------- /scripts/helpfulness/nq_wo_knowledge.sh: -------------------------------------------------------------------------------- 1 | 2 | exp_name="YOUR_EXP_NAME" 3 | task=nq 4 | 5 | # debug=True 6 | debug=False 7 | testfile=../emnlp23/emnlp_data/nq/random_testset/nq_test_random_testset.txt 8 | promptfile=../emnlp23/emnlp_data/nq/random_prompts/nq_test_random_prompt.txt 9 | 10 | # downstream_model=flan-t5-xxl 11 | downstream_model=llama-65B 12 | zero_shot=False 13 | knowledge_type=wo_knowledge 14 | 15 | export TRANSFORMERS_CACHE='YOUR_DIR' 16 | export HF_HOME='YOUR_DIR' 17 | export HUGGINGFACE_HUB_CACHE='YOUR_DIR' 18 | 19 | python3 -u helpfulness.py \ 20 | --exp_name $exp_name \ 21 | --task $task \ 22 | --zero_shot $zero_shot \ 23 | --debug $debug \ 24 | --testfile $testfile \ 25 | --promptfile $promptfile \ 26 | --downstream_model $downstream_model \ 27 | --knowledge_type $knowledge_type 1>log/$exp_name.log 2>&1 28 | 29 | -------------------------------------------------------------------------------- /scripts/helpfulness/view_results.sh: -------------------------------------------------------------------------------- 1 | 2 | for name in "YOUR_EXP_DIR" 3 | do 4 | 5 | echo $name 6 | exp_name=${name}_w_hyp_knowledge 7 | tail -2 helpfulness_results/${exp_name}.txt 8 | echo 9 | 10 | done 11 | 12 | -------------------------------------------------------------------------------- /scripts/helpfulness/wow_random_knowledge.sh: -------------------------------------------------------------------------------- 1 | exp_name=wow_helpfulness_random_knowledge 2 | task=wow 3 | 4 | # debug=True 5 | debug=False 6 | testfile=../emnlp23/emnlp_data/wow/random_testset/seen_random_testset.txt 7 | promptfile=../emnlp23/emnlp_data/wow/random_prompts/seen_random_prompt.txt 8 | 9 | downstream_model=llama-65B 10 | zero_shot=False 11 | knowledge_type=random_knowledge 12 | 13 | export TRANSFORMERS_CACHE='YOUR_DIR' 14 | export HF_HOME='YOUR_DIR' 15 | export HUGGINGFACE_HUB_CACHE='YOUR_DIR' 16 | 17 | python3 -u helpfulness.py \ 18 | --exp_name $exp_name \ 19 | --task $task \ 20 | --zero_shot $zero_shot \ 21 | --debug $debug \ 22 | --testfile $testfile \ 23 | --promptfile $promptfile \ 24 | --downstream_model $downstream_model \ 25 | --knowledge_type $knowledge_type 1>log/$exp_name.log 2>&1 26 | # --knowledge_type $knowledge_type 27 | 28 | 29 | -------------------------------------------------------------------------------- /scripts/helpfulness/wow_w_hyp_knowledge.sh: -------------------------------------------------------------------------------- 1 | 2 | for name in "YOUR_EXP_NAME" 3 | do 4 | 5 | exp_name=${name}_w_hyp_knowledge 6 | # debug=True 7 | debug=False 8 | testfile=emnlp_data/wow/random_testset/seen_random_testset.txt 9 | promptfile=emnlp_data/wow/random_prompts/seen_random_prompt.txt 10 | hyp_knowledge=${name}_w_hyp_knowledge 11 | 12 | downstream_model=llama-65B 13 | knowledge_type=w_hyp_knowledge 14 | zero_shot=False 15 | 16 | export TRANSFORMERS_CACHE='YOUR_DIR' 17 | export HF_HOME='YOUR_DIR' 18 | export HUGGINGFACE_HUB_CACHE='YOUR_DIR' 19 | 20 | python3 -u helpfulness.py \ 21 | --exp_name $exp_name \ 22 | --task nq \ 23 | --zero_shot $zero_shot \ 24 | --debug $debug \ 25 | --testfile $testfile \ 26 | --hyp_knowledge $hyp_knowledge \ 27 | --promptfile $promptfile \ 28 | --downstream_model $downstream_model \ 29 | --knowledge_type $knowledge_type 1>log/$exp_name.log 2>&1 30 | 31 | wait 32 | 33 | done -------------------------------------------------------------------------------- /scripts/nq_coh_para.sh: -------------------------------------------------------------------------------- 1 | # env: base 2 | 3 | for name in your_prediction_dir 4 | do 5 | 6 | hyp=${name}/nq_hyp 7 | 8 | exp_name=discourse_coherence_${name} 9 | echo $name 10 | 11 | export CUDA_VISIBLE_DEVICES=0 12 | PYTHONPATH=. python -u discourse-coherence.py \ 13 | --hyp_path $hyp 1>log/log-${exp_name} 2>&1 14 | 15 | echo 16 | 17 | wait 18 | done 19 | -------------------------------------------------------------------------------- /scripts/nq_coh_sent.sh: -------------------------------------------------------------------------------- 1 | # env: base 2 | 3 | for name in your_hyper_dir 4 | do 5 | 6 | hyp=${name}/nq_ref 7 | 8 | exp_name=ppl_${name} 9 | echo $name 10 | 11 | export CUDA_VISIBLE_DEVICES=0 12 | PYTHONPATH=. python -u src/ppl.py \ 13 | --hyp_path $hyp 1>log/log-${exp_name} 2>&1 14 | 15 | echo 16 | 17 | wait 18 | done 19 | -------------------------------------------------------------------------------- /scripts/nq_factuality.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # This script evaluates the model predictions using various parameters. 3 | 4 | # Set the debug mode. If true, additional debugging information will be printed. 5 | debug=False 6 | 7 | # Number of evaluations to perform. 8 | eval_num=500 9 | 10 | # Number of information retrieval (IR) evidence to consider. 11 | IR_num=10 12 | 13 | # Whether to use ground truth knowledge or not. 14 | wo_ground_truth_knowledge=False 15 | 16 | # Error tolerance. 17 | outer_strategy=max 18 | 19 | # Loop through all the predictions of your model. 20 | for name in model_prediction_dir; do 21 | # Define the reference and hypothesis paths. 22 | ref="emnlp_data/nq/random_testset/nq_test_random_testset.txt" 23 | hyp="${name}/nq_knowledge" 24 | 25 | # Construct the experiment name based on the current configuration. 26 | exp_name="${name}_IR${IR_num}_${outer_strategy}" 27 | echo "Experiment Name: $exp_name" 28 | 29 | # Set the CUDA device. 30 | export CUDA_VISIBLE_DEVICES=0 31 | 32 | # Run the evaluation script with the specified parameters. 33 | PYTHONPATH=. python -u src/eval_exp.py \ 34 | --hyp_path "$hyp" \ 35 | --ref_path "$ref" \ 36 | --use_IR_eval \ 37 | --debug "$debug" \ 38 | --eval_num "$eval_num" \ 39 | --wo_ground_truth_knowledge "$wo_ground_truth_knowledge" \ 40 | --outer_strategy "$outer_strategy" \ 41 | --retrieved_num "$IR_num" \ 42 | 1> "log/log-${exp_name}" 2>&1 43 | 44 | # Wait for the process to finish before continuing with the next prediction. 45 | wait 46 | done -------------------------------------------------------------------------------- /scripts/nq_factuality_view.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | your_log_name_list=(log1 log2 log3) # Replace with actual log names 4 | 5 | for name in "${your_log_name_list[@]}"; do 6 | echo "Results for $name:" 7 | tail -2 "log/log-${name}_IR10_max" 8 | echo 9 | done -------------------------------------------------------------------------------- /scripts/nq_info.sh: -------------------------------------------------------------------------------- 1 | for name in your_prediction_dir 2 | do 3 | 4 | hyp=/misc/kfdata01/kf_grp/lchen/FactualityPrompt/model_scale_exp/knowledge/${name}/nq_knowledge 5 | 6 | ref=/misc/kfdata01/kf_grp/lchen/EMNLP23/experiments/emnlp_data/nq/random_testset/nq_test_random_testset.txt 7 | 8 | exp_name=info_${name} 9 | echo $name 10 | 11 | export CUDA_VISIBLE_DEVICES=0 12 | PYTHONPATH=. python -u info.py \ 13 | --task nq \ 14 | --ref_path $ref \ 15 | --hyp_path $hyp 1>log/log-${exp_name} 2>&1 16 | 17 | echo 18 | 19 | wait 20 | done 21 | -------------------------------------------------------------------------------- /scripts/nq_relevance.sh: -------------------------------------------------------------------------------- 1 | for name in your_prediction 2 | do 3 | 4 | 5 | ref=emnlp_data/nq/random_testset/nq_test_random_testset.txt 6 | hyp=${name}/nq_knowledge 7 | 8 | exp_name=relevance_${name} 9 | echo $name 10 | 11 | export CUDA_VISIBLE_DEVICES=0 12 | PYTHONPATH=. python -u relevance.py \ 13 | --hyp_path $hyp \ 14 | --ref_path $ref 1>log/log-${exp_name} 2>&1 15 | 16 | echo 17 | 18 | wait 19 | done 20 | -------------------------------------------------------------------------------- /scripts/nq_validity.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Toggle debug mode 4 | debug=False 5 | 6 | # Number of evaluations 7 | eval_num=500 8 | 9 | # List of experiment directories (update with actual directory names) 10 | your_experiment_dir_list=(dir1 dir2 dir3) 11 | 12 | for name in "${your_experiment_dir_list[@]}"; do 13 | ref="./emnlp_data/nq/random_testset/nq_test_random_testset.txt" 14 | hyp="./answers/nq_answer_for_${name}/nq_answer" 15 | 16 | echo "Running experiment: ${name}" 17 | 18 | export CUDA_VISIBLE_DEVICES=0 19 | 20 | PYTHONPATH=. python -u src/nq_validity.py \ 21 | --hyp_path "$hyp" \ 22 | --ref_path "$ref" \ 23 | --debug "$debug" \ 24 | --eval_num "$eval_num" 1>"log/log-${name}" 2>&1 25 | 26 | wait 27 | done -------------------------------------------------------------------------------- /scripts/other/cal_factuality_for_DPR.sh: -------------------------------------------------------------------------------- 1 | REF_PATH=/misc/kfdata01/kf_grp/lchen/ParlAI/data/wizard_of_wikipedia/processed_data 2 | 3 | 4 | # w IR 5 | 6 | IR_num=3 7 | exp_name=IR${IR_num}_eval_backup 8 | 9 | export CUDA_VISIBLE_DEVICES=2 10 | PYTHONPATH=. python src/eval_401.py \ 11 | --hyp_path /misc/kfdata01/kf_grp/lchen/ParlAI/data/wizard_of_wikipedia/DPR_top1_knowledge_seen \ 12 | --sent_ref_path $REF_PATH/output_testseen_knowledge_sentence_reference.txt \ 13 | --use_IR_eval \ 14 | --retrieved_num $IR_num \ 15 | --doc_ref_path $REF_PATH/output_testseen_knowledge_doc_reference.txt 1>log/DPR-${exp_name}-zero-shot-res.txt 2>log/DPR-${exp_name}-zero-shot-err.txt 16 | 17 | wait 18 | 19 | export CUDA_VISIBLE_DEVICES=2 20 | PYTHONPATH=. python src/eval_401.py \ 21 | --hyp_path /misc/kfdata01/kf_grp/lchen/ParlAI/data/wizard_of_wikipedia/DPR_top1_knowledge_unseen \ 22 | --sent_ref_path $REF_PATH/output_testunseen_knowledge_sentence_reference.txt \ 23 | --use_IR_eval \ 24 | --retrieved_num $IR_num \ 25 | --doc_ref_path $REF_PATH/output_testunseen_knowledge_doc_reference.txt 1>log/DPR-${exp_name}-zero-shot-unseen-res.txt 2>log/DPR-${exp_name}-zero-shot-unseen-err.txt 26 | 27 | 28 | 29 | -------------------------------------------------------------------------------- /scripts/other/cal_factuality_for_knowledge.sh: -------------------------------------------------------------------------------- 1 | REF_PATH=/misc/kfdata01/kf_grp/lchen/ParlAI/data/wizard_of_wikipedia/processed_data 2 | 3 | # seen exp 4 | 5 | # few-shot 6 | # export CUDA_VISIBLE_DEVICES=0 7 | # PYTHONPATH=. python src/eval_NE_NLI.py \ 8 | # --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/flan-t5-11B/few-shot/seen_knowledge \ 9 | # --sent_ref_path $REF_PATH/output_testseen_knowledge_sentence_reference.txt \ 10 | # --doc_ref_path $REF_PATH/output_testseen_knowledge_doc_reference.txt 1>log/few-shot-res.txt 2>log/few-shot-err.txt 11 | 12 | # zero-shot 13 | # export CUDA_VISIBLE_DEVICES=1 14 | # PYTHONPATH=. python src/eval_NE_NLI.py \ 15 | # --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/flan-t5-11B/zero-shot/seen_knowledge_last_utter \ 16 | # --sent_ref_path $REF_PATH/output_testseen_knowledge_sentence_reference.txt \ 17 | # --doc_ref_path $REF_PATH/output_testseen_knowledge_doc_reference.txt 1>log/zero-shot-res.txt 2>log/zero-shot-err.txt 18 | 19 | 20 | # unseen exp 21 | # few-shot 22 | # export CUDA_VISIBLE_DEVICES=2 23 | # PYTHONPATH=. python src/eval_NE_NLI.py \ 24 | # --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/flan-t5-11B/few-shot/unseen_knowledge \ 25 | # --sent_ref_path $REF_PATH/output_testunseen_knowledge_sentence_reference.txt \ 26 | # --doc_ref_path $REF_PATH/output_testunseen_knowledge_doc_reference.txt 1>log/few-shot-unseen-res.txt 2>log/few-shot-unseen-err.txt 27 | 28 | # export CUDA_VISIBLE_DEVICES=3 29 | # PYTHONPATH=. python src/eval_NE_NLI.py \ 30 | # --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/flan-t5-11B/zero-shot/unseen_knowledge_last_utter \ 31 | # --sent_ref_path $REF_PATH/output_testunseen_knowledge_sentence_reference.txt \ 32 | # --doc_ref_path $REF_PATH/output_testunseen_knowledge_doc_reference.txt 1>log/zero-shot-unseen-res.txt 2>log/zero-shot-unseen-err.txt 33 | 34 | 35 | exp_name=IR3_eval 36 | 37 | # for model in flan-t5-11B flan-t5-xl flan-t5-large flan-t5-base flan-t5-small 38 | for model in flan-t5-xl 39 | do 40 | echo $model 41 | # seen + few-shot 42 | split=seen 43 | data=few-shot 44 | export CUDA_VISIBLE_DEVICES=0 45 | PYTHONPATH=. python src/eval_NE_NLI.py \ 46 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge__clean \ 47 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 48 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt & 49 | 50 | # seen + zero-shot 51 | split=seen 52 | data=zero-shot 53 | export CUDA_VISIBLE_DEVICES=1 54 | PYTHONPATH=. python src/eval_NE_NLI.py \ 55 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter_ \ 56 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 57 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt & 58 | 59 | 60 | # unseen + few-shot 61 | split=unseen 62 | data=few-shot 63 | export CUDA_VISIBLE_DEVICES=2 64 | PYTHONPATH=. python src/eval_NE_NLI.py \ 65 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge__clean \ 66 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 67 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt & 68 | 69 | 70 | # unseen + zero-shot 71 | split=unseen 72 | data=zero-shot 73 | export CUDA_VISIBLE_DEVICES=3 74 | PYTHONPATH=. python src/eval_NE_NLI.py \ 75 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter_ \ 76 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 77 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt & 78 | 79 | wait 80 | 81 | done 82 | 83 | 84 | 85 | 86 | 87 | # GEN_TO_EVALUATE_NAME=./wow/zero-shot/wizard-test-p1.jsonl 88 | 89 | # PYTHONPATH=. python src/evaluate_generated_knowledge.py \ 90 | # --gen_path ${GEN_TO_EVALUATE_NAME} 1>log/res.txt 2>log/err.txt 91 | 92 | 93 | -------------------------------------------------------------------------------- /scripts/other/cal_factuality_for_knowledge_IR.sh: -------------------------------------------------------------------------------- 1 | REF_PATH=/misc/kfdata01/kf_grp/lchen/ParlAI/data/wizard_of_wikipedia/processed_data 2 | 3 | IR_num=3 4 | # exp_name=IR${IR_num}_eval_backup 5 | exp_name=IR${IR_num}_eval_filter_know 6 | 7 | # for model in flan-t5-11B flan-t5-xl flan-t5-large flan-t5-base flan-t5-small 8 | for model in flan-t5-xxl 9 | do 10 | echo $model 11 | # seen + few-shot 12 | split=seen 13 | data=few-shot 14 | # hyp=/misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge__clean 15 | hyp=/misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/filter_know_${split}_knowledge 16 | 17 | export CUDA_VISIBLE_DEVICES=0 18 | PYTHONPATH=. python src/eval_401.py \ 19 | --hyp_path $hyp \ 20 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 21 | --use_IR_eval \ 22 | --retrieved_num $IR_num \ 23 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt & 24 | 25 | wait 26 | 27 | 28 | # # seen + zero-shot 29 | # split=seen 30 | # data=zero-shot 31 | # export CUDA_VISIBLE_DEVICES=1 32 | # PYTHONPATH=. python src/eval_401.py \ 33 | # --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter_ \ 34 | # --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 35 | # --use_IR_eval \ 36 | # --retrieved_num $IR_num \ 37 | # --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt & 38 | 39 | # wait 40 | 41 | # unseen + few-shot 42 | split=unseen 43 | data=few-shot 44 | hyp=/misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/filter_know_${split}_knowledge 45 | 46 | export CUDA_VISIBLE_DEVICES=2 47 | PYTHONPATH=. python src/eval_401.py \ 48 | --hyp_path $hyp \ 49 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 50 | --use_IR_eval \ 51 | --retrieved_num $IR_num \ 52 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt & 53 | 54 | wait 55 | 56 | # # unseen + zero-shot 57 | # split=unseen 58 | # data=zero-shot 59 | # export CUDA_VISIBLE_DEVICES=0 60 | # PYTHONPATH=. python src/eval_401.py \ 61 | # --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter_ \ 62 | # --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 63 | # --use_IR_eval \ 64 | # --retrieved_num $IR_num \ 65 | # --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt & 66 | 67 | # wait 68 | 69 | done 70 | 71 | 72 | 73 | 74 | 75 | # GEN_TO_EVALUATE_NAME=./wow/zero-shot/wizard-test-p1.jsonl 76 | 77 | # PYTHONPATH=. python src/evaluate_generated_knowledge.py \ 78 | # --gen_path ${GEN_TO_EVALUATE_NAME} 1>log/res.txt 2>log/err.txt 79 | 80 | 81 | -------------------------------------------------------------------------------- /scripts/other/cal_factuality_for_opt_knowledge_IR.sh: -------------------------------------------------------------------------------- 1 | REF_PATH=/misc/kfdata01/kf_grp/lchen/ParlAI/data/wizard_of_wikipedia/processed_data 2 | 3 | IR_num=3 4 | exp_name=OPT-IR${IR_num}_eval 5 | 6 | # for model in opt-13b opt-1.3b 7 | for model in opt-13b opt-iml-1.3b opt-1.3b 8 | # for model in opt-6.7b 9 | do 10 | echo $model 11 | # seen + few-shot 12 | split=seen 13 | data=few-shot 14 | export CUDA_VISIBLE_DEVICES=1 15 | PYTHONPATH=. python src/eval_401.py \ 16 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_extract \ 17 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 18 | --use_IR_eval \ 19 | --retrieved_num $IR_num \ 20 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt & 21 | 22 | wait 23 | 24 | # unseen + few-shot 25 | split=unseen 26 | data=few-shot 27 | export CUDA_VISIBLE_DEVICES=1 28 | PYTHONPATH=. python src/eval_401.py \ 29 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_extract \ 30 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 31 | --use_IR_eval \ 32 | --retrieved_num $IR_num \ 33 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt & 34 | 35 | wait 36 | 37 | 38 | done 39 | 40 | 41 | 42 | 43 | 44 | # GEN_TO_EVALUATE_NAME=./wow/zero-shot/wizard-test-p1.jsonl 45 | 46 | # PYTHONPATH=. python src/evaluate_generated_knowledge.py \ 47 | # --gen_path ${GEN_TO_EVALUATE_NAME} 1>log/res.txt 2>log/err.txt 48 | 49 | 50 | -------------------------------------------------------------------------------- /scripts/other/cal_factuality_for_refined_knowledge.sh: -------------------------------------------------------------------------------- 1 | REF_PATH=/misc/kfdata01/kf_grp/lchen/ParlAI/data/wizard_of_wikipedia/processed_data 2 | 3 | # for model in flan-t5-11B flan-t5-xl flan-t5-large flan-t5-base flan-t5-small 4 | # for model in flan-t5-11B 5 | for model in flan-t5-xxl 6 | do 7 | echo $model 8 | 9 | 10 | # seen + few-shot 11 | split=seen 12 | data=few-shot 13 | # hyp_file=/misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge__clean_refinement 14 | hyp_file=/misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_fewshot_refinement 15 | 16 | export CUDA_VISIBLE_DEVICES=0 17 | PYTHONPATH=. python src/eval_NE_NLI.py \ 18 | --hyp_path $hyp_file \ 19 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 20 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${model}-${data}-${split}-fewshot-refine-knowledge-res.txt 2>log/${model}-${data}-${split}-fewshot-refine-knowledge-err.txt & 21 | 22 | # unseen + few-shot 23 | split=unseen 24 | data=few-shot 25 | hyp_file=/misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_fewshot_refinement 26 | 27 | export CUDA_VISIBLE_DEVICES=2 28 | PYTHONPATH=. python src/eval_NE_NLI.py \ 29 | --hyp_path $hyp_file \ 30 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 31 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${model}-${data}-${split}-fewshot-refine-knowledge-res.txt 2>log/${model}-${data}-${split}-fewshot-refine-knowledge-err.txt & 32 | 33 | # # seen + zero-shot 34 | # split=seen 35 | # data=zero-shot 36 | # export CUDA_VISIBLE_DEVICES=1 37 | # PYTHONPATH=. python src/eval_NE_NLI.py \ 38 | # --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter__refinement \ 39 | # --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 40 | # --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${model}-${data}-${split}-zeroshot-refine-knowledge-res.txt 2>log/${model}-${data}-${split}-zeroshot-refine-knowledge-err.txt & 41 | 42 | 43 | # # unseen + zero-shot 44 | # split=unseen 45 | # data=zero-shot 46 | # export CUDA_VISIBLE_DEVICES=3 47 | # PYTHONPATH=. python src/eval_NE_NLI.py \ 48 | # --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter__refinement \ 49 | # --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 50 | # --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${model}-${data}-${split}-zeroshot-refine-knowledge-res.txt 2>log/${model}-${data}-${split}-zeroshot-refine-knowledge-err.txt & 51 | 52 | wait 53 | 54 | done 55 | 56 | 57 | -------------------------------------------------------------------------------- /scripts/other/cal_factuality_for_refined_knowledge_IR.sh: -------------------------------------------------------------------------------- 1 | REF_PATH=/misc/kfdata01/kf_grp/lchen/ParlAI/data/wizard_of_wikipedia/processed_data 2 | 3 | IR_num=3 4 | exp_name=IR${IR_num}_eval_refinement 5 | 6 | for model in flan-t5-xxl 7 | do 8 | echo $model 9 | 10 | # seen + few-shot 11 | split=seen 12 | data=few-shot 13 | export CUDA_VISIBLE_DEVICES=0 14 | PYTHONPATH=. python src/eval_401.py \ 15 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_fewshot_refinement \ 16 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 17 | --use_IR_eval \ 18 | --retrieved_num $IR_num \ 19 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt & 20 | 21 | wait 22 | 23 | # unseen + few-shot 24 | split=unseen 25 | data=few-shot 26 | export CUDA_VISIBLE_DEVICES=2 27 | PYTHONPATH=. python src/eval_401.py \ 28 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_fewshot_refinement \ 29 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 30 | --use_IR_eval \ 31 | --retrieved_num $IR_num \ 32 | --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${exp_name}-${model}-${data}-${split}-res.txt 2>log/${exp_name}-${model}-${data}-${split}-err.txt & 33 | 34 | wait 35 | 36 | 37 | done 38 | 39 | 40 | -------------------------------------------------------------------------------- /scripts/other/cal_factuality_for_response.sh: -------------------------------------------------------------------------------- 1 | REF_PATH=/misc/kfdata01/kf_grp/lchen/ParlAI/data/wizard_of_wikipedia/processed_data 2 | 3 | 4 | # for model in flan-t5-11B flan-t5-xl flan-t5-large flan-t5-base flan-t5-small 5 | for model in flan-t5-11B 6 | do 7 | echo $model 8 | # seen + few-shot 9 | split=seen 10 | data=few-shot 11 | export CUDA_VISIBLE_DEVICES=0 12 | PYTHONPATH=. python src/eval_NE_NLI.py \ 13 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_response \ 14 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 15 | --doc_ref_path $REF_PATH/output_test${split}_response_knowledge_doc_reference.txt 1>log/${model}-${data}-${split}-response-res.txt 2>log/${model}-${data}-${split}-response-err.txt & 16 | # --doc_ref_path $REF_PATH/output_test${split}_knowledge_doc_reference.txt 1>log/${model}-${data}-${split}-response-res.txt 2>log/${model}-${data}-${split}-response-err.txt & 17 | 18 | # seen + zero-shot 19 | split=seen 20 | data=zero-shot 21 | export CUDA_VISIBLE_DEVICES=1 22 | PYTHONPATH=. python src/eval_NE_NLI.py \ 23 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_response \ 24 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 25 | --doc_ref_path $REF_PATH/output_test${split}_response_knowledge_doc_reference.txt 1>log/${model}-${data}-${split}-response-res.txt 2>log/${model}-${data}-${split}-response-err.txt & 26 | 27 | 28 | # unseen + few-shot 29 | split=unseen 30 | data=few-shot 31 | export CUDA_VISIBLE_DEVICES=2 32 | PYTHONPATH=. python src/eval_NE_NLI.py \ 33 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_response \ 34 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 35 | --doc_ref_path $REF_PATH/output_test${split}_response_knowledge_doc_reference.txt 1>log/${model}-${data}-${split}-response-res.txt 2>log/${model}-${data}-${split}-response-err.txt & 36 | 37 | 38 | # unseen + zero-shot 39 | split=unseen 40 | data=zero-shot 41 | export CUDA_VISIBLE_DEVICES=3 42 | PYTHONPATH=. python src/eval_NE_NLI.py \ 43 | --hyp_path /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_response \ 44 | --sent_ref_path $REF_PATH/output_test${split}_knowledge_sentence_reference.txt \ 45 | --doc_ref_path $REF_PATH/output_test${split}_response_knowledge_doc_reference.txt 1>log/${model}-${data}-${split}-response-res.txt 2>log/${model}-${data}-${split}-response-err.txt & 46 | 47 | wait 48 | 49 | done 50 | 51 | 52 | 53 | 54 | 55 | # GEN_TO_EVALUATE_NAME=./wow/zero-shot/wizard-test-p1.jsonl 56 | 57 | # PYTHONPATH=. python src/evaluate_generated_knowledge.py \ 58 | # --gen_path ${GEN_TO_EVALUATE_NAME} 1>log/res.txt 2>log/err.txt 59 | 60 | 61 | -------------------------------------------------------------------------------- /scripts/other/tmp.sh: -------------------------------------------------------------------------------- 1 | # for model in flan-t5-xl flan-t5-large flan-t5-base flan-t5-small 2 | for model in flan-t5-11B 3 | do 4 | split=seen 5 | data=few-shot 6 | head -3865 /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge > /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_ 7 | 8 | # seen + zero-shot 9 | split=seen 10 | data=zero-shot 11 | head -3865 /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter > /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter_ 12 | 13 | 14 | # unseen + few-shot 15 | split=unseen 16 | data=few-shot 17 | head -3924 /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge > /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_ 18 | 19 | 20 | # unseen + zero-shot 21 | split=unseen 22 | data=zero-shot 23 | head -3924 /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter > /misc/kfdata01/kf_grp/lchen/opt/output/$model/$data/${split}_knowledge_last_utter_ 24 | 25 | done -------------------------------------------------------------------------------- /scripts/view_coh_sent.sh: -------------------------------------------------------------------------------- 1 | # for name in nq_DPR nq_random_prompt_flan_xxl nq_zeroshot_prompt4_flan_xxl nq_random_prompt_llama_65b_T100 nq_zeroshot_prompt4_llama_65b_T100 2 | for name in nq_random_prompt_chatgpt_T100 nq_zeroshot_prompt4_chatgpt_T100 3 | 4 | do 5 | 6 | echo $name 7 | res=/misc/kfdata01/kf_grp/lchen/FactualityPrompt/output/${name}/nq_knowledge_avg_sent_ppl 8 | tail -1 $res 9 | echo 10 | echo 11 | 12 | done 13 | 14 | 15 | # for name in wow_DPR zeroshot_prompt2_flan_xxl zeroshot_prompt4_llama_65b random_prompt_flan_xxl random_prompt_llama_65b_T100 16 | for name in random_prompt_chatgpt zeroshot_prompt4_chatgpt_T100 17 | do 18 | 19 | echo $name 20 | # res=/misc/kfdata01/kf_grp/lchen/FactualityPrompt/output_emnlp/${name}/seen_knowledge_avg_sent_ppl 21 | res=/misc/kfdata01/kf_grp/lchen/FactualityPrompt/output/${name}/seen_knowledge_avg_sent_ppl 22 | tail -1 $res 23 | echo 24 | echo 25 | 26 | done 27 | -------------------------------------------------------------------------------- /scripts/view_info.sh: -------------------------------------------------------------------------------- 1 | for name in your_dir 2 | do 3 | 4 | echo $name 5 | exp_name=info_${name} 6 | tail -1 log/log-${exp_name} 7 | echo ' ' 8 | 9 | done -------------------------------------------------------------------------------- /scripts/view_nq_validity.sh: -------------------------------------------------------------------------------- 1 | 2 | 3 | your_log_name_list=(log1 log2 log3) # Replace with actual log names 4 | 5 | for name in "${your_log_name_list[@]}"; do 6 | do 7 | 8 | echo $name 9 | tail -2 log/log-${name} 10 | echo 11 | 12 | done -------------------------------------------------------------------------------- /scripts/view_wow_validity.sh: -------------------------------------------------------------------------------- 1 | your_log_name_list=(log1 log2 log3) # Replace with actual log names 2 | 3 | for name in "${your_log_name_list[@]}"; do 4 | do 5 | 6 | echo $name 7 | tail -2 log/log-${name}-answer 8 | echo 9 | 10 | done -------------------------------------------------------------------------------- /scripts/wow_coh_para.sh: -------------------------------------------------------------------------------- 1 | 2 | for name in your_prediction_dir 3 | do 4 | 5 | hyp=${name}/seen_knowledge 6 | 7 | exp_name=ppl_${name} 8 | echo $name 9 | 10 | export CUDA_VISIBLE_DEVICES=0 11 | PYTHONPATH=. python -u discourse-coherence.py \ 12 | --hyp_path $hyp 1>log/log-${exp_name} 2>&1 13 | 14 | wait 15 | done 16 | -------------------------------------------------------------------------------- /scripts/wow_coh_sent.sh: -------------------------------------------------------------------------------- 1 | 2 | for name in your_hyper_dir 3 | do 4 | 5 | hyp=${name}/seen_knowledge 6 | 7 | exp_name=ppl_${name} 8 | echo $name 9 | 10 | export CUDA_VISIBLE_DEVICES=0 11 | PYTHONPATH=. python -u src/ppl.py \ 12 | --hyp_path $hyp 1>log/log-${exp_name} 2>&1 13 | 14 | echo 15 | 16 | wait 17 | done 18 | -------------------------------------------------------------------------------- /scripts/wow_factuality.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # This script evaluates the knowledge predictions for various models and strategies. 3 | 4 | # Set the debug mode (use 'True' to enable debugging). 5 | debug=False 6 | 7 | # Number of evaluations to perform. 8 | eval_num=500 9 | 10 | # Number of information retrieval (IR) evidence to consider. 11 | IR_num=10 12 | 13 | # Whether to use ground truth knowledge or not. 14 | wo_ground_truth_knowledge=False 15 | 16 | # Error tolerance. 17 | outer_strategy=max 18 | 19 | for name in random_prompt_llama_65b_T100 20 | do 21 | 22 | # seen split 23 | ref=./emnlp_data/wow/random_testset/seen_random_testset.txt 24 | hyp=${name}/seen_knowledge 25 | 26 | exp_name=${name}_IR${IR_num}_seen_knowledge_$outer_strategy 27 | echo $exp_name 28 | 29 | export CUDA_VISIBLE_DEVICES=0 30 | PYTHONPATH=. python -u src/eval_exp.py \ 31 | --hyp_path $hyp \ 32 | --ref_path $ref \ 33 | --use_IR_eval \ 34 | --debug $debug \ 35 | --eval_num $eval_num \ 36 | --wo_ground_truth_knowledge $wo_ground_truth_knowledge \ 37 | --retrieved_num $IR_num 1>log/log-${exp_name} 2>&1 38 | 39 | wait 40 | 41 | 42 | # unseen split 43 | ref=./emnlp_data/testsets500/unseen_random_testset.txt 44 | hyp=${name}/unseen_knowledge 45 | 46 | exp_name=${name}_IR${IR_num}_unseen_knowledge 47 | echo $exp_name 48 | 49 | export CUDA_VISIBLE_DEVICES=3 50 | PYTHONPATH=. python -u src/eval_exp.py \ 51 | --hyp_path $hyp \ 52 | --ref_path $ref \ 53 | --use_IR_eval \ 54 | --debug $debug \ 55 | --eval_num $eval_num \ 56 | --wo_ground_truth_knowledge $wo_ground_truth_knowledge \ 57 | --retrieved_num $IR_num 1>log/log-${exp_name} 2>&1 58 | 59 | wait 60 | 61 | done -------------------------------------------------------------------------------- /scripts/wow_factuality_view.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | your_log_name_list=(log1 log2 log3) # Replace with actual log names 4 | 5 | for name in "${your_log_name_list[@]}"; do 6 | echo "Results for $name:" 7 | tail -2 "log/log-${name}" 8 | echo 9 | done -------------------------------------------------------------------------------- /scripts/wow_info.sh: -------------------------------------------------------------------------------- 1 | for name in your_prediction_dir 2 | do 3 | 4 | hyp=${name}/seen_knowledge 5 | ref=./emnlp_data/wow/random_testset/seen_random_testset.txt 6 | 7 | exp_name=info_${name} 8 | echo $name 9 | 10 | export CUDA_VISIBLE_DEVICES=1 11 | PYTHONPATH=. python -u info.py \ 12 | --task wow \ 13 | --ref_path $ref \ 14 | --hyp_path $hyp 1>log/log-${exp_name} 2>&1 15 | 16 | echo 17 | 18 | wait 19 | done 20 | -------------------------------------------------------------------------------- /scripts/wow_relevance.sh: -------------------------------------------------------------------------------- 1 | 2 | for name in your_prediction 3 | 4 | do 5 | 6 | ref=emnlp_data/wow/random_testset/seen_random_testset.txt 7 | hyp=${name}/seen_knowledge 8 | 9 | exp_name=relevance_${name} 10 | echo $name 11 | 12 | export CUDA_VISIBLE_DEVICES=0 13 | PYTHONPATH=. python -u relevance.py \ 14 | --hyp_path $hyp \ 15 | --ref_path $ref 1>log/log-${exp_name} 2>&1 16 | 17 | echo 18 | 19 | wait 20 | done 21 | -------------------------------------------------------------------------------- /scripts/wow_validity.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Toggle debug mode 4 | debug=False 5 | 6 | # Number of evaluations 7 | eval_num=500 8 | 9 | # List of experiment directories (update with actual directory names) 10 | your_experiment_dir_list=(dir1 dir2 dir3) 11 | 12 | for name in $your_experiment_dir_list 13 | do 14 | 15 | ref=/misc/kfdata01/kf_grp/lchen/EMNLP23/experiments/emnlp_data/wow/random_testset/seen_random_testset.txt 16 | hyp=wow_answer/wow_answer_for_${name}/wow_answer 17 | 18 | exp_name=${name}-answer 19 | echo $exp_name 20 | 21 | export CUDA_VISIBLE_DEVICES=0 22 | PYTHONPATH=. python -u src/wow_validity.py \ 23 | --use_IR_eval \ 24 | --retrieved_num 5 \ 25 | --hyp_path $hyp \ 26 | --ref_path $ref \ 27 | --debug $debug \ 28 | --eval_num $eval_num 1>log/log-${exp_name} 2>&1 29 | 30 | wait 31 | done 32 | -------------------------------------------------------------------------------- /src/claim_handling.py: -------------------------------------------------------------------------------- 1 | import re 2 | from nltk.tokenize import sent_tokenize 3 | 4 | import spacy 5 | # spacy.prefer_gpu() 6 | spacy.load('en_core_web_sm') 7 | nlp = spacy.load("en_core_web_sm") 8 | 9 | import nltk 10 | from nltk.corpus import stopwords 11 | stop_words = set(stopwords.words('english')) 12 | 13 | ''' 14 | Five types of important entities: 15 | Organizations, Personal Names, Events, Products, Artworks 16 | 17 | Two types of non-critical entities: 18 | Cardinal Numbers indicate the quantity of something. 19 | Ordinal Numbers denote the position or rank of something within a sequence. 20 | 21 | IMPORTANT_ENT_TYPE = set(['ORG', 'PERSON', 'WORK_OF_ART', 'PRODUCT', 'EVENT']) 22 | ''' 23 | 24 | REMOVE_ENT_TYPE = set() 25 | 26 | 27 | def obtain_important_ne(gen, include_capitalized_words_as_ents=True): 28 | important_words = [] 29 | doc = nlp(gen) 30 | 31 | ents = [(ent.text, ent.label_) for ent in doc.ents] 32 | 33 | if include_capitalized_words_as_ents and len(ents) == 0: 34 | capitalized_words = re.findall('(? 0: 37 | capitalized_words = [(word, 'CAPITALIZED') for word in capitalized_words if word.lower() not in stop_words] 38 | ents.extend(capitalized_words) 39 | 40 | important_words.extend([ent for ent in ents if ent[1] in IMPORTANT_ENT_TYPE]) 41 | remaining_ne_all = [ent for ent in ents if ent[1] not in IMPORTANT_ENT_TYPE] 42 | 43 | # filter out some ne 44 | remaining_ne = [] 45 | for ent in remaining_ne_all: 46 | if ent[1] in REMOVE_ENT_TYPE: 47 | continue 48 | # if ent[1] == 'DATE' and ("year" in ent[0] or "day" in ent[0]): #not bool(re.search(r'\d', ent[0])): 49 | # if "DATE" entity contains NO number at all (e.g., ``the year''), meaningless 50 | # continue 51 | remaining_ne.append(ent) 52 | 53 | gens_with_ne = { 54 | "gen": gen, 55 | "important_ne": important_words, 56 | "unimportant_ne": remaining_ne, 57 | "subject": set([token.text for token in doc if token.dep_ in ['nsubj', 'nsubjpass']]), 58 | # "all_analysis": [(token.text, token.pos_, token.tag_, token.dep_) for token in doc] 59 | } 60 | 61 | return gens_with_ne 62 | 63 | 64 | def has_incorrect_style(gen_obj): 65 | 66 | # case 1: contains first person -- I, we 67 | if gen_obj['subject'].intersection(set(['i', 'I', 'You', 'you', 'We', 'we'])): 68 | return True 69 | 70 | # case 2: question? 71 | if "?" in gen_obj['gen']: 72 | return True 73 | 74 | return False 75 | 76 | 77 | def obtain_trust_worthy_sents(text, wiki_names): 78 | 79 | wiki_names_txt = " ".join(wiki_names) 80 | 81 | text = text.strip().replace("\n",". ") 82 | sents = sent_tokenize(text) 83 | 84 | sents_with_ne = [obtain_important_ne(sent.strip()) for sent in sents] 85 | 86 | no_fact_gen_cnt, no_fact_gens = 0, [] 87 | checkworthy_gen_cnt, checkworthy_gens = 0, [] 88 | off_topic_gen_cnt, off_topic_gens = 0, [] 89 | 90 | for sent_obj in sents_with_ne: 91 | 92 | # case 1: no facts -- i.e., no NE, incorrect_style, no SUBJECT 93 | if len(sent_obj['important_ne']) + len(sent_obj['unimportant_ne']) == 0 or has_incorrect_style(sent_obj) or len(sent_obj['subject']) == 0: 94 | no_fact_gen_cnt += 1 95 | 96 | # case 2 v1: no off-topic, but contains facts (unimportant_ne) about target-topic 97 | elif len(sent_obj['important_ne']) == 0 and len(sent_obj['unimportant_ne']) > 0: 98 | checkworthy_gen_cnt += 1 99 | checkworthy_gens.append(sent_obj) 100 | 101 | # case 3: tricky scenario. important_ne could be relevant to the target-topic, or could indicate off-topic 102 | else: 103 | 104 | # 1. filter out any extra_ne that is same as wikiname -- e.g., wiki_name = Barak Obama, ne = Obama 105 | extra_ne = [ne[0] for ne in sent_obj['important_ne'] if ne[0] not in wiki_names_txt] 106 | 107 | # 2. check if any of the extra_ne is the "SUBJECT" of the generation 108 | overlap_between_extraNE_and_subj = sent_obj['subject'].intersection(set(" ".join(extra_ne).split(" "))) 109 | 110 | if len(overlap_between_extraNE_and_subj) > 0: # contains off-topic NE!! 111 | off_topic_gen_cnt += 1 112 | else: 113 | checkworthy_gen_cnt += 1 114 | checkworthy_gens.append(sent_obj) 115 | 116 | 117 | return checkworthy_gens -------------------------------------------------------------------------------- /src/discourse-coherence.py: -------------------------------------------------------------------------------- 1 | ''' 2 | pip install sgnlp 3 | pip uninstall nvidia_cublas_cu11 4 | ''' 5 | import numpy as np 6 | from tqdm import tqdm 7 | import json 8 | from sgnlp.models.coherence_momentum import CoherenceMomentumModel, CoherenceMomentumConfig, \ 9 | CoherenceMomentumPreprocessor 10 | 11 | # Load Model 12 | config = CoherenceMomentumConfig.from_pretrained( 13 | "coherence-momentum" 14 | ) 15 | model = CoherenceMomentumModel.from_pretrained( 16 | "coherence-momentum", 17 | config=config 18 | ) 19 | 20 | model.cuda() 21 | 22 | preprocessor = CoherenceMomentumPreprocessor(config.model_size, config.max_len) 23 | 24 | # Example text inputs 25 | text1 = "Companies listed below reported quarterly profit substantially different from the average of analysts ' " \ 26 | "estimates . The companies are followed by at least three analysts , and had a minimum five-cent change in " \ 27 | "actual earnings per share . Estimated and actual results involving losses are omitted . The percent " \ 28 | "difference compares actual profit with the 30-day estimate where at least three analysts have issues " \ 29 | "forecasts in the past 30 days . Otherwise , actual profit is compared with the 300-day estimate . " \ 30 | "Source : Zacks Investment Research" 31 | text2 = "The companies are followed by at least three analysts , and had a minimum five-cent change in actual " \ 32 | "earnings per share . The percent difference compares actual profit with the 30-day estimate where at least " \ 33 | "three analysts have issues forecasts in the past 30 days . Otherwise , actual profit is compared with the " \ 34 | "300-day estimate . Source : Zacks Investment Research. Companies listed below reported quarterly profit " \ 35 | "substantially different from the average of analysts ' estimates . Estimated and actual results involving " \ 36 | "losses are omitted ." 37 | 38 | 39 | def args_parser(): 40 | import argparse 41 | parser = argparse.ArgumentParser() 42 | parser.add_argument("--hyp_path", type=str, default='./emnlp_data/nq/random_testset/nq_ref') 43 | args = parser.parse_args() 44 | return args 45 | 46 | def calculate_coherence(sentences): 47 | print('calculate coherence scores...') 48 | scores = [] 49 | for s_list in tqdm(sentences): 50 | inputs = preprocessor([s_list]) 51 | score = model.get_main_score(inputs["tokenized_texts"].cuda()).item() 52 | scores.append(score) 53 | return scores 54 | 55 | def read_hyp(hyp_path): 56 | hyps = [] 57 | with open(hyp_path, 'r') as infile: 58 | for line in infile: 59 | hyps.append(line.strip()) 60 | return hyps 61 | 62 | 63 | if __name__ == '__main__': 64 | args = args_parser() 65 | hyps = read_hyp(args.hyp_path) 66 | assert len(hyps) == 500, len(hyps) 67 | 68 | scores = calculate_coherence(hyps) 69 | assert len(scores) == 500, len(scores) 70 | 71 | with open(args.hyp_path + '_avg_coh_para', 'w') as outfile: 72 | json.dump(scores, outfile) 73 | outfile.write('\n') 74 | outfile.write(f'{max(scores)}\t{min(scores)}') 75 | -------------------------------------------------------------------------------- /src/eval_exp.py: -------------------------------------------------------------------------------- 1 | from nltk.tokenize import sent_tokenize 2 | from tqdm import tqdm 3 | from collections import Counter 4 | import copy 5 | import json 6 | import argparse 7 | import random 8 | random.seed(42) 9 | 10 | import numpy as np 11 | from factuality_metric import ner_metric, nli_metric_batch 12 | from src.claim_handling import obtain_important_ne 13 | from tools import WikiSearch 14 | 15 | import logging 16 | logging.basicConfig() 17 | logging.getLogger().setLevel(logging.ERROR) 18 | 19 | def read_hyp(hyp_path): 20 | hyps = [] 21 | with open(hyp_path, 'r') as infile: 22 | for line in infile: 23 | hyps.append(line.strip()) 24 | return hyps 25 | 26 | def read_IR_docs(IR_path): 27 | IR_docs = [] 28 | with open(IR_path, 'r') as infile: 29 | for line in infile: 30 | IR_docs.append(json.loads(line.strip())) 31 | return IR_docs 32 | 33 | def read_ref(ref_path): 34 | doc_refs = [] 35 | if 'json' not in ref_path: # txt: for wow 36 | with open(ref_path, 'r') as infile: 37 | for line in infile: 38 | parts = line.strip().split('\t') 39 | # topic, query, knowledge, response 40 | assert len(parts) == 4, parts 41 | doc_refs.append(parts[2]) 42 | else: # json: for QA dataset 43 | with open(ref_path, 'r') as infile: 44 | data_list = json.load(infile)['data'] 45 | for data in data_list: 46 | doc_refs.append(data['context']) 47 | return doc_refs 48 | 49 | def boolean_string(s): 50 | if s.lower() not in {'false', 'true'}: 51 | raise ValueError('Not a valid boolean string') 52 | return s.lower() == 'true' 53 | 54 | def args_parser(): 55 | parser = argparse.ArgumentParser(description='Process some integers.') 56 | 57 | parser.add_argument('--hyp_path', type=str, default=None, help='path to generations to evaluate') 58 | parser.add_argument('--ref_path', type=str, default=None, help='path to generations to evaluate') 59 | parser.add_argument('--eval_num', type=int, default=-1) 60 | parser.add_argument('--outer_strategy', type=str, default='max', help='max, min, mean') 61 | 62 | parser.add_argument('--use_IR_eval', action='store_true', help='Flag for saving some lm-gens with its metric for analysis') 63 | parser.add_argument('--retrieved_num', type=int, default=3) 64 | parser.add_argument('--wo_ground_truth_knowledge', type=boolean_string, default='True') 65 | 66 | parser.add_argument('--debug', type=boolean_string) 67 | parser.add_argument('--save_gen_for_analysis', action='store_true', help='Flag for saving some lm-gens with its metric for analysis') 68 | 69 | args = parser.parse_args() 70 | return args 71 | 72 | def single_instance_eval(hyp, doc_ref_str, recall_list_, args): 73 | # multiple evidences 74 | hallu_ner_ratio = [] 75 | nli_contradict_prob, nli_entail_prob, nli_neutral_prob, nli_label = [], [], [], [] 76 | 77 | hyp_sents = sent_tokenize(hyp) 78 | doc_ref = sent_tokenize(doc_ref_str) + [doc_ref_str] if doc_ref_str else [] 79 | 80 | retrieve_error = '' 81 | for sent in hyp_sents: 82 | cur_doc_ref = copy.deepcopy(doc_ref) # 83 | recall_list = copy.deepcopy(recall_list_) 84 | 85 | if args.use_IR_eval and args.retrieved_num: 86 | assert recall_list and len(recall_list) >= 10, f"len(recall_list) = {len(recall_list)}" 87 | try: 88 | if not recall_list: # dont do this 89 | recall_list = WikiSearch(sent, args.retrieved_num) # already sentences # raise ConnectTimeout(e, request=request) 90 | assert len(recall_list) == args.retrieved_num, f"len(recall_list) = {len(recall_list)}, args.retrieved_num = {args.retrieved_num}" 91 | else: 92 | recall_list = recall_list[:args.retrieved_num] 93 | except: # need to log 94 | retrieve_error = f"!!!error!!!:{sent}" 95 | else: 96 | recall_list = [] 97 | 98 | # 1. NER 99 | sent_obj_with_ne = obtain_important_ne(sent.strip()) 100 | NE_to_check = sent_obj_with_ne['important_ne'] + sent_obj_with_ne['unimportant_ne'] 101 | if NE_to_check: 102 | correct_ner_ratio = 0 103 | if not args.wo_ground_truth_knowledge: # ref 104 | correct_ner_ratio = ner_metric(NE_to_check, doc_ref_str) # apply directly on wiki and/or google search snippets 105 | for recall_passage in recall_list: 106 | correct_ner_ratio = max(correct_ner_ratio, ner_metric(NE_to_check, recall_passage)) 107 | hallu_ner_ratio.append(1 - correct_ner_ratio) 108 | 109 | # 2. NLI: identify the evs that give highest nli entailment score 110 | premise_hypothesis_pairs = [[ev, sent] for ev in cur_doc_ref + recall_list] 111 | if args.wo_ground_truth_knowledge: 112 | premise_hypothesis_pairs = [[ev, sent] for ev in recall_list] 113 | if len(premise_hypothesis_pairs) > 32: 114 | premise_hypothesis_pairs = premise_hypothesis_pairs[:32] 115 | bz = 8 116 | nli_probs, labels = [], [] 117 | for t in range((len(premise_hypothesis_pairs) - 1) // bz + 1): 118 | bz_nli_probs, bz_labels = nli_metric_batch(premise_hypothesis_pairs[t * bz: min((t + 1) * bz, len(premise_hypothesis_pairs))]) 119 | nli_probs.extend(bz_nli_probs) 120 | labels.extend(bz_labels) 121 | assert len(nli_probs) == len(premise_hypothesis_pairs) == len(labels), f"len(nli_probs) = {len(nli_probs)}, len(premise_hypothesis_pairs) = {len(premise_hypothesis_pairs)}, len(labels) = {len(labels)}" 122 | 123 | # [contradiction, neutral, entailment] 124 | entailment_argmax = np.argmax([nli_s[2] for nli_s in nli_probs]) 125 | max_prob = nli_probs[entailment_argmax] 126 | max_label = labels[entailment_argmax] 127 | 128 | nli_contradict_prob.append(max_prob[0]) 129 | nli_neutral_prob.append(max_prob[1]) 130 | nli_entail_prob.append(max_prob[2]) 131 | 132 | nli_label.append(max_label) 133 | 134 | hallu_ner_ratio = np.nanmean(hallu_ner_ratio) 135 | idx = None 136 | if args.outer_strategy == 'max': 137 | idx = nli_label.index(max(nli_label)) 138 | nli_label = max(nli_label) 139 | if args.outer_strategy == 'min': 140 | idx = nli_label.index(min(nli_label)) 141 | nli_label = min(nli_label) 142 | 143 | if args.outer_strategy != 'mean': 144 | nli_contradict_prob = nli_contradict_prob[idx] 145 | nli_neutral_prob = nli_neutral_prob[idx] 146 | nli_entail_prob = nli_entail_prob[idx] 147 | else: # mean 148 | nli_contradict_prob = np.nanmean(nli_contradict_prob) 149 | nli_neutral_prob = np.nanmean(nli_neutral_prob) 150 | nli_entail_prob = np.nanmean(nli_entail_prob) 151 | 152 | eval_result_obj = { 153 | 'claim_to_verify': hyp_sents, 154 | 'doc_ref': doc_ref, 155 | 'recall_list': recall_list, 156 | 'retrieve_error': retrieve_error, 157 | 158 | 'hallu_ner': hallu_ner_ratio, 159 | 'nli-label': nli_label, 160 | 'nli-contr': nli_contradict_prob, 161 | 'nli-entail': nli_entail_prob, 162 | 'nli-neutr': nli_neutral_prob 163 | } 164 | 165 | return eval_result_obj 166 | 167 | def main(args): 168 | 169 | # read hyp, ref, IR_docs 170 | hyps = read_hyp(args.hyp_path) 171 | IR_recalls = read_IR_docs(args.hyp_path + '_IR_docs') 172 | doc_refs = read_ref(args.ref_path) # txt file 173 | assert len(hyps) == len(doc_refs) == len(IR_recalls) == 500, (len(hyps), len(doc_refs), len(IR_recalls)) 174 | 175 | # DEBUG mode! 176 | if args.debug: 177 | DEBUG_SAMPLE_SIZE = 5 178 | hyps = hyps[:DEBUG_SAMPLE_SIZE] 179 | IR_recalls = IR_recalls[:DEBUG_SAMPLE_SIZE] 180 | doc_refs = doc_refs[:DEBUG_SAMPLE_SIZE] 181 | 182 | final_hallu_ner_score = [] 183 | final_contradict_prob, final_neutral_prob, final_entail_prob, all_nli_labels = [], [], [], [] 184 | all_analysis_list = [] 185 | 186 | for i in tqdm(range(len(hyps))): 187 | hyp, doc_ref, recall_list = hyps[i], doc_refs[i], IR_recalls[i] 188 | 189 | res_obj = single_instance_eval(hyp, doc_ref, recall_list, args) 190 | 191 | final_hallu_ner_score.append(res_obj['hallu_ner']) 192 | final_contradict_prob.append(res_obj['nli-contr']) 193 | final_neutral_prob.append(res_obj['nli-neutr']) 194 | final_entail_prob.append(res_obj['nli-entail']) 195 | all_nli_labels.append(res_obj['nli-label']) 196 | all_analysis_list.append(res_obj) 197 | 198 | # analysis 199 | avg_hallu_ner_ratio = np.nanmean(final_hallu_ner_score) 200 | avg_contradict_prob = np.mean(final_contradict_prob) 201 | avg_neutral_prob = np.mean(final_neutral_prob) 202 | avg_entail_prob = np.mean(final_entail_prob) 203 | 204 | print("\nHallu NER: {:.2f}%".format(avg_hallu_ner_ratio*100)) 205 | print("AVG PROBS: Contradict: {:.2f}%, Neutral: {:.2f}%, Entail: {:.2f}%".format(avg_contradict_prob*100, avg_neutral_prob*100, avg_entail_prob*100)) 206 | 207 | nli_contradict_class_ratio, nli_neutral_class_ratio, nli_entail_class_ratio = 0, 0, 0 208 | 209 | if args.outer_strategy == 'mean': 210 | all_nli_labels = [item for sublist in all_nli_labels for item in sublist] 211 | nli_counter = Counter(all_nli_labels) 212 | 213 | nli_contradict_class_ratio=nli_counter[0]/(nli_counter[0]+nli_counter[1]+nli_counter[2]) 214 | nli_neutral_class_ratio=nli_counter[1]/(nli_counter[0]+nli_counter[1]+nli_counter[2]) 215 | nli_entail_class_ratio=nli_counter[2]/(nli_counter[0]+nli_counter[1]+nli_counter[2]) 216 | 217 | print("NLI CLASS %: Contradict: {:.2f}%, Neutral: {:.2f}%, Entail: {:.2f}%".format( 218 | nli_contradict_class_ratio*100, 219 | nli_neutral_class_ratio*100, 220 | nli_entail_class_ratio*100 221 | )) 222 | 223 | res_path = args.hyp_path + f'_{args.outer_strategy}_factuality_results.txt' 224 | with open(res_path, 'a') as outfile: 225 | res_obj = { 226 | "avg_hallu_ner_ratio": avg_hallu_ner_ratio, 227 | "nli_contradict_class_ratio": nli_contradict_class_ratio, 228 | "nli_neutral_class_ratio": nli_neutral_class_ratio, 229 | "nli_entail_class_ratio": nli_entail_class_ratio, 230 | } 231 | json.dump(res_obj, outfile) 232 | outfile.write("\n") 233 | 234 | ana_path = args.hyp_path + f'_IR{args.retrieved_num}_{args.outer_strategy}_analysis.txt' 235 | with open(ana_path, 'w') as outfile: 236 | json.dump(all_analysis_list, outfile) 237 | outfile.write("\n") 238 | 239 | # save example NE score 240 | ne_path = args.hyp_path + f'_IR{args.retrieved_num}_{args.outer_strategy}_example_NE.txt' 241 | with open(ne_path, 'w') as outfile: 242 | for ne in final_hallu_ner_score: 243 | outfile.write(str(ne) + '\n') 244 | 245 | # save example NLI score 246 | nli_path = args.hyp_path + f'_IR{args.retrieved_num}_{args.outer_strategy}_example_NLI_entail.txt' 247 | with open(nli_path, 'w') as outfile: 248 | for nli in final_entail_prob: 249 | outfile.write(str(nli) + '\n') 250 | 251 | if __name__ == '__main__': 252 | args = args_parser() 253 | main(args) 254 | -------------------------------------------------------------------------------- /src/helpfulness.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import tqdm 3 | import math 4 | import numpy as np 5 | import argparse 6 | import random 7 | from transformers import T5Tokenizer, T5ForConditionalGeneration, LlamaTokenizer, LlamaForCausalLM 8 | from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModelForSeq2SeqLM 9 | 10 | 11 | def read_testfile(testfile): 12 | '''read testset (groud truth knowledge, response)''' 13 | res = [] 14 | with open(testfile, 'r', encoding='utf-8') as r: 15 | for i, line in enumerate(r): 16 | parts = line.strip().split('\t') 17 | assert len(parts) == 4, parts 18 | res.append(parts) 19 | # topic, query, knowledge, response 20 | return res 21 | 22 | def read_knowledge_prompt(prompt_file): 23 | ''' 24 | prompt_file: 25 | {last_utter: ["(last_utter) topic => knowledge", "..", ..]} 26 | ''' 27 | knowledge_prompts = [] 28 | with open(prompt_file, "r") as f: 29 | for i, line in enumerate(f): 30 | line = line.strip() 31 | line_list = eval(line)[:8] 32 | knowledge_prompts.append(line_list) 33 | return knowledge_prompts 34 | 35 | def read_hyp_knowledge(path): 36 | '''read generated knowledge (by dpr or llm)''' 37 | with open(path, 'r', encoding='utf-8') as r: 38 | res = [line.strip() for line in r] 39 | assert len(res) == 500, len(res) 40 | return res 41 | 42 | def load_model(model_name): 43 | if 'flan' in model_name: 44 | # flan-t5-xxl: 11B = 22G, takes 6 min to load model. 1 min if gpus are empty. 45 | assert model_name in ['flan-t5-xxl', 'flan-t5-xl', 'flan-t5-large', 'flan-t5-base', 'flan-t5-small'] 46 | tokenizer = T5Tokenizer.from_pretrained(f"google/{model_name}", local_files_only=True) 47 | model = T5ForConditionalGeneration.from_pretrained(f"google/{model_name}", device_map="balanced_low_0", torch_dtype=torch.float16, local_files_only=True) 48 | elif 'llama' in model_name: 49 | path = '/apdcephfs/share_1594716/chenliang/cache/llama1/65B' 50 | tokenizer = LlamaTokenizer.from_pretrained(path, padding_side='left') # left-padding for decoder-only model 51 | tokenizer.pad_token, tokenizer.bos_id, tokenizer.eos_id = -1, 1, 2 52 | model = LlamaForCausalLM.from_pretrained(path, device_map="balanced_low_0", torch_dtype=torch.float16) 53 | else: 54 | ''' A decoder-only architecture is being used, but right-padding was detected! 55 | For correct generation results, please set `padding_side='left'` when initializing the tokenizer.''' 56 | tokenizer = AutoTokenizer.from_pretrained(f"facebook/{model_name}", use_fast=False, padding_side='left') 57 | model = AutoModelForCausalLM.from_pretrained(f"facebook/{model_name}", device_map="auto", torch_dtype=torch.float16) 58 | return tokenizer, model 59 | 60 | def compute_ppl(prefix_and_output_text=None, output_text=None, model=None, tokenizer=None, infer_gpu=0): 61 | '''calculate ppl for a single response''' 62 | with torch.no_grad(): 63 | tokd_inputs = tokenizer.encode(prefix_and_output_text, return_tensors="pt") 64 | tokd_inputs = tokd_inputs.to(infer_gpu) 65 | 66 | # if only want to score the "generation" part we need the suffix tokenization length 67 | tokd_suffix = tokenizer.encode(output_text, return_tensors="pt") 68 | 69 | tokd_labels = tokd_inputs.clone().detach() 70 | tokd_labels[:, :tokd_labels.shape[1] - tokd_suffix.shape[1] + 1] = -100 # mask out the prefix 71 | 72 | outputs = model(input_ids=tokd_inputs, labels=tokd_labels) 73 | loss = outputs.loss # avg CE loss all positions (except -100, TODO check that this is working correctly) 74 | ppl = torch.tensor(math.exp(loss)) 75 | 76 | return loss.item(), ppl.item() 77 | 78 | def boolean_string(s): 79 | if s.lower() not in {'false', 'true'}: 80 | raise ValueError('Not a valid boolean string') 81 | return s.lower() == 'true' 82 | 83 | def parse_args(): 84 | 85 | parser = argparse.ArgumentParser() 86 | parser.add_argument('--exp_name', type=str) 87 | parser.add_argument('--task', type=str, default='nq') 88 | 89 | parser.add_argument("--debug", type=boolean_string, default=True) 90 | parser.add_argument("--zero_shot", type=boolean_string, default=False) 91 | 92 | parser.add_argument('--testfile', type=str, default='data/testset.txt') 93 | parser.add_argument('--promptfile', type=str, default='data/testset.txt') 94 | parser.add_argument('--hyp_knowledge', type=str, default='') 95 | 96 | parser.add_argument('--downstream_model', type=str) 97 | parser.add_argument("--knowledge_type", type=str, default='wo_knowledge', help='wo_knowledge, w_ref_knowledge, w_hyp_knowledge, random_knowledge') 98 | 99 | parser.add_argument('--infer_gpu', type=int, default=0) 100 | args = parser.parse_args() 101 | return args 102 | 103 | 104 | if __name__ == '__main__': 105 | args = parse_args() 106 | 107 | testset = read_testfile(args.testfile) 108 | nq_prompt_list = read_knowledge_prompt(args.promptfile) 109 | random_knowledge_list = [random.choice(nq_prompt_list[499 - i]).split('\t')[-2] for i in range(len(nq_prompt_list))] 110 | 111 | if args.hyp_knowledge: 112 | hyp_knowledge_list = read_hyp_knowledge(args.hyp_knowledge) 113 | assert len(hyp_knowledge_list) == len(testset), len(hyp_knowledge_list) 114 | 115 | if args.debug: 116 | testset = testset[:3] 117 | 118 | tokenizer, model = load_model(args.downstream_model) 119 | 120 | loss_list, ppl_list = [], [] 121 | for i in tqdm.tqdm(range(len(testset))): 122 | topic, query, knowledge, response = testset[i] 123 | examples = [e.split('\t') for e in nq_prompt_list[i] if len(e.split('\t')) == 4] 124 | turns = query.split(" [SEP] ") 125 | last_turn = turns[-1].strip() 126 | 127 | ref_knowledge = knowledge.strip() 128 | truncate_len = 500 129 | if len(ref_knowledge.split(' ')) > truncate_len: 130 | print (f'Warning: knowledge {i} length {len(ref_knowledge.split(" "))} exceeds {truncate_len}, truncating to {truncate_len}') 131 | ref_knowledge = ' '.join(ref_knowledge.split(' ')[:truncate_len]).strip() 132 | if args.hyp_knowledge: 133 | hyp_knowledge = hyp_knowledge_list[i] 134 | 135 | infer_sample = f"Passage:\nQuery: {last_turn.strip()}\nAnswer: " # set to empty passage 136 | if args.knowledge_type == 'w_hyp_knowledge': 137 | infer_sample = f"Passage: {hyp_knowledge.strip()}\nQuery: {last_turn.strip()}\nAnswer: " 138 | elif args.knowledge_type == 'w_ref_knowledge': 139 | infer_sample = f"Passage: {ref_knowledge}\nQuery: {last_turn.strip()}\nAnswer: " 140 | elif args.knowledge_type == 'random_knowledge': 141 | infer_sample = f"Passage: {random_knowledge_list[i].strip()}\nQuery: {last_turn.strip()}\nAnswer: " 142 | 143 | prompt = '' 144 | cur_len = 0 145 | if args.zero_shot: 146 | if args.knowledge_type == 'wo_knowledge': 147 | if args.task == 'nq': 148 | prompt = f'Read the passage and answer the question below:\nPassage: {ref_knowledge}\nQuestion: {last_turn}\nAnswer: ' 149 | elif args.task == 'wow': 150 | prompt = f'Using the knowledge from the passage, complete the dialogue below:\nPassage: {ref_knowledge}\nSpeaker 1: {last_turn}\nSpeaker 2: ' 151 | 152 | elif args.knowledge_type == 'w_ref_knowledge': 153 | if args.task == 'nq': 154 | prompt = f'Read the passage and answer the question below:\nPassage: {ref_knowledge}\nQuestion: {last_turn}\nAnswer: ' 155 | elif args.task == 'wow': 156 | prompt = f'Using the knowledge from the passage, complete the dialogue below:\nPassage: {ref_knowledge}\nSpeaker 1: {last_turn}\nSpeaker 2: ' 157 | elif args.knowledge_type == 'w_hyp_knowledge': 158 | if args.task == 'nq': 159 | prompt = f'Read the passage and answer the question below:\nPassage: {hyp_knowledge}\nQuestion: {last_turn}\nAnswer: ' 160 | elif args.task == 'wow': 161 | prompt = f'Using the knowledge from the passage, complete the dialogue below:\nPassage: {hyp_knowledge}\nSpeaker 1: {last_turn}\nSpeaker 2: ' 162 | else: 163 | raise NotImplementedError(args.knowledge_type) 164 | else: 165 | for example in examples: 166 | p_topic, p_turns, p_knowledge, p_response = [e.strip() for e in example] 167 | if p_knowledge.startswith(p_topic): 168 | p_knowledge = p_knowledge[len(p_topic):] 169 | 170 | demonstration = f"Passage: {p_knowledge.strip()}\nQuery: {p_turns.split(' [SEP] ')[-1].strip()}\nAnswer: {p_response.strip()}" 171 | 172 | if cur_len < 1800 - len(infer_sample.split(' ')): 173 | prompt += demonstration + '\n\n' 174 | cur_len += len(demonstration.split(' ')) 175 | 176 | prompt += infer_sample 177 | 178 | prefix_and_output_text = prompt + response 179 | output_text = response 180 | loss, ppl = compute_ppl(prefix_and_output_text, output_text, model, tokenizer, args.infer_gpu) 181 | loss_list.append(loss) 182 | ppl_list.append(ppl) 183 | 184 | if args.debug: 185 | print (prefix_and_output_text) 186 | print (loss, ppl) 187 | 188 | with open(f'helpfulness_results/{args.exp_name}.txt', 'w') as f: 189 | f.write(str(loss_list).strip() + '\n') 190 | f.write(str(ppl_list).strip() + '\n') 191 | 192 | f.write(f'loss: {np.mean(loss_list)}\t{np.std(loss_list)}\t{np.var(loss_list)}\n') 193 | f.write(f'ppl: {np.mean(ppl_list)}\t{np.std(ppl_list)}\t{np.var(ppl_list)}\n') 194 | -------------------------------------------------------------------------------- /src/info.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import math 3 | from tqdm import tqdm 4 | import numpy as np 5 | import json 6 | from transformers import GPT2LMHeadModel, GPT2Tokenizer 7 | from transformers import AutoTokenizer, AutoModelForCausalLM 8 | 9 | tokenizer = AutoTokenizer.from_pretrained('gpt-neo-2.7B') 10 | model = AutoModelForCausalLM.from_pretrained('gpt-neo-2.7B').half() 11 | device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') 12 | model.to(device) 13 | tokenizer.pad_token = tokenizer.eos_token 14 | 15 | def read_testfile(testfile): 16 | '''read testset from wow''' 17 | res = [] 18 | with open(testfile, 'r', encoding='utf-8') as r: 19 | for i, line in enumerate(r): 20 | parts = line.strip().split('\t') 21 | assert len(parts) == 4, parts 22 | res.append(parts) 23 | return res 24 | 25 | def read_hyp(hyp_path): 26 | hyps = [] 27 | with open(hyp_path, 'r') as infile: 28 | for line in infile: 29 | hyps.append(line.strip()) 30 | return hyps 31 | 32 | def calculate_info_per_example(hyps_knowledge, queries, topics, args): 33 | info_seq = [] 34 | for hyp, query, topic in tqdm(zip(hyps_knowledge, queries, topics)): 35 | hyp = ' '.join(hyp.split()[:300]) 36 | if args.task == 'nq' and query.strip()[-1] != '?': 37 | query = query.strip() + '?' 38 | instruction = f"Generate a Wikipedia to answer the given question.\nTopic: {topic.strip()}.\nQuestion: {query.strip()}\nWikipedia: " 39 | example = instruction + hyp 40 | inputs = tokenizer(example, return_tensors='pt', truncation=True).data 41 | prefix = tokenizer(instruction, return_tensors='pt', truncation=True).data 42 | for k, v in inputs.items(): 43 | inputs[k] = v.to(device) 44 | for k, v in prefix.items(): 45 | prefix[k] = v.to(device) 46 | output = model(**inputs, labels=inputs['input_ids']) 47 | logits = output.logits 48 | labels=inputs['input_ids'] 49 | logits = logits[:, prefix['input_ids'].shape[-1]:, :] 50 | labels = labels[:, prefix['input_ids'].shape[-1]:] 51 | assert logits.shape[1] == labels.shape[1], (logits.shape, labels.shape) 52 | shift_logits = logits[..., :-1, :].contiguous() 53 | shift_labels = labels[..., 1:].contiguous() 54 | loss_fct = torch.nn.CrossEntropyLoss(reduction='mean') 55 | loss1 = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)) 56 | info = 1 - torch.exp(-loss1) 57 | info = info.item() 58 | info_seq.append(info) 59 | 60 | return info_seq 61 | 62 | def args_parser(): 63 | import argparse 64 | parser = argparse.ArgumentParser() 65 | parser.add_argument("--task", type=str, default='nq') 66 | parser.add_argument("--hyp_path", type=str) 67 | parser.add_argument("--ref_path", type=str, default='./emnlp_data/nq/random_testset/nq_test_random_testset.txt') 68 | args = parser.parse_args() 69 | return args 70 | 71 | if __name__ == '__main__': 72 | args = args_parser() 73 | hyps = read_hyp(args.hyp_path) 74 | testset = read_testfile(args.ref_path) 75 | queries = [t[1].strip() for t in testset] 76 | topics = [t[0].strip() for t in testset] 77 | assert len(hyps) == len(testset) == len(queries) == 500, (len(hyps), len(testset)) 78 | 79 | info_list = calculate_info_per_example(hyps, queries, topics, args) 80 | assert len(info_list) == 500, len(info_list) 81 | 82 | print ('mean info = ', np.nanmean(info_list)) 83 | 84 | with open(args.hyp_path + '_info', 'w') as outfile: 85 | for info in info_list: 86 | outfile.write(str(info) + '\n') -------------------------------------------------------------------------------- /src/nq_validity.py: -------------------------------------------------------------------------------- 1 | from nltk.tokenize import sent_tokenize 2 | from tqdm import tqdm 3 | from collections import Counter 4 | import copy 5 | import json 6 | import argparse 7 | import random 8 | random.seed(42) 9 | 10 | import numpy as np 11 | from factuality_metric import ner_metric, nli_metric_batch 12 | from src.claim_handling import obtain_important_ne 13 | from tools import WikiSearch 14 | 15 | import logging 16 | logging.basicConfig() 17 | logging.getLogger().setLevel(logging.ERROR) 18 | 19 | def read_hyp(hyp_path): 20 | hyps = [] 21 | with open(hyp_path, 'r') as infile: 22 | for line in infile: 23 | hyps.append(line.strip()) 24 | return hyps 25 | 26 | def read_testfile(testfile): 27 | '''read testset from wow''' 28 | res = [] 29 | with open(testfile, 'r', encoding='utf-8') as r: 30 | for i, line in enumerate(r): 31 | parts = line.strip().split('\t') 32 | assert len(parts) == 4, parts 33 | res.append(parts) 34 | return res 35 | 36 | def boolean_string(s): 37 | if s.lower() not in {'false', 'true'}: 38 | raise ValueError('Not a valid boolean string') 39 | return s.lower() == 'true' 40 | 41 | def args_parser(): 42 | parser = argparse.ArgumentParser(description='Process some integers.') 43 | 44 | parser.add_argument('--hyp_path', type=str, default=None, help='path to generations to evaluate') 45 | parser.add_argument('--ref_path', type=str, default=None, help='path to generations to evaluate') 46 | parser.add_argument('--eval_num', type=int, default=-1) 47 | 48 | parser.add_argument('--use_IR_eval', action='store_true', help='Flag for saving some lm-gens with its metric for analysis') 49 | parser.add_argument('--retrieved_num', type=int, default=3) 50 | parser.add_argument('--wo_ground_truth_knowledge', type=boolean_string, default='False') 51 | 52 | parser.add_argument('--debug', type=boolean_string) 53 | parser.add_argument('--save_gen_for_analysis', action='store_true', help='Flag for saving some lm-gens with its metric for analysis') 54 | 55 | args = parser.parse_args() 56 | return args 57 | 58 | def single_instance_eval(hyp, query, answer, args): 59 | # multiple evidences 60 | hallu_ner_ratio = [] 61 | nli_contradict_prob, nli_entail_prob, nli_neutral_prob, nli_label = [], [], [], [] 62 | 63 | # NLI: identify the evs that give highest nli entailment score 64 | if query.strip()[-1] != '?': 65 | query = query.strip() + '?' 66 | premise = query + '\t' + answer 67 | hypothesis = query + '\t' + hyp 68 | 69 | premise_hypothesis_pairs = [[premise, hypothesis]] 70 | nli_probs, labels = [], [] 71 | bz_nli_probs, bz_labels = nli_metric_batch(premise_hypothesis_pairs) 72 | nli_probs.extend(bz_nli_probs) 73 | labels.extend(bz_labels) 74 | assert len(nli_probs) == len(premise_hypothesis_pairs) == len(labels), f"len(nli_probs) = {len(nli_probs)}, len(premise_hypothesis_pairs) = {len(premise_hypothesis_pairs)}, len(labels) = {len(labels)}" 75 | 76 | # [contradiction, neutral, entailment] 77 | entailment_argmax = np.argmax([nli_s[2] for nli_s in nli_probs]) 78 | max_prob = nli_probs[entailment_argmax] 79 | max_label = labels[entailment_argmax] 80 | 81 | nli_contradict_prob.append(max_prob[0]) 82 | nli_neutral_prob.append(max_prob[1]) 83 | nli_entail_prob.append(max_prob[2]) 84 | 85 | nli_label.append(max_label) 86 | 87 | hallu_ner_ratio = np.nanmean(hallu_ner_ratio) 88 | idx = nli_label.index(max(nli_label)) 89 | nli_label = max(nli_label) 90 | nli_contradict_prob = nli_contradict_prob[idx] 91 | nli_neutral_prob = nli_neutral_prob[idx] 92 | nli_entail_prob = nli_entail_prob[idx] 93 | 94 | eval_result_obj = { 95 | 'premise': premise, 96 | 'hypothesis': hypothesis, 97 | 98 | 'nli-label': nli_label, 99 | 'nli-contr': nli_contradict_prob, 100 | 'nli-entail': nli_entail_prob, 101 | 'nli-neutr': nli_neutral_prob 102 | } 103 | 104 | return eval_result_obj 105 | 106 | def main(args): 107 | 108 | # read hyp, ref, IR_docs 109 | hyps = read_hyp(args.hyp_path) 110 | testset = read_testfile(args.ref_path) 111 | assert len(hyps) == len(testset) == 500, (len(hyps), len(testset)) 112 | 113 | # DEBUG mode! 114 | if args.debug: 115 | DEBUG_SAMPLE_SIZE = 10 116 | hyps = hyps[:DEBUG_SAMPLE_SIZE] 117 | testset = testset[:DEBUG_SAMPLE_SIZE] 118 | 119 | final_contradict_prob, final_neutral_prob, final_entail_prob, all_nli_labels = [], [], [], [] 120 | all_analysis_list = [] 121 | 122 | for i in tqdm(range(len(hyps))): 123 | hyp, example = hyps[i], testset[i] 124 | query, answer = example[1], example[3] 125 | 126 | res_obj = single_instance_eval(hyp, query, answer, args) 127 | 128 | if args.debug: 129 | print ('==' * 20) 130 | print (res_obj) 131 | 132 | final_contradict_prob.append(res_obj['nli-contr']) 133 | final_neutral_prob.append(res_obj['nli-neutr']) 134 | final_entail_prob.append(res_obj['nli-entail']) 135 | all_nli_labels.append(res_obj['nli-label']) 136 | all_analysis_list.append(res_obj) 137 | 138 | # analysis 139 | avg_contradict_prob = np.mean(final_contradict_prob) 140 | avg_neutral_prob = np.mean(final_neutral_prob) 141 | avg_entail_prob = np.mean(final_entail_prob) 142 | 143 | print("AVG PROBS: Contradict: {:.2f}%, Neutral: {:.2f}%, Entail: {:.2f}%".format(avg_contradict_prob*100, avg_neutral_prob*100, avg_entail_prob*100)) 144 | 145 | nli_contradict_class_ratio, nli_neutral_class_ratio, nli_entail_class_ratio = 0, 0, 0 146 | 147 | nli_counter = Counter(all_nli_labels) 148 | 149 | nli_contradict_class_ratio=nli_counter[0]/(nli_counter[0]+nli_counter[1]+nli_counter[2]) 150 | nli_neutral_class_ratio=nli_counter[1]/(nli_counter[0]+nli_counter[1]+nli_counter[2]) 151 | nli_entail_class_ratio=nli_counter[2]/(nli_counter[0]+nli_counter[1]+nli_counter[2]) 152 | 153 | print("NLI CLASS %: Contradict: {:.2f}%, Neutral: {:.2f}%, Entail: {:.2f}%".format( 154 | nli_contradict_class_ratio*100, 155 | nli_neutral_class_ratio*100, 156 | nli_entail_class_ratio*100 157 | )) 158 | 159 | res_path = args.hyp_path + '_factuality_results.txt' 160 | with open(res_path, 'a') as outfile: 161 | res_obj = { 162 | 'Contradict_probs': avg_contradict_prob, 163 | 'Neutral_probs': avg_neutral_prob, 164 | 'Entail_probs': avg_entail_prob, 165 | "nli_contradict_class_ratio": nli_contradict_class_ratio, 166 | "nli_neutral_class_ratio": nli_neutral_class_ratio, 167 | "nli_entail_class_ratio": nli_entail_class_ratio, 168 | } 169 | json.dump(res_obj, outfile) 170 | outfile.write("\n") 171 | 172 | ana_path = args.hyp_path + '_analysis.txt' 173 | with open(ana_path, 'a') as outfile: 174 | json.dump(all_analysis_list, outfile) 175 | outfile.write("\n") 176 | 177 | if __name__ == '__main__': 178 | args = args_parser() 179 | main(args) 180 | -------------------------------------------------------------------------------- /src/ppl.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import math 3 | from tqdm import tqdm 4 | from nltk.tokenize import sent_tokenize 5 | import numpy as np 6 | import json 7 | from transformers import AutoTokenizer, AutoModelForCausalLM 8 | 9 | tokenizer = AutoTokenizer.from_pretrained('gpt-neo-2.7B') 10 | model = AutoModelForCausalLM.from_pretrained('gpt-neo-2.7B') 11 | device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') 12 | model.to(device) 13 | tokenizer.pad_token = tokenizer.eos_token 14 | 15 | def calculate_ppls(sentences, device): 16 | print('calculate PPL scores...') 17 | ppls = [] 18 | for s_list in tqdm(sentences): 19 | cur_list = [] 20 | for r in s_list: 21 | inputs = tokenizer(r, return_tensors='pt', truncation=True, max_length=500).data 22 | for k, v in inputs.items(): 23 | inputs[k] = v.to(device) 24 | output = model(**inputs, labels=inputs['input_ids']) 25 | loss = output[0] 26 | 27 | # testing 28 | logits = output.logits 29 | labels=inputs['input_ids'] 30 | shift_logits = logits[..., :-1, :].contiguous() 31 | shift_labels = labels[..., 1:].contiguous() 32 | loss_fct = torch.nn.CrossEntropyLoss(reduction='mean') 33 | loss1 = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)) 34 | print ('loss1 = ', loss1) 35 | 36 | cur_list.append(min(math.exp(loss.item()), 200)) # sentence-level PPL 37 | ppls.append(sum(cur_list) / len(cur_list)) # example-level PPL 38 | return ppls 39 | 40 | def read_hyp(hyp_path): 41 | hyps = [] 42 | with open(hyp_path, 'r') as infile: 43 | for line in infile: 44 | hyps.append(line.strip()) 45 | return hyps 46 | 47 | def args_parser(): 48 | import argparse 49 | parser = argparse.ArgumentParser() 50 | parser.add_argument("--hyp_path", type=str, default='./emnlp_data/nq/random_testset/nq_ref') 51 | args = parser.parse_args() 52 | return args 53 | 54 | if __name__ == '__main__': 55 | args = args_parser() 56 | hyps = read_hyp(args.hyp_path) 57 | sentences = [] 58 | for hyp in hyps: 59 | hyp_sents = sent_tokenize(hyp) 60 | sentences.append(hyp_sents) 61 | assert len(sentences) == 500, len(sentences) 62 | ppls = calculate_ppls(sentences, device) 63 | assert len(ppls) == 500, len(ppls) 64 | 65 | inverse_ppls = [1 / p for p in ppls] 66 | coh_sent = np.nanmean(inverse_ppls) 67 | 68 | with open(args.hyp_path + '_avg_sent_ppl', 'w') as outfile: 69 | json.dump(ppls, outfile) 70 | outfile.write('\n') 71 | json.dump(coh_sent, outfile) 72 | -------------------------------------------------------------------------------- /src/relevance.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import tqdm 3 | import argparse 4 | from transformers import AutoTokenizer, BertForSequenceClassification 5 | from transformers.data.processors.utils import InputExample 6 | from transformers import glue_convert_examples_to_features as convert_examples_to_features 7 | from torch.utils.data import DataLoader, TensorDataset 8 | import json 9 | 10 | # env: conda activate D3 11 | def load_model(path): 12 | tokenizer = AutoTokenizer.from_pretrained(path) 13 | model = BertForSequenceClassification.from_pretrained(path) 14 | device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') 15 | model.to(device) 16 | # model.half() 17 | model.eval() 18 | return tokenizer, model, device 19 | 20 | def read_testfile(ref_path): 21 | testset = [] 22 | with open(ref_path, 'r') as infile: 23 | for line in infile: 24 | parts = line.strip().split('\t') 25 | # topic, query, knowledge, response 26 | assert len(parts) == 4, parts 27 | testset.append(parts) 28 | return testset 29 | 30 | def read_hyp(hyp_path): 31 | hyps = [] 32 | with open(hyp_path, 'r') as infile: 33 | for line in infile: 34 | hyps.append(line.strip()) 35 | return hyps 36 | 37 | def get_dataloader(input_examples, tokenizer, device, batch_size=256): 38 | features = convert_examples_to_features( 39 | input_examples, 40 | tokenizer, 41 | label_list=['0', '1'], 42 | max_length=512, 43 | output_mode='classification', 44 | ) 45 | all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long).to(device) 46 | all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long).to(device) 47 | token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long).to(device) 48 | dataset = TensorDataset(all_input_ids, token_type_ids, all_attention_mask) 49 | dataloader = DataLoader(dataset, batch_size=batch_size) 50 | return dataloader 51 | 52 | def load_data(ref_path, hyp_path, tokenizer, device, batch_size=256): 53 | testset = read_testfile(ref_path) 54 | hyps = read_hyp(hyp_path) 55 | assert len(testset) == len(hyps), (len(testset), len(hyps)) 56 | # examples = [InputExample(str(i), testset[i][1], hyps[i], '0') for i in range(len(testset))] 57 | examples = [InputExample(str(i), testset[i][1], hyps[i], '0') for i in range(len(testset)) if hyps[i].strip()] 58 | test_dataloader = get_dataloader(examples, tokenizer, device, batch_size=batch_size) 59 | return test_dataloader, examples 60 | 61 | def batch_inference(model, dataloader): 62 | all_logits = None 63 | with torch.no_grad(): 64 | # for batch in tqdm.tqdm(dataloader): 65 | for batch in dataloader: 66 | inputs = {"input_ids": batch[0], "token_type_ids": batch[1], "attention_mask": batch[2]} 67 | outputs = model(**inputs) 68 | if all_logits is None: 69 | all_logits = outputs[0].cpu().detach() 70 | else: # [n, 2], 每个batch直接cat到第一个维度上 71 | all_logits = torch.cat((all_logits, outputs[0].cpu().detach()), dim=0) 72 | results = torch.argmax(all_logits, dim=1) # [n] 73 | probs = torch.nn.functional.softmax(all_logits, dim=-1) 74 | # return results, probs[torch.arange(probs.size(0)), results] 75 | return results, probs[:, 1] 76 | 77 | def args_parser(): 78 | parser = argparse.ArgumentParser() 79 | parser.add_argument("--ref_path", type=str, default='/misc/kfdata01/kf_grp/lchen/EMNLP23/experiments/emnlp_data/nq/random_testset/nq_test_random_testset.txt') 80 | parser.add_argument("--hyp_path", type=str, default='/misc/kfdata01/kf_grp/lchen/EMNLP23/experiments/emnlp_data/nq/random_testset/nq_ref') 81 | parser.add_argument("--model_path", type=str, default='/misc/kfdata01/kf_grp/lchen/cache/monobert-large-msmarco') 82 | parser.add_argument("--batch_size", type=int, default=256) 83 | args = parser.parse_args() 84 | return args 85 | 86 | if __name__ == '__main__': 87 | 88 | args = args_parser() 89 | 90 | # 1. load model 91 | tokenizer, model, device = load_model(args.model_path) 92 | 93 | # 2. load data 94 | test_dataloader, examples = load_data(args.ref_path, args.hyp_path, tokenizer, device, batch_size=args.batch_size) 95 | 96 | # 3. inference 97 | results, probs = batch_inference(model, test_dataloader) 98 | # print (results, probs) 99 | # probs = torch.tensor([p for p in probs if p > 0.01]) 100 | print (torch.sum(results), torch.mean(probs)) 101 | 102 | with open(args.hyp_path + '_rel', 'w') as w: 103 | # w.write(json.dumps()) 104 | for prob in probs: 105 | w.write(str(prob.item()) + '\n') 106 | 107 | # 4. print some examples 108 | # res = results.cpu().tolist() 109 | # for i in range(500): 110 | # idx = res[i] 111 | # if idx == 0: 112 | # print ('=='*20) 113 | # print (examples[i].text_a + '\n') 114 | # print (examples[i].text_b) 115 | 116 | -------------------------------------------------------------------------------- /src/tools.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import calendar 3 | import wolframalpha 4 | import datetime 5 | from transformers import AutoModelForSeq2SeqLM, AutoTokenizer 6 | from operator import pow, truediv, mul, add, sub 7 | 8 | 9 | ''' 10 | Calendar 11 | 12 | Uses Python's datetime and calendar libraries to retrieve the current date. 13 | 14 | input - None 15 | 16 | output - A string, the current date. 17 | ''' 18 | def Calendar(): 19 | now = datetime.datetime.now() 20 | return f'Today is {calendar.day_name[now.weekday()]}, {calendar.month_name[now.month]} {now.day}, {now.year}.' 21 | 22 | 23 | ''' 24 | Wikipedia Search 25 | 26 | Uses ColBERTv2 to retrieve Wikipedia documents. 27 | 28 | input_query - A string, the input query (e.g. "what is a dog?") 29 | k - The number of documents to retrieve 30 | 31 | output - A list of strings, each string is a Wikipedia document 32 | 33 | Adapted from Stanford's DSP: https://github.com/stanfordnlp/dsp/ 34 | Also see: https://github.com/lucabeetz/dsp 35 | ''' 36 | class ColBERTv2: 37 | def __init__(self, url: str): 38 | self.url = url 39 | 40 | def __call__(self, query, k=10): 41 | topk = colbertv2_get_request(self.url, query, k) 42 | 43 | topk = [doc['text'] for doc in topk] 44 | return topk 45 | 46 | def colbertv2_get_request(url: str, query: str, k: int): 47 | payload = {'query': query, 'k': k} 48 | res = requests.get(url, params=payload) 49 | 50 | topk = res.json()['topk'][:k] 51 | return topk 52 | 53 | def WikiSearch(input_query: str, k=3): 54 | # k = 10 55 | # k = 3 56 | retrieval_model = ColBERTv2('http://ec2-44-228-128-229.us-west-2.compute.amazonaws.com:8893/api/search') 57 | output = retrieval_model(input_query, k) 58 | return output 59 | 60 | 61 | ''' 62 | Machine Translation - NLLB-600M 63 | 64 | Uses HuggingFace's transformers library to translate input query to English. 65 | 66 | input_query - A string, the input query (e.g. "what is a dog?") 67 | 68 | output - A string, the translated input query. 69 | ''' 70 | def MT(input_query: str): 71 | model_name = "facebook/nllb-200-distilled-600M" 72 | tokenizer = AutoTokenizer.from_pretrained(model_name) 73 | model = AutoModelForSeq2SeqLM.from_pretrained(model_name) 74 | input_ids = tokenizer(input_query, return_tensors='pt') 75 | outputs = model.generate( 76 | **input_ids, 77 | forced_bos_token_id=tokenizer.lang_code_to_id["eng_Latn"], 78 | ) 79 | output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0] 80 | return output 81 | 82 | 83 | ''' 84 | Calculator 85 | 86 | Calculates the result of a mathematical expression. 87 | 88 | input_query - A string, the input query (e.g. "400/1400") 89 | 90 | output - A float, the result of the calculation 91 | 92 | Adapted from: https://levelup.gitconnected.com/3-ways-to-write-a-calculator-in-python-61642f2e4a9a 93 | ''' 94 | def Calculator(input_query: str): 95 | operators = { 96 | '+': add, 97 | '-': sub, 98 | '*': mul, 99 | '/': truediv 100 | } 101 | if input_query.isdigit(): 102 | return float(input_query) 103 | for c in operators.keys(): 104 | left, operator, right = input_query.partition(c) 105 | if operator in operators: 106 | return round(operators[operator](Calculator(left), Calculator(right)), 2) 107 | 108 | 109 | 110 | ''' 111 | Wolfram Alpha Calculator 112 | 113 | pip install wolframalpha 114 | 115 | Uses Wolfram Alpha API to calculate input query. 116 | 117 | input_query - A string, the input query (e.g. "what is 2 + 2?") 118 | 119 | output - A string, the answer to the input query 120 | 121 | wolfarm_alpha_appid - your Wolfram Alpha API key 122 | ''' 123 | def WolframAlphaCalculator(input_query: str): 124 | wolfram_alpha_appid = 'YOUR_WOLFRAM_ALPHA_APPID' 125 | wolfram_client = wolframalpha.Client(wolfram_alpha_appid) 126 | res = wolfram_client.query(input_query) 127 | assumption = next(res.pods).text 128 | answer = next(res.results).text 129 | return f'Assumption: {assumption} \nAnswer: {answer}' 130 | 131 | 132 | ''' 133 | Google Search 134 | 135 | Uses Google's Custom Search API to retrieve Google Search results. 136 | 137 | input_query - The query to search for. 138 | num_results - The number of results to return. 139 | api_key - Your Google API key. 140 | cse_id - Your Google Custom Search Engine ID. 141 | 142 | output - A list of dictionaries, each dictionary is a Google Search result 143 | ''' 144 | def custom_search(query, api_key, cse_id, **kwargs): 145 | service = build("customsearch", "v1", developerKey=api_key) 146 | res = service.cse().list(q=query, cx=cse_id, **kwargs).execute() 147 | return res['items'] 148 | 149 | def google_search(input_query: str): 150 | api_key = "YOUR_GOOGLE_API_KEY" 151 | cse_id = 'YOUR_GOOGLE_CSE_ID' 152 | num_results = 10 153 | metadata_results = [] 154 | results = custom_search(input_query, num=num_results, api_key=api_key, cse_id=cse_id) 155 | for result in results: 156 | metadata_result = { 157 | "snippet": result["snippet"], 158 | "title": result["title"], 159 | "link": result["link"], 160 | } 161 | metadata_results.append(metadata_result) 162 | return metadata_results 163 | 164 | 165 | ''' 166 | Bing Search 167 | 168 | Uses Bing's Custom Search API to retrieve Bing Search results. 169 | 170 | input_query: The query to search for. 171 | bing_subscription_key: Your Bing API key. 172 | num_results: The number of results to return. 173 | 174 | output: A list of dictionaries, each dictionary is a Bing Search result 175 | ''' 176 | def _bing_search_results(search_term: str, bing_subscription_key: str, count: int): 177 | headers = {"Ocp-Apim-Subscription-Key": bing_subscription_key} 178 | params = { 179 | "q": search_term, 180 | "count": count, 181 | "textDecorations": True, 182 | "textFormat": "HTML", 183 | } 184 | response = requests.get( 185 | # "https://api.bing.microsoft.com/v7.0/search", headers=headers, params=params 186 | "https://api.bing.microsoft.com/", headers=headers, params=params 187 | ) 188 | response.raise_for_status() 189 | search_results = response.json() 190 | return search_results["webPages"]["value"] 191 | 192 | def bing_search(input_query: str): 193 | bing_subscription_key = "" 194 | num_results = 10 195 | metadata_results = [] 196 | results = _bing_search_results(input_query, bing_subscription_key, count=num_results) 197 | for result in results: 198 | metadata_result = { 199 | "snippet": result["snippet"], 200 | "title": result["name"], 201 | "link": result["url"], 202 | } 203 | metadata_results.append(metadata_result) 204 | return metadata_results 205 | 206 | 207 | # if __name__ == '__main__': 208 | # print(google_search('What is a dog?')) 209 | # Outputs a list of dictionaries, each dictionary is a Google Search result 210 | 211 | # print(bing_search('What is a dog?')) 212 | # Outputs a list of dictionaries, each dictionary is a Bing Search result -------------------------------------------------------------------------------- /src/wow_validity.py: -------------------------------------------------------------------------------- 1 | from nltk.tokenize import sent_tokenize 2 | from tqdm import tqdm 3 | from collections import Counter 4 | import copy 5 | import json 6 | import argparse 7 | import random 8 | random.seed(42) 9 | 10 | import numpy as np 11 | from factuality_metric import ner_metric, nli_metric_batch 12 | from src.claim_handling import obtain_important_ne 13 | from tools import WikiSearch 14 | 15 | import logging 16 | logging.basicConfig() 17 | logging.getLogger().setLevel(logging.ERROR) 18 | 19 | def read_hyp(hyp_path): 20 | hyps = [] 21 | with open(hyp_path, 'r') as infile: 22 | for line in infile: 23 | hyps.append(line.strip()) 24 | return hyps 25 | 26 | def read_IR_docs(IR_path): 27 | IR_docs = [] 28 | with open(IR_path, 'r') as infile: 29 | for line in infile: 30 | IR_docs.append(json.loads(line.strip())) 31 | return IR_docs 32 | 33 | def read_testfile(testfile): 34 | '''read testset from wow''' 35 | res = [] 36 | with open(testfile, 'r', encoding='utf-8') as r: 37 | for i, line in enumerate(r): 38 | parts = line.strip().split('\t') 39 | assert len(parts) == 4, parts 40 | res.append(parts) 41 | # topic, query, knowledge, response 42 | return res 43 | 44 | def boolean_string(s): 45 | if s.lower() not in {'false', 'true'}: 46 | raise ValueError('Not a valid boolean string') 47 | return s.lower() == 'true' 48 | 49 | def args_parser(): 50 | parser = argparse.ArgumentParser(description='Process some integers.') 51 | 52 | parser.add_argument('--hyp_path', type=str, default=None, help='path to generations to evaluate') 53 | parser.add_argument('--ref_path', type=str, default=None, help='path to generations to evaluate') 54 | parser.add_argument('--eval_num', type=int, default=-1) 55 | 56 | parser.add_argument('--use_IR_eval', action='store_true', help='Flag for saving some lm-gens with its metric for analysis') 57 | parser.add_argument('--retrieved_num', type=int, default=3) 58 | parser.add_argument('--wo_ground_truth_knowledge', type=boolean_string, default='False') 59 | 60 | parser.add_argument('--debug', type=boolean_string) 61 | parser.add_argument('--save_gen_for_analysis', action='store_true', help='Flag for saving some lm-gens with its metric for analysis') 62 | 63 | args = parser.parse_args() 64 | return args 65 | 66 | def single_instance_eval(hyp, response, recall_list, args): 67 | # multiple evidences 68 | nli_contradict_prob, nli_entail_prob, nli_neutral_prob, nli_label = [], [], [], [] 69 | 70 | if args.use_IR_eval and args.retrieved_num: 71 | assert recall_list and len(recall_list) >= 10, f"len(recall_list) = {len(recall_list)}" 72 | recall_list = recall_list[:args.retrieved_num] 73 | 74 | # NLI: identify the evs that give highest nli entailment score 75 | premise_hypothesis_pairs = [[ev, hyp] for ev in [response] + recall_list] 76 | if len(premise_hypothesis_pairs) > 32: 77 | premise_hypothesis_pairs = premise_hypothesis_pairs[:32] 78 | bz = 8 79 | nli_probs, labels = [], [] 80 | for t in range((len(premise_hypothesis_pairs) - 1) // bz + 1): 81 | bz_nli_probs, bz_labels = nli_metric_batch(premise_hypothesis_pairs[t * bz: min((t + 1) * bz, len(premise_hypothesis_pairs))]) 82 | nli_probs.extend(bz_nli_probs) 83 | labels.extend(bz_labels) 84 | assert len(nli_probs) == len(premise_hypothesis_pairs) == len(labels), f"len(nli_probs) = {len(nli_probs)}, len(premise_hypothesis_pairs) = {len(premise_hypothesis_pairs)}, len(labels) = {len(labels)}" 85 | 86 | # [contradiction, neutral, entailment] 87 | entailment_argmax = np.argmax([nli_s[2] for nli_s in nli_probs]) 88 | max_prob = nli_probs[entailment_argmax] 89 | max_label = labels[entailment_argmax] 90 | 91 | nli_contradict_prob.append(max_prob[0]) 92 | nli_neutral_prob.append(max_prob[1]) 93 | nli_entail_prob.append(max_prob[2]) 94 | 95 | nli_label.append(max_label) 96 | # print (max_label, premise_hypothesis_pairs[entailment_argmax]) 97 | 98 | idx = nli_label.index(max(nli_label)) 99 | nli_label = max(nli_label) 100 | nli_contradict_prob = nli_contradict_prob[idx] 101 | nli_neutral_prob = nli_neutral_prob[idx] 102 | nli_entail_prob = nli_entail_prob[idx] 103 | 104 | eval_result_obj = { 105 | 'premise_hypothesis_pairs': premise_hypothesis_pairs, 106 | 'nli-label': nli_label, 107 | 'nli-contr': nli_contradict_prob, 108 | 'nli-entail': nli_entail_prob, 109 | 'nli-neutr': nli_neutral_prob 110 | } 111 | 112 | return eval_result_obj 113 | 114 | def main(args): 115 | 116 | # read hyp, ref, IR_docs 117 | hyps = read_hyp(args.hyp_path) 118 | IR_recalls = read_IR_docs(args.hyp_path + '_IR_docs') 119 | testset = read_testfile(args.ref_path) 120 | assert len(hyps) == len(testset) == len(IR_recalls) == 500, (len(hyps), len(testset), len(IR_recalls)) 121 | 122 | # DEBUG mode! 123 | if args.debug: 124 | DEBUG_SAMPLE_SIZE = 10 125 | hyps = hyps[:DEBUG_SAMPLE_SIZE] 126 | IR_recalls = IR_recalls[:DEBUG_SAMPLE_SIZE] 127 | testset = testset[:DEBUG_SAMPLE_SIZE] 128 | 129 | final_contradict_prob, final_neutral_prob, final_entail_prob, all_nli_labels = [], [], [], [] 130 | all_analysis_list = [] 131 | 132 | for i in tqdm(range(len(hyps))): 133 | hyp, example, recall_list = hyps[i], testset[i], IR_recalls[i] 134 | response = example[3] 135 | 136 | res_obj = single_instance_eval(hyp, response, recall_list, args) 137 | if args.debug: 138 | print ('==' * 20) 139 | print (res_obj) 140 | 141 | final_contradict_prob.append(res_obj['nli-contr']) 142 | final_neutral_prob.append(res_obj['nli-neutr']) 143 | final_entail_prob.append(res_obj['nli-entail']) 144 | all_nli_labels.append(res_obj['nli-label']) 145 | all_analysis_list.append(res_obj) 146 | 147 | # analysis 148 | avg_contradict_prob = np.mean(final_contradict_prob) 149 | avg_neutral_prob = np.mean(final_neutral_prob) 150 | avg_entail_prob = np.mean(final_entail_prob) 151 | 152 | print("AVG PROBS: Contradict: {:.2f}%, Neutral: {:.2f}%, Entail: {:.2f}%".format(avg_contradict_prob*100, avg_neutral_prob*100, avg_entail_prob*100)) 153 | 154 | nli_contradict_class_ratio, nli_neutral_class_ratio, nli_entail_class_ratio = 0, 0, 0 155 | 156 | nli_counter = Counter(all_nli_labels) 157 | 158 | nli_contradict_class_ratio=nli_counter[0]/(nli_counter[0]+nli_counter[1]+nli_counter[2]) 159 | nli_neutral_class_ratio=nli_counter[1]/(nli_counter[0]+nli_counter[1]+nli_counter[2]) 160 | nli_entail_class_ratio=nli_counter[2]/(nli_counter[0]+nli_counter[1]+nli_counter[2]) 161 | 162 | print("NLI CLASS %: Contradict: {:.2f}%, Neutral: {:.2f}%, Entail: {:.2f}%".format( 163 | nli_contradict_class_ratio*100, 164 | nli_neutral_class_ratio*100, 165 | nli_entail_class_ratio*100 166 | )) 167 | 168 | res_path = args.hyp_path + '_factuality_results.txt' 169 | with open(res_path, 'a') as outfile: 170 | res_obj = { 171 | 'Contradict_probs': avg_contradict_prob, 172 | 'Neutral_probs': avg_neutral_prob, 173 | 'Entail_probs': avg_entail_prob, 174 | "nli_contradict_class_ratio": nli_contradict_class_ratio, 175 | "nli_neutral_class_ratio": nli_neutral_class_ratio, 176 | "nli_entail_class_ratio": nli_entail_class_ratio, 177 | } 178 | json.dump(res_obj, outfile) 179 | outfile.write("\n") 180 | 181 | ana_path = args.hyp_path + '_analysis.txt' 182 | with open(ana_path, 'a') as outfile: 183 | json.dump(all_analysis_list, outfile) 184 | outfile.write("\n") 185 | 186 | 187 | if __name__ == '__main__': 188 | args = args_parser() 189 | main(args) 190 | --------------------------------------------------------------------------------