├── .gitignore
├── assets
    ├── fig_3.png
    ├── fig_4.png
    ├── tab_1.png
    ├── tab_2.png
    ├── tab_3.png
    └── fig_1_2.png
├── requirements.txt
├── data_generation
    ├── prompt_bank
    │   ├── fact_generation_en.txt
    │   ├── fact_to_tests_en.txt
    │   └── fact_enhance_classify_en.txt
    ├── per_instance_query.py
    ├── inconsistency_processing.py
    └── post_process.py
├── examination
    ├── templates.py
    ├── dispatch_openai_requests.py
    ├── hallucination
    │   ├── get_metric.py
    │   └── run_eval.py
    ├── predict.py
    └── utils.py
├── eval
    └── gpt_judge
    │   ├── show_results.py
    │   ├── gen_answer.py
    │   ├── gen_summary.py
    │   ├── gpt_judge_prompt.jsonl
    │   └── gpt_judge.py
├── LICENSE
├── train
    └── train.py
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | /data/
2 | 


--------------------------------------------------------------------------------
/assets/fig_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fanqiwan/KCA/HEAD/assets/fig_3.png


--------------------------------------------------------------------------------
/assets/fig_4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fanqiwan/KCA/HEAD/assets/fig_4.png


--------------------------------------------------------------------------------
/assets/tab_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fanqiwan/KCA/HEAD/assets/tab_1.png


--------------------------------------------------------------------------------
/assets/tab_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fanqiwan/KCA/HEAD/assets/tab_2.png


--------------------------------------------------------------------------------
/assets/tab_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fanqiwan/KCA/HEAD/assets/tab_3.png


--------------------------------------------------------------------------------
/assets/fig_1_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fanqiwan/KCA/HEAD/assets/fig_1_2.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | accelerate==0.24.0
 2 | datasets==2.13.1
 3 | deepspeed==0.11.1
 4 | evaluate==0.4.0
 5 | flash-attn==2.3.3
 6 | openai==1.3.3
 7 | ray==2.8.0
 8 | torch==2.1.0
 9 | transformers==4.36.2
10 | 


--------------------------------------------------------------------------------
/data_generation/prompt_bank/fact_generation_en.txt:
--------------------------------------------------------------------------------
 1 | Please provide the corresponding detailed facts/knowledge based on the given instruction,  the analysis of factual information requirement judgment, optional knowledge elements and an optional gold answer. Note that the provided details should not  simply be the answer of knowledge elements. Instead, it should cover a holistic background knowledge required for a layman to fulfill the needs of the instructions.
 2 | 
 3 | #Instruction:
 4 | {input}
 5 | 
 6 | #Analysis of Factual Information Requirement Judgment:
 7 | {analysis}
 8 | 
 9 | #Knowledge Elements
10 | {queries}
11 | 
12 | #Gold Answer:
13 | {output}
14 | 
15 | Please now provide the detailed facts/knowledge:


--------------------------------------------------------------------------------
/data_generation/prompt_bank/fact_to_tests_en.txt:
--------------------------------------------------------------------------------
 1 | Try to come up with three multi-choice questions that ground on the knowledge provided below. The questions should be in the format as this illustrative example, and mandatory fields include #Question, #Options, #Analysis, #Answer:
 2 | 
 3 | Below is a illustrative example of the format to follow.
 4 | <start_of_demonstration>
 5 | #Question:
 6 | One of the reasons that the government discourages and regulates monopolies is that
 7 | 
 8 | #Options:
 9 | (A) producer surplus is lost and consumer surplus is gained.
10 | (B) monopoly prices ensure productive efficiency but cost society allocative efficiency.
11 | (C) monopoly firms do not engage in significant research and development.
12 | (D) consumer surplus is lost with higher prices and lower levels of output.
13 | 
14 | #Analysis:
15 | The government discourages and regulates monopolies primarily because they result in a loss of consumer surplus through higher prices and lower levels of output. Monopolies can charge higher prices due to their market dominance, reducing consumer welfare and surplus. Additionally, they often restrict output to maintain higher prices, further diminishing consumer access to goods and services. This is why the government intervenes to promote competition and protect consumer interests. In summary, the government discourages and regulates monopolies because consumer surplus is lost with higher prices and lower levels of output, as indicated in option (D).
16 | 
17 | #Answer:
18 | (D)
19 | </start_of_demonstration>
20 | 
21 | Below is the grounding knowledge of the questions that you are required to provide.
22 | 
23 | <start_of_knowledge>
24 | {knowledge}
25 | </start_of_knowledge>
26 | 
27 | Please now provide the three multi-choice questions grounding on the above knowledge in the required format:


--------------------------------------------------------------------------------
/examination/templates.py:
--------------------------------------------------------------------------------
 1 | 
 2 | def create_prompt_with_tulu_chat_format(messages, bos="<s>", eos="</s>", add_bos=True):
 3 |     formatted_text = ""
 4 |     for message in messages:
 5 |         if message["role"] == "system":
 6 |             formatted_text += "<|system|>\n" + message["content"] + "\n"
 7 |         elif message["role"] == "user":
 8 |             formatted_text += "<|user|>\n" + message["content"] + "\n"
 9 |         elif message["role"] == "assistant":
10 |             formatted_text += "<|assistant|>\n" + message["content"].strip() + eos + "\n"
11 |         else:
12 |             raise ValueError(
13 |                 "Tulu chat template only supports 'system', 'user' and 'assistant' roles. Invalid role: {}.".format(message["role"])
14 |                 )
15 |     formatted_text += "<|assistant|>\n"
16 |     formatted_text = bos + formatted_text if add_bos else formatted_text
17 |     return formatted_text
18 | 
19 | 
20 | def create_prompt_with_llama2_chat_format(messages, bos="<s>", eos="</s>", add_bos=True):
21 |     '''
22 |     This function is adapted from the official llama2 chat completion script: 
23 |     https://github.com/facebookresearch/llama/blob/7565eb6fee2175b2d4fe2cfb45067a61b35d7f5e/llama/generation.py#L274
24 |     '''
25 |     B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
26 |     B_INST, E_INST = "[INST]", "[/INST]"
27 |     formatted_text = ""
28 |     # If you want to include system prompt, see this discussion for the template: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/discussions/4
29 |     # However, see here that removing the system prompt actually reduce the false refusal rates: https://github.com/facebookresearch/llama/blob/main/UPDATES.md?utm_source=twitter&utm_medium=organic_social&utm_campaign=llama2&utm_content=text#observed-issue
30 |     if messages[0]["role"] == "system":
31 |         assert len(messages) >= 2 and messages[1]["role"] == "user", "LLaMa2 chat cannot start with a single system message."
32 |         messages = [{
33 |             "role": "user",
34 |             "content": B_SYS + messages[0]["content"] + E_SYS + messages[1]["content"]
35 |         }] + messages[2:]
36 |     for message in messages:
37 |         if message["role"] == "user":
38 |             formatted_text += bos + f"{B_INST} {(message['content']).strip()} {E_INST}"
39 |         elif message["role"] == "assistant":
40 |             formatted_text += f" {(message['content'])} " + eos
41 |         else:
42 |             raise ValueError(
43 |                 "Llama2 chat template only supports 'system', 'user' and 'assistant' roles. Invalid role: {}.".format(message["role"])
44 |                 )
45 |     # The llama2 chat template by default has a bos token at the start of each user message.
46 |     # The next line removes the bos token if add_bos is False.
47 |     formatted_text = formatted_text[len(bos):] if not add_bos else formatted_text
48 |     return formatted_text
49 | 


--------------------------------------------------------------------------------
/examination/dispatch_openai_requests.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | This file is copied and modified from https://gist.github.com/neubig/80de662fb3e225c18172ec218be4917a.
 3 | Thanks to Graham Neubig for sharing the original code.
 4 | '''
 5 | 
 6 | import openai
 7 | import asyncio
 8 | from typing import Any, List, Dict
 9 | 
10 | 
11 | async def dispatch_openai_chat_requesets(
12 |     messages_list: List[List[Dict[str,Any]]],
13 |     model: str,
14 |     **completion_kwargs: Any,
15 | ) -> List[str]:
16 |     """Dispatches requests to OpenAI chat completion API asynchronously.
17 |     
18 |     Args:
19 |         messages_list: List of messages to be sent to OpenAI chat completion API.
20 |         model: OpenAI model to use.
21 |         completion_kwargs: Keyword arguments to be passed to OpenAI ChatCompletion API. See https://platform.openai.com/docs/api-reference/chat for details.
22 |     Returns:
23 |         List of responses from OpenAI API.
24 |     """
25 |     async_responses = [
26 |         openai.ChatCompletion.acreate(
27 |             model=model,
28 |             messages=x,
29 |             **completion_kwargs,
30 |         )
31 |         for x in messages_list
32 |     ]
33 |     return await asyncio.gather(*async_responses)
34 | 
35 | 
36 | async def dispatch_openai_prompt_requesets(
37 |     prompt_list: List[str],
38 |     model: str,
39 |     **completion_kwargs: Any,
40 | ) -> List[str]:
41 |     """Dispatches requests to OpenAI text completion API asynchronously.
42 |     
43 |     Args:
44 |         prompt_list: List of prompts to be sent to OpenAI text completion API.
45 |         model: OpenAI model to use.
46 |         completion_kwargs: Keyword arguments to be passed to OpenAI text completion API. See https://platform.openai.com/docs/api-reference/completions for details.
47 |     Returns:
48 |         List of responses from OpenAI API.
49 |     """
50 |     async_responses = [
51 |         openai.Completion.acreate(
52 |             model=model,
53 |             prompt=x,
54 |             **completion_kwargs,
55 |         )
56 |         for x in prompt_list
57 |     ]
58 |     return await asyncio.gather(*async_responses)
59 | 
60 | 
61 | if __name__ == "__main__":
62 |     chat_completion_responses = asyncio.run(
63 |         dispatch_openai_chat_requesets(
64 |             messages_list=[
65 |                 [{"role": "user", "content": "Write a poem about asynchronous execution."}],
66 |                 [{"role": "user", "content": "Write a poem about asynchronous pirates."}],
67 |             ],
68 |             model="gpt-3.5-turbo",
69 |             temperature=0.3,
70 |             max_tokens=200,
71 |             top_p=1.0,
72 | 
73 |         )
74 |     )
75 | 
76 |     for i, x in enumerate(chat_completion_responses):
77 |         print(f"Chat completion response {i}:\n{x['choices'][0]['message']['content']}\n\n")
78 | 
79 | 
80 |     prompt_completion_responses = asyncio.run(
81 |         dispatch_openai_prompt_requesets(
82 |             prompt_list=[
83 |                 "Write a poem about asynchronous execution.\n",
84 |                 "Write a poem about asynchronous pirates.\n",
85 |             ],
86 |             model="text-davinci-003",
87 |             temperature=0.3,
88 |             max_tokens=200,
89 |             top_p=1.0,
90 |         )
91 |     )
92 | 
93 |     for i, x in enumerate(prompt_completion_responses):
94 |         print(f"Prompt completion response {i}:\n{x['choices'][0]['text']}\n\n")


--------------------------------------------------------------------------------
/examination/hallucination/get_metric.py:
--------------------------------------------------------------------------------
 1 | """Obtain examination accuracy."""
 2 | 
 3 | import json
 4 | 
 5 | 
 6 | def read_jsonl(file_path):
 7 |     data = []
 8 |     with open(file_path, 'r') as f:
 9 |         for line in f:
10 |             data.append(json.loads(line))
11 |     return data
12 | 
13 | 
14 | def extract_question_id(idx):
15 |     return idx.replace("_test0", "").replace("_test1", "").replace("_test2", "")
16 | 
17 | 
18 | def compute_question_accuracy(data, threshold):
19 |     question_correct = {}
20 |     question_total = {}
21 | 
22 |     for item in data:
23 |         question_id = extract_question_id(item["idx"])
24 |         correct = item["correct"]
25 | 
26 |         if question_id not in question_total:
27 |             question_total[question_id] = 0
28 |             question_correct[question_id] = 0
29 | 
30 |         question_total[question_id] += 1
31 |         if correct == "True":
32 |             question_correct[question_id] += 1
33 | 
34 |     question_accuracy = {}
35 |     for question_id in question_total:
36 |         accuracy = question_correct[question_id] / question_total[question_id]
37 |         question_accuracy[question_id] = accuracy > threshold
38 |     return question_accuracy
39 | 
40 | 
41 | def compute_dataset_accuracy(question_accuracy):
42 |     total_questions = len(question_accuracy)
43 |     correct_questions = sum(1 for q in question_accuracy.values() if q)
44 |     return correct_questions / total_questions
45 | 
46 | 
47 | def main():
48 |     dataset_mapping = {
49 |         "train": ["wizardlm_alpaca_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen_normalize"],
50 |         "test": ["lima_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen_normalize",
51 |                  "vicuna_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen_normalize",
52 |                  "wizardlm_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen_normalize",],
53 |         "test_truth": ["truthfulqa_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen_normalize"],
54 |     }
55 |     global_path = "./data"
56 |     for split in ["train", "test", "test_truth"]:
57 |         for model_name in ["pythia-6.9b", "llama-2-7b", "mistral-7b-v0.1", "llama-2-13b"]:
58 |             for shot in ["5"]:
59 |                 for dataset in dataset_mapping[split]:
60 |                     file_path = f"{global_path}/examination/output/{split}/{model_name}/{shot}-shot/{dataset}.jsonl"
61 |                     sft_instance_behavior = dict()
62 |                     sft_instance_metric = dict()
63 |                     for threshold in [0.3, 0.6, 0.9]:
64 |                         try:
65 |                             data = read_jsonl(file_path)
66 |                             question_accuracy = compute_question_accuracy(data, threshold)
67 |                             dataset_accuracy = compute_dataset_accuracy(question_accuracy)
68 |                             sft_instance_behavior[f'threshold_{threshold}'] = question_accuracy
69 |                             sft_instance_metric[f'threshold_{threshold}'] = dataset_accuracy
70 |                             print(f"sft-instance-level acc (threshold {threshold}): {dataset_accuracy * 100:.2f}%")
71 |                         except Exception as e:
72 |                             print(e)
73 |                     with open(file_path.replace('.jsonl', '_sft_instance_behavior.json'), "w") as fout:
74 |                         json.dump(sft_instance_behavior, fout)
75 |                     with open(file_path.replace('.jsonl', '_sft_instance_metric.json'), "w") as fout:
76 |                         json.dump(sft_instance_metric, fout)
77 | 
78 | 
79 | if __name__ == "__main__":
80 |     main()
81 | 


--------------------------------------------------------------------------------
/eval/gpt_judge/show_results.py:
--------------------------------------------------------------------------------
 1 | """Calculate final evaluation results."""
 2 | 
 3 | import json
 4 | import os
 5 | 
 6 | 
 7 | def get_json_list(file_path):
 8 |     file_path = os.path.expanduser(file_path)
 9 |     with open(file_path, 'r') as f:
10 |         json_list = []
11 |         for line in f:
12 |             json_list.append(json.loads(line))
13 |         return json_list
14 | 
15 | 
16 | def calculate_effectiveness_statistic(parse_results):
17 |     all_scores = []
18 |     scores = [_["judge_score"] for _ in parse_results]
19 |     classes = [_["class"] for _ in parse_results]
20 |     error_cnt = 0
21 |     result = dict()
22 |     for i in range(len(classes)):
23 |         if scores[i] == -100:
24 |             error_cnt += 1
25 |             continue
26 |         all_scores.append(scores[i])
27 |     result["all_scores"] = sum(all_scores) / len(all_scores) if len(all_scores) != 0 else 0
28 |     result["error_cnt"] = error_cnt
29 |     return result
30 | 
31 | 
32 | def calculate_hallucination_classification_statistic(parse_results):
33 |     all_scores = []
34 |     scores = [_["judge_score"] for _ in parse_results]
35 |     classes = [_["class"] for _ in parse_results]
36 |     result = dict()
37 |     error_cnt = 0
38 |     for i in range(len(classes)):
39 |         if scores[i] == -100:
40 |             if parse_results[i]["judge_result"] != "":
41 |                 scores[i] = 1
42 |             else:
43 |                 error_cnt += 1
44 |                 continue
45 |         all_scores.append(scores[i])
46 |         result["all_scores"] = sum(all_scores) / len(all_scores) if len(all_scores) != 0 else 0
47 |         result["error_cnt"] = error_cnt
48 |     return result
49 | 
50 | 
51 | def gpt_judge_statistics_func(input_file, judge_type):
52 |     reviews = get_json_list(input_file)
53 |     if judge_type == "effectiveness_judge":
54 |         judge_statistics = calculate_effectiveness_statistic(reviews)
55 |     elif judge_type == "hallucination_judge":
56 |         judge_statistics = calculate_hallucination_classification_statistic(reviews)
57 |     else:
58 |         raise NotImplementedError
59 |     return judge_statistics
60 | 
61 | 
62 | if __name__ == "__main__":
63 |     all_statistics = dict()
64 |     for model in ["pythia-6.9b", "llama-2-7b", "mistral-7b-v0.1", "llama-2-13b"]:
65 |         for shot in ["5"]:
66 |             for fact_type in ["baseline", "openbook", "drop", "sorry"]:
67 |                 for testset in ["lima_testset", "vicuna_testset", "wizardlm_testset", "truthfulqa_test_truthset"]:
68 |                     for trainset in ["wizardlm_trainset"]:
69 |                         for judge_type in ["hallucination_judge", "effectiveness_judge"]:
70 |                             if fact_type == "baseline":
71 |                                 trainset = trainset.replace("wizardlm_trainset", "wizardlm_alpaca_train")
72 |                             global_path = "./evaluation_results/review_greedy"
73 |                             if fact_type == "drop" or fact_type == "openbook" or fact_type == "sorry":
74 |                                 input_file = f"{global_path}/data-{model}_shot-{shot}_{testset}_model-{model}_shot-{shot}_{trainset}_{fact_type}_{judge_type}_greedy.jsonl"
75 |                             elif fact_type == "baseline":
76 |                                 input_file = f"{global_path}/data-{model}_shot-{shot}_{testset}_model-baseline_{model}_{trainset}_{judge_type}_greedy.jsonl"
77 |                             else:
78 |                                 raise NotImplementedError
79 |                             try:
80 |                                 judge_statistics = gpt_judge_statistics_func(input_file, judge_type)
81 |                                 all_statistics[f"{model}_{shot}_{testset}_{trainset}_{judge_type}_{fact_type}"] = judge_statistics
82 |                             except Exception as e:
83 |                                 print(e)
84 |     print(json.dumps(all_statistics, indent=2))


--------------------------------------------------------------------------------
/eval/gpt_judge/gen_answer.py:
--------------------------------------------------------------------------------
  1 | """Generate model answer for chat task."""
  2 | 
  3 | import argparse
  4 | from transformers import AutoTokenizer, AutoModelForCausalLM
  5 | import torch
  6 | import os
  7 | import json
  8 | from tqdm import tqdm
  9 | import shortuuid
 10 | import ray
 11 | 
 12 | from fastchat.model import get_conversation_template
 13 | 
 14 | 
 15 | def run_eval(model_path, model_id, conv_temp, question_file, answer_file, num_gpus, do_sample=False):
 16 |     if question_file.endswith(".jsonl"):
 17 |         ques_jsons = []
 18 |         with open(os.path.expanduser(question_file), "r") as ques_file:
 19 |             for line in ques_file:
 20 |                 ques_jsons.append(json.loads(line))
 21 |     else:
 22 |         ques_jsons = json.loads(open(question_file, "r").read())
 23 |     chunk_size = len(ques_jsons) // num_gpus
 24 |     ans_handles = []
 25 |     for i in range(0, len(ques_jsons), chunk_size):
 26 |         ans_handles.append(
 27 |             get_model_answers.remote(
 28 |                 model_path, model_id, conv_temp, ques_jsons[i: i + chunk_size], do_sample
 29 |             )
 30 |         )
 31 | 
 32 |     ans_jsons = []
 33 |     for ans_handle in ans_handles:
 34 |         ans_jsons.extend(ray.get(ans_handle))
 35 | 
 36 |     with open(os.path.expanduser(answer_file), "w") as ans_file:
 37 |         for line in ans_jsons:
 38 |             ans_file.write(json.dumps(line) + "\n")
 39 | 
 40 | 
 41 | @ray.remote(num_gpus=1)
 42 | @torch.inference_mode()
 43 | def get_model_answers(model_path, model_id, conv_temp, question_jsons, do_sample=False):
 44 |     model_path = os.path.expanduser(model_path)
 45 |     tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False if "pythia" not in model_path else True)
 46 |     model = AutoModelForCausalLM.from_pretrained(
 47 |         model_path, low_cpu_mem_usage=True, torch_dtype=torch.float16
 48 |     ).cuda()
 49 | 
 50 |     ans_jsons = []
 51 |     for i, line in enumerate(tqdm(question_jsons)):
 52 |         ques_json = line
 53 |         idx = ques_json["id"]
 54 |         classes = ques_json["class"]
 55 |         qs = ques_json["conversations"][0]["value"]
 56 |         conv = get_conversation_template(conv_temp)
 57 |         conv.append_message(conv.roles[0], qs)
 58 |         conv.append_message(conv.roles[1], None)
 59 |         prompt = conv.get_prompt()
 60 |         input_ids = tokenizer([prompt]).input_ids
 61 |         output_ids = model.generate(
 62 |             torch.as_tensor(input_ids).cuda(),
 63 |             do_sample=do_sample,
 64 |             temperature=0.7,
 65 |             max_new_tokens=1024,
 66 |         )
 67 |         output_ids = output_ids[0][len(input_ids[0]):]
 68 |         outputs = tokenizer.decode(output_ids, skip_special_tokens=True).strip()
 69 |         ans_id = shortuuid.uuid()
 70 |         ans_jsons.append(
 71 |             {
 72 |                 "id": idx,
 73 |                 "class": classes,
 74 |                 "question": qs,
 75 |                 "answer": outputs,
 76 |                 "answer_id": ans_id,
 77 |                 "model_id": model_id,
 78 |                 "metadata": {"do_sample": do_sample, "temperature": 0.7, "max_new_tokens": 1024},
 79 |             }
 80 |         )
 81 |     return ans_jsons
 82 | 
 83 | 
 84 | if __name__ == "__main__":
 85 |     parser = argparse.ArgumentParser()
 86 |     parser.add_argument("--model-path", type=str, required=True)
 87 |     parser.add_argument("--model-id", type=str, required=True)
 88 |     parser.add_argument("--conv-temp", type=str, default="vicuna")
 89 |     parser.add_argument("--question-file", type=str, required=True)
 90 |     parser.add_argument("--answer-file", type=str, default="answer.jsonl")
 91 |     parser.add_argument("--num-gpus", type=int, default=1)
 92 |     parser.add_argument("--do-sample", action="store_true")
 93 |     args = parser.parse_args()
 94 | 
 95 |     ray.init()
 96 |     run_eval(
 97 |         args.model_path,
 98 |         args.model_id,
 99 |         args.conv_temp,
100 |         args.question_file,
101 |         args.answer_file,
102 |         args.num_gpus,
103 |         args.do_sample,
104 |     )
105 | 


--------------------------------------------------------------------------------
/eval/gpt_judge/gen_summary.py:
--------------------------------------------------------------------------------
  1 | """Generate model answer for summary task."""
  2 | 
  3 | import argparse
  4 | from transformers import AutoTokenizer, AutoModelForCausalLM
  5 | import torch
  6 | import os
  7 | import json
  8 | from tqdm import tqdm
  9 | import shortuuid
 10 | import ray
 11 | import datasets
 12 | 
 13 | from fastchat.model import get_conversation_template
 14 | 
 15 | 
 16 | def compute_metrics(ans_jsons, no_sorry):
 17 |     rouge = datasets.load_metric('rouge')
 18 |     if no_sorry:
 19 |         print("Delete sorry answers...")
 20 |         old_len = len(ans_jsons)
 21 |         ans_jsons = [line for line in ans_jsons if "Sorry, I don't know the factual information required to answer this question." not in line["answer"]]
 22 |         print(f"Delete {old_len - len(ans_jsons)} examples.")
 23 |     predictions = [line["answer"] for line in ans_jsons]
 24 |     references = [line["reference_answers"][-1] for line in ans_jsons]
 25 |     rouge_results = rouge.compute(predictions=predictions, references=references)
 26 |     return {"ROUGE-1": round(rouge_results["rouge1"].mid.fmeasure * 100, 2),
 27 |             "ROUGE-2": round(rouge_results["rouge2"].mid.fmeasure * 100, 2),
 28 |             "ROUGE-L": round(rouge_results["rougeL"].mid.fmeasure * 100, 2),
 29 |             "ROUGE-Lsum": round(rouge_results["rougeLsum"].mid.fmeasure * 100, 2),
 30 |             }
 31 | 
 32 | 
 33 | def run_eval(model_path, model_id, conv_temp, question_file, answer_file, metric_file, num_gpus, do_sample=False, no_sorry=False):
 34 |     if question_file.endswith(".jsonl"):
 35 |         ques_jsons = []
 36 |         with open(os.path.expanduser(question_file), "r") as ques_file:
 37 |             for line in ques_file:
 38 |                 ques_jsons.append(json.loads(line))
 39 |     else:
 40 |         ques_jsons = json.loads(open(question_file, "r").read())
 41 |     chunk_size = len(ques_jsons) // num_gpus
 42 |     ans_handles = []
 43 |     for i in range(0, len(ques_jsons), chunk_size):
 44 |         ans_handles.append(
 45 |             get_model_answers.remote(
 46 |                 model_path, model_id, conv_temp, ques_jsons[i: i + chunk_size], do_sample
 47 |             )
 48 |         )
 49 | 
 50 |     ans_jsons = []
 51 |     for ans_handle in ans_handles:
 52 |         ans_jsons.extend(ray.get(ans_handle))
 53 | 
 54 |     with open(os.path.expanduser(answer_file), "w") as ans_file:
 55 |         for line in ans_jsons:
 56 |             ans_file.write(json.dumps(line) + "\n")
 57 | 
 58 |     evaluation_results = compute_metrics(ans_jsons, no_sorry)
 59 |     print(json.dumps(evaluation_results, indent=2))
 60 | 
 61 |     if metric_file:
 62 |         os.makedirs(metric_file.replace("metrics.json", ""), exist_ok=True)
 63 | 
 64 |         with open(os.path.expanduser(metric_file), "w") as met_file:
 65 |             json.dump(evaluation_results, met_file, indent=2)
 66 | 
 67 | 
 68 | @ray.remote(num_gpus=1)
 69 | @torch.inference_mode()
 70 | def get_model_answers(model_path, model_id, conv_temp, question_jsons, do_sample=False):
 71 |     model_path = os.path.expanduser(model_path)
 72 |     tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False if "pythia" not in model_path else True)
 73 |     model = AutoModelForCausalLM.from_pretrained(
 74 |         model_path, low_cpu_mem_usage=True, torch_dtype=torch.float16
 75 |     ).cuda()
 76 | 
 77 |     ans_jsons = []
 78 |     for i, line in enumerate(tqdm(question_jsons)):
 79 |         ques_json = line
 80 |         idx = ques_json["idx"]
 81 |         qs = ques_json["Instruction"]
 82 |         conv = get_conversation_template(conv_temp)
 83 |         conv.append_message(conv.roles[0], qs)
 84 |         conv.append_message(conv.roles[1], None)
 85 |         prompt = conv.get_prompt()
 86 |         input_ids = tokenizer([prompt]).input_ids
 87 |         output_ids = model.generate(
 88 |             torch.as_tensor(input_ids).cuda(),
 89 |             do_sample=do_sample,
 90 |             temperature=0.7,
 91 |             max_new_tokens=1024,
 92 |         )
 93 |         output_ids = output_ids[0][len(input_ids[0]):]
 94 |         outputs = tokenizer.decode(output_ids, skip_special_tokens=True).strip()
 95 |         ans_id = shortuuid.uuid()
 96 |         ans_jsons.append(
 97 |             {
 98 |                 "id": idx,
 99 |                 "class": "summary",
100 |                 "question": qs,
101 |                 "reference_answers": ques_json["Reference_Answers"],
102 |                 "answer": outputs,
103 |                 "answer_id": ans_id,
104 |                 "model_id": model_id,
105 |                 "metadata": {"do_sample": do_sample, "temperature": 0.7, "max_new_tokens": 1024},
106 |             }
107 |         )
108 |     return ans_jsons
109 | 
110 | 
111 | if __name__ == "__main__":
112 |     parser = argparse.ArgumentParser()
113 |     parser.add_argument("--model-path", type=str, required=True)
114 |     parser.add_argument("--model-id", type=str, required=True)
115 |     parser.add_argument("--conv-temp", type=str, default="vicuna")
116 |     parser.add_argument("--question-file", type=str, required=True)
117 |     parser.add_argument("--answer-file", type=str, default="answer.jsonl")
118 |     parser.add_argument("--metric-file", type=str, default="metrics.json")
119 |     parser.add_argument("--num-gpus", type=int, default=1)
120 |     parser.add_argument("--do-sample", action="store_true")
121 |     parser.add_argument("--no-sorry", action="store_true")
122 |     args = parser.parse_args()
123 | 
124 |     ray.init()
125 |     run_eval(
126 |         args.model_path,
127 |         args.model_id,
128 |         args.conv_temp,
129 |         args.question_file,
130 |         args.answer_file,
131 |         args.metric_file,
132 |         args.num_gpus,
133 |         args.do_sample,
134 |         args.no_sorry,
135 |     )
136 | 


--------------------------------------------------------------------------------
/data_generation/prompt_bank/fact_enhance_classify_en.txt:
--------------------------------------------------------------------------------
  1 | Help me complete a task: Factual Information Requirement Judgment. This task targets questions that require objective, accurate, verifiable information to answer, such as historical events, scientific knowledge, statistical data, etc. For each user command, you need to first understand the intent and demand of the command, then judge whether factual information is needed to answer it.
  2 | 
  3 | Specific scenarios that require factual information retrieval include:
  4 | 1. Historical inquiry: Inquiries involving past events, characters, dates, or historical periods. Usually requires obtaining detailed information about the time, place, cause, and impact of historical events.
  5 | 2. Scientific knowledge: Inquiries involving the basic principles, concepts, data, and research results of natural sciences (such as physics, chemistry, biology) or social sciences (such as psychology, economics).
  6 | 3. Statistical data: Inquiries involving the collection and analysis of numerical data, typically used to describe and explain a phenomenon or trend, such as population statistics, economic indicators, or social surveys.
  7 | 4. Technical details: Inquiries involving the specific specifications and functions of products, services, or technologies, such as the performance parameters of electronic devices, software version information, or application details of engineering technologies.
  8 | 5. Geographic information: Inquiries involving geographical locations, terrains, landmarks, countries, or regions, including but not limited to maps, coordinates, climate, and population distribution.
  9 | 6. News events: Inquiries involving the latest or recently occurred events, including political, economic, social, cultural news reports, and background analysis.
 10 | 7. Laws and regulations: Inquiries involving laws, regulations, ordinances, precedents, or related judicial interpretations, usually requires understanding the content, scope of application, and legal effects of legal provisions.
 11 | 8. Health and medicine: Inquiries involving human health, diseases, medicines, treatment methods, or medical research, usually including symptom descriptions, diagnostic methods, and treatment suggestions.
 12 | 9. Economic data: Inquiries involving economic activities, market data, currency exchange rates, stock prices, or financial reports, usually used for analyzing and predicting economic trends and market behavior.
 13 | 10. Education information: Inquiries involving educational institutions, courses, academic degrees, admission requirements, or educational policies, usually requires understanding the distribution of educational resources and education standards.
 14 | 11. Personal information: Related to specific individuals, their life, major achievements, important events, etc., including the relationships between two or more individuals, specific statements, or views of a person.
 15 | 
 16 | Use the following symbols to represent judgment results:
 17 | <need>: factual information needed
 18 | <no need>: factual information not needed
 19 | If the judgment is that factual information is needed, you need to give a corresponding search query in the result.
 20 | 
 21 | ###
 22 | <start_of_demonstration>
 23 | #Command:
 24 | Who was the first president of the United States?
 25 | #Analysis:
 26 | This information is objective and verifiable, so factual information is needed to answer.
 27 | #Prediction:
 28 | <need>
 29 | 
 30 | #Command:
 31 | Write a poem in the style of the Tang Dynasty on the theme of water.
 32 | #Analysis:
 33 | This command asks to create a poem, requires an understanding of the style of Tang Dynasty poetry, but it's primarily a creative task and doesn't require factual information retrieval.
 34 | #Prediction:
 35 | <no need>
 36 | 
 37 | #Command:
 38 | Let's play a game of idioms, I'll start: "as one wishes."
 39 | #Analysis:
 40 | This command asks to participate in an idiom game, which requires language generation and understanding capabilities, and knowledge of idioms, but does not require the retrieval of specific factual information.
 41 | #Prediction:
 42 | <no need>
 43 | 
 44 | #Command:
 45 | The origin of the idiom "the foolish old man who moved the mountains," make a sentence with this idiom.
 46 | #Analysis:
 47 | This command contains two parts. The first part asks about the origin of the idiom "the foolish old man who moved the mountains," which requires factual information to answer and needs to query historical or literary references.
 48 | #Prediction:
 49 | <need>
 50 | #Search Qeury:
 51 | "The origin of the idiom 'the foolish old man who moved the mountains'"
 52 | 
 53 | #Command:
 54 | Tell me about Huang Guoping.
 55 | #Analysis:
 56 | Huang Guoping could be a person's name, or it could be the name of a place or organization. According to the expression of the command, factual information is needed to supplement the relevant background knowledge.
 57 | #Prediction:
 58 | <need>
 59 | #Search Qeury:
 60 | "Huang Guoping"
 61 | 
 62 | #Command:
 63 | I like to drink strong tea to refresh myself at work, I drink 3-4 cups every day. But some people say coffee is healthier, which do you think is more beneficial to the body, coffee or strong tea?
 64 | #Analysis:
 65 | This question is about the comparison of the health effects of coffee and strong tea, this requires factual information retrieval, needs to answer through searching for scientific research on the health effects of coffee and strong tea.
 66 | #Prediction:
 67 | <need>
 68 | #Search Qeury:
 69 | "Health effects of coffee", "Health effects of strong tea"
 70 | 
 71 | #Command:
 72 | What's the coldest year on record so far?
 73 | #Analysis:
 74 | This command is asking to query climate history information, which requires factual information retrieval.
 75 | #Prediction:
 76 | <need>
 77 | #Search Qeury:
 78 | "Coldest year on record"
 79 | 
 80 | #Command:
 81 | Please compare the OnePlus Ace2 with the Realme GT Neo5, which one is more worth buying?
 82 | #Analysis:
 83 | This command asks to compare these two phones and give purchase advice. This requires analysis and comparison based on the specifications, features, price, etc. of these two phones, which are factual information.
 84 | #Prediction:
 85 | <need>
 86 | #Search Qeury:
 87 | "OnePlus Ace2 review", "Realme GT Neo5 review"
 88 | 
 89 | #Command:
 90 | Automatically add a message to a commit.
 91 | #Analysis:
 92 | This is a technical question about how to automatically add a message to a commit in a version control system like Git. Although this requires a certain understanding of the use of Git, it's primarily an operation guide and doesn't require factual information retrieval.
 93 | #Prediction:
 94 | <no need>
 95 | 
 96 | </start_of_demonstration>
 97 | ###
 98 | 
 99 | Now, based on the given commands, perform an analysis and judgment on whether factual information is needed:
100 | #Command:
101 | {instruction_to_process}
102 | #Analysis:


--------------------------------------------------------------------------------
/eval/gpt_judge/gpt_judge_prompt.jsonl:
--------------------------------------------------------------------------------
1 | {"prompt_type": "effectiveness_judge", "system_prompt": "You are a helpful and precise assistant for checking the quality of the answer.", "prompt_template": "[Question]\n{question}\n\n[The Start of Assistant's Answer]\n{answer}\n\n[The End of Assistant's Answer]\n\n[System]\nWe would like to request your feedback on the performance of an AI assistant in response to the user question displayed above.\nPlease rate the helpfulness, relevance, accuracy, and level of detail of the response. The assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.\nPlease first output a single line containing only one value indicating the scores for the Assistant. In the subsequent line, please provide an explanation of your evaluation, avoiding any potential bias.\n\n", "demo_examples": ["[Question]\nFind three facts about the American Revolution.\n\n[The Start of Assistant's Answer]\n1. The American Revolution was fought for independence from British rule. The colonists wanted to create their own government and laws without interference from the British monarchy.\n\n2. The war lasted from 1775 to 1783, and during that time, over 200,000 people fought in the conflict. The American forces were eventually successful, and in 1783, Britain recognized the United States as an independent nation.\n\n3. The American Revolution also had a major impact on the world, as it inspired other countries to fight for their own independence from colonial powers. The ideals of liberty and self-determination that were central to the American Revolution continue to be celebrated today.\n\n[The End of Assistant's Answer]\n\n[System]\nWe would like to request your feedback on the performance of an AI assistant in response to the user question displayed above.\nPlease rate the helpfulness, relevance, accuracy, and level of detail of the response. The assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.\nPlease first output a single line containing only one value indicating the scores for the Assistant. In the subsequent line, please provide an explanation of your evaluation, avoiding any potential bias.\n\n9.0\nThe answer provided by the AI assistant is highly relevant, accurate, and detailed. It covers three important facts about the American Revolution, including its purpose, duration, and global impact. The answer could have included a few more specific events or key figures to make it even more comprehensive, but overall, it is an excellent response.", "[Question]\nWrite a short story in third person narration about a protagonist who has to make an important career decision.\n\n[The Start of Assistant's Answer]\nI will write a short story about a girl who has to make a career decision\n\n[The End of Assistant's Answer]\n\n[System]\nWe would like to request your feedback on the performance of an AI assistant in response to the user question displayed above.\nPlease rate the helpfulness, relevance, accuracy, and level of detail of the response. The assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.\nPlease first output a single line containing only one value indicating the scores for the Assistant. In the subsequent line, please provide an explanation of your evaluation, avoiding any potential bias.\n\n3.0\nThe AI assistant understood the user's request for a short story about a protagonist making an important career decision, but it failed to actually provide the story. Instead, it only mentioned that it would write a story about a girl making a career decision, without delivering on that promise."]}
2 | {"prompt_type": "hallucination_judge", "system_prompt": "You are a helpful and precise assistant for detecting the hallucination of the answer.", "prompt_template": "[Question]\n{question}\n\n[Analysis]\n{analysis}\n\n[Knowledge]\n{knowledge}\n\n[The Start of Assistant's Answer]\n{answer}\n\n[The End of Assistant's Answer]\n\n[System]\nWe would like to request your feedback on the hallucination of an AI assistant in response to the user question displayed above.\nPlease determine if the provided response contains hallucination or not for the question based on the [Analysis] and [Knowledge]. The assistant receives an overall score of 0 or 1, where 0 indicates NOT having hallucinations and 1 indicates having hallucinations.\nPlease first output a single line containing only one value indicating the scores for the Assistant. In the subsequent line, please extract the hallucination spans as a list of str. In the last line, please give an explanation of your evaluation, avoiding any potential bias.\n\n", "demo_examples": ["[Question]\nWho was the President of the United States in 2020?\n\n[Analysis]\nThis information is objective and verifiable, so factual information is needed to answer.\n\n[Knowledge]\nIn 2020, the President of the United States was Donald Trump. He assumed office on January 20, 2017, after winning the 2016 Presidential Election against Democratic nominee Hillary Clinton. Trump was the 45th President of the United States and served a four-year term until January 20, 2021, when he was succeeded by President Joe Biden.\n\nDuring his presidency, Trump was known for his controversial policies and actions, including his immigration policies, trade wars, and handling of the COVID-19 pandemic. He was also impeached twice by the House of Representatives, but was acquitted both times by the Senate.\n\nOverall, the presidency of Donald Trump was marked by significant political polarization and controversy, with both his supporters and opponents holding strong opinions about his leadership and policies.\n\n[The Start of Assistant's Answer]\nThe president of United States in 2020 is Joe Biden.\n\n[The End of Assistant's Answer]\n\n[System]\nWe would like to request your feedback on the hallucination of an AI assistant in response to the user question displayed above.\nPlease determine if the provided response contains hallucination or not for the question based on the [Analysis] and [Knowledge]. The assistant receives an overall score of 0 or 1, where 0 indicates NOT having hallucinations and 1 indicates having hallucinations.\nPlease first output a single line containing only one value indicating the scores for the Assistant. In the subsequent line, please extract the hallucination spans as a list of str. In the last line, please give an explanation of your evaluation, avoiding any potential bias.\n\n1\n[\"Joe Biden\"]\nThe provided response contains a hallucination because Joe Biden was not the President of the United States in 2020. Donald Trump was the President during that time.", "[Question]\nWhere was Albert Einstein born?\n\n[Analysis]\nThis information is objective and verifiable, so factual information is needed to answer.\n\n[Knowledge]\nAlbert Einstein was born in the city of Ulm, in the Kingdom of Württemberg, which was then a part of the German Empire. He was born on March 14, 1879, to a Jewish family. His parents were Hermann Einstein and Pauline Koch. Hermann was an engineer and salesman, while Pauline was a homemaker. Einstein had one younger sister named Maja. The family moved to Munich when Einstein was just a year old.\n\n[The Start of Assistant's Answer]\nEinstein was born in Ulm, Germany.\n\n[The End of Assistant's Answer]\n\n[System]\nWe would like to request your feedback on the hallucination of an AI assistant in response to the user question displayed above.\nPlease determine if the provided response contains hallucination or not for the question based on the [Analysis] and [Knowledge]. The assistant receives an overall score of 0 or 1, where 0 indicates NOT having hallucinations and 1 indicates having hallucinations.\nPlease first output a single line containing only one value indicating the scores for the Assistant. In the subsequent line, please extract the hallucination spans as a list of str. In the last line, please give an explanation of your evaluation, avoiding any potential bias.\n\n0\n[]\nThe assistant's answer is accurate and based on the provided knowledge. There is no hallucination in the response."]}


--------------------------------------------------------------------------------
/data_generation/per_instance_query.py:
--------------------------------------------------------------------------------
  1 | """Conduct knowledge inconsistency detection."""
  2 | 
  3 | import argparse
  4 | import json
  5 | import time
  6 | import os
  7 | import asyncio
  8 | import openai
  9 | from collections import defaultdict
 10 | from tqdm.auto import tqdm
 11 | from typing import Dict, Tuple, Union, List, Any
 12 | 
 13 | 
 14 | DEFAULT_SYSTEM_PROMPT = """
 15 | You are ChatGPT, a large language model trained by OpenAI.
 16 | Knowledge cutoff: 2021-09
 17 | Current date: 2023-06-01
 18 | """
 19 | 
 20 | OPENAI_TEMPERATURE = 1.0
 21 | 
 22 | 
 23 | def write_query_log(
 24 |         prompt: str,
 25 |         res: Dict[str, str],
 26 |         out_dir: str,
 27 | ):
 28 |     with open(os.path.join(out_dir, "query_log.jsonl"), "a+", encoding="utf-8") as f_out:
 29 |         query_log = {
 30 |             "prompt": prompt,
 31 |             "res": res,
 32 |         }
 33 |         json.dump(
 34 |             query_log,
 35 |             f_out,
 36 |             ensure_ascii=False,
 37 |         )
 38 |         f_out.write("\n")
 39 | 
 40 | 
 41 | def query_openai_batch(args):
 42 |     extension = args.file_extension if args.file_extension else "jsonl"
 43 |     os.makedirs(args.out_dir, exist_ok=True)
 44 |     path_in = os.path.join(args.data_dir, args.input)
 45 |     path_out = os.path.join(args.out_dir, args.output)
 46 |     total_requests, success_num, fail_num, side_info = \
 47 |         0, 0, 0, defaultdict(list)
 48 |     with open(path_in, encoding="utf-8") as f_in:
 49 |         all_lns = f_in.readlines()
 50 |         if extension == "jsonl":
 51 |             all_content = [json.loads(ln) for ln in all_lns]
 52 |         else:
 53 |             src_col = "en_task_detail"
 54 |             all_content = [{src_col: ln.strip(), } for ln in all_lns]
 55 |     instance_num = len(all_content)
 56 |     request_batch_size = args.request_batch_size
 57 |     total_batch_num = instance_num // request_batch_size + 1
 58 |     example_batches = [
 59 |         all_content[idx * request_batch_size: (idx + 1) * request_batch_size]
 60 |         for idx in range(total_batch_num)
 61 |     ]
 62 |     progress_bar = tqdm(range(total_batch_num))
 63 |     with open(path_out, "w", encoding="utf-8") as f_out:
 64 |         for idx in range(total_batch_num):
 65 |             batch = example_batches[idx]
 66 |             prompt = []
 67 |             for exp in batch:
 68 |                 prompt_tmp = openai_encode_prompt(exp, mode=args.prompt_mode)
 69 |                 prompt.append(prompt_tmp)
 70 |             result: List[Dict[str, str]] = request_openai_batch(prompt)
 71 |             if result is not None:
 72 |                 success_num += len(result)
 73 |                 for one_prompt, one_res, one_input in zip(prompt, result, batch):
 74 |                     write_query_log(prompt=one_prompt, res=one_res, out_dir=args.out_dir)
 75 |                     one_res["original_input"] = one_input
 76 |                     json.dump(
 77 |                         one_res,
 78 |                         f_out,
 79 |                         ensure_ascii=False
 80 |                     )
 81 |                     f_out.write("\n")
 82 |             else:
 83 |                 fail_num += 1
 84 | 
 85 |             progress_bar.update(1)
 86 |     stats_info = {
 87 |         "full_planned_requests": total_batch_num,
 88 |         "success_num": success_num,
 89 |         "fail_num": fail_num,
 90 |     }
 91 |     return stats_info
 92 | 
 93 | 
 94 | def _res_validation(
 95 |         res: Dict[str, str],
 96 |         mode: str = "turbo"
 97 | ) -> bool:
 98 |     res_text = res["choices"][0]["message"]["content"]
 99 |     if type(res_text) is str and len(res_text) > 5:
100 |         return True
101 |     return False
102 | 
103 | 
104 | def openai_encode_prompt(
105 |         example: Dict[str, str],
106 |         mode: str = "en_task_to_zh"
107 | ) -> str:
108 |     if mode == "fact_enhance_classify_en":
109 |         prompt = open("./prompt_bank/fact_enhance_classify_en.txt").read() + "\n"
110 |         prompt = prompt.format_map(
111 |             {"instruction_to_process": example["input"]}
112 |         )
113 |     elif mode == "fact_generation_en":
114 |         prompt = open("./prompt_bank/fact_generation_en.txt").read() + "\n"
115 |         kw_mapping = {
116 |             "input": example["input"],
117 |             "output": example["output"],
118 |             "analysis": example["analysis"],
119 |             "queries": example["queris"]
120 |         }
121 |         prompt = prompt.format_map(kw_mapping)
122 |     elif mode == "fact_to_tests_en":
123 |         prompt = open("./prompt_bank/fact_to_tests_en.txt").read() + "\n"
124 |         prompt = prompt.format_map(
125 |             {"knowledge": example["knowledge"]}
126 |         )
127 |     else:
128 |         raise NotImplementedError
129 |     return prompt
130 | 
131 | 
132 | def request_openai_batch(
133 |         prompt_list: List[str],
134 | ):
135 |     max_retry_times, responses = 20, None
136 |     request_sleep_time = 60
137 |     openai_kwargs = {
138 |         "model": "gpt-3.5-turbo-16k-0613",
139 |         "temperature": OPENAI_TEMPERATURE,
140 |         "top_p": 1.0,
141 |         "n": 1,
142 |         "logit_bias": {"50256": -100},
143 |     }
144 |     for trial_idx in range(max_retry_times):
145 |         try:
146 |             messages_list = [[
147 |                 {"role": "system",
148 |                  "content": DEFAULT_SYSTEM_PROMPT},
149 |                 {"role": "user",
150 |                  "content": prompt},
151 |             ] for prompt in prompt_list]
152 |             completions = asyncio.run(
153 |                 dispatch_openai_requests(
154 |                     messages_list=messages_list,
155 |                     max_tokens=6000,
156 |                     **openai_kwargs,
157 |                 )
158 |             )
159 |             responses = []
160 |             for completion in completions:
161 |                 result = completion['choices'][0]
162 |                 response = {
163 |                     "raw_response": result.get("message", {}).get("content", ""),
164 |                     "stop_reason": result.get("finish_reason", ),
165 |                 }
166 |                 responses.append(response)
167 |             return responses
168 |         except Exception as e:
169 |             print(str(e))
170 |             print(f"Trail No. {trial_idx + 1} Failed, now sleep and retrying...")
171 |             time.sleep(request_sleep_time)
172 |     return responses
173 | 
174 | 
175 | async def dispatch_openai_requests(
176 |         messages_list: list[list[dict[str, Any]]],
177 |         model: str,
178 |         temperature: float,
179 |         max_tokens: int,
180 |         top_p: float,
181 |         n: int,
182 |         logit_bias: dict,
183 | ) -> list[str]:
184 |     """Dispatches requests to OpenAI API asynchronously.
185 | 
186 |     Args:
187 |         messages_list: List of messages to be sent to OpenAI ChatCompletion API.
188 |         model: OpenAI model to use.
189 |         temperature: Temperature to use for the model.
190 |         max_tokens: Maximum number of tokens to generate.
191 |         top_p: Top p to use for the model.
192 |         n: Return sentence nums.
193 |         logit_bias: logit bias.
194 |     Returns:
195 |         List of responses from OpenAI API.
196 |     """
197 |     async_responses = [
198 |         openai.ChatCompletion.acreate(
199 |             model=model,
200 |             messages=x,
201 |             temperature=temperature,
202 |             max_tokens=max_tokens,
203 |             top_p=top_p,
204 |             n=n,
205 |             logit_bias=logit_bias,
206 |         )
207 |         for x in messages_list
208 |     ]
209 |     return await asyncio.gather(*async_responses)
210 | 
211 | 
212 | if __name__ == "__main__":
213 |     parser = argparse.ArgumentParser()
214 |     parser.add_argument('--data_dir', type=str, required=True, help='input file directory')
215 |     parser.add_argument('--input', '-i', type=str, required=True, help='input file')
216 |     parser.add_argument('--out_dir', type=str, required=True, help='output file directory')
217 |     parser.add_argument('--output', '-o', type=str, required=True, help='output file')
218 |     parser.add_argument('--file_extension', type=str, help='')
219 |     parser.add_argument('--request_batch_size', type=int, default=None, help='prompt batch size for querying openai')
220 |     parser.add_argument('--prompt_mode', type=str, help='')
221 |     args = parser.parse_args()
222 |     print(args)
223 |     query_openai_batch(args)


--------------------------------------------------------------------------------
/examination/predict.py:
--------------------------------------------------------------------------------
  1 | 
  2 | '''
  3 | This script is used to get models' predictions on a set of prompts (put in files with *.jsonl format, 
  4 | with the prompt in a `prompt` field or the conversation history in a `messages` field).
  5 | 
  6 | For example, to get predictions on a set of prompts, you should put them in a file with the following format:
  7 |     {"id": <uniq_id>, "prompt": "Plan a trip to Paris."}
  8 |     ...
  9 | Or you can use the messages format:
 10 |     {"id": <uniq_id>, "messages": [{"role": "user", "content": "Plan a trip to Paris."}]}
 11 |     ...
 12 | 
 13 | Then you can run this script with the following command:
 14 |     python eval/predict.py \
 15 |         --model_name_or_path <huggingface_model_name_or_path> \
 16 |         --input_files <input_file_1> <input_file_2> ... \
 17 |         --output_file <output_file> \
 18 |         --batch_size <batch_size> \
 19 |         --use_vllm
 20 | '''
 21 | 
 22 | 
 23 | import argparse
 24 | import json
 25 | import os
 26 | import vllm
 27 | import torch
 28 | from examination.utils import generate_completions, load_hf_lm_and_tokenizer, query_openai_chat_model, dynamic_import_function
 29 | 
 30 | 
 31 | def parse_args():
 32 |     parser = argparse.ArgumentParser()
 33 |     parser.add_argument(
 34 |         "--model_name_or_path",
 35 |         type=str,
 36 |         help="Huggingface model name or path.")
 37 |     parser.add_argument(
 38 |         "--tokenizer_name_or_path",
 39 |         type=str,
 40 |         help="Huggingface tokenizer name or path."
 41 |     )
 42 |     parser.add_argument(
 43 |         "--use_slow_tokenizer",
 44 |         action="store_true",
 45 |         help="If given, we will use the slow tokenizer."
 46 |     )
 47 |     parser.add_argument(
 48 |         "--openai_engine", 
 49 |         type=str,
 50 |         help="OpenAI engine name. This should be exclusive with `model_name_or_path`.")
 51 |     parser.add_argument(
 52 |         "--input_files", 
 53 |         type=str, 
 54 |         nargs="+",
 55 |         help="Input .jsonl files, with each line containing `id` and `prompt` or `messages`.")
 56 |     parser.add_argument(
 57 |         "--output_file",
 58 |         type=str,
 59 |         default="output/model_outputs.jsonl",
 60 |         help="Output .jsonl file, with each line containing `id`, `prompt` or `messages`, and `output`.")
 61 |     parser.add_argument(
 62 |         "--batch_size",
 63 |         type=int,
 64 |         default=1,
 65 |         help="batch size for prediction.")
 66 |     parser.add_argument(
 67 |         "--load_in_8bit",
 68 |         action="store_true",
 69 |         help="load model in 8bit mode, which will reduce memory and speed up inference.")
 70 |     parser.add_argument(
 71 |         "--load_in_float16",
 72 |         action="store_true",
 73 |         help="By default, huggingface model will be loaded in the torch.dtype specificed in its model_config file."
 74 |              "If specified, the model dtype will be converted to float16 using `model.half()`.")
 75 |     parser.add_argument(
 76 |         "--gptq",
 77 |         action="store_true",
 78 |         help="If given, we're evaluating a 4-bit quantized GPTQ model.")
 79 |     parser.add_argument(
 80 |         "--use_vllm",
 81 |         action="store_true", 
 82 |         help="If given, we will use the vllm library, which will likely increase the inference throughput.")
 83 |     parser.add_argument(
 84 |         "--use_chat_format", 
 85 |         action="store_true", 
 86 |         help="If given, we will use the chat format for the prompts."
 87 |     )
 88 |     parser.add_argument(
 89 |         "--chat_formatting_function", 
 90 |         type=str, 
 91 |         default="eval.templates.create_prompt_with_tulu_chat_format", 
 92 |         help="The function to use to create the chat format. This function will be dynamically imported. Please see examples in `eval/templates.py`."
 93 |     )
 94 |     parser.add_argument(
 95 |         "--max_new_tokens",
 96 |         type=int,
 97 |         default=2048,
 98 |         help="maximum number of new tokens to generate.")
 99 |     parser.add_argument(
100 |         "--do_sample",
101 |         action="store_true",
102 |         help="whether to use sampling ; use greedy decoding otherwise.")
103 |     parser.add_argument(
104 |         "--temperature",
105 |         type=float,
106 |         default=1.0,
107 |         help="temperature for sampling.")
108 |     parser.add_argument(
109 |         "--top_p",
110 |         type=float,
111 |         default=1.0,
112 |         help="top_p for sampling.")
113 |     args = parser.parse_args()
114 | 
115 |     # model_name_or_path and openai_engine should be exclusive.
116 |     assert (args.model_name_or_path is None) != (args.openai_engine is None), "model_name_or_path and openai_engine should be exclusive."
117 |     return args
118 | 
119 | 
120 | if __name__ == "__main__":
121 |     args = parse_args()
122 | 
123 |     # check if output directory exists
124 |     if args.output_file is not None:
125 |         output_dir = os.path.dirname(args.output_file)
126 |         if not os.path.exists(output_dir):
127 |             os.makedirs(output_dir)
128 | 
129 |     # load the data
130 |     for input_file in args.input_files:
131 |         with open(input_file, "r") as f:
132 |             instances = [json.loads(x) for x in f.readlines()]
133 | 
134 |     if args.model_name_or_path is not None:
135 |         prompts = []
136 |         chat_formatting_function = dynamic_import_function(args.chat_formatting_function) if args.use_chat_format else None
137 |         for instance in instances:
138 |             if "messages" in instance:
139 |                 if not args.use_chat_format:
140 |                     raise ValueError("If `messages` is in the instance, `use_chat_format` should be True.")
141 |                 assert all("role" in message and "content" in message for message in instance["messages"]), \
142 |                     "Each message should have a `role` and a `content` field."
143 |                 prompt = eval(args.chat_formatting_function)(instance["messages"], add_bos=False)
144 |             elif "prompt" in instance:
145 |                 if args.use_chat_format:
146 |                     messages = [{"role": "user", "content": instance["prompt"]}]
147 |                     prompt = chat_formatting_function(messages, add_bos=False)
148 |                 else:
149 |                     prompt = instance["prompt"]
150 |             else:
151 |                 raise ValueError("Either `messages` or `prompt` should be in the instance.")
152 |             prompts.append(prompt)
153 |         if args.use_vllm:
154 |             model = vllm.LLM(
155 |                 model=args.model_name_or_path,
156 |                 tokenizer=args.tokenizer_name_or_path if args.tokenizer_name_or_path else args.model_name_or_path,
157 |                 tokenizer_mode="slow" if args.use_slow_tokenizer else "auto",
158 |             )
159 |             sampling_params = vllm.SamplingParams(
160 |                 temperature=args.temperature if args.do_sample else 0, 
161 |                 top_p=args.top_p,
162 |                 max_tokens=args.max_new_tokens,
163 |             )
164 |             outputs = model.generate(prompts, sampling_params)
165 |             outputs = [it.outputs[0].text for it in outputs]
166 |         else:
167 |             model, tokenizer = load_hf_lm_and_tokenizer(
168 |                 model_name_or_path=args.model_name_or_path, 
169 |                 tokenizer_name_or_path=args.tokenizer_name_or_path,
170 |                 load_in_8bit=args.load_in_8bit, 
171 |                 device_map="balanced_low_0" if torch.cuda.device_count() > 1 else "auto",
172 |                 gptq_model=args.gptq,
173 |                 use_fast_tokenizer=not args.use_slow_tokenizer,
174 |             )
175 |             outputs = generate_completions(
176 |                 model=model,
177 |                 tokenizer=tokenizer,
178 |                 prompts=prompts,
179 |                 batch_size=args.batch_size,
180 |                 max_new_tokens=args.max_new_tokens,
181 |                 do_sample=args.do_sample,
182 |                 temperature=args.temperature,
183 |                 top_p=args.top_p,
184 |             )
185 |         with open(args.output_file, "w") as f:
186 |             for instance, output in zip(instances, outputs):
187 |                 instance["output"] = output
188 |                 f.write(json.dumps(instance) + "\n")
189 |                 
190 |     elif args.openai_engine is not None:
191 |         query_openai_chat_model(
192 |             engine=args.openai_engine,
193 |             instances=instances,
194 |             output_path=args.output_file,
195 |             batch_size=args.batch_size,
196 |             temperature=args.temperature,
197 |             top_p=args.top_p,
198 |             max_tokens=args.max_new_tokens,
199 |         )
200 |     else:
201 |         raise ValueError("Either model_name_or_path or openai_engine should be provided.")
202 | 
203 |     print("Done.")


--------------------------------------------------------------------------------
/data_generation/inconsistency_processing.py:
--------------------------------------------------------------------------------
  1 | """Conduct knowledge inconsistency processing."""
  2 | 
  3 | import json
  4 | 
  5 | 
  6 | def read_jsonl(file_path):
  7 |     data = []
  8 |     with open(file_path, 'r') as f:
  9 |         for line in f:
 10 |             data.append(json.loads(line))
 11 |     return data
 12 | 
 13 | 
 14 | def construct_data(fact_check_file, test_generation_file, output_file, data_name, model_name, llm_evaluation_result_file, no_fact_type):
 15 |     """Construct data for openbook, drop, and sorry."""
 16 |     fact_check_data = read_jsonl(fact_check_file)
 17 |     test_generation_data = read_jsonl(test_generation_file)
 18 |     llm_evaluation_results = json.loads(open(llm_evaluation_result_file, "r").read())
 19 |     data_need_facts = []
 20 |     for key, value in llm_evaluation_results["threshold_0.6"].items():
 21 |         need_fact_idx = int(key.replace(f"idx_{data_name}", ""))
 22 |         example = test_generation_data[need_fact_idx]
 23 |         if value is True:
 24 |             input_value = example["original_input"]["original_input"]["input"]
 25 |             answer_value = example["original_input"]["original_input"]["output"]
 26 |             output_value = answer_value
 27 |             data_need_facts.append({
 28 |                 "id": f"need_and_{model_name}_have_fact_{need_fact_idx}",
 29 |                 "conversations": [
 30 |                     {
 31 |                         "from": "human",
 32 |                         "value": input_value,
 33 |                     },
 34 |                     {
 35 |                         "from": "gpt",
 36 |                         "value": output_value,
 37 |                     }
 38 |                 ],
 39 |                 "class": "need_and_have_fact",
 40 |                 "analysis": example["original_input"]["original_input"]["analysis"],
 41 |                 "knowledge": example["original_input"]["knowledge"],
 42 |             })
 43 |         else:
 44 |             if no_fact_type == "drop":
 45 |                 pass
 46 |             elif no_fact_type == "openbook":
 47 |                 input_value = example['original_input']['knowledge'] + "\n" + example["original_input"]["original_input"]["input"]
 48 |                 answer_value = example["original_input"]["original_input"]["output"]
 49 |                 output_value = answer_value
 50 |                 data_need_facts.append({
 51 |                     "id": f"need_and_{model_name}_have_no_fact_{need_fact_idx}",
 52 |                     "conversations": [
 53 |                         {
 54 |                             "from": "human",
 55 |                             "value": input_value,
 56 |                         },
 57 |                         {
 58 |                             "from": "gpt",
 59 |                             "value": output_value,
 60 |                         }
 61 |                     ],
 62 |                     "class": "need_and_have_no_fact",
 63 |                     "analysis": example["original_input"]["original_input"]["analysis"],
 64 |                     "knowledge": example["original_input"]["knowledge"],
 65 |                 })
 66 |             elif no_fact_type == "sorry":
 67 |                 input_value = example["original_input"]["original_input"]["input"]
 68 |                 answer_value = f"Sorry, I don't know the factual information required to answer this question."
 69 |                 output_value = answer_value
 70 |                 data_need_facts.append({
 71 |                     "id": f"need_and_{model_name}_have_no_fact_{need_fact_idx}",
 72 |                     "conversations": [
 73 |                         {
 74 |                             "from": "human",
 75 |                             "value": input_value,
 76 |                         },
 77 |                         {
 78 |                             "from": "gpt",
 79 |                             "value": output_value,
 80 |                         }
 81 |                     ],
 82 |                     "class": "need_and_have_no_fact",
 83 |                     "analysis": example["original_input"]["original_input"]["analysis"],
 84 |                     "knowledge": example["original_input"]["knowledge"],
 85 |                 })
 86 |             else:
 87 |                 raise NotImplementedError
 88 | 
 89 |     data_no_need_facts = []
 90 |     no_need_fact_idx = 0
 91 |     for example in fact_check_data:
 92 |         if "final_prediction" in example and example["final_prediction"] == "<no need>":
 93 |             input_value = example["original_input"]["input"]
 94 |             answer_value = example["original_input"]["output"]
 95 |             output_value = answer_value
 96 |             data_no_need_facts.append({
 97 |                 "id": f"no_need_fact_{no_need_fact_idx}",
 98 |                 "conversations": [
 99 |                     {
100 |                         "from": "human",
101 |                         "value": input_value,
102 |                     },
103 |                     {
104 |                         "from": "gpt",
105 |                         "value": output_value,
106 |                     }
107 |                 ],
108 |                 "class": "no_need_fact",
109 |                 "analysis": example["analysis"],
110 |                 "knowledge": "",
111 |             })
112 |             no_need_fact_idx += 1
113 |     final_data = data_need_facts + data_no_need_facts
114 |     print(f"data_need_facts: {len(data_need_facts)}")
115 |     print(f"data_no_need_facts: {len(data_no_need_facts)}")
116 |     print(f"data_final: {len(final_data)}")
117 |     for example in final_data:
118 |         for turn in example["conversations"]:
119 |             assert type(turn["value"]) == str
120 |     with open(output_file, "w") as fout:
121 |         json.dump(final_data, fout, indent=4)
122 | 
123 | 
124 | def main():
125 |     global_path = "./data"
126 | 
127 |     fact_check_file_mapping = {
128 |         "train": ["wizardlm_alpaca_single_turn_classify_parse_res.jsonl"],
129 |         "test": ["lima_testset_single_turn_classify_parse_res.jsonl",
130 |                  "vicuna_testset_single_turn_classify_parse_res.jsonl",
131 |                  "wizardlm_testset_single_turn_classify_parse_res.jsonl"],
132 |         "test_truth": ["truthfulqa_testset_single_turn_classify_parse_res.jsonl"]
133 |     }
134 |     test_generation_file_mapping = {
135 |         "train": ["wizardlm_alpaca_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen_parse_res.jsonl"],
136 |         "test": ["lima_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen_parse_res.jsonl",
137 |                  "vicuna_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen_parse_res.jsonl",
138 |                  "wizardlm_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen_parse_res.jsonl"],
139 |         "test_truth": ["truthfulqa_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen_parse_res.jsonl"]
140 |     }
141 |     llm_evaluation_result_file_mapping = {
142 |         "train": ["wizardlm_alpaca_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen_normalize_sft_instance_behavior.json"],
143 |         "test": ["lima_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen_normalize_sft_instance_behavior.json",
144 |                  "vicuna_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen_normalize_sft_instance_behavior.json",
145 |                  "wizardlm_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen_normalize_sft_instance_behavior.json"],
146 |         "test_truth": ["truthfulqa_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen_normalize_sft_instance_behavior.json"],
147 |     }
148 |     for split in ["train", "test", "test_truth"]:
149 |         for model_name in [ "pythia-6.9b", "llama-2-7b", "mistral-7b-v0.1", "llama-2-13b"]:
150 |             for shot in ["5"]:
151 |                 for dataset_idx in range(len(fact_check_file_mapping[split])):
152 |                     for no_fact_type in ["openbook", "drop", "sorry"]:
153 |                         fact_check_file = f"{global_path}/generation_results/{split}/fact_enhance_classify/{fact_check_file_mapping[split][dataset_idx]}"
154 |                         test_generation_file = f"{global_path}/generation_results/{split}/test_generation/{test_generation_file_mapping[split][dataset_idx]}"
155 |                         llm_evaluation_result_file = f"{global_path}/examination/output/{split}/{model_name}/{shot}-shot/{llm_evaluation_result_file_mapping[split][dataset_idx]}"
156 |                         data_name = f"{fact_check_file_mapping[split][dataset_idx].split('_')[0]}_{split}set"
157 |                         output_file = f"{global_path}/processed_results/{model_name}_shot-{shot}_{data_name}_{no_fact_type}.json"
158 |                         try:
159 |                             construct_data(fact_check_file, test_generation_file, output_file, "sft", model_name, llm_evaluation_result_file, no_fact_type)
160 |                         except Exception as e:
161 |                             print(e)
162 | 
163 | 
164 | if __name__ == "__main__":
165 |     main()


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/eval/gpt_judge/gpt_judge.py:
--------------------------------------------------------------------------------
  1 | """Use gpt as automatic evaluator for hallucination or effectiveness evaluation."""
  2 | 
  3 | import openai
  4 | import time
  5 | import json
  6 | import os
  7 | import re
  8 | import tqdm
  9 | import argparse
 10 | import asyncio
 11 | from typing import Any, List, Dict, Optional
 12 | 
 13 | MAX_API_RETRY = 20
 14 | 
 15 | 
 16 | def get_json_list(file_path):
 17 |     file_path = os.path.expanduser(file_path)
 18 |     with open(file_path, 'r') as f:
 19 |         json_list = []
 20 |         for line in f:
 21 |             json_list.append(json.loads(line))
 22 |         return json_list
 23 | 
 24 | 
 25 | async def dispatch_openai_requests(
 26 |         messages_list: list[list[dict[str, Any]]],
 27 |         model: str,
 28 |         temperature: float,
 29 |         max_tokens: int,
 30 | ) -> list[str]:
 31 |     """Dispatches requests to OpenAI API asynchronously.
 32 | 
 33 |     Args:
 34 |         messages_list: List of messages to be sent to OpenAI ChatCompletion API.
 35 |         model: OpenAI model to use.
 36 |         temperature: Temperature to use for the model.
 37 |         max_tokens: Maximum number of tokens to generate.
 38 |     Returns:
 39 |         List of responses from OpenAI API.
 40 |     """
 41 |     async_responses = [
 42 |         openai.ChatCompletion.acreate(
 43 |             model=model,
 44 |             messages=x,
 45 |             temperature=temperature,
 46 |             max_tokens=max_tokens,
 47 |         )
 48 |         for x in messages_list
 49 |     ]
 50 |     return await asyncio.gather(*async_responses)
 51 | 
 52 | 
 53 | def get_completion(messages_list: list, model: str, temperature: float = 0.0):
 54 |     for i in range(MAX_API_RETRY):
 55 |         try:
 56 |             completions = asyncio.run(
 57 |                 dispatch_openai_requests(
 58 |                     messages_list=messages_list,
 59 |                     model=model,
 60 |                     temperature=temperature,
 61 |                     max_tokens=1024,
 62 |                 )
 63 |             )
 64 |             return completions
 65 |         except openai.error.InvalidRequestError:
 66 |             print(messages_list)
 67 |             print("Error: Invalid Request")
 68 |             return None
 69 |         except Exception as e:
 70 |             print(e)
 71 |             time.sleep(30)
 72 |     print(f'Failed after {MAX_API_RETRY} retries.')
 73 |     raise RuntimeError
 74 | 
 75 | 
 76 | def get_prompt_single_score(question, answer, analysis, knowledge, prompt_temp, prompt_type, demos, add_demos):
 77 |     prompt = ""
 78 |     if add_demos:
 79 |         for demo in demos:
 80 |             prompt = prompt + demo + "\n"
 81 |     if prompt_type == "effectiveness_judge":
 82 |         prompt = prompt + prompt_temp.format_map({"question": question, "answer": answer})
 83 |     elif prompt_type == "hallucination_judge":
 84 |         prompt = prompt + prompt_temp.format_map({"question": question, "answer": answer,
 85 |                                                   "analysis": analysis, "knowledge": knowledge})
 86 |     return prompt
 87 | 
 88 | 
 89 | def post_process_single_score(result: str, prompt_type: str):
 90 |     if prompt_type == "hallucination_judge":
 91 |         parts = result.strip().split("\n")
 92 |         error_cnt = 0
 93 |         if len(parts) == 3:
 94 |             score = float(parts[0])
 95 |             try:
 96 |                 span = eval(parts[1])
 97 |             except:
 98 |                 span = []
 99 |             explain = parts[2]
100 |         else:
101 |             pattern = r"^(\d+)[ \n]+(\[.*\])[ \n]+(.*)$"
102 |             match = re.match(pattern, result)
103 |             if match:
104 |                 score = float(match.group(1))
105 |                 try:
106 |                     span = eval(match.group(2))
107 |                 except:
108 |                     span = []
109 |                 explain = match.group(3)
110 |             else:
111 |                 score = -100
112 |                 span = []
113 |                 explain = "none"
114 |                 error_cnt += 1
115 |                 print(f"Error for parsing, {error_cnt}")
116 |         return score, span, explain
117 |     elif prompt_type == "effectiveness_judge":
118 |         parts = result.strip().split("\n")
119 |         error_cnt = 0
120 |         if len(parts) == 2:
121 |             score = float(parts[0])
122 |             explain = parts[1]
123 |         else:
124 |             pattern = r"^(\d+)[ \n]+(.*)$"
125 |             match = re.match(pattern, result)
126 |             if match:
127 |                 score = float(match.group(1))
128 |                 explain = match.group(2)
129 |             else:
130 |                 score = -100
131 |                 explain = "none"
132 |                 error_cnt += 1
133 |                 print(f"Error for parsing, {error_cnt}")
134 |         if score != -100 and score < 0:
135 |             score = 0.0
136 |         elif score != -100 and score > 10:
137 |             score = 10.0
138 |         return score, None, explain
139 |     else:
140 |         raise NotImplementedError
141 | 
142 | 
143 | def get_single_score(input_file, testset_file, output_file, prompt_file, prompt_type=None, use_demo=False,
144 |                      model="gpt-4", temperature=0.0, batch_size=1, no_sorry=False):
145 |     input_examples = get_json_list(input_file)
146 |     testset_examples = json.loads(open(testset_file, "r").read())
147 |     assert len(input_examples) == len(testset_examples)
148 |     for i in range(len(input_examples)):
149 |         input_examples[i]["analysis"] = testset_examples[i]["analysis"]
150 |         input_examples[i]["knowledge"] = testset_examples[i]["knowledge"]
151 |     review_examples = []
152 |     if no_sorry is False:
153 |         for x in input_examples:
154 |             review_examples.append(x)
155 |     else:
156 |         print("Delete sorry answers...")
157 |         for x in input_examples:
158 |             if "Sorry, I don't know the factual information required to answer this question." not in x["answer"]:
159 |                 review_examples.append(x)
160 |         print(f"Delete {len(input_examples) - len(review_examples)} examples.")
161 |     if os.path.exists(output_file):
162 |         curr_result = get_json_list(output_file)
163 |     else:
164 |         curr_result = []
165 |     system_prompt = None
166 |     prompt_template = None
167 |     demo_examples = None
168 |     for prompt in get_json_list(prompt_file):
169 |         if prompt["prompt_type"] == prompt_type:
170 |             system_prompt = prompt["system_prompt"]
171 |             prompt_template = prompt["prompt_template"]
172 |             demo_examples = prompt["demo_examples"]
173 |             break
174 |     for i in tqdm.tqdm(range(len(curr_result), len(review_examples), batch_size)):
175 |         examples = review_examples[i: i + batch_size]
176 |         messages_list = []
177 |         for example in examples:
178 |             qs = example["question"]
179 |             ans = example["answer"]
180 |             analysis = example["analysis"]
181 |             knowledge = example["knowledge"]
182 |             prompt = get_prompt_single_score(qs, ans, analysis, knowledge, prompt_template, prompt_type,
183 |                                              demo_examples, use_demo)
184 |             messages_list.append([
185 |                 {"role": "system",
186 |                  "content": system_prompt},
187 |                 {"role": "user",
188 |                  "content": prompt},
189 |             ])
190 |         completions = get_completion(messages_list, model, temperature)
191 |         if completions:
192 |             try:
193 |                 results = [completion['choices'][0]['message']['content'] for completion in completions]
194 |             except:
195 |                 print("Error: Not return anything.")
196 |                 results = ["" for _ in range(len(messages_list))]
197 |         else:
198 |             results = ["" for _ in range(len(messages_list))]
199 | 
200 |         parse_results = []
201 |         for result in results:
202 |             score, span, explain = post_process_single_score(result, prompt_type)
203 |             parse_results.append({"judge_result": result,
204 |                                   "judge_score": score,
205 |                                   "judge_span": span,
206 |                                   "judge_explain": explain})
207 |         for idx, example in enumerate(examples):
208 |             example.update(parse_results[idx])
209 |             with open(output_file, "a+") as fout:
210 |                 fout.write(json.dumps(example) + '\n')
211 | 
212 | 
213 | def check_error_parse(testset_file, output_file, prompt_file, prompt_type=None, use_demo=False,
214 |                       model="gpt-4", temperature=0.0, batch_size=1):
215 |     output_examples = get_json_list(output_file)
216 |     testset_examples = json.loads(open(testset_file, "r").read())
217 |     system_prompt = None
218 |     prompt_template = None
219 |     demo_examples = None
220 |     for prompt in get_json_list(prompt_file):
221 |         if prompt["prompt_type"] == prompt_type:
222 |             system_prompt = prompt["system_prompt"]
223 |             prompt_template = prompt["prompt_template"]
224 |             demo_examples = prompt["demo_examples"]
225 |             break
226 |     for i in tqdm.tqdm(range(0, len(output_examples), batch_size)):
227 |         if output_examples[i]["judge_score"] != -100:
228 |             continue
229 |         examples = output_examples[i: i + batch_size]
230 |         messages_list = []
231 |         for example in examples:
232 |             qs = example["question"]
233 |             ans = example["answer"]
234 |             analysis = example["analysis"]
235 |             knowledge = example["knowledge"]
236 |             prompt = get_prompt_single_score(qs, ans, analysis, knowledge, prompt_template, prompt_type,
237 |                                              demo_examples, use_demo)
238 |             messages_list.append([
239 |                 {"role": "system",
240 |                  "content": system_prompt},
241 |                 {"role": "user",
242 |                  "content": prompt},
243 |             ])
244 |         completions = get_completion(messages_list, model, temperature)
245 |         if completions:
246 |             try:
247 |                 results = [completion['choices'][0]['message']['content'] for completion in completions]
248 |             except:
249 |                 print("Error: Not return anything.")
250 |                 results = ["" for _ in range(len(messages_list))]
251 |         else:
252 |             results = ["" for _ in range(len(messages_list))]
253 | 
254 |         parse_results = []
255 |         for result in results:
256 |             score, span, explain = post_process_single_score(result, prompt_type)
257 |             parse_results.append({"judge_result": result,
258 |                                   "judge_score": score,
259 |                                   "judge_span": span,
260 |                                   "judge_explain": explain})
261 |         for idx, example in enumerate(examples):
262 |             example.update(parse_results[idx])
263 | 
264 |     with open(output_file, "w") as fout:
265 |         for example in output_examples:
266 |             fout.write(json.dumps(example) + '\n')
267 | 
268 | 
269 | def main():
270 |     parser = argparse.ArgumentParser()
271 |     parser.add_argument("--answer_file", type=str, default="answer.jsonl")
272 |     parser.add_argument("--testset_file", type=str, default="testset.jsonl")
273 |     parser.add_argument("--review_file", type=str, default="review.jsonl")
274 |     parser.add_argument("--prompt_file", type=str, default="prompt.jsonl")
275 |     parser.add_argument("--prompt_type", type=str, default="none")
276 |     parser.add_argument("--use_demo", action="store_true")
277 |     parser.add_argument("--review_model", type=str, default="gpt-4")
278 |     parser.add_argument("--batch_size", type=int, default=1)
279 |     parser.add_argument("--check_error_parse", action="store_true")
280 |     parser.add_argument("--no_sorry", action="store_true")
281 |     args = parser.parse_args()
282 |     print(args)
283 |     if not args.check_error_parse:
284 |         get_single_score(args.answer_file,
285 |                          args.testset_file,
286 |                          args.review_file,
287 |                          args.prompt_file,
288 |                          prompt_type=args.prompt_type,
289 |                          use_demo=args.use_demo,
290 |                          model=args.review_model,
291 |                          temperature=0.0,
292 |                          batch_size=args.batch_size,
293 |                          no_sorry=args.no_sorry)
294 |     else:
295 |         check_error_parse(args.testset_file,
296 |                           args.review_file,
297 |                           args.prompt_file,
298 |                           prompt_type=args.prompt_type,
299 |                           use_demo=args.use_demo,
300 |                           model=args.review_model,
301 |                           temperature=0.0,
302 |                           batch_size=1)
303 | 
304 | 
305 | if __name__ == "__main__":
306 |     main()


--------------------------------------------------------------------------------
/examination/hallucination/run_eval.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import os
  3 | import torch
  4 | import random
  5 | import numpy as np
  6 | import json
  7 | from transformers import AutoTokenizer, AutoModel
  8 | from typing import List, Tuple, Dict
  9 | from tqdm import tqdm
 10 | from examination.utils import get_next_word_predictions, load_hf_lm_and_tokenizer, query_openai_chat_model, dynamic_import_function
 11 | 
 12 | choices = ["A", "B", "C", "D"]
 13 | 
 14 | 
 15 | def format_subject(subject):
 16 |     l = subject.split("_")
 17 |     s = ""
 18 |     for entry in l:
 19 |         s += " " + entry
 20 |     return s
 21 | 
 22 | 
 23 | def format_example(
 24 |         examples: List[Dict[str, str]],
 25 |         idx: int,
 26 |         include_answer: bool = True
 27 | ) -> str:
 28 |     example = examples[idx]
 29 |     prompt = example["question"]
 30 |     k = len(example["options"])
 31 |     for j in range(k):
 32 |         prompt += "\n{}. {}".format(choices[j], example["options"][j])
 33 |     prompt += "\nAnswer:"
 34 |     if include_answer:
 35 |         prompt += " {}\n\n".format(example["answer"])
 36 |     return prompt
 37 | 
 38 | 
 39 | def gen_prompt(
 40 |         examples: List[Dict[str, str]],
 41 |         subject: str,
 42 |         k: int = -1,
 43 | 
 44 | ) -> str:
 45 |     prompt = "The following are multiple choice questions" \
 46 |              " (with answers) about factual knowledge.\n\n"
 47 |     if k == -1:
 48 |         k = len(examples)
 49 |     for i in range(k):
 50 |         prompt += format_example(examples, i)
 51 |     return prompt
 52 | 
 53 | 
 54 | @torch.no_grad()
 55 | def eval_hf_model(
 56 |         args: argparse.ArgumentParser,
 57 |         subject: str,
 58 |         model: AutoModel,
 59 |         tokenizer: AutoTokenizer,
 60 |         dev_set: List[Dict[str, str]],
 61 |         test_set: List[Dict[str, str]],
 62 |         batch_size: int = 1
 63 | ) -> Tuple[np.ndarray, float, np.ndarray]:
 64 |     prompts = []
 65 |     chat_formatting_function = dynamic_import_function(args.chat_formatting_function) if args.use_chat_format else None
 66 |     for i in range(0, len(test_set)):
 67 |         k = args.ntrain
 68 |         prompt_end = format_example(test_set, i, include_answer=False)
 69 |         train_prompt = gen_prompt(dev_set, subject, k)
 70 |         prompt = train_prompt + prompt_end
 71 | 
 72 |         if args.use_chat_format:
 73 |             messages = [{"role": "user", "content": prompt}]
 74 |             prompt = chat_formatting_function(messages, add_bos=False)
 75 |             if prompt[-1] in ["\n", " "]:
 76 |                 prompt += "The answer is:"
 77 |             else:
 78 |                 prompt += " The answer is:"
 79 | 
 80 |         tokenized_prompt = tokenizer(prompt, truncation=False, add_special_tokens=False).input_ids
 81 |         while len(tokenized_prompt) > 2048:
 82 |             k -= 1
 83 |             train_prompt = gen_prompt(dev_set, subject, k)
 84 |             prompt = train_prompt + prompt_end
 85 | 
 86 |             if args.use_chat_format:
 87 |                 messages = [{"role": "user", "content": prompt}]
 88 |                 prompt = chat_formatting_function(messages, add_bos=False)
 89 |                 if prompt[-1] in ["\n", " "]:
 90 |                     prompt += "The answer is:"
 91 |                 else:
 92 |                     prompt += " The answer is:"
 93 | 
 94 |             tokenized_prompt = tokenizer(prompt, truncation=False, add_special_tokens=False).input_ids
 95 |         prompts.append(prompt)
 96 | 
 97 |     answer_choice_ids = [tokenizer.encode(" " + answer_choice, add_special_tokens=False)[-1] for answer_choice in choices]
 98 |     pred_indices, all_probs = get_next_word_predictions(
 99 |         model, tokenizer, prompts, candidate_token_ids=answer_choice_ids, return_token_predictions=False,
100 |         batch_size=batch_size
101 |     )
102 | 
103 |     cors = []
104 |     ground_truths = [exp["answer"] for exp in test_set]
105 |     for i in range(len(pred_indices)):
106 |         prediction = choices[pred_indices[i]]
107 |         ground_truth = ground_truths[i]
108 |         cors.append(prediction == ground_truth)
109 | 
110 |     acc = np.mean(cors)
111 |     cors = np.array(cors)
112 | 
113 |     all_probs = np.array(all_probs)
114 |     print("Average accuracy {:.3f} - {}".format(acc, subject))
115 |     return cors, acc, all_probs
116 | 
117 | 
118 | def eval_openai_chat_engine(
119 |         args,
120 |         subject,
121 |         engine,
122 |         dev_set,
123 |         test_set,
124 |         batch_size=1
125 | ) -> Tuple[np.ndarray, float, np.ndarray]:
126 |     import tiktoken
127 |     gpt_tokenizer = tiktoken.get_encoding("cl100k_base")
128 |     answer_choice_ids = [gpt_tokenizer.encode(" " + x)[0] for x in choices]
129 |     prompts = []
130 |     for i in range(0, len(test_set)):
131 |         k = args.ntrain
132 |         prompt_end = format_example(test_set, i, include_answer=False)
133 |         train_prompt = gen_prompt(dev_set, subject, k)
134 |         prompt = train_prompt + prompt_end
135 |         prompts.append(prompt)
136 | 
137 |     instances = [{"id": prompt, "prompt": prompt} for _, prompt in enumerate(prompts)]
138 |     results = query_openai_chat_model(
139 |         engine=engine,
140 |         instances=instances,
141 |         batch_size=batch_size if batch_size else 10,
142 |         output_path=os.path.join(args.save_dir, f"{subject}_openai_results.jsonl"),
143 |         logit_bias={token_id: 100 for token_id in answer_choice_ids},
144 |         max_tokens=1,
145 |         retry_limit=20,
146 |     )
147 |     cors = []
148 |     ground_truths = [exp["answer"] for exp in test_set]
149 |     for i in range(len(test_set)):
150 |         prediction = results[i]["output"].strip()
151 |         ground_truth = ground_truths[i]
152 |         cors.append(prediction == ground_truth)
153 | 
154 |     acc = np.mean(cors)
155 |     cors = np.array(cors)
156 | 
157 |     all_probs = np.array([[0.25, 0.25, 0.25, 0.25] for _ in range(len(test_set))])
158 | 
159 |     print("Average accuracy {:.3f} - {}".format(acc, subject))
160 |     return cors, acc, all_probs
161 | 
162 | 
163 | def main(args):
164 |     if args.model_name_or_path:
165 |         print("Loading model and tokenizer...")
166 |         model, tokenizer = load_hf_lm_and_tokenizer(
167 |             model_name_or_path=args.model_name_or_path,
168 |             tokenizer_name_or_path=args.tokenizer_name_or_path,
169 |             load_in_8bit=args.load_in_8bit,
170 |             device_map="balanced_low_0" if torch.cuda.device_count() > 1 else "auto",
171 |             gptq_model=args.gptq,
172 |             use_fast_tokenizer=not args.use_slow_tokenizer,
173 |         )
174 |     subjects = sorted(
175 |         [
176 |             f.split("_test.jsonl")[0]
177 |             for f in os.listdir(args.data_dir)
178 |             if "_test.jsonl" in f
179 |         ]
180 |     )
181 |     print("evaluate on the subjects: ", subjects)
182 | 
183 |     if args.subjects:
184 |         assert all(subj in subjects for subj in
185 |                    args.subjects), f"Some of the subjects you specified are not valid: {args.subjects}"
186 |         subjects = args.subjects
187 | 
188 |     if not os.path.exists(args.save_dir):
189 |         os.makedirs(args.save_dir)
190 | 
191 |     all_cors, subject_cors = [], dict()
192 |     for subject in tqdm(subjects, desc="Evaluating subjects: "):
193 |         dev_set = [
194 |             {"question": "Which aspect of system security is evaluated in a physical security assessment?",
195 |              "options": ["Email security controls", "Patch management processes", "Network security controls",
196 |                          "Physical security measures"],
197 |              "answer": "D"},
198 |             {"question": "Which Python libraries are commonly used for sleep data analysis?",
199 |              "options": ["Pandas, Matplotlib, and Seaborn", "NumPy, Scikit-learn, and Plotly",
200 |                          "TensorFlow, Keras, and PyTorch", "Django, Flask, and SQLAlchemy"],
201 |              "answer": "A"},
202 |             {"question": "What does sentiment polarity measure in sentiment analysis?",
203 |              "options": ["The degree of personal opinion expressed in a text.",
204 |                          "The strength and direction of sentiment in a text.", "The subjectivity score of a statement.",
205 |                          "The emotional connotations associated with specific words."],
206 |              "answer": "B"},
207 |             {
208 |                 "question": "Which of the following is the recommended operating system for a Raspberry Pi when setting up a personal cloud server?",
209 |                 "options": ["Ubuntu", "Raspbian", "Fedora", "Arch Linux"],
210 |                 "answer": "B"},
211 |             {"question": "One of the benefits of vegetation in woodlands streams for fish populations is:",
212 |              "options": ["Providing hiding places and cover from predators.",
213 |                          "Absorbing excess nutrients and pollutants.", "Regulating water temperature.",
214 |                          "Enhancing the growth of algae."],
215 |              "answer": "A"}
216 |         ]
217 |         test_set = [json.loads(ln) for ln in open(os.path.join(args.data_dir, subject + "_test.jsonl")).readlines()]
218 |         if args.n_instances and args.n_instances < len(test_set):
219 |             test_set = random.sample(test_set, args.n_instances)
220 |         if args.model_name_or_path:
221 |             cors, acc, probs = eval_hf_model(
222 |                 args,
223 |                 subject,
224 |                 model,
225 |                 tokenizer,
226 |                 dev_set,
227 |                 test_set,
228 |                 args.eval_batch_size
229 |             )
230 |         else:
231 |             cors, acc, probs = eval_openai_chat_engine(
232 |                 args,
233 |                 subject,
234 |                 args.openai_engine,
235 |                 dev_set,
236 |                 test_set,
237 |                 args.eval_batch_size
238 |             )
239 |         all_cors.append(cors)
240 |         subject_cors[subject] = np.mean(cors)
241 |         for test_sample, cor, prob in zip(test_set, list(cors), list(probs)):
242 |             test_sample["correct"] = str(cor)
243 |             for j in range(len(prob)):
244 |                 choice = choices[j]
245 |                 test_sample["choice{}_probs".format(choice)] = prob[j]
246 | 
247 |         with open(os.path.join(args.save_dir, "{}.jsonl".format(subject)), "w") as f_out:
248 |             for exp in test_set:
249 |                 f_out.write(json.dumps(exp, ensure_ascii=False))
250 |                 f_out.write("\n")
251 | 
252 |     weighted_acc = np.mean(np.concatenate(all_cors))
253 |     print("Average accuracy: {:.3f}".format(weighted_acc))
254 | 
255 |     with open(os.path.join(args.save_dir, "metrics.json"), "w") as f:
256 |         json.dump(
257 |             {
258 |                 "average_acc": weighted_acc,
259 |                 "subject_acc": subject_cors,
260 |             },
261 |             f,
262 |         )
263 | 
264 | 
265 | if __name__ == "__main__":
266 |     parser = argparse.ArgumentParser()
267 |     parser.add_argument(
268 |         "--ntrain",
269 |         type=int,
270 |         default=5
271 |     )
272 |     parser.add_argument(
273 |         "--data_dir",
274 |         type=str,
275 |         default="data/mmlu"
276 |     )
277 |     parser.add_argument(
278 |         "--save_dir",
279 |         type=str,
280 |         default="results/mmlu/llama-7B/"
281 |     )
282 |     parser.add_argument(
283 |         "--model_name_or_path",
284 |         type=str,
285 |         default=None,
286 |         help="if specified, we will load the model to generate the predictions."
287 |     )
288 |     parser.add_argument(
289 |         "--tokenizer_name_or_path",
290 |         type=str,
291 |         default=None,
292 |         help="if specified, we will load the tokenizer from here."
293 |     )
294 |     parser.add_argument(
295 |         "--use_slow_tokenizer",
296 |         action="store_true",
297 |         help="If given, we will use the slow tokenizer."
298 |     )
299 |     parser.add_argument(
300 |         "--openai_engine",
301 |         type=str,
302 |         default=None,
303 |         help="if specified, we will use the OpenAI API to generate the predictions."
304 |     )
305 |     parser.add_argument(
306 |         "--subjects",
307 |         nargs="*",
308 |         help="which subjects to evaluate. If not specified, all the 57 subjects will be evaluated."
309 |     )
310 |     parser.add_argument(
311 |         "--n_instances",
312 |         type=int,
313 |         help="if specified, a maximum of n_instances per subject will be used for the evaluation."
314 |     )
315 |     parser.add_argument(
316 |         "--eval_batch_size",
317 |         type=int,
318 |         default=1,
319 |         help="batch size for evaluation."
320 |     )
321 |     parser.add_argument(
322 |         "--load_in_8bit",
323 |         action="store_true",
324 |         help="load model in 8bit mode, which will reduce memory and speed up inference."
325 |     )
326 |     parser.add_argument(
327 |         "--gptq",
328 |         action="store_true",
329 |         help="If given, we're evaluating a 4-bit quantized GPTQ model."
330 |     )
331 |     parser.add_argument(
332 |         "--use_chat_format",
333 |         action="store_true",
334 |         help="If given, we will use the chat format for the prompts."
335 |     )
336 |     parser.add_argument(
337 |         "--chat_formatting_function",
338 |         type=str,
339 |         default="eval.templates.create_prompt_with_tulu_chat_format",
340 |         help="The function to use to create the chat format. This function will be dynamically imported. Please see examples in `eval/templates.py`."
341 |     )
342 |     args = parser.parse_args()
343 | 
344 |     assert (args.model_name_or_path is None) != (args.openai_engine is None), "Either model_name_or_path or openai_engine should be specified."
345 |     main(args)
346 | 


--------------------------------------------------------------------------------
/train/train.py:
--------------------------------------------------------------------------------
  1 | # This code is based on tatsu-lab/stanford_alpaca. Below is the original copyright:
  2 | #
  3 | #    Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li
  4 | #
  5 | #    Licensed under the Apache License, Version 2.0 (the "License");
  6 | #    you may not use this file except in compliance with the License.
  7 | #    You may obtain a copy of the License at
  8 | #
  9 | #        http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | #    Unless required by applicable law or agreed to in writing, software
 12 | #    distributed under the License is distributed on an "AS IS" BASIS,
 13 | #    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | #    See the License for the specific language governing permissions and
 15 | #    limitations under the License.
 16 | 
 17 | from dataclasses import dataclass, field
 18 | import json
 19 | import math
 20 | import pathlib
 21 | from typing import Dict, Optional, Sequence
 22 | import random;random.seed(42)
 23 | 
 24 | import torch
 25 | from torch.utils.data import Dataset
 26 | import transformers
 27 | from transformers import Trainer
 28 | from transformers.trainer_pt_utils import LabelSmoother
 29 | 
 30 | from fastchat.conversation import SeparatorStyle
 31 | from fastchat.model.model_adapter import get_conversation_template
 32 | 
 33 | IGNORE_TOKEN_ID = LabelSmoother.ignore_index
 34 | 
 35 | 
 36 | @dataclass
 37 | class ModelArguments:
 38 |     model_name_or_path: Optional[str] = field(default="facebook/opt-125m")
 39 | 
 40 | 
 41 | @dataclass
 42 | class DataArguments:
 43 |     data_path: str = field(
 44 |         default=None, metadata={"help": "Path to the training data."}
 45 |     )
 46 |     eval_data_path: str = field(
 47 |         default=None, metadata={"help": "Path to the evaluation data."}
 48 |     )
 49 |     conv_temp: str = field(
 50 |         default="vicuna", metadata={"help": "Conversation template."}
 51 |     )
 52 |     lazy_preprocess: bool = False
 53 | 
 54 | 
 55 | @dataclass
 56 | class TrainingArguments(transformers.TrainingArguments):
 57 |     cache_dir: Optional[str] = field(default=None)
 58 |     optim: str = field(default="adamw_torch")
 59 |     model_max_length: int = field(
 60 |         default=512,
 61 |         metadata={
 62 |             "help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."
 63 |         },
 64 |     )
 65 |     flash_attn_transformers: bool = False
 66 | 
 67 | 
 68 | local_rank = None
 69 | 
 70 | 
 71 | def rank0_print(*args):
 72 |     if local_rank == 0:
 73 |         print(*args)
 74 | 
 75 | 
 76 | def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str):
 77 |     """Collects the state dict and dump to disk."""
 78 |     state_dict = trainer.model.state_dict()
 79 |     if trainer.args.should_save:
 80 |         cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
 81 |         del state_dict
 82 |         trainer._save(output_dir, state_dict=cpu_state_dict)  # noqa
 83 | 
 84 | 
 85 | def preprocess(
 86 |     sources,
 87 |     tokenizer: transformers.PreTrainedTokenizer,
 88 |     conv_temp: str,
 89 | ) -> Dict:
 90 |     conv = get_conversation_template(conv_temp)
 91 |     roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
 92 | 
 93 |     # Apply prompt templates
 94 |     conversations = []
 95 |     for i, source in enumerate(sources):
 96 |         if roles[source[0]["from"]] != conv.roles[0]:
 97 |             # Skip the first one if it is not from human
 98 |             source = source[1:]
 99 | 
100 |         conv.messages = []
101 |         for j, sentence in enumerate(source):
102 |             role = roles[sentence["from"]]
103 |             assert role == conv.roles[j % 2], f"{i}"
104 |             conv.append_message(role, sentence["value"])
105 |         conversations.append(conv.get_prompt())
106 | 
107 |     # Tokenize conversations
108 |     input_ids = tokenizer(
109 |         conversations,
110 |         return_tensors="pt",
111 |         padding="max_length",
112 |         max_length=tokenizer.model_max_length,
113 |         truncation=True,
114 |     ).input_ids
115 |     targets = input_ids.clone()
116 | 
117 |     assert conv.sep_style == SeparatorStyle.ADD_COLON_TWO
118 | 
119 |     # Mask targets. Only compute loss on the assistant outputs.
120 |     sep = conv.sep + conv.roles[1] + ": "
121 |     for conversation, target in zip(conversations, targets):
122 |         total_len = int(target.ne(tokenizer.pad_token_id).sum())
123 | 
124 |         turns = conversation.split(conv.sep2)
125 |         cur_len = 1
126 |         target[:cur_len] = IGNORE_TOKEN_ID
127 |         for i, turn in enumerate(turns):
128 |             if turn == "":
129 |                 break
130 |             turn_len = len(tokenizer(turn).input_ids)
131 | 
132 |             parts = turn.split(sep)
133 |             if len(parts) != 2:
134 |                 break
135 |             parts[0] += sep
136 |             # "-2" is hardcoded for the Llama tokenizer to make the offset correct.
137 |             instruction_len = len(tokenizer(parts[0]).input_ids) - 2
138 | 
139 |             if i != 0 and not tokenizer.legacy:
140 |                 # The legacy and non-legacy modes handle special tokens differently
141 |                 instruction_len -= 1
142 | 
143 |             # Ignore the user instructions
144 |             target[cur_len : cur_len + instruction_len] = IGNORE_TOKEN_ID
145 |             cur_len += turn_len
146 | 
147 |             if i != 0 and not tokenizer.legacy:
148 |                 # The legacy and non-legacy modes handle special tokens differently
149 |                 cur_len -= 1
150 | 
151 |         target[cur_len:] = IGNORE_TOKEN_ID
152 | 
153 |         if False:  # Inspect and check the correctness of masking
154 |             z = target.clone()
155 |             z = torch.where(z == IGNORE_TOKEN_ID, tokenizer.unk_token_id, z)
156 |             rank0_print(tokenizer.decode(z))
157 |             exit()
158 | 
159 |         if cur_len < tokenizer.model_max_length:
160 |             if cur_len != total_len:
161 |                 target[:] = IGNORE_TOKEN_ID
162 |                 rank0_print(
163 |                     f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
164 |                     f" #turn = {len(turns) - 1}. (ignored)"
165 |                 )
166 | 
167 |     return dict(
168 |         input_ids=input_ids,
169 |         labels=targets,
170 |         attention_mask=input_ids.ne(tokenizer.pad_token_id),
171 |     )
172 | 
173 | 
174 | def preprocess_remote_model(
175 |     sources,
176 |     tokenizer: transformers.PreTrainedTokenizer,
177 |     conv_temp: str,
178 | ) -> Dict:
179 |     """Designed for Yi and mpt model only."""
180 |     conv = get_conversation_template(conv_temp)
181 |     roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
182 |     conv.sep2 = tokenizer.eos_token
183 | 
184 |     # Apply prompt templates
185 |     conversations = []
186 |     for i, source in enumerate(sources):
187 |         if roles[source[0]["from"]] != conv.roles[0]:
188 |             # Skip the first one if it is not from human
189 |             source = source[1:]
190 | 
191 |         conv.messages = []
192 |         for j, sentence in enumerate(source):
193 |             role = roles[sentence["from"]]
194 |             assert role == conv.roles[j % 2], f"{i}"
195 |             conv.append_message(role, sentence["value"])
196 |         conversations.append(conv.get_prompt())
197 | 
198 |     # Tokenize conversations
199 |     input_ids = tokenizer(
200 |         conversations,
201 |         return_tensors="pt",
202 |         padding="max_length",
203 |         max_length=tokenizer.model_max_length,
204 |         truncation=True,
205 |     ).input_ids
206 |     targets = input_ids.clone()
207 | 
208 |     sep = conv.sep + conv.roles[1] + ": "
209 |     for idx, (conversation, target) in enumerate(zip(conversations, targets)):
210 |         total_len = int(target.ne(tokenizer.pad_token_id).sum())
211 | 
212 |         turns = conversation.split(conv.sep2)
213 | 
214 |         # Yi does not start with bos_token
215 |         cur_len = 0
216 | 
217 |         for i, turn in enumerate(turns):
218 |             if turn == "":
219 |                 break
220 |             turn_len = len(tokenizer(turn).input_ids)
221 |             parts = turn.split(sep)
222 |             if len(parts) != 2:
223 |                 break
224 |             parts[0] += sep
225 | 
226 |             # "-1" is hardcoded for the Yi tokenizer to make the offset correct.
227 |             instruction_len = len(tokenizer(parts[0]).input_ids) - 1
228 | 
229 |             # Ignore the user instructions
230 |             target[cur_len: cur_len + instruction_len] = IGNORE_TOKEN_ID
231 |             cur_len += turn_len
232 |             cur_len += 1  # add a offset for eos_token, e.g. "turn1 <eos_token> turn2 <eos_token>..."
233 | 
234 |         target[cur_len:] = IGNORE_TOKEN_ID
235 | 
236 |         if False:  # Inspect and check the correctness of masking
237 |             z = target.clone()
238 |             z = torch.where(z == IGNORE_TOKEN_ID, tokenizer.unk_token_id, z)
239 |             print(tokenizer.decode(input_ids[0]))
240 |             print(tokenizer.decode(z))
241 |             exit()
242 | 
243 |         if cur_len < tokenizer.model_max_length:
244 |             if cur_len != total_len:
245 |                 target[:] = IGNORE_TOKEN_ID
246 |                 print(
247 |                     f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
248 |                     f" #turn = {len(turns) - 1}. (ignored)"
249 |                 )
250 | 
251 |     return dict(
252 |         input_ids=input_ids,
253 |         labels=targets,
254 |         attention_mask=input_ids.ne(tokenizer.pad_token_id),
255 |     )
256 | 
257 | 
258 | class SupervisedDataset(Dataset):
259 |     """Dataset for supervised fine-tuning."""
260 | 
261 |     def __init__(self, raw_data, tokenizer: transformers.PreTrainedTokenizer, conv_temp: str):
262 |         super(SupervisedDataset, self).__init__()
263 | 
264 |         rank0_print("Formatting inputs...")
265 |         sources = [example["conversations"] for example in raw_data]
266 |         if "pythia" in tokenizer.name_or_path:
267 |             print("Using 'preprocess_remote_model' func.")
268 |             data_dict = preprocess_remote_model(sources, tokenizer, conv_temp)
269 |         else:
270 |             print("Using 'preprocess' func.")
271 |             data_dict = preprocess(sources, tokenizer, conv_temp)
272 | 
273 |         self.input_ids = data_dict["input_ids"]
274 |         self.labels = data_dict["labels"]
275 |         self.attention_mask = data_dict["attention_mask"]
276 | 
277 |     def __len__(self):
278 |         return len(self.input_ids)
279 | 
280 |     def __getitem__(self, i) -> Dict[str, torch.Tensor]:
281 |         return dict(
282 |             input_ids=self.input_ids[i],
283 |             labels=self.labels[i],
284 |             attention_mask=self.attention_mask[i],
285 |         )
286 | 
287 | 
288 | class LazySupervisedDataset(Dataset):
289 |     """Dataset for supervised fine-tuning."""
290 | 
291 |     def __init__(self, raw_data, tokenizer: transformers.PreTrainedTokenizer, conv_temp: str):
292 |         super(LazySupervisedDataset, self).__init__()
293 |         self.tokenizer = tokenizer
294 | 
295 |         rank0_print("Formatting inputs...Skip in lazy mode")
296 |         self.tokenizer = tokenizer
297 |         self.raw_data = raw_data
298 |         self.conv_temp = conv_temp
299 |         self.cached_data_dict = {}
300 |         if "pythia" in self.tokenizer.name_or_path:
301 |             print("Using 'preprocess_remote_model' func.")
302 |         else:
303 |             print("Using 'preprocess' func.")
304 | 
305 |     def __len__(self):
306 |         return len(self.raw_data)
307 | 
308 |     def __getitem__(self, i) -> Dict[str, torch.Tensor]:
309 |         if i in self.cached_data_dict:
310 |             return self.cached_data_dict[i]
311 | 
312 |         if "pythia" in self.tokenizer.name_or_path:
313 |             ret = preprocess_remote_model([self.raw_data[i]["conversations"]], self.tokenizer, self.conv_temp)
314 |         else:
315 |             ret = preprocess([self.raw_data[i]["conversations"]], self.tokenizer, self.conv_temp)
316 |         ret = dict(
317 |             input_ids=ret["input_ids"][0],
318 |             labels=ret["labels"][0],
319 |             attention_mask=ret["attention_mask"][0],
320 |         )
321 |         self.cached_data_dict[i] = ret
322 | 
323 |         return ret
324 | 
325 | 
326 | def make_supervised_data_module(
327 |     tokenizer: transformers.PreTrainedTokenizer, data_args
328 | ) -> Dict:
329 |     """Make dataset and collator for supervised fine-tuning."""
330 |     dataset_cls = (
331 |         LazySupervisedDataset if data_args.lazy_preprocess else SupervisedDataset
332 |     )
333 |     rank0_print(f"Loading data from {data_args.data_path}...")
334 | 
335 |     train_json = json.load(open(data_args.data_path, "r"))
336 |     train_json = random.sample(train_json, len(train_json))
337 |     train_dataset = dataset_cls(train_json, tokenizer=tokenizer, conv_temp=data_args.conv_temp)
338 | 
339 |     if data_args.eval_data_path:
340 |         eval_json = json.load(open(data_args.eval_data_path, "r"))
341 |         eval_dataset = dataset_cls(eval_json, tokenizer=tokenizer, conv_temp=data_args.conv_temp)
342 |     else:
343 |         eval_dataset = None
344 | 
345 |     return dict(train_dataset=train_dataset, eval_dataset=eval_dataset)
346 | 
347 | 
348 | def train():
349 |     global local_rank
350 | 
351 |     parser = transformers.HfArgumentParser(
352 |         (ModelArguments, DataArguments, TrainingArguments)
353 |     )
354 |     model_args, data_args, training_args = parser.parse_args_into_dataclasses()
355 |     local_rank = training_args.local_rank
356 | 
357 |     # Set RoPE scaling factor
358 |     config = transformers.AutoConfig.from_pretrained(
359 |         model_args.model_name_or_path,
360 |         cache_dir=training_args.cache_dir,
361 |     )
362 |     orig_ctx_len = getattr(config, "max_position_embeddings", None)
363 |     if orig_ctx_len and training_args.model_max_length > orig_ctx_len:
364 |         scaling_factor = float(math.ceil(training_args.model_max_length / orig_ctx_len))
365 |         config.rope_scaling = {"type": "linear", "factor": scaling_factor}
366 |     config.use_cache = False
367 | 
368 |     compute_dtype = (
369 |         torch.bfloat16
370 |         if training_args.bf16
371 |         else (torch.float16 if training_args.fp16 else torch.float32)
372 |     )
373 | 
374 |     # Load model and tokenizer
375 |     model = transformers.AutoModelForCausalLM.from_pretrained(
376 |         model_args.model_name_or_path,
377 |         config=config,
378 |         cache_dir=training_args.cache_dir,
379 |         use_flash_attention_2=True if training_args.flash_attn_transformers else False,
380 |         torch_dtype=compute_dtype,
381 |     )
382 |     tokenizer = transformers.AutoTokenizer.from_pretrained(
383 |         model_args.model_name_or_path,
384 |         cache_dir=training_args.cache_dir,
385 |         model_max_length=training_args.model_max_length,
386 |         padding_side="right",
387 |         use_fast=False if "pythia" not in model_args.model_name_or_path else True,
388 |     )
389 |     if "pythia" not in model_args.model_name_or_path:
390 |         tokenizer.pad_token = tokenizer.unk_token
391 |     else:
392 |         print("Set pad token to 1: <|padding|> for pythia model.")
393 |         tokenizer.pad_token = "<|padding|>"
394 |         tokenizer.pad_token_id = 1
395 | 
396 |     # Load data
397 |     data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)
398 | 
399 |     # Start trainner
400 |     trainer = Trainer(
401 |         model=model, tokenizer=tokenizer, args=training_args, **data_module
402 |     )
403 |     if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):
404 |         trainer.train(resume_from_checkpoint=True)
405 |     else:
406 |         trainer.train()
407 | 
408 |     # Save model
409 |     model.config.use_cache = True
410 |     trainer.save_state()
411 |     safe_save_model_for_hf_trainer(trainer, training_args.output_dir)
412 | 
413 | 
414 | if __name__ == "__main__":
415 |     train()


--------------------------------------------------------------------------------
/data_generation/post_process.py:
--------------------------------------------------------------------------------
  1 | """Postprocess for gpt outputs."""
  2 | 
  3 | import re
  4 | from typing import List, Dict, Optional
  5 | import json
  6 | import argparse
  7 | 
  8 | 
  9 | def parse_fact_enhance_classify_res(raw_response: str):
 10 |     replace_map = {
 11 |         "#Prediction:": "Prediction:",
 12 |         "#Search Query:": "Search Query:",
 13 |         "#Command:": "Command:"
 14 |     }
 15 |     for old_text, new_text in replace_map.items():
 16 |         raw_response = raw_response.replace(old_text, new_text)
 17 |     true_resp = re.split("Command:", raw_response)[0]
 18 |     split_response = re.split("Prediction:|Search Query:", true_resp)
 19 |     if len(split_response) < 2:
 20 |         # print(f"Illegal respnse :{raw_response}")
 21 |         return None
 22 |     analysis, final_prediction = split_response[0], split_response[1]
 23 |     final_prediction = final_prediction.strip()
 24 |     if final_prediction not in ["<need>", "<no need>"]:
 25 |         # print(f"Illegal final_prediction :{raw_response}")
 26 |         return None
 27 |     queris = ""
 28 |     if final_prediction == "<need>":
 29 |         if len(split_response) == 3:
 30 |             queris = split_response[2]
 31 |     analysis = analysis.strip()
 32 |     queris = queris.strip()
 33 |     return {"analysis": analysis, "final_prediction": final_prediction, "queris": queris, }
 34 | 
 35 | 
 36 | def parse_fact_generation_res(raw_response: str):
 37 |     return {"knowledge": raw_response.strip()}
 38 | 
 39 | 
 40 | def parse_test_generation_res(raw_response: str):
 41 |     def veryfy_test(test_sample: Dict[str, str]) -> Optional[Dict[str, str]]:
 42 |         options = re.split("\(A\)|\(B\)|\(C\)|\(D\)", test_sample["options"])
 43 |         choices = [option.strip() for option in options if option]
 44 |         if len(choices) != 4:
 45 |             return None
 46 |         norm_ans = "".join(c for c in test_sample["answer"].upper() if c in ["A", "B", "C", "D"])
 47 |         if len(norm_ans) != 1:
 48 |             return None
 49 |         test_sample["options"], test_sample["answer"] = choices, norm_ans
 50 |         return test_sample
 51 |     replace_map = {
 52 |         "#Question:": "Question:",
 53 |         "#Options:": "Options:",
 54 |         "#Analysis:": "Analysis:",
 55 |         "#Answer:": "Answer:",
 56 |     }
 57 |     for old_text, new_text in replace_map.items():
 58 |         raw_response = raw_response.replace(old_text, new_text)
 59 |     questions = re.split("Question:", raw_response)
 60 |     q_id = 0
 61 |     res_questions = []
 62 |     for single_question in questions:
 63 |         question_components = re.split("Options:|Analysis:|Answer:", single_question)
 64 |         if len(question_components) != 4:
 65 |             continue
 66 |         parse_question = {
 67 |             "question": question_components[0].strip(),
 68 |             "options": question_components[1].strip(),
 69 |             "analysis": question_components[2].strip(),
 70 |             "answer": question_components[3].strip().replace("-", ""),
 71 |         }
 72 |         normalized_test = veryfy_test(parse_question)
 73 |         q_id += 1
 74 |         res_questions.append(normalized_test)
 75 |     return {"tests": res_questions}
 76 | 
 77 | 
 78 | def parse_test_generation(
 79 |     raw_responses: List[str],
 80 |     subset_name: str = "sft",
 81 | ) -> List[Dict[str, str]]:
 82 |     if isinstance(raw_responses[0], dict):  # fix api res as input
 83 |         raw_responses = [exp["raw_response"] for exp in raw_responses]
 84 |     def veryfy_test(test_sample: Dict[str, str]) -> Optional[Dict[str, str]]:
 85 |         options = re.split("\(A\)|\(B\)|\(C\)|\(D\)", test_sample["options"])
 86 |         choices = [option.strip() for option in options if option]
 87 |         if len(choices) != 4:
 88 |             return None
 89 |         norm_ans = "".join(c for c in test_sample["answer"].upper() if c in ["A", "B", "C", "D"])
 90 |         if len(norm_ans) != 1:
 91 |             return None
 92 |         test_sample["options"], test_sample["answer"] = choices, norm_ans
 93 |         return test_sample
 94 |     replace_map = {
 95 |         "#Question:": "Question:",
 96 |         "#Options:": "Options:",
 97 |         "#Analysis:": "Analysis:",
 98 |         "#Answer:": "Answer:",
 99 |     }
100 |     fail_q_num, res_questions = 0, []
101 |     for idx, raw_response in enumerate(raw_responses):
102 |         instance_idx = f"idx_{subset_name}{idx}"
103 |         for old_text, new_text in replace_map.items():
104 |             raw_response = raw_response.replace(old_text, new_text)
105 |         questions = re.split("Question:", raw_response)
106 |         q_id = 0
107 |         for single_question in questions:
108 |             question_components = re.split("Options:|Analysis:|Answer:", single_question)
109 |             if len(question_components) != 4:
110 |                 fail_q_num += 1
111 |                 continue
112 |             parse_question = {
113 |                 "question": question_components[0].strip(),
114 |                 "options": question_components[1].strip(),
115 |                 "analysis": question_components[2].strip(),
116 |                 "answer": question_components[3].strip().replace("-", ""),
117 |             }
118 |             normalized_test = veryfy_test(parse_question)
119 |             if not normalized_test:
120 |                 fail_q_num += 1
121 |                 continue
122 |             parse_question["idx"] = f"{instance_idx}_test{q_id}"
123 |             res_questions.append(parse_question)
124 |             q_id += 1
125 |     return res_questions
126 | 
127 | 
128 | def fact_enhance_classify_post_process(input_file, output_file, mode):
129 |     """
130 |     postprocess for fact enhance classify results
131 |     1. parse result;
132 |     2. select result with need fact for fact gen;
133 |     """
134 |     data = [json.loads(line.strip()) for line in open(input_file, "r")]
135 |     processed_data = []
136 |     for example in data:
137 |         if mode == "parse_res":
138 |             processed_example = parse_fact_enhance_classify_res(example["raw_response"])
139 |             if processed_example is not None:
140 |                 example.update(processed_example)
141 |             processed_data.append(example)
142 |         elif mode == "select_need":
143 |             if "final_prediction" in example and example["final_prediction"] == "<need>":
144 |                 processed_example = {"input": example["original_input"]["input"],
145 |                                      "output": example["original_input"]["output"],
146 |                                      "analysis": example["analysis"],
147 |                                      "queris": example["queris"],
148 |                                      }
149 |                 processed_data.append(processed_example)
150 |         else:
151 |             print(f"{mode} is not supported.")
152 |             raise NotImplementedError
153 |     with open(output_file, "w") as fout:
154 |         for line in processed_data:
155 |             fout.write(json.dumps(line) + '\n')
156 | 
157 | 
158 | def fact_generation_post_process(input_file, output_file, mode):
159 |     """
160 |     postprocess for fact generation results
161 |     1. parse result;
162 |     """
163 |     data = [json.loads(line.strip()) for line in open(input_file, "r")]
164 |     processed_data = []
165 |     for example in data:
166 |         if mode == "parse_res":
167 |             processed_example = parse_fact_generation_res(example["raw_response"])
168 |             if processed_example is not None:
169 |                 example.update(processed_example)
170 |             processed_data.append(example)
171 |         else:
172 |             print(f"{mode} is not supported.")
173 |             raise NotImplementedError
174 |     with open(output_file, "w") as fout:
175 |         for line in processed_data:
176 |             fout.write(json.dumps(line) + '\n')
177 | 
178 | 
179 | def test_generation_post_process(input_file, output_file, mode):
180 |     """
181 |     postprocess for test generation results
182 |     1. parse result;
183 |     2. normalize result;
184 |     """
185 |     data = [json.loads(line.strip()) for line in open(input_file, "r")]
186 |     if mode == "parse_res":
187 |         processed_data = []
188 |         for example in data:
189 |             processed_example = parse_test_generation_res(example["raw_response"])
190 |             if processed_example is not None:
191 |                 example.update(processed_example)
192 |             processed_data.append(example)
193 |     elif mode == "normalize":
194 |         processed_data = parse_test_generation([x["raw_response"] for x in data])
195 |     else:
196 |         print(f"{mode} is not supported.")
197 |         raise NotImplementedError
198 |     with open(output_file, "w") as fout:
199 |         for line in processed_data:
200 |             fout.write(json.dumps(line) + '\n')
201 | 
202 | 
203 | if __name__ == "__main__":
204 |     parser = argparse.ArgumentParser()
205 |     parser.add_argument('--split', type=str, help='train/test/test_truth')
206 |     parser.add_argument('--stage', type=str, help='fact_enhance_classify/fact_generation/test_generation')
207 |     args = parser.parse_args()
208 |     print(args)
209 |     global_path = "../data/generation_results"
210 |     if (args.split == "test" or args.split == "test_truth") and args.stage == "fact_enhance_classify":
211 |         # ========== test ========== fact_enhance_classify ==========
212 |         split = args.split
213 |         stage = args.stage
214 |         mode = "parse_res"
215 |         for data_name in ["lima_testset_single_turn_classify",
216 |                           "vicuna_testset_single_turn_classify",
217 |                           "wizardlm_testset_single_turn_classify",
218 |                           "truthfulqa_testset_single_turn_classify"
219 |                           ]:
220 |             input_file = f"{global_path}/{split}/{stage}/{data_name}.jsonl"
221 |             output_file = f"{global_path}/{split}/{stage}/{data_name}_{mode}.jsonl"
222 |             fact_enhance_classify_post_process(input_file, output_file, mode)
223 |         mode = "select_need"
224 |         for data_name in ["lima_testset_single_turn_classify",
225 |                           "vicuna_testset_single_turn_classify",
226 |                           "wizardlm_testset_single_turn_classify",
227 |                           "truthfulqa_testset_single_turn_classify"
228 |                           ]:
229 |             input_file = f"{global_path}/{split}/{stage}/{data_name}.jsonl"
230 |             output_file = f"{global_path}/{split}/{stage}/{data_name}_{mode}.jsonl"
231 |             fact_enhance_classify_post_process(input_file, output_file, mode)
232 |     if (args.split == "test" or args.split == "test_truth") and args.stage == "fact_generation":
233 |         # ========== test ========== fact_generation ==========
234 |         split = args.split
235 |         stage = args.stage
236 |         mode = "parse_res"
237 |         for data_name in ["lima_testset_single_turn_classify_parse_res_select_need_knowledge_gen",
238 |                           "vicuna_testset_single_turn_classify_parse_res_select_need_knowledge_gen",
239 |                           "wizardlm_testset_single_turn_classify_parse_res_select_need_knowledge_gen",
240 |                           "truthfulqa_testset_single_turn_classify_parse_res_select_need_knowledge_gen"
241 |                           ]:
242 |             input_file = f"{global_path}/{split}/{stage}/{data_name}.jsonl"
243 |             output_file = f"{global_path}/{split}/{stage}/{data_name}_{mode}.jsonl"
244 |             fact_generation_post_process(input_file, output_file, mode)
245 |     if (args.split == "test" or args.split == "test_truth") and args.stage == "test_generation":
246 |         # ========== test ========== test_generation ==========
247 |         split = args.split
248 |         stage = args.stage
249 |         mode = "parse_res"
250 |         for data_name in ["lima_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen",
251 |                           "vicuna_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen",
252 |                           "wizardlm_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen",
253 |                           "truthfulqa_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen"
254 |                           ]:
255 |             input_file = f"{global_path}/{split}/{stage}/{data_name}.jsonl"
256 |             output_file = f"{global_path}/{split}/{stage}/{data_name}_{mode}.jsonl"
257 |             test_generation_post_process(input_file, output_file, mode)
258 | 
259 |         split = args.split
260 |         stage = args.stage
261 |         mode = "normalize"
262 |         for data_name in ["lima_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen",
263 |                           "vicuna_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen",
264 |                           "wizardlm_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen",
265 |                           "truthfulqa_testset_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen"
266 |                           ]:
267 |             input_file = f"{global_path}/{split}/{stage}/{data_name}.jsonl"
268 |             output_file = f"{global_path}/{split}/{stage}/{data_name}_{mode}.jsonl"
269 |             test_generation_post_process(input_file, output_file, mode)
270 |     if args.split == "train" and args.stage == "fact_enhance_classify":
271 |         # ========== train ========== fact_enhance_classify ==========
272 |         split = args.split
273 |         stage = args.stage
274 |         mode = "parse_res"
275 |         for data_name in ["wizardlm_alpaca_single_turn_classify"]:
276 |             input_file = f"{global_path}/{split}/{stage}/{data_name}.jsonl"
277 |             output_file = f"{global_path}/{split}/{stage}/{data_name}_{mode}.jsonl"
278 |             fact_enhance_classify_post_process(input_file, output_file, mode)
279 |         mode = "select_need"
280 |         for data_name in ["wizardlm_alpaca_single_turn_classify_parse_res"]:
281 |             input_file = f"{global_path}/{split}/{stage}/{data_name}.jsonl"
282 |             output_file = f"{global_path}/{split}/{stage}/{data_name}_{mode}.jsonl"
283 |             fact_enhance_classify_post_process(input_file, output_file, mode)
284 |     if args.split == "train" and args.stage == "fact_generation":
285 |         # ========== train ========== fact_generation ==========
286 |         split = args.split
287 |         stage = args.stage
288 |         mode = "parse_res"
289 |         for data_name in ["wizardlm_alpaca_single_turn_classify_parse_res_select_need_knowledge_gen"]:
290 |             input_file = f"{global_path}/{split}/{stage}/{data_name}.jsonl"
291 |             output_file = f"{global_path}/{split}/{stage}/{data_name}_{mode}.jsonl"
292 |             fact_generation_post_process(input_file, output_file, mode)
293 |     if args.split == "train" and args.stage == "test_generation":
294 |         # ========== train ========== test_generation ==========
295 |         split = args.split
296 |         stage = args.stage
297 |         mode = "parse_res"
298 |         for data_name in ["wizardlm_alpaca_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen"]:
299 |             input_file = f"{global_path}/{split}/{stage}/{data_name}.jsonl"
300 |             output_file = f"{global_path}/{split}/{stage}/{data_name}_{mode}.jsonl"
301 |             test_generation_post_process(input_file, output_file, mode)
302 | 
303 |         split = args.split
304 |         stage = args.stage
305 |         mode = "normalize"
306 |         for data_name in ["wizardlm_alpaca_single_turn_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen"]:
307 |             input_file = f"{global_path}/{split}/{stage}/{data_name}.jsonl"
308 |             output_file = f"{global_path}/{split}/{stage}/{data_name}_{mode}.jsonl"
309 |             test_generation_post_process(input_file, output_file, mode)
310 | 


--------------------------------------------------------------------------------
/examination/utils.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import tqdm
  3 | import json
  4 | import time
  5 | import asyncio
  6 | import openai
  7 | import os
  8 | from importlib import import_module
  9 | from transformers import StoppingCriteria
 10 | from typing import List, Tuple, Dict, Union
 11 | from examination.dispatch_openai_requests import dispatch_openai_chat_requesets, dispatch_openai_prompt_requesets
 12 | 
 13 | 
 14 | def encode_with_prompt_completion_format(example, tokenizer, max_seq_length):
 15 |     '''
 16 |     Here we assume each example has 'prompt' and 'completion' fields.
 17 |     We concatenate prompt and completion and tokenize them together because otherwise prompt will be padded/trancated
 18 |     and it doesn't make sense to follow directly with the completion.
 19 |     '''
 20 |     # if prompt doesn't end with space and completion doesn't start with space, add space
 21 |     if not example['prompt'].endswith((' ', '\n', '\t')) and not example['completion'].startswith((' ', '\n', '\t')):
 22 |         example_text = example['prompt'] + ' ' + example['completion']
 23 |     else:
 24 |         example_text = example['prompt'] + example['completion']
 25 |     example_text = example_text + tokenizer.eos_token
 26 |     tokenized_example = tokenizer(example_text, return_tensors='pt', max_length=max_seq_length, truncation=True)
 27 |     input_ids = tokenized_example.input_ids
 28 |     labels = input_ids.clone()
 29 |     tokenized_prompt = tokenizer(example['prompt'], return_tensors='pt', max_length=max_seq_length, truncation=True)
 30 |     # mask the prompt part for avoiding loss
 31 |     labels[:, :tokenized_prompt.input_ids.shape[1]] = -100
 32 |     attention_mask = torch.ones_like(input_ids)
 33 |     return {
 34 |         'input_ids': input_ids.flatten(),
 35 |         'labels': labels.flatten(),
 36 |         'attention_mask': attention_mask.flatten(),
 37 |     }
 38 | 
 39 | 
 40 | class KeyWordsCriteria(StoppingCriteria):
 41 |     def __init__(self, stop_id_sequences):
 42 |         assert isinstance(stop_id_sequences[0], list), "stop_id_sequences should be a list of list of ids"
 43 |         self.stop_sequences = stop_id_sequences
 44 | 
 45 |     def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
 46 |         sequences_should_be_stopped = []
 47 |         for i in range(input_ids.shape[0]):
 48 |             sequence_should_be_stopped = False
 49 |             for stop_sequence in self.stop_sequences:
 50 |                 if input_ids[i][-len(stop_sequence):].tolist() == stop_sequence:
 51 |                     sequence_should_be_stopped = True
 52 |                     break
 53 |             sequences_should_be_stopped.append(sequence_should_be_stopped)
 54 |         return all(sequences_should_be_stopped)
 55 |     
 56 |     
 57 | @torch.no_grad()
 58 | def generate_completions(model, tokenizer, prompts, batch_size=1, stop_id_sequences=None, add_special_tokens=True, disable_tqdm=False, **generation_kwargs):
 59 |     generations = []
 60 |     if not disable_tqdm:
 61 |         progress = tqdm.tqdm(total=len(prompts), desc="Generating Completions")
 62 | 
 63 |     num_return_sequences = generation_kwargs.get("num_return_sequences", 1)
 64 |     for i in range(0, len(prompts), batch_size):
 65 |         batch_prompts = prompts[i:i+batch_size]
 66 |         tokenized_prompts = tokenizer(batch_prompts, padding="longest", return_tensors="pt", add_special_tokens=add_special_tokens)
 67 |         batch_input_ids = tokenized_prompts.input_ids
 68 |         attention_mask = tokenized_prompts.attention_mask
 69 | 
 70 |         if model.device.type == "cuda":
 71 |             batch_input_ids = batch_input_ids.cuda()
 72 |             attention_mask = attention_mask.cuda()
 73 | 
 74 |         try:
 75 |             batch_outputs = model.generate(
 76 |                 input_ids=batch_input_ids,
 77 |                 attention_mask=attention_mask,
 78 |                 stopping_criteria=[KeyWordsCriteria(stop_id_sequences)] if stop_id_sequences else None,
 79 |                 **generation_kwargs
 80 |             )
 81 |         
 82 |             # the stopping criteria is applied at batch level, so if other examples are not stopped, the entire batch will continue to generate.
 83 |             # so some outputs still have the stop sequence, which we need to remove.
 84 |             if stop_id_sequences:
 85 |                 for output_idx in range(batch_outputs.shape[0]):
 86 |                     for token_idx in range(batch_input_ids.shape[1], batch_outputs.shape[1]):
 87 |                         if any(batch_outputs[output_idx, token_idx: token_idx+len(stop_sequence)].tolist() == stop_sequence for stop_sequence in stop_id_sequences):
 88 |                             batch_outputs[output_idx, token_idx:] = tokenizer.pad_token_id
 89 |                             break
 90 | 
 91 |             # remove the prompt from the output
 92 |             # we need to re-encode the prompt because we need to make sure the special tokens are treated the same way as in the outputs.
 93 |             # we changed our previous way of truncating the output token ids dicrectly because some tokenizer (e.g., llama) won't add space token before the first token.
 94 |             # space is important for some tasks (e.g., code completion).
 95 |             batch_outputs = tokenizer.batch_decode(batch_outputs, skip_special_tokens=True)
 96 |             batch_prompts = tokenizer.batch_decode(batch_input_ids, skip_special_tokens=True)
 97 |             # duplicate the prompts to match the number of return sequences
 98 |             batch_prompts = [prompt for prompt in batch_prompts for _ in range(num_return_sequences)]
 99 |             batch_generations = [
100 |                 output[len(prompt):] for prompt, output in zip(batch_prompts, batch_outputs)
101 |             ]
102 |         except Exception as e:
103 |             print("Error when generating completions for batch:")
104 |             print(batch_prompts)
105 |             print("Error message:")
106 |             print(e)
107 |             print("Use empty string as the completion.")
108 |             batch_generations = [""] * len(batch_prompts) * num_return_sequences
109 | 
110 |         generations += batch_generations
111 | 
112 |         # for prompt, generation in zip(batch_prompts, batch_generations):
113 |         #     print("========")
114 |         #     print(prompt)
115 |         #     print("--------")
116 |         #     print(generation)
117 | 
118 |         if not disable_tqdm:
119 |             progress.update(len(batch_prompts)//num_return_sequences)
120 | 
121 |     assert len(generations) == len(prompts) * num_return_sequences, "number of generations should be equal to number of prompts * num_return_sequences"
122 |     return generations
123 | 
124 | 
125 | @torch.no_grad()
126 | def get_next_word_predictions(
127 |     model,
128 |     tokenizer,
129 |     prompts,
130 |     candidate_token_ids=None,
131 |     batch_size=1,
132 |     return_token_predictions=False,
133 |     add_special_tokens=True,
134 |     disable_tqdm=False
135 | ) -> Tuple[List[Union[int, str]], List[List[float]]]:
136 |     predictions, probs = [], []
137 |     if not disable_tqdm:
138 |         progress = tqdm.tqdm(total=len(prompts), desc="Getting Predictions")
139 | 
140 |     for i in range(0, len(prompts), batch_size):
141 |         batch_prompts = prompts[i: i+batch_size]
142 |         tokenized_prompts = tokenizer(batch_prompts, padding="longest", return_tensors="pt", add_special_tokens=add_special_tokens)
143 |         batch_input_ids = tokenized_prompts.input_ids
144 |         attention_mask = tokenized_prompts.attention_mask
145 | 
146 |         if model.device.type == "cuda":
147 |             batch_input_ids = batch_input_ids.cuda()
148 |             attention_mask = attention_mask.cuda()
149 | 
150 |         batch_logits = model(input_ids=batch_input_ids, attention_mask=attention_mask).logits[:, -1, :]
151 |         batch_probs = torch.softmax(batch_logits, dim=-1)
152 |         if candidate_token_ids is not None:
153 |             batch_probs = batch_probs[:, candidate_token_ids]
154 |         batch_prediction_indices = torch.argmax(batch_probs, dim=-1)
155 |         if return_token_predictions:
156 |             if candidate_token_ids is not None:
157 |                 candidate_tokens = tokenizer.convert_ids_to_tokens(candidate_token_ids)
158 |                 batch_predictions = [candidate_tokens[idx] for idx in batch_prediction_indices]
159 |             else:
160 |                 batch_predictions = tokenizer.convert_ids_to_tokens(batch_prediction_indices)
161 |             predictions += batch_predictions
162 |         else:
163 |             predictions += batch_prediction_indices.tolist()
164 |         probs += batch_probs.tolist()
165 | 
166 |         if not disable_tqdm:
167 |             progress.update(len(batch_prompts))
168 | 
169 |     assert len(predictions) == len(prompts), "number of predictions should be equal to number of prompts"
170 |     return predictions, probs
171 | 
172 | 
173 | @torch.no_grad()
174 | def score_completions(model, tokenizer, scoring_examples, disable_tqdm=False):
175 |     '''
176 |     Each scoring example is a dict, which contains the following keys:
177 |     - prompt: the prompt to score
178 |     - completions: a list of completions to score
179 |     '''
180 |     
181 |     if not disable_tqdm:
182 |         progress = tqdm.tqdm(total=len(scoring_examples), desc="Scoring Completions")
183 | 
184 |     # unroll the scoring examples
185 |     unrolled_examples = []
186 |     for scoring_example in scoring_examples:
187 |         prompt = scoring_example["prompt"]
188 |         for completion in scoring_example["completions"]:
189 |             unrolled_examples.append({
190 |                 "prompt": prompt,
191 |                 "completion": completion
192 |             })
193 | 
194 |     scores = []
195 |     # currently we don't support batching, because we want to directly use the loss returned by the model to score each completion.
196 |     for unrolled_example in unrolled_examples:
197 |         encoded_example = encode_with_prompt_completion_format(unrolled_example, tokenizer, max_seq_length=None)
198 |         # unsqueeze the batch dimension
199 |         for key, value in encoded_example.items():
200 |             encoded_example[key] = value.unsqueeze(0)
201 |         if model.device.type == "cuda":
202 |             encoded_example = {
203 |                 key: value.cuda() for key, value in encoded_example.items()
204 |             }
205 |         outputs = model(**encoded_example)
206 |         loss = outputs.loss
207 |         scores.append(-loss.item())
208 |         if not disable_tqdm:
209 |             progress.update(1)
210 | 
211 |     # roll up the scores
212 |     rolled_up_scores = {}
213 |     for unrolled_example, score in zip(unrolled_examples, scores):
214 |         prompt = unrolled_example["prompt"]
215 |         completion = unrolled_example["completion"]
216 |         if prompt not in rolled_up_scores:
217 |             rolled_up_scores[prompt] = {}
218 |         rolled_up_scores[prompt][completion] = score
219 | 
220 |     return rolled_up_scores
221 | 
222 | 
223 | def load_hf_lm_and_tokenizer(
224 |     model_name_or_path, 
225 |     tokenizer_name_or_path=None, 
226 |     device_map="auto", 
227 |     torch_dtype="auto",
228 |     load_in_8bit=False, 
229 |     convert_to_half=False,
230 |     gptq_model=False,
231 |     use_fast_tokenizer=True,
232 |     padding_side="left",
233 | ):
234 |     
235 |     from transformers import AutoModelForCausalLM, AutoTokenizer, OPTForCausalLM, GPTNeoXForCausalLM
236 | 
237 |     if gptq_model:
238 |         from auto_gptq import AutoGPTQForCausalLM
239 |         model_wrapper = AutoGPTQForCausalLM.from_quantized(
240 |             model_name_or_path, device="cuda:0", use_triton=True
241 |         )
242 |         model = model_wrapper.model  
243 |     elif load_in_8bit:
244 |         model = AutoModelForCausalLM.from_pretrained(
245 |             model_name_or_path, 
246 |             device_map=device_map, 
247 |             load_in_8bit=True
248 |         )
249 |     else:
250 |         if device_map:
251 |             model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map=device_map, torch_dtype=torch_dtype)
252 |         else:
253 |             model = AutoModelForCausalLM.from_pretrained(model_name_or_path, torch_dtype=torch_dtype)
254 |             if torch.cuda.is_available():
255 |                 model = model.cuda()
256 |         if convert_to_half:
257 |             model = model.half()
258 |     model.eval()
259 | 
260 |     if not tokenizer_name_or_path:
261 |         tokenizer_name_or_path = model_name_or_path
262 |     try:
263 |         tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path, use_fast=use_fast_tokenizer)
264 |     except:
265 |         # some tokenizers (e.g., GPTNeoXTokenizer) don't have the slow or fast version, so we just roll back to the default one
266 |         tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)
267 |     # set padding side to left for batch generation
268 |     tokenizer.padding_side = padding_side
269 |     # set pad token to eos token if pad token is not set (as is the case for llama models)
270 |     if tokenizer.pad_token is None:
271 |         tokenizer.pad_token = tokenizer.eos_token
272 |         tokenizer.pad_token_id = tokenizer.eos_token_id
273 | 
274 |     # for OPT and Pythia models, we need to set tokenizer.model_max_length to model.config.max_position_embeddings 
275 |     # to avoid wrong embedding index.    
276 |     if isinstance(model, GPTNeoXForCausalLM) or isinstance(model, OPTForCausalLM):
277 |         tokenizer.model_max_length = model.config.max_position_embeddings
278 |         print("Set tokenizer.model_max_length to model.config.max_position_embeddings: {}".format(model.config.max_position_embeddings))
279 |         
280 |     return model, tokenizer
281 | 
282 | 
283 | def query_openai_chat_model(engine, instances, output_path=None, batch_size=10, retry_limit=5, reuse_existing_outputs=True, **completion_kwargs):
284 |     '''
285 |     Query OpenAI chat model and save the results to output_path.
286 |     `instances` is a list of dictionaries, each dictionary contains a key "prompt" and a key "id".
287 |     '''
288 |     existing_data = {}
289 |     if reuse_existing_outputs and output_path is not None and os.path.exists(output_path):
290 |         with open(output_path, "r") as f:
291 |             for line in f:
292 |                 instance = json.loads(line)
293 |                 existing_data[instance["id"]] = instance
294 | 
295 |     # by default, we use temperature 0.0 to get the most likely completion.
296 |     if "temperature" not in completion_kwargs:
297 |         completion_kwargs["temperature"] = 0.0
298 |     
299 |     results = []
300 |     if output_path is not None:
301 |         fout = open(output_path, "w")
302 | 
303 |     retry_count = 0
304 |     progress_bar = tqdm.tqdm(total=len(instances))
305 |     for i in range(0, len(instances), batch_size):
306 |         batch = instances[i:i+batch_size]
307 |         if all([x["id"] in existing_data for x in batch]):
308 |             results.extend([existing_data[x["id"]] for x in batch])
309 |             if output_path is not None:
310 |                 for instance in batch:
311 |                     fout.write(json.dumps(existing_data[instance["id"]]) + "\n")
312 |                     fout.flush()
313 |             progress_bar.update(batch_size)
314 |             continue
315 |         messages_list = []
316 |         for instance in batch:
317 |             messages = [{"role": "user", "content": instance["prompt"]}]
318 |             messages_list.append(messages)
319 |         
320 |         while retry_count < retry_limit:
321 |             try:
322 |                 if batch_size > 1:
323 |                     outputs = asyncio.run(
324 |                         dispatch_openai_chat_requesets(
325 |                         messages_list=messages_list,
326 |                         model=engine,
327 |                         **completion_kwargs,
328 |                     ))
329 |                 else:
330 |                     print(f"completion_kwargs: {completion_kwargs}")
331 |                     print(f"messages_list: {messages_list}")
332 |                     print(f"engine: {engine}")
333 |                     outputs = [openai.ChatCompletion.create(
334 |                               model=engine,
335 |                               messages=messages_list[0],
336 |                               # **completion_kwargs,
337 |                     )]
338 |                 retry_count = 0
339 |                 break
340 |             except Exception as e:
341 |                 retry_count += 1
342 |                 print(f"Error while requesting OpenAI API.")
343 |                 print(e)
344 |                 print(f"Sleep for {30*retry_count} seconds.")
345 |                 time.sleep(30*retry_count)
346 |                 print(f"Retry for the {retry_count} time.")
347 |         if retry_count == retry_limit:
348 |             raise RuntimeError(f"Failed to get response from OpenAI API after {retry_limit} retries.")
349 |         assert len(outputs) == len(batch)
350 |         for instance, output in zip(batch, outputs):
351 |             instance[f"output"] = output["choices"][0]["message"]["content"]
352 |             instance["response_metadata"] = output
353 |             results.append(instance)
354 |             if output_path is not None:
355 |                 fout.write(json.dumps(instance) + "\n")
356 |                 fout.flush()
357 |         progress_bar.update(batch_size)
358 |     return results
359 |  
360 | 
361 | def query_openai_model(engine, instances, output_path=None, batch_size=10, retry_limit=5, reuse_existing_outputs=True, **completion_kwargs):
362 |     '''
363 |     Query OpenAI chat model and save the results to output_path.
364 |     `instances` is a list of dictionaries, each dictionary contains a key "prompt" and a key "id".
365 |     '''
366 |     existing_data = {}
367 |     if reuse_existing_outputs and output_path is not None and os.path.exists(output_path):
368 |         with open(output_path, "r") as f:
369 |             for line in f:
370 |                 instance = json.loads(line)
371 |                 existing_data[instance["id"]] = instance
372 | 
373 |     # by default, we use temperature 0.0 to get the most likely completion.
374 |     if "temperature" not in completion_kwargs:
375 |         completion_kwargs["temperature"] = 0.0
376 | 
377 |     results = []
378 |     if output_path is not None:
379 |         fout = open(output_path, "w")
380 | 
381 |     retry_count = 0
382 |     progress_bar = tqdm.tqdm(total=len(instances))
383 |     for i in range(0, len(instances), batch_size):
384 |         batch = instances[i:i+batch_size]
385 |         if all([x["id"] in existing_data for x in batch]):
386 |             results.extend([existing_data[x["id"]] for x in batch])
387 |             if output_path is not None:
388 |                 for instance in batch:
389 |                     fout.write(json.dumps(existing_data[instance["id"]]) + "\n")
390 |                     fout.flush()
391 |             progress_bar.update(batch_size)
392 |             continue
393 |         messages_list = []
394 |         for instance in batch:
395 |             messages = instance["prompt"]
396 |             messages_list.append(messages)
397 |         while retry_count < retry_limit:
398 |             try:
399 |                 outputs = asyncio.run(
400 |                     dispatch_openai_prompt_requesets(
401 |                     prompt_list=messages_list,
402 |                     model=engine,
403 |                     **completion_kwargs,
404 |                 ))
405 |                 retry_count = 0
406 |                 break
407 |             except Exception as e:
408 |                 retry_count += 1
409 |                 print(f"Error while requesting OpenAI API.")
410 |                 print(e)
411 |                 print(f"Sleep for {30*retry_count} seconds.")
412 |                 time.sleep(30*retry_count)
413 |                 print(f"Retry for the {retry_count} time.")
414 |         if retry_count == retry_limit:
415 |             raise RuntimeError(f"Failed to get response from OpenAI API after {retry_limit} retries.")
416 |         assert len(outputs) == len(batch)
417 |         for instance, output in zip(batch, outputs):
418 |             instance[f"output"] = output["choices"][0]["text"]
419 |             instance["response_metadata"] = output
420 |             results.append(instance)
421 |             if output_path is not None:
422 |                 fout.write(json.dumps(instance) + "\n")
423 |                 fout.flush()
424 |         progress_bar.update(batch_size)
425 |     return results
426 | 
427 | 
428 | def dynamic_import_function(function_path):
429 |     '''
430 |     Dynamically import a function from a path string (e.g., "module.submodule.my_function")
431 |     '''
432 |     module_path, function_name = function_path.rsplit(".", 1)
433 |     module = import_module(module_path)
434 |     function = getattr(module, function_name)
435 |     return function
436 |  


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <p align="center" width="100%">
  2 | </p>
  3 | 
  4 | <div id="top" align="center">
  5 | 
  6 | Knowledge Verification to Nip Hallucination in the Bud
  7 | -----------------------------
  8 | 
  9 | <img src="https://img.shields.io/badge/Version-1.0.0-blue.svg" alt="Version"> 
 10 | <img src="https://img.shields.io/badge/License-Apache_2.0-green.svg" alt="License">
 11 | <img src="https://img.shields.io/github/stars/fanqiwan/KCA?color=yellow" alt="Stars">
 12 | <img src="https://img.shields.io/github/issues/fanqiwan/KCA?color=red" alt="Issues">
 13 | 
 14 | <h4> |<a href="https://arxiv.org/abs/2401.10768"> 📑 Paper </a> |
 15 | <a href="https://huggingface.co/collections/Wanfq/kca-65aa9ab25c84d7d5e7c9d0da"> 🤗 HuggingFace Repo </a> |
 16 | <a href="https://github.com/fanqiwan/KCA"> 🐱 GitHub Repo </a> |
 17 | </h4>
 18 | 
 19 | <!-- **Authors:** -->
 20 | 
 21 | _**Fanqi Wan<sup>†</sup>, Xinting Huang<sup>‡</sup>, Leyang Cui<sup>‡</sup>, Xiaojun Quan<sup>†</sup>, Wei Bi<sup>‡</sup>, Shuming Shi<sup>‡</sup>**_
 22 | 
 23 | 
 24 | <!-- **Affiliations:** -->
 25 | 
 26 | 
 27 | _<sup>†</sup> Sun Yat-sen University,
 28 | <sup>‡</sup> Tencent AI Lab_
 29 | 
 30 | </div>
 31 | 
 32 | 
 33 | ## News
 34 | - **Jan 19, 2024:** 🔥 We're excited to announce that the KCA datasets for open-book tuning, discard tuning, and refusal tuning are now available on 🤗 [Huggingface Datasets](https://huggingface.co/datasets/Wanfq/KCA_data). The fine-tuned models are now available on 🤗 [Huggingface Models](https://huggingface.co/models?sort=trending&search=KCA). Happy exploring!
 35 | 
 36 | ## Contents
 37 | 
 38 | - [Overview](#overview)
 39 | - [Data Release](#data-release)
 40 | - [Model Release](#model-release)
 41 | - [Knowledge Inconsistency Detection](#knowledge-inconsistency-detection)
 42 | - [Knowledge Inconsistency Calibration](#knowledge-inconsistency-calibration)
 43 | - [Evaluation](#evaluation)
 44 | - [Citation](#citation)
 45 | 
 46 | ## Overview
 47 |  
 48 | In this study, we demonstrate the feasibility of mitigating hallucinations by verifying and minimizing the inconsistency between external knowledge present in the alignment data and the intrinsic knowledge embedded within foundation LLMs.
 49 | 
 50 | <p align="center">
 51 |     <img src="./assets/fig_1_2.png" width="95%"> <br>
 52 | </p>
 53 | 
 54 | Specifically, we propose a novel approach called Knowledge Consistent Alignment (KCA), which employs a well-aligned LLM to automatically formulate assessments based on external knowledge to evaluate the knowledge boundaries of foundation LLMs. To address knowledge inconsistencies in the alignment data, KCA implements several specific strategies to deal with these data instances, which involve (i) open-book tuning, (ii) discard tuning, and (iii) refusal tuning.
 55 | 
 56 | <p align="center">
 57 |     <img src="./assets/fig_3.png" width="95%"> <br>
 58 | </p>
 59 | 
 60 | ## Data Release
 61 | 
 62 | We release the KCA datasets for open-book tuning, discard tuning, and refusal tuning on [./data/processed_results](https://huggingface.co/datasets/Wanfq/KCA_data/tree/main/data/processed_results). Please note that each dataset is corresponding to a specific tuning method and a foundation LLM. The dataset is a structured data file in the JSON format. It consists of a list of dictionaries, with each dictionary containing multiple fields. Below is an example:
 63 | 
 64 | ```
 65 | {
 66 |   "id": "...",  # Data index.
 67 |   "conversations": [
 68 |     {
 69 |       "from": "human",
 70 |       "value": "..."  # Human instruction.
 71 |     },
 72 |     {
 73 |       "from": "gpt",
 74 |       "value": "..."  # LLM response.
 75 |     }
 76 |   ],
 77 |   "class": "...",  # Three categories: "no_need_fact" (the instruction does not require knowledge), "need_and_have_fact" (the instruction requires knowledge and the foundation LLM understands the generated knowledge), "need_and_have_no_fact" (the instruction requires knowledge but the foundation LLM does not understand the generated knowledge).
 78 |   "analysis": "...",  # Analysis for whether the instruction requires knowledge.
 79 |   "knowledge": "..."  # Generated knowledge.
 80 | }
 81 | ```
 82 | 
 83 | We show the percentage (%) of the consistent subset (the instruction requires knowledge and the foundation LLM understands the generated knowledge) and the inconsistent subset (the instruction requires knowledge but the foundation LLM does not understand the generated knowledge) across various foundation LLMs on different training and evaluation datasets as follows: 
 84 | 
 85 | <p align="center">
 86 |     <img src="./assets/fig_4.png" width="95%"> <br>
 87 | </p>
 88 | 
 89 | ## Model Release
 90 | 
 91 | We release the KCA models fine-tuned with different tuning methods on 🤗 [Huggingface Models](https://huggingface.co/models?sort=trending&search=KCA). Please note that each model is corresponding to a specific tuning method and a foundation LLM.
 92 | 
 93 | ### Hallucination Mitigation
 94 | 
 95 | To facilitate a comprehensive evaluation, we conduct both LLM-based judgment and metric-based judgment. For the LLM-based judgment, we evaluate the performance on the LIMAEval, VicunaEval, WizardLMEval, and TruthfulQA benchmarks with GPT-4 to measure the hallucination rate. In terms of metric-based judgment, we assess the ROUGE-1, ROUGE-2, and ROUGE-L scores on the MS MARCO and ACI-Bench benchmarks.
 96 | 
 97 | The evaluation results of hallucination rate (%) on four public benchmarks for general instruction-following and truthful question answering with GPT-4 judgment are shown as follows, with a lower rate indicating better performance:
 98 | 
 99 | <p align="center">
100 |     <img src="./assets/tab_1.png" width="95%"> <br>
101 | </p>
102 | 
103 | The evaluation results of ROUGE-1, ROUGE-2, and ROUGE-L on two public benchmarks for search and retrieve and clinical report generation are shown as follows, with a higher score indicating better performance:
104 | 
105 | <p align="center">
106 |     <img src="./assets/tab_2.png" width="95%"> <br>
107 | </p>
108 | 
109 | ### Helpfulness Maintenance
110 | 
111 | The evaluation results of the helpful score on four public benchmarks for general instruction-following and truthful question answering with GPT-4 judgment are shown as follows, where the helpful score ranges from one (worst) to ten (best):
112 | 
113 | <p align="center">
114 |     <img src="./assets/tab_3.png" width="95%"> <br>
115 | </p>
116 | 
117 | ## Knowledge Inconsistency Detection
118 | 
119 | To detect the inconsistency between external knowledge within the instruction-tuning (alignment) data and intrinsic knowledge embedded in the foundation LLMs obtained from pretraining, we propose a four-stage approach: (i) knowledge requirement classification, (ii) reference knowledge generation, (iii) examination formulation, and (iv) examination completion.
120 | 
121 | The results of knowledge inconsistency detection are in [./data/generated_results](https://huggingface.co/datasets/Wanfq/KCA_data/tree/main/data/generation_results) and [./data/examination](https://huggingface.co/datasets/Wanfq/KCA_data/tree/main/data/examination). You could download the results and put them in the right folder. If you want to reproduce the results, please follow the following commands step by step:
122 | 
123 | ### Knowledge Requirements Classification
124 | ```
125 | cd ./data_generation
126 | export OPENAI_API_KEY=XXXXXX  # set the OpenAI API key
127 | split=train  # train / test / test_truth
128 | data_name=wizardlm_alpaca_single_turn  # wizardlm_alpaca_single_turn (train) / lima_testset_single_turn (test) / vicuna_testset_single_turn (test) / wizardlm_testset_single_turn (test) / truthfulqa_testset_single_turn (test_truth)
129 | input_dir=../data/source/${split}
130 | input_filename=${data_name}.jsonl
131 | res_dir=../data/generation_results/${split}/fact_enhance_classify
132 | res_filename=${data_name}_classify.jsonl
133 | mode=fact_enhance_classify_en
134 | batch_size=10
135 | 
136 | python3 per_instance_query.py \
137 |     --data_dir ${input_dir} \
138 |     --input ${input_filename} \
139 |     --file_extension jsonl \
140 |     --out_dir ${res_dir} \
141 |     --output ${res_filename} \
142 |     --prompt_mode ${mode} \
143 |     --request_batch_size ${batch_size}
144 | 
145 | python3 post_process.py \
146 |     --split ${split} \
147 |     --stage fact_enhance_classify
148 | ```
149 | 
150 | ### Reference Knowledge Generation
151 | ```
152 | cd ./data_generation
153 | export OPENAI_API_KEY=XXXXXX  # set the OpenAI API key
154 | split=train  # train / test / test_truth
155 | data_name=wizardlm_alpaca_single_turn  # wizardlm_alpaca_single_turn (train) / lima_testset_single_turn (test) / vicuna_testset_single_turn (test) / wizardlm_testset_single_turn (test) / truthfulqa_testset_single_turn (test_truth)
156 | input_dir=../data/generation_results/${split}/fact_enhance_classify
157 | input_filename=${data_name}_classify_parse_res_select_need.jsonl
158 | res_dir=${global_dir}/generation_results/${split}/fact_generation
159 | res_filename=${data_name}_classify_parse_res_select_need_knowledge_gen.jsonl
160 | mode=fact_generation_en
161 | batch_size=10
162 | 
163 | python3 per_instance_query.py \
164 |     --data_dir ${input_dir} \
165 |     --input ${input_filename} \
166 |     --file_extension jsonl \
167 |     --out_dir ${res_dir} \
168 |     --output ${res_filename} \
169 |     --prompt_mode ${mode} \
170 |     --request_batch_size ${batch_size}
171 | 
172 | python3 post_process.py \
173 |     --split ${split} \
174 |     --stage fact_generation
175 | ```
176 | 
177 | ### Examination Formulation
178 | ```
179 | cd ./data_generation
180 | export OPENAI_API_KEY=XXXXXX  # set the OpenAI API key
181 | split=train  # train / test / test_truth
182 | data_name=wizardlm_alpaca_single_turn  # wizardlm_alpaca_single_turn (train) / lima_testset_single_turn (test) / vicuna_testset_single_turn (test) / wizardlm_testset_single_turn (test) / truthfulqa_testset_single_turn (test_truth)
183 | input_dir=../data/generation_results/${split}/fact_generation
184 | input_filename=${data_name}_classify_parse_res_select_need_knowledge_gen_parse_res.jsonl
185 | res_dir=${global_dir}/generation_results/${split}/test_generation
186 | res_filename=${data_name}_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen.jsonl
187 | mode=fact_to_tests_en
188 | batch_size=10
189 | 
190 | python3 per_instance_query.py \
191 |     --data_dir ${input_dir} \
192 |     --input ${input_filename} \
193 |     --file_extension jsonl \
194 |     --out_dir ${res_dir} \
195 |     --output ${res_filename} \
196 |     --prompt_mode ${mode} \
197 |     --request_batch_size ${batch_size}
198 | 
199 | python3 post_process.py \
200 |     --split ${split} \
201 |     --stage test_generation
202 | ```
203 | 
204 | ### Examination Completion
205 | 
206 | ```
207 | cd ./
208 | split=train  # train / test / test_truth
209 | data_name=wizardlm_alpaca_single_turn  # wizardlm_alpaca_single_turn (train) / lima_testset_single_turn (test) / vicuna_testset_single_turn (test) / wizardlm_testset_single_turn (test) / truthfulqa_testset_single_turn (test_truth)
210 | mv ./data_generation/generation_results/${split}/test_generation/${data_name}_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen_normalize.jsonl ./data/examination/input/hallucination/${split}/${data_name}_classify_parse_res_select_need_knowledge_gen_parse_res_test_gen_normalize_test.jsonl
211 | export CUDA_VISIBLE_DEVICES=0
212 | test_dataset=hallucination
213 | eval_batch_size=1  # must set to 1
214 | shot=5
215 | model_name=llama-2-7b  # pythia-6.9b / llama-2-7b / mistral-7b-v0.1 / llama-2-13b
216 | output_dir=./data/examination/output/${test_dataset}/${split}/${model_name}/${shot}-shot
217 | data_dir=./data/examination/input/${test_dataset}/${split}
218 | 
219 | python3 ./examination/${test_dataset}/run_eval.py \
220 |     --ntrain ${SHOT} \
221 |     --data_dir ${data_dir} \
222 |     --save_dir ${output_dir} \
223 |     --model_name_or_path ${model_name} \
224 |     --tokenizer_name_or_path ${model_name} \
225 |     --eval_batch_size ${eval_batch_size} \
226 |     --use_slow_tokenizer
227 | 
228 | python3 ./examination/${test_dataset}/get_metric.py
229 | ```
230 | 
231 | ## Knowledge Inconsistency Calibration
232 | 
233 | Since knowledge inconsistency could mislead foundation LLMs during alignment and lead to hallucinations, we propose three specific strategies to manage instances in Dinc, including (i) open-book tuning, which appends the generated knowledge snippets to the instructions, (ii) discard tuning, which discards both the instructions and responses, and (iii) refusal tuning, which changes the responses to a refusal format.
234 | 
235 | The results of knowledge inconsistency calibration are in [./data/processed_results](https://huggingface.co/datasets/Wanfq/KCA_data/tree/main/data/processed_results). You could download the results and put them in the right folder. If you want to reproduce the results, please follow the following commands step by step:
236 | 
237 | ### Data Construction
238 | 
239 | First, we construct training data for these tuning methods:
240 | 
241 | ```
242 | cd ./
243 | python3 ./data_generation/inconsistency_processing.py 
244 | ```
245 | 
246 | ### Fine-Tuning
247 | 
248 | Then, we fine-tune the foundation LLMs using these tuning methods:
249 |  
250 | ```
251 | cd ./
252 | export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
253 | MODEL_NAME=llama-2-7b  # pythia-6.9b / llama-2-7b / mistral-7b-v0.1 / llama-2-13b
254 | DATA_NAME=wizardlm_trainset_sorry  # wizardlm_alpaca_train (baseline) / wizardlm_trainset_openbook (kca open-book tuning) / wizardlm_trainset_drop (kca discarding tuning) / wizardlm_trainset_sorry (kca refusal tuning)
255 | DATA_PATH=./data/processed_results/${MODEL_NAME}_shot-5_${DATA_NAME}.json  # ./data/processed_results/${DATA_NAME}.json (baseline) / ./data/processed_results/${MODEL_NAME}_shot-5_${DATA_NAME}.json (kca) 
256 | CONV_TEMP=vicuna
257 | OUTPUT_DIR=./training_results/${MODEL_NAME}_shot-5_${DATA_NAME}  # ./training_results/baseline_${MODEL_NAME}_${DATA_NAME} (baseline) / ./training_results/${MODEL_NAME}_shot-5_${DATA_NAME} (kca)
258 | LOG_FILE=./training_loggings/${MODEL_NAME}_shot-5_${DATA_NAME}.log  # ./training_loggings/baseline_${MODEL_NAME}_${DATA_NAME}.log (baseline) / ./training_loggings/${MODEL_NAME}_shot-5_${DATA_NAME}.log (kca) 
259 | 
260 | torchrun --nproc_per_node=8 --master_port=20001 ./train/train.py \
261 |     --model_name_or_path ${MODEL_NAME}  \
262 |     --data_path ${DATA_PATH} \
263 |     --bf16 True \
264 |     --output_dir ${OUTPUT_DIR} \
265 |     --num_train_epochs 3 \
266 |     --per_device_train_batch_size 8 \
267 |     --per_device_eval_batch_size 8 \
268 |     --gradient_accumulation_steps 2 \
269 |     --evaluation_strategy "no" \
270 |     --save_strategy "steps" \
271 |     --save_steps 500 \
272 |     --save_total_limit 1 \
273 |     --learning_rate 2e-5 \
274 |     --weight_decay 0. \
275 |     --warmup_ratio 0.03 \
276 |     --lr_scheduler_type "cosine" \
277 |     --logging_steps 1 \
278 |     --fsdp "full_shard auto_wrap" \
279 |     --fsdp_transformer_layer_cls_to_wrap "LlamaDecoderLayer" \
280 |     --tf32 True \
281 |     --model_max_length 2048 \
282 |     --gradient_checkpointing True \
283 |     --conv_temp ${CONV_TEMP} \
284 |     --lazy_preprocess True \
285 |     --flash_attn_transformers True 2>&1 | tee ${LOG_FILE}
286 | ```
287 | 
288 | ## Evaluation
289 | 
290 | We evaluate both the hallucination rate and helpfulness score of the fine-tuned LLMs. For hallucination evaluation, we conduct both LLM-based judgment and metric-based judgment. For helpfulness evaluation, we conduct LLM-based judgment.
291 | 
292 | ### Hallucination Evaluation
293 | 
294 | Below are the scripts for hallucination evaluation.
295 | 
296 | ```
297 | # ========== LLM-Based Judgment (LIMAEval, VicunaEval, WizardLMEval, TruthfulQA) ==========
298 | # Generate model answers
299 | export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
300 | NUM_GPUS=8
301 | MODEL_NAME=llama-2-7b  # pythia-6.9b / llama-2-7b / mistral-7b-v0.1 / llama-2-13b
302 | DATA_NAME=wizardlm_trainset_sorry  # wizardlm_alpaca_train (baseline) / wizardlm_trainset_openbook (kca open-book tuning) / wizardlm_trainset_drop (kca discarding tuning) / wizardlm_trainset_sorry (kca refusal tuning)
303 | MODEL_ID=${MODEL_NAME}_shot-5_${DATA_NAME}  # baseline_${MODEL_NAME}_${DATA_NAME} (baseline) / ${MODEL_NAME}_shot-5_${DATA_NAME} (kca)
304 | MODEL_PATH=./training_results/${MODEL_ID}
305 | QUESTION_NAME=lima_testset # lima_testset / vicuna_testset / wizardlm_testset / truthfulqa_test_truthset
306 | QUESTION_FILE=./data/processed_results/${MODEL_NAME}_shot-5_${QUESTION_NAME}_sorry.json  # do not use _openbook or _drop
307 | ANSWER_FILE=./evaluation_results/answer_greedy/data-${MODEL_NAME}_shot-5_${QUESTION_NAME}_model-${MODEL_ID}_greedy.jsonl
308 | 
309 | python3 ./eval/gpt_judge/gen_answer.py \
310 |     --model-path ${MODEL_PATH} \
311 |     --model-id ${MODEL_ID} \
312 |     --conv-temp vicuna \
313 |     --question-file ${QUESTION_FILE} \
314 |     --answer-file ${ANSWER_FILE} \
315 |     --num-gpus ${NUM_GPUS}
316 | 
317 | # GPT-4 judgment
318 | export OPENAI_API_KEY=XXXXXX  # set the OpenAI API key
319 | MODEL_NAME=llama-2-7b  # pythia-6.9b / llama-2-7b / mistral-7b-v0.1 / llama-2-13b
320 | DATA_NAME=wizardlm_trainset_sorry  # wizardlm_alpaca_train (baseline) / wizardlm_trainset_openbook (kca open-book tuning) / wizardlm_trainset_drop (kca discarding tuning) / wizardlm_trainset_sorry (kca refusal tuning)
321 | MODEL_ID=${MODEL_NAME}_shot-5_${DATA_NAME}  # baseline_${MODEL_NAME}_${DATA_NAME} (baseline) / ${MODEL_NAME}_shot-5_${DATA_NAME} (kca)
322 | QUESTION_NAME=lima_testset # lima_testset / vicuna_testset / wizardlm_testset / truthfulqa_test_truthset
323 | JUDGE_TYPE=hallucination_judge
324 | ANSWER_FILE=./evaluation_results/answer_greedy/data-${MODEL_NAME}_shot-5_${QUESTION_NAME}_model-${MODEL_ID}_greedy.jsonl
325 | TESTSET_FILE=./data/processed_results/${MODEL_NAME}_shot-5_${QUESTION_NAME}_sorry.json  # do not use _openbook or _drop
326 | REVIEW_FILE=./evaluation_results/review_greedy/data-${MODEL_NAME}_shot-5_${QUESTION_NAME}_model-${MODEL_ID}_${JUDGE_TYPE}_greedy.jsonl
327 | PROMPT_FILE=./eval/gpt_judge/gpt_judge_prompt.jsonl
328 | BATCH_SIZE=3
329 | 
330 | python3 ./eval/gpt_judge/gpt_judge.py \
331 |     --answer_file ${ANSWER_FILE} \
332 |     --testset_file ${TESTSET_FILE} \
333 |     --review_file ${REVIEW_FILE} \
334 |     --prompt_file ${PROMPT_FILE} \
335 |     --prompt_type ${JUDGE_TYPE} \
336 |     --review_model gpt-4 \
337 |     --batch_size ${BATCH_SIZE} \
338 |     --use_demo \
339 |     --no_sorry  # only when "DATA_NAME=wizardlm_trainset_sorry"
340 | 
341 | python3 ./eval/gpt_judge/show_results.py
342 | ```
343 | 
344 | ```
345 | # ======================= Metric-Based Judgment (MS-MARCO, ACI-Bench) ======================
346 | export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
347 | NUM_GPUS=8
348 | MODEL_NAME=llama-2-7b  # pythia-6.9b / llama-2-7b / mistral-7b-v0.1 / llama-2-13b
349 | DATA_NAME=wizardlm_trainset_sorry  # wizardlm_alpaca_train (baseline) / wizardlm_trainset_openbook (kca open-book tuning) / wizardlm_trainset_drop (kca discarding tuning) / wizardlm_trainset_sorry (kca refusal tuning)
350 | MODEL_ID=${MODEL_NAME}_shot-5_${DATA_NAME}  # baseline_${MODEL_NAME}_${DATA_NAME} (baseline) / ${MODEL_NAME}_shot-5_${DATA_NAME} (kca)
351 | MODEL_PATH=./training_results/${MODEL_ID}
352 | QUESTION_NAME=msmacro  #  msmacro / acibench
353 | QUESTION_FILE=./data/metric_based_evaluation/${QUESTION_NAME}_testset.jsonl
354 | ANSWER_FILE=./evaluation_results/answer_greedy/data-${MODEL_NAME}_shot-5_${QUESTION_NAME}_model-${MODEL_ID}_greedy.jsonl
355 | 
356 | python3 ./eval/gpt_judge/gen_summary.py \
357 |     --model-path ${MODEL_PATH} \
358 |     --model-id ${MODEL_ID} \
359 |     --conv-temp vicuna \
360 |     --question-file ${QUESTION_FILE} \
361 |     --answer-file ${ANSWER_FILE} \
362 |     --num-gpus ${NUM_GPUS} \
363 |     --no-sorry  # only when "DATA_NAME=wizardlm_trainset_sorry"
364 | ```
365 | 
366 | ### Helpfulness Evaluation
367 | 
368 | Below are the scripts for helpfulness evaluation.
369 | 
370 | ```
371 | # ========== LLM-Based Judgment (LIMAEval, VicunaEval, WizardLMEval, TruthfulQA) ==========
372 | # GPT-4 judgment
373 | export OPENAI_API_KEY=XXXXXX  # set the OpenAI API key
374 | MODEL_NAME=llama-2-7b  # pythia-6.9b / llama-2-7b / mistral-7b-v0.1 / llama-2-13b
375 | DATA_NAME=wizardlm_trainset_sorry  # wizardlm_alpaca_train (baseline) / wizardlm_trainset_openbook (kca open-book tuning) / wizardlm_trainset_drop (kca discarding tuning) / wizardlm_trainset_sorry (kca refusal tuning)
376 | MODEL_ID=${MODEL_NAME}_shot-5_${DATA_NAME}  # baseline_${MODEL_NAME}_${DATA_NAME} (baseline) / ${MODEL_NAME}_shot-5_${DATA_NAME} (kca)
377 | QUESTION_NAME=lima_testset # lima_testset / vicuna_testset / wizardlm_testset / truthfulqa_test_truthset
378 | JUDGE_TYPE=effectiveness_judge
379 | ANSWER_FILE=./evaluation_results/answer_greedy/data-${MODEL_NAME}_shot-5_${QUESTION_NAME}_model-${MODEL_ID}_greedy.jsonl
380 | TESTSET_FILE=./data/processed_results/${MODEL_NAME}_shot-5_${QUESTION_NAME}_sorry.json  # do not use _openbook or _drop
381 | REVIEW_FILE=./evaluation_results/review_greedy/data-${MODEL_NAME}_shot-5_${QUESTION_NAME}_model-${MODEL_ID}_${JUDGE_TYPE}_greedy.jsonl
382 | PROMPT_FILE=./eval/gpt_judge/gpt_judge_prompt.jsonl
383 | BATCH_SIZE=3
384 | 
385 | python3 ./eval/gpt_judge/gpt_judge.py \
386 |     --answer_file ${ANSWER_FILE} \
387 |     --testset_file ${TESTSET_FILE} \
388 |     --review_file ${REVIEW_FILE} \
389 |     --prompt_file ${PROMPT_FILE} \
390 |     --prompt_type ${JUDGE_TYPE} \
391 |     --review_model gpt-4 \
392 |     --batch_size ${BATCH_SIZE} \
393 |     --use_demo
394 | 
395 | python3 ./eval/gpt_judge/show_results.py
396 | ```
397 | 
398 | ## Citation
399 | 
400 | If you find this work is relevant to your research or applications, please feel free to cite our work!
401 | ```
402 | @article{wan2024knowledge,
403 |   title={Knowledge Verification to Nip Hallucination in the Bud},
404 |   author={Wan, Fanqi and Huang, Xinting and Cui, Leyang and Quan, Xiaojun and Bi, Wei and Shi, Shuming},
405 |   journal={arXiv preprint arXiv:2401.10768},
406 |   year={2024}
407 | }
408 | ```


--------------------------------------------------------------------------------