├── openai_account_files ├── used.txt └── accounts.txt ├── data └── download_data_and_put_them_here.txt ├── requirements.txt ├── download_data.py ├── llm_retrieval_prompt_drafts ├── update-using-missing-info-for-new-question.md ├── update-using-missing-info-for-new-passage.md ├── select_no_up_to.md └── filter_question_with_demo.md ├── searcher.py ├── README.md ├── multi_thread_openai_api_call.py ├── multi_process └── bm25_multi_process.py ├── llm.py ├── commands ├── eli5_iterative_retrieval.sh ├── qampari_iterative_retrieval.sh └── asqa_iterative_retrieval.sh ├── utils.py ├── prompts ├── qampari_default.json ├── asqa_demo.json ├── asqa_default.json └── eli5_default.json ├── openai_account_manager.py ├── run.py ├── eval.py └── llm_retrieval_related └── iterative_select_supporting_documents.py /openai_account_files/used.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/download_data_and_put_them_here.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /openai_account_files/accounts.txt: -------------------------------------------------------------------------------- 1 | EMAIL----PASSWORD----OPENAI_KEY -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | accelerate==0.22.0 2 | faiss-cpu==1.7.4 3 | FlagEmbedding==1.1.0 4 | openai==0.27.4 5 | sentence-transformers==2.2.2 6 | sentencepiece==0.1.99 7 | tiktoken==0.5.1 8 | tokenizers==0.13.3 9 | torch==2.0.1 10 | torchvision==0.15.2 11 | tqdm==4.45.0 12 | transformers==4.33.2 13 | pyserini==0.22.0 -------------------------------------------------------------------------------- /download_data.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset 2 | import json 3 | 4 | names = ['asqa_questions', 'qampari_questions', 'eli5_questions'] 5 | for name in names: 6 | ds = load_dataset("BeastyZ/Llatrieval", name) 7 | data = [] 8 | for d in ds['train']: 9 | data.append(dict(d)) 10 | if name == 'asqa_questions': 11 | save_path = './data/asqa_gtr_top100.json' 12 | elif name == 'qampari_questions': 13 | save_path = './data/qampari_gtr_top100.json' 14 | else: 15 | save_path = './data/eli5_bm25_top100.json' 16 | with open(save_path, 'w') as f: 17 | json.dump(data, f, indent=4, ensure_ascii=False) 18 | -------------------------------------------------------------------------------- /llm_retrieval_prompt_drafts/update-using-missing-info-for-new-question.md: -------------------------------------------------------------------------------- 1 | You are a helpful assistant as introduced below. 2 | 3 | ## Profile 4 | - Language: English 5 | - Description: You are a helpful assistant, capable of identifying missing content that answers the given question(s) but does not exist in the given possible answering passages and then using your own knowledge to genereate a new question base on the missing content you identify. 6 | 7 | ### Input 8 | - Question: The specific question(s). 9 | - Answering Passages: Possible answering passages. 10 | 11 | ### Output 12 | - A new question generated using missing content you identify based on your own knowledge. 13 | 14 | ## Rules 15 | 1. Anyway, you have to use your own knowledge to generate a new question using missing content you identify. 16 | 2. Only generate the required new question. Do not output anything else. 17 | 3. Do not output the given question(s) and possible answering passages. 18 | 4. Do not output your analysis statement. 19 | 20 | ## Workflow 21 | 1. Read and understand the question(s) and possible answering passages posed by the user. 22 | 2. Identify missing content that answers the given question(s) but does not exist in the given possible answering passages. 23 | 3. Use your own knowledge to generate a new question using missing content you identify. 24 | 25 | ## Reminder 26 | You will always remind yourself of the role settings. -------------------------------------------------------------------------------- /llm_retrieval_prompt_drafts/update-using-missing-info-for-new-passage.md: -------------------------------------------------------------------------------- 1 | You are a helpful assistant as introduced below. 2 | 3 | ## Profile 4 | - Language: English 5 | - Description: You are a helpful assistant, capable of identifying missing content that answers the given question(s) but does not exist in the given possible answering passages and then using your own knowledge to genereate correct answering passages using missing content you identify. 6 | 7 | ### Input 8 | - Question: The specific question(s). 9 | - Answering Passages: Possible answering passages. 10 | 11 | ### Output 12 | - Correct answering passages generated using missing content you identify based on your own knowledge. 13 | 14 | ## Rules 15 | 1. Anyway, you have to use your own knowledge to generate correct answering passages using missing content you identify. 16 | 2. Only generate the required correct answering passages. Do not output anything else. 17 | 3. Directly use your own knowledge to generate correct answering passages if you think the given possible answering passages do not answer to the given question(s). 18 | 4. Do not output the given question(s) and possible answering passages. 19 | 5. Do not output your analysis statement. 20 | 21 | ## Workflow 22 | 1. Read and understand the question(s) and possible answering passages posed by the user. 23 | 2. identify missing content that answers the given question(s) but does not exist in the given possible answering passages. 24 | 3. Directly use your own knowledge to generate correct answering passages if you think the given possible answering passages do not answer to the given question(s). Otherwise use your own knowledge to generate correct answering passages using missing content you identify. 25 | 26 | ## Reminder 27 | You will always remind yourself of the role settings. -------------------------------------------------------------------------------- /searcher.py: -------------------------------------------------------------------------------- 1 | from sklearn.feature_extraction.text import TfidfVectorizer 2 | from sklearn.metrics.pairwise import cosine_similarity 3 | import numpy as np 4 | import torch 5 | 6 | def doc_to_text_tfidf(doc): 7 | return doc['title'] + ' ' + doc['text'] 8 | 9 | def doc_to_text_dense(doc): 10 | return doc['title'] + '. ' + doc['text'] 11 | 12 | 13 | class SearcherWithinDocs: 14 | 15 | def __init__(self, docs, retriever, model=None, device="cuda"): 16 | self.retriever = retriever 17 | self.docs = docs 18 | self.device = device 19 | if retriever == "tfidf": 20 | self.tfidf = TfidfVectorizer() 21 | self.tfidf_docs = self.tfidf.fit_transform([doc_to_text_tfidf(doc) for doc in docs]) 22 | elif "gtr" in retriever: 23 | self.model = model 24 | self.embeddings = self.model.encode([doc_to_text_dense(doc) for doc in docs], device=self.device, convert_to_numpy=False, convert_to_tensor=True, normalize_embeddings=True) 25 | else: 26 | raise NotImplementedError 27 | 28 | def search(self, query): 29 | # Return the top-1 result doc id 30 | 31 | if self.retriever == "tfidf": 32 | tfidf_query = self.tfidf.transform([query])[0] 33 | similarities = [cosine_similarity(tfidf_doc, tfidf_query) for tfidf_doc in self.tfidf_docs] 34 | best_doc_id = np.argmax(similarities) 35 | return best_doc_id 36 | elif "gtr" in self.retriever: 37 | q_embed = self.model.encode([query], device=self.device, convert_to_numpy=False, convert_to_tensor=True, normalize_embeddings=True) 38 | score = torch.matmul(self.embeddings, q_embed.t()).squeeze(1).detach().cpu().numpy() 39 | best_doc_id = np.argmax(score) 40 | return best_doc_id 41 | else: 42 | raise NotImplementedError 43 | -------------------------------------------------------------------------------- /llm_retrieval_prompt_drafts/select_no_up_to.md: -------------------------------------------------------------------------------- 1 | You are DocSelectorGPT as introduced below. 2 | 3 | # Role: DocSelectorGPT 4 | 5 | ## Profile 6 | - Language: English 7 | - Description: You are DocSelectorGPT, capable of selecting a specified number (k) of documents for answering the user's specific question(s). k is a value specified by the user. 8 | 9 | ### Input 10 | - Question: The specific question(s) 11 | - Candidate Documents: Documents contain supporting documents which can support answering the given questions. Candidate documents will have their own identifiers for FactRetrieverGPT to cite. 12 | 13 | ### Skill 14 | 1. Analyzing the given question(s) and understanding the required information. 15 | 2. Searching through candidate documents to select k supporting documents whose combination can maximally support giving a direct, accurate, clear and engaging answer to the question and make the answer and is closely related to the core of the question. 16 | 17 | ### Output 18 | - Selected Documents: The identifiers of selected supporting documents whose combination can maximally support giving an accurate and engaging answer to the question and make the answer and is closely related to the core of the question. 19 | 20 | ### Output Format 21 | 22 | Selected Documents: [document identifiers] 23 | 24 | ### Output Example 25 | If the selected documents are 2, 6 and 8, the output should be as follows: 26 | 27 | Selected Documents: 2 6 8 28 | 29 | ## Rules 30 | 1. Don't break character. 31 | 2. When outputting the selected documents, only providing their own identifiers. 32 | 3. Strictly follow the specified output format. Do not answer the given question. Just conduct the specified retrieval task. 33 | 34 | ## selection Criteria (Very Important) 35 | 1. The order and identifier of documents are not related to their priority. 36 | 2. Since your goal is to select a combination of supporting documents which can maximally support giving a direct, accurate, clear and engaging answer, you need to avoid redundant selection of documents containing the same or similar relevant content. 37 | 38 | ## Workflow 39 | 1. Read and understand the questions posed by the user. 40 | 2. Browse through candidate documents to select k documents whose combination can maximally support giving a direct, accurate, clear and engaging answer to the question(s) and make the answer and is closely related to the core of the question(s). 41 | 3. List all selected documents. 42 | 43 | ## Reminder 44 | You will always remind yourself of the role settings. -------------------------------------------------------------------------------- /llm_retrieval_prompt_drafts/filter_question_with_demo.md: -------------------------------------------------------------------------------- 1 | You are JudgeGPT as introduced below. 2 | 3 | # Role: JudgeGPT 4 | 5 | ## Profile 6 | - Language: English 7 | - Description: You are JudgeGPT, capable of judging whether a specified number (k) of documents can maximally support giving a direct, accurate, clear and engaging answer, similar to the answer of the demonstration, closely related to the core of the user's specific question(s). 8 | 9 | ### Demonstration 10 | {Demo} 11 | 12 | ### Input 13 | - Question: The specific question(s). 14 | - Candidate Documents: Documents whose combination may maximally support giving a direct, accurate, clear and engaging answer, similar to the answer of the demonstration, closely related to the core of the corresponding question(s). 15 | 16 | ### Skill 17 | 1. Analyzing the given question(s) and understanding the required information. 18 | 2. Searching through documents to judge whether they can maximally support giving a direct, accurate, clear and engaging answer, similar to the answer of the demonstration, closely related to the core of the corresponding question(s). 19 | 20 | ### Output 21 | - Judgment: "[YES]" if provided documents can maximally support giving a direct, accurate, clear, and engaging answer, similar to the answer of the demonstration, closely related to the core of the corresponding question(s), otherwise "[NO]". 22 | 23 | ### Output Format 24 | Judgment: [YES] or [NO] 25 | 26 | ### Output Example 27 | If provided documents can maximally support giving a direct, accurate, clear, and engaging answer, similar to the answer of the demonstration, closely related to the core of the corresponding question(s), the output should be as follows: 28 | [YES] 29 | 30 | ## Rules 31 | 1. Don't break character. 32 | 2. When outputting final verdict, only providing "[YES]" or "[NO]". 33 | 3. Only output final verdict for the given question(s) and documents, do not evaluate the demonstration. 34 | 4. Strictly follow the specified output format. Do not answer the given question. Just conduct the specified judgment task. 35 | 36 | ## Judgment Criteria (Very Important) 37 | 1. Do not allow the length of the documents to influence your evaluation. 38 | 2. Be as objective as possible. 39 | 3. Output "[YES]" if provided documents can maximally support giving a direct, accurate, clear, and engaging answer, similar to the answer of the demonstration, closely related to the core of the corresponding question(s), otherwise "[NO]". 40 | 41 | ## Workflow 42 | 1. Read and understand the questions posed by the user. 43 | 2. Browse through documents to judge whether they can support giving a direct, accurate, clear, and engaging answer, similar to the answer of the demonstration, closely related to the core of the corresponding question(s). 44 | 3. Output your final verdict. 45 | 46 | ## Reminder 47 | You will always remind yourself of the role settings. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # LLatrieval: LLM-Verified Retrieval for Verifiable Generation 2 | This repository contains the code and data for paper [LLatrieval: LLM-Verified Retrieval for Verifiable Generation](https://arxiv.org/abs/2311.07838). This repository also includes code to reproduce the method we propose in our paper. 3 | 4 | ## :new:News 5 | - **[2024/03/13]** Our submission to NAACL 2024, [LLatrieval: LLM-Verified Retrieval for Verifiable Generation](https://aclanthology.org/2024.naacl-long.305/), has been accepted to the main conference. 6 | - **[2023/11/14]** We have published the preprint version of the paper on [arXiv](https://arxiv.org/abs/2311.07838). 7 | - **[2023/11/09]** We have released the code for reproducing our method. 8 | 9 | 10 | ## Quick Links 11 | - [Requirements](#requirements) 12 | - [Data](#data) 13 | - [Code Structure](#code-structure) 14 | - [Reproduce Our Method](#reproduce-our-method) 15 | - [Citation](#citation) 16 | 17 | 18 | ## Requirements 19 | 1. We recommend that you use the python virtual environment and then install the dependencies. 20 | ``` 21 | conda create -n lvr python=3.9.7 22 | ``` 23 | 2. Next, activate the python virtual environment you just created. 24 | ``` 25 | conda activate lvr 26 | ``` 27 | 3. Finally, before running the code, make sure you have set up the environment and installed the required packages. 28 | ``` 29 | pip install -r requirements.txt 30 | ``` 31 | 32 | ## Data 33 | We uploaded the data to [Hugging Face](https://huggingface.co/datasets/BeastyZ/Llatrieval)🤗. 34 | 35 | **Start by installing 🤗 Datasets:** 36 | ```bash 37 | pip install datasets 38 | ``` 39 | 40 | **Load a dataset** 41 | 42 | This command will download the raw data to the `data/` folder. 43 | ```bash 44 | python download_data.py 45 | ``` 46 | 47 | **Download corpus** 48 | 49 | Use the following command to download the `BM25_SPHERE_CORPUS`. 50 | ```bash 51 | wget -P faiss_index https://dl.fbaipublicfiles.com/sphere/sphere_sparse_index.tar.gz 52 | tar -xzvf faiss_index/sphere_sparse_index.tar.gz -C faiss_index 53 | ``` 54 | 55 | Use the following command to download the `WIKI_TSV_CORPUS`. 56 | ```bash 57 | wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz 58 | gzip -xzvf psgs_w100.tsv.gz 59 | ``` 60 | 61 | For more info about the Sphere and Wikipedia snapshot corpora, please refer to [ALCE](https://github.com/princeton-nlp/ALCE). 62 | 63 | 64 | ## Code Structure 65 | * `commands/`: folder that contains all shell files. 66 | * `data/`: folder that contains all datasets. 67 | * `llm_retrieval_prompt_drafts/`: folder that contains all prompt files. 68 | * `llm_retrieval_related/`: folder that contains code for iteratively selecting supporting documents. 69 | * `multi_process/`: folder that contains code for BM25 retrieval with multi-process support. 70 | * `openai_account_files/`: folder that contains all OpenAI account files. 71 | * `prompts/`: folder that contains all instruction and demonstration files. 72 | * `eval.py`: eval file to evaluate generations. 73 | * `Iterative_retrieval.py`: code for reproduce our method. 74 | * `llm.py`: code for using LLM. 75 | * `multi_thread_openai_api_call.py`: code for using gpt-3.5-turbo with multi-thread. 76 | * `searcher.py`: code for retrieval using TfidfVectorizer. 77 | * `run.py`: run file to generate citations. 78 | * `utils.py`: file that contains auxiliary function. 79 | 80 | 81 | ## Reproduce Our Method 82 | **NOTE:** There must be raw data and a corpus for retrieval before running the following commands. Once you have them, you also need to modify the parameters of the corresponding files in the `commands` directory. 83 | 84 | For ASQA, use the following command 85 | ```bash 86 | bash commands/asqa_iterative_retrieval.sh 87 | ``` 88 | 89 | For QAMPARI, use the following command 90 | ```bash 91 | bash commands/qampari_iterative_retrieval.sh 92 | ``` 93 | 94 | For ELI5, use the following command 95 | ```bash 96 | bash commands/eli5_iterative_retrieval.sh 97 | ``` 98 | 99 | The result will be saved in `iter_retrieval_50/`. 100 | 101 | 102 | ## Citation 103 | ``` 104 | @inproceedings{li-etal-2024-llatrieval, 105 | title = "{LL}atrieval: {LLM}-Verified Retrieval for Verifiable Generation", 106 | author = "Li, Xiaonan and 107 | Zhu, Changtai and 108 | Li, Linyang and 109 | Yin, Zhangyue and 110 | Sun, Tianxiang and 111 | Qiu, Xipeng", 112 | editor = "Duh, Kevin and 113 | Gomez, Helena and 114 | Bethard, Steven", 115 | booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)", 116 | month = jun, 117 | year = "2024", 118 | address = "Mexico City, Mexico", 119 | publisher = "Association for Computational Linguistics", 120 | url = "https://aclanthology.org/2024.naacl-long.305", 121 | pages = "5453--5471", 122 | } 123 | ``` 124 | -------------------------------------------------------------------------------- /multi_thread_openai_api_call.py: -------------------------------------------------------------------------------- 1 | import threading 2 | import openai 3 | import logging 4 | 5 | logger = logging.getLogger(__name__) 6 | import time 7 | 8 | 9 | class MyThread(threading.Thread): 10 | def __init__(self, thread_id, llm, account_manager, inp_manager, print_error, pbar, turbo_system_message, 11 | print_finish=True): 12 | threading.Thread.__init__(self) 13 | self.thread_id = thread_id 14 | self.openai_account_manager_multi_thread = account_manager 15 | self.openai_inp_manager = inp_manager 16 | self.account = self.openai_account_manager_multi_thread.get_next_account(self.thread_id) 17 | self.print_error = print_error 18 | self.pbar = pbar 19 | self.print_finish = print_finish 20 | self.turbo_system_message = turbo_system_message 21 | self.llm = llm 22 | 23 | def run(self): 24 | 25 | def repeat_until_success_call_openai_api(func): 26 | def wrapper(*args, **kw): 27 | while 1: 28 | result = None 29 | try: 30 | result = func(*args, **kw) 31 | except openai.error.APIConnectionError as e: 32 | if self.print_error: 33 | logger.info('openai connection error, so retry after sleep 5 seconds') 34 | logger.info(e) 35 | time.sleep(5) 36 | except openai.error.RateLimitError as e: 37 | logger.info(type(e)) 38 | if 'quota' in e._message: 39 | if self.print_error: 40 | logger.info('now openai account {} runs out. so use next.'.format(self.account[-1])) 41 | logger.info(type(e)) 42 | logger.info(e) 43 | self.account = self.openai_account_manager_multi_thread.get_next_account(self.thread_id, 44 | self.account) 45 | else: 46 | logger.info("Meeting RateLimitError, sleep for 45 seconds.") 47 | time.sleep(45) 48 | except openai.error.AuthenticationError as e: 49 | if 'This key is associated with a deactivated account' in e._message: 50 | logger.info('the account {} is deactivated. so use next'.format(self.account[-1])) 51 | if self.print_error: 52 | logger.info(e) 53 | self.account = self.openai_account_manager_multi_thread.get_next_account(self.thread_id, 54 | self.account) 55 | else: 56 | logger.info('meet unexpected AuthenticationError, so retry after sleep 5 seconds') 57 | if self.print_error: 58 | logger.info(e) 59 | self.account = self.openai_account_manager_multi_thread.get_next_account(self.thread_id, 60 | self.account) 61 | except Exception as e: 62 | logger.info('meet unexpected error, so retry after sleep 5 seconds') 63 | logger.info(e) 64 | logger.info(type(e)) 65 | time.sleep(5) 66 | 67 | if result != None: 68 | return result 69 | else: 70 | pass 71 | 72 | return wrapper 73 | 74 | # pbar = tqdm.tqdm(total=len(self.idx_x_list_to_decode)) 75 | responses_with_idx = [] 76 | self.responses_with_idx = responses_with_idx 77 | while True: 78 | tmp = self.openai_inp_manager.get_next_gpt_idx_inp() 79 | if tmp == None: 80 | if self.print_finish: 81 | logger.info('thread {} finish'.format(self.thread_id)) 82 | return 83 | else: 84 | idx_inp = tmp['inp'] 85 | idx, inp = idx_inp 86 | hyper_parameter = tmp['hyper_parameter'] 87 | 88 | @repeat_until_success_call_openai_api 89 | def tmp_api_call(): 90 | 91 | result = self.llm.generate(inp, hyper_parameter['max_tokens'], api_key=self.account[-1], 92 | turbo_system_message=self.turbo_system_message) 93 | return result 94 | 95 | response = tmp_api_call() 96 | if self.pbar is not None: 97 | self.pbar.update(1) 98 | responses_with_idx.append([idx, response]) 99 | -------------------------------------------------------------------------------- /multi_process/bm25_multi_process.py: -------------------------------------------------------------------------------- 1 | from typing import List, Dict 2 | import torch.multiprocessing as mp 3 | import queue 4 | import math 5 | import json 6 | import time 7 | import logging 8 | from pyserini.search import LuceneSearcher 9 | 10 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') 11 | logger = logging.getLogger(__name__) 12 | logger.setLevel(logging.INFO) 13 | 14 | class BM25MultiProcess(): 15 | def __init__(self, corpus_path: str=None, top_k: int=100): 16 | """ 17 | Init class. 18 | """ 19 | self.top_k = top_k 20 | self.corpus_path = corpus_path 21 | 22 | 23 | def start_multi_process_pool(self, process_num: int=8) -> Dict: 24 | """ 25 | :param process_num: Number of process to use. 26 | :return: Returns a dict with the target processes, an input queue and and output queue. 27 | """ 28 | target_devices = ['cpu'] * process_num 29 | logger.info("Start multi-process pool on devices: {}".format(', '.join(map(str, target_devices)))) 30 | 31 | ctx = mp.get_context('spawn') 32 | input_queue = ctx.Queue() 33 | output_queue = ctx.Queue() 34 | processes = [] 35 | 36 | for _ in target_devices: 37 | p = ctx.Process(target=BM25MultiProcess._multi_process_worker, args=(self, input_queue, output_queue), daemon=True) 38 | p.start() 39 | processes.append(p) 40 | 41 | return {'input': input_queue, 'output': output_queue, 'processes': processes} 42 | 43 | 44 | @staticmethod 45 | def _multi_process_worker(model, input_queue, results_queue) -> None: 46 | """ 47 | Internal working process to retrieve documnents in multi-process setup. 48 | """ 49 | searcher = LuceneSearcher(model.corpus_path) 50 | while True: 51 | try: 52 | id, queries = input_queue.get() 53 | docs_list = model.retrieve(queries, searcher) 54 | results_queue.put([id, docs_list]) 55 | except queue.Empty: 56 | break 57 | 58 | 59 | def retrieve(self, queries: List[str], searcher: LuceneSearcher) -> List[List]: 60 | """ 61 | Do retrieval using bm25. 62 | """ 63 | docs_list = [] 64 | for query in queries: 65 | start_time = time.time() 66 | try: 67 | hits = searcher.search(query, self.top_k) 68 | except Exception as e: 69 | if "maxClauseCount" in str(e): 70 | query = " ".join(query.split())[:950] 71 | hits = searcher.search(query, self.top_k) 72 | else: 73 | raise e 74 | 75 | # For bm25 sphere 76 | docs = [] 77 | for hit in hits: 78 | h = json.loads(str(hit.docid).strip()) 79 | docs.append({ 80 | "title": h["title"], 81 | "text": hit.raw, 82 | "url": h["url"], 83 | 'score': hit.score, 84 | 'id':hit.docid 85 | }) 86 | 87 | docs_list.append(docs) 88 | end_time = time.time() 89 | logger.warning(f"It took {end_time - start_time} seconds.") 90 | return docs_list 91 | 92 | 93 | def retrieve_multi_process(self, queries: List[str], pool: Dict[str, object], chunk_size: int=None) -> List[List]: 94 | """ 95 | :param queries: List of queries 96 | :param pool: A pool of workers started with BM25MultiProcess.start_multi_process_pool 97 | :param chunk_size: Queries are chunked and sent to the individual processes. If none, it determine a sensible size. 98 | :return: Retrieved documents. 99 | """ 100 | if chunk_size is None: 101 | chunk_size = min(math.ceil(len(queries) / len(pool["processes"]) / 10), 5000) 102 | 103 | logger.info(f"Chunk queries into {math.ceil(len(queries) / chunk_size)} packages of size {chunk_size}") 104 | 105 | input_queue = pool['input'] 106 | last_chunk_id = 0 107 | chunk = [] 108 | 109 | for query in queries: 110 | chunk.append(query) 111 | if len(chunk) >= chunk_size: 112 | input_queue.put([last_chunk_id, chunk]) 113 | last_chunk_id += 1 114 | chunk = [] 115 | 116 | if len(chunk) > 0: 117 | input_queue.put([last_chunk_id, chunk]) 118 | last_chunk_id += 1 119 | 120 | output_queue = pool['output'] 121 | results_list = sorted([output_queue.get() for _ in range(last_chunk_id)], key=lambda x: x[0]) 122 | docs_list = [] 123 | for result in results_list: 124 | docs_list += result[1] 125 | return docs_list 126 | 127 | 128 | @staticmethod 129 | def stop_multi_process_pool(pool: Dict): 130 | """ 131 | Stops all processes started with start_multi_process_pool 132 | """ 133 | for p in pool['processes']: 134 | p.terminate() 135 | 136 | for p in pool['processes']: 137 | p.join() 138 | p.close() 139 | 140 | pool['input'].close() 141 | pool['output'].close() 142 | -------------------------------------------------------------------------------- /llm.py: -------------------------------------------------------------------------------- 1 | import os 2 | from transformers import AutoTokenizer 3 | from utils import * 4 | import openai 5 | class LLM: 6 | 7 | def __init__(self, args): 8 | self.args = args 9 | 10 | if args.openai_api: 11 | # logger.info('into if args.openai_api:') 12 | import openai 13 | OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY") 14 | OPENAI_ORG_ID = os.environ.get("OPENAI_ORG_ID") 15 | OPENAI_API_BASE = os.environ.get("OPENAI_API_BASE") 16 | assert OPENAI_API_KEY == None, "api_key={}".format(OPENAI_API_KEY) 17 | 18 | if args.azure: 19 | 20 | openai.api_key = OPENAI_API_KEY 21 | openai.api_base = OPENAI_API_BASE 22 | openai.api_type = 'azure' 23 | openai.api_version = '2022-12-01' 24 | else: 25 | # logger.info('into not args.azure') 26 | # logger.info('OPENAI_API_KEY:{}'.format(OPENAI_API_KEY)) 27 | openai.api_key = OPENAI_API_KEY 28 | openai.organization = OPENAI_ORG_ID 29 | 30 | self.tokenizer = AutoTokenizer.from_pretrained("gpt2", 31 | fast_tokenizer=False) # TODO: For ChatGPT we should use a different one 32 | self.total_tokens = 0 # To keep track of how much the API costs 33 | else: 34 | self.model, self.tokenizer = load_model(args.model) 35 | 36 | self.prompt_exceed_max_length = 0 37 | self.fewer_than_50 = 0 38 | self.azure_filter_fail = 0 39 | 40 | def generate(self, prompt, max_tokens, api_key, stop=None, turbo_system_message=None): 41 | args = self.args 42 | if max_tokens == 0: 43 | self.prompt_exceed_max_length += 1 44 | logger.warning( 45 | "Prompt exceeds max length and return an empty string as answer. If this happens too many times, it is suggested to make the prompt shorter") 46 | return "" 47 | if max_tokens < 50: 48 | self.fewer_than_50 += 1 49 | logger.warning( 50 | "The model can at most generate < 50 tokens. If this happens too many times, it is suggested to make the prompt shorter") 51 | 52 | if args.openai_api: 53 | if "turbo" in args.model and not args.azure: 54 | assert turbo_system_message != None 55 | # For OpenAI's ChatGPT API, we need to convert text prompt to chat prompt 56 | prompt = [ 57 | # {'role': 'system', 'content': "You are a helpful assistant that answers the following questions with proper citations."}, 58 | {'role': 'system', 'content': turbo_system_message}, 59 | {'role': 'user', 'content': prompt} 60 | ] 61 | else: 62 | if "turbo" in args.model: 63 | deploy_name = "gpt-35-turbo-0301" 64 | else: 65 | deploy_name = args.model 66 | 67 | def repeat_until_success_call_openai_api_only_for_retry(func): 68 | def wrapper(*args, **kw): 69 | while 1: 70 | result = None 71 | try: 72 | result = func(*args, **kw) 73 | except openai.error.APIConnectionError as e: 74 | logger.warning('openai connection error, so retry after sleep 1 seconds') 75 | logger.warning(e) 76 | time.sleep(1) 77 | except openai.error.RateLimitError as e: 78 | logger.warning(type(e)) 79 | if 'quota' in e._message: 80 | raise e 81 | else: 82 | time.sleep(60) 83 | except openai.error.AuthenticationError as e: 84 | raise e 85 | except Exception as e: 86 | logger.warning('meet unexpected error, so retry after sleep 3 seconds') 87 | logger.warning(e) 88 | logger.warning(type(e)) 89 | time.sleep(3) 90 | 91 | if result != None: 92 | return result 93 | else: 94 | pass 95 | return wrapper 96 | 97 | if "turbo" in args.model and not args.azure: 98 | @repeat_until_success_call_openai_api_only_for_retry 99 | def tmp_openai_call_func(): 100 | response = openai.ChatCompletion.create( 101 | model=args.model, 102 | messages=prompt, 103 | temperature=args.temperature, 104 | max_tokens=max_tokens, 105 | stop=stop, 106 | top_p=args.top_p, 107 | api_key=api_key, 108 | n=self.args.num_samples, 109 | ) 110 | return response 111 | response = tmp_openai_call_func() 112 | self.total_tokens += response['usage']['total_tokens'] 113 | result = list(map(lambda x:x['message']['content'],response['choices'])) 114 | return result 115 | else: 116 | 117 | @repeat_until_success_call_openai_api_only_for_retry 118 | def tmp_openai_call_func(): 119 | response = openai.ChatCompletion.create( 120 | model=args.model, 121 | messages=prompt, 122 | temperature=args.temperature, 123 | max_tokens=max_tokens, 124 | stop=stop, 125 | top_p=args.top_p, 126 | api_key=api_key, 127 | n=self.args.num_samples 128 | ) 129 | return response 130 | response = tmp_openai_call_func() 131 | self.total_tokens += response['usage']['total_tokens'] 132 | result = list(map(lambda x:x['text'],response['choices'])) 133 | return result 134 | else: 135 | 136 | inputs = self.tokenizer([prompt], return_tensors="pt").to(self.model.device) 137 | stop = [] if stop is None else stop 138 | stop = list(set(stop + ["\n", "Ċ", "ĊĊ", "<0x0A>"])) # In Llama \n is <0x0A>; In OPT \n is Ċ 139 | stop_token_ids = list(set([self.tokenizer._convert_token_to_id(stop_token) for stop_token in stop] + [ 140 | self.model.config.eos_token_id])) 141 | if "llama" in args.model: 142 | stop_token_ids.remove(self.tokenizer.unk_token_id) 143 | outputs = self.model.generate( 144 | **inputs, 145 | do_sample=True, temperature=args.temperature, top_p=args.top_p, 146 | max_new_tokens=max_tokens, 147 | num_return_sequences=1, 148 | eos_token_id=stop_token_ids 149 | ) 150 | generation = self.tokenizer.decode(outputs[0][inputs['input_ids'].size(1):], skip_special_tokens=True) 151 | return generation -------------------------------------------------------------------------------- /commands/eli5_iterative_retrieval.sh: -------------------------------------------------------------------------------- 1 | ##################### eli5 Iterative_retrieval; prompt12 ##################### 2 | export CUDA_VISIBLE_DEVICES=0 3 | max_iteration=4 4 | dataset_name=eli5 5 | use_sub_questions=0 6 | use_title=0 7 | used_doc_field=answer 8 | openai_model_name=gpt-3.5-turbo-0301 9 | # Args for retrieval 10 | input_file=data/eli5_bm25_top100.json 11 | retriever=bm25 12 | update_prompt_file=update-using-missing-info-for-new-question 13 | update_query_using_missing_info_from_question_and_psgs=1 14 | corpus_path=ATH_TO_YOUR_OWN_SPHERE_CORPUS 15 | # Args for generating used field. 16 | prompt_style=answer 17 | target_used_field=answer 18 | max_tokens=150 19 | # Args for reranker 20 | position=head 21 | reranking_prompt_file=select_no_up_to 22 | doc_num=50 23 | # Args for filtration 24 | demo_file=prompts/${dataset_name}_default.json 25 | filtration_prompt_file=filter_question_with_demo 26 | filtration_method=judgment 27 | 28 | python Iterative_retrieval.py \ 29 | --max_iteration $max_iteration \ 30 | --dataset_name $dataset_name \ 31 | --use_sub_questions $use_sub_questions \ 32 | --use_title $use_title \ 33 | --used_doc_field $used_doc_field \ 34 | --openai_model_name $openai_model_name \ 35 | --input_file $input_file \ 36 | --retriever $retriever \ 37 | --update_prompt_file $update_prompt_file \ 38 | --update_query_using_missing_info_from_question_and_psgs $update_query_using_missing_info_from_question_and_psgs \ 39 | --corpus_path $corpus_path \ 40 | --prompt_style $prompt_style \ 41 | --target_used_field $target_used_field \ 42 | --max_tokens $max_tokens \ 43 | --position $position \ 44 | --reranking_prompt_file $reranking_prompt_file \ 45 | --doc_num $doc_num \ 46 | --demo_file $demo_file \ 47 | --filtration_prompt_file $filtration_prompt_file \ 48 | --filtration_method $filtration_method 49 | 50 | # run_eval 51 | export CUDA_VISIBLE_DEVICES=0 52 | shot=1 53 | openai_api=1 54 | num_samples=1 55 | data_file=iter_retrieval_50/eli5_final_data/final_data_bm25_max_iteration-4_update-using-missing-info-for-new-question_head.json 56 | ndoc=5 57 | openai_multi_thread=6 58 | model=gpt-3.5-turbo-0301 59 | quick_test=0 60 | seed=42 61 | temperature=0 62 | eval_metric=default 63 | # Other args 64 | dataset_name=eli5 65 | 66 | prompt_file=prompts/${dataset_name}_default.json 67 | output_dir=iter_retrieval_50/${dataset_name}_max-4_bm25-new-question_llm-select-head_run_eval 68 | mkdir $output_dir -p 69 | output_file=${output_dir}/run_output.json 70 | 71 | if [ ! -f "${output_file}" ]; then 72 | echo "*****************************" 73 | echo "start run.py" 74 | echo "*****************************" 75 | 76 | python run.py \ 77 | --shot $shot \ 78 | --openai_api $openai_api \ 79 | --prompt_file $prompt_file \ 80 | --output_fp $output_file \ 81 | --dataset_name $dataset_name \ 82 | --num_samples $num_samples \ 83 | --data_file $data_file \ 84 | --ndoc $ndoc \ 85 | --openai_multi_thread $openai_multi_thread \ 86 | --model $model \ 87 | --quick_test $quick_test \ 88 | --seed $seed \ 89 | --temperature $temperature \ 90 | --turbo_system_message "You are a helpful assistant that answers the following questions with proper citations." 91 | 92 | echo "*****************************" 93 | echo "finish run.py" 94 | echo "*****************************" 95 | fi 96 | 97 | eval_f=${output_file%.json} 98 | eval_result_fp=${eval_f}.score 99 | if [ ! -f $eval_result_fp ]; then 100 | echo "*****************************" 101 | echo "start eval.py" 102 | echo "*****************************" 103 | 104 | python eval.py \ 105 | --f $output_file \ 106 | --eval_metric $eval_metric 107 | 108 | echo "*****************************" 109 | echo "finish eval.py" 110 | echo "*****************************" 111 | fi 112 | 113 | 114 | ##################### eli5 Iterative_retrieval; prompt13 ##################### 115 | export CUDA_VISIBLE_DEVICES=0 116 | max_iteration=4 117 | dataset_name=eli5 118 | use_sub_questions=0 119 | use_title=0 120 | used_doc_field=answer 121 | openai_model_name=gpt-3.5-turbo-0301 122 | # Args for retrieval 123 | input_file=data/eli5_bm25_top100.json 124 | retriever=bm25 125 | update_prompt_file=update-using-missing-info-for-new-passage 126 | update_query_using_missing_info_from_question_and_psgs=1 127 | corpus_path=PATH_TO_YOUR_OWN_SPHERE_CORPUS 128 | # Args for generating used field. 129 | prompt_style=answer 130 | target_used_field=answer 131 | max_tokens=150 132 | # Args for reranker 133 | position=head 134 | reranking_prompt_file=select_no_up_to 135 | doc_num=50 136 | # Args for filtration 137 | demo_file=prompts/${dataset_name}_default.json 138 | filtration_prompt_file=filter_question_with_demo 139 | filtration_method=judgment 140 | 141 | python Iterative_retrieval.py \ 142 | --max_iteration $max_iteration \ 143 | --dataset_name $dataset_name \ 144 | --use_sub_questions $use_sub_questions \ 145 | --use_title $use_title \ 146 | --used_doc_field $used_doc_field \ 147 | --openai_model_name $openai_model_name \ 148 | --input_file $input_file \ 149 | --retriever $retriever \ 150 | --update_prompt_file $update_prompt_file \ 151 | --update_query_using_missing_info_from_question_and_psgs $update_query_using_missing_info_from_question_and_psgs \ 152 | --corpus_path $corpus_path \ 153 | --prompt_style $prompt_style \ 154 | --target_used_field $target_used_field \ 155 | --max_tokens $max_tokens \ 156 | --position $position \ 157 | --reranking_prompt_file $reranking_prompt_file \ 158 | --doc_num $doc_num \ 159 | --demo_file $demo_file \ 160 | --filtration_prompt_file $filtration_prompt_file \ 161 | --filtration_method $filtration_method 162 | 163 | # run_eval 164 | export CUDA_VISIBLE_DEVICES=0 165 | shot=1 166 | openai_api=1 167 | num_samples=1 168 | data_file=iter_retrieval_50/eli5_final_data/final_data_bm25_max_iteration-4_update-using-missing-info-for-new-passage_head.json 169 | ndoc=5 170 | openai_multi_thread=6 171 | model=gpt-3.5-turbo-0301 172 | quick_test=0 173 | seed=42 174 | temperature=0 175 | eval_metric=default 176 | # Other args 177 | dataset_name=eli5 178 | 179 | prompt_file=prompts/${dataset_name}_default.json 180 | output_dir=iter_retrieval_50/${dataset_name}_max-4_bm25-new-paasage_llm-select-head_run_eval 181 | mkdir $output_dir -p 182 | output_file=${output_dir}/run_output.json 183 | 184 | if [ ! -f "${output_file}" ]; then 185 | echo "*****************************" 186 | echo "start run.py" 187 | echo "*****************************" 188 | 189 | python run.py \ 190 | --shot $shot \ 191 | --openai_api $openai_api \ 192 | --prompt_file $prompt_file \ 193 | --output_fp $output_file \ 194 | --dataset_name $dataset_name \ 195 | --num_samples $num_samples \ 196 | --data_file $data_file \ 197 | --ndoc $ndoc \ 198 | --openai_multi_thread $openai_multi_thread \ 199 | --model $model \ 200 | --quick_test $quick_test \ 201 | --seed $seed \ 202 | --temperature $temperature \ 203 | --turbo_system_message "You are a helpful assistant that answers the following questions with proper citations." 204 | 205 | echo "*****************************" 206 | echo "finish run.py" 207 | echo "*****************************" 208 | fi 209 | 210 | eval_f=${output_file%.json} 211 | eval_result_fp=${eval_f}.score 212 | if [ ! -f $eval_result_fp ]; then 213 | echo "*****************************" 214 | echo "start eval.py" 215 | echo "*****************************" 216 | 217 | python eval.py \ 218 | --f $output_file \ 219 | --eval_metric $eval_metric 220 | 221 | echo "*****************************" 222 | echo "finish eval.py" 223 | echo "*****************************" 224 | fi 225 | -------------------------------------------------------------------------------- /commands/qampari_iterative_retrieval.sh: -------------------------------------------------------------------------------- 1 | ##################### qampari Iterative_retrieval; prompt12 ##################### 2 | export CUDA_VISIBLE_DEVICES=0 3 | max_iteration=4 4 | dataset_name=qampari 5 | use_sub_questions=0 6 | use_title=0 7 | used_doc_field=summary 8 | openai_model_name=gpt-3.5-turbo-0301 9 | # Args for retrieval 10 | input_file=data/qampari_gtr_top100.json 11 | retriever=bge-large-en-v1.5 12 | update_prompt_file=update-using-missing-info-for-new-question 13 | update_query_using_missing_info_from_question_and_psgs=1 14 | corpus_path=PATH_TO_YOUR_OWN_WIKI_CORPUS 15 | # Args for generating used field. 16 | prompt_style=summary 17 | target_used_field=summary 18 | max_tokens=150 19 | # Args for reranker 20 | position=head 21 | reranking_prompt_file=select_no_up_to 22 | doc_num=50 23 | window_size=20 24 | # Args for filtration 25 | demo_file=prompts/${dataset_name}_default.json 26 | filtration_prompt_file=filter_question_with_demo 27 | filtration_method=judgment 28 | 29 | python Iterative_retrieval.py \ 30 | --max_iteration $max_iteration \ 31 | --dataset_name $dataset_name \ 32 | --use_sub_questions $use_sub_questions \ 33 | --use_title $use_title \ 34 | --used_doc_field $used_doc_field \ 35 | --openai_model_name $openai_model_name \ 36 | --input_file $input_file \ 37 | --retriever $retriever \ 38 | --update_prompt_file $update_prompt_file \ 39 | --update_query_using_missing_info_from_question_and_psgs $update_query_using_missing_info_from_question_and_psgs \ 40 | --corpus_path $corpus_path \ 41 | --prompt_style $prompt_style \ 42 | --target_used_field $target_used_field \ 43 | --max_tokens $max_tokens \ 44 | --position $position \ 45 | --reranking_prompt_file $reranking_prompt_file \ 46 | --doc_num $doc_num \ 47 | --window_size $window_size \ 48 | --filtration_prompt_file $filtration_prompt_file \ 49 | --demo_file $demo_file \ 50 | --filtration_method $filtration_method 51 | 52 | # run_eval 53 | export CUDA_VISIBLE_DEVICES=0 54 | shot=1 55 | openai_api=1 56 | num_samples=1 57 | data_file=iter_retrieval_50/qampari_final_data/final_data_bge-large-en-v1.5_max_iteration-4_update-using-missing-info-for-new-question_head.json 58 | ndoc=5 59 | openai_multi_thread=10 60 | model=gpt-3.5-turbo-0301 61 | quick_test=0 62 | seed=42 63 | temperature=0 64 | eval_metric=default 65 | # Other args 66 | dataset_name=qampari 67 | 68 | prompt_file=prompts/${dataset_name}_default.json 69 | output_dir=iter_retrieval_50/${dataset_name}_max-4_bge-large-en-v1.5-new-question_llm-select-head_run_eval 70 | mkdir $output_dir -p 71 | output_file=${output_dir}/run_output.json 72 | 73 | if [ ! -f "${output_file}" ]; then 74 | echo "*****************************" 75 | echo "start run.py" 76 | echo "*****************************" 77 | 78 | python run.py \ 79 | --shot $shot \ 80 | --openai_api $openai_api \ 81 | --prompt_file $prompt_file \ 82 | --output_fp $output_file \ 83 | --dataset_name $dataset_name \ 84 | --num_samples $num_samples \ 85 | --data_file $data_file \ 86 | --ndoc $ndoc \ 87 | --openai_multi_thread $openai_multi_thread \ 88 | --model $model \ 89 | --quick_test $quick_test \ 90 | --seed $seed \ 91 | --temperature $temperature \ 92 | --turbo_system_message "You are a helpful assistant that answers the following questions with proper citations." 93 | 94 | echo "*****************************" 95 | echo "finish run.py" 96 | echo "*****************************" 97 | fi 98 | 99 | eval_f=${output_file%.json} 100 | eval_result_fp=${eval_f}.score 101 | if [ ! -f $eval_result_fp ]; then 102 | echo "*****************************" 103 | echo "start eval.py" 104 | echo "*****************************" 105 | 106 | python eval.py \ 107 | --f $output_file \ 108 | --eval_metric $eval_metric 109 | 110 | echo "*****************************" 111 | echo "finish eval.py" 112 | echo "*****************************" 113 | fi 114 | 115 | 116 | ##################### qampari Iterative_retrieval; prompt13 ##################### 117 | export CUDA_VISIBLE_DEVICES=0 118 | max_iteration=4 119 | dataset_name=qampari 120 | use_sub_questions=0 121 | use_title=0 122 | used_doc_field=summary 123 | openai_model_name=gpt-3.5-turbo-0301 124 | # Args for retrieval 125 | input_file=data/qampari_gtr_top100.json 126 | retriever=bge-large-en-v1.5 127 | update_prompt_file=update-using-missing-info-for-new-passage 128 | update_query_using_missing_info_from_question_and_psgs=1 129 | corpus_path=PATH_TO_YOUR_OWN_WIKI_CORPUS 130 | # Args for generating used field. 131 | prompt_style=summary 132 | target_used_field=summary 133 | max_tokens=150 134 | # Args for reranker 135 | position=head 136 | reranking_prompt_file=select_no_up_to 137 | doc_num=50 138 | window_size=20 139 | # Args for filtration 140 | demo_file=prompts/${dataset_name}_default.json 141 | filtration_prompt_file=filter_question_with_demo 142 | filtration_method=judgment 143 | 144 | python Iterative_retrieval.py \ 145 | --max_iteration $max_iteration \ 146 | --dataset_name $dataset_name \ 147 | --use_sub_questions $use_sub_questions \ 148 | --use_title $use_title \ 149 | --used_doc_field $used_doc_field \ 150 | --openai_model_name $openai_model_name \ 151 | --input_file $input_file \ 152 | --retriever $retriever \ 153 | --update_prompt_file $update_prompt_file \ 154 | --update_query_using_missing_info_from_question_and_psgs $update_query_using_missing_info_from_question_and_psgs \ 155 | --corpus_path $corpus_path \ 156 | --prompt_style $prompt_style \ 157 | --target_used_field $target_used_field \ 158 | --max_tokens $max_tokens \ 159 | --position $position \ 160 | --reranking_prompt_file $reranking_prompt_file \ 161 | --doc_num $doc_num \ 162 | --window_size $window_size \ 163 | --filtration_prompt_file $filtration_prompt_file \ 164 | --demo_file $demo_file \ 165 | --filtration_method $filtration_method 166 | 167 | # run_eval 168 | export CUDA_VISIBLE_DEVICES=0 169 | shot=1 170 | openai_api=1 171 | num_samples=1 172 | data_file=iter_retrieval_50/qampari_final_data/final_data_bge-large-en-v1.5_max_iteration-4_update-using-missing-info-for-new-passage_head.json 173 | ndoc=5 174 | openai_multi_thread=10 175 | model=gpt-3.5-turbo-0301 176 | quick_test=0 177 | seed=42 178 | temperature=0 179 | eval_metric=default 180 | # Other args 181 | dataset_name=qampari 182 | 183 | prompt_file=prompts/${dataset_name}_default.json 184 | output_dir=iter_retrieval_50/${dataset_name}_max-4_bge-large-en-v1.5-new-passage_llm-select-head_run_eval 185 | mkdir $output_dir -p 186 | output_file=${output_dir}/run_output.json 187 | 188 | if [ ! -f "${output_file}" ]; then 189 | echo "*****************************" 190 | echo "start run.py" 191 | echo "*****************************" 192 | 193 | python run.py \ 194 | --shot $shot \ 195 | --openai_api $openai_api \ 196 | --prompt_file $prompt_file \ 197 | --output_fp $output_file \ 198 | --dataset_name $dataset_name \ 199 | --num_samples $num_samples \ 200 | --data_file $data_file \ 201 | --ndoc $ndoc \ 202 | --openai_multi_thread $openai_multi_thread \ 203 | --model $model \ 204 | --quick_test $quick_test \ 205 | --seed $seed \ 206 | --temperature $temperature \ 207 | --turbo_system_message "You are a helpful assistant that answers the following questions with proper citations." 208 | 209 | echo "*****************************" 210 | echo "finish run.py" 211 | echo "*****************************" 212 | fi 213 | 214 | eval_f=${output_file%.json} 215 | eval_result_fp=${eval_f}.score 216 | if [ ! -f $eval_result_fp ]; then 217 | echo "*****************************" 218 | echo "start eval.py" 219 | echo "*****************************" 220 | 221 | python eval.py \ 222 | --f $output_file \ 223 | --eval_metric $eval_metric 224 | 225 | echo "*****************************" 226 | echo "finish eval.py" 227 | echo "*****************************" 228 | fi 229 | -------------------------------------------------------------------------------- /commands/asqa_iterative_retrieval.sh: -------------------------------------------------------------------------------- 1 | ##################### asqa Iterative_retrieval; new-question ##################### 2 | export CUDA_VISIBLE_DEVICES=0 3 | max_iteration=4 4 | dataset_name=asqa 5 | use_sub_questions=1 6 | use_title=1 7 | used_doc_field=summary_use_sub 8 | openai_model_name=gpt-3.5-turbo-0301 9 | # Args for retrieval 10 | input_file=data/asqa_gtr_top100.json 11 | retriever=bge-large-en-v1.5 12 | update_prompt_file=update-using-missing-info-for-new-question 13 | update_query_using_missing_info_from_question_and_psgs=1 14 | corpus_path=PATH_TO_YOUR_OWN_WIKI_CORPUS 15 | # Args for generating used field. 16 | prompt_style=summary 17 | target_used_field=summary_use_sub 18 | max_tokens=150 19 | # Args for reranker 20 | position=head 21 | reranking_prompt_file=select_no_up_to 22 | doc_num=50 23 | window_size=20 24 | # Args for filtration 25 | demo_file=prompts/${dataset_name}_demo.json 26 | filtration_prompt_file=filter_question_with_demo 27 | filtration_method=judgment 28 | 29 | python Iterative_retrieval.py \ 30 | --max_iteration $max_iteration \ 31 | --dataset_name $dataset_name \ 32 | --use_sub_questions $use_sub_questions \ 33 | --use_title $use_title \ 34 | --used_doc_field $used_doc_field \ 35 | --openai_model_name $openai_model_name \ 36 | --input_file $input_file \ 37 | --retriever $retriever \ 38 | --update_prompt_file $update_prompt_file \ 39 | --update_query_using_missing_info_from_question_and_psgs $update_query_using_missing_info_from_question_and_psgs \ 40 | --corpus_path $corpus_path \ 41 | --prompt_style $prompt_style \ 42 | --target_used_field $target_used_field \ 43 | --max_tokens $max_tokens \ 44 | --position $position \ 45 | --reranking_prompt_file $reranking_prompt_file \ 46 | --doc_num $doc_num \ 47 | --window_size $window_size \ 48 | --demo_file $demo_file \ 49 | --filtration_prompt_file $filtration_prompt_file \ 50 | --filtration_method $filtration_method 51 | 52 | # run_eval 53 | export CUDA_VISIBLE_DEVICES=0 54 | shot=1 55 | openai_api=1 56 | num_samples=1 57 | data_file=iter_retrieval_50/asqa_final_data/final_data_bge-large-en-v1.5_max_iteration-4_update-using-missing-info-for-new-question_head.json 58 | ndoc=5 59 | openai_multi_thread=10 60 | model=gpt-3.5-turbo-0301 61 | quick_test=0 62 | seed=42 63 | temperature=0 64 | eval_metric=default 65 | use_sub_questions=1 66 | # Other args 67 | dataset_name=asqa 68 | 69 | prompt_file=prompts/${dataset_name}_default.json 70 | output_dir=iter_retrieval_50/${dataset_name}_max-4_bge-large-en-v1.5-new-question_llm-select-head_run_eval 71 | mkdir $output_dir -p 72 | output_file=${output_dir}/run_output.json 73 | 74 | if [ ! -f "${output_file}" ]; then 75 | echo "*****************************" 76 | echo "start run.py" 77 | echo "*****************************" 78 | 79 | python run.py \ 80 | --shot $shot \ 81 | --openai_api $openai_api \ 82 | --prompt_file $prompt_file \ 83 | --output_fp $output_file \ 84 | --dataset_name $dataset_name \ 85 | --num_samples $num_samples \ 86 | --data_file $data_file \ 87 | --ndoc $ndoc \ 88 | --openai_multi_thread $openai_multi_thread \ 89 | --model $model \ 90 | --quick_test $quick_test \ 91 | --seed $seed \ 92 | --temperature $temperature \ 93 | --use_sub_questions $use_sub_questions \ 94 | --turbo_system_message "You are a helpful assistant that answers the following questions with proper citations." 95 | 96 | echo "*****************************" 97 | echo "finish run.py" 98 | echo "*****************************" 99 | fi 100 | 101 | eval_f=${output_file%.json} 102 | eval_result_fp=${eval_f}.score 103 | if [ ! -f $eval_result_fp ]; then 104 | echo "*****************************" 105 | echo "start eval.py" 106 | echo "*****************************" 107 | 108 | python eval.py \ 109 | --f $output_file \ 110 | --eval_metric $eval_metric 111 | 112 | echo "*****************************" 113 | echo "finish eval.py" 114 | echo "*****************************" 115 | fi 116 | 117 | 118 | ##################### asqa Iterative_retrieval; new-passage ##################### 119 | export CUDA_VISIBLE_DEVICES=0 120 | max_iteration=4 121 | dataset_name=asqa 122 | use_sub_questions=1 123 | use_title=1 124 | used_doc_field=summary_use_sub 125 | openai_model_name=gpt-3.5-turbo-0301 126 | # Args for retrieval 127 | input_file=data/asqa_gtr_top100.json 128 | retriever=bge-large-en-v1.5 129 | update_prompt_file=update-using-missing-info-for-new-passage 130 | update_query_using_missing_info_from_question_and_psgs=1 131 | corpus_path=PATH_TO_YOUR_OWN_WIKI_CORPUS 132 | # Args for generating used field. 133 | prompt_style=summary 134 | target_used_field=summary_use_sub 135 | max_tokens=150 136 | # Args for reranker 137 | position=head 138 | reranking_prompt_file=select_no_up_to 139 | doc_num=50 140 | window_size=20 141 | # Args for filtration 142 | demo_file=prompts/${dataset_name}_demo.json 143 | filtration_prompt_file=filter_question_with_demo 144 | filtration_method=judgment 145 | 146 | python Iterative_retrieval.py \ 147 | --max_iteration $max_iteration \ 148 | --dataset_name $dataset_name \ 149 | --use_sub_questions $use_sub_questions \ 150 | --use_title $use_title \ 151 | --used_doc_field $used_doc_field \ 152 | --openai_model_name $openai_model_name \ 153 | --input_file $input_file \ 154 | --retriever $retriever \ 155 | --update_prompt_file $update_prompt_file \ 156 | --update_query_using_missing_info_from_question_and_psgs $update_query_using_missing_info_from_question_and_psgs \ 157 | --corpus_path $corpus_path \ 158 | --prompt_style $prompt_style \ 159 | --target_used_field $target_used_field \ 160 | --max_tokens $max_tokens \ 161 | --position $position \ 162 | --reranking_prompt_file $reranking_prompt_file \ 163 | --doc_num $doc_num \ 164 | --window_size $window_size \ 165 | --demo_file $demo_file \ 166 | --filtration_prompt_file $filtration_prompt_file \ 167 | --filtration_method $filtration_method 168 | 169 | # run_eval 170 | export CUDA_VISIBLE_DEVICES=0 171 | shot=1 172 | openai_api=1 173 | num_samples=1 174 | data_file=iter_retrieval_50/asqa_final_data/final_data_bge-large-en-v1.5_max_iteration-4_update-using-missing-info-for-new-passage_head.json 175 | ndoc=5 176 | openai_multi_thread=10 177 | model=gpt-3.5-turbo-0301 178 | quick_test=0 179 | seed=42 180 | temperature=0 181 | eval_metric=default 182 | use_sub_questions=1 183 | # Other args 184 | dataset_name=asqa 185 | 186 | prompt_file=prompts/${dataset_name}_default.json 187 | output_dir=iter_retrieval_50/${dataset_name}_max-4_bge-large-en-v1.5-new-passage_llm-select-head_run_eval 188 | mkdir $output_dir -p 189 | output_file=${output_dir}/run_output.json 190 | 191 | if [ ! -f "${output_file}" ]; then 192 | echo "*****************************" 193 | echo "start run.py" 194 | echo "*****************************" 195 | 196 | python run.py \ 197 | --shot $shot \ 198 | --openai_api $openai_api \ 199 | --prompt_file $prompt_file \ 200 | --output_fp $output_file \ 201 | --dataset_name $dataset_name \ 202 | --num_samples $num_samples \ 203 | --data_file $data_file \ 204 | --ndoc $ndoc \ 205 | --openai_multi_thread $openai_multi_thread \ 206 | --model $model \ 207 | --quick_test $quick_test \ 208 | --seed $seed \ 209 | --temperature $temperature \ 210 | --use_sub_questions $use_sub_questions \ 211 | --turbo_system_message "You are a helpful assistant that answers the following questions with proper citations." 212 | 213 | echo "*****************************" 214 | echo "finish run.py" 215 | echo "*****************************" 216 | fi 217 | 218 | eval_f=${output_file%.json} 219 | eval_result_fp=${eval_f}.score 220 | if [ ! -f $eval_result_fp ]; then 221 | echo "*****************************" 222 | echo "start eval.py" 223 | echo "*****************************" 224 | 225 | python eval.py \ 226 | --f $output_file \ 227 | --eval_metric $eval_metric 228 | 229 | echo "*****************************" 230 | echo "finish eval.py" 231 | echo "*****************************" 232 | fi 233 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | import logging 2 | logger = logging.getLogger(__name__) 3 | logger.setLevel(logging.INFO) 4 | 5 | import torch 6 | import re 7 | import os 8 | import string 9 | import time 10 | import pickle 11 | 12 | 13 | def normalize_answer(s): 14 | def remove_articles(text): 15 | return re.sub(r"\b(a|an|the)\b", " ", text) 16 | 17 | def white_space_fix(text): 18 | return " ".join(text.split()) 19 | 20 | def remove_punc(text): 21 | exclude = set(string.punctuation) 22 | return "".join(ch for ch in text if ch not in exclude) 23 | 24 | def lower(text): 25 | return text.lower() 26 | 27 | return white_space_fix(remove_articles(remove_punc(lower(s)))) 28 | 29 | 30 | def remove_citations(sent): 31 | return re.sub(r"\[\d+", "", re.sub(r" \[\d+", "", sent)).replace(" |", "").replace("]", "") 32 | 33 | 34 | def get_max_memory(): 35 | """Get the maximum memory available for the current GPU for loading models.""" 36 | free_in_GB = int(torch.cuda.mem_get_info()[0]/1024**3) 37 | max_memory = f'{free_in_GB-6}GB' 38 | n_gpus = torch.cuda.device_count() 39 | max_memory = {i: max_memory for i in range(n_gpus)} 40 | return max_memory 41 | 42 | 43 | def make_doc_prompt(doc, doc_id, doc_prompt, use_shorter=None): 44 | # For doc prompt: 45 | # - {ID}: doc id (starting from 1) 46 | # - {T}: title 47 | # - {P}: text 48 | # use_shorter: None, "summary", or "extraction" 49 | 50 | text = doc['text'] 51 | if use_shorter is not None: 52 | text = doc[use_shorter] 53 | return doc_prompt.replace("{T}", doc["title"]).replace("{P}", text).replace("{ID}", str(doc_id+1)) 54 | 55 | 56 | def get_shorter_text(item, docs, ndoc, key): 57 | doc_list = [] 58 | for item_id, item in enumerate(docs): 59 | if key not in item: 60 | if len(doc_list) == 0: 61 | # If there aren't any document, at least provide one (using full text) 62 | item[key] = item['text'] 63 | doc_list.append(item) 64 | logger.warn(f"No {key} found in document. It could be this data do not contain {key} or previous documents are not relevant. This is document {item_id}. This question will only have {len(doc_list)} documents.") 65 | break 66 | if "irrelevant" in item[key] or "Irrelevant" in item[key]: 67 | continue 68 | doc_list.append(item) 69 | if len(doc_list) >= ndoc: 70 | break 71 | return doc_list 72 | 73 | 74 | def make_demo(item, prompt, ndoc=None, doc_prompt=None, instruction=None, use_shorter=None, test=False, use_sub_questions: int=0): 75 | # For demo prompt 76 | # - {INST}: the instruction 77 | # - {D}: the documents 78 | # - {Q}: the question 79 | # - {A}: the answers 80 | # ndoc: number of documents to put in context 81 | # use_shorter: None, "summary", or "extraction" 82 | 83 | # Use sub questions for asqa. 84 | if use_sub_questions and 'qa_pairs' in item: 85 | questions = list(map(lambda x: x['question'], list(item['qa_pairs']))) 86 | else: 87 | questions = [item['question']] 88 | prompt = prompt.replace("{INST}", instruction).replace("{Q}", '\n'.join(questions)) 89 | if "{D}" in prompt: 90 | if ndoc == 0: 91 | prompt = prompt.replace("{D}\n", "") # if there is no doc we also delete the empty line 92 | else: 93 | doc_list = get_shorter_text(item, item["docs"], ndoc, use_shorter) if use_shorter is not None else item["docs"][:ndoc] 94 | text = "".join([make_doc_prompt(doc, doc_id, doc_prompt, use_shorter=use_shorter) for doc_id, doc in enumerate(doc_list)]) 95 | prompt = prompt.replace("{D}", text) 96 | 97 | if not test: 98 | answer = "\n" + "\n".join(item["answer"]) if isinstance(item["answer"], list) else item["answer"] 99 | prompt = prompt.replace("{A}", "").rstrip() + answer 100 | else: 101 | prompt = prompt.replace("{A}", "").rstrip() # remove any space or \n 102 | 103 | return prompt 104 | 105 | 106 | def load_model(model_name_or_path, dtype=torch.float16, int8=False, reserve_memory=10): 107 | # Load a huggingface model and tokenizer 108 | # dtype: torch.float16 or torch.bfloat16 109 | # int8: whether to use int8 quantization 110 | # reserve_memory: how much memory to reserve for the model on each gpu (in GB) 111 | 112 | # Llama: set up the root dir 113 | open_source_models = ["llama", "alpaca", "vicuna", "oasst"] 114 | if any([m in model_name_or_path for m in open_source_models]): 115 | model_name_or_path = os.path.join(os.environ["LLAMA_ROOT"], model_name_or_path) 116 | 117 | # Load the FP16 model 118 | from transformers import AutoModelForCausalLM, AutoTokenizer 119 | logger.info(f"Loading {model_name_or_path} in {dtype}...") 120 | if int8: 121 | logger.warn("Use LLM.int8") 122 | start_time = time.time() 123 | model = AutoModelForCausalLM.from_pretrained( 124 | model_name_or_path, 125 | device_map='auto', 126 | torch_dtype=dtype, 127 | max_memory=get_max_memory(), 128 | load_in_8bit=int8, 129 | ) 130 | logger.info("Finish loading in %.2f sec." % (time.time() - start_time)) 131 | 132 | # Load the tokenizer 133 | tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False) 134 | 135 | # Fix OPT bos token problem in HF 136 | if "opt" in model_name_or_path: 137 | tokenizer.bos_token = "" 138 | tokenizer.padding_side = "left" 139 | 140 | return model, tokenizer 141 | 142 | 143 | # Save p_embeddings to local position. 144 | def save_embeddings(embeddings, file_path): 145 | with open(file_path, 'wb') as file: 146 | pickle.dump(embeddings, file, protocol=4) 147 | 148 | 149 | # Load local p_embeddings 150 | def load_embeddings(file_path): 151 | with open(file_path, 'rb') as file: 152 | embeddings = pickle.load(file) 153 | return embeddings 154 | 155 | 156 | def get_demonstration(demo_data) -> str: 157 | """ 158 | Args 159 | ---- 160 | demo_data: Dict 161 | Data for creating demonstration prompt. 162 | 163 | Returns 164 | ------- 165 | demos: str 166 | Demonstration prompt. 167 | """ 168 | if 'qa_pairs' in demo_data: 169 | logger.warning("Load sub questions when getting demostration.") 170 | questions = list(map(lambda x: x['question'], list(demo_data['qa_pairs']))) 171 | question = "\n".join(questions) 172 | answer = demo_data["answer"] 173 | else: 174 | doc = demo_data["demos"][0] 175 | question = doc['question'] 176 | answer = doc["answer"].replace('[1]', '').replace('[2]', '').replace('[3]', '').replace('[4]', '').replace('[5]', '') 177 | 178 | demos = f"Question: {question}\nAnswer: {answer}" 179 | return demos 180 | 181 | 182 | def get_messages(questions: str, prompt_style: str, doc): 183 | """ 184 | Args 185 | ---- 186 | questions: str 187 | Given questions. 188 | prompt_style: str 189 | Set the type of user's content. 190 | doc: List[Dict] 191 | One of documents that belong to a problem. 192 | """ 193 | if prompt_style == 'summary': 194 | messages = [ 195 | {'role': 'system', 'content': "You are a helpful assistant."}, 196 | {'role': 'user', 'content': f"Summarize the following document within 50 words for the given question(s). Return \"irrelevant\" if the document is irrelevant to the question(s). Try to keep all the important dates, numbers, and names.\nQuestion(s):\n{questions}\n\nDocument:\nTitle: {doc['title']}\nText: {doc['text']}\n\nSummary:"} 197 | ] 198 | elif prompt_style == 'answer': 199 | messages = [ 200 | {'role': 'system', 'content': "You are a helpful assistant."}, 201 | {'role': 'user', 'content': f"Answer the given question(s) using the following document. Return \"irrelevant\" if the document is irrelevant to the question(s). Try to keep all the important dates, numbers, and names.\n\nQuestion(s):\n{questions}\n\nDocument:\nTitle: {doc['title']}\nText: {doc['text']}\n\nAnswer:"} 202 | ] 203 | else: 204 | raise NotImplementedError 205 | 206 | return messages -------------------------------------------------------------------------------- /prompts/qampari_default.json: -------------------------------------------------------------------------------- 1 | { 2 | "instruction": "Instruction: Provide a list of accurate answers for the given question using only the provided search results (some of which might be irrelevant) and cite them properly. Always cite one and only one document for each answer. Separate answers by commas. For questions that have more than 5 answers, write at least 5 answers.", 3 | "demo_sep": "\n\n\n", 4 | "demo_prompt": "{INST}\n\nQuestion: {Q}\n\n{D}\nAnswer: {A}", 5 | "doc_prompt": "Document [{ID}](Title: {T}): {P}\n", 6 | "demos": [ 7 | { 8 | "question": "Which books were written by Nevil Shute?", 9 | "answer": "Marazan [1], Stephen Morris [1], Beyond the Black Stump [2], Lonely Road [2], The Chequer Board [2], In the Wet [2], Trustee from the Toolroom [2], Round the Bend [2], No Highway [3], Ruined City [3], On the Beach [3].", 10 | "docs": [ 11 | { 12 | "title": "Nevil Shute", 13 | "text": "early stages. My congratulations.\" His celebrity as a writer caused the Ministry of Information to send him to the Normandy Landings on 6 June 1944 and later to Burma as a correspondent. He finished the war with the rank of lieutenant commander in the Royal Navy Volunteer Reserves (RNVR). Shute's first novel, \"Stephen Morris\", was written in 1923, but not published until 1961. His first published novel was \"Marazan\", which came out in 1926. After that he averaged one novel every two years through the 1950s, with the exception of a six-year hiatus while he was establishing his own aircraft" 14 | }, 15 | { 16 | "title": "Nevil Shute", 17 | "text": "theme is the bridging of social barriers such as class (\"Lonely Road\" and \"Landfall\"), race (\"The Chequer Board\"), or religion (\"Round the Bend\"). The Australian novels are individual hymns to that country, with subtle disparagement of the mores of the United States (\"Beyond the Black Stump\") and overt antipathy towards the post-World War II socialist government of Shute's native Britain (\"The Far Country\" and \"In the Wet\"). Shute's heroes tended to be like himself: middle class solicitors, doctors, accountants, bank managers, engineers, generally university graduates. However (as in \"Trustee from the Toolroom\"), Shute valued the honest artisans and their social" 18 | }, 19 | { 20 | "title": "Nevil Shute", 21 | "text": "construction company, Airspeed Ltd. His popularity grew slowly with each novel, but he became much more famous after the publication of \"On the Beach\" in 1957. Shute's novels are written in a simple, highly readable style, with clearly delineated plot lines. Where there is a romantic element, sex is referred to only obliquely. Many of the stories are introduced by a narrator who is not a character in the story. The most common theme in Shute's novels is the dignity of work, spanning all classes, whether an Eastern European bar \"hostess\" (\"Ruined City\") or brilliant boffin (\"No Highway\"). Another recurrent" 22 | }, 23 | { 24 | "title": "The Chequer Board", 25 | "text": "the Burmese people\", both of which are central to the book's story. Shute was concerned that sales of the book in the United States would be negatively impacted by the book's open-minded handling of racial issues; as it turned out, sales soared. Shute and his wife traveled the U.S. on Greyhound buses to \"\"get in touch with the man on the street,\"\" finding the experience refreshing. Afterwards he wrote \"\"Sincerity is the first attribute for making money in the business of writing novels.\"\" The Chequer Board The Chequer Board is a novel by Nevil Shute, first published in the United" 26 | }, 27 | { 28 | "title": "In the Wet", 29 | "text": "had used the idea of multiple votes for merit in his short story \"The Curious Republic of Gondour\". In the Wet In The Wet is a novel by Nevil Shute that was first published in the United Kingdom in 1953. It contains many of the typical elements of a hearty and adventurous Shute yarn such as flying, the future, mystic states, and ordinary people doing extraordinary things. The story is opened by its initial narrator \u2013 an Anglican priest in the Bush Brotherhood named Roger Hargreaves \u2013 who describes his ordinary circumstances in a large parish of the Australian outback" 30 | } 31 | ] 32 | }, 33 | { 34 | "question": "Which film has Gong Li as a member of its cast?", 35 | "answer": "The Story of Qiu Ju [1], Farewell My Concubine [2], Flirting Scholar [2], The Monkey King 2 [3], Mulan [3], Saturday Fiction [3], Coming Home [3].", 36 | "docs": [ 37 | { 38 | "title": "Gong Li", 39 | "text": "Gong Li Gong Li (born 31 December 1965) is a Chinese-born Singaporean film actress. She achieved international prominence through her close collaborations with Chinese director Zhang Yimou and won the Volpi Cup for Best Actress at Venice for her performance in his 1992 film \"The Story of Qiu Ju\". She has been credited with helping to bring Chinese cinema to prominence in Europe and the United States. In 2006, she was voted the most beautiful woman in China. Gong has won numerous accolades for her work as an actress; she won the New York Film Critics Circle Award for Best" 40 | }, 41 | { 42 | "title": "Gong Li", 43 | "text": "making her realize that she has assisted the dark cynical system. In 1993, she received a New York Film Critics Circle award for her role in \"Farewell My Concubine\" (1993). Directed by Chen Kaige, the film was her first major role with a director other than Zhang Yimou. In the same year, she was awarded with the Berlinale Camera at the 43rd Berlin International Film Festival. \"Premiere\" magazine ranked her performance in \"Farewell My Concubine\" as the 89th greatest performance of all time. She also worked with renowned director Stephen Chow in comedy films \"\" (1991) and \"Flirting Scholar\" (1993)." 44 | }, 45 | { 46 | "title": "Gong Li", 47 | "text": "International Film Festival. Later that same year, she reunited with Zhang Yimou for the film \"Coming Home\", which is set during the throes of the Cultural Revolution; this film was their first collaboration since 2006. In 2016, Gong took on her first action role in \"The Monkey King 2\", playing the White Bone Demon. In 2018, Gong was cast in Lou Ye's period drama \"Saturday Fiction\", where she plays an actress who is working undercover gathering intelligence for the Allies. That year, she was also cast in the live-action adaptation of the 1998 Disney animated film \"Mulan\", as an unspecified" 48 | }, 49 | { 50 | "title": "Zhang Yimou", 51 | "text": "in Zhang's earlier films. \"Raise the Red Lantern\" was nominated in the Best Foreign Language Film category at the 1992 Academy Awards, becoming the second Chinese film to earn this distinction (after Zhang's \"Ju Dou\"). It eventually lost out to Gabriele Salvatores's \"Mediterraneo\". Zhang's next directorial work, \"The Story of Qiu Ju\", in 1992, once again starring Gong Li in the lead role. The film, which tells the tale of a peasant woman seeking justice for her husband after he was beaten by a village official, was a hit at film festivals and won the Golden Lion award at the" 52 | }, 53 | { 54 | "title": "Gong Li", 55 | "text": "Gong Li Gong Li (born 31 December 1965) is a Chinese-born Singaporean film actress. She achieved international prominence through her close collaborations with Chinese director Zhang Yimou and won the Volpi Cup for Best Actress at Venice for her performance in his 1992 film \"The Story of Qiu Ju\". She has been credited with helping to bring Chinese cinema to prominence in Europe and the United States. In 2006, she was voted the most beautiful woman in China. Gong has won numerous accolades for her work as an actress; she won the New York Film Critics Circle Award for Best" 56 | } 57 | ] 58 | }, 59 | { 60 | "question": "In which years did Patti LaBelle publish music?", 61 | "answer": "2006 [1], 1977 [2], 2004 [3], 2005 [3], 2000 [3], 2006 [3].", 62 | "docs": [ 63 | { 64 | "title": "The Gospel According to Patti LaBelle", 65 | "text": "The Gospel According to Patti LaBelle The Gospel According to Patti LaBelle is the first gospel album released by singer Patti LaBelle, released in November 2006. This project began three years ago when Patti's late musical director and close friend Budd Ellison told a skeptical LaBelle that \"it's now or never, Patti.\" The album is dedicated to his memory as he succumbed to prostate cancer before the album saw a release. The album was released on November 21, 2006 through indie label Umbrella/Bungalow Records, also home to Carl Thomas, Rodney Jerkins, Dean \"DC\" Charles, and other artists. \"The Gospel According" 66 | }, 67 | { 68 | "title": "Patti LaBelle (album)", 69 | "text": "scaled the high sixties on the \"Billboard\" R&B chart, it soon became one of her famous show-stoppers while performing the song. LaBelle performed the song at her first solo concert in London, getting a standing ovation, which helped to give LaBelle motivation to continue her career. The album, when released, performed successfully, reaching number 62 on the \"Billboard\" 200 and number 31 on the R&B albums chart, while critics hailed the album. Patti LaBelle (album) Patti LaBelle is the debut solo album by singer Patti LaBelle, released in 1977. The first album LaBelle recorded after sixteen years fronting the band" 70 | }, 71 | { 72 | "title": "Patti LaBelle", 73 | "text": "win. In 2000, LaBelle released her final MCA album, \"When a Woman Loves\", before signing with Def Soul Classics to release the 2004 album, \"Timeless Journey\". Following the release of her 2005 covers album, \"Classic Moments\", LaBelle engaged in a rivalry with Antonio \"L.A.\" Reid over the direction of her career, leading to her leaving the label.In the same year, the World Music Awards recognized her years in the music business by awarding her the Legend Award. In 2006, she released her first gospel album, \"The Gospel According to Patti LaBelle\" on the Bungalo label, the album later peaking at" 74 | }, 75 | { 76 | "title": "Patti LaBelle", 77 | "text": "Patti LaBelle Patti LaBelle (born Patricia Louise Holt; May 24, 1944) is an American singer, actress, and entrepreneur. LaBelle began her career in the early 1960s as lead singer and front woman of the vocal group, Patti LaBelle and the Bluebelles. Following the group's name change to Labelle in the early 1970s, they released the iconic disco song \"Lady Marmalade\" and the group later became the first African-American vocal group to land the cover of \"Rolling Stone\" magazine. After the group split in 1976, LaBelle began a successful solo career, starting with her critically acclaimed debut album, which included the" 78 | }, 79 | { 80 | "title": "The Gospel According to Patti LaBelle", 81 | "text": "Billboard's Top Gospel Albums chart for 17 weeks. \"Where Love Begins,\" a duet with Yolanda Adams was played frequently on R&B and gospel radio stations and debuted at #68 on Billboard's Hot R&B/Hip-Hop tracks. The second single \"Anything\" featuring Kanye West, Mary Mary and Consequence hit #64 on Billboards Hot R&B/Hip-Hop tracks. In 2008, the album was nominated for a Dove Award for Contemporary Gospel Album of the Year at the 39th GMA Dove Awards. The Gospel According to Patti LaBelle The Gospel According to Patti LaBelle is the first gospel album released by singer Patti LaBelle, released in November" 82 | } 83 | ] 84 | }, 85 | { 86 | "question": "Glenn Ford was a member of cast in which film?", 87 | "answer": "So Ends Our Night [1], Heaven with a Barbed Wire Fence [1], Happy Birthday to Me [2], The Greatest Gift [2], The Gift [2], The Brotherhood of the Bell [3].", 88 | "docs": [ 89 | { 90 | "title": "Glenn Ford", 91 | "text": "name came from his father's hometown of Glenford, Alberta. His first major movie part was in the 1939 film, \"Heaven with a Barbed Wire Fence\". Top Hollywood director John Cromwell was impressed enough with his work to borrow him from Columbia for the independently produced drama, \"So Ends Our Night\" (1941), where Ford delivered a poignant portrayal of a 19-year-old German exile on the run in Nazi-occupied Europe. Working with Academy Award-winning Fredric March and wooing (onscreen) 30-year-old Margaret Sullavan, recently nominated for an Oscar, Ford's shy, ardent young refugee riveted attention even in such stellar company. \"Glenn Ford, a" 92 | }, 93 | { 94 | "title": "Glenn Ford", 95 | "text": "were Westerns. He suggested doing a Western series, instead, which resulted in the \"modern-day Western\" series, \"Cade's County\". Ford played southwestern Sheriff Cade for one season (1971\u20131972) in a mix of police mystery and western drama. In \"The Family Holvak\" (1975\u20131976), Ford portrayed a Depression-era preacher in a family drama, reprising the same character he had played in the TV film, \"The Greatest Gift\". In 1978 Ford was host, presenter and narrator of the disaster documentary series 'When Havoc Struck'. In 1981, Ford co-starred with Melissa Sue Anderson in the slasher film \"Happy Birthday to Me\". In 1991, Ford agreed" 96 | }, 97 | { 98 | "title": "CBS Thursday Night Movie", 99 | "text": "Night Movie\" opened its fall schedule with the premiere of a low-budget, made-for-TV movie, rather than a proven Hollywood blockbuster guaranteed to lure mass viewership, it became CBS's way of declaring its commitment to product that, although cheaply manufactured, was nevertheless new and topical. In this case, the movie was \"The Brotherhood of the Bell\", and the film's star was Glenn Ford, a movie actor who had never appeared in a television-film. In fact, before shooting on the project even began, Ford had been warned by friends in the industry that he would hate the experience. Instead, the actor reported" 100 | }, 101 | { 102 | "title": "The Trouble with Girls (film) ", 103 | "text": "with Charlene, but when she refuses to give in, he deceives her and uses the local police force to be sure that she must leave on the train with the rest of the troupe. Cast notes In June 1959 it was announced that Don Mankiewicz would write a screenplay of an unpublished story by Mauri Grashin, Day Keene, and Dwight Babcock. By December 1960, with the project titled \"Chautauqua\", MGM was ready to make the film with Glenn Ford. Rumours circulating in Hollywood at the time stated that Presley would co-star with Ford, Hope Lange, and Arthur O'Connell, but nothing" 104 | }, 105 | { 106 | "title": "Trouble in the Glen", 107 | "text": "Mel Ferrer. It was Orson Welles' fifth British movie in six months. Filming started 15 December 1953. The film received very poor reviews. Trouble in the Glen Trouble in the Glen is a 1954 British comedy film directed by Herbert Wilcox and starring Margaret Lockwood, Orson Welles, Forrest Tucker and Victor McLaglen. It is loosely based on Maurice Walsh's 1950 novel of the same name. It was filmed in Trucolor for Republic Pictures. After moving from South America to the Scottish Highlands, millionaire Sanin Cejador y Mengues (Welles) reassumes the title of laird of Glen Easan, which he inherited from" 108 | } 109 | ] 110 | } 111 | ] 112 | } 113 | -------------------------------------------------------------------------------- /prompts/asqa_demo.json: -------------------------------------------------------------------------------- 1 | { 2 | "qa_pairs": [ 3 | { 4 | "context": "Sounds of Silence is the second studio album by Simon & Garfunkel, released on January 17, 1966. The album's title is a slight modification of the title of the duo's first major hit, \"The Sound of Silence\", which originally was released as \"The Sounds of Silence\". The song had earlier been released in an acoustic version on the album \"Wednesday Morning, 3 A.M.\", and later on the soundtrack to the movie \"The Graduate\". Without the knowledge of Paul Simon or Art Garfunkel, electric guitars, bass and drums were overdubbed by Columbia Records staff producer Tom Wilson on June 15, 1965. This new version was released as a single in September 1965, and opens the album.", 5 | "question": "Who is the original artist of sound of silence, the song, released in 1964?", 6 | "short_answers": [ 7 | "Simon & Garfunkel", 8 | "Paul Simon and Art Garfunkel", 9 | "Art Garfunkel", 10 | "Paul Simon" 11 | ], 12 | "wikipage": "Sounds of Silence" 13 | }, 14 | { 15 | "context": "Sounds of Silence is the second studio album by Simon & Garfunkel, released on January 17, 1966. The album's title is a slight modification of the title of the duo's first major hit, \"The Sound of Silence\", which originally was released as \"The Sounds of Silence\". The song had earlier been released in an acoustic version on the album \"Wednesday Morning, 3 A.M.\", and later on the soundtrack to the movie \"The Graduate\". Without the knowledge of Paul Simon or Art Garfunkel, electric guitars, bass and drums were overdubbed by Columbia Records staff producer Tom Wilson on June 15, 1965. This new version was released as a single in September 1965, and opens the album.", 16 | "question": "Who is the original artist of sound of silence, the album?", 17 | "short_answers": [ 18 | "Simon & Garfunkel", 19 | "Paul Simon and Art Garfunkel", 20 | "Art Garfunkel", 21 | "Paul Simon" 22 | ], 23 | "wikipage": "Sounds of Silence" 24 | }, 25 | { 26 | "context": "\"Sound of Silence\" is a song performed by Australian recording artist Dami Im. Written by Anthony Egizii and David Musumeci of DNA Songs, it is best known as Australia's entry at the Eurovision Song Contest 2016 which was held in Stockholm, Sweden, where it finished 2nd, receiving a total of 511 points. The song also won the Marcel Bezen\u00e7on Award in the composer category. The song was leaked on 10 March 2016, one day before its initial release date. It is Dami Im's fourth Australian top 20 hit and worldwide, it reached the top 40 in more than six countries after the Eurovision Song Contest 2016 Final.", 27 | "question": "Who is the original artist of sound of silence, the song, released in 2016?", 28 | "short_answers": [ 29 | "Dami Im" 30 | ], 31 | "wikipage": "Sound of Silence (Dami Im song)" 32 | } 33 | ], 34 | "wikipages": [ 35 | { 36 | "title": "The Sound of Silence", 37 | "url": "https://en.wikipedia.org/wiki/The%20Sound%20of%20Silence" 38 | }, 39 | { 40 | "title": "Sounds of Silence", 41 | "url": "https://en.wikipedia.org/wiki/Sounds%20of%20Silence" 42 | }, 43 | { 44 | "title": "Sound of Silence (Dami Im song)", 45 | "url": "https://en.wikipedia.org/wiki/Sound%20of%20Silence%20%28Dami%20Im%20song%29" 46 | } 47 | ], 48 | "annotations": [ 49 | { 50 | "knowledge": [ 51 | { 52 | "content": "Wednesday Morning, 3 A.M. was re-released in January 1966 (to capitalize on their newly found radio success because of the overdubbing of the song \"The Sound of Silence\" in June 1965, adding electric guitars, bass guitar and a drum kit), and reached No. 30 on the Billboard 200...The album was produced by Tom Wilson and engineered by Roy Halee between March 10\u201331, 1964.", 53 | "wikipage": "Wednesday Morning, 3 A.M." 54 | } 55 | ], 56 | "long_answer": " The original artist of the song sound of silence released in 1966 is Paul Simon and Art Garfunkel. The song had earlier been released in an acoustic version on the album \"Wednesday Morning, 3 A.M.\" which had been produced in 1964. In 2016, Australian recording artist Dami Im recorded a different song by the same name." 57 | }, 58 | { 59 | "knowledge": [ 60 | { 61 | "content": "A studio audition led to the duo signing a record deal with Columbia Records, and the original acoustic version of the song was recorded in March 1964 at Columbia Studios in New York City and included on their debut album, Wednesday Morning, 3 A.M.. Released on October 19, 1964,[2] the album was a commercial failure and led to the duo disbanding; Simon returned to England, and Art Garfunkel to his studies at Columbia University.", 62 | "wikipage": "The Sound of Silence" 63 | } 64 | ], 65 | "long_answer": "There are several songs with the title \"Sound of Silence\". Sounds of Silence is the second studio album by Simon & Garfunkel, released on January 17, 1966. The album's title is a slight modification of the title of the duo's first major hit, \"The Sound of Silence\", which was recorded in March 1964 and originally was released as \"The Sounds of Silence\". Another \"Sound of Silence\" is a song performed by Australian recording artist Dami Im, and is best known as Australia's entry at the Eurovision Song Contest 2016." 66 | } 67 | ], 68 | "sample_id": "7089015503030534342", 69 | "question": "Who is the original artist of sound of silence?", 70 | "docs": [ 71 | { 72 | "id": "2627084", 73 | "title": "The Sound of Silence", 74 | "text": "The Sound of Silence \"The Sound of Silence\", originally \"The Sounds of Silence\", is a song by the American music duo Simon & Garfunkel. The song was written by Paul Simon over a period of several months in 1963 and 1964. A studio audition led to the duo signing a record deal with Columbia Records, and the song was recorded in March 1964 at Columbia Studios in New York City for inclusion on their debut album, \"Wednesday Morning, 3 A.M.\". Released in October 1964, the album was a commercial failure and led to the duo breaking apart, with Paul Simon", 75 | "score": 0.80078125, 76 | "summary": "The original artist of \"The Sound of Silence\" is Simon & Garfunkel.", 77 | "extraction": "\"The Sound of Silence\" is a song by the American music duo Simon & Garfunkel. The song was written by Paul Simon.", 78 | "summary_no_sub": "Who is the original artist of sound of silence? The original artist of \"The Sound of Silence\" is Simon & Garfunkel.", 79 | "summary_use_sub": "\"The Sound of Silence\" is a song by Simon & Garfunkel, written by Paul Simon in 1963-64. It was recorded in March 1964 for their debut album \"Wednesday Morning, 3 A.M.\" which was released in October 1964.", 80 | "answer_no_sub": "The original artist of \"The Sound of Silence\" is Simon & Garfunkel.", 81 | "answer_use_sub": "Original artist of sound of silence, the song, released in 1964: Simon & Garfunkel Original artist of sound of silence, the album: Simon & Garfunkel's debut album \"Wednesday Morning, 3 A.M.\" Original artist of sound of silence, the song, released in 2016: irrelevant" 82 | }, 83 | { 84 | "id": "10996254", 85 | "title": "Sounds of Silence", 86 | "text": "Sounds of Silence Sounds of Silence is the second studio album by Simon & Garfunkel, released on January 17, 1966. The album's title is a slight modification of the title of the duo's first major hit, \"The Sound of Silence\", which originally was released as \"The Sounds of Silence\". The song had earlier been released in an acoustic version on the album \"Wednesday Morning, 3 A.M.\", and later on the soundtrack to the movie \"The Graduate\". Without the knowledge of Paul Simon or Art Garfunkel, electric guitars, bass and drums were overdubbed by Columbia Records staff producer Tom Wilson on", 87 | "score": 0.78955078125, 88 | "summary": "The document mentions that \"The Sound of Silence\" is a song by Simon & Garfunkel and was released in an acoustic version on the album \"Wednesday Morning, 3 A.M.\" before being overdubbed with electric guitars, bass, and drums. However, it does not state the original artist of the song.", 89 | "extraction": "\"The original artist of Sound of Silence is Simon & Garfunkel.\"", 90 | "summary_no_sub": "Who is the original artist of sound of silence? The document mentions that \"The Sound of Silence\" was a major hit by Simon & Garfunkel, but does not mention the original artist. Therefore, the answer is \"irrelevant\".", 91 | "summary_use_sub": "Who is the original artist of sound of silence, the song, released in 1964? - The original artist of \"The Sound of Silence\" song released in 1964 is Simon & Garfunkel. Who is the original artist of sound of silence, the album? - The original artist of \"Sounds of Silence\" album is Simon & Garfunkel. Who is the original artist of sound of silence, the song, released in 2016? - irrelevant.", 92 | "answer_no_sub": "The original artist of \"The Sound of Silence\" is Simon & Garfunkel.", 93 | "answer_use_sub": "Who is the original artist of sound of silence, the song, released in 1964? - The original artist of \"The Sound of Silence\" song released in 1964 is Simon & Garfunkel. Who is the original artist of sound of silence, the album? - The album \"Sounds of Silence\" is also by Simon & Garfunkel. Who is the original artist of sound of silence, the song, released in 2016? - Irrelevant." 94 | }, 95 | { 96 | "id": "10996258", 97 | "title": "Sounds of Silence", 98 | "text": "Angeles, California. \"The Sound of Silence\" (electric overdubs) personnel \"The Sound of Silence\" overdubs were recorded at Columbia's \"Studio A\" at 799 Seventh Avenue near 52nd Street by Columbia Records staff producer Tom Wilson on June 15, 1965. Sounds of Silence Sounds of Silence is the second studio album by Simon & Garfunkel, released on January 17, 1966. The album's title is a slight modification of the title of the duo's first major hit, \"The Sound of Silence\", which originally was released as \"The Sounds of Silence\". The song had earlier been released in an acoustic version on the album", 99 | "score": 0.7705078125, 100 | "summary": "\"The Sound of Silence\" was recorded by Simon & Garfunkel and produced by Tom Wilson on June 15, 1965. The song was originally released as \"The Sounds of Silence\" and later appeared on the album \"Sounds of Silence\" in January 1966.", 101 | "extraction": "\"The original artist of Sound of Silence is Simon & Garfunkel.\"", 102 | "summary_no_sub": "The document provides information about the recording of \"The Sound of Silence\" in 1965 and the release of the album \"Sounds of Silence\" by Simon & Garfunkel in 1966. However, it does not mention the original artist of \"The Sound of Silence.\"", 103 | "summary_use_sub": "The document provides information about the recording of \"The Sound of Silence\" song in 1965 and the release of the \"Sounds of Silence\" album in 1966 by Simon & Garfunkel. It does not mention any release of the song in 2016. The original artist of the song and album is Simon & Garfunkel.", 104 | "answer_no_sub": "The original artist of \"The Sound of Silence\" is Simon & Garfunkel.", 105 | "answer_use_sub": "Original artist of sound of silence, the song, released in 1964 is irrelevant as the document only mentions the recording of the electric overdubs of the song in 1965. The original artist of sound of silence, the album, is Simon & Garfunkel. The original artist of sound of silence, the song, released in 2016 is irrelevant as the document only mentions the original release of the song in an acoustic version on an album." 106 | }, 107 | { 108 | "id": "364634", 109 | "title": "Simon & Garfunkel", 110 | "text": "Simon & Garfunkel Simon & Garfunkel were an American folk rock duo consisting of singer-songwriter Paul Simon and singer Art Garfunkel. They were one of the bestselling music groups of the 1960s and became counterculture icons of the decade's social revolution, alongside artists such as the Beatles, the Beach Boys, and Bob Dylan. Their biggest hits\u2014including \"The Sound of Silence\" (1964), \"Mrs. Robinson\" (1968), \"The Boxer\" (1969), and \"Bridge over Troubled Water\" (1970)\u2014reached number one on singles charts worldwide. The duo met in elementary school in Queens, New York, in 1953, where they learned to harmonize together and began writing", 111 | "score": 0.7490234375, 112 | "summary": "The document mentions Simon & Garfunkel as an American folk rock duo consisting of Paul Simon and Art Garfunkel. They were famous for their hits, including \"The Sound of Silence\" (1964), which reached number one on singles charts worldwide.", 113 | "extraction": "\"The original artist of Sound of Silence is Simon & Garfunkel, an American folk rock duo consisting of singer-songwriter Paul Simon and singer Art Garfunkel.\"", 114 | "summary_no_sub": "The document is irrelevant to the question.", 115 | "summary_use_sub": "Simon & Garfunkel were an American folk rock duo consisting of Paul Simon and Art Garfunkel. They released the song \"The Sound of Silence\" in 1964, which became one of their biggest hits. The document does not mention the album or the artist who released the song in 2016.", 116 | "answer_no_sub": "The original artist of \"Sound of Silence\" is Simon & Garfunkel.", 117 | "answer_use_sub": "Who is the original artist of sound of silence, the song, released in 1964? - Simon & Garfunkel Who is the original artist of sound of silence, the album? - irrelevant Who is the original artist of sound of silence, the song, released in 2016? - irrelevant" 118 | }, 119 | { 120 | "id": "8156383", 121 | "title": "Simon & Garfunkel discography", 122 | "text": "Simon & Garfunkel discography Simon & Garfunkel, an American singer-songwriter duo, has released five studio albums, fifteen compilation albums, four live albums, one extended play, 26 singles, one soundtrack, and four box sets since 1964. Paul Simon and Art Garfunkel first formed a duo in 1957 as Tom & Jerry, before separating and later reforming as Simon & Garfunkel. Simon & Garfunkel's debut album, \"Wednesday Morning, 3 A.M.\", was released on October 19, 1964. Initially a flop, it was re-released two years later with the new version of the single \"The Sound of Silence\", which was overdubbed with electric instruments", 123 | "score": 0.72314453125, 124 | "summary_no_sub": "Who is the original artist of sound of silence? Simon & Garfunkel, an American singer-songwriter duo, released \"The Sound of Silence\" in 1964.", 125 | "summary_use_sub": "Simon & Garfunkel, a singer-songwriter duo, released \"The Sound of Silence\" as a single in 1964. The album it was featured on was \"Wednesday Morning, 3 A.M.\" There is no mention of a release of the song in 2016.", 126 | "answer_no_sub": "The original artist of \"The Sound of Silence\" is Simon & Garfunkel.", 127 | "answer_use_sub": "Original artist of sound of silence, the song, released in 1964: Simon & Garfunkel Original artist of sound of silence, the album: Simon & Garfunkel (the album is not mentioned in the document, but the song was included in their debut album \"Wednesday Morning, 3 A.M.\") Original artist of sound of silence, the song, released in 2016: irrelevant" 128 | } 129 | ], 130 | "answer": "There are several songs with the title \"Sound of Silence\". Sounds of Silence is the second studio album by Simon & Garfunkel, released on January 17, 1966. The album's title is a slight modification of the title of the duo's first major hit, \"The Sound of Silence\", which was recorded in March 1964 and originally was released as \"The Sounds of Silence\". Another \"Sound of Silence\" is a song performed by Australian recording artist Dami Im, and is best known as Australia's entry at the Eurovision Song Contest 2016.", 131 | "judgment": "[YES]" 132 | } -------------------------------------------------------------------------------- /openai_account_manager.py: -------------------------------------------------------------------------------- 1 | import logging 2 | from typing import Union 3 | import openai 4 | import fcntl 5 | import threading 6 | import tqdm 7 | from multi_thread_openai_api_call import MyThread 8 | 9 | logger = logging.getLogger() 10 | 11 | 12 | class OpenAI_Account_Manager: 13 | _instance = None 14 | 15 | def __new__(cls, *args, **kwargs): 16 | if cls._instance is None: 17 | cls._instance = object.__new__(cls) 18 | 19 | return cls._instance 20 | 21 | def __init__(self, used_account_fp, all_account_fp): 22 | self.used_account_fp = used_account_fp 23 | self.all_account_fp = all_account_fp 24 | 25 | used_account_f = open(used_account_fp, 'r') 26 | used_account = list(map(lambda x: x.strip().split('----'), used_account_f.readlines())) 27 | used_account_f.close() 28 | 29 | all_account_f = open(all_account_fp, 'r') 30 | all_account = list(map(lambda x: x.strip().split('----'), all_account_f.readlines())) 31 | all_account_f.close() 32 | 33 | used_account_key = list(map(lambda x: x[-1], used_account)) 34 | 35 | all_account = list(filter(lambda x: x[-1] not in used_account_key, all_account)) 36 | 37 | self.used_account = used_account 38 | self.all_account = all_account 39 | 40 | openai.api_key = self.all_account[0][-1] 41 | logger.info( 42 | 'successfully build OpenAI_Account_Manager, now the number of available accounts is {} and now api_key is {}'.format( 43 | len(self.all_account), self.all_account[0][-1])) 44 | 45 | def use_next_account(self): 46 | self.used_account.append(self.all_account[0]) 47 | del self.all_account[0] 48 | with open(self.used_account_fp, 'a') as tmp_used_account_f: 49 | fcntl.fcntl(tmp_used_account_f.fileno(), fcntl.LOCK_EX) 50 | print('----'.join(self.used_account[-1]), file=tmp_used_account_f) 51 | logger.info( 52 | 'account:[{}, {}, {}] runs out. so use next.'.format(self.used_account[-1][0], self.used_account[-1][1], 53 | self.used_account[-1][2])) 54 | openai.api_key = self.all_account[0][-1] 55 | 56 | 57 | class OpenAI_Account_Manager_MultiThread: 58 | _instance = None 59 | 60 | def __new__(cls, *args, **kwargs): 61 | if cls._instance is None: 62 | cls._instance = object.__new__(cls) 63 | 64 | return cls._instance 65 | 66 | def __init__(self, used_account_fp, all_account_fp): 67 | self.now_account_idx = 0 68 | 69 | 70 | self.used_account_fp = used_account_fp 71 | self.all_account_fp = all_account_fp 72 | 73 | used_account_f = open(used_account_fp, 'r') 74 | used_account = list(map(lambda x: x.strip().split('----'), used_account_f.readlines())) 75 | used_account_f.close() 76 | 77 | all_account_f = open(all_account_fp, 'r') 78 | all_account = list(map(lambda x: x.strip().split('----'), all_account_f.readlines())) 79 | all_account_f.close() 80 | 81 | used_account_key = list(map(lambda x: x[-1], used_account)) 82 | 83 | all_account = list(filter(lambda x: x[-1] not in used_account_key, all_account)) 84 | 85 | self.used_account = used_account 86 | self.all_account = all_account 87 | self.using_account = [] 88 | 89 | # openai.api_key = self.all_account[0][-1] 90 | logger.info( 91 | 'successfully build OpenAI_Account_Manager, now the number of available accounts is {} and now api_key is {}'.format( 92 | len(self.all_account), self.all_account[0][-1])) 93 | self.next_account_lock = threading.Lock() 94 | self.empty_account_lock = threading.Lock() 95 | 96 | def get_next_account(self, thread_id, last_empty_account=None): 97 | with self.next_account_lock: 98 | result = self.all_account[0] 99 | self.using_account.append(self.all_account[0]) 100 | del self.all_account[0] 101 | if last_empty_account != None: 102 | self.record_empty_account(last_empty_account) 103 | logger.info('Thread {} account: [{}, {}, {}] ' 104 | 'runs out'.format(thread_id, 105 | self.used_account[-1][0], 106 | self.used_account[-1][1], 107 | self.used_account[-1][2])) 108 | logger.info('Thread {} use next account: [{}, {}, {}] ' 109 | .format(thread_id, result[0], 110 | result[1], 111 | result[2])) 112 | else: 113 | logger.info('Thread {} first account: [{}, {}, {}] ' 114 | .format(thread_id, result[0], 115 | result[1], 116 | result[2])) 117 | # openai.api_key = self.all_account[0][-1] 118 | return result 119 | 120 | def record_empty_account(self, empty_account): 121 | with self.empty_account_lock: 122 | self.used_account.append(empty_account) 123 | with open(self.used_account_fp, 'a') as tmp_used_account_f: 124 | fcntl.fcntl(tmp_used_account_f.fileno(), fcntl.LOCK_EX) 125 | print('----'.join(self.used_account[-1]), file=tmp_used_account_f) 126 | 127 | 128 | class OpenAI_Account_Manager_MultiThread_One_Acount_Many_Used: 129 | ''' 130 | OpenAI_Account_Manager_MultiThread_One_Acount_Many_Used: when OpenAI_Account_Manager_MultiThread uses one account for one thread, 131 | so the number of threads is limited by the number of accounts. 132 | OpenAI_Account_Manager_MultiThread_One_Acount_Many_Used support multiple threads using one account. 133 | ''' 134 | _instance = None 135 | 136 | def __new__(cls, *args, **kwargs): 137 | if cls._instance is None: 138 | cls._instance = object.__new__(cls) 139 | 140 | return cls._instance 141 | 142 | 143 | def __init__(self, used_account_fp: str, all_account_fp: str, limit_account_num: int=-1) -> None: 144 | """Class init 145 | Args 146 | ---- 147 | used_account_fp: str 148 | Path to file containing used OpenAI accounts. 149 | all_account_fp: str 150 | Path to file containing all OpenAI accounts. 151 | limit_account_num: int=-1 152 | Number of available accounts. 153 | """ 154 | if hasattr(self, 'inited'): 155 | return 156 | self.inited = 1 157 | self.now_account_idx = 0 158 | 159 | self.used_account_fp = used_account_fp 160 | self.all_account_fp = all_account_fp 161 | 162 | used_account_f = open(used_account_fp, 'r') 163 | used_account = list(map(lambda x: x.strip().split('----'), used_account_f.readlines())) 164 | used_account_f.close() 165 | 166 | all_account_f = open(all_account_fp, 'r') 167 | all_account = list(map(lambda x: x.strip().split('----'), all_account_f.readlines())) 168 | all_account_f.close() 169 | 170 | used_account_key = [] 171 | for account in used_account: 172 | if len(account) == 4: 173 | used_account_key.append(account[-2]) 174 | else: 175 | used_account_key.append(account[-1]) 176 | 177 | # Keep only usable account. 178 | 179 | all_account = list(filter(lambda x: x[-1] not in used_account_key, all_account)) 180 | temp_all_account = [] 181 | for account in all_account: 182 | if len(account) == 4 and account[-2] not in used_account_key: 183 | temp_all_account.append(account) 184 | elif len(account) == 3 and account[-1] not in used_account_key: 185 | temp_all_account.append(account) 186 | else: 187 | raise Exception 188 | all_account = temp_all_account 189 | 190 | if limit_account_num > 0: 191 | all_account = all_account[:limit_account_num] 192 | 193 | self.used_account = used_account 194 | self.used_account_key = set(used_account_key) 195 | self.all_account = all_account 196 | 197 | self.using_account = [] 198 | self.thread_to_account = {} 199 | logger.info('successfully build OpenAI_Account_Manager, now the number of available accounts is {}'.format(len(self.all_account))) 200 | 201 | self.next_account_lock = threading.Lock() 202 | self.empty_account_lock = threading.Lock() 203 | 204 | 205 | def get_next_account(self, thread_id, last_empty_account=None): 206 | with self.next_account_lock: 207 | available_num = self.check_available_account_num() 208 | if available_num == 0: 209 | logger.info('all accounts used, so..') 210 | logger.info('all accounts used, so..') 211 | logger.info('all accounts used, so..') 212 | logger.info('all accounts used, so..') 213 | logger.info('all accounts used, so..') 214 | else: 215 | logger.info('now available accounts : {}'.format(available_num)) 216 | 217 | while True: 218 | result = self.all_account[self.now_account_idx] 219 | if result[-1] in self.used_account_key or result[-2] in self.used_account_key: 220 | self.now_account_idx += 1 221 | self.now_account_idx = self.now_account_idx % len(self.all_account) 222 | else: 223 | break 224 | 225 | result = self.all_account[self.now_account_idx] 226 | self.now_account_idx += 1 227 | self.now_account_idx = self.now_account_idx % len(self.all_account) 228 | 229 | if last_empty_account != None: 230 | self.record_empty_account(last_empty_account) 231 | logger.info('Thread {} account: [{}, {}, {}] ' 232 | 'runs out'.format(thread_id, 233 | self.used_account[-1][0], 234 | self.used_account[-1][1], 235 | self.used_account[-1][2])) 236 | logger.info('Thread {} use next account: [{}, {}, {}] ' 237 | .format(thread_id, result[0], 238 | result[1], 239 | result[2])) 240 | else: 241 | logger.info('Thread {} first account: [{}, {}, {}] ' 242 | .format(thread_id, result[0], 243 | result[1], 244 | result[2])) 245 | return result 246 | 247 | 248 | def record_empty_account(self, empty_account): 249 | with self.empty_account_lock: 250 | self.used_account.append(empty_account) 251 | if len(empty_account) == 4: 252 | self.used_account_key.add(empty_account[-2]) 253 | else: 254 | self.used_account_key.add(empty_account[-1]) 255 | with open(self.used_account_fp, 'a') as tmp_used_account_f: 256 | fcntl.fcntl(tmp_used_account_f.fileno(), fcntl.LOCK_EX) 257 | print('----'.join(self.used_account[-1]), file=tmp_used_account_f) 258 | 259 | 260 | def check_available_account_num(self): 261 | available_num = 0 262 | for account in self.all_account: 263 | if len(account) == 4 and account[-2] not in self.used_account_key: 264 | available_num += 1 265 | elif len(account) == 3 and account[-1] not in self.used_account_key: 266 | available_num += 1 267 | else: 268 | raise Exception 269 | return available_num 270 | 271 | 272 | def get_account_manager( 273 | account_file: str, 274 | used_file: str, 275 | multi_thread: bool=False, 276 | limit_account_num: int=-1 277 | ) -> Union[OpenAI_Account_Manager_MultiThread_One_Acount_Many_Used, OpenAI_Account_Manager]: 278 | """Get an instance of managing openai accounts. 279 | Args 280 | ---- 281 | account_file: str 282 | The file containing available username, password and key of OpenAI API account. 283 | used_file: str 284 | The file containing unavailable username, password and key of OpenAI API account. 285 | multi_thread: bool=False 286 | Whether to use multi-thread or not. 287 | limit_account_num: int=-1 288 | Number of available accounts. 289 | 290 | Returns 291 | ------- 292 | result: Union[OpenAI_Account_Manager_MultiThread_One_Acount_Many_Used, OpenAI_Account_Manager] 293 | An instance of class OpenAI_Account_Manager_MultiThread_One_Acount_Many_Used or OpenAI_Account_Manager 294 | """ 295 | if multi_thread: 296 | result = OpenAI_Account_Manager_MultiThread_One_Acount_Many_Used(account_file, used_file, limit_account_num=limit_account_num) 297 | else: 298 | result = OpenAI_Account_Manager(account_file, used_file) 299 | return result 300 | 301 | 302 | class OpenAI_API_inp_Manager_MultiThread: 303 | def __init__(self, idx_x_list_to_decode, inference_hyper_parameter): 304 | 305 | self.idx_x_list_to_decode = idx_x_list_to_decode 306 | 307 | self.inp_lock = threading.Lock() 308 | self.progress_index = 0 309 | 310 | assert type(inference_hyper_parameter) == type([]) 311 | assert type(inference_hyper_parameter[0]) == type({}) 312 | 313 | if len(inference_hyper_parameter) == 1: 314 | inference_hyper_parameter = inference_hyper_parameter * len(self.idx_x_list_to_decode) 315 | 316 | assert len(self.idx_x_list_to_decode) == len(inference_hyper_parameter), \ 317 | 'idx_x_list_to_decode:{}, inference_hyper_parameter:{}' \ 318 | .format(len(idx_x_list_to_decode), len(inference_hyper_parameter)) 319 | 320 | self.inference_hyper_parameter = inference_hyper_parameter 321 | 322 | for i in range(len(inference_hyper_parameter)): 323 | assert 'max_tokens' in inference_hyper_parameter[i], "{} th inference_hyper_parameter has no max_length" 324 | 325 | 326 | def get_next_gpt_idx_inp(self): 327 | with self.inp_lock: 328 | if self.progress_index < len(self.idx_x_list_to_decode): 329 | tmp_inp = self.idx_x_list_to_decode[self.progress_index] 330 | tmp_hyper_parameter = self.inference_hyper_parameter[self.progress_index] 331 | self.progress_index += 1 332 | return {'inp': tmp_inp, 'hyper_parameter': tmp_hyper_parameter} 333 | else: 334 | return None 335 | 336 | 337 | def openai_llm_generate_multi_thread(eval_data_openai_queries, llm, num_threads, use_tqdm,turbo_system_message=None): 338 | # hyper_parameter = None 339 | x_list_to_decode = list(map(lambda x:x['input'],eval_data_openai_queries)) 340 | max_tokens = list(map(lambda x:x['max_tokens'],eval_data_openai_queries)) 341 | idx_x_list_to_decode = list(enumerate(x_list_to_decode)) 342 | # eval_data_openai_queries = list(enumerate(eval_data_openai_queries)) 343 | hyper_parameter = list(map(lambda x:{'max_tokens':x},max_tokens)) 344 | 345 | inp_manager = OpenAI_API_inp_Manager_MultiThread(idx_x_list_to_decode, hyper_parameter) 346 | thread_list = [] 347 | account_manager = get_account_manager('openai_account_files/used.txt', 'openai_account_files/accounts.txt', multi_thread=True) 348 | if use_tqdm: 349 | pbar = tqdm.tqdm(total=len(idx_x_list_to_decode)) 350 | else: 351 | pbar = None 352 | for i in range(num_threads): 353 | thread_list.append(MyThread(i, llm, account_manager, inp_manager, 1, pbar, turbo_system_message)) 354 | 355 | for t in thread_list: 356 | t.start() 357 | for i, t in enumerate(thread_list): 358 | t.join() 359 | 360 | responses_with_idx = [] 361 | 362 | for t in thread_list: 363 | responses_with_idx.extend(t.responses_with_idx) 364 | 365 | responses_with_idx.sort(key=lambda x: x[0]) 366 | 367 | responses = list(map(lambda x: x[1], responses_with_idx)) 368 | return responses 369 | -------------------------------------------------------------------------------- /prompts/asqa_default.json: -------------------------------------------------------------------------------- 1 | { 2 | "instruction": "Instruction: Write an accurate, engaging, and concise answer for the given question using only the provided search results (some of which might be irrelevant) and cite them properly. Use an unbiased and journalistic tone. Always cite for any factual claim. When citing several search results, use [1][2][3]. Cite at least one document and at most three documents in each sentence. If multiple documents support the sentence, only cite a minimum sufficient subset of the documents.", 3 | "demo_sep": "\n\n\n", 4 | "demo_prompt": "{INST}\n\nQuestion: {Q}\n\n{D}\nAnswer: {A}", 5 | "doc_prompt": "Document [{ID}](Title: {T}): {P}\n", 6 | "demos": [ 7 | { 8 | "question": "Which is the most rainy place on earth?", 9 | "answer": "Several places on Earth claim to be the most rainy, such as Lloró, Colombia, which reported an average annual rainfall of 12,717 mm between 1952 and 1989, and López de Micay, Colombia, which reported an annual 12,892 mm between 1960 and 2012 [3]. However, the official record is held by Mawsynram, India with an average annual rainfall of 11,872 mm [3], although nearby town Sohra, India, also known as Cherrapunji, holds the record for most rain in a calendar month for July 1861 and most rain in a year from August 1860 to July 1861 [1].", 10 | "docs": [ 11 | { 12 | "title": "Cherrapunji", 13 | "text": "Cherrapunji Cherrapunji (; with the native name Sohra being more commonly used, and can also be spelled Cherrapunjee or Cherrapunji) is a subdivisional town in the East Khasi Hills district in the Indian state of Meghalaya. It is the traditional capital of aNongkhlaw \"hima\" (Khasi tribal chieftainship constituting a petty state), both known as Sohra or Churra. Cherrapunji has often been credited as being the wettest place on Earth, but for now nearby Mawsynram currently holds that distinction. Cherrapunji still holds the all-time record for the most rainfall in a calendar month for July 1861 and most rain in a year from August 1860 to July 1861, however: it received in" 14 | }, 15 | { 16 | "title": "Cherrapunji", 17 | "text": "Radio relay station known as Akashvani Cherrapunji. It broadcasts on FM frequencies. Cherrapunji Cherrapunji (; with the native name Sohra being more commonly used, and can also be spelled Cherrapunjee or Cherrapunji) is a subdivisional town in the East Khasi Hills district in the Indian state of Meghalaya. It is the traditional capital of aNongkhlaw \"hima\" (Khasi tribal chieftainship constituting a petty state), both known as Sohra or Churra. Cherrapunji has often been credited as being the wettest place on Earth, but for now nearby Mawsynram currently holds that distinction. Cherrapunji still holds the all-time record for the most rainfall" 18 | }, 19 | { 20 | "title": "Mawsynram", 21 | "text": "Mawsynram Mawsynram () is a village in the East Khasi Hills district of Meghalaya state in north-eastern India, 65 kilometres from Shillong. Mawsynram receives one of the highest rainfalls in India. It is reportedly the wettest place on Earth, with an average annual rainfall of 11,872 mm, but that claim is disputed by Lloró, Colombia, which reported an average yearly rainfall of 12,717 mm between 1952 and 1989 and López de Micay, also in Colombia, which reported an annual 12,892 mm per year between 1960 and 2012. According to the \"Guinness Book of World Records\", Mawsynram received of rainfall in 1985. Mawsynram is located at 25° 18′" 22 | }, 23 | { 24 | "title": "Earth rainfall climatology", 25 | "text": "Pacific Northwest, and the Sierra Nevada range are the wetter portions of the nation, with average rainfall exceeding per year. The drier areas are the Desert Southwest, Great Basin, valleys of northeast Arizona, eastern Utah, central Wyoming, eastern Oregon and Washington and the northeast of the Olympic Peninsula. The Big Bog on the island of Maui receives, on average, every year, making it the wettest location in the US, and all of Oceania. The annual average rainfall maxima across the continent lie across the northwest from northwest Brazil into northern Peru, Colombia, and Ecuador, then along the Atlantic coast of" 26 | }, 27 | { 28 | "title": "Going to Extremes", 29 | "text": "in the world. Oymyakon in Siberia, where the average winter temperature is −47 °F (− 44 °C). Arica in Chile, where there had been fourteen consecutive years without rain. Fog is the only local source of water. Mawsynram in India, where average annual rainfall is 14 meters, falling within a four-month period in the monsoon season. The rainfall is approximately equal to that of its neighbor Cherrapunji. Dallol in Ethiopia, known as the 'Hell-hole of creation' where the temperature averages 94 °F (34 °C) over the year. In his second series, Middleton visited places without permanent towns, locations where \"survival\"" 30 | } 31 | ] 32 | }, 33 | { 34 | "question": "When did the us break away from england?", 35 | "answer": "The United States took the first step towards gaining independence from Great Britain when it declared independence from Great Britain on July 2, 1776 (although the event is now commemorated on July 4, 1776, the date when the Declaration of Independence was officially adopted by Congress) [2]. The Treaty of Paris was later signed on September 3, 1783, formally separating the United States from the British Empire [3].", 36 | "docs": [ 37 | { 38 | "title": "United States withdrawal from Saudi Arabia", 39 | "text": "United States withdrawal from Saudi Arabia Beginning during Operation Desert Shield in August 1990, while preparing for the Gulf War, the United States sent a large troop contingent to Saudi Arabia. After the war, remnant troops, primarily U.S. Air Force personnel, augmented by a smaller number of coordinating and training personnel from the U.S. Navy, U.S. Army and U.S. Marine Corps remained in Saudi Arabia under the aegis of Joint Task Force Southwest Asia (JTF-SWA), as part of Operation Southern Watch (OSW). The United Kingdom and France also maintained a small contingent of Royal Air Force and French Air Force" 40 | }, 41 | { 42 | "title": "Decolonization of the Americas", 43 | "text": "and France has fully \"integrated\" most of its former colonies as fully constituent \"departments\" of France. The United States of America declared independence from Great Britain on July 2, 1776 (although the event is now commemorated on July 4, the date when the Declaration of Independence was officially adopted by Congress), in so doing becoming the first independent, foreign-recognized nation in the Americas and the first European colonial entity to break from its mother country. Britain formally acknowledged American independence in 1783 after its defeat in the American Revolutionary War. Although initially occupying only the land east of the Mississippi" 44 | }, 45 | { 46 | "title": "American Revolution", 47 | "text": "second British army at Yorktown in the fall of 1781, effectively ending the war. The Treaty of Paris was signed September 3, 1783, formally ending the conflict and confirming the new nation's complete separation from the British Empire. The United States took possession of nearly all the territory east of the Mississippi River and south of the Great Lakes, with the British retaining control of Canada and Spain taking Florida. Among the significant results of the revolution was the creation of the United States Constitution, establishing a relatively strong federal national government that included an executive, a national judiciary, and" 48 | }, 49 | { 50 | "title": "Decolonization", 51 | "text": "accelerate decolonialization and bring an end to the colonial empires of its Western allies, most importantly during the 1956 Suez Crisis, but American military bases were established around the world and direct and indirect interventions continued in Korea, Indochina, Latin America (\"inter alia\", the 1965 occupation of the Dominican Republic), Africa, and the Middle East to oppose Communist invasions and insurgencies. Since the dissolution of the Soviet Union, the United States has been far less active in the Americas, but invaded Afghanistan and Iraq following the September 11 attacks in 2001, establishing army and air bases in Central Asia. Before" 52 | }, 53 | { 54 | "title": "Decolonization", 55 | "text": "the responsibility of the United Kingdom (with a copy of the new constitution annexed), and finally, if approved, issuance of an Order of Council fixing the exact date of independence. After World War I, several former German and Ottoman territories in the Middle East, Africa, and the Pacific were governed by the UK as League of Nations mandates. Some were administered directly by the UK, and others by British dominions – Nauru and the Territory of New Guinea by Australia, South West Africa by the Union of South Africa, and Western Samoa by New Zealand. Egypt became independent in 1922," 56 | } 57 | ] 58 | }, 59 | { 60 | "question": "Who set the record for longest field goal?", 61 | "answer": "The record for the longest field goal in an NFL game was set by Matt Prater at 64 yards [1], but the record for the longest field goal at any level was 69 yards, kicked by collegiate kicker Ove Johansson in a 1976 Abilene Christian University football game against East Texas State University [2].", 62 | "docs": [ 63 | { 64 | "title": "Field goal", 65 | "text": "toward its own end. The longest field goal kick in NFL history is 64 yards, a record set by Matt Prater on December 8, 2013. The previous record was 63, originally set by Tom Dempsey (1970) and then matched by Jason Elam (1998), Sebastian Janikowski (2011), David Akers (2012), and Graham Gano (2018). High school, college and most professional football leagues offer only a three-point field goal; however, some professional leagues have encouraged more rare kicks through \"four-point field goals\". NFL Europe encouraged long field goals of 50 yards or more by making those worth four points instead of three" 66 | }, 67 | { 68 | "title": "Field goal range", 69 | "text": "35 and 40 yard lines (closer in a crosswind) often will go for the more risky fourth down conversion rather than risk either the touchback or the missed field goal. The longest field goal in recorded football history was 69 yards, set by collegiate kicker Ove Johansson, who was born in Sweden, in a 1976 Abilene Christian University football game against East Texas State University (now Texas A&M Commerce) at Shotwell Stadium in Abilene. The longest successful field goal in the NFL was 64 yards and was completed by Matt Prater in 2013. The NCAA record is 67 yards held" 70 | }, 71 | { 72 | "title": "Field goal", 73 | "text": "both end zones) is only 66 yards. Scaccia, while playing indoor football, attempted a 64-yard kick that was inches short of success, hitting the crossbar. Longer field goals have been attempted at times; the longest attempt in the NFL, which was well short and was kicked into the wind, was 76 yards, attempted by Sebastian Janikowski of the Oakland Raiders, in a September 28, 2008 game against the San Diego Chargers. NFL Europe rewarded kickers that successfully kicked a field goal of longer than 50 yards with a bonus point, making such field goals worth 4 points instead of 3;" 74 | }, 75 | { 76 | "title": "Field goal", 77 | "text": "this accomplishment is not the official record. All of the above kicks were successful with the use of a kicking tee, which was banned by the NCAA after the 1988 season. The longest known drop-kicked field goal in college football was a 62-yard kick from Pat O'Dea, an Australian kicker who played on the Wisconsin Badgers football team. O'Dea's kick took place in a blizzard against Northwestern on November 15, 1898. The longest field goal in U Sports football history is 59 yards, by Niko Difonte of Calgary Dinos, playing against the UBC Thunderbirds on November 11, 2017. The field" 78 | }, 79 | { 80 | "title": "Field goal range", 81 | "text": "NFL and have been banned from NCAA since 1989) is 68 yards held by Fabrizio Scaccia, and the high school record 68 yards held by Dirk Borgognone; high school has wider goal posts and treats a field goal attempt that lands short in the field of play the same as a punt, making longer attempts much less risky. The indoor football record, with narrower and higher goal posts, is 63 yards (set by Aaron Mills), which is practically as long of a field goal as is possible in that variant of the sport, since the field in indoor football (including" 82 | } 83 | ] 84 | }, 85 | { 86 | "question": "Who played galen in planet of the apes?", 87 | "answer": "In the 1968 film Planet of the Apes, Galen was played by Wright King [2]. And in the tv series Planet of the Apes, Galen was played by Roddy McDowall [1].", 88 | "docs": [ 89 | { 90 | "title": "Planet of the Apes", 91 | "text": "installment. Jacobs died on June 27, 1973, bringing an end to the APJAC Productions era of the \"Planet of the Apes\" franchise. Former Fox executive Stan Hough took over as producer for the television project, titled \"Planet of the Apes\". CBS picked up the series for its 1974 autumn lineup. Ron Harper and James Naughton played Alan Virdon and Peter Burke, two 20th-century American astronauts who pass through a time warp to a future where apes subjugate humans (unlike the original film, the humans can speak). Roddy McDowall returned to the franchise as Galen, a chimpanzee who joins the astronauts." 92 | }, 93 | { 94 | "title": "Planet of the Apes (1968 film)", 95 | "text": "chimpanzees: animal psychologist Zira (Kim Hunter) and surgeon Galen (Wright King). While unable to speak as his throat wound is healing, called \"Bright Eyes\" by Zira and placed with one of the captive primitive humans he later names \"Nova\", Taylor observes the enhanced society of talking apes and in a strict caste system: the gorillas being the military police, hunters and workers; the orangutans overseeing the affairs of government, science, and religion; and intellectual chimpanzees being mostly scientists. While their society is a theocracy similar to the beginnings of the human Industrial Era, the apes consider the primitive humans as" 96 | }, 97 | { 98 | "title": "Planet of the Apes (1968 film)", 99 | "text": "Planet of the Apes (1968 film) Planet of the Apes is a 1968 American science fiction film directed by Franklin J. Schaffner. It stars Charlton Heston, Roddy McDowall, Kim Hunter, Maurice Evans, James Whitmore, James Daly and Linda Harrison. The screenplay by Michael Wilson and Rod Serling was loosely based on the 1963 French novel \"La Plan\u00e8te des Singes\" by Pierre Boulle. Jerry Goldsmith composed the groundbreaking avant-garde score. It was the first in a series of five films made between 1968 and 1973, all produced by Arthur P. Jacobs and released by 20th Century Fox. The film tells the" 100 | }, 101 | { 102 | "title": "Planet of the Apes", 103 | "text": "Rupert Wyatt. To portray ape characters realistically, the production avoided practical effects in favor of performance capture acting, partnering with New Zealand visual effects company Weta Digital. Wyatt cast James Franco as Will Rodman, while veteran performance capture actor Andy Serkis signed on to star as Caesar. \"Rise\" debuted on August 5, 2011. Critics reviewed it positively, especially praising the visual effects and Serkis's performance. It was a major box office hit, taking in $482 million globally, more than five times its $93 million budget. Weta's special effects earned the film two Visual Effects Society Awards and an Oscar nomination" 104 | }, 105 | { 106 | "title": "Planet of the Apes", 107 | "text": "film stars Mark Wahlberg as astronaut Leo Davidson, who accidentally travels through a wormhole to a distant planet where talking apes enslave humans. He leads a human revolt and upends ape civilization by discovering that the apes evolved from the normal earth primates who had accompanied his mission, and arrived years before. Helena Bonham Carter played chimpanzee Ari, while Tim Roth played the human-hating chimpanzee General Thade. The film received mixed reviews; most critics believed it failed to compare to the original. Much of the negative commentary focused on the confusing plot and twist ending, though many reviewers praised the" 108 | } 109 | ] 110 | } 111 | ] 112 | } -------------------------------------------------------------------------------- /prompts/eli5_default.json: -------------------------------------------------------------------------------- 1 | { 2 | "instruction": "Instruction: Write an accurate, engaging, and concise answer for the given question using only the provided search results (some of which might be irrelevant) and cite them properly. Use an unbiased and journalistic tone. Always cite for any factual claim. When citing several search results, use [1][2][3]. Cite at least one document and at most three documents in each sentence. If multiple documents support the sentence, only cite a minimum sufficient subset of the documents.", 3 | "demo_sep": "\n\n\n", 4 | "demo_prompt": "{INST}\n\nQuestion: {Q}\n\n{D}\nAnswer: {A}", 5 | "doc_prompt": "Document [{ID}](Title: {T}): {P}\n", 6 | "demos": [ 7 | { 8 | "question": "Why did New York City try to ban food donations to the poor?", 9 | "answer": "New York City, under Mayor Michael Bloomberg's administration, banned citizens from donating food directly to homeless shelters because the city could not assess the salt, fat, and fiber content [1][2][3]. Bloomberg's administration was heavily criticized for losing their common sense by becoming too focused on what people eat [2].", 10 | "docs": [ 11 | { 12 | "title": "The Future Of America", 13 | "text": "believe that they are \u201chelping\u201d the homeless by passing such laws. In New York City, Mayor Bloomberg has banned citizens from donating food directly to homeless shelters and he is actually convinced that it was the right thing to do for the homeless\u2026 Mayor Michael Bloomberg\u2019s food police have struck again! Outlawed are food donations to homeless shelters because the city can\u2019t assess their salt, fat and fiber content, reports CBS 2\u2019s Marcia Kramer. Glenn Richter arrived at a West Side synagogue on Monday to collect surplus bagels \u2014 fresh nutritious bagels \u2014 to donate to the poor." 14 | }, 15 | { 16 | "title": "mayor bloomberg", 17 | "text": "Amuck: Bloomberg Bans Food Donations in New York City Food Might Be Salty or Too High in Calories, City Explains Washington, D.C. \u2013 New York Mayor Michael Bloomberg\u2019s administration is now banning all food being offered to the city\u2019s homeless shelters. New York City\u2019s bureaucrats have become so singularly focused on what people eat, says the National Center for Public Policy Research, that they\u2019ve lost their common sense. \u201cSo much for serving the homeless: The Bloomberg administration is now taking the term \u2018food police\u2019 to new depths, blocking food donations to all government-run facilities that serve the" 18 | }, 19 | { 20 | "title": "New York City bans food donations - WND", 21 | "text": "New York City bans food donations - WND Front Page Health U.S. New York City bans food donations Inability to control 'nutritional content' cited as reason New York City homeless shelters have Mayor Michael Bloomberg to thank for a halt in food donations, for which hungry families are waiting, according to one public policy advocate. \"The Bloomberg administration is now taking the term 'food police' to new depths, blocking food donations to all government-run facilities that serve the city's homeless,\" says Jeff Stier, a National Center for Public Policy Research senior fellow. Currently, no food can be given to government-run, New York City facilities, despite hungry crowds perfectly" 22 | }, 23 | { 24 | "title": "New York City bans food donations - WND", 25 | "text": "New York City bans food donations - WND Services didn't return WND calls. Stier told WND that he specifically was told by Diamond that the policy was tied to the nutritional guidelines set by the mayor. \"They can say that this ban on donations is a long-standing policy, but they can\u2019t document it,\" Stier told WND. \"I've also been told that there are numerous food shelves that have been accepting food donations, not just one.\" Stier is a member of a New York Synagogue that has donated food for over a decade. He is outraged that the DHS' response to his demand to know why the practice can" 26 | }, 27 | { 28 | "title": "New York City bans food donations - WND", 29 | "text": "New York City bans food donations - WND ban on donated food. In fact, it thrives because of food donations. New York City Rescue Mission has been providing food, clothing, shelter and spiritual hope for needy New Yorkers since 1872. \"We feed over 500 people a day, all through donations,\" said James Varnhagen, NYCRM director. \"Boxed food, canned food, prepared food, we take any food,\" he told WND. \"We couldn't survive without donations,\" he said." 30 | } 31 | ] 32 | }, 33 | { 34 | "question": "What's the difference between Shia vs. Sunni Islam?", 35 | "answer": "The main difference between Shia and Sunni Muslim is related to ideological heritage and issues of leadership [1]. This difference is first formed after the death of the Prophet Muhammad in 632 A.D. [1][2]. The ideological practice of the Sunni branch strictly follows Prophet Muhammad and his teachings, while the Shia branch follows Prophet Muhammad's son-in-law Ali [2]. Nowadays, Sunni and Shia are the major branches of Islam [3].", 36 | "docs": [ 37 | { 38 | "title": "The Sunni vs Shia Divide - Explained - Globaloi", 39 | "text": "centuries-long strained relationship between Sunnis and Shias. As a scholar of Islam and a public educator, I often field questions about Sunnis, Shias and the sects of Islam. What exactly is the Shia-Sunni divide? And what is its history? History of divide Both Sunnis and Shias \u2013 drawing their faith and practice from the Qur\u2019an and the life of the Prophet Muhammad \u2013 agree on most of the fundamentals of Islam. The differences are related more to historical events, ideological heritage and issues of leadership. The first and central difference emerged after the death of Prophet Muhammad in A.D. 632." 40 | }, 41 | { 42 | "title": "What\u2019s the difference between Sunni and Shia Islam? \u2013 Macrosnaps", 43 | "text": "What\u2019s the difference between Sunni and Shia Islam? Sunni and Shia identities (the 2 main branches of Islam) first formed around a dispute over leadership succession after the death of the Prophet Muhammad in 632 A.D. Sunni is the larger branch (estimated 85-90% of total world Muslim population) and it's adherents are referred to as \"people of the tradition of Muhammad\", while Shia are \"followers\" of Muhammad's son-in-law and cousin Ali. Sunnis rely heavily on the practice of the Prophet Muhammad and his teachings, the Shia view their ayatollahs as reflections of God on earth. What challenges does the anti-IS" 44 | }, 45 | { 46 | "title": "Difference between Sunni and Shia Muslims | Sunni vs Shia Muslims", 47 | "text": "of Muhammad, the last prophet of God. A follower of Islam is known as a Muslim. Many Muslims believe that their sole purpose is to worship and serve God, for which they have established five pillars of Islam that guides a Muslim on almost every aspect of life and society. Due to differences, Muslims have been divided into two primary sects: The Sunnis and the Shias. These two sects have many similarities and both consider themselves are Muslims, following the will of God. However, they are also different from each other in certain aspects. Both the Sunnis and the Shias," 48 | }, 49 | { 50 | "title": "What is the difference between Shia and Sunni Islam? - Islam Stack Exchange", 51 | "text": "What is the difference between Shia and Sunni Islam? - Islam Stack Exchange between Mutah marriage and Misyar marriage? What theological and historical factors distinguish Ibadi Islam from either Shia or Sunni schools? What are the principle/fundamental differences between Sunni and Shia? Nikah between a Sunni girl and Shia boy What is the difference between \u201cMubtalat-of-Wudu\u201d of Shia and Sunni? How can the Hadith be reliable when Sunnis and Shia follow different points of reference? Rejection of Mutawatir Hadith in Sunni Islam and Shia Islam" 52 | }, 53 | { 54 | "title": "What is the difference between Sunni and Shia Islam? | Patrick Syder Travel", 55 | "text": "What is the difference between Sunni and Shia Islam? | Patrick Syder Travel What is the difference between Sunni and Shia Islam? This Channel 4 link answers some of the key questions about the difference between Sunni and Shia Islam and alarmingly, the politics on what is happening and why, in Syria\u2026\u2026. http://www.channel4.com/news/sunni-shia-islam-muslim-syria-middle-east-key-questions \u2190 Ethiopia Appeal \u2013 Help sponsor a nurse to train and to help others G\u00f6bekli Tepe, Turkey: a new wonder of the ancient world by Jeremy Seal (Telegraph Travel Section 23/04/2013) \u2192" 56 | } 57 | ] 58 | }, 59 | { 60 | "question": "What causes Bi-polar disorder?", 61 | "answer": "Bipolar disorder is an emotional disorder that causes extreme mood swings between excitement and depression [1][3]. The spectrum of mood swing may span from days to months [1][2]. We are still not certain of the exact factors that cause such disorder, but genetics is considered a major factor [2][3].", 62 | "docs": [ 63 | { 64 | "title": "Bi-polar disorder | definition of Bi-polar disorder by Medical dictionary", 65 | "text": "bi-polar disorder | definition of bi-polar disorder by medical dictionary https://medical-dictionary.thefreedictionary.com/bi-polar+disorder (redirected from bi-polar disorder) related to bi-polar disorder: depression bipolar disorder, formerly known as manic depression, is a mood disorder that causes radical emotional changes and mood swings, from manic, restless highs to depressive, listless lows. most bipolar individuals experience alternating episodes of mania and depression. bipolar disorder is characterized by alternating manic episodes in which the individual feels abnormally euphoric, optimistic, and energetic and depressive periods in which the individual feels sad, hopeless, guilty, and sometimes suicidal. manic or depressive periods may last for days, weeks, or months" 66 | }, 67 | { 68 | "title": "Mania and Bi-Polar", 69 | "text": "can go from depressed to \u201csuper happy\u201d all in one day, or even in a few days, does not have a bi-polar disorder Bi-polar looks different depending on the severity of the symptoms. Most bi-polar diagnoses that are made are for bi-polar 2, with bi-polar 1 being much more rare. Bi-polar 1 is so severe that the individual will have periods of such agitation, or such reckless and seemingly foolish behavior that they put themselves or those around them in danger. It is not completely clear what causes bi-polar, but genetics seem to have a large role. The biggest factor" 70 | }, 71 | { 72 | "title": "Bi-Polar disorder", 73 | "text": "Bi-Polar disorder Bi-polar is generally a cyclic disease where individuals display depressive and elevated episodes at regular intervals. It is a disorder resulting from the imbalance of the chemicals in the brain that causes a lot of fluctuations of mood. It is a fact that we all experience happy and sad moods, but people with bi-polar disorder experience the changes in mood at an increased level. The cause of this disorder is not known completely. However, it is estimated that there are different factors responsible for it. It is often connected to a genetic component. People suffering from the Bi-polar disorder are" 74 | }, 75 | { 76 | "title": "For Individuals \u2014 Adam Schwartz", 77 | "text": "For Individuals \u2014 Adam Schwartz The information is extensive and covers a huge range of topics. Some of the topics include the different types of bi-polar, what it feels like, signs and symptoms, treatments and more. Black Dog Institute bi-polar causes resource specifically covers the variety of areas that could potentially be a cause of bi-polar disorder. Including genetics, environmental factors, pregnancy, and more. Black Dog Institute bi-polar treatments resource specifically covers multiple potential treatments options for bi-polar. Including management, types of psychological treatment, lifestyle changes, and more. Black Dog Institute bi-polar self-test resource is a short self-test for people who may be concerned if" 78 | }, 79 | { 80 | "title": "Depression Bi-polar Disorder Symptoms 2019 | Win Over Depression", 81 | "text": "Depression Bi-polar Disorder Symptoms 2019 | Win Over Depression signs and symptoms of bipolar disorder. Learn more about the common symptoms of bipolar depression that some patients may experience. Home \u00bb Trending Health News \u00bb 10 Warning Signs of Bipolar Disorder: Depression. One of the most serious symptoms of bipolar disorder is. Bi Polar Depression. SEVERE SWINGS What is bipolar disorder, is it the same as manic depression, what are the symptoms and is there a cure? Bipolar disorder, or manic depression, causes symptoms of mania and depression. Read about bipolar disorder treatment, medications, and causes of this. Learn more about the different types of bipolar disorder. Find out" 82 | } 83 | ] 84 | }, 85 | { 86 | "question": "How do student loans affect getting a mortgage?", 87 | "answer": "When applying for a mortgage, student loans can affect the debt to income ratio, which is a key factor in determining the amount that an individual can afford to pay for the mortgage [1]. While student loan repayments do not appear in an individual's credit history and do not affect credit scores, lenders do consider the amount of an individual's student loan repayments when assessing their mortgage application [1][2][3]. Some 83% of non-homeowners say student loan debt is preventing them from buying a home, according to the National Association of Realtors [2]. It is important to note that student loans do not prevent an individual from getting a mortgage [1].", 88 | "docs": [ 89 | { 90 | "title": "Student Loans \u2013 How do they work? | The Financial Review", 91 | "text": "typical debt. Student loan repayments do not appear in an individual\u2019s credit history, therefore there are no implications whatsoever. This also extends to applications for credit cards \u2013 student \u2018loans\u2019 are not acknowledged. One noteworthy aspect that is affected by student loans however, is mortgage applications. Nevertheless, it does not prevent an individual from getting a mortgage. For example, lenders will consider the amount of an individual\u2019s student loan repayments in order to assess the debt to income ratio and therefore establish the amount that the individual can afford to pay for the mortgage. Just as they do with other" 92 | }, 93 | { 94 | "title": "How Does Student Loan Debt Affect Buying a Home? | Experian", 95 | "text": "Rates & Affordability How Student Loans Affect Getting a Mortgage Student Loan Impact on Credit Scores Other Factors for Getting Approved for a Mortgage If you're a recent college grad and hope to become a homeowner in the near future, you should know that student loan debt could affect buying a home by making it more difficult to get a mortgage. Some 83% of non-homeowners say student loan debt is preventing them from buying a home, according to the National Association of Realtors (NAR). But while student loan payments can make it harder to save for a down payment on" 96 | }, 97 | { 98 | "title": "Studentloanify - How your student loans affect your home mortgage prospects", 99 | "text": "Though it may not seem fair, your student loan situation impacts your home mortgage outlook. Many people carry student loan debt, but it\u2019s the amount of the loan and how you handle your student loan repayment plan that will influence your ability to get a home mortgage as well as what your interest rate will be. Here are some specific factors about your student loan that will affect your home mortgage prospects. On your mortgage loan application, you will have to report how much your monthly student loan payment is. This amount will be deducted from your monthly gross income" 100 | }, 101 | { 102 | "title": "How do student loans affect your credit score? | Student Loan Planner", 103 | "text": "How do student loans affect your credit score? | Student Loan Planner Your credit score is the three-digit number that dictates a lot in your adult life. Whether you\u2019re applying for a mortgage or looking to get an auto loan, this seemingly arbitrary number determines whether you get approved for a loan and also affects your interest rate. If you\u2019re a student loan borrower you may wonder, \u201cDo student loans affect credit score?\u201d You might be especially curious if you\u2019re in the process of applying for a mortgage. Here\u2019s how student loans affect your credit score and what to know for big life events, like getting a mortgage. Do student loans affect" 104 | }, 105 | { 106 | "title": "Does Student Loan Debt Affect Getting A Mortgage?", 107 | "text": "Does Student Loan Debt Affect Getting A Mortgage? Home \u00bb Does Student Loan Debt Affect Getting A Mortgage? Last year, I helped answer a reader\u2019s question about applying for a mortgage while on Income Based Repayment. However, over the last several months, I\u2019ve been getting bombarded with questions about how student loan debt impacts your ability to get a mortgage. Maybe it\u2019s because the housing market is improving, or maybe it\u2019s because people are finally taking their student loan debt seriously. Anyway, I wanted to share a few reader questions and then look at whether student loan debt affects getting a mortgage. Here are the reader questions I\u2019ve" 108 | } 109 | ] 110 | } 111 | ] 112 | } -------------------------------------------------------------------------------- /run.py: -------------------------------------------------------------------------------- 1 | import logging 2 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') 3 | logger = logging.getLogger(__name__) 4 | logger.setLevel(logging.INFO) 5 | 6 | import argparse 7 | import os 8 | import json 9 | from tqdm import tqdm 10 | import numpy as np 11 | import re 12 | import yaml 13 | from utils import * 14 | from nltk import sent_tokenize 15 | from openai_account_manager import openai_llm_generate_multi_thread 16 | from llm import LLM 17 | 18 | def remove_citations(sent): 19 | return re.sub(r"\[\d+", "", re.sub(r" \[\d+", "", sent)).replace(" |", "").replace("]", "") 20 | 21 | 22 | def main(): 23 | parser = argparse.ArgumentParser() 24 | parser.add_argument('--ndoc_top_bottom', type=int, default=0) 25 | parser.add_argument('--ndoc_top_neighbor', type=int, default=0) 26 | parser.add_argument('--output_fp', default=None, type=str, required=True) 27 | parser.add_argument("--config", type=str, default=None, help="Path to the config file") 28 | parser.add_argument('--openai_multi_thread',type=int,required=True) 29 | parser.add_argument('--turbo_system_message', required=True) 30 | parser.add_argument('--use_sub_questions', type=int, default=0) 31 | # Prompt file is a json file that contains the following fields: 32 | # - instruction: the instruction, which will appear at the beginning of each demo and the test example 33 | # - demo_sep: the separator between each demo, for example, "\n\n\n" 34 | # - demo_prompt: the prompt for the demo, for example, "Instruction: {INST}\n\nQuestion: {Q}\n\n{D}\nAnswer: {A}" 35 | # - {INST}: the instruction 36 | # - {D}: the documents 37 | # - {Q}: the question 38 | # - {A}: the answers 39 | # - doc_prompt, the prompt for each document, for example, "Document [{ID}](Title: {T}): {P}", where 40 | # - {ID}: the document id, staring from 1 41 | # - {T}: the document title 42 | # - {P}: the document text 43 | # - demos: a list of demo examples, each of which should have 44 | # - question: the question 45 | # - docs: the documents ("title" and "text") 46 | # - answer: the answer to show in the demo. If it is a list, they will be concatenated by "\n". This is useful when the answer includes interactive components. 47 | # Note that this python file will sample `--shot` demos from the prompt file given the random seed `--seed` 48 | parser.add_argument("--prompt_file", type=str, help="Path to the prompt file") 49 | 50 | # Evaluation file is a json file that contains a list of item, each of which contains 51 | # - question: the question 52 | # - answer: the answer 53 | # - docs: the documents, each of which contains "title", "text" 54 | parser.add_argument("--data_file", type=str, help="Path to the eval file") 55 | parser.add_argument("--quick_test", type=int, default=0, help="Quickly test a few examples") 56 | 57 | # ICL setting 58 | parser.add_argument("--ndoc", type=int, help="Number of documents") 59 | parser.add_argument("--shot", type=int, help="Number of ICL demonstrations") 60 | parser.add_argument("--seed", type=int, default=42, help="Seed for the random number generator") 61 | parser.add_argument("--no_doc_in_demo", type=bool, default=False, help="Whether to remove the documents in the demos") 62 | parser.add_argument("--fewer_doc_in_demo", type=bool, default=False, help="Whether to use fewer documents in the demos") 63 | parser.add_argument("--ndoc_in_demo", type=int, default=None, help="When using --fewer_doc_in_demo, use this to designate how many docs in demo") 64 | 65 | # Model and name 66 | parser.add_argument("--dataset_name", type=str, help="Name of the dataset (for saving)") 67 | parser.add_argument("--tag", type=str, help="Tag of run (for saving)") 68 | parser.add_argument("--model", type=str, help="Model to use") 69 | parser.add_argument("--openai_api", type=bool, default=False, help="Whether to use OpenAI API") 70 | parser.add_argument("--azure", action="store_true", default=False, help="Azure openai API") 71 | 72 | # Decoding 73 | parser.add_argument("--temperature", type=float, default=0.5, help="Temperature for decoding") 74 | parser.add_argument("--top_p", type=float, default=1.0, help="Nucleus sampling top-p") 75 | parser.add_argument("--max_new_tokens", type=int, default=300, help="Max number of new tokens to generate in one step") 76 | parser.add_argument("--max_length", type=int, default=2048, help="Max length the model can take. Should set properly wrt the model to avoid position overflow.") 77 | parser.add_argument("--num_samples", type=int, required=True, help="Sample multiple answers.") 78 | 79 | # Use summarization/extraction of the documents 80 | parser.add_argument("--use_shorter", type=str, default=None, help="Whether to use summary data or extraction data for documents. Option: None, `summary`, `extraction`") 81 | 82 | # Interactive 83 | parser.add_argument("--interactive", type=bool, default=False, help="Whether to run in interactive mode") 84 | parser.add_argument("--interactive_query", type=str, default=None, help="The query to use in interactive mode, either `doc_id` (corresponding to interact in paper) or `search` (corresponding to inlinesearch in paper).") 85 | parser.add_argument("--retriever", type=str, default=None, help="When using interactive search mode, which retriever to use. Options: `tfidf`, `gtr-t5-large`") 86 | parser.add_argument("--retriever_device", type=str, default="cuda", help="Where to put the dense retriever if using. Options: `cuda`, `cpu`") 87 | parser.add_argument("--retrieve_in_all_docs", type=bool, default=False, help="Retrieve in all documents instead of just top ndoc") 88 | parser.add_argument("--max_turn", type=int, default=10, help="Max number of all actions") 89 | parser.add_argument("--max_doc_show", type=int, default=3, help="Max number of documents to show at one time.") 90 | parser.add_argument("--force_cite_show", type=bool, default=False, help="Force citing the documents that are shown to the model") 91 | 92 | # Load config 93 | args = parser.parse_args() 94 | 95 | config = yaml.safe_load(open(args.config)) if args.config is not None else {} 96 | parser.set_defaults(**config) 97 | args = parser.parse_args() 98 | 99 | assert args.openai_api, "only support openai_api now" 100 | assert not args.azure, "not support azure" 101 | assert not args.interactive, "not support interactive" 102 | assert not args.no_doc_in_demo 103 | assert not args.fewer_doc_in_demo 104 | 105 | # Save args 106 | args_dict = vars(args) 107 | directory = os.path.dirname(args.output_fp) 108 | with open(f'{directory}/args.json', 'a') as f: 109 | json.dump(args_dict, f, indent=4) 110 | 111 | if args.num_samples > 1: 112 | assert args.temperature > 0, "when multiple sampling, do not use temperature=0, i.e., greedy decoding" 113 | # assert args.num_samples == 1, "not support num_samples>1" 114 | 115 | for k in args.__dict__: 116 | print(f"{k}: {args.__dict__[k]}") 117 | 118 | if "turbo" in args.model: 119 | # ChatGPT has a longer max length 120 | logger.info("Change the max length to 4096 for ChatGPT.") 121 | args.max_length = 4096 122 | 123 | # Load the model or setup the API 124 | llm = LLM(args) 125 | 126 | # Generate prompts 127 | np.random.seed(args.seed) 128 | 129 | # Load data 130 | prompt_data = json.load(open(args.prompt_file)) 131 | eval_data = json.load(open(args.data_file)) 132 | 133 | logger.info("Generate the demonstration part") 134 | head_prompt = "" 135 | train_ids = np.random.choice(len(prompt_data["demos"]), args.shot, replace=False) 136 | for train_id in train_ids: 137 | train_item = prompt_data["demos"][train_id] 138 | ndoc = args.ndoc 139 | if args.no_doc_in_demo: 140 | ndoc = 0 141 | elif args.fewer_doc_in_demo: 142 | assert args.ndoc_in_demo is not None 143 | ndoc = args.ndoc_in_demo 144 | # Run here 145 | head_prompt += make_demo( 146 | train_item, prompt=prompt_data["demo_prompt"], ndoc=ndoc, doc_prompt=prompt_data["doc_prompt"], 147 | instruction=prompt_data["instruction"], use_shorter=args.use_shorter, test=False, use_sub_questions=args.use_sub_questions 148 | ) 149 | head_prompt += prompt_data["demo_sep"] 150 | 151 | # Sample quick test 152 | if args.quick_test > 0: # Don't run 153 | eval_ids = np.random.choice(len(eval_data), args.quick_test, replace=False) 154 | eval_data = [eval_data[int(idx)] for idx in eval_ids] 155 | 156 | logger.info("Generating prompts...") 157 | incomplete_doc_list = 0 # For some questions there might be less than ndoc documents 158 | for idx, eval_item in enumerate(tqdm(eval_data)): 159 | eval_data[idx]['prompt'] = head_prompt + make_demo( 160 | eval_item, prompt=prompt_data["demo_prompt"], ndoc=args.ndoc, doc_prompt=prompt_data["doc_prompt"], 161 | instruction=prompt_data["instruction"], use_shorter=args.use_shorter, test=True, use_sub_questions=args.use_sub_questions 162 | ) 163 | if args.use_shorter is not None: 164 | doc_list = get_shorter_text(eval_item, eval_item["docs"], args.ndoc, args.use_shorter) 165 | else: 166 | doc_list = eval_item["docs"][:args.ndoc] 167 | 168 | if args.ndoc_top_bottom > 0: 169 | doc_list += eval_item["docs"][-args.ndoc_top_bottom:] 170 | if args.ndoc_top_neighbor > 0: 171 | doc_list += eval_item['docs'][30:30+args.ndoc_top_neighbor] 172 | 173 | assert not (args.ndoc_top_bottom > 0 and args.ndoc_top_neighbor > 0), 'not support args.ndoc_top_neighbor and args.ndoc_top_bottom both > 0' 174 | 175 | if not args.retrieve_in_all_docs: 176 | # If --retrieve_in_all_docs, we keep the original docs and do not trim them by ndoc 177 | # Otherwise, take the new docs (truncated by ndoc and filtered if using summary/extraction) 178 | eval_data[idx]['docs'] = doc_list 179 | if len(doc_list) < args.ndoc: 180 | incomplete_doc_list += 1 181 | logger.info("Done.") 182 | if incomplete_doc_list > 0: 183 | logger.warning(f"There are {incomplete_doc_list} questions that have incomplete document list (may due to a lot of them are filtered out by summary/extraction).") 184 | 185 | # Load retriever for interactive search 186 | if args.interactive and args.interactive_query == "search" and "gtr" in args.retriever: # Don't run 187 | from sentence_transformers import SentenceTransformer 188 | gtr_model = SentenceTransformer(f'sentence-transformers/{args.retriever}', device=args.retriever_device) 189 | from searcher import SearcherWithinDocs 190 | 191 | eval_data_openai_queries = [] 192 | 193 | for idx, item in enumerate(tqdm(eval_data)): 194 | prompt = item['prompt'] 195 | prompt_len = len(llm.tokenizer.tokenize(prompt)) 196 | if idx == 0: 197 | print(prompt) 198 | eval_data_openai_queries.append({'input': prompt, 'max_tokens': min(args.max_new_tokens, args.max_length - prompt_len)}) 199 | 200 | if "turbo" in args.model and not args.azure: # Run 201 | assert args.turbo_system_message != None 202 | # For OpenAI's ChatGPT API, we need to convert text prompt to chat prompt 203 | item['prompt'] = [ 204 | {'role': 'system', 'content': args.turbo_system_message}, 205 | {'role': 'user', 'content': prompt} 206 | ] 207 | 208 | if args.openai_multi_thread > 1: 209 | eval_data_openai_responses = openai_llm_generate_multi_thread(eval_data_openai_queries, 210 | llm, 211 | args.openai_multi_thread, 212 | 1, 213 | args.turbo_system_message) 214 | else: 215 | raise NotImplementedError 216 | 217 | for idx, item in enumerate(tqdm(eval_data)): 218 | eval_data_openai_response = eval_data_openai_responses[idx] 219 | for j, decoded_output in enumerate(eval_data_openai_response): 220 | decoded_output = decoded_output.replace("<|im_end|>", "").rstrip() 221 | if decoded_output.endswith("End."): 222 | decoded_output = decoded_output[:-len("End.")] 223 | eval_data_openai_response[j] = decoded_output 224 | 225 | logger.info(f"Question: {item['question']}") 226 | logger.info(f"Gold answer: {item['answer']}") 227 | logger.info(f"Final model output:") 228 | for j, decoded_output in enumerate(eval_data_openai_response): 229 | print('{}: {}'.format(j,decoded_output)) 230 | item['output'] = eval_data_openai_response if len(eval_data_openai_response) > 1 else eval_data_openai_response[0] 231 | 232 | 233 | # for idx, item in enumerate(tqdm(eval_data)): 234 | for idx, item in enumerate([]): 235 | 236 | prompt = item['prompt'] 237 | prompt_len = len(llm.tokenizer.tokenize(prompt)) 238 | 239 | if idx == 0: 240 | print(prompt) 241 | 242 | output_array = [] 243 | for _ in range(args.num_samples): 244 | if args.interactive: 245 | print("============ Interactive =============") 246 | output_answer = "" 247 | doc_list = item['docs'] 248 | 249 | interactive_prompt = prompt.rstrip() + "\n" # Start a new line 250 | inline_doc = "" 251 | num_turn = 0 252 | 253 | doc_history = [] 254 | while True: 255 | # For each action, it should end at the new line 256 | # Three possible actions 257 | # - Check: Document [1][2][3] / search query 258 | # - Output: output 259 | # - End 260 | num_turn += 1 261 | new_prompt = interactive_prompt + inline_doc 262 | new_prompt_len = len(llm.tokenizer.tokenize(new_prompt)) 263 | 264 | if idx == 0: 265 | print(f"-------------- Step {num_turn} prompt --------------") 266 | print(new_prompt) 267 | print("-----------------------------") 268 | 269 | output = llm.generate(new_prompt, min(args.max_new_tokens, args.max_length-new_prompt_len), stop=["\n", "\n\n"]) 270 | 271 | if len(inline_doc) > 0: 272 | output = "Output: " + output # "Output: " was included in inline_doc 273 | inline_doc = "" # Delete inline_doc after use 274 | interactive_prompt += output + "\n" 275 | logger.info(f"Model output: \"{output}\"") 276 | 277 | if output.strip().lower()[:3] == "end": 278 | # Model decides to end the generation 279 | break 280 | elif "sorry" in output.lower() and ("relevant document" in output.lower() or "relevant information" in output.lower()) or "none of the documents" in output.lower(): 281 | # Instruction-tuned model may abstain from answer the question 282 | break 283 | elif output.strip().lower()[:5] == "check" or output.strip().lower()[:6] == "search": 284 | # Checkout or search documents 285 | if args.interactive_query == "search": 286 | query = output.replace("Search:", "").replace("search:", "").strip() 287 | if len(doc_list) == 0: 288 | show_doc_ids = [] 289 | else: 290 | searcher = SearcherWithinDocs(doc_list, args.retriever, model=gtr_model, device=args.retriever_device) 291 | show_doc_ids = [int(searcher.search(query))] 292 | elif args.interactive_query == "doc_id": 293 | show_doc_ids = [int(r[1:])-1 for r in re.findall(r"\[\d+", output)] # In text citation id starts from 1 294 | show_doc_ids = [doc_id for doc_id in show_doc_ids if doc_id < len(doc_list) and doc_id >= 0] 295 | show_doc_ids = show_doc_ids[:args.max_doc_show] # Avoiding showing too many documents 296 | else: 297 | raise NotImplementedError 298 | 299 | inline_doc = "".join([make_doc_prompt(doc_list[doc_id], doc_id, prompt_data["doc_prompt"]) for doc_id in show_doc_ids]) 300 | inline_doc += "Output:" # Force the model to generate output in the next step 301 | doc_history.append(show_doc_ids) 302 | elif output.strip().lower()[:6] == "output": 303 | output = output.strip().replace("Output:", "").strip() 304 | if args.force_cite_show: 305 | output = remove_citations(output) 306 | if len(doc_history) == 0: 307 | logger.warn("No doc history??") 308 | else: 309 | # Just cite whatever documents the model has seen in the last step 310 | if "qampari" in args.data_file: 311 | output = ", ".join(["".join([f"[{doc+1}]" for doc in doc_history[-1]]) + " " + entity.strip() for entity in output.rstrip().rstrip(",").split(",")]) + ", " 312 | else: 313 | output = " ".join(["".join([f"[{doc+1}]" for doc in doc_history[-1]]) + " " + o for o in sent_tokenize(output)]) + "." 314 | output_answer += " " + output 315 | else: 316 | # Sometimes model starts to output random things. 317 | break 318 | 319 | if num_turn >= args.max_turn: 320 | logger.warning("Reach maximum number of turns. Terminate now.") 321 | break 322 | 323 | if "qampari" in args.data_file: 324 | output_answer = output_answer.rstrip().rstrip(",") 325 | output_array.append(output_answer) 326 | item['prompt'] = interactive_prompt 327 | item['doc_history'] = doc_history 328 | else: 329 | output_array.append(llm.generate(prompt, min(args.max_new_tokens, args.max_length-prompt_len))) 330 | item['prompt'] = prompt 331 | 332 | output_array[-1] = output_array[-1].replace("<|im_end|>", "").rstrip() 333 | if output_array[-1].endswith("End."): 334 | output_array[-1] = output_array[-1][:-len("End.")] 335 | 336 | logger.info(f"Prompt length={prompt_len}") 337 | logger.info(f"Question: {item['question']}") 338 | logger.info(f"Gold answer: {item['answer']}") 339 | logger.info(f"Final model output: {output_array[-1]}") 340 | 341 | item['output'] = output_array if len(output_array) > 1 else output_array[0] 342 | 343 | # Calculate the price for OpenAI API 344 | if args.openai_api: 345 | logger.info(f"Total token used: {llm.total_tokens}") 346 | if "turbo" in args.model: 347 | unit_price = 0.002 348 | else: 349 | unit_price = 0.02 350 | logger.info(f"Unit price: {unit_price}") 351 | logger.info(f"Total cost: %.1f" % (llm.total_tokens / 1000 * unit_price)) 352 | 353 | logger.info(f"#Cases when prompts exceed max length: {llm.prompt_exceed_max_length}") 354 | logger.info(f"#Cases when max new tokens < 50: {llm.fewer_than_50}") 355 | 356 | # Save the result 357 | model_name = args.model 358 | # if "/" in model_name: 359 | # model_name = model_name.split("/")[-1] 360 | os.makedirs('exps',exist_ok=True) 361 | # name = f"exps/{args.tag}-{args.dataset_name}-{model_name.replace('/','_').replace('-','_')}-shot{args.shot}-ndoc{args.ndoc}-{args.seed}" 362 | name = f"{args.dataset_name}-{model_name}-{args.tag}-shot{args.shot}-ndoc{args.ndoc}-{args.seed}" 363 | if args.azure: 364 | name += "-azure" 365 | if args.quick_test > 0: 366 | name += f"-quick_test{args.quick_test}" 367 | if args.no_doc_in_demo: 368 | name += "-no_doc_in_demo" 369 | if args.fewer_doc_in_demo: 370 | name += f"-{args.ndoc_in_demo}_doc_in_demo" 371 | if args.num_samples > 1: 372 | name += f"-sample{args.num_samples}" 373 | if args.force_cite_show: 374 | name += f"-forceciteshow" 375 | 376 | eval_data = { 377 | "args": args.__dict__, 378 | "data": eval_data, 379 | } 380 | if args.openai_api: 381 | eval_data["total_cost"] = llm.total_tokens / 1000 * unit_price 382 | if args.azure: 383 | eval_data["azure_filter_fail"] = llm.azure_filter_fail 384 | 385 | if args.output_fp != None: 386 | name = args.output_fp 387 | else: 388 | if not os.path.exists("result"): 389 | os.makedirs("result") 390 | name = "result/" + name + ".json" 391 | 392 | logger.info('output_fp:{}'.format(name)) 393 | json.dump(eval_data, open(name, "w"), indent=4) 394 | 395 | if __name__ == "__main__": 396 | main() 397 | -------------------------------------------------------------------------------- /eval.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import collections 3 | import json 4 | import re 5 | import string 6 | import torch 7 | import copy 8 | 9 | from nltk import sent_tokenize 10 | import numpy as np 11 | from rouge_score import rouge_scorer, scoring 12 | from tqdm import tqdm 13 | import logging 14 | logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s - %(message)s', 15 | datefmt='%m/%d/%Y %H:%M:%S') 16 | logger = logging.getLogger(__name__) 17 | logger.setLevel(logging.INFO) 18 | 19 | from transformers import ( 20 | AutoModelForSeq2SeqLM, 21 | AutoTokenizer, 22 | pipeline 23 | ) 24 | 25 | from utils import normalize_answer, get_max_memory, remove_citations 26 | 27 | QA_MODEL = "gaotianyu1350/roberta-large-squad" 28 | AUTOAIS_MODEL = "google/t5_xxl_true_nli_mixture" 29 | 30 | global autoais_model, autoais_tokenizer 31 | autoais_model, autoais_tokenizer = None, None 32 | 33 | 34 | def compute_f1(a_gold, a_pred): 35 | """Compute F1 score between two strings.""" 36 | 37 | def _get_tokens(s): 38 | if not s: 39 | return [] 40 | return normalize_answer(s).split() 41 | 42 | gold_toks = _get_tokens(a_gold) 43 | pred_toks = _get_tokens(a_pred) 44 | 45 | common = collections.Counter(gold_toks) & collections.Counter(pred_toks) 46 | num_same = sum(common.values()) 47 | 48 | if len(gold_toks) == 0 or len(pred_toks) == 0: 49 | # If either is no-answer, then F1 is 1 if they agree, 0 otherwise 50 | return int(gold_toks == pred_toks) 51 | 52 | if num_same == 0: 53 | return 0 54 | 55 | precision = 1.0 * num_same / len(pred_toks) 56 | recall = 1.0 * num_same / len(gold_toks) 57 | f1 = (2 * precision * recall) / (precision + recall) 58 | 59 | return f1 60 | 61 | 62 | def compute_exact(a_gold, a_pred): 63 | """Check whether two strings are equal up to normalization.""" 64 | 65 | return int(normalize_answer(a_gold) == normalize_answer(a_pred)) 66 | 67 | 68 | def exact_presence(short_answers, context): 69 | """Verify if any of the answers is present in the given context. 70 | Args: 71 | short_answers: list of short answers to look for in the context 72 | context: a paragraph to search for short answers 73 | Returns: 74 | true if any of the short answers is present in the context 75 | """ 76 | 77 | n_short_answers = [normalize_answer(sa) for sa in short_answers] 78 | n_context = normalize_answer(context) 79 | 80 | for ans in n_short_answers: 81 | if ans in n_context: 82 | return True 83 | 84 | return False 85 | 86 | 87 | def compute_rouge(data): 88 | """Main function for rouge scoring. 89 | If two references are provided, 90 | the best score is chosen for each instance. 91 | Args: 92 | data: requires field `output` and `answer` (or `annotations` for ASQA) 93 | metrics: list of evaluation metrics 94 | Returns: 95 | dictionary representation of rouge scores 96 | """ 97 | def _rouge_calculation(hypotheses, 98 | references1, 99 | references2=[], 100 | metrics=['rougeLsum']): 101 | 102 | if references2 == []: 103 | references2 = references1 104 | 105 | scorer = rouge_scorer.RougeScorer(metrics, use_stemmer=True) 106 | aggregator = scoring.BootstrapAggregator() 107 | 108 | for i in range(len(hypotheses)): 109 | scores1 = scorer.score(references1[i], hypotheses[i]) 110 | scores2 = scorer.score(references2[i], hypotheses[i]) 111 | if scores1['rougeLsum'].fmeasure > scores2['rougeLsum'].fmeasure: 112 | aggregator.add_scores(scores1) 113 | else: 114 | aggregator.add_scores(scores2) 115 | 116 | scores = {m: [] for m in metrics} 117 | 118 | for m in metrics: 119 | fmeasure = aggregator.aggregate()[m].mid.fmeasure 120 | scores[m].append(fmeasure) 121 | 122 | for m in scores: 123 | scores[m] = 100 * sum(scores[m]) / len(scores[m]) 124 | 125 | return scores 126 | 127 | hypotheses = {} 128 | references1 = {} 129 | references2 = {} 130 | 131 | for idx, item in enumerate(data): 132 | hypotheses[idx] = item["output"] 133 | if "annotations" in item and item['annotations'] is not None: # For ASQA 134 | references1[idx] = item["annotations"][0]["long_answer"] 135 | references2[idx] = item["annotations"][1]["long_answer"] 136 | else: 137 | references1[idx] = item["answer"] 138 | references2[idx] = item["answer"] 139 | 140 | h, r1, r2 = [], [], [] 141 | 142 | for key in references1: 143 | h.append(hypotheses[key]) 144 | r1.append(references1[key]) 145 | 146 | if references2 is not None: 147 | r2.append(references2[key]) 148 | 149 | h = ['\n'.join(sent_tokenize(text.lower())) for text in h] 150 | r1 = ['\n'.join(sent_tokenize(text.lower())) for text in r1] 151 | r2 = ['\n'.join(sent_tokenize(text.lower())) for text in r2] 152 | scores = _rouge_calculation(h, r1, r2) 153 | 154 | return scores['rougeLsum'] 155 | 156 | 157 | def compute_str_em(data): 158 | """Compute STR-EM metric (only for ASQA) 159 | Args: 160 | data: requires field `qa_pairs/short_answers` and `output` 161 | Returns: 162 | STR-EM and STR-EM-HIT () 163 | """ 164 | 165 | if 'qa_pairs' not in data[0] or data[0]['qa_pairs'] is None: 166 | return 0, 0 167 | 168 | acc = [] 169 | hit = [] 170 | 171 | for item in data: 172 | loc_acc = [] 173 | for qa_pair in item['qa_pairs']: 174 | loc_acc.append(exact_presence(qa_pair['short_answers'], item["output"])) 175 | acc.append(np.mean(loc_acc)) 176 | hit.append( int(np.mean(loc_acc) == 1) ) 177 | 178 | return 100 * np.mean(acc), 100 * np.mean(hit) 179 | 180 | 181 | def compute_len(data): 182 | """Compute average length of predictions.""" 183 | 184 | res, cntr = 0, 0 185 | for item in data: 186 | res += len(item["output"].split()) 187 | cntr += 1 188 | return res / cntr 189 | 190 | 191 | def compute_qa(data): 192 | """Compute QA-based accuracy. 193 | Args: 194 | data: requires filed `qa_pairs/short_answers` and `output` 195 | Returns: 196 | QA metrics (QA-EM, QA-F1, QA-Hit) 197 | """ 198 | 199 | if 'qa_pairs' not in data[0] or data[0]['qa_pairs'] is None: 200 | logger.warn("Warning: no QA pairs found in data") 201 | return { 202 | 'QA-EM': 0, 203 | 'QA-F1': 0, 204 | 'QA-Hit': 0, 205 | } 206 | 207 | # Load model 208 | logger.info("Loading the RoBERTa-large SQuAD model for QA-based accuracy...") 209 | qa_pipeline = pipeline("question-answering", model=QA_MODEL, device=0) 210 | logger.info("Done") 211 | 212 | # Get prediction 213 | logger.info("Computing the QA-based accuracy...") 214 | em, f1, bins = [], [], [] 215 | for item in tqdm(data): 216 | question = [qa_pair['question'] for qa_pair in item['qa_pairs']] 217 | context = item['output'] if len(item['output']) > 0 else " " 218 | results = qa_pipeline(question=question, context=context, handle_impossible_answer=True) 219 | loc_counter, loc_em, loc_f1 = 0, 0, 0 220 | 221 | for idx, res in enumerate(results): 222 | answers = item["qa_pairs"][idx]["short_answers"] 223 | prediction = res["answer"] 224 | 225 | loc_em += max([compute_exact(a, prediction) for a in answers]) 226 | loc_f1 += max([compute_f1(a, prediction) for a in answers]) 227 | loc_counter += 1 228 | 229 | em.append(loc_em / loc_counter) 230 | f1.append(loc_f1 / loc_counter) 231 | bins.append(loc_em == loc_counter) 232 | 233 | return { 234 | 'QA-EM': 100 * np.mean(em), 235 | 'QA-F1': 100 * np.mean(f1), 236 | 'QA-Hit': 100 * np.mean(bins) 237 | } 238 | 239 | 240 | def compute_mauve(data): 241 | """Compute Mauve score.""" 242 | 243 | logger.info("Computing MAUVE...") 244 | human_data = [] 245 | model_data = [] 246 | for item in data: 247 | # Remove ending punctuations 248 | # Remove any new lines 249 | # Truncate by 100 words 250 | human_data.append(' '.join((item['question'] + " " + item['answer'].strip()).split()[:100]).rstrip(string.punctuation)) 251 | model_data.append(' '.join((item['question'] + " " + item['output'].strip()).split()[:100]).rstrip(string.punctuation)) 252 | 253 | import mauve 254 | out = mauve.compute_mauve( 255 | p_text=human_data, 256 | q_text=model_data, 257 | device_id=0, 258 | max_text_length=512, 259 | verbose=True, 260 | batch_size=8, 261 | featurize_model_name="gpt2-large" 262 | ) 263 | return out.mauve * 100 264 | 265 | 266 | def _run_nli_autoais(passage, claim): 267 | """ 268 | Run inference for assessing AIS between a premise and hypothesis. 269 | Adapted from https://github.com/google-research-datasets/Attributed-QA/blob/main/evaluation.py 270 | """ 271 | global autoais_model, autoais_tokenizer 272 | input_text = "premise: {} hypothesis: {}".format(passage, claim) 273 | inputs = autoais_tokenizer(input_text, return_tensors="pt").to('cuda') 274 | 275 | with torch.inference_mode(): 276 | outputs = autoais_model.generate(inputs['input_ids'], output_scores=True,max_new_tokens=10) 277 | 278 | result = autoais_tokenizer.decode(outputs[0], skip_special_tokens=True) 279 | inference = 1 if result == "1" else 0 280 | return inference 281 | 282 | 283 | def compute_claims(data): 284 | global autoais_model, autoais_tokenizer 285 | if autoais_model is None: 286 | logger.info("Loading AutoAIS model...") 287 | autoais_model = AutoModelForSeq2SeqLM.from_pretrained(AUTOAIS_MODEL, torch_dtype=torch.bfloat16, max_memory=get_max_memory(), device_map="auto") 288 | autoais_tokenizer = AutoTokenizer.from_pretrained(AUTOAIS_MODEL, use_fast=False) 289 | 290 | logger.info("Computing claims...") 291 | scores = [] 292 | for item in tqdm(data): 293 | normalized_output = remove_citations(item['output']) 294 | entail = 0 295 | claims = item["claims"] 296 | for claim in claims: 297 | entail += _run_nli_autoais(normalized_output, claim) 298 | scores.append(entail / len(claims)) 299 | return 100 * np.mean(scores) 300 | 301 | 302 | def compute_autoais(data, 303 | decontext=False, 304 | concat=False, 305 | qampari=False, 306 | at_most_citations=None,): 307 | """ 308 | Compute AutoAIS score. 309 | 310 | Args: 311 | data: requires field `output` and `docs` 312 | - docs should be a list of items with fields `title` and `text` (or `phrase` and `sent` for QA-extracted docs) 313 | citation: check citations and use the corresponding references. 314 | decontext: decontextualize the output 315 | """ 316 | 317 | global autoais_model, autoais_tokenizer 318 | if autoais_model is None: 319 | logger.info("Loading AutoAIS model...") 320 | autoais_model = AutoModelForSeq2SeqLM.from_pretrained(AUTOAIS_MODEL, torch_dtype=torch.bfloat16, max_memory=get_max_memory(), device_map="auto") 321 | autoais_tokenizer = AutoTokenizer.from_pretrained(AUTOAIS_MODEL, use_fast=False) 322 | 323 | logger.info(f"Running AutoAIS...") 324 | 325 | def _format_document(doc): 326 | """Format document for AutoAIS.""" 327 | 328 | if "sent" in doc: 329 | # QA-extracted docs 330 | return "Title: %s\n%s" % (doc['title'], doc['sent']) 331 | else: 332 | return "Title: %s\n%s" % (doc['title'], doc['text']) 333 | 334 | ais_scores = [] 335 | ais_scores_prec = [] 336 | 337 | sent_total = 0 338 | sent_mcite = 0 339 | sent_mcite_support = 0 340 | sent_mcite_overcite = 0 341 | autoais_log = [] 342 | for item in tqdm(data): 343 | # Get sentences by using NLTK 344 | if qampari: 345 | sents = [item['question'] + " " + x.strip() for x in item['output'].rstrip().rstrip(".").rstrip(",").split(",")] 346 | else: 347 | sents = sent_tokenize(item['output']) 348 | if len(sents) == 0: 349 | continue 350 | 351 | target_sents = [remove_citations(sent).strip() for sent in sents] 352 | 353 | entail = 0 354 | entail_prec = 0 355 | total_citations = 0 356 | print('len(sents):{}'.format(len(sents))) 357 | for sent_id, sent in enumerate(sents): 358 | target_sent = target_sents[sent_id] # Citation removed and (if opted for) decontextualized 359 | joint_entail = -1 # Undecided 360 | 361 | # Find references 362 | ref = [int(r[1:])-1 for r in re.findall(r"\[\d+", sent)] # In text citation id starts from 1 363 | logger.info(f"For `{sent}`, find citations {ref}") 364 | if len(ref) == 0: 365 | # No citations 366 | joint_entail = 0 367 | elif any([ref_id >= len(item['docs']) for ref_id in ref]): 368 | # Citations out of range 369 | joint_entail = 0 370 | else: 371 | if at_most_citations is not None: 372 | ref = ref[:at_most_citations] 373 | total_citations += len(ref) 374 | joint_passage = '\n'.join([_format_document(item['docs'][psgs_id]) for psgs_id in ref]) 375 | 376 | # If not directly rejected by citation format error, calculate the recall score 377 | if joint_entail == -1: 378 | print('joint_passage:\n{}'.format(joint_passage)) 379 | print('target_sent:\n{}'.format(target_sent)) 380 | print('*'*20) 381 | print() 382 | joint_entail = _run_nli_autoais(joint_passage, target_sent) 383 | autoais_log.append({ 384 | "question": item['question'], 385 | "output": item['output'], 386 | "claim": sent, 387 | "passage": [joint_passage], 388 | "model_type": "NLI", 389 | "model_output": joint_entail, 390 | }) 391 | 392 | entail += joint_entail 393 | if len(ref) > 1: 394 | sent_mcite += 1 395 | 396 | # calculate the precision score if applicable 397 | if joint_entail and len(ref) > 1: 398 | sent_mcite_support += 1 399 | # Precision check: did the model cite any unnecessary documents? 400 | for psgs_id in ref: 401 | # condition A 402 | passage = _format_document(item['docs'][psgs_id]) 403 | nli_result = _run_nli_autoais(passage, target_sent) 404 | 405 | # condition B 406 | if not nli_result: 407 | subset_exclude = copy.deepcopy(ref) 408 | subset_exclude.remove(psgs_id) 409 | passage = '\n'.join([_format_document(item['docs'][pid]) for pid in subset_exclude]) 410 | nli_result = _run_nli_autoais(passage, target_sent) 411 | if nli_result: # psgs_id is not necessary 412 | flag = 0 413 | sent_mcite_overcite += 1 414 | else: 415 | entail_prec += 1 416 | else: 417 | entail_prec += 1 418 | else: 419 | entail_prec += joint_entail 420 | 421 | sent_total += len(sents) 422 | ais_scores.append(entail / len(sents)) 423 | ais_scores_prec.append(entail_prec / total_citations if total_citations > 0 else 0) # len(sents)) 424 | 425 | if sent_mcite > 0 and sent_mcite_support > 0: 426 | print("Among all sentences, %.2f%% have multiple citations, among which %.2f%% are supported by the joint set, among which %.2f%% overcite." % ( 427 | 100 * sent_mcite / sent_total, 428 | 100 * sent_mcite_support / sent_mcite, 429 | 100 * sent_mcite_overcite / sent_mcite_support 430 | )) 431 | 432 | def calculate_f1(precision, recall): 433 | if precision + recall == 0: 434 | return 0 435 | return 2 * (precision * recall) / (precision + recall) 436 | 437 | citation_recall = np.mean(ais_scores) 438 | citation_precision = np.mean(ais_scores_prec) 439 | return { 440 | "citation_rec": 100 * citation_recall, 441 | "citation_prec": 100 * citation_precision, 442 | "citation_f1": 100 * calculate_f1(citation_precision, citation_recall) 443 | } 444 | 445 | 446 | def compute_qampari_f1(data, cot=False): 447 | prec = [] 448 | rec = [] 449 | rec_top5 = [] 450 | f1 = [] 451 | f1_top5 = [] 452 | 453 | num_preds = [] 454 | for item in data: 455 | if cot: 456 | if ":" in item['output']: 457 | o = ':'.join(item['output'].split(":")[1:]) # try to separate the COT part and the answer list part. 458 | else: 459 | o = "" 460 | else: 461 | o = item['output'] 462 | preds = [normalize_answer(x.strip()) for x in o.rstrip().rstrip(".").rstrip(",").split(",")] 463 | preds = [p for p in preds if len(p) > 0] # delete empty answers 464 | num_preds.append(len(preds)) 465 | answers = [[normalize_answer(x) for x in ans] for ans in item['answers']] 466 | flat_answers = [item for sublist in answers for item in sublist] 467 | 468 | prec.append(sum([p in flat_answers for p in preds]) / len(preds) if len(preds) > 0 else 0) 469 | rec.append(sum([any([x in preds for x in a]) for a in answers]) / len(answers)) 470 | rec_top5.append(min(5, sum([any([x in preds for x in a]) for a in answers])) / min(5, len(answers))) 471 | if (prec[-1] + rec[-1]) == 0: 472 | f1.append(0) 473 | else: 474 | f1.append(2 * prec[-1] * rec[-1] / (prec[-1] + rec[-1])) 475 | if (prec[-1] + rec_top5[-1]) == 0: 476 | f1_top5.append(0) 477 | else: 478 | f1_top5.append(2 * prec[-1] * rec_top5[-1] / (prec[-1] + rec_top5[-1])) 479 | 480 | return { 481 | "num_preds": np.mean(num_preds), 482 | "qampari_prec": 100 * np.mean(prec), 483 | "qampari_rec": 100 * np.mean(rec), 484 | "qampari_rec_top5": 100 * np.mean(rec_top5), 485 | "qampari_f1": 100 * np.mean(f1), 486 | "qampari_f1_top5": 100 * np.mean(f1_top5), 487 | } 488 | 489 | def main(): 490 | parser = argparse.ArgumentParser() 491 | parser.add_argument("--f", type=str, required=True, help="Output file. Should have field `question`, `output`, (ROUGE) `answer`, \ 492 | (accuracy) `qa_pairs`, (AIS) `docs`") 493 | parser.add_argument('--eval_metric',required=True,choices=['default','correctness','custom']) 494 | parser.add_argument("--no_rouge", action="store_true", help="Do not evaluate ROUGE score") 495 | parser.add_argument("--qa", action="store_true", help="Use the QA model") 496 | parser.add_argument("--mauve", action="store_true", help="Use the mauve score model") 497 | parser.add_argument("--citations", action="store_true", help="Evaluation with citation") 498 | parser.add_argument("--at_most_citations", type=int, default=3, help="At most take this many documents (mostly for precision)") 499 | parser.add_argument("--claims_nli", action="store_true", help="Use claims for ELI5") 500 | parser.add_argument('--full_fitlog_hyper',default=0) 501 | 502 | # QAMPARI 503 | parser.add_argument("--cot", action="store_true", help="For QAMPARI, try to find colon and separate the COT and answer listing") 504 | 505 | args = parser.parse_args() 506 | 507 | if args.eval_metric == 'default': 508 | if 'asqa' in args.f: 509 | args.qa = 1 510 | args.mauve = 1 511 | args.citations = 1 512 | args.claims_nli = 0 513 | elif 'qampari' in args.f: 514 | args.citations = 1 515 | elif 'eli5' in args.f: 516 | args.citations = 1 517 | args.claims_nli = 1 518 | args.mauve = 1 519 | elif args.eval_metric == 'custom': 520 | pass 521 | elif args.eval_metric == 'correctness': 522 | if 'asqa' in args.f: 523 | args.qa = 0 524 | args.mauve = 0 525 | args.citations = 0 526 | args.claims_nli = 0 527 | elif 'qampari' in args.f: 528 | args.citations = 0 529 | elif 'eli5' in args.f: 530 | args.citations = 0 531 | args.claims_nli = 1 532 | args.mauve = 0 533 | 534 | with open(args.f) as f: 535 | data_with_config = json.load(f) 536 | data = data_with_config['data'] 537 | 538 | if "qampari" in args.f: 539 | args.no_rouge = True 540 | args.qa = False 541 | args.mauve = False 542 | args.decontext = False 543 | qampari = True 544 | else: 545 | qampari = False 546 | 547 | # Truncate by newline and remove on the fly search result 548 | logger.warning("We remove all the pre/appended space/newlines and we truncate the answer by the first newline.") 549 | logger.warning("We replace any on the fly search result to standard bracket citation format.") 550 | for i in range(len(data)): 551 | data[i]['output'] = data[i]['output'].strip().split("\n")[0] 552 | data[i]['output'] = data[i]['output'].replace("<|im_end|>", "") 553 | 554 | # Remove all citations for all non-AutoAIS evaluation 555 | normalized_data = copy.deepcopy(data) 556 | for i in range(len(normalized_data)): 557 | normalized_data[i]['output'] = remove_citations(normalized_data[i]['output']) 558 | 559 | result = {} 560 | result['length'] = compute_len(normalized_data) 561 | result['str_em'], result['str_hit'] = compute_str_em(normalized_data) 562 | logger.info('eval_result:{}'.format(result)) 563 | if qampari: 564 | result.update(compute_qampari_f1(normalized_data, cot=args.cot)) 565 | logger.info('eval_result:{}'.format(result)) 566 | if not args.no_rouge: 567 | result['rougeLsum'] = compute_rouge(normalized_data) 568 | logger.info('eval_result:{}'.format(result)) 569 | if args.qa: 570 | result.update(compute_qa(normalized_data)) 571 | logger.info('eval_result:{}'.format(result)) 572 | if args.mauve: 573 | result['mauve'] = compute_mauve(normalized_data) 574 | logger.info('eval_result:{}'.format(result)) 575 | if args.citations: 576 | result.update(compute_autoais(data, qampari=qampari, at_most_citations=args.at_most_citations)) 577 | logger.info('eval_result:{}'.format(result)) 578 | if args.claims_nli: 579 | result["claims_nli"] = compute_claims(normalized_data) 580 | logger.info('eval_result:{}'.format(result)) 581 | 582 | json.dump(result, open(args.f.replace("json", "score"), "w"), indent=4) 583 | 584 | 585 | if __name__ == "__main__": 586 | main() 587 | -------------------------------------------------------------------------------- /llm_retrieval_related/iterative_select_supporting_documents.py: -------------------------------------------------------------------------------- 1 | import copy 2 | import threading 3 | import tqdm 4 | import openai 5 | from transformers import AutoTokenizer 6 | import time 7 | from typing import List, Dict, Tuple, Union 8 | import logging 9 | 10 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') 11 | logger = logging.getLogger(__name__) 12 | logger.setLevel(logging.INFO) 13 | 14 | gpt2_tokenizer = AutoTokenizer.from_pretrained('gpt2') 15 | 16 | 17 | def truncate_doc_in_user_prompt(user_prompt): 18 | last_index = user_prompt.rfind('Content:\n') 19 | if last_index != -1: 20 | user_prompt = user_prompt[:last_index] 21 | 22 | user_prompt = '\n'.join(user_prompt.split('\n')[:-2]) 23 | user_prompt = user_prompt.strip() 24 | return user_prompt 25 | 26 | 27 | def letter_to_int(letter): 28 | if 'a' <= letter <= 'z': 29 | return ord(letter) - ord('a') 30 | elif 'A' <= letter <= 'Z': 31 | return ord(letter) - ord('A') 32 | else: 33 | print('letter:{}'.format(letter)) 34 | raise NotImplementedError 35 | 36 | 37 | def letter_to_int_upper(letter): 38 | if 'A' <= letter <= 'Z': 39 | return ord(letter) - ord('A') 40 | else: 41 | print('letter:{}'.format(letter)) 42 | raise NotImplementedError 43 | 44 | 45 | def letter_to_int_lower(letter): 46 | if 'a' <= letter <= 'z': 47 | return ord(letter) - ord('a') 48 | else: 49 | print('letter:{}'.format(letter)) 50 | raise NotImplementedError 51 | 52 | 53 | def int_to_letter_lower(n): 54 | if 0 <= n <= 25: 55 | return chr(n + ord('a')) 56 | else: 57 | raise ValueError('The entered integer must be between 0 and 25') 58 | 59 | 60 | def int_to_letter_upper(n): 61 | if 0 <= n <= 25: 62 | return chr(n + ord('A')) 63 | else: 64 | raise ValueError('The entered integer must be between 0 and 25') 65 | 66 | 67 | def create_stage2_select_prompt(questions: List[str], 68 | docs: List, 69 | k: int, 70 | idf_use_letter: str, 71 | use_title: int, 72 | stage2_select_system_prompt: str, 73 | used_doc_field: str, 74 | reverse_doc_order: bool=False) -> List[Dict]: 75 | """Create the prompt for selection in the 2nd stage. 76 | Args 77 | ---- 78 | questions: List[str] 79 | The question. 80 | docs: List 81 | The documents relevant to the question 82 | k: int 83 | A specified number of documents for answering the user's specific question(s). 84 | idf_use_letter: str 85 | Use uppercase letters, lowercase letters, or integers to mark the documents. 86 | use_title: int 87 | Whether to use title or not. 88 | stage2_select_system_prompt: str 89 | System prompt for instruction. 90 | used_doc_field_in_retrieval: str 91 | Which filed of document to use in retrieval. 92 | reverse_doc_order: bool=False 93 | Whether to reverse the order of document or not. 94 | 95 | Returns 96 | ------- 97 | prompt: List[Dict] 98 | Prompt for selection. 99 | """ 100 | user_prompt = 'Question:\n{}\n\nk: {}\n\n'.format('\n'.join(questions), k) 101 | user_prompt += 'Candidate Documents:\n\n' 102 | 103 | prompt_doc_str_list = [] 104 | 105 | for i, doc in enumerate(docs): 106 | if idf_use_letter == 'lower': 107 | idf = int_to_letter_lower(i) 108 | elif idf_use_letter == 'upper': 109 | idf = int_to_letter_upper(i) 110 | else: 111 | idf = i + 1 112 | if use_title: 113 | prompt_doc_str_list.append('{}\nTitle:\n{}\nContent:\n{}\n\n'.format(idf, doc['title'], doc[used_doc_field])) 114 | else: 115 | prompt_doc_str_list.append('{}\nContent:\n{}\n\n'.format(idf, doc[used_doc_field])) 116 | 117 | if reverse_doc_order: 118 | user_prompt += ''.join(list(reversed(prompt_doc_str_list))) 119 | else: 120 | user_prompt += ''.join(prompt_doc_str_list) 121 | 122 | prompt = [ 123 | {'role': 'system', 'content': stage2_select_system_prompt}, 124 | {'role': 'user', 'content': user_prompt.strip()} 125 | ] 126 | 127 | return prompt 128 | 129 | 130 | def select_k_supporting_documents(questions: List[str], 131 | tmp_selected_docs: List, 132 | extra_docs_to_browse: List[Dict], 133 | k: int, 134 | selected_doc_first: int, 135 | idf_use_letter: str, 136 | use_title: int, 137 | model_name: str, 138 | stage2_select_system_prompt: str, 139 | used_doc_field_in_retrieval: str, 140 | thread: "instance") -> Dict: 141 | """Select k supporting documents. 142 | Args 143 | ---- 144 | questions: List[str] 145 | The question. 146 | tmp_selected_docs: List 147 | pass 148 | extra_docs_to_browse: List[Dict] 149 | pass 150 | k: int 151 | A specified number of documents for answering the user's specific question(s). 152 | selected_doc_first: int 153 | 154 | idf_use_letter: str 155 | Use uppercase letters, lowercase letters, or integers to mark the documents. 156 | use_title: int 157 | Whether to use title or not. 158 | model_name: str 159 | OpenAI model name. 160 | stage2_select_system_prompt: str 161 | System prompt for instruction. 162 | used_doc_field_in_retrieval: str 163 | Which filed of document to use. 164 | thread: "instance" 165 | pass 166 | """ 167 | unbrowsed_docs = [] 168 | 169 | assert idf_use_letter in ['upper', 'lower', 'int'] 170 | 171 | while 1: 172 | if selected_doc_first: 173 | docs_concat = tmp_selected_docs + extra_docs_to_browse 174 | else: 175 | docs_concat = extra_docs_to_browse + tmp_selected_docs 176 | messages = create_stage2_select_prompt(questions, docs_concat, k, idf_use_letter, use_title, stage2_select_system_prompt, used_doc_field_in_retrieval) 177 | prompt_token_num = len(gpt2_tokenizer.tokenize(messages[0]['content'] + messages[1]['content'])) 178 | if prompt_token_num > 3900: 179 | unbrowsed_docs.insert(0, extra_docs_to_browse[-1]) 180 | extra_docs_to_browse.pop() 181 | else: 182 | break 183 | if len(extra_docs_to_browse) == 0: 184 | break 185 | 186 | final_docs_in_query = [docs_concat] 187 | 188 | if len(unbrowsed_docs) > 0: 189 | logger.info('before openai query, unbrowsed_docs > 0 : {}'.format(len(unbrowsed_docs))) 190 | 191 | def repeat_until_success_call_openai_api(func): 192 | def wrapper(*args, **kw): 193 | while True: 194 | result = None 195 | try: 196 | result = func(*args, **kw) 197 | except openai.error.APIConnectionError as e: 198 | if thread.print_error: 199 | logger.info('openai connection error, so retry after sleeping 5 seconds') 200 | logger.info(e) 201 | time.sleep(5) 202 | except openai.error.RateLimitError as e: 203 | logger.info(type(e)) 204 | logger.info(e) 205 | logger.info('e._message:{}'.format(e._message)) 206 | if 'quota' in e._message: 207 | if thread.print_error: 208 | logger.info('now openai account {} runs out. so use next.'.format(thread.account[-1])) 209 | logger.info(type(e)) 210 | logger.info(e) 211 | thread.account = thread.openai_account_manager_multi_thread.get_next_account(thread.thread_id, 212 | thread.account) 213 | elif "maximum context length is" in e._message: 214 | unbrowsed_docs.insert(0, extra_docs_to_browse[-1]) 215 | extra_docs_to_browse.pop() 216 | 217 | if selected_doc_first: 218 | docs_concat = tmp_selected_docs + extra_docs_to_browse 219 | else: 220 | docs_concat = extra_docs_to_browse + tmp_selected_docs 221 | final_docs_in_query[0] = docs_concat 222 | messages = create_stage2_select_prompt(questions, docs_concat, k, idf_use_letter, use_title, stage2_select_system_prompt, used_doc_field_in_retrieval) 223 | print('in repeat_until_success_call_openai_api, docs < 20 : {}'.format( 224 | len(docs_concat))) 225 | kw['messages'] = messages 226 | else: 227 | if True: 228 | logger.info('openai rate limit error, so retry after sleeping 45 seconds') 229 | time.sleep(45) 230 | except openai.error.AuthenticationError as e: 231 | if 'This key is associated with a deactivated account' in e._message: 232 | logger.info('the account {} is deactivated. so use next'.format(thread.account[-1])) 233 | if thread.print_error: 234 | logger.info(e) 235 | thread.account = thread.openai_account_manager_multi_thread.get_next_account(thread.thread_id, 236 | thread.account) 237 | else: 238 | logger.info('meet unexpected AuthenticationError, so retry after sleeping 5 seconds') 239 | if thread.print_error: 240 | logger.info(e) 241 | thread.account = thread.openai_account_manager_multi_thread.get_next_account(thread.thread_id, 242 | thread.account) 243 | except openai.error.InvalidRequestError as e: 244 | if "maximum context length is" in e._message: 245 | unbrowsed_docs.insert(0, extra_docs_to_browse[-1]) 246 | extra_docs_to_browse.pop() 247 | 248 | if selected_doc_first: 249 | docs_concat = tmp_selected_docs + extra_docs_to_browse 250 | else: 251 | docs_concat = extra_docs_to_browse + tmp_selected_docs 252 | final_docs_in_query[0] = docs_concat 253 | messages = create_stage2_select_prompt(questions, docs_concat, k, idf_use_letter, use_title, stage2_select_system_prompt, used_doc_field_in_retrieval) 254 | print('in repeat_until_success_call_openai_api, docs < 20 : {}'.format(len(docs_concat))) 255 | kw['messages'] = messages 256 | 257 | except openai.error.OpenAIError as e: 258 | logger.info('meet unexpected openai error, so retry after sleeping 5 seconds') 259 | logger.info(e) 260 | logger.info(type(e)) 261 | time.sleep(3) 262 | 263 | except Exception as e: 264 | raise e 265 | 266 | if result != None: 267 | return result 268 | else: 269 | pass 270 | 271 | return wrapper 272 | 273 | @repeat_until_success_call_openai_api 274 | def tmp_func(messages): 275 | return openai.ChatCompletion.create(model=model_name, messages=messages, temperature=0, max_tokens=64, api_key=thread.account[-1]) 276 | 277 | if "gpt-3.5-turbo" in model_name: 278 | response = tmp_func(messages=messages) 279 | response = response['choices'][0]['message']['content'] 280 | else: 281 | raise NotImplementedError 282 | response = response.split('\n') 283 | if len(response) > 1: 284 | logger.info('response has > 1 lines, so just use its first line which has the selected documents') 285 | logger.warning(f"response: \n{response}") 286 | response = response[0] 287 | 288 | if len(unbrowsed_docs) > 0: 289 | logger.info('after openai query, unbrowsed_docs > 0 : {}'.format(len(unbrowsed_docs))) 290 | 291 | response_document_identifiers = response.replace(',', ' ').replace('[', ' ').replace(']', ' ').strip().split() 292 | selected_doc_idfs = [] 293 | 294 | docs_concat_in_openai_query = final_docs_in_query[0] 295 | for idf in response_document_identifiers: 296 | try: 297 | if idf_use_letter == 'upper': 298 | idf = letter_to_int_upper(idf) 299 | elif idf_use_letter == 'lower': 300 | idf = letter_to_int_lower(idf) 301 | else: 302 | idf = int(idf) - 1 303 | 304 | if idf >= len(docs_concat_in_openai_query): 305 | print('idf={}, response={}'.format(idf, response)) 306 | else: 307 | selected_doc_idfs.append(idf) 308 | except: 309 | pass 310 | 311 | if len(selected_doc_idfs) != k: 312 | print('len(retrieved_doc_idfs) != k, k:{}, len:{},\nresponse:\n{}response_document_identifiers:\n{}'.format(k, 313 | len(selected_doc_idfs), 314 | response, 315 | response_document_identifiers)) 316 | 317 | selected_doc_idfs = selected_doc_idfs[:k] 318 | 319 | docs_concat_in_openai_query = final_docs_in_query[0] 320 | 321 | result_dict = {} 322 | 323 | selected_docs = [] 324 | for idf in selected_doc_idfs: 325 | selected_docs.append(docs_concat_in_openai_query[idf]) 326 | result_dict['selected_docs'] = selected_docs 327 | 328 | original_openai_response = response 329 | result_dict['original_openai_response'] = original_openai_response 330 | 331 | parsed_doc_idfs = selected_doc_idfs 332 | result_dict['parsed_doc_idfs'] = parsed_doc_idfs 333 | 334 | result_dict['unbrowsed_docs'] = unbrowsed_docs 335 | 336 | return result_dict 337 | 338 | 339 | def iterative_select_supporting_documents_single(alce_item: Dict, 340 | k: int, 341 | window_size: int, 342 | reversed_browse_order: int, 343 | selected_doc_first: int, 344 | idf_use_letter: str, 345 | use_title: int, 346 | model_name: str, 347 | stage2_select_system_prompt: str, 348 | used_doc_field_in_retrieval: str, 349 | thread: "instance", 350 | use_sub_questions: int=0, 351 | old_selected_docs: List[Dict]=None, 352 | position: str=None, 353 | doc_num: int=100) -> Dict: 354 | """Iteratively select supporting documents. 355 | Args 356 | ---- 357 | alce_item: Dict 358 | Single data. 359 | k: int 360 | A specified number of documents for answering the user's specific question(s). 361 | window_size: int 362 | Context length. 363 | reversed_browse_order: int 364 | Whether to reverse the document order or not. 365 | selected_doc_first: int 366 | Whether to use the selected documents first or not. 367 | idf_use_letter: str 368 | Use uppercase letters, lowercase letters, or integers to mark the documents. 369 | use_title: int 370 | Whether to use title or not. 371 | model_name: str 372 | Which model of OpenAI to use. 373 | stage2_select_system_prompt: str 374 | System prompt for instruction. 375 | used_doc_field_in_retrieval: str 376 | Which filed of document to use in retrieval. 377 | thread: "instance" 378 | Instance of thread. 379 | use_sub_questions: int=0 380 | Whether to use sub questions for asqa. 381 | old_selected_docs: List[Dict]=None 382 | Old selected docs. May be less than 5. 383 | position: str=None 384 | Put the top-5 docs from old selected docs into the head or tail. 385 | doc_num: int=100 386 | Use top-k docs for reranking. 387 | 388 | Returns 389 | ------- 390 | output_alce_item: Dict 391 | Selected docs. 392 | """ 393 | output_alce_item = copy.deepcopy(alce_item) 394 | question = alce_item['question'] 395 | asqa_questions = None 396 | if use_sub_questions and 'qa_pairs' in alce_item: 397 | logger.warning("Use sub questions for asqa.") 398 | asqa_questions = list(map(lambda x: x['question'], list(alce_item['qa_pairs']))) 399 | 400 | if asqa_questions != None: 401 | questions = asqa_questions 402 | else: 403 | questions = [question] 404 | 405 | docs_to_browse = copy.deepcopy(alce_item['docs'][:doc_num]) 406 | logger.warning(f"The number of documents used for reranking is {len(docs_to_browse)}.") 407 | 408 | if old_selected_docs is not None and position == "head": 409 | logger.info("Add old selected docs into head.") 410 | old_selected_docs_copy = copy.deepcopy(old_selected_docs) 411 | docs_to_browse = old_selected_docs_copy + docs_to_browse 412 | elif old_selected_docs is not None and position == "tail": 413 | logger.info("Add old selected docs into tail.") 414 | old_selected_docs_copy = copy.deepcopy(old_selected_docs) 415 | docs_to_browse = docs_to_browse + old_selected_docs_copy 416 | 417 | if reversed_browse_order: 418 | docs_to_browse = list(reversed(docs_to_browse)) 419 | 420 | tmp_selected_docs = [] 421 | 422 | while len(docs_to_browse) > 0: 423 | # iteratively update tmp_selected_docs 424 | tmp_extra_docs_to_browse = docs_to_browse[:window_size - len(tmp_selected_docs)] 425 | docs_to_browse = docs_to_browse[window_size - len(tmp_selected_docs):] 426 | select_result_dict = select_k_supporting_documents(questions, tmp_selected_docs, tmp_extra_docs_to_browse, k, 427 | selected_doc_first, idf_use_letter, use_title, 428 | model_name, stage2_select_system_prompt, used_doc_field_in_retrieval, thread) 429 | 430 | tmp_selected_docs = select_result_dict['selected_docs'] 431 | original_openai_response = select_result_dict['original_openai_response'] 432 | parsed_doc_idfs = select_result_dict['parsed_doc_idfs'] 433 | unbrowsed_docs = select_result_dict['unbrowsed_docs'] 434 | 435 | docs_to_browse = unbrowsed_docs + docs_to_browse 436 | 437 | output_alce_item['docs'] = tmp_selected_docs 438 | 439 | return output_alce_item 440 | 441 | 442 | class OpenAI_API_inp_Manager_MultiThread_Generalized: 443 | def __init__(self, idx_non_general_inp: List[Tuple], general_inp: Dict) -> None: 444 | """Class init 445 | Args 446 | ---- 447 | idx_non_general_inp: List[Tuple] 448 | Data with index. 449 | general_inp: Dict 450 | Hyperparameter. 451 | """ 452 | self.idx_non_general_inp = idx_non_general_inp 453 | assert idx_non_general_inp[0][0] == 0, 'the 1st idx_non_general_inp"s idx is not 0, maybe something error' 454 | self.general_inp = general_inp 455 | self.inp_lock = threading.Lock() 456 | self.progress_index = 0 457 | 458 | assert type(general_inp) == type({}) 459 | 460 | 461 | def get_next_idx_inp(self) -> Union[List, None]: 462 | """ 463 | Get next new data. 464 | """ 465 | with self.inp_lock: 466 | if self.progress_index < len(self.idx_non_general_inp): 467 | tmp_idx = self.idx_non_general_inp[self.progress_index][0] 468 | tmp_non_general_inp = self.idx_non_general_inp[self.progress_index][1] 469 | tmp_general_inp = self.general_inp 470 | assert len(set(tmp_general_inp.keys()) & set(tmp_non_general_inp)) == 0, 'tmp_non_general_inp and tmp_general_inp has key overlap, must have problem' 471 | self.progress_index += 1 472 | return [tmp_idx, {**tmp_non_general_inp, **tmp_general_inp}] 473 | else: 474 | return None 475 | 476 | 477 | class MyThread(threading.Thread): 478 | # todo: Adjust MyThread from calling_sliding_window to two_stage_retrieve 479 | def __init__(self, thread_id: int, account_manager: "instance", inp_manager: "instance", print_error: bool, pbar: tqdm.tqdm, print_finish: bool=True) -> None: 480 | """Class init. 481 | Args 482 | ---- 483 | thread_id: int 484 | Thread id. 485 | account_manager: "instance" 486 | A manager for accounts of OpenAI. 487 | inp_manager: "instance" 488 | A manager for data. 489 | print_error: bool 490 | Whether to output error info or not. 491 | pbar: tqdm.tqdm 492 | Object of tqdm. 493 | print_finish: bool=True 494 | Whether to output ending info or not. 495 | """ 496 | threading.Thread.__init__(self) 497 | self.thread_id = thread_id 498 | self.openai_account_manager_multi_thread = account_manager 499 | self.openai_inp_manager = inp_manager 500 | self.account = self.openai_account_manager_multi_thread.get_next_account(self.thread_id) 501 | self.print_error = print_error 502 | self.pbar = pbar 503 | self.print_finish = print_finish 504 | 505 | 506 | def run(self): 507 | self.results_with_idx = [] 508 | while True: 509 | tmp = self.openai_inp_manager.get_next_idx_inp() 510 | if tmp == None: 511 | if self.print_finish: 512 | logger.info('thread {} finish'.format(self.thread_id)) 513 | return 514 | else: 515 | tmp_idx = tmp[0] 516 | select_doc_input = tmp[1] 517 | result = iterative_select_supporting_documents_single(**select_doc_input, thread=self) 518 | if self.pbar is not None: 519 | self.pbar.update(1) 520 | self.results_with_idx.append([tmp_idx, result]) 521 | 522 | 523 | from openai_account_manager import get_account_manager 524 | 525 | def iterative_select_supporting_documents_multi_thread(items_to_select: List[Dict], 526 | general_input: Dict, 527 | num_threads: int, 528 | use_tqdm: bool=True, 529 | old_data: List[Dict]=None) -> List: 530 | """Iteratively select supporting documents in a multi-threaded manner. 531 | Args 532 | ---- 533 | items_to_select: List[Dict] 534 | Candidate documents for selection. 535 | general_input: Dict 536 | Hyperparameter. 537 | num_threads: int 538 | Number of Thread. 539 | use_tqdm: bool 540 | Whether to use tqdm or not. 541 | old_data: List[Dict]=None 542 | Old data before updating query. 543 | 544 | Returns 545 | ------- 546 | results: List 547 | Selected supporting documents. 548 | """ 549 | new_items_to_select = [] 550 | if old_data is None: 551 | logger.info("Old data is None...") 552 | for item in items_to_select: 553 | new_items_to_select.append({'alce_item': item}) 554 | else: 555 | logger.info("Use old data...") 556 | question_to_docs = {item["question"]: item["docs"] for item in old_data} 557 | for item in items_to_select: 558 | new_items_to_select.append({'alce_item': item, "old_selected_docs": question_to_docs[item["question"]]}) 559 | idx_items_to_select = list(enumerate(new_items_to_select)) # List[Tuple(index, item)] 560 | account_manager = get_account_manager('openai_account_files/used.txt', 'openai_account_files/accounts.txt', multi_thread=True) 561 | inp_manager = OpenAI_API_inp_Manager_MultiThread_Generalized(idx_items_to_select, general_input) 562 | 563 | if use_tqdm: 564 | pbar = tqdm.tqdm(total=len(idx_items_to_select)) 565 | else: 566 | pbar = None 567 | 568 | thread_list = [] 569 | for i in range(num_threads): 570 | thread_list.append(MyThread(i, account_manager, inp_manager, True, pbar)) 571 | 572 | for t in thread_list: 573 | t.start() 574 | for i, t in enumerate(thread_list): 575 | t.join() 576 | 577 | results_with_idx = [] 578 | 579 | for t in thread_list: 580 | results_with_idx.extend(t.results_with_idx) 581 | 582 | results_with_idx.sort(key=lambda x: x[0]) 583 | results = list(map(lambda x: x[1], results_with_idx)) 584 | return results 585 | --------------------------------------------------------------------------------