├── .gitignore
├── README.md
├── figs
    └── Jack-Ma_Elon-Musk.jpeg
└── prompt_evaluation.py


/.gitignore:
--------------------------------------------------------------------------------
1 | 
2 | .DS_Store
3 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # PanelGPT: Prompt Language Models with a Penal Discussion 
 2 | > **Three cobblers with their wits combined equal Zhuge Liang the mastermind. --- an old saying in China.**
 3 | # 💡: 💁🏼🎤     (👾💬) (🤖💭) (🤯🗯)
 4 | 
 5 | 
 6 | 
 7 | We introduce new zero-shot prompting magic words that improve language models' reasoning ability: **Panel Discussion**!
 8 | 
 9 | ## Motivation
10 | 
11 | In conferences and workshops, there are always **penal discussions** among experts, and people exchange their opinions on a given topic, improving the understanding of new concepts, changing perspectives of thinking, and reaching a more comprehensive understanding of prevailing debates or discussions.
12 | 
13 | <figure>
14 |   <figcaption> Figure 1: A panel discussion between Jack Ma and Elon Musk, WAIC, 2019: </figcaption>
15 |   <img src="figs/Jack-Ma_Elon-Musk.jpeg" alt="panel discussion between Jack Ma and Elon Musk" width="600">
16 | </figure>
17 | 
18 | This idea is related to the work of self-consistency [(Wang, Xuezhi, et al.)](https://arxiv.org/pdf/2203.11171.pdf) (as multiple experts may disagree with each other during the panel discussion.)
19 | 
20 | 
21 | ## Empirical Results on Benchmarks
22 | 
23 | We evaluate the effectiveness of the proposed prompt method on the GSM8K dataset, using gpt-3.5-turbo api.
24 | 
25 | The cost for evaluating each prompt on the 1k GSM8k test dataset is less than 2 USD. 
26 | 
27 | Our panel discussion prompt achieves the **Best Performance**, and ablation studies show the importance of each element's benefit. 
28 | <ins>Second Best</ins>, and _Third Best_ are denoted by underline and _italic_ formats, respectively.
29 | 
30 | | Method\Dataset  | GSM8K (test 1k) | Prompt Content | Reference | 
31 | |---------|---------|---------|-----------|
32 | | No-Prompt | 0.789 | The answer is:| - |
33 | | Zero-Shot CoT | 0.854 | Let's think step by step: | [(Kojima, Takeshi, et al. 2022)](https://arxiv.org/pdf/2205.11916.pdf) | 
34 | | APE Improved CoT | 0.845 | Let’s work this out in a step by step way to be sure we have the right answer:| [(Zhou, Yongchao, et al. 2023)](https://arxiv.org/pdf/2211.01910.pdf) |
35 | | ToT Prompting | 0.842 | Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realises they're wrong at any point then they leave | [(Dave Hulbert's Repo 2023)](https://github.com/dave1010/tree-of-thought-prompting) | 
36 | | PanelGPT |  **0.899** | 3 experts are discussing the question with a _panel_ discussion, trying to solve it step by step, and make sure the result is correct _and avoid penalty_: | (This Repo, July 18, 2023) |
37 | | PanelGPT w/o AE and EA |  0.878 | 3 experts are discussing the question with a discussion, trying to solve it step by step, and make sure the result is correct: | (Ours, Ablation Study)
38 | | PanelGPT w/o AE| 0.84 | 3 experts are discussing the question with a discussion, trying to solve it step by step, and make sure the result is correct and avoid penalty: | (Ours, Ablation Study)
39 | | PanelGPT w/o EA |  <ins>0.894</ins>  | 3 experts are discussing the question with a panel discussion, trying to solve it step by step, and make sure the result is correct: | (Ours, Ablation Study)
40 | | P<ins>e</ins>n<ins>a</ins>lGPT  (Mis-spelled) |  _0.883_  | 3 experts are discussing the question with a penal discussion, trying to solve it step by step, and make sure the result is correct: | (Ours, Ablation Study)
41 | ## Related Works
42 | 
43 | ### Zero-Shot and Few-Shot Prompting 
44 | The ability of Zero-shot prompting emerges in the language models trained on large amounts of data like GPT-3 and GPT-4 (Ouyang et al., 2022; OpenAI, 2023). And it was shown in Wei et al. (2021) that instruction-fine-tuning improves the zero-shot learning ability of language models.
45 | 
46 | Notwithstanding the impressive zero-shot performance exhibited by large language models, these models often exhibit suboptimal performance in executing more complex tasks under a zero-shot setting. Leveraging few-shot prompting presents a viable approach for facilitating in-context learning (Brown et al., 2020; Min et al., 2022). This technique necessitates the inclusion of demonstrations within the prompt, effectively guiding the model toward enhanced performance. These demonstrations act as conditioning mechanisms for succeeding examples, leading the model to generate better responses.
47 | 
48 | ### Chain-of-Thought Prompting 
49 | 
50 | In some more challenging tasks like complex arithmetic, commonsense, and symbolic reasoning tasks, the chain-of-thought (CoT) prompting is shown to be more effective in helping the language models to get correct answers (Wei et al., 2022). CoT includes additional reasoning steps in the few-shot prompting examples. Kojima et al. (2022) further introduces zero-shot CoT, showing that adding task-agnostic instruction can improve the model performance in specific tasks. In Zhang et al. (2022b), Auto-CoT combines the universality of zero-shot CoT and the capability of original CoT driven by demonstrations and proposes to automatically construct demonstrations based on clustering and diversity-based sampling that are beneficial for CoT reasoning.
51 | 
52 | ### Other Prompting Strategies 
53 | Wang et al. (2022) improve the few-shot CoT method by sampling multiple diverse reasoning paths and marginalizing those paths, choosing the most consistent answers among all sampled reasoning paths. The Generated Knowledge Prompting Liu et al. (2021) improves commonsense reasoning by incorporating knowledge or information related to the questions to make more accurate predictions. Tree-of-thoughts (ToT) methods (Long, 2023; Yao et al., 2023) combine tree-based planning methods with reasoning skills of language models, and solves hard reasoning problems step by step via multiple round conversations. Hulbert (2023) also put forward a related idea that leverages multiple thoughts of a language model in a single prompt. Memory and Retrieval Augmented Generation (RAG) (Lewis et al., 2020), which is able to combine parametric memory and non-parametric memory like Wikipedia in completing knowledge-intensive tasks. MoT (Li & Qiu, 2023): Pre-thinking based on the external unlabeled dataset and then recalling the related knowledge during inference.
54 | 
55 | ### Extending the Idea to Automated Prompt Generation with Offline Inverse RL
56 | [Prompt-OIRL](https://arxiv.org/abs/2309.06553) introduces the idea of using offline inverse reinforcement learning to perform offline prompt evaluation and optimization. The method is efficient and effective. Training Prompt-OIRL on an M2 chip MacBook Air only takes 1 hour, yet the **performance of different LLMs on arithmetic reasoning tasks can be improved up to 24\%.**
57 | 
58 | 
59 | 
60 | ## Citation
61 | 
62 | If you use our code and prompt, please consider citing our paper:
63 | 
64 | ```bibtex
65 | @inproceedings{sun2023query,
66 |   title={Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL},
67 |   author={Sun, Hao and H{\"u}y{\"u}k, Alihan and van der Schaar, Mihaela},
68 |   booktitle={The Twelfth International Conference on Learning Representations},
69 |   year={2023}
70 | }
71 | 
72 | @article{sun2023reinforcement,
73 |   title={Reinforcement Learning in the Era of LLMs: What is Essential? What is needed? An RL Perspective on RLHF, Prompting, and Beyond},
74 |   author={Sun, Hao},
75 |   journal={arXiv preprint arXiv:2310.06147},
76 |   year={2023}
77 | }
78 | 


--------------------------------------------------------------------------------
/figs/Jack-Ma_Elon-Musk.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/holarissun/PanelGPT/f3065baf93004e8d46e8163e1afe88851d2ad9f0/figs/Jack-Ma_Elon-Musk.jpeg


--------------------------------------------------------------------------------
/prompt_evaluation.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | File: prompt evaluation.py
 3 | Author: Hao Sun
 4 | Email: hs789@cam.ac.uk
 5 | Date: July 18, 2023
 6 | Description: This code performs parallelized prompt evaluation on the GSM8K dataset.
 7 | Requirement: Clone the GSM8K dataset first: https://github.com/openai/grade-school-math
 8 | '''
 9 | 
10 | import torch as th
11 | from dataset import get_examples, GSMDataset, extract_answer
12 | import time
13 | import numpy as np
14 | import os
15 | import openai
16 | from concurrent.futures import ThreadPoolExecutor
17 | OPENAI_API_KEY = 'YOUR API KEY HERE'
18 | 
19 | def init_openai():
20 |     openai.api_key = OPENAI_API_KEY
21 | 
22 | def create_chat_completion(prompt, tokens=512, temperature=0, error_count=0):
23 |     try:
24 |         response = openai.ChatCompletion.create(
25 |                                         model="gpt-3.5-turbo",
26 |                                         messages=[
27 |                                                     {"role": "system", "content": "You are a helpful assistant."},
28 |                                                     {"role": "user", "content": prompt}
29 |                                                 ],
30 |                                         max_tokens = tokens,
31 |                                         temperature=temperature
32 |                                         )
33 |         return response
34 |     except:
35 |         # sleep for 5 seconds
36 |         time.sleep(10)
37 |         error_count += 1
38 |         if error_count > 600:
39 |             raise Exception("Too many errors")
40 |         return create_chat_completion(prompt, tokens, temperature, error_count)
41 | 
42 | def test_single_example(example, prompt):
43 |     qn, asr = example["question"], example["answer"]
44 |     start_time = time.time()
45 | 
46 |     response = create_chat_completion(qn + prompt)
47 |     response_content = response.choices[0].message.content
48 | 
49 |     check_response = create_chat_completion(f"The correct answer is {extract_answer(asr)}. Is the following answer correct? {response_content}.")
50 |     check_response_content = check_response.choices[0].message.content
51 | 
52 |     is_correct = 'Yes' in check_response_content
53 | 
54 |     total_time = time.time() - start_time
55 |     total_tokens = (response.usage.prompt_tokens + response.usage.completion_tokens +
56 |                     check_response.usage.prompt_tokens + check_response.usage.completion_tokens)
57 | 
58 |     return is_correct, response_content, total_time, total_tokens
59 | 
60 | def run_test(test_examples, prompt, num_threads=10):
61 |     with ThreadPoolExecutor(max_workers=num_threads) as executor:
62 |         results = list(executor.map(test_single_example, test_examples, [prompt]*len(test_examples)))
63 | 
64 |     return {
65 |         'result_list': [result[0] for result in results],
66 |         'answers_list': [result[1] for result in results],
67 |         'total_time': sum(result[2] for result in results),
68 |         'accuracy': sum(result[0] for result in results) / len(results),
69 |         'total_tokens': sum(result[3] for result in results),
70 |         'average_tokens': sum(result[3] for result in results) / len(results)
71 |     }
72 | 
73 | def eval_prompt(prompt = " ", n_eval = 1000, file_alias = 'alias', num_threads=500):
74 |     init_openai()
75 |     test_examples = get_examples("test")[:n_eval]
76 | 
77 |     results = run_test(test_examples, prompt, num_threads=num_threads)
78 | 
79 |     print('Total time:', results['total_time'])
80 |     print('Accuracy:', results['accuracy'])
81 |     print('Total tokens used:', results['total_tokens'])
82 |     print('Average tokens used per question:', results['average_tokens'])
83 | 
84 |     np.save(f'results/result_list_{file_alias}.npy', results['result_list'])
85 |     np.save(f'results/answers_list_{file_alias}.npy', results['answers_list'])
86 |     np.save(f'results/promt_{file_alias}.npy', prompt)
87 | 
88 | if __name__ == '__main__':
89 |     PROMPT = "To make sure to get the correct answer, we have 3 experts in a step-by-step discussion to solve the problem."
90 |     wall_time_start = time.time()
91 |     eval_prompt(prompt = PROMPT)
92 |     print('prompt:', PROMPT)
93 |     print(f"Wall time: {time.time() - wall_time_start}")
94 | 


--------------------------------------------------------------------------------