├── .gitignore ├── LICENSE ├── README.md ├── compute_verification.py ├── evaluation ├── README.md ├── SVAMP.json ├── evaluation_SVAMP.py ├── output_SVAMP_eval_N=1.csv ├── plot.py └── plots │ ├── SVAMP_complementary.png │ ├── SVAMP_individual_metrics.png │ └── SVAMP_radar_chart.png ├── generating_algebraic_template.py ├── llm_inference ├── README.md ├── azure_openai_service.py ├── llm_factory.py ├── llm_interface.py └── openai_service.py ├── main.py ├── math_prompts_generation.py ├── requirements.txt └── tests ├── test_cases.py └── testing.py /.gitignore: -------------------------------------------------------------------------------- 1 | .env 2 | __pycache__ -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Ramon Kaspar 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # MathPrompter Implementation 2 | 3 | This repository contains an implementation of the techniques presented in the research paper ["MathPrompter: Mathematical Reasoning Using Large Language Models"](https://arxiv.org/abs/2303.05398) by Shima Imani, Liang Du, and Harsh Shrivastava from Microsoft Research. The implementation aims to replicate the improved performance of Large Language Models (LLMs) in arithmetic reasoning tasks using the MathPrompter technique. 4 | 5 | ## Disclaimer 6 | 7 | This project is an independent implementation of the techniques described in the ["MathPrompter: Mathematical Reasoning Using Large Language Models"](https://arxiv.org/abs/2303.05398) paper by Microsoft researchers. It is not officially associated with the original authors or Microsoft. For the official and original research, please refer to the cited paper. 8 | 9 | ## Custom Changes from the Original Paper 10 | 11 | ### Including units in the Generation of Algebraic Template 12 | 13 | The original paper omits the units (i.e. "$50" is replaced by A). I believe that units are important for reasoning (assume the question is "What is 1m divided by 20cm?"). Therefore, I will leave the units as part of the question. 14 | 15 | ### Custom Few-Shot Prompts 16 | 17 | We have adopted customized few-shot prompts instead of those proposed in the original paper. These prompts have shown improved performance and consistency on platforms like Google Gemini Pro and Azure OpenAI GPT-3.5-Turbo. 18 | 19 | Example prompt for arithmetic expression generation: 20 | 21 | ```plaintext 22 | : John has A apples. He gives B apples to his friend. How many apples does John have left? 23 | Answer = A - B 24 | ... 25 | : {question} 26 | ``` 27 | 28 | However, you should experiment with different prompts and test which prompts work best for your specific LLM API, as effectiveness can vary depending on the model. 29 | 30 | ## Evaluating Performance on the SVAMP Dataset 31 | 32 | For evaluating our implementation, we benchmarked it on the [SVAMP](https://github.com/arkilpatel/SVAMP/tree/main) dataset (which contains 1000 math word problems). For the detailed evaluation process, please look at the README in the evaluation folder. 33 | 34 | To manage costs effectively, each prompt was processed exactly once, an approach we term as `self_consistency=1`. Given that each problem is addressed in a single model run, we set the `temperature` parameter to 0. This ensures that the model response is deterministic, providing the most probable and stable output for each input without the variability that higher temperature settings would introduce. 35 | 36 | The evaluation of MathPrompter on the SVAMP dataset achieved an accuracy of **63.9%**. 37 | 38 | ![SVAMP dataset performance evaluation plot](evaluation/plots/SVAMP_individual_metrics.png) 39 | 40 | The evaluation of MathPrompter using the SVAMP dataset reveals distinct strengths across different methodologies. The comprehensive approach (MathPrompter Total) demonstrates respectable accuracy (63.9%) with a low hallucination rate (10.9%), indicating reliable problem-solving capabilities. Specialized methods such as Algebraic Only and Python Only exhibit higher accuracy rates of 77.3% and 70.5% respectively but also show increased hallucination rates. This points to their heightened sensitivity and potential to overfit specific problem types. 41 | 42 | For a comparative analysis of how our results hold up against current state-of-the-art methodologies, you can visit the [PaperWithCode Leaderboard for the SVAMP benchmark](https://paperswithcode.com/sota/math-word-problem-solving-on-svamp). 43 | 44 | Notably, when considering only models that operate without the use of additional training data, our implementation ranks 4th. 45 | 46 | ## Installation 47 | 48 | Clone this repository and navigate into the project directory. Install the required dependencies: 49 | 50 | ```bash 51 | git clone https://github.com/RamonKaspar/MathPrompter.git 52 | cd MathPrompter 53 | pip install -r requirements.txt 54 | ``` 55 | 56 | ## Usage 57 | 58 | To use the MathPrompter, run the main script or import functions directly into your Python projects: 59 | 60 | ```bash 61 | python main.py 62 | ``` 63 | 64 | Detailed documentation on function usage and parameters can be found in the docstrings within the code. 65 | 66 | For instructions on setting up a connection to an LLM API, please consult the `README.md` file located in the `llm_inference` directory. 67 | 68 | ## Future Work and Enhancements 69 | 70 | While the current implementation of MathPrompter provides a foundational approach to solving algebraic questions, there are several enhancements and optimizations that can further improve its performance and functionality: 71 | 72 | 1. **Parallelization of API Calls**: 73 | 74 | - Implement parallel processing to handle API calls more efficiently. This could significantly speed up computations by making simultaneous requests. 75 | 76 | 2. **Probabilistic Results Implementation**: 77 | 78 | - Refine the result evaluation mechanism to return a probability along with each result. Currently, a consensus needs to be reached across all iterations for a result to be returned. 79 | 80 | 3. **Use Different Model Parameters**: 81 | - We typically prompt the LLM $N$ times (default is $N=5$) to generate algebraic expressions and Python code. By varying model parameters such as temperature and top_p in future iterations, we could potentially obtain more diverse answers. 82 | 83 | ## Citing the Original Work 84 | 85 | This implementation is based on the following work: 86 | 87 | ```bibtex 88 | @article{imani2023mathprompter, 89 | title={MathPrompter: Mathematical Reasoning Using Large Language Models}, 90 | author={Imani, Shima and Du, Liang and Shrivastava, Harsh}, 91 | journal={arXiv preprint arXiv:2303.05398}, 92 | year={2023} 93 | } 94 | ``` 95 | 96 | For more detail and to read the paper, you can access it [here](https://arxiv.org/abs/2303.05398). 97 | -------------------------------------------------------------------------------- /compute_verification.py: -------------------------------------------------------------------------------- 1 | import random 2 | 3 | def create_random_input(mapping): 4 | """ Generate a random test input based on variable mappings. """ 5 | return {var: random.randint(1, 500) for var in mapping} 6 | 7 | def evaluate_expression(expression, mapping): 8 | """ Evaluates a mathematical expression safely using a dictionary of variable replacements. """ 9 | try: 10 | local_dict = mapping.copy() 11 | return eval(expression, {"__builtins__": None}, local_dict) 12 | except SyntaxError as e: 13 | print(f"Syntax error in the expression: {e}") 14 | return None 15 | except Exception as e: 16 | print(f"Runtime error during expression evaluation: {e}") 17 | return None 18 | 19 | def execute_python_code(code, mapping): 20 | """ 21 | Executes Python code with a specified input mapping. 22 | This code assumes the function 'solution' is defined within the passed code. 23 | """ 24 | local_locals = mapping.copy() 25 | wrapped_code = code + f"\nresult = solution({', '.join(f'{k}={v}' for k, v in mapping.items())})" 26 | try: 27 | exec(wrapped_code, {"__builtins__": None}, local_locals) 28 | if 'result' in local_locals: 29 | return local_locals['result'] 30 | else: 31 | print("No 'result' key was found in the local variables after executing the code.") 32 | return None 33 | except Exception as e: 34 | print(f"Error when executing Python code: {e}") 35 | return None 36 | 37 | def perform_computational_verification(expression, code, variable_mapping, sample_size=5): 38 | """ 39 | Verifies that the execution results of both an expression and Python code are consistent across random inputs. 40 | This verification is repeated for a defined number of times specified by 'sample_size'. 41 | If consistent results are found, it evaluates them again with actual mappings to ensure accuracy. 42 | """ 43 | for _ in range(sample_size): 44 | test_input = create_random_input(variable_mapping) 45 | exp_result = evaluate_expression(expression, test_input) 46 | code_result = execute_python_code(code, test_input) 47 | 48 | if exp_result != code_result: 49 | return None 50 | 51 | # Consensus found, calculate result for real variable mapping 52 | exp_result = evaluate_expression(expression, variable_mapping) 53 | code_result = execute_python_code(code, variable_mapping) 54 | if exp_result == code_result: 55 | return exp_result 56 | else: 57 | return None 58 | 59 | 60 | 61 | 62 | 63 | -------------------------------------------------------------------------------- /evaluation/README.md: -------------------------------------------------------------------------------- 1 | # Evaluation of MathPrompter on SVAMP Dataset 2 | 3 | This directory contains the materials used for evaluating the MathPrompter application on the SVAMP dataset. The evaluation aims to assess the accuracy and hallucination rate of MathPrompter in interpreting and solving arithmetic word problems. 4 | 5 | ## Contents 6 | 7 | - `SVAMP.json`: The dataset used for evaluation, containing arithmetic word problems. 8 | - `evaluation_SVAMP.py`: Python scripts that perform the evaluations, outputting accuracy and hallucination metrics. 9 | - `plot.py`: Script to plot the results of the evaluation, comparing performance metrics across different methodologies. 10 | - `plots`: Folder containing all generated plots. Reproduce them by running `plot.py` with specified arguments. 11 | - `data`: Folder containing pandas dataframes saved as `.csv` files, which hold the evaluation data for each dataset. 12 | 13 | ## API Usage 14 | 15 | For the purposes of this evaluation, OpenAI API with GPT-3.5-Turbo was utilized due to its accessibility and cost-effectiveness (compraed to GPT-4). 16 | 17 | ## Results 18 | 19 | In our research, we sought to compare our implementation with the MathPrompter model, which utilized the `GPT-3 DaVinci` model with 175 billion parameters and a self-consistency approach of $N=5$. However, they did not specify the temperatures used or the number of random variable assignments in their setup. In our experiments, we employed the `GPT-3.5-Turbo` model. 20 | 21 | ### SVAMP 22 | 23 | To manage costs effectively, each prompt was processed exactly once, an approach we term as `self_consistency=1`. Given that each problem is addressed in a single model run, we set the `temperature` parameter to 0. This ensures that the model response is deterministic, providing the most probable and stable output for each input without the variability that higher temperature settings would introduce. 24 | 25 | The evaluation of MathPrompter on the SVAMP dataset achieved an accuracy of **63.9%**. 26 | 27 | ![alt text](plots/SVAMP_individual_metrics.png) 28 | 29 | The evaluation of MathPrompter using the SVAMP dataset reveals distinct strengths across different methodologies. The comprehensive approach (MathPrompter Total) demonstrates respectable accuracy (63.9%) with a low hallucination rate (10.9%), indicating reliable problem-solving capabilities. Specialized methods such as Algebraic Only and Python Only exhibit higher accuracy rates of 77.3% and 70.5% respectively but also show increased hallucination rates. This points to their heightened sensitivity and potential to overfit specific problem types. 30 | 31 | For a comparative analysis of how our results hold up against current state-of-the-art methodologies, you can visit the [PaperWithCode Leaderboard for the SVAMP benchmark](https://paperswithcode.com/sota/math-word-problem-solving-on-svamp). 32 | 33 | Notably, when considering only models that operate without the use of additional training data, our implementation ranks 4th. 34 | 35 | Note that one would probably achieve higher accuracy on the benchmark by increasing the self-consistency and using an even more capable model, i.e., `GPT-4`. However, due to cost constraints, we omitted this. 36 | 37 | ## Running the Evaluation on your own 38 | 39 | To run the evaluation, execute the following command: 40 | 41 | ```bash 42 | python evaluation_SVAMP.py 43 | ``` 44 | 45 | Then run 46 | 47 | ```bash 48 | python plot.py 49 | ``` 50 | 51 | ## Dataset Credits 52 | 53 | ### SVAMP 54 | 55 | The SVAMP dataset is sourced from the SVAMP paper: ["Are NLP Models really able to Solve Simple Math Word Problems?"](https://arxiv.org/abs/2103.07191) by Arkil Patel, Satwik Bhattamishra andNavin Goyal. The dataset and their work can be further explored through their [GitHub repository](https://github.com/arkilpatel/SVAMP/tree/main) and their publication which can be accessed [here](https://arxiv.org/abs/2103.07191). 56 | 57 | ```bibtex 58 | @inproceedings{patel-etal-2021-nlp, 59 | title = "Are {NLP} Models really able to Solve Simple Math Word Problems?", 60 | author = "Patel, Arkil and 61 | Bhattamishra, Satwik and 62 | Goyal, Navin", 63 | booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", 64 | month = jun, 65 | year = "2021", 66 | address = "Online", 67 | publisher = "Association for Computational Linguistics", 68 | url = "https://aclanthology.org/2021.naacl-main.168", 69 | doi = "10.18653/v1/2021.naacl-main.168", 70 | pages = "2080--2094", 71 | abstract = "The problem of designing NLP solvers for math word problems (MWP) has seen sustained research activity and steady gains in the test accuracy. Since existing solvers achieve high performance on the benchmark datasets for elementary level MWPs containing one-unknown arithmetic word problems, such problems are often considered {``}solved{''} with the bulk of research attention moving to more complex MWPs. In this paper, we restrict our attention to English MWPs taught in grades four and lower. We provide strong evidence that the existing MWP solvers rely on shallow heuristics to achieve high performance on the benchmark datasets. To this end, we show that MWP solvers that do not have access to the question asked in the MWP can still solve a large fraction of MWPs. Similarly, models that treat MWPs as bag-of-words can also achieve surprisingly high accuracy. Further, we introduce a challenge dataset, SVAMP, created by applying carefully chosen variations over examples sampled from existing datasets. The best accuracy achieved by state-of-the-art models is substantially lower on SVAMP, thus showing that much remains to be done even for the simplest of the MWPs.", 72 | } 73 | ``` 74 | -------------------------------------------------------------------------------- /evaluation/evaluation_SVAMP.py: -------------------------------------------------------------------------------- 1 | import json 2 | from tqdm import tqdm 3 | import sys 4 | import pandas as pd 5 | 6 | # Add the root directory to sys.path to find the main module 7 | sys.path.append('..') 8 | 9 | from main import * 10 | from compute_verification import * 11 | 12 | def evaluate_on_svamp(sample_size): 13 | with open('SVAMP.json', 'r') as file: 14 | data = json.load(file) 15 | data = random.sample(data, k=sample_size) # Take a random sample of 100 rows 16 | 17 | header = ["Question", "Correct Answer", "Predicted Answer Total", "Predicted Answer Algebraic Expression", "Predicted Answer Python Code", "Generated Algebraic Template", "Generated Expression", "Generated Python Code", "Variable Mapping", "Error", "Error Message"] 18 | df = pd.DataFrame(columns=header) 19 | 20 | for entry in tqdm(data, desc="Evaluating SVAMP dataset"): 21 | full_question = entry["Body"] + " " + entry["Question"] 22 | result, qt, exp, code, variable_mapping = None, None, None, None, None 23 | try: 24 | qt, variable_mapping = generating_algebraic_template(full_question) 25 | exp = algebric_expression_generation(qt) 26 | code = python_code_generation(qt) 27 | result = perform_computational_verification(exp, code, variable_mapping, sample_size=3) 28 | exp_result = evaluate_expression(exp, variable_mapping) 29 | code_result = execute_python_code(code, variable_mapping) 30 | error, error_message = False, "" 31 | except Exception as e: 32 | error, error_message = True, str(e) 33 | print(f"An error occurred during the verification process: {e}") 34 | 35 | data_row = pd.DataFrame([{ 36 | "Question": full_question, "Correct Answer": entry["Answer"], "Predicted Answer Total": result, 37 | "Predicted Answer Algebraic Expression": exp_result, "Predicted Answer Python Code": code_result, 38 | "Generated Algebraic Template": qt, "Generated Expression": exp, "Generated Python Code": code, 39 | "Variable Mapping": variable_mapping, "Error": error, "Error Message": error_message 40 | }]) 41 | 42 | df = pd.concat([df, data_row], ignore_index=True) 43 | df.to_csv("output_SVAMP_eval.csv", encoding='utf-8', index=False) 44 | return 45 | 46 | # Assuming mathprompter function is imported and available 47 | if __name__ == '__main__': 48 | evaluate_on_svamp(sample_size=1000) 49 | -------------------------------------------------------------------------------- /evaluation/plot.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import matplotlib.pyplot as plt 4 | import seaborn as sns 5 | from sklearn.metrics import precision_score 6 | import os 7 | 8 | def compute_detailed_metrics(df): 9 | metrics = { 10 | 'Method': ['MathPrompter Total', 'Algebraic Only', 'Python Only'], 11 | 'Accuracy': [], 12 | 'Hallucination Rate': [], 13 | 'Precision': [] 14 | } 15 | 16 | df = df.replace('None', np.nan) 17 | 18 | for col in ['Predicted Answer Total', 'Predicted Answer Algebraic Expression', 'Predicted Answer Python Code']: 19 | df[col] = pd.to_numeric(df[col], errors='coerce') 20 | correct = df[df[col] == df['Correct Answer']] 21 | accuracy = len(correct) / len(df) if len(df) > 0 else 0 22 | 23 | hallucinations = df[(df[col] != df['Correct Answer']) & (df[col].notna()) & (df[col] != 'None')] 24 | hallucination_rate = len(hallucinations) / len(df) if len(df) > 0 else 0 25 | 26 | binary_true = (df['Correct Answer'] == df[col]).astype(int) 27 | binary_pred = (df[col].notna() & (df[col] != 'None')).astype(int) 28 | 29 | precision = precision_score(binary_true, binary_pred, zero_division=0) 30 | 31 | # Debugging prints 32 | print(f"\nColumn: {col}") 33 | print(f"Correct entries count: {len(correct)}") 34 | print(f"Total non-NA predictions count: {df[col].notna().sum()}") 35 | print(f"Accuracy: {accuracy}") 36 | print(f"Hallucination Rate: {hallucination_rate}") 37 | print(f"Precision: {precision}") 38 | 39 | metrics['Accuracy'].append(accuracy) 40 | metrics['Hallucination Rate'].append(hallucination_rate) 41 | metrics['Precision'].append(precision) 42 | 43 | return pd.DataFrame(metrics) 44 | 45 | def plot_individual_metrics(df_metrics): 46 | metrics = df_metrics.columns[1:] 47 | fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(18, 6)) 48 | for i, metric in enumerate(metrics): 49 | # Changed to set 'Method' as the hue and disable the legend directly. 50 | bar_plot = sns.barplot(x='Method', y=metric, hue='Method', data=df_metrics, ax=axes[i], palette='deep', legend=False) 51 | axes[i].set_title(f'{metric} Comparison by Method') 52 | axes[i].set_ylim(0, 1) 53 | 54 | # Adding labels inside the bars 55 | for p in bar_plot.patches: 56 | bar_plot.annotate(format(p.get_height(), '.2f'), 57 | (p.get_x() + p.get_width() / 2., p.get_height()), 58 | ha='center', va='center', 59 | size=9, xytext=(0, -12), 60 | textcoords='offset points') 61 | 62 | plt.tight_layout() 63 | plt.savefig(os.path.join('plots', 'individual_metrics.png')) 64 | plt.close() 65 | 66 | def plot_radar_chart(df_metrics): 67 | labels = np.array(df_metrics.columns[1:]) 68 | num_vars = len(labels) 69 | angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist() 70 | angles += angles[:1] 71 | 72 | fig, ax = plt.subplots(figsize=(6, 6), subplot_kw=dict(polar=True)) 73 | palette = sns.color_palette("deep", len(df_metrics['Method'])) 74 | 75 | for idx, row in df_metrics.iterrows(): 76 | data = row[labels].tolist() 77 | data += data[:1] 78 | ax.plot(angles, data, linewidth=2, label=row['Method'], color=palette[idx], linestyle='dashed') # Fixed linestyle redundancy 79 | ax.fill(angles, data, alpha=0) 80 | 81 | ax.set_xticks(angles[:-1]) 82 | ax.set_xticklabels(labels, color='grey', size=12) 83 | ax.legend(loc='upper right', bbox_to_anchor=(1.1, 1.1)) 84 | 85 | plt.savefig(os.path.join('plots', 'radar_chart.png')) 86 | plt.close() 87 | 88 | if __name__ == '__main__': 89 | if not os.path.exists('plots'): 90 | os.makedirs('plots') 91 | 92 | df = pd.read_csv('output_SVAMP_eval.csv') 93 | df.replace({'None': None}, inplace=True) 94 | 95 | df_metrics = compute_detailed_metrics(df) 96 | plot_individual_metrics(df_metrics) 97 | plot_radar_chart(df_metrics) -------------------------------------------------------------------------------- /evaluation/plots/SVAMP_complementary.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RamonKaspar/MathPrompter/dd6f72a8d9560ff8e7bd8a282fbf354a5304b71d/evaluation/plots/SVAMP_complementary.png -------------------------------------------------------------------------------- /evaluation/plots/SVAMP_individual_metrics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RamonKaspar/MathPrompter/dd6f72a8d9560ff8e7bd8a282fbf354a5304b71d/evaluation/plots/SVAMP_individual_metrics.png -------------------------------------------------------------------------------- /evaluation/plots/SVAMP_radar_chart.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RamonKaspar/MathPrompter/dd6f72a8d9560ff8e7bd8a282fbf354a5304b71d/evaluation/plots/SVAMP_radar_chart.png -------------------------------------------------------------------------------- /generating_algebraic_template.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | def generate_variable_names(limit=26): 4 | """ Generates a list of unique variable names. """ 5 | base_names = [chr(i) for i in range(ord('A'), ord('Z') + 1)] 6 | extended_names = base_names.copy() 7 | idx = 0 8 | while len(extended_names) < limit: 9 | extended_names.append(base_names[idx % 26] + str(idx // 26 + 1)) 10 | idx += 1 11 | return extended_names 12 | 13 | def generating_algebraic_template(q): 14 | """ 15 | Transforms numeric values in a math word problem into unique variables. 16 | 17 | Steps: 18 | 1. Split the text to isolate numbers and mathematical operators. 19 | This preserves the structure in which numbers and mathematical operators such as '+', 20 | '-', '*', '/', '=', and '^' are treated separately. 21 | 22 | 2. Process each element from the split to detect numeric values. 23 | Identify segments that contain digits, potentially including numbers attached to units or symbols. 24 | 25 | 3. For segments that include numeric values with non-numeric substrings at the beginning or end (e.g., units or currency symbols): 26 | - Split any directly attached non-numeric characters from the numbers. This might be units like 'cm', or symbols like '$'. 27 | - For example, transform "3cm" into "3 cm" or "$20" into "$ 20", ensuring that numbers and their associated units or symbols are separated by a space. 28 | 29 | 4. Adjust number formatting by: 30 | - Removing characters used as thousands separators such as commas (',') or apostrophes ('’', '`'). 31 | - Standardizing decimal points by replacing commas with dots ('.'). 32 | 33 | 5. Assign a unique variable name to each unique numeric value. 34 | Each number, once identified and formatted correctly, is replaced by a unique variable name, 35 | ensuring that each numeric value is distinctly represented by a variable in the transformed text. 36 | 37 | Parameters: 38 | q (str): The math word problem as a string. 39 | 40 | Returns: 41 | tuple: The transformed expression with variables replacing numbers, and a dictionary mapping variables to original numeric values. 42 | """ 43 | variable_names = generate_variable_names(limit=26) # TODO: Dynamically adjust the limit when we run out of variable names. 44 | var_index = 0 45 | mapping = {} 46 | reverse_mapping = {} 47 | 48 | # 1. Split the text to isolate numbers and mathematical operators. 49 | elements = [split for element in q.split() for split in re.split(r'([+\-*/=^()])', element) if split] 50 | 51 | # 2. Process each element from the split to detect numeric values. 52 | processed_elements = [] 53 | for element in elements: 54 | numeric_match = re.match(r'([^\d]*)(\d[\d,.\'`]*)([^\d]*)', element) 55 | # 3. Segments that include numeric values with potential non-numeric substrings at the beginning or end 56 | if numeric_match: 57 | prefix, number, suffix = numeric_match.groups() 58 | # 4. Adjust number formatting 59 | if re.match(r"^\d{1,3}(?:(?:[,'`]\d{3})*|\d*)(?:\.\d+)?$", number): 60 | number = re.sub(r"[,'`]", "", number) 61 | number = number.replace(',', '.') 62 | # 5. Assign a unique variable name to each unique numeric value. 63 | if number not in reverse_mapping: 64 | variable_name = variable_names[var_index] 65 | mapping[variable_name] = float(number) 66 | reverse_mapping[float(number)] = variable_name 67 | var_index += 1 68 | if prefix != '': processed_elements.append(prefix) 69 | processed_elements.append(reverse_mapping[float(number)]) 70 | if suffix != '': processed_elements.append(suffix) 71 | else: 72 | processed_elements.append(element) 73 | final_expression = " ".join(processed_elements) 74 | return final_expression, mapping -------------------------------------------------------------------------------- /llm_inference/README.md: -------------------------------------------------------------------------------- 1 | # LLM Inference Service 2 | 3 | This directory contains the modules necessary for interfacing with various Large Language Models (LLMs) such as Azure's OpenAI. It provides a way to abstract and standardize LLM access, making it easy to integrate and switch between different LLM services. 4 | 5 | Structure: 6 | 7 | - `llm_interface.py`: Defines the standard interface for LLM services. 8 | - `azure_llm_service.py`: Implementation of the interface for Azure's LLM. 9 | - `llm_factory.py`: A factory module used to instantiate LLM service objects dynamically based on the desired service. 10 | 11 | ## Usage of the current services 12 | 13 | Default is the usage of the Azure OpenAI API with the `gpt-35-turbo` model. 14 | 15 | 1. Setting Up Environment: 16 | Ensure you have a `.env` file in your environment with the necessary API keys and endpoints (depends on the LLM you want to use) described as: 17 | 18 | - `AZURE_OPENAI_KEY` 19 | - `AZURE_OPENAI_ENDPOINT` 20 | 21 | 2. Initializing a Service: 22 | Import and use the `get_llm_service` function from `llm_factory.py` to initialize the LLM service of your choice: 23 | 24 | ```python 25 | from llm_inference.llm_factory import get_llm_service 26 | 27 | # Initialize Azure LLM Service 28 | azure_service = get_llm_service('azure') 29 | ``` 30 | 31 | ## Adding a New LLM Model 32 | 33 | To incorporate a new LLM into the system: 34 | 35 | 1. ### Add API Keys and Endpoints: Set up the necessary keys and endpoints in your `.env` file. 36 | 37 | 2. ### Implement the Interface: 38 | 39 | - Create a new Python file for the LLM service. 40 | - Inherit from `LLMInterface`. 41 | - Implement all abstract methods (`create_prompt`, `make_request`). 42 | Example: 43 | 44 | ```python 45 | from llm_interface import LLMInterface 46 | 47 | class NewLLMService(LLMInterface): 48 | def __init__(self, model_name, temperature, max_tokens): 49 | # Implement the set up 50 | pass 51 | 52 | def create_prompt(self, system_prompt, few_shot_examples, question): 53 | # Implement prompt preparation 54 | pass 55 | 56 | def make_request(self, messages): 57 | # Implement making an HTTP request to the LLM API 58 | pass 59 | ``` 60 | 61 | 3. ### Register in the Factory: 62 | 63 | Modify `llm_factory.py` to recognize your new LLM service. 64 | Example: 65 | 66 | ```python 67 | def get_llm_service(service_name, model_name, temperature, max_tokens): 68 | if service_name == 'new_llm': 69 | return NewLLMService(model_name, temperature, max_tokens) 70 | ``` 71 | 72 | 4. ### Usage 73 | Now you can use the newly added service in the same way as existing services by calling `get_llm_service('new_llm', ...)`. 74 | -------------------------------------------------------------------------------- /llm_inference/azure_openai_service.py: -------------------------------------------------------------------------------- 1 | import os 2 | from openai import AzureOpenAI 3 | 4 | from .llm_interface import LLMInterface 5 | 6 | class AzureOpenAIService(LLMInterface): 7 | def __init__(self, model_name, temperature, max_tokens): 8 | super().__init__(model_name, temperature, max_tokens) 9 | self.client = AzureOpenAI( 10 | api_version="2023-05-15", 11 | api_key=os.getenv("AZURE_OPENAI_KEY"), 12 | azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT") 13 | ) 14 | 15 | def create_prompt(self, system_prompt, few_shot_examples, question): 16 | messages = [] 17 | messages.append({"role": "system", "content": system_prompt}) 18 | for message in few_shot_examples: 19 | messages.append({"role": "user", "content": message['question']}) 20 | messages.append({"role": "assistant", "content": message['answer']}) 21 | messages.append({"role": "user", "content": question}) 22 | return messages 23 | 24 | def make_request(self, messages): 25 | response = self.client.chat.completions.create( 26 | model=self.model_name, 27 | messages=messages, 28 | max_tokens=self.max_tokens, 29 | temperature=self.temperature, 30 | ) 31 | return response.choices[0].message.content -------------------------------------------------------------------------------- /llm_inference/llm_factory.py: -------------------------------------------------------------------------------- 1 | from dotenv import load_dotenv 2 | 3 | from .azure_openai_service import AzureOpenAIService 4 | from .openai_service import OpenAIService 5 | 6 | load_dotenv() # Load the environment variables located in the .env file 7 | 8 | def get_llm_service(service_name, model_name, temperature, max_tokens): 9 | """ 10 | Get an instance of a Language Model Service. 11 | 12 | Parameters: 13 | service_name (str): Name of the LLM service provider. 14 | model_name (str): Name of the language model to use. 15 | temperature (float): Sampling temperature for generating responses. 16 | max_tokens (int): Maximum number of tokens to generate in each response. 17 | 18 | Returns: 19 | LLMInterface: Instance of a concrete implementation of LLMInterface. 20 | 21 | Raises: 22 | Exception: If the specified LLM service type is unsupported. 23 | """ 24 | if service_name == 'azure': 25 | return AzureOpenAIService(model_name, temperature, max_tokens) 26 | elif service_name == 'openai': 27 | return OpenAIService(model_name, temperature, max_tokens) 28 | else: 29 | raise Exception("Unsupported LLM service type") -------------------------------------------------------------------------------- /llm_inference/llm_interface.py: -------------------------------------------------------------------------------- 1 | class LLMInterface: 2 | def __init__(self, model_name, temperature, max_tokens): 3 | """ 4 | Initialize the LLMInterface. 5 | 6 | Parameters: 7 | model_name (str): Name of the language model to use. 8 | temperature (float): Sampling temperature for generating responses. 9 | max_tokens (int): Maximum number of tokens to generate in each response. 10 | """ 11 | self.model_name = model_name 12 | self.temperature = temperature 13 | self.max_tokens = max_tokens 14 | 15 | def create_prompt(self, system_prompt, few_shot_examples, question): 16 | """ 17 | Create a prompt for the language model. 18 | 19 | Parameters: 20 | system_prompt (str): The system prompt to start the conversation. 21 | few_shot_examples (list): List of few-shot examples for in-context learning. 22 | question (str): The user's question to generate a response for. 23 | 24 | Returns: 25 | list: List of messages formatted for input to the language model. 26 | NOTE: YOU HAVE TO ADAPT THIS METHOD BASED ON THE LANGUAGE MODEL YOU USE 27 | """ 28 | raise NotImplementedError 29 | 30 | def make_request(self, messages): 31 | """ 32 | Make a request to the language model service. 33 | 34 | Parameters: 35 | messages (list): List of messages formatted for input to the language model. 36 | 37 | Returns: 38 | str: Generated response from the language model. 39 | """ 40 | raise NotImplementedError -------------------------------------------------------------------------------- /llm_inference/openai_service.py: -------------------------------------------------------------------------------- 1 | import os 2 | from openai import OpenAI 3 | 4 | from .llm_interface import LLMInterface 5 | 6 | class OpenAIService(LLMInterface): 7 | def __init__(self, model_name, temperature, max_tokens): 8 | super().__init__(model_name, temperature, max_tokens) 9 | self.client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY")) 10 | 11 | def create_prompt(self, system_prompt, few_shot_examples, question): 12 | messages = [] 13 | messages.append({"role": "system", "content": system_prompt}) 14 | for message in few_shot_examples: 15 | messages.append({"role": "user", "content": message['question']}) 16 | messages.append({"role": "assistant", "content": message['answer']}) 17 | messages.append({"role": "user", "content": question}) 18 | return messages 19 | 20 | def make_request(self, messages): 21 | response = self.client.chat.completions.create( 22 | model=self.model_name, 23 | messages=messages, 24 | max_tokens=self.max_tokens, 25 | temperature=self.temperature, 26 | ) 27 | return response.choices[0].message.content 28 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | from generating_algebraic_template import generating_algebraic_template 2 | from math_prompts_generation import algebric_expression_generation, python_code_generation 3 | from compute_verification import perform_computational_verification 4 | 5 | def mathprompter(question: str, temperatures=[0], n_test_random_values=5): 6 | """ 7 | Attempts to verify a mathematical prompt via repeated computations and returns the most frequently computed result. 8 | Parameters: 9 | question (str): The mathematical question in natural-language form. 10 | temperatures (list[float]): Sequences of temperature settings used to vary generation randomness. 11 | Defines the number of self consistency iterations. 12 | n_test_random_values (int): Number of random samples to test for each verification. 13 | 14 | Returns: 15 | The most frequently computed result over iterations, or None if no consistent result was found. 16 | """ 17 | result_counts = {} 18 | for temp in temperatures: # TODO: Parallelize this 19 | try: 20 | qt, variable_mapping = generating_algebraic_template(question) 21 | exp = algebric_expression_generation(qt, temperature=temp) 22 | code = python_code_generation(qt, temperature=temp) 23 | res = perform_computational_verification(exp, code, variable_mapping, sample_size=n_test_random_values) 24 | if res is not None: 25 | if res in result_counts: 26 | result_counts[res] += 1 27 | else: 28 | result_counts[res] = 1 29 | except Exception as e: 30 | print(f"An error occurred during the verification process: {e}") 31 | continue 32 | 33 | if result_counts: 34 | max_result = max(result_counts, key=result_counts.get) 35 | return max_result 36 | else: 37 | print("No successful computations were found.") 38 | return None 39 | 40 | # Example usage 41 | if __name__ == '__main__': 42 | question = 'At a restaurant, each adult meal costs $5 and kids eat free. If a group of 15 people came in and 8 were kids, how much would it cost for the group to eat?' 43 | res = mathprompter(question, temperatures=[0.0, 0.2, 0.4, 0.6, 0.8], n_test_random_values=3) 44 | print(f"Result: {res}") 45 | -------------------------------------------------------------------------------- /math_prompts_generation.py: -------------------------------------------------------------------------------- 1 | from llm_inference.llm_factory import get_llm_service 2 | 3 | # DISCLAIMER: The following examples inside the prompts are partly generated with Claude Opus 3 4 | # NOTE: You may adapt the prompt based on the model you use 5 | ALGEBRAIC_FEW_SHOT_PROMPT = [ 6 | { 7 | "question": "John has A apples. He gives B apples to his friend. How many apples does John have left?", 8 | "answer": "A - B" 9 | }, 10 | { 11 | "question": "A school has B classrooms. If each classroom has C students and D teachers, how many people are in the school?", 12 | "answer": "(C + D) * B" 13 | }, 14 | { 15 | "question": "A factory produces E widgets per hour. If the factory operates for F hours per day and G days per week, how many widgets does the factory produce in H weeks?", 16 | "answer": "E * F * G * H" 17 | }, 18 | { 19 | "question": "A company has I employees. Each employee works J hours per week. If the average hourly wage is K dollars and the company spends L percent of its revenue on salaries, what is the company's weekly revenue?", 20 | "answer": "(I * J * K) / (L / 100)" 21 | }, 22 | { 23 | "question": "A store has D shirts in stock. If they sell E shirts per day, how many days will it take to sell all the shirts?", 24 | "answer": "D // E" 25 | } 26 | ] 27 | 28 | 29 | ALGEBRAIC_SYSTEM_PROMPT =''' 30 | You are a highly qualified expert in finding algebra expressions for a given math problem in natural language. 31 | You are given a math word problem and you need to find the algebraic expression that represents the problem. 32 | Only the algebraic expression is required, without explanations. 33 | ''' 34 | 35 | def algebric_expression_generation(question: str, temperature: float): 36 | question = question.strip() 37 | SERVICE = get_llm_service('openai', 'gpt-3.5-turbo', temperature=temperature, max_tokens=200) 38 | messages = SERVICE.create_prompt(system_prompt=ALGEBRAIC_SYSTEM_PROMPT, few_shot_examples=ALGEBRAIC_FEW_SHOT_PROMPT, question=question) 39 | response = SERVICE.make_request(messages=messages) 40 | expression = response.strip() 41 | return expression 42 | 43 | 44 | # DISCLAIMER: The following examples inside the prompts are partly generated with Claude Opus 3 45 | # NOTE: You may adapt the prompt based on the model you use 46 | PYTHON_FEW_SHOT_PROMPT = [ 47 | { 48 | "question": "John has A apples. He gives B apples to his friend. How many apples does John have left?", 49 | "answer": "def solution(A, B):\n return A - B" 50 | }, 51 | { 52 | "question": "A school has B classrooms. If each classroom has C students and D teachers, how many people are in the school?", 53 | "answer": "def solution(B, C, D):\n return (C + D) * B" 54 | }, 55 | { 56 | "question": "A factory produces E widgets per hour. If the factory operates for F hours per day and G days per week, how many widgets does the factory produce in H weeks?", 57 | "answer": "def solution(E, F, G, H):\n return E * F * G * H" 58 | }, 59 | { 60 | "question": "A company has I employees. Each employee works J hours per week. If the average hourly wage is K dollars and the company spends L percent of its revenue on salaries, what is the company's weekly revenue?", 61 | "answer": "def solution(I, J, K, L):\n return (I * J * K) / (L / 100)" 62 | }, 63 | { 64 | "question": "A store has D shirts in stock. If they sell E shirts per day, how many days will it take to sell all the shirts?", 65 | "answer": "def solution(D, E):\n return D // E" 66 | } 67 | ] 68 | 69 | PYTHON_SYSTEM_PROMPT = ''' 70 | You are a highly qualified expert in finding Python code for a given math problem in natural language. 71 | You are given a math word problem and you need to find the Python code that solves the problem. 72 | Only the Python code is required, without explanations. 73 | ''' 74 | 75 | def python_code_generation(question, temperature: float): 76 | SERVICE = get_llm_service('openai', 'gpt-3.5-turbo', temperature=temperature, max_tokens=200) 77 | messages = SERVICE.create_prompt(system_prompt=PYTHON_SYSTEM_PROMPT, few_shot_examples=PYTHON_FEW_SHOT_PROMPT, question=question) 78 | response = SERVICE.make_request(messages=messages) 79 | function_code = response.strip() 80 | return function_code -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | # This file may be used to create an environment using: 2 | # $ conda create --name --file 3 | # platform: win-64 4 | openai 5 | python-dotenv -------------------------------------------------------------------------------- /tests/test_cases.py: -------------------------------------------------------------------------------- 1 | test_cases_paper = [ 2 | { 3 | 'q': 'At a restaurant, each adult meal costs $5 and kids eat free. If a group of 15 people came in and 8 were kids, how much would it cost for the group to eat?', 4 | 'qt': 'At a restaurant, each adult meal costs A and kids eat free. If a group of B people came in and C were kids, how much would it cost for the group to eat?', 5 | 'Mapping': {'A': 5.0, 'B': 15.0, 'C': 8.0}, 6 | 'Correct': 35.0 7 | }, 8 | { 9 | 'q': 'At the fair Adam bought 13 tickets. After riding the ferris wheel he had 4 tickets left. If each ticket cost 9 dollars, how much money did Adam spend riding the ferris wheel?', 10 | 'qt': 'At the fair Adam bought A tickets. After riding the ferris wheel he had B tickets left. If each ticket cost C dollars, how much money did Adam spend riding the ferris wheel?', 11 | 'Mapping': {'A': 13.0, 'B': 4.0, 'C': 9.0}, 12 | 'Correct': 81.0 13 | }, 14 | { 15 | 'q': ' Each chocolate bar in a box cost $3. If a box had 9 bars total and Wendy sold all but 3 bars, how much money would she have made?', 16 | 'qt': ' Each chocolate bar in a box cost A. If a box had B bars total and Wendy sold all but C bars, how much money would she have made?', 17 | 'Mapping': {'A': 3.0, 'B': 9.0, 'C': 3.0}, 18 | 'Correct': 18.0 19 | }, 20 | { 21 | 'q': 'Jerry was helping the cafeteria workers pick up lunch trays, but he could only carry 8 trays at a time. If he had to pick up 9 trays from one table and 7 trays from another, how many trips will he make?', 22 | 'qt': 'Jerry was helping the cafeteria workers pick up lunch trays, but he could only carry A trays at a time. If he had to pick up B trays from one table and C trays from another, how many trips will he make?', 23 | 'Mapping': {'A': 8.0, 'B': 9.0, 'C': 7.0}, 24 | 'Correct': 2.0 25 | }, 26 | { 27 | 'q': 'Kaleb bought 14 boxes of chocolate candy and gave 5 to his little brother. If each box has 6 pieces inside it, how many pieces did Kaleb still have?', 28 | 'qt': 'Kaleb bought A boxes of chocolate candy and gave B to his little brother. If each box has C pieces inside it, how many pieces did Kaleb still have?', 29 | 'Mapping': {'A': 14.0, 'B': 5.0, 'C': 6.0}, 30 | 'Correct': 54 31 | } 32 | ] 33 | 34 | 35 | test_cases_own = [ 36 | { 37 | 'q': 'A train travels 320 kilometers south and then 150 kilometers north. What is the net distance traveled by the train?', 38 | 'qt': 'A train travels A kilometers south and then B kilometers north. What is the net distance traveled by the train?', 39 | 'Mapping': {'A': 320.0, 'B': 150.0}, 40 | 'Correct': 170.0 41 | }, 42 | { 43 | 'q': 'A concert hall has 425 seats. During a concert, 285 tickets were sold at full price and the remaining at half price. If the full-price ticket costs $30, what was the total revenue?', 44 | 'qt': 'A concert hall has A seats. During a concert, B tickets were sold at full price and the remaining at half price. If the full-price ticket costs C, what was the total revenue?', 45 | 'Mapping': {'A': 425.0, 'B': 285.0, 'C': 30.0}, 46 | 'Correct': 9975.0 47 | }, 48 | { 49 | 'q': 'The temperature was 5 degrees Celsius at noon, but it had dropped to -30 degrees Celsius by midnight. What was the total temperature change?', 50 | 'qt': 'The temperature was A degrees Celsius at noon, but it had dropped to B degrees Celsius by midnight. What was the total temperature change?', 51 | 'Mapping': {'A': 5.0, 'B': -30.0}, 52 | 'Correct': 35.0 53 | } 54 | ] 55 | 56 | test_cases = test_cases_paper + test_cases_own -------------------------------------------------------------------------------- /tests/testing.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | import sys 3 | 4 | # Add the root directory to sys.path to find the main module 5 | sys.path.append('..') 6 | 7 | from main import * 8 | from test_cases import * 9 | 10 | class Test(unittest.TestCase): 11 | def test_algebraic_template_generation(self): 12 | results = [] 13 | for index, test in enumerate(test_cases, start=1): 14 | try: 15 | question_template, mapping = generating_algebraic_template(test['q']) 16 | print("------------------------------") 17 | print(f"TEST {index}") 18 | print("Question:", test['q']) 19 | print("Question Transformed:", question_template) 20 | print("Calculated Mapping:", mapping) 21 | print("Expected Template:", test['qt']) 22 | print("Expected Mapping:", test['Mapping']) 23 | self.assertEqual(question_template, test['qt']) 24 | self.assertEqual(mapping, test['Mapping']) 25 | results.append(True) 26 | except AssertionError as e: 27 | print(f"FAILED: {str(e)}") 28 | results.append(False) 29 | 30 | self.assertTrue(all(results), "Some algebra templates did not match the expectations.") 31 | 32 | def test_compute_verification(self): 33 | results = [] 34 | for index, test in enumerate(test_cases, start=1): 35 | try: 36 | question_template, variable_mapping = generating_algebraic_template(test['q']) 37 | algebraic_expression = algebric_expression_generation(question_template) 38 | python_code = python_code_generation(question_template) 39 | outcome = perform_computational_verification(algebraic_expression, python_code, variable_mapping) 40 | print("------------------------------") 41 | print(f"TEST {index}") 42 | print("Generated Algebraic Expression:", algebraic_expression) 43 | print("Generated Python Code:", python_code) 44 | print("Expected Answer", test['Correct']) 45 | print("Calculated Answer", outcome) 46 | self.assertEqual(outcome, test['Correct']) 47 | results.append(True) 48 | except AssertionError as e: 49 | print(f"FAILED: {str(e)}") 50 | results.append(False) 51 | 52 | self.assertTrue(all(results), "Some computational verifications failed.") 53 | 54 | if __name__ == '__main__': 55 | unittest.main() 56 | --------------------------------------------------------------------------------