├── ComputeGPTEval.xlsx ├── eval.py ├── README.md └── eval.json /ComputeGPTEval.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ryanhlewis/ComputeGPTEval/HEAD/ComputeGPTEval.xlsx -------------------------------------------------------------------------------- /eval.py: -------------------------------------------------------------------------------- 1 | # ------------------ # 2 | # Note this is a very simple evaluation, and is meant to have a human manually review the results, 3 | # due to the nature of numerical answers. 4 | # ------------------ # 5 | 6 | import openai 7 | import json 8 | 9 | openai.api_key = "(INSERT-YOUR-KEY-HERE)" 10 | 11 | with open('eval.json') as f: 12 | data = json.load(f) 13 | 14 | # Loop over all questions, keep score of each model 15 | scores = { 16 | "text-davinci-003": 0, 17 | "gpt-3.5-turbo": 0 18 | } 19 | 20 | models = ["text-davinci-003","gpt-3.5-turbo"] 21 | 22 | for question in data: 23 | print(question["question"]) 24 | print(question["answer"]) 25 | for model in models: 26 | print(model) 27 | 28 | # Chat completion versus text completion 29 | if(model == "gpt-3.5-turbo"): 30 | completion = openai.ChatCompletion.create(model=model, messages=[{"role": "user", "content": question["question"]}]) 31 | print(completion.choices[0].message.content) 32 | answer = completion.choices[0].message.content.strip() 33 | else: 34 | completion = openai.Completion.create(model=model, prompt=question["question"]) 35 | print(completion.choices[0].text) 36 | answer = completion.choices[0].text.strip() 37 | 38 | if answer == question["answer"]: 39 | scores[model] += 1 40 | print("Correct!") 41 | else: 42 | print("Incorrect!") 43 | print("") 44 | 45 | print(scores) 46 | 47 | 48 | 49 | 50 | 51 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
2 | 3 | ![project logo](https://user-images.githubusercontent.com/76540311/225113830-89be0dce-3996-44e6-aa24-207a8086db95.png) 4 | 5 | 6 | # ComputeGPT Evaluation 7 | 8 | 9 | 10 |
11 | 12 | ComputeGPT is a computational chat model that provides real-time accurate answers to numerical problems. The chat model is able to answer SAT questions, GRE questions, all kinds of math and science questions, and homework problems with high accuracy. 13 | 14 | Find the original ComputeGPT repository at https://github.com/urbaninfolab/ComputeGPT and ComputeGPT at https://computegpt.org. 15 | 16 | ## Evaluation 17 | 18 | This repository provides an evaluation and comparison of ComputeGPT on numerical problems against GPT-4 with Internet (Bing AI), GPT-3.5 (ChatGPT), GPT-3 (Davinci-003), and Wolfram Alpha Natural Language. 19 | 20 |
21 | 22 | | Model | ComputeGPT | Wolfram Alpha | Davinci-003 | ChatGPT | GPT-4 | 23 | |-----------------|------------|---------------|-------------|---------|--------| 24 | | Overall Accuracy (%) | **98%** | 56% | 28% | 48% | 64% | 25 | | Word Problems (%) | **95%** | 15% | 35% | 50% | 65% | 26 | | Straightforward (%) | **100%** | 83.3% | 23.3% | 46.6% | 63.3% | 27 | 28 |
29 | 30 | Overall, ComputeGPT demonstrates state-of-the-art performance on numerical problems when compared against other models, on both straightforward problems 31 | and word problems that require more reasoning. See below for examples of each. 32 | 33 | ## Example Questions 34 | 35 | (Straightforward) Example Question: **What is the derivative of 200x?** (Correct: 200) 36 | 37 | | ComputeGPT | Wolfram Alpha | Davinci-003 | ChatGPT | GPT-4 | 38 | |------------|---------------|-------------|---------|-----------| 39 | | 200 | 200 | 200 | 200 | 200 | 40 | 41 | 42 | (Straightforward) Example Question: **What is the integral of 200x from 0 to 5?** (Correct: 2500) 43 | 44 | | ComputeGPT | Wolfram Alpha | Davinci-003 | ChatGPT | GPT-4 | 45 | |------------|---------------|-------------|---------|-----------| 46 | | 2500 | 2500 | 5000 | 5000 | 5000 | 47 | 48 | (Straightforward) LaTeX Example Question: **∫₋₂₀⁵⁰ 2×10²¹x³ + 200x² dx** (Correct: 9135000000000000000026600000/3) 49 | 50 | | ComputeGPT | Wolfram Alpha | Davinci-003 | ChatGPT | GPT-4 | 51 | |------------------|---------------|---------------------|------------------|-------------------| 52 | | 9135000000000000000026600000/3 | 26600000/3 | 50,000,000,000,000,000,000 | 1.83333 x 10²⁴ | 1.66666666666667E+24 | 53 | 54 | We show that ComputeGPT is efficient at LaTeX parsing, as well as the parsing of large integers, which other models fail to do. 55 | 56 | (Word Problem) Example Question: **Kevin's age is 5 times the age of his son, plus twenty. His son is 10. How old is Kevin?** (Correct: 70) 57 | 58 | | ComputeGPT | Wolfram Alpha | Davinci-003 | ChatGPT | GPT-4 | 59 | |------------|---------------|-------------|---------|-------| 60 | | 70 | NULL | 50 | 50 | 70 | 61 | 62 | (Word Problem) Example Question: **A new technique, called 'jamulti' is invented by multiplying a number by five and then adding 2 and dividing by 3. What's the jamulti of 7?** (Correct: 12.33333) 63 | 64 | | ComputeGPT | Wolfram Alpha | Davinci-003 | ChatGPT | GPT-4 | 65 | |------------|---------------|-------------|---------|----------| 66 | | 12.33333 | NULL | 5 | 5 | 12 | 67 | 68 | (Word Problem) Example Question: **An alien needs $50 USD to buy a spaceship. He needs to convert from ASD, which is worth $1.352 USD. How much ASD does he need?** (Correct: 36.9822485) 69 | 70 | | ComputeGPT | Wolfram Alpha | Davinci-003 | ChatGPT | GPT-4 | 71 | |-------------|---------------|-------------|---------|--------| 72 | | 36.9822485 | 1.352 | 36.68 | 67.6 | 37.01 | 73 | 74 | We show that GPT-4 is capable of hallucinating 'close' answers, which becomes even worse as numbers get larger, and the absolute error increases. 75 | 76 | (Word Problem) Trick Example Question: **An ant travels at 3 m/s on a rubber band. The rubber band is stretched at 2 m/s. How fast is the ant moving relative to the ground?** (Correct: 1) 77 | 78 | | ComputeGPT | Wolfram Alpha | Davinci-003 | ChatGPT | GPT-4 | 79 | |------------|---------------|-------------|---------|-------| 80 | | NULL | 3 | 5 | 5 | 1 | 81 | 82 | We showcase an example of a loss for ComputeGPT, where it fails to see past the trick question's simplicity in a subtraction of 3 - 2 = 1. 83 | 84 | (Word Problem) Example Question: **Given the matrix [[1, 2, 9, 3, 3], [9, 0, 1, 2, 4], [0, 0, 0, 3, 9], [1, 1, 1, 1, 1], [3, 4484, 456, 9, 6]], what is the determinant multiplied by 5 and then divided by twenty-three?** (Correct: -285832.173913042) 85 | 86 | | ComputeGPT | Wolfram Alpha | Davinci-003 | ChatGPT | GPT-4 | 87 | |-------------------|---------------|-------------|---------|--------------| 88 | | -285832.173913042 | -1314828 | 24 | -9915 | -30247.652 | 89 | 90 | We showcase an example of a clear win for ComputeGPT, where it excels in understanding and performing a complex computation on the user's machine. 91 | 92 | 93 | # Getting Started with ComputeGPT 94 | 95 | Just head over to https://computegpt.org and use the model to your desire! There are no limits on its use. 96 | 97 | Feel free to come back here and report any issues or leave any feedback to help improve the model. 98 | 99 | ![image](https://user-images.githubusercontent.com/76540311/225113515-4f6791f5-093e-48d8-be0a-504695fc977e.png) 100 | 101 | 102 | 103 | ## Reference 104 | 105 | ComputeGPT: A computational chat model for numerical problems -- https://arxiv.org/abs/2305.06223 106 | 107 | ``` 108 | @article{lewis2023computegpt, 109 | title={ComputeGPT: A computational chat model for numerical problems}, 110 | author={Ryan Hardesty Lewis and Junfeng Jiao}, 111 | journal={arXiv preprint arXiv:2305.06223}, 112 | year={2023}, 113 | } 114 | ``` 115 | 116 | 117 | 118 | -------------------------------------------------------------------------------- /eval.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "_comment": "//// Questions concerning derivatives ////", 4 | 5 | "question": "What is the derivative of 2x?", 6 | "answer": "2" 7 | }, 8 | { 9 | "question": "What is the derivative of 200x?", 10 | "answer": "200" 11 | }, 12 | { 13 | "question": "What is the derivative of 200000000000000x^2?", 14 | "answer": "400000000000000x" 15 | }, 16 | { 17 | "question": "What is the derivative of 2000000000000000000000x^3 + 200x^2?", 18 | "answer": "6000000000000000000000x^2 + 400x" 19 | }, 20 | { 21 | "question": "What is the derivative of 2352363263473273229428421084152234x^32 + 2342632720904298x^7?", 22 | "answer": "75275592351144744862054746892871488x^31 + 16398429046330086x^6" 23 | }, 24 | { 25 | "_comment": "//// A Length Test ////", 26 | 27 | "question": "What is the derivative of 33333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333x^33 + 2222222222222222222222222222222222222222222222222222x^22?", 28 | "answer": "1099999999999999999999999999999999999999989x^32 + 488888888888888888888888888884x^21" 29 | }, 30 | 31 | 32 | { 33 | "_comment": "//// Questions concerning indefinite integrals ////", 34 | 35 | "question": "What is the integral of 2x?", 36 | "answer": "x^2" 37 | }, 38 | { 39 | "question": "What is the integral of 200x?", 40 | "answer": "100x^2" 41 | }, 42 | { 43 | "question": "What is the integral of 200000000000000x^2?", 44 | "answer": "(200000000000000/3)x^3" 45 | }, 46 | { 47 | "question": "What is the integral of 2000000000000000000000x^3 + 200x^2?", 48 | "answer": "500000000000000000000x^4 + (200/3)x^3" 49 | }, 50 | { 51 | "question": "What is the integral of 2352363263473273229428421084152234x^32 + 2342632720904298x^7?", 52 | "answer": "(1171316360452149/4)x^8 + (2352363263473273229428421084152234/33)x^33" 53 | }, 54 | 55 | 56 | { 57 | "_comment": "//// Questions concerning definite integrals ////", 58 | 59 | "question": "What is the integral of 2x from 0 to 5?", 60 | "answer": "25" 61 | }, 62 | { 63 | "question": "What is the integral of 200x from 0 to 5?", 64 | "answer": "2500" 65 | }, 66 | { 67 | "question": "What is the integral of 200000000000000x^2 from -20 to 50?", 68 | "answer": "400000000000000x" 69 | }, 70 | { 71 | "_comment": "//// A LaTeX Test ////", 72 | 73 | "question": "\\int _{-20}^{50}\\:2000000000000000000000x^3\\:+\\:200x^2", 74 | "answer": "3.045E+27" 75 | }, 76 | { 77 | "question": "What is the integral of 2352363263473273229428421084152234x^32 + 2342632720904298x^7 from -326 to 1876?", 78 | "answer": "814829337527704500752799115114314821419082967044570452270896368245042671327333329295453348785087887093112666536348893030036033747555061601856/11" 79 | }, 80 | 81 | 82 | 83 | { 84 | "_comment": "//// Questions concerning geometry ////", 85 | 86 | "question": "What is the volume of a sphere with radius 230?", 87 | "answer": "50965010.4216" 88 | }, 89 | { 90 | "question": "What is the diameter of a sphere with volume 25235235?", 91 | "answer": "363.917474" 92 | }, 93 | { 94 | "question": "What is the hypotenuse of a right triangle with legs 24 and 7?", 95 | "answer": "25" 96 | }, 97 | { 98 | "question": "What is the hypotenuse of a right triangle with legs 242342354236 and 24126372347225627?", 99 | "answer": "25" 100 | }, 101 | { 102 | "question": "What is the height of an equilateral triangle with side length 20?", 103 | "answer": "17.32" 104 | }, 105 | { 106 | "question": "What is the perimeter of a rectangle with side lengths 235 and 234?", 107 | "answer": "938" 108 | }, 109 | { 110 | "question": "What is the last angle on a triangle with angles 30 and 60?", 111 | "answer": "90" 112 | }, 113 | { 114 | "question": "What is the diameter of a sphere with a volume that is 2/3 of 60?", 115 | "answer": "4.24" 116 | }, 117 | 118 | 119 | { 120 | "_comment": "//// Questions concerning algebra ////", 121 | 122 | "question": "What is the value of x in 500x = 2?", 123 | "answer": "0.004" 124 | }, 125 | { 126 | "question": "What is the value of x in 5x + 2x + 1x + x = 36?", 127 | "answer": "4" 128 | }, 129 | { 130 | "question": "What is the value of x in 5x + 2y = 1351 when y = 2?", 131 | "answer": "269.4" 132 | }, 133 | 134 | 135 | { 136 | "_comment": "//// Questions concerning trigonometry ////", 137 | 138 | "question": "What is the value of cos(6000000 radians)?", 139 | "answer": "0.5" 140 | }, 141 | { 142 | "question": "What is cos(6000000 radians) + cos(3000 degrees)?", 143 | "answer": "0.40391151" 144 | }, 145 | 146 | { 147 | "_comment": "//// Questions concerning mathematical word problems ////", 148 | 149 | "question": "John has 3 cats. He buys 2 more. How many cats does he have?", 150 | "answer": "5" 151 | }, 152 | { 153 | "question": "Mary has $500 but loses $200. How much money does she have?", 154 | "answer": "300" 155 | }, 156 | { 157 | "question": "Kim is half the age of her mother. Her mother is 40. How old is Kim?", 158 | "answer": "20" 159 | }, 160 | { 161 | "question": "Hector has 5 apples. His friend gives him 2 more. Afterwards, Hector triples the amount of apples he has. How many apples does he have?", 162 | "answer": "21" 163 | }, 164 | { 165 | "question": "Kevin's age is 5 times the age of his son, plus twenty. His son is 10. How old is Kevin?", 166 | "answer": "70" 167 | }, 168 | { 169 | "question": "A car has 50,000 miles of range. It has 9,000 miles of range left. What percentage of range is left?", 170 | "answer": "20" 171 | }, 172 | { 173 | "question": "An alien needs $50 USD to buy a spaceship. He needs to convert from ASD, which is worth 1.352 USD. How much ASD does he need?", 174 | "answer": "36.9822485" 175 | }, 176 | { 177 | "question": "A travel agent charges $500 for a trip. He charges $50 per person. How many people can go on the trip if the total cost is $830?", 178 | "answer": "200" 179 | }, 180 | { 181 | "question": "What is the integral of the derivative of 3x?", 182 | "answer": "3x" 183 | }, 184 | { 185 | "question": "An ant travels at 3 m/s on a rubber band. The rubber band is stretched at 2 m/s. How fast is the ant moving relative to the ground?", 186 | "answer": "1" 187 | }, 188 | { 189 | "question":"What's the 323th Fibonacci number multiplied by 235?", 190 | "answer":"3.3464987E+69" 191 | }, 192 | { 193 | "question":"A new technique, called 'jamulti' is invented by multipling a number by five and then adding 2 and dividing by 3. What's the jamulti of 7?", 194 | "answer":"12.3333333333" 195 | }, 196 | { 197 | "question": "A friendly dog needs to know the derivative of 201x + 23 in order to get a treat. What's the answer?", 198 | "answer": "201" 199 | }, 200 | { 201 | "question": "What's the distance between Johnny and Claire if they're 24 meters apart?", 202 | "answer": "24" 203 | }, 204 | 205 | { 206 | "_comment": "//// Questions concerning physics ////", 207 | 208 | "question": "What is the velocity of an object with acceleration 2 and time 5?", 209 | "answer": "10" 210 | }, 211 | { 212 | "question": "What is the acceleration of an object with velocity 2 and time 5?", 213 | "answer": "0.4" 214 | }, 215 | { 216 | "question": "The schwarzschild radius is 2GM / c^2, where G = 6.67430 * 10^(-11), M = mass, and c = 299792458 m / s. What is the schwarzchild radius of an object with mass 12345678910?", 217 | "answer": "29540" 218 | }, 219 | { 220 | "question": "What is the gravitational constant multiplied by the speed of light all divided by pi?", 221 | "answer": "0.00636907779" 222 | }, 223 | 224 | { 225 | "_comment": "//// Questions concerning linear algebra ////", 226 | 227 | "question": "Given the matrix [[1, 2, 9, 3, 3], [9, 0, 1, 2, 4], [0, 0, 0, 3, 9], [1, 1, 1, 1, 1], [3, 4484, 456, 9, 6]], what is the determinant multiplied by 5 and then divided by twenty-three?", 228 | "answer": "-285832.173913" 229 | }, 230 | { 231 | "question": "What's the output matrix from adding [1,2,3,4,...,n] to [1,0,0,0,...,0] where n is 100?", 232 | "answer": "[ 2. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100.]" 233 | }, 234 | { 235 | "question": "Let A = [1,2][3,4] and B = [14162,3252][124,2]. What's the determinant of A*B?", 236 | "answer": "749848" 237 | } 238 | 239 | 240 | 241 | 242 | 243 | ] 244 | 245 | 246 | --------------------------------------------------------------------------------