├── .gitignore ├── README.md ├── gpt_annotate.py ├── llm_annotate_paper.pdf └── sample_annotation_code.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | gpt_api_key.txt 2 | .ipynb_checkpoints* 3 | .DS_store -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Introducing gpt_annotate 2 | An easy-to-use Python package designed to streamline automated text annotation using LLMs for different tasks and datasets. All you need is an OpenAI API key, text samples you want to annotate, and a codebook (i.e., task-specific instructions) for the LLM. 3 | * OpenAI API key 4 | * Sign up for one here: https://platform.openai.com/account/api-keys 5 | * text_to_annotate: 6 | * A dataframe that includes one column for text samples and, if you are comparing the LLM output against humans, any number of one-hot-encoded category columns. The text column should be the first column in your data. We provide Python code (described below) that will automatically assist with the formatting of `text_to_annotate` to ensure accurate annotation. 7 | * codebook: 8 | * Task-specific instructions (as type string) to prompt the LLM to annotate the data. Like codebooks for qualitative content analysis, this should clearly describe the dataset, the type of task for the LLM, and, most importantly, delineate the categories of interest for the LLM to annotate. We provide Python code to standardize the beginning and ending of the codebook to ensure that the LLM understands that the task is annotation. 9 | * For example, the text of `codebook` could be: "You will be classifying text samples. Each text sample is a tweet. Classify each tweet on two dimensions: a) POLITICAL; b) PRESIDENT. For POLITICAL, label as 1 if the tweet is about politics; label as 0 if not. For PRESIDENT, label as 1 if the tweet refers to a past or present president, a candidate for president, or a presidential election; label as 0 if not. Classify the following text samples:" 10 | 11 | To annotate your text data using gpt_annotate, we recommend following the sample code we provide in `sample_annotation_code.ipynb`. 12 | 13 | As shown in `sample_annotation_code.ipynb`, annotating your text data with LLMs is as easy as 4 simple steps: 14 | 1. Import the required dependencies (including gpt_annotate.py). 15 | 16 | ``` 17 | import openai 18 | import pandas as pd 19 | import math 20 | import time 21 | import numpy as np 22 | import tiktoken 23 | #### Import main package: gpt_annotate.py 24 | # Make sure that the .py file is in the same directory as the .ipynb file, or you provide the correct relative or absolute path to the .py file. 25 | import gpt_annotate 26 | ``` 27 | 28 | 2. Read in your codebook (i.e., task-specific instructions) and the text samples you want to annotate. 29 | 30 | ``` 31 | text_to_annotate = pd.read_csv("text_to_annotate.csv") 32 | with open('codebook.txt', 'r') as file: 33 | codebook = file.read() 34 | ``` 35 | 36 | 3. To ensure your data is in the right format, you must first run `gpt_annotate.prepare_data(text_to_annotate, codebook, key)`. If you are annotating text data without any human labels to compare against, change the default to `human_labels = False`. If you want to add standardized language to the beginning and end of your codebook to ensure that GPT will label your text samples, change the default to `prep_codebook = True`. 37 | ``` 38 | text_to_annotate = gpt_annotate.prepare_data(text_to_annotate, codebook, key) 39 | ``` 40 | 4. If comparing LLM output to human labels, run `gpt_annotate.gpt_annotate(text_to_annotate, codebook, key)`. If only using gpt_annotate for prediction (i.e., no human labels to compare performance), run `gpt_annotate.gpt_annotate(text_to_annotate, codebook, key, human_labels = False)`. It’s as easy as that! 41 | ``` 42 | # Annotate the data (returns 4 outputs) 43 | gpt_out_all, gpt_out_final, performance, incorrect = gpt_annotate.gpt_annotate(text_to_annotate, codebook, key) 44 | # Annotate the data (without human labels to compare against) (returns 2 outputs) 45 | gpt_out_all, gpt_out_final = gpt_annotate.gpt_annotate(text_to_annotate, codebook, key, human_labels = False) 46 | ``` 47 | 48 | Outputs: 49 | 1) `gpt_out_all` 50 | * Raw outputs for every iteration. 51 | 2) `gpt_out_final` 52 | * Annotation outputs after taking modal category answer and calculating consistency scores. 53 | 3) `performance` 54 | * Accuracy, precision, recall, and f1. 55 | 4) `incorrect` 56 | * Any incorrect classification or classification with less than 1.0 consistency. 57 | 58 | Below we define the alternative parameters within `gpt_annotate()` to customize your annotation procedures. 59 | * num_iterations: 60 | * Number of times to classify each text sample. Default is 3. 61 | * model: 62 | * OpenAI GPT model, which is either `gpt-3.5-turbo` or `gpt-4`. Default is `gpt-4`. 63 | * temperature: 64 | * LLM temperature parameter (ranges 0 to 1), which indicates the degree of diversity to introduce into the model. Default is 0.6. 65 | * batch_size: 66 | * Number of text samples to be annotated in each batch. Default is 10. 67 | * human_labels: 68 | * Boolean indicating whether `text_to_annotate` has human labels to compare LLM outputs to. 69 | * data_prep_warning: 70 | * Boolean indicating whether to print data_prep_warning 71 | * time_cost_warning: 72 | * Boolean indicating whether to print time_cost_warning 73 | 74 | 75 | Please email us (njpang@sas.upenn.edu, sam.wolken@asc.upenn.edu) with any suggestions or problems encountered with the code. 76 | -------------------------------------------------------------------------------- /gpt_annotate.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | ## Overview 4 | 5 | Given a codebook (.txt) and a dataset (.csv) that has one text column and any number of category columns as binary indicators, the main function (`gpt_annotate`) annotates 6 | all the samples using an OpenAI GPT model (ChatGPT or GPT-4) and calculates performance metrics (if there are provided human labels). Before running `gpt_annotate`, 7 | users should run `prepare_data` to ensure that their data is in the correct format. 8 | 9 | Flow of `gpt_annotate`: 10 | * 1) Based on a provided codebook, the function uses an OpenAI GPT model to annotate every text sample per iteration, which is a parameter set by user. 11 | * 2) The function reduces the annotation output down to the modal annotation category across iterations for each category. At this stage, 12 | the function adds a consistency score for each annotation across iterations. 13 | * 3) If provided human labels, the function determines, for every category, whether the annotation is correct (by comparing to the human label), 14 | then also adds whether it is a true positive, false positive, true negative, or false negative. 15 | * 4) Finally, if provided human labels, the function calculates performance metrics (accuracy, precision, recall, and f1) for every category. 16 | 17 | The main function (`gpt_annotate`) returns four .csv's, if human labels are provided. 18 | If no human labels are provided, the main function only returns 1 and 2 listed below. 19 | * 1) `gpt_out_all_iterations.csv` 20 | * Raw outputs for every iteration. 21 | * 2) `gpt_out_final.csv` 22 | * Annotation outputs after taking modal category answer and calculating consistency scores. 23 | * 3) `performance_metrics.csv` 24 | * Accuracy, precision, recall, and f1. 25 | * 4) `incorrect.csv` 26 | * Any incorrect classification or classification with less than 1.0 consistency. 27 | 28 | Our code aims to streamline automated text annotation for different datasets and numbers of categories. 29 | 30 | """ 31 | 32 | import subprocess 33 | import sys 34 | 35 | def install(package): 36 | subprocess.check_call([sys.executable, "-m", "pip", "install", package]) 37 | 38 | install("openai") 39 | install("pandas") 40 | install("numpy") 41 | install("tiktoken") 42 | 43 | import openai 44 | import pandas as pd 45 | import math 46 | import time 47 | import numpy as np 48 | import tiktoken 49 | from openai import OpenAI 50 | import os 51 | 52 | 53 | def prepare_data(text_to_annotate, codebook, key, 54 | prep_codebook = False, human_labels = True, no_print_preview = False): 55 | 56 | """ 57 | This function ensures that the data is in the correct format for LLM annotation. 58 | If the data fails any of the requirements, returns the original input dataframe. 59 | 60 | text_to_annotate: 61 | Data that will be prepared for analysis. Should include a column with text to annotate, and, if human_labels = True, the human labels. 62 | codebook: 63 | String detailing the task-specific instructions. 64 | key: 65 | OpenAI API key 66 | prep_codebook: 67 | boolean indicating whether to standardize beginning and end of codebook to ensure that the LLM prompt is annotating text samples. 68 | human_labels: 69 | boolean indicating whether text_to_annotate has human labels to compare LLM outputs to. 70 | no_print_preview: 71 | Does not print preview of user's data after preparing. 72 | 73 | Returns: 74 | Updated dataframe (text_to_annotate) and codebook (if prep_codebook = True) that are ready to be used for annotation using gpt_annotate. 75 | """ 76 | 77 | # Check if text_to_annotate is a dataframe 78 | if not isinstance(text_to_annotate, pd.DataFrame): 79 | print("Error: text_to_annotate must be pd.DataFrame.") 80 | return text_to_annotate 81 | 82 | # Make copy of input data 83 | original_df = text_to_annotate.copy() 84 | 85 | # set OpenAI key 86 | openai.api_key = key 87 | 88 | # Standardize beginning and end of codebook to ensure that the LLM prompt is annotating text samples 89 | if prep_codebook == True: 90 | codebook = prepare_codebook(codebook) 91 | 92 | # Add unique_id column to text_to_annotate 93 | text_to_annotate = text_to_annotate \ 94 | .reset_index() \ 95 | .rename(columns={'index':'unique_id'}) 96 | 97 | ##### Minor Cleaning 98 | # Drop any Unnamed columns 99 | if any('Unnamed' in col for col in text_to_annotate.columns): 100 | text_to_annotate = text_to_annotate.drop(text_to_annotate.filter(like='Unnamed').columns, axis=1) 101 | # Drop any NA values 102 | text_to_annotate = text_to_annotate.dropna() 103 | 104 | ########## Confirming data is in correct format 105 | ##### 1) Check whether second column is string 106 | # rename second column to be 'text' 107 | text_to_annotate.columns.values[1] = 'text' 108 | # check whether second column is string 109 | if not (text_to_annotate.iloc[:, 1].dtype == 'string' or text_to_annotate.iloc[:, 1].dtype == 'object'): 110 | print("ERROR: Second column should be the text that you want to annotate.") 111 | print("") 112 | print("Your data:") 113 | print(text_to_annotate.head()) 114 | print("") 115 | print("Sample data format:") 116 | error_message(human_labels) 117 | return original_df 118 | 119 | ##### 2) If human_labels == False, there should only be 2 columns 120 | if human_labels == False and len(text_to_annotate.columns) != 2: 121 | print("ERROR: You have set human_labels = False, which means you should only have two columns in your data.") 122 | print("") 123 | print("Your data:") 124 | print(text_to_annotate.head()) 125 | print("") 126 | print("Sample data format:") 127 | error_message(human_labels) 128 | return original_df 129 | 130 | ##### 3) If human_labels == True, there should be more than 2 columns 131 | if human_labels == True and len(text_to_annotate.columns) < 3: 132 | print("ERROR: You have set human_labels = True (default value), which means you should have more than 2 columns in your data.") 133 | print("") 134 | print("Your data:") 135 | print(text_to_annotate.head()) 136 | print("") 137 | print("Sample data format:") 138 | error_message(human_labels) 139 | return original_df 140 | 141 | 142 | ##### 4) If human_labels == True, check if columns 3 through end of df are binary indicators 143 | if human_labels == True: 144 | # iterate through each column (after unique_id and text) to confirm that the values are only 0's and 1's 145 | for col in range(2, len(text_to_annotate.columns)): 146 | if set(text_to_annotate.iloc[:,col].unique()) != {0, 1}: 147 | print("ERROR: All category columns must be one-hot encoded (i.e., represent numeric or categorical variables as binary vectors of type integer).") 148 | print("Make sure that all category columns only contain 1's and 0's.") 149 | print("") 150 | print("Your data:") 151 | print(text_to_annotate.head()) 152 | print("") 153 | print("Sample data format:") 154 | error_message(human_labels) 155 | return original_df 156 | 157 | ##### 5) Add llm_query column that includes a unique ID identifier per text sample 158 | text_to_annotate['llm_query'] = text_to_annotate.apply(lambda x: str(x['unique_id']) + " " + str(x['text']) + "\n", axis=1) 159 | 160 | ##### 6) Make sure category names in codebook exactly match category names in text_to_annotate 161 | # extract category names from codebook 162 | if human_labels: 163 | col_names_codebook = get_classification_categories(codebook, key) 164 | # get category names in text_to_annotate 165 | df_cols = text_to_annotate.columns.values.tolist() 166 | # remove 'unique_id', 'text' and 'llm_query' from columns 167 | col_names = [col for col in df_cols if col not in ['unique_id','text', 'llm_query']] 168 | ### Check whether categories are the same in codebook and text_to_annotate 169 | if [col for col in col_names] != col_names_codebook: 170 | print("ERROR: Column names in codebook and text_to_annotate do not match exactly. Please note that order and capitalization matters.") 171 | print("Change order/spelling in codebook or text_to_annotate.") 172 | print("") 173 | print("Exact order and spelling of category names in text_to_annotate: ", col_names) 174 | print("Exact order and spelling of category names in codebook: ", col_names_codebook) 175 | return original_df 176 | else: 177 | col_names = get_classification_categories(codebook, key) 178 | 179 | ##### Confirm correct categories with user 180 | # Print annotation categories 181 | print("") 182 | print("Categories to annotate:") 183 | for index, item in enumerate(col_names, start=1): 184 | print(f"{index}) {item}") 185 | print("") 186 | if no_print_preview == False: 187 | waiting_response = True 188 | while waiting_response: 189 | # Confirm annotation categories 190 | input_response = input("Above are the categories you are annotating. Is this correct? (Options: Y or N) ") 191 | input_response = str(input_response).lower() 192 | if input_response == "y" or input_response == "yes": 193 | print("") 194 | print("Data is ready to be annotated using gpt_annotate()!") 195 | print("") 196 | print("Glimpse of your data:") 197 | # print preview of data 198 | print("Shape of data: ", text_to_annotate.shape) 199 | print(text_to_annotate.head()) 200 | return text_to_annotate 201 | elif input_response == "n" or input_response == "no": 202 | print("") 203 | print("Adjust your codebook to clearly indicate the names of the categories you would like to annotate.") 204 | return original_df 205 | else: 206 | print("Please input Y or N.") 207 | else: 208 | return text_to_annotate 209 | 210 | 211 | def gpt_annotate(text_to_annotate, codebook, key, 212 | num_iterations = 3, model = "gpt-4", temperature = 0.6, batch_size = 10, 213 | human_labels = True, data_prep_warning = True, 214 | time_cost_warning = True): 215 | """ 216 | Loop over the text_to_annotate rows in batches and classify each text sample in each batch for multiple iterations. 217 | Store outputs in a csv. Function is calculated in batches in case of crash. 218 | 219 | text_to_annotate: 220 | Input data that will be annotated. 221 | codebook: 222 | String detailing the task-specific instructions. 223 | key: 224 | OpenAI API key. 225 | num_iterations: 226 | Number of times to classify each text sample. 227 | model: 228 | OpenAI GPT model, which is either gpt-3.5-turbo or gpt-4 229 | temperature: 230 | LLM temperature parameter (ranges 0 to 1), which indicates the degree of diversity to introduce into the model. 231 | batch_size: 232 | number of text samples to be annotated in each batch. 233 | human_labels: 234 | boolean indicating whether text_to_annotate has human labels to compare LLM outputs to. 235 | data_prep_warning: 236 | boolean indicating whether to print data_prep_warning 237 | time_cost_warning: 238 | boolean indicating whether to print time_cost_warning 239 | 240 | 241 | 242 | Returns: 243 | gpt_annotate returns the four .csv's below, if human labels are provided. If no human labels are provided, 244 | gpt_annotate only returns gpt_out_all_iterations.csv and gpt_out_final.csv 245 | 246 | 1) `gpt_out_all_iterations.csv` 247 | Raw outputs for every iteration. 248 | 2) `gpt_out_final.csv` 249 | Annotation outputs after taking modal category answer and calculating consistency scores. 250 | 3) `performance_metrics.csv` 251 | Accuracy, precision, recall, and f1. 252 | 4) `incorrect.csv` 253 | Any incorrect classification or classification with less than 1.0 consistency. 254 | 255 | """ 256 | 257 | 258 | from openai import OpenAI 259 | 260 | client = OpenAI( 261 | api_key=key, 262 | ) 263 | 264 | OpenAI.api_key = os.getenv(key) 265 | 266 | # set OpenAI key 267 | openai.api_key = key 268 | 269 | # Double check that user has confirmed format of the data 270 | if data_prep_warning: 271 | waiting_response = True 272 | while waiting_response: 273 | input_response = input("Have you successfully run prepare_data() to ensure text_to_annotate is in correct format? (Options: Y or N) ") 274 | input_response = str(input_response).lower() 275 | if input_response == "y" or input_response == "yes": 276 | # If user has run prepare_data(), confirm that data is in correct format 277 | # first, check if first column is title "unique_id" 278 | if text_to_annotate.columns[0] == 'unique_id' and text_to_annotate.columns[-1] == 'llm_query': 279 | try: 280 | text_to_annotate_copy = text_to_annotate.iloc[:, 1:-1] 281 | except pd.core.indexing.IndexingError: 282 | print("") 283 | print("ERROR: Run prepare_data(text_to_annotate, codebook, key) before running gpt_annotate(text_to_annotate, codebook, key).") 284 | text_to_annotate_copy = prepare_data(text_to_annotate_copy, codebook, key, prep_codebook = False, human_labels = human_labels, no_print_preview = True) 285 | # if there was an error, exit gpt_annotate 286 | if text_to_annotate_copy.columns[0] != "unique_id": 287 | if human_labels: 288 | return None, None, None, None 289 | elif human_labels == False: 290 | return None, None 291 | else: 292 | print("ERROR: First column should be title 'unique_id' and last column should be titled 'llm_query'") 293 | print("Try running prepare_data() again") 294 | print("") 295 | print("Your data:") 296 | print(text_to_annotate.head()) 297 | print("") 298 | print("Sample data format:") 299 | if human_labels: 300 | return None, None, None, None 301 | elif human_labels == False: 302 | return None, None 303 | waiting_response = False 304 | elif input_response == "n" or input_response == "no": 305 | print("") 306 | print("Run prepare_data(text_to_annotate, codebook, key) before running gpt_annotate(text_to_annotate, codebook, key).") 307 | if human_labels: 308 | return None, None, None, None 309 | elif human_labels == False: 310 | return None, None 311 | else: 312 | print("Please input Y or N.") 313 | 314 | # df to store results 315 | out = pd.DataFrame() 316 | 317 | # Determine number of batches 318 | num_rows = len(text_to_annotate) 319 | num_batches = math.floor(num_rows/batch_size) 320 | num_iterations = num_iterations 321 | 322 | # Add categories to classify 323 | col_names = ["unique_id"] + text_to_annotate.columns.values.tolist()[2:-1] 324 | if human_labels == False: 325 | col_names = get_classification_categories(codebook, key) 326 | col_names = ["unique_id"] + col_names 327 | 328 | ### Nested for loop for main function 329 | # Iterate over number of classification iterations 330 | for j in range(num_iterations): 331 | # Iterate over number of batches 332 | for i in range(num_batches): 333 | 334 | # Based on batch, determine starting row and end row 335 | start_row = i*batch_size 336 | end_row = (i+1)*batch_size 337 | 338 | # Extract the text samples to annotate 339 | llm_query = text_to_annotate['llm_query'][start_row:end_row].str.cat(sep=' ') 340 | 341 | # Start while loop in case GPT fails to annotate a batch 342 | need_response = True 343 | while need_response: 344 | fails = 0 345 | # confirm time and cost with user before annotating data 346 | if fails == 0 and j == 0 and i == 0 and time_cost_warning: 347 | quit = estimate_time_cost(text_to_annotate, codebook, llm_query, model, num_iterations, num_batches, batch_size, col_names[1:]) 348 | if quit and human_labels: 349 | return None, None, None, None 350 | elif quit and human_labels == False: 351 | return None, None 352 | # if GPT fails to annotate a batch 3 times, skip the batch 353 | while(fails < 3): 354 | try: 355 | # Set temperature 356 | temperature = temperature 357 | # annotate the data by prompting GPT 358 | response = get_response(codebook, llm_query, model, temperature, key) 359 | # parse GPT's response into a clean dataframe 360 | text_df_out = parse_text(response, col_names) 361 | break 362 | except: 363 | fails += 1 364 | pass 365 | if (',' in response.choices[0].message.content or '|' in response.choices[0].message.content): 366 | need_response = False 367 | 368 | # update iteration 369 | text_df_out['iteration'] = j+1 370 | 371 | # add iteration annotation results to output df 372 | out = pd.concat([out, text_df_out]) 373 | time.sleep(.5) 374 | # print status report 375 | print("iteration: ", j+1, "completed") 376 | 377 | # Convert unique_id col to numeric 378 | out['unique_id'] = pd.to_numeric(out['unique_id']) 379 | 380 | # Combine input df (i.e., df with text column and true category labels) 381 | out_all = pd.merge(text_to_annotate, out, how="inner", on="unique_id") 382 | 383 | # replace any NA values with 0's 384 | out_all.replace('', np.nan, inplace=True) 385 | out_all.replace('-', np.nan, inplace=True) 386 | out_all.fillna(0, inplace=True) 387 | 388 | ##### output 1: full annotation results 389 | out_all.to_csv('gpt_out_all_iterations.csv',index=False) 390 | 391 | # calculate modal label and consistency score 392 | out_mode = get_mode_and_consistency(out_all, col_names,num_iterations,human_labels) 393 | 394 | if human_labels == True: 395 | # evaluate classification per category 396 | num_categories = len(col_names) - 1 # account for unique_id 397 | for label in range(0, num_categories): 398 | out_final = evaluate_classification(out_mode, label, num_categories) 399 | 400 | ##### output 2: final annotation results with modal label and consistency score 401 | out_final.to_csv('gpt_out_final.csv',index=False) 402 | 403 | # calculate performance metrics 404 | performance = performance_metrics(col_names, out_final) 405 | 406 | ##### output 3: performance metrics 407 | performance.to_csv('performance_metrics.csv',index=False) 408 | 409 | # Determine incorrect classifications and classifications with less than 1.0 consistency 410 | incorrect = filter_incorrect(out_final) 411 | 412 | ##### output 4: Incorrect classifications 413 | incorrect.to_csv('incorrect_sub.csv',index=False) 414 | return out_all, out_final, performance, incorrect 415 | 416 | else: 417 | # if not assessing performance against human annotators, then only save out_mode 418 | out_final = out_mode.copy() 419 | out_final.to_csv('gpt_out_final.csv',index=False) 420 | return out_all, out_final 421 | 422 | ########### Helper Functions 423 | 424 | def prepare_codebook(codebook): 425 | """ 426 | Standardize beginning and end of codebook to ensure that the LLM prompt is annotating text samples. 427 | 428 | codebook: 429 | String detailing the task-specific instructions. 430 | 431 | Returns: 432 | Updated codebook ready for annotation. 433 | """ 434 | beginning = "Use this codebook for text classification. Return your classifications in a table with one column for text number (the number preceding each text sample) and a column for each label. Use a csv format. " 435 | end = " Classify the following text samples:" 436 | return beginning + codebook + end 437 | 438 | def error_message(human_labels = True): 439 | """ 440 | Prints sample data format if error. 441 | 442 | human_labels: 443 | boolean indicating whether text_to_annotate has human labels to compare LLM outputs to. 444 | """ 445 | if human_labels == True: 446 | toy_data = { 447 | 'unique_id': [0, 1, 2, 3, 4], 448 | 'text': ['sample text to annotate', 'sample text to annotate', 'sample text to annotate', 'sample text to annotate', 'sample text to annotate'], 449 | 'category_1': [1, 0, 1, 0, 1], 450 | 'category_2': [0, 1, 1, 0, 1] 451 | } 452 | toy_data = pd.DataFrame(toy_data) 453 | print(toy_data) 454 | else: 455 | toy_data = { 456 | 'unique_id': [0, 1, 2, 3, 4], 457 | 'text': ['sample text to annotate', 'sample text to annotate', 'sample text to annotate', 'sample text to annotate', 'sample text to annotate'], 458 | } 459 | toy_data = pd.DataFrame(toy_data) 460 | print(toy_data) 461 | 462 | def get_response(codebook, llm_query, model, temperature, key): 463 | """ 464 | Function to query OpenAI's API and get an LLM response. 465 | 466 | Codebook: 467 | String detailing the task-specific instructions 468 | llm_query: 469 | The text samples to append to the task-specific instructions 470 | Model: 471 | gpt-3.5-turbo (Chat-GPT) or GPT-4 472 | Temperature: 473 | LLM temperature parameter (ranges 0 to 1) 474 | 475 | Returns: 476 | LLM output. 477 | """ 478 | 479 | from openai import OpenAI 480 | 481 | client = OpenAI( 482 | api_key=key, 483 | ) 484 | 485 | OpenAI.api_key = os.getenv(key) 486 | 487 | # max tokens for gpt-3.5-turbo is 4096 and max for gpt-4 is 8000 488 | if model == "gpt-3.5-turbo": 489 | max_tokens = 500 490 | elif model == "gpt-4": 491 | max_tokens = 1000 492 | 493 | # Create function to llm_query GPT 494 | # Create function to llm_query GPT 495 | response = client.chat.completions.create( 496 | model=model, # chatgpt: gpt-3.5-turbo # gpt-4: gpt-4 497 | messages=[ 498 | {"role": "user", "content": codebook + llm_query}], 499 | temperature=temperature, # ChatGPT default is 0.7 (set lower reduce variation across queries) 500 | max_tokens = max_tokens, 501 | top_p=1.0, 502 | frequency_penalty=0.0, 503 | presence_penalty=0.0 504 | ) 505 | return response 506 | 507 | def get_classification_categories(codebook, key): 508 | """ 509 | Function that extracts what GPT will label each annotation category to ensure a match with text_to_annotate. 510 | Order and exact spelling matter. Main function will not work if these do not match perfectly. 511 | 512 | Codebook: 513 | String detailing the task-specific instructions 514 | 515 | Returns: 516 | Categories to be annotated, as specified in the codebook. 517 | """ 518 | 519 | # llm_query to ask GPT for categories from codebook 520 | instructions = "Below I am going to provide you with two sets of instructions (Part 1 and Part 2). The first part will contain detailed instructions for a text classification project. The second part will ask you to identify information about part 1. Prioritize the instructions from Part 2. Part 1:" 521 | llm_query = "Part 2: I've provided a codebook in the previous sentences. Please print the categories in the order you will classify them. Ignore every other task that I described in the codebook. I only want to know the categories. Do not include any text numbers or any annotations in your response. Do not include any language like 'The categories to be identified are:'. Only include the names of the categories you are identifying. : " 522 | 523 | # Set temperature to 0 to make model deterministic 524 | temperature = 0 525 | # set model to GPT-4 526 | model = "gpt-4" 527 | 528 | 529 | from openai import OpenAI 530 | 531 | client = OpenAI( 532 | api_key=key, 533 | ) 534 | 535 | OpenAI.api_key = os.getenv(key) 536 | 537 | ### Get GPT response and clean response 538 | response = get_response(codebook, llm_query, model, temperature, key) 539 | text = response.choices[0].message.content 540 | text_split = text.split('\n') 541 | text_out = text_split[0] 542 | # text_out_list is final output of categories as a list 543 | codebook_columns = text_out.split(', ') 544 | 545 | return codebook_columns 546 | 547 | def parse_text(response, headers): 548 | """ 549 | This function converts GPT's output to a dataframe. GPT sometimes returns the output in different formats. 550 | Because there is variability in GPT outputs, this function handles different variations in possible outputs. 551 | 552 | response: 553 | LLM output 554 | headers: 555 | column names for text_to_annotate dataframe 556 | 557 | Returns: 558 | GPT output as a cleaned dataframe. 559 | 560 | """ 561 | try: 562 | text = response.choices[0].message.content 563 | text_split = text.split('\n') 564 | 565 | if any(':' in element for element in text_split): 566 | text_split_split = [item.split(":") for item in text_split] 567 | 568 | if ',' in text: 569 | text_split_out = [row for row in text_split if (',') in row] 570 | text_split_split = [text.split(',') for text in text_split_out] 571 | if '|' in text: 572 | text_split_out = [row for row in text_split if ('|') in row] 573 | text_split_split = [text.split('|') for text in text_split_out] 574 | 575 | for row in text_split_split: 576 | if '' in row: 577 | row.remove('') 578 | if '' in row: 579 | row.remove('') 580 | if ' ' in row: 581 | row.remove(' ') 582 | 583 | text_df = pd.DataFrame(text_split_split) 584 | text_df_out = pd.DataFrame(text_df.values[1:], columns=headers) 585 | text_df_out = text_df_out[pd.to_numeric(text_df_out.iloc[:,1], errors='coerce').notnull()] 586 | except Exception as e: 587 | print("ERROR: GPT output not in specified categories. Make your codebook clearer to indicate what the output format should be.") 588 | print("Try running prepare_data(text_to_annotate, codebook, key, prep_codebook = True") 589 | print("") 590 | return text_df_out 591 | 592 | def get_mode_and_consistency(df, col_names, num_iterations, human_labels): 593 | """ 594 | This function calculates the modal label across iterations and calculates the 595 | LLMs consistency score across label annotations. 596 | 597 | df: 598 | Input dataframe (text_to_annotate) 599 | col_names: 600 | Category names to be annotated 601 | num_iterations: 602 | number of iterations in gpt_annotate 603 | 604 | Returns: 605 | Modal label across iterations and consistency score for every text annotation. 606 | 607 | """ 608 | 609 | # Drop unique_id column in list of category names 610 | categories_names = col_names[1:] 611 | # Change names to add a 'y' at end to match the output df, if human_labels == True 612 | if human_labels == True: 613 | categories_names = [name + "_y" for name in categories_names] 614 | 615 | ##### Calculate modal label classification across iterations 616 | # Convert dataframe to numeric 617 | df_numeric = df.apply(pd.to_numeric, errors='coerce') 618 | # Group by unique_id 619 | grouped = df_numeric.groupby('unique_id') 620 | # Calculate modal label classification per unique id (.iloc[0] means take the first value if there are ties) 621 | modal_values = grouped[categories_names].apply(lambda x: x.mode().iloc[0]) 622 | # Create consistency score by calculating the number of times the modal answer appears per iteration 623 | consistency = grouped[categories_names].apply(lambda x: (x.mode().iloc[0] == x).sum()/num_iterations) 624 | 625 | ##### Data cleaning for new dfs related to consistency scores 626 | # add the string 'consistency_' to the column names of the consistency df 627 | consistency.columns = ['consistency_' + col for col in consistency.columns] 628 | # drop '_y' string from each column name of the consistency df, if human_labels == True 629 | if human_labels == True: 630 | consistency.columns = consistency.columns.str.replace(r'_y$', '',regex=True) 631 | # combine the modal label classification to the consistency score 632 | df_combined = pd.concat([modal_values, consistency], axis=1) 633 | # reset index 634 | df_combined = df_combined.reset_index() 635 | # in the modal label column name, replace '_y' with '_pred' 636 | if human_labels == True: 637 | df_combined.columns = [col.replace('_y', '_pred') if '_y' in col else col for col in df_combined.columns] 638 | # drop first column 639 | df_combined = df_combined.drop(df_combined.columns[0], axis=1) 640 | 641 | ##### Data cleaning for the input df (combine with ) 642 | # drop duplicates 643 | df_new = df.drop_duplicates(subset=['unique_id']) 644 | # replace '_x' with '_pred' 645 | if human_labels == True: 646 | df_new.columns = [col.replace('_x', '_true') if '_x' in col else col for col in df_new.columns] 647 | # add 2 to the length of the categories (to account for unique id and text columns) 648 | length_of_col = 2 + len(categories_names) 649 | # clean up columns included in df 650 | df_new = df_new.iloc[:,0:length_of_col] 651 | df_new = df_new.reset_index() 652 | df_new = df_new.drop(df_new.columns[0], axis=1) 653 | else: 654 | first_two_columns = df_new.iloc[:, 0:2] 655 | df_new = pd.DataFrame(first_two_columns) 656 | df_new = df_new.reset_index() 657 | 658 | # combine into final df 659 | out = pd.concat([df_new, df_combined], axis=1) 660 | 661 | 662 | return out 663 | 664 | def evaluate_classification(df, category, num_categories): 665 | """ 666 | Determines whether the classification is correct, then also adds whether it is a 667 | true positive, false positive, true negative, or false negative 668 | 669 | df: 670 | Input dataframe (text_to_annotate) 671 | Category: 672 | Category names to be annotated 673 | num_categories: 674 | total number of annotation categories 675 | 676 | Returns: 677 | Added columns to input dataframe specifying whether the GPT annotation category is correct and whether it is a tp, fp, tn, or fn 678 | """ 679 | # account for indexing starting at 0 680 | category = category + 1 681 | # specify category 682 | correct = "correct" + "_" + str(category) 683 | tp = "tp" + "_" + str(category) 684 | tn = "tn" + "_" + str(category) 685 | fp = "fp" + "_" + str(category) 686 | fn = "fn" + "_" + str(category) 687 | # account for text col and unique id (but already added one to account for zero index start) 688 | category = category + 1 689 | 690 | # evaluate classification 691 | df[correct] = (df.iloc[:, category] == df.iloc[:, category+num_categories].astype(int)).astype(int) 692 | df[tp] = ((df.iloc[:, category] == 1) & (df.iloc[:, category+num_categories].astype(int) == 1)).astype(int) 693 | df[tn] = ((df.iloc[:, category] == 0) & (df.iloc[:, category+num_categories].astype(int) == 0)).astype(int) 694 | df[fp] = ((df.iloc[:, category] == 0) & (df.iloc[:, category+num_categories].astype(int) == 1)).astype(int) 695 | df[fn] = ((df.iloc[:, category] == 1) & (df.iloc[:, category+num_categories].astype(int) == 0)).astype(int) 696 | 697 | return df 698 | 699 | def performance_metrics(col_names, df): 700 | """ 701 | Calculates performance metrics (accuracy, precision, recall, and f1) for every category. 702 | 703 | col_names: 704 | Category names to be annotated 705 | df: 706 | Input dataframe (text_to_annotate) 707 | 708 | Returns: 709 | Dataframe with performance metrics 710 | """ 711 | 712 | # Initialize lists to store the metrics 713 | categories_names = col_names[1:] 714 | categories = [index for index, string in enumerate(categories_names)] 715 | metrics = ['accuracy', 'precision', 'recall', 'f1'] 716 | accuracy_list = [] 717 | precision_list = [] 718 | recall_list = [] 719 | f1_list = [] 720 | 721 | # Calculate the metrics for each category and store them in the lists 722 | for cat in categories: 723 | tp = df['tp_' + str(cat+1)].sum() 724 | tn = df['tn_' + str(cat+1)].sum() 725 | fp = df['fp_' + str(cat+1)].sum() 726 | fn = df['fn_' + str(cat+1)].sum() 727 | 728 | accuracy = (tp + tn) / (tp + tn + fp + fn) 729 | if tp + fp == 0: # include to account for undefined denominator 730 | precision = 0 731 | else: 732 | precision = tp / (tp + fp) 733 | if tp + fn == 0: # include to account for undefined denominator 734 | recall = 0 735 | else: 736 | recall = tp / (tp + fn) 737 | if precision + recall == 0: # include to account for undefined denominator 738 | f1 = 0 739 | else: 740 | f1 = (2 * precision * recall) / (precision + recall) 741 | # append metrics 742 | accuracy_list.append(accuracy) 743 | precision_list.append(precision) 744 | recall_list.append(recall) 745 | f1_list.append(f1) 746 | 747 | # Create a dataframe to store the results 748 | results = pd.DataFrame({ 749 | 'Category': categories_names, 750 | 'Accuracy': accuracy_list, 751 | 'Precision': precision_list, 752 | 'Recall': recall_list, 753 | 'F1': f1_list 754 | }) 755 | 756 | return results 757 | 758 | def filter_incorrect(df): 759 | """ 760 | In order to better understand LLM performance, this function returns a df for all incorrect 761 | classifications and all classificatiosn with less than 1.0 consistency. 762 | 763 | df: 764 | Input dataframe (text_to_annotate) 765 | 766 | Returns: 767 | Dataframe with incorrect or less than 1.0 consistency scores. 768 | """ 769 | 770 | # Filter rows where any consistency score column value is 1.0 771 | consistency_cols = [col for col in df.columns if 'consistency' in col] 772 | consistency_filter = df[consistency_cols] == 1 773 | 774 | # Filter rows where any correct column value is 1 775 | correct_cols = [col for col in df.columns if 'correct' in col] 776 | correct_filter = df[correct_cols] == 1 777 | 778 | # Combine filters 779 | combined_filter = pd.concat([consistency_filter, correct_filter], axis=1) 780 | # Filter for any rows where the correct value is not 1 or the consistency is less than 1.0 781 | mask = combined_filter.apply(lambda x: any(val == False for val in x), axis=1) 782 | 783 | # Apply the filter and return the resulting dataframe 784 | return df[mask] 785 | 786 | def num_tokens_from_string(string: str, encoding_name: str): 787 | """Returns the number of tokens in a text string.""" 788 | encoding = tiktoken.get_encoding(encoding_name) 789 | num_tokens = len(encoding.encode(string)) 790 | return num_tokens 791 | 792 | def estimate_time_cost(text_to_annotate, codebook, llm_query, 793 | model, num_iterations, num_batches, batch_size, col_names): 794 | """ 795 | This function estimates the cost and time to run gpt_annotate(). 796 | 797 | text_to_annotate: 798 | Input data that will be annotated. 799 | codebook: 800 | String detailing the task-specific instructions. 801 | llm_query: 802 | Codebook plus the text samples in batch that will annotated. 803 | model: 804 | OpenAI GPT model, which is either gpt-3.5-turbo or gpt-4 805 | num_iterations: 806 | Number of iterations in gpt_annotate 807 | num_batches: 808 | number of batches in gpt_annotate 809 | batch_size: 810 | number of text samples in each batch 811 | col_names: 812 | Category names to be annotated 813 | 814 | Returns: 815 | quit, which is a boolean indicating whether to continue with the annotation process. 816 | """ 817 | # input estimate 818 | num_input_tokens = num_tokens_from_string(codebook + llm_query, "cl100k_base") 819 | total_input_tokens = num_input_tokens * num_iterations * num_batches 820 | if model == "gpt-4": 821 | gpt4_prompt_cost = 0.00003 822 | prompt_cost = gpt4_prompt_cost * total_input_tokens 823 | else: 824 | chatgpt_prompt_cost = 0.000002 825 | prompt_cost = chatgpt_prompt_cost * total_input_tokens 826 | 827 | # output estimate 828 | num_categories = len(text_to_annotate.columns)-3 # minus 3 to account for unique_id, text, and llm_query 829 | estimated_output_tokens = 3 + (5 * num_categories) + (3 * batch_size * num_categories) # these estimates are based on token outputs from llm queries 830 | total_output_tokens = estimated_output_tokens * num_iterations * num_batches 831 | if model == "gpt-4": 832 | gpt4_out_cost = 0.00006 833 | output_cost = gpt4_out_cost * total_output_tokens 834 | else: 835 | chatgpt_out_cost = 0.000002 836 | output_cost = chatgpt_out_cost * total_output_tokens 837 | 838 | cost = prompt_cost + output_cost 839 | cost_low = round(cost*0.9,2) 840 | cost_high = round(cost*1.1,2) 841 | 842 | if model == "gpt-4": 843 | time = round(((total_input_tokens + total_output_tokens) * 0.02)/60, 2) 844 | time_low = round(time*0.7,2) 845 | time_high = round(time*1.3,2) 846 | else: 847 | time = round(((total_input_tokens + total_output_tokens) * 0.01)/60, 2) 848 | time_low = round(time*0.7,2) 849 | time_high = round(time*1.3,2) 850 | 851 | quit = False 852 | print("You are about to annotate", len(text_to_annotate), "text samples and the number of iterations is set to", num_iterations) 853 | print("Estimated cost range in US Dollars:", cost_low,"-",cost_high) 854 | print("Estimated minutes to run gpt_annotate():", time_low,"-",time_high) 855 | print("Please note that these are rough estimates.") 856 | print("") 857 | waiting_response = True 858 | while waiting_response: 859 | input_response = input("Would you like to proceed and annotate your data? (Options: Y or N) ") 860 | input_response = str(input_response).lower() 861 | if input_response == "y" or input_response == "yes": 862 | waiting_response = False 863 | elif input_response == "n" or input_response == "no": 864 | print("") 865 | print("Exiting gpt_annotate()") 866 | quit = True 867 | waiting_response = False 868 | else: 869 | print("Please input Y or N.") 870 | return quit 871 | 872 | 873 | 874 | -------------------------------------------------------------------------------- /llm_annotate_paper.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/npangakis/gpt_annotate/fbdfa0f100281c802252572e8e990539994252ed/llm_annotate_paper.pdf -------------------------------------------------------------------------------- /sample_annotation_code.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "NRMEai9kN2R7" 7 | }, 8 | "source": [ 9 | "### Set up dependencies" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": { 16 | "id": "8o0GuNYnN4I6" 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "!pip install openai" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": { 27 | "id": "rPXQ25hBN7-_" 28 | }, 29 | "outputs": [], 30 | "source": [ 31 | "!pip install tiktoken" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": { 38 | "id": "Go5O5Lt6N9Z1" 39 | }, 40 | "outputs": [], 41 | "source": [ 42 | "import openai\n", 43 | "import pandas as pd\n", 44 | "import math\n", 45 | "import time\n", 46 | "import numpy as np\n", 47 | "import tiktoken\n", 48 | "\n", 49 | "#### Import main package: gpt_annotate.py\n", 50 | "# Make sure that the .py file is in the same directory as the .ipynb file, or you provide the correct relative or absolute path to the .py file.\n", 51 | "import gpt_annotate" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": { 58 | "id": "s2SVKJPAOAWM" 59 | }, 60 | "outputs": [], 61 | "source": [ 62 | "# don't type the key in this file! \n", 63 | "# create gpt_api.txt, put the key in that, and save\n", 64 | "with open('gpt_api_key.txt', 'r') as f:\n", 65 | " key = f.read().strip()" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": { 71 | "id": "xb5K0uyrOlUY" 72 | }, 73 | "source": [ 74 | "### Load in text_to_annotate and codebook" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": { 81 | "id": "Abeu-_5nOvLL" 82 | }, 83 | "outputs": [], 84 | "source": [ 85 | "# Load text to annotate\n", 86 | "text_to_annotate = pd.read_csv(\"text_to_classify.csv\")\n", 87 | "# Load codebook\n", 88 | "with open('codebook.txt', 'r') as file:\n", 89 | " codebook = file.read()" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": { 95 | "id": "apc7cy-eO8tF" 96 | }, 97 | "source": [ 98 | "# Annotate your data!" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": { 104 | "id": "a62wf8FzanCf" 105 | }, 106 | "source": [ 107 | "### If you have human labels you want to compare the GPT outputs against" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": { 114 | "id": "Dv7dc1mZPAGR" 115 | }, 116 | "outputs": [], 117 | "source": [ 118 | "# Prepare the data for annotation\n", 119 | "text_to_annotate = gpt_annotate.prepare_data(text_to_annotate, codebook, key)" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": { 126 | "id": "WzM16wq3ai0D" 127 | }, 128 | "outputs": [], 129 | "source": [ 130 | "# Annotate the data (returns 4 outputs)\n", 131 | "gpt_out_all, gpt_out_final, performance, incorrect = gpt_annotate.gpt_annotate(text_to_annotate, codebook, key)" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": { 137 | "id": "rd_dIEkxaseH" 138 | }, 139 | "source": [ 140 | "### If only using gpt_annotate for prediction (i.e., no human labels to compare performance)" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": { 147 | "id": "yK1hIfCOPX1v" 148 | }, 149 | "outputs": [], 150 | "source": [ 151 | "# Prepare the data for annotation\n", 152 | "text_to_annotate = gpt_annotate.prepare_data(text_to_annotate, codebook, key, human_labels = False)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": { 159 | "id": "1h75bYJfbDPT" 160 | }, 161 | "outputs": [], 162 | "source": [ 163 | "# Annotate the data (returns 2 outputs)\n", 164 | "gpt_out_all, gpt_out_final = gpt_annotate.gpt_annotate(text_to_annotate, codebook, key, human_labels = False)" 165 | ] 166 | } 167 | ], 168 | "metadata": { 169 | "colab": { 170 | "provenance": [] 171 | }, 172 | "kernelspec": { 173 | "display_name": "Python 3 (ipykernel)", 174 | "language": "python", 175 | "name": "python3" 176 | }, 177 | "language_info": { 178 | "codemirror_mode": { 179 | "name": "ipython", 180 | "version": 3 181 | }, 182 | "file_extension": ".py", 183 | "mimetype": "text/x-python", 184 | "name": "python", 185 | "nbconvert_exporter": "python", 186 | "pygments_lexer": "ipython3", 187 | "version": "3.8.16" 188 | } 189 | }, 190 | "nbformat": 4, 191 | "nbformat_minor": 4 192 | } 193 | --------------------------------------------------------------------------------