├── .gitignore
├── README.md
├── gpt_annotate.py
├── llm_annotate_paper.pdf
└── sample_annotation_code.ipynb


/.gitignore:
--------------------------------------------------------------------------------
1 | gpt_api_key.txt
2 | .ipynb_checkpoints*
3 | .DS_store


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Introducing gpt_annotate
 2 | An easy-to-use Python package designed to streamline automated text annotation using LLMs for different tasks and datasets. All you need is an OpenAI API key, text samples you want to annotate, and a codebook (i.e., task-specific instructions) for the LLM.
 3 | * OpenAI API key 
 4 | 	* Sign up for one here: https://platform.openai.com/account/api-keys
 5 | * text_to_annotate: 
 6 | 	* A dataframe that includes one column for text samples and, if you are comparing the LLM output against humans, any number of one-hot-encoded category columns. The text column should be the first column in your data. We provide Python code (described below) that will automatically assist with the formatting of `text_to_annotate` to ensure accurate annotation.
 7 | * codebook: 
 8 | 	* Task-specific instructions (as type string) to prompt the LLM to annotate the data. Like codebooks for qualitative content analysis, this should clearly describe the dataset, the type of task for the LLM, and, most importantly, delineate the categories of interest for the LLM to annotate. We provide Python code to standardize the beginning and ending of the codebook to ensure that the LLM understands that the task is annotation.
 9 | 	* For example, the text of `codebook` could be: "You will be classifying text samples. Each text sample is a tweet. Classify each tweet on two dimensions: a) POLITICAL; b) PRESIDENT. For POLITICAL, label as 1 if the tweet is about politics; label as 0 if not. For PRESIDENT, label as 1 if the tweet refers to a past or present president, a candidate for president, or a presidential election; label as 0 if not. Classify the following text samples:"
10 | 
11 | To annotate your text data using gpt_annotate, we recommend following the sample code we provide in `sample_annotation_code.ipynb`.
12 | 
13 | As shown in `sample_annotation_code.ipynb`, annotating your text data with LLMs is as easy as 4 simple steps:
14 | 1. Import the required dependencies (including gpt_annotate.py).
15 | 
16 | ```
17 | import openai
18 | import pandas as pd
19 | import math
20 | import time
21 | import numpy as np
22 | import tiktoken
23 | #### Import main package: gpt_annotate.py
24 | # Make sure that the .py file is in the same directory as the .ipynb file, or you provide the correct relative or absolute path to the .py file.
25 | import gpt_annotate
26 | ```
27 | 
28 | 2. Read in your codebook (i.e., task-specific instructions) and the text samples you want to annotate.
29 | 
30 | ```
31 | text_to_annotate = pd.read_csv("text_to_annotate.csv")
32 | with open('codebook.txt', 'r') as file:
33 | 	codebook = file.read()
34 |  ```
35 |     
36 | 3. To ensure your data is in the right format, you must first run `gpt_annotate.prepare_data(text_to_annotate, codebook, key)`. If you are annotating text data without any human labels to compare against, change the default to `human_labels = False`. If you want to add standardized language to the beginning and end of your codebook to ensure that GPT will label your text samples, change the default to `prep_codebook = True`.
37 | ```
38 | text_to_annotate = gpt_annotate.prepare_data(text_to_annotate, codebook, key)
39 | ```
40 | 4. If comparing LLM output to human labels, run `gpt_annotate.gpt_annotate(text_to_annotate, codebook, key)`. If only using gpt_annotate for prediction (i.e., no human labels to compare performance), run `gpt_annotate.gpt_annotate(text_to_annotate, codebook, key, human_labels = False)`. It’s as easy as that!
41 | ```
42 | # Annotate the data (returns 4 outputs)
43 | gpt_out_all, gpt_out_final, performance, incorrect =  gpt_annotate.gpt_annotate(text_to_annotate, codebook, key)
44 | # Annotate the data (without human labels to compare against) (returns 2 outputs)
45 | gpt_out_all, gpt_out_final =  gpt_annotate.gpt_annotate(text_to_annotate, codebook, key, human_labels = False)
46 | ```
47 | 
48 | Outputs:
49 | 1) `gpt_out_all`
50 |   *   Raw outputs for every iteration.
51 | 2) `gpt_out_final`
52 |   *   Annotation outputs after taking modal category answer and calculating consistency scores.
53 | 3) `performance`
54 |   *   Accuracy, precision, recall, and f1.
55 | 4) `incorrect`
56 |   *   Any incorrect classification or classification with less than 1.0 consistency.
57 | 
58 | Below we define the alternative parameters within `gpt_annotate()` to customize your annotation procedures.
59 | * num_iterations:
60 | 	* Number of times to classify each text sample. Default is 3.
61 | * model:
62 | 	* OpenAI GPT model, which is either `gpt-3.5-turbo` or `gpt-4`. Default is `gpt-4`.
63 | * temperature: 
64 | 	* LLM temperature parameter (ranges 0 to 1), which indicates the degree of diversity to introduce into the model. Default is 0.6.
65 | * batch_size:
66 | 	* Number of text samples to be annotated in each batch. Default is 10.
67 | * human_labels: 
68 | 	* Boolean indicating whether `text_to_annotate` has human labels to compare LLM outputs to. 
69 | * data_prep_warning: 
70 | 	* Boolean indicating whether to print data_prep_warning
71 | * time_cost_warning: 
72 | 	* Boolean indicating whether to print time_cost_warning
73 | 
74 | 
75 | Please email us (njpang@sas.upenn.edu, sam.wolken@asc.upenn.edu) with any suggestions or problems encountered with the code.
76 | 


--------------------------------------------------------------------------------
/gpt_annotate.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | ## Overview
  4 | 
  5 | Given a codebook (.txt) and a dataset (.csv) that has one text column and any number of category columns as binary indicators, the main function (`gpt_annotate`) annotates
  6 | all the samples using an OpenAI GPT model (ChatGPT or GPT-4) and calculates performance metrics (if there are provided human labels). Before running `gpt_annotate`,
  7 | users should run `prepare_data` to ensure that their data is in the correct format.
  8 | 
  9 | Flow of `gpt_annotate`:
 10 | *   1) Based on a provided codebook, the function uses an OpenAI GPT model to annotate every text sample per iteration, which is a parameter set by user.
 11 | *   2) The function reduces the annotation output down to the modal annotation category across iterations for each category. At this stage,
 12 |        the function adds a consistency score for each annotation across iterations.
 13 | *   3) If provided human labels, the function determines, for every category, whether the annotation is correct (by comparing to the human label),
 14 |         then also adds whether it is a true positive, false positive, true negative, or false negative.
 15 | *   4) Finally, if provided human labels, the function calculates performance metrics (accuracy, precision, recall, and f1) for every category.
 16 | 
 17 | The main function (`gpt_annotate`) returns four .csv's, if human labels are provided.
 18 | If no human labels are provided, the main function only returns 1 and 2 listed below.
 19 | *   1) `gpt_out_all_iterations.csv`
 20 |   *   Raw outputs for every iteration.
 21 | *   2) `gpt_out_final.csv`
 22 |   *   Annotation outputs after taking modal category answer and calculating consistency scores.
 23 | *   3) `performance_metrics.csv`
 24 |   *   Accuracy, precision, recall, and f1.
 25 | *   4) `incorrect.csv`
 26 |   *   Any incorrect classification or classification with less than 1.0 consistency.
 27 | 
 28 | Our code aims to streamline automated text annotation for different datasets and numbers of categories.
 29 | 
 30 | """
 31 | 
 32 | import subprocess
 33 | import sys
 34 | 
 35 | def install(package):
 36 |     subprocess.check_call([sys.executable, "-m", "pip", "install", package])
 37 | 
 38 | install("openai")
 39 | install("pandas")
 40 | install("numpy")
 41 | install("tiktoken")
 42 | 
 43 | import openai
 44 | import pandas as pd
 45 | import math
 46 | import time
 47 | import numpy as np
 48 | import tiktoken
 49 | from openai import OpenAI
 50 | import os
 51 | 
 52 | 
 53 | def prepare_data(text_to_annotate, codebook, key,
 54 |                  prep_codebook = False, human_labels = True, no_print_preview = False):
 55 | 
 56 |   """
 57 |   This function ensures that the data is in the correct format for LLM annotation. 
 58 |   If the data fails any of the requirements, returns the original input dataframe.
 59 | 
 60 |   text_to_annotate: 
 61 |       Data that will be prepared for analysis. Should include a column with text to annotate, and, if human_labels = True, the human labels.
 62 |   codebook: 
 63 |       String detailing the task-specific instructions.
 64 |   key:
 65 |     OpenAI API key
 66 |   prep_codebook: 
 67 |       boolean indicating whether to standardize beginning and end of codebook to ensure that the LLM prompt is annotating text samples.
 68 |   human_labels: 
 69 |       boolean indicating whether text_to_annotate has human labels to compare LLM outputs to. 
 70 |   no_print_preview:
 71 |     Does not print preview of user's data after preparing.
 72 | 
 73 |   Returns:
 74 |     Updated dataframe (text_to_annotate) and codebook (if prep_codebook = True) that are ready to be used for annotation using gpt_annotate.
 75 |   """
 76 | 
 77 |   # Check if text_to_annotate is a dataframe
 78 |   if not isinstance(text_to_annotate, pd.DataFrame):
 79 |     print("Error: text_to_annotate must be pd.DataFrame.")
 80 |     return text_to_annotate
 81 | 
 82 |   # Make copy of input data
 83 |   original_df = text_to_annotate.copy()
 84 | 
 85 |   # set OpenAI key
 86 |   openai.api_key = key
 87 | 
 88 |   # Standardize beginning and end of codebook to ensure that the LLM prompt is annotating text samples 
 89 |   if prep_codebook == True:
 90 |     codebook = prepare_codebook(codebook)
 91 | 
 92 |   # Add unique_id column to text_to_annotate
 93 |   text_to_annotate = text_to_annotate \
 94 |                     .reset_index() \
 95 |                     .rename(columns={'index':'unique_id'})
 96 | 
 97 |   ##### Minor Cleaning
 98 |   # Drop any Unnamed columns
 99 |   if any('Unnamed' in col for col in text_to_annotate.columns):
100 |     text_to_annotate = text_to_annotate.drop(text_to_annotate.filter(like='Unnamed').columns, axis=1)
101 |   # Drop any NA values
102 |   text_to_annotate = text_to_annotate.dropna()
103 | 
104 |   ########## Confirming data is in correct format
105 |   ##### 1) Check whether second column is string
106 |   # rename second column to be 'text'
107 |   text_to_annotate.columns.values[1] = 'text'
108 |   # check whether second column is string
109 |   if not (text_to_annotate.iloc[:, 1].dtype == 'string' or text_to_annotate.iloc[:, 1].dtype == 'object'):
110 |     print("ERROR: Second column should be the text that you want to annotate.")
111 |     print("")
112 |     print("Your data:")
113 |     print(text_to_annotate.head())
114 |     print("")
115 |     print("Sample data format:")
116 |     error_message(human_labels)
117 |     return original_df
118 |   
119 |   ##### 2) If human_labels == False, there should only be 2 columns
120 |   if human_labels == False and len(text_to_annotate.columns) != 2:
121 |     print("ERROR: You have set human_labels = False, which means you should only have two columns in your data.")
122 |     print("")
123 |     print("Your data:")
124 |     print(text_to_annotate.head())
125 |     print("")
126 |     print("Sample data format:")
127 |     error_message(human_labels)
128 |     return original_df
129 | 
130 |   ##### 3) If human_labels == True, there should be more than 2 columns
131 |   if human_labels == True and len(text_to_annotate.columns) < 3:
132 |     print("ERROR: You have set human_labels = True (default value), which means you should have more than 2 columns in your data.")
133 |     print("")
134 |     print("Your data:")
135 |     print(text_to_annotate.head())
136 |     print("")
137 |     print("Sample data format:")
138 |     error_message(human_labels)
139 |     return original_df
140 | 
141 | 
142 |   ##### 4) If human_labels == True, check if columns 3 through end of df are binary indicators
143 |   if human_labels == True:
144 |     # iterate through each column (after unique_id and text) to confirm that the values are only 0's and 1's
145 |     for col in range(2, len(text_to_annotate.columns)):
146 |       if set(text_to_annotate.iloc[:,col].unique()) != {0, 1}:
147 |         print("ERROR: All category columns must be one-hot encoded (i.e., represent numeric or categorical variables as binary vectors of type integer).")
148 |         print("Make sure that all category columns only contain 1's and 0's.")
149 |         print("")
150 |         print("Your data:")
151 |         print(text_to_annotate.head())
152 |         print("")
153 |         print("Sample data format:")
154 |         error_message(human_labels)
155 |         return original_df
156 |  
157 |   ##### 5) Add llm_query column that includes a unique ID identifier per text sample
158 |   text_to_annotate['llm_query'] = text_to_annotate.apply(lambda x: str(x['unique_id']) + " " + str(x['text']) + "\n", axis=1)
159 | 
160 |   ##### 6) Make sure category names in codebook exactly match category names in text_to_annotate
161 |   # extract category names from codebook
162 |   if human_labels:
163 |     col_names_codebook = get_classification_categories(codebook, key)
164 |     # get category names in text_to_annotate
165 |     df_cols = text_to_annotate.columns.values.tolist()
166 |     # remove 'unique_id', 'text' and 'llm_query' from columns
167 |     col_names = [col for col in df_cols if col not in ['unique_id','text', 'llm_query']]
168 |     ### Check whether categories are the same in codebook and text_to_annotate
169 |     if [col for col in col_names] != col_names_codebook:
170 |       print("ERROR: Column names in codebook and text_to_annotate do not match exactly. Please note that order and capitalization matters.")
171 |       print("Change order/spelling in codebook or text_to_annotate.")
172 |       print("")
173 |       print("Exact order and spelling of category names in text_to_annotate: ", col_names)
174 |       print("Exact order and spelling of category names in codebook: ", col_names_codebook)
175 |       return original_df
176 |   else:
177 |     col_names = get_classification_categories(codebook, key)
178 | 
179 |   ##### Confirm correct categories with user
180 |   # Print annotation categories
181 |   print("")
182 |   print("Categories to annotate:")
183 |   for index, item in enumerate(col_names, start=1):
184 |     print(f"{index}) {item}")
185 |   print("")
186 |   if no_print_preview == False:
187 |     waiting_response = True
188 |     while waiting_response:
189 |       # Confirm annotation categories
190 |       input_response = input("Above are the categories you are annotating. Is this correct? (Options: Y or N) ")
191 |       input_response = str(input_response).lower()
192 |       if input_response == "y" or input_response == "yes":
193 |         print("")
194 |         print("Data is ready to be annotated using gpt_annotate()!")
195 |         print("")
196 |         print("Glimpse of your data:")
197 |         # print preview of data
198 |         print("Shape of data: ", text_to_annotate.shape)
199 |         print(text_to_annotate.head())
200 |         return text_to_annotate
201 |       elif input_response == "n" or input_response == "no":
202 |         print("")
203 |         print("Adjust your codebook to clearly indicate the names of the categories you would like to annotate.")
204 |         return original_df
205 |       else:
206 |         print("Please input Y or N.")
207 |   else:
208 |     return text_to_annotate
209 | 
210 | 
211 | def gpt_annotate(text_to_annotate, codebook, key, 
212 |                  num_iterations = 3, model = "gpt-4", temperature = 0.6, batch_size = 10,
213 |                  human_labels = True,  data_prep_warning = True, 
214 |                  time_cost_warning = True):
215 |   """
216 |   Loop over the text_to_annotate rows in batches and classify each text sample in each batch for multiple iterations. 
217 |   Store outputs in a csv. Function is calculated in batches in case of crash.
218 | 
219 |   text_to_annotate:
220 |     Input data that will be annotated.
221 |   codebook:
222 |     String detailing the task-specific instructions.
223 |   key:
224 |     OpenAI API key.
225 |   num_iterations:
226 |     Number of times to classify each text sample.
227 |   model:
228 |     OpenAI GPT model, which is either gpt-3.5-turbo or gpt-4
229 |   temperature: 
230 |     LLM temperature parameter (ranges 0 to 1), which indicates the degree of diversity to introduce into the model.
231 |   batch_size:
232 |     number of text samples to be annotated in each batch.
233 |   human_labels: 
234 |     boolean indicating whether text_to_annotate has human labels to compare LLM outputs to. 
235 |   data_prep_warning: 
236 |     boolean indicating whether to print data_prep_warning
237 |   time_cost_warning: 
238 |     boolean indicating whether to print time_cost_warning
239 | 
240 | 
241 | 
242 |   Returns:
243 |     gpt_annotate returns the four .csv's below, if human labels are provided. If no human labels are provided, 
244 |     gpt_annotate only returns gpt_out_all_iterations.csv and gpt_out_final.csv
245 |     
246 |     1) `gpt_out_all_iterations.csv`
247 |         Raw outputs for every iteration.
248 |     2) `gpt_out_final.csv`
249 |         Annotation outputs after taking modal category answer and calculating consistency scores.
250 |     3) `performance_metrics.csv`
251 |         Accuracy, precision, recall, and f1.
252 |     4) `incorrect.csv`
253 |         Any incorrect classification or classification with less than 1.0 consistency.
254 | 
255 |   """
256 | 
257 |   
258 |   from openai import OpenAI
259 |   
260 |   client = OpenAI(
261 |     api_key=key,
262 |     )
263 | 
264 |   OpenAI.api_key = os.getenv(key)
265 | 
266 |   # set OpenAI key
267 |   openai.api_key = key
268 | 
269 |   # Double check that user has confirmed format of the data
270 |   if data_prep_warning:
271 |     waiting_response = True
272 |     while waiting_response:
273 |       input_response = input("Have you successfully run prepare_data() to ensure text_to_annotate is in correct format? (Options: Y or N) ")
274 |       input_response = str(input_response).lower()
275 |       if input_response == "y" or input_response == "yes":
276 |         # If user has run prepare_data(), confirm that data is in correct format
277 |         # first, check if first column is title "unique_id"
278 |         if text_to_annotate.columns[0] == 'unique_id' and text_to_annotate.columns[-1] == 'llm_query':
279 |             try:
280 |               text_to_annotate_copy = text_to_annotate.iloc[:, 1:-1]
281 |             except pd.core.indexing.IndexingError:
282 |               print("")
283 |               print("ERROR: Run prepare_data(text_to_annotate, codebook, key) before running gpt_annotate(text_to_annotate, codebook, key).")
284 |             text_to_annotate_copy = prepare_data(text_to_annotate_copy, codebook, key, prep_codebook = False, human_labels = human_labels, no_print_preview = True)
285 |             # if there was an error, exit gpt_annotate
286 |             if text_to_annotate_copy.columns[0] != "unique_id":
287 |               if human_labels:
288 |                 return None, None, None, None
289 |               elif human_labels == False:
290 |                 return None, None
291 |         else:
292 |             print("ERROR: First column should be title 'unique_id' and last column should be titled 'llm_query'")
293 |             print("Try running prepare_data() again")
294 |             print("")
295 |             print("Your data:")
296 |             print(text_to_annotate.head())
297 |             print("")
298 |             print("Sample data format:")
299 |             if human_labels:
300 |                 return None, None, None, None
301 |             elif human_labels == False:
302 |                 return None, None
303 |         waiting_response = False
304 |       elif input_response == "n" or input_response == "no":
305 |         print("")
306 |         print("Run prepare_data(text_to_annotate, codebook, key) before running gpt_annotate(text_to_annotate, codebook, key).")
307 |         if human_labels:
308 |             return None, None, None, None
309 |         elif human_labels == False:
310 |             return None, None
311 |       else:
312 |         print("Please input Y or N.")
313 | 
314 |   # df to store results
315 |   out = pd.DataFrame()
316 | 
317 |   # Determine number of batches
318 |   num_rows = len(text_to_annotate)
319 |   num_batches = math.floor(num_rows/batch_size)
320 |   num_iterations = num_iterations
321 | 
322 |   # Add categories to classify
323 |   col_names = ["unique_id"] + text_to_annotate.columns.values.tolist()[2:-1]
324 |   if human_labels == False:
325 |     col_names = get_classification_categories(codebook, key)
326 |     col_names = ["unique_id"] + col_names
327 | 
328 |   ### Nested for loop for main function
329 |   # Iterate over number of classification iterations
330 |   for j in range(num_iterations):
331 |     # Iterate over number of batches
332 |     for i in range(num_batches):
333 | 
334 |       # Based on batch, determine starting row and end row
335 |       start_row = i*batch_size
336 |       end_row = (i+1)*batch_size
337 | 
338 |       # Extract the text samples to annotate
339 |       llm_query = text_to_annotate['llm_query'][start_row:end_row].str.cat(sep=' ')
340 | 
341 |       # Start while loop in case GPT fails to annotate a batch
342 |       need_response = True
343 |       while need_response:
344 |         fails = 0
345 |         # confirm time and cost with user before annotating data
346 |         if fails == 0 and j == 0 and i == 0 and time_cost_warning:
347 |           quit = estimate_time_cost(text_to_annotate, codebook, llm_query, model, num_iterations, num_batches, batch_size, col_names[1:])
348 |           if quit and human_labels:
349 |             return None, None, None, None
350 |           elif quit and human_labels == False:
351 |             return None, None
352 |         # if GPT fails to annotate a batch 3 times, skip the batch
353 |         while(fails < 3):
354 |           try:
355 |             # Set temperature
356 |             temperature = temperature
357 |             # annotate the data by prompting GPT
358 |             response = get_response(codebook, llm_query, model, temperature, key)
359 |             # parse GPT's response into a clean dataframe
360 |             text_df_out = parse_text(response, col_names)
361 |             break
362 |           except:
363 |             fails += 1
364 |             pass
365 |         if (',' in response.choices[0].message.content  or '|' in response.choices[0].message.content):
366 |           need_response = False 
367 | 
368 |       # update iteration
369 |       text_df_out['iteration'] = j+1
370 | 
371 |       # add iteration annotation results to output df
372 |       out = pd.concat([out, text_df_out])
373 |       time.sleep(.5)
374 |     # print status report  
375 |     print("iteration: ", j+1, "completed")
376 | 
377 |   # Convert unique_id col to numeric
378 |   out['unique_id'] = pd.to_numeric(out['unique_id'])
379 | 
380 |   # Combine input df (i.e., df with text column and true category labels)
381 |   out_all = pd.merge(text_to_annotate, out, how="inner", on="unique_id")
382 | 
383 |   # replace any NA values with 0's
384 |   out_all.replace('', np.nan, inplace=True)
385 |   out_all.replace('-', np.nan, inplace=True)
386 |   out_all.fillna(0, inplace=True)
387 | 
388 |   ##### output 1: full annotation results
389 |   out_all.to_csv('gpt_out_all_iterations.csv',index=False)
390 | 
391 |   # calculate modal label and consistency score
392 |   out_mode = get_mode_and_consistency(out_all, col_names,num_iterations,human_labels)
393 | 
394 |   if human_labels == True:
395 |     # evaluate classification per category
396 |     num_categories = len(col_names) - 1 # account for unique_id
397 |     for label in range(0, num_categories):
398 |       out_final = evaluate_classification(out_mode, label, num_categories)
399 | 
400 |     ##### output 2: final annotation results with modal label and consistency score
401 |     out_final.to_csv('gpt_out_final.csv',index=False)
402 | 
403 |     # calculate performance metrics
404 |     performance = performance_metrics(col_names, out_final)
405 | 
406 |     ##### output 3: performance metrics
407 |     performance.to_csv('performance_metrics.csv',index=False)
408 |       
409 |     # Determine incorrect classifications and classifications with less than 1.0 consistency
410 |     incorrect = filter_incorrect(out_final)
411 | 
412 |     ##### output 4: Incorrect classifications 
413 |     incorrect.to_csv('incorrect_sub.csv',index=False)
414 |     return out_all, out_final, performance, incorrect
415 | 
416 |   else:
417 |     # if not assessing performance against human annotators, then only save out_mode
418 |     out_final = out_mode.copy()
419 |     out_final.to_csv('gpt_out_final.csv',index=False)
420 |     return out_all, out_final
421 | 
422 | ########### Helper Functions
423 | 
424 | def prepare_codebook(codebook):
425 |   """
426 |   Standardize beginning and end of codebook to ensure that the LLM prompt is annotating text samples. 
427 | 
428 |   codebook: 
429 |       String detailing the task-specific instructions.
430 | 
431 |   Returns:
432 |     Updated codebook ready for annotation.
433 |   """
434 |   beginning = "Use this codebook for text classification. Return your classifications in a table with one column for text number (the number preceding each text sample) and a column for each label. Use a csv format. "
435 |   end = " Classify the following text samples:"
436 |   return beginning + codebook + end
437 | 
438 | def error_message(human_labels = True):
439 |   """
440 |   Prints sample data format if error.
441 |   
442 |   human_labels: 
443 |       boolean indicating whether text_to_annotate has human labels to compare LLM outputs to.
444 |   """
445 |   if human_labels == True:
446 |     toy_data = {
447 |       'unique_id': [0, 1, 2, 3, 4],
448 |       'text': ['sample text to annotate', 'sample text to annotate', 'sample text to annotate', 'sample text to annotate', 'sample text to annotate'],
449 |       'category_1': [1, 0, 1, 0, 1],
450 |       'category_2': [0, 1, 1, 0, 1]
451 |       }
452 |     toy_data = pd.DataFrame(toy_data)
453 |     print(toy_data)
454 |   else:
455 |     toy_data = {
456 |         'unique_id': [0, 1, 2, 3, 4],
457 |       'text': ['sample text to annotate', 'sample text to annotate', 'sample text to annotate', 'sample text to annotate', 'sample text to annotate'],
458 |       }
459 |     toy_data = pd.DataFrame(toy_data)
460 |     print(toy_data)
461 | 
462 | def get_response(codebook, llm_query, model, temperature, key):
463 |   """
464 |   Function to query OpenAI's API and get an LLM response.
465 | 
466 |   Codebook: 
467 |     String detailing the task-specific instructions
468 |   llm_query: 
469 |     The text samples to append to the task-specific instructions
470 |   Model: 
471 |     gpt-3.5-turbo (Chat-GPT) or GPT-4
472 |   Temperature: 
473 |     LLM temperature parameter (ranges 0 to 1)
474 | 
475 |   Returns:
476 |     LLM output.
477 |   """
478 | 
479 |   from openai import OpenAI
480 |   
481 |   client = OpenAI(
482 |     api_key=key,
483 |     )
484 | 
485 |   OpenAI.api_key = os.getenv(key)
486 |   
487 |   # max tokens for gpt-3.5-turbo is 4096 and max for gpt-4 is 8000
488 |   if model == "gpt-3.5-turbo":
489 |     max_tokens = 500
490 |   elif model == "gpt-4":
491 |     max_tokens = 1000
492 | 
493 |   # Create function to llm_query GPT
494 |   # Create function to llm_query GPT
495 |   response = client.chat.completions.create(
496 |     model=model, # chatgpt: gpt-3.5-turbo # gpt-4: gpt-4
497 |     messages=[
498 |       {"role": "user", "content": codebook + llm_query}],
499 |     temperature=temperature, # ChatGPT default is 0.7 (set lower reduce variation across queries)
500 |     max_tokens = max_tokens,
501 |     top_p=1.0,
502 |     frequency_penalty=0.0,
503 |     presence_penalty=0.0
504 |   )
505 |   return response
506 | 
507 | def get_classification_categories(codebook, key):
508 |   """
509 |   Function that extracts what GPT will label each annotation category to ensure a match with text_to_annotate.
510 |   Order and exact spelling matter. Main function will not work if these do not match perfectly.
511 | 
512 |   Codebook: 
513 |     String detailing the task-specific instructions
514 | 
515 |   Returns:
516 |     Categories to be annotated, as specified in the codebook.
517 |   """
518 | 
519 |   # llm_query to ask GPT for categories from codebook
520 |   instructions = "Below I am going to provide you with two sets of instructions (Part 1 and Part 2). The first part will contain detailed instructions for a text classification project. The second part will ask you to identify information about part 1. Prioritize the instructions from Part 2. Part 1:"
521 |   llm_query = "Part 2: I've provided a codebook in the previous sentences. Please print the categories in the order you will classify them. Ignore every other task that I described in the codebook.   I only want to know the categories. Do not include any text numbers or any annotations in your response. Do not include any language like 'The categories to be identified are:'. Only include the names of the categories you are identifying. : "
522 | 
523 |   # Set temperature to 0 to make model deterministic
524 |   temperature = 0
525 |   # set model to GPT-4
526 |   model = "gpt-4"
527 | 
528 |   
529 |   from openai import OpenAI
530 |   
531 |   client = OpenAI(
532 |     api_key=key,
533 |     )
534 | 
535 |   OpenAI.api_key = os.getenv(key)
536 | 
537 |   ### Get GPT response and clean response
538 |   response = get_response(codebook, llm_query, model, temperature, key)
539 |   text = response.choices[0].message.content
540 |   text_split = text.split('\n')
541 |   text_out = text_split[0]
542 |   # text_out_list is final output of categories as a list
543 |   codebook_columns = text_out.split(', ')
544 | 
545 |   return codebook_columns
546 | 
547 | def parse_text(response, headers):
548 |   """
549 |   This function converts GPT's output to a dataframe. GPT sometimes returns the output in different formats.
550 |   Because there is variability in GPT outputs, this function handles different variations in possible outputs.
551 | 
552 |   response:
553 |     LLM output
554 |   headers:
555 |     column names for text_to_annotate dataframe
556 | 
557 |   Returns:
558 |     GPT output as a cleaned dataframe.
559 | 
560 |   """
561 |   try:
562 |     text = response.choices[0].message.content
563 |     text_split = text.split('\n')
564 | 
565 |     if any(':' in element for element in text_split):
566 |       text_split_split = [item.split(":") for item in text_split]
567 | 
568 |     if ',' in text:
569 |       text_split_out = [row for row in text_split if (',') in row]
570 |       text_split_split = [text.split(',') for text in text_split_out]
571 |     if '|' in text:
572 |       text_split_out = [row for row in text_split if ('|') in row]
573 |       text_split_split = [text.split('|') for text in text_split_out]
574 | 
575 |     for row in text_split_split:
576 |       if '' in row:
577 |         row.remove('')
578 |       if '' in row:
579 |         row.remove('')
580 |       if ' ' in row:
581 |         row.remove(' ')
582 | 
583 |     text_df = pd.DataFrame(text_split_split)
584 |     text_df_out = pd.DataFrame(text_df.values[1:], columns=headers)
585 |     text_df_out = text_df_out[pd.to_numeric(text_df_out.iloc[:,1], errors='coerce').notnull()]
586 |   except Exception as e:
587 |         print("ERROR: GPT output not in specified categories. Make your codebook clearer to indicate what the output format should be.")
588 |         print("Try running prepare_data(text_to_annotate, codebook, key, prep_codebook = True")
589 |         print("")
590 |   return text_df_out
591 | 
592 | def get_mode_and_consistency(df, col_names, num_iterations, human_labels):
593 |   """
594 |   This function calculates the modal label across iterations and calculates the 
595 |   LLMs consistency score across label annotations.
596 | 
597 |   df:
598 |     Input dataframe (text_to_annotate)
599 |   col_names:
600 |     Category names to be annotated
601 |   num_iterations:
602 |     number of iterations in gpt_annotate
603 |   
604 |   Returns:
605 |     Modal label across iterations and consistency score for every text annotation.
606 | 
607 |   """
608 | 
609 |   # Drop unique_id column in list of category names
610 |   categories_names = col_names[1:]
611 |   # Change names to add a 'y' at end to match the output df, if human_labels == True
612 |   if human_labels == True:
613 |     categories_names = [name + "_y" for name in categories_names]
614 | 
615 |   ##### Calculate modal label classification across iterations
616 |   # Convert dataframe to numeric
617 |   df_numeric = df.apply(pd.to_numeric, errors='coerce')
618 |   # Group by unique_id
619 |   grouped = df_numeric.groupby('unique_id')
620 |   # Calculate modal label classification per unique id (.iloc[0] means take the first value if there are ties)
621 |   modal_values = grouped[categories_names].apply(lambda x: x.mode().iloc[0])
622 |   # Create consistency score by calculating the number of times the modal answer appears per iteration
623 |   consistency = grouped[categories_names].apply(lambda x: (x.mode().iloc[0] == x).sum()/num_iterations)
624 | 
625 |   ##### Data cleaning for new dfs related to consistency scores
626 |   # add the string 'consistency_' to the column names of the consistency df
627 |   consistency.columns = ['consistency_' + col for col in consistency.columns]
628 |   # drop '_y' string from each column name of the consistency df, if human_labels == True
629 |   if human_labels == True:
630 |     consistency.columns = consistency.columns.str.replace(r'_y$', '',regex=True)
631 |   # combine the modal label classification to the consistency score
632 |   df_combined = pd.concat([modal_values, consistency], axis=1)
633 |   # reset index
634 |   df_combined = df_combined.reset_index()
635 |   # in the modal label column name, replace '_y' with '_pred'
636 |   if human_labels == True:
637 |     df_combined.columns = [col.replace('_y', '_pred') if '_y' in col else col for col in df_combined.columns]
638 |   # drop first column
639 |   df_combined = df_combined.drop(df_combined.columns[0], axis=1)
640 | 
641 |   ##### Data cleaning for the input df (combine with )
642 |   # drop duplicates
643 |   df_new = df.drop_duplicates(subset=['unique_id'])
644 |   # replace '_x' with '_pred'
645 |   if human_labels == True:
646 |     df_new.columns = [col.replace('_x', '_true') if '_x' in col else col for col in df_new.columns]
647 |     # add 2 to the length of the categories (to account for unique id and text columns)
648 |     length_of_col = 2 + len(categories_names)
649 |     # clean up columns included in df
650 |     df_new = df_new.iloc[:,0:length_of_col]
651 |     df_new = df_new.reset_index()
652 |     df_new = df_new.drop(df_new.columns[0], axis=1)
653 |   else:
654 |     first_two_columns = df_new.iloc[:, 0:2]
655 |     df_new = pd.DataFrame(first_two_columns)
656 |     df_new = df_new.reset_index()
657 | 
658 |   # combine into final df
659 |   out = pd.concat([df_new, df_combined], axis=1)
660 | 
661 | 
662 |   return out
663 | 
664 | def evaluate_classification(df, category, num_categories):
665 |   """
666 |   Determines whether the classification is correct, then also adds whether it is a 
667 |   true positive, false positive, true negative, or false negative
668 | 
669 |   df:
670 |     Input dataframe (text_to_annotate)
671 |   Category:
672 |     Category names to be annotated
673 |   num_categories:
674 |     total number of annotation categories
675 | 
676 |   Returns:
677 |     Added columns to input dataframe specifying whether the GPT annotation category is correct and whether it is a tp, fp, tn, or fn
678 |   """
679 |   # account for indexing starting at 0
680 |   category = category + 1
681 |   # specify category
682 |   correct = "correct" + "_" + str(category)
683 |   tp = "tp" + "_" + str(category)
684 |   tn = "tn" + "_" + str(category)
685 |   fp = "fp" + "_" + str(category)
686 |   fn = "fn" + "_" + str(category)
687 |   # account for text col and unique id (but already added one to account for zero index start)
688 |   category = category + 1
689 | 
690 |   # evaluate classification
691 |   df[correct] = (df.iloc[:, category] == df.iloc[:, category+num_categories].astype(int)).astype(int)
692 |   df[tp] = ((df.iloc[:, category] == 1) & (df.iloc[:, category+num_categories].astype(int) == 1)).astype(int)
693 |   df[tn] = ((df.iloc[:, category] == 0) & (df.iloc[:, category+num_categories].astype(int) == 0)).astype(int)
694 |   df[fp] = ((df.iloc[:, category] == 0) & (df.iloc[:, category+num_categories].astype(int) == 1)).astype(int)
695 |   df[fn] = ((df.iloc[:, category] == 1) & (df.iloc[:, category+num_categories].astype(int) == 0)).astype(int)
696 |   
697 |   return df
698 | 
699 | def performance_metrics(col_names, df):
700 |   """
701 |   Calculates performance metrics (accuracy, precision, recall, and f1) for every category.
702 | 
703 |   col_names:
704 |     Category names to be annotated
705 |   df:
706 |     Input dataframe (text_to_annotate)
707 | 
708 |   Returns:
709 |     Dataframe with performance metrics
710 |   """
711 | 
712 |   # Initialize lists to store the metrics
713 |   categories_names = col_names[1:]
714 |   categories = [index for index, string in enumerate(categories_names)]
715 |   metrics = ['accuracy', 'precision', 'recall', 'f1']
716 |   accuracy_list = []
717 |   precision_list = []
718 |   recall_list = []
719 |   f1_list = []
720 | 
721 |   # Calculate the metrics for each category and store them in the lists
722 |   for cat in categories:
723 |       tp = df['tp_' + str(cat+1)].sum()
724 |       tn = df['tn_' + str(cat+1)].sum()
725 |       fp = df['fp_' + str(cat+1)].sum()
726 |       fn = df['fn_' + str(cat+1)].sum()
727 |       
728 |       accuracy = (tp + tn) / (tp + tn + fp + fn)
729 |       if tp + fp == 0: # include to account for undefined denominator
730 |         precision = 0
731 |       else:
732 |         precision = tp / (tp + fp)
733 |       if tp + fn == 0: # include to account for undefined denominator
734 |         recall = 0
735 |       else:
736 |         recall = tp / (tp + fn)
737 |       if precision + recall == 0: # include to account for undefined denominator
738 |         f1 = 0 
739 |       else:
740 |         f1 = (2 * precision * recall) / (precision + recall)
741 |       # append metrics
742 |       accuracy_list.append(accuracy)
743 |       precision_list.append(precision)
744 |       recall_list.append(recall)
745 |       f1_list.append(f1)
746 | 
747 |   # Create a dataframe to store the results
748 |   results = pd.DataFrame({
749 |       'Category': categories_names,
750 |       'Accuracy': accuracy_list,
751 |       'Precision': precision_list,
752 |       'Recall': recall_list,
753 |       'F1': f1_list
754 |   })
755 | 
756 |   return results
757 | 
758 | def filter_incorrect(df):
759 |     """
760 |     In order to better understand LLM performance, this function returns a df for all incorrect 
761 |     classifications and all classificatiosn with less than 1.0 consistency.
762 | 
763 |     df:
764 |       Input dataframe (text_to_annotate)
765 | 
766 |     Returns:
767 |       Dataframe with incorrect or less than 1.0 consistency scores.
768 |     """
769 | 
770 |     # Filter rows where any consistency score column value is 1.0
771 |     consistency_cols = [col for col in df.columns if 'consistency' in col]
772 |     consistency_filter = df[consistency_cols] == 1 
773 | 
774 |     # Filter rows where any correct column value is 1
775 |     correct_cols = [col for col in df.columns if 'correct' in col]
776 |     correct_filter = df[correct_cols] == 1
777 | 
778 |     # Combine filters
779 |     combined_filter = pd.concat([consistency_filter, correct_filter], axis=1)
780 |     # Filter for any rows where the correct value is not 1 or the consistency is less than 1.0
781 |     mask = combined_filter.apply(lambda x: any(val == False for val in x), axis=1)
782 | 
783 |     # Apply the filter and return the resulting dataframe
784 |     return df[mask]
785 | 
786 | def num_tokens_from_string(string: str, encoding_name: str):
787 |     """Returns the number of tokens in a text string."""
788 |     encoding = tiktoken.get_encoding(encoding_name)
789 |     num_tokens = len(encoding.encode(string))
790 |     return num_tokens
791 | 
792 | def estimate_time_cost(text_to_annotate, codebook, llm_query, 
793 |                        model, num_iterations, num_batches, batch_size, col_names):
794 |   """
795 |   This function estimates the cost and time to run gpt_annotate().
796 | 
797 |   text_to_annotate:
798 |     Input data that will be annotated.
799 |   codebook:
800 |     String detailing the task-specific instructions.
801 |   llm_query:
802 |     Codebook plus the text samples in batch that will annotated.
803 |   model:
804 |     OpenAI GPT model, which is either gpt-3.5-turbo or gpt-4
805 |   num_iterations:
806 |     Number of iterations in gpt_annotate
807 |   num_batches:
808 |     number of batches in gpt_annotate
809 |   batch_size:
810 |     number of text samples in each batch
811 |   col_names:
812 |     Category names to be annotated
813 | 
814 |   Returns:
815 |     quit, which is a boolean indicating whether to continue with the annotation process.
816 |   """
817 |   # input estimate
818 |   num_input_tokens = num_tokens_from_string(codebook + llm_query, "cl100k_base")
819 |   total_input_tokens = num_input_tokens * num_iterations * num_batches
820 |   if model == "gpt-4":
821 |     gpt4_prompt_cost = 0.00003
822 |     prompt_cost = gpt4_prompt_cost * total_input_tokens
823 |   else:
824 |     chatgpt_prompt_cost = 0.000002
825 |     prompt_cost = chatgpt_prompt_cost * total_input_tokens
826 | 
827 |   # output estimate
828 |   num_categories = len(text_to_annotate.columns)-3 # minus 3 to account for unique_id, text, and llm_query
829 |   estimated_output_tokens = 3 + (5 * num_categories) + (3 * batch_size * num_categories) # these estimates are based on token outputs from llm queries
830 |   total_output_tokens = estimated_output_tokens * num_iterations * num_batches
831 |   if model == "gpt-4":
832 |     gpt4_out_cost = 0.00006
833 |     output_cost = gpt4_out_cost * total_output_tokens
834 |   else:
835 |     chatgpt_out_cost = 0.000002
836 |     output_cost = chatgpt_out_cost * total_output_tokens
837 | 
838 |   cost = prompt_cost + output_cost
839 |   cost_low = round(cost*0.9,2)
840 |   cost_high = round(cost*1.1,2)
841 | 
842 |   if model == "gpt-4":
843 |     time = round(((total_input_tokens + total_output_tokens) * 0.02)/60, 2)
844 |     time_low = round(time*0.7,2)
845 |     time_high = round(time*1.3,2)
846 |   else:
847 |     time = round(((total_input_tokens + total_output_tokens) * 0.01)/60, 2)
848 |     time_low = round(time*0.7,2)
849 |     time_high = round(time*1.3,2)
850 | 
851 |   quit = False
852 |   print("You are about to annotate", len(text_to_annotate), "text samples and the number of iterations is set to", num_iterations)
853 |   print("Estimated cost range in US Dollars:", cost_low,"-",cost_high)
854 |   print("Estimated minutes to run gpt_annotate():", time_low,"-",time_high)
855 |   print("Please note that these are rough estimates.")
856 |   print("")
857 |   waiting_response = True
858 |   while waiting_response:
859 |     input_response = input("Would you like to proceed and annotate your data? (Options: Y or N) ")
860 |     input_response = str(input_response).lower()
861 |     if input_response == "y" or input_response == "yes":
862 |       waiting_response = False
863 |     elif input_response == "n" or input_response == "no":
864 |       print("")
865 |       print("Exiting gpt_annotate()")
866 |       quit = True
867 |       waiting_response = False
868 |     else:
869 |       print("Please input Y or N.")
870 |   return quit
871 | 
872 | 
873 | 
874 | 


--------------------------------------------------------------------------------
/llm_annotate_paper.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/npangakis/gpt_annotate/fbdfa0f100281c802252572e8e990539994252ed/llm_annotate_paper.pdf


--------------------------------------------------------------------------------
/sample_annotation_code.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "id": "NRMEai9kN2R7"
  7 |    },
  8 |    "source": [
  9 |     "### Set up dependencies"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": null,
 15 |    "metadata": {
 16 |     "id": "8o0GuNYnN4I6"
 17 |    },
 18 |    "outputs": [],
 19 |    "source": [
 20 |     "!pip install openai"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": null,
 26 |    "metadata": {
 27 |     "id": "rPXQ25hBN7-_"
 28 |    },
 29 |    "outputs": [],
 30 |    "source": [
 31 |     "!pip install tiktoken"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "code",
 36 |    "execution_count": null,
 37 |    "metadata": {
 38 |     "id": "Go5O5Lt6N9Z1"
 39 |    },
 40 |    "outputs": [],
 41 |    "source": [
 42 |     "import openai\n",
 43 |     "import pandas as pd\n",
 44 |     "import math\n",
 45 |     "import time\n",
 46 |     "import numpy as np\n",
 47 |     "import tiktoken\n",
 48 |     "\n",
 49 |     "#### Import main package: gpt_annotate.py\n",
 50 |     "# Make sure that the .py file is in the same directory as the .ipynb file, or you provide the correct relative or absolute path to the .py file.\n",
 51 |     "import gpt_annotate"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "code",
 56 |    "execution_count": null,
 57 |    "metadata": {
 58 |     "id": "s2SVKJPAOAWM"
 59 |    },
 60 |    "outputs": [],
 61 |    "source": [
 62 |     "# don't type the key in this file! \n",
 63 |     "# create gpt_api.txt, put the key in that, and save\n",
 64 |     "with open('gpt_api_key.txt', 'r') as f:\n",
 65 |     "    key = f.read().strip()"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "markdown",
 70 |    "metadata": {
 71 |     "id": "xb5K0uyrOlUY"
 72 |    },
 73 |    "source": [
 74 |     "### Load in text_to_annotate and codebook"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": null,
 80 |    "metadata": {
 81 |     "id": "Abeu-_5nOvLL"
 82 |    },
 83 |    "outputs": [],
 84 |    "source": [
 85 |     "# Load text to annotate\n",
 86 |     "text_to_annotate = pd.read_csv(\"text_to_classify.csv\")\n",
 87 |     "# Load codebook\n",
 88 |     "with open('codebook.txt', 'r') as file:\n",
 89 |     "    codebook = file.read()"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "markdown",
 94 |    "metadata": {
 95 |     "id": "apc7cy-eO8tF"
 96 |    },
 97 |    "source": [
 98 |     "# Annotate your data!"
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "markdown",
103 |    "metadata": {
104 |     "id": "a62wf8FzanCf"
105 |    },
106 |    "source": [
107 |     "### If you have human labels you want to compare the GPT outputs against"
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "code",
112 |    "execution_count": null,
113 |    "metadata": {
114 |     "id": "Dv7dc1mZPAGR"
115 |    },
116 |    "outputs": [],
117 |    "source": [
118 |     "# Prepare the data for annotation\n",
119 |     "text_to_annotate = gpt_annotate.prepare_data(text_to_annotate, codebook, key)"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": null,
125 |    "metadata": {
126 |     "id": "WzM16wq3ai0D"
127 |    },
128 |    "outputs": [],
129 |    "source": [
130 |     "# Annotate the data (returns 4 outputs)\n",
131 |     "gpt_out_all, gpt_out_final, performance, incorrect =  gpt_annotate.gpt_annotate(text_to_annotate, codebook, key)"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "markdown",
136 |    "metadata": {
137 |     "id": "rd_dIEkxaseH"
138 |    },
139 |    "source": [
140 |     "### If only using gpt_annotate for prediction (i.e., no human labels to compare performance)"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": null,
146 |    "metadata": {
147 |     "id": "yK1hIfCOPX1v"
148 |    },
149 |    "outputs": [],
150 |    "source": [
151 |     "# Prepare the data for annotation\n",
152 |     "text_to_annotate = gpt_annotate.prepare_data(text_to_annotate, codebook, key, human_labels = False)"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "code",
157 |    "execution_count": null,
158 |    "metadata": {
159 |     "id": "1h75bYJfbDPT"
160 |    },
161 |    "outputs": [],
162 |    "source": [
163 |     "# Annotate the data (returns 2 outputs)\n",
164 |     "gpt_out_all, gpt_out_final =  gpt_annotate.gpt_annotate(text_to_annotate, codebook, key, human_labels = False)"
165 |    ]
166 |   }
167 |  ],
168 |  "metadata": {
169 |   "colab": {
170 |    "provenance": []
171 |   },
172 |   "kernelspec": {
173 |    "display_name": "Python 3 (ipykernel)",
174 |    "language": "python",
175 |    "name": "python3"
176 |   },
177 |   "language_info": {
178 |    "codemirror_mode": {
179 |     "name": "ipython",
180 |     "version": 3
181 |    },
182 |    "file_extension": ".py",
183 |    "mimetype": "text/x-python",
184 |    "name": "python",
185 |    "nbconvert_exporter": "python",
186 |    "pygments_lexer": "ipython3",
187 |    "version": "3.8.16"
188 |   }
189 |  },
190 |  "nbformat": 4,
191 |  "nbformat_minor": 4
192 | }
193 | 


--------------------------------------------------------------------------------