├── HowToUseLLMsNotebook.ipynb ├── README.md └── global-populism-dataset.zip /HowToUseLLMsNotebook.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "c0731a3d-2a69-4479-8043-5c6abcc8baa2", 6 | "metadata": {}, 7 | "source": [ 8 | "# How to use LLMs for text analysis" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "c6949262-f673-480c-be50-0c92bf0f48ef", 14 | "metadata": {}, 15 | "source": [ 16 | "#### 2023. By Petter Törnberg. ILLC, University of Amsterdam. p.tornberg@uva.nl" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "id": "da99e8df-1867-445d-8e38-a052e50f28fd", 22 | "metadata": {}, 23 | "source": [ 24 | "Updated: 2024-04-15.\n", 25 | "\n", 26 | "This notebook is associated to a how-to guide that gives a simple introduction to using Large Language Models for text analysis in the social sciences. The guide is aimed at students and researchers who are interested in using LLMs for their quantitative or qualitative social scientific text analysis, but who have limited programming experience. \n", 27 | "\n", 28 | "The notebook offers the code for the example of analyzing the level of populism in a given political text, but can easily be adapted for your particular text analysis project." 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "id": "6bb4d04c-ea57-4618-8542-13e368b54d49", 34 | "metadata": {}, 35 | "source": [ 36 | "## 1. Signing up for API access" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "id": "1b2c524e-d6a8-4c40-ada3-e5bdb51b535f", 42 | "metadata": {}, 43 | "source": [ 44 | "The first step is to sign up to API access with OpenAI. This can be done on platform.openai.com. (See the PDF how-to guide for further instructions.)\n", 45 | "\n", 46 | "You will receive an API key to be used below." 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "id": "d429de13-87fa-48b5-ba05-3e565a4b244d", 52 | "metadata": {}, 53 | "source": [ 54 | "## 2. Installing and loading the relevant libraries" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "id": "04bce3a5-52d6-436b-abfe-7214b0187b7f", 60 | "metadata": {}, 61 | "source": [ 62 | "We first need to install and import the relevant libraries: the pandas package for general data processing, and the openai package for interacting with the OpenAI API." 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "id": "44c06c36-2b3e-4218-97ee-9e1339c291ae", 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "#Install the libraries\n", 73 | "!pip install pandas\n", 74 | "!pip install openai\n", 75 | "!pip install numpy" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 5, 81 | "id": "e135a950-1263-45dd-97c4-e78f8d41eba7", 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "#Call the libraries\n", 86 | "import pandas as pd\n", 87 | "import openai\n", 88 | "from openai import OpenAI\n", 89 | "import numpy as np\n", 90 | "\n" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 8, 96 | "id": "9ff614b6-0492-4b27-b3d8-1b719936b04d", 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "#We define which model to use throughout\n", 101 | "MODEL = 'gpt-4'\n", 102 | "MAX_TOKENS = 8000\n", 103 | "WAIT_TIME = 0.8 # Wait time between each request. This depends on the rate limit of the model used: GPT-4 needs longer wait time than GPT-3.5.\n", 104 | "\n", 105 | "client = OpenAI(\n", 106 | " api_key= [YOUR API KEY HERE] #Set the API key. See the how-to guide for further instructions\n", 107 | ")\n" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "id": "dad1051e-99ac-4e5b-adc3-f5054dcb6b3f", 113 | "metadata": {}, 114 | "source": [ 115 | "We can now call the OpenAI API. For instance, we can ask ChatGPT-4 a question:" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 10, 121 | "id": "ec033aa7-8fe8-4d71-8b6f-858fcfc8dff6", 122 | "metadata": {}, 123 | "outputs": [ 124 | { 125 | "name": "stdout", 126 | "output_type": "stream", 127 | "text": [ 128 | "Model answer: 'As an AI, I don't have personal experiences or emotions, but I can tell you that the meaning of life varies from person to person based on their beliefs and values. Some people may believe it's to learn and grow, others may see it as a quest for happiness or success, and some may find meaning in contributing to the well-being of others. Philosophers, scientists, and theologians have debated this question for centuries, and many agree that the answer can be a deeply personal and subjective one.'\n" 129 | ] 130 | } 131 | ], 132 | "source": [ 133 | "#Test the API\n", 134 | "response = client.chat.completions.create(\n", 135 | " model = 'gpt-4', #Which model to use\n", 136 | " temperature=0.2, #How random is the answer\n", 137 | " max_tokens=120, #How long can the reply be\n", 138 | " messages=[\n", 139 | " {\"role\": \"user\", \n", 140 | " \"content\": \"What is the meaning of life?\"}]\n", 141 | ")\n", 142 | "result = ''\n", 143 | "for choice in response.choices:\n", 144 | " result += choice.message.content\n", 145 | "print(f\"Model answer: '{result}'\")\n" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "id": "79be4c4b-71a9-4d28-9525-5511841491f5", 151 | "metadata": {}, 152 | "source": [ 153 | "If the code above generates an error, you might need to check whether your API key is correct, and whether you have access to the specified model." 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "id": "33ea0f1e-4230-49a4-9043-542b478b7212", 159 | "metadata": {}, 160 | "source": [ 161 | "## 3. Loading and preparing your data" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "id": "55e87c6d-1c73-4026-9b3e-fff992f677e6", 167 | "metadata": {}, 168 | "source": [ 169 | "The next step is to load and prepare the data that we want to analyze. We will load the data into a Pandas dataframe to allow easy processing.\n", 170 | "\n", 171 | "The details of how to open your particular data depends on the structure and format of the data. Pandas offers ways of opening a range of file formats, including CSV and Excel files. You may wish to refer to the Pandas documentation for more details.\n", 172 | "\n", 173 | "In our example, we will use the data from the Global Populism Dataset (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/LFTQEZ). This data offers a number of texts from politicians, and can be used for validating our method. The texts are provided as .txt files in a folder. We will load all these files into a single dataframe." 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "id": "b51f0c54-272e-4f76-a18b-1128316f7230", 180 | "metadata": {}, 181 | "outputs": [], 182 | "source": [ 183 | "#Loading data from textfiles\n", 184 | "import glob\n", 185 | "import pandas as pd\n", 186 | "import os \n", 187 | "\n", 188 | "# Define the folder path where the text files are located\n", 189 | "folder_path = './global-populism-dataset/speeches_20220427/'\n", 190 | "\n", 191 | "# Use glob to get a list of all *.txt files in the folder\n", 192 | "txt_files = glob.glob(folder_path + '/*.txt')\n", 193 | "\n", 194 | "# Create an empty list to store the data\n", 195 | "data = []\n", 196 | "\n", 197 | "# Loop through each text file\n", 198 | "for file_path in txt_files:\n", 199 | " with open(file_path, 'r',encoding='utf-8',errors='ignore') as file:\n", 200 | " # Read the text from the file\n", 201 | " text = file.read()\n", 202 | "\n", 203 | " # Get the filename without the directory path\n", 204 | " filename = os.path.basename(file_path)\n", 205 | "\n", 206 | " # Append the text and filename to the data list\n", 207 | " data.append({'filename': filename, 'text': text})\n", 208 | "\n", 209 | "# Create a dataframe with the data\n", 210 | "df = pd.DataFrame(data)" 211 | ] 212 | }, 213 | { 214 | "cell_type": "markdown", 215 | "id": "af601eb8-80d9-4826-82cc-ac783dcaa45c", 216 | "metadata": {}, 217 | "source": [ 218 | "### Filter the data " 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "id": "2e2d199c-6085-43df-ab97-828848452d9c", 224 | "metadata": {}, 225 | "source": [ 226 | "You will likely need to filter out and select the data you wish to include. \n", 227 | "\n", 228 | "In our case, we will filter out texts with non-latin alphabets. While ChatGPT can handle languages with non-latin characters, it is currently more expensive, and there are issues with managing text length. For simplicity, we therefore remove the texts with non-latin alphabet.\n" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": null, 234 | "id": "fe5274dc-490e-411e-8018-7baebff8b84e", 235 | "metadata": {}, 236 | "outputs": [], 237 | "source": [ 238 | "\n", 239 | "def is_latin_alphabet(text):\n", 240 | " latin_characters = 0\n", 241 | " total_characters = 0\n", 242 | "\n", 243 | " for char in text:\n", 244 | " if ord(char) >= 0x0000 and ord(char) <= 0x007F:\n", 245 | " latin_characters += 1\n", 246 | " total_characters += 1\n", 247 | "\n", 248 | " # Check if the majority of characters are Latin alphabet characters\n", 249 | " if latin_characters / total_characters >= 0.9:\n", 250 | " return True\n", 251 | " else:\n", 252 | " return False\n", 253 | "\n", 254 | "df = df[df['text'].apply(is_latin_alphabet)]\n" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "id": "17e9ddf3-be81-4376-9b51-e8e9e90b2d5a", 260 | "metadata": {}, 261 | "source": [ 262 | "### Chunking the texts " 263 | ] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "id": "e1c10b51-b7e0-47a5-a3cb-4887e171d19e", 268 | "metadata": {}, 269 | "source": [ 270 | "Unlike other NLP methods, not much preprocessing is needed. However, LLMs are only able to process texts that are smaller than their \"context window\". If our texts are longer than the context window of our model, we have to either split the texts into several smaller chunks and analyze them part by part, or simply truncate the text (not recommended).\n", 271 | "\n", 272 | "The details depend on the model you use and the amount for data. For our example, with the GPT-4-32k model, our speeches all fit in the model window, and we do not need to split the texts. \n", 273 | "\n", 274 | "However, for pedagogical reasons, we will use the standard 8K GPT-4 model and chunk the text into smaller pieces. If your text is short, such as a tweet, this function will do nothing." 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": 231, 280 | "id": "a0f5b9d9-00c2-4f06-ad91-09a034341340", 281 | "metadata": {}, 282 | "outputs": [], 283 | "source": [ 284 | "# Example of how to chunk the text into pieces, separated on sentence level.\n", 285 | "# To do so, we use the nltk library" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 300, 291 | "id": "43650441-4449-4c61-b196-9c60144d2005", 292 | "metadata": {}, 293 | "outputs": [], 294 | "source": [ 295 | "!pip install tiktoken\n", 296 | "!pip install nltk\n", 297 | "!pip install langdetect" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "id": "085aefde-58f5-4063-8ee9-471c2119ae15", 304 | "metadata": {}, 305 | "outputs": [], 306 | "source": [ 307 | "import tiktoken\n", 308 | "import nltk\n", 309 | "import nltk.data\n", 310 | "from nltk.tokenize import sent_tokenize, word_tokenize\n", 311 | "nltk.download('punkt')" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": 328, 317 | "id": "0e35f1f5-0fc9-42be-a178-0e7c7ff59837", 318 | "metadata": {}, 319 | "outputs": [], 320 | "source": [ 321 | "#This code chunks the text into processable pieces of similar size.\n", 322 | "#If the text is longer than allowed in terms of model tokens, we want to split the text into equally sized parts, without splitting any text mid-sentence.\n", 323 | "def split_text_into_chunks(text, max_tokens):\n", 324 | " \n", 325 | " #Code the text in gpt coding and calculate the number of tokens\n", 326 | " encoding = tiktoken.encoding_for_model(MODEL)\n", 327 | " nrtokens = len(encoding.encode(text))\n", 328 | " \n", 329 | " if nrtokens < max_tokens:\n", 330 | " return [text]\n", 331 | " \n", 332 | " #how many chunks to split it into?\n", 333 | " num_chunks = np.ceil(nrtokens / max_tokens)\n", 334 | "\n", 335 | " # Tokenize the text into sentences\n", 336 | " sentences = sent_tokenize(text)\n", 337 | "\n", 338 | " # Calculate the number of words per chunk\n", 339 | " words_per_chunk = len(text.split()) // num_chunks\n", 340 | "\n", 341 | " # Initialize variables\n", 342 | " chunks = []\n", 343 | " current_chunk = []\n", 344 | "\n", 345 | " word_counter = 0\n", 346 | " # Iterate through each sentence\n", 347 | " for sentence in sentences:\n", 348 | " # Add the sentence to the current chunk\n", 349 | " current_chunk.append(sentence)\n", 350 | " word_counter += len(sentence.split())\n", 351 | "\n", 352 | " # Check if the current chunk has reached the desired number of words\n", 353 | " if word_counter >= words_per_chunk:\n", 354 | " # Add the current chunk to the list of chunks\n", 355 | " chunks.append(\" \".join(current_chunk))\n", 356 | " word_counter = 0\n", 357 | " # Reset the current chunk\n", 358 | " current_chunk = []\n", 359 | "\n", 360 | " # Add the remaining sentences as the last chunk\n", 361 | " if current_chunk:\n", 362 | " chunks.append(\" \".join(current_chunk))\n", 363 | "\n", 364 | " return chunks\n", 365 | "\n" 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": 329, 371 | "id": "835d93ee-ae38-4341-83f7-4c587df57d61", 372 | "metadata": {}, 373 | "outputs": [], 374 | "source": [ 375 | "# Maximum number of words per chunk, this depends on the model context window. \n", 376 | "# We set it to a bit lower than the max tokens, to leave space for our instruction and the response.\n", 377 | "max_tokens = MAX_TOKENS - 2000\n", 378 | "df['text_chunks'] = df['text'].apply(lambda x: split_text_into_chunks(x, max_tokens))" 379 | ] 380 | }, 381 | { 382 | "cell_type": "markdown", 383 | "id": "4c207bd8-6dea-41ad-83ff-bbaeb19d301e", 384 | "metadata": {}, 385 | "source": [ 386 | "# 4. Prompt engineering" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "id": "1f2cf57e-9ed7-4827-9acc-57539ce3ff3f", 392 | "metadata": {}, 393 | "source": [ 394 | "The next step is to formulate a first instructions for analyzing the text. The prompts will be a result of an iterative process through which you develop a formulation of the concept that you wish to capture. \n", 395 | "\n", 396 | "We here start by drawing on the instructions for human coders from a previous study.\n", 397 | "\n", 398 | "See the how-to guide for details on this process. " 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": 330, 404 | "id": "ec5a491c-fef6-4d66-992b-b40556029b5e", 405 | "metadata": {}, 406 | "outputs": [], 407 | "source": [ 408 | "instruction = \"\"\"Your task is to evaluate the level of populism in a political text. Populism is defined as \"an ideology that considers society to be ultimately separated into two homogeneous and antagonistic groups, 'the pure people' versus 'the corrupt elite', and which argues that politics should be an expression of the volonté générale (general will) of the people.\"\n", 409 | "A populist text is characterized by BOTH of the following elements:\n", 410 | "- People-centrism: how much does the text focus on \"the people\" or \"ordinary people\" as an indivisible or homogeneous community? Does the text promote a politics as the popular will of \"the people\"?\n", 411 | "Appeals to specific subgroups of the population (such as ethnicities, regional groups, classes) are inherently antithetical to populism.\n", 412 | "- Anti-elitism: how much does the text focus on \"the elite\", and to what extent are elites in general described in negative terms? In populist texts, the elite is often described as corrupt, and the juxtaposition between the ordinary people and the elite is cast as a moral struggle between good and bad. \n", 413 | "Criticism of specific elements within an elite is not populist: a populist appeal must regard the elite in its entirety as anathema. \n", 414 | "\n", 415 | "You should give the text a numeric grade between 0 and 2.\n", 416 | "2. The text is very populist and comes very close to the ideal populist discourse.\n", 417 | "1. A speech in this category includes strong expressions of all of the populist elements, but either does not use them consistently or tempers them by including non-populist elements. The text may have a romanticized notion of the people and the idea of a unified popular will, but it avoids bellicose language or any particular enemy.\n", 418 | "0. A speech in this category uses few if any populist elements. \n", 419 | "[Answer with a number in the 0-2 range, followed by a semi-colon, and then a brief motivation. For instance: \"1.23; The text shows many elements of a populist text.\" Do not use quotation marks.]\n", 420 | "\"\"\"" 421 | ] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "id": "0ecc0e9c-ae83-4c94-8638-e300db21dede", 426 | "metadata": {}, 427 | "source": [ 428 | "# 5. Calling the LLM and analyzing the results" 429 | ] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "id": "91b8cef3-f9be-4efe-b6f2-c5567c8e2c7e", 434 | "metadata": {}, 435 | "source": [ 436 | "### 5.1 Call the LLM" 437 | ] 438 | }, 439 | { 440 | "cell_type": "markdown", 441 | "id": "ee64dddd-fca1-4b86-99e9-698418eda72f", 442 | "metadata": {}, 443 | "source": [ 444 | "We will now write simple functions for calling the API and carry out our analysis request. We will also need to handle possible errors returned from the API." 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": 331, 450 | "id": "5141014c-b076-4c19-9d48-6b8590472197", 451 | "metadata": {}, 452 | "outputs": [], 453 | "source": [ 454 | "import time\n", 455 | "\n", 456 | "def analyze_message(text, instruction, model = 'gpt-4', temperature=0.2):\n", 457 | " print(f\"Analyzing message...\")\n", 458 | " \n", 459 | " response = None\n", 460 | " tries = 0\n", 461 | " failed = True\n", 462 | " \n", 463 | " while(failed):\n", 464 | " try:\n", 465 | " client.chat.completions.create(\n", 466 | " model = model, \n", 467 | " temperature=temperature,\n", 468 | " messages=[\n", 469 | " {\"role\": \"system\", \"content\": f\"'{instruction}'\"}, #The system instruction tells the bot how it is supposed to behave\n", 470 | " {\"role\": \"user\", \"content\": f\"'{text}'\"} #This provides the text to be analyzed.\n", 471 | " ]\n", 472 | " )\n", 473 | " failed = False\n", 474 | "\n", 475 | " #Handle errors.\n", 476 | " #If the API gets an error, perhaps because it is overwhelmed, we wait 10 seconds and then we try again. \n", 477 | " # We do this 10 times, and then we give up.\n", 478 | " except openai.APIError as e:\n", 479 | " print(f\"OpenAI API returned an API Error: {e}\")\n", 480 | " \n", 481 | " if tries < 10:\n", 482 | " print(f\"Caught an APIError: {e}. Waiting 10 seconds and then trying again...\")\n", 483 | " failed = True\n", 484 | " tries += 1\n", 485 | " time.sleep(10)\n", 486 | " else:\n", 487 | " print(f\"Caught an APIError: {e}. Too many exceptions. Giving up.\")\n", 488 | " raise e\n", 489 | " \n", 490 | " except openai.ServiceUnavailableError as e:\n", 491 | " print(f\"OpenAI API returned an ServiceUnavailable Error: {e}\")\n", 492 | " \n", 493 | " if tries < 10:\n", 494 | " print(f\"Caught a ServiceUnavailable error: {e}. Waiting 10 seconds and then trying again...\")\n", 495 | " failed = True\n", 496 | " tries += 1\n", 497 | " time.sleep(10)\n", 498 | " else:\n", 499 | " print(f\"Caught a ServiceUnavailable error: {e}. Too many exceptions. Giving up.\")\n", 500 | " raise e\n", 501 | " \n", 502 | " except openai.APIConnectionError as e:\n", 503 | " print(f\"Failed to connect to OpenAI API: {e}\")\n", 504 | " pass\n", 505 | " except openai.RateLimitError as e:\n", 506 | " print(f\"OpenAI API request exceeded rate limit: {e}\")\n", 507 | " pass\n", 508 | " \n", 509 | " #If the text is too long, we truncate it and try again. Note that if you get this error, you probably want to chunk your texts.\n", 510 | " except openai.InvalidRequestError as e:\n", 511 | " #Shorten request text\n", 512 | " print(f\"Received a InvalidRequestError. Request likely too long; cutting 10% of the text and trying again. {e}\")\n", 513 | " time.sleep(5)\n", 514 | " words = text.split()\n", 515 | " num_words_to_remove = round(len(words) * 0.1)\n", 516 | " remaining_words = words[:-num_words_to_remove]\n", 517 | " text = ' '.join(remaining_words)\n", 518 | " failed = True\n", 519 | " \n", 520 | " except Exception as e:\n", 521 | " print(f\"Caught unhandled error.\")\n", 522 | " pass\n", 523 | " \n", 524 | " result = ''\n", 525 | " for choice in response.choices:\n", 526 | " result += choice.message.content\n", 527 | " \n", 528 | " return result \n", 529 | "\n", 530 | "\n", 531 | " " 532 | ] 533 | }, 534 | { 535 | "cell_type": "markdown", 536 | "id": "f3e3dae9-ea69-42e4-b220-e5275bce2069", 537 | "metadata": {}, 538 | "source": [ 539 | "### 5.2 Parse response " 540 | ] 541 | }, 542 | { 543 | "cell_type": "markdown", 544 | "id": "d7995f8e-96a0-4cf4-b149-863ac974bf14", 545 | "metadata": {}, 546 | "source": [ 547 | "The LLM will return a text message. We need to parse this response so that we can use it for further analysis. The details of this function will depend on how you asked the API to respond in your instruction (see above). In our case, we asked the LLM to return a list of numbers, followed by a motivation." 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": 332, 553 | "id": "53920a6a-b1e3-4bd5-9e75-e702c5d90018", 554 | "metadata": {}, 555 | "outputs": [], 556 | "source": [ 557 | "def parse_result(result):\n", 558 | " #The LLMs at times surround their answers with quotation marks, even if you explicitly tell them not to. If so, we remove them her.\n", 559 | " result = result.strip(\"'\\\"\") \n", 560 | " try:\n", 561 | " #We asked the LLM to start with a number, followed by a semi-colon, followed by the motivation. We assume this format in the response here.\n", 562 | " return result.split(';', 2) #Split by ';' and use first part as numeric answer, second part as motivation\n", 563 | " except Exception as e:\n", 564 | " #If we get an error, we here print the string that failed, to allow debugging.\n", 565 | " print(result)\n", 566 | " pass" 567 | ] 568 | }, 569 | { 570 | "cell_type": "markdown", 571 | "id": "a8e91d60-313b-4d20-9b14-9338d08753e3", 572 | "metadata": {}, 573 | "source": [ 574 | "### 5.3 Run the analysis" 575 | ] 576 | }, 577 | { 578 | "cell_type": "markdown", 579 | "id": "53218c5d-6b69-4cf0-bc5b-57be7f659b10", 580 | "metadata": {}, 581 | "source": [ 582 | "This is the main loop of the code, where we call the LLM for each line in our data, and give it the instructions." 583 | ] 584 | }, 585 | { 586 | "cell_type": "code", 587 | "execution_count": 333, 588 | "id": "15303a74-a62e-4bba-b44d-ad7c88ab2ebc", 589 | "metadata": {}, 590 | "outputs": [], 591 | "source": [ 592 | "#First, we need to prepare the data and store it in a file for persistency\n", 593 | "filename = 'data.pkl'\n", 594 | "\n", 595 | "#These are the columns where we will store the analyzed data\n", 596 | "df['answers'] = [[] for _ in range(len(df))]\n", 597 | "df['motivations'] = [[] for _ in range(len(df))]\n", 598 | "\n", 599 | "df.to_pickle(filename)" 600 | ] 601 | }, 602 | { 603 | "cell_type": "code", 604 | "execution_count": 363, 605 | "id": "b319655e-5317-456f-999e-845a57375a00", 606 | "metadata": {}, 607 | "outputs": [], 608 | "source": [ 609 | "#Main loop \n", 610 | "df = pd.read_pickle(filename)\n", 611 | "\n", 612 | "#If you want to limit the number of lines to analyze\n", 613 | "maximum_lines_to_analyze = 100\n", 614 | "i = 0\n", 615 | "\n", 616 | "while(True):\n", 617 | "\n", 618 | " #Find all unprocessed lines\n", 619 | " # left = df.loc[df['result'].isna()]\n", 620 | " left = df.loc[df['answers'].map(len)==0]\n", 621 | " \n", 622 | " #No lines left? Then we're done\n", 623 | " if len(left)==0 or i>= maximum_lines_to_analyze:\n", 624 | " print(\"All done!\")\n", 625 | " break\n", 626 | " \n", 627 | " #Take a random line\n", 628 | " line = left.sample()\n", 629 | " index = line.index.values[0]\n", 630 | " \n", 631 | " print(f\"There are {len(left)} left to process. Processing: {index}\")\n", 632 | " \n", 633 | " #Wait for a bit, to not overload the API\n", 634 | " time.sleep(WAIT_TIME)\n", 635 | " \n", 636 | " #Analyze the specific line, chunk by chunk\n", 637 | " for chunk in line['text_chunks'].values[0]:\n", 638 | " result = analyze_message(chunk, instruction, model = MODEL)\n", 639 | "\n", 640 | " #Parse the results, and put into dataframe\n", 641 | " answer,motivation = parse_result(result)\n", 642 | "\n", 643 | " df.loc[index,'answers'].append(answer)\n", 644 | " df.loc[index,'motivations'].append(motivation)\n", 645 | " \n", 646 | " i+=1\n", 647 | " \n", 648 | " #Save the result to persistent file\n", 649 | " df.to_pickle(filename)" 650 | ] 651 | }, 652 | { 653 | "cell_type": "markdown", 654 | "id": "ddfdd116-4933-4bcf-9afc-eeedac29aa81", 655 | "metadata": {}, 656 | "source": [ 657 | "### Post-analysis calculations" 658 | ] 659 | }, 660 | { 661 | "cell_type": "markdown", 662 | "id": "28428f1b-f40e-4de7-8f4c-c38ddd30e947", 663 | "metadata": {}, 664 | "source": [ 665 | "Following the LLM analysis, we may need to do some minor calculations or modifications of the results. " 666 | ] 667 | }, 668 | { 669 | "cell_type": "markdown", 670 | "id": "ad0be53e-883c-4ae8-ba6a-0160f127de4a", 671 | "metadata": {}, 672 | "source": [ 673 | "For instance, we need to combine the values returned for the different chunks to a final complete values for the full text. This can be done in several ways, but the most straight-forward is to take the average values for each part. If the text only has one chunk, the result will be used without change. We will here leave the motivations as a list." 674 | ] 675 | }, 676 | { 677 | "cell_type": "code", 678 | "execution_count": 337, 679 | "id": "6e7e0676-d61d-410c-959d-e2d4fd544d13", 680 | "metadata": {}, 681 | "outputs": [], 682 | "source": [ 683 | "#Take the mean level of populism in the text\n", 684 | "df['answer'] = [np.mean(answers) if len(answers)>0 else None for answers in df['answers']]" 685 | ] 686 | }, 687 | { 688 | "cell_type": "markdown", 689 | "id": "2fabeed5-7191-45a0-9341-41dda12a6ac6", 690 | "metadata": {}, 691 | "source": [ 692 | "### Example result" 693 | ] 694 | }, 695 | { 696 | "cell_type": "markdown", 697 | "id": "9b368fc0-35f1-40de-909e-e85167ec3186", 698 | "metadata": {}, 699 | "source": [ 700 | "We can now look at some examples of the result from the analysis and the associated motivation. \n", 701 | "\n", 702 | "For instance, we here look at the rating of Donald Trump's inaguration speech:" 703 | ] 704 | }, 705 | { 706 | "cell_type": "code", 707 | "execution_count": 320, 708 | "id": "a8455965-7910-4234-b9ef-1ae4448e71ef", 709 | "metadata": {}, 710 | "outputs": [ 711 | { 712 | "name": "stdout", 713 | "output_type": "stream", 714 | "text": [ 715 | "Rating: \t 1.5. \n", 716 | "Motivation: \t 'The text shows many elements of a populist text. It refers to \"the people\" and \"ordinary people\" as a homogeneous community and promotes the idea of politics as the popular will of \"the people.\" It also criticizes the elite and describes them as having reaped the rewards of government while the people have borne the cost. However, it does not use bellicose language or references to a particular enemy.'\n" 717 | ] 718 | } 719 | ], 720 | "source": [ 721 | "trump = df.loc[df.filename == 'US_Trump_Famous_1.txt']\n", 722 | "print(f\"\"\"Rating: {trump.answer.values[0]}. Motivation: '{trump.motivations.values[0][0]}'\"\"\")" 723 | ] 724 | }, 725 | { 726 | "cell_type": "markdown", 727 | "id": "a22c2492-6615-4061-99e4-20ea43d0cf93", 728 | "metadata": {}, 729 | "source": [ 730 | "At face value, this motivation seems both reasonable and plausible. We will now turn to carry out a more in-depth validation." 731 | ] 732 | }, 733 | { 734 | "cell_type": "markdown", 735 | "id": "0a4faae5-75ea-4830-bfc2-da30d7580987", 736 | "metadata": {}, 737 | "source": [ 738 | "# 6. Validation" 739 | ] 740 | }, 741 | { 742 | "cell_type": "markdown", 743 | "id": "3f93455d-9c35-4861-8902-14fee1fd941b", 744 | "metadata": {}, 745 | "source": [ 746 | "Finally, we need to validate our results. Careful validation is essential to make sure that the models are measuring what we intend -- and that they do so without problematic biases. To validate our models, we can compare the outputs with established benchmarks, ground truth data, or expert evaluations to validate the effectiveness in achieving the desired analysis outcomes. Validation can furthermore help us fine-tune the model prompt to improve the results.\n", 747 | "\n", 748 | "A simple way of validating can be to output a random sample to an Excel file, and have human coders manually classifying the data to compare the results. To do so, the code below can be used:" 749 | ] 750 | }, 751 | { 752 | "cell_type": "markdown", 753 | "id": "4cd68f4a-e190-4d2b-a7b3-9f31e7ce6723", 754 | "metadata": {}, 755 | "source": [ 756 | "### 6.1 Acquire validation data" 757 | ] 758 | }, 759 | { 760 | "cell_type": "code", 761 | "execution_count": 70, 762 | "id": "eaf1f65b-eb9f-4d77-b668-ba0b1a3b4a9a", 763 | "metadata": {}, 764 | "outputs": [], 765 | "source": [ 766 | "# # Code to extract the data as excel for manual checking. This is included for illustration, however, we won't use this here.\n", 767 | "# sample_size = 100\n", 768 | "# sample = df.sample(sample_size).reset_index()\n", 769 | "# sample['manual_classification'] = None\n", 770 | "# sample[['index','text','manual_classification']].to_excel('manual_validation.xlsx')\n", 771 | "\n", 772 | "# # Now open the resulting file in Excel. Carry out manual classification and put result in the final column\n", 773 | "\n", 774 | "# manual_result = pd.read_excel('manual_validation_finished.xlsx')" 775 | ] 776 | }, 777 | { 778 | "cell_type": "markdown", 779 | "id": "f865a93e-37e1-4794-821b-1e535bfed2ed", 780 | "metadata": {}, 781 | "source": [ 782 | "In the case of the populism example, however, the Global Populism Database already offers a large sample of manually classified datapoints that we can use for validation. " 783 | ] 784 | }, 785 | { 786 | "cell_type": "markdown", 787 | "id": "b757094f-e4fa-4271-8f15-1bec4c550f27", 788 | "metadata": {}, 789 | "source": [ 790 | "We first need to make sure the data is in the right format for running simpledorff. Each line should be one coder response." 791 | ] 792 | }, 793 | { 794 | "cell_type": "code", 795 | "execution_count": 343, 796 | "id": "218d0cab-5888-49fb-ac64-6b726b27b6c1", 797 | "metadata": {}, 798 | "outputs": [], 799 | "source": [ 800 | "#Load and clean the validation data. \n", 801 | "val = pd.read_csv('./global-populism-dataset/gpd_v2_20220427.csv')\n", 802 | "val = val[val['merging_variable'].notna()] \n", 803 | "val = val[val['rubricgrade'].notna()] #The database contains some NaN values for the index; we remove these lines\n", 804 | "val = val[['merging_variable','codernum','rubricgrade','averagerubric']]" 805 | ] 806 | }, 807 | { 808 | "cell_type": "code", 809 | "execution_count": 344, 810 | "id": "d407363b-d464-4617-a41c-d1c27916d7ea", 811 | "metadata": {}, 812 | "outputs": [], 813 | "source": [ 814 | "#We include only the lines that we've coded with the LLM\n", 815 | "included = set(df.loc[~df['answer'].isna()].filename.values)\n", 816 | "val = val.loc[(val['merging_variable'].isin(included)) & (val['codernum']<=2) ]\n", 817 | "\n", 818 | "#We compare our result with that of the average coder result\n", 819 | "val = val.drop_duplicates(subset=['merging_variable'], keep='first')[['merging_variable','averagerubric']].rename(columns={'averagerubric':'answer'})\n", 820 | "val['codernum'] = 'human'\n", 821 | "\n", 822 | "#Fit our coded data into the same format to allow processing\n", 823 | "df2 = df[['filename','answer','motivations']].dropna(subset=['answer']).rename(columns={'filename':'merging_variable'})\n", 824 | "df2['codernum'] = 'llm'\n", 825 | "\n", 826 | "#Combine the two datasets\n", 827 | "validation_data = pd.concat([val,df2])" 828 | ] 829 | }, 830 | { 831 | "cell_type": "markdown", 832 | "id": "6a5bbee5-f9b2-4f1f-921e-646a2455ce2d", 833 | "metadata": {}, 834 | "source": [ 835 | "### 6.2 Measure Krippendorf's Alpha" 836 | ] 837 | }, 838 | { 839 | "cell_type": "markdown", 840 | "id": "f954f50c-0304-4bc6-8fdd-a2730edd38e4", 841 | "metadata": {}, 842 | "source": [ 843 | "To compare our data against the validation data, we can use Krippendorf's Alpha (see how-to guide for details.) We here use the simpledorff library to do so." 844 | ] 845 | }, 846 | { 847 | "cell_type": "code", 848 | "execution_count": null, 849 | "id": "83bed77d-a487-45e9-9827-309782ebd4df", 850 | "metadata": {}, 851 | "outputs": [], 852 | "source": [ 853 | "!pip install simpledorff" 854 | ] 855 | }, 856 | { 857 | "cell_type": "code", 858 | "execution_count": 71, 859 | "id": "902d9aeb-ba4a-4825-b7c6-06208340c729", 860 | "metadata": {}, 861 | "outputs": [], 862 | "source": [ 863 | "import simpledorff" 864 | ] 865 | }, 866 | { 867 | "cell_type": "code", 868 | "execution_count": 350, 869 | "id": "cd01e12b-f258-4a22-8cab-8e91d754b8fe", 870 | "metadata": {}, 871 | "outputs": [ 872 | { 873 | "name": "stdout", 874 | "output_type": "stream", 875 | "text": [ 876 | "The resulting Krippendorf's Alpha is is 0.6348188209387895.\n" 877 | ] 878 | } 879 | ], 880 | "source": [ 881 | "#Calculate inter-coder reliability\n", 882 | "#Note that this uses the interval metric. If your variable is categorical, you need to remove the metric_fn parameter.\n", 883 | "KA = simpledorff.calculate_krippendorffs_alpha_for_df(test,metric_fn=simpledorff.metrics.interval_metric,experiment_col='merging_variable', annotator_col='codernum', class_col='answer')\n", 884 | "\n", 885 | "print(f\"The resulting Krippendorf's Alpha is is {KA}.\")" 886 | ] 887 | }, 888 | { 889 | "cell_type": "markdown", 890 | "id": "b560aac5-e680-4130-8258-72708aa2a9f8", 891 | "metadata": {}, 892 | "source": [ 893 | "This is a relatively high value for a first iteration of prompt development for a challenging concept.\n" 894 | ] 895 | }, 896 | { 897 | "cell_type": "markdown", 898 | "id": "55c21731-d26e-4b4e-87a2-20073dc7e2af", 899 | "metadata": {}, 900 | "source": [ 901 | "### 6.3 Carry out iterative process of concept and prompt development " 902 | ] 903 | }, 904 | { 905 | "cell_type": "markdown", 906 | "id": "8697e9bf-11c3-45f2-b18d-97d3a633a1d5", 907 | "metadata": {}, 908 | "source": [ 909 | "Having measured the disagreements between coders and LLM, we can now seek to try to understand the sources of the disagreement. This can be best thought of as a process of mutual learning through which we develop and operationalize a rigorous social scientific concept in the form of a prompt.\n", 910 | "\n", 911 | "We can here work with the coders, and comparing their notes to the motivations given by the LLM, focusing on the examples where the LLM and the human coders are (most) in disagreement. We may find that the prompt can be improved - or that our human coders were mistaken or biased. " 912 | ] 913 | }, 914 | { 915 | "cell_type": "markdown", 916 | "id": "9ac065cb-e695-4c94-9c49-c6d9fa9d91e8", 917 | "metadata": {}, 918 | "source": [ 919 | "In our case, we do not have access to the coders, and we will simply show the process through which this form of work can be done." 920 | ] 921 | }, 922 | { 923 | "cell_type": "code", 924 | "execution_count": null, 925 | "id": "0d5f10c4-c97e-46f1-9316-4c428c7d3a81", 926 | "metadata": {}, 927 | "outputs": [], 928 | "source": [ 929 | "# We create a dataframe that lists the level of disagreement between coders and LLM\n", 930 | "wrong = df2.merge(val, on='merging_variable')\n", 931 | "wrong['diff'] = abs(wrong['answer_x']-wrong['answer_y'])" 932 | ] 933 | }, 934 | { 935 | "cell_type": "code", 936 | "execution_count": null, 937 | "id": "9a688c94-12d8-47f0-ad6a-68a790a9b452", 938 | "metadata": {}, 939 | "outputs": [], 940 | "source": [ 941 | "#We can save as CSV file to analyze results in Excel, or examine the results here.\n", 942 | "# display(wrong.sort_values(['diff']))\n", 943 | "wrong.sort_values(['diff']).to_csv('disagreements.csv') " 944 | ] 945 | }, 946 | { 947 | "cell_type": "code", 948 | "execution_count": 360, 949 | "id": "9444ac2b-8e51-4492-bac9-e009e4450b3e", 950 | "metadata": {}, 951 | "outputs": [ 952 | { 953 | "data": { 954 | "text/plain": [ 955 | "' The text does not contain populist elements. It does not focus on \"the people\" as a homogeneous group, nor does it depict \"the elite\" as a corrupt entity. Instead, it focuses on historical events, the importance of freedom, and the unity of the nation.'" 956 | ] 957 | }, 958 | "execution_count": 360, 959 | "metadata": {}, 960 | "output_type": "execute_result" 961 | } 962 | ], 963 | "source": [ 964 | "# One of the cases where the LLM and human coders disagree the most is a speech by Berlusconi. The LLM does not think it is populist, but the human coders do.\n", 965 | "\n", 966 | "#The motivation given by the LLM is:\n", 967 | "wrong.loc[wrong['merging_variable']=='Italy_Berlusconi_Ribbon_2.txt'].motivations.values[0][0]\n" 968 | ] 969 | }, 970 | { 971 | "cell_type": "markdown", 972 | "id": "79a8e06e-04d4-4b2e-b22f-80064c8e115f", 973 | "metadata": {}, 974 | "source": [ 975 | "Here follows the text, translated to English. \n", 976 | "\n", 977 | "Do you agree with the LLM or the huamn coders? If the latter, how do you think the prompt should be modified to improve the results? " 978 | ] 979 | }, 980 | { 981 | "cell_type": "code", 982 | "execution_count": 362, 983 | "id": "1d4ff010-e847-4d5b-bbf5-a31104f426b9", 984 | "metadata": {}, 985 | "outputs": [ 986 | { 987 | "name": "stdout", 988 | "output_type": "stream", 989 | "text": [ 990 | "Dear friends,\n", 991 | "\n", 992 | "It is not easy to find the words to describe my, our state of mind at this moment. We are gathered here in Onna to celebrate the Liberation Day, a celebration that is both an honor and a commitment.\n", 993 | "\n", 994 | "An honor: to commemorate a terrible massacre that took place right here in June 1944 when the Nazis, in retaliation, killed 17 citizens of Onna and then blew up the house where the bodies of those innocent victims were found.\n", 995 | "\n", 996 | "A commitment: what should inspire us is not to forget what happened here and to remember the horrors of totalitarianism and the suppression of 'freedom'.\n", 997 | "\n", 998 | "Right here, in Abruzzo, the legendary Maiella Brigade was born and operated, decorated with the Gold Medal for Military Valor. In December '43, 15 young people founded what would become the Maiella Brigade, which grew to 1,500 strong.\n", 999 | "\n", 1000 | "It is no coincidence that on this special day, the soldiers of the Honor Guard standing before us belong to the 33rd Artillery Regiment, the Abruzzesi unit that in 1943 on Cephalonia had the courage to resist the Nazis and sacrifice themselves – fighting – for the honor of our country.\n", 1001 | "\n", 1002 | "To those patriots who fought for the redemption and rebirth of Italy, our admiration, gratitude, and recognition must always go.\n", 1003 | "\n", 1004 | "Most Italians today have not experienced what it means to be deprived of freedom. Only the elderly have a direct memory of totalitarianism, foreign occupation, and the war for the liberation of our homeland.\n", 1005 | "\n", 1006 | "For many of us, it is a memory tied to our families, our parents, our grandparents, many of whom were protagonists or victims of those dramatic days. For me, it is the memory of years of separation from my father, forced to emigrate to avoid arrest, the memory of my mother's sacrifices, who alone had to support a large family during those difficult years. It is the memory of her courage, of her, like many others, traveling by train every day from a small town in the province of Como to work in Milan, and on one of those trains, risking her life but managing to save a Jewish woman from the clutches of a Nazi soldier destined for the extermination camps.\n", 1007 | "\n", 1008 | "These are the memories, the examples with which we grew up – the memories of a generation of Italians who did not hesitate to choose freedom, even at the risk of their own safety and lives.\n", 1009 | "\n", 1010 | "Our country owes an inexhaustible debt to those many young people who sacrificed their lives during their most beautiful years to redeem the honor of the nation, out of fidelity to an oath, but above all for that great, splendid, and essential value which is freedom.\n", 1011 | "\n", 1012 | "We owe the same debt of gratitude to all those other boys, Americans, English, French, Polish, from the many allied countries, who shed their blood in the Italian campaign. Without them, the sacrifice of our partisans would have risked being in vain.\n", 1013 | "\n", 1014 | "And with respect, we must remember today all the fallen, even those who fought on the wrong side, sincerely sacrificing their lives for their ideals and a lost cause.\n", 1015 | "\n", 1016 | "This does not mean, of course, neutrality or indifference. We are – all free Italians are – on the side of those who fought for our freedom, for our dignity, and for the honor of our homeland.\n", 1017 | "\n", 1018 | "In recent years, the history of the Resistance has been deepened and discussed. It is a good thing that it happened. The Resistance, along with the Risorgimento, is one of the founding values of our nation, a return to the tradition of freedom. And freedom is a right that comes before laws and the state because it is a natural right that belongs to us as human beings.\n", 1019 | "\n", 1020 | "However, a free nation does not need myths. As with the Risorgimento, we must also remember the dark pages of the civil war, even those in which those who fought on the right side made mistakes and took on blame.\n", 1021 | "\n", 1022 | "It is an exercise in truth, in honesty, an exercise that makes the history of those who fought on the right side with selflessness and courage even more glorious.\n", 1023 | "\n", 1024 | "It is the history of the many who fought in the Southern army, who, from Cephalonia onwards, redeemed the honor of the uniform with their blood.\n", 1025 | "\n", 1026 | "It is the history of martyrs like Salvo D’Acquisto, who did not hesitate to sacrifice his life in exchange for other innocent lives.\n", 1027 | "\n", 1028 | "It is the history of our soldiers interned in Germany who chose concentration camps rather than collaborating with the Nazis.\n", 1029 | "\n", 1030 | "It is the history of the many who hid their fellow Jewish citizens, saving them from deportation.\n", 1031 | "\n", 1032 | "Above all, it is the history of the many, countless unknown heroes who, with small or great acts of daily courage, contributed to the cause of freedom.\n", 1033 | "\n", 1034 | "Even the Church, I want to remember, played its part with true courage, to prevent odious concepts like race or religious differences from becoming reasons for persecution and death.\n", 1035 | "\n", 1036 | "Similarly, we must remember the young Jews of the Jewish Brigade, who came from ghettos all over Europe, took up arms, and fought for freedom.\n", 1037 | "\n", 1038 | "At that moment, many Italians of different faiths, cultures, and backgrounds came together to pursue the same great dream – the dream of freedom.\n", 1039 | "\n", 1040 | "Among them were very different individuals and groups. Some thought only of freedom, some dreamed of establishing a different social and political order, some considered themselves bound by an oath of loyalty to the monarchy.\n", 1041 | "\n", 1042 | "But they all managed to set aside their differences, even the most profound ones, to fight together. The communists and the Catholics, the socialists and the liberals, the actionists and the monarchists, faced with a common tragedy, each wrote a great page of our history. A page on which our Constitution is based, a page on which our freedom is based.\n", 1043 | "\n", 1044 | "In the drafting of the Constitution, the wisdom of the political leaders of that time – De Gasperi and Togliatti, Ruini and Terracini, Nenni, Pacciardi, and Parri – managed to channel deep initial divisions towards a single objective.\n", 1045 | "\n", 1046 | "Although clearly the result of compromises, the republican Constitution achieved two noble and fundamental objectives: guaranteeing freedom and creating the conditions for democratic development in the country. It was not a small feat; in fact, it was the best compromise possible at the time.\n", 1047 | "\n", 1048 | "However, the goal of creating a \"common\" moral conscience for the nation was missed, perhaps premature for those times, so much so that the predominant value for everyone was anti-fascism, but not necessarily anti-totalitarianism. It was a product of history, a compromise useful to avoid the Cold War that vertically divided Italy from degenerating into a civil war with unpredictable outcomes. But the assumption of responsibility and the sense of the State that animated all the political leaders of that time remain a great lesson that would be unforgivable to forget.\n", 1049 | "\n", 1050 | "Today, 64 years after April 25, 1945, and twenty years after the fall of the Berlin Wall, our task, the task of all, is to finally build a unified national sentiment.\n", 1051 | "\n", 1052 | "We must do it together, together, regardless of political affiliation, together, for a new beginning of our republican democracy, where all political parties recognize the greatest value, freedom, and debate in its name for the good and the interest of all.\n", 1053 | "\n", 1054 | "The anniversary of the regained freedom is, therefore, an opportunity to reflect on the past, but also to reflect on the present and the future of Italy. If we can do it together from today onwards, we will have rendered a great service not to one\n", 1055 | "\n", 1056 | "We have always rejected the idea that our adversary was our enemy. Our religion of freedom demanded it from us and still does. With the same spirit, I am convinced that the time has come for the Liberation Day to become the Day of Freedom, and for this commemoration to shed the character of opposition that revolutionary culture gave it, a character that still 'divides' rather than 'unites'.\n", 1057 | "\n", 1058 | "I say this with great serenity, without any intention of creating controversy. April 25 was the origin of a new season of democracy, and in democracy, the people's vote deserves absolute respect from everyone.\n", 1059 | "\n", 1060 | "After April 25, the people peacefully voted for the Republic, and the monarchy accepted the people's judgment.\n", 1061 | "\n", 1062 | "Shortly after, on April 18, 1948, the people's choice was once again decisive for our country: with De Gasperi's victory, the Italian people recognized themselves in the Christian and liberal tradition of their history. The 1950s, always with the support of the popular vote, shaped Italy into a democratic, economic, and social reality. Italy became part of Europe and the West, played a role in promoting Atlantic unity and European unity, transforming from a rejected nation to a respected one.\n", 1063 | "\n", 1064 | "Today, our young people face other challenges: to defend the freedom conquered by their fathers and expand it even further, aware that without freedom, there can be no peace, justice, or well-being.\n", 1065 | "\n", 1066 | "Some of these challenges are global and see us engaged alongside free nations: the fight against terrorism, the fight against fanatic and repressive fundamentalism, the fight against racism because freedom, dignity, and peace are rights of every human being, 'everywhere' in the world.\n", 1067 | "\n", 1068 | "That's why I want to remember the Italian soldiers engaged in peace missions abroad, especially those who have fallen in carrying out this noble mission. There is an ideal continuity between them and all the heroes, Italian and allied, who sacrificed their lives over 60 years ago to give us back freedom, security, and peace.\n", 1069 | "\n", 1070 | "Today, the teachings of our fathers take on a special value: this April 25 comes just after the great tragedy that struck this land of Abruzzo. Once again, facing the emergency and tragedy, Italians have shown their ability to unite, to overcome differences, demonstrating that they are a great and cohesive people, full of generosity, solidarity, and courage.\n", 1071 | "\n", 1072 | "Looking at the many Italians who have been engaged here in rescue and reconstruction efforts, I feel proud, once again, even more so, to be Italian and to lead this wonderful country.\n", 1073 | "\n", 1074 | "Today, Onna is the symbol of our Italy. The earthquake that destroyed it reminds us of the days when invaders destroyed it. Rebuilding it will mean repeating the gesture of its rebirth after Nazi violence.\n", 1075 | "\n", 1076 | "And it is precisely concerning the heroes of then and today that we all have a great responsibility: to set aside any controversy, to look at the interest of the nation, to safeguard the great heritage of freedom that we inherited from our fathers.\n", 1077 | "\n", 1078 | "Together, we all have the responsibility and duty to build a future of prosperity, security, peace, and freedom for all.\n", 1079 | "\n", 1080 | "Long live Italy! Long live the Republic!\n", 1081 | "\n", 1082 | "Long live April 25, the celebration of all Italians who love freedom and want to remain free!\n", 1083 | "\n", 1084 | "Long live April 25, the celebration of regained freedom!\n" 1085 | ] 1086 | } 1087 | ], 1088 | "source": [ 1089 | "print(\"\"\"Dear friends,\\n\\nIt is not easy to find the words to describe my, our state of mind at this moment. We are gathered here in Onna to celebrate the Liberation Day, a celebration that is both an honor and a commitment.\\n\\nAn honor: to commemorate a terrible massacre that took place right here in June 1944 when the Nazis, in retaliation, killed 17 citizens of Onna and then blew up the house where the bodies of those innocent victims were found.\\n\\nA commitment: what should inspire us is not to forget what happened here and to remember the horrors of totalitarianism and the suppression of 'freedom'.\\n\\nRight here, in Abruzzo, the legendary Maiella Brigade was born and operated, decorated with the Gold Medal for Military Valor. In December '43, 15 young people founded what would become the Maiella Brigade, which grew to 1,500 strong.\\n\\nIt is no coincidence that on this special day, the soldiers of the Honor Guard standing before us belong to the 33rd Artillery Regiment, the Abruzzesi unit that in 1943 on Cephalonia had the courage to resist the Nazis and sacrifice themselves – fighting – for the honor of our country.\\n\\nTo those patriots who fought for the redemption and rebirth of Italy, our admiration, gratitude, and recognition must always go.\\n\\nMost Italians today have not experienced what it means to be deprived of freedom. Only the elderly have a direct memory of totalitarianism, foreign occupation, and the war for the liberation of our homeland.\\n\\nFor many of us, it is a memory tied to our families, our parents, our grandparents, many of whom were protagonists or victims of those dramatic days. For me, it is the memory of years of separation from my father, forced to emigrate to avoid arrest, the memory of my mother's sacrifices, who alone had to support a large family during those difficult years. It is the memory of her courage, of her, like many others, traveling by train every day from a small town in the province of Como to work in Milan, and on one of those trains, risking her life but managing to save a Jewish woman from the clutches of a Nazi soldier destined for the extermination camps.\\n\\nThese are the memories, the examples with which we grew up – the memories of a generation of Italians who did not hesitate to choose freedom, even at the risk of their own safety and lives.\\n\\nOur country owes an inexhaustible debt to those many young people who sacrificed their lives during their most beautiful years to redeem the honor of the nation, out of fidelity to an oath, but above all for that great, splendid, and essential value which is freedom.\\n\\nWe owe the same debt of gratitude to all those other boys, Americans, English, French, Polish, from the many allied countries, who shed their blood in the Italian campaign. Without them, the sacrifice of our partisans would have risked being in vain.\\n\\nAnd with respect, we must remember today all the fallen, even those who fought on the wrong side, sincerely sacrificing their lives for their ideals and a lost cause.\\n\\nThis does not mean, of course, neutrality or indifference. We are – all free Italians are – on the side of those who fought for our freedom, for our dignity, and for the honor of our homeland.\\n\\nIn recent years, the history of the Resistance has been deepened and discussed. It is a good thing that it happened. The Resistance, along with the Risorgimento, is one of the founding values of our nation, a return to the tradition of freedom. And freedom is a right that comes before laws and the state because it is a natural right that belongs to us as human beings.\\n\\nHowever, a free nation does not need myths. As with the Risorgimento, we must also remember the dark pages of the civil war, even those in which those who fought on the right side made mistakes and took on blame.\\n\\nIt is an exercise in truth, in honesty, an exercise that makes the history of those who fought on the right side with selflessness and courage even more glorious.\\n\\nIt is the history of the many who fought in the Southern army, who, from Cephalonia onwards, redeemed the honor of the uniform with their blood.\\n\\nIt is the history of martyrs like Salvo D’Acquisto, who did not hesitate to sacrifice his life in exchange for other innocent lives.\\n\\nIt is the history of our soldiers interned in Germany who chose concentration camps rather than collaborating with the Nazis.\\n\\nIt is the history of the many who hid their fellow Jewish citizens, saving them from deportation.\\n\\nAbove all, it is the history of the many, countless unknown heroes who, with small or great acts of daily courage, contributed to the cause of freedom.\\n\\nEven the Church, I want to remember, played its part with true courage, to prevent odious concepts like race or religious differences from becoming reasons for persecution and death.\\n\\nSimilarly, we must remember the young Jews of the Jewish Brigade, who came from ghettos all over Europe, took up arms, and fought for freedom.\\n\\nAt that moment, many Italians of different faiths, cultures, and backgrounds came together to pursue the same great dream – the dream of freedom.\\n\\nAmong them were very different individuals and groups. Some thought only of freedom, some dreamed of establishing a different social and political order, some considered themselves bound by an oath of loyalty to the monarchy.\\n\\nBut they all managed to set aside their differences, even the most profound ones, to fight together. The communists and the Catholics, the socialists and the liberals, the actionists and the monarchists, faced with a common tragedy, each wrote a great page of our history. A page on which our Constitution is based, a page on which our freedom is based.\\n\\nIn the drafting of the Constitution, the wisdom of the political leaders of that time – De Gasperi and Togliatti, Ruini and Terracini, Nenni, Pacciardi, and Parri – managed to channel deep initial divisions towards a single objective.\\n\\nAlthough clearly the result of compromises, the republican Constitution achieved two noble and fundamental objectives: guaranteeing freedom and creating the conditions for democratic development in the country. It was not a small feat; in fact, it was the best compromise possible at the time.\\n\\nHowever, the goal of creating a \"common\" moral conscience for the nation was missed, perhaps premature for those times, so much so that the predominant value for everyone was anti-fascism, but not necessarily anti-totalitarianism. It was a product of history, a compromise useful to avoid the Cold War that vertically divided Italy from degenerating into a civil war with unpredictable outcomes. But the assumption of responsibility and the sense of the State that animated all the political leaders of that time remain a great lesson that would be unforgivable to forget.\\n\\nToday, 64 years after April 25, 1945, and twenty years after the fall of the Berlin Wall, our task, the task of all, is to finally build a unified national sentiment.\\n\\nWe must do it together, together, regardless of political affiliation, together, for a new beginning of our republican democracy, where all political parties recognize the greatest value, freedom, and debate in its name for the good and the interest of all.\\n\\nThe anniversary of the regained freedom is, therefore, an opportunity to reflect on the past, but also to reflect on the present and the future of Italy. If we can do it together from today onwards, we will have rendered a great service not to one\\n\\nWe have always rejected the idea that our adversary was our enemy. Our religion of freedom demanded it from us and still does. With the same spirit, I am convinced that the time has come for the Liberation Day to become the Day of Freedom, and for this commemoration to shed the character of opposition that revolutionary culture gave it, a character that still 'divides' rather than 'unites'.\\n\\nI say this with great serenity, without any intention of creating controversy. April 25 was the origin of a new season of democracy, and in democracy, the people's vote deserves absolute respect from everyone.\\n\\nAfter April 25, the people peacefully voted for the Republic, and the monarchy accepted the people's judgment.\\n\\nShortly after, on April 18, 1948, the people's choice was once again decisive for our country: with De Gasperi's victory, the Italian people recognized themselves in the Christian and liberal tradition of their history. The 1950s, always with the support of the popular vote, shaped Italy into a democratic, economic, and social reality. Italy became part of Europe and the West, played a role in promoting Atlantic unity and European unity, transforming from a rejected nation to a respected one.\\n\\nToday, our young people face other challenges: to defend the freedom conquered by their fathers and expand it even further, aware that without freedom, there can be no peace, justice, or well-being.\\n\\nSome of these challenges are global and see us engaged alongside free nations: the fight against terrorism, the fight against fanatic and repressive fundamentalism, the fight against racism because freedom, dignity, and peace are rights of every human being, 'everywhere' in the world.\\n\\nThat's why I want to remember the Italian soldiers engaged in peace missions abroad, especially those who have fallen in carrying out this noble mission. There is an ideal continuity between them and all the heroes, Italian and allied, who sacrificed their lives over 60 years ago to give us back freedom, security, and peace.\\n\\nToday, the teachings of our fathers take on a special value: this April 25 comes just after the great tragedy that struck this land of Abruzzo. Once again, facing the emergency and tragedy, Italians have shown their ability to unite, to overcome differences, demonstrating that they are a great and cohesive people, full of generosity, solidarity, and courage.\\n\\nLooking at the many Italians who have been engaged here in rescue and reconstruction efforts, I feel proud, once again, even more so, to be Italian and to lead this wonderful country.\\n\\nToday, Onna is the symbol of our Italy. The earthquake that destroyed it reminds us of the days when invaders destroyed it. Rebuilding it will mean repeating the gesture of its rebirth after Nazi violence.\\n\\nAnd it is precisely concerning the heroes of then and today that we all have a great responsibility: to set aside any controversy, to look at the interest of the nation, to safeguard the great heritage of freedom that we inherited from our fathers.\\n\\nTogether, we all have the responsibility and duty to build a future of prosperity, security, peace, and freedom for all.\\n\\nLong live Italy! Long live the Republic!\\n\\nLong live April 25, the celebration of all Italians who love freedom and want to remain free!\\n\\nLong live April 25, the celebration of regained freedom!\"\"\")\n" 1090 | ] 1091 | } 1092 | ], 1093 | "metadata": { 1094 | "kernelspec": { 1095 | "display_name": "Python 3 (ipykernel)", 1096 | "language": "python", 1097 | "name": "python3" 1098 | }, 1099 | "language_info": { 1100 | "codemirror_mode": { 1101 | "name": "ipython", 1102 | "version": 3 1103 | }, 1104 | "file_extension": ".py", 1105 | "mimetype": "text/x-python", 1106 | "name": "python", 1107 | "nbconvert_exporter": "python", 1108 | "pygments_lexer": "ipython3", 1109 | "version": "3.11.6" 1110 | } 1111 | }, 1112 | "nbformat": 4, 1113 | "nbformat_minor": 5 1114 | } 1115 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # How to use LLMs for Text Analysis 2 | This guide introduces Large Language Models (LLM) as a highly versatile text analysis method within the social sciences. As LLMs are easy-to-use, cheap, fast, and applicable on a broad range of text analysis tasks, ranging from text annotation and classification to sentiment analysis and critical discourse analysis, many scholars believe that LLMs will transform how we do text analysis. This how-to guide is aimed at students and researchers with limited programming experience, and offers a simple introduction to how LLMs can be used for text analysis in your own research project, as well as advice on best practices. We will go through each of the steps of analyzing textual data with LLMs using Python: installing the software, setting up the API, loading the data, developing an analysis prompt, analyzing the text, and validating the results. As an illustrative example, we will use the challenging task of identifying populism in political texts, and show how LLMs move beyond the existing state-of-the-art. 3 | -------------------------------------------------------------------------------- /global-populism-dataset.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cssmodels/howtousellms/516c3f4bd3efbafc9709c931bb0df9a4df58aace/global-populism-dataset.zip --------------------------------------------------------------------------------