├── .gitignore ├── README.md ├── WORKBOOK-Tweet-Collection.ipynb ├── Working-with-Twitter-Data-2.ipynb └── Working-with-Twitter-Data-3.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Working with Twitter Data TAPI 2022 2 | 3 |

4 | 5 | This is the GitHub repository for "Working with Twitter Data," a course led by [Melanie Walsh](https://melaniewalsh.org/) for the 2022 Text Analysis Pedagogy Institute. TAPI is supported by the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/). 6 | 7 | These notebooks are available for free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/) 8 | 9 | For questions/comments/improvements, email melwalsh@uw.edu 10 | -------------------------------------------------------------------------------- /WORKBOOK-Tweet-Collection.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "3e89c5b3", 6 | "metadata": {}, 7 | "source": [ 8 | "

\n", 9 | "\n", 10 | "This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).\n", 11 | "\n", 12 | "Created by [Melanie Walsh](https://melaniewalsh.org/) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).\n", 13 | "\n", 14 | "For questions/comments/improvements, email melwalsh@uw.edu\n", 15 | "____" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "id": "68f932d1", 21 | "metadata": {}, 22 | "source": [ 23 | "# Tweet Collection (Workbook) — June 2022\n", 24 | "\n", 25 | "Here's a streamlined workbook for collecting tweet counts and tweets.\n", 26 | "___" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "id": "35b85a87-a8e1-4496-95de-595f8c9b3f0c", 32 | "metadata": {}, 33 | "source": [ 34 | "# Install Libraries" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "id": "e8a220f5", 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "### Install Libraries ###\n", 45 | "!pip install twarc --upgrade\n", 46 | "!pip install twarc-csv --upgrade\n", 47 | "!pip install plotly\n", 48 | "!pip install wordcloud" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 1, 54 | "id": "5480e2a8", 55 | "metadata": {}, 56 | "outputs": [], 57 | "source": [ 58 | "### Import Libraries ###\n", 59 | "import plotly.express as px\n", 60 | "from wordcloud import WordCloud, STOPWORDS\n", 61 | "import matplotlib.pyplot as plt\n", 62 | "import pandas as pd\n", 63 | "# Set max column width\n", 64 | "pd.options.display.max_colwidth = 400\n", 65 | "# Set max number of columns\n", 66 | "pd.options.display.max_columns = 95" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "id": "6dcbc763-1ccc-498e-9149-e083e09fae8b", 72 | "metadata": {}, 73 | "source": [ 74 | "# Twitter API Setup" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "id": "d88aaf2a-218b-4a12-8e16-b78faa222bd8", 80 | "metadata": {}, 81 | "source": [ 82 | "## Configure Twarc" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "id": "3dfcebde-bfd7-4164-bad8-af5794fe0b91", 88 | "metadata": {}, 89 | "source": [ 90 | "Once twarc is installed, you need to configure it with your API keys and/or bearer token so that you can actually access the API. " 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "id": "09141ce0-a3f1-4a56-a1de-84d0c761b2e7", 96 | "metadata": {}, 97 | "source": [ 98 | "To configure twarc, you would typically run `twarc2 configure` from the command line. This will prompt twarc to ask for your bearer token, which you can copy and paste into the blank after the colon, and then press enter. You can optionally enter your API keys, as well." 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "id": "ccaebcf9-7bd7-45ff-adea-fcb5be519aea", 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "!printf '%s\\n' \"YOUR BEARER TOKEN HERE\" \"no\" | twarc2 configure" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "id": "d47e3c57-6716-4feb-b4db-7913a9651e06", 114 | "metadata": {}, 115 | "source": [ 116 | "If you've entered your information correctly, you should get a congratulatory message." 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "id": "06cfea96-97cd-4ac3-b238-7042067c040b", 122 | "metadata": {}, 123 | "source": [ 124 | "# Twitter Data Collection & Visualization\n", 125 | "## Get Tweet Counts and Save as Spreadsheet" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 13, 131 | "id": "8f11f1cb-55af-43fe-b2cc-91aa2aca7a3b", 132 | "metadata": {}, 133 | "outputs": [], 134 | "source": [ 135 | "!twarc2 counts \"your query\" --csv --archive --granularity day > query-tweet-counts.csv" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": null, 141 | "id": "ec42f154-7069-4236-a52c-e89fd87f045a", 142 | "metadata": {}, 143 | "outputs": [], 144 | "source": [ 145 | "# Change the title of the plot here\n", 146 | "my_title = \"YOUR TITLE HERE\"\n", 147 | "# Change the file name here\n", 148 | "filename = \"YOUR FILENAME HERE\"\n", 149 | "\n", 150 | "# Read in CSV as DataFrame\n", 151 | "tweet_counts_df = pd.read_csv(filename, parse_dates=['start', 'end'])\n", 152 | "\n", 153 | "# Set start time as DataFrame index\n", 154 | "tweet_counts_df.set_index('start', inplace=True)\n", 155 | "\n", 156 | "# Regroup, or resample, tweets by month, day, or year\n", 157 | "grouped_count = tweet_counts_df.resample('M')['day_count'].sum().reset_index() # Month\n", 158 | "#grouped_count = tweet_counts_df.resample('D')['day_count'].sum().reset_index() # Day\n", 159 | "#grouped_count = tweet_counts_df.resample('Y')['day_count'].sum().reset_index() # Year\n", 160 | "\n", 161 | "# Make a line plot from the DataFrame and specify x and y axes, axes titles, and plot title\n", 162 | "px.line(grouped_count, x = 'start', y = 'day_count',\n", 163 | " labels = {'start': 'Time', 'day_count': '# of Tweets'},\n", 164 | " title = my_title)" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "id": "893525e5-75e8-444f-8a15-a31a80b08174", 170 | "metadata": {}, 171 | "source": [ 172 | "## Get Tweets and Save as Spreadsheet" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "id": "4059e86b-7daa-405a-a7ae-655016063457", 178 | "metadata": {}, 179 | "source": [ 180 | "To actually collect tweets and their associated metadata, we can use the command `twarc2 search` and insert a query.\n", 181 | "\n", 182 | "Here we're going to search for any tweets that mention certain words and were tweeted by verified accounts. By default, `twarc2 search` will use the essential track of the Twitter API, which only collects tweets from the past week." 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "id": "a8938c24-5f5e-40af-91e2-f6c3adf5dab9", 188 | "metadata": {}, 189 | "source": [ 190 | "

\n", 191 | "

Attention 🚨

\n", 192 | " Remember that the --archive flag and full-archive search functionality is only available to those who have an Academic Research account. \n", 193 | " Students with Essential API access should remove the --archive flag from the code below.\n", 194 | "\n", 195 | "

" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "id": "1d2a4b26-726f-49e1-8e71-d2760a8b2c91", 201 | "metadata": {}, 202 | "source": [ 203 | "You might want to limit your search to 5000 tweets or less" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 30, 209 | "id": "1036d813-4b14-4489-ba5b-2eaf77da42c9", 210 | "metadata": { 211 | "scrolled": true, 212 | "tags": [ 213 | "output_scroll" 214 | ] 215 | }, 216 | "outputs": [], 217 | "source": [ 218 | "!twarc2 search --archive --limit 5000 \"YOUR QUERY\" > my_tweets.jsonl" 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": 31, 224 | "id": "f488964b-58fd-4440-a7ec-565377f0f42f", 225 | "metadata": {}, 226 | "outputs": [ 227 | { 228 | "name": "stdout", 229 | "output_type": "stream", 230 | "text": [ 231 | "100%|██████████████| Processed 3.36M/3.36M of input file [00:00<00:00, 7.44MB/s]\n", 232 | "\n", 233 | "ℹ️\n", 234 | "Parsed 1550 tweets objects from 17 lines in the input file.\n", 235 | "Wrote 1550 rows and output 74 columns in the CSV.\n", 236 | "\n" 237 | ] 238 | } 239 | ], 240 | "source": [ 241 | "!twarc2 csv my_tweets.jsonl my_tweets.csv" 242 | ] 243 | }, 244 | { 245 | "cell_type": "markdown", 246 | "id": "ee670d6c-fad6-4c1a-bc9b-91a22b8b40a5", 247 | "metadata": {}, 248 | "source": [ 249 | "Now we're ready to explore the data!" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": null, 255 | "id": "3056954b-af42-41ff-966b-42753af04507", 256 | "metadata": { 257 | "scrolled": true, 258 | "tags": [] 259 | }, 260 | "outputs": [], 261 | "source": [ 262 | "my_tweets_df = pd.read_csv('my_tweets.csv', parse_dates = ['created_at'])" 263 | ] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "id": "c5534fcc-cab2-4bc2-b23a-84b835570ecc", 268 | "metadata": {}, 269 | "source": [ 270 | "Let's apply the helper functions and create new columns." 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": 36, 276 | "id": "8b8b4139-67af-472b-86a4-07bb95f0a987", 277 | "metadata": {}, 278 | "outputs": [], 279 | "source": [ 280 | "# Find the type of tweet\n", 281 | "def find_type(tweet):\n", 282 | " \n", 283 | " # Check to see if tweet contains retweet, quote tweet, or reply tweet info\n", 284 | " contains_retweet = tweet['referenced_tweets.retweeted.id']\n", 285 | " contains_quote = tweet['referenced_tweets.quoted.id']\n", 286 | " contains_reply = tweet['referenced_tweets.replied_to.id']\n", 287 | " \n", 288 | " # Does tweet contain retweet info? (Is this category not NA or empty?)\n", 289 | " if pd.notna(contains_retweet):\n", 290 | " return \"retweet\"\n", 291 | " # Does tweet contain quote and reply info?\n", 292 | " elif pd.notna(contains_quote) and pd.notna(contains_reply):\n", 293 | " return \"quote/reply\"\n", 294 | " # Does tweet contain quote info? \n", 295 | " elif pd.notna(contains_quote):\n", 296 | " return \"quote\"\n", 297 | " # Does tweet contain reply info? \n", 298 | " elif pd.notna(contains_reply):\n", 299 | " return \"reply\"\n", 300 | " # If it doesn't contain any of this info, it must be an original tweet\n", 301 | " else:\n", 302 | " return \"original\"\n", 303 | "\n", 304 | "# Make Tweet URL\n", 305 | "def make_tweet_url(tweets):\n", 306 | " # Get username\n", 307 | " username = tweets[0]\n", 308 | " # Get tweet ID\n", 309 | " tweet_id = tweets[1]\n", 310 | " # Make tweet URL\n", 311 | " tweet_url = f\"https://twitter.com/{username}/status/{tweet_id}\"\n", 312 | " return tweet_url" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": 37, 318 | "id": "6b85dec6-694e-4084-8571-0347a73f8700", 319 | "metadata": {}, 320 | "outputs": [], 321 | "source": [ 322 | "# Create tweet type column\n", 323 | "my_tweets_df['type'] =my_tweets_df.apply(find_type, axis =1)\n", 324 | "# Create tweet URL column\n", 325 | "my_tweets_df['tweet_url'] = my_tweets_df[['author.username', 'id']].apply(make_tweet_url, axis='columns')" 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "id": "e5f9fd65-0031-4bdf-ba21-c8c28cca7bd6", 331 | "metadata": {}, 332 | "source": [ 333 | "Let's select and rename only the columns we're interested in.\n", 334 | "\n", 335 | "Pick another new column that you find intriguing and add it below!!" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "id": "bd2f9b48-9710-43f3-9798-bc97312c068a", 342 | "metadata": {}, 343 | "outputs": [], 344 | "source": [ 345 | "# Select columns of interest\n", 346 | "my_clean_tweets_df = my_tweets_df[['created_at', 'author.username', 'author.name', 'author.description',\n", 347 | " 'author.verified', 'type', 'text', 'public_metrics.retweet_count', \n", 348 | " 'public_metrics.like_count', 'public_metrics.reply_count', 'public_metrics.quote_count',\n", 349 | " 'tweet_url', 'lang', 'source', 'geo.full_name']]\n", 350 | "\n", 351 | "# Rename columns for convenience\n", 352 | "my_clean_tweets_df = my_clean_tweets_df.rename(columns={'created_at': 'date', 'public_metrics.retweet_count': 'retweets', \n", 353 | " 'author.username': 'username', 'author.name': 'name', 'author.verified': 'verified', \n", 354 | " 'public_metrics.like_count': 'likes', 'public_metrics.quote_count': 'quotes', \n", 355 | " 'public_metrics.reply_count': 'replies', 'author.description': 'user_bio'})\n", 356 | "\n", 357 | "my_clean_tweets_df" 358 | ] 359 | }, 360 | { 361 | "cell_type": "markdown", 362 | "id": "f9c4b945-73ca-4792-b317-dfbb1163d0d9", 363 | "metadata": {}, 364 | "source": [ 365 | "Let's get an overview of some of these columns." 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": null, 371 | "id": "48fc53a0-31f4-47e3-b039-72715d0772f1", 372 | "metadata": {}, 373 | "outputs": [], 374 | "source": [ 375 | "my_clean_tweets_df['type'].value_counts()" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": null, 381 | "id": "a133b156-e649-4e5a-8b53-d92f8e9512f3", 382 | "metadata": {}, 383 | "outputs": [], 384 | "source": [ 385 | "# Create stopwords list\n", 386 | "stopwords = STOPWORDS.update([\"plums\", \"icebox\"])\n", 387 | "\n", 388 | "# Set up wordcloud\n", 389 | "wc = WordCloud(background_color=\"white\", max_words=50,\n", 390 | " stopwords=stopwords, contour_width=5, contour_color='steelblue')\n", 391 | "\n", 392 | "# Strip line breaks\n", 393 | "tweet_texts = my_clean_tweets_df['text'].str.replace(r\"\\\\n\", \" \", regex=True)\n", 394 | "# Join all tweet texts together\n", 395 | "tweet_texts = ' '.join(tweet_texts)\n", 396 | "# Generate word cloud\n", 397 | "wc.generate(tweet_texts)\n", 398 | "\n", 399 | "# Create and save word cloud\n", 400 | "plt.figure(figsize = (10,5))\n", 401 | "plt.imshow(wc, interpolation='bilinear')\n", 402 | "plt.axis(\"off\")\n", 403 | "plt.savefig(\"my_tweet_word-cloud.png\", dpi=300)\n", 404 | "plt.show()" 405 | ] 406 | } 407 | ], 408 | "metadata": { 409 | "kernelspec": { 410 | "display_name": "Python 3", 411 | "language": "python", 412 | "name": "python3" 413 | }, 414 | "language_info": { 415 | "codemirror_mode": { 416 | "name": "ipython", 417 | "version": 3 418 | }, 419 | "file_extension": ".py", 420 | "mimetype": "text/x-python", 421 | "name": "python", 422 | "nbconvert_exporter": "python", 423 | "pygments_lexer": "ipython3", 424 | "version": "3.8.8" 425 | }, 426 | "toc": { 427 | "base_numbering": 1, 428 | "nav_menu": {}, 429 | "number_sections": true, 430 | "sideBar": true, 431 | "skip_h1_title": false, 432 | "title_cell": "Table of Contents", 433 | "title_sidebar": "Contents", 434 | "toc_cell": false, 435 | "toc_position": {}, 436 | "toc_section_display": true, 437 | "toc_window_display": true 438 | } 439 | }, 440 | "nbformat": 4, 441 | "nbformat_minor": 5 442 | } 443 | -------------------------------------------------------------------------------- /Working-with-Twitter-Data-2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "3e89c5b3", 6 | "metadata": {}, 7 | "source": [ 8 | "

\n", 9 | "\n", 10 | "This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).\n", 11 | "\n", 12 | "Created by [Melanie Walsh](https://melaniewalsh.org/) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).\n", 13 | "\n", 14 | "For questions/comments/improvements, email melwalsh@uw.edu\n", 15 | "____" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "id": "68f932d1", 21 | "metadata": {}, 22 | "source": [ 23 | "# Working with Twitter Data (Lesson 2) — 6/22/2022\n", 24 | "\n", 25 | "This is lesson **2** of 3 in the educational series on **Working with Twitter Data**. This notebook will demonstrate how researchers can download tweet collections that have been shared by other people and share their own tweets in ways that honor users' right to be forgotten while also adhering to Twitter's Terms of Service. \n", 26 | "\n", 27 | "**Audience:** Teachers / Learners / Researchers\n", 28 | "\n", 29 | "**Use case:** Tutorial / How-To\n", 30 | "\n", 31 | "**Difficulty:** Intermediate\n", 32 | "\n", 33 | "**Completion time:** 30 minutes to 1 hour\n", 34 | "\n", 35 | "**Knowledge Required/Recommended:** \n", 36 | "\n", 37 | "* Command line knowledge\n", 38 | "* Python basics (variables, functions, lists, dictionaries)\n", 39 | "* Pandas basics (Python library for data manipulation and analysis)\n", 40 | "\n", 41 | "\n", 42 | "**Learning Objectives:**\n", 43 | "After this lesson, learners will be able to:\n", 44 | "\n", 45 | "1. Download tweet collections from a list of tweet IDs\n", 46 | "2. Convert tweets to tweet IDs for sharing purposes\n", 47 | "\n", 48 | "___" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "id": "157c0555", 54 | "metadata": {}, 55 | "source": [ 56 | "# Required Python Libraries\n", 57 | "* [twarc2](https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/) for collecting Twitter data.\n", 58 | "* [plotly](https://plotly.com/python/) for making interactive plots \n", 59 | "* [pandas](https://pandas.pydata.org/) for manipulating and cleaning data\n", 60 | "\n", 61 | "## Install Required Libraries" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": null, 67 | "id": "e8a220f5", 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "### Install Libraries ###\n", 72 | "!pip install twarc --upgrade\n", 73 | "!pip install twarc-csv --upgrade\n", 74 | "!pip install plotly" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 38, 80 | "id": "5480e2a8", 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "### Import Libraries ###\n", 85 | "import plotly.express as px\n", 86 | "import pandas as pd\n", 87 | "# Set max column width\n", 88 | "pd.options.display.max_colwidth = 400\n", 89 | "# Set max number of columns\n", 90 | "pd.options.display.max_columns = 95" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "id": "6dcbc763-1ccc-498e-9149-e083e09fae8b", 96 | "metadata": {}, 97 | "source": [ 98 | "# Twitter API Setup" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "id": "02338a15-b842-4932-8d5f-3bc3f47d7be5", 104 | "metadata": {}, 105 | "source": [ 106 | "*This lesson presumes that you've already installed and configured twarc, which was covered in [a previous lesson](Twitter-API-Setup).*" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "id": "d88aaf2a-218b-4a12-8e16-b78faa222bd8", 112 | "metadata": {}, 113 | "source": [ 114 | "## Configure Twarc" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "id": "3dfcebde-bfd7-4164-bad8-af5794fe0b91", 120 | "metadata": {}, 121 | "source": [ 122 | "Once twarc is installed, you need to configure it with your API keys and/or bearer token so that you can actually access the API. " 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "id": "09141ce0-a3f1-4a56-a1de-84d0c761b2e7", 128 | "metadata": {}, 129 | "source": [ 130 | "To configure twarc, you would typically run `twarc2 configure` from the command line. This will prompt twarc to ask for your bearer token, which you can copy and paste into the blank after the colon, and then press enter. You can optionally enter your API keys, as well." 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "id": "b9f86f51-36dc-4833-9af2-32bb43763f03", 136 | "metadata": {}, 137 | "source": [ 138 | "

\n", 139 | "

Note

\n", 140 | " To get your Bearer Token, go to your Twitter Developer portal: https://developer.twitter.com/en/portal/dashboard\n", 141 | "\n", 142 | "

" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "id": "758aebed-a591-47eb-9a0f-bc7a66d67605", 148 | "metadata": {}, 149 | "source": [ 150 | "However, when working in a Jupyter notebook in the cloud, it is easiest to configure twarc and enter your Bearer Token in a single command. Please paste your Bearer Token between the quotations marks below and run the cell." 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "id": "ccaebcf9-7bd7-45ff-adea-fcb5be519aea", 157 | "metadata": {}, 158 | "outputs": [], 159 | "source": [ 160 | "#!printf '%s\\n' \"YOUR BEARER TOKEN HERE\" \"no\" | twarc2 configure" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "id": "dec67b43-c517-4e48-a818-c2220f0e34de", 166 | "metadata": {}, 167 | "source": [ 168 | "Now you're ready to collect and analyze tweets!" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "id": "af691423-df15-41c5-b530-806eefd1b397", 174 | "metadata": {}, 175 | "source": [ 176 | "## Tweet IDs" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "id": "cb7f0522-6c28-43b4-8ac4-b4ac26790aae", 182 | "metadata": {}, 183 | "source": [ 184 | "Twitter discourages developers and researchers from sharing full Twitter data openly on the web. They instead encourage developers and researchers to share *tweet IDs*:\n", 185 | "\n", 186 | "> [If you provide Twitter Content to third parties, including downloadable datasets or via an API, you may only distribute **Tweet IDs**, Direct Message IDs, and/or User IDs.](https://developer.twitter.com/en/developer-terms/policy#4-e)\n", 187 | "\n", 188 | "Tweet IDs are unique identifiers assigned to every tweet. They look like a random string of numbers: 1189206626135355397. Each tweet ID can be used to download the full data associated with that tweet (if the tweet still exists). This is a process called \"hydration.\"" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "id": "21d8c832-4ef6-48e0-925f-3ac6f0400f69", 194 | "metadata": {}, 195 | "source": [ 196 | "

" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "id": "b8e53537-b14b-473b-a63a-18ef7fea48df", 202 | "metadata": {}, 203 | "source": [ 204 | "**Hydration: a young tweet ID sprouts into a full tweet (to be read in David Attenborough's voice)**" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "id": "ec1b017d-ce45-4763-81b0-936c9158ac3d", 210 | "metadata": {}, 211 | "source": [ 212 | "There are actually two reasons that you might want to dehydrate tweets and/or hydrate tweet IDs: first, to responsibly share Twitter data with others and/or access Twitter data shared by others; second, to get more information about the Twitter data that you yourself collected.\n", 213 | "\n", 214 | "If you collected tweets in real time, for example, you collected those tweets immediately after they were published, which means that they will not contain any retweet or favorite count information. Nobody's had time to retweet them yet! So if you'd like to retroactively get retweet and favorite count information about your tweets, then you would want to dehydrate and rehydrate them." 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "id": "47fed50e-7780-44aa-8f3c-13c8fe3d9be3", 220 | "metadata": {}, 221 | "source": [ 222 | "## Hydrate Tweets" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "id": "5eed3a84-c484-4ef2-ab26-788329670130", 228 | "metadata": {}, 229 | "source": [ 230 | "`twarc2 hydrate tweet_ids.txt > tweets.jsonl`" 231 | ] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "id": "a7fc033f-b1fb-4aa4-9531-3a602a675da9", 236 | "metadata": {}, 237 | "source": [ 238 | "To transform a list of tweet IDs into full Twitter data, you can run the twarc command `twarc2 hydrate` with the name of your tweet IDs text file followed by the output operator `>` and the desired name of your JSONL file.\n", 239 | "\n", 240 | "> tweet ID —> tweet = hydration
\n", 241 | "> tweet ID <— tweet = dehydration" 242 | ] 243 | }, 244 | { 245 | "cell_type": "markdown", 246 | "id": "b9634ce4-4da9-4b04-93df-bb458f9f266b", 247 | "metadata": {}, 248 | "source": [ 249 | "Kevin McElwee conducted a [fascinating analysis](https://www.kmcelwee.com/fortune-100-blm-report/site/index.html) of Fortune 100 companies' response to the #BlackLivesMatter protests following George Floyd's death in May 2020. He collected tweets from 91 of these companies' Twitter accounts between May 25 and July 25, 2020, and he then shared these tweets as [tweet IDs](https://github.com/kmcelwee/fortune-100-blm-dataset/blob/main/data/ids/fortune-100-tweet-ids.txt)." 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "id": "e5abd760-af9f-4cd6-8015-fd41ca94254d", 255 | "metadata": {}, 256 | "source": [ 257 | "Let's re-hydrate the Twitter data that McElwee collected about these Fortune 100 companies." 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": 1, 263 | "id": "98ac4979-bc89-4f3d-b907-486611e09a67", 264 | "metadata": {}, 265 | "outputs": [], 266 | "source": [ 267 | "!twarc2 hydrate fortune-100-tweet-ids.txt > fortune-100-tweets.jsonl" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": 12, 273 | "id": "930db0ec-5448-47b3-bd65-6c40bd93e44e", 274 | "metadata": {}, 275 | "outputs": [ 276 | { 277 | "name": "stdout", 278 | "output_type": "stream", 279 | "text": [ 280 | "100%|██████████████| Processed 17.6M/17.6M of input file [00:02<00:00, 6.99MB/s]\n", 281 | "\n", 282 | "ℹ️\n", 283 | "Parsed 6071 tweets objects from 62 lines in the input file.\n", 284 | "Wrote 6071 rows and output 74 columns in the CSV.\n", 285 | "\n" 286 | ] 287 | } 288 | ], 289 | "source": [ 290 | "!twarc2 csv fortune-100-tweets.jsonl fortune-100-tweets.csv" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "id": "e8a42226-14df-4894-a692-a0398c6a26ce", 296 | "metadata": {}, 297 | "source": [ 298 | "Read in CSV file" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": 14, 304 | "id": "3f145a0b-c624-4878-a131-8d72b67f825e", 305 | "metadata": { 306 | "scrolled": true, 307 | "tags": [] 308 | }, 309 | "outputs": [], 310 | "source": [ 311 | "tweets_df = pd.read_csv('fortune-100-tweets.csv', parse_dates = ['created_at'])" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "id": "0e4c2126-3796-43cf-b34e-421f15074cab", 317 | "metadata": {}, 318 | "source": [ 319 | "Let's apply the helper functions and create new columns." 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": 15, 325 | "id": "b36e22a8-b5e4-439c-9019-c075f613b083", 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [ 329 | "# Find the type of tweet\n", 330 | "def find_type(tweet):\n", 331 | " \n", 332 | " # Check to see if tweet contains retweet, quote tweet, or reply tweet info\n", 333 | " contains_retweet = tweet['referenced_tweets.retweeted.id']\n", 334 | " contains_quote = tweet['referenced_tweets.quoted.id']\n", 335 | " contains_reply = tweet['referenced_tweets.replied_to.id']\n", 336 | " \n", 337 | " # Does tweet contain retweet info? (Is this category not NA or empty?)\n", 338 | " if pd.notna(contains_retweet):\n", 339 | " return \"retweet\"\n", 340 | " # Does tweet contain quote and reply info?\n", 341 | " elif pd.notna(contains_quote) and pd.notna(contains_reply):\n", 342 | " return \"quote/reply\"\n", 343 | " # Does tweet contain quote info? \n", 344 | " elif pd.notna(contains_quote):\n", 345 | " return \"quote\"\n", 346 | " # Does tweet contain reply info? \n", 347 | " elif pd.notna(contains_reply):\n", 348 | " return \"reply\"\n", 349 | " # If it doesn't contain any of this info, it must be an original tweet\n", 350 | " else:\n", 351 | " return \"original\"\n", 352 | "\n", 353 | "# Make Tweet URL\n", 354 | "def make_tweet_url(tweets):\n", 355 | " # Get username\n", 356 | " username = tweets[0]\n", 357 | " # Get tweet ID\n", 358 | " tweet_id = tweets[1]\n", 359 | " # Make tweet URL\n", 360 | " tweet_url = f\"https://twitter.com/{username}/status/{tweet_id}\"\n", 361 | " return tweet_url" 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": 16, 367 | "id": "9fcfa8c3-f7bf-4117-b539-68f324a7f8d8", 368 | "metadata": {}, 369 | "outputs": [], 370 | "source": [ 371 | "# Create tweet type column\n", 372 | "tweets_df['type'] = tweets_df.apply(find_type, axis =1)\n", 373 | "# Create tweet URL column\n", 374 | "tweets_df['tweet_url'] = tweets_df[['author.username', 'id']].apply(make_tweet_url, axis='columns')" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "id": "31e6501b-ec3a-4871-b731-d134f45ce32e", 380 | "metadata": {}, 381 | "source": [ 382 | "Let's select and rename only the columns we're interested in." 383 | ] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": null, 388 | "id": "484bcc77-735e-49f8-9c75-e9505c73d242", 389 | "metadata": {}, 390 | "outputs": [], 391 | "source": [ 392 | "# Select columns of interest\n", 393 | "clean_tweets_df = tweets_df[['created_at', 'author.username', 'author.name', 'author.description',\n", 394 | " 'author.verified', 'type', 'text', 'public_metrics.retweet_count', \n", 395 | " 'public_metrics.like_count', 'public_metrics.reply_count', 'public_metrics.quote_count',\n", 396 | " 'tweet_url', 'lang', 'source', 'geo.full_name']]\n", 397 | "\n", 398 | "# Rename columns for convenience\n", 399 | "clean_tweets_df = clean_tweets_df.rename(columns={'created_at': 'date', 'public_metrics.retweet_count': 'retweets', \n", 400 | " 'author.username': 'username', 'author.name': 'name', 'author.verified': 'verified', \n", 401 | " 'public_metrics.like_count': 'likes', 'public_metrics.quote_count': 'quotes', \n", 402 | " 'public_metrics.reply_count': 'replies', 'author.description': 'user_bio'})\n", 403 | "\n", 404 | "clean_tweets_df" 405 | ] 406 | }, 407 | { 408 | "cell_type": "markdown", 409 | "id": "a8ada66d-0e66-4786-a66c-bc5d8a8ecd79", 410 | "metadata": {}, 411 | "source": [ 412 | "### Sort By Retweets" 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "id": "3acf9893-e52a-46c4-a74b-832c30b2c672", 418 | "metadata": {}, 419 | "source": [ 420 | "What are the top retweeted tweets? We can use [pandas `sort_values()` method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) to sort any column in ascending or descending order." 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": null, 426 | "id": "f51d53b5-bbda-47a9-becb-63f8a28a7ce9", 427 | "metadata": {}, 428 | "outputs": [], 429 | "source": [ 430 | "clean_tweets_df.sort_values(by=\"retweets\", ascending=False)" 431 | ] 432 | }, 433 | { 434 | "cell_type": "markdown", 435 | "id": "e7f78ee2-fa92-440e-931a-71d1b21e4c5b", 436 | "metadata": {}, 437 | "source": [ 438 | "### Count Tweets Per Category" 439 | ] 440 | }, 441 | { 442 | "cell_type": "markdown", 443 | "id": "4eb00fa3-ae86-47b2-907e-3593bea01759", 444 | "metadata": {}, 445 | "source": [ 446 | "How many tweets from each company are here?" 447 | ] 448 | }, 449 | { 450 | "cell_type": "code", 451 | "execution_count": null, 452 | "id": "abf42225-2335-4afd-bfae-63cb511abe58", 453 | "metadata": {}, 454 | "outputs": [], 455 | "source": [ 456 | "clean_tweets_df['name'].value_counts()" 457 | ] 458 | }, 459 | { 460 | "cell_type": "markdown", 461 | "id": "96a921d5-7675-4191-8b23-9915a1c60d6d", 462 | "metadata": {}, 463 | "source": [ 464 | "## Filter tweets by text" 465 | ] 466 | }, 467 | { 468 | "cell_type": "markdown", 469 | "id": "0f33881f-4de3-42bb-994b-19aac7f44d22", 470 | "metadata": {}, 471 | "source": [ 472 | "How many tweets contain the word \"Black\" as in \"Black Americans\" or \"Black Lives Matter\"? We can use [pandas `.str.contains()` method](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html) to search for specific text." 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "execution_count": null, 478 | "id": "c9f8d882-5a3a-4ea3-85ab-2674f6756bdc", 479 | "metadata": {}, 480 | "outputs": [], 481 | "source": [ 482 | "text_filter = clean_tweets_df['text'].str.contains(\"Black\")\n", 483 | "clean_tweets_df[text_filter]" 484 | ] 485 | }, 486 | { 487 | "cell_type": "markdown", 488 | "id": "ec3066f3-5b72-4330-927c-89e0e372e5b2", 489 | "metadata": {}, 490 | "source": [ 491 | "Kevin McElwee's analysis is also an excellent example of what's possible when you combine computational work with qualitative coding and hand tagging: https://www.kmcelwee.com/fortune-100-blm-report/site/corporate-summaries.htmlm" 492 | ] 493 | }, 494 | { 495 | "cell_type": "markdown", 496 | "id": "aae451ad-ac2f-4a47-b287-41f7d1463d36", 497 | "metadata": {}, 498 | "source": [ 499 | "## Dehydrate Tweets" 500 | ] 501 | }, 502 | { 503 | "cell_type": "markdown", 504 | "id": "19056a2a-6b63-4fee-a1ae-b9d865d95c12", 505 | "metadata": {}, 506 | "source": [ 507 | "`twarc2 dehydrate tweets.jsonl > tweet_ids.txt`" 508 | ] 509 | }, 510 | { 511 | "cell_type": "markdown", 512 | "id": "08580867-5d01-49a8-a8c3-44e13f256710", 513 | "metadata": {}, 514 | "source": [ 515 | "To transform your Twitter own data into a list of tweet IDs (so that you can share your data openly on the web), you can run the twarc command `twarc2 dehydrate` with the name of your JSONL file followed by the output operator `>` and the desired name of your tweet ID text file.\n", 516 | "\n", 517 | "> tweet ID —> tweet = hydration
\n", 518 | "> tweet ID <— tweet = dehydration" 519 | ] 520 | }, 521 | { 522 | "cell_type": "markdown", 523 | "id": "5ac32ed1-0120-4f7b-a149-6068f9fe5cb7", 524 | "metadata": {}, 525 | "source": [ 526 | "

\n", 527 | "

Attention 🚨

\n", 528 | " Remember that the --archive flag and full-archive search functionality is only available to those who have an Academic Research account. \n", 529 | " Students with Essential API access should remove the --archive flag from the code below.\n", 530 | "\n", 531 | "

" 532 | ] 533 | }, 534 | { 535 | "cell_type": "code", 536 | "execution_count": 24, 537 | "id": "a30db749-ee34-4ef9-a209-bb58dc7cc0be", 538 | "metadata": { 539 | "scrolled": true, 540 | "tags": [ 541 | "output_scroll" 542 | ] 543 | }, 544 | "outputs": [], 545 | "source": [ 546 | "!twarc2 search --archive \"plums in the icebox is:verified\" --limit 500 > tweets.jsonl" 547 | ] 548 | }, 549 | { 550 | "cell_type": "markdown", 551 | "id": "253c1ff1-e484-483d-8a8c-5b67093f742e", 552 | "metadata": {}, 553 | "source": [ 554 | "Let's dehydrate the Twitter data that we collected." 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": 26, 560 | "id": "7cb09fb6-a503-4ef6-972a-41a4d3b360cb", 561 | "metadata": {}, 562 | "outputs": [ 563 | { 564 | "name": "stdout", 565 | "output_type": "stream", 566 | "text": [ 567 | "ℹ️ Parsed 578 tweets IDs from 6 lines in tweets.jsonl file.\n" 568 | ] 569 | } 570 | ], 571 | "source": [ 572 | "!twarc2 dehydrate tweets.jsonl > tweets.txt" 573 | ] 574 | }, 575 | { 576 | "cell_type": "markdown", 577 | "id": "da17654b-5457-41a7-8b80-ba445096a730", 578 | "metadata": {}, 579 | "source": [ 580 | "If we `open()` and `.read()` the tweet IDs file that we just created, it looks something like this:" 581 | ] 582 | }, 583 | { 584 | "cell_type": "code", 585 | "execution_count": null, 586 | "id": "ecad5a90-517c-4cca-bbbd-0050a46d624c", 587 | "metadata": { 588 | "scrolled": true, 589 | "tags": [] 590 | }, 591 | "outputs": [], 592 | "source": [ 593 | "tweet_ids = open(\"tweets.txt\", encoding=\"utf-8\").read()\n", 594 | "print(tweet_ids)" 595 | ] 596 | }, 597 | { 598 | "cell_type": "markdown", 599 | "id": "f4e75127-bb20-41dc-a245-33502dd0d81d", 600 | "metadata": {}, 601 | "source": [ 602 | "## Where to Find Tweet IDs" 603 | ] 604 | }, 605 | { 606 | "cell_type": "markdown", 607 | "id": "782fd122-003c-4f7d-9c7d-9a8427be63db", 608 | "metadata": {}, 609 | "source": [ 610 | "You can find repositories of tweet IDs that have been shared by other researchers in the following places:" 611 | ] 612 | }, 613 | { 614 | "cell_type": "markdown", 615 | "id": "9bf61a7a-331f-4d93-8542-ff133ad88331", 616 | "metadata": {}, 617 | "source": [ 618 | "- DocNow Catalog: https://catalog.docnow.io/\n", 619 | "\n", 620 | "- George Washington University Tweet IDs: https://dataverse.harvard.edu/dataverse/gwu-libraries" 621 | ] 622 | } 623 | ], 624 | "metadata": { 625 | "kernelspec": { 626 | "display_name": "Python 3", 627 | "language": "python", 628 | "name": "python3" 629 | }, 630 | "language_info": { 631 | "codemirror_mode": { 632 | "name": "ipython", 633 | "version": 3 634 | }, 635 | "file_extension": ".py", 636 | "mimetype": "text/x-python", 637 | "name": "python", 638 | "nbconvert_exporter": "python", 639 | "pygments_lexer": "ipython3", 640 | "version": "3.8.8" 641 | }, 642 | "toc": { 643 | "base_numbering": 1, 644 | "nav_menu": {}, 645 | "number_sections": true, 646 | "sideBar": true, 647 | "skip_h1_title": false, 648 | "title_cell": "Table of Contents", 649 | "title_sidebar": "Contents", 650 | "toc_cell": false, 651 | "toc_position": {}, 652 | "toc_section_display": true, 653 | "toc_window_display": true 654 | } 655 | }, 656 | "nbformat": 4, 657 | "nbformat_minor": 5 658 | } 659 | -------------------------------------------------------------------------------- /Working-with-Twitter-Data-3.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "3e89c5b3", 6 | "metadata": {}, 7 | "source": [ 8 | "

\n", 9 | "\n", 10 | "This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).\n", 11 | "\n", 12 | "Created by [Melanie Walsh](https://melaniewalsh.org/) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).\n", 13 | "\n", 14 | "For questions/comments/improvements, email melwalsh@uw.edu\n", 15 | "____" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "id": "68f932d1", 21 | "metadata": {}, 22 | "source": [ 23 | "# Working with Twitter Data (Lesson 3) — 6/24/2022\n", 24 | "\n", 25 | "This is lesson **3** of 3 in the educational series on **Working with Twitter Data**. This notebook will demonstrate how researchers can collect tweets from a user's timeline (or multiple users' timelines), how to find out information about who a particular Twitter user is following and who is following that user in turn, and how to work with the new \"context annotations\" metadata, which provides extra contextual information about tweets.\n", 26 | "\n", 27 | "**Audience:** Teachers / Learners / Researchers\n", 28 | "\n", 29 | "**Use case:** Tutorial / How-To\n", 30 | "\n", 31 | "**Difficulty:** Intermediate\n", 32 | "\n", 33 | "**Completion time:** 30 minutes to 1 hour\n", 34 | "\n", 35 | "**Knowledge Required/Recommended:** \n", 36 | "\n", 37 | "* Command line knowledge\n", 38 | "* Python basics (variables, functions, lists, dictionaries)\n", 39 | "* Pandas basics (Python library for data manipulation and analysis)\n", 40 | "\n", 41 | "\n", 42 | "**Learning Objectives:**\n", 43 | "After this lesson, learners will be able to:\n", 44 | "\n", 45 | "1. Collect tweets from a specific Twitter user's timeline\n", 46 | "2. Collect data about the Twitter accounts that a specific user is following\n", 47 | "3. Collect data about the Twitter accounts that are following a specific user\n", 48 | "4. Work with the new \"context annotations\" metadata\n", 49 | "\n", 50 | "___" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "id": "157c0555", 56 | "metadata": {}, 57 | "source": [ 58 | "# Required Python Libraries\n", 59 | "* [twarc2](https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/) for collecting Twitter data.\n", 60 | "* [plotly](https://plotly.com/python/) for making interactive plots \n", 61 | "* [pandas](https://pandas.pydata.org/) for manipulating and cleaning data\n", 62 | "\n", 63 | "## Install Required Libraries" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "id": "e8a220f5", 70 | "metadata": {}, 71 | "outputs": [], 72 | "source": [ 73 | "### Install Libraries ###\n", 74 | "!pip install twarc --upgrade\n", 75 | "!pip install twarc-csv --upgrade\n", 76 | "!pip install plotly" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "id": "5480e2a8", 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "### Import Libraries ###\n", 87 | "import plotly.express as px\n", 88 | "import pandas as pd\n", 89 | "# Set max column width\n", 90 | "pd.options.display.max_colwidth = 400\n", 91 | "# Set max number of columns\n", 92 | "pd.options.display.max_columns = 95" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "id": "6dcbc763-1ccc-498e-9149-e083e09fae8b", 98 | "metadata": {}, 99 | "source": [ 100 | "# Twitter API Setup" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "id": "02338a15-b842-4932-8d5f-3bc3f47d7be5", 106 | "metadata": {}, 107 | "source": [ 108 | "*This lesson presumes that you've already installed and configured twarc, which was covered in [a previous lesson](Twitter-API-Setup).*" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "id": "d88aaf2a-218b-4a12-8e16-b78faa222bd8", 114 | "metadata": {}, 115 | "source": [ 116 | "## Configure Twarc" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "id": "3dfcebde-bfd7-4164-bad8-af5794fe0b91", 122 | "metadata": {}, 123 | "source": [ 124 | "Once twarc is installed, you need to configure it with your API keys and/or bearer token so that you can actually access the API. " 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "id": "09141ce0-a3f1-4a56-a1de-84d0c761b2e7", 130 | "metadata": {}, 131 | "source": [ 132 | "To configure twarc, you would typically run `twarc2 configure` from the command line. This will prompt twarc to ask for your bearer token, which you can copy and paste into the blank after the colon, and then press enter. You can optionally enter your API keys, as well." 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "id": "b9f86f51-36dc-4833-9af2-32bb43763f03", 138 | "metadata": {}, 139 | "source": [ 140 | "

\n", 141 | "

Note

\n", 142 | " To get your Bearer Token, go to your Twitter Developer portal: https://developer.twitter.com/en/portal/dashboard\n", 143 | "\n", 144 | "

" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "id": "758aebed-a591-47eb-9a0f-bc7a66d67605", 150 | "metadata": {}, 151 | "source": [ 152 | "However, when working in a Jupyter notebook in the cloud, it is easiest to configure twarc and enter your Bearer Token in a single command. Please paste your Bearer Token between the quotations marks below and run the cell." 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "id": "ccaebcf9-7bd7-45ff-adea-fcb5be519aea", 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "!printf '%s\\n' \"YOUR BEARER TOKEN HERE\" \"no\" | twarc2 configure" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "id": "dec67b43-c517-4e48-a818-c2220f0e34de", 168 | "metadata": {}, 169 | "source": [ 170 | "Now you're ready to collect and analyze tweets!" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "id": "10e42a23-368b-4114-bf41-2bbeb87d2f84", 176 | "metadata": {}, 177 | "source": [ 178 | "## Get a Users' Timeline (3200 Tweets)" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "id": "3140dd2f-1ad2-42e6-b80d-9b98a4b2d29e", 184 | "metadata": {}, 185 | "source": [ 186 | "To get all the most recent tweets from a Twitter user's timeline (up to 3200 tweets), we will use [`twarc2 timeline username`](https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/#timeline_1). We could also get tweets for multiple users by including a text file instead of a single username, e.g., [`twarc2 timeline usernames.txt`](https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/#timeline_1)\n", 187 | "\n", 188 | "If you have access to the Academic Research track of the Twitter API, you can actually get all tweets from a user by including the flag `--use-search`." 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "id": "1bea45e2-a2e0-45f0-bc88-f4ad46d9883d", 194 | "metadata": {}, 195 | "source": [ 196 | "Let's collect tweets from President Joe Biden's timeline: https://twitter.com/POTUS 🧐 What do you think the topic of the most retweeted tweets will be...?" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": null, 202 | "id": "95907f49-cbb8-4aac-92f2-ffad04621669", 203 | "metadata": {}, 204 | "outputs": [], 205 | "source": [ 206 | "!twarc2 timeline potus potus-tweets.jsonl" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "id": "20d9b18a-9951-42ff-a6ae-06ea77009fc1", 212 | "metadata": {}, 213 | "source": [ 214 | "Let's convert these tweets to a CSV file" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": null, 220 | "id": "4f5b1010-c0bd-47ce-828b-681ae2de397d", 221 | "metadata": {}, 222 | "outputs": [], 223 | "source": [ 224 | "!twarc2 csv potus-tweets.jsonl potus-tweets.csv" 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "id": "b73dd4c1-8d86-48b1-a8ae-414225e00d41", 230 | "metadata": {}, 231 | "source": [ 232 | "Let's read in the CSV file." 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": null, 238 | "id": "0164cf42-2922-4f7b-a1be-d53b7bd724df", 239 | "metadata": { 240 | "scrolled": true, 241 | "tags": [] 242 | }, 243 | "outputs": [], 244 | "source": [ 245 | "tweets_df = pd.read_csv('potus-tweets.csv', parse_dates = ['created_at'])" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "id": "d85ef975-e4bb-4406-bf07-9e694f4557de", 251 | "metadata": {}, 252 | "source": [ 253 | "Let's apply our helper functions and create new columns for type of tweet and tweet URL." 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": null, 259 | "id": "ea0be957-bddb-4f4f-8095-1b1103fe96e9", 260 | "metadata": {}, 261 | "outputs": [], 262 | "source": [ 263 | "# Find the type of tweet\n", 264 | "def find_type(tweet):\n", 265 | " \n", 266 | " # Check to see if tweet contains retweet, quote tweet, or reply tweet info\n", 267 | " contains_retweet = tweet['referenced_tweets.retweeted.id']\n", 268 | " contains_quote = tweet['referenced_tweets.quoted.id']\n", 269 | " contains_reply = tweet['referenced_tweets.replied_to.id']\n", 270 | " \n", 271 | " # Does tweet contain retweet info? (Is this category not NA or empty?)\n", 272 | " if pd.notna(contains_retweet):\n", 273 | " return \"retweet\"\n", 274 | " # Does tweet contain quote and reply info?\n", 275 | " elif pd.notna(contains_quote) and pd.notna(contains_reply):\n", 276 | " return \"quote/reply\"\n", 277 | " # Does tweet contain quote info? \n", 278 | " elif pd.notna(contains_quote):\n", 279 | " return \"quote\"\n", 280 | " # Does tweet contain reply info? \n", 281 | " elif pd.notna(contains_reply):\n", 282 | " return \"reply\"\n", 283 | " # If it doesn't contain any of this info, it must be an original tweet\n", 284 | " else:\n", 285 | " return \"original\"\n", 286 | "\n", 287 | "# Make Tweet URL\n", 288 | "def make_tweet_url(tweets):\n", 289 | " # Get username\n", 290 | " username = tweets[0]\n", 291 | " # Get tweet ID\n", 292 | " tweet_id = tweets[1]\n", 293 | " # Make tweet URL\n", 294 | " tweet_url = f\"https://twitter.com/{username}/status/{tweet_id}\"\n", 295 | " return tweet_url" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": null, 301 | "id": "233236ed-9986-4ec0-93cb-c70102854369", 302 | "metadata": {}, 303 | "outputs": [], 304 | "source": [ 305 | "# Create tweet type column\n", 306 | "tweets_df['type'] = tweets_df.apply(find_type, axis =1)\n", 307 | "# Create tweet URL column\n", 308 | "tweets_df['tweet_url'] = tweets_df[['author.username', 'id']].apply(make_tweet_url, axis='columns')" 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "id": "856f5e92-1c97-4715-9895-b7cfe3552cf9", 314 | "metadata": {}, 315 | "source": [ 316 | "Let's select and rename only the columns we're interested in." 317 | ] 318 | }, 319 | { 320 | "cell_type": "code", 321 | "execution_count": null, 322 | "id": "1c880a2f-144e-4a03-a400-dc56ca9de5b9", 323 | "metadata": {}, 324 | "outputs": [], 325 | "source": [ 326 | "# Select columns of interest\n", 327 | "clean_tweets_df = tweets_df[['created_at', 'author.username', 'author.name', 'author.description',\n", 328 | " 'author.verified', 'type', 'text', 'public_metrics.retweet_count', \n", 329 | " 'public_metrics.like_count', 'public_metrics.reply_count', 'public_metrics.quote_count',\n", 330 | " 'tweet_url', 'lang', 'source', 'geo.full_name']]\n", 331 | "\n", 332 | "# Rename columns for convenience\n", 333 | "clean_tweets_df = clean_tweets_df.rename(columns={'created_at': 'date', 'public_metrics.retweet_count': 'retweets', \n", 334 | " 'author.username': 'username', 'author.name': 'name', 'author.verified': 'verified', \n", 335 | " 'public_metrics.like_count': 'likes', 'public_metrics.quote_count': 'quotes', \n", 336 | " 'public_metrics.reply_count': 'replies', 'author.description': 'user_bio'})\n", 337 | "\n", 338 | "clean_tweets_df" 339 | ] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "id": "ea1bace1-f2b7-41e4-b7fb-4eea2bd8db42", 344 | "metadata": {}, 345 | "source": [ 346 | "We can also create a date column that does not have hour/minute/second information, like so" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": null, 352 | "id": "aaac9b33-5627-492b-93cd-43a25b78fed7", 353 | "metadata": {}, 354 | "outputs": [], 355 | "source": [ 356 | "clean_tweets_df['formatted_date'] = clean_tweets_df['date'].dt.date" 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "id": "5475c9a6-378a-4df5-a2de-44e5fc9f00aa", 362 | "metadata": {}, 363 | "source": [ 364 | "## Code Tweet Data By Keyword" 365 | ] 366 | }, 367 | { 368 | "cell_type": "markdown", 369 | "id": "08f7587c-3b97-4537-941a-145a30791d67", 370 | "metadata": {}, 371 | "source": [ 372 | "In the previous lesson, we saw how Kevin McElwee was able to produce a [really cool Twitter analysis](https://www.kmcelwee.com/fortune-100-blm-report/site/index.html) by qualitatively coding whether Fortune 100 tweets were discussing racial justice or not.\n", 373 | "\n", 374 | "I wanted to show a quick example of how we can use a Python function to do something similar: code whether or not a tweet contains certain keywords.\n", 375 | "\n", 376 | "The function below will check to see whether a tweet contains any words that are included in the list `keywords`. In this example, we're coding whether or not the tweet is discussing COVID. " 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": null, 382 | "id": "b8ef4750-1102-4bb8-a11d-a826b08e3829", 383 | "metadata": {}, 384 | "outputs": [], 385 | "source": [ 386 | "def check_for_keywords(text):\n", 387 | " \n", 388 | " # Pick your own keywords!\n", 389 | " keywords = [\"COVID\", \"virus\"]\n", 390 | " \n", 391 | " for word in keywords:\n", 392 | " if word in text:\n", 393 | " return True\n", 394 | " else:\n", 395 | " return False" 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": null, 401 | "id": "07b6098e-f7c6-40a2-98c2-44e78343f8b0", 402 | "metadata": {}, 403 | "outputs": [], 404 | "source": [ 405 | "check_for_keywords(\"The COVID-19 crisis is serious\")" 406 | ] 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "id": "456060dc-da7e-42f0-8987-e45061845701", 411 | "metadata": {}, 412 | "source": [ 413 | "We can create a new column (which could be named whatver we want) by applying this function to the text column." 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": null, 419 | "id": "371b6a95-5560-4345-ba8e-88f1dc666627", 420 | "metadata": {}, 421 | "outputs": [], 422 | "source": [ 423 | "clean_tweets_df['COVID?'] = clean_tweets_df['text'].apply(check_for_keywords)\n", 424 | "#clean_tweets_df['your own column name'] = clean_tweets_df['text'].apply(check_for_keywords)" 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "id": "88cbe1df-fbcd-4359-9160-7c47d86b8426", 430 | "metadata": {}, 431 | "source": [ 432 | "Now we can use this new column to filter and examime only the tweets that are explicitly discussing COVID." 433 | ] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "execution_count": null, 438 | "id": "d3cb7767-733a-4e9a-adf3-0bfb59b654fe", 439 | "metadata": {}, 440 | "outputs": [], 441 | "source": [ 442 | "clean_tweets_df[clean_tweets_df[\"COVID?\"] == True]" 443 | ] 444 | }, 445 | { 446 | "cell_type": "markdown", 447 | "id": "ff408aca-b0d1-4ecd-b2a9-fd603de9a5f9", 448 | "metadata": {}, 449 | "source": [ 450 | "## Save Tweets as Spreadsheet" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "id": "9ccff287-23e5-42e6-bf2b-7ab7dfd0446d", 456 | "metadata": {}, 457 | "source": [ 458 | "Anytime we want to save a dataframe as a spreadhsheet, we can use the `.to_csv()` function." 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": null, 464 | "id": "5fbe6b9b-f4bd-438d-9e66-4dd58f6eb1c5", 465 | "metadata": {}, 466 | "outputs": [], 467 | "source": [ 468 | "clean_tweets_df.to_csv(\"clean-potus-tweets.csv\", \n", 469 | " # remove the index\n", 470 | " index=False)" 471 | ] 472 | }, 473 | { 474 | "cell_type": "markdown", 475 | "id": "fc3194b8-a211-49a3-8e68-82a86d651d8d", 476 | "metadata": {}, 477 | "source": [ 478 | "## Datawrapper" 479 | ] 480 | }, 481 | { 482 | "cell_type": "markdown", 483 | "id": "5521b684-8f86-4bb9-8894-1e05f69828f9", 484 | "metadata": {}, 485 | "source": [ 486 | "With only the data that we just collected and coded, we can make a sophisticated data visualization — either in Python or with a different data visualization platform.\n", 487 | "\n", 488 | "For example, if we drop our CSV file into Datawrapper (https://www.datawrapper.de/), we can create something that looks like this:" 489 | ] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": null, 494 | "id": "b84427a5-ff41-4c2e-b466-b5c383b64e75", 495 | "metadata": {}, 496 | "outputs": [], 497 | "source": [ 498 | "%%html\n", 499 | "" 500 | ] 501 | }, 502 | { 503 | "cell_type": "markdown", 504 | "id": "620080ff-9384-48cf-8587-e3b6e5ebb1b1", 505 | "metadata": {}, 506 | "source": [ 507 | "Be sure to check out these tips for customizing Datawrapper tooltips with HTML: https://academy.datawrapper.de/article/237-i-want-to-change-how-my-data-appears-in-tooltips" 508 | ] 509 | }, 510 | { 511 | "cell_type": "markdown", 512 | "id": "319d39e9-1d51-4c5b-baf3-3f863494ee3b", 513 | "metadata": {}, 514 | "source": [ 515 | "## Get Who a User Is Following" 516 | ] 517 | }, 518 | { 519 | "cell_type": "markdown", 520 | "id": "be6c41f4-e399-45d0-8f98-a63a12a88553", 521 | "metadata": {}, 522 | "source": [ 523 | "We can also use the Twitter API to find out who a Twitter user is following and who is following that user. Researchers and journalists have used follower/following data in a number of ways, such as examining [how conservative vs. liberal politicians gained or lost followers](https://www.theverge.com/2022/4/27/23045005/conservative-twitter-follower-boost-musk-acquisition-data) after Elon Musk finalized his deal to buy Twitter (via The Verge). \n", 524 | "\n", 525 | "To get information about all the Twitter accounts that a particular Twitter user is following, we will use [`twarc2 following username`](https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/#following_1).\n", 526 | "\n" 527 | ] 528 | }, 529 | { 530 | "cell_type": "markdown", 531 | "id": "80cba258-bf4e-4e95-8b48-536df8af86ec", 532 | "metadata": {}, 533 | "source": [ 534 | "Let's see who Joe Biden is following on Twitter." 535 | ] 536 | }, 537 | { 538 | "cell_type": "code", 539 | "execution_count": null, 540 | "id": "2685ca3b-124d-44f0-a5e8-a668f7db5f59", 541 | "metadata": {}, 542 | "outputs": [], 543 | "source": [ 544 | "!twarc2 following potus potus_following.jsonl " 545 | ] 546 | }, 547 | { 548 | "cell_type": "markdown", 549 | "id": "2dfaba55-1e25-453c-b7a4-3d7343940627", 550 | "metadata": {}, 551 | "source": [ 552 | "To convert this user data into a CSV file, we can use `twarc2 csv` but we have to include a special flag that specifies this is user data, not tweet data `--input-data-type`" 553 | ] 554 | }, 555 | { 556 | "cell_type": "code", 557 | "execution_count": null, 558 | "id": "f0be5c6f-5e81-4b15-82e3-7198f23d3449", 559 | "metadata": {}, 560 | "outputs": [], 561 | "source": [ 562 | "!twarc2 csv potus_following.jsonl --input-data-type users potus_following.csv" 563 | ] 564 | }, 565 | { 566 | "cell_type": "markdown", 567 | "id": "e66638a3-ee69-4016-b280-79d40301b96d", 568 | "metadata": {}, 569 | "source": [ 570 | "Let's see what this data looks like:" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": null, 576 | "id": "950250c6-a669-4810-8295-ce12b99cf50d", 577 | "metadata": { 578 | "scrolled": true, 579 | "tags": [] 580 | }, 581 | "outputs": [], 582 | "source": [ 583 | "following_df = pd.read_csv('potus_following.csv', parse_dates = ['created_at'])" 584 | ] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "execution_count": null, 589 | "id": "0068f50c-dac4-4524-b703-ebb2385a366b", 590 | "metadata": {}, 591 | "outputs": [], 592 | "source": [ 593 | "following_df = following_df.rename(columns={'public_metrics.following_count': 'following', \n", 594 | " 'public_metrics.followers_count': 'followers', \n", 595 | " 'public_metrics.tweet_count': 'tweets',\n", 596 | " })\n", 597 | "following_df = following_df[[\"created_at\", \"username\", \"name\", \"description\", \"location\", \"followers\",\n", 598 | " \"following\", \"tweets\", \"url\", \"verified\"]]" 599 | ] 600 | }, 601 | { 602 | "cell_type": "markdown", 603 | "id": "591e62de-69bb-4d27-a7d9-c08ca5adc290", 604 | "metadata": {}, 605 | "source": [ 606 | "Which of the Twitter accounts that Joe Biden is following has the most followers, the most total tweets, and the most accounts that they themselves are following?" 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": null, 612 | "id": "5a71c02b-aaf3-43b6-a6a0-3ae94e251008", 613 | "metadata": {}, 614 | "outputs": [], 615 | "source": [ 616 | "following_df.sort_values(\"followers\", ascending=False)" 617 | ] 618 | }, 619 | { 620 | "cell_type": "markdown", 621 | "id": "a2b32d75-f2a8-4414-9b81-08e45e366666", 622 | "metadata": {}, 623 | "source": [ 624 | "We could imagine that we might want to collect tweets for all of these Twitter accounts. To do so, we might write all these usernames to a text file." 625 | ] 626 | }, 627 | { 628 | "cell_type": "code", 629 | "execution_count": null, 630 | "id": "41a469ae-6d3c-4b51-8c78-f4d89de970ad", 631 | "metadata": {}, 632 | "outputs": [], 633 | "source": [ 634 | "following_df['username']" 635 | ] 636 | }, 637 | { 638 | "cell_type": "markdown", 639 | "id": "bed8a9ac-494f-4551-b2b5-5388b9f7097d", 640 | "metadata": {}, 641 | "source": [ 642 | "Write usernames to a text file" 643 | ] 644 | }, 645 | { 646 | "cell_type": "code", 647 | "execution_count": null, 648 | "id": "ce08057e-9e44-4300-951e-c7b86edaf2b8", 649 | "metadata": {}, 650 | "outputs": [], 651 | "source": [ 652 | "following_df['username'].to_csv(\"usernames.txt\", index=False, header=False)" 653 | ] 654 | }, 655 | { 656 | "cell_type": "markdown", 657 | "id": "fb67a51e-3b21-414f-9eb3-194c3806b135", 658 | "metadata": {}, 659 | "source": [ 660 | "Get the timelines for all of those users" 661 | ] 662 | }, 663 | { 664 | "cell_type": "code", 665 | "execution_count": null, 666 | "id": "e1f8cf74-96c4-4823-a1f1-8410bdbcc829", 667 | "metadata": {}, 668 | "outputs": [], 669 | "source": [ 670 | "#!twarc2 timelines usernames.txt all_timelines.jsonl" 671 | ] 672 | }, 673 | { 674 | "cell_type": "code", 675 | "execution_count": null, 676 | "id": "197476e5-50e4-4687-82a1-8e844d8c8c25", 677 | "metadata": {}, 678 | "outputs": [], 679 | "source": [ 680 | "#!twarc2 csv all_timelines.jsonl all_timelines.csv" 681 | ] 682 | }, 683 | { 684 | "cell_type": "markdown", 685 | "id": "4c6d6412-3825-4565-9c98-faa0b719fc5b", 686 | "metadata": {}, 687 | "source": [ 688 | "## Get a Users' Followers" 689 | ] 690 | }, 691 | { 692 | "cell_type": "markdown", 693 | "id": "33e1b2b4-d8de-440f-b5e1-19fa364144da", 694 | "metadata": {}, 695 | "source": [ 696 | "To get information about all the Twitter accounts following a particular Twitter user is following, we will use [`twarc2 followers username`](https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/#following_1). " 697 | ] 698 | }, 699 | { 700 | "cell_type": "markdown", 701 | "id": "9f74b280-081f-400a-a699-aee8bd3bfa94", 702 | "metadata": {}, 703 | "source": [ 704 | "Joe Biden has too many followers for a quick example, so let's see who's following my William Carlos Williams Twitter bot." 705 | ] 706 | }, 707 | { 708 | "cell_type": "code", 709 | "execution_count": null, 710 | "id": "95e1a46f-a395-4edc-a398-0ff544a2c51a", 711 | "metadata": {}, 712 | "outputs": [], 713 | "source": [ 714 | "!twarc2 followers sosweetbot sosweetbot_followers.jsonl " 715 | ] 716 | }, 717 | { 718 | "cell_type": "code", 719 | "execution_count": null, 720 | "id": "316bbe04-d367-4308-ba04-abfe83b186d3", 721 | "metadata": {}, 722 | "outputs": [], 723 | "source": [ 724 | "!twarc2 csv sosweetbot_followers.jsonl --input-data-type users sosweetbot_followers.csv" 725 | ] 726 | }, 727 | { 728 | "cell_type": "code", 729 | "execution_count": null, 730 | "id": "c5f22586-8bd9-44fc-aefa-e842e00e533d", 731 | "metadata": { 732 | "scrolled": true, 733 | "tags": [] 734 | }, 735 | "outputs": [], 736 | "source": [ 737 | "followers_df = pd.read_csv('sosweetbot_followers.csv', parse_dates = ['created_at'])" 738 | ] 739 | }, 740 | { 741 | "cell_type": "code", 742 | "execution_count": null, 743 | "id": "f1532dd7-5f4f-439e-8038-711a2587ed54", 744 | "metadata": {}, 745 | "outputs": [], 746 | "source": [ 747 | "clean_followers_df = followers_df.rename(columns={'public_metrics.following_count': 'following', \n", 748 | " 'public_metrics.followers_count': 'followers', \n", 749 | " 'public_metrics.tweet_count': 'tweets',\n", 750 | " })\n", 751 | "clean_followers_df = clean_followers_df[[\"created_at\", \"username\", \"name\", \"description\", \"location\", \"followers\",\n", 752 | " \"following\", \"tweets\", \"url\", \"verified\"]]" 753 | ] 754 | }, 755 | { 756 | "cell_type": "code", 757 | "execution_count": null, 758 | "id": "2418cb24-0fac-4b9d-8edd-ee11d1e389b9", 759 | "metadata": {}, 760 | "outputs": [], 761 | "source": [ 762 | "clean_followers_df" 763 | ] 764 | }, 765 | { 766 | "cell_type": "markdown", 767 | "id": "58f32580-d80c-422a-9b02-84cc44c5d2dd", 768 | "metadata": {}, 769 | "source": [ 770 | "## Context Annotations" 771 | ] 772 | }, 773 | { 774 | "cell_type": "markdown", 775 | "id": "f349b863-aafd-435a-82c0-8b0b5aa64abc", 776 | "metadata": {}, 777 | "source": [ 778 | "Twitter recently introduced a new piece of metadata for tweets: **context annotations**. These annotations are supposed to help document the contextual topic of a tweet, even if the topic itself is not explicitly mentioned in the tweet.\n", 779 | "\n", 780 | "> How does Twitter context annotations work?\n", 781 | "\n", 782 | "> Twitter classifies Tweets semantically, meaning that we curate lists of keywords, hashtags, and @handles that are relevant to a given topic. If a Tweet contains the text we’ve specified, it will be labeled appropriately. This differs from a machine learning approach where a model is trained specifically to classify text (in this case, Tweets) and produce a probability score alongside the output/classification.\n", 783 | "\n", 784 | "> How do I know that your data is complete and trustworthy?\n", 785 | "Twitter's annotations are curated by domain experts using research and QA processes that have been refined over the course of several years. The process is supported by custom tooling to scale data tracking as far as we are able to maintain excellent precision and recall. In addition, our data is audited regularly by an internal team, having received a precision score of ~80% for the past several quarters.\n", 786 | "\n", 787 | "> -[Twitter Context Annotation FAQ](https://developer.twitter.com/en/docs/twitter-api/annotations/faq)" 788 | ] 789 | }, 790 | { 791 | "cell_type": "markdown", 792 | "id": "7f2bc6a5-42c7-407f-ade2-63e89d5c8dd6", 793 | "metadata": {}, 794 | "source": [ 795 | "Twitter has also provided a list of all the [currently existing context annotations](https://developer.twitter.com/en/docs/twitter-api/annotations/faq)." 796 | ] 797 | }, 798 | { 799 | "cell_type": "code", 800 | "execution_count": null, 801 | "id": "bca2d6ce-ca00-490d-a5a6-5ad91f641ade", 802 | "metadata": { 803 | "scrolled": true, 804 | "tags": [] 805 | }, 806 | "outputs": [], 807 | "source": [ 808 | "all_context_annotations = pd.read_csv(\"https://raw.githubusercontent.com/twitterdev/twitter-context-annotations/main/files/evergreen-context-entities-20220601.csv\")\n", 809 | "all_context_annotations" 810 | ] 811 | }, 812 | { 813 | "cell_type": "code", 814 | "execution_count": null, 815 | "id": "0efdbdf3-bb61-4a33-8fd5-25b887682a06", 816 | "metadata": {}, 817 | "outputs": [], 818 | "source": [ 819 | "all_context_annotations[['entity_name']].sample(50)" 820 | ] 821 | }, 822 | { 823 | "cell_type": "markdown", 824 | "id": "a93355b1-e841-46cf-92cc-2ae3cce7b87d", 825 | "metadata": {}, 826 | "source": [ 827 | "Let's check out context annotations for a couple of tweets! Suhem Parack has created a small web application where we can insert any tweet URL and get that tweet's context annotations: https://tweet-entity-extractor.glitch.me/" 828 | ] 829 | }, 830 | { 831 | "cell_type": "markdown", 832 | "id": "ba4cde27-7cb2-4aa3-a310-74f970efde76", 833 | "metadata": {}, 834 | "source": [ 835 | "Tweet 1: https://twitter.com/POTUS/status/1532057523347689472\n", 836 | "\n", 837 | "Tweet 2: https://twitter.com/POTUS/status/1397595270582505474" 838 | ] 839 | }, 840 | { 841 | "cell_type": "code", 842 | "execution_count": null, 843 | "id": "89fe8d6f-06e5-4459-822d-2320f9c32044", 844 | "metadata": {}, 845 | "outputs": [], 846 | "source": [ 847 | "tweets_df['text'][130]" 848 | ] 849 | }, 850 | { 851 | "cell_type": "code", 852 | "execution_count": null, 853 | "id": "9a1a95af-fa92-47f6-80c5-02de4f74b28b", 854 | "metadata": {}, 855 | "outputs": [], 856 | "source": [ 857 | "tweets_df['context_annotations'][130]" 858 | ] 859 | }, 860 | { 861 | "cell_type": "markdown", 862 | "id": "3c762a3f-1c53-4be7-b334-84b5b3d23041", 863 | "metadata": {}, 864 | "source": [ 865 | "As you can see, the \"context_annotations\" column is dense, and extracting this information is a bit tricky.\n", 866 | "\n", 867 | "Here are two Python functions that can help us count up all the annotations and add extract annotations as a new column in the data." 868 | ] 869 | }, 870 | { 871 | "cell_type": "code", 872 | "execution_count": null, 873 | "id": "49d63826-8f13-4beb-ae3f-1099fd18c797", 874 | "metadata": {}, 875 | "outputs": [], 876 | "source": [ 877 | "from ast import literal_eval\n", 878 | "from collections import Counter\n", 879 | "import numpy as np\n", 880 | "\n", 881 | "context_counter = Counter()\n", 882 | "\n", 883 | "# Count up all context annotations\n", 884 | "def count_context(annotations):\n", 885 | " # if not NaN\n", 886 | " if type(annotations) != float:\n", 887 | " # Convert to an actual Python list, not just a string\n", 888 | " annotations = literal_eval(annotations)\n", 889 | " names = []\n", 890 | " # for every annotation in the tweet\n", 891 | " for annotation in annotations:\n", 892 | " # grab the name\n", 893 | " name = annotation['entity']['name']\n", 894 | " if name not in names:\n", 895 | " names.append(name)\n", 896 | " # add name to counter\n", 897 | " context_counter.update(names)\n", 898 | " \n", 899 | "# Extract context annotations\n", 900 | "def extract_context(annotations):\n", 901 | " # if not NaN\n", 902 | " if type(annotations) != float:\n", 903 | " # Convert to an actual Python list, not just a string\n", 904 | " annotations = literal_eval(annotations)\n", 905 | " names = []\n", 906 | " # for every annotation in the tweet\n", 907 | " for annotation in annotations:\n", 908 | " # grab the name\n", 909 | " name = annotation['entity']['name']\n", 910 | " names.append(name)\n", 911 | " return names" 912 | ] 913 | }, 914 | { 915 | "cell_type": "markdown", 916 | "id": "9f4d1b0d-9a08-4ca2-a9ab-feba82967ba1", 917 | "metadata": {}, 918 | "source": [ 919 | "Let's count up all the context annotations from Joe Biden's most recent tweets." 920 | ] 921 | }, 922 | { 923 | "cell_type": "code", 924 | "execution_count": null, 925 | "id": "5e8a3b7a-89a1-44f1-a685-a5aa8e36ab9c", 926 | "metadata": { 927 | "scrolled": true, 928 | "tags": [] 929 | }, 930 | "outputs": [], 931 | "source": [ 932 | "# Apply function to column\n", 933 | "tweets_df['context_annotations'].apply(count_context)\n", 934 | "# Pull out list of most common annotations\n", 935 | "context_counter.most_common()" 936 | ] 937 | }, 938 | { 939 | "cell_type": "markdown", 940 | "id": "5ee161cc-c73d-4bab-a3e7-e8c9e2429658", 941 | "metadata": {}, 942 | "source": [ 943 | "We can make a dataframe of the annotations counts easily with `pd.DataFrame()` (see the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)) and then we can plot the top 10 annotations (other than Joe Biden and The White House) with `.plot()` (see the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html))." 944 | ] 945 | }, 946 | { 947 | "cell_type": "code", 948 | "execution_count": null, 949 | "id": "6fe1a6d9-996d-41ee-9d81-d3518ef3d4af", 950 | "metadata": {}, 951 | "outputs": [], 952 | "source": [ 953 | "%matplotlib inline\n", 954 | "# Make a DataFrame\n", 955 | "context_df = pd.DataFrame(context_counter.most_common(), columns = ['context', 'count'])\n", 956 | "# Slice from the 2nd to 12th annotation, and make a bar plot\n", 957 | "context_df[2:12].plot(kind = 'barh', x = \"context\", y = \"count\", title = \"Joe Biden Tweets\",\n", 958 | " figsize = (10, 7)).invert_yaxis()" 959 | ] 960 | }, 961 | { 962 | "cell_type": "markdown", 963 | "id": "209cecae-4447-4ef6-8883-d97e222cea4f", 964 | "metadata": {}, 965 | "source": [ 966 | "Let's extract the context annotations from Joe Biden's most recent tweets and add them as a new column." 967 | ] 968 | }, 969 | { 970 | "cell_type": "code", 971 | "execution_count": null, 972 | "id": "23822c13-2a69-4056-8016-d24bc8db76d6", 973 | "metadata": {}, 974 | "outputs": [], 975 | "source": [ 976 | "tweets_df['context'] = tweets_df['context_annotations'].apply(extract_context)" 977 | ] 978 | }, 979 | { 980 | "cell_type": "code", 981 | "execution_count": null, 982 | "id": "3e0ec3a1-ab44-4148-8322-3e45b2758ed3", 983 | "metadata": {}, 984 | "outputs": [], 985 | "source": [ 986 | "# Select columns of interest\n", 987 | "clean_tweets_df = tweets_df[['created_at', 'author.username', 'author.name', 'author.description',\n", 988 | " 'author.verified', 'type', 'text', 'context', 'public_metrics.retweet_count', \n", 989 | " 'public_metrics.like_count', 'public_metrics.reply_count', 'public_metrics.quote_count',\n", 990 | " 'tweet_url', 'lang', 'source', 'geo.full_name']]\n", 991 | "\n", 992 | "# Rename columns for convenience\n", 993 | "clean_tweets_df = clean_tweets_df.rename(columns={'created_at': 'date', 'public_metrics.retweet_count': 'retweets', \n", 994 | " 'author.username': 'username', 'author.name': 'name', 'author.verified': 'verified', \n", 995 | " 'public_metrics.like_count': 'likes', 'public_metrics.quote_count': 'quotes', \n", 996 | " 'public_metrics.reply_count': 'replies', 'author.description': 'user_bio'})\n", 997 | "\n", 998 | "clean_tweets_df" 999 | ] 1000 | }, 1001 | { 1002 | "cell_type": "markdown", 1003 | "id": "98740e6c-bb6b-498a-8b7b-9c3d2afcbe92", 1004 | "metadata": {}, 1005 | "source": [ 1006 | "## Further Resources" 1007 | ] 1008 | }, 1009 | { 1010 | "cell_type": "markdown", 1011 | "id": "72725179-c129-487a-962b-99f1f5166fc7", 1012 | "metadata": {}, 1013 | "source": [ 1014 | "- Suhem's Parack's [\"Getting started with the Twitter API v2 for academic research\"](https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research)\n", 1015 | "- Melanie Walsh's [chapter on Twitter data](https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/12-Twitter-Data.html) from *Introduction to Cultural Analytics & Python*\n", 1016 | "- Twitter's blog [forum for academic research](https://twittercommunity.com/c/academic-research/62)\n", 1017 | "- Twitter's [Community space for academic researchers](https://twitter.com/i/communities/1494750019467063297)" 1018 | ] 1019 | } 1020 | ], 1021 | "metadata": { 1022 | "kernelspec": { 1023 | "display_name": "Python 3", 1024 | "language": "python", 1025 | "name": "python3" 1026 | }, 1027 | "language_info": { 1028 | "codemirror_mode": { 1029 | "name": "ipython", 1030 | "version": 3 1031 | }, 1032 | "file_extension": ".py", 1033 | "mimetype": "text/x-python", 1034 | "name": "python", 1035 | "nbconvert_exporter": "python", 1036 | "pygments_lexer": "ipython3", 1037 | "version": "3.8.8" 1038 | }, 1039 | "toc": { 1040 | "base_numbering": 1, 1041 | "nav_menu": {}, 1042 | "number_sections": true, 1043 | "sideBar": true, 1044 | "skip_h1_title": false, 1045 | "title_cell": "Table of Contents", 1046 | "title_sidebar": "Contents", 1047 | "toc_cell": false, 1048 | "toc_position": {}, 1049 | "toc_section_display": true, 1050 | "toc_window_display": true 1051 | } 1052 | }, 1053 | "nbformat": 4, 1054 | "nbformat_minor": 5 1055 | } 1056 | --------------------------------------------------------------------------------