└── twarc_fas.md /twarc_fas.md: -------------------------------------------------------------------------------- 1 | # Introduction to full archive searching using twarc v2 2 | 3 | In late 2020 Twitter began rolling out version 2 (v2) of the Twitter API. Included in this rollout is a new way to access Twitter data for academic research projects. This new method, formally referred to as the 'Academic Research product track' or 'Academic API' for short, provides free access to the full-archive Twitter repository and other v2 endpoints. The Academic API is exciting as previous access to the full-archive data came at a financial cost. With this expanded access, Twitter hopes to enable more social media research from around the world. 4 | 5 | This announcement was met with excitement from members of the open source software community who began immediately laying plans for community-based tools to handle the Academic API. Among the many open source tools to handle Twitter data is twarc, a well-established command line tool and Python library used for accessing and archiving Twitter data. Twarc has been used extensively by archivists, activists, researchers and more to gather data from the v1 API. However, the v2 API provides a [fundamentally different data payload](https://dev.to/twitterdev/understanding-the-new-tweet-payload-in-the-twitter-api-v2-1fg5). With dedicated work and contributions from scholars from around the globe, twarc was able to quickly respond to the new API and start allowing users access to an updated tool called [twarc v2](https://github.com/DocNow/twarc/tree/v2). 6 | 7 | The following tutorial demonstrates the basic installation and use of twarc v2 for accessing Twitter's full archive search. It is meant as an introductory guide for researchers who may be new to gathering Twitter data. As always, please refer to the official [twarc](https://twarc-project.readthedocs.io/en/latest/) and [Twitter](https://developer.twitter.com/en/docs/twitter-api/early-access) documentation for more advanced details. 8 | 9 | ## Installing twarc v2 10 | 11 | Before installing twarc, make sure that you have a recent and updated installation of Python 2 or 3. Python can be downloaded from the official [python.org](https://www.python.org/downloads/) website. After installing python you can proceed with the installation of twarc or twarc v2 using the popular package-management system called pip. 12 | 13 | Open up a new terminal and install twarc v2 by typing: 14 | 15 | ```console 16 | 17 | pip install twarc 18 | 19 | ``` 20 | 21 | To ensure that twarc v2 has been installed, try typing `twarc2` into the command line. You should get the following output: 22 | 23 | ```console 24 | 25 | > twarc2 26 | 27 | Usage: twarc2 [OPTIONS] COMMAND [ARGS]... 28 | Collect data from the Twitter V2 API. 29 | 30 | Options: 31 | 32 | --consumer-key TEXT Twitter app consumer key (aka "App Key") 33 | --consumer-secret TEXT Twitter app consumer secret (aka "App Secret") 34 | --access-token TEXT Twitter app access token for user authentication. 35 | --access-token-secret TEXT Twitter app access token secret for user authentication. 36 | ... 37 | ``` 38 | 39 | ## Configuring twarc v2 with your API keys 40 | 41 | Before we can gather Twitter data we need to provide some information that proves we have been approved to use the Academic API. Users who have been approved for the Academic API will be provided with a set of [*keys*](https://cloud.ibm.com/docs/account?topic=account-manapikey). These keys authorize you to access the various Twitter API endpoints and gather data. As with any other key - digital or material - your keys should be kept in a secure location and should not be shared with others. 42 | 43 | The final setup step of twarc v2 is registering your API keys. Twarc v2 conveniently offers a set-up tool such that API keys only need to be registered once (per local install). Run the following line to launch the set-up tool: 44 | 45 | ```console 46 | 47 | twarc2 configure 48 | 49 | ``` 50 | 51 | You will then be guided through a series of prompts asking for your consumer key, secret consumer key, access token key, and secret access token key. Input your respective keys into each prompt and hit `[Enter]` to proceed. 52 | 53 | You should now be able to execute searches using twarc and twarc v2! 54 | 55 | ## Understanding the recent Twitter archive versus the full Twitter archive 56 | 57 | Twitter provides two different archives of past tweets. The first, generally referred to as the *recent* archive, is for tweets generated in the past 6 to 9 days. The second, generally referred to as the *full* archive, is for all tweets part of the public conversation. To construct a twarc v2 query to access these archives you can use the `search` command. 58 | 59 | To access the recent archive, simply type: 60 | 61 | ```console 62 | 63 | > twarc2 search '#blacklivesmatter' 64 | 65 | ``` 66 | 67 | You will see your terminal flooded with text. To stop a data pull, hit `Ctrl+C`. 68 | 69 | In order to write that data to a file instead of the terminal window supply the filename to write the data to: 70 | 71 | ```console 72 | > twarc2 search '#blacklivesmatter' tweets.jsonl 73 | ``` 74 | 75 | The `.jsonl` file extension is often used to indicate that the file contains [line oriented json](https://jsonlines.org/). 76 | 77 | To access the full archive (back to 2006), add a the `--archive` option like: 78 | 79 | ```console 80 | 81 | > twarc2 search '#blacklivesmatter' --archive 82 | 83 | ``` 84 | 85 | Note that in almost all situations adding the `--archive` option will tap a much larger pool of tweets. These queries may take longer, result in larger files, and use up a greater share of the monthly cap. In the following sections we build on these basic searches to refine our API queries. 86 | 87 | ## Building a more specific twarc v2 full archive query 88 | 89 | In general, you may want to add more rules on your full archive search to gather more specific tweets. For example, let's say you want to gather tweets related to Black Lives Matter from 2015 to 2016. Well, you can add flags for `--start-time` and `--end-time`. Note that these times should be in [%Y-%m-%d|%Y-%m-%dT%H:%M:%S] format. Let's see a full archive search looks like with start and end rules: 90 | 91 | ```console 92 | > twarc2 search '#blacklivesmatter' --start-time 2015-01-01 --end-time 2016-12-31 --archive 93 | ``` 94 | 95 | If you run the above command you will be pulling form a very large pool of tweets - #blacklivesmatter is a popular topic and we are considering tweets across an entire year. Some additional ways to reduce the number of tweets could be to exclude retweets and grab a specific number of tweets, say, 5000. Well, twarc has options for these rules as well! Let's see what these additional rules look like: 96 | 97 | ```console 98 | > twarc2 search '#blacklivesmatter -is:retweet' --start-time 2015-01-01 --end-time 2016-12-31 --limit 5000 --archive 99 | ``` 100 | 101 | Note that the rule to exclude retweets (`-is:retweet`) actually appears *within* the quoted search query, whereas the limit on the number of tweets appears as an option flag. At this point we have set several additional rules to create a more specific query of the Twitter full archive. 102 | 103 | An additional choice that many users may find helpful is to *flatten* the data provided by twarc. This is useful in data pipelines that expect there to be one tweet per line in the output file. If you've collected some tweets and would like to flatten them you can: 104 | 105 | ```console 106 | > twarc2 flatten tweets.jsonl flattened-tweets.jsonl 107 | ``` 108 | 109 | ## Processing data into common analysis formats 110 | 111 | The resulting file is a data type called JSON, which has many advantages for moving large amounts of structured data. However, we need to take some steps to transform the JSON into a form more common for data analysis. We can use the [twarc-csv](https://github.com/docnow/twarc-csv) module to convert the line oriented JSON to CSV which then should be more easy to use as DataFrames in tools like [Pandas](https://pandas.pydata.org/) and [R](https://www.r-project.org/). Twarc plugins are distributed separately from twarc, and they extend the base twarc2 command with additional subcommands, in the case of twarc-csv a `csv` subcommand will be added. 112 | 113 | ```console 114 | 115 | > pip install twarc-csv 116 | > twarc2 csv tweets.jsonl tweets.csv 117 | ``` 118 | 119 | Then you can load the CSV into a Pandas DataFrame: 120 | 121 | ```python 122 | import pandas 123 | pandas.read_csv('tweets.csv') 124 | ``` 125 | 126 | Of course you can process the tweets directly as JSON. To make your life easier you may want to flatten it first to ensure that each line contains a single tweet. 127 | 128 | ```console 129 | twarc2 flatten tweets.jsonl flattened-tweets.jsonl 130 | ``` 131 | 132 | Then you can create a small Python program you can read in each line, parses it as JSON, and then does something with the data (in this case it prints out the tweet ID and text): 133 | 134 | ```python 135 | import json 136 | 137 | for line in open('tweets.jsonl'): 138 | tweet = json.loads(line) 139 | print(tweet['data']['id'], tweets['data']['text']) 140 | ``` 141 | 142 | Python is used here just as an example, and modern programming language you might find will have a JSON parsing library. 143 | 144 | ## Concluding remarks 145 | 146 | By completing this tutorial you should now be able to: 147 | 148 | - Install twarc and twarc v2, a well-established open source software used to gather data from the Twitter API 149 | - Perform searches of the recent and full Twitter archive 150 | - Process data from the API into data 151 | 152 | As with any other dataset, users should be aware of the many caveats to analyzing data gathered from Twitter (see for example work by [Crawford, 2013](https://ellaholland2022.com/s/05-The-Hidden-Biases-in-Big-Data-Crawford.pdf), and [Jiang, 2020](https://www.tandfonline.com/doi/full/10.1080/15230406.2018.1434834)). Happy twarcing! 153 | 154 | # Additional twarc resources 155 | 156 | - [Twarc github](https://github.com/DocNow/twarc) 157 | - [Twarc tutorials from University of Virginia](http://digitalcollecting.lib.virginia.edu/toolkit/docs/social-media/collect-tweets/) 158 | - [Example of using twarc in the cloud](https://medium.com/@justinlittman_38536/twarc-cloud-twitter-data-collection-in-the-cloud-d1cac85f57a5) 159 | - [Introduction to Cultural Analytics & Python](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Collection/Twitter-Data-Collection.html) 160 | - [Documenting the Now](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Collection/Twitter-Data-Collection.html) 161 | --------------------------------------------------------------------------------