├── .gitignore ├── README.md ├── get_metadata.py ├── sample_api_keys.json └── scrape.py /.gitignore: -------------------------------------------------------------------------------- 1 | api_keys.json 2 | geckodriver.log 3 | /data 4 | 5 | 6 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Twitter Scraper 2 | 3 | Twitter makes it hard to get all of a user's tweets (assuming they have more than 3200). This is a way to get around that using Python, Selenium, and Tweepy. 4 | 5 | Essentially, we will use Selenium to open up a browser and automatically visit Twitter's search page, searching for a single user's tweets on a single day. If we want all tweets from 2015, we will check all 365 days / pages. This would be a nightmare to do manually, so the `scrape.py` script does it all for you - all you have to do is input a date range and a twitter user handle, and wait for it to finish. 6 | 7 | The `scrape.py` script collects tweet ids. If you know a tweet's id number, you can get all the information available about that tweet using Tweepy - text, timestamp, number of retweets / replies / favorites, geolocation, etc. Tweepy uses Twitter's API, so you will need to get API keys. Once you have them, you can run the `get_metadata.py` script. 8 | 9 | ## Requirements 10 | 11 | - basic knowledge on how to use a terminal 12 | - Safari 10+ with 'Allow Remote Automation' option enabled in Safari's Develop menu to control Safari via WebDriver. 13 | - python3 14 | - to check, in your terminal, enter `python3` 15 | - if you don't have it, check YouTube for installation instructions 16 | - pip or pip3 17 | - to check, in your terminal, enter `pip` or `pip3` 18 | - if you don't have it, again, check YouTube for installation instructions 19 | - selenium (3.0.1) 20 | - `pip3 install selenium` 21 | - tweepy (3.5.0) 22 | - `pip3 install tweepy` 23 | 24 | ## Running the scraper 25 | 26 | - open up `scrape.py` and edit the user, start, and end variables (and save the file) 27 | - run `python3 scrape.py` 28 | - you'll see a browser pop up and output in the terminal 29 | - do some fun other task until it finishes 30 | - once it's done, it outputs all the tweet ids it found into `all_ids.json` 31 | - every time you run the scraper with different dates, it will add the new ids to the same file 32 | - it automatically removes duplicates so don't worry about small date overlaps 33 | 34 | ## Troubleshooting the scraper 35 | 36 | - do you get a `no such file` error? you need to cd to the directory of `scrape.py` 37 | - do you get a driver error when you try and run the script? 38 | - open `scrape.py` and change the driver to use Chrome() or Firefox() 39 | - if neither work, google the error (you probably need to install a new driver) 40 | - does it seem like it's not collecting tweets for days that have tweets? 41 | - open `scrape.py` and change the delay variable to 2 or 3 42 | 43 | ## Getting the metadata 44 | 45 | - first you'll need to get twitter API keys 46 | - sign up for a developer account here https://dev.twitter.com/ 47 | - get your keys here: https://apps.twitter.com/ 48 | - put your keys into the `sample_api_keys.json` file 49 | - change the name of `sample_api_keys.json` to `api_keys.json` 50 | - open up `get_metadata.py` and edit the user variable (and save the file) 51 | - run `python3 get_metadata.py` 52 | - this will get metadata for every tweet id in `all_ids.json` 53 | - it will create 4 files 54 | - `username.json` (master file with all metadata) 55 | - `username.zip` (a zipped file of the master file with all metadata) 56 | - `username_short.json` (smaller master file with relevant metadata fields) 57 | - `username.csv` (csv version of the smaller master file) 58 | -------------------------------------------------------------------------------- /get_metadata.py: -------------------------------------------------------------------------------- 1 | import tweepy 2 | import json 3 | import math 4 | import glob 5 | import csv 6 | import zipfile 7 | import zlib 8 | from tweepy import TweepError 9 | from time import sleep 10 | 11 | # CHANGE THIS TO THE USER YOU WANT 12 | user = 'realdonaldtrump' 13 | 14 | with open('api_keys.json') as f: 15 | keys = json.load(f) 16 | 17 | auth = tweepy.OAuthHandler(keys['consumer_key'], keys['consumer_secret']) 18 | auth.set_access_token(keys['access_token'], keys['access_token_secret']) 19 | api = tweepy.API(auth) 20 | user = user.lower() 21 | output_file = '{}.json'.format(user) 22 | output_file_short = '{}_short.json'.format(user) 23 | compression = zipfile.ZIP_DEFLATED 24 | 25 | with open('all_ids.json') as f: 26 | ids = json.load(f) 27 | 28 | print('total ids: {}'.format(len(ids))) 29 | 30 | all_data = [] 31 | start = 0 32 | end = 100 33 | limit = len(ids) 34 | i = math.ceil(limit / 100) 35 | 36 | for go in range(i): 37 | print('currently getting {} - {}'.format(start, end)) 38 | sleep(6) # needed to prevent hitting API rate limit 39 | id_batch = ids[start:end] 40 | start += 100 41 | end += 100 42 | tweets = api.statuses_lookup(id_batch) 43 | for tweet in tweets: 44 | all_data.append(dict(tweet._json)) 45 | 46 | print('metadata collection complete') 47 | print('creating master json file') 48 | with open(output_file, 'w') as outfile: 49 | json.dump(all_data, outfile) 50 | 51 | print('creating ziped master json file') 52 | zf = zipfile.ZipFile('{}.zip'.format(user), mode='w') 53 | zf.write(output_file, compress_type=compression) 54 | zf.close() 55 | 56 | results = [] 57 | 58 | def is_retweet(entry): 59 | return 'retweeted_status' in entry.keys() 60 | 61 | def get_source(entry): 62 | if '<' in entry["source"]: 63 | return entry["source"].split('>')[1].split('<')[0] 64 | else: 65 | return entry["source"] 66 | 67 | with open(output_file) as json_data: 68 | data = json.load(json_data) 69 | for entry in data: 70 | t = { 71 | "created_at": entry["created_at"], 72 | "text": entry["text"], 73 | "in_reply_to_screen_name": entry["in_reply_to_screen_name"], 74 | "retweet_count": entry["retweet_count"], 75 | "favorite_count": entry["favorite_count"], 76 | "source": get_source(entry), 77 | "id_str": entry["id_str"], 78 | "is_retweet": is_retweet(entry) 79 | } 80 | results.append(t) 81 | 82 | print('creating minimized json master file') 83 | with open(output_file_short, 'w') as outfile: 84 | json.dump(results, outfile) 85 | 86 | with open(output_file_short) as master_file: 87 | data = json.load(master_file) 88 | fields = ["favorite_count", "source", "text", "in_reply_to_screen_name", "is_retweet", "created_at", "retweet_count", "id_str"] 89 | print('creating CSV version of minimized json master file') 90 | f = csv.writer(open('{}.csv'.format(user), 'w')) 91 | f.writerow(fields) 92 | for x in data: 93 | f.writerow([x["favorite_count"], x["source"], x["text"], x["in_reply_to_screen_name"], x["is_retweet"], x["created_at"], x["retweet_count"], x["id_str"]]) 94 | -------------------------------------------------------------------------------- /sample_api_keys.json: -------------------------------------------------------------------------------- 1 | { 2 | "consumer_key": "91bibFB1JnMr8jHAd2AtawJLh", 3 | "consumer_secret": "4Vd6dkZp2OiEZrYbnwVyORtc8iGYBLtg7s9TqPVJFhADPurf2K", 4 | "access_token": "773564078220544410-VsxhGK7ACMU3v0nvXhlWfzYbtB8kgtV", 5 | "access_token_secret": "2ocaMiY1FWqYiHNVCvMhqCP1uzXpLx0hGMziwy0oFquav" 6 | } 7 | -------------------------------------------------------------------------------- /scrape.py: -------------------------------------------------------------------------------- 1 | from selenium import webdriver 2 | from selenium.webdriver.common.keys import Keys 3 | from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException 4 | from time import sleep 5 | import json 6 | import datetime 7 | 8 | 9 | # edit these three variables 10 | user = 'realdonaldtrump' 11 | start = datetime.datetime(2010, 1, 1) # year, month, day 12 | end = datetime.datetime(2016, 12, 7) # year, month, day 13 | 14 | # only edit these if you're having problems 15 | delay = 1 # time to wait on each page load before reading the page 16 | driver = webdriver.Safari() # options are Chrome() Firefox() Safari() 17 | 18 | 19 | # don't mess with this stuff 20 | twitter_ids_filename = 'all_ids.json' 21 | days = (end - start).days + 1 22 | id_selector = '.time a.tweet-timestamp' 23 | tweet_selector = 'li.js-stream-item' 24 | user = user.lower() 25 | ids = [] 26 | 27 | def format_day(date): 28 | day = '0' + str(date.day) if len(str(date.day)) == 1 else str(date.day) 29 | month = '0' + str(date.month) if len(str(date.month)) == 1 else str(date.month) 30 | year = str(date.year) 31 | return '-'.join([year, month, day]) 32 | 33 | def form_url(since, until): 34 | p1 = 'https://twitter.com/search?f=tweets&vertical=default&q=from%3A' 35 | p2 = user + '%20since%3A' + since + '%20until%3A' + until + 'include%3Aretweets&src=typd' 36 | return p1 + p2 37 | 38 | def increment_day(date, i): 39 | return date + datetime.timedelta(days=i) 40 | 41 | for day in range(days): 42 | d1 = format_day(increment_day(start, 0)) 43 | d2 = format_day(increment_day(start, 1)) 44 | url = form_url(d1, d2) 45 | print(url) 46 | print(d1) 47 | driver.get(url) 48 | sleep(delay) 49 | 50 | try: 51 | found_tweets = driver.find_elements_by_css_selector(tweet_selector) 52 | increment = 10 53 | 54 | while len(found_tweets) >= increment: 55 | print('scrolling down to load more tweets') 56 | driver.execute_script('window.scrollTo(0, document.body.scrollHeight);') 57 | sleep(delay) 58 | found_tweets = driver.find_elements_by_css_selector(tweet_selector) 59 | increment += 10 60 | 61 | print('{} tweets found, {} total'.format(len(found_tweets), len(ids))) 62 | 63 | for tweet in found_tweets: 64 | try: 65 | id = tweet.find_element_by_css_selector(id_selector).get_attribute('href').split('/')[-1] 66 | ids.append(id) 67 | except StaleElementReferenceException as e: 68 | print('lost element reference', tweet) 69 | 70 | except NoSuchElementException: 71 | print('no tweets on this day') 72 | 73 | start = increment_day(start, 1) 74 | 75 | 76 | try: 77 | with open(twitter_ids_filename) as f: 78 | all_ids = ids + json.load(f) 79 | data_to_write = list(set(all_ids)) 80 | print('tweets found on this scrape: ', len(ids)) 81 | print('total tweet count: ', len(data_to_write)) 82 | except FileNotFoundError: 83 | with open(twitter_ids_filename, 'w') as f: 84 | all_ids = ids 85 | data_to_write = list(set(all_ids)) 86 | print('tweets found on this scrape: ', len(ids)) 87 | print('total tweet count: ', len(data_to_write)) 88 | 89 | with open(twitter_ids_filename, 'w') as outfile: 90 | json.dump(data_to_write, outfile) 91 | 92 | print('all done here') 93 | driver.close() 94 | --------------------------------------------------------------------------------