├── .gitignore ├── MANIFEST ├── setup.py ├── README.md └── twitter-harvest.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | dist 3 | build 4 | MANIFEST 5 | lib 6 | -------------------------------------------------------------------------------- /MANIFEST: -------------------------------------------------------------------------------- 1 | # file GENERATED by distutils, do NOT edit 2 | setup.py 3 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | ################################################################################ 2 | # 3 | # Copyright (c) 2016 ObjectLabs Corporation 4 | # 5 | # Permission is hereby granted, free of charge, to any person obtaining 6 | # a copy of this software and associated documentation files (the 7 | # "Software"), to deal in the Software without restriction, including 8 | # without limitation the rights to use, copy, modify, merge, publish, 9 | # distribute, sublicense, and/or sell copies of the Software, and to 10 | # permit persons to whom the Software is furnished to do so, subject to 11 | # the following conditions: 12 | # 13 | # The above copyright notice and this permission notice shall be 14 | # included in all copies or substantial portions of the Software. 15 | # 16 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 18 | # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 19 | # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 20 | # LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 21 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION 22 | # WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 23 | ################################################################################ 24 | 25 | from setuptools import setup 26 | 27 | setup( 28 | name='twitter-harvest', 29 | version='0.1', 30 | description='Twitter User Timeline Harvest', 31 | author='The MongoLab Team', 32 | author_email='support@mongolab.com', 33 | url='https://github.com/mongolab/twitter-harvest', 34 | license = 'MIT', 35 | install_requires=['pymongo','oauth2','httplib2','argparse'] 36 | ) 37 | 38 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Twitter-Harvest 2 | ==================== 3 | 4 | Twitter-Harvest is a Twitter [User Timeline](https://dev.twitter.com/docs/api/1.1/get/statuses/user_timeline) tool that allows the user to download statuses (tweets) from user timelines as JSON objects and store them into a MongoDB database. 5 | 6 | Developed on Python 2.7.x, you must also have Twitter Auth Credentials (detailed below). 7 | 8 | 9 | Usage 10 | --------- 11 | 12 | ###Install### 13 | 14 | Run: 15 | 16 | python setup.py install 17 | 18 | * setup.py uses setuptools! 19 | 20 | ###Harvest### 21 | 22 | Run: 23 | 24 | python twitter-harvest.py --consumer-key consumer-key --consumer-secret consumer-secret --access-token access-token --access-secret access-secret --db mongodb-uri 25 | 26 | * The consumer-key, consumer-secret, access-token and access-secret arguments are required. 27 | 28 | ###Other Useful Options### 29 | 30 | -r include native retweets in harvest (default = False) 31 | 32 | -v print in stdout all the tweets being harvested (default = False) 33 | 34 | --db connection URI, allows for insertion into a MongoDB 35 | 36 | --numtweets total number of tweets to be harvested (max = 3200) 37 | 38 | --user user's timeline you would like to harvest. must be public or already following. (default = mongolab) 39 | 40 | ### Help Contents ### 41 | ``` 42 | usage: twitter-harvest.py [-h] [-r] [-v] [--numtweets NUMTWEETS] [--user USER] 43 | --db DB --consumer-key CONSUMER_KEY 44 | --consumer-secret CONSUMER_SECRET 45 | --access-token ACCESS_TOKEN 46 | --access-secret ACCESS_SECRET 47 | 48 | Connects to Twitter User Timeline endpoint, retrieves tweets and inserts into 49 | a MongoDB database. Developed on Python 2.7 50 | 51 | optional arguments: 52 | -h, --help show this help message and exit 53 | -r, --retweet include native retweets in the harvest 54 | -v, --verbose print harvested tweets in shell 55 | --numtweets NUMTWEETS set total number of tweets to be harvested, max = 3200 56 | --user USER choose user timeline for harvest 57 | --db DB MongoDB URI, example: mongodb://dbuser:dbpassword@dbhnn.mongolab.com:port/dbname 58 | --consumer-key CONSUMER_KEY Consumer Key from your Twitter App OAuth settings 59 | --consumer-secret CONSUMER_SECRET Consumer Secret from your Twitter App OAuth settings 60 | --access-token ACCESS_TOKEN Access Token from your Twitter App OAuth settings 61 | --access-secret ACCESS_SECRET Access Token Secret from your Twitter App Dev Credentials 62 | ``` 63 | 64 | Twitter App Setup 65 | ----------------- 66 | 67 | For those unfamiliar with the Twitter Dev/App page, here are instructions for getting this script up and running. 68 | 69 | 1. Visit the Twitter dev [page](https://dev.twitter.com/). 70 | 2. Go to My Applications (should be a drop down from your username). 71 | 3. Create new app, fill in required fields. 72 | 4. Your access token, access token secret, consumer key, and consumer secret should all be displayed. Assign those values accordingly in the script :) 73 | 74 | 75 | Contact 76 | ------- 77 | 78 | Feel free to contact me via twitter [@chrisckchang](https://twitter.com/chrisckchang), email , or drop us a line at support@mongolab.com if you have any questions or comments! 79 | -------------------------------------------------------------------------------- /twitter-harvest.py: -------------------------------------------------------------------------------- 1 | ############################################################################## 2 | # 3 | # Copyright (c) 2016 ObjectLabs Corporation 4 | # 5 | # Permission is hereby granted, free of charge, to any person obtaining 6 | # a copy of this software and associated documentation files (the 7 | # "Software"), to deal in the Software without restriction, including 8 | # without limitation the rights to use, copy, modify, merge, publish, 9 | # distribute, sublicense, and/or sell copies of the Software, and to 10 | # permit persons to whom the Software is furnished to do so, subject to 11 | # the following conditions: 12 | # 13 | # The above copyright notice and this permission notice shall be 14 | # included in all copies or substantial portions of the Software. 15 | # 16 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 18 | # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 19 | # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 20 | # LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 21 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION 22 | # WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 23 | ################################################################################ 24 | 25 | __author__ = 'mongolab' 26 | 27 | 28 | import pymongo 29 | import oauth2 as oauth 30 | import urllib2, json 31 | import sys, argparse, time 32 | 33 | def oauth_header(url, consumer, token): 34 | 35 | params = {'oauth_version': '1.0', 36 | 'oauth_nonce': oauth.generate_nonce(), 37 | 'oauth_timestamp': int(time.time()), 38 | } 39 | req = oauth.Request(method = 'GET',url = url, parameters = params) 40 | req.sign_request(oauth.SignatureMethod_HMAC_SHA1(),consumer, token) 41 | return req.to_header()['Authorization'].encode('utf-8') 42 | 43 | def main(): 44 | 45 | ### Build arg parser 46 | parser = argparse.ArgumentParser(description = 'Connects to Twitter User Timeline endpoint, retrieves tweets and inserts into a MongoDB database. Developed on Python 2.7') 47 | parser.add_argument('-r', '--retweet', help = 'include native retweets in the harvest', action = 'store_true') 48 | parser.add_argument('-v', '--verbose', help = 'print harvested tweets in shell', action = 'store_true') 49 | parser.add_argument('--numtweets', help = 'set total number of tweets to be harvested, max = 3200', type = int, default = 3200) 50 | parser.add_argument('--user', help = 'choose twitter user timeline for harvest', default = 'mongolab') 51 | parser.add_argument('--db', help = 'MongoDB URI, example: mongodb://dbuser:dbpassword@dbhnn.mongolab.com:port/dbname') 52 | parser.add_argument('--consumer-key', help = 'Consumer Key from your Twitter App OAuth settings', required = True) 53 | parser.add_argument('--consumer-secret', help = 'Consumer Secret from your Twitter App OAuth settings', required = True) 54 | parser.add_argument('--access-token', help = 'Access Token from your Twitter App OAuth settings', required = True) 55 | parser.add_argument('--access-secret', help = 'Access Token Secret from your Twitter App Dev Credentials', required = True) 56 | 57 | ### Fields for query 58 | args = parser.parse_args() 59 | user = args.user 60 | numtweets = args.numtweets 61 | verbose = args.verbose 62 | retweet = args.retweet 63 | 64 | ### Build Signature 65 | CONSUMER_KEY = args.consumer_key 66 | CONSUMER_SECRET = args.consumer_secret 67 | ACCESS_TOKEN = args.access_token 68 | ACCESS_SECRET = args.access_secret 69 | 70 | ### Build Endpoint + Set Headers 71 | base_url = url = 'https://api.twitter.com/1.1/statuses/user_timeline.json?include_entities=true&count=200&screen_name=%s&include_rts=%s' % (user, retweet) 72 | oauth_consumer = oauth.Consumer(key = CONSUMER_KEY, secret = CONSUMER_SECRET) 73 | oauth_token = oauth.Token(key = ACCESS_TOKEN, secret = ACCESS_SECRET) 74 | 75 | ### Setup MongoLab Goodness 76 | uri = args.db 77 | if uri != None: 78 | try: 79 | conn = pymongo.MongoClient(uri) 80 | print 'Harvesting...' 81 | except: 82 | print 'Error: Unable to connect to DB. Check --db arg' 83 | return 84 | uri_parts = pymongo.uri_parser.parse_uri(uri) 85 | db = conn[uri_parts['database']] 86 | db[user].ensure_index('id_str') 87 | 88 | ### Helper Variables for Harvest 89 | max_id = -1 90 | tweet_count = 0 91 | 92 | ### Begin Harvesting 93 | while True: 94 | auth = oauth_header(url, oauth_consumer, oauth_token) 95 | headers = {"Authorization": auth} 96 | request = urllib2.Request(url, headers = headers) 97 | try: 98 | stream = urllib2.urlopen(request) 99 | except urllib2.HTTPError, err: 100 | if err.code == 404: 101 | print 'Error: Unknown user. Check --user arg' 102 | return 103 | if err.code == 401: 104 | print 'Error: Unauthorized. Check Twitter credentials' 105 | return 106 | tweet_list = json.load(stream) 107 | 108 | if len(tweet_list) == 0: 109 | print 'No tweets to harvest!' 110 | return 111 | if 'errors' in tweet_list: 112 | print 'Hit rate limit, code: %s, message: %s' % (tweets['errors']['code'], tweets['errors']['message']) 113 | return 114 | if max_id == -1: 115 | tweets = tweet_list 116 | else: 117 | tweets = tweet_list[1:] 118 | if len(tweets) == 0: 119 | print 'Finished Harvest!' 120 | return 121 | 122 | for tweet in tweets: 123 | max_id = id_str = tweet['id_str'] 124 | try: 125 | if tweet_count == numtweets: 126 | print 'Finished Harvest- hit numtweets!' 127 | return 128 | if uri != None: 129 | db[user].update({'id_str':id_str},tweet,upsert = True) 130 | else: 131 | print tweet['text'] 132 | tweet_count+=1 133 | if verbose == True and uri != None: 134 | print tweet['text'] 135 | except Exception, err: 136 | print 'Unexpected error encountered: %s' %(err) 137 | return 138 | url = base_url + '&max_id=' + max_id 139 | 140 | if __name__ == '__main__': 141 | try: 142 | main() 143 | except SystemExit as e: 144 | if e.code == 0: 145 | pass 146 | --------------------------------------------------------------------------------