├── README.md └── twitter-to-mongo.py /README.md: -------------------------------------------------------------------------------- 1 | twitter-to-mongo 2 | ================ 3 | 4 | A python script that uses the Tweepy library to pull Tweets with specific keywords from Twitter's Streaming API, and then stores the important fields from the Tweet in a MongoDB collection. 5 | 6 | What gets stored in MongoDB? 7 | ---------------------------- 8 | - The tweet ID 9 | - The username of the tweet author 10 | - The follower count of the tweet author 11 | - The full body of the tweet 12 | - Any hashtags used in the tweet 13 | - The timestamp of the tweet's creation 14 | - The language of the tweet 15 | 16 | Dependencies: 17 | ------------- 18 | - Tweepy 19 | - Pymongo 20 | - MongoDB 21 | 22 | 5 minute setup (Assumes the dependencies are already installed): 23 | ------------------ 24 | - Have MongoDB installed on localhost, and create a database called TwitterStream 25 | - Open the script and add the keywords or hashtags you want to track to the "keywords" variable 26 | - Save it to your project folder or any easy to access folder 27 | - Open your console and cd to the folder that you just put the script in 28 | - Type the command $ python YOURSCRIPTNAME.py 29 | - Watch as tweets hit the console in realtime, all while being stored in your database 30 | 31 | Extras: 32 | ------- 33 | - I made the comments in the script fairly detailed, so you should be able to see exactly what is going on and change the fields that are stored and such fairly easily. 34 | - This is my first complete script I have ever written, so I apologize if there are some obvious errors or silly lines. 35 | -------------------------------------------------------------------------------- /twitter-to-mongo.py: -------------------------------------------------------------------------------- 1 | # =============================================== 2 | # twitter-to-mongo.py v1.0 Created by Sam Delgado 3 | # =============================================== 4 | from pymongo import Connection 5 | import json 6 | from tweepy.streaming import StreamListener 7 | from tweepy import OAuthHandler 8 | from tweepy import Stream 9 | import datetime 10 | 11 | # The MongoDB connection info. This assumes your database name is TwitterStream, and your collection name is tweets. 12 | connection = Connection('localhost', 27017) 13 | db = connection.TwitterStream 14 | db.tweets.ensure_index("id", unique=True, dropDups=True) 15 | collection = db.tweets 16 | 17 | # Add the keywords you want to track. They can be cashtags, hashtags, or words. 18 | keywords = ['$goog', '#funny', 'ipad'] 19 | 20 | # Optional - Only grab tweets of specific language 21 | language = ['en'] 22 | 23 | # You need to replace these with your own values that you get after creating an app on Twitter's developer portal. 24 | consumer_key = "ADD YOUR CONSUMER KEY HERE" 25 | consumer_secret = "ADD YOUR CONSUMER SECRET HERE" 26 | access_token = "ADD YOUR ACCESS TOKEN HERE" 27 | access_token_secret = "ADD YOUR ACCESS TOKEN SECRET HERE" 28 | 29 | # The below code will get Tweets from the stream and store only the important fields to your database 30 | class StdOutListener(StreamListener): 31 | 32 | def on_data(self, data): 33 | 34 | # Load the Tweet into the variable "t" 35 | t = json.loads(data) 36 | 37 | # Pull important data from the tweet to store in the database. 38 | tweet_id = t['id_str'] # The Tweet ID from Twitter in string format 39 | username = t['user']['screen_name'] # The username of the Tweet author 40 | followers = t['user']['followers_count'] # The number of followers the Tweet author has 41 | text = t['text'] # The entire body of the Tweet 42 | hashtags = t['entities']['hashtags'] # Any hashtags used in the Tweet 43 | dt = t['created_at'] # The timestamp of when the Tweet was created 44 | language = t['lang'] # The language of the Tweet 45 | 46 | # Convert the timestamp string given by Twitter to a date object called "created". This is more easily manipulated in MongoDB. 47 | created = datetime.datetime.strptime(dt, '%a %b %d %H:%M:%S +0000 %Y') 48 | 49 | # Load all of the extracted Tweet data into the variable "tweet" that will be stored into the database 50 | tweet = {'id':tweet_id, 'username':username, 'followers':followers, 'text':text, 'hashtags':hashtags, 'language':language, 'created':created} 51 | 52 | # Save the refined Tweet data to MongoDB 53 | collection.save(tweet) 54 | 55 | # Optional - Print the username and text of each Tweet to your console in realtime as they are pulled from the stream 56 | print username + ':' + ' ' + text 57 | return True 58 | 59 | # Prints the reason for an error to your console 60 | def on_error(self, status): 61 | print status 62 | 63 | # Some Tweepy code that can be left alone. It pulls from variables at the top of the script 64 | if __name__ == '__main__': 65 | l = StdOutListener() 66 | auth = OAuthHandler(consumer_key, consumer_secret) 67 | auth.set_access_token(access_token, access_token_secret) 68 | 69 | stream = Stream(auth, l) 70 | stream.filter(track=keywords, languages=language) 71 | --------------------------------------------------------------------------------