├── README.md
└── twitter-to-mongo.py


/README.md:
--------------------------------------------------------------------------------
 1 | twitter-to-mongo
 2 | ================
 3 | 
 4 | A python script that uses the Tweepy library to pull Tweets with specific keywords from Twitter's Streaming API, and then stores the important fields from the Tweet in a MongoDB collection.
 5 | 
 6 | What gets stored in MongoDB?
 7 | ----------------------------
 8 |   - The tweet ID
 9 |   - The username of the tweet author
10 |   - The follower count of the tweet author
11 |   - The full body of the tweet
12 |   - Any hashtags used in the tweet
13 |   - The timestamp of the tweet's creation
14 |   - The language of the tweet
15 | 
16 | Dependencies:
17 | -------------
18 |   - Tweepy
19 |   - Pymongo
20 |   - MongoDB
21 | 
22 | 5 minute setup (Assumes the dependencies are already installed):
23 | ------------------
24 |   - Have MongoDB installed on localhost, and create a database called TwitterStream
25 |   - Open the script and add the keywords or hashtags you want to track to the "keywords" variable
26 |   - Save it to your project folder or any easy to access folder
27 |   - Open your console and cd to the folder that you just put the script in
28 |   - Type the command $ python YOURSCRIPTNAME.py
29 |   - Watch as tweets hit the console in realtime, all while being stored in your database
30 |   
31 | Extras:
32 | -------
33 |   - I made the comments in the script fairly detailed, so you should be able to see exactly what is going on and change the fields that are stored and such fairly easily.
34 |   - This is my first complete script I have ever written, so I apologize if there are some obvious errors or silly lines.
35 | 


--------------------------------------------------------------------------------
/twitter-to-mongo.py:
--------------------------------------------------------------------------------
 1 | # ===============================================
 2 | # twitter-to-mongo.py v1.0 Created by Sam Delgado
 3 | # ===============================================
 4 | from pymongo import Connection
 5 | import json
 6 | from tweepy.streaming import StreamListener
 7 | from tweepy import OAuthHandler
 8 | from tweepy import Stream
 9 | import datetime
10 | 
11 | # The MongoDB connection info. This assumes your database name is TwitterStream, and your collection name is tweets.
12 | connection = Connection('localhost', 27017)
13 | db = connection.TwitterStream
14 | db.tweets.ensure_index("id", unique=True, dropDups=True)
15 | collection = db.tweets
16 | 
17 | # Add the keywords you want to track. They can be cashtags, hashtags, or words.
18 | keywords = ['$goog', '#funny', 'ipad']
19 | 
20 | # Optional - Only grab tweets of specific language
21 | language = ['en']
22 | 
23 | # You need to replace these with your own values that you get after creating an app on Twitter's developer portal.
24 | consumer_key = "ADD YOUR CONSUMER KEY HERE"
25 | consumer_secret = "ADD YOUR CONSUMER SECRET HERE"
26 | access_token = "ADD YOUR ACCESS TOKEN HERE"
27 | access_token_secret = "ADD YOUR ACCESS TOKEN SECRET HERE"
28 | 
29 | # The below code will get Tweets from the stream and store only the important fields to your database
30 | class StdOutListener(StreamListener):
31 | 
32 |     def on_data(self, data):
33 | 
34 |         # Load the Tweet into the variable "t"
35 |         t = json.loads(data)
36 | 
37 |         # Pull important data from the tweet to store in the database.
38 |         tweet_id = t['id_str']  # The Tweet ID from Twitter in string format
39 |         username = t['user']['screen_name']  # The username of the Tweet author
40 |         followers = t['user']['followers_count']  # The number of followers the Tweet author has
41 |         text = t['text']  # The entire body of the Tweet
42 |         hashtags = t['entities']['hashtags']  # Any hashtags used in the Tweet
43 |         dt = t['created_at']  # The timestamp of when the Tweet was created
44 |         language = t['lang']  # The language of the Tweet
45 | 
46 |         # Convert the timestamp string given by Twitter to a date object called "created". This is more easily manipulated in MongoDB.
47 |         created = datetime.datetime.strptime(dt, '%a %b %d %H:%M:%S +0000 %Y')
48 | 
49 |         # Load all of the extracted Tweet data into the variable "tweet" that will be stored into the database
50 |         tweet = {'id':tweet_id, 'username':username, 'followers':followers, 'text':text, 'hashtags':hashtags, 'language':language, 'created':created}
51 | 
52 |         # Save the refined Tweet data to MongoDB
53 |         collection.save(tweet)
54 | 
55 |         # Optional - Print the username and text of each Tweet to your console in realtime as they are pulled from the stream
56 |         print username + ':' + ' ' + text
57 |         return True
58 | 
59 |     # Prints the reason for an error to your console
60 |     def on_error(self, status):
61 |         print status
62 | 
63 | # Some Tweepy code that can be left alone. It pulls from variables at the top of the script
64 | if __name__ == '__main__':
65 |     l = StdOutListener()
66 |     auth = OAuthHandler(consumer_key, consumer_secret)
67 |     auth.set_access_token(access_token, access_token_secret)
68 | 
69 |     stream = Stream(auth, l)
70 |     stream.filter(track=keywords, languages=language)
71 | 


--------------------------------------------------------------------------------