├── LICENSE.txt
├── README.md
└── install.py


/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2013 Karan Luthra <karanluthra06@gmail.com>
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 6 | this software and associated documentation files (the "Software"), to deal in
 7 | the Software without restriction, including without limitation the rights to
 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 9 | the Software, and to permit persons to whom the Software is furnished to do so,
10 | subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
21 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | twitter-sentiment-training
 2 | ==========================
 3 | 
 4 | Training set of 5513 hand-classified tweets for sentiment classifiers
 5 | 
 6 | ----
 7 | 
 8 | This is an upgrade to the original script by Niek J. Sanders available [here] (http://www.sananalytics.com/lab/twitter-sentiment/). 
 9 | Twitter's [REST API v1.1] (https://dev.twitter.com/docs/api/1.1) has made it mandatory for all requests to be authenticated using [oauth](https://dev.twitter.com/docs/auth/oauth#v1-1) and hence the script required to incorporate the authentication capability.
10 | 
11 | Consequently, you must get an access token, access key, consumer token, consumer key by registering your application with twitter, in order to make such authenticated requests. Refer to [this](https://dev.twitter.com/docs/auth/tokens-devtwittercom) guide for getting these tokens, and provide them as global variables in the `install.py` script.
12 | 
13 | *It is advised to go through the original Readme file given [here](http://www.sananalytics.com/lab/twitter-sentiment/sanders-twitter-0.2.zip) for a better understanding of the project and the install script in particular.*
14 | 
15 | ### Installation
16 | Because of restrictions in Twitter’s Terms of Service, the actual tweets can not be distributed
17 | with the sentiment corpus. A small Python script is included to download all of the tweets. Due
18 | to limitations in Twitter’s API, the download process takes about 43 hours.
19 | 
20 | Just four easy steps:  
21 | 1. Set your access key and secret, consumer key and secret to the global variables declared at the beginning of install.py  
22 | 2. Start the tweet downloader script: `python install.py`  
23 | 3. Hit enter three times to accept the defaults.  
24 | 4. Wait till the script indicates that it’s done.  
25 | 
26 | Note: the script is smart enough to resume where it left off if downloading is interrupted.
27 | The completed corpus will be in full-corpus.csv. A copy of all the raw data downloaded from Twitter is kept in rawdata/.
28 | 
29 | ----
30 | 
31 | ### Credits
32 | The original work by Niek J. Sanders is a Twitter Sentiment Classifier which can be found [here] (http://www.sananalytics.com/lab/twitter-sentiment/). 
33 | My work is just a little modification to the code written in 2011 to comply with the latest Twitter API v1.1 requirements.
34 | 
35 | ### Support
36 | You may write to me for any help, I'll try and help you to the best of my capability.
37 | 
38 | Karan Luthra 
39 | karanluthra06@gmail.com
40 | 
41 | 
42 | 


--------------------------------------------------------------------------------
/install.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python2
  2 | 
  3 | # Modified Twitter Sentiment Corpus Install Script
  4 | # 
  5 | # as per the Twitter API v1.1 https://dev.twitter.com/docs/api/1.1/overview
  6 | # the original script written by Sanders (http://www.sananalytics.com/lab/twitter-sentiment)
  7 | # becomes defunct. The following changes have been made to comply with latest rules of the API
  8 | # --> Authentication of requests using oauth
  9 | # --> Minor changes to incorporate changes in response format eg use of 'errors' in place of 'error'
 10 | #
 11 | # PLEASE NOTE :
 12 | # Provide your access token and key, and consumer token and key as the global variables declared here. 
 13 | # These are provided to you at your application homepage at dev.twitter.com
 14 | # 
 15 | # You may go through the sanders_readme.pdf file to better understand this script.
 16 | # You may also write to me for any help, Ill try and help you to the best of my capability.
 17 | # 
 18 | # - Karan Luthra
 19 | #   karanluthra06@gmail.com
 20 | #   14 November, 2013
 21 | #
 22 | #!!-----Message by original author----------!!
 23 | #
 24 | # 	Sanders-Twitter Sentiment Corpus Install Script
 25 | # 	Version 0.1
 26 | #
 27 | # 	Pulls tweet data from Twitter because ToS prevents distributing it directly.
 28 | #
 29 | # 	Right now we use unauthenticated requests, which are rate-limited to 150/hr.
 30 | # 	We use 125/hr to stay safe.  
 31 | #
 32 | # 	We could more than double the download speed by using authentication with
 33 | # 	OAuth logins.  But for now, this is too much of a PITA to implement.  Just let
 34 | # 	the script run over a weekend and you'll have all the data.
 35 | #
 36 | # 	  - Niek Sanders
 37 | # 	    njs@sananalytics.com
 38 | # 	    October 20, 2011
 39 | #
 40 | #
 41 | # 	Excuse the ugly code.  I threw this together as quickly as possible and I
 42 | # 	don't normally code in Python.
 43 | #
 44 | 
 45 | import csv, getpass, json, os, time, urllib, oauth2 as oauth, time
 46 | 
 47 | # Provide your access token and key, and consumer token and key as the global variables declared here.
 48 | TOKEN_KEY = ""
 49 | TOKEN_SECRET = ""
 50 | CONSUMER_KEY = ""
 51 | CONSUMER_SECRET = ""
 52 | 
 53 | def check_if_keys_provided():
 54 | 
 55 |     # check if access/consumer tokens and keys are provided
 56 |     if not (TOKEN_KEY and TOKEN_SECRET and CONSUMER_KEY and CONSUMER_SECRET):
 57 |         print ('--> Please edit install.py, and provide all the tokens/keys as global variables declared here.\n')
 58 |         raise RuntimeError('error in authentication')
 59 | 
 60 |     return
 61 | 
 62 | def get_user_params():
 63 | 
 64 |     user_params = {}
 65 | 
 66 |     # get user input params
 67 |     user_params['inList']  = raw_input( '\nInput file [./corpus.csv]: ' )
 68 |     user_params['outList'] = raw_input( 'Results file [./full-corpus.csv]: ' )
 69 |     user_params['rawDir']  = raw_input( 'Raw data dir [./rawdata/]: ' )
 70 |     
 71 |     # apply defaults
 72 |     if user_params['inList']  == '': 
 73 |         user_params['inList'] = './corpus.csv'
 74 |     if user_params['outList'] == '': 
 75 |         user_params['outList'] = './full-corpus.csv'
 76 |     if user_params['rawDir']  == '': 
 77 |         user_params['rawDir'] = './rawdata/'
 78 | 
 79 |     return user_params
 80 | 
 81 | 
 82 | def dump_user_params( user_params ):
 83 | 
 84 |     # dump user params for confirmation
 85 |     print 'Input:    '   + user_params['inList']
 86 |     print 'Output:   '   + user_params['outList']
 87 |     print 'Raw data: '   + user_params['rawDir']
 88 |     return
 89 | 
 90 | 
 91 | def read_total_list( in_filename ):
 92 | 
 93 |     # read total fetch list csv
 94 |     fp = open( in_filename, 'rb' )
 95 |     reader = csv.reader( fp, delimiter=',', quotechar='"' )
 96 | 
 97 |     total_list = []
 98 |     for row in reader:
 99 |         total_list.append( row )
100 | 
101 |     return total_list
102 | 
103 | 
104 | def purge_already_fetched( fetch_list, raw_dir ):
105 | 
106 |     # list of tweet ids that still need downloading
107 |     rem_list = []
108 | 
109 |     # check each tweet to see if we have it
110 |     for item in fetch_list:
111 | 
112 |         # check if json file exists
113 |         tweet_file = raw_dir + item[2] + '.json'
114 |         if os.path.exists( tweet_file ):
115 | 
116 |             # attempt to parse json file
117 |             try:
118 |                 parse_tweet_json( tweet_file )
119 |                 print '--> already downloaded #' + item[2]
120 |             except RuntimeError:
121 |                 rem_list.append( item )
122 |         else:
123 |             rem_list.append( item )
124 | 
125 |     return rem_list
126 | 
127 | 
128 | def get_time_left_str( cur_idx, fetch_list, download_pause ):
129 | 
130 |     tweets_left = len(fetch_list) - cur_idx
131 |     total_seconds = tweets_left * download_pause
132 | 
133 |     str_hr = int( total_seconds / 3600 )
134 |     str_min = int((total_seconds - str_hr*3600) / 60)
135 |     str_sec = total_seconds - str_hr*3600 - str_min*60
136 | 
137 |     return '%dh %dm %ds' % (str_hr, str_min, str_sec)
138 | 
139 | def pull_data(id, raw_dir) :
140 |   
141 |     # Set the API endpoint 
142 |     url = "https://api.twitter.com/1.1/statuses/show/" + id + ".json"
143 | 
144 |     # Set the base oauth_* parameters along with any other parameters required
145 |     # for the API call.
146 |     params = {
147 | 	'oauth_version': "1.0",
148 | 	'oauth_nonce': oauth.generate_nonce(),
149 | 	'oauth_timestamp': int(time.time())
150 |     }
151 | 
152 |     # Set up instances of our Token and Consumer. 
153 |     # These keys are given to you by the API provider.
154 |     
155 |     token = oauth.Token(key= TOKEN_KEY, secret= TOKEN_SECRET)
156 |     consumer = oauth.Consumer(key= CONSUMER_KEY, secret= CONSUMER_SECRET)
157 |     client = oauth.Client(consumer)
158 | 
159 |     # Set our token/key parameters
160 |     params['oauth_token'] = token.key
161 |     params['oauth_consumer_key'] = consumer.key
162 | 
163 |     # Create our request.
164 |     req = oauth.Request(method="GET", url=url, parameters=params)
165 | 
166 |     # Sign the request.
167 |     signature_method = oauth.SignatureMethod_HMAC_SHA1()
168 |     req.sign_request(signature_method, consumer, token) 
169 | 
170 |     # catching json response in resp, data in sdata(string)
171 |     resp, sdata = client.request(req.url)
172 |     
173 |     # converting string data into dictionary
174 |     data = json.loads(sdata)
175 |     
176 |     # writing each response as a .json file
177 |     with open(raw_dir + id + '.json', 'wb') as outfile:
178 |       json.dump(data, outfile, indent=1, separators = (',',':'))
179 |    
180 |     return
181 | 
182 | 
183 | def download_tweets( fetch_list, raw_dir ):
184 | 
185 |     # ensure raw data directory exists
186 |     if not os.path.exists( raw_dir ):
187 |         os.mkdir( raw_dir )
188 | 
189 |     # stay within rate limits
190 |     max_tweets_per_hr  = 125
191 |     download_pause_sec = 3600 / max_tweets_per_hr
192 | 
193 |     # download tweets
194 |     for idx in range(0,len(fetch_list)):
195 | 
196 |         # current item
197 |         item = fetch_list[idx]
198 | 
199 |         # print status
200 |         trem = get_time_left_str( idx, fetch_list, download_pause_sec )
201 |         print '--> downloading tweet #%s (%d of %d) (%s left)' % \
202 |               (item[2], idx+1, len(fetch_list), trem)
203 | 
204 |         # pull data
205 |         pull_data(item[2], raw_dir)
206 | 	
207 |         # stay in Twitter API rate limits 
208 |         print '    pausing %d sec to obey Twitter API rate limits' % \
209 |               (download_pause_sec)
210 |         time.sleep( download_pause_sec )
211 | 
212 |     return
213 | 
214 | 
215 | def parse_tweet_json( filename ):
216 |     
217 |     # read tweet
218 |     print 'opening: ' + filename
219 |     fp = open( filename, 'rb' )
220 | 
221 |     # parse json
222 |     try:
223 |         tweet_json = json.load( fp )
224 |     except ValueError:
225 |         raise RuntimeError('error parsing json')
226 | 
227 |     # look for twitter api error msgs
228 |     if 'errors' in tweet_json:
229 |         raise RuntimeError('error in downloaded tweet')
230 | 
231 |     # extract creation date and tweet text
232 |     return [ tweet_json['created_at'], tweet_json['text'] ]
233 |     
234 | 
235 | def build_output_corpus( out_filename, raw_dir, total_list ):
236 | 
237 |     # open csv output file
238 |     fp = open( out_filename, 'wb' )
239 |     writer = csv.writer( fp, delimiter=',', quotechar='"', escapechar='\\',
240 |                          quoting=csv.QUOTE_ALL )
241 | 
242 |     # write header row
243 |     writer.writerow( ['Topic','Sentiment','TweetId','TweetDate','TweetText'] )
244 | 
245 |     # parse all downloaded tweets
246 |     missing_count = 0
247 |     for item in total_list:
248 | 
249 |         # ensure tweet exists
250 |         if os.path.exists( raw_dir + item[2] + '.json' ):
251 | 
252 |             try: 
253 |                 # parse tweet
254 |                 parsed_tweet = parse_tweet_json( raw_dir + item[2] + '.json' )
255 |                 full_row = item + parsed_tweet
256 |     
257 |                 # character encoding for output
258 |                 for i in range(0,len(full_row)):
259 |                     full_row[i] = full_row[i].encode("utf-8")
260 |     
261 |                 # write csv row
262 |                 writer.writerow( full_row )
263 | 
264 |             except RuntimeError:
265 |                 print '--> bad data in tweet #' + item[2]
266 |                 missing_count += 1
267 | 
268 |         else:
269 |             print '--> missing tweet #' + item[2]
270 |             missing_count += 1
271 | 
272 |     # indicate success
273 |     if missing_count == 0:
274 |         print '\nSuccessfully downloaded corpus!'
275 |         print 'Output in: ' + out_filename + '\n'
276 |     else: 
277 |         print '\nMissing %d of %d tweets!' % (missing_count, len(total_list))
278 |         print 'Partial output in: ' + out_filename + '\n'
279 | 
280 |     return
281 | 
282 | 
283 | def main():
284 | 
285 |     check_if_keys_provided()
286 | 
287 |     # get user parameters
288 |     user_params = get_user_params()
289 |     dump_user_params( user_params )
290 | 
291 |     # get fetch list
292 |     total_list = read_total_list( user_params['inList'] )
293 |     fetch_list = purge_already_fetched( total_list, user_params['rawDir'] )
294 | 
295 |     # start fetching data from twitter
296 |     download_tweets( fetch_list, user_params['rawDir'] )
297 | 
298 |     # second pass for any failed downloads
299 |     print '\nStarting second pass to retry any failed downloads';
300 |     fetch_list = purge_already_fetched( total_list, user_params['rawDir'] )
301 |     download_tweets( fetch_list, user_params['rawDir'] )
302 | 
303 |     # build output corpus
304 |     build_output_corpus( user_params['outList'], user_params['rawDir'], 
305 |                          total_list )
306 | 
307 |     return
308 | 
309 | 
310 | if __name__ == '__main__':
311 |     main()
312 | 


--------------------------------------------------------------------------------