├── LICENSE.txt ├── README.md └── install.py /LICENSE.txt: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2013 Karan Luthra 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | this software and associated documentation files (the "Software"), to deal in 7 | the Software without restriction, including without limitation the rights to 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | the Software, and to permit persons to whom the Software is furnished to do so, 10 | subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | twitter-sentiment-training 2 | ========================== 3 | 4 | Training set of 5513 hand-classified tweets for sentiment classifiers 5 | 6 | ---- 7 | 8 | This is an upgrade to the original script by Niek J. Sanders available [here] (http://www.sananalytics.com/lab/twitter-sentiment/). 9 | Twitter's [REST API v1.1] (https://dev.twitter.com/docs/api/1.1) has made it mandatory for all requests to be authenticated using [oauth](https://dev.twitter.com/docs/auth/oauth#v1-1) and hence the script required to incorporate the authentication capability. 10 | 11 | Consequently, you must get an access token, access key, consumer token, consumer key by registering your application with twitter, in order to make such authenticated requests. Refer to [this](https://dev.twitter.com/docs/auth/tokens-devtwittercom) guide for getting these tokens, and provide them as global variables in the `install.py` script. 12 | 13 | *It is advised to go through the original Readme file given [here](http://www.sananalytics.com/lab/twitter-sentiment/sanders-twitter-0.2.zip) for a better understanding of the project and the install script in particular.* 14 | 15 | ### Installation 16 | Because of restrictions in Twitter’s Terms of Service, the actual tweets can not be distributed 17 | with the sentiment corpus. A small Python script is included to download all of the tweets. Due 18 | to limitations in Twitter’s API, the download process takes about 43 hours. 19 | 20 | Just four easy steps: 21 | 1. Set your access key and secret, consumer key and secret to the global variables declared at the beginning of install.py 22 | 2. Start the tweet downloader script: `python install.py` 23 | 3. Hit enter three times to accept the defaults. 24 | 4. Wait till the script indicates that it’s done. 25 | 26 | Note: the script is smart enough to resume where it left off if downloading is interrupted. 27 | The completed corpus will be in full-corpus.csv. A copy of all the raw data downloaded from Twitter is kept in rawdata/. 28 | 29 | ---- 30 | 31 | ### Credits 32 | The original work by Niek J. Sanders is a Twitter Sentiment Classifier which can be found [here] (http://www.sananalytics.com/lab/twitter-sentiment/). 33 | My work is just a little modification to the code written in 2011 to comply with the latest Twitter API v1.1 requirements. 34 | 35 | ### Support 36 | You may write to me for any help, I'll try and help you to the best of my capability. 37 | 38 | Karan Luthra 39 | karanluthra06@gmail.com 40 | 41 | 42 | -------------------------------------------------------------------------------- /install.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | 3 | # Modified Twitter Sentiment Corpus Install Script 4 | # 5 | # as per the Twitter API v1.1 https://dev.twitter.com/docs/api/1.1/overview 6 | # the original script written by Sanders (http://www.sananalytics.com/lab/twitter-sentiment) 7 | # becomes defunct. The following changes have been made to comply with latest rules of the API 8 | # --> Authentication of requests using oauth 9 | # --> Minor changes to incorporate changes in response format eg use of 'errors' in place of 'error' 10 | # 11 | # PLEASE NOTE : 12 | # Provide your access token and key, and consumer token and key as the global variables declared here. 13 | # These are provided to you at your application homepage at dev.twitter.com 14 | # 15 | # You may go through the sanders_readme.pdf file to better understand this script. 16 | # You may also write to me for any help, Ill try and help you to the best of my capability. 17 | # 18 | # - Karan Luthra 19 | # karanluthra06@gmail.com 20 | # 14 November, 2013 21 | # 22 | #!!-----Message by original author----------!! 23 | # 24 | # Sanders-Twitter Sentiment Corpus Install Script 25 | # Version 0.1 26 | # 27 | # Pulls tweet data from Twitter because ToS prevents distributing it directly. 28 | # 29 | # Right now we use unauthenticated requests, which are rate-limited to 150/hr. 30 | # We use 125/hr to stay safe. 31 | # 32 | # We could more than double the download speed by using authentication with 33 | # OAuth logins. But for now, this is too much of a PITA to implement. Just let 34 | # the script run over a weekend and you'll have all the data. 35 | # 36 | # - Niek Sanders 37 | # njs@sananalytics.com 38 | # October 20, 2011 39 | # 40 | # 41 | # Excuse the ugly code. I threw this together as quickly as possible and I 42 | # don't normally code in Python. 43 | # 44 | 45 | import csv, getpass, json, os, time, urllib, oauth2 as oauth, time 46 | 47 | # Provide your access token and key, and consumer token and key as the global variables declared here. 48 | TOKEN_KEY = "" 49 | TOKEN_SECRET = "" 50 | CONSUMER_KEY = "" 51 | CONSUMER_SECRET = "" 52 | 53 | def check_if_keys_provided(): 54 | 55 | # check if access/consumer tokens and keys are provided 56 | if not (TOKEN_KEY and TOKEN_SECRET and CONSUMER_KEY and CONSUMER_SECRET): 57 | print ('--> Please edit install.py, and provide all the tokens/keys as global variables declared here.\n') 58 | raise RuntimeError('error in authentication') 59 | 60 | return 61 | 62 | def get_user_params(): 63 | 64 | user_params = {} 65 | 66 | # get user input params 67 | user_params['inList'] = raw_input( '\nInput file [./corpus.csv]: ' ) 68 | user_params['outList'] = raw_input( 'Results file [./full-corpus.csv]: ' ) 69 | user_params['rawDir'] = raw_input( 'Raw data dir [./rawdata/]: ' ) 70 | 71 | # apply defaults 72 | if user_params['inList'] == '': 73 | user_params['inList'] = './corpus.csv' 74 | if user_params['outList'] == '': 75 | user_params['outList'] = './full-corpus.csv' 76 | if user_params['rawDir'] == '': 77 | user_params['rawDir'] = './rawdata/' 78 | 79 | return user_params 80 | 81 | 82 | def dump_user_params( user_params ): 83 | 84 | # dump user params for confirmation 85 | print 'Input: ' + user_params['inList'] 86 | print 'Output: ' + user_params['outList'] 87 | print 'Raw data: ' + user_params['rawDir'] 88 | return 89 | 90 | 91 | def read_total_list( in_filename ): 92 | 93 | # read total fetch list csv 94 | fp = open( in_filename, 'rb' ) 95 | reader = csv.reader( fp, delimiter=',', quotechar='"' ) 96 | 97 | total_list = [] 98 | for row in reader: 99 | total_list.append( row ) 100 | 101 | return total_list 102 | 103 | 104 | def purge_already_fetched( fetch_list, raw_dir ): 105 | 106 | # list of tweet ids that still need downloading 107 | rem_list = [] 108 | 109 | # check each tweet to see if we have it 110 | for item in fetch_list: 111 | 112 | # check if json file exists 113 | tweet_file = raw_dir + item[2] + '.json' 114 | if os.path.exists( tweet_file ): 115 | 116 | # attempt to parse json file 117 | try: 118 | parse_tweet_json( tweet_file ) 119 | print '--> already downloaded #' + item[2] 120 | except RuntimeError: 121 | rem_list.append( item ) 122 | else: 123 | rem_list.append( item ) 124 | 125 | return rem_list 126 | 127 | 128 | def get_time_left_str( cur_idx, fetch_list, download_pause ): 129 | 130 | tweets_left = len(fetch_list) - cur_idx 131 | total_seconds = tweets_left * download_pause 132 | 133 | str_hr = int( total_seconds / 3600 ) 134 | str_min = int((total_seconds - str_hr*3600) / 60) 135 | str_sec = total_seconds - str_hr*3600 - str_min*60 136 | 137 | return '%dh %dm %ds' % (str_hr, str_min, str_sec) 138 | 139 | def pull_data(id, raw_dir) : 140 | 141 | # Set the API endpoint 142 | url = "https://api.twitter.com/1.1/statuses/show/" + id + ".json" 143 | 144 | # Set the base oauth_* parameters along with any other parameters required 145 | # for the API call. 146 | params = { 147 | 'oauth_version': "1.0", 148 | 'oauth_nonce': oauth.generate_nonce(), 149 | 'oauth_timestamp': int(time.time()) 150 | } 151 | 152 | # Set up instances of our Token and Consumer. 153 | # These keys are given to you by the API provider. 154 | 155 | token = oauth.Token(key= TOKEN_KEY, secret= TOKEN_SECRET) 156 | consumer = oauth.Consumer(key= CONSUMER_KEY, secret= CONSUMER_SECRET) 157 | client = oauth.Client(consumer) 158 | 159 | # Set our token/key parameters 160 | params['oauth_token'] = token.key 161 | params['oauth_consumer_key'] = consumer.key 162 | 163 | # Create our request. 164 | req = oauth.Request(method="GET", url=url, parameters=params) 165 | 166 | # Sign the request. 167 | signature_method = oauth.SignatureMethod_HMAC_SHA1() 168 | req.sign_request(signature_method, consumer, token) 169 | 170 | # catching json response in resp, data in sdata(string) 171 | resp, sdata = client.request(req.url) 172 | 173 | # converting string data into dictionary 174 | data = json.loads(sdata) 175 | 176 | # writing each response as a .json file 177 | with open(raw_dir + id + '.json', 'wb') as outfile: 178 | json.dump(data, outfile, indent=1, separators = (',',':')) 179 | 180 | return 181 | 182 | 183 | def download_tweets( fetch_list, raw_dir ): 184 | 185 | # ensure raw data directory exists 186 | if not os.path.exists( raw_dir ): 187 | os.mkdir( raw_dir ) 188 | 189 | # stay within rate limits 190 | max_tweets_per_hr = 125 191 | download_pause_sec = 3600 / max_tweets_per_hr 192 | 193 | # download tweets 194 | for idx in range(0,len(fetch_list)): 195 | 196 | # current item 197 | item = fetch_list[idx] 198 | 199 | # print status 200 | trem = get_time_left_str( idx, fetch_list, download_pause_sec ) 201 | print '--> downloading tweet #%s (%d of %d) (%s left)' % \ 202 | (item[2], idx+1, len(fetch_list), trem) 203 | 204 | # pull data 205 | pull_data(item[2], raw_dir) 206 | 207 | # stay in Twitter API rate limits 208 | print ' pausing %d sec to obey Twitter API rate limits' % \ 209 | (download_pause_sec) 210 | time.sleep( download_pause_sec ) 211 | 212 | return 213 | 214 | 215 | def parse_tweet_json( filename ): 216 | 217 | # read tweet 218 | print 'opening: ' + filename 219 | fp = open( filename, 'rb' ) 220 | 221 | # parse json 222 | try: 223 | tweet_json = json.load( fp ) 224 | except ValueError: 225 | raise RuntimeError('error parsing json') 226 | 227 | # look for twitter api error msgs 228 | if 'errors' in tweet_json: 229 | raise RuntimeError('error in downloaded tweet') 230 | 231 | # extract creation date and tweet text 232 | return [ tweet_json['created_at'], tweet_json['text'] ] 233 | 234 | 235 | def build_output_corpus( out_filename, raw_dir, total_list ): 236 | 237 | # open csv output file 238 | fp = open( out_filename, 'wb' ) 239 | writer = csv.writer( fp, delimiter=',', quotechar='"', escapechar='\\', 240 | quoting=csv.QUOTE_ALL ) 241 | 242 | # write header row 243 | writer.writerow( ['Topic','Sentiment','TweetId','TweetDate','TweetText'] ) 244 | 245 | # parse all downloaded tweets 246 | missing_count = 0 247 | for item in total_list: 248 | 249 | # ensure tweet exists 250 | if os.path.exists( raw_dir + item[2] + '.json' ): 251 | 252 | try: 253 | # parse tweet 254 | parsed_tweet = parse_tweet_json( raw_dir + item[2] + '.json' ) 255 | full_row = item + parsed_tweet 256 | 257 | # character encoding for output 258 | for i in range(0,len(full_row)): 259 | full_row[i] = full_row[i].encode("utf-8") 260 | 261 | # write csv row 262 | writer.writerow( full_row ) 263 | 264 | except RuntimeError: 265 | print '--> bad data in tweet #' + item[2] 266 | missing_count += 1 267 | 268 | else: 269 | print '--> missing tweet #' + item[2] 270 | missing_count += 1 271 | 272 | # indicate success 273 | if missing_count == 0: 274 | print '\nSuccessfully downloaded corpus!' 275 | print 'Output in: ' + out_filename + '\n' 276 | else: 277 | print '\nMissing %d of %d tweets!' % (missing_count, len(total_list)) 278 | print 'Partial output in: ' + out_filename + '\n' 279 | 280 | return 281 | 282 | 283 | def main(): 284 | 285 | check_if_keys_provided() 286 | 287 | # get user parameters 288 | user_params = get_user_params() 289 | dump_user_params( user_params ) 290 | 291 | # get fetch list 292 | total_list = read_total_list( user_params['inList'] ) 293 | fetch_list = purge_already_fetched( total_list, user_params['rawDir'] ) 294 | 295 | # start fetching data from twitter 296 | download_tweets( fetch_list, user_params['rawDir'] ) 297 | 298 | # second pass for any failed downloads 299 | print '\nStarting second pass to retry any failed downloads'; 300 | fetch_list = purge_already_fetched( total_list, user_params['rawDir'] ) 301 | download_tweets( fetch_list, user_params['rawDir'] ) 302 | 303 | # build output corpus 304 | build_output_corpus( user_params['outList'], user_params['rawDir'], 305 | total_list ) 306 | 307 | return 308 | 309 | 310 | if __name__ == '__main__': 311 | main() 312 | --------------------------------------------------------------------------------