├── .gitignore ├── README.md ├── datasets ├── ABSA-SemEval2014 │ ├── Laptop_Train_v2.xml │ ├── Laptops_Test_Data_PhaseA.xml │ ├── Laptops_Test_Data_phaseB.xml │ ├── README.txt │ ├── Restaurants_Test_Data_PhaseA.xml │ ├── Restaurants_Test_Data_phaseB.xml │ ├── Restaurants_Train.xml │ ├── Restaurants_Train_v2.xml │ ├── SemEvalSchema.xsd │ ├── baselinesystemdescription.pdf │ ├── eval.jar │ ├── laptops-trial.xml │ ├── restaurants-trial.xml │ ├── semeval14_absa_annotationguidelines.pdf │ ├── semeval_base.py │ └── submission-guidelines.pdf ├── ABSA-SemEval2015 │ ├── ABSA-15_Laptops_Train_Data.xml │ ├── ABSA-15_Restaurants_Train_Final.xml │ ├── ABSA15_Hotels_Test.xml │ ├── ABSA15_Laptops_Test.xml │ ├── ABSA15_Restaurants_Test.xml │ ├── absa-2015_laptops_trial.xml │ ├── absa-2015_restaurants_trial.xml │ └── guidelines │ │ ├── SemEval2015_ABSA_Laptops_AnnotationGuidelines.pdf │ │ └── semeval2015_absa_restaurants_annotationguidelines.pdf ├── ABSA-SemEval2016 │ └── Training_Data │ │ ├── ABSA16FR-RestaurantsTrain │ │ ├── ABSA16FR-download.jar │ │ ├── ABSA16FR_Restaurants_Train.xml │ │ ├── ABSA16FR_Restaurants_guidelines.pdf │ │ ├── ABSA16FR_Restaurants_index.txt │ │ └── README.txt │ │ ├── ABSA16_Laptops_Train_English_SB2.xml │ │ ├── ABSA16_Laptops_Train_SB1_v2.xml │ │ ├── ABSA16_Restaurants_Train_English_SB2.xml │ │ ├── ABSA16_Restaurants_Train_SB1_v2.xml │ │ ├── restaurants_dutch_training.xml │ │ └── restaurants_dutch_training_textlevel.xml ├── Aspect-based-Sentiment-Analysis-Dataset.ipynb └── CR │ ├── Apex AD2600 Progressive-scan DVD player.txt │ ├── Canon G3.txt │ ├── Creative Labs Nomad Jukebox Zen Xtra 40GB.txt │ ├── Nikon coolpix 4300.txt │ ├── Nokia 6610.txt │ └── Readme.txt ├── libraries ├── __init__.py └── baselines.py ├── run.py └── stanford_corenlp_python ├── LICENSE ├── README.md ├── __init__.py ├── client.py ├── corenlp.py ├── default.properties ├── jsonrpc.py └── progressbar.py /.gitignore: -------------------------------------------------------------------------------- 1 | # 3rd party libs 2 | stanford_corenlp_python/stanford-corenlp-full 3 | stanford_corenlp_python/stanford-corenlp-full-2015-12-09/ 4 | 5 | # pytchon bytecode files 6 | *.pyc 7 | 8 | # zip file 9 | *.zip 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Aspect-based Sentiment Analysis 2 | 3 | [Sentiment Analysis Datasets](https://www.w3.org/community/sentiment/wiki/Datasets): collection of many different data sets for Sentiment Analysis 4 | 5 | [TASS 2016](http://www.sepln.org/workshops/tass/2016/tass2016.php): workshop for sentiment analysis focused on Spanish 6 | 7 | [Yelp Dataset](https://www.yelp.com/dataset_challenge): Yelp reviews dataset 8 | -------------------------------------------------------------------------------- /datasets/ABSA-SemEval2014/README.txt: -------------------------------------------------------------------------------- 1 | Aspect Based Sentiment Analysis (ABSA) 2 | Task 4 of SemEval 2014 3 | ----------------------------------------------------- 4 | 5 | This folder contains scripts/code for: 6 | 7 | A. Running the ABSA baselines. 8 | B. Evaluating the output of your system. 9 | C. Validating the XML file that you will submit to ABSA 2014. 10 | 11 | 12 | Running the Baselines 13 | ----------------------- 14 | 15 | The semeval_base.py script is an implementation of the baselines of SemEval Task 4 (Aspect Based Sentiment Analysis). 16 | A high level description of them can be found at the following address: 17 | 18 | http://alt.qcri.org/semeval2014/task4/data/uploads/baselinesystemdescription.pdf 19 | 20 | By running python semeval_base.py from you shell, a list of possible options will be displayed. 21 | (**Caution: We tested semeval_base.py only in Linux.) 22 | 23 | Assuming that rest.xml and lap.xml are the training data for the restaurants and laptops 24 | domain, respectively, we recommend you run the baselines as follows: 25 | 26 | -- restaurants 27 | 28 | python semeval_base.py --train rest.xml --task 5 29 | 30 | It reads the given data (rest.xml) and splits them in a train (absa--train.xml) and a test part (absa--test.xml) using a 80:20 ratio. 31 | Then, it tags the sentences of the test part with the found aspect terms and categories and stores the result to absa--test.predicted-stageI.xml. 32 | absa--test.gold.xml contains the gold (correct) aspect terms and categories. 33 | 34 | 35 | python semeval_base.py --train rest.xml --task 6 36 | 37 | It reads the given data (rest.xml) splits them in a train (absa--train.xml) and a test part (absa--test.xml) using a 80:20 ratio. 38 | Then, it finds the polarity for the aspect terms and categories of the test part and stores the result to absa--test.predicted-stageII.xml. 39 | absa--test.gold.xml contains the gold (correct) polarities. 40 | 41 | -- laptops 42 | 43 | python semeval_base.py --train lap.xml --task 1 44 | 45 | It reads the given data (lap.xml), splits them in a train (absa--train.xml) and a test part (absa--test.xml) using a 80:20 ratio. 46 | Then, it tags the sentences of the test part with the found aspect terms and stores the result to absa--test.predicted-aspect.xml. 47 | absa--test.gold.xml contains the gold (correct) aspect terms and categories 48 | 49 | python semeval_base.py --train lap.xml --task 3 50 | 51 | It reads the given data (rest.xml), splits them in a train (absa--train.xml) and a test part (absa--test.xml) using a 80:20 ratio. 52 | Then, it finds the polarity for the aspect terms of the test part and stores the result to absa--test.predicted-stageII.xml. 53 | absa--test.gold.xml contains the gold (correct) polarities. 54 | 55 | 56 | In all cases above, the baseline script calculates and displays evaluation scores (precision, recall, and F1 for aspect term and aspect category extraction; accuracy for aspect term and aspect category polarity detection). 57 | 58 | 59 | Evaluation 60 | ----------------------- 61 | 62 | java -cp ./eval.jar Main.Aspects test.xml ref.xml 63 | 64 | It calculates and displays the precision, recall and F1 for aspect term and category extraction for a system that generated test.xml, comparing it to ref.xml that contains 65 | the gold correct annotations. The same measures are also calculated and displayed by semeval_base.py. 66 | 67 | java -cp ./eval.jar Main.Polarity test.xml ref.xml 68 | 69 | In contrast to semeval_base.py that calculates only the overall accuracy for the polarity detection task, the above command also calculates F1, Precision and Recall 70 | for all labels (positive|negative|neutral|conflict). As previously, test.xml is the file that the system generated and ref.xml 71 | is the one that contains the gold (correct) annotations. 72 | 73 | 74 | Submit your system 75 | ----------------------- 76 | 77 | The Aspect Based Sentiment Analysis task will run in two stages. 78 | 79 | In the first stage, you will be provided with a XML file that will contain a set of sentences. 80 | If you want to participate in this stage, you have to return a file tagged with the aspect terms and categories in the same way they are tagged in the training data. 81 | 82 | In the second stage, we will provide you with the correct aspect terms and categories 83 | and you will have to find their polarity (positive|negative|neutral|conflict) and tag them as in the training data. 84 | 85 | 86 | Before uploading your results (for stage one or two), we highly recommend you validate (as shown below) the XML your system produced against the provided XSD schema (SemEvalSchema.xsd). 87 | This way you will verify that your XML output is well-formed and can be processed/parsed by our evaluation scripts. 88 | 89 | java -cp ./eval.jar Main.Valid test.xml SemEvalSchema.xsd 90 | 91 | 92 | 93 | 94 | -------------------------------------------------------------------------------- /datasets/ABSA-SemEval2014/SemEvalSchema.xsd: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | -------------------------------------------------------------------------------- /datasets/ABSA-SemEval2014/baselinesystemdescription.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/davidsbatista/Aspect-Based-Sentiment-Analysis/1d9c8ec1131993d924e96676fa212db6b53cb870/datasets/ABSA-SemEval2014/baselinesystemdescription.pdf -------------------------------------------------------------------------------- /datasets/ABSA-SemEval2014/eval.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/davidsbatista/Aspect-Based-Sentiment-Analysis/1d9c8ec1131993d924e96676fa212db6b53cb870/datasets/ABSA-SemEval2014/eval.jar -------------------------------------------------------------------------------- /datasets/ABSA-SemEval2014/laptops-trial.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | I liked the aluminum body. 5 | 6 | 7 | 8 | 9 | 10 | Lightweight and the screen is beautiful! 11 | 12 | 13 | 14 | 15 | 16 | Buy it, love it, and I promise you won't regret it. 17 | 18 | 19 | From the build quality to the performance, everything about it has been sub-par from what I would have expected from Apple. 20 | 21 | 22 | 23 | 24 | 25 | 26 | pretty much everything else about the computer is good it just stops working out of no were. 27 | 28 | 29 | Originally bought it for my wife. 30 | 31 | 32 | It was truly a great computer costing less than one thousand bucks before tax. 33 | 34 | 35 | 36 | 37 | 38 | I bought this laptop on Saturday and am completely in love with it! 39 | 40 | 41 | If you don't like fingerprints, this might not be the laptop for you. 42 | 43 | 44 | Boots up fast and runs great! 45 | 46 | 47 | 48 | 49 | 50 | 51 | Call tech support, standard email the form and fax it back in to us. 52 | 53 | 54 | 55 | 56 | 57 | I did contact HP and share how unhappy I am. 58 | 59 | 60 | This is my second MacBook. 61 | 62 | 63 | The service I received from Toshiba went above and beyond the call of duty. 64 | 65 | 66 | 67 | 68 | 69 | I would recommend it just because of the internet speed probably because thats the only thing i really care about. 70 | 71 | 72 | 73 | 74 | 75 | This is my 3rd Apple Laptop and first MacBook Pro. 76 | 77 | 78 | I have had this laptop for a few months now and i would say im pretty satisfied. 79 | 80 | 81 | The love part of my relationship with this laptop doesn't take very long. 82 | 83 | 84 | The screen shows great colors. 85 | 86 | 87 | 88 | 89 | 90 | Dells are ok, HPs aren't that good, but Macs or Fantastic. 91 | 92 | 93 | The battery life has not decreased since I bought it, so i'm thrilled with that. 94 | 95 | 96 | 97 | 98 | 99 | The price and features more than met my needs. 100 | 101 | 102 | 103 | 104 | 105 | 106 | the mouse buttons are hard to push. 107 | 108 | 109 | 110 | 111 | 112 | Just don't waste your time and money on this. 113 | 114 | 115 | And I'm still paying the bloody financing, for a product which didn't last me at least three years! 116 | 117 | 118 | Bought it to use mostly for oline classes. 119 | 120 | 121 | I purchased an HP right after my high school graduation. 122 | 123 | 124 | Me and my boyfriend bought the Gateway NV78 in nov of 09. 125 | 126 | 127 | This is the complete opposite to an ergonomic design. 128 | 129 | 130 | 131 | 132 | 133 | The technical service for dell is so 3rd world it might as well not even bother. 134 | 135 | 136 | 137 | 138 | 139 | The built in camera is very useful when chatting with other techs in remote buildings on our campus. 140 | 141 | 142 | 143 | 144 | 145 | Not super fancy, but not super expensive either. 146 | 147 | 148 | Keyboard good sized and wasy to use. 149 | 150 | 151 | 152 | 153 | 154 | Great wifi too. 155 | 156 | 157 | 158 | 159 | 160 | The Dell mini was the first Dell product that I had ever purchased. 161 | 162 | 163 | My Mac has gone from being a trusted friend to an adversary. 164 | 165 | 166 | I have been a mac user since the mid 90s. 167 | 168 | 169 | My HP is very heavy. 170 | 171 | 172 | big mistake! 173 | 174 | 175 | Best Buy was great as always and accepted the return and gave me another model 1764. 176 | 177 | 178 | I would buy this lap top over and over again! 179 | 180 | 181 | Games being the main issue. 182 | 183 | 184 | 185 | 186 | 187 | My previous purchases were with Dell and HP. 188 | 189 | 190 | The price is another driving influence that made me purchase this laptop. 191 | 192 | 193 | 194 | 195 | 196 | But see the macbook pro is different because it may have a huge price tag but it comes with the full software that you would actually need and most of it has free future updates. 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | It's a great product for a great price! 205 | 206 | 207 | 208 | 209 | 210 | Excellent speed for processing data. 211 | 212 | 213 | 214 | 215 | 216 | All in all, I'm incredibly dissatisfied with this laptop, and with HP as a whole. 217 | 218 | 219 | In my opinion it was not as user friendly as I expected either. 220 | 221 | 222 | The Macbook arrived in a nice twin packing and sealed in the box, all the functions works great. 223 | 224 | 225 | 226 | 227 | 228 | 229 | The switchable graphic card is pretty sweet when you want gaming on the laptop. 230 | 231 | 232 | 233 | 234 | 235 | 236 | I would like at least a 4 hr. battery life. 237 | 238 | 239 | 240 | 241 | 242 | Looking online, many people are having the same problem. 243 | 244 | 245 | It has good speed and plenty of hard drive space. 246 | 247 | 248 | 249 | 250 | 251 | 252 | The driver updates don't fix the issue, very frustrating. 253 | 254 | 255 | 256 | 257 | 258 | Dell wanted to charge us for everything everytime I called them with a problem. 259 | 260 | 261 | THIS HAS BEEN NOTHING BUT A HEADACHE SINCE WE PURCHASED IT. 262 | 263 | 264 | Im glad that it has such great features in it. 265 | 266 | 267 | 268 | 269 | 270 | it's just a great toy to have around. 271 | 272 | 273 | When this happened I would have to completely power off my computer and restart it. 274 | 275 | 276 | I always have used a tower home PC and jumped to the laptop and have been very satisfied with its performance. 277 | 278 | 279 | 280 | 281 | 282 | I asked how they would determine that since there are no scratches, dents or other signs of damage and was told that was the only way this type of damage could happen. 283 | 284 | 285 | I had finally reached my limit and broke down. 286 | 287 | 288 | and dell and best buy both refused to take it back after i only had it for 1 hour.... 289 | 290 | 291 | I burned my leg, after lifting it from my desk, and for less than 5 second putting it on my lap to clean my coffee table, so I can place it there. 292 | 293 | 294 | and its really cheap and you wont regret buying it. 295 | 296 | 297 | It was over rated! 298 | 299 | 300 | I dont understand how anyone can think this is a great product worth purchasing. 301 | 302 | 303 | My sister has the same Mac as me and she is in a band and uses GarageBand to record and edit. 304 | 305 | 306 | 307 | 308 | 309 | and looks very sexyy:D really the mac book pro is the best laptop specially for students in college if you are not caring about price. 310 | 311 | 312 | 313 | 314 | 315 | They definitely have a superior product! 316 | 317 | 318 | This is a great laptop and I would recommend it to anyone. 319 | 320 | 321 | It is very user friendly and not hard to figure out at all. 322 | 323 | 324 | They are wonderful, but very dangerous when it comes to emitting heat. 325 | 326 | 327 | Came fully loaded -good. 328 | 329 | 330 | It super shiny, so you can see the fingerprints easily. 331 | 332 | 333 | It's a great prodcut to handle basic computing needs. 334 | 335 | 336 | the laptop was really good and it goes really fast just the way i thought it would of run. 337 | 338 | 339 | It doesnt work worth a damn. 340 | 341 | 342 | It was definelty a smart move. 343 | 344 | 345 | Overall though, for the money spent it's a great deal. 346 | 347 | 348 | A little pricey but it is well, well worth it. 349 | 350 | 351 | It is meant to be PORTABLE. 352 | 353 | 354 | I am totally satisfied with my little toshie! 355 | 356 | 357 | Also, the extended warranty was a problem. 358 | 359 | 360 | 361 | 362 | 363 | My opinion of Sony has been dropping as fast as the stock market, given their horrible support, but this machine just caused another plunge. 364 | 365 | 366 | 367 | 368 | 369 | I waited and waited and no laptop. 370 | 371 | 372 | Everything is falling apart internally and externally. 373 | 374 | 375 | I also liked the glass screen. 376 | 377 | 378 | 379 | 380 | 381 | When I called Toshiba, they would not do anything and even tried to charge me $35 for the phone call, even though they didn't offer any technical support. 382 | 383 | 384 | 385 | 386 | 387 | I can actually get work done with this MAC, and not fight with it like my tired old PC laptop. 388 | 389 | 390 | If you're not wanting to be mobile, this is a good laptop to sit on a desk. 391 | 392 | 393 | I also travel with it and it never gives me any problems. 394 | 395 | 396 | (Beware, their staff could send you back making you feel that only they know what a computer is. 397 | 398 | 399 | 400 | 401 | 402 | Other Thoughts: Do not purchase this product. 403 | 404 | 405 | The computer was delivered as promised. 406 | 407 | 408 | The only thing that I don't like about my mac is that sometimes there are programs that I want to be able to run and I am not able to. 409 | 410 | 411 | 412 | 413 | 414 | Wireless has not been a issue for me, like some others have meantioned. 415 | 416 | 417 | 418 | 419 | 420 | MacBook Notebooks quickly die out because of their short battery life, as well as the many background programs that run without the user's knowlede. 421 | 422 | 423 | 424 | 425 | 426 | 427 | All for such a great price. 428 | 429 | 430 | 431 | 432 | 433 | -------------------------------------------------------------------------------- /datasets/ABSA-SemEval2014/semeval14_absa_annotationguidelines.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/davidsbatista/Aspect-Based-Sentiment-Analysis/1d9c8ec1131993d924e96676fa212db6b53cb870/datasets/ABSA-SemEval2014/semeval14_absa_annotationguidelines.pdf -------------------------------------------------------------------------------- /datasets/ABSA-SemEval2014/semeval_base.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | ''' 4 | **Baseline methods for the 4th task of SemEval 2014** 5 | 6 | Run a task from the terminal:: 7 | >>> python baselines.py -t file -m taskNum 8 | 9 | or, import within python. E.g., for Aspect Term Extraction:: 10 | from baselines import * 11 | corpus = Corpus(ET.parse(trainfile).getroot().findall('sentence')) 12 | unseen = Corpus(ET.parse(testfile).getroot().findall('sentence')) 13 | b1 = BaselineAspectExtractor(corpus) 14 | predicted = b1.tag(unseen.corpus) 15 | corpus.write_out('%s--test.predicted-aspect.xml'%domain_name, predicted, short=False) 16 | 17 | Similarly, for Aspect Category Detection, Aspect Term Polarity Estimation, and Aspect Category Polarity Estimation. 18 | ''' 19 | 20 | __author__ = "J. Pavlopoulos" 21 | __credits__ = "J. Pavlopoulos, D. Galanis, I. Androutsopoulos" 22 | __license__ = "GPL" 23 | __version__ = "1.0.1" 24 | __maintainer__ = "John Pavlopoulos" 25 | __email__ = "annis@aueb.gr" 26 | 27 | try: 28 | import xml.etree.ElementTree as ET, getopt, logging, sys, random, re, copy 29 | from xml.sax.saxutils import escape 30 | except: 31 | sys.exit('Some package is missing... Perhaps ?') 32 | 33 | logging.basicConfig(level=logging.INFO) 34 | logger = logging.getLogger(__name__) 35 | 36 | # Stopwords, imported from NLTK (v 2.0.4) 37 | stopwords = set( 38 | ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 39 | 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 40 | 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 41 | 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 42 | 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 43 | 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 44 | 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 45 | 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 46 | 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']) 47 | 48 | 49 | def fd(counts): 50 | '''Given a list of occurrences (e.g., [1,1,1,2]), return a dictionary of frequencies (e.g., {1:3, 2:1}.)''' 51 | d = {} 52 | for i in counts: d[i] = d[i] + 1 if i in d else 1 53 | return d 54 | 55 | 56 | freq_rank = lambda d: sorted(d, key=d.get, reverse=True) 57 | '''Given a map, return ranked the keys based on their values.''' 58 | 59 | 60 | def fd2(counts): 61 | '''Given a list of 2-uplets (e.g., [(a,pos), (a,pos), (a,neg), ...]), form a dict of frequencies of specific items (e.g., {a:{pos:2, neg:1}, ...}).''' 62 | d = {} 63 | for i in counts: 64 | # If the first element of the 2-uplet is not in the map, add it. 65 | if i[0] in d: 66 | if i[1] in d[i[0]]: 67 | d[i[0]][i[1]] += 1 68 | else: 69 | d[i[0]][i[1]] = 1 70 | else: 71 | d[i[0]] = {i[1]: 1} 72 | return d 73 | 74 | 75 | def validate(filename): 76 | '''Validate an XML file, w.r.t. the format given in the 4th task of **SemEval '14**.''' 77 | elements = ET.parse(filename).getroot().findall('sentence') 78 | aspects = [] 79 | for e in elements: 80 | for eterms in e.findall('aspectTerms'): 81 | if eterms is not None: 82 | for a in eterms.findall('aspectTerm'): 83 | aspects.append(Aspect('', '', []).create(a).term) 84 | return elements, aspects 85 | 86 | 87 | fix = lambda text: escape(text.encode('utf8')).replace('\"','"') 88 | '''Simple fix for writing out text.''' 89 | 90 | # Dice coefficient 91 | def dice(t1, t2, stopwords=[]): 92 | tokenize = lambda t: set([w for w in t.split() if (w not in stopwords)]) 93 | t1, t2 = tokenize(t1), tokenize(t2) 94 | return 2. * len(t1.intersection(t2)) / (len(t1) + len(t2)) 95 | 96 | 97 | class Category: 98 | '''Category objects contain the term and polarity (i.e., pos, neg, neu, conflict) of the category (e.g., food, price, etc.) of a sentence.''' 99 | 100 | def __init__(self, term='', polarity=''): 101 | self.term = term 102 | self.polarity = polarity 103 | 104 | def create(self, element): 105 | self.term = element.attrib['category'] 106 | self.polarity = element.attrib['polarity'] 107 | return self 108 | 109 | def update(self, term='', polarity=''): 110 | self.term = term 111 | self.polarity = polarity 112 | 113 | 114 | class Aspect: 115 | '''Aspect objects contain the term (e.g., battery life) and polarity (i.e., pos, neg, neu, conflict) of an aspect.''' 116 | 117 | def __init__(self, term, polarity, offsets): 118 | self.term = term 119 | self.polarity = polarity 120 | self.offsets = offsets 121 | 122 | def create(self, element): 123 | self.term = element.attrib['term'] 124 | self.polarity = element.attrib['polarity'] 125 | self.offsets = {'from': str(element.attrib['from']), 'to': str(element.attrib['to'])} 126 | return self 127 | 128 | def update(self, term='', polarity=''): 129 | self.term = term 130 | self.polarity = polarity 131 | 132 | 133 | class Instance: 134 | '''An instance is a sentence, modeled out of XML (pre-specified format, based on the 4th task of SemEval 2014). 135 | It contains the text, the aspect terms, and any aspect categories.''' 136 | 137 | def __init__(self, element): 138 | self.text = element.find('text').text 139 | self.id = element.get('id') 140 | self.aspect_terms = [Aspect('', '', offsets={'from': '', 'to': ''}).create(e) for es in 141 | element.findall('aspectTerms') for e in es if 142 | es is not None] 143 | self.aspect_categories = [Category(term='', polarity='').create(e) for es in element.findall('aspectCategories') 144 | for e in es if 145 | es is not None] 146 | 147 | def get_aspect_terms(self): 148 | return [a.term.lower() for a in self.aspect_terms] 149 | 150 | def get_aspect_categories(self): 151 | return [c.term.lower() for c in self.aspect_categories] 152 | 153 | def add_aspect_term(self, term, polarity='', offsets={'from': '', 'to': ''}): 154 | a = Aspect(term, polarity, offsets) 155 | self.aspect_terms.append(a) 156 | 157 | def add_aspect_category(self, term, polarity=''): 158 | c = Category(term, polarity) 159 | self.aspect_categories.append(c) 160 | 161 | 162 | class Corpus: 163 | '''A corpus contains instances, and is useful for training algorithms or splitting to train/test files.''' 164 | 165 | def __init__(self, elements): 166 | self.corpus = [Instance(e) for e in elements] 167 | self.size = len(self.corpus) 168 | self.aspect_terms_fd = fd([a for i in self.corpus for a in i.get_aspect_terms()]) 169 | self.top_aspect_terms = freq_rank(self.aspect_terms_fd) 170 | self.texts = [t.text for t in self.corpus] 171 | 172 | def echo(self): 173 | print '%d instances\n%d distinct aspect terms' % (len(self.corpus), len(self.top_aspect_terms)) 174 | print 'Top aspect terms: %s' % (', '.join(self.top_aspect_terms[:10])) 175 | 176 | def clean_tags(self): 177 | for i in range(len(self.corpus)): 178 | self.corpus[i].aspect_terms = [] 179 | 180 | def split(self, threshold=0.8, shuffle=False): 181 | '''Split to train/test, based on a threshold. Turn on shuffling for randomizing the elements beforehand.''' 182 | clone = copy.deepcopy(self.corpus) 183 | if shuffle: random.shuffle(clone) 184 | train = clone[:int(threshold * self.size)] 185 | test = clone[int(threshold * self.size):] 186 | return train, test 187 | 188 | def write_out(self, filename, instances, short=True): 189 | with open(filename, 'w') as o: 190 | o.write('\n') 191 | for i in instances: 192 | o.write('\t\n' % (i.id)) 193 | o.write('\t\t%s\n' % fix(i.text)) 194 | o.write('\t\t\n') 195 | if not short: 196 | for a in i.aspect_terms: 197 | o.write('\t\t\t\n' % ( 198 | fix(a.term), a.polarity, a.offsets['from'], a.offsets['to'])) 199 | o.write('\t\t\n') 200 | o.write('\t\t\n') 201 | if not short: 202 | for c in i.aspect_categories: 203 | o.write('\t\t\t\n' % (fix(c.term), c.polarity)) 204 | o.write('\t\t\n') 205 | o.write('\t\n') 206 | o.write('') 207 | 208 | 209 | class BaselineAspectExtractor(): 210 | '''Extract the aspects from a text. 211 | Use the aspect terms from the train data, to tag any new (i.e., unseen) instances.''' 212 | 213 | def __init__(self, corpus): 214 | self.candidates = [a.lower() for a in corpus.top_aspect_terms] 215 | 216 | def find_offsets_quickly(self, term, text): 217 | start = 0 218 | while True: 219 | start = text.find(term, start) 220 | if start == -1: return 221 | yield start 222 | start += len(term) 223 | 224 | def find_offsets(self, term, text): 225 | offsets = [(i, i + len(term)) for i in list(self.find_offsets_quickly(term, text))] 226 | return offsets 227 | 228 | def tag(self, test_instances): 229 | clones = [] 230 | for i in test_instances: 231 | i_ = copy.deepcopy(i) 232 | i_.aspect_terms = [] 233 | for c in set(self.candidates): 234 | if c in i_.text: 235 | offsets = self.find_offsets(' ' + c + ' ', i.text) 236 | for start, end in offsets: i_.add_aspect_term(term=c, 237 | offsets={'from': str(start + 1), 'to': str(end - 1)}) 238 | clones.append(i_) 239 | return clones 240 | 241 | 242 | class BaselineCategoryDetector(): 243 | '''Detect the category (or categories) of an instance. 244 | For any new (i.e., unseen) instance, fetch the k-closest instances from the train data, and vote for the number of categories and the categories themselves.''' 245 | 246 | def __init__(self, corpus): 247 | self.corpus = corpus 248 | 249 | # Fetch k-neighbors (i.e., similar texts), using the Dice coefficient, and vote for #categories and category values 250 | def fetch_k_nn(self, text, k=5, multi=False): 251 | neighbors = dict([(i, dice(text, n, stopwords)) for i, n in enumerate(self.corpus.texts)]) 252 | ranked = freq_rank(neighbors) 253 | topk = [self.corpus.corpus[i] for i in ranked[:k]] 254 | num_of_cats = 1 if not multi else int(sum([len(i.aspect_categories) for i in topk]) / float(k)) 255 | cats = freq_rank(fd([c for i in topk for c in i.get_aspect_categories()])) 256 | categories = [cats[i] for i in range(num_of_cats)] 257 | return categories 258 | 259 | def tag(self, test_instances): 260 | clones = [] 261 | for i in test_instances: 262 | i_ = copy.deepcopy(i) 263 | i_.aspect_categories = [Category(term=c) for c in self.fetch_k_nn(i.text)] 264 | clones.append(i_) 265 | return clones 266 | 267 | 268 | class BaselineStageI(): 269 | '''Stage I: Aspect Term Extraction and Aspect Category Detection.''' 270 | 271 | def __init__(self, b1, b2): 272 | self.b1 = b1 273 | self.b2 = b2 274 | 275 | def tag(self, test_instances): 276 | clones = [] 277 | for i in test_instances: 278 | i_ = copy.deepcopy(i) 279 | i_.aspect_categories, i_.aspect_terms = [], [] 280 | for a in set(self.b1.candidates): 281 | offsets = self.b1.find_offsets(' ' + a + ' ', i_.text) 282 | for start, end in offsets: 283 | i_.add_aspect_term(term=a, offsets={'from': str(start + 1), 'to': str(end - 1)}) 284 | for c in self.b2.fetch_k_nn(i_.text): 285 | i_.aspect_categories.append(Category(term=c)) 286 | clones.append(i_) 287 | return clones 288 | 289 | 290 | class BaselineAspectPolarityEstimator(): 291 | '''Estimate the polarity of an instance's aspects. 292 | This is a majority baseline. 293 | Form the tuples from the train data, and measure frequencies. 294 | Then, given a new instance, vote for the polarities of the aspect terms (given).''' 295 | 296 | def __init__(self, corpus): 297 | self.corpus = corpus 298 | self.fd = fd2([(a.term, a.polarity) for i in self.corpus.corpus for a in i.aspect_terms]) 299 | self.major = freq_rank(fd([a.polarity for i in self.corpus.corpus for a in i.aspect_terms]))[0] 300 | 301 | # Fetch k-neighbors (i.e., similar texts), using the Dice coefficient, and vote for aspect's polarity 302 | def k_nn(self, text, aspect, k=5): 303 | neighbors = dict([(i, dice(text, next.text, stopwords)) for i, next in enumerate(self.corpus.corpus) if 304 | aspect in next.get_aspect_terms()]) 305 | ranked = freq_rank(neighbors) 306 | topk = [self.corpus.corpus[i] for i in ranked[:k]] 307 | return freq_rank(fd([a.polarity for i in topk for a in i.aspect_terms])) 308 | 309 | def majority(self, text, aspect): 310 | if aspect not in self.fd: 311 | return self.major 312 | else: 313 | polarities = self.k_nn(text, aspect, k=5) 314 | if polarities: 315 | return polarities[0] 316 | else: 317 | return self.major 318 | 319 | def tag(self, test_instances): 320 | clones = [] 321 | for i in test_instances: 322 | i_ = copy.deepcopy(i) 323 | for j in i_.aspect_terms: j.polarity = self.majority(i_.text, j.term) 324 | clones.append(i_) 325 | return clones 326 | 327 | 328 | class BaselineAspectCategoryPolarityEstimator(): 329 | '''Estimate the polarity of an instance's category (or categories). 330 | This is a majority baseline. 331 | Form the tuples from the train data, and measure frequencies. 332 | Then, given a new instance, vote for the polarities of the categories (given).''' 333 | 334 | def __init__(self, corpus): 335 | self.corpus = corpus 336 | self.fd = fd2([(c.term, c.polarity) for i in self.corpus.corpus for c in i.aspect_categories]) 337 | 338 | # Fetch k-neighbors (i.e., similar texts), using the Dice coefficient, and vote for aspect's polarity 339 | def k_nn(self, text, k=5): 340 | neighbors = dict([(i, dice(text, next.text, stopwords)) for i, next in enumerate(self.corpus.corpus)]) 341 | ranked = freq_rank(neighbors) 342 | topk = [self.corpus.corpus[i] for i in ranked[:k]] 343 | return freq_rank(fd([c.polarity for i in topk for c in i.aspect_categories])) 344 | 345 | def majority(self, text): 346 | return self.k_nn(text)[0] 347 | 348 | def tag(self, test_instances): 349 | clones = [] 350 | for i in test_instances: 351 | i_ = copy.deepcopy(i) 352 | for j in i_.aspect_categories: 353 | j.polarity = self.majority(i_.text) 354 | clones.append(i_) 355 | return clones 356 | 357 | 358 | class BaselineStageII(): 359 | '''Stage II: Aspect Term and Aspect Category Polarity Estimation. 360 | Terms and categories are assumed given.''' 361 | 362 | # Baselines 3 and 4 are assumed given. 363 | def __init__(self, b3, b4): 364 | self.b3 = b3 365 | self.b4 = b4 366 | 367 | # Tag sentences with aspects and categories with their polarities 368 | def tag(self, test_instances): 369 | clones = [] 370 | for i in test_instances: 371 | i_ = copy.deepcopy(i) 372 | for j in i_.aspect_terms: j.polarity=self.b3.majority(i_.text, j.term) 373 | for j in i_.aspect_categories: j.polarity = self.b4.majority(i_.text) 374 | clones.append(i_) 375 | return clones 376 | 377 | 378 | class Evaluate(): 379 | '''Evaluation methods, per subtask of the 4th task of SemEval '14.''' 380 | 381 | def __init__(self, correct, predicted): 382 | self.size = len(correct) 383 | self.correct = correct 384 | self.predicted = predicted 385 | 386 | # Aspect Extraction (no offsets considered) 387 | def aspect_extraction(self, b=1): 388 | common, relevant, retrieved = 0., 0., 0. 389 | for i in range(self.size): 390 | cor = [a.offsets for a in self.correct[i].aspect_terms] 391 | pre = [a.offsets for a in self.predicted[i].aspect_terms] 392 | common += len([a for a in pre if a in cor]) 393 | retrieved += len(pre) 394 | relevant += len(cor) 395 | p = common / retrieved if retrieved > 0 else 0. 396 | r = common / relevant 397 | f1 = (1 + (b ** 2)) * p * r / ((p * b ** 2) + r) if p > 0 and r > 0 else 0. 398 | return p, r, f1, common, retrieved, relevant 399 | 400 | # Aspect Category Detection 401 | def category_detection(self, b=1): 402 | common, relevant, retrieved = 0., 0., 0. 403 | for i in range(self.size): 404 | cor = self.correct[i].get_aspect_categories() 405 | # Use set to avoid duplicates (i.e., two times the same category) 406 | pre = set(self.predicted[i].get_aspect_categories()) 407 | common += len([c for c in pre if c in cor]) 408 | retrieved += len(pre) 409 | relevant += len(cor) 410 | p = common / retrieved if retrieved > 0 else 0. 411 | r = common / relevant 412 | f1 = (1 + b ** 2) * p * r / ((p * b ** 2) + r) if p > 0 and r > 0 else 0. 413 | return p, r, f1, common, retrieved, relevant 414 | 415 | def aspect_polarity_estimation(self, b=1): 416 | common, relevant, retrieved = 0., 0., 0. 417 | for i in range(self.size): 418 | cor = [a.polarity for a in self.correct[i].aspect_terms] 419 | pre = [a.polarity for a in self.predicted[i].aspect_terms] 420 | common += sum([1 for j in range(len(pre)) if pre[j] == cor[j]]) 421 | retrieved += len(pre) 422 | acc = common / retrieved 423 | return acc, common, retrieved 424 | 425 | def aspect_category_polarity_estimation(self, b=1): 426 | common, relevant, retrieved = 0., 0., 0. 427 | for i in range(self.size): 428 | cor = [a.polarity for a in self.correct[i].aspect_categories] 429 | pre = [a.polarity for a in self.predicted[i].aspect_categories] 430 | common += sum([1 for j in range(len(pre)) if pre[j] == cor[j]]) 431 | retrieved += len(pre) 432 | acc = common / retrieved 433 | return acc, common, retrieved 434 | 435 | 436 | def main(argv=None): 437 | # Parse the input 438 | opts, args = getopt.getopt(argv, "hg:dt:om:k:", ["help", "grammar", "train=", "task=", "test="]) 439 | trainfile, testfile, task = None, None, 1 440 | use_msg = 'Use as:\n">>> python baselines.py --train file.xml --task 1|2|3|4(|5|6)"\n\nThis will parse a train set, examine whether is valid, split to train and test (80/20 %), write the new train, test and unseen test files, perform ABSA for task 1, 2, 3, or 4 (5 and 6 perform jointly tasks 1 & 2, and 3 & 4, respectively), and write out a file with the predictions.' 441 | if len(opts) == 0: sys.exit(use_msg) 442 | for opt, arg in opts: 443 | if opt in ("-h", "--help"): 444 | sys.exit(use_msg) 445 | elif opt in ('-t', "--train"): 446 | trainfile = arg 447 | elif opt in ('-m', "--task"): 448 | task = int(arg) 449 | elif opt in ('-k', "--test"): 450 | testfile = arg 451 | 452 | # Examine if the file is in proper XML format for further use. 453 | print 'Validating the file...' 454 | try: 455 | elements, aspects = validate(trainfile) 456 | print 'PASSED! This corpus has: %d sentences, %d aspect term occurrences, and %d distinct aspect terms.' % ( 457 | len(elements), len(aspects), len(list(set(aspects)))) 458 | except: 459 | print "Unexpected error:", sys.exc_info()[0] 460 | raise 461 | 462 | # Get the corpus and split into train/test. 463 | corpus = Corpus(ET.parse(trainfile).getroot().findall('sentence')) 464 | domain_name = 'laptops' if 'laptop' in trainfile else ('restaurants' if 'restau' in trainfile else 'absa') 465 | if testfile: 466 | traincorpus = corpus 467 | seen = Corpus(ET.parse(testfile).getroot().findall('sentence')) 468 | else: 469 | train, seen = corpus.split() 470 | # Store train/test files and clean up the test files (no aspect terms or categories are present); then, parse back the files back. 471 | corpus.write_out('%s--train.xml' % domain_name, train, short=False) 472 | traincorpus = Corpus(ET.parse('%s--train.xml' % domain_name).getroot().findall('sentence')) 473 | corpus.write_out('%s--test.gold.xml' % domain_name, seen, short=False) 474 | seen = Corpus(ET.parse('%s--test.gold.xml' % domain_name).getroot().findall('sentence')) 475 | 476 | corpus.write_out('%s--test.xml' % domain_name, seen.corpus) 477 | unseen = Corpus(ET.parse('%s--test.xml' % domain_name).getroot().findall('sentence')) 478 | 479 | # Perform the tasks, asked by the user and print the files with the predicted responses. 480 | if task == 1: 481 | b1 = BaselineAspectExtractor(traincorpus) 482 | print 'Extracting aspect terms...' 483 | predicted = b1.tag(unseen.corpus) 484 | corpus.write_out('%s--test.predicted-aspect.xml' % domain_name, predicted, short=False) 485 | print 'P = %f -- R = %f -- F1 = %f (#correct: %d, #retrieved: %d, #relevant: %d)' % Evaluate(seen.corpus, 486 | predicted).aspect_extraction() 487 | if task == 2: 488 | print 'Detecting aspect categories...' 489 | b2 = BaselineCategoryDetector(traincorpus) 490 | predicted = b2.tag(unseen.corpus) 491 | print 'P = %f -- R = %f -- F1 = %f (#correct: %d, #retrieved: %d, #relevant: %d)' % Evaluate(seen.corpus, 492 | predicted).category_detection() 493 | corpus.write_out('%s--test.predicted-category.xml' % domain_name, predicted, short=False) 494 | if task == 3: 495 | print 'Estimating aspect term polarity...' 496 | b3 = BaselineAspectPolarityEstimator(traincorpus) 497 | predicted = b3.tag(seen.corpus) 498 | corpus.write_out('%s--test.predicted-aspectPolar.xml' % domain_name, predicted, short=False) 499 | print 'Accuracy = %f, #Correct/#All: %d/%d' % Evaluate(seen.corpus, predicted).aspect_polarity_estimation() 500 | if task == 4: 501 | print 'Estimating aspect category polarity...' 502 | b4 = BaselineAspectCategoryPolarityEstimator(traincorpus) 503 | predicted = b4.tag(seen.corpus) 504 | print 'Accuracy = %f, #Correct/#All: %d/%d' % Evaluate(seen.corpus, 505 | predicted).aspect_category_polarity_estimation() 506 | corpus.write_out('%s--test.predicted-categoryPolar.xml' % domain_name, predicted, short=False) 507 | # Perform tasks 1 & 2, and output an XML file with the predictions 508 | if task == 5: 509 | print 'Task 1 & 2: Aspect Term and Category Detection' 510 | b1 = BaselineAspectExtractor(traincorpus) 511 | b2 = BaselineCategoryDetector(traincorpus) 512 | b12 = BaselineStageI(b1, b2) 513 | predicted = b12.tag(unseen.corpus) 514 | corpus.write_out('%s--test.predicted-stageI.xml' % domain_name, predicted, short=False) 515 | print 'Task 1: P = %f -- R = %f -- F1 = %f (#correct: %d, #retrieved: %d, #relevant: %d)' % Evaluate( 516 | seen.corpus, predicted).aspect_extraction() 517 | print 'Task 2: P = %f -- R = %f -- F1 = %f (#correct: %d, #retrieved: %d, #relevant: %d)' % Evaluate( 518 | seen.corpus, predicted).category_detection() 519 | # Perform tasks 3 & 4, and output an XML file with the predictions 520 | if task == 6: 521 | print 'Aspect Term and Category Polarity Estimation' 522 | b3 = BaselineAspectPolarityEstimator(traincorpus) 523 | b4 = BaselineAspectCategoryPolarityEstimator(traincorpus) 524 | b34 = BaselineStageII(b3, b4) 525 | predicted = b34.tag(seen.corpus) 526 | corpus.write_out('%s--test.predicted-stageII.xml' % domain_name, predicted, short=False) 527 | print 'Task 3: Accuracy = %f (#Correct/#All: %d/%d)' % Evaluate(seen.corpus, 528 | predicted).aspect_polarity_estimation() 529 | print 'Task 4: Accuracy = %f (#Correct/#All: %d/%d)' % Evaluate(seen.corpus, 530 | predicted).aspect_category_polarity_estimation() 531 | 532 | 533 | if __name__ == "__main__": main(sys.argv[1:]) 534 | -------------------------------------------------------------------------------- /datasets/ABSA-SemEval2014/submission-guidelines.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/davidsbatista/Aspect-Based-Sentiment-Analysis/1d9c8ec1131993d924e96676fa212db6b53cb870/datasets/ABSA-SemEval2014/submission-guidelines.pdf -------------------------------------------------------------------------------- /datasets/ABSA-SemEval2015/absa-2015_laptops_trial.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Being a PC user my whole life.... 7 | 8 | 9 | This computer is absolutely AMAZING!!! 10 | 11 | 12 | 13 | 14 | 15 | 10 plus hours of battery... 16 | 17 | 18 | 19 | 20 | 21 | super fast processor and really nice graphics card.. 22 | 23 | 24 | 25 | 26 | 27 | 28 | and plenty of storage with 250 gb(though I will upgrade this and the ram..) 29 | 30 | 31 | 32 | 33 | 34 | This computer is really fast and I'm shocked as to how easy it is to get used to... 35 | 36 | 37 | 38 | 39 | 40 | 41 | I've only had mine a day but I'm already used to it... 42 | 43 | 44 | 45 | 46 | 47 | MACS ARE AMAZING!!! 48 | 49 | 50 | GET THIS COMPUTER FOR PORTABILITY AND FAST PROCESSING!!! 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | It's fast and has excellent battery life. 62 | 63 | 64 | 65 | 66 | 67 | 68 | The screen shows great colors. 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | From the moment I opened the box to the present it has been a great joy. 79 | 80 | 81 | 82 | 83 | 84 | It is always reliable, never bugged and responds well. 85 | 86 | 87 | 88 | 89 | 90 | 91 | I love the operating system and the preloaded software. 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | Well, I have to say since I bought my Mac, I won't ever go back to any Windows. 103 | 104 | 105 | It's solid. 106 | 107 | 108 | 109 | 110 | 111 | Love the stability of the Mac software and operating system. 112 | 113 | 114 | 115 | 116 | 117 | 118 | The only downfall is a lot of the software I have won't work with Mac and iWork is not worth the price of it. 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | It seems to be incompatible with everything else. 127 | 128 | 129 | 130 | 131 | 132 | But the machine is awesome and iLife is great and I love Snow Leopard X. 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | Ever since I bought this laptop, so far I've experience nothing but constant break downs of the laptop and bad customer services I received over the phone with toshiba customer services hotlines. 145 | 146 | 147 | 148 | 149 | 150 | 151 | I constantly had to send my laptop in for services every 3 months and it always seems to be the same problem that they said they had already fixed. 152 | 153 | 154 | 155 | 156 | 157 | 158 | Toshiba customer services will indirectly deal with your problems by constantly tranferring you from one country to another, and I am not kidding you, I called different hours of the day and you'll get someone else from another country trying to get you to tell them your life story all over again, since they make it sound like they don't have your history list of your calls right in front of them. 159 | 160 | 161 | 162 | 163 | 164 | It's a long and tirring process that after a while it seems like their game plan was to wear you out so you would want to give up on contacting them. 165 | 166 | 167 | 168 | 169 | 170 | And at one point, they blame me for installing a bad memory stick when I upgrade my memory for the first time during my purchase of the laptop (I bought the memory stick they recomended). 171 | 172 | 173 | 174 | 175 | 176 | Long story short, since I experience so many problems with my laptop every since I bought it from day one, I didn't ask for a new laptop or a refund of what I pay for a crapy laptop, but just an extension of my laptop warranty for another year, they made a big deal of out that and after so many calls and complaints about their products and services, they finally gave in. 177 | 178 | 179 | 180 | 181 | 182 | 183 | Was this product worth my time and money to ever want to purchase another products that is toshiba or relating to toshiba? 184 | 185 | 186 | 187 | 188 | 189 | 190 | Probably not ever again. 191 | 192 | 193 | I'll rather be out of date then spend more money on toshiba. 194 | 195 | 196 | 197 | 198 | 199 | Remember to do your research first before consider buying a toshiba product. 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | Purchased as a gift for a friend. 210 | 211 | 212 | The service I received from Toshiba went above and beyond the call of duty. 213 | 214 | 215 | 216 | 217 | 218 | My friend reports the notebook is astonishing in performance, picture quality, and ease of use. 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | It is extremely portable and easily connects to WIFI at the library and elsewhere. 227 | 228 | 229 | 230 | 231 | 232 | 233 | Just what the doctor ordered. 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | the key bindings take a little getting used to, but have loved the Macbook Pro. 244 | 245 | 246 | 247 | 248 | 249 | 250 | Delivery was early too. 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | Most everything is fine with this machine: speed, capacity, build. 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | The only thing I don't understand is that the resolution of the screen isn't high enough for some pages, such as Yahoo!Mail. 270 | 271 | 272 | 273 | 274 | 275 | Yes, I have it on the highest available setting. 276 | 277 | 278 | 279 | 280 | 281 | 282 | Plain and simple, it(laptop) runs great and loads fast. 283 | 284 | 285 | 286 | 287 | 288 | Easy to carry, can be taken anywhere, can be hooked up to printers,headsets. 289 | 290 | 291 | 292 | 293 | 294 | 295 | Love that it doesn't take up space like a regular computer. 296 | 297 | 298 | 299 | 300 | 301 | 302 | This computer gets very hot, before shutting down. 303 | 304 | 305 | 306 | 307 | 308 | It is not ideal for children because of the temp. 309 | 310 | 311 | 312 | 313 | 314 | 315 | I Contacted HP  about this situation and was given excuses, without results. 316 | 317 | 318 | 319 | 320 | 321 | They didn't even try to assist me or even offer a replacement. 322 | 323 | 324 | 325 | 326 | 327 | I will never purchase a HP again ever. 328 | 329 | 330 | 331 | 332 | 333 | 334 | -------------------------------------------------------------------------------- /datasets/ABSA-SemEval2015/absa-2015_restaurants_trial.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Judging from previous posts this used to be a good place, but not any longer. 7 | 8 | 9 | 10 | 11 | 12 | We, there were four of us, arrived at noon - the place was empty - and the staff acted like we were imposing on them and they were very rude. 13 | 14 | 15 | 16 | 17 | 18 | They never brought us complimentary noodles, ignored repeated requests for sugar, and threw our dishes on the table. 19 | 20 | 21 | 22 | 23 | 24 | The food was lousy - too sweet or too salty and the portions tiny. 25 | 26 | 27 | 28 | 29 | 30 | 31 | After all that, they complained to me about the small tip. 32 | 33 | 34 | 35 | 36 | 37 | Avoid this place! 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | Every time in New York I make it a point to visit Restaurant Saul on Smith Street. 48 | 49 | 50 | 51 | 52 | 53 | Everything is always cooked to perfection, the service is excellent, the decor cool and understated. 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | I had the duck breast special on my last visit and it was incredible. 62 | 63 | 64 | 65 | 66 | 67 | Can't wait wait for my next visit. 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | We ate outside at Haru's Sake bar because Haru's restaurant next door was overflowing. 78 | 79 | 80 | What's the difference between the two? 81 | 82 | 83 | Their sake list was extensive, but we were looking for Purple Haze, which wasn't listed but made for us upon request! 84 | 85 | 86 | 87 | 88 | 89 | 90 | The spicy tuna roll was unusually good and the rock shrimp tempura was awesome, great appetizer to share! 91 | 92 | 93 | 94 | 95 | 96 | 97 | We went around 9:30 on a Friday and it had died down a bit by then so the service was great! 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | we love th pink pony. 108 | 109 | 110 | 111 | 112 | 113 | THe perfect spot. 114 | 115 | 116 | 117 | 118 | 119 | Food-awesome. 120 | 121 | 122 | 123 | 124 | 125 | Service- friendly and attentive. 126 | 127 | 128 | 129 | 130 | 131 | Ambiance- relaxed and stylish. 132 | 133 | 134 | 135 | 136 | 137 | Don't judge this place prima facie, you have to try it to believe it, a home away from home for the literate heart. 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | This place has got to be the best japanese restaurant in the new york area. 148 | 149 | 150 | 151 | 152 | 153 | I had a great experience. 154 | 155 | 156 | 157 | 158 | 159 | Food is great. 160 | 161 | 162 | 163 | 164 | 165 | Service is top notch. 166 | 167 | 168 | 169 | 170 | 171 | I have been going back again and again. 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | Just went here for my girlfriends 23rd bday. 182 | 183 | 184 | If you've ever been along the river in Weehawken you have an idea of the top of view the chart house has to offer. 185 | 186 | 187 | 188 | 189 | 190 | Add to that great service and great food at a reasonable price and you have yourself the beginning of a great evening. 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | The lava cake dessert was incredible and I recommend it. 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | I have never eaten in the restaurant, however, upon reading the reviews I got take out last week. 209 | 210 | 211 | IT WAS HORRIBLE. 212 | 213 | 214 | 215 | 216 | 217 | The pizza was delivered cold and the cheese wasn't even fully melted! 218 | 219 | 220 | 221 | 222 | 223 | 224 | It looked like shredded cheese partly done - still in strips. 225 | 226 | 227 | 228 | 229 | 230 | I have eaten at many pizza places around NYC and this is hands down the worst. 231 | 232 | 233 | 234 | 235 | 236 | 237 | This is a fun restaurant to go to. 238 | 239 | 240 | 241 | 242 | 243 | The pizza is yummy and I like the atmoshpere. 244 | 245 | 246 | 247 | 248 | 249 | 250 | But the pizza is way to expensive. 251 | 252 | 253 | 254 | 255 | 256 | A large is $20, and toppings are about $3 each. 257 | 258 | 259 | 260 | 261 | 262 | 263 | Planet Thailand has always been a hit with me , I go there usually for the sushi, which is great , the thai food is excellent too . 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | With the great variety on the menu , I eat here often and never get bored . 272 | 273 | 274 | 275 | 276 | 277 | The atmosphere isn't the greatest , but I suppose that's how they keep the prices down . 278 | 279 | 280 | 281 | 282 | 283 | 284 | It's all about the food !! 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | Moules were excellent, lobster ravioli was VERY salty! 295 | 296 | 297 | 298 | 299 | 300 | 301 | Took my mom for Mother's Day, and the maitre d' was pretty rude. 302 | 303 | 304 | 305 | 306 | 307 | Told us to sit anywhere, and when we sat he said the table was reserved. 308 | 309 | 310 | 311 | 312 | 313 | Stepped on my foot on the SECOND time he reached over me to adjust lighting. 314 | 315 | 316 | 317 | 318 | 319 | Tiny dessert was $8.00...just plain overpriced for what it is. 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | -------------------------------------------------------------------------------- /datasets/ABSA-SemEval2015/guidelines/SemEval2015_ABSA_Laptops_AnnotationGuidelines.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/davidsbatista/Aspect-Based-Sentiment-Analysis/1d9c8ec1131993d924e96676fa212db6b53cb870/datasets/ABSA-SemEval2015/guidelines/SemEval2015_ABSA_Laptops_AnnotationGuidelines.pdf -------------------------------------------------------------------------------- /datasets/ABSA-SemEval2015/guidelines/semeval2015_absa_restaurants_annotationguidelines.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/davidsbatista/Aspect-Based-Sentiment-Analysis/1d9c8ec1131993d924e96676fa212db6b53cb870/datasets/ABSA-SemEval2015/guidelines/semeval2015_absa_restaurants_annotationguidelines.pdf -------------------------------------------------------------------------------- /datasets/ABSA-SemEval2016/Training_Data/ABSA16FR-RestaurantsTrain/ABSA16FR-download.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/davidsbatista/Aspect-Based-Sentiment-Analysis/1d9c8ec1131993d924e96676fa212db6b53cb870/datasets/ABSA-SemEval2016/Training_Data/ABSA16FR-RestaurantsTrain/ABSA16FR-download.jar -------------------------------------------------------------------------------- /datasets/ABSA-SemEval2016/Training_Data/ABSA16FR-RestaurantsTrain/ABSA16FR_Restaurants_guidelines.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/davidsbatista/Aspect-Based-Sentiment-Analysis/1d9c8ec1131993d924e96676fa212db6b53cb870/datasets/ABSA-SemEval2016/Training_Data/ABSA16FR-RestaurantsTrain/ABSA16FR_Restaurants_guidelines.pdf -------------------------------------------------------------------------------- /datasets/ABSA-SemEval2016/Training_Data/ABSA16FR-RestaurantsTrain/README.txt: -------------------------------------------------------------------------------- 1 | 2 | == SemEval-2016 Task 5 Aspect-Based Sentiment Analysis (ABSA) task for French == 3 | 4 | Training data for Subtask 1 (Sentence-level ABSA): annotated reviews for the Restaurant domain 5 | ---------------------------------------------------------------------------------------------- 6 | 7 | 8 | The French review annotations are distributed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Licence (https://creativecommons.org/licenses/by-nc-nd/4.0/). 9 | The software is distributed under licence CeCILL (http://www.cecill.info/licences/Licence_CeCILL_V2.1-en.html) 10 | 11 | 12 | ------------------------ 13 | Contents of this package 14 | ------------------------ 15 | 16 | The release package contains four files: 17 | 18 | - ABSA16FR_Restaurants_Train.xml: an XML file compliant with the ABSA dataset format, containing review ids and opinion annotations. tags in the XML file are empty for copyright reasons. 19 | - ABSA16FR-download.jar: a jar file for downloading the reviews from the Web and filling the tags in the XML file with the corresponding sentences. 20 | - ABSA16FR_Restaurants_index.txt: a mapping between review ids and their URLs. The number of sentences in a review is given at the end of each line. 21 | - ABSA16FR_Restaurants_guidelines.pdf: guidelines for French restaurant review annotation 22 | - README.txt: this file 23 | 24 | 25 | ------------------------- 26 | How to obtain the dataset 27 | ------------------------- 28 | 29 | Requirements: java >= 1.5 30 | 31 | Open a terminal, move to the directory containing these files, and run: 32 | 33 | java -jar ABSA16FR-download.jar ABSA16FR_Restaurants_index.txt ABSA16FR_Restaurants_Train.xml 34 | 35 | This will start the download and filling-in process and will write the output to a file named ABSA16FR_Restaurants_Train-withcontent.xml 36 | This file is the complete French training dataset. 37 | 38 | Please note that: 39 | - as a courtesy to the web site we get the reviews from, there is a short waiting time between two downloads. As a result, completing the process can take a while. 40 | - after each download, a tag in the output file is filled in. You can interrupt the process at any moment without losing the downloaded content. Just choose ‘K’ when the "Should we [O]verwrite, [K]eep already downloaded content or [C]ancel? [O/K/C]" question appears the next time you run the process and it will continue at the point it stopped. To overwrite the already downloaded content choose ‘O’. 41 | 42 | Please report any issues (runtime errors, download errors or non coherent data) to the French ABSA-2016 organizers. 43 | 44 | 45 | ------- 46 | Credits 47 | ------- 48 | - Reviews are extracted from HTML with HTMLCleaner. 49 | - Sentence splitting is done with OpenNLP toolkit. 50 | -------------------------------------------------------------------------------- /datasets/Aspect-based-Sentiment-Analysis-Dataset.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 17, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import re\n", 12 | "import os\n", 13 | "import random\n", 14 | "import numpy as np\n", 15 | "\n", 16 | "from collections import namedtuple" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "## Customer Review" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 10, 29 | "metadata": {}, 30 | "outputs": [ 31 | { 32 | "name": "stdout", 33 | "output_type": "stream", 34 | "text": [ 35 | "Apex AD2600 Progressive-scan DVD player.txt\n", 36 | "Canon G3.txt\n", 37 | "Creative Labs Nomad Jukebox Zen Xtra 40GB.txt\n", 38 | "Nikon coolpix 4300.txt\n", 39 | "Nokia 6610.txt\n", 40 | "1727\n" 41 | ] 42 | } 43 | ], 44 | "source": [ 45 | "# CR: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html\n", 46 | "\n", 47 | "CustomerReview = namedtuple(\"Customer_Review\", \"aspects sentence\")\n", 48 | "data = []\n", 49 | "\n", 50 | "for filename in os.listdir('CR/'):\n", 51 | " if filename=='Readme.txt':\n", 52 | " continue\n", 53 | " print(filename)\n", 54 | " with open('CR/'+filename,'r') as f_input:\n", 55 | " for line in f_input:\n", 56 | " if line.startswith(\"*\"):\n", 57 | " continue\n", 58 | " # select only lines with an opinion over a feature of the product\n", 59 | " m = re.match(r'.*\\[(.[0-9])\\].*##.*',line)\n", 60 | " if m:\n", 61 | " sentence = line.split('##')[1].strip()\n", 62 | " aspects_string = line.split('##')[0]\n", 63 | " aspects = aspects_string.split(\",\")\n", 64 | " data.append(CustomerReview(aspects, sentence))\n", 65 | "print(len(data))" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 11, 71 | "metadata": { 72 | "scrolled": false 73 | }, 74 | "outputs": [ 75 | { 76 | "name": "stdout", 77 | "output_type": "stream", 78 | "text": [ 79 | "['player[+2]', 'sound[-1]']\n", 80 | "['look[+3]', 'panel button layout[+3]', 'feature[+2]']\n", 81 | "['forward[+2]', ' rewind[+2]']\n", 82 | "['play[+2]', ' dvd-r[+1]']\n", 83 | "['remote control[-1]', ' mp3 filename[-1]']\n", 84 | "['play[-2]', ' disney movie[-2]']\n", 85 | "['player[+1]', ' look[+3]']\n", 86 | "['customer service[-3]', ' technical support[-3]']\n", 87 | "['dvd[-2]', ' read[-2]', ' play[-2]']\n", 88 | "['picture quality[+2]', 'feature[+2]']\n", 89 | "['look[+2]', ' feature[+2]']\n", 90 | "['play[-2]', ' dvd[-2]']\n", 91 | "['jpeg slideshow[+2]', 'mpeg1[+1]']\n", 92 | "['picture[-2]', ' player[-3][p]']\n", 93 | "['play[-2]', ' disc[-2]']\n", 94 | "['play[-2]', ' dvd[-2]']\n", 95 | "['picture[-2]', ' player[-2][p]']\n", 96 | "['dvd[-2]', ' play[-2]']\n", 97 | "['dvd[-2]', ' play[-2]']\n", 98 | "['format[+2][u]', 'progressive scan[+2]']\n", 99 | "['play[+2]', ' different file[+2]']\n", 100 | "['play[+2]', ' mpeg[+2]']\n", 101 | "['no disc[-2]', ' screen[-3][u]']\n", 102 | "['player[+3]', ' format[+2]']\n", 103 | "['progressive scan[+2]', 'remote[+3]']\n", 104 | "['play[-2]', ' dvd[-2]', ' no disc[-2]']\n", 105 | "['look[+2][u]', 'panel[+2]']\n", 106 | "['dvd[-2]', ' picture[-2]', ' play[-2]']\n", 107 | "['picture[-2]', ' sound[-2]']\n", 108 | "['dvd[-2]', ' read[-2]', ' no disc[-2]']\n", 109 | "['progressive scan player[+2]', 'price[+2]']\n", 110 | "['dvd[+2]', 'sound[+2]']\n", 111 | "['read[+2]', ' svcd[+2]', ' vbr mp3 cd[+2]']\n", 112 | "['disc[-1]', ' recognize[-2]']\n", 113 | "['use[+2]', ' size[+2]', 'look[+2]']\n", 114 | "['player[-3][u]', 'customer service[-3]']\n", 115 | "['design[+2]', 'player[+2]']\n", 116 | "['read[-2]', ' dvd[-2]']\n", 117 | "['run[+2]', ' dvd[+2]']\n", 118 | "['sound[+3]', 'picture clarity[+3]']\n", 119 | "['case[+2]', 'price[+2][u]']\n", 120 | "['size[+1]', 'machine[+2]']\n", 121 | "['picture[+2]', 'sound[+2]']\n", 122 | "['read[-2]', ' disc[-2]']\n", 123 | "['set up[+1]', 'player[-2]']\n", 124 | "['feature[+2]', 'reliability[-2]']\n", 125 | "['unit[+3]', ' format[+2]']\n", 126 | "['play[-2]', ' dvd[-2]']\n", 127 | "['look[+2]', 'feature[+2]']\n", 128 | "['player[-3][p]', ' motor[-2]', ' disc[-2]']\n", 129 | "['ad-1600[-3]', ' ad-1220[-2]']\n", 130 | "['size[+2]', 'weight[+2]']\n", 131 | "['player[-2]', ' no disc[-2]']\n", 132 | "['freeze[-2]', ' player[-2]']\n", 133 | "['dvd[-3]', 'cd[-3]', ' no disc[-2]']\n", 134 | "['size[+1]', 'remote layout[+1]']\n", 135 | "['noise[-1]', ' external display[-2]']\n", 136 | "['look[+2]', 'feature[+2]']\n", 137 | "['dvd[-3]', ' read[-2]']\n", 138 | "['dvd disc[-3]', ' read[-3]']\n", 139 | "['read[+2]', ' cd audio disc[+2]']\n", 140 | "['jpeg[-2]', 'dvd[-2]']\n", 141 | "['play[+2]', ' cd[+2]', ' mp3[+2]', ' jpeg[+2]']\n", 142 | "['play[-2]', ' windows media[-2]', ' divx rip[-2]']\n", 143 | "['read[-2]', ' dvd[-2]', ' vcd[-2]']\n", 144 | "['play[+2]', ' dvd-r[+2]']\n", 145 | "['build quality[+2]', 'picture[+2]', 'sound[+2]']\n", 146 | "['look[+3][u]', ' player[+1][p]']\n", 147 | "['sound[+1]', 'weight[+1]']\n", 148 | "['work[+3]', ' player[+2][u]']\n", 149 | "['play[+2]', ' dvd[+2]']\n", 150 | "['camera[+2]', ' use[+2]', ' feature[+1]']\n", 151 | "['picture quality[+3]', ' use[+1]', ' option[+1]']\n", 152 | "['speed[+2]', 'picture quality[+2]', 'function[+2]']\n", 153 | "['exposure control[+2]', 'auto setting[+2]']\n", 154 | "['size[-2][u]', 'weight[-2][u]']\n", 155 | "['feature[+2]', ' ']\n", 156 | "['optical zoom[+2]', 'digital zoom[+1]']\n", 157 | "['lens cap[-1]', 'viewfinder[-1]']\n", 158 | "['menu[+1]', 'button[+1]']\n", 159 | "['zoom[+2]', 'lense[+2]']\n", 160 | "['picture[+3]', ' control[+2]', ' battery[+2]', ' software[+2]']\n", 161 | "['photo quality[+2]', ' auto mode[+2]']\n", 162 | "['control[+2]', 'auto mode[+2]']\n", 163 | "['feel[+2]', 'weight[+2]']\n", 164 | "['picture[+2]', ' auto mode[+2]']\n", 165 | "['color[+2]', 'picture[+2]', 'white balance[+2]']\n", 166 | "['flash photo[-3]', 'noise[-2]']\n", 167 | "['lag time[-2]', 'flash[-2]']\n", 168 | "['g3[+3]', 'software[-2]']\n", 169 | "['manual function[+2]', 'picture quality[+3]']\n", 170 | "['viewfinder[-1]', 'lcd[+2]', 'camera[+3]']\n", 171 | "['automode[+2]', 'manual mode[+2]']\n", 172 | "['photo[+3]', ' use[+3]', ' software[+2]']\n", 173 | "['focus[-2]', ' shoot[-2]']\n", 174 | "['control[+2]', 'menu[+2]']\n", 175 | "['optical zoom[+3]', 'viewfinder[+1]']\n", 176 | "['design[+2]', ' feature[+2]', 'use[+1][u]', 'battery[+2]']\n", 177 | "['use[+1][u]', 'design[+1][u]']\n", 178 | "['use[+3]', 'control[+2]']\n", 179 | "['option[+1}', ' control[+1]']\n", 180 | "['macro[+2]', ' auto mode[+2]']\n", 181 | "['picture[+3]', 'learning curve[+1]']\n", 182 | "['picture[+2]', 'feel[+1][u]']\n", 183 | "['lens[+2]', 'optical zoom[+1]']\n", 184 | "['camera[+3]', 'use[+1]', 'photo[+3]']\n", 185 | "['quality[+2]', 'lens[+2]']\n", 186 | "['weight[-1]', 'camera[+2]']\n", 187 | "['size[+2]', 'weight[+2]', 'navigational system[+2]', ' sound[+3]']\n", 188 | "['battery[+2]', 'leather case[+2]']\n", 189 | "['battery life[+2]', 'price[+2]']\n", 190 | "['sound[+3]', ' headphone[-2]']\n", 191 | "['battery[+2]', 'construction[+2]', 'size[+2]', 'weight[+2]']\n", 192 | "['size[+1][u]', 'sound[+2]', 'price[+2]']\n", 193 | "['player[+]', ' price[+3]']\n", 194 | "['player[+3]', 'use[+3]', ' software[+2]', 'sound[+3]']\n", 195 | "['setup[-3]', 'interface[-3]']\n", 196 | "['earphone[-1]', 'software[-1]']\n", 197 | "['screen[+2]', 'switch[-2]']\n", 198 | "['headphone[-1]', ' sound[+3]']\n", 199 | "['deal[+3]', ' mp3 player[+3]']\n", 200 | "['player[+3][p]', 'look[+2][u]']\n", 201 | "['sound[+3]', 'earphone[-3]']\n", 202 | "['size[-2][u]', 'weight[-2][u]', 'look[-1]', 'folder structure[-1]']\n", 203 | "['interface[+3][p]', 'software[+3][p]']\n", 204 | "['sound quality[+2]', 'look[+3]', 'screen[+2]', ' battery[+3]']\n", 205 | "['size[+1][u]', 'weight[+1]']\n", 206 | "['product[+2][u]', 'price[+3]']\n", 207 | "['setup[2]', ' transfer[+2]']\n", 208 | "['use[+3]', ' playback quality[+3]', 'price[+2]']\n", 209 | "['size[-1]', 'weight[-1]']\n", 210 | "['price[+2]', 'battery[+2]']\n", 211 | "['price[+3]', 'feature[+3]']\n", 212 | "['battery life[+2]', 'online music service[+2]', 'pc compatibility[+2]']\n", 213 | "['sound[+2]', 'battery life[+2]', 'price[+2][u]']\n", 214 | "['style[+1]', 'size[+1][u]', 'control[+1]']\n", 215 | "['player[+2][p]', 'button[+2]']\n", 216 | "['player[+3]', 'software[-2]']\n", 217 | "['look[+3]', 'size[+2][u]', 'weight[+2][u]']\n", 218 | "['sound quality[+3]', 'size[+2]', 'price[+2]']\n", 219 | "['sound[+2]', ' price[+2]', ' player[-1]']\n", 220 | "['button[+3]', 'interface[+3]']\n", 221 | "['click buttons[+3]', ' display[+2]']\n", 222 | "['size[+1][u]', 'weight[+1][u]', 'look[+2]', 'display[+2]']\n", 223 | "['player[+2]', 'sound[+3]']\n", 224 | "['software[+3]', 'player[+3]']\n", 225 | "['navigation[+2]', 'scroll[-2]']\n", 226 | "['price[+2][u]', ' player[+2]']\n", 227 | "['sound[+3]', 'size[-1][u]']\n", 228 | "['weight[-1][u]', 'battery life[+1]']\n", 229 | "['tag[-1]', 'software[-1]']\n", 230 | "['control[-2]', 'look[-2]']\n", 231 | "['screen[+3]', 'equilizer[+2]']\n", 232 | "['size[-1][u]', 'weight[+1]', 'software[-1]']\n", 233 | "['price[+2]', 'sound[+2]']\n", 234 | "['weight[-2][u]', 'size[-2][u]']\n", 235 | "['price[+2]', 'capacity[+2]']\n", 236 | "['design[+2]', 'interface[+1]']\n", 237 | "['player[+2]', 'software[-2]', ' rip[-2]', ' transfer[-2]']\n", 238 | "['battery[+2]', 'sound[+3]']\n", 239 | "['software[-2]', 'case[-2]']\n", 240 | "['rip[+2]', ' quality[+2]']\n", 241 | "['playlist[+1]', 'cd rip[+1]']\n", 242 | "['sound[+2]', 'volume[+2]']\n", 243 | "['player[-3]', 'software[-2]']\n", 244 | "['player[+2]', 'sound quality[+2]']\n", 245 | "['navigation[+2]', 'sync[+2]']\n", 246 | "['sound[+2]', 'power output[+2]']\n", 247 | "['user interface[+2]', 'navigation[+2]']\n", 248 | "['look[+2]', 'build[+2]']\n", 249 | "['fm[-1]', 'voice recording[-1]']\n", 250 | "['sound[+2]', 'interface[+2]', 'battery[+2]', 'software[+2]', ' wake up[+2]', 'play mode[+2]']\n", 251 | "['plug and play[-2]', 'id3[-2]', 'fm[-1]', 'recording[-1]']\n", 252 | "['size[-2][u]', 'weight[-2][u]', 'wheel[-3]']\n", 253 | "['product[+3]', ' sound[+3]', ' use[+2]']\n", 254 | "['menue[-3]', 'control[-3]']\n", 255 | "['storage[+3]', 'navigation[+2]', 'playlist[+2]', 'battery[+2]']\n", 256 | "['battery life[-2]', 'manual[-2]', 'lock up[-2]', 'replacement battery[-2]']\n", 257 | "['construction[-2]', 'support[-2]', 'scroll wheel[-2]', 'headphone jack[-2]']\n", 258 | "['sound quality[+2]', 'battery life[+2]', 'price[+2]']\n", 259 | "['size[+1]', 'value[+1]', 'design[+1]', 'software[-2]']\n", 260 | "['sound[+2]', 'earbud[-2]']\n", 261 | "['value[+2]', 'use[+2]']\n", 262 | "['sound quality[+2]', 'volume[+2]']\n", 263 | "['sound[+2]', 'battery life[+3]', 'battery[+2]', 'storage[+1]', 'screen[+2]', ' firmware[+1]', 'price[+2]']\n", 264 | "['warranty[-2]', 'freeze up[-1]', 'navigation wheel[-2]']\n", 265 | "['size[+3]', 'design[+3]']\n", 266 | "['sound[+3]', 'battery life[-1]']\n", 267 | "['control[-2]', 'software[-2]']\n", 268 | "['picture[+3]', ' macro[+3]']\n", 269 | "['auto focus[+2]', 'scene mode[+2]']\n", 270 | "['camera[+2][p]', ' use[+1][u]', ' feature[+2]']\n", 271 | "['auto mode[+1]', 'scene mode[+2]']\n", 272 | "['macro mode[+3]', 'picture[+3]']\n", 273 | "['camera[+3]', ' use[+1]']\n", 274 | "['camera[+2]', ' picture[+2]', ' close-up shooting[+3]']\n", 275 | "['camera[+2]', 'customer service[-2]']\n", 276 | "['picture[+3]', 'delay[+1]']\n", 277 | "['auto mode[+2]', ' manual mode[+2]', ' scene mode[+2]']\n", 278 | "['digital zoom[+2]', 'optical zoom[+2]']\n", 279 | "['touchup[+2]', ' redeye[+2]']\n", 280 | "['use[+1][u]', 'quality[+2]', 'size[+1]']\n", 281 | "['picture[+2]', 'ease of use[+2]']\n", 282 | "['picture[+2]', 'print[+2]']\n", 283 | "['4mp[+2]', 'optical zoom[+2]']\n", 284 | "['software[+3]', ' online service[+2]']\n", 285 | "['camera[+3]', ' feature[+2]']\n", 286 | "['picture quality[+2]', 'function[+2]']\n", 287 | "['camera[+3]', 'picture[+2]']\n", 288 | "['picture[+2]', 'indoor shot[+1]']\n", 289 | "['autofocus[+1]', 'scene mode[+1]', 'manual mode[+1]']\n", 290 | "['use[+1]', ' accessory[+2]']\n", 291 | "['picture quality[+3]', 'feature[+3]']\n", 292 | "['design[+2]', 'construction[+2]', 'optic[+2]']\n", 293 | "['picture quality[+3]', 'movie[+1]']\n", 294 | "['use[+1]', 'feature[+2]', ' camera[+2]']\n", 295 | "['weight[+2]', ' picture[+2][u]']\n", 296 | "['photo quality[+3]', 'print[+2]']\n", 297 | "['size[+2][u]', 'control[+2]']\n", 298 | "['price[+2][u]', 'learn[+2]', 'image[+3]']\n", 299 | "['auto mode[+2]', 'scene mode[+2]', 'manual mode[+2]']\n", 300 | "['camera[+3]', ' print quality[+3]']\n", 301 | "['closeup mode[+2]', ' battery[+2]']\n", 302 | "['phone[+3]', ' work[+2]']\n", 303 | "['speaker phone[+2]', 'radio[+2]', 'infrared[+2]']\n", 304 | "['sprint plan[-2]', 'sprint customer service[-3]']\n", 305 | "['size[+1][u]', ' sturdy[+2]']\n", 306 | "['game[+2]', 'pim[+2]', 'radio[+2]']\n", 307 | "['sound volume[+1]', ' ear[-2]']\n", 308 | "['ringtone[+1]', 'background[+1]', 'screensaver[+1]', 'memory[-2]']\n", 309 | "['battery life[+2]', 'size[+2]', 'volume[-1][u]']\n", 310 | "['radio[+2]', ' radio[-1]']\n", 311 | "['phone[+2]', 'warranty[-2]']\n", 312 | "['sound quality[+2]', ' fm[+1]', ' earpiece[+1]']\n", 313 | "['size[-2][u]', ' operate[-2][u]', ' button[-2]']\n", 314 | "['speakerphone[+3]', ' reception[+2]']\n", 315 | "['speakerphone[+3]', 'radio[+3]']\n", 316 | "['size[+2]', 'weight[+2]']\n", 317 | "['bluetooth[-1]', 'high speed internet[-1]']\n", 318 | "['ringing tone[+3]', 'radio[+2]']\n", 319 | "['phone[+2]', 'size[+2]']\n", 320 | "['reception[+3]', 'sound quality[+3]']\n", 321 | "['size[+2]', 'weight[+2]']\n", 322 | "['voice dialing[-1]', 'headset jack[-1]']\n", 323 | "['weight[+1]', ' design[+1]']\n", 324 | "['picture[-1]', ' ringtone[-1]']\n", 325 | "['quality[+3]', 'durability[+3]']\n", 326 | "['screen[+1]', ' ring tone[+1]']\n", 327 | "['weight[+1]', 'phone[+2]']\n", 328 | "['phone[+2]', ' menu[+2]']\n", 329 | "['wallpaper[+1]', 'tune[+2]']\n", 330 | "['battery life[+2][u]', 'reception[+2]', ' application[+2]']\n", 331 | "['size[+1][u]', 'game[+1]', 'ringtone[+1]']\n", 332 | "['phone[+2]', 'gsm[-1]']\n", 333 | "['phone[+2]', 'screen[+2]', 'ergonomics[+2]', 'size[+1][u]']\n", 334 | "['size[+1][u]', 'weight[+1]', 'design[+3]']\n", 335 | "['color screen[+1]', 'ringtone[+1]']\n", 336 | "['sound[-2]', 'volume[-2]']\n", 337 | "['phone[+2]', ' feature[+2]']\n", 338 | "['weight[+2][u]', 'battery life[+2]']\n", 339 | "['tone[+1]', 'wallpaper[+1]', 'application[+1]']\n", 340 | "['message[+1]', 'picture sharing[+1]']\n", 341 | "['size[+2]', 'feature[+2]']\n", 342 | "['phone book[+2]', 'speakerphone[+2]']\n", 343 | "['battery life[+2]', 'radio[+2]', 'signal[+2]', 'speakerphone[+2]', 'application[+1]']\n", 344 | "['size[+1]', 'speakerphone[+1]', 'plan[+1]']\n", 345 | "['phone[+3]', ' look[+2]']\n", 346 | "['phone[+2]', 'use[+1]', 'network[+2]']\n", 347 | "['service[+1]', 'ringtone[+2]']\n", 348 | "['size[+2][u]', ' look{+1]']\n", 349 | "['screen[+2]', 'sound[+2]']\n", 350 | "['voice quality[+3]', 'reception[+2]']\n", 351 | "['screen[+2]', 'command[+2]']\n", 352 | "['resolution[+2]', ' color[+2]']\n", 353 | "['gprs[-1]', 't-zone[+2]']\n", 354 | "['menu[+2]', ' feature[+2]']\n", 355 | "['phone[+2]', ' feature[+2]']\n", 356 | "['design[+2]', 'screen[+2]']\n", 357 | "['weight[+2]', 'signal[+2]']\n" 358 | ] 359 | } 360 | ], 361 | "source": [ 362 | "for msg in data:\n", 363 | " if len(msg.aspects) > 1:\n", 364 | " print(msg.aspects)" 365 | ] 366 | } 367 | ], 368 | "metadata": { 369 | "kernelspec": { 370 | "display_name": "Python 3", 371 | "language": "python", 372 | "name": "python3" 373 | }, 374 | "language_info": { 375 | "codemirror_mode": { 376 | "name": "ipython", 377 | "version": 3 378 | }, 379 | "file_extension": ".py", 380 | "mimetype": "text/x-python", 381 | "name": "python", 382 | "nbconvert_exporter": "python", 383 | "pygments_lexer": "ipython3", 384 | "version": "3.6.2" 385 | } 386 | }, 387 | "nbformat": 4, 388 | "nbformat_minor": 2 389 | } 390 | -------------------------------------------------------------------------------- /datasets/CR/Readme.txt: -------------------------------------------------------------------------------- 1 | ***************************************************************************** 2 | * Annotated by: Minqing Hu and Bing Liu, 2004. 3 | * Department of Computer Sicence 4 | * University of Illinois at Chicago 5 | * 6 | * Contact: Bing Liu, liub@cs.uic.edu 7 | * http://www.cs.uic.edu/~liub 8 | ***************************************************************************** 9 | 10 | Readme file 11 | 12 | This folder contains annotated customer reviews of 5 products. 13 | 14 | 1. digital camera: Canon G3 15 | 2. digital camera: Nikon coolpix 4300 16 | 3. celluar phone: Nokia 6610 17 | 4. mp3 player: Creative Labs Nomad Jukebox Zen Xtra 40GB 18 | 5. dvd player: Apex AD2600 Progressive-scan DVD player 19 | 20 | All the reviews were from amazon.com. They were used in the following 21 | two papers: 22 | 23 | Minqing Hu and Bing Liu. "Mining and summarizing customer reviews". 24 | Proceedings of the ACM SIGKDD International Conference on 25 | Knowledge Discovery & Data Mining (KDD-04), 2004. 26 | 27 | Minqing Hu and Bing Liu. "Mining Opinion Features in Customer 28 | Reviews." Proceedings of Nineteeth National Conference on 29 | Artificial Intellgience (AAAI-2004), 2004. 30 | 31 | Our project homepage: http://www.cs.uic.edu/~liub/FBS/FBS.html 32 | 33 | 34 | 35 | Symbols used in the annotated reviews: 36 | 37 | [t]: the title of the review: Each [t] tag starts a review. 38 | We did not use the title information in our papers. 39 | xxxx[+|-n]: xxxx is a product feature. 40 | [+n]: Positive opinion, n is the opinion strength: 3 strongest, 41 | and 1 weakest. Note that the strength is quite subjective. 42 | You may want ignore it, but only considering + and - 43 | [-n]: Negative opinion 44 | ## : start of each sentence. Each line is a sentence. 45 | [u] : feature not appeared in the sentence. 46 | [p] : feature not appeared in the sentence. Pronoun resolution is needed. 47 | [s] : suggestion or recommendation. 48 | [cc]: comparison with a competing product from a different brand. 49 | [cs]: comparison with a competing product from the same brand. 50 | 51 | 52 | Finally, tagging is a hard task. Errors and inconsistencies are inevitable. 53 | If you see some problems, please let us know. We also welcome your comments. 54 | 55 | 56 | -------------------------------------------------------------------------------- /libraries/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/davidsbatista/Aspect-Based-Sentiment-Analysis/1d9c8ec1131993d924e96676fa212db6b53cb870/libraries/__init__.py -------------------------------------------------------------------------------- /libraries/baselines.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | ''' 4 | **Baseline methods for the 4th task of SemEval 2014** 5 | 6 | Run a task from the terminal:: 7 | >>> python baselines.py -t file -m taskNum 8 | 9 | or, import within python. E.g., for Aspect Term Extraction:: 10 | from baselines import * 11 | corpus = Corpus(ET.parse(trainfile).getroot().findall('sentence')) 12 | unseen = Corpus(ET.parse(testfile).getroot().findall('sentence')) 13 | b1 = BaselineAspectExtractor(corpus) 14 | predicted = b1.tag(unseen.corpus) 15 | corpus.write_out('%s--test.predicted-aspect.xml'%domain_name, predicted, short=False) 16 | 17 | Similarly, for Aspect Category Detection, Aspect Term Polarity Estimation, and Aspect Category Polarity Estimation. 18 | ''' 19 | 20 | __author__ = "J. Pavlopoulos" 21 | __credits__ = "J. Pavlopoulos, D. Galanis, I. Androutsopoulos" 22 | __license__ = "GPL" 23 | __version__ = "1.0.1" 24 | __maintainer__ = "John Pavlopoulos" 25 | __email__ = "annis@aueb.gr" 26 | 27 | try: 28 | import xml.etree.ElementTree as ET, getopt, logging, sys, random, re, copy 29 | from xml.sax.saxutils import escape 30 | except: 31 | sys.exit('Some package is missing... Perhaps ?') 32 | 33 | logging.basicConfig(level=logging.INFO) 34 | logger = logging.getLogger(__name__) 35 | 36 | # Stopwords, imported from NLTK (v 2.0.4) 37 | stopwords = set( 38 | ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 39 | 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 40 | 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 41 | 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 42 | 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 43 | 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 44 | 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 45 | 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 46 | 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']) 47 | 48 | 49 | def fd(counts): 50 | '''Given a list of occurrences (e.g., [1,1,1,2]), return a dictionary of frequencies (e.g., {1:3, 2:1}.)''' 51 | d = {} 52 | for i in counts: d[i] = d[i] + 1 if i in d else 1 53 | return d 54 | 55 | 56 | freq_rank = lambda d: sorted(d, key=d.get, reverse=True) 57 | '''Given a map, return ranked the keys based on their values.''' 58 | 59 | 60 | def fd2(counts): 61 | '''Given a list of 2-uplets (e.g., [(a,pos), (a,pos), (a,neg), ...]), form a dict of frequencies of specific items (e.g., {a:{pos:2, neg:1}, ...}).''' 62 | d = {} 63 | for i in counts: 64 | # If the first element of the 2-uplet is not in the map, add it. 65 | if i[0] in d: 66 | if i[1] in d[i[0]]: 67 | d[i[0]][i[1]] += 1 68 | else: 69 | d[i[0]][i[1]] = 1 70 | else: 71 | d[i[0]] = {i[1]: 1} 72 | return d 73 | 74 | 75 | def validate(filename): 76 | '''Validate an XML file, w.r.t. the format given in the 4th task of **SemEval '14**.''' 77 | elements = ET.parse(filename).getroot().findall('sentence') 78 | aspects = [] 79 | for e in elements: 80 | for eterms in e.findall('aspectTerms'): 81 | if eterms is not None: 82 | for a in eterms.findall('aspectTerm'): 83 | aspects.append(Aspect('', '', []).create(a).term) 84 | return elements, aspects 85 | 86 | 87 | fix = lambda text: escape(text.encode('utf8')).replace('\"','"') 88 | '''Simple fix for writing out text.''' 89 | 90 | # Dice coefficient 91 | def dice(t1, t2, stopwords=[]): 92 | tokenize = lambda t: set([w for w in t.split() if (w not in stopwords)]) 93 | t1, t2 = tokenize(t1), tokenize(t2) 94 | return 2. * len(t1.intersection(t2)) / (len(t1) + len(t2)) 95 | 96 | 97 | class Category: 98 | '''Category objects contain the term and polarity (i.e., pos, neg, neu, conflict) of the category (e.g., food, price, etc.) of a sentence.''' 99 | 100 | def __init__(self, term='', polarity=''): 101 | self.term = term 102 | self.polarity = polarity 103 | 104 | def create(self, element): 105 | self.term = element.attrib['category'] 106 | self.polarity = element.attrib['polarity'] 107 | return self 108 | 109 | def update(self, term='', polarity=''): 110 | self.term = term 111 | self.polarity = polarity 112 | 113 | 114 | class Aspect: 115 | '''Aspect objects contain the term (e.g., battery life) and polarity (i.e., pos, neg, neu, conflict) of an aspect.''' 116 | 117 | def __init__(self, term, polarity, offsets): 118 | self.term = term 119 | self.polarity = polarity 120 | self.offsets = offsets 121 | 122 | def create(self, element): 123 | self.term = element.attrib['term'] 124 | self.polarity = element.attrib['polarity'] 125 | self.offsets = {'from': str(element.attrib['from']), 'to': str(element.attrib['to'])} 126 | return self 127 | 128 | def update(self, term='', polarity=''): 129 | self.term = term 130 | self.polarity = polarity 131 | 132 | 133 | class Instance: 134 | '''An instance is a sentence, modeled out of XML (pre-specified format, based on the 4th task of SemEval 2014). 135 | It contains the text, the aspect terms, and any aspect categories.''' 136 | 137 | def __init__(self, element): 138 | self.text = element.find('text').text 139 | self.id = element.get('id') 140 | self.aspect_terms = [Aspect('', '', offsets={'from': '', 'to': ''}).create(e) for es in 141 | element.findall('aspectTerms') for e in es if 142 | es is not None] 143 | self.aspect_categories = [Category(term='', polarity='').create(e) for es in element.findall('aspectCategories') 144 | for e in es if 145 | es is not None] 146 | 147 | def get_aspect_terms(self): 148 | return [a.term.lower() for a in self.aspect_terms] 149 | 150 | def get_aspect_categories(self): 151 | return [c.term.lower() for c in self.aspect_categories] 152 | 153 | def add_aspect_term(self, term, polarity='', offsets={'from': '', 'to': ''}): 154 | a = Aspect(term, polarity, offsets) 155 | self.aspect_terms.append(a) 156 | 157 | def add_aspect_category(self, term, polarity=''): 158 | c = Category(term, polarity) 159 | self.aspect_categories.append(c) 160 | 161 | 162 | class Corpus: 163 | '''A corpus contains instances, and is useful for training algorithms or splitting to train/test files.''' 164 | 165 | def __init__(self, elements): 166 | self.corpus = [Instance(e) for e in elements] 167 | self.size = len(self.corpus) 168 | self.aspect_terms_fd = fd([a for i in self.corpus for a in i.get_aspect_terms()]) 169 | self.top_aspect_terms = freq_rank(self.aspect_terms_fd) 170 | self.texts = [t.text for t in self.corpus] 171 | 172 | def echo(self): 173 | print '%d instances\n%d distinct aspect terms' % (len(self.corpus), len(self.top_aspect_terms)) 174 | print 'Top aspect terms: %s' % (', '.join(self.top_aspect_terms[:10])) 175 | 176 | def clean_tags(self): 177 | for i in range(len(self.corpus)): 178 | self.corpus[i].aspect_terms = [] 179 | 180 | def split(self, threshold=0.8, shuffle=False): 181 | '''Split to train/test, based on a threshold. Turn on shuffling for randomizing the elements beforehand.''' 182 | clone = copy.deepcopy(self.corpus) 183 | if shuffle: random.shuffle(clone) 184 | train = clone[:int(threshold * self.size)] 185 | test = clone[int(threshold * self.size):] 186 | return train, test 187 | 188 | def write_out(self, filename, instances, short=True): 189 | with open(filename, 'w') as o: 190 | o.write('\n') 191 | for i in instances: 192 | o.write('\t\n' % (i.id)) 193 | o.write('\t\t%s\n' % fix(i.text)) 194 | o.write('\t\t\n') 195 | if not short: 196 | for a in i.aspect_terms: 197 | o.write('\t\t\t\n' % ( 198 | fix(a.term), a.polarity, a.offsets['from'], a.offsets['to'])) 199 | o.write('\t\t\n') 200 | o.write('\t\t\n') 201 | if not short: 202 | for c in i.aspect_categories: 203 | o.write('\t\t\t\n' % (fix(c.term), c.polarity)) 204 | o.write('\t\t\n') 205 | o.write('\t\n') 206 | o.write('') 207 | 208 | 209 | class BaselineAspectExtractor(): 210 | '''Extract the aspects from a text. 211 | Use the aspect terms from the train data, to tag any new (i.e., unseen) instances.''' 212 | 213 | def __init__(self, corpus): 214 | self.candidates = [a.lower() for a in corpus.top_aspect_terms] 215 | 216 | def find_offsets_quickly(self, term, text): 217 | start = 0 218 | while True: 219 | start = text.find(term, start) 220 | if start == -1: return 221 | yield start 222 | start += len(term) 223 | 224 | def find_offsets(self, term, text): 225 | offsets = [(i, i + len(term)) for i in list(self.find_offsets_quickly(term, text))] 226 | return offsets 227 | 228 | def tag(self, test_instances): 229 | clones = [] 230 | for i in test_instances: 231 | i_ = copy.deepcopy(i) 232 | i_.aspect_terms = [] 233 | for c in set(self.candidates): 234 | if c in i_.text: 235 | offsets = self.find_offsets(' ' + c + ' ', i.text) 236 | for start, end in offsets: i_.add_aspect_term(term=c, 237 | offsets={'from': str(start + 1), 'to': str(end - 1)}) 238 | clones.append(i_) 239 | return clones 240 | 241 | 242 | class BaselineCategoryDetector(): 243 | '''Detect the category (or categories) of an instance. 244 | For any new (i.e., unseen) instance, fetch the k-closest instances from the train data, and vote for the number of categories and the categories themselves.''' 245 | 246 | def __init__(self, corpus): 247 | self.corpus = corpus 248 | 249 | # Fetch k-neighbors (i.e., similar texts), using the Dice coefficient, and vote for #categories and category values 250 | def fetch_k_nn(self, text, k=5, multi=False): 251 | neighbors = dict([(i, dice(text, n, stopwords)) for i, n in enumerate(self.corpus.texts)]) 252 | ranked = freq_rank(neighbors) 253 | topk = [self.corpus.corpus[i] for i in ranked[:k]] 254 | num_of_cats = 1 if not multi else int(sum([len(i.aspect_categories) for i in topk]) / float(k)) 255 | cats = freq_rank(fd([c for i in topk for c in i.get_aspect_categories()])) 256 | categories = [cats[i] for i in range(num_of_cats)] 257 | return categories 258 | 259 | def tag(self, test_instances): 260 | clones = [] 261 | for i in test_instances: 262 | i_ = copy.deepcopy(i) 263 | i_.aspect_categories = [Category(term=c) for c in self.fetch_k_nn(i.text)] 264 | clones.append(i_) 265 | return clones 266 | 267 | 268 | class BaselineStageI(): 269 | '''Stage I: Aspect Term Extraction and Aspect Category Detection.''' 270 | 271 | def __init__(self, b1, b2): 272 | self.b1 = b1 273 | self.b2 = b2 274 | 275 | def tag(self, test_instances): 276 | clones = [] 277 | for i in test_instances: 278 | i_ = copy.deepcopy(i) 279 | i_.aspect_categories, i_.aspect_terms = [], [] 280 | for a in set(self.b1.candidates): 281 | offsets = self.b1.find_offsets(' ' + a + ' ', i_.text) 282 | for start, end in offsets: 283 | i_.add_aspect_term(term=a, offsets={'from': str(start + 1), 'to': str(end - 1)}) 284 | for c in self.b2.fetch_k_nn(i_.text): 285 | i_.aspect_categories.append(Category(term=c)) 286 | clones.append(i_) 287 | return clones 288 | 289 | 290 | class BaselineAspectPolarityEstimator(): 291 | '''Estimate the polarity of an instance's aspects. 292 | This is a majority baseline. 293 | Form the tuples from the train data, and measure frequencies. 294 | Then, given a new instance, vote for the polarities of the aspect terms (given).''' 295 | 296 | def __init__(self, corpus): 297 | self.corpus = corpus 298 | self.fd = fd2([(a.term, a.polarity) for i in self.corpus.corpus for a in i.aspect_terms]) 299 | self.major = freq_rank(fd([a.polarity for i in self.corpus.corpus for a in i.aspect_terms]))[0] 300 | 301 | # Fetch k-neighbors (i.e., similar texts), using the Dice coefficient, and vote for aspect's polarity 302 | def k_nn(self, text, aspect, k=5): 303 | neighbors = dict([(i, dice(text, next.text, stopwords)) for i, next in enumerate(self.corpus.corpus) if 304 | aspect in next.get_aspect_terms()]) 305 | ranked = freq_rank(neighbors) 306 | topk = [self.corpus.corpus[i] for i in ranked[:k]] 307 | return freq_rank(fd([a.polarity for i in topk for a in i.aspect_terms])) 308 | 309 | def majority(self, text, aspect): 310 | if aspect not in self.fd: 311 | return self.major 312 | else: 313 | polarities = self.k_nn(text, aspect, k=5) 314 | if polarities: 315 | return polarities[0] 316 | else: 317 | return self.major 318 | 319 | def tag(self, test_instances): 320 | clones = [] 321 | for i in test_instances: 322 | i_ = copy.deepcopy(i) 323 | for j in i_.aspect_terms: j.polarity = self.majority(i_.text, j.term) 324 | clones.append(i_) 325 | return clones 326 | 327 | 328 | class BaselineAspectCategoryPolarityEstimator(): 329 | '''Estimate the polarity of an instance's category (or categories). 330 | This is a majority baseline. 331 | Form the tuples from the train data, and measure frequencies. 332 | Then, given a new instance, vote for the polarities of the categories (given).''' 333 | 334 | def __init__(self, corpus): 335 | self.corpus = corpus 336 | self.fd = fd2([(c.term, c.polarity) for i in self.corpus.corpus for c in i.aspect_categories]) 337 | 338 | # Fetch k-neighbors (i.e., similar texts), using the Dice coefficient, and vote for aspect's polarity 339 | def k_nn(self, text, k=5): 340 | neighbors = dict([(i, dice(text, next.text, stopwords)) for i, next in enumerate(self.corpus.corpus)]) 341 | ranked = freq_rank(neighbors) 342 | topk = [self.corpus.corpus[i] for i in ranked[:k]] 343 | return freq_rank(fd([c.polarity for i in topk for c in i.aspect_categories])) 344 | 345 | def majority(self, text): 346 | return self.k_nn(text)[0] 347 | 348 | def tag(self, test_instances): 349 | clones = [] 350 | for i in test_instances: 351 | i_ = copy.deepcopy(i) 352 | for j in i_.aspect_categories: 353 | j.polarity = self.majority(i_.text) 354 | clones.append(i_) 355 | return clones 356 | 357 | 358 | class BaselineStageII(): 359 | '''Stage II: Aspect Term and Aspect Category Polarity Estimation. 360 | Terms and categories are assumed given.''' 361 | 362 | # Baselines 3 and 4 are assumed given. 363 | def __init__(self, b3, b4): 364 | self.b3 = b3 365 | self.b4 = b4 366 | 367 | # Tag sentences with aspects and categories with their polarities 368 | def tag(self, test_instances): 369 | clones = [] 370 | for i in test_instances: 371 | i_ = copy.deepcopy(i) 372 | for j in i_.aspect_terms: j.polarity=self.b3.majority(i_.text, j.term) 373 | for j in i_.aspect_categories: j.polarity = self.b4.majority(i_.text) 374 | clones.append(i_) 375 | return clones 376 | 377 | 378 | class Evaluate(): 379 | '''Evaluation methods, per subtask of the 4th task of SemEval '14.''' 380 | 381 | def __init__(self, correct, predicted): 382 | self.size = len(correct) 383 | self.correct = correct 384 | self.predicted = predicted 385 | 386 | # Aspect Extraction (no offsets considered) 387 | def aspect_extraction(self, b=1): 388 | common, relevant, retrieved = 0., 0., 0. 389 | for i in range(self.size): 390 | cor = [a.offsets for a in self.correct[i].aspect_terms] 391 | pre = [a.offsets for a in self.predicted[i].aspect_terms] 392 | common += len([a for a in pre if a in cor]) 393 | retrieved += len(pre) 394 | relevant += len(cor) 395 | p = common / retrieved if retrieved > 0 else 0. 396 | r = common / relevant 397 | f1 = (1 + (b ** 2)) * p * r / ((p * b ** 2) + r) if p > 0 and r > 0 else 0. 398 | return p, r, f1, common, retrieved, relevant 399 | 400 | # Aspect Category Detection 401 | def category_detection(self, b=1): 402 | common, relevant, retrieved = 0., 0., 0. 403 | for i in range(self.size): 404 | cor = self.correct[i].get_aspect_categories() 405 | # Use set to avoid duplicates (i.e., two times the same category) 406 | pre = set(self.predicted[i].get_aspect_categories()) 407 | common += len([c for c in pre if c in cor]) 408 | retrieved += len(pre) 409 | relevant += len(cor) 410 | p = common / retrieved if retrieved > 0 else 0. 411 | r = common / relevant 412 | f1 = (1 + b ** 2) * p * r / ((p * b ** 2) + r) if p > 0 and r > 0 else 0. 413 | return p, r, f1, common, retrieved, relevant 414 | 415 | def aspect_polarity_estimation(self, b=1): 416 | common, relevant, retrieved = 0., 0., 0. 417 | for i in range(self.size): 418 | cor = [a.polarity for a in self.correct[i].aspect_terms] 419 | pre = [a.polarity for a in self.predicted[i].aspect_terms] 420 | common += sum([1 for j in range(len(pre)) if pre[j] == cor[j]]) 421 | retrieved += len(pre) 422 | acc = common / retrieved 423 | return acc, common, retrieved 424 | 425 | def aspect_category_polarity_estimation(self, b=1): 426 | common, relevant, retrieved = 0., 0., 0. 427 | for i in range(self.size): 428 | cor = [a.polarity for a in self.correct[i].aspect_categories] 429 | pre = [a.polarity for a in self.predicted[i].aspect_categories] 430 | common += sum([1 for j in range(len(pre)) if pre[j] == cor[j]]) 431 | retrieved += len(pre) 432 | acc = common / retrieved 433 | return acc, common, retrieved 434 | 435 | 436 | def main(argv=None): 437 | # Parse the input 438 | opts, args = getopt.getopt(argv, "hg:dt:om:k:", ["help", "grammar", "train=", "task=", "test="]) 439 | trainfile, testfile, task = None, None, 1 440 | use_msg = 'Use as:\n">>> python baselines.py --train file.xml --task 1|2|3|4(|5|6)"\n\nThis will parse a train set, examine whether is valid, split to train and test (80/20 %), write the new train, test and unseen test files, perform ABSA for task 1, 2, 3, or 4 (5 and 6 perform jointly tasks 1 & 2, and 3 & 4, respectively), and write out a file with the predictions.' 441 | if len(opts) == 0: sys.exit(use_msg) 442 | for opt, arg in opts: 443 | if opt in ("-h", "--help"): 444 | sys.exit(use_msg) 445 | elif opt in ('-t', "--train"): 446 | trainfile = arg 447 | elif opt in ('-m', "--task"): 448 | task = int(arg) 449 | elif opt in ('-k', "--test"): 450 | testfile = arg 451 | 452 | # Examine if the file is in proper XML format for further use. 453 | print 'Validating the file...' 454 | try: 455 | elements, aspects = validate(trainfile) 456 | print 'PASSED! This corpus has: %d sentences, %d aspect term occurrences, and %d distinct aspect terms.' % ( 457 | len(elements), len(aspects), len(list(set(aspects)))) 458 | except: 459 | print "Unexpected error:", sys.exc_info()[0] 460 | raise 461 | 462 | # Get the corpus and split into train/test. 463 | corpus = Corpus(ET.parse(trainfile).getroot().findall('sentence')) 464 | domain_name = 'laptops' if 'laptop' in trainfile else ('restaurants' if 'restau' in trainfile else 'absa') 465 | if testfile: 466 | traincorpus = corpus 467 | seen = Corpus(ET.parse(testfile).getroot().findall('sentence')) 468 | else: 469 | train, seen = corpus.split() 470 | # Store train/test files and clean up the test files (no aspect terms or categories are present); then, parse back the files back. 471 | corpus.write_out('%s--train.xml' % domain_name, train, short=False) 472 | traincorpus = Corpus(ET.parse('%s--train.xml' % domain_name).getroot().findall('sentence')) 473 | corpus.write_out('%s--test.gold.xml' % domain_name, seen, short=False) 474 | seen = Corpus(ET.parse('%s--test.gold.xml' % domain_name).getroot().findall('sentence')) 475 | 476 | corpus.write_out('%s--test.xml' % domain_name, seen.corpus) 477 | unseen = Corpus(ET.parse('%s--test.xml' % domain_name).getroot().findall('sentence')) 478 | 479 | # Perform the tasks, asked by the user and print the files with the predicted responses. 480 | if task == 1: 481 | b1 = BaselineAspectExtractor(traincorpus) 482 | print 'Extracting aspect terms...' 483 | predicted = b1.tag(unseen.corpus) 484 | corpus.write_out('%s--test.predicted-aspect.xml' % domain_name, predicted, short=False) 485 | print 'P = %f -- R = %f -- F1 = %f (#correct: %d, #retrieved: %d, #relevant: %d)' % Evaluate(seen.corpus, 486 | predicted).aspect_extraction() 487 | if task == 2: 488 | print 'Detecting aspect categories...' 489 | b2 = BaselineCategoryDetector(traincorpus) 490 | predicted = b2.tag(unseen.corpus) 491 | print 'P = %f -- R = %f -- F1 = %f (#correct: %d, #retrieved: %d, #relevant: %d)' % Evaluate(seen.corpus, 492 | predicted).category_detection() 493 | corpus.write_out('%s--test.predicted-category.xml' % domain_name, predicted, short=False) 494 | if task == 3: 495 | print 'Estimating aspect term polarity...' 496 | b3 = BaselineAspectPolarityEstimator(traincorpus) 497 | predicted = b3.tag(seen.corpus) 498 | corpus.write_out('%s--test.predicted-aspectPolar.xml' % domain_name, predicted, short=False) 499 | print 'Accuracy = %f, #Correct/#All: %d/%d' % Evaluate(seen.corpus, predicted).aspect_polarity_estimation() 500 | if task == 4: 501 | print 'Estimating aspect category polarity...' 502 | b4 = BaselineAspectCategoryPolarityEstimator(traincorpus) 503 | predicted = b4.tag(seen.corpus) 504 | print 'Accuracy = %f, #Correct/#All: %d/%d' % Evaluate(seen.corpus, 505 | predicted).aspect_category_polarity_estimation() 506 | corpus.write_out('%s--test.predicted-categoryPolar.xml' % domain_name, predicted, short=False) 507 | # Perform tasks 1 & 2, and output an XML file with the predictions 508 | if task == 5: 509 | print 'Task 1 & 2: Aspect Term and Category Detection' 510 | b1 = BaselineAspectExtractor(traincorpus) 511 | b2 = BaselineCategoryDetector(traincorpus) 512 | b12 = BaselineStageI(b1, b2) 513 | predicted = b12.tag(unseen.corpus) 514 | corpus.write_out('%s--test.predicted-stageI.xml' % domain_name, predicted, short=False) 515 | print 'Task 1: P = %f -- R = %f -- F1 = %f (#correct: %d, #retrieved: %d, #relevant: %d)' % Evaluate( 516 | seen.corpus, predicted).aspect_extraction() 517 | print 'Task 2: P = %f -- R = %f -- F1 = %f (#correct: %d, #retrieved: %d, #relevant: %d)' % Evaluate( 518 | seen.corpus, predicted).category_detection() 519 | # Perform tasks 3 & 4, and output an XML file with the predictions 520 | if task == 6: 521 | print 'Aspect Term and Category Polarity Estimation' 522 | b3 = BaselineAspectPolarityEstimator(traincorpus) 523 | b4 = BaselineAspectCategoryPolarityEstimator(traincorpus) 524 | b34 = BaselineStageII(b3, b4) 525 | predicted = b34.tag(seen.corpus) 526 | corpus.write_out('%s--test.predicted-stageII.xml' % domain_name, predicted, short=False) 527 | print 'Task 3: Accuracy = %f (#Correct/#All: %d/%d)' % Evaluate(seen.corpus, 528 | predicted).aspect_polarity_estimation() 529 | print 'Task 4: Accuracy = %f (#Correct/#All: %d/%d)' % Evaluate(seen.corpus, 530 | predicted).aspect_category_polarity_estimation() 531 | 532 | 533 | if __name__ == "__main__": main(sys.argv[1:]) -------------------------------------------------------------------------------- /run.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | 4 | import xml.etree.ElementTree as ET 5 | from libraries.baselines import Corpus 6 | 7 | from stanford_corenlp_python import jsonrpc 8 | from simplejson import loads 9 | 10 | 11 | def process_semeval_2015(): 12 | # the train set is composed by train and trial data set 13 | corpora = dict() 14 | corpora['laptop'] = dict() 15 | train_filename = 'datasets/ABSA-SemEval2015/ABSA-15_Restaurants_Train_Final.xml' 16 | trial_filename = 'datasets/ABSA-SemEval2015/absa-2015_restaurants_trial.xml' 17 | 18 | reviews = ET.parse(train_filename).getroot().findall('Review') + \ 19 | ET.parse(trial_filename).getroot().findall('Review') 20 | 21 | sentences = [] 22 | for r in reviews: 23 | sentences += r.find('sentences').getchildren() 24 | 25 | # TODO: parser is not loading aspect words and opinioss 26 | corpus = Corpus(sentences) 27 | corpus.size() 28 | 29 | 30 | def process_semeval_2014(): 31 | # the train set is composed by train and trial dataset 32 | corpora = dict() 33 | corpora['restaurants'] = dict() 34 | train_filename = 'datasets/ABSA-SemEval2014/Restaurants_Train_v2.xml' 35 | trial_filename = 'datasets/ABSA-SemEval2014/restaurants-trial.xml' 36 | corpus = Corpus(ET.parse(train_filename).getroot().findall('sentence') + 37 | ET.parse(trial_filename).getroot().findall('sentence')) 38 | corpora['restaurants']['trainset'] = dict() 39 | corpora['restaurants']['trainset']['corpus'] = corpus 40 | return corpora 41 | 42 | 43 | def main(): 44 | # TODO: start corenlp server "python corenlp.py" 45 | 46 | # interface for Stanford-Core-NLP server 47 | server = jsonrpc.ServerProxy(jsonrpc.JsonRpc20(), 48 | jsonrpc.TransportTcpIp(addr=("127.0.0.1", 49 | 8080))) 50 | 51 | result = loads(server.parse("Hello world. It is so beautiful")) 52 | print "Result", result 53 | 54 | corpora = process_semeval_2014() 55 | train_restaurants = corpora['restaurants']['trainset']['corpus'] 56 | 57 | for s in train_restaurants.corpus: 58 | print s.text 59 | 60 | """ 61 | print train_restaurants.size 62 | print train_restaurants.aspect_terms_fd 63 | """ 64 | 65 | if __name__ == '__main__': 66 | main() 67 | -------------------------------------------------------------------------------- /stanford_corenlp_python/LICENSE: -------------------------------------------------------------------------------- 1 | GNU GENERAL PUBLIC LICENSE 2 | Version 2, June 1991 3 | 4 | Copyright (C) 1989, 1991 Free Software Foundation, Inc., 5 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA 6 | Everyone is permitted to copy and distribute verbatim copies 7 | of this license document, but changing it is not allowed. 8 | 9 | Preamble 10 | 11 | The licenses for most software are designed to take away your 12 | freedom to share and change it. By contrast, the GNU General Public 13 | License is intended to guarantee your freedom to share and change free 14 | software--to make sure the software is free for all its users. This 15 | General Public License applies to most of the Free Software 16 | Foundation's software and to any other program whose authors commit to 17 | using it. (Some other Free Software Foundation software is covered by 18 | the GNU Lesser General Public License instead.) You can apply it to 19 | your programs, too. 20 | 21 | When we speak of free software, we are referring to freedom, not 22 | price. Our General Public Licenses are designed to make sure that you 23 | have the freedom to distribute copies of free software (and charge for 24 | this service if you wish), that you receive source code or can get it 25 | if you want it, that you can change the software or use pieces of it 26 | in new free programs; and that you know you can do these things. 27 | 28 | To protect your rights, we need to make restrictions that forbid 29 | anyone to deny you these rights or to ask you to surrender the rights. 30 | These restrictions translate to certain responsibilities for you if you 31 | distribute copies of the software, or if you modify it. 32 | 33 | For example, if you distribute copies of such a program, whether 34 | gratis or for a fee, you must give the recipients all the rights that 35 | you have. You must make sure that they, too, receive or can get the 36 | source code. And you must show them these terms so they know their 37 | rights. 38 | 39 | We protect your rights with two steps: (1) copyright the software, and 40 | (2) offer you this license which gives you legal permission to copy, 41 | distribute and/or modify the software. 42 | 43 | Also, for each author's protection and ours, we want to make certain 44 | that everyone understands that there is no warranty for this free 45 | software. If the software is modified by someone else and passed on, we 46 | want its recipients to know that what they have is not the original, so 47 | that any problems introduced by others will not reflect on the original 48 | authors' reputations. 49 | 50 | Finally, any free program is threatened constantly by software 51 | patents. We wish to avoid the danger that redistributors of a free 52 | program will individually obtain patent licenses, in effect making the 53 | program proprietary. To prevent this, we have made it clear that any 54 | patent must be licensed for everyone's free use or not licensed at all. 55 | 56 | The precise terms and conditions for copying, distribution and 57 | modification follow. 58 | 59 | GNU GENERAL PUBLIC LICENSE 60 | TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 61 | 62 | 0. This License applies to any program or other work which contains 63 | a notice placed by the copyright holder saying it may be distributed 64 | under the terms of this General Public License. The "Program", below, 65 | refers to any such program or work, and a "work based on the Program" 66 | means either the Program or any derivative work under copyright law: 67 | that is to say, a work containing the Program or a portion of it, 68 | either verbatim or with modifications and/or translated into another 69 | language. (Hereinafter, translation is included without limitation in 70 | the term "modification".) Each licensee is addressed as "you". 71 | 72 | Activities other than copying, distribution and modification are not 73 | covered by this License; they are outside its scope. The act of 74 | running the Program is not restricted, and the output from the Program 75 | is covered only if its contents constitute a work based on the 76 | Program (independent of having been made by running the Program). 77 | Whether that is true depends on what the Program does. 78 | 79 | 1. You may copy and distribute verbatim copies of the Program's 80 | source code as you receive it, in any medium, provided that you 81 | conspicuously and appropriately publish on each copy an appropriate 82 | copyright notice and disclaimer of warranty; keep intact all the 83 | notices that refer to this License and to the absence of any warranty; 84 | and give any other recipients of the Program a copy of this License 85 | along with the Program. 86 | 87 | You may charge a fee for the physical act of transferring a copy, and 88 | you may at your option offer warranty protection in exchange for a fee. 89 | 90 | 2. You may modify your copy or copies of the Program or any portion 91 | of it, thus forming a work based on the Program, and copy and 92 | distribute such modifications or work under the terms of Section 1 93 | above, provided that you also meet all of these conditions: 94 | 95 | a) You must cause the modified files to carry prominent notices 96 | stating that you changed the files and the date of any change. 97 | 98 | b) You must cause any work that you distribute or publish, that in 99 | whole or in part contains or is derived from the Program or any 100 | part thereof, to be licensed as a whole at no charge to all third 101 | parties under the terms of this License. 102 | 103 | c) If the modified program normally reads commands interactively 104 | when run, you must cause it, when started running for such 105 | interactive use in the most ordinary way, to print or display an 106 | announcement including an appropriate copyright notice and a 107 | notice that there is no warranty (or else, saying that you provide 108 | a warranty) and that users may redistribute the program under 109 | these conditions, and telling the user how to view a copy of this 110 | License. (Exception: if the Program itself is interactive but 111 | does not normally print such an announcement, your work based on 112 | the Program is not required to print an announcement.) 113 | 114 | These requirements apply to the modified work as a whole. If 115 | identifiable sections of that work are not derived from the Program, 116 | and can be reasonably considered independent and separate works in 117 | themselves, then this License, and its terms, do not apply to those 118 | sections when you distribute them as separate works. But when you 119 | distribute the same sections as part of a whole which is a work based 120 | on the Program, the distribution of the whole must be on the terms of 121 | this License, whose permissions for other licensees extend to the 122 | entire whole, and thus to each and every part regardless of who wrote it. 123 | 124 | Thus, it is not the intent of this section to claim rights or contest 125 | your rights to work written entirely by you; rather, the intent is to 126 | exercise the right to control the distribution of derivative or 127 | collective works based on the Program. 128 | 129 | In addition, mere aggregation of another work not based on the Program 130 | with the Program (or with a work based on the Program) on a volume of 131 | a storage or distribution medium does not bring the other work under 132 | the scope of this License. 133 | 134 | 3. You may copy and distribute the Program (or a work based on it, 135 | under Section 2) in object code or executable form under the terms of 136 | Sections 1 and 2 above provided that you also do one of the following: 137 | 138 | a) Accompany it with the complete corresponding machine-readable 139 | source code, which must be distributed under the terms of Sections 140 | 1 and 2 above on a medium customarily used for software interchange; or, 141 | 142 | b) Accompany it with a written offer, valid for at least three 143 | years, to give any third party, for a charge no more than your 144 | cost of physically performing source distribution, a complete 145 | machine-readable copy of the corresponding source code, to be 146 | distributed under the terms of Sections 1 and 2 above on a medium 147 | customarily used for software interchange; or, 148 | 149 | c) Accompany it with the information you received as to the offer 150 | to distribute corresponding source code. (This alternative is 151 | allowed only for noncommercial distribution and only if you 152 | received the program in object code or executable form with such 153 | an offer, in accord with Subsection b above.) 154 | 155 | The source code for a work means the preferred form of the work for 156 | making modifications to it. For an executable work, complete source 157 | code means all the source code for all modules it contains, plus any 158 | associated interface definition files, plus the scripts used to 159 | control compilation and installation of the executable. However, as a 160 | special exception, the source code distributed need not include 161 | anything that is normally distributed (in either source or binary 162 | form) with the major components (compiler, kernel, and so on) of the 163 | operating system on which the executable runs, unless that component 164 | itself accompanies the executable. 165 | 166 | If distribution of executable or object code is made by offering 167 | access to copy from a designated place, then offering equivalent 168 | access to copy the source code from the same place counts as 169 | distribution of the source code, even though third parties are not 170 | compelled to copy the source along with the object code. 171 | 172 | 4. You may not copy, modify, sublicense, or distribute the Program 173 | except as expressly provided under this License. Any attempt 174 | otherwise to copy, modify, sublicense or distribute the Program is 175 | void, and will automatically terminate your rights under this License. 176 | However, parties who have received copies, or rights, from you under 177 | this License will not have their licenses terminated so long as such 178 | parties remain in full compliance. 179 | 180 | 5. You are not required to accept this License, since you have not 181 | signed it. However, nothing else grants you permission to modify or 182 | distribute the Program or its derivative works. These actions are 183 | prohibited by law if you do not accept this License. Therefore, by 184 | modifying or distributing the Program (or any work based on the 185 | Program), you indicate your acceptance of this License to do so, and 186 | all its terms and conditions for copying, distributing or modifying 187 | the Program or works based on it. 188 | 189 | 6. Each time you redistribute the Program (or any work based on the 190 | Program), the recipient automatically receives a license from the 191 | original licensor to copy, distribute or modify the Program subject to 192 | these terms and conditions. You may not impose any further 193 | restrictions on the recipients' exercise of the rights granted herein. 194 | You are not responsible for enforcing compliance by third parties to 195 | this License. 196 | 197 | 7. If, as a consequence of a court judgment or allegation of patent 198 | infringement or for any other reason (not limited to patent issues), 199 | conditions are imposed on you (whether by court order, agreement or 200 | otherwise) that contradict the conditions of this License, they do not 201 | excuse you from the conditions of this License. If you cannot 202 | distribute so as to satisfy simultaneously your obligations under this 203 | License and any other pertinent obligations, then as a consequence you 204 | may not distribute the Program at all. For example, if a patent 205 | license would not permit royalty-free redistribution of the Program by 206 | all those who receive copies directly or indirectly through you, then 207 | the only way you could satisfy both it and this License would be to 208 | refrain entirely from distribution of the Program. 209 | 210 | If any portion of this section is held invalid or unenforceable under 211 | any particular circumstance, the balance of the section is intended to 212 | apply and the section as a whole is intended to apply in other 213 | circumstances. 214 | 215 | It is not the purpose of this section to induce you to infringe any 216 | patents or other property right claims or to contest validity of any 217 | such claims; this section has the sole purpose of protecting the 218 | integrity of the free software distribution system, which is 219 | implemented by public license practices. Many people have made 220 | generous contributions to the wide range of software distributed 221 | through that system in reliance on consistent application of that 222 | system; it is up to the author/donor to decide if he or she is willing 223 | to distribute software through any other system and a licensee cannot 224 | impose that choice. 225 | 226 | This section is intended to make thoroughly clear what is believed to 227 | be a consequence of the rest of this License. 228 | 229 | 8. If the distribution and/or use of the Program is restricted in 230 | certain countries either by patents or by copyrighted interfaces, the 231 | original copyright holder who places the Program under this License 232 | may add an explicit geographical distribution limitation excluding 233 | those countries, so that distribution is permitted only in or among 234 | countries not thus excluded. In such case, this License incorporates 235 | the limitation as if written in the body of this License. 236 | 237 | 9. The Free Software Foundation may publish revised and/or new versions 238 | of the General Public License from time to time. Such new versions will 239 | be similar in spirit to the present version, but may differ in detail to 240 | address new problems or concerns. 241 | 242 | Each version is given a distinguishing version number. If the Program 243 | specifies a version number of this License which applies to it and "any 244 | later version", you have the option of following the terms and conditions 245 | either of that version or of any later version published by the Free 246 | Software Foundation. If the Program does not specify a version number of 247 | this License, you may choose any version ever published by the Free Software 248 | Foundation. 249 | 250 | 10. If you wish to incorporate parts of the Program into other free 251 | programs whose distribution conditions are different, write to the author 252 | to ask for permission. For software which is copyrighted by the Free 253 | Software Foundation, write to the Free Software Foundation; we sometimes 254 | make exceptions for this. Our decision will be guided by the two goals 255 | of preserving the free status of all derivatives of our free software and 256 | of promoting the sharing and reuse of software generally. 257 | 258 | NO WARRANTY 259 | 260 | 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY 261 | FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN 262 | OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES 263 | PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED 264 | OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 265 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS 266 | TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE 267 | PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, 268 | REPAIR OR CORRECTION. 269 | 270 | 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 271 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR 272 | REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, 273 | INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING 274 | OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED 275 | TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY 276 | YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER 277 | PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE 278 | POSSIBILITY OF SUCH DAMAGES. 279 | 280 | END OF TERMS AND CONDITIONS 281 | 282 | How to Apply These Terms to Your New Programs 283 | 284 | If you develop a new program, and you want it to be of the greatest 285 | possible use to the public, the best way to achieve this is to make it 286 | free software which everyone can redistribute and change under these terms. 287 | 288 | To do so, attach the following notices to the program. It is safest 289 | to attach them to the start of each source file to most effectively 290 | convey the exclusion of warranty; and each file should have at least 291 | the "copyright" line and a pointer to where the full notice is found. 292 | 293 | 294 | Copyright (C) 295 | 296 | This program is free software; you can redistribute it and/or modify 297 | it under the terms of the GNU General Public License as published by 298 | the Free Software Foundation; either version 2 of the License, or 299 | (at your option) any later version. 300 | 301 | This program is distributed in the hope that it will be useful, 302 | but WITHOUT ANY WARRANTY; without even the implied warranty of 303 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 304 | GNU General Public License for more details. 305 | 306 | You should have received a copy of the GNU General Public License along 307 | with this program; if not, write to the Free Software Foundation, Inc., 308 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. 309 | 310 | Also add information on how to contact you by electronic and paper mail. 311 | 312 | If the program is interactive, make it output a short notice like this 313 | when it starts in an interactive mode: 314 | 315 | Gnomovision version 69, Copyright (C) year name of author 316 | Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 317 | This is free software, and you are welcome to redistribute it 318 | under certain conditions; type `show c' for details. 319 | 320 | The hypothetical commands `show w' and `show c' should show the appropriate 321 | parts of the General Public License. Of course, the commands you use may 322 | be called something other than `show w' and `show c'; they could even be 323 | mouse-clicks or menu items--whatever suits your program. 324 | 325 | You should also get your employer (if you work as a programmer) or your 326 | school, if any, to sign a "copyright disclaimer" for the program, if 327 | necessary. Here is a sample; alter the names: 328 | 329 | Yoyodyne, Inc., hereby disclaims all copyright interest in the program 330 | `Gnomovision' (which makes passes at compilers) written by James Hacker. 331 | 332 | , 1 April 1989 333 | Ty Coon, President of Vice 334 | 335 | This General Public License does not permit incorporating your program into 336 | proprietary programs. If your program is a subroutine library, you may 337 | consider it more useful to permit linking proprietary applications with the 338 | library. If this is what you want to do, use the GNU Lesser General 339 | Public License instead of this License. 340 | -------------------------------------------------------------------------------- /stanford_corenlp_python/README.md: -------------------------------------------------------------------------------- 1 | # Python interface to Stanford Core NLP tools v3.4.1 2 | 3 | This is a Python wrapper for Stanford University's NLP group's Java-based [CoreNLP tools](http://nlp.stanford.edu/software/corenlp.shtml). It can either be imported as a module or run as a JSON-RPC server. Because it uses many large trained models (requiring 3GB RAM on 64-bit machines and usually a few minutes loading time), most applications will probably want to run it as a server. 4 | 5 | 6 | * Python interface to Stanford CoreNLP tools: tagging, phrase-structure parsing, dependency parsing, [named-entity recognition](http://en.wikipedia.org/wiki/Named-entity_recognition), and [coreference resolution](http://en.wikipedia.org/wiki/Coreference). 7 | * Runs an JSON-RPC server that wraps the Java server and outputs JSON. 8 | * Outputs parse trees which can be used by [nltk](http://nltk.googlecode.com/svn/trunk/doc/howto/tree.html). 9 | 10 | 11 | It depends on [pexpect](http://www.noah.org/wiki/pexpect) and includes and uses code from [jsonrpc](http://www.simple-is-better.org/rpc/) and [python-progressbar](http://code.google.com/p/python-progressbar/). 12 | 13 | It runs the Stanford CoreNLP jar in a separate process, communicates with the java process using its command-line interface, and makes assumptions about the output of the parser in order to parse it into a Python dict object and transfer it using JSON. The parser will break if the output changes significantly, but it has been tested on **Core NLP tools version 3.4.1** released 2014-08-27. 14 | 15 | ## Download and Usage 16 | 17 | To use this program you must [download](http://nlp.stanford.edu/software/corenlp.shtml#Download) and unpack the compressed file containing Stanford's CoreNLP package. By default, `corenlp.py` looks for the Stanford Core NLP folder as a subdirectory of where the script is being run. In other words: 18 | 19 | sudo pip install pexpect unidecode 20 | git clone git://github.com/dasmith/stanford-corenlp-python.git 21 | cd stanford-corenlp-python 22 | wget http://nlp.stanford.edu/software/stanford-corenlp-full-2014-08-27.zip 23 | unzip stanford-corenlp-full-2014-08-27.zip 24 | 25 | Then launch the server: 26 | 27 | python corenlp.py 28 | 29 | Optionally, you can specify a host or port: 30 | 31 | python corenlp.py -H 0.0.0.0 -p 3456 32 | 33 | That will run a public JSON-RPC server on port 3456. 34 | 35 | Assuming you are running on port 8080, the code in `client.py` shows an example parse: 36 | 37 | import jsonrpc 38 | from simplejson import loads 39 | server = jsonrpc.ServerProxy(jsonrpc.JsonRpc20(), 40 | jsonrpc.TransportTcpIp(addr=("127.0.0.1", 8080))) 41 | 42 | result = loads(server.parse("Hello world. It is so beautiful")) 43 | print "Result", result 44 | 45 | That returns a dictionary containing the keys `sentences` and `coref`. The key `sentences` contains a list of dictionaries for each sentence, which contain `parsetree`, `text`, `tuples` containing the dependencies, and `words`, containing information about parts of speech, recognized named-entities, etc: 46 | 47 | {u'sentences': [{u'parsetree': u'(ROOT (S (VP (NP (INTJ (UH Hello)) (NP (NN world)))) (. !)))', 48 | u'text': u'Hello world!', 49 | u'tuples': [[u'dep', u'world', u'Hello'], 50 | [u'root', u'ROOT', u'world']], 51 | u'words': [[u'Hello', 52 | {u'CharacterOffsetBegin': u'0', 53 | u'CharacterOffsetEnd': u'5', 54 | u'Lemma': u'hello', 55 | u'NamedEntityTag': u'O', 56 | u'PartOfSpeech': u'UH'}], 57 | [u'world', 58 | {u'CharacterOffsetBegin': u'6', 59 | u'CharacterOffsetEnd': u'11', 60 | u'Lemma': u'world', 61 | u'NamedEntityTag': u'O', 62 | u'PartOfSpeech': u'NN'}], 63 | [u'!', 64 | {u'CharacterOffsetBegin': u'11', 65 | u'CharacterOffsetEnd': u'12', 66 | u'Lemma': u'!', 67 | u'NamedEntityTag': u'O', 68 | u'PartOfSpeech': u'.'}]]}, 69 | {u'parsetree': u'(ROOT (S (NP (PRP It)) (VP (VBZ is) (ADJP (RB so) (JJ beautiful))) (. .)))', 70 | u'text': u'It is so beautiful.', 71 | u'tuples': [[u'nsubj', u'beautiful', u'It'], 72 | [u'cop', u'beautiful', u'is'], 73 | [u'advmod', u'beautiful', u'so'], 74 | [u'root', u'ROOT', u'beautiful']], 75 | u'words': [[u'It', 76 | {u'CharacterOffsetBegin': u'14', 77 | u'CharacterOffsetEnd': u'16', 78 | u'Lemma': u'it', 79 | u'NamedEntityTag': u'O', 80 | u'PartOfSpeech': u'PRP'}], 81 | [u'is', 82 | {u'CharacterOffsetBegin': u'17', 83 | u'CharacterOffsetEnd': u'19', 84 | u'Lemma': u'be', 85 | u'NamedEntityTag': u'O', 86 | u'PartOfSpeech': u'VBZ'}], 87 | [u'so', 88 | {u'CharacterOffsetBegin': u'20', 89 | u'CharacterOffsetEnd': u'22', 90 | u'Lemma': u'so', 91 | u'NamedEntityTag': u'O', 92 | u'PartOfSpeech': u'RB'}], 93 | [u'beautiful', 94 | {u'CharacterOffsetBegin': u'23', 95 | u'CharacterOffsetEnd': u'32', 96 | u'Lemma': u'beautiful', 97 | u'NamedEntityTag': u'O', 98 | u'PartOfSpeech': u'JJ'}], 99 | [u'.', 100 | {u'CharacterOffsetBegin': u'32', 101 | u'CharacterOffsetEnd': u'33', 102 | u'Lemma': u'.', 103 | u'NamedEntityTag': u'O', 104 | u'PartOfSpeech': u'.'}]]}], 105 | u'coref': [[[[u'It', 1, 0, 0, 1], [u'Hello world', 0, 1, 0, 2]]]]} 106 | 107 | To use it in a regular script (useful for debugging), load the module instead: 108 | 109 | from corenlp import * 110 | corenlp = StanfordCoreNLP() # wait a few minutes... 111 | corenlp.parse("Parse this sentence.") 112 | 113 | The server, `StanfordCoreNLP()`, takes an optional argument `corenlp_path` which specifies the path to the jar files. The default value is `StanfordCoreNLP(corenlp_path="./stanford-corenlp-full-2014-08-27/")`. 114 | 115 | ## Coreference Resolution 116 | 117 | The library supports [coreference resolution](http://en.wikipedia.org/wiki/Coreference), which means pronouns can be "dereferenced." If an entry in the `coref` list is, `[u'Hello world', 0, 1, 0, 2]`, the numbers mean: 118 | 119 | * 0 = The reference appears in the 0th sentence (e.g. "Hello world") 120 | * 1 = The 2nd token, "world", is the [headword](http://en.wikipedia.org/wiki/Head_%28linguistics%29) of that sentence 121 | * 0 = 'Hello world' begins at the 0th token in the sentence 122 | * 2 = 'Hello world' ends before the 2nd token in the sentence. 123 | 124 | 135 | 136 | 137 | ## Questions 138 | 139 | **Stanford CoreNLP tools require a large amount of free memory**. Java 5+ uses about 50% more RAM on 64-bit machines than 32-bit machines. 32-bit machine users can lower the memory requirements by changing `-Xmx3g` to `-Xmx2g` or even less. 140 | If pexpect timesout while loading models, check to make sure you have enough memory and can run the server alone without your kernel killing the java process: 141 | 142 | java -cp stanford-corenlp-2014-08-27.jar:stanford-corenlp-3.4.1-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props default.properties 143 | 144 | You can reach me, Dustin Smith, by sending a message on GitHub or through email (contact information is available [on my webpage](http://web.media.mit.edu/~dustin)). 145 | 146 | 147 | # License & Contributors 148 | 149 | This is free and open source software and has benefited from the contribution and feedback of others. Like Stanford's CoreNLP tools, it is covered under the [GNU General Public License v2 +](http://www.gnu.org/licenses/gpl-2.0.html), which in short means that modifications to this program must maintain the same free and open source distribution policy. 150 | 151 | I gratefully welcome bug fixes and new features. If you have forked this repository, please submit a [pull request](https://help.github.com/articles/using-pull-requests/) so others can benefit from your contributions. This project has already benefited from contributions from these members of the open source community: 152 | 153 | * [Emilio Monti](https://github.com/emilmont) 154 | * [Justin Cheng](https://github.com/jcccf) 155 | * Abhaya Agarwal 156 | 157 | *Thank you!* 158 | 159 | ## Related Projects 160 | 161 | Maintainers of the Core NLP library at Stanford keep an [updated list of wrappers and extensions](http://nlp.stanford.edu/software/corenlp.shtml#Extensions). See Brendan O'Connor's [stanford_corenlp_pywrapper](https://github.com/brendano/stanford_corenlp_pywrapper) for a different approach more suited to batch processing. 162 | -------------------------------------------------------------------------------- /stanford_corenlp_python/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/davidsbatista/Aspect-Based-Sentiment-Analysis/1d9c8ec1131993d924e96676fa212db6b53cb870/stanford_corenlp_python/__init__.py -------------------------------------------------------------------------------- /stanford_corenlp_python/client.py: -------------------------------------------------------------------------------- 1 | import json 2 | from jsonrpc import ServerProxy, JsonRpc20, TransportTcpIp 3 | from pprint import pprint 4 | 5 | class StanfordNLP: 6 | def __init__(self): 7 | self.server = ServerProxy(JsonRpc20(), 8 | TransportTcpIp(addr=("127.0.0.1", 8080))) 9 | 10 | def parse(self, text): 11 | return json.loads(self.server.parse(text)) 12 | 13 | nlp = StanfordNLP() 14 | result = nlp.parse("Hello world! It is so beautiful.") 15 | pprint(result) 16 | 17 | from nltk.tree import Tree 18 | tree = Tree.parse(result['sentences'][0]['parsetree']) 19 | pprint(tree) 20 | -------------------------------------------------------------------------------- /stanford_corenlp_python/corenlp.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # 3 | # corenlp - Python interface to Stanford Core NLP tools 4 | # Copyright (c) 2014 Dustin Smith 5 | # https://github.com/dasmith/stanford-corenlp-python 6 | # 7 | # This program is free software; you can redistribute it and/or 8 | # modify it under the terms of the GNU General Public License 9 | # as published by the Free Software Foundation; either version 2 10 | # of the License, or (at your option) any later version. 11 | # 12 | # This program is distributed in the hope that it will be useful, 13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 15 | # GNU General Public License for more details. 16 | # 17 | # You should have received a copy of the GNU General Public License 18 | # along with this program; if not, write to the Free Software 19 | # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. 20 | 21 | import json 22 | import optparse 23 | import os, re, sys, time, traceback 24 | import jsonrpc, pexpect 25 | from progressbar import ProgressBar, Fraction 26 | import logging 27 | 28 | 29 | VERBOSE = True 30 | 31 | STATE_START, STATE_TEXT, STATE_WORDS, STATE_TREE, STATE_DEPENDENCY, STATE_COREFERENCE = 0, 1, 2, 3, 4, 5 32 | WORD_PATTERN = re.compile('\[([^\]]+)\]') 33 | CR_PATTERN = re.compile(r"\((\d*),(\d)*,\[(\d*),(\d*)\]\) -> \((\d*),(\d)*,\[(\d*),(\d*)\]\), that is: \"(.*)\" -> \"(.*)\"") 34 | 35 | # initialize logger 36 | logging.basicConfig(level=logging.INFO) 37 | logger = logging.getLogger(__name__) 38 | 39 | 40 | def remove_id(word): 41 | """Removes the numeric suffix from the parsed recognized words: e.g. 'word-2' > 'word' """ 42 | return word.count("-") == 0 and word or word[0:word.rindex("-")] 43 | 44 | 45 | def parse_bracketed(s): 46 | '''Parse word features [abc=... def = ...] 47 | Also manages to parse out features that have XML within them 48 | ''' 49 | word = None 50 | attrs = {} 51 | temp = {} 52 | # Substitute XML tags, to replace them later 53 | for i, tag in enumerate(re.findall(r"(<[^<>]+>.*<\/[^<>]+>)", s)): 54 | temp["^^^%d^^^" % i] = tag 55 | s = s.replace(tag, "^^^%d^^^" % i) 56 | # Load key-value pairs, substituting as necessary 57 | for attr, val in re.findall(r"([^=\s]*)=([^=\s]*)", s): 58 | if val in temp: 59 | val = temp[val] 60 | if attr == 'Text': 61 | word = val 62 | else: 63 | attrs[attr] = val 64 | return (word, attrs) 65 | 66 | 67 | def parse_parser_results(text): 68 | """ This is the nasty bit of code to interact with the command-line 69 | interface of the CoreNLP tools. Takes a string of the parser results 70 | and then returns a Python list of dictionaries, one for each parsed 71 | sentence. 72 | """ 73 | results = {"sentences": []} 74 | state = STATE_START 75 | for line in text.encode('utf-8').split("\n"): 76 | line = line.strip() 77 | 78 | if line.startswith("Sentence #"): 79 | sentence = {'words':[], 'parsetree':[], 'dependencies':[]} 80 | results["sentences"].append(sentence) 81 | state = STATE_TEXT 82 | 83 | elif state == STATE_TEXT: 84 | sentence['text'] = line 85 | state = STATE_WORDS 86 | 87 | elif state == STATE_WORDS: 88 | if not line.startswith("[Text="): 89 | raise Exception('Parse error. Could not find "[Text=" in: %s' % line) 90 | for s in WORD_PATTERN.findall(line): 91 | sentence['words'].append(parse_bracketed(s)) 92 | state = STATE_TREE 93 | 94 | elif state == STATE_TREE: 95 | if len(line) == 0: 96 | state = STATE_DEPENDENCY 97 | sentence['parsetree'] = " ".join(sentence['parsetree']) 98 | else: 99 | sentence['parsetree'].append(line) 100 | 101 | elif state == STATE_DEPENDENCY: 102 | if len(line) == 0: 103 | state = STATE_COREFERENCE 104 | else: 105 | split_entry = re.split("\(|, ", line[:-1]) 106 | if len(split_entry) == 3: 107 | rel, left, right = map(lambda x: remove_id(x), split_entry) 108 | sentence['dependencies'].append(tuple([rel,left,right])) 109 | 110 | elif state == STATE_COREFERENCE: 111 | if "Coreference set" in line: 112 | if 'coref' not in results: 113 | results['coref'] = [] 114 | coref_set = [] 115 | results['coref'].append(coref_set) 116 | else: 117 | for src_i, src_pos, src_l, src_r, sink_i, sink_pos, sink_l, sink_r, src_word, sink_word in CR_PATTERN.findall(line): 118 | src_i, src_pos, src_l, src_r = int(src_i)-1, int(src_pos)-1, int(src_l)-1, int(src_r)-1 119 | sink_i, sink_pos, sink_l, sink_r = int(sink_i)-1, int(sink_pos)-1, int(sink_l)-1, int(sink_r)-1 120 | coref_set.append(((src_word, src_i, src_pos, src_l, src_r), (sink_word, sink_i, sink_pos, sink_l, sink_r))) 121 | 122 | return results 123 | 124 | 125 | class StanfordCoreNLP(object): 126 | """ 127 | Command-line interaction with Stanford's CoreNLP java utilities. 128 | Can be run as a JSON-RPC server or imported as a module. 129 | """ 130 | def __init__(self, corenlp_path=None): 131 | """ 132 | Checks the location of the jar files. 133 | Spawns the server as a process. 134 | """ 135 | jars = ["stanford-corenlp-3.6.0.jar", 136 | "stanford-corenlp-3.6.0-models.jar", 137 | "joda-time.jar", 138 | "xom.jar", 139 | "jollyday.jar", 140 | "slf4j-api.jar"] 141 | 142 | # if CoreNLP libraries are in a different directory, 143 | # change the corenlp_path variable to point to them 144 | if not corenlp_path: 145 | #corenlp_path = "./stanford-corenlp-full-2014-08-27/" 146 | corenlp_path = "stanford-corenlp-full/" 147 | 148 | java_path = "java" 149 | classname = "edu.stanford.nlp.pipeline.StanfordCoreNLP" 150 | # include the properties file, so you can change defaults 151 | # but any changes in output format will break parse_parser_results() 152 | props = "-props default.properties" 153 | 154 | # add and check classpaths 155 | jars = [corenlp_path + jar for jar in jars] 156 | for jar in jars: 157 | if not os.path.exists(jar): 158 | logger.error("Error! Cannot locate %s" % jar) 159 | sys.exit(1) 160 | 161 | # spawn the server 162 | start_corenlp = "%s -Xmx1800m -cp %s %s %s" % (java_path, ':'.join(jars), classname, props) 163 | if VERBOSE: 164 | logger.debug(start_corenlp) 165 | self.corenlp = pexpect.spawn(start_corenlp) 166 | 167 | # show progress bar while loading the models 168 | widgets = ['Loading Models: ', Fraction()] 169 | pbar = ProgressBar(widgets=widgets, maxval=5, force_update=True).start() 170 | self.corenlp.expect("done.", timeout=20) # Load pos tagger model (~5sec) 171 | pbar.update(1) 172 | self.corenlp.expect("done.", timeout=200) # Load NER-all classifier (~33sec) 173 | pbar.update(2) 174 | self.corenlp.expect("done.", timeout=600) # Load NER-muc classifier (~60sec) 175 | pbar.update(3) 176 | self.corenlp.expect("done.", timeout=600) # Load CoNLL classifier (~50sec) 177 | pbar.update(4) 178 | self.corenlp.expect("done.", timeout=200) # Loading PCFG (~3sec) 179 | pbar.update(5) 180 | self.corenlp.expect("Entering interactive shell.") 181 | pbar.finish() 182 | 183 | def _parse(self, text): 184 | """ 185 | This is the core interaction with the parser. 186 | 187 | It returns a Python data-structure, while the parse() 188 | function returns a JSON object 189 | """ 190 | # clean up anything leftover 191 | while True: 192 | try: 193 | self.corenlp.read_nonblocking (4000, 0.3) 194 | except pexpect.TIMEOUT: 195 | break 196 | 197 | self.corenlp.sendline(text) 198 | 199 | # How much time should we give the parser to parse it? 200 | # the idea here is that you increase the timeout as a 201 | # function of the text's length. 202 | # anything longer than 5 seconds requires that you also 203 | # increase timeout=5 in jsonrpc.py 204 | max_expected_time = min(40, 3 + len(text) / 20.0) 205 | end_time = time.time() + max_expected_time 206 | 207 | incoming = "" 208 | while True: 209 | # Time left, read more data 210 | try: 211 | incoming += self.corenlp.read_nonblocking(2000, 1) 212 | if "\nNLP>" in incoming: 213 | break 214 | time.sleep(0.0001) 215 | except pexpect.TIMEOUT: 216 | if end_time - time.time() < 0: 217 | logger.error("Error: Timeout with input '%s'" % (incoming)) 218 | return {'error': "timed out after %f seconds" % max_expected_time} 219 | else: 220 | continue 221 | except pexpect.EOF: 222 | break 223 | 224 | if VERBOSE: 225 | logger.debug("%s\n%s" % ('='*40, incoming)) 226 | try: 227 | results = parse_parser_results(incoming) 228 | except Exception, e: 229 | if VERBOSE: 230 | logger.debug(traceback.format_exc()) 231 | raise e 232 | 233 | return results 234 | 235 | def parse(self, text): 236 | """ 237 | This function takes a text string, sends it to the Stanford parser, 238 | reads in the result, parses the results and returns a list 239 | with one dictionary entry for each parsed sentence, in JSON format. 240 | """ 241 | response = self._parse(text) 242 | logger.debug("Response: '%s'" % (response)) 243 | return json.dumps(response) 244 | 245 | 246 | if __name__ == '__main__': 247 | """ 248 | The code below starts an JSONRPC server 249 | """ 250 | parser = optparse.OptionParser(usage="%prog [OPTIONS]") 251 | parser.add_option('-p', '--port', default='8080', 252 | help='Port to serve on (default: 8080)') 253 | parser.add_option('-H', '--host', default='127.0.0.1', 254 | help='Host to serve on (default: 127.0.0.1. Use 0.0.0.0 to make public)') 255 | options, args = parser.parse_args() 256 | server = jsonrpc.Server(jsonrpc.JsonRpc20(), 257 | jsonrpc.TransportTcpIp(addr=(options.host, int(options.port)))) 258 | 259 | nlp = StanfordCoreNLP() 260 | server.register_function(nlp.parse) 261 | 262 | logger.info('Serving on http://%s:%s' % (options.host, options.port)) 263 | server.serve() 264 | -------------------------------------------------------------------------------- /stanford_corenlp_python/default.properties: -------------------------------------------------------------------------------- 1 | annotators = tokenize, ssplit, pos, lemma, ner, parse, dcoref 2 | 3 | # A true-casing annotator is also available (see below) 4 | #annotators = tokenize, ssplit, pos, lemma, truecase 5 | 6 | # A simple regex NER annotator is also available 7 | # annotators = tokenize, ssplit, regexner 8 | 9 | #Use these as EOS punctuation and discard them from the actual sentence content 10 | #These are HTML tags that get expanded internally to correct syntax, e.g., from "p" to "

", "

" etc. 11 | #Will have no effect if the "cleanxml" annotator is used 12 | #ssplit.htmlBoundariesToDiscard = p,text 13 | 14 | # 15 | # None of these paths are necessary anymore: we load all models from the JAR file 16 | # 17 | 18 | #pos.model = /u/nlp/data/pos-tagger/wsj3t0-18-left3words/left3words-distsim-wsj-0-18.tagger 19 | ## slightly better model but much slower: 20 | ##pos.model = /u/nlp/data/pos-tagger/wsj3t0-18-bidirectional/bidirectional-distsim-wsj-0-18.tagger 21 | 22 | #ner.model.3class = /u/nlp/data/ner/goodClassifiers/all.3class.distsim.crf.ser.gz 23 | #ner.model.7class = /u/nlp/data/ner/goodClassifiers/muc.distsim.crf.ser.gz 24 | #ner.model.MISCclass = /u/nlp/data/ner/goodClassifiers/conll.distsim.crf.ser.gz 25 | 26 | #regexner.mapping = /u/nlp/data/TAC-KBP2010/sentence_extraction/type_map_clean 27 | #regexner.ignorecase = false 28 | 29 | #nfl.gazetteer = /scr/nlp/data/machine-reading/Machine_Reading_P1_Reading_Task_V2.0/data/SportsDomain/NFLScoring_UseCase/NFLgazetteer.txt 30 | #nfl.relation.model = /scr/nlp/data/ldc/LDC2009E112/Machine_Reading_P1_NFL_Scoring_Training_Data_V1.2/models/nfl_relation_model.ser 31 | #nfl.entity.model = /scr/nlp/data/ldc/LDC2009E112/Machine_Reading_P1_NFL_Scoring_Training_Data_V1.2/models/nfl_entity_model.ser 32 | #printable.relation.beam = 20 33 | 34 | #parser.model = /u/nlp/data/lexparser/englishPCFG.ser.gz 35 | 36 | #srl.verb.args=/u/kristina/srl/verbs.core_args 37 | #srl.model.cls=/u/nlp/data/srl/trainedModels/englishPCFG/cls/train.ann 38 | #srl.model.id=/u/nlp/data/srl/trainedModels/englishPCFG/id/train.ann 39 | 40 | #coref.model=/u/nlp/rte/resources/anno/coref/corefClassifierAll.March2009.ser.gz 41 | #coref.name.dir=/u/nlp/data/coref/ 42 | #wordnet.dir=/u/nlp/data/wordnet/wordnet-3.0-prolog 43 | 44 | #dcoref.demonym = /scr/heeyoung/demonyms.txt 45 | #dcoref.animate = /scr/nlp/data/DekangLin-Animacy-Gender/Animacy/animate.unigrams.txt 46 | #dcoref.inanimate = /scr/nlp/data/DekangLin-Animacy-Gender/Animacy/inanimate.unigrams.txt 47 | #dcoref.male = /scr/nlp/data/Bergsma-Gender/male.unigrams.txt 48 | #dcoref.neutral = /scr/nlp/data/Bergsma-Gender/neutral.unigrams.txt 49 | #dcoref.female = /scr/nlp/data/Bergsma-Gender/female.unigrams.txt 50 | #dcoref.plural = /scr/nlp/data/Bergsma-Gender/plural.unigrams.txt 51 | #dcoref.singular = /scr/nlp/data/Bergsma-Gender/singular.unigrams.txt 52 | 53 | 54 | # This is the regular expression that describes which xml tags to keep 55 | # the text from. In order to on off the xml removal, add cleanxml 56 | # to the list of annotators above after "tokenize". 57 | #clean.xmltags = .* 58 | # A set of tags which will force the end of a sentence. HTML example: 59 | # you would not want to end on , but you would want to end on

. 60 | # Once again, a regular expression. 61 | # (Blank means there are no sentence enders.) 62 | #clean.sentenceendingtags = 63 | # Whether or not to allow malformed xml 64 | # StanfordCoreNLP.properties 65 | #wordnet.dir=models/wordnet-3.0-prolog 66 | -------------------------------------------------------------------------------- /stanford_corenlp_python/progressbar.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: iso-8859-1 -*- 3 | # 4 | # progressbar - Text progressbar library for python. 5 | # Copyright (c) 2005 Nilton Volpato 6 | # 7 | # This library is free software; you can redistribute it and/or 8 | # modify it under the terms of the GNU Lesser General Public 9 | # License as published by the Free Software Foundation; either 10 | # version 2.1 of the License, or (at your option) any later version. 11 | # 12 | # This library is distributed in the hope that it will be useful, 13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 15 | # Lesser General Public License for more details. 16 | # 17 | # You should have received a copy of the GNU Lesser General Public 18 | # License along with this library; if not, write to the Free Software 19 | # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA 20 | 21 | 22 | """Text progressbar library for python. 23 | 24 | This library provides a text mode progressbar. This is typically used 25 | to display the progress of a long running operation, providing a 26 | visual clue that processing is underway. 27 | 28 | The ProgressBar class manages the progress, and the format of the line 29 | is given by a number of widgets. A widget is an object that may 30 | display diferently depending on the state of the progress. There are 31 | three types of widget: 32 | - a string, which always shows itself; 33 | - a ProgressBarWidget, which may return a diferent value every time 34 | it's update method is called; and 35 | - a ProgressBarWidgetHFill, which is like ProgressBarWidget, except it 36 | expands to fill the remaining width of the line. 37 | 38 | The progressbar module is very easy to use, yet very powerful. And 39 | automatically supports features like auto-resizing when available. 40 | """ 41 | 42 | __author__ = "Nilton Volpato" 43 | __author_email__ = "first-name dot last-name @ gmail.com" 44 | __date__ = "2006-05-07" 45 | __version__ = "2.2" 46 | 47 | # Changelog 48 | # 49 | # 2006-05-07: v2.2 fixed bug in windows 50 | # 2005-12-04: v2.1 autodetect terminal width, added start method 51 | # 2005-12-04: v2.0 everything is now a widget (wow!) 52 | # 2005-12-03: v1.0 rewrite using widgets 53 | # 2005-06-02: v0.5 rewrite 54 | # 2004-??-??: v0.1 first version 55 | 56 | import sys 57 | import time 58 | from array import array 59 | try: 60 | from fcntl import ioctl 61 | import termios 62 | except ImportError: 63 | pass 64 | import signal 65 | 66 | 67 | class ProgressBarWidget(object): 68 | """This is an element of ProgressBar formatting. 69 | 70 | The ProgressBar object will call it's update value when an update 71 | is needed. It's size may change between call, but the results will 72 | not be good if the size changes drastically and repeatedly. 73 | """ 74 | def update(self, pbar): 75 | """Returns the string representing the widget. 76 | 77 | The parameter pbar is a reference to the calling ProgressBar, 78 | where one can access attributes of the class for knowing how 79 | the update must be made. 80 | 81 | At least this function must be overriden.""" 82 | pass 83 | 84 | 85 | class ProgressBarWidgetHFill(object): 86 | """This is a variable width element of ProgressBar formatting. 87 | 88 | The ProgressBar object will call it's update value, informing the 89 | width this object must the made. This is like TeX \\hfill, it will 90 | expand to fill the line. You can use more than one in the same 91 | line, and they will all have the same width, and together will 92 | fill the line. 93 | """ 94 | def update(self, pbar, width): 95 | """Returns the string representing the widget. 96 | 97 | The parameter pbar is a reference to the calling ProgressBar, 98 | where one can access attributes of the class for knowing how 99 | the update must be made. The parameter width is the total 100 | horizontal width the widget must have. 101 | 102 | At least this function must be overriden.""" 103 | pass 104 | 105 | 106 | class ETA(ProgressBarWidget): 107 | "Widget for the Estimated Time of Arrival" 108 | def format_time(self, seconds): 109 | return time.strftime('%H:%M:%S', time.gmtime(seconds)) 110 | 111 | def update(self, pbar): 112 | if pbar.currval == 0: 113 | return 'ETA: --:--:--' 114 | elif pbar.finished: 115 | return 'Time: %s' % self.format_time(pbar.seconds_elapsed) 116 | else: 117 | elapsed = pbar.seconds_elapsed 118 | eta = elapsed * pbar.maxval / pbar.currval - elapsed 119 | return 'ETA: %s' % self.format_time(eta) 120 | 121 | 122 | class FileTransferSpeed(ProgressBarWidget): 123 | "Widget for showing the transfer speed (useful for file transfers)." 124 | def __init__(self): 125 | self.fmt = '%6.2f %s' 126 | self.units = ['B', 'K', 'M', 'G', 'T', 'P'] 127 | 128 | def update(self, pbar): 129 | if pbar.seconds_elapsed < 2e-6: # == 0: 130 | bps = 0.0 131 | else: 132 | bps = float(pbar.currval) / pbar.seconds_elapsed 133 | spd = bps 134 | for u in self.units: 135 | if spd < 1000: 136 | break 137 | spd /= 1000 138 | return self.fmt % (spd, u + '/s') 139 | 140 | 141 | class RotatingMarker(ProgressBarWidget): 142 | "A rotating marker for filling the bar of progress." 143 | def __init__(self, markers='|/-\\'): 144 | self.markers = markers 145 | self.curmark = -1 146 | 147 | def update(self, pbar): 148 | if pbar.finished: 149 | return self.markers[0] 150 | self.curmark = (self.curmark + 1) % len(self.markers) 151 | return self.markers[self.curmark] 152 | 153 | 154 | class Percentage(ProgressBarWidget): 155 | "Just the percentage done." 156 | def update(self, pbar): 157 | return '%3d%%' % pbar.percentage() 158 | 159 | 160 | class Fraction(ProgressBarWidget): 161 | "Just the fraction done." 162 | def update(self, pbar): 163 | return "%d/%d" % (pbar.currval, pbar.maxval) 164 | 165 | 166 | class Bar(ProgressBarWidgetHFill): 167 | "The bar of progress. It will strech to fill the line." 168 | def __init__(self, marker='#', left='|', right='|'): 169 | self.marker = marker 170 | self.left = left 171 | self.right = right 172 | 173 | def _format_marker(self, pbar): 174 | if isinstance(self.marker, (str, unicode)): 175 | return self.marker 176 | else: 177 | return self.marker.update(pbar) 178 | 179 | def update(self, pbar, width): 180 | percent = pbar.percentage() 181 | cwidth = width - len(self.left) - len(self.right) 182 | marked_width = int(percent * cwidth / 100) 183 | m = self._format_marker(pbar) 184 | bar = (self.left + (m * marked_width).ljust(cwidth) + self.right) 185 | return bar 186 | 187 | 188 | class ReverseBar(Bar): 189 | "The reverse bar of progress, or bar of regress. :)" 190 | def update(self, pbar, width): 191 | percent = pbar.percentage() 192 | cwidth = width - len(self.left) - len(self.right) 193 | marked_width = int(percent * cwidth / 100) 194 | m = self._format_marker(pbar) 195 | bar = (self.left + (m * marked_width).rjust(cwidth) + self.right) 196 | return bar 197 | 198 | default_widgets = [Percentage(), ' ', Bar()] 199 | 200 | 201 | class ProgressBar(object): 202 | """This is the ProgressBar class, it updates and prints the bar. 203 | 204 | The term_width parameter may be an integer. Or None, in which case 205 | it will try to guess it, if it fails it will default to 80 columns. 206 | 207 | The simple use is like this: 208 | >>> pbar = ProgressBar().start() 209 | >>> for i in xrange(100): 210 | ... # do something 211 | ... pbar.update(i+1) 212 | ... 213 | >>> pbar.finish() 214 | 215 | But anything you want to do is possible (well, almost anything). 216 | You can supply different widgets of any type in any order. And you 217 | can even write your own widgets! There are many widgets already 218 | shipped and you should experiment with them. 219 | 220 | When implementing a widget update method you may access any 221 | attribute or function of the ProgressBar object calling the 222 | widget's update method. The most important attributes you would 223 | like to access are: 224 | - currval: current value of the progress, 0 <= currval <= maxval 225 | - maxval: maximum (and final) value of the progress 226 | - finished: True if the bar is have finished (reached 100%), False o/w 227 | - start_time: first time update() method of ProgressBar was called 228 | - seconds_elapsed: seconds elapsed since start_time 229 | - percentage(): percentage of the progress (this is a method) 230 | """ 231 | def __init__(self, maxval=100, widgets=default_widgets, term_width=None, 232 | fd=sys.stderr, force_update=False): 233 | assert maxval > 0 234 | self.maxval = maxval 235 | self.widgets = widgets 236 | self.fd = fd 237 | self.signal_set = False 238 | if term_width is None: 239 | try: 240 | self.handle_resize(None, None) 241 | signal.signal(signal.SIGWINCH, self.handle_resize) 242 | self.signal_set = True 243 | except: 244 | self.term_width = 79 245 | else: 246 | self.term_width = term_width 247 | 248 | self.currval = 0 249 | self.finished = False 250 | self.prev_percentage = -1 251 | self.start_time = None 252 | self.seconds_elapsed = 0 253 | self.force_update = force_update 254 | 255 | def handle_resize(self, signum, frame): 256 | h, w = array('h', ioctl(self.fd, termios.TIOCGWINSZ, '\0' * 8))[:2] 257 | self.term_width = w 258 | 259 | def percentage(self): 260 | "Returns the percentage of the progress." 261 | return self.currval * 100.0 / self.maxval 262 | 263 | def _format_widgets(self): 264 | r = [] 265 | hfill_inds = [] 266 | num_hfill = 0 267 | currwidth = 0 268 | for i, w in enumerate(self.widgets): 269 | if isinstance(w, ProgressBarWidgetHFill): 270 | r.append(w) 271 | hfill_inds.append(i) 272 | num_hfill += 1 273 | elif isinstance(w, (str, unicode)): 274 | r.append(w) 275 | currwidth += len(w) 276 | else: 277 | weval = w.update(self) 278 | currwidth += len(weval) 279 | r.append(weval) 280 | for iw in hfill_inds: 281 | r[iw] = r[iw].update(self, 282 | (self.term_width - currwidth) / num_hfill) 283 | return r 284 | 285 | def _format_line(self): 286 | return ''.join(self._format_widgets()).ljust(self.term_width) 287 | 288 | def _need_update(self): 289 | if self.force_update: 290 | return True 291 | return int(self.percentage()) != int(self.prev_percentage) 292 | 293 | def reset(self): 294 | if not self.finished and self.start_time: 295 | self.finish() 296 | self.finished = False 297 | self.currval = 0 298 | self.start_time = None 299 | self.seconds_elapsed = None 300 | self.prev_percentage = None 301 | return self 302 | 303 | def update(self, value): 304 | "Updates the progress bar to a new value." 305 | assert 0 <= value <= self.maxval 306 | self.currval = value 307 | if not self._need_update() or self.finished: 308 | return 309 | if not self.start_time: 310 | self.start_time = time.time() 311 | self.seconds_elapsed = time.time() - self.start_time 312 | self.prev_percentage = self.percentage() 313 | if value != self.maxval: 314 | self.fd.write(self._format_line() + '\r') 315 | else: 316 | self.finished = True 317 | self.fd.write(self._format_line() + '\n') 318 | 319 | def start(self): 320 | """Start measuring time, and prints the bar at 0%. 321 | 322 | It returns self so you can use it like this: 323 | >>> pbar = ProgressBar().start() 324 | >>> for i in xrange(100): 325 | ... # do something 326 | ... pbar.update(i+1) 327 | ... 328 | >>> pbar.finish() 329 | """ 330 | self.update(0) 331 | return self 332 | 333 | def finish(self): 334 | """Used to tell the progress is finished.""" 335 | self.update(self.maxval) 336 | if self.signal_set: 337 | signal.signal(signal.SIGWINCH, signal.SIG_DFL) 338 | 339 | 340 | def example1(): 341 | widgets = ['Test: ', Percentage(), ' ', Bar(marker=RotatingMarker()), 342 | ' ', ETA(), ' ', FileTransferSpeed()] 343 | pbar = ProgressBar(widgets=widgets, maxval=10000000).start() 344 | for i in range(1000000): 345 | # do something 346 | pbar.update(10 * i + 1) 347 | pbar.finish() 348 | return pbar 349 | 350 | 351 | def example2(): 352 | class CrazyFileTransferSpeed(FileTransferSpeed): 353 | "It's bigger between 45 and 80 percent" 354 | def update(self, pbar): 355 | if 45 < pbar.percentage() < 80: 356 | return 'Bigger Now ' + FileTransferSpeed.update(self, pbar) 357 | else: 358 | return FileTransferSpeed.update(self, pbar) 359 | 360 | widgets = [CrazyFileTransferSpeed(), ' <<<', 361 | Bar(), '>>> ', Percentage(), ' ', ETA()] 362 | pbar = ProgressBar(widgets=widgets, maxval=10000000) 363 | # maybe do something 364 | pbar.start() 365 | for i in range(2000000): 366 | # do something 367 | pbar.update(5 * i + 1) 368 | pbar.finish() 369 | return pbar 370 | 371 | 372 | def example3(): 373 | widgets = [Bar('>'), ' ', ETA(), ' ', ReverseBar('<')] 374 | pbar = ProgressBar(widgets=widgets, maxval=10000000).start() 375 | for i in range(1000000): 376 | # do something 377 | pbar.update(10 * i + 1) 378 | pbar.finish() 379 | return pbar 380 | 381 | 382 | def example4(): 383 | widgets = ['Test: ', Percentage(), ' ', 384 | Bar(marker='0', left='[', right=']'), 385 | ' ', ETA(), ' ', FileTransferSpeed()] 386 | pbar = ProgressBar(widgets=widgets, maxval=500) 387 | pbar.start() 388 | for i in range(100, 500 + 1, 50): 389 | time.sleep(0.2) 390 | pbar.update(i) 391 | pbar.finish() 392 | return pbar 393 | 394 | 395 | def example5(): 396 | widgets = ['Test: ', Fraction(), ' ', Bar(marker=RotatingMarker()), 397 | ' ', ETA(), ' ', FileTransferSpeed()] 398 | pbar = ProgressBar(widgets=widgets, maxval=10, force_update=True).start() 399 | for i in range(1, 11): 400 | # do something 401 | time.sleep(0.5) 402 | pbar.update(i) 403 | pbar.finish() 404 | return pbar 405 | 406 | 407 | def main(): 408 | example1() 409 | print 410 | example2() 411 | print 412 | example3() 413 | print 414 | example4() 415 | print 416 | example5() 417 | print 418 | 419 | if __name__ == '__main__': 420 | main() 421 | --------------------------------------------------------------------------------