├── .gitignore ├── README.md ├── classify_and_plot_reviews.ipynb ├── classify_elastic ├── Extract keywords.ipynb ├── classify_pipe.py ├── generate_files_for_indexing.py ├── index_definition.json ├── index_opinion_units.py ├── index_reviews.py ├── opinionTokenizer.py └── queries │ ├── Overall sentiment per city.json │ ├── Overall sentiment per hotel class.json │ └── Topic sentiment per city.json ├── csv_monkey_converter.py ├── hotel_sentiment ├── __init__.py ├── items.py ├── pipelines.py ├── settings.py └── spiders │ ├── __init__.py │ ├── booking_single_hotel_spider.py │ ├── booking_spider.py │ ├── tripadvisor_spider.py │ └── tripadvisor_spider_moreinfo.py ├── opinionTokenizer.py └── scrapy.cfg /.gitignore: -------------------------------------------------------------------------------- 1 | *.csv 2 | *.pyc 3 | *.ipynb_checkpoints 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Sentiment Analysis and Aspect classification for Hotel Reviews 2 | 3 | This is the source code of MonkeyLearn's series of posts related to analyzing sentiment and aspects from hotel reviews using machine learning models. This code runs in python2.7. 4 | 5 | (May 2018 update -- TripAdvisor and Booking.com have changed their sites greatly since these spiders were written, and as such, they no longer work. The blog posts and code are still very useful as an example on how to build a Scrapy spider, but sadly, the examples themselves are no longer functional. We will probably fix the spiders in the future, since it's probably enough to update all the selectors to get everything working again.) 6 | 7 | ### Code organization 8 | 9 | The project itself is a Scrapy project that is used to gather training and testing data from different sites like TripAdvisor and Booking. Besides, there are a series of Python scripts and Jupyter notebooks that implement some necessary scripts. 10 | 11 | ### [Creating a sentiment analysis model with Scrapy and MonkeyLearn](https://blog.monkeylearn.com/creating-sentiment-analysis-model-with-scrapy-and-monkeylearn/) 12 | 13 | The TripAdvisor (hotel_sentiment/spider/tripadvisor_spider.py) spider is used to gather data to train a sentiment analysis classifier in MonkeyLearn. Reviews texts are used as the sample content and reviews stars are used as the category (1 and 2 stars = Negative, 4 and 5 stars = Positive). 14 | 15 | To crawl ~15000 items from tripadvisor use: 16 | ```sh 17 | scrapy crawl tripadvisor -o itemsTripadvisor.csv -s CLOSESPIDER_ITEMCOUNT=15000 18 | ``` 19 | You can check out the generated machine learning sentiment analysis model [here](https://app.monkeylearn.com/categorizer/projects/cl_rZ2P7hbs/tab/main-tab). 20 | 21 | ### [Aspect Analysis from reviews using Machine Learning](https://blog.monkeylearn.com/aspect-analysis-from-reviews-using-machine-learning/) 22 | 23 | The Booking spider (hotel_sentiment/spider/booking_spider.py) is used to gather data to train an aspect classifier in MonkeyLearn. The data obtained with this spider can be manually tagged with each aspect (eg: cleanliness, comfort & facilities, food, internet, location, staff, value for money) using MonkeyLearn's Sample tab or an external crowd sourcing service like Mechanical Turk. 24 | 25 | To crawl from booking use: 26 | ```sh 27 | scrapy crawl booking -o itemsBooking.csv 28 | ``` 29 | 30 | You first have to add the url of a starting city. To crawl from a single hotel in booking use: 31 | 32 | ```sh 33 | scrapy crawl booking_singlehotel -o .csv 34 | ``` 35 | 36 | - ```opinionTokenizer.py``` is a simple script to obtain the "opinion units" from each review. 37 | - ```classify_and_plot_reviews.ipynb``` is a simple script that uses the generated model to classify new reviews and then plot the results in a graph using Plotly. 38 | 39 | You can check out the generated machine learning aspect classifier [here](https://app.monkeylearn.com/categorizer/projects/cl_TKb7XmdG/tab/main-tab). 40 | 41 | ### [Machine Learning over 1M hotel reviews finds interesting insights](https://blog.monkeylearn.com/machine-learning-1m-hotel-reviews-finds-interesting-insights/) 42 | 43 | To crawl from Tripadvisor use: 44 | ```sh 45 | scrapy crawl tripadvisor_more -a start_url="http://some_url" -o .csv -s CLOSESPIDER_ITEMCOUNT=20000 46 | ``` 47 | With the url of a starting city to crawl from, such as https://www.tripadvisor.com/Hotels-g186338-London_England-Hotels.html. 48 | 49 | The scripts and notebooks necessary to replicate the post are in the ```classify_elastic``` folder: 50 | 51 | - ```classify_elastic/generate_files_for_indexing.py``` will take the csv file produced by scrapy and generate two files that other scripts will use. 52 | - ```classify_elastic/classify_pipe.py``` will open the ```opinion_units``` file and classify it with MonkeyLearn according to topic and sentiment, and save the results to a new csv file. 53 | - ```classify_elastic/index_definition.json``` contains the mapping definitions used in ElasticSearch. 54 | - ```classify_elastic/index_reviews.py``` will index into your ElasticSearch instance the reviews generated by ```generate_files_for_indexing.py```. 55 | - ```classify_elastic/index_opinion_units.py``` will index into your ElasticSearch instance the classified opinion units. 56 | - ```classify_elastic/Extract keywords.ipynb``` shows how to extract keywords from the indexed data. 57 | 58 | Finally, the ```queries``` folder contains some queries that were used to power the Kibana visualization. 59 | -------------------------------------------------------------------------------- /classify_and_plot_reviews.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "

Take the file and separate all the reviews into opinion units

" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "from opinionTokenizer import tokenize_into_opinion_units" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "metadata": { 25 | "collapsed": false 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "import unicodecsv as csv\n", 30 | "filename = #the name of your csv here\n", 31 | "f = open(filename)\n", 32 | "#divide the data into opinion units:\n", 33 | "content = [tokenize_into_opinion_units(row[1]) for row in csv.reader(f)]\n", 34 | "f.seek(0)\n", 35 | "content.extend([tokenize_into_opinion_units(row[4]) for row in csv.reader(f)])" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": { 42 | "collapsed": true 43 | }, 44 | "outputs": [], 45 | "source": [ 46 | "opinion_units = []\n", 47 | "for row in content:\n", 48 | " for elem in row:\n", 49 | " opinion_units.append(elem)" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "

Get the sentiment for each opinion unit

" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": { 63 | "collapsed": false 64 | }, 65 | "outputs": [], 66 | "source": [ 67 | "from monkeylearn import MonkeyLearn" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": null, 73 | "metadata": { 74 | "collapsed": false 75 | }, 76 | "outputs": [], 77 | "source": [ 78 | "api_key = #your api key here\n", 79 | "ml = MonkeyLearn(api_key)\n", 80 | "module_id = 'cl_rZ2P7hbs'\n", 81 | "res_sentiment = ml.classifiers.classify(module_id, opinion_units, sandbox=False)" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "Check out the results" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": { 95 | "collapsed": false 96 | }, 97 | "outputs": [], 98 | "source": [ 99 | "res_sentiment.result" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "

Get the topic for each opinion unit

" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": { 113 | "collapsed": false 114 | }, 115 | "outputs": [], 116 | "source": [ 117 | "module_id = 'cl_TKb7XmdG'\n", 118 | "res_topic = ml.classifiers.classify(module_id, opinion_units, sandbox=False)" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": null, 124 | "metadata": { 125 | "collapsed": false 126 | }, 127 | "outputs": [], 128 | "source": [ 129 | "res_topic.result" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "

Process the results

" 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "Define a dict for storing the results.\n", 144 | "For each entry in the dict, the first element is the number of good reviews, and the second is the number of bad ones." 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": null, 150 | "metadata": { 151 | "collapsed": true 152 | }, 153 | "outputs": [], 154 | "source": [ 155 | "results = {\n", 156 | " \"Cleanliness\":[0,0],\n", 157 | " \"Comfort & Facilities\":[0,0],\n", 158 | " \"Food\":[0,0],\n", 159 | " \"Internet\":[0,0],\n", 160 | " \"Location\":[0,0],\n", 161 | " \"Staff\":[0,0],\n", 162 | " \"Value for money\":[0,0]\n", 163 | " }" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "Then, combine the classified data into the dict:" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": { 177 | "collapsed": false 178 | }, 179 | "outputs": [], 180 | "source": [ 181 | "for i in range(len(opinion_units)):\n", 182 | " for topic_dict in res_topic.result[i]:\n", 183 | " if res_sentiment.result[i][0]['label'] == 'Good':\n", 184 | " results[topic_dict[0]['label']][0] += 1\n", 185 | " else:\n", 186 | " results[topic_dict[0]['label']][1] += 1\n", 187 | " " 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": {}, 193 | "source": [ 194 | "Print the final results:" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": null, 200 | "metadata": { 201 | "collapsed": false 202 | }, 203 | "outputs": [], 204 | "source": [ 205 | "results" 206 | ] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "metadata": {}, 211 | "source": [ 212 | "Plot the final results using plotly:" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": null, 218 | "metadata": { 219 | "collapsed": false 220 | }, 221 | "outputs": [], 222 | "source": [ 223 | "import plotly\n", 224 | "from plotly.graph_objs import Bar, Layout\n", 225 | "trace1 = Bar(\n", 226 | " x=['Cleanliness', 'Comfort & Facilities', 'Food', 'Internet', 'Location', 'Staff', 'Value for money'],\n", 227 | " y=[results['Cleanliness'][0], results['Comfort & Facilities'][0],results['Food'][0],results['Internet'][0],results['Location'][0],results['Staff'][0],results['Value for money'][0]],\n", 228 | " name='Positive',\n", 229 | " marker=dict(\n", 230 | " color = 'rgb(64,219,59)'\n", 231 | " )\n", 232 | ")\n", 233 | "trace2 = Bar(\n", 234 | " x=['Cleanliness', 'Comfort & Facilities', 'Food', 'Internet', 'Location', 'Staff', 'Value for money'],\n", 235 | " y=[results['Cleanliness'][1], results['Comfort & Facilities'][1],results['Food'][1],results['Internet'][1],results['Location'][1],results['Staff'][1],results['Value for money'][1]],\n", 236 | " name='Negative',\n", 237 | " marker=dict(\n", 238 | " color = 'rgb(235,54,72)'\n", 239 | " )\n", 240 | ")\n", 241 | "\n", 242 | "plotly.offline.plot({\n", 243 | "\"data\": [\n", 244 | " trace1, trace2\n", 245 | "],\n", 246 | "\"layout\": Layout(\n", 247 | " title=\"Reviews by topic\",\n", 248 | " barmode=\"group\"\n", 249 | ")\n", 250 | "})" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": null, 256 | "metadata": { 257 | "collapsed": true 258 | }, 259 | "outputs": [], 260 | "source": [] 261 | } 262 | ], 263 | "metadata": { 264 | "kernelspec": { 265 | "display_name": "Python 2", 266 | "language": "python", 267 | "name": "python2" 268 | }, 269 | "language_info": { 270 | "codemirror_mode": { 271 | "name": "ipython", 272 | "version": 2 273 | }, 274 | "file_extension": ".py", 275 | "mimetype": "text/x-python", 276 | "name": "python", 277 | "nbconvert_exporter": "python", 278 | "pygments_lexer": "ipython2", 279 | "version": "2.7.6" 280 | } 281 | }, 282 | "nbformat": 4, 283 | "nbformat_minor": 0 284 | } 285 | -------------------------------------------------------------------------------- /classify_elastic/Extract keywords.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "from elasticsearch import Elasticsearch\n", 12 | "es = Elasticsearch(['http://localhost:9200'])" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "Use Elasticsearch to iterate over the reviews and find some opinion units that correspond to the city, topic, and sentiment" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 10, 25 | "metadata": { 26 | "collapsed": true 27 | }, 28 | "outputs": [], 29 | "source": [ 30 | "sample_size = 1000\n", 31 | "found = 0\n", 32 | "topic = \"Cleanliness\"\n", 33 | "sentiment = \"Bad\"\n", 34 | "city = \"New York\"\n", 35 | "opinion_units = es.search(index=\"index_hotels\", \n", 36 | " scroll=\"1m\", \n", 37 | " size=1000, \n", 38 | " doc_type=\"opinion_unit\", \n", 39 | " body={\"query\":{\"match_all\":{}}}\n", 40 | " )\n", 41 | "\n", 42 | "scroll_id = opinion_units[\"_scroll_id\"]\n", 43 | "hits = opinion_units[\"hits\"][\"hits\"]\n", 44 | "rows = []\n", 45 | "while found < sample_size:\n", 46 | " for item in hits:\n", 47 | " if topic in item[\"_source\"][\"topic\"] and sentiment in item[\"_source\"][\"sentiment\"]:\n", 48 | " review = es.get(index=\"index_hotels\", doc_type=\"review\", id=item[\"_parent\"])\n", 49 | " if city in review[\"_source\"][\"city\"]:\n", 50 | " rows.append(item[\"_source\"][\"content\"])\n", 51 | " found += 1\n", 52 | " if found == sample_size:\n", 53 | " break\n", 54 | " hits = es.scroll(scroll_id=scroll_id, scroll=\"1m\")[\"hits\"][\"hits\"]" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "Combine the contents into a single text" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 11, 67 | "metadata": { 68 | "collapsed": true 69 | }, 70 | "outputs": [], 71 | "source": [ 72 | "text = \"\\n\".join(rows)" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "Extract keywords" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 12, 85 | "metadata": { 86 | "collapsed": false 87 | }, 88 | "outputs": [], 89 | "source": [ 90 | "from monkeylearn import MonkeyLearn\n", 91 | "\n", 92 | "ml = MonkeyLearn('')\n", 93 | "text_list = [text]\n", 94 | "module_id = 'ex_y7BPYzNG'\n", 95 | "res = ml.extractors.extract(module_id, text_list)" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "Print the results!" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 13, 108 | "metadata": { 109 | "collapsed": false 110 | }, 111 | "outputs": [ 112 | { 113 | "name": "stdout", 114 | "output_type": "stream", 115 | "text": [ 116 | "room 0.999\n", 117 | "bathroom 0.790\n", 118 | "carpet 0.407\n", 119 | "towels 0.311\n", 120 | "bed bugs 0.246\n", 121 | "bed 0.232\n", 122 | "hotel 0.196\n", 123 | "shower 0.155\n", 124 | "shared bathroom 0.150\n", 125 | "walls 0.138\n" 126 | ] 127 | } 128 | ], 129 | "source": [ 130 | "for d in res.result[0]:\n", 131 | " print d[\"keyword\"], d[\"relevance\"]" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": null, 137 | "metadata": { 138 | "collapsed": true 139 | }, 140 | "outputs": [], 141 | "source": [] 142 | } 143 | ], 144 | "metadata": { 145 | "kernelspec": { 146 | "display_name": "Python 2", 147 | "language": "python", 148 | "name": "python2" 149 | }, 150 | "language_info": { 151 | "codemirror_mode": { 152 | "name": "ipython", 153 | "version": 2 154 | }, 155 | "file_extension": ".py", 156 | "mimetype": "text/x-python", 157 | "name": "python", 158 | "nbconvert_exporter": "python", 159 | "pygments_lexer": "ipython2", 160 | "version": "2.7.6" 161 | } 162 | }, 163 | "nbformat": 4, 164 | "nbformat_minor": 0 165 | } 166 | -------------------------------------------------------------------------------- /classify_elastic/classify_pipe.py: -------------------------------------------------------------------------------- 1 | import unicodecsv as csv 2 | filename = "" 3 | f = open(filename) 4 | 5 | samples = [] 6 | count = 0 7 | # if the script broke for some reason and you have already classified part of your data 8 | # uncomment this code to skip the first to_skip items 9 | # to_skip = 10000 10 | # for row in csv.reader(f): 11 | # count += 1 12 | # if count == to_skip: 13 | # break 14 | 15 | from monkeylearn import MonkeyLearn 16 | ml = MonkeyLearn("") 17 | module_id = 'pi_YKStimMw' 18 | 19 | 20 | csvfile = open("classified_" + filename, 'ab') 21 | writer = csv.writer(csvfile,dialect='excel') 22 | 23 | chunk_count = 0 24 | chunk = [] 25 | for row in csv.reader(f): 26 | chunk.append(row) 27 | count+=1 28 | chunk_count+=1 29 | if chunk_count == 500: 30 | data = { 31 | "texts": [{"text": sample[1]} for sample in chunk] 32 | } 33 | res = ml.pipelines.run(module_id, data, sandbox=False) 34 | for i in range(len(chunk)): 35 | #single label classifier 36 | sentiment = res.result['tags'][i]['sentiment'][0]['label'] 37 | chunk[i].append("/" + sentiment) 38 | #probability! 39 | probability = res.result['tags'][i]['sentiment'][0]['probability'] 40 | chunk[i].append(probability) 41 | 42 | #multi label with only one level 43 | tags_topic_list = [] 44 | for tags_topic in res.result['tags'][i]['topic']: 45 | tags_topic_list.append("/" + tags_topic[0]['label']) 46 | 47 | chunk[i].append(":".join(tags_topic_list)) 48 | #not considering the probability because thats hard to save 49 | 50 | writer.writerows(chunk) 51 | chunk = [] 52 | chunk_count = 0 53 | 54 | print "wrote %d" %count 55 | 56 | csvfile.close() 57 | f.close() 58 | -------------------------------------------------------------------------------- /classify_elastic/generate_files_for_indexing.py: -------------------------------------------------------------------------------- 1 | import unicodecsv as csv 2 | import hashlib 3 | from opinionTokenizer import tokenize_into_opinion_units 4 | 5 | filename = "tripadvisor_bangkok.csv" 6 | f = open(filename) 7 | samples = [row for row in csv.reader(f)] 8 | samples[0].append("key") 9 | 10 | for i in range(1,len(samples)): 11 | key = hashlib.md5(samples[i][5].encode('utf-8')).hexdigest() 12 | samples[i].append(key) 13 | 14 | #write the reviews with the keys, this file will be used for indexing 15 | with open('keys_' + filename, 'wb') as csvfile: 16 | writer = csv.writer(csvfile, dialect='excel') 17 | writer.writerows(samples) 18 | 19 | 20 | opinion_units = [] 21 | for i in range(1,len(samples)): 22 | key = samples[i][12] 23 | for opinion_unit in tokenize_into_opinion_units(samples[i][4]) + tokenize_into_opinion_units(samples[i][5]): 24 | opinion_units.append([key, opinion_unit]) 25 | 26 | #write the opinion units with the key of the parent review 27 | with open('opinion_units_keys_' + filename, 'wb') as csvfile: 28 | writer = csv.writer(csvfile, dialect='excel') 29 | writer.writerows(opinion_units) 30 | -------------------------------------------------------------------------------- /classify_elastic/index_definition.json: -------------------------------------------------------------------------------- 1 | { 2 | "mappings": { 3 | "review" : { 4 | "properties": { 5 | "city": {"type": "string", "index" : "not_analyzed"}, 6 | "hotel_locality": {"type": "string"}, 7 | "reviewer_location": {"type": "string"}, 8 | "hotel_url": {"type": "string", "index" : "not_analyzed" }, 9 | "title": {"type": "string", "index" : "not_analyzed" }, 10 | "content": {"type": "string", "index" : "not_analyzed" }, 11 | "hotel_address": {"type": "string"}, 12 | "hotel_class": {"type": "float"}, 13 | "hotel_review_stars": {"type": "float" }, 14 | "hotel_review_qty": {"type": "integer"}, 15 | "review_stars": {"type": "float"}, 16 | "hotel_name": {"type": "string", "index" : "not_analyzed"} 17 | } 18 | }, 19 | "opinion_unit": { 20 | "properties": { 21 | "review_key": {"type": "string"}, 22 | "content": {"type": "string", "index" : "not_analyzed" }, 23 | "sentiment": {"type": "string"}, 24 | "topic": {"type": "string"}, 25 | "sent_probability": {"type": "float"} 26 | }, 27 | "_parent": { 28 | "type": "review" 29 | } 30 | } 31 | } 32 | } 33 | -------------------------------------------------------------------------------- /classify_elastic/index_opinion_units.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import unicodecsv as csv 3 | import json 4 | 5 | from elasticsearch import Elasticsearch 6 | from elasticsearch import helpers 7 | 8 | #filename = "classified_opinion_units_keys_tripadvisor_bangkok.csv" 9 | #takes two arguments: 10 | # the name of the file to index 11 | # the starting index for the id 12 | # the ids shouldn't overlap or you will replace existing opinion units 13 | #IMPORTANT: before indexing opinion units you must index the parent reviews 14 | filename = sys.argv[1] 15 | cont_id = int(sys.argv[2]) 16 | f = open(filename) 17 | reference = ["review_key", 18 | "content", 19 | "sentiment", 20 | "sent_probability", 21 | "topic" 22 | ] 23 | 24 | es = Elasticsearch(['http://localhost:9200']) 25 | #index by chunk of 10000 items 26 | chunk_count = 0 27 | actions = [] 28 | 29 | for row in csv.reader(f): 30 | item = {} 31 | 32 | for i in range(len(reference)): 33 | item[reference[i]] = row[i] 34 | action = { 35 | "_index": "index_hotels_3", 36 | "_type": "opinion_unit", 37 | "_id": cont_id, 38 | "_parent": row[0], 39 | "_source": item 40 | } 41 | actions.append(action) 42 | 43 | chunk_count += 1 44 | 45 | if chunk_count == 10000: 46 | helpers.bulk(es, actions) 47 | chunk_count = 0 48 | actions = [] 49 | print "indexed %d" %cont_id 50 | 51 | cont_id += 1 52 | 53 | if chunk_count > 0: 54 | helpers.bulk(es, actions) 55 | print "leftovers" 56 | print "indexed %d" %cont_id 57 | -------------------------------------------------------------------------------- /classify_elastic/index_reviews.py: -------------------------------------------------------------------------------- 1 | import unicodecsv as csv 2 | import json 3 | import re 4 | 5 | from elasticsearch import Elasticsearch 6 | from elasticsearch import helpers 7 | 8 | #local instance of elasticsearch, you can change this 9 | es = Elasticsearch(['http://localhost:9200']) 10 | 11 | #index the reviews 12 | filename = "keys_tripadvisor_bangkok.csv" 13 | f = open(filename) 14 | reference = ["city", #0 15 | "hotel_locality", #1 16 | "reviewer_location", #2 17 | "hotel_url", #3 18 | "title", #4 19 | "content", #5 20 | "hotel_address", #6 21 | "hotel_class", #7 22 | "hotel_review_stars", #8 23 | "hotel_review_qty", #9 24 | "review_stars", #10 25 | "hotel_name", #11 26 | "key" #12 27 | ] 28 | 29 | chunk_count = 0 30 | actions = [] 31 | csv.reader(f).next() 32 | for row in csv.reader(f): 33 | 34 | item = {} 35 | for i in range(12): 36 | item[reference[i]] = row[i] 37 | 38 | if item["hotel_review_qty"] != "": 39 | item["hotel_review_qty"] = re.sub(",", "", item["hotel_review_qty"].split(" ")[0]) 40 | 41 | if item["hotel_class"] != "": 42 | item["hotel_class"] = item["hotel_class"].split(" ")[0] 43 | 44 | if item["hotel_review_stars"] != "": 45 | item["hotel_review_stars"] = item["hotel_review_stars"].split(" ")[0] 46 | 47 | if item["review_stars"] != "": 48 | item["review_stars"] = item["review_stars"].split(" ")[0] 49 | 50 | action = { 51 | "_index": "index_hotels_3", 52 | "_type": "review", 53 | "_id": row[12], 54 | "_source": item 55 | } 56 | actions.append(action) 57 | chunk_count += 1 58 | #use chunks of 10000 items 59 | if chunk_count == 10000: 60 | helpers.bulk(es, actions) 61 | chunk_count = 0 62 | actions = [] 63 | 64 | if chunk_count > 0: 65 | helpers.bulk(es, actions) 66 | print "leftovers" 67 | 68 | #index the classified opinion units 69 | -------------------------------------------------------------------------------- /classify_elastic/opinionTokenizer.py: -------------------------------------------------------------------------------- 1 | from nltk.tokenize import sent_tokenize 2 | import unicodecsv as csv 3 | 4 | #Given a string, returns a list with the opinion units it extracted 5 | #from the string 6 | def tokenize_into_opinion_units(text): 7 | output = [] 8 | for str in sent_tokenize(text): 9 | for output_str in str.split(' but '): 10 | output.append(output_str) 11 | return output 12 | 13 | #Take positive.csv and negative.csv and mix them into 14 | #positiveandnegative.csv 15 | #This has each unit tagged with its booking.com sentiment 16 | #This is the data I tagged with Mechanical Turk 17 | def positive_and_negative_to_full(): 18 | fpos = open('positive.csv') 19 | positive_units = [row for row in csv.reader(fpos)] 20 | fneg = open('negative.csv') 21 | negative_units = [row for row in csv.reader(fneg)] 22 | for item in positive_units: 23 | item.append('positive') 24 | for item in negative_units: 25 | item.append('negative') 26 | del negative_units[0] 27 | positive_units[0][0] = 'review_content' 28 | positive_units[0][1] = 'sentiment' 29 | full = positive_units 30 | full.extend(negative_units) 31 | with open('positiveandnegative.csv', 'wb') as csvfile: 32 | writer = csv.writer(csvfile, dialect='excel') 33 | writer.writerows(full) 34 | 35 | 36 | 37 | #this will open the review scraped data and write two files from that info: 38 | #positive.csv, containing positive opinion units 39 | #negative.csv, containing negative opinion units 40 | if __name__ == "__main__": 41 | #There are some problems with unicode 42 | #TODO take the file name as argument 43 | 44 | #positive content: 45 | f = open('scrapy/itemsBooking.csv') 46 | #divide the data into opinion units: 47 | positive = [tokenize_into_opinion_units(row[1]) for row in csv.reader(f)] 48 | positive_units = [] 49 | for row in positive: 50 | for elem in row: 51 | newrow = elem.split(' but ') 52 | for newelem in newrow: 53 | positive_units.append(newelem) 54 | #transform the elements into lists so I can use writerows 55 | positive_units = [[row] for row in positive_units] 56 | with open('positive.csv', 'wb') as csvfile: 57 | writer = csv.writer(csvfile, dialect='excel') 58 | writer.writerows(positive_units) 59 | 60 | #negative content: 61 | f.seek(0) 62 | negative = [tokenize_into_opinion_units(row[4]) for row in csv.reader(f)] 63 | negative_units = [] 64 | for row in negative: 65 | for elem in row: 66 | newrow = elem.split(' but ') 67 | for newelem in newrow: 68 | negative_units.append(newelem) 69 | negative_units = [[row] for row in negative_units] 70 | with open('negative.csv', 'wb') as csvfile: 71 | writer = csv.writer(csvfile, dialect='excel') 72 | writer.writerows(negative_units) 73 | -------------------------------------------------------------------------------- /classify_elastic/queries/Overall sentiment per city.json: -------------------------------------------------------------------------------- 1 | search: 2 | { 3 | "query": { 4 | "filtered": { 5 | "query": { 6 | "range": { 7 | "sent_probability": { 8 | "gt": "0.501" 9 | } 10 | } 11 | }, 12 | "filter": { 13 | "has_parent": { 14 | "type": "review", 15 | "query": { 16 | "match_all": {} 17 | } 18 | } 19 | } 20 | } 21 | } 22 | } 23 | 24 | Buckets: 25 | 26 | split bars: 27 | Aggregation: Terms 28 | Field: sentiment 29 | Order: Descending 30 | Size: 2 31 | 32 | X-Axis: 33 | Sub Aggregation: Filters 34 | One filter per city, example: 35 | { 36 | "query": { 37 | "filtered": { 38 | "query": { 39 | "match_all": {} 40 | }, 41 | "filter": { 42 | "has_parent": { 43 | "type": "review", 44 | "query": { 45 | "match": { 46 | "city": "New York City" 47 | } 48 | } 49 | } 50 | } 51 | } 52 | } 53 | } 54 | -------------------------------------------------------------------------------- /classify_elastic/queries/Overall sentiment per hotel class.json: -------------------------------------------------------------------------------- 1 | search: 2 | { 3 | "query": { 4 | "filtered": { 5 | "query": { 6 | "range": { 7 | "sent_probability": { 8 | "gt": "0.501" 9 | } 10 | } 11 | }, 12 | "filter": { 13 | "has_parent": { 14 | "type": "review", 15 | "query": { 16 | "match_all": {} 17 | } 18 | } 19 | } 20 | } 21 | } 22 | } 23 | 24 | Buckets: 25 | 26 | split bars: 27 | Aggregation: Terms 28 | Field: sentiment 29 | Order: Descending 30 | Size: 2 31 | 32 | X-Axis: 33 | Sub Aggregation: Filters 34 | One filter per hotel class, for example 3 star hotels: 35 | { 36 | "query": { 37 | "filtered": { 38 | "query": { 39 | "match_all": {} 40 | }, 41 | "filter": { 42 | "has_parent": { 43 | "type": "review", 44 | "query": { 45 | "range": { 46 | "hotel_class": { 47 | "gte": "3", 48 | "lt": "4" 49 | } 50 | } 51 | } 52 | } 53 | } 54 | } 55 | } 56 | } 57 | -------------------------------------------------------------------------------- /classify_elastic/queries/Topic sentiment per city.json: -------------------------------------------------------------------------------- 1 | search: 2 | { 3 | "query": { 4 | "filtered": { 5 | "query": { 6 | "range": { 7 | "sent_probability": { 8 | "gt": "0.501" 9 | } 10 | } 11 | }, 12 | "filter": { 13 | "has_parent": { 14 | "type": "review", 15 | "query": { 16 | "match_all": {} 17 | } 18 | } 19 | } 20 | } 21 | } 22 | } 23 | 24 | Buckets: 25 | 26 | split bars: 27 | Aggregation: Terms 28 | Field: sentiment 29 | Order: Descending 30 | Size: 2 31 | 32 | X-Axis: 33 | Sub Aggregation: Filters 34 | One filter per city, example: 35 | { 36 | "query": { 37 | "filtered": { 38 | "query": { 39 | "match_all": {} 40 | }, 41 | "filter": { 42 | "has_parent": { 43 | "type": "review", 44 | "query": { 45 | "match": { 46 | "city": "New York City" 47 | } 48 | } 49 | } 50 | } 51 | } 52 | } 53 | } 54 | 55 | Split chart: 56 | Sub Aggregation: Filters 57 | One filter per topic, using Lucene searches 58 | Example: 59 | topic:food 60 | -------------------------------------------------------------------------------- /csv_monkey_converter.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | 3 | # We use the Pandas library to read the contents of the scraped data 4 | # obtained by scrapy 5 | df = pd.read_csv('items.csv', encoding='utf-8') 6 | 7 | # Now we remove duplicate rows (reviews) 8 | df.drop_duplicates(inplace=True) 9 | 10 | # Drop the reviews with 3 stars, since we're doing Positive/Negative 11 | # sentiment analysis. 12 | df = df[df['stars'] != '3 of 5 stars'] 13 | 14 | # We want to use both the title and content of the review to 15 | # classify, so we merge them both into a new column. 16 | df['full_content'] = df['title'] + '. ' + df['content'] 17 | 18 | def get_class(stars): 19 | score = int(stars[0]) 20 | if score > 3: 21 | return 'Good' 22 | else: 23 | return 'Bad' 24 | 25 | # Transform the number of stars into Good and Bad tags. 26 | df['true_category'] = df['stars'].apply(get_class) 27 | 28 | df = df[['full_content', 'true_category']] 29 | 30 | # Write the data into a CSV file 31 | df.to_csv('itemsHotel_MonkeyLearn2.csv', header=False, index=False, encoding='utf-8') 32 | -------------------------------------------------------------------------------- /hotel_sentiment/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/monkeylearn/hotel-review-analysis/1e838de727466af5637c3d10d4ebca287f3226b4/hotel_sentiment/__init__.py -------------------------------------------------------------------------------- /hotel_sentiment/items.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define here the models for your scraped items 4 | # 5 | # See documentation in: 6 | # http://doc.scrapy.org/en/latest/topics/items.html 7 | 8 | import scrapy 9 | 10 | 11 | class HotelSentimentItem(scrapy.Item): 12 | title = scrapy.Field() 13 | content = scrapy.Field() 14 | stars = scrapy.Field() 15 | 16 | class TripAdvisorReviewItem(scrapy.Item): 17 | title = scrapy.Field() 18 | content = scrapy.Field() 19 | review_stars = scrapy.Field() 20 | 21 | reviewer_id = scrapy.Field() 22 | reviewer_name = scrapy.Field() 23 | reviewer_level = scrapy.Field() 24 | reviewer_location = scrapy.Field() 25 | 26 | city = scrapy.Field() 27 | 28 | hotel_name = scrapy.Field() 29 | hotel_url = scrapy.Field() 30 | hotel_classs = scrapy.Field() 31 | hotel_address = scrapy.Field() 32 | hotel_locality = scrapy.Field() 33 | hotel_review_stars = scrapy.Field() 34 | hotel_review_qty = scrapy.Field() 35 | 36 | 37 | class BookingReviewItem(scrapy.Item): 38 | title = scrapy.Field() 39 | score = scrapy.Field() 40 | positive_content = scrapy.Field() 41 | negative_content = scrapy.Field() 42 | tags = scrapy.Field() 43 | -------------------------------------------------------------------------------- /hotel_sentiment/pipelines.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define your item pipelines here 4 | # 5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting 6 | # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 | 8 | 9 | class HotelSentimentPipeline(object): 10 | def process_item(self, item, spider): 11 | return item 12 | -------------------------------------------------------------------------------- /hotel_sentiment/settings.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Scrapy settings for hotel_sentiment project 4 | # 5 | # For simplicity, this file contains only settings considered important or 6 | # commonly used. You can find more settings consulting the documentation: 7 | # 8 | # http://doc.scrapy.org/en/latest/topics/settings.html 9 | # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html 10 | # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html 11 | 12 | BOT_NAME = 'hotel_sentiment' 13 | 14 | SPIDER_MODULES = ['hotel_sentiment.spiders'] 15 | NEWSPIDER_MODULE = 'hotel_sentiment.spiders' 16 | 17 | 18 | # Crawl responsibly by identifying yourself (and your website) on the user-agent 19 | #USER_AGENT = 'hotel_sentiment (+http://www.yourdomain.com)' 20 | 21 | # Configure maximum concurrent requests performed by Scrapy (default: 16) 22 | #CONCURRENT_REQUESTS=32 23 | 24 | # Configure a delay for requests for the same website (default: 0) 25 | # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay 26 | # See also autothrottle settings and docs 27 | #DOWNLOAD_DELAY=3 28 | # The download delay setting will honor only one of: 29 | #CONCURRENT_REQUESTS_PER_DOMAIN=16 30 | #CONCURRENT_REQUESTS_PER_IP=16 31 | 32 | # Disable cookies (enabled by default) 33 | #COOKIES_ENABLED=False 34 | 35 | # Disable Telnet Console (enabled by default) 36 | #TELNETCONSOLE_ENABLED=False 37 | 38 | # Override the default request headers: 39 | #DEFAULT_REQUEST_HEADERS = { 40 | # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 41 | # 'Accept-Language': 'en', 42 | #} 43 | 44 | # Enable or disable spider middlewares 45 | # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html 46 | #SPIDER_MIDDLEWARES = { 47 | # 'hotel_sentiment.middlewares.MyCustomSpiderMiddleware': 543, 48 | #} 49 | 50 | # Enable or disable downloader middlewares 51 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html 52 | #DOWNLOADER_MIDDLEWARES = { 53 | # 'hotel_sentiment.middlewares.MyCustomDownloaderMiddleware': 543, 54 | #} 55 | 56 | # Enable or disable extensions 57 | # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html 58 | #EXTENSIONS = { 59 | # 'scrapy.telnet.TelnetConsole': None, 60 | #} 61 | 62 | # Configure item pipelines 63 | # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html 64 | #ITEM_PIPELINES = { 65 | # 'hotel_sentiment.pipelines.SomePipeline': 300, 66 | #} 67 | 68 | # Enable and configure the AutoThrottle extension (disabled by default) 69 | # See http://doc.scrapy.org/en/latest/topics/autothrottle.html 70 | # NOTE: AutoThrottle will honour the standard settings for concurrency and delay 71 | #AUTOTHROTTLE_ENABLED=True 72 | # The initial download delay 73 | #AUTOTHROTTLE_START_DELAY=5 74 | # The maximum download delay to be set in case of high latencies 75 | #AUTOTHROTTLE_MAX_DELAY=60 76 | # Enable showing throttling stats for every response received: 77 | #AUTOTHROTTLE_DEBUG=False 78 | 79 | # Enable and configure HTTP caching (disabled by default) 80 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 81 | #HTTPCACHE_ENABLED=True 82 | #HTTPCACHE_EXPIRATION_SECS=0 83 | #HTTPCACHE_DIR='httpcache' 84 | #HTTPCACHE_IGNORE_HTTP_CODES=[] 85 | #HTTPCACHE_STORAGE='scrapy.extensions.httpcache.FilesystemCacheStorage' 86 | -------------------------------------------------------------------------------- /hotel_sentiment/spiders/__init__.py: -------------------------------------------------------------------------------- 1 | # This package will contain the spiders of your Scrapy project 2 | # 3 | # Please refer to the documentation for information on how to create and manage 4 | # your spiders. 5 | -------------------------------------------------------------------------------- /hotel_sentiment/spiders/booking_single_hotel_spider.py: -------------------------------------------------------------------------------- 1 | import scrapy 2 | from scrapy.loader import ItemLoader 3 | from hotel_sentiment.items import BookingReviewItem 4 | 5 | 6 | class BookingSpider(scrapy.Spider): 7 | name = "booking_singlehotel" 8 | start_urls = [ 9 | #http://www.booking.com/hotel/us/new-york-inn.html, 10 | #add your url here 11 | ] 12 | 13 | #get its reviews page 14 | def parse(self, response): 15 | reviewsurl = response.xpath('//a[@class="show_all_reviews_btn"]/@href') 16 | url = response.urljoin(reviewsurl[0].extract()) 17 | self.pageNumber = 1 18 | return scrapy.Request(url, callback=self.parse_reviews) 19 | 20 | #and parse the reviews 21 | def parse_reviews(self, response): 22 | for rev in response.xpath('//li[starts-with(@class,"review_item")]'): 23 | item = BookingReviewItem() 24 | #sometimes the title is empty because of some reason, not sure when it happens but this works 25 | title = rev.xpath('.//a[@class="review_item_header_content"]/span[@itemprop="name"]/text()') 26 | if title: 27 | item['title'] = title[0].extract() 28 | positive_content = rev.xpath('.//p[@class="review_pos"]//span/text()') 29 | if positive_content: 30 | item['positive_content'] = positive_content[0].extract() 31 | negative_content = rev.xpath('.//p[@class="review_neg"]//span/text()') 32 | if negative_content: 33 | item['negative_content'] = negative_content[0].extract() 34 | item['score'] = rev.xpath('.//span[@itemprop="reviewRating"]/meta[@itemprop="ratingValue"]/@content')[0].extract() 35 | #tags are separated by ; 36 | item['tags'] = ";".join(rev.xpath('.//li[@class="review_info_tag"]/text()').extract()) 37 | yield item 38 | 39 | next_page = response.xpath('//a[@id="review_next_page_link"]/@href') 40 | if next_page: 41 | url = response.urljoin(next_page[0].extract()) 42 | yield scrapy.Request(url, self.parse_reviews) 43 | -------------------------------------------------------------------------------- /hotel_sentiment/spiders/booking_spider.py: -------------------------------------------------------------------------------- 1 | import scrapy 2 | from scrapy.loader import ItemLoader 3 | from hotel_sentiment.items import BookingReviewItem 4 | 5 | #crawl up to 6 pages of review per hotel 6 | max_pages_per_hotel = 6 7 | 8 | class BookingSpider(scrapy.Spider): 9 | name = "booking" 10 | start_urls = [ 11 | #"http://www.booking.com/searchresults.html?aid=357026&label=gog235jc-city-XX-us-newNyork-unspec-uy-com-L%3Axu-O%3AosSx-B%3Achrome-N%3Ayes-S%3Abo-U%3Ac&sid=b9f9f1f142a364f6c36f275cfe47ee55&dcid=4&city=20088325&class_interval=1&dtdisc=0&from_popular_filter=1&hlrd=0&hyb_red=0&inac=0&label_click=undef&nflt=di%3D929%3Bdistrict%3D929%3B&nha_red=0&postcard=0&redirected_from_city=0&redirected_from_landmark=0&redirected_from_region=0&review_score_group=empty&room1=A%2CA&sb_price_type=total&score_min=0&ss_all=0&ssb=empty&sshis=0&rows=15&tfl_cwh=1", 12 | #add your city url here 13 | ] 14 | 15 | pageNumber = 1 16 | 17 | #for every hotel 18 | def parse(self, response): 19 | for hotelurl in response.xpath('//a[@class="hotel_name_link url"]/@href'): 20 | url = response.urljoin(hotelurl.extract()) 21 | yield scrapy.Request(url, callback=self.parse_hotel) 22 | 23 | next_page = response.xpath('//a[starts-with(@class,"paging-next")]/@href') 24 | if next_page: 25 | url = response.urljoin(next_page[0].extract()) 26 | yield scrapy.Request(url, self.parse) 27 | 28 | #get its reviews page 29 | def parse_hotel(self, response): 30 | reviewsurl = response.xpath('//a[@class="show_all_reviews_btn"]/@href') 31 | url = response.urljoin(reviewsurl[0].extract()) 32 | self.pageNumber = 1 33 | return scrapy.Request(url, callback=self.parse_reviews) 34 | 35 | #and parse the reviews 36 | def parse_reviews(self, response): 37 | if self.pageNumber > max_pages_per_hotel: 38 | return 39 | for rev in response.xpath('//li[starts-with(@class,"review_item")]'): 40 | item = BookingReviewItem() 41 | #sometimes the title is empty because of some reason, not sure when it happens but this works 42 | title = rev.xpath('.//a[@class="review_item_header_content"]/span[@itemprop="name"]/text()') 43 | if title: 44 | item['title'] = title[0].extract() 45 | positive_content = rev.xpath('.//p[@class="review_pos"]//span/text()') 46 | if positive_content: 47 | item['positive_content'] = positive_content[0].extract() 48 | negative_content = rev.xpath('.//p[@class="review_neg"]//span/text()') 49 | if negative_content: 50 | item['negative_content'] = negative_content[0].extract() 51 | item['score'] = rev.xpath('.//span[@itemprop="reviewRating"]/meta[@itemprop="ratingValue"]/@content')[0].extract() 52 | #tags are separated by ; 53 | item['tags'] = ";".join(rev.xpath('.//li[@class="review_info_tag"]/text()').extract()) 54 | yield item 55 | 56 | next_page = response.xpath('//a[@id="review_next_page_link"]/@href') 57 | if next_page: 58 | self.pageNumber += 1 59 | url = response.urljoin(next_page[0].extract()) 60 | yield scrapy.Request(url, self.parse_reviews) 61 | -------------------------------------------------------------------------------- /hotel_sentiment/spiders/tripadvisor_spider.py: -------------------------------------------------------------------------------- 1 | import scrapy 2 | from hotel_sentiment.items import HotelSentimentItem 3 | import re 4 | 5 | # TODO use loaders 6 | 7 | 8 | class TripadvisorSpider(scrapy.Spider): 9 | name = "tripadvisor" 10 | start_urls = [ 11 | "https://www.tripadvisor.com/Hotels-g60763-New_York_City_New_York-Hotels.html" 12 | ] 13 | 14 | def parse(self, response): 15 | for href in response.xpath('//a[@class="property_title"]/@href'): 16 | url = response.urljoin(href.extract()) 17 | yield scrapy.Request(url, callback=self.parse_hotel) 18 | 19 | # tripadvisor now has a weird pagination js thingie that doesn't even modify the url 20 | # if you feel like finding a solution please do 21 | 22 | # next_page = response.xpath('//div[@class="unified pagination standard_pagination"]/child::*[2][self::a]/@href') 23 | # if next_page: 24 | # url = response.urljoin(next_page[0].extract()) 25 | # yield scrapy.Request(url, self.parse) 26 | 27 | def parse_hotel(self, response): 28 | for href in response.xpath('//div[starts-with(@class,"quote")]/a/@href'): 29 | url = response.urljoin(href.extract()) 30 | yield scrapy.Request(url, callback=self.parse_review) 31 | 32 | # haha fuck you tripadvisor pagination I'm better than you 33 | url = response.url 34 | if not re.findall(r'or\d', url): 35 | next_page = re.sub(r'(-Reviews-)', r'\g<1>or5-', url) 36 | else: 37 | pagenum = int(re.findall(r'or(\d+)-', url)[0]) 38 | pagenum_next = pagenum + 5 39 | next_page = url.replace('or' + str(pagenum), 'or' + str(pagenum_next)) 40 | yield scrapy.Request( 41 | next_page, 42 | meta={'dont_redirect': True}, 43 | callback=self.parse_hotel 44 | ) 45 | 46 | def parse_review(self, response): 47 | item = HotelSentimentItem() 48 | item['title'] = response.xpath('//div[@class="quote"]/text()').extract()[0][1:-1] # strip the quotes 49 | item['content'] = response.xpath('//div[@class="entry"]/p/text()').extract()[0] 50 | item['stars'] = response.xpath('//span[starts-with(@class, "rating")]/span/@alt').extract()[0].replace('bubble', 'star') 51 | return item 52 | -------------------------------------------------------------------------------- /hotel_sentiment/spiders/tripadvisor_spider_moreinfo.py: -------------------------------------------------------------------------------- 1 | import scrapy 2 | from hotel_sentiment.items import TripAdvisorReviewItem 3 | 4 | #TODO use loaders 5 | #to run this use scrapy crawl tripadvisor_more -a start_url="http://some_url" 6 | #for example, scrapy crawl tripadvisor_more -a start_url="https://www.tripadvisor.com/Hotels-g186338-London_England-Hotels.html" -o tripadvisor_london.csv 7 | class TripadvisorSpiderMoreinfo(scrapy.Spider): 8 | name = "tripadvisor_more" 9 | 10 | def __init__(self, *args, **kwargs): 11 | super(TripadvisorSpiderMoreinfo, self).__init__(*args, **kwargs) 12 | self.start_urls = [kwargs.get('start_url')] 13 | 14 | def parse(self, response): 15 | for href in response.xpath('//div[@class="listing_title"]/a/@href'): 16 | url = response.urljoin(href.extract()) 17 | yield scrapy.Request(url, callback=self.parse_hotel) 18 | 19 | next_page = response.xpath('//div[@class="unified pagination standard_pagination"]/child::*[2][self::a]/@href') 20 | if next_page: 21 | url = response.urljoin(next_page[0].extract()) 22 | yield scrapy.Request(url, self.parse) 23 | 24 | def parse_hotel(self, response): 25 | for href in response.xpath('//div[starts-with(@class,"quote")]/a/@href'): 26 | url = response.urljoin(href.extract()) 27 | yield scrapy.Request(url, callback=self.parse_review) 28 | 29 | next_page = response.xpath('//div[@class="unified pagination "]/child::*[2][self::a]/@href') 30 | if next_page: 31 | url = response.urljoin(next_page[0].extract()) 32 | yield scrapy.Request(url, self.parse_hotel) 33 | 34 | 35 | #to get the full review content I open its page, because I don't get the full content on the main page 36 | #there's probably a better way to do it, requires investigation 37 | def parse_review(self, response): 38 | item = TripAdvisorReviewItem() 39 | item['title'] = response.xpath('//div[@class="quote"]/text()')[0].extract()[1:-1] #strip the quotes (first and last char) 40 | # Get all of the lines for just this review. 41 | item['content'] = '\n'.join([line.strip() for line in response.xpath('(//div[@class="entry"])[1]//p/text()').extract()]) 42 | item['review_stars'] = response.xpath('//span[@class="rate sprite-rating_s rating_s"]/img/@alt').extract()[0] 43 | 44 | try: 45 | item['reviewer_id'] = response.xpath('//div[@class="memberOverlayLink"]/@id').extract()[0] 46 | item['reviewer_name'] = response.xpath('//div[contains(@class, "username")]/span/text()').extract()[0] 47 | item['reviewer_level'] = response.xpath('//div[contains(@class, "levelBadge")]/@class').extract()[0].split()[-1] 48 | item['reviewer_location'] = response.xpath('//div[@class="location"]/text()')[0].extract()[1:-1] 49 | except: 50 | # Not all reviews have a logged in reviewer 51 | pass 52 | 53 | item['city'] = response.xpath('//li[starts-with(@class,"breadcrumb_item")]/a/span/text()')[-3].extract() 54 | 55 | locationcontent = response.xpath('//div[starts-with(@class,"locationContent")]') 56 | item['hotel_name'] = locationcontent.xpath('.//div[starts-with(@class,"surContent")]/a/text()')[0].extract() 57 | item['hotel_url'] = response.urljoin(locationcontent.xpath('.//div[starts-with(@class,"surContent")]/a/@href')[0].extract()) 58 | 59 | hotelclass = locationcontent.xpath('.//span[starts-with(@class,"star")]/span/img/@alt') 60 | if hotelclass: 61 | item['hotel_classs'] = hotelclass[0].extract() 62 | 63 | hoteladdress = locationcontent.xpath('.//span[starts-with(@class,"street-address")]/text()') 64 | if hoteladdress: 65 | item['hotel_address'] = hoteladdress[0].extract() 66 | 67 | hotellocality = locationcontent.xpath('.//span[starts-with(@class,"locality")]/text()') 68 | if hotellocality: 69 | item['hotel_locality'] = hotellocality[0].extract() 70 | 71 | item['hotel_review_stars'] = locationcontent.xpath('.//div[starts-with(@class,"userRating")]/div/span/img/@alt')[0].extract() 72 | item['hotel_review_qty'] = locationcontent.xpath('.//div[starts-with(@class,"userRating")]/div/a/text()')[0].extract() 73 | 74 | return item 75 | -------------------------------------------------------------------------------- /opinionTokenizer.py: -------------------------------------------------------------------------------- 1 | from nltk.tokenize import sent_tokenize 2 | import unicodecsv as csv 3 | 4 | #Given a string, returns a list with the opinion units it extracted 5 | #from the string 6 | def tokenize_into_opinion_units(text): 7 | output = [] 8 | for str in sent_tokenize(text): 9 | for output_str in str.split(' but '): 10 | output.append(output_str) 11 | return output 12 | 13 | #Take positive.csv and negative.csv and mix them into 14 | #positiveandnegative.csv 15 | #This has each unit tagged with its booking.com sentiment 16 | #This is the data I tagged with Mechanical Turk 17 | def positive_and_negative_to_full(): 18 | fpos = open('positive.csv') 19 | positive_units = [row for row in csv.reader(fpos)] 20 | fneg = open('negative.csv') 21 | negative_units = [row for row in csv.reader(fneg)] 22 | for item in positive_units: 23 | item.append('positive') 24 | for item in negative_units: 25 | item.append('negative') 26 | del negative_units[0] 27 | positive_units[0][0] = 'review_content' 28 | positive_units[0][1] = 'sentiment' 29 | full = positive_units 30 | full.extend(negative_units) 31 | with open('positiveandnegative.csv', 'wb') as csvfile: 32 | writer = csv.writer(csvfile, dialect='excel') 33 | writer.writerows(full) 34 | 35 | 36 | 37 | #this will open the review scraped data and write two files from that info: 38 | #positive.csv, containing positive opinion units 39 | #negative.csv, containing negative opinion units 40 | if __name__ == "__main__": 41 | #There are some problems with unicode 42 | #TODO take the file name as argument 43 | 44 | #positive content: 45 | f = open('itemsBooking.csv') 46 | #divide the data into opinion units: 47 | positive = [tokenize_into_opinion_units(row[1]) for row in csv.reader(f)] 48 | positive_units = [] 49 | for row in positive: 50 | for elem in row: 51 | newrow = elem.split(' but ') 52 | for newelem in newrow: 53 | positive_units.append(newelem) 54 | #transform the elements into lists so I can use writerows 55 | positive_units = [[row] for row in positive_units] 56 | with open('positive.csv', 'wb') as csvfile: 57 | writer = csv.writer(csvfile, dialect='excel') 58 | writer.writerows(positive_units) 59 | 60 | #negative content: 61 | f.seek(0) 62 | negative = [tokenize_into_opinion_units(row[4]) for row in csv.reader(f)] 63 | negative_units = [] 64 | for row in negative: 65 | for elem in row: 66 | newrow = elem.split(' but ') 67 | for newelem in newrow: 68 | negative_units.append(newelem) 69 | negative_units = [[row] for row in negative_units] 70 | with open('negative.csv', 'wb') as csvfile: 71 | writer = csv.writer(csvfile, dialect='excel') 72 | writer.writerows(negative_units) 73 | -------------------------------------------------------------------------------- /scrapy.cfg: -------------------------------------------------------------------------------- 1 | # Automatically created by: scrapy startproject 2 | # 3 | # For more information about the [deploy] section see: 4 | # https://scrapyd.readthedocs.org/en/latest/deploy.html 5 | 6 | [settings] 7 | default = hotel_sentiment.settings 8 | 9 | [deploy] 10 | #url = http://localhost:6800/ 11 | project = hotel_sentiment 12 | --------------------------------------------------------------------------------