├── .gitignore
├── README.md
├── classify_and_plot_reviews.ipynb
├── classify_elastic
    ├── Extract keywords.ipynb
    ├── classify_pipe.py
    ├── generate_files_for_indexing.py
    ├── index_definition.json
    ├── index_opinion_units.py
    ├── index_reviews.py
    ├── opinionTokenizer.py
    └── queries
    │   ├── Overall sentiment per city.json
    │   ├── Overall sentiment per hotel class.json
    │   └── Topic sentiment per city.json
├── csv_monkey_converter.py
├── hotel_sentiment
    ├── __init__.py
    ├── items.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
    │   ├── __init__.py
    │   ├── booking_single_hotel_spider.py
    │   ├── booking_spider.py
    │   ├── tripadvisor_spider.py
    │   └── tripadvisor_spider_moreinfo.py
├── opinionTokenizer.py
└── scrapy.cfg


/.gitignore:
--------------------------------------------------------------------------------
1 | *.csv
2 | *.pyc
3 | *.ipynb_checkpoints
4 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Sentiment Analysis and Aspect classification for Hotel Reviews
 2 | 
 3 | This is the source code of MonkeyLearn's series of posts related to analyzing sentiment and aspects from hotel reviews using machine learning models. This code runs in python2.7.
 4 | 
 5 | (May 2018 update -- TripAdvisor and Booking.com have changed their sites greatly since these spiders were written, and as such, they no longer work. The blog posts and code are still very useful as an example on how to build a Scrapy spider, but sadly, the examples themselves are no longer functional. We will probably fix the spiders in the future, since it's probably enough to update all the selectors to get everything working again.)
 6 | 
 7 | ### Code organization
 8 | 
 9 | The project itself is a Scrapy project that is used to gather training and testing data from different sites like TripAdvisor and Booking. Besides, there are a series of Python scripts and Jupyter notebooks that implement some necessary scripts.
10 | 
11 | ### [Creating a sentiment analysis model with Scrapy and MonkeyLearn](https://blog.monkeylearn.com/creating-sentiment-analysis-model-with-scrapy-and-monkeylearn/)
12 | 
13 | The TripAdvisor (hotel_sentiment/spider/tripadvisor_spider.py) spider is used to gather data to train a sentiment analysis classifier in MonkeyLearn. Reviews texts are used as the sample content and reviews stars are used as the category (1 and 2 stars = Negative, 4 and 5 stars = Positive).
14 | 
15 | To crawl ~15000 items from tripadvisor use:
16 | ```sh
17 | scrapy crawl tripadvisor -o itemsTripadvisor.csv -s CLOSESPIDER_ITEMCOUNT=15000
18 | ```
19 | You can check out the generated machine learning sentiment analysis model [here](https://app.monkeylearn.com/categorizer/projects/cl_rZ2P7hbs/tab/main-tab).
20 | 
21 | ### [Aspect Analysis from reviews using Machine Learning](https://blog.monkeylearn.com/aspect-analysis-from-reviews-using-machine-learning/)
22 | 
23 | The Booking spider (hotel_sentiment/spider/booking_spider.py) is used to gather data to train an aspect classifier in MonkeyLearn. The data obtained with this spider can be manually tagged with each aspect (eg: cleanliness, comfort & facilities, food, internet, location, staff, value for money) using MonkeyLearn's Sample tab or an external crowd sourcing service like Mechanical Turk.
24 | 
25 | To crawl from booking use:
26 | ```sh
27 | scrapy crawl booking -o itemsBooking.csv
28 | ```
29 | 
30 | You first have to add the url of a starting city. To crawl from a single hotel in booking use:
31 | 
32 | ```sh
33 | scrapy crawl booking_singlehotel -o <hotel name>.csv
34 | ```
35 | 
36 | - ```opinionTokenizer.py``` is a simple script to obtain the "opinion units" from each review.
37 | - ```classify_and_plot_reviews.ipynb``` is a simple script that uses the generated model to classify new reviews and then plot the results in a graph using Plotly.
38 | 
39 | You can check out the generated machine learning aspect classifier [here](https://app.monkeylearn.com/categorizer/projects/cl_TKb7XmdG/tab/main-tab).
40 | 
41 | ### [Machine Learning over 1M hotel reviews finds interesting insights](https://blog.monkeylearn.com/machine-learning-1m-hotel-reviews-finds-interesting-insights/)
42 | 
43 | To crawl from Tripadvisor use:
44 | ```sh
45 | scrapy crawl tripadvisor_more -a start_url="http://some_url" -o <hotel_name>.csv -s CLOSESPIDER_ITEMCOUNT=20000
46 | ```
47 | With the url of a starting city to crawl from, such as https://www.tripadvisor.com/Hotels-g186338-London_England-Hotels.html.
48 | 
49 | The scripts and notebooks necessary to replicate the post are in the ```classify_elastic``` folder:
50 | 
51 | - ```classify_elastic/generate_files_for_indexing.py``` will take the csv file produced by scrapy and generate two files that other scripts will use.
52 | - ```classify_elastic/classify_pipe.py``` will open the ```opinion_units``` file and classify it with MonkeyLearn according to topic and sentiment, and save the results to a new csv file.
53 | - ```classify_elastic/index_definition.json``` contains the mapping definitions used in ElasticSearch.
54 | - ```classify_elastic/index_reviews.py``` will index into your ElasticSearch instance the reviews generated by ```generate_files_for_indexing.py```.
55 | - ```classify_elastic/index_opinion_units.py``` will index into your ElasticSearch instance the classified opinion units.
56 | - ```classify_elastic/Extract keywords.ipynb``` shows how to extract keywords from the indexed data.
57 | 
58 | Finally, the ```queries``` folder contains some queries that were used to power the Kibana visualization.
59 | 


--------------------------------------------------------------------------------
/classify_and_plot_reviews.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "<h3>Take the file and separate all the reviews into opinion units</h3>"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": null,
 13 |    "metadata": {
 14 |     "collapsed": false
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "from opinionTokenizer import tokenize_into_opinion_units"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": null,
 24 |    "metadata": {
 25 |     "collapsed": false
 26 |    },
 27 |    "outputs": [],
 28 |    "source": [
 29 |     "import unicodecsv as csv\n",
 30 |     "filename = #the name of your csv here\n",
 31 |     "f = open(filename)\n",
 32 |     "#divide the data into opinion units:\n",
 33 |     "content = [tokenize_into_opinion_units(row[1]) for row in csv.reader(f)]\n",
 34 |     "f.seek(0)\n",
 35 |     "content.extend([tokenize_into_opinion_units(row[4]) for row in csv.reader(f)])"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": null,
 41 |    "metadata": {
 42 |     "collapsed": true
 43 |    },
 44 |    "outputs": [],
 45 |    "source": [
 46 |     "opinion_units = []\n",
 47 |     "for row in content:\n",
 48 |     "    for elem in row:\n",
 49 |     "        opinion_units.append(elem)"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "markdown",
 54 |    "metadata": {},
 55 |    "source": [
 56 |     "<h3>Get the sentiment for each opinion unit</h3>"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": null,
 62 |    "metadata": {
 63 |     "collapsed": false
 64 |    },
 65 |    "outputs": [],
 66 |    "source": [
 67 |     "from monkeylearn import MonkeyLearn"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": null,
 73 |    "metadata": {
 74 |     "collapsed": false
 75 |    },
 76 |    "outputs": [],
 77 |    "source": [
 78 |     "api_key = #your api key here\n",
 79 |     "ml = MonkeyLearn(api_key)\n",
 80 |     "module_id = 'cl_rZ2P7hbs'\n",
 81 |     "res_sentiment = ml.classifiers.classify(module_id, opinion_units, sandbox=False)"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "markdown",
 86 |    "metadata": {},
 87 |    "source": [
 88 |     "Check out the results"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": null,
 94 |    "metadata": {
 95 |     "collapsed": false
 96 |    },
 97 |    "outputs": [],
 98 |    "source": [
 99 |     "res_sentiment.result"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "markdown",
104 |    "metadata": {},
105 |    "source": [
106 |     "<h3>Get the topic for each opinion unit</h3>"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": null,
112 |    "metadata": {
113 |     "collapsed": false
114 |    },
115 |    "outputs": [],
116 |    "source": [
117 |     "module_id = 'cl_TKb7XmdG'\n",
118 |     "res_topic = ml.classifiers.classify(module_id, opinion_units, sandbox=False)"
119 |    ]
120 |   },
121 |   {
122 |    "cell_type": "code",
123 |    "execution_count": null,
124 |    "metadata": {
125 |     "collapsed": false
126 |    },
127 |    "outputs": [],
128 |    "source": [
129 |     "res_topic.result"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "markdown",
134 |    "metadata": {},
135 |    "source": [
136 |     "<h3>Process the results</h3>"
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "markdown",
141 |    "metadata": {},
142 |    "source": [
143 |     "Define a dict for storing the results.\n",
144 |     "For each entry in the dict, the first element is the number of good reviews, and the second is the number of bad ones."
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "code",
149 |    "execution_count": null,
150 |    "metadata": {
151 |     "collapsed": true
152 |    },
153 |    "outputs": [],
154 |    "source": [
155 |     "results = {\n",
156 |     "        \"Cleanliness\":[0,0],\n",
157 |     "        \"Comfort & Facilities\":[0,0],\n",
158 |     "        \"Food\":[0,0],\n",
159 |     "        \"Internet\":[0,0],\n",
160 |     "        \"Location\":[0,0],\n",
161 |     "        \"Staff\":[0,0],\n",
162 |     "        \"Value for money\":[0,0]\n",
163 |     "          }"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "markdown",
168 |    "metadata": {},
169 |    "source": [
170 |     "Then, combine the classified data into the dict:"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": null,
176 |    "metadata": {
177 |     "collapsed": false
178 |    },
179 |    "outputs": [],
180 |    "source": [
181 |     "for i in range(len(opinion_units)):\n",
182 |     "    for topic_dict in res_topic.result[i]:\n",
183 |     "        if res_sentiment.result[i][0]['label'] == 'Good':\n",
184 |     "            results[topic_dict[0]['label']][0] += 1\n",
185 |     "        else:\n",
186 |     "            results[topic_dict[0]['label']][1] += 1\n",
187 |     "    "
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "markdown",
192 |    "metadata": {},
193 |    "source": [
194 |     "Print the final results:"
195 |    ]
196 |   },
197 |   {
198 |    "cell_type": "code",
199 |    "execution_count": null,
200 |    "metadata": {
201 |     "collapsed": false
202 |    },
203 |    "outputs": [],
204 |    "source": [
205 |     "results"
206 |    ]
207 |   },
208 |   {
209 |    "cell_type": "markdown",
210 |    "metadata": {},
211 |    "source": [
212 |     "Plot the final results using plotly:"
213 |    ]
214 |   },
215 |   {
216 |    "cell_type": "code",
217 |    "execution_count": null,
218 |    "metadata": {
219 |     "collapsed": false
220 |    },
221 |    "outputs": [],
222 |    "source": [
223 |     "import plotly\n",
224 |     "from plotly.graph_objs import Bar, Layout\n",
225 |     "trace1 = Bar(\n",
226 |     "    x=['Cleanliness', 'Comfort & Facilities', 'Food', 'Internet', 'Location', 'Staff', 'Value for money'],\n",
227 |     "    y=[results['Cleanliness'][0], results['Comfort & Facilities'][0],results['Food'][0],results['Internet'][0],results['Location'][0],results['Staff'][0],results['Value for money'][0]],\n",
228 |     "    name='Positive',\n",
229 |     "    marker=dict(\n",
230 |     "        color = 'rgb(64,219,59)'\n",
231 |     "    )\n",
232 |     ")\n",
233 |     "trace2 = Bar(\n",
234 |     "    x=['Cleanliness', 'Comfort & Facilities', 'Food', 'Internet', 'Location', 'Staff', 'Value for money'],\n",
235 |     "    y=[results['Cleanliness'][1], results['Comfort & Facilities'][1],results['Food'][1],results['Internet'][1],results['Location'][1],results['Staff'][1],results['Value for money'][1]],\n",
236 |     "    name='Negative',\n",
237 |     "    marker=dict(\n",
238 |     "        color = 'rgb(235,54,72)'\n",
239 |     "    )\n",
240 |     ")\n",
241 |     "\n",
242 |     "plotly.offline.plot({\n",
243 |     "\"data\": [\n",
244 |     "            trace1, trace2\n",
245 |     "],\n",
246 |     "\"layout\": Layout(\n",
247 |     "    title=\"Reviews by topic\",\n",
248 |     "    barmode=\"group\"\n",
249 |     ")\n",
250 |     "})"
251 |    ]
252 |   },
253 |   {
254 |    "cell_type": "code",
255 |    "execution_count": null,
256 |    "metadata": {
257 |     "collapsed": true
258 |    },
259 |    "outputs": [],
260 |    "source": []
261 |   }
262 |  ],
263 |  "metadata": {
264 |   "kernelspec": {
265 |    "display_name": "Python 2",
266 |    "language": "python",
267 |    "name": "python2"
268 |   },
269 |   "language_info": {
270 |    "codemirror_mode": {
271 |     "name": "ipython",
272 |     "version": 2
273 |    },
274 |    "file_extension": ".py",
275 |    "mimetype": "text/x-python",
276 |    "name": "python",
277 |    "nbconvert_exporter": "python",
278 |    "pygments_lexer": "ipython2",
279 |    "version": "2.7.6"
280 |   }
281 |  },
282 |  "nbformat": 4,
283 |  "nbformat_minor": 0
284 | }
285 | 


--------------------------------------------------------------------------------
/classify_elastic/Extract keywords.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "from elasticsearch import Elasticsearch\n",
 12 |     "es = Elasticsearch(['http://localhost:9200'])"
 13 |    ]
 14 |   },
 15 |   {
 16 |    "cell_type": "markdown",
 17 |    "metadata": {},
 18 |    "source": [
 19 |     "Use Elasticsearch to iterate over the reviews and find some opinion units that correspond to the city, topic, and sentiment"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": 10,
 25 |    "metadata": {
 26 |     "collapsed": true
 27 |    },
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "sample_size = 1000\n",
 31 |     "found = 0\n",
 32 |     "topic = \"Cleanliness\"\n",
 33 |     "sentiment = \"Bad\"\n",
 34 |     "city = \"New York\"\n",
 35 |     "opinion_units = es.search(index=\"index_hotels\", \n",
 36 |     "                          scroll=\"1m\", \n",
 37 |     "                          size=1000, \n",
 38 |     "                          doc_type=\"opinion_unit\", \n",
 39 |     "                          body={\"query\":{\"match_all\":{}}}\n",
 40 |     "                         )\n",
 41 |     "\n",
 42 |     "scroll_id = opinion_units[\"_scroll_id\"]\n",
 43 |     "hits = opinion_units[\"hits\"][\"hits\"]\n",
 44 |     "rows = []\n",
 45 |     "while found < sample_size:\n",
 46 |     "    for item in hits:\n",
 47 |     "        if topic in item[\"_source\"][\"topic\"] and sentiment in item[\"_source\"][\"sentiment\"]:\n",
 48 |     "            review = es.get(index=\"index_hotels\", doc_type=\"review\", id=item[\"_parent\"])\n",
 49 |     "            if city in review[\"_source\"][\"city\"]:\n",
 50 |     "                rows.append(item[\"_source\"][\"content\"])\n",
 51 |     "                found += 1\n",
 52 |     "                if found == sample_size:\n",
 53 |     "                    break\n",
 54 |     "    hits = es.scroll(scroll_id=scroll_id, scroll=\"1m\")[\"hits\"][\"hits\"]"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "markdown",
 59 |    "metadata": {},
 60 |    "source": [
 61 |     "Combine the contents into a single text"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "code",
 66 |    "execution_count": 11,
 67 |    "metadata": {
 68 |     "collapsed": true
 69 |    },
 70 |    "outputs": [],
 71 |    "source": [
 72 |     "text = \"\\n\".join(rows)"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "markdown",
 77 |    "metadata": {},
 78 |    "source": [
 79 |     "Extract keywords"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": 12,
 85 |    "metadata": {
 86 |     "collapsed": false
 87 |    },
 88 |    "outputs": [],
 89 |    "source": [
 90 |     "from monkeylearn import MonkeyLearn\n",
 91 |     "\n",
 92 |     "ml = MonkeyLearn('<your api key here>')\n",
 93 |     "text_list = [text]\n",
 94 |     "module_id = 'ex_y7BPYzNG'\n",
 95 |     "res = ml.extractors.extract(module_id, text_list)"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "markdown",
100 |    "metadata": {},
101 |    "source": [
102 |     "Print the results!"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": 13,
108 |    "metadata": {
109 |     "collapsed": false
110 |    },
111 |    "outputs": [
112 |     {
113 |      "name": "stdout",
114 |      "output_type": "stream",
115 |      "text": [
116 |       "room 0.999\n",
117 |       "bathroom 0.790\n",
118 |       "carpet 0.407\n",
119 |       "towels 0.311\n",
120 |       "bed bugs 0.246\n",
121 |       "bed 0.232\n",
122 |       "hotel 0.196\n",
123 |       "shower 0.155\n",
124 |       "shared bathroom 0.150\n",
125 |       "walls 0.138\n"
126 |      ]
127 |     }
128 |    ],
129 |    "source": [
130 |     "for d in res.result[0]:\n",
131 |     "    print d[\"keyword\"], d[\"relevance\"]"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": null,
137 |    "metadata": {
138 |     "collapsed": true
139 |    },
140 |    "outputs": [],
141 |    "source": []
142 |   }
143 |  ],
144 |  "metadata": {
145 |   "kernelspec": {
146 |    "display_name": "Python 2",
147 |    "language": "python",
148 |    "name": "python2"
149 |   },
150 |   "language_info": {
151 |    "codemirror_mode": {
152 |     "name": "ipython",
153 |     "version": 2
154 |    },
155 |    "file_extension": ".py",
156 |    "mimetype": "text/x-python",
157 |    "name": "python",
158 |    "nbconvert_exporter": "python",
159 |    "pygments_lexer": "ipython2",
160 |    "version": "2.7.6"
161 |   }
162 |  },
163 |  "nbformat": 4,
164 |  "nbformat_minor": 0
165 | }
166 | 


--------------------------------------------------------------------------------
/classify_elastic/classify_pipe.py:
--------------------------------------------------------------------------------
 1 | import unicodecsv as csv
 2 | filename = "<the name of the csv with the opinion units>"
 3 | f = open(filename)
 4 | 
 5 | samples = []
 6 | count = 0
 7 | # if the script broke for some reason and you have already classified part of your data
 8 | # uncomment this code to skip the first to_skip items
 9 | # to_skip = 10000
10 | # for row in csv.reader(f):
11 | # 	count += 1
12 | # 	if count == to_skip:
13 | # 		break
14 | 
15 | from monkeylearn import MonkeyLearn
16 | ml = MonkeyLearn("<your api key here>")
17 | module_id = 'pi_YKStimMw'
18 | 
19 | 
20 | csvfile = open("classified_" + filename, 'ab')
21 | writer = csv.writer(csvfile,dialect='excel')
22 | 
23 | chunk_count = 0
24 | chunk = []
25 | for row in csv.reader(f):
26 | 	chunk.append(row)
27 | 	count+=1
28 | 	chunk_count+=1
29 | 	if chunk_count == 500:
30 | 		data = {
31 | 			"texts": [{"text": sample[1]} for sample in chunk]
32 | 		}
33 | 		res = ml.pipelines.run(module_id, data, sandbox=False)
34 | 		for i in range(len(chunk)):
35 | 			#single label classifier
36 | 			sentiment = res.result['tags'][i]['sentiment'][0]['label']
37 | 			chunk[i].append("/" + sentiment)
38 | 			#probability!
39 | 			probability = res.result['tags'][i]['sentiment'][0]['probability']
40 | 			chunk[i].append(probability)
41 | 
42 | 			#multi label with only one level
43 | 			tags_topic_list = []
44 | 			for tags_topic in res.result['tags'][i]['topic']:
45 | 					tags_topic_list.append("/" + tags_topic[0]['label'])
46 | 
47 | 			chunk[i].append(":".join(tags_topic_list))
48 | 			#not considering the probability because thats hard to save
49 | 
50 | 		writer.writerows(chunk)
51 | 		chunk = []
52 | 		chunk_count = 0
53 | 
54 | 		print "wrote %d" %count
55 | 
56 | csvfile.close()
57 | f.close()
58 | 


--------------------------------------------------------------------------------
/classify_elastic/generate_files_for_indexing.py:
--------------------------------------------------------------------------------
 1 | import unicodecsv as csv
 2 | import hashlib
 3 | from opinionTokenizer import tokenize_into_opinion_units
 4 | 
 5 | filename = "tripadvisor_bangkok.csv"
 6 | f = open(filename)
 7 | samples = [row for row in csv.reader(f)]
 8 | samples[0].append("key")
 9 | 
10 | for i in range(1,len(samples)):
11 |     key =  hashlib.md5(samples[i][5].encode('utf-8')).hexdigest()
12 |     samples[i].append(key)
13 | 
14 | #write the reviews with the keys, this file will be used for indexing
15 | with open('keys_' + filename, 'wb') as csvfile:
16 |         writer = csv.writer(csvfile, dialect='excel')
17 |         writer.writerows(samples)
18 | 
19 | 
20 | opinion_units = []
21 | for i in range(1,len(samples)):
22 |     key = samples[i][12]
23 |     for opinion_unit in tokenize_into_opinion_units(samples[i][4]) + tokenize_into_opinion_units(samples[i][5]):
24 |         opinion_units.append([key, opinion_unit])
25 | 
26 | #write the opinion units with the key of the parent review
27 | with open('opinion_units_keys_' + filename, 'wb') as csvfile:
28 |         writer = csv.writer(csvfile, dialect='excel')
29 |         writer.writerows(opinion_units)
30 | 


--------------------------------------------------------------------------------
/classify_elastic/index_definition.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "mappings": {
 3 |         "review" : {
 4 |             "properties": {
 5 |                 "city": {"type": "string", "index" : "not_analyzed"},
 6 |                 "hotel_locality": {"type": "string"},
 7 |                 "reviewer_location": {"type": "string"},
 8 |                 "hotel_url": {"type": "string", "index" : "not_analyzed" },
 9 |                 "title": {"type": "string", "index" : "not_analyzed" },
10 |                 "content": {"type": "string", "index" : "not_analyzed" },
11 |                 "hotel_address": {"type": "string"},
12 |                 "hotel_class": {"type": "float"},
13 |                 "hotel_review_stars": {"type": "float" },
14 |                 "hotel_review_qty": {"type": "integer"},
15 |                 "review_stars": {"type": "float"},
16 |                 "hotel_name": {"type": "string", "index" : "not_analyzed"}
17 |             }
18 |         },
19 |         "opinion_unit": {
20 |             "properties": {
21 |                 "review_key": {"type": "string"},
22 |                 "content": {"type": "string", "index" : "not_analyzed" },
23 |                 "sentiment": {"type": "string"},
24 |                 "topic": {"type": "string"},
25 | 		"sent_probability": {"type": "float"}
26 |             },
27 |             "_parent": {
28 |                 "type": "review"
29 |           }
30 |        }
31 |     }
32 | }
33 | 


--------------------------------------------------------------------------------
/classify_elastic/index_opinion_units.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import unicodecsv as csv
 3 | import json
 4 | 
 5 | from elasticsearch import Elasticsearch
 6 | from elasticsearch import helpers
 7 | 
 8 | #filename = "classified_opinion_units_keys_tripadvisor_bangkok.csv"
 9 | #takes two arguments:
10 | #   the name of the file to index
11 | #   the starting index for the id
12 | #       the ids shouldn't overlap or you will replace existing opinion units
13 | #IMPORTANT: before indexing opinion units you must index the parent reviews
14 | filename =  sys.argv[1]
15 | cont_id = int(sys.argv[2])
16 | f = open(filename)
17 | reference = ["review_key",
18 |              "content",
19 |              "sentiment",
20 |              "sent_probability",
21 |              "topic"
22 |             ]
23 | 
24 | es = Elasticsearch(['http://localhost:9200'])
25 | #index by chunk of 10000 items
26 | chunk_count = 0
27 | actions = []
28 | 
29 | for row in csv.reader(f):
30 |     item = {}
31 | 
32 |     for i in range(len(reference)):
33 |         item[reference[i]] = row[i]
34 |     action = {
35 |             "_index": "index_hotels_3",
36 |             "_type": "opinion_unit",
37 |             "_id": cont_id,
38 |             "_parent": row[0],
39 |             "_source": item
40 |             }
41 |     actions.append(action)
42 | 
43 |     chunk_count += 1
44 | 
45 |     if chunk_count == 10000:
46 |         helpers.bulk(es, actions)
47 |         chunk_count = 0
48 |         actions = []
49 |         print "indexed %d" %cont_id
50 | 
51 |     cont_id += 1
52 | 
53 | if chunk_count > 0:
54 |     helpers.bulk(es, actions)
55 |     print "leftovers"
56 |     print "indexed %d" %cont_id
57 | 


--------------------------------------------------------------------------------
/classify_elastic/index_reviews.py:
--------------------------------------------------------------------------------
 1 | import unicodecsv as csv
 2 | import json
 3 | import re
 4 | 
 5 | from elasticsearch import Elasticsearch
 6 | from elasticsearch import helpers
 7 | 
 8 | #local instance of elasticsearch, you can change this
 9 | es = Elasticsearch(['http://localhost:9200'])
10 | 
11 | #index the reviews
12 | filename = "keys_tripadvisor_bangkok.csv"
13 | f = open(filename)
14 | reference = ["city",                #0
15 |              "hotel_locality",      #1
16 |              "reviewer_location",   #2
17 |              "hotel_url",           #3
18 |              "title",               #4
19 |              "content",             #5
20 |              "hotel_address",       #6
21 |              "hotel_class",         #7
22 |              "hotel_review_stars",  #8
23 |              "hotel_review_qty",    #9
24 |              "review_stars",        #10
25 |              "hotel_name",          #11
26 |              "key"                  #12
27 |             ]
28 | 
29 | chunk_count = 0
30 | actions = []
31 | csv.reader(f).next()
32 | for row in csv.reader(f):
33 | 
34 |     item = {}
35 |     for i in range(12):
36 |         item[reference[i]] = row[i]
37 | 
38 |     if item["hotel_review_qty"] != "":
39 |         item["hotel_review_qty"] = re.sub(",", "", item["hotel_review_qty"].split(" ")[0])
40 | 
41 |     if item["hotel_class"] != "":
42 |         item["hotel_class"] = item["hotel_class"].split(" ")[0]
43 | 
44 |     if item["hotel_review_stars"] != "":
45 |         item["hotel_review_stars"] = item["hotel_review_stars"].split(" ")[0]
46 | 
47 |     if item["review_stars"] != "":
48 |         item["review_stars"] = item["review_stars"].split(" ")[0]
49 | 
50 |     action = {
51 |             "_index": "index_hotels_3",
52 |             "_type": "review",
53 |             "_id": row[12],
54 |             "_source": item
55 |             }
56 |     actions.append(action)
57 |     chunk_count += 1
58 |     #use chunks of 10000 items
59 |     if chunk_count == 10000:
60 |         helpers.bulk(es, actions)
61 |         chunk_count = 0
62 |         actions = []
63 | 
64 | if chunk_count > 0:
65 |     helpers.bulk(es, actions)
66 |     print "leftovers"
67 | 
68 | #index the classified opinion units
69 | 


--------------------------------------------------------------------------------
/classify_elastic/opinionTokenizer.py:
--------------------------------------------------------------------------------
 1 | from nltk.tokenize import sent_tokenize
 2 | import unicodecsv as csv
 3 | 
 4 | #Given a string, returns a list with the opinion units it extracted
 5 | #from the string
 6 | def tokenize_into_opinion_units(text):
 7 |     output = []
 8 |     for str in sent_tokenize(text):
 9 |         for output_str in str.split(' but '):
10 |             output.append(output_str)
11 |     return output
12 | 
13 | #Take positive.csv and negative.csv and mix them into
14 | #positiveandnegative.csv
15 | #This has each unit tagged with its booking.com sentiment
16 | #This is the data I tagged with Mechanical Turk
17 | def positive_and_negative_to_full():
18 |     fpos = open('positive.csv')
19 |     positive_units = [row for row in csv.reader(fpos)]
20 |     fneg = open('negative.csv')
21 |     negative_units = [row for row in csv.reader(fneg)]
22 |     for item in positive_units:
23 |         item.append('positive')
24 |     for item in negative_units:
25 |         item.append('negative')
26 |     del negative_units[0]
27 |     positive_units[0][0] = 'review_content'
28 |     positive_units[0][1] = 'sentiment'
29 |     full = positive_units
30 |     full.extend(negative_units)
31 |     with open('positiveandnegative.csv', 'wb') as csvfile:
32 |         writer = csv.writer(csvfile, dialect='excel')
33 |         writer.writerows(full)
34 | 
35 | 
36 | 
37 | #this will open the review scraped data and write two files from that info:
38 | #positive.csv, containing positive opinion units
39 | #negative.csv, containing negative opinion units
40 | if __name__ == "__main__":
41 |     #There are some problems with unicode
42 |     #TODO take the file name as argument
43 | 
44 |     #positive content:
45 |     f = open('scrapy/itemsBooking.csv')
46 |     #divide the data into opinion units:
47 |     positive = [tokenize_into_opinion_units(row[1]) for row in csv.reader(f)]
48 |     positive_units = []
49 |     for row in positive:
50 |         for elem in row:
51 |             newrow = elem.split(' but ')
52 |             for newelem in newrow:
53 |                 positive_units.append(newelem)
54 |     #transform the elements into lists so I can use writerows
55 |     positive_units = [[row] for row in positive_units]
56 |     with open('positive.csv', 'wb') as csvfile:
57 |         writer = csv.writer(csvfile, dialect='excel')
58 |         writer.writerows(positive_units)
59 | 
60 |     #negative content:
61 |     f.seek(0)
62 |     negative = [tokenize_into_opinion_units(row[4]) for row in csv.reader(f)]
63 |     negative_units = []
64 |     for row in negative:
65 |         for elem in row:
66 |             newrow = elem.split(' but ')
67 |             for newelem in newrow:
68 |                 negative_units.append(newelem)
69 |     negative_units = [[row] for row in negative_units]
70 |     with open('negative.csv', 'wb') as csvfile:
71 |         writer = csv.writer(csvfile, dialect='excel')
72 |         writer.writerows(negative_units)
73 | 


--------------------------------------------------------------------------------
/classify_elastic/queries/Overall sentiment per city.json:
--------------------------------------------------------------------------------
 1 | search:
 2 | {
 3 |     "query": {
 4 |         "filtered": {
 5 |             "query": {
 6 |                 "range": {
 7 |                     "sent_probability": {
 8 |                         "gt": "0.501"
 9 |                     }
10 |                 }
11 |             },
12 |             "filter": {
13 |                 "has_parent": {
14 |                     "type": "review",
15 |                     "query": {
16 |                         "match_all": {}
17 |                     }
18 |                 }
19 |             }
20 |         }
21 |     }
22 | }
23 | 
24 | Buckets:
25 | 
26 | split bars:
27 | 	Aggregation: Terms
28 | 	Field: sentiment
29 | 	Order: Descending
30 | 	Size: 2
31 | 
32 | X-Axis:
33 | 	Sub Aggregation: Filters
34 | 	One filter per city, example:
35 | {
36 |     "query": {
37 |         "filtered": {
38 |             "query": {
39 |                 "match_all": {}
40 |             },
41 |             "filter": {
42 |                 "has_parent": {
43 |                     "type": "review",
44 |                     "query": {
45 |                         "match": {
46 |                             "city": "New York City"
47 |                         }
48 |                     }
49 |                 }
50 |             }
51 |         }
52 |     }
53 | }
54 | 


--------------------------------------------------------------------------------
/classify_elastic/queries/Overall sentiment per hotel class.json:
--------------------------------------------------------------------------------
 1 | search:
 2 | {
 3 |     "query": {
 4 |         "filtered": {
 5 |             "query": {
 6 |                 "range": {
 7 |                     "sent_probability": {
 8 |                         "gt": "0.501"
 9 |                     }
10 |                 }
11 |             },
12 |             "filter": {
13 |                 "has_parent": {
14 |                     "type": "review",
15 |                     "query": {
16 |                         "match_all": {}
17 |                     }
18 |                 }
19 |             }
20 |         }
21 |     }
22 | }
23 | 
24 | Buckets:
25 | 
26 | split bars:
27 | 	Aggregation: Terms
28 | 	Field: sentiment
29 | 	Order: Descending
30 | 	Size: 2
31 | 
32 | X-Axis:
33 | 	Sub Aggregation: Filters
34 | 	One filter per hotel class, for example 3 star hotels:
35 | {
36 |     "query": {
37 |         "filtered": {
38 |             "query": {
39 |                 "match_all": {}
40 |             },
41 |             "filter": {
42 |                 "has_parent": {
43 |                     "type": "review",
44 |                     "query": {
45 |                         "range": {
46 |                             "hotel_class": {
47 |                                 "gte": "3",
48 |                                 "lt": "4"
49 |                             }
50 |                         }
51 |                     }
52 |                 }
53 |             }
54 |         }
55 |     }
56 | }
57 | 


--------------------------------------------------------------------------------
/classify_elastic/queries/Topic sentiment per city.json:
--------------------------------------------------------------------------------
 1 | search:
 2 | {
 3 |     "query": {
 4 |         "filtered": {
 5 |             "query": {
 6 |                 "range": {
 7 |                     "sent_probability": {
 8 |                         "gt": "0.501"
 9 |                     }
10 |                 }
11 |             },
12 |             "filter": {
13 |                 "has_parent": {
14 |                     "type": "review",
15 |                     "query": {
16 |                         "match_all": {}
17 |                     }
18 |                 }
19 |             }
20 |         }
21 |     }
22 | }
23 | 
24 | Buckets:
25 | 
26 | split bars:
27 | 	Aggregation: Terms
28 | 	Field: sentiment
29 | 	Order: Descending
30 | 	Size: 2
31 | 
32 | X-Axis:
33 | 	Sub Aggregation: Filters
34 | 	One filter per city, example:
35 | {
36 |     "query": {
37 |         "filtered": {
38 |             "query": {
39 |                 "match_all": {}
40 |             },
41 |             "filter": {
42 |                 "has_parent": {
43 |                     "type": "review",
44 |                     "query": {
45 |                         "match": {
46 |                             "city": "New York City"
47 |                         }
48 |                     }
49 |                 }
50 |             }
51 |         }
52 |     }
53 | }
54 | 
55 | Split chart:
56 | 	Sub Aggregation: Filters
57 | 	One filter per topic, using Lucene searches
58 | 	Example:
59 | 		topic:food
60 | 


--------------------------------------------------------------------------------
/csv_monkey_converter.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | 
 3 | # We use the Pandas library to read the contents of the scraped data
 4 | # obtained by scrapy
 5 | df = pd.read_csv('items.csv', encoding='utf-8')
 6 | 
 7 | # Now we remove duplicate rows (reviews)
 8 | df.drop_duplicates(inplace=True)
 9 | 
10 | # Drop the reviews with 3 stars, since we're doing Positive/Negative
11 | # sentiment analysis.
12 | df = df[df['stars'] != '3 of 5 stars']
13 | 
14 | # We want to use both the title and content of the review to
15 | # classify, so we merge them both into a new column.
16 | df['full_content'] = df['title'] + '. ' + df['content']
17 | 
18 | def get_class(stars):
19 |     score = int(stars[0])
20 |     if score > 3:
21 |         return 'Good'
22 |     else:
23 |         return 'Bad'
24 | 
25 | # Transform the number of stars into Good and Bad tags.
26 | df['true_category'] = df['stars'].apply(get_class)
27 | 
28 | df = df[['full_content', 'true_category']]
29 | 
30 | # Write the data into a CSV file
31 | df.to_csv('itemsHotel_MonkeyLearn2.csv', header=False, index=False, encoding='utf-8')
32 | 


--------------------------------------------------------------------------------
/hotel_sentiment/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/monkeylearn/hotel-review-analysis/1e838de727466af5637c3d10d4ebca287f3226b4/hotel_sentiment/__init__.py


--------------------------------------------------------------------------------
/hotel_sentiment/items.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Define here the models for your scraped items
 4 | #
 5 | # See documentation in:
 6 | # http://doc.scrapy.org/en/latest/topics/items.html
 7 | 
 8 | import scrapy
 9 | 
10 | 
11 | class HotelSentimentItem(scrapy.Item):
12 |     title = scrapy.Field()
13 |     content = scrapy.Field()
14 |     stars = scrapy.Field()
15 | 
16 | class TripAdvisorReviewItem(scrapy.Item):
17 |     title = scrapy.Field()
18 |     content = scrapy.Field()
19 |     review_stars = scrapy.Field()
20 | 
21 |     reviewer_id = scrapy.Field()
22 |     reviewer_name = scrapy.Field()
23 |     reviewer_level = scrapy.Field()
24 |     reviewer_location = scrapy.Field()
25 | 
26 |     city = scrapy.Field()
27 | 
28 |     hotel_name = scrapy.Field()
29 |     hotel_url = scrapy.Field()
30 |     hotel_classs = scrapy.Field()
31 |     hotel_address = scrapy.Field()
32 |     hotel_locality = scrapy.Field()
33 |     hotel_review_stars = scrapy.Field()
34 |     hotel_review_qty = scrapy.Field()
35 | 
36 | 
37 | class BookingReviewItem(scrapy.Item):
38 |     title = scrapy.Field()
39 |     score = scrapy.Field()
40 |     positive_content = scrapy.Field()
41 |     negative_content = scrapy.Field()
42 |     tags = scrapy.Field()
43 | 


--------------------------------------------------------------------------------
/hotel_sentiment/pipelines.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Define your item pipelines here
 4 | #
 5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 | # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 | 
 8 | 
 9 | class HotelSentimentPipeline(object):
10 |     def process_item(self, item, spider):
11 |         return item
12 | 


--------------------------------------------------------------------------------
/hotel_sentiment/settings.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Scrapy settings for hotel_sentiment project
 4 | #
 5 | # For simplicity, this file contains only settings considered important or
 6 | # commonly used. You can find more settings consulting the documentation:
 7 | #
 8 | #     http://doc.scrapy.org/en/latest/topics/settings.html
 9 | #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
10 | #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
11 | 
12 | BOT_NAME = 'hotel_sentiment'
13 | 
14 | SPIDER_MODULES = ['hotel_sentiment.spiders']
15 | NEWSPIDER_MODULE = 'hotel_sentiment.spiders'
16 | 
17 | 
18 | # Crawl responsibly by identifying yourself (and your website) on the user-agent
19 | #USER_AGENT = 'hotel_sentiment (+http://www.yourdomain.com)'
20 | 
21 | # Configure maximum concurrent requests performed by Scrapy (default: 16)
22 | #CONCURRENT_REQUESTS=32
23 | 
24 | # Configure a delay for requests for the same website (default: 0)
25 | # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
26 | # See also autothrottle settings and docs
27 | #DOWNLOAD_DELAY=3
28 | # The download delay setting will honor only one of:
29 | #CONCURRENT_REQUESTS_PER_DOMAIN=16
30 | #CONCURRENT_REQUESTS_PER_IP=16
31 | 
32 | # Disable cookies (enabled by default)
33 | #COOKIES_ENABLED=False
34 | 
35 | # Disable Telnet Console (enabled by default)
36 | #TELNETCONSOLE_ENABLED=False
37 | 
38 | # Override the default request headers:
39 | #DEFAULT_REQUEST_HEADERS = {
40 | #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
41 | #   'Accept-Language': 'en',
42 | #}
43 | 
44 | # Enable or disable spider middlewares
45 | # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
46 | #SPIDER_MIDDLEWARES = {
47 | #    'hotel_sentiment.middlewares.MyCustomSpiderMiddleware': 543,
48 | #}
49 | 
50 | # Enable or disable downloader middlewares
51 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
52 | #DOWNLOADER_MIDDLEWARES = {
53 | #    'hotel_sentiment.middlewares.MyCustomDownloaderMiddleware': 543,
54 | #}
55 | 
56 | # Enable or disable extensions
57 | # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
58 | #EXTENSIONS = {
59 | #    'scrapy.telnet.TelnetConsole': None,
60 | #}
61 | 
62 | # Configure item pipelines
63 | # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
64 | #ITEM_PIPELINES = {
65 | #    'hotel_sentiment.pipelines.SomePipeline': 300,
66 | #}
67 | 
68 | # Enable and configure the AutoThrottle extension (disabled by default)
69 | # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
70 | # NOTE: AutoThrottle will honour the standard settings for concurrency and delay
71 | #AUTOTHROTTLE_ENABLED=True
72 | # The initial download delay
73 | #AUTOTHROTTLE_START_DELAY=5
74 | # The maximum download delay to be set in case of high latencies
75 | #AUTOTHROTTLE_MAX_DELAY=60
76 | # Enable showing throttling stats for every response received:
77 | #AUTOTHROTTLE_DEBUG=False
78 | 
79 | # Enable and configure HTTP caching (disabled by default)
80 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
81 | #HTTPCACHE_ENABLED=True
82 | #HTTPCACHE_EXPIRATION_SECS=0
83 | #HTTPCACHE_DIR='httpcache'
84 | #HTTPCACHE_IGNORE_HTTP_CODES=[]
85 | #HTTPCACHE_STORAGE='scrapy.extensions.httpcache.FilesystemCacheStorage'
86 | 


--------------------------------------------------------------------------------
/hotel_sentiment/spiders/__init__.py:
--------------------------------------------------------------------------------
1 | # This package will contain the spiders of your Scrapy project
2 | #
3 | # Please refer to the documentation for information on how to create and manage
4 | # your spiders.
5 | 


--------------------------------------------------------------------------------
/hotel_sentiment/spiders/booking_single_hotel_spider.py:
--------------------------------------------------------------------------------
 1 | import scrapy
 2 | from scrapy.loader import ItemLoader
 3 | from hotel_sentiment.items import BookingReviewItem
 4 | 
 5 | 
 6 | class BookingSpider(scrapy.Spider):
 7 |     name = "booking_singlehotel"
 8 |     start_urls = [
 9 |         #http://www.booking.com/hotel/us/new-york-inn.html,
10 |         #add your url here
11 |     ]
12 | 
13 |     #get its reviews page
14 |     def parse(self, response):
15 |         reviewsurl = response.xpath('//a[@class="show_all_reviews_btn"]/@href')
16 |         url = response.urljoin(reviewsurl[0].extract())
17 |         self.pageNumber = 1
18 |         return scrapy.Request(url, callback=self.parse_reviews)
19 | 
20 |     #and parse the reviews
21 |     def parse_reviews(self, response):
22 |         for rev in response.xpath('//li[starts-with(@class,"review_item")]'):
23 |             item = BookingReviewItem()
24 |             #sometimes the title is empty because of some reason, not sure when it happens but this works
25 |             title = rev.xpath('.//a[@class="review_item_header_content"]/span[@itemprop="name"]/text()')
26 |             if title:
27 |                 item['title'] = title[0].extract()
28 |                 positive_content = rev.xpath('.//p[@class="review_pos"]//span/text()')
29 |                 if positive_content:
30 |                     item['positive_content'] = positive_content[0].extract()
31 |                 negative_content = rev.xpath('.//p[@class="review_neg"]//span/text()')
32 |                 if negative_content:
33 |                     item['negative_content'] = negative_content[0].extract()
34 |                 item['score'] = rev.xpath('.//span[@itemprop="reviewRating"]/meta[@itemprop="ratingValue"]/@content')[0].extract()
35 |                 #tags are separated by ;
36 |                 item['tags'] = ";".join(rev.xpath('.//li[@class="review_info_tag"]/text()').extract())
37 |                 yield item
38 | 
39 |         next_page = response.xpath('//a[@id="review_next_page_link"]/@href')
40 |         if next_page:
41 |             url = response.urljoin(next_page[0].extract())
42 |             yield scrapy.Request(url, self.parse_reviews)
43 | 


--------------------------------------------------------------------------------
/hotel_sentiment/spiders/booking_spider.py:
--------------------------------------------------------------------------------
 1 | import scrapy
 2 | from scrapy.loader import ItemLoader
 3 | from hotel_sentiment.items import BookingReviewItem
 4 | 
 5 | #crawl up to 6 pages of review per hotel
 6 | max_pages_per_hotel = 6
 7 | 
 8 | class BookingSpider(scrapy.Spider):
 9 |     name = "booking"
10 |     start_urls = [
11 |         #"http://www.booking.com/searchresults.html?aid=357026&label=gog235jc-city-XX-us-newNyork-unspec-uy-com-L%3Axu-O%3AosSx-B%3Achrome-N%3Ayes-S%3Abo-U%3Ac&sid=b9f9f1f142a364f6c36f275cfe47ee55&dcid=4&city=20088325&class_interval=1&dtdisc=0&from_popular_filter=1&hlrd=0&hyb_red=0&inac=0&label_click=undef&nflt=di%3D929%3Bdistrict%3D929%3B&nha_red=0&postcard=0&redirected_from_city=0&redirected_from_landmark=0&redirected_from_region=0&review_score_group=empty&room1=A%2CA&sb_price_type=total&score_min=0&ss_all=0&ssb=empty&sshis=0&rows=15&tfl_cwh=1",
12 |         #add your city url here
13 |     ]
14 | 
15 |     pageNumber = 1
16 | 
17 |     #for every hotel
18 |     def parse(self, response):
19 |         for hotelurl in response.xpath('//a[@class="hotel_name_link url"]/@href'):
20 |             url = response.urljoin(hotelurl.extract())
21 |             yield scrapy.Request(url, callback=self.parse_hotel)
22 | 
23 |         next_page = response.xpath('//a[starts-with(@class,"paging-next")]/@href')
24 |         if next_page:
25 |             url = response.urljoin(next_page[0].extract())
26 |             yield scrapy.Request(url, self.parse)
27 | 
28 |     #get its reviews page
29 |     def parse_hotel(self, response):
30 |         reviewsurl = response.xpath('//a[@class="show_all_reviews_btn"]/@href')
31 |         url = response.urljoin(reviewsurl[0].extract())
32 |         self.pageNumber = 1
33 |         return scrapy.Request(url, callback=self.parse_reviews)
34 | 
35 |     #and parse the reviews
36 |     def parse_reviews(self, response):
37 |         if self.pageNumber > max_pages_per_hotel:
38 |             return
39 |         for rev in response.xpath('//li[starts-with(@class,"review_item")]'):
40 |             item = BookingReviewItem()
41 |             #sometimes the title is empty because of some reason, not sure when it happens but this works
42 |             title = rev.xpath('.//a[@class="review_item_header_content"]/span[@itemprop="name"]/text()')
43 |             if title:
44 |                 item['title'] = title[0].extract()
45 |                 positive_content = rev.xpath('.//p[@class="review_pos"]//span/text()')
46 |                 if positive_content:
47 |                     item['positive_content'] = positive_content[0].extract()
48 |                 negative_content = rev.xpath('.//p[@class="review_neg"]//span/text()')
49 |                 if negative_content:
50 |                     item['negative_content'] = negative_content[0].extract()
51 |                 item['score'] = rev.xpath('.//span[@itemprop="reviewRating"]/meta[@itemprop="ratingValue"]/@content')[0].extract()
52 |                 #tags are separated by ;
53 |                 item['tags'] = ";".join(rev.xpath('.//li[@class="review_info_tag"]/text()').extract())
54 |                 yield item
55 | 
56 |         next_page = response.xpath('//a[@id="review_next_page_link"]/@href')
57 |         if next_page:
58 |             self.pageNumber += 1
59 |             url = response.urljoin(next_page[0].extract())
60 |             yield scrapy.Request(url, self.parse_reviews)
61 | 


--------------------------------------------------------------------------------
/hotel_sentiment/spiders/tripadvisor_spider.py:
--------------------------------------------------------------------------------
 1 | import scrapy
 2 | from hotel_sentiment.items import HotelSentimentItem
 3 | import re
 4 | 
 5 | # TODO use loaders
 6 | 
 7 | 
 8 | class TripadvisorSpider(scrapy.Spider):
 9 |     name = "tripadvisor"
10 |     start_urls = [
11 |         "https://www.tripadvisor.com/Hotels-g60763-New_York_City_New_York-Hotels.html"
12 |     ]
13 | 
14 |     def parse(self, response):
15 |         for href in response.xpath('//a[@class="property_title"]/@href'):
16 |             url = response.urljoin(href.extract())
17 |             yield scrapy.Request(url, callback=self.parse_hotel)
18 | 
19 |         # tripadvisor now has a weird pagination js thingie that doesn't even modify the url
20 |         # if you feel like finding a solution please do
21 | 
22 |         # next_page = response.xpath('//div[@class="unified pagination standard_pagination"]/child::*[2][self::a]/@href')
23 |         # if next_page:
24 |         #     url = response.urljoin(next_page[0].extract())
25 |         #     yield scrapy.Request(url, self.parse)
26 | 
27 |     def parse_hotel(self, response):
28 |         for href in response.xpath('//div[starts-with(@class,"quote")]/a/@href'):
29 |             url = response.urljoin(href.extract())
30 |             yield scrapy.Request(url, callback=self.parse_review)
31 | 
32 |         # haha fuck you tripadvisor pagination I'm better than you
33 |         url = response.url
34 |         if not re.findall(r'or\d', url):
35 |             next_page = re.sub(r'(-Reviews-)', r'\g<1>or5-', url)
36 |         else:
37 |             pagenum = int(re.findall(r'or(\d+)-', url)[0])
38 |             pagenum_next = pagenum + 5
39 |             next_page = url.replace('or' + str(pagenum), 'or' + str(pagenum_next))
40 |         yield scrapy.Request(
41 |             next_page,
42 |             meta={'dont_redirect': True},
43 |             callback=self.parse_hotel
44 |         )
45 | 
46 |     def parse_review(self, response):
47 |         item = HotelSentimentItem()
48 |         item['title'] = response.xpath('//div[@class="quote"]/text()').extract()[0][1:-1]  # strip the quotes
49 |         item['content'] = response.xpath('//div[@class="entry"]/p/text()').extract()[0]
50 |         item['stars'] = response.xpath('//span[starts-with(@class, "rating")]/span/@alt').extract()[0].replace('bubble', 'star')
51 |         return item
52 | 


--------------------------------------------------------------------------------
/hotel_sentiment/spiders/tripadvisor_spider_moreinfo.py:
--------------------------------------------------------------------------------
 1 | import scrapy
 2 | from hotel_sentiment.items import TripAdvisorReviewItem
 3 | 
 4 | #TODO use loaders
 5 | #to run this use scrapy crawl tripadvisor_more -a start_url="http://some_url"
 6 | #for example, scrapy crawl tripadvisor_more -a start_url="https://www.tripadvisor.com/Hotels-g186338-London_England-Hotels.html" -o tripadvisor_london.csv
 7 | class TripadvisorSpiderMoreinfo(scrapy.Spider):
 8 |     name = "tripadvisor_more"
 9 | 
10 |     def __init__(self, *args, **kwargs):
11 |         super(TripadvisorSpiderMoreinfo, self).__init__(*args, **kwargs)
12 |         self.start_urls = [kwargs.get('start_url')]
13 | 
14 |     def parse(self, response):
15 |         for href in response.xpath('//div[@class="listing_title"]/a/@href'):
16 |             url = response.urljoin(href.extract())
17 |             yield scrapy.Request(url, callback=self.parse_hotel)
18 | 
19 |         next_page = response.xpath('//div[@class="unified pagination standard_pagination"]/child::*[2][self::a]/@href')
20 |         if next_page:
21 |             url = response.urljoin(next_page[0].extract())
22 |             yield scrapy.Request(url, self.parse)
23 | 
24 |     def parse_hotel(self, response):
25 |         for href in response.xpath('//div[starts-with(@class,"quote")]/a/@href'):
26 |             url = response.urljoin(href.extract())
27 |             yield scrapy.Request(url, callback=self.parse_review)
28 | 
29 |         next_page = response.xpath('//div[@class="unified pagination "]/child::*[2][self::a]/@href')
30 |         if next_page:
31 |             url = response.urljoin(next_page[0].extract())
32 |             yield scrapy.Request(url, self.parse_hotel)
33 | 
34 | 
35 |     #to get the full review content I open its page, because I don't get the full content on the main page
36 |     #there's probably a better way to do it, requires investigation
37 |     def parse_review(self, response):
38 |         item = TripAdvisorReviewItem()
39 |         item['title'] = response.xpath('//div[@class="quote"]/text()')[0].extract()[1:-1] #strip the quotes (first and last char)
40 |         # Get all of the lines for just this review.
41 |         item['content'] = '\n'.join([line.strip() for line in response.xpath('(//div[@class="entry"])[1]//p/text()').extract()])
42 |         item['review_stars'] = response.xpath('//span[@class="rate sprite-rating_s rating_s"]/img/@alt').extract()[0]
43 | 
44 |         try:
45 |             item['reviewer_id'] = response.xpath('//div[@class="memberOverlayLink"]/@id').extract()[0]
46 |             item['reviewer_name'] = response.xpath('//div[contains(@class, "username")]/span/text()').extract()[0]
47 |             item['reviewer_level'] = response.xpath('//div[contains(@class, "levelBadge")]/@class').extract()[0].split()[-1]
48 |             item['reviewer_location'] = response.xpath('//div[@class="location"]/text()')[0].extract()[1:-1]
49 |         except:
50 |             # Not all reviews have a logged in reviewer
51 |             pass
52 | 
53 |         item['city'] = response.xpath('//li[starts-with(@class,"breadcrumb_item")]/a/span/text()')[-3].extract()
54 | 
55 |         locationcontent = response.xpath('//div[starts-with(@class,"locationContent")]')
56 |         item['hotel_name'] = locationcontent.xpath('.//div[starts-with(@class,"surContent")]/a/text()')[0].extract()
57 |         item['hotel_url'] = response.urljoin(locationcontent.xpath('.//div[starts-with(@class,"surContent")]/a/@href')[0].extract())
58 | 
59 |         hotelclass = locationcontent.xpath('.//span[starts-with(@class,"star")]/span/img/@alt')
60 |         if hotelclass:
61 |             item['hotel_classs'] = hotelclass[0].extract()
62 | 
63 |         hoteladdress = locationcontent.xpath('.//span[starts-with(@class,"street-address")]/text()')
64 |         if hoteladdress:
65 |             item['hotel_address'] = hoteladdress[0].extract()
66 | 
67 |         hotellocality = locationcontent.xpath('.//span[starts-with(@class,"locality")]/text()')
68 |         if hotellocality:
69 |             item['hotel_locality'] = hotellocality[0].extract()
70 | 
71 |         item['hotel_review_stars'] = locationcontent.xpath('.//div[starts-with(@class,"userRating")]/div/span/img/@alt')[0].extract()
72 |         item['hotel_review_qty'] = locationcontent.xpath('.//div[starts-with(@class,"userRating")]/div/a/text()')[0].extract()
73 | 
74 |         return item
75 | 


--------------------------------------------------------------------------------
/opinionTokenizer.py:
--------------------------------------------------------------------------------
 1 | from nltk.tokenize import sent_tokenize
 2 | import unicodecsv as csv
 3 | 
 4 | #Given a string, returns a list with the opinion units it extracted
 5 | #from the string
 6 | def tokenize_into_opinion_units(text):
 7 |     output = []
 8 |     for str in sent_tokenize(text):
 9 |         for output_str in str.split(' but '):
10 |             output.append(output_str)
11 |     return output
12 | 
13 | #Take positive.csv and negative.csv and mix them into
14 | #positiveandnegative.csv
15 | #This has each unit tagged with its booking.com sentiment
16 | #This is the data I tagged with Mechanical Turk
17 | def positive_and_negative_to_full():
18 |     fpos = open('positive.csv')
19 |     positive_units = [row for row in csv.reader(fpos)]
20 |     fneg = open('negative.csv')
21 |     negative_units = [row for row in csv.reader(fneg)]
22 |     for item in positive_units:
23 |         item.append('positive')
24 |     for item in negative_units:
25 |         item.append('negative')
26 |     del negative_units[0]
27 |     positive_units[0][0] = 'review_content'
28 |     positive_units[0][1] = 'sentiment'
29 |     full = positive_units
30 |     full.extend(negative_units)
31 |     with open('positiveandnegative.csv', 'wb') as csvfile:
32 |         writer = csv.writer(csvfile, dialect='excel')
33 |         writer.writerows(full)
34 | 
35 | 
36 | 
37 | #this will open the review scraped data and write two files from that info:
38 | #positive.csv, containing positive opinion units
39 | #negative.csv, containing negative opinion units
40 | if __name__ == "__main__":
41 |     #There are some problems with unicode
42 |     #TODO take the file name as argument
43 | 
44 |     #positive content:
45 |     f = open('itemsBooking.csv')
46 |     #divide the data into opinion units:
47 |     positive = [tokenize_into_opinion_units(row[1]) for row in csv.reader(f)]
48 |     positive_units = []
49 |     for row in positive:
50 |         for elem in row:
51 |             newrow = elem.split(' but ')
52 |             for newelem in newrow:
53 |                 positive_units.append(newelem)
54 |     #transform the elements into lists so I can use writerows
55 |     positive_units = [[row] for row in positive_units]
56 |     with open('positive.csv', 'wb') as csvfile:
57 |         writer = csv.writer(csvfile, dialect='excel')
58 |         writer.writerows(positive_units)
59 | 
60 |     #negative content:
61 |     f.seek(0)
62 |     negative = [tokenize_into_opinion_units(row[4]) for row in csv.reader(f)]
63 |     negative_units = []
64 |     for row in negative:
65 |         for elem in row:
66 |             newrow = elem.split(' but ')
67 |             for newelem in newrow:
68 |                 negative_units.append(newelem)
69 |     negative_units = [[row] for row in negative_units]
70 |     with open('negative.csv', 'wb') as csvfile:
71 |         writer = csv.writer(csvfile, dialect='excel')
72 |         writer.writerows(negative_units)
73 | 


--------------------------------------------------------------------------------
/scrapy.cfg:
--------------------------------------------------------------------------------
 1 | # Automatically created by: scrapy startproject
 2 | #
 3 | # For more information about the [deploy] section see:
 4 | # https://scrapyd.readthedocs.org/en/latest/deploy.html
 5 | 
 6 | [settings]
 7 | default = hotel_sentiment.settings
 8 | 
 9 | [deploy]
10 | #url = http://localhost:6800/
11 | project = hotel_sentiment
12 | 


--------------------------------------------------------------------------------