├── Analysis
├── Analysis_1.ipynb
├── Analysis_2.ipynb
├── Analysis_3.ipynb
├── Analysis_4.ipynb
└── Analysis_5.ipynb
├── README.md
└── Readme.md
/Analysis/Analysis_5.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# ANALYSIS 5 : RECOMMENDER SYSTEM FOR BRAND RUBIE'S COSTUME CO. \n",
8 | "- Collaborative filtering algorithms is used to get the recomendations.\n",
9 | "- The Recommender System will take the 'Product Name' and based on the correlation factor will give output as list of products which will be a suggestion or recommendation.\n",
10 | "- Suppose product name 'A' act as input parameter i.e. If a user buy product 'A' so based on that it will output the product highly correlated to it.\n",
11 | "-----------------------------------------"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "Importing all the required Libraries"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 2,
24 | "metadata": {
25 | "collapsed": true
26 | },
27 | "outputs": [],
28 | "source": [
29 | "import glob\n",
30 | "import json\n",
31 | "import pandas as pd\n",
32 | "import numpy as np\n",
33 | "import warnings\n",
34 | "warnings.filterwarnings(\"ignore\")"
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {},
40 | "source": [
41 | "Creating a path for Review file 'ReviewSample.json' which is the One of the input dataset for creating Recommender System."
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": 5,
47 | "metadata": {
48 | "collapsed": false
49 | },
50 | "outputs": [],
51 | "source": [
52 | "files=glob.glob('../Data/Tested_Data/*')"
53 | ]
54 | },
55 | {
56 | "cell_type": "markdown",
57 | "metadata": {},
58 | "source": [
59 | "Cleaning of 'ProductSample.json' file and importing the data as pandas DataFrame."
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "execution_count": 7,
65 | "metadata": {
66 | "collapsed": true
67 | },
68 | "outputs": [],
69 | "source": [
70 | "# Reading a multiple json files from a single json file 'ProductSample.json'.\n",
71 | "product=[]\n",
72 | "with open(files[0]) as data_file:\n",
73 | " data=data_file.read()\n",
74 | " for i in data.split('\\n'):\n",
75 | " product.append(i)\n",
76 | " \n",
77 | "# Firstly cleaning the data by converting files into proper json format files by some replacements and \n",
78 | "# then Making a list of Tuples containg data of brand Rubie's Costume Co.\n",
79 | "productDataframe=[]\n",
80 | "brand_List=['Rubie's Costume Co']\n",
81 | "for x in product:\n",
82 | " try:\n",
83 | " y=x.replace(\"'\",'\"')\n",
84 | " jdata=json.loads(y)\n",
85 | " if jdata['brand'] in brand_List:\n",
86 | " productDataframe.append((jdata['asin'],jdata['title'])) \n",
87 | " except:\n",
88 | " pass\n",
89 | " \n",
90 | "# Creating a dataframe using the list of Tuples got in the previous step. \n",
91 | "Product_dataset=pd.DataFrame(productDataframe,columns=['Asin','Title'])"
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "Getting all the distinct product Asin in list. "
99 | ]
100 | },
101 | {
102 | "cell_type": "code",
103 | "execution_count": 8,
104 | "metadata": {
105 | "collapsed": true
106 | },
107 | "outputs": [],
108 | "source": [
109 | "prod_List=Product_dataset.Asin.unique().tolist()"
110 | ]
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {},
115 | "source": [
116 | "Cleaning of 'ReviewSample.json' file and importing the data as pandas DataFrame."
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "execution_count": 9,
122 | "metadata": {
123 | "collapsed": false
124 | },
125 | "outputs": [],
126 | "source": [
127 | "# Reading a multiple json files from a single json file 'ReviewSample.json'.\n",
128 | "review=[]\n",
129 | "with open(files[1]) as data_file:\n",
130 | " data=data_file.read()\n",
131 | " for i in data.split('\\n'):\n",
132 | " review.append(i)\n",
133 | " \n",
134 | "# Making a list of Tuples containg all the data of json files.\n",
135 | "reviewDataframe=[]\n",
136 | "for x in review:\n",
137 | " try:\n",
138 | " jdata=json.loads(x)\n",
139 | " reviewDataframe.append((jdata['reviewerID'],jdata['asin'],jdata['overall'])) \n",
140 | " except:\n",
141 | " pass\n",
142 | " \n",
143 | "# Creating a dataframe using the list of Tuples got in the previous step. \n",
144 | "review_dataset=pd.DataFrame(reviewDataframe,columns=['Reviewer_ID','Asin','Rating']) "
145 | ]
146 | },
147 | {
148 | "cell_type": "markdown",
149 | "metadata": {},
150 | "source": [
151 | "Creating a DataFrame 'Working_dataset' which has products only from brand \"RUBIE'S COSTUME CO.\""
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": 10,
157 | "metadata": {
158 | "collapsed": false
159 | },
160 | "outputs": [],
161 | "source": [
162 | "Working_dataset=review_dataset[review_dataset.Asin.isin(prod_List)]"
163 | ]
164 | },
165 | {
166 | "cell_type": "markdown",
167 | "metadata": {},
168 | "source": [
169 | "Performing a merge of 'Working_dataset' and 'Product_dataset' to get all the required details together for building the Recommender system."
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "execution_count": 11,
175 | "metadata": {
176 | "collapsed": true
177 | },
178 | "outputs": [],
179 | "source": [
180 | "Working_dataset=pd.merge(Working_dataset,Product_dataset,on='Asin',how='inner')"
181 | ]
182 | },
183 | {
184 | "cell_type": "markdown",
185 | "metadata": {},
186 | "source": [
187 | "- Taking only the required columns and creating a pivot table with index as 'Reviewer_ID' , columns as 'Title' and values as 'Rating'.\n",
188 | "- 'Model' is a pivot table."
189 | ]
190 | },
191 | {
192 | "cell_type": "code",
193 | "execution_count": 13,
194 | "metadata": {
195 | "collapsed": true
196 | },
197 | "outputs": [],
198 | "source": [
199 | "Working_dataset=Working_dataset[['Reviewer_ID','Asin','Title','Rating']]\n",
200 | "Model = Working_dataset.pivot_table(index='Reviewer_ID',columns='Title',values='Rating')"
201 | ]
202 | },
203 | {
204 | "cell_type": "markdown",
205 | "metadata": {
206 | "collapsed": false
207 | },
208 | "source": [
209 | "- Function to find the pearson correlation between two columns or products.\n",
210 | "- Will produce result between -1 to 1.\n",
211 | "- Function will be used within the recommender function 'get_recommendations()'."
212 | ]
213 | },
214 | {
215 | "cell_type": "code",
216 | "execution_count": 14,
217 | "metadata": {
218 | "collapsed": true
219 | },
220 | "outputs": [],
221 | "source": [
222 | "def pearson(x1,x2):\n",
223 | " x1_cor=x1-x1.mean()\n",
224 | " x2_cor=x2-x2.mean()\n",
225 | " return np.sum(x1_cor * x2_cor)/np.sqrt(np.sum(x1_cor**2) * np.sum(x2_cor**2))"
226 | ]
227 | },
228 | {
229 | "cell_type": "markdown",
230 | "metadata": {},
231 | "source": [
232 | "- Function to recommend the product based on correlation between them.\n",
233 | "- Takes 3 parameters 'Product Name', 'Model' and 'Number of Recomendations'\n",
234 | "- Will return a list in descending order of correlation and the list size depends on the input given for Number of Recomendations."
235 | ]
236 | },
237 | {
238 | "cell_type": "code",
239 | "execution_count": 15,
240 | "metadata": {
241 | "collapsed": true
242 | },
243 | "outputs": [],
244 | "source": [
245 | "def get_recommendations(product_id,M,num):\n",
246 | " recomend=[]\n",
247 | " for asin in M.columns:\n",
248 | " if asin==product_id:\n",
249 | " continue\n",
250 | " cor=pearson(M[product_id],M[asin])\n",
251 | " if np.isnan(cor):\n",
252 | " continue\n",
253 | " else:\n",
254 | " recomend.append((asin,cor))\n",
255 | " recomend.sort(key=lambda tup: tup[1],reverse=True)\n",
256 | " return recomend[:num]"
257 | ]
258 | },
259 | {
260 | "cell_type": "markdown",
261 | "metadata": {},
262 | "source": [
263 | "Function to replace all the html escape characters to english."
264 | ]
265 | },
266 | {
267 | "cell_type": "code",
268 | "execution_count": 17,
269 | "metadata": {
270 | "collapsed": false
271 | },
272 | "outputs": [],
273 | "source": [
274 | "def escape(t):\n",
275 | " \"\"\"HTML-escape the text in `t`.\"\"\"\n",
276 | " return (t\n",
277 | " .replace(\"&\",\"&\").replace(\"<\",\"<\").replace(\">\",\">\")\n",
278 | " .replace(\"'\",\"'\").replace(\""\",'\"')\n",
279 | " )"
280 | ]
281 | },
282 | {
283 | "cell_type": "markdown",
284 | "metadata": {},
285 | "source": [
286 | "- Calling the recommender System by making a function call to 'get_recommendations()'.\n",
287 | "- '300 Movie Spartan Shield' is the product name pass to the function i.e. if person buys '300 Movie Spartan Shield' what else can be recommended to him/her.\n",
288 | "- 'Model' is passed for correlation calculation.\n",
289 | "- '5' is the maximum number of recommendation a function can return if there is some correlation.\n",
290 | "- Quantifying the correlation can be done by using correlation value given in the output."
291 | ]
292 | },
293 | {
294 | "cell_type": "code",
295 | "execution_count": 19,
296 | "metadata": {
297 | "collapsed": false,
298 | "scrolled": true
299 | },
300 | "outputs": [],
301 | "source": [
302 | "rec=get_recommendations('300 Movie Spartan Shield',Model,5)"
303 | ]
304 | },
305 | {
306 | "cell_type": "markdown",
307 | "metadata": {},
308 | "source": [
309 | "- Taking recommendation into DataFrame for Tabular represtation.\n",
310 | "- Takng only those values whose correlation is greater than 0."
311 | ]
312 | },
313 | {
314 | "cell_type": "code",
315 | "execution_count": 20,
316 | "metadata": {
317 | "collapsed": false
318 | },
319 | "outputs": [],
320 | "source": [
321 | "Recommendation=[]\n",
322 | "for x in rec:\n",
323 | " if x[1] > 0:\n",
324 | " Recommendation.append((escape(x[0]),x[1]))\n",
325 | "\n",
326 | "result=pd.DataFrame(Recommendation,columns=['Product Title','Correlation ']) "
327 | ]
328 | },
329 | {
330 | "cell_type": "markdown",
331 | "metadata": {},
332 | "source": [
333 | "Taking all the recommendations into .csv file"
334 | ]
335 | },
336 | {
337 | "cell_type": "code",
338 | "execution_count": 22,
339 | "metadata": {
340 | "collapsed": false
341 | },
342 | "outputs": [],
343 | "source": [
344 | "result.to_csv('../Analysis/Analysis_5/Recommendation.csv')"
345 | ]
346 | },
347 | {
348 | "cell_type": "markdown",
349 | "metadata": {},
350 | "source": [
351 | "Displaying rows of resultant DataFrame"
352 | ]
353 | },
354 | {
355 | "cell_type": "code",
356 | "execution_count": 23,
357 | "metadata": {
358 | "collapsed": false
359 | },
360 | "outputs": [
361 | {
362 | "data": {
363 | "text/html": [
364 | "
\n",
365 | "
\n",
366 | " \n",
367 | " \n",
368 | " | \n",
369 | " Product Title | \n",
370 | " Correlation | \n",
371 | "
\n",
372 | " \n",
373 | " \n",
374 | " \n",
375 | " 0 | \n",
376 | " 300 Movie Spartan Deluxe Vinyl Helmet | \n",
377 | " 0.223277 | \n",
378 | "
\n",
379 | " \n",
380 | " 1 | \n",
381 | " Toynk Toys - 300- Spartan Sword | \n",
382 | " 0.069275 | \n",
383 | "
\n",
384 | " \n",
385 | "
\n",
386 | "
"
387 | ],
388 | "text/plain": [
389 | " Product Title Correlation \n",
390 | "0 300 Movie Spartan Deluxe Vinyl Helmet 0.223277\n",
391 | "1 Toynk Toys - 300- Spartan Sword 0.069275"
392 | ]
393 | },
394 | "execution_count": 23,
395 | "metadata": {},
396 | "output_type": "execute_result"
397 | }
398 | ],
399 | "source": [
400 | "result"
401 | ]
402 | }
403 | ],
404 | "metadata": {
405 | "anaconda-cloud": {},
406 | "kernelspec": {
407 | "display_name": "Python [conda root]",
408 | "language": "python",
409 | "name": "conda-root-py"
410 | },
411 | "language_info": {
412 | "codemirror_mode": {
413 | "name": "ipython",
414 | "version": 3
415 | },
416 | "file_extension": ".py",
417 | "mimetype": "text/x-python",
418 | "name": "python",
419 | "nbconvert_exporter": "python",
420 | "pygments_lexer": "ipython3",
421 | "version": "3.5.2"
422 | }
423 | },
424 | "nbformat": 4,
425 | "nbformat_minor": 1
426 | }
427 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | --------------
5 | # AMAZON PRODUCT AND REVIEW DATA
6 | ------------------------------------------------------------------------------------------------------------------------------
7 |
8 | ## DATA SET DESCRIPTION:-
9 | - This dataset contains product reviews and metadata of 'Clothing, Shoes and Jewelry' category from Amazon, including 2.5 million reviews spanning May 1996 - July 2014.
10 | - This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links.
11 | - [DATA SET LINK](https://drive.google.com/open?id=0B4Hj2axlpCcxWldiajctWmY0NG8)
12 |
13 | ## FILES:-
14 | ### 'ProductSample.json'
15 | - Consist of all the products in 'Clothing, Shoes and Jewelry' category from Amazon. Each product is a json file in 'ProductSample.json'(each row is a json file).
16 | #### FIELDS:
17 | - 1 Asin - ID of the product, e.g. 0000031852
18 | - 2 Title - name of the product
19 | - 3 Price - price in US dollars (at time of crawl)
20 | - 4 ImUrl - url of the product image
21 | - 5 Related - related products (also bought, also viewed, bought together, buy after viewing)
22 | - 6 SalesRank - sales rank information
23 | - 7 Brand - brand name
24 | - 8 Categories - list of categories the product belongs to
25 |
26 | ### 'ReviewSample.json'
27 | - Consist of all the reviews for the products in 'Clothing, Shoes and Jewelry' category from Amazon. Each review is a json file in 'ReviewSample.json'(each row is a json file).
28 | #### FIELDS:
29 | - 1 ReviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
30 | - 2 Asin - ID of the product, e.g. 0000013714
31 | - 3 Reviewer Name - name of the reviewer
32 | - 4 Helpful - helpfulness rating of the review, e.g. 2/3
33 | - 5 Review Text - text of the review
34 | - 6 Overall - rating of the product
35 | - 7 Summary - summary of the review
36 | - 8 Unix Review Time - time of the review (unix time)
37 | - 9 Review Time - time of the review (raw)
38 |
39 | ## ANALYSIS:-
40 | * Analysis_1 : Sentimental Analysis on Reviews.
41 | * Analysis_2 : Exploratory Analysis.
42 | * Analysis_3 : 'Susan Katz' as 'Point of Interest' with maximum Reviews on Amazon.
43 | * Analysis_4 : 'Bundle' or 'Bought-Together' based Analysis.
44 | * Analysis_5 : Recommender System for Popular Brand 'Rubie's Costume Co'.
45 |
46 | ## DATA PROCESSING:-
47 | #### PRODUCT DATA i.e. ProductSample.json
48 | - Step 1: Reading a multiple json files from a single json file 'ProductSample.json' and appending it to the list such that each index of a list has a content of single json file.
49 | - Step 2: Iterating over list and loading each index as json and getting the data from the each index and making a list of Tuples containg all the data of json files. During each iteration json file is first cleaned by converting files into proper json format files by some replacements.
50 | - Step 3: Creating a dataframe using the list of Tuples got in the previous step.
51 |
52 | #### REVIEW DATA i.e. ReviewSample.json
53 | - Step 1: Reading a multiple json files from a single json file 'ReviewSample.json' and appending it to the list such that each index of a list has a content of single json file.
54 | - Step 2: Iterating over list and loading each index as json and getting the data from the each index and making a list of Tuples containg all the data of json files.
55 | - Step 3: Creating a dataframe using the list of Tuples got in the previous step.
56 |
57 |
58 | ## ANALYSIS 1: SENTIMENTAL ANALYSIS ON REVIEWS (1999-2014)
59 | - Wordcloud of summary section of 'Positive' and 'Negative' Reviews on Amazon.
60 | - VADER (Valence Aware Dictionary and Sentiment Reasoner) Sentiment analysis tool was used to calculate the sentiment of reviews.
61 | - Sentiment distribution (positive, negative and neutral) across each product along with their names mapped with the product database 'ProductSample.json'.
62 | - List of products with most number of positive, negative and neutral Sentiment (3 Different list).
63 | - Percentage distribution of positive, neutral and negative in terms of sentiments.
64 | - Sentiment distribution across the Year.
65 | ------
66 | #### WordCloud of summary section of 'Positive' and 'Negative' Reviews on Amazon.
67 | - Cleaning(Data Processing) was performed on 'ReviewSample.json' file and importing the data as pandas DataFrame.
68 | - Created a function to calculate sentiments using Vader Sentiment Analyzer and Naive Bayes Analyzer.
69 | - Vader Sentiment Analyzer was used at the final stage, since output given was much more faster and accurate.
70 | - Only taking 1 Lakh (1,00,000) reviews into consideration for Sentiment Analysis so that jupyter notebook dosen't crash.
71 | - Sentiment value was calculated for each review and stored in the new column 'Sentiment_Score' of DataFrame.
72 | - Seperated negatives and positives Sentiment_Score into different dataframes for creating a 'Wordcloud'.
73 | - Stemming function was created for stemming of different form of words which will be used by 'create_Word_Corpus()' function. PorterStemmer from nltk.stem was used for stemming.
74 | - Function 'create_Word_Corpus()' was created to generate a Word Corpus.
75 | ###### To Generate a word corpus following steps are performed inside the function 'create_Word_Corpus(df)'
76 | - Step 1 :- Iterating over the 'summary' section of reviews such that we only get important content of a review.
77 | - Step 2 :- Converting the content into Lowercase.
78 | - Step 3 :- Using nltk.tokenize to get words from the content.
79 | - Step 4 :- Using string.punctuation to get rid of punctuations.
80 | - Step 5 :- Using stopwords from nltk.corpus to get rid of stopwords.
81 | - Step 6 :- Stemming of Words.
82 | - Step 7 :- Finally forming a word corpus and returning the word corpus.
83 | - Function 'plot_cloud()' was defined to plot cloud.
84 |
85 | ##### WordCloud of 'Summary' section of Positive Reviews.
86 | 
87 | ##### WordCloud of 'Summary' section of Negative Reviews.
88 | 
89 |
90 | #### SENTIMENT DISTRIBUTION ACROSS EACH PRODUCT ALONG WITH THEIR NAMES MAPPED FROM PRODUCT DATABASE.
91 | - Cleaning(Data Processing) was performed on 'ProductSample.json' file and importing the data as pandas DataFrame.
92 | - Took only those columns which were required further down the Analysis such as 'Asin' and 'Sentiment_Score'.
93 | - Used Groupby on 'Asin' and 'Sentiment_Score' calculated the count of all the products with positive, negative and neutral sentiment Score.
94 | - DataFrame Manipulations were performed to get desired DataFrame.
95 | - Sorted the rows in the ascending order of 'Asin' and assigned it to another DataFrame 'x1'.
96 | - Products Asin and Title is assigned to x2 which is a copy of DataFrame 'Product_datset'(Product database).
97 | - Merged 2 Dataframes 'x1' and 'x2' on common column 'Asin' to map product 'Title' to respective product 'Asin' using 'inner' type.
98 | - Took all the data such as Asin, Title, Sentiment_Score and Count into .csv file
99 | - (path : Final/Analysis/Analysis_1/Sentiment_Distribution_Across_Product.csv)
100 |
101 | #### LIST OF PRODUCTS WITH MOST NUMBER OF POSITIVE, NEGATIVE AND NEUTRAL SENTIMENT (3 DIFFERENT LIST).
102 | - Segregated reviews based on their Sentiments_Score into 3 different(positive,negative and neutral) data frame,which we got earlier in step.
103 | - Sorted the above result in descending order of count.
104 | - Droped the unwanted column 'index'.
105 | - Took all the data such as Asin, Title, Sentiment_Score and Count for 3 into .csv file
106 | - (path : '../Analysis/Analysis_1/Positive_Sentiment_Max.csv'),
107 | - (path : '../Analysis/Analysis_1/Negative_Sentiment_Max.csv'),
108 | - (path : '../Analysis/Analysis_1/Neutral_Sentiment_Max.csv')
109 |
110 | #### PERCENTAGE DISTRIBUTION OF POSITIVE, NEUTRAL AND NEGATIVE IN TERMS OF SENTIMENTS.
111 | - Took summation of count column to get the Total count of Reviews under Consideration.
112 | - Percentage was calculated for positive, negative and neutral and was stored into a new column 'Percentage' of data frame.
113 | - Taking all the data such as Sentiment_Score, Count and Percentage into .csv file
114 | - (path : '../Analysis/Analysis_1/Sentiment_Percentage.csv')
115 |
116 | #### SENTIMENT DISTRIBUTION ACROSS THE YEAR
117 | - Converted the data type of 'Review_Time' column in the Dataframe 'Selected_Rows' to datetime format.
118 | - Created an Addtional column as 'Month' in Datatframe 'Selected_Rows' for Month by taking the month part of 'Review_Time' column.
119 | - Created an Addtional column as 'Year' in Datatframe 'Selected_Rows' for Year by taking the year part of 'Review_Time' column.
120 | - Grouped on the basis of 'Year' and 'Sentiment_Score' to get the respective count.
121 | - Segregated rows based on their Sentiments by year.
122 | - Got the total count including positive, negative and neutral to get the Total count of Reviews under Consideration for each year.
123 | - Merged the dataframe with total count to individual sentiment count to get percentage.
124 | - Calculated the Percentage to find a trend for sentiments.
125 | - Took all the data such as Year, Sentiment_Score, Count, Total_Count and Percentage for 3 into .csv file
126 | - (path : '../Analysis/Analysis_1/Pos_Sentiment_Percentage_vs_Year.csv')
127 | - (path : '../Analysis/Analysis_1/Neg_Sentiment_Percentage_vs_Year.csv')
128 | - (path : '../Analysis/Analysis_1/Neu_Sentiment_Percentage_vs_Year.csv')
129 | - Bar-Chart to know the Trend for Percentage of Positive, Negative and Neutral Review over the years based on Sentiments.
130 | ##### Bar-Chart to know the Trend for Percentage of Positive Review over the years based on Sentiments.
131 | 
132 |
133 | ##### Bar-Chart to know the Trend for Percentage of Negative Review over the years based on Sentiments.
134 | 
135 |
136 | ##### Bar-Chart to know the Trend for Percentage of Neutral Review over the years based on Sentiments.
137 | 
138 |
139 | ## CONCLUSION OF ANALYSIS 1:-
140 | - From Positive WordCloud
141 | - Popular words used to describe the products were love, perfect, nice, good, best, great and etc.
142 | - Many people who reviewed were happy with the price of the products sold on Amazon.
143 | - Much talked products were watch, bra, jacket, bag, costume, etc.
144 | - From Negative WordCloud
145 | - Popular words used to describe the products were dissapoint, badfit, terrible, defect, return and etc.
146 | - Much talked products were shoes, watch, bra, batteries, etc.
147 | - Popular product in terms of sentiments for following
148 | - Positive:
149 | - Converse Unisex Chuck Taylor Classic Colors Sneaker, Number of positive reviews:953
150 | - Converse Unisex Chuck Taylor All Star Hi Top Black Monochrome Sneaker, Number of positive reviews:932
151 | - Yaktrax Walker Traction Cleats for Snow and Ice, Number of positive reviews:676
152 | - Negative:
153 | - Yaktrax Walker Traction Cleats for Snow and Ice, Number of negative reviews:65
154 | - Converse Unisex Chuck Taylor Classic Colors Sneaker, Number of negative reviews:44
155 | - Converse Unisex Chuck Taylor All Star Hi Top Black Monochrome Sneaker, Number of negative reviews:44
156 | - Neutral:
157 | - Converse Unisex Chuck Taylor Classic Colors Sneaker, Number of neutral reviews:313
158 | - Yaktrax Walker Traction Cleats for Snow and Ice,Number of neutral reviews:253
159 | - Converse Unisex Chuck Taylor All Star Hi Top Black Monochrome Sneaker,Number of neutral reviews:247
160 | - Sentiment Percentage Distribution.
161 | - Positive: 72.7 %
162 | - Negative: 5 %
163 | - Neutral: 22.3 %
164 | - Overall Sentiment for reviews on Amazon is on positive side as it has very less negative sentiments.
165 | - Trend for Percentage of Review over the years
166 | - positive reviews percentage has been pretty consistent between 70-80 throughout the years.
167 | - negative reviews has been decreasing lately since last three years, may be they worked on the services and faults.
168 |
169 | # ANALYSIS 2: EXPLORATORY ANALYSIS OF PRODUCT AND REVIEWS (1999-2014)
170 | - Number of Reviews over the years.
171 | - Number of Reviews by month over the years.
172 | - Distribution of 'Overall Rating' for 2.5 million 'Clothing Shoes and Jewellery' reviews on Amazon.
173 | - Yearly average 'Overall Ratings' over the years.
174 | - Distribution of helpfulness on 'Clothing Shoes and Jwellery' reviews on Amazon.
175 | - Distributution of length of reviews on Amazon.
176 | - Distribution of product prices of 'Clothing Shoes and Jewellery' category on Amazon.
177 | - Distribution of 'Overall Rating' of Amazon 'Clothing Shoes and Jewellery'.
178 | - Product Price V/S Overall Rating of reviews written for products.
179 | - Average Review Length V/S Product Price for Amazon products.
180 | - Distribution of 'Number of Reviews' written by each of the Amazon 'Clothing Shoes and Jewellery' user.
181 | - Distribution of 'Average Rating' written by each of the Amazon 'Clothing Shoes and Jewellery' users.
182 | - Distribution of Helpfulness of reviews written by Amazon 'Clothing Shoes and Jewellery' users.
183 | - Average Rating V/S Avg Helpfulness written by Amazon 'Clothing Shoes and Jewellery' user
184 | - Helpfulness VS Average Length of reviews written by Amazon 'Clothing Shoes and Jewellery' users.
185 | -----------------------------------
186 | #### Number of Reviews over the years.
187 | - Cleaning(Data Processing) was performed on 'ReviewSample.json' file and importing the data as pandas DataFrame.
188 | - Converting the data type of 'Review_Time' column in the Dataframe 'dataset' to datetime format.
189 | - Creating an Addtional column as 'Month' in Datatframe 'dataset' for Month by taking the month part of 'Review_Time' column.
190 | - Creating an Addtional column as 'Year' in Datatframe 'dataset' for Year by taking the year part of 'Review_Time' column
191 | - Grouping by year and taking the count of reviews for each year.
192 | - Took Data to .csv file
193 | - (path : '../Analysis/Analysis_2/Year_VS_Reviews.csv')
194 | - Line Plot for number of reviews over the years.
195 | 
196 |
197 | #### Number of Reviews by month over the years.
198 | - Grouping on Month and getting the count.
199 | - Replacing digits of 'Month' column in 'Monthly' dataframe with words using 'Calendar' library.
200 | - Took Data to .csv file
201 | - (path : '../Analysis/Analysis_2/Month_VS_Reviews.csv')
202 | - Line Plot for number of reviews over the years.
203 | 
204 |
205 |
206 | #### Distribution of 'Overall Rating' for 2.5 million 'Clothing Shoes and Jewellery' reviews on Amazon.
207 | - Grouping on 'Rating' and getting the count.
208 | - Took Data to .csv file
209 | - (path : '../Analysis/Analysis_2/Rating_VS_Reviews.csv')
210 | - Bar Chart Plot for Distribution of Rating.
211 | 
212 |
213 | #### Yearly average 'Overall Ratings' over the years.
214 | - Grouping on Year and getting the mean.
215 | - Calculating the Moving Average ith window of '3' to confirm the trend
216 | - Took Data to .csv file
217 | - (path : '../Analysis/Analysis_2/Yearly_Avg_Rating.csv')
218 | - Bar Chart Plot for Distribution of Rating.
219 | 
220 |
221 |
222 | #### Distribution of helpfulness on 'Clothing Shoes and Jwellery' reviews on Amazon.
223 | - Only taking required columns and converting their data type.
224 | - Calculating helpfulnes Percentage and replacing Nan with 0
225 | - Creating an Interval of 10 for percentage Value.
226 | - Took Data to .csv file
227 | - (path : '../Analysis/Analysis_2/Helpfuness_Percentage_Distribution.csv')
228 | - Bar Chart Plot for Distribution of Rating.
229 | 
230 |
231 | #### Distributution of length of reviews on Amazon.
232 | - Creating a new Data frame with 'Reviewer_ID','Reviewer_Name' and 'Review_Text' columns.
233 | - Counting the number of words using 'len(x.split())'
234 | - Counting the number of characters 'len(x)'
235 | - Creating an Interval of 100 for Charcters and Words Length Value.
236 | - Took Data to .csv file
237 | - (path : '../Analysis/Analysis_2/Character_Length_Distribution.csv')
238 | - (path : '../Analysis/Analysis_2/Word_Length_Distribution.csv')
239 | - Bar Plot for distribution of Character Length of reviews on Amazon
240 | 
241 |
242 | - Bar Plot for distribution of Word Length of reviews on Amazon
243 | 
244 |
245 | #### Distribution of product prices of 'Clothing Shoes and Jewellery' category on Amazon.
246 | - Cleaning(Data Processing) was performed on 'ProductSample.json' file and importing the data as pandas DataFrame.
247 | - Segregating the product based on price range.
248 | - [0-10]
249 | - [11-50]
250 | - [51-100]
251 | - [101-200]
252 | - [201-500]
253 | - [501-1000]
254 | - Took Data to .csv file
255 | - (path : '../Analysis/Analysis_2/Price_Distribution.csv')
256 | - Bar Chart Plot for Distribution of Price.
257 |
258 |
259 | #### Distribution of 'Overall Rating' of Amazon 'Clothing Shoes and Jewellery'.
260 | - Grouping on Asin and getting the mean of Rating.
261 | - Took Data to .csv file
262 | - (path : '../Analysis/Analysis_2/Rating_Distribution.csv')
263 | - Bar Chart Plot for Distribution of Price.
264 | 
265 |
266 | #### Product Price V/S Overall Rating of reviews written for products.
267 | - Merging 2 data frame 'Product_dataset' and data frame got in above analysis, on common column 'Asin'.
268 | - Scatter plot for product price v/s overall rating
269 | 
270 |
271 | #### Average Review Length V/S Product Price for Amazon products.
272 | - Creating a new Data frame with 'Reviewer_ID','Reviewer_Name', 'Asin' and 'Review_Text' columns.
273 | - Counting the number of words using 'len(x.split())'
274 | - Counting the number of characters 'len(x)'
275 | - Grouped on 'Asin' and taking the mean of Word and Character length.
276 | - Scatter plot for product price v/s average review length.
277 | 
278 |
279 | #### Distribution of 'Number of Reviews' written by each of the Amazon 'Clothing Shoes and Jewellery' user.
280 | - Grouped on 'Reviewer_ID' and took the count.
281 | - Sorting in the descending order of number of reviews got in previous step.
282 | - Now grouped on Number of reviews and took the count.
283 | - Created a interval of 10 for plot and took the sum of all the count using groupby.
284 | - Took Data to .csv file
285 | - (path : '../Analysis/Analysis_2/DISTRIBUTION OF NUMBER OF REVIEWS.csv')
286 | - Scatter Plot for Distribution of Number of Reviews.
287 | 
288 |
289 |
290 | #### Distribution of 'Average Rating' written by each of the Amazon 'Clothing Shoes and Jewellery' users.
291 | - Grouped on 'Reviewer_ID' and took the mean of Rating.
292 | - Scatter Plot for Distribution of Average Rating.
293 | 
294 |
295 |
296 | #### Distribution of Helpfulness of reviews written by Amazon 'Clothing Shoes and Jewellery' users.
297 | - Creating a new Dataframe with 'Reviewer_ID','helpful_UpVote' and 'Total_Votes'
298 | - Calculate percentage using: (helpful_UpVote/Total_Votes)*100
299 | - Grouped on 'Reviewer_ID' and took the mean of Percentage'
300 | - Created interval of 10 for percentage
301 | - Took Data to .csv file
302 | - (path : '../Analysis/Analysis_2/DISTRIBUTION OF HELPFULNESS.csv')
303 | - Bar Chart Plot for DISTRIBUTION OF HELPFULNESS.
304 | 
305 |
306 | #### Average Rating V/S Avg Helpfulness written by Amazon 'Clothing Shoes and Jewellery' user
307 | - Took Data to .csv file
308 | - (path : '../Analysis/Analysis_2/AVERAGE RATING VS AVERAGE HELPFULNESS.csv')
309 | - Scatter Plot.
310 | 
311 |
312 | #### Helpfulness VS Average Length of reviews written by Amazon 'Clothing Shoes and Jewellery' users.
313 | - Took Data to .csv file
314 | - (path : '../Analysis/Analysis_2/HELPFULNESS VS AVERAGE LENGTH.csv')
315 | - Scatter Plot.
316 | 
317 |
318 |
319 | ## CONCLUSION OF ANALYSIS 3:-
320 | - There has been exponential growth for Amazon in terms of reviews, which also means the sales also increased exponentially. Plot for 2014 shows a drop because we only have a data uptill May and even then it is more than half for 5 months data.
321 | - Buyers generally shop more in December and January.
322 | - More than half of the reviews give a 4 or 5 star rating, with very few giving 1, 2 or 3 stars relatively.
323 | - Average Rating over every year for Amazon has been above 4 and also the moving average confirms the trend.
324 | - Majority of the reviews had perfect helpfulness scores.That would make sense; if you’re writing a review (especially a 5 star review), you’re writing with the intent to help other prospective buyers.
325 | - Majority of reviews on Amazon has length of 100-200 characters or 0-100 words.
326 | - Over 2/3rds of Amazon Clothing are priced between $0 and $50, which makes sense as clothes are not meant to be so expensive.
327 | - The most expensive products have 4-star and 5-star overall ratings.
328 | - Over 95% of the reviewers of Amazon electronics left less than 10 reviews.
329 | - Reviewers who give a product a 4 - 5 star rating are more passionate about the product and likely to write better reviews than someone who writes a 1 - 2 star
330 |
331 | # ANALYSIS 3 : POINT OF INTEREST AS THE ONE WITH MAXIMUM NUMBER OF REVIEWS ON AMAZON
332 | - Reviewer with maximum number of reviews.
333 | - Distribution of reviews for 'Susan Katz' based on overall rating (reviewer_id : A1RRMZKOMZ2M7J).
334 | - Distribution of reviews over the years for 'Susan Katz'.
335 | - Percentage distribution of negative reviews for 'Susan Katz', since the count of reviews is dropping post year 2009.
336 | - Lexical density distribution over the year for reviews written by 'Susan Katz'.
337 | - Wordcloud of all important words used in 'Susan Katz' reviews on amazon.
338 | - Number of distinct products reviewed by 'Susan Katz' on amazon.
339 | - Products reviewed by 'Susan Katz'.
340 | - Popular sub-category for 'Susan Katz'.
341 | - Price range in which 'Susan Katz' shops.
342 | -------------------
343 | #### Reviewer with maximum number of reviews.
344 | - Cleaning(Data Processing) was performed on 'ReviewSample.json' file and importing the data as pandas DataFrame.
345 | - Grouped on 'Reviewer_ID' and getting the count of reviews.
346 | - Sorted in Descending order of 'No_Of_Reviews'
347 | - Took Point_of_Interest DataFrame to .csv file
348 | - (path : '../Analysis/Analysis_3/Most_Reviews.csv')
349 |
350 | #### Distribution of reviews for 'Susan Katz' based on overall rating (reviewer_id : A1RRMZKOMZ2M7J).
351 | - Only took those review which is posted by 'SUSAN KATZ'.
352 | - Created a function 'ReviewCategory()' to give positive, negative and neutral status based on Overall Rating.
353 | - score >= 4 then positive
354 | - score <= 2 & score > 0 then negative
355 | - score =3 then neutral
356 | - Calling function 'ReviewCategory()' for each row of DataFrame column 'Rating'.
357 | - Grouped on 'Category' which we got in previous step and getting the count of reviews.
358 | - Bar Plot for Category V/S Count.
359 | 
360 |
361 | #### Distribution of reviews over the years for 'Susan Katz'.
362 | - Grouping on 'Year' which we got in previous step and getting the count of reviews.
363 | - Took Data to .csv file
364 | - (path : '../Analysis/Analysis_3/Yearly_Count.csv')
365 | - Bar Plot to get trend over the years for Reviews Written by 'SUSAN KATZ'
366 | 
367 |
368 | #### Percentage distribution of negative reviews for 'Susan Katz', since the count of reviews is dropping post year 2009.
369 | - Took the count of negative reviews over the years using 'Groupby'.
370 | - Merging 2 Dataframe for mapping and then calculating the Percentage of Negative reviews for each year.
371 | - Took Data to .csv file
372 | - (path : '../Analysis/Analysis_3/Negative_Review_Percentage.csv')
373 | - Bar Plot for Year V/S Negative Reviews Percentage
374 | 
375 |
376 |
377 | #### Lexical density distribution over the year for reviews written by 'Susan Katz'.
378 | - Lexical words include:
379 | - verbs (e.g. run, walk, sit)
380 | - nouns (e.g. dog, Susan, oil)
381 | - adjectives (e.g. red, happy, cold)
382 | - adverbs (e.g. very, carefully, yesterday)
383 | - Created a function 'LexicalDensity(text)' to calculate Lexical Density of a content.
384 | - Steps involved are as follows:-
385 | - Step 1 :- Converting the content into Lowercase.
386 | - Step 2 :- Using nltk.tokenize to get words from the content.
387 | - Step 3 :- Storing the total word count.
388 | - Step 4 :- Using string.punctuation to get rid of punctuations.
389 | - Step 5 :- Using stopwords from nltk.corpus to get rid of stopwords.
390 | - Step 6 :- tagging of Words and taking count of words which has tags starting from ("NN","JJ","VB","RB") which represents Nouns, Adjectives, Verbs and Adverbs respectively, will be the lexical count.
391 | - Step 7 :- Finally; (lexical count/total count)*100.
392 | - Called Function 'LexicalDensity()' for each row of DataFrame.
393 | - Grouped on 'Year' and getting the average Lexical Density of reviews.
394 | - Took Data to .csv file
395 | - (path : '../Analysis/Analysis_3/Lexical_Density.csv')
396 | - Bar Plot for Year V/S Lexical Density
397 | 
398 |
399 | #### Wordcloud of all important words used in 'Susan Katz' reviews on amazon.
400 | - To Generate a word corpus following steps are performed inside the function 'create_Word_Corpus(df)'
401 | - Step 1 :- Iterating over the 'summary' section of reviews such that we only get important content of a review.
402 | - Step 2 :- Converting the content into Lowercase.
403 | - Step 3 :- Using nltk.tokenize to get words from the content.
404 | - Step 4 :- Using string.punctuation to get rid of punctuations.
405 | - Step 5 :- Using stopwords from nltk.corpus to get rid of stopwords.
406 | - Step 6 :- tagging of Words using nltk and only allowing words with tag as ("NN","JJ","VB","RB").
407 | - Step 7 :- Finally forming a word corpus and returning the word corpus.
408 | - Generated a WordCloud image
409 | 
410 |
411 | #### Number of distinct products reviewed by 'Susan Katz' on amazon.
412 | - Took the unique Asin from the reviews reviewed by 'Susan Katz' and returned the length.
413 |
414 | #### Products reviewed by 'Susan Katz'.
415 | - Cleaning(Data Processing) was performed on 'ProductSample.json' file and importing the data as pandas DataFrame.
416 | - Mapping 'Product_dataset' with 'POI' to get the products reviewed by 'Susan Katz'
417 | - Took Data to .csv file
418 | - (path : '../Analysis/Analysis_3/Products_Reviewed.csv')
419 |
420 | #### Popular sub-category for 'Susan Katz'.
421 | - Creating list of products reviewed by 'Susan Katz'
422 | - Created a Function 'make_flat(arr)' to make multilevel list values flat which was used to get sub-categories from multilevel list.
423 | - Taking the sub-category of each Asin reviewed by 'Susan Katz'.
424 | - Counting the Occurences and taking top 5 out of it.
425 | - Took Data to .csv file
426 | - (path : '../Analysis/Analysis_3/Popular_Sub-Category.csv')
427 |
428 | #### Price range in which 'Susan Katz' shops.
429 | - Took min, max and mean price of all the products by using aggregation function on data frame column 'Price'.
430 |
431 | ## CONCLUSION OF ANALYSIS 3:-
432 | - 'Susan Katz' (reviewer_id : A1RRMZKOMZ2M7J) reviewed the maximumn number of products i.e. 180.
433 | - Susan was only 50 % of the times happy with products shopped on Amazon
434 |
435 |
436 | | Category | Count |
437 | |------|------|
438 | | neg | 38|
439 | | neu | 51|
440 | | pos | 91|
441 |
442 |
443 |
444 | - Number of reviews were droping for 'Susan Katz' after 2009.
445 | - The reason why rating for 'Susan Katz' were dropping because Susan was not happy with maximum products she shopped i.e. because the negative review count had increased for every year after 2009.
446 | - The Average lexical density for 'Susan Katz' has always been under 40% i.e. 'Susan Katz' writting used to lack the important words.
447 | - Most popular words used in 'Susan Katz' content were shoes, color, fit, heels, watch and etc.
448 | - Number of distinct products reviewed by 'Susan Katz' on amazon is 180.
449 | - Popular Category in which 'Susan Katz' were Jewelry, Novelty, Costumes & More
450 | - 'Susan Katz' shopping Price Range
451 | - Minimum: 6.99
452 | - Maximum: 249.99
453 | - Average: 66.01
454 |
455 | # ANALYSIS 4 : 'BUNDLE' OR 'BOUGHT-TOGETHER' BASED ANALYSIS.
456 | - Check for the popular bundle (quantity in a bundle).
457 | - Top 10 Popular brands which sells Pack of 2 and 5, as they are the popular bundles.
458 | - Top 10 Popular Sub-Category with Pack of 2 and 5.
459 | - Checking for number of products the brand 'Rubie's Costume Co' has listed on Amazon since it has highest number of bundle in pack 2 and 5.
460 | - Minimum, Maximum and Average Selling Price of prodcts sold by the Brand 'Rubie's Costume Co'.
461 | - Top 10 Highest selling product in 'Clothing' Category for Brand 'Rubie's Costume Co'.
462 | - Top 10 most viewed product for brand 'Rubie's Costume Co'.
463 | -----------
464 |
465 | #### Check for the popular bundle (quantity in a bundle).
466 | - Cleaning(Data Processing) was performed on 'ProductSample.json' file and importing the data as pandas DataFrame.
467 | - Got numerical values for 'Number_Of_Pack' and etc from 'ProductSample.json'.
468 | - Grouped by Number of Pack and getting their respective count.
469 | - Took the data into .csv file
470 | - (path : '../Analysis/Analysis_4/Popular_Bundle.csv')
471 | - Bar Chart was plotted for Number of Packs
472 | 
473 |
474 | #### Top 10 Popular brands which sells Pack of 2 and 5, as they are the popular bundles.
475 | - Got all the asin for Pack 2 and 5 and stored in a list 'list_Pack2_5' since they have the highest number of counts
476 | - Got the brand name of those asin which were present in the list 'list_Pack2_5'.
477 | - Removed the rows which does not have brand name
478 | - Counted the occurence of brand name and giving the top 10 brands.
479 | - Took the data into .csv file
480 | - (path : '../Analysis/Analysis_4/Popular_Brand.csv')
481 | - Bar Chart was plotted for Popular brands.
482 | 
483 |
484 | #### Top 10 Popular Sub-Category with Pack of 2 and 5.
485 | - Got all the asin for Pack 2 and 5 and stored in a list 'list_Pack2_5'
486 | - Created a Function 'make_flat(arr)' to make multilevel list values flat which was used to get sub-categories from multilevel list.
487 | - Got the category of those asin which was present in the list 'list_Pack2_5'.
488 | - Counted the occurence of Sub-Category and giving the top 10 Sub-Category.
489 |
490 | #### Checking for number of products the brand 'Rubie's Costume Co' has listed on Amazon.
491 | - Got all the products which has brand name 'Rubie's Costume Co'
492 | - Took the total count of the products.
493 |
494 | #### Minimum, Maximum and Average Selling Price of prodcts sold by the Brand 'Rubie's Costume Co'.
495 | - Took min, max and mean price of all the products by using aggregation function on data frame column 'Price'.
496 |
497 | #### Top 10 Highest selling product in 'Clothing' Category for Brand 'Rubie's Costume Co'.
498 | - Took all the Asin, SalesRank and etc. whose brand is 'RUBIE'S COSTUME CO' from ProductSample.json.
499 | - Created a Data frame for above.
500 | - Sorted DataFrame based on sales Rank.
501 | - Calculated Average selling price for top 10 products.
502 | - Took the data into .csv file
503 | - (path : '../Analysis/Analysis_4/Popular_Product.csv').
504 | - Bar Plot
505 | 
506 |
507 | #### Top 10 most viewed product for brand 'Rubie's Costume Co'.
508 | - From all the Asin getting all the Asin present in 'also_viewed' section of json file.
509 | - Counting the Occurence of Asin for brand Rubie's Costume Co.
510 | - Creating a DataFrame with Asin and its Views.
511 | - Getting products of brand Rubie's Costume Co.
512 | - Merging the 2 DataFrames 'views_dataset' and 'view_prod_dataset' such that only the Rubie's Costume Co. products from 'view_prod_dataset' gets mapped. Inner type merge was performed to get only mapped product with Rubie's Costume Co.
513 | - Sorting the DataFrame based column 'Views'
514 | - Took the data into .csv file
515 | - (path : '../Analysis/Analysis_4/Most_Viewed_Product.csv')
516 | - Took min, max and mean price of Top 10 products by using aggregation function on data frame column 'Price'
517 | 
518 |
519 | ## CONCLUSION OF ANALYSIS 4:-
520 | - Pack of 2 and 5 found to be the most popular bundled product.
521 | - 'Rubie's Costume Co' found to be the most popular brand to sell Pack of 2 and 5.
522 | - Women, Novelty Costumes & More, Novelty, etc. are the popular sub-category in 'Clothing shoes and Jewellery' on Amazon.
523 | - 'Rubie's Costume Co' has 2175 products listed on Amazon.
524 | - 'Rubie's Costume Co' Selling Price Range
525 | - Minimum: 0.66
526 | - Maximum: 783.01
527 | - Average: 20.43
528 | - Popular products for 'Rubie's Costume Co' were in the price range 5-15. such as
529 | - DC Comics Boys Action Trio Superhero Costume Set
530 | - The Dark Knight Rises Batman Child Costume Kit
531 | - Star Wars Clone Wars Ahsoka Lightsaber, etc.
532 | - 'Rubie's Costume Co' Selling Price Range
533 | - Minimum: 4.95
534 | - Maximum: 12.99
535 | - Average: 7.24
536 | - Most viewed products for 'Rubie's Costume Co' were also in the price range 5-15, this confirms the popular product data.
537 |
538 | # ANALYSIS 5 : RECOMMENDER SYSTEM FOR BRAND RUBIE'S COSTUME CO.
539 | - Collaborative filtering algorithms is used to get the recomendations.
540 | - The Recommender System will take the 'Product Name' and based on the correlation factor will give output as list of products which will be a suggestion or recommendation.
541 | - Suppose product name 'A' act as input parameter i.e. If a user buy product 'A' so based on that it will output the product highly correlated to it.
542 | -----------------------------------------
543 |
544 | - Cleaning(Data Processing) was performed on 'ProductSample.json' file and importing the data as pandas DataFrame.
545 | - Gat all the distinct product Asin of brand 'Rubie's Costume Co.' in list.
546 | - Cleaning(Data Processing) was performed on 'ReviewSample.json' file and importing the data as pandas DataFrame.
547 | - Created a DataFrame 'Working_dataset' which has products only from brand "RUBIE'S COSTUME CO.".
548 | - Performed a merge of 'Working_dataset' and 'Product_dataset' to get all the required details together for building the Recommender system.
549 | - Took only the required columns and created a pivot table with index as 'Reviewer_ID' , columns as 'Title' and values as 'Rating'.
550 | - Created a function 'pearson(x1,x2)'.
551 | - Function to find the pearson correlation between two columns or products.
552 | - Will produce result between -1 to 1.
553 | - Function will be used within the recommender function 'get_recommendations()'.
554 | - Created a function 'get_recommendations(product_id,M,num)'.
555 | - Function to recommend the product based on correlation between them.
556 | - Takes 3 parameters 'Product Name', 'Model' and 'Number of Recomendations'
557 | - Will return a list in descending order of correlation and the list size depends on the input given for Number of Recomendations.
558 | - Created a function 'escape(t)'.
559 | - Function to replace all the html escape characters to respective characters.
560 | - Calling the recommender System by making a function call to 'get_recommendations('300 Movie Spartan Shield',Model,5)'.
561 | - '300 Movie Spartan Shield' is the product name pass to the function i.e. if person buys '300 Movie Spartan Shield' what else can be recommended to him/her.
562 | - 'Model' is passed for correlation calculation. Model is a pivot table created previously.
563 | - '5' is the maximum number of recommendation a function can return if there is some correlation.
564 | - Quantifying the correlation can be done by using correlation value given in the output.
565 | - Taking recommendation into DataFrame for Tabular represtation.
566 | - Takng only those values whose correlation is greater than 0.
567 | - Took all the recommendations into .csv file
568 | - (path : '../Analysis/Analysis_5/Recommendation.csv')
569 |
570 | ## CONCLUSION OF ANALYSIS 5:
571 | - When '300 Movie Spartan Shield' is passed to recommender system.
572 | - Output: -
573 |
574 | | Product Title | Correlation |
575 | |------|------|
576 | | 300 Movie Spartan Deluxe Vinyl Helmet | 0.223277|
577 | | Toynk Toys - 300- Spartan Sword | 0.069275|
578 |
579 |
580 | - Above given table is recomendation for '300 Movie Spartan Shield'
581 |
582 |
583 | ### Citation
584 | - Image-based recommendations on styles and substitutes J. McAuley, C. Targett, J. Shi, A. van den Hengel SIGIR, 2015
585 | - Inferring networks of substitutable and complementary products J. McAuley, R. Pandey, J. Leskovec Knowledge Discovery and Data Mining, 2015
586 |
--------------------------------------------------------------------------------
/Readme.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | --------------
5 | # AMAZON PRODUCT AND REVIEW DATA
6 | ------------------------------------------------------------------------------------------------------------------------------
7 |
8 | ## DATA SET DESCRIPTION:-
9 | - This dataset contains product reviews and metadata of 'Clothing, Shoes and Jewelry' category from Amazon, including 2.5 million reviews spanning May 1996 - July 2014.
10 | - This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links.
11 | - [DATA SET LINK](https://drive.google.com/open?id=0B4Hj2axlpCcxWldiajctWmY0NG8)
12 |
13 | ## FILES:-
14 | ### 'ProductSample.json'
15 | - Consist of all the products in 'Clothing, Shoes and Jewelry' category from Amazon. Each product is a json file in 'ProductSample.json'(each row is a json file).
16 | #### FIELDS:
17 | - 1 Asin - ID of the product, e.g. 0000031852
18 | - 2 Title - name of the product
19 | - 3 Price - price in US dollars (at time of crawl)
20 | - 4 ImUrl - url of the product image
21 | - 5 Related - related products (also bought, also viewed, bought together, buy after viewing)
22 | - 6 SalesRank - sales rank information
23 | - 7 Brand - brand name
24 | - 8 Categories - list of categories the product belongs to
25 |
26 | ### 'ReviewSample.json'
27 | - Consist of all the reviews for the products in 'Clothing, Shoes and Jewelry' category from Amazon. Each review is a json file in 'ReviewSample.json'(each row is a json file).
28 | #### FIELDS:
29 | - 1 ReviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
30 | - 2 Asin - ID of the product, e.g. 0000013714
31 | - 3 Reviewer Name - name of the reviewer
32 | - 4 Helpful - helpfulness rating of the review, e.g. 2/3
33 | - 5 Review Text - text of the review
34 | - 6 Overall - rating of the product
35 | - 7 Summary - summary of the review
36 | - 8 Unix Review Time - time of the review (unix time)
37 | - 9 Review Time - time of the review (raw)
38 |
39 | ## ANALYSIS:-
40 | * Analysis_1 : Sentimental Analysis on Reviews.
41 | * Analysis_2 : Exploratory Analysis.
42 | * Analysis_3 : 'Susan Katz' as 'Point of Interest' with maximum Reviews on Amazon.
43 | * Analysis_4 : 'Bundle' or 'Bought-Together' based Analysis.
44 | * Analysis_5 : Recommender System for Popular Brand 'Rubie's Costume Co'.
45 |
46 | ## DATA PROCESSING:-
47 | #### PRODUCT DATA i.e. ProductSample.json
48 | - Step 1: Reading a multiple json files from a single json file 'ProductSample.json' and appending it to the list such that each index of a list has a content of single json file.
49 | - Step 2: Iterating over list and loading each index as json and getting the data from the each index and making a list of Tuples containg all the data of json files. During each iteration json file is first cleaned by converting files into proper json format files by some replacements.
50 | - Step 3: Creating a dataframe using the list of Tuples got in the previous step.
51 |
52 | #### REVIEW DATA i.e. ReviewSample.json
53 | - Step 1: Reading a multiple json files from a single json file 'ReviewSample.json' and appending it to the list such that each index of a list has a content of single json file.
54 | - Step 2: Iterating over list and loading each index as json and getting the data from the each index and making a list of Tuples containg all the data of json files.
55 | - Step 3: Creating a dataframe using the list of Tuples got in the previous step.
56 |
57 |
58 | ## ANALYSIS 1: SENTIMENTAL ANALYSIS ON REVIEWS (1999-2014)
59 | - Wordcloud of summary section of 'Positive' and 'Negative' Reviews on Amazon.
60 | - VADER (Valence Aware Dictionary and Sentiment Reasoner) Sentiment analysis tool was used to calculate the sentiment of reviews.
61 | - Sentiment distribution (positive, negative and neutral) across each product along with their names mapped with the product database 'ProductSample.json'.
62 | - List of products with most number of positive, negative and neutral Sentiment (3 Different list).
63 | - Percentage distribution of positive, neutral and negative in terms of sentiments.
64 | - Sentiment distribution across the Year.
65 | ------
66 | #### WordCloud of summary section of 'Positive' and 'Negative' Reviews on Amazon.
67 | - Cleaning(Data Processing) was performed on 'ReviewSample.json' file and importing the data as pandas DataFrame.
68 | - Created a function to calculate sentiments using Vader Sentiment Analyzer and Naive Bayes Analyzer.
69 | - Vader Sentiment Analyzer was used at the final stage, since output given was much more faster and accurate.
70 | - Only taking 1 Lakh (1,00,000) reviews into consideration for Sentiment Analysis so that jupyter notebook dosen't crash.
71 | - Sentiment value was calculated for each review and stored in the new column 'Sentiment_Score' of DataFrame.
72 | - Seperated negatives and positives Sentiment_Score into different dataframes for creating a 'Wordcloud'.
73 | - Stemming function was created for stemming of different form of words which will be used by 'create_Word_Corpus()' function. PorterStemmer from nltk.stem was used for stemming.
74 | - Function 'create_Word_Corpus()' was created to generate a Word Corpus.
75 | ###### To Generate a word corpus following steps are performed inside the function 'create_Word_Corpus(df)'
76 | - Step 1 :- Iterating over the 'summary' section of reviews such that we only get important content of a review.
77 | - Step 2 :- Converting the content into Lowercase.
78 | - Step 3 :- Using nltk.tokenize to get words from the content.
79 | - Step 4 :- Using string.punctuation to get rid of punctuations.
80 | - Step 5 :- Using stopwords from nltk.corpus to get rid of stopwords.
81 | - Step 6 :- Stemming of Words.
82 | - Step 7 :- Finally forming a word corpus and returning the word corpus.
83 | - Function 'plot_cloud()' was defined to plot cloud.
84 |
85 | ##### WordCloud of 'Summary' section of Positive Reviews.
86 | 
87 | ##### WordCloud of 'Summary' section of Negative Reviews.
88 | 
89 |
90 | #### SENTIMENT DISTRIBUTION ACROSS EACH PRODUCT ALONG WITH THEIR NAMES MAPPED FROM PRODUCT DATABASE.
91 | - Cleaning(Data Processing) was performed on 'ProductSample.json' file and importing the data as pandas DataFrame.
92 | - Took only those columns which were required further down the Analysis such as 'Asin' and 'Sentiment_Score'.
93 | - Used Groupby on 'Asin' and 'Sentiment_Score' calculated the count of all the products with positive, negative and neutral sentiment Score.
94 | - DataFrame Manipulations were performed to get desired DataFrame.
95 | - Sorted the rows in the ascending order of 'Asin' and assigned it to another DataFrame 'x1'.
96 | - Products Asin and Title is assigned to x2 which is a copy of DataFrame 'Product_datset'(Product database).
97 | - Merged 2 Dataframes 'x1' and 'x2' on common column 'Asin' to map product 'Title' to respective product 'Asin' using 'inner' type.
98 | - Took all the data such as Asin, Title, Sentiment_Score and Count into .csv file
99 | - (path : Final/Analysis/Analysis_1/Sentiment_Distribution_Across_Product.csv)
100 |
101 | #### LIST OF PRODUCTS WITH MOST NUMBER OF POSITIVE, NEGATIVE AND NEUTRAL SENTIMENT (3 DIFFERENT LIST).
102 | - Segregated reviews based on their Sentiments_Score into 3 different(positive,negative and neutral) data frame,which we got earlier in step.
103 | - Sorted the above result in descending order of count.
104 | - Droped the unwanted column 'index'.
105 | - Took all the data such as Asin, Title, Sentiment_Score and Count for 3 into .csv file
106 | - (path : '../Analysis/Analysis_1/Positive_Sentiment_Max.csv'),
107 | - (path : '../Analysis/Analysis_1/Negative_Sentiment_Max.csv'),
108 | - (path : '../Analysis/Analysis_1/Neutral_Sentiment_Max.csv')
109 |
110 | #### PERCENTAGE DISTRIBUTION OF POSITIVE, NEUTRAL AND NEGATIVE IN TERMS OF SENTIMENTS.
111 | - Took summation of count column to get the Total count of Reviews under Consideration.
112 | - Percentage was calculated for positive, negative and neutral and was stored into a new column 'Percentage' of data frame.
113 | - Taking all the data such as Sentiment_Score, Count and Percentage into .csv file
114 | - (path : '../Analysis/Analysis_1/Sentiment_Percentage.csv')
115 |
116 | #### SENTIMENT DISTRIBUTION ACROSS THE YEAR
117 | - Converted the data type of 'Review_Time' column in the Dataframe 'Selected_Rows' to datetime format.
118 | - Created an Addtional column as 'Month' in Datatframe 'Selected_Rows' for Month by taking the month part of 'Review_Time' column.
119 | - Created an Addtional column as 'Year' in Datatframe 'Selected_Rows' for Year by taking the year part of 'Review_Time' column.
120 | - Grouped on the basis of 'Year' and 'Sentiment_Score' to get the respective count.
121 | - Segregated rows based on their Sentiments by year.
122 | - Got the total count including positive, negative and neutral to get the Total count of Reviews under Consideration for each year.
123 | - Merged the dataframe with total count to individual sentiment count to get percentage.
124 | - Calculated the Percentage to find a trend for sentiments.
125 | - Took all the data such as Year, Sentiment_Score, Count, Total_Count and Percentage for 3 into .csv file
126 | - (path : '../Analysis/Analysis_1/Pos_Sentiment_Percentage_vs_Year.csv')
127 | - (path : '../Analysis/Analysis_1/Neg_Sentiment_Percentage_vs_Year.csv')
128 | - (path : '../Analysis/Analysis_1/Neu_Sentiment_Percentage_vs_Year.csv')
129 | - Bar-Chart to know the Trend for Percentage of Positive, Negative and Neutral Review over the years based on Sentiments.
130 | ##### Bar-Chart to know the Trend for Percentage of Positive Review over the years based on Sentiments.
131 | 
132 |
133 | ##### Bar-Chart to know the Trend for Percentage of Negative Review over the years based on Sentiments.
134 | 
135 |
136 | ##### Bar-Chart to know the Trend for Percentage of Neutral Review over the years based on Sentiments.
137 | 
138 |
139 | ## CONCLUSION OF ANALYSIS 1:-
140 | - From Positive WordCloud
141 | - Popular words used to describe the products were love, perfect, nice, good, best, great and etc.
142 | - Many people who reviewed were happy with the price of the products sold on Amazon.
143 | - Much talked products were watch, bra, jacket, bag, costume, etc.
144 | - From Negative WordCloud
145 | - Popular words used to describe the products were dissapoint, badfit, terrible, defect, return and etc.
146 | - Much talked products were shoes, watch, bra, batteries, etc.
147 | - Popular product in terms of sentiments for following
148 | - Positive:
149 | - Converse Unisex Chuck Taylor Classic Colors Sneaker, Number of positive reviews:953
150 | - Converse Unisex Chuck Taylor All Star Hi Top Black Monochrome Sneaker, Number of positive reviews:932
151 | - Yaktrax Walker Traction Cleats for Snow and Ice, Number of positive reviews:676
152 | - Negative:
153 | - Yaktrax Walker Traction Cleats for Snow and Ice, Number of negative reviews:65
154 | - Converse Unisex Chuck Taylor Classic Colors Sneaker, Number of negative reviews:44
155 | - Converse Unisex Chuck Taylor All Star Hi Top Black Monochrome Sneaker, Number of negative reviews:44
156 | - Neutral:
157 | - Converse Unisex Chuck Taylor Classic Colors Sneaker, Number of neutral reviews:313
158 | - Yaktrax Walker Traction Cleats for Snow and Ice,Number of neutral reviews:253
159 | - Converse Unisex Chuck Taylor All Star Hi Top Black Monochrome Sneaker,Number of neutral reviews:247
160 | - Sentiment Percentage Distribution.
161 | - Positive: 72.7 %
162 | - Negative: 5 %
163 | - Neutral: 22.3 %
164 | - Overall Sentiment for reviews on Amazon is on positive side as it has very less negative sentiments.
165 | - Trend for Percentage of Review over the years
166 | - positive reviews percentage has been pretty consistent between 70-80 throughout the years.
167 | - negative reviews has been decreasing lately since last three years, may be they worked on the services and faults.
168 |
169 | # ANALYSIS 2: EXPLORATORY ANALYSIS OF PRODUCT AND REVIEWS (1999-2014)
170 | - Number of Reviews over the years.
171 | - Number of Reviews by month over the years.
172 | - Distribution of 'Overall Rating' for 2.5 million 'Clothing Shoes and Jewellery' reviews on Amazon.
173 | - Yearly average 'Overall Ratings' over the years.
174 | - Distribution of helpfulness on 'Clothing Shoes and Jwellery' reviews on Amazon.
175 | - Distributution of length of reviews on Amazon.
176 | - Distribution of product prices of 'Clothing Shoes and Jewellery' category on Amazon.
177 | - Distribution of 'Overall Rating' of Amazon 'Clothing Shoes and Jewellery'.
178 | - Product Price V/S Overall Rating of reviews written for products.
179 | - Average Review Length V/S Product Price for Amazon products.
180 | - Distribution of 'Number of Reviews' written by each of the Amazon 'Clothing Shoes and Jewellery' user.
181 | - Distribution of 'Average Rating' written by each of the Amazon 'Clothing Shoes and Jewellery' users.
182 | - Distribution of Helpfulness of reviews written by Amazon 'Clothing Shoes and Jewellery' users.
183 | - Average Rating V/S Avg Helpfulness written by Amazon 'Clothing Shoes and Jewellery' user
184 | - Helpfulness VS Average Length of reviews written by Amazon 'Clothing Shoes and Jewellery' users.
185 | -----------------------------------
186 | #### Number of Reviews over the years.
187 | - Cleaning(Data Processing) was performed on 'ReviewSample.json' file and importing the data as pandas DataFrame.
188 | - Converting the data type of 'Review_Time' column in the Dataframe 'dataset' to datetime format.
189 | - Creating an Addtional column as 'Month' in Datatframe 'dataset' for Month by taking the month part of 'Review_Time' column.
190 | - Creating an Addtional column as 'Year' in Datatframe 'dataset' for Year by taking the year part of 'Review_Time' column
191 | - Grouping by year and taking the count of reviews for each year.
192 | - Took Data to .csv file
193 | - (path : '../Analysis/Analysis_2/Year_VS_Reviews.csv')
194 | - Line Plot for number of reviews over the years.
195 | 
196 |
197 | #### Number of Reviews by month over the years.
198 | - Grouping on Month and getting the count.
199 | - Replacing digits of 'Month' column in 'Monthly' dataframe with words using 'Calendar' library.
200 | - Took Data to .csv file
201 | - (path : '../Analysis/Analysis_2/Month_VS_Reviews.csv')
202 | - Line Plot for number of reviews over the years.
203 | 
204 |
205 |
206 | #### Distribution of 'Overall Rating' for 2.5 million 'Clothing Shoes and Jewellery' reviews on Amazon.
207 | - Grouping on 'Rating' and getting the count.
208 | - Took Data to .csv file
209 | - (path : '../Analysis/Analysis_2/Rating_VS_Reviews.csv')
210 | - Bar Chart Plot for Distribution of Rating.
211 | 
212 |
213 | #### Yearly average 'Overall Ratings' over the years.
214 | - Grouping on Year and getting the mean.
215 | - Calculating the Moving Average ith window of '3' to confirm the trend
216 | - Took Data to .csv file
217 | - (path : '../Analysis/Analysis_2/Yearly_Avg_Rating.csv')
218 | - Bar Chart Plot for Distribution of Rating.
219 | 
220 |
221 |
222 | #### Distribution of helpfulness on 'Clothing Shoes and Jwellery' reviews on Amazon.
223 | - Only taking required columns and converting their data type.
224 | - Calculating helpfulnes Percentage and replacing Nan with 0
225 | - Creating an Interval of 10 for percentage Value.
226 | - Took Data to .csv file
227 | - (path : '../Analysis/Analysis_2/Helpfuness_Percentage_Distribution.csv')
228 | - Bar Chart Plot for Distribution of Rating.
229 | 
230 |
231 | #### Distributution of length of reviews on Amazon.
232 | - Creating a new Data frame with 'Reviewer_ID','Reviewer_Name' and 'Review_Text' columns.
233 | - Counting the number of words using 'len(x.split())'
234 | - Counting the number of characters 'len(x)'
235 | - Creating an Interval of 100 for Charcters and Words Length Value.
236 | - Took Data to .csv file
237 | - (path : '../Analysis/Analysis_2/Character_Length_Distribution.csv')
238 | - (path : '../Analysis/Analysis_2/Word_Length_Distribution.csv')
239 | - Bar Plot for distribution of Character Length of reviews on Amazon
240 | 
241 |
242 | - Bar Plot for distribution of Word Length of reviews on Amazon
243 | 
244 |
245 | #### Distribution of product prices of 'Clothing Shoes and Jewellery' category on Amazon.
246 | - Cleaning(Data Processing) was performed on 'ProductSample.json' file and importing the data as pandas DataFrame.
247 | - Segregating the product based on price range.
248 | - [0-10]
249 | - [11-50]
250 | - [51-100]
251 | - [101-200]
252 | - [201-500]
253 | - [501-1000]
254 | - Took Data to .csv file
255 | - (path : '../Analysis/Analysis_2/Price_Distribution.csv')
256 | - Bar Chart Plot for Distribution of Price.
257 |
258 |
259 | #### Distribution of 'Overall Rating' of Amazon 'Clothing Shoes and Jewellery'.
260 | - Grouping on Asin and getting the mean of Rating.
261 | - Took Data to .csv file
262 | - (path : '../Analysis/Analysis_2/Rating_Distribution.csv')
263 | - Bar Chart Plot for Distribution of Price.
264 | 
265 |
266 | #### Product Price V/S Overall Rating of reviews written for products.
267 | - Merging 2 data frame 'Product_dataset' and data frame got in above analysis, on common column 'Asin'.
268 | - Scatter plot for product price v/s overall rating
269 | 
270 |
271 | #### Average Review Length V/S Product Price for Amazon products.
272 | - Creating a new Data frame with 'Reviewer_ID','Reviewer_Name', 'Asin' and 'Review_Text' columns.
273 | - Counting the number of words using 'len(x.split())'
274 | - Counting the number of characters 'len(x)'
275 | - Grouped on 'Asin' and taking the mean of Word and Character length.
276 | - Scatter plot for product price v/s average review length.
277 | 
278 |
279 | #### Distribution of 'Number of Reviews' written by each of the Amazon 'Clothing Shoes and Jewellery' user.
280 | - Grouped on 'Reviewer_ID' and took the count.
281 | - Sorting in the descending order of number of reviews got in previous step.
282 | - Now grouped on Number of reviews and took the count.
283 | - Created a interval of 10 for plot and took the sum of all the count using groupby.
284 | - Took Data to .csv file
285 | - (path : '../Analysis/Analysis_2/DISTRIBUTION OF NUMBER OF REVIEWS.csv')
286 | - Scatter Plot for Distribution of Number of Reviews.
287 | 
288 |
289 |
290 | #### Distribution of 'Average Rating' written by each of the Amazon 'Clothing Shoes and Jewellery' users.
291 | - Grouped on 'Reviewer_ID' and took the mean of Rating.
292 | - Scatter Plot for Distribution of Average Rating.
293 | 
294 |
295 |
296 | #### Distribution of Helpfulness of reviews written by Amazon 'Clothing Shoes and Jewellery' users.
297 | - Creating a new Dataframe with 'Reviewer_ID','helpful_UpVote' and 'Total_Votes'
298 | - Calculate percentage using: (helpful_UpVote/Total_Votes)*100
299 | - Grouped on 'Reviewer_ID' and took the mean of Percentage'
300 | - Created interval of 10 for percentage
301 | - Took Data to .csv file
302 | - (path : '../Analysis/Analysis_2/DISTRIBUTION OF HELPFULNESS.csv')
303 | - Bar Chart Plot for DISTRIBUTION OF HELPFULNESS.
304 | 
305 |
306 | #### Average Rating V/S Avg Helpfulness written by Amazon 'Clothing Shoes and Jewellery' user
307 | - Took Data to .csv file
308 | - (path : '../Analysis/Analysis_2/AVERAGE RATING VS AVERAGE HELPFULNESS.csv')
309 | - Scatter Plot.
310 | 
311 |
312 | #### Helpfulness VS Average Length of reviews written by Amazon 'Clothing Shoes and Jewellery' users.
313 | - Took Data to .csv file
314 | - (path : '../Analysis/Analysis_2/HELPFULNESS VS AVERAGE LENGTH.csv')
315 | - Scatter Plot.
316 | 
317 |
318 |
319 | ## CONCLUSION OF ANALYSIS 3:-
320 | - There has been exponential growth for Amazon in terms of reviews, which also means the sales also increased exponentially. Plot for 2014 shows a drop because we only have a data uptill May and even then it is more than half for 5 months data.
321 | - Buyers generally shop more in December and January.
322 | - More than half of the reviews give a 4 or 5 star rating, with very few giving 1, 2 or 3 stars relatively.
323 | - Average Rating over every year for Amazon has been above 4 and also the moving average confirms the trend.
324 | - Majority of the reviews had perfect helpfulness scores.That would make sense; if you’re writing a review (especially a 5 star review), you’re writing with the intent to help other prospective buyers.
325 | - Majority of reviews on Amazon has length of 100-200 characters or 0-100 words.
326 | - Over 2/3rds of Amazon Clothing are priced between $0 and $50, which makes sense as clothes are not meant to be so expensive.
327 | - The most expensive products have 4-star and 5-star overall ratings.
328 | - Over 95% of the reviewers of Amazon electronics left less than 10 reviews.
329 | - Reviewers who give a product a 4 - 5 star rating are more passionate about the product and likely to write better reviews than someone who writes a 1 - 2 star
330 |
331 | # ANALYSIS 3 : POINT OF INTEREST AS THE ONE WITH MAXIMUM NUMBER OF REVIEWS ON AMAZON
332 | - Reviewer with maximum number of reviews.
333 | - Distribution of reviews for 'Susan Katz' based on overall rating (reviewer_id : A1RRMZKOMZ2M7J).
334 | - Distribution of reviews over the years for 'Susan Katz'.
335 | - Percentage distribution of negative reviews for 'Susan Katz', since the count of reviews is dropping post year 2009.
336 | - Lexical density distribution over the year for reviews written by 'Susan Katz'.
337 | - Wordcloud of all important words used in 'Susan Katz' reviews on amazon.
338 | - Number of distinct products reviewed by 'Susan Katz' on amazon.
339 | - Products reviewed by 'Susan Katz'.
340 | - Popular sub-category for 'Susan Katz'.
341 | - Price range in which 'Susan Katz' shops.
342 | -------------------
343 | #### Reviewer with maximum number of reviews.
344 | - Cleaning(Data Processing) was performed on 'ReviewSample.json' file and importing the data as pandas DataFrame.
345 | - Grouped on 'Reviewer_ID' and getting the count of reviews.
346 | - Sorted in Descending order of 'No_Of_Reviews'
347 | - Took Point_of_Interest DataFrame to .csv file
348 | - (path : '../Analysis/Analysis_3/Most_Reviews.csv')
349 |
350 | #### Distribution of reviews for 'Susan Katz' based on overall rating (reviewer_id : A1RRMZKOMZ2M7J).
351 | - Only took those review which is posted by 'SUSAN KATZ'.
352 | - Created a function 'ReviewCategory()' to give positive, negative and neutral status based on Overall Rating.
353 | - score >= 4 then positive
354 | - score <= 2 & score > 0 then negative
355 | - score =3 then neutral
356 | - Calling function 'ReviewCategory()' for each row of DataFrame column 'Rating'.
357 | - Grouped on 'Category' which we got in previous step and getting the count of reviews.
358 | - Bar Plot for Category V/S Count.
359 | 
360 |
361 | #### Distribution of reviews over the years for 'Susan Katz'.
362 | - Grouping on 'Year' which we got in previous step and getting the count of reviews.
363 | - Took Data to .csv file
364 | - (path : '../Analysis/Analysis_3/Yearly_Count.csv')
365 | - Bar Plot to get trend over the years for Reviews Written by 'SUSAN KATZ'
366 | 
367 |
368 | #### Percentage distribution of negative reviews for 'Susan Katz', since the count of reviews is dropping post year 2009.
369 | - Took the count of negative reviews over the years using 'Groupby'.
370 | - Merging 2 Dataframe for mapping and then calculating the Percentage of Negative reviews for each year.
371 | - Took Data to .csv file
372 | - (path : '../Analysis/Analysis_3/Negative_Review_Percentage.csv')
373 | - Bar Plot for Year V/S Negative Reviews Percentage
374 | 
375 |
376 |
377 | #### Lexical density distribution over the year for reviews written by 'Susan Katz'.
378 | - Lexical words include:
379 | - verbs (e.g. run, walk, sit)
380 | - nouns (e.g. dog, Susan, oil)
381 | - adjectives (e.g. red, happy, cold)
382 | - adverbs (e.g. very, carefully, yesterday)
383 | - Created a function 'LexicalDensity(text)' to calculate Lexical Density of a content.
384 | - Steps involved are as follows:-
385 | - Step 1 :- Converting the content into Lowercase.
386 | - Step 2 :- Using nltk.tokenize to get words from the content.
387 | - Step 3 :- Storing the total word count.
388 | - Step 4 :- Using string.punctuation to get rid of punctuations.
389 | - Step 5 :- Using stopwords from nltk.corpus to get rid of stopwords.
390 | - Step 6 :- tagging of Words and taking count of words which has tags starting from ("NN","JJ","VB","RB") which represents Nouns, Adjectives, Verbs and Adverbs respectively, will be the lexical count.
391 | - Step 7 :- Finally; (lexical count/total count)*100.
392 | - Called Function 'LexicalDensity()' for each row of DataFrame.
393 | - Grouped on 'Year' and getting the average Lexical Density of reviews.
394 | - Took Data to .csv file
395 | - (path : '../Analysis/Analysis_3/Lexical_Density.csv')
396 | - Bar Plot for Year V/S Lexical Density
397 | 
398 |
399 | #### Wordcloud of all important words used in 'Susan Katz' reviews on amazon.
400 | - To Generate a word corpus following steps are performed inside the function 'create_Word_Corpus(df)'
401 | - Step 1 :- Iterating over the 'summary' section of reviews such that we only get important content of a review.
402 | - Step 2 :- Converting the content into Lowercase.
403 | - Step 3 :- Using nltk.tokenize to get words from the content.
404 | - Step 4 :- Using string.punctuation to get rid of punctuations.
405 | - Step 5 :- Using stopwords from nltk.corpus to get rid of stopwords.
406 | - Step 6 :- tagging of Words using nltk and only allowing words with tag as ("NN","JJ","VB","RB").
407 | - Step 7 :- Finally forming a word corpus and returning the word corpus.
408 | - Generated a WordCloud image
409 | 
410 |
411 | #### Number of distinct products reviewed by 'Susan Katz' on amazon.
412 | - Took the unique Asin from the reviews reviewed by 'Susan Katz' and returned the length.
413 |
414 | #### Products reviewed by 'Susan Katz'.
415 | - Cleaning(Data Processing) was performed on 'ProductSample.json' file and importing the data as pandas DataFrame.
416 | - Mapping 'Product_dataset' with 'POI' to get the products reviewed by 'Susan Katz'
417 | - Took Data to .csv file
418 | - (path : '../Analysis/Analysis_3/Products_Reviewed.csv')
419 |
420 | #### Popular sub-category for 'Susan Katz'.
421 | - Creating list of products reviewed by 'Susan Katz'
422 | - Created a Function 'make_flat(arr)' to make multilevel list values flat which was used to get sub-categories from multilevel list.
423 | - Taking the sub-category of each Asin reviewed by 'Susan Katz'.
424 | - Counting the Occurences and taking top 5 out of it.
425 | - Took Data to .csv file
426 | - (path : '../Analysis/Analysis_3/Popular_Sub-Category.csv')
427 |
428 | #### Price range in which 'Susan Katz' shops.
429 | - Took min, max and mean price of all the products by using aggregation function on data frame column 'Price'.
430 |
431 | ## CONCLUSION OF ANALYSIS 3:-
432 | - 'Susan Katz' (reviewer_id : A1RRMZKOMZ2M7J) reviewed the maximumn number of products i.e. 180.
433 | - Susan was only 50 % of the times happy with products shopped on Amazon
434 |
435 |
436 | | Category | Count |
437 | |------|------|
438 | | neg | 38|
439 | | neu | 51|
440 | | pos | 91|
441 |
442 |
443 |
444 | - Number of reviews were droping for 'Susan Katz' after 2009.
445 | - The reason why rating for 'Susan Katz' were dropping because Susan was not happy with maximum products she shopped i.e. because the negative review count had increased for every year after 2009.
446 | - The Average lexical density for 'Susan Katz' has always been under 40% i.e. 'Susan Katz' writting used to lack the important words.
447 | - Most popular words used in 'Susan Katz' content were shoes, color, fit, heels, watch and etc.
448 | - Number of distinct products reviewed by 'Susan Katz' on amazon is 180.
449 | - Popular Category in which 'Susan Katz' were Jewelry, Novelty, Costumes & More
450 | - 'Susan Katz' shopping Price Range
451 | - Minimum: 6.99
452 | - Maximum: 249.99
453 | - Average: 66.01
454 |
455 | # ANALYSIS 4 : 'BUNDLE' OR 'BOUGHT-TOGETHER' BASED ANALYSIS.
456 | - Check for the popular bundle (quantity in a bundle).
457 | - Top 10 Popular brands which sells Pack of 2 and 5, as they are the popular bundles.
458 | - Top 10 Popular Sub-Category with Pack of 2 and 5.
459 | - Checking for number of products the brand 'Rubie's Costume Co' has listed on Amazon since it has highest number of bundle in pack 2 and 5.
460 | - Minimum, Maximum and Average Selling Price of prodcts sold by the Brand 'Rubie's Costume Co'.
461 | - Top 10 Highest selling product in 'Clothing' Category for Brand 'Rubie's Costume Co'.
462 | - Top 10 most viewed product for brand 'Rubie's Costume Co'.
463 | -----------
464 |
465 | #### Check for the popular bundle (quantity in a bundle).
466 | - Cleaning(Data Processing) was performed on 'ProductSample.json' file and importing the data as pandas DataFrame.
467 | - Got numerical values for 'Number_Of_Pack' and etc from 'ProductSample.json'.
468 | - Grouped by Number of Pack and getting their respective count.
469 | - Took the data into .csv file
470 | - (path : '../Analysis/Analysis_4/Popular_Bundle.csv')
471 | - Bar Chart was plotted for Number of Packs
472 | 
473 |
474 | #### Top 10 Popular brands which sells Pack of 2 and 5, as they are the popular bundles.
475 | - Got all the asin for Pack 2 and 5 and stored in a list 'list_Pack2_5' since they have the highest number of counts
476 | - Got the brand name of those asin which were present in the list 'list_Pack2_5'.
477 | - Removed the rows which does not have brand name
478 | - Counted the occurence of brand name and giving the top 10 brands.
479 | - Took the data into .csv file
480 | - (path : '../Analysis/Analysis_4/Popular_Brand.csv')
481 | - Bar Chart was plotted for Popular brands.
482 | 
483 |
484 | #### Top 10 Popular Sub-Category with Pack of 2 and 5.
485 | - Got all the asin for Pack 2 and 5 and stored in a list 'list_Pack2_5'
486 | - Created a Function 'make_flat(arr)' to make multilevel list values flat which was used to get sub-categories from multilevel list.
487 | - Got the category of those asin which was present in the list 'list_Pack2_5'.
488 | - Counted the occurence of Sub-Category and giving the top 10 Sub-Category.
489 |
490 | #### Checking for number of products the brand 'Rubie's Costume Co' has listed on Amazon.
491 | - Got all the products which has brand name 'Rubie's Costume Co'
492 | - Took the total count of the products.
493 |
494 | #### Minimum, Maximum and Average Selling Price of prodcts sold by the Brand 'Rubie's Costume Co'.
495 | - Took min, max and mean price of all the products by using aggregation function on data frame column 'Price'.
496 |
497 | #### Top 10 Highest selling product in 'Clothing' Category for Brand 'Rubie's Costume Co'.
498 | - Took all the Asin, SalesRank and etc. whose brand is 'RUBIE'S COSTUME CO' from ProductSample.json.
499 | - Created a Data frame for above.
500 | - Sorted DataFrame based on sales Rank.
501 | - Calculated Average selling price for top 10 products.
502 | - Took the data into .csv file
503 | - (path : '../Analysis/Analysis_4/Popular_Product.csv').
504 | - Bar Plot
505 | 
506 |
507 | #### Top 10 most viewed product for brand 'Rubie's Costume Co'.
508 | - From all the Asin getting all the Asin present in 'also_viewed' section of json file.
509 | - Counting the Occurence of Asin for brand Rubie's Costume Co.
510 | - Creating a DataFrame with Asin and its Views.
511 | - Getting products of brand Rubie's Costume Co.
512 | - Merging the 2 DataFrames 'views_dataset' and 'view_prod_dataset' such that only the Rubie's Costume Co. products from 'view_prod_dataset' gets mapped. Inner type merge was performed to get only mapped product with Rubie's Costume Co.
513 | - Sorting the DataFrame based column 'Views'
514 | - Took the data into .csv file
515 | - (path : '../Analysis/Analysis_4/Most_Viewed_Product.csv')
516 | - Took min, max and mean price of Top 10 products by using aggregation function on data frame column 'Price'
517 | 
518 |
519 | ## CONCLUSION OF ANALYSIS 4:-
520 | - Pack of 2 and 5 found to be the most popular bundled product.
521 | - 'Rubie's Costume Co' found to be the most popular brand to sell Pack of 2 and 5.
522 | - Women, Novelty Costumes & More, Novelty, etc. are the popular sub-category in 'Clothing shoes and Jewellery' on Amazon.
523 | - 'Rubie's Costume Co' has 2175 products listed on Amazon.
524 | - 'Rubie's Costume Co' Selling Price Range
525 | - Minimum: 0.66
526 | - Maximum: 783.01
527 | - Average: 20.43
528 | - Popular products for 'Rubie's Costume Co' were in the price range 5-15. such as
529 | - DC Comics Boys Action Trio Superhero Costume Set
530 | - The Dark Knight Rises Batman Child Costume Kit
531 | - Star Wars Clone Wars Ahsoka Lightsaber, etc.
532 | - 'Rubie's Costume Co' Selling Price Range
533 | - Minimum: 4.95
534 | - Maximum: 12.99
535 | - Average: 7.24
536 | - Most viewed products for 'Rubie's Costume Co' were also in the price range 5-15, this confirms the popular product data.
537 |
538 | # ANALYSIS 5 : RECOMMENDER SYSTEM FOR BRAND RUBIE'S COSTUME CO.
539 | - Collaborative filtering algorithms is used to get the recomendations.
540 | - The Recommender System will take the 'Product Name' and based on the correlation factor will give output as list of products which will be a suggestion or recommendation.
541 | - Suppose product name 'A' act as input parameter i.e. If a user buy product 'A' so based on that it will output the product highly correlated to it.
542 | -----------------------------------------
543 |
544 | - Cleaning(Data Processing) was performed on 'ProductSample.json' file and importing the data as pandas DataFrame.
545 | - Gat all the distinct product Asin of brand 'Rubie's Costume Co.' in list.
546 | - Cleaning(Data Processing) was performed on 'ReviewSample.json' file and importing the data as pandas DataFrame.
547 | - Created a DataFrame 'Working_dataset' which has products only from brand "RUBIE'S COSTUME CO.".
548 | - Performed a merge of 'Working_dataset' and 'Product_dataset' to get all the required details together for building the Recommender system.
549 | - Took only the required columns and created a pivot table with index as 'Reviewer_ID' , columns as 'Title' and values as 'Rating'.
550 | - Created a function 'pearson(x1,x2)'.
551 | - Function to find the pearson correlation between two columns or products.
552 | - Will produce result between -1 to 1.
553 | - Function will be used within the recommender function 'get_recommendations()'.
554 | - Created a function 'get_recommendations(product_id,M,num)'.
555 | - Function to recommend the product based on correlation between them.
556 | - Takes 3 parameters 'Product Name', 'Model' and 'Number of Recomendations'
557 | - Will return a list in descending order of correlation and the list size depends on the input given for Number of Recomendations.
558 | - Created a function 'escape(t)'.
559 | - Function to replace all the html escape characters to respective characters.
560 | - Calling the recommender System by making a function call to 'get_recommendations('300 Movie Spartan Shield',Model,5)'.
561 | - '300 Movie Spartan Shield' is the product name pass to the function i.e. if person buys '300 Movie Spartan Shield' what else can be recommended to him/her.
562 | - 'Model' is passed for correlation calculation. Model is a pivot table created previously.
563 | - '5' is the maximum number of recommendation a function can return if there is some correlation.
564 | - Quantifying the correlation can be done by using correlation value given in the output.
565 | - Taking recommendation into DataFrame for Tabular represtation.
566 | - Takng only those values whose correlation is greater than 0.
567 | - Took all the recommendations into .csv file
568 | - (path : '../Analysis/Analysis_5/Recommendation.csv')
569 |
570 | ## CONCLUSION OF ANALYSIS 5:
571 | - When '300 Movie Spartan Shield' is passed to recommender system.
572 | - Output: -
573 |
574 | | Product Title | Correlation |
575 | |------|------|
576 | | 300 Movie Spartan Deluxe Vinyl Helmet | 0.223277|
577 | | Toynk Toys - 300- Spartan Sword | 0.069275|
578 |
579 |
580 | - Above given table is recomendation for '300 Movie Spartan Shield'
581 |
582 |
583 | ### Citation
584 | - Image-based recommendations on styles and substitutes J. McAuley, C. Targett, J. Shi, A. van den Hengel SIGIR, 2015
585 | - Inferring networks of substitutable and complementary products J. McAuley, R. Pandey, J. Leskovec Knowledge Discovery and Data Mining, 2015
586 |
--------------------------------------------------------------------------------