├── README.md └── sentiment-topic_modeling ├── .gitignore ├── Amazon_Reviews.ipynb ├── README.md ├── img ├── 9-speed-cassette.jpg ├── Merica.gif ├── Mountain-Bike.jpg ├── amazon-logo.jpg ├── amazon-stars.png ├── amazon_aws-s3.png ├── climbing.jpg ├── outdoor-exercise.jpg ├── pink-mag.jpg └── ruko.jpg └── load_data.py /README.md: -------------------------------------------------------------------------------- 1 | ![amazon-review](sentiment-topic_modeling/img/amazon-logo.jpg) 2 | 3 | ## Product Category: 4 | #### Sports & Outdoor Reviews 5 | ![](sentiment-topic_modeling/img/outdoor-exercise.jpg) 6 | 7 |
8 | 9 | ### Dataset: 10 | Utilized _AWS_ for storage and quick access of data: 11 | 12 |
13 | ![aws-s3](sentiment-topic_modeling/img/amazon_aws-s3.png) 14 | 15 |
16 | 17 | Source : 18 | UCSanDiego library/repo Curated by Julian McAuley 19 | [[ Link ]](http://jmcauley.ucsd.edu/data/amazon/) 20 | 21 | _Dataset size_: 22 | 296,337 x 10 [ rows x columns ] 23 | 24 | | Column headers | description | 25 | | -------------- | ------------------------------- | 26 | | asin | Product ID | 27 | | summary | Title of review | 28 | | reviewText | Written review | 29 | | overall | Rating 1-5 (stars) | 30 | | reviewerID | Reviewer ID | 31 | | reviewerName | Person's name (no standard format) | 32 | | helpful | Helpfulness rating of the review | 33 | | reviewTime | YYYY-MM--DD | 34 | | unixReviewTime| Time of the review (unix time) | 35 | | pos_neg | (1) Positive for 4-5 or (2) Negative for 1-3 Overall rating | 36 | 37 | 38 | # Project Outline: 39 | 40 | Business motivation : 41 | For ecommerce sites and outfitters to stay competitive and innovate, they must be able to draw and hold dedicated customers. One particularly effective approach in recent years has been to build personalized recommendation engines into their platform or interface. Determining the specific topics and sentiments associated with given sports and outdoors products is essential in building a recommendation engine. This project is mainly to understand the concepts covered in class and apply them to a specific domain. 42 | 43 | Problem formulation : 44 | 45 | 1. Explore and Process the data in order to glean basic insights about the data and prep to utilize models 46 | 47 | 2. Finding a classification model that works best with the data. 48 | 49 | 3. Understanding the topics and words that describe the broad categories of Sports and Outdoors product sold over Amazon. 50 | 51 | 4. Given the data and model performance, determine what is the best course of actions going forward. 52 | 53 | ## Approach: 54 | 55 | * EDA 56 | 57 | * Preprocessing 58 | 59 | * Model data 60 | 1. Classification / Sentiment Analysis 61 | * Logistic Regression 62 | * Multinomial Naive Bayes 63 | 64 | 2. Clustering / Topic Modeling 65 | * Nonnegative Matrix Factorization (NMF) 66 | * Latent Dirichlet Allocation (Lda) 67 | 68 | * Summarize Findings and Proposed Further Work 69 | 70 | 71 | 72 | ### Conclusion: 73 | - The data appears to be surprisingly quite biased and imbalanced toward Shooting sports. Since this group of activities was not going to be a main focus in the end product, more and different data is needed to build an appropriate model for the end goal. 74 | 75 | - Aside for the data itself, here is a summary of the modeling results: 76 | 77 | ``` 78 | Classification Summary: 79 | * Logistic Regression (using CountVectorizer) performance was the best with - F1 score: 94 %. 80 | * Multinomial NB with Tfidf was a close second with - F1 score: 92 %. 81 | 82 | Clustering Summary: 83 | * Both NMF and Lda with term frequency were about the same and just ok. 84 | * NMF with Tfidf was the best with no obscure topics and the model even correctly associated a topic of words with a specific brand (Nalgene). 85 | * Lda with Tfidf primarily retrieved unassociated words; however, it did return the most specific and unique words out of them all (i.e. brand news) 86 | ``` 87 | 88 | #### Further work: 89 | 90 | - Build a web scraper utilizing Beautiful Soup to gather more appropriate, unbiased reviews from a sports outlet like [ REI ](https://www.rei.com/) or [ Dick's Sporting Goods ](http://www.dickssportinggoods.com/home/index.jsp?ab=Header_DicksLogo). 91 | 92 | - Focus on categorizing by Sports and Outdoor activities in order to build better classification model that excludes Shooting sports 93 | 94 | - Incorporate word2vec or LDA2vec 95 | -------------------------------------------------------------------------------- /sentiment-topic_modeling/.gitignore: -------------------------------------------------------------------------------- 1 | # Zip files 2 | *.tar.gz 3 | 4 | # Jupyter 5 | *.ipynb_checkpoints 6 | 7 | # .py files 8 | *.pyc 9 | __pycache__ 10 | 11 | .DS_Store 12 | -------------------------------------------------------------------------------- /sentiment-topic_modeling/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeremygrace/amazon-reviews/d1faf18f6fbd33fe766f9ba4463a4e5da03a41bc/sentiment-topic_modeling/README.md -------------------------------------------------------------------------------- /sentiment-topic_modeling/img/9-speed-cassette.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeremygrace/amazon-reviews/d1faf18f6fbd33fe766f9ba4463a4e5da03a41bc/sentiment-topic_modeling/img/9-speed-cassette.jpg -------------------------------------------------------------------------------- /sentiment-topic_modeling/img/Merica.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeremygrace/amazon-reviews/d1faf18f6fbd33fe766f9ba4463a4e5da03a41bc/sentiment-topic_modeling/img/Merica.gif -------------------------------------------------------------------------------- /sentiment-topic_modeling/img/Mountain-Bike.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeremygrace/amazon-reviews/d1faf18f6fbd33fe766f9ba4463a4e5da03a41bc/sentiment-topic_modeling/img/Mountain-Bike.jpg -------------------------------------------------------------------------------- /sentiment-topic_modeling/img/amazon-logo.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeremygrace/amazon-reviews/d1faf18f6fbd33fe766f9ba4463a4e5da03a41bc/sentiment-topic_modeling/img/amazon-logo.jpg -------------------------------------------------------------------------------- /sentiment-topic_modeling/img/amazon-stars.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeremygrace/amazon-reviews/d1faf18f6fbd33fe766f9ba4463a4e5da03a41bc/sentiment-topic_modeling/img/amazon-stars.png -------------------------------------------------------------------------------- /sentiment-topic_modeling/img/amazon_aws-s3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeremygrace/amazon-reviews/d1faf18f6fbd33fe766f9ba4463a4e5da03a41bc/sentiment-topic_modeling/img/amazon_aws-s3.png -------------------------------------------------------------------------------- /sentiment-topic_modeling/img/climbing.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeremygrace/amazon-reviews/d1faf18f6fbd33fe766f9ba4463a4e5da03a41bc/sentiment-topic_modeling/img/climbing.jpg -------------------------------------------------------------------------------- /sentiment-topic_modeling/img/outdoor-exercise.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeremygrace/amazon-reviews/d1faf18f6fbd33fe766f9ba4463a4e5da03a41bc/sentiment-topic_modeling/img/outdoor-exercise.jpg -------------------------------------------------------------------------------- /sentiment-topic_modeling/img/pink-mag.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeremygrace/amazon-reviews/d1faf18f6fbd33fe766f9ba4463a4e5da03a41bc/sentiment-topic_modeling/img/pink-mag.jpg -------------------------------------------------------------------------------- /sentiment-topic_modeling/img/ruko.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeremygrace/amazon-reviews/d1faf18f6fbd33fe766f9ba4463a4e5da03a41bc/sentiment-topic_modeling/img/ruko.jpg -------------------------------------------------------------------------------- /sentiment-topic_modeling/load_data.py: -------------------------------------------------------------------------------- 1 | import os 2 | from urllib.request import urlretrieve 3 | 4 | 5 | # Assign variable with specific path/file info 6 | url = "https://s3-us-west-2.amazonaws.com/msds-projects/data/" 7 | path = "./reviews_Sports_and_Outdoors_5.json.gz" 8 | filename = "reviews_Sports_and_Outdoors_5.json.gz" 9 | 10 | 11 | def retrieve_data(filename, url, path): 12 | if not os.path.exists(path): 13 | filename, _ = urlretrieve(url+filename, filename) 14 | print("Data Succesfully Retrieved!") 15 | else: 16 | print("Your data already exists in the directory. Enjoy.") 17 | 18 | 19 | if __name__ == "__main__": 20 | retrieve_data(filename, url, path) 21 | --------------------------------------------------------------------------------