├── .gitignore ├── README.md └── word_embeddings.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | glove.6B/ 2 | MANIFEST 3 | build 4 | dist 5 | _build 6 | *.csv 7 | *.json 8 | .~lock.reduced_review.csv# 9 | docs/source/interactive/magics-generated.txt 10 | docs/source/config/shortcuts/*.csv 11 | docs/gh-pages 12 | jupyter_notebook/notebook/static/mathjax 13 | jupyter_notebook/static/style/*.map 14 | *.py[co] 15 | __pycache__ 16 | *.egg-info 17 | *~ 18 | *.bak 19 | .ipynb_checkpoints 20 | .tox 21 | .DS_Store 22 | \#*# 23 | .#* 24 | .cache 25 | .coverage 26 | *.swp 27 | sudhir_notebook.ipynb 28 | predict_driver-Copy1.ipynb 29 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Classify yelp reviews 2 | Classify Yelp round-10 reviews/comments 3 | 4 | 5 | # Basic Information: 6 | 7 | In this project, I classify Yelp round-10 review datasets. The reviews contain a lot of metadata that can be mined and used to infer meaning, business attributes, and sentiment. For simplicity, I classify the review comments into two class: either as positive or negative. Reviews that have star higher than three are regarded as positive while the reviews with star less than or equal to 3 are negative. Therefore, the problem is a supervised learning. To build and train the model, I first tokenize the text and convert them to sequences. Each review comment is limited to 50 words. As a result, short texts less than 50 words are padded with zeros, and long ones are truncated. After processing the review comments, I trained three model in three different ways: 8 | 9 |