├── .github └── FUNDING.yml ├── README.md ├── NLP using Glove and Spacy .ipynb ├── EDAonYelpReviews.ipynb ├── cnn-lstm-with-doc2vec-plus-tf-idf.ipynb ├── NLP_using_Glove_and_Spacy_.ipynb ├── basicMLmodelsDecepOpSpam.ipynb └── test └── EDAonYelpReviews.ipynb /.github/FUNDING.yml: -------------------------------------------------------------------------------- 1 | # These are supported funding model platforms 2 | 3 | github: # Replace with up to 4 GitHub Sponsors-enabled usernames e.g., [CoreView Systems, user2] 4 | patreon: # Replace with a single Patreon username 5 | open_collective: # Replace with a single Open Collective username 6 | ko_fi: # Replace with a single Ko-fi username 7 | tidelift: # Replace with a single Tidelift platform-name/package-name e.g., npm/babel 8 | community_bridge: # Replace with a single Community Bridge project-name e.g., cloud-foundry 9 | liberapay: # Replace with a single Liberapay username 10 | issuehunt: # Replace with a single IssueHunt username 11 | otechie: # Replace with a single Otechie username 12 | custom: # Replace with up to 4 custom sponsorship URLs e.g., ['link1', 'link2'] 13 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Detection of Fake Reviews on Online Review Platforms using Deep Learning Architectures 2 | [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unsupervised-data-augmentation/sentiment-analysis-on-yelp-fine-grained)](https://paperswithcode.com/sota/sentiment-analysis-on-yelp-fine-grained?p=unsupervised-data-augmentation) 3 | 4 | Dataset: https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz
5 | https://www.kaggle.com/rtatman/deceptive-opinion-spam-corpus 6 |
7 | The data includes 1,569,264 samples from the Yelp Dataset Challenge 2015. This subset has 280,000 training samples and 19,000 test samples in each polarity. 8 |
9 | 10 | 11 | **Also, if you happen to refer my work, a citation would do wonders for me. Thanks!** 12 | 13 | ```json 14 | Salunkhe, Ashish. "Attention-based Bidirectional LSTM for Deceptive Opinion Spam Classification." arXiv preprint arXiv:2112.14789 (2021). 15 | ``` 16 | 17 |
18 | 19 | 20 |
21 | 22 | The following implementations are done: 23 | 1. Bidirectional LSTM with GLoVE 50D word embeddings 24 | 2. LSTM with GLoVE 100D word embeddings 25 | 3. LSTM with GLoVE 300D word embeddings 26 | 4. CNN-LSTM with Doc2Vec and TF-IDF 27 | 5. Attention mechanism with GLoVe 100D word embeddings 28 | 6. Logistic Regression 29 | 7. Multinomial Naive Bayes 30 | 8. Support Vector Machine - Stochastic Gradient Descent (SGD) 31 | 32 | The results obtained were as follows: 33 | 34 | 35 | | Sr. No. | Model Accuracy (%) | Precision Score | Recall Score | F1 Score | 36 | | ----- | ----------------- | ---------------- |-------------|------------| 37 | | 1 | MultinomialNB | 90.25 | 0.9325 | 0.8601 | 0.8948 | 38 | | 2 | Stochastic Gradient Descent (SGD) | 87.75 | 0.8913 | 0.8497 | 0.8700 | 39 | | 3 | Logistic Regression | 87.00 | 0.8691 | 0.8601 | 0.8645 | 40 | | 4 | Support Vector Machine | 56.25 | 0.525 | 0.9792 | 0.6835 | 41 | | 5 | Gaussian Naive Bayes | 63.5 | 0.6424 | 0.6169 | 0.6294 | 42 | | 6 | K-Nearest Neighbour | 57.5 | 0.8604 | 0.1840 | 0.3032 | 43 | | 7 | Decision tree | 68.5 | 0.6681 | 0.7412 | 0.7028 | 44 | 45 | | Model | Training accuracy(%) | Testing accuracy(%) | 46 | | ----- | ----------------- | ---------------- | 47 | | Bidirectional LSTM + GLoVe(50D) | 92.17 | 88.13 | 48 | | LSTM + GLoVe(100D) | 99.18 | 85.75 | 49 | | CNN + LSTM + Doc2Vec +TF-IDF | 96.23 | 92.19 | 50 | | CNN + Attention + GLoVe(100D) | 99.00 | 90.25 | 51 | | BiLSTM + Attention + GLoVe(100D) | 99.18 | 89.27 | 52 | | CNN + BiLSTM + Attention + GLoVe(100D) | 99.75 | 81.25 | 53 | | LogisticRegression + TF-IDF | 99.11 | 87.21 | 54 | 55 | Future scope includes improvement in the attention layer to increase testing accuracy. BERT and XLNet can be implemented to improve the performance further. 56 | -------------------------------------------------------------------------------- /NLP using Glove and Spacy .ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"metadata":{"_cell_guid":"e35c5b34-ef39-4431-9bfd-4b438b6d072b","_uuid":"88c8394e12ad493556c3ec9581e8aec871dad89e"},"cell_type":"markdown","source":""},{"metadata":{"_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","trusted":false,"collapsed":true},"cell_type":"code","source":"# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n# For example, here's several helpful packages to load in \n\nimport numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n\n# Input data files are available in the \"../input/\" directory.\n# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory\n\nimport os\nprint(os.listdir(\"../input\"))\n\n# Any results you write to the current directory are saved as output.","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","trusted":false,"collapsed":true},"cell_type":"code","source":"import pandas as pd,numpy as np,seaborn as sns\nfrom keras.preprocessing.text import Tokenizer\nfrom keras.preprocessing.sequence import pad_sequences\nfrom keras.utils import to_categorical\nimport matplotlib.pyplot as plt\n%matplotlib inline\nimport spacy","execution_count":null,"outputs":[]},{"metadata":{"scrolled":true,"_cell_guid":"721878c6-e111-4559-9e9c-706052f573d1","_uuid":"4e095dc26e6699de771fcb27bdcd3ba177c7320d","trusted":false,"collapsed":true},"cell_type":"code","source":"yelp_reviews = pd.read_csv('../input/yelp-dataset/yelp_review.csv',nrows=10000)\nyelp_reviews.head()","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"87c3b275-5a05-4f7e-907c-336190399694","_uuid":"5782e772b4a9df661de0d74277388a0d7b686bbf","trusted":false,"collapsed":true},"cell_type":"code","source":"yelp_reviews.columns","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"231b382c-7455-4664-8aa4-48c0ce12b620","_uuid":"9d79fd330b98cc60622ea42891d4fcec14ba3099","trusted":false,"collapsed":true},"cell_type":"code","source":"yelp_reviews=yelp_reviews.drop(['review_id','user_id','business_id','date','useful','funny','cool'],axis=1)\nyelp_reviews.head()","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"3ddd726c-1fa3-45ff-94fb-67a7cdaf096f","_uuid":"52162971a1b62d874389eef678996c6792cf385b","trusted":false,"collapsed":true},"cell_type":"code","source":"yelp_reviews.isnull().any()","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"7a262c5e-587c-4ea0-9c73-eb317ab1f9c3","_uuid":"0fe4ce98d33e813e7151f5b40ee50f0999d7babf","trusted":false,"collapsed":true},"cell_type":"code","source":"yelp_reviews.stars.unique()","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"da052314-149a-4900-9620-c37b8b1101c5","_uuid":"33988b150c9233a825d8473b19b59de24c2a2652","trusted":false,"collapsed":true},"cell_type":"code","source":"sns.countplot(yelp_reviews.stars)","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"bfa44352-556e-4c88-9818-37852215b2bc","_uuid":"a2ee8b2d45dc7f5a3531ba055a263c151315f7f4","trusted":false,"collapsed":true},"cell_type":"code","source":"yelp_reviews.stars.mode()","execution_count":null,"outputs":[]},{"metadata":{"collapsed":true,"_cell_guid":"ff307ae5-24f5-423e-9941-7a3b8fca36ca","_uuid":"6e45e353693a534dcacad245f48a326df97a7eeb","trusted":false},"cell_type":"code","source":"reviews = yelp_reviews[yelp_reviews.stars!=3]\n","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"b5761e42-7942-4e1e-b48e-771904585dc8","_uuid":"a0d93fe4c2c810c5e63c068160d43f4def79510d","trusted":false,"collapsed":true},"cell_type":"code","source":"reviews['label'] = reviews['stars'].apply(lambda x: 1 if x>3 else 0)\nreviews = reviews.drop('stars',axis=1)\nreviews.head()","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"3ab1cb7f-81c2-4c06-beb2-34bed9a9ff3d","_uuid":"e125ec4b93624993ae4582a0126ba79a610950a4","trusted":false,"collapsed":true},"cell_type":"code","source":"reviews.shape","execution_count":null,"outputs":[]},{"metadata":{"collapsed":true,"_cell_guid":"ae7f5373-b105-44b0-b0f2-a77a9a3e03d1","_uuid":"fb974ff724f6c730ea91cd0f1a124768a2c2762e","trusted":false},"cell_type":"code","source":"text = reviews.text.values\nlabel = reviews.label.values","execution_count":null,"outputs":[]},{"metadata":{"collapsed":true,"_cell_guid":"b1cd2e70-9100-43a9-951c-0147f8165e8b","_uuid":"9f7126c5fad222ad73bce5ae6f4f8be47aa0bb74","trusted":false},"cell_type":"code","source":"nlp = spacy.load('en')","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"32152a68-842b-4d90-bfea-eab60b912942","_uuid":"e7238b2bf5e3ad7d1ba64d1afc38c4ca7cf14874","trusted":false,"collapsed":true},"cell_type":"code","source":"text[0]","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"ef5b3d18-e94a-4ad8-860d-4e1642534dee","_uuid":"6ff0249c86878e3169be5fd0695161765f254fd3","trusted":false,"collapsed":true},"cell_type":"code","source":"parsed_text = nlp(text[0])\nparsed_text","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"1508dcff-7752-4571-bf8a-b96aad36bf8d","_uuid":"3e0a2249ec08a3aaeb3552ee791ebd7a675d45f3","trusted":false,"collapsed":true},"cell_type":"code","source":"for i,sentance in enumerate(parsed_text.sents):\n print(i,':',sentance)","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"f8dfc439-eeae-4d7b-9a39-352c48c9c7d1","_uuid":"726fa40e5c71128880e9252ed28a03106a6169af","trusted":false,"collapsed":true},"cell_type":"code","source":"for num, entity in enumerate(nlp(text[10]).ents):\n print ('Entity {}:'.format(num + 1), entity, '-', entity.label_)","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"659f54d3-86a9-478d-80f0-a8afeed2b81e","_uuid":"8eace05f587db1f3d2f5e7cb21f9d66d526a8d6e","trusted":false,"collapsed":true},"cell_type":"code","source":"token_pos = [token.pos_ for token in nlp(text[10])]\ntokens = [token for token in nlp(text[10])]\nsd = list(zip(tokens,token_pos))\nsd = pd.DataFrame(sd,columns=['token','pos'])\nsd.head()","execution_count":null,"outputs":[]},{"metadata":{"collapsed":true,"_cell_guid":"0e58bbb7-7900-43e0-9410-6a6168cce046","_uuid":"32a9ecab6d44f2003a815c0a67b2bd52f2462e24","trusted":false},"cell_type":"code","source":"max_num_words = 1000\nmax_seq_length = 100\ntokenizer = Tokenizer(num_words=max_num_words)\n","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"46b7d4f9-e3f0-4c16-b9f2-015a1acdd39b","_uuid":"7cb3a51c6b07550a764c63c071c2a6c59d23297e","trusted":false,"collapsed":true},"cell_type":"code","source":"len(yelp_reviews)","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"a0e0409b-4aea-4c64-b1f5-b9eb8767ded4","_uuid":"6d2db8d7775a75bbdd762598ec3f5193fe82a966","trusted":false,"collapsed":true},"cell_type":"code","source":"reviews=yelp_reviews[:100000]\nreviews=reviews[reviews.stars!=3]\n\nreviews[\"labels\"]= reviews[\"stars\"].apply(lambda x: 1 if x > 3 else 0)\nreviews=reviews.drop(\"stars\",axis=1)\n\nreviews.head()","execution_count":null,"outputs":[]},{"metadata":{"collapsed":true,"_cell_guid":"732547c6-7433-4b45-bfd5-2e8b75530f17","_uuid":"96456f9fe4f1fc073b024b20e5148dee9d94981e","trusted":false},"cell_type":"code","source":"texts = reviews[\"text\"].values\nlabels = reviews[\"labels\"].values","execution_count":null,"outputs":[]},{"metadata":{"collapsed":true,"_cell_guid":"a987b4e3-b7a3-4d6d-97a7-92b889888b9b","_uuid":"5cba94b42ddb6a8a72715ab30b4958fff6a84257","trusted":false},"cell_type":"code","source":"tokenizer.fit_on_texts(texts)\nsequences = tokenizer.texts_to_sequences(texts)\nword_index = tokenizer.word_index\n","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"9b49d3c5-bca8-4e27-9324-a18831cc0361","_uuid":"01da830a97419c9b01d285f214f2a79da7ba406e","trusted":false,"collapsed":true},"cell_type":"code","source":"len(word_index)","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"c828337f-91cd-4075-9071-9b6667d4a190","_uuid":"e1af0af2bfa6556ba03a2d4a08995823b013daa7","trusted":false,"collapsed":true},"cell_type":"code","source":"data = pad_sequences(sequences, maxlen=max_seq_length)\ndata","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"76225dee-bfe2-45d6-a12b-19a4dce49d8a","_uuid":"43685b899b9e4b41ecc4b1e6e1dfd0fc00049adf","trusted":false,"collapsed":true},"cell_type":"code","source":"data.shape","execution_count":null,"outputs":[]},{"metadata":{"collapsed":true,"_cell_guid":"4f37aeb5-cdda-4ce4-b2bd-f245cc7d9519","_uuid":"f036d4518d625809b13352b9b927dda5b9c4656d","trusted":false},"cell_type":"code","source":"labels = to_categorical(np.asarray(labels))","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"6bdbeff6-94cd-438a-a288-e72b934934f8","_uuid":"19f951609e8362673c7e5420b355328eea5b3760","trusted":false,"collapsed":true},"cell_type":"code","source":"labels.shape","execution_count":null,"outputs":[]},{"metadata":{"collapsed":true,"_cell_guid":"c6fb93b5-7b9c-4ca8-903a-65526763762f","_uuid":"0256e645011ee3050ce32a65d5df8c16b44b298b","trusted":false},"cell_type":"code","source":"validation_spilit = 0.2\nindices = np.arange(data.shape[0])\nnp.random.shuffle(indices)","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"ec3d1c47-e3a2-46c7-88ff-47c0c4f05a2f","_uuid":"ccbdfa233af2918969e0560a0738c81b10694154","trusted":false,"collapsed":true},"cell_type":"code","source":"data = data[indices]\ndata","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"047fb8fe-8881-4fb6-a350-4939cb833972","_uuid":"22d87efa8087046dc6dcba08c829b08e9198a2d7","trusted":false,"collapsed":true},"cell_type":"code","source":"labels = labels[indices]\nlabels","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"f40c81b4-c3af-46c2-b3e4-d8bee0fef433","_uuid":"2db4e649432a4acbc99b2a2774dde27f4cff6ed5","trusted":false,"collapsed":true},"cell_type":"code","source":"nb_validation_samples = int(validation_spilit*data.shape[0])\nnb_validation_samples","execution_count":null,"outputs":[]},{"metadata":{"collapsed":true,"_cell_guid":"896d2744-3ae9-4868-add0-3fcd5b3ad03f","_uuid":"02c5008ecbe67e631a99fca96179f5671981b54b","trusted":false},"cell_type":"code","source":"x_train = data[:-nb_validation_samples]\ny_train = labels[:-nb_validation_samples]\nx_val = data[-nb_validation_samples:]\ny_val = labels[-nb_validation_samples:]\n","execution_count":null,"outputs":[]},{"metadata":{"collapsed":true,"_cell_guid":"886c43cc-4c6f-4e2d-a5d5-dccd7df9cf00","_uuid":"d743ab3bd117266860a29263c4330521654a112b","trusted":false},"cell_type":"code","source":"glove_dir = '../input/glove-global-vectors-for-word-representation/'","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"fb8b1a07-856d-4122-bf6d-82e3c1f23183","_uuid":"6259f81e098d4ef19d8dac2afec130e06a0309ad","trusted":false,"collapsed":true},"cell_type":"code","source":"embedding_index = {}\n\nf = open(os.path.join(glove_dir,'glove.6B.50d.txt'))\nfor line in f:\n values = line.split()\n word = values[0]\n coefs = np.asarray(values[1:],dtype='float32')\n embedding_index[word] = coefs\nf.close()\n\nprint('found word vecs: ',len(embedding_index))","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"3a75feac-fe78-4f45-ae23-b085a8d52e9d","_uuid":"9cbc91bb3b5df510066b9972c81ee5e0002baa72","trusted":false,"collapsed":true},"cell_type":"code","source":"embedding_dim = 50\nembedding_matrix = np.zeros((len(word_index)+1,embedding_dim))\nembedding_matrix.shape","execution_count":null,"outputs":[]},{"metadata":{"collapsed":true,"_cell_guid":"24dd27c6-4e9d-450a-8ae0-4bcc61e9ff2e","_uuid":"c22b2349cfb2113678d8d20de841470fcdc56709","trusted":false},"cell_type":"code","source":"for word,i in word_index.items():\n embedding_vector = embedding_index.get(word)\n if embedding_vector is not None:\n embedding_matrix[i] = embedding_vector","execution_count":null,"outputs":[]},{"metadata":{"collapsed":true,"_cell_guid":"d8aec772-3fb4-46c9-8988-66da287ba1ab","_uuid":"60df2f72c00f55fe26169328b8ef0d05eca0fab4","trusted":false},"cell_type":"code","source":"from keras.layers import Embedding\nembedding_layer = Embedding(len(word_index)+1,embedding_dim,weights=[embedding_matrix],input_length=max_seq_length,trainable=False)","execution_count":null,"outputs":[]},{"metadata":{"collapsed":true,"_cell_guid":"6618d995-5c04-41b4-9724-8892adfe50df","_uuid":"d54df4e207d8f7874884998fa3c637177f3d5972","trusted":false},"cell_type":"code","source":"from keras.layers import Bidirectional,GlobalMaxPool1D,Conv1D\nfrom keras.layers import LSTM,Input,Dense,Dropout,Activation\nfrom keras.models import Model","execution_count":null,"outputs":[]},{"metadata":{"collapsed":true,"_cell_guid":"f9dfbbed-0720-44a0-8679-2b9cc3c1ad3f","_uuid":"3f84a01f25dd92b68681353714dfaa8a2fca70d1","trusted":false},"cell_type":"code","source":"inp = Input(shape=(max_seq_length,))\nx = embedding_layer(inp)\nx = Bidirectional(LSTM(50,return_sequences=True,dropout=0.1,recurrent_dropout=0.1))(x)\nx = GlobalMaxPool1D()(x)\nx = Dense(50,activation='relu')(x)\nx = Dropout(0.1)(x)\nx = Dense(2,activation='sigmoid')(x)\nmodel = Model(inputs=inp,outputs=x)","execution_count":null,"outputs":[]},{"metadata":{"collapsed":true,"_cell_guid":"063aa4eb-1a13-49e3-bc00-bbd45be32238","_uuid":"b7d7b5275784367bb2d508fe39c313548c528f25","trusted":false},"cell_type":"code","source":"model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])\n","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"854d7171-c903-4a2b-9c40-6322aeb61d4d","_uuid":"4753f2c45b326b074d412a49e22eb0890ab76542","trusted":false,"collapsed":true},"cell_type":"code","source":"print(x_train.shape)\nprint(y_train.shape)\nprint(x_val.shape)\nprint(y_val.shape)","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"436624ed-35d9-42cc-956a-501e6a6a7818","_uuid":"da2981747e220ae713c61eda9857bfa42c48043d","trusted":false,"collapsed":true},"cell_type":"code","source":"model.fit(x_train,y_train,validation_data=(x_val,y_val),epochs=20,batch_size=128);","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"e2219519-c3b7-477e-8636-f5f2d3882b4b","_uuid":"4196ab0f3a61c0a08a5420916e5eaa7ec1dc4af3","trusted":false,"collapsed":true},"cell_type":"code","source":"score = model.evaluate(x_val,y_val)\nscore","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"4b76936e-96e5-45c9-906d-5124891449b3","_uuid":"e9f60824821a662217f9c84bf281d581422d2534","trusted":false,"collapsed":true},"cell_type":"code","source":"score[1]*100","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"032a6693-1645-4ee4-86cc-d493120c83dd","_uuid":"e85ec637a551099c76d058939bc0f4b972dbb221"},"cell_type":"markdown","source":""}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"name":"python","version":"3.6.4","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat":4,"nbformat_minor":1} 2 | -------------------------------------------------------------------------------- /EDAonYelpReviews.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n# For example, here's several helpful packages to load in \n\nimport numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n\n# Input data files are available in the \"../input/\" directory.\n# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory\n\nimport os\nprint(os.listdir(\"../input\"))\n\n# Any results you write to the current directory are saved as output.","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"**Exploratory Data Analysis**"},{"metadata":{"_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","trusted":true},"cell_type":"code","source":"import pandas as pd\nimport numpy as np","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"data_train = pd.read_csv('../input/yelp-train/train.csv',header=None)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"data_train.columns = ['deceptive','text']","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"data_train.head(5)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"data_test = pd.read_csv('../input/yelptest/test.csv',header=None)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"data_test.columns = ['deceptive','text']","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"data_test.head(5)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"data_train.info()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"import seaborn as sns\nimport matplotlib.pyplot as plt\n%matplotlib inline","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"sns.countplot(data_train.deceptive)\nplt.xlabel('Deceptive')\nplt.title('Number of Deceptive and Non Deceptive reviews (Deceptive=1 & NonDeceptive=2)')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#dataset description\ndata_train.groupby('deceptive').describe()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#word count\ndata_train['word_count'] = data_train['text'].apply(lambda x: len(str(x).split(\" \")))\ndata_train[['text','word_count']].head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#character count including spaces\ndata_train['char_count'] = data_train['text'].str.len() ## this also includes spaces\ndata_train[['text','char_count']].head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#average word length\ndef avg_word(sentence):\n words = sentence.split()\n return (sum(len(word) for word in words)/len(words))\n\ndata_train['avg_word'] = data_train['text'].apply(lambda x: avg_word(x))\ndata_train[['text','avg_word']].head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#no of stopwords\nfrom nltk.corpus import stopwords\nstop = stopwords.words('english')\n\ndata_train['stopwords'] = data_train['text'].apply(lambda x: len([x for x in x.split() if x in stop]))\ndata_train[['text','stopwords']].head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#no of special characters\ndata_train['spchar'] = data_train['text'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))\ndata_train[['text','spchar']].head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#no of numerics\ndata_train['numerics'] = data_train['text'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))\ndata_train[['text','numerics']].head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#no of uppercase characters\ndata_train['upper'] = data_train['text'].apply(lambda x: len([x for x in x.split() if x.isupper()]))\ndata_train[['text','upper']].head()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"**Preprocessing**"},{"metadata":{"trusted":true},"cell_type":"code","source":"#to lowercase\ndata_train['text'] = data_train['text'].apply(lambda x: \" \".join(x.lower() for x in x.split()))\ndata_train['text'].head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#removing punctuation\ndata_train['text'] = data_train['text'].str.replace('[^\\w\\s]','')\ndata_train['text'].head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#removing stop words\nfrom nltk.corpus import stopwords\nstop = stopwords.words('english')\ndata_train['text'] = data_train['text'].apply(lambda x: \" \".join(x for x in x.split() if x not in stop))\ndata_train['text'].head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#removing common word\nfreq = pd.Series(' '.join(data_train['text']).split()).value_counts()[:10]\nfreq","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#removing common word\nfreq = list(freq.index)\ndata_train['text'] = data_train['text'].apply(lambda x: \" \".join(x for x in x.split() if x not in freq))\ndata_train['text'].head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#remvoing rare words\nfreq = pd.Series(' '.join(data_train['text']).split()).value_counts()[-10:]\nfreq","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#removing rare words\nfreq = list(freq.index)\ndata_train['text'] = data_train['text'].apply(lambda x: \" \".join(x for x in x.split() if x not in freq))\ndata_train['text'].head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#spelling correction\nfrom textblob import TextBlob\ndata_train['text'][:5].apply(lambda x: str(TextBlob(x).correct()))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#tokenization\nTextBlob(data_train['text'][1]).words","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#stemming\nfrom nltk.stem import PorterStemmer\nst = PorterStemmer()\ndata_train['text'][:5].apply(lambda x: \" \".join([st.stem(word) for word in x.split()]))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#lemmetization\nfrom textblob import Word\ndata_train['text'] = data_train['text'].apply(lambda x: \" \".join([Word(word).lemmatize() for word in x.split()]))\ndata_train['text'].head()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"**Advance Text Processing**"},{"metadata":{"trusted":true},"cell_type":"code","source":"#N-grams\nTextBlob(data_train['text'][0]).ngrams(2)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#Term frequency\ntf1 = (data_train['text'][1:2]).apply(lambda x: pd.value_counts(x.split(\" \"))).sum(axis = 0).reset_index()\ntf1.columns = ['words','tf']\ntf1","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#inverse document frequency\nfor i,word in enumerate(tf1['words']):\n tf1.loc[i, 'idf'] = np.log(data_train.shape[0]/(len(data_train[data_train['text'].str.contains(word)])))\n\ntf1","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#term freq - inverse document freq\ntf1['tfidf'] = tf1['tf'] * tf1['idf']\ntf1","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#sparse matrix tf-idf freq\nfrom sklearn.feature_extraction.text import TfidfVectorizer\ntfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word',\n stop_words= 'english',ngram_range=(1,1))\ntrain_vect = tfidf.fit_transform(data_train['text'])\n\ntrain_vect","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#Bag of Words\nfrom sklearn.feature_extraction.text import CountVectorizer\nbow = CountVectorizer(max_features=1000, lowercase=True, ngram_range=(1,1),analyzer = \"word\")\ntrain_bow = bow.fit_transform(data_train['text'])\ntrain_bow","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.model_selection import train_test_split","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"data_train.head(10)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"x.head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"y.head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"x = data_train['text'].astype(str)\ny = data_train['deceptive']","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"x_train, x_test, y_train, y_test = train_test_split(x, y,\n stratify=y, \n test_size=0.25)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"x_train.head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"y_train.head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"x_test.head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"y_test.head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"x_train.shape","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"x_test.shape","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from keras.preprocessing.text import Tokenizer\nfrom keras.preprocessing import sequence","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"max_words = 1000\nmax_len = 150\ntok = Tokenizer(num_words=max_words)\ntok.fit_on_texts(x_train)\nsequences = tok.texts_to_sequences(x_train)\nsequences_matrix = sequence.pad_sequences(sequences,maxlen=max_len)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"max_words = 1000\nmax_len = 150\ntok = Tokenizer(num_words=max_words)\ntok.fit_on_texts(x_test)\nsequences_test = tok.texts_to_sequences(x_test)\nsequences_matrix_test = sequence.pad_sequences(sequences_test,maxlen=max_len)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"tokenizer = Tokenizer(num_words=None,lower=True,filters='!\"#$%&()*+,-./:;<=>?@[\\\\]^_`{|}~\\t\\n',split=' ',char_level=False)\ntokenizer.fit_on_texts(x_train)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"tokenizer.fit_on_texts(x_test)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"x_train1 = tokenizer.texts_to_sequences(x_train)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"x_test1=tokenizer.texts_to_sequences(x_test)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"word_index = tokenizer.word_index","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"vocab_size = len(word_index)\nprint('Vocab size: {}'.format(vocab_size))\nlongest = max(len(seq) for seq in x_train)\nprint(\"Longest comment size: {}\".format(longest))\naverage = np.mean([len(seq) for seq in x_train])\nprint(\"Average comment size: {}\".format(average))\nstdev = np.std([len(seq) for seq in x_train])\nprint(\"Stdev of comment size: {}\".format(stdev))\nmax_len = int(average + stdev * 3)\nprint('Max comment size: {}'.format(max_len))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from keras.preprocessing.sequence import pad_sequences","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"processed_x_train = pad_sequences(x_train1, maxlen=max_len, padding='post', truncating='post')\nprocessed_x_test = pad_sequences(x_test1, maxlen=max_len, padding='post', truncating='post')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from keras.models import Model\nfrom keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding\nfrom keras.optimizers import RMSprop,Nadam\nfrom keras.callbacks import EarlyStopping","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"def RNN():\n inputs = Input(name='inputs',shape=[max_len])\n layer = Embedding(max_words,50,input_length=max_len)(inputs)\n layer = LSTM(64)(layer)\n layer = Dense(256,name='FC1')(layer)\n layer = Activation('relu')(layer)\n layer = Dropout(0.5)(layer)\n layer = Dense(1,name='out_layer')(layer)\n layer = Activation('sigmoid')(layer)\n model = Model(inputs=inputs,outputs=layer)\n return model","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"model = RNN()\nmodel.summary()\nmodel.compile(loss='binary_crossentropy',optimizer=RMSprop(),metrics=['accuracy'])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"model.fit(processed_x_train,y_train,batch_size=128,epochs=10,\n validation_data=(processed_x_test,y_test),callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"import keras.backend\nfrom keras.models import Sequential, load_model\nfrom keras.layers.convolutional import Conv1D\nfrom keras.layers.convolutional import MaxPooling1D\nfrom keras.layers import Dense\nfrom keras.layers import Flatten","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from keras.layers import CuDNNGRU, Dense, Conv1D, MaxPooling1D\nfrom keras.layers import Dropout, GlobalMaxPooling1D, BatchNormalization, LSTM\nfrom keras.layers import Bidirectional","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"Embeddings ------- GloVe 100D ------"},{"metadata":{"trusted":true},"cell_type":"code","source":"embeddings_index = {}\nf = open(os.path.join('../input/glove-global-vectors-for-word-representation', 'glove.6B.100d.txt'))\nfor line in f:\n values = line.split()\n word = values[0]\n coefs = np.asarray(values[1:], dtype='float32')\n embeddings_index[word] = coefs\nf.close()\nprint('Found %s word vectors.' % len(embeddings_index))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"embedding_dim = 100\nk = 0\nembedding_matrix = np.zeros((len(word_index) + 1, embedding_dim))\nfor word, i in word_index.items():\n embedding_vector = embeddings_index.get(word)\n if embedding_vector is not None:\n # Words not found in embedding index will be all-zeros.\n k += 1\n embedding_matrix[i] = embedding_vector","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"## create model\nmodel_glove = Sequential()\nmodel_glove.add(Embedding(vocab_size + 1, embedding_dim, weights=[embedding_matrix], input_length=max_len, trainable=True))\nmodel_glove.add(Dropout(0.2))\nmodel_glove.add(Conv1D(64, 5, activation='relu'))\nmodel_glove.add(MaxPooling1D(pool_size=4))\nmodel_glove.add(LSTM(100))\nmodel_glove.add(Dense(1, activation='sigmoid'))\n#model_glove.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"model_glove.summary()\nmodel_glove.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"model_glove.fit(processed_x_train,y_train,batch_size=128,epochs=10,\n validation_data=(processed_x_test,y_test),callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)])","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"LSTM Embeddings"},{"metadata":{"trusted":true},"cell_type":"code","source":"# Initate model\nmodel3 = Sequential()\n\n# Add Embedding layer\nmodel3.add(Embedding(vocab_size + 1, embedding_dim, weights=[embedding_matrix], input_length=max_len, trainable=True))\n\n# Add Recurrent layer\n#model.add(Bidirectional(CuDNNGRU(300, return_sequences=True)))\nmodel3.add(LSTM(60, return_sequences=True, name='lstm_layer'))\nmodel3.add(LSTM(30, return_sequences=True, name='lstm_layer2'))\nmodel3.add(Conv1D(filters=128, kernel_size=5, padding='same', activation='relu'))\nmodel3.add(MaxPooling1D(3))\nmodel3.add(GlobalMaxPooling1D())\nmodel3.add(BatchNormalization())\n\n# Add fully connected layers\nmodel3.add(Dense(50, activation='relu'))\nmodel3.add(Dropout(0.3))\nmodel3.add(Dense(1, activation='sigmoid'))\n\n# Summarize the model\nmodel3.summary()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"**CNN GloVe Model 2**\n\nsequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
\nembedded_sequences = embedding_layer(sequence_input)
\nl_cov1= Conv1D(128, 5, activation='relu')(embedded_sequences)
\nl_pool1 = MaxPooling1D(5)(l_cov1)
\nl_cov2 = Conv1D(128, 5, activation='relu')(l_pool1)
\nl_pool2 = MaxPooling1D(5)(l_cov2)
\nl_cov3 = Conv1D(128, 5, activation='relu')(l_pool2)
\nl_pool3 = MaxPooling1D(35)(l_cov3) # global max pooling
\nl_flat = Flatten()(l_pool3)
\nl_dense = Dense(128, activation='relu')(l_flat)
\npreds = Dense(2, activation='softmax')
"}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"name":"python","version":"3.6.4","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat":4,"nbformat_minor":1} -------------------------------------------------------------------------------- /cnn-lstm-with-doc2vec-plus-tf-idf.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"metadata":{"_cell_guid":"c89c6d78-cd49-43b4-99b7-262151427d57","_uuid":"bc244d28a1532f3842dd1739cec5adff4e4b6c79","collapsed":true,"trusted":true},"cell_type":"code","source":"import pandas as pd\nimport numpy as np\nfrom keras.preprocessing import sequence\nfrom keras.layers import TimeDistributed, GlobalAveragePooling1D, GlobalAveragePooling2D, BatchNormalization\nfrom keras.layers.recurrent import LSTM\nfrom keras.layers.convolutional import Conv1D, MaxPooling1D, Conv2D, MaxPooling2D, AveragePooling1D\n#from keras.layers.embeddings import Embedding\nfrom keras.layers import Dropout, Flatten, Bidirectional, Dense, Activation, TimeDistributed\nfrom keras.models import Model, Sequential\nfrom keras.utils import np_utils\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import LabelEncoder\nfrom nltk.corpus import stopwords\nfrom nltk.tokenize import word_tokenize, sent_tokenize\nfrom nltk.stem.wordnet import WordNetLemmatizer\nfrom string import ascii_lowercase\nfrom collections import Counter\nfrom gensim.models import Word2Vec\nfrom gensim.models import Doc2Vec\nfrom gensim.models import doc2vec\nfrom gensim.models import KeyedVectors\nimport itertools, nltk, snowballstemmer, re\n\nLabeledSentence = doc2vec.LabeledSentence","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"66eaff70-1bb2-436f-ac31-b85281a2876d","_uuid":"4402fef233db64bf46f51e3bdd0aadc111b11753","collapsed":true,"trusted":true},"cell_type":"code","source":"class LabeledLineSentence(object):\n def __init__(self, sources):\n self.sources = sources\n \n flipped = {}\n \n # make sure that keys are unique\n for key, value in sources.items():\n if value not in flipped:\n flipped[value] = [key]\n else:\n raise Exception('Non-unique prefix encountered')\n \n def __iter__(self):\n for source, prefix in self.sources.items():\n with utils.smart_open(source) as fin:\n for item_no, line in enumerate(fin):\n yield LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no])\n \n def to_array(self):\n self.sentences = []\n for source, prefix in self.sources.items():\n with utils.smart_open(source) as fin:\n for item_no, line in enumerate(fin):\n self.sentences.append(LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))\n return self.sentences\n \n def sentences_perm(self):\n shuffled = list(self.sentences)\n random.shuffle(shuffled)\n return shuffled","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"eac514b4-5efb-4291-81a3-6730a0cdac61","_uuid":"57bed5b9fd3ad963809205d71d0883a9e38bb9f9","collapsed":true,"trusted":true},"cell_type":"code","source":"#data = pd.read_csv('deceptive-opinion-spam-corpus.zip', compression='zip', header=0, sep=',', quotechar='\"')\ndata = pd.read_csv(\"../input/deceptive-opinion.csv\")","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"5aedff72-437f-4530-9bfe-56165ddd9690","_uuid":"0d7d60b7569b847685579a73ded95641fcfe63d1","collapsed":true,"trusted":true},"cell_type":"code","source":"data['polarity'] = np.where(data['polarity']=='positive', 1, 0)\ndata['deceptive'] = np.where(data['deceptive']=='truthful', 1, 0)","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"1a30e639-f406-4320-acd4-b190cb0a1f06","_uuid":"fb20c77b00518d320c96687071080ac60b3ae8e7","collapsed":true,"trusted":true},"cell_type":"code","source":"def create_class(c):\n if c['polarity'] == 1 and c['deceptive'] == 1:\n return [1,1]\n elif c['polarity'] == 1 and c['deceptive'] == 0:\n return [1,0]\n elif c['polarity'] == 0 and c['deceptive'] == 1:\n return [0,1]\n else:\n return [0,0]\n \ndef specific_class(c):\n if c['polarity'] == 1 and c['deceptive'] == 1:\n return \"TRUE_POSITIVE\"\n elif c['polarity'] == 1 and c['deceptive'] == 0:\n return \"FALSE_POSITIVE\"\n elif c['polarity'] == 0 and c['deceptive'] == 1:\n return \"TRUE_NEGATIVE\"\n else:\n return \"FALSE_NEGATIVE\"\n\ndata['final_class'] = data.apply(create_class, axis=1)\ndata['given_class'] = data.apply(specific_class, axis=1)","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"4eaeb14e-a7f3-455e-b43c-64b26bdccd7b","_uuid":"6151435f72f8802262e5b748a4039bbc4074943f","collapsed":true,"trusted":true},"cell_type":"code","source":"Y = data['final_class']\n# encode class values as integers\nencoder = LabelEncoder()\nencoder.fit(Y)\nencoded_Y = encoder.transform(Y)\n# convert integers to dummy variables (i.e. one hot encoded)\ndummy_y = np_utils.to_categorical(encoded_Y)","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"358caced-4efe-454d-b9e6-fb1db61c3cd5","_uuid":"5a2723885623b4e5c62d5cbf04b639450ebcd486","collapsed":true,"trusted":true},"cell_type":"code","source":"textData = pd.DataFrame(list(data['text'])) # each row is one document; the raw text of the document should be in the 'text_data' column","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"727195f6-a01b-478c-9e6e-f63f9e0bcb51","_uuid":"b4b122f69da68f42742cf97da439dc067b1a5a52","collapsed":true,"trusted":true},"cell_type":"code","source":"# initialize stemmer\nstemmer = snowballstemmer.EnglishStemmer()\n\n# grab stopword list, extend it a bit, and then turn it into a set for later\nstop = stopwords.words('english')\nstop.extend(['may','also','zero','one','two','three','four','five','six','seven','eight','nine','ten','across','among','beside','however','yet','within']+list(ascii_lowercase))\nstoplist = stemmer.stemWords(stop)\nstoplist = set(stoplist)\nstop = set(sorted(stop + list(stoplist))) ","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"3bf58a8f-b3aa-41d0-af3b-4eca01d7f7bb","_uuid":"8a53f638d29e5bf6c6859dff180ddd45611c6f4a","collapsed":true,"trusted":true},"cell_type":"code","source":"# remove characters and stoplist words, then generate dictionary of unique words\ntextData[0].replace('[!\"#%\\'()*+,-./:;<=>?@\\[\\]^_`{|}~1234567890’”“′‘\\\\\\]',' ',inplace=True,regex=True)\nwordlist = filter(None, \" \".join(list(set(list(itertools.chain(*textData[0].str.split(' ')))))).split(\" \"))\ndata['stemmed_text_data'] = [' '.join(filter(None,filter(lambda word: word not in stop, line))) for line in textData[0].str.lower().str.split(' ')]","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"0a714384-6691-4178-94f8-6bc7e7771908","_uuid":"7c57c1e8ed3f743e63850290bf5ed6030dcb95e3","collapsed":true,"trusted":true},"cell_type":"code","source":"# remove all words that don't occur at least 5 times and then stem the resulting docs\nminimum_count = 1\nstr_frequencies = pd.DataFrame(list(Counter(filter(None,list(itertools.chain(*data['stemmed_text_data'].str.split(' '))))).items()),columns=['word','count'])\nlow_frequency_words = set(str_frequencies[str_frequencies['count'] < minimum_count]['word'])\ndata['stemmed_text_data'] = [' '.join(filter(None,filter(lambda word: word not in low_frequency_words, line))) for line in data['stemmed_text_data'].str.split(' ')]\ndata['stemmed_text_data'] = [\" \".join(stemmer.stemWords(re.sub('[!\"#%\\'()*+,-./:;<=>?@\\[\\]^_`{|}~1234567890’”“′‘\\\\\\]',' ', next_text).split(' '))) for next_text in data['stemmed_text_data']] ","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"ad04c9ec-5721-4a55-bcec-afd7f4e150db","_uuid":"3067446c5a55f58f67fcc4d1eb9cb1f0eea8a339","collapsed":true,"trusted":true},"cell_type":"code","source":"lmtzr = WordNetLemmatizer()\nw = re.compile(\"\\w+\",re.I)\n\ndef label_sentences(df, input_point):\n labeled_sentences = []\n list_sen = []\n for index, datapoint in df.iterrows():\n tokenized_words = re.findall(w,datapoint[input_point].lower())\n labeled_sentences.append(LabeledSentence(words=tokenized_words, tags=['SENT_%s' %index]))\n list_sen.append(tokenized_words)\n return labeled_sentences, list_sen\n\ndef train_doc2vec_model(labeled_sentences):\n model = Doc2Vec(min_count=1, window=9, size=512, sample=1e-4, negative=5, workers=7)\n model.build_vocab(labeled_sentences)\n pretrained_weights = model.wv.syn0\n vocab_size, embedding_size = pretrained_weights.shape\n model.train(labeled_sentences, total_examples=vocab_size, epochs=400)\n \n return model","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"1cab13d3-d608-4e49-89fc-5076835f4ab2","_uuid":"2bf3e3c0bef170d0601e098b159ea758a3aeb461","collapsed":true,"trusted":true},"cell_type":"code","source":"textData = data['stemmed_text_data'].to_frame().reset_index()\nsen, corpus = label_sentences(textData, 'stemmed_text_data')","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"a84fe753-6f2c-44d9-990a-af41a34a282b","_uuid":"fdf41cbef8aed8c77625debd753d88b37ed108ca","collapsed":true,"scrolled":true,"trusted":true},"cell_type":"code","source":"doc2vec_model = train_doc2vec_model(sen)","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"bcbdbe26-8a3f-4350-8633-3d9aaf0c172e","_uuid":"4de396cec1aeb9a63648d1aa6ac93d3914c05b78","collapsed":true,"trusted":true},"cell_type":"code","source":"doc2vec_model.save(\"doc2vec_model_opinion_corpus.d2v\")","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"3fe3f2b0-b681-421c-9adc-682796f97387","_uuid":"161eecd1a6baac4e39bfa9f5b7620ece1165ecc7","collapsed":true,"trusted":true},"cell_type":"code","source":"doc2vec_model = Doc2Vec.load(\"doc2vec_model_opinion_corpus.d2v\") ","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"e4bc5f09-3bbf-4ddf-97e1-e497c210d6ae","_uuid":"fb98f66f270a63bfd0d82bb1cfcacd18dae40873","collapsed":true,"trusted":true},"cell_type":"code","source":"from sklearn.feature_extraction.text import TfidfVectorizer\nfrom sklearn.decomposition import TruncatedSVD\n\ntfidf1 = TfidfVectorizer(tokenizer=lambda i:i, lowercase=False, ngram_range=(1,1))\nresult_train1 = tfidf1.fit_transform(corpus)\n\ntfidf2 = TfidfVectorizer(tokenizer=lambda i:i, lowercase=False, ngram_range=(1,2))\nresult_train2 = tfidf2.fit_transform(corpus)\n\ntfidf3 = TfidfVectorizer(tokenizer=lambda i:i, lowercase=False, ngram_range=(1,3))\nresult_train3 = tfidf3.fit_transform(corpus)\n\nsvd = TruncatedSVD(n_components=512, n_iter=40, random_state=34)\ntfidf_data1 = svd.fit_transform(result_train1)\ntfidf_data2 = svd.fit_transform(result_train2)\ntfidf_data3 = svd.fit_transform(result_train3)","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"ef1dbd5a-e69e-4b56-9ff3-c74ce62fdfdd","_uuid":"bfd26d39cd104b2bf443fc52992437751fa852b7","collapsed":true,"trusted":true},"cell_type":"code","source":"from sklearn.feature_extraction.text import CountVectorizer\nimport spacy\n\nnlp = spacy.load('en')\ntemp_textData = pd.DataFrame(list(data['text']))\n\noverall_pos_tags_tokens = []\noverall_pos = []\noverall_tokens = []\noverall_dep = []\n\nfor i in range(1600):\n doc = nlp(temp_textData[0][i])\n given_pos_tags_tokens = []\n given_pos = []\n given_tokens = []\n given_dep = []\n for token in doc:\n output = \"%s_%s\" % (token.pos_, token.tag_)\n given_pos_tags_tokens.append(output)\n given_pos.append(token.pos_)\n given_tokens.append(token.tag_)\n given_dep.append(token.dep_)\n \n overall_pos_tags_tokens.append(given_pos_tags_tokens)\n overall_pos.append(given_pos)\n overall_tokens.append(given_tokens)\n overall_dep.append(given_dep)\n","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"dbcdbe6d-0c7f-44dc-a9d6-0f240d339885","_uuid":"1c19ee8bd8fb5862cc5d25225b0684dd755ee8b2","collapsed":true,"scrolled":true,"trusted":true},"cell_type":"code","source":"from sklearn.preprocessing import MinMaxScaler\n\ncount = CountVectorizer(tokenizer=lambda i:i, lowercase=False)\npos_tags_data = count.fit_transform(overall_pos_tags_tokens).todense()\npos_data = count.fit_transform(overall_pos).todense()\ntokens_data = count.fit_transform(overall_tokens).todense()\ndep_data = count.fit_transform(overall_dep).todense()\nmin_max_scaler = MinMaxScaler()\nnormalized_pos_tags_data = min_max_scaler.fit_transform(pos_tags_data)\nnormalized_pos_data = min_max_scaler.fit_transform(pos_data)\nnormalized_tokens_data = min_max_scaler.fit_transform(tokens_data)\nnormalized_dep_data = min_max_scaler.fit_transform(dep_data)\n\nfinal_pos_tags_data = np.zeros(shape=(1600, 512)).astype(np.float32)\nfinal_pos_data = np.zeros(shape=(1600, 512)).astype(np.float32)\nfinal_tokens_data = np.zeros(shape=(1600, 512)).astype(np.float32)\nfinal_dep_data = np.zeros(shape=(1600, 512)).astype(np.float32)\nfinal_pos_tags_data[:normalized_pos_tags_data.shape[0],:normalized_pos_tags_data.shape[1]] = normalized_pos_tags_data\nfinal_pos_data[:normalized_pos_data.shape[0],:normalized_pos_data.shape[1]] = normalized_pos_data\nfinal_tokens_data[:normalized_tokens_data.shape[0],:normalized_tokens_data.shape[1]] = normalized_tokens_data\nfinal_dep_data[:normalized_dep_data.shape[0],:normalized_dep_data.shape[1]] = normalized_dep_data","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"57e0d758-a2b5-4a24-a60e-cfa2df06572f","_uuid":"e6474b59e587fbafbd2ea8801a50ef6d787a5868","collapsed":true,"trusted":true},"cell_type":"code","source":"maxlength = []\nfor i in range(0,len(sen)):\n maxlength.append(len(sen[i][0]))\n \nprint(max(maxlength)) ","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"dda366e0-43b7-4b35-8f31-c41d572375d4","_uuid":"f53332f2ce02fee41617d0205c524ebe26c4a90d","collapsed":true,"trusted":true},"cell_type":"code","source":"def vectorize_comments(df,d2v_model):\n y = []\n comments = []\n for i in range(0,df.shape[0]):\n label = 'SENT_%s' %i\n comments.append(d2v_model.docvecs[label])\n df['vectorized_comments'] = comments\n \n return df\n\ntextData = vectorize_comments(textData,doc2vec_model)\nprint (textData.head(2))","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"3fafb37f-8eee-4de3-80a0-0587f7075bc7","_uuid":"b978f4c41b6adc60490f0c6e048b3195249b9629","collapsed":true,"trusted":true},"cell_type":"code","source":"# load the whole embedding into memory\n#embeddings_index = dict()\n#f = open('glove/glove.6B.300d.txt')\n#for line in f:\n# values = line.split()\n# word = values[0]\n# coefs = np.asarray(values[1:], dtype='float32')\n# embeddings_index[word] = coefs\n#f.close()\n#print('Loaded %s word vectors.' % len(embeddings_index))","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"08eadfe1-897d-4e46-bc46-95a5e452a618","_uuid":"b99088a77d8d46e52ba16259ffba4b655265f3cd","collapsed":true,"trusted":true},"cell_type":"code","source":"#from nltk.corpus import stopwords\n\n#glove_data = np.zeros(shape=(1600, 800, 512)).astype(np.float32)\n#temp_textData = data['text'].to_frame().reset_index()\n#sen2, corpus2 = label_sentences(temp_textData, 'text')\n#stop_words = set(stopwords.words('english'))\n#test_word = np.zeros(512).astype(np.float32)\n#final_matrix = np.zeros(512).astype(np.float32)\n#final_sizes = []\n\n#count = True\n\n#for i in range(1600):\n# for j in sen2[i][0]:\n# if j in embeddings_index and j not in stop_words:\n# test_word[:300] = embeddings_index[j]\n# if count == True:\n# final_matrix = test_word\n# count = False\n# else:\n# final_matrix = np.vstack((final_matrix, test_word))\n \n# final_sizes.append(final_matrix.shape[0])\n# final_matrix = np.zeros(512).astype(np.float32)\n# glove_data[i,:final_matrix.shape[0],:] = final_matrix\n# count = True\n \n#print(max(final_sizes))","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"b2a454a7-34dd-4eb5-93ac-ac891eb7e9fe","_uuid":"4ebed90a6e28ab4de06c3037761106b6f6a0f4e5","collapsed":true,"scrolled":true,"trusted":true},"cell_type":"code","source":"from sklearn import cross_validation\nfrom sklearn.grid_search import GridSearchCV\nX_train, X_test, y_train, y_test = cross_validation.train_test_split(textData[\"vectorized_comments\"].T.tolist(), \n dummy_y, \n test_size=0.1, \n random_state=56)","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"19c5322f-c579-4b9b-91fb-ebca06c6715f","_uuid":"2c4e424e1c6f1a3a6abb7beb8eec6f33d3bc2622","collapsed":true,"trusted":true},"cell_type":"code","source":"X = np.array(textData[\"vectorized_comments\"].T.tolist()).reshape((1,1600,512))\ny = np.array(dummy_y).reshape((1600,4))\nX_train2 = np.array(X_train).reshape((1,1440,512))\ny_train2 = np.array(y_train).reshape((1,1440,4))\nX_test2 = np.array(X_test).reshape((1,160,512))\ny_test2 = np.array(y_test).reshape((1,160,4))","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"00a664f3-390b-4f40-9580-8ca6c3021678","_uuid":"87b1e9251955272eee9475662179a60dc030a399","collapsed":true,"trusted":true},"cell_type":"code","source":"from sklearn.model_selection import StratifiedKFold\nXtemp = textData[\"vectorized_comments\"].T.tolist()\nytemp = data['given_class']\ntraining_indices = []\ntesting_indices = []\n\nskf = StratifiedKFold(n_splits=10)\nskf.get_n_splits(Xtemp, ytemp)\n\nfor train_index, test_index in skf.split(Xtemp, ytemp):\n training_indices.append(train_index)\n testing_indices.append(test_index)","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"0fac1165-69ad-4576-9729-8857d625cef1","_uuid":"45bc95366bbb2e542be56aaf5f73582fd65062d4","collapsed":true,"trusted":true},"cell_type":"code","source":"def extractTrainingAndTestingData(givenIndex):\n X_train3 = np.zeros(shape=(1440, max(maxlength)+10, 512)).astype(np.float32)\n Y_train3 = np.zeros(shape=(1440, 4)).astype(np.float32)\n X_test3 = np.zeros(shape=(160, max(maxlength)+10, 512)).astype(np.float32)\n Y_test3 = np.zeros(shape=(160, 4)).astype(np.float32)\n\n empty_word = np.zeros(512).astype(np.float32)\n\n count_i = 0\n for i in training_indices[givenIndex]:\n len1 = len(sen[i][0])\n average_vector1 = np.zeros(512).astype(np.float32)\n average_vector2 = np.zeros(512).astype(np.float32)\n average_vector3 = np.zeros(512).astype(np.float32)\n for j in range(max(maxlength)+10):\n if j < len1:\n X_train3[count_i,j,:] = doc2vec_model[sen[i][0][j]]\n average_vector1 += result_train1[i, tfidf1.vocabulary_[sen[i][0][j]]] * doc2vec_model[sen[i][0][j]]\n average_vector2 += result_train2[i, tfidf2.vocabulary_[sen[i][0][j]]] * doc2vec_model[sen[i][0][j]]\n average_vector3 += result_train3[i, tfidf3.vocabulary_[sen[i][0][j]]] * doc2vec_model[sen[i][0][j]]\n #elif j >= len1 and j < len1 + 379:\n # X_train3[count_i,j,:] = glove_data[i, j-len1, :]\n elif j == len1:\n X_train3[count_i,j,:] = tfidf_data1[i]\n elif j == len1 + 1:\n X_train3[count_i,j,:] = tfidf_data2[i]\n elif j == len1+2:\n X_train3[count_i,j,:] = tfidf_data3[i]\n elif j == len1+3:\n X_train3[count_i,j,:] = average_vector1\n elif j == len1+4:\n X_train3[count_i,j,:] = average_vector2\n elif j == len1+5:\n X_train3[count_i,j,:] = average_vector3\n elif j == len1+6:\n X_train3[count_i,j,:] = final_pos_tags_data[i] \n elif j == len1+7:\n X_train3[count_i,j,:] = final_pos_data[i]\n elif j == len1+8:\n X_train3[count_i,j,:] = final_tokens_data[i]\n elif j == len1+9:\n X_train3[count_i,j,:] = final_dep_data[i]\n else:\n X_train3[count_i,j,:] = empty_word\n\n Y_train3[count_i,:] = dummy_y[i]\n count_i += 1\n\n\n count_i = 0\n for i in testing_indices[givenIndex]:\n len1 = len(sen[i][0])\n average_vector1 = np.zeros(512).astype(np.float32)\n average_vector2 = np.zeros(512).astype(np.float32)\n average_vector3 = np.zeros(512).astype(np.float32)\n for j in range(max(maxlength)+10):\n if j < len1:\n X_test3[count_i,j,:] = doc2vec_model[sen[i][0][j]]\n average_vector1 += result_train1[i, tfidf1.vocabulary_[sen[i][0][j]]] * doc2vec_model[sen[i][0][j]]\n average_vector2 += result_train2[i, tfidf2.vocabulary_[sen[i][0][j]]] * doc2vec_model[sen[i][0][j]] \n average_vector3 += result_train3[i, tfidf3.vocabulary_[sen[i][0][j]]] * doc2vec_model[sen[i][0][j]]\n #elif j >= len1 and j < len1 + 379:\n # X_test3[count_i,j,:] = glove_data[i, j-len1, :]\n elif j == len1:\n X_test3[count_i,j,:] = tfidf_data1[i]\n elif j == len1 + 1:\n X_test3[count_i,j,:] = tfidf_data2[i]\n elif j == len1+2:\n X_test3[count_i,j,:] = tfidf_data3[i]\n elif j == len1+3:\n X_test3[count_i,j,:] = average_vector1\n elif j == len1+4:\n X_test3[count_i,j,:] = average_vector2\n elif j == len1+5:\n X_test3[count_i,j,:] = average_vector3\n elif j == len1+6:\n X_test3[count_i,j,:] = final_pos_tags_data[i]\n elif j == len1+7:\n X_test3[count_i,j,:] = final_pos_data[i]\n elif j == len1+8:\n X_test3[count_i,j,:] = final_tokens_data[i]\n elif j == len1+9:\n X_test3[count_i,j,:] = final_dep_data[i]\n else:\n X_test3[count_i,j,:] = empty_word\n\n Y_test3[count_i,:] = dummy_y[i]\n count_i += 1\n \n return X_train3, X_test3, Y_train3, Y_test3\n ","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"1c0d8684-466d-48e6-9afa-bc30ff388a26","_uuid":"b7d99866c070dbf1851fa20d0a242cf1623997ee","collapsed":true,"scrolled":true,"trusted":true},"cell_type":"code","source":"model = Sequential()\nmodel.add(Conv1D(filters=128, kernel_size=9, padding='same', activation='relu', input_shape=(max(maxlength)+10,512)))\nmodel.add(Dropout(0.25))\nmodel.add(MaxPooling1D(pool_size=2))\nmodel.add(Dropout(0.25))\nmodel.add(Conv1D(filters=128, kernel_size=7, padding='same', activation='relu'))\nmodel.add(Dropout(0.25))\nmodel.add(MaxPooling1D(pool_size=2))\nmodel.add(Dropout(0.25))\nmodel.add(Conv1D(filters=128, kernel_size=5, padding='same', activation='relu'))\nmodel.add(Dropout(0.25))\n#model.add(MaxPooling1D(pool_size=2))\n#model.add(Dropout(0.25))\n#model.add(Conv1D(filters=32, kernel_size=2, padding='same', activation='relu'))\n#model.add(Dropout(0.25))\n#model.add(MaxPooling1D(pool_size=2))\n#model.add(Dropout(0.25))\n\n#model.add(Bidirectional(LSTM(50, dropout=0.3, recurrent_dropout=0.2, return_sequences=True)))\nmodel.add(Bidirectional(LSTM(50, dropout=0.25, recurrent_dropout=0.2)))\nmodel.add(Dense(4, activation='softmax'))\nmodel.compile(loss='binary_crossentropy', optimizer='Adam', metrics=['accuracy'])\nprint(model.summary())","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"a8ec219d-90d2-4f03-97ac-52323bd112c8","_uuid":"a81ad9005cf0c09a134930e20e9e14d3809e3c82","collapsed":true,"scrolled":false,"trusted":true},"cell_type":"code","source":"from sklearn.metrics import accuracy_score\nfrom keras.callbacks import ModelCheckpoint \n\nfinal_accuracies = []\n \nfilename = 'weights.best.from_scratch%s.hdf5' % 9\ncheckpointer = ModelCheckpoint(filepath=filename, verbose=1, save_best_only=True)\nX_train3, X_test3, Y_train3, Y_test3 = extractTrainingAndTestingData(9)\nmodel.fit(X_train3, Y_train3, epochs=10, batch_size=512, callbacks=[checkpointer], validation_data=(X_test3, Y_test3))\nmodel.load_weights(filename)\n\nfor i in range(10):\n filename = 'weights.best.from_scratch%s.hdf5' % i\n checkpointer = ModelCheckpoint(filepath=filename, verbose=1, save_best_only=True)\n X_train3, X_test3, Y_train3, Y_test3 = extractTrainingAndTestingData(i)\n model.fit(X_train3, Y_train3, epochs=10, batch_size=512, callbacks=[checkpointer], validation_data=(X_test3, Y_test3))\n model.load_weights(filename)\n predicted = np.rint(model.predict(X_test3))\n final_accuracies.append(accuracy_score(Y_test3, predicted))\n print(accuracy_score(Y_test3, predicted))","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"d7883c70-f74b-4389-85fe-ff575dd15a61","_uuid":"2d96181354f0499ae115b2041997108afa8946dd","collapsed":true,"trusted":true},"cell_type":"code","source":"print(sum(final_accuracies) / len(final_accuracies))","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"a81ce0af-44a8-4081-8a54-73c8ffc97bdf","_uuid":"e9c5ad94f4a4c5ad58e9e588782fc544839b8169","collapsed":true,"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"name":"python","version":"3.6.4","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat":4,"nbformat_minor":1} -------------------------------------------------------------------------------- /NLP_using_Glove_and_Spacy_.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "NLP using Glove and Spacy .ipynb", 7 | "version": "0.3.2", 8 | "provenance": [], 9 | "include_colab_link": true 10 | }, 11 | "language_info": { 12 | "name": "python", 13 | "version": "3.6.4", 14 | "mimetype": "text/x-python", 15 | "codemirror_mode": { 16 | "name": "ipython", 17 | "version": 3 18 | }, 19 | "pygments_lexer": "ipython3", 20 | "nbconvert_exporter": "python", 21 | "file_extension": ".py" 22 | }, 23 | "kernelspec": { 24 | "display_name": "Python 3", 25 | "language": "python", 26 | "name": "python3" 27 | } 28 | }, 29 | "cells": [ 30 | { 31 | "cell_type": "markdown", 32 | "metadata": { 33 | "id": "view-in-github", 34 | "colab_type": "text" 35 | }, 36 | "source": [ 37 | "\"Open" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": { 43 | "_cell_guid": "e35c5b34-ef39-4431-9bfd-4b438b6d072b", 44 | "_uuid": "88c8394e12ad493556c3ec9581e8aec871dad89e", 45 | "id": "v4d9ZfYe--0q", 46 | "colab_type": "text" 47 | }, 48 | "source": [ 49 | "more coming soon!\n", 50 | "\n", 51 | "if you like my notebook please upvote for me.\n", 52 | "\n", 53 | "Thank you." 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "metadata": { 59 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", 60 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5", 61 | "trusted": false, 62 | "id": "Bvj_YGeu--0s", 63 | "colab_type": "code", 64 | "colab": {} 65 | }, 66 | "source": [ 67 | "# This Python 3 environment comes with many helpful analytics libraries installed\n", 68 | "# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n", 69 | "# For example, here's several helpful packages to load in \n", 70 | "\n", 71 | "import numpy as np # linear algebra\n", 72 | "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", 73 | "\n", 74 | "# Input data files are available in the \"../input/\" directory.\n", 75 | "# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory\n", 76 | "\n", 77 | "import os\n", 78 | "print(os.listdir(\"../input\"))\n", 79 | "\n", 80 | "# Any results you write to the current directory are saved as output." 81 | ], 82 | "execution_count": 0, 83 | "outputs": [] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "metadata": { 88 | "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0", 89 | "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a", 90 | "trusted": false, 91 | "id": "4NCacakt--0w", 92 | "colab_type": "code", 93 | "colab": {} 94 | }, 95 | "source": [ 96 | "import pandas as pd,numpy as np,seaborn as sns\n", 97 | "from keras.preprocessing.text import Tokenizer\n", 98 | "from keras.preprocessing.sequence import pad_sequences\n", 99 | "from keras.utils import to_categorical\n", 100 | "import matplotlib.pyplot as plt\n", 101 | "%matplotlib inline\n", 102 | "import spacy" 103 | ], 104 | "execution_count": 0, 105 | "outputs": [] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "metadata": { 110 | "scrolled": true, 111 | "_cell_guid": "721878c6-e111-4559-9e9c-706052f573d1", 112 | "_uuid": "4e095dc26e6699de771fcb27bdcd3ba177c7320d", 113 | "trusted": false, 114 | "id": "ZjLfJkgv--00", 115 | "colab_type": "code", 116 | "colab": {} 117 | }, 118 | "source": [ 119 | "yelp_reviews = pd.read_csv('../input/yelp-dataset/yelp_review.csv',nrows=10000)\n", 120 | "yelp_reviews.head()" 121 | ], 122 | "execution_count": 0, 123 | "outputs": [] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "metadata": { 128 | "_cell_guid": "87c3b275-5a05-4f7e-907c-336190399694", 129 | "_uuid": "5782e772b4a9df661de0d74277388a0d7b686bbf", 130 | "trusted": false, 131 | "id": "3T5EycTE--04", 132 | "colab_type": "code", 133 | "colab": {} 134 | }, 135 | "source": [ 136 | "yelp_reviews.columns" 137 | ], 138 | "execution_count": 0, 139 | "outputs": [] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "metadata": { 144 | "_cell_guid": "231b382c-7455-4664-8aa4-48c0ce12b620", 145 | "_uuid": "9d79fd330b98cc60622ea42891d4fcec14ba3099", 146 | "trusted": false, 147 | "id": "rLm6Cf0g--08", 148 | "colab_type": "code", 149 | "colab": {} 150 | }, 151 | "source": [ 152 | "yelp_reviews=yelp_reviews.drop(['review_id','user_id','business_id','date','useful','funny','cool'],axis=1)\n", 153 | "yelp_reviews.head()" 154 | ], 155 | "execution_count": 0, 156 | "outputs": [] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "metadata": { 161 | "_cell_guid": "3ddd726c-1fa3-45ff-94fb-67a7cdaf096f", 162 | "_uuid": "52162971a1b62d874389eef678996c6792cf385b", 163 | "trusted": false, 164 | "id": "xZv8jrF8--1A", 165 | "colab_type": "code", 166 | "colab": {} 167 | }, 168 | "source": [ 169 | "yelp_reviews.isnull().any()" 170 | ], 171 | "execution_count": 0, 172 | "outputs": [] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "metadata": { 177 | "_cell_guid": "7a262c5e-587c-4ea0-9c73-eb317ab1f9c3", 178 | "_uuid": "0fe4ce98d33e813e7151f5b40ee50f0999d7babf", 179 | "trusted": false, 180 | "id": "OyVUZ21e--1E", 181 | "colab_type": "code", 182 | "colab": {} 183 | }, 184 | "source": [ 185 | "yelp_reviews.stars.unique()" 186 | ], 187 | "execution_count": 0, 188 | "outputs": [] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "metadata": { 193 | "_cell_guid": "da052314-149a-4900-9620-c37b8b1101c5", 194 | "_uuid": "33988b150c9233a825d8473b19b59de24c2a2652", 195 | "trusted": false, 196 | "id": "YWc-YGND--1I", 197 | "colab_type": "code", 198 | "colab": {} 199 | }, 200 | "source": [ 201 | "sns.countplot(yelp_reviews.stars)" 202 | ], 203 | "execution_count": 0, 204 | "outputs": [] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "metadata": { 209 | "_cell_guid": "bfa44352-556e-4c88-9818-37852215b2bc", 210 | "_uuid": "a2ee8b2d45dc7f5a3531ba055a263c151315f7f4", 211 | "trusted": false, 212 | "id": "mTYE17Kl--1N", 213 | "colab_type": "code", 214 | "colab": {} 215 | }, 216 | "source": [ 217 | "yelp_reviews.stars.mode()" 218 | ], 219 | "execution_count": 0, 220 | "outputs": [] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "metadata": { 225 | "_cell_guid": "ff307ae5-24f5-423e-9941-7a3b8fca36ca", 226 | "_uuid": "6e45e353693a534dcacad245f48a326df97a7eeb", 227 | "trusted": false, 228 | "id": "CrkoTUlv--1R", 229 | "colab_type": "code", 230 | "colab": {} 231 | }, 232 | "source": [ 233 | "reviews = yelp_reviews[yelp_reviews.stars!=3]\n" 234 | ], 235 | "execution_count": 0, 236 | "outputs": [] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "metadata": { 241 | "_cell_guid": "b5761e42-7942-4e1e-b48e-771904585dc8", 242 | "_uuid": "a0d93fe4c2c810c5e63c068160d43f4def79510d", 243 | "trusted": false, 244 | "id": "19-Tqjov--1U", 245 | "colab_type": "code", 246 | "colab": {} 247 | }, 248 | "source": [ 249 | "reviews['label'] = reviews['stars'].apply(lambda x: 1 if x>3 else 0)\n", 250 | "reviews = reviews.drop('stars',axis=1)\n", 251 | "reviews.head()" 252 | ], 253 | "execution_count": 0, 254 | "outputs": [] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "metadata": { 259 | "_cell_guid": "3ab1cb7f-81c2-4c06-beb2-34bed9a9ff3d", 260 | "_uuid": "e125ec4b93624993ae4582a0126ba79a610950a4", 261 | "trusted": false, 262 | "id": "3sBBD6aE--1X", 263 | "colab_type": "code", 264 | "colab": {} 265 | }, 266 | "source": [ 267 | "reviews.shape" 268 | ], 269 | "execution_count": 0, 270 | "outputs": [] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "metadata": { 275 | "_cell_guid": "ae7f5373-b105-44b0-b0f2-a77a9a3e03d1", 276 | "_uuid": "fb974ff724f6c730ea91cd0f1a124768a2c2762e", 277 | "trusted": false, 278 | "id": "7hT92-vW--1c", 279 | "colab_type": "code", 280 | "colab": {} 281 | }, 282 | "source": [ 283 | "text = reviews.text.values\n", 284 | "label = reviews.label.values" 285 | ], 286 | "execution_count": 0, 287 | "outputs": [] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "metadata": { 292 | "_cell_guid": "b1cd2e70-9100-43a9-951c-0147f8165e8b", 293 | "_uuid": "9f7126c5fad222ad73bce5ae6f4f8be47aa0bb74", 294 | "trusted": false, 295 | "id": "TMIvHY9C--1l", 296 | "colab_type": "code", 297 | "colab": {} 298 | }, 299 | "source": [ 300 | "nlp = spacy.load('en')" 301 | ], 302 | "execution_count": 0, 303 | "outputs": [] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "metadata": { 308 | "_cell_guid": "32152a68-842b-4d90-bfea-eab60b912942", 309 | "_uuid": "e7238b2bf5e3ad7d1ba64d1afc38c4ca7cf14874", 310 | "trusted": false, 311 | "id": "m9tSjdT9--1r", 312 | "colab_type": "code", 313 | "colab": {} 314 | }, 315 | "source": [ 316 | "text[0]" 317 | ], 318 | "execution_count": 0, 319 | "outputs": [] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "metadata": { 324 | "_cell_guid": "ef5b3d18-e94a-4ad8-860d-4e1642534dee", 325 | "_uuid": "6ff0249c86878e3169be5fd0695161765f254fd3", 326 | "trusted": false, 327 | "id": "VYa8-qlA--1t", 328 | "colab_type": "code", 329 | "colab": {} 330 | }, 331 | "source": [ 332 | "parsed_text = nlp(text[0])\n", 333 | "parsed_text" 334 | ], 335 | "execution_count": 0, 336 | "outputs": [] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "metadata": { 341 | "_cell_guid": "1508dcff-7752-4571-bf8a-b96aad36bf8d", 342 | "_uuid": "3e0a2249ec08a3aaeb3552ee791ebd7a675d45f3", 343 | "trusted": false, 344 | "id": "FDMlU5CI--1w", 345 | "colab_type": "code", 346 | "colab": {} 347 | }, 348 | "source": [ 349 | "for i,sentance in enumerate(parsed_text.sents):\n", 350 | " print(i,':',sentance)" 351 | ], 352 | "execution_count": 0, 353 | "outputs": [] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "metadata": { 358 | "_cell_guid": "f8dfc439-eeae-4d7b-9a39-352c48c9c7d1", 359 | "_uuid": "726fa40e5c71128880e9252ed28a03106a6169af", 360 | "trusted": false, 361 | "id": "SAU9LxAd--1z", 362 | "colab_type": "code", 363 | "colab": {} 364 | }, 365 | "source": [ 366 | "for num, entity in enumerate(nlp(text[10]).ents):\n", 367 | " print ('Entity {}:'.format(num + 1), entity, '-', entity.label_)" 368 | ], 369 | "execution_count": 0, 370 | "outputs": [] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "metadata": { 375 | "_cell_guid": "659f54d3-86a9-478d-80f0-a8afeed2b81e", 376 | "_uuid": "8eace05f587db1f3d2f5e7cb21f9d66d526a8d6e", 377 | "trusted": false, 378 | "id": "-TnCS-6f--12", 379 | "colab_type": "code", 380 | "colab": {} 381 | }, 382 | "source": [ 383 | "token_pos = [token.pos_ for token in nlp(text[10])]\n", 384 | "tokens = [token for token in nlp(text[10])]\n", 385 | "sd = list(zip(tokens,token_pos))\n", 386 | "sd = pd.DataFrame(sd,columns=['token','pos'])\n", 387 | "sd.head()" 388 | ], 389 | "execution_count": 0, 390 | "outputs": [] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "metadata": { 395 | "_cell_guid": "0e58bbb7-7900-43e0-9410-6a6168cce046", 396 | "_uuid": "32a9ecab6d44f2003a815c0a67b2bd52f2462e24", 397 | "trusted": false, 398 | "id": "Gqdrfc0l--17", 399 | "colab_type": "code", 400 | "colab": {} 401 | }, 402 | "source": [ 403 | "max_num_words = 1000\n", 404 | "max_seq_length = 100\n", 405 | "tokenizer = Tokenizer(num_words=max_num_words)\n" 406 | ], 407 | "execution_count": 0, 408 | "outputs": [] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "metadata": { 413 | "_cell_guid": "46b7d4f9-e3f0-4c16-b9f2-015a1acdd39b", 414 | "_uuid": "7cb3a51c6b07550a764c63c071c2a6c59d23297e", 415 | "trusted": false, 416 | "id": "sNlLRoCU--1_", 417 | "colab_type": "code", 418 | "colab": {} 419 | }, 420 | "source": [ 421 | "len(yelp_reviews)" 422 | ], 423 | "execution_count": 0, 424 | "outputs": [] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "metadata": { 429 | "_cell_guid": "a0e0409b-4aea-4c64-b1f5-b9eb8767ded4", 430 | "_uuid": "6d2db8d7775a75bbdd762598ec3f5193fe82a966", 431 | "trusted": false, 432 | "id": "wwgz57nC--2C", 433 | "colab_type": "code", 434 | "colab": {} 435 | }, 436 | "source": [ 437 | "reviews=yelp_reviews[:100000]\n", 438 | "reviews=reviews[reviews.stars!=3]\n", 439 | "\n", 440 | "reviews[\"labels\"]= reviews[\"stars\"].apply(lambda x: 1 if x > 3 else 0)\n", 441 | "reviews=reviews.drop(\"stars\",axis=1)\n", 442 | "\n", 443 | "reviews.head()" 444 | ], 445 | "execution_count": 0, 446 | "outputs": [] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "metadata": { 451 | "_cell_guid": "732547c6-7433-4b45-bfd5-2e8b75530f17", 452 | "_uuid": "96456f9fe4f1fc073b024b20e5148dee9d94981e", 453 | "trusted": false, 454 | "id": "4jb2Jyw8--2F", 455 | "colab_type": "code", 456 | "colab": {} 457 | }, 458 | "source": [ 459 | "texts = reviews[\"text\"].values\n", 460 | "labels = reviews[\"labels\"].values" 461 | ], 462 | "execution_count": 0, 463 | "outputs": [] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "metadata": { 468 | "_cell_guid": "a987b4e3-b7a3-4d6d-97a7-92b889888b9b", 469 | "_uuid": "5cba94b42ddb6a8a72715ab30b4958fff6a84257", 470 | "trusted": false, 471 | "id": "PhhArAe8--2H", 472 | "colab_type": "code", 473 | "colab": {} 474 | }, 475 | "source": [ 476 | "tokenizer.fit_on_texts(texts)\n", 477 | "sequences = tokenizer.texts_to_sequences(texts)\n", 478 | "word_index = tokenizer.word_index\n" 479 | ], 480 | "execution_count": 0, 481 | "outputs": [] 482 | }, 483 | { 484 | "cell_type": "code", 485 | "metadata": { 486 | "_cell_guid": "9b49d3c5-bca8-4e27-9324-a18831cc0361", 487 | "_uuid": "01da830a97419c9b01d285f214f2a79da7ba406e", 488 | "trusted": false, 489 | "id": "Tc8LsCgZ--2K", 490 | "colab_type": "code", 491 | "colab": {} 492 | }, 493 | "source": [ 494 | "len(word_index)" 495 | ], 496 | "execution_count": 0, 497 | "outputs": [] 498 | }, 499 | { 500 | "cell_type": "code", 501 | "metadata": { 502 | "_cell_guid": "c828337f-91cd-4075-9071-9b6667d4a190", 503 | "_uuid": "e1af0af2bfa6556ba03a2d4a08995823b013daa7", 504 | "trusted": false, 505 | "id": "zzovWC6g--2N", 506 | "colab_type": "code", 507 | "colab": {} 508 | }, 509 | "source": [ 510 | "data = pad_sequences(sequences, maxlen=max_seq_length)\n", 511 | "data" 512 | ], 513 | "execution_count": 0, 514 | "outputs": [] 515 | }, 516 | { 517 | "cell_type": "code", 518 | "metadata": { 519 | "_cell_guid": "76225dee-bfe2-45d6-a12b-19a4dce49d8a", 520 | "_uuid": "43685b899b9e4b41ecc4b1e6e1dfd0fc00049adf", 521 | "trusted": false, 522 | "id": "yZIwifyF--2R", 523 | "colab_type": "code", 524 | "colab": {} 525 | }, 526 | "source": [ 527 | "data.shape" 528 | ], 529 | "execution_count": 0, 530 | "outputs": [] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "metadata": { 535 | "_cell_guid": "4f37aeb5-cdda-4ce4-b2bd-f245cc7d9519", 536 | "_uuid": "f036d4518d625809b13352b9b927dda5b9c4656d", 537 | "trusted": false, 538 | "id": "X6kKT4NW--2T", 539 | "colab_type": "code", 540 | "colab": {} 541 | }, 542 | "source": [ 543 | "labels = to_categorical(np.asarray(labels))" 544 | ], 545 | "execution_count": 0, 546 | "outputs": [] 547 | }, 548 | { 549 | "cell_type": "code", 550 | "metadata": { 551 | "_cell_guid": "6bdbeff6-94cd-438a-a288-e72b934934f8", 552 | "_uuid": "19f951609e8362673c7e5420b355328eea5b3760", 553 | "trusted": false, 554 | "id": "Jd8qASYe--2Y", 555 | "colab_type": "code", 556 | "colab": {} 557 | }, 558 | "source": [ 559 | "labels.shape" 560 | ], 561 | "execution_count": 0, 562 | "outputs": [] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "metadata": { 567 | "_cell_guid": "c6fb93b5-7b9c-4ca8-903a-65526763762f", 568 | "_uuid": "0256e645011ee3050ce32a65d5df8c16b44b298b", 569 | "trusted": false, 570 | "id": "ZEXK5Buc--2e", 571 | "colab_type": "code", 572 | "colab": {} 573 | }, 574 | "source": [ 575 | "validation_spilit = 0.2\n", 576 | "indices = np.arange(data.shape[0])\n", 577 | "np.random.shuffle(indices)" 578 | ], 579 | "execution_count": 0, 580 | "outputs": [] 581 | }, 582 | { 583 | "cell_type": "code", 584 | "metadata": { 585 | "_cell_guid": "ec3d1c47-e3a2-46c7-88ff-47c0c4f05a2f", 586 | "_uuid": "ccbdfa233af2918969e0560a0738c81b10694154", 587 | "trusted": false, 588 | "id": "FZZ-TMeI--2g", 589 | "colab_type": "code", 590 | "colab": {} 591 | }, 592 | "source": [ 593 | "data = data[indices]\n", 594 | "data" 595 | ], 596 | "execution_count": 0, 597 | "outputs": [] 598 | }, 599 | { 600 | "cell_type": "code", 601 | "metadata": { 602 | "_cell_guid": "047fb8fe-8881-4fb6-a350-4939cb833972", 603 | "_uuid": "22d87efa8087046dc6dcba08c829b08e9198a2d7", 604 | "trusted": false, 605 | "id": "ze6lOXqn--2i", 606 | "colab_type": "code", 607 | "colab": {} 608 | }, 609 | "source": [ 610 | "labels = labels[indices]\n", 611 | "labels" 612 | ], 613 | "execution_count": 0, 614 | "outputs": [] 615 | }, 616 | { 617 | "cell_type": "code", 618 | "metadata": { 619 | "_cell_guid": "f40c81b4-c3af-46c2-b3e4-d8bee0fef433", 620 | "_uuid": "2db4e649432a4acbc99b2a2774dde27f4cff6ed5", 621 | "trusted": false, 622 | "id": "-6M0y38l--2l", 623 | "colab_type": "code", 624 | "colab": {} 625 | }, 626 | "source": [ 627 | "nb_validation_samples = int(validation_spilit*data.shape[0])\n", 628 | "nb_validation_samples" 629 | ], 630 | "execution_count": 0, 631 | "outputs": [] 632 | }, 633 | { 634 | "cell_type": "code", 635 | "metadata": { 636 | "_cell_guid": "896d2744-3ae9-4868-add0-3fcd5b3ad03f", 637 | "_uuid": "02c5008ecbe67e631a99fca96179f5671981b54b", 638 | "trusted": false, 639 | "id": "_SfMM4sO--2o", 640 | "colab_type": "code", 641 | "colab": {} 642 | }, 643 | "source": [ 644 | "x_train = data[:-nb_validation_samples]\n", 645 | "y_train = labels[:-nb_validation_samples]\n", 646 | "x_val = data[-nb_validation_samples:]\n", 647 | "y_val = labels[-nb_validation_samples:]\n" 648 | ], 649 | "execution_count": 0, 650 | "outputs": [] 651 | }, 652 | { 653 | "cell_type": "code", 654 | "metadata": { 655 | "_cell_guid": "886c43cc-4c6f-4e2d-a5d5-dccd7df9cf00", 656 | "_uuid": "d743ab3bd117266860a29263c4330521654a112b", 657 | "trusted": false, 658 | "id": "80hB96jF--2t", 659 | "colab_type": "code", 660 | "colab": {} 661 | }, 662 | "source": [ 663 | "glove_dir = '../input/glove-global-vectors-for-word-representation/'" 664 | ], 665 | "execution_count": 0, 666 | "outputs": [] 667 | }, 668 | { 669 | "cell_type": "code", 670 | "metadata": { 671 | "_cell_guid": "fb8b1a07-856d-4122-bf6d-82e3c1f23183", 672 | "_uuid": "6259f81e098d4ef19d8dac2afec130e06a0309ad", 673 | "trusted": false, 674 | "id": "fb7o2ZJG--2x", 675 | "colab_type": "code", 676 | "colab": {} 677 | }, 678 | "source": [ 679 | "embedding_index = {}\n", 680 | "\n", 681 | "f = open(os.path.join(glove_dir,'glove.6B.50d.txt'))\n", 682 | "for line in f:\n", 683 | " values = line.split()\n", 684 | " word = values[0]\n", 685 | " coefs = np.asarray(values[1:],dtype='float32')\n", 686 | " embedding_index[word] = coefs\n", 687 | "f.close()\n", 688 | "\n", 689 | "print('found word vecs: ',len(embedding_index))" 690 | ], 691 | "execution_count": 0, 692 | "outputs": [] 693 | }, 694 | { 695 | "cell_type": "code", 696 | "metadata": { 697 | "_cell_guid": "3a75feac-fe78-4f45-ae23-b085a8d52e9d", 698 | "_uuid": "9cbc91bb3b5df510066b9972c81ee5e0002baa72", 699 | "trusted": false, 700 | "id": "PFja6NNo--22", 701 | "colab_type": "code", 702 | "colab": {} 703 | }, 704 | "source": [ 705 | "embedding_dim = 50\n", 706 | "embedding_matrix = np.zeros((len(word_index)+1,embedding_dim))\n", 707 | "embedding_matrix.shape" 708 | ], 709 | "execution_count": 0, 710 | "outputs": [] 711 | }, 712 | { 713 | "cell_type": "code", 714 | "metadata": { 715 | "_cell_guid": "24dd27c6-4e9d-450a-8ae0-4bcc61e9ff2e", 716 | "_uuid": "c22b2349cfb2113678d8d20de841470fcdc56709", 717 | "trusted": false, 718 | "id": "efM_j-f8--25", 719 | "colab_type": "code", 720 | "colab": {} 721 | }, 722 | "source": [ 723 | "for word,i in word_index.items():\n", 724 | " embedding_vector = embedding_index.get(word)\n", 725 | " if embedding_vector is not None:\n", 726 | " embedding_matrix[i] = embedding_vector" 727 | ], 728 | "execution_count": 0, 729 | "outputs": [] 730 | }, 731 | { 732 | "cell_type": "code", 733 | "metadata": { 734 | "_cell_guid": "d8aec772-3fb4-46c9-8988-66da287ba1ab", 735 | "_uuid": "60df2f72c00f55fe26169328b8ef0d05eca0fab4", 736 | "trusted": false, 737 | "id": "N6ygSMux--29", 738 | "colab_type": "code", 739 | "colab": {} 740 | }, 741 | "source": [ 742 | "from keras.layers import Embedding\n", 743 | "embedding_layer = Embedding(len(word_index)+1,embedding_dim,weights=[embedding_matrix],input_length=max_seq_length,trainable=False)" 744 | ], 745 | "execution_count": 0, 746 | "outputs": [] 747 | }, 748 | { 749 | "cell_type": "code", 750 | "metadata": { 751 | "_cell_guid": "6618d995-5c04-41b4-9724-8892adfe50df", 752 | "_uuid": "d54df4e207d8f7874884998fa3c637177f3d5972", 753 | "trusted": false, 754 | "id": "zsfcBI53--3C", 755 | "colab_type": "code", 756 | "colab": {} 757 | }, 758 | "source": [ 759 | "from keras.layers import Bidirectional,GlobalMaxPool1D,Conv1D\n", 760 | "from keras.layers import LSTM,Input,Dense,Dropout,Activation\n", 761 | "from keras.models import Model" 762 | ], 763 | "execution_count": 0, 764 | "outputs": [] 765 | }, 766 | { 767 | "cell_type": "code", 768 | "metadata": { 769 | "_cell_guid": "f9dfbbed-0720-44a0-8679-2b9cc3c1ad3f", 770 | "_uuid": "3f84a01f25dd92b68681353714dfaa8a2fca70d1", 771 | "trusted": false, 772 | "id": "MDU2Y4hs--3G", 773 | "colab_type": "code", 774 | "colab": {} 775 | }, 776 | "source": [ 777 | "inp = Input(shape=(max_seq_length,))\n", 778 | "x = embedding_layer(inp)\n", 779 | "x = Bidirectional(LSTM(50,return_sequences=True,dropout=0.1,recurrent_dropout=0.1))(x)\n", 780 | "x = GlobalMaxPool1D()(x)\n", 781 | "x = Dense(50,activation='relu')(x)\n", 782 | "x = Dropout(0.1)(x)\n", 783 | "x = Dense(2,activation='sigmoid')(x)\n", 784 | "model = Model(inputs=inp,outputs=x)" 785 | ], 786 | "execution_count": 0, 787 | "outputs": [] 788 | }, 789 | { 790 | "cell_type": "code", 791 | "metadata": { 792 | "_cell_guid": "063aa4eb-1a13-49e3-bc00-bbd45be32238", 793 | "_uuid": "b7d7b5275784367bb2d508fe39c313548c528f25", 794 | "trusted": false, 795 | "id": "JAlDHW_U--3L", 796 | "colab_type": "code", 797 | "colab": {} 798 | }, 799 | "source": [ 800 | "model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])\n" 801 | ], 802 | "execution_count": 0, 803 | "outputs": [] 804 | }, 805 | { 806 | "cell_type": "code", 807 | "metadata": { 808 | "_cell_guid": "854d7171-c903-4a2b-9c40-6322aeb61d4d", 809 | "_uuid": "4753f2c45b326b074d412a49e22eb0890ab76542", 810 | "trusted": false, 811 | "id": "0Qd1ekdY--3P", 812 | "colab_type": "code", 813 | "colab": {} 814 | }, 815 | "source": [ 816 | "print(x_train.shape)\n", 817 | "print(y_train.shape)\n", 818 | "print(x_val.shape)\n", 819 | "print(y_val.shape)" 820 | ], 821 | "execution_count": 0, 822 | "outputs": [] 823 | }, 824 | { 825 | "cell_type": "code", 826 | "metadata": { 827 | "_cell_guid": "436624ed-35d9-42cc-956a-501e6a6a7818", 828 | "_uuid": "da2981747e220ae713c61eda9857bfa42c48043d", 829 | "trusted": false, 830 | "id": "diHqdQLf--3V", 831 | "colab_type": "code", 832 | "colab": {} 833 | }, 834 | "source": [ 835 | "model.fit(x_train,y_train,validation_data=(x_val,y_val),epochs=20,batch_size=128);" 836 | ], 837 | "execution_count": 0, 838 | "outputs": [] 839 | }, 840 | { 841 | "cell_type": "code", 842 | "metadata": { 843 | "_cell_guid": "e2219519-c3b7-477e-8636-f5f2d3882b4b", 844 | "_uuid": "4196ab0f3a61c0a08a5420916e5eaa7ec1dc4af3", 845 | "trusted": false, 846 | "id": "6n_Z9-LY--3Y", 847 | "colab_type": "code", 848 | "colab": {} 849 | }, 850 | "source": [ 851 | "score = model.evaluate(x_val,y_val)\n", 852 | "score" 853 | ], 854 | "execution_count": 0, 855 | "outputs": [] 856 | }, 857 | { 858 | "cell_type": "code", 859 | "metadata": { 860 | "_cell_guid": "4b76936e-96e5-45c9-906d-5124891449b3", 861 | "_uuid": "e9f60824821a662217f9c84bf281d581422d2534", 862 | "trusted": false, 863 | "id": "pP5fTYLB--3c", 864 | "colab_type": "code", 865 | "colab": {} 866 | }, 867 | "source": [ 868 | "score[1]*100" 869 | ], 870 | "execution_count": 0, 871 | "outputs": [] 872 | }, 873 | { 874 | "cell_type": "markdown", 875 | "metadata": { 876 | "_cell_guid": "032a6693-1645-4ee4-86cc-d493120c83dd", 877 | "_uuid": "e85ec637a551099c76d058939bc0f4b972dbb221", 878 | "id": "S8UJZGUg--3g", 879 | "colab_type": "text" 880 | }, 881 | "source": [ 882 | "If you like my notebook please vote for me.\n", 883 | "Thank you " 884 | ] 885 | } 886 | ] 887 | } -------------------------------------------------------------------------------- /basicMLmodelsDecepOpSpam.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n# For example, here's several helpful packages to load in \n\nimport numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n\n# Input data files are available in the \"../input/\" directory.\n# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n\nimport os\nfor dirname, _, filenames in os.walk('/kaggle/input'):\n for filename in filenames:\n print(os.path.join(dirname, filename))\n\n# Any results you write to the current directory are saved as output.","execution_count":136,"outputs":[{"output_type":"stream","text":"/kaggle/input/yelp-dataset-based-on-fake-reviewers/cleaned_data.csv\n/kaggle/input/deceptive-opinion-spam-corpus/deceptive-opinion.csv\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"import re\nimport string\nfrom nltk.corpus import stopwords\nfrom bs4 import BeautifulSoup\nimport nltk\nfrom nltk.corpus import stopwords\nfrom nltk.stem import SnowballStemmer","execution_count":137,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df = pd.read_csv('/kaggle/input/deceptive-opinion-spam-corpus/deceptive-opinion.csv')","execution_count":138,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df.head()","execution_count":139,"outputs":[{"output_type":"execute_result","execution_count":139,"data":{"text/plain":" deceptive hotel polarity source \\\n0 truthful conrad positive TripAdvisor \n1 truthful hyatt positive TripAdvisor \n2 truthful hyatt positive TripAdvisor \n3 truthful omni positive TripAdvisor \n4 truthful hyatt positive TripAdvisor \n\n text \n0 We stayed for a one night getaway with family ... \n1 Triple A rate with upgrade to view room was le... \n2 This comes a little late as I'm finally catchi... \n3 The Omni Chicago really delivers on all fronts... \n4 I asked for a high floor away from the elevato... ","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
deceptivehotelpolaritysourcetext
0truthfulconradpositiveTripAdvisorWe stayed for a one night getaway with family ...
1truthfulhyattpositiveTripAdvisorTriple A rate with upgrade to view room was le...
2truthfulhyattpositiveTripAdvisorThis comes a little late as I'm finally catchi...
3truthfulomnipositiveTripAdvisorThe Omni Chicago really delivers on all fronts...
4truthfulhyattpositiveTripAdvisorI asked for a high floor away from the elevato...
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"df = df.drop([\"hotel\", \"polarity\",\"source\"], axis=1)","execution_count":140,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df = df.sample(frac=1)","execution_count":141,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df.head()","execution_count":142,"outputs":[{"output_type":"execute_result","execution_count":142,"data":{"text/plain":" deceptive text\n494 deceptive Homewood Suites by Hilton Chicago Downtown is ...\n1547 deceptive When I first made reservations at The Palmer H...\n782 deceptive The Palmer House Hilton was recommended to me ...\n75 truthful The reviews we read were a bit mixed, but I th...\n1421 deceptive My wife and I stayed at the Ambassador East Ho...","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
deceptivetext
494deceptiveHomewood Suites by Hilton Chicago Downtown is ...
1547deceptiveWhen I first made reservations at The Palmer H...
782deceptiveThe Palmer House Hilton was recommended to me ...
75truthfulThe reviews we read were a bit mixed, but I th...
1421deceptiveMy wife and I stayed at the Ambassador East Ho...
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn import preprocessing \n\n# label_encoder object knows how to understand word labels. \nlabel_encoder = preprocessing.LabelEncoder() \n\n# Encode labels in column 'species'. \ndf['deceptive']= label_encoder.fit_transform(df['deceptive']) \n\ndf['deceptive'].unique() ","execution_count":143,"outputs":[{"output_type":"execute_result","execution_count":143,"data":{"text/plain":"array([0, 1])"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"df.head()","execution_count":144,"outputs":[{"output_type":"execute_result","execution_count":144,"data":{"text/plain":" deceptive text\n494 0 Homewood Suites by Hilton Chicago Downtown is ...\n1547 0 When I first made reservations at The Palmer H...\n782 0 The Palmer House Hilton was recommended to me ...\n75 1 The reviews we read were a bit mixed, but I th...\n1421 0 My wife and I stayed at the Ambassador East Ho...","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
deceptivetext
4940Homewood Suites by Hilton Chicago Downtown is ...
15470When I first made reservations at The Palmer H...
7820The Palmer House Hilton was recommended to me ...
751The reviews we read were a bit mixed, but I th...
14210My wife and I stayed at the Ambassador East Ho...
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#dataset description #truthful=1 deceptive=0\ndf.groupby('deceptive').describe()","execution_count":145,"outputs":[{"output_type":"execute_result","execution_count":145,"data":{"text/plain":" text \n count unique top freq\ndeceptive \n0 800 800 I stayed at the Swissotel Chicago while I was ... 1\n1 800 796 Very disappointed in our stay in Chicago Monoc... 2","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
text
countuniquetopfreq
deceptive
0800800I stayed at the Swissotel Chicago while I was ...1
1800796Very disappointed in our stay in Chicago Monoc...2
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"def clean_text(text):\n \n ## Remove puncuation\n text = text.translate(string.punctuation)\n \n ## Convert words to lower case and split them\n text = text.lower().split()\n \n ## Remove stop words\n stops = set(stopwords.words(\"english\"))\n text = [w for w in text if not w in stops and len(w) >= 3]\n \n text = \" \".join(text)\n\n # Clean the text\n text = re.sub(r\"[^A-Za-z0-9^,!.\\/'+-=]\", \" \", text)\n text = re.sub(r\"what's\", \"what is \", text)\n text = re.sub(r\"\\'s\", \" \", text)\n text = re.sub(r\"\\'ve\", \" have \", text)\n text = re.sub(r\"n't\", \" not \", text)\n text = re.sub(r\"i'm\", \"i am \", text)\n text = re.sub(r\"\\'re\", \" are \", text)\n text = re.sub(r\"\\'d\", \" would \", text)\n text = re.sub(r\"\\'ll\", \" will \", text)\n text = re.sub(r\",\", \" \", text)\n text = re.sub(r\"\\.\", \" \", text)\n text = re.sub(r\"!\", \" ! \", text)\n text = re.sub(r\"\\/\", \" \", text)\n text = re.sub(r\"\\^\", \" ^ \", text)\n text = re.sub(r\"\\+\", \" + \", text)\n text = re.sub(r\"\\-\", \" - \", text)\n text = re.sub(r\"\\=\", \" = \", text)\n text = re.sub(r\"'\", \" \", text)\n text = re.sub(r\"(\\d+)(k)\", r\"\\g<1>000\", text)\n text = re.sub(r\":\", \" : \", text)\n text = re.sub(r\" e g \", \" eg \", text)\n text = re.sub(r\" b g \", \" bg \", text)\n text = re.sub(r\" u s \", \" american \", text)\n text = re.sub(r\"\\0s\", \"0\", text)\n text = re.sub(r\" 9 11 \", \"911\", text)\n text = re.sub(r\"e - mail\", \"email\", text)\n text = re.sub(r\"j k\", \"jk\", text)\n text = re.sub(r\"\\s{2,}\", \" \", text)\n \n text = text.split()\n stemmer = SnowballStemmer('english')\n stemmed_words = [stemmer.stem(word) for word in text]\n text = \" \".join(stemmed_words)\n\n return text","execution_count":146,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# Some preprocesssing that will be common to all the text classification methods\n\npuncts = [',', '.', '\"', ':', ')', '(', '-', '!', '?', '|', ';', \"'\", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\\\', '•', '~', '@', '£', \n '·', '_', '{', '}', '©', '^', '®', '`', '<', '→', '°', '€', '™', '›', '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', \n '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', \n '▒', ':', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', \n '∙', ')', '↓', '、', '│', '(', '»', ',', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√', ]\n\ndef clean_char(x):\n x = str(x)\n for punct in puncts:\n if punct in x:\n x = x.replace(punct, f' {punct} ')\n return x","execution_count":147,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"def clean_numbers(x):\n if bool(re.search(r'\\d', x)):\n x = re.sub('[0-9]{5,}', '#####', x)\n x = re.sub('[0-9]{4}', '####', x)\n x = re.sub('[0-9]{3}', '###', x)\n x = re.sub('[0-9]{2}', '##', x)\n return x","execution_count":148,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df['text'] = df['text'].map(lambda a: clean_numbers(a))","execution_count":149,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df['text'] = df['text'].map(lambda a: clean_char(a))","execution_count":150,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df['text'] = df['text'].map(lambda a: clean_text(a))","execution_count":151,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df['text']","execution_count":152,"outputs":[{"output_type":"execute_result","execution_count":152,"data":{"text/plain":"494 homewood suit hilton chicago downtown wonder h...\n1547 first made reserv palmer hous hilton excit gor...\n782 palmer hous hilton recommend friend visit chic...\n75 review read bit mix thought excel stay splendi...\n1421 wife stay ambassador east hotel last month son...\n447 hyatt regenc chicago hotel delight stay never ...\n1542 servic subpar room need better clean check cou...\n170 stay hard rock januari night locat michigan av...\n264 recent trip chicago attend major trade show pl...\n43 husband decid take trip chicago last minut qui...\n897 noisi constant water run pipe terribl much bet...\n1292 disappoint hotel stay swissotel enjoy much ser...\n1316 one better experi first arriv hotel tri get ro...\n331 wife redeem hilton reward point stay night pal...\n0 stay one night getaway famili thursday tripl a...\n1306 husband stay hotel suppos romant weekend far c...\n1085 disappoint stay chicago monoco stay mani time ...\n1121 use hotel first time chicago price classifi in...\n98 stay sheraton navi pier first weekend novemb p...\n981 book two room four month advanc talbott place ...\n926 travel chicago mani time busi expect hotel bas...\n559 stay hotel weekend thought realli nice bed ext...\n573 talbot hotel eleg place take wife husband week...\n529 swissotel chicago delight visit locat downtown...\n159 back knickerbock would give best rate believ b...\n70 wife stay middl februari took subway hotel har...\n968 request quiet room sever week advanc yet given...\n727 intercontinent hotel truli hidden treasur nest...\n133 perfect locat clean courteous staff ad great s...\n1567 although architectur hotel quaint posit teribl...\n ... \n608 stay two night meet upscal chain hotel clean s...\n1002 found hotel small luxuri hotel world booklet s...\n870 bare averag hotel premium price hotel beauti l...\n1271 whenev decid stay fairmont chicago millennium ...\n761 friend visit amalfi hotel last week chicago fo...\n1178 room wasveri tini warm give illus control temp...\n717 downtown chicago hilton best combin thing look...\n704 enjoy stay jame chicago much hotel eleg uber m...\n1177 fourth time stay allegro previous time noth go...\n304 strike architectur begin describ one best hote...\n1520 recent stay hotel allegro chicago busi trip sa...\n213 stay night great hotel updat room clean use ti...\n430 realli enjoy stay beauti hotel entir staff hel...\n448 omni chicago hotel offer great amen comfort ex...\n1392 wife stay abassador east hotel august attend a...\n1358 sheraton chicago hotel tower one wors hotel st...\n1154 locat good check ask non smoke room given floo...\n521 sheraton chicago hotel tower everyth could ask...\n1018 book travel websit hotel current undergo major...\n147 recent celebr anniversari chicago stay sherato...\n639 husband came ambassador east hotel sister wed ...\n407 recent wed omni hotel chicago pleas decis use ...\n555 enjoy stay swissotel downtown chicago eleg cla...\n1064 hotel total overr yes locat awesom plenti near...\n1005 hotel shambl furnitur liter fall apart staff r...\n361 stay king suit hotel allegro memori day weeken...\n1250 advertis luxuri hotel conrad chicago goug gran...\n784 go chicago last month want nice hotel close re...\n480 first let say wonder wife refer hotel friend t...\n696 stay hilton chicago pleasur arriv departur sta...\nName: text, Length: 1600, dtype: object"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"df.describe()","execution_count":153,"outputs":[{"output_type":"execute_result","execution_count":153,"data":{"text/plain":" deceptive\ncount 1600.000000\nmean 0.500000\nstd 0.500156\nmin 0.000000\n25% 0.000000\n50% 0.500000\n75% 1.000000\nmax 1.000000","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
deceptive
count1600.000000
mean0.500000
std0.500156
min0.000000
25%0.000000
50%0.500000
75%1.000000
max1.000000
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"df.info()","execution_count":154,"outputs":[{"output_type":"stream","text":"\nInt64Index: 1600 entries, 494 to 696\nData columns (total 2 columns):\ndeceptive 1600 non-null int64\ntext 1600 non-null object\ndtypes: int64(1), object(1)\nmemory usage: 37.5+ KB\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"x = df['text']\ny = df['deceptive']","execution_count":155,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.model_selection import train_test_split\nfrom sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\nfrom sklearn.metrics import accuracy_score, confusion_matrix","execution_count":156,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(df['text'], df['deceptive'], random_state=5)\nprint('Number of rows in the total set: {}'.format(df.shape[0]))\nprint('Number of rows in the training set: {}'.format(X_train.shape[0]))\nprint('Number of rows in the test set: {}'.format(X_test.shape[0]))","execution_count":157,"outputs":[{"output_type":"stream","text":"Number of rows in the total set: 1600\nNumber of rows in the training set: 1200\nNumber of rows in the test set: 400\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"X_train, X_test, y_train, y_test","execution_count":158,"outputs":[{"output_type":"execute_result","execution_count":158,"data":{"text/plain":"(894 general speak noth bad place would clean issu ...\n 471 husband stay short get away weekend love conve...\n 1455 made regular busi trip chicago decid stay hote...\n 595 magnific mile chicago great place visit stay a...\n 22 actual book reserv hotel phone got great rate ...\n 322 wife decid spend three day chicago last summer...\n 865 line check desk tremend long decid use compute...\n 1385 recent trip chicago stay ambassador east hotel...\n 862 omni chosen locat whichwork perfect bed wond e...\n 874 stay fairmont two saturday row stay disappoint...\n 1464 sofitel chicago water tower downtown area adve...\n 281 want nice place stay night dinner theater daug...\n 481 thank sheraton tower invit enjoy indoor pool g...\n 474 hyatt regenc chicago one beauti hotel ever sta...\n 690 hilton hotel help make trip chicago central lo...\n 680 stay sever differ hotel chicago jame best ever...\n 908 would recommend stay swissotel chicago travel ...\n 1206 chicago mani wonder luxuri hotel tourist busi ...\n 992 chicago skinni hope run wash rock star cece de...\n 1007 travel chicago visit daughter live citi supris...\n 1049 check sofitel chicago water tower day left hot...\n 1074 parent book five night jame locat good review ...\n 1122 stay pre construct nois problem other experien...\n 186 stylish hotel locat right north michigan avenu...\n 1424 stay hotel labor day weekend pretti aw staff r...\n 413 recent travel chicago busi confer lucki enough...\n 67 husband stay new year eve weekend excel hotel ...\n 225 prepar hotel old histor list updat done doubt ...\n 213 stay night great hotel updat room clean use ti...\n 397 hotel beauti excel locat minor issu hotel clea...\n ... \n 26 simpli nice place stay great deal room impress...\n 323 wife spent sever night getaway excurs choos pr...\n 797 last month husband stay intercontinent chicago...\n 1205 realli look forward nice relax stay end long v...\n 657 stay weekend visit friend town littl pricey de...\n 1077 locat ideal high ceil room appear much larger ...\n 1466 travel chicago stay sofitel chestnut street th...\n 1280 stay millennium knickerbock hotel chicago one ...\n 597 stay affinia last weekend ideal wait return to...\n 1176 stay amalfi night say read review expect littl...\n 399 second time stay much look forward weekend ama...\n 1032 sit hotel right review impress room hotel mode...\n 666 wife stay last weekend honeymoon wonder hotel ...\n 727 intercontinent hotel truli hidden treasur nest...\n 1465 stay hotel like jame price point expect certai...\n 59 great properti excel locat wonder staff everyo...\n 1546 hotel dizzi bold color pattern everywher smell...\n 1216 friend stay hyatt regenc chicago weekend visit...\n 76 husband wonder stay omni chicago hotel contact...\n 698 stay jame one bedroom apart two week chicago v...\n 204 hard rock hotel natur choic book block room se...\n 1067 hotel impress upon enter staff friend howev fe...\n 1035 stand craptast excus shower buttnak cold soap ...\n 1479 stay last august truli glad never stay websit ...\n 1069 recent return trip chicago husband babi anoth ...\n 1478 disappoint hotel front desk clerk rude person ...\n 957 say avoid place cost reserv assist took minut ...\n 254 hotel reserv anoth hotel set read negat review...\n 490 moment walk impress staff courteous knowledg a...\n 329 stay chicago night dec read review trip adviso...\n Name: text, Length: 1200, dtype: object,\n 1187 disappoint say least heard rave rave hotel all...\n 1258 hyatt regenc chicago basic cater guest want fe...\n 1474 visit chicago area chose hotel monaco chicago ...\n 41 paid night pricelin excel june room fantast vi...\n 1532 need find last minut hotel stay night came one...\n 607 write recent stay ambassador east hotel want l...\n 1503 found stay hotel monaco chicago less satisfact...\n 1159 stay palmer hous attend confer chicago state b...\n 1061 stay enthusiast posit review trip advisor yelp...\n 843 stay time around mark differ qualiti servic st...\n 910 upon first enter hotel greet friend help bellm...\n 117 visit chicago two teenag daughter homewood sui...\n 879 stay hotel week look forward stay top class ho...\n 840 fairmont hotel general prefer choic true disap...\n 541 hotel one superior hotel price reason benefit ...\n 1484 arriv find room readi despit past 3pm check ti...\n 810 outmod worn furnish combin poor origin design ...\n 646 ask better home base check downtown chicago ho...\n 989 stay talbott time alway like time first visit ...\n 347 love hotel six floor make easi get elev recept...\n 644 travel chicago husband romant weekend away sta...\n 1539 recent stay amalfi hotel chicago disappoint ri...\n 406 visit conrad chicago heart downtown luxuri exp...\n 198 stay great review true high impress high quali...\n 904 wife frequent stay downtown night citi mani ho...\n 1226 name conrad luxuri hilton hotel illicit noth i...\n 769 hotel allegro chicago beauti place origin drea...\n 1141 check room empti beer bottl dirti short closet...\n 1469 hotel monaco hotel say charact found decor tac...\n 877 pros con pros good locat nice staff clean room...\n ... \n 1331 recent stay homewood suit hilton chicago downt...\n 748 night ago stay hotel allegro theater district ...\n 320 return girl shop sight see trip palmer hous li...\n 1437 book room hotel expect great accommod consid p...\n 1349 stay three night recent hilton homewood suit d...\n 1208 origin chosen conrad chicago hotel locat near ...\n 278 need extra night chicago gratus stay penninsul...\n 1530 although intercontinent chicago hotel locat ma...\n 653 famili chose stay chicago hilton last trip cit...\n 1219 wife went chicago enjoy weekend stay hyatt reg...\n 572 got marri chicago area pass weeknd guest stay ...\n 802 2nd time stay hyatt regenc chicago first time ...\n 871 famili dub omni chicago fawlti tower alway see...\n 90 back busi trip homewood great locat magnific m...\n 421 windi citi fairmont hotel one chicago best rec...\n 1354 walk millennium knickerbock hotel first though...\n 967 back spend memori day weekend chicago decid st...\n 885 stay swissotel three day weekend disappoint se...\n 216 stay night apri check stay perfect round staff...\n 662 decid stay jame last weekend kid reserv famili...\n 1292 disappoint hotel stay swissotel enjoy much ser...\n 1241 latest busi trip wife recent stay omni chicago...\n 543 perfect place coupl get away locat better thre...\n 818 frustrat hotel post deal promin websit claim s...\n 1368 stay affina chicago anniversari hotel great lo...\n 1018 book travel websit hotel current undergo major...\n 1554 genuin surpris stay hotel wall paper thin coul...\n 1044 return jame hotel stay park hyatt stay chicago...\n 89 stay busi night realli like attract classi com...\n 1235 recent displeasur stay conrad chicago although...\n Name: text, Length: 400, dtype: object,\n 894 1\n 471 0\n 1455 0\n 595 0\n 22 1\n 322 1\n 865 1\n 1385 0\n 862 1\n 874 1\n 1464 0\n 281 1\n 481 0\n 474 0\n 690 0\n 680 0\n 908 1\n 1206 0\n 992 1\n 1007 1\n 1049 1\n 1074 1\n 1122 1\n 186 1\n 1424 0\n 413 0\n 67 1\n 225 1\n 213 1\n 397 1\n ..\n 26 1\n 323 1\n 797 0\n 1205 0\n 657 0\n 1077 1\n 1466 0\n 1280 0\n 597 0\n 1176 1\n 399 1\n 1032 1\n 666 0\n 727 0\n 1465 0\n 59 1\n 1546 0\n 1216 0\n 76 1\n 698 0\n 204 1\n 1067 1\n 1035 1\n 1479 0\n 1069 1\n 1478 0\n 957 1\n 254 1\n 490 0\n 329 1\n Name: deceptive, Length: 1200, dtype: int64,\n 1187 1\n 1258 0\n 1474 0\n 41 1\n 1532 0\n 607 0\n 1503 0\n 1159 1\n 1061 1\n 843 1\n 910 1\n 117 1\n 879 1\n 840 1\n 541 0\n 1484 0\n 810 1\n 646 0\n 989 1\n 347 1\n 644 0\n 1539 0\n 406 0\n 198 1\n 904 1\n 1226 0\n 769 0\n 1141 1\n 1469 0\n 877 1\n ..\n 1331 0\n 748 0\n 320 1\n 1437 0\n 1349 0\n 1208 0\n 278 1\n 1530 0\n 653 0\n 1219 0\n 572 0\n 802 1\n 871 1\n 90 1\n 421 0\n 1354 0\n 967 1\n 885 1\n 216 1\n 662 0\n 1292 0\n 1241 0\n 543 0\n 818 1\n 1368 0\n 1018 1\n 1554 0\n 1044 1\n 89 1\n 1235 0\n Name: deceptive, Length: 400, dtype: int64)"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.feature_extraction.text import CountVectorizer\ncount_vector = CountVectorizer()\nprint(count_vector)","execution_count":159,"outputs":[{"output_type":"stream","text":"CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n dtype=, encoding='utf-8', input='content',\n lowercase=True, max_df=1.0, max_features=None, min_df=1,\n ngram_range=(1, 1), preprocessor=None, stop_words=None,\n strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n tokenizer=None, vocabulary=None)\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"training_data = count_vector.fit_transform(X_train)\ntesting_data = count_vector.transform(X_test)","execution_count":160,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.naive_bayes import MultinomialNB\nnaive_bayes = MultinomialNB()\nnaive_bayes.fit(training_data, y_train)","execution_count":161,"outputs":[{"output_type":"execute_result","execution_count":161,"data":{"text/plain":"MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"predictions = naive_bayes.predict(testing_data)","execution_count":162,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score\nmnbaccuracy = accuracy_score(y_test, predictions)\nprint('Accuracy score: ', format(accuracy_score(y_test, predictions)))\nprint('Precision score: ', format(precision_score(y_test, predictions)))\nprint('Recall score: ', format(recall_score(y_test, predictions)))\nprint('F1 score: ', format(f1_score(y_test, predictions)))","execution_count":164,"outputs":[{"output_type":"stream","text":"Accuracy score: 0.9025\nPrecision score: 0.9325842696629213\nRecall score: 0.8601036269430051\nF1 score: 0.894878706199461\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.svm import SVC \nsvc = SVC()\nsvc.fit(training_data, y_train)","execution_count":165,"outputs":[{"output_type":"stream","text":"/opt/conda/lib/python3.6/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.\n \"avoid this warning.\", FutureWarning)\n","name":"stderr"},{"output_type":"execute_result","execution_count":165,"data":{"text/plain":"SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n decision_function_shape='ovr', degree=3, gamma='auto_deprecated',\n kernel='rbf', max_iter=-1, probability=False, random_state=None,\n shrinking=True, tol=0.001, verbose=False)"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"svc_predictions = svc.predict(testing_data)","execution_count":166,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import accuracy_score\nsvcaccuracy = accuracy_score(y_test,svc_predictions)\nprint('Accuracy score: ', format(accuracy_score(y_test,svc_predictions)))\nprint('Precision score: ', format(precision_score(y_test,svc_predictions)))\nprint('Recall score: ', format(recall_score(y_test, svc_predictions)))\nprint('F1 score: ', format(f1_score(y_test, svc_predictions)))","execution_count":167,"outputs":[{"output_type":"stream","text":"Accuracy score: 0.5625\nPrecision score: 0.525\nRecall score: 0.9792746113989638\nF1 score: 0.6835443037974684\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.neighbors import KNeighborsClassifier \nknn = KNeighborsClassifier(n_neighbors = 7)\nknn.fit(training_data, y_train)","execution_count":168,"outputs":[{"output_type":"execute_result","execution_count":168,"data":{"text/plain":"KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n metric_params=None, n_jobs=None, n_neighbors=7, p=2,\n weights='uniform')"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"knn_predictions = knn.predict(testing_data)","execution_count":169,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import accuracy_score\nknnaccuracy = accuracy_score(knn_predictions,y_test )\nprint('Accuracy score: ', format(accuracy_score(knn_predictions,y_test )))\nprint('Precision score: ', format(precision_score(y_test,knn_predictions)))\nprint('Recall score: ', format(recall_score(y_test, knn_predictions)))\nprint('F1 score: ', format(f1_score(y_test, knn_predictions)))","execution_count":170,"outputs":[{"output_type":"stream","text":"Accuracy score: 0.5825\nPrecision score: 0.8421052631578947\nRecall score: 0.16580310880829016\nF1 score: 0.27705627705627706\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.naive_bayes import GaussianNB \ngnb = GaussianNB()\ntraining_data1 = training_data.toarray()\ngnb.fit(training_data1, y_train)","execution_count":171,"outputs":[{"output_type":"execute_result","execution_count":171,"data":{"text/plain":"GaussianNB(priors=None, var_smoothing=1e-09)"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"testing_data1= testing_data.toarray()\ngnb_predictions = gnb.predict(testing_data1)","execution_count":172,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import accuracy_score\ngnb_accuracy = accuracy_score(gnb_predictions,y_test )\nprint('Accuracy score: ', format(accuracy_score(gnb_predictions,y_test )))\nprint('Precision score: ', format(precision_score(y_test,gnb_predictions)))\nprint('Recall score: ', format(recall_score(y_test, gnb_predictions)))\nprint('F1 score: ', format(f1_score(y_test, gnb_predictions)))","execution_count":173,"outputs":[{"output_type":"stream","text":"Accuracy score: 0.665\nPrecision score: 0.6577540106951871\nRecall score: 0.6373056994818653\nF1 score: 0.6473684210526316\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"# training a DescisionTreeClassifier \nfrom sklearn.tree import DecisionTreeClassifier \ndtree_model = DecisionTreeClassifier(max_depth = 2)\ndtree_model.fit(training_data1, y_train)","execution_count":174,"outputs":[{"output_type":"execute_result","execution_count":174,"data":{"text/plain":"DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,\n max_features=None, max_leaf_nodes=None,\n min_impurity_decrease=0.0, min_impurity_split=None,\n min_samples_leaf=1, min_samples_split=2,\n min_weight_fraction_leaf=0.0, presort=False,\n random_state=None, splitter='best')"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"dtree_predictions = dtree_model.predict(testing_data) ","execution_count":175,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import accuracy_score\ndtree_accuracy = accuracy_score(dtree_predictions,y_test )\nprint('Accuracy score: ', format(accuracy_score(dtree_predictions,y_test )))\nprint('Precision score: ', format(precision_score(y_test,dtree_predictions)))\nprint('Recall score: ', format(recall_score(y_test, dtree_predictions)))\nprint('F1 score: ', format(f1_score(y_test, dtree_predictions)))","execution_count":176,"outputs":[{"output_type":"stream","text":"Accuracy score: 0.66\nPrecision score: 0.6255506607929515\nRecall score: 0.7357512953367875\nF1 score: 0.6761904761904761\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.linear_model import SGDClassifier","execution_count":178,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"sgd_clf = SGDClassifier()\nsgd_clf.fit(training_data, y_train)","execution_count":179,"outputs":[{"output_type":"execute_result","execution_count":179,"data":{"text/plain":"SGDClassifier(alpha=0.0001, average=False, class_weight=None,\n early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,\n l1_ratio=0.15, learning_rate='optimal', loss='hinge',\n max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',\n power_t=0.5, random_state=None, shuffle=True, tol=0.001,\n validation_fraction=0.1, verbose=0, warm_start=False)"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"sgdpredicted = sgd_clf.predict(testing_data)","execution_count":180,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import accuracy_score\nsgd_accuracy = accuracy_score(sgdpredicted,y_test )\nprint('Accuracy score: ', format(accuracy_score(sgdpredicted,y_test )))\nprint('Precision score: ', format(precision_score(y_test,sgdpredicted)))\nprint('Recall score: ', format(recall_score(y_test, sgdpredicted)))\nprint('F1 score: ', format(f1_score(y_test, sgdpredicted)))","execution_count":181,"outputs":[{"output_type":"stream","text":"Accuracy score: 0.8775\nPrecision score: 0.8913043478260869\nRecall score: 0.8497409326424871\nF1 score: 0.8700265251989391\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.linear_model import LogisticRegression","execution_count":184,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"lr = LogisticRegression()\nlr.fit(training_data, y_train)","execution_count":185,"outputs":[{"output_type":"stream","text":"/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n FutureWarning)\n","name":"stderr"},{"output_type":"execute_result","execution_count":185,"data":{"text/plain":"LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n intercept_scaling=1, l1_ratio=None, max_iter=100,\n multi_class='warn', n_jobs=None, penalty='l2',\n random_state=None, solver='warn', tol=0.0001, verbose=0,\n warm_start=False)"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"lrpredicted = lr.predict(testing_data)","execution_count":186,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import accuracy_score\nlr_accuracy = accuracy_score(lrpredicted,y_test )\nprint('Accuracy score: ', format(accuracy_score(lrpredicted,y_test )))\nprint('Precision score: ', format(precision_score(y_test,lrpredicted)))\nprint('Recall score: ', format(recall_score(y_test, lrpredicted)))\nprint('F1 score: ', format(f1_score(y_test, lrpredicted)))","execution_count":187,"outputs":[{"output_type":"stream","text":"Accuracy score: 0.87\nPrecision score: 0.8691099476439791\nRecall score: 0.8601036269430051\nF1 score: 0.8645833333333333\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"print('Multinomial Naive Bayes:',mnbaccuracy)\nprint('Gausian Naive Bayes:',gnb_accuracy)\nprint('Decision tree:',dtree_accuracy)\nprint('Support Vector Classifier:',svcaccuracy)\nprint('K-Nearest Neighbour:',knnaccuracy)\nprint('Stochastic Gradient Descent:',sgd_accuracy)\nprint('LogisticRegression:',lr_accuracy)","execution_count":188,"outputs":[{"output_type":"stream","text":"Multinomial Naive Bayes: 0.9025\nGausian Naive Bayes: 0.665\nDecision tree: 0.66\nSupport Vector Classifier: 0.5625\nK-Nearest Neighbour: 0.5825\nStochastic Gradient Descent: 0.8775\nLogisticRegression: 0.87\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.cluster import KMeans\nkmeans = KMeans(n_clusters=2)\nkmeans.fit(training_data)\ny_kmeans = kmeans.predict(training_data)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"%matplotlib inline\nimport matplotlib.pyplot as plt\nimport seaborn as sns; sns.set() ","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from nltk.stem.porter import PorterStemmer\nfrom nltk.corpus import stopwords","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"ps = PorterStemmer()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"stemmed_dataset = []\nfor i in range(0,1600):\n stemmed_array = df['text'][i].split()\n stemmed = [ps.stem(word) for word in stemmed_array if not word in set(stopwords.words('english'))]\n stemmed = ' '.join(stemmed)\n stemmed_dataset.append(stemmed)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"print(stemmed[0:1600])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.feature_extraction.text import CountVectorizer","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"cv = CountVectorizer()\nX = cv.fit_transform(stemmed_dataset)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.cluster import KMeans\nwcss =[]","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"for i in range(1,1600):\n kmeans = KMeans(n_clusters=i, init='k-means++', max_iter = 300, n_init=10, random_state = 0, verbose=True)\n kmeans.fit(X)\n wcss.append(kmeans.inertia_)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat":4,"nbformat_minor":1} -------------------------------------------------------------------------------- /test/EDAonYelpReviews.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n# For example, here's several helpful packages to load in \n\nimport numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n\n# Input data files are available in the \"../input/\" directory.\n# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory\n\nimport os\nprint(os.listdir(\"../input\"))\n\n# Any results you write to the current directory are saved as output.","execution_count":32,"outputs":[{"output_type":"stream","text":"['yelp-train', 'yelptest', 'glove-global-vectors-for-word-representation']\n","name":"stdout"}]},{"metadata":{},"cell_type":"markdown","source":"**Exploratory Data Analysis**"},{"metadata":{"_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","trusted":true},"cell_type":"code","source":"import pandas as pd\nimport numpy as np","execution_count":33,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"data_train = pd.read_csv('../input/yelp-train/train.csv',header=None)","execution_count":34,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"data_train.columns = ['deceptive','text']","execution_count":35,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"data_train.head(5)","execution_count":36,"outputs":[{"output_type":"execute_result","execution_count":36,"data":{"text/plain":" deceptive text\n0 1 Unfortunately, the frustration of being Dr. Go...\n1 2 Been going to Dr. Goldberg for over 10 years. ...\n2 1 I don't know what Dr. Goldberg was like before...\n3 1 I'm writing this review to give you a heads up...\n4 2 All the food is great here. But the best thing...","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
deceptivetext
01Unfortunately, the frustration of being Dr. Go...
12Been going to Dr. Goldberg for over 10 years. ...
21I don't know what Dr. Goldberg was like before...
31I'm writing this review to give you a heads up...
42All the food is great here. But the best thing...
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"data_test = pd.read_csv('../input/yelptest/test.csv',header=None)","execution_count":37,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"data_test.columns = ['deceptive','text']","execution_count":38,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"data_test.head(5)","execution_count":39,"outputs":[{"output_type":"execute_result","execution_count":39,"data":{"text/plain":" deceptive text\n0 2 Contrary to other reviews, I have zero complai...\n1 1 Last summer I had an appointment to get new ti...\n2 2 Friendly staff, same starbucks fair you get an...\n3 1 The food is good. Unfortunately the service is...\n4 2 Even when we didn't have a car Filene's Baseme...","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
deceptivetext
02Contrary to other reviews, I have zero complai...
11Last summer I had an appointment to get new ti...
22Friendly staff, same starbucks fair you get an...
31The food is good. Unfortunately the service is...
42Even when we didn't have a car Filene's Baseme...
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"data_train.info()","execution_count":40,"outputs":[{"output_type":"stream","text":"\nRangeIndex: 560000 entries, 0 to 559999\nData columns (total 2 columns):\ndeceptive 560000 non-null int64\ntext 560000 non-null object\ndtypes: int64(1), object(1)\nmemory usage: 8.5+ MB\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"import seaborn as sns\nimport matplotlib.pyplot as plt\n%matplotlib inline","execution_count":41,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"sns.countplot(data_train.deceptive)\nplt.xlabel('Deceptive')\nplt.title('Number of Deceptive and Non Deceptive reviews (Deceptive=1 & NonDeceptive=2)')","execution_count":42,"outputs":[{"output_type":"execute_result","execution_count":42,"data":{"text/plain":"Text(0.5, 1.0, 'Number of Deceptive and Non Deceptive reviews (Deceptive=1 & NonDeceptive=2)')"},"metadata":{}},{"output_type":"display_data","data":{"text/plain":"
","image/png":"iVBORw0KGgoAAAANSUhEUgAAAgoAAAEWCAYAAAAHPb8oAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAIABJREFUeJzt3Xm8XVV99/HPD8I8RwJCEghqWkW0iBGw+igVhUCr4IDCoxIQRShUferTgrYVBHGoA3UCxRIBRRBxoghGRMCCIgRlRkoEJBGEQAiDoDL8+sdah+wczj73Zjz33nzer9d5ZZ+119577fl79nATmYkkSVIvqw26AZIkaeQyKEiSpFYGBUmS1MqgIEmSWhkUJElSK4OCJElqNeKCQkScEhEfGdC0IyK+GhH3R8QVg2jD8hIRH4yI/xx0O/qJiAMi4tJBt2O0iYgvRcS/Dbod/UTEWyPiRwOc/rYRMXtQ018WEfFwRDxr0O3Q8hcR50fEjEG3AyAiXhgRPxtO3SGDQkTcHhF3R8R6jbJ3RsTFy9DGkerlwGuASZm5Y3fPemJ7ou7ID0fEbTVY/MXKb+pi7dolIuY1yzLzo5n5zkG1aVlFxJSIyIj4QVf51yPi6BUwvaMj4rGIeKh+/icivhARWyzvaS1hu54WpjLzkMw8dlBtGo7MPD0zdxtgE44FPtX5Uo9jj9Z1uzAifhYRh0TEQH8sRcTFEbHYfpqZ62fmrSuxDSdFxM0R8WREHDCM+rtFxK/rsrwmIl40RP0D6r78T13l8yJil2Vr/VM/Lv/c2Hevj4iPRcRGyzruZWzX0RHx9WZZZu6RmaeupOmvFREnR8Rv63L5VUTs0WjLtcDCiHjtUOMa7k4yDnjvUrZ3YCJi9SUcZGvg9sz8Q586P8/M9YGNgFcDjwJXRcR2S9lM9bdzRLxsJU3rm5m5ATAeeD3wTMq6HWhYGISIGDfoNiytur7+BvheV6/X1vW7NfBx4Ajg5JXcvJHoGuDvgV8Os/6pwKeBDYH/C9w/jGEWAEdExIZL1cKh/XtdtxOAA4GdgcuaP3BXQeOAucArKeerfwPOiogpjTqnA+8eckyZ2fcD3A4cSVnRG9eydwIX1+4pQALjGsNcDLyzdh8AXAYcDywEbgX+upbPBe4BZjSGPQX4EnAB8BBwCbB1o/9za78FwM3Am7uGPRE4D/gD8Ooe87MlcE4dfg7wrlp+EPBH4AngYeDDPYY9ALi0R/m5wNmN7zsDP6vzew2wS6PfeOCrwJ2UHex7jX5/B1xdh/sZ8MKu9fAB4MY63FeBtYH1KGHlydruh+s8Hg18vQ77Q+DwrjZfA7xhqGXaY14PBG6q6+ZW4N2NfrsA84D31/V6F3Bgo/8z6rJ/ELiC8qvvacuza7s6ArioUf514OjG93fV9bigjnvLRr8EDgFuqcvsi0C0TO+p5dUoW70up08Ncx1NBr4DzAfuA77Q6PeOutzuB2ax+DadwHvq8rwX+CQlxD+PxbfJhY3t/CO1+ybg7xrjGlfHscNQ22LLvn4EcC3wpzquLYFv13m6DXhPYz96FBjfGP5Fddpr0LWvtG1jwDa1bavV7/8J3NO1vt/X2P9upWx7twFvbZmP/YEf95i3V3eV7UjZb7ar39eiXIW4A7ibchxap1F/r7ruHwR+A0yv5RtRAsddwO+AjwCrdx3/Pg88APwa2LX2O66u2z/W9fuFxvbwnLruft8ZV+33euDa2r0a5dj8G8r2dlZzfSzpB7gUOGAY9e4AXrME4z2gjvu/gKMa5fM622Nd9v9BOS7eWbvXGuZx5RTq/tAo26DWO7xR1m8ffH5j+7wb+OBQy5hFx6iDa5vvAt5f+00H/gw8VtftNbX8Ysr5cy3Kdr9dow0TKPvUZkMda5ZhHV8LvLHxfWKd5lp9hxvGiG+n/HL+DosOTksaFB6nnGBWp+xEd1AO2msBu1F2/PUbK/0h4BW1/2epBxzKSXFuHdc4YAfKgen5jWEfAF5WV/DaPebnEuAEykl2e8oBcNdGW3ueuPr1rxvg3Y0Ffx+wZ23Da+r3CbX/D4BvAptQDqivrOU7UHaCnepymlGX/VqN9XA95WQ0nnLw6ayPXYB5XW06mkVBYX/gska/bevGt9ZQy7THvP4t8GwgKEn1ERadlHap6/qYOm971v6b1P5nUna09YDtKAfVoYLC+rXeq2v5U0EBeFVt6w51Xj4P/LQxjqSEuI2Breq6nt4yvaeWV1f5McAvhlpHLAoVx9f5Wxt4eR1ub0qYeV5dxv8K/KyrnRfV9boV8D8svv9c2tWmUxrr/kPA6V3r59fD2RZb9vWrKdvYOnWYq+o01gSeRTlR717r/4QatOv3TwJf6m43Q++3dwAvrt0312k8r9HvRXUcDwJ/Wcu3oH0b/STwxV7HsR517wAOrd3/QQmb4yknmv8CPlb77Ug5trymLpeJwHNrv+8BX65t3IwSgt/ddfz7f5R94i11PJ2TzcWddd21PTyndv+GxkkZ+BZwZO1+H3A5MImyDX4ZOKNRd2Gfz5E9lsWQQYGy358N/JbGiXaIYQ6o496+Trsz782gcEydl80oJ8yfAccO87hyCl1BoZafRrlKCH32QRaFivdT9tsNgJ2GWsYsOkadUdf9CyjHmM6x6mie/uPjqfUNzASOa/Q7DPjhMM8H5/ZZt+e2rIfNKaH0uV3lDzJECFmSoLAdZQOfwJIHhVsa/V5Q62/eKLsP2L6x0s9s9FufkronU3ay/+5q35epKbUOe1qfeZlcx7VBo+xjwCndB7d+G3yP8unAY7X7COBrXf1n1RW9BeUXzCY9xnEidcdolN3MoiBxO3BIo9+ewG8aO1K/oLAB5QrL1vX7ccDM2t13mQ5j+/ge8N5GOx7t2hbuofwyWp2Srp/b6PfRtuXd3K4ol0Uvr+XNoHAy5ZJjc1t5DJhSvyf1ZF2/n0WPA2TbTl3LD6Fuv/3WEfBSykFiXI9xnA8c1Pi+GuVAt3WjndMb/f8euLBtm2PxoPAcSrBet34/HfjQUNtin339HY3vOwF3dNX5APDV2v1O4Ce1Oyhh4BXd7R5qGwO+Bvwj5VbPzcC/1+X+1NUGyoF4IfBGGr/yW+bjK8DHe8xbr6BwOfAvtf1/AJ7d6PdS4LZGe4/vMfzmlKsvzSsP+1GvgtXlcCeNK1mUIPH22n0x/YPCR1i0r3bvxzdRf+TU71tQtv+nbYPD3JeHExSOpFyxfSslxHTa8i7g2y3DNLeFs4BP1O5mUPgNsGdjmN0pt4Ghz3Gle3/omu7HgQuG2gfr+vpVS9tblzGLjlHNY9q/AyfX7qPpHxReDdza6HcZsH/t7ns+WIp1uwbwY+DLPfr9jrrftn2G/SBPZl5PSTFHDneYhrsb3Y/W8XWXrd/4Prcx3Ycpl4O2pKzUnerDSAsjYiFlg31mr2F72BJYkJkPNcp+S/l1sCwm1jZS27hPVxtfTtnAJtfp97qntzXw/q7hJtc2dzTn7bdd/VrV+f0BsG8t2pdyQulMd6hl+pSI2CMiLo+IBbXunsCmjSr3Zebjje+PUNbtBBbdM2vOw3B8Bdi8x0M3WzbHUbeV+1h8ff6+R1uWRPe6bVtHk4Hfds07jeE+2xhmAeXE1Gzn0q7bOZSD2WsjYl3gdcA3GtNt2xbbNNuxNbBl1/AfpJwcofyyfGlEbEm5ApjAf/cY51Db2CWUk8ErgJ9SDqavrJ//zswnszw39BZKgLgrIn4QEc9tmYf7KSfV4eis3wnAupRnUjpt/GEth7J+f9Myb2vUNnWG+zLll3HH77Iekathr1/KunxDRKwFvAH4ZWZ2tvmtge82pnsT5YfQ5r1HtVy8l3Ir7nTKlZuLI2Jryu3kHw9j+A8Bh0ZE9/FlsX2Zpy+jtuNKP937bts+2LZuO8MNtYyXat+lXJFbJyJ2qstwe+C7jekOdT4YlvrA7tcot0IO71FlA0oIb7WkDywdRXng5dONss6Df+tSLmFAy0lmCUzudETE+pRLgXdSVsglmfmaPsNmn353AuMjYoNGWNiKkqiWxetZdICcS/kV967uSvUhq/ERsXFmdq+YuZTLUMf1mc7kRvdWlPmB/vPccQZwVET8lHJZ+aLGdIdapp32r0W5X70/8P3MfCwivkfZ4YYyn3L5cDLlPm1nHoZUp/NhyjMNNzR63UnZoTrtW4/yHMSyrs/O+FYDXsuiA2DrOoqIlwJbRcS4HmGhM9zp3cM1TGbRvC3Nut2P8ivpxhoeOtPtuS320ZzeXMov6qk9K2YujPIK5Jspl3TP6DohNsfTbxu7hHLSmVe7L6U8H/DH+r0zvVnArIhYh/JL+yvA/+kxvmspV/D6ioiXUE4Ul1JuhTxKuZ3Ra/uZS7nl1qv8T8CmLSERYGJERGPZbEW5xQFDrN/MvDEifgvsQXlw8BuN3nMpV4Au6zVsRDzcZ9QfzcyP9pt2i3GU/ZjM/FJEjKeso8eAQ4caODN/HRHfoQTOps6+3GsfWGL1vPFqytVT6LMP1pP0fi2jal3GjYcCu49pw9p3M/PJiDirTvtuyi2Dznmp7/kgIs6n97YPJVzvUesF5crr5pQrNo91jWdLym3Fm/u1dYleDaoHoG9SHrzqlM2nHJjfFhGrR8Q76L1DLYk9I+LlEbEm5eTwi8ycS7mi8RcR8faIWKN+XhIRzxtm++dS7n19LCLWjogXUh5i7HcA76nO6zYR8XnKr6EP115fp/y6273WWTvK64uTMvMuyiWwEyJik9r+V9ThvgIcUtNlRMR6EfG3EdH8ZXRYREyqO+cHKesCykb2jOj/OtB5lB3xGMp9uydr+ZIs0zUp9+nmA49HedVmWK/AZeYTlOdcjo6IdSNiW4ZxMG/4Wp329EbZN4ADI2L7GmI+StlWbl+C8T5NXQbPo5yAnwl8pvbqt46uoNzn/HgtXzsWva3xJeADEfH8Ov6NImKfrsn+U90mJlN+tTXX7aS6L7Q5k7IeDmXxE0nrtjjMRXEF8GBEHBER69RxbFdPsB3foATHN3ZNu6nvNpaZt1BO0m+jPGPyYJ3vN1KDQkRsHhGvq2HwT5QHxJ5omd4FwA4RsXavnhGxYUT8HWW5fT0zr6v7w1eA4yNis1pvYkTsXgc7mbKt7RoRq9V+z6379I+AT9fxrhYRz46IVzYmuRnwnjrf+1BC1Xm1392UZz/6+QblmPsKyjMKHV8CjqsnOiJiQkTs1emZ5TXLts9TISEi1qzLKoA16nbSdm74FvDJiHhWlDdjrqD8kHuScn9/OD5MeV5l40bZGcC/1nnYlHLl4eu9Bu4nyiuBL6bcEu089A3998FzgWdGxPvq8BtExE6N4VqXcfVv9Zj2/DpfzX13Sp9lCWXdvoVyha25//Q9H2R5zbJt3e7RGM+JlO3ttZn5aI/p70K5ffinPm0c/jMKje+TKUn/4kbZHpSnkBdSrjZcQsvDWJR7qtk1jXksevDrFBa99fAw5VLkNo26f0m5jN55svwnLP58w9PuVXVNaxJlw1hAudzUvO+/WFt7DHsAi55A/wPlMtOp1AevGvV2qstgQW3nD4Ctar/xdZi7KRvydxrDTQeurMvxLspOuUFjPXTeelhYx7FuY9iZdXkspOuth0adkykp9yVd5a3LtMcyOKy2fSHl5H0m/R+qfGr7oVzGPZcle+uheV/yzbXs6EbZIXU9LqjjntTo99S93qG2j7q8Ok8o/4HypsQJwMSuev3W0VaUA9R9lF+on2sM93bgujrvc6n3nRvt7Lz1cB9lH+o8Nb9mXTcLgHvb5gO4kPJL75nD3RaH2tdr2ZaUg/jvKdvr5Sx+PFiH8ozEDT32leZ+33cbq9O4rfH9U3W84+r3Lep8PFCX/cXAtn321W8Bb+mat0frOB8Afk7ZlptvFKxNCZu31vV0E/Utj9r/9ZSrFQ9RHozrPNS5EeWAPK+O+1fAvo3lcBnwhdrvf4DdGuN8aS27v7O98PTtdivKifgHXfO4GuXZjptrm35DuVKwpPevL67TbH52aam7Tl038+o29ZM6D2fV7jVajpvdz9mc0JxOXfafo+xTd9XutYd5XDmFcln9Icq+ewPwCepbesPcB7ej7EP3U7b1zgOjrcuYp7/18HvgnxvjfAblatX9lFtGnWXd/UxK562tNYd7rBnmet26tq/zVk3n89ZGnR8ArxtqXFEra4SLiNspG9hw7gNqFImIBKbmolsGWkb1itWpwI45wINclD9g9M7MfPmg2qAVI8qth9so4ajtttOIFREvAE7KzJcOVXfU/lEVSWqTmTcCLxmyorSKyszrKFeDhjTi/q8HSZI0cnjrQZIktfKKgiRJauUzCgOy6aab5pQpUwbdDEkaVa666qp7M3PC0DW1vBgUBmTKlCnMnj170M2QpFElyh+g0krkrQdJktTKoCBJkloZFCRJUiuDgiRJamVQkCRJrQwKkiSplUFBkiS1MihIkqRWBgVJktTKv8w4ir34n04bdBM0Al31yf0H3QTuOOYFg26CRqCtPnTdoJugpeAVBUmS1MqgIEmSWhkUJElSK4OCJElqZVCQJEmtDAqSJKmVQUGSJLUyKEiSpFYGBUmS1MqgIEmSWhkUJElSK4OCJElqZVCQJEmtDAqSJKmVQUGSJLUyKEiSpFYGBUmS1MqgIEmSWo2ZoBARkyPiooi4KSJuiIj31vKjI+J3EXF1/ezZGOYDETEnIm6OiN0b5dNr2ZyIOLJRvk1E/CIibomIb0bEmrV8rfp9Tu0/ZeXNuSRJK86YCQrA48D7M/N5wM7AYRGxbe13fGZuXz/nAdR++wLPB6YDJ0TE6hGxOvBFYA9gW2C/xng+Ucc1FbgfOKiWHwTcn5nPAY6v9SRJGvXGTFDIzLsy85e1+yHgJmBin0H2As7MzD9l5m3AHGDH+pmTmbdm5p+BM4G9IiKAVwFn1+FPBfZujOvU2n02sGutL0nSqDZmgkJTvfT/IuAXtejwiLg2ImZGxCa1bCIwtzHYvFrWVv4MYGFmPt5Vvti4av8Hav3udh0cEbMjYvb8+fOXaR4lSVoZxlxQiIj1gW8D78vMB4ETgWcD2wN3AZ/uVO0xeC5Feb9xLV6QeVJmTsvMaRMmTOg7H5IkjQRjKihExBqUkHB6Zn4HIDPvzswnMvNJ4CuUWwtQrghMbgw+CbizT/m9wMYRMa6rfLFx1f4bAQuW79xJkrTyjZmgUJ8JOBm4KTM/0yjfolHt9cD1tfscYN/6xsI2wFTgCuBKYGp9w2FNygOP52RmAhcBb6rDzwC+3xjXjNr9JuAntb4kSaPauKGrjBovA94OXBcRV9eyD1LeWtiecivgduDdAJl5Q0ScBdxIeWPisMx8AiAiDgdmAasDMzPzhjq+I4AzI+IjwK8owYT679ciYg7lSsK+K3JGJUlaWcZMUMjMS+n9rMB5fYY5DjiuR/l5vYbLzFtZdOuiWf5HYJ8laa8kSaPBmLn1IEmSlj+DgiRJamVQkCRJrQwKkiSplUFBkiS1MihIkqRWBgVJktTKoCBJkloZFCRJUiuDgiRJamVQkCRJrQwKkiSplUFBkiS1MihIkqRWBgVJktTKoCBJkloZFCRJUiuDgiRJamVQkCRJrQwKkiSplUFBkiS1MihIkqRWBgVJktTKoCBJkloZFCRJUiuDgiRJamVQkCRJrQwKkiSp1ZgJChExOSIuioibIuKGiHhvLR8fERdExC31301qeUTE5yJiTkRcGxE7NMY1o9a/JSJmNMpfHBHX1WE+FxHRbxqSJI12YyYoAI8D78/M5wE7A4dFxLbAkcCFmTkVuLB+B9gDmFo/BwMnQjnpA0cBOwE7Akc1Tvwn1rqd4abX8rZpSJI0qo2ZoJCZd2XmL2v3Q8BNwERgL+DUWu1UYO/avRdwWhaXAxtHxBbA7sAFmbkgM+8HLgCm134bZubPMzOB07rG1WsakiSNamMmKDRFxBTgRcAvgM0z8y4oYQLYrFabCMxtDDavlvUrn9ejnD7T6G7XwRExOyJmz58/f2lnT5KklWbMBYWIWB/4NvC+zHywX9UeZbkU5cOWmSdl5rTMnDZhwoQlGVSSpIEYU0EhItaghITTM/M7tfjuetuA+u89tXweMLkx+CTgziHKJ/Uo7zcNSZJGtTETFOobCCcDN2XmZxq9zgE6by7MAL7fKN+/vv2wM/BAvW0wC9gtIjapDzHuBsyq/R6KiJ3rtPbvGlevaUiSNKqNG3QDlqOXAW8HrouIq2vZB4GPA2dFxEHAHcA+td95wJ7AHOAR4ECAzFwQEccCV9Z6x2Tmgtp9KHAKsA5wfv3QZxqSJI1qYyYoZOal9H6OAGDXHvUTOKxlXDOBmT3KZwPb9Si/r9c0JEka7cbMrQdJkrT8GRQkSVIrg4IkSWplUJAkSa0MCpIkqZVBQZIktTIoSJKkVgYFSZLUyqAgSZJaGRQkSVIrg4IkSWplUJAkSa0MCpIkqZVBQZIktTIoSJKkVgYFSZLUyqAgSZJaGRQkSVIrg4IkSWplUJAkSa0MCpIkqZVBQZIktRqRQSEiLhxOmSRJWrHGDboBTRGxNrAusGlEbAJE7bUhsOXAGiZJ0ipqRAUF4N3A+yih4CoWBYUHgS8OqlGSJK2qRlRQyMzPAp+NiH/IzM8Puj2SJK3qRlRQ6MjMz0fEXwNTaLQxM08bWKMkSVoFjcigEBFfA54NXA08UYsTMChIkrQSjci3HoBpwMsy8+8z8x/q5z39BoiImRFxT0Rc3yg7OiJ+FxFX18+ejX4fiIg5EXFzROzeKJ9ey+ZExJGN8m0i4hcRcUtEfDMi1qzla9Xvc2r/KctxOUiSNFAjNShcDzxzCYc5BZjeo/z4zNy+fs4DiIhtgX2B59dhToiI1SNidcpDk3sA2wL71boAn6jjmgrcDxxUyw8C7s/M5wDH13qSJI0JIzUobArcGBGzIuKczqffAJn5U2DBMMe/F3BmZv4pM28D5gA71s+czLw1M/8MnAnsFREBvAo4uw5/KrB3Y1yn1u6zgV1rfUmSRr0R+YwCcPRyHNfhEbE/MBt4f2beD0wELm/UmVfLAOZ2le8EPANYmJmP96g/sTNMZj4eEQ/U+vd2NyQiDgYOBthqq62Wfc4kSVrBRmRQyMxLltOoTgSOpTwIeSzwaeAdLPr7DItNlt5XWLJPfYbot3hh5knASQDTpk3rWUeSpJFkRAaFiHiIRSfbNYE1gD9k5oZLMp7MvLsxzq8A59av84DJjaqTgDtrd6/ye4GNI2JcvarQrN8Z17yIGAdsxPBvgUiSNKKNyGcUMnODzNywftYG3gh8YUnHExFbNL6+nvKQJMA5wL71jYVtgKnAFcCVwNT6hsOalAcez8nMBC4C3lSHnwF8vzGuGbX7TcBPan1Jkka9EXlFoVtmfq/5qmIvEXEGsAvl/4mYBxwF7BIR21OuTtxO+RPRZOYNEXEWcCPwOHBYZj5Rx3M4MAtYHZiZmTfUSRwBnBkRHwF+BZxcy08GvhYRcyhXEvZdLjMtSdIIMCKDQkS8ofF1NcrfVej7Kz0z9+tRfHKPsk7944DjepSfB5zXo/xWylsR3eV/BPbp1zZJkkarERkUgNc2uh+nXA3YazBNkSRp1TUig0JmHjjoNkiSpBH6MGNETIqI79Y/yXx3RHw7IiYNul2SJK1qRmRQAL5KeZtgS8ofNPqvWiZJklaikRoUJmTmVzPz8fo5BZgw6EZJkrSqGalB4d6IeFvnP2qKiLcB9w26UZIkrWpGalB4B/Bm4PfAXZQ/ZOQDjpIkrWQj8q0Hyv/LMKP+B05ExHjgU5QAIUmSVpKRekXhhZ2QAJCZC4AXDbA9kiStkkZqUFgtIjbpfKlXFEbq1Q9JksaskXry/TTws4g4m/Knm99Mjz+3LEmSVqwRGRQy87SImA28CgjgDZl544CbJUnSKmdEBgWAGgwMB5IkDdBIfUZBkiSNAAYFSZLUyqAgSZJaGRQkSVIrg4IkSWplUJAkSa0MCpIkqZVBQZIktTIoSJKkVgYFSZLUyqAgSZJaGRQkSVIrg4IkSWplUJAkSa0MCpIkqdWYCQoRMTMi7omI6xtl4yPigoi4pf67SS2PiPhcRMyJiGsjYofGMDNq/VsiYkaj/MURcV0d5nMREf2mIUnSWDBmggJwCjC9q+xI4MLMnApcWL8D7AFMrZ+DgROhnPSBo4CdgB2Boxon/hNr3c5w04eYhiRJo96YCQqZ+VNgQVfxXsCptftUYO9G+WlZXA5sHBFbALsDF2Tmgsy8H7gAmF77bZiZP8/MBE7rGlevaUiSNOqNmaDQYvPMvAug/rtZLZ8IzG3Um1fL+pXP61HebxpPExEHR8TsiJg9f/78pZ4pSZJWlrEeFNpEj7JcivIlkpknZea0zJw2YcKEJR1ckqSVbqwHhbvrbQPqv/fU8nnA5Ea9ScCdQ5RP6lHebxqSJI16Yz0onAN03lyYAXy/Ub5/ffthZ+CBettgFrBbRGxSH2LcDZhV+z0UETvXtx327xpXr2lIkjTqjRt0A5aXiDgD2AXYNCLmUd5e+DhwVkQcBNwB7FOrnwfsCcwBHgEOBMjMBRFxLHBlrXdMZnYekDyU8mbFOsD59UOfaUiSNOqNmaCQmfu19Nq1R90EDmsZz0xgZo/y2cB2Pcrv6zUNSZLGgrF+60GSJC0Dg4IkSWplUJAkSa0MCpIkqZVBQZIktTIoSJKkVgYFSZLUyqAgSZJaGRQkSVIrg4IkSWplUJAkSa0MCpIkqZVBQZIktTIoSJKkVgYFSZLUyqAgSZJaGRQkSVIrg4IkSWplUJAkSa0MCpIkqZVBQZIktTIoSJKkVgYFSZLUyqAgSZJaGRQkSVIrg4IkSWplUJAkSa0MCpIkqdUqERQi4vaIuC4iro6I2bVsfERcEBG31H83qeUREZ+LiDkRcW1E7NAYz4xa/5aImNEof3Ed/5w6bKz8uZQkaflbJYJC9TeZuX1mTqvfjwQuzMypwIX1O8AewNT6ORg4EUqwAI4CdgJ2BI7qhIta5+DGcNNX/OxIkrTirUpBodtewKm1+1Rg70b5aVlcDmwcEVsAuwMXZOaCzLwfuACYXvttmJk/z8wETmuMS5KkUW1VCQoJ/CgiroqIg2vZ5pl5F0ACBrieAAAGPklEQVT9d7NaPhGY2xh2Xi3rVz6vR/nTRMTBETE7ImbPnz9/GWdJkqQVb9ygG7CSvCwz74yIzYALIuLXfer2er4gl6L86YWZJwEnAUybNq1nHUmSRpJV4opCZt5Z/70H+C7lGYO7620D6r/31OrzgMmNwScBdw5RPqlHuSRJo96YDwoRsV5EbNDpBnYDrgfOATpvLswAvl+7zwH2r28/7Aw8UG9NzAJ2i4hN6kOMuwGzar+HImLn+rbD/o1xSZI0qq0Ktx42B75b31gcB3wjM38YEVcCZ0XEQcAdwD61/nnAnsAc4BHgQIDMXBARxwJX1nrHZOaC2n0ocAqwDnB+/UiSNOqN+aCQmbcCf9Wj/D5g1x7lCRzWMq6ZwMwe5bOB7Za5sZIkjTBj/taDJElaegYFSZLUyqAgSZJaGRQkSVIrg4IkSWplUJAkSa0MCpIkqZVBQZIktTIoSJKkVgYFSZLUyqAgSZJaGRQkSVIrg4IkSWplUJAkSa0MCpIkqZVBQZIktTIoSJKkVgYFSZLUyqAgSZJaGRQkSVIrg4IkSWplUJAkSa0MCpIkqZVBQZIktTIoSJKkVgYFSZLUyqAgSZJaGRQkSVIrg8JyEhHTI+LmiJgTEUcOuj2SJC0PBoXlICJWB74I7AFsC+wXEdsOtlWSJC07g8LysSMwJzNvzcw/A2cCew24TZIkLbNxg27AGDERmNv4Pg/YqbtSRBwMHFy/PhwRN6+Etq0qNgXuHXQjRoL41IxBN0GLc9vsOCqWx1i2Xh4j0fAZFJaPXlt/Pq0g8yTgpBXfnFVPRMzOzGmDbofUzW1To523HpaPecDkxvdJwJ0DaoskScuNQWH5uBKYGhHbRMSawL7AOQNukyRJy8xbD8tBZj4eEYcDs4DVgZmZecOAm7Wq8ZaORiq3TY1qkfm0W+mSJEmAtx4kSVIfBgVJktTKoKBRLSJmRsQ9EXH9oNsiNUXE5Ii4KCJuiogbIuK9g26TtDR8RkGjWkS8AngYOC0ztxt0e6SOiNgC2CIzfxkRGwBXAXtn5o0Dbpq0RLyioFEtM38KLBh0O6RumXlXZv6ydj8E3ET5K67SqGJQkKQVLCKmAC8CfjHYlkhLzqAgSStQRKwPfBt4X2Y+OOj2SEvKoCBJK0hErEEJCadn5ncG3R5paRgUJGkFiIgATgZuyszPDLo90tIyKGhUi4gzgJ8DfxkR8yLioEG3SapeBrwdeFVEXF0/ew66UdKS8vVISZLUyisKkiSplUFBkiS1MihIkqRWBgVJktTKoCBJkloZFKRRJiKeqK/a3RAR10TEP0bEStmXI2L75it+EfG6iDhyZUxb0mD4eqQ0ykTEw5m5fu3eDPgGcFlmHrUSpn0AMC0zD1/R05I0MhgUpFGmGRTq92cBVwKbUq4SfhzYBVgL+GJmfrnW+2fKHwB6Ejg/M4+MiGcDXwQmAI8A78rMX0fEKcAfgecDmwP/CPwImAOsA/wO+Fjtngb8C3AN8KzMfDIi1gVuBp4FbNVrGitk4Uha7sYNugGSlk1m3lpvPWwG7AU8kJkviYi1gMsi4kfAc4G9gZ0y85GIGF8HPwk4JDNviYidgBOAV9V+U4BXAs8GLgKeA3yIxhWFeoWBzHwgIq6p9S8CXgvMyszHIqLfNCSNcAYFaWyI+u9uwAsj4k31+0bAVODVwFcz8xGAzFxQ/1fDvwa+Vf5bAqBcheg4KzOfBG6JiFspYaOfbwJvoQSFfYEThjENSSOcQUEa5eqthyeAeyiB4R8yc1ZXnelA933G1YCFmbl9y6i76w91n/Ic4GP1asWLgZ8A6w0xDUkjnG89SKNYREwAvgR8IcsDR7OAQ+t/b0xE/EVErEd5vuAd9dkBImJ8Zj4I3BYR+9SyiIi/aox+n4hYrT7H8CzKMwcPARv0aktmPgxcAXwWODcznxjGNCSNcAYFafRZp/N6JPBjSgj4cO33n8CNwC8j4nrgy8C4zPwh5Rf/7Ii4Gvj/tf5bgYPq8wU3UJ5x6LgZuAQ4n/KMwR8ptxW2rdN/S4+2fRN4W/23o980JI1wvvUg6WnqWw/nZubZg26LpMHyioIkSWrlFQVJktTKKwqSJKmVQUGSJLUyKEiSpFYGBUmS1MqgIEmSWv0vCFNatK7PnPsAAAAASUVORK5CYII=\n"},"metadata":{"needs_background":"light"}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#dataset description\ndata_train.groupby('deceptive').describe()","execution_count":43,"outputs":[{"output_type":"execute_result","execution_count":43,"data":{"text/plain":" text ... \n count ... freq\ndeceptive ... \n1 280000 ... 1\n2 280000 ... 1\n\n[2 rows x 4 columns]","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
text
countuniquetopfreq
deceptive
1280000280000I've been here twice now. The food is not very...1
2280000280000My husband and I went here for our anniversary...1
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#word count\ndata_train['word_count'] = data_train['text'].apply(lambda x: len(str(x).split(\" \")))\ndata_train[['text','word_count']].head()","execution_count":45,"outputs":[{"output_type":"execute_result","execution_count":45,"data":{"text/plain":" text word_count\n0 Unfortunately, the frustration of being Dr. Go... 122\n1 Been going to Dr. Goldberg for over 10 years. ... 97\n2 I don't know what Dr. Goldberg was like before... 212\n3 I'm writing this review to give you a heads up... 193\n4 All the food is great here. But the best thing... 80","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
textword_count
0Unfortunately, the frustration of being Dr. Go...122
1Been going to Dr. Goldberg for over 10 years. ...97
2I don't know what Dr. Goldberg was like before...212
3I'm writing this review to give you a heads up...193
4All the food is great here. But the best thing...80
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#character count including spaces\ndata_train['char_count'] = data_train['text'].str.len() ## this also includes spaces\ndata_train[['text','char_count']].head()","execution_count":46,"outputs":[{"output_type":"execute_result","execution_count":46,"data":{"text/plain":" text char_count\n0 Unfortunately, the frustration of being Dr. Go... 643\n1 Been going to Dr. Goldberg for over 10 years. ... 495\n2 I don't know what Dr. Goldberg was like before... 1143\n3 I'm writing this review to give you a heads up... 1050\n4 All the food is great here. But the best thing... 425","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
textchar_count
0Unfortunately, the frustration of being Dr. Go...643
1Been going to Dr. Goldberg for over 10 years. ...495
2I don't know what Dr. Goldberg was like before...1143
3I'm writing this review to give you a heads up...1050
4All the food is great here. But the best thing...425
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#average word length\ndef avg_word(sentence):\n words = sentence.split()\n return (sum(len(word) for word in words)/len(words))\n\ndata_train['avg_word'] = data_train['text'].apply(lambda x: avg_word(x))\ndata_train[['text','avg_word']].head()","execution_count":47,"outputs":[{"output_type":"execute_result","execution_count":47,"data":{"text/plain":" text avg_word\n0 Unfortunately, the frustration of being Dr. Go... 4.539130\n1 Been going to Dr. Goldberg for over 10 years. ... 4.113402\n2 I don't know what Dr. Goldberg was like before... 4.417062\n3 I'm writing this review to give you a heads up... 4.445596\n4 All the food is great here. But the best thing... 4.613333","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
textavg_word
0Unfortunately, the frustration of being Dr. Go...4.539130
1Been going to Dr. Goldberg for over 10 years. ...4.113402
2I don't know what Dr. Goldberg was like before...4.417062
3I'm writing this review to give you a heads up...4.445596
4All the food is great here. But the best thing...4.613333
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#no of stopwords\nfrom nltk.corpus import stopwords\nstop = stopwords.words('english')\n\ndata_train['stopwords'] = data_train['text'].apply(lambda x: len([x for x in x.split() if x in stop]))\ndata_train[['text','stopwords']].head()","execution_count":48,"outputs":[{"output_type":"execute_result","execution_count":48,"data":{"text/plain":" text stopwords\n0 Unfortunately, the frustration of being Dr. Go... 47\n1 Been going to Dr. Goldberg for over 10 years. ... 47\n2 I don't know what Dr. Goldberg was like before... 96\n3 I'm writing this review to give you a heads up... 79\n4 All the food is great here. But the best thing... 21","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
textstopwords
0Unfortunately, the frustration of being Dr. Go...47
1Been going to Dr. Goldberg for over 10 years. ...47
2I don't know what Dr. Goldberg was like before...96
3I'm writing this review to give you a heads up...79
4All the food is great here. But the best thing...21
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#no of special characters\ndata_train['spchar'] = data_train['text'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))\ndata_train[['text','spchar']].head()","execution_count":49,"outputs":[{"output_type":"execute_result","execution_count":49,"data":{"text/plain":" text spchar\n0 Unfortunately, the frustration of being Dr. Go... 0\n1 Been going to Dr. Goldberg for over 10 years. ... 0\n2 I don't know what Dr. Goldberg was like before... 0\n3 I'm writing this review to give you a heads up... 0\n4 All the food is great here. But the best thing... 0","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
textspchar
0Unfortunately, the frustration of being Dr. Go...0
1Been going to Dr. Goldberg for over 10 years. ...0
2I don't know what Dr. Goldberg was like before...0
3I'm writing this review to give you a heads up...0
4All the food is great here. But the best thing...0
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#no of numerics\ndata_train['numerics'] = data_train['text'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))\ndata_train[['text','numerics']].head()","execution_count":50,"outputs":[{"output_type":"execute_result","execution_count":50,"data":{"text/plain":" text numerics\n0 Unfortunately, the frustration of being Dr. Go... 2\n1 Been going to Dr. Goldberg for over 10 years. ... 1\n2 I don't know what Dr. Goldberg was like before... 1\n3 I'm writing this review to give you a heads up... 0\n4 All the food is great here. But the best thing... 0","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
textnumerics
0Unfortunately, the frustration of being Dr. Go...2
1Been going to Dr. Goldberg for over 10 years. ...1
2I don't know what Dr. Goldberg was like before...1
3I'm writing this review to give you a heads up...0
4All the food is great here. But the best thing...0
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#no of uppercase characters\ndata_train['upper'] = data_train['text'].apply(lambda x: len([x for x in x.split() if x.isupper()]))\ndata_train[['text','upper']].head()","execution_count":51,"outputs":[{"output_type":"execute_result","execution_count":51,"data":{"text/plain":" text upper\n0 Unfortunately, the frustration of being Dr. Go... 5\n1 Been going to Dr. Goldberg for over 10 years. ... 5\n2 I don't know what Dr. Goldberg was like before... 8\n3 I'm writing this review to give you a heads up... 10\n4 All the food is great here. But the best thing... 1","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
textupper
0Unfortunately, the frustration of being Dr. Go...5
1Been going to Dr. Goldberg for over 10 years. ...5
2I don't know what Dr. Goldberg was like before...8
3I'm writing this review to give you a heads up...10
4All the food is great here. But the best thing...1
\n
"},"metadata":{}}]},{"metadata":{},"cell_type":"markdown","source":"**Preprocessing**"},{"metadata":{"trusted":true},"cell_type":"code","source":"#to lowercase\ndata_train['text'] = data_train['text'].apply(lambda x: \" \".join(x.lower() for x in x.split()))\ndata_train['text'].head()","execution_count":52,"outputs":[{"output_type":"execute_result","execution_count":52,"data":{"text/plain":"0 unfortunately, the frustration of being dr. go...\n1 been going to dr. goldberg for over 10 years. ...\n2 i don't know what dr. goldberg was like before...\n3 i'm writing this review to give you a heads up...\n4 all the food is great here. but the best thing...\nName: text, dtype: object"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#removing punctuation\ndata_train['text'] = data_train['text'].str.replace('[^\\w\\s]','')\ndata_train['text'].head()","execution_count":53,"outputs":[{"output_type":"execute_result","execution_count":53,"data":{"text/plain":"0 unfortunately the frustration of being dr gold...\n1 been going to dr goldberg for over 10 years i ...\n2 i dont know what dr goldberg was like before m...\n3 im writing this review to give you a heads up ...\n4 all the food is great here but the best thing ...\nName: text, dtype: object"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#removing stop words\nfrom nltk.corpus import stopwords\nstop = stopwords.words('english')\ndata_train['text'] = data_train['text'].apply(lambda x: \" \".join(x for x in x.split() if x not in stop))\ndata_train['text'].head()","execution_count":54,"outputs":[{"output_type":"execute_result","execution_count":54,"data":{"text/plain":"0 unfortunately frustration dr goldbergs patient...\n1 going dr goldberg 10 years think one 1st patie...\n2 dont know dr goldberg like moving arizona let ...\n3 im writing review give heads see doctor office...\n4 food great best thing wings wings simply fanta...\nName: text, dtype: object"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#removing common word\nfreq = pd.Series(' '.join(data_train['text']).split()).value_counts()[:10]\nfreq","execution_count":55,"outputs":[{"output_type":"execute_result","execution_count":55,"data":{"text/plain":"food 319953\nplace 316445\ngood 290108\nlike 259145\nget 235863\none 231827\ntime 209760\nwould 206953\ngreat 205939\nservice 202014\ndtype: int64"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#removing common word\nfreq = list(freq.index)\ndata_train['text'] = data_train['text'].apply(lambda x: \" \".join(x for x in x.split() if x not in freq))\ndata_train['text'].head()","execution_count":56,"outputs":[{"output_type":"execute_result","execution_count":56,"data":{"text/plain":"0 unfortunately frustration dr goldbergs patient...\n1 going dr goldberg 10 years think 1st patients ...\n2 dont know dr goldberg moving arizona let tell ...\n3 im writing review give heads see doctor office...\n4 best thing wings wings simply fantastic wet ca...\nName: text, dtype: object"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#remvoing rare words\nfreq = pd.Series(' '.join(data_train['text']).split()).value_counts()[-10:]\nfreq","execution_count":57,"outputs":[{"output_type":"execute_result","execution_count":57,"data":{"text/plain":"bjorks 1\npurchasennof 1\nholidaysalso 1\nyumminessnnnot 1\nsnapnnonly 1\nplusnncant 1\nandstaff 1\nthinggeezare 1\nmeatnstuffed 1\nalsoncarne 1\ndtype: int64"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#removing rare words\nfreq = list(freq.index)\ndata_train['text'] = data_train['text'].apply(lambda x: \" \".join(x for x in x.split() if x not in freq))\ndata_train['text'].head()","execution_count":58,"outputs":[{"output_type":"execute_result","execution_count":58,"data":{"text/plain":"0 unfortunately frustration dr goldbergs patient...\n1 going dr goldberg 10 years think 1st patients ...\n2 dont know dr goldberg moving arizona let tell ...\n3 im writing review give heads see doctor office...\n4 best thing wings wings simply fantastic wet ca...\nName: text, dtype: object"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#spelling correction\nfrom textblob import TextBlob\ndata_train['text'][:5].apply(lambda x: str(TextBlob(x).correct()))","execution_count":59,"outputs":[{"output_type":"execute_result","execution_count":59,"data":{"text/plain":"0 unfortunately frustration dr goldbergs patient...\n1 going dr goldberg 10 years think st patients s...\n2 dont know dr goldberg moving arizona let tell ...\n3 in writing review give heads see doctor office...\n4 best thing wings wings simply fantastic wet ca...\nName: text, dtype: object"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#tokenization\nTextBlob(data_train['text'][1]).words","execution_count":61,"outputs":[{"output_type":"execute_result","execution_count":61,"data":{"text/plain":"WordList(['going', 'dr', 'goldberg', '10', 'years', 'think', '1st', 'patients', 'started', 'mhmg', 'hes', 'years', 'really', 'big', 'picture', 'former', 'gyn', 'dr', 'markoff', 'found', 'fibroids', 'explores', 'options', 'patient', 'understanding', 'doesnt', 'judge', 'asks', 'right', 'questions', 'thorough', 'wants', 'kept', 'loop', 'every', 'aspect', 'medical', 'health', 'life'])"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#stemming\nfrom nltk.stem import PorterStemmer\nst = PorterStemmer()\ndata_train['text'][:5].apply(lambda x: \" \".join([st.stem(word) for word in x.split()]))","execution_count":62,"outputs":[{"output_type":"execute_result","execution_count":62,"data":{"text/plain":"0 unfortun frustrat dr goldberg patient repeat e...\n1 go dr goldberg 10 year think 1st patient start...\n2 dont know dr goldberg move arizona let tell st...\n3 im write review give head see doctor offic sta...\n4 best thing wing wing simpli fantast wet cajun ...\nName: text, dtype: object"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#lemmetization\nfrom textblob import Word\ndata_train['text'] = data_train['text'].apply(lambda x: \" \".join([Word(word).lemmatize() for word in x.split()]))\ndata_train['text'].head()","execution_count":63,"outputs":[{"output_type":"execute_result","execution_count":63,"data":{"text/plain":"0 unfortunately frustration dr goldberg patient ...\n1 going dr goldberg 10 year think 1st patient st...\n2 dont know dr goldberg moving arizona let tell ...\n3 im writing review give head see doctor office ...\n4 best thing wing wing simply fantastic wet caju...\nName: text, dtype: object"},"metadata":{}}]},{"metadata":{},"cell_type":"markdown","source":"**Advance Text Processing**"},{"metadata":{"trusted":true},"cell_type":"code","source":"#N-grams\nTextBlob(data_train['text'][0]).ngrams(2)","execution_count":64,"outputs":[{"output_type":"execute_result","execution_count":64,"data":{"text/plain":"[WordList(['unfortunately', 'frustration']),\n WordList(['frustration', 'dr']),\n WordList(['dr', 'goldberg']),\n WordList(['goldberg', 'patient']),\n WordList(['patient', 'repeat']),\n WordList(['repeat', 'experience']),\n WordList(['experience', 'ive']),\n WordList(['ive', 'many']),\n WordList(['many', 'doctor']),\n WordList(['doctor', 'nyc']),\n WordList(['nyc', 'doctor']),\n WordList(['doctor', 'terrible']),\n WordList(['terrible', 'staff']),\n WordList(['staff', 'seems']),\n WordList(['seems', 'staff']),\n WordList(['staff', 'simply']),\n WordList(['simply', 'never']),\n WordList(['never', 'answer']),\n WordList(['answer', 'phone']),\n WordList(['phone', 'usually']),\n WordList(['usually', 'take']),\n WordList(['take', '2']),\n WordList(['2', 'hour']),\n WordList(['hour', 'repeated']),\n WordList(['repeated', 'calling']),\n WordList(['calling', 'answer']),\n WordList(['answer', 'want']),\n WordList(['want', 'deal']),\n WordList(['deal', 'run']),\n WordList(['run', 'problem']),\n WordList(['problem', 'many']),\n WordList(['many', 'doctor']),\n WordList(['doctor', 'dont']),\n WordList(['dont', 'office']),\n WordList(['office', 'worker']),\n WordList(['worker', 'patient']),\n WordList(['patient', 'medical']),\n WordList(['medical', 'need']),\n WordList(['need', 'isnt']),\n WordList(['isnt', 'anyone']),\n WordList(['anyone', 'answering']),\n WordList(['answering', 'phone']),\n WordList(['phone', 'incomprehensible']),\n WordList(['incomprehensible', 'work']),\n WordList(['work', 'aggravation']),\n WordList(['aggravation', 'regret']),\n WordList(['regret', 'feel']),\n WordList(['feel', 'give']),\n WordList(['give', 'dr']),\n WordList(['dr', 'goldberg']),\n WordList(['goldberg', '2']),\n WordList(['2', 'star'])]"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#Term frequency\ntf1 = (data_train['text'][1:2]).apply(lambda x: pd.value_counts(x.split(\" \"))).sum(axis = 0).reset_index()\ntf1.columns = ['words','tf']\ntf1","execution_count":65,"outputs":[{"output_type":"execute_result","execution_count":65,"data":{"text/plain":" words tf\n0 patient 2\n1 dr 2\n2 year 2\n3 explores 1\n4 kept 1\n5 judge 1\n6 found 1\n7 medical 1\n8 goldberg 1\n9 started 1\n10 thorough 1\n11 life 1\n12 former 1\n13 doesnt 1\n14 10 1\n15 really 1\n16 going 1\n17 markoff 1\n18 picture 1\n19 big 1\n20 health 1\n21 option 1\n22 understanding 1\n23 right 1\n24 gyn 1\n25 think 1\n26 he 1\n27 1st 1\n28 aspect 1\n29 question 1\n30 asks 1\n31 fibroid 1\n32 want 1\n33 loop 1\n34 mhmg 1\n35 every 1","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
wordstf
0patient2
1dr2
2year2
3explores1
4kept1
5judge1
6found1
7medical1
8goldberg1
9started1
10thorough1
11life1
12former1
13doesnt1
14101
15really1
16going1
17markoff1
18picture1
19big1
20health1
21option1
22understanding1
23right1
24gyn1
25think1
26he1
271st1
28aspect1
29question1
30asks1
31fibroid1
32want1
33loop1
34mhmg1
35every1
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#inverse document frequency\nfor i,word in enumerate(tf1['words']):\n tf1.loc[i, 'idf'] = np.log(data_train.shape[0]/(len(data_train[data_train['text'].str.contains(word)])))\n\ntf1","execution_count":66,"outputs":[{"output_type":"execute_result","execution_count":66,"data":{"text/plain":" words tf idf\n0 patient 2 4.511322\n1 dr 2 1.218428\n2 year 2 2.425439\n3 explores 1 11.289782\n4 kept 1 3.571541\n5 judge 1 5.420889\n6 found 1 2.801429\n7 medical 1 6.007304\n8 goldberg 1 11.038467\n9 started 1 3.265013\n10 thorough 1 5.026928\n11 life 1 3.485531\n12 former 1 5.205283\n13 doesnt 1 3.131593\n14 10 1 2.262695\n15 really 1 1.534670\n16 going 1 2.029686\n17 markoff 1 12.137080\n18 picture 1 4.116809\n19 big 1 2.447301\n20 health 1 4.004471\n21 option 1 3.140345\n22 understanding 1 5.768893\n23 right 1 2.083736\n24 gyn 1 6.782067\n25 think 1 2.073815\n26 he 1 0.365561\n27 1st 1 4.925031\n28 aspect 1 5.871779\n29 question 1 3.767768\n30 asks 1 5.582197\n31 fibroid 1 11.626254\n32 want 1 1.706942\n33 loop 1 6.802752\n34 mhmg 1 13.235692\n35 every 1 1.557829","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
wordstfidf
0patient24.511322
1dr21.218428
2year22.425439
3explores111.289782
4kept13.571541
5judge15.420889
6found12.801429
7medical16.007304
8goldberg111.038467
9started13.265013
10thorough15.026928
11life13.485531
12former15.205283
13doesnt13.131593
141012.262695
15really11.534670
16going12.029686
17markoff112.137080
18picture14.116809
19big12.447301
20health14.004471
21option13.140345
22understanding15.768893
23right12.083736
24gyn16.782067
25think12.073815
26he10.365561
271st14.925031
28aspect15.871779
29question13.767768
30asks15.582197
31fibroid111.626254
32want11.706942
33loop16.802752
34mhmg113.235692
35every11.557829
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#term freq - inverse document freq\ntf1['tfidf'] = tf1['tf'] * tf1['idf']\ntf1","execution_count":67,"outputs":[{"output_type":"execute_result","execution_count":67,"data":{"text/plain":" words tf idf tfidf\n0 patient 2 4.511322 9.022644\n1 dr 2 1.218428 2.436856\n2 year 2 2.425439 4.850878\n3 explores 1 11.289782 11.289782\n4 kept 1 3.571541 3.571541\n5 judge 1 5.420889 5.420889\n6 found 1 2.801429 2.801429\n7 medical 1 6.007304 6.007304\n8 goldberg 1 11.038467 11.038467\n9 started 1 3.265013 3.265013\n10 thorough 1 5.026928 5.026928\n11 life 1 3.485531 3.485531\n12 former 1 5.205283 5.205283\n13 doesnt 1 3.131593 3.131593\n14 10 1 2.262695 2.262695\n15 really 1 1.534670 1.534670\n16 going 1 2.029686 2.029686\n17 markoff 1 12.137080 12.137080\n18 picture 1 4.116809 4.116809\n19 big 1 2.447301 2.447301\n20 health 1 4.004471 4.004471\n21 option 1 3.140345 3.140345\n22 understanding 1 5.768893 5.768893\n23 right 1 2.083736 2.083736\n24 gyn 1 6.782067 6.782067\n25 think 1 2.073815 2.073815\n26 he 1 0.365561 0.365561\n27 1st 1 4.925031 4.925031\n28 aspect 1 5.871779 5.871779\n29 question 1 3.767768 3.767768\n30 asks 1 5.582197 5.582197\n31 fibroid 1 11.626254 11.626254\n32 want 1 1.706942 1.706942\n33 loop 1 6.802752 6.802752\n34 mhmg 1 13.235692 13.235692\n35 every 1 1.557829 1.557829","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
wordstfidftfidf
0patient24.5113229.022644
1dr21.2184282.436856
2year22.4254394.850878
3explores111.28978211.289782
4kept13.5715413.571541
5judge15.4208895.420889
6found12.8014292.801429
7medical16.0073046.007304
8goldberg111.03846711.038467
9started13.2650133.265013
10thorough15.0269285.026928
11life13.4855313.485531
12former15.2052835.205283
13doesnt13.1315933.131593
141012.2626952.262695
15really11.5346701.534670
16going12.0296862.029686
17markoff112.13708012.137080
18picture14.1168094.116809
19big12.4473012.447301
20health14.0044714.004471
21option13.1403453.140345
22understanding15.7688935.768893
23right12.0837362.083736
24gyn16.7820676.782067
25think12.0738152.073815
26he10.3655610.365561
271st14.9250314.925031
28aspect15.8717795.871779
29question13.7677683.767768
30asks15.5821975.582197
31fibroid111.62625411.626254
32want11.7069421.706942
33loop16.8027526.802752
34mhmg113.23569213.235692
35every11.5578291.557829
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#sparse matrix tf-idf freq\nfrom sklearn.feature_extraction.text import TfidfVectorizer\ntfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word',\n stop_words= 'english',ngram_range=(1,1))\ntrain_vect = tfidf.fit_transform(data_train['text'])\n\ntrain_vect","execution_count":68,"outputs":[{"output_type":"execute_result","execution_count":68,"data":{"text/plain":"<560000x1000 sparse matrix of type ''\n\twith 16479223 stored elements in Compressed Sparse Row format>"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"#Bag of Words\nfrom sklearn.feature_extraction.text import CountVectorizer\nbow = CountVectorizer(max_features=1000, lowercase=True, ngram_range=(1,1),analyzer = \"word\")\ntrain_bow = bow.fit_transform(data_train['text'])\ntrain_bow","execution_count":70,"outputs":[{"output_type":"execute_result","execution_count":70,"data":{"text/plain":"<560000x1000 sparse matrix of type ''\n\twith 19312429 stored elements in Compressed Sparse Row format>"},"metadata":{}}]},{"metadata":{},"cell_type":"markdown","source":"**Sentiment Analysis**"},{"metadata":{"trusted":true},"cell_type":"code","source":"#sentiment of review\ndata_train['text'][:5].apply(lambda x: TextBlob(x).sentiment)","execution_count":71,"outputs":[{"output_type":"execute_result","execution_count":71,"data":{"text/plain":"0 (-0.10714285714285714, 0.5153061224489796)\n1 (0.07142857142857142, 0.15892857142857142)\n2 (-0.11153846153846153, 0.36987179487179483)\n3 (0.11538461538461539, 0.480952380952381)\n4 (0.49318181818181817, 0.5954545454545455)\nName: text, dtype: object"},"metadata":{}}]},{"metadata":{},"cell_type":"markdown","source":"***the sentiment as value nearer to 1 means a positive sentiment and values nearer to -1 means a negative sentiment. ***"},{"metadata":{"trusted":true},"cell_type":"code","source":"data_train['sentiment'] = data_train['text'].apply(lambda x: TextBlob(x).sentiment[0] )\ndata_train[['text','sentiment']].head()","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"name":"python","version":"3.6.4","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat":4,"nbformat_minor":1} --------------------------------------------------------------------------------