├── README.md ├── Twitter Sentiment Analysis - Classical Approach VS Deep Learning.ipynb └── images ├── dropout.png ├── embedding.png ├── love_scrable.jpg └── sentiment_classification.png /README.md: -------------------------------------------------------------------------------- 1 | # Twitter Sentiment Analysis - Classical Approach VS Deep Learning 2 | 3 | 4 | 5 | Photo by Gaelle Marcel on Unsplash. 6 | 7 | # Overview 8 | 9 | This project's aim, is to explore the world of *Natural Language Processing* (NLP) by building what is known as a **Sentiment Analysis Model**. A sentiment analysis model is a model that analyses a given piece of text and predicts whether this piece of text expresses positive or negative sentiment. 10 | 11 |
12 | 13 | To this end, we will be using the `sentiment140` dataset containing data collected from twitter. An impressive feature of this dataset is that it is *perfectly* balanced (i.e., the number of examples in each class is equal). 14 | 15 | Citing the [creators](http://help.sentiment140.com/for-students/) of this dataset: 16 | 17 | > *Our approach was unique because our training data was automatically created, as opposed to having humans manual annotate tweets. In our approach, we assume that any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative. We used the Twitter Search API to collect these tweets by using keyword search* 18 | 19 | After a series of **cleaning and data processing**, and after visualizing our data in a **word cloud**, we will be building a **Naive Bayezian** model. This model's goal would be to properly classify positive and negative tweets in terms of sentiment. 20 | Next, we will propose a much more advanced solution using a **deep learning** model: **LSTM**. This process will require a different kind of data cleaning and processing. Also, we will discover **Word Embeddings**, **Dropout** and many other machine learning related concepts. 21 | 22 | Throughout this notebook, we will take advantage of every result, visualization and failure in order to try and further understand the data, extract insights and information from it and learn how to improve our model. From the type of words used in positive/negative sentiment tweets, to the vocabulary diversity in each case and the day of the week in which these tweets occur, to the overfitting concept and grasping the huge importance of the data while building a given model, I really hope that you'll enjoy going through this notebook and gain not only technical skills but also analytical skills from it. 23 | 24 | --- 25 | 26 | This notebook is written by **Joseph Assaker**. Feel free to reach out for any feedback on this notebook via [email](mailto:lb.josephassaker@gmail.com) or [LinkedIn](https://www.linkedin.com/in/joseph-assaker/). 27 | 28 | --- 29 | 30 | Now, let's start with the fun 🎉 31 | 32 | ### **Table of Content:** 33 | 34 | 1. Importing and Discovering the Dataset 35 | 2. Cleaning and Processing the Data 36 | 2.1. Tokenization 37 | 2.2. Lemmatization 38 | 2.3. Cleaning the Data 39 | 3. Visualizing the Data 40 | 4. Naive Bayesian Model 41 | 4.1. Splitting the Data 42 | 4.2. Training the Model 43 | 4.3. Testing the Model 44 | 4.4. Asserting the Model 45 | 5. Deep Learning Model - LSTM 46 | 5.1. Data Pre-processing 47 |     5.1.1. Word Embeddings 48 |     5.1.2. Global Vectors for Word Representation (GloVe) 49 |     5.1.3. Data Padding 50 | 5.2. Data Transformation 51 | 5.3. Building the Model 52 | 5.4. Training the Model 53 | 5.5. Investigating Possibilties to Improve the Model 54 |     5.5.1. Regularization - Dropout 55 |     5.5.2. Inspecting the Data - Unknown Words 56 | 5.6. Predicting on Custom Data 57 | 5.7. Inspecting Wrongly Predicted Data 58 | 6. Bonus Section 59 | 7. Extra Tip: Pickling ! 60 | 8. Further Work 61 | 62 | 63 | Continue reading the whole notebook [here](https://github.com/JosephAssaker/Twitter-Sentiment-Analysis-Classical-Approach-VS-Deep-Learning/blob/master/Twitter%20Sentiment%20Analysis%20-%20Classical%20Approach%20VS%20Deep%20Learning.ipynb). 64 | 65 | You can also find this notebook, and give it an upvote 😊, on [Kaggle](https://www.kaggle.com/josephassaker/twitter-sentiment-analysis-classical-vs-lstm). 66 | -------------------------------------------------------------------------------- /images/dropout.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JosephAssaker/Twitter-Sentiment-Analysis-Classical-Approach-VS-Deep-Learning/d6a9925f7c563e4298674a54f581ec89ee20c1d6/images/dropout.png -------------------------------------------------------------------------------- /images/embedding.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JosephAssaker/Twitter-Sentiment-Analysis-Classical-Approach-VS-Deep-Learning/d6a9925f7c563e4298674a54f581ec89ee20c1d6/images/embedding.png -------------------------------------------------------------------------------- /images/love_scrable.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JosephAssaker/Twitter-Sentiment-Analysis-Classical-Approach-VS-Deep-Learning/d6a9925f7c563e4298674a54f581ec89ee20c1d6/images/love_scrable.jpg -------------------------------------------------------------------------------- /images/sentiment_classification.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JosephAssaker/Twitter-Sentiment-Analysis-Classical-Approach-VS-Deep-Learning/d6a9925f7c563e4298674a54f581ec89ee20c1d6/images/sentiment_classification.png --------------------------------------------------------------------------------