├── Deep Learning
├── Fashion_Class_Classification_using_MNIST_dataset.ipynb
└── MNIST_USING_PCA.ipynb
├── LICENSE
├── Python
├── 911_Calls_Project.ipynb
├── Car Price Prediction using Ridge & Lasso Regression.ipynb
├── Car_Price_Prediction.ipynb
├── Customer_Segmentation_using_(Recency,_Frequency,_Monetary)RFM_Analysis.ipynb
├── Directing_Customers_to_Subscription_Through_App_Behavior_Analysis.ipynb
├── Minimizing_Churn_Rate_Through_Analysis_of_Financial_Habits.ipynb
└── Telecom_Customer_Churn.ipynb
├── R
├── A Visual History of Nobel Prize Winners.ipynb
├── Degress that pay you back.ipynb
├── Rise and Fall of Programming Languages.ipynb
└── Visualizing Inequalities in Life Expectancy.ipynb
├── README.md
├── Text Analytics
├── LDA_NMF_Assessment_Project.ipynb
├── Latent_Dirichlet_Allocation_on_Articles.ipynb
├── Named_Entity_Recognition.ipynb
├── Part_of_Speech_Assessment.ipynb
├── Restaurant_Reviews.ipynb
├── Text_Classification.ipynb
├── Text_Generation_with_Neural_Networks.ipynb
├── Text_Summarization.ipynb
└── Yelp_Reviews_Classification.ipynb
└── Time Series
├── Avocado_Prices_Prediction.ipynb
├── Deep_Learning_for_Time_Series.ipynb
├── Mauna_Loa_Atmospheric_CO2_Concentration_Forecasting_using_SARIMA.ipynb
└── Miles_Travelled_using_ARIMA_Model.ipynb
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 Shantanu Gupta
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Data-Science-Portfolio:fire:
2 | 
3 | ## Contents
4 |
5 | - ### R
6 |
7 | 1. [Data Visualization: Corruption and Human Development](): The purpose of this project is to perform data visualization to explore the relationship between Corruption and Human Development across various nations based on UN Human Development Report. A scatter plot for the relationship between the 'Human Development Index' and the 'Corruption Perceptions Index' of countries.
8 |
9 | 2. [Visualizing Inequalities in Life Expectancy](http://rpubs.com/shantanu97/Title): Do women live longer than men? How long? Does it happen everywhere? Is life expectancy increasing? Everywhere? Which is the country with the lowest life expectancy? Which is the one with the highest? In this Project, I will answer all these questions by manipulating and visualizing United Nations life expectancy data using ggplot2.The dataset can be found [here]() and contains the average life expectancies of men and women by country (in years). It covers four periods: 1985-1990, 1990-1995, 1995-2000, and 2000-2005.
10 |
11 | 3. [Rise and Fall of Programming Languages](http://rpubs.com/shantanu97/Programming_Languages): How can you tell what programming languages and technologies are used by the most people? How about what languages are growing and which are shrinking, so that you can tell which are most worth investing time in? One excellent source of data is [Stack Overflow](), a programming question and answer site with more than 16 million questions on programming topics. By measuring the number of questions about each technology, you can get an approximate sense of how many people are using it. In this project, you'll use open data from the [Stack Exchange Data Explorer]() to examine the relative popularity of languages like R, Python, Java and Javascript have changed over time.
12 |
13 | 4. [Degress that pay you back](https://github.com/Shantanu9326/Data-Science-Portfolio/blob/master/Degress%20that%20pay%20you%20back.ipynb): Wondering if that Philosophy major will really help you pay the bills? Think you're set with an Engineering degree? Whether you're in school or navigating the postgrad world, this project will guide you in exploring the short- and long-term financial implications of this major decision. After doing some data clean up, you'll compare the recommendations from three different methods for determining the optimal number of clusters, apply a k-means cluster analysis, and visualize the results.
14 |
15 | 5. [A Visual History of Nobel Prize Winners]():The Nobel Prize is perhaps the world's most well known scientific award. Every year it is given to scientists and scholars in chemistry, literature, physics, medicine, economics, and peace. The first Nobel Prize was handed out in 1901, and at that time the prize was Eurocentric and male-focused, but nowadays it's not biased in any way. Surely, right?Well, let's find out! In this project, you get to explore patterns and trends in over 100 years worth of Nobel Prize winners. What characteristics do the prize winners have? Which country gets it most often? And has anybody gotten it twice? It's up to you to figure this out.
16 |
17 | - ### Python
18 |
19 | 1. [Telecom Customer Churn](https://github.com/Shantanu9326/Telecom-Customer-Churn/blob/master/Telecom_Customer_Churn.ipynb):Customer churn occurs when customers or subscribers stop doing business with a company or service, also known as customer attrition. It is also referred as loss of clients or customers. One industry in which churn rates are particularly useful is the telecommunications industry, because most customers have multiple options from which to choose within a geographic location.- Data Source: The dataset is available on [kaggle](https://www.kaggle.com/blastchar/telco-customer-churn) data source and you can directly read this notebook into Google Colaboratory. By building a model to predict customer churn with Logistic Regression, ideally we can nip the problem of unsatisfied customers in the bud and keep the revenue flowing.
20 |
21 | 2. [Directing Customers to Subscription Through App Behavior Analysis](https://github.com/Shantanu9326/Data-Science-Portfolio/blob/master/Directing_Customers_to_Subscription_Through_App_Behavior_Analysis.ipynb):In today’s market, many companies have a mobile presence. Often, these companies provide free products/services in their mobile apps in an attempt to transition their customers to a paid membership. Some examples of paid products, which originate from free ones, are Youtube Red, Pandora premium, audible subscription, and you need a budget. Since marketing efforts are never free, these companies need to know exactly who to target with offers and promotions.
22 |
23 | 3. [Minimizing Churn Rate Through Analysis of financial habits](https://github.com/Shantanu9326/Data-Science-Portfolio/blob/master/Minimizing_Churn_Rate_Through_Analysis_of_Financial_Habits.ipynb):Subscription Products often are the main source of revenue for companies across all industries. These products can come in the form of a ‘one size fits all’ overcompassing subscription, or in multi-level memberships. Regardless of how they structure their memberships, or what industry they are in, companies almost try to minimize customer churn (a.k.a subscription cancellations).To retain their customers, these companies first need to identify the behavioural pattern that acts as a catalyst in disengagement with the product.
24 |
25 | 4. [Car Price Prediction](https://github.com/Shantanu9326/Data-Science-Portfolio/blob/master/Car_Price_Prediction.ipynb):The dataset for this paper has been obtained from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/automobile). [Car Price Prediction using Ridge & Lasso Regression](https://github.com/Shantanu9326/Data-Science-Portfolio/blob/master/Car%20Price%20Prediction%20using%20Ridge%20%26%20Lasso%20Regression.ipynb):
26 |
27 | 5. [Customer Segmentation using RFM analysis](https://github.com/Shantanu9326/Data-Science-Portfolio/blob/master/Customer_Segmentation_using_(Recency%2C_Frequency%2C_Monetary)RFM_Analysis.ipynb): Python code using RFM model to segment customers. You can use it to perform RFM anlaysis to segment customers based on their purchase history.
28 |
29 | 6. [Movie Recommendations using Recommender Systems](https://github.com/Shantanu9326/Data-Science-Portfolio/blob/master/Movie_Recommender_System.ipynb): recommender systems are used to suggest movies or songs to users based on their interests.A micro project to build a recommendation system that makes movie recommendations based on user review similarities.
30 |
31 | 7. [Minimizing Churn Rate through analysis of financial habits](https://github.com/Shantanu9326/Data-Science-Portfolio/blob/master/Minimizing_Churn_Rate_Through_Analysis_of_Financial_Habits.ipynb):Developed an Machine learning model with Random Forest classifier after feature selection and hyper parameter tuning the model accuracy was 79.83% based on the financial habits of the customers in the Bank Database.
32 |
33 | 8. [Declining in Viewership in Digital Media Company](https://github.com/Shantanu9326/Data-Science-Portfolio/blob/master/Media%2BCompany.ipynb):A digital media company (similar to Voot, Hotstar, Netflix, etc.) had launched a show. Initially, the show got a good response, but then witnessed a decline in viewership. The company wants to figure out what went wrong.This is a real life case study related to a streaming video company say Hotstar/Netflix. The company launched a particular show, the problem is initially the TRP for that show was very good but suddenly the company notice the decline in the TRP for that particular show. They were interested to find out that what can be the possible reason due to which their show viewership has been decreased and what action they can take to fix that problem. This is a multiple regression model case and we have to build a perfect model to know what are the particular factors/columns which are impacting the viewership and to predict its views in the future.
34 |
35 | 9. [911 calls](https://github.com/Shantanu9326/Data-Science-Portfolio/blob/master/911_Calls_Project.ipynb):Exploratory Data Analysis of the 911 calls dataset hosted on Kaggle. Demonstrates extraction of useful features from different variables.
36 |
37 | 10. [Predicting the likelihood of E-signing a loan based on financial history](https://github.com/Shantanu9326/Data-Science-Portfolio/blob/master/Predicting_the_Likelihood_of_E_Signing_a_Loan_Based_on_Financial_History.ipynb):Lending companies work by analyzing the financial history of their loan applicants, and choosing whether or not the applicant is too risky to be given a loan. If the applicant is not, the company then determines the terms of the loan. To acquire these applicants, companies can organically receive them through their websites/apps. often with the help of advertisement campaigns. Other times. lending companies partner with peer-to-peer (P2P) lending marketplaces, in order to acquire leads of possible applicants. Some example marketplaces include Upstart. Lending Tree, and Lending club. In this project, we are going to asses the quality of the leads our company receives from these marketplaces.
38 |
39 |
40 | - ### Deep Learning
41 | 1. [Fashion-Class-Classification-using-MNIST-dataset](https://github.com/Shantanu9326/Data-Science-Portfolio/blob/master/Fashion_Class_Classification_using_MNIST_dataset.ipynb): Training AI machine learning models on the Fashion MNIST [dataset](https://github.com/zalandoresearch/fashion-mnist).Read the full article at [Image Recognition for Fashion with Machine Learning](http://www.primaryobjects.com/2017/10/23/image-recognition-for-fashion-with-machine-learning/)
42 |
43 | 2. [MNIST Using PCA](https://github.com/Shantanu9326/Fashion-Class-Classification-using-MNIST-dataset/blob/master/MNIST_USING_PCA.ipynb) The global fashion industry is valued at three trillion dollars and accounts for 2 percent of the world's GDP the fashion industry is undergoing a dramatic transformation by adopting new computer vision,machine learning and deep learning techniques.
44 |
45 | 3. [Deep Learning for Time Series](https://github.com/Shantanu9326/Data-Science-Portfolio/blob/master/Deep_Learning_for_Time_Series.ipynb):Americans are driving more than ever before.Predicted and plotted the future traffic trends using the RNN & LSTM deep learning models.
46 |
47 | - ### Natural Language Processing
48 | 1. [Named Entity Recognition](https://github.com/Shantanu9326/Text-Mining-Mini-Projects/blob/master/Named_Entity_Recognition.ipynb): Named entity recognition (NER)is probably the first step towards information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. Named Entity Recognition, also known as entity extraction classifies named entities that are present in a text into pre-defined categories like “individuals”, “companies”, “places”, “organization”, “cities”, “dates”, “product terminologies” etc. It adds a wealth of semantic knowledge to your content and helps you to promptly understand the subject of any given text.
49 |
50 | 2. [Part of Speech Assessment](https://github.com/Shantanu9326/Text-Mining-Mini-Projects/blob/master/Part_of_Speech_Assessment.ipynb):
51 |
52 | 3. [Text Classification](https://github.com/Shantanu9326/Text-Mining-Mini-Projects/blob/master/Text_Classification.ipynb):
53 |
54 | 4. [Text Generation with Neural Networks](https://github.com/Shantanu9326/Data-Science-Portfolio/blob/master/Text_Generation_with_Neural_Networks.ipynb):
55 |
56 | - ### Time Series
57 | 1. [Mauna Loa Atmospheric CO2 Concentration Forecasting using SARIMA](https://github.com/Shantanu9326/Forecasting/blob/master/Mauna_Loa_Atmospheric_CO2_Concentration_Forecasting_using_SARIMA.ipynb): Trends and seasonal variation in time-series models.Atmospheric CO2 concentrations (measured in parts per million) derived from air samples collected at Mauna Loa Observatory, Hawaii.
58 |
59 | 2. [Miles Travelled using ARIMA Model]():Americans are driving more than ever before.Predicted and plotted the future traffic trends using the RNN & LSTM deep learning models.
60 |
61 | 3. [Avocado Price Prediction]():Predict the avocado prices given Kaggle dataset.
62 |
63 |
64 | # License
65 |
66 | MIT
67 |
68 |
69 | # Help
70 |
71 | If you find any mistakes or you can't figure out something, raise a question. I will get back to you as soon as possible. If you liked what you saw, want to have a chat with me about the portfolio, work opportunities, or collaboration, shoot an email at :e-mail:shantanu97@gmail.com
72 | .More information about me: [LinkedIn](https://www.linkedin.com/in/shantanugupta9326/) :mag_right:
73 |
--------------------------------------------------------------------------------
/Text Analytics/LDA_NMF_Assessment_Project.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "LDA-NMF-Assessment-Project.ipynb",
7 | "version": "0.3.2",
8 | "provenance": [],
9 | "include_colab_link": true
10 | },
11 | "language_info": {
12 | "codemirror_mode": {
13 | "name": "ipython",
14 | "version": 3
15 | },
16 | "file_extension": ".py",
17 | "mimetype": "text/x-python",
18 | "name": "python",
19 | "nbconvert_exporter": "python",
20 | "pygments_lexer": "ipython3",
21 | "version": "3.6.6"
22 | },
23 | "kernelspec": {
24 | "display_name": "Python 3",
25 | "language": "python",
26 | "name": "python3"
27 | }
28 | },
29 | "cells": [
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {
33 | "id": "view-in-github",
34 | "colab_type": "text"
35 | },
36 | "source": [
37 | " "
38 | ]
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "metadata": {
43 | "collapsed": true,
44 | "id": "4VloKTlo2XaY",
45 | "colab_type": "text"
46 | },
47 | "source": [
48 | "# Topic Modeling Assessment on Quora Questions"
49 | ]
50 | },
51 | {
52 | "cell_type": "markdown",
53 | "metadata": {
54 | "id": "BRZv54462XaZ",
55 | "colab_type": "text"
56 | },
57 | "source": [
58 | "#### Task: Import pandas and read in the quora_questions.csv file."
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "metadata": {
64 | "id": "MIC_dnfV2XaZ",
65 | "colab_type": "code",
66 | "colab": {}
67 | },
68 | "source": [
69 | "import pandas as pd"
70 | ],
71 | "execution_count": 0,
72 | "outputs": []
73 | },
74 | {
75 | "cell_type": "code",
76 | "metadata": {
77 | "id": "niMVWH9i2evL",
78 | "colab_type": "code",
79 | "outputId": "e5b3b19e-3421-42b0-c203-26b22469e42a",
80 | "colab": {
81 | "base_uri": "https://localhost:8080/",
82 | "height": 34
83 | }
84 | },
85 | "source": [
86 | "#Running or Importing .py Files with Google Colab\n",
87 | "from google.colab import drive\n",
88 | "drive.mount('/content/drive/')"
89 | ],
90 | "execution_count": 0,
91 | "outputs": [
92 | {
93 | "output_type": "stream",
94 | "text": [
95 | "Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount(\"/content/drive/\", force_remount=True).\n"
96 | ],
97 | "name": "stdout"
98 | }
99 | ]
100 | },
101 | {
102 | "cell_type": "code",
103 | "metadata": {
104 | "id": "udHc-J5J2Xab",
105 | "colab_type": "code",
106 | "outputId": "e392072d-bf47-4407-ee35-695d7f6761f4",
107 | "colab": {
108 | "base_uri": "https://localhost:8080/",
109 | "height": 1000
110 | }
111 | },
112 | "source": [
113 | "quora = pd.read_csv('/content/drive/My Drive/app/quora_questions.csv')\n",
114 | "quora"
115 | ],
116 | "execution_count": 0,
117 | "outputs": [
118 | {
119 | "output_type": "execute_result",
120 | "data": {
121 | "text/html": [
122 | "
\n",
123 | "\n",
136 | "
\n",
137 | " \n",
138 | " \n",
139 | " \n",
140 | " Question \n",
141 | " \n",
142 | " \n",
143 | " \n",
144 | " \n",
145 | " 0 \n",
146 | " What is the step by step guide to invest in sh... \n",
147 | " \n",
148 | " \n",
149 | " 1 \n",
150 | " What is the story of Kohinoor (Koh-i-Noor) Dia... \n",
151 | " \n",
152 | " \n",
153 | " 2 \n",
154 | " How can I increase the speed of my internet co... \n",
155 | " \n",
156 | " \n",
157 | " 3 \n",
158 | " Why am I mentally very lonely? How can I solve... \n",
159 | " \n",
160 | " \n",
161 | " 4 \n",
162 | " Which one dissolve in water quikly sugar, salt... \n",
163 | " \n",
164 | " \n",
165 | " 5 \n",
166 | " Astrology: I am a Capricorn Sun Cap moon and c... \n",
167 | " \n",
168 | " \n",
169 | " 6 \n",
170 | " Should I buy tiago? \n",
171 | " \n",
172 | " \n",
173 | " 7 \n",
174 | " How can I be a good geologist? \n",
175 | " \n",
176 | " \n",
177 | " 8 \n",
178 | " When do you use シ instead of し? \n",
179 | " \n",
180 | " \n",
181 | " 9 \n",
182 | " Motorola (company): Can I hack my Charter Moto... \n",
183 | " \n",
184 | " \n",
185 | " 10 \n",
186 | " Method to find separation of slits using fresn... \n",
187 | " \n",
188 | " \n",
189 | " 11 \n",
190 | " How do I read and find my YouTube comments? \n",
191 | " \n",
192 | " \n",
193 | " 12 \n",
194 | " What can make Physics easy to learn? \n",
195 | " \n",
196 | " \n",
197 | " 13 \n",
198 | " What was your first sexual experience like? \n",
199 | " \n",
200 | " \n",
201 | " 14 \n",
202 | " What are the laws to change your status from a... \n",
203 | " \n",
204 | " \n",
205 | " 15 \n",
206 | " What would a Trump presidency mean for current... \n",
207 | " \n",
208 | " \n",
209 | " 16 \n",
210 | " What does manipulation mean? \n",
211 | " \n",
212 | " \n",
213 | " 17 \n",
214 | " Why do girls want to be friends with the guy t... \n",
215 | " \n",
216 | " \n",
217 | " 18 \n",
218 | " Why are so many Quora users posting questions ... \n",
219 | " \n",
220 | " \n",
221 | " 19 \n",
222 | " Which is the best digital marketing institutio... \n",
223 | " \n",
224 | " \n",
225 | " 20 \n",
226 | " Why do rockets look white? \n",
227 | " \n",
228 | " \n",
229 | " 21 \n",
230 | " What's causing someone to be jealous? \n",
231 | " \n",
232 | " \n",
233 | " 22 \n",
234 | " What are the questions should not ask on Quora? \n",
235 | " \n",
236 | " \n",
237 | " 23 \n",
238 | " How much is 30 kV in HP? \n",
239 | " \n",
240 | " \n",
241 | " 24 \n",
242 | " What does it mean that every time I look at th... \n",
243 | " \n",
244 | " \n",
245 | " 25 \n",
246 | " What are some tips on making it through the jo... \n",
247 | " \n",
248 | " \n",
249 | " 26 \n",
250 | " What is web application? \n",
251 | " \n",
252 | " \n",
253 | " 27 \n",
254 | " Does society place too much importance on sports? \n",
255 | " \n",
256 | " \n",
257 | " 28 \n",
258 | " What is best way to make money online? \n",
259 | " \n",
260 | " \n",
261 | " 29 \n",
262 | " How should I prepare for CA final law? \n",
263 | " \n",
264 | " \n",
265 | " ... \n",
266 | " ... \n",
267 | " \n",
268 | " \n",
269 | " 404259 \n",
270 | " Which phone is best under 12000? \n",
271 | " \n",
272 | " \n",
273 | " 404260 \n",
274 | " Who is the overall most popular Game of Throne... \n",
275 | " \n",
276 | " \n",
277 | " 404261 \n",
278 | " How do you troubleshoot a Toshiba laptop? \n",
279 | " \n",
280 | " \n",
281 | " 404262 \n",
282 | " How does the burning of fossil fuels contribut... \n",
283 | " \n",
284 | " \n",
285 | " 404263 \n",
286 | " Is it safe to store an external battery power ... \n",
287 | " \n",
288 | " \n",
289 | " 404264 \n",
290 | " How can I gain weight on my body? \n",
291 | " \n",
292 | " \n",
293 | " 404265 \n",
294 | " What is the green dot next to the phone icon o... \n",
295 | " \n",
296 | " \n",
297 | " 404266 \n",
298 | " What are the causes of the fall of the Roman E... \n",
299 | " \n",
300 | " \n",
301 | " 404267 \n",
302 | " Why don't we still do great music like in the ... \n",
303 | " \n",
304 | " \n",
305 | " 404268 \n",
306 | " How do you diagnose antisocial personality dis... \n",
307 | " \n",
308 | " \n",
309 | " 404269 \n",
310 | " What is the difference between who and how? \n",
311 | " \n",
312 | " \n",
313 | " 404270 \n",
314 | " Does Stalin have any grandchildren that are st... \n",
315 | " \n",
316 | " \n",
317 | " 404271 \n",
318 | " What are the best new car products or inventio... \n",
319 | " \n",
320 | " \n",
321 | " 404272 \n",
322 | " What happens if you put milk in a coffee maker? \n",
323 | " \n",
324 | " \n",
325 | " 404273 \n",
326 | " Will the next generation of parenting change o... \n",
327 | " \n",
328 | " \n",
329 | " 404274 \n",
330 | " In accounting, why do we debit expenses and cr... \n",
331 | " \n",
332 | " \n",
333 | " 404275 \n",
334 | " What is copilotsearch.com? \n",
335 | " \n",
336 | " \n",
337 | " 404276 \n",
338 | " What does analytics do? \n",
339 | " \n",
340 | " \n",
341 | " 404277 \n",
342 | " How did you prepare for AIIMS/NEET/AIPMT? \n",
343 | " \n",
344 | " \n",
345 | " 404278 \n",
346 | " What is the minimum time required to build a f... \n",
347 | " \n",
348 | " \n",
349 | " 404279 \n",
350 | " What are some outfit ideas to wear to a frat p... \n",
351 | " \n",
352 | " \n",
353 | " 404280 \n",
354 | " Why is Manaphy childish in Pokémon Ranger and ... \n",
355 | " \n",
356 | " \n",
357 | " 404281 \n",
358 | " How does a long distance relationship work? \n",
359 | " \n",
360 | " \n",
361 | " 404282 \n",
362 | " What do you think of the removal of the MagSaf... \n",
363 | " \n",
364 | " \n",
365 | " 404283 \n",
366 | " What does Jainism say about homosexuality? \n",
367 | " \n",
368 | " \n",
369 | " 404284 \n",
370 | " How many keywords are there in the Racket prog... \n",
371 | " \n",
372 | " \n",
373 | " 404285 \n",
374 | " Do you believe there is life after death? \n",
375 | " \n",
376 | " \n",
377 | " 404286 \n",
378 | " What is one coin? \n",
379 | " \n",
380 | " \n",
381 | " 404287 \n",
382 | " What is the approx annual cost of living while... \n",
383 | " \n",
384 | " \n",
385 | " 404288 \n",
386 | " What is like to have sex with cousin? \n",
387 | " \n",
388 | " \n",
389 | "
\n",
390 | "
404289 rows × 1 columns
\n",
391 | "
"
392 | ],
393 | "text/plain": [
394 | " Question\n",
395 | "0 What is the step by step guide to invest in sh...\n",
396 | "1 What is the story of Kohinoor (Koh-i-Noor) Dia...\n",
397 | "2 How can I increase the speed of my internet co...\n",
398 | "3 Why am I mentally very lonely? How can I solve...\n",
399 | "4 Which one dissolve in water quikly sugar, salt...\n",
400 | "5 Astrology: I am a Capricorn Sun Cap moon and c...\n",
401 | "6 Should I buy tiago?\n",
402 | "7 How can I be a good geologist?\n",
403 | "8 When do you use シ instead of し?\n",
404 | "9 Motorola (company): Can I hack my Charter Moto...\n",
405 | "10 Method to find separation of slits using fresn...\n",
406 | "11 How do I read and find my YouTube comments?\n",
407 | "12 What can make Physics easy to learn?\n",
408 | "13 What was your first sexual experience like?\n",
409 | "14 What are the laws to change your status from a...\n",
410 | "15 What would a Trump presidency mean for current...\n",
411 | "16 What does manipulation mean?\n",
412 | "17 Why do girls want to be friends with the guy t...\n",
413 | "18 Why are so many Quora users posting questions ...\n",
414 | "19 Which is the best digital marketing institutio...\n",
415 | "20 Why do rockets look white?\n",
416 | "21 What's causing someone to be jealous?\n",
417 | "22 What are the questions should not ask on Quora?\n",
418 | "23 How much is 30 kV in HP?\n",
419 | "24 What does it mean that every time I look at th...\n",
420 | "25 What are some tips on making it through the jo...\n",
421 | "26 What is web application?\n",
422 | "27 Does society place too much importance on sports?\n",
423 | "28 What is best way to make money online?\n",
424 | "29 How should I prepare for CA final law?\n",
425 | "... ...\n",
426 | "404259 Which phone is best under 12000?\n",
427 | "404260 Who is the overall most popular Game of Throne...\n",
428 | "404261 How do you troubleshoot a Toshiba laptop?\n",
429 | "404262 How does the burning of fossil fuels contribut...\n",
430 | "404263 Is it safe to store an external battery power ...\n",
431 | "404264 How can I gain weight on my body?\n",
432 | "404265 What is the green dot next to the phone icon o...\n",
433 | "404266 What are the causes of the fall of the Roman E...\n",
434 | "404267 Why don't we still do great music like in the ...\n",
435 | "404268 How do you diagnose antisocial personality dis...\n",
436 | "404269 What is the difference between who and how?\n",
437 | "404270 Does Stalin have any grandchildren that are st...\n",
438 | "404271 What are the best new car products or inventio...\n",
439 | "404272 What happens if you put milk in a coffee maker?\n",
440 | "404273 Will the next generation of parenting change o...\n",
441 | "404274 In accounting, why do we debit expenses and cr...\n",
442 | "404275 What is copilotsearch.com?\n",
443 | "404276 What does analytics do?\n",
444 | "404277 How did you prepare for AIIMS/NEET/AIPMT?\n",
445 | "404278 What is the minimum time required to build a f...\n",
446 | "404279 What are some outfit ideas to wear to a frat p...\n",
447 | "404280 Why is Manaphy childish in Pokémon Ranger and ...\n",
448 | "404281 How does a long distance relationship work?\n",
449 | "404282 What do you think of the removal of the MagSaf...\n",
450 | "404283 What does Jainism say about homosexuality?\n",
451 | "404284 How many keywords are there in the Racket prog...\n",
452 | "404285 Do you believe there is life after death?\n",
453 | "404286 What is one coin?\n",
454 | "404287 What is the approx annual cost of living while...\n",
455 | "404288 What is like to have sex with cousin?\n",
456 | "\n",
457 | "[404289 rows x 1 columns]"
458 | ]
459 | },
460 | "metadata": {
461 | "tags": []
462 | },
463 | "execution_count": 9
464 | }
465 | ]
466 | },
467 | {
468 | "cell_type": "code",
469 | "metadata": {
470 | "id": "Yp6SQrxf2Xad",
471 | "colab_type": "code",
472 | "outputId": "b2c5ad9d-38cc-431c-edaa-f1d546fd2964",
473 | "colab": {
474 | "base_uri": "https://localhost:8080/",
475 | "height": 204
476 | }
477 | },
478 | "source": [
479 | "quora.head()"
480 | ],
481 | "execution_count": 0,
482 | "outputs": [
483 | {
484 | "output_type": "execute_result",
485 | "data": {
486 | "text/html": [
487 | "\n",
488 | "\n",
501 | "
\n",
502 | " \n",
503 | " \n",
504 | " \n",
505 | " Question \n",
506 | " \n",
507 | " \n",
508 | " \n",
509 | " \n",
510 | " 0 \n",
511 | " What is the step by step guide to invest in sh... \n",
512 | " \n",
513 | " \n",
514 | " 1 \n",
515 | " What is the story of Kohinoor (Koh-i-Noor) Dia... \n",
516 | " \n",
517 | " \n",
518 | " 2 \n",
519 | " How can I increase the speed of my internet co... \n",
520 | " \n",
521 | " \n",
522 | " 3 \n",
523 | " Why am I mentally very lonely? How can I solve... \n",
524 | " \n",
525 | " \n",
526 | " 4 \n",
527 | " Which one dissolve in water quikly sugar, salt... \n",
528 | " \n",
529 | " \n",
530 | "
\n",
531 | "
"
532 | ],
533 | "text/plain": [
534 | " Question\n",
535 | "0 What is the step by step guide to invest in sh...\n",
536 | "1 What is the story of Kohinoor (Koh-i-Noor) Dia...\n",
537 | "2 How can I increase the speed of my internet co...\n",
538 | "3 Why am I mentally very lonely? How can I solve...\n",
539 | "4 Which one dissolve in water quikly sugar, salt..."
540 | ]
541 | },
542 | "metadata": {
543 | "tags": []
544 | },
545 | "execution_count": 10
546 | }
547 | ]
548 | },
549 | {
550 | "cell_type": "code",
551 | "metadata": {
552 | "id": "lEA02LK323mU",
553 | "colab_type": "code",
554 | "outputId": "fd4fb574-3f1a-40d6-9b7f-59f3e61c238d",
555 | "colab": {
556 | "base_uri": "https://localhost:8080/",
557 | "height": 34
558 | }
559 | },
560 | "source": [
561 | "len(quora)"
562 | ],
563 | "execution_count": 0,
564 | "outputs": [
565 | {
566 | "output_type": "execute_result",
567 | "data": {
568 | "text/plain": [
569 | "404289"
570 | ]
571 | },
572 | "metadata": {
573 | "tags": []
574 | },
575 | "execution_count": 12
576 | }
577 | ]
578 | },
579 | {
580 | "cell_type": "markdown",
581 | "metadata": {
582 | "id": "0Y343fMs2Xag",
583 | "colab_type": "text"
584 | },
585 | "source": [
586 | "# Preprocessing\n",
587 | "\n",
588 | "#### Task: Use TF-IDF Vectorization to create a vectorized document term matrix. You may want to explore the max_df and min_df parameters."
589 | ]
590 | },
591 | {
592 | "cell_type": "code",
593 | "metadata": {
594 | "id": "kzrf0HQn2Xag",
595 | "colab_type": "code",
596 | "colab": {}
597 | },
598 | "source": [
599 | "from sklearn.feature_extraction.text import TfidfVectorizer"
600 | ],
601 | "execution_count": 0,
602 | "outputs": []
603 | },
604 | {
605 | "cell_type": "code",
606 | "metadata": {
607 | "id": "U8wbdnz63ErE",
608 | "colab_type": "code",
609 | "colab": {}
610 | },
611 | "source": [
612 | "?TfidfVectorizer"
613 | ],
614 | "execution_count": 0,
615 | "outputs": []
616 | },
617 | {
618 | "cell_type": "code",
619 | "metadata": {
620 | "id": "K6Ry3vx12Xai",
621 | "colab_type": "code",
622 | "colab": {}
623 | },
624 | "source": [
625 | "tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')"
626 | ],
627 | "execution_count": 0,
628 | "outputs": []
629 | },
630 | {
631 | "cell_type": "code",
632 | "metadata": {
633 | "id": "3y33cwll2Xak",
634 | "colab_type": "code",
635 | "colab": {}
636 | },
637 | "source": [
638 | "dtm = tfidf.fit_transform(quora['Question'])"
639 | ],
640 | "execution_count": 0,
641 | "outputs": []
642 | },
643 | {
644 | "cell_type": "code",
645 | "metadata": {
646 | "id": "OCFFD-B42Xal",
647 | "colab_type": "code",
648 | "outputId": "6166ba87-4dba-4304-8eac-35353ff1a4ff",
649 | "colab": {
650 | "base_uri": "https://localhost:8080/",
651 | "height": 51
652 | }
653 | },
654 | "source": [
655 | "dtm"
656 | ],
657 | "execution_count": 0,
658 | "outputs": [
659 | {
660 | "output_type": "execute_result",
661 | "data": {
662 | "text/plain": [
663 | "<404289x38669 sparse matrix of type ''\n",
664 | "\twith 2002912 stored elements in Compressed Sparse Row format>"
665 | ]
666 | },
667 | "metadata": {
668 | "tags": []
669 | },
670 | "execution_count": 17
671 | }
672 | ]
673 | },
674 | {
675 | "cell_type": "markdown",
676 | "metadata": {
677 | "id": "HjNTMw1H2Xap",
678 | "colab_type": "text"
679 | },
680 | "source": [
681 | "# Non-Negative Matrix Factorization\n",
682 | "\n",
683 | "#### TASK: Using Scikit-Learn create an instance of NMF with 20 expected components. (Use random_state=42).."
684 | ]
685 | },
686 | {
687 | "cell_type": "code",
688 | "metadata": {
689 | "id": "pLQlfSMs2Xap",
690 | "colab_type": "code",
691 | "colab": {}
692 | },
693 | "source": [
694 | "from sklearn.decomposition import NMF"
695 | ],
696 | "execution_count": 0,
697 | "outputs": []
698 | },
699 | {
700 | "cell_type": "code",
701 | "metadata": {
702 | "id": "t9TC4Uug3deO",
703 | "colab_type": "code",
704 | "colab": {}
705 | },
706 | "source": [
707 | "?NMF"
708 | ],
709 | "execution_count": 0,
710 | "outputs": []
711 | },
712 | {
713 | "cell_type": "code",
714 | "metadata": {
715 | "id": "Ek9MAyq62Xar",
716 | "colab_type": "code",
717 | "colab": {}
718 | },
719 | "source": [
720 | "nmf_model = NMF(n_components=20,random_state=42)"
721 | ],
722 | "execution_count": 0,
723 | "outputs": []
724 | },
725 | {
726 | "cell_type": "code",
727 | "metadata": {
728 | "id": "uKT7m-Ej2Xat",
729 | "colab_type": "code",
730 | "outputId": "57b2bf57-413b-4b1b-9563-dfa8034d3712",
731 | "colab": {
732 | "base_uri": "https://localhost:8080/",
733 | "height": 68
734 | }
735 | },
736 | "source": [
737 | "nmf_model.fit(dtm)"
738 | ],
739 | "execution_count": 0,
740 | "outputs": [
741 | {
742 | "output_type": "execute_result",
743 | "data": {
744 | "text/plain": [
745 | "NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,\n",
746 | " n_components=20, random_state=42, shuffle=False, solver='cd', tol=0.0001,\n",
747 | " verbose=0)"
748 | ]
749 | },
750 | "metadata": {
751 | "tags": []
752 | },
753 | "execution_count": 22
754 | }
755 | ]
756 | },
757 | {
758 | "cell_type": "markdown",
759 | "metadata": {
760 | "id": "AL9G8voh2Xav",
761 | "colab_type": "text"
762 | },
763 | "source": [
764 | "#### TASK: Print our the top 15 most common words for each of the 20 topics."
765 | ]
766 | },
767 | {
768 | "cell_type": "code",
769 | "metadata": {
770 | "id": "8sigQQ042Xaw",
771 | "colab_type": "code",
772 | "outputId": "64340943-b533-4ea9-ad97-ab2b2f8c0016",
773 | "colab": {
774 | "base_uri": "https://localhost:8080/",
775 | "height": 1000
776 | }
777 | },
778 | "source": [
779 | "for index,topic in enumerate(nmf_model.components_):\n",
780 | " print(f'THE TOP 15 WORDS FOR TOPIC #{index}')\n",
781 | " print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])\n",
782 | " print('\\n')"
783 | ],
784 | "execution_count": 0,
785 | "outputs": [
786 | {
787 | "output_type": "stream",
788 | "text": [
789 | "THE TOP 15 WORDS FOR TOPIC #0\n",
790 | "['thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']\n",
791 | "\n",
792 | "\n",
793 | "THE TOP 15 WORDS FOR TOPIC #1\n",
794 | "['majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']\n",
795 | "\n",
796 | "\n",
797 | "THE TOP 15 WORDS FOR TOPIC #2\n",
798 | "['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']\n",
799 | "\n",
800 | "\n",
801 | "THE TOP 15 WORDS FOR TOPIC #3\n",
802 | "['using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']\n",
803 | "\n",
804 | "\n",
805 | "THE TOP 15 WORDS FOR TOPIC #4\n",
806 | "['balance', 'earth', 'day', 'death', 'changed', 'live', 'want', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']\n",
807 | "\n",
808 | "\n",
809 | "THE TOP 15 WORDS FOR TOPIC #5\n",
810 | "['reservation', 'engineering', 'minister', 'president', 'company', 'china', 'business', 'country', 'olympics', 'available', 'job', 'spotify', 'war', 'pakistan', 'india']\n",
811 | "\n",
812 | "\n",
813 | "THE TOP 15 WORDS FOR TOPIC #6\n",
814 | "['beginners', 'online', 'english', 'book', 'did', 'hacking', 'want', 'python', 'languages', 'java', 'learning', 'start', 'language', 'programming', 'learn']\n",
815 | "\n",
816 | "\n",
817 | "THE TOP 15 WORDS FOR TOPIC #7\n",
818 | "['happen', 'presidency', 'think', 'presidential', '2016', 'vote', 'better', 'election', 'did', 'win', 'hillary', 'president', 'clinton', 'donald', 'trump']\n",
819 | "\n",
820 | "\n",
821 | "THE TOP 15 WORDS FOR TOPIC #8\n",
822 | "['russia', 'business', 'win', 'coming', 'countries', 'place', 'pakistan', 'happen', 'end', 'country', 'iii', 'start', 'did', 'war', 'world']\n",
823 | "\n",
824 | "\n",
825 | "THE TOP 15 WORDS FOR TOPIC #9\n",
826 | "['indian', 'companies', 'don', 'guy', 'men', 'culture', 'women', 'work', 'girls', 'live', 'girl', 'look', 'sex', 'feel', 'like']\n",
827 | "\n",
828 | "\n",
829 | "THE TOP 15 WORDS FOR TOPIC #10\n",
830 | "['ca', 'departments', 'positions', 'movies', 'songs', 'business', 'read', 'start', 'job', 'work', 'engineering', 'ways', 'bad', 'books', 'good']\n",
831 | "\n",
832 | "\n",
833 | "THE TOP 15 WORDS FOR TOPIC #11\n",
834 | "['money', 'modi', 'currency', 'economy', 'think', 'government', 'ban', 'banning', 'black', 'indian', 'rupee', 'rs', '1000', 'notes', '500']\n",
835 | "\n",
836 | "\n",
837 | "THE TOP 15 WORDS FOR TOPIC #12\n",
838 | "['blowing', 'resolutions', 'resolution', 'mind', 'likes', 'girl', '2017', 'year', 'don', 'employees', 'going', 'day', 'things', 'new', 'know']\n",
839 | "\n",
840 | "\n",
841 | "THE TOP 15 WORDS FOR TOPIC #13\n",
842 | "['aspects', 'fluent', 'skill', 'spoken', 'ways', 'language', 'fluently', 'speak', 'communication', 'pronunciation', 'speaking', 'writing', 'skills', 'improve', 'english']\n",
843 | "\n",
844 | "\n",
845 | "THE TOP 15 WORDS FOR TOPIC #14\n",
846 | "['diet', 'help', 'healthy', 'exercise', 'month', 'pounds', 'reduce', 'quickly', 'loss', 'fast', 'fat', 'ways', 'gain', 'lose', 'weight']\n",
847 | "\n",
848 | "\n",
849 | "THE TOP 15 WORDS FOR TOPIC #15\n",
850 | "['having', 'feel', 'long', 'spend', 'did', 'person', 'machine', 'movies', 'favorite', 'job', 'home', 'sex', 'possible', 'travel', 'time']\n",
851 | "\n",
852 | "\n",
853 | "THE TOP 15 WORDS FOR TOPIC #16\n",
854 | "['marriage', 'make', 'did', 'girlfriend', 'feel', 'tell', 'forget', 'really', 'friend', 'true', 'know', 'person', 'girl', 'fall', 'love']\n",
855 | "\n",
856 | "\n",
857 | "THE TOP 15 WORDS FOR TOPIC #17\n",
858 | "['easy', 'hack', 'prepare', 'quickest', 'facebook', 'increase', 'painless', 'instagram', 'account', 'best', 'commit', 'fastest', 'suicide', 'easiest', 'way']\n",
859 | "\n",
860 | "\n",
861 | "THE TOP 15 WORDS FOR TOPIC #18\n",
862 | "['web', 'java', 'scripting', 'phone', 'mechanical', 'better', 'job', 'use', 'account', 'data', 'software', 'science', 'computer', 'engineering', 'difference']\n",
863 | "\n",
864 | "\n",
865 | "THE TOP 15 WORDS FOR TOPIC #19\n",
866 | "['earth', 'blowing', 'stop', 'use', 'easily', 'mind', 'google', 'flat', 'questions', 'hate', 'believe', 'ask', 'don', 'think', 'people']\n",
867 | "\n",
868 | "\n"
869 | ],
870 | "name": "stdout"
871 | }
872 | ]
873 | },
874 | {
875 | "cell_type": "markdown",
876 | "metadata": {
877 | "id": "PPcdAFaP2Xay",
878 | "colab_type": "text"
879 | },
880 | "source": [
881 | "#### TASK: Add a new column to the original quora dataframe that labels each question into one of the 20 topic categories."
882 | ]
883 | },
884 | {
885 | "cell_type": "code",
886 | "metadata": {
887 | "id": "IO8lTwGN2Xay",
888 | "colab_type": "code",
889 | "outputId": "8fff9dfe-7ec2-4051-f4bd-7d7a4ef493fd",
890 | "colab": {
891 | "base_uri": "https://localhost:8080/",
892 | "height": 204
893 | }
894 | },
895 | "source": [
896 | "quora.head()"
897 | ],
898 | "execution_count": 0,
899 | "outputs": [
900 | {
901 | "output_type": "execute_result",
902 | "data": {
903 | "text/html": [
904 | "\n",
905 | "\n",
918 | "
\n",
919 | " \n",
920 | " \n",
921 | " \n",
922 | " Question \n",
923 | " \n",
924 | " \n",
925 | " \n",
926 | " \n",
927 | " 0 \n",
928 | " What is the step by step guide to invest in sh... \n",
929 | " \n",
930 | " \n",
931 | " 1 \n",
932 | " What is the story of Kohinoor (Koh-i-Noor) Dia... \n",
933 | " \n",
934 | " \n",
935 | " 2 \n",
936 | " How can I increase the speed of my internet co... \n",
937 | " \n",
938 | " \n",
939 | " 3 \n",
940 | " Why am I mentally very lonely? How can I solve... \n",
941 | " \n",
942 | " \n",
943 | " 4 \n",
944 | " Which one dissolve in water quikly sugar, salt... \n",
945 | " \n",
946 | " \n",
947 | "
\n",
948 | "
"
949 | ],
950 | "text/plain": [
951 | " Question\n",
952 | "0 What is the step by step guide to invest in sh...\n",
953 | "1 What is the story of Kohinoor (Koh-i-Noor) Dia...\n",
954 | "2 How can I increase the speed of my internet co...\n",
955 | "3 Why am I mentally very lonely? How can I solve...\n",
956 | "4 Which one dissolve in water quikly sugar, salt..."
957 | ]
958 | },
959 | "metadata": {
960 | "tags": []
961 | },
962 | "execution_count": 24
963 | }
964 | ]
965 | },
966 | {
967 | "cell_type": "code",
968 | "metadata": {
969 | "id": "RvlBn4A_2Xa0",
970 | "colab_type": "code",
971 | "colab": {}
972 | },
973 | "source": [
974 | "topic_results = nmf_model.transform(dtm)"
975 | ],
976 | "execution_count": 0,
977 | "outputs": []
978 | },
979 | {
980 | "cell_type": "code",
981 | "metadata": {
982 | "id": "WmFmF0Bj2Xa2",
983 | "colab_type": "code",
984 | "outputId": "4cccf8f8-eae9-48e1-c7a2-4ceb13625fb8",
985 | "colab": {
986 | "base_uri": "https://localhost:8080/",
987 | "height": 359
988 | }
989 | },
990 | "source": [
991 | "topic_results.argmax(axis=1)\n",
992 | "\n",
993 | "quora['Topic'] = topic_results.argmax(axis=1)\n",
994 | "\n",
995 | "quora.head(10)"
996 | ],
997 | "execution_count": 0,
998 | "outputs": [
999 | {
1000 | "output_type": "execute_result",
1001 | "data": {
1002 | "text/html": [
1003 | "\n",
1004 | "\n",
1017 | "
\n",
1018 | " \n",
1019 | " \n",
1020 | " \n",
1021 | " Question \n",
1022 | " Topic \n",
1023 | " \n",
1024 | " \n",
1025 | " \n",
1026 | " \n",
1027 | " 0 \n",
1028 | " What is the step by step guide to invest in sh... \n",
1029 | " 5 \n",
1030 | " \n",
1031 | " \n",
1032 | " 1 \n",
1033 | " What is the story of Kohinoor (Koh-i-Noor) Dia... \n",
1034 | " 16 \n",
1035 | " \n",
1036 | " \n",
1037 | " 2 \n",
1038 | " How can I increase the speed of my internet co... \n",
1039 | " 17 \n",
1040 | " \n",
1041 | " \n",
1042 | " 3 \n",
1043 | " Why am I mentally very lonely? How can I solve... \n",
1044 | " 11 \n",
1045 | " \n",
1046 | " \n",
1047 | " 4 \n",
1048 | " Which one dissolve in water quikly sugar, salt... \n",
1049 | " 14 \n",
1050 | " \n",
1051 | " \n",
1052 | " 5 \n",
1053 | " Astrology: I am a Capricorn Sun Cap moon and c... \n",
1054 | " 1 \n",
1055 | " \n",
1056 | " \n",
1057 | " 6 \n",
1058 | " Should I buy tiago? \n",
1059 | " 0 \n",
1060 | " \n",
1061 | " \n",
1062 | " 7 \n",
1063 | " How can I be a good geologist? \n",
1064 | " 10 \n",
1065 | " \n",
1066 | " \n",
1067 | " 8 \n",
1068 | " When do you use シ instead of し? \n",
1069 | " 19 \n",
1070 | " \n",
1071 | " \n",
1072 | " 9 \n",
1073 | " Motorola (company): Can I hack my Charter Moto... \n",
1074 | " 17 \n",
1075 | " \n",
1076 | " \n",
1077 | "
\n",
1078 | "
"
1079 | ],
1080 | "text/plain": [
1081 | " Question Topic\n",
1082 | "0 What is the step by step guide to invest in sh... 5\n",
1083 | "1 What is the story of Kohinoor (Koh-i-Noor) Dia... 16\n",
1084 | "2 How can I increase the speed of my internet co... 17\n",
1085 | "3 Why am I mentally very lonely? How can I solve... 11\n",
1086 | "4 Which one dissolve in water quikly sugar, salt... 14\n",
1087 | "5 Astrology: I am a Capricorn Sun Cap moon and c... 1\n",
1088 | "6 Should I buy tiago? 0\n",
1089 | "7 How can I be a good geologist? 10\n",
1090 | "8 When do you use シ instead of し? 19\n",
1091 | "9 Motorola (company): Can I hack my Charter Moto... 17"
1092 | ]
1093 | },
1094 | "metadata": {
1095 | "tags": []
1096 | },
1097 | "execution_count": 26
1098 | }
1099 | ]
1100 | },
1101 | {
1102 | "cell_type": "markdown",
1103 | "metadata": {
1104 | "id": "gfk1I8Ua2Xa4",
1105 | "colab_type": "text"
1106 | },
1107 | "source": [
1108 | "# Great job!"
1109 | ]
1110 | }
1111 | ]
1112 | }
--------------------------------------------------------------------------------
/Text Analytics/Latent_Dirichlet_Allocation_on_Articles.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "Latent-Dirichlet-Allocation on Articles.ipynb",
7 | "version": "0.3.2",
8 | "provenance": [],
9 | "include_colab_link": true
10 | },
11 | "language_info": {
12 | "codemirror_mode": {
13 | "name": "ipython",
14 | "version": 3
15 | },
16 | "file_extension": ".py",
17 | "mimetype": "text/x-python",
18 | "name": "python",
19 | "nbconvert_exporter": "python",
20 | "pygments_lexer": "ipython3",
21 | "version": "3.6.6"
22 | },
23 | "kernelspec": {
24 | "name": "python3",
25 | "display_name": "Python 3"
26 | },
27 | "accelerator": "GPU"
28 | },
29 | "cells": [
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {
33 | "id": "view-in-github",
34 | "colab_type": "text"
35 | },
36 | "source": [
37 | " "
38 | ]
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "metadata": {
43 | "id": "I6AJRZnqoi9i",
44 | "colab_type": "text"
45 | },
46 | "source": [
47 | "# Latent Dirichlet Allocation"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {
53 | "id": "M6FGtZf6oi9i",
54 | "colab_type": "text"
55 | },
56 | "source": [
57 | "## Data\n",
58 | "\n",
59 | "We will be using articles from NPR (National Public Radio), obtained from their website [www.npr.org](http://www.npr.org)"
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "metadata": {
65 | "id": "bCBAW567oi9j",
66 | "colab_type": "code",
67 | "colab": {}
68 | },
69 | "source": [
70 | "import pandas as pd"
71 | ],
72 | "execution_count": 0,
73 | "outputs": []
74 | },
75 | {
76 | "cell_type": "code",
77 | "metadata": {
78 | "id": "gA3YsT-ao8df",
79 | "colab_type": "code",
80 | "outputId": "6e0b10a6-36aa-45bc-cdaa-18068d0401c6",
81 | "colab": {
82 | "base_uri": "https://localhost:8080/",
83 | "height": 34
84 | }
85 | },
86 | "source": [
87 | "#Running or Importing .py Files with Google Colab\n",
88 | "from google.colab import drive\n",
89 | "drive.mount('/content/drive/')"
90 | ],
91 | "execution_count": 0,
92 | "outputs": [
93 | {
94 | "output_type": "stream",
95 | "text": [
96 | "Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount(\"/content/drive/\", force_remount=True).\n"
97 | ],
98 | "name": "stdout"
99 | }
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "metadata": {
105 | "id": "UQpIxqV6oi9k",
106 | "colab_type": "code",
107 | "colab": {}
108 | },
109 | "source": [
110 | "npr = pd.read_csv(\"/content/drive/My Drive/app/npr.csv\")"
111 | ],
112 | "execution_count": 0,
113 | "outputs": []
114 | },
115 | {
116 | "cell_type": "code",
117 | "metadata": {
118 | "id": "AgbIMBtcoi9m",
119 | "colab_type": "code",
120 | "outputId": "caed0316-38f0-436f-fc2d-f7df780575a1",
121 | "colab": {
122 | "base_uri": "https://localhost:8080/",
123 | "height": 204
124 | }
125 | },
126 | "source": [
127 | "npr.head()"
128 | ],
129 | "execution_count": 0,
130 | "outputs": [
131 | {
132 | "output_type": "execute_result",
133 | "data": {
134 | "text/html": [
135 | "\n",
136 | "\n",
149 | "
\n",
150 | " \n",
151 | " \n",
152 | " \n",
153 | " Article \n",
154 | " \n",
155 | " \n",
156 | " \n",
157 | " \n",
158 | " 0 \n",
159 | " In the Washington of 2016, even when the polic... \n",
160 | " \n",
161 | " \n",
162 | " 1 \n",
163 | " Donald Trump has used Twitter — his prefe... \n",
164 | " \n",
165 | " \n",
166 | " 2 \n",
167 | " Donald Trump is unabashedly praising Russian... \n",
168 | " \n",
169 | " \n",
170 | " 3 \n",
171 | " Updated at 2:50 p. m. ET, Russian President Vl... \n",
172 | " \n",
173 | " \n",
174 | " 4 \n",
175 | " From photography, illustration and video, to d... \n",
176 | " \n",
177 | " \n",
178 | "
\n",
179 | "
"
180 | ],
181 | "text/plain": [
182 | " Article\n",
183 | "0 In the Washington of 2016, even when the polic...\n",
184 | "1 Donald Trump has used Twitter — his prefe...\n",
185 | "2 Donald Trump is unabashedly praising Russian...\n",
186 | "3 Updated at 2:50 p. m. ET, Russian President Vl...\n",
187 | "4 From photography, illustration and video, to d..."
188 | ]
189 | },
190 | "metadata": {
191 | "tags": []
192 | },
193 | "execution_count": 4
194 | }
195 | ]
196 | },
197 | {
198 | "cell_type": "markdown",
199 | "metadata": {
200 | "id": "MDyoxoZ-oi9p",
201 | "colab_type": "text"
202 | },
203 | "source": [
204 | "Notice how we don't have the topic of the articles! Let's use LDA to attempt to figure out clusters of the articles."
205 | ]
206 | },
207 | {
208 | "cell_type": "markdown",
209 | "metadata": {
210 | "id": "2eGnw3G1oi9p",
211 | "colab_type": "text"
212 | },
213 | "source": [
214 | "## Preprocessing"
215 | ]
216 | },
217 | {
218 | "cell_type": "code",
219 | "metadata": {
220 | "id": "6skz31_loi9q",
221 | "colab_type": "code",
222 | "colab": {}
223 | },
224 | "source": [
225 | "from sklearn.feature_extraction.text import CountVectorizer"
226 | ],
227 | "execution_count": 0,
228 | "outputs": []
229 | },
230 | {
231 | "cell_type": "markdown",
232 | "metadata": {
233 | "id": "un0zCGajoi9r",
234 | "colab_type": "text"
235 | },
236 | "source": [
237 | "**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0` \n",
238 | "When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.\n",
239 | "\n",
240 | "**`min_df`**` : float in range [0.0, 1.0] or int, default=1` \n",
241 | "When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None."
242 | ]
243 | },
244 | {
245 | "cell_type": "code",
246 | "metadata": {
247 | "id": "SkhHsnteoi9s",
248 | "colab_type": "code",
249 | "colab": {}
250 | },
251 | "source": [
252 | "cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')"
253 | ],
254 | "execution_count": 0,
255 | "outputs": []
256 | },
257 | {
258 | "cell_type": "code",
259 | "metadata": {
260 | "id": "j5yonZ8roi9t",
261 | "colab_type": "code",
262 | "colab": {}
263 | },
264 | "source": [
265 | "dtm = cv.fit_transform(npr['Article'])"
266 | ],
267 | "execution_count": 0,
268 | "outputs": []
269 | },
270 | {
271 | "cell_type": "code",
272 | "metadata": {
273 | "id": "Dbn3ejluoi9v",
274 | "colab_type": "code",
275 | "outputId": "986356d4-8b16-47b8-83fa-3bb6bb6feeba",
276 | "colab": {
277 | "base_uri": "https://localhost:8080/",
278 | "height": 51
279 | }
280 | },
281 | "source": [
282 | "dtm"
283 | ],
284 | "execution_count": 0,
285 | "outputs": [
286 | {
287 | "output_type": "execute_result",
288 | "data": {
289 | "text/plain": [
290 | "<11992x54777 sparse matrix of type ''\n",
291 | "\twith 3033388 stored elements in Compressed Sparse Row format>"
292 | ]
293 | },
294 | "metadata": {
295 | "tags": []
296 | },
297 | "execution_count": 8
298 | }
299 | ]
300 | },
301 | {
302 | "cell_type": "markdown",
303 | "metadata": {
304 | "id": "1luFG-zooi9y",
305 | "colab_type": "text"
306 | },
307 | "source": [
308 | "## LDA"
309 | ]
310 | },
311 | {
312 | "cell_type": "code",
313 | "metadata": {
314 | "id": "s5zdudFToi9z",
315 | "colab_type": "code",
316 | "colab": {}
317 | },
318 | "source": [
319 | "from sklearn.decomposition import LatentDirichletAllocation"
320 | ],
321 | "execution_count": 0,
322 | "outputs": []
323 | },
324 | {
325 | "cell_type": "code",
326 | "metadata": {
327 | "id": "VNX7DDpQoi90",
328 | "colab_type": "code",
329 | "colab": {}
330 | },
331 | "source": [
332 | "LDA = LatentDirichletAllocation(n_components=7,random_state=42)"
333 | ],
334 | "execution_count": 0,
335 | "outputs": []
336 | },
337 | {
338 | "cell_type": "code",
339 | "metadata": {
340 | "id": "Ygjy_01uoi92",
341 | "colab_type": "code",
342 | "outputId": "95b122ae-c649-48bf-8fe5-aca97da36d5c",
343 | "colab": {
344 | "base_uri": "https://localhost:8080/",
345 | "height": 136
346 | }
347 | },
348 | "source": [
349 | "# This can take a while, we're dealing with a large amount of documents!\n",
350 | "LDA.fit(dtm)"
351 | ],
352 | "execution_count": 0,
353 | "outputs": [
354 | {
355 | "output_type": "execute_result",
356 | "data": {
357 | "text/plain": [
358 | "LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,\n",
359 | " evaluate_every=-1, learning_decay=0.7,\n",
360 | " learning_method='batch', learning_offset=10.0,\n",
361 | " max_doc_update_iter=100, max_iter=10,\n",
362 | " mean_change_tol=0.001, n_components=7, n_jobs=None,\n",
363 | " perp_tol=0.1, random_state=42, topic_word_prior=None,\n",
364 | " total_samples=1000000.0, verbose=0)"
365 | ]
366 | },
367 | "metadata": {
368 | "tags": []
369 | },
370 | "execution_count": 11
371 | }
372 | ]
373 | },
374 | {
375 | "cell_type": "markdown",
376 | "metadata": {
377 | "id": "yNf4usl2oi95",
378 | "colab_type": "text"
379 | },
380 | "source": [
381 | "## Showing Stored Words"
382 | ]
383 | },
384 | {
385 | "cell_type": "code",
386 | "metadata": {
387 | "id": "qAT7nAfBoi96",
388 | "colab_type": "code",
389 | "outputId": "f5e7099a-76c6-4f41-9357-6fa1c21fd6aa",
390 | "colab": {
391 | "base_uri": "https://localhost:8080/",
392 | "height": 34
393 | }
394 | },
395 | "source": [
396 | "len(cv.get_feature_names())"
397 | ],
398 | "execution_count": 0,
399 | "outputs": [
400 | {
401 | "output_type": "execute_result",
402 | "data": {
403 | "text/plain": [
404 | "54777"
405 | ]
406 | },
407 | "metadata": {
408 | "tags": []
409 | },
410 | "execution_count": 12
411 | }
412 | ]
413 | },
414 | {
415 | "cell_type": "code",
416 | "metadata": {
417 | "id": "-01QBp41oi97",
418 | "colab_type": "code",
419 | "colab": {}
420 | },
421 | "source": [
422 | "import random"
423 | ],
424 | "execution_count": 0,
425 | "outputs": []
426 | },
427 | {
428 | "cell_type": "code",
429 | "metadata": {
430 | "id": "ksmN1zJPoi99",
431 | "colab_type": "code",
432 | "outputId": "6a83c287-0076-4341-e9d6-92dfc997bfae",
433 | "colab": {
434 | "base_uri": "https://localhost:8080/",
435 | "height": 187
436 | }
437 | },
438 | "source": [
439 | "for i in range(10):\n",
440 | " random_word_id = random.randint(0,54776)\n",
441 | " print(cv.get_feature_names()[random_word_id])"
442 | ],
443 | "execution_count": 0,
444 | "outputs": [
445 | {
446 | "output_type": "stream",
447 | "text": [
448 | "folkloric\n",
449 | "exaggerations\n",
450 | "proves\n",
451 | "promesa\n",
452 | "compatibility\n",
453 | "examiners\n",
454 | "corgi\n",
455 | "publish\n",
456 | "flory\n",
457 | "fart\n"
458 | ],
459 | "name": "stdout"
460 | }
461 | ]
462 | },
463 | {
464 | "cell_type": "code",
465 | "metadata": {
466 | "id": "K7Ywf1Luoi9_",
467 | "colab_type": "code",
468 | "outputId": "6f6a1370-edee-44dc-b862-843901f9338d",
469 | "colab": {
470 | "base_uri": "https://localhost:8080/",
471 | "height": 187
472 | }
473 | },
474 | "source": [
475 | "for i in range(10):\n",
476 | " random_word_id = random.randint(0,54776)\n",
477 | " print(cv.get_feature_names()[random_word_id])"
478 | ],
479 | "execution_count": 0,
480 | "outputs": [
481 | {
482 | "output_type": "stream",
483 | "text": [
484 | "fen\n",
485 | "cheery\n",
486 | "forepaws\n",
487 | "iom\n",
488 | "kobayashi\n",
489 | "reveals\n",
490 | "commode\n",
491 | "powered\n",
492 | "nina\n",
493 | "glosses\n"
494 | ],
495 | "name": "stdout"
496 | }
497 | ]
498 | },
499 | {
500 | "cell_type": "markdown",
501 | "metadata": {
502 | "collapsed": true,
503 | "id": "B2NZt_KSoi-B",
504 | "colab_type": "text"
505 | },
506 | "source": [
507 | "### Showing Top Words Per Topic"
508 | ]
509 | },
510 | {
511 | "cell_type": "code",
512 | "metadata": {
513 | "id": "H6eyvhGCoi-C",
514 | "colab_type": "code",
515 | "outputId": "e7b59918-4054-42b7-cd84-f581d2e57f67",
516 | "colab": {
517 | "base_uri": "https://localhost:8080/",
518 | "height": 34
519 | }
520 | },
521 | "source": [
522 | "len(LDA.components_)"
523 | ],
524 | "execution_count": 0,
525 | "outputs": [
526 | {
527 | "output_type": "execute_result",
528 | "data": {
529 | "text/plain": [
530 | "7"
531 | ]
532 | },
533 | "metadata": {
534 | "tags": []
535 | },
536 | "execution_count": 16
537 | }
538 | ]
539 | },
540 | {
541 | "cell_type": "code",
542 | "metadata": {
543 | "id": "zl9ICtaMoi-E",
544 | "colab_type": "code",
545 | "outputId": "36849a8a-173e-434e-f344-a70e3428840e",
546 | "colab": {
547 | "base_uri": "https://localhost:8080/",
548 | "height": 238
549 | }
550 | },
551 | "source": [
552 | "LDA.components_"
553 | ],
554 | "execution_count": 0,
555 | "outputs": [
556 | {
557 | "output_type": "execute_result",
558 | "data": {
559 | "text/plain": [
560 | "array([[8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,\n",
561 | " 1.43006821e-01, 1.42902042e-01, 1.42861626e-01],\n",
562 | " [2.76191749e+01, 5.36394437e+02, 1.42857148e-01, ...,\n",
563 | " 1.42861973e-01, 1.42857147e-01, 1.42906875e-01],\n",
564 | " [7.22783888e+00, 8.24033986e+02, 1.42857148e-01, ...,\n",
565 | " 6.14236247e+00, 2.14061364e+00, 1.42923753e-01],\n",
566 | " ...,\n",
567 | " [3.11488651e+00, 3.50409655e+02, 1.42857147e-01, ...,\n",
568 | " 1.42859912e-01, 1.42857146e-01, 1.42866614e-01],\n",
569 | " [4.61486388e+01, 5.14408600e+01, 3.14281373e+00, ...,\n",
570 | " 1.43107628e-01, 1.43902481e-01, 2.14271779e+00],\n",
571 | " [4.93991422e-01, 4.18841042e+02, 1.42857151e-01, ...,\n",
572 | " 1.42857146e-01, 1.43760101e-01, 1.42866201e-01]])"
573 | ]
574 | },
575 | "metadata": {
576 | "tags": []
577 | },
578 | "execution_count": 17
579 | }
580 | ]
581 | },
582 | {
583 | "cell_type": "code",
584 | "metadata": {
585 | "id": "nXDZnGMboi-G",
586 | "colab_type": "code",
587 | "outputId": "930c490c-7c8f-42a9-c51a-6430d8cbfa38",
588 | "colab": {
589 | "base_uri": "https://localhost:8080/",
590 | "height": 34
591 | }
592 | },
593 | "source": [
594 | "len(LDA.components_[0])"
595 | ],
596 | "execution_count": 0,
597 | "outputs": [
598 | {
599 | "output_type": "execute_result",
600 | "data": {
601 | "text/plain": [
602 | "54777"
603 | ]
604 | },
605 | "metadata": {
606 | "tags": []
607 | },
608 | "execution_count": 18
609 | }
610 | ]
611 | },
612 | {
613 | "cell_type": "code",
614 | "metadata": {
615 | "id": "H1CpNUGWoi-J",
616 | "colab_type": "code",
617 | "outputId": "a2fdd07f-21ae-48d3-d8c1-5d8b56379ee0",
618 | "colab": {
619 | "base_uri": "https://localhost:8080/",
620 | "height": 51
621 | }
622 | },
623 | "source": [
624 | "single_topic = LDA.components_[0]\n",
625 | "single_topic[0:7]"
626 | ],
627 | "execution_count": 0,
628 | "outputs": [
629 | {
630 | "output_type": "execute_result",
631 | "data": {
632 | "text/plain": [
633 | "array([8.64332806e+00, 2.38014333e+03, 1.42900522e-01, 3.14264092e+00,\n",
634 | " 1.42857148e-01, 1.43742936e-01, 1.43002955e-01])"
635 | ]
636 | },
637 | "metadata": {
638 | "tags": []
639 | },
640 | "execution_count": 32
641 | }
642 | ]
643 | },
644 | {
645 | "cell_type": "code",
646 | "metadata": {
647 | "id": "80VSJiRcoi-L",
648 | "colab_type": "code",
649 | "outputId": "93cc7971-c1f7-4bdc-f576-17b97db701cd",
650 | "colab": {
651 | "base_uri": "https://localhost:8080/",
652 | "height": 34
653 | }
654 | },
655 | "source": [
656 | "# Returns the indices that would sort this array.\n",
657 | "single_topic.argsort()"
658 | ],
659 | "execution_count": 0,
660 | "outputs": [
661 | {
662 | "output_type": "execute_result",
663 | "data": {
664 | "text/plain": [
665 | "array([ 2475, 18302, 35285, ..., 22673, 42561, 42993])"
666 | ]
667 | },
668 | "metadata": {
669 | "tags": []
670 | },
671 | "execution_count": 33
672 | }
673 | ]
674 | },
675 | {
676 | "cell_type": "code",
677 | "metadata": {
678 | "id": "Wdft5sEboi-O",
679 | "colab_type": "code",
680 | "outputId": "f4febbf1-560f-4910-cc67-b1c2412f670a",
681 | "colab": {
682 | "base_uri": "https://localhost:8080/",
683 | "height": 34
684 | }
685 | },
686 | "source": [
687 | "# Word least representative of this topic\n",
688 | "single_topic[18302]"
689 | ],
690 | "execution_count": 0,
691 | "outputs": [
692 | {
693 | "output_type": "execute_result",
694 | "data": {
695 | "text/plain": [
696 | "0.14285714309286987"
697 | ]
698 | },
699 | "metadata": {
700 | "tags": []
701 | },
702 | "execution_count": 21
703 | }
704 | ]
705 | },
706 | {
707 | "cell_type": "code",
708 | "metadata": {
709 | "id": "W_bJFb4soi-Q",
710 | "colab_type": "code",
711 | "outputId": "56d609b9-66dc-4891-96e9-823f0e8af12c",
712 | "colab": {
713 | "base_uri": "https://localhost:8080/",
714 | "height": 34
715 | }
716 | },
717 | "source": [
718 | "# Word most representative of this topic\n",
719 | "single_topic[42993]"
720 | ],
721 | "execution_count": 0,
722 | "outputs": [
723 | {
724 | "output_type": "execute_result",
725 | "data": {
726 | "text/plain": [
727 | "6247.245510521098"
728 | ]
729 | },
730 | "metadata": {
731 | "tags": []
732 | },
733 | "execution_count": 22
734 | }
735 | ]
736 | },
737 | {
738 | "cell_type": "code",
739 | "metadata": {
740 | "id": "cHhR7zGpoi-S",
741 | "colab_type": "code",
742 | "outputId": "b8549707-17ed-472b-cc05-cd8c155eb79d",
743 | "colab": {
744 | "base_uri": "https://localhost:8080/",
745 | "height": 51
746 | }
747 | },
748 | "source": [
749 | "# Top 10 words for this topic:\n",
750 | "single_topic.argsort()[-10:]"
751 | ],
752 | "execution_count": 0,
753 | "outputs": [
754 | {
755 | "output_type": "execute_result",
756 | "data": {
757 | "text/plain": [
758 | "array([33390, 36310, 21228, 10425, 31464, 8149, 36283, 22673, 42561,\n",
759 | " 42993])"
760 | ]
761 | },
762 | "metadata": {
763 | "tags": []
764 | },
765 | "execution_count": 23
766 | }
767 | ]
768 | },
769 | {
770 | "cell_type": "code",
771 | "metadata": {
772 | "id": "QXHaduMgoi-T",
773 | "colab_type": "code",
774 | "colab": {}
775 | },
776 | "source": [
777 | "top_word_indices = single_topic.argsort()[-10:]"
778 | ],
779 | "execution_count": 0,
780 | "outputs": []
781 | },
782 | {
783 | "cell_type": "code",
784 | "metadata": {
785 | "id": "7pCdxYMyoi-V",
786 | "colab_type": "code",
787 | "outputId": "9e66d247-87a7-48aa-a09a-aac1bc52c208",
788 | "colab": {
789 | "base_uri": "https://localhost:8080/",
790 | "height": 187
791 | }
792 | },
793 | "source": [
794 | "for index in top_word_indices:\n",
795 | " print(cv.get_feature_names()[index])"
796 | ],
797 | "execution_count": 0,
798 | "outputs": [
799 | {
800 | "output_type": "stream",
801 | "text": [
802 | "new\n",
803 | "percent\n",
804 | "government\n",
805 | "company\n",
806 | "million\n",
807 | "care\n",
808 | "people\n",
809 | "health\n",
810 | "said\n",
811 | "says\n"
812 | ],
813 | "name": "stdout"
814 | }
815 | ]
816 | },
817 | {
818 | "cell_type": "markdown",
819 | "metadata": {
820 | "id": "D8DT6Zs9oi-Y",
821 | "colab_type": "text"
822 | },
823 | "source": [
824 | "These look like business articles perhaps... Let's confirm by using .transform() on our vectorized articles to attach a label number. But first, let's view all the 10 topics found."
825 | ]
826 | },
827 | {
828 | "cell_type": "code",
829 | "metadata": {
830 | "id": "ooQnxjINoi-Y",
831 | "colab_type": "code",
832 | "outputId": "d6f4b031-d4d9-4638-cc4d-d05b6a4972b2",
833 | "colab": {
834 | "base_uri": "https://localhost:8080/",
835 | "height": 493
836 | }
837 | },
838 | "source": [
839 | "for index,topic in enumerate(LDA.components_):\n",
840 | " print(f'THE TOP 15 WORDS FOR TOPIC #{index}')\n",
841 | " print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])\n",
842 | " print('\\n')"
843 | ],
844 | "execution_count": 0,
845 | "outputs": [
846 | {
847 | "output_type": "stream",
848 | "text": [
849 | "THE TOP 15 WORDS FOR TOPIC #0\n",
850 | "['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']\n",
851 | "\n",
852 | "\n",
853 | "THE TOP 15 WORDS FOR TOPIC #1\n",
854 | "['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']\n",
855 | "\n",
856 | "\n",
857 | "THE TOP 15 WORDS FOR TOPIC #2\n",
858 | "['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']\n",
859 | "\n",
860 | "\n",
861 | "THE TOP 15 WORDS FOR TOPIC #3\n",
862 | "['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']\n",
863 | "\n",
864 | "\n",
865 | "THE TOP 15 WORDS FOR TOPIC #4\n",
866 | "['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']\n",
867 | "\n",
868 | "\n",
869 | "THE TOP 15 WORDS FOR TOPIC #5\n",
870 | "['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think', 'people', 'just', 'like']\n",
871 | "\n",
872 | "\n",
873 | "THE TOP 15 WORDS FOR TOPIC #6\n",
874 | "['student', 'years', 'data', 'science', 'university', 'people', 'time', 'schools', 'just', 'education', 'new', 'like', 'students', 'school', 'says']\n",
875 | "\n",
876 | "\n"
877 | ],
878 | "name": "stdout"
879 | }
880 | ]
881 | },
882 | {
883 | "cell_type": "markdown",
884 | "metadata": {
885 | "id": "azxes_46oi-b",
886 | "colab_type": "text"
887 | },
888 | "source": [
889 | "### Attaching Discovered Topic Labels to Original Articles"
890 | ]
891 | },
892 | {
893 | "cell_type": "code",
894 | "metadata": {
895 | "id": "5Ly4YPAroi-b",
896 | "colab_type": "code",
897 | "outputId": "71d498b6-87c4-4a40-b8c9-d168fd11839d",
898 | "colab": {
899 | "base_uri": "https://localhost:8080/",
900 | "height": 51
901 | }
902 | },
903 | "source": [
904 | "dtm"
905 | ],
906 | "execution_count": 0,
907 | "outputs": [
908 | {
909 | "output_type": "execute_result",
910 | "data": {
911 | "text/plain": [
912 | "<11992x54777 sparse matrix of type ''\n",
913 | "\twith 3033388 stored elements in Compressed Sparse Row format>"
914 | ]
915 | },
916 | "metadata": {
917 | "tags": []
918 | },
919 | "execution_count": 27
920 | }
921 | ]
922 | },
923 | {
924 | "cell_type": "code",
925 | "metadata": {
926 | "id": "vzfZMMWUoi-d",
927 | "colab_type": "code",
928 | "outputId": "7244816c-30fd-4d85-9160-60d6375e3fe1",
929 | "colab": {
930 | "base_uri": "https://localhost:8080/",
931 | "height": 34
932 | }
933 | },
934 | "source": [
935 | "dtm.shape"
936 | ],
937 | "execution_count": 0,
938 | "outputs": [
939 | {
940 | "output_type": "execute_result",
941 | "data": {
942 | "text/plain": [
943 | "(11992, 54777)"
944 | ]
945 | },
946 | "metadata": {
947 | "tags": []
948 | },
949 | "execution_count": 28
950 | }
951 | ]
952 | },
953 | {
954 | "cell_type": "code",
955 | "metadata": {
956 | "id": "iCUeS5Bboi-e",
957 | "colab_type": "code",
958 | "outputId": "acf8b909-3a85-4f48-ec22-8fb9e04172e4",
959 | "colab": {
960 | "base_uri": "https://localhost:8080/",
961 | "height": 34
962 | }
963 | },
964 | "source": [
965 | "len(npr)"
966 | ],
967 | "execution_count": 0,
968 | "outputs": [
969 | {
970 | "output_type": "execute_result",
971 | "data": {
972 | "text/plain": [
973 | "11992"
974 | ]
975 | },
976 | "metadata": {
977 | "tags": []
978 | },
979 | "execution_count": 29
980 | }
981 | ]
982 | },
983 | {
984 | "cell_type": "code",
985 | "metadata": {
986 | "id": "LmFcxGihoi-g",
987 | "colab_type": "code",
988 | "colab": {}
989 | },
990 | "source": [
991 | "topic_results = LDA.transform(dtm)"
992 | ],
993 | "execution_count": 0,
994 | "outputs": []
995 | },
996 | {
997 | "cell_type": "code",
998 | "metadata": {
999 | "id": "cldzVYGloi-i",
1000 | "colab_type": "code",
1001 | "outputId": "21ae5a16-5777-4a87-a8c7-ab36a6d99acc",
1002 | "colab": {
1003 | "base_uri": "https://localhost:8080/",
1004 | "height": 34
1005 | }
1006 | },
1007 | "source": [
1008 | "topic_results.shape"
1009 | ],
1010 | "execution_count": 0,
1011 | "outputs": [
1012 | {
1013 | "output_type": "execute_result",
1014 | "data": {
1015 | "text/plain": [
1016 | "(11992, 7)"
1017 | ]
1018 | },
1019 | "metadata": {
1020 | "tags": []
1021 | },
1022 | "execution_count": 35
1023 | }
1024 | ]
1025 | },
1026 | {
1027 | "cell_type": "code",
1028 | "metadata": {
1029 | "id": "uRzKffZ-oi-k",
1030 | "colab_type": "code",
1031 | "outputId": "f9af84b8-188b-4c45-a5f2-dcb8c788bc08",
1032 | "colab": {
1033 | "base_uri": "https://localhost:8080/",
1034 | "height": 51
1035 | }
1036 | },
1037 | "source": [
1038 | "topic_results[0]"
1039 | ],
1040 | "execution_count": 0,
1041 | "outputs": [
1042 | {
1043 | "output_type": "execute_result",
1044 | "data": {
1045 | "text/plain": [
1046 | "array([1.61040465e-02, 6.83341493e-01, 2.25376318e-04, 2.25369288e-04,\n",
1047 | " 2.99652737e-01, 2.25479379e-04, 2.25497980e-04])"
1048 | ]
1049 | },
1050 | "metadata": {
1051 | "tags": []
1052 | },
1053 | "execution_count": 36
1054 | }
1055 | ]
1056 | },
1057 | {
1058 | "cell_type": "code",
1059 | "metadata": {
1060 | "id": "WW7cJLJvoi-m",
1061 | "colab_type": "code",
1062 | "outputId": "88b9f630-8e87-4c36-de19-1d37861e1056",
1063 | "colab": {
1064 | "base_uri": "https://localhost:8080/",
1065 | "height": 34
1066 | }
1067 | },
1068 | "source": [
1069 | "topic_results[0].round(2)"
1070 | ],
1071 | "execution_count": 0,
1072 | "outputs": [
1073 | {
1074 | "output_type": "execute_result",
1075 | "data": {
1076 | "text/plain": [
1077 | "array([0.02, 0.68, 0. , 0. , 0.3 , 0. , 0. ])"
1078 | ]
1079 | },
1080 | "metadata": {
1081 | "tags": []
1082 | },
1083 | "execution_count": 37
1084 | }
1085 | ]
1086 | },
1087 | {
1088 | "cell_type": "code",
1089 | "metadata": {
1090 | "id": "YKbSfxMCoi-o",
1091 | "colab_type": "code",
1092 | "outputId": "95cd2182-794b-401d-d591-ea93f4d67073",
1093 | "colab": {
1094 | "base_uri": "https://localhost:8080/",
1095 | "height": 34
1096 | }
1097 | },
1098 | "source": [
1099 | "topic_results[0].argmax()"
1100 | ],
1101 | "execution_count": 0,
1102 | "outputs": [
1103 | {
1104 | "output_type": "execute_result",
1105 | "data": {
1106 | "text/plain": [
1107 | "1"
1108 | ]
1109 | },
1110 | "metadata": {
1111 | "tags": []
1112 | },
1113 | "execution_count": 38
1114 | }
1115 | ]
1116 | },
1117 | {
1118 | "cell_type": "markdown",
1119 | "metadata": {
1120 | "id": "WdI-CQ3xoi-q",
1121 | "colab_type": "text"
1122 | },
1123 | "source": [
1124 | "This means that our model thinks that the first article belongs to topic #1."
1125 | ]
1126 | },
1127 | {
1128 | "cell_type": "markdown",
1129 | "metadata": {
1130 | "id": "AyaJESXSoi-q",
1131 | "colab_type": "text"
1132 | },
1133 | "source": [
1134 | "### Combining with Original Data"
1135 | ]
1136 | },
1137 | {
1138 | "cell_type": "code",
1139 | "metadata": {
1140 | "id": "bGQSKRw2oi-r",
1141 | "colab_type": "code",
1142 | "outputId": "824a4f76-08ee-4994-bb13-5da034d99dd3",
1143 | "colab": {
1144 | "base_uri": "https://localhost:8080/",
1145 | "height": 204
1146 | }
1147 | },
1148 | "source": [
1149 | "npr.head()"
1150 | ],
1151 | "execution_count": 0,
1152 | "outputs": [
1153 | {
1154 | "output_type": "execute_result",
1155 | "data": {
1156 | "text/html": [
1157 | "\n",
1158 | "\n",
1171 | "
\n",
1172 | " \n",
1173 | " \n",
1174 | " \n",
1175 | " Article \n",
1176 | " \n",
1177 | " \n",
1178 | " \n",
1179 | " \n",
1180 | " 0 \n",
1181 | " In the Washington of 2016, even when the polic... \n",
1182 | " \n",
1183 | " \n",
1184 | " 1 \n",
1185 | " Donald Trump has used Twitter — his prefe... \n",
1186 | " \n",
1187 | " \n",
1188 | " 2 \n",
1189 | " Donald Trump is unabashedly praising Russian... \n",
1190 | " \n",
1191 | " \n",
1192 | " 3 \n",
1193 | " Updated at 2:50 p. m. ET, Russian President Vl... \n",
1194 | " \n",
1195 | " \n",
1196 | " 4 \n",
1197 | " From photography, illustration and video, to d... \n",
1198 | " \n",
1199 | " \n",
1200 | "
\n",
1201 | "
"
1202 | ],
1203 | "text/plain": [
1204 | " Article\n",
1205 | "0 In the Washington of 2016, even when the polic...\n",
1206 | "1 Donald Trump has used Twitter — his prefe...\n",
1207 | "2 Donald Trump is unabashedly praising Russian...\n",
1208 | "3 Updated at 2:50 p. m. ET, Russian President Vl...\n",
1209 | "4 From photography, illustration and video, to d..."
1210 | ]
1211 | },
1212 | "metadata": {
1213 | "tags": []
1214 | },
1215 | "execution_count": 39
1216 | }
1217 | ]
1218 | },
1219 | {
1220 | "cell_type": "code",
1221 | "metadata": {
1222 | "id": "81shC1cFoi-s",
1223 | "colab_type": "code",
1224 | "outputId": "73b611ec-b9c8-41f6-ad7e-05828bc80eee",
1225 | "colab": {
1226 | "base_uri": "https://localhost:8080/",
1227 | "height": 34
1228 | }
1229 | },
1230 | "source": [
1231 | "topic_results.argmax(axis=1)"
1232 | ],
1233 | "execution_count": 0,
1234 | "outputs": [
1235 | {
1236 | "output_type": "execute_result",
1237 | "data": {
1238 | "text/plain": [
1239 | "array([1, 1, 1, ..., 3, 4, 0])"
1240 | ]
1241 | },
1242 | "metadata": {
1243 | "tags": []
1244 | },
1245 | "execution_count": 40
1246 | }
1247 | ]
1248 | },
1249 | {
1250 | "cell_type": "code",
1251 | "metadata": {
1252 | "id": "PB-v1ydeoi-v",
1253 | "colab_type": "code",
1254 | "colab": {}
1255 | },
1256 | "source": [
1257 | "npr['Topic'] = topic_results.argmax(axis=1)"
1258 | ],
1259 | "execution_count": 0,
1260 | "outputs": []
1261 | },
1262 | {
1263 | "cell_type": "code",
1264 | "metadata": {
1265 | "id": "mhTqevA_oi-x",
1266 | "colab_type": "code",
1267 | "outputId": "55ef7dea-ff53-4f8f-d252-1fda29cf5013",
1268 | "colab": {
1269 | "base_uri": "https://localhost:8080/",
1270 | "height": 359
1271 | }
1272 | },
1273 | "source": [
1274 | "npr.head(10)"
1275 | ],
1276 | "execution_count": 0,
1277 | "outputs": [
1278 | {
1279 | "output_type": "execute_result",
1280 | "data": {
1281 | "text/html": [
1282 | "\n",
1283 | "\n",
1296 | "
\n",
1297 | " \n",
1298 | " \n",
1299 | " \n",
1300 | " Article \n",
1301 | " Topic \n",
1302 | " \n",
1303 | " \n",
1304 | " \n",
1305 | " \n",
1306 | " 0 \n",
1307 | " In the Washington of 2016, even when the polic... \n",
1308 | " 1 \n",
1309 | " \n",
1310 | " \n",
1311 | " 1 \n",
1312 | " Donald Trump has used Twitter — his prefe... \n",
1313 | " 1 \n",
1314 | " \n",
1315 | " \n",
1316 | " 2 \n",
1317 | " Donald Trump is unabashedly praising Russian... \n",
1318 | " 1 \n",
1319 | " \n",
1320 | " \n",
1321 | " 3 \n",
1322 | " Updated at 2:50 p. m. ET, Russian President Vl... \n",
1323 | " 1 \n",
1324 | " \n",
1325 | " \n",
1326 | " 4 \n",
1327 | " From photography, illustration and video, to d... \n",
1328 | " 2 \n",
1329 | " \n",
1330 | " \n",
1331 | " 5 \n",
1332 | " I did not want to join yoga class. I hated tho... \n",
1333 | " 3 \n",
1334 | " \n",
1335 | " \n",
1336 | " 6 \n",
1337 | " With a who has publicly supported the debunk... \n",
1338 | " 3 \n",
1339 | " \n",
1340 | " \n",
1341 | " 7 \n",
1342 | " I was standing by the airport exit, debating w... \n",
1343 | " 2 \n",
1344 | " \n",
1345 | " \n",
1346 | " 8 \n",
1347 | " If movies were trying to be more realistic, pe... \n",
1348 | " 3 \n",
1349 | " \n",
1350 | " \n",
1351 | " 9 \n",
1352 | " Eighteen years ago, on New Year’s Eve, David F... \n",
1353 | " 2 \n",
1354 | " \n",
1355 | " \n",
1356 | "
\n",
1357 | "
"
1358 | ],
1359 | "text/plain": [
1360 | " Article Topic\n",
1361 | "0 In the Washington of 2016, even when the polic... 1\n",
1362 | "1 Donald Trump has used Twitter — his prefe... 1\n",
1363 | "2 Donald Trump is unabashedly praising Russian... 1\n",
1364 | "3 Updated at 2:50 p. m. ET, Russian President Vl... 1\n",
1365 | "4 From photography, illustration and video, to d... 2\n",
1366 | "5 I did not want to join yoga class. I hated tho... 3\n",
1367 | "6 With a who has publicly supported the debunk... 3\n",
1368 | "7 I was standing by the airport exit, debating w... 2\n",
1369 | "8 If movies were trying to be more realistic, pe... 3\n",
1370 | "9 Eighteen years ago, on New Year’s Eve, David F... 2"
1371 | ]
1372 | },
1373 | "metadata": {
1374 | "tags": []
1375 | },
1376 | "execution_count": 42
1377 | }
1378 | ]
1379 | },
1380 | {
1381 | "cell_type": "markdown",
1382 | "metadata": {
1383 | "id": "VCpb1qzKoi-z",
1384 | "colab_type": "text"
1385 | },
1386 | "source": [
1387 | "## Great work!"
1388 | ]
1389 | }
1390 | ]
1391 | }
--------------------------------------------------------------------------------
/Text Analytics/Part_of_Speech_Assessment.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "Part of Speech-Assessment.ipynb",
7 | "version": "0.3.2",
8 | "provenance": [],
9 | "include_colab_link": true
10 | },
11 | "language_info": {
12 | "codemirror_mode": {
13 | "name": "ipython",
14 | "version": 3
15 | },
16 | "file_extension": ".py",
17 | "mimetype": "text/x-python",
18 | "name": "python",
19 | "nbconvert_exporter": "python",
20 | "pygments_lexer": "ipython3",
21 | "version": "3.6.2"
22 | },
23 | "kernelspec": {
24 | "name": "python3",
25 | "display_name": "Python 3"
26 | },
27 | "accelerator": "GPU"
28 | },
29 | "cells": [
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {
33 | "id": "view-in-github",
34 | "colab_type": "text"
35 | },
36 | "source": [
37 | " "
38 | ]
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "metadata": {
43 | "id": "1IXRup7pCt1R",
44 | "colab_type": "text"
45 | },
46 | "source": [
47 | "# Parts of Speech Assessment "
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {
53 | "id": "WgaXoStXCt1S",
54 | "colab_type": "text"
55 | },
56 | "source": [
57 | "For this assessment we'll be using the short story [The Tale of Peter Rabbit](https://en.wikipedia.org/wiki/The_Tale_of_Peter_Rabbit) by Beatrix Potter (1902). The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/14838.txt.utf-8)."
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "metadata": {
63 | "id": "UMNtyeWrFCOf",
64 | "colab_type": "code",
65 | "outputId": "6813eb50-1588-49bf-fff3-c51a5b45eaee",
66 | "colab": {
67 | "base_uri": "https://localhost:8080/",
68 | "height": 34
69 | }
70 | },
71 | "source": [
72 | "#Running or Importing .py Files with Google Colab\n",
73 | "from google.colab import drive\n",
74 | "drive.mount('/content/drive/')"
75 | ],
76 | "execution_count": 0,
77 | "outputs": [
78 | {
79 | "output_type": "stream",
80 | "text": [
81 | "Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount(\"/content/drive/\", force_remount=True).\n"
82 | ],
83 | "name": "stdout"
84 | }
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "metadata": {
90 | "id": "RYKkqkYpCt1S",
91 | "colab_type": "code",
92 | "colab": {}
93 | },
94 | "source": [
95 | "# RUN THIS CELL to perform standard imports:\n",
96 | "import spacy\n",
97 | "nlp = spacy.load('en_core_web_sm')\n",
98 | "from spacy import displacy"
99 | ],
100 | "execution_count": 0,
101 | "outputs": []
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {
106 | "id": "n51tRgVVCt1U",
107 | "colab_type": "text"
108 | },
109 | "source": [
110 | "**1. Create a Doc object from the file `peterrabbit.txt`** \n",
111 | "> HINT: Use `with open('../TextFiles/peterrabbit.txt') as f:`"
112 | ]
113 | },
114 | {
115 | "cell_type": "code",
116 | "metadata": {
117 | "id": "i9ANOGV7Ct1U",
118 | "colab_type": "code",
119 | "colab": {}
120 | },
121 | "source": [
122 | "with open('/content/drive/My Drive/app/peterrabbit.txt') as f:\n",
123 | " doc = nlp(f.read())"
124 | ],
125 | "execution_count": 0,
126 | "outputs": []
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {
131 | "id": "gNFEq01dCt1W",
132 | "colab_type": "text"
133 | },
134 | "source": [
135 | "**2. For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG tag, and the description of the fine-grained tag.**"
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "metadata": {
141 | "id": "m47lbOaPCt1X",
142 | "colab_type": "code",
143 | "outputId": "a38591e4-c008-45f5-df43-50b5df9680d0",
144 | "colab": {
145 | "base_uri": "https://localhost:8080/",
146 | "height": 357
147 | }
148 | },
149 | "source": [
150 | "# Enter your code here:\n",
151 | "\n",
152 | "for token in list(doc.sents)[2]:\n",
153 | " print(f'{token.text:{12}} {token.pos_:{6}} {token.tag_:{6}} {spacy.explain(token.tag_)}')"
154 | ],
155 | "execution_count": 0,
156 | "outputs": [
157 | {
158 | "output_type": "stream",
159 | "text": [
160 | "Flopsy PROPN NNP noun, proper singular\n",
161 | ", PUNCT , punctuation mark, comma\n",
162 | "\n",
163 | " SPACE _SP None\n",
164 | "Mopsy PROPN NNP noun, proper singular\n",
165 | ", PUNCT , punctuation mark, comma\n",
166 | "\n",
167 | " SPACE _SP None\n",
168 | "Cotton PROPN NNP noun, proper singular\n",
169 | "- PUNCT HYPH punctuation mark, hyphen\n",
170 | "tail NOUN NN noun, singular or mass\n",
171 | ", PUNCT , punctuation mark, comma\n",
172 | "\n",
173 | " SPACE _SP None\n",
174 | "and CCONJ CC conjunction, coordinating\n",
175 | "Peter PROPN NNP noun, proper singular\n",
176 | ". PUNCT . punctuation mark, sentence closer\n",
177 | "\n",
178 | "\n",
179 | " SPACE _SP None\n"
180 | ],
181 | "name": "stdout"
182 | }
183 | ]
184 | },
185 | {
186 | "cell_type": "markdown",
187 | "metadata": {
188 | "id": "dDV7TikYCt1Z",
189 | "colab_type": "text"
190 | },
191 | "source": [
192 | "**3. Provide a frequency list of POS tags from the entire document**"
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "metadata": {
198 | "id": "AhsFHttxCt1a",
199 | "colab_type": "code",
200 | "outputId": "12eba2dd-63d9-4a0b-a215-80b5c010cbbc",
201 | "colab": {
202 | "base_uri": "https://localhost:8080/",
203 | "height": 238
204 | }
205 | },
206 | "source": [
207 | "POS_counts = doc.count_by(spacy.attrs.POS)\n",
208 | "\n",
209 | "for k,v in sorted(POS_counts.items()):\n",
210 | " print(f'{k}. {doc.vocab[k].text:{5}}: {v}')"
211 | ],
212 | "execution_count": 0,
213 | "outputs": [
214 | {
215 | "output_type": "stream",
216 | "text": [
217 | "84. ADJ : 57\n",
218 | "85. ADP : 129\n",
219 | "86. ADV : 75\n",
220 | "89. CCONJ: 61\n",
221 | "90. DET : 118\n",
222 | "92. NOUN : 166\n",
223 | "93. NUM : 8\n",
224 | "94. PART : 34\n",
225 | "95. PRON : 78\n",
226 | "96. PROPN: 75\n",
227 | "97. PUNCT: 173\n",
228 | "100. VERB : 185\n",
229 | "103. SPACE: 99\n"
230 | ],
231 | "name": "stdout"
232 | }
233 | ]
234 | },
235 | {
236 | "cell_type": "markdown",
237 | "metadata": {
238 | "id": "f0rUCtHsCt1c",
239 | "colab_type": "text"
240 | },
241 | "source": [
242 | "**4. CHALLENGE: What percentage of tokens are nouns?** \n",
243 | "HINT: the attribute ID for 'NOUN' is 91"
244 | ]
245 | },
246 | {
247 | "cell_type": "code",
248 | "metadata": {
249 | "id": "6ew55xoHCt1c",
250 | "colab_type": "code",
251 | "outputId": "e72e1d00-955e-42c7-fa56-b25b0bf53c2d",
252 | "colab": {
253 | "base_uri": "https://localhost:8080/",
254 | "height": 197
255 | }
256 | },
257 | "source": [
258 | "percent = 100*POS_counts[91]/len(doc)\n",
259 | "\n",
260 | "print(f'{POS_counts[91]}/{len(doc)} = {percent:{.4}}%')"
261 | ],
262 | "execution_count": 0,
263 | "outputs": [
264 | {
265 | "output_type": "error",
266 | "ename": "KeyError",
267 | "evalue": "ignored",
268 | "traceback": [
269 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
270 | "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
271 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mpercent\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m100\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0mPOS_counts\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m91\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m/\u001b[0m\u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdoc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf'{POS_counts[91]}/{len(doc)} = {percent:{.4}}%'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
272 | "\u001b[0;31mKeyError\u001b[0m: 91"
273 | ]
274 | }
275 | ]
276 | },
277 | {
278 | "cell_type": "markdown",
279 | "metadata": {
280 | "id": "fFxaEnIOCt1e",
281 | "colab_type": "text"
282 | },
283 | "source": [
284 | "**5. Display the Dependency Parse for the third sentence**"
285 | ]
286 | },
287 | {
288 | "cell_type": "code",
289 | "metadata": {
290 | "id": "G2heS4yVCt1f",
291 | "colab_type": "code",
292 | "outputId": "aa61fbfd-691b-4cec-ffcf-ffd3ab244edf",
293 | "colab": {
294 | "base_uri": "https://localhost:8080/",
295 | "height": 378
296 | }
297 | },
298 | "source": [
299 | "displacy.render(list(doc.sents)[2], style='dep', jupyter=True, options={'distance': 110})"
300 | ],
301 | "execution_count": 0,
302 | "outputs": [
303 | {
304 | "output_type": "display_data",
305 | "data": {
306 | "text/html": [
307 | "\n",
308 | "\n",
309 | " Flopsy, \n",
310 | " PROPN \n",
311 | " \n",
312 | "\n",
313 | "\n",
314 | " \n",
315 | " \n",
316 | " SPACE \n",
317 | " \n",
318 | "\n",
319 | "\n",
320 | " Mopsy, \n",
321 | " PROPN \n",
322 | " \n",
323 | "\n",
324 | "\n",
325 | " \n",
326 | " \n",
327 | " SPACE \n",
328 | " \n",
329 | "\n",
330 | "\n",
331 | " Cotton- \n",
332 | " PROPN \n",
333 | " \n",
334 | "\n",
335 | "\n",
336 | " tail, \n",
337 | " NOUN \n",
338 | " \n",
339 | "\n",
340 | "\n",
341 | " \n",
342 | " \n",
343 | " SPACE \n",
344 | " \n",
345 | "\n",
346 | "\n",
347 | " and \n",
348 | " CCONJ \n",
349 | " \n",
350 | "\n",
351 | "\n",
352 | " Peter. \n",
353 | " PROPN \n",
354 | " \n",
355 | "\n",
356 | "\n",
357 | " \n",
358 | "\n",
359 | " \n",
360 | " SPACE \n",
361 | " \n",
362 | "\n",
363 | "\n",
364 | " \n",
365 | " \n",
366 | " nmod \n",
367 | " \n",
368 | " \n",
369 | " \n",
370 | "\n",
371 | "\n",
372 | " \n",
373 | " \n",
374 | " \n",
375 | " \n",
376 | " \n",
377 | " \n",
378 | "\n",
379 | "\n",
380 | " \n",
381 | " \n",
382 | " \n",
383 | " \n",
384 | " \n",
385 | " \n",
386 | "\n",
387 | "\n",
388 | " \n",
389 | " \n",
390 | " compound \n",
391 | " \n",
392 | " \n",
393 | " \n",
394 | "\n",
395 | "\n",
396 | " \n",
397 | " \n",
398 | " conj \n",
399 | " \n",
400 | " \n",
401 | " \n",
402 | "\n",
403 | "\n",
404 | " \n",
405 | " \n",
406 | " \n",
407 | " \n",
408 | " \n",
409 | " \n",
410 | "\n",
411 | "\n",
412 | " \n",
413 | " \n",
414 | " cc \n",
415 | " \n",
416 | " \n",
417 | " \n",
418 | "\n",
419 | "\n",
420 | " \n",
421 | " \n",
422 | " punct \n",
423 | " \n",
424 | " \n",
425 | " \n",
426 | "\n",
427 | "\n",
428 | " \n",
429 | " \n",
430 | " \n",
431 | " \n",
432 | " \n",
433 | " \n",
434 | " "
435 | ],
436 | "text/plain": [
437 | ""
438 | ]
439 | },
440 | "metadata": {
441 | "tags": []
442 | }
443 | }
444 | ]
445 | },
446 | {
447 | "cell_type": "markdown",
448 | "metadata": {
449 | "id": "eEh0FlYNCt1h",
450 | "colab_type": "text"
451 | },
452 | "source": [
453 | "**6. Show the first two named entities from Beatrix Potter's *The Tale of Peter Rabbit* **"
454 | ]
455 | },
456 | {
457 | "cell_type": "code",
458 | "metadata": {
459 | "id": "Q4-RkhxcCt1i",
460 | "colab_type": "code",
461 | "outputId": "eb2e2209-eb24-4e5c-d0d9-31d5f5ca0c7e",
462 | "colab": {
463 | "base_uri": "https://localhost:8080/",
464 | "height": 51
465 | }
466 | },
467 | "source": [
468 | "for ent in doc.ents[:2]:\n",
469 | " print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))"
470 | ],
471 | "execution_count": 0,
472 | "outputs": [
473 | {
474 | "output_type": "stream",
475 | "text": [
476 | "Peter Rabbit - PERSON - People, including fictional\n",
477 | "Beatrix Potter - PERSON - People, including fictional\n"
478 | ],
479 | "name": "stdout"
480 | }
481 | ]
482 | },
483 | {
484 | "cell_type": "markdown",
485 | "metadata": {
486 | "id": "zFkqZz6_Ct1k",
487 | "colab_type": "text"
488 | },
489 | "source": [
490 | "**7. How many sentences are contained in *The Tale of Peter Rabbit*?**"
491 | ]
492 | },
493 | {
494 | "cell_type": "code",
495 | "metadata": {
496 | "id": "-zxHY3c_Ct1k",
497 | "colab_type": "code",
498 | "outputId": "032cb0c0-6339-4cc2-b4f9-e3b3ccd29c3f",
499 | "colab": {
500 | "base_uri": "https://localhost:8080/",
501 | "height": 34
502 | }
503 | },
504 | "source": [
505 | "len([sent for sent in doc.sents])"
506 | ],
507 | "execution_count": 0,
508 | "outputs": [
509 | {
510 | "output_type": "execute_result",
511 | "data": {
512 | "text/plain": [
513 | "60"
514 | ]
515 | },
516 | "metadata": {
517 | "tags": []
518 | },
519 | "execution_count": 11
520 | }
521 | ]
522 | },
523 | {
524 | "cell_type": "markdown",
525 | "metadata": {
526 | "id": "Ksxsg8LICt1m",
527 | "colab_type": "text"
528 | },
529 | "source": [
530 | "**8. CHALLENGE: How many sentences contain named entities?**"
531 | ]
532 | },
533 | {
534 | "cell_type": "code",
535 | "metadata": {
536 | "id": "ZHrPXu9vCt1n",
537 | "colab_type": "code",
538 | "outputId": "10e7403b-113c-4f22-d1a0-28494ed7e8a8",
539 | "colab": {
540 | "base_uri": "https://localhost:8080/",
541 | "height": 34
542 | }
543 | },
544 | "source": [
545 | "list_of_sents = [nlp(sent.text) for sent in doc.sents]\n",
546 | "list_of_ners = [doc for doc in list_of_sents if doc.ents]\n",
547 | "len(list_of_ners)"
548 | ],
549 | "execution_count": 0,
550 | "outputs": [
551 | {
552 | "output_type": "execute_result",
553 | "data": {
554 | "text/plain": [
555 | "39"
556 | ]
557 | },
558 | "metadata": {
559 | "tags": []
560 | },
561 | "execution_count": 12
562 | }
563 | ]
564 | },
565 | {
566 | "cell_type": "markdown",
567 | "metadata": {
568 | "id": "o1_LIfo7Ct1p",
569 | "colab_type": "text"
570 | },
571 | "source": [
572 | "**9. CHALLENGE: Display the named entity visualization for `list_of_sents[0]` from the previous problem**"
573 | ]
574 | },
575 | {
576 | "cell_type": "code",
577 | "metadata": {
578 | "id": "5RSPe4ZSCt1q",
579 | "colab_type": "code",
580 | "outputId": "a5649676-a4cc-459d-bd5b-eb7e4980125f",
581 | "colab": {
582 | "base_uri": "https://localhost:8080/",
583 | "height": 52
584 | }
585 | },
586 | "source": [
587 | "displacy.render(list_of_sents[0], style='ent', jupyter=True)"
588 | ],
589 | "execution_count": 0,
590 | "outputs": [
591 | {
592 | "output_type": "display_data",
593 | "data": {
594 | "text/html": [
595 | "The Tale of \n",
596 | "\n",
597 | " Peter Rabbit\n",
598 | " PERSON \n",
599 | " \n",
600 | ", by \n",
601 | "\n",
602 | " Beatrix Potter\n",
603 | " PERSON \n",
604 | " \n",
605 | " (\n",
606 | "\n",
607 | " 1902\n",
608 | " DATE \n",
609 | " \n",
610 | ").\n",
611 | "\n",
612 | "
"
613 | ],
614 | "text/plain": [
615 | ""
616 | ]
617 | },
618 | "metadata": {
619 | "tags": []
620 | }
621 | }
622 | ]
623 | },
624 | {
625 | "cell_type": "markdown",
626 | "metadata": {
627 | "id": "2XV6ZJJBCt1s",
628 | "colab_type": "text"
629 | },
630 | "source": [
631 | "### Great Job!"
632 | ]
633 | }
634 | ]
635 | }
--------------------------------------------------------------------------------
/Text Analytics/Text_Classification.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "Text-Classification.ipynb",
7 | "version": "0.3.2",
8 | "provenance": [],
9 | "include_colab_link": true
10 | },
11 | "language_info": {
12 | "codemirror_mode": {
13 | "name": "ipython",
14 | "version": 3
15 | },
16 | "file_extension": ".py",
17 | "mimetype": "text/x-python",
18 | "name": "python",
19 | "nbconvert_exporter": "python",
20 | "pygments_lexer": "ipython3",
21 | "version": "3.6.2"
22 | },
23 | "kernelspec": {
24 | "display_name": "Python 3",
25 | "language": "python",
26 | "name": "python3"
27 | }
28 | },
29 | "cells": [
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {
33 | "id": "view-in-github",
34 | "colab_type": "text"
35 | },
36 | "source": [
37 | " "
38 | ]
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "metadata": {
43 | "id": "xgrlWtuEBEf0",
44 | "colab_type": "text"
45 | },
46 | "source": [
47 | "# Text Classification\n",
48 | "\n",
49 | "The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`. \n",
50 | "\n",
51 | "We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.\n",
52 | "\n",
53 | "For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "metadata": {
59 | "id": "yDAWIe2fBEf1",
60 | "colab_type": "text"
61 | },
62 | "source": [
63 | "### Task #1: Perform imports and load the dataset into a pandas DataFrame\n"
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "metadata": {
69 | "id": "ziny-LPpB2h6",
70 | "colab_type": "code",
71 | "outputId": "d2b41879-c1eb-431a-debb-bead1cccbebc",
72 | "colab": {
73 | "base_uri": "https://localhost:8080/",
74 | "height": 34
75 | }
76 | },
77 | "source": [
78 | "#Running or Importing .py Files with Google Colab\n",
79 | "from google.colab import drive\n",
80 | "drive.mount('/content/drive/')"
81 | ],
82 | "execution_count": 0,
83 | "outputs": [
84 | {
85 | "output_type": "stream",
86 | "text": [
87 | "Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount(\"/content/drive/\", force_remount=True).\n"
88 | ],
89 | "name": "stdout"
90 | }
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "metadata": {
96 | "id": "o2QKBJhCBEf1",
97 | "colab_type": "code",
98 | "outputId": "0d1dbc4f-9e2f-417e-a415-ba4c95ad3fe0",
99 | "colab": {
100 | "base_uri": "https://localhost:8080/",
101 | "height": 204
102 | }
103 | },
104 | "source": [
105 | "import numpy as np\n",
106 | "import pandas as pd\n",
107 | "\n",
108 | "df = pd.read_csv('/content/drive/My Drive/app/moviereviews2.tsv', sep='\\t')\n",
109 | "df.head()"
110 | ],
111 | "execution_count": 0,
112 | "outputs": [
113 | {
114 | "output_type": "execute_result",
115 | "data": {
116 | "text/html": [
117 | "\n",
118 | "\n",
131 | "
\n",
132 | " \n",
133 | " \n",
134 | " \n",
135 | " label \n",
136 | " review \n",
137 | " \n",
138 | " \n",
139 | " \n",
140 | " \n",
141 | " 0 \n",
142 | " pos \n",
143 | " I loved this movie and will watch it again. Or... \n",
144 | " \n",
145 | " \n",
146 | " 1 \n",
147 | " pos \n",
148 | " A warm, touching movie that has a fantasy-like... \n",
149 | " \n",
150 | " \n",
151 | " 2 \n",
152 | " pos \n",
153 | " I was not expecting the powerful filmmaking ex... \n",
154 | " \n",
155 | " \n",
156 | " 3 \n",
157 | " neg \n",
158 | " This so-called \"documentary\" tries to tell tha... \n",
159 | " \n",
160 | " \n",
161 | " 4 \n",
162 | " pos \n",
163 | " This show has been my escape from reality for ... \n",
164 | " \n",
165 | " \n",
166 | "
\n",
167 | "
"
168 | ],
169 | "text/plain": [
170 | " label review\n",
171 | "0 pos I loved this movie and will watch it again. Or...\n",
172 | "1 pos A warm, touching movie that has a fantasy-like...\n",
173 | "2 pos I was not expecting the powerful filmmaking ex...\n",
174 | "3 neg This so-called \"documentary\" tries to tell tha...\n",
175 | "4 pos This show has been my escape from reality for ..."
176 | ]
177 | },
178 | "metadata": {
179 | "tags": []
180 | },
181 | "execution_count": 2
182 | }
183 | ]
184 | },
185 | {
186 | "cell_type": "markdown",
187 | "metadata": {
188 | "id": "YEvK3g2dBEf4",
189 | "colab_type": "text"
190 | },
191 | "source": [
192 | "### Task #2: Check for missing values:"
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "metadata": {
198 | "id": "z0cbjqfTBEf5",
199 | "colab_type": "code",
200 | "outputId": "f4f5326e-5d58-43dc-a676-6a4526d0610e",
201 | "colab": {
202 | "base_uri": "https://localhost:8080/",
203 | "height": 68
204 | }
205 | },
206 | "source": [
207 | "# Check for NaN values:\n",
208 | "df.isnull().sum()"
209 | ],
210 | "execution_count": 0,
211 | "outputs": [
212 | {
213 | "output_type": "execute_result",
214 | "data": {
215 | "text/plain": [
216 | "label 0\n",
217 | "review 20\n",
218 | "dtype: int64"
219 | ]
220 | },
221 | "metadata": {
222 | "tags": []
223 | },
224 | "execution_count": 3
225 | }
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "metadata": {
231 | "id": "kq--840lBEf7",
232 | "colab_type": "code",
233 | "outputId": "07604200-dd1f-4d3e-bb22-45bbf4fbaa30",
234 | "colab": {
235 | "base_uri": "https://localhost:8080/",
236 | "height": 34
237 | }
238 | },
239 | "source": [
240 | "# Check for whitespace strings (it's OK if there aren't any!):\n",
241 | "blanks = [] # start with an empty list\n",
242 | "\n",
243 | "for i,lb,rv in df.itertuples(): # iterate over the DataFrame\n",
244 | " if type(rv)==str: # avoid NaN values\n",
245 | " if rv.isspace(): # test 'review' for whitespace\n",
246 | " blanks.append(i) # add matching index numbers to the list\n",
247 | " \n",
248 | "len(blanks)"
249 | ],
250 | "execution_count": 0,
251 | "outputs": [
252 | {
253 | "output_type": "execute_result",
254 | "data": {
255 | "text/plain": [
256 | "0"
257 | ]
258 | },
259 | "metadata": {
260 | "tags": []
261 | },
262 | "execution_count": 4
263 | }
264 | ]
265 | },
266 | {
267 | "cell_type": "markdown",
268 | "metadata": {
269 | "id": "SBUB9BGVBEf9",
270 | "colab_type": "text"
271 | },
272 | "source": [
273 | "### Task #3: Remove NaN values:"
274 | ]
275 | },
276 | {
277 | "cell_type": "code",
278 | "metadata": {
279 | "id": "McIVuiwWBEf9",
280 | "colab_type": "code",
281 | "colab": {}
282 | },
283 | "source": [
284 | "df.dropna(inplace=True)"
285 | ],
286 | "execution_count": 0,
287 | "outputs": []
288 | },
289 | {
290 | "cell_type": "markdown",
291 | "metadata": {
292 | "id": "W9Y827jPBEf_",
293 | "colab_type": "text"
294 | },
295 | "source": [
296 | "### Task #4: Take a quick look at the `label` column:"
297 | ]
298 | },
299 | {
300 | "cell_type": "code",
301 | "metadata": {
302 | "id": "QFW8FCuSBEgA",
303 | "colab_type": "code",
304 | "outputId": "169ef27c-71c7-4030-cbfc-3bbc4aefb7df",
305 | "colab": {
306 | "base_uri": "https://localhost:8080/",
307 | "height": 68
308 | }
309 | },
310 | "source": [
311 | "df['label'].value_counts()"
312 | ],
313 | "execution_count": 0,
314 | "outputs": [
315 | {
316 | "output_type": "execute_result",
317 | "data": {
318 | "text/plain": [
319 | "pos 2990\n",
320 | "neg 2990\n",
321 | "Name: label, dtype: int64"
322 | ]
323 | },
324 | "metadata": {
325 | "tags": []
326 | },
327 | "execution_count": 6
328 | }
329 | ]
330 | },
331 | {
332 | "cell_type": "markdown",
333 | "metadata": {
334 | "id": "NSJiR8BgBEgD",
335 | "colab_type": "text"
336 | },
337 | "source": [
338 | "### Task #5: Split the data into train & test sets:\n",
339 | "You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`"
340 | ]
341 | },
342 | {
343 | "cell_type": "code",
344 | "metadata": {
345 | "id": "q8UTmpopBEgE",
346 | "colab_type": "code",
347 | "colab": {}
348 | },
349 | "source": [
350 | "from sklearn.model_selection import train_test_split\n",
351 | "\n",
352 | "X = df['review']\n",
353 | "y = df['label']\n",
354 | "\n",
355 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)"
356 | ],
357 | "execution_count": 0,
358 | "outputs": []
359 | },
360 | {
361 | "cell_type": "markdown",
362 | "metadata": {
363 | "id": "0iWbjsj6BEgF",
364 | "colab_type": "text"
365 | },
366 | "source": [
367 | "### Task #6: Build a pipeline to vectorize the date, then train and fit a model\n",
368 | "You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`."
369 | ]
370 | },
371 | {
372 | "cell_type": "code",
373 | "metadata": {
374 | "id": "myVrEev0BEgG",
375 | "colab_type": "code",
376 | "outputId": "e9f96687-a102-4d85-c1de-96b6c2672ac1",
377 | "colab": {
378 | "base_uri": "https://localhost:8080/",
379 | "height": 374
380 | }
381 | },
382 | "source": [
383 | "from sklearn.pipeline import Pipeline\n",
384 | "from sklearn.feature_extraction.text import TfidfVectorizer\n",
385 | "from sklearn.svm import LinearSVC\n",
386 | "\n",
387 | "text_clf = Pipeline([('tfidf', TfidfVectorizer()),\n",
388 | " ('clf', LinearSVC()),\n",
389 | "])\n",
390 | "\n",
391 | "# Feed the training data through the pipeline\n",
392 | "text_clf.fit(X_train, y_train) "
393 | ],
394 | "execution_count": 0,
395 | "outputs": [
396 | {
397 | "output_type": "execute_result",
398 | "data": {
399 | "text/plain": [
400 | "Pipeline(memory=None,\n",
401 | " steps=[('tfidf',\n",
402 | " TfidfVectorizer(analyzer='word', binary=False,\n",
403 | " decode_error='strict',\n",
404 | " dtype=,\n",
405 | " encoding='utf-8', input='content',\n",
406 | " lowercase=True, max_df=1.0, max_features=None,\n",
407 | " min_df=1, ngram_range=(1, 1), norm='l2',\n",
408 | " preprocessor=None, smooth_idf=True,\n",
409 | " stop_words=None, strip_accents=None,\n",
410 | " sublinear_tf=False,\n",
411 | " token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
412 | " tokenizer=None, use_idf=True,\n",
413 | " vocabulary=None)),\n",
414 | " ('clf',\n",
415 | " LinearSVC(C=1.0, class_weight=None, dual=True,\n",
416 | " fit_intercept=True, intercept_scaling=1,\n",
417 | " loss='squared_hinge', max_iter=1000,\n",
418 | " multi_class='ovr', penalty='l2', random_state=None,\n",
419 | " tol=0.0001, verbose=0))],\n",
420 | " verbose=False)"
421 | ]
422 | },
423 | "metadata": {
424 | "tags": []
425 | },
426 | "execution_count": 8
427 | }
428 | ]
429 | },
430 | {
431 | "cell_type": "markdown",
432 | "metadata": {
433 | "id": "v1kBBJl-BEgI",
434 | "colab_type": "text"
435 | },
436 | "source": [
437 | "### Task #7: Run predictions and analyze the results"
438 | ]
439 | },
440 | {
441 | "cell_type": "code",
442 | "metadata": {
443 | "id": "mAFM8Y12BEgJ",
444 | "colab_type": "code",
445 | "colab": {}
446 | },
447 | "source": [
448 | "# Form a prediction set\n",
449 | "predictions = text_clf.predict(X_test)"
450 | ],
451 | "execution_count": 0,
452 | "outputs": []
453 | },
454 | {
455 | "cell_type": "code",
456 | "metadata": {
457 | "id": "qjqXt-ssBEgK",
458 | "colab_type": "code",
459 | "outputId": "507947a6-d8ec-4514-d933-3b864e80c920",
460 | "colab": {
461 | "base_uri": "https://localhost:8080/",
462 | "height": 51
463 | }
464 | },
465 | "source": [
466 | "# Report the confusion matrix\n",
467 | "from sklearn import metrics\n",
468 | "print(metrics.confusion_matrix(y_test,predictions))"
469 | ],
470 | "execution_count": 0,
471 | "outputs": [
472 | {
473 | "output_type": "stream",
474 | "text": [
475 | "[[900 91]\n",
476 | " [ 63 920]]\n"
477 | ],
478 | "name": "stdout"
479 | }
480 | ]
481 | },
482 | {
483 | "cell_type": "code",
484 | "metadata": {
485 | "id": "IxEzexQZBEgM",
486 | "colab_type": "code",
487 | "outputId": "7f51b465-f969-42d7-f511-f55bb7cd0f6b",
488 | "colab": {
489 | "base_uri": "https://localhost:8080/",
490 | "height": 170
491 | }
492 | },
493 | "source": [
494 | "# Print a classification report\n",
495 | "print(metrics.classification_report(y_test,predictions))"
496 | ],
497 | "execution_count": 0,
498 | "outputs": [
499 | {
500 | "output_type": "stream",
501 | "text": [
502 | " precision recall f1-score support\n",
503 | "\n",
504 | " neg 0.93 0.91 0.92 991\n",
505 | " pos 0.91 0.94 0.92 983\n",
506 | "\n",
507 | " accuracy 0.92 1974\n",
508 | " macro avg 0.92 0.92 0.92 1974\n",
509 | "weighted avg 0.92 0.92 0.92 1974\n",
510 | "\n"
511 | ],
512 | "name": "stdout"
513 | }
514 | ]
515 | },
516 | {
517 | "cell_type": "code",
518 | "metadata": {
519 | "id": "nABEntVJBEgO",
520 | "colab_type": "code",
521 | "outputId": "ce1a33b2-c6a6-42ed-d56b-c8469947d8b7",
522 | "colab": {
523 | "base_uri": "https://localhost:8080/",
524 | "height": 34
525 | }
526 | },
527 | "source": [
528 | "# Print the overall accuracy\n",
529 | "print(metrics.accuracy_score(y_test,predictions))"
530 | ],
531 | "execution_count": 0,
532 | "outputs": [
533 | {
534 | "output_type": "stream",
535 | "text": [
536 | "0.9219858156028369\n"
537 | ],
538 | "name": "stdout"
539 | }
540 | ]
541 | },
542 | {
543 | "cell_type": "markdown",
544 | "metadata": {
545 | "id": "BMeCl2jABEgQ",
546 | "colab_type": "text"
547 | },
548 | "source": [
549 | "## Great job!"
550 | ]
551 | }
552 | ]
553 | }
--------------------------------------------------------------------------------
/Text Analytics/Text_Generation_with_Neural_Networks.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "kernelspec": {
6 | "display_name": "Python 3",
7 | "language": "python",
8 | "name": "python3"
9 | },
10 | "language_info": {
11 | "codemirror_mode": {
12 | "name": "ipython",
13 | "version": 3
14 | },
15 | "file_extension": ".py",
16 | "mimetype": "text/x-python",
17 | "name": "python",
18 | "nbconvert_exporter": "python",
19 | "pygments_lexer": "ipython3",
20 | "version": "3.6.6"
21 | },
22 | "colab": {
23 | "name": "Text-Generation-with-Neural-Networks.ipynb",
24 | "version": "0.3.2",
25 | "provenance": [],
26 | "include_colab_link": true
27 | }
28 | },
29 | "cells": [
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {
33 | "id": "view-in-github",
34 | "colab_type": "text"
35 | },
36 | "source": [
37 | " "
38 | ]
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "metadata": {
43 | "id": "mXRWETNvp-1U",
44 | "colab_type": "text"
45 | },
46 | "source": [
47 | "# Text Generation with Neural Networks"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {
53 | "id": "oG39BOxMp-1V",
54 | "colab_type": "text"
55 | },
56 | "source": [
57 | "## Functions for Processing Text\n",
58 | "\n",
59 | "### Reading in files as a string text"
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "metadata": {
65 | "id": "p2Izh3oTp-1W",
66 | "colab_type": "code",
67 | "colab": {}
68 | },
69 | "source": [
70 | "def read_file(filepath):\n",
71 | " \n",
72 | " with open(filepath) as f:\n",
73 | " str_text = f.read()\n",
74 | " \n",
75 | " return str_text"
76 | ],
77 | "execution_count": 0,
78 | "outputs": []
79 | },
80 | {
81 | "cell_type": "code",
82 | "metadata": {
83 | "id": "vCn9hJCcp-1Y",
84 | "colab_type": "code",
85 | "colab": {}
86 | },
87 | "source": [
88 | "read_file('moby_dick_four_chapters.txt')"
89 | ],
90 | "execution_count": 0,
91 | "outputs": []
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "metadata": {
96 | "collapsed": true,
97 | "id": "vvDbRFD_p-1a",
98 | "colab_type": "text"
99 | },
100 | "source": [
101 | "### Tokenize and Clean Text"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "metadata": {
107 | "id": "QcqYeq_Qp-1a",
108 | "colab_type": "code",
109 | "colab": {}
110 | },
111 | "source": [
112 | "import spacy\n",
113 | "nlp = spacy.load('en',disable=['parser', 'tagger','ner'])\n",
114 | "\n",
115 | "nlp.max_length = 1198623"
116 | ],
117 | "execution_count": 0,
118 | "outputs": []
119 | },
120 | {
121 | "cell_type": "code",
122 | "metadata": {
123 | "id": "8k1DqRlCp-1c",
124 | "colab_type": "code",
125 | "colab": {}
126 | },
127 | "source": [
128 | "def separate_punc(doc_text):\n",
129 | " return [token.text.lower() for token in nlp(doc_text) if token.text not in '\\n\\n \\n\\n\\n!\"-#$%&()--.*+,-/:;<=>?@[\\\\]^_`{|}~\\t\\n ']"
130 | ],
131 | "execution_count": 0,
132 | "outputs": []
133 | },
134 | {
135 | "cell_type": "code",
136 | "metadata": {
137 | "id": "1v9-yY4Mp-1e",
138 | "colab_type": "code",
139 | "colab": {}
140 | },
141 | "source": [
142 | "d = read_file('melville-moby_dick.txt')\n",
143 | "tokens = separate_punc(d)"
144 | ],
145 | "execution_count": 0,
146 | "outputs": []
147 | },
148 | {
149 | "cell_type": "code",
150 | "metadata": {
151 | "id": "K0o8nH3Pp-1f",
152 | "colab_type": "code",
153 | "colab": {}
154 | },
155 | "source": [
156 | "tokens"
157 | ],
158 | "execution_count": 0,
159 | "outputs": []
160 | },
161 | {
162 | "cell_type": "code",
163 | "metadata": {
164 | "id": "8KdjIXcep-1h",
165 | "colab_type": "code",
166 | "colab": {}
167 | },
168 | "source": [
169 | "len(tokens)"
170 | ],
171 | "execution_count": 0,
172 | "outputs": []
173 | },
174 | {
175 | "cell_type": "code",
176 | "metadata": {
177 | "id": "d6ccQdY4p-1j",
178 | "colab_type": "code",
179 | "colab": {}
180 | },
181 | "source": [
182 | "4431/25"
183 | ],
184 | "execution_count": 0,
185 | "outputs": []
186 | },
187 | {
188 | "cell_type": "markdown",
189 | "metadata": {
190 | "id": "IYXc57_Tp-1m",
191 | "colab_type": "text"
192 | },
193 | "source": [
194 | "## Create Sequences of Tokens"
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "metadata": {
200 | "id": "YkK8dLRUp-1n",
201 | "colab_type": "code",
202 | "colab": {}
203 | },
204 | "source": [
205 | "# organize into sequences of tokens\n",
206 | "train_len = 25+1 # 50 training words , then one target word\n",
207 | "\n",
208 | "# Empty list of sequences\n",
209 | "text_sequences = []\n",
210 | "\n",
211 | "for i in range(train_len, len(tokens)):\n",
212 | " \n",
213 | " # Grab train_len# amount of characters\n",
214 | " seq = tokens[i-train_len:i]\n",
215 | " \n",
216 | " # Add to list of sequences\n",
217 | " text_sequences.append(seq)"
218 | ],
219 | "execution_count": 0,
220 | "outputs": []
221 | },
222 | {
223 | "cell_type": "code",
224 | "metadata": {
225 | "id": "4bpALGXVp-1o",
226 | "colab_type": "code",
227 | "colab": {}
228 | },
229 | "source": [
230 | "' '.join(text_sequences[0])"
231 | ],
232 | "execution_count": 0,
233 | "outputs": []
234 | },
235 | {
236 | "cell_type": "code",
237 | "metadata": {
238 | "id": "HrYABSvdp-1q",
239 | "colab_type": "code",
240 | "colab": {}
241 | },
242 | "source": [
243 | "' '.join(text_sequences[1])"
244 | ],
245 | "execution_count": 0,
246 | "outputs": []
247 | },
248 | {
249 | "cell_type": "code",
250 | "metadata": {
251 | "id": "72xaV4-Vp-1r",
252 | "colab_type": "code",
253 | "colab": {}
254 | },
255 | "source": [
256 | "' '.join(text_sequences[2])"
257 | ],
258 | "execution_count": 0,
259 | "outputs": []
260 | },
261 | {
262 | "cell_type": "code",
263 | "metadata": {
264 | "id": "sAy5wlwvp-1s",
265 | "colab_type": "code",
266 | "colab": {}
267 | },
268 | "source": [
269 | "len(text_sequences)"
270 | ],
271 | "execution_count": 0,
272 | "outputs": []
273 | },
274 | {
275 | "cell_type": "markdown",
276 | "metadata": {
277 | "id": "fJ08ocJ0p-1u",
278 | "colab_type": "text"
279 | },
280 | "source": [
281 | "# Keras"
282 | ]
283 | },
284 | {
285 | "cell_type": "markdown",
286 | "metadata": {
287 | "id": "4Rs_KXMBp-1u",
288 | "colab_type": "text"
289 | },
290 | "source": [
291 | "### Keras Tokenization"
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "metadata": {
297 | "id": "gVJ2jV4Sp-1v",
298 | "colab_type": "code",
299 | "colab": {}
300 | },
301 | "source": [
302 | "from keras.preprocessing.text import Tokenizer"
303 | ],
304 | "execution_count": 0,
305 | "outputs": []
306 | },
307 | {
308 | "cell_type": "code",
309 | "metadata": {
310 | "id": "3DHc_Wz3p-1w",
311 | "colab_type": "code",
312 | "colab": {}
313 | },
314 | "source": [
315 | "# integer encode sequences of words\n",
316 | "tokenizer = Tokenizer()\n",
317 | "tokenizer.fit_on_texts(text_sequences)\n",
318 | "sequences = tokenizer.texts_to_sequences(text_sequences)"
319 | ],
320 | "execution_count": 0,
321 | "outputs": []
322 | },
323 | {
324 | "cell_type": "code",
325 | "metadata": {
326 | "id": "Owy258o0p-1y",
327 | "colab_type": "code",
328 | "colab": {}
329 | },
330 | "source": [
331 | "sequences[0]"
332 | ],
333 | "execution_count": 0,
334 | "outputs": []
335 | },
336 | {
337 | "cell_type": "code",
338 | "metadata": {
339 | "id": "ksEUcX27p-1z",
340 | "colab_type": "code",
341 | "colab": {}
342 | },
343 | "source": [
344 | "tokenizer.index_word"
345 | ],
346 | "execution_count": 0,
347 | "outputs": []
348 | },
349 | {
350 | "cell_type": "code",
351 | "metadata": {
352 | "id": "qCL4K6ETp-11",
353 | "colab_type": "code",
354 | "colab": {}
355 | },
356 | "source": [
357 | "for i in sequences[0]:\n",
358 | " print(f'{i} : {tokenizer.index_word[i]}')"
359 | ],
360 | "execution_count": 0,
361 | "outputs": []
362 | },
363 | {
364 | "cell_type": "code",
365 | "metadata": {
366 | "id": "ocX9iXRYp-13",
367 | "colab_type": "code",
368 | "colab": {}
369 | },
370 | "source": [
371 | "tokenizer.word_counts"
372 | ],
373 | "execution_count": 0,
374 | "outputs": []
375 | },
376 | {
377 | "cell_type": "code",
378 | "metadata": {
379 | "id": "ytfWmmd5p-16",
380 | "colab_type": "code",
381 | "colab": {}
382 | },
383 | "source": [
384 | "vocabulary_size = len(tokenizer.word_counts)"
385 | ],
386 | "execution_count": 0,
387 | "outputs": []
388 | },
389 | {
390 | "cell_type": "markdown",
391 | "metadata": {
392 | "id": "5Q8Ls1BIp-17",
393 | "colab_type": "text"
394 | },
395 | "source": [
396 | "### Convert to Numpy Matrix"
397 | ]
398 | },
399 | {
400 | "cell_type": "code",
401 | "metadata": {
402 | "id": "Bf0Bv5eTp-18",
403 | "colab_type": "code",
404 | "colab": {}
405 | },
406 | "source": [
407 | "import numpy as np"
408 | ],
409 | "execution_count": 0,
410 | "outputs": []
411 | },
412 | {
413 | "cell_type": "code",
414 | "metadata": {
415 | "id": "Gx_fCFjtp-19",
416 | "colab_type": "code",
417 | "colab": {}
418 | },
419 | "source": [
420 | "sequences = np.array(sequences)"
421 | ],
422 | "execution_count": 0,
423 | "outputs": []
424 | },
425 | {
426 | "cell_type": "code",
427 | "metadata": {
428 | "id": "bcQEf7pDp-2A",
429 | "colab_type": "code",
430 | "colab": {}
431 | },
432 | "source": [
433 | "sequences"
434 | ],
435 | "execution_count": 0,
436 | "outputs": []
437 | },
438 | {
439 | "cell_type": "markdown",
440 | "metadata": {
441 | "id": "RFAVEVHbp-2B",
442 | "colab_type": "text"
443 | },
444 | "source": [
445 | "# Creating an LSTM based model"
446 | ]
447 | },
448 | {
449 | "cell_type": "code",
450 | "metadata": {
451 | "id": "HiekDgfOp-2C",
452 | "colab_type": "code",
453 | "colab": {}
454 | },
455 | "source": [
456 | "import keras\n",
457 | "from keras.models import Sequential\n",
458 | "from keras.layers import Dense,LSTM,Embedding"
459 | ],
460 | "execution_count": 0,
461 | "outputs": []
462 | },
463 | {
464 | "cell_type": "code",
465 | "metadata": {
466 | "id": "7OPZUN9dp-2D",
467 | "colab_type": "code",
468 | "colab": {}
469 | },
470 | "source": [
471 | "def create_model(vocabulary_size, seq_len):\n",
472 | " model = Sequential()\n",
473 | " model.add(Embedding(vocabulary_size, 25, input_length=seq_len))\n",
474 | " model.add(LSTM(150, return_sequences=True))\n",
475 | " model.add(LSTM(150))\n",
476 | " model.add(Dense(150, activation='relu'))\n",
477 | "\n",
478 | " model.add(Dense(vocabulary_size, activation='softmax'))\n",
479 | " \n",
480 | " model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
481 | " \n",
482 | " model.summary()\n",
483 | " \n",
484 | " return model"
485 | ],
486 | "execution_count": 0,
487 | "outputs": []
488 | },
489 | {
490 | "cell_type": "markdown",
491 | "metadata": {
492 | "id": "61AVruktp-2E",
493 | "colab_type": "text"
494 | },
495 | "source": [
496 | "### Train / Test Split"
497 | ]
498 | },
499 | {
500 | "cell_type": "code",
501 | "metadata": {
502 | "collapsed": true,
503 | "id": "gluECLSvp-2F",
504 | "colab_type": "code",
505 | "colab": {}
506 | },
507 | "source": [
508 | "from keras.utils import to_categorical"
509 | ],
510 | "execution_count": 0,
511 | "outputs": []
512 | },
513 | {
514 | "cell_type": "code",
515 | "metadata": {
516 | "collapsed": true,
517 | "id": "_LZriQsrp-2G",
518 | "colab_type": "code",
519 | "colab": {}
520 | },
521 | "source": [
522 | "sequences"
523 | ],
524 | "execution_count": 0,
525 | "outputs": []
526 | },
527 | {
528 | "cell_type": "code",
529 | "metadata": {
530 | "collapsed": true,
531 | "id": "ob1dzz95p-2I",
532 | "colab_type": "code",
533 | "colab": {}
534 | },
535 | "source": [
536 | "# First 49 words\n",
537 | "sequences[:,:-1]"
538 | ],
539 | "execution_count": 0,
540 | "outputs": []
541 | },
542 | {
543 | "cell_type": "code",
544 | "metadata": {
545 | "collapsed": true,
546 | "id": "lHXs1BkRp-2K",
547 | "colab_type": "code",
548 | "colab": {}
549 | },
550 | "source": [
551 | "# last Word\n",
552 | "sequences[:,-1]"
553 | ],
554 | "execution_count": 0,
555 | "outputs": []
556 | },
557 | {
558 | "cell_type": "code",
559 | "metadata": {
560 | "collapsed": true,
561 | "id": "cYhnq8Csp-2L",
562 | "colab_type": "code",
563 | "colab": {}
564 | },
565 | "source": [
566 | "X = sequences[:,:-1]"
567 | ],
568 | "execution_count": 0,
569 | "outputs": []
570 | },
571 | {
572 | "cell_type": "code",
573 | "metadata": {
574 | "collapsed": true,
575 | "id": "kHhzNOkbp-2N",
576 | "colab_type": "code",
577 | "colab": {}
578 | },
579 | "source": [
580 | "y = sequences[:,-1]"
581 | ],
582 | "execution_count": 0,
583 | "outputs": []
584 | },
585 | {
586 | "cell_type": "code",
587 | "metadata": {
588 | "collapsed": true,
589 | "id": "5m8Uw5P9p-2P",
590 | "colab_type": "code",
591 | "colab": {}
592 | },
593 | "source": [
594 | "y = to_categorical(y, num_classes=vocabulary_size+1)"
595 | ],
596 | "execution_count": 0,
597 | "outputs": []
598 | },
599 | {
600 | "cell_type": "code",
601 | "metadata": {
602 | "collapsed": true,
603 | "id": "S08ERCEGp-2R",
604 | "colab_type": "code",
605 | "colab": {}
606 | },
607 | "source": [
608 | "seq_len = X.shape[1]"
609 | ],
610 | "execution_count": 0,
611 | "outputs": []
612 | },
613 | {
614 | "cell_type": "code",
615 | "metadata": {
616 | "collapsed": true,
617 | "id": "w3KXjGlip-2S",
618 | "colab_type": "code",
619 | "colab": {}
620 | },
621 | "source": [
622 | "seq_len"
623 | ],
624 | "execution_count": 0,
625 | "outputs": []
626 | },
627 | {
628 | "cell_type": "markdown",
629 | "metadata": {
630 | "id": "9J75-5btp-2T",
631 | "colab_type": "text"
632 | },
633 | "source": [
634 | "### Training the Model"
635 | ]
636 | },
637 | {
638 | "cell_type": "code",
639 | "metadata": {
640 | "collapsed": true,
641 | "id": "6GDSus9Fp-2U",
642 | "colab_type": "code",
643 | "colab": {}
644 | },
645 | "source": [
646 | "# define model\n",
647 | "model = create_model(vocabulary_size+1, seq_len)"
648 | ],
649 | "execution_count": 0,
650 | "outputs": []
651 | },
652 | {
653 | "cell_type": "markdown",
654 | "metadata": {
655 | "id": "MMqsmjWhp-2V",
656 | "colab_type": "text"
657 | },
658 | "source": [
659 | "---\n",
660 | "\n",
661 | "----"
662 | ]
663 | },
664 | {
665 | "cell_type": "code",
666 | "metadata": {
667 | "collapsed": true,
668 | "id": "zazcuoXkp-2V",
669 | "colab_type": "code",
670 | "colab": {}
671 | },
672 | "source": [
673 | "from pickle import dump,load"
674 | ],
675 | "execution_count": 0,
676 | "outputs": []
677 | },
678 | {
679 | "cell_type": "code",
680 | "metadata": {
681 | "collapsed": true,
682 | "id": "pP_Fd2y0p-2W",
683 | "colab_type": "code",
684 | "colab": {}
685 | },
686 | "source": [
687 | "# fit model\n",
688 | "model.fit(X, y, batch_size=128, epochs=300,verbose=1)"
689 | ],
690 | "execution_count": 0,
691 | "outputs": []
692 | },
693 | {
694 | "cell_type": "code",
695 | "metadata": {
696 | "collapsed": true,
697 | "scrolled": true,
698 | "id": "ZRA4RvWhp-2Y",
699 | "colab_type": "code",
700 | "colab": {}
701 | },
702 | "source": [
703 | "# save the model to file\n",
704 | "model.save('epochBIG.h5')\n",
705 | "# save the tokenizer\n",
706 | "dump(tokenizer, open('epochBIG', 'wb'))"
707 | ],
708 | "execution_count": 0,
709 | "outputs": []
710 | },
711 | {
712 | "cell_type": "markdown",
713 | "metadata": {
714 | "id": "X_djmagrp-2a",
715 | "colab_type": "text"
716 | },
717 | "source": [
718 | "# Generating New Text"
719 | ]
720 | },
721 | {
722 | "cell_type": "code",
723 | "metadata": {
724 | "id": "Of8VWIzyp-2a",
725 | "colab_type": "code",
726 | "colab": {}
727 | },
728 | "source": [
729 | "from random import randint\n",
730 | "from pickle import load\n",
731 | "from keras.models import load_model\n",
732 | "from keras.preprocessing.sequence import pad_sequences"
733 | ],
734 | "execution_count": 0,
735 | "outputs": []
736 | },
737 | {
738 | "cell_type": "code",
739 | "metadata": {
740 | "id": "mf_r7MY0p-2b",
741 | "colab_type": "code",
742 | "colab": {}
743 | },
744 | "source": [
745 | "def generate_text(model, tokenizer, seq_len, seed_text, num_gen_words):\n",
746 | " '''\n",
747 | " INPUTS:\n",
748 | " model : model that was trained on text data\n",
749 | " tokenizer : tokenizer that was fit on text data\n",
750 | " seq_len : length of training sequence\n",
751 | " seed_text : raw string text to serve as the seed\n",
752 | " num_gen_words : number of words to be generated by model\n",
753 | " '''\n",
754 | " \n",
755 | " # Final Output\n",
756 | " output_text = []\n",
757 | " \n",
758 | " # Intial Seed Sequence\n",
759 | " input_text = seed_text\n",
760 | " \n",
761 | " # Create num_gen_words\n",
762 | " for i in range(num_gen_words):\n",
763 | " \n",
764 | " # Take the input text string and encode it to a sequence\n",
765 | " encoded_text = tokenizer.texts_to_sequences([input_text])[0]\n",
766 | " \n",
767 | " # Pad sequences to our trained rate (50 words in the video)\n",
768 | " pad_encoded = pad_sequences([encoded_text], maxlen=seq_len, truncating='pre')\n",
769 | " \n",
770 | " # Predict Class Probabilities for each word\n",
771 | " pred_word_ind = model.predict_classes(pad_encoded, verbose=0)[0]\n",
772 | " \n",
773 | " # Grab word\n",
774 | " pred_word = tokenizer.index_word[pred_word_ind] \n",
775 | " \n",
776 | " # Update the sequence of input text (shifting one over with the new word)\n",
777 | " input_text += ' ' + pred_word\n",
778 | " \n",
779 | " output_text.append(pred_word)\n",
780 | " \n",
781 | " # Make it look like a sentence.\n",
782 | " return ' '.join(output_text)"
783 | ],
784 | "execution_count": 0,
785 | "outputs": []
786 | },
787 | {
788 | "cell_type": "markdown",
789 | "metadata": {
790 | "id": "SNJiCy9Hp-2d",
791 | "colab_type": "text"
792 | },
793 | "source": [
794 | "### Grab a random seed sequence"
795 | ]
796 | },
797 | {
798 | "cell_type": "code",
799 | "metadata": {
800 | "id": "RDfFUYRCp-2d",
801 | "colab_type": "code",
802 | "colab": {}
803 | },
804 | "source": [
805 | "text_sequences[0]"
806 | ],
807 | "execution_count": 0,
808 | "outputs": []
809 | },
810 | {
811 | "cell_type": "code",
812 | "metadata": {
813 | "id": "RWH6pPeEp-2e",
814 | "colab_type": "code",
815 | "colab": {}
816 | },
817 | "source": [
818 | "import random\n",
819 | "random.seed(101)\n",
820 | "random_pick = random.randint(0,len(text_sequences))"
821 | ],
822 | "execution_count": 0,
823 | "outputs": []
824 | },
825 | {
826 | "cell_type": "code",
827 | "metadata": {
828 | "id": "S_Y5s1Ytp-2h",
829 | "colab_type": "code",
830 | "colab": {}
831 | },
832 | "source": [
833 | "random_seed_text = text_sequences[random_pick]"
834 | ],
835 | "execution_count": 0,
836 | "outputs": []
837 | },
838 | {
839 | "cell_type": "code",
840 | "metadata": {
841 | "id": "Kjxyuhi1p-2j",
842 | "colab_type": "code",
843 | "colab": {}
844 | },
845 | "source": [
846 | "random_seed_text"
847 | ],
848 | "execution_count": 0,
849 | "outputs": []
850 | },
851 | {
852 | "cell_type": "code",
853 | "metadata": {
854 | "id": "qZl3iyYyp-2k",
855 | "colab_type": "code",
856 | "colab": {}
857 | },
858 | "source": [
859 | "seed_text = ' '.join(random_seed_text)"
860 | ],
861 | "execution_count": 0,
862 | "outputs": []
863 | },
864 | {
865 | "cell_type": "code",
866 | "metadata": {
867 | "id": "smGsrMgap-2l",
868 | "colab_type": "code",
869 | "colab": {}
870 | },
871 | "source": [
872 | "seed_text"
873 | ],
874 | "execution_count": 0,
875 | "outputs": []
876 | },
877 | {
878 | "cell_type": "code",
879 | "metadata": {
880 | "id": "8P4RahB8p-2m",
881 | "colab_type": "code",
882 | "colab": {}
883 | },
884 | "source": [
885 | "generate_text(model,tokenizer,seq_len,seed_text=seed_text,num_gen_words=50)"
886 | ],
887 | "execution_count": 0,
888 | "outputs": []
889 | },
890 | {
891 | "cell_type": "markdown",
892 | "metadata": {
893 | "id": "28_uX2Lzp-2n",
894 | "colab_type": "text"
895 | },
896 | "source": [
897 | "### Exploring Generated Sequence"
898 | ]
899 | },
900 | {
901 | "cell_type": "code",
902 | "metadata": {
903 | "id": "KsJwQAUVp-2o",
904 | "colab_type": "code",
905 | "colab": {}
906 | },
907 | "source": [
908 | "full_text = read_file('moby_dick_four_chapters.txt')"
909 | ],
910 | "execution_count": 0,
911 | "outputs": []
912 | },
913 | {
914 | "cell_type": "code",
915 | "metadata": {
916 | "id": "IcsXT0nip-2p",
917 | "colab_type": "code",
918 | "colab": {}
919 | },
920 | "source": [
921 | "for i,word in enumerate(full_text.split()):\n",
922 | " if word == 'inkling':\n",
923 | " print(' '.join(full_text.split()[i-20:i+20]))\n",
924 | " print('\\n')"
925 | ],
926 | "execution_count": 0,
927 | "outputs": []
928 | },
929 | {
930 | "cell_type": "markdown",
931 | "metadata": {
932 | "id": "etIos-pHp-2q",
933 | "colab_type": "text"
934 | },
935 | "source": [
936 | "# Great Job!"
937 | ]
938 | }
939 | ]
940 | }
--------------------------------------------------------------------------------