├── LICENSE
├── README.md
├── Section 2
    └── TF-IDF.py
├── Section 3
    └── imdb.py
├── Section 4
    ├── NaiveBayes.py
    └── SVM.py
└── Section 5
    └── ChatBot.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Packt
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | 
 4 | # Natural Language Processing in Practice [Video]
 5 | This is the code repository for [Natural Language Processing in Practice [Video]](https://www.packtpub.com/big-data-and-business-intelligence/natural-language-processing-practice-video?utm_source=github&utm_medium=repository&utm_campaign=9781787280885), published by [Packt](https://www.packtpub.com/?utm_source=github). It contains all the supporting project files necessary to work through the video course from start to finish.
 6 | ## About the Video Course
 7 | Natural Language Processing (NLP) offers powerful ways to interpret and act on spoken and written language. It can help you with tasks such as customer support enquiries and customer feedback analysis. As the quantity of data continues to grow at an incomprehensible rate, being able to understand and process data is becoming a key differentiator for competitive organizations.
 8 | 
 9 | This course will help you gain this skill by practical demonstrations, clear explanations, and interesting real-world examples. It will give you a versatile range of deep learning and NLP skills that you can put to work in your own applications. 
10 | 
11 | By the end of this tutorial, you’ll have a better understanding of NLP and will be able to transform data into actionable knowledge. You will also have worked on multiple examples that implement deep learning to solve real-world spoken-language problems.
12 | 
13 | 
14 | <H2>What You Will Learn</H2>
15 | <DIV class=book-info-will-learn-text>
16 | <UL>
17 | <LI>Build applications with Python, using the Natural Language Toolkit via NLP&nbsp; 
18 | <LI>Create your own Chatbot using NLP&nbsp; 
19 | <LI>Perform several Natural Language Processing tasks&nbsp; 
20 | <LI>Classify text and speech using the Naive Bayes Algorithm&nbsp; 
21 | <LI>Use various tools and algorithms to build real-world applications </LI></UL></DIV>
22 | 
23 | ## Instructions and Navigation
24 | ### Assumed Knowledge
25 | To fully benefit from the coverage included in this course, you will need:<br/>
26 | To fully benefit from the coverage included in this course, you will need:
27 | 
28 | ●	An understanding of basic Python.
29 | 
30 | ### Technical Requirements
31 | This course has the following software requirements:<br/>
32 | This course has the following software requirements:
33 | 
34 | ●	Pycharm Community
35 | 
36 | ●	Python 3.7
37 | 
38 | ●	Python 3.6
39 | 
40 | ●	NLTK
41 | 
42 | ●	KERAS
43 | 
44 | ●	SCIKIT LEARN
45 | 
46 | ●	NUMPY
47 | 
48 | ●	PANDAS
49 | 
50 | This course has been tested on the following system configuration:
51 | 
52 | ●	OS: Windows 10
53 | 
54 | ●	Processor: Dual Core 3.0 Ghz
55 | 
56 | ●	Memory: 16GB
57 | 
58 | 
59 | ## Related Products
60 | * [Hands-On Natural Language Processing with Pytorch [Video]](https://www.packtpub.com/application-development/hands-natural-language-processing-pytorch-video?utm_source=github&utm_medium=repository&utm_campaign=9781789133974)
61 | 
62 | * [Mastering Natural Language Processing with Python [Video]](https://www.packtpub.com/big-data-and-business-intelligence/mastering-natural-language-processing-python-video?utm_source=github&utm_medium=repository&utm_campaign=9781789618358)
63 | 
64 | * [Natural Language Processing with Python [Video]](https://www.packtpub.com/big-data-and-business-intelligence/natural-language-processing-python-video?utm_source=github&utm_medium=repository&utm_campaign=9781787286085)
65 | 
66 | 


--------------------------------------------------------------------------------
/Section 2/TF-IDF.py:
--------------------------------------------------------------------------------
  1 | from math import log
  2 | from nltk.stem import PorterStemmer
  3 | from nltk.corpus import stopwords
  4 | import string
  5 | from nltk import corpus
  6 | import numpy
  7 | 
  8 | # importing the corpus into data variable
  9 | data = corpus.brown
 10 | 
 11 | # Using Porter Stemmer
 12 | stemmer = PorterStemmer()
 13 | 
 14 | # Building the list of stop words
 15 | # We will filter the tokens against it
 16 | stopwords = set(stopwords.words('english'))
 17 | stopwords = stopwords.union(string.punctuation)
 18 | 
 19 | # Limiting the number of files we will use
 20 | # uncomment the following line to use all of the
 21 | # corpus files
 22 | # fileIds = data.fileids()
 23 | fileids = data.fileids()[:30]
 24 | 
 25 | idf_matrix  = []
 26 | dictionary = dict()
 27 | 
 28 | # total count f words in the corpus
 29 | words_count = 0
 30 | 
 31 | # total count of document in the corpus
 32 | documents_count = len(fileids)
 33 | 
 34 | # holds the words before and after filtering : Stemming
 35 | filtered = dict()
 36 | 
 37 | # to save the total counts of every word per file
 38 | frequencies = dict()
 39 | for fileid in fileids:
 40 |     frequencies[fileid] = dict()
 41 | 
 42 | # filtering corpus
 43 | for fileid in fileids:
 44 |     for word in data.words(fileid):
 45 |         # Skipping if it is a stop word
 46 |         if word in stopwords:
 47 |             continue
 48 |         # Before and after filtering
 49 |         before_word = word
 50 |         if before_word in filtered:
 51 |             word = filtered[before_word]
 52 |         else:
 53 |             # stemming the word
 54 |             word = stemmer.stem(word)
 55 |             filtered[before_word] = word
 56 | 
 57 |         if word in frequencies[fileid]:
 58 |             frequencies[fileid][word] += 1
 59 |         else:
 60 |             frequencies[fileid][word] = 1
 61 | 
 62 |         # saving all the words in a dictionary
 63 |         if word not in dictionary:
 64 |             dictionary[word] = words_count
 65 |             words_count += 1
 66 | 
 67 | # Calculating TF #
 68 | # indexes of non zero values
 69 | tf_matrix = []
 70 | nonzeros = []
 71 | for fileid in fileids:
 72 |     tf_vector = [0] * words_count
 73 |     nonzeros_vec = []
 74 |     for word in frequencies[fileid].keys():
 75 |         f = frequencies[fileid][word]
 76 |         tf_vector[dictionary[word]] = f
 77 |         if f > 0:
 78 |             nonzeros_vec.append(dictionary[word])
 79 |     nonzeros.append(nonzeros_vec)
 80 |     tf_matrix.append(tf_vector)
 81 | 
 82 | # Calculating IDF
 83 | idf_matrix = [0] * words_count
 84 | for fileid in fileids:
 85 |     for word in frequencies[fileid].keys():
 86 |         idf_matrix[dictionary[word]] += 1
 87 | 
 88 | # Calculating TF-IDF matrix#
 89 | tfidf = []
 90 | 
 91 | for i in range(documents_count):
 92 |     vector = [0] * words_count
 93 |     for j in nonzeros[i]:
 94 |         tf_value = tf_matrix[i][j]
 95 |         idf_value = idf_matrix[j]
 96 |         tf_value = 1 + log(tf_value, 2)
 97 |         idf_value = log(1 + documents_count/ float(idf_value), 2)
 98 |         vector[j] = tf_value * idf_value
 99 |     tfidf.append(vector)
100 | 
101 | print("------ Top 10 : Keywords per document -------")
102 | for i in range(len(tfidf)):
103 |     print("--- Document  : " + str(fileids[i]))
104 |     vector = tfidf[i]
105 |     sorted = numpy.argsort(vector)[::-1]
106 |     for ind in sorted[:15]:
107 |         stem = list(dictionary.keys())[list(dictionary.values()).index(ind)]
108 |         beforeStemming = list(filtered.keys())[list(filtered.values()).index(stem)]
109 |         print(beforeStemming  + "  -- " + str(vector[ind]))
110 | 


--------------------------------------------------------------------------------
/Section 3/imdb.py:
--------------------------------------------------------------------------------
 1 | from keras.datasets import imdb
 2 | import keras
 3 | from keras.models import Sequential
 4 | from keras.layers.embeddings import Embedding
 5 | from keras.layers import Flatten, Dense
 6 | from keras.preprocessing import sequence
 7 | from numpy import array
 8 | 
 9 | (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=5000)
10 | 
11 | word_to_id = keras.datasets.imdb.get_word_index()
12 | word_to_id = {k:(v+3) for k,v in word_to_id.items()}
13 | word_to_id["<PAD>"] = 0
14 | word_to_id["<START>"] = 1
15 | word_to_id["<UNK>"] = 2
16 | 
17 | x_train = sequence.pad_sequences(x_train, maxlen=300)
18 | x_test = sequence.pad_sequences(x_test, maxlen=300)
19 | 
20 | network = Sequential()
21 | network.add(Embedding(5000, 32, input_length=300))
22 | network.add(Flatten())
23 | network.add(Dense(1, activation='sigmoid'))
24 | network.compile(loss="binary_crossentropy", optimizer='Adam', metrics=['accuracy'])
25 | 
26 | network.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=3, batch_size=64)
27 | 
28 | result = network.evaluate(x_test, y_test, verbose=0)
29 | 
30 | negative = "this movie was bad"
31 | positive = "i had fun"
32 | negative2 = "this movie was terrible"
33 | positive2 = "i really liked the movie"
34 | 
35 | for review in [positive, positive2, negative, negative2]:
36 |     temp = []
37 |     for word in review.split(" "):
38 |         temp.append(word_to_id[word])
39 |     temp_padded = sequence.pad_sequences([temp], maxlen=300)
40 |     print(review + " -- Sent -- " + str(network.predict(array([temp_padded][0]))[0][0]))
41 | 
42 | 


--------------------------------------------------------------------------------
/Section 4/NaiveBayes.py:
--------------------------------------------------------------------------------
 1 | 
 2 | # Getting the dataset
 3 | from sklearn.datasets import fetch_20newsgroups
 4 | 
 5 | # Getting the training and test data subsets
 6 | newsgroup_train = fetch_20newsgroups(subset='train', shuffle=True)
 7 | newsgroup_test = fetch_20newsgroups(subset='test', shuffle=True)
 8 | 
 9 | # Checking out the Categories Names
10 | i = 0
11 | for cat in newsgroup_train.target_names:
12 |     i = i + 1
13 |     print(str(i) + " - " + str(cat))
14 | 
15 | 
16 | # Printing a single ost
17 | print("\n".join(newsgroup_train.data[5].split("\n")[:10]))
18 | 
19 | # Extracting features
20 | from sklearn.feature_extraction.text import CountVectorizer
21 | 
22 | count_vector = CountVectorizer()
23 | newsgroup_train_counts = count_vector.fit_transform(newsgroup_train.data)
24 | 
25 | # Calculating TF-IDF
26 | from sklearn.feature_extraction.text import TfidfTransformer
27 | 
28 | tfidf_transformer = TfidfTransformer()
29 | newsgroup_train_tfidf = tfidf_transformer.fit_transform(newsgroup_train_counts)
30 | 
31 | 
32 | # Training Naive Bayes
33 | from sklearn.naive_bayes import MultinomialNB
34 | 
35 | nb_cla = MultinomialNB().fit(newsgroup_train_tfidf, newsgroup_train.target)
36 | 
37 | 
38 | # Simplifying the process with a Pipeline
39 | from sklearn.pipeline import Pipeline
40 | 
41 | NB_Classifier = Pipeline([('vectorizer', CountVectorizer()), ('tfidf_matrix', TfidfTransformer()), ('nb_classifier', MultinomialNB())])
42 | NB_Classifier = NB_Classifier.fit(newsgroup_train.data, newsgroup_train.target)
43 | 
44 | # Testing the Classifier
45 | import numpy as np
46 | 
47 | predicted = NB_Classifier.predict(newsgroup_test.data)
48 | print(np.mean(predicted == newsgroup_test.target))
49 | 
50 | 
51 | 


--------------------------------------------------------------------------------
/Section 4/SVM.py:
--------------------------------------------------------------------------------
 1 | # Getting the dataset
 2 | from sklearn.datasets import fetch_20newsgroups
 3 | 
 4 | # Getting the training and test data subsets
 5 | newsgroup_train = fetch_20newsgroups(subset='train', shuffle=True)
 6 | newsgroup_test = fetch_20newsgroups(subset='test', shuffle=True)
 7 | 
 8 | # Checking out the Categories Names
 9 | i = 0
10 | for cat in newsgroup_train.target_names:
11 |     i = i + 1
12 |     print(str(i) + " - " + str(cat))
13 | 
14 | 
15 | # Printing a single ost
16 | print("\n".join(newsgroup_train.data[5].split("\n")[:10]))
17 | 
18 | # Extracting features
19 | from sklearn.feature_extraction.text import CountVectorizer
20 | 
21 | count_vector = CountVectorizer()
22 | newsgroup_train_counts = count_vector.fit_transform(newsgroup_train.data)
23 | 
24 | # Calculating TF-IDF
25 | from sklearn.feature_extraction.text import TfidfTransformer
26 | 
27 | tfidf_transformer = TfidfTransformer()
28 | newsgroup_train_tfidf = tfidf_transformer.fit_transform(newsgroup_train_counts)
29 | 
30 | 
31 | # Training Support Vector Machines
32 | from sklearn.linear_model import SGDClassifier
33 | 
34 | from sklearn.pipeline import Pipeline
35 | import numpy as np
36 | 
37 | SVM_Classifier = Pipeline([('vectorizer', CountVectorizer()), ('tfidf_matrix', TfidfTransformer()),
38 |                          ('svm_classifier', SGDClassifier(loss='hinge', penalty='l2', max_iter=100, alpha=1e-3, random_state=42))])
39 | 
40 | SVM_Classifier = SVM_Classifier.fit(newsgroup_train.data, newsgroup_train.target)
41 | 
42 | 
43 | 
44 | 
45 | Predicted_SVM = SVM_Classifier.predict(newsgroup_test.data)
46 | print(np.mean(Predicted_SVM == newsgroup_test.target))
47 | 
48 | 


--------------------------------------------------------------------------------
/Section 5/ChatBot.py:
--------------------------------------------------------------------------------
 1 | from chatterbot import ChatBot
 2 | import logging
 3 | 
 4 | from chatterbot.trainers import ChatterBotCorpusTrainer
 5 | 
 6 | logging.basicConfig(level=logging.CRITICAL)
 7 | 
 8 | chatB = ChatBot("Mike",
 9 |                 preprocessors=['chatterbot.preprocessors.clean_whitespace'],
10 |                 logic_adapters=['chatterbot.logic.BestMatch',
11 |                                 'chatterbot.logic.MathematicalEvaluation',
12 |                                 'chatterbot.logic.TimeLogicAdapter'])
13 | 
14 | trainer = ChatterBotCorpusTrainer(chatB)
15 | 
16 | trainer.train(
17 |     "chatterbot.corpus.french"
18 | )
19 | 
20 | conversation = []
21 | 
22 | 
23 | def converse(quit="quit"):
24 |     user_input = ""
25 |     while user_input != quit:
26 |         user_input = quit
27 |         try:
28 |             user_input = input(">")
29 |         except EOFError:
30 |             print(user_input)
31 |         if user_input:
32 |             while user_input[-1] in "!.":
33 |                 user_input = user_input[:-1]
34 |             print(chatB.get_response(user_input))
35 | 
36 | converse()


--------------------------------------------------------------------------------