├── LICENSE
├── README.md
├── Section 2
└── TF-IDF.py
├── Section 3
└── imdb.py
├── Section 4
├── NaiveBayes.py
└── SVM.py
└── Section 5
└── ChatBot.py
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 Packt
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | # Natural Language Processing in Practice [Video]
5 | This is the code repository for [Natural Language Processing in Practice [Video]](https://www.packtpub.com/big-data-and-business-intelligence/natural-language-processing-practice-video?utm_source=github&utm_medium=repository&utm_campaign=9781787280885), published by [Packt](https://www.packtpub.com/?utm_source=github). It contains all the supporting project files necessary to work through the video course from start to finish.
6 | ## About the Video Course
7 | Natural Language Processing (NLP) offers powerful ways to interpret and act on spoken and written language. It can help you with tasks such as customer support enquiries and customer feedback analysis. As the quantity of data continues to grow at an incomprehensible rate, being able to understand and process data is becoming a key differentiator for competitive organizations.
8 |
9 | This course will help you gain this skill by practical demonstrations, clear explanations, and interesting real-world examples. It will give you a versatile range of deep learning and NLP skills that you can put to work in your own applications.
10 |
11 | By the end of this tutorial, you’ll have a better understanding of NLP and will be able to transform data into actionable knowledge. You will also have worked on multiple examples that implement deep learning to solve real-world spoken-language problems.
12 |
13 |
14 |
What You Will Learn
15 |
16 |
17 | - Build applications with Python, using the Natural Language Toolkit via NLP
18 |
- Create your own Chatbot using NLP
19 |
- Perform several Natural Language Processing tasks
20 |
- Classify text and speech using the Naive Bayes Algorithm
21 |
- Use various tools and algorithms to build real-world applications
22 |
23 | ## Instructions and Navigation
24 | ### Assumed Knowledge
25 | To fully benefit from the coverage included in this course, you will need:
26 | To fully benefit from the coverage included in this course, you will need:
27 |
28 | ● An understanding of basic Python.
29 |
30 | ### Technical Requirements
31 | This course has the following software requirements:
32 | This course has the following software requirements:
33 |
34 | ● Pycharm Community
35 |
36 | ● Python 3.7
37 |
38 | ● Python 3.6
39 |
40 | ● NLTK
41 |
42 | ● KERAS
43 |
44 | ● SCIKIT LEARN
45 |
46 | ● NUMPY
47 |
48 | ● PANDAS
49 |
50 | This course has been tested on the following system configuration:
51 |
52 | ● OS: Windows 10
53 |
54 | ● Processor: Dual Core 3.0 Ghz
55 |
56 | ● Memory: 16GB
57 |
58 |
59 | ## Related Products
60 | * [Hands-On Natural Language Processing with Pytorch [Video]](https://www.packtpub.com/application-development/hands-natural-language-processing-pytorch-video?utm_source=github&utm_medium=repository&utm_campaign=9781789133974)
61 |
62 | * [Mastering Natural Language Processing with Python [Video]](https://www.packtpub.com/big-data-and-business-intelligence/mastering-natural-language-processing-python-video?utm_source=github&utm_medium=repository&utm_campaign=9781789618358)
63 |
64 | * [Natural Language Processing with Python [Video]](https://www.packtpub.com/big-data-and-business-intelligence/natural-language-processing-python-video?utm_source=github&utm_medium=repository&utm_campaign=9781787286085)
65 |
66 |
--------------------------------------------------------------------------------
/Section 2/TF-IDF.py:
--------------------------------------------------------------------------------
1 | from math import log
2 | from nltk.stem import PorterStemmer
3 | from nltk.corpus import stopwords
4 | import string
5 | from nltk import corpus
6 | import numpy
7 |
8 | # importing the corpus into data variable
9 | data = corpus.brown
10 |
11 | # Using Porter Stemmer
12 | stemmer = PorterStemmer()
13 |
14 | # Building the list of stop words
15 | # We will filter the tokens against it
16 | stopwords = set(stopwords.words('english'))
17 | stopwords = stopwords.union(string.punctuation)
18 |
19 | # Limiting the number of files we will use
20 | # uncomment the following line to use all of the
21 | # corpus files
22 | # fileIds = data.fileids()
23 | fileids = data.fileids()[:30]
24 |
25 | idf_matrix = []
26 | dictionary = dict()
27 |
28 | # total count f words in the corpus
29 | words_count = 0
30 |
31 | # total count of document in the corpus
32 | documents_count = len(fileids)
33 |
34 | # holds the words before and after filtering : Stemming
35 | filtered = dict()
36 |
37 | # to save the total counts of every word per file
38 | frequencies = dict()
39 | for fileid in fileids:
40 | frequencies[fileid] = dict()
41 |
42 | # filtering corpus
43 | for fileid in fileids:
44 | for word in data.words(fileid):
45 | # Skipping if it is a stop word
46 | if word in stopwords:
47 | continue
48 | # Before and after filtering
49 | before_word = word
50 | if before_word in filtered:
51 | word = filtered[before_word]
52 | else:
53 | # stemming the word
54 | word = stemmer.stem(word)
55 | filtered[before_word] = word
56 |
57 | if word in frequencies[fileid]:
58 | frequencies[fileid][word] += 1
59 | else:
60 | frequencies[fileid][word] = 1
61 |
62 | # saving all the words in a dictionary
63 | if word not in dictionary:
64 | dictionary[word] = words_count
65 | words_count += 1
66 |
67 | # Calculating TF #
68 | # indexes of non zero values
69 | tf_matrix = []
70 | nonzeros = []
71 | for fileid in fileids:
72 | tf_vector = [0] * words_count
73 | nonzeros_vec = []
74 | for word in frequencies[fileid].keys():
75 | f = frequencies[fileid][word]
76 | tf_vector[dictionary[word]] = f
77 | if f > 0:
78 | nonzeros_vec.append(dictionary[word])
79 | nonzeros.append(nonzeros_vec)
80 | tf_matrix.append(tf_vector)
81 |
82 | # Calculating IDF
83 | idf_matrix = [0] * words_count
84 | for fileid in fileids:
85 | for word in frequencies[fileid].keys():
86 | idf_matrix[dictionary[word]] += 1
87 |
88 | # Calculating TF-IDF matrix#
89 | tfidf = []
90 |
91 | for i in range(documents_count):
92 | vector = [0] * words_count
93 | for j in nonzeros[i]:
94 | tf_value = tf_matrix[i][j]
95 | idf_value = idf_matrix[j]
96 | tf_value = 1 + log(tf_value, 2)
97 | idf_value = log(1 + documents_count/ float(idf_value), 2)
98 | vector[j] = tf_value * idf_value
99 | tfidf.append(vector)
100 |
101 | print("------ Top 10 : Keywords per document -------")
102 | for i in range(len(tfidf)):
103 | print("--- Document : " + str(fileids[i]))
104 | vector = tfidf[i]
105 | sorted = numpy.argsort(vector)[::-1]
106 | for ind in sorted[:15]:
107 | stem = list(dictionary.keys())[list(dictionary.values()).index(ind)]
108 | beforeStemming = list(filtered.keys())[list(filtered.values()).index(stem)]
109 | print(beforeStemming + " -- " + str(vector[ind]))
110 |
--------------------------------------------------------------------------------
/Section 3/imdb.py:
--------------------------------------------------------------------------------
1 | from keras.datasets import imdb
2 | import keras
3 | from keras.models import Sequential
4 | from keras.layers.embeddings import Embedding
5 | from keras.layers import Flatten, Dense
6 | from keras.preprocessing import sequence
7 | from numpy import array
8 |
9 | (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=5000)
10 |
11 | word_to_id = keras.datasets.imdb.get_word_index()
12 | word_to_id = {k:(v+3) for k,v in word_to_id.items()}
13 | word_to_id[""] = 0
14 | word_to_id[""] = 1
15 | word_to_id[""] = 2
16 |
17 | x_train = sequence.pad_sequences(x_train, maxlen=300)
18 | x_test = sequence.pad_sequences(x_test, maxlen=300)
19 |
20 | network = Sequential()
21 | network.add(Embedding(5000, 32, input_length=300))
22 | network.add(Flatten())
23 | network.add(Dense(1, activation='sigmoid'))
24 | network.compile(loss="binary_crossentropy", optimizer='Adam', metrics=['accuracy'])
25 |
26 | network.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=3, batch_size=64)
27 |
28 | result = network.evaluate(x_test, y_test, verbose=0)
29 |
30 | negative = "this movie was bad"
31 | positive = "i had fun"
32 | negative2 = "this movie was terrible"
33 | positive2 = "i really liked the movie"
34 |
35 | for review in [positive, positive2, negative, negative2]:
36 | temp = []
37 | for word in review.split(" "):
38 | temp.append(word_to_id[word])
39 | temp_padded = sequence.pad_sequences([temp], maxlen=300)
40 | print(review + " -- Sent -- " + str(network.predict(array([temp_padded][0]))[0][0]))
41 |
42 |
--------------------------------------------------------------------------------
/Section 4/NaiveBayes.py:
--------------------------------------------------------------------------------
1 |
2 | # Getting the dataset
3 | from sklearn.datasets import fetch_20newsgroups
4 |
5 | # Getting the training and test data subsets
6 | newsgroup_train = fetch_20newsgroups(subset='train', shuffle=True)
7 | newsgroup_test = fetch_20newsgroups(subset='test', shuffle=True)
8 |
9 | # Checking out the Categories Names
10 | i = 0
11 | for cat in newsgroup_train.target_names:
12 | i = i + 1
13 | print(str(i) + " - " + str(cat))
14 |
15 |
16 | # Printing a single ost
17 | print("\n".join(newsgroup_train.data[5].split("\n")[:10]))
18 |
19 | # Extracting features
20 | from sklearn.feature_extraction.text import CountVectorizer
21 |
22 | count_vector = CountVectorizer()
23 | newsgroup_train_counts = count_vector.fit_transform(newsgroup_train.data)
24 |
25 | # Calculating TF-IDF
26 | from sklearn.feature_extraction.text import TfidfTransformer
27 |
28 | tfidf_transformer = TfidfTransformer()
29 | newsgroup_train_tfidf = tfidf_transformer.fit_transform(newsgroup_train_counts)
30 |
31 |
32 | # Training Naive Bayes
33 | from sklearn.naive_bayes import MultinomialNB
34 |
35 | nb_cla = MultinomialNB().fit(newsgroup_train_tfidf, newsgroup_train.target)
36 |
37 |
38 | # Simplifying the process with a Pipeline
39 | from sklearn.pipeline import Pipeline
40 |
41 | NB_Classifier = Pipeline([('vectorizer', CountVectorizer()), ('tfidf_matrix', TfidfTransformer()), ('nb_classifier', MultinomialNB())])
42 | NB_Classifier = NB_Classifier.fit(newsgroup_train.data, newsgroup_train.target)
43 |
44 | # Testing the Classifier
45 | import numpy as np
46 |
47 | predicted = NB_Classifier.predict(newsgroup_test.data)
48 | print(np.mean(predicted == newsgroup_test.target))
49 |
50 |
51 |
--------------------------------------------------------------------------------
/Section 4/SVM.py:
--------------------------------------------------------------------------------
1 | # Getting the dataset
2 | from sklearn.datasets import fetch_20newsgroups
3 |
4 | # Getting the training and test data subsets
5 | newsgroup_train = fetch_20newsgroups(subset='train', shuffle=True)
6 | newsgroup_test = fetch_20newsgroups(subset='test', shuffle=True)
7 |
8 | # Checking out the Categories Names
9 | i = 0
10 | for cat in newsgroup_train.target_names:
11 | i = i + 1
12 | print(str(i) + " - " + str(cat))
13 |
14 |
15 | # Printing a single ost
16 | print("\n".join(newsgroup_train.data[5].split("\n")[:10]))
17 |
18 | # Extracting features
19 | from sklearn.feature_extraction.text import CountVectorizer
20 |
21 | count_vector = CountVectorizer()
22 | newsgroup_train_counts = count_vector.fit_transform(newsgroup_train.data)
23 |
24 | # Calculating TF-IDF
25 | from sklearn.feature_extraction.text import TfidfTransformer
26 |
27 | tfidf_transformer = TfidfTransformer()
28 | newsgroup_train_tfidf = tfidf_transformer.fit_transform(newsgroup_train_counts)
29 |
30 |
31 | # Training Support Vector Machines
32 | from sklearn.linear_model import SGDClassifier
33 |
34 | from sklearn.pipeline import Pipeline
35 | import numpy as np
36 |
37 | SVM_Classifier = Pipeline([('vectorizer', CountVectorizer()), ('tfidf_matrix', TfidfTransformer()),
38 | ('svm_classifier', SGDClassifier(loss='hinge', penalty='l2', max_iter=100, alpha=1e-3, random_state=42))])
39 |
40 | SVM_Classifier = SVM_Classifier.fit(newsgroup_train.data, newsgroup_train.target)
41 |
42 |
43 |
44 |
45 | Predicted_SVM = SVM_Classifier.predict(newsgroup_test.data)
46 | print(np.mean(Predicted_SVM == newsgroup_test.target))
47 |
48 |
--------------------------------------------------------------------------------
/Section 5/ChatBot.py:
--------------------------------------------------------------------------------
1 | from chatterbot import ChatBot
2 | import logging
3 |
4 | from chatterbot.trainers import ChatterBotCorpusTrainer
5 |
6 | logging.basicConfig(level=logging.CRITICAL)
7 |
8 | chatB = ChatBot("Mike",
9 | preprocessors=['chatterbot.preprocessors.clean_whitespace'],
10 | logic_adapters=['chatterbot.logic.BestMatch',
11 | 'chatterbot.logic.MathematicalEvaluation',
12 | 'chatterbot.logic.TimeLogicAdapter'])
13 |
14 | trainer = ChatterBotCorpusTrainer(chatB)
15 |
16 | trainer.train(
17 | "chatterbot.corpus.french"
18 | )
19 |
20 | conversation = []
21 |
22 |
23 | def converse(quit="quit"):
24 | user_input = ""
25 | while user_input != quit:
26 | user_input = quit
27 | try:
28 | user_input = input(">")
29 | except EOFError:
30 | print(user_input)
31 | if user_input:
32 | while user_input[-1] in "!.":
33 | user_input = user_input[:-1]
34 | print(chatB.get_response(user_input))
35 |
36 | converse()
--------------------------------------------------------------------------------