├── LICENSE ├── README.md ├── Section 2 └── TF-IDF.py ├── Section 3 └── imdb.py ├── Section 4 ├── NaiveBayes.py └── SVM.py └── Section 5 └── ChatBot.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Packt 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | # Natural Language Processing in Practice [Video] 5 | This is the code repository for [Natural Language Processing in Practice [Video]](https://www.packtpub.com/big-data-and-business-intelligence/natural-language-processing-practice-video?utm_source=github&utm_medium=repository&utm_campaign=9781787280885), published by [Packt](https://www.packtpub.com/?utm_source=github). It contains all the supporting project files necessary to work through the video course from start to finish. 6 | ## About the Video Course 7 | Natural Language Processing (NLP) offers powerful ways to interpret and act on spoken and written language. It can help you with tasks such as customer support enquiries and customer feedback analysis. As the quantity of data continues to grow at an incomprehensible rate, being able to understand and process data is becoming a key differentiator for competitive organizations. 8 | 9 | This course will help you gain this skill by practical demonstrations, clear explanations, and interesting real-world examples. It will give you a versatile range of deep learning and NLP skills that you can put to work in your own applications. 10 | 11 | By the end of this tutorial, you’ll have a better understanding of NLP and will be able to transform data into actionable knowledge. You will also have worked on multiple examples that implement deep learning to solve real-world spoken-language problems. 12 | 13 | 14 |

What You Will Learn

15 |
16 |
22 | 23 | ## Instructions and Navigation 24 | ### Assumed Knowledge 25 | To fully benefit from the coverage included in this course, you will need:
26 | To fully benefit from the coverage included in this course, you will need: 27 | 28 | ● An understanding of basic Python. 29 | 30 | ### Technical Requirements 31 | This course has the following software requirements:
32 | This course has the following software requirements: 33 | 34 | ● Pycharm Community 35 | 36 | ● Python 3.7 37 | 38 | ● Python 3.6 39 | 40 | ● NLTK 41 | 42 | ● KERAS 43 | 44 | ● SCIKIT LEARN 45 | 46 | ● NUMPY 47 | 48 | ● PANDAS 49 | 50 | This course has been tested on the following system configuration: 51 | 52 | ● OS: Windows 10 53 | 54 | ● Processor: Dual Core 3.0 Ghz 55 | 56 | ● Memory: 16GB 57 | 58 | 59 | ## Related Products 60 | * [Hands-On Natural Language Processing with Pytorch [Video]](https://www.packtpub.com/application-development/hands-natural-language-processing-pytorch-video?utm_source=github&utm_medium=repository&utm_campaign=9781789133974) 61 | 62 | * [Mastering Natural Language Processing with Python [Video]](https://www.packtpub.com/big-data-and-business-intelligence/mastering-natural-language-processing-python-video?utm_source=github&utm_medium=repository&utm_campaign=9781789618358) 63 | 64 | * [Natural Language Processing with Python [Video]](https://www.packtpub.com/big-data-and-business-intelligence/natural-language-processing-python-video?utm_source=github&utm_medium=repository&utm_campaign=9781787286085) 65 | 66 | -------------------------------------------------------------------------------- /Section 2/TF-IDF.py: -------------------------------------------------------------------------------- 1 | from math import log 2 | from nltk.stem import PorterStemmer 3 | from nltk.corpus import stopwords 4 | import string 5 | from nltk import corpus 6 | import numpy 7 | 8 | # importing the corpus into data variable 9 | data = corpus.brown 10 | 11 | # Using Porter Stemmer 12 | stemmer = PorterStemmer() 13 | 14 | # Building the list of stop words 15 | # We will filter the tokens against it 16 | stopwords = set(stopwords.words('english')) 17 | stopwords = stopwords.union(string.punctuation) 18 | 19 | # Limiting the number of files we will use 20 | # uncomment the following line to use all of the 21 | # corpus files 22 | # fileIds = data.fileids() 23 | fileids = data.fileids()[:30] 24 | 25 | idf_matrix = [] 26 | dictionary = dict() 27 | 28 | # total count f words in the corpus 29 | words_count = 0 30 | 31 | # total count of document in the corpus 32 | documents_count = len(fileids) 33 | 34 | # holds the words before and after filtering : Stemming 35 | filtered = dict() 36 | 37 | # to save the total counts of every word per file 38 | frequencies = dict() 39 | for fileid in fileids: 40 | frequencies[fileid] = dict() 41 | 42 | # filtering corpus 43 | for fileid in fileids: 44 | for word in data.words(fileid): 45 | # Skipping if it is a stop word 46 | if word in stopwords: 47 | continue 48 | # Before and after filtering 49 | before_word = word 50 | if before_word in filtered: 51 | word = filtered[before_word] 52 | else: 53 | # stemming the word 54 | word = stemmer.stem(word) 55 | filtered[before_word] = word 56 | 57 | if word in frequencies[fileid]: 58 | frequencies[fileid][word] += 1 59 | else: 60 | frequencies[fileid][word] = 1 61 | 62 | # saving all the words in a dictionary 63 | if word not in dictionary: 64 | dictionary[word] = words_count 65 | words_count += 1 66 | 67 | # Calculating TF # 68 | # indexes of non zero values 69 | tf_matrix = [] 70 | nonzeros = [] 71 | for fileid in fileids: 72 | tf_vector = [0] * words_count 73 | nonzeros_vec = [] 74 | for word in frequencies[fileid].keys(): 75 | f = frequencies[fileid][word] 76 | tf_vector[dictionary[word]] = f 77 | if f > 0: 78 | nonzeros_vec.append(dictionary[word]) 79 | nonzeros.append(nonzeros_vec) 80 | tf_matrix.append(tf_vector) 81 | 82 | # Calculating IDF 83 | idf_matrix = [0] * words_count 84 | for fileid in fileids: 85 | for word in frequencies[fileid].keys(): 86 | idf_matrix[dictionary[word]] += 1 87 | 88 | # Calculating TF-IDF matrix# 89 | tfidf = [] 90 | 91 | for i in range(documents_count): 92 | vector = [0] * words_count 93 | for j in nonzeros[i]: 94 | tf_value = tf_matrix[i][j] 95 | idf_value = idf_matrix[j] 96 | tf_value = 1 + log(tf_value, 2) 97 | idf_value = log(1 + documents_count/ float(idf_value), 2) 98 | vector[j] = tf_value * idf_value 99 | tfidf.append(vector) 100 | 101 | print("------ Top 10 : Keywords per document -------") 102 | for i in range(len(tfidf)): 103 | print("--- Document : " + str(fileids[i])) 104 | vector = tfidf[i] 105 | sorted = numpy.argsort(vector)[::-1] 106 | for ind in sorted[:15]: 107 | stem = list(dictionary.keys())[list(dictionary.values()).index(ind)] 108 | beforeStemming = list(filtered.keys())[list(filtered.values()).index(stem)] 109 | print(beforeStemming + " -- " + str(vector[ind])) 110 | -------------------------------------------------------------------------------- /Section 3/imdb.py: -------------------------------------------------------------------------------- 1 | from keras.datasets import imdb 2 | import keras 3 | from keras.models import Sequential 4 | from keras.layers.embeddings import Embedding 5 | from keras.layers import Flatten, Dense 6 | from keras.preprocessing import sequence 7 | from numpy import array 8 | 9 | (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=5000) 10 | 11 | word_to_id = keras.datasets.imdb.get_word_index() 12 | word_to_id = {k:(v+3) for k,v in word_to_id.items()} 13 | word_to_id[""] = 0 14 | word_to_id[""] = 1 15 | word_to_id[""] = 2 16 | 17 | x_train = sequence.pad_sequences(x_train, maxlen=300) 18 | x_test = sequence.pad_sequences(x_test, maxlen=300) 19 | 20 | network = Sequential() 21 | network.add(Embedding(5000, 32, input_length=300)) 22 | network.add(Flatten()) 23 | network.add(Dense(1, activation='sigmoid')) 24 | network.compile(loss="binary_crossentropy", optimizer='Adam', metrics=['accuracy']) 25 | 26 | network.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=3, batch_size=64) 27 | 28 | result = network.evaluate(x_test, y_test, verbose=0) 29 | 30 | negative = "this movie was bad" 31 | positive = "i had fun" 32 | negative2 = "this movie was terrible" 33 | positive2 = "i really liked the movie" 34 | 35 | for review in [positive, positive2, negative, negative2]: 36 | temp = [] 37 | for word in review.split(" "): 38 | temp.append(word_to_id[word]) 39 | temp_padded = sequence.pad_sequences([temp], maxlen=300) 40 | print(review + " -- Sent -- " + str(network.predict(array([temp_padded][0]))[0][0])) 41 | 42 | -------------------------------------------------------------------------------- /Section 4/NaiveBayes.py: -------------------------------------------------------------------------------- 1 | 2 | # Getting the dataset 3 | from sklearn.datasets import fetch_20newsgroups 4 | 5 | # Getting the training and test data subsets 6 | newsgroup_train = fetch_20newsgroups(subset='train', shuffle=True) 7 | newsgroup_test = fetch_20newsgroups(subset='test', shuffle=True) 8 | 9 | # Checking out the Categories Names 10 | i = 0 11 | for cat in newsgroup_train.target_names: 12 | i = i + 1 13 | print(str(i) + " - " + str(cat)) 14 | 15 | 16 | # Printing a single ost 17 | print("\n".join(newsgroup_train.data[5].split("\n")[:10])) 18 | 19 | # Extracting features 20 | from sklearn.feature_extraction.text import CountVectorizer 21 | 22 | count_vector = CountVectorizer() 23 | newsgroup_train_counts = count_vector.fit_transform(newsgroup_train.data) 24 | 25 | # Calculating TF-IDF 26 | from sklearn.feature_extraction.text import TfidfTransformer 27 | 28 | tfidf_transformer = TfidfTransformer() 29 | newsgroup_train_tfidf = tfidf_transformer.fit_transform(newsgroup_train_counts) 30 | 31 | 32 | # Training Naive Bayes 33 | from sklearn.naive_bayes import MultinomialNB 34 | 35 | nb_cla = MultinomialNB().fit(newsgroup_train_tfidf, newsgroup_train.target) 36 | 37 | 38 | # Simplifying the process with a Pipeline 39 | from sklearn.pipeline import Pipeline 40 | 41 | NB_Classifier = Pipeline([('vectorizer', CountVectorizer()), ('tfidf_matrix', TfidfTransformer()), ('nb_classifier', MultinomialNB())]) 42 | NB_Classifier = NB_Classifier.fit(newsgroup_train.data, newsgroup_train.target) 43 | 44 | # Testing the Classifier 45 | import numpy as np 46 | 47 | predicted = NB_Classifier.predict(newsgroup_test.data) 48 | print(np.mean(predicted == newsgroup_test.target)) 49 | 50 | 51 | -------------------------------------------------------------------------------- /Section 4/SVM.py: -------------------------------------------------------------------------------- 1 | # Getting the dataset 2 | from sklearn.datasets import fetch_20newsgroups 3 | 4 | # Getting the training and test data subsets 5 | newsgroup_train = fetch_20newsgroups(subset='train', shuffle=True) 6 | newsgroup_test = fetch_20newsgroups(subset='test', shuffle=True) 7 | 8 | # Checking out the Categories Names 9 | i = 0 10 | for cat in newsgroup_train.target_names: 11 | i = i + 1 12 | print(str(i) + " - " + str(cat)) 13 | 14 | 15 | # Printing a single ost 16 | print("\n".join(newsgroup_train.data[5].split("\n")[:10])) 17 | 18 | # Extracting features 19 | from sklearn.feature_extraction.text import CountVectorizer 20 | 21 | count_vector = CountVectorizer() 22 | newsgroup_train_counts = count_vector.fit_transform(newsgroup_train.data) 23 | 24 | # Calculating TF-IDF 25 | from sklearn.feature_extraction.text import TfidfTransformer 26 | 27 | tfidf_transformer = TfidfTransformer() 28 | newsgroup_train_tfidf = tfidf_transformer.fit_transform(newsgroup_train_counts) 29 | 30 | 31 | # Training Support Vector Machines 32 | from sklearn.linear_model import SGDClassifier 33 | 34 | from sklearn.pipeline import Pipeline 35 | import numpy as np 36 | 37 | SVM_Classifier = Pipeline([('vectorizer', CountVectorizer()), ('tfidf_matrix', TfidfTransformer()), 38 | ('svm_classifier', SGDClassifier(loss='hinge', penalty='l2', max_iter=100, alpha=1e-3, random_state=42))]) 39 | 40 | SVM_Classifier = SVM_Classifier.fit(newsgroup_train.data, newsgroup_train.target) 41 | 42 | 43 | 44 | 45 | Predicted_SVM = SVM_Classifier.predict(newsgroup_test.data) 46 | print(np.mean(Predicted_SVM == newsgroup_test.target)) 47 | 48 | -------------------------------------------------------------------------------- /Section 5/ChatBot.py: -------------------------------------------------------------------------------- 1 | from chatterbot import ChatBot 2 | import logging 3 | 4 | from chatterbot.trainers import ChatterBotCorpusTrainer 5 | 6 | logging.basicConfig(level=logging.CRITICAL) 7 | 8 | chatB = ChatBot("Mike", 9 | preprocessors=['chatterbot.preprocessors.clean_whitespace'], 10 | logic_adapters=['chatterbot.logic.BestMatch', 11 | 'chatterbot.logic.MathematicalEvaluation', 12 | 'chatterbot.logic.TimeLogicAdapter']) 13 | 14 | trainer = ChatterBotCorpusTrainer(chatB) 15 | 16 | trainer.train( 17 | "chatterbot.corpus.french" 18 | ) 19 | 20 | conversation = [] 21 | 22 | 23 | def converse(quit="quit"): 24 | user_input = "" 25 | while user_input != quit: 26 | user_input = quit 27 | try: 28 | user_input = input(">") 29 | except EOFError: 30 | print(user_input) 31 | if user_input: 32 | while user_input[-1] in "!.": 33 | user_input = user_input[:-1] 34 | print(chatB.get_response(user_input)) 35 | 36 | converse() --------------------------------------------------------------------------------