├── Output ├── PdfInput.png ├── InputTextWays.png ├── OriginalvsSummaryWordCount.png ├── Wiki-Artificial-Intelligence-Article_Output.png └── Wiki-Artificial-Intelligence-Summary.txt ├── README.md └── Text-Summarizer.py /Output/PdfInput.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LunaticPrakash/Text-Summarization/HEAD/Output/PdfInput.png -------------------------------------------------------------------------------- /Output/InputTextWays.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LunaticPrakash/Text-Summarization/HEAD/Output/InputTextWays.png -------------------------------------------------------------------------------- /Output/OriginalvsSummaryWordCount.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LunaticPrakash/Text-Summarization/HEAD/Output/OriginalvsSummaryWordCount.png -------------------------------------------------------------------------------- /Output/Wiki-Artificial-Intelligence-Article_Output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LunaticPrakash/Text-Summarization/HEAD/Output/Wiki-Artificial-Intelligence-Article_Output.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Text-Summarization 2 | Using Spacy and NLTK module with TF-IDF algorithm for text-summarisation. This code will give you the summary of inputted article. You can input the text by typing (or copy-paste) or from Txt file, PDF file or from Wikipedia Page Url. 3 | 4 | ## Purpose :- 5 | 6 | To save time while reading by summarizing a large article or text into fewer lines. 7 | 8 | 9 | ## Description :- 10 | 11 | It usage Term Frequency-Inverse Document Frequency (TF-IDF) algorithm for summarising the article. 12 | 13 | ## Features :- 14 | 15 | You can read the text of your long article in 4 ways :- 16 | 17 | ![InputTextWays](https://user-images.githubusercontent.com/56812557/212475484-5bd0addf-1b14-4820-b4e2-b21565de8b71.png) 18 | 19 | - By typing text on your own (or copy-paste). 20 | - Reading the text from **.txt file**. 21 | - Reading the text from **.pdf file**.(You can choose either to get summary of entire pdf or select any page interval). 22 | 23 | ![PdfInput](https://user-images.githubusercontent.com/56812557/212475479-d012f433-8ebd-4283-9c18-c1ebf552accf.png) 24 | 25 | - Reading the text from **wikipedia page** (All you have to do is to provide the url of that page. Program will automatically scrap the text and summarise it for you). 26 | 27 | Don't worry about Code length xD. It might look lengthy but there are lot of comments for explaination of code(almost 70 comments) and extra spacing for more readability. 28 | 29 | 30 | ## Output :- 31 | 32 | - This is some of the summary text return by the program. Main article was loaded by Wikipedia Page Url -> https://en.wikipedia.org/wiki/Artificial_intelligence 33 | 34 | ![Summary](https://user-images.githubusercontent.com/56812557/212475483-5fe99afd-5016-428e-877d-e1e0b9406786.png) 35 | 36 | - Comparison of Original Content vs Summarized content. 37 | 38 | ![OriginalvsSummaryWordCount](https://user-images.githubusercontent.com/56812557/212475485-d06beadf-1805-49e2-a906-a2745d06b832.png) 39 | 40 | 41 | 42 | 43 | ## Requirements :- 44 | 45 | - Python3 46 | - Spacy Module (short, medium, or long any type is sufficient) 47 | - NLTK Module 48 | - PyPdf2 49 | - Beautiful Soup (bs4) 50 | - urllib (already available with python itself, no need for external installation) 51 | 52 | 53 | ## How to install Requirements :- 54 | 55 | 1. Python3 can be installed from their official site https://www.python.org/ . Or you can use anaconda environment. 56 | 2. Spacy can be installed by 57 | For Anaconda Environment > 58 | ``` 59 | conda install -c conda-forge spacy 60 | 61 | python3 -m spacy download en 62 | ``` 63 | For other environments > 64 | ``` 65 | pip3 install spacy 66 | 67 | python3 -m spacy download en 68 | ``` 69 | 3. NLTK can be installed by 70 | For Anaconda Environment > 71 | ``` 72 | conda install -c anaconda nltk 73 | ``` 74 | For other environments > 75 | ``` 76 | pip3 install nltk 77 | ``` 78 | 79 | 4. PyPdf2 can be installed by 80 | For Anaconda Environment > 81 | ``` 82 | conda install -c conda-forge pypdf2 83 | ``` 84 | For other environments > 85 | ``` 86 | pip3 install PyPDF2 87 | ``` 88 | 89 | 5. Beautiful Soup (bs4) 90 | For Anaconda Environment > 91 | ``` 92 | conda install -c anaconda beautifulsoup4 93 | ``` 94 | For other environments > 95 | ``` 96 | pip3 install beautifulsoup4` 97 | ``` 98 | ## Getting Started :- 99 | 100 | - Download or clone repository. 101 | 102 | - Open cmd or terminal in same directory where **Text-Summarizer.py** file is stored and then run it by followng command :- 103 | ``` 104 | python3 Text-Summarizer.py 105 | ``` 106 | - Now just follow along with the program. 107 | 108 | 109 | ## Bugs and Improvements :- 110 | 111 | - No known bugs. Summary can't be as perfect as humans can do. 112 | - Audio feature will be added soon, so that you can listen the summary too if you want. 113 | 114 | 115 | ## Dev :- Prakash Gupta 116 | -------------------------------------------------------------------------------- /Output/Wiki-Artificial-Intelligence-Summary.txt: -------------------------------------------------------------------------------- 1 | A quip in Tesler's Theorem says "AI is whatever hasn't been done yet. These issues have been explored by myth, fiction and philosophy since antiquity. Marvin Minsky agreed, writing, "within a generation ... They failed to recognize the difficulty of some of the remaining tasks. By 1985, the market for AI had reached over a billion dollars. In 2011, a Jeopardy! champions, Brad Rutter and Ken Jennings, by a significant margin. No. 1 ranking for two years. Goals can be explicitly defined or induced. AI often revolves around the use of algorithms. An algorithm is a set of unambiguous instructions that a mechanical computer can execute.[b] A complex algorithm is often built on top of other, simpler, algorithms. These learners could therefore derive all possible knowledge, by considering every possible hypothesis and matching them against the data. These inferences can be obvious, such as "since the sun rose every morning for the last 10,000 days, it will probably rise tomorrow morning as well". Besides classic overfitting, learners can also disappoint by "learning the wrong lesson". Faintly superimposing such a pattern on a legitimate image results in an "adversarial" image that the system misclassifies.[c] 2 | This gives rise to two classes of models: structuralist and functionalist. The functional model refers to the correlating data to its computed counterpart. 3 | The general problem of simulating (or creating) intelligence has been broken down into sub-problems. The traits described below have received the most attention. 4 | they became exponentially slower as the problems grew larger. They solve most of their problems using fast, intuitive judgments. 5 | However, if the agent is not the only actor, then it requires that the agent can reason under uncertainty. This calls for an agent that can not only assess its environment and make predictions but also evaluate its predictions and adapt based on its assessment. 6 | Emergent behavior such as this is used by evolutionary algorithms and swarm intelligence. 7 | Applications include speech recognition, facial recognition, and object recognition. Computer vision is the ability to analyze visual input. AI is heavily used in robotics. Motion planning is the process of breaking down a movement task into "primitives" such as individual joint movements. Moravec's paradox can be extended to many forms of social intelligence. Many advances have general, cross-domain significance. Researchers disagree about many issues. This includes embodied, situated, behavior-based, and nouvelle AI. Nowadays results of experiments are often rigorously measurable, and are sometimes (with difficulty) reproducible. A few of the most general of these methods are discussed below. 8 | The result is a search that is too slow or never completes. Heuristics limit the search for solutions into a smaller sample size. 9 | Other optimization algorithms are simulated annealing, beam search and random optimization. 10 | Evolutionary computation uses a form of optimization search. Classic evolutionary algorithms include genetic algorithms, gene expression programming, and genetic programming. Logic is used for knowledge representation and problem solving, but it can be applied to other problems as well. Propositional logic involves truth functions such as "or" and "not". that are too linguistically imprecise to be completely true or false. Bayesian networks are a very general tool that can be used for various problems: For inference to be tractable, most observations must be conditionally independent of one another. These examples are known as observations or patterns. In supervised learning, each pattern belongs to a certain predefined class. A class is a decision that has to be made. All the observations combined with their class labels are known as a data set. When a new observation is received, that observation is classified based on previous experience. 11 | kernel methods such as the support vector machine (SVM),[h] 12 | Gaussian mixture model, and the extremely popular naive Bayes classifier.[i] Among the most popular feedforward networks are perceptrons, multi-layer perceptrons and radial basis networks. One advantage of neuroevolution is that it may be less prone to get caught in "dead ends". 13 | In 1989, Yann LeCun and colleagues applied backpropagation to such an architecture. RNNs can be trained by gradient descent but suffer from the vanishing gradient problem. LSTM is often trained by Connectionist Temporal Classification (CTC). There is no consensus on how to characterize which tasks AI tends to excel at. Games provide a well-publicized benchmark for assessing rates of progress. E-sports such as StarCraft continue to provide additional public benchmarks. A computer asks a user to complete a simple test then generates a grade for that test. AI is relevant to any intellectual task. AI can also produce Deepfakes, a content-altering technology. The breadth of applications is rapidly increasing. 14 | Artificial intelligence is assisting doctors. There is a great amount of research and drugs developed relating to cancer. In detail, there are more than 800 medicines and vaccines to treat cancer. Watson has struggled to achieve success and adoption in healthcare. 15 | A few companies involved with AI include Tesla, Google, and Apple. 16 | Many components contribute to the functioning of self-driving cars. Self-driving truck platoons are a fleet of self-driving trucks following the lead of one non-self-driving truck, so the truck platoons aren't entirely autonomous yet. In general, the vehicle would be pre-programmed with a map of the area being driven. Another factor that is influencing the ability of a driverless automobile is the safety of the passenger. These situations could include a head-on collision with pedestrians. But there is a possibility the car would need to make a decision that would put someone in danger. In other words, the car would need to decide to save the pedestrians or the passengers. The programming of the car in these situations is crucial to a successful driverless automobile. 17 | AI can react to changes overnight or when business is not taking place. AI is increasingly being used by corporations. Furthermore, AI in the markets limits the consequences of behavior in the markets again making markets more efficient[citation needed]. Artificial intelligence in government consists of applications and regulation. This is already the case in some parts of China. In addition, well-understood AI techniques are routinely used for pathfinding. For financial statements audit, AI makes continuous audit possible. There are three philosophical questions related to AI[citation needed]: 18 | And, of course, other risks come from things like job losses. Humans, who are limited by slow biological evolution, couldn't compete and would be superseded. If this AI's goals do not fully reflect humanity's— Facebook I think there is potentially a dangerous outcome there." 19 | Algorithms already have numerous applications in legal systems. The relationship between automation and employment is complicated. In all cases, only human beings have engaged in ethical reasoning. Machine ethics is sometimes referred to as machine morality, computational ethics or computational morality. He argues that "any sufficiently advanced benevolence may be indistinguishable from malevolence. Some question whether this kind of check could actually remain in place. 20 | The hard problem is explaining how this feels or why it should feel like anything at all. Human information processing is easy to explain, however human subjective experience is difficult to explain. 21 | The hard problem is that people also know something else—they also know what red looks like. (Consider that a person born blind can know that something is red without knowing what red looks like.)[l] If a machine can be created that has intelligence, could it also feel? If it can feel, does it have the same rights as a human? Are there limits to how intelligent machines—or human-machine hybrids—can be? Science fiction writer Vernor Vinge named this scenario "singularity". The long-term economic effects of AI are uncertain. This includes such works as Arthur C. Clarke's and Stanley Kubrick's 2001: See also: Logic machines in fiction and List of fictional computers 22 | 23 | -------------------------------------------------------------------------------- /Text-Summarizer.py: -------------------------------------------------------------------------------- 1 | #Step 1. Importing Libraries 2 | 3 | import sys 4 | import math 5 | import bs4 as bs 6 | import urllib.request 7 | import re 8 | import PyPDF2 9 | import nltk 10 | from nltk.stem import WordNetLemmatizer 11 | import spacy 12 | 13 | 14 | #Execute this line if you are running this code for first time 15 | nltk.download('wordnet') 16 | 17 | #Initializing few variable 18 | nlp = spacy.load('en_core_web_sm') 19 | lemmatizer = WordNetLemmatizer() 20 | 21 | 22 | #Step 2. Define functions for Reading Input Text 23 | 24 | #Function to Read .txt File and return its Text 25 | def file_text(filepath): 26 | with open(filepath) as f: 27 | text = f.read().replace("\n", '') 28 | return text 29 | 30 | 31 | #Function to Read PDF File and return its Text 32 | def pdfReader(pdf_path): 33 | 34 | with open(pdf_path, 'rb') as pdfFileObject: 35 | pdfReader = PyPDF2.PdfFileReader(pdfFileObject) 36 | count = pdfReader.numPages 37 | print("\nTotal Pages in pdf = ", count) 38 | 39 | c = 'Y' 40 | start_page = 0 41 | end_page = count-1 42 | c = input("Do you want to read entire pdf ?[Y]/N : ") 43 | if c == 'N' or c == 'n' : 44 | start_page = int(input("Enter start page number (Indexing start from 0) : ")) 45 | end_page = int(input(f"Enter end page number (Less than {count}) : ")) 46 | 47 | if start_page <0 or start_page >= count: 48 | print("\nInvalid Start page given") 49 | sys.exit() 50 | 51 | if end_page <0 or end_page >= count: 52 | print("\nInvalid End page given") 53 | sys.exit() 54 | 55 | for i in range(start_page,end_page+1): 56 | page = pdfReader.getPage(i) 57 | 58 | return page.extractText() 59 | 60 | 61 | #Function to Read wikipedia page url and return its Text 62 | def wiki_text(url): 63 | scrap_data = urllib.request.urlopen(url) 64 | article = scrap_data.read() 65 | parsed_article = bs.BeautifulSoup(article,'lxml') 66 | 67 | paragraphs = parsed_article.find_all('p') 68 | article_text = "" 69 | 70 | for p in paragraphs: 71 | article_text += p.text 72 | 73 | #Removing all unwanted characters 74 | article_text = re.sub(r'\[[0-9]*\]', '', article_text) 75 | return article_text 76 | 77 | 78 | #Step 3. Getting Text 79 | 80 | input_text_type = int(input("Select one way of inputting your text \ 81 | : \n1. Type your Text(or Copy-Paste)\n2. Load from .txt file\n3. Load from .pdf file\n4. From Wikipedia Page URL\n\n")) 82 | 83 | if input_text_type == 1: 84 | text = input(u"Enter your text : \n\n") 85 | 86 | elif input_text_type == 2: 87 | txt_path = input("Enter file path : ") 88 | text = file_text(txt_path) 89 | 90 | 91 | elif input_text_type == 3: 92 | file_path = input("Enter file path : ") 93 | text = pdfReader(file_path) 94 | 95 | elif input_text_type == 4: 96 | wiki_url = input("Enter Wikipedia URL to load Article : ") 97 | text = wiki_text(wiki_url) 98 | 99 | else: 100 | print("Sorry! Wrong Input, Try Again.") 101 | 102 | 103 | 104 | #Step 4. Defining functions to create Tf-Idf Matrix 105 | 106 | 107 | #Function to calculate frequency of word in each sentence 108 | #INPUT -> List of all sentences from text as spacy.Doc object 109 | #OUTPUT -> freq_matrix (A dictionary with each sentence itself as key, 110 | # and a dictionary of words of that sentence with their frequency as value) 111 | 112 | def frequency_matrix(sentences): 113 | freq_matrix = {} 114 | stopWords = nlp.Defaults.stop_words 115 | 116 | for sent in sentences: 117 | freq_table = {} #dictionary with 'words' as key and their 'frequency' as value 118 | 119 | #Getting all word from the sentence in lower case 120 | words = [word.text.lower() for word in sent if word.text.isalnum()] 121 | 122 | for word in words: 123 | word = lemmatizer.lemmatize(word) #Lemmatize the word 124 | if word not in stopWords: #Reject stopwords 125 | if word in freq_table: 126 | freq_table[word] += 1 127 | else: 128 | freq_table[word] = 1 129 | 130 | freq_matrix[sent[:15]] = freq_table 131 | 132 | return freq_matrix 133 | 134 | 135 | #Function to calculate Term Frequency(TF) of each word 136 | #INPUT -> freq_matrix 137 | #OUTPUT -> tf_matrix (A dictionary with each sentence itself as key, 138 | # and a dictionary of words of that sentence with their Term-Frequency as value) 139 | 140 | #TF(t) = (Number of times term t appears in document) / (Total number of terms in the document) 141 | def tf_matrix(freq_matrix): 142 | tf_matrix = {} 143 | 144 | for sent, freq_table in freq_matrix.items(): 145 | tf_table = {} #dictionary with 'word' itself as a key and its TF as value 146 | 147 | total_words_in_sentence = len(freq_table) 148 | for word, count in freq_table.items(): 149 | tf_table[word] = count / total_words_in_sentence 150 | 151 | tf_matrix[sent] = tf_table 152 | 153 | return tf_matrix 154 | 155 | 156 | #Function to find how many sentences contain a 'word' 157 | #INPUT -> freq_matrix 158 | #OUTPUT -> sent_per_words (Dictionary with each word itself as key and number of 159 | #sentences containing that word as value) 160 | 161 | def sentences_per_words(freq_matrix): 162 | sent_per_words = {} 163 | 164 | for sent, f_table in freq_matrix.items(): 165 | for word, count in f_table.items(): 166 | if word in sent_per_words: 167 | sent_per_words[word] += 1 168 | else: 169 | sent_per_words[word] = 1 170 | 171 | return sent_per_words 172 | 173 | 174 | #Function to calculate Inverse Document frequency(IDF) for each word 175 | #INPUT -> freq_matrix,sent_per_words, total_sentences 176 | #OUTPUT -> idf_matrix (A dictionary with each sentence itself as key, 177 | # and a dictionary of words of that sentence with their IDF as value) 178 | 179 | #IDF(t) = log_e(Total number of documents / Number of documents with term t in it) 180 | def idf_matrix(freq_matrix, sent_per_words, total_sentences): 181 | idf_matrix = {} 182 | 183 | for sent, f_table in freq_matrix.items(): 184 | idf_table = {} 185 | 186 | for word in f_table.keys(): 187 | idf_table[word] = math.log10(total_sentences / float(sent_per_words[word])) 188 | 189 | idf_matrix[sent] = idf_table 190 | 191 | return idf_matrix 192 | 193 | 194 | #Function to calculate Tf-Idf score of each word 195 | #INPUT -> tf_matrix, idf_matrix 196 | #OUTPUT - > tf_idf_matrix (A dictionary with each sentence itself as key, 197 | # and a dictionary of words of that sentence with their Tf-Idf as value) 198 | def tf_idf_matrix(tf_matrix, idf_matrix): 199 | tf_idf_matrix = {} 200 | 201 | for (sent1, f_table1), (sent2, f_table2) in zip(tf_matrix.items(), idf_matrix.items()): 202 | 203 | tf_idf_table = {} 204 | 205 | #word1 and word2 are same 206 | for (word1, tf_value), (word2, idf_value) in zip(f_table1.items(), 207 | f_table2.items()): 208 | tf_idf_table[word1] = float(tf_value * idf_value) 209 | 210 | tf_idf_matrix[sent1] = tf_idf_table 211 | 212 | return tf_idf_matrix 213 | 214 | 215 | #Function to rate every sentence with some score calculated on basis of Tf-Idf 216 | #INPUT -> tf_idf_matrix 217 | #OUTPUT - > sentenceScore (Dictionary with each sentence itself as key and its score 218 | # as value) 219 | def score_sentences(tf_idf_matrix): 220 | 221 | sentenceScore = {} 222 | 223 | for sent, f_table in tf_idf_matrix.items(): 224 | total_tfidf_score_per_sentence = 0 225 | 226 | total_words_in_sentence = len(f_table) 227 | for word, tf_idf_score in f_table.items(): 228 | total_tfidf_score_per_sentence += tf_idf_score 229 | 230 | if total_words_in_sentence != 0: 231 | sentenceScore[sent] = total_tfidf_score_per_sentence / total_words_in_sentence 232 | 233 | return sentenceScore 234 | 235 | 236 | 237 | #Function Calculating average sentence score 238 | #INPUT -> sentence_score 239 | #OUTPUT -> average_sent_score(An average of the sentence_score) 240 | def average_score(sentence_score): 241 | 242 | total_score = 0 243 | for sent in sentence_score: 244 | total_score += sentence_score[sent] 245 | 246 | average_sent_score = (total_score / len(sentence_score)) 247 | 248 | return average_sent_score 249 | 250 | 251 | #Function to return summary of article 252 | #INPUT -> sentences(list of all sentences in article), sentence_score, threshold 253 | # (set to the average pf sentence_score) 254 | #OUTPUT -> summary (String text) 255 | def create_summary(sentences, sentence_score, threshold): 256 | summary = '' 257 | 258 | for sentence in sentences: 259 | if sentence[:15] in sentence_score and sentence_score[sentence[:15]] >= (threshold): 260 | summary += " " + sentence.text 261 | 262 | 263 | return summary 264 | 265 | 266 | #Step 5. Using all functions to generate summary 267 | 268 | 269 | #Counting number of words in original article 270 | original_words = text.split() 271 | original_words = [w for w in original_words if w.isalnum()] 272 | num_words_in_original_text = len(original_words) 273 | 274 | 275 | #Converting received text into sapcy Doc object 276 | text = nlp(text) 277 | 278 | #Extracting all sentences from the text in a list 279 | sentences = list(text.sents) 280 | total_sentences = len(sentences) 281 | 282 | #Generating Frequency Matrix 283 | freq_matrix = frequency_matrix(sentences) 284 | 285 | #Generating Term Frequency Matrix 286 | tf_matrix = tf_matrix(freq_matrix) 287 | 288 | #Getting number of sentences containing a particular word 289 | num_sent_per_words = sentences_per_words(freq_matrix) 290 | 291 | #Generating ID Frequency Matrix 292 | idf_matrix = idf_matrix(freq_matrix, num_sent_per_words, total_sentences) 293 | 294 | #Generating Tf-Idf Matrix 295 | tf_idf_matrix = tf_idf_matrix(tf_matrix, idf_matrix) 296 | 297 | 298 | #Generating Sentence score for each sentence 299 | sentence_scores = score_sentences(tf_idf_matrix) 300 | 301 | #Setting threshold to average value (You are free to play with ther values) 302 | threshold = average_score(sentence_scores) 303 | 304 | #Getting summary 305 | summary = create_summary(sentences, sentence_scores, 1.3 * threshold) 306 | print("\n\n") 307 | print("*"*20,"Summary","*"*20) 308 | print("\n") 309 | print(summary) 310 | print("\n\n") 311 | print("Total words in original article = ", num_words_in_original_text) 312 | print("Total words in summarized article = ", len(summary.split())) 313 | --------------------------------------------------------------------------------