├── Output
    ├── PdfInput.png
    ├── InputTextWays.png
    ├── OriginalvsSummaryWordCount.png
    ├── Wiki-Artificial-Intelligence-Article_Output.png
    └── Wiki-Artificial-Intelligence-Summary.txt
├── README.md
└── Text-Summarizer.py


/Output/PdfInput.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LunaticPrakash/Text-Summarization/HEAD/Output/PdfInput.png


--------------------------------------------------------------------------------
/Output/InputTextWays.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LunaticPrakash/Text-Summarization/HEAD/Output/InputTextWays.png


--------------------------------------------------------------------------------
/Output/OriginalvsSummaryWordCount.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LunaticPrakash/Text-Summarization/HEAD/Output/OriginalvsSummaryWordCount.png


--------------------------------------------------------------------------------
/Output/Wiki-Artificial-Intelligence-Article_Output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LunaticPrakash/Text-Summarization/HEAD/Output/Wiki-Artificial-Intelligence-Article_Output.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Text-Summarization
  2 | Using Spacy and NLTK module with TF-IDF algorithm for text-summarisation. This code will give you the summary of inputted article. You can input the text by typing (or copy-paste) or from Txt file, PDF file or from Wikipedia Page Url.
  3 | 
  4 | ## Purpose :- 
  5 | 
  6 | To save time while reading by summarizing a large article or text into fewer lines. 
  7 | 
  8 | 
  9 | ## Description :-
 10 | 
 11 | It usage Term Frequency-Inverse Document Frequency (TF-IDF) algorithm for summarising the article.
 12 | 
 13 | ## Features :-
 14 | 
 15 | You can read the text of your long article in 4 ways :-
 16 | 
 17 | ![InputTextWays](https://user-images.githubusercontent.com/56812557/212475484-5bd0addf-1b14-4820-b4e2-b21565de8b71.png)
 18 | 
 19 |   - By typing text on your own (or copy-paste).
 20 |   - Reading the text from **.txt file**.
 21 |   - Reading the text from **.pdf file**.(You can choose either to get summary of entire pdf or select any page interval).
 22 |   
 23 |   ![PdfInput](https://user-images.githubusercontent.com/56812557/212475479-d012f433-8ebd-4283-9c18-c1ebf552accf.png)
 24 | 
 25 |   - Reading the text from **wikipedia page** (All you have to do is to provide the url of that page. Program will automatically scrap the text and summarise it for you).
 26 |   
 27 | Don't worry about Code length xD. It might look lengthy but there are lot of comments for explaination of code(almost 70 comments) and extra spacing for more readability.
 28 | 
 29 |  
 30 |  ## Output :- 
 31 |  
 32 |    - This is some of the summary text return by the program. Main article was loaded by Wikipedia Page Url -> https://en.wikipedia.org/wiki/Artificial_intelligence
 33 |    
 34 |    ![Summary](https://user-images.githubusercontent.com/56812557/212475483-5fe99afd-5016-428e-877d-e1e0b9406786.png)
 35 |    
 36 |    - Comparison of Original Content vs Summarized content.
 37 |    
 38 |    ![OriginalvsSummaryWordCount](https://user-images.githubusercontent.com/56812557/212475485-d06beadf-1805-49e2-a906-a2745d06b832.png)
 39 |    
 40 | 
 41 | 
 42 |   
 43 | ## Requirements :-
 44 | 
 45 | - Python3 
 46 | - Spacy Module (short, medium, or long any type is sufficient)
 47 | - NLTK Module
 48 | - PyPdf2
 49 | - Beautiful Soup (bs4)
 50 | - urllib (already available with python itself, no need for external installation)
 51 | 
 52 | 
 53 | ## How to install Requirements :-
 54 | 
 55 | 1. Python3 can be installed from their official site https://www.python.org/ . Or you can use anaconda environment.
 56 | 2. Spacy can be installed by
 57 | For Anaconda Environment > 
 58 | ```
 59 | conda install -c conda-forge spacy
 60 | 
 61 | python3 -m spacy download en
 62 | ```
 63 | For other environments > 
 64 | ```
 65 | pip3 install spacy
 66 | 
 67 | python3 -m spacy download en
 68 | ```
 69 | 3. NLTK can be installed by
 70 | For Anaconda Environment > 
 71 | ```
 72 | conda install -c anaconda nltk
 73 | ```
 74 | For other environments > 
 75 | ```
 76 | pip3 install nltk
 77 | ```
 78 | 
 79 | 4. PyPdf2 can be installed by
 80 | For Anaconda Environment > 
 81 | ```
 82 | conda install -c conda-forge pypdf2
 83 | ```
 84 | For other environments > 
 85 | ```
 86 | pip3 install PyPDF2
 87 | ```
 88 | 
 89 | 5. Beautiful Soup (bs4)
 90 | For Anaconda Environment > 
 91 | ```
 92 | conda install -c anaconda beautifulsoup4
 93 | ```
 94 | For other environments > 
 95 | ```
 96 | pip3 install beautifulsoup4`
 97 | ```
 98 | ## Getting Started :-
 99 | 
100 | - Download or clone repository.
101 | 
102 | - Open cmd or terminal in same directory where **Text-Summarizer.py** file is stored and then run it by followng command :- 
103 | ```
104 | python3 Text-Summarizer.py
105 | ```
106 | - Now just follow along with the program.
107 | 
108 | 
109 | ## Bugs and Improvements :-
110 | 
111 | - No known bugs. Summary can't be as perfect as humans can do.
112 | - Audio feature will be added soon, so that you can listen the summary too if you want.
113 | 
114 | 
115 | ## Dev :- Prakash Gupta
116 | 


--------------------------------------------------------------------------------
/Output/Wiki-Artificial-Intelligence-Summary.txt:
--------------------------------------------------------------------------------
 1 | A quip in Tesler's Theorem says "AI is whatever hasn't been done yet. These issues have been explored by myth, fiction and philosophy since antiquity. Marvin Minsky agreed, writing, "within a generation ... They failed to recognize the difficulty of some of the remaining tasks. By 1985, the market for AI had reached over a billion dollars. In 2011, a Jeopardy! champions, Brad Rutter and Ken Jennings, by a significant margin. No. 1 ranking for two years. Goals can be explicitly defined or induced. AI often revolves around the use of algorithms. An algorithm is a set of unambiguous instructions that a mechanical computer can execute.[b] A complex algorithm is often built on top of other, simpler, algorithms. These learners could therefore derive all possible knowledge, by considering every possible hypothesis and matching them against the data. These inferences can be obvious, such as "since the sun rose every morning for the last 10,000 days, it will probably rise tomorrow morning as well". Besides classic overfitting, learners can also disappoint by "learning the wrong lesson". Faintly superimposing such a pattern on a legitimate image results in an "adversarial" image that the system misclassifies.[c]
 2 |  This gives rise to two classes of models: structuralist and functionalist. The functional model refers to the correlating data to its computed counterpart.
 3 |  The general problem of simulating (or creating) intelligence has been broken down into sub-problems. The traits described below have received the most attention.
 4 |  they became exponentially slower as the problems grew larger. They solve most of their problems using fast, intuitive judgments.
 5 |  However, if the agent is not the only actor, then it requires that the agent can reason under uncertainty. This calls for an agent that can not only assess its environment and make predictions but also evaluate its predictions and adapt based on its assessment.
 6 |  Emergent behavior such as this is used by evolutionary algorithms and swarm intelligence.
 7 |  Applications include speech recognition, facial recognition, and object recognition. Computer vision is the ability to analyze visual input. AI is heavily used in robotics. Motion planning is the process of breaking down a movement task into "primitives" such as individual joint movements. Moravec's paradox can be extended to many forms of social intelligence. Many advances have general, cross-domain significance. Researchers disagree about many issues. This includes embodied, situated, behavior-based, and nouvelle AI. Nowadays results of experiments are often rigorously measurable, and are sometimes (with difficulty) reproducible. A few of the most general of these methods are discussed below.
 8 |  The result is a search that is too slow or never completes. Heuristics limit the search for solutions into a smaller sample size.
 9 |  Other optimization algorithms are simulated annealing, beam search and random optimization.
10 |  Evolutionary computation uses a form of optimization search. Classic evolutionary algorithms include genetic algorithms, gene expression programming, and genetic programming. Logic is used for knowledge representation and problem solving, but it can be applied to other problems as well. Propositional logic involves truth functions such as "or" and "not". that are too linguistically imprecise to be completely true or false. Bayesian networks are a very general tool that can be used for various problems: For inference to be tractable, most observations must be conditionally independent of one another. These examples are known as observations or patterns. In supervised learning, each pattern belongs to a certain predefined class. A class is a decision that has to be made. All the observations combined with their class labels are known as a data set. When a new observation is received, that observation is classified based on previous experience.
11 |  kernel methods such as the support vector machine (SVM),[h]
12 |  Gaussian mixture model, and the extremely popular naive Bayes classifier.[i] Among the most popular feedforward networks are perceptrons, multi-layer perceptrons and radial basis networks. One advantage of neuroevolution is that it may be less prone to get caught in "dead ends".
13 |  In 1989, Yann LeCun and colleagues applied backpropagation to such an architecture. RNNs can be trained by gradient descent but suffer from the vanishing gradient problem. LSTM is often trained by Connectionist Temporal Classification (CTC). There is no consensus on how to characterize which tasks AI tends to excel at. Games provide a well-publicized benchmark for assessing rates of progress. E-sports such as StarCraft continue to provide additional public benchmarks. A computer asks a user to complete a simple test then generates a grade for that test. AI is relevant to any intellectual task. AI can also produce Deepfakes, a content-altering technology. The breadth of applications is rapidly increasing.
14 |  Artificial intelligence is assisting doctors. There is a great amount of research and drugs developed relating to cancer. In detail, there are more than 800 medicines and vaccines to treat cancer. Watson has struggled to achieve success and adoption in healthcare.
15 |  A few companies involved with AI include Tesla, Google, and Apple.
16 |  Many components contribute to the functioning of self-driving cars. Self-driving truck platoons are a fleet of self-driving trucks following the lead of one non-self-driving truck, so the truck platoons aren't entirely autonomous yet. In general, the vehicle would be pre-programmed with a map of the area being driven. Another factor that is influencing the ability of a driverless automobile is the safety of the passenger. These situations could include a head-on collision with pedestrians. But there is a possibility the car would need to make a decision that would put someone in danger. In other words, the car would need to decide to save the pedestrians or the passengers. The programming of the car in these situations is crucial to a successful driverless automobile.
17 |  AI can react to changes overnight or when business is not taking place. AI is increasingly being used by corporations. Furthermore, AI in the markets limits the consequences of behavior in the markets again making markets more efficient[citation needed]. Artificial intelligence in government consists of applications and regulation. This is already the case in some parts of China. In addition, well-understood AI techniques are routinely used for pathfinding. For financial statements audit, AI makes continuous audit possible. There are three philosophical questions related to AI[citation needed]:
18 |  And, of course, other risks come from things like job losses. Humans, who are limited by slow biological evolution, couldn't compete and would be superseded. If this AI's goals do not fully reflect humanity's— Facebook I think there is potentially a dangerous outcome there."
19 |  Algorithms already have numerous applications in legal systems. The relationship between automation and employment is complicated. In all cases, only human beings have engaged in ethical reasoning. Machine ethics is sometimes referred to as machine morality, computational ethics or computational morality. He argues that "any sufficiently advanced benevolence may be indistinguishable from malevolence. Some question whether this kind of check could actually remain in place.
20 |  The hard problem is explaining how this feels or why it should feel like anything at all. Human information processing is easy to explain, however human subjective experience is difficult to explain.
21 |  The hard problem is that people also know something else—they also know what red looks like. (Consider that a person born blind can know that something is red without knowing what red looks like.)[l] If a machine can be created that has intelligence, could it also feel? If it can feel, does it have the same rights as a human? Are there limits to how intelligent machines—or human-machine hybrids—can be? Science fiction writer Vernor Vinge named this scenario "singularity". The long-term economic effects of AI are uncertain. This includes such works as Arthur C. Clarke's and Stanley Kubrick's 2001: See also: Logic machines in fiction and List of fictional computers
22 | 
23 | 


--------------------------------------------------------------------------------
/Text-Summarizer.py:
--------------------------------------------------------------------------------
  1 | #Step 1. Importing Libraries
  2 | 
  3 | import sys
  4 | import math
  5 | import bs4 as bs
  6 | import urllib.request
  7 | import re
  8 | import PyPDF2
  9 | import nltk
 10 | from nltk.stem import WordNetLemmatizer 
 11 | import spacy
 12 | 
 13 | 
 14 | #Execute this line if you are running this code for first time
 15 | nltk.download('wordnet')
 16 | 
 17 | #Initializing few variable
 18 | nlp = spacy.load('en_core_web_sm')
 19 | lemmatizer = WordNetLemmatizer() 
 20 | 
 21 | 
 22 | #Step 2. Define functions for Reading Input Text
 23 | 
 24 | #Function to Read .txt File and return its Text
 25 | def file_text(filepath):
 26 |     with open(filepath) as f:
 27 |         text = f.read().replace("\n", '')
 28 |         return text
 29 | 
 30 | 
 31 | #Function to Read PDF File and return its Text
 32 | def pdfReader(pdf_path):
 33 |     
 34 |     with open(pdf_path, 'rb') as pdfFileObject:
 35 |         pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
 36 |         count = pdfReader.numPages
 37 |         print("\nTotal Pages in pdf = ", count)
 38 |         
 39 |         c = 'Y'
 40 |         start_page = 0
 41 |         end_page = count-1
 42 |         c = input("Do you want to read entire pdf ?[Y]/N  :  ")
 43 |         if c == 'N' or c == 'n' :
 44 |             start_page  = int(input("Enter start page number (Indexing start from 0) :  "))
 45 |             end_page = int(input(f"Enter end page number (Less than {count}) : "))
 46 |             
 47 |             if start_page <0 or start_page >= count:
 48 |                 print("\nInvalid Start page given")
 49 |                 sys.exit()
 50 |                 
 51 |             if end_page <0 or end_page >= count:
 52 |                 print("\nInvalid End page given")
 53 |                 sys.exit()
 54 |                 
 55 |         for i in range(start_page,end_page+1):
 56 |             page = pdfReader.getPage(i)
 57 | 
 58 |         return page.extractText()
 59 |     
 60 |     
 61 | #Function to Read wikipedia page url and return its Text   
 62 | def wiki_text(url):
 63 |     scrap_data = urllib.request.urlopen(url)
 64 |     article = scrap_data.read()
 65 |     parsed_article = bs.BeautifulSoup(article,'lxml')
 66 |     
 67 |     paragraphs = parsed_article.find_all('p')
 68 |     article_text = ""
 69 |     
 70 |     for p in paragraphs:
 71 |         article_text += p.text
 72 |     
 73 |     #Removing all unwanted characters
 74 |     article_text = re.sub(r'\[[0-9]*\]', '', article_text)
 75 |     return article_text
 76 | 
 77 | 
 78 | #Step 3. Getting Text 
 79 | 
 80 | input_text_type = int(input("Select one way of inputting your text  \
 81 | : \n1. Type your Text(or Copy-Paste)\n2. Load from .txt file\n3. Load from .pdf file\n4. From Wikipedia Page URL\n\n"))
 82 | 
 83 | if input_text_type == 1:
 84 |     text = input(u"Enter your text : \n\n")
 85 | 
 86 | elif input_text_type == 2:
 87 |     txt_path = input("Enter file path :  ")
 88 |     text = file_text(txt_path)
 89 |     
 90 |         
 91 | elif input_text_type == 3:
 92 |     file_path = input("Enter file path :  ")
 93 |     text = pdfReader(file_path)
 94 |     
 95 | elif input_text_type == 4:
 96 |     wiki_url = input("Enter Wikipedia URL to load Article : ")
 97 |     text = wiki_text(wiki_url)
 98 |     
 99 | else:
100 |     print("Sorry! Wrong Input, Try Again.")
101 |     
102 | 
103 | 
104 | #Step 4. Defining functions to create Tf-Idf Matrix
105 | 
106 | 
107 | #Function to calculate frequency of word in each sentence
108 | #INPUT -> List of all sentences from text as spacy.Doc object
109 | #OUTPUT -> freq_matrix (A dictionary with each sentence itself as key, 
110 | # and a dictionary of words of that sentence with their frequency as value)
111 | 
112 | def frequency_matrix(sentences):
113 |     freq_matrix = {}
114 |     stopWords = nlp.Defaults.stop_words
115 | 
116 |     for sent in sentences:
117 |         freq_table = {} #dictionary with 'words' as key and their 'frequency' as value
118 |         
119 |         #Getting all word from the sentence in lower case
120 |         words = [word.text.lower() for word in sent  if word.text.isalnum()]
121 |        
122 |         for word in words:  
123 |             word = lemmatizer.lemmatize(word)   #Lemmatize the word
124 |             if word not in stopWords:           #Reject stopwords
125 |                 if word in freq_table:
126 |                     freq_table[word] += 1
127 |                 else:
128 |                     freq_table[word] = 1
129 | 
130 |         freq_matrix[sent[:15]] = freq_table
131 | 
132 |     return freq_matrix
133 | 
134 | 
135 | #Function to calculate Term Frequency(TF) of each word
136 | #INPUT -> freq_matrix
137 | #OUTPUT -> tf_matrix (A dictionary with each sentence itself as key, 
138 | # and a dictionary of words of that sentence with their Term-Frequency as value)
139 | 
140 | #TF(t) = (Number of times term t appears in  document) / (Total number of terms in the document)
141 | def tf_matrix(freq_matrix):
142 |     tf_matrix = {}
143 | 
144 |     for sent, freq_table in freq_matrix.items():
145 |         tf_table = {}  #dictionary with 'word' itself as a key and its TF as value
146 | 
147 |         total_words_in_sentence = len(freq_table)
148 |         for word, count in freq_table.items():
149 |             tf_table[word] = count / total_words_in_sentence
150 | 
151 |         tf_matrix[sent] = tf_table
152 | 
153 |     return tf_matrix
154 | 
155 | 
156 | #Function to find how many sentences contain a 'word'
157 | #INPUT -> freq_matrix
158 | #OUTPUT -> sent_per_words (Dictionary with each word itself as key and number of 
159 | #sentences containing that word as value)
160 | 
161 | def sentences_per_words(freq_matrix):
162 |     sent_per_words = {}
163 | 
164 |     for sent, f_table in freq_matrix.items():
165 |         for word, count in f_table.items():
166 |             if word in sent_per_words:
167 |                 sent_per_words[word] += 1
168 |             else:
169 |                 sent_per_words[word] = 1
170 | 
171 |     return sent_per_words
172 | 
173 | 
174 | #Function to calculate Inverse Document frequency(IDF) for each word
175 | #INPUT -> freq_matrix,sent_per_words, total_sentences
176 | #OUTPUT -> idf_matrix (A dictionary with each sentence itself as key, 
177 | # and a dictionary of words of that sentence with their IDF as value)
178 | 
179 | #IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
180 | def idf_matrix(freq_matrix, sent_per_words, total_sentences):
181 |     idf_matrix = {}
182 | 
183 |     for sent, f_table in freq_matrix.items():
184 |         idf_table = {}
185 | 
186 |         for word in f_table.keys():
187 |             idf_table[word] = math.log10(total_sentences / float(sent_per_words[word]))
188 | 
189 |         idf_matrix[sent] = idf_table
190 | 
191 |     return idf_matrix
192 | 
193 | 
194 | #Function to calculate Tf-Idf score of each word
195 | #INPUT -> tf_matrix, idf_matrix
196 | #OUTPUT - > tf_idf_matrix (A dictionary with each sentence itself as key, 
197 | # and a dictionary of words of that sentence with their Tf-Idf as value)
198 | def tf_idf_matrix(tf_matrix, idf_matrix):
199 |     tf_idf_matrix = {}
200 | 
201 |     for (sent1, f_table1), (sent2, f_table2) in zip(tf_matrix.items(), idf_matrix.items()):
202 | 
203 |         tf_idf_table = {}
204 | 
205 |        #word1 and word2 are same
206 |         for (word1, tf_value), (word2, idf_value) in zip(f_table1.items(),
207 |                                                     f_table2.items()):  
208 |             tf_idf_table[word1] = float(tf_value * idf_value)
209 | 
210 |         tf_idf_matrix[sent1] = tf_idf_table
211 | 
212 |     return tf_idf_matrix
213 | 
214 | 
215 | #Function to rate every sentence with some score calculated on basis of Tf-Idf
216 | #INPUT -> tf_idf_matrix
217 | #OUTPUT - > sentenceScore (Dictionary with each sentence itself as key and its score
218 | # as value)
219 | def score_sentences(tf_idf_matrix):
220 |     
221 |     sentenceScore = {}
222 | 
223 |     for sent, f_table in tf_idf_matrix.items():
224 |         total_tfidf_score_per_sentence = 0
225 | 
226 |         total_words_in_sentence = len(f_table)
227 |         for word, tf_idf_score in f_table.items():
228 |             total_tfidf_score_per_sentence += tf_idf_score
229 | 
230 |         if total_words_in_sentence != 0:
231 |             sentenceScore[sent] = total_tfidf_score_per_sentence / total_words_in_sentence
232 | 
233 |     return sentenceScore
234 | 
235 | 
236 | 
237 | #Function Calculating average sentence score 
238 | #INPUT -> sentence_score
239 | #OUTPUT -> average_sent_score(An average of the sentence_score) 
240 | def average_score(sentence_score):
241 |     
242 |     total_score = 0
243 |     for sent in sentence_score:
244 |         total_score += sentence_score[sent]
245 | 
246 |     average_sent_score = (total_score / len(sentence_score))
247 | 
248 |     return average_sent_score
249 | 
250 | 
251 | #Function to return summary of article
252 | #INPUT -> sentences(list of all sentences in article), sentence_score, threshold
253 | # (set to the average pf sentence_score)
254 | #OUTPUT -> summary (String text)
255 | def create_summary(sentences, sentence_score, threshold):
256 |     summary = ''
257 | 
258 |     for sentence in sentences:
259 |         if sentence[:15] in sentence_score and sentence_score[sentence[:15]] >= (threshold):
260 |             summary += " " + sentence.text
261 |         
262 | 
263 |     return summary
264 | 
265 | 
266 | #Step 5. Using all functions to generate summary
267 | 
268 | 
269 | #Counting number of words in original article
270 | original_words = text.split()
271 | original_words = [w for w in original_words if w.isalnum()]
272 | num_words_in_original_text = len(original_words)
273 | 
274 | 
275 | #Converting received text into sapcy Doc object
276 | text = nlp(text)
277 | 
278 | #Extracting all sentences from the text in a list
279 | sentences = list(text.sents)
280 | total_sentences = len(sentences)
281 | 
282 | #Generating Frequency Matrix
283 | freq_matrix = frequency_matrix(sentences)
284 | 
285 | #Generating Term Frequency Matrix
286 | tf_matrix = tf_matrix(freq_matrix)
287 | 
288 | #Getting number of sentences containing a particular word
289 | num_sent_per_words = sentences_per_words(freq_matrix)
290 | 
291 | #Generating ID Frequency Matrix
292 | idf_matrix = idf_matrix(freq_matrix, num_sent_per_words, total_sentences)
293 | 
294 | #Generating Tf-Idf Matrix
295 | tf_idf_matrix = tf_idf_matrix(tf_matrix, idf_matrix)
296 | 
297 | 
298 | #Generating Sentence score for each sentence
299 | sentence_scores = score_sentences(tf_idf_matrix)
300 | 
301 | #Setting threshold to average value (You are free to play with ther values) 
302 | threshold = average_score(sentence_scores)
303 | 
304 | #Getting summary 
305 | summary = create_summary(sentences, sentence_scores, 1.3 * threshold)
306 | print("\n\n")
307 | print("*"*20,"Summary","*"*20)
308 | print("\n")
309 | print(summary)
310 | print("\n\n")
311 | print("Total words in original article = ", num_words_in_original_text)
312 | print("Total words in summarized article = ", len(summary.split()))
313 | 


--------------------------------------------------------------------------------