├── README.md
├── index.js
└── package.json


/README.md:
--------------------------------------------------------------------------------
 1 | # TextRank - *Automatic Summarization with Sentence Extraction*
 2 | 
 3 | ### About
 4 | ----
 5 | A javascript implementation of **TextRank: Bringing Order into Texts** ([PDF link to the paper](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)) by Rada Mihalcea and Paul Tarau.
 6 | 
 7 | This only has the implementation for the **Sentence Extraction** method described in the paper.
 8 | 
 9 | Here is a live example http://kevinnadro.me/TextRank/
10 | 
11 | ### Details
12 | ---
13 | Given an article of text in string format it will generate a summary of the article.
14 | 
15 | ###### Example case:
16 | http://www.cnn.com/2017/03/22/opinions/puzzling-out-tsa-laptop-ban/index.html
17 | I took all the text from the article above and put it into a string. Then ran the code below.
18 | 
19 | ###### Node.js example
20 | 
21 | ```javascript
22 | var tr = require('textrank');
23 | 
24 | var articleOfText = "On Monday, the TSA announced a peculiar new security measure to take effect within 96 hours. Passengers flying into the US on foreign airlines from eight Muslim countries would be prohibited from carrying aboard any electronics larger than a smartphone. They would have to be checked and put into the cargo hold. And now the UK is following suit. It's difficult to make sense of this as a security measure, particularly at a time when many people question the veracity of government orders, but other explanations are either unsatisfying or damning. So let's look at the security aspects of this first. Laptop computers aren't inherently dangerous, but they're convenient carrying boxes. This is why, in the past, TSA officials have demanded passengers turn their laptops on: to confirm that they're actually laptops and not laptop cases emptied of their electronics and then filled with explosives. Forcing a would-be bomber to put larger laptops in the plane's hold is a reasonable defense against this threat, because it increases the complexity of the plot. Both the shoe-bomber Richard Reid and the underwear bomber Umar Farouk Abdulmutallab carried crude bombs aboard their planes with the plan to set them off manually once aloft. Setting off a bomb in checked baggage is more work, which is why we don't see more midair explosions like Pan Am Flight 103 over Lockerbie, Scotland, in 1988. Security measures that restrict what passengers can carry onto planes are not unprecedented either. Airport security regularly responds to both actual attacks and intelligence regarding future attacks. After the liquid bombers were captured in 2006, the British banned all carry-on luggage except passports and wallets. I remember talking with a friend who traveled home from London with his daughters in those early weeks of the ban. They reported that airport security officials confiscated every tube of lip balm they tried to hide. Similarly, the US started checking shoes after Reid, installed full-body scanners after Abdulmutallab and restricted liquids in 2006. But all of those measure were global, and most lessened in severity as the threat diminished. This current restriction implies some specific intelligence of a laptop-based plot and a temporary ban to address it. However, if that's the case, why only certain non-US carriers? And why only certain airports? Terrorists are smart enough to put a laptop bomb in checked baggage from the Middle East to Europe and then carry it on from Europe to the US. Why not require passengers to turn their laptops on as they go through security? That would be a more effective security measure than forcing them to check them in their luggage. And lastly, why is there a delay between the ban being announced and it taking effect? Even more confusing, The New York Times reported that \"officials called the directive an attempt to address gaps in foreign airport security, and said it was not based on any specific or credible threat of an imminent attack.\" The Department of Homeland Security FAQ page makes this general statement, \"Yes, intelligence is one aspect of every security-related decision,\" but doesn't provide a specific security threat. And yet a report from the UK states the ban \"follows the receipt of specific intelligence reports.\" Of course, the details are all classified, which leaves all of us security experts scratching our heads. On the face of it, the ban makes little sense. One analysis painted this as a protectionist measure targeted at the heavily subsidized Middle Eastern airlines by hitting them where it hurts the most: high-paying business class travelers who need their laptops with them on planes to get work done. That reasoning makes more sense than any security-related explanation, but doesn't explain why the British extended the ban to UK carriers as well. Or why this measure won't backfire when those Middle Eastern countries turn around and ban laptops on American carriers in retaliation. And one aviation official told CNN that an intelligence official informed him it was not a \"political move.\" In the end, national security measures based on secret information require us to trust the government. That trust is at historic low levels right now, so people both in the US and other countries are rightly skeptical of the official unsatisfying explanations. The new laptop ban highlights this mistrust.";
25 | 
26 | var textRank = new tr.TextRank(articleOfText);
27 | 
28 | console.log(textRank.summarizedArticle)
29 | ```
30 | 
31 | ---
32 | Generated summary below from ```textRank.summarizedArticle```
33 | 
34 | It's difficult to make sense of this as a security measure, particularly at a time when many people question the veracity of government orders, but other explanations are either unsatisfying or damning. This is why, in the past, TSA officials have demanded passengers turn their laptops on: to confirm that they're actually laptops and not laptop cases emptied of their electronics and then filled with explosives. Forcing a would-be bomber to put larger laptops in the plane's hold is a reasonable defense against this threat, because it increases the complexity of the plot. Terrorists are smart enough to put a laptop bomb in checked baggage from the Middle East to Europe and then carry it on from Europe to the US. Even more confusing, The New York Times reported that "officials called the directive an attempt to address gaps in foreign airport security, and said it was not based on any specific or credible threat of an imminent attack.
35 | 
36 | or if ```{ summaryType: "array" }``` is set in the settings object,
37 | 
38 | [
39 |   'It\'s difficult to make sense of this as a security measure, particularly at a time when many people question the veracity of government orders, but other explanations are either unsatisfying or damning.',
40 |   'This is why, in the past, TSA officials have demanded passengers turn their laptops on: to confirm that they\'re actually laptops and not laptop cases emptied of their electronics and then filled with explosives.',
41 |   'Forcing a would-be bomber to put larger laptops in the plane\'s hold is a reasonable defense against this threat, because it increases the complexity of the plot.',
42 |   'Terrorists are smart enough to put a laptop bomb in checked baggage from the Middle East to Europe and then carry it on from Europe to the US.',
43 |   'Even more confusing, The New York Times reported that "officials called the directive an attempt to address gaps in foreign airport security, and said it was not based on any specific or credible threat of an imminent attack."' ]
44 | 
45 | ---
46 | If you want to summarize another article you would have to create a new TextRank object again. (Thinking about changing this later)
47 | ```javascript
48 | var textRank = new tr.TextRank(someArticle);
49 | console.log(textRank.summarizedArticle);
50 | 
51 | var textRank_ = new tr.TextRank(anotherArticle);
52 | console.log(textRank_.summarizedArticle);
53 | ```
54 | 
55 | ### Settings
56 | There are some parameters you can set in the TextRank object. You can provide any combination of none of these settings.
57 | 
58 | Note: You must provide both **tokens** and **split** not just one or the other if you choose you provide your own.
59 | ```javascript
60 | var settings = { // does not compile, just pesudocode layout
61 |     extractAmount: 6, // Extracts 6 sentences instead of the default 5.
62 |     d: 0.95, // value from [0,1] it is the random surfer model constant default of 0.85.
63 |     summaryType: "array", // Returns an array of the summarized sentences instead of a long string default is a string.
64 |     sim: function(Si, Sj) { ... }, // You can use your own similarity scoring function!
65 |     tokens: [ sentence1, sentence2, sentence3, ... , sentenceN ],
66 |     split: [[word1, word2, ... , wordN],[word1, word2, ... , wordN], ..., [word1, word2, ... , wordN]]
67 | }
68 | ```
69 | 
70 | #### Providing tokens yourself
71 | Providing your own tokens means you already parsed your body of text into sentences.
72 | In addition you must tokenize the sentences on your own and provide those. Here is an example.
73 | 
74 | ```javascript
75 | var someArticle = "Blue cats are cool. Welcome home Julie! Tacos are tasty."
76 | var settings = {
77 |     extractAmount: 3,
78 |     tokens: [ "Blue cats are cool.", "Welcome home Julie!", "Tacos are tasty."],
79 |     split: [["blue","cats","are","cool"],["welcome","home","julie"], ["tacos","are","tasty"]]
80 | }
81 | // You don't actually have to provide the real someArticle text if you provide your own tokens. Just don't provide the empty string.
82 | var textRank = new tr.TextRank(someArticle, settings);
83 | var textRank_same_result_as_above = new tr.TextRank("This text does nothing!",settings);
84 | ```
85 | 
86 | Also since you provide the tokenized sentences your similarity function can vary a lot.
87 | 
88 | #### Similarity function
89 | The parameters **Si** and **Sj** are of this format. This should help you understand what information is available to you when implementing a scoring function.
90 | When you provide your own tokens and split, the *token* is the sentence attribute. The tokenized sentence will be *tokens* attribute.
91 | ```javascript
92 | {
93 |     id:0, // sentence position
94 |     score:7.497927571481792, // sentence score (vertex score)
95 |     sentence:"Today Judge Denise Lind announced her verdict in the case of Pfc.",
96 |     tokens:Array[12] // looks like ["today", "judge", ... , "pfc"]
97 | }
98 | ```
99 | 


--------------------------------------------------------------------------------
/index.js:
--------------------------------------------------------------------------------
  1 | /*
  2 |  ==========================================
  3 |  TextRank: Bringing Order into Texts
  4 | 
  5 |  Performs sentence extraction only.
  6 |  Used for automatic article summarization.
  7 |  ==========================================
  8 | */
  9 | // Article is a string of text to summarize
 10 | exports.TextRank = function (article, settings) {
 11 | 
 12 |   this.printError = function (msg) {
 13 |     console.log("TextRank ERROR:", msg);
 14 |   }
 15 | 
 16 |   if(typeof article != "string") {
 17 |     this.printError("Article Must Be Type String");
 18 |     return;
 19 |   }
 20 | 
 21 |   if(article.length < 1){
 22 |     this.printError("Article Can't Be Empty");
 23 |     return;
 24 |   }
 25 | 
 26 |   if(!settings){
 27 |     settings = {};
 28 |   }
 29 | 
 30 |   this.extractAmount = (settings["extractAmount"])? settings["extractAmount"] : 5;
 31 | 
 32 |   // Random surfer model, used in the similarity scoring function
 33 |   this.d = (settings["d"])? settings["d"] : 0.85;
 34 | 
 35 |   // Set the similarity function for edge weighting
 36 |   this.userDefinedSimilarity = (settings["sim"])? settings["sim"] : null;
 37 | 
 38 |   // Tokens are a sentence [ sentence1, sentence2, sentence3, ... , sentenceN ]
 39 |   this.userDefinedTokens = (settings["tokens"])? settings["tokens"]: null;
 40 |   // Split are the sentences tokenized into words [[word1, word2, ... , wordN],[word1, word2, ... , wordN], ..., [word1, word2, ... , wordN]]
 41 |   this.userDefinedTokensSplit = (settings["split"])? settings["split"]: null;
 42 | 
 43 |   this.typeOfSummary = (settings["summaryType"])? 1 : 0;
 44 | 
 45 |   this.graph = {
 46 |     V: {}, // Sentences are the vertices of the graph
 47 |     E: {},
 48 |     numVerts: 0
 49 |   }
 50 | 
 51 |   this.summarizedArticle = "";
 52 | 
 53 |   // convergence threshold
 54 |   this.delta = 0.0001
 55 | 
 56 |   // Constructs the graph
 57 |   this.setupGraph = function (article) {
 58 | 
 59 |     // The TextPreprocesser cleans up and tokenizes the article
 60 |     this.graph.V = TextPreprocesser(article, this.userDefinedTokens, this.userDefinedTokensSplit);
 61 | 
 62 |     this.graph.numVerts = Object.keys(this.graph.V).length;
 63 | 
 64 |     // Check for user defined similarity function
 65 |     this.sim = (this.userDefinedSimilarity != null)? this.userDefinedSimilarity : this.similarityScoring;
 66 | 
 67 |     // Init vertex scores
 68 |     for(iIndex in this.graph.V) {
 69 |       var vertex = this.graph.V[iIndex];
 70 | 
 71 |       // The initial score of a vertex is random and does not matter for the TextRank algorithm
 72 |       vertex["score"] = Math.random() * 10 + 1;
 73 | 
 74 |       // Id is the sentence position starting from 0
 75 |       vertex["id"] = Number(iIndex);
 76 | 
 77 |       var Si = vertex;
 78 | 
 79 |       // Add an edge between every sentence in the graph
 80 |       // Fully connected graph
 81 |       for (var j = 0; j < this.graph.numVerts; j++) {
 82 | 
 83 |         var jIndex = j.toString();
 84 | 
 85 |         // No self edges
 86 |         if(jIndex != iIndex) {
 87 | 
 88 |           // If no edge list, create it
 89 |           if(!this.graph.E[iIndex]) {
 90 |             this.graph.E[iIndex] = {};
 91 |           }
 92 | 
 93 |           var Sj = this.graph.V[jIndex];
 94 | 
 95 |           // Compute the edge weight between two sentences in the graph
 96 |           this.graph.E[iIndex][jIndex] = this.sim(Si, Sj);
 97 | 
 98 |         }
 99 |       }
100 |     }
101 |   }
102 | 
103 |   // Given two sentences compute a score which is the weight on the edge between the two sentence
104 |   // Implementation of Similarity(Si, Sj) function defined in the paper
105 |   this.similarityScoring = function (Si, Sj) {
106 | 
107 |     var overlap = {}
108 |     var Si_tokens = Si.tokens;
109 |     var Sj_tokens = Sj.tokens;
110 | 
111 |     // Count words for sentence i
112 |     for(var i = 0; i < Si_tokens.length; i++) {
113 |       var word = Si_tokens[i];
114 | 
115 |       if(!overlap[word]) {
116 |         overlap[word] = {}
117 |       }
118 | 
119 |       overlap[word]['i'] = 1;
120 |     }
121 | 
122 |     // Count words for sentence j
123 |     for(var i = 0; i < Sj_tokens.length; i++) {
124 |       var word = Sj_tokens[i];
125 | 
126 |       if(!overlap[word]) {
127 |         overlap[word] = {}
128 |       }
129 |       overlap[word]['j'] = 1;
130 |     }
131 | 
132 |     var logLengths = Math.log(Si_tokens.length) + Math.log(Sj_tokens.length);
133 |     var wordOverlapCount = 0;
134 | 
135 |     // Compute word overlap from the sentences
136 |     for( index in overlap) {
137 |       var word = overlap[index]
138 |       if ( Object.keys(word).length === 2) {
139 |         wordOverlapCount++;
140 |       }
141 |     }
142 | 
143 |     // Compute score
144 |     return wordOverlapCount/logLengths;
145 |   }
146 | 
147 |   this.iterations = 0;
148 |   this.iterateAgain = true;
149 | 
150 |   // The Weighted Graph WS(Vi) function to score a vertex
151 |   this.iterate = function () {
152 | 
153 |     for(index in this.graph.V){
154 | 
155 |       var vertex = this.graph.V[index];  // Vi vertex
156 |       var score_0 = vertex.score;
157 | 
158 |       var vertexNeighbors = this.graph.E[index]; // In(Vi) set
159 | 
160 |       var summedNeighbors = 0;
161 | 
162 |       // Sum over In(Vi)
163 |       for (neighborIndex in vertexNeighbors) {
164 | 
165 |         var neighbor = vertexNeighbors[neighborIndex]; // Vj
166 | 
167 |         var wji = this.graph.E[index][neighborIndex]; // wji
168 | 
169 |         // Sum over Out(Vj)
170 |         var outNeighbors = this.graph.E[neighborIndex];
171 |         var summedOutWeight = 1; // Stores the summation of weights over the Out(Vj)
172 | 
173 |         for( outIndex in outNeighbors) {
174 |           summedOutWeight += outNeighbors[outIndex];
175 |         }
176 | 
177 |         var WSVertex = this.graph.V[neighborIndex].score; // WS(Vj)
178 |         summedNeighbors += (wji/summedOutWeight) * WSVertex;
179 | 
180 |       }
181 | 
182 |       var score_1 = (1 - this.d) + this.d * summedNeighbors; // WS(Vi)
183 | 
184 |       // Update the score on the vertex
185 |       this.graph.V[index].score = score_1;
186 | 
187 |       // Check to see if you should continue
188 |       if(Math.abs(score_1 - score_0) <= this.delta) {
189 |         this.iterateAgain = false;
190 |       }
191 | 
192 |     }
193 | 
194 |     // Check for another iteration
195 |     if(this.iterateAgain == true) {
196 |       this.iterations += 1;
197 |       this.iterate();
198 |     }else {
199 | 
200 |       // Prints only once
201 |       // console.log(this.iterations);
202 |     }
203 | 
204 |     return;
205 |   }
206 | 
207 |   // Extracts the top N sentences
208 |   this.extractSummary = function (N) {
209 | 
210 |     var sentences = [];
211 | 
212 |     // Graph all the sentences
213 |     for ( index in this.graph.V) {
214 |       sentences.push(this.graph.V[index]);
215 |     }
216 | 
217 |     // Sort the sentences based off the score of the vertex
218 |     sentences = sentences.sort( function (a,b) {
219 |       if (a.score > b.score) {
220 |         return -1;
221 |       }else {
222 |         return 1;
223 |       }
224 |     });
225 | 
226 |     // Grab the top N sentences
227 |     // var sentences = sentences.slice(0,0+(N));
228 |     sentences.length = N;
229 | 
230 |     // Sort based of the id which is the position of the sentence in the original article
231 |     sentences = sentences.sort(function (a,b) {
232 |       if (a.id < b.id) {
233 |         return -1;
234 |       } else {
235 |         return 1;
236 |       }
237 |     })
238 | 
239 |     var summary = null;
240 | 
241 |     if(this.typeOfSummary) {
242 |       summary = [];
243 |       for (var i = 0; i < sentences.length; i++) {
244 |         summary.push(sentences[i].sentence);
245 |       }
246 | 
247 |     } else {
248 |       // Compose the summary by joining the ranked sentences
249 |       var summary = sentences[0].sentence;
250 | 
251 |       for (var i = 1; i < sentences.length; i++) {
252 |         summary += " " + sentences[i].sentence;
253 |       }
254 | 
255 |     }
256 | 
257 |     return summary;
258 |   }
259 | 
260 |   this.run =  function (article) {
261 |     // Create graph structure
262 |     this.setupGraph(article);
263 | 
264 |     // Rank sentences
265 |     this.iterate();
266 | 
267 |     this.summarizedArticle = this.extractSummary(this.extractAmount);
268 |   }
269 | 
270 |   this.run(article);
271 | }
272 | 
273 | // Handles the preprocessing of text for creating the graph structure of TextRank
274 | function TextPreprocesser(article, userTokens, userTokensSplit) {
275 | 
276 |   // Fucntion to clean up anything with the article that is passed in.
277 |   this.cleanArticle = function (article) {
278 | 
279 |     // Regex to remove two or more spaces in a row.
280 |     return article.replace(/[ ]+(?= )/g, "");
281 | 
282 |   }
283 | 
284 |   // tokenizer takes a string {article} and turns it into an array of sentences
285 |   // tokens are sentences, must end with (!?.) characters
286 |   this.tokenizer = function(article) {
287 | 
288 |     return article.replace(/([ ][".A-Za-z-|0-9]+[!|.|?|"](?=[ ]["“A-Z]))/g, "$1|").split("|");
289 |   }
290 | 
291 |   // Cleans up the tokens
292 |   // tokens are sentences
293 |   this.cleanTokens = function(tokens) {
294 | 
295 |     // Iterate backwards to allow for splicing.
296 |     for (var i = tokens.length - 1; i >= 0; i--) {
297 | 
298 |       // Current Token
299 |       var token = tokens[i]
300 | 
301 |       // Empty String
302 |       if(token == "") {
303 |         tokens.splice(i,1);
304 |       }else { // Since string is not empty clean it up
305 | 
306 |         // Remove all spaces leading the sentence
307 |         tokens[i] = token.replace(/[ .]*/,"")
308 |       }
309 |     }
310 | 
311 |     return tokens;
312 |   }
313 | 
314 |   // given a sentence, split it up into the amount of words in the sentence
315 |   this.tokenizeASentence = function(sentence) {
316 | 
317 |     // lowercase all the words in the sentences
318 |     var lc_sentence = sentence.toLowerCase();
319 | 
320 |     /*
321 |     Regex Expression Below :
322 |     Example: cool, awesome, something else, and yup
323 |     The delimiters like commas (,) (:) (;)  etc ... need to be removed
324 |     When scoring sentences against each other you do not want to compare
325 |     {cool,} against {cool} because they will not match since the comma stays with {cool,}
326 |     */
327 | 
328 |     // put spaces between all characters to split into words
329 |     var replaceToSpaceWithoutAfterSpace = /[-|'|"|(|)|/|<|>|,|:|;](?! )/g;
330 |     lc_sentence = lc_sentence.replace(replaceToSpaceWithoutAfterSpace," ");
331 | 
332 |     // Now replace all characters with blank
333 |     var replaceToBlankWithCharacters = /[-|'|"|(|)|/|<|>|,|:|;]/g;
334 |     lc_sentence = lc_sentence.replace(replaceToBlankWithCharacters,"");
335 | 
336 |     // Split into the words based off spaces since cleaned up
337 |     return lc_sentence.split(" ");
338 |   }
339 | 
340 |   this.outputPreprocess = function(article) {
341 | 
342 |     var cleanedArticle = this.cleanArticle(article);
343 | 
344 |     // Check for user tokens
345 |     var usingUserDefinedTokens = (userTokens && userTokensSplit);
346 |     var tokens = (usingUserDefinedTokens)? userTokens : this.cleanTokens(this.tokenizer(cleanedArticle));
347 | 
348 |     var output = {};
349 | 
350 |     for (var i = 0; i < tokens.length; i++) {
351 | 
352 |       var tokenizedSentence = (usingUserDefinedTokens)? userTokensSplit[i]: this.tokenizeASentence(tokens[i]);
353 | 
354 |       output[i] = {
355 |         sentence: tokens[i],
356 |         tokens: tokenizedSentence
357 |       };
358 | 
359 |     }
360 | 
361 |     return output;
362 |   }
363 | 
364 |   return this.outputPreprocess(article);
365 | }
366 | 


--------------------------------------------------------------------------------
/package.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "name": "textrank",
 3 |   "version": "1.0.5",
 4 |   "description": "TextRank javascript implementation for automatic text summarization",
 5 |   "main": "TextRank.js",
 6 |   "scripts": {
 7 |     "test": "echo \"Error: no test specified\" && exit 1"
 8 |   },
 9 |   "repository": {
10 |     "type": "git",
11 |     "url": "https://github.com/nadr0/TextRank-node"
12 |   },
13 |   "keywords": [
14 |     "TextRank",
15 |     "Automatic",
16 |     "Summarization",
17 |     "Text",
18 |     "Summary"
19 |   ],
20 |   "author": "Kevin Nadro",
21 |   "license": "ISC",
22 |   "bugs": {
23 |     "url": "https://github.com/nadr0/TextRank-node/issues"
24 |   },
25 |   "homepage": "https://github.com/nadr0/TextRank-node"
26 | }
27 | 


--------------------------------------------------------------------------------