├── README.md ├── index.js └── package.json /README.md: -------------------------------------------------------------------------------- 1 | # TextRank - *Automatic Summarization with Sentence Extraction* 2 | 3 | ### About 4 | ---- 5 | A javascript implementation of **TextRank: Bringing Order into Texts** ([PDF link to the paper](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)) by Rada Mihalcea and Paul Tarau. 6 | 7 | This only has the implementation for the **Sentence Extraction** method described in the paper. 8 | 9 | Here is a live example http://kevinnadro.me/TextRank/ 10 | 11 | ### Details 12 | --- 13 | Given an article of text in string format it will generate a summary of the article. 14 | 15 | ###### Example case: 16 | http://www.cnn.com/2017/03/22/opinions/puzzling-out-tsa-laptop-ban/index.html 17 | I took all the text from the article above and put it into a string. Then ran the code below. 18 | 19 | ###### Node.js example 20 | 21 | ```javascript 22 | var tr = require('textrank'); 23 | 24 | var articleOfText = "On Monday, the TSA announced a peculiar new security measure to take effect within 96 hours. Passengers flying into the US on foreign airlines from eight Muslim countries would be prohibited from carrying aboard any electronics larger than a smartphone. They would have to be checked and put into the cargo hold. And now the UK is following suit. It's difficult to make sense of this as a security measure, particularly at a time when many people question the veracity of government orders, but other explanations are either unsatisfying or damning. So let's look at the security aspects of this first. Laptop computers aren't inherently dangerous, but they're convenient carrying boxes. This is why, in the past, TSA officials have demanded passengers turn their laptops on: to confirm that they're actually laptops and not laptop cases emptied of their electronics and then filled with explosives. Forcing a would-be bomber to put larger laptops in the plane's hold is a reasonable defense against this threat, because it increases the complexity of the plot. Both the shoe-bomber Richard Reid and the underwear bomber Umar Farouk Abdulmutallab carried crude bombs aboard their planes with the plan to set them off manually once aloft. Setting off a bomb in checked baggage is more work, which is why we don't see more midair explosions like Pan Am Flight 103 over Lockerbie, Scotland, in 1988. Security measures that restrict what passengers can carry onto planes are not unprecedented either. Airport security regularly responds to both actual attacks and intelligence regarding future attacks. After the liquid bombers were captured in 2006, the British banned all carry-on luggage except passports and wallets. I remember talking with a friend who traveled home from London with his daughters in those early weeks of the ban. They reported that airport security officials confiscated every tube of lip balm they tried to hide. Similarly, the US started checking shoes after Reid, installed full-body scanners after Abdulmutallab and restricted liquids in 2006. But all of those measure were global, and most lessened in severity as the threat diminished. This current restriction implies some specific intelligence of a laptop-based plot and a temporary ban to address it. However, if that's the case, why only certain non-US carriers? And why only certain airports? Terrorists are smart enough to put a laptop bomb in checked baggage from the Middle East to Europe and then carry it on from Europe to the US. Why not require passengers to turn their laptops on as they go through security? That would be a more effective security measure than forcing them to check them in their luggage. And lastly, why is there a delay between the ban being announced and it taking effect? Even more confusing, The New York Times reported that \"officials called the directive an attempt to address gaps in foreign airport security, and said it was not based on any specific or credible threat of an imminent attack.\" The Department of Homeland Security FAQ page makes this general statement, \"Yes, intelligence is one aspect of every security-related decision,\" but doesn't provide a specific security threat. And yet a report from the UK states the ban \"follows the receipt of specific intelligence reports.\" Of course, the details are all classified, which leaves all of us security experts scratching our heads. On the face of it, the ban makes little sense. One analysis painted this as a protectionist measure targeted at the heavily subsidized Middle Eastern airlines by hitting them where it hurts the most: high-paying business class travelers who need their laptops with them on planes to get work done. That reasoning makes more sense than any security-related explanation, but doesn't explain why the British extended the ban to UK carriers as well. Or why this measure won't backfire when those Middle Eastern countries turn around and ban laptops on American carriers in retaliation. And one aviation official told CNN that an intelligence official informed him it was not a \"political move.\" In the end, national security measures based on secret information require us to trust the government. That trust is at historic low levels right now, so people both in the US and other countries are rightly skeptical of the official unsatisfying explanations. The new laptop ban highlights this mistrust."; 25 | 26 | var textRank = new tr.TextRank(articleOfText); 27 | 28 | console.log(textRank.summarizedArticle) 29 | ``` 30 | 31 | --- 32 | Generated summary below from ```textRank.summarizedArticle``` 33 | 34 | It's difficult to make sense of this as a security measure, particularly at a time when many people question the veracity of government orders, but other explanations are either unsatisfying or damning. This is why, in the past, TSA officials have demanded passengers turn their laptops on: to confirm that they're actually laptops and not laptop cases emptied of their electronics and then filled with explosives. Forcing a would-be bomber to put larger laptops in the plane's hold is a reasonable defense against this threat, because it increases the complexity of the plot. Terrorists are smart enough to put a laptop bomb in checked baggage from the Middle East to Europe and then carry it on from Europe to the US. Even more confusing, The New York Times reported that "officials called the directive an attempt to address gaps in foreign airport security, and said it was not based on any specific or credible threat of an imminent attack. 35 | 36 | or if ```{ summaryType: "array" }``` is set in the settings object, 37 | 38 | [ 39 | 'It\'s difficult to make sense of this as a security measure, particularly at a time when many people question the veracity of government orders, but other explanations are either unsatisfying or damning.', 40 | 'This is why, in the past, TSA officials have demanded passengers turn their laptops on: to confirm that they\'re actually laptops and not laptop cases emptied of their electronics and then filled with explosives.', 41 | 'Forcing a would-be bomber to put larger laptops in the plane\'s hold is a reasonable defense against this threat, because it increases the complexity of the plot.', 42 | 'Terrorists are smart enough to put a laptop bomb in checked baggage from the Middle East to Europe and then carry it on from Europe to the US.', 43 | 'Even more confusing, The New York Times reported that "officials called the directive an attempt to address gaps in foreign airport security, and said it was not based on any specific or credible threat of an imminent attack."' ] 44 | 45 | --- 46 | If you want to summarize another article you would have to create a new TextRank object again. (Thinking about changing this later) 47 | ```javascript 48 | var textRank = new tr.TextRank(someArticle); 49 | console.log(textRank.summarizedArticle); 50 | 51 | var textRank_ = new tr.TextRank(anotherArticle); 52 | console.log(textRank_.summarizedArticle); 53 | ``` 54 | 55 | ### Settings 56 | There are some parameters you can set in the TextRank object. You can provide any combination of none of these settings. 57 | 58 | Note: You must provide both **tokens** and **split** not just one or the other if you choose you provide your own. 59 | ```javascript 60 | var settings = { // does not compile, just pesudocode layout 61 | extractAmount: 6, // Extracts 6 sentences instead of the default 5. 62 | d: 0.95, // value from [0,1] it is the random surfer model constant default of 0.85. 63 | summaryType: "array", // Returns an array of the summarized sentences instead of a long string default is a string. 64 | sim: function(Si, Sj) { ... }, // You can use your own similarity scoring function! 65 | tokens: [ sentence1, sentence2, sentence3, ... , sentenceN ], 66 | split: [[word1, word2, ... , wordN],[word1, word2, ... , wordN], ..., [word1, word2, ... , wordN]] 67 | } 68 | ``` 69 | 70 | #### Providing tokens yourself 71 | Providing your own tokens means you already parsed your body of text into sentences. 72 | In addition you must tokenize the sentences on your own and provide those. Here is an example. 73 | 74 | ```javascript 75 | var someArticle = "Blue cats are cool. Welcome home Julie! Tacos are tasty." 76 | var settings = { 77 | extractAmount: 3, 78 | tokens: [ "Blue cats are cool.", "Welcome home Julie!", "Tacos are tasty."], 79 | split: [["blue","cats","are","cool"],["welcome","home","julie"], ["tacos","are","tasty"]] 80 | } 81 | // You don't actually have to provide the real someArticle text if you provide your own tokens. Just don't provide the empty string. 82 | var textRank = new tr.TextRank(someArticle, settings); 83 | var textRank_same_result_as_above = new tr.TextRank("This text does nothing!",settings); 84 | ``` 85 | 86 | Also since you provide the tokenized sentences your similarity function can vary a lot. 87 | 88 | #### Similarity function 89 | The parameters **Si** and **Sj** are of this format. This should help you understand what information is available to you when implementing a scoring function. 90 | When you provide your own tokens and split, the *token* is the sentence attribute. The tokenized sentence will be *tokens* attribute. 91 | ```javascript 92 | { 93 | id:0, // sentence position 94 | score:7.497927571481792, // sentence score (vertex score) 95 | sentence:"Today Judge Denise Lind announced her verdict in the case of Pfc.", 96 | tokens:Array[12] // looks like ["today", "judge", ... , "pfc"] 97 | } 98 | ``` 99 | -------------------------------------------------------------------------------- /index.js: -------------------------------------------------------------------------------- 1 | /* 2 | ========================================== 3 | TextRank: Bringing Order into Texts 4 | 5 | Performs sentence extraction only. 6 | Used for automatic article summarization. 7 | ========================================== 8 | */ 9 | // Article is a string of text to summarize 10 | exports.TextRank = function (article, settings) { 11 | 12 | this.printError = function (msg) { 13 | console.log("TextRank ERROR:", msg); 14 | } 15 | 16 | if(typeof article != "string") { 17 | this.printError("Article Must Be Type String"); 18 | return; 19 | } 20 | 21 | if(article.length < 1){ 22 | this.printError("Article Can't Be Empty"); 23 | return; 24 | } 25 | 26 | if(!settings){ 27 | settings = {}; 28 | } 29 | 30 | this.extractAmount = (settings["extractAmount"])? settings["extractAmount"] : 5; 31 | 32 | // Random surfer model, used in the similarity scoring function 33 | this.d = (settings["d"])? settings["d"] : 0.85; 34 | 35 | // Set the similarity function for edge weighting 36 | this.userDefinedSimilarity = (settings["sim"])? settings["sim"] : null; 37 | 38 | // Tokens are a sentence [ sentence1, sentence2, sentence3, ... , sentenceN ] 39 | this.userDefinedTokens = (settings["tokens"])? settings["tokens"]: null; 40 | // Split are the sentences tokenized into words [[word1, word2, ... , wordN],[word1, word2, ... , wordN], ..., [word1, word2, ... , wordN]] 41 | this.userDefinedTokensSplit = (settings["split"])? settings["split"]: null; 42 | 43 | this.typeOfSummary = (settings["summaryType"])? 1 : 0; 44 | 45 | this.graph = { 46 | V: {}, // Sentences are the vertices of the graph 47 | E: {}, 48 | numVerts: 0 49 | } 50 | 51 | this.summarizedArticle = ""; 52 | 53 | // convergence threshold 54 | this.delta = 0.0001 55 | 56 | // Constructs the graph 57 | this.setupGraph = function (article) { 58 | 59 | // The TextPreprocesser cleans up and tokenizes the article 60 | this.graph.V = TextPreprocesser(article, this.userDefinedTokens, this.userDefinedTokensSplit); 61 | 62 | this.graph.numVerts = Object.keys(this.graph.V).length; 63 | 64 | // Check for user defined similarity function 65 | this.sim = (this.userDefinedSimilarity != null)? this.userDefinedSimilarity : this.similarityScoring; 66 | 67 | // Init vertex scores 68 | for(iIndex in this.graph.V) { 69 | var vertex = this.graph.V[iIndex]; 70 | 71 | // The initial score of a vertex is random and does not matter for the TextRank algorithm 72 | vertex["score"] = Math.random() * 10 + 1; 73 | 74 | // Id is the sentence position starting from 0 75 | vertex["id"] = Number(iIndex); 76 | 77 | var Si = vertex; 78 | 79 | // Add an edge between every sentence in the graph 80 | // Fully connected graph 81 | for (var j = 0; j < this.graph.numVerts; j++) { 82 | 83 | var jIndex = j.toString(); 84 | 85 | // No self edges 86 | if(jIndex != iIndex) { 87 | 88 | // If no edge list, create it 89 | if(!this.graph.E[iIndex]) { 90 | this.graph.E[iIndex] = {}; 91 | } 92 | 93 | var Sj = this.graph.V[jIndex]; 94 | 95 | // Compute the edge weight between two sentences in the graph 96 | this.graph.E[iIndex][jIndex] = this.sim(Si, Sj); 97 | 98 | } 99 | } 100 | } 101 | } 102 | 103 | // Given two sentences compute a score which is the weight on the edge between the two sentence 104 | // Implementation of Similarity(Si, Sj) function defined in the paper 105 | this.similarityScoring = function (Si, Sj) { 106 | 107 | var overlap = {} 108 | var Si_tokens = Si.tokens; 109 | var Sj_tokens = Sj.tokens; 110 | 111 | // Count words for sentence i 112 | for(var i = 0; i < Si_tokens.length; i++) { 113 | var word = Si_tokens[i]; 114 | 115 | if(!overlap[word]) { 116 | overlap[word] = {} 117 | } 118 | 119 | overlap[word]['i'] = 1; 120 | } 121 | 122 | // Count words for sentence j 123 | for(var i = 0; i < Sj_tokens.length; i++) { 124 | var word = Sj_tokens[i]; 125 | 126 | if(!overlap[word]) { 127 | overlap[word] = {} 128 | } 129 | overlap[word]['j'] = 1; 130 | } 131 | 132 | var logLengths = Math.log(Si_tokens.length) + Math.log(Sj_tokens.length); 133 | var wordOverlapCount = 0; 134 | 135 | // Compute word overlap from the sentences 136 | for( index in overlap) { 137 | var word = overlap[index] 138 | if ( Object.keys(word).length === 2) { 139 | wordOverlapCount++; 140 | } 141 | } 142 | 143 | // Compute score 144 | return wordOverlapCount/logLengths; 145 | } 146 | 147 | this.iterations = 0; 148 | this.iterateAgain = true; 149 | 150 | // The Weighted Graph WS(Vi) function to score a vertex 151 | this.iterate = function () { 152 | 153 | for(index in this.graph.V){ 154 | 155 | var vertex = this.graph.V[index]; // Vi vertex 156 | var score_0 = vertex.score; 157 | 158 | var vertexNeighbors = this.graph.E[index]; // In(Vi) set 159 | 160 | var summedNeighbors = 0; 161 | 162 | // Sum over In(Vi) 163 | for (neighborIndex in vertexNeighbors) { 164 | 165 | var neighbor = vertexNeighbors[neighborIndex]; // Vj 166 | 167 | var wji = this.graph.E[index][neighborIndex]; // wji 168 | 169 | // Sum over Out(Vj) 170 | var outNeighbors = this.graph.E[neighborIndex]; 171 | var summedOutWeight = 1; // Stores the summation of weights over the Out(Vj) 172 | 173 | for( outIndex in outNeighbors) { 174 | summedOutWeight += outNeighbors[outIndex]; 175 | } 176 | 177 | var WSVertex = this.graph.V[neighborIndex].score; // WS(Vj) 178 | summedNeighbors += (wji/summedOutWeight) * WSVertex; 179 | 180 | } 181 | 182 | var score_1 = (1 - this.d) + this.d * summedNeighbors; // WS(Vi) 183 | 184 | // Update the score on the vertex 185 | this.graph.V[index].score = score_1; 186 | 187 | // Check to see if you should continue 188 | if(Math.abs(score_1 - score_0) <= this.delta) { 189 | this.iterateAgain = false; 190 | } 191 | 192 | } 193 | 194 | // Check for another iteration 195 | if(this.iterateAgain == true) { 196 | this.iterations += 1; 197 | this.iterate(); 198 | }else { 199 | 200 | // Prints only once 201 | // console.log(this.iterations); 202 | } 203 | 204 | return; 205 | } 206 | 207 | // Extracts the top N sentences 208 | this.extractSummary = function (N) { 209 | 210 | var sentences = []; 211 | 212 | // Graph all the sentences 213 | for ( index in this.graph.V) { 214 | sentences.push(this.graph.V[index]); 215 | } 216 | 217 | // Sort the sentences based off the score of the vertex 218 | sentences = sentences.sort( function (a,b) { 219 | if (a.score > b.score) { 220 | return -1; 221 | }else { 222 | return 1; 223 | } 224 | }); 225 | 226 | // Grab the top N sentences 227 | // var sentences = sentences.slice(0,0+(N)); 228 | sentences.length = N; 229 | 230 | // Sort based of the id which is the position of the sentence in the original article 231 | sentences = sentences.sort(function (a,b) { 232 | if (a.id < b.id) { 233 | return -1; 234 | } else { 235 | return 1; 236 | } 237 | }) 238 | 239 | var summary = null; 240 | 241 | if(this.typeOfSummary) { 242 | summary = []; 243 | for (var i = 0; i < sentences.length; i++) { 244 | summary.push(sentences[i].sentence); 245 | } 246 | 247 | } else { 248 | // Compose the summary by joining the ranked sentences 249 | var summary = sentences[0].sentence; 250 | 251 | for (var i = 1; i < sentences.length; i++) { 252 | summary += " " + sentences[i].sentence; 253 | } 254 | 255 | } 256 | 257 | return summary; 258 | } 259 | 260 | this.run = function (article) { 261 | // Create graph structure 262 | this.setupGraph(article); 263 | 264 | // Rank sentences 265 | this.iterate(); 266 | 267 | this.summarizedArticle = this.extractSummary(this.extractAmount); 268 | } 269 | 270 | this.run(article); 271 | } 272 | 273 | // Handles the preprocessing of text for creating the graph structure of TextRank 274 | function TextPreprocesser(article, userTokens, userTokensSplit) { 275 | 276 | // Fucntion to clean up anything with the article that is passed in. 277 | this.cleanArticle = function (article) { 278 | 279 | // Regex to remove two or more spaces in a row. 280 | return article.replace(/[ ]+(?= )/g, ""); 281 | 282 | } 283 | 284 | // tokenizer takes a string {article} and turns it into an array of sentences 285 | // tokens are sentences, must end with (!?.) characters 286 | this.tokenizer = function(article) { 287 | 288 | return article.replace(/([ ][".A-Za-z-|0-9]+[!|.|?|"](?=[ ]["“A-Z]))/g, "$1|").split("|"); 289 | } 290 | 291 | // Cleans up the tokens 292 | // tokens are sentences 293 | this.cleanTokens = function(tokens) { 294 | 295 | // Iterate backwards to allow for splicing. 296 | for (var i = tokens.length - 1; i >= 0; i--) { 297 | 298 | // Current Token 299 | var token = tokens[i] 300 | 301 | // Empty String 302 | if(token == "") { 303 | tokens.splice(i,1); 304 | }else { // Since string is not empty clean it up 305 | 306 | // Remove all spaces leading the sentence 307 | tokens[i] = token.replace(/[ .]*/,"") 308 | } 309 | } 310 | 311 | return tokens; 312 | } 313 | 314 | // given a sentence, split it up into the amount of words in the sentence 315 | this.tokenizeASentence = function(sentence) { 316 | 317 | // lowercase all the words in the sentences 318 | var lc_sentence = sentence.toLowerCase(); 319 | 320 | /* 321 | Regex Expression Below : 322 | Example: cool, awesome, something else, and yup 323 | The delimiters like commas (,) (:) (;) etc ... need to be removed 324 | When scoring sentences against each other you do not want to compare 325 | {cool,} against {cool} because they will not match since the comma stays with {cool,} 326 | */ 327 | 328 | // put spaces between all characters to split into words 329 | var replaceToSpaceWithoutAfterSpace = /[-|'|"|(|)|/|<|>|,|:|;](?! )/g; 330 | lc_sentence = lc_sentence.replace(replaceToSpaceWithoutAfterSpace," "); 331 | 332 | // Now replace all characters with blank 333 | var replaceToBlankWithCharacters = /[-|'|"|(|)|/|<|>|,|:|;]/g; 334 | lc_sentence = lc_sentence.replace(replaceToBlankWithCharacters,""); 335 | 336 | // Split into the words based off spaces since cleaned up 337 | return lc_sentence.split(" "); 338 | } 339 | 340 | this.outputPreprocess = function(article) { 341 | 342 | var cleanedArticle = this.cleanArticle(article); 343 | 344 | // Check for user tokens 345 | var usingUserDefinedTokens = (userTokens && userTokensSplit); 346 | var tokens = (usingUserDefinedTokens)? userTokens : this.cleanTokens(this.tokenizer(cleanedArticle)); 347 | 348 | var output = {}; 349 | 350 | for (var i = 0; i < tokens.length; i++) { 351 | 352 | var tokenizedSentence = (usingUserDefinedTokens)? userTokensSplit[i]: this.tokenizeASentence(tokens[i]); 353 | 354 | output[i] = { 355 | sentence: tokens[i], 356 | tokens: tokenizedSentence 357 | }; 358 | 359 | } 360 | 361 | return output; 362 | } 363 | 364 | return this.outputPreprocess(article); 365 | } 366 | -------------------------------------------------------------------------------- /package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "textrank", 3 | "version": "1.0.5", 4 | "description": "TextRank javascript implementation for automatic text summarization", 5 | "main": "TextRank.js", 6 | "scripts": { 7 | "test": "echo \"Error: no test specified\" && exit 1" 8 | }, 9 | "repository": { 10 | "type": "git", 11 | "url": "https://github.com/nadr0/TextRank-node" 12 | }, 13 | "keywords": [ 14 | "TextRank", 15 | "Automatic", 16 | "Summarization", 17 | "Text", 18 | "Summary" 19 | ], 20 | "author": "Kevin Nadro", 21 | "license": "ISC", 22 | "bugs": { 23 | "url": "https://github.com/nadr0/TextRank-node/issues" 24 | }, 25 | "homepage": "https://github.com/nadr0/TextRank-node" 26 | } 27 | --------------------------------------------------------------------------------