├── .github └── FUNDING.yml ├── LICENSE ├── README.md ├── assets ├── cloud.png ├── font.txt ├── stopwords-en.txt ├── stopwords-es.txt └── whitelist.txt ├── bot.py ├── cloud.py ├── cloud_example.png ├── config.py ├── requirements.txt ├── scraper.py ├── summary.py └── templates └── es.txt /.github/FUNDING.yml: -------------------------------------------------------------------------------- 1 | github: agentphantom 2 | patreon: agentphantom 3 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Phantom Insights 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Article Summarizer 2 | 3 | This project implements a custom algorithm to extract the most important sentences and keywords from Spanish and English news articles. 4 | 5 | It was fully developed in `Python` and it is inspired by similar projects seen on `Reddit` news subreddits that use the term frequency–inverse document frequency (`tf–idf`). 6 | 7 | The 3 most important files are: 8 | 9 | * `scraper.py` : A Python script that performs web scraping on a given HTML source, it extracts the article title, date and body. 10 | 11 | * `summary.py` : A Python script that applies a custom algorithm to a string of text and extracts the top ranked sentences and words. 12 | 13 | * `bot.py` : A Reddit bot that checks a subreddit for its latest submissions. It manages a list of already processed submissions to avoid duplicates. 14 | 15 | ## Requirements 16 | 17 | This project uses the following Python libraries 18 | 19 | * `spaCy` : Used to tokenize the article into sentences and words. 20 | * `PRAW` : Makes the use of the Reddit API very easy. 21 | * `Requests` : To perform HTTP `get` requests to the articles urls. 22 | * `BeautifulSoup` : Used for extracting the article text. 23 | * `html5lib` : This parser got better compatibility when used with `BeautifulSoup`. 24 | * `tldextract` : Used to extract the domain from an url. 25 | * `wordcloud` : Used to create word clouds with the article text. 26 | 27 | After installing the `spaCy` library you must install a language model to be able to tokenize the article. 28 | 29 | For `Spanish` you can run this one: 30 | 31 | `python -m spacy download es_core_news_sm` 32 | 33 | For other languages please check the following link: https://spacy.io/usage/models 34 | 35 | ## Reddit Bot 36 | 37 | The bot is simple in nature, it uses the `PRAW` library which is very straightforward to use. The bot polls a subreddit every 10 minutes to get its latest submissions. 38 | 39 | It first detects if the submission hasn't already been processed and then checks if the submission url is in the whitelist. This whitelist is currently curated by myself. 40 | 41 | If the post and its url passes both checks then a process of web scraping is applied to the url, this is where things start getting interesting. 42 | 43 | Before replying to the original submission it checks the percentage of the reduction achieved, if it's too low or too high it skips it and moves to the next submission. 44 | 45 | ## Web Scraper 46 | 47 | Currently in the whitelist there are already more than 300 different websites of news articles and blogs. Creating specialized web scrapers for each one is simply not feasible. 48 | 49 | The second best thing to do is to make the scraper as accurate as possible. 50 | 51 | We start the web scraper on the usual way, with the `Requests` and `BeautifulSoup` libraries. 52 | 53 | ```python 54 | with requests.get(article_url) as response: 55 | 56 | if response.encoding == "ISO-8859-1": 57 | response.encoding = "utf-8" 58 | 59 | html_source = response.text 60 | 61 | for item in ["

", "", "", "", ""]: 62 | html_source = html_source.replace(item, item+"\n") 63 | 64 | soup = BeautifulSoup(html_source, "html5lib") 65 | ``` 66 | 67 | Very few times I got encoding issues caused by an incorrect encoding guess. To avoid this issue I force `Requests` to decode with `utf-8`. 68 | 69 | Now that we have our article parsed into a `soup` object we will start by extracting the title and published time. 70 | 71 | I used similar methods to extract both values, I first check the most common tags and fallback to the next common alternatives. 72 | 73 | Not all websites expose their published date, we sometimes end with an empty string. 74 | 75 | ```python 76 | article_title = soup.find("title").text.replace("\n", " ").strip() 77 | 78 | # If our title is too short we fallback to the first h1 tag. 79 | if len(article_title) <= 5: 80 | article_title = soup.find("h1").text.replace("\n", " ").strip() 81 | 82 | article_date = "" 83 | 84 | # We look for the first meta tag that has the word 'time' in it. 85 | for item in soup.find_all("meta"): 86 | 87 | if "time" in item.get("property", ""): 88 | 89 | clean_date = item["content"].split("+")[0].replace("Z", "") 90 | 91 | # Use your preferred time formatting. 92 | article_date = "{:%d-%m-%Y a las %H:%M:%S}".format( 93 | datetime.fromisoformat(clean_date)) 94 | break 95 | 96 | # If we didn't find any meta tag with a datetime we look for a 'time' tag. 97 | if len(article_date) <= 5: 98 | try: 99 | article_date = soup.find("time").text.strip() 100 | except: 101 | pass 102 | ``` 103 | 104 | When extracting the text from different tags I often got the strings without separation. I implemented a little hack to add new lines to each tag that usually contains text. This significantly improved the overall accuracy of the tokenizer. 105 | 106 | My original idea was to only accept websites that used the `
` tag. It worked ok for the first websites I tested, but I soon realized that very few websites use it and those who use it don't use it correctly. 107 | 108 | ```python 109 | article = soup.find("article").text 110 | ``` 111 | 112 | When accessing the `.text` property of the `
` tag I noticed I was also getting the JavaScript code. I backtracked a bit and removed all tags which could add *noise* to the article text. 113 | 114 | ```python 115 | [tag.extract() for tag in soup.find_all( 116 | ["script", "img", "ul", "time", "h1", "h2", "h3", "iframe", "style", "form", "footer", "figcaption"])] 117 | 118 | 119 | # These class names/ids are known to add noise or duplicate text to the article. 120 | noisy_names = ["image", "img", "video", "subheadline", 121 | "hidden", "tract", "caption", "tweet", "expert"] 122 | 123 | for tag in soup.find_all("div"): 124 | 125 | tag_id = tag["id"].lower() 126 | 127 | for item in noisy_names: 128 | if item in tag_id: 129 | tag.extract() 130 | ``` 131 | 132 | The above code removed most captions, which usually repeat what's inside in the article. 133 | 134 | After that I applied a 3 step process to get the article text. 135 | 136 | First I checked all `
` tags and grabbed the one with the longest text. 137 | 138 | 139 | ```python 140 | article = "" 141 | 142 | for article_tag in soup.find_all("article"): 143 | 144 | if len(article_tag.text) >= len(article): 145 | article = article_tag.text 146 | ``` 147 | 148 | That worked fine for websites that properly used the `
` tag. The longest tag almost always contains the main article. 149 | 150 | But that didn't quite worked as expected, I noticed poor quality on the results, sometimes I was getting excerpts for other articles. 151 | 152 | That's when I decided to add the fallback, lnstead of only looking for the `
` tag I will be looking for `
` and `
` tags with commonly used `id's`. 153 | 154 | ```python 155 | # These names commonly hold the article text. 156 | common_names = ["artic", "summary", "cont", "note", "cuerpo", "body"] 157 | 158 | # If the article is too short we look somewhere else. 159 | if len(article) <= 650: 160 | 161 | for tag in soup.find_all(["div", "section"]): 162 | 163 | tag_id = tag["id"].lower() 164 | 165 | for item in common_names: 166 | if item in tag_id: 167 | # We guarantee to get the longest div. 168 | if len(tag.text) >= len(article): 169 | article = tag.text 170 | ``` 171 | 172 | That increased the accuracy quite a bit, I repeated the code but instead of the `id` attribute I was also looking for the `class` attribute. 173 | 174 | ```python 175 | # The article is still too short, let's try one more time. 176 | if len(article) <= 650: 177 | 178 | for tag in soup.find_all(["div", "section"]): 179 | 180 | tag_class = "".join(tag["class"]).lower() 181 | 182 | for item in common_names: 183 | if item in tag_class: 184 | # We guarantee to get the longest div. 185 | if len(tag.text) >= len(article): 186 | article = tag.text 187 | ``` 188 | 189 | Using all the previous methods greatly increased the overall accuracy of the scraper. In some cases I used partial words that share the same letters in English and Spanish (artic -> article/articulo). The scraper was now compatible with all the urls I tested. 190 | 191 | We make a final check and if the article is still too short we abort the process and move to the next url, otherwise we move to the summary algorithm. 192 | 193 | ## Summary Algorithm 194 | 195 | This algorithm was designed to work primarily on Spanish written articles. It consists on several steps: 196 | 197 | 1. Reformat and clean the original article by removing all whitespaces. 198 | 2. Make a copy of the original article and remove all common used words from it. 199 | 3. Split the copied article into words and score each word. 200 | 4. Split the original article into sentences and score each sentence using the scores from the words. 201 | 5. Take the top 5 sentences and top 5 words and return them in chronological order. 202 | 203 | Before starting out we need to initialize the `spaCy` library. 204 | 205 | ```python 206 | NLP = spacy.load("es_core_news_sm") 207 | ``` 208 | 209 | That line of code will load the `Spanish` model which I use the most. If you are using another language please refer to the `Requirements` section so you know how to install the appropriate model. 210 | 211 | ### Clean the Article 212 | 213 | When extracting the text from the article we usually get a lot of whitespaces, mostly from line breaks (`\n`). 214 | 215 | We split the text by that character, then strip all whitespaces and join it again. This is not strictly required to do but helps a lot while debugging the whole process. 216 | 217 | ### Remove Common and Stop Words 218 | 219 | At the top of the script we declare the path of the stop words text files. These stop words will be added to a `set`, guaranteeing no duplicates. 220 | 221 | I also added a list with some Spanish and English words that are not stop words but they don't add anything substantial to the article. My personal preference was to hard code them in lowercase form. 222 | 223 | Then I added a copy of each word in uppercase and title form. Which means the `set` will be 3 times the original size. 224 | 225 | ```python 226 | with open(ES_STOPWORDS_FILE, "r", encoding="utf-8") as temp_file: 227 | for word in temp_file.read().splitlines(): 228 | COMMON_WORDS.add(word) 229 | 230 | with open(EN_STOPWORDS_FILE, "r", encoding="utf-8") as temp_file: 231 | for word in temp_file.read().splitlines(): 232 | COMMON_WORDS.add(word) 233 | 234 | extra_words = list() 235 | 236 | for word in COMMON_WORDS: 237 | extra_words.append(word.title()) 238 | extra_words.append(word.upper()) 239 | 240 | for word in extra_words: 241 | COMMON_WORDS.add(word) 242 | ``` 243 | 244 | ### Scoring Words 245 | 246 | Before starting tokenizing our words we must first pass our cleaned article into the `NLP` pipeline, this is done with one line of code. 247 | 248 | ```python 249 | doc = NLP(cleaned_article) 250 | ``` 251 | 252 | This `doc` object contains several iterators, the 2 we will use are `tokens` and `sents` (sentences). 253 | 254 | At this point I added a personal touch to the algorithm. First I made a copy of the article and then removed all common words from it. 255 | 256 | Afterwards I used a `collections.Counter` object to do the initial scoring. 257 | 258 | Then I applied a multiplier bonus to words that start in uppercase and are equal or longer than 4 characters. Most of the time those words are names of places, people or organizations. 259 | 260 | Finally I set to zero the score for all words that are actually numbers. 261 | 262 | ```python 263 | words_of_interest = [ 264 | token.text for token in doc if token.text not in COMMON_WORDS] 265 | 266 | scored_words = Counter(words_of_interest) 267 | 268 | for word in scored_words: 269 | 270 | if word[0].isupper() and len(word) >= 4: 271 | scored_words[word] *= 3 272 | 273 | if word.isdigit(): 274 | scored_words[word] = 0 275 | ``` 276 | 277 | ### Scoring Sentences 278 | 279 | Now that we have the final scores for each word it is time to score each sentence from the article. 280 | 281 | To do this we first need to split the article into sentences. I tried various approaches, including `RegEx` but the one that worked best was the `spaCy` library. 282 | 283 | We will iterate again over the `doc` object we defined in the previous step, but this time we will iterate over its `sents` property. 284 | 285 | Something to note is that we create a list of sentence `tokens` and inside those tokens we can retrieve the sentences text by accessing their `text` property. 286 | 287 | ```python 288 | article_sentences = [sent for sent in doc.sents] 289 | 290 | scored_sentences = list() 291 | 292 | or index, sent in enumerate(article_sentences): 293 | 294 | # In some edge cases we have duplicated sentences, we make sure that doesn't happen. 295 | if sent.text not in [sent for score, index, sent in scored_sentences]: 296 | scored_sentences.append( 297 | [score_line(sent, scored_words), index, sent.text]) 298 | ``` 299 | 300 | `scored_sentences` is a list of lists. Each inner list contains 3 values. The sentence score, its index and the sentence itself. Those values will be used in the next step. 301 | 302 | The code below shows how the lines are scored. 303 | 304 | ```python 305 | def score_line(line, scored_words): 306 | 307 | # We remove the common words. 308 | cleaned_line = [ 309 | token.text for token in line if token.text not in COMMON_WORDS] 310 | 311 | # We now sum the total number of ocurrences for all words. 312 | temp_score = 0 313 | 314 | for word in cleaned_line: 315 | temp_score += scored_words[word] 316 | 317 | # We apply a bonus score to sentences that contain financial information. 318 | line_lowercase = line.text.lower() 319 | 320 | for word in FINANCIAL_WORDS: 321 | if word in line_lowercase: 322 | temp_score *= 1.5 323 | break 324 | 325 | return temp_score 326 | ``` 327 | 328 | We apply a multiplier to sentences that contain any word that refers to money or finance. 329 | 330 | ### Chronological Order 331 | 332 | This is the final part of the algorithm, we make use of the `sorted()` function to get the top sentences and then reorder them in their original positions. 333 | 334 | We sort `scored_sentences` in reverse order, this will give us the top scored sentences first. We start a small counter variable so it breaks the loop once it hits 5. We also discard all sentences that are 3 characters or less (sometimes there are sneaky zero-width characters). 335 | 336 | ```python 337 | top_sentences = list() 338 | counter = 0 339 | 340 | for score, index, sentence in sorted(scored_sentences, reverse=True): 341 | 342 | if counter >= 5: 343 | break 344 | 345 | # When the article is too small the sentences may come empty. 346 | if len(sentence) >= 3: 347 | 348 | # We append the sentence and its index so we can sort in chronological order. 349 | top_sentences.append([index, sentence]) 350 | counter += 1 351 | 352 | return [sentence for index, sentence in sorted(top_sentences)] 353 | ``` 354 | 355 | At the end we use a list comprehension to return only the sentences which are already sorted in chronological order. 356 | 357 | ### Word Cloud 358 | 359 | Just for fun I added a word cloud to each article. To do so I used the `wordcloud` library. This library is very easy to use, you only require to declare a `WordCloud` object and use the `generate` method with a string of text as its parameter. 360 | 361 | ```python 362 | wc = wordcloud.WordCloud() # See cloud.py for full parameters. 363 | wc.generate(prepared_article) 364 | wc.to_file("./temp.png") 365 | ``` 366 | 367 | After generating the image I uploaded it to `Imgur`, got back the url link and added it to the `Markdown` message. 368 | 369 | ![Word cloud example](cloud_example.png) 370 | 371 | ## Conclusion 372 | 373 | This was a very fun and interesting project to work on. I may have reinvented the wheel but at least I learned a few cool things. 374 | 375 | I'm satisfied with the overall quality of the results and I will keep tweaking the algorithm and applying compatibility enhancements. 376 | 377 | As a side note, when testing the script I accidentally requested Tweets, Facebook posts and English written articles. All of them got acceptable outputs, but since those sites were not the target I removed them from the whitelist. 378 | 379 | After some weeks of feedback I decided to add support for the English language. This required a bit of refactoring. 380 | 381 | To make it work with other languages you will only require a text file containing all the stop words from said language and copy a few lines of code (see Remove Common and Stop Words section). 382 | 383 | [![Become a Patron!](https://c5.patreon.com/external/logo/become_a_patron_button.png)](https://www.patreon.com/bePatron?u=20521425) 384 | -------------------------------------------------------------------------------- /assets/cloud.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PhantomInsights/summarizer/d8b4d7745ca9ba4309fc9707b7c98ae143b97a10/assets/cloud.png -------------------------------------------------------------------------------- /assets/font.txt: -------------------------------------------------------------------------------- 1 | The name of the font used by this project is Sofia Pro Light 2 | 3 | The font is free, to download it you need to purchase it from the following link: 4 | 5 | https://www.fontspring.com/fonts/mostardesign/sofia-pro -------------------------------------------------------------------------------- /assets/stopwords-en.txt: -------------------------------------------------------------------------------- 1 | 'll 2 | 'tis 3 | 'twas 4 | 've 5 | 10 6 | 39 7 | a 8 | a's 9 | able 10 | ableabout 11 | about 12 | above 13 | abroad 14 | abst 15 | accordance 16 | according 17 | accordingly 18 | across 19 | act 20 | actually 21 | ad 22 | added 23 | adj 24 | adopted 25 | ae 26 | af 27 | affected 28 | affecting 29 | affects 30 | after 31 | afterwards 32 | ag 33 | again 34 | against 35 | ago 36 | ah 37 | ahead 38 | ai 39 | ain't 40 | aint 41 | al 42 | all 43 | allow 44 | allows 45 | almost 46 | alone 47 | along 48 | alongside 49 | already 50 | also 51 | although 52 | always 53 | am 54 | amid 55 | amidst 56 | among 57 | amongst 58 | amoungst 59 | amount 60 | an 61 | and 62 | announce 63 | another 64 | any 65 | anybody 66 | anyhow 67 | anymore 68 | anyone 69 | anything 70 | anyway 71 | anyways 72 | anywhere 73 | ao 74 | apart 75 | apparently 76 | appear 77 | appreciate 78 | appropriate 79 | approximately 80 | aq 81 | ar 82 | are 83 | area 84 | areas 85 | aren 86 | aren't 87 | arent 88 | arise 89 | around 90 | arpa 91 | as 92 | aside 93 | ask 94 | asked 95 | asking 96 | asks 97 | associated 98 | at 99 | au 100 | auth 101 | available 102 | aw 103 | away 104 | awfully 105 | az 106 | b 107 | ba 108 | back 109 | backed 110 | backing 111 | backs 112 | backward 113 | backwards 114 | bb 115 | bd 116 | be 117 | became 118 | because 119 | become 120 | becomes 121 | becoming 122 | been 123 | before 124 | beforehand 125 | began 126 | begin 127 | beginning 128 | beginnings 129 | begins 130 | behind 131 | being 132 | beings 133 | believe 134 | below 135 | beside 136 | besides 137 | best 138 | better 139 | between 140 | beyond 141 | bf 142 | bg 143 | bh 144 | bi 145 | big 146 | bill 147 | billion 148 | biol 149 | bj 150 | bm 151 | bn 152 | bo 153 | both 154 | bottom 155 | br 156 | brief 157 | briefly 158 | bs 159 | bt 160 | but 161 | buy 162 | bv 163 | bw 164 | by 165 | bz 166 | c 167 | c'mon 168 | c's 169 | ca 170 | call 171 | came 172 | can 173 | can't 174 | cannot 175 | cant 176 | caption 177 | case 178 | cases 179 | cause 180 | causes 181 | cc 182 | cd 183 | certain 184 | certainly 185 | cf 186 | cg 187 | ch 188 | changes 189 | ci 190 | ck 191 | cl 192 | clear 193 | clearly 194 | click 195 | cm 196 | cmon 197 | cn 198 | co 199 | co. 200 | com 201 | come 202 | comes 203 | computer 204 | con 205 | concerning 206 | consequently 207 | consider 208 | considering 209 | contain 210 | containing 211 | contains 212 | copy 213 | corresponding 214 | could 215 | could've 216 | couldn 217 | couldn't 218 | couldnt 219 | course 220 | cr 221 | cry 222 | cs 223 | cu 224 | currently 225 | cv 226 | cx 227 | cy 228 | cz 229 | d 230 | dare 231 | daren't 232 | darent 233 | date 234 | de 235 | dear 236 | definitely 237 | describe 238 | described 239 | despite 240 | detail 241 | did 242 | didn 243 | didn't 244 | didnt 245 | differ 246 | different 247 | differently 248 | directly 249 | dj 250 | dk 251 | dm 252 | do 253 | does 254 | doesn 255 | doesn't 256 | doesnt 257 | doing 258 | don 259 | don't 260 | done 261 | dont 262 | doubtful 263 | down 264 | downed 265 | downing 266 | downs 267 | downwards 268 | due 269 | during 270 | dz 271 | e 272 | each 273 | early 274 | ec 275 | ed 276 | edu 277 | ee 278 | effect 279 | eg 280 | eh 281 | eight 282 | eighty 283 | either 284 | eleven 285 | else 286 | elsewhere 287 | empty 288 | end 289 | ended 290 | ending 291 | ends 292 | enough 293 | entirely 294 | er 295 | es 296 | especially 297 | et 298 | et-al 299 | etc 300 | even 301 | evenly 302 | ever 303 | evermore 304 | every 305 | everybody 306 | everyone 307 | everything 308 | everywhere 309 | ex 310 | exactly 311 | example 312 | except 313 | f 314 | face 315 | faces 316 | fact 317 | facts 318 | fairly 319 | far 320 | farther 321 | felt 322 | few 323 | fewer 324 | ff 325 | fi 326 | fifteen 327 | fifth 328 | fifty 329 | fify 330 | fill 331 | find 332 | finds 333 | fire 334 | first 335 | five 336 | fix 337 | fj 338 | fk 339 | fm 340 | fo 341 | followed 342 | following 343 | follows 344 | for 345 | forever 346 | former 347 | formerly 348 | forth 349 | forty 350 | forward 351 | found 352 | four 353 | fr 354 | free 355 | from 356 | front 357 | full 358 | fully 359 | further 360 | furthered 361 | furthering 362 | furthermore 363 | furthers 364 | fx 365 | g 366 | ga 367 | gave 368 | gb 369 | gd 370 | ge 371 | general 372 | generally 373 | get 374 | gets 375 | getting 376 | gf 377 | gg 378 | gh 379 | gi 380 | give 381 | given 382 | gives 383 | giving 384 | gl 385 | gm 386 | gmt 387 | gn 388 | go 389 | goes 390 | going 391 | gone 392 | good 393 | goods 394 | got 395 | gotten 396 | gov 397 | gp 398 | gq 399 | gr 400 | great 401 | greater 402 | greatest 403 | greetings 404 | group 405 | grouped 406 | grouping 407 | groups 408 | gs 409 | gt 410 | gu 411 | gw 412 | gy 413 | h 414 | had 415 | hadn't 416 | hadnt 417 | half 418 | happens 419 | hardly 420 | has 421 | hasn 422 | hasn't 423 | hasnt 424 | have 425 | haven 426 | haven't 427 | havent 428 | having 429 | he 430 | he'd 431 | he'll 432 | he's 433 | hed 434 | hell 435 | hello 436 | help 437 | hence 438 | her 439 | here 440 | here's 441 | hereafter 442 | hereby 443 | herein 444 | heres 445 | hereupon 446 | hers 447 | herself 448 | herse” 449 | hes 450 | hi 451 | hid 452 | high 453 | higher 454 | highest 455 | him 456 | himself 457 | himse” 458 | his 459 | hither 460 | hk 461 | hm 462 | hn 463 | home 464 | homepage 465 | hopefully 466 | how 467 | how'd 468 | how'll 469 | how's 470 | howbeit 471 | however 472 | hr 473 | ht 474 | htm 475 | html 476 | http 477 | hu 478 | hundred 479 | i 480 | i'd 481 | i'll 482 | i'm 483 | i've 484 | i.e. 485 | id 486 | ie 487 | if 488 | ignored 489 | ii 490 | il 491 | ill 492 | im 493 | immediate 494 | immediately 495 | importance 496 | important 497 | in 498 | inasmuch 499 | inc 500 | inc. 501 | indeed 502 | index 503 | indicate 504 | indicated 505 | indicates 506 | information 507 | inner 508 | inside 509 | insofar 510 | instead 511 | int 512 | interest 513 | interested 514 | interesting 515 | interests 516 | into 517 | invention 518 | inward 519 | io 520 | iq 521 | ir 522 | is 523 | isn 524 | isn't 525 | isnt 526 | it 527 | it'd 528 | it'll 529 | it's 530 | itd 531 | itll 532 | its 533 | itself 534 | itse” 535 | ive 536 | j 537 | je 538 | jm 539 | jo 540 | join 541 | jp 542 | just 543 | k 544 | ke 545 | keep 546 | keeps 547 | kept 548 | keys 549 | kg 550 | kh 551 | ki 552 | kind 553 | km 554 | kn 555 | knew 556 | know 557 | known 558 | knows 559 | kp 560 | kr 561 | kw 562 | ky 563 | kz 564 | l 565 | la 566 | large 567 | largely 568 | last 569 | lately 570 | later 571 | latest 572 | latter 573 | latterly 574 | lb 575 | lc 576 | least 577 | length 578 | less 579 | lest 580 | let 581 | let's 582 | lets 583 | li 584 | like 585 | liked 586 | likely 587 | likewise 588 | line 589 | little 590 | lk 591 | ll 592 | long 593 | longer 594 | longest 595 | look 596 | looking 597 | looks 598 | low 599 | lower 600 | lr 601 | ls 602 | lt 603 | ltd 604 | lu 605 | lv 606 | ly 607 | m 608 | ma 609 | made 610 | mainly 611 | make 612 | makes 613 | making 614 | man 615 | many 616 | may 617 | maybe 618 | mayn't 619 | maynt 620 | mc 621 | md 622 | me 623 | mean 624 | means 625 | meantime 626 | meanwhile 627 | member 628 | members 629 | men 630 | merely 631 | mg 632 | mh 633 | microsoft 634 | might 635 | might've 636 | mightn't 637 | mightnt 638 | mil 639 | mill 640 | million 641 | mine 642 | minus 643 | miss 644 | mk 645 | ml 646 | mm 647 | mn 648 | mo 649 | more 650 | moreover 651 | most 652 | mostly 653 | move 654 | mp 655 | mq 656 | mr 657 | mrs 658 | ms 659 | msie 660 | mt 661 | mu 662 | much 663 | mug 664 | must 665 | must've 666 | mustn't 667 | mustnt 668 | mv 669 | mw 670 | mx 671 | my 672 | myself 673 | myse” 674 | mz 675 | n 676 | na 677 | name 678 | namely 679 | nay 680 | nc 681 | nd 682 | ne 683 | near 684 | nearly 685 | necessarily 686 | necessary 687 | need 688 | needed 689 | needing 690 | needn't 691 | neednt 692 | needs 693 | neither 694 | net 695 | netscape 696 | never 697 | neverf 698 | neverless 699 | nevertheless 700 | new 701 | newer 702 | newest 703 | next 704 | nf 705 | ng 706 | ni 707 | nine 708 | ninety 709 | nl 710 | no 711 | no-one 712 | nobody 713 | non 714 | none 715 | nonetheless 716 | noone 717 | nor 718 | normally 719 | nos 720 | not 721 | noted 722 | nothing 723 | notwithstanding 724 | novel 725 | now 726 | nowhere 727 | np 728 | nr 729 | nu 730 | null 731 | number 732 | numbers 733 | nz 734 | o 735 | obtain 736 | obtained 737 | obviously 738 | of 739 | off 740 | often 741 | oh 742 | ok 743 | okay 744 | old 745 | older 746 | oldest 747 | om 748 | omitted 749 | on 750 | once 751 | one 752 | one's 753 | ones 754 | only 755 | onto 756 | open 757 | opened 758 | opening 759 | opens 760 | opposite 761 | or 762 | ord 763 | order 764 | ordered 765 | ordering 766 | orders 767 | org 768 | other 769 | others 770 | otherwise 771 | ought 772 | oughtn't 773 | oughtnt 774 | our 775 | ours 776 | ourselves 777 | out 778 | outside 779 | over 780 | overall 781 | owing 782 | own 783 | p 784 | pa 785 | page 786 | pages 787 | part 788 | parted 789 | particular 790 | particularly 791 | parting 792 | parts 793 | past 794 | pe 795 | per 796 | perhaps 797 | pf 798 | pg 799 | ph 800 | pk 801 | pl 802 | place 803 | placed 804 | places 805 | please 806 | plus 807 | pm 808 | pmid 809 | pn 810 | point 811 | pointed 812 | pointing 813 | points 814 | poorly 815 | possible 816 | possibly 817 | potentially 818 | pp 819 | pr 820 | predominantly 821 | present 822 | presented 823 | presenting 824 | presents 825 | presumably 826 | previously 827 | primarily 828 | probably 829 | problem 830 | problems 831 | promptly 832 | proud 833 | provided 834 | provides 835 | pt 836 | put 837 | puts 838 | pw 839 | py 840 | q 841 | qa 842 | que 843 | quickly 844 | quite 845 | qv 846 | r 847 | ran 848 | rather 849 | rd 850 | re 851 | readily 852 | really 853 | reasonably 854 | recent 855 | recently 856 | ref 857 | refs 858 | regarding 859 | regardless 860 | regards 861 | related 862 | relatively 863 | research 864 | reserved 865 | respectively 866 | resulted 867 | resulting 868 | results 869 | right 870 | ring 871 | ro 872 | room 873 | rooms 874 | round 875 | ru 876 | run 877 | rw 878 | s 879 | sa 880 | said 881 | same 882 | saw 883 | say 884 | saying 885 | says 886 | sb 887 | sc 888 | sd 889 | se 890 | sec 891 | second 892 | secondly 893 | seconds 894 | section 895 | see 896 | seeing 897 | seem 898 | seemed 899 | seeming 900 | seems 901 | seen 902 | sees 903 | self 904 | selves 905 | sensible 906 | sent 907 | serious 908 | seriously 909 | seven 910 | seventy 911 | several 912 | sg 913 | sh 914 | shall 915 | shan't 916 | shant 917 | she 918 | she'd 919 | she'll 920 | she's 921 | shed 922 | shell 923 | shes 924 | should 925 | should've 926 | shouldn 927 | shouldn't 928 | shouldnt 929 | show 930 | showed 931 | showing 932 | shown 933 | showns 934 | shows 935 | si 936 | side 937 | sides 938 | significant 939 | significantly 940 | similar 941 | similarly 942 | since 943 | sincere 944 | site 945 | six 946 | sixty 947 | sj 948 | sk 949 | sl 950 | slightly 951 | sm 952 | small 953 | smaller 954 | smallest 955 | sn 956 | so 957 | some 958 | somebody 959 | someday 960 | somehow 961 | someone 962 | somethan 963 | something 964 | sometime 965 | sometimes 966 | somewhat 967 | somewhere 968 | soon 969 | sorry 970 | specifically 971 | specified 972 | specify 973 | specifying 974 | sr 975 | st 976 | state 977 | states 978 | still 979 | stop 980 | strongly 981 | su 982 | sub 983 | substantially 984 | successfully 985 | such 986 | sufficiently 987 | suggest 988 | sup 989 | sure 990 | sv 991 | sy 992 | system 993 | sz 994 | t 995 | t's 996 | take 997 | taken 998 | taking 999 | tc 1000 | td 1001 | tell 1002 | ten 1003 | tends 1004 | test 1005 | text 1006 | tf 1007 | tg 1008 | th 1009 | than 1010 | thank 1011 | thanks 1012 | thanx 1013 | that 1014 | that'll 1015 | that's 1016 | that've 1017 | thatll 1018 | thats 1019 | thatve 1020 | the 1021 | their 1022 | theirs 1023 | them 1024 | themselves 1025 | then 1026 | thence 1027 | there 1028 | there'd 1029 | there'll 1030 | there're 1031 | there's 1032 | there've 1033 | thereafter 1034 | thereby 1035 | thered 1036 | therefore 1037 | therein 1038 | therell 1039 | thereof 1040 | therere 1041 | theres 1042 | thereto 1043 | thereupon 1044 | thereve 1045 | these 1046 | they 1047 | they'd 1048 | they'll 1049 | they're 1050 | they've 1051 | theyd 1052 | theyll 1053 | theyre 1054 | theyve 1055 | thick 1056 | thin 1057 | thing 1058 | things 1059 | think 1060 | thinks 1061 | third 1062 | thirty 1063 | this 1064 | thorough 1065 | thoroughly 1066 | those 1067 | thou 1068 | though 1069 | thoughh 1070 | thought 1071 | thoughts 1072 | thousand 1073 | three 1074 | throug 1075 | through 1076 | throughout 1077 | thru 1078 | thus 1079 | til 1080 | till 1081 | tip 1082 | tis 1083 | tj 1084 | tk 1085 | tm 1086 | tn 1087 | to 1088 | today 1089 | together 1090 | too 1091 | took 1092 | top 1093 | toward 1094 | towards 1095 | tp 1096 | tr 1097 | tried 1098 | tries 1099 | trillion 1100 | truly 1101 | try 1102 | trying 1103 | ts 1104 | tt 1105 | turn 1106 | turned 1107 | turning 1108 | turns 1109 | tv 1110 | tw 1111 | twas 1112 | twelve 1113 | twenty 1114 | twice 1115 | two 1116 | tz 1117 | u 1118 | ua 1119 | ug 1120 | uk 1121 | um 1122 | un 1123 | under 1124 | underneath 1125 | undoing 1126 | unfortunately 1127 | unless 1128 | unlike 1129 | unlikely 1130 | until 1131 | unto 1132 | up 1133 | upon 1134 | ups 1135 | upwards 1136 | us 1137 | use 1138 | used 1139 | useful 1140 | usefully 1141 | usefulness 1142 | uses 1143 | using 1144 | usually 1145 | uucp 1146 | uy 1147 | uz 1148 | v 1149 | va 1150 | value 1151 | various 1152 | vc 1153 | ve 1154 | versus 1155 | very 1156 | vg 1157 | vi 1158 | via 1159 | viz 1160 | vn 1161 | vol 1162 | vols 1163 | vs 1164 | vu 1165 | w 1166 | want 1167 | wanted 1168 | wanting 1169 | wants 1170 | was 1171 | wasn 1172 | wasn't 1173 | wasnt 1174 | way 1175 | ways 1176 | we 1177 | we'd 1178 | we'll 1179 | we're 1180 | we've 1181 | web 1182 | webpage 1183 | website 1184 | wed 1185 | welcome 1186 | well 1187 | wells 1188 | went 1189 | were 1190 | weren 1191 | weren't 1192 | werent 1193 | weve 1194 | wf 1195 | what 1196 | what'd 1197 | what'll 1198 | what's 1199 | what've 1200 | whatever 1201 | whatll 1202 | whats 1203 | whatve 1204 | when 1205 | when'd 1206 | when'll 1207 | when's 1208 | whence 1209 | whenever 1210 | where 1211 | where'd 1212 | where'll 1213 | where's 1214 | whereafter 1215 | whereas 1216 | whereby 1217 | wherein 1218 | wheres 1219 | whereupon 1220 | wherever 1221 | whether 1222 | which 1223 | whichever 1224 | while 1225 | whilst 1226 | whim 1227 | whither 1228 | who 1229 | who'd 1230 | who'll 1231 | who's 1232 | whod 1233 | whoever 1234 | whole 1235 | wholl 1236 | whom 1237 | whomever 1238 | whos 1239 | whose 1240 | why 1241 | why'd 1242 | why'll 1243 | why's 1244 | widely 1245 | width 1246 | will 1247 | willing 1248 | wish 1249 | with 1250 | within 1251 | without 1252 | won 1253 | won't 1254 | wonder 1255 | wont 1256 | words 1257 | work 1258 | worked 1259 | working 1260 | works 1261 | world 1262 | would 1263 | would've 1264 | wouldn 1265 | wouldn't 1266 | wouldnt 1267 | ws 1268 | www 1269 | x 1270 | y 1271 | ye 1272 | year 1273 | years 1274 | yes 1275 | yet 1276 | you 1277 | you'd 1278 | you'll 1279 | you're 1280 | you've 1281 | youd 1282 | youll 1283 | young 1284 | younger 1285 | youngest 1286 | your 1287 | youre 1288 | yours 1289 | yourself 1290 | yourselves 1291 | youve 1292 | yt 1293 | yu 1294 | z 1295 | za 1296 | zero 1297 | zm 1298 | zr -------------------------------------------------------------------------------- /assets/stopwords-es.txt: -------------------------------------------------------------------------------- 1 | 0 2 | 1 3 | 2 4 | 3 5 | 4 6 | 5 7 | 6 8 | 7 9 | 8 10 | 9 11 | _ 12 | a 13 | actualmente 14 | acuerdo 15 | adelante 16 | ademas 17 | además 18 | adrede 19 | afirmó 20 | agregó 21 | ahi 22 | ahora 23 | ahí 24 | al 25 | algo 26 | alguna 27 | algunas 28 | alguno 29 | algunos 30 | algún 31 | alli 32 | allí 33 | alrededor 34 | ambos 35 | ampleamos 36 | antano 37 | antaño 38 | ante 39 | anterior 40 | antes 41 | apenas 42 | aproximadamente 43 | aquel 44 | aquella 45 | aquellas 46 | aquello 47 | aquellos 48 | aqui 49 | aquél 50 | aquélla 51 | aquéllas 52 | aquéllos 53 | aquí 54 | arriba 55 | arribaabajo 56 | aseguró 57 | asi 58 | así 59 | atras 60 | aun 61 | aunque 62 | ayer 63 | añadió 64 | aún 65 | b 66 | bajo 67 | bastante 68 | bien 69 | breve 70 | buen 71 | buena 72 | buenas 73 | bueno 74 | buenos 75 | c 76 | cada 77 | casi 78 | cerca 79 | cierta 80 | ciertas 81 | cierto 82 | ciertos 83 | cinco 84 | claro 85 | comentó 86 | como 87 | con 88 | conmigo 89 | conocer 90 | conseguimos 91 | conseguir 92 | considera 93 | consideró 94 | consigo 95 | consigue 96 | consiguen 97 | consigues 98 | contigo 99 | contra 100 | cosas 101 | creo 102 | cual 103 | cuales 104 | cualquier 105 | cuando 106 | cuanta 107 | cuantas 108 | cuanto 109 | cuantos 110 | cuatro 111 | cuenta 112 | cuál 113 | cuáles 114 | cuándo 115 | cuánta 116 | cuántas 117 | cuánto 118 | cuántos 119 | cómo 120 | d 121 | da 122 | dado 123 | dan 124 | dar 125 | de 126 | debajo 127 | debe 128 | deben 129 | debido 130 | decir 131 | dejó 132 | del 133 | delante 134 | demasiado 135 | demás 136 | dentro 137 | deprisa 138 | desde 139 | despacio 140 | despues 141 | después 142 | detras 143 | detrás 144 | dia 145 | dias 146 | dice 147 | dicen 148 | dicho 149 | dieron 150 | diferente 151 | diferentes 152 | dijeron 153 | dijo 154 | dio 155 | donde 156 | dos 157 | durante 158 | día 159 | días 160 | dónde 161 | e 162 | ejemplo 163 | el 164 | ella 165 | ellas 166 | ello 167 | ellos 168 | embargo 169 | empleais 170 | emplean 171 | emplear 172 | empleas 173 | empleo 174 | en 175 | encima 176 | encuentra 177 | enfrente 178 | enseguida 179 | entonces 180 | entre 181 | era 182 | erais 183 | eramos 184 | eran 185 | eras 186 | eres 187 | es 188 | esa 189 | esas 190 | ese 191 | eso 192 | esos 193 | esta 194 | estaba 195 | estabais 196 | estaban 197 | estabas 198 | estad 199 | estada 200 | estadas 201 | estado 202 | estados 203 | estais 204 | estamos 205 | estan 206 | estando 207 | estar 208 | estaremos 209 | estará 210 | estarán 211 | estarás 212 | estaré 213 | estaréis 214 | estaría 215 | estaríais 216 | estaríamos 217 | estarían 218 | estarías 219 | estas 220 | este 221 | estemos 222 | esto 223 | estos 224 | estoy 225 | estuve 226 | estuviera 227 | estuvierais 228 | estuvieran 229 | estuvieras 230 | estuvieron 231 | estuviese 232 | estuvieseis 233 | estuviesen 234 | estuvieses 235 | estuvimos 236 | estuviste 237 | estuvisteis 238 | estuviéramos 239 | estuviésemos 240 | estuvo 241 | está 242 | estábamos 243 | estáis 244 | están 245 | estás 246 | esté 247 | estéis 248 | estén 249 | estés 250 | ex 251 | excepto 252 | existe 253 | existen 254 | explicó 255 | expresó 256 | f 257 | fin 258 | final 259 | fue 260 | fuera 261 | fuerais 262 | fueran 263 | fueras 264 | fueron 265 | fuese 266 | fueseis 267 | fuesen 268 | fueses 269 | fui 270 | fuimos 271 | fuiste 272 | fuisteis 273 | fuéramos 274 | fuésemos 275 | g 276 | general 277 | gran 278 | grandes 279 | gueno 280 | h 281 | ha 282 | haber 283 | habia 284 | habida 285 | habidas 286 | habido 287 | habidos 288 | habiendo 289 | habla 290 | hablan 291 | habremos 292 | habrá 293 | habrán 294 | habrás 295 | habré 296 | habréis 297 | habría 298 | habríais 299 | habríamos 300 | habrían 301 | habrías 302 | habéis 303 | había 304 | habíais 305 | habíamos 306 | habían 307 | habías 308 | hace 309 | haceis 310 | hacemos 311 | hacen 312 | hacer 313 | hacerlo 314 | haces 315 | hacia 316 | haciendo 317 | hago 318 | han 319 | has 320 | hasta 321 | hay 322 | haya 323 | hayamos 324 | hayan 325 | hayas 326 | hayáis 327 | he 328 | hecho 329 | hemos 330 | hicieron 331 | hizo 332 | horas 333 | hoy 334 | hube 335 | hubiera 336 | hubierais 337 | hubieran 338 | hubieras 339 | hubieron 340 | hubiese 341 | hubieseis 342 | hubiesen 343 | hubieses 344 | hubimos 345 | hubiste 346 | hubisteis 347 | hubiéramos 348 | hubiésemos 349 | hubo 350 | i 351 | igual 352 | incluso 353 | indicó 354 | informo 355 | informó 356 | intenta 357 | intentais 358 | intentamos 359 | intentan 360 | intentar 361 | intentas 362 | intento 363 | ir 364 | j 365 | junto 366 | k 367 | l 368 | la 369 | lado 370 | largo 371 | las 372 | le 373 | lejos 374 | les 375 | llegó 376 | lleva 377 | llevar 378 | lo 379 | los 380 | luego 381 | lugar 382 | m 383 | mal 384 | manera 385 | manifestó 386 | mas 387 | mayor 388 | me 389 | mediante 390 | medio 391 | mejor 392 | mencionó 393 | menos 394 | menudo 395 | mi 396 | mia 397 | mias 398 | mientras 399 | mio 400 | mios 401 | mis 402 | misma 403 | mismas 404 | mismo 405 | mismos 406 | modo 407 | momento 408 | mucha 409 | muchas 410 | mucho 411 | muchos 412 | muy 413 | más 414 | mí 415 | mía 416 | mías 417 | mío 418 | míos 419 | n 420 | nada 421 | nadie 422 | ni 423 | ninguna 424 | ningunas 425 | ninguno 426 | ningunos 427 | ningún 428 | no 429 | nos 430 | nosotras 431 | nosotros 432 | nuestra 433 | nuestras 434 | nuestro 435 | nuestros 436 | nueva 437 | nuevas 438 | nuevo 439 | nuevos 440 | nunca 441 | o 442 | ocho 443 | os 444 | otra 445 | otras 446 | otro 447 | otros 448 | p 449 | pais 450 | para 451 | parece 452 | parte 453 | partir 454 | pasada 455 | pasado 456 | paìs 457 | peor 458 | pero 459 | pesar 460 | poca 461 | pocas 462 | poco 463 | pocos 464 | podeis 465 | podemos 466 | poder 467 | podria 468 | podriais 469 | podriamos 470 | podrian 471 | podrias 472 | podrá 473 | podrán 474 | podría 475 | podrían 476 | poner 477 | por 478 | por qué 479 | porque 480 | posible 481 | primer 482 | primera 483 | primero 484 | primeros 485 | principalmente 486 | pronto 487 | propia 488 | propias 489 | propio 490 | propios 491 | proximo 492 | próximo 493 | próximos 494 | pudo 495 | pueda 496 | puede 497 | pueden 498 | puedo 499 | pues 500 | q 501 | qeu 502 | que 503 | quedó 504 | queremos 505 | quien 506 | quienes 507 | quiere 508 | quiza 509 | quizas 510 | quizá 511 | quizás 512 | quién 513 | quiénes 514 | qué 515 | r 516 | raras 517 | realizado 518 | realizar 519 | realizó 520 | repente 521 | respecto 522 | s 523 | sabe 524 | sabeis 525 | sabemos 526 | saben 527 | saber 528 | sabes 529 | sal 530 | salvo 531 | se 532 | sea 533 | seamos 534 | sean 535 | seas 536 | segun 537 | segunda 538 | segundo 539 | según 540 | seis 541 | ser 542 | sera 543 | seremos 544 | será 545 | serán 546 | serás 547 | seré 548 | seréis 549 | sería 550 | seríais 551 | seríamos 552 | serían 553 | serías 554 | seáis 555 | señaló 556 | si 557 | sido 558 | siempre 559 | siendo 560 | siete 561 | sigue 562 | siguiente 563 | sin 564 | sino 565 | sobre 566 | sois 567 | sola 568 | solamente 569 | solas 570 | solo 571 | solos 572 | somos 573 | son 574 | soy 575 | soyos 576 | su 577 | supuesto 578 | sus 579 | suya 580 | suyas 581 | suyo 582 | suyos 583 | sé 584 | sí 585 | sólo 586 | t 587 | tal 588 | tambien 589 | también 590 | tampoco 591 | tan 592 | tanto 593 | tarde 594 | te 595 | temprano 596 | tendremos 597 | tendrá 598 | tendrán 599 | tendrás 600 | tendré 601 | tendréis 602 | tendría 603 | tendríais 604 | tendríamos 605 | tendrían 606 | tendrías 607 | tened 608 | teneis 609 | tenemos 610 | tener 611 | tenga 612 | tengamos 613 | tengan 614 | tengas 615 | tengo 616 | tengáis 617 | tenida 618 | tenidas 619 | tenido 620 | tenidos 621 | teniendo 622 | tenéis 623 | tenía 624 | teníais 625 | teníamos 626 | tenían 627 | tenías 628 | tercera 629 | ti 630 | tiempo 631 | tiene 632 | tienen 633 | tienes 634 | toda 635 | todas 636 | todavia 637 | todavía 638 | todo 639 | todos 640 | total 641 | trabaja 642 | trabajais 643 | trabajamos 644 | trabajan 645 | trabajar 646 | trabajas 647 | trabajo 648 | tras 649 | trata 650 | través 651 | tres 652 | tu 653 | tus 654 | tuve 655 | tuviera 656 | tuvierais 657 | tuvieran 658 | tuvieras 659 | tuvieron 660 | tuviese 661 | tuvieseis 662 | tuviesen 663 | tuvieses 664 | tuvimos 665 | tuviste 666 | tuvisteis 667 | tuviéramos 668 | tuviésemos 669 | tuvo 670 | tuya 671 | tuyas 672 | tuyo 673 | tuyos 674 | tú 675 | u 676 | ultimo 677 | un 678 | una 679 | unas 680 | uno 681 | unos 682 | usa 683 | usais 684 | usamos 685 | usan 686 | usar 687 | usas 688 | uso 689 | usted 690 | ustedes 691 | v 692 | va 693 | vais 694 | valor 695 | vamos 696 | van 697 | varias 698 | varios 699 | vaya 700 | veces 701 | ver 702 | verdad 703 | verdadera 704 | verdadero 705 | vez 706 | vosotras 707 | vosotros 708 | voy 709 | vuestra 710 | vuestras 711 | vuestro 712 | vuestros 713 | w 714 | x 715 | y 716 | ya 717 | yo 718 | z 719 | él 720 | éramos 721 | ésa 722 | ésas 723 | ése 724 | ésos 725 | ésta 726 | éstas 727 | éste 728 | éstos 729 | última 730 | últimas 731 | último 732 | últimos -------------------------------------------------------------------------------- /assets/whitelist.txt: -------------------------------------------------------------------------------- 1 | rotativo.com.mx 2 | excelsior.com.mx 3 | yogonet.com 4 | eluniversal.com.mx 5 | nyti.ms 6 | unocero.com 7 | mexico.com 8 | thecoinrepublic.com 9 | costumbres.de 10 | bbc.com 11 | avclub.com 12 | infobae.com 13 | news24.com 14 | nasa.gov 15 | sdpnoticias.com 16 | jetnews.com.mx 17 | razon.com.mx 18 | elceo.com 19 | arenapublica.com 20 | diarioelindependiente.mx 21 | pscp.tv 22 | plumasatomicas.com 23 | regeneracion.mx 24 | mvsnoticias.com 25 | publimetro.com.mx 26 | themexico.news 27 | aristeguinoticias.com 28 | pulsoslp.com.mx 29 | diputados.gob.mx 30 | diariodequeretaro.com.mx 31 | nnc.mx 32 | frontera.info 33 | bloomberg.com 34 | lopezobrador.org.mx 35 | asisucedegto.mx 36 | xeu.mx 37 | xevt.com 38 | 24-horas.mx 39 | politico.mx 40 | festivosmexico.com.mx 41 | lavozdechile.com 42 | noticiaslapaz.com 43 | milenio.com 44 | theconservativetreehouse.com 45 | chalenoticias.mx 46 | breaking.com.mx 47 | miamiherald.com 48 | economiahoy.mx 49 | argumentopolitico.com 50 | elfinanciero.com.mx 51 | reporteroshoy.mx 52 | vanguardia.com.mx 53 | laopcion.com.mx 54 | elexpres.com 55 | elindependientedehidalgo.com.mx 56 | canalsonora.com 57 | diariocambio.com.mx 58 | nexos.com.mx 59 | newsweek.com 60 | xataka.com.mx 61 | ampproject.org 62 | zetatijuana.com 63 | brainwala.com 64 | tumblr.com 65 | sipse.com 66 | periodicocorreo.com.mx 67 | imparcialoaxaca.mx 68 | ejecentral.com.mx 69 | mas-mexico.com.mx 70 | elsoldepuebla.com.mx 71 | lasestrellas.tv 72 | coachesvoice.com 73 | psicologiaymente.com 74 | reportur.com 75 | themazatlanpost.com 76 | sg.com.mx 77 | superrucos.com 78 | elsoldeacapulco.com.mx 79 | elpais.com 80 | elmercurio.com.mx 81 | taringa.net 82 | oilandgasmagazine.com.mx 83 | proceso.com.mx 84 | lanetanoticias.com 85 | suracapulco.mx 86 | bancomundial.org 87 | cletofilia.com 88 | aztecanoticias.com.mx 89 | periodicoelmexicano.com.mx 90 | imagenradio.com.mx 91 | animalpolitico.com 92 | tiempo.com.mx 93 | forbes.com.mx 94 | eia.gov 95 | casede.org 96 | eleconomista.com.mx 97 | sinembargo.mx 98 | huffingtonpost.com.mx 99 | zocalo.com.mx 100 | www.gob.mx 101 | aem.gob.mx 102 | clipperdata.com 103 | expreso.com.mx 104 | elsoldemexico.com.mx 105 | streamable.com 106 | lacronica.com 107 | televisa.com 108 | am.com.mx 109 | mexnewz.mx 110 | beeg1.net 111 | moreloshabla.com 112 | washingtonpost.com 113 | dailywire.com 114 | soyhomosensual.com 115 | ft.com 116 | wsj.com 117 | blogspot.com 118 | wsws.org 119 | nacionunida.com 120 | society6.com 121 | telesurenglish.net 122 | independent.co.uk 123 | revistaei.cl 124 | amazon.com.mx 125 | escapadeland.com 126 | elnuevoherald.com 127 | mxcity.mx 128 | tribuna.com.mx 129 | lasillarota.com 130 | tabascohoy.com 131 | bitcoinrealcash.com 132 | informador.mx 133 | netnoticias.mx 134 | heraldodemexico.com.mx 135 | businessinsider.com 136 | sapiens.org 137 | monitoreconomico.org 138 | forbes.com 139 | elsoldelbajio.com.mx 140 | sputniknews.com 141 | versiones.com.mx 142 | quadratin.com.mx 143 | omnia.com.mx 144 | wordpress.com 145 | theregister.co.uk 146 | bbc.co.uk 147 | novedadesaca.mx 148 | dineroenimagen.com 149 | elhorizonte.mx 150 | opensocietyfoundations.org 151 | unimexicali.com 152 | asistepemex.org 153 | radioformula.com.mx 154 | reforma.com 155 | ibb.co 156 | laverdadnoticias.com 157 | nytimes.com 158 | notisistema.com 159 | reliefweb.int 160 | lavozdeperu.com 161 | abcnoticias.mx 162 | itam.mx 163 | jornada.com.mx 164 | parametria.com.mx 165 | unomasuno.com.mx 166 | commondreams.org 167 | theguardian.com 168 | sptnkne.ws 169 | josecardenas.com 170 | rascamapas.com 171 | segundoasegundo.com 172 | reporteindigo.com 173 | globo.com 174 | rasnoticias.mx 175 | maritimeherald.com 176 | jetbrains.com 177 | lopezdoriga.com 178 | cns.gob.mx 179 | livejournal.com 180 | desastre.mx 181 | mexicodesconocido.com.mx 182 | yahoo.com 183 | allerorts.de 184 | diario.mx 185 | bcsnoticias.mx 186 | noticiasdequeretaro.com.mx 187 | expansion.mx 188 | elimparcial.com 189 | cargonewsmex.com 190 | contrareplica.mx 191 | unam.mx 192 | lavozdelafrontera.com.mx 193 | terceravia.mx 194 | latercera.com 195 | acustiknoticias.com 196 | riodoce.mx 197 | adnpolitico.com 198 | fayerwayer.com 199 | horizontal.mx 200 | wradio.com.mx 201 | diariodecolima.com 202 | noticiaszmg.com 203 | elmanana.com 204 | altonivel.com.mx 205 | elsiglodetorreon.com.mx 206 | eldiariodechihuahua.mx 207 | declarenews.com 208 | reuters.com 209 | thelocal.se 210 | hongkongfp.com 211 | canoe.com 212 | indiatimes.com 213 | faroutmagazine.co.uk 214 | els5ra.com 215 | physicalfitnesscare.com 216 | bients.com 217 | udefense.info 218 | wildwechsel.de 219 | viralnewsdrift.com 220 | chinaro.ir 221 | dankpupper.com 222 | mightykingseo.com 223 | thesun.co.uk 224 | theverge.com 225 | thenewcivilrightsmovement.com 226 | truthdig.com 227 | channelnewsasia.com 228 | dailytelegraph.com.au 229 | centicsystems.com 230 | abracadabranoticias.com 231 | rawstory.com 232 | evolutionalblogs.com 233 | euronews.com 234 | haaretz.com 235 | loversofcats.com 236 | ndtv.com 237 | livetrendynews.com 238 | pakthought.com 239 | scmp.com 240 | euractiv.com 241 | aviralupdate.com 242 | france24.com 243 | theindianwire.com 244 | aljazeera.com 245 | wnobserver.com 246 | tessyinfohub.com 247 | newyorker.com 248 | kartiavelino.com 249 | newrightnetwork.com 250 | atusocialscience.ir 251 | samrattailors.com 252 | trendnewsworld.com 253 | zdnet.com 254 | nicepatogh.ir 255 | news18.com 256 | apsense.com 257 | virapars.com 258 | newsunbox.com 259 | nationalpost.com 260 | trendnewsweb.com 261 | globalnews.ca 262 | huffpost.com 263 | thenewsobservers.com 264 | bestnewsviral.com 265 | rt.com 266 | madamasr.com 267 | standard.co.uk 268 | local10.com 269 | telegraph.co.uk 270 | time8.in 271 | thehill.com 272 | timesofisrael.com 273 | dailymail.co.uk 274 | kyivpost.com 275 | indiatvnews.com 276 | talkingpointsmemo.com 277 | livemint.com 278 | sabq.org 279 | veteranstoday.com 280 | isibcase.ir 281 | dailytimes.com.pk 282 | thedailybeast.com 283 | total-croatia-news.com 284 | articlescad.com 285 | writeup.co.in 286 | gonewsviral.com 287 | thewire.in 288 | npr.org 289 | theglobeandmail.com 290 | nbcnews.com 291 | viralspicynews.com 292 | app.link 293 | trtworld.com 294 | unionjournalism.com 295 | winnaijatv.com 296 | jpost.com 297 | politico.com 298 | cnn.com 299 | walesonline.co.uk 300 | highmarksecurity.com 301 | indiatoday.in 302 | medium.com 303 | viralreportnow.com 304 | thegrowthop.com 305 | sky.com 306 | gappoo.com 307 | nymag.com 308 | viraltopiczone.com 309 | rfa.org 310 | apnews.com 311 | newindianexpress.com 312 | dailykos.com 313 | dw.com 314 | middleeastmonitor.com 315 | msn.com 316 | truescoopnews.com 317 | sbs.com.au 318 | discountbook.ir 319 | tnewst.com 320 | birminghammail.co.uk 321 | dailytrendshunter.com 322 | usnews.com 323 | dawn.com 324 | abc.net.au 325 | trendynewstime.com 326 | hindustantimes.com 327 | spacenews.com 328 | acneuro.co.uk 329 | washingtonexaminer.com 330 | cbc.ca 331 | rightsanddissent.org 332 | tribune.com.pk 333 | dailysabah.com 334 | today.com 335 | faitmain.ma 336 | dailyhive.com 337 | trendyupdatenews.com 338 | ctvnews.ca 339 | businessinsider.co.za 340 | usatoday.com 341 | viralupfeed.com 342 | indianexpress.com 343 | batask.ir 344 | foxnews.com 345 | thenews.com.pk 346 | cnbc.com 347 | newsdaily.today 348 | firstpost.com 349 | morningstaronline.co.uk 350 | ntrguadalajara.com 351 | nationalgeographic.com 352 | cronica.com.mx 353 | debate.com.mx 354 | guanajuatoinforma.com 355 | yucatan.com.mx 356 | sinlineamx.com 357 | sintesistv.com.mx 358 | laprensademonclova.com 359 | sandiegored.com 360 | turquesanews.mx 361 | enlapolitika.com 362 | lineadirectaportal.com 363 | hoyestado.com 364 | alcalorpolitico.com 365 | cafenegroportal.com 366 | noroeste.com.mx 367 | lineadirectaportal.com 368 | mediotiempo.com 369 | unotv.com 370 | criteriohidalgo.com 371 | xeva.com.mx 372 | quintafuerza.mx 373 | latinus.us 374 | verificado.com.mx 375 | lavanguardia.com 376 | nature.com 377 | -------------------------------------------------------------------------------- /bot.py: -------------------------------------------------------------------------------- 1 | """ 2 | Inits the summary bot. It starts a Reddit instance using PRAW, gets the latest posts 3 | and filters those who have already been processed. 4 | """ 5 | 6 | import praw 7 | import requests 8 | import tldextract 9 | 10 | import cloud 11 | import config 12 | import scraper 13 | import summary 14 | 15 | # We don't reply to posts which have a very small or very high reduction. 16 | MINIMUM_REDUCTION_THRESHOLD = 20 17 | MAXIMUM_REDUCTION_THRESHOLD = 68 18 | 19 | # File locations 20 | POSTS_LOG = "./processed_posts.txt" 21 | WHITELIST_FILE = "./assets/whitelist.txt" 22 | ERROR_LOG = "./error.log" 23 | 24 | # Templates. 25 | TEMPLATE = open("./templates/es.txt", "r", encoding="utf-8").read() 26 | 27 | 28 | HEADERS = {"User-Agent": "Summarizer v2.0"} 29 | 30 | 31 | def load_whitelist(): 32 | """Reads the processed posts log file and creates it if it doesn't exist. 33 | 34 | Returns 35 | ------- 36 | list 37 | A list of domains that are confirmed to have an 'article' tag. 38 | 39 | """ 40 | 41 | with open(WHITELIST_FILE, "r", encoding="utf-8") as log_file: 42 | return log_file.read().splitlines() 43 | 44 | 45 | def load_log(): 46 | """Reads the processed posts log file and creates it if it doesn't exist. 47 | 48 | Returns 49 | ------- 50 | list 51 | A list of Reddit posts ids. 52 | 53 | """ 54 | 55 | try: 56 | with open(POSTS_LOG, "r", encoding="utf-8") as log_file: 57 | return log_file.read().splitlines() 58 | 59 | except FileNotFoundError: 60 | with open(POSTS_LOG, "a", encoding="utf-8") as log_file: 61 | return [] 62 | 63 | 64 | def update_log(post_id): 65 | """Updates the processed posts log with the given post id. 66 | 67 | Parameters 68 | ---------- 69 | post_id : str 70 | A Reddit post id. 71 | 72 | """ 73 | 74 | with open(POSTS_LOG, "a", encoding="utf-8") as log_file: 75 | log_file.write("{}\n".format(post_id)) 76 | 77 | 78 | def log_error(error_message): 79 | """Updates the error log. 80 | 81 | Parameters 82 | ---------- 83 | error_message : str 84 | A string containing the faulty url and the exception message. 85 | 86 | """ 87 | 88 | with open(ERROR_LOG, "a", encoding="utf-8") as log_file: 89 | log_file.write("{}\n".format(error_message)) 90 | 91 | 92 | def init(): 93 | """Inits the bot.""" 94 | 95 | reddit = praw.Reddit(client_id=config.APP_ID, client_secret=config.APP_SECRET, 96 | user_agent=config.USER_AGENT, username=config.REDDIT_USERNAME, 97 | password=config.REDDIT_PASSWORD) 98 | 99 | processed_posts = load_log() 100 | whitelist = load_whitelist() 101 | 102 | for subreddit in config.SUBREDDITS: 103 | 104 | for submission in reddit.subreddit(subreddit).new(limit=50): 105 | 106 | if submission.id not in processed_posts: 107 | 108 | clean_url = submission.url.replace("amp.", "") 109 | ext = tldextract.extract(clean_url) 110 | domain = "{}.{}".format(ext.domain, ext.suffix) 111 | 112 | if domain in whitelist: 113 | 114 | try: 115 | with requests.get(clean_url, headers=HEADERS, timeout=10) as response: 116 | 117 | # Most of the times the encoding is utf-8 but in edge cases 118 | # we set it to ISO-8859-1 when it is present in the HTML header. 119 | if "iso-8859-1" in response.text.lower(): 120 | response.encoding = "iso-8859-1" 121 | elif response.encoding == "ISO-8859-1": 122 | response.encoding = "utf-8" 123 | 124 | html_source = response.text 125 | 126 | article_title, article_date, article_body = scraper.scrape_html( 127 | html_source) 128 | 129 | summary_dict = summary.get_summary(article_body) 130 | except Exception as e: 131 | log_error("{},{}".format(clean_url, e)) 132 | update_log(submission.id) 133 | print("Failed:", submission.id) 134 | continue 135 | 136 | # To reduce low quality submissions, we only process those that made a meaningful summary. 137 | if summary_dict["reduction"] >= MINIMUM_REDUCTION_THRESHOLD and summary_dict["reduction"] <= MAXIMUM_REDUCTION_THRESHOLD: 138 | 139 | # Create a wordcloud, upload it to Imgur and get back the url. 140 | image_url = cloud.generate_word_cloud( 141 | summary_dict["article_words"]) 142 | 143 | # We start creating the comment body. 144 | post_body = "\n\n".join( 145 | ["> " + item for item in summary_dict["top_sentences"]]) 146 | 147 | top_words = "" 148 | 149 | for index, word in enumerate(summary_dict["top_words"]): 150 | top_words += "{}^#{} ".format(word, index+1) 151 | 152 | post_message = TEMPLATE.format( 153 | article_title, clean_url, summary_dict["reduction"], article_date, post_body, image_url, top_words) 154 | 155 | reddit.submission(submission.id).reply(post_message) 156 | update_log(submission.id) 157 | print("Replied to:", submission.id) 158 | else: 159 | update_log(submission.id) 160 | print("Skipped:", submission.id) 161 | 162 | 163 | if __name__ == "__main__": 164 | 165 | init() 166 | -------------------------------------------------------------------------------- /cloud.py: -------------------------------------------------------------------------------- 1 | """ 2 | This script generates a word cloud from the article words. Uploads it to Imgur and returns back the url. 3 | """ 4 | 5 | import os 6 | import random 7 | 8 | import numpy as np 9 | import requests 10 | import wordcloud 11 | from PIL import Image 12 | 13 | import config 14 | 15 | MASK_FILE = "./assets/cloud.png" 16 | FONT_FILE = "./assets/sofiapro-light.otf" 17 | IMAGE_PATH = "./temp.png" 18 | 19 | COLORMAPS = ["spring", "summer", "autumn", "Wistia"] 20 | 21 | mask = np.array(Image.open(MASK_FILE)) 22 | 23 | 24 | def generate_word_cloud(text): 25 | """Generates a word cloud and uploads it to Imgur. 26 | 27 | Parameters 28 | ---------- 29 | text : str 30 | The text to be converted into a word cloud. 31 | 32 | Returns 33 | ------- 34 | str 35 | The url generated from the Imgur API. 36 | """ 37 | 38 | wc = wordcloud.WordCloud(background_color="#222222", 39 | max_words=2000, 40 | mask=mask, 41 | contour_width=2, 42 | colormap=random.choice(COLORMAPS), 43 | font_path=FONT_FILE, 44 | contour_color="white") 45 | 46 | wc.generate(text) 47 | wc.to_file(IMAGE_PATH) 48 | image_link = upload_image(IMAGE_PATH) 49 | os.remove(IMAGE_PATH) 50 | 51 | return image_link 52 | 53 | 54 | def upload_image(image_path): 55 | """Uploads an image to Imgur and returns the permanent link url. 56 | 57 | Parameters 58 | ---------- 59 | image_path : str 60 | The path of the file to be uploaded. 61 | 62 | Returns 63 | ------- 64 | str 65 | The url generated from the Imgur API. 66 | """ 67 | 68 | url = "https://api.imgur.com/3/image" 69 | headers = {"Authorization": "Client-ID " + config.IMGUR_CLIENT_ID} 70 | files = {"image": open(IMAGE_PATH, "rb")} 71 | 72 | with requests.post(url, headers=headers, files=files) as response: 73 | 74 | # We extract the new link from the response. 75 | image_link = response.json()["data"]["link"] 76 | 77 | return image_link 78 | -------------------------------------------------------------------------------- /cloud_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PhantomInsights/summarizer/d8b4d7745ca9ba4309fc9707b7c98ae143b97a10/cloud_example.png -------------------------------------------------------------------------------- /config.py: -------------------------------------------------------------------------------- 1 | """Required constants for the Reddit API.""" 2 | 3 | # The following constants are used by the bot. 4 | REDDIT_USERNAME = "" 5 | REDDIT_PASSWORD = "" 6 | 7 | APP_ID = "" 8 | APP_SECRET = "" 9 | USER_AGENT = "" 10 | 11 | SUBREDDITS = ["mexico"] 12 | 13 | IMGUR_CLIENT_ID = "" -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PhantomInsights/summarizer/d8b4d7745ca9ba4309fc9707b7c98ae143b97a10/requirements.txt -------------------------------------------------------------------------------- /scraper.py: -------------------------------------------------------------------------------- 1 | """ 2 | This function tries to extract the article title, date and body from an HTML string. 3 | """ 4 | 5 | from datetime import datetime 6 | 7 | from bs4 import BeautifulSoup 8 | 9 | # We don't process articles that have fewer characters than this. 10 | ARTICLE_MINIMUM_LENGTH = 650 11 | 12 | 13 | def scrape_html(html_source): 14 | """Tries to scrape the article from the given HTML source. 15 | 16 | Parameters 17 | ---------- 18 | html_source : str 19 | The html source of the article. 20 | 21 | Returns 22 | ------- 23 | tuple 24 | The article title, date and body. 25 | 26 | """ 27 | 28 | # Very often the text between tags comes together, we add an artificial newline to each common tag. 29 | for item in ["

", "", "
", "", "
"]: 30 | html_source = html_source.replace(item, item+"\n") 31 | 32 | # We create a BeautifulSOup object and remove the unnecessary tags. 33 | soup = BeautifulSoup(html_source, "html5lib") 34 | 35 | # Then we extract the title and the article tags. 36 | article_title = soup.find("title").text.replace("\n", " ").strip() 37 | 38 | # If our title is too short we fallback to the first h1 tag. 39 | if len(article_title) <= 5: 40 | article_title = soup.find("h1").text.replace("\n", " ").strip() 41 | 42 | article_date = "" 43 | 44 | # We look for the first meta tag that has the word 'time' in it. 45 | for item in soup.find_all("meta"): 46 | 47 | if "time" in item.get("property", ""): 48 | 49 | clean_date = item["content"].split("+")[0].replace("Z", "") 50 | 51 | # Use your preferred time formatting. 52 | article_date = "{:%d-%m-%Y a las %H:%M:%S}".format( 53 | datetime.fromisoformat(clean_date)) 54 | break 55 | 56 | # If we didn't find any meta tag with a datetime we look for a 'time' tag. 57 | if len(article_date) <= 5: 58 | try: 59 | article_date = soup.find("time").text.strip() 60 | except: 61 | pass 62 | 63 | # We remove some tags that add noise. 64 | [tag.extract() for tag in soup.find_all( 65 | ["script", "img", "ol", "ul", "time", "h1", "h2", "h3", "iframe", "style", "form", "footer", "figcaption"])] 66 | 67 | # These class names/ids are known to add noise or duplicate text to the article. 68 | noisy_names = ["image", "img", "video", "subheadline", "editor", "fondea", "resumen", "tags", "sidebar", "comment", 69 | "entry-title", "breaking_content", "pie", "tract", "caption", "tweet", "expert", "previous", "next", 70 | "compartir", "rightbar", "mas", "copyright", "instagram-media", "cookie", "paywall", "mainlist", "sitelist"] 71 | 72 | for tag in soup.find_all("div"): 73 | 74 | try: 75 | tag_id = tag["id"].lower() 76 | 77 | for item in noisy_names: 78 | if item in tag_id: 79 | tag.extract() 80 | except: 81 | pass 82 | 83 | for tag in soup.find_all(["div", "p", "blockquote"]): 84 | 85 | try: 86 | tag_class = "".join(tag["class"]).lower() 87 | 88 | for item in noisy_names: 89 | if item in tag_class: 90 | tag.extract() 91 | except: 92 | pass 93 | 94 | # These names commonly hold the article text. 95 | common_names = ["artic", "summary", "cont", "note", "cuerpo", "body"] 96 | 97 | article_body = "" 98 | 99 | # Sometimes we have more than one article tag. We are going to grab the longest one. 100 | for article_tag in soup.find_all("article"): 101 | 102 | if len(article_tag.text) >= len(article_body): 103 | article_body = article_tag.text 104 | 105 | # The article is too short, let's try to find it in another tag. 106 | if len(article_body) <= ARTICLE_MINIMUM_LENGTH: 107 | 108 | for tag in soup.find_all(["div", "section"]): 109 | 110 | try: 111 | tag_id = tag["id"].lower() 112 | 113 | for item in common_names: 114 | if item in tag_id: 115 | # We guarantee to get the longest div. 116 | if len(tag.text) >= len(article_body): 117 | article_body = tag.text 118 | except: 119 | pass 120 | 121 | # The article is still too short, let's try one more time. 122 | if len(article_body) <= ARTICLE_MINIMUM_LENGTH: 123 | 124 | for tag in soup.find_all(["div", "section"]): 125 | 126 | try: 127 | tag_class = "".join(tag["class"]).lower() 128 | 129 | for item in common_names: 130 | if item in tag_class: 131 | # We guarantee to get the longest div. 132 | if len(tag.text) >= len(article_body): 133 | article_body = tag.text 134 | except: 135 | pass 136 | 137 | return article_title, article_date, article_body 138 | -------------------------------------------------------------------------------- /summary.py: -------------------------------------------------------------------------------- 1 | """ 2 | This script extracts and ranks the sentences and words of an article. 3 | 4 | IT is inspired by the tf-idf algorithm. 5 | """ 6 | 7 | from collections import Counter 8 | 9 | import spacy 10 | 11 | # The stop words files. 12 | ES_STOPWORDS_FILE = "./assets/stopwords-es.txt" 13 | EN_STOPWORDS_FILE = "./assets/stopwords-en.txt" 14 | 15 | # The number of sentences we need. 16 | NUMBER_OF_SENTENCES = 5 17 | 18 | # The number of top words we need. 19 | NUMBER_OF_TOP_WORDS = 5 20 | 21 | # Multiplier for uppercase and long words. 22 | IMPORTANT_WORDS_MULTIPLIER = 2.5 23 | 24 | # Financial sentences often are more important than others. 25 | FINANCIAL_SENTENCE_MULTIPLIER = 1.5 26 | 27 | # The minimum number of characters needed for a line to be valid. 28 | LINE_LENGTH_THRESHOLD = 150 29 | 30 | # It is very important to add spaces on these words. 31 | # Otherwise it will take into account partial words. 32 | COMMON_WORDS = { 33 | " ", " ", "\xa0", "#", ",", "|", "-", "‘", "’", ";", "(", ")", ".", ":", "¿", "?", '“', "/", 34 | '”', '"', "'", "%", "•", "«", "»", "foto", "photo", "video", "redacción", "nueve", "diez", "cien", 35 | "mil", "miles", "ciento", "cientos", "millones", "vale" 36 | } 37 | 38 | # These words increase the score of a sentence. They don't require whitespaces around them. 39 | FINANCIAL_WORDS = ["$", "€", "£", "pesos", "dólar", "libras", "euros", 40 | "dollar", "pound", "mdp", "mdd"] 41 | 42 | 43 | # Don't forget to specify the correct model for your language. 44 | NLP = spacy.load("es_core_news_sm") 45 | 46 | 47 | def add_extra_words(): 48 | """Adds the title and uppercase forms of all words to COMMON_WORDS. 49 | 50 | We parse local copies of stop words downloaded from the following repositories: 51 | 52 | https://github.com/stopwords-iso/stopwords-es 53 | https://github.com/stopwords-iso/stopwords-en 54 | """ 55 | 56 | with open(ES_STOPWORDS_FILE, "r", encoding="utf-8") as temp_file: 57 | for word in temp_file.read().splitlines(): 58 | COMMON_WORDS.add(word) 59 | 60 | with open(EN_STOPWORDS_FILE, "r", encoding="utf-8") as temp_file: 61 | for word in temp_file.read().splitlines(): 62 | COMMON_WORDS.add(word) 63 | 64 | 65 | add_extra_words() 66 | 67 | 68 | def get_summary(article): 69 | """Generates the top words and sentences from the article text. 70 | 71 | Parameters 72 | ---------- 73 | article : str 74 | The article text. 75 | 76 | Returns 77 | ------- 78 | dict 79 | A dict containing the title of the article, reduction percentage, top words and the top scored sentences. 80 | 81 | """ 82 | 83 | # Now we prepare the article for scoring. 84 | cleaned_article = clean_article(article) 85 | 86 | # We start the NLP process. 87 | doc = NLP(cleaned_article) 88 | 89 | article_sentences = [sent for sent in doc.sents] 90 | 91 | words_of_interest = [ 92 | token.text for token in doc if token.lower_ not in COMMON_WORDS] 93 | 94 | # We use the Counter class to count all words ocurrences. 95 | scored_words = Counter(words_of_interest) 96 | 97 | for word in scored_words: 98 | 99 | # We add bonus points to words starting in uppercase and are equal or longer than 4 characters. 100 | if word[0].isupper() and len(word) >= 4: 101 | scored_words[word] *= IMPORTANT_WORDS_MULTIPLIER 102 | 103 | # If the word is a number we punish it by settings its points to 0. 104 | if word.isdigit(): 105 | scored_words[word] = 0 106 | 107 | top_sentences = get_top_sentences(article_sentences, scored_words) 108 | top_sentences_length = sum([len(sentence) for sentence in top_sentences]) 109 | reduction = 100 - (top_sentences_length / len(cleaned_article)) * 100 110 | 111 | summary_dict = { 112 | "top_words": get_top_words(scored_words), 113 | "top_sentences": top_sentences, 114 | "reduction": reduction, 115 | "article_words": " ".join(words_of_interest) 116 | } 117 | 118 | return summary_dict 119 | 120 | 121 | def clean_article(article_text): 122 | """Cleans and reformats the article text. 123 | 124 | Parameters 125 | ---------- 126 | article_text : str 127 | The article string. 128 | 129 | Returns 130 | ------- 131 | str 132 | The cleaned up article. 133 | 134 | """ 135 | 136 | # We divide the script into lines, this is to remove unnecessary whitespaces. 137 | lines_list = list() 138 | 139 | for line in article_text.split("\n"): 140 | 141 | # We remove whitespaces. 142 | stripped_line = line.strip() 143 | 144 | # If the line is too short we ignore it. 145 | if len(stripped_line) >= LINE_LENGTH_THRESHOLD: 146 | lines_list.append(stripped_line) 147 | 148 | # Now we have the article fully cleaned. 149 | return " ".join(lines_list) 150 | 151 | 152 | def get_top_words(scored_words): 153 | """Gets the top scored words from the prepared article. 154 | 155 | Parameters 156 | ---------- 157 | scored_words : collections.Counter 158 | A Counter containing the article words and their scores. 159 | 160 | Returns 161 | ------- 162 | list 163 | An ordered list with the top words. 164 | 165 | """ 166 | 167 | # Once we have our words scored it's time to get top ones. 168 | top_words = list() 169 | 170 | for word, score in scored_words.most_common(): 171 | 172 | add_to_list = True 173 | 174 | # We avoid duplicates by checking if the word already is in the top_words list. 175 | if word.upper() not in [item.upper() for item in top_words]: 176 | 177 | # Sometimes we have the same word but in plural form, we skip the word when that happens. 178 | for item in top_words: 179 | if word.upper() in item.upper() or item.upper() in word.upper(): 180 | add_to_list = False 181 | 182 | if add_to_list: 183 | top_words.append(word) 184 | 185 | return top_words[0:NUMBER_OF_TOP_WORDS] 186 | 187 | 188 | def get_top_sentences(article_sentences, scored_words): 189 | """Gets the top scored sentences from the cleaned article. 190 | 191 | Parameters 192 | ---------- 193 | cleaned_article : str 194 | The original article after it has been cleaned and reformatted. 195 | 196 | scored_words : collections.Counter 197 | A Counter containing the article words and their scores. 198 | 199 | Returns 200 | ------- 201 | list 202 | An ordered list with the top sentences. 203 | 204 | """ 205 | 206 | # Now its time to score each sentence. 207 | scored_sentences = list() 208 | 209 | # We take a reference of the order of the sentences, this will be used later. 210 | for index, sent in enumerate(article_sentences): 211 | 212 | # In some edge cases we have duplicated sentences, we make sure that doesn't happen. 213 | if sent.text not in [sent for score, index, sent in scored_sentences]: 214 | scored_sentences.append( 215 | [score_line(sent, scored_words), index, sent.text]) 216 | 217 | top_sentences = list() 218 | counter = 0 219 | 220 | for score, index, sentence in sorted(scored_sentences, reverse=True): 221 | 222 | if counter >= NUMBER_OF_SENTENCES: 223 | break 224 | 225 | # When the article is too small the sentences may come empty. 226 | if len(sentence) >= 3: 227 | 228 | # We clean the sentence and its index so we can sort in chronological order. 229 | top_sentences.append([index, sentence]) 230 | counter += 1 231 | 232 | return [sentence for index, sentence in sorted(top_sentences)] 233 | 234 | 235 | def score_line(line, scored_words): 236 | """Calculates the score of the given line using the word scores. 237 | 238 | Parameters 239 | ---------- 240 | line : spacy.tokens.span.Span 241 | A tokenized sentence from the article. 242 | 243 | scored_words : collections.Counter 244 | A Counter containing the article words and their scores. 245 | 246 | Returns 247 | ------- 248 | int 249 | The total score of all the words in the sentence. 250 | 251 | """ 252 | 253 | # We remove the common words. 254 | cleaned_line = [ 255 | token.text for token in line if token.lower_ not in COMMON_WORDS] 256 | 257 | # We now sum the total number of ocurrences for all words. 258 | temp_score = 0 259 | 260 | for word in cleaned_line: 261 | temp_score += scored_words[word] 262 | 263 | # We apply a bonus score to sentences that contain financial information. 264 | line_lowercase = line.text.lower() 265 | 266 | for word in FINANCIAL_WORDS: 267 | if word in line_lowercase: 268 | temp_score *= FINANCIAL_SENTENCE_MULTIPLIER 269 | break 270 | 271 | return temp_score 272 | -------------------------------------------------------------------------------- /templates/es.txt: -------------------------------------------------------------------------------- 1 | ### {} 2 | 3 | [Nota Original]({}) | Reducido en un {:.2f}% | {} 4 | 5 | ***** 6 | 7 | {} 8 | 9 | ***** 10 | 11 | *^Este ^bot ^solo ^responde ^cuando ^logra ^resumir ^en ^un ^mínimo ^del ^20%. ^Tus ^reportes, ^sugerencias ^y ^comentarios ^son ^bienvenidos. ​* 12 | 13 | [FAQ](https://redd.it/arkxlg) | [GitHub](https://git.io/fhQkC) | [☁️]({}) | {} 14 | --------------------------------------------------------------------------------