├── .github
    └── FUNDING.yml
├── LICENSE
├── README.md
├── assets
    ├── cloud.png
    ├── font.txt
    ├── stopwords-en.txt
    ├── stopwords-es.txt
    └── whitelist.txt
├── bot.py
├── cloud.py
├── cloud_example.png
├── config.py
├── requirements.txt
├── scraper.py
├── summary.py
└── templates
    └── es.txt


/.github/FUNDING.yml:
--------------------------------------------------------------------------------
1 | github: agentphantom
2 | patreon: agentphantom
3 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Phantom Insights
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Article Summarizer
  2 | 
  3 | This project implements a custom algorithm to extract the most important sentences and keywords from Spanish and English news articles.
  4 | 
  5 | It was fully developed in `Python` and it is inspired by similar projects seen on `Reddit` news subreddits that use the term frequency–inverse document frequency (`tf–idf`).
  6 | 
  7 | The 3 most important files are:
  8 | 
  9 | * `scraper.py` : A Python script that performs web scraping on a given HTML source, it extracts the article title, date and body.
 10 | 
 11 | * `summary.py` : A Python script that applies a custom algorithm to a string of text and extracts the top ranked sentences and words.
 12 | 
 13 | * `bot.py` : A Reddit bot that checks a subreddit for its latest submissions. It manages a list of already processed submissions to avoid duplicates.
 14 | 
 15 | ## Requirements
 16 | 
 17 | This project uses the following Python libraries
 18 | 
 19 | * `spaCy` : Used to tokenize the article into sentences and words.
 20 | * `PRAW` : Makes the use of the Reddit API very easy.
 21 | * `Requests` : To perform HTTP `get` requests to the articles urls.
 22 | * `BeautifulSoup` : Used for extracting the article text.
 23 | * `html5lib` : This parser got better compatibility when used with `BeautifulSoup`.
 24 | * `tldextract` : Used to extract the domain from an url.
 25 | * `wordcloud` : Used to create word clouds with the article text.
 26 | 
 27 | After installing the `spaCy` library you must install a language model to be able to tokenize the article.
 28 | 
 29 | For `Spanish` you can run this one:
 30 | 
 31 | `python -m spacy download es_core_news_sm`
 32 | 
 33 | For other languages please check the following link: https://spacy.io/usage/models
 34 | 
 35 | ## Reddit Bot
 36 | 
 37 | The bot is simple in nature, it uses the `PRAW` library which is very straightforward to use. The bot polls a subreddit every 10 minutes to get its latest submissions.
 38 | 
 39 | It first detects if the submission hasn't already been processed and then checks if the submission url is in the whitelist. This whitelist is currently curated by myself.
 40 | 
 41 | If the post and its url passes both checks then a process of web scraping is applied to the url, this is where things start getting interesting.
 42 | 
 43 | Before replying to the original submission it checks the percentage of the reduction achieved, if it's too low or too high it skips it and moves to the next submission.
 44 | 
 45 | ## Web Scraper
 46 | 
 47 | Currently in the whitelist there are already more than 300 different websites of news articles and blogs. Creating specialized web scrapers for each one is simply not feasible.
 48 | 
 49 | The second best thing to do is to make the scraper as accurate as possible.
 50 | 
 51 | We start the web scraper on the usual way, with the `Requests` and `BeautifulSoup` libraries.
 52 | 
 53 | ```python
 54 | with requests.get(article_url) as response:
 55 |     
 56 |     if response.encoding == "ISO-8859-1":
 57 |         response.encoding = "utf-8"
 58 | 
 59 |     html_source = response.text
 60 | 
 61 | for item in ["</p>", "</blockquote>", "</div>", "</h2>", "</h3>"]:
 62 |     html_source = html_source.replace(item, item+"\n")
 63 | 
 64 | soup = BeautifulSoup(html_source, "html5lib")
 65 | ```
 66 | 
 67 | Very few times I got encoding issues caused by an incorrect encoding guess. To avoid this issue I force `Requests` to decode with `utf-8`.
 68 | 
 69 | Now that we have our article parsed into a `soup` object we will start by extracting the title and published time.
 70 | 
 71 | I used similar methods to extract both values, I first check the most common tags and fallback to the next common alternatives.
 72 | 
 73 | Not all websites expose their published date, we sometimes end with an empty string.
 74 | 
 75 | ```python
 76 | article_title = soup.find("title").text.replace("\n", " ").strip()
 77 | 
 78 | # If our title is too short we fallback to the first h1 tag.
 79 | if len(article_title) <= 5:
 80 |     article_title = soup.find("h1").text.replace("\n", " ").strip()
 81 | 
 82 | article_date = ""
 83 | 
 84 | # We look for the first meta tag that has the word 'time' in it.
 85 | for item in soup.find_all("meta"):
 86 | 
 87 |     if "time" in item.get("property", ""):
 88 | 
 89 |         clean_date = item["content"].split("+")[0].replace("Z", "")
 90 |         
 91 |         # Use your preferred time formatting.
 92 |         article_date = "{:%d-%m-%Y a las %H:%M:%S}".format(
 93 |             datetime.fromisoformat(clean_date))
 94 |         break
 95 | 
 96 | # If we didn't find any meta tag with a datetime we look for a 'time' tag.
 97 | if len(article_date) <= 5:
 98 |     try:
 99 |         article_date = soup.find("time").text.strip()
100 |     except:
101 |         pass
102 | ```
103 | 
104 | When extracting the text from different tags I often got the strings without separation. I implemented a little hack to add new lines to each tag that usually contains text. This significantly improved the overall accuracy of the tokenizer.
105 | 
106 | My original idea was to only accept websites that used the `<article>` tag. It worked ok for the first websites I tested, but I soon realized that very few websites use it and those who use it don't use it correctly.
107 | 
108 | ```python
109 | article = soup.find("article").text
110 | ```
111 | 
112 | When accessing the `.text` property of the `<article>` tag I noticed I was also getting the JavaScript code. I backtracked a bit and removed all tags which could add *noise* to the article text.
113 | 
114 | ```python
115 | [tag.extract() for tag in soup.find_all(
116 |         ["script", "img", "ul", "time", "h1", "h2", "h3", "iframe", "style", "form", "footer", "figcaption"])]
117 | 
118 | 
119 | # These class names/ids are known to add noise or duplicate text to the article.
120 | noisy_names = ["image", "img", "video", "subheadline",
121 |                 "hidden", "tract", "caption", "tweet", "expert"]
122 | 
123 | for tag in soup.find_all("div"):
124 | 
125 |     tag_id = tag["id"].lower()
126 | 
127 |     for item in noisy_names:
128 |         if item in tag_id:
129 |             tag.extract()
130 | ```
131 | 
132 | The above code removed most captions, which usually repeat what's inside in the article.
133 | 
134 | After that I applied a 3 step process to get the article text.
135 | 
136 | First I checked all `<article>` tags and grabbed the one with the longest text.
137 | 
138 | 
139 | ```python
140 | article = ""
141 | 
142 | for article_tag in soup.find_all("article"):
143 | 
144 |     if len(article_tag.text) >= len(article):
145 |         article = article_tag.text
146 | ```
147 | 
148 | That worked fine for websites that properly used the `<article>` tag. The longest tag almost always contains the main article.
149 | 
150 | But that didn't quite worked as expected, I noticed poor quality on the results, sometimes I was getting excerpts for other articles.
151 | 
152 | That's when I decided to add the fallback, lnstead of only looking for the `<article>` tag I will be looking for `<div>` and `<section>` tags with commonly used `id's`.
153 | 
154 | ```python
155 | # These names commonly hold the article text.
156 | common_names = ["artic", "summary", "cont", "note", "cuerpo", "body"]
157 | 
158 | # If the article is too short we look somewhere else.
159 | if len(article) <= 650:
160 | 
161 |     for tag in soup.find_all(["div", "section"]):
162 | 
163 |         tag_id = tag["id"].lower()
164 | 
165 |         for item in common_names:
166 |             if item in tag_id:
167 |                 # We guarantee to get the longest div.
168 |                 if len(tag.text) >= len(article):
169 |                     article = tag.text
170 | ```
171 | 
172 | That increased the accuracy quite a bit, I repeated the code but instead of the `id` attribute I was also looking for the `class` attribute.
173 | 
174 | ```python
175 | # The article is still too short, let's try one more time.
176 | if len(article) <= 650:
177 | 
178 |     for tag in soup.find_all(["div", "section"]):
179 | 
180 |         tag_class = "".join(tag["class"]).lower()
181 | 
182 |         for item in common_names:
183 |             if item in tag_class:
184 |                 # We guarantee to get the longest div.
185 |                 if len(tag.text) >= len(article):
186 |                     article = tag.text
187 | ```
188 | 
189 | Using all the previous methods greatly increased the overall accuracy of the scraper. In some cases I used partial words that share the same letters in English and Spanish (artic -> article/articulo). The scraper was now compatible with all the urls I tested.
190 | 
191 | We make a final check and if the article is still too short we abort the process and move to the next url, otherwise we move to the summary algorithm.
192 | 
193 | ## Summary Algorithm
194 | 
195 | This algorithm was designed to work primarily on Spanish written articles. It consists on several steps:
196 | 
197 | 1. Reformat and clean the original article by removing all whitespaces.
198 | 2. Make a copy of the original article and remove all common used words from it.
199 | 3. Split the copied article into words and score each word.
200 | 4. Split the original article into sentences and score each sentence using the scores from the words.
201 | 5. Take the top 5 sentences and top 5 words and return them in chronological order.
202 | 
203 | Before starting out we need to initialize the `spaCy` library.
204 | 
205 | ```python
206 | NLP = spacy.load("es_core_news_sm")
207 | ```
208 | 
209 | That line of code will load the `Spanish` model which I use the most. If you are using another language please refer to the `Requirements` section so you know how to install the appropriate model.
210 | 
211 | ### Clean the Article
212 | 
213 | When extracting the text from the article we usually get a lot of whitespaces, mostly from line breaks (`\n`).
214 | 
215 | We split the text by that character, then strip all whitespaces and join it again. This is not strictly required to do but helps a lot while debugging the whole process.
216 | 
217 | ### Remove Common and Stop Words
218 | 
219 | At the top of the script we declare the path of the stop words text files. These stop words will be added to a `set`, guaranteeing no duplicates.
220 | 
221 | I also added a list with some Spanish and English words that are not stop words but they don't add anything substantial to the article. My personal preference was to hard code them in lowercase form.
222 | 
223 | Then I added a copy of each word in uppercase and title form. Which means the `set` will be 3 times the original size.
224 | 
225 | ```python
226 | with open(ES_STOPWORDS_FILE, "r", encoding="utf-8") as temp_file:
227 |     for word in temp_file.read().splitlines():
228 |         COMMON_WORDS.add(word)
229 | 
230 | with open(EN_STOPWORDS_FILE, "r", encoding="utf-8") as temp_file:
231 |     for word in temp_file.read().splitlines():
232 |         COMMON_WORDS.add(word)
233 | 
234 | extra_words = list()
235 | 
236 | for word in COMMON_WORDS:
237 |     extra_words.append(word.title())
238 |     extra_words.append(word.upper())
239 | 
240 | for word in extra_words:
241 |     COMMON_WORDS.add(word)
242 | ```
243 | 
244 | ### Scoring Words
245 | 
246 | Before starting tokenizing our words we must first pass our cleaned article into the `NLP` pipeline, this is done with one line of code.
247 | 
248 | ```python
249 | doc = NLP(cleaned_article)
250 | ```
251 | 
252 | This `doc` object contains several iterators, the 2 we will use are `tokens` and `sents` (sentences).
253 | 
254 | At this point I added a personal touch to the algorithm. First I made a copy of the article and then removed all common words from it.
255 | 
256 | Afterwards I used a `collections.Counter` object to do the initial scoring.
257 | 
258 | Then I applied a multiplier bonus to words that start in uppercase and are equal or longer than 4 characters. Most of the time those words are names of places, people or organizations.
259 | 
260 | Finally I set to zero the score for all words that are actually numbers.
261 | 
262 | ```python
263 | words_of_interest = [
264 |         token.text for token in doc if token.text not in COMMON_WORDS]
265 | 
266 | scored_words = Counter(words_of_interest)
267 | 
268 | for word in scored_words:
269 | 
270 |     if word[0].isupper() and len(word) >= 4:
271 |         scored_words[word] *= 3
272 | 
273 |     if word.isdigit():
274 |         scored_words[word] = 0
275 | ```
276 | 
277 | ### Scoring Sentences
278 | 
279 | Now that we have the final scores for each word it is time to score each sentence from the article.
280 | 
281 | To do this we first need to split the article into sentences. I tried various approaches, including `RegEx` but the one that worked best was the `spaCy` library.
282 | 
283 | We will iterate again over the `doc` object we defined in the previous step, but this time we will iterate over its `sents` property.
284 | 
285 | Something to note is that we create a list of sentence `tokens` and inside those tokens we can retrieve the sentences text by accessing their `text` property.
286 | 
287 | ```python
288 | article_sentences = [sent for sent in doc.sents]
289 | 
290 | scored_sentences = list()
291 | 
292 | or index, sent in enumerate(article_sentences):
293 | 
294 |     # In some edge cases we have duplicated sentences, we make sure that doesn't happen.
295 |     if sent.text not in [sent for score, index, sent in scored_sentences]:
296 |         scored_sentences.append(
297 |             [score_line(sent, scored_words), index, sent.text])
298 | ```
299 | 
300 | `scored_sentences` is a list of lists. Each inner list contains 3 values. The sentence score, its index and the sentence itself. Those values will be used in the next step.
301 | 
302 | The code below shows how the lines are scored.
303 | 
304 | ```python
305 | def score_line(line, scored_words):
306 | 
307 |     # We remove the common words.
308 |     cleaned_line = [
309 |         token.text for token in line if token.text not in COMMON_WORDS]
310 | 
311 |     # We now sum the total number of ocurrences for all words.
312 |     temp_score = 0
313 | 
314 |     for word in cleaned_line:
315 |         temp_score += scored_words[word]
316 | 
317 |     # We apply a bonus score to sentences that contain financial information.
318 |     line_lowercase = line.text.lower()
319 | 
320 |     for word in FINANCIAL_WORDS:
321 |         if word in line_lowercase:
322 |             temp_score *= 1.5
323 |             break
324 | 
325 |     return temp_score  
326 | ```
327 | 
328 | We apply a multiplier to sentences that contain any word that refers to money or finance.
329 | 
330 | ### Chronological Order
331 | 
332 | This is the final part of the algorithm, we make use of the `sorted()` function to get the top sentences and then reorder them in their original positions.
333 | 
334 | We sort `scored_sentences` in reverse order, this will give us the top scored sentences first. We start a small counter variable so it breaks the loop once it hits 5. We also discard all sentences that are 3 characters or less (sometimes there are sneaky zero-width characters).
335 | 
336 | ```python
337 | top_sentences = list()
338 | counter = 0
339 | 
340 | for score, index, sentence in sorted(scored_sentences, reverse=True):
341 | 
342 |     if counter >= 5:
343 |         break
344 | 
345 |     # When the article is too small the sentences may come empty.
346 |     if len(sentence) >= 3:
347 | 
348 |         # We append the sentence and its index so we can sort in chronological order.
349 |         top_sentences.append([index, sentence])
350 |         counter += 1
351 | 
352 | return [sentence for index, sentence in sorted(top_sentences)]
353 | ```
354 | 
355 | At the end we use a list comprehension to return only the sentences which are already sorted in chronological order.
356 | 
357 | ### Word Cloud
358 | 
359 | Just for fun I added a word cloud to each article. To do so I used the `wordcloud` library. This library is very easy to use, you only require to declare a `WordCloud` object and use the `generate` method with a string of text as its parameter.
360 | 
361 | ```python
362 | wc = wordcloud.WordCloud() # See cloud.py for full parameters.
363 | wc.generate(prepared_article)
364 | wc.to_file("./temp.png")
365 | ```
366 | 
367 | After generating the image I uploaded it to `Imgur`, got back the url link and added it to the `Markdown` message.
368 | 
369 | ![Word cloud example](cloud_example.png)
370 | 
371 | ## Conclusion
372 | 
373 | This was a very fun and interesting project to work on. I may have reinvented the wheel but at least I learned a few cool things.
374 | 
375 | I'm satisfied with the overall quality of the results and I will keep tweaking the algorithm and applying compatibility enhancements.
376 | 
377 | As a side note, when testing the script I accidentally requested Tweets, Facebook posts and English written articles. All of them got acceptable outputs, but since those sites were not the target I removed them from the whitelist.
378 | 
379 | After some weeks of feedback I decided to add support for the English language. This required a bit of refactoring.
380 | 
381 | To make it work with other languages you will only require a text file containing all the stop words from said language and copy a few lines of code (see Remove Common and Stop Words section).
382 | 
383 | [![Become a Patron!](https://c5.patreon.com/external/logo/become_a_patron_button.png)](https://www.patreon.com/bePatron?u=20521425)
384 | 


--------------------------------------------------------------------------------
/assets/cloud.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PhantomInsights/summarizer/d8b4d7745ca9ba4309fc9707b7c98ae143b97a10/assets/cloud.png


--------------------------------------------------------------------------------
/assets/font.txt:
--------------------------------------------------------------------------------
1 | The name of the font used by this project is Sofia Pro Light
2 | 
3 | The font is free, to download it you need to purchase it from the following link:
4 | 
5 | https://www.fontspring.com/fonts/mostardesign/sofia-pro


--------------------------------------------------------------------------------
/assets/stopwords-en.txt:
--------------------------------------------------------------------------------
   1 | 'll
   2 | 'tis
   3 | 'twas
   4 | 've
   5 | 10
   6 | 39
   7 | a
   8 | a's
   9 | able
  10 | ableabout
  11 | about
  12 | above
  13 | abroad
  14 | abst
  15 | accordance
  16 | according
  17 | accordingly
  18 | across
  19 | act
  20 | actually
  21 | ad
  22 | added
  23 | adj
  24 | adopted
  25 | ae
  26 | af
  27 | affected
  28 | affecting
  29 | affects
  30 | after
  31 | afterwards
  32 | ag
  33 | again
  34 | against
  35 | ago
  36 | ah
  37 | ahead
  38 | ai
  39 | ain't
  40 | aint
  41 | al
  42 | all
  43 | allow
  44 | allows
  45 | almost
  46 | alone
  47 | along
  48 | alongside
  49 | already
  50 | also
  51 | although
  52 | always
  53 | am
  54 | amid
  55 | amidst
  56 | among
  57 | amongst
  58 | amoungst
  59 | amount
  60 | an
  61 | and
  62 | announce
  63 | another
  64 | any
  65 | anybody
  66 | anyhow
  67 | anymore
  68 | anyone
  69 | anything
  70 | anyway
  71 | anyways
  72 | anywhere
  73 | ao
  74 | apart
  75 | apparently
  76 | appear
  77 | appreciate
  78 | appropriate
  79 | approximately
  80 | aq
  81 | ar
  82 | are
  83 | area
  84 | areas
  85 | aren
  86 | aren't
  87 | arent
  88 | arise
  89 | around
  90 | arpa
  91 | as
  92 | aside
  93 | ask
  94 | asked
  95 | asking
  96 | asks
  97 | associated
  98 | at
  99 | au
 100 | auth
 101 | available
 102 | aw
 103 | away
 104 | awfully
 105 | az
 106 | b
 107 | ba
 108 | back
 109 | backed
 110 | backing
 111 | backs
 112 | backward
 113 | backwards
 114 | bb
 115 | bd
 116 | be
 117 | became
 118 | because
 119 | become
 120 | becomes
 121 | becoming
 122 | been
 123 | before
 124 | beforehand
 125 | began
 126 | begin
 127 | beginning
 128 | beginnings
 129 | begins
 130 | behind
 131 | being
 132 | beings
 133 | believe
 134 | below
 135 | beside
 136 | besides
 137 | best
 138 | better
 139 | between
 140 | beyond
 141 | bf
 142 | bg
 143 | bh
 144 | bi
 145 | big
 146 | bill
 147 | billion
 148 | biol
 149 | bj
 150 | bm
 151 | bn
 152 | bo
 153 | both
 154 | bottom
 155 | br
 156 | brief
 157 | briefly
 158 | bs
 159 | bt
 160 | but
 161 | buy
 162 | bv
 163 | bw
 164 | by
 165 | bz
 166 | c
 167 | c'mon
 168 | c's
 169 | ca
 170 | call
 171 | came
 172 | can
 173 | can't
 174 | cannot
 175 | cant
 176 | caption
 177 | case
 178 | cases
 179 | cause
 180 | causes
 181 | cc
 182 | cd
 183 | certain
 184 | certainly
 185 | cf
 186 | cg
 187 | ch
 188 | changes
 189 | ci
 190 | ck
 191 | cl
 192 | clear
 193 | clearly
 194 | click
 195 | cm
 196 | cmon
 197 | cn
 198 | co
 199 | co.
 200 | com
 201 | come
 202 | comes
 203 | computer
 204 | con
 205 | concerning
 206 | consequently
 207 | consider
 208 | considering
 209 | contain
 210 | containing
 211 | contains
 212 | copy
 213 | corresponding
 214 | could
 215 | could've
 216 | couldn
 217 | couldn't
 218 | couldnt
 219 | course
 220 | cr
 221 | cry
 222 | cs
 223 | cu
 224 | currently
 225 | cv
 226 | cx
 227 | cy
 228 | cz
 229 | d
 230 | dare
 231 | daren't
 232 | darent
 233 | date
 234 | de
 235 | dear
 236 | definitely
 237 | describe
 238 | described
 239 | despite
 240 | detail
 241 | did
 242 | didn
 243 | didn't
 244 | didnt
 245 | differ
 246 | different
 247 | differently
 248 | directly
 249 | dj
 250 | dk
 251 | dm
 252 | do
 253 | does
 254 | doesn
 255 | doesn't
 256 | doesnt
 257 | doing
 258 | don
 259 | don't
 260 | done
 261 | dont
 262 | doubtful
 263 | down
 264 | downed
 265 | downing
 266 | downs
 267 | downwards
 268 | due
 269 | during
 270 | dz
 271 | e
 272 | each
 273 | early
 274 | ec
 275 | ed
 276 | edu
 277 | ee
 278 | effect
 279 | eg
 280 | eh
 281 | eight
 282 | eighty
 283 | either
 284 | eleven
 285 | else
 286 | elsewhere
 287 | empty
 288 | end
 289 | ended
 290 | ending
 291 | ends
 292 | enough
 293 | entirely
 294 | er
 295 | es
 296 | especially
 297 | et
 298 | et-al
 299 | etc
 300 | even
 301 | evenly
 302 | ever
 303 | evermore
 304 | every
 305 | everybody
 306 | everyone
 307 | everything
 308 | everywhere
 309 | ex
 310 | exactly
 311 | example
 312 | except
 313 | f
 314 | face
 315 | faces
 316 | fact
 317 | facts
 318 | fairly
 319 | far
 320 | farther
 321 | felt
 322 | few
 323 | fewer
 324 | ff
 325 | fi
 326 | fifteen
 327 | fifth
 328 | fifty
 329 | fify
 330 | fill
 331 | find
 332 | finds
 333 | fire
 334 | first
 335 | five
 336 | fix
 337 | fj
 338 | fk
 339 | fm
 340 | fo
 341 | followed
 342 | following
 343 | follows
 344 | for
 345 | forever
 346 | former
 347 | formerly
 348 | forth
 349 | forty
 350 | forward
 351 | found
 352 | four
 353 | fr
 354 | free
 355 | from
 356 | front
 357 | full
 358 | fully
 359 | further
 360 | furthered
 361 | furthering
 362 | furthermore
 363 | furthers
 364 | fx
 365 | g
 366 | ga
 367 | gave
 368 | gb
 369 | gd
 370 | ge
 371 | general
 372 | generally
 373 | get
 374 | gets
 375 | getting
 376 | gf
 377 | gg
 378 | gh
 379 | gi
 380 | give
 381 | given
 382 | gives
 383 | giving
 384 | gl
 385 | gm
 386 | gmt
 387 | gn
 388 | go
 389 | goes
 390 | going
 391 | gone
 392 | good
 393 | goods
 394 | got
 395 | gotten
 396 | gov
 397 | gp
 398 | gq
 399 | gr
 400 | great
 401 | greater
 402 | greatest
 403 | greetings
 404 | group
 405 | grouped
 406 | grouping
 407 | groups
 408 | gs
 409 | gt
 410 | gu
 411 | gw
 412 | gy
 413 | h
 414 | had
 415 | hadn't
 416 | hadnt
 417 | half
 418 | happens
 419 | hardly
 420 | has
 421 | hasn
 422 | hasn't
 423 | hasnt
 424 | have
 425 | haven
 426 | haven't
 427 | havent
 428 | having
 429 | he
 430 | he'd
 431 | he'll
 432 | he's
 433 | hed
 434 | hell
 435 | hello
 436 | help
 437 | hence
 438 | her
 439 | here
 440 | here's
 441 | hereafter
 442 | hereby
 443 | herein
 444 | heres
 445 | hereupon
 446 | hers
 447 | herself
 448 | herse”
 449 | hes
 450 | hi
 451 | hid
 452 | high
 453 | higher
 454 | highest
 455 | him
 456 | himself
 457 | himse”
 458 | his
 459 | hither
 460 | hk
 461 | hm
 462 | hn
 463 | home
 464 | homepage
 465 | hopefully
 466 | how
 467 | how'd
 468 | how'll
 469 | how's
 470 | howbeit
 471 | however
 472 | hr
 473 | ht
 474 | htm
 475 | html
 476 | http
 477 | hu
 478 | hundred
 479 | i
 480 | i'd
 481 | i'll
 482 | i'm
 483 | i've
 484 | i.e.
 485 | id
 486 | ie
 487 | if
 488 | ignored
 489 | ii
 490 | il
 491 | ill
 492 | im
 493 | immediate
 494 | immediately
 495 | importance
 496 | important
 497 | in
 498 | inasmuch
 499 | inc
 500 | inc.
 501 | indeed
 502 | index
 503 | indicate
 504 | indicated
 505 | indicates
 506 | information
 507 | inner
 508 | inside
 509 | insofar
 510 | instead
 511 | int
 512 | interest
 513 | interested
 514 | interesting
 515 | interests
 516 | into
 517 | invention
 518 | inward
 519 | io
 520 | iq
 521 | ir
 522 | is
 523 | isn
 524 | isn't
 525 | isnt
 526 | it
 527 | it'd
 528 | it'll
 529 | it's
 530 | itd
 531 | itll
 532 | its
 533 | itself
 534 | itse”
 535 | ive
 536 | j
 537 | je
 538 | jm
 539 | jo
 540 | join
 541 | jp
 542 | just
 543 | k
 544 | ke
 545 | keep
 546 | keeps
 547 | kept
 548 | keys
 549 | kg
 550 | kh
 551 | ki
 552 | kind
 553 | km
 554 | kn
 555 | knew
 556 | know
 557 | known
 558 | knows
 559 | kp
 560 | kr
 561 | kw
 562 | ky
 563 | kz
 564 | l
 565 | la
 566 | large
 567 | largely
 568 | last
 569 | lately
 570 | later
 571 | latest
 572 | latter
 573 | latterly
 574 | lb
 575 | lc
 576 | least
 577 | length
 578 | less
 579 | lest
 580 | let
 581 | let's
 582 | lets
 583 | li
 584 | like
 585 | liked
 586 | likely
 587 | likewise
 588 | line
 589 | little
 590 | lk
 591 | ll
 592 | long
 593 | longer
 594 | longest
 595 | look
 596 | looking
 597 | looks
 598 | low
 599 | lower
 600 | lr
 601 | ls
 602 | lt
 603 | ltd
 604 | lu
 605 | lv
 606 | ly
 607 | m
 608 | ma
 609 | made
 610 | mainly
 611 | make
 612 | makes
 613 | making
 614 | man
 615 | many
 616 | may
 617 | maybe
 618 | mayn't
 619 | maynt
 620 | mc
 621 | md
 622 | me
 623 | mean
 624 | means
 625 | meantime
 626 | meanwhile
 627 | member
 628 | members
 629 | men
 630 | merely
 631 | mg
 632 | mh
 633 | microsoft
 634 | might
 635 | might've
 636 | mightn't
 637 | mightnt
 638 | mil
 639 | mill
 640 | million
 641 | mine
 642 | minus
 643 | miss
 644 | mk
 645 | ml
 646 | mm
 647 | mn
 648 | mo
 649 | more
 650 | moreover
 651 | most
 652 | mostly
 653 | move
 654 | mp
 655 | mq
 656 | mr
 657 | mrs
 658 | ms
 659 | msie
 660 | mt
 661 | mu
 662 | much
 663 | mug
 664 | must
 665 | must've
 666 | mustn't
 667 | mustnt
 668 | mv
 669 | mw
 670 | mx
 671 | my
 672 | myself
 673 | myse”
 674 | mz
 675 | n
 676 | na
 677 | name
 678 | namely
 679 | nay
 680 | nc
 681 | nd
 682 | ne
 683 | near
 684 | nearly
 685 | necessarily
 686 | necessary
 687 | need
 688 | needed
 689 | needing
 690 | needn't
 691 | neednt
 692 | needs
 693 | neither
 694 | net
 695 | netscape
 696 | never
 697 | neverf
 698 | neverless
 699 | nevertheless
 700 | new
 701 | newer
 702 | newest
 703 | next
 704 | nf
 705 | ng
 706 | ni
 707 | nine
 708 | ninety
 709 | nl
 710 | no
 711 | no-one
 712 | nobody
 713 | non
 714 | none
 715 | nonetheless
 716 | noone
 717 | nor
 718 | normally
 719 | nos
 720 | not
 721 | noted
 722 | nothing
 723 | notwithstanding
 724 | novel
 725 | now
 726 | nowhere
 727 | np
 728 | nr
 729 | nu
 730 | null
 731 | number
 732 | numbers
 733 | nz
 734 | o
 735 | obtain
 736 | obtained
 737 | obviously
 738 | of
 739 | off
 740 | often
 741 | oh
 742 | ok
 743 | okay
 744 | old
 745 | older
 746 | oldest
 747 | om
 748 | omitted
 749 | on
 750 | once
 751 | one
 752 | one's
 753 | ones
 754 | only
 755 | onto
 756 | open
 757 | opened
 758 | opening
 759 | opens
 760 | opposite
 761 | or
 762 | ord
 763 | order
 764 | ordered
 765 | ordering
 766 | orders
 767 | org
 768 | other
 769 | others
 770 | otherwise
 771 | ought
 772 | oughtn't
 773 | oughtnt
 774 | our
 775 | ours
 776 | ourselves
 777 | out
 778 | outside
 779 | over
 780 | overall
 781 | owing
 782 | own
 783 | p
 784 | pa
 785 | page
 786 | pages
 787 | part
 788 | parted
 789 | particular
 790 | particularly
 791 | parting
 792 | parts
 793 | past
 794 | pe
 795 | per
 796 | perhaps
 797 | pf
 798 | pg
 799 | ph
 800 | pk
 801 | pl
 802 | place
 803 | placed
 804 | places
 805 | please
 806 | plus
 807 | pm
 808 | pmid
 809 | pn
 810 | point
 811 | pointed
 812 | pointing
 813 | points
 814 | poorly
 815 | possible
 816 | possibly
 817 | potentially
 818 | pp
 819 | pr
 820 | predominantly
 821 | present
 822 | presented
 823 | presenting
 824 | presents
 825 | presumably
 826 | previously
 827 | primarily
 828 | probably
 829 | problem
 830 | problems
 831 | promptly
 832 | proud
 833 | provided
 834 | provides
 835 | pt
 836 | put
 837 | puts
 838 | pw
 839 | py
 840 | q
 841 | qa
 842 | que
 843 | quickly
 844 | quite
 845 | qv
 846 | r
 847 | ran
 848 | rather
 849 | rd
 850 | re
 851 | readily
 852 | really
 853 | reasonably
 854 | recent
 855 | recently
 856 | ref
 857 | refs
 858 | regarding
 859 | regardless
 860 | regards
 861 | related
 862 | relatively
 863 | research
 864 | reserved
 865 | respectively
 866 | resulted
 867 | resulting
 868 | results
 869 | right
 870 | ring
 871 | ro
 872 | room
 873 | rooms
 874 | round
 875 | ru
 876 | run
 877 | rw
 878 | s
 879 | sa
 880 | said
 881 | same
 882 | saw
 883 | say
 884 | saying
 885 | says
 886 | sb
 887 | sc
 888 | sd
 889 | se
 890 | sec
 891 | second
 892 | secondly
 893 | seconds
 894 | section
 895 | see
 896 | seeing
 897 | seem
 898 | seemed
 899 | seeming
 900 | seems
 901 | seen
 902 | sees
 903 | self
 904 | selves
 905 | sensible
 906 | sent
 907 | serious
 908 | seriously
 909 | seven
 910 | seventy
 911 | several
 912 | sg
 913 | sh
 914 | shall
 915 | shan't
 916 | shant
 917 | she
 918 | she'd
 919 | she'll
 920 | she's
 921 | shed
 922 | shell
 923 | shes
 924 | should
 925 | should've
 926 | shouldn
 927 | shouldn't
 928 | shouldnt
 929 | show
 930 | showed
 931 | showing
 932 | shown
 933 | showns
 934 | shows
 935 | si
 936 | side
 937 | sides
 938 | significant
 939 | significantly
 940 | similar
 941 | similarly
 942 | since
 943 | sincere
 944 | site
 945 | six
 946 | sixty
 947 | sj
 948 | sk
 949 | sl
 950 | slightly
 951 | sm
 952 | small
 953 | smaller
 954 | smallest
 955 | sn
 956 | so
 957 | some
 958 | somebody
 959 | someday
 960 | somehow
 961 | someone
 962 | somethan
 963 | something
 964 | sometime
 965 | sometimes
 966 | somewhat
 967 | somewhere
 968 | soon
 969 | sorry
 970 | specifically
 971 | specified
 972 | specify
 973 | specifying
 974 | sr
 975 | st
 976 | state
 977 | states
 978 | still
 979 | stop
 980 | strongly
 981 | su
 982 | sub
 983 | substantially
 984 | successfully
 985 | such
 986 | sufficiently
 987 | suggest
 988 | sup
 989 | sure
 990 | sv
 991 | sy
 992 | system
 993 | sz
 994 | t
 995 | t's
 996 | take
 997 | taken
 998 | taking
 999 | tc
1000 | td
1001 | tell
1002 | ten
1003 | tends
1004 | test
1005 | text
1006 | tf
1007 | tg
1008 | th
1009 | than
1010 | thank
1011 | thanks
1012 | thanx
1013 | that
1014 | that'll
1015 | that's
1016 | that've
1017 | thatll
1018 | thats
1019 | thatve
1020 | the
1021 | their
1022 | theirs
1023 | them
1024 | themselves
1025 | then
1026 | thence
1027 | there
1028 | there'd
1029 | there'll
1030 | there're
1031 | there's
1032 | there've
1033 | thereafter
1034 | thereby
1035 | thered
1036 | therefore
1037 | therein
1038 | therell
1039 | thereof
1040 | therere
1041 | theres
1042 | thereto
1043 | thereupon
1044 | thereve
1045 | these
1046 | they
1047 | they'd
1048 | they'll
1049 | they're
1050 | they've
1051 | theyd
1052 | theyll
1053 | theyre
1054 | theyve
1055 | thick
1056 | thin
1057 | thing
1058 | things
1059 | think
1060 | thinks
1061 | third
1062 | thirty
1063 | this
1064 | thorough
1065 | thoroughly
1066 | those
1067 | thou
1068 | though
1069 | thoughh
1070 | thought
1071 | thoughts
1072 | thousand
1073 | three
1074 | throug
1075 | through
1076 | throughout
1077 | thru
1078 | thus
1079 | til
1080 | till
1081 | tip
1082 | tis
1083 | tj
1084 | tk
1085 | tm
1086 | tn
1087 | to
1088 | today
1089 | together
1090 | too
1091 | took
1092 | top
1093 | toward
1094 | towards
1095 | tp
1096 | tr
1097 | tried
1098 | tries
1099 | trillion
1100 | truly
1101 | try
1102 | trying
1103 | ts
1104 | tt
1105 | turn
1106 | turned
1107 | turning
1108 | turns
1109 | tv
1110 | tw
1111 | twas
1112 | twelve
1113 | twenty
1114 | twice
1115 | two
1116 | tz
1117 | u
1118 | ua
1119 | ug
1120 | uk
1121 | um
1122 | un
1123 | under
1124 | underneath
1125 | undoing
1126 | unfortunately
1127 | unless
1128 | unlike
1129 | unlikely
1130 | until
1131 | unto
1132 | up
1133 | upon
1134 | ups
1135 | upwards
1136 | us
1137 | use
1138 | used
1139 | useful
1140 | usefully
1141 | usefulness
1142 | uses
1143 | using
1144 | usually
1145 | uucp
1146 | uy
1147 | uz
1148 | v
1149 | va
1150 | value
1151 | various
1152 | vc
1153 | ve
1154 | versus
1155 | very
1156 | vg
1157 | vi
1158 | via
1159 | viz
1160 | vn
1161 | vol
1162 | vols
1163 | vs
1164 | vu
1165 | w
1166 | want
1167 | wanted
1168 | wanting
1169 | wants
1170 | was
1171 | wasn
1172 | wasn't
1173 | wasnt
1174 | way
1175 | ways
1176 | we
1177 | we'd
1178 | we'll
1179 | we're
1180 | we've
1181 | web
1182 | webpage
1183 | website
1184 | wed
1185 | welcome
1186 | well
1187 | wells
1188 | went
1189 | were
1190 | weren
1191 | weren't
1192 | werent
1193 | weve
1194 | wf
1195 | what
1196 | what'd
1197 | what'll
1198 | what's
1199 | what've
1200 | whatever
1201 | whatll
1202 | whats
1203 | whatve
1204 | when
1205 | when'd
1206 | when'll
1207 | when's
1208 | whence
1209 | whenever
1210 | where
1211 | where'd
1212 | where'll
1213 | where's
1214 | whereafter
1215 | whereas
1216 | whereby
1217 | wherein
1218 | wheres
1219 | whereupon
1220 | wherever
1221 | whether
1222 | which
1223 | whichever
1224 | while
1225 | whilst
1226 | whim
1227 | whither
1228 | who
1229 | who'd
1230 | who'll
1231 | who's
1232 | whod
1233 | whoever
1234 | whole
1235 | wholl
1236 | whom
1237 | whomever
1238 | whos
1239 | whose
1240 | why
1241 | why'd
1242 | why'll
1243 | why's
1244 | widely
1245 | width
1246 | will
1247 | willing
1248 | wish
1249 | with
1250 | within
1251 | without
1252 | won
1253 | won't
1254 | wonder
1255 | wont
1256 | words
1257 | work
1258 | worked
1259 | working
1260 | works
1261 | world
1262 | would
1263 | would've
1264 | wouldn
1265 | wouldn't
1266 | wouldnt
1267 | ws
1268 | www
1269 | x
1270 | y
1271 | ye
1272 | year
1273 | years
1274 | yes
1275 | yet
1276 | you
1277 | you'd
1278 | you'll
1279 | you're
1280 | you've
1281 | youd
1282 | youll
1283 | young
1284 | younger
1285 | youngest
1286 | your
1287 | youre
1288 | yours
1289 | yourself
1290 | yourselves
1291 | youve
1292 | yt
1293 | yu
1294 | z
1295 | za
1296 | zero
1297 | zm
1298 | zr


--------------------------------------------------------------------------------
/assets/stopwords-es.txt:
--------------------------------------------------------------------------------
  1 | 0
  2 | 1
  3 | 2
  4 | 3
  5 | 4
  6 | 5
  7 | 6
  8 | 7
  9 | 8
 10 | 9
 11 | _
 12 | a
 13 | actualmente
 14 | acuerdo
 15 | adelante
 16 | ademas
 17 | además
 18 | adrede
 19 | afirmó
 20 | agregó
 21 | ahi
 22 | ahora
 23 | ahí
 24 | al
 25 | algo
 26 | alguna
 27 | algunas
 28 | alguno
 29 | algunos
 30 | algún
 31 | alli
 32 | allí
 33 | alrededor
 34 | ambos
 35 | ampleamos
 36 | antano
 37 | antaño
 38 | ante
 39 | anterior
 40 | antes
 41 | apenas
 42 | aproximadamente
 43 | aquel
 44 | aquella
 45 | aquellas
 46 | aquello
 47 | aquellos
 48 | aqui
 49 | aquél
 50 | aquélla
 51 | aquéllas
 52 | aquéllos
 53 | aquí
 54 | arriba
 55 | arribaabajo
 56 | aseguró
 57 | asi
 58 | así
 59 | atras
 60 | aun
 61 | aunque
 62 | ayer
 63 | añadió
 64 | aún
 65 | b
 66 | bajo
 67 | bastante
 68 | bien
 69 | breve
 70 | buen
 71 | buena
 72 | buenas
 73 | bueno
 74 | buenos
 75 | c
 76 | cada
 77 | casi
 78 | cerca
 79 | cierta
 80 | ciertas
 81 | cierto
 82 | ciertos
 83 | cinco
 84 | claro
 85 | comentó
 86 | como
 87 | con
 88 | conmigo
 89 | conocer
 90 | conseguimos
 91 | conseguir
 92 | considera
 93 | consideró
 94 | consigo
 95 | consigue
 96 | consiguen
 97 | consigues
 98 | contigo
 99 | contra
100 | cosas
101 | creo
102 | cual
103 | cuales
104 | cualquier
105 | cuando
106 | cuanta
107 | cuantas
108 | cuanto
109 | cuantos
110 | cuatro
111 | cuenta
112 | cuál
113 | cuáles
114 | cuándo
115 | cuánta
116 | cuántas
117 | cuánto
118 | cuántos
119 | cómo
120 | d
121 | da
122 | dado
123 | dan
124 | dar
125 | de
126 | debajo
127 | debe
128 | deben
129 | debido
130 | decir
131 | dejó
132 | del
133 | delante
134 | demasiado
135 | demás
136 | dentro
137 | deprisa
138 | desde
139 | despacio
140 | despues
141 | después
142 | detras
143 | detrás
144 | dia
145 | dias
146 | dice
147 | dicen
148 | dicho
149 | dieron
150 | diferente
151 | diferentes
152 | dijeron
153 | dijo
154 | dio
155 | donde
156 | dos
157 | durante
158 | día
159 | días
160 | dónde
161 | e
162 | ejemplo
163 | el
164 | ella
165 | ellas
166 | ello
167 | ellos
168 | embargo
169 | empleais
170 | emplean
171 | emplear
172 | empleas
173 | empleo
174 | en
175 | encima
176 | encuentra
177 | enfrente
178 | enseguida
179 | entonces
180 | entre
181 | era
182 | erais
183 | eramos
184 | eran
185 | eras
186 | eres
187 | es
188 | esa
189 | esas
190 | ese
191 | eso
192 | esos
193 | esta
194 | estaba
195 | estabais
196 | estaban
197 | estabas
198 | estad
199 | estada
200 | estadas
201 | estado
202 | estados
203 | estais
204 | estamos
205 | estan
206 | estando
207 | estar
208 | estaremos
209 | estará
210 | estarán
211 | estarás
212 | estaré
213 | estaréis
214 | estaría
215 | estaríais
216 | estaríamos
217 | estarían
218 | estarías
219 | estas
220 | este
221 | estemos
222 | esto
223 | estos
224 | estoy
225 | estuve
226 | estuviera
227 | estuvierais
228 | estuvieran
229 | estuvieras
230 | estuvieron
231 | estuviese
232 | estuvieseis
233 | estuviesen
234 | estuvieses
235 | estuvimos
236 | estuviste
237 | estuvisteis
238 | estuviéramos
239 | estuviésemos
240 | estuvo
241 | está
242 | estábamos
243 | estáis
244 | están
245 | estás
246 | esté
247 | estéis
248 | estén
249 | estés
250 | ex
251 | excepto
252 | existe
253 | existen
254 | explicó
255 | expresó
256 | f
257 | fin
258 | final
259 | fue
260 | fuera
261 | fuerais
262 | fueran
263 | fueras
264 | fueron
265 | fuese
266 | fueseis
267 | fuesen
268 | fueses
269 | fui
270 | fuimos
271 | fuiste
272 | fuisteis
273 | fuéramos
274 | fuésemos
275 | g
276 | general
277 | gran
278 | grandes
279 | gueno
280 | h
281 | ha
282 | haber
283 | habia
284 | habida
285 | habidas
286 | habido
287 | habidos
288 | habiendo
289 | habla
290 | hablan
291 | habremos
292 | habrá
293 | habrán
294 | habrás
295 | habré
296 | habréis
297 | habría
298 | habríais
299 | habríamos
300 | habrían
301 | habrías
302 | habéis
303 | había
304 | habíais
305 | habíamos
306 | habían
307 | habías
308 | hace
309 | haceis
310 | hacemos
311 | hacen
312 | hacer
313 | hacerlo
314 | haces
315 | hacia
316 | haciendo
317 | hago
318 | han
319 | has
320 | hasta
321 | hay
322 | haya
323 | hayamos
324 | hayan
325 | hayas
326 | hayáis
327 | he
328 | hecho
329 | hemos
330 | hicieron
331 | hizo
332 | horas
333 | hoy
334 | hube
335 | hubiera
336 | hubierais
337 | hubieran
338 | hubieras
339 | hubieron
340 | hubiese
341 | hubieseis
342 | hubiesen
343 | hubieses
344 | hubimos
345 | hubiste
346 | hubisteis
347 | hubiéramos
348 | hubiésemos
349 | hubo
350 | i
351 | igual
352 | incluso
353 | indicó
354 | informo
355 | informó
356 | intenta
357 | intentais
358 | intentamos
359 | intentan
360 | intentar
361 | intentas
362 | intento
363 | ir
364 | j
365 | junto
366 | k
367 | l
368 | la
369 | lado
370 | largo
371 | las
372 | le
373 | lejos
374 | les
375 | llegó
376 | lleva
377 | llevar
378 | lo
379 | los
380 | luego
381 | lugar
382 | m
383 | mal
384 | manera
385 | manifestó
386 | mas
387 | mayor
388 | me
389 | mediante
390 | medio
391 | mejor
392 | mencionó
393 | menos
394 | menudo
395 | mi
396 | mia
397 | mias
398 | mientras
399 | mio
400 | mios
401 | mis
402 | misma
403 | mismas
404 | mismo
405 | mismos
406 | modo
407 | momento
408 | mucha
409 | muchas
410 | mucho
411 | muchos
412 | muy
413 | más
414 | mí
415 | mía
416 | mías
417 | mío
418 | míos
419 | n
420 | nada
421 | nadie
422 | ni
423 | ninguna
424 | ningunas
425 | ninguno
426 | ningunos
427 | ningún
428 | no
429 | nos
430 | nosotras
431 | nosotros
432 | nuestra
433 | nuestras
434 | nuestro
435 | nuestros
436 | nueva
437 | nuevas
438 | nuevo
439 | nuevos
440 | nunca
441 | o
442 | ocho
443 | os
444 | otra
445 | otras
446 | otro
447 | otros
448 | p
449 | pais
450 | para
451 | parece
452 | parte
453 | partir
454 | pasada
455 | pasado
456 | paìs
457 | peor
458 | pero
459 | pesar
460 | poca
461 | pocas
462 | poco
463 | pocos
464 | podeis
465 | podemos
466 | poder
467 | podria
468 | podriais
469 | podriamos
470 | podrian
471 | podrias
472 | podrá
473 | podrán
474 | podría
475 | podrían
476 | poner
477 | por
478 | por qué
479 | porque
480 | posible
481 | primer
482 | primera
483 | primero
484 | primeros
485 | principalmente
486 | pronto
487 | propia
488 | propias
489 | propio
490 | propios
491 | proximo
492 | próximo
493 | próximos
494 | pudo
495 | pueda
496 | puede
497 | pueden
498 | puedo
499 | pues
500 | q
501 | qeu
502 | que
503 | quedó
504 | queremos
505 | quien
506 | quienes
507 | quiere
508 | quiza
509 | quizas
510 | quizá
511 | quizás
512 | quién
513 | quiénes
514 | qué
515 | r
516 | raras
517 | realizado
518 | realizar
519 | realizó
520 | repente
521 | respecto
522 | s
523 | sabe
524 | sabeis
525 | sabemos
526 | saben
527 | saber
528 | sabes
529 | sal
530 | salvo
531 | se
532 | sea
533 | seamos
534 | sean
535 | seas
536 | segun
537 | segunda
538 | segundo
539 | según
540 | seis
541 | ser
542 | sera
543 | seremos
544 | será
545 | serán
546 | serás
547 | seré
548 | seréis
549 | sería
550 | seríais
551 | seríamos
552 | serían
553 | serías
554 | seáis
555 | señaló
556 | si
557 | sido
558 | siempre
559 | siendo
560 | siete
561 | sigue
562 | siguiente
563 | sin
564 | sino
565 | sobre
566 | sois
567 | sola
568 | solamente
569 | solas
570 | solo
571 | solos
572 | somos
573 | son
574 | soy
575 | soyos
576 | su
577 | supuesto
578 | sus
579 | suya
580 | suyas
581 | suyo
582 | suyos
583 | sé
584 | sí
585 | sólo
586 | t
587 | tal
588 | tambien
589 | también
590 | tampoco
591 | tan
592 | tanto
593 | tarde
594 | te
595 | temprano
596 | tendremos
597 | tendrá
598 | tendrán
599 | tendrás
600 | tendré
601 | tendréis
602 | tendría
603 | tendríais
604 | tendríamos
605 | tendrían
606 | tendrías
607 | tened
608 | teneis
609 | tenemos
610 | tener
611 | tenga
612 | tengamos
613 | tengan
614 | tengas
615 | tengo
616 | tengáis
617 | tenida
618 | tenidas
619 | tenido
620 | tenidos
621 | teniendo
622 | tenéis
623 | tenía
624 | teníais
625 | teníamos
626 | tenían
627 | tenías
628 | tercera
629 | ti
630 | tiempo
631 | tiene
632 | tienen
633 | tienes
634 | toda
635 | todas
636 | todavia
637 | todavía
638 | todo
639 | todos
640 | total
641 | trabaja
642 | trabajais
643 | trabajamos
644 | trabajan
645 | trabajar
646 | trabajas
647 | trabajo
648 | tras
649 | trata
650 | través
651 | tres
652 | tu
653 | tus
654 | tuve
655 | tuviera
656 | tuvierais
657 | tuvieran
658 | tuvieras
659 | tuvieron
660 | tuviese
661 | tuvieseis
662 | tuviesen
663 | tuvieses
664 | tuvimos
665 | tuviste
666 | tuvisteis
667 | tuviéramos
668 | tuviésemos
669 | tuvo
670 | tuya
671 | tuyas
672 | tuyo
673 | tuyos
674 | tú
675 | u
676 | ultimo
677 | un
678 | una
679 | unas
680 | uno
681 | unos
682 | usa
683 | usais
684 | usamos
685 | usan
686 | usar
687 | usas
688 | uso
689 | usted
690 | ustedes
691 | v
692 | va
693 | vais
694 | valor
695 | vamos
696 | van
697 | varias
698 | varios
699 | vaya
700 | veces
701 | ver
702 | verdad
703 | verdadera
704 | verdadero
705 | vez
706 | vosotras
707 | vosotros
708 | voy
709 | vuestra
710 | vuestras
711 | vuestro
712 | vuestros
713 | w
714 | x
715 | y
716 | ya
717 | yo
718 | z
719 | él
720 | éramos
721 | ésa
722 | ésas
723 | ése
724 | ésos
725 | ésta
726 | éstas
727 | éste
728 | éstos
729 | última
730 | últimas
731 | último
732 | últimos


--------------------------------------------------------------------------------
/assets/whitelist.txt:
--------------------------------------------------------------------------------
  1 | rotativo.com.mx
  2 | excelsior.com.mx
  3 | yogonet.com
  4 | eluniversal.com.mx
  5 | nyti.ms
  6 | unocero.com
  7 | mexico.com
  8 | thecoinrepublic.com
  9 | costumbres.de
 10 | bbc.com
 11 | avclub.com
 12 | infobae.com
 13 | news24.com
 14 | nasa.gov
 15 | sdpnoticias.com
 16 | jetnews.com.mx
 17 | razon.com.mx
 18 | elceo.com
 19 | arenapublica.com
 20 | diarioelindependiente.mx
 21 | pscp.tv
 22 | plumasatomicas.com
 23 | regeneracion.mx
 24 | mvsnoticias.com
 25 | publimetro.com.mx
 26 | themexico.news
 27 | aristeguinoticias.com
 28 | pulsoslp.com.mx
 29 | diputados.gob.mx
 30 | diariodequeretaro.com.mx
 31 | nnc.mx
 32 | frontera.info
 33 | bloomberg.com
 34 | lopezobrador.org.mx
 35 | asisucedegto.mx
 36 | xeu.mx
 37 | xevt.com
 38 | 24-horas.mx
 39 | politico.mx
 40 | festivosmexico.com.mx
 41 | lavozdechile.com
 42 | noticiaslapaz.com
 43 | milenio.com
 44 | theconservativetreehouse.com
 45 | chalenoticias.mx
 46 | breaking.com.mx
 47 | miamiherald.com
 48 | economiahoy.mx
 49 | argumentopolitico.com
 50 | elfinanciero.com.mx
 51 | reporteroshoy.mx
 52 | vanguardia.com.mx
 53 | laopcion.com.mx
 54 | elexpres.com
 55 | elindependientedehidalgo.com.mx
 56 | canalsonora.com
 57 | diariocambio.com.mx
 58 | nexos.com.mx
 59 | newsweek.com
 60 | xataka.com.mx
 61 | ampproject.org
 62 | zetatijuana.com
 63 | brainwala.com
 64 | tumblr.com
 65 | sipse.com
 66 | periodicocorreo.com.mx
 67 | imparcialoaxaca.mx
 68 | ejecentral.com.mx
 69 | mas-mexico.com.mx
 70 | elsoldepuebla.com.mx
 71 | lasestrellas.tv
 72 | coachesvoice.com
 73 | psicologiaymente.com
 74 | reportur.com
 75 | themazatlanpost.com
 76 | sg.com.mx
 77 | superrucos.com
 78 | elsoldeacapulco.com.mx
 79 | elpais.com
 80 | elmercurio.com.mx
 81 | taringa.net
 82 | oilandgasmagazine.com.mx
 83 | proceso.com.mx
 84 | lanetanoticias.com
 85 | suracapulco.mx
 86 | bancomundial.org
 87 | cletofilia.com
 88 | aztecanoticias.com.mx
 89 | periodicoelmexicano.com.mx
 90 | imagenradio.com.mx
 91 | animalpolitico.com
 92 | tiempo.com.mx
 93 | forbes.com.mx
 94 | eia.gov
 95 | casede.org
 96 | eleconomista.com.mx
 97 | sinembargo.mx
 98 | huffingtonpost.com.mx
 99 | zocalo.com.mx
100 | www.gob.mx
101 | aem.gob.mx
102 | clipperdata.com
103 | expreso.com.mx
104 | elsoldemexico.com.mx
105 | streamable.com
106 | lacronica.com
107 | televisa.com
108 | am.com.mx
109 | mexnewz.mx
110 | beeg1.net
111 | moreloshabla.com
112 | washingtonpost.com
113 | dailywire.com
114 | soyhomosensual.com
115 | ft.com
116 | wsj.com
117 | blogspot.com
118 | wsws.org
119 | nacionunida.com
120 | society6.com
121 | telesurenglish.net
122 | independent.co.uk
123 | revistaei.cl
124 | amazon.com.mx
125 | escapadeland.com
126 | elnuevoherald.com
127 | mxcity.mx
128 | tribuna.com.mx
129 | lasillarota.com
130 | tabascohoy.com
131 | bitcoinrealcash.com
132 | informador.mx
133 | netnoticias.mx
134 | heraldodemexico.com.mx
135 | businessinsider.com
136 | sapiens.org
137 | monitoreconomico.org
138 | forbes.com
139 | elsoldelbajio.com.mx
140 | sputniknews.com
141 | versiones.com.mx
142 | quadratin.com.mx
143 | omnia.com.mx
144 | wordpress.com
145 | theregister.co.uk
146 | bbc.co.uk
147 | novedadesaca.mx
148 | dineroenimagen.com
149 | elhorizonte.mx
150 | opensocietyfoundations.org
151 | unimexicali.com
152 | asistepemex.org
153 | radioformula.com.mx
154 | reforma.com
155 | ibb.co
156 | laverdadnoticias.com
157 | nytimes.com
158 | notisistema.com
159 | reliefweb.int
160 | lavozdeperu.com
161 | abcnoticias.mx
162 | itam.mx
163 | jornada.com.mx
164 | parametria.com.mx
165 | unomasuno.com.mx
166 | commondreams.org
167 | theguardian.com
168 | sptnkne.ws
169 | josecardenas.com
170 | rascamapas.com
171 | segundoasegundo.com
172 | reporteindigo.com
173 | globo.com
174 | rasnoticias.mx
175 | maritimeherald.com
176 | jetbrains.com
177 | lopezdoriga.com
178 | cns.gob.mx
179 | livejournal.com
180 | desastre.mx
181 | mexicodesconocido.com.mx
182 | yahoo.com
183 | allerorts.de
184 | diario.mx
185 | bcsnoticias.mx
186 | noticiasdequeretaro.com.mx
187 | expansion.mx
188 | elimparcial.com
189 | cargonewsmex.com
190 | contrareplica.mx
191 | unam.mx
192 | lavozdelafrontera.com.mx
193 | terceravia.mx
194 | latercera.com
195 | acustiknoticias.com
196 | riodoce.mx
197 | adnpolitico.com
198 | fayerwayer.com
199 | horizontal.mx
200 | wradio.com.mx
201 | diariodecolima.com
202 | noticiaszmg.com
203 | elmanana.com
204 | altonivel.com.mx
205 | elsiglodetorreon.com.mx
206 | eldiariodechihuahua.mx
207 | declarenews.com
208 | reuters.com
209 | thelocal.se
210 | hongkongfp.com
211 | canoe.com
212 | indiatimes.com
213 | faroutmagazine.co.uk
214 | els5ra.com
215 | physicalfitnesscare.com
216 | bients.com
217 | udefense.info
218 | wildwechsel.de
219 | viralnewsdrift.com
220 | chinaro.ir
221 | dankpupper.com
222 | mightykingseo.com
223 | thesun.co.uk
224 | theverge.com
225 | thenewcivilrightsmovement.com
226 | truthdig.com
227 | channelnewsasia.com
228 | dailytelegraph.com.au
229 | centicsystems.com
230 | abracadabranoticias.com
231 | rawstory.com
232 | evolutionalblogs.com
233 | euronews.com
234 | haaretz.com
235 | loversofcats.com
236 | ndtv.com
237 | livetrendynews.com
238 | pakthought.com
239 | scmp.com
240 | euractiv.com
241 | aviralupdate.com
242 | france24.com
243 | theindianwire.com
244 | aljazeera.com
245 | wnobserver.com
246 | tessyinfohub.com
247 | newyorker.com
248 | kartiavelino.com
249 | newrightnetwork.com
250 | atusocialscience.ir
251 | samrattailors.com
252 | trendnewsworld.com
253 | zdnet.com
254 | nicepatogh.ir
255 | news18.com
256 | apsense.com
257 | virapars.com
258 | newsunbox.com
259 | nationalpost.com
260 | trendnewsweb.com
261 | globalnews.ca
262 | huffpost.com
263 | thenewsobservers.com
264 | bestnewsviral.com
265 | rt.com
266 | madamasr.com
267 | standard.co.uk
268 | local10.com
269 | telegraph.co.uk
270 | time8.in
271 | thehill.com
272 | timesofisrael.com
273 | dailymail.co.uk
274 | kyivpost.com
275 | indiatvnews.com
276 | talkingpointsmemo.com
277 | livemint.com
278 | sabq.org
279 | veteranstoday.com
280 | isibcase.ir
281 | dailytimes.com.pk
282 | thedailybeast.com
283 | total-croatia-news.com
284 | articlescad.com
285 | writeup.co.in
286 | gonewsviral.com
287 | thewire.in
288 | npr.org
289 | theglobeandmail.com
290 | nbcnews.com
291 | viralspicynews.com
292 | app.link
293 | trtworld.com
294 | unionjournalism.com
295 | winnaijatv.com
296 | jpost.com
297 | politico.com
298 | cnn.com
299 | walesonline.co.uk
300 | highmarksecurity.com
301 | indiatoday.in
302 | medium.com
303 | viralreportnow.com
304 | thegrowthop.com
305 | sky.com
306 | gappoo.com
307 | nymag.com
308 | viraltopiczone.com
309 | rfa.org
310 | apnews.com
311 | newindianexpress.com
312 | dailykos.com
313 | dw.com
314 | middleeastmonitor.com
315 | msn.com
316 | truescoopnews.com
317 | sbs.com.au
318 | discountbook.ir
319 | tnewst.com
320 | birminghammail.co.uk
321 | dailytrendshunter.com
322 | usnews.com
323 | dawn.com
324 | abc.net.au
325 | trendynewstime.com
326 | hindustantimes.com
327 | spacenews.com
328 | acneuro.co.uk
329 | washingtonexaminer.com
330 | cbc.ca
331 | rightsanddissent.org
332 | tribune.com.pk
333 | dailysabah.com
334 | today.com
335 | faitmain.ma
336 | dailyhive.com
337 | trendyupdatenews.com
338 | ctvnews.ca
339 | businessinsider.co.za
340 | usatoday.com
341 | viralupfeed.com
342 | indianexpress.com
343 | batask.ir
344 | foxnews.com
345 | thenews.com.pk
346 | cnbc.com
347 | newsdaily.today
348 | firstpost.com
349 | morningstaronline.co.uk
350 | ntrguadalajara.com
351 | nationalgeographic.com
352 | cronica.com.mx
353 | debate.com.mx
354 | guanajuatoinforma.com
355 | yucatan.com.mx
356 | sinlineamx.com
357 | sintesistv.com.mx
358 | laprensademonclova.com
359 | sandiegored.com
360 | turquesanews.mx
361 | enlapolitika.com
362 | lineadirectaportal.com
363 | hoyestado.com
364 | alcalorpolitico.com
365 | cafenegroportal.com
366 | noroeste.com.mx
367 | lineadirectaportal.com
368 | mediotiempo.com
369 | unotv.com
370 | criteriohidalgo.com
371 | xeva.com.mx
372 | quintafuerza.mx
373 | latinus.us
374 | verificado.com.mx
375 | lavanguardia.com
376 | nature.com
377 | 


--------------------------------------------------------------------------------
/bot.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Inits the summary bot. It starts a Reddit instance using PRAW, gets the latest posts
  3 | and filters those who have already been processed.
  4 | """
  5 | 
  6 | import praw
  7 | import requests
  8 | import tldextract
  9 | 
 10 | import cloud
 11 | import config
 12 | import scraper
 13 | import summary
 14 | 
 15 | # We don't reply to posts which have a very small or very high reduction.
 16 | MINIMUM_REDUCTION_THRESHOLD = 20
 17 | MAXIMUM_REDUCTION_THRESHOLD = 68
 18 | 
 19 | # File locations
 20 | POSTS_LOG = "./processed_posts.txt"
 21 | WHITELIST_FILE = "./assets/whitelist.txt"
 22 | ERROR_LOG = "./error.log"
 23 | 
 24 | # Templates.
 25 | TEMPLATE = open("./templates/es.txt", "r", encoding="utf-8").read()
 26 | 
 27 | 
 28 | HEADERS = {"User-Agent": "Summarizer v2.0"}
 29 | 
 30 | 
 31 | def load_whitelist():
 32 |     """Reads the processed posts log file and creates it if it doesn't exist.
 33 | 
 34 |     Returns
 35 |     -------
 36 |     list
 37 |         A list of domains that are confirmed to have an 'article' tag.
 38 | 
 39 |     """
 40 | 
 41 |     with open(WHITELIST_FILE, "r", encoding="utf-8") as log_file:
 42 |         return log_file.read().splitlines()
 43 | 
 44 | 
 45 | def load_log():
 46 |     """Reads the processed posts log file and creates it if it doesn't exist.
 47 | 
 48 |     Returns
 49 |     -------
 50 |     list
 51 |         A list of Reddit posts ids.
 52 | 
 53 |     """
 54 | 
 55 |     try:
 56 |         with open(POSTS_LOG, "r", encoding="utf-8") as log_file:
 57 |             return log_file.read().splitlines()
 58 | 
 59 |     except FileNotFoundError:
 60 |         with open(POSTS_LOG, "a", encoding="utf-8") as log_file:
 61 |             return []
 62 | 
 63 | 
 64 | def update_log(post_id):
 65 |     """Updates the processed posts log with the given post id.
 66 | 
 67 |     Parameters
 68 |     ----------
 69 |     post_id : str
 70 |         A Reddit post id.
 71 | 
 72 |     """
 73 | 
 74 |     with open(POSTS_LOG, "a", encoding="utf-8") as log_file:
 75 |         log_file.write("{}\n".format(post_id))
 76 | 
 77 | 
 78 | def log_error(error_message):
 79 |     """Updates the error log.
 80 | 
 81 |     Parameters
 82 |     ----------
 83 |     error_message : str
 84 |         A string containing the faulty url and the exception message.
 85 | 
 86 |     """
 87 | 
 88 |     with open(ERROR_LOG, "a", encoding="utf-8") as log_file:
 89 |         log_file.write("{}\n".format(error_message))
 90 | 
 91 | 
 92 | def init():
 93 |     """Inits the bot."""
 94 | 
 95 |     reddit = praw.Reddit(client_id=config.APP_ID, client_secret=config.APP_SECRET,
 96 |                          user_agent=config.USER_AGENT, username=config.REDDIT_USERNAME,
 97 |                          password=config.REDDIT_PASSWORD)
 98 | 
 99 |     processed_posts = load_log()
100 |     whitelist = load_whitelist()
101 | 
102 |     for subreddit in config.SUBREDDITS:
103 | 
104 |         for submission in reddit.subreddit(subreddit).new(limit=50):
105 | 
106 |             if submission.id not in processed_posts:
107 | 
108 |                 clean_url = submission.url.replace("amp.", "")
109 |                 ext = tldextract.extract(clean_url)
110 |                 domain = "{}.{}".format(ext.domain, ext.suffix)
111 | 
112 |                 if domain in whitelist:
113 | 
114 |                     try:
115 |                         with requests.get(clean_url, headers=HEADERS, timeout=10) as response:
116 | 
117 |                             # Most of the times the encoding is utf-8 but in edge cases
118 |                             # we set it to ISO-8859-1 when it is present in the HTML header.
119 |                             if "iso-8859-1" in response.text.lower():
120 |                                 response.encoding = "iso-8859-1"
121 |                             elif response.encoding == "ISO-8859-1":
122 |                                 response.encoding = "utf-8"
123 | 
124 |                             html_source = response.text
125 | 
126 |                         article_title, article_date, article_body = scraper.scrape_html(
127 |                             html_source)
128 | 
129 |                         summary_dict = summary.get_summary(article_body)
130 |                     except Exception as e:
131 |                         log_error("{},{}".format(clean_url, e))
132 |                         update_log(submission.id)
133 |                         print("Failed:", submission.id)
134 |                         continue
135 | 
136 |                     # To reduce low quality submissions, we only process those that made a meaningful summary.
137 |                     if summary_dict["reduction"] >= MINIMUM_REDUCTION_THRESHOLD and summary_dict["reduction"] <= MAXIMUM_REDUCTION_THRESHOLD:
138 | 
139 |                         # Create a wordcloud, upload it to Imgur and get back the url.
140 |                         image_url = cloud.generate_word_cloud(
141 |                             summary_dict["article_words"])
142 | 
143 |                         # We start creating the comment body.
144 |                         post_body = "\n\n".join(
145 |                             ["> " + item for item in summary_dict["top_sentences"]])
146 | 
147 |                         top_words = ""
148 | 
149 |                         for index, word in enumerate(summary_dict["top_words"]):
150 |                             top_words += "{}^#{} ".format(word, index+1)
151 | 
152 |                         post_message = TEMPLATE.format(
153 |                             article_title, clean_url, summary_dict["reduction"], article_date, post_body, image_url, top_words)
154 | 
155 |                         reddit.submission(submission.id).reply(post_message)
156 |                         update_log(submission.id)
157 |                         print("Replied to:", submission.id)
158 |                     else:
159 |                         update_log(submission.id)
160 |                         print("Skipped:", submission.id)
161 | 
162 | 
163 | if __name__ == "__main__":
164 | 
165 |     init()
166 | 


--------------------------------------------------------------------------------
/cloud.py:
--------------------------------------------------------------------------------
 1 | """
 2 | This script generates a word cloud from the article words. Uploads it to Imgur and returns back the url.
 3 | """
 4 | 
 5 | import os
 6 | import random
 7 | 
 8 | import numpy as np
 9 | import requests
10 | import wordcloud
11 | from PIL import Image
12 | 
13 | import config
14 | 
15 | MASK_FILE = "./assets/cloud.png"
16 | FONT_FILE = "./assets/sofiapro-light.otf"
17 | IMAGE_PATH = "./temp.png"
18 | 
19 | COLORMAPS = ["spring", "summer", "autumn", "Wistia"]
20 | 
21 | mask = np.array(Image.open(MASK_FILE))
22 | 
23 | 
24 | def generate_word_cloud(text):
25 |     """Generates a word cloud and uploads it to Imgur.
26 | 
27 |     Parameters
28 |     ----------
29 |     text : str
30 |         The text to be converted into a word cloud.
31 | 
32 |     Returns
33 |     -------
34 |     str
35 |         The url generated from the Imgur API.
36 |     """
37 | 
38 |     wc = wordcloud.WordCloud(background_color="#222222",
39 |                              max_words=2000,
40 |                              mask=mask,
41 |                              contour_width=2,
42 |                              colormap=random.choice(COLORMAPS),
43 |                              font_path=FONT_FILE,
44 |                              contour_color="white")
45 | 
46 |     wc.generate(text)
47 |     wc.to_file(IMAGE_PATH)
48 |     image_link = upload_image(IMAGE_PATH)
49 |     os.remove(IMAGE_PATH)
50 | 
51 |     return image_link
52 | 
53 | 
54 | def upload_image(image_path):
55 |     """Uploads an image to Imgur and returns the permanent link url.
56 | 
57 |     Parameters
58 |     ----------
59 |     image_path : str
60 |         The path of the file to be uploaded.
61 | 
62 |     Returns
63 |     -------
64 |     str
65 |         The url generated from the Imgur API.
66 |     """
67 | 
68 |     url = "https://api.imgur.com/3/image"
69 |     headers = {"Authorization": "Client-ID " + config.IMGUR_CLIENT_ID}
70 |     files = {"image": open(IMAGE_PATH, "rb")}
71 | 
72 |     with requests.post(url, headers=headers, files=files) as response:
73 | 
74 |         # We extract the new link from the response.
75 |         image_link = response.json()["data"]["link"]
76 | 
77 |         return image_link
78 | 


--------------------------------------------------------------------------------
/cloud_example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PhantomInsights/summarizer/d8b4d7745ca9ba4309fc9707b7c98ae143b97a10/cloud_example.png


--------------------------------------------------------------------------------
/config.py:
--------------------------------------------------------------------------------
 1 | """Required constants for the Reddit API."""
 2 | 
 3 | # The following constants are used by the bot.
 4 | REDDIT_USERNAME = ""
 5 | REDDIT_PASSWORD = ""
 6 | 
 7 | APP_ID = ""
 8 | APP_SECRET = ""
 9 | USER_AGENT = ""
10 | 
11 | SUBREDDITS = ["mexico"]
12 | 
13 | IMGUR_CLIENT_ID = ""


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PhantomInsights/summarizer/d8b4d7745ca9ba4309fc9707b7c98ae143b97a10/requirements.txt


--------------------------------------------------------------------------------
/scraper.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This function tries to extract the article title, date and body from an HTML string.
  3 | """
  4 | 
  5 | from datetime import datetime
  6 | 
  7 | from bs4 import BeautifulSoup
  8 | 
  9 | # We don't process articles that have fewer characters than this.
 10 | ARTICLE_MINIMUM_LENGTH = 650
 11 | 
 12 | 
 13 | def scrape_html(html_source):
 14 |     """Tries to scrape the article from the given HTML source.
 15 | 
 16 |     Parameters
 17 |     ----------
 18 |     html_source : str
 19 |         The html source of the article.
 20 | 
 21 |     Returns
 22 |     -------
 23 |     tuple
 24 |         The article title, date and body.
 25 | 
 26 |     """
 27 | 
 28 |     # Very often the text between tags comes together, we add an artificial newline to each common tag.
 29 |     for item in ["</p>", "</blockquote>", "</div>", "</h3>", "<br>"]:
 30 |         html_source = html_source.replace(item, item+"\n")
 31 | 
 32 |     # We create a BeautifulSOup object and remove the unnecessary tags.
 33 |     soup = BeautifulSoup(html_source, "html5lib")
 34 | 
 35 |     # Then we extract the title and the article tags.
 36 |     article_title = soup.find("title").text.replace("\n", " ").strip()
 37 | 
 38 |     # If our title is too short we fallback to the first h1 tag.
 39 |     if len(article_title) <= 5:
 40 |         article_title = soup.find("h1").text.replace("\n", " ").strip()
 41 | 
 42 |     article_date = ""
 43 | 
 44 |     # We look for the first meta tag that has the word 'time' in it.
 45 |     for item in soup.find_all("meta"):
 46 | 
 47 |         if "time" in item.get("property", ""):
 48 | 
 49 |             clean_date = item["content"].split("+")[0].replace("Z", "")
 50 |             
 51 |             # Use your preferred time formatting.
 52 |             article_date = "{:%d-%m-%Y a las %H:%M:%S}".format(
 53 |                 datetime.fromisoformat(clean_date))
 54 |             break
 55 | 
 56 |     # If we didn't find any meta tag with a datetime we look for a 'time' tag.
 57 |     if len(article_date) <= 5:
 58 |         try:
 59 |             article_date = soup.find("time").text.strip()
 60 |         except:
 61 |             pass
 62 | 
 63 |     # We remove some tags that add noise.
 64 |     [tag.extract() for tag in soup.find_all(
 65 |         ["script", "img", "ol", "ul", "time", "h1", "h2", "h3", "iframe", "style", "form", "footer", "figcaption"])]
 66 | 
 67 |     # These class names/ids are known to add noise or duplicate text to the article.
 68 |     noisy_names = ["image", "img", "video", "subheadline", "editor", "fondea", "resumen", "tags", "sidebar", "comment",
 69 |                    "entry-title", "breaking_content", "pie", "tract", "caption", "tweet", "expert", "previous", "next",
 70 |                    "compartir", "rightbar", "mas", "copyright", "instagram-media", "cookie", "paywall", "mainlist", "sitelist"]
 71 | 
 72 |     for tag in soup.find_all("div"):
 73 | 
 74 |         try:
 75 |             tag_id = tag["id"].lower()
 76 | 
 77 |             for item in noisy_names:
 78 |                 if item in tag_id:
 79 |                     tag.extract()
 80 |         except:
 81 |             pass
 82 | 
 83 |     for tag in soup.find_all(["div", "p", "blockquote"]):
 84 | 
 85 |         try:
 86 |             tag_class = "".join(tag["class"]).lower()
 87 | 
 88 |             for item in noisy_names:
 89 |                 if item in tag_class:
 90 |                     tag.extract()
 91 |         except:
 92 |             pass
 93 | 
 94 |     # These names commonly hold the article text.
 95 |     common_names = ["artic", "summary", "cont", "note", "cuerpo", "body"]
 96 | 
 97 |     article_body = ""
 98 | 
 99 |     # Sometimes we have more than one article tag. We are going to grab the longest one.
100 |     for article_tag in soup.find_all("article"):
101 | 
102 |         if len(article_tag.text) >= len(article_body):
103 |             article_body = article_tag.text
104 | 
105 |     # The article is too short, let's try to find it in another tag.
106 |     if len(article_body) <= ARTICLE_MINIMUM_LENGTH:
107 | 
108 |         for tag in soup.find_all(["div", "section"]):
109 | 
110 |             try:
111 |                 tag_id = tag["id"].lower()
112 | 
113 |                 for item in common_names:
114 |                     if item in tag_id:
115 |                         # We guarantee to get the longest div.
116 |                         if len(tag.text) >= len(article_body):
117 |                             article_body = tag.text
118 |             except:
119 |                 pass
120 | 
121 |     # The article is still too short, let's try one more time.
122 |     if len(article_body) <= ARTICLE_MINIMUM_LENGTH:
123 | 
124 |         for tag in soup.find_all(["div", "section"]):
125 | 
126 |             try:
127 |                 tag_class = "".join(tag["class"]).lower()
128 | 
129 |                 for item in common_names:
130 |                     if item in tag_class:
131 |                         # We guarantee to get the longest div.
132 |                         if len(tag.text) >= len(article_body):
133 |                             article_body = tag.text
134 |             except:
135 |                 pass
136 | 
137 |     return article_title, article_date, article_body
138 | 


--------------------------------------------------------------------------------
/summary.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This script extracts and ranks the sentences and words of an article.
  3 | 
  4 | IT is inspired by the tf-idf algorithm.
  5 | """
  6 | 
  7 | from collections import Counter
  8 | 
  9 | import spacy
 10 | 
 11 | # The stop words files.
 12 | ES_STOPWORDS_FILE = "./assets/stopwords-es.txt"
 13 | EN_STOPWORDS_FILE = "./assets/stopwords-en.txt"
 14 | 
 15 | # The number of sentences we need.
 16 | NUMBER_OF_SENTENCES = 5
 17 | 
 18 | # The number of top words we need.
 19 | NUMBER_OF_TOP_WORDS = 5
 20 | 
 21 | # Multiplier for uppercase and long words.
 22 | IMPORTANT_WORDS_MULTIPLIER = 2.5
 23 | 
 24 | # Financial sentences often are more important than others.
 25 | FINANCIAL_SENTENCE_MULTIPLIER = 1.5
 26 | 
 27 | # The minimum number of characters needed for a line to be valid.
 28 | LINE_LENGTH_THRESHOLD = 150
 29 | 
 30 | # It is very important to add spaces on these words.
 31 | # Otherwise it will take into account partial words.
 32 | COMMON_WORDS = {
 33 |     " ", "  ", "\xa0", "#", ",", "|", "-", "‘", "’", ";", "(", ")", ".", ":", "¿", "?", '“', "/",
 34 |     '”', '"', "'", "%", "•", "«", "»", "foto", "photo", "video", "redacción", "nueve", "diez", "cien",
 35 |     "mil", "miles", "ciento", "cientos", "millones", "vale"
 36 | }
 37 | 
 38 | # These words increase the score of a sentence. They don't require whitespaces around them.
 39 | FINANCIAL_WORDS = ["$", "€", "£", "pesos", "dólar", "libras", "euros",
 40 |                    "dollar", "pound", "mdp", "mdd"]
 41 | 
 42 | 
 43 | # Don't forget to specify the correct model for your language.
 44 | NLP = spacy.load("es_core_news_sm")
 45 | 
 46 | 
 47 | def add_extra_words():
 48 |     """Adds the title and uppercase forms of all words to COMMON_WORDS.
 49 | 
 50 |     We parse local copies of stop words downloaded from the following repositories:
 51 | 
 52 |     https://github.com/stopwords-iso/stopwords-es
 53 |     https://github.com/stopwords-iso/stopwords-en
 54 |     """
 55 | 
 56 |     with open(ES_STOPWORDS_FILE, "r", encoding="utf-8") as temp_file:
 57 |         for word in temp_file.read().splitlines():
 58 |             COMMON_WORDS.add(word)
 59 | 
 60 |     with open(EN_STOPWORDS_FILE, "r", encoding="utf-8") as temp_file:
 61 |         for word in temp_file.read().splitlines():
 62 |             COMMON_WORDS.add(word)
 63 | 
 64 | 
 65 | add_extra_words()
 66 | 
 67 | 
 68 | def get_summary(article):
 69 |     """Generates the top words and sentences from the article text.
 70 | 
 71 |     Parameters
 72 |     ----------
 73 |     article : str
 74 |         The article text.
 75 | 
 76 |     Returns
 77 |     -------
 78 |     dict
 79 |         A dict containing the title of the article, reduction percentage, top words and the top scored sentences.
 80 | 
 81 |     """
 82 | 
 83 |     # Now we prepare the article for scoring.
 84 |     cleaned_article = clean_article(article)
 85 | 
 86 |     # We start the NLP process.
 87 |     doc = NLP(cleaned_article)
 88 | 
 89 |     article_sentences = [sent for sent in doc.sents]
 90 | 
 91 |     words_of_interest = [
 92 |         token.text for token in doc if token.lower_ not in COMMON_WORDS]
 93 | 
 94 |     # We use the Counter class to count all words ocurrences.
 95 |     scored_words = Counter(words_of_interest)
 96 | 
 97 |     for word in scored_words:
 98 | 
 99 |         # We add bonus points to words starting in uppercase and are equal or longer than 4 characters.
100 |         if word[0].isupper() and len(word) >= 4:
101 |             scored_words[word] *= IMPORTANT_WORDS_MULTIPLIER
102 | 
103 |         # If the word is a number we punish it by settings its points to 0.
104 |         if word.isdigit():
105 |             scored_words[word] = 0
106 | 
107 |     top_sentences = get_top_sentences(article_sentences, scored_words)
108 |     top_sentences_length = sum([len(sentence) for sentence in top_sentences])
109 |     reduction = 100 - (top_sentences_length / len(cleaned_article)) * 100
110 | 
111 |     summary_dict = {
112 |         "top_words": get_top_words(scored_words),
113 |         "top_sentences": top_sentences,
114 |         "reduction": reduction,
115 |         "article_words": " ".join(words_of_interest)
116 |     }
117 | 
118 |     return summary_dict
119 | 
120 | 
121 | def clean_article(article_text):
122 |     """Cleans and reformats the article text.
123 | 
124 |     Parameters
125 |     ----------
126 |     article_text : str
127 |         The article string.
128 | 
129 |     Returns
130 |     -------
131 |     str
132 |         The cleaned up article.
133 | 
134 |     """
135 | 
136 |     # We divide the script into lines, this is to remove unnecessary whitespaces.
137 |     lines_list = list()
138 | 
139 |     for line in article_text.split("\n"):
140 | 
141 |         # We remove whitespaces.
142 |         stripped_line = line.strip()
143 | 
144 |         # If the line is too short we ignore it.
145 |         if len(stripped_line) >= LINE_LENGTH_THRESHOLD:
146 |             lines_list.append(stripped_line)
147 | 
148 |     # Now we have the article fully cleaned.
149 |     return "   ".join(lines_list)
150 | 
151 | 
152 | def get_top_words(scored_words):
153 |     """Gets the top scored words from the prepared article.
154 | 
155 |     Parameters
156 |     ----------
157 |     scored_words : collections.Counter
158 |         A Counter containing the article words and their scores.
159 | 
160 |     Returns
161 |     -------
162 |     list
163 |         An ordered list with the top words.
164 | 
165 |     """
166 | 
167 |     # Once we have our words scored it's time to get top ones.
168 |     top_words = list()
169 | 
170 |     for word, score in scored_words.most_common():
171 | 
172 |         add_to_list = True
173 | 
174 |         # We avoid duplicates by checking if the word already is in the top_words list.
175 |         if word.upper() not in [item.upper() for item in top_words]:
176 | 
177 |             # Sometimes we have the same word but in plural form, we skip the word when that happens.
178 |             for item in top_words:
179 |                 if word.upper() in item.upper() or item.upper() in word.upper():
180 |                     add_to_list = False
181 | 
182 |             if add_to_list:
183 |                 top_words.append(word)
184 | 
185 |     return top_words[0:NUMBER_OF_TOP_WORDS]
186 | 
187 | 
188 | def get_top_sentences(article_sentences, scored_words):
189 |     """Gets the top scored sentences from the cleaned article.
190 | 
191 |     Parameters
192 |     ----------
193 |     cleaned_article : str
194 |         The original article after it has been cleaned and reformatted.
195 | 
196 |     scored_words : collections.Counter
197 |         A Counter containing the article words and their scores.
198 | 
199 |     Returns
200 |     -------
201 |     list
202 |         An ordered list with the top sentences.
203 | 
204 |     """
205 | 
206 |     # Now its time to score each sentence.
207 |     scored_sentences = list()
208 | 
209 |     # We take a reference of the order of the sentences, this will be used later.
210 |     for index, sent in enumerate(article_sentences):
211 | 
212 |         # In some edge cases we have duplicated sentences, we make sure that doesn't happen.
213 |         if sent.text not in [sent for score, index, sent in scored_sentences]:
214 |             scored_sentences.append(
215 |                 [score_line(sent, scored_words), index, sent.text])
216 | 
217 |     top_sentences = list()
218 |     counter = 0
219 | 
220 |     for score, index, sentence in sorted(scored_sentences, reverse=True):
221 | 
222 |         if counter >= NUMBER_OF_SENTENCES:
223 |             break
224 | 
225 |         # When the article is too small the sentences may come empty.
226 |         if len(sentence) >= 3:
227 | 
228 |             # We clean the sentence and its index so we can sort in chronological order.
229 |             top_sentences.append([index, sentence])
230 |             counter += 1
231 | 
232 |     return [sentence for index, sentence in sorted(top_sentences)]
233 | 
234 | 
235 | def score_line(line, scored_words):
236 |     """Calculates the score of the given line using the word scores.
237 | 
238 |     Parameters
239 |     ----------
240 |     line : spacy.tokens.span.Span
241 |         A tokenized sentence from the article.
242 | 
243 |     scored_words : collections.Counter
244 |         A Counter containing the article words and their scores.
245 | 
246 |     Returns
247 |     -------
248 |     int
249 |         The total score of all the words in the sentence.
250 | 
251 |     """
252 | 
253 |     # We remove the common words.
254 |     cleaned_line = [
255 |         token.text for token in line if token.lower_ not in COMMON_WORDS]
256 | 
257 |     # We now sum the total number of ocurrences for all words.
258 |     temp_score = 0
259 | 
260 |     for word in cleaned_line:
261 |         temp_score += scored_words[word]
262 | 
263 |     # We apply a bonus score to sentences that contain financial information.
264 |     line_lowercase = line.text.lower()
265 | 
266 |     for word in FINANCIAL_WORDS:
267 |         if word in line_lowercase:
268 |             temp_score *= FINANCIAL_SENTENCE_MULTIPLIER
269 |             break
270 | 
271 |     return temp_score
272 | 


--------------------------------------------------------------------------------
/templates/es.txt:
--------------------------------------------------------------------------------
 1 | ### {}
 2 | 
 3 | [Nota Original]({}) | Reducido en un {:.2f}% | {}
 4 | 
 5 | *****
 6 | 
 7 | {}
 8 | 
 9 | *****
10 | 
11 | *^Este ^bot ^solo ^responde ^cuando ^logra ^resumir ^en ^un ^mínimo ^del ^20%. ^Tus ^reportes, ^sugerencias ^y ^comentarios ^son ^bienvenidos. ​*
12 | 
13 | [FAQ](https://redd.it/arkxlg) | [GitHub](https://git.io/fhQkC) | [☁️]({}) | {}
14 | 


--------------------------------------------------------------------------------