The response has been limited to 50k tokens of the smallest files in the repo. You can remove this limitation by removing the max tokens filter.
├── .gitignore
├── LICENSE
├── README.md
├── dist
    ├── newscatcher-0.1.0-py3-none-any.whl
    └── newscatcher-0.1.0.tar.gz
├── newscatcher
    ├── __init__.py
    └── data
    │   └── package_rss.db
├── newscatcher_oneliner.png
├── newscatcherdemo.gif
├── poetry.lock
├── pyproject.toml
├── requirements.txt
└── tests
    ├── __init__.py
    └── test_newscatcher.py


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Created by https://www.gitignore.io/api/macos,python
  2 | # Edit at https://www.gitignore.io/?templates=macos,python
  3 | 
  4 | ### macOS ###
  5 | # General
  6 | .DS_Store
  7 | .AppleDouble
  8 | .LSOverride
  9 | 
 10 | # Icon must end with two \r
 11 | Icon
 12 | 
 13 | # Thumbnails
 14 | ._*
 15 | 
 16 | # Files that might appear in the root of a volume
 17 | .DocumentRevisions-V100
 18 | .fseventsd
 19 | .Spotlight-V100
 20 | .TemporaryItems
 21 | .Trashes
 22 | .VolumeIcon.icns
 23 | .com.apple.timemachine.donotpresent
 24 | 
 25 | # Directories potentially created on remote AFP share
 26 | .AppleDB
 27 | .AppleDesktop
 28 | Network Trash Folder
 29 | Temporary Items
 30 | .apdisk
 31 | 
 32 | ### Python ###
 33 | # Byte-compiled / optimized / DLL files
 34 | __pycache__/
 35 | *.py[cod]
 36 | *$py.class
 37 | 
 38 | # C extensions
 39 | *.so
 40 | 
 41 | # Distribution / packaging
 42 | .Python
 43 | build/
 44 | develop-eggs/
 45 | dist/
 46 | downloads/
 47 | eggs/
 48 | .eggs/
 49 | lib/
 50 | lib64/
 51 | parts/
 52 | sdist/
 53 | var/
 54 | wheels/
 55 | pip-wheel-metadata/
 56 | share/python-wheels/
 57 | *.egg-info/
 58 | .installed.cfg
 59 | *.egg
 60 | MANIFEST
 61 | 
 62 | # PyInstaller
 63 | #  Usually these files are written by a python script from a template
 64 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 65 | *.manifest
 66 | *.spec
 67 | 
 68 | # Installer logs
 69 | pip-log.txt
 70 | pip-delete-this-directory.txt
 71 | 
 72 | # Unit test / coverage reports
 73 | htmlcov/
 74 | .tox/
 75 | .nox/
 76 | .coverage
 77 | .coverage.*
 78 | .cache
 79 | nosetests.xml
 80 | coverage.xml
 81 | *.cover
 82 | .hypothesis/
 83 | .pytest_cache/
 84 | 
 85 | # Translations
 86 | *.mo
 87 | *.pot
 88 | 
 89 | # Scrapy stuff:
 90 | .scrapy
 91 | 
 92 | # Sphinx documentation
 93 | docs/_build/
 94 | 
 95 | # PyBuilder
 96 | target/
 97 | 
 98 | # pyenv
 99 | .python-version
100 | 
101 | # pipenv
102 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
103 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
104 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
105 | #   install all needed dependencies.
106 | #Pipfile.lock
107 | 
108 | # celery beat schedule file
109 | celerybeat-schedule
110 | 
111 | # SageMath parsed files
112 | *.sage.py
113 | 
114 | # Spyder project settings
115 | .spyderproject
116 | .spyproject
117 | 
118 | # Rope project settings
119 | .ropeproject
120 | 
121 | # Mr Developer
122 | .mr.developer.cfg
123 | .project
124 | .pydevproject
125 | 
126 | # mkdocs documentation
127 | /site
128 | 
129 | # mypy
130 | .mypy_cache/
131 | .dmypy.json
132 | dmypy.json
133 | 
134 | # Pyre type checker
135 | .pyre/
136 | 
137 | # End of https://www.gitignore.io/api/macos,python
138 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 newscatcherapi.com
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Newscatcher
  2 | **Programmatically collect normalized news from (almost) any website.**
  3 | 
  4 | Filter by **topic**, **country**, or **language**.
  5 | 
  6 | Created by [newscatcherapi.com](https://www.newscatcherapi.com) but you do not need anything from us or from anyone else to get the software going, it just works out of the box.
  7 | 
  8 | ## Demo
  9 | ![](newscatcherdemo.gif)
 10 | 
 11 | ## Motivation
 12 | While working on [newscatcherapi](https://newscatcherapi.com/) - JSON API 
 13 | to query news articles,
 14 | I came up with an idea to make a simple Python package that would allow
 15 | to easily grab the live news data. 
 16 | 
 17 | When I used to be a junior data scientist working on my own side projects,
 18 | it was difficult for me to operate with external data sources. I knew Python
 19 | quite well, but in most cases it was not enough to build proper data pipelines
 20 | that required gathering data on my own. I hope that this package will help you 
 21 | with your next project. 
 22 | 
 23 | Even though I do not recommend to use this package for any production systems, 
 24 | I believe that it should be enough to test your assumptions and build some MVPs.
 25 | 
 26 | ## Installation
 27 | `pip install newscatcher --upgrade` 
 28 | 
 29 | 
 30 | ## Quick Start
 31 | ```python
 32 | from newscatcher import Newscatcher
 33 | ```
 34 | 
 35 | Get the latest news from [nytimes.com](https://www.nytimes.com/) 
 36 | (_we support thousands of news websites, try yourself!_) main news feed
 37 | ```python
 38 | nc = Newscatcher(website = 'nytimes.com')
 39 | results = nc.get_news()
 40 | 
 41 | # results.keys()
 42 | # 'url', 'topic', 'language', 'country', 'articles'
 43 | 
 44 | # Get the articles
 45 | articles = results['articles']
 46 | 
 47 | first_article_summary = articles[0]['summary']
 48 | first_article_title = articles[0]['title']
 49 | ```
 50 | 
 51 | Get the latest news from [nytimes.com](https://www.nytimes.com/) **politics** feed
 52 | 
 53 | ```python
 54 | nc = Newscatcher(website = 'nytimes.com', topic = 'politics')
 55 | 
 56 | results = nc.get_news()
 57 | articles = results['articles']
 58 | ```
 59 | 
 60 | There is a limited set of topic that you might find:
 61 | 
 62 | ``` 'tech', 'news', 'business', 'science', 'finance', 'food', 'politics', 'economics', 'travel', 'entertainment', 'music', 'sport', 'world' ```
 63 | 
 64 | However, not all topics are supported by every newspaper.
 65 | 
 66 | How to check which topics are supported by which newspaper:
 67 | ```python
 68 | from newscatcher import describe_url
 69 | 
 70 | describe = describe_url('nytimes.com')
 71 | 
 72 | print(describe['topics'])
 73 | ```
 74 | 
 75 | 
 76 | ### Get the list of all news feeds by topic/language/country
 77 | If you want to find the full list of supported news websites 
 78 | you can always do so using `urls()` function
 79 | ```python
 80 | from newscatcher import urls
 81 | 
 82 | # URLs by TOPIC
 83 | politic_urls = urls(topic = 'politics')
 84 | 
 85 | # URLs by COUNTRY
 86 | american_urls = urls(country = 'US')
 87 | 
 88 | # URLs by LANGUAGE
 89 | english_urls = urls(language = 'en')
 90 | 
 91 | # Combine any from topic, country, language
 92 | american_english_politics_urls = urls(country = 'US', topic = 'politics', language = 'en') 
 93 | 
 94 | # note some websites do not explicitly declare their language 
 95 | # as a result they will be excluded from queries based on language
 96 | ```
 97 | 
 98 | 
 99 | 
100 | 
101 | ## Documentation
102 | 
103 | ### `Newscatcher` Class
104 | ```python
105 | from newscatcher import Newscatcher
106 | 
107 | Newscatcher(website, topic = None)
108 | ```
109 | **Please take the base form url of a website** (without `www.`,neither `https://`, nor `/` at the end of url).
110 | 
111 | For example: “nytimes”.com, “news.ycombinator.com” or “theverge.com”.
112 | ___
113 | `Newscatcher.get_news()` - Get the latest news from the website of interest.
114 | 
115 | Allowed topics:
116 | `tech`, `news`, `business`, `science`, `finance`, `food`, 
117 | `politics`, `economics`, `travel`, `entertainment`, 
118 | `music`, `sport`, `world`
119 | 
120 | If no topic is provided, the main feed is returned.
121 | 
122 | Returns a dictionary of 5 elements:
123 | 1. `url` - URL of the website
124 | 2. `topic` - topic of the returned feed
125 | 3. `language` - language of returned feed
126 | 4. `country` - country of returned feed
127 | 5. `articles` - articles of the feed. [Feedparser object]((https://pythonhosted.org/feedparser/reference.html))
128 | 
129 | ___
130 | 
131 | `Newscatcher.get_headlines()` - Returns only the headlines
132 | 
133 | ___
134 | `Newscatcher.print_headlines(n)` - Print top `n` headlines
135 | 
136 | 
137 | <br> 
138 | <br> 
139 | <br> 
140 | 
141 | ### `describe_url()` & `urls()`
142 | Those functions exist to help you navigate through this package
143 | 
144 | ___
145 | ```python
146 | from newscatcher import describe_url
147 | ```
148 | 
149 | `describe_url(website)` - Get the main info on the website. 
150 | 
151 | Returns a dictionary of 5 elements:
152 | 1. `url` - URL of the website
153 | 2. `topics` - list of all supported topics
154 | 3. `language` - language of website
155 | 4. `country` - country of returned feed
156 | 5. `main_topic` - main topic of a website
157 | 
158 | ___
159 | ```python
160 | from newscatcher import urls
161 | ```
162 | 
163 | `urls(topic = None, language = None, country = None)` - Get a list of all supported 
164 | news websites given any combination of `topic`, `language`, `country`
165 | 
166 | Returns a list of websites that match your combination of `topic`, `language`, `country`
167 | 
168 | Supported topics:
169 | `tech`, `news`, `business`, `science`, `finance`, `food`, 
170 | `politics`, `economics`, `travel`, `entertainment`, 
171 | `music`, `sport`, `world`
172 | 
173 | 
174 | Supported countries:
175 | `US`, `GB`, `DE`, `FR`, `IN`, `RU`, `ES`, `BR`, `IT`, `CA`, `AU`, `NL`, `PL`, `NZ`, `PT`, `RO`, `UA`, `JP`, `AR`, `IR`, `IE`, `PH`, `IS`, `ZA`, `AT`, `CL`, `HR`, `BG`, `HU`, `KR`, `SZ`, `AE`, `EG`, `VE`, `CO`, `SE`, `CZ`, `ZH`, `MT`, `AZ`, `GR`, `BE`, `LU`, `IL`, `LT`, `NI`, `MY`, `TR`, `BM`, `NO`, `ME`, `SA`, `RS`, `BA`
176 | 
177 | Supported languages:
178 | `EL`, `IT`, `ZH`, `EN`, `RU`, `CS`, `RO`, `FR`, `JA`, `DE`, `PT`, `ES`, `AR`, `HE`, `UK`, `PL`, `NL`, `TR`, `VI`, `KO`, `TH`, `ID`, `HR`, `DA`, `BG`, `NO`, `SK`, `FA`, `ET`, `SV`, `BN`, `GU`, `MK`, `PA`, `HU`, `SL`, `FI`, `LT`, `MR`, `HI`
179 | 
180 | 
181 | 
182 | 
183 | 
184 | ## Tech/framework used
185 | The package itself is nothing more than a SQLite database with 
186 | RSS feed endpoints for each website and some basic wrapper of
187 | [feedparser](https://pythonhosted.org/feedparser/index.html).
188 | 
189 | 
190 | ## About Us
191 | We are Newscatcher API team. We are glad that you liked our package.
192 | 
193 | If you want to search for any news data, consider using [our API](https://newscatcherapi.com/)
194 | 
195 | ![](newscatcher_oneliner.png)
196 | 
197 | 
198 | [Artem Bugara]() - co-founder of Newscatcher, made v.0.1.0
199 | 
200 | [Maksym Sugonyaka](https://www.linkedin.com/mwlite/in/msugonyaka) - co-founder of Newscatcher, made v.0.1.0
201 | 
202 | [Becket Trotter](https://www.linkedin.com/in/beckettrotter/) - Python Developer, made v.0.2.0
203 | 
204 | ## Licence
205 | MIT
206 | 


--------------------------------------------------------------------------------
/dist/newscatcher-0.1.0-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kotartemiy/newscatcher/b86b1a650241be4e82941319698e01a33c0c01ac/dist/newscatcher-0.1.0-py3-none-any.whl


--------------------------------------------------------------------------------
/dist/newscatcher-0.1.0.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kotartemiy/newscatcher/b86b1a650241be4e82941319698e01a33c0c01ac/dist/newscatcher-0.1.0.tar.gz


--------------------------------------------------------------------------------
/newscatcher/__init__.py:
--------------------------------------------------------------------------------
  1 | __version__ = "0.2.0"
  2 | 
  3 | # Retrieve and analyze
  4 | # 24/7 streams of news data
  5 | import sqlite3
  6 | 
  7 | # import requests
  8 | import feedparser
  9 | import pkg_resources
 10 | from tldextract import extract
 11 | 
 12 | DB_FILE = pkg_resources.resource_filename("newscatcher", "data/package_rss.db")
 13 | 
 14 | 
 15 | class Query:
 16 |     # Query class used to build subsequent sql queries
 17 |     def __init__(self):
 18 |         self.params = {"website": None, "topic": None}
 19 | 
 20 |     def build_conditional(self, field, sql_field):
 21 |         # single conditional build
 22 |         field = field.lower()
 23 |         sql_field = sql_field.lower()
 24 | 
 25 |         if self.params[field] != None:
 26 |             conditional = "{} = '{}'".format(sql_field, self.params[field])
 27 |             return conditional
 28 |         return
 29 | 
 30 |     def build_where(self):
 31 |         # returning the conditional from paramters
 32 |         # the post "WHERE"
 33 |         conditionals = []
 34 | 
 35 |         conv = {"topic": "topic_unified", "website": "clean_url"}
 36 | 
 37 |         for field in conv.keys():
 38 |             cond = self.build_conditional(field, conv[field])
 39 |             if cond != None:
 40 |                 conditionals.append(cond)
 41 | 
 42 |         if conditionals == []:
 43 |             return
 44 | 
 45 |         conditionals[0] = "WHERE " + conditionals[0]
 46 |         conditionals = """ AND '.join([x for x in conditionals if x != None])
 47 | 		+ ' ORDER BY IFNULL(Globalrank,999999);"""
 48 | 
 49 |         return conditionals
 50 | 
 51 |     def build_sql(self):
 52 |         # build sql on user qeury
 53 |         db = sqlite3.connect(DB_FILE, isolation_level=None)
 54 |         sql = "SELECT rss_url from rss_main " + self.build_where()
 55 | 
 56 |         db.close()
 57 |         return sql
 58 | 
 59 | 
 60 | def clean_url(dirty_url):
 61 |     # website.com
 62 |     dirty_url = dirty_url.lower()
 63 |     o = extract(dirty_url)
 64 |     return o.domain + "." + o.suffix
 65 | 
 66 | 
 67 | class Newscatcher:
 68 |     # search engine
 69 |     def build_sql(self):
 70 |         if self.topic is None:
 71 |             sql = """SELECT rss_url from rss_main 
 72 | 					 WHERE clean_url = '{}';"""
 73 |             sql = sql.format(self.url)
 74 |             return sql
 75 | 
 76 |     def __init__(self, website, topic=None):
 77 |         # init with given params
 78 |         website = website.lower()
 79 |         self.url = clean_url(website)
 80 |         self.topic = topic
 81 | 
 82 |     def get_headlines(self, n=None):
 83 |         if self.topic is None:
 84 |             sql = """SELECT rss_url,topic_unified, language, clean_country from rss_main 
 85 | 					 WHERE clean_url = '{}' AND main = 1;"""
 86 |             sql = sql.format(self.url)
 87 |         else:
 88 |             sql = """SELECT rss_url, topic_unified, language, clean_country from rss_main 
 89 | 					 WHERE clean_url = '{}' AND topic_unified = '{}';"""
 90 |             sql = sql.format(self.url, self.topic)
 91 | 
 92 |         db = sqlite3.connect(DB_FILE, isolation_level=None)
 93 | 
 94 |         try:
 95 |             rss_endpoint, _, _, _ = db.execute(sql).fetchone()
 96 |             feed = feedparser.parse(rss_endpoint)
 97 |         except:
 98 |             if self.topic is not None:
 99 |                 sql = """SELECT rss_url from rss_main 
100 | 					 WHERE clean_url = '{}';"""
101 |                 sql = sql.format(self.url)
102 | 
103 |                 if len(db.execute(sql).fetchall()) > 0:
104 |                     db.close()
105 |                     print("Topic is not supported")
106 |                     return
107 |                 else:
108 |                     print("Website is not supported")
109 |                     return
110 |             else:
111 |                 print("Website is not supported")
112 |                 return
113 | 
114 |         if feed["entries"] == []:
115 |             db.close()
116 |             print(
117 |                 "\nNo headlines found check internet connection or query parameters\n"
118 |             )
119 |             return
120 | 
121 |         title_list = []
122 |         for article in feed["entries"]:
123 |             if "title" in article:
124 |                 title_list.append(article["title"])
125 |             if n != None:
126 |                 if len(title_list) == n:
127 |                     break
128 | 
129 |         return title_list
130 | 
131 |     def print_headlines(self, n=None):
132 |         headlines = self.get_headlines(n)
133 | 
134 |         i = 1
135 |         for headline in headlines:
136 |             if i < 10:
137 |                 print(str(i) + ".   |  " + headline)
138 |                 i += 1
139 |             elif i in list(range(10, 100)):
140 |                 print(str(i) + ".  |  " + headline)
141 |                 i += 1
142 |             else:
143 |                 print(str(i) + ". |  " + headline)
144 |                 i += 1
145 | 
146 |     def get_news(self, n=None):
147 |         # return results based on current stream
148 |         if self.topic is None:
149 |             sql = """SELECT rss_url,topic_unified, language, clean_country from rss_main 
150 | 					 WHERE clean_url = '{}' AND main = 1;"""
151 |             sql = sql.format(self.url)
152 |         else:
153 |             sql = """SELECT rss_url, topic_unified, language, clean_country from rss_main 
154 | 					 WHERE clean_url = '{}' AND topic_unified = '{}';"""
155 |             sql = sql.format(self.url, self.topic)
156 | 
157 |         db = sqlite3.connect(DB_FILE, isolation_level=None)
158 | 
159 |         try:
160 |             rss_endpoint, topic, language, country = db.execute(sql).fetchone()
161 |             feed = feedparser.parse(rss_endpoint)
162 |         except:
163 |             if self.topic is not None:
164 |                 sql = """SELECT rss_url from rss_main 
165 | 					 WHERE clean_url = '{}';"""
166 |                 sql = sql.format(self.url)
167 | 
168 |                 if len(db.execute(sql).fetchall()) > 0:
169 |                     db.close()
170 |                     print("Topic is not supported")
171 |                     return
172 |                 else:
173 |                     print("Website is not supported")
174 |                     return
175 |             else:
176 |                 print("Website is not supported")
177 |                 return
178 | 
179 |         if feed["entries"] == []:
180 |             db.close()
181 |             print("\nNo results found check internet connection or query parameters\n")
182 |             return
183 | 
184 |         if n == None or len(feed["entries"]) <= n:
185 |             articles = feed["entries"]  # ['summary']#[0].keys()
186 |         else:
187 |             articles = feed["entries"][:n]
188 | 
189 |         db.close()
190 |         return {
191 |             "url": self.url,
192 |             "topic": topic,
193 |             "language": language,
194 |             "country": country,
195 |             "articles": articles,
196 |         }
197 | 
198 | 
199 | def describe_url(website):
200 |     # return newscatcher fields that correspond to the url
201 |     website = website.lower()
202 |     website = clean_url(website)
203 |     db = sqlite3.connect(DB_FILE, isolation_level=None)
204 | 
205 |     sql = "SELECT clean_url, language, clean_country, topic_unified from rss_main WHERE clean_url = '{}' and main == 1 ".format(
206 |         website
207 |     )
208 |     results = db.execute(sql).fetchone()
209 |     main = results[-1]
210 | 
211 |     if main == None:
212 |         print("\nWebsite not supported\n")
213 |         return
214 | 
215 |     if len(main) == 0:
216 |         print("\nWebsite note supported\n")
217 |         return
218 | 
219 |     sql = "SELECT DISTINCT topic_unified from rss_main WHERE clean_url == '{}'".format(
220 |         website
221 |     )
222 |     topics = db.execute(sql).fetchall()
223 |     topics = [x[0] for x in topics]
224 | 
225 |     ret = {
226 |         "url": results[0],
227 |         "language": results[1],
228 |         "country": results[2],
229 |         "main_topic": main,
230 |         "topics": topics,
231 |     }
232 | 
233 |     return ret
234 | 
235 | 
236 | def urls(topic=None, language=None, country=None):
237 |     # return urls that matches users parameters
238 |     if language != None:
239 |         language = language.lower()
240 | 
241 |     if country != None:
242 |         country = country.upper()
243 | 
244 |     if topic != None:
245 |         topic = topic.lower()
246 | 
247 |     db = sqlite3.connect(DB_FILE, isolation_level=None)
248 |     quick_q = Query()
249 |     inp = {"topic": topic, "language": language, "country": country}
250 |     for x in inp.keys():
251 |         quick_q.params[x] = inp[x]
252 | 
253 |     conditionals = []
254 |     conv = {
255 |         "topic": "topic_unified",
256 |         "website": "clean_url",
257 |         "country": "clean_country",
258 |         "language": "language",
259 |     }
260 | 
261 |     for field in conv.keys():
262 |         try:
263 |             cond = quick_q.build_conditional(field, conv[field])
264 |         except:
265 |             cond = None
266 | 
267 |         if cond != None:
268 |             conditionals.append(cond)
269 | 
270 |     sql = ""
271 | 
272 |     if conditionals == []:
273 |         sql = "SELECT clean_url from rss_main "
274 |     else:
275 |         conditionals[0] = " WHERE " + conditionals[0]
276 |         conditionals = " AND ".join([x for x in conditionals if x is not None])
277 |         conditionals += " AND main = 1 ORDER BY IFNULL(Globalrank,999999);"
278 |         sql = "SELECT DISTINCT clean_url from rss_main" + conditionals
279 | 
280 |     ret = db.execute(sql).fetchall()
281 |     if len(ret) == 0:
282 |         print("\nNo websites found for given parameters\n")
283 |         return
284 | 
285 |     db.close()
286 |     return [x[0] for x in ret]
287 | 


--------------------------------------------------------------------------------
/newscatcher/data/package_rss.db:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kotartemiy/newscatcher/b86b1a650241be4e82941319698e01a33c0c01ac/newscatcher/data/package_rss.db


--------------------------------------------------------------------------------
/newscatcher_oneliner.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kotartemiy/newscatcher/b86b1a650241be4e82941319698e01a33c0c01ac/newscatcher_oneliner.png


--------------------------------------------------------------------------------
/newscatcherdemo.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kotartemiy/newscatcher/b86b1a650241be4e82941319698e01a33c0c01ac/newscatcherdemo.gif


--------------------------------------------------------------------------------
/poetry.lock:
--------------------------------------------------------------------------------
  1 | [[package]]
  2 | category = "dev"
  3 | description = "Atomic file writes."
  4 | marker = "sys_platform == \"win32\""
  5 | name = "atomicwrites"
  6 | optional = false
  7 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*"
  8 | version = "1.4.0"
  9 | 
 10 | [[package]]
 11 | category = "dev"
 12 | description = "Classes Without Boilerplate"
 13 | name = "attrs"
 14 | optional = false
 15 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*"
 16 | version = "19.3.0"
 17 | 
 18 | [package.extras]
 19 | azure-pipelines = ["coverage", "hypothesis", "pympler", "pytest (>=4.3.0)", "six", "zope.interface", "pytest-azurepipelines"]
 20 | dev = ["coverage", "hypothesis", "pympler", "pytest (>=4.3.0)", "six", "zope.interface", "sphinx", "pre-commit"]
 21 | docs = ["sphinx", "zope.interface"]
 22 | tests = ["coverage", "hypothesis", "pympler", "pytest (>=4.3.0)", "six", "zope.interface"]
 23 | 
 24 | [[package]]
 25 | category = "main"
 26 | description = "Python package for providing Mozilla's CA Bundle."
 27 | name = "certifi"
 28 | optional = false
 29 | python-versions = "*"
 30 | version = "2020.4.5.1"
 31 | 
 32 | [[package]]
 33 | category = "main"
 34 | description = "Universal encoding detector for Python 2 and 3"
 35 | name = "chardet"
 36 | optional = false
 37 | python-versions = "*"
 38 | version = "3.0.4"
 39 | 
 40 | [[package]]
 41 | category = "dev"
 42 | description = "Cross-platform colored terminal text."
 43 | marker = "sys_platform == \"win32\""
 44 | name = "colorama"
 45 | optional = false
 46 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*"
 47 | version = "0.4.3"
 48 | 
 49 | [[package]]
 50 | category = "main"
 51 | description = "Universal feed parser, handles RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0 feeds"
 52 | name = "feedparser"
 53 | optional = false
 54 | python-versions = "*"
 55 | version = "5.2.1"
 56 | 
 57 | [[package]]
 58 | category = "main"
 59 | description = "Internationalized Domain Names in Applications (IDNA)"
 60 | name = "idna"
 61 | optional = false
 62 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*"
 63 | version = "2.9"
 64 | 
 65 | [[package]]
 66 | category = "dev"
 67 | description = "Read metadata from Python packages"
 68 | marker = "python_version < \"3.8\""
 69 | name = "importlib-metadata"
 70 | optional = false
 71 | python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,>=2.7"
 72 | version = "1.6.0"
 73 | 
 74 | [package.dependencies]
 75 | zipp = ">=0.5"
 76 | 
 77 | [package.extras]
 78 | docs = ["sphinx", "rst.linker"]
 79 | testing = ["packaging", "importlib-resources"]
 80 | 
 81 | [[package]]
 82 | category = "dev"
 83 | description = "More routines for operating on iterables, beyond itertools"
 84 | name = "more-itertools"
 85 | optional = false
 86 | python-versions = ">=3.5"
 87 | version = "8.3.0"
 88 | 
 89 | [[package]]
 90 | category = "dev"
 91 | description = "Core utilities for Python packages"
 92 | name = "packaging"
 93 | optional = false
 94 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*"
 95 | version = "20.4"
 96 | 
 97 | [package.dependencies]
 98 | pyparsing = ">=2.0.2"
 99 | six = "*"
100 | 
101 | [[package]]
102 | category = "dev"
103 | description = "plugin and hook calling mechanisms for python"
104 | name = "pluggy"
105 | optional = false
106 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*"
107 | version = "0.13.1"
108 | 
109 | [package.dependencies]
110 | [package.dependencies.importlib-metadata]
111 | python = "<3.8"
112 | version = ">=0.12"
113 | 
114 | [package.extras]
115 | dev = ["pre-commit", "tox"]
116 | 
117 | [[package]]
118 | category = "dev"
119 | description = "library with cross-python path, ini-parsing, io, code, log facilities"
120 | name = "py"
121 | optional = false
122 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*"
123 | version = "1.8.1"
124 | 
125 | [[package]]
126 | category = "dev"
127 | description = "Python parsing module"
128 | name = "pyparsing"
129 | optional = false
130 | python-versions = ">=2.6, !=3.0.*, !=3.1.*, !=3.2.*"
131 | version = "2.4.7"
132 | 
133 | [[package]]
134 | category = "dev"
135 | description = "pytest: simple powerful testing with Python"
136 | name = "pytest"
137 | optional = false
138 | python-versions = ">=3.5"
139 | version = "5.4.2"
140 | 
141 | [package.dependencies]
142 | atomicwrites = ">=1.0"
143 | attrs = ">=17.4.0"
144 | colorama = "*"
145 | more-itertools = ">=4.0.0"
146 | packaging = "*"
147 | pluggy = ">=0.12,<1.0"
148 | py = ">=1.5.0"
149 | wcwidth = "*"
150 | 
151 | [package.dependencies.importlib-metadata]
152 | python = "<3.8"
153 | version = ">=0.12"
154 | 
155 | [package.extras]
156 | checkqa-mypy = ["mypy (v0.761)"]
157 | testing = ["argcomplete", "hypothesis (>=3.56)", "mock", "nose", "requests", "xmlschema"]
158 | 
159 | [[package]]
160 | category = "main"
161 | description = "Python HTTP for Humans."
162 | name = "requests"
163 | optional = false
164 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*"
165 | version = "2.23.0"
166 | 
167 | [package.dependencies]
168 | certifi = ">=2017.4.17"
169 | chardet = ">=3.0.2,<4"
170 | idna = ">=2.5,<3"
171 | urllib3 = ">=1.21.1,<1.25.0 || >1.25.0,<1.25.1 || >1.25.1,<1.26"
172 | 
173 | [package.extras]
174 | security = ["pyOpenSSL (>=0.14)", "cryptography (>=1.3.4)"]
175 | socks = ["PySocks (>=1.5.6,<1.5.7 || >1.5.7)", "win-inet-pton"]
176 | 
177 | [[package]]
178 | category = "main"
179 | description = "File transport adapter for Requests"
180 | name = "requests-file"
181 | optional = false
182 | python-versions = "*"
183 | version = "1.5.1"
184 | 
185 | [package.dependencies]
186 | requests = ">=1.0.0"
187 | six = "*"
188 | 
189 | [[package]]
190 | category = "main"
191 | description = "Python 2 and 3 compatibility utilities"
192 | name = "six"
193 | optional = false
194 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*"
195 | version = "1.14.0"
196 | 
197 | [[package]]
198 | category = "main"
199 | description = "Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List. By default, this includes the public ICANN TLDs and their exceptions. You can optionally support the Public Suffix List's private domains as well."
200 | name = "tldextract"
201 | optional = false
202 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*"
203 | version = "2.2.2"
204 | 
205 | [package.dependencies]
206 | idna = "*"
207 | requests = ">=2.1.0"
208 | requests-file = ">=1.4"
209 | setuptools = "*"
210 | 
211 | [[package]]
212 | category = "main"
213 | description = "HTTP library with thread-safe connection pooling, file post, and more."
214 | name = "urllib3"
215 | optional = false
216 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*, <4"
217 | version = "1.25.9"
218 | 
219 | [package.extras]
220 | brotli = ["brotlipy (>=0.6.0)"]
221 | secure = ["certifi", "cryptography (>=1.3.4)", "idna (>=2.0.0)", "pyOpenSSL (>=0.14)", "ipaddress"]
222 | socks = ["PySocks (>=1.5.6,<1.5.7 || >1.5.7,<2.0)"]
223 | 
224 | [[package]]
225 | category = "dev"
226 | description = "Measures number of Terminal column cells of wide-character codes"
227 | name = "wcwidth"
228 | optional = false
229 | python-versions = "*"
230 | version = "0.1.9"
231 | 
232 | [[package]]
233 | category = "dev"
234 | description = "Backport of pathlib-compatible object wrapper for zip files"
235 | marker = "python_version < \"3.8\""
236 | name = "zipp"
237 | optional = false
238 | python-versions = ">=3.6"
239 | version = "3.1.0"
240 | 
241 | [package.extras]
242 | docs = ["sphinx", "jaraco.packaging (>=3.2)", "rst.linker (>=1.9)"]
243 | testing = ["jaraco.itertools", "func-timeout"]
244 | 
245 | [metadata]
246 | content-hash = "a4ae118340bddfdf0bb2db18f07c7259530e2328f0cee7e70854813dcc6bb422"
247 | python-versions = "^3.6"
248 | 
249 | [metadata.files]
250 | atomicwrites = [
251 |     {file = "atomicwrites-1.4.0-py2.py3-none-any.whl", hash = "sha256:6d1784dea7c0c8d4a5172b6c620f40b6e4cbfdf96d783691f2e1302a7b88e197"},
252 |     {file = "atomicwrites-1.4.0.tar.gz", hash = "sha256:ae70396ad1a434f9c7046fd2dd196fc04b12f9e91ffb859164193be8b6168a7a"},
253 | ]
254 | attrs = [
255 |     {file = "attrs-19.3.0-py2.py3-none-any.whl", hash = "sha256:08a96c641c3a74e44eb59afb61a24f2cb9f4d7188748e76ba4bb5edfa3cb7d1c"},
256 |     {file = "attrs-19.3.0.tar.gz", hash = "sha256:f7b7ce16570fe9965acd6d30101a28f62fb4a7f9e926b3bbc9b61f8b04247e72"},
257 | ]
258 | certifi = [
259 |     {file = "certifi-2020.4.5.1-py2.py3-none-any.whl", hash = "sha256:1d987a998c75633c40847cc966fcf5904906c920a7f17ef374f5aa4282abd304"},
260 |     {file = "certifi-2020.4.5.1.tar.gz", hash = "sha256:51fcb31174be6e6664c5f69e3e1691a2d72a1a12e90f872cbdb1567eb47b6519"},
261 | ]
262 | chardet = [
263 |     {file = "chardet-3.0.4-py2.py3-none-any.whl", hash = "sha256:fc323ffcaeaed0e0a02bf4d117757b98aed530d9ed4531e3e15460124c106691"},
264 |     {file = "chardet-3.0.4.tar.gz", hash = "sha256:84ab92ed1c4d4f16916e05906b6b75a6c0fb5db821cc65e70cbd64a3e2a5eaae"},
265 | ]
266 | colorama = [
267 |     {file = "colorama-0.4.3-py2.py3-none-any.whl", hash = "sha256:7d73d2a99753107a36ac6b455ee49046802e59d9d076ef8e47b61499fa29afff"},
268 |     {file = "colorama-0.4.3.tar.gz", hash = "sha256:e96da0d330793e2cb9485e9ddfd918d456036c7149416295932478192f4436a1"},
269 | ]
270 | feedparser = [
271 |     {file = "feedparser-5.2.1.tar.bz2", hash = "sha256:ce875495c90ebd74b179855449040003a1beb40cd13d5f037a0654251e260b02"},
272 |     {file = "feedparser-5.2.1.tar.gz", hash = "sha256:bd030652c2d08532c034c27fcd7c85868e7fa3cb2b17f230a44a6bbc92519bf9"},
273 |     {file = "feedparser-5.2.1.zip", hash = "sha256:cd2485472e41471632ed3029d44033ee420ad0b57111db95c240c9160a85831c"},
274 | ]
275 | idna = [
276 |     {file = "idna-2.9-py2.py3-none-any.whl", hash = "sha256:a068a21ceac8a4d63dbfd964670474107f541babbd2250d61922f029858365fa"},
277 |     {file = "idna-2.9.tar.gz", hash = "sha256:7588d1c14ae4c77d74036e8c22ff447b26d0fde8f007354fd48a7814db15b7cb"},
278 | ]
279 | importlib-metadata = [
280 |     {file = "importlib_metadata-1.6.0-py2.py3-none-any.whl", hash = "sha256:2a688cbaa90e0cc587f1df48bdc97a6eadccdcd9c35fb3f976a09e3b5016d90f"},
281 |     {file = "importlib_metadata-1.6.0.tar.gz", hash = "sha256:34513a8a0c4962bc66d35b359558fd8a5e10cd472d37aec5f66858addef32c1e"},
282 | ]
283 | more-itertools = [
284 |     {file = "more-itertools-8.3.0.tar.gz", hash = "sha256:558bb897a2232f5e4f8e2399089e35aecb746e1f9191b6584a151647e89267be"},
285 |     {file = "more_itertools-8.3.0-py3-none-any.whl", hash = "sha256:7818f596b1e87be009031c7653d01acc46ed422e6656b394b0f765ce66ed4982"},
286 | ]
287 | packaging = [
288 |     {file = "packaging-20.4-py2.py3-none-any.whl", hash = "sha256:998416ba6962ae7fbd6596850b80e17859a5753ba17c32284f67bfff33784181"},
289 |     {file = "packaging-20.4.tar.gz", hash = "sha256:4357f74f47b9c12db93624a82154e9b120fa8293699949152b22065d556079f8"},
290 | ]
291 | pluggy = [
292 |     {file = "pluggy-0.13.1-py2.py3-none-any.whl", hash = "sha256:966c145cd83c96502c3c3868f50408687b38434af77734af1e9ca461a4081d2d"},
293 |     {file = "pluggy-0.13.1.tar.gz", hash = "sha256:15b2acde666561e1298d71b523007ed7364de07029219b604cf808bfa1c765b0"},
294 | ]
295 | py = [
296 |     {file = "py-1.8.1-py2.py3-none-any.whl", hash = "sha256:c20fdd83a5dbc0af9efd622bee9a5564e278f6380fffcacc43ba6f43db2813b0"},
297 |     {file = "py-1.8.1.tar.gz", hash = "sha256:5e27081401262157467ad6e7f851b7aa402c5852dbcb3dae06768434de5752aa"},
298 | ]
299 | pyparsing = [
300 |     {file = "pyparsing-2.4.7-py2.py3-none-any.whl", hash = "sha256:ef9d7589ef3c200abe66653d3f1ab1033c3c419ae9b9bdb1240a85b024efc88b"},
301 |     {file = "pyparsing-2.4.7.tar.gz", hash = "sha256:c203ec8783bf771a155b207279b9bccb8dea02d8f0c9e5f8ead507bc3246ecc1"},
302 | ]
303 | pytest = [
304 |     {file = "pytest-5.4.2-py3-none-any.whl", hash = "sha256:95c710d0a72d91c13fae35dce195633c929c3792f54125919847fdcdf7caa0d3"},
305 |     {file = "pytest-5.4.2.tar.gz", hash = "sha256:eb2b5e935f6a019317e455b6da83dd8650ac9ffd2ee73a7b657a30873d67a698"},
306 | ]
307 | requests = [
308 |     {file = "requests-2.23.0-py2.py3-none-any.whl", hash = "sha256:43999036bfa82904b6af1d99e4882b560e5e2c68e5c4b0aa03b655f3d7d73fee"},
309 |     {file = "requests-2.23.0.tar.gz", hash = "sha256:b3f43d496c6daba4493e7c431722aeb7dbc6288f52a6e04e7b6023b0247817e6"},
310 | ]
311 | requests-file = [
312 |     {file = "requests-file-1.5.1.tar.gz", hash = "sha256:07d74208d3389d01c38ab89ef403af0cfec63957d53a0081d8eca738d0247d8e"},
313 |     {file = "requests_file-1.5.1-py2.py3-none-any.whl", hash = "sha256:dfe5dae75c12481f68ba353183c53a65e6044c923e64c24b2209f6c7570ca953"},
314 | ]
315 | six = [
316 |     {file = "six-1.14.0-py2.py3-none-any.whl", hash = "sha256:8f3cd2e254d8f793e7f3d6d9df77b92252b52637291d0f0da013c76ea2724b6c"},
317 |     {file = "six-1.14.0.tar.gz", hash = "sha256:236bdbdce46e6e6a3d61a337c0f8b763ca1e8717c03b369e87a7ec7ce1319c0a"},
318 | ]
319 | tldextract = [
320 |     {file = "tldextract-2.2.2-py2.py3-none-any.whl", hash = "sha256:16b2f7e81d89c2a5a914d25bdbddd3932c31a6b510db886c3ce0764a195c0ee7"},
321 |     {file = "tldextract-2.2.2.tar.gz", hash = "sha256:9aa21a1f7827df4209e242ec4fc2293af5940ec730cde46ea80f66ed97bfc808"},
322 | ]
323 | urllib3 = [
324 |     {file = "urllib3-1.25.9-py2.py3-none-any.whl", hash = "sha256:88206b0eb87e6d677d424843ac5209e3fb9d0190d0ee169599165ec25e9d9115"},
325 |     {file = "urllib3-1.25.9.tar.gz", hash = "sha256:3018294ebefce6572a474f0604c2021e33b3fd8006ecd11d62107a5d2a963527"},
326 | ]
327 | wcwidth = [
328 |     {file = "wcwidth-0.1.9-py2.py3-none-any.whl", hash = "sha256:cafe2186b3c009a04067022ce1dcd79cb38d8d65ee4f4791b8888d6599d1bbe1"},
329 |     {file = "wcwidth-0.1.9.tar.gz", hash = "sha256:ee73862862a156bf77ff92b09034fc4825dd3af9cf81bc5b360668d425f3c5f1"},
330 | ]
331 | zipp = [
332 |     {file = "zipp-3.1.0-py3-none-any.whl", hash = "sha256:aa36550ff0c0b7ef7fa639055d797116ee891440eac1a56f378e2d3179e0320b"},
333 |     {file = "zipp-3.1.0.tar.gz", hash = "sha256:c599e4d75c98f6798c509911d08a22e6c021d074469042177c8c86fb92eefd96"},
334 | ]
335 | 


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [tool.poetry]
 2 | name = "newscatcher"
 3 | version = "0.2.0"
 4 | description = "Get the normalized latest news from (almost) any website"
 5 | authors = ["Artem Bugara <bugara.artem@gmail.com>",
 6 | 	   "Maksym Sugonyaka <sugonyaka.maksym@gmail.com>"]
 7 | readme = "README.md"
 8 | homepage = "https://www.newscatcherapi.com"
 9 | license = "MIT"
10 | keywords = ["News", "RSS", "Scraping", "Data Mining"]
11 | 
12 | [tool.poetry.dependencies]
13 | python = "^3.6"
14 | requests = "^2.23.0"
15 | feedparser = "^5.2.1"
16 | tldextract = "^2.2.2"
17 | 
18 | [tool.poetry.dev-dependencies]
19 | pytest = "^5.2"
20 | 
21 | [build-system]
22 | requires = ["poetry>=0.12"]
23 | build-backend = "poetry.masonry.api"
24 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | requests
2 | feedparser
3 | tldextract
4 | 


--------------------------------------------------------------------------------
/tests/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kotartemiy/newscatcher/b86b1a650241be4e82941319698e01a33c0c01ac/tests/__init__.py


--------------------------------------------------------------------------------
/tests/test_newscatcher.py:
--------------------------------------------------------------------------------
1 | from newscatcher import __version__
2 | 
3 | 
4 | def test_version():
5 |     assert __version__ == '0.2.0'
6 | 


--------------------------------------------------------------------------------