├── .gitignore ├── LICENSE ├── README.md ├── dist ├── newscatcher-0.1.0-py3-none-any.whl └── newscatcher-0.1.0.tar.gz ├── newscatcher ├── __init__.py └── data │ └── package_rss.db ├── newscatcher_oneliner.png ├── newscatcherdemo.gif ├── poetry.lock ├── pyproject.toml ├── requirements.txt └── tests ├── __init__.py └── test_newscatcher.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Created by https://www.gitignore.io/api/macos,python 2 | # Edit at https://www.gitignore.io/?templates=macos,python 3 | 4 | ### macOS ### 5 | # General 6 | .DS_Store 7 | .AppleDouble 8 | .LSOverride 9 | 10 | # Icon must end with two \r 11 | Icon 12 | 13 | # Thumbnails 14 | ._* 15 | 16 | # Files that might appear in the root of a volume 17 | .DocumentRevisions-V100 18 | .fseventsd 19 | .Spotlight-V100 20 | .TemporaryItems 21 | .Trashes 22 | .VolumeIcon.icns 23 | .com.apple.timemachine.donotpresent 24 | 25 | # Directories potentially created on remote AFP share 26 | .AppleDB 27 | .AppleDesktop 28 | Network Trash Folder 29 | Temporary Items 30 | .apdisk 31 | 32 | ### Python ### 33 | # Byte-compiled / optimized / DLL files 34 | __pycache__/ 35 | *.py[cod] 36 | *$py.class 37 | 38 | # C extensions 39 | *.so 40 | 41 | # Distribution / packaging 42 | .Python 43 | build/ 44 | develop-eggs/ 45 | dist/ 46 | downloads/ 47 | eggs/ 48 | .eggs/ 49 | lib/ 50 | lib64/ 51 | parts/ 52 | sdist/ 53 | var/ 54 | wheels/ 55 | pip-wheel-metadata/ 56 | share/python-wheels/ 57 | *.egg-info/ 58 | .installed.cfg 59 | *.egg 60 | MANIFEST 61 | 62 | # PyInstaller 63 | # Usually these files are written by a python script from a template 64 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 65 | *.manifest 66 | *.spec 67 | 68 | # Installer logs 69 | pip-log.txt 70 | pip-delete-this-directory.txt 71 | 72 | # Unit test / coverage reports 73 | htmlcov/ 74 | .tox/ 75 | .nox/ 76 | .coverage 77 | .coverage.* 78 | .cache 79 | nosetests.xml 80 | coverage.xml 81 | *.cover 82 | .hypothesis/ 83 | .pytest_cache/ 84 | 85 | # Translations 86 | *.mo 87 | *.pot 88 | 89 | # Scrapy stuff: 90 | .scrapy 91 | 92 | # Sphinx documentation 93 | docs/_build/ 94 | 95 | # PyBuilder 96 | target/ 97 | 98 | # pyenv 99 | .python-version 100 | 101 | # pipenv 102 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 103 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 104 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 105 | # install all needed dependencies. 106 | #Pipfile.lock 107 | 108 | # celery beat schedule file 109 | celerybeat-schedule 110 | 111 | # SageMath parsed files 112 | *.sage.py 113 | 114 | # Spyder project settings 115 | .spyderproject 116 | .spyproject 117 | 118 | # Rope project settings 119 | .ropeproject 120 | 121 | # Mr Developer 122 | .mr.developer.cfg 123 | .project 124 | .pydevproject 125 | 126 | # mkdocs documentation 127 | /site 128 | 129 | # mypy 130 | .mypy_cache/ 131 | .dmypy.json 132 | dmypy.json 133 | 134 | # Pyre type checker 135 | .pyre/ 136 | 137 | # End of https://www.gitignore.io/api/macos,python 138 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 newscatcherapi.com 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Newscatcher 2 | **Programmatically collect normalized news from (almost) any website.** 3 | 4 | Filter by **topic**, **country**, or **language**. 5 | 6 | Created by [newscatcherapi.com](https://www.newscatcherapi.com) but you do not need anything from us or from anyone else to get the software going, it just works out of the box. 7 | 8 | ## Demo 9 |  10 | 11 | ## Motivation 12 | While working on [newscatcherapi](https://newscatcherapi.com/) - JSON API 13 | to query news articles, 14 | I came up with an idea to make a simple Python package that would allow 15 | to easily grab the live news data. 16 | 17 | When I used to be a junior data scientist working on my own side projects, 18 | it was difficult for me to operate with external data sources. I knew Python 19 | quite well, but in most cases it was not enough to build proper data pipelines 20 | that required gathering data on my own. I hope that this package will help you 21 | with your next project. 22 | 23 | Even though I do not recommend to use this package for any production systems, 24 | I believe that it should be enough to test your assumptions and build some MVPs. 25 | 26 | ## Installation 27 | `pip install newscatcher --upgrade` 28 | 29 | 30 | ## Quick Start 31 | ```python 32 | from newscatcher import Newscatcher 33 | ``` 34 | 35 | Get the latest news from [nytimes.com](https://www.nytimes.com/) 36 | (_we support thousands of news websites, try yourself!_) main news feed 37 | ```python 38 | nc = Newscatcher(website = 'nytimes.com') 39 | results = nc.get_news() 40 | 41 | # results.keys() 42 | # 'url', 'topic', 'language', 'country', 'articles' 43 | 44 | # Get the articles 45 | articles = results['articles'] 46 | 47 | first_article_summary = articles[0]['summary'] 48 | first_article_title = articles[0]['title'] 49 | ``` 50 | 51 | Get the latest news from [nytimes.com](https://www.nytimes.com/) **politics** feed 52 | 53 | ```python 54 | nc = Newscatcher(website = 'nytimes.com', topic = 'politics') 55 | 56 | results = nc.get_news() 57 | articles = results['articles'] 58 | ``` 59 | 60 | There is a limited set of topic that you might find: 61 | 62 | ``` 'tech', 'news', 'business', 'science', 'finance', 'food', 'politics', 'economics', 'travel', 'entertainment', 'music', 'sport', 'world' ``` 63 | 64 | However, not all topics are supported by every newspaper. 65 | 66 | How to check which topics are supported by which newspaper: 67 | ```python 68 | from newscatcher import describe_url 69 | 70 | describe = describe_url('nytimes.com') 71 | 72 | print(describe['topics']) 73 | ``` 74 | 75 | 76 | ### Get the list of all news feeds by topic/language/country 77 | If you want to find the full list of supported news websites 78 | you can always do so using `urls()` function 79 | ```python 80 | from newscatcher import urls 81 | 82 | # URLs by TOPIC 83 | politic_urls = urls(topic = 'politics') 84 | 85 | # URLs by COUNTRY 86 | american_urls = urls(country = 'US') 87 | 88 | # URLs by LANGUAGE 89 | english_urls = urls(language = 'en') 90 | 91 | # Combine any from topic, country, language 92 | american_english_politics_urls = urls(country = 'US', topic = 'politics', language = 'en') 93 | 94 | # note some websites do not explicitly declare their language 95 | # as a result they will be excluded from queries based on language 96 | ``` 97 | 98 | 99 | 100 | 101 | ## Documentation 102 | 103 | ### `Newscatcher` Class 104 | ```python 105 | from newscatcher import Newscatcher 106 | 107 | Newscatcher(website, topic = None) 108 | ``` 109 | **Please take the base form url of a website** (without `www.`,neither `https://`, nor `/` at the end of url). 110 | 111 | For example: “nytimes”.com, “news.ycombinator.com” or “theverge.com”. 112 | ___ 113 | `Newscatcher.get_news()` - Get the latest news from the website of interest. 114 | 115 | Allowed topics: 116 | `tech`, `news`, `business`, `science`, `finance`, `food`, 117 | `politics`, `economics`, `travel`, `entertainment`, 118 | `music`, `sport`, `world` 119 | 120 | If no topic is provided, the main feed is returned. 121 | 122 | Returns a dictionary of 5 elements: 123 | 1. `url` - URL of the website 124 | 2. `topic` - topic of the returned feed 125 | 3. `language` - language of returned feed 126 | 4. `country` - country of returned feed 127 | 5. `articles` - articles of the feed. [Feedparser object]((https://pythonhosted.org/feedparser/reference.html)) 128 | 129 | ___ 130 | 131 | `Newscatcher.get_headlines()` - Returns only the headlines 132 | 133 | ___ 134 | `Newscatcher.print_headlines(n)` - Print top `n` headlines 135 | 136 | 137 | <br> 138 | <br> 139 | <br> 140 | 141 | ### `describe_url()` & `urls()` 142 | Those functions exist to help you navigate through this package 143 | 144 | ___ 145 | ```python 146 | from newscatcher import describe_url 147 | ``` 148 | 149 | `describe_url(website)` - Get the main info on the website. 150 | 151 | Returns a dictionary of 5 elements: 152 | 1. `url` - URL of the website 153 | 2. `topics` - list of all supported topics 154 | 3. `language` - language of website 155 | 4. `country` - country of returned feed 156 | 5. `main_topic` - main topic of a website 157 | 158 | ___ 159 | ```python 160 | from newscatcher import urls 161 | ``` 162 | 163 | `urls(topic = None, language = None, country = None)` - Get a list of all supported 164 | news websites given any combination of `topic`, `language`, `country` 165 | 166 | Returns a list of websites that match your combination of `topic`, `language`, `country` 167 | 168 | Supported topics: 169 | `tech`, `news`, `business`, `science`, `finance`, `food`, 170 | `politics`, `economics`, `travel`, `entertainment`, 171 | `music`, `sport`, `world` 172 | 173 | 174 | Supported countries: 175 | `US`, `GB`, `DE`, `FR`, `IN`, `RU`, `ES`, `BR`, `IT`, `CA`, `AU`, `NL`, `PL`, `NZ`, `PT`, `RO`, `UA`, `JP`, `AR`, `IR`, `IE`, `PH`, `IS`, `ZA`, `AT`, `CL`, `HR`, `BG`, `HU`, `KR`, `SZ`, `AE`, `EG`, `VE`, `CO`, `SE`, `CZ`, `ZH`, `MT`, `AZ`, `GR`, `BE`, `LU`, `IL`, `LT`, `NI`, `MY`, `TR`, `BM`, `NO`, `ME`, `SA`, `RS`, `BA` 176 | 177 | Supported languages: 178 | `EL`, `IT`, `ZH`, `EN`, `RU`, `CS`, `RO`, `FR`, `JA`, `DE`, `PT`, `ES`, `AR`, `HE`, `UK`, `PL`, `NL`, `TR`, `VI`, `KO`, `TH`, `ID`, `HR`, `DA`, `BG`, `NO`, `SK`, `FA`, `ET`, `SV`, `BN`, `GU`, `MK`, `PA`, `HU`, `SL`, `FI`, `LT`, `MR`, `HI` 179 | 180 | 181 | 182 | 183 | 184 | ## Tech/framework used 185 | The package itself is nothing more than a SQLite database with 186 | RSS feed endpoints for each website and some basic wrapper of 187 | [feedparser](https://pythonhosted.org/feedparser/index.html). 188 | 189 | 190 | ## About Us 191 | We are Newscatcher API team. We are glad that you liked our package. 192 | 193 | If you want to search for any news data, consider using [our API](https://newscatcherapi.com/) 194 | 195 |  196 | 197 | 198 | [Artem Bugara]() - co-founder of Newscatcher, made v.0.1.0 199 | 200 | [Maksym Sugonyaka](https://www.linkedin.com/mwlite/in/msugonyaka) - co-founder of Newscatcher, made v.0.1.0 201 | 202 | [Becket Trotter](https://www.linkedin.com/in/beckettrotter/) - Python Developer, made v.0.2.0 203 | 204 | ## Licence 205 | MIT 206 | -------------------------------------------------------------------------------- /dist/newscatcher-0.1.0-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kotartemiy/newscatcher/b86b1a650241be4e82941319698e01a33c0c01ac/dist/newscatcher-0.1.0-py3-none-any.whl -------------------------------------------------------------------------------- /dist/newscatcher-0.1.0.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kotartemiy/newscatcher/b86b1a650241be4e82941319698e01a33c0c01ac/dist/newscatcher-0.1.0.tar.gz -------------------------------------------------------------------------------- /newscatcher/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = "0.2.0" 2 | 3 | # Retrieve and analyze 4 | # 24/7 streams of news data 5 | import sqlite3 6 | 7 | # import requests 8 | import feedparser 9 | import pkg_resources 10 | from tldextract import extract 11 | 12 | DB_FILE = pkg_resources.resource_filename("newscatcher", "data/package_rss.db") 13 | 14 | 15 | class Query: 16 | # Query class used to build subsequent sql queries 17 | def __init__(self): 18 | self.params = {"website": None, "topic": None} 19 | 20 | def build_conditional(self, field, sql_field): 21 | # single conditional build 22 | field = field.lower() 23 | sql_field = sql_field.lower() 24 | 25 | if self.params[field] != None: 26 | conditional = "{} = '{}'".format(sql_field, self.params[field]) 27 | return conditional 28 | return 29 | 30 | def build_where(self): 31 | # returning the conditional from paramters 32 | # the post "WHERE" 33 | conditionals = [] 34 | 35 | conv = {"topic": "topic_unified", "website": "clean_url"} 36 | 37 | for field in conv.keys(): 38 | cond = self.build_conditional(field, conv[field]) 39 | if cond != None: 40 | conditionals.append(cond) 41 | 42 | if conditionals == []: 43 | return 44 | 45 | conditionals[0] = "WHERE " + conditionals[0] 46 | conditionals = """ AND '.join([x for x in conditionals if x != None]) 47 | + ' ORDER BY IFNULL(Globalrank,999999);""" 48 | 49 | return conditionals 50 | 51 | def build_sql(self): 52 | # build sql on user qeury 53 | db = sqlite3.connect(DB_FILE, isolation_level=None) 54 | sql = "SELECT rss_url from rss_main " + self.build_where() 55 | 56 | db.close() 57 | return sql 58 | 59 | 60 | def clean_url(dirty_url): 61 | # website.com 62 | dirty_url = dirty_url.lower() 63 | o = extract(dirty_url) 64 | return o.domain + "." + o.suffix 65 | 66 | 67 | class Newscatcher: 68 | # search engine 69 | def build_sql(self): 70 | if self.topic is None: 71 | sql = """SELECT rss_url from rss_main 72 | WHERE clean_url = '{}';""" 73 | sql = sql.format(self.url) 74 | return sql 75 | 76 | def __init__(self, website, topic=None): 77 | # init with given params 78 | website = website.lower() 79 | self.url = clean_url(website) 80 | self.topic = topic 81 | 82 | def get_headlines(self, n=None): 83 | if self.topic is None: 84 | sql = """SELECT rss_url,topic_unified, language, clean_country from rss_main 85 | WHERE clean_url = '{}' AND main = 1;""" 86 | sql = sql.format(self.url) 87 | else: 88 | sql = """SELECT rss_url, topic_unified, language, clean_country from rss_main 89 | WHERE clean_url = '{}' AND topic_unified = '{}';""" 90 | sql = sql.format(self.url, self.topic) 91 | 92 | db = sqlite3.connect(DB_FILE, isolation_level=None) 93 | 94 | try: 95 | rss_endpoint, _, _, _ = db.execute(sql).fetchone() 96 | feed = feedparser.parse(rss_endpoint) 97 | except: 98 | if self.topic is not None: 99 | sql = """SELECT rss_url from rss_main 100 | WHERE clean_url = '{}';""" 101 | sql = sql.format(self.url) 102 | 103 | if len(db.execute(sql).fetchall()) > 0: 104 | db.close() 105 | print("Topic is not supported") 106 | return 107 | else: 108 | print("Website is not supported") 109 | return 110 | else: 111 | print("Website is not supported") 112 | return 113 | 114 | if feed["entries"] == []: 115 | db.close() 116 | print( 117 | "\nNo headlines found check internet connection or query parameters\n" 118 | ) 119 | return 120 | 121 | title_list = [] 122 | for article in feed["entries"]: 123 | if "title" in article: 124 | title_list.append(article["title"]) 125 | if n != None: 126 | if len(title_list) == n: 127 | break 128 | 129 | return title_list 130 | 131 | def print_headlines(self, n=None): 132 | headlines = self.get_headlines(n) 133 | 134 | i = 1 135 | for headline in headlines: 136 | if i < 10: 137 | print(str(i) + ". | " + headline) 138 | i += 1 139 | elif i in list(range(10, 100)): 140 | print(str(i) + ". | " + headline) 141 | i += 1 142 | else: 143 | print(str(i) + ". | " + headline) 144 | i += 1 145 | 146 | def get_news(self, n=None): 147 | # return results based on current stream 148 | if self.topic is None: 149 | sql = """SELECT rss_url,topic_unified, language, clean_country from rss_main 150 | WHERE clean_url = '{}' AND main = 1;""" 151 | sql = sql.format(self.url) 152 | else: 153 | sql = """SELECT rss_url, topic_unified, language, clean_country from rss_main 154 | WHERE clean_url = '{}' AND topic_unified = '{}';""" 155 | sql = sql.format(self.url, self.topic) 156 | 157 | db = sqlite3.connect(DB_FILE, isolation_level=None) 158 | 159 | try: 160 | rss_endpoint, topic, language, country = db.execute(sql).fetchone() 161 | feed = feedparser.parse(rss_endpoint) 162 | except: 163 | if self.topic is not None: 164 | sql = """SELECT rss_url from rss_main 165 | WHERE clean_url = '{}';""" 166 | sql = sql.format(self.url) 167 | 168 | if len(db.execute(sql).fetchall()) > 0: 169 | db.close() 170 | print("Topic is not supported") 171 | return 172 | else: 173 | print("Website is not supported") 174 | return 175 | else: 176 | print("Website is not supported") 177 | return 178 | 179 | if feed["entries"] == []: 180 | db.close() 181 | print("\nNo results found check internet connection or query parameters\n") 182 | return 183 | 184 | if n == None or len(feed["entries"]) <= n: 185 | articles = feed["entries"] # ['summary']#[0].keys() 186 | else: 187 | articles = feed["entries"][:n] 188 | 189 | db.close() 190 | return { 191 | "url": self.url, 192 | "topic": topic, 193 | "language": language, 194 | "country": country, 195 | "articles": articles, 196 | } 197 | 198 | 199 | def describe_url(website): 200 | # return newscatcher fields that correspond to the url 201 | website = website.lower() 202 | website = clean_url(website) 203 | db = sqlite3.connect(DB_FILE, isolation_level=None) 204 | 205 | sql = "SELECT clean_url, language, clean_country, topic_unified from rss_main WHERE clean_url = '{}' and main == 1 ".format( 206 | website 207 | ) 208 | results = db.execute(sql).fetchone() 209 | main = results[-1] 210 | 211 | if main == None: 212 | print("\nWebsite not supported\n") 213 | return 214 | 215 | if len(main) == 0: 216 | print("\nWebsite note supported\n") 217 | return 218 | 219 | sql = "SELECT DISTINCT topic_unified from rss_main WHERE clean_url == '{}'".format( 220 | website 221 | ) 222 | topics = db.execute(sql).fetchall() 223 | topics = [x[0] for x in topics] 224 | 225 | ret = { 226 | "url": results[0], 227 | "language": results[1], 228 | "country": results[2], 229 | "main_topic": main, 230 | "topics": topics, 231 | } 232 | 233 | return ret 234 | 235 | 236 | def urls(topic=None, language=None, country=None): 237 | # return urls that matches users parameters 238 | if language != None: 239 | language = language.lower() 240 | 241 | if country != None: 242 | country = country.upper() 243 | 244 | if topic != None: 245 | topic = topic.lower() 246 | 247 | db = sqlite3.connect(DB_FILE, isolation_level=None) 248 | quick_q = Query() 249 | inp = {"topic": topic, "language": language, "country": country} 250 | for x in inp.keys(): 251 | quick_q.params[x] = inp[x] 252 | 253 | conditionals = [] 254 | conv = { 255 | "topic": "topic_unified", 256 | "website": "clean_url", 257 | "country": "clean_country", 258 | "language": "language", 259 | } 260 | 261 | for field in conv.keys(): 262 | try: 263 | cond = quick_q.build_conditional(field, conv[field]) 264 | except: 265 | cond = None 266 | 267 | if cond != None: 268 | conditionals.append(cond) 269 | 270 | sql = "" 271 | 272 | if conditionals == []: 273 | sql = "SELECT clean_url from rss_main " 274 | else: 275 | conditionals[0] = " WHERE " + conditionals[0] 276 | conditionals = " AND ".join([x for x in conditionals if x is not None]) 277 | conditionals += " AND main = 1 ORDER BY IFNULL(Globalrank,999999);" 278 | sql = "SELECT DISTINCT clean_url from rss_main" + conditionals 279 | 280 | ret = db.execute(sql).fetchall() 281 | if len(ret) == 0: 282 | print("\nNo websites found for given parameters\n") 283 | return 284 | 285 | db.close() 286 | return [x[0] for x in ret] 287 | -------------------------------------------------------------------------------- /newscatcher/data/package_rss.db: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kotartemiy/newscatcher/b86b1a650241be4e82941319698e01a33c0c01ac/newscatcher/data/package_rss.db -------------------------------------------------------------------------------- /newscatcher_oneliner.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kotartemiy/newscatcher/b86b1a650241be4e82941319698e01a33c0c01ac/newscatcher_oneliner.png -------------------------------------------------------------------------------- /newscatcherdemo.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kotartemiy/newscatcher/b86b1a650241be4e82941319698e01a33c0c01ac/newscatcherdemo.gif -------------------------------------------------------------------------------- /poetry.lock: -------------------------------------------------------------------------------- 1 | [[package]] 2 | category = "dev" 3 | description = "Atomic file writes." 4 | marker = "sys_platform == \"win32\"" 5 | name = "atomicwrites" 6 | optional = false 7 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*" 8 | version = "1.4.0" 9 | 10 | [[package]] 11 | category = "dev" 12 | description = "Classes Without Boilerplate" 13 | name = "attrs" 14 | optional = false 15 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*" 16 | version = "19.3.0" 17 | 18 | [package.extras] 19 | azure-pipelines = ["coverage", "hypothesis", "pympler", "pytest (>=4.3.0)", "six", "zope.interface", "pytest-azurepipelines"] 20 | dev = ["coverage", "hypothesis", "pympler", "pytest (>=4.3.0)", "six", "zope.interface", "sphinx", "pre-commit"] 21 | docs = ["sphinx", "zope.interface"] 22 | tests = ["coverage", "hypothesis", "pympler", "pytest (>=4.3.0)", "six", "zope.interface"] 23 | 24 | [[package]] 25 | category = "main" 26 | description = "Python package for providing Mozilla's CA Bundle." 27 | name = "certifi" 28 | optional = false 29 | python-versions = "*" 30 | version = "2020.4.5.1" 31 | 32 | [[package]] 33 | category = "main" 34 | description = "Universal encoding detector for Python 2 and 3" 35 | name = "chardet" 36 | optional = false 37 | python-versions = "*" 38 | version = "3.0.4" 39 | 40 | [[package]] 41 | category = "dev" 42 | description = "Cross-platform colored terminal text." 43 | marker = "sys_platform == \"win32\"" 44 | name = "colorama" 45 | optional = false 46 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*" 47 | version = "0.4.3" 48 | 49 | [[package]] 50 | category = "main" 51 | description = "Universal feed parser, handles RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0 feeds" 52 | name = "feedparser" 53 | optional = false 54 | python-versions = "*" 55 | version = "5.2.1" 56 | 57 | [[package]] 58 | category = "main" 59 | description = "Internationalized Domain Names in Applications (IDNA)" 60 | name = "idna" 61 | optional = false 62 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*" 63 | version = "2.9" 64 | 65 | [[package]] 66 | category = "dev" 67 | description = "Read metadata from Python packages" 68 | marker = "python_version < \"3.8\"" 69 | name = "importlib-metadata" 70 | optional = false 71 | python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,>=2.7" 72 | version = "1.6.0" 73 | 74 | [package.dependencies] 75 | zipp = ">=0.5" 76 | 77 | [package.extras] 78 | docs = ["sphinx", "rst.linker"] 79 | testing = ["packaging", "importlib-resources"] 80 | 81 | [[package]] 82 | category = "dev" 83 | description = "More routines for operating on iterables, beyond itertools" 84 | name = "more-itertools" 85 | optional = false 86 | python-versions = ">=3.5" 87 | version = "8.3.0" 88 | 89 | [[package]] 90 | category = "dev" 91 | description = "Core utilities for Python packages" 92 | name = "packaging" 93 | optional = false 94 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*" 95 | version = "20.4" 96 | 97 | [package.dependencies] 98 | pyparsing = ">=2.0.2" 99 | six = "*" 100 | 101 | [[package]] 102 | category = "dev" 103 | description = "plugin and hook calling mechanisms for python" 104 | name = "pluggy" 105 | optional = false 106 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*" 107 | version = "0.13.1" 108 | 109 | [package.dependencies] 110 | [package.dependencies.importlib-metadata] 111 | python = "<3.8" 112 | version = ">=0.12" 113 | 114 | [package.extras] 115 | dev = ["pre-commit", "tox"] 116 | 117 | [[package]] 118 | category = "dev" 119 | description = "library with cross-python path, ini-parsing, io, code, log facilities" 120 | name = "py" 121 | optional = false 122 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*" 123 | version = "1.8.1" 124 | 125 | [[package]] 126 | category = "dev" 127 | description = "Python parsing module" 128 | name = "pyparsing" 129 | optional = false 130 | python-versions = ">=2.6, !=3.0.*, !=3.1.*, !=3.2.*" 131 | version = "2.4.7" 132 | 133 | [[package]] 134 | category = "dev" 135 | description = "pytest: simple powerful testing with Python" 136 | name = "pytest" 137 | optional = false 138 | python-versions = ">=3.5" 139 | version = "5.4.2" 140 | 141 | [package.dependencies] 142 | atomicwrites = ">=1.0" 143 | attrs = ">=17.4.0" 144 | colorama = "*" 145 | more-itertools = ">=4.0.0" 146 | packaging = "*" 147 | pluggy = ">=0.12,<1.0" 148 | py = ">=1.5.0" 149 | wcwidth = "*" 150 | 151 | [package.dependencies.importlib-metadata] 152 | python = "<3.8" 153 | version = ">=0.12" 154 | 155 | [package.extras] 156 | checkqa-mypy = ["mypy (v0.761)"] 157 | testing = ["argcomplete", "hypothesis (>=3.56)", "mock", "nose", "requests", "xmlschema"] 158 | 159 | [[package]] 160 | category = "main" 161 | description = "Python HTTP for Humans." 162 | name = "requests" 163 | optional = false 164 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*" 165 | version = "2.23.0" 166 | 167 | [package.dependencies] 168 | certifi = ">=2017.4.17" 169 | chardet = ">=3.0.2,<4" 170 | idna = ">=2.5,<3" 171 | urllib3 = ">=1.21.1,<1.25.0 || >1.25.0,<1.25.1 || >1.25.1,<1.26" 172 | 173 | [package.extras] 174 | security = ["pyOpenSSL (>=0.14)", "cryptography (>=1.3.4)"] 175 | socks = ["PySocks (>=1.5.6,<1.5.7 || >1.5.7)", "win-inet-pton"] 176 | 177 | [[package]] 178 | category = "main" 179 | description = "File transport adapter for Requests" 180 | name = "requests-file" 181 | optional = false 182 | python-versions = "*" 183 | version = "1.5.1" 184 | 185 | [package.dependencies] 186 | requests = ">=1.0.0" 187 | six = "*" 188 | 189 | [[package]] 190 | category = "main" 191 | description = "Python 2 and 3 compatibility utilities" 192 | name = "six" 193 | optional = false 194 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*" 195 | version = "1.14.0" 196 | 197 | [[package]] 198 | category = "main" 199 | description = "Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List. By default, this includes the public ICANN TLDs and their exceptions. You can optionally support the Public Suffix List's private domains as well." 200 | name = "tldextract" 201 | optional = false 202 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*" 203 | version = "2.2.2" 204 | 205 | [package.dependencies] 206 | idna = "*" 207 | requests = ">=2.1.0" 208 | requests-file = ">=1.4" 209 | setuptools = "*" 210 | 211 | [[package]] 212 | category = "main" 213 | description = "HTTP library with thread-safe connection pooling, file post, and more." 214 | name = "urllib3" 215 | optional = false 216 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*, <4" 217 | version = "1.25.9" 218 | 219 | [package.extras] 220 | brotli = ["brotlipy (>=0.6.0)"] 221 | secure = ["certifi", "cryptography (>=1.3.4)", "idna (>=2.0.0)", "pyOpenSSL (>=0.14)", "ipaddress"] 222 | socks = ["PySocks (>=1.5.6,<1.5.7 || >1.5.7,<2.0)"] 223 | 224 | [[package]] 225 | category = "dev" 226 | description = "Measures number of Terminal column cells of wide-character codes" 227 | name = "wcwidth" 228 | optional = false 229 | python-versions = "*" 230 | version = "0.1.9" 231 | 232 | [[package]] 233 | category = "dev" 234 | description = "Backport of pathlib-compatible object wrapper for zip files" 235 | marker = "python_version < \"3.8\"" 236 | name = "zipp" 237 | optional = false 238 | python-versions = ">=3.6" 239 | version = "3.1.0" 240 | 241 | [package.extras] 242 | docs = ["sphinx", "jaraco.packaging (>=3.2)", "rst.linker (>=1.9)"] 243 | testing = ["jaraco.itertools", "func-timeout"] 244 | 245 | [metadata] 246 | content-hash = "a4ae118340bddfdf0bb2db18f07c7259530e2328f0cee7e70854813dcc6bb422" 247 | python-versions = "^3.6" 248 | 249 | [metadata.files] 250 | atomicwrites = [ 251 | {file = "atomicwrites-1.4.0-py2.py3-none-any.whl", hash = "sha256:6d1784dea7c0c8d4a5172b6c620f40b6e4cbfdf96d783691f2e1302a7b88e197"}, 252 | {file = "atomicwrites-1.4.0.tar.gz", hash = "sha256:ae70396ad1a434f9c7046fd2dd196fc04b12f9e91ffb859164193be8b6168a7a"}, 253 | ] 254 | attrs = [ 255 | {file = "attrs-19.3.0-py2.py3-none-any.whl", hash = "sha256:08a96c641c3a74e44eb59afb61a24f2cb9f4d7188748e76ba4bb5edfa3cb7d1c"}, 256 | {file = "attrs-19.3.0.tar.gz", hash = "sha256:f7b7ce16570fe9965acd6d30101a28f62fb4a7f9e926b3bbc9b61f8b04247e72"}, 257 | ] 258 | certifi = [ 259 | {file = "certifi-2020.4.5.1-py2.py3-none-any.whl", hash = "sha256:1d987a998c75633c40847cc966fcf5904906c920a7f17ef374f5aa4282abd304"}, 260 | {file = "certifi-2020.4.5.1.tar.gz", hash = "sha256:51fcb31174be6e6664c5f69e3e1691a2d72a1a12e90f872cbdb1567eb47b6519"}, 261 | ] 262 | chardet = [ 263 | {file = "chardet-3.0.4-py2.py3-none-any.whl", hash = "sha256:fc323ffcaeaed0e0a02bf4d117757b98aed530d9ed4531e3e15460124c106691"}, 264 | {file = "chardet-3.0.4.tar.gz", hash = "sha256:84ab92ed1c4d4f16916e05906b6b75a6c0fb5db821cc65e70cbd64a3e2a5eaae"}, 265 | ] 266 | colorama = [ 267 | {file = "colorama-0.4.3-py2.py3-none-any.whl", hash = "sha256:7d73d2a99753107a36ac6b455ee49046802e59d9d076ef8e47b61499fa29afff"}, 268 | {file = "colorama-0.4.3.tar.gz", hash = "sha256:e96da0d330793e2cb9485e9ddfd918d456036c7149416295932478192f4436a1"}, 269 | ] 270 | feedparser = [ 271 | {file = "feedparser-5.2.1.tar.bz2", hash = "sha256:ce875495c90ebd74b179855449040003a1beb40cd13d5f037a0654251e260b02"}, 272 | {file = "feedparser-5.2.1.tar.gz", hash = "sha256:bd030652c2d08532c034c27fcd7c85868e7fa3cb2b17f230a44a6bbc92519bf9"}, 273 | {file = "feedparser-5.2.1.zip", hash = "sha256:cd2485472e41471632ed3029d44033ee420ad0b57111db95c240c9160a85831c"}, 274 | ] 275 | idna = [ 276 | {file = "idna-2.9-py2.py3-none-any.whl", hash = "sha256:a068a21ceac8a4d63dbfd964670474107f541babbd2250d61922f029858365fa"}, 277 | {file = "idna-2.9.tar.gz", hash = "sha256:7588d1c14ae4c77d74036e8c22ff447b26d0fde8f007354fd48a7814db15b7cb"}, 278 | ] 279 | importlib-metadata = [ 280 | {file = "importlib_metadata-1.6.0-py2.py3-none-any.whl", hash = "sha256:2a688cbaa90e0cc587f1df48bdc97a6eadccdcd9c35fb3f976a09e3b5016d90f"}, 281 | {file = "importlib_metadata-1.6.0.tar.gz", hash = "sha256:34513a8a0c4962bc66d35b359558fd8a5e10cd472d37aec5f66858addef32c1e"}, 282 | ] 283 | more-itertools = [ 284 | {file = "more-itertools-8.3.0.tar.gz", hash = "sha256:558bb897a2232f5e4f8e2399089e35aecb746e1f9191b6584a151647e89267be"}, 285 | {file = "more_itertools-8.3.0-py3-none-any.whl", hash = "sha256:7818f596b1e87be009031c7653d01acc46ed422e6656b394b0f765ce66ed4982"}, 286 | ] 287 | packaging = [ 288 | {file = "packaging-20.4-py2.py3-none-any.whl", hash = "sha256:998416ba6962ae7fbd6596850b80e17859a5753ba17c32284f67bfff33784181"}, 289 | {file = "packaging-20.4.tar.gz", hash = "sha256:4357f74f47b9c12db93624a82154e9b120fa8293699949152b22065d556079f8"}, 290 | ] 291 | pluggy = [ 292 | {file = "pluggy-0.13.1-py2.py3-none-any.whl", hash = "sha256:966c145cd83c96502c3c3868f50408687b38434af77734af1e9ca461a4081d2d"}, 293 | {file = "pluggy-0.13.1.tar.gz", hash = "sha256:15b2acde666561e1298d71b523007ed7364de07029219b604cf808bfa1c765b0"}, 294 | ] 295 | py = [ 296 | {file = "py-1.8.1-py2.py3-none-any.whl", hash = "sha256:c20fdd83a5dbc0af9efd622bee9a5564e278f6380fffcacc43ba6f43db2813b0"}, 297 | {file = "py-1.8.1.tar.gz", hash = "sha256:5e27081401262157467ad6e7f851b7aa402c5852dbcb3dae06768434de5752aa"}, 298 | ] 299 | pyparsing = [ 300 | {file = "pyparsing-2.4.7-py2.py3-none-any.whl", hash = "sha256:ef9d7589ef3c200abe66653d3f1ab1033c3c419ae9b9bdb1240a85b024efc88b"}, 301 | {file = "pyparsing-2.4.7.tar.gz", hash = "sha256:c203ec8783bf771a155b207279b9bccb8dea02d8f0c9e5f8ead507bc3246ecc1"}, 302 | ] 303 | pytest = [ 304 | {file = "pytest-5.4.2-py3-none-any.whl", hash = "sha256:95c710d0a72d91c13fae35dce195633c929c3792f54125919847fdcdf7caa0d3"}, 305 | {file = "pytest-5.4.2.tar.gz", hash = "sha256:eb2b5e935f6a019317e455b6da83dd8650ac9ffd2ee73a7b657a30873d67a698"}, 306 | ] 307 | requests = [ 308 | {file = "requests-2.23.0-py2.py3-none-any.whl", hash = "sha256:43999036bfa82904b6af1d99e4882b560e5e2c68e5c4b0aa03b655f3d7d73fee"}, 309 | {file = "requests-2.23.0.tar.gz", hash = "sha256:b3f43d496c6daba4493e7c431722aeb7dbc6288f52a6e04e7b6023b0247817e6"}, 310 | ] 311 | requests-file = [ 312 | {file = "requests-file-1.5.1.tar.gz", hash = "sha256:07d74208d3389d01c38ab89ef403af0cfec63957d53a0081d8eca738d0247d8e"}, 313 | {file = "requests_file-1.5.1-py2.py3-none-any.whl", hash = "sha256:dfe5dae75c12481f68ba353183c53a65e6044c923e64c24b2209f6c7570ca953"}, 314 | ] 315 | six = [ 316 | {file = "six-1.14.0-py2.py3-none-any.whl", hash = "sha256:8f3cd2e254d8f793e7f3d6d9df77b92252b52637291d0f0da013c76ea2724b6c"}, 317 | {file = "six-1.14.0.tar.gz", hash = "sha256:236bdbdce46e6e6a3d61a337c0f8b763ca1e8717c03b369e87a7ec7ce1319c0a"}, 318 | ] 319 | tldextract = [ 320 | {file = "tldextract-2.2.2-py2.py3-none-any.whl", hash = "sha256:16b2f7e81d89c2a5a914d25bdbddd3932c31a6b510db886c3ce0764a195c0ee7"}, 321 | {file = "tldextract-2.2.2.tar.gz", hash = "sha256:9aa21a1f7827df4209e242ec4fc2293af5940ec730cde46ea80f66ed97bfc808"}, 322 | ] 323 | urllib3 = [ 324 | {file = "urllib3-1.25.9-py2.py3-none-any.whl", hash = "sha256:88206b0eb87e6d677d424843ac5209e3fb9d0190d0ee169599165ec25e9d9115"}, 325 | {file = "urllib3-1.25.9.tar.gz", hash = "sha256:3018294ebefce6572a474f0604c2021e33b3fd8006ecd11d62107a5d2a963527"}, 326 | ] 327 | wcwidth = [ 328 | {file = "wcwidth-0.1.9-py2.py3-none-any.whl", hash = "sha256:cafe2186b3c009a04067022ce1dcd79cb38d8d65ee4f4791b8888d6599d1bbe1"}, 329 | {file = "wcwidth-0.1.9.tar.gz", hash = "sha256:ee73862862a156bf77ff92b09034fc4825dd3af9cf81bc5b360668d425f3c5f1"}, 330 | ] 331 | zipp = [ 332 | {file = "zipp-3.1.0-py3-none-any.whl", hash = "sha256:aa36550ff0c0b7ef7fa639055d797116ee891440eac1a56f378e2d3179e0320b"}, 333 | {file = "zipp-3.1.0.tar.gz", hash = "sha256:c599e4d75c98f6798c509911d08a22e6c021d074469042177c8c86fb92eefd96"}, 334 | ] 335 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.poetry] 2 | name = "newscatcher" 3 | version = "0.2.0" 4 | description = "Get the normalized latest news from (almost) any website" 5 | authors = ["Artem Bugara <bugara.artem@gmail.com>", 6 | "Maksym Sugonyaka <sugonyaka.maksym@gmail.com>"] 7 | readme = "README.md" 8 | homepage = "https://www.newscatcherapi.com" 9 | license = "MIT" 10 | keywords = ["News", "RSS", "Scraping", "Data Mining"] 11 | 12 | [tool.poetry.dependencies] 13 | python = "^3.6" 14 | requests = "^2.23.0" 15 | feedparser = "^5.2.1" 16 | tldextract = "^2.2.2" 17 | 18 | [tool.poetry.dev-dependencies] 19 | pytest = "^5.2" 20 | 21 | [build-system] 22 | requires = ["poetry>=0.12"] 23 | build-backend = "poetry.masonry.api" 24 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | requests 2 | feedparser 3 | tldextract 4 | -------------------------------------------------------------------------------- /tests/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kotartemiy/newscatcher/b86b1a650241be4e82941319698e01a33c0c01ac/tests/__init__.py -------------------------------------------------------------------------------- /tests/test_newscatcher.py: -------------------------------------------------------------------------------- 1 | from newscatcher import __version__ 2 | 3 | 4 | def test_version(): 5 | assert __version__ == '0.2.0' 6 | --------------------------------------------------------------------------------