├── .github └── FUNDING.yml ├── .gitignore ├── LICENSE ├── README.md ├── core.py ├── extract_content.py ├── main.py ├── main_no_vpn.py └── requirements.txt /.github/FUNDING.yml: -------------------------------------------------------------------------------- 1 | github: [philipperemy] 2 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | 3 | .DS_Store 4 | .idea 5 | 6 | nohup.out 7 | data/ 8 | data* 9 | __pycache__/ 10 | *.py[cod] 11 | *$py.class 12 | 13 | # C extensions 14 | *.so 15 | 16 | # Distribution / packaging 17 | .Python 18 | env/ 19 | build/ 20 | develop-eggs/ 21 | dist/ 22 | downloads/ 23 | eggs/ 24 | .eggs/ 25 | lib/ 26 | lib64/ 27 | parts/ 28 | sdist/ 29 | var/ 30 | wheels/ 31 | *.egg-info/ 32 | .installed.cfg 33 | *.egg 34 | 35 | # PyInstaller 36 | # Usually these files are written by a python script from a template 37 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 38 | *.manifest 39 | *.spec 40 | 41 | # Installer logs 42 | pip-log.txt 43 | pip-delete-this-directory.txt 44 | 45 | # Unit test / coverage reports 46 | htmlcov/ 47 | .tox/ 48 | .coverage 49 | .coverage.* 50 | .cache 51 | nosetests.xml 52 | coverage.xml 53 | *,cover 54 | .hypothesis/ 55 | 56 | # Translations 57 | *.mo 58 | *.pot 59 | 60 | # Django stuff: 61 | *.log 62 | local_settings.py 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # pyenv 81 | .python-version 82 | 83 | # celery beat schedule file 84 | celerybeat-schedule 85 | 86 | # dotenv 87 | .env 88 | 89 | # virtualenv 90 | .venv 91 | venv/ 92 | ENV/ 93 | 94 | # Spyder project settings 95 | .spyderproject 96 | 97 | # Rope project settings 98 | .ropeproject 99 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Philippe Rémy 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Google News Scraper - Japanese and Chinese supported 2 | 3 | For English articles, Google has a RSS feed that you can directly use. [Click here for English](https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en). 4 | 5 | Each scraped article has the following fields: 6 | - **title**: Title of the article 7 | - **datetime**: Publication date 8 | - **content**: Full content (text format) - best effort 9 | - **link**: URL where the article was published 10 | - **keyword**: Google News keyword used to find this article 11 | 12 | ## How many articles can I fetch with this scraper? 13 | 14 | No upper bound of course but it should be in the range **`100,000 articles per day`** when scraping 24/7 with VPN enabled. 15 | 16 | ## How to get started? 17 | ```bash 18 | git clone git@github.com:philipperemy/google-news-scraper.git && cd google-news-scraper 19 | virtualenv -p python3 venv && source venv/bin/activate # optional but recommended! 20 | pip install -r requirements.txt 21 | python main_no_vpn.py --keywords hello,toto --language ja # for VPN support, scroll down! 22 | ``` 23 | 24 | ## Output example 25 | 26 | `Article 1` 27 | ``` 28 | { 29 | "content": "(本文中の野村証券 [...] 生命経済研の熊野英生氏は指摘。  記事の全文 \n保護主義を根拠とする円高説を信じ込むのは禁物であり、実際は米貿易赤字縮小と円安が進むかもしれないとBBHの村田雅志氏は指摘。  記事の全文 \n", 30 | "datetime": "2015/11/03", 31 | "keyword": "米国の銀行業務", 32 | "link": "http://jp.reuters.com/article/idJPL3N12Y5QX20151104", 33 | "title": "再送-インタビュー:運用高度化、PEやハイイールド債増やす=長門・ゆうちょ銀社長" 34 | } 35 | ``` 36 | 37 | `Article 2` 38 | 39 | ``` 40 | { 41 | "content": "記事保存 有料会員の方のみご利用になれます。[...] 詳しくは、こちら 電子版トップ速報トップ アルゼンチン、ドル、通貨ペソ、外貨取引 来春の新入社員を募集 記者など4職種 【週末新紙面】宅配+電子版お試し実施中! 天気 プレスリリース検索 アカウント一覧 訂正・おわび", 42 | "datetime": "2015/12/17", 43 | "keyword": "アルゼンチン", 44 | "link": "http://www.nikkei.com/article/DGXLASGM18H1B_Y5A211C1EAF000/", 45 | "title": "アルゼンチンの通貨ペソ、大幅下落 対ドルで36%安" 46 | } 47 | ``` 48 | **NOTE**: The field `content` was truncated in the README. 49 | 50 | ## VPN 51 | Scraping Google News usually results in a ban for a few hours. Using a VPN with dynamic IP fetching is a way to overcome this problem. 52 | 53 | In my case, I subscribed to this VPN: [https://www.expressvpn.com/](https://www.expressvpn.com/). 54 | 55 | I provide a python binding for this VPN here: [https://github.com/philipperemy/expressvpn-python](https://github.com/philipperemy/expressvpn-python). 56 | 57 | Also make sure that: 58 | - you can run `expressvpn` in your terminal. 59 | - ExpressVPN is properly configured: 60 | - [https://www.expressvpn.com/setup](https://www.expressvpn.com/setup) 61 | - [https://www.expressvpn.com/support/vpn-setup/app-for-linux/#download](https://www.expressvpn.com/support/vpn-setup/app-for-linux/#download) 62 | - you get `expressvpn-python (x.y)` where `x.y` is the version, when you run `pip list | grep "expressvpn-python"` 63 | 64 | Every time the script detects that Google has banned you, it will request the VPN to get a fresh new IP and will resume. 65 | 66 | ## Questions/Answers 67 | - Why didn't you use the RSS feed provided by Google News? It does not exist for Japanese! 68 | - What is the best way to use this scraper? If you want to scrape a lot of data, I highly recommend you to subscribe to a VPN, preferably ExpressVPN (I implemented the VPN wrapper and the interaction with this scraper). 69 | -------------------------------------------------------------------------------- /core.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | 3 | import errno 4 | import hashlib 5 | import json 6 | import logging 7 | import os 8 | import random 9 | import re 10 | import time 11 | from datetime import datetime 12 | 13 | import requests 14 | from bs4 import BeautifulSoup 15 | from fake_useragent import UserAgent 16 | from mtranslate import translate 17 | 18 | from extract_content import get_content, get_title 19 | 20 | logger = logging.getLogger(__name__) 21 | 22 | 23 | def hash_string(s: str): 24 | return hashlib.md5(s.encode('utf-8')).hexdigest() 25 | 26 | 27 | def parallel_function(f, sequence, num_threads=None): 28 | from multiprocessing import Pool 29 | pool = Pool(processes=num_threads) 30 | result = pool.map(f, sequence) 31 | cleaned = [x for x in result if x is not None] 32 | pool.close() 33 | pool.join() 34 | return cleaned 35 | 36 | 37 | class URL: 38 | 39 | def __init__(self, language='ja'): # ja, cn... 40 | self.num_calls = 0 41 | 42 | if language == 'ja': 43 | self.google_news_url = 'https://www.google.co.jp/search?q={}&hl=ja&source=lnt&' \ 44 | 'tbs=cdr%3A1%2Ccd_min%3A{}%2Ccd_max%3A{}&tbm=nws&start={}' 45 | elif language == 'cn': 46 | self.google_news_url = 'https://www.google.com.hk/search?q={}&source=lnt&' \ 47 | 'tbs=cdr%3A1%2Ccd_min%3A{}%2Ccd_max%3A{}&tbm=nws&start={}' 48 | else: 49 | raise Exception('Unknown language. Only [ja] and [cn] are supported.') 50 | 51 | def create(self, q, start, year_start, year_end): 52 | self.num_calls += 1 53 | return self.google_news_url.format(q.replace(' ', '+'), str(year_start), str(year_end), start) 54 | 55 | 56 | def extract_links(content): 57 | soup = BeautifulSoup(content.decode('utf8'), 'lxml') 58 | blocks = [a for a in soup.find_all('div', {'class': ['dbsr']})] 59 | links_list = [(b.find('a').attrs['href'], b.find('div', {'role': 'heading'}).text) for b in blocks] 60 | dates_list = [b.find('span', {'class': 'WG9SHc'}).text for b in blocks] 61 | assert len(links_list) == len(dates_list) 62 | output = [{'link': l[0], 'title': l[1], 'date': d} for (l, d) in zip(links_list, dates_list)] 63 | return output 64 | 65 | 66 | def google_news_run(keyword, language='ja', limit=10, year_start=2010, year_end=2011, sleep_time_every_ten_articles=0): 67 | num_articles_index = 0 68 | ua = UserAgent() 69 | uf = URL(language) 70 | result = [] 71 | while num_articles_index < limit: 72 | url = uf.create(keyword, num_articles_index, year_start, year_end) 73 | if num_articles_index > 0: 74 | logger.info('[Google News] Fetched %s articles for keyword [%s]. Limit is %s.' % 75 | (num_articles_index, keyword, limit)) 76 | logger.info('[Google News] %s.' % url) 77 | headers = {'User-Agent': ua.chrome} 78 | try: 79 | response = requests.get(url, headers=headers, timeout=20) 80 | links = extract_links(response.content) 81 | nb_links = len(links) 82 | if nb_links == 0 and num_articles_index == 0: 83 | raise Exception( 84 | 'No results fetched. Either the keyword is wrong ' 85 | 'or you have been banned from Google. Retry tomorrow ' 86 | 'or change of IP Address.') 87 | 88 | if nb_links == 0: 89 | print('No more news to read for keyword {}.'.format(keyword)) 90 | break 91 | 92 | for i in range(nb_links): 93 | cur_link = links[i] 94 | logger.debug('|- {} ({})'.format(cur_link['title'], cur_link['date'])) 95 | result.extend(links) 96 | except requests.exceptions.Timeout: 97 | logger.warning('Google news TimeOut. Maybe the connection is too slow. Skipping.') 98 | pass 99 | num_articles_index += 10 100 | logger.debug('Program is going to sleep for {} seconds.'.format(sleep_time_every_ten_articles)) 101 | time.sleep(sleep_time_every_ten_articles) 102 | return result 103 | 104 | 105 | def mkdir_p(path): 106 | try: 107 | os.makedirs(path) 108 | except OSError as exc: 109 | if exc.errno == errno.EEXIST and os.path.isdir(path): 110 | pass 111 | else: 112 | raise 113 | 114 | 115 | def get_keywords(language): # ja, cn... 116 | keyword_url = 'http://www.generalecommerce.com/clients/broadcastnews_tv/category_list_js.html' 117 | logger.info('No keywords specified. Will randomly select some keywords from %s.' % keyword_url) 118 | response = requests.get(keyword_url, timeout=20) 119 | assert response.status_code == 200 120 | soup = BeautifulSoup(response.content, 'lxml') 121 | keywords = [l.replace('news', '').strip() for l in 122 | set([v.text for v in soup.find_all('td', {'class': 'devtableitem'}) if 'http' not in v.text])] 123 | assert len(keywords) > 0 124 | 125 | logger.info('Found %s keywords.' % len(keywords)) 126 | random.shuffle(keywords) 127 | for keyword in keywords: 128 | japanese_keyword = translate(keyword, language) 129 | logger.info('[Google Translate] {} -> {}'.format(keyword, japanese_keyword)) 130 | if re.search('[a-zA-Z]', japanese_keyword): # we don't want that: Fed watch -> Fed時計 131 | logger.info('Discarding keyword.') 132 | continue 133 | yield japanese_keyword 134 | 135 | 136 | def run(keywords: list = None, language='ja', limit=50, retrieve_content_behind_links=False, num_threads=1): 137 | logger.info('[Google News] Output is in data/') 138 | if keywords is None: 139 | keywords = get_keywords(language) 140 | for keyword in keywords: 141 | # logger.info('[Google News] -> FETCHING NEWS FOR KEYWORD [{}].'.format(keyword)) 142 | download_links_and_contents(keyword, language=language, year_end=datetime.now().year, 143 | limit=limit, retrieve_content_behind_links=retrieve_content_behind_links, 144 | num_threads=num_threads) 145 | 146 | 147 | def download_links_and_contents(keyword, language='ja', year_start=2010, year_end=2019, 148 | limit=50, retrieve_content_behind_links=False, num_threads=1): 149 | tmp_news_folder = 'data/{}/{}/news'.format(language, keyword) 150 | mkdir_p(tmp_news_folder) 151 | 152 | tmp_link_folder = 'data/{}/{}'.format(language, keyword) 153 | mkdir_p(tmp_link_folder) 154 | 155 | json_file = '{}/{}_{}_{}_links.json'.format(tmp_link_folder, keyword, year_start, year_end) 156 | if os.path.isfile(json_file): 157 | logger.info('Google news links for keyword [{}] have been fetched already.'.format(keyword)) 158 | with open(json_file, encoding='utf8') as r: 159 | links = json.load(fp=r) 160 | logger.info('Found {} links.'.format(len(links))) 161 | else: 162 | links = google_news_run( 163 | keyword=keyword, 164 | language=language, 165 | limit=limit, 166 | year_start=year_start, 167 | year_end=year_end, 168 | sleep_time_every_ten_articles=10 169 | ) 170 | logger.info(f'Dumping links to %s.' % json_file) 171 | with open(json_file, 'w', encoding='utf8') as w: 172 | json.dump(fp=w, obj=links, indent=2, ensure_ascii=False) 173 | if retrieve_content_behind_links: 174 | retrieve_data_from_links(links, tmp_news_folder, num_threads) 175 | 176 | 177 | def retrieve_data_for_link(param): 178 | logger.debug('retrieve_data_for_link - param = {}'.format(param)) 179 | (full_link, tmp_news_folder) = param 180 | link = full_link['link'] 181 | google_title = full_link['title'] 182 | link_datetime = full_link['date'] 183 | os.environ['PYTHONHASHSEED'] = '0' 184 | compliant_filename_for_link = hash_string(link) # just a hash number. 185 | json_file = '{}/{}.json'.format(tmp_news_folder, compliant_filename_for_link) 186 | already_fetched = os.path.isfile(json_file) 187 | if not already_fetched: 188 | try: 189 | html = download_html_from_link(link) # .decode('utf8', errors='ignore') 190 | soup = BeautifulSoup(html, 'lxml') 191 | content = get_content(soup) 192 | if len(content) == 0: 193 | html = html.decode('utf8', errors='ignore') 194 | soup = BeautifulSoup(html, 'lxml') 195 | content = get_content(soup) 196 | content = content.strip() 197 | full_title = complete_title(soup, google_title) 198 | article = { 199 | 'link': link, 200 | 'title': full_title, 201 | 'content': content, 202 | 'date': link_datetime 203 | } 204 | logger.info(f'Dumping content to %s.' % json_file) 205 | with open(json_file, 'w', encoding='utf8') as w: 206 | json.dump(fp=w, obj=article, indent=2, ensure_ascii=False) 207 | except Exception as e: 208 | logger.error(e) 209 | logger.error('ERROR - could not download article with link {}'.format(link)) 210 | pass 211 | 212 | 213 | def retrieve_data_from_links(full_links, tmp_news_folder, num_threads): 214 | if num_threads > 1: 215 | inputs = [(full_links, tmp_news_folder) for full_links in full_links] 216 | parallel_function(retrieve_data_for_link, inputs, num_threads) 217 | else: 218 | for full_link in full_links: 219 | retrieve_data_for_link((full_link, tmp_news_folder)) 220 | 221 | 222 | def download_html_from_link(link, params=None, fail_on_error=True): 223 | try: 224 | # logger.info('Get -> {} '.format(link)) 225 | response = requests.get(link, params, timeout=20) 226 | if fail_on_error and response.status_code != 200: 227 | raise Exception('Response code is not [200]. Got: {}'.format(response.status_code)) 228 | else: 229 | pass 230 | # logger.info('Download successful [OK]') 231 | return response.content 232 | except: 233 | if fail_on_error: 234 | raise 235 | return None 236 | 237 | 238 | def update_title(soup, google_article_title): 239 | fail_to_update = False 240 | if '...' not in google_article_title: 241 | # we did not fail because the google title was already valid. 242 | return google_article_title, fail_to_update 243 | truncated_title = google_article_title[:-4] # remove ' ...' at the end. 244 | title_list = [v.text for v in soup.find_all('h1') if len(v.text) > 0] 245 | for title in title_list: 246 | if truncated_title in title: 247 | # we succeeded here because we found the original title 248 | return title, fail_to_update 249 | fail_to_update = True 250 | return google_article_title, fail_to_update 251 | 252 | 253 | def complete_title(soup, google_article_title): 254 | # soup.contents (show without formatting). 255 | full_title, fail_to_update = update_title(soup, google_article_title) 256 | if full_title != google_article_title: 257 | logger.debug('Updated title: old is [{}], new is [{}]'.format(google_article_title, full_title)) 258 | else: 259 | if fail_to_update: 260 | logger.debug('Could not update title with Google truncated title trick.') 261 | # full_title = get_title(soup) 262 | # logger.info('Found it anyway here [{}]'.format(full_title)) 263 | else: 264 | logger.debug('Nothing to do for title [{}]'.format(full_title)) 265 | return full_title.strip() 266 | -------------------------------------------------------------------------------- /extract_content.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | 4 | def get_content(soup): 5 | """Retrieves contents of the article""" 6 | # heuristics 7 | div_tags = soup.find_all('div', id='articleContentBody') 8 | div_tags_2 = soup.find_all('div', class_='ArticleText') 9 | div_tags_3 = soup.find_all('div', id='ArticleText') 10 | div3 = soup.find_all('div', id='article_content') 11 | div4 = soup.find_all('div', class_='articleBodyText') 12 | div5 = soup.find_all('div', class_='story-container') 13 | div_tags_l = soup.find_all('div', id=re.compile('article')) 14 | div6 = soup.find_all('div', class_='kizi-honbun') 15 | div7 = soup.find_all('div', class_='main-text') 16 | rest = soup.find_all(id='articleText') 17 | 18 | if div_tags: 19 | return collect_content(div_tags) 20 | elif div_tags_2: 21 | return collect_content(div_tags_2) 22 | elif div_tags_3: 23 | return collect_content(div_tags) 24 | elif div3: 25 | return collect_content(div3) 26 | elif div4: 27 | return collect_content(div4) 28 | elif div5: 29 | return collect_content(div5) 30 | elif div_tags_l and len(collect_content(div_tags_l)) > 0: 31 | return collect_content(div_tags_l) 32 | elif div6: 33 | return collect_content(div6) 34 | elif div7: 35 | return collect_content(div7) 36 | elif rest: 37 | return collect_content(rest) 38 | else: 39 | # contingency 40 | c_list = [v.text for v in soup.find_all('p') if len(v.text) > 0] 41 | words_to_bans = ['<', 'javascript'] 42 | for word_to_ban in words_to_bans: 43 | c_list = list(filter(lambda x: word_to_ban not in x.lower(), c_list)) 44 | clean_html_ratio_letters_length = 0.33 45 | c_list = [t for t in c_list if 46 | len(re.findall('[a-z]', t.lower())) / ( 47 | len(t) + 1) < clean_html_ratio_letters_length] 48 | content = ' '.join(c_list) 49 | content = content.replace('\n', ' ') 50 | content = re.sub('\s\s+', ' ', content) # remove multiple spaces. 51 | return content 52 | 53 | 54 | def collect_content(parent_tag): 55 | """Collects all text from children p tags of parent_tag""" 56 | content = '' 57 | for tag in parent_tag: 58 | p_tags = tag.find_all('p') 59 | for tag in p_tags: 60 | content += tag.text + '\n' 61 | return content 62 | 63 | 64 | def get_title(soup): 65 | """Retrieves Title of Article. Use Google truncated title trick instead.""" 66 | # Heuristics 67 | div_tags = soup.find_all('div', class_='Title') 68 | article_headline_tags = soup.find_all('h1', class_='article-headline') 69 | headline_tags = soup.find_all('h2', id='main_title') 70 | hl = soup.find_all(class_='Title') 71 | all_h1_tags = soup.find_all('h1') 72 | title_match = soup.find_all(class_=re.compile('title')) 73 | Title_match = soup.find_all(class_=re.compile('Title')) 74 | headline_match = soup.find_all(class_=re.compile('headline')) 75 | 76 | item_prop_hl = soup.find_all(itemprop='headline') 77 | if item_prop_hl: 78 | return item_prop_hl[0].text 79 | 80 | if div_tags: 81 | for tag in div_tags: 82 | h1Tag = tag.find_all('h1') 83 | for tag in h1Tag: 84 | if tag.text: 85 | return tag.text 86 | 87 | elif article_headline_tags: 88 | for tag in article_headline_tags: 89 | return tag.text 90 | elif headline_tags: 91 | for tag in headline_tags: 92 | return tag.text 93 | elif headline_match: 94 | return headline_match[0].text 95 | elif all_h1_tags: 96 | return all_h1_tags[0].text 97 | elif hl: 98 | return hl[0].text 99 | else: 100 | if title_match: 101 | return title_match[0].text 102 | elif Title_match: 103 | return Title_match[0].text 104 | else: 105 | return "" 106 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import sys 3 | 4 | from expressvpn import wrapper 5 | 6 | from core import run 7 | 8 | 9 | def get_new_ip(): 10 | while True: 11 | try: 12 | print('GETTING NEW IP') 13 | wrapper.random_connect() 14 | print('SUCCESS') 15 | return 16 | except: 17 | pass 18 | 19 | 20 | if __name__ == '__main__': 21 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', stream=sys.stdout) 22 | # https://news.google.co.jp 23 | # https://news.google.com/?output=rss&hl=fr 24 | # RSS Feed does not work for Japanese language. 25 | # get_articles('プロクター・アンド・ギャンブル') 26 | 27 | while True: 28 | try: 29 | run() 30 | except: 31 | print('EXCEPTION CAUGHT in __MAIN__') 32 | print('Lets change our PUBLIC IP GUYS!') 33 | get_new_ip() 34 | -------------------------------------------------------------------------------- /main_no_vpn.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import sys 3 | from argparse import ArgumentParser 4 | 5 | from core import run 6 | 7 | 8 | def get_script_arguments(): 9 | args = ArgumentParser() 10 | args.add_argument('--keywords', default=None, type=str) 11 | args.add_argument('--language', default='ja') 12 | args.add_argument('--retrieve_content_behind_links', action='store_true') 13 | args.add_argument('--limit_num_links_per_keyword', default=50, type=int) 14 | args.add_argument('--num_threads', default=1, type=int) 15 | return args.parse_args() 16 | 17 | 18 | def main(): 19 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', stream=sys.stdout) 20 | # https://news.google.com/?output=rss&hl=fr 21 | # RSS Feed does not work for Japanese/Chinese language. 22 | args = get_script_arguments() 23 | keywords = args.keywords.split(',') if args.keywords is not None else None 24 | run( 25 | keywords=keywords, 26 | language=args.language, 27 | limit=args.limit_num_links_per_keyword, 28 | retrieve_content_behind_links=args.retrieve_content_behind_links, 29 | num_threads=args.num_threads 30 | ) 31 | 32 | 33 | if __name__ == '__main__': 34 | main() 35 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | fake_useragent 2 | mtranslate 3 | requests 4 | beautifulsoup4 5 | gnp 6 | unicode_slugify 7 | numpy --------------------------------------------------------------------------------