├── .github
    └── FUNDING.yml
├── .gitignore
├── LICENSE
├── README.md
├── core.py
├── extract_content.py
├── main.py
├── main_no_vpn.py
└── requirements.txt


/.github/FUNDING.yml:
--------------------------------------------------------------------------------
1 | github: [philipperemy]
2 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | 
 3 | .DS_Store
 4 | .idea
 5 | 
 6 | nohup.out
 7 | data/
 8 | data*
 9 | __pycache__/
10 | *.py[cod]
11 | *$py.class
12 | 
13 | # C extensions
14 | *.so
15 | 
16 | # Distribution / packaging
17 | .Python
18 | env/
19 | build/
20 | develop-eggs/
21 | dist/
22 | downloads/
23 | eggs/
24 | .eggs/
25 | lib/
26 | lib64/
27 | parts/
28 | sdist/
29 | var/
30 | wheels/
31 | *.egg-info/
32 | .installed.cfg
33 | *.egg
34 | 
35 | # PyInstaller
36 | #  Usually these files are written by a python script from a template
37 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
38 | *.manifest
39 | *.spec
40 | 
41 | # Installer logs
42 | pip-log.txt
43 | pip-delete-this-directory.txt
44 | 
45 | # Unit test / coverage reports
46 | htmlcov/
47 | .tox/
48 | .coverage
49 | .coverage.*
50 | .cache
51 | nosetests.xml
52 | coverage.xml
53 | *,cover
54 | .hypothesis/
55 | 
56 | # Translations
57 | *.mo
58 | *.pot
59 | 
60 | # Django stuff:
61 | *.log
62 | local_settings.py
63 | 
64 | # Flask stuff:
65 | instance/
66 | .webassets-cache
67 | 
68 | # Scrapy stuff:
69 | .scrapy
70 | 
71 | # Sphinx documentation
72 | docs/_build/
73 | 
74 | # PyBuilder
75 | target/
76 | 
77 | # Jupyter Notebook
78 | .ipynb_checkpoints
79 | 
80 | # pyenv
81 | .python-version
82 | 
83 | # celery beat schedule file
84 | celerybeat-schedule
85 | 
86 | # dotenv
87 | .env
88 | 
89 | # virtualenv
90 | .venv
91 | venv/
92 | ENV/
93 | 
94 | # Spyder project settings
95 | .spyderproject
96 | 
97 | # Rope project settings
98 | .ropeproject
99 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Philippe Rémy
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Google News Scraper - Japanese and Chinese supported
 2 | 
 3 | For English articles, Google has a RSS feed that you can directly use. [Click here for English](https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en).
 4 | 
 5 | Each scraped article has the following fields:
 6 | - **title**: Title of the article
 7 | - **datetime**: Publication date
 8 | - **content**: Full content (text format) - best effort
 9 | - **link**: URL where the article was published
10 | - **keyword**: Google News keyword used to find this article
11 | 
12 | ## How many articles can I fetch with this scraper?
13 | 
14 | No upper bound of course but it should be in the range **`100,000 articles per day`** when scraping 24/7 with VPN enabled.
15 | 
16 | ## How to get started?
17 | ```bash
18 | git clone git@github.com:philipperemy/google-news-scraper.git && cd google-news-scraper
19 | virtualenv -p python3 venv && source venv/bin/activate # optional but recommended!
20 | pip install -r requirements.txt
21 | python main_no_vpn.py --keywords hello,toto --language ja  # for VPN support, scroll down!
22 | ```
23 | 
24 | ## Output example
25 | 
26 | `Article 1`
27 | ```
28 | {
29 |     "content": "(本文中の野村証券 [...] 生命経済研の熊野英生氏は指摘。  記事の全文 \n保護主義を根拠とする円高説を信じ込むのは禁物であり、実際は米貿易赤字縮小と円安が進むかもしれないとＢＢＨの村田雅志氏は指摘。  記事の全文 \n",
30 |     "datetime": "2015/11/03",
31 |     "keyword": "米国の銀行業務",
32 |     "link": "http://jp.reuters.com/article/idJPL3N12Y5QX20151104",
33 |     "title": "再送-インタビュー：運用高度化、ＰＥやハイイールド債増やす＝長門・ゆうちょ銀社長"
34 | }
35 | ```
36 | 
37 | `Article 2`
38 | 
39 | ```
40 | {
41 |     "content": "記事保存 有料会員の方のみご利用になれます。[...] 詳しくは、こちら 電子版トップ速報トップ アルゼンチン、ドル、通貨ペソ、外貨取引 来春の新入社員を募集　記者など４職種 【週末新紙面】宅配＋電子版お試し実施中！ 天気 プレスリリース検索 アカウント一覧 訂正・おわび",
42 |     "datetime": "2015/12/17",
43 |     "keyword": "アルゼンチン",
44 |     "link": "http://www.nikkei.com/article/DGXLASGM18H1B_Y5A211C1EAF000/",
45 |     "title": "アルゼンチンの通貨ペソ、大幅下落 対ドルで36％安"
46 | }
47 | ```
48 | **NOTE**: The field `content` was truncated in the README.
49 | 
50 | ## VPN
51 | Scraping Google News usually results in a ban for a few hours. Using a VPN with dynamic IP fetching is a way to overcome this problem.
52 | 
53 | In my case, I subscribed to this VPN: [https://www.expressvpn.com/](https://www.expressvpn.com/).
54 | 
55 | I provide a python binding for this VPN here: [https://github.com/philipperemy/expressvpn-python](https://github.com/philipperemy/expressvpn-python).
56 | 
57 | Also make sure that:
58 | - you can run `expressvpn` in your terminal.
59 | - ExpressVPN is properly configured:
60 |     - [https://www.expressvpn.com/setup](https://www.expressvpn.com/setup) 
61 |     - [https://www.expressvpn.com/support/vpn-setup/app-for-linux/#download](https://www.expressvpn.com/support/vpn-setup/app-for-linux/#download)
62 | - you get `expressvpn-python (x.y)` where `x.y` is the version, when you run `pip list | grep "expressvpn-python"`
63 | 
64 | Every time the script detects that Google has banned you, it will request the VPN to get a fresh new IP and will resume.
65 | 
66 | ## Questions/Answers
67 | - Why didn't you use the RSS feed provided by Google News? It does not exist for Japanese!
68 | - What is the best way to use this scraper? If you want to scrape a lot of data, I highly recommend you to subscribe to a VPN, preferably ExpressVPN (I implemented the VPN wrapper and the interaction with this scraper).
69 | 


--------------------------------------------------------------------------------
/core.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function
  2 | 
  3 | import errno
  4 | import hashlib
  5 | import json
  6 | import logging
  7 | import os
  8 | import random
  9 | import re
 10 | import time
 11 | from datetime import datetime
 12 | 
 13 | import requests
 14 | from bs4 import BeautifulSoup
 15 | from fake_useragent import UserAgent
 16 | from mtranslate import translate
 17 | 
 18 | from extract_content import get_content, get_title
 19 | 
 20 | logger = logging.getLogger(__name__)
 21 | 
 22 | 
 23 | def hash_string(s: str):
 24 |     return hashlib.md5(s.encode('utf-8')).hexdigest()
 25 | 
 26 | 
 27 | def parallel_function(f, sequence, num_threads=None):
 28 |     from multiprocessing import Pool
 29 |     pool = Pool(processes=num_threads)
 30 |     result = pool.map(f, sequence)
 31 |     cleaned = [x for x in result if x is not None]
 32 |     pool.close()
 33 |     pool.join()
 34 |     return cleaned
 35 | 
 36 | 
 37 | class URL:
 38 | 
 39 |     def __init__(self, language='ja'):  # ja, cn...
 40 |         self.num_calls = 0
 41 | 
 42 |         if language == 'ja':
 43 |             self.google_news_url = 'https://www.google.co.jp/search?q={}&hl=ja&source=lnt&' \
 44 |                                    'tbs=cdr%3A1%2Ccd_min%3A{}%2Ccd_max%3A{}&tbm=nws&start={}'
 45 |         elif language == 'cn':
 46 |             self.google_news_url = 'https://www.google.com.hk/search?q={}&source=lnt&' \
 47 |                                    'tbs=cdr%3A1%2Ccd_min%3A{}%2Ccd_max%3A{}&tbm=nws&start={}'
 48 |         else:
 49 |             raise Exception('Unknown language. Only [ja] and [cn] are supported.')
 50 | 
 51 |     def create(self, q, start, year_start, year_end):
 52 |         self.num_calls += 1
 53 |         return self.google_news_url.format(q.replace(' ', '+'), str(year_start), str(year_end), start)
 54 | 
 55 | 
 56 | def extract_links(content):
 57 |     soup = BeautifulSoup(content.decode('utf8'), 'lxml')
 58 |     blocks = [a for a in soup.find_all('div', {'class': ['dbsr']})]
 59 |     links_list = [(b.find('a').attrs['href'], b.find('div', {'role': 'heading'}).text) for b in blocks]
 60 |     dates_list = [b.find('span', {'class': 'WG9SHc'}).text for b in blocks]
 61 |     assert len(links_list) == len(dates_list)
 62 |     output = [{'link': l[0], 'title': l[1], 'date': d} for (l, d) in zip(links_list, dates_list)]
 63 |     return output
 64 | 
 65 | 
 66 | def google_news_run(keyword, language='ja', limit=10, year_start=2010, year_end=2011, sleep_time_every_ten_articles=0):
 67 |     num_articles_index = 0
 68 |     ua = UserAgent()
 69 |     uf = URL(language)
 70 |     result = []
 71 |     while num_articles_index < limit:
 72 |         url = uf.create(keyword, num_articles_index, year_start, year_end)
 73 |         if num_articles_index > 0:
 74 |             logger.info('[Google News] Fetched %s articles for keyword [%s]. Limit is %s.' %
 75 |                         (num_articles_index, keyword, limit))
 76 |         logger.info('[Google News] %s.' % url)
 77 |         headers = {'User-Agent': ua.chrome}
 78 |         try:
 79 |             response = requests.get(url, headers=headers, timeout=20)
 80 |             links = extract_links(response.content)
 81 |             nb_links = len(links)
 82 |             if nb_links == 0 and num_articles_index == 0:
 83 |                 raise Exception(
 84 |                     'No results fetched. Either the keyword is wrong '
 85 |                     'or you have been banned from Google. Retry tomorrow '
 86 |                     'or change of IP Address.')
 87 | 
 88 |             if nb_links == 0:
 89 |                 print('No more news to read for keyword {}.'.format(keyword))
 90 |                 break
 91 | 
 92 |             for i in range(nb_links):
 93 |                 cur_link = links[i]
 94 |                 logger.debug('|- {} ({})'.format(cur_link['title'], cur_link['date']))
 95 |             result.extend(links)
 96 |         except requests.exceptions.Timeout:
 97 |             logger.warning('Google news TimeOut. Maybe the connection is too slow. Skipping.')
 98 |             pass
 99 |         num_articles_index += 10
100 |         logger.debug('Program is going to sleep for {} seconds.'.format(sleep_time_every_ten_articles))
101 |         time.sleep(sleep_time_every_ten_articles)
102 |     return result
103 | 
104 | 
105 | def mkdir_p(path):
106 |     try:
107 |         os.makedirs(path)
108 |     except OSError as exc:
109 |         if exc.errno == errno.EEXIST and os.path.isdir(path):
110 |             pass
111 |         else:
112 |             raise
113 | 
114 | 
115 | def get_keywords(language):  # ja, cn...
116 |     keyword_url = 'http://www.generalecommerce.com/clients/broadcastnews_tv/category_list_js.html'
117 |     logger.info('No keywords specified. Will randomly select some keywords from %s.' % keyword_url)
118 |     response = requests.get(keyword_url, timeout=20)
119 |     assert response.status_code == 200
120 |     soup = BeautifulSoup(response.content, 'lxml')
121 |     keywords = [l.replace('news', '').strip() for l in
122 |                 set([v.text for v in soup.find_all('td', {'class': 'devtableitem'}) if 'http' not in v.text])]
123 |     assert len(keywords) > 0
124 | 
125 |     logger.info('Found %s keywords.' % len(keywords))
126 |     random.shuffle(keywords)
127 |     for keyword in keywords:
128 |         japanese_keyword = translate(keyword, language)
129 |         logger.info('[Google Translate] {} -> {}'.format(keyword, japanese_keyword))
130 |         if re.search('[a-zA-Z]', japanese_keyword):  # we don't want that: Fed watch -> Fed時計
131 |             logger.info('Discarding keyword.')
132 |             continue
133 |         yield japanese_keyword
134 | 
135 | 
136 | def run(keywords: list = None, language='ja', limit=50, retrieve_content_behind_links=False, num_threads=1):
137 |     logger.info('[Google News] Output is in data/')
138 |     if keywords is None:
139 |         keywords = get_keywords(language)
140 |     for keyword in keywords:
141 |         # logger.info('[Google News] -> FETCHING NEWS FOR KEYWORD [{}].'.format(keyword))
142 |         download_links_and_contents(keyword, language=language, year_end=datetime.now().year,
143 |                                     limit=limit, retrieve_content_behind_links=retrieve_content_behind_links,
144 |                                     num_threads=num_threads)
145 | 
146 | 
147 | def download_links_and_contents(keyword, language='ja', year_start=2010, year_end=2019,
148 |                                 limit=50, retrieve_content_behind_links=False, num_threads=1):
149 |     tmp_news_folder = 'data/{}/{}/news'.format(language, keyword)
150 |     mkdir_p(tmp_news_folder)
151 | 
152 |     tmp_link_folder = 'data/{}/{}'.format(language, keyword)
153 |     mkdir_p(tmp_link_folder)
154 | 
155 |     json_file = '{}/{}_{}_{}_links.json'.format(tmp_link_folder, keyword, year_start, year_end)
156 |     if os.path.isfile(json_file):
157 |         logger.info('Google news links for keyword [{}] have been fetched already.'.format(keyword))
158 |         with open(json_file, encoding='utf8') as r:
159 |             links = json.load(fp=r)
160 |         logger.info('Found {} links.'.format(len(links)))
161 |     else:
162 |         links = google_news_run(
163 |             keyword=keyword,
164 |             language=language,
165 |             limit=limit,
166 |             year_start=year_start,
167 |             year_end=year_end,
168 |             sleep_time_every_ten_articles=10
169 |         )
170 |         logger.info(f'Dumping links to %s.' % json_file)
171 |         with open(json_file, 'w', encoding='utf8') as w:
172 |             json.dump(fp=w, obj=links, indent=2, ensure_ascii=False)
173 |     if retrieve_content_behind_links:
174 |         retrieve_data_from_links(links, tmp_news_folder, num_threads)
175 | 
176 | 
177 | def retrieve_data_for_link(param):
178 |     logger.debug('retrieve_data_for_link - param = {}'.format(param))
179 |     (full_link, tmp_news_folder) = param
180 |     link = full_link['link']
181 |     google_title = full_link['title']
182 |     link_datetime = full_link['date']
183 |     os.environ['PYTHONHASHSEED'] = '0'
184 |     compliant_filename_for_link = hash_string(link)  # just a hash number.
185 |     json_file = '{}/{}.json'.format(tmp_news_folder, compliant_filename_for_link)
186 |     already_fetched = os.path.isfile(json_file)
187 |     if not already_fetched:
188 |         try:
189 |             html = download_html_from_link(link)  # .decode('utf8', errors='ignore')
190 |             soup = BeautifulSoup(html, 'lxml')
191 |             content = get_content(soup)
192 |             if len(content) == 0:
193 |                 html = html.decode('utf8', errors='ignore')
194 |                 soup = BeautifulSoup(html, 'lxml')
195 |                 content = get_content(soup)
196 |             content = content.strip()
197 |             full_title = complete_title(soup, google_title)
198 |             article = {
199 |                 'link': link,
200 |                 'title': full_title,
201 |                 'content': content,
202 |                 'date': link_datetime
203 |             }
204 |             logger.info(f'Dumping content to %s.' % json_file)
205 |             with open(json_file, 'w', encoding='utf8') as w:
206 |                 json.dump(fp=w, obj=article, indent=2, ensure_ascii=False)
207 |         except Exception as e:
208 |             logger.error(e)
209 |             logger.error('ERROR - could not download article with link {}'.format(link))
210 |             pass
211 | 
212 | 
213 | def retrieve_data_from_links(full_links, tmp_news_folder, num_threads):
214 |     if num_threads > 1:
215 |         inputs = [(full_links, tmp_news_folder) for full_links in full_links]
216 |         parallel_function(retrieve_data_for_link, inputs, num_threads)
217 |     else:
218 |         for full_link in full_links:
219 |             retrieve_data_for_link((full_link, tmp_news_folder))
220 | 
221 | 
222 | def download_html_from_link(link, params=None, fail_on_error=True):
223 |     try:
224 |         # logger.info('Get -> {} '.format(link))
225 |         response = requests.get(link, params, timeout=20)
226 |         if fail_on_error and response.status_code != 200:
227 |             raise Exception('Response code is not [200]. Got: {}'.format(response.status_code))
228 |         else:
229 |             pass
230 |             # logger.info('Download successful [OK]')
231 |         return response.content
232 |     except:
233 |         if fail_on_error:
234 |             raise
235 |         return None
236 | 
237 | 
238 | def update_title(soup, google_article_title):
239 |     fail_to_update = False
240 |     if '...' not in google_article_title:
241 |         # we did not fail because the google title was already valid.
242 |         return google_article_title, fail_to_update
243 |     truncated_title = google_article_title[:-4]  # remove ' ...' at the end.
244 |     title_list = [v.text for v in soup.find_all('h1') if len(v.text) > 0]
245 |     for title in title_list:
246 |         if truncated_title in title:
247 |             # we succeeded here because we found the original title
248 |             return title, fail_to_update
249 |     fail_to_update = True
250 |     return google_article_title, fail_to_update
251 | 
252 | 
253 | def complete_title(soup, google_article_title):
254 |     # soup.contents (show without formatting).
255 |     full_title, fail_to_update = update_title(soup, google_article_title)
256 |     if full_title != google_article_title:
257 |         logger.debug('Updated title: old is [{}], new is [{}]'.format(google_article_title, full_title))
258 |     else:
259 |         if fail_to_update:
260 |             logger.debug('Could not update title with Google truncated title trick.')
261 |             # full_title = get_title(soup)
262 |             # logger.info('Found it anyway here [{}]'.format(full_title))
263 |         else:
264 |             logger.debug('Nothing to do for title [{}]'.format(full_title))
265 |     return full_title.strip()
266 | 


--------------------------------------------------------------------------------
/extract_content.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | 
  3 | 
  4 | def get_content(soup):
  5 |     """Retrieves contents of the article"""
  6 |     # heuristics
  7 |     div_tags = soup.find_all('div', id='articleContentBody')
  8 |     div_tags_2 = soup.find_all('div', class_='ArticleText')
  9 |     div_tags_3 = soup.find_all('div', id='ArticleText')
 10 |     div3 = soup.find_all('div', id='article_content')
 11 |     div4 = soup.find_all('div', class_='articleBodyText')
 12 |     div5 = soup.find_all('div', class_='story-container')
 13 |     div_tags_l = soup.find_all('div', id=re.compile('article'))
 14 |     div6 = soup.find_all('div', class_='kizi-honbun')
 15 |     div7 = soup.find_all('div', class_='main-text')
 16 |     rest = soup.find_all(id='articleText')
 17 | 
 18 |     if div_tags:
 19 |         return collect_content(div_tags)
 20 |     elif div_tags_2:
 21 |         return collect_content(div_tags_2)
 22 |     elif div_tags_3:
 23 |         return collect_content(div_tags)
 24 |     elif div3:
 25 |         return collect_content(div3)
 26 |     elif div4:
 27 |         return collect_content(div4)
 28 |     elif div5:
 29 |         return collect_content(div5)
 30 |     elif div_tags_l and len(collect_content(div_tags_l)) > 0:
 31 |         return collect_content(div_tags_l)
 32 |     elif div6:
 33 |         return collect_content(div6)
 34 |     elif div7:
 35 |         return collect_content(div7)
 36 |     elif rest:
 37 |         return collect_content(rest)
 38 |     else:
 39 |         # contingency
 40 |         c_list = [v.text for v in soup.find_all('p') if len(v.text) > 0]
 41 |         words_to_bans = ['<', 'javascript']
 42 |         for word_to_ban in words_to_bans:
 43 |             c_list = list(filter(lambda x: word_to_ban not in x.lower(), c_list))
 44 |         clean_html_ratio_letters_length = 0.33
 45 |         c_list = [t for t in c_list if
 46 |                   len(re.findall('[a-z]', t.lower())) / (
 47 |                           len(t) + 1) < clean_html_ratio_letters_length]
 48 |         content = ' '.join(c_list)
 49 |         content = content.replace('\n', ' ')
 50 |         content = re.sub('\s\s+', ' ', content)  # remove multiple spaces.
 51 |     return content
 52 | 
 53 | 
 54 | def collect_content(parent_tag):
 55 |     """Collects all text from children p tags of parent_tag"""
 56 |     content = ''
 57 |     for tag in parent_tag:
 58 |         p_tags = tag.find_all('p')
 59 |         for tag in p_tags:
 60 |             content += tag.text + '\n'
 61 |     return content
 62 | 
 63 | 
 64 | def get_title(soup):
 65 |     """Retrieves Title of Article. Use Google truncated title trick instead."""
 66 |     # Heuristics
 67 |     div_tags = soup.find_all('div', class_='Title')
 68 |     article_headline_tags = soup.find_all('h1', class_='article-headline')
 69 |     headline_tags = soup.find_all('h2', id='main_title')
 70 |     hl = soup.find_all(class_='Title')
 71 |     all_h1_tags = soup.find_all('h1')
 72 |     title_match = soup.find_all(class_=re.compile('title'))
 73 |     Title_match = soup.find_all(class_=re.compile('Title'))
 74 |     headline_match = soup.find_all(class_=re.compile('headline'))
 75 | 
 76 |     item_prop_hl = soup.find_all(itemprop='headline')
 77 |     if item_prop_hl:
 78 |         return item_prop_hl[0].text
 79 | 
 80 |     if div_tags:
 81 |         for tag in div_tags:
 82 |             h1Tag = tag.find_all('h1')
 83 |             for tag in h1Tag:
 84 |                 if tag.text:
 85 |                     return tag.text
 86 | 
 87 |     elif article_headline_tags:
 88 |         for tag in article_headline_tags:
 89 |             return tag.text
 90 |     elif headline_tags:
 91 |         for tag in headline_tags:
 92 |             return tag.text
 93 |     elif headline_match:
 94 |         return headline_match[0].text
 95 |     elif all_h1_tags:
 96 |         return all_h1_tags[0].text
 97 |     elif hl:
 98 |         return hl[0].text
 99 |     else:
100 |         if title_match:
101 |             return title_match[0].text
102 |         elif Title_match:
103 |             return Title_match[0].text
104 |         else:
105 |             return ""
106 | 


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
 1 | import logging
 2 | import sys
 3 | 
 4 | from expressvpn import wrapper
 5 | 
 6 | from core import run
 7 | 
 8 | 
 9 | def get_new_ip():
10 |     while True:
11 |         try:
12 |             print('GETTING NEW IP')
13 |             wrapper.random_connect()
14 |             print('SUCCESS')
15 |             return
16 |         except:
17 |             pass
18 | 
19 | 
20 | if __name__ == '__main__':
21 |     logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', stream=sys.stdout)
22 |     # https://news.google.co.jp
23 |     # https://news.google.com/?output=rss&hl=fr
24 |     # RSS Feed does not work for Japanese language.
25 |     # get_articles('プロクター・アンド・ギャンブル')
26 | 
27 |     while True:
28 |         try:
29 |             run()
30 |         except:
31 |             print('EXCEPTION CAUGHT in __MAIN__')
32 |             print('Lets change our PUBLIC IP GUYS!')
33 |             get_new_ip()
34 | 


--------------------------------------------------------------------------------
/main_no_vpn.py:
--------------------------------------------------------------------------------
 1 | import logging
 2 | import sys
 3 | from argparse import ArgumentParser
 4 | 
 5 | from core import run
 6 | 
 7 | 
 8 | def get_script_arguments():
 9 |     args = ArgumentParser()
10 |     args.add_argument('--keywords', default=None, type=str)
11 |     args.add_argument('--language', default='ja')
12 |     args.add_argument('--retrieve_content_behind_links', action='store_true')
13 |     args.add_argument('--limit_num_links_per_keyword', default=50, type=int)
14 |     args.add_argument('--num_threads', default=1, type=int)
15 |     return args.parse_args()
16 | 
17 | 
18 | def main():
19 |     logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', stream=sys.stdout)
20 |     # https://news.google.com/?output=rss&hl=fr
21 |     # RSS Feed does not work for Japanese/Chinese language.
22 |     args = get_script_arguments()
23 |     keywords = args.keywords.split(',') if args.keywords is not None else None
24 |     run(
25 |         keywords=keywords,
26 |         language=args.language,
27 |         limit=args.limit_num_links_per_keyword,
28 |         retrieve_content_behind_links=args.retrieve_content_behind_links,
29 |         num_threads=args.num_threads
30 |     )
31 | 
32 | 
33 | if __name__ == '__main__':
34 |     main()
35 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | fake_useragent
2 | mtranslate
3 | requests
4 | beautifulsoup4
5 | gnp
6 | unicode_slugify
7 | numpy


--------------------------------------------------------------------------------