├── .gitignore
├── LICENSE
├── README.md
├── dist
    ├── KoreaNewsCrawler-1.12-py3-none-any.whl
    ├── KoreaNewsCrawler-1.2-py3-none-any.whl
    ├── KoreaNewsCrawler-1.20-py3-none-any.whl
    ├── KoreaNewsCrawler-1.30-py3-none-any.whl
    ├── KoreaNewsCrawler-1.41-py3-none-any.whl
    ├── KoreaNewsCrawler-1.50-py3-none-any.whl
    └── KoreaNewsCrawler-1.51-py3-none-any.whl
├── img
    ├── article_result.PNG
    ├── multi_process.PNG
    └── sport_resultimg.PNG
├── korea_news_crawler
    ├── __init__.py
    ├── articlecrawler.py
    ├── articleparser.py
    ├── exceptions.py
    ├── sample.py
    ├── sportcrawler.py
    ├── sports_crawler_sample.py
    └── writer.py
├── setup.cfg
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea
2 | __pycache__
3 | build
4 | output
5 | korea_news_crawler/.idea
6 | korea_news_crawler/__pycache__
7 | KoreaNewsCrawler.egg-info
8 | MENIFEST.in
9 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2022 lumyjuwon
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # KoreaNewsCrawler
 2 | 
 3 | This crawler is a crawler that crawls news articles from media organizations posted on NAVER portal.  
 4 | Crawlable article categories include politics, economy, lifeculture, global, IT/science, society.
 5 | In the case of sports articles, that include korea baseball, korea soccer, world baseball, world soccer, basketball, volleyball, golf, general sports, e-sports.
 6 |   
 7 | ## How to install
 8 |     pip install KoreaNewsCrawler
 9 |     
10 | ## Method
11 | 
12 | * **set_category(category_name)**
13 |   
14 |  This method is to set the category you want to collect.  
15 |  The categories that can be included in the parameter are 'politics', 'economy', 'society', 'living_culture', 'IT_science', 'world', and 'opinion'.  
16 |  You can have multiple parameters.  
17 |  category_name: politics, economy, society, living_culture, IT_science, world, opinion
18 |   
19 | * **set_date_range(startyear, startmonth, endyear, endmonth)**
20 |   
21 |  This method refers to the time period of news you want to collect. By default, it collects data from the month of startmonth to the month of endmonth.
22 |   
23 | * **start()**
24 |   
25 |  This method is the crawl execution method.
26 |   
27 | ## Article News Crawler Example
28 | ```
29 | from korea_news_crawler.articlecrawler import ArticleCrawler
30 | 
31 | Crawler = ArticleCrawler()  
32 | Crawler.set_category("politics", "IT_science", "economy")  
33 | Crawler.set_date_range("2017-01", "2018-04-20") 
34 | Crawler.start()
35 | ```
36 |   Perform a parallel crawl of news in the categories Politics, IT Science, and Economy from January 2017 to April 20, 2018 using a multiprocessor.
37 | 
38 | ## Sports News Crawler Example 
39 |   Method is similar to ArticleCrawler().
40 | ```
41 | from korea_news_crawler.sportcrawler import SportCrawler 
42 | 
43 | Spt_crawler = SportCrawler()
44 | Spt_crawler.set_category('korea baseball','korea soccer')
45 | Spt_crawler.set_date_range("2017-01", "2018-04-20") 
46 | Spt_crawler.start()
47 | ```
48 |   Execute a parallel crawl of Korean baseball and Korean soccer news from January 2017 to April 20, 2018 using a multiprocessor.
49 |   
50 | ## Results
51 |  ![ex_screenshot](./img/article_result.PNG)
52 |  ![ex_screenshot](./img/sport_resultimg.PNG)
53 |  
54 |  Colum A: Article date & time  
55 |  Colum B: Article Category  
56 |  Colum C: Media Company  
57 |  Colum D: Article title  
58 |  Colum E: Article body  
59 |  Colum F: Article address  
60 |  All the data you collect is saved with a CSV extension.  
61 | 


--------------------------------------------------------------------------------
/dist/KoreaNewsCrawler-1.12-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.12-py3-none-any.whl


--------------------------------------------------------------------------------
/dist/KoreaNewsCrawler-1.2-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.2-py3-none-any.whl


--------------------------------------------------------------------------------
/dist/KoreaNewsCrawler-1.20-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.20-py3-none-any.whl


--------------------------------------------------------------------------------
/dist/KoreaNewsCrawler-1.30-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.30-py3-none-any.whl


--------------------------------------------------------------------------------
/dist/KoreaNewsCrawler-1.41-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.41-py3-none-any.whl


--------------------------------------------------------------------------------
/dist/KoreaNewsCrawler-1.50-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.50-py3-none-any.whl


--------------------------------------------------------------------------------
/dist/KoreaNewsCrawler-1.51-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.51-py3-none-any.whl


--------------------------------------------------------------------------------
/img/article_result.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/img/article_result.PNG


--------------------------------------------------------------------------------
/img/multi_process.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/img/multi_process.PNG


--------------------------------------------------------------------------------
/img/sport_resultimg.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/img/sport_resultimg.PNG


--------------------------------------------------------------------------------
/korea_news_crawler/__init__.py:
--------------------------------------------------------------------------------
 1 | __author__ = "lumyjuwon"
 2 | __version__ = "1.50"
 3 | __copyright__ = "Copyright (c) lumyjuwon"
 4 | __license__ = "MIT"
 5 | 
 6 | from .articlecrawler import *
 7 | from .articleparser import *
 8 | from .exceptions import *
 9 | from .sample import *
10 | from .sportcrawler import *
11 | from .writer import *
12 | 


--------------------------------------------------------------------------------
/korea_news_crawler/articlecrawler.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8, euc-kr -*-
  3 | 
  4 | import os
  5 | import platform
  6 | import calendar
  7 | import requests
  8 | import re
  9 | from time import sleep
 10 | from bs4 import BeautifulSoup
 11 | from multiprocessing import Process
 12 | from korea_news_crawler.exceptions import *
 13 | from korea_news_crawler.articleparser import ArticleParser
 14 | from korea_news_crawler.writer import Writer
 15 | 
 16 | class ArticleCrawler(object):
 17 |     def __init__(self):
 18 |         self.categories = {'정치': 100, '경제': 101, '사회': 102, '생활문화': 103, '세계': 104, 'IT과학': 105, '오피니언': 110,
 19 |                            'politics': 100, 'economy': 101, 'society': 102, 'living_culture': 103, 'world': 104, 'IT_science': 105, 'opinion': 110}
 20 |         self.selected_categories = []
 21 |         self.date = {'start_year': 0, 'start_month': 0, 'start_day' : 0, 'end_year': 0, 'end_month': 0, 'end_day':0}
 22 |         self.user_operating_system = str(platform.system())
 23 | 
 24 |     def set_category(self, *args):
 25 |         for key in args:
 26 |             if self.categories.get(key) is None:
 27 |                 raise InvalidCategory(key)
 28 |         self.selected_categories = args
 29 | 
 30 |     def set_date_range(self, start_date:str, end_date:str):
 31 |         start = list(map(int, start_date.split("-")))
 32 |         end = list(map(int, end_date.split("-")))
 33 |         
 34 |         # Setting Start Date
 35 |         if len(start) == 1: # Input Only Year
 36 |             start_year = start[0]
 37 |             start_month = 1
 38 |             start_day = 1
 39 |         elif len(start) == 2: # Input Year and month
 40 |             start_year, start_month = start
 41 |             start_day = 1
 42 |         elif len(start) == 3: # Input Year, month and day
 43 |             start_year, start_month, start_day = start
 44 |         
 45 |         # Setting End Date
 46 |         if len(end) == 1: # Input Only Year
 47 |             end_year = end[0]
 48 |             end_month = 12
 49 |             end_day = 31
 50 |         elif len(end) == 2: # Input Year and month
 51 |             end_year, end_month = end
 52 |             end_day = calendar.monthrange(end_year, end_month)[1]
 53 |         elif len(end) == 3: # Input Year, month and day
 54 |             end_year, end_month, end_day = end
 55 | 
 56 |         args = [start_year, start_month, start_day, end_year, end_month, end_day]
 57 | 
 58 |         if start_year > end_year:
 59 |             raise InvalidYear(start_year, end_year)
 60 |         if start_month < 1 or start_month > 12:
 61 |             raise InvalidMonth(start_month)
 62 |         if end_month < 1 or end_month > 12:
 63 |             raise InvalidMonth(end_month)
 64 |         if start_day < 1 or calendar.monthrange(start_year, start_month)[1] < start_day:
 65 |             raise InvalidDay(start_day)
 66 |         if end_day < 1 or calendar.monthrange(end_year, end_month)[1] < end_day:
 67 |             raise InvalidDay(end_day)
 68 |         if start_year == end_year and start_month > end_month:
 69 |             raise OverbalanceMonth(start_month, end_month)
 70 |         if start_year == end_year and start_month == end_month and start_day > end_day:
 71 |             raise OverbalanceDay(start_day, end_day)
 72 | 
 73 |         for key, date in zip(self.date, args):
 74 |             self.date[key] = date
 75 |         print(self.date)
 76 | 
 77 |     @staticmethod
 78 |     def make_news_page_url(category_url, date):
 79 |         made_urls = []
 80 |         for year in range(date['start_year'], date['end_year'] + 1):
 81 |             if date['start_year'] == date['end_year']:
 82 |                 target_start_month = date['start_month']
 83 |                 target_end_month = date['end_month']
 84 |             else:
 85 |                 if year == date['start_year']:
 86 |                     target_start_month = date['start_month']
 87 |                     target_end_month = 12
 88 |                 elif year == date['end_year']:
 89 |                     target_start_month = 1
 90 |                     target_end_month = date['end_month']
 91 |                 else:
 92 |                     target_start_month = 1
 93 |                     target_end_month = 12
 94 | 
 95 |             for month in range(target_start_month, target_end_month + 1):
 96 |                 if date['start_month'] == date['end_month']:
 97 |                     target_start_day = date['start_day']
 98 |                     target_end_day = date['end_day']
 99 |                 else:
100 |                     if year == date['start_year'] and month == date['start_month']:
101 |                         target_start_day = date['start_day']
102 |                         target_end_day = calendar.monthrange(year, month)[1]
103 |                     elif year == date['end_year'] and month == date['end_month']:
104 |                         target_start_day = 1
105 |                         target_end_day = date['end_day']
106 |                     else:
107 |                         target_start_day = 1
108 |                         target_end_day = calendar.monthrange(year, month)[1]
109 | 
110 |                 for day in range(target_start_day, target_end_day + 1):
111 |                     if len(str(month)) == 1:
112 |                         month = "0" + str(month)
113 |                     if len(str(day)) == 1:
114 |                         day = "0" + str(day)
115 |                         
116 |                     # 날짜별로 Page Url 생성
117 |                     url = category_url + str(year) + str(month) + str(day)
118 | 
119 |                     # totalpage는 네이버 페이지 구조를 이용해서 page=10000으로 지정해 totalpage를 알아냄
120 |                     # page=10000을 입력할 경우 페이지가 존재하지 않기 때문에 page=totalpage로 이동 됨 (Redirect)
121 |                     totalpage = ArticleParser.find_news_totalpage(url + "&page=10000")
122 |                     for page in range(1, totalpage + 1):
123 |                         made_urls.append(url + "&page=" + str(page))
124 |         return made_urls
125 | 
126 |     @staticmethod
127 |     def get_url_data(url, max_tries=5):
128 |         remaining_tries = int(max_tries)
129 |         while remaining_tries > 0:
130 |             try:
131 |                 return requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
132 |             except requests.exceptions:
133 |                 sleep(1)
134 |             remaining_tries = remaining_tries - 1
135 |         raise ResponseTimeout()
136 | 
137 |     def crawling(self, category_name):
138 |         # Multi Process PID
139 |         print(category_name + " PID: " + str(os.getpid()))    
140 | 
141 |         writer = Writer(category='Article', article_category=category_name, date=self.date)
142 |         # 기사 url 형식
143 |         url_format = f'http://news.naver.com/main/list.nhn?mode=LSD&mid=sec&sid1={self.categories.get(category_name)}&date='
144 |         # start_year년 start_month월 start_day일 부터 ~ end_year년 end_month월 end_day일까지 기사를 수집합니다.
145 |         target_urls = self.make_news_page_url(url_format, self.date)
146 |         print(f'{category_name} Urls are generated')
147 | 
148 |         print(f'{category_name} is collecting ...')
149 |         for url in target_urls:
150 |             request = self.get_url_data(url)
151 |             document = BeautifulSoup(request.content, 'html.parser')
152 | 
153 |             # html - newsflash_body - type06_headline, type06
154 |             # 각 페이지에 있는 기사들 가져오기
155 |             temp_post = document.select('.newsflash_body .type06_headline li dl')
156 |             temp_post.extend(document.select('.newsflash_body .type06 li dl'))
157 |             
158 |             # 각 페이지에 있는 기사들의 url 저장
159 |             post_urls = []
160 |             for line in temp_post:
161 |                 # 해당되는 page에서 모든 기사들의 URL을 post_urls 리스트에 넣음
162 |                 post_urls.append(line.a.get('href'))
163 |             del temp_post
164 | 
165 |             for content_url in post_urls:  # 기사 url
166 |                 # 크롤링 대기 시간
167 |                 sleep(0.01)
168 |                 
169 |                 # 기사 HTML 가져옴
170 |                 request_content = self.get_url_data(content_url)
171 | 
172 |                 try:
173 |                     document_content = BeautifulSoup(request_content.content, 'html.parser')
174 |                 except:
175 |                     continue
176 |                 try:
177 |                     # 기사 제목 가져옴
178 |                     tag_headline = document_content.find_all('h2',  {'class': 'media_end_head_headline'})
179 |                     # 뉴스 기사 제목 초기화
180 |                     text_headline = ''
181 |                     text_headline = text_headline + ArticleParser.clear_headline(str(tag_headline[0].find_all(text=True)))
182 |                     # 공백일 경우 기사 제외 처리
183 |                     if not text_headline:
184 |                         continue
185 |                     #<div class="go_trans _article_content" id="dic_area">
186 | 
187 |                     # 기사 본문 가져옴
188 |                     tag_content = document_content.find_all('div', {'id': 'dic_area'})
189 |                     # 뉴스 기사 본문 초기화
190 |                     text_sentence = ''
191 |                     text_sentence = text_sentence + ArticleParser.clear_content(str(tag_content[0].find_all(text=True)))
192 |                     # 공백일 경우 기사 제외 처리
193 |                     if not text_sentence:
194 |                         continue
195 | 
196 |                     # 기사 언론사 가져옴
197 |                     tag_content = document_content.find_all('meta', {'property': 'og:article:author'})
198 |                     # 언론사 초기화
199 |                     text_company = ''
200 |                     text_company = text_company + tag_content[0]['content'].split("|")[0]
201 | 
202 |                     # 공백일 경우 기사 제외 처리
203 |                     if not text_company:
204 |                         continue
205 |                     
206 |                     # 기사 시간대 가져옴
207 |                     time = document_content.find_all('span',{'class':"media_end_head_info_datestamp_time _ARTICLE_DATE_TIME"})[0]['data-date-time']
208 | 
209 |                     # CSV 작성
210 |                     writer.write_row([time, category_name, text_company, text_headline, text_sentence, content_url])
211 | 
212 |                     del time
213 |                     del text_company, text_sentence, text_headline
214 |                     del tag_company 
215 |                     del tag_content, tag_headline
216 |                     del request_content, document_content
217 | 
218 |                 # UnicodeEncodeError
219 |                 except Exception as ex:
220 |                     del request_content, document_content
221 |                     pass
222 |         writer.close()
223 | 
224 |     def start(self):
225 |         # MultiProcess 크롤링 시작
226 |         for category_name in self.selected_categories:
227 |             proc = Process(target=self.crawling, args=(category_name,))
228 |             proc.start()
229 | 
230 | 
231 | if __name__ == "__main__":
232 |     Crawler = ArticleCrawler()
233 |     Crawler.set_category('생활문화')
234 |     Crawler.set_date_range('2018-01', '2018-02')
235 |     Crawler.start()
236 | 


--------------------------------------------------------------------------------
/korea_news_crawler/articleparser.py:
--------------------------------------------------------------------------------
 1 | import requests
 2 | import re
 3 | from bs4 import BeautifulSoup
 4 | 
 5 | 
 6 | class ArticleParser(object):
 7 |     special_symbol = re.compile('[\{\}\[\]\/?,;:|\)*~`!^\-_+<>@\#$&▲▶◆◀■【】\\\=\(\'\"]')
 8 |     content_pattern = re.compile('본문 내용|TV플레이어| 동영상 뉴스|flash 오류를 우회하기 위한 함수 추가function  flash removeCallback|tt|앵커 멘트|xa0')
 9 | 
10 |     @classmethod
11 |     def clear_content(cls, text):
12 |         # 기사 본문에서 필요없는 특수문자 및 본문 양식 등을 다 지움
13 |         newline_symbol_removed_text = text.replace('\\n', '').replace('\\t', '').replace('\\r', '')
14 |         special_symbol_removed_content = re.sub(cls.special_symbol, ' ', newline_symbol_removed_text)
15 |         end_phrase_removed_content = re.sub(cls.content_pattern, '', special_symbol_removed_content)
16 |         blank_removed_content = re.sub(' +', ' ', end_phrase_removed_content).lstrip()  # 공백 에러 삭제
17 |         reversed_content = ''.join(reversed(blank_removed_content))  # 기사 내용을 reverse 한다.
18 |         content = ''
19 |         for i in range(0, len(blank_removed_content)):
20 |             # reverse 된 기사 내용중, ".다"로 끝나는 경우 기사 내용이 끝난 것이기 때문에 기사 내용이 끝난 후의 광고, 기자 등의 정보는 다 지움
21 |             if reversed_content[i:i + 2] == '.다':
22 |                 content = ''.join(reversed(reversed_content[i:]))
23 |                 break
24 |         return content
25 | 
26 |     @classmethod
27 |     def clear_headline(cls, text):
28 |         # 기사 제목에서 필요없는 특수문자들을 지움
29 |         newline_symbol_removed_text = text.replace('\\n', '').replace('\\t', '').replace('\\r', '')
30 |         special_symbol_removed_headline = re.sub(cls.special_symbol, '', newline_symbol_removed_text)
31 |         return special_symbol_removed_headline
32 | 
33 |     @classmethod
34 |     def find_news_totalpage(cls, url):
35 |         # 당일 기사 목록 전체를 알아냄
36 |         try:
37 |             # Added headers for avoid anti-crawling
38 |             request_content = requests.get(url, timeout=10, headers={'User-Agent':'Mozilla/5.0'})
39 |             document_content = BeautifulSoup(request_content.content, 'html.parser')
40 |             headline_tag = document_content.find('div', {'class': 'paging'}).find('strong')
41 |             regex = re.compile(r'<strong>(?P<num>\d+)')
42 |             match = regex.findall(str(headline_tag))
43 |             return int(match[0])
44 |         except Exception:
45 |             return 0
46 | 


--------------------------------------------------------------------------------
/korea_news_crawler/exceptions.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | # 처리 가능한 값보다 큰 값이 나왔을 때
 4 | class OverFlow(Exception):
 5 |     def __init__(self, args):
 6 |         self.message = f'{args} is overflow'
 7 | 
 8 |     def __str__(self):
 9 |         return self.message
10 |     
11 | 
12 | # 처리 가능한 값보다 작은 값이 나왔을 때
13 | class UnderFlow(Exception):
14 |     def __init__(self, args):
15 |         self.message = f'{args} is underflow'
16 | 
17 |     def __str__(self):
18 |         return self.message
19 | 
20 | 
21 | # 변수가 올바르지 않을 때
22 | class InvalidArgs(Exception):
23 |     def __init__(self, args):
24 |         self.message = f'{args} is Invalid Arguments'
25 | 
26 |     def __str__(self):
27 |         return self.message
28 | 
29 | 
30 | # 카테고리가 올바르지 않을 때
31 | class InvalidCategory(Exception):
32 |     def __init__(self, category):
33 |         self.message = f'{category} is Invalid Category.'
34 | 
35 |     def __str__(self):
36 |         return self.message
37 | 
38 | 
39 | # 년도가 올바르지 않을 때
40 | class InvalidYear(Exception):
41 |     def __init__(self, start_year, end_year):
42 |         self.message = f'{start_year}(start year) is bigger than {end_year}(end year)'
43 | 
44 |     def __str__(self):
45 |         return str(self.message)
46 | 
47 | 
48 | # 달이 올바르지 않을 때
49 | class InvalidMonth(Exception):
50 |     def __init__(self, month):
51 |         self.message = f'{month} is an invalid month'
52 | 
53 |     def __str__(self):
54 |         return self.message
55 | 
56 | # 일이 올바르지 않을 때
57 | class InvalidDay(Exception):
58 |     def __init__(self, day):
59 |         self.message = f'{day} is an invalid day'
60 | 
61 |     def __str__(self):
62 |         return self.message
63 | 
64 | 
65 | 
66 | # 시작 달과 끝나는 달이 올바르지 않을 때
67 | class OverbalanceMonth(Exception):
68 |     def __init__(self, start_month, end_month):
69 |         self.message = f'{start_month}(start month) is an overbalance with {end_month}(end month)'
70 | 
71 |     def __str__(self):
72 |         return self.message
73 | 
74 | class OverbalanceDay(Exception):
75 |     def __init__(self, start_day, end_day):
76 |         self.message = f'{start_day}(start day) is an overbalance with {end_day}(end day)'
77 | 
78 |     def __str__(self):
79 |         return self.message
80 | 
81 | 
82 | # 실행시간이 너무 길어서 데이터를 얻을 수 없을 때
83 | class ResponseTimeout(Exception):
84 |     def __init__(self):
85 |         self.message = "Couldn't get the data"
86 | 
87 |     def __str__(self):
88 |         return self.message
89 | 
90 | 
91 | # 존재하는 파일
92 | class ExistFile(Exception):
93 |     def __init__(self, path):
94 |         absolute_path = os.path.abspath(path)
95 |         self.message = f'{absolute_path} already exist'
96 | 
97 |     def __str__(self):
98 |         return self.message
99 | 


--------------------------------------------------------------------------------
/korea_news_crawler/sample.py:
--------------------------------------------------------------------------------
1 | from korea_news_crawler.articlecrawler import ArticleCrawler
2 | if __name__ == "__main__":
3 |     Crawler = ArticleCrawler()
4 |     # 정치, 경제, 생활문화, IT과학, 사회, 세계 카테고리 사용 가능
5 |     Crawler.set_category("IT과학", "세계")
6 |     # 2017년 12월 (1일) 부터 2018년 1월 13일까지 크롤링 시작 YYYY-MM-DD의 형식으로 입력
7 |     Crawler.set_date_range('2017-12', '2018-01-13')
8 |     Crawler.start()
9 | 


--------------------------------------------------------------------------------
/korea_news_crawler/sportcrawler.py:
--------------------------------------------------------------------------------
  1 | import calendar
  2 | import csv
  3 | import requests
  4 | import re
  5 | import json
  6 | from bs4 import BeautifulSoup
  7 | from time import sleep
  8 | from multiprocessing import Process
  9 | from korea_news_crawler.exceptions import *
 10 | from korea_news_crawler.writer import Writer
 11 | #from korea_news_crawler.writer import Writer
 12 | 
 13 | 
 14 | class SportCrawler:
 15 |     def __init__(self):
 16 |         self.category = {'한국야구': "kbaseball", '해외야구': "wbaseball", '해외축구': "wfootball",
 17 |                          '한국축구': "kfootball", '농구': "basketball", '배구': "volleyball", '일반 스포츠': "general",
 18 |                          'e스포츠': "esports",
 19 |                          'korea baseball': "kbaseball", 'world baseball': "wbaseball", 'world football': "wfootball",
 20 |                          'korea football': "kfootball", 'basketball': "basketball", 'volleyball': "volleyball",
 21 |                          'general sports': "general",
 22 |                          'e-sports': "esports"
 23 |                          }
 24 |         self.selected_category = []
 25 |         self.selected_url_category = []
 26 |         self.date = {'start_year': 0, 'start_month': 0, 'end_year': 0, 'end_month': 0}
 27 | 
 28 |     def get_total_page(self, url):
 29 |         totalpage_url = f'{url}&page=10000'
 30 |         request_content = requests.get(totalpage_url,headers={'User-Agent': 'Mozilla/5.0'})
 31 |         page_number = re.findall('\"totalPages\":(.*)}', request_content.text)
 32 |         return int(page_number[0])
 33 | 
 34 |     def content(self, html_document, url_label):
 35 |         content_match = []
 36 |         Tag = html_document.find_all('script', {'type': 'text/javascript'})
 37 |         Tag_ = re.sub(',"officeName', '\nofficeName', str(Tag))
 38 |         regex = re.compile('oid":"(?P<oid>\d+)","aid":"(?P<aid>\d+)"')
 39 |         content = regex.findall(Tag_)
 40 |         for oid_aid in content:
 41 |             maked_url = "https://sports.news.naver.com/" + url_label + "/news/read.nhn?oid=" + oid_aid[0] + "&aid=" + \
 42 |                         oid_aid[1]
 43 |             content_match.append(maked_url)
 44 |         return content_match
 45 | 
 46 |     def clear_content(self, text):
 47 |         remove_special = re.sub('[∙©\{\}\[\]\/?,;:|\)*~`!^\-_+<>@\#$%&n▲▶◆◀■\\\=\(\'\"]', '', text)
 48 |         remove_author = re.sub('\w\w\w 기자', '', remove_special)
 49 |         remove_flash_error = re.sub('본문 내용|TV플레이어| 동영상 뉴스|flash 오류를 우회하기 위한 함수 추가fuctio flashremoveCallback|tt|t|앵커 멘트|xa0', '', remove_author)
 50 |         # 공백 에러 삭제
 51 |         remove_strip = remove_flash_error.strip().replace('   ', '')
 52 |         # 기사 내용을 reverse 한다.
 53 |         reverse_content = ''.join(reversed(remove_strip))
 54 |         cleared_content = ''
 55 |         for i in range(0, len(remove_strip)):
 56 |             # 기사가 reverse 되었기에  ".다"로 기사가 마무리 되므로, 이를 탐색하여 불필요한 정보를 모두 지운다.
 57 |             # 이 매커니즘에 의해 영문 기사가 완전히 나오지 않는 이슈가 있습니다... edge case가 적다면  '.다'를 '.'으로바꾸는 편이 나아보입니다.
 58 |            if reverse_content[i:i + 2] == '.다':
 59 |                 cleared_content = ''.join(reversed(reverse_content[i:]))
 60 |                 break
 61 |         cleared_content = re.sub('if deployPhase(.*)displayRMCPlayer', '', cleared_content)
 62 |         return cleared_content
 63 | 
 64 |     def clear_headline(self, text):
 65 |         first = re.sub('[∙©\{\}\[\]\/?,;:|\)*~`!^\-_+<>@\#$%&n▲▶◆◀■\\\=\(\'\"]', '', text)
 66 |         return first
 67 | 
 68 |     def make_sport_page_url(self, input_url, start_year, last_year, start_month, last_month):
 69 |         urls = []
 70 |         for year in range(start_year, last_year + 1):
 71 |             target_start_month = start_month
 72 |             target_last_month = last_month
 73 | 
 74 |             if year != last_year:
 75 |                 target_start_month = 1
 76 |                 target_last_month = 12
 77 |             else:
 78 |                 target_start_month = start_month
 79 |                 target_last_month = last_month
 80 | 
 81 |             for month in range(target_start_month, target_last_month + 1):
 82 |                 for day in range(1, calendar.monthrange(year, month)[1] + 1):
 83 |                     if len(str(month)) == 1:
 84 |                         month = "0" + str(month)
 85 |                     if len(str(day)) == 1:
 86 |                         day = "0" + str(day)
 87 |                     url = f'{input_url}{year}{month}{day}'
 88 |                     # page 날짜 정보만 있고 page 정보가 없는 url 임시 저장
 89 |                     final_url = f'${input_url}{year}{month}{day}'
 90 | 
 91 |                     # TotalPage 확인
 92 |                     total_page = self.get_total_page(final_url)
 93 |                     for page in range(1, total_page + 1):
 94 |                         # url page 초기화
 95 |                         url = f'{final_url}&page={page}'
 96 |                         # [[page1,page2,page3 ....]
 97 |                         urls.append(url)
 98 |         return urls
 99 | 
100 |     def crawling(self, category_name):
101 |         writer = Writer(category='Sport', article_category=category_name, date=self.date)
102 |         url_category = [self.category[category_name]]
103 |         category = [category_name]
104 | 
105 | 
106 | 
107 |         # URL 카테고리. Multiprocessing시 어차피 1번 도는거라 refactoring할 필요 있어보임
108 |         for url_label in url_category:
109 |             # URL 인덱스와 category 인덱스가 일치할 경우 그 값도 일치
110 |             category = category[url_category.index(url_label)]
111 |             url = f'https://sports.news.naver.com/{url_label}/news/list.nhn?isphoto=N&view=photo&date='
112 |             final_url_day = self.make_sport_page_url(url, self.date['start_year'],
113 |                                                     self.date['end_year'], self.date['start_month'], self.date['end_month'])
114 |             print("succeed making url")
115 |             print("crawler starts")
116 |             if len(str(self.date['start_month'])) == 2:
117 |                 start_month = str(self.date['start_month'])
118 |             else:
119 |                 start_month = '0' + str(self.date['start_month'])
120 | 
121 |             if len(str(self.date['end_month'])) == 2:
122 |                 end_month = str(self.date['end_month'])
123 |             else:
124 |                 end_month = '0' + str(self.date['end_month'])
125 | 
126 |             # category Year Month Data Page 처리 된 URL
127 |             for list_page in final_url_day:
128 |                 title_script = ''
129 |                 office_name_script = ''
130 |                 time_script = ''
131 |                 matched_content = ''
132 | 
133 |                 # 제목 / URL
134 |                 request_content = requests.get(list_page, headers={'User-Agent': 'Mozilla/5.0'})
135 |                 content_dict = json.loads(request_content.text)
136 |                 # 이는 크롤링에 사용
137 | 
138 |                 hef_script = ''
139 | 
140 |                 for contents in content_dict["list"]:
141 |                     oid = contents['oid']
142 |                     aid = contents['aid']
143 |                     title_script = contents['title']
144 |                     time_script = contents['datetime']
145 |                     hef_script = "https://sports.news.naver.com/news.nhn?oid=" + oid + "&aid=" + aid
146 |                     office_name_script = contents['officeName']
147 |                     sleep(0.01)
148 |                     content_request_content = requests.get(hef_script, headers={'User-Agent': 'Mozilla/5.0'})
149 |                     content_document_content = BeautifulSoup(content_request_content.content, 'html.parser')
150 |                     content_tag_content = content_document_content.find_all('div', {'class': 'news_end'},
151 |                                                                             {'id': 'newsEndContents'})
152 |                     # 뉴스 기사 본문 내용 초기화
153 |                     text_sentence = ''
154 | 
155 |                     try:
156 |                         text_sentence = text_sentence + str(content_tag_content[0].find_all(text=True))
157 |                         matched_content = self.clear_content(text_sentence)
158 |                         writer.write_row([time_script, category, office_name_script, self.clear_headline(title_script),
159 |                                           matched_content,
160 |                                           hef_script])
161 |                     except:
162 |                         pass
163 |             writer.close()
164 | 
165 |     def set_category(self, *args):
166 |         for key in args:
167 |             if self.category.get(key) is None:
168 |                 raise InvalidCategory(key)
169 |         self.selected_category = args
170 |         for selected in self.selected_category:
171 |             self.selected_url_category.append(self.category[selected])
172 | 
173 |     def start(self):
174 |         # MultiProcess 크롤링 시작
175 |         for category_name in self.selected_category:
176 |             proc = Process(target=self.crawling, args=(category_name,))
177 |             proc.start()
178 | 
179 |     def set_date_range(self, start_year, start_month, end_year, end_month):
180 |         self.date['start_year'] = start_year
181 |         self.date['start_month'] = start_month
182 |         self.date['end_year'] = end_year
183 |         self.date['end_month'] = end_month
184 | 
185 | 
186 | # Main
187 | if __name__ == "__main__":
188 |     Spt_crawler = SportCrawler()
189 |     Spt_crawler.set_category('한국야구','한국축구')
190 |     Spt_crawler.set_date_range(2020, 12, 2020, 12)
191 |     Spt_crawler.start()
192 | 


--------------------------------------------------------------------------------
/korea_news_crawler/sports_crawler_sample.py:
--------------------------------------------------------------------------------
1 | from korea_news_crawler.sportcrawler import SportCrawler
2 | 
3 | if __name__ == "__main__":
4 |     Spt_crawler = SportCrawler()
5 |     Spt_crawler.set_category('한국야구', '한국축구')
6 |     Spt_crawler.set_date_range(2020, 11, 2020, 11)
7 |     Spt_crawler.start()
8 | 


--------------------------------------------------------------------------------
/korea_news_crawler/writer.py:
--------------------------------------------------------------------------------
 1 | import csv
 2 | import platform
 3 | from korea_news_crawler.exceptions import *
 4 | 
 5 | 
 6 | class Writer(object):
 7 |     def __init__(self, category, article_category, date):
 8 |         self.start_year = date['start_year']
 9 |         self.start_month = f'0{date["start_month"]}' if len(str(date['start_month'])) == 1 else str(date['start_month'])
10 |         self.start_day = f'0{date["start_day"]}' if len(str(date['start_day'])) == 1 else str(date['start_day'])
11 |         self.end_year = date['end_year']
12 |         self.end_month = f'0{date["end_month"]}' if len(str(date['end_month'])) == 1 else str(date['end_month'])
13 |         self.end_day = f'0{date["end_day"]}' if len(str(date['end_day'])) == 1 else str(date['end_day'])
14 |         self.file = None
15 |         self.initialize_file(category, article_category)
16 | 
17 |         self.csv_writer = csv.writer(self.file)
18 | 
19 |     def initialize_file(self, category, article_category):
20 |         output_path = f'../output'
21 |         if os.path.exists(output_path) is not True:
22 |             os.mkdir(output_path)
23 | 
24 |         file_name = f'{output_path}/{category}_{article_category}_{self.start_year}{self.start_month}{self.start_day}_{self.end_year}{self.end_month}{self.end_day}.csv'
25 |         if os.path.isfile(file_name) and os.path.getsize(file_name) != 0:
26 |             raise ExistFile(file_name)
27 | 
28 |         user_os = str(platform.system())
29 |         if user_os == "Windows":
30 |             self.file = open(file_name, 'w', encoding='euc-kr', newline='')
31 |         # Other OS uses utf-8
32 |         else:
33 |             self.file = open(file_name, 'w', encoding='utf-8', newline='')
34 | 
35 |     def write_row(self, arg):
36 |         self.csv_writer.writerow(arg)
37 | 
38 |     def close(self):
39 |         self.file.close()
40 | 


--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [metadata]
2 | description_file = README.md


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup
 2 | 
 3 | # build package command: python setup.py bdist_wheel
 4 | # release package command: twine upload dist/KoreaNewsCrawler-${version}-py3-none-any.whl
 5 | 
 6 | setup(
 7 |     name             = 'KoreaNewsCrawler',
 8 |     version          = '1.51',
 9 |     description      = 'Crawl the korean news',
10 |     author           = 'lumyjuwon',
11 |     author_email     = 'lumyjuwon@gmail.com',
12 |     url              = 'https://github.com/lumyjuwon/KoreaNewsCrawler',
13 |     download_url     = 'https://github.com/lumyjuwon/KoreaNewsCrawler/archive/1.51.tar.gz',
14 |     install_requires = ['requests', 'beautifulsoup4'],
15 |     packages         = ['korea_news_crawler'],
16 |     keywords         = ['crawl', 'KoreaNews', 'crawler'],
17 |     python_requires  = '>=3.6',
18 |     zip_safe=False,
19 |     classifiers      = [
20 |         'Programming Language :: Python :: 3.6'
21 |     ]
22 | )


--------------------------------------------------------------------------------