├── .gitignore
├── LICENSE
├── README.md
├── dist
├── KoreaNewsCrawler-1.12-py3-none-any.whl
├── KoreaNewsCrawler-1.2-py3-none-any.whl
├── KoreaNewsCrawler-1.20-py3-none-any.whl
├── KoreaNewsCrawler-1.30-py3-none-any.whl
├── KoreaNewsCrawler-1.41-py3-none-any.whl
├── KoreaNewsCrawler-1.50-py3-none-any.whl
└── KoreaNewsCrawler-1.51-py3-none-any.whl
├── img
├── article_result.PNG
├── multi_process.PNG
└── sport_resultimg.PNG
├── korea_news_crawler
├── __init__.py
├── articlecrawler.py
├── articleparser.py
├── exceptions.py
├── sample.py
├── sportcrawler.py
├── sports_crawler_sample.py
└── writer.py
├── setup.cfg
└── setup.py
/.gitignore:
--------------------------------------------------------------------------------
1 | .idea
2 | __pycache__
3 | build
4 | output
5 | korea_news_crawler/.idea
6 | korea_news_crawler/__pycache__
7 | KoreaNewsCrawler.egg-info
8 | MENIFEST.in
9 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2022 lumyjuwon
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # KoreaNewsCrawler
2 |
3 | This crawler is a crawler that crawls news articles from media organizations posted on NAVER portal.
4 | Crawlable article categories include politics, economy, lifeculture, global, IT/science, society.
5 | In the case of sports articles, that include korea baseball, korea soccer, world baseball, world soccer, basketball, volleyball, golf, general sports, e-sports.
6 |
7 | ## How to install
8 | pip install KoreaNewsCrawler
9 |
10 | ## Method
11 |
12 | * **set_category(category_name)**
13 |
14 | This method is to set the category you want to collect.
15 | The categories that can be included in the parameter are 'politics', 'economy', 'society', 'living_culture', 'IT_science', 'world', and 'opinion'.
16 | You can have multiple parameters.
17 | category_name: politics, economy, society, living_culture, IT_science, world, opinion
18 |
19 | * **set_date_range(startyear, startmonth, endyear, endmonth)**
20 |
21 | This method refers to the time period of news you want to collect. By default, it collects data from the month of startmonth to the month of endmonth.
22 |
23 | * **start()**
24 |
25 | This method is the crawl execution method.
26 |
27 | ## Article News Crawler Example
28 | ```
29 | from korea_news_crawler.articlecrawler import ArticleCrawler
30 |
31 | Crawler = ArticleCrawler()
32 | Crawler.set_category("politics", "IT_science", "economy")
33 | Crawler.set_date_range("2017-01", "2018-04-20")
34 | Crawler.start()
35 | ```
36 | Perform a parallel crawl of news in the categories Politics, IT Science, and Economy from January 2017 to April 20, 2018 using a multiprocessor.
37 |
38 | ## Sports News Crawler Example
39 | Method is similar to ArticleCrawler().
40 | ```
41 | from korea_news_crawler.sportcrawler import SportCrawler
42 |
43 | Spt_crawler = SportCrawler()
44 | Spt_crawler.set_category('korea baseball','korea soccer')
45 | Spt_crawler.set_date_range("2017-01", "2018-04-20")
46 | Spt_crawler.start()
47 | ```
48 | Execute a parallel crawl of Korean baseball and Korean soccer news from January 2017 to April 20, 2018 using a multiprocessor.
49 |
50 | ## Results
51 | 
52 | 
53 |
54 | Colum A: Article date & time
55 | Colum B: Article Category
56 | Colum C: Media Company
57 | Colum D: Article title
58 | Colum E: Article body
59 | Colum F: Article address
60 | All the data you collect is saved with a CSV extension.
61 |
--------------------------------------------------------------------------------
/dist/KoreaNewsCrawler-1.12-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.12-py3-none-any.whl
--------------------------------------------------------------------------------
/dist/KoreaNewsCrawler-1.2-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.2-py3-none-any.whl
--------------------------------------------------------------------------------
/dist/KoreaNewsCrawler-1.20-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.20-py3-none-any.whl
--------------------------------------------------------------------------------
/dist/KoreaNewsCrawler-1.30-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.30-py3-none-any.whl
--------------------------------------------------------------------------------
/dist/KoreaNewsCrawler-1.41-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.41-py3-none-any.whl
--------------------------------------------------------------------------------
/dist/KoreaNewsCrawler-1.50-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.50-py3-none-any.whl
--------------------------------------------------------------------------------
/dist/KoreaNewsCrawler-1.51-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.51-py3-none-any.whl
--------------------------------------------------------------------------------
/img/article_result.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/img/article_result.PNG
--------------------------------------------------------------------------------
/img/multi_process.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/img/multi_process.PNG
--------------------------------------------------------------------------------
/img/sport_resultimg.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/img/sport_resultimg.PNG
--------------------------------------------------------------------------------
/korea_news_crawler/__init__.py:
--------------------------------------------------------------------------------
1 | __author__ = "lumyjuwon"
2 | __version__ = "1.50"
3 | __copyright__ = "Copyright (c) lumyjuwon"
4 | __license__ = "MIT"
5 |
6 | from .articlecrawler import *
7 | from .articleparser import *
8 | from .exceptions import *
9 | from .sample import *
10 | from .sportcrawler import *
11 | from .writer import *
12 |
--------------------------------------------------------------------------------
/korea_news_crawler/articlecrawler.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8, euc-kr -*-
3 |
4 | import os
5 | import platform
6 | import calendar
7 | import requests
8 | import re
9 | from time import sleep
10 | from bs4 import BeautifulSoup
11 | from multiprocessing import Process
12 | from korea_news_crawler.exceptions import *
13 | from korea_news_crawler.articleparser import ArticleParser
14 | from korea_news_crawler.writer import Writer
15 |
16 | class ArticleCrawler(object):
17 | def __init__(self):
18 | self.categories = {'정치': 100, '경제': 101, '사회': 102, '생활문화': 103, '세계': 104, 'IT과학': 105, '오피니언': 110,
19 | 'politics': 100, 'economy': 101, 'society': 102, 'living_culture': 103, 'world': 104, 'IT_science': 105, 'opinion': 110}
20 | self.selected_categories = []
21 | self.date = {'start_year': 0, 'start_month': 0, 'start_day' : 0, 'end_year': 0, 'end_month': 0, 'end_day':0}
22 | self.user_operating_system = str(platform.system())
23 |
24 | def set_category(self, *args):
25 | for key in args:
26 | if self.categories.get(key) is None:
27 | raise InvalidCategory(key)
28 | self.selected_categories = args
29 |
30 | def set_date_range(self, start_date:str, end_date:str):
31 | start = list(map(int, start_date.split("-")))
32 | end = list(map(int, end_date.split("-")))
33 |
34 | # Setting Start Date
35 | if len(start) == 1: # Input Only Year
36 | start_year = start[0]
37 | start_month = 1
38 | start_day = 1
39 | elif len(start) == 2: # Input Year and month
40 | start_year, start_month = start
41 | start_day = 1
42 | elif len(start) == 3: # Input Year, month and day
43 | start_year, start_month, start_day = start
44 |
45 | # Setting End Date
46 | if len(end) == 1: # Input Only Year
47 | end_year = end[0]
48 | end_month = 12
49 | end_day = 31
50 | elif len(end) == 2: # Input Year and month
51 | end_year, end_month = end
52 | end_day = calendar.monthrange(end_year, end_month)[1]
53 | elif len(end) == 3: # Input Year, month and day
54 | end_year, end_month, end_day = end
55 |
56 | args = [start_year, start_month, start_day, end_year, end_month, end_day]
57 |
58 | if start_year > end_year:
59 | raise InvalidYear(start_year, end_year)
60 | if start_month < 1 or start_month > 12:
61 | raise InvalidMonth(start_month)
62 | if end_month < 1 or end_month > 12:
63 | raise InvalidMonth(end_month)
64 | if start_day < 1 or calendar.monthrange(start_year, start_month)[1] < start_day:
65 | raise InvalidDay(start_day)
66 | if end_day < 1 or calendar.monthrange(end_year, end_month)[1] < end_day:
67 | raise InvalidDay(end_day)
68 | if start_year == end_year and start_month > end_month:
69 | raise OverbalanceMonth(start_month, end_month)
70 | if start_year == end_year and start_month == end_month and start_day > end_day:
71 | raise OverbalanceDay(start_day, end_day)
72 |
73 | for key, date in zip(self.date, args):
74 | self.date[key] = date
75 | print(self.date)
76 |
77 | @staticmethod
78 | def make_news_page_url(category_url, date):
79 | made_urls = []
80 | for year in range(date['start_year'], date['end_year'] + 1):
81 | if date['start_year'] == date['end_year']:
82 | target_start_month = date['start_month']
83 | target_end_month = date['end_month']
84 | else:
85 | if year == date['start_year']:
86 | target_start_month = date['start_month']
87 | target_end_month = 12
88 | elif year == date['end_year']:
89 | target_start_month = 1
90 | target_end_month = date['end_month']
91 | else:
92 | target_start_month = 1
93 | target_end_month = 12
94 |
95 | for month in range(target_start_month, target_end_month + 1):
96 | if date['start_month'] == date['end_month']:
97 | target_start_day = date['start_day']
98 | target_end_day = date['end_day']
99 | else:
100 | if year == date['start_year'] and month == date['start_month']:
101 | target_start_day = date['start_day']
102 | target_end_day = calendar.monthrange(year, month)[1]
103 | elif year == date['end_year'] and month == date['end_month']:
104 | target_start_day = 1
105 | target_end_day = date['end_day']
106 | else:
107 | target_start_day = 1
108 | target_end_day = calendar.monthrange(year, month)[1]
109 |
110 | for day in range(target_start_day, target_end_day + 1):
111 | if len(str(month)) == 1:
112 | month = "0" + str(month)
113 | if len(str(day)) == 1:
114 | day = "0" + str(day)
115 |
116 | # 날짜별로 Page Url 생성
117 | url = category_url + str(year) + str(month) + str(day)
118 |
119 | # totalpage는 네이버 페이지 구조를 이용해서 page=10000으로 지정해 totalpage를 알아냄
120 | # page=10000을 입력할 경우 페이지가 존재하지 않기 때문에 page=totalpage로 이동 됨 (Redirect)
121 | totalpage = ArticleParser.find_news_totalpage(url + "&page=10000")
122 | for page in range(1, totalpage + 1):
123 | made_urls.append(url + "&page=" + str(page))
124 | return made_urls
125 |
126 | @staticmethod
127 | def get_url_data(url, max_tries=5):
128 | remaining_tries = int(max_tries)
129 | while remaining_tries > 0:
130 | try:
131 | return requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
132 | except requests.exceptions:
133 | sleep(1)
134 | remaining_tries = remaining_tries - 1
135 | raise ResponseTimeout()
136 |
137 | def crawling(self, category_name):
138 | # Multi Process PID
139 | print(category_name + " PID: " + str(os.getpid()))
140 |
141 | writer = Writer(category='Article', article_category=category_name, date=self.date)
142 | # 기사 url 형식
143 | url_format = f'http://news.naver.com/main/list.nhn?mode=LSD&mid=sec&sid1={self.categories.get(category_name)}&date='
144 | # start_year년 start_month월 start_day일 부터 ~ end_year년 end_month월 end_day일까지 기사를 수집합니다.
145 | target_urls = self.make_news_page_url(url_format, self.date)
146 | print(f'{category_name} Urls are generated')
147 |
148 | print(f'{category_name} is collecting ...')
149 | for url in target_urls:
150 | request = self.get_url_data(url)
151 | document = BeautifulSoup(request.content, 'html.parser')
152 |
153 | # html - newsflash_body - type06_headline, type06
154 | # 각 페이지에 있는 기사들 가져오기
155 | temp_post = document.select('.newsflash_body .type06_headline li dl')
156 | temp_post.extend(document.select('.newsflash_body .type06 li dl'))
157 |
158 | # 각 페이지에 있는 기사들의 url 저장
159 | post_urls = []
160 | for line in temp_post:
161 | # 해당되는 page에서 모든 기사들의 URL을 post_urls 리스트에 넣음
162 | post_urls.append(line.a.get('href'))
163 | del temp_post
164 |
165 | for content_url in post_urls: # 기사 url
166 | # 크롤링 대기 시간
167 | sleep(0.01)
168 |
169 | # 기사 HTML 가져옴
170 | request_content = self.get_url_data(content_url)
171 |
172 | try:
173 | document_content = BeautifulSoup(request_content.content, 'html.parser')
174 | except:
175 | continue
176 | try:
177 | # 기사 제목 가져옴
178 | tag_headline = document_content.find_all('h2', {'class': 'media_end_head_headline'})
179 | # 뉴스 기사 제목 초기화
180 | text_headline = ''
181 | text_headline = text_headline + ArticleParser.clear_headline(str(tag_headline[0].find_all(text=True)))
182 | # 공백일 경우 기사 제외 처리
183 | if not text_headline:
184 | continue
185 | #
186 |
187 | # 기사 본문 가져옴
188 | tag_content = document_content.find_all('div', {'id': 'dic_area'})
189 | # 뉴스 기사 본문 초기화
190 | text_sentence = ''
191 | text_sentence = text_sentence + ArticleParser.clear_content(str(tag_content[0].find_all(text=True)))
192 | # 공백일 경우 기사 제외 처리
193 | if not text_sentence:
194 | continue
195 |
196 | # 기사 언론사 가져옴
197 | tag_content = document_content.find_all('meta', {'property': 'og:article:author'})
198 | # 언론사 초기화
199 | text_company = ''
200 | text_company = text_company + tag_content[0]['content'].split("|")[0]
201 |
202 | # 공백일 경우 기사 제외 처리
203 | if not text_company:
204 | continue
205 |
206 | # 기사 시간대 가져옴
207 | time = document_content.find_all('span',{'class':"media_end_head_info_datestamp_time _ARTICLE_DATE_TIME"})[0]['data-date-time']
208 |
209 | # CSV 작성
210 | writer.write_row([time, category_name, text_company, text_headline, text_sentence, content_url])
211 |
212 | del time
213 | del text_company, text_sentence, text_headline
214 | del tag_company
215 | del tag_content, tag_headline
216 | del request_content, document_content
217 |
218 | # UnicodeEncodeError
219 | except Exception as ex:
220 | del request_content, document_content
221 | pass
222 | writer.close()
223 |
224 | def start(self):
225 | # MultiProcess 크롤링 시작
226 | for category_name in self.selected_categories:
227 | proc = Process(target=self.crawling, args=(category_name,))
228 | proc.start()
229 |
230 |
231 | if __name__ == "__main__":
232 | Crawler = ArticleCrawler()
233 | Crawler.set_category('생활문화')
234 | Crawler.set_date_range('2018-01', '2018-02')
235 | Crawler.start()
236 |
--------------------------------------------------------------------------------
/korea_news_crawler/articleparser.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import re
3 | from bs4 import BeautifulSoup
4 |
5 |
6 | class ArticleParser(object):
7 | special_symbol = re.compile('[\{\}\[\]\/?,;:|\)*~`!^\-_+<>@\#$&▲▶◆◀■【】\\\=\(\'\"]')
8 | content_pattern = re.compile('본문 내용|TV플레이어| 동영상 뉴스|flash 오류를 우회하기 위한 함수 추가function flash removeCallback|tt|앵커 멘트|xa0')
9 |
10 | @classmethod
11 | def clear_content(cls, text):
12 | # 기사 본문에서 필요없는 특수문자 및 본문 양식 등을 다 지움
13 | newline_symbol_removed_text = text.replace('\\n', '').replace('\\t', '').replace('\\r', '')
14 | special_symbol_removed_content = re.sub(cls.special_symbol, ' ', newline_symbol_removed_text)
15 | end_phrase_removed_content = re.sub(cls.content_pattern, '', special_symbol_removed_content)
16 | blank_removed_content = re.sub(' +', ' ', end_phrase_removed_content).lstrip() # 공백 에러 삭제
17 | reversed_content = ''.join(reversed(blank_removed_content)) # 기사 내용을 reverse 한다.
18 | content = ''
19 | for i in range(0, len(blank_removed_content)):
20 | # reverse 된 기사 내용중, ".다"로 끝나는 경우 기사 내용이 끝난 것이기 때문에 기사 내용이 끝난 후의 광고, 기자 등의 정보는 다 지움
21 | if reversed_content[i:i + 2] == '.다':
22 | content = ''.join(reversed(reversed_content[i:]))
23 | break
24 | return content
25 |
26 | @classmethod
27 | def clear_headline(cls, text):
28 | # 기사 제목에서 필요없는 특수문자들을 지움
29 | newline_symbol_removed_text = text.replace('\\n', '').replace('\\t', '').replace('\\r', '')
30 | special_symbol_removed_headline = re.sub(cls.special_symbol, '', newline_symbol_removed_text)
31 | return special_symbol_removed_headline
32 |
33 | @classmethod
34 | def find_news_totalpage(cls, url):
35 | # 당일 기사 목록 전체를 알아냄
36 | try:
37 | # Added headers for avoid anti-crawling
38 | request_content = requests.get(url, timeout=10, headers={'User-Agent':'Mozilla/5.0'})
39 | document_content = BeautifulSoup(request_content.content, 'html.parser')
40 | headline_tag = document_content.find('div', {'class': 'paging'}).find('strong')
41 | regex = re.compile(r'
(?P\d+)')
42 | match = regex.findall(str(headline_tag))
43 | return int(match[0])
44 | except Exception:
45 | return 0
46 |
--------------------------------------------------------------------------------
/korea_news_crawler/exceptions.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | # 처리 가능한 값보다 큰 값이 나왔을 때
4 | class OverFlow(Exception):
5 | def __init__(self, args):
6 | self.message = f'{args} is overflow'
7 |
8 | def __str__(self):
9 | return self.message
10 |
11 |
12 | # 처리 가능한 값보다 작은 값이 나왔을 때
13 | class UnderFlow(Exception):
14 | def __init__(self, args):
15 | self.message = f'{args} is underflow'
16 |
17 | def __str__(self):
18 | return self.message
19 |
20 |
21 | # 변수가 올바르지 않을 때
22 | class InvalidArgs(Exception):
23 | def __init__(self, args):
24 | self.message = f'{args} is Invalid Arguments'
25 |
26 | def __str__(self):
27 | return self.message
28 |
29 |
30 | # 카테고리가 올바르지 않을 때
31 | class InvalidCategory(Exception):
32 | def __init__(self, category):
33 | self.message = f'{category} is Invalid Category.'
34 |
35 | def __str__(self):
36 | return self.message
37 |
38 |
39 | # 년도가 올바르지 않을 때
40 | class InvalidYear(Exception):
41 | def __init__(self, start_year, end_year):
42 | self.message = f'{start_year}(start year) is bigger than {end_year}(end year)'
43 |
44 | def __str__(self):
45 | return str(self.message)
46 |
47 |
48 | # 달이 올바르지 않을 때
49 | class InvalidMonth(Exception):
50 | def __init__(self, month):
51 | self.message = f'{month} is an invalid month'
52 |
53 | def __str__(self):
54 | return self.message
55 |
56 | # 일이 올바르지 않을 때
57 | class InvalidDay(Exception):
58 | def __init__(self, day):
59 | self.message = f'{day} is an invalid day'
60 |
61 | def __str__(self):
62 | return self.message
63 |
64 |
65 |
66 | # 시작 달과 끝나는 달이 올바르지 않을 때
67 | class OverbalanceMonth(Exception):
68 | def __init__(self, start_month, end_month):
69 | self.message = f'{start_month}(start month) is an overbalance with {end_month}(end month)'
70 |
71 | def __str__(self):
72 | return self.message
73 |
74 | class OverbalanceDay(Exception):
75 | def __init__(self, start_day, end_day):
76 | self.message = f'{start_day}(start day) is an overbalance with {end_day}(end day)'
77 |
78 | def __str__(self):
79 | return self.message
80 |
81 |
82 | # 실행시간이 너무 길어서 데이터를 얻을 수 없을 때
83 | class ResponseTimeout(Exception):
84 | def __init__(self):
85 | self.message = "Couldn't get the data"
86 |
87 | def __str__(self):
88 | return self.message
89 |
90 |
91 | # 존재하는 파일
92 | class ExistFile(Exception):
93 | def __init__(self, path):
94 | absolute_path = os.path.abspath(path)
95 | self.message = f'{absolute_path} already exist'
96 |
97 | def __str__(self):
98 | return self.message
99 |
--------------------------------------------------------------------------------
/korea_news_crawler/sample.py:
--------------------------------------------------------------------------------
1 | from korea_news_crawler.articlecrawler import ArticleCrawler
2 | if __name__ == "__main__":
3 | Crawler = ArticleCrawler()
4 | # 정치, 경제, 생활문화, IT과학, 사회, 세계 카테고리 사용 가능
5 | Crawler.set_category("IT과학", "세계")
6 | # 2017년 12월 (1일) 부터 2018년 1월 13일까지 크롤링 시작 YYYY-MM-DD의 형식으로 입력
7 | Crawler.set_date_range('2017-12', '2018-01-13')
8 | Crawler.start()
9 |
--------------------------------------------------------------------------------
/korea_news_crawler/sportcrawler.py:
--------------------------------------------------------------------------------
1 | import calendar
2 | import csv
3 | import requests
4 | import re
5 | import json
6 | from bs4 import BeautifulSoup
7 | from time import sleep
8 | from multiprocessing import Process
9 | from korea_news_crawler.exceptions import *
10 | from korea_news_crawler.writer import Writer
11 | #from korea_news_crawler.writer import Writer
12 |
13 |
14 | class SportCrawler:
15 | def __init__(self):
16 | self.category = {'한국야구': "kbaseball", '해외야구': "wbaseball", '해외축구': "wfootball",
17 | '한국축구': "kfootball", '농구': "basketball", '배구': "volleyball", '일반 스포츠': "general",
18 | 'e스포츠': "esports",
19 | 'korea baseball': "kbaseball", 'world baseball': "wbaseball", 'world football': "wfootball",
20 | 'korea football': "kfootball", 'basketball': "basketball", 'volleyball': "volleyball",
21 | 'general sports': "general",
22 | 'e-sports': "esports"
23 | }
24 | self.selected_category = []
25 | self.selected_url_category = []
26 | self.date = {'start_year': 0, 'start_month': 0, 'end_year': 0, 'end_month': 0}
27 |
28 | def get_total_page(self, url):
29 | totalpage_url = f'{url}&page=10000'
30 | request_content = requests.get(totalpage_url,headers={'User-Agent': 'Mozilla/5.0'})
31 | page_number = re.findall('\"totalPages\":(.*)}', request_content.text)
32 | return int(page_number[0])
33 |
34 | def content(self, html_document, url_label):
35 | content_match = []
36 | Tag = html_document.find_all('script', {'type': 'text/javascript'})
37 | Tag_ = re.sub(',"officeName', '\nofficeName', str(Tag))
38 | regex = re.compile('oid":"(?P\d+)","aid":"(?P\d+)"')
39 | content = regex.findall(Tag_)
40 | for oid_aid in content:
41 | maked_url = "https://sports.news.naver.com/" + url_label + "/news/read.nhn?oid=" + oid_aid[0] + "&aid=" + \
42 | oid_aid[1]
43 | content_match.append(maked_url)
44 | return content_match
45 |
46 | def clear_content(self, text):
47 | remove_special = re.sub('[∙©\{\}\[\]\/?,;:|\)*~`!^\-_+<>@\#$%&n▲▶◆◀■\\\=\(\'\"]', '', text)
48 | remove_author = re.sub('\w\w\w 기자', '', remove_special)
49 | remove_flash_error = re.sub('본문 내용|TV플레이어| 동영상 뉴스|flash 오류를 우회하기 위한 함수 추가fuctio flashremoveCallback|tt|t|앵커 멘트|xa0', '', remove_author)
50 | # 공백 에러 삭제
51 | remove_strip = remove_flash_error.strip().replace(' ', '')
52 | # 기사 내용을 reverse 한다.
53 | reverse_content = ''.join(reversed(remove_strip))
54 | cleared_content = ''
55 | for i in range(0, len(remove_strip)):
56 | # 기사가 reverse 되었기에 ".다"로 기사가 마무리 되므로, 이를 탐색하여 불필요한 정보를 모두 지운다.
57 | # 이 매커니즘에 의해 영문 기사가 완전히 나오지 않는 이슈가 있습니다... edge case가 적다면 '.다'를 '.'으로바꾸는 편이 나아보입니다.
58 | if reverse_content[i:i + 2] == '.다':
59 | cleared_content = ''.join(reversed(reverse_content[i:]))
60 | break
61 | cleared_content = re.sub('if deployPhase(.*)displayRMCPlayer', '', cleared_content)
62 | return cleared_content
63 |
64 | def clear_headline(self, text):
65 | first = re.sub('[∙©\{\}\[\]\/?,;:|\)*~`!^\-_+<>@\#$%&n▲▶◆◀■\\\=\(\'\"]', '', text)
66 | return first
67 |
68 | def make_sport_page_url(self, input_url, start_year, last_year, start_month, last_month):
69 | urls = []
70 | for year in range(start_year, last_year + 1):
71 | target_start_month = start_month
72 | target_last_month = last_month
73 |
74 | if year != last_year:
75 | target_start_month = 1
76 | target_last_month = 12
77 | else:
78 | target_start_month = start_month
79 | target_last_month = last_month
80 |
81 | for month in range(target_start_month, target_last_month + 1):
82 | for day in range(1, calendar.monthrange(year, month)[1] + 1):
83 | if len(str(month)) == 1:
84 | month = "0" + str(month)
85 | if len(str(day)) == 1:
86 | day = "0" + str(day)
87 | url = f'{input_url}{year}{month}{day}'
88 | # page 날짜 정보만 있고 page 정보가 없는 url 임시 저장
89 | final_url = f'${input_url}{year}{month}{day}'
90 |
91 | # TotalPage 확인
92 | total_page = self.get_total_page(final_url)
93 | for page in range(1, total_page + 1):
94 | # url page 초기화
95 | url = f'{final_url}&page={page}'
96 | # [[page1,page2,page3 ....]
97 | urls.append(url)
98 | return urls
99 |
100 | def crawling(self, category_name):
101 | writer = Writer(category='Sport', article_category=category_name, date=self.date)
102 | url_category = [self.category[category_name]]
103 | category = [category_name]
104 |
105 |
106 |
107 | # URL 카테고리. Multiprocessing시 어차피 1번 도는거라 refactoring할 필요 있어보임
108 | for url_label in url_category:
109 | # URL 인덱스와 category 인덱스가 일치할 경우 그 값도 일치
110 | category = category[url_category.index(url_label)]
111 | url = f'https://sports.news.naver.com/{url_label}/news/list.nhn?isphoto=N&view=photo&date='
112 | final_url_day = self.make_sport_page_url(url, self.date['start_year'],
113 | self.date['end_year'], self.date['start_month'], self.date['end_month'])
114 | print("succeed making url")
115 | print("crawler starts")
116 | if len(str(self.date['start_month'])) == 2:
117 | start_month = str(self.date['start_month'])
118 | else:
119 | start_month = '0' + str(self.date['start_month'])
120 |
121 | if len(str(self.date['end_month'])) == 2:
122 | end_month = str(self.date['end_month'])
123 | else:
124 | end_month = '0' + str(self.date['end_month'])
125 |
126 | # category Year Month Data Page 처리 된 URL
127 | for list_page in final_url_day:
128 | title_script = ''
129 | office_name_script = ''
130 | time_script = ''
131 | matched_content = ''
132 |
133 | # 제목 / URL
134 | request_content = requests.get(list_page, headers={'User-Agent': 'Mozilla/5.0'})
135 | content_dict = json.loads(request_content.text)
136 | # 이는 크롤링에 사용
137 |
138 | hef_script = ''
139 |
140 | for contents in content_dict["list"]:
141 | oid = contents['oid']
142 | aid = contents['aid']
143 | title_script = contents['title']
144 | time_script = contents['datetime']
145 | hef_script = "https://sports.news.naver.com/news.nhn?oid=" + oid + "&aid=" + aid
146 | office_name_script = contents['officeName']
147 | sleep(0.01)
148 | content_request_content = requests.get(hef_script, headers={'User-Agent': 'Mozilla/5.0'})
149 | content_document_content = BeautifulSoup(content_request_content.content, 'html.parser')
150 | content_tag_content = content_document_content.find_all('div', {'class': 'news_end'},
151 | {'id': 'newsEndContents'})
152 | # 뉴스 기사 본문 내용 초기화
153 | text_sentence = ''
154 |
155 | try:
156 | text_sentence = text_sentence + str(content_tag_content[0].find_all(text=True))
157 | matched_content = self.clear_content(text_sentence)
158 | writer.write_row([time_script, category, office_name_script, self.clear_headline(title_script),
159 | matched_content,
160 | hef_script])
161 | except:
162 | pass
163 | writer.close()
164 |
165 | def set_category(self, *args):
166 | for key in args:
167 | if self.category.get(key) is None:
168 | raise InvalidCategory(key)
169 | self.selected_category = args
170 | for selected in self.selected_category:
171 | self.selected_url_category.append(self.category[selected])
172 |
173 | def start(self):
174 | # MultiProcess 크롤링 시작
175 | for category_name in self.selected_category:
176 | proc = Process(target=self.crawling, args=(category_name,))
177 | proc.start()
178 |
179 | def set_date_range(self, start_year, start_month, end_year, end_month):
180 | self.date['start_year'] = start_year
181 | self.date['start_month'] = start_month
182 | self.date['end_year'] = end_year
183 | self.date['end_month'] = end_month
184 |
185 |
186 | # Main
187 | if __name__ == "__main__":
188 | Spt_crawler = SportCrawler()
189 | Spt_crawler.set_category('한국야구','한국축구')
190 | Spt_crawler.set_date_range(2020, 12, 2020, 12)
191 | Spt_crawler.start()
192 |
--------------------------------------------------------------------------------
/korea_news_crawler/sports_crawler_sample.py:
--------------------------------------------------------------------------------
1 | from korea_news_crawler.sportcrawler import SportCrawler
2 |
3 | if __name__ == "__main__":
4 | Spt_crawler = SportCrawler()
5 | Spt_crawler.set_category('한국야구', '한국축구')
6 | Spt_crawler.set_date_range(2020, 11, 2020, 11)
7 | Spt_crawler.start()
8 |
--------------------------------------------------------------------------------
/korea_news_crawler/writer.py:
--------------------------------------------------------------------------------
1 | import csv
2 | import platform
3 | from korea_news_crawler.exceptions import *
4 |
5 |
6 | class Writer(object):
7 | def __init__(self, category, article_category, date):
8 | self.start_year = date['start_year']
9 | self.start_month = f'0{date["start_month"]}' if len(str(date['start_month'])) == 1 else str(date['start_month'])
10 | self.start_day = f'0{date["start_day"]}' if len(str(date['start_day'])) == 1 else str(date['start_day'])
11 | self.end_year = date['end_year']
12 | self.end_month = f'0{date["end_month"]}' if len(str(date['end_month'])) == 1 else str(date['end_month'])
13 | self.end_day = f'0{date["end_day"]}' if len(str(date['end_day'])) == 1 else str(date['end_day'])
14 | self.file = None
15 | self.initialize_file(category, article_category)
16 |
17 | self.csv_writer = csv.writer(self.file)
18 |
19 | def initialize_file(self, category, article_category):
20 | output_path = f'../output'
21 | if os.path.exists(output_path) is not True:
22 | os.mkdir(output_path)
23 |
24 | file_name = f'{output_path}/{category}_{article_category}_{self.start_year}{self.start_month}{self.start_day}_{self.end_year}{self.end_month}{self.end_day}.csv'
25 | if os.path.isfile(file_name) and os.path.getsize(file_name) != 0:
26 | raise ExistFile(file_name)
27 |
28 | user_os = str(platform.system())
29 | if user_os == "Windows":
30 | self.file = open(file_name, 'w', encoding='euc-kr', newline='')
31 | # Other OS uses utf-8
32 | else:
33 | self.file = open(file_name, 'w', encoding='utf-8', newline='')
34 |
35 | def write_row(self, arg):
36 | self.csv_writer.writerow(arg)
37 |
38 | def close(self):
39 | self.file.close()
40 |
--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [metadata]
2 | description_file = README.md
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup
2 |
3 | # build package command: python setup.py bdist_wheel
4 | # release package command: twine upload dist/KoreaNewsCrawler-${version}-py3-none-any.whl
5 |
6 | setup(
7 | name = 'KoreaNewsCrawler',
8 | version = '1.51',
9 | description = 'Crawl the korean news',
10 | author = 'lumyjuwon',
11 | author_email = 'lumyjuwon@gmail.com',
12 | url = 'https://github.com/lumyjuwon/KoreaNewsCrawler',
13 | download_url = 'https://github.com/lumyjuwon/KoreaNewsCrawler/archive/1.51.tar.gz',
14 | install_requires = ['requests', 'beautifulsoup4'],
15 | packages = ['korea_news_crawler'],
16 | keywords = ['crawl', 'KoreaNews', 'crawler'],
17 | python_requires = '>=3.6',
18 | zip_safe=False,
19 | classifiers = [
20 | 'Programming Language :: Python :: 3.6'
21 | ]
22 | )
--------------------------------------------------------------------------------