├── .gitignore ├── LICENSE ├── README.md ├── dist ├── KoreaNewsCrawler-1.12-py3-none-any.whl ├── KoreaNewsCrawler-1.2-py3-none-any.whl ├── KoreaNewsCrawler-1.20-py3-none-any.whl ├── KoreaNewsCrawler-1.30-py3-none-any.whl ├── KoreaNewsCrawler-1.41-py3-none-any.whl ├── KoreaNewsCrawler-1.50-py3-none-any.whl └── KoreaNewsCrawler-1.51-py3-none-any.whl ├── img ├── article_result.PNG ├── multi_process.PNG └── sport_resultimg.PNG ├── korea_news_crawler ├── __init__.py ├── articlecrawler.py ├── articleparser.py ├── exceptions.py ├── sample.py ├── sportcrawler.py ├── sports_crawler_sample.py └── writer.py ├── setup.cfg └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | .idea 2 | __pycache__ 3 | build 4 | output 5 | korea_news_crawler/.idea 6 | korea_news_crawler/__pycache__ 7 | KoreaNewsCrawler.egg-info 8 | MENIFEST.in 9 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 lumyjuwon 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # KoreaNewsCrawler 2 | 3 | This crawler is a crawler that crawls news articles from media organizations posted on NAVER portal. 4 | Crawlable article categories include politics, economy, lifeculture, global, IT/science, society. 5 | In the case of sports articles, that include korea baseball, korea soccer, world baseball, world soccer, basketball, volleyball, golf, general sports, e-sports. 6 | 7 | ## How to install 8 | pip install KoreaNewsCrawler 9 | 10 | ## Method 11 | 12 | * **set_category(category_name)** 13 | 14 | This method is to set the category you want to collect. 15 | The categories that can be included in the parameter are 'politics', 'economy', 'society', 'living_culture', 'IT_science', 'world', and 'opinion'. 16 | You can have multiple parameters. 17 | category_name: politics, economy, society, living_culture, IT_science, world, opinion 18 | 19 | * **set_date_range(startyear, startmonth, endyear, endmonth)** 20 | 21 | This method refers to the time period of news you want to collect. By default, it collects data from the month of startmonth to the month of endmonth. 22 | 23 | * **start()** 24 | 25 | This method is the crawl execution method. 26 | 27 | ## Article News Crawler Example 28 | ``` 29 | from korea_news_crawler.articlecrawler import ArticleCrawler 30 | 31 | Crawler = ArticleCrawler() 32 | Crawler.set_category("politics", "IT_science", "economy") 33 | Crawler.set_date_range("2017-01", "2018-04-20") 34 | Crawler.start() 35 | ``` 36 | Perform a parallel crawl of news in the categories Politics, IT Science, and Economy from January 2017 to April 20, 2018 using a multiprocessor. 37 | 38 | ## Sports News Crawler Example 39 | Method is similar to ArticleCrawler(). 40 | ``` 41 | from korea_news_crawler.sportcrawler import SportCrawler 42 | 43 | Spt_crawler = SportCrawler() 44 | Spt_crawler.set_category('korea baseball','korea soccer') 45 | Spt_crawler.set_date_range("2017-01", "2018-04-20") 46 | Spt_crawler.start() 47 | ``` 48 | Execute a parallel crawl of Korean baseball and Korean soccer news from January 2017 to April 20, 2018 using a multiprocessor. 49 | 50 | ## Results 51 | ![ex_screenshot](./img/article_result.PNG) 52 | ![ex_screenshot](./img/sport_resultimg.PNG) 53 | 54 | Colum A: Article date & time 55 | Colum B: Article Category 56 | Colum C: Media Company 57 | Colum D: Article title 58 | Colum E: Article body 59 | Colum F: Article address 60 | All the data you collect is saved with a CSV extension. 61 | -------------------------------------------------------------------------------- /dist/KoreaNewsCrawler-1.12-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.12-py3-none-any.whl -------------------------------------------------------------------------------- /dist/KoreaNewsCrawler-1.2-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.2-py3-none-any.whl -------------------------------------------------------------------------------- /dist/KoreaNewsCrawler-1.20-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.20-py3-none-any.whl -------------------------------------------------------------------------------- /dist/KoreaNewsCrawler-1.30-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.30-py3-none-any.whl -------------------------------------------------------------------------------- /dist/KoreaNewsCrawler-1.41-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.41-py3-none-any.whl -------------------------------------------------------------------------------- /dist/KoreaNewsCrawler-1.50-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.50-py3-none-any.whl -------------------------------------------------------------------------------- /dist/KoreaNewsCrawler-1.51-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/dist/KoreaNewsCrawler-1.51-py3-none-any.whl -------------------------------------------------------------------------------- /img/article_result.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/img/article_result.PNG -------------------------------------------------------------------------------- /img/multi_process.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/img/multi_process.PNG -------------------------------------------------------------------------------- /img/sport_resultimg.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lumyjuwon/KoreaNewsCrawler/06689c7e70cfec9d008434bfa317c1614c7d17af/img/sport_resultimg.PNG -------------------------------------------------------------------------------- /korea_news_crawler/__init__.py: -------------------------------------------------------------------------------- 1 | __author__ = "lumyjuwon" 2 | __version__ = "1.50" 3 | __copyright__ = "Copyright (c) lumyjuwon" 4 | __license__ = "MIT" 5 | 6 | from .articlecrawler import * 7 | from .articleparser import * 8 | from .exceptions import * 9 | from .sample import * 10 | from .sportcrawler import * 11 | from .writer import * 12 | -------------------------------------------------------------------------------- /korea_news_crawler/articlecrawler.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8, euc-kr -*- 3 | 4 | import os 5 | import platform 6 | import calendar 7 | import requests 8 | import re 9 | from time import sleep 10 | from bs4 import BeautifulSoup 11 | from multiprocessing import Process 12 | from korea_news_crawler.exceptions import * 13 | from korea_news_crawler.articleparser import ArticleParser 14 | from korea_news_crawler.writer import Writer 15 | 16 | class ArticleCrawler(object): 17 | def __init__(self): 18 | self.categories = {'정치': 100, '경제': 101, '사회': 102, '생활문화': 103, '세계': 104, 'IT과학': 105, '오피니언': 110, 19 | 'politics': 100, 'economy': 101, 'society': 102, 'living_culture': 103, 'world': 104, 'IT_science': 105, 'opinion': 110} 20 | self.selected_categories = [] 21 | self.date = {'start_year': 0, 'start_month': 0, 'start_day' : 0, 'end_year': 0, 'end_month': 0, 'end_day':0} 22 | self.user_operating_system = str(platform.system()) 23 | 24 | def set_category(self, *args): 25 | for key in args: 26 | if self.categories.get(key) is None: 27 | raise InvalidCategory(key) 28 | self.selected_categories = args 29 | 30 | def set_date_range(self, start_date:str, end_date:str): 31 | start = list(map(int, start_date.split("-"))) 32 | end = list(map(int, end_date.split("-"))) 33 | 34 | # Setting Start Date 35 | if len(start) == 1: # Input Only Year 36 | start_year = start[0] 37 | start_month = 1 38 | start_day = 1 39 | elif len(start) == 2: # Input Year and month 40 | start_year, start_month = start 41 | start_day = 1 42 | elif len(start) == 3: # Input Year, month and day 43 | start_year, start_month, start_day = start 44 | 45 | # Setting End Date 46 | if len(end) == 1: # Input Only Year 47 | end_year = end[0] 48 | end_month = 12 49 | end_day = 31 50 | elif len(end) == 2: # Input Year and month 51 | end_year, end_month = end 52 | end_day = calendar.monthrange(end_year, end_month)[1] 53 | elif len(end) == 3: # Input Year, month and day 54 | end_year, end_month, end_day = end 55 | 56 | args = [start_year, start_month, start_day, end_year, end_month, end_day] 57 | 58 | if start_year > end_year: 59 | raise InvalidYear(start_year, end_year) 60 | if start_month < 1 or start_month > 12: 61 | raise InvalidMonth(start_month) 62 | if end_month < 1 or end_month > 12: 63 | raise InvalidMonth(end_month) 64 | if start_day < 1 or calendar.monthrange(start_year, start_month)[1] < start_day: 65 | raise InvalidDay(start_day) 66 | if end_day < 1 or calendar.monthrange(end_year, end_month)[1] < end_day: 67 | raise InvalidDay(end_day) 68 | if start_year == end_year and start_month > end_month: 69 | raise OverbalanceMonth(start_month, end_month) 70 | if start_year == end_year and start_month == end_month and start_day > end_day: 71 | raise OverbalanceDay(start_day, end_day) 72 | 73 | for key, date in zip(self.date, args): 74 | self.date[key] = date 75 | print(self.date) 76 | 77 | @staticmethod 78 | def make_news_page_url(category_url, date): 79 | made_urls = [] 80 | for year in range(date['start_year'], date['end_year'] + 1): 81 | if date['start_year'] == date['end_year']: 82 | target_start_month = date['start_month'] 83 | target_end_month = date['end_month'] 84 | else: 85 | if year == date['start_year']: 86 | target_start_month = date['start_month'] 87 | target_end_month = 12 88 | elif year == date['end_year']: 89 | target_start_month = 1 90 | target_end_month = date['end_month'] 91 | else: 92 | target_start_month = 1 93 | target_end_month = 12 94 | 95 | for month in range(target_start_month, target_end_month + 1): 96 | if date['start_month'] == date['end_month']: 97 | target_start_day = date['start_day'] 98 | target_end_day = date['end_day'] 99 | else: 100 | if year == date['start_year'] and month == date['start_month']: 101 | target_start_day = date['start_day'] 102 | target_end_day = calendar.monthrange(year, month)[1] 103 | elif year == date['end_year'] and month == date['end_month']: 104 | target_start_day = 1 105 | target_end_day = date['end_day'] 106 | else: 107 | target_start_day = 1 108 | target_end_day = calendar.monthrange(year, month)[1] 109 | 110 | for day in range(target_start_day, target_end_day + 1): 111 | if len(str(month)) == 1: 112 | month = "0" + str(month) 113 | if len(str(day)) == 1: 114 | day = "0" + str(day) 115 | 116 | # 날짜별로 Page Url 생성 117 | url = category_url + str(year) + str(month) + str(day) 118 | 119 | # totalpage는 네이버 페이지 구조를 이용해서 page=10000으로 지정해 totalpage를 알아냄 120 | # page=10000을 입력할 경우 페이지가 존재하지 않기 때문에 page=totalpage로 이동 됨 (Redirect) 121 | totalpage = ArticleParser.find_news_totalpage(url + "&page=10000") 122 | for page in range(1, totalpage + 1): 123 | made_urls.append(url + "&page=" + str(page)) 124 | return made_urls 125 | 126 | @staticmethod 127 | def get_url_data(url, max_tries=5): 128 | remaining_tries = int(max_tries) 129 | while remaining_tries > 0: 130 | try: 131 | return requests.get(url, headers={'User-Agent':'Mozilla/5.0'}) 132 | except requests.exceptions: 133 | sleep(1) 134 | remaining_tries = remaining_tries - 1 135 | raise ResponseTimeout() 136 | 137 | def crawling(self, category_name): 138 | # Multi Process PID 139 | print(category_name + " PID: " + str(os.getpid())) 140 | 141 | writer = Writer(category='Article', article_category=category_name, date=self.date) 142 | # 기사 url 형식 143 | url_format = f'http://news.naver.com/main/list.nhn?mode=LSD&mid=sec&sid1={self.categories.get(category_name)}&date=' 144 | # start_year년 start_month월 start_day일 부터 ~ end_year년 end_month월 end_day일까지 기사를 수집합니다. 145 | target_urls = self.make_news_page_url(url_format, self.date) 146 | print(f'{category_name} Urls are generated') 147 | 148 | print(f'{category_name} is collecting ...') 149 | for url in target_urls: 150 | request = self.get_url_data(url) 151 | document = BeautifulSoup(request.content, 'html.parser') 152 | 153 | # html - newsflash_body - type06_headline, type06 154 | # 각 페이지에 있는 기사들 가져오기 155 | temp_post = document.select('.newsflash_body .type06_headline li dl') 156 | temp_post.extend(document.select('.newsflash_body .type06 li dl')) 157 | 158 | # 각 페이지에 있는 기사들의 url 저장 159 | post_urls = [] 160 | for line in temp_post: 161 | # 해당되는 page에서 모든 기사들의 URL을 post_urls 리스트에 넣음 162 | post_urls.append(line.a.get('href')) 163 | del temp_post 164 | 165 | for content_url in post_urls: # 기사 url 166 | # 크롤링 대기 시간 167 | sleep(0.01) 168 | 169 | # 기사 HTML 가져옴 170 | request_content = self.get_url_data(content_url) 171 | 172 | try: 173 | document_content = BeautifulSoup(request_content.content, 'html.parser') 174 | except: 175 | continue 176 | try: 177 | # 기사 제목 가져옴 178 | tag_headline = document_content.find_all('h2', {'class': 'media_end_head_headline'}) 179 | # 뉴스 기사 제목 초기화 180 | text_headline = '' 181 | text_headline = text_headline + ArticleParser.clear_headline(str(tag_headline[0].find_all(text=True))) 182 | # 공백일 경우 기사 제외 처리 183 | if not text_headline: 184 | continue 185 | #
186 | 187 | # 기사 본문 가져옴 188 | tag_content = document_content.find_all('div', {'id': 'dic_area'}) 189 | # 뉴스 기사 본문 초기화 190 | text_sentence = '' 191 | text_sentence = text_sentence + ArticleParser.clear_content(str(tag_content[0].find_all(text=True))) 192 | # 공백일 경우 기사 제외 처리 193 | if not text_sentence: 194 | continue 195 | 196 | # 기사 언론사 가져옴 197 | tag_content = document_content.find_all('meta', {'property': 'og:article:author'}) 198 | # 언론사 초기화 199 | text_company = '' 200 | text_company = text_company + tag_content[0]['content'].split("|")[0] 201 | 202 | # 공백일 경우 기사 제외 처리 203 | if not text_company: 204 | continue 205 | 206 | # 기사 시간대 가져옴 207 | time = document_content.find_all('span',{'class':"media_end_head_info_datestamp_time _ARTICLE_DATE_TIME"})[0]['data-date-time'] 208 | 209 | # CSV 작성 210 | writer.write_row([time, category_name, text_company, text_headline, text_sentence, content_url]) 211 | 212 | del time 213 | del text_company, text_sentence, text_headline 214 | del tag_company 215 | del tag_content, tag_headline 216 | del request_content, document_content 217 | 218 | # UnicodeEncodeError 219 | except Exception as ex: 220 | del request_content, document_content 221 | pass 222 | writer.close() 223 | 224 | def start(self): 225 | # MultiProcess 크롤링 시작 226 | for category_name in self.selected_categories: 227 | proc = Process(target=self.crawling, args=(category_name,)) 228 | proc.start() 229 | 230 | 231 | if __name__ == "__main__": 232 | Crawler = ArticleCrawler() 233 | Crawler.set_category('생활문화') 234 | Crawler.set_date_range('2018-01', '2018-02') 235 | Crawler.start() 236 | -------------------------------------------------------------------------------- /korea_news_crawler/articleparser.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import re 3 | from bs4 import BeautifulSoup 4 | 5 | 6 | class ArticleParser(object): 7 | special_symbol = re.compile('[\{\}\[\]\/?,;:|\)*~`!^\-_+<>@\#$&▲▶◆◀■【】\\\=\(\'\"]') 8 | content_pattern = re.compile('본문 내용|TV플레이어| 동영상 뉴스|flash 오류를 우회하기 위한 함수 추가function flash removeCallback|tt|앵커 멘트|xa0') 9 | 10 | @classmethod 11 | def clear_content(cls, text): 12 | # 기사 본문에서 필요없는 특수문자 및 본문 양식 등을 다 지움 13 | newline_symbol_removed_text = text.replace('\\n', '').replace('\\t', '').replace('\\r', '') 14 | special_symbol_removed_content = re.sub(cls.special_symbol, ' ', newline_symbol_removed_text) 15 | end_phrase_removed_content = re.sub(cls.content_pattern, '', special_symbol_removed_content) 16 | blank_removed_content = re.sub(' +', ' ', end_phrase_removed_content).lstrip() # 공백 에러 삭제 17 | reversed_content = ''.join(reversed(blank_removed_content)) # 기사 내용을 reverse 한다. 18 | content = '' 19 | for i in range(0, len(blank_removed_content)): 20 | # reverse 된 기사 내용중, ".다"로 끝나는 경우 기사 내용이 끝난 것이기 때문에 기사 내용이 끝난 후의 광고, 기자 등의 정보는 다 지움 21 | if reversed_content[i:i + 2] == '.다': 22 | content = ''.join(reversed(reversed_content[i:])) 23 | break 24 | return content 25 | 26 | @classmethod 27 | def clear_headline(cls, text): 28 | # 기사 제목에서 필요없는 특수문자들을 지움 29 | newline_symbol_removed_text = text.replace('\\n', '').replace('\\t', '').replace('\\r', '') 30 | special_symbol_removed_headline = re.sub(cls.special_symbol, '', newline_symbol_removed_text) 31 | return special_symbol_removed_headline 32 | 33 | @classmethod 34 | def find_news_totalpage(cls, url): 35 | # 당일 기사 목록 전체를 알아냄 36 | try: 37 | # Added headers for avoid anti-crawling 38 | request_content = requests.get(url, timeout=10, headers={'User-Agent':'Mozilla/5.0'}) 39 | document_content = BeautifulSoup(request_content.content, 'html.parser') 40 | headline_tag = document_content.find('div', {'class': 'paging'}).find('strong') 41 | regex = re.compile(r'(?P\d+)') 42 | match = regex.findall(str(headline_tag)) 43 | return int(match[0]) 44 | except Exception: 45 | return 0 46 | -------------------------------------------------------------------------------- /korea_news_crawler/exceptions.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | # 처리 가능한 값보다 큰 값이 나왔을 때 4 | class OverFlow(Exception): 5 | def __init__(self, args): 6 | self.message = f'{args} is overflow' 7 | 8 | def __str__(self): 9 | return self.message 10 | 11 | 12 | # 처리 가능한 값보다 작은 값이 나왔을 때 13 | class UnderFlow(Exception): 14 | def __init__(self, args): 15 | self.message = f'{args} is underflow' 16 | 17 | def __str__(self): 18 | return self.message 19 | 20 | 21 | # 변수가 올바르지 않을 때 22 | class InvalidArgs(Exception): 23 | def __init__(self, args): 24 | self.message = f'{args} is Invalid Arguments' 25 | 26 | def __str__(self): 27 | return self.message 28 | 29 | 30 | # 카테고리가 올바르지 않을 때 31 | class InvalidCategory(Exception): 32 | def __init__(self, category): 33 | self.message = f'{category} is Invalid Category.' 34 | 35 | def __str__(self): 36 | return self.message 37 | 38 | 39 | # 년도가 올바르지 않을 때 40 | class InvalidYear(Exception): 41 | def __init__(self, start_year, end_year): 42 | self.message = f'{start_year}(start year) is bigger than {end_year}(end year)' 43 | 44 | def __str__(self): 45 | return str(self.message) 46 | 47 | 48 | # 달이 올바르지 않을 때 49 | class InvalidMonth(Exception): 50 | def __init__(self, month): 51 | self.message = f'{month} is an invalid month' 52 | 53 | def __str__(self): 54 | return self.message 55 | 56 | # 일이 올바르지 않을 때 57 | class InvalidDay(Exception): 58 | def __init__(self, day): 59 | self.message = f'{day} is an invalid day' 60 | 61 | def __str__(self): 62 | return self.message 63 | 64 | 65 | 66 | # 시작 달과 끝나는 달이 올바르지 않을 때 67 | class OverbalanceMonth(Exception): 68 | def __init__(self, start_month, end_month): 69 | self.message = f'{start_month}(start month) is an overbalance with {end_month}(end month)' 70 | 71 | def __str__(self): 72 | return self.message 73 | 74 | class OverbalanceDay(Exception): 75 | def __init__(self, start_day, end_day): 76 | self.message = f'{start_day}(start day) is an overbalance with {end_day}(end day)' 77 | 78 | def __str__(self): 79 | return self.message 80 | 81 | 82 | # 실행시간이 너무 길어서 데이터를 얻을 수 없을 때 83 | class ResponseTimeout(Exception): 84 | def __init__(self): 85 | self.message = "Couldn't get the data" 86 | 87 | def __str__(self): 88 | return self.message 89 | 90 | 91 | # 존재하는 파일 92 | class ExistFile(Exception): 93 | def __init__(self, path): 94 | absolute_path = os.path.abspath(path) 95 | self.message = f'{absolute_path} already exist' 96 | 97 | def __str__(self): 98 | return self.message 99 | -------------------------------------------------------------------------------- /korea_news_crawler/sample.py: -------------------------------------------------------------------------------- 1 | from korea_news_crawler.articlecrawler import ArticleCrawler 2 | if __name__ == "__main__": 3 | Crawler = ArticleCrawler() 4 | # 정치, 경제, 생활문화, IT과학, 사회, 세계 카테고리 사용 가능 5 | Crawler.set_category("IT과학", "세계") 6 | # 2017년 12월 (1일) 부터 2018년 1월 13일까지 크롤링 시작 YYYY-MM-DD의 형식으로 입력 7 | Crawler.set_date_range('2017-12', '2018-01-13') 8 | Crawler.start() 9 | -------------------------------------------------------------------------------- /korea_news_crawler/sportcrawler.py: -------------------------------------------------------------------------------- 1 | import calendar 2 | import csv 3 | import requests 4 | import re 5 | import json 6 | from bs4 import BeautifulSoup 7 | from time import sleep 8 | from multiprocessing import Process 9 | from korea_news_crawler.exceptions import * 10 | from korea_news_crawler.writer import Writer 11 | #from korea_news_crawler.writer import Writer 12 | 13 | 14 | class SportCrawler: 15 | def __init__(self): 16 | self.category = {'한국야구': "kbaseball", '해외야구': "wbaseball", '해외축구': "wfootball", 17 | '한국축구': "kfootball", '농구': "basketball", '배구': "volleyball", '일반 스포츠': "general", 18 | 'e스포츠': "esports", 19 | 'korea baseball': "kbaseball", 'world baseball': "wbaseball", 'world football': "wfootball", 20 | 'korea football': "kfootball", 'basketball': "basketball", 'volleyball': "volleyball", 21 | 'general sports': "general", 22 | 'e-sports': "esports" 23 | } 24 | self.selected_category = [] 25 | self.selected_url_category = [] 26 | self.date = {'start_year': 0, 'start_month': 0, 'end_year': 0, 'end_month': 0} 27 | 28 | def get_total_page(self, url): 29 | totalpage_url = f'{url}&page=10000' 30 | request_content = requests.get(totalpage_url,headers={'User-Agent': 'Mozilla/5.0'}) 31 | page_number = re.findall('\"totalPages\":(.*)}', request_content.text) 32 | return int(page_number[0]) 33 | 34 | def content(self, html_document, url_label): 35 | content_match = [] 36 | Tag = html_document.find_all('script', {'type': 'text/javascript'}) 37 | Tag_ = re.sub(',"officeName', '\nofficeName', str(Tag)) 38 | regex = re.compile('oid":"(?P\d+)","aid":"(?P\d+)"') 39 | content = regex.findall(Tag_) 40 | for oid_aid in content: 41 | maked_url = "https://sports.news.naver.com/" + url_label + "/news/read.nhn?oid=" + oid_aid[0] + "&aid=" + \ 42 | oid_aid[1] 43 | content_match.append(maked_url) 44 | return content_match 45 | 46 | def clear_content(self, text): 47 | remove_special = re.sub('[∙©\{\}\[\]\/?,;:|\)*~`!^\-_+<>@\#$%&n▲▶◆◀■\\\=\(\'\"]', '', text) 48 | remove_author = re.sub('\w\w\w 기자', '', remove_special) 49 | remove_flash_error = re.sub('본문 내용|TV플레이어| 동영상 뉴스|flash 오류를 우회하기 위한 함수 추가fuctio flashremoveCallback|tt|t|앵커 멘트|xa0', '', remove_author) 50 | # 공백 에러 삭제 51 | remove_strip = remove_flash_error.strip().replace(' ', '') 52 | # 기사 내용을 reverse 한다. 53 | reverse_content = ''.join(reversed(remove_strip)) 54 | cleared_content = '' 55 | for i in range(0, len(remove_strip)): 56 | # 기사가 reverse 되었기에 ".다"로 기사가 마무리 되므로, 이를 탐색하여 불필요한 정보를 모두 지운다. 57 | # 이 매커니즘에 의해 영문 기사가 완전히 나오지 않는 이슈가 있습니다... edge case가 적다면 '.다'를 '.'으로바꾸는 편이 나아보입니다. 58 | if reverse_content[i:i + 2] == '.다': 59 | cleared_content = ''.join(reversed(reverse_content[i:])) 60 | break 61 | cleared_content = re.sub('if deployPhase(.*)displayRMCPlayer', '', cleared_content) 62 | return cleared_content 63 | 64 | def clear_headline(self, text): 65 | first = re.sub('[∙©\{\}\[\]\/?,;:|\)*~`!^\-_+<>@\#$%&n▲▶◆◀■\\\=\(\'\"]', '', text) 66 | return first 67 | 68 | def make_sport_page_url(self, input_url, start_year, last_year, start_month, last_month): 69 | urls = [] 70 | for year in range(start_year, last_year + 1): 71 | target_start_month = start_month 72 | target_last_month = last_month 73 | 74 | if year != last_year: 75 | target_start_month = 1 76 | target_last_month = 12 77 | else: 78 | target_start_month = start_month 79 | target_last_month = last_month 80 | 81 | for month in range(target_start_month, target_last_month + 1): 82 | for day in range(1, calendar.monthrange(year, month)[1] + 1): 83 | if len(str(month)) == 1: 84 | month = "0" + str(month) 85 | if len(str(day)) == 1: 86 | day = "0" + str(day) 87 | url = f'{input_url}{year}{month}{day}' 88 | # page 날짜 정보만 있고 page 정보가 없는 url 임시 저장 89 | final_url = f'${input_url}{year}{month}{day}' 90 | 91 | # TotalPage 확인 92 | total_page = self.get_total_page(final_url) 93 | for page in range(1, total_page + 1): 94 | # url page 초기화 95 | url = f'{final_url}&page={page}' 96 | # [[page1,page2,page3 ....] 97 | urls.append(url) 98 | return urls 99 | 100 | def crawling(self, category_name): 101 | writer = Writer(category='Sport', article_category=category_name, date=self.date) 102 | url_category = [self.category[category_name]] 103 | category = [category_name] 104 | 105 | 106 | 107 | # URL 카테고리. Multiprocessing시 어차피 1번 도는거라 refactoring할 필요 있어보임 108 | for url_label in url_category: 109 | # URL 인덱스와 category 인덱스가 일치할 경우 그 값도 일치 110 | category = category[url_category.index(url_label)] 111 | url = f'https://sports.news.naver.com/{url_label}/news/list.nhn?isphoto=N&view=photo&date=' 112 | final_url_day = self.make_sport_page_url(url, self.date['start_year'], 113 | self.date['end_year'], self.date['start_month'], self.date['end_month']) 114 | print("succeed making url") 115 | print("crawler starts") 116 | if len(str(self.date['start_month'])) == 2: 117 | start_month = str(self.date['start_month']) 118 | else: 119 | start_month = '0' + str(self.date['start_month']) 120 | 121 | if len(str(self.date['end_month'])) == 2: 122 | end_month = str(self.date['end_month']) 123 | else: 124 | end_month = '0' + str(self.date['end_month']) 125 | 126 | # category Year Month Data Page 처리 된 URL 127 | for list_page in final_url_day: 128 | title_script = '' 129 | office_name_script = '' 130 | time_script = '' 131 | matched_content = '' 132 | 133 | # 제목 / URL 134 | request_content = requests.get(list_page, headers={'User-Agent': 'Mozilla/5.0'}) 135 | content_dict = json.loads(request_content.text) 136 | # 이는 크롤링에 사용 137 | 138 | hef_script = '' 139 | 140 | for contents in content_dict["list"]: 141 | oid = contents['oid'] 142 | aid = contents['aid'] 143 | title_script = contents['title'] 144 | time_script = contents['datetime'] 145 | hef_script = "https://sports.news.naver.com/news.nhn?oid=" + oid + "&aid=" + aid 146 | office_name_script = contents['officeName'] 147 | sleep(0.01) 148 | content_request_content = requests.get(hef_script, headers={'User-Agent': 'Mozilla/5.0'}) 149 | content_document_content = BeautifulSoup(content_request_content.content, 'html.parser') 150 | content_tag_content = content_document_content.find_all('div', {'class': 'news_end'}, 151 | {'id': 'newsEndContents'}) 152 | # 뉴스 기사 본문 내용 초기화 153 | text_sentence = '' 154 | 155 | try: 156 | text_sentence = text_sentence + str(content_tag_content[0].find_all(text=True)) 157 | matched_content = self.clear_content(text_sentence) 158 | writer.write_row([time_script, category, office_name_script, self.clear_headline(title_script), 159 | matched_content, 160 | hef_script]) 161 | except: 162 | pass 163 | writer.close() 164 | 165 | def set_category(self, *args): 166 | for key in args: 167 | if self.category.get(key) is None: 168 | raise InvalidCategory(key) 169 | self.selected_category = args 170 | for selected in self.selected_category: 171 | self.selected_url_category.append(self.category[selected]) 172 | 173 | def start(self): 174 | # MultiProcess 크롤링 시작 175 | for category_name in self.selected_category: 176 | proc = Process(target=self.crawling, args=(category_name,)) 177 | proc.start() 178 | 179 | def set_date_range(self, start_year, start_month, end_year, end_month): 180 | self.date['start_year'] = start_year 181 | self.date['start_month'] = start_month 182 | self.date['end_year'] = end_year 183 | self.date['end_month'] = end_month 184 | 185 | 186 | # Main 187 | if __name__ == "__main__": 188 | Spt_crawler = SportCrawler() 189 | Spt_crawler.set_category('한국야구','한국축구') 190 | Spt_crawler.set_date_range(2020, 12, 2020, 12) 191 | Spt_crawler.start() 192 | -------------------------------------------------------------------------------- /korea_news_crawler/sports_crawler_sample.py: -------------------------------------------------------------------------------- 1 | from korea_news_crawler.sportcrawler import SportCrawler 2 | 3 | if __name__ == "__main__": 4 | Spt_crawler = SportCrawler() 5 | Spt_crawler.set_category('한국야구', '한국축구') 6 | Spt_crawler.set_date_range(2020, 11, 2020, 11) 7 | Spt_crawler.start() 8 | -------------------------------------------------------------------------------- /korea_news_crawler/writer.py: -------------------------------------------------------------------------------- 1 | import csv 2 | import platform 3 | from korea_news_crawler.exceptions import * 4 | 5 | 6 | class Writer(object): 7 | def __init__(self, category, article_category, date): 8 | self.start_year = date['start_year'] 9 | self.start_month = f'0{date["start_month"]}' if len(str(date['start_month'])) == 1 else str(date['start_month']) 10 | self.start_day = f'0{date["start_day"]}' if len(str(date['start_day'])) == 1 else str(date['start_day']) 11 | self.end_year = date['end_year'] 12 | self.end_month = f'0{date["end_month"]}' if len(str(date['end_month'])) == 1 else str(date['end_month']) 13 | self.end_day = f'0{date["end_day"]}' if len(str(date['end_day'])) == 1 else str(date['end_day']) 14 | self.file = None 15 | self.initialize_file(category, article_category) 16 | 17 | self.csv_writer = csv.writer(self.file) 18 | 19 | def initialize_file(self, category, article_category): 20 | output_path = f'../output' 21 | if os.path.exists(output_path) is not True: 22 | os.mkdir(output_path) 23 | 24 | file_name = f'{output_path}/{category}_{article_category}_{self.start_year}{self.start_month}{self.start_day}_{self.end_year}{self.end_month}{self.end_day}.csv' 25 | if os.path.isfile(file_name) and os.path.getsize(file_name) != 0: 26 | raise ExistFile(file_name) 27 | 28 | user_os = str(platform.system()) 29 | if user_os == "Windows": 30 | self.file = open(file_name, 'w', encoding='euc-kr', newline='') 31 | # Other OS uses utf-8 32 | else: 33 | self.file = open(file_name, 'w', encoding='utf-8', newline='') 34 | 35 | def write_row(self, arg): 36 | self.csv_writer.writerow(arg) 37 | 38 | def close(self): 39 | self.file.close() 40 | -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [metadata] 2 | description_file = README.md -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | # build package command: python setup.py bdist_wheel 4 | # release package command: twine upload dist/KoreaNewsCrawler-${version}-py3-none-any.whl 5 | 6 | setup( 7 | name = 'KoreaNewsCrawler', 8 | version = '1.51', 9 | description = 'Crawl the korean news', 10 | author = 'lumyjuwon', 11 | author_email = 'lumyjuwon@gmail.com', 12 | url = 'https://github.com/lumyjuwon/KoreaNewsCrawler', 13 | download_url = 'https://github.com/lumyjuwon/KoreaNewsCrawler/archive/1.51.tar.gz', 14 | install_requires = ['requests', 'beautifulsoup4'], 15 | packages = ['korea_news_crawler'], 16 | keywords = ['crawl', 'KoreaNews', 'crawler'], 17 | python_requires = '>=3.6', 18 | zip_safe=False, 19 | classifiers = [ 20 | 'Programming Language :: Python :: 3.6' 21 | ] 22 | ) --------------------------------------------------------------------------------