├── .gitignore ├── LICENSE ├── README.md ├── company_codes.xlsx ├── dart_crawling.ipynb ├── dart_xls.ipynb ├── pics ├── kakao_debttoequityratio.jpg ├── kakao_operatingmargin.jpg ├── kakao_operatingprofit_sales.jpg ├── kakao_totaldebt_totalequity.jpg ├── samsung_debt_equity.png └── samsung_sales_profit.png ├── requirements.txt ├── requirements_full.txt └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | api_key.txt 2 | *.ipynb_checkpoints 3 | *.pyc -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Seoweon 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Downloading and analyzing historical annual reports through crawling [DART](http://dart.fss.or.kr/) (Repository of Korea's Corporate Filings) 2 | 3 | [한글 Readme를 보려면 여기를 클릭하세요](https://github.com/seoweon/dart_reports/blob/master/README.md#%EC%A0%84%EC%9E%90%EA%B3%B5%EC%8B%9C%EC%8B%9C%EC%8A%A4%ED%85%9C-%ED%81%AC%EB%A1%A4%EB%A7%81%EC%9C%BC%EB%A1%9C-%ED%95%9C-%ED%9A%8C%EC%82%AC%EC%9D%98-%EC%97%AD%EB%8C%80-%EC%82%AC%EC%97%85%EB%B3%B4%EA%B3%A0%EC%84%9C-%EB%8B%A4%EC%9A%B4%EB%B0%9B%EA%B8%B0-%EB%B0%8F-%ED%8A%B8%EB%A0%8C%EB%93%9C-%ED%99%95%EC%9D%B8%ED%95%98%EA%B8%B0) 4 | 5 | Update: Debt-to-equity ratio and operating margin graphs are added to the analysis code in [dart_xls.ipynb](https://github.com/seoweon/dart_reports/blob/master/dart_xls.ipynb). 6 | 7 | ## Directions: 8 | 1. Clone this repository 9 | 2. Download the needed libraries through running [requirements.txt](https://github.com/seoweon/dart_reports/blob/master/requirements.txt) (type ```pip install -r requirements.txt``` in your command prompt and run) 10 | - If for some reason this doesn't work and an error occurs, try [requirements_full.txt](https://github.com/seoweon/dart_reports/blob/master/requirements_full.txt) (```pip install -r requirements_full.txt```) 11 | 3. (If you didn't already) Get an API Key from DART by [going to this link](http://dart.fss.or.kr/dsap001/apikeyManagement.do;jsessionid=Bs7AWiSzD8YmbBx0Zg3WoEixviKFJ7tL2OmeavY5lXpuYNh4MBmNjvvrgldaazhx.dart2_servlet_engine2) and following their directions 12 | 4. In the same folder that you cloned, add a simple text file with the given key and name it ```api_key.txt``` 13 | 5. Open [dart_crawling.ipynb](https://github.com/seoweon/dart_reports/blob/master/dart_crawling.ipynb) on Jupyter Notebook and run the code 14 | 6. Input the information as directed (items such as company name). Running this code will download all annual reports for the given company and organize them neatly in folders 15 | 7. Open [dart_xls.ipynb](https://github.com/seoweon/dart_reports/blob/master/dart_xls.ipynb) on Jupyter Notebook and run the code 16 | 8. Select a company that you downloaded annual reports for already 17 | 9. You can check a company's total debt and equity trend as well as their debt-to-equity ratio trend over the years 18 | 10. You can check a company's total operating profit and sales as well as their operating margin trend over the years 19 | 20 | Below are example charts using [Kakao corporation](https://www.kakaocorp.com/?lang=en)'s data: 21 | #### Kakao total debt and equity trend: 22 | ![](pics/kakao_totaldebt_totalequity.jpg) 23 | #### Kakao debt-to-equity ratio: 24 | ![](pics/kakao_debttoequityratio.jpg) 25 | #### Kakao operating profit and sales: 26 | ![](pics/kakao_operatingprofit_sales.jpg) 27 | #### Kakao operating margin: 28 | ![](pics/kakao_operatingmargin.jpg) 29 | 30 | ### Blogs I referred to for making this 31 | * http://quantkim.blogspot.kr/2018/01/dart-api-with.html 32 | * http://tariat.tistory.com/31 33 | * https://woosa7.github.io/fss_dart/ 34 | 35 | ### Disclaimer: 36 | * [dart_xls.ipynb](https://github.com/seoweon/dart_reports/blob/master/dart_xls.ipynb) is not 100% foolproof because the formatting of balance sheets are slightly different from company to company. If you discover a company that doesn't work, please let me know by raising an issue and I will look into fixing it as soon as I can. 37 | * In addition, since error-handling is not perfect either, there may be a chance that the code fails silently and doesn't tell you an error occurred. Also, please let me know if you discover this. 38 | 39 | ### Post-script: 40 | * Finding a variable called ```dm_no``` was not easy, and could be done better using RegEx. May get on this in the future. 41 | * We only look at annual reports in this code, but DART has many other types of corporate filings that could be interesting to look into. The homepage's own [API development guide](http://dart.fss.or.kr/dsap001/guide.do) demonstrates a wide variety of using their data, so this is only scratching the surface. 42 | 43 | # [전자공시시스템](http://dart.fss.or.kr/) 크롤링으로 한 회사의 역대 사업보고서 다운받기 및 트렌드 확인하기 44 | 전자공시시스템에서 한 회사의 역대 사업보고서를 한 번에 다운받는 script입니다. 45 | 46 | Update: 부채비율 및 영업이익률 그래프 추가하였습니다. ([dart_xls.ipynb](https://github.com/seoweon/dart_reports/blob/master/dart_xls.ipynb)) 47 | 48 | ## 사용방법: 49 | 50 | 1. 다음 repository를 클론합니다 51 | 2. [requirements.txt](https://github.com/seoweon/dart_reports/blob/master/requirements.txt)를 이용해 필요한 라이브러리를 설치합니다. (```pip install -r requirements.txt```) 52 | - 만약 어떠한 이유로 계속 오류가 발생한다면 좀 더 강력한 조치로 [requirements_full.txt](https://github.com/seoweon/dart_reports/blob/master/requirements_full.txt)를 이용해 전체 라이브러리 설치를 시도해봅니다. 53 | 3. (아직 없다면) [DART API Key 발급페이지](http://dart.fss.or.kr/dsap001/apikeyManagement.do;jsessionid=Bs7AWiSzD8YmbBx0Zg3WoEixviKFJ7tL2OmeavY5lXpuYNh4MBmNjvvrgldaazhx.dart2_servlet_engine2)에 접속해 API key를 발급받습니다 (쉬워요) 54 | 4. 동 폴더에 ```api_key.txt```라는 텍스트파일을 만들어 발급받은 KEY를 저장합니다 55 | 5. [dart_crawling.ipynb](https://github.com/seoweon/dart_reports/blob/master/dart_crawling.ipynb)를 Jupyter Notebook에서 열어 실행시킵니다 56 | 6. 회사명 등의 입력사항을 넣으면 파일이 다운로드 됩니다 57 | 7. [dart_xls.ipynb](https://github.com/seoweon/dart_reports/blob/master/dart_xls.ipynb)를 Jupyter Notebook에서 열어 실행시킵니다 58 | 8. 앞서 다운받은 회사의 사업보고서 중 하나를 선택해 회사명을 입력합니다 59 | 9. "재무제표 크롤링"을 통해 연도별 총 자본ㆍ부채 규모 그래프를 확인할 수 있습니다 60 | 10. "손익계산서 크롤링"을 통해 연도별 매출ㆍ영업이익 현황을 확인할 수 있습니다 61 | 62 | 아래는 [카카오](https://www.kakaocorp.com/?lang=ko) 데이터로 그린 그래프입니다: 63 | #### 카카오 총부채 및 총 자산 추이: 64 | ![](pics/kakao_totaldebt_totalequity.jpg) 65 | #### 카카오 부채비율 추이: 66 | ![](pics/kakao_debttoequityratio.jpg) 67 | #### 카카오 영업이익 및 총매출 추이: 68 | ![](pics/kakao_operatingprofit_sales.jpg) 69 | #### Kakao 영업이익률 추이: 70 | ![](pics/kakao_operatingmargin.jpg) 71 | 72 | ### 참고한 블로그 73 | * http://quantkim.blogspot.kr/2018/01/dart-api-with.html 74 | * http://tariat.tistory.com/31 75 | * https://woosa7.github.io/fss_dart/ 76 | 77 | ### 주의사항: 78 | * [dart_xls.ipynb](https://github.com/seoweon/dart_reports/blob/master/dart_xls.ipynb)는 어느수준 정형화된 엑셀 데이터를 크롤링한 것이기 때문에 버그가 있을 수 있습니다. 해 봤는데 안 되는 회사명이 있으면 알려주세요! 조사해보고 코드를 업데이트하도록 하겠습니다 79 | * 이에 더불어, error-handling을 완벽하게 하지 않았기 때문에 코드 자체에서 에러가 나오지 않더라도 틀린 데이터를 가져오는 경우가 있을 수 있습니다. 이 부분도 보이는 대로 알려주세요 80 | 81 | 82 | ### 추신: 83 | * Regular Expressions를 사용하면 ```dm_no```같은 변수를 찾거나 폴더명을 정렬하는 게 좀 더 간편할 것 같아요. 84 | * 사업보고서는 코드가 A001인데, 이것 외에도 다운받을 수 있는 공식 문서들이 굉장히 많습니다 (홈페이지의 [API 개발가이드](http://dart.fss.or.kr/dsap001/guide.do) 중 "상세 유형" 보면 이것저것 많이 나와있어요). 근데 어떤 방식으로 다운 받는 게 목적에 부합할 지 확신이 안 서서 우선은 사업보고서만 다운받는 형식으로 구성했습니다 85 | -------------------------------------------------------------------------------- /company_codes.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/seoweon/dart_reports/6bfd485699d4ecbb00d3efbf825cbb0dc5f040fe/company_codes.xlsx -------------------------------------------------------------------------------- /dart_crawling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Downloading historical annual reports from DART.fss.or.kr (Korea's corporate filings repository)\n", 8 | "# DART.fss.or.kr에서 역대 사업보고서 다운로드하기" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": null, 14 | "metadata": { 15 | "collapsed": true 16 | }, 17 | "outputs": [], 18 | "source": [ 19 | "#Importing requests to read json later...\n", 20 | "import requests\n", 21 | "import pandas as pd\n", 22 | "#Need to import lxml in order to get the xpath of a dcm_no \n", 23 | "from lxml import html\n", 24 | "#Just in case we need RegEx...\n", 25 | "import re\n", 26 | "\n", 27 | "%load_ext autoreload\n", 28 | "%autoreload 2\n", 29 | "\n", 30 | "from utils import *\n", 31 | "\n", 32 | "from subprocess import call\n", 33 | "\n", 34 | "from fake_useragent import UserAgent" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "Please insert your API key. You can easily get an API key by creating an account and applying [here](http://dart.fss.or.kr/dsap001/apikeyManagement.do;jsessionid=Bs7AWiSzD8YmbBx0Zg3WoEixviKFJ7tL2OmeavY5lXpuYNh4MBmNjvvrgldaazhx.dart2_servlet_engine2).
\n", 42 | "API Key를 넣어주세요. [인증키 신청](http://dart.fss.or.kr/dsap001/apikeyManagement.do;jsessionid=Bs7AWiSzD8YmbBx0Zg3WoEixviKFJ7tL2OmeavY5lXpuYNh4MBmNjvvrgldaazhx.dart2_servlet_engine2)은 DART 계정을 만든 후 간단하게 할 수 있습니다" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": null, 48 | "metadata": { 49 | "collapsed": true 50 | }, 51 | "outputs": [], 52 | "source": [ 53 | "with open('api_key.txt','r') as f:\n", 54 | " API_KEY = f.read()" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "## 1. Get a list of links for your desired company's financial reports. \n", 62 | "## 1. 원하는 회사의 사업보고서 링크 목록을 가져와봅시다. " 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "#### 1) Get the company code (Excel source: [Korea Investor’s Network for Disclosure System](http://kind.krx.co.kr/corpgeneral/corpList.do?method=loadInitPage))\n", 70 | "#### 1) 회사의 종목코드를 가져오세요. (엑셀 출처: [한국거래소 전자공시 홈페이지](http://kind.krx.co.kr/corpgeneral/corpList.do?method=loadInitPage))" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "metadata": { 77 | "collapsed": true 78 | }, 79 | "outputs": [], 80 | "source": [ 81 | "#Read the Excel that includes official company information, while making sure that the company code is read as a string, not an integer (since integer values delete the zeros in front)\n", 82 | "#회사 정보가 들어있는 엑셀을 읽어오되, 종목코드는 int가 아닌 str로 가져와야 합니다 (안 그러면 앞의 0이 지워져서 나오게 돼요)\n", 83 | "#Excel source: http://kind.krx.co.kr/corpgeneral/corpList.do?method=loadInitPage\n", 84 | "company_codes = pd.read_excel('company_codes.xlsx',converters={'종목코드':str})" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": { 91 | "scrolled": true 92 | }, 93 | "outputs": [], 94 | "source": [ 95 | "#Get input from user on company name\n", 96 | "#회사명 입력란을 만들어요\n", 97 | "name_input = input('Please write company name here | 회사명을 입력해주세요: ')" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": { 104 | "scrolled": true 105 | }, 106 | "outputs": [], 107 | "source": [ 108 | "#If there is not an exact match, the system pauses. \n", 109 | "#입력된 회사명이 없으면 진행이 안돼요\n", 110 | "#CAVEAT: There are company names that are also a portion of another company name (like CJ vs. CJ오쇼핑) - this simple seach program cannot find CJ오쇼핑 when only CJ is inputted. \n", 111 | "#CAVEAT: CJ같이 단독으로도 회사명이 존재하지만 CJ오쇼핑 같이 이게 포함된 회사명이 있는 경우, 찾아주지는 못합니다\n", 112 | "while len(company_codes[company_codes.회사명 == name_input]) == 0:\n", 113 | " print('The company name entered does not exist. | 해당 이름의 회사명이 존재하지 않습니다. \\nPlease find below for suggestions and re-type the company name | 아래 회사명 중 하나를 찾으시나요? 다시 입력해주세요.\\n')\n", 114 | " for row in company_codes.회사명:\n", 115 | " if row.find(name_input) != -1:\n", 116 | " print(row)\n", 117 | " name_input = input()\n", 118 | "code = company_codes[company_codes.회사명 == name_input].종목코드.iloc[0]\n", 119 | "print(\"Company name | 회사명: \"+name_input+\"\\nCompany code | 종목코드: \"+code)" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "#### 2) Generate list of report URLs\n", 127 | "#### 2) 보고서 목록 URL을 생성하세요" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": null, 133 | "metadata": { 134 | "collapsed": true 135 | }, 136 | "outputs": [], 137 | "source": [ 138 | "#Start year is set as 1956, which is the year when the first company went public\n", 139 | "#시작날짜는 최초의 기업이 상장한 날짜인 1956년 3월보다 이전으로 잡았습니다\n", 140 | "start_date = '19560101'\n", 141 | "#Document type: Annual report (사업보고서)\n", 142 | "bsn_tp = 'A001'" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "metadata": { 149 | "collapsed": true 150 | }, 151 | "outputs": [], 152 | "source": [ 153 | "url = \"http://dart.fss.or.kr/api/search.json?auth=\"+API_KEY+\"&crp_cd=\"+code+\"&start_dt=\"+start_date+\"&bsn_tp=\"+bsn_tp+\"&fin_rpt=Y&page_set=100\"" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "#### 3) Generate individual report urls\n", 161 | "#### 3) 개별 보고서 URL을 생성하세요" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": { 168 | "collapsed": true 169 | }, 170 | "outputs": [], 171 | "source": [ 172 | "ua = UserAgent()\n", 173 | "headers = {'User-Agent':ua.chrome}" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "metadata": { 180 | "collapsed": true 181 | }, 182 | "outputs": [], 183 | "source": [ 184 | "#Extract json values\n", 185 | "#json 값을 추출합시다\n", 186 | "a = requests.get(url,headers=headers).json()" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "metadata": { 193 | "scrolled": true 194 | }, 195 | "outputs": [], 196 | "source": [ 197 | "#Let's see if the list per report is generated properly. If there are no annual reports to start with, the rest of the program will be useless. \n", 198 | "#각 사업보고서 당 리스트가 제대로 생성되는 지 봅시다. 하나도 없으면 코드 돌려봤자 아무것도 다운 안 됨\n", 199 | "urldict = {}\n", 200 | "for row in a['list']:\n", 201 | " url2 = \"http://dart.fss.or.kr/dsaf001/main.do?rcpNo=\"\n", 202 | " name = row['rpt_nm']\n", 203 | " #Getting rid of the pre-amble that's irrelevant\n", 204 | " #[기재정정] [첨부추가] [첨부정정] 등 앞에 붙은 것을 제거해봅시다\n", 205 | " if name.find('[') != -1:\n", 206 | " name = name.split(']')[1]\n", 207 | " urldict[name] = url2+row['rcp_no']\n", 208 | " print(name+\": \"+url2+row['rcp_no'])" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "## 2. Check attachments per annual report and download\n", 216 | "## 2. 각 사업보고서의 첨부파일 리스트를 확인하고 다운로드합시다" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": { 223 | "scrolled": true 224 | }, 225 | "outputs": [], 226 | "source": [ 227 | "#Setting up counter\n", 228 | "#카운터 선정\n", 229 | "n=1\n", 230 | "\n", 231 | "for key, value in urldict.items(): \n", 232 | " #We need to know a value called \"dcm_no\" in order to access the link, but there is no way to get it other than scraping the html of the page and getting the appropriate xpath\n", 233 | " #dcm_no 값을 알아야 다운로드 링크에 접근할 수 있는데, 알 방법이 링크에서 바로 가져오는 방법밖에 없으므로 xpath을 활용해서 알아봅시다\n", 234 | " test = requests.get(value, headers=headers)\n", 235 | " tree = html.fromstring(test.content)\n", 236 | " testpath = tree.xpath('//*[@id=\"north\"]/div[2]/ul/li[1]/a/@onclick')[0]\n", 237 | " dcm_no = dcm_no = testpath.split(\", '\")[1].split(\"')\")[0]\n", 238 | " \n", 239 | " #The url format is slightly different for attachments... below is code to add the needed elements\n", 240 | " #다운로드를 위한 url은 보고서 url과 차이점이 몇 가지 있는데, replace를 통해 추가할 수 있어요\n", 241 | " download_url = value.replace('dsaf001','pdf/download').replace('rcpNo','rcp_no')+\"&dcm_no=\"+dcm_no\n", 242 | " print(key+\" \"+download_url+\" Downloading... \"+str(n)+\" out of \"+str(len(urldict)))\n", 243 | " \n", 244 | " #Extract the attachment download url, same way we got the dcm_no\n", 245 | " #dcm_no를 구했던 것과 같은 방법으로 첨부파일 다운로드 url을 추출합니다\n", 246 | " dtest = requests.get(download_url, headers=headers)\n", 247 | " dtree = html.fromstring(dtest.text)\n", 248 | " \n", 249 | " #There are multiple attachment files per report; we used a dict called downloadpath to save the file with the name on screen\n", 250 | " #각 보고서 당 복수의 첨부파일이 존재하는데, 첨부파일 이름과 함께 저장하기 위해 downloadpath라는 dict를 사용했습니다\n", 251 | " downloadpath={}\n", 252 | " keys = dtree.xpath('/html/body/div/div/table/tr/td[1]/text()')\n", 253 | " key_links = dtree.xpath('/html/body/div/div/table/tr/td/a/@href')\n", 254 | " for key2, link in zip(keys, key_links):\n", 255 | " l = \"http://dart.fss.or.kr\"+link\n", 256 | " k = key2.replace(\")\",\"\")\n", 257 | " downloadpath[k] = l\n", 258 | " \n", 259 | " #Using download_file in utils, create a directory and put the file in\n", 260 | " #utils에 있는 download_file을 이용해 디렉토리를 만들고 그 안에다가 파일을 집어넣습니다\n", 261 | " for key2, link in downloadpath.items():\n", 262 | " download_file(link,filename=key2,directory=\"dart_\"+name_input+\"/\"+key)\n", 263 | " #try:\n", 264 | " #os.mkdir(key)\n", 265 | " #except:\n", 266 | " #pass\n", 267 | " #call(['curl',link,'-o',key+'/'+key2])\n", 268 | " n+=1" 269 | ] 270 | }, 271 | { 272 | "cell_type": "markdown", 273 | "metadata": {}, 274 | "source": [ 275 | "## 3. Lastly, open the file explorer and check the downloaded files\n", 276 | "## 3. 마지막으로 파일 탐색기를 열어 다운로드받은 파일을 확인합니다. " 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": null, 282 | "metadata": {}, 283 | "outputs": [], 284 | "source": [ 285 | "yesno = input('File download complete. Would you like to open the file explorer to check? (y/n) | 파일 다운로드가 완료되었습니다. 파일 탐색기를 열어 확인하시겠습니까? (y/n) ')\n", 286 | "\n", 287 | "if yesno.startswith('y'):\n", 288 | " call(['explorer','dart_'+name_input])" 289 | ] 290 | } 291 | ], 292 | "metadata": { 293 | "kernelspec": { 294 | "display_name": "Python 3", 295 | "language": "python", 296 | "name": "python3" 297 | }, 298 | "language_info": { 299 | "codemirror_mode": { 300 | "name": "ipython", 301 | "version": 3 302 | }, 303 | "file_extension": ".py", 304 | "mimetype": "text/x-python", 305 | "name": "python", 306 | "nbconvert_exporter": "python", 307 | "pygments_lexer": "ipython3", 308 | "version": "3.6.2" 309 | } 310 | }, 311 | "nbformat": 4, 312 | "nbformat_minor": 2 313 | } 314 | -------------------------------------------------------------------------------- /dart_xls.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Create graphs showing the debt-to-equity ratio and operating margin\n", 8 | "# 연결 재무제표에서 자본총계와 부채총계값을 추출해서 그래프를 만듭니다" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": null, 14 | "metadata": { 15 | "collapsed": true 16 | }, 17 | "outputs": [], 18 | "source": [ 19 | "import pandas as pd\n", 20 | "import os\n", 21 | "import re\n", 22 | "import matplotlib.pyplot as plt\n", 23 | "import seaborn as sns\n", 24 | "%matplotlib inline" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": { 31 | "scrolled": false 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "print('Existing company folders | 사업보고서가 저장된 회사:')\n", 36 | "#Find the folder names first\n", 37 | "#폴더명을 먼저 가져옵니다 (형식: 'dart_[회사명]')\n", 38 | "folderitems = os.listdir()\n", 39 | "#Extract the company name only to get the list of reports\n", 40 | "#회사명만 분리해 목록을 뽑습니다\n", 41 | "for item in folderitems:\n", 42 | " if '.' not in item:\n", 43 | " if item.startswith('dart_'):\n", 44 | " print(\"- \"+item.replace('dart_',''))" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": { 51 | "scrolled": false 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "inputcompany=input('Please insert the name of the company you would like to check | 저장된 사업보고서 중 조사하고자 하는 회사명을 입력해주세요:')\n", 56 | "testfile = 'dart_'+inputcompany+'/'\n", 57 | "while testfile.replace('/','') not in folderitems:\n", 58 | " inputcompany=input('Error: Company name does not exist | 존재하는 회사명으로 다시 입력해주세요. ')\n", 59 | " testfile = 'dart_'+inputcompany+'/'" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": { 66 | "collapsed": true 67 | }, 68 | "outputs": [], 69 | "source": [ 70 | "#Find the excel file within each folder and save the directory\n", 71 | "#각 폴더 내에 있는 엑셀파일을 찾아 디렉토리 목록을 저장합니다\n", 72 | "linklist=[]\n", 73 | "for item in os.listdir(testfile):\n", 74 | " sublist = os.listdir(testfile+item)\n", 75 | " for item2 in sublist:\n", 76 | " if item2.endswith('.xls'):\n", 77 | " filedirectory = testfile+item+\"/\"+item2\n", 78 | " linklist.append(filedirectory)" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "if len(linklist) < 1:\n", 88 | " print('Error: No reports exist. | 사업보고서가 존재하지 않습니다.')\n", 89 | "else:\n", 90 | " first_year = linklist[0].split('/')[1].split('(')[1].split('.')[0]\n", 91 | " first_month = linklist[0].split('/')[1].split('(')[1].split('.')[1].replace(')','')\n", 92 | " current_year = linklist[-1].split('/')[1].split('(')[1].split('.')[0]\n", 93 | " accounting_month = linklist[-1].split('/')[1].split('(')[1].split('.')[1].replace(')','')\n", 94 | " print('Total number of files available to extract: '+str(len(linklist))+' ('+first_month+'.'+first_year+' ~ '+accounting_month+'.'+current_year+')')" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "metadata": { 101 | "collapsed": true 102 | }, 103 | "outputs": [], 104 | "source": [ 105 | "#Save the internal numerical values in each excel as a \"score\"\n", 106 | "#각 행을 스캔해 \"제 XX 기\" 같은 표현이 등장하는 횟수를 score로 저장함 (엑셀시트 형식 당 skiprows를 해야 하는 숫자가 달라서 이걸 자동으로 찾아주기 위함)\n", 107 | "r_col = re.compile(r'제\\s*\\d{1,}\\s*기')\n", 108 | "def header_score(row):\n", 109 | " score=0\n", 110 | " for k,item in row.items():\n", 111 | " if len(r_col.findall(str(item))) !=0:\n", 112 | " score+=1\n", 113 | " return score\n", 114 | "\n", 115 | "#TODO: Later, make a function to find the unit for each numbers.... \n", 116 | "#TODO:단위를 찾아주는 function을 만들어야 하는데... 머리가 아프다...\n", 117 | "\n", 118 | "#Find the header row\n", 119 | "#\"제 XX 기\" 표현이 가장 많이 등장하는 행을 header로 선정함\n", 120 | "def get_right_header(df):\n", 121 | " header_row = df.fillna('').apply(header_score,axis=1).argmax()\n", 122 | " newdf = df.iloc[header_row+1:]\n", 123 | " newdf.columns = df.iloc[header_row].values\n", 124 | " return newdf" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "## 1. Crawl the balance sheet\n", 132 | "## 1. 재무제표 크롤링" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": { 139 | "collapsed": true 140 | }, 141 | "outputs": [], 142 | "source": [ 143 | "#A function that organizes and cleans up the balance sheet\n", 144 | "#재무 테이블을 깔끔하게 정리해주는 function\n", 145 | "def clean_table(jaemu):\n", 146 | " #Changing the first column name to category\n", 147 | " oldname = jaemu.columns[0]\n", 148 | " clean = jaemu.rename(columns={oldname:'category'})\n", 149 | " clean['category'] = clean.category.apply(str.strip)\n", 150 | " indexlist = []\n", 151 | " for index,row in clean.iterrows():\n", 152 | " if (row.category == '부채총계') | (row.category == '자본총계'):\n", 153 | " indexlist.append(index)\n", 154 | " clean2 = clean.iloc[indexlist].copy()\n", 155 | " return clean2" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "metadata": { 162 | "collapsed": true 163 | }, 164 | "outputs": [], 165 | "source": [ 166 | "def get_jaemu(wholetable, year):\n", 167 | " wholetable_sheets = list(wholetable.keys())\n", 168 | " if '대차대조표' in wholetable_sheets:\n", 169 | " a = get_right_header(wholetable['대차대조표']).reset_index(drop=True)\n", 170 | " print(str(year)+'년: 대차대조표')\n", 171 | " elif '연결 재무상태표' in wholetable_sheets:\n", 172 | " a = get_right_header(wholetable['연결 재무상태표']).reset_index(drop=True)\n", 173 | " print(str(year)+'년: 연결 재무상태표')\n", 174 | " elif '재무상태표' in wholetable_sheets:\n", 175 | " a = get_right_header(wholetable['재무상태표']).reset_index(drop=True)\n", 176 | " print(str(year)+'년: 재무상태표')\n", 177 | " else:\n", 178 | " b = wholetable_sheets[1]\n", 179 | " print(str(year)+\"년: \"+b+\"(?)\")\n", 180 | " a = get_right_header(wholetable[b]).reset_index(drop=True)\n", 181 | " return clean_table(a)" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": null, 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "getall = pd.DataFrame()\n", 191 | "#Using RegEx, get the number value from expressions such as '제 xx 기'\n", 192 | "#RegEx를 이용해서 '제 XX 기' (또는 '제XX기')로 명시된 기수를 숫자로 따로 추출합니다\n", 193 | "r = re.compile(r'(\\d+)')\n", 194 | "year_tracker = int(first_year)\n", 195 | "for item in linklist:\n", 196 | " table = pd.read_excel(item, sheetname=None)\n", 197 | " a = get_jaemu(table,year_tracker)\n", 198 | " year_tracker+=1\n", 199 | " melted = pd.melt(a, id_vars='category')\n", 200 | " #RegEx 이용 부분...\n", 201 | " melted['variable'] = melted.variable.apply(lambda x: r.findall(x)[0])\n", 202 | " #위에서 만든 getsummary DataFrame에다가 append\n", 203 | " getall = getall.append(melted).reset_index(drop=True).drop_duplicates()" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": null, 209 | "metadata": { 210 | "collapsed": true 211 | }, 212 | "outputs": [], 213 | "source": [ 214 | "#Fixing the datatype (monetary amounts are float, counting numbers are integers)\n", 215 | "#Datatype 수정 (금액은 float, 기수는 int로 변경)\n", 216 | "getall['value'] = getall.value.astype(float)\n", 217 | "getall['variable'] = getall.variable.astype(int)" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": null, 223 | "metadata": { 224 | "collapsed": true 225 | }, 226 | "outputs": [], 227 | "source": [ 228 | "#Correct the date information based on the counting numbers extracted above\n", 229 | "#폴더명에서 추출한 최근 연도를 가지고 각 '기'에다가 더할 숫자를 구하고 (28기가 2016년이면 1988를 모든 '기'에 더해주면 연도가 나옴) 'variable' 열을 연도로 업데이트시켜줌\n", 230 | "addnumber = int(current_year)-max(getall.variable)\n", 231 | "getall['variable'] = getall.variable + addnumber" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": null, 237 | "metadata": { 238 | "collapsed": true 239 | }, 240 | "outputs": [], 241 | "source": [ 242 | "#Create a dataframe that's appropriate for drawing tables\n", 243 | "#pivot을 통해 그래프를 그리기 위한 적절한 형식의 dataframe를 만들어줌\n", 244 | "jaemutable = pd.pivot_table(getall,index='variable',values='value',columns='category').rename(columns={'부채총계':'total_debt','자본총계':'total_equity'})" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": null, 250 | "metadata": {}, 251 | "outputs": [], 252 | "source": [ 253 | "jaemutable['debt-to-equity']=jaemutable.total_debt/jaemutable.total_equity" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": null, 259 | "metadata": { 260 | "scrolled": false 261 | }, 262 | "outputs": [], 263 | "source": [ 264 | "#Draw the table!\n", 265 | "#그래프 그리기\n", 266 | "print(inputcompany+\" debt and equity trend:\")\n", 267 | "sns.set_context('poster')\n", 268 | "plt.figure(figsize=(16,9))\n", 269 | "jaemutable[['total_debt','total_equity']].plot.bar(stacked=True,ax=plt.gca(),colormap='tab10')\n", 270 | "plt.xlabel('')\n", 271 | "sns.despine()" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": null, 277 | "metadata": {}, 278 | "outputs": [], 279 | "source": [ 280 | "#Draw the table!\n", 281 | "#그래프 그리기\n", 282 | "print(inputcompany+\" debt-to-equity ratio:\")\n", 283 | "sns.set_context('poster')\n", 284 | "plt.figure(figsize=(16,9))\n", 285 | "jaemutable['debt-to-equity'].plot.bar(stacked=True,ax=plt.gca(),colormap='tab10')\n", 286 | "plt.xlabel('')\n", 287 | "sns.despine()" 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "metadata": {}, 293 | "source": [ 294 | "## 2. Crawl the income statement\n", 295 | "## 2. 손익계산서 크롤링" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": null, 301 | "metadata": { 302 | "collapsed": true 303 | }, 304 | "outputs": [], 305 | "source": [ 306 | "#A function that organizes and cleans up the income statement\n", 307 | "#손익계산서를 깔끔하게 정리해주는 function\n", 308 | "def clean_sonik(df):\n", 309 | " #Changing the first column name to category\n", 310 | " oldname = df.columns[0]\n", 311 | " clean = df.rename(columns={oldname:'category'})\n", 312 | " clean['category'] = clean.category.apply(str.strip)\n", 313 | " indexlist = []\n", 314 | " for index,row in clean.iterrows():\n", 315 | " if row.category.find('영업수익') != -1:\n", 316 | " indexlist.append(index)\n", 317 | " if row.category.find('매출액') != -1:\n", 318 | " indexlist.append(index)\n", 319 | " if row.category.find('매출') != -1:\n", 320 | " indexlist.append(index)\n", 321 | " if len(indexlist) != 1:\n", 322 | " indexlist = indexlist[:1]\n", 323 | " for index,row in clean.iterrows():\n", 324 | " if row.category.find('영업이익') != -1:\n", 325 | " indexlist.append(index)\n", 326 | " clean2 = clean.loc[indexlist].copy()\n", 327 | " #TODO: What to do when there are two sales numbers (edge cases)\n", 328 | " #TODO: 매출액은 두 개 이상 있으면 맨 처음 것이어야 \n", 329 | " if len(clean2) !=2:\n", 330 | " clean2 = clean2[:2]\n", 331 | " return clean2" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": null, 337 | "metadata": { 338 | "collapsed": true 339 | }, 340 | "outputs": [], 341 | "source": [ 342 | "def get_sonik(wholetable, year):\n", 343 | " wholetable_sheets = list(wholetable.keys())\n", 344 | " if '손익계산서' in wholetable_sheets:\n", 345 | " a = get_right_header(wholetable['손익계산서']).reset_index(drop=True)\n", 346 | " print(str(year)+'년: 손익계산서')\n", 347 | " elif '포괄손익계산서' in wholetable_sheets:\n", 348 | " a = get_right_header(wholetable['포괄손익계산서']).reset_index(drop=True)\n", 349 | " print(str(year)+'년: 포괄손익계산서')\n", 350 | " elif '연결 포괄손익계산서' in wholetable_sheets:\n", 351 | " a = get_right_header(wholetable['연결 포괄손익계산서']).reset_index(drop=True)\n", 352 | " print(str(year)+'년: 연결 포괄손익계산서')\n", 353 | " else:\n", 354 | " for item in wholetable_sheets:\n", 355 | " if item.find('손익계산서') != -1:\n", 356 | " a = get_right_header(wholetable[item]).reset_index(drop=True)\n", 357 | " print(str(year)+'년: 손익계산서가 포함된 시트명')\n", 358 | " else:\n", 359 | " print(str(year)+\"년: 손익계산서를 찾지 못함\")\n", 360 | " a=pd.DataFrame()\n", 361 | " return clean_sonik(a)" 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": null, 367 | "metadata": {}, 368 | "outputs": [], 369 | "source": [ 370 | "#Create an empty dataframe to stuff everything into one\n", 371 | "#하나의 DataFrame로 통합하기 위해 빈 DataFrame을 우선 생성합니다\n", 372 | "soniksummary = pd.DataFrame()\n", 373 | "#RegEx를 이용해서 '제 XX 기' (또는 '제XX기')로 명시된 기수를 숫자로 따로 추출합니다\n", 374 | "r = re.compile(r'(\\d+)')\n", 375 | "#재무제표 엑셀파일이 존재하는 모든 폴더에 걸쳐, 위에서 만든 function을 이용해 정보를 가져오고 정리합니다\n", 376 | "year_tracker = int(first_year)\n", 377 | "for item in linklist:\n", 378 | " table = pd.read_excel(item, sheetname=None)\n", 379 | " get_table = get_sonik(table, year_tracker)\n", 380 | " year_tracker+=1\n", 381 | " #좀 더 정돈된 정보를 위해 기수를 하나의 열로 만들어줍니다 (melt function 이용)\n", 382 | " melted = pd.melt(get_table, id_vars='category')\n", 383 | " #RegEx 이용 부분...\n", 384 | " melted['variable'] = melted.variable.apply(lambda x: r.findall(x)[0])\n", 385 | " #위에서 만든 getsummary DataFrame에다가 append\n", 386 | " soniksummary = soniksummary.append(melted).reset_index(drop=True).drop_duplicates()" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": null, 392 | "metadata": { 393 | "collapsed": true 394 | }, 395 | "outputs": [], 396 | "source": [ 397 | "def rename_values(x):\n", 398 | " if (x.find('매출') >= 0) or (x.find('영업수익') >=0):\n", 399 | " return 'sales'\n", 400 | " if x.find('영업이익') >=0:\n", 401 | " return 'operating_profit'\n", 402 | " return x" 403 | ] 404 | }, 405 | { 406 | "cell_type": "code", 407 | "execution_count": null, 408 | "metadata": { 409 | "collapsed": true 410 | }, 411 | "outputs": [], 412 | "source": [ 413 | "soniksummary['category'] = soniksummary.category.apply(rename_values)" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": null, 419 | "metadata": { 420 | "collapsed": true, 421 | "scrolled": false 422 | }, 423 | "outputs": [], 424 | "source": [ 425 | "#Datatype 수정 (금액은 float, 기수는 int로 변경)\n", 426 | "soniksummary['value'] = soniksummary.value.astype(float)\n", 427 | "soniksummary['variable'] = soniksummary.variable.astype(int)" 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": null, 433 | "metadata": { 434 | "collapsed": true 435 | }, 436 | "outputs": [], 437 | "source": [ 438 | "#폴더명에서 추출한 최근 연도를 가지고 각 '기'에다가 더할 숫자를 구하고 (28기가 2016년이면 1988를 모든 '기'에 더해주면 연도가 나옴) 'variable' 열을 연도로 업데이트시켜줌\n", 439 | "addnumber = int(current_year)-max(soniksummary.variable)\n", 440 | "soniksummary['variable'] = soniksummary.variable + addnumber" 441 | ] 442 | }, 443 | { 444 | "cell_type": "code", 445 | "execution_count": null, 446 | "metadata": { 447 | "collapsed": true 448 | }, 449 | "outputs": [], 450 | "source": [ 451 | "#pivot을 통해 그래프를 그리기 위한 적절한 형식의 dataframe를 만들어줌\n", 452 | "sonikpivot = pd.pivot_table(soniksummary,index='variable',values='value',columns='category')" 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": null, 458 | "metadata": { 459 | "collapsed": true 460 | }, 461 | "outputs": [], 462 | "source": [ 463 | "sonikpivot['operating_margin']=sonikpivot.operating_profit/sonikpivot.sales" 464 | ] 465 | }, 466 | { 467 | "cell_type": "code", 468 | "execution_count": null, 469 | "metadata": { 470 | "scrolled": false 471 | }, 472 | "outputs": [], 473 | "source": [ 474 | "#Draw the graph\n", 475 | "#그래프 그리기\n", 476 | "print(inputcompany+\" sales and operating profit trend:\")\n", 477 | "sns.set_context('poster')\n", 478 | "plt.figure(figsize=(16,9))\n", 479 | "sonikpivot[['operating_profit','sales']].plot.bar(stacked=False,ax=plt.gca(),colormap='tab10')\n", 480 | "plt.xlabel('')\n", 481 | "sns.despine()" 482 | ] 483 | }, 484 | { 485 | "cell_type": "code", 486 | "execution_count": null, 487 | "metadata": {}, 488 | "outputs": [], 489 | "source": [ 490 | "#Draw the graph\n", 491 | "#그래프 그리기\n", 492 | "print(inputcompany+\" operating margin: \")\n", 493 | "sns.set_context('poster')\n", 494 | "plt.figure(figsize=(16,9))\n", 495 | "sonikpivot['operating_margin'].plot.bar(stacked=False,ax=plt.gca(),colormap='tab10')\n", 496 | "plt.xlabel('')\n", 497 | "sns.despine()" 498 | ] 499 | }, 500 | { 501 | "cell_type": "code", 502 | "execution_count": null, 503 | "metadata": { 504 | "collapsed": true 505 | }, 506 | "outputs": [], 507 | "source": [] 508 | } 509 | ], 510 | "metadata": { 511 | "kernelspec": { 512 | "display_name": "Python 3", 513 | "language": "python", 514 | "name": "python3" 515 | }, 516 | "language_info": { 517 | "codemirror_mode": { 518 | "name": "ipython", 519 | "version": 3 520 | }, 521 | "file_extension": ".py", 522 | "mimetype": "text/x-python", 523 | "name": "python", 524 | "nbconvert_exporter": "python", 525 | "pygments_lexer": "ipython3", 526 | "version": "3.6.2" 527 | } 528 | }, 529 | "nbformat": 4, 530 | "nbformat_minor": 2 531 | } 532 | -------------------------------------------------------------------------------- /pics/kakao_debttoequityratio.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/seoweon/dart_reports/6bfd485699d4ecbb00d3efbf825cbb0dc5f040fe/pics/kakao_debttoequityratio.jpg -------------------------------------------------------------------------------- /pics/kakao_operatingmargin.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/seoweon/dart_reports/6bfd485699d4ecbb00d3efbf825cbb0dc5f040fe/pics/kakao_operatingmargin.jpg -------------------------------------------------------------------------------- /pics/kakao_operatingprofit_sales.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/seoweon/dart_reports/6bfd485699d4ecbb00d3efbf825cbb0dc5f040fe/pics/kakao_operatingprofit_sales.jpg -------------------------------------------------------------------------------- /pics/kakao_totaldebt_totalequity.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/seoweon/dart_reports/6bfd485699d4ecbb00d3efbf825cbb0dc5f040fe/pics/kakao_totaldebt_totalequity.jpg -------------------------------------------------------------------------------- /pics/samsung_debt_equity.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/seoweon/dart_reports/6bfd485699d4ecbb00d3efbf825cbb0dc5f040fe/pics/samsung_debt_equity.png -------------------------------------------------------------------------------- /pics/samsung_sales_profit.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/seoweon/dart_reports/6bfd485699d4ecbb00d3efbf825cbb0dc5f040fe/pics/samsung_sales_profit.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | ipython 2 | pandas 3 | requests 4 | lxml 5 | re 6 | utils 7 | subprocess 8 | fake_useragent 9 | os 10 | matplotlib 11 | seaborn -------------------------------------------------------------------------------- /requirements_full.txt: -------------------------------------------------------------------------------- 1 | alabaster==0.7.10 2 | anaconda-client==1.6.3 3 | anaconda-navigator==1.6.4 4 | anaconda-project==0.6.0 5 | asn1crypto==0.22.0 6 | astroid==1.5.3 7 | astropy==2.0.2 8 | awscli==1.11.75 9 | Babel==2.5.0 10 | backports.shutil-get-terminal-size==1.0.0 11 | beautifulsoup4==4.6.0 12 | bitarray==0.8.1 13 | bkcharts==0.2 14 | blaze==0.10.1 15 | bleach==1.5.0 16 | bokeh==0.12.7 17 | boto==2.48.0 18 | botocore==1.5.38 19 | Bottleneck==1.2.1 20 | cairocffi==0.8.0 21 | CairoSVG==2.0.2 22 | certifi==2016.2.28 23 | cffi==1.10.0 24 | chardet==3.0.4 25 | chest==0.2.3 26 | click==6.7 27 | cloudpickle==0.4.0 28 | clyent==1.2.2 29 | colorama==0.3.9 30 | comtypes==1.1.2 31 | configobj==5.0.6 32 | contextlib2==0.5.5 33 | cryptography==1.8.1 34 | cssselect==1.0.1 35 | cycler==0.10.0 36 | Cython==0.26 37 | cytoolz==0.8.2 38 | dask==0.15.2 39 | datashape==0.5.4 40 | decorator==4.1.2 41 | dill==0.2.6 42 | Django>=1.11.15 43 | django-countries==4.4 44 | django-session-security==2.3.2 45 | django-simple-menu==1.2.1 46 | django-timezone-field==2.0 47 | djangorestframework==3.6.2 48 | docutils==0.14 49 | entrypoints==0.2.3 50 | et-xmlfile==1.0.1 51 | fake-useragent==0.1.10 52 | fastcache==1.0.2 53 | flake8==3.3.0 54 | Flask>=0.12.3 55 | Flask-Cors==3.0.3 56 | geopy==1.11.0 57 | gevent==1.2.2 58 | googlemaps==2.4.6 59 | graphviz==0.5.2 60 | greenlet==0.4.12 61 | gunicorn==19.7.1 62 | h5py==2.7.1 63 | haversine==0.4.5 64 | HeapDict==1.0.0 65 | html5lib==0.999999999 66 | httplib2==0.10.3 67 | idna==2.6 68 | imagesize==0.7.1 69 | ipykernel==4.6.1 70 | ipython==6.1.0 71 | ipython-genutils==0.2.0 72 | ipywidgets==7.1.2 73 | iso8601==0.1.11 74 | isort==4.2.15 75 | itsdangerous==0.24 76 | jdcal==1.3 77 | jedi==0.10.2 78 | Jinja2==2.9.6 79 | jmespath==0.9.2 80 | jsonschema==2.6.0 81 | jupyter==1.0.0 82 | jupyter-client==5.1.0 83 | jupyter-console==5.2.0 84 | jupyter-core==4.3.0 85 | jupyterlab==0.31.8 86 | jupyterlab-launcher==0.10.5 87 | lazy-object-proxy==1.3.1 88 | llvmlite==0.20.0 89 | locket==0.2.0 90 | lxml==3.8.0 91 | MarkupSafe==1.0 92 | matplotlib==2.0.2 93 | mccabe==0.6.1 94 | menuinst==1.4.7 95 | mistune==0.7.4 96 | mpmath==0.19 97 | multipledispatch==0.4.9 98 | mysqlclient==1.3.10 99 | nbconvert==5.2.1 100 | nbformat==4.4.0 101 | ndg-httpsclient==0.4.2 102 | networkx==1.11 103 | nltk==3.2.4 104 | nose==1.3.7 105 | notebook>=5.4.1 106 | ntlm-auth==1.0.4 107 | numba==0.35.0 108 | numexpr==2.6.2 109 | numpy==1.13.1 110 | numpydoc==0.7.0 111 | oauthlib==2.0.2 112 | odo==0.5.1 113 | olefile==0.44 114 | openpyxl==2.4.8 115 | ordereddict==1.1 116 | packaging==16.8 117 | pandas==0.20.3 118 | pandasql==0.7.3 119 | pandocfilters==1.4.2 120 | partd==0.3.8 121 | path.py==10.3.1 122 | pathlib2==2.3.0 123 | patsy==0.4.1 124 | pdfminer3k==1.3.1 125 | pep8==1.7.0 126 | pickleshare==0.7.4 127 | Pillow==5.0.0 128 | ply==3.10 129 | prompt-toolkit==1.0.15 130 | psutil==5.2.2 131 | py==1.4.34 132 | py-trello==0.9.0 133 | pyasn1==0.2.3 134 | pycodestyle==2.3.1 135 | pycosat==0.6.2 136 | pycparser==2.18 137 | pycurl==7.43.0 138 | pyflakes==1.6.0 139 | Pygments==2.2.0 140 | pylint==1.7.2 141 | pyOpenSSL>=17.5.0 142 | pyparsing==2.2.0 143 | PyPDF2==1.26.0 144 | Pyphen==0.9.4 145 | pyreadline==2.1 146 | pytest==3.2.1 147 | python-dateutil==2.6.1 148 | python-instagram==1.3.2 149 | pytz==2017.2 150 | PyWavelets==0.5.2 151 | pywin32==220 152 | PyYAML==3.12 153 | pyzmq==16.0.2 154 | QtAwesome==0.4.4 155 | qtconsole==4.3.1 156 | QtPy==1.3.1 157 | ratebeer==2.3.1 158 | requests>=2.20.0 159 | requests-ntlm==1.0.0 160 | requests-oauthlib==0.8.0 161 | rope-py3k==0.9.4.post1 162 | rsa==3.4.2 163 | s3transfer==0.1.10 164 | scikit-image==0.13.0 165 | scikit-learn==0.19.0 166 | scipy==0.19.1 167 | seaborn==0.8 168 | simplegeneric==0.8.1 169 | simplejson==3.11.1 170 | singledispatch==3.4.0.3 171 | six==1.10.0 172 | snowballstemmer==1.2.1 173 | sockjs-tornado==1.0.3 174 | sphinx==1.6.3 175 | sphinxcontrib-websupport==1.0.1 176 | spyder==3.2.3 177 | SQLAlchemy==1.1.13 178 | statsmodels==0.8.0 179 | sympy==1.1.1 180 | tables==3.4.2 181 | testpath==0.3 182 | tinycss==0.4 183 | toolz==0.8.2 184 | tornado==4.5.2 185 | tqdm==4.11.2 186 | traitlets==4.3.2 187 | unicodecsv==0.14.1 188 | wcwidth==0.1.7 189 | WeasyPrint==0.36 190 | webencodings==0.5 191 | Werkzeug==0.12.2 192 | widgetsnbextension==3.1.4 193 | win-unicode-console==0.5 194 | wincertstore==0.2 195 | wrapt==1.10.11 196 | xlrd==1.1.0 197 | XlsxWriter==0.9.8 198 | xlwings==0.11.4 199 | xlwt==1.3.0 200 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | #Source: https://github.com/cochoa0x1/mailgunbot/blob/master/mailgunbot/utils.py 2 | 3 | import requests 4 | from requests.utils import unquote 5 | import os 6 | 7 | 8 | def download_file(url,filename=None,chunk_size=512, directory=os.getcwd(), auth=None): 9 | 10 | #if no filename is given, try and get it from the url 11 | if not filename: 12 | filename = unquote(url.split('/')[-1]) 13 | 14 | full_name = os.path.join(directory,filename) 15 | 16 | #make the destination directory, but guard against race condition 17 | if not os.path.exists(os.path.dirname(full_name)): 18 | try: 19 | os.makedirs(os.path.dirname(full_name)) 20 | except OSError as exc: 21 | raise Exception('something failed') 22 | 23 | r = requests.get(url, stream=True, auth=auth) 24 | 25 | 26 | with open(full_name, 'wb') as f: 27 | for chunk in r.iter_content(chunk_size=chunk_size) : 28 | if chunk: # filter out keep-alive new chunks 29 | f.write(chunk) 30 | f.flush() 31 | r.close() --------------------------------------------------------------------------------