├── .gitignore
├── LICENSE
├── README.md
├── company_codes.xlsx
├── dart_crawling.ipynb
├── dart_xls.ipynb
├── pics
    ├── kakao_debttoequityratio.jpg
    ├── kakao_operatingmargin.jpg
    ├── kakao_operatingprofit_sales.jpg
    ├── kakao_totaldebt_totalequity.jpg
    ├── samsung_debt_equity.png
    └── samsung_sales_profit.png
├── requirements.txt
├── requirements_full.txt
└── utils.py


/.gitignore:
--------------------------------------------------------------------------------
1 | api_key.txt
2 | *.ipynb_checkpoints
3 | *.pyc


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Seoweon
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Downloading and analyzing historical annual reports through crawling [DART](http://dart.fss.or.kr/) (Repository of Korea's Corporate Filings)
 2 | 
 3 | [한글 Readme를 보려면 여기를 클릭하세요](https://github.com/seoweon/dart_reports/blob/master/README.md#%EC%A0%84%EC%9E%90%EA%B3%B5%EC%8B%9C%EC%8B%9C%EC%8A%A4%ED%85%9C-%ED%81%AC%EB%A1%A4%EB%A7%81%EC%9C%BC%EB%A1%9C-%ED%95%9C-%ED%9A%8C%EC%82%AC%EC%9D%98-%EC%97%AD%EB%8C%80-%EC%82%AC%EC%97%85%EB%B3%B4%EA%B3%A0%EC%84%9C-%EB%8B%A4%EC%9A%B4%EB%B0%9B%EA%B8%B0-%EB%B0%8F-%ED%8A%B8%EB%A0%8C%EB%93%9C-%ED%99%95%EC%9D%B8%ED%95%98%EA%B8%B0)
 4 | 
 5 | Update: Debt-to-equity ratio and operating margin graphs are added to the analysis code in [dart_xls.ipynb](https://github.com/seoweon/dart_reports/blob/master/dart_xls.ipynb). 
 6 | 
 7 | ## Directions: 
 8 | 1. Clone this repository
 9 | 2. Download the needed libraries through running [requirements.txt](https://github.com/seoweon/dart_reports/blob/master/requirements.txt) (type ```pip install -r requirements.txt``` in your command prompt and run)
10 | 	- If for some reason this doesn't work and an error occurs, try [requirements_full.txt](https://github.com/seoweon/dart_reports/blob/master/requirements_full.txt) (```pip install -r requirements_full.txt```)
11 | 3. (If you didn't already) Get an API Key from DART by [going to this link](http://dart.fss.or.kr/dsap001/apikeyManagement.do;jsessionid=Bs7AWiSzD8YmbBx0Zg3WoEixviKFJ7tL2OmeavY5lXpuYNh4MBmNjvvrgldaazhx.dart2_servlet_engine2) and following their directions
12 | 4. In the same folder that you cloned, add a simple text file with the given key and name it ```api_key.txt```
13 | 5. Open [dart_crawling.ipynb](https://github.com/seoweon/dart_reports/blob/master/dart_crawling.ipynb) on Jupyter Notebook and run the code
14 | 6. Input the information as directed (items such as company name). Running this code will download all annual reports for the given company and organize them neatly in folders
15 | 7. Open [dart_xls.ipynb](https://github.com/seoweon/dart_reports/blob/master/dart_xls.ipynb) on Jupyter Notebook and run the code
16 | 8. Select a company that you downloaded annual reports for already
17 | 9. You can check a company's total debt and equity trend as well as their debt-to-equity ratio trend over the years
18 | 10. You can check a company's total operating profit and sales as well as their operating margin trend over the years
19 | 
20 | Below are example charts using [Kakao corporation](https://www.kakaocorp.com/?lang=en)'s data:
21 | #### Kakao total debt and equity trend:
22 | ![](pics/kakao_totaldebt_totalequity.jpg)
23 | #### Kakao debt-to-equity ratio: 
24 | ![](pics/kakao_debttoequityratio.jpg)
25 | #### Kakao operating profit and sales:
26 | ![](pics/kakao_operatingprofit_sales.jpg)
27 | #### Kakao operating margin:
28 | ![](pics/kakao_operatingmargin.jpg)
29 | 
30 | ### Blogs I referred to for making this
31 | * http://quantkim.blogspot.kr/2018/01/dart-api-with.html
32 | * http://tariat.tistory.com/31
33 | * https://woosa7.github.io/fss_dart/
34 | 
35 | ### Disclaimer:
36 | * [dart_xls.ipynb](https://github.com/seoweon/dart_reports/blob/master/dart_xls.ipynb) is not 100% foolproof because the formatting of balance sheets are slightly different from company to company. If you discover a company that doesn't work, please let me know by raising an issue and I will look into fixing it as soon as I can.
37 | * In addition, since error-handling is not perfect either, there may be a chance that the code fails silently and doesn't tell you an error occurred. Also, please let me know if you discover this.
38 | 
39 | ### Post-script: 
40 | * Finding a variable called ```dm_no``` was not easy, and could be done better using RegEx. May get on this in the future. 
41 | * We only look at annual reports in this code, but DART has many other types of corporate filings that could be interesting to look into. The homepage's own [API development guide](http://dart.fss.or.kr/dsap001/guide.do) demonstrates a wide variety of using their data, so this is only scratching the surface. 
42 | 
43 | # [전자공시시스템](http://dart.fss.or.kr/) 크롤링으로 한 회사의 역대 사업보고서 다운받기 및 트렌드 확인하기
44 | 전자공시시스템에서 한 회사의 역대 사업보고서를 한 번에 다운받는 script입니다.
45 | 
46 | Update: 부채비율 및 영업이익률 그래프 추가하였습니다. ([dart_xls.ipynb](https://github.com/seoweon/dart_reports/blob/master/dart_xls.ipynb))
47 | 
48 | ## 사용방법: 
49 | 
50 | 1. 다음 repository를 클론합니다
51 | 2. [requirements.txt](https://github.com/seoweon/dart_reports/blob/master/requirements.txt)를 이용해 필요한 라이브러리를 설치합니다. (```pip install -r requirements.txt```)
52 |     - 만약 어떠한 이유로 계속 오류가 발생한다면 좀 더 강력한 조치로 [requirements_full.txt](https://github.com/seoweon/dart_reports/blob/master/requirements_full.txt)를 이용해 전체 라이브러리 설치를 시도해봅니다. 
53 | 3. (아직 없다면) [DART API Key 발급페이지](http://dart.fss.or.kr/dsap001/apikeyManagement.do;jsessionid=Bs7AWiSzD8YmbBx0Zg3WoEixviKFJ7tL2OmeavY5lXpuYNh4MBmNjvvrgldaazhx.dart2_servlet_engine2)에 접속해 API key를 발급받습니다 (쉬워요)
54 | 4. 동 폴더에 ```api_key.txt```라는 텍스트파일을 만들어 발급받은 KEY를 저장합니다
55 | 5. [dart_crawling.ipynb](https://github.com/seoweon/dart_reports/blob/master/dart_crawling.ipynb)를 Jupyter Notebook에서 열어 실행시킵니다
56 | 6. 회사명 등의 입력사항을 넣으면 파일이 다운로드 됩니다
57 | 7. [dart_xls.ipynb](https://github.com/seoweon/dart_reports/blob/master/dart_xls.ipynb)를 Jupyter Notebook에서 열어 실행시킵니다
58 | 8. 앞서 다운받은 회사의 사업보고서 중 하나를 선택해 회사명을 입력합니다
59 | 9. "재무제표 크롤링"을 통해 연도별 총 자본ㆍ부채 규모 그래프를 확인할 수 있습니다 
60 | 10. "손익계산서 크롤링"을 통해 연도별 매출ㆍ영업이익 현황을 확인할 수 있습니다 
61 | 
62 | 아래는 [카카오](https://www.kakaocorp.com/?lang=ko) 데이터로 그린 그래프입니다:
63 | #### 카카오 총부채 및 총 자산 추이:
64 | ![](pics/kakao_totaldebt_totalequity.jpg)
65 | #### 카카오 부채비율 추이: 
66 | ![](pics/kakao_debttoequityratio.jpg)
67 | #### 카카오 영업이익 및 총매출 추이:
68 | ![](pics/kakao_operatingprofit_sales.jpg)
69 | #### Kakao 영업이익률 추이:
70 | ![](pics/kakao_operatingmargin.jpg)
71 | 
72 | ### 참고한 블로그
73 | * http://quantkim.blogspot.kr/2018/01/dart-api-with.html
74 | * http://tariat.tistory.com/31
75 | * https://woosa7.github.io/fss_dart/
76 | 
77 | ### 주의사항:
78 | * [dart_xls.ipynb](https://github.com/seoweon/dart_reports/blob/master/dart_xls.ipynb)는 어느수준 정형화된 엑셀 데이터를 크롤링한 것이기 때문에 버그가 있을 수 있습니다. 해 봤는데 안 되는 회사명이 있으면 알려주세요! 조사해보고 코드를 업데이트하도록 하겠습니다
79 | * 이에 더불어, error-handling을 완벽하게 하지 않았기 때문에 코드 자체에서 에러가 나오지 않더라도 틀린 데이터를 가져오는 경우가 있을 수 있습니다. 이 부분도 보이는 대로 알려주세요
80 | 
81 | 
82 | ### 추신: 
83 | * Regular Expressions를 사용하면 ```dm_no```같은 변수를 찾거나 폴더명을 정렬하는 게 좀 더 간편할 것 같아요.
84 | * 사업보고서는 코드가 A001인데, 이것 외에도 다운받을 수 있는 공식 문서들이 굉장히 많습니다 (홈페이지의 [API 개발가이드](http://dart.fss.or.kr/dsap001/guide.do) 중 "상세 유형" 보면 이것저것 많이 나와있어요). 근데 어떤 방식으로 다운 받는 게 목적에 부합할 지 확신이 안 서서 우선은 사업보고서만 다운받는 형식으로 구성했습니다
85 | 


--------------------------------------------------------------------------------
/company_codes.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/seoweon/dart_reports/6bfd485699d4ecbb00d3efbf825cbb0dc5f040fe/company_codes.xlsx


--------------------------------------------------------------------------------
/dart_crawling.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Downloading historical annual reports from DART.fss.or.kr (Korea's corporate filings repository)\n",
  8 |     "# DART.fss.or.kr에서 역대 사업보고서 다운로드하기"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": null,
 14 |    "metadata": {
 15 |     "collapsed": true
 16 |    },
 17 |    "outputs": [],
 18 |    "source": [
 19 |     "#Importing requests to read json later...\n",
 20 |     "import requests\n",
 21 |     "import pandas as pd\n",
 22 |     "#Need to import lxml in order to get the xpath of a dcm_no \n",
 23 |     "from lxml import html\n",
 24 |     "#Just in case we need RegEx...\n",
 25 |     "import re\n",
 26 |     "\n",
 27 |     "%load_ext autoreload\n",
 28 |     "%autoreload 2\n",
 29 |     "\n",
 30 |     "from utils import *\n",
 31 |     "\n",
 32 |     "from subprocess import call\n",
 33 |     "\n",
 34 |     "from fake_useragent import UserAgent"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "markdown",
 39 |    "metadata": {},
 40 |    "source": [
 41 |     "Please insert your API key. You can easily get an API key by creating an account and applying [here](http://dart.fss.or.kr/dsap001/apikeyManagement.do;jsessionid=Bs7AWiSzD8YmbBx0Zg3WoEixviKFJ7tL2OmeavY5lXpuYNh4MBmNjvvrgldaazhx.dart2_servlet_engine2). <br>\n",
 42 |     "API Key를 넣어주세요. [인증키 신청](http://dart.fss.or.kr/dsap001/apikeyManagement.do;jsessionid=Bs7AWiSzD8YmbBx0Zg3WoEixviKFJ7tL2OmeavY5lXpuYNh4MBmNjvvrgldaazhx.dart2_servlet_engine2)은 DART 계정을 만든 후 간단하게 할 수 있습니다"
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "code",
 47 |    "execution_count": null,
 48 |    "metadata": {
 49 |     "collapsed": true
 50 |    },
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "with open('api_key.txt','r') as f:\n",
 54 |     "    API_KEY = f.read()"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "markdown",
 59 |    "metadata": {},
 60 |    "source": [
 61 |     "## 1. Get a list of links for your desired company's financial reports. \n",
 62 |     "## 1. 원하는 회사의 사업보고서 링크 목록을 가져와봅시다. "
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "markdown",
 67 |    "metadata": {},
 68 |    "source": [
 69 |     "#### 1) Get the company code (Excel source: [Korea Investor’s Network for Disclosure System](http://kind.krx.co.kr/corpgeneral/corpList.do?method=loadInitPage))\n",
 70 |     "#### 1) 회사의 종목코드를 가져오세요. (엑셀 출처: [한국거래소 전자공시 홈페이지](http://kind.krx.co.kr/corpgeneral/corpList.do?method=loadInitPage))"
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "code",
 75 |    "execution_count": null,
 76 |    "metadata": {
 77 |     "collapsed": true
 78 |    },
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "#Read the Excel that includes official company information, while making sure that the company code is read as a string, not an integer (since integer values delete the zeros in front)\n",
 82 |     "#회사 정보가 들어있는 엑셀을 읽어오되, 종목코드는 int가 아닌 str로 가져와야 합니다 (안 그러면 앞의 0이 지워져서 나오게 돼요)\n",
 83 |     "#Excel source: http://kind.krx.co.kr/corpgeneral/corpList.do?method=loadInitPage\n",
 84 |     "company_codes = pd.read_excel('company_codes.xlsx',converters={'종목코드':str})"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": null,
 90 |    "metadata": {
 91 |     "scrolled": true
 92 |    },
 93 |    "outputs": [],
 94 |    "source": [
 95 |     "#Get input from user on company name\n",
 96 |     "#회사명 입력란을 만들어요\n",
 97 |     "name_input = input('Please write company name here | 회사명을 입력해주세요: ')"
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "code",
102 |    "execution_count": null,
103 |    "metadata": {
104 |     "scrolled": true
105 |    },
106 |    "outputs": [],
107 |    "source": [
108 |     "#If there is not an exact match, the system pauses. \n",
109 |     "#입력된 회사명이 없으면 진행이 안돼요\n",
110 |     "#CAVEAT: There are company names that are also a portion of another company name (like CJ vs. CJ오쇼핑) - this simple seach program cannot find CJ오쇼핑 when only CJ is inputted. \n",
111 |     "#CAVEAT: CJ같이 단독으로도 회사명이 존재하지만 CJ오쇼핑 같이 이게 포함된 회사명이 있는 경우, 찾아주지는 못합니다\n",
112 |     "while len(company_codes[company_codes.회사명 ==  name_input]) == 0:\n",
113 |     "    print('The company name entered does not exist. | 해당 이름의 회사명이 존재하지 않습니다. \\nPlease find below for suggestions and re-type the company name | 아래 회사명 중 하나를 찾으시나요? 다시 입력해주세요.\\n')\n",
114 |     "    for row in company_codes.회사명:\n",
115 |     "        if row.find(name_input) != -1:\n",
116 |     "            print(row)\n",
117 |     "    name_input = input()\n",
118 |     "code = company_codes[company_codes.회사명 == name_input].종목코드.iloc[0]\n",
119 |     "print(\"Company name | 회사명: \"+name_input+\"\\nCompany code | 종목코드: \"+code)"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "markdown",
124 |    "metadata": {},
125 |    "source": [
126 |     "#### 2) Generate list of report URLs\n",
127 |     "#### 2) 보고서 목록 URL을 생성하세요"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "code",
132 |    "execution_count": null,
133 |    "metadata": {
134 |     "collapsed": true
135 |    },
136 |    "outputs": [],
137 |    "source": [
138 |     "#Start year is set as 1956, which is the year when the first company went public\n",
139 |     "#시작날짜는 최초의 기업이 상장한 날짜인 1956년 3월보다 이전으로 잡았습니다\n",
140 |     "start_date = '19560101'\n",
141 |     "#Document type: Annual report (사업보고서)\n",
142 |     "bsn_tp = 'A001'"
143 |    ]
144 |   },
145 |   {
146 |    "cell_type": "code",
147 |    "execution_count": null,
148 |    "metadata": {
149 |     "collapsed": true
150 |    },
151 |    "outputs": [],
152 |    "source": [
153 |     "url = \"http://dart.fss.or.kr/api/search.json?auth=\"+API_KEY+\"&crp_cd=\"+code+\"&start_dt=\"+start_date+\"&bsn_tp=\"+bsn_tp+\"&fin_rpt=Y&page_set=100\""
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "markdown",
158 |    "metadata": {},
159 |    "source": [
160 |     "#### 3) Generate individual report urls\n",
161 |     "#### 3) 개별 보고서 URL을 생성하세요"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "code",
166 |    "execution_count": null,
167 |    "metadata": {
168 |     "collapsed": true
169 |    },
170 |    "outputs": [],
171 |    "source": [
172 |     "ua = UserAgent()\n",
173 |     "headers = {'User-Agent':ua.chrome}"
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "code",
178 |    "execution_count": null,
179 |    "metadata": {
180 |     "collapsed": true
181 |    },
182 |    "outputs": [],
183 |    "source": [
184 |     "#Extract json values\n",
185 |     "#json 값을 추출합시다\n",
186 |     "a = requests.get(url,headers=headers).json()"
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "code",
191 |    "execution_count": null,
192 |    "metadata": {
193 |     "scrolled": true
194 |    },
195 |    "outputs": [],
196 |    "source": [
197 |     "#Let's see if the list per report is generated properly. If there are no annual reports to start with, the rest of the program will be useless. \n",
198 |     "#각 사업보고서 당 리스트가 제대로 생성되는 지 봅시다. 하나도 없으면 코드 돌려봤자 아무것도 다운 안 됨\n",
199 |     "urldict = {}\n",
200 |     "for row in a['list']:\n",
201 |     "    url2 = \"http://dart.fss.or.kr/dsaf001/main.do?rcpNo=\"\n",
202 |     "    name = row['rpt_nm']\n",
203 |     "    #Getting rid of the pre-amble that's irrelevant\n",
204 |     "    #[기재정정] [첨부추가] [첨부정정] 등 앞에 붙은 것을 제거해봅시다\n",
205 |     "    if name.find('[') != -1:\n",
206 |     "        name = name.split(']')[1]\n",
207 |     "    urldict[name] = url2+row['rcp_no']\n",
208 |     "    print(name+\": \"+url2+row['rcp_no'])"
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "markdown",
213 |    "metadata": {},
214 |    "source": [
215 |     "## 2. Check attachments per annual report and download\n",
216 |     "## 2. 각 사업보고서의 첨부파일 리스트를 확인하고 다운로드합시다"
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "code",
221 |    "execution_count": null,
222 |    "metadata": {
223 |     "scrolled": true
224 |    },
225 |    "outputs": [],
226 |    "source": [
227 |     "#Setting up counter\n",
228 |     "#카운터 선정\n",
229 |     "n=1\n",
230 |     "\n",
231 |     "for key, value in urldict.items(): \n",
232 |     "    #We need to know a value called \"dcm_no\" in order to access the link, but there is no way to get it other than scraping the html of the page and getting the appropriate xpath\n",
233 |     "    #dcm_no 값을 알아야 다운로드 링크에 접근할 수 있는데, 알 방법이 링크에서 바로 가져오는 방법밖에 없으므로 xpath을 활용해서 알아봅시다\n",
234 |     "    test = requests.get(value, headers=headers)\n",
235 |     "    tree = html.fromstring(test.content)\n",
236 |     "    testpath = tree.xpath('//*[@id=\"north\"]/div[2]/ul/li[1]/a/@onclick')[0]\n",
237 |     "    dcm_no = dcm_no = testpath.split(\", '\")[1].split(\"')\")[0]\n",
238 |     "    \n",
239 |     "    #The url format is slightly different for attachments... below is code to add the needed elements\n",
240 |     "    #다운로드를 위한 url은 보고서 url과 차이점이 몇 가지 있는데, replace를 통해 추가할 수 있어요\n",
241 |     "    download_url = value.replace('dsaf001','pdf/download').replace('rcpNo','rcp_no')+\"&dcm_no=\"+dcm_no\n",
242 |     "    print(key+\" \"+download_url+\" Downloading... \"+str(n)+\" out of \"+str(len(urldict)))\n",
243 |     "    \n",
244 |     "    #Extract the attachment download url, same way we got the dcm_no\n",
245 |     "    #dcm_no를 구했던 것과 같은 방법으로 첨부파일 다운로드 url을 추출합니다\n",
246 |     "    dtest = requests.get(download_url, headers=headers)\n",
247 |     "    dtree = html.fromstring(dtest.text)\n",
248 |     "    \n",
249 |     "    #There are multiple attachment files per report; we used a dict called downloadpath to save the file with the name on screen\n",
250 |     "    #각 보고서 당 복수의 첨부파일이 존재하는데, 첨부파일 이름과 함께 저장하기 위해 downloadpath라는 dict를 사용했습니다\n",
251 |     "    downloadpath={}\n",
252 |     "    keys = dtree.xpath('/html/body/div/div/table/tr/td[1]/text()')\n",
253 |     "    key_links = dtree.xpath('/html/body/div/div/table/tr/td/a/@href')\n",
254 |     "    for key2, link in zip(keys, key_links):\n",
255 |     "        l = \"http://dart.fss.or.kr\"+link\n",
256 |     "        k = key2.replace(\")\",\"\")\n",
257 |     "        downloadpath[k] = l\n",
258 |     "        \n",
259 |     "    #Using download_file in utils, create a directory and put the file in\n",
260 |     "    #utils에 있는 download_file을 이용해 디렉토리를 만들고 그 안에다가 파일을 집어넣습니다\n",
261 |     "    for key2, link in downloadpath.items():\n",
262 |     "        download_file(link,filename=key2,directory=\"dart_\"+name_input+\"/\"+key)\n",
263 |     "        #try:\n",
264 |     "            #os.mkdir(key)\n",
265 |     "        #except:\n",
266 |     "            #pass\n",
267 |     "        #call(['curl',link,'-o',key+'/'+key2])\n",
268 |     "    n+=1"
269 |    ]
270 |   },
271 |   {
272 |    "cell_type": "markdown",
273 |    "metadata": {},
274 |    "source": [
275 |     "## 3. Lastly, open the file explorer and check the downloaded files\n",
276 |     "## 3. 마지막으로 파일 탐색기를 열어 다운로드받은 파일을 확인합니다. "
277 |    ]
278 |   },
279 |   {
280 |    "cell_type": "code",
281 |    "execution_count": null,
282 |    "metadata": {},
283 |    "outputs": [],
284 |    "source": [
285 |     "yesno = input('File download complete. Would you like to open the file explorer to check? (y/n) | 파일 다운로드가 완료되었습니다. 파일 탐색기를 열어 확인하시겠습니까? (y/n)  ')\n",
286 |     "\n",
287 |     "if yesno.startswith('y'):\n",
288 |     "    call(['explorer','dart_'+name_input])"
289 |    ]
290 |   }
291 |  ],
292 |  "metadata": {
293 |   "kernelspec": {
294 |    "display_name": "Python 3",
295 |    "language": "python",
296 |    "name": "python3"
297 |   },
298 |   "language_info": {
299 |    "codemirror_mode": {
300 |     "name": "ipython",
301 |     "version": 3
302 |    },
303 |    "file_extension": ".py",
304 |    "mimetype": "text/x-python",
305 |    "name": "python",
306 |    "nbconvert_exporter": "python",
307 |    "pygments_lexer": "ipython3",
308 |    "version": "3.6.2"
309 |   }
310 |  },
311 |  "nbformat": 4,
312 |  "nbformat_minor": 2
313 | }
314 | 


--------------------------------------------------------------------------------
/dart_xls.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Create graphs showing the debt-to-equity ratio and operating margin\n",
  8 |     "# 연결 재무제표에서 자본총계와 부채총계값을 추출해서 그래프를 만듭니다"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": null,
 14 |    "metadata": {
 15 |     "collapsed": true
 16 |    },
 17 |    "outputs": [],
 18 |    "source": [
 19 |     "import pandas as pd\n",
 20 |     "import os\n",
 21 |     "import re\n",
 22 |     "import matplotlib.pyplot as plt\n",
 23 |     "import seaborn as sns\n",
 24 |     "%matplotlib inline"
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "code",
 29 |    "execution_count": null,
 30 |    "metadata": {
 31 |     "scrolled": false
 32 |    },
 33 |    "outputs": [],
 34 |    "source": [
 35 |     "print('Existing company folders | 사업보고서가 저장된 회사:')\n",
 36 |     "#Find the folder names first\n",
 37 |     "#폴더명을 먼저 가져옵니다 (형식: 'dart_[회사명]')\n",
 38 |     "folderitems = os.listdir()\n",
 39 |     "#Extract the company name only to get the list of reports\n",
 40 |     "#회사명만 분리해 목록을 뽑습니다\n",
 41 |     "for item in folderitems:\n",
 42 |     "    if '.' not in item:\n",
 43 |     "        if item.startswith('dart_'):\n",
 44 |     "            print(\"- \"+item.replace('dart_',''))"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": null,
 50 |    "metadata": {
 51 |     "scrolled": false
 52 |    },
 53 |    "outputs": [],
 54 |    "source": [
 55 |     "inputcompany=input('Please insert the name of the company you would like to check | 저장된 사업보고서 중 조사하고자 하는 회사명을 입력해주세요:')\n",
 56 |     "testfile = 'dart_'+inputcompany+'/'\n",
 57 |     "while testfile.replace('/','') not in folderitems:\n",
 58 |     "    inputcompany=input('Error: Company name does not exist | 존재하는 회사명으로 다시 입력해주세요. ')\n",
 59 |     "    testfile = 'dart_'+inputcompany+'/'"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": null,
 65 |    "metadata": {
 66 |     "collapsed": true
 67 |    },
 68 |    "outputs": [],
 69 |    "source": [
 70 |     "#Find the excel file within each folder and save the directory\n",
 71 |     "#각 폴더 내에 있는 엑셀파일을 찾아 디렉토리 목록을 저장합니다\n",
 72 |     "linklist=[]\n",
 73 |     "for item in os.listdir(testfile):\n",
 74 |     "    sublist = os.listdir(testfile+item)\n",
 75 |     "    for item2 in sublist:\n",
 76 |     "        if item2.endswith('.xls'):\n",
 77 |     "            filedirectory = testfile+item+\"/\"+item2\n",
 78 |     "            linklist.append(filedirectory)"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": null,
 84 |    "metadata": {},
 85 |    "outputs": [],
 86 |    "source": [
 87 |     "if len(linklist) < 1:\n",
 88 |     "    print('Error: No reports exist. | 사업보고서가 존재하지 않습니다.')\n",
 89 |     "else:\n",
 90 |     "    first_year = linklist[0].split('/')[1].split('(')[1].split('.')[0]\n",
 91 |     "    first_month = linklist[0].split('/')[1].split('(')[1].split('.')[1].replace(')','')\n",
 92 |     "    current_year = linklist[-1].split('/')[1].split('(')[1].split('.')[0]\n",
 93 |     "    accounting_month = linklist[-1].split('/')[1].split('(')[1].split('.')[1].replace(')','')\n",
 94 |     "    print('Total number of files available to extract: '+str(len(linklist))+' ('+first_month+'.'+first_year+' ~ '+accounting_month+'.'+current_year+')')"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": null,
100 |    "metadata": {
101 |     "collapsed": true
102 |    },
103 |    "outputs": [],
104 |    "source": [
105 |     "#Save the internal numerical values in each excel as a \"score\"\n",
106 |     "#각 행을 스캔해 \"제 XX 기\" 같은 표현이 등장하는 횟수를 score로 저장함 (엑셀시트 형식 당 skiprows를 해야 하는 숫자가 달라서 이걸 자동으로 찾아주기 위함)\n",
107 |     "r_col = re.compile(r'제\\s*\\d{1,}\\s*기')\n",
108 |     "def header_score(row):\n",
109 |     "    score=0\n",
110 |     "    for k,item in row.items():\n",
111 |     "        if len(r_col.findall(str(item))) !=0:\n",
112 |     "            score+=1\n",
113 |     "    return score\n",
114 |     "\n",
115 |     "#TODO: Later, make a function to find the unit for each  numbers....  \n",
116 |     "#TODO:단위를 찾아주는 function을 만들어야 하는데... 머리가 아프다...\n",
117 |     "\n",
118 |     "#Find the header row\n",
119 |     "#\"제 XX 기\" 표현이 가장 많이 등장하는 행을 header로 선정함\n",
120 |     "def get_right_header(df):\n",
121 |     "    header_row = df.fillna('').apply(header_score,axis=1).argmax()\n",
122 |     "    newdf = df.iloc[header_row+1:]\n",
123 |     "    newdf.columns = df.iloc[header_row].values\n",
124 |     "    return newdf"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "markdown",
129 |    "metadata": {},
130 |    "source": [
131 |     "## 1. Crawl the balance sheet\n",
132 |     "## 1. 재무제표 크롤링"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "code",
137 |    "execution_count": null,
138 |    "metadata": {
139 |     "collapsed": true
140 |    },
141 |    "outputs": [],
142 |    "source": [
143 |     "#A function that organizes and cleans up the balance sheet\n",
144 |     "#재무 테이블을 깔끔하게 정리해주는 function\n",
145 |     "def clean_table(jaemu):\n",
146 |     "    #Changing the first column name to category\n",
147 |     "    oldname = jaemu.columns[0]\n",
148 |     "    clean = jaemu.rename(columns={oldname:'category'})\n",
149 |     "    clean['category'] = clean.category.apply(str.strip)\n",
150 |     "    indexlist = []\n",
151 |     "    for index,row in clean.iterrows():\n",
152 |     "        if (row.category == '부채총계') | (row.category == '자본총계'):\n",
153 |     "            indexlist.append(index)\n",
154 |     "    clean2 = clean.iloc[indexlist].copy()\n",
155 |     "    return clean2"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "code",
160 |    "execution_count": null,
161 |    "metadata": {
162 |     "collapsed": true
163 |    },
164 |    "outputs": [],
165 |    "source": [
166 |     "def get_jaemu(wholetable, year):\n",
167 |     "    wholetable_sheets = list(wholetable.keys())\n",
168 |     "    if '대차대조표' in wholetable_sheets:\n",
169 |     "        a = get_right_header(wholetable['대차대조표']).reset_index(drop=True)\n",
170 |     "        print(str(year)+'년: 대차대조표')\n",
171 |     "    elif '연결 재무상태표' in wholetable_sheets:\n",
172 |     "        a = get_right_header(wholetable['연결 재무상태표']).reset_index(drop=True)\n",
173 |     "        print(str(year)+'년: 연결 재무상태표')\n",
174 |     "    elif '재무상태표' in wholetable_sheets:\n",
175 |     "        a = get_right_header(wholetable['재무상태표']).reset_index(drop=True)\n",
176 |     "        print(str(year)+'년: 재무상태표')\n",
177 |     "    else:\n",
178 |     "        b = wholetable_sheets[1]\n",
179 |     "        print(str(year)+\"년: \"+b+\"(?)\")\n",
180 |     "        a = get_right_header(wholetable[b]).reset_index(drop=True)\n",
181 |     "    return clean_table(a)"
182 |    ]
183 |   },
184 |   {
185 |    "cell_type": "code",
186 |    "execution_count": null,
187 |    "metadata": {},
188 |    "outputs": [],
189 |    "source": [
190 |     "getall = pd.DataFrame()\n",
191 |     "#Using RegEx, get the number value from expressions such as '제 xx 기'\n",
192 |     "#RegEx를 이용해서 '제 XX 기' (또는 '제XX기')로 명시된 기수를 숫자로 따로 추출합니다\n",
193 |     "r = re.compile(r'(\\d+)')\n",
194 |     "year_tracker = int(first_year)\n",
195 |     "for item in linklist:\n",
196 |     "    table = pd.read_excel(item, sheetname=None)\n",
197 |     "    a = get_jaemu(table,year_tracker)\n",
198 |     "    year_tracker+=1\n",
199 |     "    melted = pd.melt(a, id_vars='category')\n",
200 |     "    #RegEx 이용 부분...\n",
201 |     "    melted['variable'] = melted.variable.apply(lambda x: r.findall(x)[0])\n",
202 |     "    #위에서 만든 getsummary DataFrame에다가 append\n",
203 |     "    getall = getall.append(melted).reset_index(drop=True).drop_duplicates()"
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "code",
208 |    "execution_count": null,
209 |    "metadata": {
210 |     "collapsed": true
211 |    },
212 |    "outputs": [],
213 |    "source": [
214 |     "#Fixing the datatype (monetary amounts are float, counting numbers are integers)\n",
215 |     "#Datatype 수정 (금액은 float, 기수는 int로 변경)\n",
216 |     "getall['value'] = getall.value.astype(float)\n",
217 |     "getall['variable'] = getall.variable.astype(int)"
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "code",
222 |    "execution_count": null,
223 |    "metadata": {
224 |     "collapsed": true
225 |    },
226 |    "outputs": [],
227 |    "source": [
228 |     "#Correct the date information based on the counting numbers extracted above\n",
229 |     "#폴더명에서 추출한 최근 연도를 가지고 각 '기'에다가 더할 숫자를 구하고 (28기가 2016년이면 1988를 모든 '기'에 더해주면 연도가 나옴) 'variable' 열을 연도로 업데이트시켜줌\n",
230 |     "addnumber = int(current_year)-max(getall.variable)\n",
231 |     "getall['variable'] = getall.variable + addnumber"
232 |    ]
233 |   },
234 |   {
235 |    "cell_type": "code",
236 |    "execution_count": null,
237 |    "metadata": {
238 |     "collapsed": true
239 |    },
240 |    "outputs": [],
241 |    "source": [
242 |     "#Create a dataframe that's appropriate for drawing tables\n",
243 |     "#pivot을 통해 그래프를 그리기 위한 적절한 형식의 dataframe를 만들어줌\n",
244 |     "jaemutable = pd.pivot_table(getall,index='variable',values='value',columns='category').rename(columns={'부채총계':'total_debt','자본총계':'total_equity'})"
245 |    ]
246 |   },
247 |   {
248 |    "cell_type": "code",
249 |    "execution_count": null,
250 |    "metadata": {},
251 |    "outputs": [],
252 |    "source": [
253 |     "jaemutable['debt-to-equity']=jaemutable.total_debt/jaemutable.total_equity"
254 |    ]
255 |   },
256 |   {
257 |    "cell_type": "code",
258 |    "execution_count": null,
259 |    "metadata": {
260 |     "scrolled": false
261 |    },
262 |    "outputs": [],
263 |    "source": [
264 |     "#Draw the table!\n",
265 |     "#그래프 그리기\n",
266 |     "print(inputcompany+\" debt and equity trend:\")\n",
267 |     "sns.set_context('poster')\n",
268 |     "plt.figure(figsize=(16,9))\n",
269 |     "jaemutable[['total_debt','total_equity']].plot.bar(stacked=True,ax=plt.gca(),colormap='tab10')\n",
270 |     "plt.xlabel('')\n",
271 |     "sns.despine()"
272 |    ]
273 |   },
274 |   {
275 |    "cell_type": "code",
276 |    "execution_count": null,
277 |    "metadata": {},
278 |    "outputs": [],
279 |    "source": [
280 |     "#Draw the table!\n",
281 |     "#그래프 그리기\n",
282 |     "print(inputcompany+\" debt-to-equity ratio:\")\n",
283 |     "sns.set_context('poster')\n",
284 |     "plt.figure(figsize=(16,9))\n",
285 |     "jaemutable['debt-to-equity'].plot.bar(stacked=True,ax=plt.gca(),colormap='tab10')\n",
286 |     "plt.xlabel('')\n",
287 |     "sns.despine()"
288 |    ]
289 |   },
290 |   {
291 |    "cell_type": "markdown",
292 |    "metadata": {},
293 |    "source": [
294 |     "## 2. Crawl the income statement\n",
295 |     "## 2. 손익계산서 크롤링"
296 |    ]
297 |   },
298 |   {
299 |    "cell_type": "code",
300 |    "execution_count": null,
301 |    "metadata": {
302 |     "collapsed": true
303 |    },
304 |    "outputs": [],
305 |    "source": [
306 |     "#A function that organizes and cleans up the income statement\n",
307 |     "#손익계산서를 깔끔하게 정리해주는 function\n",
308 |     "def clean_sonik(df):\n",
309 |     "    #Changing the first column name to category\n",
310 |     "    oldname = df.columns[0]\n",
311 |     "    clean = df.rename(columns={oldname:'category'})\n",
312 |     "    clean['category'] = clean.category.apply(str.strip)\n",
313 |     "    indexlist = []\n",
314 |     "    for index,row in clean.iterrows():\n",
315 |     "        if row.category.find('영업수익') != -1:\n",
316 |     "            indexlist.append(index)\n",
317 |     "        if row.category.find('매출액') != -1:\n",
318 |     "            indexlist.append(index)\n",
319 |     "        if row.category.find('매출') != -1:\n",
320 |     "            indexlist.append(index)\n",
321 |     "    if len(indexlist) != 1:\n",
322 |     "        indexlist = indexlist[:1]\n",
323 |     "    for index,row in clean.iterrows():\n",
324 |     "        if row.category.find('영업이익') != -1:\n",
325 |     "            indexlist.append(index)\n",
326 |     "    clean2 = clean.loc[indexlist].copy()\n",
327 |     "    #TODO: What to do when there are two sales numbers (edge cases)\n",
328 |     "    #TODO: 매출액은 두 개 이상 있으면 맨 처음 것이어야 \n",
329 |     "    if len(clean2) !=2:\n",
330 |     "        clean2 = clean2[:2]\n",
331 |     "    return clean2"
332 |    ]
333 |   },
334 |   {
335 |    "cell_type": "code",
336 |    "execution_count": null,
337 |    "metadata": {
338 |     "collapsed": true
339 |    },
340 |    "outputs": [],
341 |    "source": [
342 |     "def get_sonik(wholetable, year):\n",
343 |     "    wholetable_sheets = list(wholetable.keys())\n",
344 |     "    if '손익계산서' in wholetable_sheets:\n",
345 |     "        a = get_right_header(wholetable['손익계산서']).reset_index(drop=True)\n",
346 |     "        print(str(year)+'년: 손익계산서')\n",
347 |     "    elif '포괄손익계산서' in wholetable_sheets:\n",
348 |     "        a = get_right_header(wholetable['포괄손익계산서']).reset_index(drop=True)\n",
349 |     "        print(str(year)+'년: 포괄손익계산서')\n",
350 |     "    elif '연결 포괄손익계산서' in wholetable_sheets:\n",
351 |     "        a = get_right_header(wholetable['연결 포괄손익계산서']).reset_index(drop=True)\n",
352 |     "        print(str(year)+'년: 연결 포괄손익계산서')\n",
353 |     "    else:\n",
354 |     "        for item in wholetable_sheets:\n",
355 |     "            if item.find('손익계산서') != -1:\n",
356 |     "                a = get_right_header(wholetable[item]).reset_index(drop=True)\n",
357 |     "                print(str(year)+'년: 손익계산서가 포함된 시트명')\n",
358 |     "            else:\n",
359 |     "                print(str(year)+\"년: 손익계산서를 찾지 못함\")\n",
360 |     "                a=pd.DataFrame()\n",
361 |     "    return clean_sonik(a)"
362 |    ]
363 |   },
364 |   {
365 |    "cell_type": "code",
366 |    "execution_count": null,
367 |    "metadata": {},
368 |    "outputs": [],
369 |    "source": [
370 |     "#Create an empty dataframe to stuff everything into one\n",
371 |     "#하나의 DataFrame로 통합하기 위해 빈 DataFrame을 우선 생성합니다\n",
372 |     "soniksummary = pd.DataFrame()\n",
373 |     "#RegEx를 이용해서 '제 XX 기' (또는 '제XX기')로 명시된 기수를 숫자로 따로 추출합니다\n",
374 |     "r = re.compile(r'(\\d+)')\n",
375 |     "#재무제표 엑셀파일이 존재하는 모든 폴더에 걸쳐, 위에서 만든 function을 이용해 정보를 가져오고 정리합니다\n",
376 |     "year_tracker = int(first_year)\n",
377 |     "for item in linklist:\n",
378 |     "    table = pd.read_excel(item, sheetname=None)\n",
379 |     "    get_table = get_sonik(table, year_tracker)\n",
380 |     "    year_tracker+=1\n",
381 |     "    #좀 더 정돈된 정보를 위해 기수를 하나의 열로 만들어줍니다 (melt function 이용)\n",
382 |     "    melted = pd.melt(get_table, id_vars='category')\n",
383 |     "    #RegEx 이용 부분...\n",
384 |     "    melted['variable'] = melted.variable.apply(lambda x: r.findall(x)[0])\n",
385 |     "    #위에서 만든 getsummary DataFrame에다가 append\n",
386 |     "    soniksummary = soniksummary.append(melted).reset_index(drop=True).drop_duplicates()"
387 |    ]
388 |   },
389 |   {
390 |    "cell_type": "code",
391 |    "execution_count": null,
392 |    "metadata": {
393 |     "collapsed": true
394 |    },
395 |    "outputs": [],
396 |    "source": [
397 |     "def rename_values(x):\n",
398 |     "    if (x.find('매출') >= 0) or (x.find('영업수익') >=0):\n",
399 |     "        return 'sales'\n",
400 |     "    if x.find('영업이익') >=0:\n",
401 |     "        return 'operating_profit'\n",
402 |     "    return x"
403 |    ]
404 |   },
405 |   {
406 |    "cell_type": "code",
407 |    "execution_count": null,
408 |    "metadata": {
409 |     "collapsed": true
410 |    },
411 |    "outputs": [],
412 |    "source": [
413 |     "soniksummary['category'] = soniksummary.category.apply(rename_values)"
414 |    ]
415 |   },
416 |   {
417 |    "cell_type": "code",
418 |    "execution_count": null,
419 |    "metadata": {
420 |     "collapsed": true,
421 |     "scrolled": false
422 |    },
423 |    "outputs": [],
424 |    "source": [
425 |     "#Datatype 수정 (금액은 float, 기수는 int로 변경)\n",
426 |     "soniksummary['value'] = soniksummary.value.astype(float)\n",
427 |     "soniksummary['variable'] = soniksummary.variable.astype(int)"
428 |    ]
429 |   },
430 |   {
431 |    "cell_type": "code",
432 |    "execution_count": null,
433 |    "metadata": {
434 |     "collapsed": true
435 |    },
436 |    "outputs": [],
437 |    "source": [
438 |     "#폴더명에서 추출한 최근 연도를 가지고 각 '기'에다가 더할 숫자를 구하고 (28기가 2016년이면 1988를 모든 '기'에 더해주면 연도가 나옴) 'variable' 열을 연도로 업데이트시켜줌\n",
439 |     "addnumber = int(current_year)-max(soniksummary.variable)\n",
440 |     "soniksummary['variable'] = soniksummary.variable + addnumber"
441 |    ]
442 |   },
443 |   {
444 |    "cell_type": "code",
445 |    "execution_count": null,
446 |    "metadata": {
447 |     "collapsed": true
448 |    },
449 |    "outputs": [],
450 |    "source": [
451 |     "#pivot을 통해 그래프를 그리기 위한 적절한 형식의 dataframe를 만들어줌\n",
452 |     "sonikpivot = pd.pivot_table(soniksummary,index='variable',values='value',columns='category')"
453 |    ]
454 |   },
455 |   {
456 |    "cell_type": "code",
457 |    "execution_count": null,
458 |    "metadata": {
459 |     "collapsed": true
460 |    },
461 |    "outputs": [],
462 |    "source": [
463 |     "sonikpivot['operating_margin']=sonikpivot.operating_profit/sonikpivot.sales"
464 |    ]
465 |   },
466 |   {
467 |    "cell_type": "code",
468 |    "execution_count": null,
469 |    "metadata": {
470 |     "scrolled": false
471 |    },
472 |    "outputs": [],
473 |    "source": [
474 |     "#Draw the graph\n",
475 |     "#그래프 그리기\n",
476 |     "print(inputcompany+\" sales and operating profit trend:\")\n",
477 |     "sns.set_context('poster')\n",
478 |     "plt.figure(figsize=(16,9))\n",
479 |     "sonikpivot[['operating_profit','sales']].plot.bar(stacked=False,ax=plt.gca(),colormap='tab10')\n",
480 |     "plt.xlabel('')\n",
481 |     "sns.despine()"
482 |    ]
483 |   },
484 |   {
485 |    "cell_type": "code",
486 |    "execution_count": null,
487 |    "metadata": {},
488 |    "outputs": [],
489 |    "source": [
490 |     "#Draw the graph\n",
491 |     "#그래프 그리기\n",
492 |     "print(inputcompany+\" operating margin: \")\n",
493 |     "sns.set_context('poster')\n",
494 |     "plt.figure(figsize=(16,9))\n",
495 |     "sonikpivot['operating_margin'].plot.bar(stacked=False,ax=plt.gca(),colormap='tab10')\n",
496 |     "plt.xlabel('')\n",
497 |     "sns.despine()"
498 |    ]
499 |   },
500 |   {
501 |    "cell_type": "code",
502 |    "execution_count": null,
503 |    "metadata": {
504 |     "collapsed": true
505 |    },
506 |    "outputs": [],
507 |    "source": []
508 |   }
509 |  ],
510 |  "metadata": {
511 |   "kernelspec": {
512 |    "display_name": "Python 3",
513 |    "language": "python",
514 |    "name": "python3"
515 |   },
516 |   "language_info": {
517 |    "codemirror_mode": {
518 |     "name": "ipython",
519 |     "version": 3
520 |    },
521 |    "file_extension": ".py",
522 |    "mimetype": "text/x-python",
523 |    "name": "python",
524 |    "nbconvert_exporter": "python",
525 |    "pygments_lexer": "ipython3",
526 |    "version": "3.6.2"
527 |   }
528 |  },
529 |  "nbformat": 4,
530 |  "nbformat_minor": 2
531 | }
532 | 


--------------------------------------------------------------------------------
/pics/kakao_debttoequityratio.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/seoweon/dart_reports/6bfd485699d4ecbb00d3efbf825cbb0dc5f040fe/pics/kakao_debttoequityratio.jpg


--------------------------------------------------------------------------------
/pics/kakao_operatingmargin.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/seoweon/dart_reports/6bfd485699d4ecbb00d3efbf825cbb0dc5f040fe/pics/kakao_operatingmargin.jpg


--------------------------------------------------------------------------------
/pics/kakao_operatingprofit_sales.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/seoweon/dart_reports/6bfd485699d4ecbb00d3efbf825cbb0dc5f040fe/pics/kakao_operatingprofit_sales.jpg


--------------------------------------------------------------------------------
/pics/kakao_totaldebt_totalequity.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/seoweon/dart_reports/6bfd485699d4ecbb00d3efbf825cbb0dc5f040fe/pics/kakao_totaldebt_totalequity.jpg


--------------------------------------------------------------------------------
/pics/samsung_debt_equity.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/seoweon/dart_reports/6bfd485699d4ecbb00d3efbf825cbb0dc5f040fe/pics/samsung_debt_equity.png


--------------------------------------------------------------------------------
/pics/samsung_sales_profit.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/seoweon/dart_reports/6bfd485699d4ecbb00d3efbf825cbb0dc5f040fe/pics/samsung_sales_profit.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | ipython
 2 | pandas
 3 | requests
 4 | lxml
 5 | re
 6 | utils
 7 | subprocess
 8 | fake_useragent
 9 | os
10 | matplotlib
11 | seaborn


--------------------------------------------------------------------------------
/requirements_full.txt:
--------------------------------------------------------------------------------
  1 | alabaster==0.7.10
  2 | anaconda-client==1.6.3
  3 | anaconda-navigator==1.6.4
  4 | anaconda-project==0.6.0
  5 | asn1crypto==0.22.0
  6 | astroid==1.5.3
  7 | astropy==2.0.2
  8 | awscli==1.11.75
  9 | Babel==2.5.0
 10 | backports.shutil-get-terminal-size==1.0.0
 11 | beautifulsoup4==4.6.0
 12 | bitarray==0.8.1
 13 | bkcharts==0.2
 14 | blaze==0.10.1
 15 | bleach==1.5.0
 16 | bokeh==0.12.7
 17 | boto==2.48.0
 18 | botocore==1.5.38
 19 | Bottleneck==1.2.1
 20 | cairocffi==0.8.0
 21 | CairoSVG==2.0.2
 22 | certifi==2016.2.28
 23 | cffi==1.10.0
 24 | chardet==3.0.4
 25 | chest==0.2.3
 26 | click==6.7
 27 | cloudpickle==0.4.0
 28 | clyent==1.2.2
 29 | colorama==0.3.9
 30 | comtypes==1.1.2
 31 | configobj==5.0.6
 32 | contextlib2==0.5.5
 33 | cryptography==1.8.1
 34 | cssselect==1.0.1
 35 | cycler==0.10.0
 36 | Cython==0.26
 37 | cytoolz==0.8.2
 38 | dask==0.15.2
 39 | datashape==0.5.4
 40 | decorator==4.1.2
 41 | dill==0.2.6
 42 | Django>=1.11.15
 43 | django-countries==4.4
 44 | django-session-security==2.3.2
 45 | django-simple-menu==1.2.1
 46 | django-timezone-field==2.0
 47 | djangorestframework==3.6.2
 48 | docutils==0.14
 49 | entrypoints==0.2.3
 50 | et-xmlfile==1.0.1
 51 | fake-useragent==0.1.10
 52 | fastcache==1.0.2
 53 | flake8==3.3.0
 54 | Flask>=0.12.3
 55 | Flask-Cors==3.0.3
 56 | geopy==1.11.0
 57 | gevent==1.2.2
 58 | googlemaps==2.4.6
 59 | graphviz==0.5.2
 60 | greenlet==0.4.12
 61 | gunicorn==19.7.1
 62 | h5py==2.7.1
 63 | haversine==0.4.5
 64 | HeapDict==1.0.0
 65 | html5lib==0.999999999
 66 | httplib2==0.10.3
 67 | idna==2.6
 68 | imagesize==0.7.1
 69 | ipykernel==4.6.1
 70 | ipython==6.1.0
 71 | ipython-genutils==0.2.0
 72 | ipywidgets==7.1.2
 73 | iso8601==0.1.11
 74 | isort==4.2.15
 75 | itsdangerous==0.24
 76 | jdcal==1.3
 77 | jedi==0.10.2
 78 | Jinja2==2.9.6
 79 | jmespath==0.9.2
 80 | jsonschema==2.6.0
 81 | jupyter==1.0.0
 82 | jupyter-client==5.1.0
 83 | jupyter-console==5.2.0
 84 | jupyter-core==4.3.0
 85 | jupyterlab==0.31.8
 86 | jupyterlab-launcher==0.10.5
 87 | lazy-object-proxy==1.3.1
 88 | llvmlite==0.20.0
 89 | locket==0.2.0
 90 | lxml==3.8.0
 91 | MarkupSafe==1.0
 92 | matplotlib==2.0.2
 93 | mccabe==0.6.1
 94 | menuinst==1.4.7
 95 | mistune==0.7.4
 96 | mpmath==0.19
 97 | multipledispatch==0.4.9
 98 | mysqlclient==1.3.10
 99 | nbconvert==5.2.1
100 | nbformat==4.4.0
101 | ndg-httpsclient==0.4.2
102 | networkx==1.11
103 | nltk==3.2.4
104 | nose==1.3.7
105 | notebook>=5.4.1
106 | ntlm-auth==1.0.4
107 | numba==0.35.0
108 | numexpr==2.6.2
109 | numpy==1.13.1
110 | numpydoc==0.7.0
111 | oauthlib==2.0.2
112 | odo==0.5.1
113 | olefile==0.44
114 | openpyxl==2.4.8
115 | ordereddict==1.1
116 | packaging==16.8
117 | pandas==0.20.3
118 | pandasql==0.7.3
119 | pandocfilters==1.4.2
120 | partd==0.3.8
121 | path.py==10.3.1
122 | pathlib2==2.3.0
123 | patsy==0.4.1
124 | pdfminer3k==1.3.1
125 | pep8==1.7.0
126 | pickleshare==0.7.4
127 | Pillow==5.0.0
128 | ply==3.10
129 | prompt-toolkit==1.0.15
130 | psutil==5.2.2
131 | py==1.4.34
132 | py-trello==0.9.0
133 | pyasn1==0.2.3
134 | pycodestyle==2.3.1
135 | pycosat==0.6.2
136 | pycparser==2.18
137 | pycurl==7.43.0
138 | pyflakes==1.6.0
139 | Pygments==2.2.0
140 | pylint==1.7.2
141 | pyOpenSSL>=17.5.0
142 | pyparsing==2.2.0
143 | PyPDF2==1.26.0
144 | Pyphen==0.9.4
145 | pyreadline==2.1
146 | pytest==3.2.1
147 | python-dateutil==2.6.1
148 | python-instagram==1.3.2
149 | pytz==2017.2
150 | PyWavelets==0.5.2
151 | pywin32==220
152 | PyYAML==3.12
153 | pyzmq==16.0.2
154 | QtAwesome==0.4.4
155 | qtconsole==4.3.1
156 | QtPy==1.3.1
157 | ratebeer==2.3.1
158 | requests>=2.20.0
159 | requests-ntlm==1.0.0
160 | requests-oauthlib==0.8.0
161 | rope-py3k==0.9.4.post1
162 | rsa==3.4.2
163 | s3transfer==0.1.10
164 | scikit-image==0.13.0
165 | scikit-learn==0.19.0
166 | scipy==0.19.1
167 | seaborn==0.8
168 | simplegeneric==0.8.1
169 | simplejson==3.11.1
170 | singledispatch==3.4.0.3
171 | six==1.10.0
172 | snowballstemmer==1.2.1
173 | sockjs-tornado==1.0.3
174 | sphinx==1.6.3
175 | sphinxcontrib-websupport==1.0.1
176 | spyder==3.2.3
177 | SQLAlchemy==1.1.13
178 | statsmodels==0.8.0
179 | sympy==1.1.1
180 | tables==3.4.2
181 | testpath==0.3
182 | tinycss==0.4
183 | toolz==0.8.2
184 | tornado==4.5.2
185 | tqdm==4.11.2
186 | traitlets==4.3.2
187 | unicodecsv==0.14.1
188 | wcwidth==0.1.7
189 | WeasyPrint==0.36
190 | webencodings==0.5
191 | Werkzeug==0.12.2
192 | widgetsnbextension==3.1.4
193 | win-unicode-console==0.5
194 | wincertstore==0.2
195 | wrapt==1.10.11
196 | xlrd==1.1.0
197 | XlsxWriter==0.9.8
198 | xlwings==0.11.4
199 | xlwt==1.3.0
200 | 


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
 1 | #Source: https://github.com/cochoa0x1/mailgunbot/blob/master/mailgunbot/utils.py
 2 | 
 3 | import requests
 4 | from requests.utils import unquote
 5 | import os
 6 | 
 7 | 
 8 | def download_file(url,filename=None,chunk_size=512, directory=os.getcwd(), auth=None):
 9 | 
10 |     #if no filename is given, try and get it from the url
11 |     if not filename: 
12 |         filename = unquote(url.split('/')[-1])
13 |         
14 |     full_name = os.path.join(directory,filename)
15 |     
16 |     #make the destination directory, but guard against race condition
17 |     if not os.path.exists(os.path.dirname(full_name)):
18 |         try:
19 |             os.makedirs(os.path.dirname(full_name))
20 |         except OSError as exc: 
21 |             raise Exception('something failed')
22 | 
23 |     r = requests.get(url, stream=True, auth=auth)
24 |     
25 | 
26 |     with open(full_name, 'wb') as f:
27 |         for chunk in r.iter_content(chunk_size=chunk_size) :
28 |             if chunk: # filter out keep-alive new chunks
29 |                 f.write(chunk)
30 |                 f.flush()
31 |     r.close()


--------------------------------------------------------------------------------