├── .gitignore
├── LICENSE.txt
├── README.md
├── __init__.py
├── __main__.py
├── companies_list.txt
├── document_group_section_search.json
├── img
    ├── home_depot_screenshots.png
    └── output_files_example_image.png
├── output_files_examples
    └── batch_0001
    │   └── 001
    │       ├── HD_0000354950_10K_20160131_Item1A_excerpt.txt
    │       ├── HD_0000354950_10K_20160131_Item1_excerpt.txt
    │       ├── HD_0000354950_10K_20160131_Item7A_excerpt.txt
    │       ├── HD_0000354950_10K_20160131_Item7_excerpt.txt
    │       ├── HD_0000354950_10K_20170129_Item1A_excerpt.txt
    │       ├── HD_0000354950_10K_20170129_Item1_excerpt.txt
    │       ├── HD_0000354950_10K_20170129_Item7A_excerpt.txt
    │       └── HD_0000354950_10K_20170129_Item7_excerpt.txt
├── requirements.txt
└── src
    ├── __init__.py
    ├── control.py
    ├── document.py
    ├── download.py
    ├── html_document.py
    ├── metadata.py
    ├── text_document.py
    └── utils.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Compiled source #
 2 | ###################
 3 | *.com
 4 | *.class
 5 | *.dll
 6 | *.exe
 7 | *.o
 8 | *.so
 9 | *.pyc
10 | 
11 | # Packages #
12 | ############
13 | # it's better to unpack these files and commit the raw source
14 | # git has its own built in compression methods
15 | *.7z
16 | *.dmg
17 | *.gz
18 | *.iso
19 | *.jar
20 | *.rar
21 | *.tar
22 | *.zip
23 | 
24 | # Logs and databases #
25 | ######################
26 | *.log
27 | *.sql
28 | *.sqlite
29 | 
30 | # OS generated files #
31 | ######################
32 | .DS_Store
33 | .DS_Store?
34 | ._*
35 | .Spotlight-V100
36 | .Trashes
37 | ehthumbs.db
38 | Thumbs.db
39 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # SEC EDGAR Text
  2 | The goal of this project is to download large numbers of company filings
  3 | from the SEC EDGAR service, extract key text sections of interest,
  4 | and store them in an easily accessible and readable format.
  5 | 
  6 | ![Home depot screenshots](img/home_depot_screenshots.png)
  7 | 
  8 | Download key text sections of SEC EDGAR company filings. Format, organise
  9 | and store the text excerpts ready for both automated processing (NLP) and
 10 | for human reading (spot-checking). Structured storage of text and
 11 | metadata, with logging of failed document analyses. Suitable for
 12 | automation of large-scale downloads, with flexibility to customise
 13 | which sections of the documents are extracted. Compatible with all
 14 | main EDGAR document formats from 1993 onwards, and easily adapted or
 15 | extended to extract different sections of interest in EDGAR filings.
 16 | 
 17 | Generally accurate in extracting text, but lots of room for improvement.
 18 | Comments and contributions welcome!
 19 | 
 20 | 
 21 | 
 22 | #### About the project
 23 | 
 24 | * I completed this project during a sabbatical from my job.
 25 | * I used [SEC-Edgar-Crawler](https://github.com/rahulrrixe/sec-edgar)
 26 | for initial ideas which helped this project.
 27 | * Thanks to my colleagues at Rosenberg Equities for help with an earlier
 28 | attempt to download EDGAR data.
 29 | 
 30 | ## Installation
 31 | Clone the repo, and install the packages in requirements.txt.
 32 | 
 33 |     git clone https://github.com/alions7000/SEC-EDGAR-text
 34 |     pip install -r SEC-EDGAR-text/requirements.txt
 35 | 
 36 | 
 37 | ## Usage
 38 | *Basic usage* This will download those 500 large US companies in the
 39 | included 'companies_list.txt' file. Run from the project folder,
 40 | and accept the default settings when prompted.
 41 | 
 42 |     python SEC-EDGAR-text
 43 | 
 44 | *Typical usage* There are several arguments available to choose which
 45 | companies download list ot use, which types of filings to download, where to
 46 | save the extracted documents, multiprocessing option, download rate
 47 | and so on. For example:
 48 | 
 49 |     python SEC-EDGAR-text --storage=/path/to/my_storage_location --start=20150101 --end=99991231 --filings=10-K --multiprocessing_cores=0 --traffic_limit_pause_ms=500
 50 | 
 51 | See module utils.py to see a full list of command line options.
 52 | 
 53 | To download of a full history of key sections from 10-K and 10-Q filings
 54 | for (most) US companies takes less than 40GB storage: around 1 million
 55 | text excerpt files, plus a similar number of metadata files.
 56 | 
 57 | 
 58 | 
 59 | 
 60 | ## Background
 61 | ### About EDGAR
 62 | 
 63 | *History and future of EDGAR, links to key SEC procedures documents etc.*
 64 | Electronic filing was mandatory from 1996.
 65 | 
 66 | 
 67 | ### Retrieving text data from EDGAR
 68 | 
 69 | ![Example of files download](img/output_files_example_image.png)
 70 | 
 71 | 
 72 | ### References
 73 | **Other packages**
 74 | 
 75 | Lots of open source projects have automated access to EDGAR filings.
 76 | A few access text data, like this one. Most focus on downloading whole
 77 | filing documents, financial statements information, or parsing
 78 | XBRL filings. This package aims to make access to large volumes of text
 79 | information easier and more consistent.
 80 | 
 81 | https://github.com/datasets/edgar A nice introduction to the EDGAR database
 82 | website.
 83 | https://github.com/eliangcs/pystock-crawler A project which also accesses
 84 | daily stock prices from Yahoo Finance
 85 | 
 86 | https://github.com/rahulrrixe/sec-edgar Download a list of filings and
 87 | save the complete document locally.
 88 | https://github.com/lukerosiak/pysec Django code for parsing XBRL documents.
 89 | https://github.com/eliangcs/pystock-crawler A project which also accesses
 90 | daily stock prices from Yahoo Finance
 91 | 
 92 | **Academic research**
 93 | 
 94 | Professor Bill McDonald, with collaborators including Prof Tim Loughran
 95 | has led much of the academic reserach into company filings' text
 96 | data in recent years. He much shares the approach
 97 | that he used for extracting EDGAR filings text data on his
 98 | [website](https://www3.nd.edu/~mcdonald/Word_Lists.html).
 99 | The approach to scraping the text data is somewhat different to that
100 | used in this project, but it has a similar goal, and the documentation
101 | gives a great introduction to the structure of the HTML filing documents.
102 | The website includes links to related research, and
103 | plenty of guidance on doing downstream research on EDGAR text documents.


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
1 | """
2 |     secedgartext: extract text from SEC corporate filings
3 |     Copyright (C) 2017  Alexander Ions
4 | 
5 |     You should have received a copy of the GNU General Public License
6 |     along with this program.  If not, see <http://www.gnu.org/licenses/>.
7 | """


--------------------------------------------------------------------------------
/__main__.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | """
 3 |     secedgartext: extract text from SEC corporate filings
 4 |     Copyright (C) 2017  Alexander Ions
 5 | 
 6 |     This program is free software: you can redistribute it and/or modify
 7 |     it under the terms of the GNU General Public License as published by
 8 |     the Free Software Foundation, either version 3 of the License, or
 9 |     (at your option) any later version.
10 | 
11 |     This program is distributed in the hope that it will be useful,
12 |     but WITHOUT ANY WARRANTY; without even the implied warranty of
13 |     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
14 |     GNU General Public License for more details.
15 | 
16 |     You should have received a copy of the GNU General Public License
17 |     along with this program.  If not, see <http://www.gnu.org/licenses/>.
18 | """
19 | 
20 | from src.control import Downloader
21 | from src.utils import logger, sql_cursor, sql_connection
22 | 
23 | def main():
24 |     try:
25 |         Downloader().download_companies(do_save_full_document=False)
26 |     except Exception:
27 |         # this makes sure that the full error message is recorded in
28 |         # the logger text file for the process
29 |         logger.exception("Fatal error in company downloading")
30 | 
31 |     # tidy up database before closing
32 |     sql_cursor.execute("delete from metadata where sec_cik like 'dummy%'")
33 |     sql_connection.close()
34 | 
35 | 
36 | if __name__ == '__main__':
37 |     main()
38 | 
39 | 
40 | 
41 | 
42 | 
43 | 
44 | 
45 | 
46 | 


--------------------------------------------------------------------------------
/companies_list.txt:
--------------------------------------------------------------------------------
  1 | # A list of some of the current largest US listed companies: SEC CIK code, followed by a ticker code (which is used for labelling the output files)
  2 | 354950 HD
  3 | 50104 TSO
  4 | 895421 MS
  5 | 89800 SHW
  6 | 1037868 AME
  7 | 850209 FL
  8 | 827054 MCHP
  9 | 277948 CSX
 10 | 100493 TSN
 11 | 4962 AXP
 12 | 1260221 TDG
 13 | 899689 VNO
 14 | 29989 OMC
 15 | 62709 MMC
 16 | 106535 WY
 17 | 51644 IPG
 18 | 1050915 PWR
 19 | 10456 BAX
 20 | 354908 FLIR
 21 | 1281761 RF
 22 | 100885 UNP
 23 | 1390777 BK
 24 | 65984 ETR
 25 | 1326160 DUK
 26 | 18230 CAT
 27 | 72333 JWN
 28 | 21665 CL
 29 | 1430602 SNI
 30 | 1492633 NLSN
 31 | 21271 VLO
 32 | 1341439 ORCL
 33 | 1120193 NDAQ
 34 | 76334 PH
 35 | 1024478 ROK
 36 | 1048911 FDX
 37 | 313616 DHR
 38 | 96021 SYY
 39 | 320335 TMK
 40 | 743988 XLNX
 41 | 813828 CBS
 42 | 859737 HOLX
 43 | 49826 ITW
 44 | 1140536 WLTW
 45 | 906107 EQR
 46 | 31791 PKI
 47 | 1001039 DIS
 48 | 63276 MAT
 49 | 886982 GS
 50 | 1393311 PSA
 51 | 92230 BBT
 52 | 35527 FITB
 53 | 315293 AON
 54 | 217346 TXT
 55 | 1067983 BRK/B
 56 | 1646383 CSRA
 57 | 815556 FAST
 58 | 63754 MKC
 59 | 940944 DRI
 60 | 1174922 WYNN
 61 | 100517 UAL
 62 | 36270 MTB
 63 | 1051470 CCI
 64 | 46080 HAS
 65 | 1659166 FTV
 66 | 60667 LOW
 67 | 97476 TXN
 68 | 754737 SCG
 69 | 823768 WM
 70 | 52988 JEC
 71 | 827052 EIX
 72 | 1063761 SPG
 73 | 55785 KMB
 74 | 80424 PG
 75 | 200406 JNJ
 76 | 6201 AAL
 77 | 920148 LH
 78 | 77476 PEP
 79 | 1039101 LLL
 80 | 814453 NWL
 81 | 891024 PDCO
 82 | 879101 KIM
 83 | 831001 C
 84 | 1410636 AWK
 85 | 21076 CLX
 86 | 42582 GT
 87 | 40533 GD
 88 | 1524472 XYL
 89 | 20286 CINF
 90 | 78239 PVH
 91 | 4281 ARNC
 92 | 1532063 ESRX
 93 | 1567892 MNK
 94 | 48465 HRL
 95 | 1012100 SEE
 96 | 936468 LMT
 97 | 1060391 RSG
 98 | 1043604 JNPR
 99 | 1489393 LYB
100 | 773840 HON
101 | 1004980 PCG
102 | 1618921 WBA
103 | 8818 AVY
104 | 50863 INTC
105 | 794323 LVLT
106 | 1275283 RAI
107 | 1361658 WYN
108 | 1136869 ZBH
109 | 27904 DAL
110 | 313927 CHD
111 | 1623613 MYL
112 | 899051 ALL
113 | 858877 CSCO
114 | 6951 AMAT
115 | 1002910 AEE
116 | 56873 KR
117 | 77360 PNR
118 | 1466258 IR
119 | 310158 MRK
120 | 1359841 HBI
121 | 350698 AN
122 | 79879 PPG
123 | 68505 MSI
124 | 723254 CTAS
125 | 1103982 MDLZ
126 | 12659 HRB
127 | 895648 GGP
128 | 86312 TRV
129 | 78003 PFE
130 | 104169 WMT
131 | 1144215 AYI
132 | 66740 MMM
133 | 1551182 ETN
134 | 1158449 AAP
135 | 1022079 DGX
136 | 915389 EMN
137 | 5272 AIG
138 | 59558 LNC
139 | 40987 GPC
140 | 832988 SIG
141 | 1418135 DPS
142 | 1510295 MPC
143 | 849399 SYMC
144 | 72903 XEL
145 | 764622 PNW
146 | 30625 FLS
147 | 6281 ADI
148 | 922224 PPL
149 | 1124198 FLR
150 | 1339947 VIAB
151 | 922864 AIV
152 | 9389 BLL
153 | 356028 CA
154 | 1037038 RL
155 | 877890 CTXS
156 | 1121788 GRMN
157 | 1067701 URI
158 | 1099219 MET
159 | 811156 CMS
160 | 1137774 PRU
161 | 7332 SWN
162 | 861878 SRCL
163 | 1365135 WU
164 | 820027 AMP
165 | 12927 BA
166 | 47111 HSY
167 | 72741 ES
168 | 5513 UNM
169 | 1637459 KHC
170 | 764180 MO
171 | 833444 JCI
172 | 723125 MU
173 | 107263 WMB
174 | 1506307 KMI
175 | 732717 T
176 | 37996 F
177 | 1530721 KORS
178 | 1013871 NRG
179 | 1534701 PSX
180 | 59478 LLY
181 | 875159 XL
182 | 39911 GPS
183 | 55067 K
184 | 27419 TGT
185 | 319201 KLAC
186 | 24545 TAP
187 | 7084 ADM
188 | 40704 GIS
189 | 886158 BBBY
190 | 60086 L
191 | 32604 EMR
192 | 783325 WEC
193 | 16732 CPB
194 | 106040 WDC
195 | 816761 TDC
196 | 1070750 HST
197 | 23217 CAG
198 | 764478 BBY
199 | 78814 PBI
200 | 320193 AAPL
201 | 788784 PEG
202 | 4904 AEP
203 | 318154 AMGN
204 | 24741 GLW
205 | 38777 BEN
206 | 1130310 CNP
207 | 1045609 PLD
208 | 1289490 EXR
209 | 1677703 CNDT
210 | 920760 LEN
211 | 47217 HPQ
212 | 1047862 ED
213 | 732712 VZ
214 | 1467858 GM
215 | 885639 KSS
216 | 1564708 NWS
217 | 1564708 NWSA
218 | 1636023 WRK
219 | 1002047 NTAP
220 | 1593034 ENDP
221 | 51434 IP
222 | 791519 SPLS
223 | 1267238 AIZ
224 | 4977 AFL
225 | 1137789 STX
226 | 18926 CTL
227 | 1031296 FE
228 | 1109357 EXC
229 | 874761 AES
230 | 765880 HCP
231 | 874766 HIG
232 | 1011006 YHOO
233 | 936340 DTE
234 | 794367 M
235 | 1274494 FSLR
236 | 1451505 RIG
237 | 20520 FTR
238 | 1385157 TEL
239 | 1137411 COL
240 | 1164727 NEM
241 | 912595 MAA
242 | 1593538 NAVI
243 | 1024305 COTY
244 | 882095 GILD
245 | 1645590 HPE
246 | 108772 XRX
247 | 21344 KO
248 | 63908 MCD
249 | 831259 FCX
250 | 1041061 YUM
251 | 101778 MRO
252 | 39899 TGNA
253 | 769397 ADSK
254 | 1358071 CXO
255 | 1336917 UAA
256 | 797468 OXY
257 | 1039684 OKE
258 | 1021860 NOV
259 | 1168054 XEC
260 | 1336917 UA
261 | 1018724 AMZN
262 | 87347 SLB
263 | 808362 BHI
264 | 33213 EQT
265 | 899866 ALXN
266 | 1065280 NFLX
267 | 1108524 CRM
268 | 796343 ADBE
269 | 1652044 GOOGL
270 | 1633917 PYPL
271 | 4447 HES
272 | 6769 APA
273 | 872589 REGN
274 | 72207 NBL
275 | 816284 CELG
276 | 1090012 DVN
277 | 1075531 PCLN
278 | 1403568 ULTA
279 | 773910 APC
280 | 1373835 SE
281 | 1086222 AKAM
282 | 1087423 RHT
283 | 1324424 EXPE
284 | 1058090 CMG
285 | 1045810 NVDA
286 | 1403161 V
287 | 34088 XOM
288 | 1526520 TRIP
289 | 865752 MNST
290 | 316709 SCHW
291 | 707549 LRCX
292 | 1141391 MA
293 | 1110803 ILMN
294 | 1101239 EQIX
295 | 29915 DOW
296 | 858470 COG
297 | 45012 HAL
298 | 895126 CHK
299 | 1038357 PXD
300 | 821189 EOG
301 | 875320 VRTX
302 | 1326801 FB
303 | 315852 RRC
304 | 46765 HP
305 | 882184 DHI
306 | 19617 JPM
307 | 93410 CVX
308 | 51143 IBM
309 | 912750 NFX
310 | 1101215 ADS
311 | 1020569 IRM
312 | 1035267 ISRG
313 | 1004434 AMG
314 | 1396009 VMC
315 | 1058290 CTSH
316 | 829224 SBUX
317 | 1135152 FTI
318 | 896878 INTU
319 | 874716 IDXX
320 | 1048286 MAR
321 | 1071739 CNC
322 | 728535 JBHT
323 | 1604778 QRVO
324 | 789019 MSFT
325 | 1099800 EW
326 | 804753 CERN
327 | 40545 GE
328 | 320187 NKE
329 | 1678531 EVHC
330 | 92122 SO
331 | 1551152 ABBV
332 | 1123360 GPN
333 | 822416 PHM
334 | 800459 HAR
335 | 75362 PCAR
336 | 202058 HRS
337 | 916076 MLM
338 | 766421 ALK
339 | 80661 PGR
340 | 30554 DD
341 | 731766 UNH
342 | 8670 ADP
343 | 1601712 SYF
344 | 29534 DG
345 | 354190 AJG
346 | 1364742 BLK
347 | 914208 IVZ
348 | 915912 AVB
349 | 1111711 NI
350 | 1324404 CF
351 | 915913 ALB
352 | 717423 MUR
353 | 1437107 DISCK
354 | 1015780 ETFC
355 | 701221 CI
356 | 33185 EFX
357 | 1800 ABT
358 | 723531 PAYX
359 | 1166691 CMCSA
360 | 882835 ROP
361 | 1578845 AGN
362 | 909832 COST
363 | 1519751 FBHS
364 | 203527 VAR
365 | 16918 STZ
366 | 885725 BSX
367 | 4127 SWKS
368 | 875045 BIIB
369 | 1649338 AVGO
370 | 898173 ORLY
371 | 109198 TJX
372 | 1138118 CBG
373 | 1467373 ACN
374 | 916365 TSCO
375 | 315189 DE
376 | 745732 ROST
377 | 49196 HBAN
378 | 352915 UHS
379 | 712515 EA
380 | 759944 CFG
381 | 912242 MAC
382 | 1156039 ANTM
383 | 920522 ESS
384 | 91419 SJM
385 | 1053507 AMT
386 | 1170010 KMX
387 | 815097 CCL
388 | 1001250 EL
389 | 1059556 MCO
390 | 64803 CVS
391 | 1032208 SRE
392 | 1297996 DLR
393 | 1110783 MON
394 | 1091667 CHTR
395 | 1442145 VRSK
396 | 54480 KSU
397 | 746515 EXPD
398 | 1585364 PRGO
399 | 1378946 PBCT
400 | 1308161 FOX
401 | 1308161 FOXA
402 | 310764 SYK
403 | 96223 LUK
404 | 1555280 ZTS
405 | 1048695 FFIV
406 | 1133421 NOC
407 | 1521332 DLPH
408 | 1285785 MOS
409 | 721683 TSS
410 | 34903 FRT
411 | 1122304 AET
412 | 935703 DLTR
413 | 804328 QCOM
414 | 1037540 BXP
415 | 51253 IFF
416 | 818479 XRAY
417 | 711404 COO
418 | 2969 APD
419 | 109380 ZION
420 | 9892 BCR
421 | 14693 BF/B
422 | 49071 HUM
423 | 798354 FISV
424 | 701985 LB
425 | 64040 SPGI
426 | 1000228 HSIC
427 | 28412 CMA
428 | 74208 UDR
429 | 37785 FMC
430 | 73309 NUE
431 | 753308 NEE
432 | 766704 HCN
433 | 718877 ATVI
434 | 1000697 WAT
435 | 1043277 CHRW
436 | 106640 WHR
437 | 73124 NTRS
438 | 1065088 EBAY
439 | 1413329 PM
440 | 927628 COF
441 | 1163165 COP
442 | 1105705 TWX
443 | 866787 AZO
444 | 91440 SNA
445 | 865436 WFM
446 | 851968 MHK
447 | 715957 D
448 | 721371 CAH
449 | 791907 LLTC
450 | 1156375 CME
451 | 1065696 LKQ
452 | 1113169 TROW
453 | 97745 TMO
454 | 1116132 COH
455 | 101829 UTX
456 | 14272 BMY
457 | 36104 USB
458 | 93556 SWK
459 | 1140859 ABC
460 | 820313 APH
461 | 884887 RCL
462 | 927066 DVA
463 | 92380 LUV
464 | 1037646 MTD
465 | 927653 MCK
466 | 750556 STI
467 | 85961 R
468 | 912615 URBN
469 | 70858 BAC
470 | 896159 CB
471 | 740260 VTR
472 | 1115222 DNB
473 | 1090872 A
474 | 10795 BDX
475 | 884905 PX
476 | 1126328 PFG
477 | 860730 HCA
478 | 1613103 MDT
479 | 352541 LNT
480 | 908255 BWA
481 | 1040971 SLG
482 | 62996 MAS
483 | 713676 PNC
484 | 277135 GWW
485 | 103379 VFC
486 | 98246 TIF
487 | 1571949 ICE
488 | 1579241 ALLE
489 | 58492 LEG
490 | 315213 RHI
491 | 29905 DOV
492 | 1136893 FIS
493 | 1090727 UPS
494 | 26172 CMI
495 | 1452575 MJN
496 | 1014473 VRSN
497 | 1393612 DFS
498 | 1047122 RTN
499 | 31462 ECL
500 | 726728 O
501 | 72971 WFC
502 | 793952 HOG
503 | 91576 KEY
504 | 702165 NSC
505 | 93751 STT
506 | 


--------------------------------------------------------------------------------
/document_group_section_search.json:
--------------------------------------------------------------------------------
  1 | {
  2 |     "10-K": [
  3 |         {
  4 |             "itemname": "Item1",
  5 |             "txt": [
  6 |                 {
  7 |                     "start": "\n\\s{,40}(?:PART.{,40})?Item_1.{,10}Business.{,39}?\n",
  8 |                     "end": "\n\\s{,40}(?:PART.{,40})?ITEM_(?:1A|2).{,10}(?:Risk_Factors|Properties).{,39}?\n"
  9 |                 }
 10 |             ],
 11 |             "html": [
 12 |                 {
 13 |                     "start": "\n_(?:PART.{,40})?Item_1.{,10}Business",
 14 |                     "end": "\n_(?:PART.{,40})?Item_(?:1A|2).{,10}(?:Risk_Factors|Properties).{,99}?\n"
 15 |                 }
 16 |             ]
 17 |         },
 18 |         {
 19 |             "itemname": "Item1A",
 20 |             "txt": [
 21 |                 {
 22 |                     "start": "\n\\s{,40}(?:PART.{,40})?Item_1.{,10}Risk_Factors.{,39}?\n",
 23 |                     "end": "\n\\s{,40}(?:PART.{,40})?ITEM_2.{,10}Properties.{,39}?\n"
 24 |                 },
 25 |                 {
 26 |                     "start": "\n\\s{,40}(?:PART.{,40})?Item_1.{,10}Risk_Factors",
 27 |                     "end": "\n\\s{,40}(?:PART.{,40})?ITEM_[2-9].{,39}?\n"
 28 |                 }
 29 |             ],
 30 |             "html": [
 31 |                 {
 32 |                     "start": "\n_(?:PART.{,40})?Item_1.{,10}Risk_Factors",
 33 |                     "end": "\n_(?:PART.{,40})?Item_2.{,10}Properties.{,99}?\n"
 34 |                 },
 35 |                 {
 36 |                     "start": "\n_(?:PART.{,40})?Item_1.{,10}Risk_Factors",
 37 |                     "end": "\n_(?:PART.{,40})?Item_[2-9].{,99}?\n"
 38 |                 },
 39 |                 {
 40 |                     "start": "\n_(?:PART.{,40})?Item_1\\(?A",
 41 |                     "end": "\n_(?:PART.{,40})?Item_[2-9].{,99}?\n"
 42 |                 }
 43 |             ]
 44 |         },
 45 |         {
 46 |             "itemname": "Item7",
 47 |             "txt": [
 48 |                 {
 49 |                     "start": "\n\\s{,40}(?:PART.{,40})?(?:ITEM_7.{,40})?MANAGEMENT.{,5}DISCUSSION_AND.{,69}?\n",
 50 |                     "end": "\n\\s{,40}(?:PART.{,40})?ITEM_(?:7A|8).{,99}?\n"
 51 |                 },
 52 |                 {
 53 |                     "start": "\n\\s{,40}(?:PART.{,40})?(?:ITEM_7.{,40})?MANAGEMENT.{,5}DISCUSSION_AND.{,69}?\n",
 54 |                     "end": "\n\\s{,40}(?:PART.{,40})?(?:INDEX_TO.{,20})?(?:QUANTITATIVE_AND_QUALITATIVE_DISCLOSURES|FINANCIAL_STATEMENTS|CONSOLIDATED_BALANCE|CONSOLIDATED_STATEMENTS).{,69}?\n"
 55 |                 }
 56 |             ],
 57 |             "html": [
 58 |                 {
 59 |                     "start": "\n_(?:PART.{,40})?_Item_7",
 60 |                     "end": "\n_(?:PART.{,40})?_Item_(?:7A|8).{,99}?\n"
 61 |                 },
 62 |                 {
 63 |                     "start": "\n(?:PART.{,40})?MANAGEMENT.{,5}DISCUSSION_AND",
 64 |                     "end": "\n(?:PART.{,40})?(?:INDEX_TO.{,20})?(?:QUANTITATIVE_AND_QUALITATIVE_DISCLOSURES|FINANCIAL_STATEMENTS|CONSOLIDATED_BALANCE|CONSOLIDATED_STATEMENTS?).{,99}?\n"
 65 |                 }
 66 |             ]
 67 |         },
 68 |         {
 69 |             "itemname": "Item7A",
 70 |             "txt": [
 71 |                 {
 72 |                     "start": "\n\\s{,40}(?:PART.{,40})?(?:ITEM_7A.{,40})?QUANTITATIVE_AND_QUALITATIVE_DISCLOSURES.{,69}?\n",
 73 |                     "end": "\n\\s{,40}(?:PART.{,40})?ITEM_8.{,99}?\n"
 74 |                 },
 75 |                 {
 76 |                     "start": "\n\\s{,40}(?:PART.{,40})?(?:ITEM_7A.{,40})?QUANTITATIVE_AND_QUALITATIVE_DISCLOSURES.{,69}?\n",
 77 |                     "end": "\n\\s{,40}(?:PART.{,40})?(?:INDEX_TO.{,20})?(?:FINANCIAL_STATEMENTS|CONSOLIDATED_BALANCE|CONSOLIDATED_STATEMENTS).{,69}?\n"
 78 |                 }
 79 |             ],
 80 |             "html": [
 81 |                 {
 82 |                     "start": "\n_(?:PART.{,40})?_Item_7A",
 83 |                     "end": "\n_(?:PART.{,40})?_Item_8.{,99}?\n"
 84 |                 },
 85 |                 {
 86 |                     "start": "\n(?:PART.{,40})?QUANTITATIVE_AND_QUALITATIVE_DISCLOSURES",
 87 |                     "end": "\n(?:PART.{,40})?(?:INDEX_TO.{,20})?(?:FINANCIAL_STATEMENTS|CONSOLIDATED_BALANCE|CONSOLIDATED_STATEMENTS?).{,99}?\n"
 88 |                 }
 89 |             ]
 90 |         }
 91 | 
 92 |     ],
 93 | 
 94 |     "EX-13": [
 95 |         {
 96 |             "itemname": "Exhibit13",
 97 |             "txt": [
 98 |                 {
 99 |                     "start": "\n\\s{,40}(?:PART.{,40})?(?:ITEM_7.{,40})?Management.{,5}Discussion.{,69}?\n",
100 |                     "end": "\n\\s{,40}(?:PART.{,40})?(?:ITEM_8.{,40})?(?:Financial_Statement|Consolidated_Statements?_of|Significant_Accounting|Cautionary_Language|Report_of_Independent).{,69}?\n"
101 |                 },
102 |                 {
103 |                     "start": "\n\\s{,40}(?:PART.{,40})?(?:ITEM_7.{,40})?Financial_Review.{,69}?\n",
104 |                     "end": "\n\\s{,40}(?:PART.{,40})?(?:ITEM_8.{,40})?(?:Financial_Statement|Consolidated_Statements?_of|Significant_Accounting|Cautionary_Language|Report_of_Independent).{,69}?\n"
105 |                 }
106 |             ],
107 |             "html": [
108 |                 {
109 |                     "start": "\n_(?:PART.{,40})?(?:ITEM_7.{,40})?(?:Management.{,5}Discussion|Financial_Review)",
110 |                     "end": "\n_(?:PART.{,40})?(?:ITEM_8.{,40})?(?:Financial_Statement|Consolidated_Statements?_of|Significant_Accounting|Cautionary_Language|Report_of_Independent).{,99}?\n"
111 |                 }
112 |             ]
113 |         }
114 |     ],
115 | 
116 |     "10-Q": [
117 |         {
118 |             "itemname": "Item1A",
119 |             "txt": [
120 |                 {
121 |                     "start": "\n\\s{,40}(?:PART.{,40})?ITEM_1.{,10}Risk_Factors.{,39}?\n",
122 |                     "end": "\n\\s{,40}(?:PART.{,40})?ITEM_2.{,39}?\n"
123 |                 }
124 |             ],
125 |             "html": [
126 |                 {
127 |                     "start": "\n_(?:PART.{,40})?Item_1.{,10}Risk_Factors",
128 |                     "end": "\n_(?:PART.{,40})?Item_2.{,10}Unregistered.{,99}?\n"
129 |                 },
130 |                 {
131 |                     "start": "\n_(?:PART.{,40})?Item_1.{,10}Risk_Factors",
132 |                     "end": "\n_(?:PART.{,40})?Item_\\d.{,99}?\n"
133 |                 },
134 |                 {
135 |                     "start": "\n_(?:PART.{,40})?Item_1\\(?A",
136 |                     "end": "\n_(?:PART.{,40})?Item_\\d.{,99}?\n"
137 |                 },
138 |                 {
139 |                     "start": "\n_(?:PART.{,40})?(?:Item_1.{,40})?.{,10}Risk_Factors",
140 |                     "end": "\n_(?:PART.{,40})?.{,10}(?:Item_[2-9]|Unregistered).{,99}?\n"
141 |                 }
142 |             ]
143 |         },
144 |         {
145 |             "itemname": "Item2",
146 |             "txt": [
147 |                 {
148 |                     "start": "\n\\s{,40}(?:PART.{,40})?(?:ITEM_2.{,10})?Management.{,5}Discussion_and.{,69}?\n",
149 |                     "end": "\n\\s{,40}(?:PART.{,40})?(?:ITEM_3|PART_II).{,99}?\n"
150 |                 },
151 |                 {
152 |                     "start": "\n\\s{,40}(?:PART.{,40})?(?:ITEM_2.{,10})?Management.{,5}Discussion_and.{,69}?\n",
153 |                     "end": "\n\\s{,40}(?:PART_II|Item_[3-9]).{,99}?\n"
154 |                 }
155 |             ],
156 |             "html": [
157 |                 {
158 |                     "start": "\n_(?:PART.{,40})?Items?_2.{,15}Management",
159 |                     "end": "\n_(?:PART.{,40})?(?:Item_[3-9]|Quantitative_and_Qualitative|Controls_and_Procedures|OTHER_INFORMATION|Legal_Proceedings).{,99}?\n"
160 |                 },
161 |                 {
162 |                     "start": "\n_(?:PART.{,40})?(?:Items?_2.{,15})?Management.{,5}Discussion_and_Analysis",
163 |                     "end": "\n_(?:PART.{,40})?(?:Item_[3-9]|Quantitative_and_Qualitative|Controls_and_Procedures|OTHER_INFORMATION|Legal_Proceedings).{,99}?\n"
164 |                 },
165 |                 {
166 |                     "start": "\n_(?:Financial_Review|Disclosure_Regarding_Forward|Executive_Overview|Business_Overview|Earnings_Performance).{,49}\n",
167 |                     "end": "\n_(?:PART.{,40})?(?:Item_[3-9]|Quantitative_and_Qualitative|Controls_and_Procedures|OTHER_INFORMATION|Legal_Proceedings).{,99}?\n"
168 |                 },
169 |                 {
170 |                     "start": "\n_(?:PART.{,40})?Items?_2(?!.{,10}Unregistered_Sales)",
171 |                     "end": "\n_(?:PART.{,40})?(?:Item_[3-9]|Quantitative_and_Qualitative|Controls_and_Procedures|OTHER_INFORMATION|Legal_Proceedings).{,99}?\n"
172 |                 }
173 |             ]
174 |         }
175 |     ],
176 | 
177 |     "EX-99": [
178 |         {
179 |             "itemname": "EX-99",
180 |             "txt": [
181 |                 {
182 |                     "start": "^",
183 |                     "end": "$"
184 |                 }
185 |             ],
186 |             "html": [
187 |                 {
188 |                     "start": "^",
189 |                     "end": "$"
190 |                 }
191 |             ]
192 |         }
193 | 
194 |     ],
195 |     "8-K": [
196 |         {
197 |             "itemname": "8-K",
198 |             "txt": [
199 |                 {
200 |                     "start": "^",
201 |                     "end": "$"
202 |                 }
203 |             ],
204 |             "html": [
205 |                 {
206 |                     "start": "^",
207 |                     "end": "$"
208 |                 }
209 |             ]
210 |         }
211 | 
212 |     ]
213 | }
214 | 


--------------------------------------------------------------------------------
/img/home_depot_screenshots.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alions7000/SEC-EDGAR-text/1acedca1babf79a2e3fba4cf88c02bd27b0b28e0/img/home_depot_screenshots.png


--------------------------------------------------------------------------------
/img/output_files_example_image.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alions7000/SEC-EDGAR-text/1acedca1babf79a2e3fba4cf88c02bd27b0b28e0/img/output_files_example_image.png


--------------------------------------------------------------------------------
/output_files_examples/batch_0001/001/HD_0000354950_10K_20160131_Item1A_excerpt.txt:
--------------------------------------------------------------------------------
  1 | Item 1A. Risk Factors.
  2 | 
  3 | The risks and uncertainties described below could materially and adversely affect our business, financial condition and results of operations and could cause actual results to differ materially from our expectations and projections. You should read these Risk Factors in conjunction with "Management’s Discussion and Analysis of Financial Condition and Results of Operations" in Item 7 and our Consolidated Financial Statements and related notes in Item 8. There also may be other factors that we cannot anticipate or that are not described in this report generally because we do not currently perceive them to be material. Those factors could cause results to differ materially from our expectations.
  4 | 
  5 | Strong competition could adversely affect prices and demand for our products and services and could decrease our market share.
  6 | 
  7 | We operate in markets that are highly competitive. We compete principally based on customer service, price, store location and appearance, and quality, availability and assortment of merchandise. In each market we serve, there are a number of other home improvement stores, electrical, plumbing and building materials supply houses and lumber yards. With respect to some products and services, we also compete with specialty design stores, showrooms, discount stores, local, regional and national hardware stores, paint stores, mail order firms, warehouse clubs, independent building supply stores, MRO companies and, to a lesser extent, other retailers, as well as with installers of home improvement products. In addition, we face growing competition from online and multichannel retailers, some of whom may have a lower cost structure than ours, as our customers increasingly use computers, tablets, smartphones and other mobile devices to shop online and compare prices and products in real time. Intense competitive pressures from one or more of our competitors or our inability to adapt effectively and quickly to a changing competitive landscape could affect our prices, our margins or demand for our products and services. If we are unable to timely and appropriately respond to these competitive pressures, including through maintenance of customer service and customer relationships to deliver a superior customer experience, our market share and our financial performance could be adversely affected.
  8 | 
  9 |  7
 10 | We may not timely identify or effectively respond to consumer needs, expectations or trends, which could adversely affect our relationship with customers, our reputation, the demand for our products and services, and our market share.
 11 | 
 12 | The success of our business depends in part on our ability to identify and respond promptly to evolving trends in demographics; consumer preferences, expectations and needs; and unexpected weather conditions, while also managing appropriate inventory levels and maintaining an excellent customer experience. It is difficult to successfully predict the products and services our customers will demand. As the housing and home improvement market continues to recover, resulting changes in demand will put further pressure on our ability to meet customer needs and expectations and maintain high service levels. In addition, each of our primary customer groups – DIY, DIFM and Pro – have different needs and expectations, many of which evolve as the demographics in a particular customer group change. If we do not successfully differentiate the shopping experience to meet the individual needs and expectations of a customer group, we may lose market share with respect to those customers. 
 13 | 
 14 | Customer expectations about the methods by which they purchase and receive products or services are also becoming more demanding. Customers are increasingly using technology and mobile devices to rapidly compare products and prices, determine real-time product availability and purchase products. Once products are purchased, customers are seeking alternate options for delivery of those products, and they often expect quick and low-cost delivery. We must continually anticipate and adapt to these changes in the purchasing process. We have implemented programs like BOSS, BOPIS and direct fulfillment, and are rolling out BODFS, but we cannot guarantee that these programs or others we may implement will be implemented successfully or will meet customers’ needs and expectations. Customers are also using social media to provide feedback and information about our Company and products and services in a manner that can be quickly and broadly disseminated. To the extent a customer has a negative experience and shares it over social media, it may impact our brand and reputation. 
 15 | 
 16 | Further, we have an aging store base that requires maintenance and space reallocation initiatives to deliver the shopping environment that our customers desire. Failure to maintain our stores and utilize our store space effectively, to provide a compelling online presence, to timely identify or respond to changing consumer preferences, expectations and home improvement needs and to differentiate the customer experience for our three primary customer groups could adversely affect our relationship with customers, our reputation, the demand for our products and services, and our market share.
 17 | 
 18 | Our success depends upon our ability to attract, develop and retain highly qualified associates while also controlling our labor costs.
 19 | 
 20 | Our customers expect a high level of customer service and product knowledge from our associates. To meet the needs and expectations of our customers, we must attract, develop and retain a large number of highly qualified associates while at the same time controlling labor costs. Our ability to control labor costs is subject to numerous external factors, including prevailing wage rates and health and other insurance costs, as well as the impact of legislation or regulations governing labor relations, minimum wage, or healthcare benefits. An inability to provide wages and/or benefits that are competitive within the markets in which we operate could adversely affect our ability to retain and attract employees. In addition, we compete with other retail businesses for many of our associates in hourly positions, and we invest significant resources in training and motivating them to maintain a high level of job satisfaction. These positions have historically had high turnover rates, which can lead to increased training and retention costs, particularly if the economy continues to improve and employment opportunities increase. There is no assurance that we will be able to attract or retain highly qualified associates in the future.
 21 | 
 22 | We have incurred losses related to our Data Breach, and we are still in the process of determining the full impact of related government investigations and civil litigation on our results of operations, which could have an adverse impact on our operations, financial results and reputation.
 23 | 
 24 | The Data Breach involved the theft of certain payment card information and customer email addresses through unauthorized access to our systems. Since the Data Breach occurred, we have recorded $161 million of pretax expenses, net of expected insurance recoveries, in connection with the Data Breach, as described in more detail in Item 7, "Management’s Discussion and Analysis of Financial Condition and Results of Operation" and Note 13 to the Consolidated Financial Statements included in Item 8, "Financial Statements and Supplementary Data". We are facing putative class actions filed in the U.S. and Canada and a consolidated shareholder derivative action brought by two purported shareholders in the U.S., and other claims have been and may be asserted on behalf of customers, payment card issuing banks, shareholders, or others seeking damages or other related relief, allegedly arising out of the Data Breach. We are also facing investigations by a number of state and federal agencies. These claims and investigations may adversely affect how we operate our business, divert the attention of management from the operation of the business, have an adverse effect on our reputation, and result in additional costs and fines. In addition, the governmental agencies investigating the Data Breach may seek to impose injunctive relief, which could 
 25 | 
 26 |  8
 27 | materially increase our data security costs, adversely impact how we operate our systems and collect and use customer information, and put us at a competitive disadvantage with other retailers. 
 28 | 
 29 | If our efforts to maintain the privacy and security of customer, associate, supplier and Company information are not successful, we could incur substantial additional costs and reputational damage, and could become subject to further litigation and enforcement actions.
 30 | 
 31 | Our business, like that of most retailers, involves the receipt, storage and transmission of customers’ personal information, consumer preferences and payment card information, as well as confidential information about our associates, our suppliers and our Company, some of which is entrusted to third-party service providers and vendors. We also work with third-party service providers and vendors that provide technology, systems and services that we use in connection with the receipt, storage and transmission of this information. Our information systems, and those of our third-party service providers and vendors, are vulnerable to an increasing threat of continually evolving cybersecurity risks. Unauthorized parties may attempt to gain access to these systems or our information through fraud or other means of deceiving our associates, third-party service providers or vendors. Hardware, software or applications we develop or obtain from third parties may contain defects in design or manufacture or other problems that could unexpectedly compromise information security. The methods used to obtain unauthorized access, disable or degrade service or sabotage systems are also constantly changing and evolving and may be difficult to anticipate or detect for long periods of time. We have implemented and regularly review and update processes and procedures to protect against unauthorized access to or use of secured data and to prevent data loss. However, the ever-evolving threats mean we and our third-party service providers and vendors must continually evaluate and adapt our respective systems and processes, and there is no guarantee that they will be adequate to safeguard against all data security breaches or misuses of data. Any future significant compromise or breach of our data security, whether external or internal, or misuse of customer, associate, supplier or Company data, could result in additional significant costs, lost sales, fines, lawsuits, and damage to our reputation. In addition, as the regulatory environment related to information security, data collection and use, and privacy becomes increasingly rigorous, with new and constantly changing requirements applicable to our business, compliance with those requirements could also result in additional costs.
 32 | 
 33 | We are subject to payment-related risks that could increase our operating costs, expose us to fraud or theft, subject us to potential liability and potentially disrupt our business.
 34 | 
 35 | We accept payments using a variety of methods, including cash, checks, credit and debit cards, PayPal, our private label credit cards and installment loan program, and gift cards, and we may offer new payment options over time. Acceptance of these payment options subjects us to rules, regulations, contractual obligations and compliance requirements, including payment network rules and operating guidelines, data security standards and certification requirements, and rules governing electronic funds transfers. These requirements may change over time or be reinterpreted, making compliance more difficult or costly. For certain payment methods, including credit and debit cards, we pay interchange and other fees, which may increase over time and raise our operating costs. We rely on third parties to provide payment processing services, including the processing of credit cards, debit cards, and other forms of electronic payment. If these companies become unable to provide these services to us, or if their systems are compromised, it could potentially disrupt our business. The payment methods that we offer also subject us to potential fraud and theft by criminals, who are becoming increasingly more sophisticated, seeking to obtain unauthorized access to or exploit weaknesses that may exist in the payment systems, as reflected in our recent Data Breach. If we fail to comply with applicable rules or requirements for the payment methods we accept, or if payment-related data is compromised due to a breach or misuse of data, we may be liable for costs incurred by payment card issuing banks and other third parties or subject to fines and higher transaction fees, or our ability to accept or facilitate certain types of payments may be impaired. In addition, our customers could lose confidence in certain payment types, which may result in a shift to other payment types or potential changes to our payment systems that may result in higher costs. As a result, our business and operating results could be adversely affected.
 36 | 
 37 | Uncertainty regarding the housing market, economic conditions and other factors beyond our control could adversely affect demand for our products and services, our costs of doing business and our financial performance.
 38 | 
 39 | Our financial performance depends significantly on the stability of the housing, residential construction and home improvement markets, as well as general economic conditions, including changes in gross domestic product. Adverse conditions in or uncertainty about these markets or the economy could adversely impact our customers’ confidence or financial condition, causing them to determine not to purchase home improvement products and services or delay purchasing or payment for those products and services. Other factors beyond our control – including high levels of unemployment and foreclosures; interest rate fluctuations; fuel and other energy costs; labor and healthcare costs; the availability of financing; the state of the credit markets, including mortgages, home equity loans and consumer credit; weather; natural disasters and 
 40 | 
 41 |  9
 42 | other conditions beyond our control – could further adversely affect demand for our products and services, our costs of doing business and our financial performance.
 43 | 
 44 | A failure of a key information technology system or process could adversely affect our business.
 45 | 
 46 | We rely extensively on information technology systems, some of which are managed or provided by third-party service providers, to analyze, process, store, manage and protect transactions and data. We also rely heavily on the integrity of, security of and consistent access to this data in managing our business. For these systems and processes to operate effectively, we or our service providers must periodically maintain and update them. Our systems and the third-party systems on which we rely are subject to damage or interruption from a number of causes, including power outages; computer and telecommunications failures; computer viruses; security breaches; cyber-attacks; catastrophic events such as fires, floods, earthquakes, tornadoes, or hurricanes; acts of war or terrorism; and design or usage errors by our associates, contractors or third-party service providers. Although we and our third-party service providers seek to maintain our respective systems effectively and to successfully address the risk of compromise of the integrity, security and consistent operations of these systems, we may not be successful in doing so. As a result, we or our service providers could experience errors, interruptions, delays or cessations of service in key portions of our information technology infrastructure, which could significantly disrupt our operations and be costly, time consuming and resource-intensive to remedy.
 47 | 
 48 | Disruptions in our customer-facing technology systems could impair our interconnected retail strategy and give rise to negative customer experiences.
 49 | 
 50 | Through our information technology developments, we are able to provide an improved overall shopping and multichannel experience that empowers our customers to shop and interact with us from computers, tablets, smartphones and other mobile devices. We use our website both as a sales channel for our products and also as a method of providing product, project and other relevant information to our customers to drive both in-store and online sales. We have multiple online communities and knowledge centers that allow us to inform, assist and interact with our customers. Multichannel retailing is continually evolving and expanding, and we must effectively respond to changing customer expectations and new developments. For example, to improve our special order process we are currently rolling out COM, our new Customer Order Management system, which our customers will be able to access online, and we continually seek to enhance all of our online properties to provide an attractive user-friendly interface for our customers. Disruptions, failures or other performance issues with these customer-facing technology systems could impair the benefits that they provide to our online and in-store business and negatively affect our relationship with our customers.
 51 | 
 52 | If we fail to identify and develop relationships with a sufficient number of qualified suppliers, or if our suppliers experience financial difficulties or other challenges, our ability to timely and efficiently access products that meet our high standards for quality could be adversely affected.
 53 | 
 54 | We buy our products from suppliers located throughout the world. Our ability to continue to identify and develop relationships with qualified suppliers who can satisfy our high standards for quality and responsible sourcing, as well as our need to access products in a timely and efficient manner, is a significant challenge. Our ability to access products from our suppliers can be adversely affected by political instability, military conflict, the financial instability of suppliers (particularly in light of continuing economic difficulties in various regions of the world), suppliers’ noncompliance with applicable laws, trade restrictions, tariffs, currency exchange rates, any disruptions in our suppliers’ logistics or supply chain networks, and other factors beyond our or our suppliers’ control.
 55 | 
 56 | Disruptions in our supply chain and other factors affecting the distribution of our merchandise could adversely impact our business.
 57 | 
 58 | A disruption within our logistics or supply chain network could adversely affect our ability to deliver inventory in a timely manner, which could impair our ability to meet customer demand for products and result in lost sales, increased supply chain costs or damage to our reputation. Such disruptions may result from damage or destruction to our distribution centers; weather-related events; natural disasters; trade restrictions; tariffs; third-party strikes, lock-outs, work stoppages or slowdowns; shipping capacity constraints; supply or shipping interruptions or costs; or other factors beyond our control. Any such disruption could negatively impact our financial performance or financial condition.
 59 | 
 60 | The implementation of our supply chain and technology initiatives could disrupt our operations in the near term, and these initiatives might not provide the anticipated benefits or might fail.
 61 | 
 62 | We have made, and we plan to continue to make, significant investments in our supply chain and technology. These initiatives, such as Project Sync and COM, are designed to streamline our operations to allow our associates to continue to 
 63 | 
 64 |  10
 65 | provide high-quality service to our customers, while simplifying customer interaction and providing our customers with a more interconnected retail experience. The cost and potential problems and interruptions associated with the implementation of these initiatives, including those associated with managing third-party service providers and employing new web-based tools and services, could disrupt or reduce the efficiency of our operations in the near term and lead to product availability issues. In addition, our improved supply chain and new or upgraded technology might not provide the anticipated benefits, it might take longer than expected to realize the anticipated benefits, or the initiatives might fail altogether, each of which could adversely impact our competitive position and our financial condition, results of operations or cash flows. 
 66 | 
 67 | If we are unable to effectively manage and expand our alliances and relationships with selected suppliers of both brand name and proprietary products, we may be unable to effectively execute our strategy to differentiate ourselves from our competitors.
 68 | 
 69 | As part of our focus on product differentiation, we have formed strategic alliances and exclusive relationships with selected suppliers to market products under a variety of well-recognized brand names. We have also developed relationships with selected suppliers to allow us to market proprietary products that are comparable to national brands. Our proprietary products differentiate us from other retailers, generally carry higher margins than national brand products, and represent a growing portion of our business. If we are unable to manage and expand these alliances and relationships or identify alternative sources for comparable brand name and proprietary products, we may not be able to effectively execute product differentiation, which may impact our sales and gross margin results.
 70 | 
 71 | Our proprietary products subject us to certain increased risks.
 72 | 
 73 | As we expand our proprietary product offerings, we may become subject to increased risks due to our greater role in the design, manufacture, marketing and sale of those products. The risks include greater responsibility to administer and comply with applicable regulatory requirements, increased potential product liability and product recall exposure and increased potential reputational risks related to the responsible sourcing of those products. To effectively execute on our product differentiation strategy, we must also be able to successfully protect our proprietary rights and successfully navigate and avoid claims related to the proprietary rights of third parties. In addition, an increase in sales of our proprietary products may adversely affect sales of our vendors’ products, which, in turn, could adversely affect our relationships with certain of our vendors. Any failure to appropriately address some or all of these risks could damage our reputation and have an adverse effect on our business, results of operations and financial condition.
 74 | 
 75 | We may be unsuccessful in implementing our growth strategy, which includes the integration of Interline to expand our business with professional customers and in the MRO market, which could have an adverse impact on our financial condition and results of operation.
 76 | 
 77 | In fiscal 2015, we completed the acquisition of Interline, which we believe will enhance our ability to serve our professional customers and increase our share of the MRO market. Our goal is to serve all of our different Pro customer groups through one integrated approach to drive growth and capture market share in the retail, services and MRO markets, and this strategy depends, in part, on the successful integration of Interline. As with any acquisition, we need to successfully integrate the target company’s products, services, associates and systems into our business operations. Integration can be a complex and time-consuming process, and if the integration is not fully successful or is delayed for a material period of time, we may not achieve the anticipated synergies or benefits of the acquisition. An inability to realize the full extent of the anticipated synergies or benefits of the Interline acquisition could have an adverse effect on our financial condition or results of operation. Furthermore, even if Interline is successfully integrated, the acquisition may fail to further our business strategy as anticipated, expose us to increased competition or challenges with respect to our products or services, and expose us to additional liabilities associated with the Interline business.
 78 | 
 79 | If we are unable to manage effectively our installation service business, we could suffer lost sales and be subject to fines, lawsuits and reputational damage.
 80 | 
 81 | We act as a general contractor to provide installation services to our DIFM customers through third-party installers. As such, we are subject to regulatory requirements and risks applicable to general contractors, which include management of licensing, permitting and quality of our third-party installers. We have established processes and procedures that provide protections beyond those required by law to manage these requirements and ensure customer satisfaction with the services provided by our third-party installers. If we fail to manage these processes effectively or to provide proper oversight of these services, we could suffer lost sales, fines and lawsuits, as well as damage to our reputation, which could adversely affect our business.
 82 | 
 83 |  11
 84 | Our costs of doing business could increase as a result of changes in, expanded enforcement of, or adoption of new federal, state or local laws and regulations.
 85 | 
 86 | We are subject to various federal, state and local laws and regulations that govern numerous aspects of our business. Recently, there have been a large number of legislative and regulatory initiatives and reforms, as well as expanded enforcement of existing laws and regulations by federal, state and local agencies. Changes in, expanded enforcement of, or adoption of new federal, state or local laws and regulations governing minimum wage or living wage requirements; other wage, labor or workplace regulations; cybersecurity and data privacy; the sale of some of our products; transportation; logistics; supply chain transparency; taxes; energy costs or environmental matters could increase our costs of doing business or impact our operations. In addition, recent healthcare reform legislation could adversely impact our labor costs and our ability to negotiate favorable terms under our benefit plans for our associates.
 87 | 
 88 | If we cannot successfully manage the unique challenges presented by international markets, we may not be successful in our international operations and our sales and profit margins may be impacted.
 89 | 
 90 | Our ability to successfully conduct retail operations in, and source products and materials from, international markets is affected by many of the same risks we face in our U.S. operations, as well as unique costs and difficulties of managing international operations. Our international operations, including any expansion in international markets, may be adversely affected by local laws and customs, U.S. laws applicable to foreign operations and other legal and regulatory constraints, as well as political and economic conditions. Risks inherent in international operations also include, among others, potential adverse tax consequences, greater difficulty in enforcing intellectual property rights, risks associated with the Foreign Corrupt Practices Act and local anti-bribery law compliance, and challenges in our ability to identify and gain access to local suppliers. In addition, our operations in international markets create risk due to foreign currency exchange rates and fluctuations in those rates, which may adversely impact our sales and profit margins. 
 91 | 
 92 | The inflation or deflation of commodity prices could affect our prices, demand for our products, our sales and our profit margins.
 93 | 
 94 | Prices of certain commodity products, including lumber and other raw materials, are historically volatile and are subject to fluctuations arising from changes in domestic and international supply and demand, labor costs, competition, market speculation, government regulations and periodic delays in delivery. Rapid and significant changes in commodity prices may affect the demand for our products, our sales and our profit margins.
 95 | 
 96 | Changes in accounting standards and subjective assumptions, estimates and judgments by management related to complex accounting matters could significantly affect our financial results or financial condition.
 97 | 
 98 | Generally accepted accounting principles and related accounting pronouncements, implementation guidelines and interpretations with regard to a wide range of matters that are relevant to our business, such as revenue recognition, asset impairment, impairment of goodwill and other intangible assets, inventories, lease obligations, self-insurance, tax matters and litigation, are highly complex and involve many subjective assumptions, estimates and judgments. Changes in these rules or their interpretation or changes in underlying assumptions, estimates or judgments could significantly change our reported or expected financial performance or financial condition.
 99 | 
100 | We are involved in a number of legal and regulatory proceedings, and while we cannot predict the outcomes of those proceedings and other contingencies with certainty, some of these outcomes may adversely affect our operations or increase our costs.
101 | 
102 | In addition to the matters discussed above with respect to the Data Breach, we are involved in a number of legal proceedings and regulatory matters, including government inquiries and investigations, and consumer, employment, tort and other litigation that arise from time to time in the ordinary course of business. Litigation is inherently unpredictable, and the outcome of some of these proceedings and other contingencies could require us to take or refrain from taking actions which could adversely affect our operations or could result in excessive adverse verdicts. Additionally, involvement in these lawsuits, investigations and inquiries, and other proceedings may involve significant expense, divert management’s attention and resources from other matters, and impact the reputation of the Company.
103 | 
104 |  12
105 | Item 1B. Unresolved Staff Comments.
106 | 
107 | Not applicable.
108 | 
109 | Item 2. Properties.


--------------------------------------------------------------------------------
/output_files_examples/batch_0001/001/HD_0000354950_10K_20160131_Item1_excerpt.txt:
--------------------------------------------------------------------------------
  1 | PART I
  2 | 
  3 | Item 1. Business.
  4 | 
  5 | Introduction
  6 | 
  7 | The Home Depot, Inc. is the world’s largest home improvement retailer based on Net Sales for the fiscal year ended January 31, 2016 ("fiscal 2015"). The Home Depot sells a wide assortment of building materials, home improvement products and lawn and garden products and provides a number of services. The Home Depot stores average approximately 104,000 square feet of enclosed space, with approximately 24,000 additional square feet of outside garden area. As of the end of fiscal 2015, we had 2,274 The Home Depot stores located throughout the United States, including the Commonwealth of Puerto Rico and the territories of the U.S. Virgin Islands and Guam, Canada and Mexico. When we refer to "The Home Depot", the "Company", "we", "us" or "our" in this report, we are referring to The Home Depot, Inc. and its consolidated subsidiaries.
  8 | 
  9 | The Home Depot, Inc. is a Delaware corporation that was incorporated in 1978. Our Store Support Center (corporate office) is located at 2455 Paces Ferry Road, Atlanta, Georgia 30339. Our telephone number is (770) 433-8211.
 10 | 
 11 | Our internet website is www.homedepot.com. We make available on the Investor Relations section of our website, free of charge, our Annual Reports to shareholders, Annual Reports on Form 10-K, Quarterly Reports on Form 10-Q, Current Reports on Form 8-K, Proxy Statements and Forms 3, 4 and 5, and amendments to those reports, as soon as reasonably practicable after filing such documents with, or furnishing such documents to, the SEC.
 12 | 
 13 | We include our website addresses throughout this filing for reference only. The information contained on our websites is not incorporated by reference into this report.
 14 | 
 15 | For information on key financial highlights, including historical revenues, profits and total assets, see the "Five-Year Summary of Financial and Operating Results" on page F-1 of this report and Item 7, "Management’s Discussion and Analysis of Financial Condition and Results of Operations".
 16 | 
 17 |  1
 18 | Our Business
 19 | 
 20 | Operating Strategy 
 21 | 
 22 | Since 2009, we have been guided by a consistent strategic framework organized around our customers, our products and our disciplined use of capital, tied together through our interconnected retail initiative. In fiscal 2015, we announced an evolution of this strategy to reflect the changing needs of our customers and our business. The fundamental aspects remain the same, but we are now focused more than ever on connecting various aspects of our business to drive value for our customers, our associates, our suppliers and our shareholders. Our current strategic framework is comprised of three key initiatives – Customer Experience, Product Authority, and Productivity and Efficiency Driven by Capital Allocation – tied together by our interconnecting retail initiative. As customers increasingly expect to be able to buy how, when and where they want, we believe that providing a seamless and frictionless shopping experience across multiple channels, featuring innovative and expanded product choices delivered in a fast and cost-efficient manner, will be a key enabler for future success. Becoming a best-in-class interconnected retailer is growing in importance as the line between online and in-store shopping continues to blur and customers demand increased value and convenience.
 23 | 
 24 | Interconnecting retail is woven through each of our other three initiatives, as discussed in more detail below. For example, under our customer experience initiative, we are focused on connecting our stores to our online experience and connecting service to customer needs. Under our product authority initiative, we are focused on connecting our product assortment to local needs and connecting our customers with product information to inspire and empower them. Under our productivity and efficiency initiative, we are focused on connecting our merchandise from our suppliers to our customers by optimizing our supply chain. Overall, we are collaborating more closely, both internally and externally, through deeper cross-functional work and a more integrated, longer-term approach with our suppliers and other business partners, to build complete end-to-end solutions.
 25 | 
 26 | Customer Experience 
 27 | 
 28 | Our customer experience initiative is anchored on the principles of putting customers first and taking care of our associates. Our commitment to customer service is a key part of this initiative, and in fiscal 2015, to underscore the importance of customer service, we re-trained our store associates on our Customer FIRST program. We recognize that the customer experience includes more than just customer service, and we have taken a number of steps to enhance this initiative to provide our customers with a seamless and frictionless shopping experience in our stores, online, on the job site or in their homes.
 29 | 
 30 | Our Customers. We serve three primary customer groups, and we have different approaches to meet their particular needs:
 31 | 
 32 | •Do-It-Yourself ("DIY") Customers. These customers are typically home owners who purchase products and complete their own projects and installations. Our associates assist these customers with specific product and installation questions both in our stores and through online resources and other media designed to provide product and project knowledge. We also offer a variety of clinics and workshops both to impart this knowledge and to build an emotional connection with our DIY customers.
 33 | 
 34 | •Do-It-For-Me ("DIFM") Customers. These customers are typically home owners who purchase materials and hire third parties to complete the project or installation. Our stores offer a variety of installation services targeted at DIFM customers who purchase products and installation of those products from us in our stores, online or in their homes through in-home consultations. Our installation programs include many categories, such as flooring, cabinets, countertops, water heaters and sheds. In addition, we provide third-party professional installation in a number of categories sold through our in-home sales programs, such as roofing, siding, windows, cabinet refacing, furnaces and central air systems. This customer group is growing due to changing demographics, which we believe will increase demand for our installation services. Further, our focus on serving the professional customers, or "Pros", who perform these services for our DIFM customers will help us drive higher product sales.
 35 | 
 36 | •Professional Customers. These customers are primarily professional renovators/remodelers, general contractors, repairmen, installers, small business owners and tradesmen. With our acquisition of Interline Brands, Inc. ("Interline") in August 2015, we expanded our service to the maintenance, repair and operations ("MRO") Pro. We recognize the unique service needs of the Pro customer and use our expertise to facilitate their buying experience. We offer a variety of special programs to these customers, including delivery and will-call services, dedicated staff, expanded credit programs, designated parking spaces close to store entrances and bulk pricing programs for both online and in-store purchases. In addition, we maintain a loyalty program, Pro Xtra, that provides our Pros with discounts on useful business services, exclusive product offers and a purchase tracking tool to enable receipt lookup 
 37 | 
 38 |  2
 39 | online and job tracking of purchases across all forms of payment. This program, introduced in fiscal 2013, has continued to gain traction, with almost 4 million customers enrolled by the end of fiscal 2015.
 40 | 
 41 | We also recognize that our Pros have differing needs depending on the type of work they perform. Our goal is to develop a wide spectrum of solutions for all of our professional customers, such as supplying both recurring MRO needs and core building materials to large-scale property managers and providing inventory management solutions for our traditional Pro customers. We believe developing a unified approach to service all the needs of our Pros will differentiate us from competitors who are solely traditional retail, installation or MRO companies.
 42 | 
 43 | We help our DIY, DIFM and Pro customers finance their projects by offering private label credit products in our stores through third-party credit providers. We also help certain of our Pros through our own programs. In fiscal 2015, our customers opened approximately 3.2 million new The Home Depot private label credit accounts, and at fiscal year end the total number of The Home Depot active account holders was approximately 12 million. Private label credit card sales accounted for approximately 23% of sales in fiscal 2015. In addition, in the U.S. we re-launched our private label credit program at the end of fiscal 2015 with additional benefits, including a 365-day return policy for all of our customers and commercial fuel rewards and extended payment terms for our Pros.
 44 | 
 45 | Our Associates. Our associates are key to our customer experience initiative. As noted above, we empower our associates to deliver excellent customer service through our Customer FIRST training program, and we strive to remove complexity and inefficient processes from the stores to allow our associates to focus on our customers. In fiscal 2015, we began to roll out a number of new initiatives to improve freight handling in the stores, as well as Project Sync, which is discussed in more detail below under "Logistics". All of these programs are designed to make our freight handling process more efficient, which allows our associates to devote more time to the customer experience and makes working at The Home Depot a better experience for them. We also have a number of programs to recognize stores and individual associates for exceptional customer and community service. 
 46 | 
 47 | At the end of fiscal 2015, we employed approximately 385,000 associates, of whom approximately 24,000 were salaried, with the remainder compensated on an hourly or temporary basis. To attract and retain qualified personnel, we seek to maintain competitive salary and wage levels in each market we serve. We measure associate satisfaction regularly, and we believe that our employee relations are very good.
 48 | 
 49 | Interconnecting Retail. In fiscal 2015, we continued to enhance our customers’ interconnected shopping experiences through a variety of initiatives. Our associates used second generation FIRST phones, our web-enabled handheld devices, to help customers complete online sales in the aisle, expedite the checkout process for customers during peak traffic periods, locate products in the aisles and online, and check inventory on hand. We have also empowered our customers with improved product location and inventory availability tools through enhancements to our website and mobile app, and we have invested heavily in content improvements such as videos, ratings and reviews, and more detailed product information. These enhancements are critical for our increasingly interconnected customers who research products online and then go into one of our stores to view the products in person or talk to an associate before making the purchase. While in the store, customers may also go online to access ratings and reviews, compare prices, view our extended assortment and purchase products.
 50 | 
 51 | We continued to make enhancements to our special order process in fiscal 2015 with our new Customer Order Management platform ("COM"), which was introduced in fiscal 2014. This platform is designed to provide greater visibility into and improved execution of special orders by our associates and a more seamless and frictionless experience for our customers. After COM is rolled out to all U.S. stores, which we expect to occur by the end of fiscal 2016, store associates, suppliers and customers will be able to access relevant special order information online, regardless of where the order was placed. In addition, we have three online contact centers to service our online customers’ needs.
 52 | 
 53 | We also recognize that customers desire greater flexibility and convenience when it comes to receiving their products and services. In fiscal 2015, we began to roll out Buy Online, Deliver From Store ("BODFS"), which complements our existing interconnecting retail programs: Buy Online, Pick-up In Store ("BOPIS"), Buy Online, Ship to Store ("BOSS") and Buy Online, Return In Store ("BORIS"). We expect to complete the roll out of BODFS by the end of fiscal 2016. We will continue to blend our physical and digital assets in a seamless and frictionless way to enhance the end-to-end customer experience.
 54 | 
 55 |  3
 56 | Product Authority
 57 | 
 58 | Our product authority initiative is facilitated by our merchandising transformation and portfolio strategy, which is focused on delivering product innovation, assortment and value. In fiscal 2015, we continued to introduce a wide range of innovative new products to our DIY, DIFM and Pro customers, while remaining focused on offering everyday values in our stores and online. 
 59 | 
 60 | Our Products. In fiscal 2015, we introduced a number of innovative and distinctive products to our customers at attractive values. Examples of these new products include EGO™ 58-volt cordless outdoor power tools (string trimmer, hedge trimmer, blower, chainsaw and lawn mower); the Husky® 100 platform of mechanics tools; LifeProof Carpet®; Milwaukee® Cobalt Red Helix™ drill bits; and Feit® Electric HomeBrite® Bluetooth® Smart LED light bulbs.
 61 | 
 62 | During fiscal 2015, we continued to offer value to our customers through our proprietary and exclusive brands across a wide range of departments. Highlights of these offerings include Husky® hand tools and tool storage; Everbilt® hardware and fasteners; Hampton Bay® lighting, ceiling fans and patio furniture; Vigoro® lawn care products; RIDGID® and Ryobi® power tools; Glacier Bay® bath fixtures; HDX® storage and cleaning products; and Home Decorators Collection® furniture and home décor. We will continue to assess departments and categories, both online and in-store, for opportunities to expand the assortment of products offered within The Home Depot’s portfolio of proprietary and exclusive brands.
 63 | 
 64 | We maintain a global sourcing program to obtain high-quality and innovative products directly from manufacturers around the world. In fiscal 2015, in addition to our U.S. sourcing operations, we maintained sourcing offices in China, Taiwan, India, Italy, Mexico and Canada. With our acquisition of Interline, we also acquired additional sourcing offices in China, Thailand and Indonesia.
 65 | 
 66 | The percentage of Net Sales of each of our major product categories (and related services) for each of the last three fiscal years is presented in Note 1 to the Consolidated Financial Statements included in Item 8, "Financial Statements and Supplementary Data". Net Sales outside the U.S. were $8.0 billion, $8.5 billion and $8.5 billion for fiscal 2015, 2014 and 2013, respectively. Long-lived assets outside the U.S. totaled $2.3 billion, $2.5 billion and $2.9 billion as of January 31, 2016, February 1, 2015 and February 2, 2014, respectively.
 67 | 
 68 | Quality Assurance. Our suppliers are obligated to ensure that their products comply with applicable international, federal, state and local laws. In addition, we have both quality assurance and engineering resources dedicated to establishing criteria and overseeing compliance with safety, quality and performance standards for our proprietary branded products. We also have a global Supplier Social and Environmental Responsibility Program designed to ensure that all suppliers adhere to the highest standards of social and environmental responsibility.
 69 | 
 70 | Environmentally-Friendly Products and Programs. The Home Depot is committed to sustainable business practices – from the environmental impact of our operations, to our sourcing activities, to our involvement within the communities in which we do business. We believe these efforts continue to be successful in creating value for our customers and shareholders. For example, we offer a growing selection of environmentally-preferred products, which supports sustainability and helps our customers save energy, water and money. Through our Eco Options® Program introduced in 2007, we have created product categories that allow customers to easily identify products that meet specifications for energy efficiency, water conservation, healthy home, clean air and sustainable forestry. As of the end of fiscal 2015, our Eco Options® Program included over 10,000 products. Through this program, we sell ENERGY STAR® certified appliances, LED light bulbs, tankless water heaters and other products that enable our customers to save on their utility bills. We estimate that in fiscal 2015 we helped customers save over $700 million in electricity costs through sales of ENERGY STAR® certified products and over $300 million in product costs through ENERGY STAR® rebate programs. We also estimate our customers saved over 70 billion gallons of water resulting in over $590 million in water bill savings in fiscal 2015 through the sales of our WaterSense®-labeled bath faucets, showerheads, aerators, toilets and irrigation controllers.
 71 | 
 72 | We continue to offer store recycling programs nationwide, such as an in-store compact fluorescent light ("CFL") bulb recycling program launched in 2008. This service is offered to customers free of charge and is available in all U.S. stores. We also maintain an in-store rechargeable battery recycling program. Launched in 2001 and currently done in partnership with Call2Recycle, this program is also available to customers free of charge in all stores throughout the U.S. Through these recycling programs, in fiscal 2015 we helped recycle over 680,000 pounds of CFL bulbs and over 930,000 pounds of rechargeable batteries collected from our customers. In fiscal 2015, we also recycled over 170,000 lead acid batteries collected from our customers under our lead acid battery exchange program, as well as over 200,000 tons of cardboard through a nationwide cardboard recycling program across our U.S. stores. We believe our Eco Options® Program and our recycling efforts drive sales, which in turn benefits our shareholders, in addition to our customers and the environment.
 73 | 
 74 |  4
 75 | Interconnecting Retail. A typical The Home Depot store stocks approximately 30,000 to 40,000 products during the year, including both national brand name and proprietary items. To enhance our merchandising capabilities, we continued to make improvements to our information technology tools in fiscal 2015 to better understand our customers, provide more localized assortments to fit customer demand and optimize space to dedicate the right square footage to the right products in the right location. We also continued to use the resources of BlackLocus, Inc., a data analytics and pricing firm we acquired in fiscal 2012, to help us make focused merchandising decisions based on large, complex data sets.
 76 | 
 77 | Our online product offerings complement our stores by serving as an extended aisle, and we offer a significantly broader product assortment through our Home Depot, Home Decorators Collection and Blinds.com websites. We continue to enhance our websites and mobile experience by improving navigation and search functionalities to allow customers to more easily find and purchase an expanded array of products and provide our customers with flexibility and convenience for their purchases, for example, through our BOPIS, BOSS, BORIS and BODFS programs. In addition, we invest in content, such as videos, room scenes, buying guides and how-to information, and we routinely assess our online assortment to balance choice with curation so that we provide value to our customers. As a result of these efforts, in fiscal 2015 we enhanced the customer experience and saw increased traffic to our websites, improved online sales conversion rates, and a larger percentage of orders being picked up in our stores. For fiscal 2015, we had over 1.4 billion visits to our online properties; sales from our online channels increased over 25% compared to fiscal 2014; and over 40% of our online orders were picked up in a store.
 78 | 
 79 | Seasonality. Our business is subject to seasonal influences. Generally, our highest volume of sales occurs in our second fiscal quarter, and the lowest volume occurs either during our first or fourth fiscal quarter. 
 80 | 
 81 | Competition. Our industry is highly competitive, with competition based primarily on customer service, price, store location and appearance, and quality, availability and assortment of merchandise. Although we are currently the world’s largest home improvement retailer, in each of the markets we serve there are a number of other home improvement stores, electrical, plumbing and building materials supply houses, and lumber yards. With respect to some products and services, we also compete with specialty design stores, showrooms, discount stores, local, regional and national hardware stores, paint stores, mail order firms, warehouse clubs, independent building supply stores, MRO companies and, to a lesser extent, other retailers, as well as with installers of home improvement products. In addition, we face growing competition from online and multichannel retailers, some of whom may have a lower cost structure than ours, as our customers increasingly use computers, tablets, smartphones and other mobile devices to shop online and compare prices and products.
 82 | 
 83 | Intellectual Property. Our business has one of the most recognized brands in North America. As a result, we believe that The Home Depot® trademark has significant value and is an important factor in the marketing of our products, e-commerce, stores and business. We have registered or applied for registration of trademarks, service marks, copyrights and internet domain names, both domestically and internationally, for use in our business, including our expanding proprietary brands such as HDX®, Husky®, Hampton Bay®, Home Decorators Collection®, Glacier Bay® and Vigoro®. We also maintain patent portfolios relating to some of our products and services and seek to patent or otherwise protect innovations we incorporate into our products or business operations.
 84 | 
 85 | Productivity and Efficiency Driven by Capital Allocation
 86 | 
 87 | We have advanced this initiative by building best-in-class competitive advantages in our information technology and supply chain to better ensure product availability to our customers while managing our costs, which results in higher returns for our shareholders. During fiscal 2015, we continued to focus on optimizing our supply chain network and improving our inventory, transportation and distribution productivity.
 88 | 
 89 | Logistics. Our supply chain operations are focused on creating a competitive advantage through ensuring product availability for our customers, effectively using our investment in inventory, and managing total supply chain costs. One of our principal 2015 initiatives has been to further optimize and efficiently operate our network by beginning initial work on a multi-year program called Supply Chain Synchronization, or "Project Sync". 
 90 | 
 91 | Our distribution strategy is to provide the optimal flow path for a given product. Rapid Deployment Centers ("RDCs") play a key role in optimizing our network as they allow for aggregation of product needs for multiple stores to a single purchase order and then rapid allocation and deployment of inventory to individual stores upon arrival at the RDC. This results in a simplified ordering process and improved transportation and inventory management. We have 18 mechanized RDCs in the U.S. and two recently opened mechanized RDCs in Canada. Through Project Sync, which is being rolled out gradually to suppliers in several U.S. RDCs, we can significantly reduce our average lead time from supplier to shelf. Project Sync requires deep collaboration among our suppliers, transportation providers, RDCs and stores, as well as rigorous planning and information technology development to create an engineered flow schedule that shortens and stabilizes lead time, resulting in 
 92 | 
 93 |  5
 94 | more predictable and consistent freight flow. As we continue to roll out Project Sync throughout our supply chain over the next several years, we plan to create an end-to-end solution that benefits all participants in our supply chain, from our suppliers to our transportation providers to our RDC and store associates to our customers.
 95 | 
 96 | Over the past several years, we have centralized our inventory planning and replenishment function and continuously improved our forecasting and replenishment technology. This has helped us improve our product availability and our inventory productivity at the same time. At the end of fiscal 2015, over 95% of our U.S. store products were ordered through central inventory management.
 97 | 
 98 | In addition to our RDCs, at the end of fiscal 2015, we operated 34 bulk distribution centers, which handle products distributed optimally on flat bed trucks, in the U.S. and Canada; 22 stocking distribution centers in the U.S., Canada and Mexico; and ten specialty distribution centers, which include offshore consolidation and return logistics centers, in the U.S. and Canada. We also utilize four U.S. transload facilities, operated by third parties near ocean ports, for our imported product. These facilities allow us to improve our import logistics costs and inventory management by postponing final inventory deployment decisions until product arrives at destination ports. We remain committed to leveraging our supply chain capabilities to fully utilize and optimize our improved logistics network. 
 99 | 
100 | Interconnecting Retail. To support our online growth, in fiscal 2015 we opened the third of our three new direct fulfillment centers ("DFCs"). We expect these facilities to enable us to reach 90% of our U.S. customers in two business days or less with parcel shipping, which provides our customers with a balance of cost efficiency and speed in shipping online orders. For non-parcel orders originating from our DFCs, we have fully implemented BOSS via RDC delivery to provide our customers with a less expensive store pick-up alternative. With our acquisition of Interline, we have also added more than 90 distribution points with fast delivery of a broad assortment of MRO products.
101 | 
102 | In addition to the distribution and fulfillment centers described above, we leverage our almost 2,000 U.S. stores as a network of convenient customer pick-up, return and delivery fulfillment locations. For customers who shop online and wish to pick-up or return merchandise at our U.S. stores, we have fully implemented our BOPIS, BOSS and BORIS programs, which we believe provide us with a competitive advantage. For customers who would like the option to have store-based orders delivered directly to their home or job site, we pick, pack and ship orders to customers from our stores. We will continue our roll out of BODFS during fiscal 2016, allowing online customers to select their preferred delivery date and time windows for store-based deliveries. Our supply chain and logistics strategies will continue to be focused on providing our customers high product availability with convenient and low cost fulfillment options. 
103 | 
104 | Commitment to Sustainability and Environmentally Responsible Operations. The Home Depot focuses on sustainable operations and is committed to conducting business in an environmentally responsible manner. This commitment impacts all areas of our business, including energy usage, supply chain, store construction and maintenance, and, as noted above under "Environmentally-Friendly Products and Programs", product selection and recycling programs for our customers.
105 | 
106 | In our 2015 Sustainability Report, available on our corporate website under "Corporate Responsibility > THD and the Environment", we reported that we had significantly surpassed our energy and carbon reduction goals set in 2010 and announced two new sustainability goals for 2020. Our 2010 goals were to reduce our kilowatt hours (kWh) per square foot in our U.S. stores by 20% over 2004 levels and to reduce our supply chain carbon emissions by 20% over 2010 levels by 2015. We estimate that we have reduced those levels by over 30% and over 35%, respectively, as of the end of fiscal 2015. From 2014 to 2015 alone, we reduced our kWh per square foot by approximately 3.6%. Our new 2020 sustainability commitments are to reduce our U.S. stores’ energy use by 20% over 2010 levels and to produce and procure, on an annual basis, 135 megawatts of energy for our stores through renewable or alternate energy sources, such as wind, solar and fuel cell technology. We are committed to implementing strict operational standards that establish energy efficient operations in all of our U.S. facilities and continuing to invest in renewable energy. Our 2015 Sustainability Report also uses the Global Reporting Initiative (GRI) framework for sustainability reporting.
107 | 
108 | Additionally, we implemented a rainwater reclamation project in our stores in 2010. As of the end of fiscal 2015, 145 of our stores used reclamation tanks to collect rainwater and condensation from HVAC units and garden center roofs, which is in turn used to water plants in our outside garden centers. We estimate our annual water savings from these units to be approximately 500,000 gallons per store for total water savings of over 68 million gallons in fiscal 2015.
109 | 
110 | Our commitment to corporate sustainability has resulted in a number of environmental awards and recognitions. In 2015, we received three significant awards from the U.S. Environmental Protection Agency ("EPA"). The ENERGY STAR® division named us "Retail Partner of the Year – Sustained Excellence" for our overall excellence in energy efficiency, and we received the 2015 WaterSense® Sustained Excellence Award for our overall excellence in water efficiency. We also received the EPA’s 
111 | 
112 |  6
113 | "SmartWay Excellence Award", which recognizes The Home Depot as an industry leader in freight supply chain environmental performance and energy efficiency. We also participate in the CDP (formerly known as the Carbon Disclosure Project) reporting process. CDP is an independent, international, not-for-profit organization providing a global system for companies and cities to measure, disclose, manage and share environmental information. In 2015, we scored 99 out of 100 from the CDP for our disclosure, placing us among the highest scoring companies in the Index and near the top of our sector. We also were named as an industry leader by the CDP and received a performance band ranking of A- (out of a range from A to E), reflecting a high level of action on climate change mitigation, adaptation and transparency.
114 | 
115 | We are strongly committed to maintaining a safe shopping and working environment for our customers and associates and protecting the environment of the communities in which we do business. Our Environmental, Health & Safety ("EH&S") function is dedicated to ensuring the health and safety of our customers and associates, with trained associates who evaluate, develop, implement and enforce policies, processes and programs on a Company-wide basis. Our EH&S policies are woven into our everyday operations and are part of The Home Depot culture. Some common program elements include: daily store inspection checklists (by department); routine follow-up audits from our store-based safety team members and regional, district and store operations field teams; equipment enhancements and preventative maintenance programs to promote physical safety; departmental merchandising safety standards; training and education programs for all associates, with varying degrees of training provided based on an associate’s role and responsibilities; and awareness, communication and recognition programs designed to drive operational awareness and understanding of EH&S issues.
116 | 
117 | Returning Value to Shareholders. As noted above, we drive productivity and efficiency through our capital allocation decisions, with a focus on expense control. This discipline drove higher returns on invested capital and allowed us to return value to shareholders through $7.0 billion in share repurchases and $3.0 billion in dividends in fiscal 2015, as discussed in Item 7, "Management’s Discussion and Analysis of Financial Condition and Results of Operations".
118 | 
119 | Data Breach
120 | 
121 | In the third quarter of fiscal 2014, we confirmed that our payment data systems were breached, which impacted customers who used payment cards at our U.S. and Canadian stores (the "Data Breach"). For a description of matters related to the Data Breach, see Item 7, "Management’s Discussion and Analysis of Financial Condition and Results of Operations" and Note 13 to the Consolidated Financial Statements included in Item 8, "Financial Statements and Supplementary Data".
122 | 
123 | Item 1A. Risk Factors.


--------------------------------------------------------------------------------
/output_files_examples/batch_0001/001/HD_0000354950_10K_20160131_Item7A_excerpt.txt:
--------------------------------------------------------------------------------
1 | Item 7A. Quantitative and Qualitative Disclosures About Market Risk.
2 | 
3 | The information required by this item is incorporated by reference to Item 7, "Management’s Discussion and Analysis of Financial Condition and Results of Operations" of this report.
4 | 
5 |  28
6 | Item 8. Financial Statements and Supplementary Data.


--------------------------------------------------------------------------------
/output_files_examples/batch_0001/001/HD_0000354950_10K_20170129_Item1A_excerpt.txt:
--------------------------------------------------------------------------------
  1 | Item 1A. Risk Factors.
  2 | 
  3 | The risks and uncertainties described below could materially and adversely affect our business, financial condition and results of operations and could cause actual results to differ materially from our expectations and projections. You should read these Risk Factors in conjunction with "Management’s Discussion and Analysis of Financial Condition and Results of Operations" in Item 7 and our Consolidated Financial Statements and related notes in Item 8. There also may be other factors that we cannot anticipate or that are not described in this report generally because we do not currently perceive them to be material. Those factors could cause results to differ materially from our expectations.
  4 | 
  5 | Strong competition could adversely affect prices and demand for our products and services and could decrease our market share.
  6 | 
  7 | We operate in markets that are highly competitive. We compete principally based on customer experience, price, store location and appearance, and quality, availability, assortment and presentation of merchandise. In each market we serve, there are a number of other home improvement stores, electrical, plumbing and building materials supply houses and lumber yards. With respect to some products and services, we also compete with specialty design stores, showrooms, discount stores, local, regional and national hardware stores, paint stores, mail order firms, warehouse clubs, independent building supply stores, MRO companies and other retailers, as well as with providers of home improvement services. In addition, we face growing competition from online and multichannel retailers, some of whom may have a lower cost structure than ours, as our customers now routinely use computers, tablets, smartphones and other mobile devices to shop online and compare prices and products in real time. We use our marketing, advertising and promotional programs to drive customer traffic and compete more effectively, and we must regularly assess and adjust our efforts to address changes in the competitive landscape. Intense competitive pressures from one or more of our competitors, such as through aggressive promotional pricing or liquidation events, or our inability to adapt effectively and quickly to a changing competitive landscape could affect our prices, our 
  8 | 
  9 |  7
 10 | margins or demand for our products and services. If we are unable to timely and appropriately respond to these competitive pressures, including through the delivery of a superior customer experience or maintenance of effective marketing, advertising or promotional programs, our market share and our financial performance could be adversely affected.
 11 | 
 12 | We may not timely identify or effectively respond to consumer needs, expectations or trends, which could adversely affect our relationship with customers, our reputation, the demand for our products and services, and our market share.
 13 | 
 14 | The success of our business depends in part on our ability to identify and respond promptly to evolving trends in demographics; consumer preferences, expectations and needs; and unexpected weather conditions, while also managing appropriate inventory levels and maintaining an excellent customer experience. It is difficult to successfully predict the products and services our customers will demand. As we continue to see increasing strength in the housing and home improvement market, resulting changes in demand will put further pressure on our ability to meet customer needs and expectations and maintain high service levels. In addition, each of our primary customer groups – DIY, DIFM and Pro – have different needs and expectations, many of which evolve as the demographics in a particular customer group change. We also need to offer more localized assortments of our merchandise to appeal to local cultural and demographic tastes within each customer group. If we do not successfully differentiate the shopping experience to meet the individual needs and expectations of – or within – a customer group, we may lose market share with respect to those customers. 
 15 | 
 16 | Customer expectations about the methods by which they purchase and receive products or services are also becoming more demanding. Customers now routinely use technology and mobile devices to rapidly compare products and prices, determine real-time product availability and purchase products. Once products are purchased, customers are seeking alternate options for delivery of those products, and they often expect quick and low-cost delivery. We must continually anticipate and adapt to these changes in the purchasing process. We have implemented programs like BOSS, BOPIS, BODFS and direct fulfillment, but we cannot guarantee that these programs or others we may implement will be implemented successfully or will meet customers’ needs and expectations. Customers are also using social media to provide feedback and information about our Company and products and services in a manner that can be quickly and broadly disseminated. To the extent a customer has a negative experience and shares it over social media, it may impact our brand and reputation. 
 17 | 
 18 | Further, we have an aging store base that requires maintenance and space reallocation initiatives to deliver the shopping experience that our customers desire. We must also maintain a safe store environment for our customers and associates. Failure to maintain our stores and utilize our store space effectively; to provide a compelling online presence; to timely identify or respond to changing consumer preferences, expectations and home improvement needs; to provide quick and low-cost delivery alternatives; to differentiate the customer experience for our three primary customer groups; and to effectively implement an increasingly localized merchandising assortment could adversely affect our relationship with customers, our reputation, the demand for our products and services, and our market share.
 19 | 
 20 | Our success depends upon our ability to attract, develop and retain highly qualified associates while also controlling our labor costs.
 21 | 
 22 | Our customers expect a high level of customer service and product knowledge from our associates. To meet the needs and expectations of our customers, we must attract, develop and retain a large number of highly qualified associates while at the same time controlling labor costs. Our ability to control labor costs is subject to numerous external factors, including prevailing wage rates and health and other insurance costs, as well as the impact of legislation or regulations governing labor relations, minimum wage, or healthcare benefits. An inability to provide wages and/or benefits that are competitive within the markets in which we operate could adversely affect our ability to retain and attract employees. In addition, we compete with other retail businesses for many of our associates in hourly positions, and we invest significant resources in training and motivating them to maintain a high level of job satisfaction. These positions have historically had high turnover rates, which can lead to increased training and retention costs, particularly as the economy continues to improve and the labor market tightens. There is no assurance that we will be able to attract or retain highly qualified associates in the future.
 23 | 
 24 | A failure of a key information technology system or process could adversely affect our business.
 25 | 
 26 | We rely extensively on information technology systems, some of which are managed or provided by third-party service providers, to analyze, process, store, manage and protect transactions and data. In managing our business, we also rely heavily on the integrity of, security of and consistent access to this data for information such as sales, merchandise ordering, inventory replenishment and order fulfillment. For these information technology systems and processes to operate effectively, we or our service providers must periodically maintain and update them. Our systems and the third-party systems on which we rely are subject to damage or interruption from a number of causes, including power outages; computer and telecommunications failures; computer viruses; security breaches; cyber-attacks, including the use of ransomware; 
 27 | 
 28 |  8
 29 | catastrophic events such as fires, floods, earthquakes, tornadoes, or hurricanes; acts of war or terrorism; and design or usage errors by our associates, contractors or third-party service providers. Although we and our third-party service providers seek to maintain our respective systems effectively and to successfully address the risk of compromise of the integrity, security and consistent operations of these systems, such efforts may not be successful. As a result, we or our service providers could experience errors, interruptions, delays or cessations of service in key portions of our information technology infrastructure, which could significantly disrupt our operations and be costly, time consuming and resource-intensive to remedy.
 30 | 
 31 | Disruptions in our customer-facing technology systems could impair our interconnected retail strategy and give rise to negative customer experiences.
 32 | 
 33 | Through our information technology developments, we are able to provide an improved overall shopping and interconnected retail experience that empowers our customers to shop and interact with us from computers, tablets, smartphones and other mobile devices. We use our websites and our mobile app both as sales channels for our products and also as methods of providing product, project and other relevant information to our customers to drive both in-store and online sales. We have multiple online communities and knowledge centers that allow us to inform, assist and interact with our customers. Multichannel retailing is continually evolving and expanding, and we must effectively respond to changing customer preferences and new developments. We also continually seek to enhance all of our online properties to provide an attractive user-friendly interface for our customers, as evidenced by our recent redesign of our homedepot.com website. Disruptions, failures or other performance issues with these customer-facing technology systems could impair the benefits that they provide to our online and in-store business and negatively affect our relationship with our customers.
 34 | 
 35 | If our efforts to maintain the privacy and security of customer, associate, supplier and Company information are not successful, we could incur substantial additional costs and reputational damage, and could become subject to further litigation and enforcement actions.
 36 | 
 37 | Our business, like that of most retailers, involves the receipt, storage and transmission of customers’ personal information, preferences and payment card information, as well as other confidential information, such as personal information about our associates and our suppliers and confidential Company information. We also work with third-party service providers and vendors that provide technology, systems and services that we use in connection with the receipt, storage and transmission of this information. Our information systems, and those of our third-party service providers and vendors, are vulnerable to an increasing threat of continually evolving data protection and cybersecurity risks. Unauthorized parties may attempt to gain access to these systems or our information through fraud or other means of deceiving our associates, third-party service providers or vendors. Hardware, software or applications we develop or obtain from third parties may contain defects in design or manufacture or other problems that could unexpectedly compromise information security. The methods used to obtain unauthorized access, disable or degrade service or sabotage systems are also constantly changing and evolving and may be difficult to anticipate or detect for long periods of time. We have implemented and regularly review and update processes and procedures to protect against unauthorized access to or use of data and to prevent data loss. However, the ever-evolving threats mean we and our third-party service providers and vendors must continually evaluate and adapt our respective systems and processes and overall security environment, and there is no guarantee that they will be adequate to safeguard against all data security breaches, system compromises or misuses of data. Any future significant compromise or breach of our data security, whether external or internal, or misuse of customer, associate, supplier or Company data, could result in significant costs, lost sales, fines, lawsuits, and damage to our reputation. In addition, as the regulatory environment related to information security, data collection and use, and privacy becomes increasingly rigorous, with new and constantly changing requirements applicable to our business, compliance with those requirements could also result in significant costs.
 38 | 
 39 | We are subject to payment-related risks that could increase our operating costs, expose us to fraud or theft, subject us to potential liability and potentially disrupt our business.
 40 | 
 41 | We accept payments using a variety of methods, including cash, checks, credit and debit cards, PayPal, our private label credit cards, an installment loan program, trade credit, and gift cards, and we may offer new payment options over time. Acceptance of these payment options subjects us to rules, regulations, contractual obligations and compliance requirements, including payment network rules and operating guidelines, data security standards and certification requirements, and rules governing electronic funds transfers. These requirements may change over time or be reinterpreted, making compliance more difficult or costly. For certain payment methods, including credit and debit cards, we pay interchange and other fees, which may increase over time and raise our operating costs. We rely on third parties to provide payment processing services, including the processing of credit cards, debit cards, and other forms of electronic payment. If these companies become unable to provide these services to us, or if their systems are compromised, it could potentially disrupt our business. The payment methods that we offer also subject us to potential fraud and theft by criminals, who are becoming increasingly more sophisticated, seeking to obtain unauthorized access to or exploit weaknesses that may exist in the payment systems. If we 
 42 | 
 43 |  9
 44 | fail to comply with applicable rules or requirements for the payment methods we accept, or if payment-related data is compromised due to a breach or misuse of data, we may be liable for costs incurred by payment card issuing banks and other third parties or subject to fines and higher transaction fees, or our ability to accept or facilitate certain types of payments may be impaired. In addition, our customers could lose confidence in certain payment types, which may result in a shift to other payment types or potential changes to our payment systems that may result in higher costs. As a result, our business and operating results could be adversely affected.
 45 | 
 46 | Uncertainty regarding the housing market, economic conditions, political climate and other factors beyond our control could adversely affect demand for our products and services, our costs of doing business and our financial performance.
 47 | 
 48 | Our financial performance depends significantly on the stability of the housing, residential construction and home improvement markets, as well as general economic conditions, including changes in gross domestic product. Adverse conditions in or uncertainty about these markets, the economy or the political climate could adversely impact our customers’ confidence or financial condition, causing them to determine not to purchase home improvement products and services, causing them to delay purchasing decisions, or impacting their ability to pay for products and services. Other factors beyond our control – including unemployment and foreclosure rates; interest rate fluctuations; fuel and other energy costs; labor and healthcare costs; the availability of financing; the state of the credit markets, including mortgages, home equity loans and consumer credit; weather; natural disasters; acts of terrorism and other conditions beyond our control – could further adversely affect demand for our products and services, our costs of doing business and our financial performance.
 49 | 
 50 | If we fail to identify and develop relationships with a sufficient number of qualified suppliers, or if our suppliers experience financial difficulties or other challenges, our ability to timely and efficiently access products that meet our high standards for quality could be adversely affected.
 51 | 
 52 | We buy our products from suppliers located throughout the world. Our ability to continue to identify and develop relationships with qualified suppliers who can satisfy our high standards for quality and responsible sourcing, as well as our need to access products in a timely and efficient manner, is a significant challenge. Our ability to access products from our suppliers can be adversely affected by political instability, military conflict, acts of terrorism, the financial instability of suppliers, suppliers’ noncompliance with applicable laws, trade restrictions, tariffs, currency exchange rates, any disruptions in our suppliers’ logistics or supply chain networks, and other factors beyond our or our suppliers’ control.
 53 | 
 54 | The implementation of our supply chain and technology initiatives could disrupt our operations in the near term, and these initiatives might not provide the anticipated benefits or might fail.
 55 | 
 56 | We have made, and we plan to continue to make, significant investments in our supply chain and information technology systems. These initiatives, such as Project Sync and COM, our new Customer Order Management system, are designed to streamline our operations to allow our associates to continue to provide high-quality service to our customers, while simplifying customer interaction and providing our customers with a more interconnected retail experience. The cost and potential problems and interruptions associated with the implementation of these initiatives, including those associated with managing third-party service providers and employing new web-based tools and services, could disrupt or reduce the efficiency of our operations in the near term and lead to product availability issues. Failure to choose the right investments and implement them in the right manner and at the right pace could disrupt our operations. In addition, our improved supply chain and new or upgraded information technology systems might not provide the anticipated benefits, it might take longer than expected to realize the anticipated benefits, or the initiatives might fail altogether, each of which could adversely impact our competitive position and our financial condition, results of operations or cash flows.
 57 | 
 58 | Disruptions in our supply chain and other factors affecting the distribution of our merchandise could adversely impact our business.
 59 | 
 60 | A disruption within our logistics or supply chain network could adversely affect our ability to deliver inventory in a timely manner, which could impair our ability to meet customer demand for products and result in lost sales, increased supply chain costs or damage to our reputation. Such disruptions may result from damage or destruction to our distribution centers; weather-related events; natural disasters; trade policy changes or restrictions; tariffs or import-related taxes; third-party strikes, lock-outs, work stoppages or slowdowns; shipping capacity constraints; supply or shipping interruptions or costs; or other factors beyond our control. Any such disruption could negatively impact our financial performance or financial condition.
 61 | 
 62 |  10
 63 | If we are unable to effectively manage and expand our alliances and relationships with selected suppliers of both brand name and proprietary products, we may be unable to effectively execute our strategy to differentiate ourselves from our competitors.
 64 | 
 65 | As part of our focus on product differentiation, we have formed strategic alliances and exclusive relationships with selected suppliers to market products under a variety of well-recognized brand names. We have also developed relationships with selected suppliers to allow us to market proprietary products that are comparable to national brands. Our proprietary products differentiate us from other retailers, generally carry higher margins than national brand products, and represent a growing portion of our business. If we are unable to manage and expand these alliances and relationships or identify alternative sources for comparable brand name and proprietary products, we may not be able to effectively execute product differentiation, which may impact our sales and gross margin results.
 66 | 
 67 | Our proprietary products subject us to certain increased risks.
 68 | 
 69 | As we expand our proprietary product offerings, we may become subject to increased risks due to our greater role in the design, manufacture, marketing and sale of those products. The risks include greater responsibility to administer and comply with applicable regulatory requirements, increased potential product liability and product recall exposure and increased potential reputational risks related to the responsible sourcing of those products. To effectively execute on our product differentiation strategy, we must also be able to successfully protect our proprietary rights and successfully navigate and avoid claims related to the proprietary rights of third parties. In addition, an increase in sales of our proprietary products may adversely affect sales of our vendors’ products, which in turn could adversely affect our relationships with certain of our vendors. Any failure to appropriately address some or all of these risks could damage our reputation and have an adverse effect on our business, results of operations and financial condition.
 70 | 
 71 | If we are unable to manage effectively our installation services business, we could suffer lost sales and be subject to fines, lawsuits and reputational damage, or the loss of our general contractor licenses.
 72 | 
 73 | We act as a general contractor to provide installation services to our DIFM customers through professional third-party installers. As such, we are subject to regulatory requirements and risks applicable to general contractors, which include management of licensing, permitting and quality of work performed by our third-party installers. We have established processes and procedures to manage these requirements and ensure customer satisfaction with the services provided by our third-party installers. However, if we fail to manage these processes effectively or to provide proper oversight of these services, we could suffer lost sales, fines and lawsuits for violations of regulatory requirements, as well as for property damage or personal injury. In addition, we may suffer damage to our reputation or the loss of our general contractor licenses, which could adversely affect our business. 
 74 | 
 75 | We may be unsuccessful in implementing our growth strategy, which could have an adverse impact on our financial condition and results of operation.
 76 | 
 77 | In fiscal 2015, we completed the acquisition of Interline, which we believe has enhanced our ability to serve our professional customers and increased our share of the MRO market. During fiscal 2016, we continued to develop and implement our strategy with Interline. Our goal is to serve all of our different Pro customer groups through one integrated approach to drive growth and capture market share in the retail, services and MRO markets, and this strategy depends, in part, on our continuing integration of Interline. As with any acquisition, we need to successfully integrate Interline’s products, services, associates and systems into our business operations. Integration can be a complex and time-consuming process, and if the integration is not fully successful or is delayed for a material period of time, we may not achieve the anticipated synergies or benefits of the acquisition. Furthermore, even if Interline is successfully integrated, the acquisition may fail to further our business strategy as anticipated, expose us to increased competition or challenges with respect to our products or services, and expose us to additional liabilities associated with the Interline business and the wholesale market.
 78 | 
 79 | Our costs of doing business could increase as a result of changes in, expanded enforcement of, or adoption of new federal, state or local laws and regulations.
 80 | 
 81 | We are subject to various federal, state and local laws and regulations that govern numerous aspects of our business. In recent years, a number of new laws and regulations have been adopted, and there has been expanded enforcement of certain existing laws and regulations by federal, state and local agencies. These laws and regulations, and related interpretations and enforcement activity, may change as a result of a variety of factors, including political, economic or social events. Changes in, expanded enforcement of, or adoption of new federal, state or local laws and regulations governing minimum wage or living wage requirements; other wage, labor or workplace regulations; healthcare; data protection and cybersecurity; the sale of some of our products; transportation; logistics; international trade; supply chain transparency; taxes; unclaimed property; 
 82 | 
 83 |  11
 84 | energy costs; or environmental matters, including with respect to our installation services business, could increase our costs of doing business or impact our operations.
 85 | 
 86 | If we cannot successfully manage the unique challenges presented by international markets, we may not be successful in our international operations and our sales and profit margins may be impacted.
 87 | 
 88 | Our ability to successfully conduct retail operations in, and source products and materials from, international markets is affected by many of the same risks we face in our U.S. operations, as well as unique costs and difficulties of managing international operations. Our international operations, including any expansion in international markets, may be adversely affected by local laws and customs, U.S. laws applicable to foreign operations and other legal and regulatory constraints, as well as political and economic conditions. Risks inherent in international operations also include, among others, potential adverse tax consequences; potential tariffs and other import-related taxes; greater difficulty in enforcing intellectual property rights; risks associated with the Foreign Corrupt Practices Act and local anti-bribery law compliance; and challenges in our ability to identify and gain access to local suppliers. In addition, our operations in international markets create risk due to foreign currency exchange rates and fluctuations in those rates, which may adversely impact our sales and profit margins. 
 89 | 
 90 | The inflation or deflation of commodity prices could affect our prices, demand for our products, our sales and our profit margins.
 91 | 
 92 | Prices of certain commodity products, including lumber and other raw materials, are historically volatile and are subject to fluctuations arising from changes in domestic and international supply and demand, labor costs, competition, market speculation, government regulations and periodic delays in delivery. Rapid and significant changes in commodity prices may affect the demand for our products, our sales and our profit margins.
 93 | 
 94 | Changes in accounting standards and subjective assumptions, estimates and judgments by management related to complex accounting matters could significantly affect our financial results or financial condition.
 95 | 
 96 | Generally accepted accounting principles and related accounting pronouncements, implementation guidelines and interpretations with regard to a wide range of matters that are relevant to our business, such as revenue recognition, asset impairment, impairment of goodwill and other intangible assets, inventories, lease obligations, self-insurance, tax matters and litigation, are highly complex and involve many subjective assumptions, estimates and judgments. Changes in these rules or their interpretation or changes in underlying assumptions, estimates or judgments could significantly change our reported or expected financial performance or financial condition.
 97 | 
 98 | We have incurred losses related to the data breach we discovered in the third quarter of fiscal 2014 (the "Data Breach"); we may incur additional losses or experience future operational impacts on our business, which could have an adverse impact on our operations, financial results and reputation.
 99 | 
100 | The Data Breach involved the theft of certain payment card information and customer email addresses through unauthorized access to our systems. Since the Data Breach occurred, we have recorded $198 million of pretax expenses, net of expected insurance recoveries, in connection with the Data Breach, as described in more detail in Note 13 to the Consolidated Financial Statements included in Item 8, "Financial Statements and Supplementary Data". We are still facing a consolidated shareholder derivative action brought by two purported shareholders and an investigation by a number of State Attorneys General. These matters may adversely affect how we operate our business, divert the attention of management from the operation of the business, have an adverse effect on our reputation, and result in additional costs and fines.
101 | 
102 | We are involved in a number of legal and regulatory proceedings, and while we cannot predict the outcomes of those proceedings and other contingencies with certainty, some of these outcomes may adversely affect our operations or increase our costs.
103 | 
104 | In addition to the matters discussed above with respect to the Data Breach, we are involved in a number of legal proceedings and regulatory matters, including government inquiries and investigations, and consumer, employment, tort and other litigation that arise from time to time in the ordinary course of business. Litigation is inherently unpredictable, and the outcome of some of these proceedings and other contingencies could require us to take or refrain from taking actions which could adversely affect our operations or could result in excessive adverse verdicts. Additionally, involvement in these lawsuits, investigations and inquiries, and other proceedings may involve significant expense, divert management’s attention and resources from other matters, and impact the reputation of the Company.
105 | 
106 |  12
107 | Item 1B. Unresolved Staff Comments.
108 | 
109 | Not applicable.
110 | 
111 | Item 2. Properties.


--------------------------------------------------------------------------------
/output_files_examples/batch_0001/001/HD_0000354950_10K_20170129_Item1_excerpt.txt:
--------------------------------------------------------------------------------
  1 | PART I
  2 | 
  3 | Item 1. Business.
  4 | 
  5 | Introduction
  6 | 
  7 | The Home Depot, Inc. is the world’s largest home improvement retailer based on Net Sales for the fiscal year ended January 29, 2017 ("fiscal 2016"). The Home Depot sells a wide assortment of building materials, home improvement products and lawn and garden products and provides a number of services. The Home Depot stores average approximately 104,000 square feet of enclosed space, with approximately 24,000 additional square feet of outside garden area. As of the end of fiscal 2016, we had 2,278 The Home Depot stores located throughout the United States, including the Commonwealth of Puerto Rico and the territories of the U.S. Virgin Islands and Guam, Canada and Mexico. When we refer to "The Home Depot", the "Company", "we", "us" or "our" in this report, we are referring to The Home Depot, Inc. and its consolidated subsidiaries.
  8 | 
  9 | The Home Depot, Inc. is a Delaware corporation that was incorporated in 1978. Our Store Support Center (corporate office) is located at 2455 Paces Ferry Road, Atlanta, Georgia 30339. Our telephone number is (770) 433-8211.
 10 | 
 11 | Our internet website is www.homedepot.com. We make available on the Investor Relations section of our website, free of charge, our Annual Reports to shareholders, Annual Reports on Form 10-K, Quarterly Reports on Form 10-Q, Current Reports on Form 8-K, Proxy Statements and Forms 3, 4 and 5, and amendments to those reports, as soon as reasonably practicable after filing such documents with, or furnishing such documents to, the SEC.
 12 | 
 13 | We include our website addresses throughout this filing for reference only. The information contained on our websites is not incorporated by reference into this report.
 14 | 
 15 | For information on key financial highlights, including historical revenues, profits and total assets, see the "Five-Year Summary of Financial and Operating Results" on page F-1 of this report and Item 7, "Management’s Discussion and Analysis of Financial Condition and Results of Operations".
 16 | 
 17 |  1
 18 | Our Business
 19 | 
 20 | Operating Strategy 
 21 | 
 22 | Our strategic framework is founded on our three-legged stool of customer experience, product authority, and productivity and efficiency driven by effective capital allocation, all united by the goal of providing an interconnected retail experience to drive value for our customers, our associates, our suppliers and our shareholders. The retail landscape is rapidly evolving, and our goal is to become more agile in responding to the changing competitive landscape and customer preferences. As customers increasingly expect to be able to buy how, when and where they want, we believe that providing a seamless and frictionless shopping experience across multiple channels, featuring innovative and curated product choices delivered in a fast and cost-efficient manner, will be a key enabler for future success. 
 23 | 
 24 | Becoming a best-in-class interconnected retailer is growing in importance as the line between online and in-store shopping continues to blur and customers demand increased convenience and value. Under customer experience, we are focused on connecting our associates to customer needs and connecting our stores to our online experience. Under product authority, we are focused on connecting our product and services to customer needs, including offering local and customized assortments to meet the unique needs of our different communities. Under productivity and efficiency, we are focused on connecting our merchandise from our suppliers to our customers by optimizing our supply chain. Overall, we are collaborating more closely throughout our organization and with our business partners to build an interconnected experience for customers that offers an end-to-end solution for their home improvement needs.
 25 | 
 26 | Customer Experience 
 27 | 
 28 | Customer experience is anchored on the principles of putting customers first and taking care of our associates. Our commitment to customer service has been and will continue to be a key part of the customer experience. But we recognize that the customer experience includes more than just customer service, and we have taken a number of steps to provide our customers with a seamless and frictionless shopping experience in our stores, online, on the job site or in their homes.
 29 | 
 30 | Our Customers. We serve three primary customer groups, and we have different approaches to meet their particular needs:
 31 | 
 32 | •Do-It-Yourself ("DIY") Customers. These customers are typically home owners who purchase products and complete their own projects and installations. Our associates assist these customers with specific product and installation questions both in our stores and through online resources and other media designed to provide product and project knowledge. We also offer a variety of clinics and workshops both to impart this knowledge and to build an emotional connection with our DIY customers.
 33 | 
 34 | •Do-It-For-Me ("DIFM") Customers. These customers are typically home owners who purchase materials and hire third parties to complete the project or installation. Our stores offer a variety of installation services available to DIFM customers who purchase products and installation of those products from us in our stores, online or in their homes through in-home consultations. Our installation programs include many categories, such as flooring, cabinets, countertops, water heaters and sheds. In addition, we provide third-party professional installation in a number of categories sold through our in-home sales programs, such as roofing, siding, windows, cabinet refacing, furnaces and central air systems. This customer group is growing due to changing demographics, which we believe will increase demand for our installation services. Further, our focus on serving the professional customers, or "Pros", who perform these services for our DIFM customers will help us drive higher product sales. 
 35 | 
 36 | •Professional Customers. These customers are primarily professional renovators/remodelers, general contractors, handymen, property managers, building service contractors and specialty tradesmen, such as installers. We recognize the great value our Pro customers provide to their clients, and we strive to make the Pro’s job easier. For example, we offer our Pros a wide range of special programs such as delivery and will-call services, dedicated sales and service staff, enhanced credit programs, designated parking spaces close to store entrances and bulk pricing programs for both online and in-store purchases. In addition, we maintain a loyalty program, Pro Xtra, that provides our Pros with discounts on useful business services, exclusive product offers and a purchase tracking tool to enable receipt lookup online and job tracking of purchases across all forms of payment. 
 37 | 
 38 | With our acquisition of Interline Brands, Inc. ("Interline") in August 2015, we established a platform in the maintenance, repair and operations ("MRO") market where we serve primarily institutional customers (such as educational and healthcare institutions), hospitality businesses, and national apartment complexes. We also gained additional competencies relevant to our large professional customers, including outside sales and account management, expanded assortment of product, expanded financing options and last mile delivery capabilities.
 39 | 
 40 |  2
 41 | We recognize that our Pros have differing needs depending on the type of work they perform. Our goal is to develop a comprehensive set of capabilities for our professional customers to provide solutions across every purchase opportunity, such as supplying both recurring MRO needs and renovation products and services to property managers or providing inventory management solutions for specialty tradesmen’s replenishment needs. We believe that by bringing our best resources to bear for each individual customer, we can provide a differentiated customer experience and enhanced value proposition for our professional customers.
 42 | 
 43 | We help our DIY, DIFM and Pro customers finance their projects by offering private label credit products through third-party credit providers. Our private label credit program includes other benefits, such as a 365-day return policy and, for our Pros, commercial fuel rewards and extended payment terms. In fiscal 2016, our customers opened approximately 4.2 million new The Home Depot private label credit accounts, and at fiscal year end the total number of The Home Depot active account holders was approximately 13 million. Private label credit card sales accounted for approximately 23% of sales in fiscal 2016. 
 44 | 
 45 | Our Associates. Our associates are key to our customer experience. We empower our associates to deliver excellent customer service, and we strive to remove complexity and inefficient processes from the stores to allow our associates to focus on our customers. In fiscal 2016, we continued to implement new initiatives to improve freight handling in the stores, as well as Project Sync, which is discussed in more detail below under "Logistics". All of these programs are designed to make our freight handling process more efficient, which allows our associates to devote more time to the customer experience and makes working at The Home Depot a better experience for them. We also have a number of programs to recognize stores and individual associates for exceptional customer service. 
 46 | 
 47 | At the end of fiscal 2016, we employed approximately 406,000 associates, of whom approximately 27,000 were salaried, with the remainder compensated on an hourly or temporary basis. To attract and retain qualified personnel, we seek to maintain competitive salary and wage levels in each market we serve. We measure associate satisfaction regularly, and we believe that our employee relations are very good.
 48 | 
 49 | Interconnected Retail. In fiscal 2016, we continued to enhance our customers’ interconnected shopping experiences through a variety of initiatives. Our associates continued to use FIRST phones, our web-enabled handheld devices, to help customers complete online sales in the aisle, expedite the checkout process for customers during peak traffic periods, locate products in the aisles and online, and check inventory on hand. We have also empowered our customers with improved product location and inventory availability tools on our website and mobile app. During fiscal 2016, we launched a redesign of our website to provide our customers with better search capabilities and a faster checkout experience, and we upgraded our mobile app. Enhancements to our online and mobile properties are critical for our increasingly interconnected customers who research products online and then go into one of our stores to view the products in person or talk to an associate before making the purchase. While in the store, customers may also go online to access ratings and reviews, compare prices, view our extended assortment and purchase products.
 50 | 
 51 | We also recognize that customers desire greater flexibility and convenience when it comes to receiving their products and services. We strive to deliver on this customer expectation by blending the physical and digital channels into a seamless customer experience. A key component of this strategy is enabled through our new Customer Order Management platform ("COM"), which we rolled out to all U.S. stores in fiscal 2016. COM provides a common management system for customer orders, providing greater visibility and a more seamless and frictionless experience for our customers and associates. COM also provides the foundation for the enhanced delivery option we rolled out to all U.S. stores in fiscal 2016: Buy Online, Deliver From Store ("BODFS"). BODFS allows customers to schedule a specific delivery window, and it complements our existing interconnected retail programs: Buy Online, Pick-up In Store ("BOPIS"), Buy Online, Ship to Store ("BOSS") and Buy Online, Return In Store ("BORIS"). During the year, we also enabled a dynamic ETA (estimated time of arrival) feature for online orders, which provides our customers with a faster and more accurate delivery date based on their location.
 52 | 
 53 | We do not view the customer experience as a specific transaction; rather, it encompasses an entire process from inspiration and know-how, to purchase and fulfillment, to post-purchase care and support. Further, we believe that by connecting our stores to online and online to our stores, we drive sales not just in-store but also online. In fiscal 2016, we saw increased traffic to our online properties and improved online sales conversion rates. Sales from our online channels increased over 19% compared to fiscal 2015. We will continue to blend our physical and digital assets in a seamless and frictionless way to enhance the end-to-end customer experience.
 54 | 
 55 |  3
 56 | Product Authority
 57 | 
 58 | Product authority is facilitated by our merchandising transformation and portfolio strategy, which is focused on delivering product innovation, assortment and value. In fiscal 2016, we continued to introduce a wide range of innovative new products to our DIY, DIFM and Pro customers, while remaining focused on offering everyday values in our stores and online. 
 59 | 
 60 | Our Products. A typical The Home Depot store stocks approximately 30,000 to 40,000 products during the year, including both national brand name and proprietary items. To enhance our merchandising capabilities, we continued to make improvements to our information technology tools in fiscal 2016 to better understand our customers, provide more localized assortments to fit customer demand and optimize space to dedicate the right square footage to the right products in the right location. Our online product offerings complement our stores by serving as an extended aisle, and we offer a significantly broader product assortment through our websites, including homedepot.com and Blinds.com. We also routinely use our merchandising tools to refine our online assortment to balance the extended choice with a more curated offering.
 61 | 
 62 | In fiscal 2016, we introduced a number of innovative and distinctive products to our customers at attractive values. Examples of these new products include Feit® LED Edge-Lit flat panel lighting; Pergo® Outlast+ flooring; Thomasville® Studio 1904 kitchen cabinets; DEWALT FLEXVOLT® system for power tools; Diablo® carbide hole saws; Defiant® LED security lights; Makita® subcompact drills; and Milwaukee M18 FUEL® with ONE KEY™ power tools, compatible with the HIGH DEMAND™ 9.0Ah battery.
 63 | 
 64 | During fiscal 2016, we continued to offer value to our customers through our proprietary and exclusive brands across a wide range of departments. Highlights of these offerings include Husky® hand tools, tool storage and work benches; Everbilt® hardware and fasteners; Hampton Bay® lighting, ceiling fans and patio furniture; Vigoro® lawn care products; RIDGID® and Ryobi® power tools; Glacier Bay® bath fixtures; HDX® storage and cleaning products; and Home Decorators Collection® furniture and home décor. We will continue to assess departments and categories, both online and in-store, for opportunities to expand the assortment of products offered within The Home Depot’s portfolio of proprietary and exclusive brands.
 65 | 
 66 | We maintain a global sourcing program to obtain high-quality and innovative products directly from manufacturers around the world. In fiscal 2016, in addition to our U.S. sourcing operations, we maintained sourcing offices in Mexico, Canada, China, India, Southeast Asia and Europe.
 67 | 
 68 | The percentage of Net Sales of each of our major product categories (and related services) for each of the last three fiscal years is presented in Note 1 to the Consolidated Financial Statements included in Item 8, "Financial Statements and Supplementary Data". Net Sales inside the U.S. were $86.6 billion, $80.5 billion and $74.7 billion for fiscal 2016, 2015 and 2014, respectively. Net Sales outside the U.S. were $8.0 billion, $8.0 billion and $8.5 billion for fiscal 2016, 2015 and 2014, respectively. Long-lived assets, which consist of net property and equipment, inside the U.S. totaled $19.5 billion, $19.9 billion and $20.2 billion as of January 29, 2017, January 31, 2016 and February 1, 2015, respectively. Long-lived assets outside the U.S. totaled $2.4 billion, $2.3 billion and $2.5 billion as of January 29, 2017, January 31, 2016 and February 1, 2015, respectively.
 69 | 
 70 | Quality Assurance. Our suppliers are obligated to ensure that their products comply with applicable international, federal, state and local laws. In addition, we have both quality assurance and engineering resources dedicated to establishing criteria and overseeing compliance with safety, quality and performance standards for our proprietary branded products. We also have a global Supplier Social and Environmental Responsibility Program designed to ensure that suppliers adhere to the highest standards of social and environmental responsibility.
 71 | 
 72 | Environmentally-Preferred Products and Programs. The Home Depot is committed to sustainable business practices – from the environmental impact of our operations, to our sourcing activities, to our involvement within the communities in which we do business. We believe these efforts continue to be successful in creating value for our customers and shareholders. For example, we offer a growing selection of environmentally-preferred products, which supports sustainability and helps our customers save energy, water and money. Through our Eco Options® Program introduced in 2007, we have created product categories that allow customers to easily identify products that meet specifications for energy efficiency, water conservation, healthy home, clean air and sustainable forestry. As of the end of fiscal 2016, our Eco Options® Program included over 10,000 products. Through this program, we sell ENERGY STAR® certified appliances, LED light bulbs, tankless water heaters and other products that enable our customers to save on their utility bills. We estimate that in fiscal 2016 we helped customers save over $900 million in electricity costs through sales of ENERGY STAR® certified products. We also estimate our customers saved over 76 billion gallons of water resulting in over $640 million in water bill savings in fiscal 2016 through the sales of our WaterSense®-labeled bath faucets, showerheads, aerators, toilets and irrigation controllers. Our 2016 
 73 | 
 74 |  4
 75 | Responsibility Report, available on our website at https://corporate.homedepot.com/responsibility, describes many of our other environmentally-preferred products that promote energy efficiency, water conservation, clean air and a healthy home.
 76 | 
 77 | We continue to offer store recycling programs nationwide, such as an in-store compact fluorescent light ("CFL") bulb recycling program launched in 2008. This service is offered to customers free of charge and is available in all U.S. stores. We also maintain an in-store rechargeable battery recycling program. Launched in 2001 and currently done in partnership with Call2Recycle, this program is also available to customers free of charge in all stores throughout the U.S. Through our recycling programs, in fiscal 2016 we helped recycle over 860,000 pounds of CFL bulbs and over 1 million pounds of rechargeable batteries. In fiscal 2016, we also recycled over 180,000 lead acid batteries collected from our customers under our lead acid battery exchange program, as well as over 225,000 tons of cardboard through a nationwide cardboard recycling program across our U.S. operations. We believe our environmentally-preferred product selection and our recycling efforts drive sales, which in turn benefits our shareholders, in addition to our customers and the environment.
 78 | 
 79 | Seasonality. Our business is subject to seasonal influences. Generally, our highest volume of sales occurs in our second fiscal quarter, and the lowest volume occurs either during our first or fourth fiscal quarter. 
 80 | 
 81 | Competition. Our industry is highly competitive, with competition based primarily on customer experience, price, store location and appearance, and quality, availability, assortment and presentation of merchandise. Although we are currently the world’s largest home improvement retailer, in each of the markets we serve there are a number of other home improvement stores, electrical, plumbing and building materials supply houses, and lumber yards. With respect to some products and services, we also compete with specialty design stores, showrooms, discount stores, local, regional and national hardware stores, paint stores, mail order firms, warehouse clubs, independent building supply stores, MRO companies and other retailers, as well as with providers of home improvement services. In addition, we face growing competition from online and multichannel retailers, some of whom may have a lower cost structure than ours, as our customers now routinely use computers, tablets, smartphones and other mobile devices to shop online and compare prices and products.
 82 | 
 83 | Intellectual Property. Our business has one of the most recognized brands in North America. As a result, we believe that The Home Depot® trademark has significant value and is an important factor in the marketing of our products, e-commerce, stores and business. We have registered or applied for registration of trademarks, service marks, copyrights and internet domain names, both domestically and internationally, for use in our business, including our expanding proprietary brands such as HDX®, Husky®, Hampton Bay®, Home Decorators Collection®, Glacier Bay® and Vigoro®. We also maintain patent portfolios relating to some of our products and services and seek to patent or otherwise protect innovations we incorporate into our products or business operations.
 84 | 
 85 | Productivity and Efficiency Driven by Capital Allocation
 86 | 
 87 | We continue to drive productivity and efficiency by building best-in-class competitive advantages in our information technology and supply chain to ensure product availability for our customers while managing our costs, which results in higher returns for our shareholders. During fiscal 2016, we continued to focus on optimizing our end-to-end supply chain network and improving our inventory, transportation and distribution productivity.
 88 | 
 89 | Logistics. We continue to invest in our supply chain by optimizing our network through initiatives like Supply Chain Synchronization, or "Project Sync". As described in more detail below, Project Sync is a recent initiative to improve our in-stock rates, inventory productivity and logistics costs. During fiscal 2016, we rolled out the first phase of Project Sync to all U.S. stores, and we continue to onboard new suppliers. This is a multi-year, multi-phase endeavor, which we plan to implement in each of our distribution flow paths.
 90 | 
 91 | Our overall distribution strategy is to provide the optimal flow path for a given product. Rapid Deployment Centers ("RDCs") play a key role in optimizing our network as they allow for aggregation of product needs for multiple stores to a single purchase order and then rapid allocation and deployment of inventory to individual stores upon arrival at the RDC. This results in a simplified ordering process and improved transportation and inventory management. Through Project Sync, we can significantly reduce our average lead time from supplier to shelf. Project Sync requires deep collaboration among our suppliers, transportation providers, RDCs and stores, as well as rigorous planning and information technology development to create an engineered flow schedule that shortens and stabilizes lead time, resulting in more predictable and consistent freight flow. As we continue to roll out Project Sync throughout our supply chain over the next several years, we plan to create an end-to-end solution that benefits all participants in our supply chain, from our suppliers to our transportation providers to our RDC and store associates to our customers.
 92 | 
 93 | Over the past several years, we have centralized our inventory planning and replenishment function and continuously improved our forecasting and replenishment technology. This has helped us improve our product availability and our 
 94 | 
 95 |  5
 96 | inventory productivity at the same time. At the end of fiscal 2016, over 95% of our U.S. store products were ordered through central inventory management.
 97 | 
 98 | We operate multiple distribution center platforms tailored to meet the needs of our stores and customers, based on the types of products, local geography, and transportation and delivery requirements. The following table sets forth the number and type of distribution centers in our network as of the end of fiscal 2016:
 99 | 
100 | [DATA_TABLE_REMOVED]
101 | 
102 | To better serve our online customers, we offer a variety of delivery methods to get product to each customer’s preferred location. We completed the rollout of three new U.S. DFCs in fiscal 2015 to support our online growth. We also operate one DFC for our Home Decorators Collection orders and one DFC for "big and bulky" orders. Our DFCs enable us to reduce delivery costs and lead time, and improve the overall customer experience by shipping online orders directly to the customer. In addition, with our acquisition of Interline in fiscal 2015, we added more than 100 distribution points with fast delivery of a broad assortment of MRO products. We remain committed to leveraging our supply chain capabilities to fully utilize and optimize our improved logistics network.
103 | 
104 | In addition to the distribution and fulfillment centers described above, we leverage our almost 2,000 U.S. stores as a network of convenient customer pick-up, return and delivery fulfillment locations. For customers who shop online and wish to pick-up or return merchandise at our U.S. stores, we have fully implemented our BOPIS, BOSS and BORIS programs, which we believe provide us with a competitive advantage. For example, in the fourth quarter of fiscal 2016, approximately 45% of our online U.S. orders were picked up in a store. For customers who would like the option to have store-based orders delivered directly to their home or job site, we pick, pack and ship orders to customers from our stores. Our BODFS program, which we rolled out in fiscal 2016, allows our customers to select their preferred delivery date and time windows for store-based deliveries. Our supply chain and logistics strategies will continue to be focused on providing our customers high product availability with convenient and low cost fulfillment options. 
105 | 
106 | Commitment to Sustainability and Environmentally Responsible Operations. The Home Depot focuses on sustainable operations and is committed to conducting business in an environmentally responsible manner. This commitment impacts all areas of our business, including energy usage, supply chain and packaging, store construction and maintenance, and, as noted above under "Environmentally-Preferred Products and Programs", product selection and recycling programs for our customers.
107 | 
108 | In 2015, we announced two major sustainability commitments for 2020. Our first goal is to reduce our U.S. stores’ energy use by 20% over 2010 levels, and our second goal is to produce and procure, on an annual basis, 135 megawatts of energy for our stores through renewable or alternate energy sources, such as wind, solar and fuel cell technology. As of the end of fiscal 2016, we are on track to exceed both of our goals before the end of 2020. We are committed to implementing strict operational standards that establish energy efficient operations in all of our U.S. facilities and continuing to invest in renewable energy. 
109 | 
110 |  6
111 | Additionally, we implemented a rainwater reclamation project in our stores in 2010. As of the end of fiscal 2016, 145 of our stores used reclamation tanks to collect rainwater and condensation from HVAC units and garden center roofs, which is in turn used to water plants in our outside garden centers. We estimate our annual water savings from these units to be over 500,000 gallons per store for total water savings of over 72.5 million gallons in fiscal 2016. Our 2016 Responsibility Report, which uses the Global Reporting Initiative (GRI) framework for sustainability reporting, provides more information on sustainability efforts in other aspects of our operations. Our 2016 Responsibility Report is available on our website at https://corporate.homedepot.com/responsibility.
112 | 
113 | Our commitment to corporate sustainability has resulted in a number of environmental awards and recognitions. In 2016, we received three significant awards from the U.S. Environmental Protection Agency ("EPA"). The ENERGY STAR® division named us "Retail Partner of the Year – Sustained Excellence" for our overall excellence in energy efficiency, and we received the 2016 WaterSense® Sustained Excellence Award for our overall excellence in water efficiency. We also received the EPA’s "SmartWay Excellence Award", which recognizes The Home Depot as an industry leader in freight supply chain environmental performance and energy efficiency. We also participate in the CDP (formerly known as the Carbon Disclosure Project) reporting process. CDP is an independent, international, not-for-profit organization providing a global system for companies and cities to measure, disclose, manage and share environmental information. In 2016, we received a score of A- (out of a range from A to E) from the CDP, reflecting a high level of action on climate change mitigation, adaptation and transparency. We also were named an industry leader by the CDP.
114 | 
115 | We are strongly committed to maintaining a safe shopping and working environment for our customers and associates. Our Environmental, Health & Safety ("EH&S") function is dedicated to ensuring the health and safety of our customers and associates, with trained associates who evaluate, develop, implement and enforce policies, processes and programs on a Company-wide basis. Our EH&S policies are woven into our everyday operations and are part of The Home Depot culture. Some common program elements include: daily store inspection checklists (by department); routine follow-up audits from our store-based safety team members and regional, district and store operations field teams; equipment enhancements and preventative maintenance programs to promote physical safety; departmental merchandising safety standards; training and education programs for all associates, with varying degrees of training provided based on an associate’s role and responsibilities; and awareness, communication and recognition programs designed to drive operational awareness and understanding of EH&S issues.
116 | 
117 | Returning Value to Shareholders. As noted above, we drive productivity and efficiency through our capital allocation decisions, with a focus on expense control. This discipline drove higher returns on invested capital and allowed us to return value to shareholders through $7.0 billion in share repurchases and $3.4 billion in dividends in fiscal 2016, as discussed in Item 7, "Management’s Discussion and Analysis of Financial Condition and Results of Operations".
118 | 
119 | Item 1A. Risk Factors.


--------------------------------------------------------------------------------
/output_files_examples/batch_0001/001/HD_0000354950_10K_20170129_Item7A_excerpt.txt:
--------------------------------------------------------------------------------
1 | Item 7A. Quantitative and Qualitative Disclosures About Market Risk.
2 | 
3 | The information required by this item is incorporated by reference to Item 7, "Management’s Discussion and Analysis of Financial Condition and Results of Operations" of this report.
4 | 
5 |  28
6 | Item 8. Financial Statements and Supplementary Data.


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | beautifulsoup4>=4.4.1
2 | lxml>=3.5.0
3 | requests>=2.12.4


--------------------------------------------------------------------------------
/src/__init__.py:
--------------------------------------------------------------------------------
1 | """
2 |     secedgartext: extract text from SEC corporate filings
3 |     Copyright (C) 2017  Alexander Ions
4 | 
5 |     You should have received a copy of the GNU General Public License
6 |     along with this program.  If not, see <http://www.gnu.org/licenses/>.
7 | """


--------------------------------------------------------------------------------
/src/control.py:
--------------------------------------------------------------------------------
  1 | """
  2 |     secedgartext: extract text from SEC corporate filings
  3 |     Copyright (C) 2017  Alexander Ions
  4 | 
  5 |     You should have received a copy of the GNU General Public License
  6 |     along with this program.  If not, see <http://www.gnu.org/licenses/>.
  7 | """
  8 | import re
  9 | import os
 10 | 
 11 | from .download import EdgarCrawler
 12 | from .utils import logger, args
 13 | from .utils import companies_file_location, single_company, date_search_string
 14 | from .utils import batch_number, storage_toplevel_directory
 15 | 
 16 | MAX_FILES_IN_SUBDIRECTORY = 1000
 17 | 
 18 | class Downloader(object):
 19 |     def __init__(self):
 20 |         self.storage_path = args.storage
 21 | 
 22 |     def download_companies (self, do_save_full_document=False):
 23 |         """Iterate through a list of companies and download documents.
 24 | 
 25 |         Downloading document contents within each filing type required
 26 |         :param do_save_full_document: save a local copy of the whole original
 27 |         document
 28 |         :return:
 29 |         """
 30 |         companies = list()
 31 |         if single_company:
 32 |             companies.append([str(single_company), str(single_company)])
 33 |             logger.info("Downloading single company: %s", args.company)
 34 |         if not companies:
 35 |             try:
 36 |                 companies = company_list(companies_file_location)
 37 |                 logger.info("Using companies list: %s",
 38 |                             companies_file_location)
 39 |             except:
 40 |                 logger.warning("Companies list not available")
 41 |                 company_input = input("Enter company code (CIK or ticker): ")
 42 |                 if company_input:
 43 |                     companies.append([company_input, company_input])
 44 |                     logger.info("Downloading single company: %s", company_input)
 45 |                 else:
 46 |                     # default company: Dow Chemical
 47 |                     company_default = 'DOW'
 48 |                     companies.append([company_default, company_default.title()])
 49 |                     logger.info("Downloading default company: %s",
 50 |                                 next(iter(companies)))
 51 |         start_date =  args.start    # TODO:this may be ignored by EDGAR web interface, consider removing this argument
 52 |         end_date = args.end
 53 |         filings = args.filings
 54 | 
 55 |         logger.info('-' * 65)
 56 |         logger.info("Downloading %i companies: %s", len(companies),
 57 |                     single_company or companies_file_location)
 58 |         logger.info("Filings period: %i - %i", args.start, args.end)
 59 |         logger.info("Filings search: %s", filings)
 60 |         logger.info("Storage location: %s", self.storage_path)
 61 |         logger.info('-' * 65)
 62 | 
 63 |         start_company = max(1, int(args.start_company or 1))
 64 |         end_company = min(len(companies),
 65 |                           int(args.end_company or len(companies)))
 66 | 
 67 |         download_companies = companies[start_company-1:end_company]
 68 |         seccrawler = EdgarCrawler()
 69 | 
 70 | 
 71 |         if do_save_full_document:
 72 |             logger.info("Saving source document and extracts "
 73 |                         "(if successful) locally")
 74 |         else:
 75 |             logger.info("Saving extracts (if successful) only. "
 76 |                         "Not saving source documents locally.")
 77 |         logger.info("SEC filing date range: %i to %i", start_date, end_date)
 78 |         storage_subdirectory_number = 1
 79 | 
 80 |         for c, company_keys in enumerate(download_companies):
 81 |             edgar_search_string = str(company_keys[0])
 82 |             company_description = str(company_keys[1]).strip()
 83 |             company_description = re.sub('/','', company_description)
 84 | 
 85 |             logger.info('Batch number: ' + str(batch_number) +
 86 |                         ', begin downloading company: ' +
 87 |                         str(c + 1) + ' / ' +
 88 |                         str(len(download_companies)))
 89 |             storage_subdirectory = os.path.join(storage_toplevel_directory,
 90 |                                                format(storage_subdirectory_number,
 91 |                                                       '03d'))
 92 |             if not os.path.exists(storage_subdirectory):
 93 |                 os.makedirs(storage_subdirectory)
 94 |             seccrawler.storage_folder = storage_subdirectory
 95 |             for filing_search_string in args.filings:
 96 |                 seccrawler.download_filings(company_description,
 97 |                                             edgar_search_string,
 98 |                                             filing_search_string,
 99 |                                             date_search_string,
100 |                                             str(start_date),
101 |                                             str(end_date), do_save_full_document)
102 |             if len(os.listdir(storage_subdirectory)) > MAX_FILES_IN_SUBDIRECTORY:
103 |                 storage_subdirectory_number += 1
104 |         logger.warning("SUCCESS: Finished attempted download of " +
105 |                        str(len(download_companies) or 0) +
106 |                        " companies from an overall list of " +
107 |                        str(len(companies) or 0) + " companies." )
108 | 
109 | 
110 | def company_list(text_file_location):
111 |     """Read companies list from text_file_location, load into a dictionary.
112 |     :param text_file_location:
113 |     :return: company_dict: each element is a list of CIK code text and
114 |     company descriptive text
115 |     """
116 |     company_list = list()
117 |     with open(text_file_location, newline='') as f:
118 |         for r in f.readlines():
119 |             if r[0] =='#' and len(company_list) > 0:
120 |                 break
121 |             if r[0] != '#' and len(r) > 1:
122 |                 r = re.sub('\n', '', r)
123 |                 text_items = re.split('[ ,\t]', r)  # various delimiters allowed
124 |                 edgar_search_text = text_items[0].zfill(10)
125 |                 company_description = '_'.join(
126 |                     text_items[1:2])
127 |                 company_list.append([edgar_search_text, company_description])
128 |     return company_list
129 | 
130 | 
131 | 
132 | 


--------------------------------------------------------------------------------
/src/document.py:
--------------------------------------------------------------------------------
 1 | """
 2 |     secedgartext: extract text from SEC corporate filings
 3 |     Copyright (C) 2017  Alexander Ions
 4 | 
 5 |     You should have received a copy of the GNU General Public License
 6 |     along with this program.  If not, see <http://www.gnu.org/licenses/>.
 7 | """
 8 | import time
 9 | from datetime import datetime
10 | import copy
11 | import os
12 | from abc import ABCMeta
13 | import multiprocessing as mp
14 | 
15 | from .utils import search_terms as master_search_terms
16 | from .utils import args, logger
17 | 
18 | class Document(object):
19 |     __metaclass__ = ABCMeta
20 | 
21 |     def __init__(self, file_path, doc_text, extraction_method):
22 |         self._file_path = file_path
23 |         self.doc_text = doc_text
24 |         self.extraction_method = extraction_method
25 |         self.log_cache = []
26 | 
27 |     def get_excerpt(self, input_text, form_type, metadata_master,
28 |                     skip_existing_excerpts):
29 |         """
30 | 
31 |         :param input_text:
32 |         :param form_type:
33 |         :param metadata_master:
34 |         :param skip_existing_excerpts:
35 |         :return:
36 |         """
37 |         start_time = time.process_time()
38 |         self.prepare_text()
39 |         prep_time = time.process_time() - start_time
40 |         file_name_root = metadata_master.metadata_file_name
41 |         for section_search_terms in master_search_terms[form_type]:
42 |             start_time = time.process_time()
43 |             metadata = copy.copy(metadata_master)
44 |             warnings = []
45 |             section_name = section_search_terms['itemname']
46 |             section_output_path = file_name_root + '_' + section_name
47 |             txt_output_path = section_output_path + '_excerpt.txt'
48 |             metadata_path = section_output_path + '_metadata.json'
49 |             failure_metadata_output_path = section_output_path + '_failure.json'
50 | 
51 |             search_pairs = section_search_terms[self.search_terms_type()]
52 |             text_extract, extraction_summary, start_text, end_text, warnings = \
53 |                 self.extract_section(search_pairs)
54 |             time_elapsed = time.process_time() - start_time
55 |             # metadata.extraction_method = self.extraction_method
56 |             metadata.section_name = section_name
57 |             if start_text:
58 |                 start_text = start_text.replace('\"', '\'')
59 |             if end_text:
60 |                 end_text = end_text.replace('\"', '\'')
61 |             metadata.endpoints = [start_text, end_text]
62 |             metadata.warnings = warnings
63 |             metadata.time_elapsed = round(prep_time + time_elapsed, 1)
64 |             metadata.section_end_time = str(datetime.utcnow())
65 |             if text_extract:
66 |                 # success: save the excerpt file
67 |                 metadata.section_n_characters = len(text_extract)
68 |                 with open(txt_output_path, 'w', encoding='utf-8',
69 |                           newline='\n') as txt_output:
70 |                     txt_output.write(text_extract)
71 |                 log_str = ': '.join(['SUCCESS Saved file for',
72 |                                          section_name, txt_output_path])
73 |                 self.log_cache.append(('DEBUG', log_str))
74 |                 try:
75 |                     os.remove(failure_metadata_output_path)
76 |                 except:
77 |                     pass
78 |                 metadata.output_file = txt_output_path
79 |                 metadata.metadata_file_name = metadata_path
80 |                 metadata.save_to_json(metadata_path)
81 |             else:
82 |                 log_str = ': '.join(['No excerpt located for ',
83 |                                          section_name, metadata.sec_index_url])
84 |                 self.log_cache.append(('WARNING', log_str))
85 |                 try:
86 |                     os.remove(metadata_path)
87 |                 except:
88 |                     pass
89 |                 metadata.metadata_file_name = failure_metadata_output_path
90 |                 metadata.save_to_json(failure_metadata_output_path)
91 |             if args.write_sql:
92 |                 metadata.save_to_db()
93 |         return(self.log_cache)
94 | 
95 |     def prepare_text(self):
96 |         # handled in child classes
97 |         pass
98 | 
99 | 


--------------------------------------------------------------------------------
/src/download.py:
--------------------------------------------------------------------------------
  1 | """
  2 |     secedgartext: extract text from SEC corporate filings
  3 |     Copyright (C) 2017  Alexander Ions
  4 | 
  5 |     You should have received a copy of the GNU General Public License
  6 |     along with this program.  If not, see <http://www.gnu.org/licenses/>.
  7 | """
  8 | # Originally adapted from "SEC-Edgar" package code
  9 | import multiprocessing as mp
 10 | import os
 11 | import re
 12 | import copy
 13 | from bs4 import BeautifulSoup
 14 | 
 15 | from .utils import args, logger, requests_get
 16 | from .metadata import Metadata
 17 | from .utils import search_terms as master_search_terms
 18 | from .html_document import HtmlDocument
 19 | from .text_document import TextDocument
 20 | 
 21 | 
 22 | class EdgarCrawler(object):
 23 |     def __init__(self):
 24 |         self.storage_folder = None
 25 | 
 26 |     def download_filings(self, company_description, edgar_search_string,
 27 |                          filing_search_string, date_search_string,
 28 |                          start_date, end_date,
 29 |                          do_save_full_document, count=100):
 30 |         """Build a list of all filings of a certain type, within a date range.
 31 | 
 32 |         Then download them and extract the text of interest
 33 |         :param: cik
 34 |         :param: count number of Filing Results to return on the (first) EDGAR
 35 |             Search Results query page. 9999=show all
 36 |         :param: type_serach_string
 37 |         :param: start_date, end_date
 38 |         :return: text_extract: str , warnings: [str]
 39 |         """
 40 | 
 41 |         filings_links = self.download_filings_links(edgar_search_string,
 42 |                                                     company_description,
 43 |                                                     filing_search_string,
 44 |                                                     date_search_string,
 45 |                                                     start_date, end_date, count)
 46 | 
 47 |         filings_list = []
 48 | 
 49 |         logger.info("Identified " + str(len(filings_links)) +
 50 |                     " filings, gathering SEC metadata and document links...")
 51 | 
 52 |         is_multiprocessing = args.multiprocessing_cores > 0
 53 |         if is_multiprocessing:
 54 |             pool = mp.Pool(processes = args.multiprocessing_cores)
 55 | 
 56 |         for i, index_url in enumerate(filings_links):
 57 |             # Get the URL for the (text-format) document which packages all
 58 |             # of the parts of the filing
 59 |             base_url = re.sub('-index.htm.?','',index_url) + ".txt"
 60 |             filings_list.append([index_url, base_url, company_description])
 61 |             filing_metadata = Metadata(index_url)
 62 | 
 63 |             if re.search(date_search_string,
 64 |                          str(filing_metadata.sec_period_of_report)):
 65 |                 filing_metadata.sec_index_url = index_url
 66 |                 filing_metadata.sec_url = base_url
 67 |                 filing_metadata.company_description = company_description
 68 |                 if is_multiprocessing:
 69 |                     # multi-core processing. Add jobs to pool.
 70 |                     pool.apply_async(self.download_filing,
 71 |                                      args=(filing_metadata, do_save_full_document),
 72 |                                      callback=self.process_log_cache)
 73 |                 else:
 74 |                     # single core processing
 75 |                     log_cache = self.download_filing(filing_metadata, do_save_full_document)
 76 |                     self.process_log_cache(log_cache)
 77 |         if is_multiprocessing:
 78 |             pool.close()
 79 |             pool.join()
 80 |         logger.debug("Finished attempting to download all the %s forms for %s",
 81 |                      filing_search_string, company_description)
 82 | 
 83 | 
 84 |     def process_log_cache(self, log_cache):
 85 |         """Output log_cache messages via logger
 86 |         """
 87 |         for msg in log_cache:
 88 |             msg_type = msg[0]
 89 |             msg_text = msg[1]
 90 |             if msg_type=='process_name':
 91 |                 id = '(' + msg_text + ') '
 92 |             elif msg_type=='INFO':
 93 |                 logger.info(id + msg_text)
 94 |             elif msg_type=='DEBUG':
 95 |                 logger.debug(id + msg_text)
 96 |             elif msg_type=='WARNING':
 97 |                 logger.warning(id + msg_text)
 98 |             elif msg_type=='ERROR':
 99 |                 logger.error(id + msg_text)
100 | 
101 | 
102 | 
103 |     def download_filings_links(self, edgar_search_string, company_description,
104 |                                filing_search_string, date_search_string,
105 |                                start_date, end_date, count):
106 |         """[docstring here]
107 |         :param edgar_search_string: 10-digit integer CIK code, or ticker
108 |         :param company_description:
109 |         :param filing_search_string: e.g. '10-K'
110 |         :param start_date: ccyymmdd
111 |         :param end_date: ccyymmdd
112 |         :param count:
113 |         :return: linkList, a list of links to main pages for each filing found
114 |         example of a typical base_url: http://www.sec.gov/cgi-bin/browse-secedgartext?action=getcompany&CIK=0000051143&type=10-K&datea=20011231&dateb=20131231&owner=exclude&output=xml&count=9999
115 |         """
116 | 
117 |         sec_website = "https://www.sec.gov/"
118 |         browse_url = sec_website + "cgi-bin/browse-edgar"
119 |         requests_params = {'action': 'getcompany',
120 |                            'CIK': str(edgar_search_string),
121 |                            'type': filing_search_string,
122 |                            'datea': start_date,
123 |                            'dateb': end_date,
124 |                            'owner': 'exclude',
125 |                            'output': 'html',
126 |                            'count': count}
127 |         logger.info('-' * 100)
128 |         logger.info(
129 |             "Query EDGAR database for " + filing_search_string + ", Search: " +
130 |             str(edgar_search_string) + " (" + company_description + ")")
131 | 
132 |         linkList = []  # List of all links from the CIK page
133 |         continuation_tag = 'first pass'
134 | 
135 |         while continuation_tag:
136 |             r = requests_get(browse_url, params=requests_params)
137 |             if continuation_tag == 'first pass':
138 |                 logger.debug("EDGAR search URL: " + r.url)
139 |                 logger.info('-' * 100)
140 |             data = r.text
141 |             soup = BeautifulSoup(data, "html.parser")
142 |             for link in soup.find_all('a', {'id': 'documentsbutton'}):
143 |                 URL = sec_website + link['href']
144 |                 linkList.append(URL)
145 |             continuation_tag = soup.find('input', {'value': 'Next ' + str(count)}) # a button labelled 'Next 100' for example
146 |             if continuation_tag:
147 |                 continuation_string = continuation_tag['onclick']
148 |                 browse_url = sec_website + re.findall('cgi-bin.*count=\d*', continuation_string)[0]
149 |                 requests_params = None
150 |         return linkList
151 | 
152 | 
153 |     def download_filing(self, filing_metadata, do_save_full_document):
154 |         """
155 |         Download filing, extract relevant sections.
156 | 
157 |         Download a filing (full filing submission). Find relevant <DOCUMENT>
158 |         portions of the filing, and send the raw text for text extraction
159 |         :param: doc_info: contains URL for the full filing submission, and
160 |         other EDGAR index metadata
161 |         """
162 |         log_cache = [('process_name', str(os.getpid()))]
163 |         filing_url = filing_metadata.sec_url
164 |         company_description = filing_metadata.company_description
165 |         log_str = "Retrieving: %s, %s, period: %s, index page: %s" \
166 |             % (filing_metadata.sec_company_name,
167 |                     filing_metadata.sec_form_header,
168 |                     filing_metadata.sec_period_of_report,
169 |                     filing_metadata.sec_index_url)
170 |         log_cache.append(('DEBUG', log_str))
171 | 
172 |         r = requests_get(filing_url)
173 |         filing_text = r.text
174 |         filing_metadata.add_data_from_filing_text(filing_text[0:10000])
175 | 
176 |         # Iterate through the DOCUMENT types that we are seeking,
177 |         # checking for each in turn whether they are included in the current
178 |         # filing. Note that searching for document_group '10-K' will also
179 |         # deliberately find DOCUMENT type variants such as 10-K/A, 10-K405 etc.
180 |         # Note we search for all DOCUMENT types that interest us, regardless of
181 |         # whether the current filing came from a '10-K' or '10-Q' web query
182 |         # originally. Also note that we process DOCUMENT types in no
183 |         # fixed order.
184 |         filtered_search_terms = {doc_type: master_search_terms[doc_type]
185 |                                  for doc_type in args.documents}
186 |         for document_group in filtered_search_terms:
187 |             doc_search = re.search("<DOCUMENT>.{,20}<TYPE>" + document_group +
188 |                                    ".*?</DOCUMENT>", filing_text,
189 |                                    flags=re.DOTALL | re.IGNORECASE)
190 |             if doc_search:
191 |                 doc_text = doc_search.group()
192 |                 doc_metadata = copy.copy(filing_metadata)
193 |                 # look for form type near the start of the document.
194 |                 type_search = re.search("<TYPE>.*",
195 |                                         doc_text[0:10000], re.IGNORECASE)
196 |                 if type_search:
197 |                     document_type = re.sub("^<TYPE>", "", type_search.group(), re.IGNORECASE)
198 |                     document_type = re.sub(r"(-|/|\.)", "",
199 |                                          document_type)  # remove hyphens etc
200 |                 else:
201 |                     document_type = "document_TYPE_not_tagged"
202 |                     log_cache.append(('ERROR',
203 |                                       "form <TYPE> not given in form?: " +
204 |                                       filing_url))
205 |                 local_path = os.path.join(self.storage_folder,
206 |                         company_description + '_' + \
207 |                         filing_metadata.sec_cik + "_" + document_type + "_" + \
208 |                         filing_metadata.sec_period_of_report)
209 |                 doc_metadata.document_type = document_type
210 |                 # doc_metadata.form_type_internal = form_string
211 |                 doc_metadata.document_group = document_group
212 |                 doc_metadata.metadata_file_name = local_path
213 | 
214 |                 # search for a <html>...</html> block in the DOCUMENT
215 |                 html_search = re.search(r"<html>.*?</html>",
216 |                                         doc_text, re.DOTALL | re.IGNORECASE)
217 |                 xbrl_search = re.search(r"<xbrl>.*?</xbrl>",
218 |                                         doc_text, re.DOTALL | re.IGNORECASE)
219 |                 # occasionally a (somewhat corrupted) filing includes a mixture
220 |                 # of HTML-format documents, but some of them are enclosed in
221 |                 # <TEXT>...</TEXT> tags and others in <HTML>...</HTML> tags.
222 |                 # If the first <TEXT>-enclosed document is before the first
223 |                 # <HTML> enclosed one, then we take that one instead of
224 |                 # the block identified in html_search.
225 |                 text_search = re.search(r"<text>.*?</text>",
226 |                                         doc_text, re.DOTALL | re.IGNORECASE)
227 |                 if text_search and html_search \
228 |                         and text_search.start() < html_search.start() \
229 |                         and html_search.start() > 5000:
230 |                     html_search = text_search
231 |                 if xbrl_search:
232 |                     doc_metadata.extraction_method = 'xbrl'
233 |                     doc_text = xbrl_search.group()
234 |                     main_path = local_path + ".xbrl"
235 |                     reader_class = HtmlDocument
236 |                 elif html_search:
237 |                     # if there's an html block inside the DOCUMENT then just
238 |                     # take this instead of the full DOCUMENT text
239 |                     doc_metadata.extraction_method = 'html'
240 |                     doc_text = html_search.group()
241 |                     main_path = local_path + ".htm"
242 |                     reader_class = HtmlDocument
243 |                 else:
244 |                     doc_metadata.extraction_method = 'txt'
245 |                     main_path = local_path + ".txt"
246 |                     reader_class = TextDocument
247 |                 doc_metadata.original_file_size = str(len(doc_text)) + ' chars'
248 |                 sections_log_items = reader_class(
249 |                     doc_metadata.original_file_name,
250 |                     doc_text, doc_metadata.extraction_method).\
251 |                     get_excerpt(doc_text, document_group,
252 |                                 doc_metadata,
253 |                                 skip_existing_excerpts=False)
254 |                 log_cache = log_cache + sections_log_items
255 |                 if do_save_full_document:
256 |                     with open(main_path, "w") as filename:
257 |                         filename.write(doc_text)
258 |                     log_str = "Saved file: " + main_path + ', ' + \
259 |                         str(round(os.path.getsize(main_path) / 1024)) + ' KB'
260 |                     log_cache.append(('DEBUG', log_str))
261 |                     filing_metadata.original_file_name = main_path
262 |                 else:
263 |                     filing_metadata.original_file_name = \
264 |                         "file was not saved locally"
265 |         return(log_cache)
266 | 
267 | 
268 | 
269 | 


--------------------------------------------------------------------------------
/src/html_document.py:
--------------------------------------------------------------------------------
  1 | """
  2 |     secedgartext: extract text from SEC corporate filings
  3 |     Copyright (C) 2017  Alexander Ions
  4 | 
  5 |     You should have received a copy of the GNU General Public License
  6 |     along with this program.  If not, see <http://www.gnu.org/licenses/>.
  7 | """
  8 | import re
  9 | import time
 10 | from statistics import median
 11 | from bs4 import BeautifulSoup, NavigableString, Tag, Comment
 12 | 
 13 | from .utils import logger
 14 | from .document import Document
 15 | 
 16 | USE_HTML2TEXT = False
 17 | 
 18 | class HtmlDocument(Document):
 19 |     soup = None
 20 |     plaintext = None
 21 | 
 22 |     def __init__(self, *args, **kwargs):
 23 |         super(HtmlDocument, self).__init__(*args, **kwargs)
 24 | 
 25 |     def search_terms_type(self):
 26 |         return "html"
 27 | 
 28 |     def prepare_text(self):
 29 |         """Strip unwanted text and parse the HTML.
 30 | 
 31 |         Remove some unhelpful text from the HTML, and parse the HTML,
 32 |         initialising the 'soup' attribute
 33 |         """
 34 |         html_text = self.doc_text
 35 |         # remove whitespace sometimes found inside tags,
 36 |         # which breaks the parser
 37 |         html_text = re.sub('<\s', '<', html_text)
 38 |         # remove small-caps formatting tags which can confuse later analysis
 39 |         html_text = re.sub('(<small>|</small>)', '', html_text,
 40 |                            flags=re.IGNORECASE)
 41 |         # for simplistic no-tags HTML documents (example: T 10K 20031231),
 42 |         # make sure the section headers get treated as new blocks.
 43 |         html_text = re.sub(r'(\nITEM\s{1,10}[1-9])', r'<br>\1', html_text,
 44 |                            flags=re.IGNORECASE)
 45 | 
 46 |         # we prefer to use lxml parser for speed, this requires seprate
 47 |         # installation. Straightforward on Linux, somewhat tricky on Windows.
 48 |         # http://stackoverflow.com/questions/29440482/how-to-install-lxml-on-windows
 49 |         # ...note install the 32-bit version for Intel 64-bit?
 50 |         start_time = time.process_time()
 51 |         try:
 52 |             soup = BeautifulSoup(html_text, 'lxml')
 53 |         except:
 54 |             soup = BeautifulSoup(html_text, 'html.parser')      # default parser
 55 |         parsing_time_elapsed = time.process_time() - start_time
 56 |         log_str = 'parsing time: ' + '% 3.2f' % \
 57 |                      (parsing_time_elapsed) + 's; ' + "{:,}". \
 58 |                      format(len(html_text)) + ' characters; ' + "{:,}". \
 59 |                      format(len(soup.find_all())) + ' HTML elements'
 60 |         self.log_cache.append(('DEBUG', log_str))
 61 | 
 62 |         # for some old, simplistic documents lacking a proper HTML tree,
 63 |         # put in <br> tags artificially to help with parsing paragraphs, ensures
 64 |         # that section headers get properly identified
 65 |         if len(html_text) / len(soup.find_all()) > 500:
 66 |             html_text = re.sub(r'\n\n', r'<br>', html_text,
 67 |                                flags=re.IGNORECASE)
 68 |             soup = BeautifulSoup(html_text, 'html.parser')
 69 | 
 70 |         # Remove numeric tables from soup
 71 |         tables_generator = (s for s in soup.find_all('table') if
 72 |                             self.should_remove_table(s))
 73 |         # debug: save the extracted tables to a text file
 74 |         # tables_debug_file = open(r'tables_deleted.txt', 'wt', encoding='latin1')
 75 |         for s in tables_generator:
 76 |             s.replace_with('[DATA_TABLE_REMOVED]')
 77 |             # tables_debug_file.write('#' * 80 + '\n')
 78 |             # tables_debug_file.write('\n'.join([x for x in s.text.splitlines()
 79 |             # if x.strip()]).encode('latin-1','replace').decode('latin-1'))
 80 |         # tables_debug_file.close()
 81 |         self.soup = soup
 82 | 
 83 |         if USE_HTML2TEXT:
 84 |             # option: use the HTML2TEXT library for paragraph splitting.
 85 |             # Purpose and performance is generally similar to the
 86 |             # home-made approach below
 87 |             import html2text
 88 |             h = html2text.HTML2Text(bodywidth=0)
 89 |             h.ignore_emphasis = True
 90 |             self.plaintext = h.handle(str(soup)) # use soup instead of the original html: it's faster and it benefits from the tables being excluded
 91 |         else:
 92 |             # paragraphs_analysis = []
 93 |             # p_idx = 0
 94 |             # has_href = False
 95 |             # has_crossreference = False
 96 |             paragraph_string = ''
 97 |             document_string = ''
 98 |             all_paras = []
 99 |             ec = soup.find()
100 |             is_in_a_paragraph = True
101 |             while not (ec is None):
102 |                 if is_line_break(ec) or ec.next_element is None:
103 |                     # end of paragraph tag (does not itself contain
104 |                     # Navigable String): insert double line-break for readability
105 |                     if is_in_a_paragraph:
106 |                         is_in_a_paragraph = False
107 |                         all_paras.append(paragraph_string)
108 |                         document_string = document_string + '\n\n' + paragraph_string
109 |                 else:
110 |                     # continuation of the current paragraph
111 |                     if isinstance(ec, NavigableString) and not \
112 |                             isinstance(ec, Comment):
113 |                         # # remove redundant line breaks and other whitespace at the
114 |                         # # ends, and in the middle, of the string
115 |                         # ecs = re.sub(r'\s+', ' ', ec.string.strip())
116 |                         ecs = re.sub(r'\s+', ' ', ec.string)
117 |                         if len(ecs) > 0:
118 |                             if not (is_in_a_paragraph):
119 |                                 # set up for the start of a new paragraph
120 |                                 is_in_a_paragraph = True
121 |                                 paragraph_string = ''
122 |                             # paragraph_string = paragraph_string + ' ' + ecs
123 |                             paragraph_string = paragraph_string + ecs
124 |                 ec = ec.next_element
125 |             # clean up multiple line-breaks
126 |             document_string = re.sub('\n\s+\n', '\n\n', document_string)
127 |             document_string = re.sub('\n{3,}', '\n\n', document_string)
128 |             self.plaintext = document_string
129 | 
130 | 
131 |     def extract_section(self, search_pairs):
132 |         """
133 | 
134 |         :param search_pairs:
135 |         :return:
136 |         """
137 |         start_text = 'na'
138 |         end_text = 'na'
139 |         warnings = []
140 |         text_extract = None
141 |         for st_idx, st in enumerate(search_pairs):
142 |             # ungreedy search (note '.*?' regex expression between 'start' and 'end' patterns
143 |             # also using (?:abc|def) for a non-capturing group
144 |             # also an extra pair of parentheses around the whole expression,
145 |             # so that we always return just one object, not a tuple of groups
146 |             # st = super().search_terms_pattern_to_regex()
147 |             # st = Reader.search_terms_pattern_to_regex(st)
148 |             item_search = re.findall(st['start']+'.*?'+ st['end'],
149 |                                      self.plaintext,
150 |                                      re.DOTALL | re.IGNORECASE)
151 |             # item_search = re.findall('(' + st['start']+'.*?'+ st['end']+')',
152 |             #                          self.plaintext,
153 |             #                          re.DOTALL | re.IGNORECASE)
154 |             if item_search:
155 |                 longest_text_length = 0
156 |                 for s in item_search:
157 |                     if isinstance(s, tuple):
158 |                         # If incorrect use of multiple regex groups has caused
159 |                         # more than one match, then s is returned as a tuple
160 |                         self.log_cache.append(('ERROR',
161 |                                    "Groups found in Regex, please correct"))
162 |                     if len(s) > longest_text_length:
163 |                         text_extract = s.strip()
164 |                         longest_text_length = len(s)
165 |                 # final_text_new = re.sub('^\n*', '', final_text_new)
166 |                 final_text_lines = text_extract.split('\n')
167 |                 start_text = final_text_lines[0]
168 |                 end_text = final_text_lines[-1]
169 |                 break
170 |         extraction_summary = self.extraction_method + '_document'
171 |         if not text_extract:
172 |             warnings.append('Extraction did not work for HTML file')
173 |             extraction_summary = self.extraction_method + '_document: failed'
174 |         else:
175 |             text_extract = re.sub('\n\s{,5}Table of Contents\n', '',
176 |                                   text_extract, flags=re.IGNORECASE)
177 | 
178 |         return text_extract, extraction_summary, start_text, end_text, warnings
179 | 
180 | 
181 |     def should_remove_table(self, html):
182 |         """Decide whether <table> html contains a mostly-numeric table.
183 | 
184 |         Identify text in table element 'html' which cannot (realistically) be
185 |         subject to downstream text analysis. Note there is a risk that we
186 |         inadvertently remove any Section headings that are inside <table> elements
187 |         We reduce this risk by only seeking takes with more than 5 (nonblank)
188 |         elements, the median length of which is fewer than 30 characters
189 |         """
190 |         char_counts = []
191 |         if html.stripped_strings:
192 |             for t in html.stripped_strings:
193 |                 if len(t) > 0:
194 |                     char_counts.append(len(t))
195 |             return len(char_counts) > 5 and median(char_counts) < 30
196 |         else:
197 |             self.log_cache.append(('ERROR',
198 |                                    "the should_remove_table function is broken"))
199 | 
200 | 
201 | 
202 | def is_line_break(e):
203 |     """Is e likely to function as a line break when document is rendered?
204 | 
205 |     we are including 'HTML block-level elements' here. Note <p> ('paragraph')
206 |     and other tags may not necessarily force the appearance of a 'line break',
207 |     on the page if they are enclosed inside other elements, notably a
208 |     table cell
209 |     """
210 | 
211 | 
212 |     is_block_tag = e.name != None and e.name in ['p', 'div', 'br', 'hr', 'tr',
213 |                                                  'table', 'form', 'h1', 'h2',
214 |                                                  'h3', 'h4', 'h5', 'h6']
215 |     # handle block tags inside tables: if the apparent block formatting is
216 |     # enclosed in a table cell <td> tags, and if there are no other block
217 |     # elements within the <td> cell (it's a singleton, then it will not
218 |     # necessarily appear on a new line so we don't treat it as a line break
219 |     if is_block_tag and e.parent.name == 'td':
220 |         if len(e.parent.findChildren(name=e.name)) == 1:
221 |             is_block_tag = False
222 |     # inspect the style attribute of element e (if any) to see if it has
223 |     # block style, which will appear as a line break in the document
224 |     if hasattr(e, 'attrs') and 'style' in e.attrs:
225 |         is_block_style = re.search('margin-(top|bottom)', e['style'])
226 |     else:
227 |         is_block_style = False
228 |     return is_block_tag or is_block_style
229 | 
230 | 


--------------------------------------------------------------------------------
/src/metadata.py:
--------------------------------------------------------------------------------
  1 | """
  2 |     secedgartext: extract text from SEC corporate filings
  3 |     Copyright (C) 2017  Alexander Ions
  4 | 
  5 |     You should have received a copy of the GNU General Public License
  6 |     along with this program.  If not, see <http://www.gnu.org/licenses/>.
  7 | """
  8 | import json
  9 | import re
 10 | from bs4 import BeautifulSoup, Tag, NavigableString
 11 | import time
 12 | import random
 13 | 
 14 | from .utils import logger
 15 | from .utils import args, requests_get
 16 | from .utils import batch_number, batch_start_time, batch_machine_id
 17 | from .utils import sql_cursor, sql_connection
 18 | 
 19 | 
 20 | class Metadata(object):
 21 |     def __init__(self, index_url=None):
 22 |         self.sec_cik = ''
 23 |         self.sec_company_name = ''
 24 |         self.document_type = ''
 25 |         self.sec_form_header = ''
 26 |         self.sec_period_of_report = ''
 27 |         self.sec_filing_date = ''
 28 |         self.sec_changed_date = ''
 29 |         self.sec_accepted_date = ''
 30 |         self.sec_index_url = ''
 31 |         self.sec_url = ''
 32 |         self.metadata_file_name = ''
 33 |         self.original_file_name = ''
 34 |         self.original_file_size = ''
 35 |         self.document_group = ''
 36 |         self.section_name = ''
 37 |         self.section_n_characters = None
 38 |         self.endpoints = []
 39 |         self.extraction_method = ''
 40 |         self.warnings = []
 41 |         self.company_description = ''
 42 |         self.output_file = None
 43 |         self.time_elapsed = None
 44 |         self.batch_number = batch_number
 45 |         self.batch_signature = args.batch_signature or ''
 46 |         self.batch_start_time = str(batch_start_time)
 47 |         self.batch_machine_id = batch_machine_id
 48 |         self.section_end_time = None
 49 | 
 50 |         if index_url:
 51 |             index_metadata = {}
 52 |             attempts = 0
 53 |             while attempts < 5:
 54 |                 try:
 55 |                     ri = requests_get(index_url)
 56 |                     logger.info('Status Code: ' + str(ri.status_code))
 57 |                     soup = BeautifulSoup(ri.text, 'html.parser')
 58 |                     # Parse the page to find metadata
 59 |                     form_type = soup.find('div', {'id': 'formHeader'}). \
 60 |                         find_next('strong').string.strip()
 61 |                     break
 62 |                 except:
 63 |                     attempts += 1
 64 |                     logger.warning('No valid index page, attempt %i: %s'
 65 |                                    % (attempts, index_url))
 66 |                     time.sleep(attempts*10 + random.randint(1,5))
 67 | 
 68 |             index_metadata['formHeader'] = form_type
 69 |             infoheads = soup.find_all('div', class_='infoHead')
 70 |             for i in infoheads:
 71 |                 j = i.next_element
 72 |                 while not (isinstance(j, Tag)) or not ('info') in \
 73 |                         j.attrs['class']:
 74 |                     j = j.next_element
 75 |                 # remove colons, spaces, hyphens from dates/times
 76 |                 if type(j.string) is NavigableString:
 77 |                     index_metadata[i.string] = re.sub('[: -]', '',
 78 |                                                       j.string).strip()
 79 |             i = soup.find('span', class_='companyName')
 80 |             while not (isinstance(i, NavigableString)):
 81 |                 i = i.next_element
 82 |             index_metadata['companyName'] = i.strip()
 83 |             i = soup.find(string='CIK')
 84 |             while not (isinstance(i, NavigableString)) or not (re.search('\d{10}', i.string)):
 85 |                 i = i.next_element
 86 |             index_metadata['CIK'] = re.search('\d{5,}', i).group()
 87 | 
 88 |             for pair in [['Period of Report', 'sec_period_of_report'],
 89 |                          ['Filing Date', 'sec_filing_date'],
 90 |                          ['Filing Date Changed', 'sec_changed_date'],
 91 |                          ['Accepted', 'sec_accepted_date'],
 92 |                          ['formHeader', 'sec_form_header'],
 93 |                          ['companyName', 'sec_company_name'],
 94 |                          ['CIK', 'sec_cik']]:
 95 |                 if pair[0] in index_metadata:
 96 |                     setattr(self, pair[1], index_metadata[pair[0]])
 97 | 
 98 |     def add_data_from_filing_text(self, text):
 99 |         """Scrape metadata from the filing document
100 | 
101 |         Find key metadata fields at the start of the filing submission,
102 |         if they were not already found in the SEC index page
103 |         :param text: full text of the filing
104 |         """
105 |         for pair in [['CONFORMED PERIOD OF REPORT:', 'sec_period_of_report'],
106 |                      ['FILED AS OF DATE:', 'sec_filing_date'],
107 |                      ['DATE AS OF CHANGE:', 'sec_changed_date'],
108 |                      ['<ACCEPTANCE-DATETIME>', 'sec_accepted_date'],
109 |                      ['COMPANY CONFORMED NAME:', 'sec_company_name'],
110 |                      ['CENTRAL INDEX KEY::', 'sec_cik']]:
111 |             srch = re.search('(?<=' + pair[0] + ').*', text)
112 |             if srch and not getattr(self, pair[1]):
113 |                 setattr(self, pair[1], srch.group().strip())
114 | 
115 |     def save_to_json(self, file_path):
116 |         """
117 |         we effectively convert the Metadata object's data into a dict
118 |         when we do json.dumps on it
119 |         :param file_path:
120 |         :return:
121 |         """
122 |         with open(file_path, 'w', encoding='utf-8') as json_output:
123 |             # to write the backslashes in the JSON file legibly
124 |             # (without duplicate backslashes), we have to
125 |             # encode/decode using the 'unicode_escape' codec. This then
126 |             # allows us to open the JSON file and click on the file link,
127 |             # for immediate viewing in a browser.
128 |             excerpt_as_json = json.dumps(self, default=lambda o: o.__dict__,
129 |                                          sort_keys=False, indent=4)
130 |             json_output.write(bytes(excerpt_as_json, "utf-8").
131 |                               decode("unicode_escape"))
132 | 
133 | 
134 |     def save_to_db(self):
135 |         """Append metadata to sqlite database
136 | 
137 |         """
138 | 
139 |         # conn = sqlite3.connect(path.join(args.storage, 'metadata.sqlite3'))
140 |         # c = conn.cursor()
141 |         sql_insert = """INSERT INTO metadata (
142 |             batch_number,
143 |             batch_signature,
144 |             batch_start_time,
145 |             batch_machine_id,
146 |             sec_cik,
147 |             company_description,
148 |             sec_company_name,
149 |             sec_form_header,
150 |             sec_period_of_report,
151 |             sec_filing_date,
152 |             sec_index_url,
153 |             sec_url,
154 |             metadata_file_name,
155 |             document_group,
156 |             section_name,
157 |             section_n_characters,
158 |             section_end_time,
159 |             extraction_method,
160 |             output_file,
161 |             start_line,
162 |             end_line,
163 |             time_elapsed) VALUES
164 |             """ + "('" + "', '".join([str(self.batch_number),
165 |                        str(self.batch_signature),
166 |                        str(self.batch_start_time)[:-3],  # take only 3dp microseconds
167 |                        self.batch_machine_id,
168 |                        self.sec_cik,
169 |                        re.sub("[\'\"]","", self.company_description).strip(),
170 |                        re.sub("[\'\"]","", self.sec_company_name).strip(),
171 |                        self.sec_form_header, self.sec_period_of_report,
172 |                        self.sec_filing_date,
173 |                        self.sec_index_url, self.sec_url,
174 |                        self.metadata_file_name, self.document_group,
175 |                        self.section_name, str(self.section_n_characters),
176 |                        str(self.section_end_time)[:-3],
177 |                        self.extraction_method,
178 |                        str(self.output_file),
179 |                        re.sub("[\'\"]","", self.endpoints[0]).strip()[0:200],
180 |                        re.sub("[\'\"]","", self.endpoints[1]).strip()[0:200],
181 |                        str(self.time_elapsed)]) + "')"
182 |         sql_insert = sql_insert.replace("'None'","NULL")
183 |         sql_cursor.execute(sql_insert)
184 |         sql_connection.commit()
185 | 
186 | 
187 | def load_from_json(file_path):
188 |     metadata = Metadata()
189 |     with open(file_path, 'r') as json_file:
190 |         try:
191 |             # data = json.loads(data_file.read().replace('\\', '\\\\'), strict=False)
192 |             data = json.loads(json_file.read())
193 |             metadata.sec_cik = data['sec_cik']
194 |             metadata.sec_company_name = data['sec_company_name']
195 |             metadata.company_description = data['company_description']
196 |             metadata.document_type = data['document_type']
197 |             metadata.sec_form_header = data['sec_form_header']
198 |             metadata.sec_period_of_report = data['sec_period_of_report']
199 |             metadata.sec_filing_date = data['sec_filing_date']
200 |             metadata.sec_changed_date = data['sec_changed_date']
201 |             metadata.sec_accepted_date = data['sec_accepted_date']
202 |             metadata.sec_accepted_date = data['sec_accepted_date']
203 |             metadata.sec_url = data['sec_url']
204 |             metadata.metadata_file_name = data['metadata_file_name']
205 |             metadata.original_file_name = data['original_file_name']
206 |             metadata.original_file_size = data['original_file_size']
207 |             metadata.document_group = data['form_group']
208 |             metadata.section_name = data['section_name']
209 |             metadata.section_n_characters = data['section_n_characters']
210 |             metadata.endpoints = data['endpoints']
211 |             metadata.extraction_method = data['extraction_method']
212 |             metadata.warnings = data['warnings']
213 |             metadata.output_file = data['output_file']
214 |             metadata.time_elapsed = data['time_elapsed']
215 |             metadata.batch_number = data['batch_number']
216 |             metadata.batch_signature = data['batch_signature']
217 |             metadata.batch_start_time = data['batch_start_time']
218 |             metadata.batch_machine_id = data['batch_machine_id']
219 |             metadata.section_end_time = data['section_end_time']
220 | 
221 |         except:
222 |             logger.info('Could not load corrupted JSON file: ' + file_path)
223 | 
224 |     return metadata
225 | 
226 | 


--------------------------------------------------------------------------------
/src/text_document.py:
--------------------------------------------------------------------------------
  1 | """
  2 |     secedgartext: extract text from SEC corporate filings
  3 |     Copyright (C) 2017  Alexander Ions
  4 | 
  5 |     You should have received a copy of the GNU General Public License
  6 |     along with this program.  If not, see <http://www.gnu.org/licenses/>.
  7 | """
  8 | import re
  9 | 
 10 | from .document import Document
 11 | 
 12 | 
 13 | class TextDocument(Document):
 14 |     def __init__(self, *args, **kwargs):
 15 |         super(TextDocument, self).__init__(*args, **kwargs)
 16 | 
 17 |     def search_terms_type(self):
 18 |         return "txt"
 19 | 
 20 |     def extract_section(self, search_pairs):
 21 |         """
 22 | 
 23 |         :param search_pairs:
 24 |         :return:
 25 |         """
 26 |         start_text = 'na'
 27 |         end_text = 'na'
 28 |         warnings = []
 29 |         text_extract = None
 30 |         for st_idx, st in enumerate(search_pairs):
 31 |             # ungreedy search (note '.*?' regex expression between 'start' and 'end' patterns
 32 |             # also using (?:abc|def) for a non-capturing group
 33 |             # st = super().search_terms_pattern_to_regex()
 34 |             # st = Reader.search_terms_pattern_to_regex(st)
 35 |             item_search = re.findall(st['start'] + '.*?' + st['end'],
 36 |                                      self.doc_text,
 37 |                                      re.DOTALL | re.IGNORECASE)
 38 |             if item_search:
 39 |                 longest_text_length = 0
 40 |                 for s in item_search:
 41 |                     text_extract = s.strip()
 42 |                     if len(s) > longest_text_length:
 43 |                         longest_text_length = len(text_extract)
 44 |                 # final_text_new = re.sub('^\n*', '', final_text_new)
 45 |                 final_text_lines = text_extract.split('\n')
 46 |                 start_text = final_text_lines[0]
 47 |                 end_text = final_text_lines[-1]
 48 |                 break
 49 |         if text_extract:
 50 |             # final_text = '\n'.join(final_text_lines)
 51 |             # text_extract = remove_table_lines(final_text)
 52 |             text_extract = remove_table_lines(text_extract)
 53 |             extraction_summary = self.extraction_method + '_document'
 54 |         else:
 55 |             warnings.append('Extraction did not work for text file')
 56 |             extraction_summary = self.extraction_method + '_document: failed'
 57 |         return text_extract, extraction_summary, start_text, end_text, warnings
 58 | 
 59 | def remove_table_lines(input_text):
 60 |     """Replace lines believed to be part of numeric tables with a placeholder.
 61 | 
 62 |     :param input_text:
 63 |     :return:
 64 |     """
 65 |     text_lines = []
 66 |     table_lines = []
 67 |     post_table_lines = []
 68 |     is_in_a_table = False
 69 |     is_in_a_post_table = False
 70 |     all_lines = input_text.splitlines(True)
 71 |     for i, line in enumerate(all_lines, 0):
 72 |         if is_table_line(line):
 73 |             # a table line, possibly not part of an excerpt
 74 |             if is_in_a_post_table:
 75 |                 # table resumes: put the inter-table lines into the table_line list
 76 |                 table_lines = table_lines + post_table_lines
 77 |                 post_table_lines = []
 78 |                 is_in_a_post_table = False
 79 |             table_lines.append(line)
 80 |             is_in_a_table = True
 81 |         else:
 82 |             # not a table line
 83 |             if is_in_a_table:
 84 |                 # the first post-table line
 85 |                 is_in_a_table = False
 86 |                 is_in_a_post_table = True
 87 |                 post_table_lines.append(line)
 88 |             elif is_in_a_post_table:
 89 |                 # 2nd and subsequent post-table lines, or final line
 90 |                 if len(post_table_lines) >= 4:
 91 |                     # sufficient post-table lines have accumulated now that we
 92 |                     # revert to standard 'not a post table' mode.
 93 |                     # We append the post-table lines to the text_lines,
 94 |                     # and we discard the table_lines
 95 |                     if len(table_lines) >= 3:
 96 |                         text_lines.append(
 97 |                             '[DATA_TABLE_REMOVED_' +
 98 |                             str(len(table_lines)) + '_LINES]\n\n')
 99 |                     else:
100 |                         # very short table, so we just leave it in
101 |                         # the document regardless
102 |                         text_lines = text_lines + table_lines
103 |                     text_lines = text_lines + post_table_lines
104 |                     table_lines = []
105 |                     post_table_lines = []
106 |                     is_in_a_post_table = False
107 |                 else:
108 |                     post_table_lines.append(line)
109 |         if not (is_in_a_table) and not (is_in_a_post_table):
110 |             # normal excerpt line: just append it to text_lines
111 |             text_lines.append(line)
112 |     # Tidy up any outstanding table_lines and post_table_lines at the end
113 |     if len(table_lines) >= 3:
114 |         text_lines.append(
115 |             '[DATA_TABLE_REMOVED_' + str(len(table_lines)) + '_LINES]\n\n')
116 |     else:
117 |         text_lines = text_lines + table_lines
118 |     text_lines = text_lines + post_table_lines
119 | 
120 |     final_text = ''.join(text_lines)
121 |     return final_text
122 | 
123 | 
124 | def is_table_line(s):
125 |     """Is text line string s likely to be part of a numeric table?
126 | 
127 |     gaps between table 'cells' are expected to have three or more whitespaces,
128 |     and table rows are expected to have at least 3 such gaps, i.e. 4 columns
129 | 
130 |     :param s:
131 |     :return:
132 |     """
133 |     s = s.replace('\t', '    ')
134 |     rs = re.findall('\S\s{3,}', s)  # \S = non-whitespace, \s = whitespace
135 |     r = re.search('(<TABLE>|(-|=|_){5,})', s)  # check for TABLE quasi-HTML tag,
136 |     # or use of lots of punctuation marks as table gridlines
137 |     # Previously also looking for ^\s{10,}[a-zA-z] "lots of spaces prior to
138 |     # the first (non-numeric i.e. not just a page number marker) character".
139 |     # Not using this approach because risk of confusion with centre-justified
140 |     # section headings in certain text documents
141 |     return len(rs) >= 2 or r != None


--------------------------------------------------------------------------------
/src/utils.py:
--------------------------------------------------------------------------------
  1 | """
  2 |     secedgartext: extract text from SEC corporate filings
  3 |     Copyright (C) 2017  Alexander Ions
  4 | 
  5 |     You should have received a copy of the GNU General Public License
  6 |     along with this program.  If not, see <http://www.gnu.org/licenses/>.
  7 | """
  8 | 
  9 | import logging
 10 | import os
 11 | import sys
 12 | import shutil
 13 | import argparse
 14 | import re
 15 | from os import path
 16 | import socket
 17 | import time
 18 | import datetime
 19 | import json
 20 | import sqlite3
 21 | import multiprocessing as mp
 22 | from copy import copy
 23 | 
 24 | 
 25 | """Parse the command line arguments
 26 | """
 27 | companies_file_location = ''
 28 | single_company = ''
 29 | project_dir = path.dirname(path.dirname(__file__))
 30 | parser = argparse.ArgumentParser()
 31 | parser.add_argument('--storage', help='Specify path to storage location')
 32 | parser.add_argument('--write_sql', default=True, help='Save metadata to sqlite database? (Boolean)')
 33 | parser.add_argument('--company', help='CIK code specifying company for single-company download')
 34 | parser.add_argument('--companies_list', help='path of text file with all company CIK codes to download')
 35 | parser.add_argument('--filings', help='comma-separated list of SEC filings of interest (10-Q,10-K...)')
 36 | parser.add_argument('--documents')
 37 | parser.add_argument('--start', help='document start date passed to EDGAR web interface')
 38 | parser.add_argument('--end', help='document end date passed to EDGAR web interface')
 39 | parser.add_argument('--report_period', help='search pattern for company report dates, e.g. 2012, 201206 etc.')
 40 | parser.add_argument('--batch_signature')
 41 | parser.add_argument('--start_company', help='index number of first company to download from the companies_list file')
 42 | parser.add_argument('--end_company', help='index number of last company to download from the companies_list file')
 43 | parser.add_argument('--traffic_limit_pause_ms', help='time to pause between download attempts, to avoid overloading EDGAR server')
 44 | parser.add_argument('--multiprocessing_cores', help='number of processor cores to use')
 45 | args = parser.parse_args()
 46 | 
 47 | if args.storage:
 48 |     if not path.isabs(args.storage):
 49 |         args.storage = path.join(project_dir, args.storage)
 50 | else:
 51 |     args.storage = path.join(project_dir, 'output_files_examples')
 52 | 
 53 | args.write_sql = args.write_sql or True
 54 | if args.company:
 55 |     single_company = args.company
 56 | else:
 57 |     if args.companies_list:
 58 |         companies_file_location = os.path.join(project_dir, args.companies_list)
 59 |     else:
 60 |         companies_file_location = os.path.join(project_dir, 'companies_list.txt')
 61 | 
 62 | args.filings = args.filings or \
 63 |     input('Enter filings search text (default: 10-K,10-Q): ') or \
 64 |     '10-K,10-Q'
 65 | args.filings = re.split(',', args.filings)          # ['10-K','10-Q']
 66 | 
 67 | if '10-K' in args.filings:
 68 |     search_window_days = 365
 69 | else:
 70 |     search_window_days = 91
 71 | ccyymmdd_default_start = (datetime.datetime.now() - datetime.timedelta(days=
 72 |                       search_window_days)).strftime('%Y%m%d')
 73 | args.start = int(args.start or \
 74 |     input('Enter start date for filings search (default: ' +
 75 |           ccyymmdd_default_start + '): ') or \
 76 |              ccyymmdd_default_start)
 77 | ccyymmdd_default_end = (datetime.datetime.strptime(str(args.start), '%Y%m%d') +
 78 |                         datetime.timedelta(days=search_window_days)).strftime('%Y%m%d')
 79 | args.end = int(args.end or \
 80 |     input('Enter end date for filings search (default: ' +
 81 |           ccyymmdd_default_end + '): ') or \
 82 |             ccyymmdd_default_end)
 83 | if str(args.report_period).lower() == 'all':
 84 |     date_search_string = '.*'
 85 | else:
 86 |     date_search_string = str(
 87 |         args.report_period or
 88 |         input('Enter filing report period ccyy, ccyymm etc. (default: all periods): ') or
 89 |         '.*')
 90 | 
 91 | 
 92 | """Set up the metadata database
 93 | """
 94 | batch_start_time = datetime.datetime.utcnow()
 95 | batch_machine_id = socket.gethostname()
 96 | 
 97 | if args.write_sql:
 98 |     db_location = path.join(args.storage, 'metadata.sqlite3')
 99 |     sql_connection = sqlite3.connect(db_location)
100 |     sql_cursor = sql_connection.cursor()
101 |     sql_cursor.execute("""
102 |             CREATE TABLE IF NOT EXISTS metadata (
103 |             id integer PRIMARY KEY,
104 |             batch_number integer NOT NULL,
105 |             batch_signature text NOT NULL,
106 |             batch_start_time datetime NOT NULL,
107 |             batch_machine_id text,
108 |             sec_cik text NOT NULL,
109 |             company_description text,
110 |             sec_company_name text,
111 |             sec_form_header text,
112 |             sec_period_of_report integer,
113 |             sec_filing_date integer,
114 |             sec_index_url text,
115 |             sec_url text,
116 |             metadata_file_name text,
117 |             document_group text,
118 |             section_name text,
119 |             section_n_characters integer,
120 |             section_end_time datetime,
121 |             extraction_method text,
122 |             output_file text,
123 |             start_line text,
124 |             end_line text,
125 |             time_elapsed real)
126 |             """)
127 |     sql_connection.commit()
128 |     query_result = sql_cursor.execute('SELECT max(batch_number) FROM metadata').fetchone()
129 |     if query_result and query_result[0]:
130 |         batch_number = query_result[0] + 1
131 |     else:
132 |         batch_number = 1
133 |     # put a dummy line into the metadata table to 'reserve' a batch number:
134 |     # prevents other processes running in parallel from taking the same batch_number
135 |     sql_cursor.execute("""
136 |         insert into metadata (batch_number, batch_signature,
137 |         batch_start_time, sec_cik) values
138 |         """ + " ('" + "', '".join([str(batch_number),
139 |                            str(args.batch_signature or ''),
140 |                            str(batch_start_time)[:-3],  # take only 3dp microseconds
141 |                            'dummy_cik_code']) + "')")
142 |     sql_connection.commit()
143 | else:
144 |     batch_number = 0
145 | 
146 | 
147 | """Set up numbered storage sub-directory for the current batch run
148 | """
149 | storage_toplevel_directory = os.path.join(args.storage,
150 |                                           'batch_' +
151 |                                           format(batch_number, '04d'))
152 | 
153 | # (re-)make the storage directory for the current batch. This will delete
154 | # any contents that might be left over from earlier runs, thus avoiding
155 | # any potential duplication/overlap/confusion
156 | if os.path.exists(storage_toplevel_directory):
157 |     shutil.rmtree(storage_toplevel_directory)
158 | os.makedirs(storage_toplevel_directory)
159 | 
160 | 
161 | 
162 | 
163 | """Set up logging
164 | """
165 | # log_file_name = 'sec_extractor_{0}.log'.format(ts)
166 | log_file_name = 'secedgartext_batch_%s.log' % format(batch_number, '04d')
167 | log_path = path.join(args.storage, log_file_name)
168 | 
169 | logger = logging.getLogger('text_analysis')
170 | # # set up the logger if it hasn't already been set up earlier in the execution run
171 | logger.setLevel(logging.DEBUG)  # we have to initialise this top-level setting otherwise everything defaults to logging.WARN level
172 | formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s',
173 |                               '%Y%m%d %H:%M:%S')
174 | 
175 | file_handler = logging.FileHandler(log_path)
176 | file_handler.setFormatter(formatter)
177 | file_handler.setLevel(logging.DEBUG)
178 | file_handler.set_name('my_file_handler')
179 | logger.addHandler(file_handler)
180 | 
181 | console_handler = logging.StreamHandler()
182 | console_handler.setFormatter(formatter)
183 | console_handler.setLevel(logging.DEBUG)
184 | console_handler.set_name('my_console_handler')
185 | logger.addHandler(console_handler)
186 | 
187 | 
188 | ts = time.time()
189 | logger.info('=' * 65)
190 | logger.info('Analysis started at {0}'.
191 |             format(datetime.datetime.fromtimestamp(ts).
192 |                    strftime('%Y%m%d %H:%M:%S')))
193 | logger.info('Command line:\t{0}'.format(sys.argv[0]))
194 | logger.info('Arguments:\t\t{0}'.format(' '.join(sys.argv[:])))
195 | logger.info('=' * 65)
196 | 
197 | if args.write_sql:
198 |     logger.info('Opened SQL connection: %s', db_location)
199 | 
200 | 
201 | if not args.traffic_limit_pause_ms:
202 |     # default pause after HTTP request: zero milliseconds
203 |     args.traffic_limit_pause_ms = 0
204 | else:
205 |     args.traffic_limit_pause_ms = int(args.traffic_limit_pause_ms)
206 | logger.info('Traffic Limit Pause (ms): %s' %
207 |             str(args.traffic_limit_pause_ms))
208 | 
209 | 
210 | if args.multiprocessing_cores:
211 |     args.multiprocessing_cores = min(mp.cpu_count()-1,
212 |                                      int(args.multiprocessing_cores))
213 | else:
214 |     args.multiprocessing_cores = 0
215 | 
216 | 
217 | """Create search_terms_regex, which stores the patterns that we
218 | use for identifying sections in each of EDGAR documents types
219 | """
220 | with open (path.join(project_dir, 'document_group_section_search.json'), 'r') as \
221 |         f:
222 |     json_text = f.read()
223 |     search_terms = json.loads(json_text)
224 |     if not search_terms:
225 |         logger.error('Search terms file is missing or corrupted: ' +
226 |               f.name)
227 | search_terms_regex = copy(search_terms)
228 | for filing in search_terms:
229 |     for idx, section in enumerate(search_terms[filing]):
230 |         for format in ['txt','html']:
231 |             for idx2, pattern in enumerate(search_terms[filing][idx][format]):
232 |                 for startend in ['start','end']:
233 |                     regex_string = search_terms[filing][idx][format] \
234 |                         [idx2][startend]
235 |                     regex_string = regex_string.replace('_','\\s{,5}')
236 |                     regex_string = regex_string.replace('\n', '\\n')
237 |                     search_terms_regex[filing][idx][format] \
238 |                         [idx2][startend] = regex_string
239 | """identify which 'document' types are to be downloaded. If no command line
240 |  argument given, then default to all of the document types listed in the
241 |  JSON file"""
242 | args.documents = args.documents or ','.join(list(search_terms.keys()))
243 | args.documents = re.split(',', args.documents)          # ['10-K','10-Q']
244 | 
245 | 
246 | def requests_get(url, params=None):
247 |     """retrieve text via url, fatal error if no internet connection available
248 |     :param url: source url
249 |     :return: text retriieved
250 |     """
251 |     import requests, random
252 |     retries = 0
253 |     success = False
254 |     hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}
255 |     while (not success) and (retries <= 20):
256 |         # wait for an increasingly long time (up to a day) in case internet
257 |         # connection is broken. Gives enough time to fix connection or SEC site
258 |         try:
259 |             # to test the timeout functionality, try loading this page:
260 |             # http://httpstat.us/200?sleep=20000  (20 seconds delay before page loads)
261 |             r = requests.get(url, headers=hdr, params=params, timeout=10)
262 |             success = True
263 |             # facility to add a pause to respect SEC EDGAR traffic limit
264 |             # https://www.sec.gov/privacy.htm#security
265 |             time.sleep(args.traffic_limit_pause_ms/1000)
266 |         except requests.exceptions.RequestException as e:
267 |             wait = (retries ^3) * 20 + random.randint(1,5)
268 |             logger.warning(e)
269 |             logger.info('URL: %s' % url)
270 |             logger.info(
271 |                 'Waiting %s secs and re-trying...' % wait)
272 |             time.sleep(wait)
273 |             retries += 1
274 |     if retries > 10:
275 |         logger.error('Download repeatedly failed: %s',
276 |                      url)
277 |         sys.exit('Download repeatedly failed: %s' %
278 |                  url)
279 |     return r
280 | 
281 | 
282 | 
283 | 
284 | 


--------------------------------------------------------------------------------