├── README.md
├── ascii_text.py
├── main.py
├── requirements.txt
└── scraper.py
/README.md:
--------------------------------------------------------------------------------
1 | # EDGAR 13F Scraper
2 |
3 | ## Installation
4 | This program uses Python 3, Selenium, PhantomJS,
5 | BeautifulSoup, and lmxl.
6 |
7 | Before using the scraper:
8 | 1. Check that you have Python 3:
9 | python3 --version
10 | 2. Navigate to project folder
11 | 3. Create a virtual environment in Python:
12 | python3 -m venv myvenv
13 | 4. Activate virtual environment:
14 | source myvenv/bin/activate
15 | 5. Install requirements.txt:
16 | pip install -r requirements.txt
17 | 6. Install PhantomJS (if not already installed)
18 |
19 | ## Files and Functionality
20 | The EDGAR 13F Scraper parses fund holdings pulled from EDGAR, given a ticker or CIK.
21 |
22 | * main.py allows you to start the scraper.
23 | * scraper.py contains the main functionality to find and scrape holdings data from 13F filings on https://www.sec.gov.
24 |
25 | To start scraper, run this command in an activated virtual environment:
26 | python main.py
27 |
28 | Program will prompt you for a ticker/CIK to find filings for.
29 | Example format:
30 | 0001166559
31 |
32 | Note, the scraper will find all filings for a ticker and generate a text file for each filing in the same directory of the script. To avoid a large amount of generated files, you can get the most recent filing if you:
33 | * comment out lines 78-86 and uncomment lines 88-89 in scraper.py
34 |
35 | Scraper was tested on the following tickers:
36 | * Gates Foundation '0001166559'
37 | * Hershey Trust '0000908551'
38 | * Menlo Advisors '0001279708'
39 | * BlackRock '0001086364'
40 | * Fidelity Advisors 'FELIX'
41 |
42 | ## Implementation for different 13F file formats
43 | In 2013, the SEC discontinued ASCII text-based filing format and required the use of XML format.
44 |
45 | The scraper handles both XML and ASCII text files in the following manner:
46 | * First attempt to find the xml file on the documents page.
47 | If it exists, holdings retrieval for filings is done via parsing of
48 | XML from .xml link. Data is written to tab-delimited file.
49 |
50 | * If xml file is not found, holdings retrieval for filings is done via parsing of
51 | .txt link. Scraper looks for table of holdings and saves it to a
52 | temporary file. Temporary file is read and parsed again to
53 | split each line by whitespace, allowing for writing tab-delimited text file.
54 |
55 |
--------------------------------------------------------------------------------
/ascii_text.py:
--------------------------------------------------------------------------------
1 | import urllib.request
2 | import xml.etree.ElementTree as ET
3 | import re
4 | import csv
5 | from astropy.io import ascii
6 |
7 | url = 'https://www.sec.gov/Archives/edgar/data/908551/000090855109000003/0000908551-09-000003.txt'
8 | #url = 'https://www.sec.gov/Archives/edgar/data/1166559/000104746911000932/0001047469-11-000932.txt'
9 | #url = 'https://www.sec.gov/Archives/edgar/data/1279708/000127970813000005/0001279708-13-000005.txt'
10 | # data = urllib.request.urlopen(url)
11 | # parse = False
12 | # holdings = []
13 | # with open('holdingsFromXML.txt', 'w', newline='') as f:
14 | # writer = csv.writer(f)
15 | # for line in data:
16 | # line = line.decode('UTF-8').strip()
17 | # if re.search('^
', line) or re.search('^', line):
18 | # parse = True
19 | # if re.search('^
$', line) or re.search('^
$', line):
20 | # parse = False
21 | # if parse:
22 | # writer.writerow([line])
23 |
24 | fh = open('holdingsFromXML.txt', 'r')
25 | #data = fh.read()
26 | #print(data)
27 | holdings = []
28 | # table = ascii.read(data, format='fixed_width', header_start=6, data_start=10)
29 | # print(table)
30 |
31 | # Pure Python
32 | for line in fh:
33 | line = line.strip()
34 | columns = re.split(r'\s{2,}', line)
35 | # print(line)
36 | #print(columns)
37 | holdings.append(columns)
38 | fh.close()
39 | print(holdings)
40 |
41 | with open('parsedHoldings.txt', 'w', newline='') as f:
42 | writer = csv.writer(f, dialect='excel-tab')
43 | # writer.writerow(['Ticker: ' + self.ticker])
44 | # writer.writerow(['Filing Date: ' + str(date)])
45 | # writer.writerow(['Period of Report: ' + str(period)])
46 | for row in holdings:
47 | writer.writerow([row[i] for i in range(len(row))])
48 |
49 |
--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | """main.py allows you to start the scraper."""
4 |
5 | import datetime
6 | from scraper import HoldingsScraper
7 | import sys
8 |
9 | ticker = ''
10 | while len(ticker) < 1:
11 | ticker = input('Please enter a ticker: ')
12 |
13 | sys.stdout.write('Scraping started at %s\n' % str(datetime.datetime.now()))
14 | holdings = HoldingsScraper(ticker)
15 | holdings.scrape()
16 |
17 | try:
18 | holdings.remove_temp_file()
19 | except:
20 | pass
21 |
22 | sys.stdout.write('Scraping completed at %s\n' % str(datetime.datetime.now()))
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | beautifulsoup4==4.5.1
2 | lxml==3.6.4
3 | selenium==3.0.1
4 |
--------------------------------------------------------------------------------
/scraper.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | """
4 | scraper.py contains the main functionality to find and scrape
5 | holdings data from 13F filings on https://www.sec.gov.
6 |
7 | See comments in README.md for instructions on how to run
8 | the scraper.
9 |
10 | Note, the scraper will find all filings for a ticker and
11 | generated a text file for each in the current directory.
12 | To test on the most recent filing only, you can comment out
13 | lines 78-86, and uncomment lines 88-89 in scraper.py
14 | """
15 |
16 | import csv
17 | import datetime
18 | import os
19 | import re
20 | import sys
21 | import time
22 | import urllib.request
23 | import urllib.parse
24 |
25 | from bs4 import BeautifulSoup
26 | from selenium import webdriver
27 | from selenium.webdriver.common.by import By
28 | from selenium.webdriver.common.keys import Keys
29 | from selenium.webdriver.support import expected_conditions as EC
30 | from selenium.webdriver.support.ui import WebDriverWait
31 |
32 | SEC_LINK = "https://www.sec.gov/edgar/searchedgar/companysearch.html"
33 | DOMAIN = "https://www.sec.gov"
34 |
35 |
36 | class HoldingsScraper:
37 | """Find holdings data in funds by scraping data from the SEC."""
38 |
39 | def __init__(self, ticker):
40 | self.browser = webdriver.PhantomJS()
41 | self.browser.set_window_size(1024, 768)
42 | self.ticker = ticker
43 | self.links = []
44 |
45 | def find_filings(self):
46 | """Open SEC page, feed HTML into BeautifulSoup, and find filings."""
47 | self.browser.get(SEC_LINK)
48 | soup = BeautifulSoup(self.browser.page_source, "html.parser")
49 |
50 | # Enter ticker to query filings
51 | search = self.browser.find_element_by_name('CIK')
52 | search.send_keys(self.ticker)
53 | search.send_keys(Keys.RETURN)
54 |
55 | try:
56 | wait = WebDriverWait(self.browser, 20)
57 | wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".tableFile2")))
58 |
59 | # Filter search results by '13F' filings
60 | search = self.browser.find_element_by_name('type')
61 | filing_type = '13F'
62 | search.send_keys(filing_type)
63 | search.send_keys(Keys.RETURN)
64 | time.sleep(5)
65 | self.retrieve_filings()
66 | except:
67 | sys.stdout.write('No results found for ticker: %s\n' % self.ticker)
68 |
69 | def retrieve_filings(self):
70 | """Retrieve links for all 13F filing from search results."""
71 | sys.stdout.write('Retrieving filings for: %s\n' % self.ticker)
72 | soup = BeautifulSoup(self.browser.page_source, "html.parser")
73 | self.links.extend(soup('a', id='documentsbutton'))
74 | sys.stdout.write('13F filings found: %d\n' % len(self.links))
75 |
76 | # Comment out try/except lines below to run most recent filing only
77 | # Checks for more results and click next page
78 | try:
79 | next = self.browser.find_element_by_xpath("//input[@value='Next 40']")
80 | next.click()
81 | self.retrieve_filings()
82 | except:
83 | # Otherwise loop through all filings found to get data
84 | for link in self.links:
85 | url = DOMAIN + link.get('href', None)
86 | self.parse_filing(url)
87 | # Uncomment below to run most recent filing only
88 | # url = DOMAIN + self.links[0].get('href', None)
89 | # self.parse_filing(url)
90 |
91 | def parse_filing(self, url):
92 | """Examines filing to determine how to parse holdings data.
93 |
94 | Opens filing url, get filing and report end dates, and determine
95 | parsing by either XML or ASCII based on 2013 filing format change.
96 | """
97 | self.browser.get(url)
98 | soup = BeautifulSoup(self.browser.page_source, "html.parser")
99 |
100 | # Find report information for text file headers
101 | filing_date_loc = soup.find("div", text="Filing Date")
102 | filing_date = filing_date_loc.findNext('div').text
103 | period_of_report_loc = soup.find("div", text="Period of Report")
104 | period_of_report = period_of_report_loc.findNext('div').text
105 |
106 | # Prepare report header and file_name for each text file
107 | report_detail = self.report_info(filing_date, period_of_report)
108 | file_name = report_detail[0]
109 | report_headers = report_detail[1]
110 |
111 | # Determine if xml file exists, if not look for ASCII text file
112 | try:
113 | xml = soup.find('td', text="2")
114 | xml_link = xml.findNext('a', text=re.compile("\.xml$"))
115 | xml_file = DOMAIN + xml_link.get('href', None)
116 | sys.stdout.write('Getting holdings from: %s\n' % xml_file)
117 | holdings = self.get_holdings_xml(xml_file)
118 | col_headers = holdings[0]
119 | data = holdings[1]
120 | self.save_holdings_xml(file_name, report_headers, col_headers, data)
121 |
122 | except:
123 | ascii = soup.find('td', text="Complete submission text file")
124 | ascii_link = ascii.findNext('a', text=re.compile("\.txt$"))
125 | txt_file = DOMAIN + ascii_link.get('href', None)
126 | sys.stdout.write('Getting holdings from (ascii): %s\n' % txt_file)
127 | holdings = self.get_holdings_ascii(txt_file)
128 | self.save_holdings_ascii(file_name, report_headers, holdings)
129 |
130 | def report_info(self, date, period):
131 | """Prep report headers to be written to text file. """
132 | file_name = self.ticker + '_' + str(date) + '_filing_date.txt'
133 | headers = []
134 | headers.append('Ticker: ' + self.ticker)
135 | headers.append('Filing Date: ' + str(date))
136 | headers.append('Period of Report: ' + str(period))
137 | return(file_name, headers)
138 |
139 | def get_holdings_xml(self, xml_file):
140 | """Get holdings detail from xml file and store data.
141 |
142 | XML format for filings was required by SEC in 2013.
143 | """
144 | self.browser.get(xml_file)
145 | soup = BeautifulSoup(self.browser.page_source, "xml")
146 | holdings = soup.find_all('infoTable')
147 | data = []
148 | # Attempt retrieval of available attributes for 13F filings
149 | for i in range(len(holdings)):
150 | d = {}
151 | try:
152 | d['nameOfIssuer'] = holdings[i].find('nameOfIssuer').text
153 | except:
154 | pass
155 | try:
156 | d['titleOfClass'] = holdings[i].find('titleOfClass').text
157 | except:
158 | pass
159 | try:
160 | d['cusip'] = holdings[i].find('cusip').text
161 | except:
162 | pass
163 | try:
164 | d['value'] = holdings[i].find('value').text
165 | except:
166 | pass
167 | try:
168 | d['sshPrnamt'] = holdings[i].find('shrsOrPrnAmt').find('sshPrnamt').text
169 | except:
170 | pass
171 | try:
172 | d['sshPrnamtType'] = holdings[i].find('shrsOrPrnAmt').find('sshPrnamtType').text
173 | except:
174 | pass
175 | try:
176 | d['putCall'] = holdings[i].find('putCall').text
177 | except:
178 | pass
179 | try:
180 | d['investmentDiscretion'] = holdings[i].find('investmentDiscretion').text
181 | except:
182 | pass
183 | try:
184 | d['otherManager'] = holdings[i].find('otherManager').text
185 | except:
186 | pass
187 | try:
188 | d['votingAuthoritySole'] = holdings[i].find('votingAuthority').find('Sole').text
189 | except:
190 | pass
191 | try:
192 | d['votingAuthorityShared'] = holdings[i].find('votingAuthority').find('Shared').text
193 | except:
194 | pass
195 | try:
196 | d['votingAuthorityNone'] = holdings[i].find('votingAuthority').find('None').text
197 | except:
198 | pass
199 | data.append(d)
200 |
201 | col_headers = list(d.keys())
202 | return(col_headers, data)
203 |
204 | def save_holdings_xml(self, file_name, report_headers, col_headers, data):
205 | """Create and write holdings data from XML to tab-delimited text file."""
206 | with open(file_name, 'w', newline='') as f:
207 | writer = csv.writer(f, dialect='excel-tab')
208 | for i in range(len(report_headers)):
209 | writer.writerow([report_headers[i]])
210 | writer.writerow(col_headers)
211 | for row in data:
212 | writer.writerow([row.get(k, 'n/a') for k in col_headers])
213 |
214 | def get_holdings_ascii(self, txt_file):
215 | """Get holdings detail from ASCII file and store data.
216 |
217 | ASCII format was used pre-2013 decision to use XML. Read and find
218 | holdings details from ASCII text file. Store data in 'temp_holdings.txt'
219 | file for save_holdings_ascii().
220 | """
221 | data = urllib.request.urlopen(txt_file)
222 | parse = False
223 | temp_file = 'temp_holdings.txt'
224 | with open(temp_file, 'w', newline='') as f:
225 | writer = csv.writer(f)
226 | # Look for table storing holdings data before writing to file
227 | for line in data:
228 | line = line.decode('UTF-8').strip()
229 | if re.search('^', line) or re.search('^', line):
230 | parse = True
231 | if re.search('^
$', line) or re.search('^
$', line):
232 | parse = False
233 | if parse:
234 | writer.writerow([line])
235 |
236 | return(temp_file)
237 |
238 | def save_holdings_ascii(self, file_name, report_headers, data):
239 | """Retrieves and reads 'temp_holdings.txt', then writes to tab-delimited file.
240 |
241 | Parse holdings data in ASCII text format, splitting each line
242 | by looking for 2 or more whitespaces, stores each line in 'holdings',
243 | then writes to tab-delimited text file.
244 | """
245 | with open(data, 'r') as f:
246 | holdings = []
247 | for line in f:
248 | line = line.strip()
249 | columns = re.split(r'\s{2,}', line)
250 | holdings.append(columns)
251 |
252 | # Write tab delimited file
253 | file_name = 'ASCII_' + file_name
254 | with open(file_name, 'w', newline='') as f:
255 | writer = csv.writer(f, dialect='excel-tab')
256 | for i in range(len(report_headers)):
257 | writer.writerow([report_headers[i]])
258 | for row in holdings:
259 | writer.writerow([row[i] for i in range(len(row))])
260 |
261 | def remove_temp_file(self):
262 | """Deletes temp file used to parse ASCII data."""
263 | os.remove('temp_holdings.txt')
264 |
265 | def scrape(self):
266 | """Main method to start scraper and find SEC holdings data."""
267 | self.find_filings()
268 | self.browser.quit()
269 |
--------------------------------------------------------------------------------