├── README.md ├── ascii_text.py ├── main.py ├── requirements.txt └── scraper.py /README.md: -------------------------------------------------------------------------------- 1 | # EDGAR 13F Scraper 2 | 3 | ## Installation 4 | This program uses Python 3, Selenium, PhantomJS, 5 | BeautifulSoup, and lmxl. 6 | 7 | Before using the scraper: 8 | 1. Check that you have Python 3: 9 | python3 --version 10 | 2. Navigate to project folder 11 | 3. Create a virtual environment in Python: 12 | python3 -m venv myvenv 13 | 4. Activate virtual environment: 14 | source myvenv/bin/activate 15 | 5. Install requirements.txt: 16 | pip install -r requirements.txt 17 | 6. Install PhantomJS (if not already installed) 18 | 19 | ## Files and Functionality 20 | The EDGAR 13F Scraper parses fund holdings pulled from EDGAR, given a ticker or CIK. 21 | 22 | * main.py allows you to start the scraper. 23 | * scraper.py contains the main functionality to find and scrape holdings data from 13F filings on https://www.sec.gov. 24 | 25 | To start scraper, run this command in an activated virtual environment: 26 | python main.py 27 | 28 | Program will prompt you for a ticker/CIK to find filings for. 29 | Example format: 30 | 0001166559 31 | 32 | Note, the scraper will find all filings for a ticker and generate a text file for each filing in the same directory of the script. To avoid a large amount of generated files, you can get the most recent filing if you: 33 | * comment out lines 78-86 and uncomment lines 88-89 in scraper.py 34 | 35 | Scraper was tested on the following tickers: 36 | * Gates Foundation '0001166559' 37 | * Hershey Trust '0000908551' 38 | * Menlo Advisors '0001279708' 39 | * BlackRock '0001086364' 40 | * Fidelity Advisors 'FELIX' 41 | 42 | ## Implementation for different 13F file formats 43 | In 2013, the SEC discontinued ASCII text-based filing format and required the use of XML format. 44 | 45 | The scraper handles both XML and ASCII text files in the following manner: 46 | * First attempt to find the xml file on the documents page. 47 | If it exists, holdings retrieval for filings is done via parsing of 48 | XML from .xml link. Data is written to tab-delimited file. 49 | 50 | * If xml file is not found, holdings retrieval for filings is done via parsing of 51 | .txt link. Scraper looks for table of holdings and saves it to a 52 | temporary file. Temporary file is read and parsed again to 53 | split each line by whitespace, allowing for writing tab-delimited text file. 54 | 55 | -------------------------------------------------------------------------------- /ascii_text.py: -------------------------------------------------------------------------------- 1 | import urllib.request 2 | import xml.etree.ElementTree as ET 3 | import re 4 | import csv 5 | from astropy.io import ascii 6 | 7 | url = 'https://www.sec.gov/Archives/edgar/data/908551/000090855109000003/0000908551-09-000003.txt' 8 | #url = 'https://www.sec.gov/Archives/edgar/data/1166559/000104746911000932/0001047469-11-000932.txt' 9 | #url = 'https://www.sec.gov/Archives/edgar/data/1279708/000127970813000005/0001279708-13-000005.txt' 10 | # data = urllib.request.urlopen(url) 11 | # parse = False 12 | # holdings = [] 13 | # with open('holdingsFromXML.txt', 'w', newline='') as f: 14 | # writer = csv.writer(f) 15 | # for line in data: 16 | # line = line.decode('UTF-8').strip() 17 | # if re.search('^', line) or re.search('^
', line): 18 | # parse = True 19 | # if re.search('^
$', line) or re.search('^$', line): 20 | # parse = False 21 | # if parse: 22 | # writer.writerow([line]) 23 | 24 | fh = open('holdingsFromXML.txt', 'r') 25 | #data = fh.read() 26 | #print(data) 27 | holdings = [] 28 | # table = ascii.read(data, format='fixed_width', header_start=6, data_start=10) 29 | # print(table) 30 | 31 | # Pure Python 32 | for line in fh: 33 | line = line.strip() 34 | columns = re.split(r'\s{2,}', line) 35 | # print(line) 36 | #print(columns) 37 | holdings.append(columns) 38 | fh.close() 39 | print(holdings) 40 | 41 | with open('parsedHoldings.txt', 'w', newline='') as f: 42 | writer = csv.writer(f, dialect='excel-tab') 43 | # writer.writerow(['Ticker: ' + self.ticker]) 44 | # writer.writerow(['Filing Date: ' + str(date)]) 45 | # writer.writerow(['Period of Report: ' + str(period)]) 46 | for row in holdings: 47 | writer.writerow([row[i] for i in range(len(row))]) 48 | 49 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | """main.py allows you to start the scraper.""" 4 | 5 | import datetime 6 | from scraper import HoldingsScraper 7 | import sys 8 | 9 | ticker = '' 10 | while len(ticker) < 1: 11 | ticker = input('Please enter a ticker: ') 12 | 13 | sys.stdout.write('Scraping started at %s\n' % str(datetime.datetime.now())) 14 | holdings = HoldingsScraper(ticker) 15 | holdings.scrape() 16 | 17 | try: 18 | holdings.remove_temp_file() 19 | except: 20 | pass 21 | 22 | sys.stdout.write('Scraping completed at %s\n' % str(datetime.datetime.now())) -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | beautifulsoup4==4.5.1 2 | lxml==3.6.4 3 | selenium==3.0.1 4 | -------------------------------------------------------------------------------- /scraper.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | """ 4 | scraper.py contains the main functionality to find and scrape 5 | holdings data from 13F filings on https://www.sec.gov. 6 | 7 | See comments in README.md for instructions on how to run 8 | the scraper. 9 | 10 | Note, the scraper will find all filings for a ticker and 11 | generated a text file for each in the current directory. 12 | To test on the most recent filing only, you can comment out 13 | lines 78-86, and uncomment lines 88-89 in scraper.py 14 | """ 15 | 16 | import csv 17 | import datetime 18 | import os 19 | import re 20 | import sys 21 | import time 22 | import urllib.request 23 | import urllib.parse 24 | 25 | from bs4 import BeautifulSoup 26 | from selenium import webdriver 27 | from selenium.webdriver.common.by import By 28 | from selenium.webdriver.common.keys import Keys 29 | from selenium.webdriver.support import expected_conditions as EC 30 | from selenium.webdriver.support.ui import WebDriverWait 31 | 32 | SEC_LINK = "https://www.sec.gov/edgar/searchedgar/companysearch.html" 33 | DOMAIN = "https://www.sec.gov" 34 | 35 | 36 | class HoldingsScraper: 37 | """Find holdings data in funds by scraping data from the SEC.""" 38 | 39 | def __init__(self, ticker): 40 | self.browser = webdriver.PhantomJS() 41 | self.browser.set_window_size(1024, 768) 42 | self.ticker = ticker 43 | self.links = [] 44 | 45 | def find_filings(self): 46 | """Open SEC page, feed HTML into BeautifulSoup, and find filings.""" 47 | self.browser.get(SEC_LINK) 48 | soup = BeautifulSoup(self.browser.page_source, "html.parser") 49 | 50 | # Enter ticker to query filings 51 | search = self.browser.find_element_by_name('CIK') 52 | search.send_keys(self.ticker) 53 | search.send_keys(Keys.RETURN) 54 | 55 | try: 56 | wait = WebDriverWait(self.browser, 20) 57 | wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".tableFile2"))) 58 | 59 | # Filter search results by '13F' filings 60 | search = self.browser.find_element_by_name('type') 61 | filing_type = '13F' 62 | search.send_keys(filing_type) 63 | search.send_keys(Keys.RETURN) 64 | time.sleep(5) 65 | self.retrieve_filings() 66 | except: 67 | sys.stdout.write('No results found for ticker: %s\n' % self.ticker) 68 | 69 | def retrieve_filings(self): 70 | """Retrieve links for all 13F filing from search results.""" 71 | sys.stdout.write('Retrieving filings for: %s\n' % self.ticker) 72 | soup = BeautifulSoup(self.browser.page_source, "html.parser") 73 | self.links.extend(soup('a', id='documentsbutton')) 74 | sys.stdout.write('13F filings found: %d\n' % len(self.links)) 75 | 76 | # Comment out try/except lines below to run most recent filing only 77 | # Checks for more results and click next page 78 | try: 79 | next = self.browser.find_element_by_xpath("//input[@value='Next 40']") 80 | next.click() 81 | self.retrieve_filings() 82 | except: 83 | # Otherwise loop through all filings found to get data 84 | for link in self.links: 85 | url = DOMAIN + link.get('href', None) 86 | self.parse_filing(url) 87 | # Uncomment below to run most recent filing only 88 | # url = DOMAIN + self.links[0].get('href', None) 89 | # self.parse_filing(url) 90 | 91 | def parse_filing(self, url): 92 | """Examines filing to determine how to parse holdings data. 93 | 94 | Opens filing url, get filing and report end dates, and determine 95 | parsing by either XML or ASCII based on 2013 filing format change. 96 | """ 97 | self.browser.get(url) 98 | soup = BeautifulSoup(self.browser.page_source, "html.parser") 99 | 100 | # Find report information for text file headers 101 | filing_date_loc = soup.find("div", text="Filing Date") 102 | filing_date = filing_date_loc.findNext('div').text 103 | period_of_report_loc = soup.find("div", text="Period of Report") 104 | period_of_report = period_of_report_loc.findNext('div').text 105 | 106 | # Prepare report header and file_name for each text file 107 | report_detail = self.report_info(filing_date, period_of_report) 108 | file_name = report_detail[0] 109 | report_headers = report_detail[1] 110 | 111 | # Determine if xml file exists, if not look for ASCII text file 112 | try: 113 | xml = soup.find('td', text="2") 114 | xml_link = xml.findNext('a', text=re.compile("\.xml$")) 115 | xml_file = DOMAIN + xml_link.get('href', None) 116 | sys.stdout.write('Getting holdings from: %s\n' % xml_file) 117 | holdings = self.get_holdings_xml(xml_file) 118 | col_headers = holdings[0] 119 | data = holdings[1] 120 | self.save_holdings_xml(file_name, report_headers, col_headers, data) 121 | 122 | except: 123 | ascii = soup.find('td', text="Complete submission text file") 124 | ascii_link = ascii.findNext('a', text=re.compile("\.txt$")) 125 | txt_file = DOMAIN + ascii_link.get('href', None) 126 | sys.stdout.write('Getting holdings from (ascii): %s\n' % txt_file) 127 | holdings = self.get_holdings_ascii(txt_file) 128 | self.save_holdings_ascii(file_name, report_headers, holdings) 129 | 130 | def report_info(self, date, period): 131 | """Prep report headers to be written to text file. """ 132 | file_name = self.ticker + '_' + str(date) + '_filing_date.txt' 133 | headers = [] 134 | headers.append('Ticker: ' + self.ticker) 135 | headers.append('Filing Date: ' + str(date)) 136 | headers.append('Period of Report: ' + str(period)) 137 | return(file_name, headers) 138 | 139 | def get_holdings_xml(self, xml_file): 140 | """Get holdings detail from xml file and store data. 141 | 142 | XML format for filings was required by SEC in 2013. 143 | """ 144 | self.browser.get(xml_file) 145 | soup = BeautifulSoup(self.browser.page_source, "xml") 146 | holdings = soup.find_all('infoTable') 147 | data = [] 148 | # Attempt retrieval of available attributes for 13F filings 149 | for i in range(len(holdings)): 150 | d = {} 151 | try: 152 | d['nameOfIssuer'] = holdings[i].find('nameOfIssuer').text 153 | except: 154 | pass 155 | try: 156 | d['titleOfClass'] = holdings[i].find('titleOfClass').text 157 | except: 158 | pass 159 | try: 160 | d['cusip'] = holdings[i].find('cusip').text 161 | except: 162 | pass 163 | try: 164 | d['value'] = holdings[i].find('value').text 165 | except: 166 | pass 167 | try: 168 | d['sshPrnamt'] = holdings[i].find('shrsOrPrnAmt').find('sshPrnamt').text 169 | except: 170 | pass 171 | try: 172 | d['sshPrnamtType'] = holdings[i].find('shrsOrPrnAmt').find('sshPrnamtType').text 173 | except: 174 | pass 175 | try: 176 | d['putCall'] = holdings[i].find('putCall').text 177 | except: 178 | pass 179 | try: 180 | d['investmentDiscretion'] = holdings[i].find('investmentDiscretion').text 181 | except: 182 | pass 183 | try: 184 | d['otherManager'] = holdings[i].find('otherManager').text 185 | except: 186 | pass 187 | try: 188 | d['votingAuthoritySole'] = holdings[i].find('votingAuthority').find('Sole').text 189 | except: 190 | pass 191 | try: 192 | d['votingAuthorityShared'] = holdings[i].find('votingAuthority').find('Shared').text 193 | except: 194 | pass 195 | try: 196 | d['votingAuthorityNone'] = holdings[i].find('votingAuthority').find('None').text 197 | except: 198 | pass 199 | data.append(d) 200 | 201 | col_headers = list(d.keys()) 202 | return(col_headers, data) 203 | 204 | def save_holdings_xml(self, file_name, report_headers, col_headers, data): 205 | """Create and write holdings data from XML to tab-delimited text file.""" 206 | with open(file_name, 'w', newline='') as f: 207 | writer = csv.writer(f, dialect='excel-tab') 208 | for i in range(len(report_headers)): 209 | writer.writerow([report_headers[i]]) 210 | writer.writerow(col_headers) 211 | for row in data: 212 | writer.writerow([row.get(k, 'n/a') for k in col_headers]) 213 | 214 | def get_holdings_ascii(self, txt_file): 215 | """Get holdings detail from ASCII file and store data. 216 | 217 | ASCII format was used pre-2013 decision to use XML. Read and find 218 | holdings details from ASCII text file. Store data in 'temp_holdings.txt' 219 | file for save_holdings_ascii(). 220 | """ 221 | data = urllib.request.urlopen(txt_file) 222 | parse = False 223 | temp_file = 'temp_holdings.txt' 224 | with open(temp_file, 'w', newline='') as f: 225 | writer = csv.writer(f) 226 | # Look for table storing holdings data before writing to file 227 | for line in data: 228 | line = line.decode('UTF-8').strip() 229 | if re.search('^', line) or re.search('^
', line): 230 | parse = True 231 | if re.search('^
$', line) or re.search('^$', line): 232 | parse = False 233 | if parse: 234 | writer.writerow([line]) 235 | 236 | return(temp_file) 237 | 238 | def save_holdings_ascii(self, file_name, report_headers, data): 239 | """Retrieves and reads 'temp_holdings.txt', then writes to tab-delimited file. 240 | 241 | Parse holdings data in ASCII text format, splitting each line 242 | by looking for 2 or more whitespaces, stores each line in 'holdings', 243 | then writes to tab-delimited text file. 244 | """ 245 | with open(data, 'r') as f: 246 | holdings = [] 247 | for line in f: 248 | line = line.strip() 249 | columns = re.split(r'\s{2,}', line) 250 | holdings.append(columns) 251 | 252 | # Write tab delimited file 253 | file_name = 'ASCII_' + file_name 254 | with open(file_name, 'w', newline='') as f: 255 | writer = csv.writer(f, dialect='excel-tab') 256 | for i in range(len(report_headers)): 257 | writer.writerow([report_headers[i]]) 258 | for row in holdings: 259 | writer.writerow([row[i] for i in range(len(row))]) 260 | 261 | def remove_temp_file(self): 262 | """Deletes temp file used to parse ASCII data.""" 263 | os.remove('temp_holdings.txt') 264 | 265 | def scrape(self): 266 | """Main method to start scraper and find SEC holdings data.""" 267 | self.find_filings() 268 | self.browser.quit() 269 | --------------------------------------------------------------------------------