├── .gitignore
├── 10-K Text Analysis.pdf
├── 10-K _ 10-Q Text Analysis Eric He - eriqqc.pdf
├── Documentation
    ├── .Rhistory
    ├── Building-Document-Frequency-Matrices.Rmd
    ├── Building-Sentiment-Dictionary.Rmd
    ├── Calculating-Distance-Returns.Rmd
    ├── Calculating-Financial-Returns.Rmd
    ├── Calculating-NumProp-Returns.Rmd
    ├── Calculating-Sentiment-Returns.Rmd
    ├── Cleaning-Raw-Filings.md
    ├── Creating-Master-Index.Rmd
    ├── Script_10Q.R
    ├── Sentiment-Scores-Algo.Rmd
    ├── Text-Distance-Algo-10Q.R
    ├── Text-Distance-Algo.Rmd
    ├── documentation.html
    └── parsing-script.R
├── Graphs
    ├── cosine_distance_returns.png
    ├── jaccard_distance_returns.png
    ├── negative_sentiment_quantile_returns.png
    └── positive_sentiment_quantile_returns.png
├── README.md
├── Sample-Data
    └── Apple-2016-Cleaned.txt
├── analyses
    ├── optimize_10k_cleaning.ipynb
    ├── optimize_cosine_distance.ipynb
    └── optimize_doc_fetch.ipynb
└── scripts
    ├── fetch_10k_docs.py
    ├── fetch_10k_urls.py
    └── fetch_russell_3000.py


/.gitignore:
--------------------------------------------------------------------------------
1 | config.json
2 | data/
3 | __pycache__/
4 | */.ipynb_checkpoints


--------------------------------------------------------------------------------
/10-K Text Analysis.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/10-K Text Analysis.pdf


--------------------------------------------------------------------------------
/10-K _ 10-Q Text Analysis Eric He - eriqqc.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/10-K _ 10-Q Text Analysis Eric He - eriqqc.pdf


--------------------------------------------------------------------------------
/Documentation/.Rhistory:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/Documentation/.Rhistory


--------------------------------------------------------------------------------
/Documentation/Building-Document-Frequency-Matrices.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "CorpusCreation"
  3 | author: "Eric He"
  4 | date: "July 17, 2017"
  5 | output: html_document
  6 | ---
  7 | 
  8 | ```{r setup, include=FALSE}
  9 | knitr::opts_chunk$set(echo = TRUE)
 10 | ```
 11 | 
 12 | This document details the process for taking the cleaned financial statements and parsing them into quanteda corpuses. The code is done for all the financial statements for a given year, and is repeated for each of the four years.
 13 | 
 14 | We begin by loading in the required libraries.
 15 | 
 16 | ```{r}
 17 | library("quanteda")
 18 | library("stringr")
 19 | library("dplyr")
 20 | library("purrr")
 21 | ```
 22 | 
 23 | The first 6033 filings are filed in the year 2013, while filings in year 2014 range from 6034 to 11882. This can be recomputed by looking at the masterIndex in the folder.
 24 | 
 25 | ```{r}
 26 | year2013 <- c(1:6033)
 27 | year2014 <- c(6034:11882)
 28 | year2015 <- c(11883:17467)
 29 | year2016 <- c(17468:22631)
 30 | StopWordsList <- "StopWordsList.txt" %>%
 31 |   readLines() %>%
 32 |   str_split(pattern = " ")
 33 | ```
 34 | 
 35 | In this code, each financial statement for a given year is loaded in one by one. The filing is split into its component sections, and each section becomes a document that is added to the corpus. That is to say, each row of the corpus corresponds to one of the 20 sections of a financial statement. Each financial statement should add 20 sections to the corpus. Unfortunately, the cleaning algorithm which tagged these sections within the financial statement is not perfect, and many false positives must be dealt with. False negatives are at this point impossible to catch, which is why the tagging procedure of the cleaning algorithm has been designed to be more liberal with its tagging.
 36 | 
 37 | ```{r}
 38 | years <- year2013 # replace year here
 39 | bigcorpus <- corpus("")
 40 | for (i in years){
 41 |   text <- paste("parsed/", i, ".txt", sep = "") %>%
 42 |     readLines() %>%
 43 |     str_split(pattern = "(?s)(?i)°Item", simplify = TRUE) %>%
 44 |     str_replace_all(pattern = "(?s)<.*?>", replacement = "") %>%
 45 |     str_replace_all(pattern = "(?s) +", replacement = " ")
 46 |   text <- text[text != ""]
 47 |   names(text) <- paste(i, str_extract(text, pattern = "[1234567890ABC]+"))
 48 |   text <- corpus(text)
 49 |   bigcorpus <- bigcorpus + text
 50 |   rm(text)
 51 |   print(i)
 52 | }
 53 | save(bigcorpus, file = "RawCorpus2016.RData")
 54 | ```
 55 | 
 56 | The bigcorpus object holds every identified section of every financial statement in the given year as a document. However, most of these documents are garbage! Many documents are actually snippets of the Table of Contents of various filings, which were mistakenly tagged by the cleaning algorithm as section texts. Unfortunately, the heterogeneity of filings makes this incorrect tagging difficult to avoid. Additionally, some documents are simply excerpts of other sections that are mistagged.
 57 | 
 58 | Thus, the goal of the code below is to weed out these documents which are not desired. A good way to delete the documents which are actually part of the Table of Contents is to remove any document with less than 100 words. These documents are both likely to be not real sections, or contain only boiler plate language when the section is not relevant to the company (e.g. Mine Disclosures for a fast food company; the fast food company owns no mines!). At any rate, having less than 100 words is not very useful for our text analysis.
 59 | 
 60 | ```{r}
 61 | # sufficient word count
 62 | #wordcount <- ntoken(bigcorpus)
 63 | #load("wordcount2014.RData")
 64 | enoughWords <- wordcount < 100
 65 | 
 66 | # is not section
 67 | names <- docnames(bigcorpus)
 68 | real.section.letter <- !is.na(str_extract(names, pattern = "[1234567890]+ [ABCDEFGHIJKLMNOPQRSTUVWXYZ]"))
 69 | 
 70 | # is duplicate
 71 | real.section.toc <- !is.na((str_extract(names, pattern = "\\.")))
 72 | index.numbers <- unique(str_extract(names[real.section.toc], pattern = "[1234567890]+ "))
 73 | all.names <- str_extract(names, pattern = "[0-9]+ ")
 74 | has.duplicates <- is.element(all.names, index.numbers)
 75 | real.section.duplicate <- !real.section.toc + has.duplicates > 1
 76 | 
 77 | #filter out all trash
 78 | the.trash <- real.section.duplicate + real.section.letter + enoughWords > 0
 79 | 
 80 | docvars(bigcorpus, "Subset") <- the.trash
 81 | bigcorpus <- corpus_subset(bigcorpus, subset = Subset == FALSE, select = FALSE)
 82 | ```
 83 | 
 84 | Here, we extract relevant information from the remaining documents. The data of interest is as follows:
 85 | 
 86 | 1) The section the document is of.
 87 | 2) The filing the document belongs to.
 88 | 3) The date during which the financial statement was filed with the SEC.
 89 | 4) The index number of the section within its new, subsetted corpus.
 90 | 5) The number of words in the filing.
 91 | 
 92 | ```{r}
 93 | #remove the duplicate names
 94 | names[real.section.toc] <- names[real.section.toc] %>%
 95 |   str_extract(pattern = ".*(?=\\.)")
 96 | subsetted.names <- names[the.trash == FALSE]
 97 | #extract section of filing
 98 | section <- subsetted.names %>%
 99 |   str_extract(pattern = "(?<= ).*") %>%
100 |   as.factor()
101 | #extract filing of section
102 | filing <- subsetted.names %>%
103 |   str_extract(pattern = ".*(?= )")
104 | #extract date during which 10-k was filed
105 | date.filed <- masterIndex$DATE_FILED[as.numeric(filing)]
106 | #extract word count
107 | word.count <- wordcount[the.trash == FALSE]
108 | #combine into meta dataframe
109 | metadata <- data_frame(index = 1:ndoc(bigcorpus), subsetted.names, section, filing, date.filed, word.count)
110 | 
111 | #remove the clutter
112 | rm(real.section.duplicate, has.duplicates, all.names, index.numbers, real.section.toc, real.section.letter, wordcount, names, date.filed, word.count, subsetted.names, section, filing, the.trash, enoughWords)
113 | ```
114 | 
115 | The meta dataframe is saved for future analysis.
116 | 
117 | ```{r}
118 | save(metadata, file = "metadata2015.RData")
119 | write.csv(metadata, file = "metadata2015.csv")
120 | ```
121 | 
122 | It is at this stage that we create a document-frequency matrix (DFM) of the various documents. Each row of the DFM corresponds to a different section, while each column corresponds to a different word which appeared in any of the various documents. Cell i,j corresponds to the count of word j in document i. Punctuation and numbers are removed and do not appear in the DFM, since they are not actually words.
123 | 
124 | ```{r}
125 | bigdfm <- dfm(bigcorpus, remove = StopWordsList, remove_punct = TRUE, remove_numbers = TRUE) %>%
126 |   tfidf(scheme_tf="logave")
127 | ```
128 | 
129 | ```{r}
130 | save(bigcorpus, file = "parsedCorpus2016.RData")
131 | ```


--------------------------------------------------------------------------------
/Documentation/Building-Sentiment-Dictionary.Rmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Making the Dictionary"
 3 | author: "Eric He"
 4 | date: "June 16, 2017"
 5 | output: html_document
 6 | ---
 7 | 
 8 | ```{r setup, include=FALSE}
 9 | knitr::opts_chunk$set(echo = TRUE)
10 | ```
11 | 
12 | ```{r}
13 | library("dplyr")
14 | dictionary <- read.csv("masterDictionary.csv")
15 | ```
16 | 
17 | We follow the exact specifications laid out by Luo. Positive and Interesting words from the dictionary are classified as Positive, while Negative, Uncertain, Litigious, Constraining, and Superfluous all are collapsed under the umbrella classification "Negative".
18 | 
19 | ```{r}
20 | positive <- dictionary %>%
21 |   filter(Positive > 0 | Interesting > 0) %>%
22 |   select(ï..Word)
23 | negative <- dictionary %>%
24 |   filter(Negative > 0 | Uncertainty > 0 | Litigious > 0 | Constraining > 0 | Superfluous > 0) %>%
25 |   select(ï..Word)
26 | write.csv(negative, row.names= FALSE, file = "negative.csv")
27 | write.csv(positive, row.names = FALSE, file = "positive.csv")
28 | ```


--------------------------------------------------------------------------------
/Documentation/Calculating-Distance-Returns.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "TextDistanceAlgo"
  3 | author: "Eric He"
  4 | date: "August 19, 2017"
  5 | output: html_document
  6 | ---
  7 | 
  8 | ```{r setup, include=FALSE}
  9 | knitr::opts_chunk$set(echo = TRUE)
 10 | ```
 11 | 
 12 | ```{r}
 13 | library("quanteda")
 14 | library("dplyr")
 15 | library("purrr")
 16 | library("stringr")
 17 | library("readtext")
 18 | library("reshape2")
 19 | library("magrittr")
 20 | library("ggplot2")
 21 | library("gridExtra")
 22 | ```
 23 | 
 24 | ```{r}
 25 | masterIndex <- read.csv("masterIndex.csv")
 26 | masterIndex$filing %<>% as.character
 27 | tickers <- readLines("tickers.txt") # use unique(masterIndex) if we wish to scale this across multiple years, or keep tickers.txt updated
 28 | StopWordsList <- readLines("StopWordsList.txt")
 29 | hpr <- read.csv("annualReturns.csv", na.strings = "NA")
 30 | sections <- c("1", "1A", "3", "4", "7", "8", "9", "9A")
 31 | section_names <- c("Business", "Risk Factors", "Legal Proceedings", "Mine Safety Disclosures", "MDA of Financial Conditions", "Financial Statements and Supplementary Data", "Changes on Accounting and Financial Disclosure", "Controls and Procedures")
 32 | ```
 33 | 
 34 | The new section extractor algorithm is much more flexible and robust than the old algorithm. Before, we had an issue where we frequently got the table of contents of a financial statement masquerading as an extra 20 sections. Other, much more rare situations include problematic formatting by the financial statements which would tag normal words as section headings. This problem is solved by choosing the hit with the maximum word count and making it the target section. There are only a few niche cases were a mistagged string is longer than the actual section, and almost no situations where the table of contents is longer than the section itself.
 35 | 
 36 | ```{r}
 37 | # 1 statement, 1 section
 38 | section_extractor <- function(statement, section){
 39 |   name <- statement$doc_id # needs to be atomic vector
 40 |   pattern <- paste0("°Item ", section, "[^\\w|\\d]", ".*?°") # exclude any Item X where X is followed by any unexpected alphanumeric character
 41 |   # needs simplify=TRUE because FALSE returns 1-element list of multiple vectors which map() cannot handle. May file issue with stringr
 42 |   section_hits <- str_extract_all(statement, pattern, simplify=TRUE) 
 43 |   #if section_hits is empty then we need function to skip this one
 44 |   if (is_empty(section_hits) == TRUE){
 45 |     return("empty")
 46 |   }
 47 |   word_counts <- map_int(section_hits, ntoken)
 48 |   max_hit <- which(word_counts == max(word_counts))
 49 |   max_filing <- section_hits[[max_hit[length(max_hit)]]] # select the "filing" with the largest word count. If two hits have the same word count, choose the last one. (Following the idea that the first one is probably ToC; it doesn't really matter which one we pick because we're tossing it out anyway because it definitely doesnt make word count.)
 50 |  names(max_filing) <- paste(name, section, sep = "_") 
 51 |   return(max_filing)
 52 | }
 53 | 
 54 | # multiple statements, 1 section. We use logs to discount frequently occurring words. No inverse document frequency is used because it is not useful for comparisons of documents which should be functionally equivalent (idf is used to differentiate documents with entirely different subject matters, since it highlights differences in word choice. In two risk factor sections, it is easy to discount words like "risk" or "dispute", which are still important words to us).
 55 | 
 56 | section_dfm <- function(statements_list, section, min_words, tf){
 57 |     map(statements_list, section_extractor, section=section) %>%
 58 |     map(corpus) %>%
 59 |     reduce(`+`) %>%
 60 |     dfm(tolower=TRUE, remove=StopWordsList, remove_punct=TRUE) %>% 
 61 |     dfm_subset(., rowSums(.) > min_words) %>%
 62 |     when(tf==TRUE ~ tf(., scheme="log"), 
 63 |          ~ .)
 64 | }
 65 | # the when statement looks like black magic but it is the functional version of an if-else statement.
 66 | # syntax denoted by formula (~) object, LHS of ~ is the condition, RHS is the return.
 67 | # If tfidf (the tfidf parameter) == TRUE then return tfidf(., scheme_tf="logave") (the tfidf function)
 68 | # Else (no condition) then return . (return the input as the output (do nothing))
 69 | 
 70 | # multiple statements, multiple sections, 1 ticker. No reduce() since each filing section needs its own corpus
 71 | filing_dfm <- function(sections, filings_list, min_words, tf){
 72 |   map(sections, section_dfm, statements_list=filings_list, min_words=min_words, tf=tf)
 73 | }
 74 | 
 75 | # perform distance analysis on the processed dfm_list. The dist_parser function tries to wrangle with the distObj which textstat_simil returns. It returns a dataframe showing the cosine distance between each pair of filings.
 76 | dist_parser <- function(distObj){
 77 |   melted_frame <- as.matrix(distObj) %>%
 78 |   {. * upper.tri(.)} %>% # lambda function to extract the upper triangular part of b, since the diagonal is the identity distance and dist object is symmetric
 79 |   melt(varnames = c("previous_filing", "current_filing"), value.name = "distance") %>% # comparison filing is always filed before current_filing when using upper triangular
 80 |   filter(distance != 0) # cut out identity and duplicates. This assumes that no two legitimate documents are completely orthogonal, which I think is reasonable
 81 |   melted_frame$previous_filing %<>% str_extract(pattern = ".*?(?=\\.)") # cut out the text name/section
 82 |   melted_frame$current_filing %<>% str_extract(pattern = ".*?(?=\\.)") # to allow for easy joining with financial returns
 83 |   return(melted_frame)
 84 | }
 85 | 
 86 | filing_similarity <- function(dfm_list, method){
 87 |   map(dfm_list, textstat_simil, method=method) %>%
 88 |   map(dist_parser)}
 89 | 
 90 | index_filing_filterer <- function(ticker, index){
 91 |   filter(index, TICKER == ticker) %>%
 92 |     pull(filing) # pull the file name, which in this case is just the filing number
 93 | }
 94 | index_year_filterer <- function(ticker, index){
 95 |   filter(index, TICKER == ticker) %>%
 96 |     pull(YEAR)
 97 | }
 98 | 
 99 | plotter <- function(dfObj, section, nquantiles = 5){
100 |   dfObj %>%
101 |     na.omit %>%
102 |     mutate(quantile = ntile(distance, n = nquantiles)) %>%
103 |     group_by(quantile) %>%
104 |     summarise(average_return = mean(as.numeric(returns)) - 1) %>%
105 |     ggplot(aes(x = quantile, y = average_return)) +
106 |     geom_bar(stat = "identity") +
107 |     theme(axis.title.y=element_blank()) +
108 |     #coord_cartesian(ylim = c(-.2, .3)) +
109 |     xlab(section)
110 | }
111 | ```
112 | 
113 | Do 1 ticker end to end.
114 | 
115 | ```{r}
116 | index_filing_filterer <- function(ticker, index){
117 |   filter(index, TICKER == ticker) %>%
118 |     pull(filing) # pull the file name, which in this case is just the filing number
119 | }
120 | index_year_filterer <- function(ticker, index){
121 |   filter(index, TICKER == ticker) %>%
122 |     pull(YEAR)
123 | }
124 | 
125 | file_path <- "parsed/"
126 | file_type <- ".txt"
127 | 
128 | the_ticker <- "AAPL"
129 | 
130 | file_names <- index_filing_filterer(the_ticker, masterIndex)
131 | file_years <- index_year_filterer(the_ticker, masterIndex)
132 | 
133 | file_locations <- paste0(file_path, file_names, file_type)
134 | 
135 | filings_list <- map(file_locations, readtext)
136 | 
137 | years <- paste0("X", file_years[-1]) # financial return columns start with X bc colnames cannot only be numbers; chop out the first year
138 | returns_df <- filter(hpr, ticker == the_ticker) %>% # calling the_ticker ticker gives rise to namespace issues
139 |   select(years) %>% 
140 |   t() 
141 | colnames(returns_df) <- "returns"
142 | returns_df %<>% cbind(previous_filing = file_names[-length(file_names)], current_filing = file_names[-1], .) %>%
143 |   as.data.frame(stringsAsFactors=FALSE)
144 | 
145 | similarity_list <- filing_dfm(sections=sections, filings_list=filings_list, min_words=100, tf=TRUE) %>%
146 |   filing_similarity("cosine") # jaccard distance doesnt need any term weightings, although tf=TRUE doesnt make any difference
147 | # similarity_list[map_dbl(similarity_list, nrow) == 0] <- NULL # not needed as we rbind
148 | 
149 | distance_returns_df2 <- similarity_list %>% # MOVE THE FILTERING HERE, MORE FLEXIBLE THIS WAY
150 |   map(right_join, returns_df)
151 |   #map(~ data_frame(distance=., returns=returns_vector))
152 | ```
153 | Do all tickers.
154 | 
155 | No financial returns data for 2033 of the tickers.
156 | 
157 | ```{r}
158 | tickers %in% hpr$ticker %>% table
159 | ```
160 | 
161 | ```{r}
162 | file_path <- "parsed/"
163 | file_type <- ".txt"
164 | 
165 | distance_returns_calculator <- function(the_ticker){
166 | file_names <- index_filing_filterer(the_ticker, masterIndex)
167 | file_years <- index_year_filterer(the_ticker, masterIndex)
168 | 
169 | if (length(file_names) <= 1 | length(file_years) <= 1){
170 |   empty_list <- map(rep(NA, times = length(sections)), ~data_frame(previous_filing = ., current_filing = ., distance = ., returns = .))
171 |   print(paste("Only one filing available for ticker", the_ticker))
172 |   return(empty_list)
173 | } # companies with only one year of data cannot be used for a distance analysis. We return a data frame of NAs so that the rbind() can be smooth
174 | 
175 | years <- paste0("X", file_years[-1]) # chop out the first year
176 | returns_df <- filter(hpr, ticker == the_ticker) %>% # calling the_ticker ticker gives rise to namespace issues
177 |   select(years) %>% 
178 |   t() 
179 | if (is_empty(returns_df) == TRUE){
180 |   empty_list <- map(rep(NA, times = length(sections)), ~data_frame(previous_filing = ., current_filing = ., distance = ., returns = .))
181 |   print(paste("No financial data for ticker", the_ticker))
182 |   return(empty_list)
183 | }
184 | 
185 | file_locations <- paste0(file_path, file_names, file_type)
186 | 
187 | filings_list <- map(file_locations, readtext)
188 | 
189 | colnames(returns_df) <- "returns"
190 | returns_df %<>% cbind(previous_filing = file_names[-length(file_names)], current_filing = file_names[-1], .) %>% # assumes no broken years in the data; broken years can occur if there is no financial filing located one year, or financial data no existerino for a year. Obviously this occurring would be very nonstandard
191 |   as.data.frame(stringsAsFactors = FALSE)
192 | 
193 | similarity_list <- filing_dfm(sections=sections, filings_list=filings_list, min_words=100, tf=FALSE) %>%
194 |   filing_similarity("jaccard")
195 | 
196 | distance_returns_df <- similarity_list %>%
197 |   map(right_join, returns_df, by = c("previous_filing", "current_filing"))
198 | print(paste("Successfully mapped distance scores to financial returns for ticker", the_ticker))
199 | return(distance_returns_df)
200 | }
201 | 
202 | distance_returns_df <- map(tickers, distance_returns_calculator) %>%
203 |   pmap(rbind) # pmap takes the list of lists and rbinds each of the elements within the nested list together. its black magic
204 | 
205 | save(distance_returns_df, file = "jaccard_distance_returns_df.RData")
206 |   
207 | distance_returns_plot <- distance_returns_df %>%
208 |   map2(sections, ~plotter(dfObj = .x, section = .y)) %>%
209 |   arrangeGrob(grobs = ., ncol = 4, top = "Average Yearly Financial Returns By Jaccard Distance Quantile", left = "Average Yearly Return", bottom = "Filing Section")
210 |   
211 | ggsave(distance_returns_plot, file = "jaccard_distance_returns.png", width = 7, height = 5)
212 | plot(distance_returns_plot)
213 | ```
214 | 


--------------------------------------------------------------------------------
/Documentation/Calculating-Financial-Returns.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "FinancialReturns"
  3 | author: "Eric He"
  4 | date: "July 17, 2017"
  5 | output: html_document
  6 | ---
  7 | 
  8 | ```{r setup, include=FALSE}
  9 | knitr::opts_chunk$set(echo = TRUE)
 10 | ```
 11 | 
 12 | ```{r}
 13 | library("dplyr")
 14 | library("lubridate")
 15 | library("tidyr")
 16 | library("purrr")
 17 | ```
 18 | 
 19 | Financial returns data was downloaded from the Center for Research in Security Prices (CRSP) daily stock returns database from the Wharton Research Data Services (WRDS) account. Data from 01/01/2000 to 12/31/2016 was downloaded; every single ticker and data category was downloaded, just in case. Date was selected to be in MM/DD/YYYY form.
 20 | 
 21 | Market cap data was also downloaded from the CRSP database provided by WRDS, same beginning and end date, every single ticker and data category again. Date is in YYYYMMDD form.
 22 | 
 23 | Load in the data.
 24 | 
 25 | ```{r}
 26 | master_index <- read.csv("masterIndex.csv")
 27 | full_returns <- read.csv("../Data/Financial Data/Trimmed_Returns_Raw.csv")
 28 | full_cap <- read.csv("market_cap.csv") %>%
 29 |   select(permno = PERMNO, num_shares = SHROUT, date = SHRSDT)
 30 | ```
 31 | 
 32 | ```{r}
 33 | head(masterIndex)
 34 | ```
 35 | 
 36 | ```{r}
 37 | head(full_returns)
 38 | ```
 39 | 
 40 | ```{r}
 41 | head(full_cap)
 42 | ```
 43 | 
 44 | Select the relevant columns: holding period returns (RET), tickers (TICKER), company name (COMNAM), delisting return (DLRET), Shares Observation End Date (shrenddt)
 45 | 
 46 | End product: correct monthly returns
 47 | 
 48 | ```{r}
 49 | full_returns <- select(full_returns, permco = PERMCO, date = date, return = RET, ticker = TICKER, delisting_return = DLRET) %>%
 50 |   filter(ticker %in% master_index$TICKER) %>% # only need tickers for which we have filings to compare with
 51 |   mutate(delisting_return = as.numeric(as.character(delisting_return))) %>% # 
 52 |   mutate(delisting_return = replace(delisting_return, is.na(delisting_return) == TRUE, 0)) %>% # replace NA with 0
 53 |   mutate(delisting_return = delisting_return + 1) %>% # so we can add 1 so we can multiply
 54 |   mutate(return = as.numeric(as.character(return))) %>% # change return from factor to numeric, characters which CRSP uses to represent missing data or point towards delisting return gets changed to NA
 55 |   mutate(return = replace(return, is.na(return) == TRUE, 0)) %>%
 56 |   mutate(return = return + 1) %>%
 57 |   filter(is.na(return) == FALSE) %>%
 58 |   mutate(return = return * delisting_return) %>%
 59 |   mutate(date = mdy(date)) %>%
 60 |   select(-delisting_return)
 61 | ```
 62 | 
 63 | Given a ticker and a date from the master Index, devise a formula to calculate holding period returns for variable time intervals.
 64 | 
 65 | ```{r}
 66 | the_ticker <- "AAPL"
 67 | the_year <- 2013
 68 | the_month <- 10
 69 | the_day <- 30
 70 | 
 71 | interval_length <- 3
 72 | interval_type <- "month"
 73 | 
 74 | start_date <- ymd("20131030")
 75 | 
 76 | end_date <- the_date %m+% months(3)
 77 | 
 78 | returns_calculator <- function(the_ticker, filing_date, interval_length = 5, interval_type = "day"){
 79 |   start_date_beginning <- filing_date %>% # start date is first non-weekend day before the filing date.
 80 |     ymd(.) %m-% days(3)
 81 |   start_date_candidates <- seq(start_date_beginning, ymd(filing_date) %m-% days(1), by = "days")
 82 |   start_date <- start_date_candidates[which(format(start_date_candidates, "%u") %in% c(1:5))[length(which(format(start_date_candidates, "%u") %in% c(1:5)))]] # pick out first weekday in the sequence, %u formats date object into numeric weekday
 83 |   end_date <- filing_date %>%
 84 |     ymd %>% # have to ram through ymd again because map() uses [[]] which messes with lubridate object type
 85 |     when(interval_type == "day" ~ . %m+% days(interval_length),
 86 |          interval_type == "month" ~ . %m+% months(interval_length), 
 87 |          interval_type == "year" ~ . %m+% years(interval_length),
 88 |          ~ stop(print(interval_type)))
 89 |   date_sequence <- seq(start_date, end_date, by = "days") # see above comment
 90 |   date_returns <- filter(full_returns, as.character(ticker) == as.character(the_ticker), date %in% date_sequence) # tickers must be converted to character or else throws a factor level error
 91 |   if (nrow(date_returns) == 0){
 92 |     empty_df <- data_frame(hpr = NA)
 93 |     print(paste("No financial data for ticker", the_ticker))
 94 |     return(empty_df) # if no financial data then we would like to make that clear
 95 |   } # when statement does not work to break the function! :(
 96 |   hpr <-  summarise(date_returns, hpr = prod(return))
 97 |   print(paste("Calculated hpr for ticker", the_ticker, "and date", filing_date))
 98 |   return(hpr)}
 99 | ```
100 | 
101 | ```{r}
102 | returns_5_days <- map2_df(masterIndex$TICKER, masterIndex$DATE_FILED, returns_calculator)
103 | returns_1_month <- map2_df(masterIndex$TICKER, masterIndex$DATE_FILED, returns_calculator, interval_length = 1, interval_type = "month")
104 | returns_3_months <- map2_df(masterIndex$TICKER, masterIndex$DATE_FILED, returns_calculator, interval_length = 3, interval_type = "month")
105 | returns_6_months <- map2_df(masterIndex$TICKER, masterIndex$DATE_FILED, returns_calculator, interval_length = 6, interval_type = "month")
106 | returns_1_year <- map2_df(masterIndex$TICKER, masterIndex$DATE_FILED, returns_calculator, interval_length = 1, interval_type = "year")
107 | ```
108 | 
109 | The volatility calculator uses the original trimmed raw returns data to calculate, since it is more accurate about which days are trading days!
110 | 
111 | ```{r}
112 | volatility_calculator <- function(the_ticker, filing_date, interval_length = 1, interval_type = "year"){
113 |   end_date <- filing_date %>% ymd
114 |   start_date <- end_date %>%
115 |     ymd %>%
116 |     when(interval_type == "day" ~ . %m-% days(interval_length),
117 |          interval_type == "month" ~ . %m-% months(interval_length), 
118 |          interval_type == "year" ~ . %m-% years(interval_length),
119 |          ~ stop(print(interval_type)))
120 |   date_sequence <- seq(start_date, end_date, by = "days")
121 |   date_sequence <- date_sequence[-which(format(date_sequence, "%u") %in% c(6,7))]
122 |   date_returns <- filter(full_returns, as.character(ticker) == as.character(the_ticker), date %in% date_sequence)
123 |   if (nrow(date_returns) == 0){
124 |     empty_df <- data_frame(sd = NA)
125 |     print(paste("No financial data for ticker", the_ticker))
126 |     return(empty_df) 
127 |   }
128 |   volatility <-  summarise(date_returns, sd = sd(return))
129 |   print(paste("Calculated volatility for ticker", the_ticker, "and date", filing_date))
130 |   return(volatility)
131 | }
132 | ```
133 | 
134 | ```{r}
135 | vol_1year <- map2_df(master_index$TICKER, master_index$DATE_FILED, volatility_calculator)
136 | ```
137 | 
138 | ```{r}
139 | master_index <- cbind(master_index, vol_1year)
140 | master_index %<>% mutate(
141 |   adj_ret5d = ret5d / (sqrt(5) * sd),
142 |   adj_ret1m = ret1m / (sqrt(30) * sd),
143 |   adj_ret3m = ret3m / (sqrt(90) * sd),
144 |   adj_ret1y = ret1y / (sqrt(365) * sd)
145 | )
146 | ```
147 | 
148 | ```{r}
149 | masterIndex <- cbind(masterIndex, ret5d = returns_5_days$hpr, ret1m = returns_1_month$hpr, ret3m = returns_3_months$hpr,  ret6m = returns_6_months$hpr, ret1y = returns_1_year$hpr)
150 | write.csv(master_index, "../Data/Master Index/masterIndex.csv", row.names = FALSE)
151 | ```
152 | 
153 | 


--------------------------------------------------------------------------------
/Documentation/Calculating-NumProp-Returns.Rmd:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/Documentation/Calculating-NumProp-Returns.Rmd


--------------------------------------------------------------------------------
/Documentation/Calculating-Sentiment-Returns.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "sentimentAnalysisAlgo"
  3 | author: "Eric He"
  4 | date: "August 7, 2017"
  5 | output: html_document
  6 | ---
  7 | 
  8 | ```{r setup, include=FALSE}
  9 | knitr::opts_chunk$set(echo = TRUE)
 10 | ```
 11 | 
 12 | ```{r}
 13 | library("quanteda")
 14 | library("dplyr")
 15 | library("purrr")
 16 | library("ggplot2")
 17 | library("reshape2")
 18 | library("gridExtra")
 19 | ```
 20 | 
 21 | Read in the data relevant for all four years.
 22 | 
 23 | ```{r}
 24 | negative <- readLines("negative.txt")
 25 | positive <- readLines("positive.txt")
 26 | sections <- c("1", "1A", "1B", "2", "3", "4", "5", "6", "7", "7A", "8", "9", "9A", "9B", "10", "11", "12", "13", "14", "15")
 27 | hpr <- read.csv("annualReturns.csv", na.strings = "NA")
 28 | years <- c("X2012", "X2013", "X2014", "X2015", "X2016")
 29 | years <- c(2013, 2014, 2015, 2016)
 30 | masterIndex <- read.csv("masterIndex.csv")
 31 | tickers <- data_frame(filing = c(1:nrow(masterIndex)), ticker = masterIndex$TICKER)
 32 | ```
 33 | 
 34 | Build two functions, one which subsets the bigdfm according to section and the other recording the indices if the subsetted.
 35 | 
 36 | ```{r}
 37 | indices_tfidf <- function(meta_df, cond){
 38 |   index_num <- filter(meta_df, section == cond) %>%
 39 |     select(filing)
 40 |   return(index_num)
 41 | }
 42 | subset_tfidf <- function(meta_df, dfmobj, cond){
 43 |   index_num <- filter(meta_df, section == cond) %>%
 44 |     select(index)
 45 |   weightsdfm <- dfmobj[index_num$index,] %>%
 46 |     tfidf(scheme_tf = "logave")
 47 |   return(weightsdfm)
 48 | }
 49 | ```
 50 | 
 51 | Build a function which computes the sentiment scores. Recall that the sentiment score is the weights of the words in the sentiment dictionary, divided by the total weight of the words in the document.
 52 | 
 53 | ```{r}
 54 | dfmstat_ratio <- function(dfmObj, dict){
 55 |   dfm_select(dfmObj, features = dict) %>%
 56 |     rowSums(.) / rowSums(dfmObj)
 57 | }
 58 | ```
 59 | 
 60 | We would like to do the algorithm for the positive, negative, positive-negative sentiment scorings. This requires the positive and negative dictionaries.
 61 | Then we would like to do them for all four years. 
 62 | 
 63 | ```{r}
 64 | indices_list <- map(sections, indices_tfidf, meta_df = metadata)
 65 | weightsdfm_list <- map(sections, subset_tfidf, meta_df = metadata, dfmobj = bigdfm)
 66 | sentiment_list <- weightsdfm_list %>%
 67 |   map(dfmstat_ratio, dict = negative)
 68 | ```
 69 | 
 70 | Now do this for every year's worth of data.
 71 | 
 72 | ```{r}
 73 | years <- c(2013:2016)
 74 | path_to_metadata <- "metadata"
 75 | metadata_type <- ".csv"
 76 | path_to_parsedDFM <- "parsedBigDfm"
 77 | parsedDFM_type <- ".RData"
 78 | 
 79 | weighter <- function(year){
 80 | metadata <- paste(path_to_metadata, year, metadata_type, sep = "") %>%
 81 |   read.csv()
 82 | load(paste(path_to_parsedDFM, year, parsedDFM_type, sep = ""))
 83 | 
 84 | indices_list <- map(sections, indices_tfidf, meta_df = metadata)
 85 | weightsdfm_list <- map(sections, subset_tfidf, meta_df = metadata, dfmobj = bigdfm)
 86 | 
 87 | save(indices_list, file = paste("indices_list_", year, ".RData", sep = ""))
 88 | save(weightsdfm_list, file = paste("weightsdfm_list_", year, ".RData", sep = ""))
 89 | }
 90 | 
 91 | map(years, weighter) # The weighter function returns nothing which is fine.
 92 | ```
 93 | 
 94 | Get every sentiment list for every year.
 95 | 
 96 | ```{r}
 97 | years <- c(2013:2016)
 98 | path_to_weightsdfm_list <- "weightsdfm_list_"
 99 | weightsdfm_list_type <- ".RData"
100 | path_to_indices_list <- "indices_list_"
101 | indices_list_type <- ".RData"
102 | 
103 | returns_quantiler <- function(sentiment_list, index_list, return_df, n_quantiles){
104 |   sentiment_list %>%
105 |   map(ntile, n = n_quantiles) %>%
106 |   map(as.factor) %>%
107 |   map(~ data_frame("quantile" = .)) %>%
108 |   map2(index_list, cbind) %>%
109 |   map(left_join, y = tickers, by = "filing") %>%
110 |   map(left_join, y = return_df, by = "ticker") %>%
111 |   map(group_by, quantile) %>%
112 |   map(summarise, average_return = mean(return, na.rm = TRUE)) %>%
113 |   map(transmute, relative_return = average_return / min(average_return) - 1) %>% # quantile 1 is row 1, 2 is row 2, etc. Higher quantile means higher sentiment value.
114 |   do.call(what = cbind)
115 | }
116 | 
117 | sentiment_returns_algo <- function(year, sentiment_dict){
118 | load(paste(path_to_weightsdfm_list, year, weightsdfm_list_type, sep = "")) # loads into global
119 | load(paste(path_to_indices_list, year, indices_list_type, sep = "")) # loads into global
120 | return_df <- select(hpr, ticker, return = paste("X", year, sep="")) # creates in function namespace so MUST be specified in returns_by_quantile function which would otherwise look in global for return_df
121 | sentiment_list <- weightsdfm_list %>%
122 |   map(dfmstat_ratio, dict = sentiment_dict)
123 | returns_by_quantile <- returns_quantiler(sentiment_list, index_list = indices_list, return_df = return_df, n_quantiles = 5)
124 | names(returns_by_quantile) <- paste("section", sections, sep = "")
125 | return(returns_by_quantile)
126 | }
127 | 
128 | negative_returns_by_quantile <- map(years, sentiment_returns_algo, sentiment_dict = negative) %>%
129 |   reduce(`+`) / length(years) # require hpr, sections, masterIndex, tickers, dfmstat_ratio
130 | 
131 | positive_returns_by_quantile <- map(years, sentiment_returns_algo, sentiment_dict = positive) %>%
132 |   reduce(`+`) / length(years)
133 | ```
134 | 
135 | Make the ggplot graph.
136 | 
137 | ```{r}
138 | quantile <- c(1:5) # little hacky but we need x to be the five quantiles; a proper melt would be the correct method, I think, but i cant get it to work. The problem is that this relies on the ggplot mapping 1 to quantile 1, 2 to quantile 2, etc. which works for c(1:5) but does not work for c("one", "two", "three", "four", "five"), for example, which will sort the character vector alphabetically
139 | 
140 | nm <- names(negative_returns_by_quantile)
141 | negative_sentiment_quantile_returns <- map(nm, ~ ggplot(data = negative_returns_by_quantile, aes_string(x = quantile, y = .)) +
142 |       geom_bar(stat = "identity") +
143 |       theme(axis.title.x=element_blank(),
144 |         axis.text.x=element_blank(),
145 |         axis.ticks.x=element_blank())) %>%
146 |   arrangeGrob(grobs = ., ncol = 5)
147 | ggsave(negative_sentiment_quantile_returns, file = "negative_sentiment_quantile_returns.png")
148 | 
149 | nm <- names(positive_returns_by_quantile)
150 | positive_sentiment_quantile_returns <- map(nm, ~ ggplot(data = positive_returns_by_quantile, aes_string(x = quantile, y = .)) +
151 |       geom_bar(stat = "identity") +
152 |       theme(axis.title.x=element_blank(),
153 |         axis.text.x=element_blank(),
154 |         axis.ticks.x=element_blank())) %>%
155 |   arrangeGrob(grobs = ., ncol = 5)
156 | ggsave(positive_sentiment_quantile_returns, file = "positive_sentiment_quantile_returns.png")
157 | ```
158 | 
159 | 
160 | ```{r}
161 | returns_by_quantile <- sentiment_list %>%
162 |   map(ntile, n = 5) %>%
163 |   map(as.factor) %>%
164 |   map(~ data_frame("quantile" = .)) %>%
165 |   map2(indices_list, cbind) %>%
166 |   map(left_join, y = tickers, by = "filing") %>%
167 |   map(left_join, y = return_df, by = "ticker") %>%
168 |   map(group_by, quantile) %>%
169 |   map(summarise, average_return = mean(return, na.rm = TRUE)) %>%
170 |   map(transmute, relative_return = average_return / min(average_return) - 1) %>%
171 |   do.call(what = cbind)
172 | names(returns_by_quantile) <- rep(paste("section", sections))
173 | 
174 | quantiles <- map2(indices_list, quantile_negative, cbind)
175 | returns_by_quantile <- map(quantiles, group_by, return) %>%
176 |   map(summarise, average_return = mean(X2012, na.rm = TRUE))
177 | ```
178 | 


--------------------------------------------------------------------------------
/Documentation/Cleaning-Raw-Filings.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/Documentation/Cleaning-Raw-Filings.md


--------------------------------------------------------------------------------
/Documentation/Creating-Master-Index.Rmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Creating-Master-Index"
 3 | author: "Eric He"
 4 | date: "August 4, 2017"
 5 | output: html_document
 6 | ---
 7 | 
 8 | ```{r setup, include=FALSE}
 9 | knitr::opts_chunk$set(echo = TRUE)
10 | ```
11 | 
12 | ```{r}
13 | library("edgar")
14 | library("dplyr")
15 | ```
16 | 
17 | ```{r}
18 | getMasterIndex(c(2013, 2014, 2015, 2016)
19 | ```
20 | 
21 | ```{r}
22 | load("Master Index/2013master.Rda")
23 | index.2013 <- year.master
24 | index.2013 <- filter(index.2013, FORM_TYPE == "10-K")
25 | load("Master Index/2014master.Rda")
26 | index.2014 <- year.master
27 | index.2014 <- filter(index.2014, FORM_TYPE == "10-K")
28 | load("Master Index/2015master.Rda")
29 | index.2015 <- year.master
30 | index.2015 <- filter(index.2015, FORM_TYPE == "10-K")
31 | load("Master Index/2016master.Rda")
32 | index.2016 <- year.master
33 | index.2016 <- filter(index.2016, FORM_TYPE == "10-K")
34 | rm(year.master)
35 | index <- rbind(index.2013, index.2014, index.2015, index.2016)
36 | rm(index.2013, index.2014, index.2015, index.2016)
37 | ```
38 | 
39 | We have the text data needed to begin our analysis; however, we want to be able to access the financial data corresponding to the companies we are analyzing. This is done by linking the CIK values given by the SEC for companies to their stock tickers. The CIK-ticker mapping was downloaded from https://www.valuespreadsheet.com/iedgar/, and lists every publicly traded company's CIK, ticker, SIC code (which denotes the industry the company is classified as being in), and the exchange where that company's stock trades.
40 | 
41 | ```{r}
42 | tickers <- "cik-ticker.csv" %>%
43 |   read.csv() %>%
44 |   rename(TICKER = ticker, CIK = cik, SIC = sic, EXCHANGE = exchange, HITS = hits)
45 | ```
46 | 
47 | Let's join the two datasets together.
48 | 
49 | ```{r}
50 | index <- left_join(index, tickers, by = "CIK") %>%
51 |   select(-name)
52 | ```
53 | 
54 | ```{r}
55 | write.csv(index, "masterIndex.csv")
56 | ```


--------------------------------------------------------------------------------
/Documentation/Script_10Q.R:
--------------------------------------------------------------------------------
 1 | library("dplyr")
 2 | library("lubridate")
 3 | library("tidyr")
 4 | library("purrr")
 5 | 
 6 | returns_calculator <- function(the_ticker, filing_date, interval_length = 5, interval_type = "day"){
 7 |   start_date_beginning <- filing_date %>% # start date is first non-weekend day before the filing date.
 8 |     ymd(.) %m-% days(3)
 9 |   start_date_candidates <- seq(start_date_beginning, ymd(filing_date) %m-% days(1), by = "days")
10 |   start_date <- start_date_candidates[which(format(start_date_candidates, "%u") %in% c(1:5))[length(which(format(start_date_candidates, "%u") %in% c(1:5)))]] # pick out first weekday in the sequence, %u formats date object into numeric weekday
11 |   end_date <- filing_date %>%
12 |     ymd %>% # have to ram through ymd again because map() uses [[]] which messes with lubridate object type
13 |     when(interval_type == "day" ~ . %m+% days(interval_length),
14 |          interval_type == "month" ~ . %m+% months(interval_length), 
15 |          interval_type == "year" ~ . %m+% years(interval_length),
16 |          ~ stop(print(interval_type)))
17 |   date_sequence <- seq(start_date, end_date, by = "days") # see above comment
18 |   date_returns <- filter(full_returns, as.character(ticker) == as.character(the_ticker), date %in% date_sequence) # tickers must be converted to character or else throws a factor level error
19 |   if (nrow(date_returns) == 0){
20 |     empty_df <- data_frame(hpr = NA)
21 |     print(paste("No financial data for ticker", the_ticker))
22 |     return(empty_df) # if no financial data then we would like to make that clear
23 |   } # when statement does not work to break the function! :(
24 |   hpr <-  summarise(date_returns, hpr = prod(return))
25 |   print(paste("Calculated hpr for ticker", the_ticker, "and date", filing_date))
26 |   return(hpr)}
27 | 
28 | volatility_calculator <- function(the_ticker, filing_date, interval_length = 1, interval_type = "year"){
29 |   end_date <- filing_date %>% ymd
30 |   start_date <- end_date %>%
31 |     ymd %>%
32 |     when(interval_type == "day" ~ . %m-% days(interval_length),
33 |          interval_type == "month" ~ . %m-% months(interval_length), 
34 |          interval_type == "year" ~ . %m-% years(interval_length),
35 |          ~ stop(print(interval_type)))
36 |   date_sequence <- seq(start_date, end_date, by = "days")
37 |   date_sequence <- date_sequence[-which(format(date_sequence, "%u") %in% c(6,7))]
38 |   date_returns <- filter(full_returns, as.character(ticker) == as.character(the_ticker), date %in% date_sequence)
39 |   if (nrow(date_returns) == 0){
40 |     empty_df <- data_frame(sd = NA)
41 |     print(paste("No financial data for ticker", the_ticker))
42 |     return(empty_df) 
43 |   }
44 |   volatility <-  summarise(date_returns, sd = sd(return))
45 |   print(paste("Calculated volatility for ticker", the_ticker, "and date", filing_date))
46 |   return(volatility)
47 | }
48 | 
49 | master_index <- read.csv("master_index_10Q.csv")
50 | 
51 | full_returns <- read.csv("../Data/Financial Data/Trimmed_Returns_Raw.csv")
52 | full_returns <- select(full_returns, permco = PERMCO, date = date, return = RET, ticker = TICKER, delisting_return = DLRET) %>%
53 |   filter(ticker %in% master_index$ticker) %>% # only need tickers for which we have filings to compare with
54 |   mutate(delisting_return = as.numeric(as.character(delisting_return))) %>% # 
55 |   mutate(delisting_return = replace(delisting_return, is.na(delisting_return) == TRUE, 0)) %>% # replace NA with 0
56 |   mutate(delisting_return = delisting_return + 1) %>% # so we can add 1 so we can multiply
57 |   mutate(return = as.numeric(as.character(return))) %>% # change return from factor to numeric, characters which CRSP uses to represent missing data or point towards delisting return gets changed to NA
58 |   mutate(return = replace(return, is.na(return) == TRUE, 0)) %>%
59 |   mutate(return = return + 1) %>%
60 |   filter(is.na(return) == FALSE) %>%
61 |   mutate(return = return * delisting_return) %>%
62 |   mutate(date = mdy(date)) %>%
63 |   select(-delisting_return)
64 | 
65 | ret5d_10Q <- map2_df(master_index$ticker, master_index$date, returns_calculator)
66 | vol_10Q <- map2_df(master_index$ticker, master_index$date, volatility_calculator)
67 | 
68 | master_index <- cbind(master_index, ret5d_10Q, vol_10Q)
69 | 
70 | write.csv(master_index, "master_index_10Q_.csv")


--------------------------------------------------------------------------------
/Documentation/Sentiment-Scores-Algo.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "sentimentAnalysisAlgo"
  3 | author: "Eric He"
  4 | date: "August 7, 2017"
  5 | output: html_document
  6 | ---
  7 | 
  8 | ```{r setup, include=FALSE}
  9 | knitr::opts_chunk$set(echo = TRUE)
 10 | ```
 11 | 
 12 | ```{r}
 13 | library("quanteda")
 14 | library("dplyr")
 15 | library("purrr")
 16 | library("ggplot2")
 17 | library("reshape2")
 18 | library("gridExtra")
 19 | ```
 20 | 
 21 | Read in the data relevant for all four years.
 22 | 
 23 | The desired sentiment analysis algorithm computes sentiment scores of every 
 24 | 
 25 | ```{r}
 26 | negative <- readLines("negative.txt")
 27 | positive <- readLines("positive.txt")
 28 | sections <- c("1", "1A", "1B", "2", "3", "4", "5", "6", "7", "7A", "8", "9", "9A", "9B", "10", "11", "12", "13", "14", "15")
 29 | masterIndex <- read.csv("masterIndex.csv")
 30 | tickers <- unique(masterIndex$TICKER)
 31 | ```
 32 | 
 33 | 
 34 | 
 35 | 
 36 | 
 37 | 
 38 | 
 39 | 
 40 | 
 41 | Build two functions, one which subsets the bigdfm according to section and the other recording the indices if the subsetted.
 42 | 
 43 | ```{r}
 44 | indices_tfidf <- function(meta_df, cond){
 45 |   index_num <- filter(meta_df, section == cond) %>%
 46 |     select(filing)
 47 |   return(index_num)
 48 | }
 49 | subset_tfidf <- function(meta_df, dfmobj, cond){
 50 |   index_num <- filter(meta_df, section == cond) %>%
 51 |     select(index)
 52 |   weightsdfm <- dfmobj[index_num$index,] %>%
 53 |     tfidf(scheme_tf = "logave")
 54 |   return(weightsdfm)
 55 | }
 56 | ```
 57 | 
 58 | Build a function which computes the sentiment scores. Recall that the sentiment score is the weights of the words in the sentiment dictionary, divided by the total weight of the words in the document.
 59 | 
 60 | ```{r}
 61 | dfmstat_ratio <- function(dfmObj, dict){
 62 |   dfm_select(dfmObj, pattern = dict) %>%
 63 |     rowSums(.) / rowSums(dfmObj)
 64 | }
 65 | ```
 66 | 
 67 | We would like to do the algorithm for the positive, negative, positive-negative sentiment scorings. This requires the positive and negative dictionaries.
 68 | Then we would like to do them for all four years. 
 69 | 
 70 | The indices_list sorts the metadata frame by section. 
 71 | 
 72 | ```{r}
 73 | indices_list <- map(sections, indices_tfidf, meta_df = metadata)
 74 | weightsdfm_list <- map(sections, subset_tfidf, meta_df = metadata, dfmobj = bigdfm)
 75 | sentiment_list <- weightsdfm_list %>%
 76 |   map(dfmstat_ratio, dict = negative)
 77 | ```
 78 | 
 79 | Now do this for every year's worth of data. The weighter function is used to save the indices and weightsdfm for every section in a given year. It is mapped across all four years of data.
 80 | 
 81 | ```{r}
 82 | years <- c(2013:2016)
 83 | path_to_metadata <- "metadata"
 84 | metadata_type <- ".csv"
 85 | path_to_parsedDFM <- "parsedBigDfm"
 86 | parsedDFM_type <- ".RData"
 87 | 
 88 | weighter <- function(year){
 89 | metadata <- paste(path_to_metadata, year, metadata_type, sep = "") %>%
 90 |   read.csv()
 91 | load(paste(path_to_parsedDFM, year, parsedDFM_type, sep = ""))
 92 | 
 93 | indices_list <- map(sections, indices_tfidf, meta_df = metadata)
 94 | weightsdfm_list <- map(sections, subset_tfidf, meta_df = metadata, dfmobj = bigdfm)
 95 | 
 96 | save(indices_list, file = paste("indices_list_", year, ".RData", sep = ""))
 97 | save(weightsdfm_list, file = paste("weightsdfm_list_", year, ".RData", sep = ""))
 98 | }
 99 | 
100 | map(years, weighter) # The weighter function returns nothing which is fine.
101 | ```
102 | 
103 | Get every sentiment list for every year.
104 | 
105 | The returns quantiler splits the data of each section into five quantiles and calculates the mean returns for a portfolio which invests equally in all companies of the quantile stock.
106 | 
107 | ```{r}
108 | years <- c(2013:2016)
109 | path_to_weightsdfm_list <- "weightsdfm_list_"
110 | weightsdfm_list_type <- ".RData"
111 | path_to_indices_list <- "indices_list_"
112 | indices_list_type <- ".RData"
113 | 
114 | #TODO: unmap this cancer function
115 | 
116 | returns_quantiler <- function(sentiment_list, index_list, return_df, n_quantiles){
117 |   sentiment_list %>%
118 |   map(ntile, n = n_quantiles) %>%
119 |   map(as.factor) %>%
120 |   map(~ data_frame("quantile" = .)) %>%
121 |   map2(index_list, cbind) %>%
122 |   map(left_join, y = tickers, by = "filing") %>%
123 |   map(left_join, y = return_df, by = "ticker") %>%
124 |   map(group_by, quantile) %>%
125 |   map(summarise, average_return = mean(return, na.rm = TRUE)) %>%
126 |   map(transmute, relative_return = average_return / min(average_return) - 1) %>% # quantile 1 is row 1, 2 is row 2, etc. Higher quantile means higher sentiment value.
127 |   do.call(what = cbind)
128 | }
129 | 
130 | sentiment_returns_algo <- function(year, sentiment_dict){
131 | load(paste(path_to_weightsdfm_list, year, weightsdfm_list_type, sep = "")) # loads into global
132 | load(paste(path_to_indices_list, year, indices_list_type, sep = "")) # loads into global
133 | sentiment_list <- weightsdfm_list %>%
134 |   map(dfmstat_ratio, dict = sentiment_dict)
135 | # names(returns_by_quantile) <- paste("section", sections, sep = "")
136 | return(sentiment_list)
137 | }
138 | 
139 | load("weightsdfm_list_2013.RData")
140 | 
141 | metadata <- map(paste0("metadata", years, ".csv"), read.csv) %>%
142 |   reduce(rbind)
143 | 
144 | negative_sentiment <- map(years, sentiment_returns_algo, sentiment_dict = negative) %>%
145 |   flatten %>%
146 |   reduce(append)
147 | 
148 | #TODO: clean up this cancer code
149 | 
150 | names <- negative_sentiment %>% names %>%
151 |   str_extract(pattern = ".*?(?=\\.)")
152 | 
153 | dummy <- data_frame(subsetted.names = names, negative_sentiment = negative_sentiment)
154 | joined <- left_join(dummy, metadata, by = "subsetted.names") %>%
155 |   group_by(filing, section) %>%
156 |   filter(word.count == max(word.count)) %>%
157 |   distinct(filing, section, .keep_all = TRUE) %>%
158 |   select(filing, section, negative_sentiment) %>%
159 |   spread(key = section, value = negative_sentiment) %>%
160 |   rename(sec1sent = `1`, sec1Asent = `1A`, sec1Bsent = `1B`, sec2sent = `2`, sec3sent = `3`, sec4sent = `4`, sec5sent = `5`, sec6sent = `6`, sec7sent = `7`, sec7Asent = `7A`, sec8sent = `8`, sec9sent = `9`, sec9Asent = `9`, sec10sent = `10`, sec11sent = `11`, sec12sent = `12`, sec13sent = `13`, sec14sent = `14`, sec15sent = `15`)
161 | 
162 | masterIndex <- left_join(masterIndex, joined, by = "filing")
163 | masterIndex <- read.csv("masterIndex.csv")
164 | 
165 | a <- spread(joined, key = section, value = negative_sentiment)
166 | 
167 | joined2 <- group_by(joined, filing, section) %>%
168 | + filter(word.count == max(word.count))
169 | 
170 | positive_returns_by_quantile <- map(years, sentiment_returns_algo, sentiment_dict = positive) %>%
171 |   reduce(`+`) / length(years)
172 | ```
173 | Load all the metadata together and rbind
174 | Load all the names from the negative_sentiment and then perform a join operation.
175 | Things in the metadata that are not in sentiment scores are not actual sections; e.g. they have sections tagged as section 16, 2014, 4A, etc.
176 | 
177 | Make the ggplot graph.
178 | 
179 | ```{r}
180 | quantile <- c(1:5) # little hacky but we need x to be the five quantiles; a proper melt would be the correct method, I think, but i cant get it to work. The problem is that this relies on the ggplot mapping 1 to quantile 1, 2 to quantile 2, etc. which works for c(1:5) but does not work for c("one", "two", "three", "four", "five"), for example, which will sort the character vector alphabetically
181 | 
182 | nm <- names(negative_returns_by_quantile)
183 | negative_sentiment_quantile_returns <- map(nm, ~ ggplot(data = negative_returns_by_quantile, aes_string(x = quantile, y = .)) +
184 |       geom_bar(stat = "identity") +
185 |       theme(axis.title.x=element_blank(),
186 |         axis.text.x=element_blank(),
187 |         axis.ticks.x=element_blank())) %>%
188 |   arrangeGrob(grobs = ., ncol = 5)
189 | ggsave(negative_sentiment_quantile_returns, file = "negative_sentiment_quantile_returns.png")
190 | 
191 | nm <- names(positive_returns_by_quantile)
192 | positive_sentiment_quantile_returns <- map(nm, ~ ggplot(data = positive_returns_by_quantile, aes_string(x = quantile, y = .)) +
193 |       geom_bar(stat = "identity") +
194 |       theme(axis.title.x=element_blank(),
195 |         axis.text.x=element_blank(),
196 |         axis.ticks.x=element_blank())) %>%
197 |   arrangeGrob(grobs = ., ncol = 5)
198 | ggsave(positive_sentiment_quantile_returns, file = "positive_sentiment_quantile_returns.png")
199 | ```
200 | 
201 | 
202 | 
203 | 
204 | ```{r}
205 | #this is test code and should be ignored
206 | 
207 | returns_by_quantile <- sentiment_list %>%
208 |   map(ntile, n = 5) %>%
209 |   map(as.factor) %>%
210 |   map(~ data_frame("quantile" = .)) %>%
211 |   map2(indices_list, cbind) %>%
212 |   map(left_join, y = tickers, by = "filing") %>%
213 |   map(left_join, y = return_df, by = "ticker") %>%
214 |   map(group_by, quantile) %>%
215 |   map(summarise, average_return = mean(return, na.rm = TRUE)) %>%
216 |   map(transmute, relative_return = average_return / min(average_return) - 1) %>%
217 |   do.call(what = cbind)
218 | names(returns_by_quantile) <- rep(paste("section", sections))
219 | 
220 | quantiles <- map2(indices_list, quantile_negative, cbind)
221 | returns_by_quantile <- map(quantiles, group_by, return) %>%
222 |   map(summarise, average_return = mean(X2012, na.rm = TRUE))
223 | ```
224 | 


--------------------------------------------------------------------------------
/Documentation/Text-Distance-Algo-10Q.R:
--------------------------------------------------------------------------------
  1 | library("quanteda")
  2 | library("dplyr")
  3 | library("readr")
  4 | library("purrr")
  5 | library("stringr")
  6 | library("readtext")
  7 | library("reshape2")
  8 | library("magrittr")
  9 | library("ggplot2")
 10 | library("gridExtra")
 11 | 
 12 | masterIndex <- read_csv("master_index_10Q_withFile.csv")
 13 | tickers <- unique(masterIndex$ticker) # use unique(masterIndex) if we wish to scale this across multiple years, or keep tickers.txt updated
 14 | StopWordsList <- readLines("../Data/Text Data/StopWordsList.txt")
 15 | sections <- c("1", "1A", "2", "3", "4", "5", "6")
 16 | file_path <- "../Data/Text Data/10-Q/"
 17 | file_type <- ".txt"
 18 | 
 19 | section_extractor <- function(statement, section){
 20 |   name <- statement$doc_id 
 21 |   pattern <- paste0("(?i)°Item ", section, "[^\\w|\\d]", ".*?°")
 22 |   section_hits <- str_extract_all(statement, pattern=pattern, simplify=TRUE) 
 23 |   if (is_empty(section_hits) == TRUE){
 24 |     empty_vec <- "empty"
 25 |     names(empty_vec) <- paste(name, section, sep = "_") 
 26 |     print(paste("No hits for section", section, "of filing", name))
 27 |     return(empty_vec)
 28 |   }
 29 |   word_counts <- map_int(section_hits, ntoken)
 30 |   max_hit <- which(word_counts == max(word_counts))
 31 |   max_filing <- section_hits[[max_hit[length(max_hit)]]]
 32 |   if (max(word_counts) < 250 & str_detect(max_filing, pattern = "(?i)(incorporated by reference)|(incorporated herein by reference)") == TRUE){
 33 |     empty_vec <- "empty"
 34 |     names(empty_vec) <- paste(name, section, sep = "_") 
 35 |     print(paste("Section", section, "of filing", name, "incorporates by reference its information"))
 36 |     return(empty_vec)
 37 |   }
 38 |   names(max_filing) <- paste(name, section, sep = "_") 
 39 |   return(max_filing)
 40 | }
 41 | 
 42 | section_dfm <- function(statements_list, section, min_words, tf){
 43 |   map(statements_list, section_extractor, section=section) %>%
 44 |     map(corpus) %>%
 45 |     reduce(`+`) %>%
 46 |     dfm(tolower=TRUE, remove=StopWordsList, remove_punct=TRUE) %>% 
 47 |     dfm_subset(., rowSums(.) > min_words) %>%
 48 |     when(tf==TRUE ~ tf(., scheme="log"), 
 49 |          ~ .)
 50 | }
 51 | 
 52 | filing_dfm <- function(sections, filings_list, min_words, tf){
 53 |   map(sections, section_dfm, statements_list=filings_list, min_words=min_words, tf=tf)
 54 | }
 55 | 
 56 | dist_parser <- function(distObj, section){
 57 |   melted_frame <- as.matrix(distObj) %>%
 58 |   {. * upper.tri(.)} %>% 
 59 |     melt(varnames = c("previous_filing", "filing"), value.name = paste0("sec", section, "dist"))  
 60 |   melted_frame$previous_filing %<>% str_extract(pattern = ".*?(?=\\.)")
 61 |   melted_frame$filing %<>% str_extract(pattern = ".*?(?=\\.)") 
 62 |   return(melted_frame)
 63 | }
 64 | 
 65 | filing_similarity <- function(dfm_list, method){
 66 |   map(dfm_list, textstat_simil, method=method) %>%
 67 |     map(dist_parser)}
 68 | 
 69 | index_filing_filterer <- function(the_ticker, index){
 70 |   filter(index, ticker == the_ticker) %>%
 71 |     arrange(date) %>% 
 72 |     pull(filing)
 73 | }
 74 | 
 75 | distance_returns_calculator <- function(the_ticker){
 76 |   file_names <- index_filing_filterer(the_ticker, masterIndex)
 77 |   
 78 |   if (length(file_names) <= 1){
 79 |     empty_list <- data_frame()
 80 |     print(paste("Only one filing available for ticker", the_ticker))
 81 |     return(empty_list)
 82 |   }
 83 |   
 84 |   file_locations <- paste0(file_path, file_names, file_type)
 85 |   
 86 |   filings_list <- map(file_locations, readtext)
 87 |   
 88 |   similarity_list <- map(sections, section_dfm, statements_list=filings_list, min_words=10, tf=TRUE) %>%
 89 |     map(textstat_simil, method="cosine") %>%
 90 |     map2(sections, dist_parser) %>%
 91 |     reduce(left_join, by = c("previous_filing", "filing"))
 92 |   
 93 |   prev_current_mapping <- data_frame(previous_filing = file_names[-length(file_names)], filing = file_names[-1])
 94 |   distance_returns_df <- left_join(prev_current_mapping, similarity_list, by = c("previous_filing", "filing"))
 95 |   print(paste("Successfully mapped distance scores to financial returns for ticker", the_ticker))
 96 |   return(distance_returns_df)}
 97 | 
 98 | distance_df <- map(tickers, distance_returns_calculator) %>%
 99 |   reduce(rbind)
100 | 
101 | masterIndex %<>% left_join(distance_df, by = "filing")
102 | 
103 | write.csv(masterIndex, file = "index_distance_10Q.csv", row.names = FALSE)


--------------------------------------------------------------------------------
/Documentation/Text-Distance-Algo.Rmd:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/Documentation/Text-Distance-Algo.Rmd


--------------------------------------------------------------------------------
/Documentation/parsing-script.R:
--------------------------------------------------------------------------------
 1 | library("stringr")
 2 | library("dplyr")
 3 | library("purrr")
 4 | 
 5 | input <- "/data/edgar/data/"
 6 | output <- "parsed-filings/"
 7 | 
 8 | clean_filing <- function(file_name, input_cik){
 9 |   paste0(input_cik, file_name) %>%
10 |   readLines(encoding = "UTF-8") %>%
11 |   str_c(collapse = " ") %>%
12 |   str_extract(pattern = "(?s)(?m)<TYPE>10-Q.*?(</TEXT>)") %>%
13 |   str_replace(pattern = "((?i)<TYPE>).*?(?=<)", replacement = "") %>%
14 |   str_replace(pattern = "((?i)<SEQUENCE>).*?(?=<)", replacement = "") %>%
15 |   str_replace(pattern = "((?i)<FILENAME>).*?(?=<)", replacement = "") %>%
16 |   str_replace(pattern = "((?i)<DESCRIPTION>).*?(?=<)", replacement = "") %>%
17 |   str_replace(pattern = "(?s)(?i)<head>.*?</head>", replacement = "") %>%
18 |   str_replace(pattern = "(?s)(?i)<(table).*?(</table>)", replacement = "") %>%
19 |   str_replace_all(pattern = "(?s)(?i)(?m)> +Item|>Item|^Item", replacement = ">°Item") %>%
20 |   str_replace(pattern = "</TEXT>", replacement = "°</TEXT>") %>%
21 |   str_replace_all(pattern = "(?s)<.*?>", replacement = " ") %>%
22 |   str_replace_all(pattern = "&(.{2,6});", replacement = " ") %>%
23 |   str_replace_all(pattern = "(?s) +", replacement = " ") %>%
24 |   write(file = paste0(output, file_name))
25 |   print(paste("Cleaned filing", file_name))
26 | }
27 | 
28 | clean_cik <- function(cik){
29 |   input_cik <- paste0(input, cik, "/")
30 |   
31 |   files = input_cik %>% 
32 |     list.files %>% 
33 |     subset(str_detect(., pattern = "10-Q"))
34 |   
35 |   map(files, clean_filing, input_cik = input_cik)
36 |   
37 |   print(paste("Cleaned all filings for CIK", cik))  
38 | }
39 | 
40 | cik_list <- list.files(input)
41 | 
42 | map(cik_list, clean_cik)
43 | 
44 | 
45 | #mutate(cik = cik,
46 | #         date = str_extract(files, pattern = "(?<=(.{1,10}_){2}).*?(?=_)"),
47 | #         file_name = str_extract(files, pattern = "(?<=(.{1,10}_){3}).*")))
48 | 
49 | 


--------------------------------------------------------------------------------
/Graphs/cosine_distance_returns.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/Graphs/cosine_distance_returns.png


--------------------------------------------------------------------------------
/Graphs/jaccard_distance_returns.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/Graphs/jaccard_distance_returns.png


--------------------------------------------------------------------------------
/Graphs/negative_sentiment_quantile_returns.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/Graphs/negative_sentiment_quantile_returns.png


--------------------------------------------------------------------------------
/Graphs/positive_sentiment_quantile_returns.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/Graphs/positive_sentiment_quantile_returns.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/README.md


--------------------------------------------------------------------------------
/analyses/optimize_cosine_distance.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "id": "132c0f76",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "import numpy as np\n",
 11 |     "import pandas as pd\n",
 12 |     "import cython\n",
 13 |     "import os\n",
 14 |     "import re\n",
 15 |     "import json\n",
 16 |     "from bs4 import BeautifulSoup\n",
 17 |     "from multiprocessing import Pool\n",
 18 |     "from pandarallel import pandarallel"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": 2,
 24 |    "id": "fe7c519c",
 25 |    "metadata": {},
 26 |    "outputs": [],
 27 |    "source": [
 28 |     "os.chdir('/mnt/d/workspace/8-2/Financial-Statements-Text-Analysis/')"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "code",
 33 |    "execution_count": 3,
 34 |    "id": "5faf2ba9",
 35 |    "metadata": {},
 36 |    "outputs": [],
 37 |    "source": [
 38 |     "# params\n",
 39 |     "with open('config.json', 'r') as f:\n",
 40 |     "    c = json.load(f)\n",
 41 |     "input_dir = os.path.join(c['DATA_DIR'], '10k_clean')\n",
 42 |     "# destination_dir = os.path.join(c['DATA_DIR'], '10k_clean')"
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "markdown",
 47 |    "id": "40763f7b",
 48 |    "metadata": {},
 49 |    "source": [
 50 |     "# read processed 10-Ks in"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "code",
 55 |    "execution_count": 4,
 56 |    "id": "080d1c05",
 57 |    "metadata": {},
 58 |    "outputs": [],
 59 |    "source": [
 60 |     "metadata = pd.read_csv(os.path.join(c['DATA_DIR'], 'metadata.csv'))\n",
 61 |     "metadata_legacy = pd.read_csv(os.path.join(c['DATA_DIR'], 'metadata_2017.csv'))\n",
 62 |     "\n",
 63 |     "# only download the data from russell 3000 today\n",
 64 |     "metadata = metadata_legacy[metadata_legacy['TICKER'].isin(metadata['ticker'])]"
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "code",
 69 |    "execution_count": 5,
 70 |    "id": "1793c9ca",
 71 |    "metadata": {},
 72 |    "outputs": [
 73 |     {
 74 |      "name": "stderr",
 75 |      "output_type": "stream",
 76 |      "text": [
 77 |       "/tmp/ipykernel_14741/1550976992.py:1: SettingWithCopyWarning: \n",
 78 |       "A value is trying to be set on a copy of a slice from a DataFrame.\n",
 79 |       "Try using .loc[row_indexer,col_indexer] = value instead\n",
 80 |       "\n",
 81 |       "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
 82 |       "  metadata['LOCAL_LINK'] = input_dir + '/' + metadata['TICKER'] + '/' + metadata['EDGAR_LINK'].str.split(\"/\").str[-1]\n"
 83 |      ]
 84 |     }
 85 |    ],
 86 |    "source": [
 87 |     "metadata['LOCAL_LINK'] = input_dir + '/' + metadata['TICKER'] + '/' + metadata['EDGAR_LINK'].str.split(\"/\").str[-1]"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "code",
 92 |    "execution_count": 6,
 93 |    "id": "99e4069e",
 94 |    "metadata": {},
 95 |    "outputs": [
 96 |     {
 97 |      "name": "stdout",
 98 |      "output_type": "stream",
 99 |      "text": [
100 |       "652 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
101 |      ]
102 |     }
103 |    ],
104 |    "source": [
105 |     "%%timeit\n",
106 |     "\n",
107 |     "for i in range(100):\n",
108 |     "    pd.read_csv(metadata.iloc[i]['LOCAL_LINK'])"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "code",
113 |    "execution_count": 7,
114 |    "id": "fcef961c",
115 |    "metadata": {},
116 |    "outputs": [
117 |     {
118 |      "name": "stdout",
119 |      "output_type": "stream",
120 |      "text": [
121 |       "348 ms ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
122 |      ]
123 |     }
124 |    ],
125 |    "source": [
126 |     "%%timeit \n",
127 |     "links = [metadata.iloc[i]['LOCAL_LINK'] for i in range(100)]\n",
128 |     "\n",
129 |     "with Pool(processes=4) as pool:\n",
130 |     "    pool.map(pd.read_csv, links)"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "code",
135 |    "execution_count": 8,
136 |    "id": "d37cfbf5",
137 |    "metadata": {},
138 |    "outputs": [],
139 |    "source": [
140 |     "def read_csv_wrapper(i):\n",
141 |     "    try:\n",
142 |     "        row = metadata.iloc[i]\n",
143 |     "        path = row['LOCAL_LINK']\n",
144 |     "        ticker = row['TICKER']\n",
145 |     "        \n",
146 |     "        df = pd.read_csv(path)\n",
147 |     "        df['ticker'] = ticker\n",
148 |     "        df['path'] = path\n",
149 |     "        df['filing_date'] = row['DATE_FILED']\n",
150 |     "        return df\n",
151 |     "    except:\n",
152 |     "        # some were unable to read because the parse failed \n",
153 |     "        return pd.DataFrame()\n",
154 |     "\n",
155 |     "with Pool(processes=16) as pool:\n",
156 |     "    dfs = pool.map(read_csv_wrapper, range(len(metadata)))\n",
157 |     "    \n",
158 |     "df = pd.concat(dfs)\n",
159 |     "# filter out failed reads\n",
160 |     "df = df[~df['text'].isnull()]\n",
161 |     "\n",
162 |     "# order the df\n",
163 |     "df = df.sort_values(['ticker', 'item', 'filing_date'])\n",
164 |     "df['index'] = np.arange(len(df))\n",
165 |     "df['lead_index'] = df.groupby(['ticker', 'item'])['index'].shift(-1)"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "markdown",
170 |    "id": "adaafe25",
171 |    "metadata": {},
172 |    "source": [
173 |     "# Text cleaning"
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "code",
178 |    "execution_count": 9,
179 |    "id": "2eceeaf7",
180 |    "metadata": {},
181 |    "outputs": [
182 |     {
183 |      "name": "stdout",
184 |      "output_type": "stream",
185 |      "text": [
186 |       "3.54 s ± 107 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
187 |      ]
188 |     }
189 |    ],
190 |    "source": [
191 |     "%%timeit\n",
192 |     "df.head(1000)['text'].str.replace('\\W', ' ', regex=True)\\\n",
193 |     "    .str.lower()\\\n",
194 |     "    .str.split()\\\n",
195 |     "    .str.join(' ')"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": 10,
201 |    "id": "5252b875",
202 |    "metadata": {},
203 |    "outputs": [],
204 |    "source": [
205 |     "def clean_string(s):\n",
206 |     "    s = re.sub('\\W', ' ', s)\n",
207 |     "    s = s.lower()\n",
208 |     "    s = re.sub(' +', ' ', s)\n",
209 |     "    return s"
210 |    ]
211 |   },
212 |   {
213 |    "cell_type": "code",
214 |    "execution_count": 11,
215 |    "id": "02b207b2",
216 |    "metadata": {},
217 |    "outputs": [
218 |     {
219 |      "name": "stdout",
220 |      "output_type": "stream",
221 |      "text": [
222 |       "4.95 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
223 |      ]
224 |     }
225 |    ],
226 |    "source": [
227 |     "%%timeit\n",
228 |     "df.head(1000)['text'].apply(clean_string)"
229 |    ]
230 |   },
231 |   {
232 |    "cell_type": "code",
233 |    "execution_count": 12,
234 |    "id": "9f52d9cb",
235 |    "metadata": {},
236 |    "outputs": [],
237 |    "source": [
238 |     "pandarallel.initialize(progress_bar=True, nb_workers=16, verbose=0)"
239 |    ]
240 |   },
241 |   {
242 |    "cell_type": "code",
243 |    "execution_count": 13,
244 |    "id": "2795ac24",
245 |    "metadata": {},
246 |    "outputs": [
247 |     {
248 |      "data": {
249 |       "application/vnd.jupyter.widget-view+json": {
250 |        "model_id": "f93dfb3050194425b3e4044db2335bc1",
251 |        "version_major": 2,
252 |        "version_minor": 0
253 |       },
254 |       "text/plain": [
255 |        "VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=63), Label(value='0 / 63'))), HBox…"
256 |       ]
257 |      },
258 |      "metadata": {},
259 |      "output_type": "display_data"
260 |     },
261 |     {
262 |      "data": {
263 |       "application/vnd.jupyter.widget-view+json": {
264 |        "model_id": "69aac78a4c5e4ac78beb042cdbd885ae",
265 |        "version_major": 2,
266 |        "version_minor": 0
267 |       },
268 |       "text/plain": [
269 |        "VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=63), Label(value='0 / 63'))), HBox…"
270 |       ]
271 |      },
272 |      "metadata": {},
273 |      "output_type": "display_data"
274 |     },
275 |     {
276 |      "data": {
277 |       "application/vnd.jupyter.widget-view+json": {
278 |        "model_id": "34a10a212d6947bd9e5fae6e7b9ae6d0",
279 |        "version_major": 2,
280 |        "version_minor": 0
281 |       },
282 |       "text/plain": [
283 |        "VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=63), Label(value='0 / 63'))), HBox…"
284 |       ]
285 |      },
286 |      "metadata": {},
287 |      "output_type": "display_data"
288 |     },
289 |     {
290 |      "data": {
291 |       "application/vnd.jupyter.widget-view+json": {
292 |        "model_id": "b30e20a430a94f09915f297066485805",
293 |        "version_major": 2,
294 |        "version_minor": 0
295 |       },
296 |       "text/plain": [
297 |        "VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=63), Label(value='0 / 63'))), HBox…"
298 |       ]
299 |      },
300 |      "metadata": {},
301 |      "output_type": "display_data"
302 |     },
303 |     {
304 |      "data": {
305 |       "application/vnd.jupyter.widget-view+json": {
306 |        "model_id": "1b12fefdd6c549b59540fa0379cf3072",
307 |        "version_major": 2,
308 |        "version_minor": 0
309 |       },
310 |       "text/plain": [
311 |        "VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=63), Label(value='0 / 63'))), HBox…"
312 |       ]
313 |      },
314 |      "metadata": {},
315 |      "output_type": "display_data"
316 |     },
317 |     {
318 |      "data": {
319 |       "application/vnd.jupyter.widget-view+json": {
320 |        "model_id": "e9bcdecccc9d4f578b381875e0c183b0",
321 |        "version_major": 2,
322 |        "version_minor": 0
323 |       },
324 |       "text/plain": [
325 |        "VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=63), Label(value='0 / 63'))), HBox…"
326 |       ]
327 |      },
328 |      "metadata": {},
329 |      "output_type": "display_data"
330 |     },
331 |     {
332 |      "data": {
333 |       "application/vnd.jupyter.widget-view+json": {
334 |        "model_id": "b753ddfc5b604e60a35cae484ea8dc76",
335 |        "version_major": 2,
336 |        "version_minor": 0
337 |       },
338 |       "text/plain": [
339 |        "VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=63), Label(value='0 / 63'))), HBox…"
340 |       ]
341 |      },
342 |      "metadata": {},
343 |      "output_type": "display_data"
344 |     },
345 |     {
346 |      "data": {
347 |       "application/vnd.jupyter.widget-view+json": {
348 |        "model_id": "fde8283eaa0e46a2b63cf68292df1bbb",
349 |        "version_major": 2,
350 |        "version_minor": 0
351 |       },
352 |       "text/plain": [
353 |        "VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=63), Label(value='0 / 63'))), HBox…"
354 |       ]
355 |      },
356 |      "metadata": {},
357 |      "output_type": "display_data"
358 |     },
359 |     {
360 |      "name": "stdout",
361 |      "output_type": "stream",
362 |      "text": [
363 |       "2.34 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
364 |      ]
365 |     }
366 |    ],
367 |    "source": [
368 |     "%%timeit\n",
369 |     "df.head(1000)['text'].parallel_apply(clean_string)"
370 |    ]
371 |   },
372 |   {
373 |    "cell_type": "code",
374 |    "execution_count": 14,
375 |    "id": "a64c4639",
376 |    "metadata": {},
377 |    "outputs": [
378 |     {
379 |      "data": {
380 |       "application/vnd.jupyter.widget-view+json": {
381 |        "model_id": "dbc5bcd67d9442a0a0f923fb9f513789",
382 |        "version_major": 2,
383 |        "version_minor": 0
384 |       },
385 |       "text/plain": [
386 |        "VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1556), Label(value='0 / 1556'))), …"
387 |       ]
388 |      },
389 |      "metadata": {},
390 |      "output_type": "display_data"
391 |     }
392 |    ],
393 |    "source": [
394 |     "df['text'] = df['text'].parallel_apply(clean_string)"
395 |    ]
396 |   },
397 |   {
398 |    "cell_type": "markdown",
399 |    "id": "944afb97",
400 |    "metadata": {},
401 |    "source": [
402 |     "# transform to tfidf"
403 |    ]
404 |   },
405 |   {
406 |    "cell_type": "code",
407 |    "execution_count": 15,
408 |    "id": "20c6ee8a",
409 |    "metadata": {},
410 |    "outputs": [],
411 |    "source": [
412 |     "from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer"
413 |    ]
414 |   },
415 |   {
416 |    "cell_type": "code",
417 |    "execution_count": 16,
418 |    "id": "da8eb1c8",
419 |    "metadata": {},
420 |    "outputs": [],
421 |    "source": [
422 |     "comparison_df = df[~df['lead_index'].isnull()].copy()\n",
423 |     "comparison_df['lead_index'] = comparison_df['lead_index'].astype(int)"
424 |    ]
425 |   },
426 |   {
427 |    "cell_type": "code",
428 |    "execution_count": 17,
429 |    "id": "efe6a73f",
430 |    "metadata": {},
431 |    "outputs": [],
432 |    "source": [
433 |     "vectorizer = TfidfVectorizer()\n",
434 |     "\n",
435 |     "tfidf = vectorizer.fit_transform(comparison_df['text'])"
436 |    ]
437 |   },
438 |   {
439 |    "cell_type": "markdown",
440 |    "id": "553b3709",
441 |    "metadata": {},
442 |    "source": [
443 |     "# perform cosine distance computation"
444 |    ]
445 |   },
446 |   {
447 |    "cell_type": "code",
448 |    "execution_count": 18,
449 |    "id": "ff680833",
450 |    "metadata": {},
451 |    "outputs": [],
452 |    "source": [
453 |     "from sklearn.metrics.pairwise import cosine_similarity"
454 |    ]
455 |   },
456 |   {
457 |    "cell_type": "code",
458 |    "execution_count": 19,
459 |    "id": "16170ec3",
460 |    "metadata": {},
461 |    "outputs": [
462 |     {
463 |      "name": "stdout",
464 |      "output_type": "stream",
465 |      "text": [
466 |       "742 ms ± 30.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
467 |      ]
468 |     }
469 |    ],
470 |    "source": [
471 |     "%%timeit\n",
472 |     "cosine_similarity(tfidf[:1000], tfidf[:1000])"
473 |    ]
474 |   },
475 |   {
476 |    "cell_type": "code",
477 |    "execution_count": 20,
478 |    "id": "ab05f8f9",
479 |    "metadata": {},
480 |    "outputs": [
481 |     {
482 |      "name": "stdout",
483 |      "output_type": "stream",
484 |      "text": [
485 |       "2.67 s ± 65.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
486 |      ]
487 |     }
488 |    ],
489 |    "source": [
490 |     "%%timeit\n",
491 |     "cosine_similarity(tfidf[:2000], tfidf[:2000])"
492 |    ]
493 |   },
494 |   {
495 |    "cell_type": "code",
496 |    "execution_count": 21,
497 |    "id": "f88154f9",
498 |    "metadata": {},
499 |    "outputs": [
500 |     {
501 |      "name": "stdout",
502 |      "output_type": "stream",
503 |      "text": [
504 |       "76 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
505 |      ]
506 |     }
507 |    ],
508 |    "source": [
509 |     "%%timeit\n",
510 |     "(tfidf[:999].multiply(tfidf[1:1000]).sum(axis=1) / \\\n",
511 |     "     np.sqrt(tfidf[:999].multiply(tfidf[:999]).sum(axis=1)) / np.sqrt(tfidf[1:1000].multiply(tfidf[1:1000]).sum(axis=1)))"
512 |    ]
513 |   },
514 |   {
515 |    "cell_type": "code",
516 |    "execution_count": 22,
517 |    "id": "b9800088",
518 |    "metadata": {},
519 |    "outputs": [
520 |     {
521 |      "name": "stdout",
522 |      "output_type": "stream",
523 |      "text": [
524 |       "172 ms ± 4.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
525 |      ]
526 |     }
527 |    ],
528 |    "source": [
529 |     "%%timeit\n",
530 |     "(tfidf[:1999].multiply(tfidf[1:2000]).sum(axis=1) / \\\n",
531 |     "     np.sqrt(tfidf[:1999].multiply(tfidf[:1999]).sum(axis=1)) / np.sqrt(tfidf[1:2000].multiply(tfidf[1:2000]).sum(axis=1)))"
532 |    ]
533 |   },
534 |   {
535 |    "cell_type": "code",
536 |    "execution_count": 23,
537 |    "id": "1aae8383",
538 |    "metadata": {},
539 |    "outputs": [
540 |     {
541 |      "name": "stdout",
542 |      "output_type": "stream",
543 |      "text": [
544 |       "1.51 s ± 43.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
545 |      ]
546 |     }
547 |    ],
548 |    "source": [
549 |     "%%timeit\n",
550 |     "(tfidf[:-1].multiply(tfidf[1:]).sum(axis=1) / \\\n",
551 |     "     np.sqrt(tfidf[:-1].multiply(tfidf[:-1]).sum(axis=1)) / np.sqrt(tfidf[1:].multiply(tfidf[1:]).sum(axis=1)))"
552 |    ]
553 |   },
554 |   {
555 |    "cell_type": "code",
556 |    "execution_count": 24,
557 |    "id": "436dca7e",
558 |    "metadata": {},
559 |    "outputs": [
560 |     {
561 |      "data": {
562 |       "text/plain": [
563 |        "matrix([[0.98227666],\n",
564 |        "        [0.99666753],\n",
565 |        "        [0.58170126],\n",
566 |        "        ...,\n",
567 |        "        [0.56405527],\n",
568 |        "        [0.93125178],\n",
569 |        "        [0.94108083]])"
570 |       ]
571 |      },
572 |      "execution_count": 24,
573 |      "metadata": {},
574 |      "output_type": "execute_result"
575 |     }
576 |    ],
577 |    "source": [
578 |     "(tfidf[:-1].multiply(tfidf[1:]).sum(axis=1) / \\\n",
579 |     "     np.sqrt(tfidf[:-1].multiply(tfidf[:-1]).sum(axis=1)) / np.sqrt(tfidf[1:].multiply(tfidf[1:]).sum(axis=1)))"
580 |    ]
581 |   },
582 |   {
583 |    "cell_type": "code",
584 |    "execution_count": 25,
585 |    "id": "9c0a7c7f",
586 |    "metadata": {},
587 |    "outputs": [
588 |     {
589 |      "data": {
590 |       "text/plain": [
591 |        "(18301, 88229)"
592 |       ]
593 |      },
594 |      "execution_count": 25,
595 |      "metadata": {},
596 |      "output_type": "execute_result"
597 |     }
598 |    ],
599 |    "source": [
600 |     "tfidf.shape"
601 |    ]
602 |   },
603 |   {
604 |    "cell_type": "code",
605 |    "execution_count": null,
606 |    "id": "6ebed1f8",
607 |    "metadata": {},
608 |    "outputs": [],
609 |    "source": []
610 |   }
611 |  ],
612 |  "metadata": {
613 |   "kernelspec": {
614 |    "display_name": "Python 3 (ipykernel)",
615 |    "language": "python",
616 |    "name": "python3"
617 |   },
618 |   "language_info": {
619 |    "codemirror_mode": {
620 |     "name": "ipython",
621 |     "version": 3
622 |    },
623 |    "file_extension": ".py",
624 |    "mimetype": "text/x-python",
625 |    "name": "python",
626 |    "nbconvert_exporter": "python",
627 |    "pygments_lexer": "ipython3",
628 |    "version": "3.10.10"
629 |   }
630 |  },
631 |  "nbformat": 4,
632 |  "nbformat_minor": 5
633 | }
634 | 


--------------------------------------------------------------------------------
/scripts/fetch_10k_docs.py:
--------------------------------------------------------------------------------
  1 | # for analysis of which methods of fetching the file are best,
  2 | # see analyses/optimize_doc_fetch.ipynb
  3 | 
  4 | from sec_api import RenderApi
  5 | import os
  6 | import pandas as pd
  7 | import json
  8 | from pandarallel import pandarallel
  9 | import requests
 10 | 
 11 | # params
 12 | with open('config.json', 'r') as f:
 13 |     c = json.load(f)
 14 | destination_dir = os.path.join(c['DATA_DIR'], '10k_raw')
 15 | api_key = c['SEC_API_KEY']
 16 | render_api = RenderApi(api_key=api_key)
 17 | 
 18 | def download_filing(url, destination_file, skip_existing=True, skip_ixbrl=True, engine='requests'):
 19 |     """
 20 |     Given a SEC EDGAR 10-K URL, will download the HTML file to the local destination file.
 21 |     If skip, then will not re-download file if the local destination file already exists.
 22 | 
 23 |     Args
 24 |     ---------
 25 |     engine : one of {'sec_api', 'requests'}
 26 |         sec_api : uses the proprietary SEC API to fetch the filing
 27 |         requests : uses the open source SEC EDGAR database and requests package to fetch the filing
 28 |     """
 29 |     try:
 30 |         destination_dir = os.path.dirname(destination_file)
 31 | 
 32 |         if not os.path.isdir(destination_dir):
 33 |             os.makedirs(destination_dir)
 34 | 
 35 |         if skip_existing and os.path.exists(destination_file):
 36 |             print('⏭️ already exists, skipping download: {url}'.format(
 37 |             url=url))
 38 |             return
 39 | 
 40 |         # do not download iXBRL output
 41 |         if skip_ixbrl:
 42 |             url = url.replace('ix?doc=/', '')
 43 |         if engine == 'sec_api':
 44 |             file_content = render_api.get_filing(url)
 45 |         elif engine == 'requests':
 46 |             file_content = requests.get(url).text
 47 | 
 48 |         with open(destination_file, "w") as f:
 49 |             f.write(file_content)
 50 | 
 51 |     except:
 52 |         print('❌ download failed: {url}'.format(
 53 |             url=url))
 54 | 
 55 | def pandarallel_wrapper(metadata):
 56 |     """
 57 |     Basic wrapper of the download_filing functionality to allow for pandarallel optimization
 58 |     """
 59 |     ticker = metadata['ticker']
 60 |     url = metadata['filingUrl']
 61 |     file_name = url.split("/")[-1] 
 62 |     destination_file = os.path.join(destination_dir, ticker, file_name)
 63 | 
 64 |     download_filing(url, destination_file)
 65 | 
 66 | def pandarallel_wrapper_legacy(metadata):
 67 |     """
 68 |     Basic wrapper of the download_filing functionality to allow for pandarallel optimization
 69 |     """
 70 |     ticker = metadata['TICKER']
 71 |     url = 'https://www.sec.gov/Archives/' + metadata['EDGAR_LINK']
 72 |     file_name = url.split("/")[-1] 
 73 |     destination_file = os.path.join(destination_dir, ticker, file_name)
 74 | 
 75 |     download_filing(url, destination_file)
 76 | 
 77 | if __name__ == '__main__':
 78 |     # this code follows the SEC API documentation
 79 |     # to fetch the files of the 10-Ks of the 3000 companies of the Russell 3000
 80 | 
 81 |     # The SEC API documentation can be found at
 82 |     # https://sec-api.io/docs/sec-filings-render-api/python-example
 83 |     import json
 84 |     import os
 85 |     import pandas as pd
 86 | 
 87 |     number_of_workers = 8
 88 | 
 89 |     # read URL table
 90 |     metadata = pd.read_csv(os.path.join(c['DATA_DIR'], 'metadata.csv'))
 91 |     metadata_legacy = pd.read_csv(os.path.join(c['DATA_DIR'], 'metadata_2017.csv'))
 92 | 
 93 |     # only download the data from russell 3000 today
 94 |     metadata = metadata_legacy[metadata_legacy['TICKER'].isin(metadata['ticker'])]
 95 | 
 96 |     # download multiple files in parallel
 97 |     pandarallel.initialize(progress_bar=True, nb_workers=number_of_workers, verbose=0)
 98 | 
 99 |     # uncomment to run a quick sample and download 50 filings
100 |     # sample = metadata.head(50)
101 |     # sample.parallel_apply(pandarallel_wrapper, axis=1)
102 | 
103 |     # download all filings 
104 |     metadata.parallel_apply(pandarallel_wrapper_legacy, axis=1)
105 | 
106 |     print('✅ Download completed')
107 | 


--------------------------------------------------------------------------------
/scripts/fetch_10k_urls.py:
--------------------------------------------------------------------------------
  1 | # for analysis of which methods of fetching the URL are best, see the analysis
  2 | # at analyses/optimize_url_fetch.ipynb
  3 | # bulk download capabilities at https://www.sec.gov/edgar/sec-api-documentation
  4 | 
  5 | from sec_api import QueryApi
  6 | import pandas as pd
  7 | 
  8 | 
  9 | def create_batches(tickers=[], max_length_of_batch=100):
 10 |     """
 11 |     # create batches of tickers: [[A,B,C], [D,E,F], ...]
 12 |     # a single batch has a maximum of max_length_of_batch tickers
 13 |     """
 14 |     batches = [[]]
 15 | 
 16 |     for ticker in tickers:
 17 |         if len(batches[len(batches)-1]) == max_length_of_batch:
 18 |             batches.append([])
 19 | 
 20 |         batches[len(batches)-1].append(ticker)
 21 | 
 22 |     return batches
 23 | 
 24 | def download_10K_metadata(query_api, tickers, start_year, end_year):
 25 |     """
 26 |     Given a list of tickers, this function will return a Pandas Dataframe 
 27 |     with the ticker, CIK, and URLs to download the 10-K files from the SEC database.
 28 | 
 29 |     Example Output
 30 |     ------------
 31 |         ticker	cik	    formType	filedAt	                    filingUrl
 32 |     0	AVGO	1730168	10-K	    2021-12-17T16:42:51-05:00	https://www.sec.gov/Archives/edgar/data/173016...
 33 |     1	AMAT	6951	10-K	    2021-12-17T16:14:51-05:00	https://www.sec.gov/Archives/edgar/data/6951/0...
 34 |     2	DE	    315189	10-K	    2021-12-16T11:39:34-05:00	https://www.sec.gov/Archives/edgar/data/315189...
 35 |     3	ADI	    6281	10-K	    2021-12-03T16:02:52-05:00	https://www.sec.gov/Archives/edgar/data/6281/0...
 36 | 
 37 |     Args
 38 |     ------------
 39 |     query_api: SEC API to run URL fetching queries
 40 |     tickers : a list of tickers for which to fetch URLs
 41 |     start_year : the start year to begin fetching 10-K URLs
 42 |     end_year : the final year for which 10-K URLs should be fetched
 43 |     """
 44 |     print('✅ Starting download process')
 45 | 
 46 |     # create ticker batches, with 100 tickers per batch
 47 |     batches = create_batches(tickers)
 48 |     frames = []
 49 | 
 50 |     for year in range(start_year, end_year + 1):
 51 |         for batch in batches:
 52 |             tickers_joined = ', '.join(batch)
 53 |             ticker_query = 'ticker:({})'.format(tickers_joined)
 54 | 
 55 |             query_string = '''
 56 |             {ticker_query} 
 57 |             AND filedAt:[{start_year}-01-01 TO {end_year}-12-31] 
 58 |             AND formType:"10-K" 
 59 |             AND NOT formType:"10-K/A" 
 60 |             AND NOT formType:NT'''.format(
 61 |                 ticker_query=ticker_query, start_year=year, end_year=year)
 62 | 
 63 |             query = {
 64 |                 "query": {"query_string": {
 65 |                     "query": query_string,
 66 |                     "time_zone": "America/New_York"
 67 |                 }},
 68 |                 "from": "0",
 69 |                 "size": "200",
 70 |                 "sort": [{"filedAt": {"order": "desc"}}]
 71 |             }
 72 | 
 73 |             response = query_api.get_filings(query)
 74 | 
 75 |             filings = response['filings']
 76 | 
 77 |             metadata = list(map(lambda f: {'ticker': f['ticker'],
 78 |                                            'cik': f['cik'],
 79 |                                            'formType': f['formType'],
 80 |                                            'filedAt': f['filedAt'],
 81 |                                            'filingUrl': f['linkToFilingDetails']}, filings))
 82 | 
 83 |             df = pd.DataFrame.from_records(metadata)
 84 | 
 85 |             frames.append(df)
 86 | 
 87 |         print('✅ Downloaded metadata for year', year)
 88 | 
 89 |     result = pd.concat(frames)
 90 |     return result
 91 | 
 92 | 
 93 | if __name__ == '__main__':
 94 |     # this code follows the SEC official documentation
 95 |     # to fetch the URLs of the 10-Ks of the 3000 companies of the Russell 3000
 96 | 
 97 |     # The SEC official documentation can be found at
 98 |     # https://sec-api.io/docs/sec-filings-render-api/python-example
 99 |     import json
100 |     import os
101 |     import pandas as pd
102 | 
103 |     # params
104 |     with open('config.json', 'r') as f:
105 |         c = json.load(f)
106 |     destination_dir = c['DATA_DIR']
107 |     destination_file = os.path.join(destination_dir, 'metadata.csv')
108 |     api_key = c['SEC_API_KEY']
109 | 
110 |     # read Russell 3000 files
111 |     holdings = pd.read_csv(os.path.join(c['DATA_DIR'], 'russell_3000/russell-3000-clean.csv'))
112 | 
113 |     query_api = QueryApi(api_key=api_key)
114 |     tickers = list(holdings['Ticker'])
115 | 
116 |     metadata = download_10K_metadata(
117 |         query_api=query_api, tickers=tickers, start_year=2019, end_year=2023)
118 | 
119 |     number_metadata_downloaded = len(metadata)
120 |     print('✅ Downloaded completed. Metadata downloaded for {} filings.'.format(
121 |         number_metadata_downloaded))
122 | 
123 |     print('Writing to file {}'.format(destination_file))
124 |     metadata.to_csv(destination_file, index=False)
125 |     print('Completed writing file. Exiting script.')
126 | 


--------------------------------------------------------------------------------
/scripts/fetch_russell_3000.py:
--------------------------------------------------------------------------------
 1 | # this code follows the SEC official documentation
 2 | # to fetch the 3000 constituents of the Russell 3000
 3 | 
 4 | # The SEC official documentation can be found at 
 5 | # https://sec-api.io/docs/sec-filings-render-api/python-example
 6 | 
 7 | # import libraries
 8 | from sec_api import QueryApi, RenderApi
 9 | import requests
10 | import os
11 | import json
12 | 
13 | with open('config.json', 'r') as f:
14 |     c = json.load(f)
15 | 
16 | # params
17 | destination_dir = os.path.join(c['DATA_DIR'], 'russell_3000')
18 | raw_data_path = os.path.join(destination_dir, 'russell-3000.csv')
19 | clean_data_path = os.path.join(destination_dir, 'russell-3000-clean.csv')
20 | url = c['RUSSELL_3000_URL']
21 | 
22 | ####
23 | # Download the 3000 constituents of the Russell 3000
24 | ####
25 | response = requests.get(url)
26 | 
27 | with open(raw_data_path, 'wb') as f:
28 |     f.write(response.content)
29 | 
30 | # cleaning the iShares CSV file
31 | import csv
32 | 
33 | with open(raw_data_path, 'r', encoding='utf-8') as f:
34 |     reader = csv.reader(f)
35 |     rows = list(reader)
36 | 
37 | empty_row_indicies = [i for i in range(len(rows)) if (len(rows[i]) == 0 or '\xa0' in rows[i])]
38 | 
39 | print('Empty rows:', empty_row_indicies)
40 | 
41 | start = empty_row_indicies[0] + 1
42 | end = empty_row_indicies[1]
43 | cleaned_rows = rows[start:end]
44 | 
45 | with open(clean_data_path, 'w', newline='') as f:
46 |     writer = csv.writer(f)
47 |     writer.writerows(cleaned_rows)
48 | 


--------------------------------------------------------------------------------