├── .gitignore ├── 10-K Text Analysis.pdf ├── 10-K _ 10-Q Text Analysis Eric He - eriqqc.pdf ├── Documentation ├── .Rhistory ├── Building-Document-Frequency-Matrices.Rmd ├── Building-Sentiment-Dictionary.Rmd ├── Calculating-Distance-Returns.Rmd ├── Calculating-Financial-Returns.Rmd ├── Calculating-NumProp-Returns.Rmd ├── Calculating-Sentiment-Returns.Rmd ├── Cleaning-Raw-Filings.md ├── Creating-Master-Index.Rmd ├── Script_10Q.R ├── Sentiment-Scores-Algo.Rmd ├── Text-Distance-Algo-10Q.R ├── Text-Distance-Algo.Rmd ├── documentation.html └── parsing-script.R ├── Graphs ├── cosine_distance_returns.png ├── jaccard_distance_returns.png ├── negative_sentiment_quantile_returns.png └── positive_sentiment_quantile_returns.png ├── README.md ├── Sample-Data └── Apple-2016-Cleaned.txt ├── analyses ├── optimize_10k_cleaning.ipynb ├── optimize_cosine_distance.ipynb └── optimize_doc_fetch.ipynb └── scripts ├── fetch_10k_docs.py ├── fetch_10k_urls.py └── fetch_russell_3000.py /.gitignore: -------------------------------------------------------------------------------- 1 | config.json 2 | data/ 3 | __pycache__/ 4 | */.ipynb_checkpoints -------------------------------------------------------------------------------- /10-K Text Analysis.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/10-K Text Analysis.pdf -------------------------------------------------------------------------------- /10-K _ 10-Q Text Analysis Eric He - eriqqc.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/10-K _ 10-Q Text Analysis Eric He - eriqqc.pdf -------------------------------------------------------------------------------- /Documentation/.Rhistory: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/Documentation/.Rhistory -------------------------------------------------------------------------------- /Documentation/Building-Document-Frequency-Matrices.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "CorpusCreation" 3 | author: "Eric He" 4 | date: "July 17, 2017" 5 | output: html_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | ``` 11 | 12 | This document details the process for taking the cleaned financial statements and parsing them into quanteda corpuses. The code is done for all the financial statements for a given year, and is repeated for each of the four years. 13 | 14 | We begin by loading in the required libraries. 15 | 16 | ```{r} 17 | library("quanteda") 18 | library("stringr") 19 | library("dplyr") 20 | library("purrr") 21 | ``` 22 | 23 | The first 6033 filings are filed in the year 2013, while filings in year 2014 range from 6034 to 11882. This can be recomputed by looking at the masterIndex in the folder. 24 | 25 | ```{r} 26 | year2013 <- c(1:6033) 27 | year2014 <- c(6034:11882) 28 | year2015 <- c(11883:17467) 29 | year2016 <- c(17468:22631) 30 | StopWordsList <- "StopWordsList.txt" %>% 31 | readLines() %>% 32 | str_split(pattern = " ") 33 | ``` 34 | 35 | In this code, each financial statement for a given year is loaded in one by one. The filing is split into its component sections, and each section becomes a document that is added to the corpus. That is to say, each row of the corpus corresponds to one of the 20 sections of a financial statement. Each financial statement should add 20 sections to the corpus. Unfortunately, the cleaning algorithm which tagged these sections within the financial statement is not perfect, and many false positives must be dealt with. False negatives are at this point impossible to catch, which is why the tagging procedure of the cleaning algorithm has been designed to be more liberal with its tagging. 36 | 37 | ```{r} 38 | years <- year2013 # replace year here 39 | bigcorpus <- corpus("") 40 | for (i in years){ 41 | text <- paste("parsed/", i, ".txt", sep = "") %>% 42 | readLines() %>% 43 | str_split(pattern = "(?s)(?i)°Item", simplify = TRUE) %>% 44 | str_replace_all(pattern = "(?s)<.*?>", replacement = "") %>% 45 | str_replace_all(pattern = "(?s) +", replacement = " ") 46 | text <- text[text != ""] 47 | names(text) <- paste(i, str_extract(text, pattern = "[1234567890ABC]+")) 48 | text <- corpus(text) 49 | bigcorpus <- bigcorpus + text 50 | rm(text) 51 | print(i) 52 | } 53 | save(bigcorpus, file = "RawCorpus2016.RData") 54 | ``` 55 | 56 | The bigcorpus object holds every identified section of every financial statement in the given year as a document. However, most of these documents are garbage! Many documents are actually snippets of the Table of Contents of various filings, which were mistakenly tagged by the cleaning algorithm as section texts. Unfortunately, the heterogeneity of filings makes this incorrect tagging difficult to avoid. Additionally, some documents are simply excerpts of other sections that are mistagged. 57 | 58 | Thus, the goal of the code below is to weed out these documents which are not desired. A good way to delete the documents which are actually part of the Table of Contents is to remove any document with less than 100 words. These documents are both likely to be not real sections, or contain only boiler plate language when the section is not relevant to the company (e.g. Mine Disclosures for a fast food company; the fast food company owns no mines!). At any rate, having less than 100 words is not very useful for our text analysis. 59 | 60 | ```{r} 61 | # sufficient word count 62 | #wordcount <- ntoken(bigcorpus) 63 | #load("wordcount2014.RData") 64 | enoughWords <- wordcount < 100 65 | 66 | # is not section 67 | names <- docnames(bigcorpus) 68 | real.section.letter <- !is.na(str_extract(names, pattern = "[1234567890]+ [ABCDEFGHIJKLMNOPQRSTUVWXYZ]")) 69 | 70 | # is duplicate 71 | real.section.toc <- !is.na((str_extract(names, pattern = "\\."))) 72 | index.numbers <- unique(str_extract(names[real.section.toc], pattern = "[1234567890]+ ")) 73 | all.names <- str_extract(names, pattern = "[0-9]+ ") 74 | has.duplicates <- is.element(all.names, index.numbers) 75 | real.section.duplicate <- !real.section.toc + has.duplicates > 1 76 | 77 | #filter out all trash 78 | the.trash <- real.section.duplicate + real.section.letter + enoughWords > 0 79 | 80 | docvars(bigcorpus, "Subset") <- the.trash 81 | bigcorpus <- corpus_subset(bigcorpus, subset = Subset == FALSE, select = FALSE) 82 | ``` 83 | 84 | Here, we extract relevant information from the remaining documents. The data of interest is as follows: 85 | 86 | 1) The section the document is of. 87 | 2) The filing the document belongs to. 88 | 3) The date during which the financial statement was filed with the SEC. 89 | 4) The index number of the section within its new, subsetted corpus. 90 | 5) The number of words in the filing. 91 | 92 | ```{r} 93 | #remove the duplicate names 94 | names[real.section.toc] <- names[real.section.toc] %>% 95 | str_extract(pattern = ".*(?=\\.)") 96 | subsetted.names <- names[the.trash == FALSE] 97 | #extract section of filing 98 | section <- subsetted.names %>% 99 | str_extract(pattern = "(?<= ).*") %>% 100 | as.factor() 101 | #extract filing of section 102 | filing <- subsetted.names %>% 103 | str_extract(pattern = ".*(?= )") 104 | #extract date during which 10-k was filed 105 | date.filed <- masterIndex$DATE_FILED[as.numeric(filing)] 106 | #extract word count 107 | word.count <- wordcount[the.trash == FALSE] 108 | #combine into meta dataframe 109 | metadata <- data_frame(index = 1:ndoc(bigcorpus), subsetted.names, section, filing, date.filed, word.count) 110 | 111 | #remove the clutter 112 | rm(real.section.duplicate, has.duplicates, all.names, index.numbers, real.section.toc, real.section.letter, wordcount, names, date.filed, word.count, subsetted.names, section, filing, the.trash, enoughWords) 113 | ``` 114 | 115 | The meta dataframe is saved for future analysis. 116 | 117 | ```{r} 118 | save(metadata, file = "metadata2015.RData") 119 | write.csv(metadata, file = "metadata2015.csv") 120 | ``` 121 | 122 | It is at this stage that we create a document-frequency matrix (DFM) of the various documents. Each row of the DFM corresponds to a different section, while each column corresponds to a different word which appeared in any of the various documents. Cell i,j corresponds to the count of word j in document i. Punctuation and numbers are removed and do not appear in the DFM, since they are not actually words. 123 | 124 | ```{r} 125 | bigdfm <- dfm(bigcorpus, remove = StopWordsList, remove_punct = TRUE, remove_numbers = TRUE) %>% 126 | tfidf(scheme_tf="logave") 127 | ``` 128 | 129 | ```{r} 130 | save(bigcorpus, file = "parsedCorpus2016.RData") 131 | ``` -------------------------------------------------------------------------------- /Documentation/Building-Sentiment-Dictionary.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Making the Dictionary" 3 | author: "Eric He" 4 | date: "June 16, 2017" 5 | output: html_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | ``` 11 | 12 | ```{r} 13 | library("dplyr") 14 | dictionary <- read.csv("masterDictionary.csv") 15 | ``` 16 | 17 | We follow the exact specifications laid out by Luo. Positive and Interesting words from the dictionary are classified as Positive, while Negative, Uncertain, Litigious, Constraining, and Superfluous all are collapsed under the umbrella classification "Negative". 18 | 19 | ```{r} 20 | positive <- dictionary %>% 21 | filter(Positive > 0 | Interesting > 0) %>% 22 | select(ï..Word) 23 | negative <- dictionary %>% 24 | filter(Negative > 0 | Uncertainty > 0 | Litigious > 0 | Constraining > 0 | Superfluous > 0) %>% 25 | select(ï..Word) 26 | write.csv(negative, row.names= FALSE, file = "negative.csv") 27 | write.csv(positive, row.names = FALSE, file = "positive.csv") 28 | ``` -------------------------------------------------------------------------------- /Documentation/Calculating-Distance-Returns.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "TextDistanceAlgo" 3 | author: "Eric He" 4 | date: "August 19, 2017" 5 | output: html_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | ``` 11 | 12 | ```{r} 13 | library("quanteda") 14 | library("dplyr") 15 | library("purrr") 16 | library("stringr") 17 | library("readtext") 18 | library("reshape2") 19 | library("magrittr") 20 | library("ggplot2") 21 | library("gridExtra") 22 | ``` 23 | 24 | ```{r} 25 | masterIndex <- read.csv("masterIndex.csv") 26 | masterIndex$filing %<>% as.character 27 | tickers <- readLines("tickers.txt") # use unique(masterIndex) if we wish to scale this across multiple years, or keep tickers.txt updated 28 | StopWordsList <- readLines("StopWordsList.txt") 29 | hpr <- read.csv("annualReturns.csv", na.strings = "NA") 30 | sections <- c("1", "1A", "3", "4", "7", "8", "9", "9A") 31 | section_names <- c("Business", "Risk Factors", "Legal Proceedings", "Mine Safety Disclosures", "MDA of Financial Conditions", "Financial Statements and Supplementary Data", "Changes on Accounting and Financial Disclosure", "Controls and Procedures") 32 | ``` 33 | 34 | The new section extractor algorithm is much more flexible and robust than the old algorithm. Before, we had an issue where we frequently got the table of contents of a financial statement masquerading as an extra 20 sections. Other, much more rare situations include problematic formatting by the financial statements which would tag normal words as section headings. This problem is solved by choosing the hit with the maximum word count and making it the target section. There are only a few niche cases were a mistagged string is longer than the actual section, and almost no situations where the table of contents is longer than the section itself. 35 | 36 | ```{r} 37 | # 1 statement, 1 section 38 | section_extractor <- function(statement, section){ 39 | name <- statement$doc_id # needs to be atomic vector 40 | pattern <- paste0("°Item ", section, "[^\\w|\\d]", ".*?°") # exclude any Item X where X is followed by any unexpected alphanumeric character 41 | # needs simplify=TRUE because FALSE returns 1-element list of multiple vectors which map() cannot handle. May file issue with stringr 42 | section_hits <- str_extract_all(statement, pattern, simplify=TRUE) 43 | #if section_hits is empty then we need function to skip this one 44 | if (is_empty(section_hits) == TRUE){ 45 | return("empty") 46 | } 47 | word_counts <- map_int(section_hits, ntoken) 48 | max_hit <- which(word_counts == max(word_counts)) 49 | max_filing <- section_hits[[max_hit[length(max_hit)]]] # select the "filing" with the largest word count. If two hits have the same word count, choose the last one. (Following the idea that the first one is probably ToC; it doesn't really matter which one we pick because we're tossing it out anyway because it definitely doesnt make word count.) 50 | names(max_filing) <- paste(name, section, sep = "_") 51 | return(max_filing) 52 | } 53 | 54 | # multiple statements, 1 section. We use logs to discount frequently occurring words. No inverse document frequency is used because it is not useful for comparisons of documents which should be functionally equivalent (idf is used to differentiate documents with entirely different subject matters, since it highlights differences in word choice. In two risk factor sections, it is easy to discount words like "risk" or "dispute", which are still important words to us). 55 | 56 | section_dfm <- function(statements_list, section, min_words, tf){ 57 | map(statements_list, section_extractor, section=section) %>% 58 | map(corpus) %>% 59 | reduce(`+`) %>% 60 | dfm(tolower=TRUE, remove=StopWordsList, remove_punct=TRUE) %>% 61 | dfm_subset(., rowSums(.) > min_words) %>% 62 | when(tf==TRUE ~ tf(., scheme="log"), 63 | ~ .) 64 | } 65 | # the when statement looks like black magic but it is the functional version of an if-else statement. 66 | # syntax denoted by formula (~) object, LHS of ~ is the condition, RHS is the return. 67 | # If tfidf (the tfidf parameter) == TRUE then return tfidf(., scheme_tf="logave") (the tfidf function) 68 | # Else (no condition) then return . (return the input as the output (do nothing)) 69 | 70 | # multiple statements, multiple sections, 1 ticker. No reduce() since each filing section needs its own corpus 71 | filing_dfm <- function(sections, filings_list, min_words, tf){ 72 | map(sections, section_dfm, statements_list=filings_list, min_words=min_words, tf=tf) 73 | } 74 | 75 | # perform distance analysis on the processed dfm_list. The dist_parser function tries to wrangle with the distObj which textstat_simil returns. It returns a dataframe showing the cosine distance between each pair of filings. 76 | dist_parser <- function(distObj){ 77 | melted_frame <- as.matrix(distObj) %>% 78 | {. * upper.tri(.)} %>% # lambda function to extract the upper triangular part of b, since the diagonal is the identity distance and dist object is symmetric 79 | melt(varnames = c("previous_filing", "current_filing"), value.name = "distance") %>% # comparison filing is always filed before current_filing when using upper triangular 80 | filter(distance != 0) # cut out identity and duplicates. This assumes that no two legitimate documents are completely orthogonal, which I think is reasonable 81 | melted_frame$previous_filing %<>% str_extract(pattern = ".*?(?=\\.)") # cut out the text name/section 82 | melted_frame$current_filing %<>% str_extract(pattern = ".*?(?=\\.)") # to allow for easy joining with financial returns 83 | return(melted_frame) 84 | } 85 | 86 | filing_similarity <- function(dfm_list, method){ 87 | map(dfm_list, textstat_simil, method=method) %>% 88 | map(dist_parser)} 89 | 90 | index_filing_filterer <- function(ticker, index){ 91 | filter(index, TICKER == ticker) %>% 92 | pull(filing) # pull the file name, which in this case is just the filing number 93 | } 94 | index_year_filterer <- function(ticker, index){ 95 | filter(index, TICKER == ticker) %>% 96 | pull(YEAR) 97 | } 98 | 99 | plotter <- function(dfObj, section, nquantiles = 5){ 100 | dfObj %>% 101 | na.omit %>% 102 | mutate(quantile = ntile(distance, n = nquantiles)) %>% 103 | group_by(quantile) %>% 104 | summarise(average_return = mean(as.numeric(returns)) - 1) %>% 105 | ggplot(aes(x = quantile, y = average_return)) + 106 | geom_bar(stat = "identity") + 107 | theme(axis.title.y=element_blank()) + 108 | #coord_cartesian(ylim = c(-.2, .3)) + 109 | xlab(section) 110 | } 111 | ``` 112 | 113 | Do 1 ticker end to end. 114 | 115 | ```{r} 116 | index_filing_filterer <- function(ticker, index){ 117 | filter(index, TICKER == ticker) %>% 118 | pull(filing) # pull the file name, which in this case is just the filing number 119 | } 120 | index_year_filterer <- function(ticker, index){ 121 | filter(index, TICKER == ticker) %>% 122 | pull(YEAR) 123 | } 124 | 125 | file_path <- "parsed/" 126 | file_type <- ".txt" 127 | 128 | the_ticker <- "AAPL" 129 | 130 | file_names <- index_filing_filterer(the_ticker, masterIndex) 131 | file_years <- index_year_filterer(the_ticker, masterIndex) 132 | 133 | file_locations <- paste0(file_path, file_names, file_type) 134 | 135 | filings_list <- map(file_locations, readtext) 136 | 137 | years <- paste0("X", file_years[-1]) # financial return columns start with X bc colnames cannot only be numbers; chop out the first year 138 | returns_df <- filter(hpr, ticker == the_ticker) %>% # calling the_ticker ticker gives rise to namespace issues 139 | select(years) %>% 140 | t() 141 | colnames(returns_df) <- "returns" 142 | returns_df %<>% cbind(previous_filing = file_names[-length(file_names)], current_filing = file_names[-1], .) %>% 143 | as.data.frame(stringsAsFactors=FALSE) 144 | 145 | similarity_list <- filing_dfm(sections=sections, filings_list=filings_list, min_words=100, tf=TRUE) %>% 146 | filing_similarity("cosine") # jaccard distance doesnt need any term weightings, although tf=TRUE doesnt make any difference 147 | # similarity_list[map_dbl(similarity_list, nrow) == 0] <- NULL # not needed as we rbind 148 | 149 | distance_returns_df2 <- similarity_list %>% # MOVE THE FILTERING HERE, MORE FLEXIBLE THIS WAY 150 | map(right_join, returns_df) 151 | #map(~ data_frame(distance=., returns=returns_vector)) 152 | ``` 153 | Do all tickers. 154 | 155 | No financial returns data for 2033 of the tickers. 156 | 157 | ```{r} 158 | tickers %in% hpr$ticker %>% table 159 | ``` 160 | 161 | ```{r} 162 | file_path <- "parsed/" 163 | file_type <- ".txt" 164 | 165 | distance_returns_calculator <- function(the_ticker){ 166 | file_names <- index_filing_filterer(the_ticker, masterIndex) 167 | file_years <- index_year_filterer(the_ticker, masterIndex) 168 | 169 | if (length(file_names) <= 1 | length(file_years) <= 1){ 170 | empty_list <- map(rep(NA, times = length(sections)), ~data_frame(previous_filing = ., current_filing = ., distance = ., returns = .)) 171 | print(paste("Only one filing available for ticker", the_ticker)) 172 | return(empty_list) 173 | } # companies with only one year of data cannot be used for a distance analysis. We return a data frame of NAs so that the rbind() can be smooth 174 | 175 | years <- paste0("X", file_years[-1]) # chop out the first year 176 | returns_df <- filter(hpr, ticker == the_ticker) %>% # calling the_ticker ticker gives rise to namespace issues 177 | select(years) %>% 178 | t() 179 | if (is_empty(returns_df) == TRUE){ 180 | empty_list <- map(rep(NA, times = length(sections)), ~data_frame(previous_filing = ., current_filing = ., distance = ., returns = .)) 181 | print(paste("No financial data for ticker", the_ticker)) 182 | return(empty_list) 183 | } 184 | 185 | file_locations <- paste0(file_path, file_names, file_type) 186 | 187 | filings_list <- map(file_locations, readtext) 188 | 189 | colnames(returns_df) <- "returns" 190 | returns_df %<>% cbind(previous_filing = file_names[-length(file_names)], current_filing = file_names[-1], .) %>% # assumes no broken years in the data; broken years can occur if there is no financial filing located one year, or financial data no existerino for a year. Obviously this occurring would be very nonstandard 191 | as.data.frame(stringsAsFactors = FALSE) 192 | 193 | similarity_list <- filing_dfm(sections=sections, filings_list=filings_list, min_words=100, tf=FALSE) %>% 194 | filing_similarity("jaccard") 195 | 196 | distance_returns_df <- similarity_list %>% 197 | map(right_join, returns_df, by = c("previous_filing", "current_filing")) 198 | print(paste("Successfully mapped distance scores to financial returns for ticker", the_ticker)) 199 | return(distance_returns_df) 200 | } 201 | 202 | distance_returns_df <- map(tickers, distance_returns_calculator) %>% 203 | pmap(rbind) # pmap takes the list of lists and rbinds each of the elements within the nested list together. its black magic 204 | 205 | save(distance_returns_df, file = "jaccard_distance_returns_df.RData") 206 | 207 | distance_returns_plot <- distance_returns_df %>% 208 | map2(sections, ~plotter(dfObj = .x, section = .y)) %>% 209 | arrangeGrob(grobs = ., ncol = 4, top = "Average Yearly Financial Returns By Jaccard Distance Quantile", left = "Average Yearly Return", bottom = "Filing Section") 210 | 211 | ggsave(distance_returns_plot, file = "jaccard_distance_returns.png", width = 7, height = 5) 212 | plot(distance_returns_plot) 213 | ``` 214 | -------------------------------------------------------------------------------- /Documentation/Calculating-Financial-Returns.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "FinancialReturns" 3 | author: "Eric He" 4 | date: "July 17, 2017" 5 | output: html_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | ``` 11 | 12 | ```{r} 13 | library("dplyr") 14 | library("lubridate") 15 | library("tidyr") 16 | library("purrr") 17 | ``` 18 | 19 | Financial returns data was downloaded from the Center for Research in Security Prices (CRSP) daily stock returns database from the Wharton Research Data Services (WRDS) account. Data from 01/01/2000 to 12/31/2016 was downloaded; every single ticker and data category was downloaded, just in case. Date was selected to be in MM/DD/YYYY form. 20 | 21 | Market cap data was also downloaded from the CRSP database provided by WRDS, same beginning and end date, every single ticker and data category again. Date is in YYYYMMDD form. 22 | 23 | Load in the data. 24 | 25 | ```{r} 26 | master_index <- read.csv("masterIndex.csv") 27 | full_returns <- read.csv("../Data/Financial Data/Trimmed_Returns_Raw.csv") 28 | full_cap <- read.csv("market_cap.csv") %>% 29 | select(permno = PERMNO, num_shares = SHROUT, date = SHRSDT) 30 | ``` 31 | 32 | ```{r} 33 | head(masterIndex) 34 | ``` 35 | 36 | ```{r} 37 | head(full_returns) 38 | ``` 39 | 40 | ```{r} 41 | head(full_cap) 42 | ``` 43 | 44 | Select the relevant columns: holding period returns (RET), tickers (TICKER), company name (COMNAM), delisting return (DLRET), Shares Observation End Date (shrenddt) 45 | 46 | End product: correct monthly returns 47 | 48 | ```{r} 49 | full_returns <- select(full_returns, permco = PERMCO, date = date, return = RET, ticker = TICKER, delisting_return = DLRET) %>% 50 | filter(ticker %in% master_index$TICKER) %>% # only need tickers for which we have filings to compare with 51 | mutate(delisting_return = as.numeric(as.character(delisting_return))) %>% # 52 | mutate(delisting_return = replace(delisting_return, is.na(delisting_return) == TRUE, 0)) %>% # replace NA with 0 53 | mutate(delisting_return = delisting_return + 1) %>% # so we can add 1 so we can multiply 54 | mutate(return = as.numeric(as.character(return))) %>% # change return from factor to numeric, characters which CRSP uses to represent missing data or point towards delisting return gets changed to NA 55 | mutate(return = replace(return, is.na(return) == TRUE, 0)) %>% 56 | mutate(return = return + 1) %>% 57 | filter(is.na(return) == FALSE) %>% 58 | mutate(return = return * delisting_return) %>% 59 | mutate(date = mdy(date)) %>% 60 | select(-delisting_return) 61 | ``` 62 | 63 | Given a ticker and a date from the master Index, devise a formula to calculate holding period returns for variable time intervals. 64 | 65 | ```{r} 66 | the_ticker <- "AAPL" 67 | the_year <- 2013 68 | the_month <- 10 69 | the_day <- 30 70 | 71 | interval_length <- 3 72 | interval_type <- "month" 73 | 74 | start_date <- ymd("20131030") 75 | 76 | end_date <- the_date %m+% months(3) 77 | 78 | returns_calculator <- function(the_ticker, filing_date, interval_length = 5, interval_type = "day"){ 79 | start_date_beginning <- filing_date %>% # start date is first non-weekend day before the filing date. 80 | ymd(.) %m-% days(3) 81 | start_date_candidates <- seq(start_date_beginning, ymd(filing_date) %m-% days(1), by = "days") 82 | start_date <- start_date_candidates[which(format(start_date_candidates, "%u") %in% c(1:5))[length(which(format(start_date_candidates, "%u") %in% c(1:5)))]] # pick out first weekday in the sequence, %u formats date object into numeric weekday 83 | end_date <- filing_date %>% 84 | ymd %>% # have to ram through ymd again because map() uses [[]] which messes with lubridate object type 85 | when(interval_type == "day" ~ . %m+% days(interval_length), 86 | interval_type == "month" ~ . %m+% months(interval_length), 87 | interval_type == "year" ~ . %m+% years(interval_length), 88 | ~ stop(print(interval_type))) 89 | date_sequence <- seq(start_date, end_date, by = "days") # see above comment 90 | date_returns <- filter(full_returns, as.character(ticker) == as.character(the_ticker), date %in% date_sequence) # tickers must be converted to character or else throws a factor level error 91 | if (nrow(date_returns) == 0){ 92 | empty_df <- data_frame(hpr = NA) 93 | print(paste("No financial data for ticker", the_ticker)) 94 | return(empty_df) # if no financial data then we would like to make that clear 95 | } # when statement does not work to break the function! :( 96 | hpr <- summarise(date_returns, hpr = prod(return)) 97 | print(paste("Calculated hpr for ticker", the_ticker, "and date", filing_date)) 98 | return(hpr)} 99 | ``` 100 | 101 | ```{r} 102 | returns_5_days <- map2_df(masterIndex$TICKER, masterIndex$DATE_FILED, returns_calculator) 103 | returns_1_month <- map2_df(masterIndex$TICKER, masterIndex$DATE_FILED, returns_calculator, interval_length = 1, interval_type = "month") 104 | returns_3_months <- map2_df(masterIndex$TICKER, masterIndex$DATE_FILED, returns_calculator, interval_length = 3, interval_type = "month") 105 | returns_6_months <- map2_df(masterIndex$TICKER, masterIndex$DATE_FILED, returns_calculator, interval_length = 6, interval_type = "month") 106 | returns_1_year <- map2_df(masterIndex$TICKER, masterIndex$DATE_FILED, returns_calculator, interval_length = 1, interval_type = "year") 107 | ``` 108 | 109 | The volatility calculator uses the original trimmed raw returns data to calculate, since it is more accurate about which days are trading days! 110 | 111 | ```{r} 112 | volatility_calculator <- function(the_ticker, filing_date, interval_length = 1, interval_type = "year"){ 113 | end_date <- filing_date %>% ymd 114 | start_date <- end_date %>% 115 | ymd %>% 116 | when(interval_type == "day" ~ . %m-% days(interval_length), 117 | interval_type == "month" ~ . %m-% months(interval_length), 118 | interval_type == "year" ~ . %m-% years(interval_length), 119 | ~ stop(print(interval_type))) 120 | date_sequence <- seq(start_date, end_date, by = "days") 121 | date_sequence <- date_sequence[-which(format(date_sequence, "%u") %in% c(6,7))] 122 | date_returns <- filter(full_returns, as.character(ticker) == as.character(the_ticker), date %in% date_sequence) 123 | if (nrow(date_returns) == 0){ 124 | empty_df <- data_frame(sd = NA) 125 | print(paste("No financial data for ticker", the_ticker)) 126 | return(empty_df) 127 | } 128 | volatility <- summarise(date_returns, sd = sd(return)) 129 | print(paste("Calculated volatility for ticker", the_ticker, "and date", filing_date)) 130 | return(volatility) 131 | } 132 | ``` 133 | 134 | ```{r} 135 | vol_1year <- map2_df(master_index$TICKER, master_index$DATE_FILED, volatility_calculator) 136 | ``` 137 | 138 | ```{r} 139 | master_index <- cbind(master_index, vol_1year) 140 | master_index %<>% mutate( 141 | adj_ret5d = ret5d / (sqrt(5) * sd), 142 | adj_ret1m = ret1m / (sqrt(30) * sd), 143 | adj_ret3m = ret3m / (sqrt(90) * sd), 144 | adj_ret1y = ret1y / (sqrt(365) * sd) 145 | ) 146 | ``` 147 | 148 | ```{r} 149 | masterIndex <- cbind(masterIndex, ret5d = returns_5_days$hpr, ret1m = returns_1_month$hpr, ret3m = returns_3_months$hpr, ret6m = returns_6_months$hpr, ret1y = returns_1_year$hpr) 150 | write.csv(master_index, "../Data/Master Index/masterIndex.csv", row.names = FALSE) 151 | ``` 152 | 153 | -------------------------------------------------------------------------------- /Documentation/Calculating-NumProp-Returns.Rmd: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/Documentation/Calculating-NumProp-Returns.Rmd -------------------------------------------------------------------------------- /Documentation/Calculating-Sentiment-Returns.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "sentimentAnalysisAlgo" 3 | author: "Eric He" 4 | date: "August 7, 2017" 5 | output: html_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | ``` 11 | 12 | ```{r} 13 | library("quanteda") 14 | library("dplyr") 15 | library("purrr") 16 | library("ggplot2") 17 | library("reshape2") 18 | library("gridExtra") 19 | ``` 20 | 21 | Read in the data relevant for all four years. 22 | 23 | ```{r} 24 | negative <- readLines("negative.txt") 25 | positive <- readLines("positive.txt") 26 | sections <- c("1", "1A", "1B", "2", "3", "4", "5", "6", "7", "7A", "8", "9", "9A", "9B", "10", "11", "12", "13", "14", "15") 27 | hpr <- read.csv("annualReturns.csv", na.strings = "NA") 28 | years <- c("X2012", "X2013", "X2014", "X2015", "X2016") 29 | years <- c(2013, 2014, 2015, 2016) 30 | masterIndex <- read.csv("masterIndex.csv") 31 | tickers <- data_frame(filing = c(1:nrow(masterIndex)), ticker = masterIndex$TICKER) 32 | ``` 33 | 34 | Build two functions, one which subsets the bigdfm according to section and the other recording the indices if the subsetted. 35 | 36 | ```{r} 37 | indices_tfidf <- function(meta_df, cond){ 38 | index_num <- filter(meta_df, section == cond) %>% 39 | select(filing) 40 | return(index_num) 41 | } 42 | subset_tfidf <- function(meta_df, dfmobj, cond){ 43 | index_num <- filter(meta_df, section == cond) %>% 44 | select(index) 45 | weightsdfm <- dfmobj[index_num$index,] %>% 46 | tfidf(scheme_tf = "logave") 47 | return(weightsdfm) 48 | } 49 | ``` 50 | 51 | Build a function which computes the sentiment scores. Recall that the sentiment score is the weights of the words in the sentiment dictionary, divided by the total weight of the words in the document. 52 | 53 | ```{r} 54 | dfmstat_ratio <- function(dfmObj, dict){ 55 | dfm_select(dfmObj, features = dict) %>% 56 | rowSums(.) / rowSums(dfmObj) 57 | } 58 | ``` 59 | 60 | We would like to do the algorithm for the positive, negative, positive-negative sentiment scorings. This requires the positive and negative dictionaries. 61 | Then we would like to do them for all four years. 62 | 63 | ```{r} 64 | indices_list <- map(sections, indices_tfidf, meta_df = metadata) 65 | weightsdfm_list <- map(sections, subset_tfidf, meta_df = metadata, dfmobj = bigdfm) 66 | sentiment_list <- weightsdfm_list %>% 67 | map(dfmstat_ratio, dict = negative) 68 | ``` 69 | 70 | Now do this for every year's worth of data. 71 | 72 | ```{r} 73 | years <- c(2013:2016) 74 | path_to_metadata <- "metadata" 75 | metadata_type <- ".csv" 76 | path_to_parsedDFM <- "parsedBigDfm" 77 | parsedDFM_type <- ".RData" 78 | 79 | weighter <- function(year){ 80 | metadata <- paste(path_to_metadata, year, metadata_type, sep = "") %>% 81 | read.csv() 82 | load(paste(path_to_parsedDFM, year, parsedDFM_type, sep = "")) 83 | 84 | indices_list <- map(sections, indices_tfidf, meta_df = metadata) 85 | weightsdfm_list <- map(sections, subset_tfidf, meta_df = metadata, dfmobj = bigdfm) 86 | 87 | save(indices_list, file = paste("indices_list_", year, ".RData", sep = "")) 88 | save(weightsdfm_list, file = paste("weightsdfm_list_", year, ".RData", sep = "")) 89 | } 90 | 91 | map(years, weighter) # The weighter function returns nothing which is fine. 92 | ``` 93 | 94 | Get every sentiment list for every year. 95 | 96 | ```{r} 97 | years <- c(2013:2016) 98 | path_to_weightsdfm_list <- "weightsdfm_list_" 99 | weightsdfm_list_type <- ".RData" 100 | path_to_indices_list <- "indices_list_" 101 | indices_list_type <- ".RData" 102 | 103 | returns_quantiler <- function(sentiment_list, index_list, return_df, n_quantiles){ 104 | sentiment_list %>% 105 | map(ntile, n = n_quantiles) %>% 106 | map(as.factor) %>% 107 | map(~ data_frame("quantile" = .)) %>% 108 | map2(index_list, cbind) %>% 109 | map(left_join, y = tickers, by = "filing") %>% 110 | map(left_join, y = return_df, by = "ticker") %>% 111 | map(group_by, quantile) %>% 112 | map(summarise, average_return = mean(return, na.rm = TRUE)) %>% 113 | map(transmute, relative_return = average_return / min(average_return) - 1) %>% # quantile 1 is row 1, 2 is row 2, etc. Higher quantile means higher sentiment value. 114 | do.call(what = cbind) 115 | } 116 | 117 | sentiment_returns_algo <- function(year, sentiment_dict){ 118 | load(paste(path_to_weightsdfm_list, year, weightsdfm_list_type, sep = "")) # loads into global 119 | load(paste(path_to_indices_list, year, indices_list_type, sep = "")) # loads into global 120 | return_df <- select(hpr, ticker, return = paste("X", year, sep="")) # creates in function namespace so MUST be specified in returns_by_quantile function which would otherwise look in global for return_df 121 | sentiment_list <- weightsdfm_list %>% 122 | map(dfmstat_ratio, dict = sentiment_dict) 123 | returns_by_quantile <- returns_quantiler(sentiment_list, index_list = indices_list, return_df = return_df, n_quantiles = 5) 124 | names(returns_by_quantile) <- paste("section", sections, sep = "") 125 | return(returns_by_quantile) 126 | } 127 | 128 | negative_returns_by_quantile <- map(years, sentiment_returns_algo, sentiment_dict = negative) %>% 129 | reduce(`+`) / length(years) # require hpr, sections, masterIndex, tickers, dfmstat_ratio 130 | 131 | positive_returns_by_quantile <- map(years, sentiment_returns_algo, sentiment_dict = positive) %>% 132 | reduce(`+`) / length(years) 133 | ``` 134 | 135 | Make the ggplot graph. 136 | 137 | ```{r} 138 | quantile <- c(1:5) # little hacky but we need x to be the five quantiles; a proper melt would be the correct method, I think, but i cant get it to work. The problem is that this relies on the ggplot mapping 1 to quantile 1, 2 to quantile 2, etc. which works for c(1:5) but does not work for c("one", "two", "three", "four", "five"), for example, which will sort the character vector alphabetically 139 | 140 | nm <- names(negative_returns_by_quantile) 141 | negative_sentiment_quantile_returns <- map(nm, ~ ggplot(data = negative_returns_by_quantile, aes_string(x = quantile, y = .)) + 142 | geom_bar(stat = "identity") + 143 | theme(axis.title.x=element_blank(), 144 | axis.text.x=element_blank(), 145 | axis.ticks.x=element_blank())) %>% 146 | arrangeGrob(grobs = ., ncol = 5) 147 | ggsave(negative_sentiment_quantile_returns, file = "negative_sentiment_quantile_returns.png") 148 | 149 | nm <- names(positive_returns_by_quantile) 150 | positive_sentiment_quantile_returns <- map(nm, ~ ggplot(data = positive_returns_by_quantile, aes_string(x = quantile, y = .)) + 151 | geom_bar(stat = "identity") + 152 | theme(axis.title.x=element_blank(), 153 | axis.text.x=element_blank(), 154 | axis.ticks.x=element_blank())) %>% 155 | arrangeGrob(grobs = ., ncol = 5) 156 | ggsave(positive_sentiment_quantile_returns, file = "positive_sentiment_quantile_returns.png") 157 | ``` 158 | 159 | 160 | ```{r} 161 | returns_by_quantile <- sentiment_list %>% 162 | map(ntile, n = 5) %>% 163 | map(as.factor) %>% 164 | map(~ data_frame("quantile" = .)) %>% 165 | map2(indices_list, cbind) %>% 166 | map(left_join, y = tickers, by = "filing") %>% 167 | map(left_join, y = return_df, by = "ticker") %>% 168 | map(group_by, quantile) %>% 169 | map(summarise, average_return = mean(return, na.rm = TRUE)) %>% 170 | map(transmute, relative_return = average_return / min(average_return) - 1) %>% 171 | do.call(what = cbind) 172 | names(returns_by_quantile) <- rep(paste("section", sections)) 173 | 174 | quantiles <- map2(indices_list, quantile_negative, cbind) 175 | returns_by_quantile <- map(quantiles, group_by, return) %>% 176 | map(summarise, average_return = mean(X2012, na.rm = TRUE)) 177 | ``` 178 | -------------------------------------------------------------------------------- /Documentation/Cleaning-Raw-Filings.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/Documentation/Cleaning-Raw-Filings.md -------------------------------------------------------------------------------- /Documentation/Creating-Master-Index.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Creating-Master-Index" 3 | author: "Eric He" 4 | date: "August 4, 2017" 5 | output: html_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | ``` 11 | 12 | ```{r} 13 | library("edgar") 14 | library("dplyr") 15 | ``` 16 | 17 | ```{r} 18 | getMasterIndex(c(2013, 2014, 2015, 2016) 19 | ``` 20 | 21 | ```{r} 22 | load("Master Index/2013master.Rda") 23 | index.2013 <- year.master 24 | index.2013 <- filter(index.2013, FORM_TYPE == "10-K") 25 | load("Master Index/2014master.Rda") 26 | index.2014 <- year.master 27 | index.2014 <- filter(index.2014, FORM_TYPE == "10-K") 28 | load("Master Index/2015master.Rda") 29 | index.2015 <- year.master 30 | index.2015 <- filter(index.2015, FORM_TYPE == "10-K") 31 | load("Master Index/2016master.Rda") 32 | index.2016 <- year.master 33 | index.2016 <- filter(index.2016, FORM_TYPE == "10-K") 34 | rm(year.master) 35 | index <- rbind(index.2013, index.2014, index.2015, index.2016) 36 | rm(index.2013, index.2014, index.2015, index.2016) 37 | ``` 38 | 39 | We have the text data needed to begin our analysis; however, we want to be able to access the financial data corresponding to the companies we are analyzing. This is done by linking the CIK values given by the SEC for companies to their stock tickers. The CIK-ticker mapping was downloaded from https://www.valuespreadsheet.com/iedgar/, and lists every publicly traded company's CIK, ticker, SIC code (which denotes the industry the company is classified as being in), and the exchange where that company's stock trades. 40 | 41 | ```{r} 42 | tickers <- "cik-ticker.csv" %>% 43 | read.csv() %>% 44 | rename(TICKER = ticker, CIK = cik, SIC = sic, EXCHANGE = exchange, HITS = hits) 45 | ``` 46 | 47 | Let's join the two datasets together. 48 | 49 | ```{r} 50 | index <- left_join(index, tickers, by = "CIK") %>% 51 | select(-name) 52 | ``` 53 | 54 | ```{r} 55 | write.csv(index, "masterIndex.csv") 56 | ``` -------------------------------------------------------------------------------- /Documentation/Script_10Q.R: -------------------------------------------------------------------------------- 1 | library("dplyr") 2 | library("lubridate") 3 | library("tidyr") 4 | library("purrr") 5 | 6 | returns_calculator <- function(the_ticker, filing_date, interval_length = 5, interval_type = "day"){ 7 | start_date_beginning <- filing_date %>% # start date is first non-weekend day before the filing date. 8 | ymd(.) %m-% days(3) 9 | start_date_candidates <- seq(start_date_beginning, ymd(filing_date) %m-% days(1), by = "days") 10 | start_date <- start_date_candidates[which(format(start_date_candidates, "%u") %in% c(1:5))[length(which(format(start_date_candidates, "%u") %in% c(1:5)))]] # pick out first weekday in the sequence, %u formats date object into numeric weekday 11 | end_date <- filing_date %>% 12 | ymd %>% # have to ram through ymd again because map() uses [[]] which messes with lubridate object type 13 | when(interval_type == "day" ~ . %m+% days(interval_length), 14 | interval_type == "month" ~ . %m+% months(interval_length), 15 | interval_type == "year" ~ . %m+% years(interval_length), 16 | ~ stop(print(interval_type))) 17 | date_sequence <- seq(start_date, end_date, by = "days") # see above comment 18 | date_returns <- filter(full_returns, as.character(ticker) == as.character(the_ticker), date %in% date_sequence) # tickers must be converted to character or else throws a factor level error 19 | if (nrow(date_returns) == 0){ 20 | empty_df <- data_frame(hpr = NA) 21 | print(paste("No financial data for ticker", the_ticker)) 22 | return(empty_df) # if no financial data then we would like to make that clear 23 | } # when statement does not work to break the function! :( 24 | hpr <- summarise(date_returns, hpr = prod(return)) 25 | print(paste("Calculated hpr for ticker", the_ticker, "and date", filing_date)) 26 | return(hpr)} 27 | 28 | volatility_calculator <- function(the_ticker, filing_date, interval_length = 1, interval_type = "year"){ 29 | end_date <- filing_date %>% ymd 30 | start_date <- end_date %>% 31 | ymd %>% 32 | when(interval_type == "day" ~ . %m-% days(interval_length), 33 | interval_type == "month" ~ . %m-% months(interval_length), 34 | interval_type == "year" ~ . %m-% years(interval_length), 35 | ~ stop(print(interval_type))) 36 | date_sequence <- seq(start_date, end_date, by = "days") 37 | date_sequence <- date_sequence[-which(format(date_sequence, "%u") %in% c(6,7))] 38 | date_returns <- filter(full_returns, as.character(ticker) == as.character(the_ticker), date %in% date_sequence) 39 | if (nrow(date_returns) == 0){ 40 | empty_df <- data_frame(sd = NA) 41 | print(paste("No financial data for ticker", the_ticker)) 42 | return(empty_df) 43 | } 44 | volatility <- summarise(date_returns, sd = sd(return)) 45 | print(paste("Calculated volatility for ticker", the_ticker, "and date", filing_date)) 46 | return(volatility) 47 | } 48 | 49 | master_index <- read.csv("master_index_10Q.csv") 50 | 51 | full_returns <- read.csv("../Data/Financial Data/Trimmed_Returns_Raw.csv") 52 | full_returns <- select(full_returns, permco = PERMCO, date = date, return = RET, ticker = TICKER, delisting_return = DLRET) %>% 53 | filter(ticker %in% master_index$ticker) %>% # only need tickers for which we have filings to compare with 54 | mutate(delisting_return = as.numeric(as.character(delisting_return))) %>% # 55 | mutate(delisting_return = replace(delisting_return, is.na(delisting_return) == TRUE, 0)) %>% # replace NA with 0 56 | mutate(delisting_return = delisting_return + 1) %>% # so we can add 1 so we can multiply 57 | mutate(return = as.numeric(as.character(return))) %>% # change return from factor to numeric, characters which CRSP uses to represent missing data or point towards delisting return gets changed to NA 58 | mutate(return = replace(return, is.na(return) == TRUE, 0)) %>% 59 | mutate(return = return + 1) %>% 60 | filter(is.na(return) == FALSE) %>% 61 | mutate(return = return * delisting_return) %>% 62 | mutate(date = mdy(date)) %>% 63 | select(-delisting_return) 64 | 65 | ret5d_10Q <- map2_df(master_index$ticker, master_index$date, returns_calculator) 66 | vol_10Q <- map2_df(master_index$ticker, master_index$date, volatility_calculator) 67 | 68 | master_index <- cbind(master_index, ret5d_10Q, vol_10Q) 69 | 70 | write.csv(master_index, "master_index_10Q_.csv") -------------------------------------------------------------------------------- /Documentation/Sentiment-Scores-Algo.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "sentimentAnalysisAlgo" 3 | author: "Eric He" 4 | date: "August 7, 2017" 5 | output: html_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | ``` 11 | 12 | ```{r} 13 | library("quanteda") 14 | library("dplyr") 15 | library("purrr") 16 | library("ggplot2") 17 | library("reshape2") 18 | library("gridExtra") 19 | ``` 20 | 21 | Read in the data relevant for all four years. 22 | 23 | The desired sentiment analysis algorithm computes sentiment scores of every 24 | 25 | ```{r} 26 | negative <- readLines("negative.txt") 27 | positive <- readLines("positive.txt") 28 | sections <- c("1", "1A", "1B", "2", "3", "4", "5", "6", "7", "7A", "8", "9", "9A", "9B", "10", "11", "12", "13", "14", "15") 29 | masterIndex <- read.csv("masterIndex.csv") 30 | tickers <- unique(masterIndex$TICKER) 31 | ``` 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | Build two functions, one which subsets the bigdfm according to section and the other recording the indices if the subsetted. 42 | 43 | ```{r} 44 | indices_tfidf <- function(meta_df, cond){ 45 | index_num <- filter(meta_df, section == cond) %>% 46 | select(filing) 47 | return(index_num) 48 | } 49 | subset_tfidf <- function(meta_df, dfmobj, cond){ 50 | index_num <- filter(meta_df, section == cond) %>% 51 | select(index) 52 | weightsdfm <- dfmobj[index_num$index,] %>% 53 | tfidf(scheme_tf = "logave") 54 | return(weightsdfm) 55 | } 56 | ``` 57 | 58 | Build a function which computes the sentiment scores. Recall that the sentiment score is the weights of the words in the sentiment dictionary, divided by the total weight of the words in the document. 59 | 60 | ```{r} 61 | dfmstat_ratio <- function(dfmObj, dict){ 62 | dfm_select(dfmObj, pattern = dict) %>% 63 | rowSums(.) / rowSums(dfmObj) 64 | } 65 | ``` 66 | 67 | We would like to do the algorithm for the positive, negative, positive-negative sentiment scorings. This requires the positive and negative dictionaries. 68 | Then we would like to do them for all four years. 69 | 70 | The indices_list sorts the metadata frame by section. 71 | 72 | ```{r} 73 | indices_list <- map(sections, indices_tfidf, meta_df = metadata) 74 | weightsdfm_list <- map(sections, subset_tfidf, meta_df = metadata, dfmobj = bigdfm) 75 | sentiment_list <- weightsdfm_list %>% 76 | map(dfmstat_ratio, dict = negative) 77 | ``` 78 | 79 | Now do this for every year's worth of data. The weighter function is used to save the indices and weightsdfm for every section in a given year. It is mapped across all four years of data. 80 | 81 | ```{r} 82 | years <- c(2013:2016) 83 | path_to_metadata <- "metadata" 84 | metadata_type <- ".csv" 85 | path_to_parsedDFM <- "parsedBigDfm" 86 | parsedDFM_type <- ".RData" 87 | 88 | weighter <- function(year){ 89 | metadata <- paste(path_to_metadata, year, metadata_type, sep = "") %>% 90 | read.csv() 91 | load(paste(path_to_parsedDFM, year, parsedDFM_type, sep = "")) 92 | 93 | indices_list <- map(sections, indices_tfidf, meta_df = metadata) 94 | weightsdfm_list <- map(sections, subset_tfidf, meta_df = metadata, dfmobj = bigdfm) 95 | 96 | save(indices_list, file = paste("indices_list_", year, ".RData", sep = "")) 97 | save(weightsdfm_list, file = paste("weightsdfm_list_", year, ".RData", sep = "")) 98 | } 99 | 100 | map(years, weighter) # The weighter function returns nothing which is fine. 101 | ``` 102 | 103 | Get every sentiment list for every year. 104 | 105 | The returns quantiler splits the data of each section into five quantiles and calculates the mean returns for a portfolio which invests equally in all companies of the quantile stock. 106 | 107 | ```{r} 108 | years <- c(2013:2016) 109 | path_to_weightsdfm_list <- "weightsdfm_list_" 110 | weightsdfm_list_type <- ".RData" 111 | path_to_indices_list <- "indices_list_" 112 | indices_list_type <- ".RData" 113 | 114 | #TODO: unmap this cancer function 115 | 116 | returns_quantiler <- function(sentiment_list, index_list, return_df, n_quantiles){ 117 | sentiment_list %>% 118 | map(ntile, n = n_quantiles) %>% 119 | map(as.factor) %>% 120 | map(~ data_frame("quantile" = .)) %>% 121 | map2(index_list, cbind) %>% 122 | map(left_join, y = tickers, by = "filing") %>% 123 | map(left_join, y = return_df, by = "ticker") %>% 124 | map(group_by, quantile) %>% 125 | map(summarise, average_return = mean(return, na.rm = TRUE)) %>% 126 | map(transmute, relative_return = average_return / min(average_return) - 1) %>% # quantile 1 is row 1, 2 is row 2, etc. Higher quantile means higher sentiment value. 127 | do.call(what = cbind) 128 | } 129 | 130 | sentiment_returns_algo <- function(year, sentiment_dict){ 131 | load(paste(path_to_weightsdfm_list, year, weightsdfm_list_type, sep = "")) # loads into global 132 | load(paste(path_to_indices_list, year, indices_list_type, sep = "")) # loads into global 133 | sentiment_list <- weightsdfm_list %>% 134 | map(dfmstat_ratio, dict = sentiment_dict) 135 | # names(returns_by_quantile) <- paste("section", sections, sep = "") 136 | return(sentiment_list) 137 | } 138 | 139 | load("weightsdfm_list_2013.RData") 140 | 141 | metadata <- map(paste0("metadata", years, ".csv"), read.csv) %>% 142 | reduce(rbind) 143 | 144 | negative_sentiment <- map(years, sentiment_returns_algo, sentiment_dict = negative) %>% 145 | flatten %>% 146 | reduce(append) 147 | 148 | #TODO: clean up this cancer code 149 | 150 | names <- negative_sentiment %>% names %>% 151 | str_extract(pattern = ".*?(?=\\.)") 152 | 153 | dummy <- data_frame(subsetted.names = names, negative_sentiment = negative_sentiment) 154 | joined <- left_join(dummy, metadata, by = "subsetted.names") %>% 155 | group_by(filing, section) %>% 156 | filter(word.count == max(word.count)) %>% 157 | distinct(filing, section, .keep_all = TRUE) %>% 158 | select(filing, section, negative_sentiment) %>% 159 | spread(key = section, value = negative_sentiment) %>% 160 | rename(sec1sent = `1`, sec1Asent = `1A`, sec1Bsent = `1B`, sec2sent = `2`, sec3sent = `3`, sec4sent = `4`, sec5sent = `5`, sec6sent = `6`, sec7sent = `7`, sec7Asent = `7A`, sec8sent = `8`, sec9sent = `9`, sec9Asent = `9`, sec10sent = `10`, sec11sent = `11`, sec12sent = `12`, sec13sent = `13`, sec14sent = `14`, sec15sent = `15`) 161 | 162 | masterIndex <- left_join(masterIndex, joined, by = "filing") 163 | masterIndex <- read.csv("masterIndex.csv") 164 | 165 | a <- spread(joined, key = section, value = negative_sentiment) 166 | 167 | joined2 <- group_by(joined, filing, section) %>% 168 | + filter(word.count == max(word.count)) 169 | 170 | positive_returns_by_quantile <- map(years, sentiment_returns_algo, sentiment_dict = positive) %>% 171 | reduce(`+`) / length(years) 172 | ``` 173 | Load all the metadata together and rbind 174 | Load all the names from the negative_sentiment and then perform a join operation. 175 | Things in the metadata that are not in sentiment scores are not actual sections; e.g. they have sections tagged as section 16, 2014, 4A, etc. 176 | 177 | Make the ggplot graph. 178 | 179 | ```{r} 180 | quantile <- c(1:5) # little hacky but we need x to be the five quantiles; a proper melt would be the correct method, I think, but i cant get it to work. The problem is that this relies on the ggplot mapping 1 to quantile 1, 2 to quantile 2, etc. which works for c(1:5) but does not work for c("one", "two", "three", "four", "five"), for example, which will sort the character vector alphabetically 181 | 182 | nm <- names(negative_returns_by_quantile) 183 | negative_sentiment_quantile_returns <- map(nm, ~ ggplot(data = negative_returns_by_quantile, aes_string(x = quantile, y = .)) + 184 | geom_bar(stat = "identity") + 185 | theme(axis.title.x=element_blank(), 186 | axis.text.x=element_blank(), 187 | axis.ticks.x=element_blank())) %>% 188 | arrangeGrob(grobs = ., ncol = 5) 189 | ggsave(negative_sentiment_quantile_returns, file = "negative_sentiment_quantile_returns.png") 190 | 191 | nm <- names(positive_returns_by_quantile) 192 | positive_sentiment_quantile_returns <- map(nm, ~ ggplot(data = positive_returns_by_quantile, aes_string(x = quantile, y = .)) + 193 | geom_bar(stat = "identity") + 194 | theme(axis.title.x=element_blank(), 195 | axis.text.x=element_blank(), 196 | axis.ticks.x=element_blank())) %>% 197 | arrangeGrob(grobs = ., ncol = 5) 198 | ggsave(positive_sentiment_quantile_returns, file = "positive_sentiment_quantile_returns.png") 199 | ``` 200 | 201 | 202 | 203 | 204 | ```{r} 205 | #this is test code and should be ignored 206 | 207 | returns_by_quantile <- sentiment_list %>% 208 | map(ntile, n = 5) %>% 209 | map(as.factor) %>% 210 | map(~ data_frame("quantile" = .)) %>% 211 | map2(indices_list, cbind) %>% 212 | map(left_join, y = tickers, by = "filing") %>% 213 | map(left_join, y = return_df, by = "ticker") %>% 214 | map(group_by, quantile) %>% 215 | map(summarise, average_return = mean(return, na.rm = TRUE)) %>% 216 | map(transmute, relative_return = average_return / min(average_return) - 1) %>% 217 | do.call(what = cbind) 218 | names(returns_by_quantile) <- rep(paste("section", sections)) 219 | 220 | quantiles <- map2(indices_list, quantile_negative, cbind) 221 | returns_by_quantile <- map(quantiles, group_by, return) %>% 222 | map(summarise, average_return = mean(X2012, na.rm = TRUE)) 223 | ``` 224 | -------------------------------------------------------------------------------- /Documentation/Text-Distance-Algo-10Q.R: -------------------------------------------------------------------------------- 1 | library("quanteda") 2 | library("dplyr") 3 | library("readr") 4 | library("purrr") 5 | library("stringr") 6 | library("readtext") 7 | library("reshape2") 8 | library("magrittr") 9 | library("ggplot2") 10 | library("gridExtra") 11 | 12 | masterIndex <- read_csv("master_index_10Q_withFile.csv") 13 | tickers <- unique(masterIndex$ticker) # use unique(masterIndex) if we wish to scale this across multiple years, or keep tickers.txt updated 14 | StopWordsList <- readLines("../Data/Text Data/StopWordsList.txt") 15 | sections <- c("1", "1A", "2", "3", "4", "5", "6") 16 | file_path <- "../Data/Text Data/10-Q/" 17 | file_type <- ".txt" 18 | 19 | section_extractor <- function(statement, section){ 20 | name <- statement$doc_id 21 | pattern <- paste0("(?i)°Item ", section, "[^\\w|\\d]", ".*?°") 22 | section_hits <- str_extract_all(statement, pattern=pattern, simplify=TRUE) 23 | if (is_empty(section_hits) == TRUE){ 24 | empty_vec <- "empty" 25 | names(empty_vec) <- paste(name, section, sep = "_") 26 | print(paste("No hits for section", section, "of filing", name)) 27 | return(empty_vec) 28 | } 29 | word_counts <- map_int(section_hits, ntoken) 30 | max_hit <- which(word_counts == max(word_counts)) 31 | max_filing <- section_hits[[max_hit[length(max_hit)]]] 32 | if (max(word_counts) < 250 & str_detect(max_filing, pattern = "(?i)(incorporated by reference)|(incorporated herein by reference)") == TRUE){ 33 | empty_vec <- "empty" 34 | names(empty_vec) <- paste(name, section, sep = "_") 35 | print(paste("Section", section, "of filing", name, "incorporates by reference its information")) 36 | return(empty_vec) 37 | } 38 | names(max_filing) <- paste(name, section, sep = "_") 39 | return(max_filing) 40 | } 41 | 42 | section_dfm <- function(statements_list, section, min_words, tf){ 43 | map(statements_list, section_extractor, section=section) %>% 44 | map(corpus) %>% 45 | reduce(`+`) %>% 46 | dfm(tolower=TRUE, remove=StopWordsList, remove_punct=TRUE) %>% 47 | dfm_subset(., rowSums(.) > min_words) %>% 48 | when(tf==TRUE ~ tf(., scheme="log"), 49 | ~ .) 50 | } 51 | 52 | filing_dfm <- function(sections, filings_list, min_words, tf){ 53 | map(sections, section_dfm, statements_list=filings_list, min_words=min_words, tf=tf) 54 | } 55 | 56 | dist_parser <- function(distObj, section){ 57 | melted_frame <- as.matrix(distObj) %>% 58 | {. * upper.tri(.)} %>% 59 | melt(varnames = c("previous_filing", "filing"), value.name = paste0("sec", section, "dist")) 60 | melted_frame$previous_filing %<>% str_extract(pattern = ".*?(?=\\.)") 61 | melted_frame$filing %<>% str_extract(pattern = ".*?(?=\\.)") 62 | return(melted_frame) 63 | } 64 | 65 | filing_similarity <- function(dfm_list, method){ 66 | map(dfm_list, textstat_simil, method=method) %>% 67 | map(dist_parser)} 68 | 69 | index_filing_filterer <- function(the_ticker, index){ 70 | filter(index, ticker == the_ticker) %>% 71 | arrange(date) %>% 72 | pull(filing) 73 | } 74 | 75 | distance_returns_calculator <- function(the_ticker){ 76 | file_names <- index_filing_filterer(the_ticker, masterIndex) 77 | 78 | if (length(file_names) <= 1){ 79 | empty_list <- data_frame() 80 | print(paste("Only one filing available for ticker", the_ticker)) 81 | return(empty_list) 82 | } 83 | 84 | file_locations <- paste0(file_path, file_names, file_type) 85 | 86 | filings_list <- map(file_locations, readtext) 87 | 88 | similarity_list <- map(sections, section_dfm, statements_list=filings_list, min_words=10, tf=TRUE) %>% 89 | map(textstat_simil, method="cosine") %>% 90 | map2(sections, dist_parser) %>% 91 | reduce(left_join, by = c("previous_filing", "filing")) 92 | 93 | prev_current_mapping <- data_frame(previous_filing = file_names[-length(file_names)], filing = file_names[-1]) 94 | distance_returns_df <- left_join(prev_current_mapping, similarity_list, by = c("previous_filing", "filing")) 95 | print(paste("Successfully mapped distance scores to financial returns for ticker", the_ticker)) 96 | return(distance_returns_df)} 97 | 98 | distance_df <- map(tickers, distance_returns_calculator) %>% 99 | reduce(rbind) 100 | 101 | masterIndex %<>% left_join(distance_df, by = "filing") 102 | 103 | write.csv(masterIndex, file = "index_distance_10Q.csv", row.names = FALSE) -------------------------------------------------------------------------------- /Documentation/Text-Distance-Algo.Rmd: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/Documentation/Text-Distance-Algo.Rmd -------------------------------------------------------------------------------- /Documentation/parsing-script.R: -------------------------------------------------------------------------------- 1 | library("stringr") 2 | library("dplyr") 3 | library("purrr") 4 | 5 | input <- "/data/edgar/data/" 6 | output <- "parsed-filings/" 7 | 8 | clean_filing <- function(file_name, input_cik){ 9 | paste0(input_cik, file_name) %>% 10 | readLines(encoding = "UTF-8") %>% 11 | str_c(collapse = " ") %>% 12 | str_extract(pattern = "(?s)(?m)10-Q.*?()") %>% 13 | str_replace(pattern = "((?i)).*?(?=<)", replacement = "") %>% 14 | str_replace(pattern = "((?i)).*?(?=<)", replacement = "") %>% 15 | str_replace(pattern = "((?i)).*?(?=<)", replacement = "") %>% 16 | str_replace(pattern = "((?i)).*?(?=<)", replacement = "") %>% 17 | str_replace(pattern = "(?s)(?i).*?", replacement = "") %>% 18 | str_replace(pattern = "(?s)(?i)<(table).*?()", replacement = "") %>% 19 | str_replace_all(pattern = "(?s)(?i)(?m)> +Item|>Item|^Item", replacement = ">°Item") %>% 20 | str_replace(pattern = "", replacement = "°") %>% 21 | str_replace_all(pattern = "(?s)<.*?>", replacement = " ") %>% 22 | str_replace_all(pattern = "&(.{2,6});", replacement = " ") %>% 23 | str_replace_all(pattern = "(?s) +", replacement = " ") %>% 24 | write(file = paste0(output, file_name)) 25 | print(paste("Cleaned filing", file_name)) 26 | } 27 | 28 | clean_cik <- function(cik){ 29 | input_cik <- paste0(input, cik, "/") 30 | 31 | files = input_cik %>% 32 | list.files %>% 33 | subset(str_detect(., pattern = "10-Q")) 34 | 35 | map(files, clean_filing, input_cik = input_cik) 36 | 37 | print(paste("Cleaned all filings for CIK", cik)) 38 | } 39 | 40 | cik_list <- list.files(input) 41 | 42 | map(cik_list, clean_cik) 43 | 44 | 45 | #mutate(cik = cik, 46 | # date = str_extract(files, pattern = "(?<=(.{1,10}_){2}).*?(?=_)"), 47 | # file_name = str_extract(files, pattern = "(?<=(.{1,10}_){3}).*"))) 48 | 49 | -------------------------------------------------------------------------------- /Graphs/cosine_distance_returns.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/Graphs/cosine_distance_returns.png -------------------------------------------------------------------------------- /Graphs/jaccard_distance_returns.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/Graphs/jaccard_distance_returns.png -------------------------------------------------------------------------------- /Graphs/negative_sentiment_quantile_returns.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/Graphs/negative_sentiment_quantile_returns.png -------------------------------------------------------------------------------- /Graphs/positive_sentiment_quantile_returns.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/Graphs/positive_sentiment_quantile_returns.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EricHe98/Financial-Statements-Text-Analysis/0bd4dd172f0a083c60751ef991364b2258eee75d/README.md -------------------------------------------------------------------------------- /analyses/optimize_cosine_distance.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "132c0f76", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import numpy as np\n", 11 | "import pandas as pd\n", 12 | "import cython\n", 13 | "import os\n", 14 | "import re\n", 15 | "import json\n", 16 | "from bs4 import BeautifulSoup\n", 17 | "from multiprocessing import Pool\n", 18 | "from pandarallel import pandarallel" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "id": "fe7c519c", 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "os.chdir('/mnt/d/workspace/8-2/Financial-Statements-Text-Analysis/')" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 3, 34 | "id": "5faf2ba9", 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "# params\n", 39 | "with open('config.json', 'r') as f:\n", 40 | " c = json.load(f)\n", 41 | "input_dir = os.path.join(c['DATA_DIR'], '10k_clean')\n", 42 | "# destination_dir = os.path.join(c['DATA_DIR'], '10k_clean')" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "id": "40763f7b", 48 | "metadata": {}, 49 | "source": [ 50 | "# read processed 10-Ks in" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 4, 56 | "id": "080d1c05", 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "metadata = pd.read_csv(os.path.join(c['DATA_DIR'], 'metadata.csv'))\n", 61 | "metadata_legacy = pd.read_csv(os.path.join(c['DATA_DIR'], 'metadata_2017.csv'))\n", 62 | "\n", 63 | "# only download the data from russell 3000 today\n", 64 | "metadata = metadata_legacy[metadata_legacy['TICKER'].isin(metadata['ticker'])]" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 5, 70 | "id": "1793c9ca", 71 | "metadata": {}, 72 | "outputs": [ 73 | { 74 | "name": "stderr", 75 | "output_type": "stream", 76 | "text": [ 77 | "/tmp/ipykernel_14741/1550976992.py:1: SettingWithCopyWarning: \n", 78 | "A value is trying to be set on a copy of a slice from a DataFrame.\n", 79 | "Try using .loc[row_indexer,col_indexer] = value instead\n", 80 | "\n", 81 | "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", 82 | " metadata['LOCAL_LINK'] = input_dir + '/' + metadata['TICKER'] + '/' + metadata['EDGAR_LINK'].str.split(\"/\").str[-1]\n" 83 | ] 84 | } 85 | ], 86 | "source": [ 87 | "metadata['LOCAL_LINK'] = input_dir + '/' + metadata['TICKER'] + '/' + metadata['EDGAR_LINK'].str.split(\"/\").str[-1]" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 6, 93 | "id": "99e4069e", 94 | "metadata": {}, 95 | "outputs": [ 96 | { 97 | "name": "stdout", 98 | "output_type": "stream", 99 | "text": [ 100 | "652 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 101 | ] 102 | } 103 | ], 104 | "source": [ 105 | "%%timeit\n", 106 | "\n", 107 | "for i in range(100):\n", 108 | " pd.read_csv(metadata.iloc[i]['LOCAL_LINK'])" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 7, 114 | "id": "fcef961c", 115 | "metadata": {}, 116 | "outputs": [ 117 | { 118 | "name": "stdout", 119 | "output_type": "stream", 120 | "text": [ 121 | "348 ms ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 122 | ] 123 | } 124 | ], 125 | "source": [ 126 | "%%timeit \n", 127 | "links = [metadata.iloc[i]['LOCAL_LINK'] for i in range(100)]\n", 128 | "\n", 129 | "with Pool(processes=4) as pool:\n", 130 | " pool.map(pd.read_csv, links)" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 8, 136 | "id": "d37cfbf5", 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "def read_csv_wrapper(i):\n", 141 | " try:\n", 142 | " row = metadata.iloc[i]\n", 143 | " path = row['LOCAL_LINK']\n", 144 | " ticker = row['TICKER']\n", 145 | " \n", 146 | " df = pd.read_csv(path)\n", 147 | " df['ticker'] = ticker\n", 148 | " df['path'] = path\n", 149 | " df['filing_date'] = row['DATE_FILED']\n", 150 | " return df\n", 151 | " except:\n", 152 | " # some were unable to read because the parse failed \n", 153 | " return pd.DataFrame()\n", 154 | "\n", 155 | "with Pool(processes=16) as pool:\n", 156 | " dfs = pool.map(read_csv_wrapper, range(len(metadata)))\n", 157 | " \n", 158 | "df = pd.concat(dfs)\n", 159 | "# filter out failed reads\n", 160 | "df = df[~df['text'].isnull()]\n", 161 | "\n", 162 | "# order the df\n", 163 | "df = df.sort_values(['ticker', 'item', 'filing_date'])\n", 164 | "df['index'] = np.arange(len(df))\n", 165 | "df['lead_index'] = df.groupby(['ticker', 'item'])['index'].shift(-1)" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "id": "adaafe25", 171 | "metadata": {}, 172 | "source": [ 173 | "# Text cleaning" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 9, 179 | "id": "2eceeaf7", 180 | "metadata": {}, 181 | "outputs": [ 182 | { 183 | "name": "stdout", 184 | "output_type": "stream", 185 | "text": [ 186 | "3.54 s ± 107 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 187 | ] 188 | } 189 | ], 190 | "source": [ 191 | "%%timeit\n", 192 | "df.head(1000)['text'].str.replace('\\W', ' ', regex=True)\\\n", 193 | " .str.lower()\\\n", 194 | " .str.split()\\\n", 195 | " .str.join(' ')" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 10, 201 | "id": "5252b875", 202 | "metadata": {}, 203 | "outputs": [], 204 | "source": [ 205 | "def clean_string(s):\n", 206 | " s = re.sub('\\W', ' ', s)\n", 207 | " s = s.lower()\n", 208 | " s = re.sub(' +', ' ', s)\n", 209 | " return s" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": 11, 215 | "id": "02b207b2", 216 | "metadata": {}, 217 | "outputs": [ 218 | { 219 | "name": "stdout", 220 | "output_type": "stream", 221 | "text": [ 222 | "4.95 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 223 | ] 224 | } 225 | ], 226 | "source": [ 227 | "%%timeit\n", 228 | "df.head(1000)['text'].apply(clean_string)" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 12, 234 | "id": "9f52d9cb", 235 | "metadata": {}, 236 | "outputs": [], 237 | "source": [ 238 | "pandarallel.initialize(progress_bar=True, nb_workers=16, verbose=0)" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 13, 244 | "id": "2795ac24", 245 | "metadata": {}, 246 | "outputs": [ 247 | { 248 | "data": { 249 | "application/vnd.jupyter.widget-view+json": { 250 | "model_id": "f93dfb3050194425b3e4044db2335bc1", 251 | "version_major": 2, 252 | "version_minor": 0 253 | }, 254 | "text/plain": [ 255 | "VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=63), Label(value='0 / 63'))), HBox…" 256 | ] 257 | }, 258 | "metadata": {}, 259 | "output_type": "display_data" 260 | }, 261 | { 262 | "data": { 263 | "application/vnd.jupyter.widget-view+json": { 264 | "model_id": "69aac78a4c5e4ac78beb042cdbd885ae", 265 | "version_major": 2, 266 | "version_minor": 0 267 | }, 268 | "text/plain": [ 269 | "VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=63), Label(value='0 / 63'))), HBox…" 270 | ] 271 | }, 272 | "metadata": {}, 273 | "output_type": "display_data" 274 | }, 275 | { 276 | "data": { 277 | "application/vnd.jupyter.widget-view+json": { 278 | "model_id": "34a10a212d6947bd9e5fae6e7b9ae6d0", 279 | "version_major": 2, 280 | "version_minor": 0 281 | }, 282 | "text/plain": [ 283 | "VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=63), Label(value='0 / 63'))), HBox…" 284 | ] 285 | }, 286 | "metadata": {}, 287 | "output_type": "display_data" 288 | }, 289 | { 290 | "data": { 291 | "application/vnd.jupyter.widget-view+json": { 292 | "model_id": "b30e20a430a94f09915f297066485805", 293 | "version_major": 2, 294 | "version_minor": 0 295 | }, 296 | "text/plain": [ 297 | "VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=63), Label(value='0 / 63'))), HBox…" 298 | ] 299 | }, 300 | "metadata": {}, 301 | "output_type": "display_data" 302 | }, 303 | { 304 | "data": { 305 | "application/vnd.jupyter.widget-view+json": { 306 | "model_id": "1b12fefdd6c549b59540fa0379cf3072", 307 | "version_major": 2, 308 | "version_minor": 0 309 | }, 310 | "text/plain": [ 311 | "VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=63), Label(value='0 / 63'))), HBox…" 312 | ] 313 | }, 314 | "metadata": {}, 315 | "output_type": "display_data" 316 | }, 317 | { 318 | "data": { 319 | "application/vnd.jupyter.widget-view+json": { 320 | "model_id": "e9bcdecccc9d4f578b381875e0c183b0", 321 | "version_major": 2, 322 | "version_minor": 0 323 | }, 324 | "text/plain": [ 325 | "VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=63), Label(value='0 / 63'))), HBox…" 326 | ] 327 | }, 328 | "metadata": {}, 329 | "output_type": "display_data" 330 | }, 331 | { 332 | "data": { 333 | "application/vnd.jupyter.widget-view+json": { 334 | "model_id": "b753ddfc5b604e60a35cae484ea8dc76", 335 | "version_major": 2, 336 | "version_minor": 0 337 | }, 338 | "text/plain": [ 339 | "VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=63), Label(value='0 / 63'))), HBox…" 340 | ] 341 | }, 342 | "metadata": {}, 343 | "output_type": "display_data" 344 | }, 345 | { 346 | "data": { 347 | "application/vnd.jupyter.widget-view+json": { 348 | "model_id": "fde8283eaa0e46a2b63cf68292df1bbb", 349 | "version_major": 2, 350 | "version_minor": 0 351 | }, 352 | "text/plain": [ 353 | "VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=63), Label(value='0 / 63'))), HBox…" 354 | ] 355 | }, 356 | "metadata": {}, 357 | "output_type": "display_data" 358 | }, 359 | { 360 | "name": "stdout", 361 | "output_type": "stream", 362 | "text": [ 363 | "2.34 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 364 | ] 365 | } 366 | ], 367 | "source": [ 368 | "%%timeit\n", 369 | "df.head(1000)['text'].parallel_apply(clean_string)" 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": 14, 375 | "id": "a64c4639", 376 | "metadata": {}, 377 | "outputs": [ 378 | { 379 | "data": { 380 | "application/vnd.jupyter.widget-view+json": { 381 | "model_id": "dbc5bcd67d9442a0a0f923fb9f513789", 382 | "version_major": 2, 383 | "version_minor": 0 384 | }, 385 | "text/plain": [ 386 | "VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1556), Label(value='0 / 1556'))), …" 387 | ] 388 | }, 389 | "metadata": {}, 390 | "output_type": "display_data" 391 | } 392 | ], 393 | "source": [ 394 | "df['text'] = df['text'].parallel_apply(clean_string)" 395 | ] 396 | }, 397 | { 398 | "cell_type": "markdown", 399 | "id": "944afb97", 400 | "metadata": {}, 401 | "source": [ 402 | "# transform to tfidf" 403 | ] 404 | }, 405 | { 406 | "cell_type": "code", 407 | "execution_count": 15, 408 | "id": "20c6ee8a", 409 | "metadata": {}, 410 | "outputs": [], 411 | "source": [ 412 | "from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer" 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": 16, 418 | "id": "da8eb1c8", 419 | "metadata": {}, 420 | "outputs": [], 421 | "source": [ 422 | "comparison_df = df[~df['lead_index'].isnull()].copy()\n", 423 | "comparison_df['lead_index'] = comparison_df['lead_index'].astype(int)" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": 17, 429 | "id": "efe6a73f", 430 | "metadata": {}, 431 | "outputs": [], 432 | "source": [ 433 | "vectorizer = TfidfVectorizer()\n", 434 | "\n", 435 | "tfidf = vectorizer.fit_transform(comparison_df['text'])" 436 | ] 437 | }, 438 | { 439 | "cell_type": "markdown", 440 | "id": "553b3709", 441 | "metadata": {}, 442 | "source": [ 443 | "# perform cosine distance computation" 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": 18, 449 | "id": "ff680833", 450 | "metadata": {}, 451 | "outputs": [], 452 | "source": [ 453 | "from sklearn.metrics.pairwise import cosine_similarity" 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": 19, 459 | "id": "16170ec3", 460 | "metadata": {}, 461 | "outputs": [ 462 | { 463 | "name": "stdout", 464 | "output_type": "stream", 465 | "text": [ 466 | "742 ms ± 30.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 467 | ] 468 | } 469 | ], 470 | "source": [ 471 | "%%timeit\n", 472 | "cosine_similarity(tfidf[:1000], tfidf[:1000])" 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "execution_count": 20, 478 | "id": "ab05f8f9", 479 | "metadata": {}, 480 | "outputs": [ 481 | { 482 | "name": "stdout", 483 | "output_type": "stream", 484 | "text": [ 485 | "2.67 s ± 65.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 486 | ] 487 | } 488 | ], 489 | "source": [ 490 | "%%timeit\n", 491 | "cosine_similarity(tfidf[:2000], tfidf[:2000])" 492 | ] 493 | }, 494 | { 495 | "cell_type": "code", 496 | "execution_count": 21, 497 | "id": "f88154f9", 498 | "metadata": {}, 499 | "outputs": [ 500 | { 501 | "name": "stdout", 502 | "output_type": "stream", 503 | "text": [ 504 | "76 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" 505 | ] 506 | } 507 | ], 508 | "source": [ 509 | "%%timeit\n", 510 | "(tfidf[:999].multiply(tfidf[1:1000]).sum(axis=1) / \\\n", 511 | " np.sqrt(tfidf[:999].multiply(tfidf[:999]).sum(axis=1)) / np.sqrt(tfidf[1:1000].multiply(tfidf[1:1000]).sum(axis=1)))" 512 | ] 513 | }, 514 | { 515 | "cell_type": "code", 516 | "execution_count": 22, 517 | "id": "b9800088", 518 | "metadata": {}, 519 | "outputs": [ 520 | { 521 | "name": "stdout", 522 | "output_type": "stream", 523 | "text": [ 524 | "172 ms ± 4.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" 525 | ] 526 | } 527 | ], 528 | "source": [ 529 | "%%timeit\n", 530 | "(tfidf[:1999].multiply(tfidf[1:2000]).sum(axis=1) / \\\n", 531 | " np.sqrt(tfidf[:1999].multiply(tfidf[:1999]).sum(axis=1)) / np.sqrt(tfidf[1:2000].multiply(tfidf[1:2000]).sum(axis=1)))" 532 | ] 533 | }, 534 | { 535 | "cell_type": "code", 536 | "execution_count": 23, 537 | "id": "1aae8383", 538 | "metadata": {}, 539 | "outputs": [ 540 | { 541 | "name": "stdout", 542 | "output_type": "stream", 543 | "text": [ 544 | "1.51 s ± 43.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 545 | ] 546 | } 547 | ], 548 | "source": [ 549 | "%%timeit\n", 550 | "(tfidf[:-1].multiply(tfidf[1:]).sum(axis=1) / \\\n", 551 | " np.sqrt(tfidf[:-1].multiply(tfidf[:-1]).sum(axis=1)) / np.sqrt(tfidf[1:].multiply(tfidf[1:]).sum(axis=1)))" 552 | ] 553 | }, 554 | { 555 | "cell_type": "code", 556 | "execution_count": 24, 557 | "id": "436dca7e", 558 | "metadata": {}, 559 | "outputs": [ 560 | { 561 | "data": { 562 | "text/plain": [ 563 | "matrix([[0.98227666],\n", 564 | " [0.99666753],\n", 565 | " [0.58170126],\n", 566 | " ...,\n", 567 | " [0.56405527],\n", 568 | " [0.93125178],\n", 569 | " [0.94108083]])" 570 | ] 571 | }, 572 | "execution_count": 24, 573 | "metadata": {}, 574 | "output_type": "execute_result" 575 | } 576 | ], 577 | "source": [ 578 | "(tfidf[:-1].multiply(tfidf[1:]).sum(axis=1) / \\\n", 579 | " np.sqrt(tfidf[:-1].multiply(tfidf[:-1]).sum(axis=1)) / np.sqrt(tfidf[1:].multiply(tfidf[1:]).sum(axis=1)))" 580 | ] 581 | }, 582 | { 583 | "cell_type": "code", 584 | "execution_count": 25, 585 | "id": "9c0a7c7f", 586 | "metadata": {}, 587 | "outputs": [ 588 | { 589 | "data": { 590 | "text/plain": [ 591 | "(18301, 88229)" 592 | ] 593 | }, 594 | "execution_count": 25, 595 | "metadata": {}, 596 | "output_type": "execute_result" 597 | } 598 | ], 599 | "source": [ 600 | "tfidf.shape" 601 | ] 602 | }, 603 | { 604 | "cell_type": "code", 605 | "execution_count": null, 606 | "id": "6ebed1f8", 607 | "metadata": {}, 608 | "outputs": [], 609 | "source": [] 610 | } 611 | ], 612 | "metadata": { 613 | "kernelspec": { 614 | "display_name": "Python 3 (ipykernel)", 615 | "language": "python", 616 | "name": "python3" 617 | }, 618 | "language_info": { 619 | "codemirror_mode": { 620 | "name": "ipython", 621 | "version": 3 622 | }, 623 | "file_extension": ".py", 624 | "mimetype": "text/x-python", 625 | "name": "python", 626 | "nbconvert_exporter": "python", 627 | "pygments_lexer": "ipython3", 628 | "version": "3.10.10" 629 | } 630 | }, 631 | "nbformat": 4, 632 | "nbformat_minor": 5 633 | } 634 | -------------------------------------------------------------------------------- /scripts/fetch_10k_docs.py: -------------------------------------------------------------------------------- 1 | # for analysis of which methods of fetching the file are best, 2 | # see analyses/optimize_doc_fetch.ipynb 3 | 4 | from sec_api import RenderApi 5 | import os 6 | import pandas as pd 7 | import json 8 | from pandarallel import pandarallel 9 | import requests 10 | 11 | # params 12 | with open('config.json', 'r') as f: 13 | c = json.load(f) 14 | destination_dir = os.path.join(c['DATA_DIR'], '10k_raw') 15 | api_key = c['SEC_API_KEY'] 16 | render_api = RenderApi(api_key=api_key) 17 | 18 | def download_filing(url, destination_file, skip_existing=True, skip_ixbrl=True, engine='requests'): 19 | """ 20 | Given a SEC EDGAR 10-K URL, will download the HTML file to the local destination file. 21 | If skip, then will not re-download file if the local destination file already exists. 22 | 23 | Args 24 | --------- 25 | engine : one of {'sec_api', 'requests'} 26 | sec_api : uses the proprietary SEC API to fetch the filing 27 | requests : uses the open source SEC EDGAR database and requests package to fetch the filing 28 | """ 29 | try: 30 | destination_dir = os.path.dirname(destination_file) 31 | 32 | if not os.path.isdir(destination_dir): 33 | os.makedirs(destination_dir) 34 | 35 | if skip_existing and os.path.exists(destination_file): 36 | print('⏭️ already exists, skipping download: {url}'.format( 37 | url=url)) 38 | return 39 | 40 | # do not download iXBRL output 41 | if skip_ixbrl: 42 | url = url.replace('ix?doc=/', '') 43 | if engine == 'sec_api': 44 | file_content = render_api.get_filing(url) 45 | elif engine == 'requests': 46 | file_content = requests.get(url).text 47 | 48 | with open(destination_file, "w") as f: 49 | f.write(file_content) 50 | 51 | except: 52 | print('❌ download failed: {url}'.format( 53 | url=url)) 54 | 55 | def pandarallel_wrapper(metadata): 56 | """ 57 | Basic wrapper of the download_filing functionality to allow for pandarallel optimization 58 | """ 59 | ticker = metadata['ticker'] 60 | url = metadata['filingUrl'] 61 | file_name = url.split("/")[-1] 62 | destination_file = os.path.join(destination_dir, ticker, file_name) 63 | 64 | download_filing(url, destination_file) 65 | 66 | def pandarallel_wrapper_legacy(metadata): 67 | """ 68 | Basic wrapper of the download_filing functionality to allow for pandarallel optimization 69 | """ 70 | ticker = metadata['TICKER'] 71 | url = 'https://www.sec.gov/Archives/' + metadata['EDGAR_LINK'] 72 | file_name = url.split("/")[-1] 73 | destination_file = os.path.join(destination_dir, ticker, file_name) 74 | 75 | download_filing(url, destination_file) 76 | 77 | if __name__ == '__main__': 78 | # this code follows the SEC API documentation 79 | # to fetch the files of the 10-Ks of the 3000 companies of the Russell 3000 80 | 81 | # The SEC API documentation can be found at 82 | # https://sec-api.io/docs/sec-filings-render-api/python-example 83 | import json 84 | import os 85 | import pandas as pd 86 | 87 | number_of_workers = 8 88 | 89 | # read URL table 90 | metadata = pd.read_csv(os.path.join(c['DATA_DIR'], 'metadata.csv')) 91 | metadata_legacy = pd.read_csv(os.path.join(c['DATA_DIR'], 'metadata_2017.csv')) 92 | 93 | # only download the data from russell 3000 today 94 | metadata = metadata_legacy[metadata_legacy['TICKER'].isin(metadata['ticker'])] 95 | 96 | # download multiple files in parallel 97 | pandarallel.initialize(progress_bar=True, nb_workers=number_of_workers, verbose=0) 98 | 99 | # uncomment to run a quick sample and download 50 filings 100 | # sample = metadata.head(50) 101 | # sample.parallel_apply(pandarallel_wrapper, axis=1) 102 | 103 | # download all filings 104 | metadata.parallel_apply(pandarallel_wrapper_legacy, axis=1) 105 | 106 | print('✅ Download completed') 107 | -------------------------------------------------------------------------------- /scripts/fetch_10k_urls.py: -------------------------------------------------------------------------------- 1 | # for analysis of which methods of fetching the URL are best, see the analysis 2 | # at analyses/optimize_url_fetch.ipynb 3 | # bulk download capabilities at https://www.sec.gov/edgar/sec-api-documentation 4 | 5 | from sec_api import QueryApi 6 | import pandas as pd 7 | 8 | 9 | def create_batches(tickers=[], max_length_of_batch=100): 10 | """ 11 | # create batches of tickers: [[A,B,C], [D,E,F], ...] 12 | # a single batch has a maximum of max_length_of_batch tickers 13 | """ 14 | batches = [[]] 15 | 16 | for ticker in tickers: 17 | if len(batches[len(batches)-1]) == max_length_of_batch: 18 | batches.append([]) 19 | 20 | batches[len(batches)-1].append(ticker) 21 | 22 | return batches 23 | 24 | def download_10K_metadata(query_api, tickers, start_year, end_year): 25 | """ 26 | Given a list of tickers, this function will return a Pandas Dataframe 27 | with the ticker, CIK, and URLs to download the 10-K files from the SEC database. 28 | 29 | Example Output 30 | ------------ 31 | ticker cik formType filedAt filingUrl 32 | 0 AVGO 1730168 10-K 2021-12-17T16:42:51-05:00 https://www.sec.gov/Archives/edgar/data/173016... 33 | 1 AMAT 6951 10-K 2021-12-17T16:14:51-05:00 https://www.sec.gov/Archives/edgar/data/6951/0... 34 | 2 DE 315189 10-K 2021-12-16T11:39:34-05:00 https://www.sec.gov/Archives/edgar/data/315189... 35 | 3 ADI 6281 10-K 2021-12-03T16:02:52-05:00 https://www.sec.gov/Archives/edgar/data/6281/0... 36 | 37 | Args 38 | ------------ 39 | query_api: SEC API to run URL fetching queries 40 | tickers : a list of tickers for which to fetch URLs 41 | start_year : the start year to begin fetching 10-K URLs 42 | end_year : the final year for which 10-K URLs should be fetched 43 | """ 44 | print('✅ Starting download process') 45 | 46 | # create ticker batches, with 100 tickers per batch 47 | batches = create_batches(tickers) 48 | frames = [] 49 | 50 | for year in range(start_year, end_year + 1): 51 | for batch in batches: 52 | tickers_joined = ', '.join(batch) 53 | ticker_query = 'ticker:({})'.format(tickers_joined) 54 | 55 | query_string = ''' 56 | {ticker_query} 57 | AND filedAt:[{start_year}-01-01 TO {end_year}-12-31] 58 | AND formType:"10-K" 59 | AND NOT formType:"10-K/A" 60 | AND NOT formType:NT'''.format( 61 | ticker_query=ticker_query, start_year=year, end_year=year) 62 | 63 | query = { 64 | "query": {"query_string": { 65 | "query": query_string, 66 | "time_zone": "America/New_York" 67 | }}, 68 | "from": "0", 69 | "size": "200", 70 | "sort": [{"filedAt": {"order": "desc"}}] 71 | } 72 | 73 | response = query_api.get_filings(query) 74 | 75 | filings = response['filings'] 76 | 77 | metadata = list(map(lambda f: {'ticker': f['ticker'], 78 | 'cik': f['cik'], 79 | 'formType': f['formType'], 80 | 'filedAt': f['filedAt'], 81 | 'filingUrl': f['linkToFilingDetails']}, filings)) 82 | 83 | df = pd.DataFrame.from_records(metadata) 84 | 85 | frames.append(df) 86 | 87 | print('✅ Downloaded metadata for year', year) 88 | 89 | result = pd.concat(frames) 90 | return result 91 | 92 | 93 | if __name__ == '__main__': 94 | # this code follows the SEC official documentation 95 | # to fetch the URLs of the 10-Ks of the 3000 companies of the Russell 3000 96 | 97 | # The SEC official documentation can be found at 98 | # https://sec-api.io/docs/sec-filings-render-api/python-example 99 | import json 100 | import os 101 | import pandas as pd 102 | 103 | # params 104 | with open('config.json', 'r') as f: 105 | c = json.load(f) 106 | destination_dir = c['DATA_DIR'] 107 | destination_file = os.path.join(destination_dir, 'metadata.csv') 108 | api_key = c['SEC_API_KEY'] 109 | 110 | # read Russell 3000 files 111 | holdings = pd.read_csv(os.path.join(c['DATA_DIR'], 'russell_3000/russell-3000-clean.csv')) 112 | 113 | query_api = QueryApi(api_key=api_key) 114 | tickers = list(holdings['Ticker']) 115 | 116 | metadata = download_10K_metadata( 117 | query_api=query_api, tickers=tickers, start_year=2019, end_year=2023) 118 | 119 | number_metadata_downloaded = len(metadata) 120 | print('✅ Downloaded completed. Metadata downloaded for {} filings.'.format( 121 | number_metadata_downloaded)) 122 | 123 | print('Writing to file {}'.format(destination_file)) 124 | metadata.to_csv(destination_file, index=False) 125 | print('Completed writing file. Exiting script.') 126 | -------------------------------------------------------------------------------- /scripts/fetch_russell_3000.py: -------------------------------------------------------------------------------- 1 | # this code follows the SEC official documentation 2 | # to fetch the 3000 constituents of the Russell 3000 3 | 4 | # The SEC official documentation can be found at 5 | # https://sec-api.io/docs/sec-filings-render-api/python-example 6 | 7 | # import libraries 8 | from sec_api import QueryApi, RenderApi 9 | import requests 10 | import os 11 | import json 12 | 13 | with open('config.json', 'r') as f: 14 | c = json.load(f) 15 | 16 | # params 17 | destination_dir = os.path.join(c['DATA_DIR'], 'russell_3000') 18 | raw_data_path = os.path.join(destination_dir, 'russell-3000.csv') 19 | clean_data_path = os.path.join(destination_dir, 'russell-3000-clean.csv') 20 | url = c['RUSSELL_3000_URL'] 21 | 22 | #### 23 | # Download the 3000 constituents of the Russell 3000 24 | #### 25 | response = requests.get(url) 26 | 27 | with open(raw_data_path, 'wb') as f: 28 | f.write(response.content) 29 | 30 | # cleaning the iShares CSV file 31 | import csv 32 | 33 | with open(raw_data_path, 'r', encoding='utf-8') as f: 34 | reader = csv.reader(f) 35 | rows = list(reader) 36 | 37 | empty_row_indicies = [i for i in range(len(rows)) if (len(rows[i]) == 0 or '\xa0' in rows[i])] 38 | 39 | print('Empty rows:', empty_row_indicies) 40 | 41 | start = empty_row_indicies[0] + 1 42 | end = empty_row_indicies[1] 43 | cleaned_rows = rows[start:end] 44 | 45 | with open(clean_data_path, 'w', newline='') as f: 46 | writer = csv.writer(f) 47 | writer.writerows(cleaned_rows) 48 | --------------------------------------------------------------------------------