├── Consolidated Report.pdf
├── model
    ├── interim_model_dict.pickle
    ├── tfidf_vectorizer_2014_dict.pickle
    └── ensemble_model_function_2014.py
├── other reports
    ├── Data Wrangling.pdf
    └── Machine Learning Implementation.pdf
├── requirements.txt
├── README.md
└── public codebase (not executable)
    ├── CB1 Data Wrangling.ipynb
    └── CB4 Machine Learning Implementation.ipynb


/Consolidated Report.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gslicht/Using_NLP_to_Predict_Almost_Bankruptcy/HEAD/Consolidated Report.pdf


--------------------------------------------------------------------------------
/model/interim_model_dict.pickle:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gslicht/Using_NLP_to_Predict_Almost_Bankruptcy/HEAD/model/interim_model_dict.pickle


--------------------------------------------------------------------------------
/other reports/Data Wrangling.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gslicht/Using_NLP_to_Predict_Almost_Bankruptcy/HEAD/other reports/Data Wrangling.pdf


--------------------------------------------------------------------------------
/model/tfidf_vectorizer_2014_dict.pickle:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gslicht/Using_NLP_to_Predict_Almost_Bankruptcy/HEAD/model/tfidf_vectorizer_2014_dict.pickle


--------------------------------------------------------------------------------
/other reports/Machine Learning Implementation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gslicht/Using_NLP_to_Predict_Almost_Bankruptcy/HEAD/other reports/Machine Learning Implementation.pdf


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | matplotlib 3.1.3
 2 | numpy 1.18.1
 3 | pandas 1.0.1 
 4 | python 3.7.6 
 5 | quandl (python 3)
 6 | scipy 1.4.1
 7 | seaborn 0.10.0
 8 | sec_edgar_downloader 3.0.2
 9 | sklearn 0.22.1
10 | 
11 | 
12 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Using_NLP_to_Predict_Almost_Bankruptcy
2 | The goal of the project is to see if sharp and extreme equity drawdown , as a proxy for bankruptcy, credit risk and governance, can be predicted from the language of a company’s annual report and to compare this with the performance from more traditional measures using financial metrics.
3 | 
4 | The model is available in the self-named folder and uses the first expanding window with training data up to and including 2014. It is structured as a function that takes the annual company report as a document input and produces 1 for prediction of an 80% (or greater) 20-day stock price drawdown in the year following the report filing date and 0 otherwise. Have fun!
5 | 


--------------------------------------------------------------------------------
/model/ensemble_model_function_2014.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """
 3 | Created on Tue Aug 25 15:07:17 2020
 4 | 
 5 | @author: gslicht
 6 | """
 7 | 
 8 | 
 9 | import pandas as pd
10 | import numpy as np
11 | 
12 | 
13 | def ensemble_model_function_2014(doc):
14 |     
15 |     vectorizer = pd.read_pickle('tfidf_vectorizer_2014_dict.pickle')
16 |     dict_models = pd.read_pickle('interim_model_dict.pickle')
17 |     
18 |     model_over = dict_models['model_over']
19 |     model_under = dict_models['model_under']
20 |     
21 |     X = vectorizer.transform(doc)
22 | 
23 |     y_over_log_proba = model_over.predict_log_proba(X)[:,1]
24 |     y_under_log_proba = model_under.predict_log_proba(X)[:,1]
25 |         
26 |     y_log_proba = 0.25*y_over_log_proba + 0.75*y_under_log_proba
27 |     y_pred = (y_log_proba > np.log(0.5))*1
28 |     
29 |     return y_pred
30 |     


--------------------------------------------------------------------------------
/public codebase (not executable)/CB1 Data Wrangling.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Codebase1: Data Wrangling\n",
  8 |     "\n",
  9 |     "The structured data is sourced from Sharadar's paid subscription and consists of (i) market data, (ii) financial data and (iii) metadata. The market data is used to calcuate the maximum 20 day rolling drawdown in the 1 year period following the filing of the annual report. The binary target data is defined as experiencing a positive event when this dardwown is greater than 80% and negative otherwise. \n",
 10 |     "\n",
 11 |     "The codebase is structured in the following sections:\n",
 12 |     "\n",
 13 |     "1. Data retrieval and early calculations\n",
 14 |     "2. Preprocessing\n",
 15 |     "3. Merging datasets into target-features data frame \n"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "markdown",
 20 |    "metadata": {},
 21 |    "source": [
 22 |     "### (1) Data retrieval and early calculations\n",
 23 |     "\n"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "markdown",
 28 |    "metadata": {},
 29 |    "source": [
 30 |     "The market price database was too big to be loaded by API call and was instead bulk downloaded as a CSV file from the Quandl site.\n",
 31 |     "\n",
 32 |     "The below code uses the closing price of each equity to return the rolling 20 day max drawdowns on a daily basis."
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "code",
 37 |    "execution_count": null,
 38 |    "metadata": {},
 39 |    "outputs": [
 40 |     {
 41 |      "name": "stdout",
 42 |      "output_type": "stream",
 43 |      "text": [
 44 |       "2020-08-25 11:23:12.440870\n"
 45 |      ]
 46 |     }
 47 |    ],
 48 |    "source": [
 49 |     "'''\n",
 50 |     "Converts daily equity prices from Sharadar database to rolling 20 day max\n",
 51 |     "drawdowns in dataframe format with columns as ticker and dates as index\n",
 52 |     "'''\n",
 53 |     "\n",
 54 |     "\n",
 55 |     "import pickle\n",
 56 |     "import pandas as pd\n",
 57 |     "from datetime import datetime as dt\n",
 58 |     "\n",
 59 |     "\n",
 60 |     "t1 = dt.now()\n",
 61 |     "print(t1)\n",
 62 |     "\n",
 63 |     "#specify inputs\n",
 64 |     "window_dd = 20\n",
 65 |     "\n",
 66 |     "input_file_1 = 'daily_equity_prices.csv'\n",
 67 |     "output_file = 'monthly_rolling_20d_dd_whole_db.pickle'\n",
 68 |     "\n",
 69 |     "#read csv file of stock price data\n",
 70 |     "df_stocks = pd.read_csv(input_file_1, parse_dates=['date'])\n",
 71 |     "\n",
 72 |     "#pivot table and select closing prices only\n",
 73 |     "df_prices = df_stocks.pivot(index='date', columns='ticker', values='close')\n",
 74 |     "\n",
 75 |     "df_prices = df_prices.sort_index()\n",
 76 |     "#calculate max rolling 20 day drawdowns on rolling daily basis\n",
 77 |     "    #compute rolling dd\n",
 78 |     "df_dd = df_prices / df_prices.rolling(window_dd).max() -1\n",
 79 |     "df_dd = df_dd.applymap(lambda x: min(x,0))\n",
 80 |     "\n",
 81 |     "df_dd = df_dd.dropna(how='all', axis=1)\n",
 82 |     "\n",
 83 |     "#save dict to pickle\n",
 84 |     "with open(output_file, 'wb') as handle:                                     \n",
 85 |     "    pickle.dump(df_dd, handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
 86 |     "\n",
 87 |     "t2 = dt.now()\n",
 88 |     "print(t2 - t1)\n",
 89 |     "\n",
 90 |     "#runtime 3min30sec"
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "markdown",
 95 |    "metadata": {},
 96 |    "source": [
 97 |     "Sharadar also provides metadata for each equity ticker and this is downloaded via the Quandle API in Python."
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "code",
102 |    "execution_count": null,
103 |    "metadata": {},
104 |    "outputs": [],
105 |    "source": [
106 |     "import quandl\n",
107 |     "import pickle\n",
108 |     "import numpy as np\n",
109 |     "\n",
110 |     "\n",
111 |     "output_file = 'meta_df_whole_db.pickle' \n",
112 |     "\n",
113 |     "#API Key\n",
114 |     "quandl.ApiConfig.api_key = \"key\"\n",
115 |     "\n",
116 |     "#Pull data from quandl in df format\n",
117 |     "df_meta = quandl.get_table('SHARADAR/TICKERS', paginate = True)     #all tickers and metadata\n",
118 |     "\n",
119 |     "\n",
120 |     "#Wrangle Meta Table                                                    \n",
121 |     "df_meta = df_meta[df_meta.table.eq('SF1')]                                              #filter by table 'SF1' \n",
122 |     "df_meta.set_index('ticker', inplace=True)                                               #set ticker as index\n",
123 |     "df_meta['CIK'] = df_meta['secfilings'].apply(lambda x: x[x.find('CIK=')+4:].strip())    #form new column for CIK refernece number (number as text)                                   \n",
124 |     "df_meta.fillna(np.NaN, inplace=True)                                                    #fill None with NaN\n",
125 |     "df_meta = df_meta.transpose()                                                           \n",
126 |     "\n",
127 |     "#save dataframe to file\n",
128 |     "with open(output_file, 'wb') as handle:                                     \n",
129 |     "    pickle.dump(df_meta, handle, protocol=pickle.HIGHEST_PROTOCOL)       "
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "markdown",
134 |    "metadata": {},
135 |    "source": [
136 |     "The next step is to download the 10Ks from the SEC website. Given the highly imbalanced dataset, we use the rolling drawdown dataframe to find those tickers with maximum drawdowns over 80% and make sure these 10Ks are downloaded first. While company tickers can change for various reasons, the CIK number is unique and this links the drawdown and SEC 10K data through the metadata. Once the target companies have been specified and the CIK numbers retrieved, we ise the existing SEC downloader library to retrieve these annual statements to the local drive.    "
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "code",
141 |    "execution_count": null,
142 |    "metadata": {},
143 |    "outputs": [],
144 |    "source": [
145 |     "'''\n",
146 |     "Takes in CIK number from metadata and drawdown dataframe to chose tickers for \n",
147 |     "download from SEC website. Downloads to local drive and saves custom log.\n",
148 |     "'''\n",
149 |     "\n",
150 |     "import pickle\n",
151 |     "from sec_edgar_downloader import Downloader\n",
152 |     "from datetime import datetime as dt\n",
153 |     "\n",
154 |     "#Specify inout and outout files\n",
155 |     "\n",
156 |     "input_file_meta = 'meta_df_whole_db.pickle' \n",
157 |     "input_file_dd = 'monthly_rolling_20d_dd_whole_db.pickle'\n",
158 |     "output_log_file = '10k_dowload_logs.pickle'\n",
159 |     "\n",
160 |     "local_drive_destination = 'XXX'\n",
161 |     "\n",
162 |     "with open(input_file_meta, 'rb') as f_meta:\n",
163 |     "        df_meta = pickle.load(f_meta)\n",
164 |     "\n",
165 |     "with open(input_file_dd, 'rb') as f_dd:\n",
166 |     "        df_dd = pickle.load(f_dd)\n",
167 |     "\n",
168 |     "#find tickers with max dd >= 80%\n",
169 |     "s_dd = df_dd.min(axis=0)\n",
170 |     "mask_dd = s_dd <= -0.8\n",
171 |     "pos_tickers = s_dd[mask_dd].index.tolist()\n",
172 |     "neg_tickers = s_dd[~mask_dd].index.tolist()         \n",
173 |     "tickers = pos_tickers + neg_tickers             #ensure pos_tickers downloaded first\n",
174 |     "\n",
175 |     "t0 = dt.now()\n",
176 |     "print(t0)\n",
177 |     "\n",
178 |     "df_ticker_cik = df_meta.loc['CIK']\n",
179 |     "\n",
180 |     "#Initialize a downloader instance with specified destination\n",
181 |     "dl = Downloader(local_drive_destination)\n",
182 |     "\n",
183 |     "# Initialize lists for custom log\n",
184 |     "descr_list = []\n",
185 |     "error_list = []\n",
186 |     "\n",
187 |     "#download all 10Ks of ticker after January 1997\n",
188 |     "for idx, ticker in enumerate(tickers):                      \n",
189 |     "    cik = df_ticker_cik[ticker]\n",
190 |     "    try:\n",
191 |     "        t1 = dt.now()\n",
192 |     "        dl.get(\"10-K\", cik, after_date=\"19970101\")     \n",
193 |     "        t2 = dt.now()\n",
194 |     "        delta = t2-t1\n",
195 |     "        descr = str(idx) + ' : ' + ticker + ' : ' + str(delta.seconds) + 'sec'\n",
196 |     "        descr_list.append(descr)\n",
197 |     "        print(descr)\n",
198 |     "    except:\n",
199 |     "        error_list.append(ticker)\n",
200 |     "        descr = str(idx) + ' : ' + ticker + ' : ' + 'Error'\n",
201 |     "        print(descr)\n",
202 |     "        continue\n",
203 |     "\n",
204 |     "d_log = {'log': descr_list, 'error_codes': error_list}\n",
205 |     "\n",
206 |     "#save custom log to file\n",
207 |     "with open(output_log_file, 'wb') as handle:\n",
208 |     "    pickle.dump(d_log, handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
209 |     "\n",
210 |     "t3 =dt.now()\n",
211 |     "\n",
212 |     "print(t3-t0)  \n",
213 |     "\n",
214 |     "#runtime overnight- stopped in morning"
215 |    ]
216 |   },
217 |   {
218 |    "cell_type": "markdown",
219 |    "metadata": {},
220 |    "source": [
221 |     "### (2) Preprocessing\n",
222 |     "\n",
223 |     "The 10Ks are pulled over a 20+ year period and are inconsistent in format (text, html, xbrl). Resultantly, the more general regex method is preferred for preprocessing. This is programmed as a function below. Stemming and lemmatization are intentionally excluded in order to leave the corpus as nuanced as possible.  "
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "code",
228 |    "execution_count": null,
229 |    "metadata": {},
230 |    "outputs": [],
231 |    "source": [
232 |     "def remove_html_tags_char(text):\n",
233 |     "    '''Takes in string and removes defined special characters  '''\n",
234 |     "    \n",
235 |     "    #Define special Chars\n",
236 |     "    clean1 = re.compile('\\n')               \n",
237 |     "    clean2 = re.compile('\\r')               \n",
238 |     "    clean3 = re.compile('&nbsp;')           \n",
239 |     "    clean4 = re.compile('&#160;')\n",
240 |     "    clean5 = re.compile('  ')\n",
241 |     "    #Define html tags\n",
242 |     "    clean6 = re.compile('<.*?>')\n",
243 |     "    #remove special characters and html tags\n",
244 |     "    text = re.sub(clean1,' ', text)\n",
245 |     "    text = re.sub(clean2,' ',text)  \n",
246 |     "    text = re.sub(clean3,' ',text) \n",
247 |     "    text = re.sub(clean4,' ',text) \n",
248 |     "    text = re.sub(clean5,' ',text) \n",
249 |     "    text = re.sub(clean6,' ',text) \n",
250 |     "    # check spacing\n",
251 |     "    final_text = ' '.join(text.split())  \n",
252 |     "    \n",
253 |     "    return final_text "
254 |    ]
255 |   },
256 |   {
257 |    "cell_type": "markdown",
258 |    "metadata": {},
259 |    "source": [
260 |     "In addition to cleaning the 10Ks of special characters, the below code also pulls out document metadata from the text and creates a custom log to track failed documents. The most important metadata is the filing date which will be used to join the unstructured data with the target data. The program saves the output as a dictionary with each document specified by a concatenation of the ticker name and year as the primary key. "
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "code",
265 |    "execution_count": null,
266 |    "metadata": {},
267 |    "outputs": [],
268 |    "source": [
269 |     "\n",
270 |     "\n",
271 |     "\"\"\"\n",
272 |     "Program processes downloaded 10ks with the following steps:\n",
273 |     "    (i) maps SEC CIK number to stock exchange tickers needed for later comparison to financial data\n",
274 |     "    (ii) Finds CIKs with two tickers to ensure 10k data stored for both\n",
275 |     "    (iii) Walks through directory of downloaded CIKs converting CIK to ticker label\n",
276 |     "    (iv) finds metadata section and extracts metadata for each 10k\n",
277 |     "    (v) Find main 10k body and uses regex to remove html and other tags \n",
278 |     "         before storing document as single string (10ks from 1997 have different\n",
279 |     "         and inconsistent formats so regex preferred to html parser)\n",
280 |     "    (vi) Ticker metadata added eg: sector, industry\n",
281 |     "    (vii) user defined log and error list per ticker created and stored\n",
282 |     "    (viii) Final output is dictionary with keys for log, errors and data.\n",
283 |     "           Data is a nested dictionary with keys equal to ticker_name concateded with\n",
284 |     "           year in label of 10k document, values are another dictionary including\n",
285 |     "           document metadata, ticker metadata, and processed 10k text as string \n",
286 |     "\"\"\"\n",
287 |     "\n",
288 |     "import os\n",
289 |     "import pickle\n",
290 |     "import pandas as pd\n",
291 |     "from datetime import datetime as dt\n",
292 |     "from capstone_10k_functions import remove_html_tags_char\n",
293 |     "\n",
294 |     "t0 = dt.now()\n",
295 |     "\n",
296 |     "rootdir = 'XXX'  #for looping through raw 10ks \n",
297 |     "input_file_1 = 'meta_df_whole_db.pickle'      #metadata  \n",
298 |     "output_file = '10k_clean_dict.pickle'\n",
299 |     "\n",
300 |     "\n",
301 |     "with open(input_file_1, 'rb') as f1:\n",
302 |     "        v = pickle.load(f1)\n",
303 |     "        \n",
304 |     "\n",
305 |     "#create cik to ticker df\n",
306 |     "df_cik2tic = pd.DataFrame(v.loc['CIK',:])\n",
307 |     "df_cik2tic = df_cik2tic.reset_index()\n",
308 |     "\n",
309 |     "\n",
310 |     "#find duplicate tickers for single CIK\n",
311 |     "bool_series = df_cik2tic['CIK'].duplicated(keep=False)\n",
312 |     "df_dup_cik = df_cik2tic[bool_series].sort_values(by='CIK')\n",
313 |     "df_dup_cik['CIK'] = df_dup_cik['CIK'].apply(lambda x: x.lstrip('0'))\n",
314 |     "dup_n = len(df_dup_cik)\n",
315 |     "\n",
316 |     "if dup_n % 2 != 0:\n",
317 |     "    print('Error: duplicate CIKs not an even number')\n",
318 |     "else:\n",
319 |     "    pass\n",
320 |     "    \n",
321 |     "index_list = [2*number for number in range(dup_n//2)]\n",
322 |     "dict_dupes_cik2tic = {df_dup_cik.CIK.iloc[j]: (df_dup_cik.ticker.iloc[j], \n",
323 |     "                                                df_dup_cik.ticker.iloc[j+1]) for j in index_list}\n",
324 |     "\n",
325 |     "#Remove duplicates from primary cik to ticker df \n",
326 |     "df_cik2tic = df_cik2tic[~bool_series]\n",
327 |     "df_cik2tic['CIK'] = df_cik2tic['CIK'].apply(lambda x: x.lstrip('0'))\n",
328 |     "df_cik2tic = df_cik2tic.set_index('CIK')\n",
329 |     "\n",
330 |     "\n",
331 |     "#\n",
332 |     "n = 0\n",
333 |     "\n",
334 |     "d_all = {}          #final outout dict\n",
335 |     "descr_list = []     #description list for live debugging\n",
336 |     "error_list = []     #list for error logging\n",
337 |     "\n",
338 |     "#Walk thorugh 10k downloads for primary non-duplicated ciks\n",
339 |     "for subdir, dirs, files in os.walk(rootdir):\n",
340 |     "    for file in files:\n",
341 |     "        \n",
342 |     "        t1 = dt.now()       #start clock for each document \n",
343 |     "        \n",
344 |     "        #find cik number from filename\n",
345 |     "        subdir_str = str(subdir)\n",
346 |     "        start_sub= subdir_str.find('filings') + 8\n",
347 |     "        end_sub = subdir_str.find('10-K') - 1\n",
348 |     "        cik = subdir_str[start_sub:end_sub]\n",
349 |     "        \n",
350 |     "        #find year in name of document (label in file, may not reflect report yr)\n",
351 |     "        old_fname_str = str(file)\n",
352 |     "        start_fn = old_fname_str.find('-') +1\n",
353 |     "        end_fn = start_fn + 2\n",
354 |     "        year_fn = old_fname_str[start_fn:end_fn]            \n",
355 |     "     \n",
356 |     "        #map cik to ticker for renaming & check if cik map unique\n",
357 |     "        try:\n",
358 |     "            ticker = df_cik2tic.loc[cik, 'ticker']\n",
359 |     "            key_all = ticker + '_' + year_fn\n",
360 |     "            dupe_flag = False\n",
361 |     "        except:\n",
362 |     "            list_ticker = dict_dupes_cik2tic[cik]\n",
363 |     "            key_all_0 = list_ticker[0] + '_' + year_fn\n",
364 |     "            key_all_1 = list_ticker[1] + '_' + year_fn\n",
365 |     "            dupe_flag = True        \n",
366 |     "        \n",
367 |     "        #finally ready to open and work with document\n",
368 |     "        filename = os.path.join(subdir, file)\n",
369 |     "        \n",
370 |     "        n += 1      #counter for live print debugging\n",
371 |     "       \n",
372 |     "        try:\n",
373 |     "            with open (filename, 'r') as file:    \n",
374 |     "    \n",
375 |     "                file_str = file.read()                      #read file into memory\n",
376 |     "    \n",
377 |     "                end_1 = file_str.find('<SEQUENCE>2')        #start section follow main 10k\n",
378 |     "                start = file_str.find('<SEQUENCE>1')        #start of 10k / end of metadata\n",
379 |     "                end = file_str.find('</DOCUMENT>')          #end of 10k if before end_1\n",
380 |     "                \n",
381 |     "                #Extract metadata from document\n",
382 |     "                meta_text = file_str[:start]                #meta data section\n",
383 |     "    \n",
384 |     "                doc_metadata = ['ACCESSION NUMBER:', 'CONFORMED SUBMISSION TYPE:',                #metadata labels in document (order important)\n",
385 |     "                                'PUBLIC DOCUMENT COUNT:', 'CONFORMED PERIOD OF REPORT:', \n",
386 |     "                                'FILED AS OF DATE:', 'DATE AS OF CHANGE:']\n",
387 |     "    \n",
388 |     "                doc_key_names = ['Accession_#', 'Type', 'Doc_Count','Period',       #key names for metadata\n",
389 |     "                                 'Filed_Date', 'Change_Date']\n",
390 |     "                \n",
391 |     "                pos_start = [meta_text.find(label) for label in doc_metadata]               #start pos meta data label\n",
392 |     "                pos_end = [meta_text.find(label) + len(label) for label in doc_metadata]    #end pos meta data label\n",
393 |     "                doc_meta_values = [meta_text[pos_end[j]:pos_start[j+1]].strip()             #metadata value between end label and beg next label \n",
394 |     "                                       for j in range(len(doc_metadata)-1)]  \n",
395 |     "                doc_meta_values[-1] = doc_meta_values[-1][:8]                               #last label manual as no next label\n",
396 |     "                \n",
397 |     "                #define 10k body and clean of html / text / xbrl etc.\n",
398 |     "                text = file_str[start:end_1]            #define sequence 1\n",
399 |     "                text = text[:end]                       #end sequence 1 doc\n",
400 |     "                \n",
401 |     "                #remove html tags and special chars\n",
402 |     "                text = remove_html_tags_char(text)\n",
403 |     "                \n",
404 |     "                #create main dict with 10k metadata and 10k text as string\n",
405 |     "                d = dict(zip(doc_key_names, doc_meta_values))\n",
406 |     "                d.update({'Text': text})\n",
407 |     "                \n",
408 |     "                #add ticker sector / industry metadata to main dict\n",
409 |     "                ticker_meta_short = ['name', 'sicsector', 'sicindustry', 'famasector',  #define metadata of interest\n",
410 |     "                                     'famaindustry', 'sector', 'industry']      #for later categorical analysis\n",
411 |     "                df_meta_short = v[ticker][ticker_meta_short]           #extract metadata \n",
412 |     "                d.update(df_meta_short.to_dict())                           #add to main dict\n",
413 |     "\n",
414 |     "                \n",
415 |     "                #treat for cik duplicate or not to populate final dict with\n",
416 |     "                #logs and errors\n",
417 |     "                if dupe_flag == False:                          #no dupe, write 10k to unique ticker\n",
418 |     "                    d_all.update({key_all: d})\n",
419 |     "                    \n",
420 |     "                    t2 = dt.now()\n",
421 |     "                    delta = t2 - t1\n",
422 |     "                    descr = str(n) + ' : ' + key_all + ' : ' + str(delta.microseconds/1000000)  #description for log\n",
423 |     "                    descr_list.append(descr)                                                    #append to log\n",
424 |     "                    print(descr)                                                                #print to screen for live record\n",
425 |     "                else:\n",
426 |     "                    d_all.update({key_all_0: d, key_all_1: d})  #if dupe, write 10k to both tickers\n",
427 |     "                \n",
428 |     "                    t2 = dt.now()\n",
429 |     "                    delta = t2 - t1\n",
430 |     "                    descr = str(n) + ' : ' + key_all_0 + ' | ' + key_all_1 + ' : ' + str(delta.microseconds/1000000)    #description for log\n",
431 |     "                    descr_list.append(descr)                #append to log\n",
432 |     "                    print(descr)                            #print to screen for live record\n",
433 |     "                    \n",
434 |     "        #if metadata and 10k wrangle fails, record error            \n",
435 |     "        except:  \n",
436 |     "                 \n",
437 |     "            try:\n",
438 |     "                if dupe_flag == False:\n",
439 |     "                    error_list.append(key_all)                           #write error to list\n",
440 |     "                else:                                                       \n",
441 |     "                    error_list.append(key_all_0 + ' | ' + key_all_1)    #if fail on duplicate, make sure to record both tickers\n",
442 |     "                        \n",
443 |     "                t2 = dt.now()            \n",
444 |     "                delta = t2 - t1\n",
445 |     "                descr = str(n) + ' : ' + key_all + ' : ' + 'Error'      #print to screen for live record of error\n",
446 |     "                descr_list.append(descr)                                #append to log\n",
447 |     "                print(descr)\n",
448 |     "                continue\n",
449 |     "        \n",
450 |     "            except:\n",
451 |     "                continue\n",
452 |     "            \n",
453 |     "\n",
454 |     "\n",
455 |     "\n",
456 |     "\n",
457 |     "d_all.update({'log' : descr_list, 'error_codes': error_list})       #write log and errors to final dict\n",
458 |     "\n",
459 |     "with open(output_file, 'wb') as handle:                             #save final dict as pickle\n",
460 |     "    pickle.dump(d_all, handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
461 |     "\n",
462 |     "t3 = dt.now()   \n",
463 |     "print(t3-t0)    \n",
464 |     "\n",
465 |     "#runtime 30mins"
466 |    ]
467 |   },
468 |   {
469 |    "cell_type": "markdown",
470 |    "metadata": {},
471 |    "source": [
472 |     "The dictionary is converted to a data frame format where the concatenated reference is dropped and the \"tickers\" and \"Filed_Date\" columns become the unique identifiers."
473 |    ]
474 |   },
475 |   {
476 |    "cell_type": "code",
477 |    "execution_count": null,
478 |    "metadata": {},
479 |    "outputs": [],
480 |    "source": [
481 |     "\n",
482 |     "\"\"\"\n",
483 |     "Convert dictionary of processed 10k statements dictionary with ticker as keys and value as a dataframe\n",
484 |     "\"\"\"\n",
485 |     "\n",
486 |     "import pickle\n",
487 |     "import pandas as pd\n",
488 |     "\n",
489 |     "input_file = '10k_clean_dict.pickle'\n",
490 |     "output_file = '10k_clean_df.pickle'\n",
491 |     "\n",
492 |     "#Load clean dictionary \n",
493 |     "with open(input_file, 'rb') as f:\n",
494 |     "        z = pickle.load(f).copy()\n",
495 |     "        \n",
496 |     "#delete keys that are not related to k10 data        \n",
497 |     "del z['log']                    \n",
498 |     "del z['error_codes']\n",
499 |     "\n",
500 |     "#create dictionary with ticker as key and a list of all annual dictionaries as values\n",
501 |     "d_temp = {}\n",
502 |     "for k, v in z.items():\n",
503 |     "    end = k.find('_')                               #find ticker (eg: AAPL) from long key name (eg: AAPL_18)\n",
504 |     "    ticker = k[:end]\n",
505 |     "    \n",
506 |     "    if ticker not in list(d_temp.keys()):           \n",
507 |     "        d_temp.update({ticker: [v]})            #if key hasn't appeared yet, initialise with list for value\n",
508 |     "    else:\n",
509 |     "        d_temp[ticker].append(v)                #if key has appeared, append value to list\n",
510 |     "        \n",
511 |     "\n",
512 |     "#convert ticker dictionaries to dataframes \n",
513 |     "df_final = pd.DataFrame()\n",
514 |     "for k, v in d_temp.items():    \n",
515 |     "    df = pd.DataFrame.from_dict(v, orient='columns')    \n",
516 |     "    df['ticker']= k                                                                    #add ticker column\n",
517 |     "    df['file_month_date'] = pd.to_datetime(df['Filed_Date'], errors = 'coerce')\n",
518 |     "    df['file_month_date'] = df['file_month_date'] + pd.offsets.MonthEnd(0)\n",
519 |     "    #add column year of of statement plus 1\n",
520 |     "    df = df.sort_values(['ticker', 'file_month_date'])    \n",
521 |     "    \n",
522 |     "    df_final = df_final.append(df)                                                        #set key (ticker) and value (df) for final dictionary\n",
523 |     "\n",
524 |     "\n",
525 |     "with open(output_file, 'wb') as handle:                                     #save final dictionary as pickle file\n",
526 |     "    pickle.dump(df_final, handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
527 |     "    \n",
528 |     "#runtime 10min"
529 |    ]
530 |   },
531 |   {
532 |    "cell_type": "markdown",
533 |    "metadata": {},
534 |    "source": [
535 |     "### (3) Merging the Datasets\n",
536 |     "\n",
537 |     "Merging the 10K and drawdown dataframes will result in a loss of information. for example, there will be some price tickers with price history but no recorded filings or with incomplete filings. The below code inner joins the dataframes on the ticker and Filed_Date columns and calaculates the maximum of the drawdowns in the year following the Filing as the target variable.\n",
538 |     "\n",
539 |     "The function for computing this drawdown is"
540 |    ]
541 |   },
542 |   {
543 |    "cell_type": "code",
544 |    "execution_count": null,
545 |    "metadata": {},
546 |    "outputs": [],
547 |    "source": [
548 |     "def find_max_dd_period(s, date1, date2, window=20):\n",
549 |     "    \"\"\"finds the max drawdown of the 20 day rolling dd series between the dates\"\"\"\n",
550 |     "    s_dd = pd.Series(s[window-1:].values, index=s.index[:-(window-1)], name=s.name)\n",
551 |     "\n",
552 |     "    mask = (s_dd.index > date1) & (s_dd.index <= date2)\n",
553 |     "    \n",
554 |     "    max_dd = s_dd[mask].min()\n",
555 |     "    \n",
556 |     "    return max_dd\n"
557 |    ]
558 |   },
559 |   {
560 |    "cell_type": "markdown",
561 |    "metadata": {},
562 |    "source": [
563 |     "The code for the merge is"
564 |    ]
565 |   },
566 |   {
567 |    "cell_type": "code",
568 |    "execution_count": null,
569 |    "metadata": {},
570 |    "outputs": [],
571 |    "source": [
572 |     "import pickle\n",
573 |     "import pandas as pd\n",
574 |     "from capstone_10k_functions import find_max_dd_period\n",
575 |     "from pandas.tseries.offsets import DateOffset\n",
576 |     "\n",
577 |     "\n",
578 |     "input_text = '10k_clean_df.pickle'\n",
579 |     "input_dd = 'monthly_rolling_20d_dd_whole_db.pickle'\n",
580 |     "\n",
581 |     "output_file = 'dict_10k_matched_dd.pickle'\n",
582 |     "\n",
583 |     "\n",
584 |     "\n",
585 |     "with open(input_text, 'rb') as f_text:\n",
586 |     "        df_text = pickle.load(f_text)\n",
587 |     "\n",
588 |     "#set to datetime format\n",
589 |     "df_text['Filed_Date'] = pd.to_datetime(df_text['Filed_Date'], errors = 'coerce')\n",
590 |     "#define 10k text df for later merging\n",
591 |     "df_text_actual = df_text[['ticker', 'Filed_Date', 'Text']]\n",
592 |     "df_text_actual.columns = ['ticker_', 'Filed_Date', 'Text']\n",
593 |     "#non text data to carry through calcs before merge\n",
594 |     "df_text = df_text[['ticker', 'Filed_Date', 'sector', 'sicsector']]\n",
595 |     "\n",
596 |     "#10K tickers to list\n",
597 |     "tickers_text = set(df_text['ticker'].tolist())   #len = 4,482\n",
598 |     "\n",
599 |     "\n",
600 |     "with open(input_dd, 'rb') as f_dd:\n",
601 |     "        df_dd = pickle.load(f_dd)\n",
602 |     "        \n",
603 |     "#drawdown tickers to list\n",
604 |     "tickers_dd = set(df_dd.columns.tolist())        #len = 16,973\n",
605 |     "\n",
606 |     "#find intersection of tickers across the dataframes\n",
607 |     "tickers = tickers_text.intersection(tickers_dd)   #len = 4,456\n",
608 |     "tickers = list(tickers)\n",
609 |     "\n",
610 |     "\n",
611 |     "#match 10k file date with max 20d dd over next 12 months\n",
612 |     "counter=0\n",
613 |     "#loop through tickers\n",
614 |     "for code in tickers:\n",
615 |     "    s_dd = df_dd[code]\n",
616 |     "    \n",
617 |     "    #event flag column\n",
618 |     "    ticker_dd_flag = (s_dd.min() <= -0.8)*1\n",
619 |     "    \n",
620 |     "    df_10k = df_text[df_text.ticker == code].reset_index(drop=True)\n",
621 |     "    df_10k.columns = ['ticker_', 'Filed_Date', 'sector', 'sicsector']\n",
622 |     "    \n",
623 |     "    #info for meta df\n",
624 |     "    sector =df_10k['sector'][0]\n",
625 |     "    sic_sector =df_10k['sicsector'][0]\n",
626 |     "    #custm sector category\n",
627 |     "    custom_sector = str(sector) + ' : ' + str(sic_sector)\n",
628 |     "    \n",
629 |     "    meta_dict = {'sector': sector, 'sic_sector': sic_sector, \n",
630 |     "                       'custom_sector': custom_sector, 'ticker_dd_flag': \n",
631 |     "                           ticker_dd_flag }\n",
632 |     "    df_meta = pd.DataFrame(meta_dict, index = [code])\n",
633 |     "    \n",
634 |     "    #loop through years\n",
635 |     "    for row in range(len(df_10k)):\n",
636 |     "                    \n",
637 |     "                    #find max dd over next 1 and 2 years\n",
638 |     "                     start = df_10k.loc[row, 'Filed_Date']\n",
639 |     "                     end_1yr = start + DateOffset(months=12)\n",
640 |     "                     end_2yr = start + DateOffset(months=24)\n",
641 |     "                     max_dd_1yr = find_max_dd_period(s_dd, start, end_1yr, window=20)\n",
642 |     "                     max_dd_2yr = find_max_dd_period(s_dd, start, end_2yr, window=20)\n",
643 |     "                     df_10k.loc[row, 'max_dd_1yr'] = max_dd_1yr\n",
644 |     "                     df_10k.loc[row, 'max_dd_2yr'] = max_dd_2yr\n",
645 |     "                     df_10k.loc[row, 'year_dd_flag'] = (max_dd_1yr <= -0.8)*1\n",
646 |     "                     \n",
647 |     "    #add_cumulative_year_dd_flag (incl)\n",
648 |     "    df_10k['cum_year_dd_flag'] = df_10k['year_dd_flag'].expanding().max()\n",
649 |     "    \n",
650 |     "    df_10k = df_10k.dropna()\n",
651 |     "    \n",
652 |     "    #if emoty then skip ticker\n",
653 |     "    if df_10k.empty:\n",
654 |     "        counter +=1\n",
655 |     "        continue\n",
656 |     "    else:\n",
657 |     "        pass\n",
658 |     "    #if first iteration, initialize df for concate over next loops\n",
659 |     "    if counter == 0:\n",
660 |     "        df_final = df_10k\n",
661 |     "        df_meta_final = df_meta\n",
662 |     "    else:\n",
663 |     "        df_final = pd.concat([df_final, df_10k])\n",
664 |     "        df_meta_final = pd.concat([df_meta_final, df_meta])\n",
665 |     "        \n",
666 |     "    counter += 1\n",
667 |     "\n",
668 |     "df_final = df_final.reset_index(drop=True)\n",
669 |     "\n",
670 |     "#\n",
671 |     "df_final = df_final.merge(df_text_actual, on=['ticker_','Filed_Date'], how='inner')\n",
672 |     "\n",
673 |     "dict_final = {'matched_df_10k_dd': df_final, 'matched_df_10k_dd_meta': df_meta_final}    #len = 4,365 tickers / 38,807 docs\n",
674 |     "\n",
675 |     "\n",
676 |     "#save dict to pickle\n",
677 |     "with open(output_file, 'wb') as handle:                                     \n",
678 |     "    pickle.dump(dict_final, handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
679 |     "\n",
680 |     "\n"
681 |    ]
682 |   },
683 |   {
684 |    "cell_type": "markdown",
685 |    "metadata": {},
686 |    "source": [
687 |     "The next step is to convert the 10Ks from a text string in a column of a dataframe to a features set in td_idf matrix form which makes use of the following function "
688 |    ]
689 |   },
690 |   {
691 |    "cell_type": "code",
692 |    "execution_count": null,
693 |    "metadata": {},
694 |    "outputs": [],
695 |    "source": [
696 |     "def vectorize_corpus(text_series, vectorizer_func, min_df, max_df, ngram_range):\n",
697 |     "    '''vectorize corpus with specified vectorizer (tdidf or count) \n",
698 |     "    and parameters'''\n",
699 |     "    \n",
700 |     "    vectorizer = vectorizer_func(min_df=min_df, max_df=max_df)\n",
701 |     "    vectors = vectorizer.fit_transform(text_series)\n",
702 |     "    feature_names = vectorizer.get_feature_names()\n",
703 |     "\n",
704 |     "    #wrap vectors in sparse dataframe and label\n",
705 |     "    df = pd.DataFrame.sparse.from_spmatrix(vectors, columns = feature_names)\n",
706 |     "    \n",
707 |     "    #drop null columns\n",
708 |     "    df_test = df[:5]\n",
709 |     "    null_columns = df_test.columns[df_test.isnull().any()]\n",
710 |     "    df = df.drop(null_columns, axis=1)\n",
711 |     "    \n",
712 |     "    dict_answer ={'df_wv': df, 'vectorizer': vectorizer}\n",
713 |     "    \n",
714 |     "    return dict_answer"
715 |    ]
716 |   },
717 |   {
718 |    "cell_type": "markdown",
719 |    "metadata": {},
720 |    "source": [
721 |     "The code for the vecorization and formation of the trainning and test sets across the expanding time-series cross validation folds is given by "
722 |    ]
723 |   },
724 |   {
725 |    "cell_type": "code",
726 |    "execution_count": null,
727 |    "metadata": {},
728 |    "outputs": [],
729 |    "source": [
730 |     "\"\"\"\n",
731 |     "vectorize corpus and create target-features matrix across validation folds \n",
732 |     "using expanding windows for time series data\n",
733 |     "\"\"\"\n",
734 |     "\n",
735 |     "\n",
736 |     "import pickle\n",
737 |     "import pandas as pd\n",
738 |     "from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\n",
739 |     "from capstone_10k_functions import vectorize_corpus\n",
740 |     "from datetime import datetime as dt\n",
741 |     "\n",
742 |     "t1 = dt.now()\n",
743 |     "print(t1)\n",
744 |     "\n",
745 |     "\n",
746 |     "input_file = 'dict_10k_matched_dd.pickle'\n",
747 |     "\n",
748 |     "\n",
749 |     "vector_func = TfidfVectorizer    \n",
750 |     "func_name = 'TfidfVectorizer'   #['TfidfVectorizer', 'CountVectorizer']\n",
751 |     "\n",
752 |     "hold_out_set_start = 2015\n",
753 |     "\n",
754 |     "k_ratio = 0.2\n",
755 |     "min_df = 15\n",
756 |     "min_df_grid = [min_df]\n",
757 |     "\n",
758 |     "max_df = 0.5\n",
759 |     "ngram = (1,2)\n",
760 |     "ngram_name = 'bigram'\n",
761 |     "\n",
762 |     "label_cv = ['cv1', 'cv2', 'cv3', 'cv4']\n",
763 |     "\n",
764 |     "\n",
765 |     "\n",
766 |     "with open(input_file , 'rb') as f:\n",
767 |     "        d_data = pickle.load(f)\n",
768 |     "df = d_data['matched_df_10k_dd']\n",
769 |     "\n",
770 |     "df = df.sort_values(\"Filed_Date\")\n",
771 |     "\n",
772 |     "\n",
773 |     "##Define validation sets\n",
774 |     "mask_hold_out = df['Filed_Date'].dt.year >= hold_out_set_start\n",
775 |     "df_v = df[~mask_hold_out]\n",
776 |     "size = df_v.shape[0]\n",
777 |     "n = int(k_ratio*size)\n",
778 |     "k_stops = [n, 2*n, 3*n, 4*n, size]\n",
779 |     "\n",
780 |     "\n",
781 |     "    \n",
782 |     "#Generate df master (word vector / vectorizer) sets for each cv fold\n",
783 |     "\n",
784 |     "\n",
785 |     "for idx_cv, label in enumerate(label_cv):\n",
786 |     "    \n",
787 |     "    output_filename = label + '_' + func_name + '_' +'min_df_' + str(min_df) +'_' + ngram_name + '.pickle'\n",
788 |     "    dict_cv = {}\n",
789 |     "    \n",
790 |     "    print(label)\n",
791 |     "    \n",
792 |     "    stop_train = k_stops[idx_cv]\n",
793 |     "    stop_test = k_stops[idx_cv + 1]\n",
794 |     "    df_test = df[stop_train: stop_test]\n",
795 |     "    df_train = df[:stop_train ]\n",
796 |     "    \n",
797 |     "    #format training data\n",
798 |     "    df_train_text = df_train[['ticker_','Filed_Date', 'Text']]\n",
799 |     "    df_train_other = df_train.drop('Text', axis=1)\n",
800 |     "    df_train_other.columns = ['ticker_', 'Filed_Date', 'sector_', 'sic_sector', \n",
801 |     "                        'max_dd_1yr', 'max_dd_2yr', 'year_dd_flag', \n",
802 |     "                        'cum_year_dd_flag']\n",
803 |     "    df_train_other['custom_sector'] = str(df_train_other['sector_']) + ' : ' + str(df_train_other['sic_sector'])\n",
804 |     "\n",
805 |     "    #format testing data\n",
806 |     "    df_test_text = df_test[['ticker_','Filed_Date', 'Text']]\n",
807 |     "    df_test_other = df_test.drop('Text', axis=1)\n",
808 |     "    df_test_other.columns = ['ticker_', 'Filed_Date', 'sector_', 'sic_sector', \n",
809 |     "                        'max_dd_1yr', 'max_dd_2yr', 'year_dd_flag', \n",
810 |     "                        'cum_year_dd_flag']\n",
811 |     "    df_test_other['custom_sector'] = str(df_test_other['sector_']) + ' : ' + str(df_test_other['sic_sector'])\n",
812 |     "        \n",
813 |     "\n",
814 |     "    for min_df in min_df_grid: \n",
815 |     "        print(min_df)\n",
816 |     "        \n",
817 |     "        #name for cv dictionary specified by min_df value\n",
818 |     "        key_name = 'min_df_' + str(min_df)\n",
819 |     "        \n",
820 |     "        #vectorize corpus and assign word vector and vectorizer\n",
821 |     "        function = vectorize_corpus(df_train_text['Text'], vector_func, min_df, \n",
822 |     "                                            max_df,ngram)\n",
823 |     "        X = function['df_wv']\n",
824 |     "        vectorizer = function['vectorizer']\n",
825 |     "        \n",
826 |     "        #Transform training data into df_master format\n",
827 |     "        vocab = X.columns.tolist()\n",
828 |     "        X['Filed_Date'] = df_train_text['Filed_Date'].values\n",
829 |     "        X['ticker_'] = df_train_text['ticker_'].values\n",
830 |     "                        \n",
831 |     "        df_train_master = df_train_other.merge(X, on=['ticker_','Filed_Date'], how='inner')\n",
832 |     "        \n",
833 |     "        #Transform test data into df master format\n",
834 |     "        arr_test_transform = vectorizer.transform(df_test_text['Text'])\n",
835 |     "        df_test_transform = pd.DataFrame.sparse.from_spmatrix(arr_test_transform,\n",
836 |     "                                                           columns = vocab)\n",
837 |     "        df_test_transform['Filed_Date'] = df_test_text['Filed_Date'].values\n",
838 |     "        df_test_transform['ticker_'] = df_test_text['ticker_'].values\n",
839 |     "        \n",
840 |     "        \n",
841 |     "        df_test_master = df_test_other.merge(df_test_transform, \n",
842 |     "                                             on=['ticker_','Filed_Date'], \n",
843 |     "                                                                 how='inner')\n",
844 |     "        \n",
845 |     "            \n",
846 |     "        dict_final = {'df_test_master': df_test_master, 'df_train_master': df_train_master}\n",
847 |     "                \n",
848 |     "        dict_cv[key_name] = dict_final\n",
849 |     "           \n",
850 |     "    with open(output_filename, 'wb') as handle:                                     \n",
851 |     "        pickle.dump(dict_cv, handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
852 |     "                    \n",
853 |     "\n",
854 |     "\n",
855 |     "\n",
856 |     "t2 = dt.now()\n",
857 |     "print(t2)\n",
858 |     "print(t2-t1)\n",
859 |     "              \n",
860 |     "                \n",
861 |     "    #runtime 2hrs30mins\n"
862 |    ]
863 |   },
864 |   {
865 |    "cell_type": "code",
866 |    "execution_count": null,
867 |    "metadata": {},
868 |    "outputs": [],
869 |    "source": []
870 |   }
871 |  ],
872 |  "metadata": {
873 |   "kernelspec": {
874 |    "display_name": "Python 3",
875 |    "language": "python",
876 |    "name": "python3"
877 |   },
878 |   "language_info": {
879 |    "codemirror_mode": {
880 |     "name": "ipython",
881 |     "version": 3
882 |    },
883 |    "file_extension": ".py",
884 |    "mimetype": "text/x-python",
885 |    "name": "python",
886 |    "nbconvert_exporter": "python",
887 |    "pygments_lexer": "ipython3",
888 |    "version": "3.7.6"
889 |   }
890 |  },
891 |  "nbformat": 4,
892 |  "nbformat_minor": 4
893 | }
894 | 


--------------------------------------------------------------------------------
/public codebase (not executable)/CB4 Machine Learning Implementation.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "## Codebase 4: Machine Learning Implementation\n",
   8 |     "\n",
   9 |     "Implementation is structured in the following parts where sections 1 to 6 comprise cross-validation.\n",
  10 |     "\n",
  11 |     "1. Helper functions\n",
  12 |     "2. Undersampling\n",
  13 |     "3. Oversampling\n",
  14 |     "4. Weight Class\n",
  15 |     "5. Threshhold Testing \n",
  16 |     "6. Ensembles\n",
  17 |     "7. Create Financial Ratio Features Target Matrix\n",
  18 |     "8. Generate Holdout Testing Sets\n",
  19 |     "9. Test Holdout Sets"
  20 |    ]
  21 |   },
  22 |   {
  23 |    "cell_type": "markdown",
  24 |    "metadata": {},
  25 |    "source": [
  26 |     "### 1. Helper Functions\n",
  27 |     "\n",
  28 |     "There are four key helper functions in implementation: (i) time equalization preprocesses the sample to ensure that the annual ratio of events in the majority and minority classes are equal, (ii) undersampling, (iii) oversampling and (iv) a function to convert sparse dataframes to sparse matrices in chunks.\n",
  29 |     "\n",
  30 |     "First up is time equalization:"
  31 |    ]
  32 |   },
  33 |   {
  34 |    "cell_type": "code",
  35 |    "execution_count": null,
  36 |    "metadata": {},
  37 |    "outputs": [],
  38 |    "source": [
  39 |     "def equalize_year_ratio(df_train, ratio):\n",
  40 |     "    '''preprocesses sample to set the annual ratio of negative events (majority class) \n",
  41 |     "    equal to that of positive events (minority class)'''\n",
  42 |     "    \n",
  43 |     "    mask_pos = df_train['max_dd_1yr'] < -0.8\n",
  44 |     "    df_year = df_train[['ticker_','Filed_Date']][mask_pos]\n",
  45 |     "    df_year['year'] = df_year['Filed_Date'].dt.year\n",
  46 |     "    df_pos_count = df_year['year'].value_counts()\n",
  47 |     "    \n",
  48 |     "    years = sorted(df_pos_count.index.tolist())\n",
  49 |     "    \n",
  50 |     "    df_train['year'] = df_train['Filed_Date'].dt.year\n",
  51 |     "    \n",
  52 |     "    #loop through years and randomly select n samples\n",
  53 |     "    count=0\n",
  54 |     "    for year in years:\n",
  55 |     "    \n",
  56 |     "        mask_year = df_train['year']==year\n",
  57 |     "        df = df_train[mask_year]\n",
  58 |     "        mask_neg = df['max_dd_1yr'] >=-0.8\n",
  59 |     "        df = df[mask_neg]\n",
  60 |     "        m = df.shape[0]\n",
  61 |     "        n = int(ratio*df_pos_count[year])\n",
  62 |     "    \n",
  63 |     "        random_idx = random.sample(range(0,m), n)\n",
  64 |     "        df = df.iloc[random_idx]\n",
  65 |     "        if count == 0:\n",
  66 |     "            df_final = df\n",
  67 |     "        else:\n",
  68 |     "            df_final = pd.concat([df_final,df])\n",
  69 |     "        count +=1    \n",
  70 |     "    \n",
  71 |     "    df_train = df_train.drop('year', axis=1)\n",
  72 |     "    df_neg = df_final.drop('year', axis=1)\n",
  73 |     "    df_pos = df_train[mask_pos]\n",
  74 |     "    \n",
  75 |     "    df_train = pd.concat([df_neg, df_pos])\n",
  76 |     "    df_train = df_train.sort_values(by='Filed_Date')  \n",
  77 |     "\n",
  78 |     "    return df_train          "
  79 |    ]
  80 |   },
  81 |   {
  82 |    "cell_type": "markdown",
  83 |    "metadata": {},
  84 |    "source": [
  85 |     "Undersampling"
  86 |    ]
  87 |   },
  88 |   {
  89 |    "cell_type": "code",
  90 |    "execution_count": null,
  91 |    "metadata": {},
  92 |    "outputs": [],
  93 |    "source": [
  94 |     "def undersample_random(X_train, y_train, seed = 41):\n",
  95 |     "    '''undersample negative events (majority class) to same size\n",
  96 |     "    as positive events (minority class)'''\n",
  97 |     "    \n",
  98 |     "    random.seed(seed)\n",
  99 |     "\n",
 100 |     "    #get indices and calculate class sizes\n",
 101 |     "    idx_train_argsort= np.argsort(y_train)\n",
 102 |     "    n_pos = y_train.sum()\n",
 103 |     "    n_neg = y_train.shape[0] - n_pos\n",
 104 |     "    n_min = min(n_pos, n_neg)\n",
 105 |     "    \n",
 106 |     "    #set minority class sets\n",
 107 |     "    idx_pos_prelim = idx_train_argsort[-n_pos:]\n",
 108 |     "    random_idx_pos = random.sample(range(0,n_pos), n_min)\n",
 109 |     "    idx_pos = idx_pos_prelim[random_idx_pos]\n",
 110 |     "    X_train_pos = X_train[idx_pos]\n",
 111 |     "    y_train_pos = y_train[idx_pos]\n",
 112 |     "    \n",
 113 |     "    #undersample majority class without replacement\n",
 114 |     "    idx_neg_prelim = idx_train_argsort[:n_neg]\n",
 115 |     "    random_idx_neg = random.sample(range(0,n_neg), n_min)\n",
 116 |     "    idx_neg = idx_neg_prelim[random_idx_neg]\n",
 117 |     "    X_train_neg = X_train[idx_neg]\n",
 118 |     "    y_train_neg = y_train[idx_neg]\n",
 119 |     "\n",
 120 |     "    #join classes for training and testing sets \n",
 121 |     "    X_train_balanced = vstack((X_train_pos,X_train_neg))        #X arrays assumed sparse\n",
 122 |     "    y_train_balanced = np.concatenate((y_train_pos,y_train_neg), axis=0)  \n",
 123 |     "    \n",
 124 |     "    dict_answer = {'X_balanced': X_train_balanced , 'y_balanced': y_train_balanced}\n",
 125 |     "    \n",
 126 |     "    return dict_answer"
 127 |    ]
 128 |   },
 129 |   {
 130 |    "cell_type": "markdown",
 131 |    "metadata": {},
 132 |    "source": [
 133 |     "and Oversampling:"
 134 |    ]
 135 |   },
 136 |   {
 137 |    "cell_type": "code",
 138 |    "execution_count": null,
 139 |    "metadata": {},
 140 |    "outputs": [],
 141 |    "source": [
 142 |     "def oversample_random(X_train, y_train, seed = 41, flag=False, n='define'):\n",
 143 |     "    '''oversmaple positive events (minority class) to same size\n",
 144 |     "    as negative events (majority class)'''\n",
 145 |     "    \n",
 146 |     "    random.seed(seed)\n",
 147 |     "    \n",
 148 |     "    #get indices and calculate class sizes\n",
 149 |     "    idx_train_argsort= np.argsort(y_train)\n",
 150 |     "    n_pos = y_train.sum()\n",
 151 |     "    n_neg = y_train.shape[0] - n_pos\n",
 152 |     "    \n",
 153 |     "    #Default to size of largest sample unless set in argument\n",
 154 |     "    if flag==True:\n",
 155 |     "        n_max = n\n",
 156 |     "    else:  \n",
 157 |     "        n_max = max(n_pos, n_neg)\n",
 158 |     "    \n",
 159 |     "    #oversample minority class with replacement\n",
 160 |     "    idx_pos_prelim = idx_train_argsort[-n_pos:]\n",
 161 |     "    random_idx_pos = np.random.choice(range(0,n_pos), n_max)\n",
 162 |     "    idx_pos = idx_pos_prelim[random_idx_pos]\n",
 163 |     "    X_train_pos = X_train[idx_pos]\n",
 164 |     "    y_train_pos = y_train[idx_pos]\n",
 165 |     "    \n",
 166 |     "    #randomly sample majority class without replacement to required size\n",
 167 |     "    idx_neg_prelim = idx_train_argsort[:n_neg]\n",
 168 |     "    random_idx_neg = random.sample(range(0,n_neg), n_max)\n",
 169 |     "    idx_neg = idx_neg_prelim[random_idx_neg]\n",
 170 |     "    X_train_neg = X_train[idx_neg]\n",
 171 |     "    y_train_neg = y_train[idx_neg]\n",
 172 |     "\n",
 173 |     "    #join classes for training and testing sets \n",
 174 |     "    try:\n",
 175 |     "        X_train_balanced = vstack((X_train_pos,X_train_neg))     #if X arrays sparse\n",
 176 |     "    except:\n",
 177 |     "        X_train_balanced = np.concatenate((X_train_pos,X_train_neg), axis=0)\n",
 178 |     "        \n",
 179 |     "    y_train_balanced = np.concatenate((y_train_pos,y_train_neg), axis=0)\n",
 180 |     "    \n",
 181 |     "    dict_answer = {'X_balanced': X_train_balanced , 'y_balanced': y_train_balanced}\n",
 182 |     "    \n",
 183 |     "    return dict_answer"
 184 |    ]
 185 |   },
 186 |   {
 187 |    "cell_type": "markdown",
 188 |    "metadata": {},
 189 |    "source": [
 190 |     "Converting sparse dataframe to sparse matrix in chunks:"
 191 |    ]
 192 |   },
 193 |   {
 194 |    "cell_type": "code",
 195 |    "execution_count": null,
 196 |    "metadata": {},
 197 |    "outputs": [],
 198 |    "source": [
 199 |     "def convert_df_values_csc_chunk(df):\n",
 200 |     "    '''converts dataframe to sparse matrix in chunks'''\n",
 201 |     "    \n",
 202 |     "    #calculate number loops for 3,000 chunk size\n",
 203 |     "    rows = df.shape[0]\n",
 204 |     "    if rows < 3000:\n",
 205 |     "        num = rows - 1\n",
 206 |     "    else:\n",
 207 |     "        num = 3000\n",
 208 |     "        \n",
 209 |     "    loops = rows // num\n",
 210 |     "    stub_start = num * loops \n",
 211 |     "\n",
 212 |     "    #process loops    \n",
 213 |     "    for j in range(1, 1+ loops):\n",
 214 |     "        arr = df[(j-1)*num: j*num].values\n",
 215 |     "        arr = np.nan_to_num(arr)\n",
 216 |     "        mat_csr = csr_matrix(arr)\n",
 217 |     "        if j == 1:\n",
 218 |     "            answer = mat_csr\n",
 219 |     "        else:\n",
 220 |     "            answer =  vstack([answer, mat_csr])\n",
 221 |     "    \n",
 222 |     "    #process end stub        \n",
 223 |     "    arr_stub = df[stub_start:].values\n",
 224 |     "    arr_stub = np.nan_to_num(arr_stub)    \n",
 225 |     "    mat_csr_stub = csr_matrix(arr_stub)\n",
 226 |     "    \n",
 227 |     "    #join stub to rest\n",
 228 |     "    answer =  vstack([answer, mat_csr_stub])\n",
 229 |     "                \n",
 230 |     "    return answer"
 231 |    ]
 232 |   },
 233 |   {
 234 |    "cell_type": "markdown",
 235 |    "metadata": {},
 236 |    "source": [
 237 |     "### 2. Undersampling"
 238 |    ]
 239 |   },
 240 |   {
 241 |    "cell_type": "markdown",
 242 |    "metadata": {},
 243 |    "source": [
 244 |     "As the method with least computation, undersampling is where we do the heavy lifting to find the core model. This is where we choose min_df, decide on random or time equalized sampling and investigate whther sector dummy variables improve the model.\n",
 245 |     "\n",
 246 |     "We start with choosing min_df by looking at performance over a random selection of samples for various min_df values:  "
 247 |    ]
 248 |   },
 249 |   {
 250 |    "cell_type": "code",
 251 |    "execution_count": null,
 252 |    "metadata": {},
 253 |    "outputs": [],
 254 |    "source": [
 255 |     "import pickle\n",
 256 |     "import numpy as np\n",
 257 |     "import pandas as pd\n",
 258 |     "import random\n",
 259 |     "from scipy.sparse import vstack\n",
 260 |     "from sklearn.linear_model import LogisticRegression\n",
 261 |     "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
 262 |     "from sklearn.preprocessing import normalize\n",
 263 |     "from sklearn.metrics import classification_report, confusion_matrix, recall_score\n",
 264 |     "from capstone_10k_functions import convert_df_values_csc_chunk\n",
 265 |     "\n",
 266 |     "#inputs\n",
 267 |     "label_cv = ['cv1', 'cv2', 'cv3', 'cv4']   #['cv1', 'cv2', 'cv3', 'cv4']\n",
 268 |     "vector_func = 'TfidfVectorizer'\n",
 269 |     "ngram = 'unigram'\n",
 270 |     "min_df = 25\n",
 271 |     "x_drop_columns = ['Filed_Date', 'ticker_','sector_', 'sic_sector', 'max_dd_1yr','max_dd_2yr', \n",
 272 |     "                  'year_dd_flag', 'cum_year_dd_flag', 'custom_sector']\n",
 273 |     "models = [GradientBoostingClassifier(random_state=41),\n",
 274 |     "          RandomForestClassifier(n_estimators=100, bootstrap = True, \n",
 275 |     "                                 max_features = 'sqrt'),\n",
 276 |     "          LogisticRegression(random_state=41)]\n",
 277 |     "method = 'undersample_random'     #describes methof for output name\n",
 278 |     "model_names = ['grad_boost', 'random_forest', 'log_reg' ]\n",
 279 |     "output_filename= 'dict_cv_' + method +'_' + vector_func + '_' + 'min_df_' + str(min_df) + '_' + ngram + '.pickle'\n",
 280 |     "\n",
 281 |     "#calculations\n",
 282 |     "\n",
 283 |     "dict_cv= {}\n",
 284 |     "\n",
 285 |     "for label in label_cv:\n",
 286 |     "    \n",
 287 |     "    print(label)\n",
 288 |     "    \n",
 289 |     "    #open dataframe files for cv set\n",
 290 |     "    dict_cv[label] = {}\n",
 291 |     "    filename = label + '_' + vector_func +  '_' +'min_df_' + str(min_df) + '_' + ngram + '.pickle' \n",
 292 |     "    d_cv = pd.read_pickle(filename)\n",
 293 |     "    df_train = d_cv['min_df_' + str(min_df)]['df_train_master']\n",
 294 |     "    df_test = d_cv['min_df_' + str(min_df)]['df_test_master']\n",
 295 |     "    \n",
 296 |     "    #define X and y dataframes\n",
 297 |     "    df_x_train = df_train.drop(x_drop_columns, axis=1)\n",
 298 |     "    df_y_train= (df_train['max_dd_1yr'] <= -0.8)*1\n",
 299 |     "    df_x_test = df_test.drop(x_drop_columns, axis=1)\n",
 300 |     "    df_y_test= (df_test['max_dd_1yr'] <= -0.8)*1\n",
 301 |     "\n",
 302 |     "    #convert to X and Y arrays\n",
 303 |     "    X_train = convert_df_values_csc_chunk(df_x_train)\n",
 304 |     "    y_train = df_y_train.values\n",
 305 |     "    X_test = convert_df_values_csc_chunk(df_x_test)\n",
 306 |     "    y_test = df_y_test.values\n",
 307 |     "\n",
 308 |     "    #normalize (row wise so not strictly necesarry to process this way\n",
 309 |     "    #but kept in format for generalization to other preprocessing)\n",
 310 |     "    n_test = y_test.shape[0]\n",
 311 |     "    X_train_test = vstack((X_train, X_test ))\n",
 312 |     "    X_train = normalize(X_train)\n",
 313 |     "    X_test = normalize(X_train_test)[-n_test:]    \n",
 314 |     "    \n",
 315 |     "    #undersample\n",
 316 |     "    dict_bal = undersample_random(X_train, y_train, seed = 41)\n",
 317 |     "    X_train_balanced = dict_bal['X_balanced']\n",
 318 |     "    y_train_balanced = dict_bal['y_balanced'] \n",
 319 |     "    \n",
 320 |     "    #train model\n",
 321 |     "    for idx, model_func in enumerate(models):\n",
 322 |     "        model_name = model_names[idx]\n",
 323 |     "        model =model_func\n",
 324 |     "        model.fit(X_train_balanced, y_train_balanced)\n",
 325 |     "        y_pred = model.predict(X_test)\n",
 326 |     "        y_proba = model.predict_proba(X_test)\n",
 327 |     "    \n",
 328 |     "        cm = confusion_matrix(y_test, y_pred, labels=[0,1])\n",
 329 |     "        report = classification_report(y_test, y_pred,output_dict=True)\n",
 330 |     "        recall = recall_score(y_test, y_pred, average='macro')\n",
 331 |     "        \n",
 332 |     "        dict_cv[label][model_name] = {'y_test': y_test, 'y_pred': y_pred,\n",
 333 |     "                                      'y_proba': y_proba, 'conf_matrix':cm,\n",
 334 |     "                                      'class_report': report, \n",
 335 |     "                                      'macro_recall':recall}\n",
 336 |     "        \n",
 337 |     "        print(model_name, ' : ', \"{:.2f}\".format(recall))\n",
 338 |     "    \n",
 339 |     "    with open(output_filename, 'wb') as handle:                                     \n",
 340 |     "        pickle.dump(dict_cv, handle, protocol=pickle.HIGHEST_PROTOCOL)"
 341 |    ]
 342 |   },
 343 |   {
 344 |    "cell_type": "markdown",
 345 |    "metadata": {},
 346 |    "source": [
 347 |     "Before considering time-equalization:"
 348 |    ]
 349 |   },
 350 |   {
 351 |    "cell_type": "code",
 352 |    "execution_count": null,
 353 |    "metadata": {},
 354 |    "outputs": [],
 355 |    "source": [
 356 |     "import pickle\n",
 357 |     "import numpy as np\n",
 358 |     "import pandas as pd\n",
 359 |     "import random\n",
 360 |     "from scipy.sparse import vstack\n",
 361 |     "from sklearn.linear_model import LogisticRegression\n",
 362 |     "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
 363 |     "from sklearn.preprocessing import normalize\n",
 364 |     "from sklearn.metrics import classification_report, confusion_matrix, recall_score\n",
 365 |     "from capstone_10k_functions import convert_df_values_csc_chunk\n",
 366 |     "\n",
 367 |     "#inputs\n",
 368 |     "ratio = 5\n",
 369 |     "label_cv = ['cv1', 'cv2', 'cv3', 'cv4']   #['cv1', 'cv2', 'cv3', 'cv4']\n",
 370 |     "vector_func = 'TfidfVectorizer'\n",
 371 |     "ngram = 'unigram'\n",
 372 |     "min_df = 25\n",
 373 |     "x_drop_columns = ['Filed_Date', 'ticker_','sector_', 'sic_sector', 'max_dd_1yr','max_dd_2yr', \n",
 374 |     "                  'year_dd_flag', 'cum_year_dd_flag', 'custom_sector']\n",
 375 |     "\n",
 376 |     "models = [GradientBoostingClassifier(random_state=41),\n",
 377 |     "          RandomForestClassifier(n_estimators=100, bootstrap = True, \n",
 378 |     "                                 max_features = 'sqrt'),\n",
 379 |     "          LogisticRegression(random_state=41)]\n",
 380 |     "method = 'undersample_equal_num'\n",
 381 |     "model_names = ['grad_boost', 'random_forest', 'log_reg' ]\n",
 382 |     "output_filename= 'dict_cv_' + method +'_' + vector_func + '_' + 'min_df_' + str(min_df) + '_' + ngram + '.pickle'\n",
 383 |     "\n",
 384 |     "#calculations\n",
 385 |     "dict_cv= {}\n",
 386 |     "\n",
 387 |     "for label in label_cv:\n",
 388 |     "    \n",
 389 |     "    print(label)\n",
 390 |     "    #open dataframe files for cv set\n",
 391 |     "    dict_cv[label] = {}\n",
 392 |     "    filename = label + '_' + vector_func +  '_' +'min_df_' + str(min_df) + '_' + ngram + '.pickle'\n",
 393 |     "    d_cv = pd.read_pickle(filename) \n",
 394 |     "    df_train = d_cv['min_df_' + str(min_df)]['df_train_master']\n",
 395 |     "    df_test = d_cv['min_df_' + str(min_df)]['df_test_master']\n",
 396 |     "    \n",
 397 |     "    #equal number algo\n",
 398 |     "    df_train = equalize_year_ratio(df_train, ratio)\n",
 399 |     "        \n",
 400 |     "    #define X and y dataframes\n",
 401 |     "    df_x_train = df_train.drop(x_drop_columns, axis=1)\n",
 402 |     "    df_y_train= (df_train['max_dd_1yr'] <= -0.8)*1 \n",
 403 |     "    df_x_test = df_test.drop(x_drop_columns, axis=1)\n",
 404 |     "    df_y_test= (df_test['max_dd_1yr'] <= -0.8)*1\n",
 405 |     "\n",
 406 |     "    #convert to X and Y arrays\n",
 407 |     "    X_train = convert_df_values_csc_chunk(df_x_train)\n",
 408 |     "    y_train = df_y_train.values\n",
 409 |     "    X_test = convert_df_values_csc_chunk(df_x_test)\n",
 410 |     "    y_test = df_y_test.values\n",
 411 |     "\n",
 412 |     "    #normalize (row wise so not strictly necesarry to process this way\n",
 413 |     "    #but kept in format for generalization to other preprocessing)\n",
 414 |     "    n_test = y_test.shape[0]\n",
 415 |     "    X_train_test = vstack((X_train, X_test ))\n",
 416 |     "    X_train = normalize(X_train)\n",
 417 |     "    X_test = normalize(X_train_test)[-n_test:]    \n",
 418 |     "    \n",
 419 |     "    #undersample\n",
 420 |     "    dict_bal = undersample_random(X_train, y_train, seed = 41)\n",
 421 |     "    X_train_balanced = dict_bal['X_balanced']\n",
 422 |     "    y_train_balanced = dict_bal['y_balanced'] \n",
 423 |     "    \n",
 424 |     "    #train model\n",
 425 |     "    for idx, model_func in enumerate(models):\n",
 426 |     "        model_name = model_names[idx]\n",
 427 |     "        model =model_func\n",
 428 |     "        model.fit(X_train_balanced, y_train_balanced)\n",
 429 |     "        y_pred = model.predict(X_test)\n",
 430 |     "        y_proba = model.predict_proba(X_test)\n",
 431 |     "    \n",
 432 |     "        cm = confusion_matrix(y_test, y_pred, labels=[0,1])\n",
 433 |     "        report = classification_report(y_test, y_pred,output_dict=True)\n",
 434 |     "        recall = recall_score(y_test, y_pred, average='macro')\n",
 435 |     "        \n",
 436 |     "        dict_cv[label][model_name] = {'y_test': y_test, 'y_pred': y_pred,\n",
 437 |     "                                      'y_proba': y_proba, 'conf_matrix':cm,\n",
 438 |     "                                      'class_report': report, \n",
 439 |     "                                      'macro_recall':recall}\n",
 440 |     "        \n",
 441 |     "        print(model_name, ' : ', \"{:.2f}\".format(recall))\n",
 442 |     "    \n",
 443 |     "    with open(output_filename, 'wb') as handle:                                     \n",
 444 |     "        pickle.dump(dict_cv, handle, protocol=pickle.HIGHEST_PROTOCOL)"
 445 |    ]
 446 |   },
 447 |   {
 448 |    "cell_type": "markdown",
 449 |    "metadata": {},
 450 |    "source": [
 451 |     "And sector dummy variables:"
 452 |    ]
 453 |   },
 454 |   {
 455 |    "cell_type": "code",
 456 |    "execution_count": null,
 457 |    "metadata": {},
 458 |    "outputs": [],
 459 |    "source": [
 460 |     "import pickle\n",
 461 |     "import numpy as np\n",
 462 |     "import pandas as pd\n",
 463 |     "import random\n",
 464 |     "from scipy.sparse import csr_matrix, vstack, hstack\n",
 465 |     "from sklearn.linear_model import LogisticRegression\n",
 466 |     "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
 467 |     "from sklearn.preprocessing import normalize\n",
 468 |     "from sklearn.metrics import classification_report, confusion_matrix, recall_score\n",
 469 |     "from capstone_10k_functions import convert_df_values_csc_chunk\n",
 470 |     "\n",
 471 |     "#inputs\n",
 472 |     "ratio = 5\n",
 473 |     "label_cv = ['cv1', 'cv2', 'cv3', 'cv4']   #['cv1', 'cv2', 'cv3', 'cv4']\n",
 474 |     "vector_func = 'TfidfVectorizer'\n",
 475 |     "ngram = 'unigram'\n",
 476 |     "min_df = 25\n",
 477 |     "x_drop_columns = ['Filed_Date', 'ticker_','sector_', 'sic_sector', 'max_dd_1yr','max_dd_2yr', \n",
 478 |     "                  'year_dd_flag', 'cum_year_dd_flag', 'custom_sector']\n",
 479 |     "models = [GradientBoostingClassifier(random_state=41),\n",
 480 |     "          RandomForestClassifier(n_estimators=100, bootstrap = True, \n",
 481 |     "                                 max_features = 'sqrt'),\n",
 482 |     "          LogisticRegression(random_state=41)]\n",
 483 |     "method = 'undersample_equal_num_DUMMIES'\n",
 484 |     "model_names = ['grad_boost', 'random_forest', 'log_reg' ]\n",
 485 |     "\n",
 486 |     "output_filename= 'dict_cv_' + method +'_' + vector_func + '_' + 'min_df_' + str(min_df) + '_' + ngram + '.pickle'\n",
 487 |     "\n",
 488 |     "#calculations\n",
 489 |     "dict_cv= {}\n",
 490 |     "\n",
 491 |     "for label in label_cv:\n",
 492 |     "    \n",
 493 |     "    print(label)\n",
 494 |     "    #open dataframe files for cv set\n",
 495 |     "    dict_cv[label] = {}\n",
 496 |     "    filename = label + '_' + vector_func +  '_' +'min_df_' + str(min_df) + '_' + ngram + '.pickle'\n",
 497 |     "    d_cv = pd.read_pickle(filename)\n",
 498 |     "    df_train = d_cv['min_df_' + str(min_df)]['df_train_master']\n",
 499 |     "    df_test = d_cv['min_df_' + str(min_df)]['df_test_master']\n",
 500 |     "   \n",
 501 |     "    #equal number algo\n",
 502 |     "    df_train = equalize_year_ratio(df_train, ratio)\n",
 503 |     "    \n",
 504 |     "    #get_dummies\n",
 505 |     "    df_dummies_train = pd.get_dummies(df_train['sector_'], prefix='_').astype(int)\n",
 506 |     "    df_dummies_test = pd.get_dummies(df_test['sector_'], prefix='_').astype(int)\n",
 507 |     "    #make sure they have same columns\n",
 508 |     "    dummies_train = set(df_dummies_train.columns)\n",
 509 |     "    dummies_test = set(df_dummies_test.columns)\n",
 510 |     "    union_dummies = dummies_train.union(dummies_test)\n",
 511 |     "    add_dummies_train = union_dummies - dummies_train\n",
 512 |     "    add_dummies_test = union_dummies - dummies_test\n",
 513 |     "    #make sure train and test sets have same number cols    \n",
 514 |     "    if len(add_dummies_train) == 0:\n",
 515 |     "        pass\n",
 516 |     "    else:\n",
 517 |     "        for dummy in add_dummies_train:\n",
 518 |     "            df_dummies_train[dummy] = 0\n",
 519 |     "            \n",
 520 |     "    if len(add_dummies_test) == 0:\n",
 521 |     "        pass\n",
 522 |     "    else:\n",
 523 |     "        for dummy in add_dummies_test:\n",
 524 |     "            df_dummies_test[dummy] = 0\n",
 525 |     "    #convert to sparse matrices\n",
 526 |     "    csr_dummies_train = csr_matrix(df_dummies_train.values)\n",
 527 |     "    csr_dummies_test = csr_matrix(df_dummies_test.values)\n",
 528 |     "    \n",
 529 |     "    #define X and y dataframes\n",
 530 |     "    df_x_train = df_train.drop(x_drop_columns, axis=1)\n",
 531 |     "    df_y_train= (df_train['max_dd_1yr'] <= -0.8)*1\n",
 532 |     "    df_x_test = df_test.drop(x_drop_columns, axis=1)\n",
 533 |     "    df_y_test= (df_test['max_dd_1yr'] <= -0.8)*1\n",
 534 |     "    \n",
 535 |     "    #convert to X and Y arrays\n",
 536 |     "    X_train = convert_df_values_csc_chunk(df_x_train)\n",
 537 |     "    X_train = hstack((X_train, csr_dummies_train))\n",
 538 |     "    y_train = df_y_train.values\n",
 539 |     "    X_test = convert_df_values_csc_chunk(df_x_test)\n",
 540 |     "    X_test = hstack((X_test, csr_dummies_test))\n",
 541 |     "    y_test = df_y_test.values\n",
 542 |     "\n",
 543 |     "    #normalize (row wise so not strictly necesarry to process this way\n",
 544 |     "    #but kept in format for generalization to other preprocessing)\n",
 545 |     "    n_test = y_test.shape[0]\n",
 546 |     "    X_train_test = vstack((X_train, X_test ))\n",
 547 |     "    X_train = normalize(X_train)\n",
 548 |     "    X_test = normalize(X_train_test)[-n_test:]    \n",
 549 |     "    \n",
 550 |     "    #undersample\n",
 551 |     "    dict_bal = undersample_random(X_train, y_train, seed = 41)\n",
 552 |     "    X_train_balanced = dict_bal['X_balanced']\n",
 553 |     "    y_train_balanced = dict_bal['y_balanced'] \n",
 554 |     "    \n",
 555 |     "    #train model\n",
 556 |     "    for idx, model_func in enumerate(models):\n",
 557 |     "        model_name = model_names[idx]\n",
 558 |     "        model =model_func\n",
 559 |     "        model.fit(X_train_balanced, y_train_balanced)\n",
 560 |     "        y_pred = model.predict(X_test)\n",
 561 |     "        y_proba = model.predict_proba(X_test)\n",
 562 |     "    \n",
 563 |     "        cm = confusion_matrix(y_test, y_pred, labels=[0,1])\n",
 564 |     "        report = classification_report(y_test, y_pred,output_dict=True)\n",
 565 |     "        recall = recall_score(y_test, y_pred, average='macro')\n",
 566 |     "         ort': report, \n",
 567 |     "                                      'macro_recall':recall}\n",
 568 |     "        \n",
 569 |     "        print(model_name, ' : ', \"{:.2f}\".format(recall))\n",
 570 |     "    \n",
 571 |     "    with open(output_filename, 'wb') as handle:                                     \n",
 572 |     "        pickle.dump(dict_cv, handle, protocol=pickle.HIGHEST_PROTOCOL)"
 573 |    ]
 574 |   },
 575 |   {
 576 |    "cell_type": "markdown",
 577 |    "metadata": {},
 578 |    "source": [
 579 |     "### 3. Oversampling\n",
 580 |     "\n",
 581 |     "We apply the core model (min_df=25, time equalization and no sectordummy variables) to oversampling. Given the 1:20 data imbalance, we test a more computationally practical n = 2,3 and 5 where n x (minority class size for unique events) sets the size of the samples:"
 582 |    ]
 583 |   },
 584 |   {
 585 |    "cell_type": "code",
 586 |    "execution_count": null,
 587 |    "metadata": {},
 588 |    "outputs": [],
 589 |    "source": [
 590 |     "\n",
 591 |     "\n",
 592 |     "import pickle\n",
 593 |     "import numpy as np\n",
 594 |     "import pandas as pd\n",
 595 |     "import random\n",
 596 |     "from scipy.sparse import vstack\n",
 597 |     "from sklearn.linear_model import LogisticRegression\n",
 598 |     "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
 599 |     "from sklearn.naive_bayes import MultinomialNB\n",
 600 |     "from sklearn.preprocessing import normalize\n",
 601 |     "from sklearn.metrics import classification_report, confusion_matrix, recall_score\n",
 602 |     "from capstone_10k_functions import convert_df_values_csc_chunk\n",
 603 |     "\n",
 604 |     "#inputs\n",
 605 |     "pd.set_option('mode.chained_assignment', None)  \n",
 606 |     "ratio = 5\n",
 607 |     "label_cv = ['cv1', 'cv2', 'cv3', 'cv4']  #['cv1', 'cv2', 'cv3', 'cv4']\n",
 608 |     "vector_func = 'TfidfVectorizer'\n",
 609 |     "ngram = 'unigram'\n",
 610 |     "min_df = 25\n",
 611 |     "over_sample_amount = [2, 3, 5]\n",
 612 |     "x_drop_columns = ['Filed_Date', 'ticker_','sector_', 'sic_sector', 'max_dd_1yr','max_dd_2yr', \n",
 613 |     "                  'year_dd_flag', 'cum_year_dd_flag', 'custom_sector']\n",
 614 |     "models = [GradientBoostingClassifier(random_state=41),\n",
 615 |     "          RandomForestClassifier(n_estimators=100, bootstrap = True, \n",
 616 |     "                                 max_features = 'sqrt'),\n",
 617 |     "          LogisticRegression(random_state=41), MultinomialNB()]\n",
 618 |     "method = 'oversample_equal_num'\n",
 619 |     "model_names = ['grad_boost', 'random_forest', 'log_reg', 'NBayes' ]\n",
 620 |     "\n",
 621 |     "\n",
 622 |     "#create dictionary structure\n",
 623 |     "dict_cv = {}\n",
 624 |     "for label in label_cv:\n",
 625 |     "    dict_cv.update({label: {}})\n",
 626 |     "    for number in over_sample_amount:\n",
 627 |     "        dict_cv[label].update({'number'+ str(number):{}})\n",
 628 |     "        for model in model_names:\n",
 629 |     "            dict_cv[label]['number'+ str(number)].update({model: {}})\n",
 630 |     "\n",
 631 |     "\n",
 632 |     "for label in label_cv:\n",
 633 |     "    \n",
 634 |     "    print(label)\n",
 635 |     "    #open dataframe files for cv set\n",
 636 |     "    filename = label + '_' + vector_func +  '_' +'min_df_' + str(min_df) + '_' + ngram + '.pickle'\n",
 637 |     "    d_cv = pd.read_pickle(filename)\n",
 638 |     "    df_train = d_cv['min_df_' + str(min_df)]['df_train_master']\n",
 639 |     "    df_test = d_cv['min_df_' + str(min_df)]['df_test_master']\n",
 640 |     "    \n",
 641 |     "    #Equal number algo\n",
 642 |     "    df_train = equalize_year_ratio(df_train, ratio)\n",
 643 |     "    \n",
 644 |     "    #define X and y dataframes\n",
 645 |     "    df_x_train = df_train.drop(x_drop_columns, axis=1)\n",
 646 |     "    df_y_train= (df_train['max_dd_2yr'] <= -0.8)*1\n",
 647 |     "    df_x_test = df_test.drop(x_drop_columns, axis=1)\n",
 648 |     "    df_y_test= (df_test['max_dd_2yr'] <= -0.8)*1\n",
 649 |     "\n",
 650 |     "    #convert to X and Y arrays\n",
 651 |     "    X_train = convert_df_values_csc_chunk(df_x_train)\n",
 652 |     "    y_train = df_y_train.values\n",
 653 |     "    X_test = convert_df_values_csc_chunk(df_x_test)\n",
 654 |     "    y_test = df_y_test.values  \n",
 655 |     "\n",
 656 |     "    #normalize (row wise so not strictly necesarry to process this way\n",
 657 |     "    #but kept in format for generalization to other preprocessing)\n",
 658 |     "    n_test = y_test.shape[0]\n",
 659 |     "    X_train_test = vstack((X_train, X_test ))\n",
 660 |     "    X_train = normalize(X_train)\n",
 661 |     "    X_test = normalize(X_train_test)[-n_test:]    \n",
 662 |     "    \n",
 663 |     "    #oversample\n",
 664 |     "    for number in over_sample_amount:\n",
 665 |     "        \n",
 666 |     "        key_number = 'number'+ str(number)  \n",
 667 |     "        n = int(number*n_pos)\n",
 668 |     "        \n",
 669 |     "        dict_bal = oversample_random(X_train, y_train, seed = 41, flag=True, n=n)\n",
 670 |     "        X_train_balanced = dict_bal['X_balanced']\n",
 671 |     "        y_train_balanced = dict_bal['y_balanced'] \n",
 672 |     "\n",
 673 |     "        #train model\n",
 674 |     "        for idx, model_func in enumerate(models):\n",
 675 |     "            model_name = model_names[idx]\n",
 676 |     "            model =model_func\n",
 677 |     "            model.fit(X_train_balanced, y_train_balanced)\n",
 678 |     "            y_pred = model.predict(X_test)\n",
 679 |     "            y_proba = model.predict_proba(X_test)\n",
 680 |     "    \n",
 681 |     "            cm = confusion_matrix(y_test, y_pred, labels=[0,1])\n",
 682 |     "            report = classification_report(y_test, y_pred,output_dict=True)\n",
 683 |     "            recall = recall_score(y_test, y_pred, average='macro')\n",
 684 |     "        \n",
 685 |     "            dict_cv[label][key_number][model_name] = {'y_test': y_test, 'y_pred': y_pred,\n",
 686 |     "                                      'y_proba': y_proba, 'conf_matrix':cm,\n",
 687 |     "                                      'class_report': report, \n",
 688 |     "                                      'macro_recall':recall}\n",
 689 |     "        \n",
 690 |     "            print(model_name, ' : ', \"{:.2f}\".format(recall))\n",
 691 |     "            \n",
 692 |     "           \n",
 693 |     "output_filename= 'dict_cv_' + method +'_' + vector_func + '_' + 'min_df_' + str(min_df) + '_number_ratios_' + ngram + '.pickle'\n",
 694 |     "with open(output_filename, 'wb') as handle:                                     \n",
 695 |     "    pickle.dump(dict_cv, handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
 696 |     "\n"
 697 |    ]
 698 |   },
 699 |   {
 700 |    "cell_type": "markdown",
 701 |    "metadata": {},
 702 |    "source": [
 703 |     "### 4. Weight Classes\n",
 704 |     "\n",
 705 |     "We use SK-Learn's Random Forest to see how balanced weight class performs for the problem:"
 706 |    ]
 707 |   },
 708 |   {
 709 |    "cell_type": "code",
 710 |    "execution_count": null,
 711 |    "metadata": {},
 712 |    "outputs": [],
 713 |    "source": [
 714 |     "import pickle\n",
 715 |     "import pandas as pd\n",
 716 |     "import random\n",
 717 |     "from scipy.sparse import csr_matrix, vstack, hstack\n",
 718 |     "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
 719 |     "from sklearn.preprocessing import normalize\n",
 720 |     "from sklearn.metrics import classification_report, confusion_matrix, recall_score\n",
 721 |     "from capstone_10k_functions import convert_df_values_csc_chunk\n",
 722 |     "\n",
 723 |     "pd.set_option('mode.chained_assignment', None)\n",
 724 |     "\n",
 725 |     "#inputs\n",
 726 |     "ratio = 5\n",
 727 |     "label_cv = ['cv1', 'cv2', 'cv3', 'cv4'] #['cv1', 'cv2', 'cv3', 'cv4']\n",
 728 |     "vector_func = 'TfidfVectorizer'  \n",
 729 |     "ngram = 'unigram'\n",
 730 |     "min_df = 25\n",
 731 |     "x_drop_columns = ['Filed_Date', 'ticker_','sector_', 'sic_sector', 'max_dd_1yr','max_dd_2yr', \n",
 732 |     "                  'year_dd_flag', 'cum_year_dd_flag', 'custom_sector']\n",
 733 |     "models = [GradientBoostingClassifier(random_state=41),\n",
 734 |     "          RandomForestClassifier(n_estimators=100, bootstrap = True, \n",
 735 |     "                                 max_features = 'sqrt'),\n",
 736 |     "          RandomForestClassifier(n_estimators=100, bootstrap = True, \n",
 737 |     "                                 max_features = 'sqrt', class_weight='balanced')]\n",
 738 |     "method = 'class_weights_equal_num'\n",
 739 |     "model_names = ['grad_boost', 'random_forest_balanced_cw_none',\n",
 740 |     "                                           'random_forest_cw_balanced']\n",
 741 |     "output_filename= 'dict_cv_' + method +'_' + vector_func + '_' + 'min_df_' + str(min_df) + '_' + ngram + '.pickle'\n",
 742 |     "\n",
 743 |     "#calculations\n",
 744 |     "dict_cv= {}\n",
 745 |     "\n",
 746 |     "for label in label_cv:\n",
 747 |     "    \n",
 748 |     "    print(label)\n",
 749 |     "    #open dataframe files for cv set\n",
 750 |     "    dict_cv[label] = {}\n",
 751 |     "    filename = label + '_' + vector_func +  '_' +'min_df_' + str(min_df) + '_' + ngram + '.pickle'\n",
 752 |     "    d_cv = pd.read_pickle(filename)\n",
 753 |     "    df_train = d_cv['min_df_' + str(min_df)]['df_train_master']\n",
 754 |     "    df_test = d_cv['min_df_' + str(min_df)]['df_test_master']\n",
 755 |     "    \n",
 756 |     "    #equal number algo\n",
 757 |     "    df_train = equalize_year_ratio(df_train, ratio)\n",
 758 |     "    \n",
 759 |     "    #define X and y dataframes\n",
 760 |     "    df_x_train = df_train.drop(x_drop_columns, axis=1)\n",
 761 |     "    df_y_train= (df_train['max_dd_1yr'] <= -0.8)*1\n",
 762 |     "    df_x_test = df_test.drop(x_drop_columns, axis=1)\n",
 763 |     "    df_y_test= (df_test['max_dd_1yr'] <= -0.8)*1\n",
 764 |     "\n",
 765 |     "    #convert to X and Y arrays\n",
 766 |     "    X_train = convert_df_values_csc_chunk(df_x_train)\n",
 767 |     "    y_train = df_y_train.values\n",
 768 |     "    X_test = convert_df_values_csc_chunk(df_x_test)\n",
 769 |     "    y_test = df_y_test.values\n",
 770 |     "\n",
 771 |     "    #normalize (row wise so not strictly necesarry to process this way\n",
 772 |     "    #but kept in format for generalization to other preprocessing)\n",
 773 |     "    n_test = y_test.shape[0]\n",
 774 |     "    X_train_test = vstack((X_train, X_test ))\n",
 775 |     "    X_train = normalize(X_train)\n",
 776 |     "    X_test = normalize(X_train_test)[-n_test:]    \n",
 777 |     "    \n",
 778 |     "    #train model\n",
 779 |     "    for idx, model_func in enumerate(models):\n",
 780 |     "        model_name = model_names[idx]\n",
 781 |     "        model =model_func\n",
 782 |     "        model.fit(X_train, y_train)\n",
 783 |     "        y_pred = model.predict(X_test)\n",
 784 |     "        y_proba = model.predict_proba(X_test)\n",
 785 |     "    \n",
 786 |     "        cm = confusion_matrix(y_test, y_pred, labels=[0,1])\n",
 787 |     "        report = classification_report(y_test, y_pred,output_dict=True)\n",
 788 |     "        recall = recall_score(y_test, y_pred, average='macro')\n",
 789 |     "        \n",
 790 |     "        dict_cv[label][model_name] = {'y_test': y_test, 'y_pred': y_pred,\n",
 791 |     "                                      'y_proba': y_proba, 'conf_matrix':cm,\n",
 792 |     "                                      'class_report': report, \n",
 793 |     "                                      'macro_recall':recall}\n",
 794 |     "        \n",
 795 |     "        print(model_name, ' : ', \"{:.2f}\".format(recall))\n",
 796 |     "    \n",
 797 |     "    with open(output_filename, 'wb') as handle:                                     \n",
 798 |     "        pickle.dump(dict_cv, handle, protocol=pickle.HIGHEST_PROTOCOL)"
 799 |    ]
 800 |   },
 801 |   {
 802 |    "cell_type": "markdown",
 803 |    "metadata": {},
 804 |    "source": [
 805 |     "### 5. Threshold Testing\n",
 806 |     "\n",
 807 |     "This section investigates how performance changes with movement of probability threshhold away from 50%.\n",
 808 |     "\n",
 809 |     "This analysis can be found in Code Base 4: Interactive Analysis and Interpretation where it uses the results saved in preceding sections."
 810 |    ]
 811 |   },
 812 |   {
 813 |    "cell_type": "markdown",
 814 |    "metadata": {},
 815 |    "source": [
 816 |     "### 6. Ensemble\n",
 817 |     "\n",
 818 |     "We investigate ensemble methods by (i) combining learning alogorithims for undersampling, (ii) combining learning algorithms for oversampling and (iii) combining the best models for vanilla oversampling and undersampling.\n",
 819 |     "\n",
 820 |     "This analysis can be found in Code Base 4: Interactive Analysis and Interpretation where it uses the results saved in preceding sections."
 821 |    ]
 822 |   },
 823 |   {
 824 |    "cell_type": "markdown",
 825 |    "metadata": {},
 826 |    "source": [
 827 |     "### 7. Create Financial Ratios Target-Features Matrix"
 828 |    ]
 829 |   },
 830 |   {
 831 |    "cell_type": "markdown",
 832 |    "metadata": {},
 833 |    "source": [
 834 |     "The below code matches tickers and filing dates of the NLP dataset and calculates the relevant annual financial ratios where there is data.    "
 835 |    ]
 836 |   },
 837 |   {
 838 |    "cell_type": "code",
 839 |    "execution_count": null,
 840 |    "metadata": {},
 841 |    "outputs": [],
 842 |    "source": [
 843 |     "import pickle\n",
 844 |     "import numpy as np\n",
 845 |     "import pandas as pd\n",
 846 |     "\n",
 847 |     "#inputs\n",
 848 |     "input_fin = 'fundamentals_df_all_db_ary.pickle'\n",
 849 |     "input_all_10ks = '10k_clean_df.pickle'\n",
 850 |     "input_master = 'dict_10k_matched_dd.pickle'\n",
 851 |     "output_file = \"df_fund_ratios_matched.pickle\"\n",
 852 |     "\n",
 853 |     "\n",
 854 |     "#Alignment calcs: reportperiod to filed date\n",
 855 |     "df_align = pd.read_pickle(input_all_10ks)\n",
 856 |     "df_align = df_align[['ticker','Period', 'Filed_Date']]\n",
 857 |     "df_align['Period'] = pd.to_datetime(df_align['Period'], errors='coerce').values\n",
 858 |     "df_align['Filed_Date'] = pd.to_datetime(df_align['Filed_Date'], errors='coerce').values\n",
 859 |     "df_align.columns = ['ticker_','reportperiod', 'Filed_Date']\n",
 860 |     "\n",
 861 |     "#Calculate Ratios\n",
 862 |     "df_fin= pd.read_pickle(input_fin)\n",
 863 |     "\n",
 864 |     "df_fin_ratios = pd.DataFrame(df_fin['reportperiod'])\n",
 865 |     "df_fin_ratios['ticker_'] = df_fin['ticker']\n",
 866 |     "\n",
 867 |     "df_fin_ratios['netinc_assets'] = df_fin['netinc'].values / df_fin['assets'].values\n",
 868 |     "df_fin_ratios['leverage_'] = df_fin['assets'].values / df_fin['liabilities'].values\n",
 869 |     "df_fin_ratios['accruals_'] = df_fin['ncfo'].values / df_fin['netinc'].values\n",
 870 |     "df_fin_ratios['cash_debt'] = df_fin['ncfo'].values / df_fin['liabilities'].values\n",
 871 |     "df_fin_ratios['coe_'] = df_fin['ncfo'].values / (df_fin['assets'] - df_fin['liabilities']).values\n",
 872 |     "df_fin_ratios['roe_'] = df_fin['netinc'].values / (df_fin['assets'] - df_fin['liabilities']).values\n",
 873 |     "\n",
 874 |     "del df_fin #memory management\n",
 875 |     "\n",
 876 |     "df_fin_ratios = df_fin_ratios.dropna()\n",
 877 |     "df_fin_ratios = df_fin_ratios.merge(df_align, on =['ticker_','reportperiod'], how='inner')\n",
 878 |     "df_fin_ratios = df_fin_ratios.drop('reportperiod', axis=1)\n",
 879 |     "df_fin_ratios = df_fin_ratios.replace([np.inf, -np.inf], np.nan).dropna()\n",
 880 |     "\n",
 881 |     "#merge to NLP matched master tickers / dates\n",
 882 |     "dict_input = pd.read_pickle(input_master)       \n",
 883 |     "df = dict_input['matched_df_10k_dd']\n",
 884 |     "df = df[['ticker_', 'Filed_Date','sector', 'max_dd_1yr']].sort_values('Filed_Date')\n",
 885 |     "df.columns = ['ticker_', 'Filed_Date','sector_', 'max_dd_1yr']\n",
 886 |     "\n",
 887 |     "df = df.merge(df_fin_ratios, on=['ticker_', 'Filed_Date'], how='inner')\n",
 888 |     "\n",
 889 |     "with open(output_file, 'wb') as handle:                                     \n",
 890 |     "    pickle.dump(df, handle, protocol=pickle.HIGHEST_PROTOCOL)"
 891 |    ]
 892 |   },
 893 |   {
 894 |    "cell_type": "markdown",
 895 |    "metadata": {},
 896 |    "source": [
 897 |     "### 8. Generate Holdout Testing Sets\n",
 898 |     "\n",
 899 |     "This section starts by generating the holdout set before testing it with the optimal model found in CV (ensemble of oversmapling and undersampling). An expanding annual window method is used to update the model and test on an annual basis. The aggregation of these results then form the full set of holdout results.\n",
 900 |     "\n",
 901 |     "Generate holdout sets for NLP model:"
 902 |    ]
 903 |   },
 904 |   {
 905 |    "cell_type": "code",
 906 |    "execution_count": null,
 907 |    "metadata": {},
 908 |    "outputs": [],
 909 |    "source": [
 910 |     "import pickle\n",
 911 |     "import pandas as pd\n",
 912 |     "from sklearn.feature_extraction.text import TfidfVectorizer\n",
 913 |     "from capstone_10k_functions import vectorize_corpus\n",
 914 |     "from datetime import datetime as dt\n",
 915 |     "\n",
 916 |     "t1 = dt.now()\n",
 917 |     "print(t1)\n",
 918 |     "\n",
 919 |     "#inputs\n",
 920 |     "input_file = 'dict_10k_matched_dd.pickle'\n",
 921 |     "vector_func = TfidfVectorizer    \n",
 922 |     "func_name = 'TfidfVectorizer'   #['TfidfVectorizer', 'CountVectorizer']\n",
 923 |     "hold_out_set_start = [2015, 2016, 2017, 2018, 2019]\n",
 924 |     "min_df_grid = [25]\n",
 925 |     "max_df = 0.5\n",
 926 |     "ngram = (1,1)\n",
 927 |     "ngram_name = 'unigram'\n",
 928 |     "label = ['hold_out_2015', 'hold_out_2016', 'hold_out_2017', 'hold_out_2018', \n",
 929 |     "         'hold_out_2019']\n",
 930 |     "\n",
 931 |     "\n",
 932 |     "#read data into memory\n",
 933 |     "with open(input_file , 'rb') as f:\n",
 934 |     "        d_data = pickle.load(f)\n",
 935 |     "df = d_data['matched_df_10k_dd']\n",
 936 |     "df = df.sort_values(\"Filed_Date\")\n",
 937 |     "\n",
 938 |     "for idx_ho, year in enumerate(hold_out_set_start):\n",
 939 |     "    \n",
 940 |     "    print(label[idx_ho])\n",
 941 |     "    \n",
 942 |     "    ##Define hold out set\n",
 943 |     "    mask_train = df['Filed_Date'].dt.year < year\n",
 944 |     "    mask_test = df['Filed_Date'].dt.year == year\n",
 945 |     "    df_test = df[mask_test]\n",
 946 |     "    df_train = df[mask_train]\n",
 947 |     "\n",
 948 |     "    #Generate df master (word vector / vectorizer) sets for each cv fold\n",
 949 |     "    dict_cv = {}\n",
 950 |     "    #format training data\n",
 951 |     "    df_train_text = df_train[['ticker_','Filed_Date', 'Text']]\n",
 952 |     "    df_train_other = df_train.drop('Text', axis=1)\n",
 953 |     "    df_train_other.columns = ['ticker_', 'Filed_Date', 'sector_', 'sic_sector', \n",
 954 |     "                        'max_dd_1yr', 'max_dd_2yr', 'year_dd_flag', \n",
 955 |     "                        'cum_year_dd_flag']\n",
 956 |     "    df_train_other['custom_sector'] = str(df_train_other['sector_']) + ' : ' + str(df_train_other['sic_sector'])\n",
 957 |     "\n",
 958 |     "    #format testing data\n",
 959 |     "    df_test_text = df_test[['ticker_','Filed_Date', 'Text']]\n",
 960 |     "    df_test_other = df_test.drop('Text', axis=1)\n",
 961 |     "    df_test_other.columns = ['ticker_', 'Filed_Date', 'sector_', 'sic_sector', \n",
 962 |     "                        'max_dd_1yr', 'max_dd_2yr', 'year_dd_flag', \n",
 963 |     "                        'cum_year_dd_flag']\n",
 964 |     "    df_test_other['custom_sector'] = str(df_test_other['sector_']) + ' : ' + str(df_test_other['sic_sector'])\n",
 965 |     "        \n",
 966 |     "\n",
 967 |     "    for min_df in min_df_grid: \n",
 968 |     "        print(min_df)\n",
 969 |     "        \n",
 970 |     "        #name for cv dictionary specified by min_df value\n",
 971 |     "        key_name = 'min_df_' + str(min_df)\n",
 972 |     "        \n",
 973 |     "        #vectorize corpus and assign word vector and vectorizer\n",
 974 |     "        function = vectorize_corpus(df_train_text['Text'], vector_func, min_df, \n",
 975 |     "                                            max_df,ngram)\n",
 976 |     "        X = function['df_wv']\n",
 977 |     "        vectorizer = function['vectorizer']\n",
 978 |     "        \n",
 979 |     "        #Transform training data into df_master format\n",
 980 |     "        vocab = X.columns.tolist()\n",
 981 |     "        X['Filed_Date'] = df_train_text['Filed_Date'].values\n",
 982 |     "        X['ticker_'] = df_train_text['ticker_'].values\n",
 983 |     "                        \n",
 984 |     "        df_train_master = df_train_other.merge(X, on=['ticker_','Filed_Date'], how='inner')\n",
 985 |     "        \n",
 986 |     "        #Transform test data into df master format\n",
 987 |     "        arr_test_transform = vectorizer.transform(df_test_text['Text'])\n",
 988 |     "        df_test_transform = pd.DataFrame.sparse.from_spmatrix(arr_test_transform,\n",
 989 |     "                                                           columns = vocab)\n",
 990 |     "        df_test_transform['Filed_Date'] = df_test_text['Filed_Date'].values\n",
 991 |     "        df_test_transform['ticker_'] = df_test_text['ticker_'].values\n",
 992 |     "        \n",
 993 |     "        df_test_master = df_test_other.merge(df_test_transform, \n",
 994 |     "                                             on=['ticker_','Filed_Date'], \n",
 995 |     "                                                                 how='inner')\n",
 996 |     "            \n",
 997 |     "        dict_final = {'df_test_master': df_test_master, 'df_train_master': df_train_master}\n",
 998 |     "                \n",
 999 |     "        dict_cv[key_name] = dict_final\n",
1000 |     "        \n",
1001 |     "        output_filename = label[idx_ho] + '_' + func_name + '_' + key_name + '_' + ngram_name + '.pickle'\n",
1002 |     "           \n",
1003 |     "        with open(output_filename, 'wb') as handle:                                     \n",
1004 |     "            pickle.dump(dict_cv, handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
1005 |     "                    \n",
1006 |     "t2 = dt.now()\n",
1007 |     "print(t2)\n",
1008 |     "print(t2-t1)\n",
1009 |     "\n",
1010 |     "#runtime 2hrs11mins"
1011 |    ]
1012 |   },
1013 |   {
1014 |    "cell_type": "markdown",
1015 |    "metadata": {},
1016 |    "source": [
1017 |     "Generate holdout sets for financial ratio (FIN) model:"
1018 |    ]
1019 |   },
1020 |   {
1021 |    "cell_type": "code",
1022 |    "execution_count": null,
1023 |    "metadata": {},
1024 |    "outputs": [],
1025 |    "source": [
1026 |     "import pandas as pd\n",
1027 |     "import pickle\n",
1028 |     "\n",
1029 |     "#inputs\n",
1030 |     "input_file = \"df_fund_ratios_matched.pickle\"\n",
1031 |     "output_filename = 'hold_out_fund_ratio_sets.pickle'\n",
1032 |     "years = [2015, 2016, 2017, 2018, 2019]\n",
1033 |     "labels = list(map(lambda x: 'fund_h_out_' + str(x), years))\n",
1034 |     "\n",
1035 |     "#read data into memory and set new year_column\n",
1036 |     "df_all = pd.read_pickle(input_file)\n",
1037 |     "df_all['year_'] = (df_all['Filed_Date'].dt.year).values\n",
1038 |     "\n",
1039 |     "#create dictionary\n",
1040 |     "dict_sets={}\n",
1041 |     "for label in labels:\n",
1042 |     "    dict_sets.update({label: {'df_train_master': pd.DataFrame(),\n",
1043 |     "                              'df_test_master': pd.DataFrame()}}) \n",
1044 |     "    \n",
1045 |     "#Create holdout sets and populate dictionary \n",
1046 |     "for idx, label in enumerate(labels):\n",
1047 |     "    year = years[idx]\n",
1048 |     "    \n",
1049 |     "    mask_train = df_all['year_'] < year\n",
1050 |     "    mask_test = df_all['year_'] == year\n",
1051 |     "    \n",
1052 |     "    df_train = df_all[mask_train]\n",
1053 |     "    df_test = df_all[mask_test]\n",
1054 |     "    \n",
1055 |     "    dict_sets[label]['df_train_master'] = df_train\n",
1056 |     "    dict_sets[label]['df_test_master'] = df_test\n",
1057 |     "    \n",
1058 |     "with open(output_filename, 'wb') as handle:                                     \n",
1059 |     "    pickle.dump(dict_sets, handle, protocol=pickle.HIGHEST_PROTOCOL)"
1060 |    ]
1061 |   },
1062 |   {
1063 |    "cell_type": "markdown",
1064 |    "metadata": {},
1065 |    "source": [
1066 |     "The holdout sets for the market (MKT) model are set equal to FIN holdout sets."
1067 |    ]
1068 |   },
1069 |   {
1070 |    "cell_type": "markdown",
1071 |    "metadata": {},
1072 |    "source": [
1073 |     "### 9. Test Holdout Sets \n",
1074 |     "\n",
1075 |     "Test and store results for the NLP model: "
1076 |    ]
1077 |   },
1078 |   {
1079 |    "cell_type": "code",
1080 |    "execution_count": null,
1081 |    "metadata": {},
1082 |    "outputs": [],
1083 |    "source": [
1084 |     "import pickle\n",
1085 |     "import numpy as np\n",
1086 |     "import pandas as pd\n",
1087 |     "import random\n",
1088 |     "from scipy.sparse import csr_matrix, vstack, hstack, identity\n",
1089 |     "from sklearn.linear_model import LogisticRegression\n",
1090 |     "from sklearn.ensemble import GradientBoostingClassifier\n",
1091 |     "from sklearn.preprocessing import normalize\n",
1092 |     "from sklearn.metrics import classification_report, confusion_matrix\n",
1093 |     "from capstone_10k_functions import convert_df_values_csc_chunk\n",
1094 |     "from datetime import datetime as dt\n",
1095 |     "\n",
1096 |     "t1 = dt.now()\n",
1097 |     "print(t1)\n",
1098 |     "pd.set_option('mode.chained_assignment', None)\n",
1099 |     "\n",
1100 |     "#inputs\n",
1101 |     "ratio = 5\n",
1102 |     "vector_func = 'TfidfVectorizer'\n",
1103 |     "ngram = 'unigram'\n",
1104 |     "years = [2015, 2016, 2017, 2018, 2019]\n",
1105 |     "label_years = list(map(lambda x: 'hold_out_' + str(x), years))\n",
1106 |     "number = 3\n",
1107 |     "x_drop_columns = ['Filed_Date', 'ticker_','sector_', 'sic_sector', 'max_dd_1yr','max_dd_2yr', \n",
1108 |     "                  'year_dd_flag', 'cum_year_dd_flag', 'custom_sector']\n",
1109 |     "weight_over =0.25\n",
1110 |     "weight_under = 0.75\n",
1111 |     "model_over = GradientBoostingClassifier(random_state=41)\n",
1112 |     "model_under = GradientBoostingClassifier(random_state=41)\n",
1113 |     "method = 'over_under_ensemble_words_only_hold_out'\n",
1114 |     "\n",
1115 |     "#create dictionary structure\n",
1116 |     "dict_years = {}\n",
1117 |     "for label in label_years:\n",
1118 |     "    dict_years.update({label: {}})\n",
1119 |     "\n",
1120 |     "#loop through holdout sets and populate dictionary\n",
1121 |     "for label in label_years:\n",
1122 |     "    \n",
1123 |     "    print(label)\n",
1124 |     "    \n",
1125 |     "    filename = label + '_' + vector_func + '_min_df_25_' + ngram + '.pickle'\n",
1126 |     "    d_years = pd.read_pickle(filename)\n",
1127 |     "    df_train = d_years['min_df_25']['df_train_master']\n",
1128 |     "    df_test = d_years['min_df_25']['df_test_master']\n",
1129 |     "\n",
1130 |     "    #Equal number algo\n",
1131 |     "    df_train = equalize_year_ratio(df_train, ratio)\n",
1132 |     "       \n",
1133 |     "    #define X and y dataframes \n",
1134 |     "    df_x_train = df_train.drop(x_drop_columns, axis=1)\n",
1135 |     "    df_y_train= (df_train['max_dd_1yr'] <= -0.8)*1\n",
1136 |     "    df_x_test = df_test.drop(x_drop_columns, axis=1)\n",
1137 |     "    df_y_test= (df_test['max_dd_1yr'] <= -0.8)*1\n",
1138 |     "\n",
1139 |     "    \n",
1140 |     "    #convert X and y arrays\n",
1141 |     "    X_train = convert_df_values_csc_chunk(df_x_train)\n",
1142 |     "    y_train = df_y_train.values\n",
1143 |     "    X_test = convert_df_values_csc_chunk(df_x_test)\n",
1144 |     "    y_test = df_y_test.values  \n",
1145 |     "\n",
1146 |     "    #normalize (row wise so not strictly necesarry to process this way\n",
1147 |     "    #but kept in format for generalization to other preprocessing\n",
1148 |     "    n_test = y_test.shape[0]\n",
1149 |     "    X_train_test = vstack((X_train, X_test ))\n",
1150 |     "    X_train = normalize(X_train)\n",
1151 |     "    X_test = normalize(X_train_test)[-n_test:]    \n",
1152 |     "\n",
1153 |     "    #oversample              \n",
1154 |     "    key_number = 'number'+ str(number)\n",
1155 |     "    n_over = int(number*n_pos_over)\n",
1156 |     "    dict_bal_over = oversample_random(X_train, y_train, seed = 41, flag=True, n=n_over)\n",
1157 |     "    X_train_balanced_over = dict_bal_over['X_balanced'] \n",
1158 |     "    y_train_balanced_over = dict_bal_over['y_balanced']  \n",
1159 |     "\n",
1160 |     "    #undersample\n",
1161 |     "    dict_bal_under = undersample_random(X_train, y_train, seed = 41)\n",
1162 |     "    X_train_balanced_under = dict_bal_under['X_balanced'] \n",
1163 |     "    y_train_balanced_under = dict_bal_under['y_balanced']  \n",
1164 |     "\n",
1165 |     "    #train model\n",
1166 |     "    model_1 =model_over\n",
1167 |     "    model_1.fit(X_train_balanced_over, y_train_balanced_over)\n",
1168 |     "    y_1_log_proba = model_1.predict_log_proba(X_test)[:,1]\n",
1169 |     "        \n",
1170 |     "    model_2 =model_under\n",
1171 |     "    model_2.fit(X_train_balanced_under, y_train_balanced_under)\n",
1172 |     "    y_2_log_proba = model_2.predict_log_proba(X_test)[:,1]\n",
1173 |     "        \n",
1174 |     "    y_log_proba = weight_over*y_1_log_proba + weight_under*y_2_log_proba\n",
1175 |     "    y_pred = (y_log_proba > np.log(0.5))*1\n",
1176 |     "        \n",
1177 |     "    #calcuate model metrics    \n",
1178 |     "    cm = confusion_matrix(y_test, y_pred, labels=[0,1])\n",
1179 |     "    pos_recall = cm[1,1] / (cm[1,0] + cm[1,1])\n",
1180 |     "    neg_recall = cm[0,0] / (cm[0,0] + cm[0,1])\n",
1181 |     "    mh_recall =  2*neg_recall*pos_recall / (neg_recall + pos_recall)\n",
1182 |     "    report = classification_report(y_test, y_pred,output_dict=True)\n",
1183 |     "        \n",
1184 |     "    print('mh_recall = ', mh_recall)\n",
1185 |     "    print('pos_recall = ', pos_recall)\n",
1186 |     "    print('neg_recall = ', neg_recall)\n",
1187 |     "        \n",
1188 |     "    df_y_result = df_test[['ticker_', 'Filed_Date']]\n",
1189 |     "    df_y_result['true_'] = y_test\n",
1190 |     "    df_y_result['pred_'] = y_pred\n",
1191 |     "    df_y_result['log_proba'] = y_log_proba\n",
1192 |     "        \n",
1193 |     "    dict_years[label] = {'df_y_result': df_y_result, 'class_report': report}\n",
1194 |     "        \n",
1195 |     "        \n",
1196 |     "    #find words in testing set only and count doc number\n",
1197 |     "          #drop null columns\n",
1198 |     "    null_columns = df_x_test.columns[df_x_test.isnull().any()]\n",
1199 |     "    df_x_test_notna = df_x_test.drop(null_columns, axis=1)\n",
1200 |     "          #count how many docs word in\n",
1201 |     "    df_x_test_notna = df_x_test_notna.transpose()\n",
1202 |     "    s_word_in_test_doc_count = df_x_test_notna.apply(lambda x: (x != 0).sum(), axis=1) \n",
1203 |     "    mask_keep = s_word_in_test_doc_count != 0\n",
1204 |     "    s_word_in_test_doc_count = s_word_in_test_doc_count[mask_keep]\n",
1205 |     "    s_word_in_test_doc_count = s_word_in_test_doc_count.sort_values(ascending=False)\n",
1206 |     "\n",
1207 |     "        #find words in predicted pos only only\n",
1208 |     "    mask_pred_pos = (y_pred == 1)\n",
1209 |     "    df_x_pred_pos = df_x_test[mask_pred_pos]\n",
1210 |     "          #drop null columns\n",
1211 |     "    null_columns = df_x_pred_pos.columns[df_x_pred_pos.isnull().any()]\n",
1212 |     "    df_x_pred_pos_notna = df_x_pred_pos.drop(null_columns, axis=1)\n",
1213 |     "          #count how many docs word in\n",
1214 |     "    df_x_pred_pos_notna= df_x_pred_pos_notna.transpose()\n",
1215 |     "    s_word_in_pred_pos_doc_count = df_x_pred_pos_notna.apply(lambda x: (x != 0).sum(), axis=1) \n",
1216 |     "    mask_keep = s_word_in_pred_pos_doc_count!= 0\n",
1217 |     "    s_word_in_pred_pos_doc_count = s_word_in_pred_pos_doc_count[mask_keep]   \n",
1218 |     "    s_word_in_pred_pos_doc_count = s_word_in_pred_pos_doc_count.sort_values(ascending=False)\n",
1219 |     "        \n",
1220 |     "\n",
1221 |     "    #Generate prob word matrix \n",
1222 |     "    words_list = list(df_x_test.columns)  \n",
1223 |     "    n_id = len(words_list)\n",
1224 |     "\n",
1225 |     "    word_arr = identity(n_id).tolil()\n",
1226 |     "    word_arr_q = csr_matrix(word_arr)\n",
1227 |     "            \n",
1228 |     "    word_1_log_proba = model_1.predict_log_proba(word_arr_q)[:,1]\n",
1229 |     "    word_2_log_proba = model_2.predict_log_proba(word_arr_q)[:,1]\n",
1230 |     "    word_log_proba = weight_over*word_1_log_proba + weight_under*word_2_log_proba\n",
1231 |     "    word_proba = np.exp(word_log_proba)\n",
1232 |     "\n",
1233 |     "    df_words_prob = pd.DataFrame(word_proba, index = words_list, columns=['prob'])\n",
1234 |     "    df_words_prob = df_words_prob.sort_values('prob', ascending=False)\n",
1235 |     "        \n",
1236 |     "    dict_years[label] = {'df_y_result': df_y_result,\n",
1237 |     "                             'conf_mat': cm,\n",
1238 |     "                             'class_report': report, \n",
1239 |     "                             'df_words_proba': df_words_prob,\n",
1240 |     "                             'words_test_doc_count': s_word_in_test_doc_count,\n",
1241 |     "                             'words_pred_pos_doc_count': s_word_in_pred_pos_doc_count}\n",
1242 |     "                             \n",
1243 |     "        \n",
1244 |     "output_filename= 'hold_out_results_under_over_ensemble_words_only.pickle'\n",
1245 |     "with open(output_filename, 'wb') as handle:                                     \n",
1246 |     "    pickle.dump(dict_years, handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
1247 |     "\n",
1248 |     "t2= dt.now()\n",
1249 |     "print(t2)\n",
1250 |     "print(t2-t1)\n",
1251 |     "               \n",
1252 |     "#runtime 1hr56mins\n",
1253 |     "            "
1254 |    ]
1255 |   },
1256 |   {
1257 |    "cell_type": "markdown",
1258 |    "metadata": {},
1259 |    "source": [
1260 |     "Test and store results for the baseline financial ratio (FIN) model: "
1261 |    ]
1262 |   },
1263 |   {
1264 |    "cell_type": "code",
1265 |    "execution_count": null,
1266 |    "metadata": {},
1267 |    "outputs": [],
1268 |    "source": [
1269 |     "import pickle\n",
1270 |     "import numpy as np\n",
1271 |     "import pandas as pd\n",
1272 |     "import random\n",
1273 |     "from sklearn.ensemble import GradientBoostingClassifier\n",
1274 |     "from sklearn.preprocessing import scale\n",
1275 |     "from sklearn.metrics import classification_report, confusion_matrix\n",
1276 |     "\n",
1277 |     "pd.set_option('mode.chained_assignment', None)\n",
1278 |     "#not refactored\n",
1279 |     "\n",
1280 |     "#inputs\n",
1281 |     "random.seed(41)\n",
1282 |     "input_file = 'hold_out_fund_ratio_sets.pickle'\n",
1283 |     "output_filename = 'fund_ratio_hold_out_results.pickle'\n",
1284 |     "ratio = 5\n",
1285 |     "years = [2015, 2016, 2017, 2018, 2019]\n",
1286 |     "label_years = list(map(lambda x: 'fund_h_out_' + str(x), years))\n",
1287 |     "over_sample_amount = [3]\n",
1288 |     "x_drop_columns = ['Filed_Date', 'ticker_','sector_',  'max_dd_1yr', 'ann_qcut']\n",
1289 |     "models = [GradientBoostingClassifier(random_state=41)]\n",
1290 |     "model_names = ['grad_boost']\n",
1291 |     "\n",
1292 |     "#create dictionary structure\n",
1293 |     "dict_results = {}\n",
1294 |     "for label in label_years:\n",
1295 |     "    dict_results.update({label: {}})\n",
1296 |     "    for model in model_names:\n",
1297 |     "        dict_results[label].update({model: {}})\n",
1298 |     "\n",
1299 |     "#populate dictionary\n",
1300 |     "d_h_out = pd.read_pickle(input_file)\n",
1301 |     "\n",
1302 |     "for label in label_years:\n",
1303 |     "    \n",
1304 |     "    df_train = d_h_out[label]['df_train_master']\n",
1305 |     "    df_test = d_h_out[label]['df_test_master']\n",
1306 |     "    \n",
1307 |     "    \n",
1308 |     "       \n",
1309 |     "#calculate previous year quartile rank\n",
1310 |     "    #join training and test to calculate annual quartile ranks\n",
1311 |     "    df_all = pd.concat([df_train, df_test])\n",
1312 |     "    df_all['year_'] = df_all['Filed_Date'].dt.year.values\n",
1313 |     "    \n",
1314 |     "    years = set(df_all['year_'].tolist())\n",
1315 |     "    years = sorted(list(years))\n",
1316 |     "    n_years =len(years)\n",
1317 |     "    \n",
1318 |     "    counter=0\n",
1319 |     "    for j in range(n_years-1):\n",
1320 |     "        current_year = years[j+1]\n",
1321 |     "        prev_year = years[j]\n",
1322 |     "        \n",
1323 |     "        mask_prev = df_all['year_'] == prev_year\n",
1324 |     "        df_prev = df_all[mask_prev]\n",
1325 |     "        df_prev['ann_qcut'] = pd.qcut(df_prev['max_dd_1yr'], q=[0,0.25,0.5,0.75,1],\n",
1326 |     "                                      labels=[4,3,2,1]).astype(int)\n",
1327 |     "        df_prev = df_prev[['ticker_','ann_qcut']]\n",
1328 |     "        mask_current = df_all['year_'] == current_year\n",
1329 |     "        df_current = df_all[mask_current]\n",
1330 |     "        df_current = df_current.merge(df_prev, on=['ticker_'], how='inner')\n",
1331 |     "        \n",
1332 |     "        if counter==0:\n",
1333 |     "            df_final = df_current\n",
1334 |     "        else:\n",
1335 |     "            df_final = pd.concat([df_final, df_current])\n",
1336 |     "        counter+=1\n",
1337 |     "    \n",
1338 |     "    #drop year_ column and sort ascending date\n",
1339 |     "    df_final = df_final.drop('year_', axis=1)\n",
1340 |     "    df_final = df_final.sort_values(by='Filed_Date')\n",
1341 |     "    \n",
1342 |     "    #split data bank into training and test sets\n",
1343 |     "    all_rows = df_all.shape[0]    \n",
1344 |     "    train_perc = df_train.shape[0] / df_all.shape[0] \n",
1345 |     "    updated_train_rows = int(train_perc * df_final.shape[0] // 1)\n",
1346 |     "    df_train = df_final.iloc[:updated_train_rows, :].reset_index(drop=True)\n",
1347 |     "    df_test = df_final.iloc[updated_train_rows:, :].reset_index(drop=True)\n",
1348 |     "\n",
1349 |     "    \n",
1350 |     "\n",
1351 |     "#Equal number algo (see helper function for comments)\n",
1352 |     "    mask_pos = df_train['max_dd_1yr'] < -0.8\n",
1353 |     "    df_year = df_train[['ticker_','Filed_Date']][mask_pos]\n",
1354 |     "    df_year['year'] = df_year['Filed_Date'].dt.year\n",
1355 |     "    df_pos_count = df_year['year'].value_counts()\n",
1356 |     "    \n",
1357 |     "    years = sorted(df_pos_count.index.tolist())\n",
1358 |     "    \n",
1359 |     "    df_train['year'] = df_train['Filed_Date'].dt.year\n",
1360 |     "\n",
1361 |     "    count=0\n",
1362 |     "    for year in years:\n",
1363 |     "    \n",
1364 |     "        mask_year = df_train['year']==year\n",
1365 |     "        df = df_train[mask_year]\n",
1366 |     "        mask_neg = df['max_dd_1yr'] >= -0.8 \n",
1367 |     "        df = df[mask_neg]\n",
1368 |     "        m = df.shape[0]\n",
1369 |     "        n = int(ratio*df_pos_count[year])\n",
1370 |     "    \n",
1371 |     "        random_idx = random.sample(range(0,m), n)\n",
1372 |     "        df = df.iloc[random_idx]\n",
1373 |     "        if count == 0:\n",
1374 |     "            df_final = df\n",
1375 |     "        else:\n",
1376 |     "            df_final = pd.concat([df_final,df])\n",
1377 |     "        count +=1    \n",
1378 |     "    \n",
1379 |     "    df_train = df_train.drop('year', axis=1)\n",
1380 |     "    df_neg = df_final.drop('year', axis=1)\n",
1381 |     "    df_pos = df_train[mask_pos]\n",
1382 |     "    \n",
1383 |     "    df_train = pd.concat([df_neg, df_pos])\n",
1384 |     "    df_train = df_train.sort_values(by='Filed_Date')\n",
1385 |     "    \n",
1386 |     "    \n",
1387 |     "     #define X and y dataframes\n",
1388 |     "    df_x_train = df_train.drop(x_drop_columns, axis=1)\n",
1389 |     "    df_y_train= (df_train['max_dd_1yr'] <= -0.8)*1\n",
1390 |     "    df_x_test = df_test.drop(x_drop_columns, axis=1)\n",
1391 |     "    df_y_test= (df_test['max_dd_1yr'] <= -0.8)*1\n",
1392 |     "\n",
1393 |     "    #convert X and y arrays\n",
1394 |     "    X_train = df_x_train.values\n",
1395 |     "    y_train = df_y_train.values\n",
1396 |     "    X_test = df_x_test.values\n",
1397 |     "    y_test = df_y_test.values\n",
1398 |     "    \n",
1399 |     "    \n",
1400 |     "    #oversample (see helper function for comments)\n",
1401 |     "    for number in over_sample_amount:\n",
1402 |     "        \n",
1403 |     "        key_number = 'number'+ str(number)\n",
1404 |     "        idx_train_argsort= np.argsort(y_train)\n",
1405 |     "        n_pos = y_train.sum()\n",
1406 |     "        n_neg = y_train.shape[0] - n_pos\n",
1407 |     "    \n",
1408 |     "        idx_pos_prelim = idx_train_argsort[-n_pos:]\n",
1409 |     "        \n",
1410 |     "        n = int(number*n_pos)\n",
1411 |     "\n",
1412 |     "        idx_pos = np.random.choice(idx_pos_prelim, n)   \n",
1413 |     "        X_train_pos = X_train[idx_pos]\n",
1414 |     "        y_train_pos = y_train[idx_pos]\n",
1415 |     "        \n",
1416 |     "        idx_neg_prelim = idx_train_argsort[:n_neg]\n",
1417 |     "        random_idx = random.sample(range(0,n_neg), n)\n",
1418 |     "        idx_neg = idx_neg_prelim[random_idx]\n",
1419 |     "        X_train_neg = X_train[idx_neg]\n",
1420 |     "        y_train_neg = y_train[idx_neg]\n",
1421 |     "\n",
1422 |     "        X_train_balanced = np.concatenate((X_train_pos,X_train_neg), axis=0)\n",
1423 |     "        y_train_balanced = np.concatenate((y_train_pos,y_train_neg),axis=0)\n",
1424 |     "    \n",
1425 |     "    \n",
1426 |     "        #train model\n",
1427 |     "        for idx, model_func in enumerate(models):\n",
1428 |     "            model_name = model_names[idx]\n",
1429 |     "            model =model_func\n",
1430 |     "            model.fit(X_train_balanced, y_train_balanced)\n",
1431 |     "            y_pred = model.predict(X_test)\n",
1432 |     "    \n",
1433 |     "            cm = confusion_matrix(y_test, y_pred, labels=[0,1])\n",
1434 |     "            pos_recall = cm[1,1] / (cm[1,0] + cm[1,1])\n",
1435 |     "            neg_recall = cm[0,0] / (cm[0,0] + cm[0,1])\n",
1436 |     "            mh_recall =  2*neg_recall*pos_recall / (neg_recall + pos_recall)\n",
1437 |     "            report = classification_report(y_test, y_pred,output_dict=True)\n",
1438 |     "        \n",
1439 |     "            print('mh_recall = ', mh_recall)\n",
1440 |     "            print('pos_recall = ', pos_recall)\n",
1441 |     "            print('neg_recall = ', neg_recall)\n",
1442 |     "        \n",
1443 |     "            df_y_result = df_test[['ticker_', 'Filed_Date']]\n",
1444 |     "            df_y_result['true_'] = y_test\n",
1445 |     "            df_y_result['pred_'] = y_pred\n",
1446 |     "    \n",
1447 |     "            dict_results[label][model_name] = {'df_y_result': df_y_result, 'class_report': report}\n",
1448 |     "            \n",
1449 |     "       \n",
1450 |     "with open(output_filename, 'wb') as handle:                                     \n",
1451 |     "    pickle.dump(dict_results, handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
1452 |     "\n"
1453 |    ]
1454 |   },
1455 |   {
1456 |    "cell_type": "markdown",
1457 |    "metadata": {},
1458 |    "source": [
1459 |     "Test and store results for the market (MKT) model: "
1460 |    ]
1461 |   },
1462 |   {
1463 |    "cell_type": "code",
1464 |    "execution_count": null,
1465 |    "metadata": {},
1466 |    "outputs": [],
1467 |    "source": [
1468 |     "import pickle\n",
1469 |     "import numpy as np\n",
1470 |     "import pandas as pd\n",
1471 |     "import random\n",
1472 |     "from sklearn.ensemble import GradientBoostingClassifier\n",
1473 |     "from sklearn.preprocessing import scale\n",
1474 |     "from sklearn.metrics import classification_report, confusion_matrix\n",
1475 |     "\n",
1476 |     "pd.set_option('mode.chained_assignment', None)\n",
1477 |     "#not refactored\n",
1478 |     "\n",
1479 |     "#inputs\n",
1480 |     "random.seed(41)\n",
1481 |     "input_file = 'hold_out_fund_ratio_sets.pickle'\n",
1482 |     "output_filename = 'prev_quartile_only_hold_out_results.pickle'\n",
1483 |     "ratio = 5\n",
1484 |     "years = [2015, 2016, 2017, 2018, 2019]\n",
1485 |     "label_years = list(map(lambda x: 'fund_h_out_' + str(x), years))\n",
1486 |     "over_sample_amount = [3]\n",
1487 |     "x_drop_columns = ['Filed_Date', 'ticker_','sector_',  'max_dd_1yr']\n",
1488 |     "models = [GradientBoostingClassifier(random_state=41)]\n",
1489 |     "model_names = ['grad_boost']\n",
1490 |     "\n",
1491 |     "#create dictionary structure\n",
1492 |     "dict_results = {}\n",
1493 |     "for label in label_years:\n",
1494 |     "    dict_results.update({label: {}})\n",
1495 |     "    for model in model_names:\n",
1496 |     "        dict_results[label].update({model: {}})\n",
1497 |     "\n",
1498 |     "#read data into memory\n",
1499 |     "d_h_out = pd.read_pickle(input_file)\n",
1500 |     "\n",
1501 |     "#loop through holdout sets and populate dictionary\n",
1502 |     "for label in label_years:\n",
1503 |     "    \n",
1504 |     "    df_train = d_h_out[label]['df_train_master']\n",
1505 |     "    df_test = d_h_out[label]['df_test_master']\n",
1506 |     "    \n",
1507 |     "#calculate previous year quartile rank\n",
1508 |     "    #join training and test to calculate annual quartile ranks\n",
1509 |     "    df_all = pd.concat([df_train, df_test])\n",
1510 |     "    df_all['year_'] = df_all['Filed_Date'].dt.year.values\n",
1511 |     "    \n",
1512 |     "    years = set(df_all['year_'].tolist())\n",
1513 |     "    years = sorted(list(years))\n",
1514 |     "    n_years =len(years)\n",
1515 |     "    \n",
1516 |     "    counter=0\n",
1517 |     "    for j in range(n_years-1):\n",
1518 |     "        current_year = years[j+1]\n",
1519 |     "        prev_year = years[j]\n",
1520 |     "        \n",
1521 |     "        mask_prev = df_all['year_'] == prev_year\n",
1522 |     "        df_prev = df_all[mask_prev]\n",
1523 |     "        df_prev['ann_qcut'] = pd.qcut(df_prev['max_dd_1yr'], q=[0,0.25,0.5,0.75,1],\n",
1524 |     "                                      labels=[4,3,2,1]).astype(int)\n",
1525 |     "        df_prev = df_prev[['ticker_','ann_qcut']]\n",
1526 |     "        mask_current = df_all['year_'] == current_year\n",
1527 |     "        df_current = df_all[mask_current]\n",
1528 |     "        df_current = df_current.merge(df_prev, on=['ticker_'], how='inner')\n",
1529 |     "        \n",
1530 |     "        if counter==0:\n",
1531 |     "            df_final = df_current\n",
1532 |     "        else:\n",
1533 |     "            df_final = pd.concat([df_final, df_current])\n",
1534 |     "        counter+=1\n",
1535 |     "    \n",
1536 |     "    #drop year_ column and sort ascending date\n",
1537 |     "    df_final = df_final.drop('year_', axis=1)\n",
1538 |     "    df_final = df_final.sort_values(by='Filed_Date')\n",
1539 |     "    \n",
1540 |     "    #split data bank into training and test sets\n",
1541 |     "    all_rows = df_all.shape[0]    \n",
1542 |     "    train_perc = df_train.shape[0] / df_all.shape[0] \n",
1543 |     "    updated_train_rows = int(train_perc * df_final.shape[0] // 1)\n",
1544 |     "    df_train = df_final.iloc[:updated_train_rows, :].reset_index(drop=True)\n",
1545 |     "    df_test = df_final.iloc[updated_train_rows:, :].reset_index(drop=True)\n",
1546 |     "\n",
1547 |     "#Equal number algo (see helper function for comments)\n",
1548 |     "    mask_pos = df_train['max_dd_1yr'] < -0.8\n",
1549 |     "    df_year = df_train[['ticker_','Filed_Date']][mask_pos]\n",
1550 |     "    df_year['year'] = df_year['Filed_Date'].dt.year\n",
1551 |     "    df_pos_count = df_year['year'].value_counts()\n",
1552 |     "    \n",
1553 |     "    years = sorted(df_pos_count.index.tolist())\n",
1554 |     "    \n",
1555 |     "    df_train['year'] = df_train['Filed_Date'].dt.year\n",
1556 |     "\n",
1557 |     "    count=0\n",
1558 |     "    for year in years:\n",
1559 |     "    \n",
1560 |     "        mask_year = df_train['year']==year\n",
1561 |     "        df = df_train[mask_year]\n",
1562 |     "        mask_neg = df['max_dd_1yr'] >= -0.8 \n",
1563 |     "        df = df[mask_neg]\n",
1564 |     "        m = df.shape[0]\n",
1565 |     "        n = int(ratio*df_pos_count[year])\n",
1566 |     "    \n",
1567 |     "        random_idx = random.sample(range(0,m), n)\n",
1568 |     "        df = df.iloc[random_idx]\n",
1569 |     "        if count == 0:\n",
1570 |     "            df_final = df\n",
1571 |     "        else:\n",
1572 |     "            df_final = pd.concat([df_final,df])\n",
1573 |     "        count +=1    \n",
1574 |     "    \n",
1575 |     "    df_train = df_train.drop('year', axis=1)\n",
1576 |     "    df_neg = df_final.drop('year', axis=1)\n",
1577 |     "    df_pos = df_train[mask_pos]\n",
1578 |     "    \n",
1579 |     "    df_train = pd.concat([df_neg, df_pos])\n",
1580 |     "    df_train = df_train.sort_values(by='Filed_Date')\n",
1581 |     "    \n",
1582 |     "    \n",
1583 |     "    #define X and y dataframes\n",
1584 |     "    df_x_train = df_train['ann_qcut']\n",
1585 |     "    df_y_train= (df_train['max_dd_1yr'] <= -0.8)*1\n",
1586 |     "    df_x_test = df_test['ann_qcut']\n",
1587 |     "    df_y_test= (df_test['max_dd_1yr'] <= -0.8)*1\n",
1588 |     "\n",
1589 |     "    #convert X and y arrays\n",
1590 |     "    X_train = df_x_train.values\n",
1591 |     "    X_train = X_train.reshape(-1, 1)\n",
1592 |     "    y_train = df_y_train.values\n",
1593 |     "    X_test = df_x_test.values\n",
1594 |     "    X_test = X_test.reshape(-1, 1)\n",
1595 |     "    y_test = df_y_test.values\n",
1596 |     "    \n",
1597 |     "    #oversample (see helper function for comments)\n",
1598 |     "    idx_train_argsort= np.argsort(y_train)\n",
1599 |     "    n_pos = y_train.sum()\n",
1600 |     "    n_neg = y_train.shape[0] - n_pos\n",
1601 |     "    \n",
1602 |     "    idx_pos_prelim = idx_train_argsort[-n_pos:]\n",
1603 |     "    \n",
1604 |     "    for number in over_sample_amount:\n",
1605 |     "        \n",
1606 |     "        key_number = 'number'+ str(number)\n",
1607 |     "    \n",
1608 |     "        n = int(number*n_pos)\n",
1609 |     "\n",
1610 |     "        idx_pos = np.random.choice(idx_pos_prelim, n)   \n",
1611 |     "        X_train_pos = X_train[idx_pos]\n",
1612 |     "        y_train_pos = y_train[idx_pos]\n",
1613 |     "\n",
1614 |     "        idx_neg_prelim = idx_train_argsort[:n_neg]\n",
1615 |     "        random_idx = random.sample(range(0,n_neg), n)\n",
1616 |     "        idx_neg = idx_neg_prelim[random_idx]\n",
1617 |     "        X_train_neg = X_train[idx_neg]\n",
1618 |     "        y_train_neg = y_train[idx_neg]\n",
1619 |     "\n",
1620 |     "        X_train_balanced = np.concatenate((X_train_pos,X_train_neg), axis=0)\n",
1621 |     "        y_train_balanced = np.concatenate((y_train_pos,y_train_neg),axis=0)\n",
1622 |     "    \n",
1623 |     "    \n",
1624 |     "        #train model\n",
1625 |     "        for idx, model_func in enumerate(models):\n",
1626 |     "            model_name = model_names[idx]\n",
1627 |     "            model =model_func\n",
1628 |     "            model.fit(X_train_balanced, y_train_balanced)\n",
1629 |     "            y_pred = model.predict(X_test)\n",
1630 |     "    \n",
1631 |     "            cm = confusion_matrix(y_test, y_pred, labels=[0,1])\n",
1632 |     "            pos_recall = cm[1,1] / (cm[1,0] + cm[1,1])\n",
1633 |     "            neg_recall = cm[0,0] / (cm[0,0] + cm[0,1])\n",
1634 |     "            mh_recall =  2*neg_recall*pos_recall / (neg_recall + pos_recall)\n",
1635 |     "            report = classification_report(y_test, y_pred,output_dict=True)\n",
1636 |     "        \n",
1637 |     "            print('mh_recall = ', mh_recall)\n",
1638 |     "            print('pos_recall = ', pos_recall)\n",
1639 |     "            print('neg_recall = ', neg_recall)\n",
1640 |     "        \n",
1641 |     "            df_y_result = df_test[['ticker_', 'Filed_Date']]\n",
1642 |     "            df_y_result['true_'] = y_test\n",
1643 |     "            df_y_result['pred_'] = y_pred\n",
1644 |     "    \n",
1645 |     "            dict_results[label][model_name] = {'df_y_result': df_y_result, 'class_report': report}\n",
1646 |     "            \n",
1647 |     "          \n",
1648 |     "with open(output_filename, 'wb') as handle:                                     \n",
1649 |     "    pickle.dump(dict_results, handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
1650 |     "\n"
1651 |    ]
1652 |   }
1653 |  ],
1654 |  "metadata": {
1655 |   "kernelspec": {
1656 |    "display_name": "Python 3",
1657 |    "language": "python",
1658 |    "name": "python3"
1659 |   },
1660 |   "language_info": {
1661 |    "codemirror_mode": {
1662 |     "name": "ipython",
1663 |     "version": 3
1664 |    },
1665 |    "file_extension": ".py",
1666 |    "mimetype": "text/x-python",
1667 |    "name": "python",
1668 |    "nbconvert_exporter": "python",
1669 |    "pygments_lexer": "ipython3",
1670 |    "version": "3.7.6"
1671 |   }
1672 |  },
1673 |  "nbformat": 4,
1674 |  "nbformat_minor": 4
1675 | }
1676 | 


--------------------------------------------------------------------------------