├── README.md
├── Urban Air Pollution Challenge by #ZindiWeekendz
├── Instructions.md
├── README.md
├── zindi-weekendz-pollution-blend.ipynb
└── zindi-weekendz-pollution-v4.ipynb
└── The Zimnat Insurance Assurance Challenge by #ZindiWeekendz
└── zindi_weekend_insurance_final_solution.ipynb
/README.md:
--------------------------------------------------------------------------------
1 | # zindi-winning-solutions
2 | My solutions to various zindi competitions. Zindi is an African data science platform, which can be found here https://zindi.africa/
3 |
--------------------------------------------------------------------------------
/Urban Air Pollution Challenge by #ZindiWeekendz/Instructions.md:
--------------------------------------------------------------------------------
1 | ## Instructions for running.
2 |
3 | 1. Run the notebooks from zindi-weekendz-pollution-v1.ipynb to zindi-weekendz-pollution-v4.ipynb to generate the output submission files for each of them.
4 | 2. Change the data path from '/kaggle/input' to where the files are stored in your local machine.
5 | 3. Finally run the notebook zindi-weekendz-pollution-blend.ipynb, to generate the final output file, **submission.csv**
6 |
7 | ## Approach
8 |
9 | Only the first notebook **zindi-weekendz-pollution-v1.ipynb** has comments added, because all the notebooks are almost same.
10 |
11 | The idea was simply to use future and past values to predict the future, along with the current readings. If you see the feature importances you will see that future and past values are indeed important. Some kinds of features that were added were:
12 |
13 | 1. Past and previous values of target for arbitary number of days, which you can experiment.
14 | 2. Past and previous values of sensor readings for arbitary number of days.
15 | 3. Extracting information from the date column about weekday, month, is_month_start or is_month_end.
16 | 4. Adding cyclic features because of the cyclic behaviour of time, consider this Monday repeats after every 7 days. Sunday is just before Monday, but while doing encoding we would encode Sunday as 6, and Monday as 0. Adding cyclic features help eliminate some of this problem.
17 | 5. Adding frequency encoding for **place_ID** since some place_IDs had different number of occurences in the data compared to others, I added a feature to capture it.
18 |
19 | The final model was a blended version of 4 different lightGBM models I created, weights were adjusted to ensure optimal performance on the leaderboard.
20 |
21 |
22 |
--------------------------------------------------------------------------------
/Urban Air Pollution Challenge by #ZindiWeekendz/README.md:
--------------------------------------------------------------------------------
1 | # Urban Air Pollution Challenge by #ZindiWeekendz
2 |
3 | ## Can you predict air quality in cities around the world using satellite data?
4 |
5 | The objective of this challenge is to predict PM2.5 particulate matter concentration in the air every day for each city. PM2.5 refers to atmospheric particulate matter that have a diameter of less than 2.5 micrometers and is one of the most harmful air pollutants. PM2.5 is a common measure of air quality that normally requires ground-based sensors to measure. The data covers the last three months, spanning hundreds of cities across the globe.
6 |
7 | The data comes from three main sources:
8 |
9 | * Ground-based air quality sensors. These measure the target variable (PM2.5 particle concentration). In addition to the target column (which is the daily mean concentration) there are also columns for minimum and maximum readings on that day, the variance of the readings and the total number (count) of sensor readings used to compute the target value. This data is only provided for the train set - you must predict the target variable for the test set.
10 | * The Global Forecast System (GFS) for weather data. Humidity, temperature and wind speed, which can be used as inputs for your model.
11 | * The Sentinel 5P satellite. This satellite monitors various pollutants in the atmosphere. For each pollutant, we queried the offline Level 3 (L3) datasets available in Google Earth Engine (you can read more about the individual products here: https://developers.google.com/earth-engine/datasets/catalog/sentinel-5p). For a given pollutant, for example NO2, we provide all data from the Sentinel 5P dataset for that pollutant. This includes the key measurements like NO2_column_number_density (a measure of NO2 concentration) as well as metadata like the satellite altitude. We recommend that you focus on the key measurements, either the column_number_density or the tropospheric_X_column_number_density (which measures density closer to Earth’s surface).
12 | Unfortunately, this data is not 100% complete. Some locations have no sensor readings for a particular day, and so those rows have been excluded. There are also gaps in the input data, particularly the satellite data for CH4.
13 |
14 | ### Variable Definitions: Read about the datasets at the following pages:
15 |
16 | * Weather Data: https://developers.google.com/earth-engine/datasets/catalog/NOAA_GFS0P25
17 | * Sentinel 5P data: https://developers.google.com/earth-engine/datasets/catalog/sentinel-5p - all columns begin with the dataset name (eg L3_NO2 corresponds to https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S5P_OFFL_L3_NO2) - look at the corresponding dataset on GEE for detailed descriptions of the image bands - band names should match the second half of the column titles.
18 |
19 | ## Leaderboard
20 |
21 | * **[Private LB](https://zindi.africa/hackathons/urban-air-pollution-challenge/leaderboard)** : **1st/115 Rank**
22 |
--------------------------------------------------------------------------------
/Urban Air Pollution Challenge by #ZindiWeekendz/zindi-weekendz-pollution-blend.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"import numpy as np\nimport pandas as pd\nimport os","execution_count":33,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"ID_COL = 'Place_ID X Date'","execution_count":34,"outputs":[]},{"metadata":{"_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","trusted":true},"cell_type":"code","source":"df = pd.read_csv('preds_lgbm_v1.csv')[[ID_COL]]\ndf['v1'] = pd.read_csv('preds_lgbm_v1.csv')['target']\ndf['v2'] = pd.read_csv('preds_lgbm_v2.csv')['target']\ndf['v3'] = pd.read_csv('preds_lgbm_v3.csv')['target']\ndf['v4'] = pd.read_csv('preds_lgbm_v4.csv')['target']","execution_count":35,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# Blending the different versions"},{"metadata":{},"cell_type":"markdown","source":"## The weights were decided based on the performance of each version on the leaderboard"},{"metadata":{"trusted":true},"cell_type":"code","source":"sub_df = df[[ID_COL]]\nsub_df['target'] = ((df['v1']*0.75 + df['v2']*0.25)*0.6 + df['v4']*0.4)*0.9 + df['v3']*0.1","execution_count":36,"outputs":[{"output_type":"stream","text":"/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: \nA value is trying to be set on a copy of a slice from a DataFrame.\nTry using .loc[row_indexer,col_indexer] = value instead\n\nSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n \n","name":"stderr"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"sub_df.to_csv('submission.csv', index=False)","execution_count":37,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"sub_df.head(10)","execution_count":38,"outputs":[{"output_type":"execute_result","execution_count":38,"data":{"text/plain":" Place_ID X Date target\n0 0OS9LVX X 2020-01-02 32.680554\n1 0OS9LVX X 2020-01-03 31.289176\n2 0OS9LVX X 2020-01-04 28.412420\n3 0OS9LVX X 2020-01-05 34.441060\n4 0OS9LVX X 2020-01-06 56.556865\n5 0OS9LVX X 2020-01-07 57.094444\n6 0OS9LVX X 2020-01-08 27.458843\n7 0OS9LVX X 2020-01-09 29.744958\n8 0OS9LVX X 2020-01-10 26.061362\n9 0OS9LVX X 2020-01-11 34.992617","text/html":"
\n\n
\n \n
\n
\n
Place_ID X Date
\n
target
\n
\n \n \n
\n
0
\n
0OS9LVX X 2020-01-02
\n
32.680554
\n
\n
\n
1
\n
0OS9LVX X 2020-01-03
\n
31.289176
\n
\n
\n
2
\n
0OS9LVX X 2020-01-04
\n
28.412420
\n
\n
\n
3
\n
0OS9LVX X 2020-01-05
\n
34.441060
\n
\n
\n
4
\n
0OS9LVX X 2020-01-06
\n
56.556865
\n
\n
\n
5
\n
0OS9LVX X 2020-01-07
\n
57.094444
\n
\n
\n
6
\n
0OS9LVX X 2020-01-08
\n
27.458843
\n
\n
\n
7
\n
0OS9LVX X 2020-01-09
\n
29.744958
\n
\n
\n
8
\n
0OS9LVX X 2020-01-10
\n
26.061362
\n
\n
\n
9
\n
0OS9LVX X 2020-01-11
\n
34.992617
\n
\n \n
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"sub_df['target'].describe()","execution_count":39,"outputs":[{"output_type":"execute_result","execution_count":39,"data":{"text/plain":"count 16136.000000\nmean 58.007864\nstd 35.462370\nmin 2.591207\n25% 31.461230\n50% 48.513680\n75% 75.257445\nmax 280.533302\nName: target, dtype: float64"},"metadata":{}}]}],"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat":4,"nbformat_minor":4}
--------------------------------------------------------------------------------
/The Zimnat Insurance Assurance Challenge by #ZindiWeekendz/zindi_weekend_insurance_final_solution.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"metadata":{},"cell_type":"markdown","source":"# The Zimnat Insurance Assurance Challenge by #ZindiWeekendz - Farzi Data Scientists (Rank 3)\n\n*The data describes 51,685 life assurance policies, each identified by a unique Policy ID. Each year, some policies lapse as clients change jobs, move countries etc. The information on these policies are in multiple files.*\n\n*Numerical quantities have been transformed, and many categories have been assigned unique IDs in place of the original text.*\n\n*The objective of this hackathon is to develop a predictive model that determines the likelihood for a customer to churn - to seek an alternative insurer or simply stop paying for insurance altogether.*\n\n*sample_submission.csv contains rows for the remaining Policy IDs (with ‘?’s in TRAIN). You must predict which of these policies marked with '?' will lapse in 2020*\n\n* client_data.csv - Contains some personal information on the principal member, such as location, branch and agent code, age etc.\n* payment_history.csv - Contains payment history up to the end of 2018 tied to Policy ID. Payments made in 2019 are not provided.\n* policy_data.csv - Describes the policies themselves. There may be multiple rows for each Policy ID since policies can cover more than one person.\n* train.csv - contains a list of all the policies. Policies that lapsed in 2017, 2018 or 2019 are identified with a 1 in the ‘Lapse’ column, and the year is provided. The policies with a '?' in the 'Lapse' and 'Lapse Year' column are the policies that remained and had not lapsed as of the end of 2019. You must estimate the likelihood that these policies lapsed or not in 2020.\n* sample_submission.csv - is an example of what your submission should look like. The order of the rows does not matter but the name of the ID must be correct.\n* variable_defintions.txt - definitions of the variables"},{"metadata":{"trusted":true},"cell_type":"code","source":"from lightgbm import LGBMClassifier\nimport numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\nimport os\nfrom tqdm import *\nfrom sklearn.metrics import *\nimport warnings \nwarnings.simplefilter('ignore')\nfrom IPython.core.interactiveshell import InteractiveShell\nInteractiveShell.ast_node_interactive = \"all\"","execution_count":99,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"%%time\npayment_history = pd.read_csv('payment_history.csv')\nclient_data = pd.read_csv('client_data.csv')\npolicy_data = pd.read_csv('policy_data.csv')\ntrain = pd.read_csv('train.csv')\nsample_sub = pd.read_csv('sample_sub.csv')","execution_count":101,"outputs":[{"output_type":"stream","text":"CPU times: user 1.85 s, sys: 36 ms, total: 1.89 s\nWall time: 1.88 s\n","name":"stdout"}]},{"metadata":{},"cell_type":"markdown","source":"Our final solution used only **policy_data**.\n**client_data** cannot be used since it will not vary with time. (Or maybe it did, and we did not find out)\n**payment_history** was given only till 2018, and caused overfitting for 2020 test data, hence we ignored it in our final solution."},{"metadata":{"trusted":true},"cell_type":"code","source":"policy_data.shape, payment_history.shape","execution_count":102,"outputs":[{"output_type":"execute_result","execution_count":102,"data":{"text/plain":"((282815, 14), (495503, 5))"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"policy_data = policy_data.drop_duplicates()\npayment_history = payment_history.drop_duplicates()","execution_count":103,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"policy_data.shape, payment_history.shape","execution_count":104,"outputs":[{"output_type":"execute_result","execution_count":104,"data":{"text/plain":"((278988, 14), (482179, 5))"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"policy_data.nunique()","execution_count":105,"outputs":[{"output_type":"execute_result","execution_count":105,"data":{"text/plain":"Policy ID 51685\nNP2_EFFECTDATE 43\nPPR_PRODCD 17\nNPR_PREMIUM 2234\nNPH_LASTNAME 25275\nCLF_LIFECD 6\nNSP_SUBPROPOSAL 171\nNPR_SUMASSURED 1200\nNLO_TYPE 6\nNLO_AMOUNT 974\nAAG_AGCODE 591\nPCL_LOCATCODE 15\nOCCUPATION 240\nCATEGORY 6\ndtype: int64"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"policy_data.info()","execution_count":106,"outputs":[{"output_type":"stream","text":"\nInt64Index: 278988 entries, 0 to 282814\nData columns (total 14 columns):\n # Column Non-Null Count Dtype \n--- ------ -------------- ----- \n 0 Policy ID 278988 non-null object \n 1 NP2_EFFECTDATE 278988 non-null object \n 2 PPR_PRODCD 278988 non-null object \n 3 NPR_PREMIUM 278934 non-null float64\n 4 NPH_LASTNAME 278988 non-null object \n 5 CLF_LIFECD 278988 non-null int64 \n 6 NSP_SUBPROPOSAL 278988 non-null int64 \n 7 NPR_SUMASSURED 185357 non-null float64\n 8 NLO_TYPE 278988 non-null object \n 9 NLO_AMOUNT 88654 non-null float64\n 10 AAG_AGCODE 278988 non-null object \n 11 PCL_LOCATCODE 278988 non-null object \n 12 OCCUPATION 278988 non-null object \n 13 CATEGORY 278988 non-null object \ndtypes: float64(3), int64(2), object(9)\nmemory usage: 31.9+ MB\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"policy_data.head()","execution_count":107,"outputs":[{"output_type":"execute_result","execution_count":107,"data":{"text/plain":" Policy ID NP2_EFFECTDATE PPR_PRODCD NPR_PREMIUM \\\n0 PID_EPZDSP8 1/9/2019 PPR_PRODCD_B2KVCE7 265.724174 \n1 PID_6M6G9IB 1/8/2018 PPR_PRODCD_64QNIHM 2795.069380 \n2 PID_UL0F7LH 1/8/2017 PPR_PRODCD_KOFUYNN 2492.759107 \n3 PID_TRGUBTU 1/4/2018 PPR_PRODCD_KOFUYNN 3982.538095 \n4 PID_TODLPIB 1/12/2019 PPR_PRODCD_KOFUYNN 1143.953733 \n\n NPH_LASTNAME CLF_LIFECD NSP_SUBPROPOSAL NPR_SUMASSURED \\\n0 NPH_LASTNAME_BPN2LEB 2 222 NaN \n1 NPH_LASTNAME_U2H3GC6 1 111 213380.713197 \n2 NPH_LASTNAME_B68RERV 1 111 238857.872515 \n3 NPH_LASTNAME_NPN3VGI 1 111 74968.903115 \n4 NPH_LASTNAME_9VSNH0E 3 555 238857.872515 \n\n NLO_TYPE NLO_AMOUNT AAG_AGCODE PCL_LOCATCODE \\\n0 NLO_TYPE_DPBHSAH NaN AAG_AGCODE_APWOOPE PCL_LOCATCODE_7SHK7I9 \n1 NLO_TYPE_XTHV3A3 609.054794 AAG_AGCODE_9Z3FBGA PCL_LOCATCODE_7VFS3EQ \n2 NLO_TYPE_XAJI0Y6 1339.461987 AAG_AGCODE_Y0LKFF0 PCL_LOCATCODE_SKPRCR4 \n3 NLO_TYPE_XAJI0Y6 7870.961557 AAG_AGCODE_1OCF2N0 PCL_LOCATCODE_SPQHMX5 \n4 NLO_TYPE_DPBHSAH NaN AAG_AGCODE_E31VV8B PCL_LOCATCODE_0T6GYGX \n\n OCCUPATION CATEGORY \n0 OCCUPATION_NNHJ7XV CATEGORY_GWW4FYB \n1 OCCUPATION_IKCIDKW CATEGORY_R821UZV \n2 OCCUPATION_NUJZA7T CATEGORY_8DALFYO \n3 OCCUPATION_W9XA3KX CATEGORY_LXSLG6M \n4 OCCUPATION_NNHJ7XV CATEGORY_GWW4FYB ","text/html":"
\n\n
\n \n
\n
\n
Policy ID
\n
NP2_EFFECTDATE
\n
PPR_PRODCD
\n
NPR_PREMIUM
\n
NPH_LASTNAME
\n
CLF_LIFECD
\n
NSP_SUBPROPOSAL
\n
NPR_SUMASSURED
\n
NLO_TYPE
\n
NLO_AMOUNT
\n
AAG_AGCODE
\n
PCL_LOCATCODE
\n
OCCUPATION
\n
CATEGORY
\n
\n \n \n
\n
0
\n
PID_EPZDSP8
\n
1/9/2019
\n
PPR_PRODCD_B2KVCE7
\n
265.724174
\n
NPH_LASTNAME_BPN2LEB
\n
2
\n
222
\n
NaN
\n
NLO_TYPE_DPBHSAH
\n
NaN
\n
AAG_AGCODE_APWOOPE
\n
PCL_LOCATCODE_7SHK7I9
\n
OCCUPATION_NNHJ7XV
\n
CATEGORY_GWW4FYB
\n
\n
\n
1
\n
PID_6M6G9IB
\n
1/8/2018
\n
PPR_PRODCD_64QNIHM
\n
2795.069380
\n
NPH_LASTNAME_U2H3GC6
\n
1
\n
111
\n
213380.713197
\n
NLO_TYPE_XTHV3A3
\n
609.054794
\n
AAG_AGCODE_9Z3FBGA
\n
PCL_LOCATCODE_7VFS3EQ
\n
OCCUPATION_IKCIDKW
\n
CATEGORY_R821UZV
\n
\n
\n
2
\n
PID_UL0F7LH
\n
1/8/2017
\n
PPR_PRODCD_KOFUYNN
\n
2492.759107
\n
NPH_LASTNAME_B68RERV
\n
1
\n
111
\n
238857.872515
\n
NLO_TYPE_XAJI0Y6
\n
1339.461987
\n
AAG_AGCODE_Y0LKFF0
\n
PCL_LOCATCODE_SKPRCR4
\n
OCCUPATION_NUJZA7T
\n
CATEGORY_8DALFYO
\n
\n
\n
3
\n
PID_TRGUBTU
\n
1/4/2018
\n
PPR_PRODCD_KOFUYNN
\n
3982.538095
\n
NPH_LASTNAME_NPN3VGI
\n
1
\n
111
\n
74968.903115
\n
NLO_TYPE_XAJI0Y6
\n
7870.961557
\n
AAG_AGCODE_1OCF2N0
\n
PCL_LOCATCODE_SPQHMX5
\n
OCCUPATION_W9XA3KX
\n
CATEGORY_LXSLG6M
\n
\n
\n
4
\n
PID_TODLPIB
\n
1/12/2019
\n
PPR_PRODCD_KOFUYNN
\n
1143.953733
\n
NPH_LASTNAME_9VSNH0E
\n
3
\n
555
\n
238857.872515
\n
NLO_TYPE_DPBHSAH
\n
NaN
\n
AAG_AGCODE_E31VV8B
\n
PCL_LOCATCODE_0T6GYGX
\n
OCCUPATION_NNHJ7XV
\n
CATEGORY_GWW4FYB
\n
\n \n
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"policy_data['NP2_EFFECTDATE'] = pd.to_datetime(policy_data['NP2_EFFECTDATE'], format = '%d/%m/%Y')\nobj_cols = [c for c in policy_data.select_dtypes('object').columns.tolist() if c != 'Policy ID']\npolicy_data[obj_cols] = policy_data[obj_cols].apply(lambda x: pd.factorize(x)[0])\n\npolicy_data['NPR_PREMIUM - NLO_AMOUNT'] = policy_data['NPR_PREMIUM'] - policy_data['NLO_AMOUNT']\npolicy_data['NPR_PREMIUM / NPR_SUMASSURED'] = policy_data['NPR_PREMIUM'] / policy_data['NPR_SUMASSURED']\n\ndef get_last_payment_diff(x):\n try:\n return (x['NP2_EFFECTDATE'].values[-1] - x['NP2_EFFECTDATE'].values[-2])/ np.timedelta64(1, 'D')\n except:\n return np.nan\n\ndef get_pd_agg(pd):\n pd['policy_count'] = pd['Policy ID'].map(pd['Policy ID'].value_counts())\n aggs = {'NP2_EFFECTDATE': ['min', 'max', 'nunique', 'size'],\n 'NPR_PREMIUM': ['mean', 'min', 'max', 'sum', 'std', 'nunique'],\n 'NPR_SUMASSURED': ['mean','min', 'max', 'sum'],\n 'NLO_AMOUNT': ['mean', 'min', 'max', 'sum', 'std', 'nunique'],\n 'policy_count': ['sum', 'mean', 'std',],\n 'NPR_PREMIUM - NLO_AMOUNT': ['sum', 'mean', 'std'],\n 'NPR_PREMIUM / NPR_SUMASSURED': ['mean', 'std',]}\n pd_agg = pd.groupby('Policy ID').agg(aggs)\n pd_agg.columns = ['_'.join(c).strip('_') for c in pd_agg.columns]\n return pd_agg","execution_count":108,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"policy_data.columns","execution_count":109,"outputs":[{"output_type":"execute_result","execution_count":109,"data":{"text/plain":"Index(['Policy ID', 'NP2_EFFECTDATE', 'PPR_PRODCD', 'NPR_PREMIUM',\n 'NPH_LASTNAME', 'CLF_LIFECD', 'NSP_SUBPROPOSAL', 'NPR_SUMASSURED',\n 'NLO_TYPE', 'NLO_AMOUNT', 'AAG_AGCODE', 'PCL_LOCATCODE', 'OCCUPATION',\n 'CATEGORY', 'NPR_PREMIUM - NLO_AMOUNT', 'NPR_PREMIUM / NPR_SUMASSURED'],\n dtype='object')"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"def get_features(df, year = 2020):\n '''\n This function calculates the stats and generates new features only upto the year specified.\n e.g: year = 2018 implies years 2019 and 2020 are ignored while creating new features\n '''\n \n ### Policy data features\n pdata = policy_data[policy_data['NP2_EFFECTDATE'].dt.year <= year]\n \n ### Filter the policy data so it contains data only upto the previous year\n prev_pdata = policy_data[policy_data['NP2_EFFECTDATE'].dt.year <= year-1]\n \n ### Get aggregate features upto current year\n pd_agg = get_pd_agg(pdata)\n df = pd.merge(left=df, right=pd_agg, on = 'Policy ID', how = 'left')\n \n ### Change in sum of numerical features between current year and previous year\n for c in ['NPR_SUMASSURED', 'NPR_PREMIUM', 'NLO_AMOUNT']:\n t1 = pdata.groupby('Policy ID')[c].sum()\n t2 = prev_pdata.groupby('Policy ID')[c].sum()\n t_diff = (t1 - t2).reset_index().rename({c: f'{c}_sum_change'}, axis=1).fillna(0)\n df = pd.merge(df, t_diff, on = 'Policy ID', how = 'left')\n \n \n ### Change and ratio of policy counts between current year and Previous year\n policy_cnt_curr_yr = pdata['Policy ID'].value_counts()\n policy_cnt_prev_yr = prev_pdata['Policy ID'].value_counts()\n \n policy_cnt_diff = (policy_cnt_curr_yr - policy_cnt_prev_yr).reset_index().rename({'index': 'Policy ID', 'Policy ID': 'policy_count_change'}, axis=1).fillna(0)\n df = pd.merge(df, policy_cnt_diff, on = 'Policy ID', how = 'left')\n \n policy_cnt_change_ratio = (policy_cnt_curr_yr / policy_cnt_prev_yr).reset_index().rename({'index': 'Policy ID', 'Policy ID': 'policy_count_change_ratio'}, axis=1).fillna(0)\n df = pd.merge(df, policy_cnt_change_ratio, on = 'Policy ID', how = 'left')\n \n ### Change in mean of numerical features between current year and previous year\n for c in ['NPR_PREMIUM', 'NPR_SUMASSURED','NLO_AMOUNT']:\n curr = pdata.groupby('Policy ID')[c].mean()\n prev = prev_pdata.groupby('Policy ID')[c].mean()\n diff = (curr - prev).reset_index().rename({'index': 'Policy ID', c: f'{c}_change'}, axis=1).fillna(0)\n df = pd.merge(df, diff, on = 'Policy ID', how = 'left')\n \n return df","execution_count":110,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"### Train using only the data upto 2018. Data upto 2018 implies any policies that have lapsed in 2019 and 2020, haven't lapsed yet (We are still in 2018 :P), so are 0.\n\ntrn_2k18 = train.copy()\ntrn_2k18['target'] = 1\ntrn_2k18.loc[~trn_2k18['Lapse Year'].isin(['2017', '2018']), 'target'] = 0\ntrn_2k18 = get_features(trn_2k18, year = 2018)","execution_count":111,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"### Validate using 2019's data\nval_2k19 = train[~train['Lapse Year'].isin(['2017', '2018'])]\nval_2k19['target'] = 1\nval_2k19.loc[val_2k19['Lapse Year'] == '?', 'target'] = 0\nval_2k19 = get_features(val_2k19, year = 2019)\n\n#### Train again using complete data, and predict on test data\ntrn_all = train.copy()\ntrn_all['target'] = 1\ntrn_all.loc[trn_all['Lapse Year'] == '?', 'target'] = 0\ntrn_all = get_features(trn_all, year = 2020)\n\ntest = train[train['Lapse Year'] == '?']\ntest = pd.merge(sample_sub[['Policy ID']], test, on = 'Policy ID', how = 'left')\ntest = get_features(test, year = 2020)\n\ntest.shape, sample_sub.shape","execution_count":112,"outputs":[{"output_type":"execute_result","execution_count":112,"data":{"text/plain":"((43707, 39), (43707, 2))"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"### Highly imbalanced dataset.\ntrn_2k18['target'].value_counts(normalize=True)","execution_count":113,"outputs":[{"output_type":"execute_result","execution_count":113,"data":{"text/plain":"0 0.942517\n1 0.057483\nName: target, dtype: float64"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"### Highly imbalanced dataset.\ntrn_all['target'].value_counts(normalize=True)","execution_count":114,"outputs":[{"output_type":"execute_result","execution_count":114,"data":{"text/plain":"0 0.845642\n1 0.154358\nName: target, dtype: float64"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"### Try setting all the submission values, to this constant value, you will get 0.30xxx as score, without doing anything. That was a baseline :D\n### 0.057 -> 1 count for year upto 2018\n### 0.154 -> 1 count for year upto 2019\n\n(0.057 + 0.154)/2","execution_count":115,"outputs":[{"output_type":"execute_result","execution_count":115,"data":{"text/plain":"0.1055"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"### Some features were overfitting hence ignored\n\nfeatures = [c for c in trn_all.columns if c not in ['Policy ID', 'Lapse', 'Lapse Year', 'NP2_EFFECTDATE_min', 'NP2_EFFECTDATE_max', 'target']]\nfeatures","execution_count":116,"outputs":[{"output_type":"execute_result","execution_count":116,"data":{"text/plain":"['NP2_EFFECTDATE_nunique',\n 'NP2_EFFECTDATE_size',\n 'NPR_PREMIUM_mean',\n 'NPR_PREMIUM_min',\n 'NPR_PREMIUM_max',\n 'NPR_PREMIUM_sum',\n 'NPR_PREMIUM_std',\n 'NPR_PREMIUM_nunique',\n 'NPR_SUMASSURED_mean',\n 'NPR_SUMASSURED_min',\n 'NPR_SUMASSURED_max',\n 'NPR_SUMASSURED_sum',\n 'NLO_AMOUNT_mean',\n 'NLO_AMOUNT_min',\n 'NLO_AMOUNT_max',\n 'NLO_AMOUNT_sum',\n 'NLO_AMOUNT_std',\n 'NLO_AMOUNT_nunique',\n 'policy_count_sum',\n 'policy_count_mean',\n 'policy_count_std',\n 'NPR_PREMIUM - NLO_AMOUNT_sum',\n 'NPR_PREMIUM - NLO_AMOUNT_mean',\n 'NPR_PREMIUM - NLO_AMOUNT_std',\n 'NPR_PREMIUM / NPR_SUMASSURED_mean',\n 'NPR_PREMIUM / NPR_SUMASSURED_std',\n 'NPR_SUMASSURED_sum_change',\n 'NPR_PREMIUM_sum_change',\n 'NLO_AMOUNT_sum_change',\n 'policy_count_change',\n 'policy_count_change_ratio',\n 'NPR_PREMIUM_change',\n 'NPR_SUMASSURED_change',\n 'NLO_AMOUNT_change']"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"trn_all.head()","execution_count":117,"outputs":[{"output_type":"execute_result","execution_count":117,"data":{"text/plain":" Policy ID Lapse Lapse Year target NP2_EFFECTDATE_min NP2_EFFECTDATE_max \\\n0 PID_4928TWH ? ? 0 2017-08-01 2017-08-01 \n1 PID_KBLLEGK ? ? 0 2018-05-01 2018-05-01 \n2 PID_90F0QA3 ? ? 0 2019-09-01 2019-09-01 \n3 PID_18F3NHF ? ? 0 2019-12-01 2019-12-01 \n4 PID_SX4QUVO ? ? 0 2019-03-01 2019-03-01 \n\n NP2_EFFECTDATE_nunique NP2_EFFECTDATE_size NPR_PREMIUM_mean \\\n0 1 1 42911.077278 \n1 1 2 3561.268991 \n2 1 2 6164.812836 \n3 1 1 2278.189789 \n4 1 1 1619.046308 \n\n NPR_PREMIUM_min ... NPR_PREMIUM / NPR_SUMASSURED_mean \\\n0 42911.077278 ... 0.125962 \n1 3561.268991 ... 0.021576 \n2 6164.812836 ... 0.024418 \n3 2278.189789 ... 0.009538 \n4 1619.046308 ... 0.011165 \n\n NPR_PREMIUM / NPR_SUMASSURED_std NPR_SUMASSURED_sum_change \\\n0 NaN 0.0 \n1 0.0 0.0 \n2 0.0 0.0 \n3 NaN 0.0 \n4 NaN 0.0 \n\n NPR_PREMIUM_sum_change NLO_AMOUNT_sum_change policy_count_change \\\n0 0.0 0.0 0.0 \n1 0.0 0.0 0.0 \n2 0.0 0.0 0.0 \n3 0.0 0.0 0.0 \n4 0.0 0.0 0.0 \n\n policy_count_change_ratio NPR_PREMIUM_change NPR_SUMASSURED_change \\\n0 1.0 0.0 0.0 \n1 1.0 0.0 0.0 \n2 1.0 0.0 0.0 \n3 1.0 0.0 0.0 \n4 1.0 0.0 0.0 \n\n NLO_AMOUNT_change \n0 0.0 \n1 0.0 \n2 0.0 \n3 0.0 \n4 0.0 \n\n[5 rows x 40 columns]","text/html":"
\n\n
\n \n
\n
\n
Policy ID
\n
Lapse
\n
Lapse Year
\n
target
\n
NP2_EFFECTDATE_min
\n
NP2_EFFECTDATE_max
\n
NP2_EFFECTDATE_nunique
\n
NP2_EFFECTDATE_size
\n
NPR_PREMIUM_mean
\n
NPR_PREMIUM_min
\n
...
\n
NPR_PREMIUM / NPR_SUMASSURED_mean
\n
NPR_PREMIUM / NPR_SUMASSURED_std
\n
NPR_SUMASSURED_sum_change
\n
NPR_PREMIUM_sum_change
\n
NLO_AMOUNT_sum_change
\n
policy_count_change
\n
policy_count_change_ratio
\n
NPR_PREMIUM_change
\n
NPR_SUMASSURED_change
\n
NLO_AMOUNT_change
\n
\n \n \n
\n
0
\n
PID_4928TWH
\n
?
\n
?
\n
0
\n
2017-08-01
\n
2017-08-01
\n
1
\n
1
\n
42911.077278
\n
42911.077278
\n
...
\n
0.125962
\n
NaN
\n
0.0
\n
0.0
\n
0.0
\n
0.0
\n
1.0
\n
0.0
\n
0.0
\n
0.0
\n
\n
\n
1
\n
PID_KBLLEGK
\n
?
\n
?
\n
0
\n
2018-05-01
\n
2018-05-01
\n
1
\n
2
\n
3561.268991
\n
3561.268991
\n
...
\n
0.021576
\n
0.0
\n
0.0
\n
0.0
\n
0.0
\n
0.0
\n
1.0
\n
0.0
\n
0.0
\n
0.0
\n
\n
\n
2
\n
PID_90F0QA3
\n
?
\n
?
\n
0
\n
2019-09-01
\n
2019-09-01
\n
1
\n
2
\n
6164.812836
\n
6164.812836
\n
...
\n
0.024418
\n
0.0
\n
0.0
\n
0.0
\n
0.0
\n
0.0
\n
1.0
\n
0.0
\n
0.0
\n
0.0
\n
\n
\n
3
\n
PID_18F3NHF
\n
?
\n
?
\n
0
\n
2019-12-01
\n
2019-12-01
\n
1
\n
1
\n
2278.189789
\n
2278.189789
\n
...
\n
0.009538
\n
NaN
\n
0.0
\n
0.0
\n
0.0
\n
0.0
\n
1.0
\n
0.0
\n
0.0
\n
0.0
\n
\n
\n
4
\n
PID_SX4QUVO
\n
?
\n
?
\n
0
\n
2019-03-01
\n
2019-03-01
\n
1
\n
1
\n
1619.046308
\n
1619.046308
\n
...
\n
0.011165
\n
NaN
\n
0.0
\n
0.0
\n
0.0
\n
0.0
\n
1.0
\n
0.0
\n
0.0
\n
0.0
\n
\n \n
\n
5 rows × 40 columns
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"### Some features can be removed still\n\ndrop_features = [c for c in features if trn_all[c].sum() == 0]\nfeatures = [c for c in features if c not in drop_features]\nprint(drop_features)","execution_count":118,"outputs":[{"output_type":"stream","text":"['policy_count_std']\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"### Some features were overfitting hence ignored\n\nfeatures = [c for c in features if c not in ['AMOUNTPAID', 'NP2_EFFECTDATE_days_diff', 'policy_count_std']]","execution_count":119,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Training on 2018, validation on 2019"},{"metadata":{"trusted":true},"cell_type":"code","source":"clf = LGBMClassifier(num_leaves=32,max_depth=18, learning_rate=0.01, reg_alpha=1,reg_lambda=1, n_estimators=600, subsample=1, subsample_freq=1, colsample_bytree=0.55)\nclf.fit(trn_2k18[features], trn_2k18['target'], eval_set = [(val_2k19[features], val_2k19['target'])], verbose = 20)","execution_count":120,"outputs":[{"output_type":"stream","text":"[20]\tvalid_0's binary_logloss: 0.317732\n[40]\tvalid_0's binary_logloss: 0.300291\n[60]\tvalid_0's binary_logloss: 0.291181\n[80]\tvalid_0's binary_logloss: 0.283651\n[100]\tvalid_0's binary_logloss: 0.278971\n[120]\tvalid_0's binary_logloss: 0.275966\n[140]\tvalid_0's binary_logloss: 0.273309\n[160]\tvalid_0's binary_logloss: 0.271762\n[180]\tvalid_0's binary_logloss: 0.270549\n[200]\tvalid_0's binary_logloss: 0.269712\n[220]\tvalid_0's binary_logloss: 0.269263\n[240]\tvalid_0's binary_logloss: 0.268986\n[260]\tvalid_0's binary_logloss: 0.268713\n[280]\tvalid_0's binary_logloss: 0.268712\n[300]\tvalid_0's binary_logloss: 0.268737\n[320]\tvalid_0's binary_logloss: 0.268734\n[340]\tvalid_0's binary_logloss: 0.268875\n[360]\tvalid_0's binary_logloss: 0.269038\n[380]\tvalid_0's binary_logloss: 0.269172\n[400]\tvalid_0's binary_logloss: 0.269256\n[420]\tvalid_0's binary_logloss: 0.269424\n[440]\tvalid_0's binary_logloss: 0.269646\n[460]\tvalid_0's binary_logloss: 0.269832\n[480]\tvalid_0's binary_logloss: 0.270026\n[500]\tvalid_0's binary_logloss: 0.270278\n[520]\tvalid_0's binary_logloss: 0.270428\n[540]\tvalid_0's binary_logloss: 0.270593\n[560]\tvalid_0's binary_logloss: 0.270737\n[580]\tvalid_0's binary_logloss: 0.270912\n[600]\tvalid_0's binary_logloss: 0.271107\n","name":"stdout"},{"output_type":"execute_result","execution_count":120,"data":{"text/plain":"LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=0.55,\n importance_type='split', learning_rate=0.01, max_depth=18,\n min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,\n n_estimators=600, n_jobs=-1, num_leaves=32, objective=None,\n random_state=None, reg_alpha=1, reg_lambda=1, silent=True,\n subsample=1, subsample_for_bin=200000, subsample_freq=1)"},"metadata":{}}]},{"metadata":{},"cell_type":"markdown","source":"## Training on complete data.\n\n* The number of iterations have been increased a little more compared to the best iteration for validation"},{"metadata":{"trusted":true},"cell_type":"code","source":"tp = pd.DataFrame()\nfor i in tqdm_notebook(range(7)):\n clf = LGBMClassifier(num_leaves=32, learning_rate=0.01, reg_alpha=2, n_estimators=380, subsample=1, subsample_freq=1, colsample_bytree=0.5, random_state=2**i)\n clf.fit(trn_all[features], trn_all['target'], eval_set = [(val_2k19[features], val_2k19['target'])], verbose = 380)\n tp[i] = clf.predict_proba(test[features])[:, 1]\ntest_preds = tp.mean(axis=1)","execution_count":121,"outputs":[{"output_type":"display_data","data":{"text/plain":"HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"bd2a68089e004589bd43bf907aa05045"}},"metadata":{}},{"output_type":"stream","text":"[380]\tvalid_0's binary_logloss: 0.252244\n[380]\tvalid_0's binary_logloss: 0.251058\n[380]\tvalid_0's binary_logloss: 0.251556\n[380]\tvalid_0's binary_logloss: 0.251686\n[380]\tvalid_0's binary_logloss: 0.251647\n[380]\tvalid_0's binary_logloss: 0.251662\n[380]\tvalid_0's binary_logloss: 0.251376\n\n","name":"stdout"}]},{"metadata":{},"cell_type":"markdown","source":"Lets see some stats about our target"},{"metadata":{"trusted":true},"cell_type":"code","source":"test_preds = tp.mean(axis=1)\nprint(pd.Series(test_preds).describe())","execution_count":122,"outputs":[{"output_type":"stream","text":"count 43707.000000\nmean 0.128244\nstd 0.130293\nmin 0.003687\n25% 0.003703\n50% 0.104520\n75% 0.234366\nmax 0.653318\ndtype: float64\n","name":"stdout"}]},{"metadata":{},"cell_type":"markdown","source":"This thresholding greatly improved our score, from 0.260x to 0.24xx :D. The intuition is log loss heavily penalizes confident wrong predictions. Also most of the values were 0, so using a threshold to rectify it, seemed ok."},{"metadata":{"trusted":true},"cell_type":"code","source":"test_preds[test_preds < 0.03] = 0\ntest_preds[test_preds > 0.18] = 0.18\npd.Series(test_preds).describe()","execution_count":123,"outputs":[{"output_type":"execute_result","execution_count":123,"data":{"text/plain":"count 43707.000000\nmean 0.090678\nstd 0.086338\nmin 0.000000\n25% 0.000000\n50% 0.104520\n75% 0.180000\nmax 0.180000\ndtype: float64"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"SUB_FILE_NAME = 'all_preds.csv'\nsample_sub['Lapse'] = test_preds\nsample_sub.to_csv(SUB_FILE_NAME, index=False)\n\nfrom IPython.display import HTML\ndef create_download_link(title = \"Download CSV file\", filename = \"data.csv\"): \n html = '{title}'\n html = html.format(title=title,filename=filename)\n return HTML(html)\ncreate_download_link(filename = SUB_FILE_NAME)","execution_count":124,"outputs":[{"output_type":"execute_result","execution_count":124,"data":{"text/plain":"","text/html":"Download CSV file"},"metadata":{}}]}],"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat":4,"nbformat_minor":4}
--------------------------------------------------------------------------------
/Urban Air Pollution Challenge by #ZindiWeekendz/zindi-weekendz-pollution-v4.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
8 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5"
9 | },
10 | "outputs": [],
11 | "source": [
12 | "import numpy as np\n",
13 | "import pandas as pd\n",
14 | "import lightgbm as lgb\n",
15 | "import gc\n",
16 | "from fastai.tabular import *\n",
17 | "from sklearn.metrics import mean_squared_error as mse\n",
18 | "from sklearn.model_selection import KFold, StratifiedKFold\n",
19 | "from IPython.core.interactiveshell import InteractiveShell\n",
20 | "InteractiveShell.ast_node_interactivity = \"all\"\n",
21 | "import datetime\n",
22 | "from tqdm import tqdm_notebook"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 2,
28 | "metadata": {},
29 | "outputs": [],
30 | "source": [
31 | "def rmse(y_true, y_pred):\n",
32 | " return np.sqrt(mse(y_true, y_pred))"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 3,
38 | "metadata": {
39 | "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
40 | "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a"
41 | },
42 | "outputs": [],
43 | "source": [
44 | "train = pd.read_csv('/kaggle/input/zindi-weekendz-pollution/Train.csv')\n",
45 | "test = pd.read_csv('/kaggle/input/zindi-weekendz-pollution/Test.csv')\n",
46 | "sample_sub = pd.read_csv('/kaggle/input/zindi-weekendz-pollution/SampleSubmission.csv')\n",
47 | "\n",
48 | "train['Date'] = pd.to_datetime(train['Date'], format='%Y-%m-%d')\n",
49 | "test['Date'] = pd.to_datetime(test['Date'], format='%Y-%m-%d')\n",
50 | "\n",
51 | "ID_COL, TARGET_COL = 'Place_ID X Date', 'target'"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": 4,
57 | "metadata": {},
58 | "outputs": [
59 | {
60 | "name": "stderr",
61 | "output_type": "stream",
62 | "text": [
63 | "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version\n",
64 | "of pandas will change to not sort by default.\n",
65 | "\n",
66 | "To accept the future behavior, pass 'sort=False'.\n",
67 | "\n",
68 | "To retain the current behavior and silence the warning, pass 'sort=True'.\n",
69 | "\n",
70 | " \"\"\"Entry point for launching an IPython kernel.\n"
71 | ]
72 | },
73 | {
74 | "data": {
75 | "text/plain": [
76 | "38"
77 | ]
78 | },
79 | "execution_count": 4,
80 | "metadata": {},
81 | "output_type": "execute_result"
82 | }
83 | ],
84 | "source": [
85 | "df = pd.concat([train, test]).reset_index(drop=True)\n",
86 | "features = [c for c in df.columns if c not in ['Date', 'target_count', 'target_min', 'Place_ID X Date', 'target_variance', 'Place_ID', 'target_max', 'target']]\n",
87 | "simple_feats = [c for c in features if ('angle' not in c) & ('height' not in c) & ('altittude' not in c)]\n",
88 | "len(simple_feats)"
89 | ]
90 | },
91 | {
92 | "cell_type": "code",
93 | "execution_count": 5,
94 | "metadata": {},
95 | "outputs": [
96 | {
97 | "name": "stderr",
98 | "output_type": "stream",
99 | "text": [
100 | "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:6: TqdmDeprecationWarning: This function will be removed in tqdm==5.0.0\n",
101 | "Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`\n",
102 | " \n"
103 | ]
104 | },
105 | {
106 | "data": {
107 | "application/vnd.jupyter.widget-view+json": {
108 | "model_id": "cbea4e3d24a645eeb3733f9508cf69ef",
109 | "version_major": 2,
110 | "version_minor": 0
111 | },
112 | "text/plain": [
113 | "HBox(children=(FloatProgress(value=0.0, max=24.0), HTML(value='')))"
114 | ]
115 | },
116 | "metadata": {},
117 | "output_type": "display_data"
118 | },
119 | {
120 | "name": "stdout",
121 | "output_type": "stream",
122 | "text": [
123 | "\n"
124 | ]
125 | },
126 | {
127 | "name": "stderr",
128 | "output_type": "stream",
129 | "text": [
130 | "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:10: TqdmDeprecationWarning: This function will be removed in tqdm==5.0.0\n",
131 | "Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`\n",
132 | " # Remove the CWD from sys.path while we load stuff.\n"
133 | ]
134 | },
135 | {
136 | "data": {
137 | "application/vnd.jupyter.widget-view+json": {
138 | "model_id": "a6234022dc9d4ea983e62bc863344757",
139 | "version_major": 2,
140 | "version_minor": 0
141 | },
142 | "text/plain": [
143 | "HBox(children=(FloatProgress(value=0.0, max=44.0), HTML(value='')))"
144 | ]
145 | },
146 | "metadata": {},
147 | "output_type": "display_data"
148 | },
149 | {
150 | "name": "stdout",
151 | "output_type": "stream",
152 | "text": [
153 | "\n"
154 | ]
155 | },
156 | {
157 | "name": "stderr",
158 | "output_type": "stream",
159 | "text": [
160 | "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:15: TqdmDeprecationWarning: This function will be removed in tqdm==5.0.0\n",
161 | "Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`\n",
162 | " from ipykernel import kernelapp as app\n"
163 | ]
164 | },
165 | {
166 | "data": {
167 | "application/vnd.jupyter.widget-view+json": {
168 | "model_id": "41d874433a334d6abda65442beba3550",
169 | "version_major": 2,
170 | "version_minor": 0
171 | },
172 | "text/plain": [
173 | "HBox(children=(FloatProgress(value=0.0, max=21.0), HTML(value='')))"
174 | ]
175 | },
176 | "metadata": {},
177 | "output_type": "display_data"
178 | },
179 | {
180 | "name": "stdout",
181 | "output_type": "stream",
182 | "text": [
183 | "\n"
184 | ]
185 | },
186 | {
187 | "data": {
188 | "text/html": [
189 | "