└── kaggle-hm-recommend.ipynb /kaggle-hm-recommend.ipynb: -------------------------------------------------------------------------------- 1 | {"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# Recommend Items Frequently Purchased Together\nThis notebook demonstrates how recommending items that are frequently purchased together is effective. The current best scoring public notebook [here][1] recommends to customers those customers' last purchases and scores public LB 0.020. In this notebook here, we will begin with that idea and add recommending items that are frequently purchased together with a customers' previous purchaes. This notebook improves the LB and scores LB 0.021. This notebook's strategy is as follows:\n* recommend items previously purchased [idea here][1]\n* recommend items that are bought together with previous purchases [idea here][2]\n* recommend popular items [idea here][1]\n\n[1]: https://www.kaggle.com/hengzheng/time-is-our-best-friend-v2\n[2]: https://www.kaggle.com/cdeotte/customers-who-bought-this-frequently-buy-this","metadata":{}},{"cell_type":"markdown","source":"# RAPIDS cuDF\nWe will use RAPIDS cuDF for fast dataframe operations","metadata":{}},{"cell_type":"code","source":"import cudf\nprint('RAPIDS version',cudf.__version__)","metadata":{"execution":{"iopub.status.busy":"2022-02-20T02:54:42.216243Z","iopub.execute_input":"2022-02-20T02:54:42.216574Z","iopub.status.idle":"2022-02-20T02:54:45.800073Z","shell.execute_reply.started":"2022-02-20T02:54:42.216493Z","shell.execute_reply":"2022-02-20T02:54:45.799236Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# Load Transactions, Reduce Memory\nDiscussion about reducing memory is [here][1]\n\n[1]: https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/308635","metadata":{}},{"cell_type":"code","source":"train = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv')\ntrain['customer_id'] = train['customer_id'].str[-16:].str.hex_to_int().astype('int64')\ntrain['article_id'] = train.article_id.astype('int32')\ntrain.t_dat = cudf.to_datetime(train.t_dat)\ntrain = train[['t_dat','customer_id','article_id']]\ntrain.to_parquet('train.pqt',index=False)\nprint( train.shape )\ntrain.head()","metadata":{"execution":{"iopub.status.busy":"2022-02-20T02:54:45.801797Z","iopub.execute_input":"2022-02-20T02:54:45.80252Z","iopub.status.idle":"2022-02-20T02:55:25.310423Z","shell.execute_reply.started":"2022-02-20T02:54:45.802481Z","shell.execute_reply":"2022-02-20T02:55:25.30968Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# Find Each Customer's Last Week of Purchases\nOur final predictions will have the row order from of our dataframe. Each row of our dataframe will be a prediction. We will create the `predictionstring` later by `train.groupby('customer_id').article_id.sum()`. Since `article_id` is a string, when we groupby sum, it will concatenate all the customer predictions into a single string. It will also create the string in the order of the dataframe. So as we proceed in this notebook, we will order the dataframe how we want our predictions ordered.","metadata":{}},{"cell_type":"code","source":"tmp = train.groupby('customer_id').t_dat.max().reset_index()\ntmp.columns = ['customer_id','max_dat']\ntrain = train.merge(tmp,on=['customer_id'],how='left')\ntrain['diff_dat'] = (train.max_dat - train.t_dat).dt.days\ntrain = train.loc[train['diff_dat']<=6]\nprint('Train shape:',train.shape)","metadata":{"execution":{"iopub.status.busy":"2022-02-20T02:55:25.311718Z","iopub.execute_input":"2022-02-20T02:55:25.311959Z","iopub.status.idle":"2022-02-20T02:55:25.495643Z","shell.execute_reply.started":"2022-02-20T02:55:25.311925Z","shell.execute_reply":"2022-02-20T02:55:25.494879Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# (1) Recommend Most Often Previously Purchased Items\nNote that many operations in cuDF will shuffle the order of the dataframe rows. Therefore we need to sort afterward because we want the most often previously purchased items first. Because this will be the order of our predictons. Since we sort by `ct` and then `t_dat` will will recommend items that have been purchased more frequently first followed by items purchased more recently second.","metadata":{}},{"cell_type":"code","source":"tmp = train.groupby(['customer_id','article_id'])['t_dat'].agg('count').reset_index()\ntmp.columns = ['customer_id','article_id','ct']\ntrain = train.merge(tmp,on=['customer_id','article_id'],how='left')\ntrain = train.sort_values(['ct','t_dat'],ascending=False)\ntrain = train.drop_duplicates(['customer_id','article_id'])\ntrain = train.sort_values(['ct','t_dat'],ascending=False)\ntrain.head()","metadata":{"execution":{"iopub.status.busy":"2022-02-20T02:55:25.497788Z","iopub.execute_input":"2022-02-20T02:55:25.498561Z","iopub.status.idle":"2022-02-20T02:55:25.77822Z","shell.execute_reply.started":"2022-02-20T02:55:25.498521Z","shell.execute_reply":"2022-02-20T02:55:25.777441Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# (2) Recommend Items Purchased Together\nIn my notebook [here][1], we compute a dictionary of items frequently purchased together. We will load and use that dictionary below. Note that we use the command `drop_duplicates` so that we don't recommend an item that the user has already bought and we have already recommended above. We will need to use Pandas for some commands because RAPIDS cuDF doesn't have two conveinent commands, (1) create new column from dictionary map of another column (2) groupby aggregate strings sum.\n\nWe concatenate these rows after the rows containing customers' previous purchases. Therefore we will recommend previous items first and then items purchased together second. Note the trick to convert a column of int32 into a prediction string (using groupby agg str sum) is from notebook [here][2]\n\n[1]: https://www.kaggle.com/cdeotte/customers-who-bought-this-frequently-buy-this\n[2]: https://www.kaggle.com/hiroshisakiyama/recommending-items-recently-bought","metadata":{}},{"cell_type":"code","source":"# USE PANDAS TO MAP COLUMN WITH DICTIONARY\nimport pandas as pd, numpy as np\ntrain = train.to_pandas()\npairs = np.load('../input/hmitempairs/pairs_cudf.npy',allow_pickle=True).item()\ntrain['article_id2'] = train.article_id.map(pairs)","metadata":{"execution":{"iopub.status.busy":"2022-02-20T02:55:25.779661Z","iopub.execute_input":"2022-02-20T02:55:25.779975Z","iopub.status.idle":"2022-02-20T02:55:26.335798Z","shell.execute_reply.started":"2022-02-20T02:55:25.77994Z","shell.execute_reply":"2022-02-20T02:55:26.335071Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# RECOMMENDATION OF PAIRED ITEMS\ntrain2 = train[['customer_id','article_id2']].copy()\ntrain2 = train2.loc[train2.article_id2.notnull()]\ntrain2 = train2.drop_duplicates(['customer_id','article_id2'])\ntrain2 = train2.rename({'article_id2':'article_id'},axis=1)","metadata":{"execution":{"iopub.status.busy":"2022-02-20T02:55:26.337146Z","iopub.execute_input":"2022-02-20T02:55:26.337426Z","iopub.status.idle":"2022-02-20T02:55:27.6275Z","shell.execute_reply.started":"2022-02-20T02:55:26.337365Z","shell.execute_reply":"2022-02-20T02:55:27.626712Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# CONCATENATE PAIRED ITEM RECOMMENDATION AFTER PREVIOUS PURCHASED RECOMMENDATIONS\ntrain = train[['customer_id','article_id']]\ntrain = pd.concat([train,train2],axis=0,ignore_index=True)\ntrain.article_id = train.article_id.astype('int32')\ntrain = train.drop_duplicates(['customer_id','article_id'])","metadata":{"execution":{"iopub.status.busy":"2022-02-20T02:55:27.628788Z","iopub.execute_input":"2022-02-20T02:55:27.629036Z","iopub.status.idle":"2022-02-20T02:55:29.600364Z","shell.execute_reply.started":"2022-02-20T02:55:27.629002Z","shell.execute_reply":"2022-02-20T02:55:29.599697Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# CONVERT RECOMMENDATIONS INTO SINGLE STRING\ntrain.article_id = ' 0' + train.article_id.astype('str')\npreds = cudf.DataFrame( train.groupby('customer_id').article_id.sum().reset_index() )\npreds.columns = ['customer_id','prediction']\npreds.head()","metadata":{"execution":{"iopub.status.busy":"2022-02-20T02:55:29.601495Z","iopub.execute_input":"2022-02-20T02:55:29.603286Z","iopub.status.idle":"2022-02-20T02:55:42.703889Z","shell.execute_reply.started":"2022-02-20T02:55:29.603253Z","shell.execute_reply":"2022-02-20T02:55:42.703213Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# (3) Recommend Last Week's Most Popular Items\nAfter recommending previous purchases and items purchased together we will then recommend the 12 most popular items. Therefore if our previous recommendations did not fill up a customer's 12 recommendations, then it will be filled by popular items.","metadata":{}},{"cell_type":"code","source":"train = cudf.read_parquet('train.pqt')\ntrain.t_dat = cudf.to_datetime(train.t_dat)\ntrain = train.loc[train.t_dat >= cudf.to_datetime('2020-09-16')]\ntop12 = ' 0' + ' 0'.join(train.article_id.value_counts().to_pandas().index.astype('str')[:12])\nprint(\"Last week's top 12 popular items:\")\nprint( top12 )","metadata":{"execution":{"iopub.status.busy":"2022-02-20T02:55:42.705242Z","iopub.execute_input":"2022-02-20T02:55:42.705714Z","iopub.status.idle":"2022-02-20T02:55:43.170621Z","shell.execute_reply.started":"2022-02-20T02:55:42.705679Z","shell.execute_reply":"2022-02-20T02:55:43.169177Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# Write Submission CSV\nWe will merge our predictions onto `sample_submission.csv` and submit to Kaggle.","metadata":{}},{"cell_type":"code","source":"sub = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv')\nsub = sub[['customer_id']]\nsub['customer_id_2'] = sub['customer_id'].str[-16:].str.hex_to_int().astype('int64')\nsub = sub.merge(preds.rename({'customer_id':'customer_id_2'},axis=1),\\\n on='customer_id_2', how='left').fillna('')\ndel sub['customer_id_2']\nsub.prediction = sub.prediction + top12\nsub.prediction = sub.prediction.str.strip()\nsub.prediction = sub.prediction.str[:131]\nsub.to_csv(f'submission.csv',index=False)\nsub.head()","metadata":{"execution":{"iopub.status.busy":"2022-02-20T02:55:43.172957Z","iopub.execute_input":"2022-02-20T02:55:43.173205Z","iopub.status.idle":"2022-02-20T02:55:47.517952Z","shell.execute_reply.started":"2022-02-20T02:55:43.173171Z","shell.execute_reply":"2022-02-20T02:55:47.517241Z"},"trusted":true},"execution_count":null,"outputs":[]}]} --------------------------------------------------------------------------------