├── questions ├── q1.png └── q2.png ├── explanation ├── doc.pdf └── doc.tex ├── .gitignore ├── README.md ├── problem1 ├── problem1.py └── problem1.ipynb ├── problem2 └── problem2.ipynb └── problem3 └── problem3.ipynb /questions/q1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ffs97/jpmc-quant-challenge/HEAD/questions/q1.png -------------------------------------------------------------------------------- /questions/q2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ffs97/jpmc-quant-challenge/HEAD/questions/q2.png -------------------------------------------------------------------------------- /explanation/doc.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ffs97/jpmc-quant-challenge/HEAD/explanation/doc.pdf -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *data* 2 | *.vscode* 3 | 4 | *.swp 5 | *.swo 6 | *~ 7 | 8 | *.log 9 | *.aux 10 | *.bbl 11 | *.blg 12 | *.out 13 | 14 | *.pyc 15 | *.o 16 | *.ipynb_checkpoint* 17 | 18 | *.tar.* 19 | *.zip 20 | *.gz* 21 | 22 | *-env* 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # JPMC Quant Challenge, 2018 2 | 3 | This repository contains the code and explanation docs for my submission in the 4 | JP Morgan Quant Challenge, 2018 hosted by Mettl. 5 | 6 | The explanation doc is stored in the explanation directory (doc.pdf), explaining 7 | my approach for each of the three problem statements. 8 | -------------------------------------------------------------------------------- /problem1/problem1.py: -------------------------------------------------------------------------------- 1 | import urllib2 2 | import pyperclip 3 | import numpy as np 4 | 5 | from datetime import datetime 6 | from sklearn.neural_network import MLPRegressor 7 | 8 | tr_url = "https://s3-ap-southeast-1.amazonaws.com/mettl-arq/questions/codelysis/machine-learning/fare-prediction/train.csv" 9 | ts_url = "https://s3-ap-southeast-1.amazonaws.com/mettl-arq/questions/codelysis/machine-learning/fare-prediction/test.csv" 10 | 11 | def read_data(url, split_string=",", test=False): 12 | data = urllib2.urlopen(url) 13 | data.readline() 14 | data = data.readlines() 15 | 16 | def process(line): 17 | line = line.strip().split(split_string) 18 | 19 | if test: 20 | return line[1:] 21 | 22 | return line[1:-1], line[-1] 23 | 24 | data = [process(line) for line in data] 25 | 26 | if test: 27 | return data 28 | 29 | X, Y = zip(*data) 30 | 31 | X = list(X) 32 | Y = np.array(Y, dtype=float) 33 | 34 | return X, Y 35 | 36 | X_tr, Y_tr = read_data(tr_url) 37 | X_ts = read_data(ts_url, test=True) 38 | 39 | cities = [(x[1], x[2]) for x in X_tr] 40 | c1,c2 = zip(*cities) 41 | cities = c1 + c2 42 | cities = list(set(cities)) 43 | cities.sort() 44 | tuple_cities = {} 45 | index = 0 46 | for i, city_1 in enumerate(cities): 47 | for j, city_2 in enumerate(cities[(i+1):]): 48 | tuple_cities[city_1 + city_2] = index 49 | tuple_cities[city_2 + city_1] = index 50 | index += 1 51 | 52 | def process_features(X, tuple_cities): 53 | today = datetime.today() 54 | time_0 = datetime.strptime("0:0", "%H:%M") 55 | 56 | for i in range(len(X)): 57 | x = X[i] 58 | 59 | cities_index = tuple_cities[x[1] + x[2]] 60 | cities_one_hot = [0] * len(tuple_cities) 61 | cities_one_hot[cities_index] = 1 62 | 63 | flight_day = datetime.strptime(x[3] + " " + x[4], "%Y-%m-%d %H:%M") 64 | bookind_day = datetime.strptime(x[5], "%Y-%m-%d") 65 | days_diff = (flight_day - bookind_day).days 66 | 67 | dob = datetime.strptime(X[i][0], "%Y-%m-%d") 68 | age = int(round((today - dob).days / 365.0)) 69 | 70 | flight_time = datetime.strptime(x[4], "%H:%M") 71 | flight_time = (time_0 - flight_time).seconds / 600 72 | 73 | bclass = 0 if x[6] == "Economy" else 1 74 | 75 | # X[i] = np.array([bclass, cities_index, flight_time, days_diff, age]) 76 | X[i] = np.array([bclass, flight_time, days_diff, age] + cities_one_hot) 77 | 78 | X = np.array(X) 79 | return X 80 | 81 | X_tr = process_features(X_tr, tuple_cities) 82 | X_ts = process_features(X_ts, tuple_cities) 83 | 84 | ### Regression 85 | 86 | #regressor = MLPRegressor(hidden_layer_sizes=(50, 20, 5), max_iter=10000) 87 | #regressor.fit(X_tr, Y_tr) 88 | 89 | 90 | eco_indices = (X_tr[:, 0] == 0) 91 | regressor_eco = MLPRegressor(hidden_layer_sizes=(10, 20, 5), max_iter=10000, tol=0.00001) 92 | regressor_eco.fit(X_tr[eco_indices][:, 1:], Y_tr[eco_indices]) 93 | regressor_eco.score(X_tr[eco_indices][:, 1:], Y_tr[eco_indices]) 94 | 95 | bus_indices = (X_tr[:, 0] == 1) 96 | regressor_bus = MLPRegressor(hidden_layer_sizes=(10, 20, 5), max_iter=10000) 97 | regressor_bus.fit(X_tr[bus_indices][:, 1:], Y_tr[bus_indices]) 98 | regressor_bus.score(X_tr[bus_indices][:, 1:], Y_tr[bus_indices]) 99 | Y_ts = regressor.predict(X_ts) 100 | 101 | pyperclip.copy("return [" + ", ".join([str(y) for y in Y_ts]) + "]") 102 | -------------------------------------------------------------------------------- /problem2/problem2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "ExecuteTime": { 8 | "end_time": "2018-09-30T11:48:42.852886Z", 9 | "start_time": "2018-09-30T11:48:42.720898Z" 10 | } 11 | }, 12 | "outputs": [], 13 | "source": [ 14 | "import math\n", 15 | "import numpy as np" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 2, 21 | "metadata": { 22 | "ExecuteTime": { 23 | "end_time": "2018-09-30T11:48:42.861517Z", 24 | "start_time": "2018-09-30T11:48:42.856734Z" 25 | } 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "a = 1\n", 30 | "theta = 150\n", 31 | "X0 = 150\n", 32 | "sigma = 5\n", 33 | "T = 2" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 3, 39 | "metadata": { 40 | "ExecuteTime": { 41 | "end_time": "2018-09-30T11:48:42.870001Z", 42 | "start_time": "2018-09-30T11:48:42.864642Z" 43 | } 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "n_partitions = 100\n", 48 | "dsigma = 0.001\n", 49 | "dt = T / float(n_partitions)" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 4, 55 | "metadata": { 56 | "ExecuteTime": { 57 | "end_time": "2018-09-30T11:48:42.877689Z", 58 | "start_time": "2018-09-30T11:48:42.873587Z" 59 | } 60 | }, 61 | "outputs": [], 62 | "source": [ 63 | "sigma_ntrl = sigma\n", 64 | "sigma_plus = sigma + dsigma\n", 65 | "sigma_mins = sigma - dsigma" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 5, 71 | "metadata": { 72 | "ExecuteTime": { 73 | "end_time": "2018-09-30T11:48:46.928788Z", 74 | "start_time": "2018-09-30T11:48:42.881400Z" 75 | } 76 | }, 77 | "outputs": [], 78 | "source": [ 79 | "values = []\n", 80 | "for _ in range(10000):\n", 81 | " x_ntrl = X0\n", 82 | " x_plus = X0\n", 83 | " x_mins = X0\n", 84 | " Nw = np.random.normal(0, 1, n_partitions)\n", 85 | " \n", 86 | " for i in range(n_partitions):\n", 87 | " try:\n", 88 | " sq_xwdt_ntrl = math.sqrt(x_ntrl * dt)\n", 89 | " sq_xwdt_plus = math.sqrt(x_plus * dt)\n", 90 | " sq_xwdt_mins = math.sqrt(x_mins * dt)\n", 91 | " except:\n", 92 | " print x, dt\n", 93 | " \n", 94 | " x_ntrl += a * (theta - x_ntrl) * dt + sigma_ntrl * sq_xwdt_ntrl * Nw[i]\n", 95 | " x_plus += a * (theta - x_plus) * dt + sigma_plus * sq_xwdt_plus * Nw[i]\n", 96 | " x_mins += a * (theta - x_mins) * dt + sigma_mins * sq_xwdt_mins * Nw[i]\n", 97 | " \n", 98 | " values.append([x_ntrl, x_plus, x_mins])" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 6, 104 | "metadata": { 105 | "ExecuteTime": { 106 | "end_time": "2018-09-30T11:48:46.945305Z", 107 | "start_time": "2018-09-30T11:48:46.932603Z" 108 | } 109 | }, 110 | "outputs": [ 111 | { 112 | "name": "stdout", 113 | "output_type": "stream", 114 | "text": [ 115 | "[151.01249869855152, 52.484295091300716, 1.3319134247176123]\n" 116 | ] 117 | } 118 | ], 119 | "source": [ 120 | "values = np.array(values)\n", 121 | "print [np.mean(values[:, 0]), np.mean(np.maximum(values[:, 0] - 100, 0)), (np.mean(np.maximum(values[:, 1] - 100, 0)) - np.mean(np.maximum(values[:, 2] - 100, 0))) / (2 * dsigma)]" 122 | ] 123 | } 124 | ], 125 | "metadata": { 126 | "hide_input": false, 127 | "kernelspec": { 128 | "display_name": "Python 2 (Machine Learning)", 129 | "language": "python", 130 | "name": "machine-learning" 131 | }, 132 | "language_info": { 133 | "codemirror_mode": { 134 | "name": "ipython", 135 | "version": 2 136 | }, 137 | "file_extension": ".py", 138 | "mimetype": "text/x-python", 139 | "name": "python", 140 | "nbconvert_exporter": "python", 141 | "pygments_lexer": "ipython2", 142 | "version": "2.7.15" 143 | } 144 | }, 145 | "nbformat": 4, 146 | "nbformat_minor": 2 147 | } 148 | -------------------------------------------------------------------------------- /problem3/problem3.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "ExecuteTime": { 8 | "end_time": "2018-09-30T14:37:08.506936Z", 9 | "start_time": "2018-09-30T14:37:08.460940Z" 10 | } 11 | }, 12 | "outputs": [], 13 | "source": [ 14 | "import math\n", 15 | "import numpy as np" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 2, 21 | "metadata": { 22 | "ExecuteTime": { 23 | "end_time": "2018-09-30T14:37:08.527858Z", 24 | "start_time": "2018-09-30T14:37:08.511406Z" 25 | } 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "X_SPOT = np.array([2443.25,2447.83,2459.27,2459.14,2460.61,2473.83,2473.45,2472.54,2469.91,2477.13,2477.83,2475.42,2472.1,2470.3,2476.35,2477.57,2472.16,2476.83,2480.91,2474.92,2474.02,2438.21,2441.32,2465.84,2464.61,2468.11,2430.01,2425.55,2428.37,2452.51,2444.04,2438.97,2443.05,2444.24,2446.3,2457.59,2471.65,2476.55,2457.85,2457.85,2465.54,2465.1,2461.43,2488.11,2496.48,2498.37,2495.62,2500.23,2503.87,2506.65,2508.24,2500.6,2502.22,2496.66,2496.84,2507.04,2510.06,2519.36,2529.12,2534.58,2537.74,2552.07,2549.33,2544.73,2550.64,2555.24,2550.93,2553.17,2557.64,2559.36,2561.26,2562.1,2575.21,2564.98,2569.13,2557.15,2560.4,2581.07,2572.83,2575.26,2579.36,2579.85,2587.84,2591.13,2590.64,2594.38,2584.62,2582.3,2584.84,2578.87,2564.62,2585.64,2578.85,2582.14,2599.03,2597.08,2602.42,2602.42,2601.42,2627.04,2626.07,2647.58,2642.22,2639.44,2629.57,2629.27,2636.98,2651.5,2659.99,2664.11,2662.85,2652.01,2675.81,2690.16,2681.47,2679.25,2684.57,2683.34,2680.5,2680.5,2682.62,2687.54,2673.61,2695.81,2695.81,2713.06,2723.99,2743.15,2747.71,2751.29,2748.23,2767.56,2786.24,2776.42,2776.42,2802.56,2798.03,2810.3,2832.97,2839.13,2837.54,2839.25,2872.87,2853.53,2822.43,2823.81,2821.98,2762.13,2648.94,2695.14,2681.66,2581,2619.55,2656,2662.94,2698.63,2731.2,2732.22,2716.26,2716.26,2701.33,2703.96,2747.3,2779.6,2744.28,2713.83,2677.67,2691.25,2720.94,2728.12,2726.8,2738.97,2786.57,2783.02,2765.31,2749.48,2747.33,2752.01,2712.92,2716.94,2711.93,2643.69,2588.26,2658.55,2612.62,2605,2640.87,2581.88,2581.88,2614.45,2644.69,2662.84,2604.47,2613.16,2656.87,2642.19,2663.99,2656.3,2677.84,2706.39,2708.64,2693.13,2670.14,2670.29,2634.56,2639.4,2666.94,2669.91,2648.05,2654.8,2635.67,2629.73,2663.42,2672.63,2671.92,2697.79,2723.07,2727.72,2730.13,2711.45,2722.46,2720.13,2712.97,2733.01,2724.44,2733.29,2727.76,2721.33,2689.86,2689.86,2724.01,2705.27,2734.62,2746.87,2748.8,2772.35,2770.37,2779.03,2782,2786.85,2775.63,2782.49,2779.66,2773.75,2762.59,2767.32,2749.76,2754.88,2717.07,2723.06,2699.63,2716.31,2718.37,2726.71,2713.22,2736.61,2736.61,2759.82,2784.17,2793.84,2774.02,2798.29])\n", 30 | "Y_SPOT = np.array([35.5,35.86,36.35,36.38,36.43,36.47,36.41,36.07,35.82,35.57,35.62,35.94,35.77,35.98,34.76,34.82,34.77,35.27,35.3,35.39,35.28,34.88,34.93,35.47,35.54,35.75,35,34.83,34.91,35.3,35.49,35.52,35.6,35.51,35.52,35.82,36.54,37.36,37.23,37.23,37.67,36.91,37,37.35,37.89,38.21,38.79,38.88,38.59,38.7,38.88,39.1,39.42,40.3,40.26,40.58,40.58,40.38,42.15,43.45,43.78,43.85,44.93,45.33,45.21,45.47,44.89,45.88,45.76,45.02,45.12,45.35,45.61,45.15,46.48,45.12,45.25,44.64,43.37,42.98,43.13,42.6,42.34,42.14,41.7,42.11,42.11,42.66,43.57,43,42.86,43.6,43.88,44.88,44.97,44.29,44.46,44.46,44.17,44.92,43.81,43.09,42.79,43.05,42.8,42.15,42.02,42.02,41.67,41.53,41.4,40.81,40.95,42.15,42.49,42.52,42.16,42.02,41.8,41.8,41.31,41.38,40.99,41.8,41.8,42.82,44.14,44.01,44.22,44.05,43,44.19,44.07,44.19,44.19,44.03,43.86,43.15,43.29,43.38,44.16,43.16,43.49,43.02,42.7,42.41,42.43,41,39.54,41.86,42.39,40.75,41.46,42,41.4,41.81,41.85,41.09,40.77,40.77,40.56,40.91,40.91,41.54,40.17,39.35,37.79,37.43,37.74,37.93,37.74,37.84,37.84,37.83,38.01,37.69,37.85,37.94,37.01,36.89,37.58,36.35,35.17,35.99,34.87,35.47,36.34,35.76,35.76,36.94,38.03,38,37.68,37.83,39.07,39,38.83,38.73,39.17,39.22,38.93,37.77,37.61,37.69,37.93,38.11,38.25,37.65,36.74,36.42,36.2,36.15,36.71,36.34,36.33,36.27,37.16,36.89,36.63,36.94,38.03,38.3,37.79,38.09,38.28,37.85,38.39,38.3,37.38,37.38,37.83,42.7,43.2,43.78,43.41,43.93,44.01,44.25,44.85,44.18,44.45,43.57,43.91,43.95,42.26,41.95,41.12,41.25,40.61,41.01,40.37,40.52,39.4,39.5,38.97,39.47,39.47,39.16,39.75,40.09,39.3,39.27])" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 3, 36 | "metadata": { 37 | "ExecuteTime": { 38 | "end_time": "2018-09-30T14:37:08.537536Z", 39 | "start_time": "2018-09-30T14:37:08.531876Z" 40 | } 41 | }, 42 | "outputs": [], 43 | "source": [ 44 | "def delta(arr, map_function=None, use_map=False):\n", 45 | " if map_function is not None:\n", 46 | " if use_map:\n", 47 | " arr = map(lambda x: map_function(x), arr)\n", 48 | " else:\n", 49 | " arr = map_function(arr)\n", 50 | " \n", 51 | " return np.array(arr[1:] - arr[:-1])" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 4, 57 | "metadata": { 58 | "ExecuteTime": { 59 | "end_time": "2018-09-30T14:37:08.582252Z", 60 | "start_time": "2018-09-30T14:37:08.542758Z" 61 | } 62 | }, 63 | "outputs": [], 64 | "source": [ 65 | "log_delta_X = delta(X_SPOT, np.log)\n", 66 | "log_delta_Y = delta(Y_SPOT, np.log)\n", 67 | "\n", 68 | "rho = np.corrcoef(log_delta_X, log_delta_Y)\n", 69 | "\n", 70 | "sigma_X = np.std(log_delta_X)\n", 71 | "sigma_Y = np.std(log_delta_Y)" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 5, 77 | "metadata": { 78 | "ExecuteTime": { 79 | "end_time": "2018-09-30T14:37:08.590875Z", 80 | "start_time": "2018-09-30T14:37:08.586089Z" 81 | } 82 | }, 83 | "outputs": [], 84 | "source": [ 85 | "N = 4\n", 86 | "r = 0.01" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 6, 92 | "metadata": { 93 | "ExecuteTime": { 94 | "end_time": "2018-09-30T14:37:08.602025Z", 95 | "start_time": "2018-09-30T14:37:08.595988Z" 96 | } 97 | }, 98 | "outputs": [], 99 | "source": [ 100 | "n_partitions = 252\n", 101 | "m = 10000\n", 102 | "\n", 103 | "dt = N\n", 104 | "\n", 105 | "spot_X = X_SPOT[-1]\n", 106 | "spot_Y = Y_SPOT[-1]\n", 107 | "\n", 108 | "rdt = r * dt\n", 109 | "sigmadt_X = sigma_X * math.sqrt(dt)\n", 110 | "sigmadt_Y = sigma_Y * math.sqrt(dt)" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": 7, 116 | "metadata": { 117 | "ExecuteTime": { 118 | "end_time": "2018-09-30T14:37:08.611969Z", 119 | "start_time": "2018-09-30T14:37:08.606897Z" 120 | } 121 | }, 122 | "outputs": [], 123 | "source": [ 124 | "partitions_to_check = dict([((0.25 * i * 252) / N - 1, True) for i in range(1, N + 1)])\n", 125 | "\n", 126 | "q_spot_X = [0.75 * spot_X, 1.25 * spot_X]\n", 127 | "q_spot_Y = [0.75 * spot_Y, 1.25 * spot_Y]" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": 8, 133 | "metadata": { 134 | "ExecuteTime": { 135 | "end_time": "2018-09-30T14:37:11.405186Z", 136 | "start_time": "2018-09-30T14:37:08.615665Z" 137 | } 138 | }, 139 | "outputs": [], 140 | "source": [ 141 | "values = []\n", 142 | "for _ in range(m):\n", 143 | " x = spot_X\n", 144 | " y = spot_Y\n", 145 | " \n", 146 | " dW = np.random.multivariate_normal([0, 0], rho, n_partitions)\n", 147 | " \n", 148 | " accept = True\n", 149 | " \n", 150 | " for i in range(n_partitions):\n", 151 | " x += x * rdt + sigmadt_X * dW[i][0]\n", 152 | " y += y * rdt + sigmadt_Y * dW[i][1]\n", 153 | " \n", 154 | " if i in partitions_to_check:\n", 155 | " if not (q_spot_X[0] <= x <= q_spot_X[1]):\n", 156 | " accept = False\n", 157 | " break\n", 158 | " \n", 159 | " if not (q_spot_Y[0] <= y <= q_spot_Y[1]):\n", 160 | " accept = False\n", 161 | " break\n", 162 | "\n", 163 | " if accept:\n", 164 | " values.append([x, y])" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 9, 170 | "metadata": { 171 | "ExecuteTime": { 172 | "end_time": "2018-09-30T14:37:11.415177Z", 173 | "start_time": "2018-09-30T14:37:11.408193Z" 174 | } 175 | }, 176 | "outputs": [ 177 | { 178 | "name": "stdout", 179 | "output_type": "stream", 180 | "text": [ 181 | "0.0\n" 182 | ] 183 | } 184 | ], 185 | "source": [ 186 | "if len(values) == 0:\n", 187 | " print 0.0\n", 188 | "else:\n", 189 | " values = np.array(values, dtype=float) / np.array([spot_X, spot_Y])\n", 190 | " values = np.sum(values, axis=1) / 2.0 - 1\n", 191 | " values = np.maximum(values, 0)\n", 192 | " \n", 193 | " print max(0.0, np.mean(values))" 194 | ] 195 | } 196 | ], 197 | "metadata": { 198 | "hide_input": false, 199 | "kernelspec": { 200 | "display_name": "Python 2", 201 | "language": "python", 202 | "name": "python2" 203 | }, 204 | "language_info": { 205 | "codemirror_mode": { 206 | "name": "ipython", 207 | "version": 2 208 | }, 209 | "file_extension": ".py", 210 | "mimetype": "text/x-python", 211 | "name": "python", 212 | "nbconvert_exporter": "python", 213 | "pygments_lexer": "ipython2", 214 | "version": "2.7.15" 215 | } 216 | }, 217 | "nbformat": 4, 218 | "nbformat_minor": 2 219 | } 220 | -------------------------------------------------------------------------------- /problem1/problem1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "ExecuteTime": { 8 | "end_time": "2018-09-30T14:00:05.363390Z", 9 | "start_time": "2018-09-30T14:00:04.919401Z" 10 | } 11 | }, 12 | "outputs": [], 13 | "source": [ 14 | "%matplotlib inline\n", 15 | "\n", 16 | "import urllib2\n", 17 | "import numpy as np\n", 18 | "\n", 19 | "from datetime import datetime\n", 20 | "from matplotlib import pyplot as plt\n", 21 | "from sklearn.neural_network import MLPRegressor\n", 22 | "from sklearn.tree import DecisionTreeRegressor" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 2, 28 | "metadata": { 29 | "ExecuteTime": { 30 | "end_time": "2018-09-30T14:00:05.372579Z", 31 | "start_time": "2018-09-30T14:00:05.368156Z" 32 | } 33 | }, 34 | "outputs": [], 35 | "source": [ 36 | "tr_url = \"https://s3-ap-southeast-1.amazonaws.com/mettl-arq/questions/codelysis/machine-learning/fare-prediction/train.csv\"\n", 37 | "ts_url = \"https://s3-ap-southeast-1.amazonaws.com/mettl-arq/questions/codelysis/machine-learning/fare-prediction/test.csv\"" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 3, 43 | "metadata": { 44 | "ExecuteTime": { 45 | "end_time": "2018-09-30T14:00:05.392127Z", 46 | "start_time": "2018-09-30T14:00:05.376195Z" 47 | } 48 | }, 49 | "outputs": [], 50 | "source": [ 51 | "def read_data(url, split_string=\",\", test=False):\n", 52 | " data = urllib2.urlopen(url)\n", 53 | " data.readline()\n", 54 | " data = data.readlines()\n", 55 | " \n", 56 | " def process(line):\n", 57 | " line = line.strip().split(split_string)\n", 58 | " \n", 59 | " if test:\n", 60 | " return line[1:]\n", 61 | " \n", 62 | " return line[1:-1], line[-1]\n", 63 | " \n", 64 | " data = [process(line) for line in data]\n", 65 | " \n", 66 | " if test:\n", 67 | " return data\n", 68 | " \n", 69 | " X, Y = zip(*data)\n", 70 | " \n", 71 | " X = list(X)\n", 72 | " Y = np.array(Y, dtype=float)\n", 73 | " \n", 74 | " return X, Y" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 4, 80 | "metadata": { 81 | "ExecuteTime": { 82 | "end_time": "2018-09-30T14:00:10.240325Z", 83 | "start_time": "2018-09-30T14:00:05.395908Z" 84 | } 85 | }, 86 | "outputs": [], 87 | "source": [ 88 | "X_tr, Y_tr = read_data(tr_url)\n", 89 | "X_ts = read_data(ts_url, test=True)" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 5, 95 | "metadata": { 96 | "ExecuteTime": { 97 | "end_time": "2018-09-30T14:00:10.262597Z", 98 | "start_time": "2018-09-30T14:00:10.243672Z" 99 | } 100 | }, 101 | "outputs": [], 102 | "source": [ 103 | "cities = [(x[1], x[2]) for x in X_tr]\n", 104 | "c1,c2 = zip(*cities)\n", 105 | "cities = c1 + c2\n", 106 | "cities = list(set(cities))\n", 107 | "cities.sort()\n", 108 | "tuple_cities = {}\n", 109 | "index = 0\n", 110 | "for i, city_1 in enumerate(cities):\n", 111 | " tuple_cities[city_1] = dict()\n", 112 | " for j, city_2 in enumerate(cities):\n", 113 | " tuple_cities[city_1][city_2] = index\n", 114 | " #tuple_cities[city_2 + city_1] = index\n", 115 | " index += 1" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 8, 121 | "metadata": { 122 | "ExecuteTime": { 123 | "end_time": "2018-09-30T14:00:53.499153Z", 124 | "start_time": "2018-09-30T14:00:53.473108Z" 125 | } 126 | }, 127 | "outputs": [], 128 | "source": [ 129 | "def process_features(X, tuple_cities):\n", 130 | " today = datetime.today()\n", 131 | " time_0 = datetime.strptime(\"0:0\", \"%H:%M\")\n", 132 | "\n", 133 | " for i in range(len(X)):\n", 134 | " x = X[i]\n", 135 | "\n", 136 | " cities_index = tuple_cities[x[1]][x[2]]\n", 137 | " cities_one_hot = [0] * (len(cities) * len(cities))\n", 138 | " cities_one_hot[cities_index] = 1\n", 139 | "\n", 140 | " flight_day = datetime.strptime(x[3] + \" \" + x[4], \"%Y-%m-%d %H:%M\")\n", 141 | " bookind_day = datetime.strptime(x[5], \"%Y-%m-%d\")\n", 142 | " days_diff = (flight_day - bookind_day).days\n", 143 | "\n", 144 | " dob = datetime.strptime(X[i][0], \"%Y-%m-%d\")\n", 145 | " age = int(round((today - dob).days / 365.0))\n", 146 | "\n", 147 | " flight_time = datetime.strptime(x[4], \"%H:%M\")\n", 148 | " flight_time = (time_0 - flight_time).seconds / 600\n", 149 | "\n", 150 | " bclass = 0 if x[6] == \"Economy\" else 1\n", 151 | "\n", 152 | " # X[i] = np.array([bclass, cities_index, flight_time, days_diff, age])\n", 153 | " X[i] = np.array([bclass, flight_time, days_diff, age] + cities_one_hot)\n", 154 | "\n", 155 | " X = np.array(X)\n", 156 | " return X" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 9, 162 | "metadata": { 163 | "ExecuteTime": { 164 | "end_time": "2018-09-30T14:00:55.104849Z", 165 | "start_time": "2018-09-30T14:00:54.570801Z" 166 | } 167 | }, 168 | "outputs": [], 169 | "source": [ 170 | "X_tr = process_features(X_tr, tuple_cities)\n", 171 | "X_ts = process_features(X_ts, tuple_cities)" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 10, 177 | "metadata": { 178 | "ExecuteTime": { 179 | "end_time": "2018-09-30T14:00:55.756975Z", 180 | "start_time": "2018-09-30T14:00:55.750180Z" 181 | } 182 | }, 183 | "outputs": [], 184 | "source": [ 185 | "plot = False\n", 186 | "if plot:\n", 187 | " var = X[:, 3]\n", 188 | " classes = list(set(var))\n", 189 | " Y_tr = []\n", 190 | " # fig=plt.figure(figsize=(18, 16), dpi= 240, facecolor='w', edgecolor='k')\n", 191 | " for classe in classes:\n", 192 | " Y_tr.append(list(Y[var == classe]))\n", 193 | " bp = plt.boxplot(Y_tr)\n", 194 | " plt.savefig(\"class.png\")\n", 195 | " plt.close()" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "### Regression" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": 11, 208 | "metadata": { 209 | "ExecuteTime": { 210 | "end_time": "2018-09-30T14:01:19.177956Z", 211 | "start_time": "2018-09-30T14:01:02.552837Z" 212 | } 213 | }, 214 | "outputs": [ 215 | { 216 | "data": { 217 | "text/plain": [ 218 | "0.7690609579210475" 219 | ] 220 | }, 221 | "execution_count": 11, 222 | "metadata": {}, 223 | "output_type": "execute_result" 224 | } 225 | ], 226 | "source": [ 227 | "regressor = MLPRegressor(hidden_layer_sizes=(50, 20, 5), max_iter=10000, tol=0.00001)\n", 228 | "regressor.fit(X_tr, Y_tr)\n", 229 | "regressor.score(X_tr, Y_tr)" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": null, 235 | "metadata": { 236 | "ExecuteTime": { 237 | "end_time": "2018-09-30T13:43:54.663069Z", 238 | "start_time": "2018-09-30T13:43:54.659749Z" 239 | } 240 | }, 241 | "outputs": [], 242 | "source": [ 243 | "# Y_ts = regressor.predict(X_ts)" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": null, 249 | "metadata": { 250 | "ExecuteTime": { 251 | "end_time": "2018-09-30T13:49:33.994871Z", 252 | "start_time": "2018-09-30T13:49:24.052552Z" 253 | } 254 | }, 255 | "outputs": [], 256 | "source": [ 257 | "regressor_eco = MLPRegressor(hidden_layer_sizes=(10, 20, 5), max_iter=10000, tol=0.00001)\n", 258 | "eco_indices = (X_tr[:, 0] == 0)\n", 259 | "regressor_eco.fit(X_tr[eco_indices][:, 1:], Y_tr[eco_indices])\n", 260 | "regressor_eco.score(X_tr[eco_indices][:, 1:], Y_tr[eco_indices])" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": { 267 | "ExecuteTime": { 268 | "end_time": "2018-09-30T13:48:54.199893Z", 269 | "start_time": "2018-09-30T13:48:47.156053Z" 270 | } 271 | }, 272 | "outputs": [], 273 | "source": [ 274 | "regressor_bus = MLPRegressor(hidden_layer_sizes=(10, 20, 5), max_iter=10000)\n", 275 | "bus_indices = (X_tr[:, 0] == 1)\n", 276 | "regressor_bus.fit(X_tr[bus_indices][:, 1:], Y_tr[bus_indices])\n", 277 | "regressor_bus.score(X_tr[bus_indices][:, 1:], Y_tr[bus_indices])" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": null, 283 | "metadata": { 284 | "ExecuteTime": { 285 | "end_time": "2018-09-30T13:49:39.422768Z", 286 | "start_time": "2018-09-30T13:49:39.413773Z" 287 | } 288 | }, 289 | "outputs": [], 290 | "source": [ 291 | "Y_ts = np.zeros(len(X_ts))\n", 292 | "Y_ts[X_ts[:, 0] == 0] = regressor_eco.predict(X_ts[X_ts[:, 0] == 0][:, 1:])\n", 293 | "Y_ts[X_ts[:, 0] == 1] = regressor_bus.predict(X_ts[X_ts[:, 0] == 1][:, 1:])" 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": null, 299 | "metadata": { 300 | "ExecuteTime": { 301 | "end_time": "2018-09-30T13:49:40.041171Z", 302 | "start_time": "2018-09-30T13:49:40.020251Z" 303 | } 304 | }, 305 | "outputs": [], 306 | "source": [ 307 | "import pyperclip\n", 308 | "pyperclip.copy(\"return [\" + \", \".join([str(y) for y in Y_ts]) + \"]\")" 309 | ] 310 | } 311 | ], 312 | "metadata": { 313 | "hide_input": false, 314 | "kernelspec": { 315 | "display_name": "Python 2 (Machine Learning)", 316 | "language": "python", 317 | "name": "machine-learning" 318 | }, 319 | "language_info": { 320 | "codemirror_mode": { 321 | "name": "ipython", 322 | "version": 2 323 | }, 324 | "file_extension": ".py", 325 | "mimetype": "text/x-python", 326 | "name": "python", 327 | "nbconvert_exporter": "python", 328 | "pygments_lexer": "ipython2", 329 | "version": "2.7.15" 330 | } 331 | }, 332 | "nbformat": 4, 333 | "nbformat_minor": 2 334 | } 335 | -------------------------------------------------------------------------------- /explanation/doc.tex: -------------------------------------------------------------------------------- 1 | \documentclass{article} 2 | 3 | \usepackage{assign} 4 | \setcoursetitle{JP Morgan Quant Challenge} 5 | \setstartnewpage{TRUE} 6 | \setquestionheader{TRUE} 7 | 8 | \RenewDocumentCommand{\makeheader}{ O{ASSIGNMENT} O{} }{ 9 | \parbox{0.75\textwidth}{ 10 | {\sffamily\bfseries JP Morgan Quant Challenge, 2018} \\ 11 | {\sffamily\bfseries Explanation Document} 12 | 13 | \rule{0mm}{2pt} 14 | 15 | {\em Name:\/} \authname \\ 16 | {\em Team:\/} \$\$\$\$ \\ 17 | {\em Date:\/} \headate 18 | 19 | } 20 | \hfill 21 | \begin{tabular}{c@{}} 22 | { 23 | \sffamily\bfseries\Large #1 24 | }\\ 25 | \rule{0mm}{15mm}\scalebox{5}{ 26 | \sffamily \ifthenelse{\equal{#2}{}}{\assigncode}{#2} 27 | } 28 | \end{tabular} 29 | 30 | \vspace{1mm} 31 | 32 | \hrule height1pt 33 | 34 | \vspace{3mm} 35 | } 36 | 37 | \begin{document} 38 | 39 | \begin{question} 40 | 41 | \begin{qsection}{Method} 42 | I have simply used Multi-Layer Perceptron to predict the ticket rate for each customer. This is implemented using Scikit-Learn library for python. 43 | \end{qsection} 44 | 45 | \begin{qsection}{Features} 46 | 47 | For each customer, features are computed using his/her DOB, Flight Time, Flight Date, Booking Date, Place of Departure, Place of Arrival and Ticket Class (Business/Economy). 48 | 49 | \begin{enumerate}[label=\textbf{\arabic*.}] 50 | \item From DOB, the customer's age is extracted, based on the intuition that the age of a customer might affect the price he/she is willing to pay for a flight. 51 | 52 | \item As Flight Times affect the rate of a flight, the time is used to compute the difference of minutes from 0:00, i.e. for a flight at 0:20, the feature value would be (0x60 + 20 = 20). 53 | 54 | \item The Flight Date, Flight Time and Flight Day are used together to compute the number of days before the flight's actual departure has the customer booked the ticket, as the ticket rate often increases towards the departure date. 55 | 56 | \item Since Ticket Class would obviously affect the rate of ticket, a binary variable is computed which, if one, means that the ticket is a business class ticket. 57 | 58 | \item The place of arrival and place of departure would decide the distance travelled by a flight, and therefore would affect the ticket charge. Since the distances are not directly available, I compute 21 features (as there are 7 places and therefore 7c2 = 21 possible tuples, ignoring the order), each binary and representing whether a flight is between the places denoted by the respective tuple/feature. 59 | \end{enumerate} 60 | 61 | \end{qsection} 62 | 63 | \begin{qsection}{Model} 64 | Using these features, I compute the feature set, and train it against the actual fares a customer is willing to give. 65 | 66 | Upon observing the performance on training data, decision trees worked the best, but seemed to overfit by a large amount (Training R2 ~ .99, Test R2 ~ 0.60). Based on this, I decided to use Neural Networks (I did not linear or polynomial methods of regression). 67 | 68 | Another observation was the better performance if two separate networks are trained for different customers based on different ticket classes. That is, the tickets/customers are divided into two classes, one for economy and another for business. For these two sets of features and labels, separate and independent networks are trained. Using this method gives a marginal improvement over a single network. The sizes of the networks that seemed to work the best were (50, 20, 5) for both the networks (multiple different combinations were tried). 69 | \end{qsection} 70 | 71 | \begin{qsection}{Code} 72 | 73 | The code is divided into three parts - 74 | 75 | \begin{qsubsection}{Reading the Data} 76 | 77 | This simply involved reading the data using the urls for the training and 78 | testing data, and splitting each lime into different fields. 79 | 80 | This is handled using the function 81 | read\_data ( string url, string split\_string [default:','], boolean test [default: False] ) 82 | \end{qsubsection} 83 | 84 | \begin{qsubsection}{Processing Features} 85 | 86 | This step essentially computes the features set, based on the features described 87 | in the previous section. 88 | 89 | This is handled using the function 90 | process\_features( 2D-array X, dictionary tuple\_cities ) 91 | 92 | Note: The tuple\_cities attribute is computed beforehand, which simply computes 93 | an index for each tuple (of departure and arrival cities) and stores equal index 94 | for both orders of the cities in any tuple. 95 | \end{qsubsection} 96 | 97 | \begin{qsubsection}{Regressing} 98 | 99 | This step fits the training data and predicts the test labels, and puts the 100 | predicted output into the clipboard (for my ease of testing). 101 | \end{qsubsection} 102 | 103 | \begin{qsubsection}{Required Libraries} 104 | 105 | For running the code, one needs to have the following python libraries installed 106 | \begin{enumerate} 107 | \item numpy 108 | \item datetime 109 | \item scikit-learn 110 | \item pyperclip (for copying to clipboard) 111 | \item urllib2 112 | \end{enumerate} 113 | 114 | \end{qsubsection} 115 | 116 | \end{qsection} 117 | 118 | \end{question} 119 | 120 | \begin{question} 121 | 122 | We are given a continous random process X. In order to model this process, we break the continous time interval into discrete time steps, as given in the hint. The number of steps, again as suggested by the hint, is taken to be 100. Therefore, the state of the random process X is computed at 100 discrete points, uniformly spread over the interval $\brac{0, T}$ where $T = 2$. 123 | 124 | Using the forementioned method, we can obtain an estimate of $X_T$. However, to compute the expectation, we use Monte Carlo method, and therefore sample multiple values of $X_T$ and average over them to compute the expectations required by the question. 125 | 126 | To do this, 10000 estimates (again suggested in the hints) of $X_T$ were generated using the process mentioned earlier, and these estimates are used to compute a Monte Carlo estimation of the expectations as follows 127 | \begin{enumerate} 128 | \item $\E{X_T} = \sum_{i = 1}^{10000} X_T^{(i)}$ 129 | \item $\E{\max\set{X_T - 100, 0}} = \sum_{i = 1}^{10000} \max\set{X_T^{(i)} - 100, 0}$ 130 | \end{enumerate} 131 | 132 | For the third part, only using Monte Carlo would not give us a solution, as there is a differential. In order to compute the partial derivative, I have used central difference technique, wherein, I compute the expectation (term inside the partial differntial function), for values at $\sigma + d\sigma$ and at $\sigma - d\sigma$, and approximate the differential as 133 | \[ \derp{\E{\max\set{X_T - 100, 0}}}{\sigma} \eq \frac{\E{\max\set{X_T(\sigma + d\sigma) - 100, 0}} - \E{\max\set{X_T(\sigma - d\sigma) - 100, 0}}}{2\ d\sigma} \] 134 | 135 | Using these techniques, I have estimated all three quantities as required by the question. 136 | 137 | \begin{qsection}{Code} 138 | 139 | For each path, I compute three estimates of $X_T$ corresponding to $\sigma$, $\sigma + d\sigma$ and $\sigma - d\sigma$. All these estimates are computed using the same samples of $dW$ in order to save time (the expectations remain unbiased even after this, and therefore still provide good estimations of the quantities required). 140 | 141 | The random normal sampling is done using Numpy's random sampling function, and all the random values required for the next 100 steps are obtained simultaneously, to reduce the number of calls to numpy's functions and therefore save time. 142 | 143 | Using these 100 random samples, $dX$ is computed at each time step, and $X$ is updated as $X = X + dX$, where $X$ is the current state of the random process given. After 100 steps, we obtain $X_T = X$ as the estimate for the current path. The same procedure is repeated for $\sigma + d\sigma$ and $\sigma - d\sigma$, using the same random normal samples. 144 | 145 | This is then repeated for 10000 samples, and all three values corresponding to $\sigma$, $\sigma + d\sigma$ and $\sigma - d\sigma$ are saved in an array/. 146 | 147 | Post sampling, the 10000 samples are used as in the equations given previously to compute the expectations and derrivative. 148 | 149 | \end{qsection} 150 | 151 | \end{question} 152 | 153 | \begin{question} 154 | 155 | This question is similar to the second question in terms of discretizing the time interval and using Monte Carlo for computing expectations. However, in this case, we need to use the historical data in order to compute parameters of the random process. These parameters are $\sigma_X$, $\sigma_Y$ and $\rho$. 156 | 157 | As pointed out by the question, $\sigma_X$ and $\sigma_Y$ are computed as the expected log difference between consecutive time steps. This is estimated using the SPOT Data for X and Y, that is, by using the spot samples to compute a Monte Carlo esitmate of the $\sigma_X$ and $\sigma_Y$. The quantity $\rho$ is also computed similarly, by assuming the spot samples to be tuples of $X$ and $Y$. 158 | 159 | Similar to the second quesiton, we are given two continous random processes X and Y. In order to model these processes, we break the continous time interval into discrete time steps, as given in the hint. The number of steps, again as suggested by the hint, is taken to be 252. Therefore, the state of the random processes X and Y are computed at 252 discrete points, uniformly spread over the interval $\brac{0, T}$ where, in this case, $T = 252 * N$. 160 | 161 | Using the forementioned method, we can obtain one estimate of each $X_T$ and $Y_N$. However, to compute the expectation, we use Monte Carlo method, and therefore sample multiple values of $X_T$ and $Y_N$ and average over them to compute the expectations required by the question. 162 | 163 | To do this, 10000 estimates (again suggested in the hints) of $X_T$ and $Y_N$ were generated using the process mentioned earlier, and these estimates are used to compute a Monte Carlo estimation of the payoff as follows 164 | \begin{enumerate} 165 | \item $\E{\max\set{\frac{1}{2} \frac{X_T}{X_0} + \frac{1}{2} \frac{Y_T}{Y_0} - 1, 0}} = \sum_{i = 1}^{10000} \is{accept(i)} \max\set{\frac{1}{2}\frac{X_T^{(i)}}{X_0} + \frac{1}{2} \frac{Y_T^{(i)}}{Y_0} - 1, 0}$ 166 | \end{enumerate} 167 | 168 | The function accept(i) is used to satisfy the constraint mentioned in the question, that for all $k = 0 \cdots N$, $X_t \in \brac{0.75 X_0, 1.25 X_0}$ and $Y_t \in \brac{0.75 Y_0, 1.25 Y_0}$ where $t = t_0 + 0.25 * k * 252$ days 169 | 170 | \begin{qsection}{Code} 171 | 172 | Using the historical data, first the values of $sigma_X$, $sigma_Y$ and $rho$ are computed, using the formulas provided in the question. 173 | 174 | For each path, estimates of $X_T$ and $Y_T$ are computed. These estimates are computed using samples of $\vepsilon$, which is distributed according to a bivariate normal distribution with correlation rho. The random normal sampling is done using Numpy's random multivariate\_normal sampling function, and all the random values required for the next 100 steps are obtained simultaneously, to reduce the number of calls to numpy's functions and therefore save time. 175 | 176 | Using these 100 random samples, $dX$ and $dY$ are computed at each time step, and $X$ is updated as $X = X + dX$, where $X$ is the current state of the random process given (likewise for $Y$). After 100 steps, we obtain $X_T = X$ and $Y_T = Y$ as the estimates for the current path. 177 | 178 | This is then repeated for 10000 samples, and all paths satisfy the acceptance criteria are saved in an array. Post sampling, the 10000 samples are used as in the equations given previously to compute the expectation for the payoff. 179 | 180 | \end{qsection} 181 | 182 | \end{question} 183 | 184 | \end{document} 185 | --------------------------------------------------------------------------------