├── DRL_GBM.ipynb
├── LICENSE
├── README.md
├── ddpg_agent.py
├── main.py
├── model.py
└── report.pdf
/DRL_GBM.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "
\n",
8 | " \n",
9 | " \n",
10 | " \n",
11 | " \n",
12 | " & \n",
13 | " \n",
14 | " \n",
15 | " \n",
16 | " \n",
17 | "
"
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "metadata": {},
23 | "source": [
24 | "# Deep Reinforcement Learning for Optimal Execution of Portfolio Transactions "
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "# Introduction\n",
32 | "\n",
33 | "This notebook demonstrates how to use Deep Reinforcement Learning (DRL) for optimizing the execution of large portfolio transactions. We begin with a brief review of reinforcement learning and actor-critic methods. Then, you will use an actor-critic method to generate optimal trading strategies that maximize profit when liquidating a block of shares. \n",
34 | "\n",
35 | "# Actor-Critic Methods\n",
36 | "\n",
37 | "In reinforcement learning, an agent makes observations and takes actions within an environment, and in return it receives rewards. Its objective is to learn to act in a way that will maximize its expected long-term rewards. \n",
38 | "\n",
39 | " \n",
40 | "\n",
41 | " \n",
42 | " Fig 1. - Reinforcement Learning. \n",
43 | " \n",
44 | " \n",
45 | "\n",
46 | "There are several types of RL algorithms, and they can be divided into three groups:\n",
47 | "\n",
48 | "- **Critic-Only**: Critic-Only methods, also known as Value-Based methods, first find the optimal value function and then derive an optimal policy from it. \n",
49 | "\n",
50 | "\n",
51 | "- **Actor-Only**: Actor-Only methods, also known as Policy-Based methods, search directly for the optimal policy in policy space. This is typically done by using a parameterized family of policies over which optimization procedures can be used directly. \n",
52 | "\n",
53 | "\n",
54 | "- **Actor-Critic**: Actor-Critic methods combine the advantages of actor-only and critic-only methods. In this method, the critic learns the value function and uses it to determine how the actor's policy parramerters should be changed. In this case, the actor brings the advantage of computing continuous actions without the need for optimization procedures on a value function, while the critic supplies the actor with knowledge of the performance. Actor-critic methods usually have good convergence properties, in contrast to critic-only methods. The **Deep Deterministic Policy Gradients (DDPG)** algorithm is one example of an actor-critic method.\n",
55 | "\n",
56 | " \n",
57 | "\n",
58 | " \n",
59 | " Fig 2. - Actor-Critic Reinforcement Learning. \n",
60 | " \n",
61 | " \n",
62 | "\n",
63 | "In this notebook, we will use DDPG to determine the optimal execution of portfolio transactions. In other words, we will use the DDPG algorithm to solve the optimal liquidation problem. But before we can apply the DDPG algorithm we first need to formulate the optimal liquidation problem so that in can be solved using reinforcement learning. In the next section we will see how to do this. "
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "# Modeling Optimal Execution as a Reinforcement Learning Problem\n",
71 | "\n",
72 | "As we learned in the previous lessons, the optimal liquidation problem is a minimization problem, *i.e.* we need to find the trading list that minimizes the implementation shortfall. In order to solve this problem through reinforcement learning, we need to restate the optimal liquidation problem in terms of **States**, **Actions**, and **Rewards**. Let's start by defining our States.\n",
73 | "\n",
74 | "### States\n",
75 | "\n",
76 | "The optimal liquidation problem entails that we sell all our shares within a given time frame. Therefore, our state vector must contain some information about the time remaining, or what is equivalent, the number trades remaning. We will use the latter and use the following features to define the state vector at time $t_k$:\n",
77 | "\n",
78 | "\n",
79 | "$$\n",
80 | "[r_{k-5},\\, r_{k-4},\\, r_{k-3},\\, r_{k-2},\\, r_{k-1},\\, r_{k},\\, m_{k},\\, i_{k}]\n",
81 | "$$\n",
82 | "\n",
83 | "where:\n",
84 | "\n",
85 | "- $r_{k} = \\log\\left(\\frac{\\tilde{S}_k}{\\tilde{S}_{k-1}}\\right)$ is the log-return at time $t_k$\n",
86 | "\n",
87 | "\n",
88 | "- $m_{k} = \\frac{N_k}{N}$ is the number of trades remaining at time $t_k$ normalized by the total number of trades.\n",
89 | "\n",
90 | "\n",
91 | "- $i_{k} = \\frac{x_k}{X}$ is the remaining number of shares at time $t_k$ normalized by the total number of shares.\n",
92 | "\n",
93 | "The log-returns capture information about stock prices before time $t_k$, which can be used to detect possible price trends. The number of trades and shares remaining allow the agent to learn to sell all the shares within a given time frame. It is important to note that in real world trading scenarios, this state vector can hold many more variables. \n",
94 | "\n",
95 | "### Actions\n",
96 | "\n",
97 | "Since the optimal liquidation problem only requires us to sell stocks, it is reasonable to define the action $a_k$ to be the number of shares to sell at time $t_{k}$. However, if we start with millions of stocks, intepreting the action directly as the number of shares to sell at each time step can lead to convergence problems, because, the agent will need to produce actions with very high values. Instead, we will interpret the action $a_k$ as a **percentage**. In this case, the actions produced by the agent will only need to be between 0 and 1. Using this interpretation, we can determine the number of shares to sell at each time step using:\n",
98 | "\n",
99 | "$$\n",
100 | "n_k = a_k \\times x_k\n",
101 | "$$\n",
102 | "\n",
103 | "where $x_k$ is the number of shares remaining at time $t_k$.\n",
104 | "\n",
105 | "### Rewards\n",
106 | "\n",
107 | "Defining the rewards is trickier than defining states and actions, since the original problem is a minimization problem. One option is to use the difference between two consecutive utility functions. Remeber the utility function is given by:\n",
108 | "\n",
109 | "$$\n",
110 | "U(x) = E(x) + λ V(x)\n",
111 | "$$\n",
112 | "\n",
113 | "After each time step, we compute the utility using the equations for $E(x)$ and $V(x)$ from the Almgren and Chriss model for the remaining time and inventory while holding parameter λ constant. Denoting the optimal trading trajectory computed at time $t$ as $x^*_t$, we define the reward as: \n",
114 | "\n",
115 | "$$\n",
116 | "R_{t} = {{U_t(x^*_t) - U_{t+1}(x^*_{t+1})}\\over{U_t(x^*_t)}}\n",
117 | "$$\n",
118 | "\n",
119 | "Where we have normalized the difference to train the actor-critic model easier."
120 | ]
121 | },
122 | {
123 | "cell_type": "markdown",
124 | "metadata": {},
125 | "source": [
126 | "# Simulation Environment\n",
127 | "\n",
128 | "In order to train our DDPG algorithm we will use a very simple simulated trading environment. This environment simulates stock prices that follow a discrete arithmetic random walk and that the permanent and temporary market impact functions are linear functions of the rate of trading, just like in the Almgren and Chriss model. This simple trading environment serves as a starting point to create more complex trading environments. You are encouraged to extend this simple trading environment by adding more complexity to simulte real world trading dynamics, such as book orders, network latencies, trading fees, etc... \n",
129 | "\n",
130 | "The simulated enviroment is contained in the **syntheticChrissAlmgren.py** module. You are encouraged to take a look it and modify its parameters as you wish. Let's take a look at the default parameters of our simulation environment. We have set the intial stock price to be $S_0 = 50$, and the total number of shares to sell to one million. This gives an initial portfolio value of $\\$50$ Million dollars. We have also set the trader's risk aversion to $\\lambda = 10^{-6}$.\n",
131 | "\n",
132 | "The stock price will have 12\\% annual volatility, a [bid-ask spread](https://www.investopedia.com/terms/b/bid-askspread.asp) of 1/8 and an average daily trading volume of 5 million shares. Assuming there are 250 trading days in a year, this gives a daily volatility in stock price of $0.12 / \\sqrt{250} \\approx 0.8\\%$. We will use a liquiditation time of $T = 60$ days and we will set the number of trades $N = 60$. This means that $\\tau=\\frac{T}{N} = 1$ which means we will be making one trade per day. \n",
133 | "\n",
134 | "For the temporary cost function we will set the fixed cost of selling to be 1/2 of the bid-ask spread, $\\epsilon = 1/16$. we will set $\\eta$ such that for each one percent of the daily volume we trade, we incur a price impact equal to the bid-ask\n",
135 | "spread. For example, trading at a rate of $5\\%$ of the daily trading volume incurs a one-time cost on each trade of 5/8. Under this assumption we have $\\eta =(1/8)/(0.01 \\times 5 \\times 10^6) = 2.5 \\times 10^{-6}$.\n",
136 | "\n",
137 | "For the permanent costs, a common rule of thumb is that price effects become significant when we sell $10\\%$ of the daily volume. If we suppose that significant means that the price depression is one bid-ask spread, and that the effect is linear for smaller and larger trading rates, then we have $\\gamma = (1/8)/(0.1 \\times 5 \\times 10^6) = 2.5 \\times 10^{-7}$. \n",
138 | "\n",
139 | "The tables below summarize the default parameters of the simulation environment"
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": 1,
145 | "metadata": {},
146 | "outputs": [],
147 | "source": [
148 | "import utils\n",
149 | "\n",
150 | "# Get the default financial and AC Model parameters\n",
151 | "financial_params, ac_params = utils.get_env_param()"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": 2,
157 | "metadata": {},
158 | "outputs": [
159 | {
160 | "data": {
161 | "text/html": [
162 | "\n",
163 | "Financial Parameters \n",
164 | "\n",
165 | " Annual Volatility: 12% Bid-Ask Spread: 0.125 \n",
166 | " \n",
167 | "\n",
168 | " Daily Volatility: 0.8% Daily Trading Volume: 5,000,000 \n",
169 | " \n",
170 | "
"
171 | ],
172 | "text/plain": [
173 | ""
174 | ]
175 | },
176 | "execution_count": 2,
177 | "metadata": {},
178 | "output_type": "execute_result"
179 | }
180 | ],
181 | "source": [
182 | "financial_params"
183 | ]
184 | },
185 | {
186 | "cell_type": "code",
187 | "execution_count": 3,
188 | "metadata": {},
189 | "outputs": [
190 | {
191 | "data": {
192 | "text/html": [
193 | "\n",
194 | "Almgren and Chriss Model Parameters \n",
195 | "\n",
196 | " Total Number of Shares to Sell: 1,000,000 Fixed Cost of Selling per Share: $0.062 \n",
197 | " \n",
198 | "\n",
199 | " Starting Price per Share: $50.00 Trader's Risk Aversion: 1e-06 \n",
200 | " \n",
201 | "\n",
202 | " Price Impact for Each 1% of Daily Volume Traded: $2.5e-06 Permanent Impact Constant: 2.5e-07 \n",
203 | " \n",
204 | "\n",
205 | " Number of Days to Sell All the Shares: 60 Single Step Variance: 0.144 \n",
206 | " \n",
207 | "\n",
208 | " Number of Trades: 60 Time Interval between trades: 1.0 \n",
209 | " \n",
210 | "
"
211 | ],
212 | "text/plain": [
213 | ""
214 | ]
215 | },
216 | "execution_count": 3,
217 | "metadata": {},
218 | "output_type": "execute_result"
219 | }
220 | ],
221 | "source": [
222 | "ac_params"
223 | ]
224 | },
225 | {
226 | "cell_type": "code",
227 | "execution_count": 8,
228 | "metadata": {},
229 | "outputs": [],
230 | "source": [
231 | "import importlib\n",
232 | "import syntheticChrissAlmgren as sca"
233 | ]
234 | },
235 | {
236 | "cell_type": "code",
237 | "execution_count": 14,
238 | "metadata": {},
239 | "outputs": [
240 | {
241 | "data": {
242 | "text/plain": [
243 | ""
244 | ]
245 | },
246 | "execution_count": 14,
247 | "metadata": {},
248 | "output_type": "execute_result"
249 | }
250 | ],
251 | "source": [
252 | "importlib.reload(sca)"
253 | ]
254 | },
255 | {
256 | "cell_type": "code",
257 | "execution_count": null,
258 | "metadata": {},
259 | "outputs": [],
260 | "source": []
261 | },
262 | {
263 | "cell_type": "code",
264 | "execution_count": 15,
265 | "metadata": {},
266 | "outputs": [
267 | {
268 | "name": "stdout",
269 | "output_type": "stream",
270 | "text": [
271 | "Average Stock Price: $54.48\n",
272 | "Standard Deviation in Stock Price: $6.42\n"
273 | ]
274 | },
275 | {
276 | "data": {
277 | "image/png": "\n",
278 | "text/plain": [
279 | ""
280 | ]
281 | },
282 | "metadata": {
283 | "needs_background": "light"
284 | },
285 | "output_type": "display_data"
286 | }
287 | ],
288 | "source": [
289 | "utils.plot_price_model()"
290 | ]
291 | },
292 | {
293 | "cell_type": "code",
294 | "execution_count": 16,
295 | "metadata": {},
296 | "outputs": [
297 | {
298 | "data": {
299 | "text/html": [
300 | "\n",
301 | "AC Optimal Strategy \n",
302 | "\n",
303 | " Number of Days to Sell All the Shares: 60 Initial Portfolio Value: $50,000,000.00 \n",
304 | " \n",
305 | "\n",
306 | " Half-Life of The Trade: 4.1 Expected Shortfall: $477,712.60 \n",
307 | " \n",
308 | "\n",
309 | " Utility: $704,723.22 Standard Deviation of Shortfall: $476,456.32 \n",
310 | " \n",
311 | "
"
312 | ],
313 | "text/plain": [
314 | ""
315 | ]
316 | },
317 | "execution_count": 16,
318 | "metadata": {},
319 | "output_type": "execute_result"
320 | }
321 | ],
322 | "source": [
323 | "utils.get_optimal_vals()"
324 | ]
325 | },
326 | {
327 | "cell_type": "code",
328 | "execution_count": null,
329 | "metadata": {},
330 | "outputs": [],
331 | "source": []
332 | },
333 | {
334 | "cell_type": "code",
335 | "execution_count": null,
336 | "metadata": {},
337 | "outputs": [],
338 | "source": []
339 | },
340 | {
341 | "cell_type": "markdown",
342 | "metadata": {},
343 | "source": [
344 | "# Reinforcement Learning\n",
345 | "\n",
346 | "In the code below we use DDPG to find a policy that can generate optimal trading trajectories that minimize implementation shortfall, and can be benchmarked against the Almgren and Chriss model. We will implement a typical reinforcement learning workflow to train the actor and critic using the simulation environment. We feed the states observed from our simulator to an agent. The Agent first predicts an action using the actor model and performs the action in the environment. Then, environment returns the reward and new state. This process continues for the given number of episodes. To get accurate results, you should run the code at least 10,000 episodes."
347 | ]
348 | },
349 | {
350 | "cell_type": "markdown",
351 | "metadata": {},
352 | "source": [
353 | "# Todo\n",
354 | "\n",
355 | "The above code should provide you with a starting framework for incorporating more complex dynamics into our model. Here are a few things you can try out:\n",
356 | "\n",
357 | "- Incorporate your own reward function in the simulation environmet to see if you can achieve a expected shortfall that is better (lower) than that produced by the Almgren and Chriss model.\n",
358 | "\n",
359 | "\n",
360 | "- Experiment rewarding the agent at every step and only giving a reward at the end.\n",
361 | "\n",
362 | "\n",
363 | "- Use more realistic price dynamics, such as geometric brownian motion (GBM). The equations used to model GBM can be found in section 3b of this [paper](https://ro.uow.edu.au/cgi/viewcontent.cgi?referer=https://www.google.com/&httpsredir=1&article=1705&context=aabfj)\n",
364 | "\n",
365 | "\n",
366 | "- Try different functions for the action. You can change the values of the actions produced by the agent by using different functions. You can choose your function depending on the interpretation you give to the action. For example, you could set the action to be a function of the trading rate.\n",
367 | "\n",
368 | "\n",
369 | "- Add more complex dynamics to the environment. Try incorporate trading fees, for example. This can be done by adding and extra term to the fixed cost of selling, $\\epsilon$."
370 | ]
371 | },
372 | {
373 | "cell_type": "code",
374 | "execution_count": 17,
375 | "metadata": {
376 | "scrolled": false
377 | },
378 | "outputs": [
379 | {
380 | "name": "stdout",
381 | "output_type": "stream",
382 | "text": [
383 | "Episode [100/10000]\tAverage Shortfall: $1,955,559.82\n",
384 | "Episode [200/10000]\tAverage Shortfall: $2,562,500.00\n",
385 | "Episode [300/10000]\tAverage Shortfall: $2,562,500.00\n",
386 | "Episode [400/10000]\tAverage Shortfall: $2,562,500.00\n",
387 | "Episode [500/10000]\tAverage Shortfall: $2,562,500.00\n",
388 | "Episode [600/10000]\tAverage Shortfall: $2,562,500.00\n",
389 | "Episode [700/10000]\tAverage Shortfall: $2,562,500.00\n",
390 | "Episode [800/10000]\tAverage Shortfall: $2,562,500.00\n",
391 | "Episode [900/10000]\tAverage Shortfall: $2,562,500.00\n",
392 | "Episode [1000/10000]\tAverage Shortfall: $2,562,500.00\n",
393 | "Episode [1100/10000]\tAverage Shortfall: $2,562,500.00\n",
394 | "Episode [1200/10000]\tAverage Shortfall: $2,562,500.00\n",
395 | "Episode [1300/10000]\tAverage Shortfall: $2,562,500.00\n",
396 | "Episode [1400/10000]\tAverage Shortfall: $2,562,500.00\n",
397 | "Episode [1500/10000]\tAverage Shortfall: $2,562,500.00\n",
398 | "Episode [1600/10000]\tAverage Shortfall: $2,562,500.00\n",
399 | "Episode [1700/10000]\tAverage Shortfall: $2,562,500.00\n",
400 | "Episode [1800/10000]\tAverage Shortfall: $2,562,500.00\n",
401 | "Episode [1900/10000]\tAverage Shortfall: $2,562,500.00\n",
402 | "Episode [2000/10000]\tAverage Shortfall: $2,365,766.34\n",
403 | "Episode [2100/10000]\tAverage Shortfall: $-4,961,069.15\n",
404 | "Episode [2200/10000]\tAverage Shortfall: $-9,865,626.73\n",
405 | "Episode [2300/10000]\tAverage Shortfall: $-10,111,475.89\n",
406 | "Episode [2400/10000]\tAverage Shortfall: $-9,667,788.71\n",
407 | "Episode [2500/10000]\tAverage Shortfall: $-12,157,641.08\n",
408 | "Episode [2600/10000]\tAverage Shortfall: $-9,094,492.66\n",
409 | "Episode [2700/10000]\tAverage Shortfall: $-9,853,531.99\n",
410 | "Episode [2800/10000]\tAverage Shortfall: $-1,816,561.93\n",
411 | "Episode [2900/10000]\tAverage Shortfall: $-10,270,762.94\n",
412 | "Episode [3000/10000]\tAverage Shortfall: $-10,202,420.94\n",
413 | "Episode [3100/10000]\tAverage Shortfall: $-10,528,299.52\n",
414 | "Episode [3200/10000]\tAverage Shortfall: $-10,470,899.05\n",
415 | "Episode [3300/10000]\tAverage Shortfall: $-9,406,544.15\n",
416 | "Episode [3400/10000]\tAverage Shortfall: $-8,838,479.19\n",
417 | "Episode [3500/10000]\tAverage Shortfall: $-11,441,167.07\n",
418 | "Episode [3600/10000]\tAverage Shortfall: $-9,947,253.26\n",
419 | "Episode [3700/10000]\tAverage Shortfall: $-10,928,396.01\n",
420 | "Episode [3800/10000]\tAverage Shortfall: $-11,946,627.98\n",
421 | "Episode [3900/10000]\tAverage Shortfall: $-6,900,682.33\n",
422 | "Episode [4000/10000]\tAverage Shortfall: $-6,695.71\n",
423 | "Episode [4100/10000]\tAverage Shortfall: $-8,904,595.61\n",
424 | "Episode [4200/10000]\tAverage Shortfall: $-9,659,425.17\n",
425 | "Episode [4300/10000]\tAverage Shortfall: $-10,491,365.42\n",
426 | "Episode [4400/10000]\tAverage Shortfall: $-9,667,934.38\n",
427 | "Episode [4500/10000]\tAverage Shortfall: $-9,290,083.81\n",
428 | "Episode [4600/10000]\tAverage Shortfall: $-9,772,319.45\n",
429 | "Episode [4700/10000]\tAverage Shortfall: $-9,741,509.38\n",
430 | "Episode [4800/10000]\tAverage Shortfall: $-9,893,239.53\n",
431 | "Episode [4900/10000]\tAverage Shortfall: $-9,595,988.92\n",
432 | "Episode [5000/10000]\tAverage Shortfall: $-12,501,569.45\n",
433 | "Episode [5100/10000]\tAverage Shortfall: $-11,811,195.05\n",
434 | "Episode [5200/10000]\tAverage Shortfall: $-10,138,894.46\n",
435 | "Episode [5300/10000]\tAverage Shortfall: $-10,504,233.09\n",
436 | "Episode [5400/10000]\tAverage Shortfall: $-5,038,707.51\n",
437 | "Episode [5500/10000]\tAverage Shortfall: $-8,936,829.38\n",
438 | "Episode [5600/10000]\tAverage Shortfall: $-10,781,792.26\n",
439 | "Episode [5700/10000]\tAverage Shortfall: $-8,302,488.86\n",
440 | "Episode [5800/10000]\tAverage Shortfall: $-9,594,546.28\n",
441 | "Episode [5900/10000]\tAverage Shortfall: $-9,814,352.78\n",
442 | "Episode [6000/10000]\tAverage Shortfall: $-10,684,048.74\n",
443 | "Episode [6100/10000]\tAverage Shortfall: $-10,373,680.60\n",
444 | "Episode [6200/10000]\tAverage Shortfall: $-10,625,956.59\n",
445 | "Episode [6300/10000]\tAverage Shortfall: $-11,631,723.77\n",
446 | "Episode [6400/10000]\tAverage Shortfall: $-10,585,561.44\n",
447 | "Episode [6500/10000]\tAverage Shortfall: $-9,381,861.37\n",
448 | "Episode [6600/10000]\tAverage Shortfall: $-11,218,867.50\n",
449 | "Episode [6700/10000]\tAverage Shortfall: $-7,701,202.56\n",
450 | "Episode [6800/10000]\tAverage Shortfall: $-10,332,167.94\n",
451 | "Episode [6900/10000]\tAverage Shortfall: $-9,952,372.31\n",
452 | "Episode [7000/10000]\tAverage Shortfall: $-10,887,028.27\n",
453 | "Episode [7100/10000]\tAverage Shortfall: $-9,948,462.96\n",
454 | "Episode [7200/10000]\tAverage Shortfall: $-10,970,077.82\n",
455 | "Episode [7300/10000]\tAverage Shortfall: $-9,718,404.47\n",
456 | "Episode [7400/10000]\tAverage Shortfall: $-8,927,763.57\n",
457 | "Episode [7500/10000]\tAverage Shortfall: $-9,266,710.71\n",
458 | "Episode [7600/10000]\tAverage Shortfall: $-10,700,665.61\n",
459 | "Episode [7700/10000]\tAverage Shortfall: $-10,407,278.17\n",
460 | "Episode [7800/10000]\tAverage Shortfall: $-10,812,626.73\n",
461 | "Episode [7900/10000]\tAverage Shortfall: $-9,742,862.11\n",
462 | "Episode [8000/10000]\tAverage Shortfall: $-8,727,273.28\n",
463 | "Episode [8100/10000]\tAverage Shortfall: $-11,945,952.88\n",
464 | "Episode [8200/10000]\tAverage Shortfall: $-9,305,444.17\n",
465 | "Episode [8300/10000]\tAverage Shortfall: $-9,569,747.92\n",
466 | "Episode [8400/10000]\tAverage Shortfall: $-11,008,021.52\n",
467 | "Episode [8500/10000]\tAverage Shortfall: $-10,176,711.27\n",
468 | "Episode [8600/10000]\tAverage Shortfall: $-8,852,801.73\n",
469 | "Episode [8700/10000]\tAverage Shortfall: $-9,340,764.19\n",
470 | "Episode [8800/10000]\tAverage Shortfall: $-9,508,081.87\n",
471 | "Episode [8900/10000]\tAverage Shortfall: $-9,887,704.43\n",
472 | "Episode [9000/10000]\tAverage Shortfall: $-10,422,108.32\n",
473 | "Episode [9100/10000]\tAverage Shortfall: $-11,640,813.81\n",
474 | "Episode [9200/10000]\tAverage Shortfall: $-8,812,642.36\n",
475 | "Episode [9300/10000]\tAverage Shortfall: $-10,013,637.13\n",
476 | "Episode [9400/10000]\tAverage Shortfall: $-10,609,369.46\n",
477 | "Episode [9500/10000]\tAverage Shortfall: $-9,189,505.15\n",
478 | "Episode [9600/10000]\tAverage Shortfall: $-9,111,533.09\n",
479 | "Episode [9700/10000]\tAverage Shortfall: $-9,875,456.01\n",
480 | "Episode [9800/10000]\tAverage Shortfall: $-10,223,336.47\n",
481 | "Episode [9900/10000]\tAverage Shortfall: $-10,590,809.41\n",
482 | "Episode [10000/10000]\tAverage Shortfall: $-11,169,677.73\n",
483 | "\n",
484 | "Average Implementation Shortfall: $-7,262,618.76 \n",
485 | "\n"
486 | ]
487 | }
488 | ],
489 | "source": [
490 | "import numpy as np\n",
491 | "\n",
492 | "import syntheticChrissAlmgren as sca\n",
493 | "from ddpg_agent import Agent\n",
494 | "from workspace_utils import keep_awake\n",
495 | "from collections import deque\n",
496 | "\n",
497 | "# Create simulation environment\n",
498 | "env = sca.MarketEnvironment()\n",
499 | "\n",
500 | "# Initialize Feed-forward DNNs for Actor and Critic models. \n",
501 | "agent = Agent(state_size=env.observation_space_dimension(), action_size=env.action_space_dimension(), random_seed=0)\n",
502 | "\n",
503 | "# Set the liquidation time\n",
504 | "lqt = 60\n",
505 | "\n",
506 | "# Set the number of trades\n",
507 | "n_trades = 60\n",
508 | "\n",
509 | "# Set trader's risk aversion\n",
510 | "tr = 1e-6\n",
511 | "\n",
512 | "# Set the number of episodes to run the simulation\n",
513 | "episodes = 10000\n",
514 | "\n",
515 | "shortfall_hist = np.array([])\n",
516 | "shortfall_deque = deque(maxlen=100)\n",
517 | "\n",
518 | "\n",
519 | "for episode in keep_awake(range(episodes)): \n",
520 | " # Reset the enviroment\n",
521 | " cur_state = env.reset(seed = episode, liquid_time = lqt, num_trades = n_trades, lamb = tr)\n",
522 | "\n",
523 | " # set the environment to make transactions\n",
524 | " env.start_transactions()\n",
525 | "\n",
526 | " for i in range(n_trades + 1):\n",
527 | " \n",
528 | " # Predict the best action for the current state. \n",
529 | " action = agent.act(cur_state, add_noise = True)\n",
530 | " \n",
531 | " # Action is performed and new state, reward, info are received. \n",
532 | " new_state, reward, done, info = env.step(action)\n",
533 | " \n",
534 | " # current state, action, reward, new state are stored in the experience replay\n",
535 | " agent.step(cur_state, action, reward, new_state, done)\n",
536 | " \n",
537 | " # roll over new state\n",
538 | " cur_state = new_state\n",
539 | "\n",
540 | " if info.done:\n",
541 | " shortfall_hist = np.append(shortfall_hist, info.implementation_shortfall)\n",
542 | " shortfall_deque.append(info.implementation_shortfall)\n",
543 | " break\n",
544 | " \n",
545 | " if (episode + 1) % 100 == 0: # print average shortfall over last 100 episodes\n",
546 | " print('\\rEpisode [{}/{}]\\tAverage Shortfall: ${:,.2f}'.format(episode + 1, episodes, np.mean(shortfall_deque))) \n",
547 | "\n",
548 | "print('\\nAverage Implementation Shortfall: ${:,.2f} \\n'.format(np.mean(shortfall_hist)))"
549 | ]
550 | },
551 | {
552 | "cell_type": "code",
553 | "execution_count": null,
554 | "metadata": {},
555 | "outputs": [],
556 | "source": []
557 | }
558 | ],
559 | "metadata": {
560 | "kernelspec": {
561 | "display_name": "Python 3",
562 | "language": "python",
563 | "name": "python3"
564 | },
565 | "language_info": {
566 | "codemirror_mode": {
567 | "name": "ipython",
568 | "version": 3
569 | },
570 | "file_extension": ".py",
571 | "mimetype": "text/x-python",
572 | "name": "python",
573 | "nbconvert_exporter": "python",
574 | "pygments_lexer": "ipython3",
575 | "version": "3.6.3"
576 | }
577 | },
578 | "nbformat": 4,
579 | "nbformat_minor": 2
580 | }
581 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 Akshay Sathe
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Actor_Critic-Method-for-Financial-Trading
2 |
3 |
4 | ## Implementation of the DDPG algorithm for Automatic Finance Trading
5 |
6 | In this project, I implemented the DDPG algorithm to solve the optimization problem of large portfolio transactions. This is an example of how Deep Reinforcement Learning can be used to solve real-world problems by simulating the problem in the form of an environment. Just like the various Deep RL algorithms, for example, DQN, DDPG and Multi-Agent DDPG algorithms are used to solve tasks related to robotics, gaming or even Autonomous Car Driving, these algorithms can be used to solve many supervised learning problems by properly formulating the tasks into an environment.
7 |
8 | The 'environment' is basically a python class having methods and attributes related to the task. It generates 'States' for an Agent. The agent is a class that follows one of the above algorithms. The agent then generates 'Actions' and gets rewarded based on that action. The important thing is to define carefully the 'State', 'Action' and 'Reward' for the environment based on the problem.
9 |
10 | In this repository, I will show how this can be done for a finance related problem.
11 |
12 | Please see the report.pdf file for full description.
13 |
14 | The basic code for this project was provided by Udacity. I modified it to simulate the stock prices based on a more realistic Geometric Brownian Motion.
15 |
16 |
17 |
--------------------------------------------------------------------------------
/ddpg_agent.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import random
3 | import copy
4 | from collections import namedtuple, deque
5 |
6 | from model import Actor, Critic
7 |
8 | import torch
9 | import torch.nn.functional as F
10 | import torch.optim as optim
11 |
12 | BUFFER_SIZE = int(1e4) # replay buffer size
13 | BATCH_SIZE = 128 # minibatch size
14 | GAMMA = 0.99 # discount factor
15 | TAU = 1e-3 # for soft update of target parameters
16 | LR_ACTOR = 1e-4 # learning rate of the actor
17 | LR_CRITIC = 1e-3 # learning rate of the critic
18 | WEIGHT_DECAY = 0 # L2 weight decay
19 |
20 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
21 |
22 | class Agent():
23 | """Interacts with and learns from the environment."""
24 |
25 | def __init__(self, state_size, action_size, random_seed):
26 | """Initialize an Agent object.
27 |
28 | Params
29 | ======
30 | state_size (int): dimension of each state
31 | action_size (int): dimension of each action
32 | random_seed (int): random seed
33 | """
34 | self.state_size = state_size
35 | self.action_size = action_size
36 | self.seed = random.seed(random_seed)
37 |
38 | # Actor Network (w/ Target Network)
39 | self.actor_local = Actor(state_size, action_size, random_seed).to(device)
40 | self.actor_target = Actor(state_size, action_size, random_seed).to(device)
41 | self.actor_optimizer = optim.Adam(self.actor_local.parameters(), lr=LR_ACTOR)
42 |
43 | # Critic Network (w/ Target Network)
44 | self.critic_local = Critic(state_size, action_size, random_seed).to(device)
45 | self.critic_target = Critic(state_size, action_size, random_seed).to(device)
46 | self.critic_optimizer = optim.Adam(self.critic_local.parameters(), lr=LR_CRITIC, weight_decay=WEIGHT_DECAY)
47 |
48 | # Noise process
49 | self.noise = OUNoise(action_size, random_seed)
50 |
51 | # Replay memory
52 | self.memory = ReplayBuffer(action_size, BUFFER_SIZE, BATCH_SIZE, random_seed)
53 |
54 | def step(self, state, action, reward, next_state, done):
55 | """Save experience in replay memory, and use random sample from buffer to learn."""
56 | # Save experience / reward
57 | self.memory.add(state, action, reward, next_state, done)
58 |
59 | # Learn, if enough samples are available in memory
60 | if len(self.memory) > BATCH_SIZE:
61 | experiences = self.memory.sample()
62 | self.learn(experiences, GAMMA)
63 |
64 | def act(self, state, add_noise=True):
65 | """Returns actions for given state as per current policy."""
66 | state = torch.from_numpy(state).float().to(device)
67 | self.actor_local.eval()
68 | with torch.no_grad():
69 | action = self.actor_local(state).cpu().data.numpy()
70 | self.actor_local.train()
71 | if add_noise:
72 | action += self.noise.sample()
73 | action = (action + 1.0) / 2.0
74 | return np.clip(action, 0, 1)
75 |
76 |
77 | def reset(self):
78 | self.noise.reset()
79 |
80 | def learn(self, experiences, gamma):
81 | """Update policy and value parameters using given batch of experience tuples.
82 | Q_targets = r + γ * critic_target(next_state, actor_target(next_state))
83 | where:
84 | actor_target(state) -> action
85 | critic_target(state, action) -> Q-value
86 |
87 | Params
88 | ======
89 | experiences (Tuple[torch.Tensor]): tuple of (s, a, r, s', done) tuples
90 | gamma (float): discount factor
91 | """
92 | states, actions, rewards, next_states, dones = experiences
93 |
94 | # ---------------------------- update critic ---------------------------- #
95 | # Get predicted next-state actions and Q values from target models
96 | actions_next = self.actor_target(next_states)
97 | Q_targets_next = self.critic_target(next_states, actions_next)
98 | # Compute Q targets for current states (y_i)
99 | Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))
100 | # Compute critic loss
101 | Q_expected = self.critic_local(states, actions)
102 | critic_loss = F.mse_loss(Q_expected, Q_targets)
103 | # Minimize the loss
104 | self.critic_optimizer.zero_grad()
105 | critic_loss.backward()
106 | self.critic_optimizer.step()
107 |
108 | # ---------------------------- update actor ---------------------------- #
109 | # Compute actor loss
110 | actions_pred = self.actor_local(states)
111 | actor_loss = -self.critic_local(states, actions_pred).mean()
112 | # Minimize the loss
113 | self.actor_optimizer.zero_grad()
114 | actor_loss.backward()
115 | self.actor_optimizer.step()
116 |
117 | # ----------------------- update target networks ----------------------- #
118 | self.soft_update(self.critic_local, self.critic_target, TAU)
119 | self.soft_update(self.actor_local, self.actor_target, TAU)
120 |
121 | def soft_update(self, local_model, target_model, tau):
122 | """Soft update model parameters.
123 | θ_target = τ*θ_local + (1 - τ)*θ_target
124 |
125 | Params
126 | ======
127 | local_model: PyTorch model (weights will be copied from)
128 | target_model: PyTorch model (weights will be copied to)
129 | tau (float): interpolation parameter
130 | """
131 | for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
132 | target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data)
133 |
134 | class OUNoise:
135 | """Ornstein-Uhlenbeck process."""
136 |
137 | def __init__(self, size, seed, mu=0., theta=0.15, sigma=0.2):
138 | """Initialize parameters and noise process."""
139 | self.mu = mu * np.ones(size)
140 | self.theta = theta
141 | self.sigma = sigma
142 | self.seed = random.seed(seed)
143 | self.reset()
144 |
145 | def reset(self):
146 | """Reset the internal state (= noise) to mean (mu)."""
147 | self.state = copy.copy(self.mu)
148 |
149 | def sample(self):
150 | """Update internal state and return it as a noise sample."""
151 | x = self.state
152 | dx = self.theta * (self.mu - x) + self.sigma * np.array([random.random() for i in range(len(x))])
153 | self.state = x + dx
154 | return self.state
155 |
156 | class ReplayBuffer:
157 | """Fixed-size buffer to store experience tuples."""
158 |
159 | def __init__(self, action_size, buffer_size, batch_size, seed):
160 | """Initialize a ReplayBuffer object.
161 | Params
162 | ======
163 | buffer_size (int): maximum size of buffer
164 | batch_size (int): size of each training batch
165 | """
166 | self.action_size = action_size
167 | self.memory = deque(maxlen=buffer_size) # internal memory (deque)
168 | self.batch_size = batch_size
169 | self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
170 | self.seed = random.seed(seed)
171 |
172 | def add(self, state, action, reward, next_state, done):
173 | """Add a new experience to memory."""
174 | e = self.experience(state, action, reward, next_state, done)
175 | self.memory.append(e)
176 |
177 | def sample(self):
178 | """Randomly sample a batch of experiences from memory."""
179 | experiences = random.sample(self.memory, k=self.batch_size)
180 |
181 | states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
182 | actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).float().to(device)
183 | rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
184 | next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
185 | dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)
186 |
187 | return (states, actions, rewards, next_states, dones)
188 |
189 | def __len__(self):
190 | """Return the current size of internal memory."""
191 | return len(self.memory)
--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 | import syntheticChrissAlmgren as sca
4 | from ddpg_agent import Agent
5 | from workspace_utils import keep_awake
6 | from collections import deque
7 |
8 | # Create simulation environment
9 | env = sca.MarketEnvironment()
10 |
11 | # Initialize Feed-forward DNNs for Actor and Critic models.
12 | agent = Agent(state_size=env.observation_space_dimension(), action_size=env.action_space_dimension(), random_seed=0)
13 |
14 | # Set the liquidation time
15 | lqt = 60
16 |
17 | # Set the number of trades
18 | n_trades = 60
19 |
20 | # Set trader's risk aversion
21 | tr = 1e-6
22 |
23 | # Set the number of episodes to run the simulation
24 | episodes = 10000
25 |
26 | shortfall_hist = np.array([])
27 | shortfall_deque = deque(maxlen=100)
28 |
29 |
30 | for episode in keep_awake(range(episodes)):
31 | # Reset the enviroment
32 | cur_state = env.reset(seed = episode, liquid_time = lqt, num_trades = n_trades, lamb = tr)
33 |
34 | # set the environment to make transactions
35 | env.start_transactions()
36 |
37 | for i in range(n_trades + 1):
38 |
39 | # Predict the best action for the current state.
40 | action = agent.act(cur_state, add_noise = True)
41 |
42 | # Action is performed and new state, reward, info are received.
43 | new_state, reward, done, info = env.step(action)
44 |
45 | # current state, action, reward, new state are stored in the experience replay
46 | agent.step(cur_state, action, reward, new_state, done)
47 |
48 | # roll over new state
49 | cur_state = new_state
50 |
51 | if info.done:
52 | shortfall_hist = np.append(shortfall_hist, info.implementation_shortfall)
53 | shortfall_deque.append(info.implementation_shortfall)
54 | break
55 |
56 | if (episode + 1) % 100 == 0: # print average shortfall over last 100 episodes
57 | print('\rEpisode [{}/{}]\tAverage Shortfall: ${:,.2f}'.format(episode + 1, episodes, np.mean(shortfall_deque)))
58 |
59 | print('\nAverage Implementation Shortfall: ${:,.2f} \n'.format(np.mean(shortfall_hist)))
--------------------------------------------------------------------------------
/model.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 | import torch
4 | import torch.nn as nn
5 | import torch.nn.functional as F
6 |
7 | def hidden_init(layer):
8 | fan_in = layer.weight.data.size()[0]
9 | lim = 1. / np.sqrt(fan_in)
10 | return (-lim, lim)
11 |
12 | class Actor(nn.Module):
13 | """Actor (Policy) Model."""
14 |
15 | def __init__(self, state_size, action_size, seed, fc1_units=24, fc2_units=48):
16 | """Initialize parameters and build model.
17 | Params
18 | ======
19 | state_size (int): Dimension of each state
20 | action_size (int): Dimension of each action
21 | seed (int): Random seed
22 | fc1_units (int): Number of nodes in first hidden layer
23 | fc2_units (int): Number of nodes in second hidden layer
24 | """
25 | super(Actor, self).__init__()
26 | self.seed = torch.manual_seed(seed)
27 | self.fc1 = nn.Linear(state_size, fc1_units)
28 | self.fc2 = nn.Linear(fc1_units, fc2_units)
29 | self.fc3 = nn.Linear(fc2_units, action_size)
30 | self.reset_parameters()
31 |
32 | def reset_parameters(self):
33 | self.fc1.weight.data.uniform_(*hidden_init(self.fc1))
34 | self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
35 | self.fc3.weight.data.uniform_(-3e-3, 3e-3)
36 |
37 | def forward(self, state):
38 | """Build an actor (policy) network that maps states -> actions."""
39 | x = F.relu(self.fc1(state))
40 | x = F.relu(self.fc2(x))
41 | return F.tanh(self.fc3(x))
42 |
43 |
44 | class Critic(nn.Module):
45 | """Critic (Value) Model."""
46 |
47 | def __init__(self, state_size, action_size, seed, fcs1_units=24, fc2_units=48):
48 | """Initialize parameters and build model.
49 | Params
50 | ======
51 | state_size (int): Dimension of each state
52 | action_size (int): Dimension of each action
53 | seed (int): Random seed
54 | fcs1_units (int): Number of nodes in the first hidden layer
55 | fc2_units (int): Number of nodes in the second hidden layer
56 | """
57 | super(Critic, self).__init__()
58 | self.seed = torch.manual_seed(seed)
59 | self.fcs1 = nn.Linear(state_size, fcs1_units)
60 | self.fc2 = nn.Linear(fcs1_units+action_size, fc2_units)
61 | self.fc3 = nn.Linear(fc2_units, 1)
62 | self.reset_parameters()
63 |
64 | def reset_parameters(self):
65 | self.fcs1.weight.data.uniform_(*hidden_init(self.fcs1))
66 | self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
67 | self.fc3.weight.data.uniform_(-3e-3, 3e-3)
68 |
69 | def forward(self, state, action):
70 | """Build a critic (value) network that maps (state, action) pairs -> Q-values."""
71 | xs = F.relu(self.fcs1(state))
72 | x = torch.cat((xs, action), dim=1)
73 | x = F.relu(self.fc2(x))
74 | return self.fc3(x)
75 |
--------------------------------------------------------------------------------
/report.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AkshayS21/Reinforcement-Learning-for-Optimal-Financial-Trading/b34069de0b73c60d5fa9a5af1359ad82119d937e/report.pdf
--------------------------------------------------------------------------------