├── .gitignore
├── README.md
├── doc
├── .visio
│ └── bellmanVisio.vsdx
├── discrete.py
├── image
│ └── 扫码_搜索联合传播样式-微信标准绿版.png
└── 强化学习汇报Piper_202009.pptx
├── mathematics
├── 梯度赌博机算法中,偏好函数更新:梯度上升公式是精确梯度上升的随机近似的证明.md
├── 第10章:基于函数逼近的同轨策略控制.md
├── 第11章:基于函数逼近的离轨策略方法.md
├── 第12章:资格迹.md
├── 第13章:策略梯度方法.md
├── 第9章:基于函数逼近的同轨策略预测.md
├── 策略改进(Policy Improvement)使策略更优的数学证明.md
└── 表格型方法总结.md
├── open_lRLwp_jupyter.bat
├── open_vscode_project.bat
├── practice
├── 01-Stochastic-Multi-Armed-Bandit.ipynb
├── 02-MDP-and-Bellman-Equation.ipynb
├── 03-01-Grid-World.ipynb
├── 03-02-Policy-Iteration.ipynb
├── 03-03-Value-Iteration-and-Asynchronous-etc.ipynb
├── 04-Monte-Carlo-Methods.ipynb
├── 05-01-Temporal-Difference-Prediction.ipynb
├── 05-02-Temporal-Difference-Control.ipynb
├── 06-N-Step-Bootstrapping.ipynb
├── 07-01-Maze-Problem-with-DynaQ-and-Priority.ipynb
├── 07-02-Expectation-vs-Sample.ipynb
├── 07-03-Trajectory-Sampling.ipynb
├── Counterexample.ipynb
├── Mountain-Car-Acess-Control.ipynb
├── On-policy-Prediction-with-Approximation.ipynb
├── Random-Walk-Mountain-Car.ipynb
├── Short-Corridor.ipynb
└── images
│ ├── 03-02-policy-iteration.png
│ ├── 03-03-value-iteration.png
│ ├── 03-04-generalized-policy-iteration.png
│ ├── 03_grid_world.jpg
│ ├── 04-01.png
│ ├── 04-02.png
│ ├── 04-03.png
│ ├── 05-01.png
│ ├── 05-02.png
│ ├── 05-03.png
│ ├── 05-04.png
│ ├── 05-05.png
│ ├── 06-01.png
│ ├── 06-02.png
│ ├── 06-03.png
│ ├── 06-04.png
│ ├── 06-05.png
│ ├── 07-01.jpg
│ ├── 07-02.jpg
│ ├── 07-03.jpg
│ ├── 07-04.png
│ ├── 07-05.png
│ ├── example_13_1.png
│ └── figure_8_5.png
└── resources
├── Approximate Dynamic Programming by Powell 2nd edition.pdf
├── DEEP REINFORCEMENT LEARNING arXiv_1810.06339.pdf
├── RL-An Introdction exercises Solution.pdf
├── RL-An Introdction notes.pdf
└── Reinforcement Learning - An Introduction 2018.pdf
/.gitignore:
--------------------------------------------------------------------------------
1 | /practice/.ipynb_checkpoints
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # 你好,这是刘洪佳的强化学习笔记
2 |
3 | *第一次系统学习强化学习,本笔记语言为中文。*
4 |
5 | ****
6 |
7 | ### 我的笔记分布
8 | - 🥊 入门学习 / 读书笔记 [GitHub链接:PiperLiu/Reinforcement-Learning-practice-zh](https://github.com/PiperLiu/Reinforcement-Learning-practice-zh)
9 | - 💻 阅读论文 / 视频课程的笔记 [GitHub链接:PiperLiu/introRL](https://github.com/PiperLiu/introRL)
10 | - ✨ 大小算法 / 练手操场 [GitHub链接:PiperLiu/Approachable-Reinforcement-Learning](https://github.com/PiperLiu/Approachable-Reinforcement-Learning)
11 |
12 | ### 正在进行的学习内容与计划中的内容
13 | - [X] 强化学习圣经的第一遍学习 [[details]](#对强化学习圣经的第一遍学习)
14 | - [ ] Deep Reinforcement Learning 的第一遍阅读 [[details]](#深度强化学习第一遍阅读)
15 | - [ ] Approximate Dynamic Programming 的第一遍阅读 [[details]](#近似动态规划的第一遍阅读)
16 |
17 | ****
18 |
19 | ### 对强化学习圣经的第一遍学习
20 |
21 | **输出是最好的学习,我的学习方法如下:**
22 | - 读书,为了保证进度,我选择阅读中文版书籍[[1-2]](#参考资料);
23 | - 一般地,每读完一章,我会把其知识体系用自己的语言概括下来,这会引发我的很多思考:完整地将其表述出来,会弥补我读书时没有注意到的问题;
24 | - 结合代码的笔记与心得,以 `.ipynb` 文件形式写在了[./practice/](./practice/)中,没有代码的,以 `.md` 形式写在了[./mathematics/](./mathematics/)中;
25 | - 我会参考他人的笔记与思考,对我帮助很大的有:
26 | - - [github.com/ShangtongZhangn 使用python复现书上案例](https://github.com/ShangtongZhang/reinforcement-learning-an-introduction);
27 | - - [github.com/brynhayder 对于本书的笔记,对练习题的解答](https://github.com/brynhayder/reinforcement_learning_an_introduction)。
28 |
29 | 目前已完成:
30 |
31 | - [X] 第I部分 表格型求解方法 [学习总结 link](./mathematics/表格型方法总结.md)
32 | - [X] 第II部分 表格型近似求解方法
33 | - [X] 第III部分 表格型深入研究
34 |
35 | 学习笔记目录(所有的`.ipynb`链接已转换到`nbviewer.jupyter.org/github/`):
36 |
37 | ##### 第I部分 表格型求解方法
38 |
39 | - 摇臂赌博机:
40 | - - 实例代码:[01-Stochastic-Multi-Armed-Bandit.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/01-Stochastic-Multi-Armed-Bandit.ipynb)
41 | - - 数学公式的讨论:[梯度赌博机算法中,偏好函数更新:梯度上升公式是精确梯度上升的随机近似的证明.md](./mathematics/梯度赌博机算法中,偏好函数更新:梯度上升公式是精确梯度上升的随机近似的证明.md)
42 | - 马尔科夫链与贝尔曼方程:
43 | - - 实例:[02-MDP-and-Bellman-Equation.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/02-MDP-and-Bellman-Equation.ipynb)
44 | - 动态规划:
45 | - - 实例1:[./practice/03-01-Grid-World.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/03-01-Grid-World.ipynb)
46 | - - 实例2:[./practice/03-02-Policy-Iteration.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/03-02-Policy-Iteration.ipynb)
47 | - - 实例3:[./practice/03-03-Value-Iteration-and-Asynchronous-etc.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/03-03-Value-Iteration-and-Asynchronous-etc.ipynb)
48 | - 蒙特卡洛方法:[./practice/04-Monte-Carlo-Methods.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/04-Monte-Carlo-Methods.ipynb)
49 | - (单步)时序差分学习:
50 | - - 评估价值部分:[./practice/05-01-Temporal-Difference-Prediction.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/05-01-Temporal-Difference-Prediction.ipynb)
51 | - - 控制部分:[./practice/05-02-Temporal-Difference-Control.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/05-02-Temporal-Difference-Control.ipynb)
52 | - n 步自举法:[./practice/06-N-Step-Bootstrapping.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/06-N-Step-Bootstrapping.ipynb)
53 | - 表格型方法的规划与学习:
54 | - - **书前八章总结:**[./mathematics/表格型方法总结.md](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/mathematics/表格型方法总结.md)
55 | - - Dyna-Q 与 优先遍历实例:[./practice/07-01-Maze-Problem-with-DynaQ-and-Priority.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/07-01-Maze-Problem-with-DynaQ-and-Priority.ipynb)
56 | - - 期望估计与采用估计:[./practice/07-02-Expectation-vs-Sample.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/07-02-Expectation-vs-Sample.ipynb)
57 | - - 轨迹采样:[./practice/07-03-Trajectory-Sampling.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/07-03-Trajectory-Sampling.ipynb)
58 |
59 | ##### 第II部分 表格型近似求解方法
60 |
61 | - 第9章:基于函数逼近的同轨策略预测:
62 | - - 心得:[第9章:基于函数逼近的同轨策略预测.md](./mathematics/第9章:基于函数逼近的同轨策略预测.md)
63 | - - 实例(随机游走与粗编码大小):[./practice/On-policy-Prediction-with-Approximation.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/On-policy-Prediction-with-Approximation.ipynb)
64 | - 第10章:基于函数逼近的同轨策略控制:
65 | - - 心得:[第10章:基于函数逼近的同轨策略控制.md](./mathematics/第10章:基于函数逼近的同轨策略控制.md)
66 | - - 实例(n步Sarsa控制与平均收益实例):[./practice/Mountain-Car-Acess-Control.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/Mountain-Car-Acess-Control.ipynb)
67 | - 第11章:基于函数逼近的离轨策略方法:
68 | - - 心得:[第11章:基于函数逼近的离轨策略方法.md](./mathematics/第11章:基于函数逼近的离轨策略方法.md)
69 | - - 实例:[./practice/Counterexample.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/Counterexample.ipynb)
70 | - 第12章:资格迹:
71 | - - 心得:[第12章:资格迹.md](./mathematics/第12章:资格迹.md)
72 | - - 实例:[./practice/Random-Walk-Mountain-Car.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/Random-Walk-Mountain-Car.ipynb)
73 | - 第13章:策略梯度方法
74 | - - 心得:[第13章:策略梯度方法.md](./mathematics/第13章:策略梯度方法.md)
75 | - - 实例:[./practice/Short-Corridor.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/Short-Corridor.ipynb)
76 |
77 | ****
78 |
79 | ### 深度强化学习第一遍阅读
80 |
81 | 听说这本综述不错:
82 |
83 | [Li Y. Deep reinforcement learning: An overview[J]. arXiv preprint arXiv:1701.07274, 2017.](./resources/)
84 |
85 | 如果想看看论文与代码,可以考虑先看:
86 |
87 | [https://github.com/ShangtongZhang/DeepRL](https://github.com/ShangtongZhang/DeepRL)
88 |
89 | ****
90 |
91 | ### 近似动态规划的第一遍阅读
92 |
93 | 在管理中,强化学习(近似动态规划)有哪些应用?老师给我推荐了这本书:
94 |
95 | [Powell W B. Approximate Dynamic Programming: Solving the curses of dimensionality[M]. John Wiley & Sons, 2007.](./resources/)
96 |
97 | ****
98 |
99 | ### 参考资料
100 |
101 | - [1] 强化学习(第2版); [加拿大] Richard S. Sutton, [美国] Andrew G. Barto; 俞凯 译.
102 | - [2] 在上述书籍出版前,有人已经开始了翻译工作:[http://rl.qiwihui.com/](http://rl.qiwihui.com/).
103 | - [3] 英文电子原版在:[http://rl.qiwihui.com/zh_CN/latest/chapter1/introduction.html](http://rl.qiwihui.com/zh_CN/latest/chapter1/introduction.html),已经下载到本仓库[./resources/Reinforcement Learning - An Introduction 2018.pdf](./resources/)
104 | - [4] 强化学习读书笔记系列;公众号:老薛带你学Python(xue_python)
105 |
106 | ### 更多平台
107 |
108 | "输出是最好的学习方式"——欢迎在其他平台查看我的学习足迹!
109 |
110 |
111 | 
112 |
--------------------------------------------------------------------------------
/doc/.visio/bellmanVisio.vsdx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/doc/.visio/bellmanVisio.vsdx
--------------------------------------------------------------------------------
/doc/discrete.py:
--------------------------------------------------------------------------------
1 | from copy import Error
2 | import gym
3 | import tianshou as ts
4 | from tianshou import policy
5 | import torch
6 | from torch import nn
7 | import numpy as np
8 | import argparse
9 | from torch.utils.tensorboard import SummaryWriter
10 | import time
11 |
12 | env = gym.make('CartPole-v0')
13 | # train_envs = gym.make('CartPole-v0')
14 | # test_envs = gym.make('CartPole-v0')
15 |
16 | train_envs = ts.env.VectorEnv([lambda: gym.make('CartPole-v0') for _ in range(8)])
17 | test_envs = ts.env.VectorEnv([lambda: gym.make('CartPole-v0') for _ in range(100)])
18 |
19 | import torch
20 | import numpy as np
21 | from torch import nn
22 | import torch.nn.functional as F
23 |
24 |
25 | class Net(nn.Module):
26 | def __init__(self, layer_num, state_shape, action_shape=0, device='cpu'):
27 | super().__init__()
28 | self.device = device
29 | self.model = [
30 | nn.Linear(np.prod(state_shape), 128),
31 | nn.ReLU(inplace=True)]
32 | for i in range(layer_num):
33 | self.model += [nn.Linear(128, 128), nn.ReLU(inplace=True)]
34 | if action_shape:
35 | self.model += [nn.Linear(128, np.prod(action_shape))]
36 | self.model = nn.Sequential(*self.model)
37 |
38 | def forward(self, s, state=None, info={}):
39 | if not isinstance(s, torch.Tensor):
40 | s = torch.tensor(s, device=self.device, dtype=torch.float)
41 | batch = s.shape[0]
42 | s = s.view(batch, -1)
43 | logits = self.model(s)
44 | return logits, state
45 |
46 |
47 | class Actor(nn.Module):
48 | def __init__(self, preprocess_net, action_shape):
49 | super().__init__()
50 | self.preprocess = preprocess_net
51 | self.last = nn.Linear(128, np.prod(action_shape))
52 |
53 | def forward(self, s, state=None, info={}):
54 | logits, h = self.preprocess(s, state)
55 | logits = F.softmax(self.last(logits), dim=-1)
56 | return logits, h
57 |
58 | class Critic(nn.Module):
59 | def __init__(self, preprocess_net):
60 | super().__init__()
61 | self.preprocess = preprocess_net
62 | self.last = nn.Linear(128, 1)
63 |
64 | def forward(self, s):
65 | logits, h = self.preprocess(s, None)
66 | logits = self.last(logits)
67 | return logits
68 |
69 | class DQN(nn.Module):
70 |
71 | def __init__(self, h, w, action_shape, device='cpu'):
72 | super(DQN, self).__init__()
73 | self.device = device
74 |
75 | self.conv1 = nn.Conv2d(4, 16, kernel_size=5, stride=2)
76 | self.bn1 = nn.BatchNorm2d(16)
77 | self.conv2 = nn.Conv2d(16, 32, kernel_size=5, stride=2)
78 | self.bn2 = nn.BatchNorm2d(32)
79 | self.conv3 = nn.Conv2d(32, 32, kernel_size=5, stride=2)
80 | self.bn3 = nn.BatchNorm2d(32)
81 |
82 | def conv2d_size_out(size, kernel_size=5, stride=2):
83 | return (size - (kernel_size - 1) - 1) // stride + 1
84 |
85 | convw = conv2d_size_out(conv2d_size_out(conv2d_size_out(w)))
86 | convh = conv2d_size_out(conv2d_size_out(conv2d_size_out(h)))
87 | linear_input_size = convw * convh * 32
88 | self.fc = nn.Linear(linear_input_size, 512)
89 | self.head = nn.Linear(512, action_shape)
90 |
91 | def forward(self, x, state=None, info={}):
92 | if not isinstance(x, torch.Tensor):
93 | x = torch.tensor(x, device=self.device, dtype=torch.float)
94 | x = F.relu(self.bn1(self.conv1(x)))
95 | x = F.relu(self.bn2(self.conv2(x)))
96 | x = F.relu(self.bn3(self.conv3(x)))
97 | x = self.fc(x.reshape(x.size(0), -1))
98 | return self.head(x), state
99 |
100 |
101 | state_shape = env.observation_space.shape or env.observation_space.n
102 | action_shape = env.action_space.shape or env.action_space.n
103 |
104 | def build_policy(_p='PG'):
105 | if _p == 'PG':
106 | net = Net(3, state_shape, action_shape)
107 | optim = torch.optim.Adam(net.parameters(), lr=1e-3)
108 | dist = torch.distributions.Categorical
109 | policy = ts.policy.PGPolicy(net, optim, dist, discount_factor=0.99)
110 | elif _p == 'PPO':
111 | net = Net(3, state_shape)
112 | actor = Actor(net, action_shape)
113 | critic = Critic(net)
114 | optim = torch.optim.Adam(list(
115 | actor.parameters()) + list(critic.parameters()), lr=1e-3)
116 | dist = torch.distributions.Categorical
117 | policy = ts.policy.PPOPolicy(
118 | actor, critic, optim, dist,
119 | discount_factor=0.99, action_range=None)
120 | elif _p == 'DQN':
121 | # model
122 | net = Net(3, state_shape, action_shape)
123 | optim = torch.optim.Adam(net.parameters(), lr=1e-3)
124 | policy = ts.policy.DQNPolicy(
125 | net, optim, 0.9, 3,
126 | use_target_network=320 > 0,
127 | target_update_freq=320)
128 | elif _p == 'A2C':
129 | net = Net(3, state_shape)
130 | actor = Actor(net, action_shape)
131 | critic = Critic(net)
132 | optim = torch.optim.Adam(list(
133 | actor.parameters()) + list(critic.parameters()), lr=1e-3)
134 | dist = torch.distributions.Categorical
135 | policy = ts.policy.A2CPolicy(
136 | actor, critic, optim, dist, 0.9, vf_coef=0.5,
137 | ent_coef=0.01, max_grad_norm=None)
138 | else:
139 | raise ValueError('No such policy in this file!')
140 |
141 | return policy
142 |
143 | def get_args():
144 | parser = argparse.ArgumentParser()
145 | parser.add_argument('--policy', type=str, default='PG')
146 | # parser.add_argument('--task', type=str, default='CartPole-v0')
147 | # parser.add_argument('--seed', type=int, default=1626)
148 | # parser.add_argument('--buffer-size', type=int, default=20000)
149 | # parser.add_argument('--lr', type=float, default=3e-4)
150 | # parser.add_argument('--gamma', type=float, default=0.9)
151 | parser.add_argument('--epoch', type=int, default=100)
152 | parser.add_argument('--step-per-epoch', type=int, default=1000)
153 | parser.add_argument('--collect-per-step', type=int, default=10)
154 | parser.add_argument('--repeat-per-collect', type=int, default=1)
155 | parser.add_argument('--batch-size', type=int, default=64)
156 | # parser.add_argument('--layer-num', type=int, default=2)
157 | parser.add_argument('--training-num', type=int, default=32)
158 | parser.add_argument('--test-num', type=int, default=100)
159 | # parser.add_argument('--logdir', type=str, default='log')
160 | # parser.add_argument('--render', type=float, default=0.)
161 | # parser.add_argument(
162 | # '--device', type=str,
163 | # default='cuda' if torch.cuda.is_available() else 'cpu')
164 | # # a2c special
165 | # parser.add_argument('--vf-coef', type=float, default=0.5)
166 | # parser.add_argument('--ent-coef', type=float, default=0.001)
167 | # parser.add_argument('--max-grad-norm', type=float, default=None)
168 | args = parser.parse_known_args()[0]
169 | return args
170 |
171 |
172 | def stop_fn(x):
173 | return x >= env.spec.reward_threshold
174 |
175 |
176 | if __name__ == "__main__":
177 | args = get_args()
178 | policy = build_policy(args.policy)
179 | writer = SummaryWriter('log' + '/' + args.policy + '/' + time.strftime("%d_%Y_%H_%M_%S"))
180 | train_collector = ts.data.Collector(
181 | policy, train_envs, ts.data.ReplayBuffer(2000))
182 | test_collector = ts.data.Collector(policy, test_envs)
183 | result = ts.trainer.onpolicy_trainer(
184 | policy, train_collector, test_collector, args.epoch,
185 | args.step_per_epoch, args.collect_per_step, args.repeat_per_collect,
186 | args.test_num, args.batch_size, stop_fn=stop_fn, writer=writer)
187 | assert stop_fn(result['best_reward'])
188 | train_collector.close()
189 | test_collector.close()
190 |
--------------------------------------------------------------------------------
/doc/image/扫码_搜索联合传播样式-微信标准绿版.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/doc/image/扫码_搜索联合传播样式-微信标准绿版.png
--------------------------------------------------------------------------------
/doc/强化学习汇报Piper_202009.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/doc/强化学习汇报Piper_202009.pptx
--------------------------------------------------------------------------------
/mathematics/梯度赌博机算法中,偏好函数更新:梯度上升公式是精确梯度上升的随机近似的证明.md:
--------------------------------------------------------------------------------
1 | > 本文证明强化学习入门问题:K摇臂赌博机的梯度赌博机算法中,偏好函数更新公式:$H_{t+1}(A_t) = H_t(A_t) + \alpha (R_t - \overline{R_t})(1-\pi_t(A_t))$的合理性。书上可能有些不太好理解,我用较为浅显的语言将每步证明的“why & how”描述出来。
2 |
3 | > 引用自:强化学习(第2版); [加拿大] Richard S. Sutton, [美国] Andrew G. Barto; 俞凯 译
4 |
5 | ### 前言
6 |
7 | 在**强化学习入门问题:K摇臂赌博机**的梯度赌博机算法中,提出了偏好函数。偏好函数本身的值并不重要,重要的是一个动作相比于另一个动作的偏好,因此,选择动作的概率分布使用softmax分布:
8 |
9 | $$Pr_{A_t = a} = \frac{e^{H_t(a)}}{\sum_{b=1}^{k} e^{H_t(b)}} = \pi_t(a)$$
10 |
11 | $\pi_t(a)$表示动作a在t时刻被选择的概率,所有偏好函数的初始值都相同(可为0)。
12 |
13 | 则,偏好函数更新遵守如下规则:
14 |
15 | |$H_{t+1}(A_t) = H_t(A_t) + \alpha (R_t - \overline{R_t})(1-\pi_t(A_t))$|对于被选择的动作$A_t$| (1) |
16 | |---|---|---|
17 | |$H_{t+1}(a) = H_t(a) - \alpha (R_t - \overline(R_t) \pi_t(a))$|对于所有$a \not= A_t$| (2) |
18 |
19 | 其中,a是一个大于0的数,表示步长。$\overline{R_t}$是时刻t内所有收益的平均值,称为基准项。
20 |
21 | **个人思考:为什么更新偏好函数时要考虑概率呢?** 答:对于(1)式,若本身概率较大,则$H_{t+1}$不会加太多,若本身概率$\pi_t=1$,则$H_{t+1}$不用更新。
22 |
23 | 上述思考有一定道理,**但是这个更新公式的合理性可以在数学上证明**。下面开始证明。
24 |
25 | ****
26 |
27 | ### 证明
28 |
29 | 在精确梯度上升算法中,有:
30 |
31 | $$H_{t+1}(a)=H_t(a) + \alpha \frac{\partial \mathbb{E}[R_t]}{\partial H_t (a)}$$
32 |
33 | 这里,用总体的期望收益定义为性能的衡量指标:
34 |
35 | $$ \mathbb{E}[R_t] = \sum_x \pi_t (x) q_* (x)$$
36 |
37 | 真实的$q_* (x)$(每个动作的真实收益)是未知的,因此无法实现精确的梯度上升。但是可以使用随机梯度上升求近似。
38 |
39 | 即,开始推导$\frac{\partial \mathbb{E}[R_t]}{\partial H_t (a)}$的近似:
40 |
41 | $$\frac{\partial \mathbb{E}[R_t]}{\partial H_t (a)} = \frac{\partial}{\partial H_t(a)}\left[ \sum_x \pi_t (x) q_* (x) \right]$$
42 |
43 | 因为$q_* (x)$客观存在,与$H_t (a)$值无关,所以:
44 |
45 | $$\frac{\partial \mathbb{E}[R_t]}{\partial H_t (a)} = \sum_x q_* (x) \frac{\partial \pi_t (x)}{\partial H_t(a)}$$
46 |
47 | 因为$\sum_x \frac{\partial \pi_t (x)}{\partial H_t(a)}=0$(其证明在后文:[动作导数总和为0的证明](#1)),因此可以加入“基准项”$B_t$:
48 |
49 | $$\frac{\partial \mathbb{E}[R_t]}{\partial H_t (a)} = \sum_x (q_* (x) - B_t ) \frac{\partial \pi_t (x)}{\partial H_t(a)}$$
50 |
51 | 然后,乘以$\pi_t(x) / \pi_t(x)$,有:
52 |
53 | $$\frac{\partial \mathbb{E}[R_t]}{\partial H_t (a)} = \sum_x \pi_t(x) (q_* (x) - B_t ) \frac{\partial \pi_t (x)}{\partial H_t(a)} / \pi_t(x)$$
54 |
55 | 可以看出,上式实际上是对$\pi_t(x)$分布中的$(q_* (x) - B_t ) \frac{\partial \pi_t (x)}{\partial H_t(a)} / \pi_t(x)$进行期望求值,即:
56 |
57 | $$\frac{\partial \mathbb{E}[R_t]}{\partial H_t (a)} = \mathbb{E} \left[ (q_* (x) - B_t ) \frac{\partial \pi_t (x)}{\partial H_t(a)} / \pi_t(x) \right]$$
58 |
59 | 其中,变量为动作$x$,这里记为选择的动作$A_t$;并且,将$B_t$取值为$\overline{R_t}$;又有,选择$A_t$动作的回报的期望为$\mathbb{E}[R_t | A_t]$,即$q_* (x)=\mathbb{E}[R_t | A_t]$。因此,有:
60 |
61 | $$\frac{\partial \mathbb{E}[R_t]}{\partial H_t (a)} = \mathbb{E} \left[ (R_t - \overline{R_t} ) \frac{\partial \pi_t ( A_t)}{\partial H_t(a)} / \pi_t( A_t) \right]$$
62 |
63 | 又有,$\frac{\partial \pi_t (x)}{\partial H_t(a)}=\pi_t(x) (\mathbb{I}_{a=A_t} - \pi_t(a))$,$\mathbb{I}_{a=A_t}$表示,如果$a=x$就取1,否则取0。其证明在后文:[偏好函数导数的推导证明](#2)。
64 |
65 | 则带入$\frac{\partial \pi_t (x)}{\partial H_t(a)}=\pi_t(x) (\mathbb{I}_{a=A_t} - \pi_t(a))$,有:
66 |
67 | $$\frac{\partial \mathbb{E}[R_t]}{\partial H_t (a)} = \mathbb{E} \left[ (R_t - \overline{R_t} ) (\mathbb{I}_{a=A_t} - \pi_t(a)) \right]$$
68 |
69 | 将上式带入$H_{t+1}(a)=H_t(a) + \alpha \frac{\partial \mathbb{E}[R_t]}{\partial H_t (a)}$,即有
70 |
71 | $$H_{t+1}(a)=H_t(a) + \alpha (R_t - \overline{R_t} ) (\mathbb{I}_{a=A_t} - \pi_t(a))$$
72 |
73 | 即此式子收敛于精确梯度上升。
74 |
75 | Q.E.D
76 |
77 | ### 动作导数总和为0的证明
78 |
79 |
80 | 证明:$\sum_x \frac{\partial \pi_t (x)}{\partial H_t(a)}=0$:
81 |
82 | 因为$\sum_x \pi_t (x)=1$,即概率和为1,所以对每一项的$H_t(a)$求导,等式右边为0:
83 |
84 | $$\sum_x \frac{\partial \pi_t (x)}{\partial H_t(a)}=0$$
85 |
86 | Q.E.D
87 |
88 | ### 偏好函数导数的推导证明
89 |
90 |
91 | 证明:$\frac{\partial \pi_t (x)}{\partial H_t(a)}=\pi_t(x) (\mathbb{I}_{a=A_t} - \pi_t(a))$,$\mathbb{I}_{a=A_t}$表示,如果$a=x$就取1,否则取0。
92 |
93 | 其实,就是一道很简单的$(\frac{f(x)}{g(x)})^{'}$等应用。
94 |
95 | 简化一下$\frac{\partial \pi_t (x)}{\partial H_t(a)}$,将$H_t(x)$替换为$x$,并在证明中使用下式即可:
96 |
97 | $$\pi_t (x) = \frac{e^{x}}{\sum_{i=1}^{k} e^i}$$
98 |
99 | 证明下式即可:
100 |
101 | $$\frac{\partial \pi_t (x)}{\partial x} = \left\{
102 | \begin{aligned}
103 | \pi_t(x)(1-\pi_t(a)) & & x=a \\
104 | -\pi_t(x) \pi_t(a) & & x\not= a \\
105 | \end{aligned}
106 | \right.$$
107 |
108 | 高中数学内容,应用公式$(\frac{f(x)}{g(x)})^{'} = \frac{f^{'}(x)g(x) - g^{'}(x)f(x)}{g(x)^{2}}$分类讨论,可轻松证明。
109 |
--------------------------------------------------------------------------------
/mathematics/第10章:基于函数逼近的同轨策略控制.md:
--------------------------------------------------------------------------------
1 | > 前言: 本次笔记对《强化学习(第二版)》第十章进行概括性描述。
2 |
3 | 以下概括都是基于我个人的理解,可能有误,欢迎交流:piperliu@qq.com。
4 |
5 | # 正文
6 |
7 | 很短的一章,是对于第九章的延续。
8 |
9 | 第九章中,我们用`函数逼近`的方法为$V(S)$估值,而第十章中,我们用同样的思想为$q_\pi(s,a)$估值。但是具体内容上,又是大不同的。
10 |
11 | ### 10.1 - 10.2
12 |
13 | 这两节内容为:
14 | - 10.1 分幕式半梯度控制
15 | - 10.2 半梯度 n 步 Sarsa
16 |
17 | 这两节是第9章内容延续,没有什么好说的。在`程序实现`中,值得强调的有两点:
18 | - 从第9章开始,我们假设了$w$与目标向量$v$或$q$间是`线性关系`,因此可以把$w$理解成`“权重”`,寻找最优动作时,找权重最大的就可以了;
19 | - `瓦片编码`,在第9章中,我们对状态$s$瓦片编码,而第10章中,我们对$[s,q]$进行瓦片编码。如此,就与表格有所不同。`换句话说,如果不进行瓦片编码,那么将面临这个大问题:`**单纯对(s,a)进行聚合,与表格型求解无差别,(s,a)间是解耦。因此,使用瓦片编码,使(s,a)间耦合。**
20 |
21 | 以上,是我看了很久 Shantong Zhang (Sutton 学生,官方认可的代码)的代码后得出的结论。
22 |
23 | 瓦片编码采用了哈希方法,由 Sutton 本人制作 `python3` 文件:[http://incompleteideas.net/tiles/tiles3.py-remove](http://incompleteideas.net/tiles/tiles3.py-remove)
24 |
25 | 我对代码进行了一点标注,见:[https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/Mountain-Car-Acess-Control.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/Mountain-Car-Acess-Control.ipynb)
26 |
27 | ### 10.3 - 10.5
28 |
29 | 这三节内容为:
30 | - 10.3 平均收益:持续性任务中的新的问题设定
31 | - 10.4 弃用折扣
32 | - 10.5 差分半梯度 n 步 Sarsa
33 |
34 | 为了解决持续性问题的控制问题,10.3与10.4 引出了`平均收益`、`差分回报`与`差分价值函数`的概念,并且在数学上证明了:**持续性问题中折扣的无用性。** 这个证明是基于**MDP的便利性假设的。**
35 |
36 | 与权重$w$同,平均收益虽然客观存在,但是不可知。因此平均收益也是需要更新的,这个过程也需要定义步长。
37 |
38 | ### 对练习题解的补充
39 |
40 | 练习10.6很有趣。解答参见:[https://github.com/PiperLiu/Reinforcement-Learning-practice-zh/tree/master/resources](https://github.com/PiperLiu/Reinforcement-Learning-practice-zh/tree/master/resources)中的`RL-An Introdction exercises Solution.pdf`。
41 |
42 | > tip: 我知道了什么叫`同余方程`。
43 |
44 | 这里解释一下,为什么 $V(A) = -\overline{R} + V(B)$。
45 |
46 | 我的理解是这样的:
47 |
48 | $$\lim_{t \to 0} \delta_t = 0$$
49 |
50 | 而
51 |
52 | $$\delta_t = R_{t+1} - \overline{R}_{t+1} + \hat{v}(S_{t+1}, w_t) - \hat{v}(S_t,w_t)$$
53 |
54 | 结合例题中的情况则是:
55 |
56 | $$0 = 0 - \frac{1}{3} + V(B) - V(A)$$
57 |
58 | 有关$V(C)$的式子同理。
59 |
60 | - - -
61 |
62 | 练习10.7同样很有趣。**但是 Bryn Elesedy 提供的答案中,有处`笔误`。**
63 |
64 | **原文如下:**
65 |
66 | Now to compute the differential state values we write
67 |
68 | $$ V(S; \gamma) = \lim_{h\to \infty} \sum_{t=0}^h \gamma^t \left( \mathbb{E}[R_{t+1} \vert{} S_0 = s] - \bar{R} \right)$$
69 | then
70 |
71 | $$
72 | \begin{aligned}
73 | V(A; \gamma) &= 1 - \bar{R} + \gamma V(A; \gamma) \\
74 | V(A; \gamma) &= - \bar{R} + \gamma V(B; \gamma)
75 | \end{aligned}
76 | $$
77 |
78 | so
79 |
80 | $$
81 | V(A; \gamma) = \frac12 ( 1 - \gamma ) - \gamma^2 V(A; \gamma)
82 | $$
83 |
84 | **我对其的修改与解释如下:**
85 |
86 | 在例题的情况中,因为原差分回报公式$G_t(formula \; 10.9)$的定义不使用,因此对于状态的价值,本题中采用:
87 |
88 | $$ V(S; \gamma) = \lim_{h\to \infty} \sum_{t=0}^h \gamma^t \left( \mathbb{E}[R_{t+1} \vert{} S_0 = S] - \bar{R} \right)$$
89 |
90 | 那么,由上式,可得:
91 |
92 | $$
93 | \begin{aligned}
94 | V(B; \gamma) &= \lim_{h\to \infty} \sum_{t=0}^h \gamma^t \left( \mathbb{E}[R_{t+1} \vert{} S_0 = B] - \bar{R} \right) \\
95 | & = \lim_{h\to \infty} \gamma^0 (\mathbb{E}[R_{1} \vert{} S_0 = B] - \bar{R}) + \lim_{h\to \infty} \sum_{t=1}^h \gamma^t \left( \mathbb{E}[R_{t+1} \vert{} S_0 = B] - \bar{R} \right)\\
96 | & = 1 - \bar{R} + \gamma V(A; \gamma) \\
97 | V(A; \gamma) &= - \bar{R} + \gamma V(B; \gamma)
98 | \end{aligned}
99 | $$
100 |
101 | 解二元一次方程组:
102 |
103 | $$
104 | V(A; \gamma) = \frac12 ( 1 - \gamma ) - \gamma^2 V(A; \gamma)
105 | $$
106 |
107 | - - -
108 |
109 | 练习10.8很生动、巧妙地告诉我们,使用书上式子10.10对$\bar{R}$进行更新,“Once $\bar{R}$
110 | gets to the correct value it never leaves”。
111 |
--------------------------------------------------------------------------------
/mathematics/第11章:基于函数逼近的离轨策略方法.md:
--------------------------------------------------------------------------------
1 | > 前言: 本次笔记对《强化学习(第二版)》第十一章进行概括性描述。
2 |
3 | 以下概括都是基于我个人的理解,可能有误,欢迎交流:piperliu@qq.com。
4 |
5 | 总的来说,第11章学习体验不好。可能是由于内容本身比较抽象,第11章属于星标章节。练习题没有引起我的兴趣。还有一点比较令人失望:尽管本章讨论了不少更新目标与算法(其中很多为反例),并给出了大量带有矩阵的计算公式,但实例并不多。因此,我认为理解其大概思想便可。
6 |
7 | # 正文
8 |
9 | `基于函数逼近的离轨策略方法`的大多方法并不令人满意:无论是理论上还是实践上。
10 |
11 | 离轨策略学习目前面临两个挑战:
12 | - The first part of the challenge has to do with `the target of the update` (not to be confused with the
13 | target policy);
14 | - and the second part has to do with the `distribution of the updates`.
15 |
16 | 解释一下,在离轨策略的预测/控制中,我们更新的目标与交互的基础是不同的,因此有了第一个挑战。在非表格型情况下,我们需要用到`状态的分布`$\mu(s)$,因此有了第二个挑战。
17 |
18 | 第一个挑战可以通过引入`重要度采样`来解决。
19 |
20 | ### 11.1 半梯度方法
21 |
22 | 本节讨论了`重要度采样`在离轨策略中的推广。**换句话说,就是把之前表格型算法中的离轨方法转换为半梯度形式。**
23 |
24 | 值得注意的是,这些方法只对`表格型`的情况才保证稳定且渐进无偏。
25 |
26 | ### 11.2 离轨策略发散的例子
27 |
28 | 本节举了一个小例子和一个大例子,说明了半梯度以及其他方法的`不稳定性`。
29 |
30 | Baird 反例代码见:[https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/Counterexample.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/Counterexample.ipynb)
31 |
32 | 此外,Tsitsiklis 和 van Roy's 的反例由数学推导得证。
33 |
34 | 这里我要强调一下 Baird 反例,这着实困扰我一些时间。并且现在也没有完全明白其计算过程。我能给出的说明是:`如果你对于为什么状态的编码刑如 2w_1 + w_8 有疑问,那么我的解释是,状态编码(或者说线性特征的构造)并不是我们决定的,在例子中,特征已经被很愚蠢地构造好了,在这种愚蠢地构造下,我们可以观测大,传统的半梯度方法会导致不收敛、不稳定。`
35 |
36 | 那么,既然如此,我们“聪明地”构造特征不就可以了吗?
37 |
38 | 不可以,因为大部分情况下,我们无法掌控我们对于特征的构造(因为`权值向量`$w$维度不能太大)。如果要构造特征的话,这涉及到$w$到$s$的映射关系(我的理解是,`神经网络`/`瓦片编码`可以解决这种问题)。这也是为什么在本节末尾,提到了“`另一个避免不稳定的方式是使用函数逼近中的一些特殊方法`”。
39 |
40 | ### 11.3 致命三要素
41 |
42 | `同时满足`以下三种要素,一定不稳定/发散:
43 | - 函数逼近
44 | - 自举法
45 | - 离轨策略训练
46 |
47 | 那么,该抛弃哪个来保证稳定呢?
48 |
49 | 一、函数逼近极大地节省了计算开销,不能抛弃。
50 |
51 | 二、自举法或许可以抛弃,但是将带来的效率损失是巨大的:
52 | - 不用自举法学的慢,在第10章中已经有实例证明;
53 | - 使用蒙特卡洛将带来巨大的存储压力。
54 |
55 | 三、离轨策略或许可能被抛弃。因为很多情况下,同轨策略就住够了。但是想实现真正的“通用性”,仅仅依靠离轨策略是不够的。
56 |
57 | ### 11.4 线性价值函数的几何性质
58 |
59 | 这节课为向量赋予了“几何性质”,但着实困扰我许久。这节课是`后面章节所提指标的铺垫`。
60 |
61 | 书上说,考虑这种情况:
62 | - $S = \{ s_1,s_2,s_3 \}$
63 | - $w = (w_1, w_2)^T$
64 |
65 | 接着,书上就做了二维与三维交叉的图,`这让我很困惑`。最终我给自己的解释是(`这个解释存疑`):这个空间中的点对应的三维坐标是$w$对应的各状态的价值,$v(s)$是三元组$(v(s_1),v(s_2),v(s_3))$。
66 |
67 | 近似的`线性`价值函数即:
68 |
69 | $$v_w = X w$$
70 |
71 | 其中:
72 |
73 | $$X = |S| \times d = \left[ \begin{aligned}
74 | x_1(s) \; x_2(s) \\
75 | x_1(s) \; x_2(s) \\
76 | x_1(s) \; x_2(s) \\
77 | \end{aligned} \right]$$
78 |
79 | 这里注意理解一下,$S$本来是有`三`个维度的,因为`“特征工程”`的处理,变成了只有两个特征的$X$。为什么两个特征?因为我们的参数只有两个值($w=(w_1,w_2)^T$)。
80 |
81 | 如果仅仅依靠线性关系来计算,因为我们的参数只有两个值($w=(w_1,w_2)^T$),并且$X$可能很差,因此(考虑`向量相乘的几何意义`尤其是`方向`)我们得到的状态价值$(\hat{v}(s_1),\hat{v}(s_2),\hat{v}(s_3))$这个向量很有可能无法得到$v_\pi$。
82 |
83 | 值得注意的是,这里我们描绘距离不使用欧几里得距离。
84 |
85 | ### 11.5 对贝尔曼误差做梯度下降
86 |
87 | 本节中以最小化`均方梯度误差`$\overline{TDE}$为目标,使用完全的 SGD 算法进行下降。这在理论上会保证收敛,这个算法被称为`朴素残差梯度` 算法。但是,**朴素残差收敛到的并非最优策略。**
88 |
89 | 书上例11.2的`A-分裂`例子很好地做了反例。
90 |
91 | 因此,我们考虑贝尔曼误差$\overline{BE}$。如果价值是准确的,那么贝尔曼误差是0。
92 |
93 | 贝尔曼误差是该状态的TD误差的期望。需要下一状态的`两个独立样本`,但这并不难实现。且,贝尔曼误差在现行情况下,总会收敛到最小化对应的$w$,且$w$是唯一的。
94 |
95 | 以贝尔曼误差为目标下降
96 |
97 | 但问题有三个:
98 | - 贝尔曼误差对应的`残差梯度算法`太慢了;
99 | - 贝尔曼误差看起来仍然收敛到错误的值(书例11.3 A预先分裂的例子);
100 | - 贝尔曼误差是不可学习的。
101 |
102 | ### 11.6 贝尔曼误差是不可学习的
103 |
104 | 何为不可学习的的?
105 |
106 | 即,不能从可观测的数据中学到。
107 |
108 | 书上有例子:两个情况是不同的,但产生的数据遵从同一分布,而$\overline{VE}$却不同。也就是说$\overline{VE}$并`不是这个数据分布唯一确定的函数`(跟数据序列竟然没什么关系!)。$\overline{BE}$也是一个道理。因此$\overline{BE}$是不可学习的。
109 |
110 | 但是,为啥在上一章我们认为$\overline{VE}$可用呢?因为优化他的参数是可学习的。我们引出`均方回报误差`$\overline{RE}$。
111 |
112 | ### 11.7 梯度TD方法
113 |
114 | 考虑最小化`均方投影贝尔曼误差`$\overline{PBE}$的SGD。
115 |
116 | 本节由数学推导,依此提出了`最小均方(Least Mean Square, LMS)`、`GTD2`、`带梯度修正的TD(0)(TDC)或者GTD(0)`算法。
117 |
118 | 实例证明,梯度TD(0)是有用的。
119 |
120 | GTD2 和 TDC 其实包含了两个学习过程:学w和学v。书上将这种不对称的依赖称为`梯级`。
121 |
122 | ### 11.8 强调 TD 方法
123 |
124 | 此外,书上还简单介绍了强调 TD 方法,核心思想是将更新分布也转换为同轨策略分布。
125 |
126 | 这需要用到`兴趣值`与`强调值`。
127 |
128 | 理论上,期望中,算法理论上是可以收敛到最优解的,但实际上并非如此。
129 |
130 | ### 11.9 减小方差
131 |
132 | 介绍了“重要性权重感知”更新、树回溯算法、“识别器”等概念,或许可以帮助减小估计的方差,更有效地利用数据。
133 |
134 | PiperLiu
135 | 2020-3-15 16:39:22
--------------------------------------------------------------------------------
/mathematics/第12章:资格迹.md:
--------------------------------------------------------------------------------
1 | > 前言: 本次笔记对《强化学习(第二版)》第十二章进行概括性描述。
2 |
3 | 以下概括都是基于我个人的理解,可能有误,欢迎交流:piperliu@qq.com。
4 |
5 | 第12章我依旧有很多地方不懂、不透,这里,我只尽力将自己所理解的知识体系串讲下来,`并且我在文末给出自己的疑问与猜测的答案/解决方案`。因为还有很多东西要学要做,因此第一遍学习不求很透彻,重视工程能力而非理论能力。
6 |
7 | `第12章提出的方法实在太多,个人认为其核心思想在于利用“资格迹”这个辅助向量,“优雅地”把前向视图转为后向视图(使计算在线、高效)。`
8 |
9 | # 正文
10 |
11 | ### 12.1 lambda-回报
12 |
13 | 个人认为从`蒙特卡洛方法`说起会更加好理解一点:`蒙特卡洛`是`lambda-回报的特例`。
14 |
15 | 什么意思?当`lambda=1时`,才是`蒙特卡洛方法`,若`lambda不等于1`,`其计算量比蒙特卡洛方法大了n倍`。
16 |
17 | 当幕结束时:
18 | - `蒙特卡洛方法`只从最后一步回溯(`G_t的t即最后一步`),对每个状态进行价值更新;
19 | - 而`lambda-回报`不单单从最后一步回溯,`还要从 T-1, T-2, T-3, ... 步回溯`,其`前面乘上参数lambda相关项`。
20 |
21 | **换个说法:**
22 |
23 | `蒙特卡洛`和`lambda-回报`都对每一个时刻有对应的回报$G_t$,但是`lambda-回报`的$G_t$信息量更丰富一些,其中包含了`lambda-回报`的相关项运算,因此写作$G_t^\lambda$。
24 |
25 | 代码可见[https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/Random-Walk-Mountain-Car.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/Random-Walk-Mountain-Car.ipynb)。
26 |
27 | ### 12.2 TD(lambda)
28 |
29 | 上节所提到的lambda-回报有这样几个问题:
30 | - 不能在线,有延后(前向视图,需要用到)(必须幕结束后才能更新);
31 | - 幕结束后,计算任务集中统一处理(计算压力大)。
32 |
33 | `TD(lambda)`是一种古老的方法,却解决了上述问题。
34 |
35 | 其能解决上述问题的技巧在于,其`引入了辅助的向量“资格迹”`,这个向量与权重向量同样大小,用于保存权重的“痕迹”。
36 |
37 | 资格迹随着迭代的更新而更新。
38 |
39 | 因此我们有了`后项视图(依靠产生过的、身后的数据来更新权重)`的算法。
40 |
41 | 应注意,代码[https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/Random-Walk-Mountain-Car.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/Random-Walk-Mountain-Car.ipynb)的random-walk案例中,虽然是对权重进行更新,其实本质上还是表格型的$v(s)$关系。因为在这个案例中,编程者是对其进行$v(s) = w^T x = w_1 x_1 + w_2 x_2 ...$且$x_1=[s_1, 0, 0, ..]^T, x_2=[0, s_2, 0, ..]^T$建模的。
42 |
43 | ### 12.3 n-步截断 lambda-回报方法
44 |
45 | 现在回过头来考虑离线算法,我们给`t时刻`规定一个`视界h`,即回报$G_t$只能获得从`t到h的共(h-t)步`的回报,即$G_{t:h}^\lambda$
46 |
47 | 我们称这种回报为`截断回报`。
48 |
49 | 如果你仔细阅读random-walk案例的代码会发现,其实截断的思想早有体现:当lambda有很高次方时,该项对于更新的影响其实微乎甚微,可以省略。
50 |
51 | 有关截断的相关算法被称为`截断TD(lambda)`或`TTD(lambda)`。
52 |
53 | 并且,由此推广,$G_{t:t+k}^\lambda$有了一个高效的计算公式,见公式12.10。【疑问1】
54 |
55 | ### 12.4 重做更新:在线lambda-回报算法
56 |
57 | 12.4节其实是12.5的铺垫,12.4提出了一种计算量很庞大的算法,12.5提供了其简化形式(计算量小,效果相同)。
58 |
59 | 在`重做更新:在线lambda-回报算法`中,随着`时间的推移`(随着步进,或者说随着t的最大值的增大),我们收获的数据也在变多,`视界h`也就可以随之增大,因此,每走一步(t最大可取的值增大/h增大),我们就可以进行一次更新全局的、从t=0到t=h的更新。
60 |
61 | 关于公式,这里我有个疑问,昨天着实困扰我许久。我做出了自己的猜测。
62 |
63 | $$
64 | \begin{aligned}
65 | h = 1 & \; : & \; w_0^1 = w_0^1 + \alpha [G_{0:1}^\lambda - \hat{v}(S_0,w_0^1)]\nabla \hat{v}(S_0,w_0^1) \\
66 | h = 2 & \; : & \; w_1^2 = w_0^2 + \alpha [G_{0:2}^\lambda - \hat{v}(S_0,w_0^2)]\nabla \hat{v}(S_0,w_0^1) \\
67 | & & \; w_2^2 = w_0^1 + \alpha [G_{0:2}^\lambda - \hat{v}(S_1,w_1^2)]\nabla \hat{v}(S_0,w_1^2) \\
68 | h = 3 & \; : & \; w_1^3 = w_0^3 + \alpha [G_{0:3}^\lambda - \hat{v}(S_0,w_0^3)]\nabla \hat{v}(S_0,w_0^3) \\
69 | & & \; w_2^3 = w_1^3 + \alpha [G_{1:3}^\lambda - \hat{v}(S_1,w_1^3)]\nabla \hat{v}(S_0,w_1^3) \\
70 | & & \; w_3^3 = w_2^3 + \alpha [G_{2:3}^\lambda - \hat{v}(S_2,w_2^3)]\nabla \hat{v}(S_0,w_2^3) \\
71 | \end{aligned}
72 | $$
73 |
74 | 书上说,$w_0^h$由上一幕继承而来,最终所需要的权重(或者说权重的最终值)即$w_T^T$。那么,我的问题是:`干嘛要在h=1、h=2..时计算呢?在h=T时计算不好吗?毕竟,我注意到,h=T计算的公式中,没有用到h=T-1产生的变量w^{T-1}。`
75 |
76 | 后来我自己给出了答案:`重做更新:在线lambda-回报算法`其实还是基于12.3中公式12.10的`离线lambda-回报`的。比如在$w_1^2$的计算中,我们还是用到了$G_{0:2}$,在$G_{0:2}$的计算中,我们还是需要$w_1$的值,而此时$w_1^1$的值未产生,我们需要用到$w_1^0$(上个视界产生的值)。
77 |
78 | ### 12.5 真实的在线TD(lambda)
79 |
80 | `真实的在线TD(lambda)`是效果(数学上等价)的`真正意义上的在线、计算量小`的`重做更新`。
81 |
82 | 其资格迹被称为`荷兰迹`。
83 |
84 | ### 12.6 蒙特卡洛学习中的荷兰迹
85 |
86 | 本节证明了前向视图和后向视图的等价性。【疑问3】
87 |
88 | ### 12.7 Sarsa(lambda)
89 |
90 | 看到 `Sarsa` ,我们知道,要开始探讨`控制`问题。
91 |
92 | 即把对 $v(s, w)$ 估值的方法迁移到对 $q(s, a, w)$ 估值上来。
93 |
94 | ### 12.8 变量lambda和gamma
95 |
96 | 本节没有给出控制`变量lambda和gamma`的明确结论,我也没有深究。
97 |
98 | ### 12.9 带有控制变量的离轨策略资格迹
99 |
100 | 讨论了`同轨策略`问题,照例,应该来讨论`离轨策略`问题。
101 |
102 | 本节由7.4节中的`带有控制变量的每次决策型重要度采样的自举法`,进行推广,得出了`带有重要度采样的lambda-回报形式`。数学证明我没有深究。
103 |
104 | 但是半梯度方法的离轨策略稳定性终究是不理想的。
105 |
106 | ### 12.10 从Watkins的Q(lambda)到树回溯TB(lambda)
107 |
108 | 首先讨论了Watkins的`Q(lambda)`,其既可以`使用离轨策略数据`,又有`不使用重要度采样`的优势。
109 |
110 | 此外,还有`树回溯(lambda)`或称为`TB(lambda)`方法。没有实例,我也没有深究。【疑问4】
111 |
112 | ### 12.11 采用资格迹保障离轨策略方法的稳定性
113 |
114 | 这节结合了11.7、11.8中的梯度TD或强调TD的思想。在算法`GTD(lambda)`中,提出了一个额外的向量,$\textbf{v}$用于分布存储样本【疑问5】。
115 |
116 | 算法`HTD(lambda)`可以视为`TD(lambda)`的扩展,同时结合了算法`GTD(lambda)`。
117 |
118 | 算法`强调TD(lambda)`是单步`强调TD算法`的扩展。这还涉及到了收敛性的问题。【疑问6】
119 |
120 |
121 | ### 12.12 实现中的问题
122 |
123 | 现实计算中,我们经常涉及到`计算量与准确性的博弈`。
124 |
125 | 对于并行计算机来说,这很好解决。
126 |
127 | 对于常规串行计算机来说,通常,有很多lambda、gamma的项是接近0的,可以舍去。
128 |
129 | # 疑问
130 |
131 | #### 疑问1:
132 |
133 | 此处的证明以及练习12.5应该是巧妙且复杂的,我没有深究。相信答案在原论文中。
134 |
135 | #### 疑问2:
136 |
137 | 解答如`原文12.4 重做更新:在线lambda-回报算法`。
138 |
139 | #### 疑问3:
140 |
141 | 关于`12.6 蒙特卡洛学习中的荷兰迹`的数学证明我没有深究。
142 |
143 | #### 疑问4:
144 |
145 | `12.10 从Watkins的Q(lambda)到树回溯TB(lambda`,未来深究该知识点时,可以考虑写出其伪代码。
146 |
147 | #### 疑问5:
148 |
149 | 由于没有理解透彻,不知“分布存储样本”这个说法来形容$\textbf{v}$是否正确。
150 |
151 | #### 疑问6:
152 |
153 | 书上没说,我也不知道这几个算法的效果、适用场景有何差别。原论文中应该会有对比。
154 |
155 | Piper Liu
156 | 2020-3-20 00:08:12
157 |
--------------------------------------------------------------------------------
/mathematics/第13章:策略梯度方法.md:
--------------------------------------------------------------------------------
1 | > 前言: 本次笔记对《强化学习(第二版)》第十三章进行概括性描述。
2 |
3 | 以下概括都是基于我个人的理解,可能有误,欢迎交流:piperliu@qq.com。
4 |
5 | 让时间回到最开始:`第二章多臂赌博机`。在那一章,我们还没接触`状态`的概念,只是在一个稳定的环境中,探索最优动作。第十三章中,我们跳过`q(s,a)`的映射,直接考量`π(s,a)`并对其数值进行迭代。在第二章中,我们的策略只是`π(a)`,也就是各个动作的使用概率;而第十三章中,我们多了“状态”这个概念,对于不同的状态,我们的动作选择概率是不同的。至于选择哪个动作好,可以使用对于状态的映射进行建模,比如`f(θx(s))=π(s)`。随着章节的深入,我们发现`策略梯度`的方法不但`理论上有优势`,且能`解决动作空间连续`等问题,更甚,其`与人工神经网络逻辑单元运行的原理是相通的`,`利用神经网络抽取特征、建立映射适于策略梯度方法`。
6 |
7 | 终于学完了前13章!第13章在我看来有些突兀:其利用了3-12章我们讨论的思想,但却抛弃了我们讨论了整整长达10章的`q(s,a)`相关方法。不管怎么说,我终于可以进入工程部分,开始领教 DRL 了。**小小白同学要变成小白同学了。**
8 |
9 | # 正文
10 |
11 | ### 13.1 策略近似及其优势
12 |
13 | 一开始,我们所讨论的还是`分幕的情况`与离散的状态空间。
14 |
15 | 如多臂赌博机那张中的内容,我们可以为建立一个`指数柔性最大化(softmax)`的策略函数:
16 |
17 | $$\pi(a|s,\theta) = \frac{e^{h(s,a,\theta)}}{\sum_b e^{h(s,b,\theta)}}$$
18 |
19 | 这种策略参数化的形式被称为`动作偏好值的柔性最大化`。
20 |
21 | 当然,状态空间已经不再受限,因为我们有`参数化`的方法。比如$\theta$可以是网络连接权重的向量,而`动作偏好值的柔性最大化`里,我们使用简单的特征线性组合:
22 |
23 | $$h(s,a,\theta)=\theta^T x(s,a)$$
24 |
25 | 这节中提出了一个“动作切换的小型走廊任务”,便于我们理解基本算法的效果,代码可见:[https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/Short-Corridor.ipynb](https://nbviewer.jupyter.org/github/PiperLiu/Reinforcement-Learning-practice-zh/blob/master/practice/Short-Corridor.ipynb)。要注意,这里的策略模型很简陋,依旧还是不管状态,只考虑全局下的“向左动作”概率多大、“向右动作”概率多大。
26 |
27 | ### 13.2 策略梯度定理
28 |
29 | 这节开始讨论`策略梯度`的`理论优势`。
30 |
31 | “基于策略梯度的方法能够比基于动作价值函数的方法有更强的收敛保证。”
32 |
33 | 在`分幕式`情况下,我们有性能指标:
34 |
35 | $$\textbf{J}(\theta)=v_{\pi_\theta}(s_0)$$
36 |
37 | 策略梯度定理表达式即,这个性能指标与前几章讨论的`更新公式中的下降项的和`是成正比的:
38 |
39 | $$\textbf{J}(\theta) \propto \sum_s \mu(s) \sum_a q_\pi (s,a) \nabla \pi(a|s,\theta)$$
40 |
41 | 书上有严格的数学证明。
42 |
43 | ### 13.3 REINFORCE:蒙特卡洛策略梯度
44 |
45 | Willams 提出了经典的`依靠迭代的`、`每步更新一个动作`的算法 REINFORCE 。
46 |
47 | 我们有了梯度下降的方向 $$\textbf{J}(\theta)$ ,从数学意义上对其进行推导、简化,就可得到更新公式:
48 |
49 | $$\theta_{t+1} = \theta_t + \alpha G_t \frac{\nabla \pi(A_t|S_t,\theta_t)}{\pi(A_t|S_t,\theta_t)}$$
50 |
51 | 其中,$\frac{\nabla \pi(A_t|S_t,\theta_t)}{\pi(A_t|S_t,\theta_t)}$与$\nabla \ln\pi(A_t|S_t,\theta_t)$等价,我们称后者为迹向量。
52 |
53 | 注意这里的算法还是蒙特卡洛的,即必须`每幕结束后才能更新`,是`离线的`。
54 |
55 | ### 13.4 带有基线的REINFORCE
56 |
57 | 我们在多臂赌博机中就讨论过:增加基线(那时叫“基准项”),对最终收敛结果没有影响,因为同状态下的所有动作概率和为常数1,其导数为0。
58 |
59 | 但是,增加基线会让收敛速度加快(有益于收敛过程)。书上说,基线减小了方差。
60 |
61 | 这里的基线是多少都无所谓,但是,对于每个状态$s$来讲,其应该是相同的。或者说,`基线值应该随状态变化而变化`。
62 |
63 | 其意义是:“在一些状态下,所有的动作的价值可能都比较大,因此我们需要一个较大的基线用以区分拥有更大值的动作和相对值不那么高的动作;在其他状态下当所有动作的值都较低时,基线也应该较小。”
64 |
65 | 那么,以 $\hat{v}(S_t, w)$ 作为基线便十分合适。书上给出了算法伪代码,即更新 $\theta$ 同时更新 $w$ 。
66 |
67 | ### 13.5 “行动器-评判器”方法
68 |
69 | 注意之前讨论的方法都是蒙特卡洛的,都是离线的。
70 |
71 | 现在我们开始讨论`在线`方法。
72 |
73 | 我们则需要引入“评判器”来求得可以代替REINFORCE的“整个回报”的变量。
74 |
75 | 即将 $G_t$ 替换为:
76 |
77 | $$G_{t:t+1} - \hat{v} (S_t,w)=R_{t+1} + \gamma\hat{v} (S_{t+1},w) - \hat{v} (S_t,w)$$
78 |
79 | 此外,因为涉及到`前向视图转后向视图`,书上还给出了`资格迹`的伪代码。这里存在`两个资格迹`,分别对应 $w$ 与 $\theta$ 。
80 |
81 | ### 13.6 持续性问题的策略梯度
82 |
83 | 策略梯度定理对于持续性问题同样适用。
84 |
85 | 只不过,我们评价性能的指标也需要相应地改变:
86 |
87 | $$\textbf{J}(\theta) = r(\pi)$$
88 |
89 | 基于遍历性假设,我们依然可以给出数学上的严格证明。
90 |
91 | 这里,我总结了`带资格迹的“行动器-评判器”方法`在`分幕的`与`持续的`情况见区别,只有二:
92 | - 持续性下没有折扣;
93 | - 持续性下 $\delta$ 第一项使用平均收益 $R-\bar{R}$。
94 |
95 | ### 13.7 针对连续动作的策略参数化方法
96 |
97 | 本节中针对连续动作,给了一个范例:使用正态概率分布模型对策略参数化。不再复述。
98 |
99 | 练习题都很有意义。
100 |
101 | 2020-3-22 00:48:13
102 | Piper Liu
103 |
--------------------------------------------------------------------------------
/mathematics/第9章:基于函数逼近的同轨策略预测.md:
--------------------------------------------------------------------------------
1 | > **前言:** 本次笔记对《强化学习(第二版)》第九章进行概括性描述。
2 |
3 | *以下概括都是基于我个人的理解,可能有误,欢迎交流:piperliu@qq.com。*
4 |
5 | # 正文
6 |
7 | ### 引言
8 |
9 | 前八章学习的“表格型方法”有一个问题:没法表示任意大的状态空间,比如连续状态。
10 |
11 | 并且,在连续状态中,我们相信描述状态的数值与价值间存在数量关系,这意味着我们可以用$v(s,w)$来描述$v$,而并非要建立$s \rightarrow v$的表格。
12 |
13 | 其中,$w$是向量(叫做`权值向量`),我们从已知状态归纳所求状态的行为属于`泛化`,通常使用`函数逼近`(`监督学习`范畴)。
14 |
15 | ### 价值函数逼近
16 |
17 | 显然,我们想要“基于函数逼近的同轨策略预测”,就是要更新$w$。
18 |
19 | 但是,强化学习与监督学习不同的一点是,强化学习强调**在线学习&与环境交互**。
20 |
21 | 这意味着我们还需要算法可以处理非平稳状况。
22 |
23 | ### 预测目标($\overline{VE}$)
24 |
25 | 表格型学习中,在每个状态下学习价值函数是`解耦`的:即一个状态更新不会影响其他状态。但是在函数逼近中未必。
26 |
27 | 我们使用`均方价值误差`作为目标函数。强化学习的终极目标是寻找更好的策略,因此使用`均方价值误差`作为目标函数未必是最好的,但是目前看来有用。
28 |
29 | $$\overline{VE} (w) = \sum_{s \in S} \mu(s) [v_\pi (s) - \hat{v} (s,w)]^2$$
30 |
31 | 其中$\mu(s)$表示分布。这很 make sense 。
32 |
33 | ### 随机梯度和半梯度方法
34 |
35 | `梯度下降`很基础很直观,这里不做介绍。
36 |
37 | 随机梯度下降英文:stochastic gradient-descent, SGD。之所以叫“随机”,因为其更新仅仅依赖一个“随机”的样本完成。
38 |
39 | 其更新公式为:
40 |
41 | $$w \leftarrow w + \alpha [G_t - \hat{v}(S_t, w) ]\nabla \hat{v} (S_t,w)$$
42 |
43 | 这个公式很核心。书9.3节中“梯度蒙特卡洛”伪代码中使用了上式。依此估计的价值是无偏的。
44 |
45 | **但是,“半梯度”是有偏的。**
46 |
47 | 在9.3节“半梯度TD(0)”中使用了如下的更新公式:
48 |
49 | $$w \leftarrow w + \alpha [R + \gamma \hat{v}(S', w) - \hat{v}(S_t, w) ]\nabla \hat{v} (S_t,w)$$
50 |
51 | 为什么叫半梯度:
52 | - 非半梯度中,更新依赖于$\hat{v}(S', w)$,其中含有$w$,并不只是真正的梯度下降;
53 | - 估计是有偏的。
54 |
55 | 但是,半梯度的优点:
56 | - 速度快;
57 | - 可以在线并持续地学习,无需等待幕结束;
58 | - 线性情况中,依旧可以稳健的收敛。**而线性情况及其常见。**
59 |
60 | ### 线性方法
61 |
62 | 线性方法中,$\hat{v}(s,w) = w^T x(s)=\sum_{i=1}^d w_i x_i (s)$,向量$x (s)$称为特征向量。
63 |
64 | 在线性TD(0)中,我们可以证明其预测时收敛的,并且将收敛到`TD不动点`:
65 |
66 | $$w_{TD} = A^{-1} b$$
67 |
68 | $$A = \mathbb{E} [x_t (x_t - \gamma x_{t+1})^T] \in \mathbb{R}^d \times \mathbb{R}^d$$
69 |
70 | $$b=\mathbb{E} [R_{t+1}x_t] \in \mathbb{R}^b$$
71 |
72 |
73 | 这对于后面的“最小二乘”很重要。
74 |
75 | **更有趣的是:第I部分表格型方法实际上是线性函数逼近的一个特例,在这种情况下,$\nabla \hat{v} (S_t,w)=1$,即特征向量是$x(s) = 1$。**
76 |
77 | 更进一步,表格型方法中$w(S_t)$就是$\hat{v} (S_t,w)$本身。
78 |
79 | ### 线性方法的特征构造
80 |
81 | 这里有很多有趣的操作,比如`多项式基`、`傅立叶基`、`粗编码`、`瓦片编码`(一种粗编码的实例)等。
82 |
83 | 在我看来,前两者就是“特征工程”,使用线性计算的方法考虑状态间的非线性(交互)特性。
84 |
85 | 后两者类似一种聚合,但是建立了多个聚合的“标准”,使得一个状态在不同的瓦片中有可能不同的“号码”。
86 |
87 | 在瓦片的实现中,可以使用哈希编码节省时空。但是书上没有具体实例。
88 |
89 | 此外,还有`径向基函数`。
90 |
91 | 其对特征的处理类似高斯分布,使用了`范数`/`距离度量`的概念,但是范数未必是距离的度量,可以设定别的规则。
92 |
93 | ### 手动选择步长参数
94 |
95 | 书上建议将 线性SGD 步长设置为:$\alpha = (\tau \mathbb{E}[x^T x])^{-1}$
96 |
97 | ### 非线性函数逼近:人工神经网络
98 |
99 | 人工神经网络在DL中很常用,书上仅做了描述性介绍。
100 |
101 | 这里提几个名词:
102 | - 深度置信网络;
103 | - 批量块归一化;
104 | - 深度残差学习。
105 |
106 | ### 最小二乘时序差分
107 |
108 | 之前提到了,使用不动点的计算在数学上很有效(数学上证明是合理的)。且,其对数据的利用是充分的。
109 |
110 | 最小二乘时序差分对 A、b公式进行了变换,令A多了一个项$\epsilon I$以保证其永远可逆。
111 |
112 | 这样,就可以将一个统计性公式转换成可在线更新的迭代式了。
113 |
114 | ### 基于记忆的函数逼近
115 |
116 | 在这里,讨论了`拖延学习 lazy learning`,即不立即更新并使用参数;当我们需要预测某个状态时,我们根据已有的数据(记忆中的状态),找出与其相关的,并计算估计。
117 |
118 | 常用的例子有:`最近邻居法`,`加权平均法`等。
119 |
120 | 这可以有效地进行局部学习,但是查询速度称为一个问题。因此可以考虑`特制硬件`、`并行计算`、`数据结构`等方案以解决查询速度问题。
121 |
122 | ### 基于核函数的函数逼近
123 |
124 | 之前`径向基函数`中有提到距离这个概念。`核函数`则是决定`距离`这个函数如何映射的。
125 |
126 | $D$是一组样本,$g(s')$表示状态$s'$的目标结果,$\hat{v}(s,D) = \sum_{s' \in D}k(s,s')g(s')$这个公式可以看出,核函数决定了权重(或者是否计算,因为核函数为了保证局部计算,将不相干状态间映射为0)。
127 |
128 | ### 深入了解同轨策略学习:“兴趣”与“强调”
129 |
130 | 思想是不变的,还是为了保证学习的精准性,即利用到“局部学习”这个概念。
131 |
132 | 也就是说,不应把所有状态平等地看待,因此在更新公式中增加了`强调值`与`兴趣值`,并且二者也进行递归。
133 |
134 | 2020-3-9 00:10:21
135 | PiperLIu
--------------------------------------------------------------------------------
/mathematics/策略改进(Policy Improvement)使策略更优的数学证明.md:
--------------------------------------------------------------------------------
1 | > **前言:** Sutton第二版《强化学习》中,第4章第2节“策略改进”介绍了基于贪心算法的策略改进。为什么以(只考虑一个后续状态来选择当前动作的)贪心算法进行更新的策略一定会比原策略更优呢?书上也给出了论证,但不明显。这里我将其整理,把顺序和逻辑重新捋一下。
2 |
3 | #### 定义:策略A比策略B好
4 |
5 | 如果$\pi$和$\pi '$是两个确定的策略,对任意$s \in S$,有:
6 |
7 | $$q_\pi (s, \pi ' (s)) \ge v_\pi (s)$$
8 |
9 | 我们称$\pi '$相比于$\pi$更好。*$q$为期望,$v$为价值。*
10 |
11 | #### 策略A比B好,意味着A更高的价值
12 |
13 | 将式子$q_\pi (s, \pi ' (s)) \ge v_\pi (s)$展开:
14 |
15 | $$
16 | \begin{aligned}
17 | v_\pi (s) & \le q_\pi (s, \pi ' (s)) \\
18 | & ... \\
19 | & \le \mathbb{E}_{\pi '} [R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s] \\
20 | & = v_{\pi '} (s)
21 | \end{aligned}
22 | $$
23 |
24 | 由此式,可以推导出,若对于状态$s$,选择期望更高的动作(选择更好的策略),**可以使其价值提高。**
25 |
26 | **tips:** 上式的推导步骤省略了,其推导的核心就是应用$q$的定义式,迭代展开$q$:
27 |
28 | $$
29 | \begin{aligned}
30 | q_{\pi} (s, \pi' (s)) & = \mathbb{E} [R_{t+1} + \gamma v_\pi (S_{t+1} )| S_t = s, A_t = \pi' (s)]\\
31 | & = \mathbb{E}_{\pi '} [R_{t+1} + \gamma v_\pi (S_{t+1} )| S_t = s] \\
32 | & \le \mathbb{E}_{\pi '} [R_{t+1} + \gamma q_{\pi} (S_{t+1}, \pi' (S_{t+1}))| S_t = s] \\
33 | & ... \\
34 | \end{aligned}
35 | $$
36 |
37 | #### 策略改进:贪心算法构造更好的策略
38 |
39 | 在每个状态下根据$q_\pi (s, a)$选择一个最优的,换言之,考虑一个新的贪心策略$\pi '$,满足:
40 |
41 | $$\begin{aligned}
42 | \pi's (s) & = \argmax_a q_\pi (s, a) \\
43 | & = argmax_a \mathbb{E} [R_{t+1} + \gamma v_\pi (S_{t+1} )| S_t = s, A_t = \pi' (s)] \\
44 | & = argmax_a \left\{ \sum_{s', r} p(s', r | s, a)[r + \gamma v_\pi (s')] \right\}
45 | \end{aligned}$$
46 |
47 | 这个贪心策略采取在短期内看上去最优策略,根据$v_{\pi}$单步搜索。
48 |
49 | #### 反证法证明:策略改进是“有效的”
50 |
51 | 我们知道,贪心策略基于$q_\pi (s, \pi ' (s)) \ge v_\pi (s)$,可以保证$v_{\pi '} (s) \ge v_\pi (s)$。
52 |
53 | 但是,贪心策略有没有可能造成在每步搜索中(每次策略更新中),存在$v_{\pi '} (s) = v_\pi (s)$的情况(则此次更新无效)。
54 |
55 | 实际上,这是不可能的。**反证法:**
56 |
57 | 如果新的策略$\pi'$比老的策略$\pi$:$v_{\pi '} (s) = v_\pi (s)$,那么由上文贪心策略递进式,可得,对任意$s \in S$:
58 |
59 | $$\begin{aligned}
60 | v_{\pi '}(s) & = \max_a \mathbb{E} [R_{t+1} + \gamma v_{\pi '} (S_{t+1} )| S_t = s, A_t = a] \\
61 | & = \max_a \sum_{s', r} p(s', r | s, a)[r + \gamma v_{\pi '} (s')]
62 | \end{aligned}$$
63 |
64 | 这个式子就是贝尔曼最优方程组了,所求解的$v_{\pi '}$一定位最优策略下的$v_{*}$。
65 |
66 | 因此,如果如果新的策略$\pi'$比老的策略$\pi$:$v_{\pi '} (s) = v_\pi (s)$,那么一定是已经迭代到了最优的策略,否则,策略改进一定会给出一个更优的结果。
67 |
68 | 今天我们讲的内容,为以后的工程问题奠定了理论基础,有了理论依据,即:我依据贝尔曼方程更新策略是有效的。
69 |
--------------------------------------------------------------------------------
/mathematics/表格型方法总结.md:
--------------------------------------------------------------------------------
1 | > **前言:** 本次笔记对《强化学习(第二版)》第八章进行概括性描述。同时,也对本书的第一部分(共三部分)表格型求解方法进行了系统性阐述。
2 |
3 | *以下概括都是基于我个人的理解,可能有误,欢迎交流:piperliu@qq.com。*
4 |
5 | ****
6 |
7 | ### 一、表格型求解方法梳理知识
8 |
9 | #### 第二到七章知识总结
10 |
11 | 第八章学习完成后,标志着书中第I部分 **表格型求解方法** 的完结,目前已学习到的知识包括:
12 | - **多臂赌博机问题:** 在一个稳定的单状态环境中,如何评估各动作价值,并进行控制(选择下一动作,试探 exploration 还是开发 exploitation,开发的依据是什么);
13 | - **有限马尔科夫决策过程:** 如何对环境、动作、收益建模(如何把待解决问题抽象成数学上的表达)?什么是幕 episode ?强化学习的基本元素有?
14 | - **动态规划:** 在环境已知的条件下(什么是环境已知:状态转移概率已知,且环境稳定不随机变动;或者说状态转移可以用概率描述),如何评估基于某策略的状态的价值 $v_\pi(s)$ ?如何评估基于某策略的,在某状态下选择该动作的价值 $q_\pi(s,a)$ ?并且,如何通过迭代,逼近该环境下的最优策略(得到最优策略,叫做控制)?
15 | - **蒙特卡洛方法:** 在环境不可知的条件下,可以通过与环境的交换获取信息。那么,这种条件下如何进行 $v_\pi(s)$ 、 $q_\pi(s,a)$ 的评估?蒙特卡洛方法基于很多幕的数据进行“学习”。环境对于动作的选择是基于我们要评估的策略 $\pi$ 的,称之为同轨策略 on-policy ,否则称之为 离轨策略 off-policy 。要实现离轨策略下的控制,就必须对交互得到的信息进行处理,一般地,我们使用采样率 sampling ratio 进行处理。
16 | - **时序差分学习:** 蒙特卡洛方法要求到幕结束时,才可使用幕的过程中的信息。有没有可以不用等到幕结束,立即更新对价值的评估的方法呢?时序差分学习给出了答案。这涉及到了一个问题:为了更新当前状态/动作的价值,那就要用到下一步状态的价值,如何评估下一步状态的价值? Sarsa 、 Q-Learning 、 期望 Sarsa 给出了参考。同样,在控制中,讨论了同轨策略与离轨策略。并且,在可能产生最大化偏差的背景介绍下,介绍了双学习这个解决办法,双学习的期望是无偏的。
17 | - **n 步自举法:** 蒙特卡洛方法等到幕结束,时序差分学习立即更新,但事实证明,基于两种之间的学习方法好于两者,即并非“无限步直到幕结束”或“1步”,使用n步最好。这涉及到多个结点,每个结点采用何种规则回溯,来评估当前价值状态呢?有采样率、树回溯、采样率+期望、交叉进行期望与采样率等等方法。
18 |
19 | 可以注意到,后三章**蒙特卡洛方法、时序差分学习、n 步自举法**并不基于规划,即,**无需对环境建模**,只需要利用环境的 output 与 input 就可以进行学习。
20 |
21 | **但第八章又回到了规划上。**
22 |
23 | #### 第八章:与二到七章的关系
24 |
25 | 尽管要用到模型(与规划),但是第八章中的条件并不需要像第四章**动态规划**中那样苛刻:
26 | - 不知道环境的完备模型也行,我可以自己仿真一个环境,对“仿真系统”进行输入输出,来学习各个状态/动作的价值;
27 | - 现在有了(仿真好的)环境,也**未必要使用传统动态规划的更新方法(期望更新)**,对每个子节点加权求和(即求父结点期望),因为环境可能很复杂,而很多状态是无用的(任何一种“聪明的”策略都不会到达那种状态),也就没必要遍历,或者评估价值 **(在规划的前提下,依旧使用采样更新,来减少计算量)**;
28 | - 因此,本章提出了许多有趣而有用的方法:**基于更新效果(走这一步,是否会令原价值变化很小,若很小,则没必要考虑这个更新)决定更新优先级的优先遍历**、**基于轨迹进行采样**、**实时动态规划(只更新经历过的状态,on-policy,采样更新的动态规划)**、**启发式搜索/决策时规划(聚焦当前状态/决策,计算各种后续可能动作的价值,但不存储他们,只进行启发式搜索)**、**预演算法**与**蒙特卡洛树搜索**。
29 |
30 | 可以看出,第八章就是:
31 | - 使用了动态规划的“规划”思想;
32 | - 或者说 Dyna-Q 将规划与时序差分结合了起来;
33 | - 但在规划中未必要使用“期望更新”,很多状态可以忽略;
34 | - 在更新时可以利用“规划”,预演出后续状态,只为了对当前状态/决策进行评估(预演出的状态产生的价值无需储存)。
35 |
36 | ****
37 |
38 | ### 二、第八章基于表格型方法的规划和学习各节概括
39 |
40 | > 参考了下述项目的笔记部分:
41 | > [https://github.com/brynhayder/reinforcement_learning_an_introduction](https://github.com/brynhayder/reinforcement_learning_an_introduction)
42 |
43 | #### 8.1 模型和规划 Models and Planning
44 |
45 | 对环境建模,即帮助智能体去预测环境对于动作的反馈是什么,模型分为两类:
46 | - 分布模型 distribution model ,返回反馈的概率分布;
47 | - 样本模型 sample ,返回一个具体的反馈。
48 |
49 | 模型时用于仿真环境的。
50 |
51 | #### 8.2 Dyna:集成在一起的规划、动作和学习 Dyna: Integrated Planning, Acting and Learning
52 |
53 | 
54 |
55 | 如上图, Dyna-Q 算法结合了 Q-Learning 与 规划:
56 | - 正常的Q更新结束之后,对模型进行更新;
57 | - 依据现有模型,进行n次循环,对已出现过的 $Q(S, A)$ 进行更新。
58 |
59 | #### 8.3 当模型错误的时候 When the Model is Wrong
60 |
61 | 当采样不足或者陷入局部次优解时,就会让模型产生偏差。 Dyna-Q+ 用一个指标鼓励模型采取未采取过的动作,来解决这个问题。
62 |
63 | ### 8.4 优先遍历 Prioritized Sweeping
64 |
65 | 许多状态是与最优策略无关的,换言之,逼近最优策略,用不着对无关状态采样。
66 |
67 | 好比,小明从沈阳出发去成都,被要求找一条最近的路,他可以来来回回好几趟。他可能经过北京,可能经过西安,但一定没有必要先去东京,再去成都。这里,“抵达东京”这个状态与我们“从沈阳到达成都的最短路径”这个最优策略目标无关。
68 |
69 | 优先遍历的方法被提出,来过滤那些没有用的采样。比如,在一个非随机的环境中,使用 Q-Learning ,如果:
70 |
71 | $$|R+\gamma \max_a Q(S',a) - Q(S, A)| < \theta$$
72 |
73 | 那么,才把这个 P 对应的状态放在 PQueue 中,进行更新。
74 |
75 | #### 8.5 期望更新与采样更新的对比 Expected vs. Sample Updates
76 |
77 | 当模型时分布模型时,或者有很多很多状态分支时,进行期望更新计算量太大( $\sum_i^{s\_number} p_i q_i$ )中s_number过大。因此,采用采样更新代替期望更新。
78 |
79 | 事实证明,在计算量很大/迭代次数很多时,采样更新的效果不逊于期望更新。
80 |
81 | 
82 |
83 | 这是一张较为精辟的图,来自书中。这是一个三维图:
84 | - 维度一:**当前策略还是最优策略**,按照我的理解,因为同轨策略 on-policy 中,策略可以随着迭代而更新,采样也可以随之更新,因此当前策略多用于同轨策略,而最优策略下的估计多用于离轨策略;
85 | - 维度二:**状态价值还是状态-动作价值期望**,前者多用于价值评估,或者多用于控制;
86 | - 维度三:**期望更新还是采样更新**。
87 |
88 | #### 8.6 轨迹采样 Trajectory Sampling
89 |
90 | 对平均分布进行采样,可能会带来偏差。因为,很多“样本”其实是根本不会出现的。在规划中对同轨策略采样可以带来“初期快速收敛”的效果。
91 |
92 | #### 8.7 实时动态规划 Real-time Dynamic Programming
93 |
94 | 实时动态规划是一种特殊的价值迭代:不同于传统动态规划,实时动态规划不采用期望更新,使用采样更新,对同轨策略下的轨迹进行采样。
95 |
96 | #### 8.8 决策时规划 Planning at Decision Time
97 |
98 | 何为在决策时规划?
99 |
100 | 在之前提到的算法中,我们都是基于已有的经验进行规划(background planning);决策时规划即,考虑与环境的交换,即模拟作出动作后可以进行的采样(有可能考虑很多步)。
101 |
102 | #### 8.9 启发式搜索 Heuristic Search
103 |
104 | 在我看来,是对将要选择的动作进行“推演”,建立一个“决策树”,并且依照某种顺序(深度优先)对分支进行“不集中”的回溯。这往往比“集中回溯更新”产生的决策效果好。
105 |
106 | #### 8.10 预演算法 Rollout Algorithm
107 |
108 | 预演算法是基于蒙特卡洛控制的、通过仿真迹进行采样的决策时规划。
109 |
110 | 预演算法即:
111 | - 从某个状态出发;
112 | - 基于一个策略(预演策略),进行仿真,评估价值;
113 | - 选择仿真中价值最高的动作,以此类推。
114 |
115 | 预演算法用于改进预演策略的性能,而非找到最优策略。
116 |
117 | #### 8.11 蒙特卡洛树搜索 Monte Carlo Tree Search
118 |
119 | 蒙特卡洛树搜索是决策时规划、预演算法的集大成者。预演算法是其价值评估的框架,在仿真时,应用蒙特卡洛仿真来引导搜索。AlphaGo使用了这种技术。
120 |
121 | MCTS可以概括为四步:
122 | - 选择 Selection ,基于树策略(树策略考虑了树边缘的动作价值)价值选择一个叶子结点;
123 | - 扩展 Expansion ,对选定的结点进行非试探动作,为其增加子结点;
124 | - 仿真 Simulation ,从叶子结点或新增叶子结点开始,基于预演策略进行整个一个幕的仿真。在树中的部分,基于蒙特卡洛树的策略进行策略选择,在树外的部分,则基于预演策略;
125 | - 回溯 Backup ,在本次更新中,对轨迹的回报值上传,树外的状态和动作都不会被保存下来。
126 |
127 | 
128 |
129 | 如上图,按照我的理解,MCTS的这四个步骤,即:
130 | - **逐渐扩张树的过程:** 树本身代表了一种策略,但是在第一次更新前,树是不存在的,每一次更新(一次更新中进行上述四个步骤),树都将生长一点(生长一个叶子,还是几个叶子,it depends);
131 | - **逐渐更新树的过程:** 在仿真的步骤中,如果状态是树内的,则基于树策略进行仿真,一旦跑出树外,则基于预演策略进行仿真;由此回溯,**树会越来越健壮**;
132 | - 对树外策略进行仿真,并且基于蒙特卡洛控制(要仿真到幕结束),应该是为了让回报更准确,与环境交互更加充分;
133 | - 就好比,高级的围棋手会在脑中推演好几步:如果我们这下了,对方会怎么下,我再怎么下...
134 | - 应该注意,这张图片对于初学者(如今天上午的我)有一定误导性:**每次学习时,我们遇到的状态未必是树的边缘结点或者根结点;在很多次学习之后(机器下了很多盘棋之后),再开局,树已经很全面,很健壮了。**
135 |
136 | ### 强化学习初步/表格型求解方法总结
137 |
138 | Sutton的书所提到的所有强化学习方法都包含三个重要思想:
139 | 1. 需要估计价值函数;
140 | 2. 需要沿着真实或者模拟的状态轨迹进行回溯操作来更新价值估计;
141 | 3. 遵循广义策略迭代(GPI)的通用流程,即会维护一个近似的价值函数和一个近似的策略,持续地基于一方的结果来改善另一方。
142 |
143 | ****
144 |
145 | 期待后续的学习。
146 |
147 | *Piper Liu*
148 |
149 | *2020-1-31 23:38:22*
150 |
--------------------------------------------------------------------------------
/open_lRLwp_jupyter.bat:
--------------------------------------------------------------------------------
1 | CALL D:\Anaconda3\Scripts\activate.bat D:\Anaconda3
2 | d:
3 | cd D:\GitHub\Reinforcement-Learning-practice-zh
4 | jupyter notebook
--------------------------------------------------------------------------------
/open_vscode_project.bat:
--------------------------------------------------------------------------------
1 | code .
--------------------------------------------------------------------------------
/practice/02-MDP-and-Bellman-Equation.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "2020-1-11 19:57:40\n",
8 | "刚刚读完《强化学习(第二版)》第三章:有限马尔科夫决策过程,印象深刻的有:\n",
9 | "- 贝尔曼方程。"
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {
15 | "ExecuteTime": {
16 | "end_time": "2020-01-11T11:59:45.286758Z",
17 | "start_time": "2020-01-11T11:59:45.282797Z"
18 | }
19 | },
20 | "source": [
21 | "### 贝尔曼方程是什么\n",
22 | "\n",
23 | "$$\\begin{aligned} \n",
24 | "v_\\pi (s) = & \\mathbb{E}_\\pi [G_t | S_t = s] \\\\\n",
25 | "= & \\mathbb{E}_{\\pi} [R_{t+1} + \\gamma G_{t+1} | S_t = s] \\\\\n",
26 | "= & \\sum_a \\pi(a|s) \\sum_{s'} \\sum_{r} p(s', r| s,a) \\left[ r + \\gamma \\mathbb{E}_\\pi [G_{t+1} | S_{t+1} = s'] \\right] \\\\\n",
27 | "= &\\sum_a \\pi(a|s) \\sum_{s',r} p(s',r|s,a)[r + \\gamma v_\\pi (s')] \\quad for \\; all \\; s \\in S\n",
28 | "\\end{aligned}$$\n",
29 | "\n",
30 | "各符号意义:\n",
31 | "- 上图中,$v_\\pi(s)$表示在状态s下的,使用策略集$\\pi$的价值;\n",
32 | "- $G_t$就是在当前时刻$t$所产生的“回报”,在有限时刻中,通常引入折扣率$\\gamma$的概念,将$G_t$定义为$G_t = R_{t+1} + \\gamma G_{t+1}$,表示下一步对当前决策影响最大,时间越远,影响越小;\n",
33 | "- $\\pi(a|s)$是策略,在我看来就是在状态$s$下选择动作$a$的概率;\n",
34 | "- $p()$是状态转移概率,$r$是回报。\n",
35 | "\n",
36 | "对这个方程我还有个通俗的解释,请见[https://blog.csdn.net/weixin_42815609/article/details/103934891](https://blog.csdn.net/weixin_42815609/article/details/103934891)。"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "### 贝尔曼方程用处\n",
44 | "\n",
45 | "其实只在理论上有用(目前在我看来)。\n",
46 | "\n",
47 | "#### 贝尔曼方程组\n",
48 | "\n",
49 | "$$\n",
50 | "\\begin{aligned}\n",
51 | "& v(s_1) = f(v(s_1), v(s_2), ..., v(s_n)) \\\\\n",
52 | "& v(s_2) = f(v(s_1), v(s_2), ..., v(s_n)) \\\\\n",
53 | "& ... \\\\\n",
54 | "& v(s_n) = f(v(s_1), v(s_2), ..., v(s_n)) \\\\\n",
55 | "\\end{aligned}\n",
56 | "$$\n",
57 | "\n",
58 | "可见,这构造了一个关于$v(s_i)$的n元1次方程组,可以求解每个状态的价值。\n",
59 | "\n",
60 | "当然,这里$v_\\pi (s)$简写成了$v(s)$,我们知道每个状态的价值是由策略决定的,策略糟糕,价值低。\n",
61 | "\n",
62 | "#### 贝尔曼最优方程\n",
63 | "\n",
64 | "最优方程说明:最优策略下各个状态的价值一定等于这个状态下最优动作的期望回报。\n",
65 | "\n",
66 | "假设只有2个状态($s_1$与$s_2$),对于状态$s_1$,其最优价值:\n",
67 | "\n",
68 | "$$v_* (s_1) = =\\max \\left\\{ \\begin{aligned}\n",
69 | "& p(s_1 | s_1, a_1) [r(s_1, a_1, s_1) + \\gamma v_* (s_1)] + \n",
70 | "p(s_2 | s_1, a_1) [r(s_2, a_1, s_1) + \\gamma v_* (s_2)] \\\\\n",
71 | "& p(s_1 | s_1, a_2) [r(s_1, a_2, s_1) + \\gamma v_* (s_1)] + \n",
72 | "p(s_2 | s_1, a_2) [r(s_2, a_2, s_1) + \\gamma v_* (s_2)] \\\\\n",
73 | "& ... \\\\\n",
74 | "& p(s_1 | s_1, a_n) [r(s_1, a_n, s_1) + \\gamma v_* (s_1)] + \n",
75 | "p(s_2 | s_1, a_n) [r(s_2, a_n, s_1) + \\gamma v_* (s_2)] \\\\\n",
76 | "\\end{aligned} \\right\\}$$\n",
77 | "\n",
78 | "如上,是需要选择出一个/多个最优动作的。\n",
79 | "\n",
80 | "如果将两个状态的方程式联立,则计算量急剧增大。\n",
81 | "\n",
82 | "**而对于状态多的更不用说,几乎不可计算。因此,要使用近似算。**"
83 | ]
84 | },
85 | {
86 | "cell_type": "markdown",
87 | "metadata": {
88 | "ExecuteTime": {
89 | "end_time": "2020-01-11T12:21:49.968997Z",
90 | "start_time": "2020-01-11T12:21:49.962985Z"
91 | }
92 | },
93 | "source": [
94 | "### 复现一下网格问题\n",
95 | "\n",
96 | "#### 问题描述\n",
97 | "\n",
98 | ""
99 | ]
100 | },
101 | {
102 | "cell_type": "markdown",
103 | "metadata": {},
104 | "source": [
105 | "#### 注意下面的分析也很有趣"
106 | ]
107 | },
108 | {
109 | "cell_type": "markdown",
110 | "metadata": {
111 | "ExecuteTime": {
112 | "end_time": "2020-01-11T12:23:55.319564Z",
113 | "start_time": "2020-01-11T12:23:55.314576Z"
114 | }
115 | },
116 | "source": [
117 | "#### 复现\n",
118 | "\n",
119 | "参考[https://github.com/ShangtongZhang/reinforcement-learning-an-introduction](https://github.com/ShangtongZhang/reinforcement-learning-an-introduction)。\n",
120 | "\n",
121 | "复现了策略={上下左右概率一样}与策略={最优}时。"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": 31,
127 | "metadata": {
128 | "ExecuteTime": {
129 | "end_time": "2020-01-11T12:51:19.118125Z",
130 | "start_time": "2020-01-11T12:51:19.110146Z"
131 | }
132 | },
133 | "outputs": [],
134 | "source": [
135 | "import matplotlib\n",
136 | "import matplotlib.pyplot as plt\n",
137 | "import numpy as np\n",
138 | "from matplotlib.table import Table\n",
139 | "%matplotlib inline\n",
140 | "\n",
141 | "WORLD_SIZE = 5\n",
142 | "A_POS = [0, 1]\n",
143 | "A_PRIME_POS = [4, 1]\n",
144 | "B_POS = [0, 3]\n",
145 | "B_PRIME_POS = [2, 3]\n",
146 | "DISCOUNT = 0.9\n",
147 | "\n",
148 | "# left, up, right, down\n",
149 | "# 用坐标表示动作,比如向左ACTIONS[0],即y+0,x-1\n",
150 | "ACTIONS = [np.array([0, -1]),\n",
151 | " np.array([-1, 0]),\n",
152 | " np.array([0, 1]),\n",
153 | " np.array([1, 0])]\n",
154 | "ACTION_PROB = 0.25"
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "execution_count": 32,
160 | "metadata": {
161 | "ExecuteTime": {
162 | "end_time": "2020-01-11T12:51:19.126105Z",
163 | "start_time": "2020-01-11T12:51:19.120120Z"
164 | }
165 | },
166 | "outputs": [],
167 | "source": [
168 | "def step(state, action):\n",
169 | " if state == A_POS:\n",
170 | " return A_PRIME_POS, 10\n",
171 | " if state == B_POS:\n",
172 | " return B_PRIME_POS, 5\n",
173 | "\n",
174 | " next_state = (np.array(state) + action).tolist()\n",
175 | " x, y = next_state\n",
176 | " if x < 0 or x >= WORLD_SIZE or y < 0 or y >= WORLD_SIZE:\n",
177 | " reward = -1.0\n",
178 | " next_state = state\n",
179 | " else:\n",
180 | " reward = 0\n",
181 | " return next_state, reward"
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": 33,
187 | "metadata": {
188 | "ExecuteTime": {
189 | "end_time": "2020-01-11T12:51:19.135080Z",
190 | "start_time": "2020-01-11T12:51:19.128099Z"
191 | }
192 | },
193 | "outputs": [],
194 | "source": [
195 | "def draw_image(image):\n",
196 | " fig, ax = plt.subplots()\n",
197 | " ax.set_axis_off()\n",
198 | " tb = Table(ax, bbox=[0, 0, 1, 1])\n",
199 | "\n",
200 | " nrows, ncols = image.shape\n",
201 | " width, height = 1.0 / ncols, 1.0 / nrows\n",
202 | "\n",
203 | " # Add cells\n",
204 | " for (i, j), val in np.ndenumerate(image):\n",
205 | " tb.add_cell(i, j, width, height, text=val,\n",
206 | " loc='center', facecolor='white')\n",
207 | "\n",
208 | " # Row and column labels...\n",
209 | " # 行号、列号\n",
210 | " for i in range(len(image)):\n",
211 | " tb.add_cell(i, -1, width, height, text=i+1, loc='right',\n",
212 | " edgecolor='none', facecolor='none')\n",
213 | " tb.add_cell(-1, i, width, height/2, text=i+1, loc='center',\n",
214 | " edgecolor='none', facecolor='none')\n",
215 | "\n",
216 | " ax.add_table(tb)"
217 | ]
218 | },
219 | {
220 | "cell_type": "code",
221 | "execution_count": 34,
222 | "metadata": {
223 | "ExecuteTime": {
224 | "end_time": "2020-01-11T12:51:19.144056Z",
225 | "start_time": "2020-01-11T12:51:19.137075Z"
226 | }
227 | },
228 | "outputs": [],
229 | "source": [
230 | "def figure_3_2():\n",
231 | " value = np.zeros((WORLD_SIZE, WORLD_SIZE))\n",
232 | " while True:\n",
233 | " # keep iteration until convergence\n",
234 | " # 直到迭代不动了,停止(收敛到解了)\n",
235 | " new_value = np.zeros_like(value)\n",
236 | " for i in range(WORLD_SIZE):\n",
237 | " for j in range(WORLD_SIZE):\n",
238 | " for action in ACTIONS:\n",
239 | " # 上下左右都进行一遍(因为现在的策略是等概率选择上下左右的)\n",
240 | " (next_i, next_j), reward = step([i, j], action)\n",
241 | " # bellman equation\n",
242 | " new_value[i, j] += ACTION_PROB * (reward + DISCOUNT * value[next_i, next_j])\n",
243 | " if np.sum(np.abs(value - new_value)) < 1e-4:\n",
244 | " draw_image(np.round(new_value, decimals=2))\n",
245 | " plt.show()\n",
246 | " break\n",
247 | " value = new_value"
248 | ]
249 | },
250 | {
251 | "cell_type": "code",
252 | "execution_count": 35,
253 | "metadata": {
254 | "ExecuteTime": {
255 | "end_time": "2020-01-11T12:51:19.303928Z",
256 | "start_time": "2020-01-11T12:51:19.145057Z"
257 | }
258 | },
259 | "outputs": [
260 | {
261 | "data": {
262 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAagAAAEOCAYAAADc94MzAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAIABJREFUeJzt3Xt4FOXZP/DvY5KycjQgUZIoAQIh2c3ucg6KHOUsVDQKoRYuwLa0UCtqkforon2lHlAU5HS1aFD0TUQRwguIlIMKSA0CQQVCICZIDhDCISRByGHv3x8JE2IILCGz+4R8P9e1V3dnnk3vvZnZ7+zM044SERAREenmFm8XQEREdCUMKCIi0hIDioiItMSAIiIiLTGgiIhISwwoIiLSEgOKiIi0xIAiIiItMaCIiEhLDCgiItISA4qIiLTEgCIiIi0xoOoIpdS7SqkcpdQP3q7Fm5RSdymltiqlDiql9iul/uLtmrxFKWVRSiUqpfaV9+JFb9fkbUopH6XUXqXUWm/X4k1KqXSl1PdKqSSl1LferqemFP/fzOsGpVRvAAUA3hcRm7fr8RalVCsArURkj1KqCYDdAB4UkQNeLs3jlFIKQCMRKVBK+QHYDuAvIvJfL5fmNUqppwB0BdBURB7wdj3eopRKB9BVRHK9XcuN4C+oOkJEvgJw2tt1eJuIZIvInvLn+QAOAgjyblXeIWUKyl/6lT/q7RGnUioYwHAAS71dC9UOBhTVWUqpEACdAHzj3Uq8p/yUVhKAHAD/EZF62wsAbwGYDsDl7UI0IAA2KqV2K6V+7+1iaooBRXWSUqoxgJUAnhSRc96ux1tEpFREnACCAXRXStXL079KqQcA5IjIbm/Xool7RaQzgKEAppRfIqhzGFBU55Rfb1kJ4EMR+dTb9ehARM4C+ALAEC+X4i33AhhZfu0lHkB/pdQH3i3Je0Qkq/w/cwCsAtDduxXVDAOK6pTyiQHvADgoInO9XY83KaVaKqVuK39+K4D7ASR7tyrvEJG/iUiwiIQAGANgi4g85uWyvEIp1ah8AhGUUo0ADAJQJ2f/MqDqCKVUHICdAMKUUhlKqUnerslL7gXwW5QdISeVP4Z5uygvaQVgq1LqOwC7UHYNql5PryYAwB0Atiul9gFIBLBORDZ4uaYa4TRzIiLSEn9BERGRlhhQRESkJQYUERFpiQFFRERaYkAREZGWGFBERKQlX28XUFfceuutxy9cuHCHt+vQgcVicV24cIEHN2AvLsdeVGAvKlgslhM///zznTV5L/93UG5SSgl7VUYpBfaiDHtRgb2owF5UKO+Fqsl7mfBERKQlBhQREWmJAUVERFpiQBERkZYYUEREpCUGFBERaYkBRUREWmJAERGRlhhQRESkJQYUERFpiQFFRERaYkAREZGWGFBERKQlBhQREWmJAUVERFpiQBERkZbqbUAppd5VSuUopX7wdi0XLlxA9+7d4XA4YLVaMWvWrCpjlixZgsjISDidTvTq1QsHDhwAAJw6dQr9+vVD48aNMXXqVE+Xboo333wTVqsVNpsNMTExuHDhQqX106ZNg9PphNPpRIcOHXDbbbcZ65599lnYbDbYbDZ89NFHni7dFKWlpejUqRMeeOCBasd88sknUErh22+/BQAkJiYaPXI4HFi1apWnyjVNSEiIsQ907dq1yvqEhATY7XZj/fbt2wEASUlJ6NmzJ6xWK+x2e53fLiZOnIiAgADYbLarjtu1axd8fHzwySefGMvq3P4hIvXyAaA3gM4AfnBzvJjF5XJJfn6+iIgUFRVJ9+7dZefOnZXG5OXlGc8TEhJk8ODBIiJSUFAg27Ztk8WLF8uUKVNMq/FyZvYiIyNDQkJC5Pz58yIi8sgjj0hsbGy14+fPny8TJkwQEZG1a9fK/fffL8XFxVJQUCBdunSp1DczmNmLS9544w2JiYmR4cOHX3H9uXPn5L777pMePXrIrl27RESksLBQiouLRUQkKytLWrZsabw2i9m9aN26tZw8ebLa9fn5+eJyuUREZN++fRIWFiYiIocOHZKUlBQREcnMzJQ777xTzpw5Y2qtZvbiyy+/lN27d4vVaq12TElJifTr10+GDh0qH3/8sYh4Z/8QMXpRo+/pevsLSkS+AnDa23UAZbdEbty4MQCguLgYxcXFUKryHZKbNm1qPC8sLDTWN2rUCL169YLFYvFcwSYrKSnBzz//jJKSEpw/fx6BgYHVjo2Li0NMTAwA4MCBA+jTpw98fX3RqFEjOBwObNiwwVNlmyIjIwPr1q3D448/Xu2YmTNnYvr06ZW2gYYNG8LX1xdA2S/0X25PN6PGjRsbn/PyfaRDhw5o3749ACAwMBABAQE4efKk1+q8Ub1790bz5s2vOubtt9/Gww8/jICAAGNZXdw/6m1A6aa0tBROpxMBAQEYOHAgevToUWXMwoUL0a5dO0yfPh3z58/3QpXmCwoKwjPPPIO7774brVq1QrNmzTBo0KArjj169CjS0tLQv39/AIDD4cBnn32G8+fPIzc3F1u3bsWxY8c8WX6te/LJJ/Haa6/hlluuvKvu3bsXx44du+Lpv2+++QZWqxWRkZFYsmSJEVh1lVIKgwYNQpcuXfCvf/3rimNWrVqFjh07Yvjw4Xj33XerrE9MTERRURHatWtndrlek5mZiVWrVmHy5MmVltfF/YMBpQkfHx8kJSUhIyMDiYmJ+OGHqpfGpkyZgtTUVLz66qt46aWXvFCl+c6cOYOEhASkpaUhKysLhYWF+OCDD644Nj4+HtHR0fDx8QEADBo0CMOGDcM999yDmJgY9OzZs05/Ka9duxYBAQHo0qXLFde7XC5MmzYNb7zxxhXX9+jRA/v378euXbvw8ssvV7mWV9fs2LEDe/bswWeffYaFCxfiq6++qjJm1KhRSE5OxurVqzFz5sxK67Kzs/Hb3/4WsbGx1Qb+zeDJJ5/Eq6++auwXl9TJ/aOm5wZvhgeAEGhwDeqXXnjhBZkzZ06160tLS6Vp06aVlsXGxt4U16BWrFghEydONF6/99578sc//vGKY51Op+zYsaPavxUTEyPr1q2r9RovZ2YvZsyYIUFBQdK6dWu544475NZbb5Xf/OY3xvqzZ89KixYtpHXr1tK6dWtp0KCBtGrVyrgOdbm+fftecXlt8uQ+MmvWrKvuIyIiISEhxjWrvLw86dSpk6xYscIT5Znei7S0tGqvQYWEhBjbRKNGjaRly5ayatWqKuM8sX+I8BpUnXfy5EmcPXsWAPDzzz9j06ZN6NixY6Uxhw8fNp6vW7fOOKd+s7n77rvx3//+F+fPn4eIYPPmzQgPD68y7tChQzhz5gx69uxpLCstLcWpU6cAAN999x2+++67ak8P1gUvv/wyMjIykJ6ejvj4ePTv37/Sr8lmzZohNzcX6enpSE9PR1RUFNasWYOuXbsiLS0NJSUlAMpOhR46dAghISFe+iQ3rrCwEPn5+cbzjRs3VpnFduTIkUsHk9izZw+KiorQokULFBUVYdSoURg3bhweeeQRj9fuaWlpacY2ER0djUWLFuHBBx+sk/uH5r/vzKOUigPQF8DtSqkMALNE5B1v1JKdnY3x48ejtLQULpcLjz76KB544AE8//zz6Nq1K0aOHIkFCxZg06ZN8PPzg7+/P9577z3j/SEhITh37hyKioqwevVqbNy4EREREd74KDesR48eiI6ORufOneHr64tOnTrh97//faVeAGWTI8aMGVPp4n9xcTHuu+8+AGWTSj744AP9T2HUwC97cSXbt2/HK6+8Aj8/P9xyyy1YtGgRbr/9dg9WWbtOnDiBUaNGASibRDN27FgMGTIES5YsAQBMnjwZK1euxPvvvw8/Pz/ceuut+Oijj6CUwooVK/DVV1/h1KlTWLZsGQBg2bJlcDqd3vo4NyQmJgZffPEFcnNzERwcjBdffBHFxcUAUOW60+Xq4v6hLh1x0NUppYS9KqOUAntRhr2owF5UYC8qlPeiRtNIeYqPiIi0xIAiIiItMaCIiEhLDCgiItISA4qIiLTEgCIiIi0xoIiISEsMKCIi0hIDioiItMSAIiIiLTGgiIhISwwoIiLSEgOKiIi0xIAiIiItMaCIiEhLDCgiItISA4qIiLSk9/1+NWKxWFxKKQY6AIvFUulW6/UZe1GBvajAXlSwWCyumr6Xt3x3E2/5XoG3s67AXlRgLyqwFxV4y3ciIrrpMKCIiEhLDCgiItISA4qIiLTEgCIiIi0xoIiISEsMKCIi0hIDioiItMSAIiIiLTGgiIhISwwoIiLSEgOKiIi0xIAiIiItMaCIiEhLDCgiItJSvQ0opdRdSqmtSqmDSqn9Sqm/eLOeiRMnIiAgADab7Yrrv/jiCzRr1gxOpxNOpxP/+Mc/AADHjh1Dv379EB4eDqvVinnz5nmy7Frnzuc5c+YMRo0aBbvdju7du+OHH34w1s2bNw82mw1WqxVvvfWWJ0uvde704sMPP4Tdbofdbsc999yDffv2GetCQkIQGRkJp9OJrl27erL0Wnet/eNq28S13lsXbdiwAWFhYQgNDcUrr7xSZf2yZcvQsmVL4/ti6dKlxjofHx9j+ciRIz1Z9vUTkXr5ANAKQOfy500ApACIuMp4MdOXX34pu3fvFqvVesX1W7duleHDh1dZnpWVJbt37xYRkXPnzkn79u1l//79ptZqZi/c+TzPPPOMvPDCCyIicvDgQenfv7+IiHz//fditVqlsLBQiouLZcCAAZKSkmJarSLe78WOHTvk9OnTIiKyfv166d69u7GudevWcvLkSdPq+yUze3Gt/aO6bcKd95rBzF6UlJRI27ZtJTU1VS5evCh2u73KdhEbGytTpky54vsbNWpkWm1XUt6LGn1P19tfUCKSLSJ7yp/nAzgIIMhb9fTu3RvNmze/7ve1atUKnTt3BgA0adIE4eHhyMzMrO3yPMadz3PgwAEMGDAAANCxY0ekp6fjxIkTOHjwIKKiotCwYUP4+vqiT58+WLVqlcc/Q21xpxf33HMP/P39AQBRUVHIyMjweJ2ecK39o7ptwp331jWJiYkIDQ1F27Zt8atf/QpjxoxBQkKCt8syRb0NqMsppUIAdALwjXcrubqdO3fC4XBg6NCh2L9/f5X16enp2Lt3L3r06OGF6mpfdZ/H4XDg008/BVC2sx49ehQZGRmw2Wz46quvcOrUKZw/fx7r16/HsWPHvFF6rXPn3/add97B0KFDjddKKQwaNAhdunTBv/71L0+U6TXVbRM3o8zMTNx1113G6+Dg4CselK5cuRJ2ux3R0dGV9oMLFy6ga9euiIqKwurVqz1Sc035ersAb1NKNQawEsCTInLO2/VUp3Pnzjh69CgaN26M9evX48EHH8Thw4eN9QUFBXj44Yfx1ltvoWnTpl6stHZc7fPMmDEDf/nLX+B0OhEZGYlOnTrB19cX4eHhePbZZzFw4EA0btwYDocDvr51fxN3599269ateOedd7B9+3Zj2Y4dOxAYGIicnBwMHDgQHTt2RO/evT1VtkdVt03cjMrOmlWmlKr0esSIEYiJiUGDBg2wZMkSjB8/Hlu2bAEA/PTTTwgMDMSPP/6I/v37IzIyEu3atfNI7detpucGb4YHAD8AnwN4yo2xbp5xrbm0tDS3z5Nffn2hqKhIBg0aJG+88YaZ5RnM7sX1fB6XyyWtW7eWvLy8Kuv+9re/ycKFC80o0aBDL/bt2ydt27aVQ4cOVTtm1qxZMmfOHDNKNJjdC3f3jyttE9ezb9UGM3vx9ddfy6BBg4zX//znP+Wf//xnteNLSkqkadOmV1w3fvx4+fjjj2u9xsuB16Cunyo75HgHwEERmevteq7l+PHjxpFTYmIiXC4XWrRoARHBpEmTEB4ejqeeesrLVd44dz7P2bNnUVRUBABYunQpevfubfyyyMnJAVB2lPjpp58iJibGM4WbwJ1e/PTTT3jooYewfPlydOjQwVheWFiI/Px84/nGjRtvqllsv3S1beJm061bNxw+fBhpaWkoKipCfHx8ldl42dnZxvM1a9YgPDwcQNlsx4sXLwIAcnNzsWPHDkRERHiu+OtV02Sr6w8AvQAIgO8AJJU/hl1lvPuHDDUwZswYufPOO8XX11eCgoJk6dKlsnjxYlm8eLGIiLz99tsSEREhdrtdevToITt27BARkW3btgkAiYyMFIfDIQ6HQ9atW2dqrWb2orrPc3kvvv76awkNDZWwsDAZNWqUMYtNRKRXr14SHh4udrtdNm3aZFqdl3i7F5MmTZLbbrvNWN+lSxcREUlNTRW73S52u10iIiLkpZdeMq3OS8zsxbX2j6ttE1d6r9nM/r5Yt26dtG/fXtq2bWv8286cOVMSEhJERGTGjBnG90Xfvn3l4MGDIlI269Nms4ndbhebzebJXtToe1qJVD2fSVUppYS9KqOUAntRhr2owF5UYC8qlPdCXXtkVfX2FB8REemNAUVERFpiQBERkZYYUEREpCUGFBERaYkBRUREWmJAERGRlhhQRESkJQYUERFpiQFFRERaYkAREZGWGFBERKQlBhQREWmJAUVERFpiQBERkZYYUEREpCUGFBERacnX2wXUFRaLxaWUYqADsFgsUKpGN8i86bAXFdiLCuxFBYvF4qrpe3nLdzfxlu8VeDvrCuxFBfaiAntRgbd8JyKimw4DioiItMSAIiIiLTGgiIhISwwoIiLSEgOKiIi0xIAiIiItMaCIiEhLDCgiItISA4qIiLTEgCIiIi0xoIiISEsMKCIi0hIDioiItMSAIiIiLTGgiIhIS/U2oJRSFqVUolJqn1Jqv1LqRW/Ws2HDBoSFhSE0NBSvvPJKlfUXL17E6NGjERoaih49eiA9PR0AkJ6ejltvvRVOpxNOpxOTJ0/2cOW171q9mDZtmvF5O3TogNtuu63S+nPnziEoKAhTp071VMmmuVYvAGDFihWIiIiA1WrF2LFjAQBbt241euR0OmGxWLB69WpPll7rrtWLJUuWIDIyEk6nE7169cKBAweMdd999x169uwJq9WKyMhIXLhwwZOlm0pE8MQTTyA0NBR2ux179uy56viRI0fCZrN5qLobJCL18gFAAWhc/twPwDcAoq4yXsxSUlIibdu2ldTUVLl48aLY7XbZv39/pTELFy6UP/zhDyIiEhcXJ48++qiIiKSlpYnVajWttivxdi8uN3/+fJkwYUKlZU888YTExMTIlClTTKvzEm/3IiUlRZxOp5w+fVpERE6cOFHl75w6dUr8/f2lsLDQtFpFvN+LvLw843lCQoIMHjxYRESKi4slMjJSkpKSREQkNzdXSkpKTKtVxNxe/NK6detkyJAh4nK5ZOfOndK9e/dqx65cuVJiYmI8+p1R3osafU/X219Q5b0rKH/pV/7wyj2aExMTERoairZt2+JXv/oVxowZg4SEhEpjEhISMH78eABAdHQ0Nm/efFPeUtqdXlwuLi4OMTExxuvdu3fjxIkTGDRokCfKNZU7vfj3v/+NKVOmwN/fHwAQEBBQ5e988sknGDp0KBo2bOiRus3gTi+aNm1qPC8sLIRSZXcZ37hxI+x2OxwOBwCgRYsW8PHx8VzxJktISMC4ceOglEJUVBTOnj2L7OzsKuMKCgowd+5c/P3vf/dClTVTbwMKAJRSPkqpJAA5AP4jIt94o47MzEzcddddxuvg4GBkZmZWO8bX1xfNmjXDqVOnAABpaWno1KkT+vTpg23btnmucBO404tLjh49irS0NPTv3x8A4HK58PTTT2POnDkeqdVs7vQiJSUFKSkpuPfeexEVFYUNGzZU+Tvx8fGVQrwucne7WLhwIdq1a4fp06dj/vz5AMp6pJTC4MGD0blzZ7z22mseq9sT3O3NzJkz8fTTT9epA5V6HVAiUioiTgDBALorpbxyYvZKv4QuHf1da0yrVq3w008/Ye/evZg7dy7Gjh2Lc+fOmVar2dzpxSXx8fGIjo42joYXLVqEYcOGVdpZ6zJ3elFSUoLDhw/jiy++QFxcHB5//HGcPXvWWJ+dnY3vv/8egwcPNr1eM7m7XUyZMgWpqal49dVX8dJLLwEo69H27dvx4YcfYvv27Vi1ahU2b95ses2e4k5vkpKScOTIEYwaNcpTZdWKeh1Ql4jIWQBfABjijf/+4OBgHDt2zHidkZGBwMDAaseUlJQgLy8PzZs3R4MGDdCiRQsAQJcuXdCuXTukpKR4rvha5k4vLvnlL4OdO3diwYIFCAkJwTPPPIP3338fM2bMML1ms7i7Xfz617+Gn58f2rRpg7CwMBw+fNhYv2LFCowaNQp+fn4eq9sM17NdAMCYMWOMSSHBwcHo06cPbr/9djRs2BDDhg275kQC3S1cuNCYABMYGHjN3uzcuRO7d+9GSEgIevXqhZSUFPTt29fDVddATS9e1fUHgJYAbit/fiuAbQAeuMr4q18JvAHFxcXSpk0b+fHHH40LwD/88EOlMQsWLKg0SeKRRx4REZGcnBzjgm9qaqoEBgbKqVOnTKtVxNwLwO70QkQkOTlZWrduLS6X64p/JzY2ts5PknCnF5999pmMGzdOREROnjwpwcHBkpuba6zv0aOHbNmyxbQaL+ftXqSkpBjP16xZI126dBERkdOnT0unTp2ksLBQiouLZcCAAbJ27VrTahXx7CSJtWvXVpok0a1bt6uO9/TEKtzAJAlfb4SiJloBeE8p5YOyX5IrRGStNwrx9fXFggULMHjwYJSWlmLixImwWq14/vnn0bVrV4wcORKTJk3Cb3/7W4SGhqJ58+aIj48HAHz11Vd4/vnn4evrCx8fHyxZsgTNmzf3xseoFe70AiibHDFmzJhqT//dDNzpxeDBg7Fx40ZERETAx8cHc+bMMX5Rp6en49ixY+jTp4+XP8mNc6cXCxYswKZNm+Dn5wd/f3+89957AAB/f3889dRT6NatG5RSGDZsGIYPH+7lT1R7hg0bhvXr1yM0NBQNGzZEbGyssc7pdCIpKcmL1d0YJTfhTDAzKKWEvSqjlLopZxDWBHtRgb2owF5UKO9FjY4keQ2KiIi0xIAiIiItMaCIiEhLDCgiItISA4qIiLTEgCIiIi0xoIiISEsMKCIi0hIDioiItMSAIiIiLTGgiIhISwwoIiLSEgOKiIi0xIAiIiItMaCIiEhLDCgiItISA4qIiLRUn2/5fl0sFotLKcVAB2CxWG7qW61fD/aiAntRgb2oYLFYXDV9L2/57ibe8r0Cb2ddgb2owF5UYC8q8JbvRER002FAERGRlhhQRESkJQYUERFpiQFFRERaYkAREZGWGFBERKQlBhQREWmJAUVERFpiQBERkZYYUEREpCUGFBERaYkBRUREWmJAERGRlhhQRESkpXofUEopH6XUXqXUWm/WISJ44oknEBoaCrvdjj179lxx3EcffQS73Q6r1Yrp06cby6dNmwan0wmn04kOHTrgtttu81Tptc7dXlwycuRI2Gy2Kstff/11KKWQm5trVqmmc7cXQ4YMgcPhgNVqxeTJk1FaWgoAeOGFFxAUFGRsG+vXr/dk+bXK3V707dsXYWFhxmfOyckBABw9ehQDBgyA3W5H3759kZGR4cnya1VycjJ69uyJBg0a4PXXX6923IIFCxAaGlplP8jLy8OIESOMbSY2NtYTZV8/EanXDwBPAfhfAGuvMU7MtG7dOhkyZIi4XC7ZuXOndO/evcqY3NxcueuuuyQnJ0dERMaNGyebNm2qMm7+/PkyYcIE02rVoReXrFy5UmJiYsRqtVZa/tNPP8mgQYPk7rvvlpMnT5pWqy69yMvLExERl8slDz30kMTFxYmIyKxZs2TOnDmm1niJLr3o06eP7Nq1q8ry6OhoWbZsmYiIbN68WR577DHTajW7FydOnJDExER57rnnrvrvu2fPHklLS5PWrVtX2g9mz54t06dPFxGRnJwc8ff3l4sXL5pSa3kvavT9XK9/QSmlggEMB7DU27UkJCRg3LhxUEohKioKZ8+eRXZ2dqUxP/74Izp06ICWLVsCAO6//36sXLmyyt+Ki4tDTEyMR+o2gzu9AICCggLMnTsXf//736usmzZtGl577bU6f9ttd3vRtGlTAEBJSQmKiorq/Oe+End7UZ0DBw5gwIABAIB+/fohISHBrFJNFxAQgG7dusHPz++q4zp16oSQkJAqy5VSyM/Ph4igoKAAzZs3h6+vr0nV1ly9DigAbwGYDsDl7UIyMzNx1113Ga+Dg4ORmZlZaUxoaCiSk5ORnp6OkpISrF69GseOHas05ujRo0hLS0P//v09UrcZ3OkFAMycORNPP/00GjZsWGn5mjVrEBQUBIfDYXqtZnO3FwAwePBgBAQEoEmTJoiOjjaWL1iwAHa7HRMnTsSZM2dMr9ks19OLCRMmwOl04n/+53+MW687HA7jgG7VqlXIz8/HqVOnzC9cQ1OnTsXBgwcRGBiIyMhIzJs3D7fcol8c6FeRhyilHgCQIyK7vV0LAGMnutwvj4L9/f2xePFijB49Gvfddx9CQkKqHPXEx8cjOjoaPj4+ptZrJnd6kZSUhCNHjmDUqFGVlp8/fx6zZ8/GP/7xD1Nr9BR3enHJ559/juzsbFy8eBFbtmwBAPzxj39EamoqkpKS0KpVKzz99NOm1msmd3vx4Ycf4vvvv8e2bduwbds2LF++HEDZNckvv/wSnTp1wpdffomgoCAtfzV4wueffw6n04msrCwkJSVh6tSpOHfunLfLqqLeBhSAewGMVEqlA4gH0F8p9YEnC1i4cKFxITcwMLDSr6GMjAwEBgZWec+IESPwzTffYOfOnQgLC0P79u0rrY+Pj6+Tp/eutxc7d+7E7t27ERISgl69eiElJQV9+/ZFamoq0tLS4HA4EBISgoyMDHTu3BnHjx/39EeqsZpsF5dYLBaMHDnSOH11xx13wMfHB7fccgt+97vfITEx0fT6a1NNehEUFAQAaNKkCcaOHWt85sDAQHz66afYu3cvZs+eDQBo1qyZBz5F7bi8F1lZWTf0t2JjY/HQQw9BKYXQ0FC0adMGycnJtVRpLarpxaub6QGgL7w8SWLt2rWVLgB369btiuNOnDghIiKnT58Wh8Mhhw4dMtYlJydL69Z/MFqRAAAKfElEQVStxeVymVqrLr24JC0trcokiUt+eXG4tunQi/z8fMnKyhIRkeLiYnn00Ufl7bffFhExlouIzJ07V0aPHm1arTr0ori42Pj3LioqkocfflgWL14sIiInT56U0tJSERF57rnnZObMmabVanYvLnF3Eswv94PJkyfLrFmzRETk+PHjEhgYaNp+ghuYJOH1cNDhoUNAuVwu+dOf/iRt27YVm81WaRaSw+Ewno8ZM0bCw8MlPDzcmKl1yaxZs+TZZ581tU4R83c+d3txyc0cUO704vjx49K1a1eJjIyUiIgImTp1qhQXF4uIyGOPPSY2m00iIyNlxIgRlQKrtunQi4KCAuncubPRiyeeeEJKSkpEROTjjz+W0NBQad++vUyaNEkuXLhgWq1m9yI7O1uCgoKkSZMm0qxZMwkKCjJmcg4dOlQyMzNFRGTevHkSFBQkPj4+0qpVK5k0aZKIiGRmZsrAgQPFZrOJ1WqV5cuXm1brjQSUKns/XYtSStirMkopsBdl2IsK7EUF9qJCeS9qNK20Pl+DIiIijTGgiIhISwwoIiLSEgOKiIi0xIAiIiItMaCIiEhLDCgiItISA4qIiLTEgCIiIi0xoIiISEsMKCIi0hIDioiItMSAIiIiLTGgiIhISwwoIiLSEgOKiIi0xIAiIiIt+Xq7gLrCYrG4lFIMdAAWiwVK1egGmTcd9qICe1GBvahgsVhcNX0vb/nuJt7yvQJvZ12BvajAXlRgLyrwlu9ERHTTYUAREZGWGFBERKQlBhQREWmJAUVERFpiQBERkZYYUEREpCUGFBERaYkBRUREWmJAERGRlhhQRESkJQYUERFpiQFFRERaYkAREZGWGFBERKQlBhQREWmpXgeUUipdKfW9UipJKfWtN2tJTk5Gz5490aBBA7z++uvVjtu8eTM6d+4Mp9OJXr164ciRI8a6FStWICIiAlarFWPHjvVE2aZwtxeTJk2Cw+GA3W5HdHQ0CgoKAADLli1Dy5Yt4XQ64XQ6sXTpUk+VXuvc7cVvfvMbhIWFwWazYeLEiSguLgYAJCQkwG63w+l0omvXrti+fbunSq917vbikj//+c9o3Lix8Xru3LmIiIiA3W7HgAEDcPToUTPLNZW7vdiyZQs6d+4Mm82G8ePHo6SkBACQl5eHESNGwOFwwGq1IjY21lOlXx8RqbcPAOkAbndzrJjpxIkTkpiYKM8995zMmTOn2nHt27eXAwcOiIjIwoULZfz48SIikpKSIk6nU06fPm38PbPo0ou8vDzj+bRp0+Tll18WEZHY2FiZMmWKqTVeoksv1q1bJy6XS1wul4wZM0YWLVokIiL5+fnicrlERGTfvn0SFhZmWq269EJEZNeuXfLYY49Jo0aNjGVbtmyRwsJCERFZtGiRPProo6bVqkMvSktLJTg4WA4dOiQiIjNnzpSlS5eKiMjs2bNl+vTpIiKSk5Mj/v7+cvHiRVNqLe9Fjb6j6/UvKJ0EBASgW7du8PPzu+o4pRTOnTsHoOwoKDAwEADw73//G1OmTIG/v7/x9+oqd3vRtGlTAGUHWT///DOUqtFdpbXmbi+GDRsGpRSUUujevTsyMjIAAI0bNzb6UlhYWKd75G4vSktL8de//hWvvfZapeX9+vVDw4YNAQBRUVFGj+oid3px6tQpNGjQAB06dAAADBw4ECtXrgRQ9j2Sn58PEUFBQQGaN28OX19fj9R+Pep7QAmAjUqp3Uqp33u7GHcsXboUw4YNQ3BwMJYvX44ZM2YAAFJSUpCSkoJ7770XUVFR2LBhg5cr9YwJEybgzjvvRHJyMv785z8by1euXGmc+jt27JgXK/Ss4uJiLF++HEOGDDGWrVq1Ch07dsTw4cPx7rvverE6z1iwYAFGjhyJVq1aVTvmnXfewdChQz1YlefdfvvtKC4uxrffll29+OSTT4x9YerUqTh48CACAwMRGRmJefPm4ZZb9IsD/SryrHtFpDOAoQCmKKV6e7uga3nzzTexfv16ZGRkYMKECXjqqacAACUlJTh8+DC++OILxMXF4fHHH8fZs2e9XK35YmNjkZWVhfDwcHz00UcAgBEjRiA9PR3fffcd7r//fowfP97LVXrOn/70J/Tu3Rv33XefsWzUqFFITk7G6tWrMXPmTC9WZ76srCx8/PHHlQ5WfumDDz7At99+i7/+9a8erMzzlFKIj4/HtGnT0L17dzRp0sT4lfT555/D6XQiKysLSUlJmDp1qnFmRif1OqBEJKv8P3MArALQ3ZP//QsXLjQu5GdlZV1z/MmTJ7Fv3z706NEDADB69Gh8/fXXAIDg4GD8+te/hp+fH9q0aYOwsDAcPnzY1Ppr0/X24nI+Pj4YPXq0cfqiRYsWaNCgAQDgd7/7HXbv3l3r9Zqppr148cUXcfLkScydO/eK63v37o3U1FTk5ubWVqmmu95e7N27F0eOHEFoaChCQkJw/vx5hIaGGus3bdqE2bNnY82aNcY2UlfUZLvo2bMntm3bhsTERPTu3Rvt27cHUHZg99BDD0EphdDQULRp0wbJyclmll8j9TaglFKNlFJNLj0HMAjAD56sYcqUKUhKSkJSUpJxLelq/P39kZeXh5SUFADAf/7zH4SHhwMAHnzwQWzduhUAkJubi5SUFLRt29a84mvZ9fZCRIwZjCKC//u//0PHjh0BANnZ2ca4NWvWGD2qK663F0DZqd/PP/8ccXFxlU7VHDly5NIkH+zZswdFRUVo0aKFKXWb4Xp7MXz4cBw/fhzp6elIT09Hw4YNje1k7969+MMf/oA1a9bUyWu0NdkucnJyAAAXL17Eq6++ismTJwMA7r77bmzevBkAcOLECRw6dEjP74uazq6o6w8AbQHsK3/sB/D/rjHevSkrNZSdnS1BQUHSpEkTadasmQQFBRmz1IYOHSqZmZkiIvLpp5+KzWYTu90uffr0kdTUVBERcblcMm3aNAkPDxebzSZxcXGm1apDL0pLS+Wee+4Rm80mVqtVxo4da4yZMWOGREREiN1ul759+8rBgwdNq1WHXoiI+Pj4SNu2bcXhcIjD4ZAXX3xRREReeeUViYiIEIfDIVFRUbJt2zbTatWlF5e7fBbfgAEDJCAgwOjRiBEjTKtVl14888wz0rFjR+nQoYO8+eabxvszMzNl4MCBxv6zfPly02rFDcziU1J+dEVXp5QS9qqMUgrsRRn2ogJ7UYG9qFDeixpNH623p/iIiEhvDCgiItISA4qIiLTEgCIiIi0xoIiISEsMKCIi0hIDioiItMSAIiIiLTGgiIhISwwoIiLSEgOKiIi0xIAiIiItMaCIiEhLDCgiItISA4qIiLTEgCIiIi0xoIiISEu+3i6grrBYLCeUUnd4uw4dWCwWl1KKBzdgLy7HXlRgLypYLJYTNX0vb/lORERaYsITEZGWGFBERKQlBhQREWmJAUVERFpiQBERkZYYUEREpCUGFBERaYkBRUREWmJAERGRlhhQRESkJQYUERFpiQFFRERaYkAREZGWGFBERKQlBhQREWmJAUVERFpiQBERkZYYUEREpCUGFBERaYkBRUREWmJAERGRlhhQRESkpf8PfkjOg+3xekAAAAAASUVORK5CYII=\n",
263 | "text/plain": [
264 | ""
265 | ]
266 | },
267 | "metadata": {
268 | "needs_background": "light"
269 | },
270 | "output_type": "display_data"
271 | }
272 | ],
273 | "source": [
274 | "figure_3_2()"
275 | ]
276 | },
277 | {
278 | "cell_type": "code",
279 | "execution_count": 36,
280 | "metadata": {
281 | "ExecuteTime": {
282 | "end_time": "2020-01-11T12:51:19.309913Z",
283 | "start_time": "2020-01-11T12:51:19.304926Z"
284 | }
285 | },
286 | "outputs": [],
287 | "source": [
288 | "def figure_3_5():\n",
289 | " value = np.zeros((WORLD_SIZE, WORLD_SIZE))\n",
290 | " while True:\n",
291 | " # keep iteration until convergence\n",
292 | " new_value = np.zeros_like(value)\n",
293 | " for i in range(WORLD_SIZE):\n",
294 | " for j in range(WORLD_SIZE):\n",
295 | " values = []\n",
296 | " for action in ACTIONS:\n",
297 | " (next_i, next_j), reward = step([i, j], action)\n",
298 | " # value iteration\n",
299 | " values.append(reward + DISCOUNT * value[next_i, next_j])\n",
300 | " # 这里,没有保留每个状态的最优动作是什么,让我做个改进\n",
301 | " new_value[i, j] = np.max(values)\n",
302 | " if np.sum(np.abs(new_value - value)) < 1e-4:\n",
303 | " draw_image(np.round(new_value, decimals=2))\n",
304 | " plt.show()\n",
305 | " break\n",
306 | " value = new_value"
307 | ]
308 | },
309 | {
310 | "cell_type": "code",
311 | "execution_count": 37,
312 | "metadata": {
313 | "ExecuteTime": {
314 | "end_time": "2020-01-11T12:51:19.935651Z",
315 | "start_time": "2020-01-11T12:51:19.310909Z"
316 | }
317 | },
318 | "outputs": [
319 | {
320 | "data": {
321 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAagAAAEOCAYAAADc94MzAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAIABJREFUeJzt3X9U1HW+P/DnW4nIvdoeW/EHIDTJGWZnBCRDPGvw9XIhxa123WvC0tYJ9rS71VoZ/ujUdrOj1ckKvK3yh6a3tf3aqZNdLaz2IqZ8j7kiZT/MdK9hArVqpVKQOOM8v38AMyIDGs2PD/F8nDNnmc/n8x5en5dv5jmfz3zajyEJERERqxkS6QJEREQCUUCJiIglKaBERMSSFFAiImJJCigREbEkBZSIiFiSAkpERCxJASUiIpakgBIREUtSQImIiCUpoERExJIUUCIiYkkKqAHCGLPWGHPMGPNhpGuJJGNMgjFmmzFmvzFmnzHm7kjXFCnGmBhjzG5jzHudvVgS6ZoizRgz1BjzrjHmtUjXEknGmMPGmA+MMXuNMXsiXU9/Gf2/mQ8MxphsAN8A+AtJV6TriRRjzFgAY0m+Y4wZDqAewC9IfhTh0sLOGGMA/IjkN8aYSwD8PwB3k9wV4dIixhgzH8BkACNI/jzS9USKMeYwgMkkv4h0Ld+HjqAGCJI7AHwV6ToijeTnJN/p/PlrAPsBxEW2qshgh286n17S+Ri0nziNMfEAZgFYE+laJDgUUDJgGWOSAEwC8PfIVhI5nae09gI4BuB/SA7aXgCoALAQgDfShVgAAfzNGFNvjLk90sX0lwJKBiRjzL8AeBnAPSRbIl1PpJA8SzIdQDyATGPMoDz9a4z5OYBjJOsjXYtF/IxkBoCZAO7s/IpgwFFAyYDT+X3LywD+SnJjpOuxApInAbwFYEaES4mUnwG4ofO7lxcA/Ksx5vnIlhQ5JD/r/N9jAF4BkBnZivpHASUDSueFAc8C2E/y6UjXE0nGmFHGmB93/nwZgH8D8HFkq4oMkveTjCeZBKAQQA3JmyNcVkQYY37UeQERjDE/ApAPYEBe/auAGiCMMRsAvA3AboxpMsaURrqmCPkZgN+g4xPy3s5HQaSLipCxALYZY94HUIeO76AG9eXVAgAYDeD/GWPeA7AbQBXJNyJcU7/oMnMREbEkHUGJiIglKaBERMSSFFAiImJJCigREbEkBZSIiFiSAkpERCwpKtIFDBSXXXbZP0+fPj060nVYQUxMjPf06dP6cAP14lzqhZ964RcTE3P022+/HdOfsfrvoC6SMYbqVQdjDNSLDuqFn3rhp174dfbC9GesEl5ERCxJASUiIpakgBIREUtSQImIiCUpoERExJIUUCIiYkkKKBERsSQFlIiIWJICSkRELEkBJSIilqSAEhERS1JAiYiIJSmgRETEkhRQIiJiSQooERGxJAWUiIhY0qANKGPMWmPMMWPMh5H4/Y2NjZg+fTocDgecTidWrFgBAHjppZfgdDoxZMgQ7Nmzp9fxK1asgMvlgtPpREVFhW/53r17kZWVhfT0dEyePBm7d+8O+b58X731osuTTz4JYwy++OKLXl+jpaUFcXFxuOuuuwAAbW1tmDVrFlJSUuB0OrF48eKQ7kOwaF74lZSUIDY2Fi6Xy7fsvffew9SpUzFx4kRcf/31aGlp6XX82bNnMWnSJPz85z/3LSsuLobdbofL5UJJSQncbndI9yFYAvVi7ty5SE9PR3p6OpKSkpCent7r+EC92Lp1KzIyMpCeno5p06bhf//3f0O6D/1CclA+AGQDyADw4UVuz2D67LPPWF9fT5JsaWlhcnIy9+3bx48++ogff/wxc3JyWFdXF3DsBx98QKfTydbWVrrdbubm5vLgwYMkyby8PG7ZsoUkWVVVxZycnKDWTZLh6gVJHjlyhPn5+Rw/fjyPHz/e62vMmzePRUVFvPPOO0mSra2trKmpIUm2t7dz2rRpvr4Ek+aFX7B7sX37dtbX19PpdPqWTZ48mW+99RZJ8tlnn+WDDz7Y6/innnqKRUVFnDVrlm9ZVVUVvV4vvV4vCwsLuWrVqqDW3CUcvTjX/PnzuWTJkl7HB+pFcnIyP/roI5LkypUreeuttwa15i6dvejX+/SgPYIiuQPAV5H6/WPHjkVGRgYAYPjw4XA4HGhubobD4YDdbu9z7P79+5GVlYVhw4YhKioKOTk5eOWVVwB03F6561PlqVOnMG7cuNDuSBD01gsAuPfee/HEE0/AmN7vGF1fX4+jR48iPz/ft2zYsGGYPn06ACA6OhoZGRloamoK4V4Eh+aFX3Z2NkaOHNlt2YEDB5CdnQ0AyMvLw8svvxxwbFNTE6qqqvDb3/622/KCggIYY2CMQWZm5oCYE0DgXnQhiRdffBFFRUUB1/fWi4EwJ6IiXYAAhw8fxrvvvospU6Zc1PYulwsPPPAAvvzyS1x22WXYsmULJk+eDACoqKjAddddh7KyMni9XuzcuTOUpQfdub3YvHkz4uLikJaW1uv2Xq8X9913H9avX4+tW7cG3ObkyZN49dVXcffdd4eq7JDQvOjJ5XJh8+bNuPHGG/HSSy+hsbEx4Hb33HMPnnjiCXz99dcB17vdbqxfv77H6eSBqLa2FqNHj0ZycnLA9b31Ys2aNSgoKMBll12GESNGYNeuXeEo9zsZtEdQVvHNN9/gV7/6FSoqKjBixIiLGuNwOLBo0SLk5eVhxowZSEtLQ1RUx2eNyspKlJeXo7GxEeXl5SgtLQ1l+UF1bi+ioqKwbNkyPPLII32OWbVqFQoKCpCQkBBwvcfjQVFREebNmwebzRaKskNC8yKwtWvXYuXKlbj66qvx9ddfIzo6usc2r732GmJjY3H11Vf3+jp33HEHsrOzce2114ay3LDYsGFDr0dPffWivLwcW7ZsQVNTE2677TbMnz8/1KV+d/09N/hDeABIQoS+gyLJM2fOMD8/n0899VSPdX1913C++++/nytXriRJjhgxgl6vlyTp9Xo5fPjw4BXcKRy9eP/99zlq1CgmJiYyMTGRQ4cOZUJCAj///PNu4379618zISGBiYmJvOKKKzh8+HAuWrTIt/62227jH//4x6DX20Xzwi8UvWhoaOj1e5cDBw7wmmuu6bF88eLFjIuLY2JiIkePHs3LLruMxcXFvvUPP/wwb7zxRp49ezbo9XYJVy/cbjdjY2PZ2NgYcExvvTh27BhtNptvu08//ZQOhyPoNZPf7zuoiIdEJB+RDCiv18vf/OY3vPvuuwOuv9Ab0dGjR0l2TCy73c6vvvqKJJmSksJt27aRJKurq5mRkRHUusng//FdqBckmZiY2OdFEiS5bt0630USJPnAAw9w9uzZA+qNSPOiu/PflLv27+zZs/zNb37DZ599ts/x27Zt63ZhwOrVqzl16lS2tbUFvdZzhSugXn/9dWZnZ1/U+HN74Xa7ecUVV/DAgQMkyTVr1nD27NnBLbiTAqp/4bQBwOcA3ACaAJReYPvv9q9yAbW1tQTAiRMnMi0tjWlpaayqquLGjRsZFxfH6OhoxsbGMj8/nyTZ3NzMmTNn+sZPmzaNDoeDqamprK6u7va6GRkZTE1NZWZmJvfs2RPUusng//H11otznRtQdXV1LC0t7fE65wZUY2MjATAlJcX3mqtXrw5q3WT4ejEY50VhYSHHjBnDqKgoxsXFcc2aNayoqGBycjKTk5O5aNEi31Hh+X3ocn5ADR06lDabzdfbvq58+z7C0QuSvPXWW1lZWdlt24vtxcaNG+lyuZiamsqcnBweOnQoqDV3+T4BZTrGy4UYY6hedTDGQL3ooF74qRd+6oVfZy96vwy3D7pIQkRELEkBJSIilqSAEhERS1JAiYiIJSmgRETEkhRQIiJiSQooERGxJAWUiIhYkgJKREQsSQElIiKWpIASERFLUkCJiIglKaBERMSSFFAiImJJCigREbEkBZSIiFiSAkpERCwpKtIFDBQxMTFeY4wCHUBMTAyM6dcNMn9w1As/9cJPvfCLiYnx9nesbvl+kXTLdz/dztpPvfBTL/zUCz/d8l1ERH5wFFAiImJJCigREbEkBZSIiFiSAkpERCxJASUiIpakgBIREUtSQImIiCUpoERExJIUUCIiYkkKKBERsSQFlIiIWJICSkRELEkBJSIilqSAEhERSxq0AWWMSTDGbDPG7DfG7DPG3B3O319SUoLY2Fi4XC7fsvfeew9Tp07FxIkTcf3116OlpaXHuAMHDiA9Pd33GDFiBCoqKgAAe/fuRVZWFtLT0zF58mTs3r07bPvzfTQ2NmL69OlwOBxwOp1YsWIFAOCll16C0+nEkCFDsGfPnl7Hr1ixAi6XC06n09cLYGD2Q/Oid4F6M3fuXN8+JyUlIT09PeDY8vJyOJ1OuFwuFBUV4fTp0+EqOygC7TsAPPPMM7Db7XA6nVi4cGHAsW+88QbsdjsmTJiAxx9/3Le8uLgYdrsdLpcLJSUlcLvdId2HfiE5KB8AxgLI6Px5OICDAH7ax/YMpu3bt7O+vp5Op9O3bPLkyXzrrbdIks8++ywffPDBPl/D4/Fw9OjRPHz4MEkyLy+PW7ZsIUlWVVUxJycnqDV3CXYvPvvsM9bX15MkW1pamJyczH379vGjjz7ixx9/zJycHNbV1QUc+8EHH9DpdLK1tZVut5u5ubk8ePAgyfD0Q/PCL9i9OF+g3pxr/vz5XLJkSY/lTU1NTEpKYltbG0lyzpw5XLduXShLDcu8qKmpYW5uLk+fPk2SPHr0aI9xHo+HNpuNhw4dYnt7O1NTU7lv3z6SHXPB6/XS6/WysLCQq1atCmrNXTp70a/36UF7BEXyc5LvdP78NYD9AOLC9fuzs7MxcuTIbssOHDiA7OxsAEBeXh5efvnlPl9j69atuOqqq5CYmAig486VXZ+uT506hXHjxoWg8uAbO3YsMjIyAADDhw+Hw+FAc3MzHA4H7HZ7n2P379+PrKwsDBs2DFFRUcjJycErr7wCYGD2Q/Oid4F604UkXnzxRRQVFQVc7/F48O2338Lj8aCtrW3A9SDQvldWVmLx4sW49NJLAQCxsbE9xu3evRsTJkyAzWZDdHQ0CgsLsWnTJgBAQUEBjDEwxiAzMxNNTU2h35HvaNAG1LmMMUkAJgH4eyTrcLlc2Lx5M4CO01uNjY19bv/CCy90+4OsqKjAggULkJCQgLKyMjz22GMhrTcUDh8+jHfffRdTpky5qO1dLhd27NiBL7/8Em1tbdiyZYuvbz+EfgCaFxejtrYWo0ePRnJyco91cXFxKCsrw/jx4zF27FhcfvnlyM/Pj0CVwXXw4EHU1tZiypQpyMnJQV1dXY9tmpubkZCQ4HseHx+P5ubmbtu43W6sX78eM2bMCHnN39WgDyhjzL8AeBnAPSR7ntwPo7Vr12LlypW4+uqr8fXXXyM6OrrXbc+cOYPNmzdjzpw5vmWVlZUoLy9HY2MjysvLUVpaGo6yg+abb77Br371K1RUVGDEiBEXNcbhcGDRokXIy8vDjBkzkJaWhqioKAADvx9dBvu8uBgbNmzo9ejpxIkT2LRpExoaGvDZZ5+htbUVzz//fJgrDD6Px4MTJ05g165dWL58OW666aauryN8zn8OdBxRn+uOO+5AdnY2rr322pDW2y/9PTf4Q3gAuATAmwDmX8S2F3W+9btoaGjo9Xz6gQMHeM011/Q69r//+7+Zl5fXbdmIESPo9XpJkl6vl8OHDw9esecIRS/OnDnD/Px8PvXUUz3W9fUd1Pnuv/9+rly5kmR4+qF54ReKXpwvUG/cbjdjY2PZ2NgYcMyLL77IkpIS3/PnnnuOf/jDH0JaZzjmxXXXXcdt27b5nttsNh47dqzbmJ07dzI/P9/3/NFHH+Wjjz7qe/7www/zxhtv5NmzZ4NebxfoO6jvznR8jHgWwH6ST0e6HgA4duwYAMDr9WLp0qX4/e9/3+u2gT4xjhs3Dtu3bwcA1NTUBDzdYUUkUVpaCofDgfnz53/n8V19O3LkCDZu3Ojry0Dtx/kG67y4WNXV1UhJSUF8fHzA9ePHj8euXbvQ1tYGkti6dSscDkeYqwy+X/ziF6ipqQHQcbrvzJkz+MlPftJtm2uuuQb/+Mc/0NDQgDNnzuCFF17ADTfcAABYs2YN3nzzTWzYsAFDhlg0CvqbbAP9AWAaAAJ4H8DezkdBH9t/lw8NF1RYWMgxY8YwKiqKcXFxXLNmDSsqKpicnMzk5GQuWrTI96m3ubmZM2fO9I1tbW3lyJEjefLkyW6vWVtby4yMDKampjIzM5N79uwJas1dgt2L2tpaAuDEiROZlpbGtLQ0VlVVcePGjYyLi2N0dDRjY2N9nwTP78e0adPocDiYmprK6urqbq8b6n5oXvgFuxfnC9Qbkrz11ltZWVnZbdvze/PQQw/RbrfT6XTy5ptv9l35FirhmBft7e0sLi6m0+nkpEmTuHXrVpI9972qqorJycm02WxcunSpb/nQoUNps9l8f3OBroAMBnyPIyjTMV4uxBhD9aqDMQbqRQf1wk+98FMv/Dp7YS68ZU8WPa4TEZHBTgElIiKWpIASERFLUkCJiIglKaBERMSSFFAiImJJCigREbEkBZSIiFiSAkpERCxJASUiIpakgBIREUtSQImIiCUpoERExJIUUCIiYkkKKBERsSQFlIiIWJICSkRELCkq0gUMFDExMV5jjAIdQExMDIzp1w0yf3DUCz/1wk+98IuJifH2d6xu+X6RdMt3P93O2k+98FMv/NQLP93yXUREfnAUUCIiYkkKKBERsSQFlIiIWJICSkRELEkBJSIilqSAEhERS1JAiYiIJSmgRETEkhRQIiJiSQooERGxJAWUiIhYkgJKREQsSQElIiKWpIASERFLUkCJiIglDdqAMsbEGGN2G2PeM8bsM8YsiWQ9JSUliI2Nhcvl8i2bO3cu0tPTkZ6ejqSkJKSnpwccW15eDqfTCZfLhaKiIpw+fTpcZQdFoH1/7733MHXqVEycOBHXX389Wlpaeow7cOCArz/p6ekYMWIEKioqAAB79+5FVlYW0tPTMXnyZOzevTts+xNMmhfd9x0AnnnmGdjtdjidTixcuDDg2DfeeAN2ux0TJkzA448/7lteXFwMu90Ol8uFkpISuN3ukO5DsPTWCwB48sknYYzBF1980ev4lpYWxMXF4a677gIAtLW1YdasWUhJSYHT6cTixYtDVvv3QnJQPgAYAP/S+fMlAP4OIKuP7RlK27dvZ319PZ1OZ8D18+fP55IlS3osb2pqYlJSEtva2kiSc+bM4bp160JZKoPdi0D7PnnyZL711lskyWeffZYPPvhgn6/h8Xg4evRoHj58mCSZl5fHLVu2kCSrqqqYk5MT1Jq7aF74hWNe1NTUMDc3l6dPnyZJHj16tMc4j8dDm83GQ4cOsb29nampqdy3bx/Jjrng9Xrp9XpZWFjIVatWBbXmLuHoBUkeOXKE+fn5HD9+PI8fP97r+Hnz5rGoqIh33nknSbK1tZU1NTUkyfb2dk6bNs339xJsnb3o1/v0oD2C6uzdN51PL+l8ROwezdnZ2Rg5cmTAdSTx4osvoqioKOB6j8eDb7/9Fh6PB21tbRg3blwoSw26QPt+4MABZGdnAwDy8vLw8ssv9/kaW7duxVVXXYXExEQAHbeZ7jrqOnXq1IDrSRfNi+77XllZicWLF+PSSy8FAMTGxvYYt3v3bkyYMAE2mw3R0dEoLCzEpk2bAAAFBQUwxsAYg8zMTDQ1NYV+R4Kgt3lw77334oknnoAxvd9Rvb6+HkePHkV+fr5v2bBhwzB9+nQAQHR0NDIyMizZi0EbUABgjBlqjNkL4BiA/yH590jXFEhtbS1Gjx6N5OTkHuvi4uJQVlaG8ePHY+zYsbj88su7TcSByuVyYfPmzQCAl156CY2NjX1u/8ILL3R7o66oqMCCBQuQkJCAsrIyPPbYYyGtNxIG47w4ePAgamtrMWXKFOTk5KCurq7HNs3NzUhISPA9j4+PR3Nzc7dt3G431q9fjxkzZoS85lDZvHkz4uLikJaW1us2Xq8X9913H5YvX97rNidPnsSrr76K3NzcUJT5vQzqgCJ5lmQ6gHgAmcaYnid4LWDDhg29fko+ceIENm3ahIaGBnz22WdobW3F888/H+YKg2/t2rVYuXIlrr76anz99deIjo7uddszZ85g8+bNmDNnjm9ZZWUlysvL0djYiPLycpSWloaj7LAajPPC4/HgxIkT2LVrF5YvX46bbrqp6xS8z/nPAfQ4wrjjjjuQnZ2Na6+9NqT1hkpbWxuWLVuGRx55pM/tVq1ahYKCgm6BfS6Px4OioiLMmzcPNpstFKV+L4M6oLqQPAngLQCW+zjl8XiwceNGzJ07N+D66upqXHnllRg1ahQuueQSzJ49Gzt37gxzlcGXkpKCv/3tb6ivr0dRURGuuuqqXrd9/fXXkZGRgdGjR/uWPffcc5g9ezYAYM6cOQP2IoneDNZ5ER8fj9mzZ/tO0Q0ZMqTHxQHx8fHdjribmpq6nd5csmQJjh8/jqeffjpsdQfboUOH0NDQgLS0NCQlJaGpqQkZGRn45z//2W27t99+G3/+85+RlJSEsrIy/OUvf+l2QcTtt9+O5ORk3HPPPeHehYsyaAPKGDPKGPPjzp8vA/BvAD6ObFU9VVdXIyUlBfHx8QHXjx8/Hrt27UJbWxtIYuvWrXA4HGGuMviOHTsGoOMUxdKlS/H73/++120DHUmMGzcO27dvBwDU1NQEPA02kA3WefGLX/wCNTU1ADpO9505cwY/+clPum1zzTXX4B//+AcaGhpw5swZvPDCC7jhhhsAAGvWrMGbb76JDRs2YMiQgfv2N3HiRBw7dgyHDx/G4cOHER8fj3feeQdjxozptt1f//pXHDlyBIcPH8aTTz6JW265xXdV44MPPohTp075rny1pP5eXTHQHwBSAbwL4H0AHwJ46ALbX+Bale+nsLCQY8aMYVRUFOPi4rhmzRqS5K233srKyspu2zY3N3PmzJm+5w899BDtdjudTidvvvlm3xVOoRLsXgTa94qKCiYnJzM5OZmLFi2i1+sl2XPfW1tbOXLkSJ48ebLba9bW1jIjI4OpqanMzMzknj17glpzF80Lv3DMi/b2dhYXF9PpdHLSpEncunUryZ77XlVVxeTkZNpsNi5dutS3fOjQobTZbExLS2NaWlrAKyCDIRy9OFdiYqLvKr66ujqWlpb2eI1169b5ruJrbGwkAKakpPh6sXr16qDW3AXf4yo+wwDna6UnYwzVqw7GmIDn+Qcj9cJPvfBTL/w6e9H7ZYZ9GLjHuCIi8oOmgBIREUtSQImIiCUpoERExJIUUCIiYkkKKBERsSQFlIiIWJICSkRELEkBJSIilqSAEhERS1JAiYiIJSmgRETEkhRQIiJiSQooERGxJAWUiIhYkgJKREQsSQElIiKWFBXpAgaKmJgYrzFGgQ4gJiYGxvTrBpk/OOqFn3rhp174xcTEePs7Vrd8v0i65bufbmftp174qRd+6oWfbvkuIiI/OAooERGxJAWUiIhYkgJKREQsSQElIiKWpIASERFLUkCJiIglKaBERMSSFFAiImJJCigREbEkBZSIiFiSAkpERCxJASUiIpakgBIREUtSQImIiCUN+oAyxgw1xrxrjHktnL+3pKQEsbGxcLlc3ZY/88wzsNvtcDqdWLhwYcCxb7zxBux2OyZMmIDHH3/ct7y4uBh2ux0ulwslJSVwu90h3YdQCdSbuXPnIj09Henp6UhKSkJ6enrAseXl5XA6nXC5XCgqKsLp06fDVXZQaF749dYLAHjyySdhjMEXX3zR6/iWlhbExcXhrrvuAgC0tbVh1qxZSElJgdPpxOLFi0NWe7AF6sWCBQuQkpKC1NRU/PKXv8TJkycDjl2xYgVcLhecTicqKip8y/fu3YusrCykp6dj8uTJ2L17d8j34zsjOagfAOYD+L8AXrvAdgym7du3s76+nk6n07espqaGubm5PH36NEny6NGjPcZ5PB7abDYeOnSI7e3tTE1N5b59+0iSVVVV9Hq99Hq9LCws5KpVq4Jac5dg9+J8gXpzrvnz53PJkiU9ljc1NTEpKYltbW0kyTlz5nDdunWhLDXovdC88OttHhw5coT5+fkcP348jx8/3uv4efPmsaioiHfeeSdJsrW1lTU1NSTJ9vZ2Tps2jVu2bAlqzV3C0Ys333yTbrebJLlw4UIuXLiwx7gPPviATqeTra2tdLvdzM3N5cGDB0mSeXl5vv2vqqpiTk5OUGvu0tmLfr0/D+ojKGNMPIBZANaE+3dnZ2dj5MiR3ZZVVlZi8eLFuPTSSwEAsbGxPcbt3r0bEyZMgM1mQ3R0NAoLC7Fp0yYAQEFBAYwxMMYgMzMTTU1Nod+REAjUmy4k8eKLL6KoqCjgeo/Hg2+//RYejwdtbW0YN25cKEsNOs0Lv97mwb333osnnniiz1uq19fX4+jRo8jPz/ctGzZsGKZPnw4AiI6ORkZGxoDuRX5+PqKiogAAWVlZAfdl//79yMrKwrBhwxAVFYWcnBy88sorADrudNvS0gIAOHXqlCX/VgZ1QAGoALAQgDfShQDAwYMHUVtbiylTpiAnJwd1dXU9tmlubkZCQoLveXx8PJqbm7tt43a7sX79esyYMSPkNYdbbW0tRo8ejeTk5B7r4uLiUFZWhvHjx2Ps2LG4/PLLu71BDVSaF36bN29GXFwc0tLSet3G6/Xivvvuw/Lly3vd5uTJk3j11VeRm5sbijLDbu3atZg5c2aP5S6XCzt27MCXX36JtrY2bNmyBY2NjQCAiooKLFiwAAkJCSgrK8Njjz0W7rIvaNAGlDHm5wCOkayPdC1dPB4PTpw4gV27dmH58uW46aabuk4v+pz/HECPT5J33HEHsrOzce2114a03kjYsGFDr0dPJ06cwKZNm9DQ0IDPPvsMra2teP7558NcYfBpXnRoa2vDsmXL8Mgjj/S53apVq1BQUNAtsM/l8XhQVFSEefPmwWazhaLUsFq2bBmioqJQXFzcY53D4cCiRYuQl5eHGTNmIC0tzXfUVVlZifLycjQ2NqK8vBylpaVjXLncAAALuElEQVThLv2CBm1AAfgZgBuMMYcBvADgX40xEX03i4+Px+zZs32nYoYMGdLjS+D4+HjfJyAAaGpq6nZovmTJEhw/fhxPP/102OoOF4/Hg40bN2Lu3LkB11dXV+PKK6/EqFGjcMkll2D27NnYuXNnmKsMPs2LDocOHUJDQwPS0tKQlJSEpqYmZGRk4J///Ge37d5++238+c9/RlJSEsrKyvCXv/yl2wURt99+O5KTk3HPPfeEexeC7rnnnsNrr72Gv/71r72e8iwtLcU777yDHTt2YOTIkb6zD8899xxmz54NAJgzZ44ukrDqA8D/QZgvkiDJhoaGbl96VlZW8k9/+hNJ8sCBA4yPj6fX6+02xu1288orr+Qnn3zi+zL8ww8/JEmuXr2aU6dO9V0kECqh6MX5zu8NSb7++uvMzs7udcyuXbv405/+lK2trfR6vbzlllv4n//5nyGtU/PCLxy9OFdiYmKfF0mQ5Lp163wXSZDkAw88wNmzZ/Ps2bNBrfN84ejF66+/TofDwWPHjvU5ruuimk8//ZR2u51fffUVSTIlJYXbtm0jSVZXVzMjIyPoNZPf7yKJiIeDFR6RCKjCwkKOGTOGUVFRjIuL45o1a9je3s7i4mI6nU5OmjSJW7duJUk2Nzdz5syZvrFVVVVMTk6mzWbj0qVLfcuHDh1Km83GtLQ0pqWlBbzSLRhCHVCBekOSt956KysrK7tte35vHnroIdrtdjqdTt58882+K99CRfPCLxy9ONe5AVVXV8fS0tIer3FuQDU2NhIAU1JSfL1YvXp1UGvuEo5eXHXVVYyPj/fty+9+9zuSPefFtGnT6HA4mJqayurqat/y2tpaZmRkMDU1lZmZmdyzZ09Qa+7yfQLKdIyXCzHGUL3qYIyBetFBvfBTL/zUC7/OXvR+yWUfBvN3UCIiYmEKKBERsSQFlIiIWJICSkRELEkBJSIilqSAEhERS1JAiYiIJSmgRETEkhRQIiJiSQooERGxJAWUiIhYkgJKREQsSQElIiKWpIASERFLUkCJiIglKaBERMSSFFAiImJJUZEuYKCIiYnxGmMU6ABiYmJgTL9ukPmDo174qRd+6oVfTEyMt79jdcv3i6RbvvvpdtZ+6oWfeuGnXvjplu8iIvKDo4ASERFLUkCJiIglKaBERMSSFFAiImJJCigREbEkBZSIiFiSAkpERCxJASUiIpakgBIREUtSQImIiCUpoERExJIUUCIiYkkKKBERsSQFlIiIWJICSkRELGlQB5Qx5rAx5gNjzF5jzJ5w//6SkhLExsbC5XL1WPfkk0/CGIMvvvii1/EtLS2Ii4vDXXfdBQBoa2vDrFmzkJKSAqfTicWLF4es9mDqrQ/PPPMM7HY7nE4nFi5cGHDsG2+8AbvdjgkTJuDxxx/3LS8uLobdbofL5UJJSQncbndI9yFYNCf8AvViwYIFSElJQWpqKn75y1/i5MmTAceuWLECLpcLTqcTFRUVvuV79+5FVlYW0tPTMXnyZOzevTvk+xEMgXrx0ksvwel0YsiQIdizp/e3r5MnT+Lf//3fkZKSAofDgbfffhvAAOkFyUH7AHAYwE8uclsG2/bt21lfX0+n09lt+ZEjR5ifn8/x48fz+PHjvY6fN28ei4qKeOedd5IkW1tbWVNTQ5Jsb2/ntGnTuGXLlqDXHexeBOpDTU0Nc3Nzefr0aZLk0aNHe4zzeDy02Ww8dOgQ29vbmZqayn379pEkq6qq6PV66fV6WVhYyFWrVgW15i7h6AVp/TlBhqcXb775Jt1uN0ly4cKFXLhwYY9xH3zwAZ1OJ1tbW+l2u5mbm8uDBw+SJPPy8nz7X1VVxZycnKDW3CUcvfjoo4/48ccfMycnh3V1db2OveWWW7h69WqSHXPgxIkTJMPei369Rw/qI6hIy87OxsiRI3ssv/fee/HEE0/AmN7vklxfX4+jR48iPz/ft2zYsGGYPn06ACA6OhoZGRloamoKfuFBFqgPlZWVWLx4MS699FIAQGxsbI9xu3fvxoQJE2Cz2RAdHY3CwkJs2rQJAFBQUABjDIwxyMzMHBB9ADQnzhWoF/n5+YiKigIAZGVlBdyX/fv3IysrC8OGDUNUVBRycnLwyiuvAOi4/XhLSwsA4NSpUxg3blyI9yI4AvXC4XDAbrf3Oa6lpQU7duxAaWkpgI458OMf/xjAwOjFYA8oAvibMabeGHN7pIsBgM2bNyMuLg5paWm9buP1enHfffdh+fLlvW5z8uRJvPrqq8jNzQ1FmSF38OBB1NbWYsqUKcjJyUFdXV2PbZqbm5GQkOB7Hh8fj+bm5m7buN1urF+/HjNmzAh5zaGiORHY2rVrMXPmzB7LXS4XduzYgS+//BJtbW3YsmULGhsbAQAVFRVYsGABEhISUFZWhsceeyzcZYfVJ598glGjRuG2227DpEmT8Nvf/hatra0ABkYvBntA/YxkBoCZAO40xmRHspi2tjYsW7YMjzzySJ/brVq1CgUFBd3enM/l8XhQVFSEefPmwWazhaLUkPN4PDhx4gR27dqF5cuX46abbuo61epz/nMAPY4w7rjjDmRnZ+Paa68Nab2hojkR2LJlyxAVFYXi4uIe6xwOBxYtWoS8vDzMmDEDaWlpvqOuyspKlJeXo7GxEeXl5b4jix8qj8eDd955B3/4wx/w7rvv4kc/+pHvu9oB0Yv+nhv8oT0APAygrI/1F3vK9TtpaGjwnVd+//33OWrUKCYmJjIxMZFDhw5lQkICP//8825jfv3rXzMhIYGJiYm84oorOHz4cC5atMi3/rbbbuMf//jHkNRLBv/8Otm9DyR53XXXcdu2bb7nNpuNx44d6zZm586dzM/P9z1/9NFH+eijj/qeP/zww7zxxht59uzZoNfbJdS9GChzggzPvCDJ//qv/2JWVhZbW1sv6jXuv/9+rly5kiQ5YsQIer1ekqTX6+Xw4cODW3CncPWCZJ/fQX3++edMTEz0Pd+xYwcLCgpIhr0X/Xtf7u/Agf4A8CMAw8/5eSeAGX1s/93+VS5Sb5OOJBMTE/v8Qpwk161b5/tCnCQfeOABzp49e0C/KZNkZWUl//SnP5EkDxw4wPj4eN8fUxe3280rr7ySn3zyie8iiQ8//JAkuXr1ak6dOpVtbW1Br/Vc4XwjIq07J8jw9OL111+nw+Ho8WHlfF0X1Xz66ae02+386quvSJIpKSm+Dz7V1dXMyMgIes2kdQKKJKdNm8aPP/6YJPkf//EfLCsrIxn2XiigvtOOAzYA73U+9gF44ALbf7d/lYtQWFjIMWPGMCoqinFxcVyzZk239ee+GdXV1bG0tLTHa5z7ZtTY2EgATElJYVpaGtPS0nxX7wRTsHsRqA/t7e0sLi6m0+nkpEmTuHXrVpJkc3MzZ86c6RtbVVXF5ORk2mw2Ll261Ld86NChtNlsvj4sWbIkqDV3CUcvzmXVOUGGpxdXXXUV4+Pjffvyu9/9jmTPeTFt2jQ6HA6mpqayurrat7y2tpYZGRlMTU1lZmYm9+zZE9Sau4SjFxs3bmRcXByjo6MZGxvrO5twfi/effddXn311Zw4cSJvvPFGX1iHuRf9ep82HePlQowxVK86GGOgXnRQL/zUCz/1wq+zF71fftqHwX6RhIiIWJQCSkRELEkBJSIilqSAEhERS1JAiYiIJSmgRETEkhRQIiJiSQooERGxJAWUiIhYkgJKREQsSQElIiKWpIASERFLUkCJiIglKaBERMSSFFAiImJJCigREbEkBZSIiFhSVKQLGChiYmKOGmNGR7oOK4iJifEaY/ThBurFudQLP/XCLyYm5mh/x+qW7yIiYklKeBERsSQFlIiIWJICSkRELEkBJSIilqSAEhERS1JAiYiIJSmgRETEkhRQIiJiSQooERGxJAWUiIhYkgJKREQsSQElIiKWpIASERFLUkCJiIglKaBERMSSFFAiImJJCigREbEkBZSIiFiSAkpERCxJASUiIpakgBIREUtSQImIiCX9f9uya0OjRVvgAAAAAElFTkSuQmCC\n",
322 | "text/plain": [
323 | ""
324 | ]
325 | },
326 | "metadata": {
327 | "needs_background": "light"
328 | },
329 | "output_type": "display_data"
330 | }
331 | ],
332 | "source": [
333 | "figure_3_5()"
334 | ]
335 | },
336 | {
337 | "cell_type": "code",
338 | "execution_count": 48,
339 | "metadata": {
340 | "ExecuteTime": {
341 | "end_time": "2020-01-11T13:33:22.415984Z",
342 | "start_time": "2020-01-11T13:33:22.407974Z"
343 | }
344 | },
345 | "outputs": [],
346 | "source": [
347 | "def figure_3_5_prove():\n",
348 | " # PiperLiu@qq.com\n",
349 | " # 2020-1-11 20:48:40\n",
350 | " value = np.zeros((WORLD_SIZE, WORLD_SIZE))\n",
351 | " while True:\n",
352 | " # keep iteration until convergence\n",
353 | " new_value = np.zeros_like(value)\n",
354 | " action_optimal = dict()\n",
355 | " for i in range(WORLD_SIZE):\n",
356 | " for j in range(WORLD_SIZE):\n",
357 | " values = []\n",
358 | " for action in ACTIONS:\n",
359 | " (next_i, next_j), reward = step([i, j], action)\n",
360 | " # value iteration\n",
361 | " values.append(reward + DISCOUNT * value[next_i, next_j])\n",
362 | " # 这里,没有保留每个状态的最优动作是什么,让我做个改进\n",
363 | " new_value[i, j] = np.max(values)\n",
364 | " action_optimal.setdefault((i, j), np.where(new_value[i, j] == values))\n",
365 | " if np.sum(np.abs(new_value - value)) < 1e-4:\n",
366 | " draw_image_prove(action_optimal)\n",
367 | " plt.show()\n",
368 | " break\n",
369 | " value = new_value\n",
370 | "\n",
371 | "def number2char(number):\n",
372 | " if number == 0:\n",
373 | " return 'left,'\n",
374 | " elif number == 1:\n",
375 | " return 'up,'\n",
376 | " elif number == 2:\n",
377 | " return 'right,'\n",
378 | " return 'down,'\n",
379 | "\n",
380 | "def number2str(number):\n",
381 | " str = ''\n",
382 | " for num in number[0]:\n",
383 | " str = str + number2char(num)\n",
384 | " return str\n",
385 | "\n",
386 | "def draw_image_prove(image):\n",
387 | " fig, ax = plt.subplots()\n",
388 | " ax.set_axis_off()\n",
389 | " tb = Table(ax, bbox=[0, 0, 1, 1])\n",
390 | "\n",
391 | " nrows, ncols = 0, 0\n",
392 | " for key in image:\n",
393 | " if key[0] > nrows:\n",
394 | " nrows = key[0]\n",
395 | " if key[1] > ncols:\n",
396 | " ncols = key[1]\n",
397 | " width, height = 1.0 / ncols, 1.0 / nrows\n",
398 | "\n",
399 | " # Add cells\n",
400 | " for key in image:\n",
401 | " tb.add_cell(key[0], key[1], width, height, text=number2str(image[key]),\n",
402 | " loc='center', facecolor='white')\n",
403 | "\n",
404 | " # Row and column labels...\n",
405 | " for i in range(nrows): # 这里默认为正方形了,可以向原仓库提供一个pull request\n",
406 | " tb.add_cell(i, -1, width, height, text=i+1, loc='right',\n",
407 | " edgecolor='none', facecolor='none')\n",
408 | " tb.add_cell(-1, i, width, height/2, text=i+1, loc='center',\n",
409 | " edgecolor='none', facecolor='none')\n",
410 | "\n",
411 | " ax.add_table(tb)"
412 | ]
413 | },
414 | {
415 | "cell_type": "code",
416 | "execution_count": 49,
417 | "metadata": {
418 | "ExecuteTime": {
419 | "end_time": "2020-01-11T13:33:25.504840Z",
420 | "start_time": "2020-01-11T13:33:25.268091Z"
421 | }
422 | },
423 | "outputs": [
424 | {
425 | "data": {
426 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAbgAAAEOCAYAAAD7WQLbAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAFJ9JREFUeJzt3W9InOm9xvHrXmNnJKF1S5waqNtDlpQwRSFQWpN2G5SFtKxn04VTihbaUmhXui3aItlKlWMhbbbJG0ukdcIpuLCbsA3pimhzkOBKIvaVbsiihE6xc9riv1PSyBb/jt7nhWaeeLJJNDvOk/zm+wEh6jzOPVcmftVl0XnvBQCANU+FfQAAAHYCgQMAmETgAAAmETgAgEkEDgBgEoEDAJhE4AAAJhE4AIBJBA4AYBKBAwCYROAAACYROACASQQOAGASgQMAmETgAAAmEbg84pz7tHOuyzn31bDPEjbn3HPOuVedc6875z4e9nnC5Jz7jHPux8653zjn9oZ9nrA5515wzvWEfQ58eAQuj3jv/ySpK+xzPA6899e897+U9GdJxWGfJ0ze+zFJM5JKJa2EfJxQOecOSYpKmgj7LPjwdoV9ACAszrk6SRPe+7z/ZOa9f9M5d1vSM5LeC/s8IfqKpHlJh5xzh7z374Z9IDw6ApdHnHOlkv5DUpFz7l3v/f+EfaawOOe+Jumbkv7bOfepPN/iy5IqJD0r6T9DPk6ovPe/kCTn3L8Rtyef896HfQYAALKO/wYHADCJwAEATCJwAACTCBwAwCQCBwAwicABAEwicAAAk/gfvXOoqKhoenFx8RNhn+NxEI1G1xYXF/kCS2xxN7YIsEUgGo3OLCwslG73Ov5H7xxyznn2XuecE1usY4sAWwTYIrCxhdvudXx1AAAwicABAEwicAAAkwgcAMAkAgcAMInAAQBMInAAAJMIHADAJAIHADCJwAEATCJwAACTCBwAwCQCBwAwicABAEwicAAAkwgcAMAkAgcAMInAAQBMInBZ4Jz7tHOuyzn31bDPcre33nrrnrd1dXXp+vXrmdcbGxtzeaSH+v/nk6Smpib19fV96LN+0B6pVErt7e0PvP87HretHtVObrxTeF7svCfxefEwBC4LvPd/ktQV9jnuaGtrU3t7u86dO6fJyUn98Ic/VEdHR+ZJ+sYbb+jll1/W2NiYxsfH1dXVpXQ6HfKpA/39/Tpz5owaGhqUTCZ148YN7dmzR+Pj4+rt7c3c7s7jaW9vVyqV0rFjx3ThwgW1tLRkbjM4OKjvfve7GhgY0B//+Ed579XY2Khz587p+PHjkqR33nlHbW1t6urqUjKZVE9Pj/7+979Lkubn5/XKK6/o17/+tf75z39qenpaP/rRj/Szn/1M77zzjl599VX961//0qFDh/T+++/r5z//uerr6/X666/rO9/5jlZWVnK43NZtdePHCc+LnfckPi8ehMAZVVtbq/Lycl27dk0vvfSS6urqMu/7+te/rpdfflkjIyOKx+P69re/rV27dmlhYSHEEweGh4dVUlKi4uJiFRUVKR6P6+jRo4rH46qpqVE6nd70CWJ1dVWSVFpaqtraWs3NzWl5eVnz8/OSpOeff17V1dWSpNnZWX30ox/V9773PX3sYx+TJH3xi19UW1ub3n33XR04cEAvvviiPvnJT2p+fl7vvfeePvvZz+r73/++vPe6evWqXnjhBf3kJz/R22+/rc9//vN67bXX9I1vfEOnT59WdXW1CgsL9a1vfUvl5eWamprKnONxst2NHwc8L3bek/i8eBAClwXOuVJJ/yHp351znwr7PJIUiUQkSc8995zefvttvfHGG9q1a5ckqbCwUE899ZTW1tb0zDPPqKOjQ3/729/0q1/9KswjZ1RWVmp2dlaxWEyxWGzT+y5duqSLFy9qZGREFRUVOnv2rIaGhiRJMzMzSiQSikQiKigoUGtrq6RgC0mKxWKam5vTuXPn9P7770ta30OSnHN69tln9dZbb+mvf/2rmpubVV5errGxMb355pvy3utLX/qS+vr69Nprr+mll17S888/r9///vf6wQ9+oN/97nf63Oc+p4KCgszHW1tbU3Nz845vtl1b3fhxwvNi5z2Jz4sHcd77sM+QN5xzPoy9f/vb32pqakpf+MIXVFVV9YG3mZub00c+8hEVFRXl5EzOOT3qFtPT0yotLb3n7Y2NjZn/brK0tKT5+Xk9/fTT99zuD3/4g/7yl79Ikl555ZVt30+2znvHh9lip2TrsW8Xz4sAz4vAxhZu29c9bgNaFlbgHkeP4z/esLBFgC0CbBF41MDxI0oAgEkEDgBgEoEDAJhE4AAAJhE4AIBJBA4AYBKBAwCYROAAACYROACASQQOAGASgQMAmETgAAAmETgAgEkEDgBgEoEDAJhE4AAAJhE4AIBJBA4AYNKusA+QT6LR6Jpzji8qJEWjUTm37d9AbxJbBNgiwBaBaDS69ijXOe99ts+C+3DOefZe55wTW6xjiwBbBNgisLHFtmvPdxMAAJMIHADAJAIHADCJwAEATCJwAACTCBwAwCQCBwAwicABAEwicAAAkwgcAMAkAgcAMInAAQBMInAAAJMIHADAJAIHADCJwAEATCJwAACTCBwAwCQClwXOueecc6865153zn087PPcT09PjxYWFu55e2NjY+bPg4OD6u7uzuWxsIO6urp0/fr1TW9rampSX1/fpr/3fMU+AYtbELgs8N5f897/UtKfJRXn8r7b2tp0+/ZtdXd3a+/evTp//ryampqUTqclSalUSsePH1dvb69GR0e1tLSk1tZWdXZ2qqqqSpI0Ojqq06dP69SpU0omkxoYGNDNmzdz+TBy4mFbWdXf368zZ86ooaFByWRSN27c0J49ezQ+Pq7e3t7M7e58Emtvb1cqldKxY8d04cIFtbS0hHX0nNjqPvnA2hYELkucc3WSJrz3Ezm+X3nvtbq6qt27d6uurk779u3TxMSE5ufnJUlHjhxRTU1N5prZ2VnV19errKxMklReXq4TJ05oenpaBw4cUHV1tQ4ePJi53oqtbGXR8PCwSkpKVFxcrKKiIsXjcR09elTxeFw1NTVKp9NaWVnJ3H51dVWSVFpaqtraWs3NzWl5ednsRtvdxzJrWxC4LHDOfU3SNyWVOOc+lcv7Pnz4sDo6OnT58mUtLCwokUhoYmJC+/fvV3NzsyQpEolsuqakpESJREKTk5OSpMLCwjuPQ2VlZbpy5YrGxsYy11uxla0sqqys1OzsrGKxmGKx2Kb3Xbp0SRcvXtTIyIgqKip09uxZDQ0NSZJmZmaUSCQUiURUUFCg1tbWMI6/47a6Tz6wtoXz3od9hrzhnPM7uXdjY6Pa29szr09PT6u0tPSe2127dk3j4+P6xz/+oZ/+9Kf3/Xj3uz4b7nw3FZatbpULYW9xv8d+90ZLS0uan5/X008/vaNnCXuLDxLWc4MtAhtbuG1f97gNaNlOB+5J8jj+4w0LWwTYIsAWgUcNHD+iBACYROAAACYROACASQQOAGASgQMAmETgAAAmETgAgEkEDgBgEoEDAJhE4AAAJhE4AIBJBA4AYBKBAwCYROAAACYROACASQQOAGASgQMAmLQr7APkk2g0uuac44sKSdFoVM5t+xf0msQWAbYIsEUgGo2uPcp1jl+JnjvOOc/e6zZ+BX3Yx3gssEWALQJsEdjYYtu157sJAIBJBA4AYBKBAwCYROAAACYROACASQQOAGASgQMAmETgAAAmETgAgEkEDgBgEoEDAJhE4AAAJhE4AIBJBA4AYBKBAwCYROAAACYROACASQQOAGASgcsC59xnnHM/ds79xjm3N+zz3E9PT48WFhbueXtjY2Pmz4ODg+ru7s7lsbCDurq6dP369U1va2pqUl9f36a/93zAFoF82YLAZYH3fkzSjKRSSSu5vO+2tjbdvn1b3d3d2rt3r86fP6+mpial02lJUiqV0vHjx9Xb26vR0VEtLS2ptbVVnZ2dqqqqkiSNjo7q9OnTOnXqlJLJpAYGBnTz5s1cPoyceNhWVvX39+vMmTNqaGhQMpnUjRs3tGfPHo2Pj6u3tzdzuzuf2Nrb25VKpXTs2DFduHBBLS0tYR0969gikA9bELgs8d6/Kem/JD2Ty/t1zsl7r9XVVe3evVt1dXXat2+fJiYmND8/L0k6cuSIampqMtfMzs6qvr5eZWVlkqTy8nKdOHFC09PTOnDggKqrq3Xw4MHM9VZsZSuLhoeHVVJSouLiYhUVFSkej+vo0aOKx+OqqalROp3Wykrwddnq6qokqbS0VLW1tZqbm9Py8rKJjdgikA9bELgscM592Tl3QtKLkv43l/d9+PBhdXR06PLly1pYWFAikdDExIT279+v5uZmSVIkEtl0TUlJiRKJhCYnJyVJhYWFdx6HysrKdOXKFY2NjWWut2IrW1lUWVmp2dlZxWIxxWKxTe+7dOmSLl68qJGREVVUVOjs2bMaGhqSJM3MzCiRSCgSiaigoECtra1hHD+r2CKQF1t473nJ0cv63DunoaFh0+tTU1MfeLurV6/6zs5Of/LkyQd+vPtdnw07vcXDbHWrXAh7i/s99rs3Wlxc9Ldu3drxs7BFgC0CG1ts+3OuW78WueCc8+y97s6PC8EWd2OLAFsENrZw272OH1ECAEwicAAAkwgcAMAkAgcAMInAAQBMInAAAJMIHADAJAIHADCJwAEATCJwAACTCBwAwCQCBwAwicABAEwicAAAkwgcAMAkAgcAMInAAQBMInAAAJN2hX2AfBKNRtecc3xRISkajcq5bf8GepPYIsAWAbYIRKPRtUe5znnvs30W3IdzzrP3Ouec2GIdWwTYIsAWgY0ttl17vpsAAJhE4AAAJhE4AIBJBA4AYBKBAwCYROAAACYROACASQQOAGASgQMAmETgAAAmETgAgEkEDgBgEoEDAJhE4AAAJhE4AIBJBA4AYBKBAwCYROAAACYRuCxxzr3gnOsJ+xwP0tPTo4WFhXve3tjYmPnz4OCguru7c3ks7KCuri5dv35909uamprU19e36e89H7BFIF+2IHBZ4Jw7JCkqaSLX993W1qbbt2+ru7tbe/fu1fnz59XU1KR0Oi1JSqVSOn78uHp7ezU6OqqlpSW1traqs7NTVVVVkqTR0VGdPn1ap06dUjKZ1MDAgG7evJnrh7LjHraVVf39/Tpz5owaGhqUTCZ148YN7dmzR+Pj4+rt7c3c7s4ntvb2dqVSKR07dkwXLlxQS0tLWEfPOrYI5MMWBC47viKpTNKhjdjljHNO3nutrq5q9+7dqqur0759+zQxMaH5+XlJ0pEjR1RTU5O5ZnZ2VvX19SorK5MklZeX68SJE5qentaBAwdUXV2tgwcPZq63YitbWTQ8PKySkhIVFxerqKhI8XhcR48eVTweV01NjdLptFZWVjK3X11dlSSVlpaqtrZWc3NzWl5eNrERWwTyYQsClwXe+19479slveu9fzeX93348GF1dHTo8uXLWlhYUCKR0MTEhPbv36/m5mZJUiQS2XRNSUmJEomEJicnJUmFhYWS1gNQVlamK1euaGxsLHO9FVvZyqLKykrNzs4qFospFottet+lS5d08eJFjYyMqKKiQmfPntXQ0JAkaWZmRolEQpFIRAUFBWptbQ3j+FnFFoG82MJ7z0uOXtbn3jkNDQ2bXp+amvrA2129etV3dnb6kydPPvDj3e/6bNjpLR5mq1vlQthb3O+x373R4uKiv3Xr1o6fhS0CbBHY2GLbn3Pd+rXIBeecZ+91d35cCLa4G1sE2CKwsYXb7nX8iBIAYBKBAwCYROAAACYROACASQQOAGASgQMAmETgAAAmETgAgEkEDgBgEoEDAJhE4AAAJhE4AIBJBA4AYBKBAwCYROAAACYROACASQQOAGDSrrAPkE+i0eiac44vKiRFo1E5t+1f0GsSWwTYIsAWgWg0uvYo1zl+JXruOOc8e6/b+BX0YR/jscAWAbYIsEVgY4tt157vJgAAJhE4AIBJBA4AYBKBAwCYROAAACYROACASQQOAGASgQMAmETgAAAmETgAgEkEDgBgEoEDAJhE4AAAJhE4AIBJBA4AYBKBAwCYROAAACYROACASQQuj/T09GhhYeGetzc2Nmb+PDg4qO7u7lweCzuoq6tL169f3/S2pqYm9fX1bfp7zwdsEciXLQjcE66trU23b99Wd3e39u7dq/Pnz6upqUnpdFqSlEqldPz4cfX29mp0dFRLS0tqbW1VZ2enqqqqJEmjo6M6ffq0Tp06pWQyqYGBAd28eTPMh7UjHraVVf39/Tpz5owaGhqUTCZ148YN7dmzR+Pj4+rt7c3c7s4ntvb2dqVSKR07dkwXLlxQS0tLWEfPOrYI5MMWBO4J55yT916rq6vavXu36urqtG/fPk1MTGh+fl6SdOTIEdXU1GSumZ2dVX19vcrKyiRJ5eXlOnHihKanp3XgwAFVV1fr4MGDmeut2MpWFg0PD6ukpETFxcUqKipSPB7X0aNHFY/HVVNTo3Q6rZWVlcztV1dXJUmlpaWqra3V3NyclpeXTWzEFoF82ILAPeEOHz6sjo4OXb58WQsLC0okEpqYmND+/fvV3NwsSYpEIpuuKSkpUSKR0OTkpCSpsLBQ0noAysrKdOXKFY2NjWWut2IrW1lUWVmp2dlZxWIxxWKxTe+7dOmSLl68qJGREVVUVOjs2bMaGhqSJM3MzCiRSCgSiaigoECtra1hHD+r2CKQF1t473nJ0cv63DunoaFh0+tTU1MfeLurV6/6zs5Of/LkyQd+vPtdnw07vcXDbHWrXAh7i/s99rs3Wlxc9Ldu3drxs7BFgC0CG1ts+3OuW78WueCc8+y97s6PC8EWd2OLAFsENrZw272OH1ECAEwicAAAkwgcAMAkAgcAMInAAQBMInAAAJMIHADAJAIHADCJwAEATCJwAACTCBwAwCQCBwAwicABAEwicAAAkwgcAMAkAgcAMInAAQBMInAAAJN2hX2AfBKNRmecc58I+xyPg2g0uuac4wssscXd2CLAFoFoNDrzKNc57322zwIAQOj46gAAYBKBAwCYROAAACYROACASQQOAGASgQMAmETgAAAmETgAgEkEDgBgEoEDAJhE4AAAJhE4AIBJBA4AYBKBAwCYROAAACYROACASQQOAGASgQMAmETgAAAmETgAgEkEDgBgEoEDAJj0f7ANJZ6wKyd8AAAAAElFTkSuQmCC\n",
427 | "text/plain": [
428 | ""
429 | ]
430 | },
431 | "metadata": {
432 | "needs_background": "light"
433 | },
434 | "output_type": "display_data"
435 | }
436 | ],
437 | "source": [
438 | "figure_3_5_prove()"
439 | ]
440 | },
441 | {
442 | "cell_type": "markdown",
443 | "metadata": {},
444 | "source": [
445 | "**如上,我写的代码,所产出的最优动作结果与书上结果一致。**\n",
446 | "\n",
447 | "2020-1-11 21:33:57"
448 | ]
449 | },
450 | {
451 | "cell_type": "code",
452 | "execution_count": null,
453 | "metadata": {},
454 | "outputs": [],
455 | "source": []
456 | }
457 | ],
458 | "metadata": {
459 | "kernelspec": {
460 | "display_name": "Python 3",
461 | "language": "python",
462 | "name": "python3"
463 | },
464 | "language_info": {
465 | "codemirror_mode": {
466 | "name": "ipython",
467 | "version": 3
468 | },
469 | "file_extension": ".py",
470 | "mimetype": "text/x-python",
471 | "name": "python",
472 | "nbconvert_exporter": "python",
473 | "pygments_lexer": "ipython3",
474 | "version": "3.7.0"
475 | },
476 | "toc": {
477 | "base_numbering": 1,
478 | "nav_menu": {},
479 | "number_sections": true,
480 | "sideBar": true,
481 | "skip_h1_title": false,
482 | "title_cell": "Table of Contents",
483 | "title_sidebar": "Contents",
484 | "toc_cell": false,
485 | "toc_position": {},
486 | "toc_section_display": true,
487 | "toc_window_display": false
488 | },
489 | "varInspector": {
490 | "cols": {
491 | "lenName": 16,
492 | "lenType": 16,
493 | "lenVar": 40
494 | },
495 | "kernels_config": {
496 | "python": {
497 | "delete_cmd_postfix": "",
498 | "delete_cmd_prefix": "del ",
499 | "library": "var_list.py",
500 | "varRefreshCmd": "print(var_dic_list())"
501 | },
502 | "r": {
503 | "delete_cmd_postfix": ") ",
504 | "delete_cmd_prefix": "rm(",
505 | "library": "var_list.r",
506 | "varRefreshCmd": "cat(var_dic_list()) "
507 | }
508 | },
509 | "types_to_exclude": [
510 | "module",
511 | "function",
512 | "builtin_function_or_method",
513 | "instance",
514 | "_Feature"
515 | ],
516 | "window_display": false
517 | }
518 | },
519 | "nbformat": 4,
520 | "nbformat_minor": 2
521 | }
522 |
--------------------------------------------------------------------------------
/practice/03-01-Grid-World.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 2,
4 | "metadata": {
5 | "language_info": {
6 | "name": "python",
7 | "codemirror_mode": {
8 | "name": "ipython",
9 | "version": 3
10 | },
11 | "version": "3.7.0-final"
12 | },
13 | "orig_nbformat": 2,
14 | "file_extension": ".py",
15 | "mimetype": "text/x-python",
16 | "name": "python",
17 | "npconvert_exporter": "python",
18 | "pygments_lexer": "ipython3",
19 | "version": 3,
20 | "kernelspec": {
21 | "name": "python37064bitbasecondaf1f4ce8bd9ee468caf98567667ef0765",
22 | "display_name": "Python 3.7.0 64-bit ('base': conda)"
23 | }
24 | },
25 | "cells": [
26 | {
27 | "cell_type": "markdown",
28 | "execution_count": null,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "### 网格游戏\n",
33 | "\n",
34 | "我们将讨论:\n",
35 | "- in-place与out-of-place在实现中的区别;\n",
36 | "- 同[02-MDP-and-Bellman-Equation.ipynb](02-MDP-and-Bellman-Equation.ipynb),我们将改进[Zhang的代码](https://github.com/ShangtongZhang/reinforcement-learning-an-introduction/blob/master/chapter04/grid_world.py),尝试在jupyter中查看最优动作。"
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": 15,
42 | "metadata": {},
43 | "outputs": [],
44 | "source": [
45 | "import matplotlib\n",
46 | "import matplotlib.pyplot as plt\n",
47 | "import numpy as np\n",
48 | "from matplotlib.table import Table\n",
49 | "%matplotlib inline\n",
50 | "\n",
51 | "WORLD_SIZE = 4\n",
52 | "# left, up, right, down\n",
53 | "ACTIONS = [np.array([0, -1]),\n",
54 | " np.array([-1, 0]),\n",
55 | " np.array([0, 1]),\n",
56 | " np.array([1, 0])]\n",
57 | "ACTION_PROB = 0.25\n",
58 | "\n",
59 | "\n",
60 | "def is_terminal(state):\n",
61 | " x, y = state\n",
62 | " return (x == 0 and y == 0) or (x == WORLD_SIZE - 1 and y == WORLD_SIZE - 1)\n",
63 | "\n",
64 | "\n",
65 | "def step(state, action):\n",
66 | " if is_terminal(state):\n",
67 | " return state, 0\n",
68 | "\n",
69 | " next_state = (np.array(state) + action).tolist()\n",
70 | " x, y = next_state\n",
71 | "\n",
72 | " if x < 0 or x >= WORLD_SIZE or y < 0 or y >= WORLD_SIZE:\n",
73 | " next_state = state\n",
74 | "\n",
75 | " reward = -1\n",
76 | " return next_state, reward\n",
77 | "\n",
78 | "\n",
79 | "def draw_image(image):\n",
80 | " fig, ax = plt.subplots()\n",
81 | " ax.set_axis_off()\n",
82 | " tb = Table(ax, bbox=[0, 0, 1, 1])\n",
83 | "\n",
84 | " nrows, ncols = image.shape\n",
85 | " width, height = 1.0 / ncols, 1.0 / nrows\n",
86 | "\n",
87 | " # Add cells\n",
88 | " for (i, j), val in np.ndenumerate(image):\n",
89 | " tb.add_cell(i, j, width, height, text=val,\n",
90 | " loc='center', facecolor='white')\n",
91 | "\n",
92 | " # Row and column labels...\n",
93 | " for i in range(len(image)):\n",
94 | " tb.add_cell(i, -1, width, height, text=i+1, loc='right',\n",
95 | " edgecolor='none', facecolor='none')\n",
96 | " tb.add_cell(-1, i, width, height/2, text=i+1, loc='center',\n",
97 | " edgecolor='none', facecolor='none')\n",
98 | " ax.add_table(tb)\n",
99 | "\n",
100 | "\n",
101 | "def compute_state_value(in_place=True, discount=1.0):\n",
102 | " new_state_values = np.zeros((WORLD_SIZE, WORLD_SIZE))\n",
103 | " iteration = 0\n",
104 | " while True:\n",
105 | " if in_place:\n",
106 | " # 在 in place 下, state_values 和 new_state_values 是同一个数组,不同名字\n",
107 | " state_values = new_state_values\n",
108 | " else:\n",
109 | " state_values = new_state_values.copy()\n",
110 | " old_state_values = state_values.copy()\n",
111 | "\n",
112 | " for i in range(WORLD_SIZE):\n",
113 | " for j in range(WORLD_SIZE):\n",
114 | " value = 0\n",
115 | " for action in ACTIONS:\n",
116 | " (next_i, next_j), reward = step([i, j], action)\n",
117 | " value += ACTION_PROB * (reward + discount * state_values[next_i, next_j])\n",
118 | " # 在 in place 下,对 new_state_values 进行更新即对 state_values 更新;\n",
119 | " # in place 下,对 state_values 立即更新,即会对本次迭代中的其他状态产生影响\n",
120 | " # 否则,使用 out of place 迭代策略,本次迭代中贝尔曼公式用到的 state_values 都来自上次迭代\n",
121 | " new_state_values[i, j] = value\n",
122 | "\n",
123 | " max_delta_value = abs(old_state_values - new_state_values).max()\n",
124 | " if max_delta_value < 1e-4:\n",
125 | " break\n",
126 | "\n",
127 | " iteration += 1\n",
128 | "\n",
129 | " return new_state_values, iteration\n",
130 | "\n",
131 | "\n",
132 | "def figure_4_1():\n",
133 | " # While the author suggests using in-place iterative policy evaluation,\n",
134 | " # Figure 4.1 actually uses out-of-place version.\n",
135 | " _, asycn_iteration = compute_state_value(in_place=True)\n",
136 | " values, sync_iteration = compute_state_value(in_place=False)\n",
137 | " draw_image(np.round(values, decimals=2))\n",
138 | " print('In-place: {} iterations'.format(asycn_iteration))\n",
139 | " print('Synchronous: {} iterations'.format(sync_iteration))\n",
140 | "\n",
141 | " plt.plot()"
142 | ]
143 | },
144 | {
145 | "cell_type": "code",
146 | "execution_count": 16,
147 | "metadata": {},
148 | "outputs": [
149 | {
150 | "name": "stdout",
151 | "output_type": "stream",
152 | "text": "In-place: 113 iterations\nSynchronous: 172 iterations\n"
153 | },
154 | {
155 | "data": {
156 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAbQAAAEUCAYAAABDKMOoAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAGNRJREFUeJzt3G9MlOee//HP5QGdJofin5T9KdjAMIRY7IAlCHI2m+3JptHV7AMFhZTqA0lMukk1u80mPxNrNvGB9c+vtpH4ZNluNnXdJ7vJGDAmp0qz7bQ9NK1Ho6nZBIthgLWJtnSrQEC+vwdSKs4gUzpww8X7ldwJzH1dzXc+zpnPDPed48xMAAAsdEuCHgAAgEyg0AAAXqDQAABeoNAAAF6g0AAAXqDQAABeoNAAAF6g0AAAXqDQAABeoNAAAF6g0AAAXqDQAABeoNAAAF6g0AAAXqDQAABeoNAwwTn3z865b51z14OeZaFzzq11znU45752zt1wzu0PeqaFzDkXcs51Oueujuf5j0HPtNA5537jnLvinGsLepZModDwuH+RtDnoITwxKunvzWydpBpJf+uceyHgmRayYUm/N7NySRWSNjvnagKeaaHbL+nroIfIJAoNE8zsvyTdC3oOH5hZv5l9Nf7z/+rRG0d+sFMtXPbIj+O/Zo8fFuBIC5pzrkDSVkn/FPQsmUShAbPMOVcoaYOkPwY7ycI2/ieyP0n6VtIfzIw8Z+6UpH+QNBb0IJlEoQGzyDn3W0n/IemAmf0Q9DwLmZk9NLMKSQWSNjrn1gc900LknNsm6Vsz+zLoWTKNQgNmiXMuW4/K7KyZ/WfQ8/jCzL6X9JG43jtTv5P0N865bkn/Lun3zrkPgh0pMyg0YBY455ykVklfm9n/C3qehc4595xzbvn4z89I+itJN4OdamEys/9rZgVmViipQdJlM2sKeKyMoNAwwTl3TtJnkkqdcwnn3N6gZ1rAfifpNT369Pun8eOvgx5qAVstqcM5d03SF3p0Dc2b282RGc6MG4UAAAsf39AAAF6g0AAAXqDQAABeoNAAAF6g0AAAXqDQAABeoNAAAF6g0AAAXqDQAABeyAp6gMXkmWee+Z+hoaE/C3oOX4RCobGhoSE+lGUAWWYWeWZWKBS6Mzg4+H+mW8f/9dUccs4ZeWeOc07kmRlkmVnkmVnjebrp1vEJAgDgBQoNAOAFCg0A4AUKDQDgBQoNAOAFCg0A4AUKDQDgBQoNAOAFCg0A4AUKDQDgBQoNAOAFCg0A4AUKDQDgBQoNAOAFCg0A4AUKDQDgBQoNAOAFCg0A4AUKDQDgBQoNAOAFCg0A4AUKbQacc//snPvWOXc96Fl+jYsXL6q0tFSRSERHjx5NOj88PKxdu3YpEomourpa3d3dcz/kPHbz5k1t2rRJy5Yt04kTJ5LOP3z4UBs2bNC2bdtS7iffyc6ePatoNKpoNKra2lpdvXp14tx0r1WJPB83VZY9PT16+eWXtW7dOpWVlendd99Nud/M9MYbbygSiSgajeqrr76ay/Fnzsw4fuEh6S8kvSTp+i/cZ/PF6OiohcNh6+rqsuHhYYtGo3bjxo1Ja1paWmzfvn1mZnbu3DnbuXNnEKNOKeg879y5Y52dnXbw4EE7fvx40vmTJ09aY2Ojbd26NeX++ZRv0FmamcXjcbt3756ZmV24cME2btxoZum9Vs3I83FTZdnX12dffvmlmZn98MMPVlJSkjLL9vZ227x5s42Njdlnn302sT8o43lO+x7LN7QZMLP/knQv6Dl+jc7OTkUiEYXDYS1dulQNDQ2KxWKT1sRiMe3Zs0eSVFdXp0uXLv1UzJCUl5enqqoqZWdnJ51LJBJqb29Xc3PzlPvJd7La2lqtWLFCklRTU6NEIiEpvdeqRJ6PmyrL1atX66WXXpIk5eTkaN26dert7U3aH4vFtHv3bjnnVFNTo++//179/f1z9wRmiEJbpHp7e7V27dqJ3wsKCpJe2I+vycrKUm5uru7evTuncy5UBw4c0LFjx7RkydT/EyPfqbW2tmrLli2S0nutPrmOPH/2eJaP6+7u1pUrV1RdXZ10Lt3M55usoAdAMFJ9cnXO/eI1SNbW1qa8vDxVVlbqo48+mnId+abW0dGh1tZWffLJJ5LSz4k8kz2Z5U9+/PFH7dixQ6dOndKzzz6btG+hZsk3tEWqoKBAPT09E78nEgmtWbNmyjWjo6MaGBjQypUr53TO+aalpUUVFRWqqKhQX19fyjXxeFznz59XYWGhGhoadPnyZTU1NSWtI9/kPK9du6bm5mbFYjGtWrVKUnqv1SfXLcY808lSkkZGRrRjxw69+uqr2r59e8r/VrqZzzvpXGjjSHmDR6EW8E0hIyMjVlRUZLdu3Zq40H79+vVJa06fPj3pInt9fX0Qo05pvuR5+PDhlDeFmJl1dHRMeVPIfMp3PmR5+/ZtKy4utng8PunxdF6rZuT5uKmyHBsbs9dee83279//1P1tbW2TbgqpqqqazXGnpTRvCgm8GBbiIemcpH5JI5ISkvamuW/6f7k51N7ebiUlJRYOh+3IkSNmZnbo0CGLxWJmZjY4OGh1dXVWXFxsVVVV1tXVFeS4SYLOs7+/3/Lz8y0nJ8dyc3MtPz/fBgYGJq15stDma75BZ2lmtnfvXlu+fLmVl5dbeXm5VVZWTpxL9Vo1I8+pTJXlxx9/bJLsxRdfnDjX3t5uZmZnzpyxM2fOmNmj4nv99dctHA7b+vXr7YsvvgjsuZilX2ju0VrMBeeckXfmOOdEnplBlplFnpk1nue0F/G4hgYA8AKFBgDwAoUGAPAChQYA8AKFBgDwAoUGAPAChQYA8AKFBgDwAoUGAPAChQYA8AKFBgDwAoUGAPAChQYA8AKFBgDwAoUGAPAChQYA8AKFBgDwAoUGAPAChQYA8AKFBgDwAoUGAPAChQYA8AKFBgDwQlbQAywmoVBozDnHh4gMCYVCcs4FPYYXyDKzyDOzQqHQWDrrnJnN9iwY55wz8s4c55zIMzPIMrPIM7PG85z2EwLfFgAAXqDQAABeoNAAAF6g0AAAXqDQAABeoNAAAF6g0AAAXqDQAABeoNAAAF6g0AAAXqDQAABeoNAAAF6g0AAAXqDQAABeoNAAAF6g0AAAXqDQAABeoNAAAF6g0AAAXqDQAABeoNBmwDm31jnX4Zz72jl3wzm3P+iZ0nHz5k1t2rRJy5Yt04kTJ5LOP3z4UBs2bNC2bdtS7h8eHtauXbsUiURUXV2t7u7uWZ54fntanu+8847Kysq0fv16NTY2amhoKGk/eU529uxZRaNRRaNR1dbW6urVqxPnLl68qNLSUkUiER09ejTlfvL82aLN0sw4fuEhabWkl8Z/zpH035JeSGOfBenOnTvW2dlpBw8etOPHjyedP3nypDU2NtrWrVtT7m9pabF9+/aZmdm5c+ds586dszrvdOZrnolEwgoLC+3BgwdmZlZfX2/vv/9+0v75lGfQWZqZxeNxu3fvnpmZXbhwwTZu3GhmZqOjoxYOh62rq8uGh4ctGo3ajRs3kvaT5898ytJsIs9p35v5hjYDZtZvZl+N//y/kr6WlB/sVNPLy8tTVVWVsrOzk84lEgm1t7erubl5yv2xWEx79uyRJNXV1enSpUs/FfWi9LQ8R0dHNTg4qNHRUT148EBr1qxJWkOek9XW1mrFihWSpJqaGiUSCUlSZ2enIpGIwuGwli5dqoaGBsVisaT95PmzxZolhfYrOecKJW2Q9MdgJ/l1Dhw4oGPHjmnJkqlfEr29vVq7dq0kKSsrS7m5ubp79+5cjbhg5Ofn680339Tzzz+v1atXKzc3V6+88krSOvKcWmtrq7Zs2SJpck6SVFBQoN7e3qQ95JnaYsqSQvsVnHO/lfQfkg6Y2Q9BzzNTbW1tysvLU2Vl5VPXpfqE5pybrbEWrO+++06xWEzffPON+vr6dP/+fX3wwQdJ68gztY6ODrW2turtt9+WlH5O5JlssWVJoc2Qcy5bj8rsrJn9Z9DzTKWlpUUVFRWqqKhQX19fyjXxeFznz59XYWGhGhoadPnyZTU1NSWtKygoUE9Pj6RHf1IbGBjQypUrZ3X++SadPD/88EMVFRXpueeeU3Z2trZv365PP/00aR15Jud57do1NTc3KxaLadWqVZIm5yQ9+vN4qj/hLvY8yVLcFDKTQ5KT9K+STv3CfUkXO4Nw+PDhlDeFmJl1dHRMeVPI6dOnJ10orq+vn7UZ0zFf8/z888/thRdesPv379vY2Jjt3r3b3nvvvaR98ynP+ZDl7du3rbi42OLx+KTHR0ZGrKioyG7dujVxI8P169eT9pPnz3zK0iz9m0ICL4eFeEj6c0km6ZqkP40ff53GvjT+6WZPf3+/5efnW05OjuXm5lp+fr4NDAxMWvNkoR06dMhisZiZmQ0ODlpdXZ0VFxdbVVWVdXV1zen8T5rPeb711ltWWlpqZWVl1tTUZENDQ2Y2f/MMOkszs71799ry5cutvLzcysvLrbKycuJce3u7lZSUWDgctiNHjkw8Tp6p+ZSlWfqF5h6txVxwzhl5Z45zTuSZGWSZWeSZWeN5TnsRj2toAAAvUGgAAC9QaAAAL1BoAAAvUGgAAC9QaAAAL1BoAAAvUGgAAC9QaAAAL1BoAAAvUGgAAC9QaAAAL1BoAAAvUGgAAC9QaAAAL1BoAAAvUGgAAC9QaAAAL1BoAAAvUGgAAC9QaAAAL1BoAAAvUGgAAC9kBT3AYhIKhcacc3yIyJBQKCTnXNBjeIEsM4s8MysUCo2ls86Z2WzPgnHOOSPvzHHOiTwzgywzizwzazzPaT8h8G0BAOAFCg0A4AUKDQDgBQoNAOAFCg0A4AUKDQDgBQoNAOAFCg0A4AUKDQDgBQoNAOAFCg0A4AUKDQDgBQoNAOAFCg0A4AUKDQDgBQoNAOAFCg0A4AUKDQDgBQoNAOAFCg0A4AUKbQaccyHnXKdz7qpz7oZz7h+DnikdZ8+eVTQaVTQaVW1tra5evTpx7uLFiyotLVUkEtHRo0dT7h8eHtauXbsUiURUXV2t7u7uOZp8fiLPzLp586Y2bdqkZcuW6cSJE5POvfPOOyorK9P69evV2NiooaGhpP3k+bOnZSlJDx8+1IYNG7Rt27aU+xdslmbG8QsPSU7Sb8d/zpb0R0k1aeyzIMXjcbt3756ZmV24cME2btxoZmajo6MWDoetq6vLhoeHLRqN2o0bN5L2t7S02L59+8zM7Ny5c7Zz5865Gz4F8sycoLM0M7tz5451dnbawYMH7fjx4xOPJxIJKywstAcPHpiZWX19vb3//vtJ+8nzZ1Nl+ZOTJ09aY2Ojbd26NeX++ZSl2USe07438w1tBsYz/nH81+zxwwIcKS21tbVasWKFJKmmpkaJREKS1NnZqUgkonA4rKVLl6qhoUGxWCxpfywW0549eyRJdXV1unTp0k9FvSiRZ2bl5eWpqqpK2dnZSedGR0c1ODio0dFRPXjwQGvWrElaQ54/e1qWiURC7e3tam5unnL/Qs2SQpsh59xvnHN/kvStpD+Y2R+DnumXaG1t1ZYtWyRJvb29Wrt27cS5goIC9fb2Ju15fF1WVpZyc3N19+7duRl4niPP2ZOfn68333xTzz//vFavXq3c3Fy98sorSevIMz0HDhzQsWPHtGTJ1G//CzVLCm2GzOyhmVVIKpC00Tm3PuiZ0tXR0aHW1la9/fbbkpTyk5dzLumxdNctNuQ5u7777jvFYjF988036uvr0/379/XBBx8krSPP6bW1tSkvL0+VlZVPXbdQs6TQfiUz+17SR5I2BzxKSi0tLaqoqFBFRYX6+vp07do1NTc3KxaLadWqVZIefYPo6emZ2JNIJFL+SefxdaOjoxoYGNDKlSvn5onME+SZWU/mmcqHH36ooqIiPffcc8rOztb27dv16aefJq1b7Hmmk2U8Htf58+dVWFiohoYGXb58WU1NTUnrFmyW6Vxo40i6ueM5ScvHf35G0seStqWxz4J0+/ZtKy4utng8PunxkZERKyoqslu3bk3cxHD9+vWk/adPn550obi+vn5O5p4KeWZO0Fk+7vDhw5NuZPj888/thRdesPv379vY2Jjt3r3b3nvvvaR95JnsySwf19HRMeVNIfMpS7P0bwoJvBwW4iEpKumKpGuSrkt6K8190//LzaK9e/fa8uXLrby83MrLy62ysnLiXHt7u5WUlFg4HLYjR45MPH7o0CGLxWJmZjY4OGh1dXVWXFxsVVVV1tXVNefP4XHkmTlBZ2lm1t/fb/n5+ZaTk2O5ubmWn59vAwMDZmb21ltvWWlpqZWVlVlTU5MNDQ2ZGXlO5WlZ/uTJQpuvWZqlX2ju0VrMBeeckXfmOOdEnplBlplFnpk1nue0F/G4hgYA8AKFBgDwAoUGAPAChQYA8AKFBgDwAoUGAPAChQYA8AKFBgDwAoUGAPAChQYA8AKFBgDwAoUGAPAChQYA8AKFBgDwAoUGAPAChQYA8AKFBgDwAoUGAPAChQYA8AKFBgDwAoUGAPAChQYA8AKFBgDwQlbQAywmoVBozDnHh4gMCYVCcs4FPYYXyDKzyDOzQqHQWDrrnJnN9iwY55wz8s4c55zIMzPIMrPIM7PG85z2EwLfFgAAXqDQAABeoNAAAF6g0AAAXqDQAABeoNAAAF6g0AAAXqDQAABeoNAAAF6g0AAAXqDQAABeoNAAAF6g0AAAXqDQAABeoNAAAF6g0AAAXqDQAABeoNAAAF6g0AAAXqDQAABeoNAAAF6g0H4F59xvnHNXnHNtQc+SjrNnzyoajSoajaq2tlZXr16VJPX09Ojll1/WunXrVFZWpnfffTflfjPTG2+8oUgkomg0qq+++moux593pspTki5evKjS0lJFIhEdPXo05f7h4WHt2rVLkUhE1dXV6u7unqPJ56ebN29q06ZNWrZsmU6cOJF0/uHDh9qwYYO2bduWcj95Tm2616M32ZkZxwwPSX8n6d8ktaW53oIUj8ft3r17ZmZ24cIF27hxo5mZ9fX12ZdffmlmZj/88IOVlJTYjRs3kva3t7fb5s2bbWxszD777LOJ/UGZr3mOjo5aOBy2rq4uGx4etmg0mjLPlpYW27dvn5mZnTt3znbu3Dl3wz8h6CzNzO7cuWOdnZ128OBBO378eNL5kydPWmNjo23dujXlfvJMLZ3X43zKLpXxPKd9j+Ub2gw55wokbZX0T0HPkq7a2lqtWLFCklRTU6NEIiFJWr16tV566SVJUk5OjtatW6fe3t6k/bFYTLt375ZzTjU1Nfr+++/V398/d09gnpkqz87OTkUiEYXDYS1dulQNDQ2KxWJJ+2OxmPbs2SNJqqur06VLl3764LMo5eXlqaqqStnZ2UnnEomE2tvb1dzcPOV+8kwtndejL9lRaDN3StI/SBoLepCZaG1t1ZYtW5Ie7+7u1pUrV1RdXZ10rre3V2vXrp34vaCgIGXxLUaP55luTo+vy8rKUm5uru7evTs3Ay8wBw4c0LFjx7RkydRvWeSZWjqvR1+yywp6gIXIObdN0rdm9qVz7i+DnueX6ujoUGtrqz755JNJj//444/asWOHTp06pWeffTZpX6pPbM65WZtzoXgyz3RzIs/0tLW1KS8vT5WVlfroo4+mXEeeqaWTiy/Z8Q1tZn4n6W+cc92S/l3S751zHwQ7UmotLS2qqKhQRUWF+vr6dO3aNTU3NysWi2nVqlUT60ZGRrRjxw69+uqr2r59e8r/VkFBgXp6eiZ+TyQSWrNmzaw/h/kknTzTzenxdaOjoxoYGNDKlSvn5onME0/mmUo8Htf58+dVWFiohoYGXb58WU1NTUnryDO1dF6P3mSXzoU2jqfe6PGXWiA3hdy+fduKi4stHo9PenxsbMxee+01279//1P3t7W1TboppKqqajbHndZ8zXNkZMSKiors1q1bExfhr1+/nrT/9OnTky7E19fXz8ncqQSd5eMOHz6c8qYQM7OOjo4pbwohz9TSeT3Op+xSUZo3hQReCAv9WEiFtnfvXlu+fLmVl5dbeXm5VVZWmpnZxx9/bJLsxRdfnDjX3t5uZmZnzpyxM2fOmNmj4nv99dctHA7b+vXr7YsvvgjsuZgF/6YxVZ5mj+4ILSkpsXA4bEeOHJl4/NChQxaLxczMbHBw0Orq6qy4uNiqqqqsq6trzp/DT4LO0sysv7/f8vPzLScnx3Jzcy0/P98GBgYmrXmy0MgzPalej/M1u1TSLTT3aC3mgnPOyDtznHMiz8wgy8wiz8waz3Pai3pcQwMAeIFCAwB4gUIDAHiBQgMAeIFCAwB4gUIDAHiBQgMAeIFCAwB4gUIDAHiBQgMAeIFCAwB4gUIDAHiBQgMAeIFCAwB4gUIDAHiBQgMAeIFCAwB4gUIDAHiBQgMAeIFCAwB4gUIDAHiBQgMAeIFCAwB4ISvoARaTUCh0xzn3Z0HP4YtQKDTmnONDWQaQZWaRZ2aFQqE76axzZjbbswAAMOv4BAEA8AKFBgDwAoUGAPAChQYA8AKFBgDwAoUGAPAChQYA8AKFBgDwAoUGAPAChQYA8AKFBgDwAoUGAPAChQYA8AKFBgDwAoUGAPAChQYA8AKFBgDwAoUGAPAChQYA8AKFBgDwAoUGAPAChQYA8ML/B09G1MZ0t/bBAAAAAElFTkSuQmCC\n",
157 | "image/svg+xml": "\r\n\r\n\r\n\r\n",
158 | "text/plain": ""
159 | },
160 | "metadata": {
161 | "needs_background": "light"
162 | },
163 | "output_type": "display_data"
164 | }
165 | ],
166 | "source": [
167 | "figure_4_1()"
168 | ]
169 | },
170 | {
171 | "cell_type": "markdown",
172 | "execution_count": null,
173 | "metadata": {},
174 | "outputs": [],
175 | "source": [
176 | "**请注意:**\n",
177 | "- 上面 in-place 与 out-of-place 的实现区别;\n",
178 | "- in-place 与 out-of-place 在迭代速度上的差异。"
179 | ]
180 | },
181 | {
182 | "cell_type": "markdown",
183 | "execution_count": null,
184 | "metadata": {},
185 | "outputs": [],
186 | "source": [
187 | "编程实现上述过程可以有两种方式。第一种情况就是,我们使用两个数组,一个表示$v_{k+1}(s)$,相应另一个是$v_k(s)$。这种情况下,更新了某个状态值并不会影响到这次迭代其它状态值的更新。\n",
188 | "\n",
189 | "因为在一轮更新中$v_k(s)$保持不变。这种更新方式和更新顺序没有关系;另一种方式是使用一个数组,替代(in-place)地更新,计算出新的值后立刻替代掉该状态旧的值,这样一轮迭代中一些状态值的更新可能会受到其它状态值更新的影响,in-place也能保证收敛到$v_k(s)$,且更快,因此我们通常采用in-place方法。\n",
190 | "\n",
191 | "每轮迭代我们都更新状态空间每个状态的值,这叫做一个sweep,in-place的sweep顺序对收敛的速率有很大影响。"
192 | ]
193 | }
194 | ]
195 | }
--------------------------------------------------------------------------------
/practice/07-02-Expectation-vs-Sample.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "这段程序帮我们理解什么是期望更新。"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "ExecuteTime": {
15 | "end_time": "2020-02-01T13:53:04.858054Z",
16 | "start_time": "2020-02-01T13:53:02.439031Z"
17 | }
18 | },
19 | "outputs": [],
20 | "source": [
21 | "#######################################################################\n",
22 | "# Copyright (C) #\n",
23 | "# 2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #\n",
24 | "# Permission given to modify the code as long as you keep this #\n",
25 | "# declaration at the top #\n",
26 | "#######################################################################\n",
27 | "\n",
28 | "import numpy as np\n",
29 | "import matplotlib\n",
30 | "%matplotlib inline\n",
31 | "import matplotlib.pyplot as plt\n",
32 | "from tqdm import tqdm\n",
33 | "\n",
34 | "# for figure 8.7, run a simulation of 2 * @b steps\n",
35 | "def b_steps(b):\n",
36 | " # set the value of the next b states\n",
37 | " # it is not clear how to set this\n",
38 | " distribution = np.random.randn(b)\n",
39 | "\n",
40 | " # true value of the current state\n",
41 | " true_v = np.mean(distribution)\n",
42 | "\n",
43 | " samples = []\n",
44 | " errors = []\n",
45 | "\n",
46 | " # sample 2b steps\n",
47 | " for t in range(2 * b):\n",
48 | " v = np.random.choice(distribution)\n",
49 | " samples.append(v)\n",
50 | " errors.append(np.abs(np.mean(samples) - true_v))\n",
51 | "\n",
52 | " return errors"
53 | ]
54 | },
55 | {
56 | "cell_type": "code",
57 | "execution_count": 2,
58 | "metadata": {
59 | "ExecuteTime": {
60 | "end_time": "2020-02-01T13:53:41.152569Z",
61 | "start_time": "2020-02-01T13:53:25.756005Z"
62 | }
63 | },
64 | "outputs": [
65 | {
66 | "name": "stderr",
67 | "output_type": "stream",
68 | "text": [
69 | "100%|████████████████| 100/100 [00:00<00:00, 4359.48it/s]\n",
70 | "100%|████████████████| 100/100 [00:00<00:00, 3458.39it/s]\n",
71 | "100%|█████████████████| 100/100 [00:00<00:00, 257.87it/s]\n",
72 | "100%|██████████████████| 100/100 [00:14<00:00, 6.98it/s]\n"
73 | ]
74 | },
75 | {
76 | "data": {
77 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEKCAYAAAD9xUlFAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAIABJREFUeJzs3Xd4lFXa+PHvmT7pvUACBBJKQASko65dbGAXbLhixbK6u7bX/e2+67rrur6vu+srFqyIBduquBZcewGlKDWAlFBCSyM9mXp+fzyTZEgv82RmMudzXXPNzDPPPHMn4tw57T5CSomiKIqiABiCHYCiKIoSOlRSUBRFUZqopKAoiqI0UUlBURRFaaKSgqIoitJEJQVFURSliUoKiqIoShOVFBRFUZQmKikoiqIoTUzBDqC7UlJS5JAhQ4IdhqIoSlhZu3ZtqZQytbPzwi4pDBkyhDVr1gQ7DEVRlLAihNjTlfNU95GiKIrSRCUFRVEUpYlKCoqiKEoTXccUhBAzgX8CRuBZKeVfW7w+CFgMJPjOuVdK+aGeMSmKErpcLhdFRUU0NDQEO5SwZbPZyMrKwmw29+j9uiUFIYQRWAicDhQBq4UQy6SUBX6n/Q54Q0r5pBAiH/gQGKJXTIqihLaioiJiY2MZMmQIQohghxN2pJSUlZVRVFRETk5Oj66hZ/fRZGCHlHKXlNIJLAVmtzhHAnG+x/HAAR3jURQlxDU0NJCcnKwSQg8JIUhOTu5VS0vP7qOBwD6/50XAlBbn/DfwiRDiNiAaOE3HeBRFCQMqIfROb39/erYU2oqs5d6fc4EXpZRZwNnAEiFEq5iEEDcIIdYIIdaUlJT0OKAaZw0L1y1kY8nGHl9DURSlP9MzKRQB2X7Ps2jdPTQfeANASrkSsAEpLS8kpVwkpZwopZyYmtrpgrx2Ob1Onlr/FBtLVVJQFKW13bt3M2bMmIBec926dUybNo3Ro0czduxYXn/99YBeP9D0TAqrgTwhRI4QwgLMAZa1OGcvcCqAEGIUWlLoeVOgE1ajFQCHx6HXRyiKohwlKiqKl156ic2bN/Pxxx9zxx13UFFREeyw2qVbUpBSuoFbgeXAFrRZRpuFEA8IIWb5TvsNcL0QYj3wGnCNlLJlF1PAqKSgKEpn3G438+bNY+zYsVx88cXU1dX16nrDhw8nLy8PgAEDBpCWlkZvusH1pus6Bd+agw9bHPu93+MCYIaeMfgzGUyYhEklBUUJA398fzMFB6oCes38AXH84bzRHZ6zbds2nnvuOWbMmMG1117LE088wW9/+9ujznnkkUd45ZVXWr33xBNP5LHHHmv32qtWrcLpdDJs2LCe/QB9IOwK4vWW1WSlwa0WxiiK0rbs7GxmzND+Vr3yyit57LHHWiWFu+66i7vuuqtb1z148CBXXXUVixcvxmAI3WISkZcUjFbVUlCUMNDZX/R6aTmls60pnt1tKVRVVXHOOefw4IMPMnXq1MAFqwOVFBRFUfzs3buXlStXMm3aNF577TWOP/74Vud0p6XgdDq54IILuPrqq7nkkksCHW7AhW4bRicqKSiK0pFRo0axePFixo4dS3l5OTfffHOvrvfGG2/w9ddf8+KLLzJu3DjGjRvHunXrAhRt4EVcS8FmsuFwq6SgKEprQ4YMoaCgoPMTu+HKK6/kyiuvDOg19RSRLYUGTw8Gmr3ewAejKIoSYiIyKTg9zu69aeNb8OhIaKjUJyhFUZQQEZFJodsthYRBUHMYfl6uT1CKoighIuKSQo/GFAZOhNhMKHhPn6AURVFCRMQlhR61FAwGGDULdnwKjhp9AlMURQkBEZkUuj2mAJA/C9wNsP2TwAelKIoSIiIyKfRo9tGgaRCdCltaFnpVFKW/0KN0NsDMmTNJSEjg3HPPPep4YWEhU6ZMIS8vj8suuwynswd/sAZY5CUFk7Vn6xQMRhh1Hvz8CbjqAx+Yoij91l133cWSJUtaHb/nnnu488472b59O4mJiTz33HNBiO5oEZcUbEYbTq8Tr+zBuoNRs8BVCzs+C3xgiqKEhECXzgY49dRTiY2NPeqYlJLPP/+ciy++GIB58+bx7rvv9vqzeiviVjRbjBYAnB4nNpOte28ecjzYk7RZSKPO7fx8RVF67qN74VCAd0nMOAbO+muHp+hZOttfWVkZCQkJmEza13BWVhb79+/v4g+in4hLCjajlggcHkf3k4LRDCPPhoJl4HaAyapDhIqiBJNepbNbams/sbYqsva1iEkK7iNH8JSWYjWYAWhwNxBvje/+hfLPh59ehl1fwvAzAxukoijNOvmLXi96lM5uS0pKChUVFbjdbkwmE0VFRQwYMKBnQQeQrmMKQoiZQohtQogdQoh723j970KIdb7bz0II3TYurXz7bXadNwu7R8uDPa6UmvMLsMarhWyK0k81ls4GOiydvW7dula3riYE0JLNySefzFtvvQXA4sWLmT17dmB+iF7QLSkIIYzAQuAsIB+YK4TI9z9HSnmnlHKclHIc8H/Av/SKB1+2t/haCj1OCiYLjDgLtn4AHlegolMUJUQEunQ2wAknnMAll1zCZ599RlZWFsuXayVzHn74YR599FFyc3MpKytj/vz5vf6s3tKz+2gysENKuQtACLEUmA20V5d2LvAH/cLRkoLNoI0D9GpPhfzZsGEpFH4NuacGIjhFUUKAHqWzAb755ps2jw8dOpRVq1YF/PN6Q8/uo4HAPr/nRb5jrQghBgM5wOftvH6DEGKNEGJNSUlJz6JpbCn4Zh/1ap/mYaeAJUZ1ISmK0u/omRTaGkZvPdyumQO8JaX0tPWilHKRlHKilHJiampqD6MJYEvBbNMGmbd+AB53z6+jKIoSYvRMCkVAtt/zLOBAO+fOAV7TMZamFNXrMYVGo2ZBXSnsXdHLwBRFUUKHnklhNZAnhMgRQljQvvhbFQ4SQowAEoGVOsbSNK3MatC6j3qdFPJOB5NdW7OgKIrST+iWFKSUbuBWYDmwBXhDSrlZCPGAEGKW36lzgaWyrZUcgdSYFAIxpgBgiYa802DL+2qrTkVR+g1dF69JKT8EPmxx7Pctnv+3njE0a5ySGqCWAmgL2ba8D0WrYNDU3l9PURQlyCKnIJ7QISnknQFGq5qFpCj9RKiUznY4HFx22WXk5uYyZcoUdu/eHfCY2hNxScEWyKRgi9OmpxYsA517vxRFCV/dLZ393HPPkZiYyI4dO7jzzju55557+izWCEoK2p1RGDEIQ+/HFBrlz4aqItj/Y2CupyhKUIVC6ez33nuPefPmAXDxxRfz2WeftVlATw8RUxCvsaUg0HZfC0hLAWDETDCYoOBdyDouMNdUFIWHVz3M1vKtAb3myKSR3DO547+6Q6F09v79+8nO1mb0m0wm4uPjKSsrIyUlpUvX7o2ISQpNlQ6lDGxSsCfC0JO0bTpPf6Ap+SiKEp5CoXR2MMtqR0xSwO+XHdCkANpCtvdvh0MbIPPYwF1XUSJYZ3/R6yUUSmdnZWWxb98+srKycLvdVFZWkpSU1IOfpvsiZ0yhcVBBgs1k69k+ze0ZeS4Io1rIpij9QCiUzp41axaLFy8G4K233uKUU07ps5ZC5CSFpl+o1lJo8ARooBkgOlnbqrPgXTULSVHCXCiUzp4/fz5lZWXk5uby6KOP8te/9t2GQxHUfeS793UfOT3OwF4/fxZ88Bso3gLp+Z2fryhKyAmV0tk2m40333wz4HF0ReS1FKQOLQWAkecBQhtwVhRFCVMRkxSOmn1ksgZ2TAEgNh0GTVOrmxVFCWsRkxT8Wwo2oy3wLQXQFrIVF0Dp9sBfW1EUpQ9ETlKgcUqqtvtawMcUAEadp92r1oKiKGEqcpKC3+wj3VoK8QMha5JKCoqihK3ISQoGnVY0t5Q/W1vEVl6oz/UVRVF0FDFJoWmg2esN/OI1f41dSGoWkqKEnXAonf3QQw+Rm5vLiBEjmtY7BJKuSUEIMVMIsU0IsUMIcW8751wqhCgQQmwWQryqYzDavZRYjBYcHoc+VQcTh0DmOLW6WVGUJoEqnV1QUMDSpUvZvHkzH3/8MQsWLMDj8QQ0Vt2SghDCCCwEzgLygblCiPwW5+QB9wEzpJSjgTv0ise/9pHNaEMicXld+nxW/izYvwYqi/S5vqIougnl0tnvvfcec+bMwWq1kpOTQ25ubpuL33pDzxXNk4EdUspdAEKIpcBswH+54PXAQinlEQApZbF+4TTXPrIarQA0eBqw+PZsDqhRs+GzB7StOqf2fom8okSiQ3/5C44tgS2dbR01koz/+q8Ozwnl0tn79+9n6tTmrX/93xMoenYfDQT2+T0v8h3zNxwYLoT4TgjxvRBipm7R+M8+MtkA9BtXSMmFtNFqFpKihKGWpbO//fbbVucEoiBeT0pn90VJbT1bCm1F2vInMgF5wElAFvCNEGKMlLLiqAsJcQNwA8CgQYN6F41vTAECtCVne/Jnw5cPQfUhiM3Q73MUpZ/q7C96vYRy6ezG44383xMoerYUioBsv+dZwIE2znlPSumSUhYC29CSxFGklIuklBOllBNTU1N7FIxosaIZ+iApILUuJEVRwkYol86eNWsWS5cuxeFwUFhYyPbt25k8eXJvf+Sj6JkUVgN5QogcIYQFmAO0nJLzLnAygBAiBa07aZcu0bQoiAfos4CtUdpISBmuupAUJcyEcuns0aNHc+mll5Kfn8/MmTNZuHAhRqOx1/H50637SErpFkLcCiwHjMDzUsrNQogHgDVSymW+184QQhQAHuAuKWWZLgH577xm0pKCbmMKjfJnwzf/C7WlEK3/3qqKovROOJTOvv/++7n//vsDGp8/XdcpSCk/lFIOl1IOk1L+2Xfs976EgNT8WkqZL6U8Rkq5VL9o/HZe64vuI9C26ZRe2PpvfT9HURQlQCJmRXPLndegD5JCxjGQmKMWsimKEjYiKCn47vtqTAG0RJQ/Cwq/grpyfT9LUfoJXSoNRJDe/v4iKCkcvckO9MGYAkD++eB1w6a39f8sRQlzNpuNsrIylRh6SEpJWVkZNputx9eImD2a+3xKaqMB42HABFi5ECZeC4bAzhRQlP4kKyuLoqIiSkpKgh1K2LLZbGRlZfX4/RGTFPxnH/XJ4jX/z53xK3hznjbgnD9b/89UlDBlNpvJyckJdhgRLQK7j2guc9EXSQG0ctqJOfDdP7Wt3xRFUUJU5CQF/EpnGywIBA1unQeaGxmMMP1W2L8W9qzom89UFEXpgchJCk3lSyRCCKxGqz77NLdn3BUQlQwrur4MXlEUpa9FUFJobikAWIwW/aek+jPbYfKN8PPHUBzYcsCKoiiBEjFJQbRICjajre/GFBpNug5Mdljxf337uYqiKF0UMUmhZUvBarL2fVKIToYJV8GG16GqZcFYRVGU4Iu4pNC4KMZqtPbN4rWWpt0C0gM/PNX3n60oitKJyEkKfgXxQEsKfTqm0ChxiLbKec0L0FDV95/fRT8fruahD7fg9niDHYqiKH0ocpKCX0E88LUU+rr7qNGM28FRBWtfDM7nd8GnWw7z9Ne7uO6lNVQ3uIIdjqIofSSCkoLvvnGg2RSEgeZGA8ZDzonw/ZPg7sNpsd2w4KRc/nzBGL7ZXsrFT66k6EhdsENSFKUPRExSaDn7KGhjCo2m/wqqD8Cmt4IXQyeumDKYF385iQOV9Zy/cAU/7T0S7JAURdFZxCSFVrOPgtl9BJB7KqSNhu8eC+nSFyfkpfKvm6djtxiYs+h7PthwMNghKYqiI12TghBiphBimxBihxDi3jZev0YIUSKEWOe7XadjMMDRs4+CMtDsH8+M26FkC2z/T/Di6IK89FjeXTCDMQPjueXVH1n4xQ5V2lhR+indkoIQwggsBM4C8oG5Qoj8Nk59XUo5znd7Vq94EL4f1ds8ptCnZS7aMuYiiBuoFcoLcckxVl65bgqzxw3gkeXb+O2bG3C4PcEOS1GUANOzpTAZ2CGl3CWldAJLgeDVjfarfQS+lkJfFcRrj9EMUxfAnm+haG1wY+kCm9nIPy4bx52nDeftH4u46tlVHKkNzYFyRVF6psOkIIQwCCEu7eG1BwL7/J4X+Y61dJEQYoMQ4i0hRHY7cdwghFgjhFjT08032hxoDuaYQqPj5oE1HlaEfmsBtN/jr07L459zxrGuqIILnviOnSU1wQ5LUZQA6TApSCm9wK09vLZo41jLjuj3gSFSyrHAp8DiduJYJKWcKKWcmJqa2sNoWicFj/Tg8gZ5Dr41FiZdC1veh7KdwY2lG2aPG8hr10+husHNBQu/Y8XO0mCHpChKAHSl++g/QojfCiGyhRBJjbcuvK8I8P/LPws4quCPlLJMStn45/ozwHFdironWgw0N260E/RxBYApN4HBpG3ZGUaOG5zEu7fMID3OxtXPreKN1fs6f5OiKCGtK0nhWuAW4Gtgre+2pgvvWw3kCSFyhBAWYA6wzP8EIUSm39NZwJauBN0jonWZCyD44woAsRkw9jJY9wrUhtdf3NlJUby9YDrThiVz99sbeOijLXi9amaSooSrTpOClDKnjdvQLrzPjdb1tBzty/4NKeVmIcQDQohZvtNuF0JsFkKsB24Hrun5j9KZ1t1H0IdbcnZm+m3gboBVzwQ7km6Ls5l5/ppJXDFlEE9/tYubX1lLndMd7LAURekBU2cnCCHMwM3Aib5DXwJPSyk77YyXUn4IfNji2O/9Ht8H3NeNeHuujdlHQHDXKvhLHQEjzoZVi2DGr8ASFeyIusVsNPDg+WMYmhrDgx8UcNnT3/PsvImkx9mCHZqiKN3Qle6jJ9H6+p/w3Y7zHQsrrWYfmbSkEBJjCo2m3w715Vo3UhgSQjD/+ByeuWoiO0tqOH/hd2w+UBnssBRF6YauJIVJUsp5UsrPfbdfApP0Dizg2th5DUJkTKHRoKmQNVnbmc0Tvt0vp+Wn8+ZN0wC45KmVfFpwOMgRKYrSVV1JCh4hxLDGJ0KIoUD4LWVtMfvIYrQAITSmAM2lLyr2wJZlnZ8fwkYPiOe9W2YwLDWG65es4dlvdqnSGIoSBrqSFO4CvhBCfCmE+Ar4HPiNvmHpoJ2WQkglBdDGFZJztdIXYf4lmhZn4/Ubp3JmfgYPfrCF3727CZfatEdRQlqnK5qBeiAPbXbQ7cAIKeUXfRBbgLWYkmoKsdlHjQxGmHYrHFwHu78JdjS9FmUx8cQVE7jpF8N45Ye9XPviairr1aY9ihKqurKi+X+llA4p5QYp5Xq/xWbhpcXOayE5ptDo2LkQnRoWhfK6wmAQ3HvWSP520VhW7izjoidXsK9cbdqjKKGoK91HnwghLhJN03fCVIud10JyTKGR2QZTboQdn8KhTcGOJmAunZTNS/MnU1Lt4PyF37F2T3mwQ1IUpYWuJIVfA28CDiFElRCiWggRujvOt6PllNSQHVNoNHE+mKO1mUj9yPRhKbyzYDqxNhNzn/mB99btD3ZIiqL46WxMQQCjpZQGKaVFShknpYyVUsb1UXyB02L2kd1sB6DOFaLdGFFJMOFqbbvOyqJgRxNQQ1NjeGfBDMZlJ/Crpev4x6c/q5lJihIiOhtTkMA7fRSLvtqokmo32alwVAQxqE5MW6DF+/4d4KwNdjQBlRhtYcn8yVw0IYt/fLqdO15fR4Mr/GY6K0p/05Xuo++FEOG3WK2lFgXxABKtiRxpCOHN6BMGwdl/g52fwYvnQHX/WgRmNRn5n0vGcteZI3hv3QGuePYHympCtDtPUSJEV5LCyWiJYadvM5yNQogNegcWeEe3FAASbYkccYRwUgCYdB1c9gqUbINnT4PircGOKKCEENxyci4LL5/Apv2VnP/Ed2w/XB3ssBQlYnUlKZwFDAVOAc4DzvXdh5cWBfEAEmwJod1SaDTybPjlh+BxwHNnwK6vgh1RwJ0zNpPXb5xGvdPLhU+u4JvtPdthT1GU3ulK6ew9aJvlnOJ7XNeV94WalrOPAJKsSeGRFAAGjIfrPoW4AfDyhbDu1WBHFHDjshN495bpDEywc80Lq3nlhz3BDklRIk6nX+5CiD8A99Bc4toMvKxnULpoIymERfeRv4RBMH85DJ4B794MX/wl7EthtJSVGMWbN03jhLwU7n9nE3/6dwEetWmPovSZrvzFfwHarmi1AFLKA0CsnkHposWUVNCSQr27PjRXNbfHFg9XvAXjroSvHoZ3bgJ3/xqcjbWZefbqiVwzfQjPfVvIjUvWUOsI36qxihJOupIUnL6pqRJACBHd1YsLIWYKIbYJIXYIIe7t4LyLhRBSCDGxq9fuNoPvR/U0T3tMtCYChPa01LaYLDD7cTjld7BhKSy5EOrDqMXTBSajgf+eNZoHZo/m863FXPLUSg5W1gc7LEXp97qSFN4QQjwNJAghrgc+BTrdM1IIYQQWog1U5wNzhRD5bZwXi1Zo74fuBN5dwqyVtZCu5r84E2wJAJQ3hGG5BSHgxLvgwmegaJU2AH1kd7CjCrirpw3h+Wsmsbe8jtmPf8fGIrVpj6LoqSsDzf8DvAW8DYwAfi+l7ErthcnADinlLimlE1gKzG7jvD8BfwN07cMRZjMA0tVcoTPJlgQQPoPNbRl7KVz1DtQUa1NWi9YEO6KAO2lEGm/fPB2z0cClT6/k402Hgh2SovRbXZpFJKX8j5TyLinlb6WU/+nitQcC+/yeF/mONRFCjAeypZT/7uI1e0xYfEnB3dxSaOw+CqvB5rYMOV6bmWSJ1ha5FYT3Bj1tGZERy7u3zGBERiw3v7KWp7/aqUpjKIoO9Jxa2lZV1ab/i317NfydLmzYI4S4QQixRgixpqSkZ/PX22opJNp8SSGcWwqNUvLgus8g4xh442pY8Xi/m5mUGmtl6Q1TOfuYTB76aCv3vr0Rp1tt2qMogaRnUihCW9/QKAs44Pc8FhgDfCmE2A1MBZa1NdgspVwkpZwopZyYmprao2DaSgqxlliMwtg/kgJAdArMex9GnQef3A8f3hXWez23xWY28n9zxnPbKbm8vmYf855fRWWd2rRHUQKly0lBCGEWQowXQqR18S2rgTwhRI4QwgLMAZr6NaSUlVLKFCnlECnlEOB7YJaUUpdOcWEyaZ/rcjYdMwgD8db48O8+8me2wyWLYfptsPoZeP1KcPWvWTsGg+A3Z4zg0UuPZe2eI1zw5HfsLu1fBQMVJVjaTQpCiKeEEKN9j+OB9cBLwE9CiLmdXVhK6QZuBZYDW4A3pJSbhRAPCCFmBST6bhBGIxiNR7UUQBts7jcthUYGA5zxIJz9P/Dzx/DKJeDof/WELpyQxcvXTeFIrZPzn/iOVYVhOItMUUJMRy2FE6SUm32Pfwn8LKU8BjgOuLsrF5dSfiilHC6lHCal/LPv2O+llK1GQqWUJ+nVSmgkzOZWSSHBGib1j3pi8vVwwdOwZwW8dD7U9b8vzck5SbyzYAZJ0RauePZ73l7bv/aeUJS+1lFScPo9Ph14F0BKGbbzAYXJ1CophF2pi+469jK49CU4tAEWn6dNXe1nhqRE887NM5g0JInfvLme/1m+Da8qjaEoPdJRUqgQQpzrmzY6A/gYQAhhAux9EVygtdVSSLIlUdEQZiuau2vUuXD561C+C56fCRX7On9PmImPMrP42snMmZTN41/s4LalP6lNexSlBzpKCjeijQm8ANzh10I4FfhA78D00FZSSLYlU+GowOlxtvOufmLYKdoit9oSLTGU7Qx2RAFnNhp46MJj+K+zR/LhxoNctuh7iqvDqK6VooSAdpOClPJnKeVMKeU4KeWLfseXSyk7XVsQioTZDC2SQmZMJhLJodqw7RXrukFT4Zp/g7teSwyHNgU7ooATQnDDicN46srj+PlQNRcsXMHWQ1XBDktRwkZHs48e6+jWl0EGirDZ8NYdPT1zYIy2yHp/zf5ghNT3Mo+FX34EBpO2+rkflsUAOHN0Bm/eNA2318vFT67ki239byxFUfTQUffRTcDxaAvO1gBrW9zCjik5GXdZ2VHHIi4pAKSOgGs/AnsCvDQbCr8JdkS6GDMwnndvmcGgpCjmv7iaxSt2BzskRQl5HSWFTGARcCZwFdrmOsuklIullIv7IrhAM8TG4q2rO+pYelQ6JmHiQM2Bdt7VTyUOgV9+DPFZ8MrF8PPyYEeki8x4O2/eNI1TRqbzh2Wb+cN7m3B7VGkMRWlPR2MKZVLKp6SUJwPXAAnAZiHEVX0VXKAJo7FV2QejwUhGdAZFNRE4vz0uE675EFJHwtLLYdPbwY5IF9FWE09fdRzXn5DD4pV7uO6lNVQ3qNIYitKWrmzHOQG4A7gS+Igw7ToCECbjUfspNBoYMzDyWgqNopNh3jLImgRvzYcfXwp2RLowGgT3n5PPXy44hm+2l3LxkyspOlLX+RsVJcJ0NND8RyHEWuDXwFfARCnlfCllQZ9FF2gmE9LTeu76wNiBkTWm0JItHq78lzZtddltsPKJYEekm8unDGLxLydzoLKe8xeu4Ke9/XjhoqL0QEcthf8HxAPHAg8BPwohNgghNgohNvRJdAEmjCZkG1VDB0QPoLS+NLz2ag40SxTMfQ1GzYLl98GXD/e70tuNjs9L4Z0F07FbDMxZ9D0fbDgY7JAUJWSYOngtp8+i6CPCZIS2uo9itRlIB2oPMDR+aF+HFTpMVrj4Ba218OVfoK4MTv1/YI0NdmQBl5sWy7sLZnDjkrXc8uqP7C4bwYKThiFEW9uAKErk6GigeU9bN7R9Eo7vuxADyGhsu/vINy01YscV/BlNMHshTLkJVj0Nfx8Nn/0JakuDHVnAJcdYefm6KZw/bgCPLN/Gb9/cgMOtSmMoka2jMYU4IcR9QojHhRBnCM1twC7g0r4LMXCEwYinvHWl0Ka1CtURPK7gz2CAsx6G6z+HnF/AN/8Lfx+jbdpzZE+wowsom9nI3y8bx52nDeftH4u46tlVHKnt5yVPFKUDHY0pLAFGABuB64BPgIuB2VLK2X0QW8AdefVVgFYL2FLsKdhNdnZU7AhGWKFr4HFw2RK4ZRUccxGseQEeGw//ugEOh+98g5aEEPzqtDwemzuedUUVXPDEd+wsqQl2WIoSFB0lhaFSymuklE8Dc4GJwLlSynV9E1rgxV90IQDemqNlKFeBAAAgAElEQVT/hzcIA8emHsuPxT8GI6zQlzpc61L61XqYejNs+Tc8OQ1enQN7fwh2dAEz69gBvHb9VKob3Fyw8DtW7Ox/XWaK0pmOkkLT6h4ppQcolFJ2a/suIcRMIcQ2IcQOIcS9bbx+k2820zohxLdCiPzuXL+7Yk44AQDpbN09MCF9AtuPbKfKqYqntSt+IJz5Z7hzE5z0X7DvB3j+DHj+LNj+n34xW+m4wYm8e8sM0uNsXP3cKt5Y3f/KjCtKRzpKCscKIap8t2pgbONjIUSn35xCCCOwEDgLyAfmtvGl/6qU8hgp5Tjgb8CjPfw5ukRYLAB4Ha2TwnFpxyGRrCsO24ZQ34lKgpPu0ZLDzIehYq9WKuOp42HjW61WjYeb7KQo3l4wnWnDkrn77Q089NEWtWmPEjE6mn1klFLG+W6xUkqT3+O4Llx7MrBDSrlLSukElgJHjUVIKf2TSzSg6/95wmLVPtfpaPXaManHYDKYWHs4bBds9z1LNEy9CW7/Cc5/EjwueHs+/N8EWP0sOGuDHWGPxdnMvHDNJK6cOoinv9rFza+spc4Z3slOUbqi0zIXvTAQ8G97F/mOHUUIcYsQYidaS+F2HeNBWMxA291HdpOd0cmj+fGwGlfoNpMFxl0OC76HOa9CdCp88Bv431Hw8X1hu6GPyWjgT7PH8Ptz8/mk4DCXPf09h6sieIGjEhH0TAptrQJq1RKQUi6UUg4D7gF+1+aFhLhBCLFGCLGmpKSkxwEZrL6WgqN1SwFgQtoENpVtwuVRxdJ6xGCAkefAdZ9qFVjzToNVi7SWw5ILYOuH4A2vdQBCCK49Podnr57IzpIazl/4HZsPVAY7LEXRjZ5JoQjI9nuehbY3Q3uWAue39YKUcpGUcqKUcmJqamqPAxK+pOBto6UAMDRhKG6vOzJ2YdOTEDB4Glz8PNxZACffD8VbYelc+Oc4+ObRsFsMd+qodN66aToAlzy1kk8LDgc5IkXRh55JYTWQJ4TIEUJYgDnAMv8ThBB5fk/PAbbrGE/TQLNsY6AZmhexRWQZbb3EpsMv7oY7NsKlL0HiYPjsj/DoKG29w77VYTNrKX9AHO/dMoPctBiuX7KGZ7/ZhQyT2BWlqzqqfdQrUkq3EOJWYDlgBJ6XUm4WQjwArJFSLgNuFUKchjb99QgwT694wG+guZ3uo4jcha2vGE2QP1u7FW/VBqLXL4UNr2tbhE66Ho65GMz2YEfaobQ4G6/fMI07X1/Hgx9sobC0lv+eNRqzUc+/rxSl7+iWFACklB8CH7Y49nu/x7/S8/NbMlh9LQVX2y2Fxl3YVFLQWdpIOOd/4LQ/aIlh9bOw7Fb45Hcw/kqYNB+SQrcwod1i5IkrJvDIJ9t48sud7C2v4/HLJxBvNwc7NEXptYj686a5+6jtlkLjLmwqKfQRayxMvl6btXTNBzD0JPj+SXhsAjxzKnzxkNa9FIKD0waD4J6ZI/nbxWNZubOMi55cwb5ytWmPEv50bSmEmqaB5nbGFEBtuBMUQsCQ47Vb1UFY97K2Z/TXf4Ov/gr2RG0DoNzTtfvY9GBH3OTSidlkJ0Zx08trOePvX5MRb8NuNhJlMWK3GP0em4iyNB+PMhuJsphanKMda35sxGYyYjCoct5K34mspNDYUmhn9hFo4wpf7fuqr0JSWorLhBPv0m515bDrC9j+Kez4tHkP6YyxkHsa5J2ubSNqDG63zbRhybyzYDrPf1dIZb2beqebOqeH6gY3xVUO6lxu6p0e6p0e6lyebo+r281HJxj/BNKcYIzYLEaizEcnFf+E1HidKN95dosRiymiOguULoispGAwgNncbvcRaEmhrKGMenc9dlNoD3r2e1FJMOYi7eb1wuFNWnLY8SmseAy+fRSscTD0F1orIvdUiM8KSqhDU2N48PxjOj1PSonD7aXO6aHOqSWLOt+t3uWm3unVjrv8jvuSTL3T03S83umhuLqh6XHjvdPj7VbcJoPwSyAmbP6Jp51WzlHH22nl2M3aTbVywk9EJQUAg8XSZpmLRo0zkA7WHGRoQugOdkYcgwEyx2q3E34NDZVQ+LWWILZ/Clve185LHaXNcJp4bUh1MzUSQmAzG7GZjSRFWwJ+fbfHS73L0yrZ1PkljjpfcmlMNnV+rZjGYzUONyXVjqMTUw9aOTazQWvRtNHKibaaGJYazajMOPIz48hKtKud70JAxCUFYbG0u3gNjl6roJJCCLPFw6jztJuUULLNlyCWa+MQ3z4KYy7WSn1njg12tH3GZDQQazQQawt8l1rLVk7DUa0ZT3Prx681oz12t2rRFFc3UFXs5t8bDjQlmliriZGZsYzKjGNkRhyjMmMZkRFLlCXivqaCKuJ+28Ji6XRMAdRahbAihDbNNW0kTL8VSndoW4n+9AqsfxUGH68lhxFngcEY7GjDlh6tnDqnm22HqtlysJotB6vYcrCKf/24nxrHHt9nQk5ytC9RaAlj1IA4BsTbVKtCJxGaFNqvbdS4C9tPxT8xd+TcPoxMCZiUXDj7Ea28xk9L4Ien4fUrIHGItvf0uCvA1pVCv4reoiwmxg9KZPygxKZjXq9kf0U9Bb4kseVgFRv3V/LBxoNN58TbzU1JIj8zjpGZsQxPj8VmVkm/tyI0KbTfUhBCMHfkXJ7f9DxX51/NmJQxfRidElD2BJh+G0y5Gbb+W1sD8fG98PmfYcJVMPkGSMoJdpRKCwaDIDspiuykKM4cndF0vLrBxc+Hqynwa1W8sWYfdU5tHYtBaAP+ozK1rqdRmXGMyogjPc6qWhXdIMKtdsvEiRPlmjVrevz+XRdciDkjg+wnn2j3nBpnDee+cy7Zsdm8dNZL6h9Uf7J/LXz/FGz+l7YobuQ5WtfS4BlaX4USVrxeyZ7yOrYcrGLrwaqmhLG/or7pnMQosy9RxDUljNy0GKymyGpVCCHWSikndnZeBLYUzB22FABiLDHcOv5W/rjyj3y7/1tOyDqhj6JTdDfwOLjoGTj9j1p5jTUvaK2IjLEwdQGMuRBM1mBHqXSRwSDISYkmJyWas4/JbDpeWe9iq681sfWQlihe/n4PDrc2ZddkEAxLjWluUfhuqbHqv33EtRT2XHkVAINfXtLheS6Pi7PfOZsB0QNYfNbiHn+eEuKcdbDxDa1rqWQrCANYYrUSHK1uMdq6iFbH48ASo20ulDxMtThClMcrKSytbep6akwYByubN05KibEc1aIYlRnHsNSYflHwsKsthYhLCtsmTcZbXc3wH77HGB/f4bmvbnmVh1Y9xAtnvsDEjE5/l0o4k1JbPb37O3DWgKMaHFW++8ab77izuv3rJA2F/PNh9Pla60MliJB3pNbJlkNVR82A2n64pmkhoNkoyE2LZVRmLPl+rQo91pnoSSWFdmwZOQqAQS++SPTUKR2e2+Bu4Kx/nUWKPYVXzn4FizG8/hEoOvF6/RJHte9xFRzZDQXLtEV10gOJOVpyGH2BShBhxuXxNrUqtFlQWsIoqW5e+JoeZ/Wtp4hrShg5KdGYQrRVoZJCOxqTwoBH/kb8eed1ev6X+77kts9vY17+PH4z8Tdq0FnpXG2ZNk5R8C7s+uroBJF/vrZ/hPp3FJZKaxxs9WtRFBysYmdJDS6P9j1qNRkYnh7bvKbCN2U2Pir4ZdVVUmhHY1JIvv560n7z6y6954GVD/Dmz2+SE5/D/VPuZ0pmxy0MRWlSV64liM3v+CWIIc1dTJnjVIIIc063l50lNX5jFVrSKKttntAyIN7GSP+psplxDEmOxtiHtaFUUmhHw9atFJ5/ATGnnkr2wse79B6X18V7O97jqfVPkRGdwctnv9zjz1ciWFOCeBcKvwKv25cgfDvSpYwAS7RKEv2AlJKSagdbDlUfNbC9s6QWj1f7zrWbjQzPiCU/M7apG2pkZixxOpQogRBJCkKImcA/0bbjfFZK+dcWr/8auA5wAyXAtVLKPR1ds7dJAWDv/Ovw1FST8/rr3Xrf4z89zjMbn+GbOd8QZ1ErYpVeqCuHrR/4upi+1BIEgMmuzWKKTvHdt/XY9zwqBUxqnCucNLg87CiuoeBgVXM31KEqKuqaqyxkJdqbZ0D5uqEGJUX1uuJs0NcpCCGMwELgdKAIWC2EWCalLPA77SdgopSyTghxM/A34DK9YmpkSknBWVjY7fdNGzCNpzc8zaqDqzht8Gk6RKZEjKgkbVX1hKu0BLHzc6jaD7UlUFuq3dcc0sqF15aAp521NbZ4LUHEZkLCYEgY5Ltla/exA7T9sZWQYDMbGTMwnjEDm2c+Sik5VNVwVNfTloNVfLblML5GBdEWIyMyYrn5pFxOz9e3+q+e/1omAzuklLsAhBBLgdlAU1KQUn7hd/73wJU6xtPEmJCAp6Ki2+8bmzqWKFMUKw+sVElBCZyoJDjm4vZfl1Kb3dSYLJpujcmjGKoOwM7PoPrg0e8VRogb2DpZNN7iBgZ9k6JIJ4QgM95OZrydU0Y2f+HXOz38fNiv++lQNX3RsahnUhgI7PN7XgR0NEI7H/iorReEEDcANwAMGjSo14EJmw1vBxvttMdsMDM5czIrDqzodQyK0mVCaC0CW7y2OK4jrgatxVGxt/lWuU+7L/xKSx74dRkLgzYzavA0GDRdu0/MUeMaIcBuMXJsdgLHZif06efqmRTa+lfV5gCGEOJKYCLwi7Zel1IuAhaBNqbQ28AMNit4PEiXC2Hu3l9J0wdM58t9X7L28FqOSz+ut6EoSmCZbVriaC95uJ1a0mhMFBV74dBGbXzjJ98EipgMGDxduw2aBmn52iZHSkTQMykUAdl+z7OAAy1PEkKcBtwP/EJK2f0/33tAWG0AeB0OjN1MCucNPY8lBUu4++u7mT9mPmsOr+HeyfeSFpWmR6iKElgmi1YZtmV1WK9XK/OxdwXsWQl7VmhFA0FroQyapt0GT9em0XZ1gNvt0Lq3aou17q7GxzUlUH8ELFG+VlCCVtW26d7vmDVeJaU+pGdSWA3kCSFygP3AHOBy/xOEEOOBp4GZUspiHWM5irBpRa9kQwPExHTrvTGWGB496VGu+OAKHlr1EKBVVX3q9KcwCPUPVwlTBgOk52u3Sddp4xgVe7QE0Zgofv5YO9dkh6yJWoJIG6V9udeU+L7sfV/+jV/8jsq2P88SC1GJWu2phorm2VdtEtr+Fy2TR3SKNgOr1QytVLAnqkTSQ7olBSmlWwhxK7AcbUrq81LKzUKIB4A1UsplwCNADPCmb6XwXinlLL1iamRobCk09KxhMjJpJM+d+Rwur4vCykL+9P2fuPebexmXOo4L8y7EZrIFMlxF6XtCaGsoEofAON9mUzXFsHdlc6L4+hGQ3ub32BMhOg1i0iDjGN/j1OZjTc9TwWxvfp+U4KzVkkNDJdRXaI8b79s6VrJNa83UldFmr7QwQFRy64QR5ZvSa43VduEzmPxunT33HbPEaNfup+Muus5Vk1J+CHzY4tjv/R4HZQqPwa59aUtHQydntm9c2jgAJqZPZFPpJj4q/IiPCj9iXfE6Hj7xYVUOQ+l/YtKaF9oBNFRprYmo5N6tmRDCV4E2BuKzuvder0eb0ltX2npWlv/9gXXaOQ3ttFy6yxrfPHaTnAtJw5qf2zoutBnqInICs7A1thR6nhSariUED8x4gAdmPMBzG5/jHz/+g7zEPK4fe32vr60oIc0Wp7UIgslg1FofManAqM7Pdzu0JOGq07qsmm4e363lsTaeN1RA2U4o3wn7foCNb3FUayU6tXWiSM7VKuj6t5BCVGQmBavfmEIAXTvmWjaXbWbRhkVcmHchyfbkgF5fUZReMlkhfmBgr+lq0Crklu3QEkXZDi1p7PgU1rUoiRM3UOuSSxgMiYOPvo/NDIlxkIhMCgZ7FADe+sAmBSEEt42/jU/3fMrLW17mmtHX4Pa6VXJQlP7MbIO0kdqtJUc1lO9qThRlO7Uut11f+hYa+rUwjBaIz26dLBIHQ8IQbZFjH3RLR2ZSiNaSQu2KFcQcPyOg186Jz+GMIWfwcsHLvLLlFewmO6+c/QpZsd3sK1UUJfxZY7VS6ZnHtn7N7YCKfVCxG47s0ZJF4/2BdVBffvT5lhg462EYr2/hh8hMCnatX6/8+edJv/uugF//xrE38v3B75mWOY3vDnzHrZ/dyuKzFhNvDe8BKEVRAshkhZRc7daWhirfAkO/ZJHczrmBDEv3TwhBpvTm+iJSyoDPFMpLzOPbOd8C8MPBH7jp05u45uNruHX8raw9vJa5I+aSHZfdyVUURYlotjjIGKPd+lDwRzWCwGC1knbXbwHw1tTo+llTMqfwxKlPcKDmAHd8cQdLCpbw669+jbONqperD63mqfVPUemoREqJy+tq44qKoij6iciWAoA5MxOAynfeIenqq3X9rGkDpvHqOa9SWFmIW7q566u7uO+b+zgh6wTOHHImNqONh1Y9xGtbXwNgScESrEYrZQ1lDEsYxoJjF6iqrIqi9ImITQpRU7SCrYf/8hDRJ5yANSenk3f0zrCEYQxL0IqUFZQV8MKmF/hkzyesOLCCk7NP5rWtrzFnxBzOG3YeL25+EYvRQkZUBl8VfcXdX9/N82c+37RgTlEURS8Rtx2nv3233krNp5+RfP11pP3mNwG5Zlc5PA4WbVjEog2LiDXHkhWbxWvnvIbRYDzqvIqGCi7/8HIqGiqYMXAG80bPY0xK3/YxKooS/rq681pEjik0yn78cSy5w2j4+ec+/2yr0cpNY29iWPwwql3V3DflvlYJASDBlsATpz7B9IHT+f7g99z2+W1UtldkTFEUpZciuqUAUHTnnVQv/4Qhb7yBfczogF23q/ZV7+Pn8p85dfCpnZ67pWwLl39wOSdmncgZQ84g2Z5MRlQGXxd9TYo9hZk5M1WlVkVR2tTVlkLEJ4WKt97i4O/+HwAZf/g9iXPnBuzaenhy/ZM8se6JNl8bnzae+6fcz4ikEX0claIooU4lhS6q/WEVe+fNa3o+auuWgF1bD1JKdlXuQiA4WHuQouoipg2YxtrDa/n72r9T5azilEGnMDJpJJnRmWTHZjM8cThR5qhgh64oShCppNBF0umk/OVXKH95CcJsJnf58oBdu69VOipZuG4hXxd9zf6a/U3HDcLAkLghjEoexcjEkcRb48mKzeK49ONUd5OiRAiVFLqpdNEzlDz6KMO/X4kxoW83ytZDnauO4rpidlftZkvZFgrKCigoL6C4rnmDu4ExA0mLSiMjOoM7JtzBgJgBHKg5wOd7P6e8oRyr0crolNGU1ZdRUl/CCQNPYHjicLVXhKKEoZBICkKImcA/0XZee1ZK+dcWr58I/AMYC8yRUr7V2TX1Sgp1q1ez56qryXpiIbGnnBLw64eKSkcl1c5qNpRs4KPCj6hz17GxdCNSSmIsMZTWlwJa68Lrv6uWj9lgZlDsIO6efDfTB0zv6/AVRemhoCcFIYQR+Bk4HShC27N5rpSywO+cIUAc8FtgWTCTgtfhYNux4zBERTHix7UBv34oO1BzgEUbFiGRDI4bzOmDTic7LpsaZw0FZQUk2hJJtCXy5b4v2Vu9ly/3fUlhZSGnDjqVMwafwYlZJxJj6d5e14qi9K1QSArTgP+WUp7pe34fgJTyoTbOfRH4dzCTAsCO007HVVRE7ldfYvYrmqccrcHdwFPrn+K9ne9RWl+K2WDmtMGnMX/MfHITcjEajEgp+an4J9xeN8elH9fmGgxFUfpOV5OCnmUuBgL7/J4XAVN0/Lxey3r8/yg8/wJ2nHY6I9f9hDCqL7K22Ew27jjuDm6fcDvrS9azfPdy/rX9X3xU+BFGYSTFnoJBGDhYexCA9Kh0zh16Lidln8SAmAGYDWZA66JS5cQVJbTomRTaGo3sUbNECHEDcAPAoEGDehNTh6x5edoDl4vSp58mdcEC3T6rPzAIA+PTxjM+bTw3jr2R/+z5D4dqD3G47jC1rloWjFuAzWRj2Y5lvLD5BZ7b9Fyra4xLHcd5w84jKyaLfdX72F21mxpXDaOSRjF9wHSMBiPrS9azv3o/E9InsLl0Mz8W/0h2bDYT0icwLXMabulmU+kmCisLOTn7ZAbEDAjCb0NR+gfVfdRC9eefU7TgFgBGbtyAMJt1+6xIUlpfyubSzRyqPYRHegCocdXw3o732Fu9t+m8KFMUdpOdsoaydq+VHZtNcV0xDo+j1WsGYWBKxhROHnQyJ2WdRGZMZuB/GEUJQ6EwpmBCG2g+FdiPNtB8uZRycxvnvkiIJAWA4n/+k7InnyLjTw+QeMklun5WpPNKL/tr9nOo9hADYgYwIHoAQgh2VuxkU+kmPNJDXkIeg+IGsebQGrJisxiRNAKX18Xaw2tZX7wem8lGTnwO2bHZvL/zff6z5z/srtoNgN1kx+VxMShuEMn2ZDxeD0MThnJc+nGcMPAE1X2lRIygJwVfEGejTTk1As9LKf8shHgAWCOlXCaEmAS8AyQCDcAhKWWHBYj6Iik4i4rYedrpANiPPZa0e+4masIEXT9TCazCykK+3PclpfWlGA1GCisLqXJUAbD9yHaqXdUYhIFUeyqJtkQEgihzFHGWOOKt8RyTcgwT0yeSHp1OtDm6y5/r9DipclZhEiYSbOG/3kXpP0IiKeihL5ICQMnjCyl9/PGm5znL3sM2fLjun6vozyu9bC7dzLcHvqWouogqRxVevNS6aqlyVlFWX0Z5Q/Om6XaTnRR7Csm2ZGpcNZTUl+BwO0ixp5Aenc7+mv14pRer0dr0GCArJotJGZOYlDGJqZlTSY1KDdaPrCgqKQRCzbffUfKPf9CwaRPGhAQGvfgCtpEj++SzleCRUlJYWcjmss2U1pdSUl9CaV0ppQ2lxJhjSItKw2q0crjuMMV1xWTFZGE0GKl31zMkbgip9lTq3fX8VPwTaw6vocqptVByE3KZlDEJm9GGy+vCK70MSxjG0PihWIwW4ixxJNoSibPEqVXjSsCppBBA9evWsXuOVj11yNLXsAwdijEurk9jUMKTV3rZVr6NlQdXsuLACtYXr0cisRgsTa2TlozCiN1kx2K0YDPayE3MZUzKGFLtqVQ5qyivL8dkMBFriSXZnkyyLRmbyYZBGDAIA1JKPNKDV3pJj0pncNxglWQUlRQCrezFFyn+68NNz1N/dTuJc+f2izpJSnBIKSmqLqKopgiX10Wlo5IjDUeocFRQ767H6XFS666loKyAwsrCpvfZjDbc0o3b6+7S58RaYvFKL1GmKIbEDyEnLodkezJGYWRg7EDSo9IxCiPlDeUcrjvMkYYjjE8bz5TMKZgMEbtjb7+jkoIOKt//N4cefBBvZfPOZ7ZjjiF66hRS77wTYVAVRxV9ODwOyurLiLXEEmuJBbSih2X1ZZQ1lOHwOJBIvF4vQoimVsPuqt1sK9+G2WCm2lnN7qrd7KrcRbWzutPPNBvMJFoTGRw/mOzYbO2aGLAYLSTaEql0VFLpqCQjOoO8xDxy4nOoaKjAbDSTFpVGmj0Ns1FN6Q4VKinoyHXgAMX/+AdVy94/6rgpMxNbfj5JV15B9LRpQYpOUTonpcTldbG3ai9lDWV4vB4SbYmkRaURbY7mu/3fsb50PeX15RRWFnKw9iBe6UUicXgc1LpqsRltxFniKG0obbN4IkCSLYn0qHRy4nNIj0rHarJiM9qwmWxYjVZsJttRz2PMMU1xqLLugaWSQh9xHT5M2dOLqHzvPby1zf3D1hEjsB0zhtTbbse1dw+H//YIzj17sGRlkfGnB7CP7vutPxUlUBweBxaDBSEELo+Ln4/8zN7qvSTZknB5XRTXFXO47jAldSUcqD3AropdlDeUt7ngsC12k53chFzyEvPIS8gjNzGXYfHDcHvdlNSXUN5QTpwljuzYbFLsKV0aM/FKL9XOamItsRGZcFRSCAJPRQUNBQVU/vsDnLt2Ub9u3VGvW4YNw11aireykrizzyLxiiuwjRqFsNlU15MSEbzSi8PjwOF20OBpoMHdgMPT/LjGVUNZfRm7Knex/ch2th/ZzhHHkQ6vaTfZSbAmEG2OJsochcfrodZVS42rBrfXjUAghKDKWYXb68ZsMBNvjSfKFMXIpJEk2hIpbygn2ZZMvDUel9eF2+tuuo82RxNvjSfeGo/T48ThcWA1WptaW26vmwZPA1JK4ixxODwOjjiOUN5QjkEYsBqtVDmqOOI4QrWzGq/0NsWXHpVOVmwWmdGZCAQWo4VkWzJCCLzSi9vrprS+FIfHQbI9mVOyT2F0Ss/+oFRJIQTUrV1L6RNPYkpJIf2/7sMYH3/UwrhGxpQUkq+ZhyktDWG3U7d6NdFTphA1ZSrGmK4vnFKU/kZKSVlDGduPbGdX5S5sRhupUakk2ZKocFSwr3of+6r3UemopNZVS62rFpPBRIw5hmhzdNNAeeN+IUm2JMrqy6hyVlHtrGZj6UZqXbUk2ZIorS+lxlWD2WDWbkYzRmGkxlmD0+vsME6BNo7TWMKlsRtMSq27rTGpxFniMBlM2E12os3RHKo9RFFNEYdqDiGE0BJmi9ZUjDkGq9HKEccR/jDtD1yYd2GPfpcqKYQwKSV1q1ZT+83XiKgo6laspK6dn8mUloY5MxPp9WIbOZKEiy/COnIkBqu1j6NWlP5PStmqK0pKSb27nipnFTajDYvRgtPjRAjRlEAak0+tqxaL0YLFaOnx59e565BSYhAGjAYjVqP2/7rHq00z7ungvUoKYURKiWPrVpz79iFdLrzVNQizifp163Fs347r0CFMKSk0bNkCHg+YzdhGjMAQE4MhJpqY40/AOjwPy+DBGBMSEEZtPwNPWRmOHTsxJiZiGTJYJRJFiWAqKfRDrv37qf7iS+p/+gnX/v1gMOAqKsJdXHzUecJiQTpbNHfNZqy5udqiO68XY0K89leRwYj0eBAGoSUUixXL4MGYswZiiI4BrwdDTAymtDTcpaUIkwlvbR14PXjr6nAXF8pX6CcAAAxoSURBVOOprsEyeBDmrCwMUdFIpwNTWrrq+lKUEKKSQoSQUlL/0084d+/BVbQPV3ExwmgCrwdjcjK2/Hw8lZU4d+6ifv16MBhAgKe0DK/TgcFiRVgseGqqcR84GLjAhMAyaBDmgQMwJiTgravHmJioJSy3C+lwIt1uvFVVeOvqMKWlabEBeNx4Gxzg9WKIicE6bBgYDZgzMjBEx2CIsiOsNswZ6Qi7XdulQ3oRRiPCbEa6XEi3G2Gx4K1vwJSa0uNWkvR6QYjWXQpuN0gJJpNaLayEBZUUlG6TLhdIiau4GPfBg7iPHMEYE4OnthZPaSnGlBTwSgxRUQAYbFZMmZkY7HZc+/bhOngQb10dwmLFuWcP9T/+iKe6Gk9FBcJqwbWvCGG1YrDbERYLwmTCGBeHMJtxl5dD45erx4OIsmsPy8pxHz7c659NWK1akkmIx5SQgNfhxFutLeAyxMXiraxCut1Ijwc8Hi0ZeDxNX/xIibBYwOtFOpoHAo3JyZiSkgAJRhOm1FSMcXFIrwdvpZbwjMnJGOx2LbkYBAgDGA0gBMbYOAyxMZhSUzElp2BKTsIQG6slHcAYH4+w/P/27j1GrrKM4/j3d+ays5de0ssfpgWq4A0vKUJIBFQS0KASgVgDCAQUNUAKImpCTCQBDYFIIP5hrFBJQYlEbqY1GKgBBLmX6xaqWO2Wi1UqbXa33dmdy3n8431nOt3udne7O53u7vNJTvadc95z3ufMnsxz3jNn3pMPf5OkHpdVqyiX86cDunE7FB7H6aaZ2gOF8kuXkl+6dELrZhcupH358maEFT6ozSi/8w5psUhaLGLFIuVt/8HKJUCQKHxgl0rh7D2Xw0ollCRUe3up9vWjTEJl507S3j6Uy5LMmwdmpH19JHPnhkSVyaJMAkkGMgnKZEMSSJKwbYmk0BaWE37ImPb3YWZYuUx1x05KW7cixctx7e2UtvZAuYKZhRjTKqShXO3rwwYHx34Taglz2Elc0tVFMncOlCuQyaBsFiRscDDEm8+RtBVIurrIzJlDMm8uSUdHSChxf4dvk4aeT9JeQPm28H51dKD2djJdXeH9ymRRNgOZTNiGhBWLIbnGBJvk86itgHKhHUtTkvYOks6OcHKQzZIODYVyLkdSKITk61rGk4I75NXOhvNHHNHiSJrDymUq775LZccOKu+9F280yEFapdq/CxsapLJzZ7hMlWTCB3GSwYaGqPb1kfb1Qi4HqWGVcujNtRdQLh8upZWGqPb1U+3vo/zmWyGplkphqlT2+o3MXunBjLRYhHL5oL4f6ghJQ0m4HKhCG0lHJ0lbG4pTUiiEBB17nViKVVNIq1ilSjpYDDtTy28WjiNls5DLomwW5UJvNZSzqK1A0l7AypWwPuy5dJiEdZOO9tBTju+Zcrl9JpIk9ObM4rw8yiQknZ31HitpWj9JqO9HW7iUq1yupb9b8qTgXIsplyO3ZAm5JUtaHcqo0lIJGxggLRap9veT9vXt9SGMFG5KiGf8ZHMok2BDQ6RDJaxSDh+uUujtDRRJiwNYuUzS1kY6OBjuvBsYIO3tJR0YwNIUK5XDNnbvJh0aDO0ODdU/9NOBsA0lCSRJ+JvN1pMGsKcXU62EHlutJxO/e7JK5aAnvjFlMnsSROzVJW1tLFq5knmnf7mpTXtScM6NKcnnIZ8nM38+uffNzOdep4OD4VJeNkfSXmhYEM7qrVTGigOkAwPxUphBNSaXxqmahkuQUpxXwaoV0t2765c0UVK/6QMjJMhSKfbsan9r2yxhpZAwswsXNP19aGpSkHQa8HPC4zhXm9kNw5a3AXcCxwLvAWebWU8zY3LOuZEkhQIUCvsuyGTCVah8HmbBbdZNu3AlKQP8AvgicDRwrqSjh1W7GNhpZkcBtwA34pxzrmWa+W3G8cBmM/uXmZWAu4EzhtU5A7gjlu8FTpHf9O2ccy3TzKSwBHir4fXbcd6IdcysAvQCC4dvSNJ3JG2QtGH79u1NCtc551wzk8JIZ/zDfyk3njqY2a1mdpyZHbd48eIpCc4559y+mpkU3gYOa3i9FPj3aHUkZYF5wI4mxuScc24/mpkUngc+KOn9kvLAOcDaYXXWAhfG8grgEZtu424459wM0rRbUs2sImkl8BDhltTbzew1SdcBG8xsLfBr4DeSNhN6COc0Kx7nnHNja+rvFMzsQeDBYfOuaSgPAl9rZgzOOefGb9qNkippO7D1AFdfBPxvCsNxrpEfX67ZJnOMHWFmY96pM+2SwmRI2jCeoWOdOxB+fLlmOxjHWOuG4nPOOXfI8aTgnHOubrYlhVtbHYCb0fz4cs3W9GNsVn2n4Jxzbv9mW0/BOefcfsyapCDpNEl/l7RZ0tWtjsfNHJKWSdrY6jjczCHpMEmPStok6TVJ343zH5PU1LuPZsWT1xqe7fB5wnhLz0taa2avtzYy55wbUQX4vpm9KGkO8IKk9Qej4dnSUxjPsx2cm4yspDskvSrpXkkdrQ7ITV9mts3MXozlfmATex49cL6kpyRtlHT8VLc9W5LCeJ7t4NxkfBi41cw+CfQBl7U4HjdDSFoGHAM8G2d1mtkJhGPs9qlub7YkhXE9t8G5SXjLzJ6M5d8CJ7UyGDczSOoC7gOuNLO+OPt3AGb2ODBX0vypbHO2JIXxPNvBuckYfpLhJx1uUiTlCAnhLjO7v2FRU4+12ZIUxvNsB+cm43BJn47lc4G/tjIYN73FZ9X/GthkZjcPW3x2rHMS0GtmvVPZ9qy4+2i0Zzu0OCw3s2wCLpT0K+AfwC9bHI+b3k4ELgC6Jb0c5/0o/t0p6SlgLvDNqW7Yf9HsnHOubrZcPnLOOTcOnhScc87VeVJwzjlX50nBOedcnScF55xzdZ4U3LRzMEaKjO1cEUepvKvZbR0ISWdKOnqi9SRdJ+nU5kbnpitPCm5WkTSR3+ZcBnzJzM5rVjyTdCYwZlIYXs/MrjGzPzctKjeteVJwTRGfMbBJ0m1xPPiHJbXHZfUzfUmLJPXE8kWS/iBpnaQtklZKukrSS5KekbSgoYl9RoqU1CnpdknPx3XOaNjuPZLWAQ+PEOtVcTsbJV0Z560CPgCslfS9YfUzkm6S1B1HRb08zj8lttsd42iL83skXS/paUkbJH1K0kOS/inpkljnZEmPS3pA0uuSVklK4rJdDW2vkLRG0gnAV4CfSXpZ0pGSvh33/RVJ90nqGKXeGkkrxhHztZJejMs+Eud/Lm7n5bjenAM+SNyhycx88mnKJ2AZYUz45fH174HzY/kx4LhYXgT0xPJFwGZgDrAY6AUuictuIQwKVlv/tlj+LLAxlq9vaGM+8AbQGbf7NrBghDiPBbpjvS7gNeCYuKwHWDTCOpcSxqTJxtcLgAJhJN4PxXl3NsTbA1zasB+vNuzju3H+ycAgIRFlgPXAirhsV0PbK4A1sbymVie+XthQ/ilw+Sj11sTtjBVzbf3LgNWxvA44MZa7au+BTzNn8p6Ca6YtZlb7if4LhEQxlkfNrN/MthOSwro4v3vY+iONFPkF4Oo4LMBjhA+9w2P99Wa2Y4T2TgIeMLPdZrYLuB/4zBgxngqsMrNKjGEHYejsLWb2RqxzByFh1dTG2uoGnm3Yx8GGUS6fs/DMj2rcv4mOtPpxSU9I6gbOAz42Rv2xYq4Nwtb4v3sSuFnSFcD82nvgZg5PCq6ZhhrKVfaMtVVhz7FX2M86acPrlL3H6hpppEgBXzWz5XE63Mw2xeW7R4lxpGHVx6IR2h9rO437MXwfa/s12uiXjfOHv1+N1gArzewTwLVj1IXxx1z/35nZDcC3gHbgmdplJTdzeFJwrdBDuGwD4TLGgRhppMiHgMvjCJNIOmYc23kcODNef+8EzgKeGGOdh4FLal9ax+86/gYsk3RUrHMB8JcJ7tPxCiP5JoT9q420+l9JH43zz2qo30+4DFUzB9imMOTyefupVzPhmCUdaWbdZnYjsAHwpDDDeFJwrXATcKnCSI+LDnAbtZEiVwEXx3k/AXLAq5I2xtf7ZeGRh2uA5whPtlptZi+Nsdpq4M3YzivA181sEPgGcE+8fJPG2CbiaeAGYCOwBXggzr8a+CPwCLCtof7dwA/jF75HAj+O+7Ce8IE/Wr3avh9IzFfGL+RfAYrAnya4j+4Q56OkOncIkHQy8AMzO73VsbjZzXsKzjnn6ryn4Jxzrs57Cs455+o8KTjnnKvzpOCcc67Ok4Jzzrk6TwrOOefqPCk455yr+z9UNAJ2dHA16QAAAABJRU5ErkJggg==\n",
78 | "text/plain": [
79 | ""
80 | ]
81 | },
82 | "metadata": {
83 | "needs_background": "light"
84 | },
85 | "output_type": "display_data"
86 | }
87 | ],
88 | "source": [
89 | "runs = 100\n",
90 | "branch = [2, 10, 100, 1000]\n",
91 | "for b in branch:\n",
92 | " errors = np.zeros((runs, 2 * b))\n",
93 | " for r in tqdm(np.arange(runs)):\n",
94 | " errors[r] = b_steps(b)\n",
95 | " errors = errors.mean(axis=0)\n",
96 | " x_axis = (np.arange(len(errors)) + 1) / float(b)\n",
97 | " \"\"\"\n",
98 | " 为什么要 np.zeros((runs, 2 * b)) ?\n",
99 | " 为了将不同的 b 的数画在 0, 1b, 2b 的坐标上\n",
100 | " x_axis 将其横坐标都归一化到了 (0,2b) 的范围\n",
101 | " \"\"\"\n",
102 | " plt.plot(x_axis, errors, label='b = %d' % (b))\n",
103 | "\n",
104 | "plt.xlabel('number of computations')\n",
105 | "plt.xticks([0, 1.0, 2.0], ['0', 'b', '2b'])\n",
106 | "plt.ylabel('RMS error')\n",
107 | "plt.legend()\n",
108 | "\n",
109 | "plt.show()"
110 | ]
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {},
115 | "source": [
116 | "- 对于期望更新来讲,在进行了 b 次运算可以将误差降为 0 ;\n",
117 | "- 但是如果 b 很大,使用采样更新,无需 b 次运算,就可让误差贴近 0 。"
118 | ]
119 | },
120 | {
121 | "cell_type": "code",
122 | "execution_count": null,
123 | "metadata": {},
124 | "outputs": [],
125 | "source": []
126 | }
127 | ],
128 | "metadata": {
129 | "kernelspec": {
130 | "display_name": "Python 3",
131 | "language": "python",
132 | "name": "python3"
133 | },
134 | "language_info": {
135 | "codemirror_mode": {
136 | "name": "ipython",
137 | "version": 3
138 | },
139 | "file_extension": ".py",
140 | "mimetype": "text/x-python",
141 | "name": "python",
142 | "nbconvert_exporter": "python",
143 | "pygments_lexer": "ipython3",
144 | "version": "3.7.0"
145 | },
146 | "toc": {
147 | "base_numbering": 1,
148 | "nav_menu": {},
149 | "number_sections": true,
150 | "sideBar": true,
151 | "skip_h1_title": false,
152 | "title_cell": "Table of Contents",
153 | "title_sidebar": "Contents",
154 | "toc_cell": false,
155 | "toc_position": {},
156 | "toc_section_display": true,
157 | "toc_window_display": false
158 | },
159 | "varInspector": {
160 | "cols": {
161 | "lenName": 16,
162 | "lenType": 16,
163 | "lenVar": 40
164 | },
165 | "kernels_config": {
166 | "python": {
167 | "delete_cmd_postfix": "",
168 | "delete_cmd_prefix": "del ",
169 | "library": "var_list.py",
170 | "varRefreshCmd": "print(var_dic_list())"
171 | },
172 | "r": {
173 | "delete_cmd_postfix": ") ",
174 | "delete_cmd_prefix": "rm(",
175 | "library": "var_list.r",
176 | "varRefreshCmd": "cat(var_dic_list()) "
177 | }
178 | },
179 | "types_to_exclude": [
180 | "module",
181 | "function",
182 | "builtin_function_or_method",
183 | "instance",
184 | "_Feature"
185 | ],
186 | "window_display": false
187 | }
188 | },
189 | "nbformat": 4,
190 | "nbformat_minor": 2
191 | }
192 |
--------------------------------------------------------------------------------
/practice/images/03-02-policy-iteration.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/03-02-policy-iteration.png
--------------------------------------------------------------------------------
/practice/images/03-03-value-iteration.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/03-03-value-iteration.png
--------------------------------------------------------------------------------
/practice/images/03-04-generalized-policy-iteration.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/03-04-generalized-policy-iteration.png
--------------------------------------------------------------------------------
/practice/images/03_grid_world.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/03_grid_world.jpg
--------------------------------------------------------------------------------
/practice/images/04-01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/04-01.png
--------------------------------------------------------------------------------
/practice/images/04-02.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/04-02.png
--------------------------------------------------------------------------------
/practice/images/04-03.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/04-03.png
--------------------------------------------------------------------------------
/practice/images/05-01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/05-01.png
--------------------------------------------------------------------------------
/practice/images/05-02.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/05-02.png
--------------------------------------------------------------------------------
/practice/images/05-03.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/05-03.png
--------------------------------------------------------------------------------
/practice/images/05-04.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/05-04.png
--------------------------------------------------------------------------------
/practice/images/05-05.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/05-05.png
--------------------------------------------------------------------------------
/practice/images/06-01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/06-01.png
--------------------------------------------------------------------------------
/practice/images/06-02.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/06-02.png
--------------------------------------------------------------------------------
/practice/images/06-03.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/06-03.png
--------------------------------------------------------------------------------
/practice/images/06-04.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/06-04.png
--------------------------------------------------------------------------------
/practice/images/06-05.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/06-05.png
--------------------------------------------------------------------------------
/practice/images/07-01.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/07-01.jpg
--------------------------------------------------------------------------------
/practice/images/07-02.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/07-02.jpg
--------------------------------------------------------------------------------
/practice/images/07-03.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/07-03.jpg
--------------------------------------------------------------------------------
/practice/images/07-04.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/07-04.png
--------------------------------------------------------------------------------
/practice/images/07-05.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/07-05.png
--------------------------------------------------------------------------------
/practice/images/example_13_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/example_13_1.png
--------------------------------------------------------------------------------
/practice/images/figure_8_5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/practice/images/figure_8_5.png
--------------------------------------------------------------------------------
/resources/Approximate Dynamic Programming by Powell 2nd edition.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/resources/Approximate Dynamic Programming by Powell 2nd edition.pdf
--------------------------------------------------------------------------------
/resources/DEEP REINFORCEMENT LEARNING arXiv_1810.06339.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/resources/DEEP REINFORCEMENT LEARNING arXiv_1810.06339.pdf
--------------------------------------------------------------------------------
/resources/RL-An Introdction exercises Solution.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/resources/RL-An Introdction exercises Solution.pdf
--------------------------------------------------------------------------------
/resources/RL-An Introdction notes.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/resources/RL-An Introdction notes.pdf
--------------------------------------------------------------------------------
/resources/Reinforcement Learning - An Introduction 2018.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PiperLiu/Reinforcement-Learning-practice-zh/960bbe30ca09d971d26459c47fbad54a1f5543de/resources/Reinforcement Learning - An Introduction 2018.pdf
--------------------------------------------------------------------------------