├── LICENSE ├── PER-in-RL ├── .gitignore ├── README.md ├── binary_heap.py ├── rank_based.py ├── rl │ ├── __init__.py │ ├── binary_heap.py │ ├── neural_q_learner.py │ ├── pg_actor_critic.py │ ├── pg_ddpg.py │ ├── pg_ddpg.py~ │ ├── pg_reinforce.py │ ├── rank_based.py │ ├── replay_buffer.py │ ├── tabular_q_learner.py │ └── utility.py ├── run_actor_critic_acrobot.py ├── run_cem_cartpole.py ├── run_ddpg_mujoco.py ├── run_dqn_cartpole.py ├── run_ql_cartpole.py ├── run_reinforce_cartpole.py └── utility.py └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 HOU Yuenan 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /PER-in-RL/.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .coverage.* 42 | .cache 43 | nosetests.xml 44 | coverage.xml 45 | *,cover 46 | .hypothesis/ 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | 55 | # Sphinx documentation 56 | docs/_build/ 57 | 58 | # PyBuilder 59 | target/ 60 | 61 | #Ipython Notebook 62 | .ipynb_checkpoints 63 | -------------------------------------------------------------------------------- /PER-in-RL/README.md: -------------------------------------------------------------------------------- 1 | # Tensorflow-Reinforce 2 | A collection of [Tensorflow](https://www.tensorflow.org) implementations of reinforcement learning models. Models are evaluated in [OpenAI Gym](https://gym.openai.com) environments. Any contribution/feedback is more than welcome. **Disclaimer:** These implementations are used for educational purposes only (i.e., to learn deep RL myself). There is no guarantee that the exact models will work on any of your particular RL problems without changes. 3 | 4 | Models 5 | ------ 6 | | Model | Code | References | 7 | |:------------- |:-------------- |:------------| 8 | | Cross-Entropy Method | [run_cem_cartpole](https://github.com/yukezhu/tensorflow-reinforce/blob/master/run_cem_cartpole.py) | [Cross-entropy method](https://en.wikipedia.org/wiki/Cross-entropy_method) | 9 | | Tabular Q Learning | [rl/tabular_q_learner](https://github.com/yukezhu/tensorflow-reinforce/blob/master/rl/tabular_q_learner.py) | [Sutton and Barto, Chapter 8](http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf) | 10 | | Deep Q Network | [rl/neural_q_learner](https://github.com/yukezhu/tensorflow-reinforce/blob/master/rl/neural_q_learner.py) | [Mnih et al.](http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html) | 11 | | Double Deep Q Network | [rl/neural_q_learner](https://github.com/yukezhu/tensorflow-reinforce/blob/master/rl/neural_q_learner.py) | [van Hasselt et al.](http://arxiv.org/abs/1509.06461) | 12 | | REINFORCE Policy Gradient | [rl/pg_reinforce](https://github.com/yukezhu/tensorflow-reinforce/blob/master/rl/pg_reinforce.py) | [Sutton et al.](https://webdocs.cs.ualberta.ca/~sutton/papers/SMSM-NIPS99.pdf) | 13 | | Actor-critic Policy Gradient | [rl/pg_actor_critic](https://github.com/yukezhu/tensorflow-reinforce/blob/master/rl/pg_actor_critic.py) | [Minh et al.](https://arxiv.org/abs/1602.01783) | 14 | | Deep Deterministic Policy Gradient | [rl/pg_ddpg](https://github.com/yukezhu/tensorflow-reinforce/blob/master/rl/pg_ddpg.py) | [Lillicrap et al.](https://arxiv.org/abs/1509.02971) | 15 | 16 | License 17 | ------- 18 | MIT 19 | -------------------------------------------------------------------------------- /PER-in-RL/binary_heap.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- encoding=utf-8 -*- 3 | # author: Ian 4 | # e-mail: stmayue@gmail.com 5 | # description: 6 | 7 | import sys 8 | import math 9 | 10 | import utility 11 | 12 | 13 | class BinaryHeap(object): 14 | 15 | def __init__(self, priority_size=100, priority_init=None, replace=True): 16 | self.e2p = {} 17 | self.p2e = {} 18 | self.replace = replace 19 | 20 | if priority_init is None: 21 | self.priority_queue = {} 22 | self.size = 0 23 | self.max_size = priority_size 24 | else: 25 | # not yet test 26 | self.priority_queue = priority_init 27 | self.size = len(self.priority_queue) 28 | self.max_size = None or self.size 29 | 30 | experience_list = list(map(lambda x: self.priority_queue[x], self.priority_queue)) 31 | self.p2e = utility.list_to_dict(experience_list) 32 | self.e2p = utility.exchange_key_value(self.p2e) 33 | for i in range(int(self.size / 2), -1, -1): 34 | self.down_heap(i) 35 | 36 | def __repr__(self): 37 | """ 38 | :return: string of the priority queue, with level info 39 | """ 40 | if self.size == 0: 41 | return 'No element in heap!' 42 | to_string = '' 43 | level = -1 44 | max_level = int (math.floor(math.log(self.size, 2))) 45 | 46 | for i in range(1, self.size + 1): 47 | now_level = int (math.floor(math.log(i, 2))) 48 | if level != now_level: 49 | to_string = to_string + ('\n' if level != -1 else '') \ 50 | + ' ' * (max_level - now_level) 51 | level = now_level 52 | 53 | to_string = to_string + '%.2f ' % self.priority_queue[i][1] + ' ' * (max_level - now_level) 54 | 55 | return to_string 56 | 57 | def check_full(self): 58 | return self.size > self.max_size 59 | 60 | def _insert(self, priority, e_id): 61 | """ 62 | insert new experience id with priority 63 | (maybe don't need get_max_priority and implement it in this function) 64 | :param priority: priority value 65 | :param e_id: experience id 66 | :return: bool 67 | """ 68 | self.size += 1 69 | 70 | if self.check_full() and not self.replace: 71 | sys.stderr.write('Error: no space left to add experience id %d with priority value %f\n' % (e_id, priority)) 72 | return False 73 | else: 74 | self.size = min(self.size, self.max_size) 75 | 76 | self.priority_queue[self.size] = (priority, e_id) 77 | self.p2e[self.size] = e_id 78 | self.e2p[e_id] = self.size 79 | 80 | self.up_heap(self.size) 81 | return True 82 | 83 | def update(self, priority, e_id): 84 | """ 85 | update priority value according its experience id 86 | :param priority: new priority value 87 | :param e_id: experience id 88 | :return: bool 89 | """ 90 | if e_id in self.e2p: 91 | p_id = self.e2p[e_id] 92 | self.priority_queue[p_id] = (priority, e_id) 93 | self.p2e[p_id] = e_id 94 | 95 | self.down_heap(p_id) 96 | self.up_heap(p_id) 97 | return True 98 | else: 99 | # this e id is new, do insert 100 | return self._insert(priority, e_id) 101 | 102 | def get_max_priority(self): 103 | """ 104 | get max priority, if no experience, return 1 105 | :return: max priority if size > 0 else 1 106 | """ 107 | if self.size > 0: 108 | return self.priority_queue[1][0] 109 | else: 110 | return 1 111 | 112 | def pop(self): 113 | """ 114 | pop out the max priority value with its experience id 115 | :return: priority value & experience id 116 | """ 117 | if self.size == 0: 118 | sys.stderr.write('Error: no value in heap, pop failed\n') 119 | return False, False 120 | 121 | pop_priority, pop_e_id = self.priority_queue[1] 122 | self.e2p[pop_e_id] = -1 123 | # replace first 124 | last_priority, last_e_id = self.priority_queue[self.size] 125 | self.priority_queue[1] = (last_priority, last_e_id) 126 | self.size -= 1 127 | self.e2p[last_e_id] = 1 128 | self.p2e[1] = last_e_id 129 | 130 | self.down_heap(1) 131 | 132 | return pop_priority, pop_e_id 133 | 134 | def up_heap(self, i): 135 | """ 136 | upward balance 137 | :param i: tree node i 138 | :return: None 139 | """ 140 | if i > 1: 141 | parent = int (math.floor(i / 2)) 142 | if self.priority_queue[parent][0] < self.priority_queue[i][0]: 143 | tmp = self.priority_queue[i] 144 | self.priority_queue[i] = self.priority_queue[parent] 145 | self.priority_queue[parent] = tmp 146 | # change e2p & p2e 147 | self.e2p[self.priority_queue[i][1]] = i 148 | self.e2p[self.priority_queue[parent][1]] = parent 149 | self.p2e[i] = self.priority_queue[i][1] 150 | self.p2e[parent] = self.priority_queue[parent][1] 151 | # up heap parent 152 | self.up_heap(parent) 153 | 154 | def down_heap(self, i): 155 | """ 156 | downward balance 157 | :param i: tree node i 158 | :return: None 159 | """ 160 | if i < self.size: 161 | greatest = i 162 | left, right = i * 2, i * 2 + 1 163 | if left < self.size and self.priority_queue[left][0] > self.priority_queue[greatest][0]: 164 | greatest = left 165 | if right < self.size and self.priority_queue[right][0] > self.priority_queue[greatest][0]: 166 | greatest = right 167 | 168 | if greatest != i: 169 | tmp = self.priority_queue[i] 170 | self.priority_queue[i] = self.priority_queue[greatest] 171 | self.priority_queue[greatest] = tmp 172 | # change e2p & p2e 173 | self.e2p[self.priority_queue[i][1]] = i 174 | self.e2p[self.priority_queue[greatest][1]] = greatest 175 | self.p2e[i] = self.priority_queue[i][1] 176 | self.p2e[greatest] = self.priority_queue[greatest][1] 177 | # down heap greatest 178 | self.down_heap(greatest) 179 | 180 | def get_priority(self): 181 | """ 182 | get all priority value 183 | :return: list of priority 184 | """ 185 | return list(map(lambda x: x[0], self.priority_queue.values()))[0:self.size] 186 | 187 | def get_e_id(self): 188 | """ 189 | get all experience id in priority queue 190 | :return: list of experience ids order by their priority 191 | """ 192 | return list(map(lambda x: x[1], self.priority_queue.values()))[0:self.size] 193 | 194 | def balance_tree(self): 195 | """ 196 | rebalance priority queue 197 | :return: None 198 | """ 199 | sort_array = sorted(self.priority_queue.values(), key=lambda x: x[0], reverse=True) 200 | # reconstruct priority_queue 201 | self.priority_queue.clear() 202 | self.p2e.clear() 203 | self.e2p.clear() 204 | cnt = 1 205 | while cnt <= self.size: 206 | priority, e_id = sort_array[cnt - 1] 207 | self.priority_queue[cnt] = (priority, e_id) 208 | self.p2e[cnt] = e_id 209 | self.e2p[e_id] = cnt 210 | cnt += 1 211 | # sort the heap 212 | for i in range(int (math.floor(self.size / 2)), 1, -1): 213 | self.down_heap(i) 214 | 215 | def priority_to_experience(self, priority_ids): 216 | """ 217 | retrieve experience ids by priority ids 218 | :param priority_ids: list of priority id 219 | :return: list of experience id 220 | """ 221 | return [self.p2e[i] for i in priority_ids] 222 | -------------------------------------------------------------------------------- /PER-in-RL/rank_based.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- encoding=utf-8 -*- 3 | # author: Ian 4 | # e-mail: stmayue@gmail.com 5 | # description: 6 | 7 | import sys 8 | import math 9 | import random 10 | import numpy as np 11 | 12 | import binary_heap 13 | 14 | 15 | class Experience(object): 16 | 17 | def __init__(self, conf): 18 | self.size = conf['size'] 19 | self.replace_flag = conf['replace_old'] if 'replace_old' in conf else True 20 | self.priority_size = conf['priority_size'] if 'priority_size' in conf else self.size 21 | 22 | self.alpha = conf['alpha'] if 'alpha' in conf else 0.7 23 | self.beta_zero = conf['beta_zero'] if 'beta_zero' in conf else 0.5 24 | self.batch_size = conf['batch_size'] if 'batch_size' in conf else 32 25 | self.learn_start = conf['learn_start'] if 'learn_start' in conf else 1000 26 | self.total_steps = conf['steps'] if 'steps' in conf else 100000 27 | # partition number N, split total size to N part 28 | self.partition_num = conf['partition_num'] if 'partition_num' in conf else 100 29 | 30 | self.index = 0 31 | self.record_size = 0 32 | self.isFull = False 33 | 34 | self._experience = {} 35 | self.priority_queue = binary_heap.BinaryHeap(self.priority_size) 36 | self.distributions = self.build_distributions() 37 | 38 | self.beta_grad = (1 - self.beta_zero) / (self.total_steps - self.learn_start) 39 | 40 | def build_distributions(self): 41 | """ 42 | preprocess pow of rank 43 | (rank i) ^ (-alpha) / sum ((rank i) ^ (-alpha)) 44 | :return: distributions, dict 45 | """ 46 | res = {} 47 | n_partitions = self.partition_num 48 | partition_num = 1 49 | # each part size 50 | partition_size = int (math.floor(self.size / n_partitions)) 51 | 52 | for n in range(partition_size, self.size + 1, partition_size): 53 | if self.learn_start <= n <= self.priority_size: 54 | distribution = {} 55 | # P(i) = (rank i) ^ (-alpha) / sum ((rank i) ^ (-alpha)) 56 | pdf = list( 57 | map(lambda x: math.pow(x, -self.alpha), range(1, n + 1)) 58 | ) 59 | pdf_sum = math.fsum(pdf) 60 | distribution['pdf'] = list(map(lambda x: x / pdf_sum, pdf)) 61 | # split to k segment, and than uniform sample in each k 62 | # set k = batch_size, each segment has total probability is 1 / batch_size 63 | # strata_ends keep each segment start pos and end pos 64 | cdf = np.cumsum(distribution['pdf']) 65 | strata_ends = {1: 0, self.batch_size + 1: n} 66 | step = 1.0 / self.batch_size 67 | index = 1 68 | for s in range(2, self.batch_size + 1): 69 | while cdf[index] < step: 70 | index += 1 71 | strata_ends[s] = index 72 | step += 1.0 / self.batch_size 73 | 74 | distribution['strata_ends'] = strata_ends 75 | 76 | res[partition_num] = distribution 77 | 78 | partition_num += 1 79 | 80 | return res 81 | 82 | def fix_index(self): 83 | """ 84 | get next insert index 85 | :return: index, int 86 | """ 87 | if self.record_size <= self.size: 88 | self.record_size += 1 89 | if self.index % self.size == 0: 90 | self.isFull = True if len(self._experience) == self.size else False 91 | if self.replace_flag: 92 | self.index = 1 93 | return self.index 94 | else: 95 | sys.stderr.write('Experience replay buff is full and replace is set to FALSE!\n') 96 | return -1 97 | else: 98 | self.index += 1 99 | return self.index 100 | 101 | def store(self, experience): 102 | """ 103 | store experience, suggest that experience is a tuple of (s1, a, r, s2, t) 104 | so each experience is valid 105 | :param experience: maybe a tuple, or list 106 | :return: bool, indicate insert status 107 | """ 108 | insert_index = self.fix_index() 109 | if insert_index > 0: 110 | if insert_index in self._experience: 111 | del self._experience[insert_index] 112 | self._experience[insert_index] = experience 113 | # add to priority queue 114 | priority = self.priority_queue.get_max_priority() 115 | self.priority_queue.update(priority, insert_index) 116 | return True 117 | else: 118 | sys.stderr.write('Insert failed\n') 119 | return False 120 | 121 | def retrieve(self, indices): 122 | """ 123 | get experience from indices 124 | :param indices: list of experience id 125 | :return: experience replay sample 126 | """ 127 | return [self._experience[v] for v in indices] 128 | 129 | def rebalance(self): 130 | """ 131 | rebalance priority queue 132 | :return: None 133 | """ 134 | self.priority_queue.balance_tree() 135 | 136 | def update_priority(self, indices, delta): 137 | """ 138 | update priority according indices and deltas 139 | :param indices: list of experience id 140 | :param delta: list of delta, order correspond to indices 141 | :return: None 142 | """ 143 | for i in range(0, len(indices)): 144 | self.priority_queue.update(math.fabs(delta[i]), indices[i]) 145 | 146 | def sample(self, global_step): 147 | """ 148 | sample a mini batch from experience replay 149 | :param global_step: now training step 150 | :return: experience, list, samples 151 | :return: w, list, weights 152 | :return: rank_e_id, list, samples id, used for update priority 153 | """ 154 | if self.record_size < self.learn_start: 155 | sys.stderr.write('Record size less than learn start! Sample failed\n') 156 | return False, False, False 157 | 158 | dist_index = int (math.floor(self.record_size / self.size * self.partition_num)) 159 | # issue 1 by @camigord 160 | partition_size = int (math.floor(self.size / self.partition_num)) 161 | partition_max = dist_index * partition_size 162 | distribution = self.distributions[dist_index] 163 | print distribution 164 | rank_list = [] 165 | # sample from k segments 166 | for n in range(1, self.batch_size + 1): 167 | if distribution['strata_ends'][n] + 1 <= distribution['strata_ends'][n + 1]: 168 | index = random.randint(distribution['strata_ends'][n] + 1, 169 | distribution['strata_ends'][n + 1]) 170 | else: 171 | index = random.randint(distribution['strata_ends'][n + 1], 172 | distribution['strata_ends'][n] + 1) 173 | rank_list.append(index) 174 | 175 | # beta, increase by global_step, max 1 176 | beta = min(self.beta_zero + (global_step - self.learn_start - 1) * self.beta_grad, 1) 177 | # find all alpha pow, notice that pdf is a list, start from 0 178 | alpha_pow = [distribution['pdf'][v - 1] for v in rank_list] 179 | # w = (N * P(i)) ^ (-beta) / max w 180 | w = np.power(np.array(alpha_pow) * partition_max, -beta) 181 | w_max = max(w) 182 | w = np.divide(w, w_max) 183 | # rank list is priority id 184 | # convert to experience id 185 | rank_e_id = self.priority_queue.priority_to_experience(rank_list) 186 | # get experience id according rank_e_id 187 | experience = self.retrieve(rank_e_id) 188 | return experience, w, rank_e_id 189 | -------------------------------------------------------------------------------- /PER-in-RL/rl/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cardwing/Codes-for-RL-Project/e48e29fe0626ce3d6f9ad7c0b8b1fc5886edccbb/PER-in-RL/rl/__init__.py -------------------------------------------------------------------------------- /PER-in-RL/rl/binary_heap.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- encoding=utf-8 -*- 3 | # author: Ian 4 | # e-mail: stmayue@gmail.com 5 | # description: 6 | 7 | import sys 8 | import math 9 | 10 | import utility 11 | 12 | 13 | class BinaryHeap(object): 14 | 15 | def __init__(self, priority_size=100, priority_init=None, replace=True): 16 | self.e2p = {} 17 | self.p2e = {} 18 | self.replace = replace 19 | 20 | if priority_init is None: 21 | self.priority_queue = {} 22 | self.size = 0 23 | self.max_size = priority_size 24 | else: 25 | # not yet test 26 | self.priority_queue = priority_init 27 | self.size = len(self.priority_queue) 28 | self.max_size = None or self.size 29 | 30 | experience_list = list(map(lambda x: self.priority_queue[x], self.priority_queue)) 31 | self.p2e = utility.list_to_dict(experience_list) 32 | self.e2p = utility.exchange_key_value(self.p2e) 33 | for i in range(int(self.size / 2), -1, -1): 34 | self.down_heap(i) 35 | 36 | def __repr__(self): 37 | """ 38 | :return: string of the priority queue, with level info 39 | """ 40 | if self.size == 0: 41 | return 'No element in heap!' 42 | to_string = '' 43 | level = -1 44 | max_level = int (math.floor(math.log(self.size, 2))) 45 | 46 | for i in range(1, self.size + 1): 47 | now_level = int (math.floor(math.log(i, 2))) 48 | if level != now_level: 49 | to_string = to_string + ('\n' if level != -1 else '') \ 50 | + ' ' * (max_level - now_level) 51 | level = now_level 52 | 53 | to_string = to_string + '%.2f ' % self.priority_queue[i][1] + ' ' * (max_level - now_level) 54 | 55 | return to_string 56 | 57 | def check_full(self): 58 | return self.size > self.max_size 59 | 60 | def _insert(self, priority, e_id): 61 | """ 62 | insert new experience id with priority 63 | (maybe don't need get_max_priority and implement it in this function) 64 | :param priority: priority value 65 | :param e_id: experience id 66 | :return: bool 67 | """ 68 | self.size += 1 69 | 70 | if self.check_full() and not self.replace: 71 | sys.stderr.write('Error: no space left to add experience id %d with priority value %f\n' % (e_id, priority)) 72 | return False 73 | else: 74 | self.size = min(self.size, self.max_size) 75 | 76 | self.priority_queue[self.size] = (priority, e_id) 77 | self.p2e[self.size] = e_id 78 | self.e2p[e_id] = self.size 79 | 80 | self.up_heap(self.size) 81 | return True 82 | 83 | def update(self, priority, e_id): 84 | """ 85 | update priority value according its experience id 86 | :param priority: new priority value 87 | :param e_id: experience id 88 | :return: bool 89 | """ 90 | if e_id in self.e2p: 91 | p_id = self.e2p[e_id] 92 | self.priority_queue[p_id] = (priority, e_id) 93 | self.p2e[p_id] = e_id 94 | 95 | self.down_heap(p_id) 96 | self.up_heap(p_id) 97 | return True 98 | else: 99 | # this e id is new, do insert 100 | return self._insert(priority, e_id) 101 | 102 | def get_max_priority(self): 103 | """ 104 | get max priority, if no experience, return 1 105 | :return: max priority if size > 0 else 1 106 | """ 107 | if self.size > 0: 108 | return self.priority_queue[1][0] 109 | else: 110 | return 1 111 | 112 | def pop(self): 113 | """ 114 | pop out the max priority value with its experience id 115 | :return: priority value & experience id 116 | """ 117 | if self.size == 0: 118 | sys.stderr.write('Error: no value in heap, pop failed\n') 119 | return False, False 120 | 121 | pop_priority, pop_e_id = self.priority_queue[1] 122 | self.e2p[pop_e_id] = -1 123 | # replace first 124 | last_priority, last_e_id = self.priority_queue[self.size] 125 | self.priority_queue[1] = (last_priority, last_e_id) 126 | self.size -= 1 127 | self.e2p[last_e_id] = 1 128 | self.p2e[1] = last_e_id 129 | 130 | self.down_heap(1) 131 | 132 | return pop_priority, pop_e_id 133 | 134 | def up_heap(self, i): 135 | """ 136 | upward balance 137 | :param i: tree node i 138 | :return: None 139 | """ 140 | if i > 1: 141 | parent = int (math.floor(i / 2)) 142 | if self.priority_queue[parent][0] < self.priority_queue[i][0]: 143 | tmp = self.priority_queue[i] 144 | self.priority_queue[i] = self.priority_queue[parent] 145 | self.priority_queue[parent] = tmp 146 | # change e2p & p2e 147 | self.e2p[self.priority_queue[i][1]] = i 148 | self.e2p[self.priority_queue[parent][1]] = parent 149 | self.p2e[i] = self.priority_queue[i][1] 150 | self.p2e[parent] = self.priority_queue[parent][1] 151 | # up heap parent 152 | self.up_heap(parent) 153 | 154 | def down_heap(self, i): 155 | """ 156 | downward balance 157 | :param i: tree node i 158 | :return: None 159 | """ 160 | if i < self.size: 161 | greatest = i 162 | left, right = i * 2, i * 2 + 1 163 | if left < self.size and self.priority_queue[left][0] > self.priority_queue[greatest][0]: 164 | greatest = left 165 | if right < self.size and self.priority_queue[right][0] > self.priority_queue[greatest][0]: 166 | greatest = right 167 | 168 | if greatest != i: 169 | tmp = self.priority_queue[i] 170 | self.priority_queue[i] = self.priority_queue[greatest] 171 | self.priority_queue[greatest] = tmp 172 | # change e2p & p2e 173 | self.e2p[self.priority_queue[i][1]] = i 174 | self.e2p[self.priority_queue[greatest][1]] = greatest 175 | self.p2e[i] = self.priority_queue[i][1] 176 | self.p2e[greatest] = self.priority_queue[greatest][1] 177 | # down heap greatest 178 | self.down_heap(greatest) 179 | 180 | def get_priority(self): 181 | """ 182 | get all priority value 183 | :return: list of priority 184 | """ 185 | return list(map(lambda x: x[0], self.priority_queue.values()))[0:self.size] 186 | 187 | def get_e_id(self): 188 | """ 189 | get all experience id in priority queue 190 | :return: list of experience ids order by their priority 191 | """ 192 | return list(map(lambda x: x[1], self.priority_queue.values()))[0:self.size] 193 | 194 | def balance_tree(self): 195 | """ 196 | rebalance priority queue 197 | :return: None 198 | """ 199 | sort_array = sorted(self.priority_queue.values(), key=lambda x: x[0], reverse=True) 200 | # reconstruct priority_queue 201 | self.priority_queue.clear() 202 | self.p2e.clear() 203 | self.e2p.clear() 204 | cnt = 1 205 | while cnt <= self.size: 206 | priority, e_id = sort_array[cnt - 1] 207 | self.priority_queue[cnt] = (priority, e_id) 208 | self.p2e[cnt] = e_id 209 | self.e2p[e_id] = cnt 210 | cnt += 1 211 | # sort the heap 212 | for i in range(int (math.floor(self.size / 2)), 1, -1): 213 | self.down_heap(i) 214 | 215 | def priority_to_experience(self, priority_ids): 216 | """ 217 | retrieve experience ids by priority ids 218 | :param priority_ids: list of priority id 219 | :return: list of experience id 220 | """ 221 | return [self.p2e[i] for i in priority_ids] 222 | -------------------------------------------------------------------------------- /PER-in-RL/rl/neural_q_learner.py: -------------------------------------------------------------------------------- 1 | import random 2 | import numpy as np 3 | import tensorflow as tf 4 | from replay_buffer import ReplayBuffer 5 | 6 | class NeuralQLearner(object): 7 | 8 | def __init__(self, session, 9 | optimizer, 10 | q_network, 11 | state_dim, 12 | num_actions, 13 | batch_size=32, 14 | init_exp=0.5, # initial exploration prob 15 | final_exp=0.1, # final exploration prob 16 | anneal_steps=10000, # N steps for annealing exploration 17 | replay_buffer_size=10000, 18 | store_replay_every=5, # how frequent to store experience 19 | discount_factor=0.9, # discount future rewards 20 | target_update_rate=0.01, 21 | reg_param=0.01, # regularization constants 22 | max_gradient=5, # max gradient norms 23 | double_q_learning=False, 24 | summary_writer=None, 25 | summary_every=100): 26 | 27 | # tensorflow machinery 28 | self.session = session 29 | self.optimizer = optimizer 30 | self.summary_writer = summary_writer 31 | 32 | # model components 33 | self.q_network = q_network 34 | self.replay_buffer = ReplayBuffer(buffer_size=replay_buffer_size) 35 | 36 | # Q learning parameters 37 | self.batch_size = batch_size 38 | self.state_dim = state_dim 39 | self.num_actions = num_actions 40 | self.exploration = init_exp 41 | self.init_exp = init_exp 42 | self.final_exp = final_exp 43 | self.anneal_steps = anneal_steps 44 | self.discount_factor = discount_factor 45 | self.target_update_rate = target_update_rate 46 | self.double_q_learning = double_q_learning 47 | 48 | # training parameters 49 | self.max_gradient = max_gradient 50 | self.reg_param = reg_param 51 | 52 | # counters 53 | self.store_replay_every = store_replay_every 54 | self.store_experience_cnt = 0 55 | self.train_iteration = 0 56 | 57 | # create and initialize variables 58 | self.create_variables() 59 | var_lists = tf.get_collection(tf.GraphKeys.VARIABLES) 60 | self.session.run(tf.initialize_variables(var_lists)) 61 | 62 | # make sure all variables are initialized 63 | self.session.run(tf.assert_variables_initialized()) 64 | 65 | if self.summary_writer is not None: 66 | # graph was not available when journalist was created 67 | self.summary_writer.add_graph(self.session.graph) 68 | self.summary_every = summary_every 69 | 70 | def create_variables(self): 71 | # compute action from a state: a* = argmax_a Q(s_t,a) 72 | with tf.name_scope("predict_actions"): 73 | # raw state representation 74 | self.states = tf.placeholder(tf.float32, (None, self.state_dim), name="states") 75 | # initialize Q network 76 | with tf.variable_scope("q_network"): 77 | self.q_outputs = self.q_network(self.states) 78 | # predict actions from Q network 79 | self.action_scores = tf.identity(self.q_outputs, name="action_scores") 80 | tf.histogram_summary("action_scores", self.action_scores) 81 | self.predicted_actions = tf.argmax(self.action_scores, dimension=1, name="predicted_actions") 82 | 83 | # estimate rewards using the next state: r(s_t,a_t) + argmax_a Q(s_{t+1}, a) 84 | with tf.name_scope("estimate_future_rewards"): 85 | self.next_states = tf.placeholder(tf.float32, (None, self.state_dim), name="next_states") 86 | self.next_state_mask = tf.placeholder(tf.float32, (None,), name="next_state_masks") 87 | 88 | if self.double_q_learning: 89 | # reuse Q network for action selection 90 | with tf.variable_scope("q_network", reuse=True): 91 | self.q_next_outputs = self.q_network(self.next_states) 92 | self.action_selection = tf.argmax(tf.stop_gradient(self.q_next_outputs), 1, name="action_selection") 93 | tf.histogram_summary("action_selection", self.action_selection) 94 | self.action_selection_mask = tf.one_hot(self.action_selection, self.num_actions, 1, 0) 95 | # use target network for action evaluation 96 | with tf.variable_scope("target_network"): 97 | self.target_outputs = self.q_network(self.next_states) * tf.cast(self.action_selection_mask, tf.float32) 98 | self.action_evaluation = tf.reduce_sum(self.target_outputs, reduction_indices=[1,]) 99 | tf.histogram_summary("action_evaluation", self.action_evaluation) 100 | self.target_values = self.action_evaluation * self.next_state_mask 101 | else: 102 | # initialize target network 103 | with tf.variable_scope("target_network"): 104 | self.target_outputs = self.q_network(self.next_states) 105 | # compute future rewards 106 | self.next_action_scores = tf.stop_gradient(self.target_outputs) 107 | self.target_values = tf.reduce_max(self.next_action_scores, reduction_indices=[1,]) * self.next_state_mask 108 | tf.histogram_summary("next_action_scores", self.next_action_scores) 109 | 110 | self.rewards = tf.placeholder(tf.float32, (None,), name="rewards") 111 | self.future_rewards = self.rewards + self.discount_factor * self.target_values 112 | 113 | # compute loss and gradients 114 | with tf.name_scope("compute_temporal_differences"): 115 | # compute temporal difference loss 116 | self.action_mask = tf.placeholder(tf.float32, (None, self.num_actions), name="action_mask") 117 | self.masked_action_scores = tf.reduce_sum(self.action_scores * self.action_mask, reduction_indices=[1,]) 118 | self.temp_diff = self.masked_action_scores - self.future_rewards 119 | self.td_loss = tf.reduce_mean(tf.square(self.temp_diff)) 120 | # regularization loss 121 | q_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="q_network") 122 | self.reg_loss = self.reg_param * tf.reduce_sum([tf.reduce_sum(tf.square(x)) for x in q_network_variables]) 123 | # compute total loss and gradients 124 | self.loss = self.td_loss + self.reg_loss 125 | gradients = self.optimizer.compute_gradients(self.loss) 126 | # clip gradients by norm 127 | for i, (grad, var) in enumerate(gradients): 128 | if grad is not None: 129 | gradients[i] = (tf.clip_by_norm(grad, self.max_gradient), var) 130 | # add histograms for gradients. 131 | for grad, var in gradients: 132 | tf.histogram_summary(var.name, var) 133 | if grad is not None: 134 | tf.histogram_summary(var.name + '/gradients', grad) 135 | self.train_op = self.optimizer.apply_gradients(gradients) 136 | 137 | # update target network with Q network 138 | with tf.name_scope("update_target_network"): 139 | self.target_network_update = [] 140 | # slowly update target network parameters with Q network parameters 141 | q_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="q_network") 142 | target_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="target_network") 143 | for v_source, v_target in zip(q_network_variables, target_network_variables): 144 | # this is equivalent to target = (1-alpha) * target + alpha * source 145 | update_op = v_target.assign_sub(self.target_update_rate * (v_target - v_source)) 146 | self.target_network_update.append(update_op) 147 | self.target_network_update = tf.group(*self.target_network_update) 148 | 149 | # scalar summaries 150 | tf.scalar_summary("td_loss", self.td_loss) 151 | tf.scalar_summary("reg_loss", self.reg_loss) 152 | tf.scalar_summary("total_loss", self.loss) 153 | tf.scalar_summary("exploration", self.exploration) 154 | 155 | self.summarize = tf.merge_all_summaries() 156 | self.no_op = tf.no_op() 157 | 158 | def storeExperience(self, state, action, reward, next_state, done): 159 | # always store end states 160 | if self.store_experience_cnt % self.store_replay_every == 0 or done: 161 | self.replay_buffer.add(state, action, reward, next_state, done) 162 | self.store_experience_cnt += 1 163 | 164 | def eGreedyAction(self, states, explore=True): 165 | if explore and self.exploration > random.random(): 166 | return random.randint(0, self.num_actions-1) 167 | else: 168 | return self.session.run(self.predicted_actions, {self.states: states})[0] 169 | 170 | def annealExploration(self, stategy='linear'): 171 | ratio = max((self.anneal_steps - self.train_iteration)/float(self.anneal_steps), 0) 172 | self.exploration = (self.init_exp - self.final_exp) * ratio + self.final_exp 173 | 174 | def updateModel(self): 175 | # not enough experiences yet 176 | if self.replay_buffer.count() < self.batch_size: 177 | return 178 | 179 | batch = self.replay_buffer.getBatch(self.batch_size) 180 | states = np.zeros((self.batch_size, self.state_dim)) 181 | rewards = np.zeros((self.batch_size,)) 182 | action_mask = np.zeros((self.batch_size, self.num_actions)) 183 | next_states = np.zeros((self.batch_size, self.state_dim)) 184 | next_state_mask = np.zeros((self.batch_size,)) 185 | 186 | for k, (s0, a, r, s1, done) in enumerate(batch): 187 | states[k] = s0 188 | rewards[k] = r 189 | action_mask[k][a] = 1 190 | # check terminal state 191 | if not done: 192 | next_states[k] = s1 193 | next_state_mask[k] = 1 194 | 195 | # whether to calculate summaries 196 | calculate_summaries = self.train_iteration % self.summary_every == 0 and self.summary_writer is not None 197 | 198 | # perform one update of training 199 | cost, _, summary_str = self.session.run([ 200 | self.loss, 201 | self.train_op, 202 | self.summarize if calculate_summaries else self.no_op 203 | ], { 204 | self.states: states, 205 | self.next_states: next_states, 206 | self.next_state_mask: next_state_mask, 207 | self.action_mask: action_mask, 208 | self.rewards: rewards 209 | }) 210 | 211 | # update target network using Q-network 212 | self.session.run(self.target_network_update) 213 | 214 | # emit summaries 215 | if calculate_summaries: 216 | self.summary_writer.add_summary(summary_str, self.train_iteration) 217 | 218 | self.annealExploration() 219 | self.train_iteration += 1 220 | -------------------------------------------------------------------------------- /PER-in-RL/rl/pg_actor_critic.py: -------------------------------------------------------------------------------- 1 | import random 2 | import numpy as np 3 | import tensorflow as tf 4 | 5 | class PolicyGradientActorCritic(object): 6 | 7 | def __init__(self, session, 8 | optimizer, 9 | actor_network, 10 | critic_network, 11 | state_dim, 12 | num_actions, 13 | init_exp=0.1, # initial exploration prob 14 | final_exp=0.0, # final exploration prob 15 | anneal_steps=1000, # N steps for annealing exploration 16 | discount_factor=0.99, # discount future rewards 17 | reg_param=0.001, # regularization constants 18 | max_gradient=5, # max gradient norms 19 | summary_writer=None, 20 | summary_every=100): 21 | 22 | # tensorflow machinery 23 | self.session = session 24 | self.optimizer = optimizer 25 | self.summary_writer = summary_writer 26 | 27 | # model components 28 | self.actor_network = actor_network 29 | self.critic_network = critic_network 30 | 31 | # training parameters 32 | self.state_dim = state_dim 33 | self.num_actions = num_actions 34 | self.discount_factor = discount_factor 35 | self.max_gradient = max_gradient 36 | self.reg_param = reg_param 37 | 38 | # exploration parameters 39 | self.exploration = init_exp 40 | self.init_exp = init_exp 41 | self.final_exp = final_exp 42 | self.anneal_steps = anneal_steps 43 | 44 | # counters 45 | self.train_iteration = 0 46 | 47 | # rollout buffer 48 | self.state_buffer = [] 49 | self.reward_buffer = [] 50 | self.action_buffer = [] 51 | 52 | # create and initialize variables 53 | self.create_variables() 54 | var_lists = tf.get_collection(tf.GraphKeys.VARIABLES) 55 | self.session.run(tf.initialize_variables(var_lists)) 56 | 57 | # make sure all variables are initialized 58 | self.session.run(tf.assert_variables_initialized()) 59 | 60 | if self.summary_writer is not None: 61 | # graph was not available when journalist was created 62 | self.summary_writer.add_graph(self.session.graph) 63 | self.summary_every = summary_every 64 | 65 | def resetModel(self): 66 | self.cleanUp() 67 | self.train_iteration = 0 68 | self.exploration = self.init_exp 69 | var_lists = tf.get_collection(tf.GraphKeys.VARIABLES) 70 | self.session.run(tf.initialize_variables(var_lists)) 71 | 72 | def create_variables(self): 73 | 74 | with tf.name_scope("model_inputs"): 75 | # raw state representation 76 | self.states = tf.placeholder(tf.float32, (None, self.state_dim), name="states") 77 | 78 | # rollout action based on current policy 79 | with tf.name_scope("predict_actions"): 80 | # initialize actor-critic network 81 | with tf.variable_scope("actor_network"): 82 | self.policy_outputs = self.actor_network(self.states) 83 | with tf.variable_scope("critic_network"): 84 | self.value_outputs = self.critic_network(self.states) 85 | 86 | # predict actions from policy network 87 | self.action_scores = tf.identity(self.policy_outputs, name="action_scores") 88 | # Note 1: tf.multinomial is not good enough to use yet 89 | # so we don't use self.predicted_actions for now 90 | self.predicted_actions = tf.multinomial(self.action_scores, 1) 91 | 92 | # get variable list 93 | actor_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="actor_network") 94 | critic_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="critic_network") 95 | 96 | # compute loss and gradients 97 | with tf.name_scope("compute_pg_gradients"): 98 | # gradients for selecting action from policy network 99 | self.taken_actions = tf.placeholder(tf.int32, (None,), name="taken_actions") 100 | self.discounted_rewards = tf.placeholder(tf.float32, (None,), name="discounted_rewards") 101 | 102 | with tf.variable_scope("actor_network", reuse=True): 103 | self.logprobs = self.actor_network(self.states) 104 | 105 | with tf.variable_scope("critic_network", reuse=True): 106 | self.estimated_values = self.critic_network(self.states) 107 | 108 | # compute policy loss and regularization loss 109 | self.cross_entropy_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(self.logprobs, self.taken_actions) 110 | self.pg_loss = tf.reduce_mean(self.cross_entropy_loss) 111 | self.actor_reg_loss = tf.reduce_sum([tf.reduce_sum(tf.square(x)) for x in actor_network_variables]) 112 | self.actor_loss = self.pg_loss + self.reg_param * self.actor_reg_loss 113 | 114 | # compute actor gradients 115 | self.actor_gradients = self.optimizer.compute_gradients(self.actor_loss, actor_network_variables) 116 | # compute advantages A(s) = R - V(s) 117 | self.advantages = tf.reduce_sum(self.discounted_rewards - self.estimated_values) 118 | # compute policy gradients 119 | for i, (grad, var) in enumerate(self.actor_gradients): 120 | if grad is not None: 121 | self.actor_gradients[i] = (grad * self.advantages, var) 122 | 123 | # compute critic gradients 124 | self.mean_square_loss = tf.reduce_mean(tf.square(self.discounted_rewards - self.estimated_values)) 125 | self.critic_reg_loss = tf.reduce_sum([tf.reduce_sum(tf.square(x)) for x in critic_network_variables]) 126 | self.critic_loss = self.mean_square_loss + self.reg_param * self.critic_reg_loss 127 | self.critic_gradients = self.optimizer.compute_gradients(self.critic_loss, critic_network_variables) 128 | 129 | # collect all gradients 130 | self.gradients = self.actor_gradients + self.critic_gradients 131 | 132 | # clip gradients 133 | for i, (grad, var) in enumerate(self.gradients): 134 | # clip gradients by norm 135 | if grad is not None: 136 | self.gradients[i] = (tf.clip_by_norm(grad, self.max_gradient), var) 137 | 138 | # summarize gradients 139 | for grad, var in self.gradients: 140 | tf.histogram_summary(var.name, var) 141 | if grad is not None: 142 | tf.histogram_summary(var.name + '/gradients', grad) 143 | 144 | # emit summaries 145 | tf.histogram_summary("estimated_values", self.estimated_values) 146 | tf.scalar_summary("actor_loss", self.actor_loss) 147 | tf.scalar_summary("critic_loss", self.critic_loss) 148 | tf.scalar_summary("reg_loss", self.actor_reg_loss + self.critic_reg_loss) 149 | 150 | # training update 151 | with tf.name_scope("train_actor_critic"): 152 | # apply gradients to update actor network 153 | self.train_op = self.optimizer.apply_gradients(self.gradients) 154 | 155 | self.summarize = tf.merge_all_summaries() 156 | self.no_op = tf.no_op() 157 | 158 | def sampleAction(self, states): 159 | # TODO: use this code piece when tf.multinomial gets better 160 | # sample action from current policy 161 | # actions = self.session.run(self.predicted_actions, {self.states: states})[0] 162 | # return actions[0] 163 | 164 | # temporary workaround 165 | def softmax(y): 166 | """ simple helper function here that takes unnormalized logprobs """ 167 | maxy = np.amax(y) 168 | e = np.exp(y - maxy) 169 | return e / np.sum(e) 170 | 171 | # epsilon-greedy exploration strategy 172 | if random.random() < self.exploration: 173 | return random.randint(0, self.num_actions-1) 174 | else: 175 | action_scores = self.session.run(self.action_scores, {self.states: states})[0] 176 | action_probs = softmax(action_scores) - 1e-5 177 | action = np.argmax(np.random.multinomial(1, action_probs)) 178 | return action 179 | 180 | def updateModel(self): 181 | 182 | N = len(self.reward_buffer) 183 | r = 0 # use discounted reward to approximate Q value 184 | 185 | # compute discounted future rewards 186 | discounted_rewards = np.zeros(N) 187 | for t in reversed(xrange(N)): 188 | # future discounted reward from now on 189 | r = self.reward_buffer[t] + self.discount_factor * r 190 | discounted_rewards[t] = r 191 | 192 | # whether to calculate summaries 193 | calculate_summaries = self.train_iteration % self.summary_every == 0 and self.summary_writer is not None 194 | 195 | # update policy network with the rollout in batches 196 | for t in xrange(N-1): 197 | 198 | # prepare inputs 199 | states = self.state_buffer[t][np.newaxis, :] 200 | actions = np.array([self.action_buffer[t]]) 201 | rewards = np.array([discounted_rewards[t]]) 202 | 203 | # perform one update of training 204 | _, summary_str = self.session.run([ 205 | self.train_op, 206 | self.summarize if calculate_summaries else self.no_op 207 | ], { 208 | self.states: states, 209 | self.taken_actions: actions, 210 | self.discounted_rewards: rewards 211 | }) 212 | 213 | # emit summaries 214 | if calculate_summaries: 215 | self.summary_writer.add_summary(summary_str, self.train_iteration) 216 | 217 | self.annealExploration() 218 | self.train_iteration += 1 219 | 220 | # clean up 221 | self.cleanUp() 222 | 223 | def annealExploration(self, stategy='linear'): 224 | ratio = max((self.anneal_steps - self.train_iteration)/float(self.anneal_steps), 0) 225 | self.exploration = (self.init_exp - self.final_exp) * ratio + self.final_exp 226 | 227 | def storeRollout(self, state, action, reward): 228 | self.action_buffer.append(action) 229 | self.reward_buffer.append(reward) 230 | self.state_buffer.append(state) 231 | 232 | def cleanUp(self): 233 | self.state_buffer = [] 234 | self.reward_buffer = [] 235 | self.action_buffer = [] 236 | -------------------------------------------------------------------------------- /PER-in-RL/rl/pg_ddpg.py: -------------------------------------------------------------------------------- 1 | import random 2 | import numpy as np 3 | import tensorflow as tf 4 | from replay_buffer import ReplayBuffer 5 | import rank_based 6 | 7 | 8 | class DeepDeterministicPolicyGradient(object): 9 | 10 | def __init__(self, session, 11 | optimizer, 12 | actor_network, 13 | critic_network, 14 | state_dim, 15 | action_dim, 16 | batch_size=32, 17 | replay_buffer_size=10000, # size of replay buffer 18 | store_replay_every=1, # how frequent to store experience 19 | discount_factor=0.99, # discount future rewards 20 | target_update_rate=0.01, 21 | reg_param=0.01, # regularization constants 22 | max_gradient=5, # max gradient norms 23 | noise_sigma=0.20, 24 | noise_theta=0.15, 25 | summary_writer=None, 26 | summary_every=100): 27 | 28 | # tensorflow machinery 29 | self.session = session 30 | self.optimizer = optimizer 31 | self.summary_writer = summary_writer 32 | 33 | # model components 34 | self.actor_network = actor_network 35 | self.critic_network = critic_network 36 | self.replay_buffer = ReplayBuffer(buffer_size=replay_buffer_size) 37 | 38 | # training parameters 39 | self.batch_size = batch_size 40 | self.state_dim = state_dim 41 | self.action_dim = action_dim 42 | self.discount_factor = discount_factor 43 | self.target_update_rate = target_update_rate 44 | self.max_gradient = max_gradient 45 | self.reg_param = reg_param 46 | 47 | # Ornstein-Uhlenbeck noise for exploration 48 | self.noise_var = tf.Variable(tf.zeros([1, action_dim])) 49 | noise_random = tf.random_normal([1, action_dim], stddev=noise_sigma) 50 | self.noise = self.noise_var.assign_sub((noise_theta) * self.noise_var - noise_random) 51 | 52 | # counters 53 | self.store_replay_every = store_replay_every 54 | self.store_experience_cnt = 0 55 | self.train_iteration = 0 56 | 57 | # create and initialize variables 58 | self.create_variables() 59 | var_lists = tf.get_collection(tf.GraphKeys.VARIABLES) 60 | self.session.run(tf.initialize_variables(var_lists)) 61 | 62 | # make sure all variables are initialized 63 | self.session.run(tf.assert_variables_initialized()) 64 | 65 | if self.summary_writer is not None: 66 | # graph was not available when journalist was created 67 | self.summary_writer.add_graph(self.session.graph) 68 | self.summary_every = summary_every 69 | 70 | def create_variables(self): 71 | 72 | with tf.name_scope("model_inputs"): 73 | # raw state representation 74 | self.states = tf.placeholder(tf.float32, (None, self.state_dim), name="states") 75 | # action input used by critic network 76 | self.action = tf.placeholder(tf.float32, (None, self.action_dim), name="action") 77 | 78 | # define outputs from the actor and the critic 79 | with tf.name_scope("predict_actions"): 80 | # initialize actor-critic network 81 | with tf.variable_scope("actor_network"): 82 | self.policy_outputs = self.actor_network(self.states) 83 | with tf.variable_scope("critic_network"): 84 | self.value_outputs = self.critic_network(self.states, self.action) 85 | self.action_gradients = tf.gradients(self.value_outputs, self.action)[0] 86 | 87 | # predict actions from policy network 88 | self.predicted_actions = tf.identity(self.policy_outputs, name="predicted_actions") 89 | # tf.histogram_summary("predicted_actions", self.predicted_actions) 90 | # tf.histogram_summary("action_scores", self.value_outputs) 91 | 92 | # get variable list 93 | actor_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="actor_network") 94 | critic_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="critic_network") 95 | 96 | # estimate rewards using the next state: r + argmax_a Q'(s_{t+1}, u'(a)) 97 | with tf.name_scope("estimate_future_rewards"): 98 | self.next_states = tf.placeholder(tf.float32, (None, self.state_dim), name="next_states") 99 | self.next_state_mask = tf.placeholder(tf.float32, (None,), name="next_state_mask")#s") 100 | self.rewards = tf.placeholder(tf.float32, (None,), name="rewards") 101 | 102 | # initialize target network 103 | with tf.variable_scope("target_actor_network"): 104 | self.target_actor_outputs = self.actor_network(self.next_states) 105 | with tf.variable_scope("target_critic_network"): 106 | self.target_critic_outputs = self.critic_network(self.next_states, self.target_actor_outputs) 107 | 108 | # compute future rewards 109 | self.next_action_scores = tf.stop_gradient(self.target_critic_outputs)[:,0] * self.next_state_mask 110 | # tf.histogram_summary("next_action_scores", self.next_action_scores) 111 | self.future_rewards = self.rewards + self.discount_factor * self.next_action_scores 112 | 113 | # compute loss and gradients 114 | with tf.name_scope("compute_pg_gradients"): 115 | self.w_id=tf.placeholder(tf.float32, (None,), name="w_id") 116 | self.w =tf.reshape(self.w_id, [32,1]) 117 | #self.ww=tf.shape(self.w_id) 118 | # compute gradients for critic network 119 | self.temp_diff = self.value_outputs[:,0] - self.future_rewards#TD-error 120 | #self.temp = tf.diag(tf.matmul(tf.square(self.temp_diff), self.w_id, False, True)) 121 | #tf.transpose 122 | #self.mean_square_loss = tf.reduce_mean(tf.square(self.temp_diff))#tf.square(self.temp_diff)) 123 | self.tem=tf.reshape(tf.square(self.temp_diff),[32,1]) 124 | self.temp=tf.matmul(tf.transpose(self.tem), self.w)/self.batch_size 125 | #self.dim=tf.shape(self.temp) 126 | #self.temp=tf.reshape(self.temp,[1,1]) 127 | self.mean_square_loss = tf.reduce_mean(self.temp) 128 | #self.mean_square_loss = tf.reduce_mean(tf.square(self.temp_diff))#self.temp/self.batch_size#tf.reduce_mean(self.temp) 129 | self.critic_reg_loss = tf.reduce_sum(tf.stack([tf.reduce_sum(tf.square(x)) for x in critic_network_variables])) 130 | self.critic_loss = self.mean_square_loss + self.reg_param * self.critic_reg_loss 131 | self.critic_gradients = self.optimizer.compute_gradients(self.critic_loss, critic_network_variables) 132 | 133 | # compute actor gradients (we don't do weight decay for actor network) 134 | self.q_action_grad = tf.placeholder(tf.float32, (None, self.action_dim), name="q_action_grad") 135 | actor_policy_gradients = tf.gradients(self.policy_outputs, actor_network_variables, -self.q_action_grad) 136 | self.actor_gradients = zip(actor_policy_gradients, actor_network_variables) 137 | 138 | # collect all gradients 139 | self.gradients = self.actor_gradients + self.critic_gradients 140 | 141 | # clip gradients 142 | for i, (grad, var) in enumerate(self.gradients): 143 | # clip gradients by norm 144 | if grad is not None: 145 | self.gradients[i] = (tf.clip_by_norm(grad, self.max_gradient), var) 146 | 147 | # summarize gradients 148 | '''for grad, var in self.gradients: 149 | tf.histogram_summary(var.name, var) 150 | if grad is not None: 151 | tf.histogram_summary(var.name + '/gradients', grad)''' 152 | 153 | # emit summaries 154 | # tf.scalar_summary("critic_loss", self.critic_loss) 155 | # tf.scalar_summary("critic_td_loss", self.mean_square_loss) 156 | # tf.scalar_summary("critic_reg_loss", self.critic_reg_loss) 157 | 158 | # apply gradients to update actor network 159 | self.train_op = self.optimizer.apply_gradients(self.gradients) 160 | 161 | # update target network with Q network 162 | with tf.name_scope("update_target_network"): 163 | self.target_network_update = [] 164 | 165 | # slowly update target network parameters with the actor network parameters 166 | actor_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="actor_network") 167 | target_actor_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="target_actor_network") 168 | for v_source, v_target in zip(actor_network_variables, target_actor_network_variables): 169 | # this is equivalent to target = (1-alpha) * target + alpha * source 170 | update_op = v_target.assign_sub(self.target_update_rate * (v_target - v_source)) 171 | self.target_network_update.append(update_op) 172 | 173 | # same for the critic network 174 | critic_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="critic_network") 175 | target_critic_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="target_critic_network") 176 | for v_source, v_target in zip(critic_network_variables, target_critic_network_variables): 177 | # this is equivalent to target = (1-alpha) * target + alpha * source 178 | update_op = v_target.assign_sub(self.target_update_rate * (v_target - v_source)) 179 | self.target_network_update.append(update_op) 180 | 181 | # group all assignment operations together 182 | self.target_network_update = tf.group(*self.target_network_update) 183 | 184 | # self.summarize = tf.merge_all_summaries() 185 | self.no_op = tf.no_op() 186 | 187 | def sampleAction(self, states, exploration=True): 188 | policy_outs, ou_noise = self.session.run([ 189 | self.policy_outputs, 190 | self.noise 191 | ], { 192 | self.states: states 193 | }) 194 | # add OU noise for exploration 195 | policy_outs = policy_outs + ou_noise if exploration else policy_outs 196 | return policy_outs 197 | 198 | def updateModel(self, recorder): 199 | 200 | # not enough experiences yet 201 | if self.replay_buffer.count() < self.batch_size*20: 202 | return 203 | 204 | batch, w_id, eid= self.replay_buffer.getBatch(self.batch_size) 205 | states = np.zeros((self.batch_size, self.state_dim)) 206 | rewards = np.zeros((self.batch_size,)) 207 | actions = np.zeros((self.batch_size, self.action_dim)) 208 | next_states = np.zeros((self.batch_size, self.state_dim)) 209 | next_state_mask = np.zeros((self.batch_size,)) 210 | e_id = eid 211 | 212 | for i in range(32): 213 | recorder[e_id[i]%10000]=recorder[e_id[i]%10000]+1 214 | 215 | 216 | for k, (s0, a, r, s1, done) in enumerate(batch): 217 | states[k] = s0 218 | rewards[k] = r 219 | actions[k] = a 220 | if not done: 221 | next_states[k] = s1 222 | next_state_mask[k] = 1 223 | 224 | # whether to calculate summaries 225 | # calculate_summaries = self.train_iteration % self.summary_every == 0 and self.summary_writer is not None 226 | 227 | # compute a = u(s) 228 | policy_outs = self.session.run(self.policy_outputs, { 229 | self.states: states 230 | }) 231 | 232 | # compute d_a Q(s,a) where s=s_i, a=u(s) 233 | action_grads = self.session.run(self.action_gradients, { 234 | self.states: states, 235 | self.action: policy_outs 236 | }) 237 | 238 | critic_loss, _, delta = self.session.run([ 239 | self.critic_loss, 240 | self.train_op, 241 | # self.summarize if calculate_summaries else self.no_op, 242 | self.temp_diff 243 | ], { 244 | self.states: states, 245 | self.next_states: next_states, 246 | self.next_state_mask: next_state_mask, 247 | self.action: actions, 248 | self.rewards: rewards, 249 | self.q_action_grad: action_grads, 250 | self.w_id: w_id 251 | }) 252 | self.error=delta 253 | # print temp1, self.train_iteration 254 | # update target network using Q-network 255 | self.session.run(self.target_network_update) 256 | 257 | # emit summaries 258 | # if calculate_summaries: 259 | # self.summary_writer.add_summary(summary_str, self.train_iteration) 260 | #self.replay_memory.update_priority(self.e_id, self.error) 261 | # if self.time_step % 100 ==0: 262 | # self.replay_memory.rebalance() 263 | self.replay_buffer.update_priority(e_id, self.error) 264 | self.train_iteration += 1 265 | if self.train_iteration % 100 == 0: 266 | self.replay_buffer.rebalance() 267 | 268 | return self.error 269 | 270 | def storeExperience(self, state, action, reward, next_state, done): 271 | # always store end states 272 | if self.store_experience_cnt % self.store_replay_every == 0 or done: 273 | self.replay_buffer.add(state, action, reward, next_state, done) 274 | self.store_experience_cnt += 1 275 | -------------------------------------------------------------------------------- /PER-in-RL/rl/pg_ddpg.py~: -------------------------------------------------------------------------------- 1 | import random 2 | import numpy as np 3 | import tensorflow as tf 4 | from replay_buffer import ReplayBuffer 5 | import rank_based 6 | 7 | 8 | class DeepDeterministicPolicyGradient(object): 9 | 10 | def __init__(self, session, 11 | optimizer, 12 | actor_network, 13 | critic_network, 14 | state_dim, 15 | action_dim, 16 | batch_size=32, 17 | replay_buffer_size=10000, # size of replay buffer 18 | store_replay_every=1, # how frequent to store experience 19 | discount_factor=0.99, # discount future rewards 20 | target_update_rate=0.01, 21 | reg_param=0.01, # regularization constants 22 | max_gradient=5, # max gradient norms 23 | noise_sigma=0.20, 24 | noise_theta=0.15, 25 | summary_writer=None, 26 | summary_every=100): 27 | 28 | # tensorflow machinery 29 | self.session = session 30 | self.optimizer = optimizer 31 | self.summary_writer = summary_writer 32 | 33 | # model components 34 | self.actor_network = actor_network 35 | self.critic_network = critic_network 36 | self.replay_buffer = ReplayBuffer(buffer_size=replay_buffer_size) 37 | 38 | # training parameters 39 | self.batch_size = batch_size 40 | self.state_dim = state_dim 41 | self.action_dim = action_dim 42 | self.discount_factor = discount_factor 43 | self.target_update_rate = target_update_rate 44 | self.max_gradient = max_gradient 45 | self.reg_param = reg_param 46 | 47 | # Ornstein-Uhlenbeck noise for exploration 48 | self.noise_var = tf.Variable(tf.zeros([1, action_dim])) 49 | noise_random = tf.random_normal([1, action_dim], stddev=noise_sigma) 50 | self.noise = self.noise_var.assign_sub((noise_theta) * self.noise_var - noise_random) 51 | 52 | # counters 53 | self.store_replay_every = store_replay_every 54 | self.store_experience_cnt = 0 55 | self.train_iteration = 0 56 | 57 | # create and initialize variables 58 | self.create_variables() 59 | var_lists = tf.get_collection(tf.GraphKeys.VARIABLES) 60 | self.session.run(tf.initialize_variables(var_lists)) 61 | 62 | # make sure all variables are initialized 63 | self.session.run(tf.assert_variables_initialized()) 64 | 65 | if self.summary_writer is not None: 66 | # graph was not available when journalist was created 67 | self.summary_writer.add_graph(self.session.graph) 68 | self.summary_every = summary_every 69 | 70 | def create_variables(self): 71 | 72 | with tf.name_scope("model_inputs"): 73 | # raw state representation 74 | self.states = tf.placeholder(tf.float32, (None, self.state_dim), name="states") 75 | # action input used by critic network 76 | self.action = tf.placeholder(tf.float32, (None, self.action_dim), name="action") 77 | 78 | # define outputs from the actor and the critic 79 | with tf.name_scope("predict_actions"): 80 | # initialize actor-critic network 81 | with tf.variable_scope("actor_network"): 82 | self.policy_outputs = self.actor_network(self.states) 83 | with tf.variable_scope("critic_network"): 84 | self.value_outputs = self.critic_network(self.states, self.action) 85 | self.action_gradients = tf.gradients(self.value_outputs, self.action)[0] 86 | 87 | # predict actions from policy network 88 | self.predicted_actions = tf.identity(self.policy_outputs, name="predicted_actions") 89 | # tf.histogram_summary("predicted_actions", self.predicted_actions) 90 | # tf.histogram_summary("action_scores", self.value_outputs) 91 | 92 | # get variable list 93 | actor_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="actor_network") 94 | critic_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="critic_network") 95 | 96 | # estimate rewards using the next state: r + argmax_a Q'(s_{t+1}, u'(a)) 97 | with tf.name_scope("estimate_future_rewards"): 98 | self.next_states = tf.placeholder(tf.float32, (None, self.state_dim), name="next_states") 99 | self.next_state_mask = tf.placeholder(tf.float32, (None,), name="next_state_mask")#s") 100 | self.rewards = tf.placeholder(tf.float32, (None,), name="rewards") 101 | 102 | # initialize target network 103 | with tf.variable_scope("target_actor_network"): 104 | self.target_actor_outputs = self.actor_network(self.next_states) 105 | with tf.variable_scope("target_critic_network"): 106 | self.target_critic_outputs = self.critic_network(self.next_states, self.target_actor_outputs) 107 | 108 | # compute future rewards 109 | self.next_action_scores = tf.stop_gradient(self.target_critic_outputs)[:,0] * self.next_state_mask 110 | # tf.histogram_summary("next_action_scores", self.next_action_scores) 111 | self.future_rewards = self.rewards + self.discount_factor * self.next_action_scores 112 | 113 | # compute loss and gradients 114 | with tf.name_scope("compute_pg_gradients"): 115 | self.w_id=tf.placeholder(tf.float32, (None,), name="w_id") 116 | self.w =tf.reshape(self.w_id, [32,1]) 117 | #self.ww=tf.shape(self.w_id) 118 | # compute gradients for critic network 119 | self.temp_diff = self.value_outputs[:,0] - self.future_rewards#TD-error 120 | #self.temp = tf.diag(tf.matmul(tf.square(self.temp_diff), self.w_id, False, True)) 121 | #tf.transpose 122 | #self.mean_square_loss = tf.reduce_mean(tf.square(self.temp_diff))#tf.square(self.temp_diff)) 123 | self.tem=tf.reshape(tf.square(self.temp_diff),[32,1]) 124 | self.temp=tf.matmul(tf.transpose(self.tem), self.w)/self.batch_size 125 | #self.dim=tf.shape(self.temp) 126 | #self.temp=tf.reshape(self.temp,[1,1]) 127 | self.mean_square_loss = tf.reduce_mean(self.temp) 128 | #self.mean_square_loss = tf.reduce_mean(tf.square(self.temp_diff))#self.temp/self.batch_size#tf.reduce_mean(self.temp) 129 | self.critic_reg_loss = tf.reduce_sum(tf.stack([tf.reduce_sum(tf.square(x)) for x in critic_network_variables])) 130 | self.critic_loss = self.mean_square_loss + self.reg_param * self.critic_reg_loss 131 | self.critic_gradients = self.optimizer.compute_gradients(self.critic_loss, critic_network_variables) 132 | 133 | # compute actor gradients (we don't do weight decay for actor network) 134 | self.q_action_grad = tf.placeholder(tf.float32, (None, self.action_dim), name="q_action_grad") 135 | actor_policy_gradients = tf.gradients(self.policy_outputs, actor_network_variables, -self.q_action_grad) 136 | self.actor_gradients = zip(actor_policy_gradients, actor_network_variables) 137 | 138 | # collect all gradients 139 | self.gradients = self.actor_gradients + self.critic_gradients 140 | 141 | # clip gradients 142 | for i, (grad, var) in enumerate(self.gradients): 143 | # clip gradients by norm 144 | if grad is not None: 145 | self.gradients[i] = (tf.clip_by_norm(grad, self.max_gradient), var) 146 | 147 | # summarize gradients 148 | '''for grad, var in self.gradients: 149 | tf.histogram_summary(var.name, var) 150 | if grad is not None: 151 | tf.histogram_summary(var.name + '/gradients', grad)''' 152 | 153 | # emit summaries 154 | # tf.scalar_summary("critic_loss", self.critic_loss) 155 | # tf.scalar_summary("critic_td_loss", self.mean_square_loss) 156 | # tf.scalar_summary("critic_reg_loss", self.critic_reg_loss) 157 | 158 | # apply gradients to update actor network 159 | self.train_op = self.optimizer.apply_gradients(self.gradients) 160 | 161 | # update target network with Q network 162 | with tf.name_scope("update_target_network"): 163 | self.target_network_update = [] 164 | 165 | # slowly update target network parameters with the actor network parameters 166 | actor_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="actor_network") 167 | target_actor_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="target_actor_network") 168 | for v_source, v_target in zip(actor_network_variables, target_actor_network_variables): 169 | # this is equivalent to target = (1-alpha) * target + alpha * source 170 | update_op = v_target.assign_sub(self.target_update_rate * (v_target - v_source)) 171 | self.target_network_update.append(update_op) 172 | 173 | # same for the critic network 174 | critic_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="critic_network") 175 | target_critic_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="target_critic_network") 176 | for v_source, v_target in zip(critic_network_variables, target_critic_network_variables): 177 | # this is equivalent to target = (1-alpha) * target + alpha * source 178 | update_op = v_target.assign_sub(self.target_update_rate * (v_target - v_source)) 179 | self.target_network_update.append(update_op) 180 | 181 | # group all assignment operations together 182 | self.target_network_update = tf.group(*self.target_network_update) 183 | 184 | # self.summarize = tf.merge_all_summaries() 185 | self.no_op = tf.no_op() 186 | 187 | def sampleAction(self, states, exploration=True): 188 | policy_outs, ou_noise = self.session.run([ 189 | self.policy_outputs, 190 | self.noise 191 | ], { 192 | self.states: states 193 | }) 194 | # add OU noise for exploration 195 | policy_outs = policy_outs + ou_noise if exploration else policy_outs 196 | return policy_outs 197 | 198 | def updateModel(self, recorder): 199 | 200 | # not enough experiences yet 201 | if self.replay_buffer.count() < self.batch_size*20: 202 | return 203 | 204 | batch, w_id, eid= self.replay_buffer.getBatch(self.batch_size) 205 | states = np.zeros((self.batch_size, self.state_dim)) 206 | rewards = np.zeros((self.batch_size,)) 207 | actions = np.zeros((self.batch_size, self.action_dim)) 208 | next_states = np.zeros((self.batch_size, self.state_dim)) 209 | next_state_mask = np.zeros((self.batch_size,)) 210 | e_id = eid 211 | 212 | for i in range(32): 213 | recorder[e_id[i]%10000]=recorder[e_id[i]%10000]+1 214 | 215 | 216 | for k, (s0, a, r, s1, done) in enumerate(batch): 217 | states[k] = s0 218 | rewards[k] = r 219 | actions[k] = a 220 | if not done: 221 | next_states[k] = s1 222 | next_state_mask[k] = 1 223 | 224 | # whether to calculate summaries 225 | # calculate_summaries = self.train_iteration % self.summary_every == 0 and self.summary_writer is not None 226 | 227 | # compute a = u(s) 228 | policy_outs = self.session.run(self.policy_outputs, { 229 | self.states: states 230 | }) 231 | 232 | # compute d_a Q(s,a) where s=s_i, a=u(s) 233 | action_grads = self.session.run(self.action_gradients, { 234 | self.states: states, 235 | self.action: policy_outs 236 | }) 237 | 238 | critic_loss, _, summary_str, delta = self.session.run([ 239 | self.critic_loss, 240 | self.train_op, 241 | self.summarize if calculate_summaries else self.no_op, 242 | self.temp_diff 243 | ], { 244 | self.states: states, 245 | self.next_states: next_states, 246 | self.next_state_mask: next_state_mask, 247 | self.action: actions, 248 | self.rewards: rewards, 249 | self.q_action_grad: action_grads, 250 | self.w_id: w_id 251 | }) 252 | self.error=delta 253 | # print temp1, self.train_iteration 254 | # update target network using Q-network 255 | self.session.run(self.target_network_update) 256 | 257 | # emit summaries 258 | if calculate_summaries: 259 | self.summary_writer.add_summary(summary_str, self.train_iteration) 260 | #self.replay_memory.update_priority(self.e_id, self.error) 261 | # if self.time_step % 100 ==0: 262 | # self.replay_memory.rebalance() 263 | self.replay_buffer.update_priority(e_id, self.error) 264 | self.train_iteration += 1 265 | if self.train_iteration % 100 == 0: 266 | self.replay_buffer.rebalance() 267 | 268 | return self.error 269 | 270 | def storeExperience(self, state, action, reward, next_state, done): 271 | # always store end states 272 | if self.store_experience_cnt % self.store_replay_every == 0 or done: 273 | self.replay_buffer.add(state, action, reward, next_state, done) 274 | self.store_experience_cnt += 1 275 | -------------------------------------------------------------------------------- /PER-in-RL/rl/pg_reinforce.py: -------------------------------------------------------------------------------- 1 | import random 2 | import numpy as np 3 | import tensorflow as tf 4 | 5 | class PolicyGradientREINFORCE(object): 6 | 7 | def __init__(self, session, 8 | optimizer, 9 | policy_network, 10 | state_dim, 11 | num_actions, 12 | init_exp=0.5, # initial exploration prob 13 | final_exp=0.0, # final exploration prob 14 | anneal_steps=10000, # N steps for annealing exploration 15 | discount_factor=0.99, # discount future rewards 16 | reg_param=0.001, # regularization constants 17 | max_gradient=5, # max gradient norms 18 | summary_writer=None, 19 | summary_every=100): 20 | 21 | # tensorflow machinery 22 | self.session = session 23 | self.optimizer = optimizer 24 | self.summary_writer = summary_writer 25 | 26 | # model components 27 | self.policy_network = policy_network 28 | 29 | # training parameters 30 | self.state_dim = state_dim 31 | self.num_actions = num_actions 32 | self.discount_factor = discount_factor 33 | self.max_gradient = max_gradient 34 | self.reg_param = reg_param 35 | 36 | # exploration parameters 37 | self.exploration = init_exp 38 | self.init_exp = init_exp 39 | self.final_exp = final_exp 40 | self.anneal_steps = anneal_steps 41 | 42 | # counters 43 | self.train_iteration = 0 44 | 45 | # rollout buffer 46 | self.state_buffer = [] 47 | self.reward_buffer = [] 48 | self.action_buffer = [] 49 | 50 | # record reward history for normalization 51 | self.all_rewards = [] 52 | self.max_reward_length = 1000000 53 | 54 | # create and initialize variables 55 | self.create_variables() 56 | var_lists = tf.get_collection(tf.GraphKeys.VARIABLES) 57 | self.session.run(tf.initialize_variables(var_lists)) 58 | 59 | # make sure all variables are initialized 60 | self.session.run(tf.assert_variables_initialized()) 61 | 62 | if self.summary_writer is not None: 63 | # graph was not available when journalist was created 64 | self.summary_writer.add_graph(self.session.graph) 65 | self.summary_every = summary_every 66 | 67 | def resetModel(self): 68 | self.cleanUp() 69 | self.train_iteration = 0 70 | self.exploration = self.init_exp 71 | var_lists = tf.get_collection(tf.GraphKeys.VARIABLES) 72 | self.session.run(tf.initialize_variables(var_lists)) 73 | 74 | def create_variables(self): 75 | 76 | with tf.name_scope("model_inputs"): 77 | # raw state representation 78 | self.states = tf.placeholder(tf.float32, (None, self.state_dim), name="states") 79 | 80 | # rollout action based on current policy 81 | with tf.name_scope("predict_actions"): 82 | # initialize policy network 83 | with tf.variable_scope("policy_network"): 84 | self.policy_outputs = self.policy_network(self.states) 85 | 86 | # predict actions from policy network 87 | self.action_scores = tf.identity(self.policy_outputs, name="action_scores") 88 | # Note 1: tf.multinomial is not good enough to use yet 89 | # so we don't use self.predicted_actions for now 90 | self.predicted_actions = tf.multinomial(self.action_scores, 1) 91 | 92 | # regularization loss 93 | policy_network_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="policy_network") 94 | 95 | # compute loss and gradients 96 | with tf.name_scope("compute_pg_gradients"): 97 | # gradients for selecting action from policy network 98 | self.taken_actions = tf.placeholder(tf.int32, (None,), name="taken_actions") 99 | self.discounted_rewards = tf.placeholder(tf.float32, (None,), name="discounted_rewards") 100 | 101 | with tf.variable_scope("policy_network", reuse=True): 102 | self.logprobs = self.policy_network(self.states) 103 | 104 | # compute policy loss and regularization loss 105 | self.cross_entropy_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(self.logprobs, self.taken_actions) 106 | self.pg_loss = tf.reduce_mean(self.cross_entropy_loss) 107 | self.reg_loss = tf.reduce_sum([tf.reduce_sum(tf.square(x)) for x in policy_network_variables]) 108 | self.loss = self.pg_loss + self.reg_param * self.reg_loss 109 | 110 | # compute gradients 111 | self.gradients = self.optimizer.compute_gradients(self.loss) 112 | 113 | # compute policy gradients 114 | for i, (grad, var) in enumerate(self.gradients): 115 | if grad is not None: 116 | self.gradients[i] = (grad * self.discounted_rewards, var) 117 | 118 | for grad, var in self.gradients: 119 | tf.histogram_summary(var.name, var) 120 | if grad is not None: 121 | tf.histogram_summary(var.name + '/gradients', grad) 122 | 123 | # emit summaries 124 | tf.scalar_summary("policy_loss", self.pg_loss) 125 | tf.scalar_summary("reg_loss", self.reg_loss) 126 | tf.scalar_summary("total_loss", self.loss) 127 | 128 | # training update 129 | with tf.name_scope("train_policy_network"): 130 | # apply gradients to update policy network 131 | self.train_op = self.optimizer.apply_gradients(self.gradients) 132 | 133 | self.summarize = tf.merge_all_summaries() 134 | self.no_op = tf.no_op() 135 | 136 | def sampleAction(self, states): 137 | # TODO: use this code piece when tf.multinomial gets better 138 | # sample action from current policy 139 | # actions = self.session.run(self.predicted_actions, {self.states: states})[0] 140 | # return actions[0] 141 | 142 | # temporary workaround 143 | def softmax(y): 144 | """ simple helper function here that takes unnormalized logprobs """ 145 | maxy = np.amax(y) 146 | e = np.exp(y - maxy) 147 | return e / np.sum(e) 148 | 149 | # epsilon-greedy exploration strategy 150 | if random.random() < self.exploration: 151 | return random.randint(0, self.num_actions-1) 152 | else: 153 | action_scores = self.session.run(self.action_scores, {self.states: states})[0] 154 | action_probs = softmax(action_scores) - 1e-5 155 | action = np.argmax(np.random.multinomial(1, action_probs)) 156 | return action 157 | 158 | def updateModel(self): 159 | 160 | N = len(self.reward_buffer) 161 | r = 0 # use discounted reward to approximate Q value 162 | 163 | # compute discounted future rewards 164 | discounted_rewards = np.zeros(N) 165 | for t in reversed(xrange(N)): 166 | # future discounted reward from now on 167 | r = self.reward_buffer[t] + self.discount_factor * r 168 | discounted_rewards[t] = r 169 | 170 | # reduce gradient variance by normalization 171 | self.all_rewards += discounted_rewards.tolist() 172 | self.all_rewards = self.all_rewards[:self.max_reward_length] 173 | discounted_rewards -= np.mean(self.all_rewards) 174 | discounted_rewards /= np.std(self.all_rewards) 175 | 176 | # whether to calculate summaries 177 | calculate_summaries = self.train_iteration % self.summary_every == 0 and self.summary_writer is not None 178 | 179 | # update policy network with the rollout in batches 180 | for t in xrange(N-1): 181 | 182 | # prepare inputs 183 | states = self.state_buffer[t][np.newaxis, :] 184 | actions = np.array([self.action_buffer[t]]) 185 | rewards = np.array([discounted_rewards[t]]) 186 | 187 | # evaluate gradients 188 | grad_evals = [grad for grad, var in self.gradients] 189 | 190 | # perform one update of training 191 | _, summary_str = self.session.run([ 192 | self.train_op, 193 | self.summarize if calculate_summaries else self.no_op 194 | ], { 195 | self.states: states, 196 | self.taken_actions: actions, 197 | self.discounted_rewards: rewards 198 | }) 199 | 200 | # emit summaries 201 | if calculate_summaries: 202 | self.summary_writer.add_summary(summary_str, self.train_iteration) 203 | 204 | self.annealExploration() 205 | self.train_iteration += 1 206 | 207 | # clean up 208 | self.cleanUp() 209 | 210 | def annealExploration(self, stategy='linear'): 211 | ratio = max((self.anneal_steps - self.train_iteration)/float(self.anneal_steps), 0) 212 | self.exploration = (self.init_exp - self.final_exp) * ratio + self.final_exp 213 | 214 | def storeRollout(self, state, action, reward): 215 | self.action_buffer.append(action) 216 | self.reward_buffer.append(reward) 217 | self.state_buffer.append(state) 218 | 219 | def cleanUp(self): 220 | self.state_buffer = [] 221 | self.reward_buffer = [] 222 | self.action_buffer = [] -------------------------------------------------------------------------------- /PER-in-RL/rl/rank_based.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- encoding=utf-8 -*- 3 | # author: Ian 4 | # e-mail: stmayue@gmail.com 5 | # description: 6 | 7 | import sys 8 | import math 9 | import random 10 | import numpy as np 11 | 12 | import binary_heap 13 | 14 | 15 | class Experience(object): 16 | 17 | def __init__(self, conf): 18 | self.size = conf['size'] 19 | self.replace_flag = conf['replace_old'] if 'replace_old' in conf else True 20 | self.priority_size = conf['priority_size'] if 'priority_size' in conf else self.size 21 | 22 | self.alpha = conf['alpha'] if 'alpha' in conf else 0.7 23 | self.beta_zero = conf['beta_zero'] if 'beta_zero' in conf else 0.5 24 | self.batch_size = conf['batch_size'] if 'batch_size' in conf else 32 25 | self.learn_start = conf['learn_start'] if 'learn_start' in conf else 1000 26 | self.total_steps = conf['steps'] if 'steps' in conf else 100000 27 | # partition number N, split total size to N part 28 | self.partition_num = conf['partition_num'] if 'partition_num' in conf else 100 29 | 30 | self.index = 0 31 | self.record_size = 0 32 | self.isFull = False 33 | 34 | self._experience = {} 35 | self.priority_queue = binary_heap.BinaryHeap(self.priority_size) 36 | self.distributions = self.build_distributions() 37 | 38 | self.beta_grad = (1 - self.beta_zero) / (self.total_steps - self.learn_start) 39 | 40 | def build_distributions(self): 41 | """ 42 | preprocess pow of rank 43 | (rank i) ^ (-alpha) / sum ((rank i) ^ (-alpha)) 44 | :return: distributions, dict 45 | """ 46 | res = {} 47 | n_partitions = self.partition_num 48 | partition_num = 1 49 | # each part size 50 | partition_size = int (math.floor(1.0*self.size / n_partitions)) 51 | 52 | for n in range(partition_size, self.size + 1, partition_size): 53 | if self.learn_start <= n <= self.priority_size: 54 | distribution = {} 55 | # P(i) = (rank i) ^ (-alpha) / sum ((rank i) ^ (-alpha)) 56 | pdf = list( 57 | map(lambda x: math.pow(x, -self.alpha), range(1, n + 1)) 58 | ) 59 | pdf_sum = math.fsum(pdf) 60 | distribution['pdf'] = list(map(lambda x: x / pdf_sum, pdf)) 61 | # split to k segment, and than uniform sample in each k 62 | # set k = batch_size, each segment has total probability is 1 / batch_size 63 | # strata_ends keep each segment start pos and end pos 64 | cdf = np.cumsum(distribution['pdf']) 65 | strata_ends = {1: 0, self.batch_size + 1: n} 66 | step = 1.0 / self.batch_size 67 | index = 1 68 | for s in range(2, self.batch_size + 1): 69 | while cdf[index] < step: 70 | index += 1 71 | strata_ends[s] = index 72 | step += 1.0 / self.batch_size 73 | 74 | distribution['strata_ends'] = strata_ends 75 | 76 | res[partition_num] = distribution 77 | 78 | partition_num += 1 79 | 80 | return res 81 | 82 | def fix_index(self): 83 | """ 84 | get next insert index 85 | :return: index, int 86 | """ 87 | if self.record_size < self.size:#self.record_size <= self.size: 88 | self.record_size += 1 89 | if self.index % self.size == 0: 90 | self.isFull = True if len(self._experience) == self.size else False 91 | if self.replace_flag: 92 | self.index = 1 93 | return self.index 94 | else: 95 | sys.stderr.write('Experience replay buff is full and replace is set to FALSE!\n') 96 | return -1 97 | else: 98 | self.index += 1 99 | return self.index 100 | 101 | def store(self, experience): 102 | """ 103 | store experience, suggest that experience is a tuple of (s1, a, r, s2, t) 104 | so each experience is valid 105 | :param experience: maybe a tuple, or list 106 | :return: bool, indicate insert status 107 | """ 108 | insert_index = self.fix_index() 109 | if insert_index > 0: 110 | if insert_index in self._experience: 111 | del self._experience[insert_index] 112 | self._experience[insert_index] = experience 113 | # add to priority queue 114 | priority = self.priority_queue.get_max_priority() 115 | self.priority_queue.update(priority, insert_index) 116 | return True 117 | else: 118 | sys.stderr.write('Insert failed\n') 119 | return False 120 | 121 | def retrieve(self, indices): 122 | """ 123 | get experience from indices 124 | :param indices: list of experience id 125 | :return: experience replay sample 126 | """ 127 | return [self._experience[v] for v in indices] 128 | 129 | def rebalance(self): 130 | """ 131 | rebalance priority queue 132 | :return: None 133 | """ 134 | self.priority_queue.balance_tree() 135 | 136 | def update_priority(self, indices, delta): 137 | """ 138 | update priority according indices and deltas 139 | :param indices: list of experience id 140 | :param delta: list of delta, order correspond to indices 141 | :return: None 142 | """ 143 | for i in range(0, len(indices)): 144 | self.priority_queue.update(math.fabs(delta[i]), indices[i]) 145 | 146 | def sample(self, global_step): 147 | """ 148 | sample a mini batch from experience replay 149 | :param global_step: now training step 150 | :return: experience, list, samples 151 | :return: w, list, weights 152 | :return: rank_e_id, list, samples id, used for update priority 153 | """ 154 | if self.record_size < self.learn_start: 155 | sys.stderr.write('Record size less than learn start! Sample failed\n') 156 | return False, False, False 157 | dist_index = int (math.floor(1.0*self.record_size / self.size * self.partition_num)) 158 | # issue 1 by @camigord 159 | partition_size = int (math.floor(1.0*self.size / self.partition_num)) 160 | partition_max = dist_index * partition_size 161 | distribution = self.distributions[dist_index] 162 | rank_list = [] 163 | # sample from k segments 164 | for n in range(1, self.batch_size + 1): 165 | if distribution['strata_ends'][n] + 1 <= distribution['strata_ends'][n + 1]: 166 | index = random.randint(distribution['strata_ends'][n] + 1, 167 | distribution['strata_ends'][n + 1]) 168 | else: 169 | index = random.randint(distribution['strata_ends'][n + 1], 170 | distribution['strata_ends'][n] + 1) 171 | rank_list.append(index) 172 | 173 | # beta, increase by global_step, max 1 174 | beta = min(self.beta_zero + (global_step - self.learn_start - 1) * self.beta_grad, 1) 175 | # find all alpha pow, notice that pdf is a list, start from 0 176 | alpha_pow = [distribution['pdf'][v - 1] for v in rank_list] 177 | # w = (N * P(i)) ^ (-beta) / max w 178 | w = np.power(np.array(alpha_pow) * partition_max, -beta) 179 | w_max = max(w) 180 | w = np.divide(w, w_max) 181 | # rank list is priority id 182 | # convert to experience id 183 | rank_e_id = self.priority_queue.priority_to_experience(rank_list) 184 | # get experience id according rank_e_id 185 | experience = self.retrieve(rank_e_id) 186 | return experience, w, rank_e_id 187 | -------------------------------------------------------------------------------- /PER-in-RL/rl/replay_buffer.py: -------------------------------------------------------------------------------- 1 | from collections import deque 2 | import random 3 | import rank_based 4 | 5 | class ReplayBuffer(object): 6 | 7 | def __init__(self, buffer_size): 8 | 9 | self.buffer_size = buffer_size 10 | self.num_experiences = 0 11 | #self.buffer = deque() 12 | conf = {'size': 10000, 13 | 'learn_start': 32, 14 | 'partition_num': 32, 15 | 'total_step': 10000, 16 | 'batch_size': 32} 17 | self.replay_memory = rank_based.Experience(conf) 18 | 19 | def getBatch(self, batch_size): 20 | # random draw N 21 | #return random.sample(self.buffer, batch_size) 22 | batch, w, e_id = self.replay_memory.sample(self.num_experiences) 23 | self.e_id=e_id 24 | self.w_id=w 25 | '''#state t 26 | self.state_t_batch = [item[0] for item in batch] 27 | self.state_t_batch = np.array(self.state_t_batch) 28 | #state t+1 29 | self.state_t_1_batch = [item[1] for item in batch] 30 | self.state_t_1_batch = np.array( self.state_t_1_batch) 31 | self.action_batch = [item[2] for item in batch] 32 | self.action_batch = np.array(self.action_batch) 33 | self.action_batch = np.reshape(self.action_batch,[len(self.action_batch),self.num_actions]) 34 | self.reward_batch = [item[3] for item in batch] 35 | self.reward_batch = np.array(self.reward_batch) 36 | self.done_batch = [item[4] for item in batch] 37 | self.done_batch = np.array(self.done_batch)''' 38 | return batch, self.w_id, self.e_id 39 | 40 | 41 | def size(self): 42 | return self.buffer_size 43 | 44 | def add(self, state, action, reward, next_state, done):#add(self, state, next_state, action, reward, done): 45 | #new_experience = (state, next_action, action, reward, done)#(state, action, reward, next_state, done) 46 | self.replay_memory.store((state, action, reward, next_state, done)) 47 | #if self.num_experiences < self.buffer_size: 48 | # self.buffer.append(new_experience) 49 | self.num_experiences += 1 50 | #else: 51 | # self.buffer.popleft() 52 | # self.buffer.append(new_experience) 53 | 54 | def count(self): 55 | # if buffer is full, return buffer size 56 | # otherwise, return experience counter 57 | return self.num_experiences 58 | 59 | #def erase(self): 60 | # self.buffer = deque() 61 | # self.num_experiences = 0 62 | def rebalance(self): 63 | self.replay_memory.rebalance() 64 | 65 | def update_priority(self, indices, delta): 66 | self.replay_memory.update_priority(indices, delta) 67 | -------------------------------------------------------------------------------- /PER-in-RL/rl/tabular_q_learner.py: -------------------------------------------------------------------------------- 1 | import random 2 | import numpy as np 3 | 4 | # tabular Q-learning where states and actions 5 | # are discrete and stored in a table 6 | class QLearner(object): 7 | 8 | def __init__(self, state_dim, 9 | num_actions, 10 | init_exp=0.5, # initial exploration prob 11 | final_exp=0.0, # final exploration prob 12 | anneal_steps=500, # N steps for annealing exploration 13 | alpha = 0.2, 14 | discount_factor=0.9): # discount future rewards 15 | 16 | # Q learning parameters 17 | self.state_dim = state_dim 18 | self.num_actions = num_actions 19 | self.exploration = init_exp 20 | self.init_exp = init_exp 21 | self.final_exp = final_exp 22 | self.anneal_steps = anneal_steps 23 | self.discount_factor = discount_factor 24 | self.alpha = alpha 25 | 26 | # counters 27 | self.train_iteration = 0 28 | 29 | # table of q values 30 | self.qtable = np.random.uniform(low=-1, high=1, size=(state_dim, num_actions)) 31 | 32 | def initializeState(self, state): 33 | self.state = state 34 | self.action = self.qtable[state].argsort()[-1] 35 | return self.action 36 | 37 | # select action based on epsilon-greedy strategy 38 | def eGreedyAction(self, state): 39 | if self.exploration > random.random(): 40 | action = random.randint(0, self.num_actions-1) 41 | else: 42 | action = self.qtable[state].argsort()[-1] 43 | return action 44 | 45 | # do one value iteration update 46 | def updateModel(self, state, reward): 47 | action = self.eGreedyAction(state) 48 | 49 | self.train_iteration += 1 50 | self.annealExploration() 51 | self.qtable[self.state, self.action] = (1 - self.alpha) * self.qtable[self.state, self.action] + self.alpha * (reward + self.discount_factor * self.qtable[state, action]) 52 | 53 | self.state = state 54 | self.action = action 55 | 56 | return self.action 57 | 58 | # anneal learning rate 59 | def annealExploration(self, stategy='linear'): 60 | ratio = max((self.anneal_steps - self.train_iteration)/float(self.anneal_steps), 0) 61 | self.exploration = (self.init_exp - self.final_exp) * ratio + self.final_exp 62 | -------------------------------------------------------------------------------- /PER-in-RL/rl/utility.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- encoding=utf-8 -*- 3 | # author: Ian 4 | # e-mail: stmayue@gmail.com 5 | # description: 6 | 7 | 8 | def list_to_dict(in_list): 9 | return dict((i, in_list[i]) for i in range(0, len(in_list))) 10 | 11 | 12 | def exchange_key_value(in_dict): 13 | return dict((in_dict[i], i) for i in in_dict) 14 | 15 | 16 | def main(): 17 | pass 18 | 19 | 20 | if __name__ == '__main__': 21 | main() 22 | 23 | -------------------------------------------------------------------------------- /PER-in-RL/run_actor_critic_acrobot.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | from collections import deque 3 | 4 | from rl.pg_actor_critic import PolicyGradientActorCritic 5 | import tensorflow as tf 6 | import numpy as np 7 | import gym 8 | 9 | env_name = 'Acrobot-v0' 10 | env = gym.make(env_name) 11 | 12 | sess = tf.Session() 13 | optimizer = tf.train.RMSPropOptimizer(learning_rate=0.0001, decay=0.9) 14 | writer = tf.train.SummaryWriter("/tmp/{}-experiment-1".format(env_name)) 15 | 16 | state_dim = env.observation_space.shape[0] 17 | num_actions = env.action_space.n 18 | 19 | def actor_network(states): 20 | # define policy neural network 21 | W1 = tf.get_variable("W1", [state_dim, 20], 22 | initializer=tf.random_normal_initializer()) 23 | b1 = tf.get_variable("b1", [20], 24 | initializer=tf.constant_initializer(0)) 25 | h1 = tf.nn.tanh(tf.matmul(states, W1) + b1) 26 | W2 = tf.get_variable("W2", [20, num_actions], 27 | initializer=tf.random_normal_initializer(stddev=0.1)) 28 | b2 = tf.get_variable("b2", [num_actions], 29 | initializer=tf.constant_initializer(0)) 30 | p = tf.matmul(h1, W2) + b2 31 | return p 32 | 33 | def critic_network(states): 34 | # define policy neural network 35 | W1 = tf.get_variable("W1", [state_dim, 20], 36 | initializer=tf.random_normal_initializer()) 37 | b1 = tf.get_variable("b1", [20], 38 | initializer=tf.constant_initializer(0)) 39 | h1 = tf.nn.tanh(tf.matmul(states, W1) + b1) 40 | W2 = tf.get_variable("W2", [20, 1], 41 | initializer=tf.random_normal_initializer()) 42 | b2 = tf.get_variable("b2", [1], 43 | initializer=tf.constant_initializer(0)) 44 | v = tf.matmul(h1, W2) + b2 45 | return v 46 | 47 | pg_reinforce = PolicyGradientActorCritic(sess, 48 | optimizer, 49 | actor_network, 50 | critic_network, 51 | state_dim, 52 | num_actions, 53 | summary_writer=writer) 54 | 55 | MAX_EPISODES = 10000 56 | MAX_STEPS = 1000 57 | 58 | no_reward_since = 0 59 | 60 | episode_history = deque(maxlen=100) 61 | for i_episode in xrange(MAX_EPISODES): 62 | 63 | # initialize 64 | state = env.reset() 65 | total_rewards = 0 66 | 67 | for t in xrange(MAX_STEPS): 68 | env.render() 69 | action = pg_reinforce.sampleAction(state[np.newaxis,:]) 70 | next_state, reward, done, _ = env.step(action) 71 | 72 | total_rewards += reward 73 | reward = 5.0 if done else -0.1 74 | pg_reinforce.storeRollout(state, action, reward) 75 | 76 | state = next_state 77 | if done: break 78 | 79 | # if we don't see rewards in consecutive episodes 80 | # it's likely that the model gets stuck in bad local optima 81 | # we simply reset the model and try again 82 | if not done: 83 | no_reward_since += 1 84 | if no_reward_since >= 5: 85 | # create and initialize variables 86 | print('Resetting model... start anew!') 87 | pg_reinforce.resetModel() 88 | no_reward_since = 0 89 | continue 90 | else: 91 | no_reward_since = 0 92 | 93 | pg_reinforce.updateModel() 94 | 95 | episode_history.append(total_rewards) 96 | mean_rewards = np.mean(episode_history) 97 | 98 | print("Episode {}".format(i_episode)) 99 | print("Finished after {} timesteps".format(t+1)) 100 | print("Reward for this episode: {}".format(total_rewards)) 101 | print("Average reward for last 100 episodes: {}".format(mean_rewards)) 102 | if mean_rewards >= -100.0 and len(episode_history) >= 100: 103 | print("Environment {} solved after {} episodes".format(env_name, i_episode+1)) 104 | break 105 | -------------------------------------------------------------------------------- /PER-in-RL/run_cem_cartpole.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | from collections import deque 3 | import numpy as np 4 | import gym 5 | 6 | env_name = 'CartPole-v0' 7 | env = gym.make(env_name) 8 | 9 | def observation_to_action(ob, theta): 10 | # define policy neural network 11 | W1 = theta[:-1] 12 | b1 = theta[-1] 13 | return int((ob.dot(W1) + b1) < 0) 14 | 15 | def theta_rollout(env, theta, num_steps, render = False): 16 | total_rewards = 0 17 | observation = env.reset() 18 | for t in range(num_steps): 19 | action = observation_to_action(observation, theta) 20 | observation, reward, done, _ = env.step(action) 21 | total_rewards += reward 22 | if render: env.render() 23 | if done: break 24 | return total_rewards, t 25 | 26 | MAX_EPISODES = 10000 27 | MAX_STEPS = 200 28 | batch_size = 25 29 | top_per = 0.2 # percentage of theta with highest score selected from all the theta 30 | std = 1 # scale of standard deviation 31 | 32 | # initialize 33 | theta_mean = np.zeros(env.observation_space.shape[0] + 1) 34 | theta_std = np.ones_like(theta_mean) * std 35 | 36 | episode_history = deque(maxlen=100) 37 | for i_episode in xrange(MAX_EPISODES): 38 | # maximize function theta_rollout through cross-entropy method 39 | theta_sample = np.tile(theta_mean, (batch_size, 1)) + np.tile(theta_std, (batch_size, 1)) * np.random.randn(batch_size, theta_mean.size) 40 | reward_sample = np.array([theta_rollout(env, th, MAX_STEPS)[0] for th in theta_sample]) 41 | top_idx = np.argsort(-reward_sample)[:np.round(batch_size * top_per)] 42 | top_theta = theta_sample[top_idx] 43 | theta_mean = top_theta.mean(axis = 0) 44 | theta_std = top_theta.std(axis = 0) 45 | total_rewards, t = theta_rollout(env, theta_mean, MAX_STEPS, render = True) 46 | 47 | episode_history.append(total_rewards) 48 | mean_rewards = np.mean(episode_history) 49 | 50 | print("Episode {}".format(i_episode)) 51 | print("Finished after {} timesteps".format(t+1)) 52 | print("Reward for this episode: {}".format(total_rewards)) 53 | print("Average reward for last 100 episodes: {}".format(mean_rewards)) 54 | if mean_rewards >= 195.0: 55 | print("Environment {} solved after {} episodes".format(env_name, i_episode+1)) 56 | break 57 | -------------------------------------------------------------------------------- /PER-in-RL/run_ddpg_mujoco.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | from collections import deque 3 | 4 | from rl.pg_ddpg import DeepDeterministicPolicyGradient 5 | import tensorflow as tf 6 | import numpy as np 7 | import sys 8 | # sys.path.append('/home/stone/gym/') 9 | import gym 10 | 11 | # env_name = 'Hopper-v1' 12 | # env_name = 'Swimmer-v1' 13 | # env_name = 'HalfCheetah-v1' 14 | env_name = 'Walker2d-v1' 15 | # env_name = 'InvertedPendulum-v1' 16 | # env_name = 'InvertedDoublePendulum-v1' 17 | # env_name = 'Reacher-v1' 18 | 19 | env = gym.make(env_name) 20 | sess = tf.Session() 21 | optimizer = tf.train.AdamOptimizer(learning_rate=0.0001) 22 | # writer = tf.train.SummaryWriter("/tmp/{}-experiment-1".format(env_name)) 23 | 24 | state_dim = env.observation_space.shape[0] 25 | action_dim = env.action_space.shape[0] 26 | 27 | # DDPG actor and critic architecture 28 | # Continuous control with deep reinforcement learning 29 | # Timothy P. Lillicrap, et al., 2015 30 | 31 | def actor_network(states): 32 | h1_dim = 400 33 | h2_dim = 300 34 | 35 | # define policy neural network 36 | W1 = tf.get_variable("W1", [state_dim, h1_dim], 37 | initializer=tf.contrib.layers.xavier_initializer()) 38 | b1 = tf.get_variable("b1", [h1_dim], 39 | initializer=tf.constant_initializer(0)) 40 | h1 = tf.nn.relu(tf.matmul(states, W1) + b1) 41 | 42 | W2 = tf.get_variable("W2", [h1_dim, h2_dim], 43 | initializer=tf.contrib.layers.xavier_initializer()) 44 | b2 = tf.get_variable("b2", [h2_dim], 45 | initializer=tf.constant_initializer(0)) 46 | h2 = tf.nn.relu(tf.matmul(h1, W2) + b2) 47 | 48 | # use tanh to bound the action 49 | W3 = tf.get_variable("W3", [h2_dim, action_dim], 50 | initializer=tf.contrib.layers.xavier_initializer()) 51 | b3 = tf.get_variable("b3", [action_dim], 52 | initializer=tf.constant_initializer(0)) 53 | 54 | # we assume actions range from [-1, 1] 55 | # you can scale action outputs with any constant here 56 | a = tf.nn.tanh(tf.matmul(h2, W3) + b3) 57 | return a 58 | 59 | def critic_network(states, action): 60 | h1_dim = 400 61 | h2_dim = 300 62 | 63 | # define policy neural network 64 | W1 = tf.get_variable("W1", [state_dim, h1_dim], 65 | initializer=tf.contrib.layers.xavier_initializer()) 66 | b1 = tf.get_variable("b1", [h1_dim], 67 | initializer=tf.constant_initializer(0)) 68 | h1 = tf.nn.relu(tf.matmul(states, W1) + b1) 69 | # skip action from the first layer 70 | h1_concat = tf.concat([h1, action], 1) 71 | 72 | W2 = tf.get_variable("W2", [h1_dim + action_dim, h2_dim], 73 | initializer=tf.contrib.layers.xavier_initializer()) 74 | b2 = tf.get_variable("b2", [h2_dim], 75 | initializer=tf.constant_initializer(0)) 76 | h2 = tf.nn.relu(tf.matmul(h1_concat, W2) + b2) 77 | 78 | W3 = tf.get_variable("W3", [h2_dim, 1], 79 | initializer=tf.contrib.layers.xavier_initializer()) 80 | b3 = tf.get_variable("b3", [1], 81 | initializer=tf.constant_initializer(0)) 82 | v = tf.matmul(h2, W3) + b3 83 | return v 84 | 85 | pg_ddpg = DeepDeterministicPolicyGradient(sess, 86 | optimizer, 87 | actor_network, 88 | critic_network, 89 | state_dim, 90 | action_dim) # , 91 | # summary_writer=writer) 92 | 93 | MAX_EPISODES = 5000 # 100000 94 | #MAX_STEPS = 50 # if testing "Reacher" task 95 | MAX_STEPS = 5000 # 1000 96 | mean_rewards = 0 97 | 98 | record = [0 for x in range(0, 10000)] 99 | 100 | 101 | f=open('/home/cardwing/Desktop/usefulcode_HouYuenan/reward_curve/PER-DDPG/walker2d.txt','w') 102 | episode_history = deque(maxlen=100)#record total reward 103 | # reward_history = deque(maxlen=1000) 104 | for i_episode in xrange(MAX_EPISODES): 105 | #if i_episode == 300: 106 | # MAX_STEPS = 5000 107 | # initialize 108 | state = env.reset() 109 | total_rewards = 0 110 | # print(MAX_STEPS) 111 | # run for an episode 112 | for t in xrange(MAX_STEPS): 113 | # env.render() 114 | action = pg_ddpg.sampleAction(state[np.newaxis,:]) 115 | next_state, reward, done, _ = env.step(action) 116 | pg_ddpg.storeExperience(state, action, reward, next_state, done) 117 | error = pg_ddpg.updateModel(record) 118 | total_rewards += reward 119 | state = next_state 120 | if done: break 121 | episode_history.append(total_rewards) 122 | f.write(str(total_rewards)) 123 | f.write('\n') 124 | if i_episode % 100 == 0: 125 | for i_eval in range(100): 126 | total_rewards = 0 127 | state = env.reset() 128 | for t in xrange(MAX_STEPS): 129 | if i_episode >= 400 and mean_rewards > 8000: 130 | env.render() 131 | action = pg_ddpg.sampleAction(state[np.newaxis,:], exploration=False) 132 | next_state, reward, done, _ = env.step(action) 133 | total_rewards += reward 134 | state = next_state 135 | if done: break 136 | episode_history.append(total_rewards) 137 | mean_rewards = np.mean(episode_history) 138 | 139 | print("Episode {}".format(i_episode)) 140 | print("Finished after {} timesteps".format(t+1)) 141 | print("Reward for this episode: {:.2f}".format(total_rewards)) 142 | print("Average reward for last 100 episodes: {:.2f}".format(mean_rewards)) 143 | #if mean_rewards >= -3.75:# for Reacher-v1 144 | # if mean_rewards >= 950.0: # for InvertedPendulum-v1 145 | if mean_rewards >= 9100.0 or i_episode >= 5000: # for InvertedDoublePendulum-v1 146 | #if mean_rewards >= -3: # for Walker2d-v1 147 | print("Environment {} solved after {} episodes".format(env_name, i_episode+1)) 148 | f.close() 149 | break 150 | f.close() 151 | -------------------------------------------------------------------------------- /PER-in-RL/run_dqn_cartpole.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | from collections import deque 3 | 4 | from rl.neural_q_learner import NeuralQLearner 5 | import tensorflow as tf 6 | import numpy as np 7 | import gym 8 | 9 | env_name = 'CartPole-v0' 10 | env = gym.make(env_name) 11 | 12 | sess = tf.Session() 13 | optimizer = tf.train.RMSPropOptimizer(learning_rate=0.0001, decay=0.9) 14 | writer = tf.train.SummaryWriter("/tmp/{}-experiment-1".format(env_name)) 15 | 16 | state_dim = env.observation_space.shape[0] 17 | num_actions = env.action_space.n 18 | 19 | def observation_to_action(states): 20 | # define policy neural network 21 | W1 = tf.get_variable("W1", [state_dim, 20], 22 | initializer=tf.random_normal_initializer()) 23 | b1 = tf.get_variable("b1", [20], 24 | initializer=tf.constant_initializer(0)) 25 | h1 = tf.nn.relu(tf.matmul(states, W1) + b1) 26 | W2 = tf.get_variable("W2", [20, num_actions], 27 | initializer=tf.random_normal_initializer()) 28 | b2 = tf.get_variable("b2", [num_actions], 29 | initializer=tf.constant_initializer(0)) 30 | q = tf.matmul(h1, W2) + b2 31 | return q 32 | 33 | q_learner = NeuralQLearner(sess, 34 | optimizer, 35 | observation_to_action, 36 | state_dim, 37 | num_actions, 38 | summary_writer=writer) 39 | 40 | MAX_EPISODES = 10000 41 | MAX_STEPS = 200 42 | 43 | episode_history = deque(maxlen=100) 44 | for i_episode in xrange(MAX_EPISODES): 45 | 46 | # initialize 47 | state = env.reset() 48 | total_rewards = 0 49 | 50 | for t in xrange(MAX_STEPS): 51 | env.render() 52 | action = q_learner.eGreedyAction(state[np.newaxis,:]) 53 | next_state, reward, done, _ = env.step(action) 54 | 55 | total_rewards += reward 56 | # reward = -10 if done else 0.1 # normalize reward 57 | q_learner.storeExperience(state, action, reward, next_state, done) 58 | 59 | q_learner.updateModel() 60 | state = next_state 61 | 62 | if done: break 63 | 64 | episode_history.append(total_rewards) 65 | mean_rewards = np.mean(episode_history) 66 | 67 | print("Episode {}".format(i_episode)) 68 | print("Finished after {} timesteps".format(t+1)) 69 | print("Reward for this episode: {}".format(total_rewards)) 70 | print("Average reward for last 100 episodes: {}".format(mean_rewards)) 71 | if mean_rewards >= 195.0: 72 | print("Environment {} solved after {} episodes".format(env_name, i_episode+1)) 73 | break 74 | -------------------------------------------------------------------------------- /PER-in-RL/run_ql_cartpole.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | from collections import deque 3 | from rl.tabular_q_learner import QLearner 4 | import numpy as np 5 | import gym 6 | 7 | env_name = 'CartPole-v0' 8 | env = gym.make(env_name) 9 | 10 | cart_position_bins = np.linspace(-2.4, 2.4, num = 11)[1:-1] 11 | pole_angle_bins = np.linspace(-2, 2, num = 11)[1:-1] 12 | cart_velocity_bins = np.linspace(-1, 1, num = 11)[1:-1] 13 | angle_rate_bins = np.linspace(-3.5, 3.5, num = 11)[1:-1] 14 | 15 | def digitalizeState(observation): 16 | return int("".join([str(o) for o in observation])) 17 | 18 | state_dim = 10 ** env.observation_space.shape[0] 19 | num_actions = env.action_space.n 20 | 21 | q_learner = QLearner(state_dim, num_actions) 22 | 23 | MAX_EPISODES = 10000 24 | MAX_STEPS = 200 25 | 26 | episode_history = deque(maxlen=100) 27 | for i_episode in xrange(MAX_EPISODES): 28 | 29 | # initialize 30 | observation = env.reset() 31 | cart_position, pole_angle, cart_velocity, angle_rate_of_change = observation 32 | state = digitalizeState([np.digitize(cart_position, cart_position_bins), 33 | np.digitize(pole_angle, pole_angle_bins), 34 | np.digitize(cart_velocity, cart_velocity_bins), 35 | np.digitize(angle_rate_of_change, angle_rate_bins)]) 36 | action = q_learner.initializeState(state) 37 | total_rewards = 0 38 | 39 | for t in xrange(MAX_STEPS): 40 | env.render() 41 | observation, reward, done, _ = env.step(action) 42 | cart_position, pole_angle, cart_velocity, angle_rate_of_change = observation 43 | 44 | state = digitalizeState([np.digitize(cart_position, cart_position_bins), 45 | np.digitize(pole_angle, pole_angle_bins), 46 | np.digitize(cart_velocity, cart_velocity_bins), 47 | np.digitize(angle_rate_of_change, angle_rate_bins)]) 48 | 49 | 50 | total_rewards += reward 51 | if done: reward = -200 # normalize reward 52 | action = q_learner.updateModel(state, reward) 53 | 54 | if done: break 55 | 56 | episode_history.append(total_rewards) 57 | mean_rewards = np.mean(episode_history) 58 | 59 | print("Episode {}".format(i_episode)) 60 | print("Finished after {} timesteps".format(t+1)) 61 | print("Reward for this episode: {}".format(total_rewards)) 62 | print("Average reward for last 100 episodes: {}".format(mean_rewards)) 63 | if mean_rewards >= 195.0: 64 | print("Environment {} solved after {} episodes".format(env_name, i_episode+1)) 65 | break 66 | -------------------------------------------------------------------------------- /PER-in-RL/run_reinforce_cartpole.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | from collections import deque 3 | 4 | from rl.pg_reinforce import PolicyGradientREINFORCE 5 | import tensorflow as tf 6 | import numpy as np 7 | import gym 8 | 9 | env_name = 'CartPole-v0' 10 | env = gym.make(env_name) 11 | 12 | sess = tf.Session() 13 | optimizer = tf.train.RMSPropOptimizer(learning_rate=0.0001, decay=0.9) 14 | writer = tf.train.SummaryWriter("/tmp/{}-experiment-1".format(env_name)) 15 | 16 | state_dim = env.observation_space.shape[0] 17 | num_actions = env.action_space.n 18 | 19 | def policy_network(states): 20 | # define policy neural network 21 | W1 = tf.get_variable("W1", [state_dim, 20], 22 | initializer=tf.random_normal_initializer()) 23 | b1 = tf.get_variable("b1", [20], 24 | initializer=tf.constant_initializer(0)) 25 | h1 = tf.nn.tanh(tf.matmul(states, W1) + b1) 26 | W2 = tf.get_variable("W2", [20, num_actions], 27 | initializer=tf.random_normal_initializer(stddev=0.1)) 28 | b2 = tf.get_variable("b2", [num_actions], 29 | initializer=tf.constant_initializer(0)) 30 | p = tf.matmul(h1, W2) + b2 31 | return p 32 | 33 | pg_reinforce = PolicyGradientREINFORCE(sess, 34 | optimizer, 35 | policy_network, 36 | state_dim, 37 | num_actions, 38 | summary_writer=writer) 39 | 40 | MAX_EPISODES = 10000 41 | MAX_STEPS = 200 42 | 43 | episode_history = deque(maxlen=100) 44 | for i_episode in xrange(MAX_EPISODES): 45 | 46 | # initialize 47 | state = env.reset() 48 | total_rewards = 0 49 | 50 | for t in xrange(MAX_STEPS): 51 | env.render() 52 | action = pg_reinforce.sampleAction(state[np.newaxis,:]) 53 | next_state, reward, done, _ = env.step(action) 54 | 55 | total_rewards += reward 56 | reward = -10 if done else 0.1 # normalize reward 57 | pg_reinforce.storeRollout(state, action, reward) 58 | 59 | state = next_state 60 | if done: break 61 | 62 | pg_reinforce.updateModel() 63 | 64 | episode_history.append(total_rewards) 65 | mean_rewards = np.mean(episode_history) 66 | 67 | print("Episode {}".format(i_episode)) 68 | print("Finished after {} timesteps".format(t+1)) 69 | print("Reward for this episode: {}".format(total_rewards)) 70 | print("Average reward for last 100 episodes: {}".format(mean_rewards)) 71 | if mean_rewards >= 195.0 and len(episode_history) >= 100: 72 | print("Environment {} solved after {} episodes".format(env_name, i_episode+1)) 73 | break 74 | -------------------------------------------------------------------------------- /PER-in-RL/utility.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- encoding=utf-8 -*- 3 | # author: Ian 4 | # e-mail: stmayue@gmail.com 5 | # description: 6 | 7 | 8 | def list_to_dict(in_list): 9 | return dict((i, in_list[i]) for i in range(0, len(in_list))) 10 | 11 | 12 | def exchange_key_value(in_dict): 13 | return dict((in_dict[i], i) for i in in_dict) 14 | 15 | 16 | def main(): 17 | pass 18 | 19 | 20 | if __name__ == '__main__': 21 | main() 22 | 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Codes for projects of IERG 6130 Reinforcement Learning 2 | 3 | ### Requirements 4 | 5 | - Tensorflow 1.4.0 6 | - MuJoCo 7 | - Gym 0.7.4 8 | 9 | ### Install necessary components 10 | conda create -n tensorflow_gpu pip python=2.7 11 | source activate tensorflow_gpu 12 | pip install --upgrade tensorflow-gpu==1.4 13 | pip install gym==0.7.4 14 | pip install mujoco-py==0.5.5 15 | 16 | 17 | ### Run the code 18 | source activate tensorflow_gpu 19 | cd PER-in-RL 20 | CUDA_VISIBLE_DEVICES=0 python run_ddpg_mujoco.py 21 | 22 | ### Notes 23 | export MUJOCO_PY_MJKEY_PATH=/path/to/mjpro131/bin/mjkey.txt 24 | export MUJOCO_PY_MJPRO_PATH=/path/to/mjpro131 25 | 26 | You need to have the above mujoco key file in your path. 27 | 28 | ### Acknowledgement 29 | This repo is built upon [Tensorflow-Reinforce](https://github.com/yukezhu/tensorflow-reinforce) and [prioritized-experience-replay](https://github.com/Damcy/prioritized-experience-replay) 30 | 31 | --------------------------------------------------------------------------------