├── MDP_RL_Lecture1.pdf ├── MDP_RL_Lecture2.pdf ├── README.md ├── frameworks_demo └── Gym │ ├── images │ ├── algo.png │ ├── atari.png │ ├── cartpole.jpg │ ├── go.png │ └── mujoco.png │ ├── lua │ ├── README.MD │ ├── dqn.lua │ ├── gym_http_client.lua │ ├── layers.lua │ ├── main.lua │ ├── replay_buffer.lua │ └── utils.lua │ ├── main.ipynb │ └── python │ ├── README.MD │ ├── dqn.py │ ├── main.py │ └── replay_buffer.py └── mb_demo.ipynb /MDP_RL_Lecture1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cilvrRG/RL/3d1a68624d14b29deb80bce23bcab6379f507fc8/MDP_RL_Lecture1.pdf -------------------------------------------------------------------------------- /MDP_RL_Lecture2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cilvrRG/RL/3d1a68624d14b29deb80bce23bcab6379f507fc8/MDP_RL_Lecture2.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # RL 2 | ##Reading Group on Reinforcement Learning topics 3 | ##NYU, Fall 2016 4 | 5 | ###Logistics 6 | - Meetings run every **Wednesday at 9h30** (before the CILVR lab meeting), at the large conference room at 715/719 Broadway 12th floor. Breakfast will be provided. 7 | - Paper discussion + Paper review plan: Each week we will assign one or two papers to volunteers who will present it following week. During the reading/presentation, we will edit a review of the paper which will be posted here. [also subject to change]. 8 | - Guest speakers. We will try to invite RL experts (e.g. G. Tesauro) with some frequency. 9 | - Other communication channels ( Facebook groups?, Slack? ) [TBD]. 10 | 11 | ###Organization 12 | The RG is initially organized by J.Bruna, K. Cho, S. Sukhbaatar, K. Ross, D. Sontag, with help from the rest of the CILVR group. 13 | 14 | ### Tentative Agenda 15 | 16 | - 9/21: [Tutorial on MDPs, Policy Gradient (part 1)](MDP_RL_Lecture1.pdf). [**Keith Ross**] 17 | - Markov Decision Process Paradigm 18 | - Discounted and average cost criteria 19 | - Model-free Reinforcement Learning Paradigm 20 | - Policy Gradient: parameterized policies; policy gradient theorem; Monte Carlo Policy Gradient (REINFORCE) 21 | - Using Policy Gradient and deep neural networks to learn the Atari game "pong". 22 | 23 | - 9/28: [Tutorial on MDPs, Policy Gradient (part 2)](MDP_RL_Lecture2.pdf). [**Keith**] 24 | - Dynamic Programming equations for MDPs 25 | - Policy iteration 26 | - Value iteration 27 | - Monte Carlo methods for RL 28 | - Q-learning for RL 29 | 30 | - 10/5 and 10/12: Actor-Critic. [**Martin**] 31 | - Deterministic Policy Gradient 32 | - Off-Policy variants 33 | - Relevant Papers: 34 | - [Policy Gradient and Actor Critic](https://webdocs.cs.ualberta.ca/~sutton/papers/SMSM-NIPS99.pdf) 35 | - [Deterministic Policy Gradients](http://jmlr.org/proceedings/papers/v32/silver14.pdf) 36 | - [Off-policy actor critic](https://webdocs.cs.ualberta.ca/~sutton/papers/Degris-OffPAC-ICML-2012.pdf) 37 | 38 | - 10/19: Tutorial on OpenAI Gym and Mazebase. Also, Twitter's new [twrl](https://github.com/twitter/torch-twrl) [**Sainaa and Ilya**] 39 | - MazeBase: https://github.com/facebook/MazeBase 40 | - 10/26: [Apprenticeship Learning via Inverse Reinforcement Learning](http://ai.stanford.edu/~ang/papers/icml04-apprentice.pdf) and [Model-Free Imitation Learning with Policy Optimization](https://arxiv.org/abs/1605.08478) [**Arthur**] 41 | - 10/31: Trust region policy optimization (TRPO) [**Elman, Ilya**] 42 | 43 | 44 | 45 | 46 | ### Pool of Papers [please fill] 47 | 48 | - Guided Policy Search 49 | - Value Iteration Networks 50 | - TRPO [Elman, early November] 51 | - Review of recent hierarchical reinforcement learning papers [Sainaa] 52 | - Intrinsically Motivated Reinforcement Learning [Martin?]: 53 | - [Intrinsically Motivated Reinforcement Learning](https://web.eecs.umich.edu/~baveja/Papers/FinalNIPSIMRL.pdf) 54 | - [Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning](https://arxiv.org/pdf/1509.08731v1.pdf) 55 | - [Bayesian Surprise Attracts Human Attention](https://papers.nips.cc/paper/2822-bayesian-surprise-attracts-human-attention.pdf) 56 | - [Variational Information Maximizing Exploration](https://arxiv.org/abs/1605.09674) 57 | - [Unifying Count-Based Exploration and Intrinsic Motivation](https://arxiv.org/pdf/1606.01868v2.pdf) 58 | - High dimensional action spaces: 59 | - [Reinforcement Learning with Factored States and Actions](http://www.jmlr.org/papers/volume5/sallans04a/sallans04a.pdf) 60 | - [Deep Reinforcement Learning in Large Discrete Action Spaces](https://arxiv.org/pdf/1512.07679.pdf) 61 | - [Learning Multiagent Communication with Backpropagation](https://arxiv.org/pdf/1605.07736.pdf) 62 | - Stability in RL (these 4 papers shouldn't take more than 1 or 2 lectures): 63 | - [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/abs/1312.5602) 64 | - [Human-level control through deep reinforcement learning](http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html) 65 | - [Double Q-learning](http://papers.nips.cc/paper/3964-double-q-learning.pdf). Also consider reading about [the optimizer's curse](https://faculty.fuqua.duke.edu/~jes9/bio/Optimizers_Curse.pdf) to make the reading simpler. 66 | - [Prioritized Experience Replay](https://arxiv.org/abs/1511.05952) 67 | -------------------------------------------------------------------------------- /frameworks_demo/Gym/images/algo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cilvrRG/RL/3d1a68624d14b29deb80bce23bcab6379f507fc8/frameworks_demo/Gym/images/algo.png -------------------------------------------------------------------------------- /frameworks_demo/Gym/images/atari.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cilvrRG/RL/3d1a68624d14b29deb80bce23bcab6379f507fc8/frameworks_demo/Gym/images/atari.png -------------------------------------------------------------------------------- /frameworks_demo/Gym/images/cartpole.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cilvrRG/RL/3d1a68624d14b29deb80bce23bcab6379f507fc8/frameworks_demo/Gym/images/cartpole.jpg -------------------------------------------------------------------------------- /frameworks_demo/Gym/images/go.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cilvrRG/RL/3d1a68624d14b29deb80bce23bcab6379f507fc8/frameworks_demo/Gym/images/go.png -------------------------------------------------------------------------------- /frameworks_demo/Gym/images/mujoco.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cilvrRG/RL/3d1a68624d14b29deb80bce23bcab6379f507fc8/frameworks_demo/Gym/images/mujoco.png -------------------------------------------------------------------------------- /frameworks_demo/Gym/lua/README.MD: -------------------------------------------------------------------------------- 1 | 1. Install server: https://github.com/openai/gym-http-api 2 | 2. Launch server: 3 | ```bash 4 | python gym_http_server.py 5 | ``` 6 | 3. Run: 7 | ```bash 8 | th main.lua 9 | ``` 10 | -------------------------------------------------------------------------------- /frameworks_demo/Gym/lua/dqn.lua: -------------------------------------------------------------------------------- 1 | require 'nn' 2 | require 'optim' 3 | 4 | local class = require 'class' 5 | 6 | DQN = class('DQN') 7 | 8 | function DQN:__init(model, discount, optimConfig) 9 | self.model = model 10 | self.parameters, self.gradParameters = self.model:getParameters() 11 | 12 | self.criterion = nn.MSECriterion() 13 | 14 | self.optimConfig = optimConfig or { 15 | optimizer = "rmsprop", 16 | learningRate = 0.001, 17 | } 18 | 19 | self.targets = torch.Tensor() --A buffer for targets 20 | self.discount = discount or 0.99 21 | end 22 | 23 | function DQN:act(observation) 24 | local qvalues = self.model:forward(observation) 25 | local _, indices = torch.max(qvalues, 2) 26 | return indices 27 | end 28 | 29 | function DQN:create_targets(rewards, next_observations, mask) 30 | local values = torch.max(self.model:forward(next_observations), 2) 31 | return torch.cmul(mask, values) * self.discount + rewards 32 | end 33 | 34 | function DQN:update_params(observations, actions, qtargets) 35 | local feval = function(x) 36 | 37 | -- reset gradients 38 | self.gradParameters:zero() 39 | 40 | local qvalues = self.model:forward(observations) 41 | 42 | self.targets:resizeAs(qvalues) 43 | self.targets:copy(qvalues) 44 | 45 | -- A simple trick to avoid using a mask for other actions. 46 | -- Set targets only for taken actions, zero gradient for others. 47 | for i=1,actions:size(1) do 48 | self.targets[i][actions[i][1]] = qtargets[i] 49 | end 50 | 51 | local error = self.criterion:forward(qvalues, self.targets) 52 | local df_do = self.criterion:backward(qvalues, self.targets) 53 | 54 | self.model:backward(observations, df_do) 55 | 56 | return error, self.gradParameters 57 | end 58 | 59 | optim[self.optimConfig.optimizer](feval, self.parameters, self.optimConfig) 60 | end 61 | 62 | function DQN:update(observations, actions, rewards, next_observations, mask) 63 | local qtargets = self:create_targets(rewards, next_observations, mask) 64 | self:update_params(observations, actions, qtargets) 65 | end 66 | -------------------------------------------------------------------------------- /frameworks_demo/Gym/lua/gym_http_client.lua: -------------------------------------------------------------------------------- 1 | local HttpClient = require("httpclient") 2 | local json = require("dkjson") 3 | local os = require 'os' 4 | 5 | local GymClient = {} 6 | local m = {} 7 | 8 | function m.new(remote_base) 9 | local self = {} 10 | self.remote_base = remote_base 11 | self.http = HttpClient.new() 12 | setmetatable(self, {__index = GymClient}) 13 | return self 14 | end 15 | 16 | function GymClient:parse_server_error_or_raise_for_status(resp) 17 | local resp_data, pos, err = {} 18 | if resp.err then 19 | err = resp.err 20 | -- print('Response error: ' .. err) 21 | else 22 | if resp.code ~= 200 then 23 | err = resp.status_line 24 | -- Descriptive message from the server side 25 | -- print('Response: ' .. err) 26 | end 27 | resp_data, pos, err = json.decode(resp.body) 28 | end 29 | return resp_data, pos, err 30 | end 31 | 32 | function GymClient:get_request(route) 33 | url = self.remote_base .. route 34 | options = {} 35 | options.content_type = 'application/json' 36 | resp = self.http:get(url, options) 37 | return self:parse_server_error_or_raise_for_status(resp) 38 | end 39 | 40 | function GymClient:post_request(route, req_data) 41 | url = self.remote_base .. route 42 | options = {} 43 | options.content_type = 'application/json' 44 | json_str = json.encode(req_data) 45 | resp = self.http:post(url, json_str, options) 46 | return self:parse_server_error_or_raise_for_status(resp) 47 | end 48 | 49 | function GymClient:env_create(env_id) 50 | route = '/v1/envs/' 51 | req_data = {env_id = env_id} 52 | resp_data = self:post_request(route, req_data) 53 | return resp_data['instance_id'] 54 | end 55 | 56 | function GymClient:env_list_all() 57 | route = '/v1/envs/' 58 | resp_data = self:get_request(route) 59 | return resp_data['all_envs'] 60 | end 61 | 62 | function GymClient:env_reset(instance_id) 63 | route = '/v1/envs/'..instance_id..'/reset/' 64 | resp_data = self:post_request(route, '') 65 | return resp_data['observation'] 66 | end 67 | 68 | function GymClient:env_step(instance_id, action, render, video_callable) 69 | render = render or false 70 | video_callable = video_callable or false 71 | route = '/v1/envs/'..instance_id..'/step/' 72 | req_data = {action = action, render = render, video_callable = video_callable} 73 | resp_data = self:post_request(route, req_data) 74 | obs = resp_data['observation'] 75 | reward = resp_data['reward'] 76 | done = resp_data['done'] 77 | info = resp_data['info'] 78 | return obs, reward, done, info 79 | end 80 | 81 | function GymClient:env_action_space_info(instance_id) 82 | route = '/v1/envs/'..instance_id..'/action_space/' 83 | resp_data = self:get_request(route) 84 | return resp_data['info'] 85 | end 86 | 87 | function GymClient:env_action_space_sample(instance_id) 88 | route = '/v1/envs/'..instance_id..'/action_space/sample' 89 | resp_data = self:get_request(route) 90 | action = resp_data['action'] 91 | return action 92 | end 93 | 94 | function GymClient:env_action_space_contains(instance_id) 95 | route = '/v1/envs/'..instance_id..'/action_space/contains' 96 | resp_data = self:get_request(route) 97 | member = resp['member'] 98 | return member 99 | end 100 | 101 | function GymClient:env_observation_space_info(instance_id) 102 | route = '/v1/envs/'..instance_id..'/observation_space/' 103 | resp_data = self:get_request(route) 104 | return resp_data['info'] 105 | end 106 | 107 | function GymClient:env_monitor_start(instance_id, directory, force, resume, video_callable) 108 | if not force then force = false end 109 | if not resume then resume = false end 110 | req_data = {directory = directory, 111 | force = tostring(force), 112 | resume = tostring(resume), 113 | video_callable = video_callable} 114 | route = '/v1/envs/'..instance_id..'/monitor/start/' 115 | resp_data = self:post_request(route, req_data) 116 | end 117 | 118 | function GymClient:env_monitor_close(instance_id) 119 | route = '/v1/envs/'..instance_id..'/monitor/close/' 120 | resp_data = self:post_request(route, '') 121 | end 122 | 123 | function GymClient:upload(training_dir, algorithm_id, api_key) 124 | if not api_key then 125 | api_key = os.getenv('OPENAI_GYM_API_KEY') 126 | end 127 | if not algorithm_id then algorithm_id = '' end 128 | req_data = {training_dir = training_dir, 129 | algorithm_id = algorithm_id, 130 | api_key = api_key} 131 | route = '/v1/upload/' 132 | resp = self:post_request(route, req_data) 133 | return resp 134 | end 135 | 136 | function GymClient:shutdown_server() 137 | route = '/v1/shutdown/' 138 | self:post_request(route, '') 139 | end 140 | 141 | return m 142 | -------------------------------------------------------------------------------- /frameworks_demo/Gym/lua/layers.lua: -------------------------------------------------------------------------------- 1 | require 'nngraph' 2 | 3 | --[[ 4 | Copied from: 5 | https://raw.githubusercontent.com/ryankiros/layer-norm/master/torch_modules/LayerNormalization.lua 6 | --]] 7 | 8 | function nn.LayerNormalization(nOutput, bias, eps, affine) 9 | local eps = eps or 1e-5 10 | local affine = affine or true 11 | local bias = bias or nil 12 | 13 | local input = nn.Identity()() 14 | local mean = nn.Mean(2)(input) 15 | local mean_rep = nn.Replicate(nOutput,2)(mean) 16 | 17 | local input_center = nn.CSubTable()({input, mean_rep}) 18 | local std = nn.Sqrt()(nn.Mean(2)(nn.Square()(input_center))) 19 | local std_rep = nn.AddConstant(eps)(nn.Replicate(nOutput,2)(std)) 20 | local output = nn.CDivTable()({input_center, std_rep}) 21 | 22 | if affine == true then 23 | local biasTransform = nn.Add(nOutput, false) 24 | if bias ~=nil then 25 | biasTransform.bias:fill(bias) 26 | end 27 | local gainTransform = nn.CMul(nOutput) 28 | gainTransform.weight:fill(1.) 29 | output = biasTransform(gainTransform(output)) 30 | end 31 | return nn.gModule({input},{output}) 32 | end 33 | -------------------------------------------------------------------------------- /frameworks_demo/Gym/lua/main.lua: -------------------------------------------------------------------------------- 1 | require 'dqn' 2 | require 'replay_buffer' 3 | local utils = require 'utils' 4 | local GymClient = require("gym_http_client") 5 | local HttpClient = require("httpclient") 6 | 7 | client = GymClient.new('http://127.0.0.1:5000') 8 | 9 | -- Parameters 10 | BATCH_SIZE = 32 11 | NUM_EPISODES = 50000 12 | MIN_SAMPLES = 2 * BATCH_SIZE -- Minimal number of samples for an update 13 | EVAL_EPISODE = 10 -- Perform evaluation each EVAL_EPISODE number of episodes 14 | NUM_STEPS = 200 15 | NUM_EVALUATIONS = 10 -- Number of times we run evaluation 16 | INITIAL_EPSILON = 0.9 17 | MIN_EPSILON = 0.1 18 | EPSILON_DECAY = 0.999 19 | BUFFER_SIZE = 100000 20 | 21 | -- Set up environment 22 | env_id = 'CartPole-v0' 23 | instance_id = client:env_create(env_id) 24 | 25 | ndim_action = client:env_action_space_info(instance_id).n 26 | ndim_obs = client:env_observation_space_info(instance_id).shape[1] 27 | epsilon = INITIAL_EPSILON 28 | 29 | agent = DQN(utils.mlp({ndim_obs, 128, 128, ndim_action})) 30 | replayBuffer = ReplayBuffer(BUFFER_SIZE) 31 | 32 | for i = 1, NUM_EPISODES do 33 | -- Training 34 | observation = torch.Tensor(client:env_reset(instance_id)) 35 | for j = 1, NUM_STEPS do 36 | -- Epsilon greedy exploration 37 | if torch.uniform() < epsilon then 38 | action = torch.random(2) 39 | else 40 | action = agent:act(observation:view(1, -1))[1][1] 41 | end 42 | 43 | -- Perform an action, observe next state and reward 44 | next_observation, reward, done, info = client:env_step(instance_id, action-1, true) 45 | next_observation = torch.Tensor(next_observation) 46 | 47 | -- Insert it to replay buffer 48 | replayBuffer:insert(observation, action, reward, done, next_observation) 49 | 50 | if done == true then 51 | break 52 | else 53 | observation = next_observation 54 | end 55 | 56 | if #replayBuffer.replayBuffer > MIN_SAMPLES then 57 | -- Decay epsilon 58 | epsilon = epsilon * EPSILON_DECAY 59 | epsilon = math.max(epsilon, MIN_EPSILON) 60 | 61 | -- Sample a batch of samples and then update 62 | observations, actions, rewards, next_observations, mask = replayBuffer:sample(BATCH_SIZE) 63 | agent:update(observations, actions, rewards, next_observations, mask) 64 | end 65 | end 66 | 67 | -- Evaluation 68 | if i % EVAL_EPISODE == 0 then 69 | local total_reward = 0 70 | for k = 1, NUM_EVALUATIONS do 71 | observation = torch.Tensor(client:env_reset(instance_id)) 72 | 73 | for j = 1, NUM_STEPS do 74 | action = agent:act(observation:view(1, -1))[1][1] 75 | next_observation, reward, done, info = client:env_step(instance_id, action-1, true) 76 | next_observation = torch.Tensor(next_observation) 77 | total_reward = total_reward + reward 78 | 79 | if done == true then 80 | break 81 | else 82 | observation = next_observation 83 | end 84 | end 85 | end 86 | print("Eval after episode #".. i .. ", average reward: " .. total_reward/NUM_EVALUATIONS) 87 | end 88 | end 89 | -------------------------------------------------------------------------------- /frameworks_demo/Gym/lua/replay_buffer.lua: -------------------------------------------------------------------------------- 1 | require 'nn' 2 | local class = require 'class' 3 | 4 | ReplayBuffer = class('ReplayBuffer') 5 | 6 | function ReplayBuffer:__init(maxSize) 7 | self.maxSize = maxSize 8 | self.currentIndex = 1 9 | self.replayBuffer = {} 10 | self.observations = torch.Tensor() 11 | self.actions = torch.LongTensor() 12 | self.rewards = torch.Tensor() 13 | self.next_observations = torch.Tensor() 14 | self.mask = torch.Tensor() 15 | end 16 | 17 | function ReplayBuffer:insert(observation, action, reward, done, next_observation) 18 | self.replayBuffer[self.currentIndex] = { 19 | observation = observation:clone(), 20 | action = action, 21 | next_observation = next_observation:clone(), 22 | reward = reward, 23 | done = done 24 | } 25 | 26 | self.currentIndex = self.currentIndex + 1 27 | if self.currentIndex > self.maxSize then 28 | self.currentIndex = 1 29 | end 30 | end 31 | 32 | function ReplayBuffer:sample(nSamples) 33 | local observationSize = self.replayBuffer[1].observation:size(1) 34 | self.observations:resize(nSamples, observationSize) 35 | self.actions:resize(nSamples, 1) 36 | self.rewards:resize(nSamples, 1) 37 | self.next_observations:resize(nSamples, observationSize) 38 | self.mask:resize(nSamples, 1) 39 | 40 | for i=1,nSamples do 41 | local index = torch.random(#self.replayBuffer) 42 | self.observations[i] = self.replayBuffer[index].observation 43 | self.actions[i] = self.replayBuffer[index].action 44 | self.rewards[i] = self.replayBuffer[index].reward 45 | self.next_observations[i] = self.replayBuffer[index].next_observation 46 | self.mask[i] = self.replayBuffer[index].done and 0 or 1 47 | end 48 | 49 | return self.observations, self.actions, self.rewards, self.next_observations, self.mask 50 | end 51 | -------------------------------------------------------------------------------- /frameworks_demo/Gym/lua/utils.lua: -------------------------------------------------------------------------------- 1 | require 'layers' 2 | 3 | local utils = {} 4 | 5 | function utils.mlp(dims) 6 | local model = nn.Sequential() 7 | 8 | for i=1,#dims-2 do 9 | model:add(nn.Linear(dims[i], dims[i+1])) 10 | model:add(nn.LayerNormalization(dims[i+1])) 11 | model:add(nn.ReLU()) 12 | end 13 | 14 | model:add(nn.Linear(dims[#dims-1], dims[#dims])) 15 | return model 16 | end 17 | 18 | return utils 19 | -------------------------------------------------------------------------------- /frameworks_demo/Gym/python/README.MD: -------------------------------------------------------------------------------- 1 | # Run 2 | ```bash 3 | python main.py 4 | ``` 5 | -------------------------------------------------------------------------------- /frameworks_demo/Gym/python/dqn.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | from tensorflow.contrib import slim 4 | 5 | 6 | class dqn(object): 7 | 8 | def __init__(self, ndim_input, num_actions, discount=0.99, model_path=None): 9 | self.num_actions = num_actions 10 | 11 | self.discount = discount 12 | self.sess = tf.Session() 13 | 14 | self.inputs_tf = tf.placeholder(tf.float32, [None, ndim_input]) 15 | self.q_values_tf = self.get_q_net(self.inputs_tf, num_actions) 16 | 17 | self.action_mask_tf = tf.placeholder(tf.float32, [None, num_actions]) 18 | self.target_tf = tf.placeholder(tf.float32, [None, 1]) 19 | 20 | loss = tf.reduce_mean( 21 | tf.pow(self.q_values_tf - self.target_tf, 2) * self.action_mask_tf) 22 | 23 | optim = tf.train.RMSPropOptimizer(1e-3) 24 | 25 | self.train_op = slim.learning.create_train_op(loss, optim) 26 | 27 | self.sess.run(tf.initialize_all_variables()) 28 | self.saver = tf.train.Saver() 29 | 30 | if model_path is not None: 31 | self.saver.restore(self.sess, model_path) 32 | 33 | def get_q_net(self, inputs, num_actions): 34 | with slim.arg_scope([slim.fully_connected], 35 | activation_fn=tf.nn.relu, 36 | normalizer_fn=slim.layer_norm): 37 | net = slim.fully_connected(self.inputs_tf, 128, scope="l1") 38 | net = slim.fully_connected(net, 128, scope="l2") 39 | return slim.fully_connected( 40 | net, num_actions, normalizer_fn=None, activation_fn=None, scope="l3") 41 | 42 | def act(self, obs): 43 | q_values_np = self.sess.run(self.q_values_tf, {self.inputs_tf: obs}) 44 | return np.argmax(q_values_np, 1) 45 | 46 | def get_targets(self, reward, newobs, mask): 47 | q_values_np = self.sess.run( 48 | self.q_values_tf, {self.inputs_tf: newobs}) 49 | 50 | return np.max(q_values_np, 1, keepdims=True) * mask * self.discount + reward 51 | 52 | def update(self, obs, action, reward, newobs, mask): 53 | q_targets = self.get_targets(reward, newobs, mask) 54 | 55 | onehot_actions = np.zeros((action.shape[0], self.num_actions)) 56 | onehot_actions[np.arange(action.shape[0]), 57 | action.astype('int')[:, 0]] = 1 58 | 59 | self.sess.run(self.train_op, {self.inputs_tf: obs, 60 | self.target_tf: q_targets, 61 | self.action_mask_tf: onehot_actions}) 62 | 63 | def save(self, save_path="/tmp/model.ckpt"): 64 | self.saver.save(self.sess, save_path) 65 | -------------------------------------------------------------------------------- /frameworks_demo/Gym/python/main.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | from __future__ import absolute_import 3 | 4 | import gym 5 | import tensorflow as tf 6 | import numpy as np 7 | from dqn import dqn 8 | from replay_buffer import replay_buffer 9 | 10 | # Parameters 11 | BATCH_SIZE = 32 12 | NUM_EPISODES = 100000 13 | MIN_SAMPLES = 2 * BATCH_SIZE # Minimal number of samples for an update 14 | EVAL_EPISODE = 10 # Perform evaluation each EVAL_EPISODE number of episodes 15 | NUM_STEPS = 200 16 | NUM_EVALUATIONS = 10 # Number of times we run evaluation 17 | INITIAL_EPSILON = 0.9 18 | MIN_EPSILON = 0.1 19 | EPSILON_DECAY = 0.999 20 | BUFFER_SIZE = 100000 21 | 22 | env = gym.make('CartPole-v0') 23 | 24 | ndim_action = env.action_space.n 25 | ndim_obs = env.observation_space.shape[0] 26 | epsilon = INITIAL_EPSILON 27 | 28 | agent = dqn(ndim_obs, ndim_action) 29 | replay = replay_buffer(BUFFER_SIZE) 30 | 31 | for i in range(NUM_EPISODES): 32 | # Training 33 | obs = env.reset() 34 | for j in range(NUM_STEPS): 35 | # Epsilon greedy exploration 36 | if np.random.uniform() < epsilon: 37 | action = env.action_space.sample() 38 | else: 39 | action = agent.act(np.expand_dims(obs, 0))[0] 40 | 41 | # Perform an action, observe next state and reward 42 | newobs, reward, done, info = env.step(action) 43 | 44 | # Insert it to replay buffer 45 | replay.insert(obs, action, reward, newobs, 0 if done == True else 1) 46 | 47 | if done == True: 48 | break 49 | else: 50 | obs = newobs 51 | 52 | if len(replay.deque) >= MIN_SAMPLES: 53 | # Decay epsilon 54 | epsilon = epsilon * EPSILON_DECAY 55 | epsilon = max(epsilon, MIN_EPSILON) 56 | 57 | # Sample a batch of samples and then update 58 | obserbation_batch, action_batch, reward_batch, next_obserbation_batch, mask_batch = replay.sample( 59 | BATCH_SIZE) 60 | agent.update(obserbation_batch, action_batch, reward_batch, 61 | next_obserbation_batch, mask_batch) 62 | 63 | agent.save() 64 | 65 | # Evaluation 66 | if i % EVAL_EPISODE == 0: 67 | total_reward = 0 68 | for k in range(NUM_EVALUATIONS): 69 | obs = env.reset() 70 | for j in range(NUM_STEPS): 71 | env.render() 72 | 73 | action = agent.act(np.expand_dims(obs, 0))[0] 74 | newobs, reward, done, info = env.step(action) 75 | total_reward += reward 76 | 77 | if done: 78 | break 79 | else: 80 | obs = newobs 81 | 82 | print("Eval after episode #{0}, average reward: {1}".format( 83 | i, total_reward / NUM_EVALUATIONS)) 84 | -------------------------------------------------------------------------------- /frameworks_demo/Gym/python/replay_buffer.py: -------------------------------------------------------------------------------- 1 | from collections import deque 2 | import numpy as np 3 | 4 | 5 | class replay_buffer(object): 6 | 7 | def __init__(self, max_size): 8 | self.max_size = max_size 9 | self.deque = deque() 10 | 11 | self.obserbation_batch = np.array([]) 12 | self.action_batch = np.array([]) 13 | self.reward_batch = np.array([]) 14 | self.next_obserbation_batch = np.array([]) 15 | self.mask_batch = np.array([]) 16 | 17 | def insert(self, obs, action, reward, newobs, mask): 18 | self.deque.append([obs, action, reward, newobs, mask]) 19 | if len(self.deque) > self.max_size: 20 | self.deque.popleft() 21 | 22 | def sample(self, batch_size): 23 | indices = np.random.choice(len(self.deque), batch_size) 24 | 25 | self.obserbation_batch.resize(batch_size, self.deque[0][0].shape[0]) 26 | self.action_batch.resize(batch_size, 1) 27 | self.reward_batch.resize(batch_size, 1) 28 | self.next_obserbation_batch.resize( 29 | batch_size, self.deque[0][0].shape[0]) 30 | self.mask_batch.resize(batch_size, 1) 31 | 32 | for b in range(batch_size): 33 | other_b = indices[b] 34 | self.obserbation_batch[b] = self.deque[other_b][0] 35 | self.action_batch[b] = self.deque[other_b][1] 36 | self.reward_batch[b] = self.deque[other_b][2] 37 | self.next_obserbation_batch[b] = self.deque[other_b][3] 38 | self.mask_batch[b] = self.deque[other_b][4] 39 | 40 | return self.obserbation_batch, self.action_batch, self.reward_batch, self.next_obserbation_batch, self.mask_batch 41 | -------------------------------------------------------------------------------- /mb_demo.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "mb = require'mazebase'" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 5, 17 | "metadata": { 18 | "collapsed": true 19 | }, 20 | "outputs": [], 21 | "source": [ 22 | "g_opts = {games_config_path='/home/sainbar/MazeBase-public/lua/mazebase/config/game_config.lua'}" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 6, 28 | "metadata": { 29 | "collapsed": false 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "mb.init_vocab()\n", 34 | "mb.init_game()" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 42, 40 | "metadata": { 41 | "collapsed": true 42 | }, 43 | "outputs": [], 44 | "source": [ 45 | "g_opts.game = 'MultiGoals'\n", 46 | "g = mb.new_game()" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 53, 52 | "metadata": { 53 | "collapsed": false 54 | }, 55 | "outputs": [ 56 | { 57 | "data": { 58 | "image/png": "", 59 | "text/plain": [ 60 | "Console does not support images" 61 | ] 62 | }, 63 | "metadata": { 64 | "image/png": { 65 | "height": 320, 66 | "width": 160 67 | } 68 | }, 69 | "output_type": "display_data" 70 | } 71 | ], 72 | "source": [ 73 | "g:act(1)\n", 74 | "itorch.image(g.map:to_image())" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 30, 80 | "metadata": { 81 | "collapsed": false 82 | }, 83 | "outputs": [ 84 | { 85 | "data": { 86 | "text/plain": [ 87 | "-0.3\t\n" 88 | ] 89 | }, 90 | "execution_count": 30, 91 | "metadata": {}, 92 | "output_type": "execute_result" 93 | } 94 | ], 95 | "source": [ 96 | "g:get_reward()" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 37, 102 | "metadata": { 103 | "collapsed": false 104 | }, 105 | "outputs": [ 106 | { 107 | "data": { 108 | "text/plain": [ 109 | "{\n" 110 | ] 111 | }, 112 | "execution_count": 37, 113 | "metadata": {}, 114 | "output_type": "execute_result" 115 | }, 116 | { 117 | "data": { 118 | "text/plain": [ 119 | " 1 : up\n", 120 | " 2 : down\n", 121 | " 3 : left\n", 122 | " 4 : right\n", 123 | " 5 : stop\n", 124 | " 6 : toggle\n", 125 | " 7 : push_up\n", 126 | " 8 : push_down\n", 127 | " 9 : push_left\n", 128 | " 10 : push_right\n", 129 | "}\n" 130 | ] 131 | }, 132 | "execution_count": 37, 133 | "metadata": {}, 134 | "output_type": "execute_result" 135 | } 136 | ], 137 | "source": [ 138 | "g.agent.action_names" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 45, 144 | "metadata": { 145 | "collapsed": false 146 | }, 147 | "outputs": [ 148 | { 149 | "data": { 150 | "text/plain": [ 151 | "{\n", 152 | " 1 : obj2\n", 153 | " 2 : info\n", 154 | " 3 : goal3\n", 155 | "}\n" 156 | ] 157 | }, 158 | "execution_count": 45, 159 | "metadata": {}, 160 | "output_type": "execute_result" 161 | } 162 | ], 163 | "source": [ 164 | "g.items_bytype['info'][2]:to_sentence()" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 47, 170 | "metadata": { 171 | "collapsed": false 172 | }, 173 | "outputs": [ 174 | { 175 | "data": { 176 | "image/png": "", 177 | "text/plain": [ 178 | "Console does not support images" 179 | ] 180 | }, 181 | "metadata": { 182 | "image/png": { 183 | "height": 320, 184 | "width": 160 185 | } 186 | }, 187 | "output_type": "display_data" 188 | } 189 | ], 190 | "source": [ 191 | "g:place_item({type='block'},3,1)\n", 192 | "itorch.image(g.map:to_image())" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": { 199 | "collapsed": true 200 | }, 201 | "outputs": [], 202 | "source": [] 203 | } 204 | ], 205 | "metadata": { 206 | "kernelspec": { 207 | "display_name": "iTorch", 208 | "language": "lua", 209 | "name": "itorch" 210 | }, 211 | "language_info": { 212 | "name": "lua", 213 | "version": "5.1" 214 | } 215 | }, 216 | "nbformat": 4, 217 | "nbformat_minor": 1 218 | } 219 | --------------------------------------------------------------------------------