├── LICENSE ├── README.md ├── envs └── cartpole_cont.py ├── gym_cartpole.py ├── gym_pendulum.py ├── plot_cost.py ├── pytorch_mppi ├── __init__.py ├── mppi.py └── smooth_mppi.py └── setup.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Kim Taekyung 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SMPPI Implementation in Pytorch 2 | 3 | This repository implements the idea of Smooth Model Predictive Path Integral control (SMPPI), using neural network dynamics model in pytorch. SMPPI is a general framework that is able to obtain smooth actions using sampling-based MPC without any extra smoothing algorithms (e.g. Savitzky-Golay Filter). The related paper will be relased soon. 4 | 5 | # Installation 6 | 7 | Clone repository, then 'pip install -e .' or 'pip3 install -e .' based on your environment. 8 | 9 | Or you can manually install dependencies: 10 | 11 | - pytorch 12 | - numpy 13 | - gym 14 | - scipy 15 | 16 | # How to Run Example 17 | 18 | You can run our test example by: 19 | 20 | For pendulum, 21 | ```bash 22 | python gym_pendulum.py 23 | ``` 24 | For cartpole, 25 | ```bash 26 | python gym_cartpole.py 27 | ``` 28 | 29 | 30 | It's an inverted pendulum in gym environment. The sample results of the four different controllers are shown below: 31 | 32 | | MPPI w/o Smoothing | MPPI (apply smoothing on noise sequence) | 33 | | :------------------------------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------: | 34 | |

| 35 | | __MPPI (apply smoothing on control sequence)__ | __SMPPI__ | 36 | |

| 37 | 38 | It's a cartpole (continuous action) environment. Since MPPI requires random noise sampling of actions, cartpole environment in openAI gym(which has only two discrete actions, Left or Right) is not suitable for MPPI test. So we made custom environment which provides continuous action. In this environment, action can vary continuously from -10.0 to 10.0. (For more detail, see envs/cartpole_cont.py) 39 | 40 | The sample result of SMPPI controller is shown below: 41 | 42 |

43 | 44 | 45 | They are collecting the state-action pairs dataset with exploration. The dynamics models are retrained every 50 iterations. SMPPI can accurately find the optimal action sequence, right after re-training the neural network dynamics. 46 | # How to Use 47 | 48 | Simply import SMPPI from 'pytorch_mppi', you can obtain sequence of smooth optimal actions from sampling-based MPC. 49 | 50 | ```python 51 | from pytorch_mppi import smooth_mppi as smppi 52 | # define your dynamics model (both work for nominal dynamics or neural network approximation) 53 | # create controller with chosen parameters 54 | mppi_env = smppi.MPPI(dynamics, running_cost, nx, nu, noise_sigma, 55 | num_samples=N_SAMPLES, 56 | horizon=TIMESTEPS, lambda_=lambda_, gamma_=gamma_, device=device, 57 | w_action_seq_cost=Omega, 58 | u_min=torch.tensor(D_ACTION_LOW, dtype=dtype, device=device), 59 | u_max=torch.tensor(D_ACTION_HIGH, dtype=dtype, device=device), 60 | action_min=torch.tensor(ACTION_LOW, dtype=dtype, device=device), 61 | action_max=torch.tensor(ACTION_HIGH, dtype=dtype, device=device)) 62 | 63 | # assuming you have a gym-like env 64 | obs = env.reset() 65 | for i in range(100): 66 | action = mppi_env.command(obs) 67 | obs, reward, done, _ = env.step(action.cpu().numpy()) 68 | ``` 69 | 70 | Alternatively, you can test the original MPPI with different smoothing methods. 71 | 72 | ```python 73 | from pytorch_mppi import mppi 74 | # define your dynamics model (both work for nominal dyanmics or neural network approximation) 75 | # create controller with chosen parameters 76 | mppi_env = mppi.MPPI(dynamics, running_cost, nx, nu, noise_sigma, 77 | num_samples=N_SAMPLES, 78 | horizon=TIMESTEPS, lambda_=lambda_, gamma_=gamma_, device=device, 79 | u_min=torch.tensor(ACTION_LOW, dtype=dtype, device=device), 80 | u_max=torch.tensor(ACTION_HIGH, dtype=dtype, device=device), 81 | smooth=SMOOTHING_METHOD) 82 | ``` 83 | 84 | You have three options for the 'SMOOTHING_METHOD': 85 | 86 | 1. __"no filter"__ : no smoothing 87 | 2. __"smooth u"__ : smooth control sequence after adding noise 88 | 3. __"smooth noise"__ : smooth noise sequence before adding noise 89 | 90 | For the smoothing algorithm, we use convolutional Savitzky-Golay Filter (in scipy). 91 | 92 | # Parameters Description 93 | 94 | ### lambda\_ 95 | 96 | - temperature, positive scalar where larger values will allow more exploration 97 | - we recommend 10.0 ~ 20.0 when you have more than 1,000 samples 98 | 99 | ### gamma\_ 100 | 101 | - running action cost parameter 102 | - see [MPPI paper](https://ieeexplore.ieee.org/abstract/document/8558663?casa_token=RTtdCK4jrykAAAAA:YgIhGuAKv_dPA_JjvaxHT2npZuaFVI0utE4JSnDkALwqbUvh676UydsOUg44ka5rawG7edPo) for more detail 103 | 104 | ### w_action_seq_cost 105 | 106 | - (nu x nu) weight parameter for smoothing action sequence 107 | 108 | ### num_samples 109 | 110 | - number of trajectories to sample; generally the more the better. (determine this parameter based on the size of your neural network model.) 111 | - try to have it between 1K ~ 10K, if your GPU allows it to. 112 | 113 | ### noise_sigma 114 | 115 | - (nu x nu) control noise covariance; larger covariance yeilds more exploration 116 | 117 | | See our paper for further information (will be released soon). 118 | 119 | # Requirements 120 | 121 | - `next state <- dynamics(state, action)` function (doesn't have to be true dynamics) 122 | - `state` is `K x nx`, `action` is `K x nu` 123 | - `cost <- running_cost(state, action)` function 124 | - `cost` is `K x 1`, state is `K x nx`, `action` is `K x nu` 125 | 126 | | __The shapes of the important tensors (such as 'states', 'noise', 'actions') are all commented on the scripts.__ 127 | 128 | # Related Works 129 | 130 | This repository was built based on the [project of pytorch implementation of MPPI](https://github.com/UM-ARM-Lab/pytorch_mppi), that I had contributed before. Thanks for the great work of [LemonPi](https://github.com/LemonPi). 131 | -------------------------------------------------------------------------------- /envs/cartpole_cont.py: -------------------------------------------------------------------------------- 1 | """ 2 | Classic cart-pole system implemented by Rich Sutton et al. 3 | Copied from http://incompleteideas.net/sutton/book/code/pole.c 4 | permalink: https://perma.cc/C9ZM-652R 5 | """ 6 | 7 | import math 8 | import gym 9 | from gym import spaces, logger 10 | from gym.utils import seeding 11 | import numpy as np 12 | 13 | 14 | class CartPoleContEnv(gym.Env): 15 | 16 | metadata = {"render.modes": ["human", "rgb_array"], "video.frames_per_second": 50} 17 | 18 | def __init__(self): 19 | self.gravity = 9.8 20 | self.masscart = 1.0 21 | self.masspole = 0.1 22 | self.total_mass = self.masspole + self.masscart 23 | self.length = 0.5 # actually half the pole's length 24 | self.polemass_length = self.masspole * self.length 25 | # self.force_mag = 30.0 26 | self.tau = 0.02 # seconds between state updates 27 | self.min_action = -10.0 28 | self.max_action = 10.0 29 | 30 | # Angle at which to fail the episode 31 | self.theta_threshold_radians = 12 * 2 * math.pi / 360 32 | self.x_threshold = 2.4#10.0 33 | 34 | # Angle limit set to 2 * theta_threshold_radians so failing observation 35 | # is still within bounds. 36 | high = np.array( 37 | [ 38 | #np.finfo(np.float32).max, 39 | self.x_threshold * 2, 40 | np.finfo(np.float32).max, 41 | np.finfo(np.float32).max, 42 | #self.theta_threshold_radians * 2, 43 | np.finfo(np.float32).max 44 | ], 45 | dtype=np.float32 46 | ) 47 | 48 | self.action_space = spaces.Box( 49 | low = self.min_action, 50 | high = self.max_action, 51 | shape=(1,), 52 | dtype=np.float32 53 | ) 54 | self.observation_space = spaces.Box(-high, high, dtype=np.float32) 55 | 56 | self.seed() 57 | self.viewer = None 58 | self.state = None 59 | 60 | self.steps_beyond_done = None 61 | 62 | def seed(self, seed=None): 63 | self.np_random, seed = seeding.np_random(seed) 64 | return [seed] 65 | 66 | def step(self, action): 67 | #err_msg = "%r (%s) invalid" % (action, type(action)) 68 | #assert self.action_space.contains(action), err_msg 69 | 70 | x, x_dot, theta, theta_dot = self.state 71 | #force = self.force_mag * float(action) 72 | force = float(action) 73 | costheta = math.cos(theta) 74 | sintheta = math.sin(theta) 75 | 76 | # For the interested reader: 77 | # https://coneural.org/florian/papers/05_cart_pole.pdf 78 | temp = ( 79 | force + self.polemass_length * theta_dot ** 2 * sintheta 80 | ) / self.total_mass 81 | thetaacc = (self.gravity * sintheta - costheta * temp) / ( 82 | self.length * (4.0 / 3.0 - self.masspole * costheta ** 2 / self.total_mass) 83 | ) 84 | xacc = temp - self.polemass_length * thetaacc * costheta / self.total_mass 85 | 86 | x = x + self.tau * x_dot 87 | x_dot = x_dot + self.tau * xacc 88 | theta = theta + self.tau * theta_dot 89 | theta_dot = theta_dot + self.tau * thetaacc 90 | 91 | self.state = (x, x_dot, theta, theta_dot) 92 | 93 | #done = False 94 | done = bool( 95 | x < -self.x_threshold 96 | or x > self.x_threshold 97 | #or theta < -self.theta_threshold_radians 98 | #or theta > self.theta_threshold_radians 99 | ) 100 | 101 | if not done: 102 | reward = 1.0 103 | elif self.steps_beyond_done is None: 104 | # Pole just fell! 105 | self.steps_beyond_done = 0 106 | reward = 1.0 107 | else: 108 | if self.steps_beyond_done == 0: 109 | logger.warn( 110 | "You are calling 'step()' even though this " 111 | "environment has already returned done = True. You " 112 | "should always call 'reset()' once you receive 'done = " 113 | "True' -- any further steps are undefined behavior." 114 | ) 115 | self.steps_beyond_done += 1 116 | reward = 0.0 117 | 118 | return np.array(self.state, dtype=np.float32), reward, done, {} 119 | 120 | def reset(self): 121 | self.state = self.np_random.uniform(low=-0.05, high=0.05, size=(4,)) 122 | self.state[2] += np.pi 123 | self.steps_beyond_done = None 124 | return np.array(self.state, dtype=np.float32) 125 | 126 | def render(self, mode="human"): 127 | screen_width = 600 128 | screen_height = 400 129 | 130 | world_width = self.x_threshold * 2 131 | scale = screen_width / world_width 132 | carty = 100 # TOP OF CART 133 | polewidth = 10.0 134 | polelen = scale * (2 * self.length) 135 | cartwidth = 50.0 136 | cartheight = 30.0 137 | 138 | if self.viewer is None: 139 | from gym.envs.classic_control import rendering 140 | 141 | self.viewer = rendering.Viewer(screen_width, screen_height) 142 | l, r, t, b = -cartwidth / 2, cartwidth / 2, cartheight / 2, -cartheight / 2 143 | axleoffset = cartheight / 4.0 144 | cart = rendering.FilledPolygon([(l, b), (l, t), (r, t), (r, b)]) 145 | self.carttrans = rendering.Transform() 146 | cart.add_attr(self.carttrans) 147 | self.viewer.add_geom(cart) 148 | l, r, t, b = ( 149 | -polewidth / 2, 150 | polewidth / 2, 151 | polelen - polewidth / 2, 152 | -polewidth / 2, 153 | ) 154 | pole = rendering.FilledPolygon([(l, b), (l, t), (r, t), (r, b)]) 155 | pole.set_color(0.8, 0.6, 0.4) 156 | self.poletrans = rendering.Transform(translation=(0, axleoffset)) 157 | pole.add_attr(self.poletrans) 158 | pole.add_attr(self.carttrans) 159 | self.viewer.add_geom(pole) 160 | self.axle = rendering.make_circle(polewidth / 2) 161 | self.axle.add_attr(self.poletrans) 162 | self.axle.add_attr(self.carttrans) 163 | self.axle.set_color(0.5, 0.5, 0.8) 164 | self.viewer.add_geom(self.axle) 165 | self.track = rendering.Line((0, carty), (screen_width, carty)) 166 | self.track.set_color(0, 0, 0) 167 | self.viewer.add_geom(self.track) 168 | 169 | self._pole_geom = pole 170 | 171 | if self.state is None: 172 | return None 173 | 174 | # Edit the pole polygon vertex 175 | pole = self._pole_geom 176 | l, r, t, b = ( 177 | -polewidth / 2, 178 | polewidth / 2, 179 | polelen - polewidth / 2, 180 | -polewidth / 2, 181 | ) 182 | pole.v = [(l, b), (l, t), (r, t), (r, b)] 183 | 184 | x = self.state 185 | cartx = x[0] * scale + screen_width / 2.0 # MIDDLE OF CART 186 | self.carttrans.set_translation(cartx, carty) 187 | self.poletrans.set_rotation(-x[2]) 188 | 189 | return self.viewer.render(return_rgb_array=mode == "rgb_array") 190 | 191 | def close(self): 192 | if self.viewer: 193 | self.viewer.close() 194 | self.viewer = None 195 | -------------------------------------------------------------------------------- /gym_cartpole.py: -------------------------------------------------------------------------------- 1 | import gym 2 | import numpy as np 3 | import torch 4 | import logging 5 | import math 6 | from gym import wrappers, logger as gym_log 7 | from torch.utils.data import DataLoader 8 | from torch.utils.data import Dataset 9 | from envs import cartpole_cont 10 | 11 | gym_log.set_level(gym_log.INFO) 12 | logger = logging.getLogger(__name__) 13 | logging.basicConfig(level=logging.INFO, 14 | format='[%(levelname)s %(asctime)s %(pathname)s:%(lineno)d] %(message)s', 15 | datefmt='%m-%d %H:%M:%S') 16 | 17 | SMPPI = True 18 | 19 | # three options for control smoothing 20 | # 1: "no filter" : no smoothing 21 | # 2: "smooth u" : smooth control sequence after adding noise 22 | # 3: "smooth noise" : smooth noise sequence before adding noise 23 | ## for more detail, please wait for our paper now under review. 24 | if not SMPPI: 25 | SMOOTH = "no filter" 26 | 27 | 28 | 29 | if SMPPI: 30 | from pytorch_mppi import smooth_mppi as mppi 31 | else: 32 | from pytorch_mppi import mppi 33 | 34 | if __name__ == "__main__": 35 | # ENV_NAME = "ContinuousCart_Pole-v1" 36 | TIMESTEPS = 75 37 | N_SAMPLES = 1000 38 | ACTION_LOW = -10.0 39 | ACTION_HIGH = 10.0 40 | D_ACTION_LOW = -1.0 41 | D_ACTION_HIGH = 1.0 42 | 43 | device = torch.device("cuda") if torch.cuda.is_available( 44 | ) else torch.device("cpu") 45 | dtype = torch.double 46 | 47 | noise_sigma = torch.tensor([5.0], device=device, dtype=dtype) 48 | # if size of action space is larger than 1: 49 | # noise_sigma = torch.tensor([[1, 0], [0, 2]], device=d, dtype=dtype) 50 | lambda_ = 10. 51 | gamma_ = 0.1 52 | 53 | import random 54 | 55 | randseed = 42 56 | if randseed is None: 57 | randseed = random.randint(0, 1000000) 58 | random.seed(randseed) 59 | np.random.seed(randseed) 60 | torch.manual_seed(randseed) 61 | logger.info("random seed %d", randseed) 62 | 63 | H_UNITS = 32 64 | TRAIN_EPOCH = 100 65 | BOOT_STRAP_ITER = 0 #30000 66 | EPISODE_CUT = 1000 67 | BATCH_SIZE = 50 68 | 69 | cost_tolerance = 40. 70 | SUCCESS_CRITERION = 1000 71 | 72 | nx = 4 73 | nu = 1 74 | # network output is state residual 75 | network = torch.nn.Sequential( 76 | torch.nn.Linear(nx + nu, H_UNITS), 77 | torch.nn.Tanh(), 78 | torch.nn.Linear(H_UNITS, H_UNITS), 79 | torch.nn.Tanh(), 80 | torch.nn.Linear(H_UNITS, nx - 1) 81 | ).double().to(device=device) 82 | 83 | def dynamics(state, perturbed_action): 84 | tau = 0.02 85 | u = torch.clamp(perturbed_action, ACTION_LOW, ACTION_HIGH) 86 | xu = torch.cat((state, u), dim=1) 87 | # feed in cosine and sine of angle instead of theta 88 | xu = torch.cat( 89 | (xu[:, 1].view(-1, 1), 90 | torch.sin(xu[:, 2]).view(-1, 1), 91 | torch.cos(xu[:, 2]).view(-1, 1), 92 | xu[:, 3].view(-1, 1), 93 | xu[:, 4].view(-1, 1)), dim=1) 94 | network.eval() 95 | with torch.no_grad(): 96 | state_residual = network(xu) 97 | 98 | # output dtheta directly so can just add 99 | next_state = torch.zeros_like(state) 100 | next_state[:, 1:] = state[:, 1:].clone().detach() + state_residual 101 | next_state[:, 0] = state[:, 0].clone().detach() + tau * state[:, 1].clone().detach() 102 | next_state[:, 2] = angle_normalize(next_state[:, 2]) 103 | return next_state 104 | 105 | def true_dynamics(state, perturbed_action): 106 | perturbed_action = torch.clamp(perturbed_action, ACTION_LOW, ACTION_HIGH) 107 | gravity = 9.8 108 | masscart = 1.0 109 | masspole = 0.1 110 | total_mass = masspole + masscart 111 | length = 0.5 # actually half the pole's length 112 | polemass_length = masspole * length 113 | tau = 0.02 # seconds between state updates 114 | 115 | x = state[:, 0].view(-1, 1) 116 | x_dot = state[:, 1].view(-1, 1) 117 | theta = state[:, 2].view(-1, 1) 118 | theta_dot = state[:, 3].view(-1, 1) 119 | 120 | #force = force_mag * perturbed_action 121 | force = perturbed_action 122 | costheta = torch.cos(theta) 123 | sintheta = torch.sin(theta) 124 | # For the interested reader: 125 | # https://coneural.org/florian/papers/05_cart_pole.pdf 126 | temp = ( 127 | force + polemass_length * theta_dot ** 2 * sintheta 128 | ) / total_mass 129 | thetaacc = (gravity * sintheta - costheta * temp) / ( 130 | length * (4.0 / 3.0 - masspole * costheta ** 2 / total_mass) 131 | ) 132 | xacc = temp - polemass_length * thetaacc * costheta / total_mass 133 | 134 | x = x + tau * x_dot 135 | x_dot = x_dot + tau * xacc 136 | theta = theta + tau * theta_dot 137 | theta_dot = theta_dot + tau * thetaacc 138 | theta = angle_normalize(theta) 139 | next_state = torch.cat((x, x_dot, theta, theta_dot), dim=1) 140 | return next_state 141 | 142 | def angular_diff_batch(a, b): 143 | """Angle difference from b to a (a - b)""" 144 | d = a - b 145 | d[d > math.pi] -= 2 * math.pi 146 | d[d < -math.pi] += 2 * math.pi 147 | return d 148 | 149 | def angle_normalize(x): 150 | return (((x + math.pi) % (2 * math.pi)) - math.pi) 151 | 152 | def running_cost(state): 153 | x = state[:, 0] 154 | x_dot = state[:, 1] 155 | theta = state[:, 2] 156 | theta_dot = state[:, 3] 157 | w_x = 50. 158 | w_x_dot = 0.01 159 | w_theta = 5. 160 | w_theta_dot = 0.01 161 | cost = w_x * x ** 2 + w_x_dot * x_dot ** 2 + w_theta * angle_normalize(theta) ** 2 + w_theta_dot * theta_dot ** 2 162 | return cost 163 | 164 | dataset_xu = None 165 | dataset_Y = None 166 | # create some true dynamics validation set to compare model 167 | Nv = 1000 168 | statev = torch.cat(((torch.rand(Nv, 1, dtype=dtype, device=device) - 0.5) * 2 * 1.2, 169 | (torch.rand(Nv, 1, dtype=dtype, device=device) - 0.5) * 2, 170 | (torch.rand(Nv, 1, dtype=dtype, device=device) - 0.5) * 2 * math.pi * 10 / 180, 171 | (torch.rand(Nv, 1, dtype=dtype, device=device) - 0.5) * 2 172 | ), dim=1) 173 | actionv = (torch.rand(Nv, 1, dtype=dtype, device=device) - 174 | 0.5) * (ACTION_HIGH - ACTION_LOW) 175 | 176 | class CustomDataset(Dataset): 177 | def __init__(self, x, y): 178 | self.x_data = x 179 | self.y_data = y 180 | 181 | def __len__(self): 182 | return len(self.x_data) 183 | 184 | def __getitem__(self, item): 185 | x_ = self.x_data[item] 186 | y_ = self.y_data[item] 187 | return x_, y_ 188 | 189 | def dataset_append(state, action, next_state): 190 | global dataset_xu, dataset_Y 191 | state[2] = angle_normalize(state[2]) 192 | next_state[2] = angle_normalize(next_state[2]) 193 | action = torch.clamp(action.clone().detach(), ACTION_LOW, ACTION_HIGH) 194 | 195 | xu = torch.cat((state, action), dim=0) 196 | 197 | xu = torch.tensor((xu[1], 198 | torch.sin(xu[2]), 199 | torch.cos(xu[2]), 200 | xu[3], 201 | xu[4])).view(1, -1) 202 | dx = next_state[0] - state[0] 203 | dx_dot = next_state[1] - state[1] 204 | dtheta = angular_diff_batch(next_state[2], state[2]) 205 | dtheta_dot = next_state[3] - state[3] 206 | Y = torch.tensor((dx_dot, dtheta, dtheta_dot)).view(1, -1).clone().detach() 207 | 208 | if dataset_xu is None and dataset_Y is None: 209 | dataset_xu = xu 210 | dataset_Y = Y 211 | 212 | else: 213 | dataset_xu = torch.cat((dataset_xu, xu), dim=0) 214 | dataset_Y = torch.cat((dataset_Y, Y), dim=0) 215 | 216 | def train(epoch=TRAIN_EPOCH): 217 | global dataset_xu, dataset_Y, network 218 | 219 | # thaw network 220 | for param in network.parameters(): 221 | param.requires_grad = True 222 | 223 | optimizer = torch.optim.Adam(network.parameters(), lr=1e-3) 224 | train_dataset = CustomDataset(dataset_xu, dataset_Y) 225 | train_loader = DataLoader( 226 | train_dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True) 227 | 228 | network.train() 229 | for i in range(epoch): 230 | # MSE loss 231 | for x, y in train_loader: 232 | x, y = x.to(device), y.to(device) 233 | 234 | yhat = network(x) 235 | loss = (y - yhat).norm(2, dim=1) ** 2 236 | optimizer.zero_grad() 237 | loss.mean().backward() 238 | optimizer.step() 239 | logger.debug("ds %d epoch %d loss %f", 240 | dataset_xu.shape[0], i, loss.mean().item()) 241 | 242 | # freeze network 243 | for param in network.parameters(): 244 | param.requires_grad = False 245 | 246 | # evaluate network against true dynamics 247 | yt = true_dynamics(statev, actionv) 248 | yp = dynamics(statev, actionv) 249 | dx = yp[:, 0] - yt[:, 0] 250 | dx_dot = yp[:, 1] - yt[:, 1] 251 | dtheta = angular_diff_batch(yp[:, 2], yt[:, 2]) 252 | dtheta_dot = yp[:, 3] - yt[:, 3] 253 | E = torch.cat((dx.view(-1, 1), dx_dot.view(-1, 1), dtheta.view(-1, 1), dtheta_dot.view(-1, 1)), 254 | dim=1).norm(dim=1) 255 | logger.info("Error with true dynamics x %f x_dot %f theta %f theta_dot %f norm %f", dx.abs().mean(), dx_dot.abs().mean(), dtheta.abs().mean(), 256 | dtheta_dot.abs().mean(), E.mean()) 257 | logger.debug("Start next collection sequence") 258 | 259 | def model_save(): 260 | global network 261 | torch.save(network.state_dict(), 'model_weights_cartpole.pth') 262 | 263 | env = cartpole_cont.CartPoleContEnv() 264 | if BOOT_STRAP_ITER: 265 | logger.info( 266 | "bootstrapping with random action for %d actions", BOOT_STRAP_ITER) 267 | data_count = 0 268 | while True: 269 | env.reset() 270 | for i in range(EPISODE_CUT): 271 | state = env.state 272 | state = torch.tensor(state, dtype=torch.float64).to(device=device) 273 | action = np.random.uniform(low=ACTION_LOW, high=ACTION_HIGH) 274 | action = torch.tensor([action], dtype=torch.float64).to(device=device) 275 | s, _, done, _ = env.step(action.cpu().numpy()) 276 | next_state = env.state 277 | next_state = torch.tensor(next_state, dtype=torch.float64).to(device=device) 278 | dataset_append(state, action, next_state) 279 | data_count += 1 280 | if data_count == BOOT_STRAP_ITER: 281 | break 282 | if done: 283 | break 284 | if data_count == BOOT_STRAP_ITER: 285 | break 286 | train(epoch=500) 287 | logger.info("bootstrapping finished") 288 | 289 | env = wrappers.Monitor(env, '/tmp/mppi/', force=True) 290 | 291 | if SMPPI: 292 | mppi_gym = mppi.MPPI(dynamics, running_cost, nx, nu, noise_sigma, 293 | num_samples=N_SAMPLES, 294 | horizon=TIMESTEPS, 295 | lambda_=lambda_, 296 | gamma_=gamma_, 297 | device=device, 298 | u_min=torch.tensor( 299 | D_ACTION_LOW, dtype=dtype, device=device), 300 | u_max=torch.tensor( 301 | D_ACTION_HIGH, dtype=dtype, device=device), 302 | action_min=torch.tensor( 303 | ACTION_LOW, dtype=dtype, device=device), 304 | action_max=torch.tensor(ACTION_HIGH, dtype=dtype, device=device)) 305 | else: 306 | mppi_gym = mppi.MPPI(dynamics, running_cost, nx, nu, noise_sigma, 307 | num_samples=N_SAMPLES, 308 | horizon=TIMESTEPS, 309 | lambda_=lambda_, 310 | gamma_=gamma_, 311 | device=device, 312 | u_min=torch.tensor( 313 | ACTION_LOW, dtype=dtype, device=device), 314 | u_max=torch.tensor( 315 | ACTION_HIGH, dtype=dtype, device=device), 316 | smooth=SMOOTH) 317 | 318 | cost_history = mppi.run_mppi_episode(mppi_gym, env, dataset_append, train, running_cost, model_save, cost_tolerance, SUCCESS_CRITERION) -------------------------------------------------------------------------------- /gym_pendulum.py: -------------------------------------------------------------------------------- 1 | import gym 2 | import numpy as np 3 | import torch 4 | import logging 5 | import math 6 | from gym import wrappers, logger as gym_log 7 | from torch.utils.data import DataLoader 8 | from torch.utils.data import Dataset 9 | 10 | gym_log.set_level(gym_log.INFO) 11 | logger = logging.getLogger(__name__) 12 | logging.basicConfig(level=logging.INFO, 13 | format='[%(levelname)s %(asctime)s %(pathname)s:%(lineno)d] %(message)s', 14 | datefmt='%m-%d %H:%M:%S') 15 | 16 | SMPPI = True 17 | 18 | downward_start = True 19 | INIT_VEL = 0 20 | 21 | # three options for control smoothing 22 | # 1: "no filter" : no smoothing 23 | # 2: "smooth u" : smooth control sequence after adding noise 24 | # 3: "smooth noise" : smooth noise sequence before adding noise 25 | # for more detail, please wait for our paper now under review. 26 | if not SMPPI: 27 | SMOOTH = "no filter" 28 | 29 | 30 | if SMPPI: 31 | from pytorch_mppi import smooth_mppi as mppi 32 | else: 33 | from pytorch_mppi import mppi 34 | 35 | if __name__ == "__main__": 36 | ENV_NAME = "Pendulum-v1" 37 | TIMESTEPS = 15 # T 38 | N_SAMPLES = 1000 # K 39 | ACTION_LOW = -2.0 40 | ACTION_HIGH = 2.0 41 | D_ACTION_LOW = -8.0 42 | D_ACTION_HIGH = 8.0 43 | 44 | device = torch.device("cuda") if torch.cuda.is_available( 45 | ) else torch.device("cpu") 46 | dtype = torch.double 47 | 48 | noise_sigma = torch.tensor([1.], device=device, dtype=dtype) 49 | # if size of action space is larger than 1: 50 | # noise_sigma = torch.tensor([[1, 0], [0, 2]], device=d, dtype=dtype) 51 | lambda_ = 10. 52 | gamma_ = 0.1 53 | 54 | import random 55 | 56 | randseed = 42 57 | if randseed is None: 58 | randseed = random.randint(0, 1000000) 59 | random.seed(randseed) 60 | np.random.seed(randseed) 61 | torch.manual_seed(randseed) 62 | logger.info("random seed %d", randseed) 63 | 64 | # new hyperparmaeters for approximate dynamics 65 | H_UNITS = 32 66 | TRAIN_EPOCH = 100 # 150 67 | BOOT_STRAP_ITER = 0 68 | EPISODE_CUT = 1000 69 | BATCH_SIZE = 50 70 | 71 | cost_tolerance = 0.1 72 | SUCCESS_CRITERION = 300 73 | 74 | nx = 2 75 | nu = 1 76 | # network output is state residual 77 | network = torch.nn.Sequential( 78 | torch.nn.Linear(nx + nu + 1, H_UNITS), 79 | torch.nn.Tanh(), 80 | torch.nn.Linear(H_UNITS, H_UNITS), 81 | torch.nn.Tanh(), 82 | torch.nn.Linear(H_UNITS, nx) 83 | ).double().to(device=device) 84 | 85 | def dynamics(state, perturbed_action): 86 | u = torch.clamp(perturbed_action, ACTION_LOW, ACTION_HIGH) 87 | if state.dim() == 1 or u.dim() == 1: 88 | state = state.view(1, -1) 89 | u = u.view(1, -1) 90 | xu = torch.cat((state, u), dim=1) 91 | # feed in cosine and sine of angle instead of theta 92 | xu = torch.cat((torch.sin( 93 | xu[:, 0]).view(-1, 1), torch.cos(xu[:, 0]).view(-1, 1), xu[:, 1:]), dim=1) 94 | 95 | network.eval() 96 | with torch.no_grad(): 97 | state_residual = network(xu) 98 | # output dtheta directly so can just add 99 | next_state = state.clone().detach() + state_residual 100 | next_state[:, 0] = angle_normalize(next_state[:, 0]) 101 | return next_state 102 | 103 | def true_dynamics(state, perturbed_action): 104 | # true dynamics from gym 105 | th = state[:, 0].view(-1, 1) 106 | thdot = state[:, 1].view(-1, 1) 107 | 108 | g = 10 109 | m = 1 110 | l = 1 111 | dt = 0.05 112 | 113 | u = perturbed_action 114 | u = torch.clamp(u, -2, 2) 115 | 116 | newthdot = thdot + (-3 * g / (2 * l) * 117 | torch.sin(th + np.pi) + 3. / (m * l ** 2) * u) * dt 118 | newth = th + newthdot * dt 119 | newthdot = torch.clamp(newthdot, -8, 8) 120 | 121 | next_state = torch.cat((newth, newthdot), dim=1) 122 | return next_state 123 | 124 | def angular_diff_batch(a, b): 125 | """Angle difference from b to a (a - b)""" 126 | d = a - b 127 | d[d > math.pi] -= 2 * math.pi 128 | d[d < -math.pi] += 2 * math.pi 129 | return d 130 | 131 | def angle_normalize(x): 132 | return (((x + math.pi) % (2 * math.pi)) - math.pi) 133 | 134 | def running_cost(state): 135 | theta = state[:, 0] 136 | theta_dt = state[:, 1] 137 | cost = angle_normalize(theta) ** 2 + 0.1 * theta_dt ** 2 138 | return cost 139 | 140 | dataset_xu = None 141 | dataset_Y = None 142 | # create some true dynamics validation set to compare model against 143 | Nv = 1000 144 | statev = torch.cat(((torch.rand(Nv, 1, dtype=dtype, device=device) - 0.5) * 2 * math.pi, 145 | (torch.rand(Nv, 1, dtype=dtype, device=device) - 0.5) * 16), dim=1) 146 | actionv = (torch.rand(Nv, 1, dtype=dtype, device=device) - 147 | 0.5) * (ACTION_HIGH - ACTION_LOW) 148 | 149 | class CustomDataset(Dataset): 150 | def __init__(self, x, y): 151 | self.x_data = x 152 | self.y_data = y 153 | 154 | def __len__(self): 155 | return len(self.x_data) 156 | 157 | def __getitem__(self, item): 158 | x_ = self.x_data[item] 159 | y_ = self.y_data[item] 160 | return x_, y_ 161 | 162 | def dataset_append(state, action, next_state): 163 | global dataset_xu, dataset_Y 164 | state[0] = angle_normalize(state[0]) 165 | next_state[0] = angle_normalize(next_state[0]) 166 | action = torch.clamp(action.clone().detach(), ACTION_LOW, ACTION_HIGH) 167 | 168 | xu = torch.cat((state, action), dim=0) 169 | 170 | xu = torch.tensor( 171 | (torch.sin(xu[0]), 172 | torch.cos(xu[0]), 173 | xu[1], 174 | xu[2])).view(1, -1) 175 | dtheta = angular_diff_batch(next_state[0], state[0]) 176 | dtheta_dot = next_state[1] - state[1] 177 | Y = torch.tensor((dtheta, dtheta_dot)).view(1, -1).clone().detach() 178 | 179 | if dataset_xu is None and dataset_Y is None: 180 | dataset_xu = xu 181 | dataset_Y = Y 182 | 183 | else: 184 | dataset_xu = torch.cat((dataset_xu, xu), dim=0) 185 | dataset_Y = torch.cat((dataset_Y, Y), dim=0) 186 | 187 | def train(epoch=TRAIN_EPOCH): 188 | global dataset_xu, dataset_Y, network 189 | # thaw network 190 | for param in network.parameters(): 191 | param.requires_grad = True 192 | 193 | optimizer = torch.optim.Adam(network.parameters()) 194 | train_dataset = CustomDataset(dataset_xu, dataset_Y) 195 | train_loader = DataLoader( 196 | train_dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True) 197 | 198 | network.train() 199 | for i in range(epoch): 200 | # MSE loss 201 | for x, y in train_loader: 202 | x, y = x.to(device), y.to(device) 203 | yhat = network(x) 204 | loss = (y - yhat).norm(2, dim=1) ** 2 205 | optimizer.zero_grad() 206 | loss.mean().backward() 207 | optimizer.step() 208 | logger.debug("ds %d epoch %d loss %f", 209 | dataset_xu.shape[0], i, loss.mean().item()) 210 | 211 | # freeze network 212 | for param in network.parameters(): 213 | param.requires_grad = False 214 | 215 | # evaluate network against true dynamics 216 | yt = true_dynamics(statev, actionv) 217 | yp = dynamics(statev, actionv) 218 | dtheta = angular_diff_batch(yp[:, 0], yt[:, 0]) 219 | dtheta_dt = yp[:, 1] - yt[:, 1] 220 | E = torch.cat((dtheta.view(-1, 1), dtheta_dt.view(-1, 1)), 221 | dim=1).norm(dim=1) 222 | logger.info("Error with true dynamics theta %f theta_dt %f norm %f", dtheta.abs().mean(), 223 | dtheta_dt.abs().mean(), E.mean()) 224 | logger.debug("Start next collection sequence") 225 | 226 | def model_save(): 227 | global network 228 | torch.save(network.state_dict(), 'model_weights_pendulum.pth') 229 | 230 | env = gym.make(ENV_NAME).env # bypass the default TimeLimit wrapper 231 | # bootstrap network with random actions 232 | if BOOT_STRAP_ITER: 233 | logger.info( 234 | "bootstrapping with random action for %d actions", BOOT_STRAP_ITER) 235 | data_count = 0 236 | while True: 237 | env.reset() 238 | for i in range(EPISODE_CUT): 239 | state = env.state 240 | state = torch.tensor( 241 | state, dtype=torch.float64).to(device=device) 242 | action = np.random.uniform(low=ACTION_LOW, high=ACTION_HIGH) 243 | action = torch.tensor( 244 | [action], dtype=torch.float64).to(device=device) 245 | s, _, done, _ = env.step(action.cpu().numpy()) 246 | next_state = env.state 247 | next_state = torch.tensor( 248 | next_state, dtype=torch.float64).to(device=device) 249 | dataset_append(state, action, next_state) 250 | data_count += 1 251 | if data_count == BOOT_STRAP_ITER: 252 | break 253 | if done: 254 | break 255 | if data_count == BOOT_STRAP_ITER: 256 | break 257 | train(epoch=500) 258 | logger.info("bootstrapping finished") 259 | 260 | env = wrappers.Monitor(env, '/tmp/mppi/', force=True) 261 | env.reset() 262 | if downward_start: 263 | env.env.state = [np.pi, INIT_VEL] 264 | 265 | if SMPPI: 266 | mppi_gym = mppi.MPPI(dynamics, running_cost, nx, nu, noise_sigma, 267 | num_samples=N_SAMPLES, 268 | horizon=TIMESTEPS, 269 | lambda_=lambda_, 270 | gamma_=gamma_, 271 | device=device, 272 | u_min=torch.tensor( 273 | D_ACTION_LOW, dtype=dtype, device=device), 274 | u_max=torch.tensor( 275 | D_ACTION_HIGH, dtype=dtype, device=device), 276 | action_min=torch.tensor( 277 | ACTION_LOW, dtype=dtype, device=device), 278 | action_max=torch.tensor(ACTION_HIGH, dtype=dtype, device=device)) 279 | else: 280 | mppi_gym = mppi.MPPI(dynamics, running_cost, nx, nu, noise_sigma, 281 | num_samples=N_SAMPLES, 282 | horizon=TIMESTEPS, 283 | lambda_=lambda_, 284 | gamma_=gamma_, 285 | device=device, 286 | u_min=torch.tensor( 287 | ACTION_LOW, dtype=dtype, device=device), 288 | u_max=torch.tensor( 289 | ACTION_HIGH, dtype=dtype, device=device), 290 | smooth=SMOOTH) 291 | 292 | cost_history = mppi.run_mppi_episode( 293 | mppi_gym, env, dataset_append, train, running_cost, model_save, cost_tolerance, SUCCESS_CRITERION) 294 | -------------------------------------------------------------------------------- /plot_cost.py: -------------------------------------------------------------------------------- 1 | from plotly.subplots import make_subplots 2 | import plotly.graph_objects as go 3 | import csv 4 | import numpy as np 5 | import os 6 | import random 7 | import math 8 | import torch 9 | import torch.nn as nn 10 | from torch.utils.data import Dataset 11 | from torch.utils.data import DataLoader 12 | import torch.optim as optim 13 | import argparse 14 | import seaborn as sns 15 | import matplotlib.pyplot as plt 16 | import pandas as pd 17 | 18 | timestep = 50 19 | 20 | 21 | def csv2np(filename): 22 | raw_data = [] 23 | with open(filename, newline='') as csvfile: 24 | spamreader = csv.reader(csvfile, delimiter=',', quotechar='|') 25 | for i, row in enumerate(spamreader): 26 | raw_data.append(row) 27 | return np.array(raw_data, dtype='float32') 28 | 29 | 30 | filename = 'original' 31 | split_data = csv2np(filename+'.csv') 32 | split_iteration = np.hsplit(split_data, timestep) 33 | split_iteration = np.array(split_iteration) 34 | print(split_iteration.shape) 35 | iteration_column = np.reshape( 36 | np.arange(split_iteration.shape[0]), (split_iteration.shape[0], 1)) 37 | df = pd.DataFrame(data=np.repeat(iteration_column, 38 | split_iteration.shape[1], axis=0), columns=['Iteration']) 39 | mean_cost = [np.mean(it, axis=1) for it in split_iteration] 40 | mean_cost = np.transpose(np.array(mean_cost)) 41 | mean_cost = np.clip(mean_cost, 0.01, 10) 42 | df.loc[:, 'Original w/o filter'] = pd.Series(np.squeeze(np.reshape( 43 | np.transpose(mean_cost), (1, -1))), index=df.index) 44 | 45 | # fig = go.Figure() 46 | # mean_cost = split_data 47 | # t = np.arange(0, mean_cost.shape[1]) 48 | 49 | # fig.add_trace(go.Scatter(x=t, y=mean_cost[0], mode='lines')) 50 | # fig.add_trace(go.Scatter(x=t, y=mean_cost[1], mode='lines')) 51 | # fig.add_trace(go.Scatter(x=t, y=mean_cost[2], mode='lines')) 52 | # fig.add_trace(go.Scatter(x=t, y=mean_cost[3], mode='lines')) 53 | # fig.add_trace(go.Scatter(x=t, y=mean_cost[4], mode='lines')) 54 | # fig.add_trace(go.Scatter(x=t, y=mean_cost[5], mode='lines')) 55 | # fig.add_trace(go.Scatter(x=t, y=mean_cost[6], mode='lines')) 56 | # fig.show() 57 | 58 | 59 | # fig = go.Figure() 60 | 61 | # fig.add_trace(go.Scatter(x=t, y=mean_cost, mode='lines', name='original', marker={ 62 | # 'color': 'rgb(255,0,0)'}, line=dict(width=4, dash='dash'))) 63 | 64 | filename = 'u_cost' 65 | split_data = csv2np(filename+'.csv') 66 | split_iteration = np.hsplit(split_data, timestep) 67 | split_iteration = np.array(split_iteration) 68 | mean_cost = [np.mean(it, axis=1) for it in split_iteration] 69 | mean_cost = np.transpose(np.array(mean_cost)) 70 | mean_cost = np.clip(mean_cost, 0.01, 10) 71 | df.loc[:, 'Original w/ action cost'] = pd.Series(np.squeeze(np.reshape( 72 | np.transpose(mean_cost), (1, -1))), index=df.index) 73 | 74 | filename = 'original_filter_on_noise' 75 | split_data = csv2np(filename+'.csv') 76 | split_iteration = np.hsplit(split_data, timestep) 77 | split_iteration = np.array(split_iteration) 78 | mean_cost = [np.mean(it, axis=1) for it in split_iteration] 79 | mean_cost = np.transpose(np.array(mean_cost)) 80 | mean_cost = np.clip(mean_cost, 0.01, 10) 81 | df.loc[:, 'Original (SGF(\u03B5))'] = pd.Series(np.squeeze(np.reshape( 82 | np.transpose(mean_cost), (1, -1))), index=df.index) 83 | 84 | filename = 'original_filter_on_u' 85 | split_data = csv2np(filename+'.csv') 86 | split_iteration = np.hsplit(split_data, timestep) 87 | split_iteration = np.array(split_iteration) 88 | mean_cost = [np.mean(it, axis=1) for it in split_iteration] 89 | mean_cost = np.transpose(np.array(mean_cost)) 90 | mean_cost = np.clip(mean_cost, 0.01, 10) 91 | df.loc[:, 'Original (SGF(u))'] = pd.Series(np.squeeze(np.reshape( 92 | np.transpose(mean_cost), (1, -1))), index=df.index) 93 | 94 | 95 | filename = 'das' 96 | split_data = csv2np(filename+'.csv') 97 | split_iteration = np.hsplit(split_data, timestep) 98 | split_iteration = np.array(split_iteration) 99 | mean_cost = [np.mean(it, axis=1) for it in split_iteration] 100 | mean_cost = np.transpose(np.array(mean_cost)) 101 | mean_cost = np.clip(mean_cost, 0.01, 10) 102 | df.loc[:, 'Ours'] = pd.Series(np.squeeze(np.reshape( 103 | np.transpose(mean_cost), (1, -1))), index=df.index) 104 | 105 | 106 | #sns.lineplot(data=df, x="iteration", y="Ours", ci=50, legend='full') 107 | #sns.lineplot(data=df, x="iteration", y="Original", ci=50, legend='full') 108 | df = df.melt('Iteration', var_name='Method', value_name='State cost') 109 | plt.figure(figsize=[6, 4]) 110 | 111 | sns.set_palette(reversed(sns.color_palette('Set1', 5)), 5) 112 | sns.lineplot(data=df, x='Iteration', y='State cost', hue='Method', ci=80) 113 | plt.yscale('log') 114 | plt.tight_layout() 115 | plt.xlim(0, 35) 116 | plt.ylim(0.009, 10) 117 | plt.xticks(fontsize=12) 118 | plt.yticks(fontsize=12) 119 | plt.xlabel(xlabel='Iteration', fontsize=12) 120 | plt.ylabel(ylabel='State cost', fontsize=12) 121 | plt.legend(prop={'size': 12}) 122 | plt.savefig('/home/add/Desktop/tempo.pdf', format='pdf') 123 | # plt.legend() 124 | plt.show() 125 | plt.close() 126 | 127 | 128 | # fig.update_layout( 129 | # xaxis_title="itertaion", 130 | # yaxis_title="cost", 131 | # height=650, 132 | # width=1800, 133 | # margin=dict(l=30, r=30, t=30, b=30), 134 | # font=dict(size=20), 135 | # legend=dict(font=dict(size=25), yanchor="top", 136 | # y=0.99, xanchor="right", x=0.99) 137 | # ) 138 | # fig.layout.template = 'plotly_white' 139 | # fig.show() 140 | 141 | # if not os.path.exists("image"): 142 | # os.mkdir("image") 143 | # fig.write_image("./image/inference_"+filename+"_vx.svg") 144 | -------------------------------------------------------------------------------- /pytorch_mppi/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tkkim-robot/smooth-mppi-pytorch/dcc6f4285069dbf8add0a111c8b936d0ef1830e6/pytorch_mppi/__init__.py -------------------------------------------------------------------------------- /pytorch_mppi/mppi.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import time 3 | import logging 4 | from torch.distributions.multivariate_normal import MultivariateNormal 5 | import numpy as np 6 | import math 7 | import csv 8 | from scipy.signal import savgol_filter 9 | 10 | f = open('mppi.csv', 'a', encoding='utf-8', newline='') 11 | wr = csv.writer(f) 12 | 13 | logger = logging.getLogger(__name__) 14 | 15 | def is_tensor_like(x): 16 | return torch.is_tensor(x) or type(x) is np.ndarray 17 | 18 | 19 | class MPPI(): 20 | """ 21 | Model Predictive Path Integral control 22 | This implementation batch samples the trajectories and so scales well with the number of samples K. 23 | 24 | Implemented according to algorithm 2 in Williams et al., 2017 25 | 'Information Theoretic MPC for Model-Based Reinforcement Learning', 26 | based off of https://github.com/ferreirafabio/mppi_pendulum 27 | """ 28 | 29 | def __init__(self, dynamics, running_cost, nx, nu, noise_sigma, num_samples=100, horizon=15, device="cpu", 30 | terminal_state_cost=None, 31 | lambda_=1., 32 | gamma_=1., 33 | noise_mu=None, 34 | u_min=None, 35 | u_max=None, 36 | u_init=None, 37 | U_init=None, 38 | u_scale=1, 39 | u_per_command=1, 40 | step_dependent_dynamics=False, 41 | rollout_var_cost=0, 42 | rollout_var_discount=0.95, 43 | sample_null_action=False, 44 | smooth="no filter"): 45 | """ 46 | :param dynamics: function(state, action) -> next_state (K x nx) taking in batch state (K x nx) and action (K x nu) 47 | :param running_cost: function(state, action) -> cost (K x 1) taking in batch state and action (same as dynamics) 48 | :param nx: state dimension 49 | :param noise_sigma: (nu x nu) control noise covariance (assume v_t ~ N(u_t, noise_sigma)) 50 | :param num_samples: K, number of trajectories to sample 51 | :param horizon: T, length of each trajectory 52 | :param device: pytorch device 53 | :param terminal_state_cost: function(state) -> cost (K x 1) taking in batch state 54 | :param lambda_: temperature, positive scalar where larger values will allow more exploration 55 | :param noise_mu: (nu) control noise mean (used to bias control samples); defaults to zero mean 56 | :param u_min: (nu) minimum values for each dimension of control to pass into dynamics 57 | :param u_max: (nu) maximum values for each dimension of control to pass into dynamics 58 | :param u_init: (nu) what to initialize new end of trajectory control to be; defeaults to zero 59 | :param U_init: (T x nu) initial control sequence; defaults to noise 60 | :param step_dependent_dynamics: whether the passed in dynamics needs horizon step passed in (as 3rd arg) 61 | :param rollout_var_cost: Cost attached to the variance of costs across trajectory rollouts 62 | :param rollout_var_discount: Discount of variance cost over control horizon 63 | :param sample_null_action: Whether to explicitly sample a null action (bad for starting in a local minima) 64 | :param smooth: Whether to explicitly sample a null action (bad for starting in a local minima) 65 | 66 | * option 67 | :param rollout_samples: M, number of state trajectories to rollout for each control trajectory 68 | (should be 1 for deterministic dynamics and more for models that output a distribution) 69 | """ 70 | 71 | self.smooth_method = smooth 72 | self.device = device 73 | self.dtype = noise_sigma.dtype 74 | self.K = num_samples # N_SAMPLES 75 | self.T = horizon # TIMESTEPS 76 | 77 | # dimensions of state and control 78 | self.nx = nx 79 | self.nu = 1 if len(noise_sigma.shape) == 0 else noise_sigma.shape[0] 80 | self.lambda_ = lambda_ 81 | self.gamma_ = gamma_ 82 | 83 | if noise_mu is None: 84 | noise_mu = torch.zeros(self.nu, dtype=self.dtype) 85 | 86 | if u_init is None: 87 | u_init = torch.zeros_like(noise_mu) 88 | 89 | if U_init is None: 90 | U_init = torch.zeros(self.T, self.nu).to(device) 91 | 92 | # handle 1D edge case 93 | if self.nu == 1: 94 | noise_mu = noise_mu.view(-1) 95 | noise_sigma = noise_sigma.view(-1, 1) 96 | 97 | # bounds 98 | self.u_min = u_min 99 | self.u_max = u_max 100 | self.u_scale = u_scale 101 | self.u_per_command = u_per_command 102 | # make sure if any of them is specified, both are specified 103 | if self.u_max is not None and self.u_min is None: 104 | if not torch.is_tensor(self.u_max): 105 | self.u_max = torch.tensor(self.u_max) 106 | self.u_min = -self.u_max 107 | if self.u_min is not None and self.u_max is None: 108 | if not torch.is_tensor(self.u_min): 109 | self.u_min = torch.tensor(self.u_min) 110 | self.u_max = -self.u_min 111 | if self.u_min is not None: 112 | self.u_min = self.u_min.to(device=self.device) 113 | self.u_max = self.u_max.to(device=self.device) 114 | 115 | self.noise_mu = noise_mu.to(self.device) 116 | self.noise_sigma = noise_sigma.to(self.device) 117 | self.noise_sigma_inv = torch.inverse(self.noise_sigma) 118 | self.noise_dist = MultivariateNormal( 119 | self.noise_mu, covariance_matrix=self.noise_sigma) 120 | # T x nu control sequence 121 | self.U = U_init 122 | self.u_init = u_init.to(self.device) 123 | 124 | if self.U is None: 125 | self.U = self.noise_dist.sample((self.T,)) 126 | self.U = torch.zeros_like(self.U) 127 | 128 | self.U_history = torch.zeros_like(self.U)[:5] 129 | 130 | self.step_dependency = step_dependent_dynamics 131 | self.F = dynamics 132 | self.running_cost = running_cost 133 | self.terminal_state_cost = terminal_state_cost 134 | self.sample_null_action = sample_null_action 135 | self.state = None 136 | 137 | # handling dynamics models that output a distribution (take multiple trajectory samples) 138 | self.rollout_var_cost = rollout_var_cost 139 | self.rollout_var_discount = rollout_var_discount 140 | 141 | # sampled results from last command 142 | self.cost_total = None 143 | self.cost_total_non_zero = None 144 | self.omega = None 145 | self.states = None 146 | self.actions = None 147 | 148 | def _dynamics(self, state, u, t): 149 | return self.F(state, u, t) if self.step_dependency else self.F(state, u) 150 | 151 | def _running_cost(self, state): 152 | return self.running_cost(state) 153 | 154 | def command(self, state): 155 | """ 156 | :param state: (nx) or (K x nx) current state, or samples of states (for propagating a distribution of states) 157 | :returns action: (nu) best action 158 | """ 159 | # shift command 1 time step 160 | self.U = torch.roll(self.U, -1, dims=0) 161 | self.U[-1] = self.u_init 162 | 163 | perturbed_action = self.action_sampling() 164 | 165 | cost_total, states = self._compute_batch_rollout_costs( 166 | perturbed_action, state) 167 | self.omega = self._compute_weighting(cost_total) 168 | 169 | weighted_noise = torch.sum( 170 | self.omega.view(-1, 1, 1) * self.noise, dim=0) 171 | 172 | if self.smooth_method == "no filter": 173 | self.U += weighted_noise 174 | elif self.smooth_method == "smooth u": 175 | self.U += weighted_noise 176 | U_filtered = savgol_filter( 177 | torch.cat([self.U_history, self.U]).to('cpu'), 5, 3, axis=0) 178 | self.U = torch.tensor(U_filtered[-self.T:]).to(self.device) 179 | 180 | self.U_history = torch.roll(self.U_history, -1, dims=0) 181 | self.U_history[-1] = self.U[0] 182 | elif self.smooth_method == "smooth noise": 183 | self.U += torch.tensor(savgol_filter(weighted_noise.to('cpu'), 184 | 5, 3, axis=0)).to(self.device) 185 | else: 186 | raise Exception("Wrong smooth option !!!") 187 | 188 | 189 | action = self.U[0] 190 | 191 | return action 192 | 193 | def reset(self): 194 | """ 195 | Clear controller state after finishing a trial 196 | """ 197 | self.U = torch.zeros_like(self.U) 198 | 199 | def _compute_weighting(self, cost_total): 200 | beta = torch.min(cost_total) 201 | cost_total_non_zero = torch.exp(-1/self.lambda_ * (cost_total - beta)) 202 | eta = torch.sum(cost_total_non_zero) 203 | omega = (1. / eta) * cost_total_non_zero 204 | return omega 205 | 206 | def _compute_batch_rollout_costs(self, perturbed_actions, state): 207 | K, T, nu = perturbed_actions.shape 208 | assert nu == self.nu 209 | 210 | cost_total = torch.zeros(K, device=self.device, dtype=self.dtype) 211 | cost_samples = torch.zeros(K, device=self.device, dtype=self.dtype) 212 | 213 | # allow propagation of a sample of states (ex. to carry a distribution), or to start with a single state 214 | # state -> nx 215 | if state.shape == (K, self.nx): 216 | state = state 217 | else: 218 | state = state.view(1, -1).repeat(K, 1) 219 | # state -> K*nu 220 | 221 | states = [] 222 | actions = [] 223 | 224 | prev_map_mask = torch.zeros((K), device=self.device) 225 | for t in range(T): 226 | # perturbed_actions -> K*T*nu 227 | # perturbed_actions[:, t] -> K*nu 228 | u = self.u_scale * perturbed_actions[:, t] # v -> K*nu 229 | 230 | state = self._dynamics(state, u, t) 231 | c = self._running_cost(state) # c -> K 232 | 233 | cost_samples += c # cost_samples -> K 234 | 235 | # Save total states/actions 236 | states.append(state) 237 | actions.append(u) 238 | 239 | # actions -> [K*nu, K*nu ...] with size T 240 | # torch.stack(actions, dim=-2) -> K*T*nu 241 | actions = torch.stack(actions, dim=-2) 242 | states = torch.stack(states, dim=-2) 243 | 244 | # terminal state cost 245 | if self.terminal_state_cost: 246 | phi = self.terminal_state_cost(states, actions) 247 | cost_samples += phi 248 | 249 | action_cost = self.gamma_ * self.noise @ self.noise_sigma_inv 250 | 251 | # action_cost -> K*T*nu 252 | # U -> T*nu 253 | perturbation_cost = torch.sum(self.U * action_cost, dim=(1, 2)) 254 | 255 | cost_total += cost_samples + perturbation_cost # K dim 256 | 257 | return cost_total, states 258 | 259 | def _bound_action(self, action): 260 | 261 | return torch.max(torch.min(action, self.u_max), self.u_min) # action 262 | 263 | def action_sampling(self): 264 | # parallelize sampling across trajectories 265 | # resample noise each time we take an action 266 | 267 | # Small portion are just guanssian perturbation aroung zero 268 | self.noise = self.noise_dist.sample( 269 | (round(self.K*0.99), self.T)) # K*T*nu (noise_dist has nu-dim) 270 | # broadcast own control to noise over samples; now it's K x T x nu 271 | perturbed_action = self.U + self.noise 272 | perturbed_action = torch.cat( 273 | [perturbed_action, self.noise_dist.sample((round(self.K*0.01), self.T))]) 274 | 275 | if self.sample_null_action: 276 | perturbed_action[self.K - 1] = 0 277 | 278 | perturbed_action = self._bound_action(perturbed_action) 279 | 280 | # remove U to earn bounded noise 281 | self.noise = perturbed_action - self.U 282 | 283 | return perturbed_action 284 | 285 | 286 | def angle_normalize(x): 287 | return (((x + math.pi) % (2 * math.pi)) - math.pi) 288 | 289 | def run_mppi_episode(mppi, env, dataset_append, retrain_dynamics, cost, model_save, cost_tolerance, SUCCESS_CRITERION, retrain_after_iter=50, num_episode=30, render=True): 290 | dataset_count = 0 291 | cost_history = [] 292 | cost_ = 0. 293 | for ep in range(num_episode): 294 | env.reset() 295 | success_count = 0 296 | cost_episode = [] 297 | 298 | while True: 299 | if render: 300 | env.render() 301 | state = env.state 302 | state = torch.tensor(state, dtype=mppi.noise_sigma.dtype).to(device=mppi.device) 303 | command_start = time.perf_counter() 304 | action = mppi.command(state) 305 | elapsed = time.perf_counter() - command_start 306 | s, _, done, _ = env.step(action.cpu().numpy()) 307 | next_state = env.state 308 | next_state = torch.tensor(next_state, dtype=mppi.noise_sigma.dtype).to(device=mppi.device) 309 | 310 | # Collect Training datas 311 | dataset_append(state, action, next_state) 312 | 313 | logger.debug( 314 | "action taken: %.4f cost received: %.4f time taken: %.5fs", action, cost_, elapsed) 315 | 316 | dataset_count += 1 317 | di = dataset_count % retrain_after_iter 318 | if di == 0 and dataset_count > 0: 319 | retrain_dynamics() 320 | 321 | cost_ = cost(next_state.view(1, -1)) 322 | cost_episode.append(cost_.item()) 323 | 324 | if cost_ < cost_tolerance: 325 | success_count += 1 326 | if success_count >= SUCCESS_CRITERION: 327 | print("Task completed") 328 | cost_history.append(cost_episode) 329 | model_save() 330 | return cost_history 331 | else: 332 | success_count = 0 333 | 334 | if done: 335 | print("Episode {} terminated".format(ep + 1)) 336 | break 337 | wr.writerow(cost_episode) 338 | cost_history.append(cost_episode) 339 | return cost_history -------------------------------------------------------------------------------- /pytorch_mppi/smooth_mppi.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import time 3 | import logging 4 | from torch.distributions.multivariate_normal import MultivariateNormal 5 | import numpy as np 6 | import math 7 | import csv 8 | 9 | f = open('smppi.csv', 'a', encoding='utf-8', newline='') 10 | wr = csv.writer(f) 11 | 12 | logger = logging.getLogger(__name__) 13 | 14 | def is_tensor_like(x): 15 | return torch.is_tensor(x) or type(x) is np.ndarray 16 | 17 | 18 | class MPPI(): 19 | """ 20 | Model Predictive Path Integral control 21 | This implementation batch samples the trajectories and so scales well with the number of samples K. 22 | 23 | Implemented according to algorithm 2 in Williams et al., 2017 24 | 'Information Theoretic MPC for Model-Based Reinforcement Learning', 25 | based off of https://github.com/ferreirafabio/mppi_pendulum 26 | """ 27 | 28 | def __init__(self, dynamics, running_cost, nx, nu, noise_sigma, num_samples=100, horizon=15, device="cpu", 29 | terminal_state_cost=None, 30 | lambda_=1., 31 | gamma_=1., 32 | w_action_seq_cost=1., 33 | noise_mu=None, 34 | u_min=None, 35 | u_max=None, 36 | action_min=None, 37 | action_max=None, 38 | u_init=None, 39 | U_init=None, 40 | u_scale=1, 41 | u_per_command=1, 42 | step_dependent_dynamics=False, 43 | rollout_var_cost=0, 44 | rollout_var_discount=0.95, 45 | sample_null_action=False): 46 | """ 47 | :param dynamics: function(state, action) -> next_state (K x nx) taking in batch state (K x nx) and action (K x nu) 48 | :param running_cost: function(state, action) -> cost (K x 1) taking in batch state and action (same as dynamics) 49 | :param nx: state dimension 50 | :param noise_sigma: (nu x nu) control noise covariance (assume v_t ~ N(u_t, noise_sigma)) 51 | :param num_samples: K, number of trajectories to sample 52 | :param horizon: T, length of each trajectory 53 | :param device: pytorch device 54 | :param terminal_state_cost: function(state) -> cost (K x 1) taking in batch state 55 | :param lambda_: temperature, positive scalar where larger values will allow more exploration 56 | :param gamma_: running action cost parameter 57 | :param w_action_seq_cost: (nu x nu) weight parameter for action sequence cost 58 | :param noise_mu: (nu) control noise mean (used to bias control samples); defaults to zero mean 59 | :param u_min: (nu) minimum values for each dimension of control 60 | :param u_max: (nu) maximum values for each dimension of control 61 | :param action_min: (nu) minimum values for each dimension of action to pass into dynamics 62 | :param action_max: (nu) maximum values for each dimension of action to pass into dynamics 63 | :param u_init: (nu) what to initialize new end of trajectory control to be; defeaults to zero 64 | :param U_init: (T x nu) initial control sequence; defaults to noise 65 | :param step_dependent_dynamics: whether the passed in dynamics needs horizon step passed in (as 3rd arg) 66 | :param rollout_var_cost: Cost attached to the variance of costs across trajectory rollouts 67 | :param rollout_var_discount: Discount of variance cost over control horizon 68 | :param sample_null_action: Whether to explicitly sample a null action (bad for starting in a local minima) 69 | 70 | * option 71 | :param rollout_samples: M, number of state trajectories to rollout for each control trajectory 72 | (should be 1 for deterministic dynamics and more for models that output a distribution) 73 | """ 74 | 75 | self.device = device 76 | self.dtype = noise_sigma.dtype 77 | self.K = num_samples # N_SAMPLES 78 | self.T = horizon # TIMESTEPS 79 | 80 | # dimensions of state and control 81 | self.nx = nx 82 | self.nu = 1 if len(noise_sigma.shape) == 0 else noise_sigma.shape[0] 83 | self.lambda_ = lambda_ 84 | self.gamma_ = gamma_ 85 | 86 | self.w_action_seq_cost = w_action_seq_cost 87 | 88 | if noise_mu is None: 89 | noise_mu = torch.zeros(self.nu, dtype=self.dtype) 90 | 91 | if u_init is None: 92 | u_init = torch.zeros_like(noise_mu) 93 | 94 | if U_init is None: 95 | U_init = torch.zeros(self.T, self.nu).to(device) 96 | 97 | # handle 1D edge case 98 | if self.nu == 1: 99 | noise_mu = noise_mu.view(-1) 100 | noise_sigma = noise_sigma.view(-1, 1) 101 | 102 | # bounds 103 | self.u_min = u_min 104 | self.u_max = u_max 105 | self.action_min = action_min 106 | self.action_max = action_max 107 | self.u_scale = u_scale 108 | self.u_per_command = u_per_command 109 | # make sure if any of them is specified, both are specified 110 | if self.u_max is not None and self.u_min is None: 111 | if not torch.is_tensor(self.u_max): 112 | self.u_max = torch.tensor(self.u_max) 113 | self.u_min = -self.u_max 114 | if self.u_min is not None and self.u_max is None: 115 | if not torch.is_tensor(self.u_min): 116 | self.u_min = torch.tensor(self.u_min) 117 | self.u_max = -self.u_min 118 | if self.u_min is not None: 119 | self.u_min = self.u_min.to(device=self.device) 120 | self.u_max = self.u_max.to(device=self.device) 121 | self.action_min = self.action_min.to(device=self.device) 122 | self.action_max = self.action_max.to(device=self.device) 123 | 124 | self.noise_mu = noise_mu.to(self.device) 125 | self.noise_sigma = noise_sigma.to(self.device) 126 | self.noise_sigma_inv = torch.inverse(self.noise_sigma) 127 | self.noise_dist = MultivariateNormal( 128 | self.noise_mu, covariance_matrix=self.noise_sigma) 129 | # T x nu control sequence 130 | self.U = U_init 131 | self.action_sequence = U_init 132 | self.u_init = u_init.to(self.device) 133 | 134 | if self.U is None: 135 | self.U = self.noise_dist.sample((self.T,)) 136 | self.U = torch.zeros_like(self.U) 137 | self.action_sequence = torch.zeros_like(self.U) 138 | 139 | self.step_dependency = step_dependent_dynamics 140 | self.F = dynamics 141 | self.running_cost = running_cost 142 | self.terminal_state_cost = terminal_state_cost 143 | self.sample_null_action = sample_null_action 144 | self.state = None 145 | 146 | # handling dynamics models that output a distribution (take multiple trajectory samples) 147 | self.rollout_var_cost = rollout_var_cost 148 | self.rollout_var_discount = rollout_var_discount 149 | 150 | # sampled results from last command 151 | self.cost_total = None 152 | self.cost_total_non_zero = None 153 | self.omega = None 154 | self.states = None 155 | self.actions = None 156 | 157 | def _dynamics(self, state, u, t): 158 | return self.F(state, u, t) if self.step_dependency else self.F(state, u) 159 | 160 | def _running_cost(self, state): 161 | return self.running_cost(state) 162 | 163 | def command(self, state): 164 | """ 165 | :param state: (nx) or (K x nx) current state, or samples of states (for propagating a distribution of states) 166 | :returns action: (nu) best action 167 | """ 168 | # shift command 1 time step 169 | self.U = torch.roll(self.U, -1, dims=0) 170 | self.U[-1] = self.u_init 171 | self.action_sequence = torch.roll(self.action_sequence, -1, dims=0) 172 | self.action_sequence[-1] = self.action_sequence[-2] # add T-1 action to T 173 | 174 | perturbed_action = self.noise_sampling() 175 | 176 | cost_total, states = self._compute_batch_rollout_costs( 177 | perturbed_action, state) 178 | self.omega = self._compute_weighting(cost_total) 179 | 180 | weighted_noise = torch.sum( 181 | self.omega.view(-1, 1, 1) * self.noise, dim=0) 182 | self.U += weighted_noise 183 | 184 | self.action_sequence += self.U 185 | 186 | action = self.action_sequence[0] 187 | 188 | return action 189 | 190 | def reset(self): 191 | """ 192 | Clear controller state after finishing a trial 193 | """ 194 | self.U = torch.zeros_like(self.U) 195 | self.action_sequence = torch.zeros_like(self.U) 196 | 197 | def _compute_weighting(self, cost_total): 198 | beta = torch.min(cost_total) 199 | cost_total_non_zero = torch.exp(-1/self.lambda_ * (cost_total - beta)) 200 | eta = torch.sum(cost_total_non_zero) 201 | omega = (1. / eta) * cost_total_non_zero 202 | return omega 203 | 204 | def _compute_batch_rollout_costs(self, perturbed_actions, state): 205 | K, T, nu = perturbed_actions.shape 206 | assert nu == self.nu 207 | 208 | cost_total = torch.zeros(K, device=self.device, dtype=self.dtype) 209 | cost_samples = torch.zeros(K, device=self.device, dtype=self.dtype) 210 | 211 | # allow propagation of a sample of states (ex. to carry a distribution), or to start with a single state 212 | # state -> nx 213 | if state.shape == (K, self.nx): 214 | state = state 215 | else: 216 | state = state.view(1, -1).repeat(K, 1) 217 | # state -> K*nu 218 | 219 | states = [] 220 | actions = [] 221 | 222 | for t in range(T): 223 | # perturbed_actions -> K*T*nu 224 | # perturbed_actions[:, t] -> K*nu 225 | action = self.u_scale * perturbed_actions[:, t] # v -> K*nu 226 | 227 | state = self._dynamics(state, action, t) 228 | c = self._running_cost(state) # c -> K 229 | 230 | cost_samples += c # cost_samples -> K 231 | 232 | # Save total states/actions 233 | states.append(state) 234 | actions.append(action) 235 | 236 | # actions -> [K*nu, K*nu ...] with size T 237 | # torch.stack(actions, dim=-2) -> K*T*nu 238 | actions = torch.stack(actions, dim=-2) 239 | states = torch.stack(states, dim=-2) 240 | 241 | # terminal state cost 242 | if self.terminal_state_cost: 243 | phi = self.terminal_state_cost(states, actions) 244 | cost_samples += phi 245 | 246 | control_cost = self.gamma_ * self.noise @ self.noise_sigma_inv 247 | 248 | # control_cost -> K*T*nu 249 | # U -> T*nu 250 | control_cost = torch.sum(self.U * control_cost, dim=(1, 2)) 251 | 252 | # action difference as cost 253 | action_diff = self.u_scale * \ 254 | (perturbed_actions[:, 1:] - perturbed_actions[:, :-1]) 255 | action_sequence_cost = torch.sum(torch.square(action_diff), dim=(1, 2)) 256 | action_sequence_cost *= self.w_action_seq_cost 257 | 258 | cost_total = cost_samples + control_cost + action_sequence_cost # K dim 259 | 260 | return cost_total, states 261 | 262 | def _bound_d_action(self, control): 263 | return torch.max(torch.min(control, self.u_max), self.u_min) # action 264 | 265 | def _bound_action(self, action): 266 | return torch.max(torch.min(action, self.action_max), self.action_min) # derivative action (= control) 267 | 268 | def noise_sampling(self): 269 | # parallelize sampling across trajectories 270 | # resample noise each time we take an action 271 | 272 | # Small portion are just guanssian perturbation aroung zero 273 | self.noise = self.noise_dist.sample( 274 | (round(self.K*0.99), self.T)) # K*T*nu (noise_dist has nu-dim) 275 | # broadcast own control to noise over samples; now it's K x T x nu 276 | perturbed_control = self.U + self.noise 277 | perturbed_control = torch.cat( 278 | [perturbed_control, self.noise_dist.sample((round(self.K*0.01), self.T))]) 279 | 280 | perturbed_control = self._bound_d_action(perturbed_control) 281 | 282 | perturbed_action = perturbed_control + self.action_sequence 283 | 284 | if self.sample_null_action: 285 | perturbed_action[self.K - 1] = 0 286 | 287 | perturbed_action = self._bound_action(perturbed_action) 288 | 289 | # remove action and U to earn double bounded noise 290 | self.noise = perturbed_action - self.action_sequence - self.U 291 | 292 | return perturbed_action 293 | 294 | 295 | def angle_normalize(x): 296 | return (((x + math.pi) % (2 * math.pi)) - math.pi) 297 | 298 | def run_mppi_episode(mppi, env, dataset_append, retrain_dynamics, cost, model_save, cost_tolerance, SUCCESS_CRITERION, retrain_after_iter=50, num_episode=30, render=True): 299 | dataset_count = 0 300 | cost_history = [] 301 | cost_ = 0. 302 | for ep in range(num_episode): 303 | env.reset() 304 | success_count = 0 305 | cost_episode = [] 306 | 307 | while True: 308 | if render: 309 | env.render() 310 | state = env.state 311 | state = torch.tensor(state, dtype=mppi.noise_sigma.dtype).to(device=mppi.device) 312 | command_start = time.perf_counter() 313 | action = mppi.command(state) 314 | elapsed = time.perf_counter() - command_start 315 | s, _, done, _ = env.step(action.cpu().numpy()) 316 | next_state = env.state 317 | next_state = torch.tensor(next_state, dtype=mppi.noise_sigma.dtype).to(device=mppi.device) 318 | 319 | # Collect Training datas 320 | dataset_append(state, action, next_state) 321 | 322 | logger.debug( 323 | "action taken: %.4f cost received: %.4f time taken: %.5fs", action, cost_, elapsed) 324 | 325 | dataset_count += 1 326 | di = dataset_count % retrain_after_iter 327 | if di == 0 and dataset_count > 0: 328 | retrain_dynamics() 329 | 330 | cost_ = cost(next_state.view(1, -1)) 331 | cost_episode.append(cost_.item()) 332 | 333 | if cost_ < cost_tolerance: 334 | success_count += 1 335 | if success_count >= SUCCESS_CRITERION: 336 | print("Task completed") 337 | cost_history.append(cost_episode) 338 | model_save() 339 | return cost_history 340 | else: 341 | success_count = 0 342 | 343 | if done: 344 | print("Episode {} terminated".format(ep + 1)) 345 | break 346 | wr.writerow(cost_episode) 347 | cost_history.append(cost_episode) 348 | return cost_history 349 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | setup( 4 | name='pytorch_mppi', 5 | version='0.1.0', 6 | packages=['pytorch_mppi'], 7 | url='https://github.com/ktk1501/smooth-mppi-pytorch', 8 | license='MIT', 9 | author='Taekyung Kim', 10 | author_email='ktk1501@kakao.com', 11 | description='Smooth Model Predictive Path Integral without Smoothing (SMPPI) implemented in pytorch', 12 | install_requires=[ 13 | 'torch', 14 | 'numpy', 15 | 'scipy' 16 | ], 17 | tests_require=[ 18 | 'gym' 19 | ] 20 | ) 21 | --------------------------------------------------------------------------------