├── xuance ├── utils │ ├── __init__.py │ ├── layer.py │ ├── distribution.py │ ├── block.py │ ├── common.py │ └── memory.py ├── environment │ ├── custom_envs │ │ ├── __pycache__ │ │ │ └── dmc.cpython-39.pyc │ │ ├── atari.py │ │ └── dmc.py │ ├── __init__.py │ ├── wrappers.py │ ├── env_utils.py │ ├── normalizer.py │ ├── vectorize.py │ └── envpool_utils.py ├── representation │ ├── __init__.py │ └── network.py ├── agent │ ├── __init__.py │ ├── ddpg.py │ ├── td3.py │ ├── a2c.py │ ├── dqn.py │ └── ppo.py ├── policy │ ├── __init__.py │ ├── categorical.py │ ├── gaussian.py │ ├── dqn.py │ └── deterministic.py └── learner │ ├── __init__.py │ ├── dqn.py │ ├── ddqn.py │ ├── a2c.py │ ├── ddpg.py │ ├── td3.py │ └── ppo.py ├── docs ├── api │ ├── xuance.agent.rst │ ├── xuance.policy.rst │ ├── xuance.utils.rst │ ├── xuance.learner.rst │ ├── xuance.environment.rst │ └── xuance.representation.rst ├── tutorials │ ├── concept.rst │ ├── custom_env.rst │ ├── custom_loss.rst │ ├── logger.rst │ ├── multi_inputs.rst │ ├── custom_network.rst │ └── configuration.rst ├── Makefile ├── make.bat ├── conf.py └── index.rst ├── figures ├── Ant.png ├── Hopper.png ├── plotter.png ├── wandb_vis.png ├── halfcheetah.png ├── tensorboard.png ├── InvertedPendulum.png ├── mujoco_benchmark.png └── tensorboard_vis.png ├── .gitignore ├── config ├── a2c │ └── mujoco.yaml ├── ddqn │ └── atari.yaml ├── dqn │ └── atari.yaml ├── duelingdqn │ └── atari.yaml ├── ddpg │ └── mujoco.yaml ├── td3 │ └── mujoco.yaml └── ppo │ ├── mujoco.yaml │ └── walkerStand.yaml ├── .readthedocs.yaml ├── LICENSE.txt ├── setup.py ├── example ├── run_ddqn.py ├── run_dqn.py ├── run_dueldqn.py ├── run_ppo.py ├── run_td3.py ├── run_ddpg.py └── run_a2c.py ├── example_win ├── run_gym_ppo.py ├── test_dmc_ppo.py └── run_dmc_ppo.py └── README.md /xuance/utils/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /docs/api/xuance.agent.rst: -------------------------------------------------------------------------------- 1 | xuance.agent 2 | ================================== -------------------------------------------------------------------------------- /docs/api/xuance.policy.rst: -------------------------------------------------------------------------------- 1 | xuance.policy 2 | ================================== -------------------------------------------------------------------------------- /docs/api/xuance.utils.rst: -------------------------------------------------------------------------------- 1 | xuance.utils 2 | ================================== -------------------------------------------------------------------------------- /docs/api/xuance.learner.rst: -------------------------------------------------------------------------------- 1 | xuance.learner 2 | ================================== -------------------------------------------------------------------------------- /docs/api/xuance.environment.rst: -------------------------------------------------------------------------------- 1 | xuance.environment 2 | ================================== -------------------------------------------------------------------------------- /docs/tutorials/concept.rst: -------------------------------------------------------------------------------- 1 | Basic Concept in XuanCE 2 | ================================== -------------------------------------------------------------------------------- /docs/tutorials/custom_env.rst: -------------------------------------------------------------------------------- 1 | Custom Environment 2 | ================================== -------------------------------------------------------------------------------- /figures/Ant.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wzcai99/XuanCE-Tiny/HEAD/figures/Ant.png -------------------------------------------------------------------------------- /docs/tutorials/custom_loss.rst: -------------------------------------------------------------------------------- 1 | Custom Loss Function 2 | ================================== -------------------------------------------------------------------------------- /docs/tutorials/logger.rst: -------------------------------------------------------------------------------- 1 | WandB and Tensorboard Logger 2 | ================================== -------------------------------------------------------------------------------- /figures/Hopper.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wzcai99/XuanCE-Tiny/HEAD/figures/Hopper.png -------------------------------------------------------------------------------- /figures/plotter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wzcai99/XuanCE-Tiny/HEAD/figures/plotter.png -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | __pycache__ 3 | logs/ 4 | models/ 5 | _build/ 6 | wandb/ 7 | xuance.egg-info -------------------------------------------------------------------------------- /docs/api/xuance.representation.rst: -------------------------------------------------------------------------------- 1 | xuance.representation 2 | ================================== -------------------------------------------------------------------------------- /docs/tutorials/multi_inputs.rst: -------------------------------------------------------------------------------- 1 | Multi-Input Observations 2 | ================================== -------------------------------------------------------------------------------- /figures/wandb_vis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wzcai99/XuanCE-Tiny/HEAD/figures/wandb_vis.png -------------------------------------------------------------------------------- /docs/tutorials/custom_network.rst: -------------------------------------------------------------------------------- 1 | Custom Representation Network 2 | ================================== -------------------------------------------------------------------------------- /figures/halfcheetah.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wzcai99/XuanCE-Tiny/HEAD/figures/halfcheetah.png -------------------------------------------------------------------------------- /figures/tensorboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wzcai99/XuanCE-Tiny/HEAD/figures/tensorboard.png -------------------------------------------------------------------------------- /docs/tutorials/configuration.rst: -------------------------------------------------------------------------------- 1 | Explainations of Configuration Files 2 | ================================== -------------------------------------------------------------------------------- /figures/InvertedPendulum.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wzcai99/XuanCE-Tiny/HEAD/figures/InvertedPendulum.png -------------------------------------------------------------------------------- /figures/mujoco_benchmark.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wzcai99/XuanCE-Tiny/HEAD/figures/mujoco_benchmark.png -------------------------------------------------------------------------------- /figures/tensorboard_vis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wzcai99/XuanCE-Tiny/HEAD/figures/tensorboard_vis.png -------------------------------------------------------------------------------- /xuance/environment/custom_envs/__pycache__/dmc.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wzcai99/XuanCE-Tiny/HEAD/xuance/environment/custom_envs/__pycache__/dmc.cpython-39.pyc -------------------------------------------------------------------------------- /xuance/representation/__init__.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | 5 | import numpy as np 6 | from torchvision.transforms import transforms 7 | from xuance.utils.block import * 8 | from .network import MLP,CNN 9 | -------------------------------------------------------------------------------- /config/a2c/mujoco.yaml: -------------------------------------------------------------------------------- 1 | algo_name: a2c 2 | env_name: Humanoid 3 | seed: 6782 4 | 5 | nenvs: 16 6 | nsize: 16 7 | nminibatch: 1 8 | nepoch: 1 9 | 10 | vf_coef: 0.25 11 | ent_coef: 0.00 12 | clipgrad_norm: 0.5 13 | lr_rate: 0.0007 14 | 15 | save_model_frequency: 1000 16 | train_steps: 62500 17 | evaluate_steps: 10000 18 | 19 | gamma: 0.99 20 | tdlam: 0.95 21 | 22 | logger: wandb 23 | logdir: "./logs/" 24 | modeldir: "./models/" -------------------------------------------------------------------------------- /config/ddqn/atari.yaml: -------------------------------------------------------------------------------- 1 | algo_name: double-dqn 2 | env_name: Pong 3 | seed: 1069 4 | 5 | nenvs: 4 6 | nsize: 25000 7 | minibatch: 256 8 | start_egreedy: 0.5 9 | end_egreedy: 0.1 10 | training_frequency: 4 11 | update_frequency: 25 12 | lr_rate: 0.001 13 | save_model_frequency: 100 14 | start_training_size: 1000 15 | train_steps: 25000 16 | evaluate_steps: 500 17 | gamma: 0.99 18 | logger: wandb 19 | logdir: "./logs/" 20 | modeldir: "./models/" -------------------------------------------------------------------------------- /config/dqn/atari.yaml: -------------------------------------------------------------------------------- 1 | algo_name: dqn 2 | env_name: Pong 3 | seed: 1069 4 | 5 | nenvs: 8 6 | nsize: 10000 7 | minibatch: 128 8 | start_egreedy: 0.5 9 | end_egreedy: 0.05 10 | training_frequency: 1 11 | update_frequency: 500 12 | lr_rate: 0.00025 13 | save_model_frequency: 5000 14 | start_training_size: 10000 15 | train_steps: 1250000 16 | evaluate_steps: 10000 17 | gamma: 0.99 18 | logger: wandb 19 | logdir: "./logs/" 20 | modeldir: "./models/" -------------------------------------------------------------------------------- /config/duelingdqn/atari.yaml: -------------------------------------------------------------------------------- 1 | algo_name: dueling-dqn 2 | env_name: Pong 3 | seed: 1069 4 | 5 | nenvs: 4 6 | nsize: 25000 7 | minibatch: 256 8 | start_egreedy: 0.5 9 | end_egreedy: 0.1 10 | training_frequency: 4 11 | update_frequency: 25 12 | lr_rate: 0.001 13 | save_model_frequency: 100 14 | start_training_size: 1000 15 | train_steps: 25000 16 | evaluate_steps: 500 17 | gamma: 0.99 18 | logger: wandb 19 | logdir: "./logs/" 20 | modeldir: "./models/" -------------------------------------------------------------------------------- /config/ddpg/mujoco.yaml: -------------------------------------------------------------------------------- 1 | algo_name: ddpg 2 | env_name: Walker2d 3 | seed: 19089 4 | 5 | nenvs: 4 6 | nsize: 50000 7 | minibatch: 256 8 | start_noise: 1.0 9 | end_noise: 0.01 10 | training_frequency: 1 11 | tau: 0.01 12 | actor_lr_rate: 0.001 13 | critic_lr_rate: 0.001 14 | save_model_frequency: 10000 15 | start_training_size: 10000 16 | train_steps: 250000 17 | evaluate_steps: 10000 18 | gamma: 0.99 19 | logger: tensorboard 20 | logdir: "./logs/" 21 | modeldir: "./models/" -------------------------------------------------------------------------------- /xuance/agent/__init__.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import wandb 5 | from xuance.utils.memory import * 6 | from xuance.utils.common import get_time_hm, get_time_full 7 | from torch.utils.tensorboard import SummaryWriter 8 | from tqdm import tqdm 9 | 10 | from .a2c import A2C_Agent 11 | from .ppo import PPO_Agent 12 | from .dqn import DQN_Agent 13 | from .ddpg import DDPG_Agent 14 | from .td3 import TD3_Agent 15 | 16 | -------------------------------------------------------------------------------- /config/td3/mujoco.yaml: -------------------------------------------------------------------------------- 1 | algo_name: td3 2 | env_name: Walker2d 3 | seed: 1929 4 | 5 | nenvs: 4 6 | nsize: 50000 7 | minibatch: 256 8 | start_noise: 0.5 9 | end_noise: 0.01 10 | training_frequency: 1 11 | tau: 0.01 12 | actor_lr_rate: 0.001 13 | actor_update_decay: 3 14 | critic_lr_rate: 0.001 15 | save_model_frequency: 10000 16 | start_training_size: 10000 17 | train_steps: 250000 18 | evaluate_steps: 10000 19 | gamma: 0.99 20 | logger: wandb 21 | logdir: "./logs/" 22 | modeldir: "./models/" -------------------------------------------------------------------------------- /config/ppo/mujoco.yaml: -------------------------------------------------------------------------------- 1 | algo_name: ppo 2 | env_name: LunarLander 3 | seed: 79811 4 | 5 | nenvs: 16 6 | nsize: 256 7 | nminibatch: 8 8 | nepoch: 16 9 | 10 | vf_coef: 0.25 11 | ent_coef: 0.00 12 | clipgrad_norm: 0.5 13 | clip_range: 0.20 14 | target_kl: 0.01 15 | lr_rate: 0.0007 16 | 17 | save_model_frequency: 1000 18 | train_steps: 62500 19 | evaluate_steps: 10000 20 | 21 | gamma: 0.99 22 | tdlam: 0.95 23 | 24 | logger: tensorboard # or wandb | tensorboard 25 | logdir: "logs/" 26 | modeldir: "models/" -------------------------------------------------------------------------------- /xuance/policy/__init__.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import gym.spaces 5 | import copy 6 | from xuance.utils.block import * 7 | from xuance.utils.distribution import CategoricalDistribution,DiagGaussianDistribution 8 | 9 | from .categorical import ActorCriticPolicy as Categorical_ActorCritic 10 | from .gaussian import ActorCriticPolicy as Gaussian_ActorCritic 11 | from .dqn import DQN_Policy,DuelDQN_Policy 12 | from .deterministic import DDPGPolicy,TD3Policy -------------------------------------------------------------------------------- /xuance/learner/__init__.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import time 5 | import copy 6 | import os 7 | import wandb 8 | import numpy as np 9 | from xuance.utils.common import create_directory 10 | from xuance.utils.common import get_time_hm, get_time_full 11 | from torch.utils.tensorboard import SummaryWriter 12 | 13 | from .a2c import A2C_Learner 14 | from .ppo import PPO_Learner 15 | 16 | from .dqn import DQN_Learner 17 | from .ddqn import DDQN_Learner 18 | 19 | from .ddpg import DDPG_Learner 20 | from .td3 import TD3_Learner -------------------------------------------------------------------------------- /xuance/environment/__init__.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import gym 3 | import copy 4 | import cv2 5 | from xuance.utils.common import * 6 | from abc import ABC, abstractmethod 7 | from xuance.utils.common import discount_cumsum 8 | from .env_utils import * 9 | from .custom_envs.dmc import DMControl 10 | from .custom_envs.atari import Atari 11 | from .vectorize import DummyVecEnv 12 | from .wrappers import BasicWrapper 13 | from .normalizer import RewardNorm,ObservationNorm,ActionNorm 14 | from .envpool_utils import EnvPool_Wrapper,EnvPool_ActionNorm,EnvPool_ObservationNorm,EnvPool_RewardNorm 15 | # from .custom_envs.mt import MT10_Env -------------------------------------------------------------------------------- /.readthedocs.yaml: -------------------------------------------------------------------------------- 1 | # Required 2 | version: 2 3 | 4 | # Set the version of Python and other tools you might need 5 | build: 6 | os: ubuntu-20.04 7 | tools: 8 | python: "3.9" 9 | # You can also specify other tool versions: 10 | # nodejs: "19" 11 | # rust: "1.64" 12 | # golang: "1.19" 13 | # Build documentation in the docs/ directory with Sphinx 14 | sphinx: 15 | configuration: docs/conf.py 16 | # If using Sphinx, optionally build your docs in additional formats such as PDF 17 | # formats: 18 | # - pdf 19 | # Optionally declare the Python requirements required to build your docs 20 | # python: 21 | # install: 22 | # - requirements: docs/requirements.txt -------------------------------------------------------------------------------- /docs/Makefile: -------------------------------------------------------------------------------- 1 | # Minimal makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line, and also 5 | # from the environment for the first two. 6 | SPHINXOPTS ?= 7 | SPHINXBUILD ?= sphinx-build 8 | SOURCEDIR = . 9 | BUILDDIR = _build 10 | 11 | # Put it first so that "make" without argument is like "make help". 12 | help: 13 | @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 14 | 15 | .PHONY: help Makefile 16 | 17 | # Catch-all target: route all unknown targets to Sphinx using the new 18 | # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). 19 | %: Makefile 20 | @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 21 | -------------------------------------------------------------------------------- /config/ppo/walkerStand.yaml: -------------------------------------------------------------------------------- 1 | algo_name: ppo 2 | env_name: walkerStand0612-2 3 | seed: 79811 4 | 5 | nenvs: 64 6 | nsize: 512 7 | nminibatch: 16 8 | nepoch: 8 9 | 10 | mlp_hiddens: 128, 128 11 | 12 | vf_coef: 0.25 13 | ent_coef: 0.00 14 | clipgrad_norm: 0.5 15 | clip_range: 0.20 16 | target_kl: 0.025 # "generally found the approx_kl stays below 0.02, and if approx_kl becomes too high it usually means the policy is changing too quickly and there is a bug." 17 | lr_rate: 0.0004 # "In MuJoCo, the learning rate linearly decays from 3e-4 to 0" 18 | 19 | save_model_frequency: 1000 20 | train_steps: 1024000 # better set as (evaluate_steps * X) 21 | evaluate_steps: 10240 # better set as (nsize * X) 22 | 23 | gamma: 0.98 24 | tdlam: 0.95 25 | 26 | logger: tensorboard # or wandb | tensorboard 27 | logdir: "logs/" 28 | modeldir: "models/" -------------------------------------------------------------------------------- /docs/make.bat: -------------------------------------------------------------------------------- 1 | @ECHO OFF 2 | 3 | pushd %~dp0 4 | 5 | REM Command file for Sphinx documentation 6 | 7 | if "%SPHINXBUILD%" == "" ( 8 | set SPHINXBUILD=sphinx-build 9 | ) 10 | set SOURCEDIR=. 11 | set BUILDDIR=_build 12 | 13 | %SPHINXBUILD% >NUL 2>NUL 14 | if errorlevel 9009 ( 15 | echo. 16 | echo.The 'sphinx-build' command was not found. Make sure you have Sphinx 17 | echo.installed, then set the SPHINXBUILD environment variable to point 18 | echo.to the full path of the 'sphinx-build' executable. Alternatively you 19 | echo.may add the Sphinx directory to PATH. 20 | echo. 21 | echo.If you don't have Sphinx installed, grab it from 22 | echo.https://www.sphinx-doc.org/ 23 | exit /b 1 24 | ) 25 | 26 | if "%1" == "" goto help 27 | 28 | %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% 29 | goto end 30 | 31 | :help 32 | %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% 33 | 34 | :end 35 | popd 36 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 xuance 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /docs/conf.py: -------------------------------------------------------------------------------- 1 | # Configuration file for the Sphinx documentation builder. 2 | # 3 | # For the full list of built-in configuration values, see the documentation: 4 | # https://www.sphinx-doc.org/en/master/usage/configuration.html 5 | 6 | # -- Project information ----------------------------------------------------- 7 | # https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information 8 | 9 | project = 'xuance' 10 | copyright = '2023, wenzhe cai, wenzhang liu' 11 | author = 'wenzhe cai, wenzhang liu' 12 | release = '0.1.0' 13 | 14 | # -- General configuration --------------------------------------------------- 15 | # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration 16 | 17 | extensions = [] 18 | templates_path = ['_templates'] 19 | exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] 20 | 21 | # -- Options for HTML output ------------------------------------------------- 22 | # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output 23 | 24 | import sphinx_rtd_theme 25 | html_theme = "sphinx_rtd_theme" 26 | html_theme_path = [sphinx_rtd_theme.get_html_theme_path()] 27 | html_static_path = ['_static'] 28 | 29 | from recommonmark.parser import CommonMarkParser 30 | source_parsers = { 31 | '.md': CommonMarkParser, 32 | } 33 | source_suffix = ['.rst', '.md'] -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup,find_packages 2 | import codecs 3 | setup( 4 | name='xuance', 5 | packages=find_packages(), 6 | version='0.1.0', 7 | description= 'xuance: a simple and clean deep reinforcement learning framework and implementations', 8 | long_description=codecs.open('README.md', 'r', encoding='utf-8').read(), 9 | long_description_content_type='text/markdown', 10 | license='MIT License', 11 | install_requires=[ 12 | 'gym==0.26.1', 13 | 'matplotlib>=3.7.1', 14 | 'opencv-python >= 4.7.0.72', 15 | 'pandas>=1.5.3', 16 | 'PyYAML>=6.0', 17 | 'scipy>=1.10.1', 18 | 'seaborn>=0.12.2', 19 | 'tensorboard>=2.12.0', 20 | 'torch>=1.12.1,<2.0.0', 21 | 'torchvision>=0.13.0,<0.14.0', 22 | 'torchaudio>=0.13.0,<0.14.0', 23 | 'tqdm>=4.65.0', 24 | 'mujoco==2.3.3', 25 | 'mujoco-py==2.1.2.14', 26 | 'free-mujoco-py==2.1.6', 27 | 'dm_control==1.0.11', 28 | 'ale-py==0.8.1', 29 | 'atari-py==0.3.0', 30 | 'attrs==21.2.0', 31 | 'AutoROM==0.4.2', 32 | 'terminaltables==3.1.10', 33 | 'AutoROM.accept-rom-license==0.4.2', 34 | 'envpool==0.8.2', 35 | 'wandb==0.15.1', 36 | 'moviepy==1.0.3', 37 | 'imageio==2.28.1', 38 | 'numpy==1.23.1', 39 | ] 40 | ) -------------------------------------------------------------------------------- /xuance/environment/wrappers.py: -------------------------------------------------------------------------------- 1 | from xuance.environment import * 2 | class BasicWrapper(gym.Wrapper): 3 | def __init__(self,env): 4 | super().__init__(env) 5 | self._observation_space = env.observation_space 6 | self._action_space = env.action_space 7 | self._reward_range = env._reward_range 8 | self._metadata = env._metadata 9 | if not isinstance(self.observation_space,gym.spaces.Dict): 10 | self._observation_space = gym.spaces.Dict({'observation':self._observation_space}) 11 | def reset(self): 12 | self.episode_length = 0 13 | self.episode_score = 0 14 | obs,info = super().reset() 15 | info['episode_length'] = self.episode_length 16 | info['episode_score'] = self.episode_score 17 | if isinstance(obs,dict): 18 | info['next_observation'] = obs 19 | return obs,info 20 | info['next_observation'] = {'observation':obs} 21 | return {'observation':obs},info 22 | def step(self,action): 23 | next_obs,reward,terminal,trunction,info = super().step(action) 24 | self.episode_length += 1 25 | self.episode_score += reward 26 | info['episode_length'] = self.episode_length 27 | info['episode_score'] = self.episode_score 28 | if isinstance(next_obs,dict): 29 | info['next_observation'] = next_obs 30 | return next_obs,reward,terminal,trunction,info 31 | info['next_observation'] = {'observation':next_obs} 32 | return {'observation':next_obs},reward,terminal,trunction,info 33 | 34 | 35 | 36 | -------------------------------------------------------------------------------- /xuance/utils/layer.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import numpy as np 4 | from typing import Any, Dict, Optional, Sequence, Tuple, Type, Union, Callable 5 | ModuleType = Type[nn.Module] 6 | 7 | class NoisyLinear(nn.Linear): 8 | def __init__(self, 9 | input_features:int, 10 | output_features:int, 11 | sigma: float = 0.02, 12 | bias: bool = True, 13 | dtype: Any = None, 14 | device: Any = None): 15 | super().__init__(input_features,output_features,bias,device,dtype) 16 | sigma_init = sigma / np.sqrt(input_features) 17 | self.sigma_weight = nn.Parameter(torch.ones((output_features,input_features),dtype=dtype,device=device)*sigma_init) 18 | self.register_buffer("epsilon_input", torch.zeros(1, input_features,dtype=dtype,device=device)) 19 | self.register_buffer("epsilon_output", torch.zeros(output_features, 1,dtype=dtype,device=device)) 20 | if bias: 21 | self.sigma_bias = nn.Parameter(torch.ones((output_features,),dtype=dtype,device=device)*sigma_init) 22 | 23 | def forward(self, input): 24 | bias = self.bias 25 | func = lambda x: torch.sign(x) * torch.sqrt(torch.abs(x)) 26 | with torch.no_grad(): 27 | torch.randn(self.epsilon_input.size(), out=self.epsilon_input) 28 | torch.randn(self.epsilon_output.size(), out=self.epsilon_output) 29 | eps_in = func(self.epsilon_input) 30 | eps_out = func(self.epsilon_output) 31 | noise_v = torch.mul(eps_in, eps_out).detach() 32 | if bias is not None: 33 | bias = bias + self.sigma_bias * eps_out.t() 34 | return F.linear(input, self.weight + self.sigma_weight * noise_v, bias) 35 | -------------------------------------------------------------------------------- /xuance/representation/network.py: -------------------------------------------------------------------------------- 1 | from xuance.representation import * 2 | class MLP(nn.Module): 3 | def __init__(self, 4 | input_shape, 5 | hidden_sizes, 6 | activation, 7 | initialize, 8 | device): 9 | super(MLP,self).__init__() 10 | self.device = device 11 | self.input_shape = input_shape 12 | self.output_shape = {'state':(hidden_sizes[-1],)} 13 | layers = [] 14 | input_shape = self.input_shape['observation'] 15 | for h in hidden_sizes: 16 | block,input_shape = mlp_block(input_shape[0],h,activation,initialize,device) 17 | layers.extend(block) 18 | self.model = nn.Sequential(*layers) 19 | def forward(self,observation: dict): 20 | tensor_observation = torch.as_tensor(observation['observation'],dtype=torch.float32,device=self.device) 21 | state = self.model(tensor_observation) 22 | return {'state':state} 23 | 24 | class CNN(nn.Module): 25 | def __init__(self, 26 | input_shape, 27 | filters, 28 | kernels, 29 | strides, 30 | activation, 31 | initialize, 32 | device): 33 | super(CNN,self).__init__() 34 | self.device = device 35 | self.input_shape = input_shape 36 | layers = [] 37 | input_shape = self.input_shape['observation'] 38 | for f,k,s in zip(filters,kernels,strides): 39 | block,input_shape = cnn_block(input_shape,f,k,s,activation,initialize,device) 40 | layers.extend(block) 41 | layers.append(nn.Flatten()) 42 | self.output_shape = {'state':(np.prod(input_shape),)} 43 | self.model = nn.Sequential(*layers) 44 | 45 | def forward(self,observation: dict): 46 | tensor_observation = torch.as_tensor(observation['observation']/255.0,dtype=torch.float32,device=self.device) 47 | state = self.model(tensor_observation) 48 | return {'state':state} 49 | 50 | 51 | 52 | -------------------------------------------------------------------------------- /xuance/environment/env_utils.py: -------------------------------------------------------------------------------- 1 | from xuance.environment import * 2 | class Running_MeanStd: 3 | def __init__(self, 4 | shape:dict, 5 | epsilon=1e-4): 6 | assert isinstance(shape,dict) 7 | self.shape = shape 8 | self.mean = {key:np.zeros(shape[key],np.float32) for key in shape.keys()} 9 | self.var = {key:np.ones(shape[key],np.float32) for key in shape.keys()} 10 | self.count = {key:epsilon for key in shape.keys()} 11 | 12 | @property 13 | def std(self): 14 | return {key:np.sqrt(self.var[key]) for key in self.shape.keys()} 15 | 16 | def update(self,x): 17 | batch_means = {} 18 | batch_vars = {} 19 | batch_counts = {} 20 | for key in self.shape.keys(): 21 | if len(x[key].shape) == 1: 22 | batch_mean, batch_std, batch_count = np.mean(x[key][np.newaxis,:], axis=0), np.std(x[key][np.newaxis,:], axis=0), x[key][np.newaxis,:].shape[0] 23 | else: 24 | batch_mean, batch_std, batch_count = np.mean(x[key], axis=0), np.std(x[key], axis=0), x[key].shape[0] 25 | batch_means[key] = batch_mean 26 | batch_vars[key] = np.square(batch_std) 27 | batch_counts[key] = batch_count 28 | self.update_from_moments(batch_means, batch_vars, batch_counts) 29 | 30 | def update_from_moments(self,batch_mean,batch_var,batch_count): 31 | for key in self.shape.keys(): 32 | delta = batch_mean[key] - self.mean[key] 33 | tot_count = self.count[key] + batch_count[key] 34 | new_mean = self.mean[key] + delta * batch_count[key] / tot_count 35 | m_a = self.var[key] * (self.count[key]) 36 | m_b = batch_var[key] * (batch_count[key]) 37 | M2 = m_a + m_b + np.square(delta) * self.count[key] * batch_count[key] / (self.count[key] + batch_count[key]) 38 | new_var = M2 / (self.count[key] + batch_count[key]) 39 | new_count = batch_count[key] + self.count[key] 40 | self.mean[key] = new_mean 41 | self.var[key] = new_var 42 | self.count[key] = new_count 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | -------------------------------------------------------------------------------- /example/run_ddqn.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import argparse 5 | import envpool 6 | import numpy as np 7 | import random 8 | from xuance.utils.common import space2shape,get_config 9 | from xuance.environment import BasicWrapper,DummyVecEnv,Atari 10 | from xuance.environment import EnvPool_Wrapper,EnvPool_RewardNorm,EnvPool_ActionNorm,EnvPool_ObservationNorm 11 | from xuance.representation import CNN 12 | from xuance.policy import DQN_Policy 13 | from xuance.learner import DDQN_Learner 14 | from xuance.agent import DQN_Agent 15 | 16 | def get_args(): 17 | parser = argparse.ArgumentParser() 18 | parser.add_argument("--device",type=str,default="cuda:0") 19 | parser.add_argument("--config",type=str,default="./config/ddqn/") 20 | parser.add_argument("--domain",type=str,default="atari") 21 | parser.add_argument("--pretrain_weight",type=str,default=None) 22 | parser.add_argument("--render",type=bool,default=False) 23 | args = parser.parse_known_args()[0] 24 | return args 25 | 26 | def set_seed(seed): 27 | torch.manual_seed(seed) 28 | torch.cuda.manual_seed(seed) 29 | torch.cuda.manual_seed_all(seed) 30 | np.random.seed(seed) 31 | random.seed(seed) 32 | 33 | if __name__ == "__main__": 34 | args = get_args() 35 | device = args.device 36 | config = get_config(args.config,args.domain) 37 | set_seed(config.seed) 38 | 39 | #define a envpool environment 40 | train_envs = envpool.make("Pong-v5","gym",num_envs=config.nenvs) 41 | train_envs = EnvPool_Wrapper(train_envs) 42 | 43 | representation = CNN(space2shape(train_envs.observation_space),(16,16,32,32),(8,6,4,4),(2,2,2,2),nn.LeakyReLU,nn.init.orthogonal_,device) 44 | policy = DQN_Policy(train_envs.action_space,representation,nn.init.orthogonal_,device) 45 | if args.pretrain_weight: 46 | policy.load_state_dict(torch.load(args.pretrain_weight,map_location=device)) 47 | optimizer = torch.optim.Adam(policy.parameters(),config.lr_rate,eps=1e-5) 48 | scheduler = torch.optim.lr_scheduler.LinearLR(optimizer, start_factor=1.0, end_factor=0.1,total_iters=config.train_steps/config.training_frequency) 49 | learner = DDQN_Learner(config,policy,optimizer,scheduler,device) 50 | agent = DQN_Agent(config,train_envs,policy,learner) 51 | 52 | def build_env_fn(): 53 | test_envs = [BasicWrapper(Atari("PongNoFrameskip-v4",render_mode="rgb_array")) for _ in range(1)] 54 | test_envs = DummyVecEnv(test_envs) 55 | return test_envs 56 | test_envs = build_env_fn() 57 | agent.benchmark(test_envs,config.train_steps,config.evaluate_steps,0,render=args.render) 58 | 59 | 60 | 61 | 62 | -------------------------------------------------------------------------------- /example/run_dqn.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import argparse 5 | import envpool 6 | import numpy as np 7 | import random 8 | from xuance.utils.common import space2shape,get_config 9 | from xuance.environment import BasicWrapper,DummyVecEnv,Atari 10 | from xuance.environment import EnvPool_Wrapper,EnvPool_RewardNorm,EnvPool_ActionNorm,EnvPool_ObservationNorm 11 | from xuance.representation import CNN 12 | from xuance.policy import DQN_Policy 13 | from xuance.learner import DQN_Learner 14 | from xuance.agent import DQN_Agent 15 | 16 | def get_args(): 17 | parser = argparse.ArgumentParser() 18 | parser.add_argument("--device",type=str,default="cuda:0") 19 | parser.add_argument("--config",type=str,default="./config/dqn/") 20 | parser.add_argument("--domain",type=str,default="atari") 21 | parser.add_argument("--pretrain_weight",type=str,default=None) 22 | parser.add_argument("--render",type=bool,default=False) 23 | args = parser.parse_known_args()[0] 24 | return args 25 | 26 | def set_seed(seed): 27 | torch.manual_seed(seed) 28 | torch.cuda.manual_seed(seed) 29 | torch.cuda.manual_seed_all(seed) 30 | np.random.seed(seed) 31 | random.seed(seed) 32 | 33 | if __name__ == "__main__": 34 | args = get_args() 35 | device = args.device 36 | config = get_config(args.config,args.domain) 37 | set_seed(config.seed) 38 | 39 | #define a envpool environment 40 | train_envs = envpool.make("Pong-v5","gym",num_envs=config.nenvs) 41 | train_envs = EnvPool_Wrapper(train_envs) 42 | 43 | representation = CNN(space2shape(train_envs.observation_space),(16,16,32,32),(8,6,4,4),(2,2,2,2),nn.LeakyReLU,nn.init.orthogonal_,device) 44 | policy = DQN_Policy(train_envs.action_space,representation,nn.init.orthogonal_,device) 45 | if args.pretrain_weight: 46 | policy.load_state_dict(torch.load(args.pretrain_weight,map_location=device)) 47 | optimizer = torch.optim.Adam(policy.parameters(),config.lr_rate,eps=1e-5) 48 | scheduler = torch.optim.lr_scheduler.LinearLR(optimizer, start_factor=1.0, end_factor=0.1,total_iters=config.train_steps/config.training_frequency) 49 | learner = DQN_Learner(config,policy,optimizer,scheduler,device) 50 | agent = DQN_Agent(config,train_envs,policy,learner) 51 | 52 | def build_env_fn(): 53 | test_envs = [BasicWrapper(Atari("PongNoFrameskip-v4",render_mode="rgb_array")) for _ in range(1)] 54 | test_envs = DummyVecEnv(test_envs) 55 | return test_envs 56 | test_envs = build_env_fn() 57 | agent.benchmark(test_envs,config.train_steps,config.evaluate_steps,0,render=args.render) 58 | 59 | 60 | 61 | 62 | -------------------------------------------------------------------------------- /example/run_dueldqn.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import argparse 5 | import envpool 6 | import numpy as np 7 | import random 8 | from xuance.utils.common import space2shape,get_config 9 | from xuance.environment import BasicWrapper,DummyVecEnv,Atari 10 | from xuance.environment import EnvPool_Wrapper,EnvPool_RewardNorm,EnvPool_ActionNorm,EnvPool_ObservationNorm 11 | from xuance.representation import CNN 12 | from xuance.policy import DuelDQN_Policy 13 | from xuance.learner import DQN_Learner 14 | from xuance.agent import DQN_Agent 15 | 16 | def get_args(): 17 | parser = argparse.ArgumentParser() 18 | parser.add_argument("--device",type=str,default="cuda:0") 19 | parser.add_argument("--config",type=str,default="./config/dueldqn/") 20 | parser.add_argument("--domain",type=str,default="atari") 21 | parser.add_argument("--pretrain_weight",type=str,default=None) 22 | parser.add_argument("--render",type=bool,default=False) 23 | args = parser.parse_known_args()[0] 24 | return args 25 | 26 | def set_seed(seed): 27 | torch.manual_seed(seed) 28 | torch.cuda.manual_seed(seed) 29 | torch.cuda.manual_seed_all(seed) 30 | np.random.seed(seed) 31 | random.seed(seed) 32 | 33 | if __name__ == "__main__": 34 | args = get_args() 35 | device = args.device 36 | config = get_config(args.config,args.domain) 37 | set_seed(config.seed) 38 | 39 | #define a envpool environment 40 | train_envs = envpool.make("Pong-v5","gym",num_envs=config.nenvs) 41 | train_envs = EnvPool_Wrapper(train_envs) 42 | 43 | representation = CNN(space2shape(train_envs.observation_space),(16,16,32,32),(8,6,4,4),(2,2,2,2),nn.LeakyReLU,nn.init.orthogonal_,device) 44 | policy = DuelDQN_Policy(train_envs.action_space,representation,nn.init.orthogonal_,device) 45 | if args.pretrain_weight: 46 | policy.load_state_dict(torch.load(args.pretrain_weight,map_location=device)) 47 | optimizer = torch.optim.Adam(policy.parameters(),config.lr_rate,eps=1e-5) 48 | scheduler = torch.optim.lr_scheduler.LinearLR(optimizer, start_factor=1.0, end_factor=0.1,total_iters=config.train_steps/config.training_frequency) 49 | learner = DQN_Learner(config,policy,optimizer,scheduler,device) 50 | agent = DQN_Agent(config,train_envs,policy,learner) 51 | 52 | def build_env_fn(): 53 | test_envs = [BasicWrapper(Atari("PongNoFrameskip-v4",render_mode="rgb_array")) for _ in range(1)] 54 | test_envs = DummyVecEnv(test_envs) 55 | return test_envs 56 | test_envs = build_env_fn() 57 | agent.benchmark(test_envs,config.train_steps,config.evaluate_steps,0,render=args.render) 58 | 59 | 60 | 61 | 62 | -------------------------------------------------------------------------------- /xuance/policy/categorical.py: -------------------------------------------------------------------------------- 1 | from xuance.policy import * 2 | class ActorNet(nn.Module): 3 | def __init__(self, 4 | state_dim:int, 5 | action_dim:int, 6 | initialize, 7 | device 8 | ): 9 | super(ActorNet,self).__init__() 10 | self.device = device 11 | self.action_dim = action_dim 12 | self.model = nn.Sequential(*mlp_block(state_dim,action_dim,None,initialize,device)[0]) 13 | self.distribution = CategoricalDistribution(action_dim) 14 | self.output_shape = self.distribution.params_shape 15 | def forward(self,x:torch.Tensor): 16 | distribution = CategoricalDistribution(self.action_dim) 17 | distribution.set_param(logits=self.model(x)) 18 | return distribution.get_param(),distribution 19 | 20 | class CriticNet(nn.Module): 21 | def __init__(self, 22 | state_dim:int, 23 | initialize, 24 | device 25 | ): 26 | super(CriticNet,self).__init__() 27 | self.device = device 28 | self.model = nn.Sequential(*mlp_block(state_dim,1,None,initialize,device)[0]) 29 | self.output_shape = {'critic':()} 30 | def forward(self,x:torch.Tensor): 31 | return self.model(x).squeeze(dim=-1) 32 | 33 | class ActorCriticPolicy(nn.Module): 34 | def __init__(self, 35 | action_space:gym.spaces.Discrete, 36 | representation:torch.nn.Module, 37 | initialize, 38 | device): 39 | assert isinstance(action_space, gym.spaces.Discrete) 40 | super(ActorCriticPolicy,self).__init__() 41 | self.action_space = action_space 42 | self.representation = representation 43 | self.input_shape = self.representation.input_shape.copy() 44 | self.output_shape = self.representation.output_shape.copy() 45 | self.actor = ActorNet(self.representation.output_shape['state'][0], 46 | self.action_space.n, 47 | initialize, 48 | device) 49 | self.critic = CriticNet(self.representation.output_shape['state'][0], 50 | initialize, 51 | device) 52 | for key,value in zip(self.actor.output_shape.keys(),self.actor.output_shape.values()): 53 | self.output_shape[key] = value 54 | self.output_shape['critic'] = () 55 | 56 | def forward(self,observation:dict): 57 | outputs = self.representation(observation) 58 | a_param,a = self.actor(outputs['state']) 59 | v = self.critic(outputs['state']) 60 | for key in self.actor.output_shape.keys(): 61 | outputs[key] = a_param[key] 62 | outputs['critic'] = v 63 | return outputs,a,v 64 | -------------------------------------------------------------------------------- /xuance/utils/distribution.py: -------------------------------------------------------------------------------- 1 | from abc import abstractmethod 2 | import torch 3 | from torch.distributions import Categorical,Normal 4 | 5 | class Distribution: 6 | def __init__(self): 7 | self.params_shape = None 8 | self.params = {} 9 | self.distribution = None 10 | def set_param(self,**kwargs): 11 | for key,value in zip(kwargs.keys(),kwargs.values()): 12 | self.params[key] = value 13 | def get_param(self): 14 | return self.params 15 | def get_distribution(self): 16 | return self.distribution 17 | def logprob(self,x:torch.Tensor): 18 | return self.distribution.log_prob(x) 19 | def entropy(self): 20 | return self.distribution.entropy() 21 | def sample(self): 22 | return self.distribution.sample() 23 | def deterministic(self): 24 | raise NotImplementedError 25 | def kl_divergence(self,other_pd): 26 | return torch.distributions.kl_divergence(self.distribution,other_pd.distribution) 27 | def detach(self): 28 | raise NotImplementedError 29 | 30 | class CategoricalDistribution(Distribution): 31 | def __init__(self,action_dim:int): 32 | super().__init__() 33 | self.params_shape = {'logits':(action_dim,)} 34 | def set_param(self,**kwargs): 35 | super().set_param(**kwargs) 36 | self.distribution = Categorical(logits=self.params['logits']) 37 | def detach(self): 38 | self.distribution.logits.detach() 39 | def deterministic(self): 40 | return self.params['logits'].argmax(dim=-1) 41 | 42 | class DiagGaussianDistribution(Distribution): 43 | def __init__(self,action_dim:int): 44 | super().__init__() 45 | self.params_shape = {'mu':(action_dim,),'std':(action_dim,)} 46 | def set_param(self, **kwargs): 47 | super().set_param(**kwargs) 48 | self.distribution = Normal(self.params['mu'],self.params['std']) 49 | def logprob(self,x): 50 | return super().logprob(x).sum(-1) 51 | def entropy(self): 52 | return super().entropy().sum(-1) 53 | def deterministic(self): 54 | return self.params['mu'] 55 | def detach(self): 56 | self.distribution.mean.detach() 57 | self.distribution.stddev.detach() 58 | 59 | # class MultiheadDiagGaussianDistribution(Distribution): 60 | # def __init__(self,action_dim:int,num_head:int): 61 | # super().__init__() 62 | # self.params_shape = {'mu':(num_head,action_dim,),'std':(num_head,action_dim,)} 63 | # def set_param(self, **kwargs): 64 | # super().set_param(**kwargs) 65 | # self.distribution = Normal(self.params['mu'],self.params['std']) 66 | # def logprob(self,x): 67 | # return super().logprob(x).sum(-1) 68 | # def entropy(self): 69 | # return super().entropy().sum(-1) 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | -------------------------------------------------------------------------------- /xuance/environment/custom_envs/atari.py: -------------------------------------------------------------------------------- 1 | IMAGE_SIZE = 84 2 | STACK_SIZE = 4 3 | IMAGE_CHANNEL = 1 4 | ACTION_REPEAT = 4 5 | 6 | import gym 7 | import cv2 8 | import numpy as np 9 | from gym.spaces import Space, Box, Discrete, Dict 10 | 11 | class Atari(gym.Env): 12 | def __init__(self,env_id,render_mode='rgb_array'): 13 | self.env = gym.make(env_id,render_mode=render_mode) 14 | self.observation_space = Box(0, 1, (IMAGE_CHANNEL * STACK_SIZE,IMAGE_SIZE, IMAGE_SIZE)) 15 | self.action_space = self.env.action_space 16 | self.render_mode = render_mode 17 | self._metadata = self.env._metadata 18 | self._reward_range = self.env._reward_range 19 | 20 | def _process_reset_image(self, image): 21 | resize_image = cv2.resize(image, (IMAGE_SIZE, IMAGE_SIZE)) 22 | if IMAGE_CHANNEL == 1: 23 | resize_image = cv2.cvtColor(resize_image, cv2.COLOR_BGR2GRAY) 24 | resize_image = np.expand_dims(resize_image, axis=2) 25 | self.stack_image = np.tile(resize_image, (1, 1, STACK_SIZE)) 26 | return self.stack_image.transpose(2,0,1) 27 | 28 | def _process_step_image(self, image): 29 | resize_image = cv2.resize(image, (IMAGE_SIZE, IMAGE_SIZE)) 30 | if IMAGE_CHANNEL == 1: 31 | resize_image = cv2.cvtColor(resize_image, cv2.COLOR_BGR2GRAY) 32 | resize_image = np.expand_dims(resize_image, axis=2) 33 | self.stack_image = np.concatenate((self.stack_image[:, :, :-IMAGE_CHANNEL], resize_image), axis=2) 34 | return self.stack_image.transpose(2,0,1) 35 | 36 | # FireReset 37 | # NoOpReset 38 | def reset(self): 39 | obs,info = self.env.reset() 40 | self.lives = self.env.unwrapped.ale.lives() 41 | if self.env.unwrapped.get_action_meanings()[1] == 'FIRE': 42 | obs, _, done,trunction, _ = self.env.step(1) 43 | if done or trunction: 44 | obs,info = self.env.reset() 45 | noop = np.random.randint(0, 30) 46 | for i in range(noop): 47 | obs, _, done,trunction, _ = self.env.step(0) 48 | if done or trunction: 49 | obs,info = self.env.reset() 50 | break 51 | return self._process_reset_image(obs),info 52 | 53 | def step(self, action): 54 | cum_reward = 0 55 | last_image = np.zeros(self.env.observation_space.shape, np.uint8) 56 | for i in range(ACTION_REPEAT): 57 | obs, rew, done, trunction, info = self.env.step(action) 58 | cum_reward += rew 59 | concat_image = np.concatenate((np.expand_dims(last_image, axis=0), np.expand_dims(obs, axis=0)), axis=0) 60 | max_image = np.max(concat_image, axis=0) 61 | last_image = obs 62 | done = (done or (self.lives > self.env.unwrapped.ale.lives())) 63 | if done: 64 | break 65 | return self._process_step_image(max_image), cum_reward, done, trunction, info 66 | 67 | def render(self): 68 | return self.env.render() 69 | 70 | -------------------------------------------------------------------------------- /example_win/run_gym_ppo.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import argparse 5 | import gym 6 | # import envpool 7 | import numpy as np 8 | import random 9 | from xuance.utils.common import space2shape,get_config 10 | from xuance.environment import BasicWrapper,DummyVecEnv,RewardNorm,ObservationNorm,ActionNorm 11 | # from xuance.environment import EnvPool_Wrapper,EnvPool_ActionNorm,EnvPool_RewardNorm,EnvPool_ObservationNorm 12 | from xuance.representation import MLP 13 | from xuance.policy import Categorical_ActorCritic,Gaussian_ActorCritic 14 | from xuance.learner import PPO_Learner 15 | from xuance.agent import PPO_Agent 16 | 17 | def get_args(): 18 | parser = argparse.ArgumentParser() 19 | parser.add_argument("--device",type=str,default="cuda:0") 20 | parser.add_argument("--config",type=str,default="config/ppo/") 21 | parser.add_argument("--domain",type=str,default="mujoco") # default: same config.yaml for env from the same domain 22 | parser.add_argument("--env_id",type=str,default="BipedalWalker-v3") # LunarLander-v2, BipedalWalker-v3, ... 23 | parser.add_argument("--pretrain_weight",type=str,default=None) 24 | parser.add_argument("--render",type=bool,default=True) 25 | args = parser.parse_known_args()[0] 26 | return args 27 | 28 | def set_seed(seed): 29 | torch.manual_seed(seed) 30 | torch.cuda.manual_seed(seed) 31 | torch.cuda.manual_seed_all(seed) 32 | np.random.seed(seed) 33 | random.seed(seed) 34 | 35 | if __name__ == "__main__": 36 | args = get_args() 37 | device = args.device 38 | config = get_config(args.config,args.domain) 39 | set_seed(config.seed) 40 | 41 | # in some cases, the training environment is different with the testing environment 42 | def build_train_envs(): 43 | env = gym.make(args.env_id, render_mode="rgb_array") 44 | envs = [BasicWrapper(env) for _ in range(config.nenvs)] 45 | return ObservationNorm(config, RewardNorm(config, ActionNorm(DummyVecEnv(envs)), train=True), train=True) 46 | 47 | def build_test_envs(): 48 | env = gym.make(args.env_id, render_mode="rgb_array") 49 | envs = [BasicWrapper(env) for _ in range(2)] 50 | return ObservationNorm(config, RewardNorm(config, ActionNorm(DummyVecEnv(envs)), train=False), train=False) 51 | 52 | train_envs = build_train_envs() 53 | 54 | representation = MLP(space2shape(train_envs.observation_space),(256,256),nn.LeakyReLU,nn.init.orthogonal_,device) 55 | policy = Gaussian_ActorCritic(train_envs.action_space,representation,nn.init.orthogonal_,device) 56 | if args.pretrain_weight: 57 | policy.load_state_dict(torch.load(args.pretrain_weight,map_location=device)) 58 | optimizer = torch.optim.Adam(policy.parameters(),config.lr_rate) 59 | scheduler = torch.optim.lr_scheduler.LinearLR(optimizer, start_factor=1.0, end_factor=0.5,total_iters=config.train_steps/config.nsize * config.nepoch * config.nminibatch) 60 | learner = PPO_Learner(config,policy,optimizer,scheduler,device) 61 | agent = PPO_Agent(config,train_envs,policy,learner) 62 | 63 | agent.benchmark(build_test_envs(),config.train_steps,config.evaluate_steps,render=args.render) -------------------------------------------------------------------------------- /example_win/test_dmc_ppo.py: -------------------------------------------------------------------------------- 1 | import os 2 | import torch 3 | import torch.nn as nn 4 | import torch.nn.functional as F 5 | import argparse 6 | # import gym 7 | # import envpool 8 | import xuance.environment.custom_envs.dmc as dmc 9 | import numpy as np 10 | import random 11 | from xuance.utils.common import space2shape,get_config 12 | from xuance.environment import BasicWrapper,DummyVecEnv,RewardNorm,ObservationNorm,ActionNorm 13 | from xuance.representation import MLP 14 | from xuance.policy import Gaussian_ActorCritic 15 | 16 | 17 | def get_args(): 18 | parser = argparse.ArgumentParser() 19 | parser.add_argument("--device",type=str,default="cuda:0") 20 | 21 | parser.add_argument("--config",type=str,default="config/ppo/") 22 | parser.add_argument("--domain",type=str,default="walkerStand") # default: same config.yaml for env from the same domain 23 | parser.add_argument("--task_id",type=str,default="walker") # walker, swimmer, ... 24 | parser.add_argument("--env_id",type=str,default="stand") # stand, walk, ... 25 | parser.add_argument("--time_limit",type=int,default=150) 26 | 27 | parser.add_argument("--pretrain_weight",type=str,default=r"D:\zzm_codes\xuance_TneitapSimHand\models\walkerStand0612-2\ppo-79811\best_model.pth") 28 | 29 | parser.add_argument("--render",type=bool,default=True) 30 | 31 | args = parser.parse_known_args()[0] 32 | return args 33 | 34 | to_cpu = lambda tensor: tensor.detach().cpu().numpy() 35 | 36 | if __name__ == "__main__": 37 | 38 | args = get_args() 39 | device = args.device 40 | config = get_config(args.config,args.domain) 41 | 42 | def build_test_envs(): 43 | env = dmc.DMControl(args.task_id,args.env_id, args.time_limit) 44 | envs = [BasicWrapper(env) for _ in range(8)] 45 | envs = RewardNorm(config, ActionNorm(DummyVecEnv(envs)), train=False) 46 | envs.load_rms() # os.path.join(self.save_dir,"reward_stat.npy") 47 | envs = ObservationNorm(config, envs, train=False) 48 | envs.load_rms() 49 | return envs 50 | 51 | envs = build_test_envs() 52 | 53 | mlp_hiddens = tuple(map(int, config.mlp_hiddens.split(","))) 54 | representation = MLP(space2shape(envs.observation_space),mlp_hiddens,nn.LeakyReLU,nn.init.orthogonal_,device) 55 | policy = Gaussian_ActorCritic(envs.action_space,representation,nn.init.orthogonal_,device) 56 | policy.load_state_dict(torch.load(args.pretrain_weight,map_location=device)) 57 | print("use weights: ", args.pretrain_weight) 58 | policy.eval() 59 | 60 | obs,infos = envs.reset() # (nenvs, 24) 61 | 62 | test_episode = 1000 63 | current_episode = 0 64 | while current_episode < test_episode: 65 | print("[%03d]"%(current_episode)) 66 | envs.render("human") 67 | # obs_Tsor = torch.from_numpy(obs['observation']).float().to(policy.actor.device) 68 | _,act_Distrib,_ = policy.forward(obs) # (nenvs, 6) 69 | act_Tsor = act_Distrib.sample() 70 | next_obs,rewards,terminals,trunctions,infos = envs.step(to_cpu(act_Tsor)) 71 | for i in range(envs.num_envs): 72 | if terminals[i] == True or trunctions[i] == True: 73 | current_episode += 1 74 | obs = next_obs -------------------------------------------------------------------------------- /example/run_ppo.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import argparse 5 | import gym 6 | import envpool 7 | import numpy as np 8 | import random 9 | from xuance.utils.common import space2shape,get_config 10 | from xuance.environment import BasicWrapper,DummyVecEnv,RewardNorm,ObservationNorm,ActionNorm 11 | from xuance.environment import EnvPool_Wrapper,EnvPool_ActionNorm,EnvPool_RewardNorm,EnvPool_ObservationNorm 12 | from xuance.representation import MLP 13 | from xuance.policy import Categorical_ActorCritic,Gaussian_ActorCritic 14 | from xuance.learner import PPO_Learner 15 | from xuance.agent import PPO_Agent 16 | 17 | def get_args(): 18 | parser = argparse.ArgumentParser() 19 | parser.add_argument("--device",type=str,default="cuda:0") 20 | parser.add_argument("--config",type=str,default="./config/ppo/") 21 | parser.add_argument("--domain",type=str,default="mujoco") 22 | parser.add_argument("--env_id",type=str,default="HalfCheetah-v4") 23 | parser.add_argument("--pretrain_weight",type=str,default=None) 24 | parser.add_argument("--render",type=bool,default=False) 25 | args = parser.parse_known_args()[0] 26 | return args 27 | 28 | def set_seed(seed): 29 | torch.manual_seed(seed) 30 | torch.cuda.manual_seed(seed) 31 | torch.cuda.manual_seed_all(seed) 32 | np.random.seed(seed) 33 | random.seed(seed) 34 | 35 | if __name__ == "__main__": 36 | args = get_args() 37 | device = args.device 38 | config = get_config(args.config,args.domain) 39 | set_seed(config.seed) 40 | 41 | # define a envpool environment 42 | train_envs = envpool.make(args.env_id,"gym",num_envs=config.nenvs) 43 | train_envs = EnvPool_Wrapper(train_envs) 44 | train_envs = EnvPool_ActionNorm(train_envs) 45 | train_envs = EnvPool_RewardNorm(config,train_envs) 46 | train_envs = EnvPool_ObservationNorm(config,train_envs) 47 | representation = MLP(space2shape(train_envs.observation_space),(256,256),nn.LeakyReLU,nn.init.orthogonal_,device) 48 | policy = Gaussian_ActorCritic(train_envs.action_space,representation,nn.init.orthogonal_,device) 49 | if args.pretrain_weight: 50 | policy.load_state_dict(torch.load(args.pretrain_weight,map_location=device)) 51 | optimizer = torch.optim.Adam(policy.parameters(),config.lr_rate) 52 | scheduler = torch.optim.lr_scheduler.LinearLR(optimizer, start_factor=1.0, end_factor=0.5,total_iters=config.train_steps/config.nsize * config.nepoch * config.nminibatch) 53 | learner = PPO_Learner(config,policy,optimizer,scheduler,device) 54 | agent = PPO_Agent(config,train_envs,policy,learner) 55 | 56 | # in many cases, the training environment is different with the testing environment 57 | def build_test_env(): 58 | test_envs = [BasicWrapper(gym.make(args.env_id,render_mode="rgb_array")) for _ in range(1)] 59 | test_envs = DummyVecEnv(test_envs) 60 | test_envs = ActionNorm(test_envs) 61 | test_envs = RewardNorm(config,test_envs,train=False) 62 | test_envs = ObservationNorm(config,test_envs,train=False) 63 | return test_envs 64 | test_envs = build_test_env() 65 | agent.benchmark(test_envs,config.train_steps,config.evaluate_steps,render=args.render) 66 | 67 | -------------------------------------------------------------------------------- /example/run_td3.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import argparse 5 | import gym 6 | import envpool 7 | import numpy as np 8 | import random 9 | from xuance.utils.common import space2shape,get_config 10 | from xuance.environment import BasicWrapper,ActionNorm,DummyVecEnv 11 | from xuance.environment import EnvPool_Wrapper,EnvPool_RewardNorm,EnvPool_ActionNorm,EnvPool_ObservationNorm 12 | from xuance.representation import MLP 13 | from xuance.policy import TD3Policy 14 | from xuance.learner import TD3_Learner 15 | from xuance.agent import TD3_Agent 16 | 17 | def get_args(): 18 | parser = argparse.ArgumentParser() 19 | parser.add_argument("--device",type=str,default="cuda:0") 20 | parser.add_argument("--config",type=str,default="./config/td3/") 21 | parser.add_argument("--domain",type=str,default="mujoco") 22 | parser.add_argument("--env_id",type=str,default="HalfCheetah-v4") 23 | parser.add_argument("--pretrain_weight",type=str,default=None) 24 | parser.add_argument("--render",type=bool,default=False) 25 | args = parser.parse_known_args()[0] 26 | return args 27 | 28 | def set_seed(seed): 29 | torch.manual_seed(seed) 30 | torch.cuda.manual_seed(seed) 31 | torch.cuda.manual_seed_all(seed) 32 | np.random.seed(seed) 33 | random.seed(seed) 34 | 35 | if __name__ == "__main__": 36 | args = get_args() 37 | device = args.device 38 | config = get_config(args.config,args.domain) 39 | set_seed(config.seed) 40 | 41 | # define a envpool environment 42 | train_envs = envpool.make(args.env_id,"gym",num_envs=config.nenvs) 43 | train_envs = EnvPool_Wrapper(train_envs) 44 | train_envs = EnvPool_ActionNorm(train_envs) 45 | 46 | representation = MLP(space2shape(train_envs.observation_space),(256,),nn.LeakyReLU,nn.init.xavier_uniform_,device) 47 | policy = TD3Policy(train_envs.action_space,representation,nn.init.xavier_uniform_,device) 48 | if args.pretrain_weight: 49 | policy.load_state_dict(torch.load(args.pretrain_weight,map_location=device)) 50 | actor_optimizer = torch.optim.Adam(policy.actor_parameters,config.actor_lr_rate,eps=1e-5) 51 | critic_optimizer = torch.optim.Adam(policy.critic_parameters,config.critic_lr_rate,eps=1e-5) 52 | actor_scheduler = torch.optim.lr_scheduler.LinearLR(actor_optimizer, start_factor=1.0, end_factor=0.1,total_iters=config.train_steps/config.training_frequency) 53 | critic_scheduler = torch.optim.lr_scheduler.LinearLR(critic_optimizer, start_factor=1.0, end_factor=0.1,total_iters=config.train_steps/config.training_frequency) 54 | learner = TD3_Learner(config,policy,[actor_optimizer,critic_optimizer],[actor_scheduler,critic_scheduler],device) 55 | agent = TD3_Agent(config,train_envs,policy,learner) 56 | 57 | # in many cases, the training environment is different with the testing environment 58 | def build_test_env(): 59 | test_envs = [BasicWrapper(gym.make(args.env_id,render_mode='rgb_array')) for _ in range(1)] 60 | test_envs = DummyVecEnv(test_envs) 61 | test_envs = ActionNorm(test_envs) 62 | return test_envs 63 | test_envs = build_test_env() 64 | agent.benchmark(test_envs,config.train_steps,config.evaluate_steps,render=args.render) 65 | 66 | 67 | 68 | -------------------------------------------------------------------------------- /example/run_ddpg.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import argparse 5 | import gym 6 | import envpool 7 | import numpy as np 8 | import random 9 | from xuance.utils.common import space2shape,get_config 10 | from xuance.environment import BasicWrapper,ActionNorm,DummyVecEnv 11 | from xuance.environment import EnvPool_Wrapper,EnvPool_RewardNorm,EnvPool_ActionNorm,EnvPool_ObservationNorm 12 | from xuance.representation import MLP 13 | from xuance.policy import DDPGPolicy 14 | from xuance.learner import DDPG_Learner 15 | from xuance.agent import DDPG_Agent 16 | 17 | def get_args(): 18 | parser = argparse.ArgumentParser() 19 | parser.add_argument("--device",type=str,default="cuda:0") 20 | parser.add_argument("--config",type=str,default="./config/ddpg/") 21 | parser.add_argument("--domain",type=str,default="mujoco") 22 | parser.add_argument("--env_id",type=str,default="HalfCheetah-v4") 23 | parser.add_argument("--pretrain_weight",type=str,default=None) 24 | parser.add_argument("--render",type=bool,default=False) 25 | args = parser.parse_known_args()[0] 26 | return args 27 | 28 | def set_seed(seed): 29 | torch.manual_seed(seed) 30 | torch.cuda.manual_seed(seed) 31 | torch.cuda.manual_seed_all(seed) 32 | np.random.seed(seed) 33 | random.seed(seed) 34 | 35 | if __name__ == "__main__": 36 | args = get_args() 37 | device = args.device 38 | config = get_config(args.config,args.domain) 39 | set_seed(config.seed) 40 | 41 | # define a envpool environment 42 | train_envs = envpool.make(args.env_id,"gym",num_envs=config.nenvs) 43 | train_envs = EnvPool_Wrapper(train_envs) 44 | train_envs = EnvPool_ActionNorm(train_envs) 45 | 46 | representation = MLP(space2shape(train_envs.observation_space),(256,),nn.LeakyReLU,nn.init.xavier_uniform_,device) 47 | policy = DDPGPolicy(train_envs.action_space,representation,nn.init.xavier_uniform_,device) 48 | if args.pretrain_weight: 49 | policy.load_state_dict(torch.load(args.pretrain_weight,map_location=device)) 50 | actor_optimizer = torch.optim.Adam(policy.actor_parameters,config.actor_lr_rate,eps=1e-5) 51 | critic_optimizer = torch.optim.Adam(policy.critic_parameters,config.critic_lr_rate,eps=1e-5) 52 | actor_scheduler = torch.optim.lr_scheduler.LinearLR(actor_optimizer, start_factor=1.0, end_factor=0.1,total_iters=config.train_steps/config.training_frequency) 53 | critic_scheduler = torch.optim.lr_scheduler.LinearLR(critic_optimizer, start_factor=1.0, end_factor=0.1,total_iters=config.train_steps/config.training_frequency) 54 | learner = DDPG_Learner(config,policy,[actor_optimizer,critic_optimizer],[actor_scheduler,critic_scheduler],device) 55 | agent = DDPG_Agent(config,train_envs,policy,learner) 56 | 57 | # in many cases, the training environment is different with the testing environment 58 | def build_test_env(): 59 | test_envs = [BasicWrapper(gym.make(args.env_id,render_mode='rgb_array')) for _ in range(1)] 60 | test_envs = DummyVecEnv(test_envs) 61 | test_envs = ActionNorm(test_envs) 62 | return test_envs 63 | test_envs = build_test_env() 64 | agent.benchmark(test_envs,config.train_steps,config.evaluate_steps,render=args.render) 65 | 66 | 67 | 68 | -------------------------------------------------------------------------------- /example/run_a2c.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import argparse 5 | import gym 6 | import numpy as np 7 | import envpool 8 | import random 9 | from xuance.utils.common import space2shape,get_config 10 | from xuance.environment import BasicWrapper,DummyVecEnv,RewardNorm,ObservationNorm,ActionNorm 11 | from xuance.environment import EnvPool_Wrapper,EnvPool_ActionNorm,EnvPool_RewardNorm,EnvPool_ObservationNorm 12 | from xuance.representation import MLP 13 | from xuance.policy import Categorical_ActorCritic,Gaussian_ActorCritic 14 | from xuance.learner import A2C_Learner 15 | from xuance.agent import A2C_Agent 16 | 17 | def get_args(): 18 | parser = argparse.ArgumentParser() 19 | parser.add_argument("--device",type=str,default="cuda:0") 20 | parser.add_argument("--config",type=str,default="./config/a2c/") 21 | parser.add_argument("--domain",type=str,default="mujoco") 22 | parser.add_argument("--env_id",type=str,default="HalfCheetah-v4") 23 | parser.add_argument("--pretrain_weight",type=str,default=None) 24 | parser.add_argument("--render",type=bool,default=False) 25 | args = parser.parse_known_args()[0] 26 | return args 27 | 28 | def set_seed(seed): 29 | torch.manual_seed(seed) 30 | torch.cuda.manual_seed(seed) 31 | torch.cuda.manual_seed_all(seed) 32 | np.random.seed(seed) 33 | random.seed(seed) 34 | 35 | if __name__ == "__main__": 36 | args = get_args() 37 | device = args.device 38 | config = get_config(args.config,args.domain) 39 | set_seed(config.seed) 40 | 41 | # define the envpool environment 42 | train_envs = envpool.make(args.env_id,"gym",num_envs=config.nenvs) 43 | train_envs = EnvPool_Wrapper(train_envs) 44 | train_envs = EnvPool_ActionNorm(train_envs) 45 | train_envs = EnvPool_RewardNorm(config,train_envs,train=(args.pretrain_weight is None)) 46 | train_envs = EnvPool_ObservationNorm(config,train_envs,train=(args.pretrain_weight is None)) 47 | 48 | # network and training 49 | representation = MLP(space2shape(train_envs.observation_space),(256,256),nn.LeakyReLU,nn.init.orthogonal_,device) 50 | policy = Gaussian_ActorCritic(train_envs.action_space,representation,nn.init.orthogonal_,device) 51 | if args.pretrain_weight: 52 | policy.load_state_dict(torch.load(args.pretrain_weight,map_location=device)) 53 | optimizer = torch.optim.Adam(policy.parameters(),config.lr_rate) 54 | scheduler = torch.optim.lr_scheduler.LinearLR(optimizer, start_factor=1.0, end_factor=0.5,total_iters=config.train_steps/config.nsize * config.nepoch * config.nminibatch) 55 | learner = A2C_Learner(config,policy,optimizer,scheduler,device) 56 | agent = A2C_Agent(config,train_envs,policy,learner) 57 | 58 | # in many cases, the training environment is different with the testing environment 59 | def build_test_env(): 60 | test_envs = [BasicWrapper(gym.make(args.env_id,render_mode="rgb_array")) for _ in range(1)] 61 | test_envs = DummyVecEnv(test_envs) 62 | test_envs = ActionNorm(test_envs) 63 | test_envs = RewardNorm(config,test_envs,train=False) 64 | test_envs = ObservationNorm(config,test_envs,train=False) 65 | return test_envs 66 | 67 | test_envs = build_test_env() 68 | agent.benchmark(test_envs,config.train_steps,config.evaluate_steps,render=args.render) 69 | 70 | 71 | 72 | 73 | -------------------------------------------------------------------------------- /xuance/environment/custom_envs/dmc.py: -------------------------------------------------------------------------------- 1 | from xuance.environment import * 2 | from dm_control import suite 3 | import gym.spaces as gym_spaces 4 | import cv2 5 | class DMControl(gym.Env): 6 | def __init__(self,domain_name="humanoid",task_name="stand",timelimit=100,render_mode='rgb_array'): 7 | self.domain_name = domain_name 8 | self.task_name = task_name 9 | self.env = suite.load(domain_name=domain_name,task_name=task_name) 10 | self.action_spec = self.env.action_spec() 11 | self.observation_spec = self.env.observation_spec() 12 | 13 | self.timelimit = timelimit 14 | self.render_mode = render_mode 15 | self.action_space = self.make_action_space(self.action_spec) 16 | self.observation_space = self.make_observation_space(self.observation_spec) 17 | self._metadata = {} 18 | self._reward_range = (-float("inf"), float("inf")) 19 | 20 | def make_observation_space(self,obs_spec): 21 | obs_shape_dim = 0 22 | for key,value in zip(obs_spec.keys(),obs_spec.values()): 23 | shape = value.shape 24 | if len(shape) == 0: 25 | obs_shape_dim += 1 26 | else: 27 | obs_shape_dim += shape[0] 28 | return gym_spaces.Box(-np.inf, np.inf,(obs_shape_dim,)) 29 | 30 | def make_action_space(self,act_spec): 31 | return gym_spaces.Box(act_spec.minimum.astype(np.float32),act_spec.maximum.astype(np.float32),act_spec.shape) 32 | 33 | def render(self): 34 | camera_frame0 = self.env.physics.render(camera_id=0, height=240, width=320) 35 | camera_frame1 = self.env.physics.render(camera_id=1, height=240, width=320) 36 | if self.render_mode == 'rgb_array': 37 | return np.concatenate((camera_frame0,camera_frame1),axis=0) 38 | elif self.render_mode == 'human': 39 | cv2.imshow("render_dmc",np.concatenate((camera_frame0,camera_frame1),axis=0)) 40 | cv2.waitKey(5) 41 | 42 | def make_observation(self,timestep_data): 43 | # return_observation = np.empty((0,),dtype=np.float32) 44 | # for key,value in zip(timestep_data.observation.keys(),timestep_data.observation.values()): 45 | # value = np.array([value],np.float32) 46 | # if len(value.shape) == 1: 47 | # return_observation=np.concatenate((return_observation,value),axis=-1) 48 | # elif len(value.shape) == 2: 49 | # return_observation=np.concatenate((return_observation,value[0]),axis=-1) 50 | # else: 51 | # raise NotImplementedError 52 | 53 | # return return_observation 54 | return np.concatenate([val.reshape(-1) for key, val in timestep_data.observation.items()], dtype=np.float32) 55 | 56 | def reset(self): 57 | self.episode_time = 0 58 | timestep_data = self.env.reset() 59 | info = {} 60 | return self.make_observation(timestep_data),info 61 | 62 | def step(self,action): 63 | timestep_data = self.env.step(action) 64 | next_obs = self.make_observation(timestep_data) 65 | reward = timestep_data.reward 66 | done = (self.episode_time >= self.timelimit) 67 | self.episode_time += 1 68 | trunction = (self.episode_time >= self.timelimit) 69 | info = {} 70 | return next_obs,reward,done,trunction,info 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | -------------------------------------------------------------------------------- /xuance/learner/dqn.py: -------------------------------------------------------------------------------- 1 | from xuance.learner import * 2 | class DQN_Learner: 3 | def __init__(self, 4 | config, 5 | policy, 6 | optimizer, 7 | scheduler, 8 | device): 9 | self.policy = policy 10 | self.optimizer = optimizer 11 | self.scheduler = scheduler 12 | self.device = device 13 | 14 | self.gamma = config.gamma 15 | self.update_frequency = config.update_frequency 16 | self.save_model_frequency = config.save_model_frequency 17 | self.iterations = 0 18 | 19 | self.logdir = os.path.join(config.logdir,config.env_name,config.algo_name+"-%d/"%config.seed) 20 | self.modeldir = os.path.join(config.modeldir,config.env_name,config.algo_name+"-%d/"%config.seed) 21 | self.logger = config.logger 22 | create_directory(self.modeldir) 23 | create_directory(self.logdir) 24 | if self.logger == 'wandb': 25 | self.summary = wandb.init(project="XuanCE", 26 | group=config.env_name, 27 | name=config.algo_name, 28 | config=wandb.helper.parse_config(vars(config), exclude=('logger','logdir','modeldir'))) 29 | wandb.define_metric("iterations") 30 | wandb.define_metric("train/*",step_metric="iterations") 31 | elif self.logger == 'tensorboard': 32 | self.summary = SummaryWriter(self.logdir) 33 | else: 34 | raise NotImplementedError 35 | 36 | def update(self,input_batch,action_batch,reward_batch,terminal_batch,next_input_batch): 37 | self.iterations += 1 38 | tensor_action_batch = torch.as_tensor(action_batch,device=self.device) 39 | tensor_reward_batch = torch.as_tensor(reward_batch,device=self.device) 40 | tensor_terminal_batch = torch.as_tensor(terminal_batch,device=self.device) 41 | 42 | _,evalQ,_ = self.policy(input_batch) 43 | _,_,targetQ = self.policy(next_input_batch) 44 | 45 | evalQ = (evalQ * F.one_hot(tensor_action_batch.long(),evalQ.shape[-1])).sum(-1) 46 | targetQ = targetQ.max(dim=-1).values 47 | targetQ = tensor_reward_batch + self.gamma*(1-tensor_terminal_batch)*targetQ 48 | 49 | loss = F.mse_loss(evalQ,targetQ.detach()) 50 | self.optimizer.zero_grad() 51 | loss.backward() 52 | self.optimizer.step() 53 | self.scheduler.step() 54 | 55 | if self.iterations % self.update_frequency == 0: 56 | self.policy.update_target() 57 | 58 | if self.logger == 'tensorboard': 59 | self.summary.add_scalar("Q-loss",loss.item(),self.iterations) 60 | self.summary.add_scalar("learning-rate",self.optimizer.state_dict()['param_groups'][0]['lr'],self.iterations) 61 | self.summary.add_scalar("value_function",evalQ.mean().item(),self.iterations) 62 | else: 63 | wandb.log({'train/Q-loss':loss.item(), 64 | "train/learning-rate":self.optimizer.state_dict()['param_groups'][0]['lr'], 65 | "train/value_function":evalQ.mean().item(), 66 | "iterations":self.iterations}) 67 | 68 | if self.iterations % self.save_model_frequency == 0: 69 | model_path = self.modeldir + "model-%s-%s.pth" % (get_time_full(), str(self.iterations)) 70 | torch.save(self.policy.state_dict(), model_path) 71 | 72 | -------------------------------------------------------------------------------- /example_win/run_dmc_ppo.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import argparse 5 | # import gym 6 | # import envpool 7 | import xuance.environment.custom_envs.dmc as dmc 8 | import numpy as np 9 | import random 10 | from xuance.utils.common import space2shape,get_config, summarize_ppo_config 11 | from xuance.environment import BasicWrapper,DummyVecEnv,RewardNorm,ObservationNorm,ActionNorm 12 | # from xuance.environment import EnvPool_Wrapper,EnvPool_ActionNorm,EnvPool_RewardNorm,EnvPool_ObservationNorm 13 | from xuance.representation import MLP 14 | from xuance.policy import Categorical_ActorCritic,Gaussian_ActorCritic 15 | from xuance.learner import PPO_Learner 16 | from xuance.agent import PPO_Agent 17 | 18 | def get_args(): 19 | parser = argparse.ArgumentParser() 20 | parser.add_argument("--device",type=str,default="cuda:0") 21 | parser.add_argument("--config",type=str,default="config/ppo/") 22 | parser.add_argument("--domain",type=str,default="walkerStand") # default: same config.yaml for env from the same domain 23 | parser.add_argument("--task_id",type=str,default="walker") # walker, swimmer, ... 24 | parser.add_argument("--env_id",type=str,default="stand") # stand, walk, ... 25 | parser.add_argument("--time_limit",type=int,default=100) 26 | parser.add_argument("--pretrain_weight",type=str, default=None) 27 | parser.add_argument("--render",type=bool,default=False) 28 | args = parser.parse_known_args()[0] 29 | return args 30 | 31 | def set_seed(seed): 32 | torch.manual_seed(seed) 33 | torch.cuda.manual_seed(seed) 34 | torch.cuda.manual_seed_all(seed) 35 | np.random.seed(seed) 36 | random.seed(seed) 37 | 38 | if __name__ == "__main__": 39 | args = get_args() 40 | device = args.device 41 | config = get_config(args.config,args.domain) 42 | set_seed(config.seed) 43 | 44 | # in some cases, the training environment is different with the testing environment 45 | def build_train_envs(): 46 | env = dmc.DMControl(args.task_id,args.env_id, args.time_limit) 47 | envs = [BasicWrapper(env) for _ in range(config.nenvs)] 48 | rms_need_train = False if args.pretrain_weight else True 49 | return ObservationNorm(config, RewardNorm(config, ActionNorm(DummyVecEnv(envs)), train=rms_need_train), train=rms_need_train) 50 | 51 | def build_test_envs(): 52 | env = dmc.DMControl(args.task_id,args.env_id, args.time_limit) 53 | envs = [BasicWrapper(env) for _ in range(2)] 54 | return ObservationNorm(config, RewardNorm(config, ActionNorm(DummyVecEnv(envs)), train=False), train=False) 55 | 56 | train_envs = build_train_envs() 57 | mlp_hiddens = tuple(map(int, config.mlp_hiddens.split(","))) 58 | representation = MLP(space2shape(train_envs.observation_space),mlp_hiddens,nn.LeakyReLU,nn.init.orthogonal_,device) 59 | policy = Gaussian_ActorCritic(train_envs.action_space,representation,nn.init.orthogonal_,device) 60 | if args.pretrain_weight: 61 | policy.load_state_dict(torch.load(args.pretrain_weight,map_location=device)) 62 | 63 | #eps: follow 'ICLR:ppo-implementation-details(3)' 64 | optimizer = torch.optim.Adam(policy.parameters(),config.lr_rate) 65 | scheduler = torch.optim.lr_scheduler.LinearLR(optimizer, start_factor=1.0, end_factor=0.1, total_iters=config.train_steps/config.nsize * config.nepoch * config.nminibatch) 66 | learner = PPO_Learner(config,policy,optimizer,scheduler,device) 67 | agent = PPO_Agent(config,train_envs,policy,learner) 68 | 69 | summarize_ppo_config(config) 70 | agent.benchmark(build_test_envs(),config.train_steps,config.evaluate_steps,render=args.render) -------------------------------------------------------------------------------- /xuance/learner/ddqn.py: -------------------------------------------------------------------------------- 1 | from xuance.learner import * 2 | class DDQN_Learner: 3 | def __init__(self, 4 | config, 5 | policy, 6 | optimizer, 7 | scheduler, 8 | device): 9 | self.policy = policy 10 | self.optimizer = optimizer 11 | self.scheduler = scheduler 12 | self.device = device 13 | self.gamma = config.gamma 14 | self.update_frequency = config.update_frequency 15 | self.save_model_frequency = config.save_model_frequency 16 | self.iterations = 0 17 | 18 | self.logdir = os.path.join(config.logdir,config.env_name,config.algo_name+"-%d/"%config.seed) 19 | self.modeldir = os.path.join(config.modeldir,config.env_name,config.algo_name+"-%d/"%config.seed) 20 | self.logger = config.logger 21 | create_directory(self.modeldir) 22 | create_directory(self.logdir) 23 | if self.logger == 'wandb': 24 | self.summary = wandb.init(project="XuanCE", 25 | group=config.env_name, 26 | name=config.algo_name, 27 | config=wandb.helper.parse_config(vars(config), exclude=('logger','logdir','modeldir'))) 28 | wandb.define_metric("iterations") 29 | wandb.define_metric("train/*",step_metric="iterations") 30 | elif self.logger == 'tensorboard': 31 | self.summary = SummaryWriter(self.logdir) 32 | else: 33 | raise NotImplementedError 34 | 35 | def update(self,input_batch,action_batch,reward_batch,terminal_batch,next_input_batch): 36 | self.iterations += 1 37 | tensor_action_batch = torch.as_tensor(action_batch,device=self.device) 38 | tensor_reward_batch = torch.as_tensor(reward_batch,device=self.device) 39 | tensor_terminal_batch = torch.as_tensor(terminal_batch,device=self.device) 40 | 41 | _,evalQ,_ = self.policy(input_batch) 42 | _,_,targetQ = self.policy(next_input_batch) 43 | _,next_evalQ,_ = self.policy(next_input_batch) 44 | 45 | evalQ = (evalQ * F.one_hot(tensor_action_batch.long(),evalQ.shape[-1])).sum(-1) 46 | targetA = next_evalQ.argmax(dim=-1) 47 | targetQ = (targetQ * F.one_hot(targetA.long(),targetQ.shape[-1])).sum(-1) 48 | targetQ = tensor_reward_batch + self.gamma*(1-tensor_terminal_batch)*targetQ 49 | 50 | loss = F.mse_loss(evalQ,targetQ.detach()) 51 | self.optimizer.zero_grad() 52 | loss.backward() 53 | self.optimizer.step() 54 | self.scheduler.step() 55 | 56 | if self.iterations % self.update_frequency == 0: 57 | self.policy.update_target() 58 | 59 | if self.logger == 'tensorboard': 60 | self.summary.add_scalar("Q-loss",loss.item(),self.iterations) 61 | self.summary.add_scalar("learning-rate",self.optimizer.state_dict()['param_groups'][0]['lr'],self.iterations) 62 | self.summary.add_scalar("value_function",evalQ.mean().item(),self.iterations) 63 | else: 64 | wandb.log({'train/Q-loss':loss.item(), 65 | "train/learning-rate":self.optimizer.state_dict()['param_groups'][0]['lr'], 66 | "train/value_function":evalQ.mean().item(), 67 | "iterations":self.iterations}) 68 | 69 | if self.iterations % self.save_model_frequency == 0: 70 | model_path = self.modeldir + "model-%s-%s.pth" % (get_time_full(), str(self.iterations)) 71 | torch.save(self.policy.state_dict(), model_path) 72 | 73 | -------------------------------------------------------------------------------- /xuance/learner/a2c.py: -------------------------------------------------------------------------------- 1 | from xuance.learner import * 2 | class A2C_Learner: 3 | def __init__(self, 4 | config, 5 | policy, 6 | optimizer, 7 | scheduler, 8 | device): 9 | self.policy = policy 10 | self.optimizer = optimizer 11 | self.scheduler = scheduler 12 | self.device = device 13 | 14 | self.vf_coef = config.vf_coef 15 | self.ent_coef = config.ent_coef 16 | self.clipgrad_norm = config.clipgrad_norm 17 | self.save_model_frequency = config.save_model_frequency 18 | self.iterations = 0 19 | self.logdir = os.path.join(config.logdir,config.env_name,config.algo_name+"-%d/"%config.seed) 20 | self.modeldir = os.path.join(config.modeldir,config.env_name,config.algo_name+"-%d/"%config.seed) 21 | self.logger = config.logger 22 | create_directory(self.modeldir) 23 | create_directory(self.logdir) 24 | 25 | if self.logger == 'wandb': 26 | self.summary = wandb.init(project="XuanCE", 27 | group=config.env_name, 28 | name=config.algo_name, 29 | config=wandb.helper.parse_config(vars(config), exclude=('logger','logdir','modeldir'))) 30 | wandb.define_metric("iterations") 31 | wandb.define_metric("train/*",step_metric="iterations") 32 | elif self.logger == 'tensorboard': 33 | self.summary = SummaryWriter(self.logdir) 34 | else: 35 | raise NotImplementedError 36 | 37 | 38 | def update(self,input_batch,action_batch,return_batch,advantage_batch): 39 | self.iterations += 1 40 | tensor_action_batch = torch.as_tensor(action_batch,device=self.device) 41 | tensor_return_batch = torch.as_tensor(return_batch,device=self.device) 42 | tensor_advantage_batch = torch.as_tensor(advantage_batch,device=self.device) 43 | 44 | _,actor,critic = self.policy(input_batch) 45 | actor_loss = -(tensor_advantage_batch * actor.logprob(tensor_action_batch)).mean() 46 | critic_loss = F.mse_loss(critic,tensor_return_batch) 47 | entropy_loss = actor.entropy().mean() 48 | 49 | loss = actor_loss + self.vf_coef * critic_loss - self.ent_coef * entropy_loss 50 | self.optimizer.zero_grad() 51 | loss.backward() 52 | torch.nn.utils.clip_grad_norm_(self.policy.parameters(), self.clipgrad_norm) 53 | self.optimizer.step() 54 | self.scheduler.step() 55 | 56 | if self.logger == 'tensorboard': 57 | self.summary.add_scalar("actor-loss",actor_loss.item(),self.iterations) 58 | self.summary.add_scalar("critic-loss",critic_loss.item(),self.iterations) 59 | self.summary.add_scalar("entropy-loss",entropy_loss.item(),self.iterations) 60 | self.summary.add_scalar("learning-rate",self.optimizer.state_dict()['param_groups'][0]['lr'],self.iterations) 61 | self.summary.add_scalar("value_function",critic.mean().item(),self.iterations) 62 | else: 63 | wandb.log({'train/actor-loss':actor_loss.item(), 64 | "train/critic-loss":critic_loss.item(), 65 | "train/entropy-loss":entropy_loss.item(), 66 | "train/learning-rate":self.optimizer.state_dict()['param_groups'][0]['lr'], 67 | "train/value_function":critic.mean().item(), 68 | "iterations":self.iterations}) 69 | 70 | if self.iterations % self.save_model_frequency == 0: 71 | model_path = self.modeldir + "model-%s-%s.pth" % (get_time_full(), str(self.iterations)) 72 | torch.save(self.policy.state_dict(), model_path) -------------------------------------------------------------------------------- /xuance/utils/block.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import numpy as np 4 | from typing import Any, Dict, Optional, Sequence, Tuple, Type, Union, Callable 5 | from .layer import NoisyLinear 6 | ModuleType = Type[nn.Module] 7 | 8 | def mlp_block(input_dim: int, 9 | output_dim: int, 10 | activation: Optional[ModuleType] = None, 11 | initialize: Optional[Callable[[torch.Tensor], torch.Tensor]] = None, 12 | device: Optional[Union[str, int, torch.device]] = None): 13 | block = [] 14 | linear = nn.Linear(input_dim, output_dim, device=device) 15 | if initialize is not None: 16 | initialize(linear.weight) 17 | block.append(linear) 18 | if activation is not None: 19 | block.append(activation()) 20 | return block, (output_dim,) 21 | 22 | def noisy_mlp_block(input_dim: int, 23 | output_dim: int, 24 | sigma: float, 25 | activation: Optional[ModuleType] = None, 26 | initialize: Optional[Callable[[torch.Tensor], torch.Tensor]] = None, 27 | device: Optional[Union[str, int, torch.device]] = None): 28 | block = [] 29 | linear = NoisyLinear(input_dim, output_dim, sigma, device=device) 30 | if initialize is not None: 31 | initialize(linear.weight) 32 | block.append(linear) 33 | if activation is not None: 34 | block.append(activation()) 35 | return block, (output_dim,) 36 | 37 | def cnn_block(input_shape: Sequence[int], 38 | filter: int, 39 | kernel_size: int, 40 | stride: int, 41 | activation: Optional[ModuleType] = None, 42 | initialize: Optional[Callable[[torch.Tensor], torch.Tensor]] = None, 43 | device: Optional[Union[str, int, torch.device]] = None 44 | ): 45 | assert len(input_shape) == 3 # CxHxW 46 | C, H, W = input_shape 47 | padding = int((kernel_size - stride) // 2) 48 | block = [] 49 | cnn = nn.Conv2d(C, filter, kernel_size, stride, padding=padding, device=device) 50 | if initialize is not None: 51 | initialize(cnn.weight) 52 | block.append(cnn) 53 | C = filter 54 | H = int((H + 2 * padding - (kernel_size - 1) - 1) / stride + 1) 55 | W = int((W + 2 * padding - (kernel_size - 1) - 1) / stride + 1) 56 | if activation is not None: 57 | block.append(activation()) 58 | return block, (C, H, W) 59 | 60 | def gru_block(input_dim: Sequence[int], 61 | output_dim: int, 62 | dropout: float = 0, 63 | initialize: Optional[Callable[[torch.Tensor], torch.Tensor]] = None, 64 | device: Optional[Union[str, int, torch.device]] = None): 65 | gru = nn.GRU(input_size=input_dim, 66 | hidden_size=output_dim, 67 | batch_first=True, 68 | dropout=dropout, 69 | device=device) 70 | if initialize is not None: 71 | for weight_list in gru.all_weights: 72 | for weight in weight_list: 73 | if len(weight.shape) > 1: 74 | initialize(weight) 75 | return gru 76 | 77 | def lstm_block(input_dim: Sequence[int], 78 | output_dim: int, 79 | dropout: float = 0, 80 | initialize: Optional[Callable[[torch.Tensor], torch.Tensor]] = None, 81 | device: Optional[Union[str, int, torch.device]] = None) -> ModuleType: 82 | lstm = nn.LSTM(input_size=input_dim, 83 | hidden_size=output_dim, 84 | batch_first=True, 85 | dropout=dropout, 86 | device=device) 87 | if initialize is not None: 88 | for weight_list in lstm.all_weights: 89 | for weight in weight_list: 90 | if len(weight.shape) > 1: 91 | initialize(weight) 92 | return lstm -------------------------------------------------------------------------------- /xuance/utils/common.py: -------------------------------------------------------------------------------- 1 | import gym 2 | import os 3 | import time 4 | import yaml 5 | import scipy.signal 6 | import numpy as np 7 | from terminaltables import AsciiTable 8 | from argparse import Namespace 9 | 10 | def get_config(dir_name, args_name): 11 | with open(os.path.join(dir_name, args_name + ".yaml"), "r") as f: 12 | try: 13 | config_dict = yaml.load(f, Loader=yaml.FullLoader) 14 | except yaml.YAMLError as exc: 15 | assert False, args_name + ".yaml error: {}".format(exc) 16 | return Namespace(**config_dict) 17 | 18 | 19 | def create_directory(path): 20 | dir_split = path.split("/") 21 | current_dir = dir_split[0] + "/" 22 | for i in range(1, len(dir_split)): 23 | if not os.path.exists(current_dir): 24 | os.makedirs(current_dir, exist_ok=True) 25 | current_dir = current_dir + dir_split[i] + "/" 26 | 27 | def space2shape(observation_space: gym.Space): 28 | if isinstance(observation_space, gym.spaces.Dict): 29 | return {key: observation_space[key].shape for key in observation_space.keys()} 30 | else: 31 | return observation_space.shape 32 | 33 | def discount_cumsum(x, discount=0.99): 34 | return scipy.signal.lfilter([1], [1, float(-discount)], x[::-1], axis=0)[::-1] 35 | 36 | def combined_shape(length, shape=None): 37 | if shape is None: 38 | return (length,) 39 | return (length, shape) if np.isscalar(shape) else (length, *shape) 40 | 41 | 42 | ################# Time Func For Log ############################## 43 | def get_time_hm()->str: 44 | localtime = time.localtime(time.time()) 45 | return "%02d:%02d"%(localtime.tm_hour, localtime.tm_min) 46 | 47 | def get_time_full()->str: 48 | return time.asctime().replace(":", "_")#.replace(" ", "_") 49 | 50 | 51 | ################################################################### 52 | 53 | 54 | def log_the_table(title, info_table: list, txt_writer): 55 | def log_the_str(*print_paras): 56 | for para_i in print_paras: 57 | print(para_i, end= "") 58 | print(para_i, end= "", file = txt_writer) 59 | print("") 60 | print("", file = txt_writer) 61 | return 62 | 63 | table = AsciiTable(info_table, title).table 64 | log_the_str(table) 65 | 66 | def summarize_ppo_config(pConfig): 67 | 68 | info_table = [['item', 'detail']] 69 | info_table.append(["# model&log saving dir", pConfig.env_name]) 70 | info_table.append(["# interact envs", pConfig.nenvs]) 71 | info_table.append(["# total train_steps", str(pConfig.train_steps/1000) + " K"]) 72 | info_table.append(["# train-test epochs", pConfig.train_steps // pConfig.evaluate_steps]) 73 | info_table.append([" ", " "]) 74 | info_table.append(["# train_step per train", pConfig.evaluate_steps]) 75 | info_table.append(["# policy max updates per train_step", pConfig.evaluate_steps//pConfig.nsize]) 76 | info_table.append(["# updates to save model", str(pConfig.save_model_frequency/1000) + " K"]) 77 | info_table.append(["# data samples per train_step", pConfig.nminibatch * pConfig.nepoch]) 78 | info_table.append(["# data batchsize per update", pConfig.nenvs * pConfig.nsize // pConfig.nsize]) 79 | info_table.append([" ", " "]) 80 | info_table.append(["target_kl for erly-stopping train_step", pConfig.target_kl]) 81 | info_table.append(["base lr for Adam", pConfig.lr_rate]) 82 | info_table.append([" ", " "]) 83 | info_table.append(["loss-graph log timestep <=", str((pConfig.train_steps//pConfig.nsize)*pConfig.nminibatch * pConfig.nepoch/1000)+" K"]) 84 | info_table.append(["reward-graph log timestep ==", str(pConfig.train_steps * pConfig.nenvs/1000000)+" M"]) 85 | os.makedirs(os.path.join(pConfig.logdir, pConfig.env_name), exist_ok=True) 86 | logfile_path = os.path.join(pConfig.logdir, pConfig.env_name, "config_settings.txt") 87 | txt_writer = open(logfile_path, 'a+') 88 | log_the_table("config" , info_table, txt_writer) 89 | txt_writer.close() -------------------------------------------------------------------------------- /xuance/learner/ddpg.py: -------------------------------------------------------------------------------- 1 | from xuance.learner import * 2 | class DDPG_Learner: 3 | def __init__(self, 4 | config, 5 | policy, 6 | optimizer, 7 | scheduler, 8 | device): 9 | self.policy = policy 10 | self.actor_optimizer = optimizer[0] 11 | self.critic_optimizer = optimizer[1] 12 | self.actor_scheduler = scheduler[0] 13 | self.critic_scheduler = scheduler[1] 14 | self.device = device 15 | 16 | self.tau = config.tau 17 | self.gamma = config.gamma 18 | self.save_model_frequency = config.save_model_frequency 19 | self.iterations = 0 20 | 21 | self.logdir = os.path.join(config.logdir,config.env_name,config.algo_name+"-%d/"%config.seed) 22 | self.modeldir = os.path.join(config.modeldir,config.env_name,config.algo_name+"-%d/"%config.seed) 23 | self.logger = config.logger 24 | create_directory(self.modeldir) 25 | create_directory(self.logdir) 26 | if self.logger == 'wandb': 27 | self.summary = wandb.init(project="XuanCE", 28 | group=config.env_name, 29 | name=config.algo_name, 30 | config=wandb.helper.parse_config(vars(config), exclude=('logger','logdir','modeldir'))) 31 | wandb.define_metric("iterations") 32 | wandb.define_metric("train/*",step_metric="iterations") 33 | elif self.logger == 'tensorboard': 34 | self.summary = SummaryWriter(self.logdir) 35 | else: 36 | raise NotImplementedError 37 | 38 | def update(self,input_batch,action_batch,reward_batch,terminal_batch,next_input_batch): 39 | self.iterations += 1 40 | tensor_action_batch = torch.as_tensor(action_batch,device=self.device) 41 | tensor_reward_batch = torch.as_tensor(reward_batch,device=self.device) 42 | tensor_terminal_batch = torch.as_tensor(terminal_batch,device=self.device) 43 | 44 | #update Q network 45 | with torch.no_grad(): 46 | targetQ = self.policy.Qtarget(next_input_batch) 47 | targetQ = tensor_reward_batch + self.gamma * (1-tensor_terminal_batch) * targetQ 48 | 49 | currentQ = self.policy.Qaction(input_batch,tensor_action_batch) 50 | Q_loss = F.mse_loss(currentQ,targetQ) 51 | self.critic_optimizer.zero_grad() 52 | Q_loss.backward() 53 | self.critic_optimizer.step() 54 | self.critic_scheduler.step() 55 | # update A network 56 | _,_,evalQ = self.policy(input_batch) 57 | A_loss = -evalQ.mean() 58 | self.actor_optimizer.zero_grad() 59 | A_loss.backward() 60 | self.actor_optimizer.step() 61 | self.actor_scheduler.step() 62 | self.policy.soft_update(self.tau) 63 | 64 | if self.logger == 'tensorboard': 65 | self.summary.add_scalar("Q-loss",Q_loss.item(),self.iterations) 66 | self.summary.add_scalar("A-loss",A_loss.item(),self.iterations) 67 | self.summary.add_scalar("actor-learning-rate",self.actor_optimizer.state_dict()['param_groups'][0]['lr'],self.iterations) 68 | self.summary.add_scalar("critic-learning-rate",self.critic_optimizer.state_dict()['param_groups'][0]['lr'],self.iterations) 69 | self.summary.add_scalar("value_function",evalQ.mean().item(),self.iterations) 70 | else: 71 | wandb.log({'train/Q-loss':Q_loss.item(), 72 | "train/A-loss":A_loss.item(), 73 | "train/actor-learning-rate":self.actor_optimizer.state_dict()['param_groups'][0]['lr'], 74 | "train/critic-learning-rate":self.critic_optimizer.state_dict()['param_groups'][0]['lr'], 75 | "train/value_function":evalQ.mean().item(), 76 | "iterations":self.iterations}) 77 | 78 | if self.iterations % self.save_model_frequency == 0: 79 | model_path = self.modeldir + "model-%s-%s.pth" % (get_time_full(), str(self.iterations)) 80 | torch.save(self.policy.state_dict(), model_path) 81 | -------------------------------------------------------------------------------- /xuance/learner/td3.py: -------------------------------------------------------------------------------- 1 | from xuance.learner import * 2 | class TD3_Learner: 3 | def __init__(self, 4 | config, 5 | policy, 6 | optimizer, 7 | scheduler, 8 | device): 9 | self.policy = policy 10 | self.actor_optimizer = optimizer[0] 11 | self.critic_optimizer = optimizer[1] 12 | self.actor_scheduler = scheduler[0] 13 | self.critic_scheduler = scheduler[1] 14 | self.device = device 15 | 16 | self.tau = config.tau 17 | self.gamma = config.gamma 18 | self.update_decay = config.actor_update_decay 19 | self.save_model_frequency = config.save_model_frequency 20 | self.iterations = 0 21 | 22 | self.logdir = os.path.join(config.logdir,config.env_name,config.algo_name+"-%d/"%config.seed) 23 | self.modeldir = os.path.join(config.modeldir,config.env_name,config.algo_name+"-%d/"%config.seed) 24 | self.logger = config.logger 25 | create_directory(self.modeldir) 26 | create_directory(self.logdir) 27 | if self.logger == 'wandb': 28 | self.summary = wandb.init(project="XuanCE", 29 | group=config.env_name, 30 | name=config.algo_name, 31 | config=wandb.helper.parse_config(vars(config), exclude=('logger','logdir','modeldir'))) 32 | wandb.define_metric("iterations") 33 | wandb.define_metric("train/*",step_metric="iterations") 34 | elif self.logger == 'tensorboard': 35 | self.summary = SummaryWriter(self.logdir) 36 | else: 37 | raise NotImplementedError 38 | 39 | def update(self,input_batch,action_batch,reward_batch,terminal_batch,next_input_batch): 40 | self.iterations += 1 41 | tensor_action_batch = torch.as_tensor(action_batch,device=self.device) 42 | tensor_reward_batch = torch.as_tensor(reward_batch,device=self.device) 43 | tensor_terminal_batch = torch.as_tensor(terminal_batch,device=self.device) 44 | 45 | #update Q network 46 | with torch.no_grad(): 47 | targetQ = self.policy.Qtarget(next_input_batch) 48 | targetQ = tensor_reward_batch + self.gamma * (1-tensor_terminal_batch) * targetQ 49 | currentQA,currentQB = self.policy.Qaction(input_batch,tensor_action_batch) 50 | Q_loss = F.mse_loss(currentQA,targetQ) + F.mse_loss(currentQB,targetQ) 51 | self.critic_optimizer.zero_grad() 52 | Q_loss.backward() 53 | self.critic_optimizer.step() 54 | self.critic_scheduler.step() 55 | # update A network 56 | if self.iterations % self.update_decay == 0: 57 | _,_,evalQ = self.policy(input_batch) 58 | A_loss = -evalQ.mean() 59 | self.actor_optimizer.zero_grad() 60 | A_loss.backward() 61 | self.actor_optimizer.step() 62 | self.actor_scheduler.step() 63 | if self.logger == 'tensorboard': 64 | self.summary.add_scalar("A-loss",A_loss.item(),self.iterations) 65 | self.summary.add_scalar("value_function",evalQ.mean().item(),self.iterations) 66 | self.summary.add_scalar("Q-loss",Q_loss.item(),self.iterations) 67 | self.summary.add_scalar("actor-learning-rate",self.actor_optimizer.state_dict()['param_groups'][0]['lr'],self.iterations) 68 | self.summary.add_scalar("critic-learning-rate",self.critic_optimizer.state_dict()['param_groups'][0]['lr'],self.iterations) 69 | else: 70 | wandb.log({'train/Q-loss':Q_loss.item(), 71 | "train/A-loss":A_loss.item(), 72 | "train/actor-learning-rate":self.actor_optimizer.state_dict()['param_groups'][0]['lr'], 73 | "train/critic-learning-rate":self.critic_optimizer.state_dict()['param_groups'][0]['lr'], 74 | "train/value_function":evalQ.mean().item(), 75 | "iterations":self.iterations}) 76 | 77 | self.policy.soft_update(self.tau) 78 | if self.iterations % self.save_model_frequency == 0: 79 | model_path = self.modeldir + "model-%s-%s.pth" % (get_time_full(), str(self.iterations)) 80 | torch.save(self.policy.state_dict(), model_path) 81 | -------------------------------------------------------------------------------- /xuance/learner/ppo.py: -------------------------------------------------------------------------------- 1 | from xuance.learner import * 2 | class PPO_Learner: 3 | def __init__(self, 4 | config, 5 | policy, 6 | optimizer, 7 | scheduler, 8 | device): 9 | self.policy = policy 10 | self.optimizer = optimizer 11 | self.scheduler = scheduler 12 | self.device = device 13 | 14 | self.vf_coef = config.vf_coef 15 | self.ent_coef = config.ent_coef 16 | self.clipgrad_norm = config.clipgrad_norm 17 | self.clip_range = config.clip_range 18 | self.save_model_frequency = config.save_model_frequency 19 | self.iterations = 0 20 | 21 | self.logdir = os.path.join(config.logdir,config.env_name,config.algo_name+"-%d/"%config.seed) 22 | self.modeldir = os.path.join(config.modeldir,config.env_name,config.algo_name+"-%d/"%config.seed) 23 | self.logger = config.logger 24 | create_directory(self.modeldir) 25 | create_directory(self.logdir) 26 | if self.logger == 'wandb': 27 | self.summary = wandb.init(project="XuanCE", 28 | group=config.env_name, 29 | name=config.algo_name, 30 | config=wandb.helper.parse_config(vars(config), exclude=('logger','logdir','modeldir'))) 31 | wandb.define_metric("iterations") 32 | wandb.define_metric("train/*",step_metric="iterations") 33 | elif self.logger == 'tensorboard': 34 | self.summary = SummaryWriter(self.logdir) 35 | else: 36 | raise NotImplementedError 37 | 38 | def update(self,input_batch,action_batch,output_batch,return_batch,advantage_batch): 39 | self.iterations += 1 40 | tensor_action_batch = torch.as_tensor(action_batch,device=self.device) 41 | tensor_return_batch = torch.as_tensor(return_batch,device=self.device) 42 | tensor_advantage_batch = torch.as_tensor(advantage_batch,device=self.device) 43 | 44 | # get current policy distribution 45 | _,actor,critic = self.policy(input_batch) 46 | current_logp = actor.logprob(tensor_action_batch) 47 | # get old policy distribution 48 | _,old_actor,_ = self.policy(input_batch) 49 | param_dict = {} 50 | for key in self.policy.actor.output_shape.keys(): 51 | param_dict[key] = torch.as_tensor(output_batch[key],device=self.device) 52 | old_actor.set_param(**param_dict) 53 | old_logp = old_actor.logprob(tensor_action_batch).detach() 54 | ratio = (current_logp - old_logp).exp().float() 55 | approx_kl = actor.kl_divergence(old_actor).mean() 56 | 57 | surrogate1 = tensor_advantage_batch * ratio 58 | surrogate2 = ratio.clamp(1-self.clip_range,1+self.clip_range)*tensor_advantage_batch 59 | 60 | actor_loss = -torch.minimum(surrogate1,surrogate2).mean() 61 | critic_loss = F.mse_loss(critic,tensor_return_batch) 62 | entropy_loss = actor.entropy().mean() 63 | 64 | loss = actor_loss + self.vf_coef * critic_loss - self.ent_coef * entropy_loss 65 | self.optimizer.zero_grad() 66 | loss.backward() 67 | torch.nn.utils.clip_grad_norm_(self.policy.parameters(), self.clipgrad_norm) 68 | self.optimizer.step() 69 | self.scheduler.step() 70 | if self.logger == 'tensorboard': 71 | self.summary.add_scalar("ratio",ratio.mean().item(),self.iterations) 72 | self.summary.add_scalar("actor-loss",actor_loss.item(),self.iterations) 73 | self.summary.add_scalar("critic-loss",critic_loss.item(),self.iterations) 74 | self.summary.add_scalar("entropy-loss",entropy_loss.item(),self.iterations) 75 | self.summary.add_scalar("kl-divergence",approx_kl.item(),self.iterations) 76 | self.summary.add_scalar("learning-rate",self.optimizer.state_dict()['param_groups'][0]['lr'],self.iterations) 77 | self.summary.add_scalar("value_function",critic.mean().item(),self.iterations) 78 | else: 79 | wandb.log({'train/ratio':ratio.mean().item(), 80 | 'train/actor-loss':actor_loss.item(), 81 | "train/critic-loss":critic_loss.item(), 82 | "train/entropy-loss":entropy_loss.item(), 83 | "train/kl-divergence":approx_kl.item(), 84 | "train/learning-rate":self.optimizer.state_dict()['param_groups'][0]['lr'], 85 | "train/value_function":critic.mean().item(), 86 | "iterations":self.iterations}) 87 | if self.iterations % self.save_model_frequency == 0: 88 | model_path = self.modeldir + "model-%s-%s.pth" % (get_time_full(), str(self.iterations)) 89 | torch.save(self.policy.state_dict(), model_path) 90 | return approx_kl 91 | 92 | -------------------------------------------------------------------------------- /xuance/policy/gaussian.py: -------------------------------------------------------------------------------- 1 | from xuance.policy import * 2 | class ActorNet(nn.Module): 3 | def __init__(self, 4 | state_dim:int, 5 | action_dim:int, 6 | initialize, 7 | device 8 | ): 9 | super(ActorNet,self).__init__() 10 | self.device = device 11 | self.action_dim = action_dim 12 | self.mu = nn.Sequential(*mlp_block(state_dim,action_dim,None,initialize,device)[0],nn.Tanh()) 13 | self.logstd = nn.Parameter(-torch.ones((action_dim,),device=device)) 14 | self.distribution = DiagGaussianDistribution(self.action_dim) 15 | self.output_shape = self.distribution.params_shape 16 | def forward(self,x:torch.Tensor): 17 | distribution = DiagGaussianDistribution(self.action_dim) 18 | distribution.set_param(mu=self.mu(x),std=self.logstd.exp()) 19 | return distribution.get_param(),distribution 20 | 21 | class CriticNet(nn.Module): 22 | def __init__(self, 23 | state_dim:int, 24 | initialize, 25 | device 26 | ): 27 | super(CriticNet,self).__init__() 28 | self.device = device 29 | self.model = nn.Sequential(*mlp_block(state_dim,1,None,initialize,device)[0]) 30 | self.output_shape = {'critic':()} 31 | def forward(self,x:torch.Tensor): 32 | return self.model(x).squeeze(dim=-1) 33 | 34 | class ActorCriticPolicy(nn.Module): 35 | def __init__(self, 36 | action_space:gym.Space, 37 | representation:torch.nn.Module, 38 | initialize, 39 | device): 40 | assert isinstance(action_space, gym.spaces.Box) 41 | super(ActorCriticPolicy,self).__init__() 42 | self.action_space = action_space 43 | self.representation = representation 44 | self.input_shape = self.representation.input_shape.copy() 45 | self.output_shape = self.representation.output_shape.copy() 46 | self.output_shape['actor'] = self.action_space.shape 47 | self.output_shape['critic'] = () 48 | self.actor = ActorNet(self.representation.output_shape['state'][0], 49 | self.action_space.shape[0], 50 | initialize, 51 | device) 52 | self.critic = CriticNet(self.representation.output_shape['state'][0], 53 | initialize, 54 | device) 55 | for key,value in zip(self.actor.output_shape.keys(),self.actor.output_shape.values()): 56 | self.output_shape[key] = value 57 | self.output_shape['critic'] = () 58 | 59 | def forward(self,observation:dict): 60 | outputs = self.representation(observation) 61 | a_param,a = self.actor(outputs['state']) 62 | v = self.critic(outputs['state']) 63 | for key in self.actor.output_shape.keys(): 64 | outputs[key] = a_param[key] 65 | outputs['critic'] = v 66 | return outputs,a,v 67 | 68 | class SACCriticNet(nn.Module): 69 | def __init__(self, 70 | state_dim:int, 71 | action_dim:int, 72 | initialize, 73 | device): 74 | super(CriticNet,self).__init__() 75 | self.device = device 76 | self.model = nn.Sequential(*mlp_block(state_dim+action_dim,state_dim,nn.LeakyReLU,initialize,device)[0], 77 | *mlp_block(state_dim,1,None,initialize,device)[0]) 78 | self.output_shape = {'critic':()} 79 | def forward(self,x:torch.Tensor,a:torch.Tensor): 80 | return self.model(torch.concat((x,a),dim=-1))[:,0] 81 | 82 | class SACPolicy(nn.Module): 83 | def __init__(self, 84 | action_space:gym.Space, 85 | representation:torch.nn.Module, 86 | initialize, 87 | device): 88 | assert isinstance(action_space, gym.spaces.Box) 89 | super(SACPolicy,self).__init__() 90 | self.action_space = action_space 91 | self.input_shape = self.representation.input_shape.copy() 92 | self.output_shape = self.representation.output_shape.copy() 93 | 94 | self.representation = representation 95 | self.actor = ActorNet(self.representation.output_shape['state'][0], 96 | self.action_space.shape[0], 97 | initialize, 98 | device) 99 | 100 | self.criticA_representation = copy.deepcopy(representation) 101 | self.criticA = SACCriticNet(self.representation.output_shape['state'][0], 102 | self.action_space.shape[0], 103 | initialize, 104 | device) 105 | 106 | self.criticB_representation = copy.deepcopy(representation) 107 | self.criticB = SACCriticNet(self.representation.output_shape['state'][0], 108 | self.action_space.shape[0], 109 | initialize, 110 | device) 111 | 112 | self.targetA_critic_representation = copy.deepcopy(representation) 113 | self.targetB_critic_representation = copy.deepcopy(representation) 114 | self.target_criticA = copy.deepcopy(self.criticA) 115 | self.target_criticB = copy.deepcopy(self.criticB) 116 | 117 | for key,value in zip(self.actor.output_shape.keys(),self.actor.output_shape.values()): 118 | self.output_shape[key] = value 119 | self.output_shape['criticA'] = () 120 | self.output_shape['criticB'] = () 121 | self.actor_parameters = list(self.representation.parameters()) + list(self.actor.parameters()) 122 | self.critic_parameters = list(self.criticA_representation.parameters()) + list(self.criticA.parameters()) + list(self.criticB_representation.parameters()) + list(self.criticB.parameters()) 123 | -------------------------------------------------------------------------------- /xuance/policy/dqn.py: -------------------------------------------------------------------------------- 1 | from xuance.policy import * 2 | class BasicQhead(nn.Module): 3 | def __init__(self, 4 | state_dim:int, 5 | action_dim:int, 6 | initialize, 7 | device): 8 | super(BasicQhead,self).__init__() 9 | self.model = nn.Sequential(*mlp_block(state_dim,action_dim,None,initialize,device)[0]) 10 | def forward(self,x:torch.Tensor): 11 | return self.model(x) 12 | 13 | class DuelQhead(nn.Module): 14 | def __init__(self, 15 | state_dim:int, 16 | action_dim:int, 17 | initialize, 18 | device): 19 | super(DuelQhead,self).__init__() 20 | self.a_model = nn.Sequential(*mlp_block(state_dim,action_dim,None,initialize,device)[0]) 21 | self.v_model = nn.Sequential(*mlp_block(state_dim,1,None,initialize,device)[0]) 22 | def forward(self,x:torch.Tensor): 23 | v = self.v_model(x) 24 | a = self.a_model(x) 25 | q = v + (a - a.mean(dim=-1).unsqueeze(dim=-1)) 26 | return q 27 | 28 | class C51Qhead(nn.Module): 29 | def __init__(self, 30 | state_dim: int, 31 | action_dim: int, 32 | atom_num: int, 33 | initialize: Optional[Callable[..., torch.Tensor]] = None, 34 | device: Optional[Union[str, int, torch.device]] = None): 35 | super(C51Qhead, self).__init__() 36 | self.action_dim = action_dim 37 | self.atom_num = atom_num 38 | self.model = nn.Sequential(*mlp_block(state_dim,action_dim*atom_num,None,initialize,device)[0]) 39 | def forward(self, x: torch.Tensor): 40 | dist_logits = self.model(x).view(-1, self.action_dim, self.atom_num) 41 | dist_probs = F.softmax(dist_logits, dim=-1) 42 | return dist_probs 43 | 44 | class QRDQNhead(nn.Module): 45 | def __init__(self, 46 | state_dim: int, 47 | action_dim: int, 48 | atom_num: int, 49 | initialize: Optional[Callable[..., torch.Tensor]] = None, 50 | device: Optional[Union[str, int, torch.device]] = None): 51 | super(QRDQNhead, self).__init__() 52 | self.action_dim = action_dim 53 | self.atom_num = atom_num 54 | self.model = nn.Sequential(*mlp_block(state_dim,action_dim*atom_num,None,initialize,device)[0]) 55 | def forward(self, x: torch.Tensor): 56 | quantiles = self.model(x).view(-1, self.action_dim, self.atom_num) 57 | return quantiles 58 | 59 | class DQN_Policy(nn.Module): 60 | def __init__(self, 61 | action_space, 62 | representation, 63 | initialize, 64 | device): 65 | super(DQN_Policy,self).__init__() 66 | assert isinstance(action_space,gym.spaces.Discrete), "DQN is not supported for non-discrete action space" 67 | self.action_dim = action_space.n 68 | self.input_shape = representation.input_shape.copy() 69 | self.output_shape = representation.output_shape.copy() 70 | self.output_shape['evalQ'] = (self.action_dim,) 71 | self.output_shape['targetQ'] = (self.action_dim,) 72 | self.eval_representation = representation 73 | self.evalQ = BasicQhead(representation.output_shape['state'][0],self.action_dim,initialize,device) 74 | self.target_representation = copy.deepcopy(self.eval_representation) 75 | self.targetQ = copy.deepcopy(self.evalQ) 76 | def forward(self,observation:dict): 77 | eval_outputs = self.eval_representation(observation) 78 | target_outputs = self.target_representation(observation) 79 | evalQ = self.evalQ(eval_outputs['state']) 80 | targetQ = self.targetQ(target_outputs['state']).detach() 81 | eval_outputs['evalQ'] = evalQ 82 | eval_outputs['targetQ'] = targetQ 83 | return eval_outputs,evalQ,targetQ 84 | def update_target(self): 85 | for ep,tp in zip(self.eval_representation.parameters(),self.target_representation.parameters()): 86 | tp.data.copy_(ep.data) 87 | for ep,tp in zip(self.evalQ.parameters(),self.targetQ.parameters()): 88 | tp.data.copy_(ep.data) 89 | 90 | class DuelDQN_Policy(DQN_Policy): 91 | def __init__(self, 92 | action_space, 93 | representation, 94 | initialize, 95 | device): 96 | super(DuelDQN_Policy,self).__init__(action_space,representation,initialize,device) 97 | assert isinstance(action_space,gym.spaces.Discrete), "Dueling-DQN is not supported for non-discrete action space" 98 | self.evalQ = DuelQhead(representation.output_shape['state'][0],self.action_dim,initialize,device) 99 | self.targetQ = copy.deepcopy(self.evalQ) 100 | 101 | class C51_Policy(DQN_Policy): 102 | def __init__(self, 103 | action_space, 104 | representation, 105 | value_range, 106 | atom_num, 107 | initialize, 108 | device): 109 | super(C51_Policy,self).__init__(action_space,representation,initialize,device) 110 | assert isinstance(action_space,gym.spaces.Discrete), "C51 is not supported for non-discrete action space" 111 | self.evalQ = C51Qhead(representation.output_shape['state'][0],action_space.n,atom_num,initialize,device) 112 | self.targetQ = copy.deepcopy(self.evalQ) 113 | self.value_range = value_range 114 | self.atom_num = atom_num 115 | self.supports = torch.nn.Parameter(torch.linspace(self.value_range[0], self.value_range[1], self.atom_num), requires_grad=False).to(device) 116 | self.deltaz = (value_range[1] - value_range[0]) / (atom_num - 1) 117 | 118 | class QRDQN_Policy(DQN_Policy): 119 | def __init__(self, 120 | action_space, 121 | representation, 122 | atom_num, 123 | initialize, 124 | device): 125 | super(QRDQN_Policy,self).__init__(action_space,representation,initialize,device) 126 | assert isinstance(action_space,gym.spaces.Discrete), "QRDQN is not supported for non-discrete action space" 127 | self.evalQ = QRDQNhead(representation.output_shape['state'][0],action_space.n,atom_num,initialize,device) 128 | self.targetQ = copy.deepcopy(self.evalQ) 129 | self.atom_num = atom_num 130 | 131 | 132 | 133 | 134 | 135 | -------------------------------------------------------------------------------- /xuance/environment/normalizer.py: -------------------------------------------------------------------------------- 1 | from xuance.environment import * 2 | from xuance.environment.vectorize import VecEnv 3 | class RewardNorm(VecEnv): 4 | def __init__(self,config,vecenv:VecEnv,scale_range=(0.1,10),reward_range=(-5,5),gamma=0.99,train=True): 5 | super(RewardNorm,self).__init__(vecenv.num_envs,vecenv.observation_space,vecenv.action_space) 6 | assert scale_range[0] < scale_range[1], "invalid scale_range." 7 | assert reward_range[0] < reward_range[1], "Invalid reward_range." 8 | assert gamma < 1, "Gamma should be a float value smaller than 1." 9 | self.config = config 10 | self.gamma = gamma 11 | self.scale_range = scale_range 12 | self.reward_range = reward_range 13 | self.vecenv = vecenv 14 | self.return_rms = Running_MeanStd({'return':(1,)}) 15 | self.episode_rewards = [[] for i in range(self.num_envs)] 16 | self.train_steps = 0 17 | self.train = train 18 | self.save_dir = os.path.join(self.config.modeldir,self.config.env_name,self.config.algo_name+"-%d"%self.config.seed) 19 | 20 | def load_rms(self): 21 | npy_path = os.path.join(self.save_dir,"reward_stat.npy") 22 | if not os.path.exists(npy_path): 23 | return 24 | rms_data = np.load(npy_path,allow_pickle=True).item() 25 | self.return_rms.count = rms_data['count'] 26 | self.return_rms.mean = rms_data['mean'] 27 | self.return_rms.var = rms_data['var'] 28 | 29 | def step_wait(self): 30 | obs,rews,terminals,trunctions,infos = self.vecenv.step_wait() 31 | for i in range(len(rews)): 32 | if terminals[i] != True and trunctions[i] != True: 33 | self.episode_rewards[i].append(rews[i]) 34 | else: 35 | if self.train: 36 | self.return_rms.update({'return':discount_cumsum(self.episode_rewards[i],self.gamma)[0:1][np.newaxis,:]}) 37 | self.episode_rewards[i].clear() 38 | scale = np.clip(self.return_rms.std['return'][0],self.scale_range[0],self.scale_range[1]) 39 | rews[i] = np.clip(rews[i]/scale,self.reward_range[0],self.reward_range[1]) 40 | return obs,rews,terminals,trunctions,infos 41 | def reset(self): 42 | if self.train == False: 43 | self.load_rms() 44 | return self.vecenv.reset() 45 | def step_async(self, actions): 46 | self.train_steps += 1 47 | if self.config.train_steps == self.train_steps or self.train_steps % self.config.save_model_frequency == 0: 48 | np.save(os.path.join(self.save_dir,"reward_stat.npy"),{'count':self.return_rms.count,'mean':self.return_rms.mean,'var':self.return_rms.var}) 49 | return self.vecenv.step_async(actions) 50 | def get_images(self): 51 | return self.vecenv.get_images() 52 | def close_extras(self): 53 | return self.vecenv.close_extras() 54 | 55 | class ObservationNorm(VecEnv): 56 | def __init__(self,config,vecenv:VecEnv,scale_range=(0.1,10),obs_range=(-5,5),forbidden_keys=[],train=True): 57 | super(ObservationNorm,self).__init__(vecenv.num_envs,vecenv.observation_space,vecenv.action_space) 58 | assert scale_range[0] < scale_range[1], "invalid scale_range." 59 | assert obs_range[0] < obs_range[1], "Invalid reward_range." 60 | self.config = config 61 | self.scale_range = scale_range 62 | self.obs_range = obs_range 63 | self.forbidden_keys = forbidden_keys 64 | self.vecenv = vecenv 65 | self.obs_rms = Running_MeanStd(space2shape(vecenv.observation_space)) 66 | self.train_steps = 0 67 | self.train = train 68 | self.save_dir = os.path.join(self.config.modeldir,self.config.env_name,self.config.algo_name+"-%d"%self.config.seed) 69 | 70 | 71 | def load_rms(self): 72 | npy_path = os.path.join(self.save_dir,"observation_stat.npy") 73 | if not os.path.exists(npy_path): 74 | return 75 | rms_data = np.load(npy_path,allow_pickle=True).item() 76 | self.obs_rms.count = rms_data['count'] 77 | self.obs_rms.mean = rms_data['mean'] 78 | self.obs_rms.var = rms_data['var'] 79 | 80 | def step_wait(self): 81 | obs,rews,terminals,trunctions,infos = self.vecenv.step_wait() 82 | if self.train: 83 | self.obs_rms.update(obs) 84 | norm_observation = {} 85 | for key,value in zip(obs.keys(),obs.values()): 86 | if key in self.forbidden_keys: 87 | continue 88 | scale_factor = np.clip(1/(self.obs_rms.std[key] + 1e-7),self.scale_range[0],self.scale_range[1]) 89 | norm_observation[key] = np.clip((value - self.obs_rms.mean[key]) * scale_factor,self.obs_range[0],self.obs_range[1]) 90 | return norm_observation,rews,terminals,trunctions,infos 91 | def reset(self): 92 | if self.train == False: 93 | self.load_rms() 94 | return self.vecenv.reset() 95 | def step_async(self, actions): 96 | self.train_steps += 1 97 | if self.config.train_steps == self.train_steps or self.train_steps % self.config.save_model_frequency == 0: 98 | np.save(os.path.join(self.save_dir,"observation_stat.npy"),{'count':self.obs_rms.count,'mean':self.obs_rms.mean,'var':self.obs_rms.var}) 99 | return self.vecenv.step_async(actions) 100 | def get_images(self): 101 | return self.vecenv.get_images() 102 | def close_extras(self): 103 | return self.vecenv.close_extras() 104 | 105 | class ActionNorm(VecEnv): 106 | def __init__(self,vecenv:VecEnv,input_action_range=(-1,1)): 107 | super().__init__(vecenv.num_envs,vecenv.observation_space,vecenv.action_space) 108 | self.vecenv = vecenv 109 | self.input_action_range = input_action_range 110 | assert isinstance(self.action_space,gym.spaces.Box), "Only use the NormActionWrapper for Continuous Action." 111 | def step_async(self, act): 112 | act = np.clip(act,self.input_action_range[0],self.input_action_range[1]) 113 | assert np.min(act) >= self.input_action_range[0] and np.max(act) <= self.input_action_range[1], "input action is out of the defined action range." 114 | self.action_space_low = self.action_space.low 115 | self.action_space_high = self.action_space.high 116 | self.input_action_low = self.input_action_range[0] 117 | self.input_action_high = self.input_action_range[1] 118 | input_prop = (act - self.input_action_low) / (self.input_action_high - self.input_action_low) 119 | output_action = input_prop * (self.action_space_high - self.action_space_low) + self.action_space_low 120 | return self.vecenv.step_async(output_action) 121 | def step_wait(self): 122 | return self.vecenv.step_wait() 123 | def reset(self): 124 | return self.vecenv.reset() 125 | def get_images(self): 126 | return self.vecenv.get_images() 127 | def close_extras(self): 128 | return self.vecenv.close_extras() 129 | 130 | -------------------------------------------------------------------------------- /xuance/environment/vectorize.py: -------------------------------------------------------------------------------- 1 | from xuance.environment import * 2 | # referenced from openai/baselines 3 | class AlreadySteppingError(Exception): 4 | def __init__(self): 5 | msg = 'already running an async step' 6 | Exception.__init__(self, msg) 7 | class NotSteppingError(Exception): 8 | def __init__(self): 9 | msg = 'not running an async step' 10 | Exception.__init__(self, msg) 11 | 12 | def tile_images(images): 13 | image_nums = len(images) 14 | image_shape = images[0].shape 15 | image_height = image_shape[0] 16 | image_width = image_shape[1] 17 | rows = (image_nums - 1) // 4 + 1 18 | if image_nums >= 4: 19 | cols = 4 20 | else: 21 | cols = image_nums 22 | try: 23 | big_img = np.zeros( 24 | (rows * image_height + 10 * (rows - 1), cols * image_width + 10 * (cols - 1), image_shape[2]), np.uint8) 25 | except IndexError: 26 | big_img = np.zeros((rows * image_height + 10 * (rows - 1), cols * image_width + 10 * (cols - 1)), np.uint8) 27 | for i in range(image_nums): 28 | c = i % 4 29 | r = i // 4 30 | big_img[10 * r + image_height * r:10 * r + image_height * r + image_height, 31 | 10 * c + image_width * c:10 * c + image_width * c + image_width] = images[i] 32 | return big_img 33 | 34 | class VecEnv(ABC): 35 | def __init__(self, num_envs, observation_space, action_space): 36 | self.num_envs = num_envs 37 | self.observation_space = observation_space 38 | self.action_space = action_space 39 | self.closed = False 40 | @abstractmethod 41 | def reset(self): 42 | """ 43 | Reset all the environments and return an array of 44 | observations, or a dict of observation arrays. 45 | If step_async is still doing work, that work will 46 | be cancelled and step_wait() should not be called 47 | until step_async() is invoked again. 48 | """ 49 | pass 50 | @abstractmethod 51 | def step_async(self, actions): 52 | """ 53 | Tell all the environments to start taking a step 54 | with the given actions. 55 | Call step_wait() to get the results of the step. 56 | You should not call this if a step_async run is 57 | already pending. 58 | """ 59 | pass 60 | @abstractmethod 61 | def step_wait(self): 62 | """ 63 | Wait for the step taken with step_async(). 64 | Returns (obs, rews, dones, infos): 65 | - obs: an array of observations, or a dict of 66 | arrays of observations. 67 | - rews: an array of rewards 68 | - dones: an array of "episode done" booleans 69 | - infos: a sequence of info objects 70 | """ 71 | pass 72 | @abstractmethod 73 | def get_images(self): 74 | """ 75 | Return RGB images from each environment 76 | """ 77 | pass 78 | @abstractmethod 79 | def close_extras(self): 80 | """ 81 | Clean up the extra resources, beyond what's in this base class. 82 | Only runs when not self.closed. 83 | """ 84 | pass 85 | def step(self, actions): 86 | self.step_async(actions) 87 | return self.step_wait() 88 | def render(self, mode): 89 | imgs = self.get_images() 90 | big_img = tile_images(imgs) 91 | if mode == "human": 92 | cv2.imshow("render", cv2.cvtColor(big_img,cv2.COLOR_RGB2BGR)) 93 | cv2.waitKey(1) 94 | elif mode == "rgb_array": 95 | return imgs 96 | else: 97 | raise NotImplementedError 98 | def close(self): 99 | if self.closed == True: 100 | return 101 | self.close_extras() 102 | self.closed = True 103 | 104 | class DummyVecEnv(VecEnv): 105 | """ 106 | VecEnv that does runs multiple environments sequentially, that is, 107 | the step and reset commands are send to one environment at a time. 108 | Useful when debugging and when num_env == 1 (in the latter case, 109 | avoids communication overhead) 110 | """ 111 | def __init__(self, envs): 112 | self.waiting = False 113 | self.closed = False 114 | self.envs = envs 115 | env = self.envs[0] 116 | VecEnv.__init__(self, len(envs), env.observation_space, env.action_space) 117 | self.obs_shape = space2shape(self.observation_space) 118 | if isinstance(self.observation_space, gym.spaces.Dict): 119 | self.buf_obs = {k: np.zeros(combined_shape(self.num_envs, v)) for k, v in 120 | zip(self.obs_shape.keys(), self.obs_shape.values())} 121 | else: 122 | self.buf_obs = np.zeros(combined_shape(self.num_envs, self.obs_shape), dtype=np.float32) 123 | self.buf_dones = np.zeros((self.num_envs,), dtype=np.bool) 124 | self.buf_trunctions = np.zeros((self.num_envs,), dtype=np.bool) 125 | self.buf_rews = np.zeros((self.num_envs,), dtype=np.float32) 126 | self.buf_infos = [{} for _ in range(self.num_envs)] 127 | self.actions = None 128 | 129 | def reset(self): 130 | for e in range(self.num_envs): 131 | obs,info = self.envs[e].reset() 132 | self._save_obs(e, obs) 133 | self.buf_infos[e] = info 134 | return copy.deepcopy(self.buf_obs),self.buf_infos.copy() 135 | 136 | def step_async(self, actions): 137 | if self.waiting == True: 138 | raise AlreadySteppingError 139 | listify = True 140 | try: 141 | if len(actions) == self.num_envs: 142 | listify = False 143 | except TypeError: 144 | pass 145 | if listify == False: 146 | self.actions = actions 147 | else: 148 | assert self.num_envs == 1, "actions {} is either not a list or has a wrong size - cannot match to {} environments".format( 149 | actions, self.num_envs) 150 | self.actions = [actions] 151 | self.waiting = True 152 | 153 | def step_wait(self): 154 | if self.waiting == False: 155 | raise NotSteppingError 156 | for e in range(self.num_envs): 157 | action = self.actions[e] 158 | obs, self.buf_rews[e], self.buf_dones[e], self.buf_trunctions[e], self.buf_infos[e] = self.envs[e].step(action) 159 | if self.buf_dones[e] or self.buf_trunctions[e]: 160 | obs,_ = self.envs[e].reset() 161 | self._save_obs(e, obs) 162 | self.waiting = False 163 | return copy.deepcopy(self.buf_obs), self.buf_rews.copy(), self.buf_dones.copy(), self.buf_trunctions.copy(), self.buf_infos.copy() 164 | 165 | def close_extras(self): 166 | self.closed = True 167 | for env in self.envs: 168 | env.close() 169 | def get_images(self): 170 | return [env.render() for env in self.envs] 171 | def render(self,mode): 172 | return super().render(mode) 173 | 174 | # save observation of indexes of e environment 175 | def _save_obs(self, e, obs): 176 | if isinstance(self.observation_space,gym.spaces.Dict): 177 | for k in self.obs_shape.keys(): 178 | self.buf_obs[k][e] = obs[k] 179 | else: 180 | self.buf_obs[e] = obs 181 | 182 | 183 | 184 | 185 | 186 | -------------------------------------------------------------------------------- /xuance/agent/ddpg.py: -------------------------------------------------------------------------------- 1 | from xuance.agent import * 2 | class DDPG_Agent: 3 | def __init__(self, 4 | config, 5 | environment, 6 | policy, 7 | learner): 8 | self.config = config 9 | self.environment = environment 10 | self.policy = policy 11 | self.learner = learner 12 | self.nenvs = environment.num_envs 13 | self.nsize = config.nsize 14 | self.minibatch = config.minibatch 15 | self.gamma = config.gamma 16 | 17 | self.input_shape = self.policy.input_shape 18 | self.action_shape = self.environment.action_space.shape 19 | self.output_shape = self.policy.output_shape 20 | 21 | self.start_noise = config.start_noise 22 | self.end_noise = config.end_noise 23 | self.noise = self.start_noise 24 | 25 | self.start_training_size = config.start_training_size 26 | self.training_frequency = config.training_frequency 27 | self.memory = DummyOffPolicyBuffer(self.input_shape, 28 | self.action_shape, 29 | self.output_shape, 30 | self.nenvs, 31 | self.nsize, 32 | self.minibatch) 33 | self.logger = self.config.logger 34 | self.summary = self.learner.summary 35 | self.train_episodes = np.zeros((self.nenvs,),np.int32) 36 | self.train_steps = 0 37 | 38 | if self.logger=="wandb": 39 | wandb.define_metric("train-steps") 40 | wandb.define_metric("train-rewards/*",step_metric="train-steps") 41 | wandb.define_metric("evaluate-steps") 42 | wandb.define_metric("evaluate-rewards/*",step_metric="evaluate-steps") 43 | 44 | 45 | def interact(self,inputs,noise): 46 | outputs,action,_ = self.policy(inputs) 47 | action = action.detach().cpu().numpy() 48 | action = action + np.random.normal(size=action.shape)*noise 49 | for key,value in zip(outputs.keys(),outputs.values()): 50 | outputs[key] = value.detach().cpu().numpy() 51 | return outputs,np.clip(action,-1,1) 52 | 53 | def train(self,train_steps:int=10000): 54 | obs,infos = self.environment.reset() 55 | for _ in tqdm(range(train_steps)): 56 | outputs,actions = self.interact(obs,self.noise) 57 | if self.train_steps < self.config.start_training_size: 58 | actions = [self.environment.action_space.sample() for i in range(self.nenvs)] 59 | next_obs,rewards,terminals,trunctions,infos = self.environment.step(actions) 60 | store_next_obs = next_obs.copy() 61 | for i in range(self.nenvs): 62 | if trunctions[i]: 63 | for key in infos[i].keys(): 64 | if key in store_next_obs.keys(): 65 | store_next_obs[key][i] = infos[i][key] 66 | self.memory.store(obs,actions,outputs,rewards,terminals,store_next_obs) 67 | if self.memory.size >= self.start_training_size and self.train_steps % self.training_frequency == 0: 68 | input_batch,action_batch,output_batch,reward_batch,terminal_batch,next_input_batch = self.memory.sample() 69 | self.learner.update(input_batch,action_batch,reward_batch,terminal_batch,next_input_batch) 70 | for i in range(self.nenvs): 71 | if terminals[i] or trunctions[i]: 72 | self.train_episodes[i] += 1 73 | if self.logger == "tensorboard": 74 | self.summary.add_scalars("train-rewards-steps",{"env-%d"%i:infos[i]['episode_score']},self.train_steps*self.nenvs) 75 | else: 76 | wandb.log({f"train-rewards/{i}":infos[i]['episode_score'],'train-steps':self.train_steps*self.nenvs}) 77 | obs = next_obs 78 | self.train_steps += 1 79 | self.noise = self.noise - (self.start_noise-self.end_noise)/self.config.train_steps 80 | 81 | def test(self,test_environment,test_episode=10,render=False): 82 | obs,infos = test_environment.reset() 83 | current_episode = 0 84 | scores = [] 85 | images = [[] for i in range(test_environment.num_envs)] 86 | best_score = -np.inf 87 | 88 | while current_episode < test_episode: 89 | if render: 90 | test_environment.render("human") 91 | else: 92 | render_images = test_environment.render('rgb_array') 93 | for index,img in enumerate(render_images): 94 | images[index].append(img) 95 | outputs,actions = self.interact(obs,0) 96 | next_obs,rewards,terminals,trunctions,infos = test_environment.step(actions) 97 | for i in range(test_environment.num_envs): 98 | if terminals[i] == True or trunctions[i] == True: 99 | if self.logger == 'tensorboard': 100 | self.summary.add_scalars("evaluate-score",{"episode-%d"%current_episode:infos[i]['episode_score']},self.train_steps*self.nenvs) 101 | else: 102 | wandb.log({f"evaluate-rewards/{current_episode}":infos[i]['episode_score'],'evaluate-steps':self.train_steps*self.nenvs}) 103 | if infos[i]['episode_score'] > best_score: 104 | episode_images = images[i].copy() 105 | best_score = infos[i]['episode_score'] 106 | 107 | scores.append(infos[i]['episode_score']) 108 | images[i].clear() 109 | current_episode += 1 110 | obs = next_obs 111 | 112 | print("[%s] Training Steps:%.2f K, Evaluate Episodes:%d, Score Average:%f, Std:%f"%(get_time_hm(), self.train_steps*self.nenvs/1000, 113 | test_episode,np.mean(scores),np.std(scores))) 114 | return scores,episode_images 115 | 116 | def benchmark(self,test_environment,train_steps:int=10000,evaluate_steps:int=10000,test_episode=10,render=False,save_best_model=True): 117 | import time 118 | epoch = int(train_steps / evaluate_steps) + 1 119 | 120 | evaluate_scores,evaluate_video = self.test(test_environment,test_episode,render) 121 | benchmark_scores = [] 122 | benchmark_scores.append({'steps':self.train_steps,'scores':evaluate_scores}) 123 | 124 | best_average_score = np.mean(benchmark_scores[-1]['scores']) 125 | best_std_score = np.std(benchmark_scores[-1]['scores']) 126 | best_video = evaluate_video 127 | 128 | for i in range(epoch): 129 | if i == epoch - 1: 130 | train_step = train_steps - (i*evaluate_steps) 131 | else: 132 | train_step = evaluate_steps 133 | self.train(train_step) 134 | 135 | evaluate_scores,evaluate_video = self.test(test_environment,test_episode,render) 136 | benchmark_scores.append({'steps':self.train_steps,'scores':evaluate_scores}) 137 | 138 | if np.mean(benchmark_scores[-1]['scores']) > best_average_score: 139 | best_average_score = np.mean(benchmark_scores[-1]['scores']) 140 | best_std_score = np.std(benchmark_scores[-1]['scores']) 141 | best_video = evaluate_video 142 | if save_best_model == True: 143 | model_path = self.learner.modeldir + "/best_model.pth" 144 | torch.save(self.policy.state_dict(), model_path) 145 | 146 | if not render: 147 | # show the best performance video demo on web browser 148 | video_arr = np.array(best_video,dtype=np.uint8).transpose(0,3,1,2) 149 | if self.logger == "tensorboard": 150 | self.summary.add_video("video",torch.as_tensor(video_arr,dtype=torch.uint8).unsqueeze(0),fps=50,global_step=self.nenvs*self.train_steps) 151 | else: 152 | wandb.log({"video":wandb.Video(video_arr,fps=50,format='gif')},step=self.nenvs*self.train_steps) 153 | 154 | 155 | np.save(self.learner.logdir+"/benchmark_%s.npy"%get_time_full(), benchmark_scores) 156 | print("Best Model score = %f, std = %f"%(best_average_score,best_std_score)) -------------------------------------------------------------------------------- /xuance/agent/td3.py: -------------------------------------------------------------------------------- 1 | from xuance.agent import * 2 | class TD3_Agent: 3 | def __init__(self, 4 | config, 5 | environment, 6 | policy, 7 | learner): 8 | self.config = config 9 | self.environment = environment 10 | self.policy = policy 11 | self.learner = learner 12 | self.nenvs = environment.num_envs 13 | self.nsize = config.nsize 14 | self.minibatch = config.minibatch 15 | self.gamma = config.gamma 16 | 17 | self.input_shape = self.policy.input_shape 18 | self.action_shape = self.environment.action_space.shape 19 | self.output_shape = self.policy.output_shape 20 | 21 | self.start_noise = config.start_noise 22 | self.end_noise = config.end_noise 23 | self.noise = self.start_noise 24 | 25 | self.start_training_size = config.start_training_size 26 | self.training_frequency = config.training_frequency 27 | self.memory = DummyOffPolicyBuffer(self.input_shape, 28 | self.action_shape, 29 | self.output_shape, 30 | self.nenvs, 31 | self.nsize, 32 | self.minibatch) 33 | 34 | self.logger = self.config.logger 35 | self.summary = self.learner.summary 36 | self.train_episodes = np.zeros((self.nenvs,),np.int32) 37 | self.train_steps = 0 38 | 39 | if self.logger=="wandb": 40 | wandb.define_metric("train-steps") 41 | wandb.define_metric("train-rewards/*",step_metric="train-steps") 42 | wandb.define_metric("evaluate-steps") 43 | wandb.define_metric("evaluate-rewards/*",step_metric="evaluate-steps") 44 | 45 | def interact(self,inputs,noise): 46 | outputs,action,_ = self.policy(inputs) 47 | action = action.detach().cpu().numpy() 48 | action = action + np.random.normal(size=action.shape)*noise 49 | for key,value in zip(outputs.keys(),outputs.values()): 50 | outputs[key] = value.detach().cpu().numpy() 51 | return outputs,np.clip(action,-1,1) 52 | 53 | def train(self,train_steps:int=10000): 54 | obs,infos = self.environment.reset() 55 | for step in tqdm(range(train_steps)): 56 | outputs,actions = self.interact(obs,self.noise) 57 | if self.train_steps < self.config.start_training_size: 58 | actions = [self.environment.action_space.sample() for i in range(self.nenvs)] 59 | next_obs,rewards,terminals,trunctions,infos = self.environment.step(actions) 60 | store_next_obs = next_obs.copy() 61 | for i in range(self.nenvs): 62 | if trunctions[i]: 63 | for key in infos[i].keys(): 64 | if key in store_next_obs.keys(): 65 | store_next_obs[key][i] = infos[i][key] 66 | self.memory.store(obs,actions,outputs,rewards,terminals,store_next_obs) 67 | if self.memory.size >= self.start_training_size and self.train_steps % self.training_frequency == 0: 68 | input_batch,action_batch,output_batch,reward_batch,terminal_batch,next_input_batch = self.memory.sample() 69 | self.learner.update(input_batch,action_batch,reward_batch,terminal_batch,next_input_batch) 70 | for i in range(self.nenvs): 71 | if terminals[i] or trunctions[i]: 72 | self.train_episodes[i] += 1 73 | if self.logger == "tensorboard": 74 | self.summary.add_scalars("train-rewards-steps",{"env-%d"%i:infos[i]['episode_score']},self.train_steps*self.nenvs) 75 | else: 76 | wandb.log({f"train-rewards/{i}":infos[i]['episode_score'],'train-steps':self.train_steps*self.nenvs}) 77 | obs = next_obs 78 | self.train_steps += 1 79 | self.noise = self.noise - (self.start_noise-self.end_noise)/self.config.train_steps 80 | 81 | def test(self,test_environment,test_episode=10,render=False): 82 | obs,infos = test_environment.reset() 83 | current_episode = 0 84 | scores = [] 85 | images = [[] for i in range(test_environment.num_envs)] 86 | best_score = -np.inf 87 | 88 | while current_episode < test_episode: 89 | if render: 90 | test_environment.render("human") 91 | else: 92 | render_images = test_environment.render('rgb_array') 93 | for index,img in enumerate(render_images): 94 | images[index].append(img) 95 | outputs,actions = self.interact(obs,0) 96 | next_obs,rewards,terminals,trunctions,infos = test_environment.step(actions) 97 | for i in range(test_environment.num_envs): 98 | if terminals[i] == True or trunctions[i] == True: 99 | if self.logger == 'tensorboard': 100 | self.summary.add_scalars("evaluate-score",{"episode-%d"%current_episode:infos[i]['episode_score']},self.train_steps*self.nenvs) 101 | else: 102 | wandb.log({f"evaluate-rewards/{current_episode}":infos[i]['episode_score'],'evaluate-steps':self.train_steps*self.nenvs}) 103 | 104 | if infos[i]['episode_score'] > best_score: 105 | episode_images = images[i].copy() 106 | best_score = infos[i]['episode_score'] 107 | 108 | scores.append(infos[i]['episode_score']) 109 | images[i] = [] 110 | current_episode += 1 111 | obs = next_obs 112 | 113 | print("[%s] Training Steps:%.2f K, Evaluate Episodes:%d, Score Average:%f, Std:%f"%(get_time_hm(), self.train_steps*self.nenvs/1000, 114 | test_episode,np.mean(scores),np.std(scores))) 115 | return scores,episode_images 116 | 117 | def benchmark(self,test_environment,train_steps:int=10000,evaluate_steps:int=10000,test_episode=10,render=False,save_best_model=True): 118 | epoch = int(train_steps / evaluate_steps) + 1 119 | evaluate_scores,evaluate_video = self.test(test_environment,test_episode,render) 120 | benchmark_scores = [] 121 | benchmark_scores.append({'steps':self.train_steps,'scores':evaluate_scores}) 122 | 123 | best_average_score = np.mean(benchmark_scores[-1]['scores']) 124 | best_std_score = np.std(benchmark_scores[-1]['scores']) 125 | best_video = evaluate_video 126 | 127 | for i in range(epoch): 128 | if i == epoch - 1: 129 | train_step = train_steps - (i*evaluate_steps) 130 | else: 131 | train_step = evaluate_steps 132 | self.train(train_step) 133 | evaluate_scores,evaluate_video = self.test(test_environment,test_episode,render) 134 | benchmark_scores.append({'steps':self.train_steps,'scores':evaluate_scores}) 135 | 136 | if np.mean(benchmark_scores[-1]['scores']) > best_average_score: 137 | best_average_score = np.mean(benchmark_scores[-1]['scores']) 138 | best_std_score = np.std(benchmark_scores[-1]['scores']) 139 | best_video = evaluate_video 140 | if save_best_model == True: 141 | model_path = self.learner.modeldir + "/best_model.pth" 142 | torch.save(self.policy.state_dict(), model_path) 143 | 144 | if not render: 145 | # show the best performance video demo on web browser 146 | video_arr = np.array(best_video,dtype=np.uint8).transpose(0,3,1,2) 147 | if self.logger == "tensorboard": 148 | self.summary.add_video("video",torch.as_tensor(video_arr,dtype=torch.uint8).unsqueeze(0),fps=50,global_step=self.nenvs*self.train_steps) 149 | else: 150 | wandb.log({"video":wandb.Video(video_arr,fps=50,format='gif')},step=self.nenvs*self.train_steps) 151 | 152 | np.save(self.learner.logdir+"/benchmark_%s.npy"%get_time_full(), benchmark_scores) 153 | print("Best Model score = %f, std = %f"%(best_average_score,best_std_score)) 154 | 155 | -------------------------------------------------------------------------------- /xuance/agent/a2c.py: -------------------------------------------------------------------------------- 1 | from xuance.agent import * 2 | class A2C_Agent: 3 | def __init__(self, 4 | config, 5 | environment, 6 | policy, 7 | learner): 8 | self.config = config 9 | self.environment = environment 10 | self.policy = policy 11 | self.learner = learner 12 | self.nenvs = environment.num_envs 13 | self.nsize = config.nsize 14 | self.nminibatch = config.nminibatch 15 | self.nepoch = config.nepoch 16 | self.gamma = config.gamma 17 | self.tdlam = config.tdlam 18 | self.input_shape = self.policy.input_shape 19 | self.action_shape = self.environment.action_space.shape 20 | self.output_shape = self.policy.output_shape 21 | self.memory = DummyOnPolicyBuffer(self.input_shape, 22 | self.action_shape, 23 | self.output_shape, 24 | self.nenvs, 25 | self.nsize, 26 | self.nminibatch, 27 | self.gamma, 28 | self.tdlam) 29 | self.logger = self.config.logger 30 | self.summary = self.learner.summary 31 | self.train_episodes = np.zeros((self.nenvs,),np.int32) 32 | self.train_steps = 0 33 | 34 | if self.logger=="wandb": 35 | wandb.define_metric("train-steps") 36 | wandb.define_metric("train-rewards/*",step_metric="train-steps") 37 | wandb.define_metric("evaluate-steps") 38 | wandb.define_metric("evaluate-rewards/*",step_metric="evaluate-steps") 39 | 40 | 41 | def interact(self,inputs,training=True): 42 | outputs,dist,v = self.policy(inputs) 43 | if training: 44 | action = dist.sample().detach().cpu().numpy() 45 | else: 46 | action = dist.deterministic().detach().cpu().numpy() 47 | v = v.detach().cpu().numpy() 48 | for key,value in zip(outputs.keys(),outputs.values()): 49 | outputs[key] = value.detach().cpu().numpy() 50 | return outputs,action,v 51 | 52 | def train(self,train_steps:int=10000): 53 | obs,infos = self.environment.reset() 54 | for _ in tqdm(range(train_steps)): 55 | outputs,actions,pred_values = self.interact(obs) 56 | next_obs,rewards,terminals,trunctions,infos = self.environment.step(actions) 57 | self.memory.store(obs,actions,outputs,rewards,pred_values) 58 | for i in range(self.nenvs): 59 | if terminals[i] == True: 60 | self.memory.finish_path(0,i) 61 | elif trunctions[i] == True: 62 | real_next_observation = infos[i]['next_observation'] 63 | for key in real_next_observation.keys(): 64 | real_next_observation[key] = real_next_observation[key][np.newaxis,:] 65 | _,_,truncate_value = self.interact(real_next_observation) 66 | self.memory.finish_path(truncate_value[0],i) 67 | if self.memory.full: 68 | _,_,next_pred_values = self.interact(next_obs) 69 | for i in range(self.nenvs): 70 | self.memory.finish_path(next_pred_values[i]*(1-terminals[i]),i) 71 | for _ in range(self.nminibatch * self.nepoch): 72 | input_batch,action_batch,output_batch,return_batch,advantage_batch = self.memory.sample() 73 | self.learner.update(input_batch,action_batch,return_batch,advantage_batch) 74 | self.memory.clear() 75 | for i in range(self.nenvs): 76 | if terminals[i] or trunctions[i]: 77 | self.train_episodes[i] += 1 78 | if self.logger == "tensorboard": 79 | self.summary.add_scalars("train-rewards-steps",{"env-%d"%i:infos[i]['episode_score']},self.train_steps*self.nenvs) 80 | else: 81 | wandb.log({f"train-rewards/{i}":infos[i]['episode_score'],'train-steps':self.train_steps*self.nenvs}) 82 | 83 | obs = next_obs 84 | self.train_steps += 1 85 | 86 | def test(self,test_environment,test_episode=10,render=False): 87 | obs,infos = test_environment.reset() 88 | current_episode = 0 89 | scores = [] 90 | images = [[] for i in range(test_environment.num_envs)] 91 | best_score = -np.inf 92 | 93 | while current_episode < test_episode: 94 | if render: 95 | test_environment.render("human") 96 | else: 97 | render_images = test_environment.render('rgb_array') 98 | for index,img in enumerate(render_images): 99 | images[index].append(img) 100 | outputs,actions,pred_values = self.interact(obs,False) 101 | next_obs,rewards,terminals,trunctions,infos = test_environment.step(actions) 102 | for i in range(test_environment.num_envs): 103 | if terminals[i] == True or trunctions[i] == True: 104 | if self.logger == 'tensorboard': 105 | self.summary.add_scalars("evaluate-rewards",{"episode-%d"%current_episode:infos[i]['episode_score']},self.train_steps*self.nenvs) 106 | else: 107 | wandb.log({f"evaluate-rewards/{current_episode}":infos[i]['episode_score'],'evaluate-steps':self.train_steps*self.nenvs}) 108 | if infos[i]['episode_score'] > best_score: 109 | episode_images = images[i].copy() 110 | best_score = infos[i]['episode_score'] 111 | scores.append(infos[i]['episode_score']) 112 | images[i].clear() 113 | current_episode += 1 114 | obs = next_obs 115 | print("[%s] Training Steps:%.2f K, Evaluate Episodes:%d, Score Average:%f, Std:%f"%(get_time_hm(), self.train_steps*self.nenvs/1000, 116 | test_episode,np.mean(scores),np.std(scores))) 117 | return scores,episode_images 118 | 119 | def benchmark(self,test_environment,train_steps:int=10000,evaluate_steps:int=10000,test_episode=10,render=False,save_best_model=True): 120 | 121 | epoch = int(train_steps / evaluate_steps) + 1 122 | 123 | evaluate_scores,evaluate_video = self.test(test_environment,test_episode,render) 124 | benchmark_scores = [] 125 | benchmark_scores.append({'steps':self.train_steps,'scores':evaluate_scores}) 126 | 127 | best_average_score = np.mean(benchmark_scores[-1]['scores']) 128 | best_std_score = np.std(benchmark_scores[-1]['scores']) 129 | best_video = evaluate_video 130 | 131 | for i in range(epoch): 132 | if i == epoch - 1: 133 | train_step = train_steps - (i*evaluate_steps) 134 | else: 135 | train_step = evaluate_steps 136 | self.train(train_step) 137 | evaluate_scores,evaluate_video = self.test(test_environment,test_episode,render) 138 | benchmark_scores.append({'steps':self.train_steps,'scores':evaluate_scores}) 139 | 140 | if np.mean(benchmark_scores[-1]['scores']) > best_average_score: 141 | best_average_score = np.mean(benchmark_scores[-1]['scores']) 142 | best_std_score = np.std(benchmark_scores[-1]['scores']) 143 | best_video = evaluate_video 144 | if save_best_model == True: 145 | model_path = self.learner.modeldir + "/best_model.pth" 146 | torch.save(self.policy.state_dict(), model_path) 147 | 148 | if not render: 149 | # show the best performance video demo on web browser 150 | video_arr = np.array(best_video,dtype=np.uint8).transpose(0,3,1,2) 151 | if self.logger == "tensorboard": 152 | self.summary.add_video("video",torch.as_tensor(video_arr,dtype=torch.uint8).unsqueeze(0),fps=50,global_step=self.nenvs*self.train_steps) 153 | else: 154 | wandb.log({"video":wandb.Video(video_arr,fps=50,format='gif')},step=self.nenvs*self.train_steps) 155 | 156 | 157 | np.save(self.learner.logdir+"/benchmark_%s.npy"%get_time_full(), benchmark_scores) 158 | print("Best Model score = %f, std = %f"%(best_average_score,best_std_score)) -------------------------------------------------------------------------------- /xuance/agent/dqn.py: -------------------------------------------------------------------------------- 1 | from xuance.agent import * 2 | class DQN_Agent: 3 | def __init__(self, 4 | config, 5 | environment, 6 | policy, 7 | learner): 8 | self.config = config 9 | self.environment = environment 10 | self.policy = policy 11 | self.learner = learner 12 | self.nenvs = environment.num_envs 13 | self.nsize = config.nsize 14 | self.minibatch = config.minibatch 15 | self.gamma = config.gamma 16 | self.input_shape = self.policy.input_shape 17 | self.action_shape = self.environment.action_space.shape 18 | self.output_shape = self.policy.output_shape 19 | self.start_egreedy = config.start_egreedy 20 | self.end_egreedy = config.end_egreedy 21 | self.egreedy = self.start_egreedy 22 | 23 | self.start_training_size = config.start_training_size 24 | self.training_frequency = config.training_frequency 25 | self.memory = DummyOffPolicyBuffer(self.input_shape, 26 | self.action_shape, 27 | self.output_shape, 28 | self.nenvs, 29 | self.nsize, 30 | self.minibatch) 31 | 32 | self.logger = self.config.logger 33 | self.summary = self.learner.summary 34 | self.train_episodes = np.zeros((self.nenvs,),np.int32) 35 | self.train_steps = 0 36 | 37 | if self.logger=="wandb": 38 | wandb.define_metric("steps") 39 | wandb.define_metric("train-rewards/*",step_metric="steps") 40 | wandb.define_metric("evaluate-rewards/*",step_metric="steps") 41 | 42 | def interact(self,inputs,egreedy): 43 | outputs,evalQ,_ = self.policy(inputs) 44 | argmax_action = evalQ.argmax(dim=-1) 45 | random_action = np.random.choice(self.environment.action_space.n,self.nenvs) 46 | if np.random.rand() < egreedy: 47 | action = random_action 48 | else: 49 | action = argmax_action.detach().cpu().numpy() 50 | for key,value in zip(outputs.keys(),outputs.values()): 51 | outputs[key] = value.detach().cpu().numpy() 52 | return outputs,action 53 | 54 | def train(self,train_steps:int=10000): 55 | obs,infos = self.environment.reset() 56 | for _ in tqdm(range(train_steps)): 57 | outputs,actions = self.interact(obs,self.egreedy) 58 | next_obs,rewards,terminals,trunctions,infos = self.environment.step(actions) 59 | store_next_obs = next_obs.copy() 60 | 61 | for i in range(self.nenvs): 62 | if trunctions[i]: 63 | for key in infos[i].keys(): 64 | if key in store_next_obs.keys(): 65 | store_next_obs[key][i] = infos[i][key] 66 | self.memory.store(obs,actions,outputs,rewards,terminals,store_next_obs) 67 | 68 | if self.memory.size >= self.start_training_size and self.train_steps % self.training_frequency == 0: 69 | input_batch,action_batch,output_batch,reward_batch,terminal_batch,next_input_batch = self.memory.sample() 70 | self.learner.update(input_batch,action_batch,reward_batch,terminal_batch,next_input_batch) 71 | 72 | for i in range(self.nenvs): 73 | if terminals[i] or trunctions[i]: 74 | self.train_episodes[i] += 1 75 | if self.logger == "tensorboard": 76 | self.summary.add_scalars("train-rewards-steps",{"env-%d"%i:infos[i]['episode_score']},self.train_steps*self.nenvs) 77 | else: 78 | wandb.log({f"train-rewards/{i}":infos[i]['episode_score'],'steps':self.train_steps*self.nenvs}) 79 | 80 | 81 | obs = next_obs 82 | self.train_steps += 1 83 | self.egreedy = self.egreedy - (self.start_egreedy-self.end_egreedy)/self.config.train_steps 84 | 85 | def test(self,test_environment,test_episode=10,render=False): 86 | obs,infos = test_environment.reset() 87 | current_episode = 0 88 | scores = [] 89 | images = [[] for i in range(test_environment.num_envs)] 90 | best_score = -np.inf 91 | 92 | while current_episode < test_episode: 93 | if render: 94 | test_environment.render("human") 95 | else: 96 | render_images = test_environment.render('rgb_array') 97 | for index,img in enumerate(render_images): 98 | images[index].append(img) 99 | outputs,actions = self.interact(obs,0) 100 | next_obs,rewards,terminals,trunctions,infos = test_environment.step(actions) 101 | for i in range(test_environment.num_envs): 102 | if terminals[i] == True or trunctions[i] == True: 103 | if self.logger == 'tensorboard': 104 | self.summary.add_scalars("evaluate-score",{"episode-%d"%current_episode:infos[i]['episode_score']},self.train_steps*self.nenvs) 105 | else: 106 | wandb.log({f"evaluate-rewards/{current_episode}":infos[i]['episode_score'],'steps':self.train_steps*self.nenvs}) 107 | 108 | if infos[i]['episode_score'] > best_score: 109 | episode_images = images[i].copy() 110 | best_score = infos[i]['episode_score'] 111 | 112 | scores.append(infos[i]['episode_score']) 113 | images[i] = [] 114 | current_episode += 1 115 | obs = next_obs 116 | 117 | if self.logger == "wandb": 118 | wandb.log({"video":wandb.Video(np.array(episode_images[0],dtype=np.uint8).transpose(0,3,1,2),fps=50,format='gif'), 119 | 'train-step':self.train_steps*self.nenvs}) 120 | print("[%s] Training Steps:%.2f K, Evaluate Episodes:%d, Score Average:%f, Std:%f"%(get_time_hm(), self.train_steps*self.nenvs/1000, 121 | test_episode,np.mean(scores),np.std(scores))) 122 | return scores,episode_images 123 | 124 | def benchmark(self,test_environment,train_steps:int=10000,evaluate_steps:int=10000,test_episode=10,render=False,save_best_model=True): 125 | import time 126 | epoch = int(train_steps / evaluate_steps) + 1 127 | 128 | evaluate_scores,evaluate_video = self.test(test_environment,test_episode,render) 129 | benchmark_scores = [] 130 | benchmark_scores.append({'steps':self.train_steps,'scores':evaluate_scores}) 131 | 132 | 133 | best_average_score = np.mean(benchmark_scores[-1]['scores']) 134 | best_std_score = np.std(benchmark_scores[-1]['scores']) 135 | best_video = evaluate_video 136 | 137 | for i in range(epoch): 138 | if i == epoch - 1: 139 | train_step = train_steps - (i*evaluate_steps) 140 | else: 141 | train_step = evaluate_steps 142 | self.train(train_step) 143 | 144 | evaluate_scores,evaluate_video = self.test(test_environment,test_episode,render) 145 | benchmark_scores.append({'steps':self.train_steps,'scores':evaluate_scores}) 146 | 147 | 148 | if np.mean(benchmark_scores[-1]['scores']) > best_average_score: 149 | best_average_score = np.mean(benchmark_scores[-1]['scores']) 150 | best_std_score = np.std(benchmark_scores[-1]['scores']) 151 | best_video = evaluate_video 152 | if save_best_model == True: 153 | model_path = self.learner.modeldir + "/best_model.pth" 154 | torch.save(self.policy.state_dict(), model_path) 155 | 156 | if not render: 157 | # show the best performance video demo on web browser 158 | video_arr = np.array(best_video,dtype=np.uint8).transpose(0,3,1,2) 159 | if self.logger == "tensorboard": 160 | self.summary.add_video("video",torch.as_tensor(video_arr,dtype=torch.uint8).unsqueeze(0),fps=50,global_step=self.nenvs*self.train_steps) 161 | else: 162 | wandb.log({"video":wandb.Video(video_arr,fps=50,format='gif')},step=self.nenvs*self.train_steps) 163 | 164 | np.save(self.learner.logdir+"/benchmark_%s.npy"%get_time_full(), benchmark_scores) 165 | print("Best Model score = %f, std = %f"%(best_average_score,best_std_score)) 166 | -------------------------------------------------------------------------------- /xuance/utils/memory.py: -------------------------------------------------------------------------------- 1 | from venv import create 2 | import numpy as np 3 | from .common import discount_cumsum 4 | from typing import Optional, Union, Sequence 5 | 6 | def create_memory(shape: Optional[Union[tuple, dict]], nenvs: int, nsize: int): 7 | if shape == None: 8 | return None 9 | elif isinstance(shape, dict): 10 | memory = {} 11 | for key, value in zip(shape.keys(), shape.values()): 12 | if value is None: # save an object type 13 | if nenvs == 0: 14 | memory[key] = np.zeros([nsize], dtype=object) 15 | else: 16 | memory[key] = np.zeros([nenvs,nsize], dtype=object) 17 | else: 18 | if nenvs == 0: 19 | memory[key] = np.zeros([nsize] + list(value), dtype=np.float32) 20 | else: 21 | memory[key] = np.zeros([nenvs, nsize] + list(value), dtype=np.float32) 22 | return memory 23 | elif isinstance(shape, tuple): 24 | if nenvs == 0: 25 | return np.zeros([nsize] + list(shape), np.float32) 26 | else: 27 | return np.zeros([nenvs, nsize] + list(shape), np.float32) 28 | else: 29 | raise NotImplementedError 30 | 31 | def store_element(data: Optional[Union[np.ndarray, dict, float]], memory: Union[dict, np.ndarray], ptr: int): 32 | if data is None: 33 | return 34 | elif isinstance(data, dict): 35 | for key, value in zip(data.keys(), data.values()): 36 | memory[key][:, ptr] = data[key] 37 | else: 38 | memory[:, ptr] = data 39 | 40 | def store_batch_element(data: Optional[Union[np.ndarray, dict, float]], memory: Union[dict, np.ndarray], ptr: int): 41 | if data is None: 42 | return 43 | elif isinstance(data,dict): 44 | for key,value in zip(data.keys(),data.values()): 45 | memory[key][ptr:ptr+value.shape[0]] = value 46 | else: 47 | memory[ptr:ptr+data.shape[0]] = data 48 | 49 | def sample_batch(memory: Optional[Union[np.ndarray, dict]], index: np.ndarray): 50 | if memory is None: 51 | return None 52 | elif isinstance(memory, dict): 53 | batch = {} 54 | for key, value in zip(memory.keys(), memory.values()): 55 | batch[key] = value[index] 56 | return batch 57 | else: 58 | return memory[index] 59 | 60 | class DummyOnPolicyBuffer: 61 | def __init__(self, 62 | input_shape: dict, 63 | action_shape: tuple, 64 | output_shape: dict, 65 | nenvs: int, 66 | nsize: int, 67 | nminibatch: int, 68 | gamma: float=0.99, 69 | tdlam: float=0.95): 70 | self.input_shape,self.action_shape,self.output_shape = input_shape,action_shape,output_shape 71 | self.size,self.ptr = 0,0 72 | self.nenvs,self.nsize,self.nminibatch = nenvs,nsize,nminibatch 73 | self.gamma,self.tdlam = gamma,tdlam 74 | self.start_ids = np.zeros(self.nenvs,np.int32) 75 | self.inputs = create_memory(input_shape,nenvs,nsize) 76 | self.actions = create_memory(action_shape,nenvs,nsize) 77 | self.outputs = create_memory(output_shape,nenvs,nsize) 78 | self.rewards = create_memory((),self.nenvs,self.nsize) 79 | self.returns = create_memory((),self.nenvs,self.nsize) 80 | self.advantages = create_memory((),self.nenvs,self.nsize) 81 | @property 82 | def full(self): 83 | return self.size >= self.nsize 84 | def clear(self): 85 | self.size,self.ptr = 0,0 86 | self.start_ids = np.zeros(self.nenvs,np.int32) 87 | self.inputs = create_memory(self.input_shape,self.nenvs,self.nsize) 88 | self.actions = create_memory(self.action_shape,self.nenvs,self.nsize) 89 | self.outputs = create_memory(self.output_shape,self.nenvs,self.nsize) 90 | self.rewards = create_memory((),self.nenvs,self.nsize) 91 | self.returns = create_memory((),self.nenvs,self.nsize) 92 | self.advantages = create_memory((),self.nenvs,self.nsize) 93 | 94 | def store(self,input,action,output,reward,value): 95 | store_element(input,self.inputs,self.ptr) 96 | store_element(action,self.actions,self.ptr) 97 | store_element(output,self.outputs,self.ptr) 98 | store_element(reward,self.rewards,self.ptr) 99 | store_element(value,self.returns,self.ptr) 100 | self.ptr = (self.ptr + 1) % self.nsize 101 | self.size = min(self.size+1,self.nsize) 102 | 103 | def finish_path(self, val, i): 104 | if self.full: 105 | path_slice = np.arange(self.start_ids[i], self.nsize).astype(np.int32) 106 | else: 107 | path_slice = np.arange(self.start_ids[i], self.ptr).astype(np.int32) 108 | rewards = np.append(np.array(self.rewards[i, path_slice]), [val], axis=0) 109 | critics = np.append(np.array(self.returns[i, path_slice]), [val], axis=0) 110 | returns = discount_cumsum(rewards, self.gamma)[:-1] 111 | deltas = rewards[:-1] + self.gamma * critics[1:] - critics[:-1] 112 | advantages = discount_cumsum(deltas, self.gamma * self.tdlam) 113 | self.returns[i, path_slice] = returns 114 | self.advantages[i, path_slice] = advantages 115 | self.start_ids[i] = self.ptr 116 | 117 | def sample(self): 118 | assert self.full, "Not enough transitions for on-policy buffer to random sample" 119 | env_choices = np.random.choice(self.nenvs, self.nenvs * self.nsize // self.nminibatch) 120 | step_choices = np.random.choice(self.nsize, self.nenvs * self.nsize // self.nminibatch) 121 | input_batch = sample_batch(self.inputs,tuple([env_choices, step_choices])) 122 | action_batch = sample_batch(self.actions,tuple([env_choices, step_choices])) 123 | output_batch = sample_batch(self.outputs,tuple([env_choices, step_choices])) 124 | return_batch = sample_batch(self.returns,tuple([env_choices, step_choices])) 125 | advantage_batch = sample_batch(self.advantages,tuple([env_choices, step_choices])) 126 | advantage_batch = (advantage_batch - np.mean(self.advantages)) / (np.std(self.advantages) + 1e-7) 127 | return input_batch,action_batch,output_batch,return_batch,advantage_batch 128 | 129 | class DummyOffPolicyBuffer: 130 | def __init__(self, 131 | input_shape, 132 | action_shape, 133 | output_shape, 134 | nenvs, 135 | nsize, 136 | minibatch,): 137 | self.input_shape = input_shape 138 | self.action_shape = action_shape 139 | self.output_shape = output_shape 140 | self.nenvs,self.nsize,self.minibatch = nenvs,nsize,minibatch 141 | self.inputs = create_memory(input_shape,nenvs,nsize) 142 | self.actions = create_memory(action_shape,nenvs,nsize) 143 | self.outputs = create_memory(output_shape,nenvs,nsize) 144 | self.next_inputs = create_memory(input_shape,nenvs,nsize) 145 | self.rewards = create_memory((),nenvs,nsize) 146 | self.terminals = create_memory((),nenvs,nsize) 147 | self.ptr,self.size = 0,0 148 | def clear(self): 149 | self.ptr,self.size = 0,0 150 | def store(self,input,action,output,reward,terminal,next_input): 151 | store_element(input,self.inputs,self.ptr) 152 | store_element(action,self.actions,self.ptr) 153 | store_element(output,self.outputs,self.ptr) 154 | store_element(reward,self.rewards,self.ptr) 155 | store_element(terminal,self.terminals,self.ptr) 156 | store_element(next_input,self.next_inputs,self.ptr) 157 | self.ptr = (self.ptr+1)%self.nsize 158 | self.size = min(self.size+1,self.nsize) 159 | def sample(self): 160 | env_choices = np.random.choice(self.nenvs,self.minibatch) 161 | step_choices = np.random.choice(self.size,self.minibatch) 162 | input_batch = sample_batch(self.inputs,tuple([env_choices, step_choices])) 163 | action_batch = sample_batch(self.actions,tuple([env_choices, step_choices])) 164 | output_batch = sample_batch(self.outputs,tuple([env_choices, step_choices])) 165 | reward_batch = sample_batch(self.rewards,tuple([env_choices, step_choices])) 166 | terminal_batch = sample_batch(self.terminals,tuple([env_choices, step_choices])) 167 | next_input_batch = sample_batch(self.next_inputs,tuple([env_choices, step_choices])) 168 | return input_batch,action_batch,output_batch,reward_batch,terminal_batch,next_input_batch 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | # class EpisodeOnPolicyBuffer: 191 | 192 | # class DummyOffPolicyBuffer: 193 | 194 | 195 | 196 | # class EpisodeOffPolicyBuffer: -------------------------------------------------------------------------------- /xuance/agent/ppo.py: -------------------------------------------------------------------------------- 1 | from xuance.agent import * 2 | class PPO_Agent: 3 | def __init__(self, 4 | config, 5 | environment, 6 | policy, 7 | learner): 8 | self.config = config 9 | self.environment = environment 10 | self.policy = policy 11 | self.learner = learner 12 | self.nenvs = environment.num_envs 13 | self.nsize = config.nsize 14 | self.nminibatch = config.nminibatch 15 | self.nepoch = config.nepoch 16 | self.gamma = config.gamma 17 | self.tdlam = config.tdlam 18 | self.input_shape = self.policy.input_shape 19 | self.action_shape = self.environment.action_space.shape 20 | self.output_shape = self.policy.output_shape 21 | self.memory = DummyOnPolicyBuffer(self.input_shape, 22 | self.action_shape, 23 | self.output_shape, 24 | self.nenvs, 25 | self.nsize, 26 | self.nminibatch, 27 | self.gamma, 28 | self.tdlam) 29 | self.logger = self.config.logger 30 | self.summary = self.learner.summary 31 | self.train_episodes = np.zeros((self.nenvs,),np.int32) 32 | self.train_steps = 0 33 | 34 | if self.logger=="wandb": 35 | wandb.define_metric("train-steps") 36 | wandb.define_metric("train-rewards/*",step_metric="train-steps") 37 | wandb.define_metric("evaluate-steps") 38 | wandb.define_metric("evaluate-rewards/*",step_metric="evaluate-steps") 39 | 40 | def interact(self,inputs,training=True): 41 | outputs,dist,v = self.policy(inputs) 42 | if training: 43 | action = dist.sample().detach().cpu().numpy() 44 | else: 45 | action = dist.deterministic().detach().cpu().numpy() 46 | v = v.detach().cpu().numpy() 47 | for key,value in zip(outputs.keys(),outputs.values()): 48 | outputs[key] = value.detach().cpu().numpy() 49 | return outputs,action,v 50 | 51 | def train(self,train_steps:int=10000): 52 | obs,infos = self.environment.reset() 53 | memfull_episode_count = 0; earlystop_episode_count = 0 54 | for _ in tqdm(range(train_steps)): 55 | outputs,actions,pred_values = self.interact(obs) 56 | next_obs,rewards,terminals,trunctions,infos = self.environment.step(actions) 57 | self.memory.store(obs,actions,outputs,rewards,pred_values) 58 | for i in range(self.nenvs): 59 | if terminals[i] == True: 60 | self.memory.finish_path(0,i) 61 | elif trunctions[i] == True: 62 | real_next_observation = infos[i]['next_observation'] 63 | for key in real_next_observation.keys(): 64 | real_next_observation[key] = real_next_observation[key][np.newaxis,:] 65 | _,_,truncate_value = self.interact(real_next_observation) 66 | self.memory.finish_path(truncate_value[0],i) 67 | if self.memory.full: 68 | _,_,next_pred_values = self.interact(next_obs) 69 | for i in range(self.nenvs): 70 | self.memory.finish_path(next_pred_values[i]*(1-terminals[i]),i) 71 | for _ in range(self.nminibatch * self.nepoch): 72 | input_batch,action_batch,output_batch,return_batch,advantage_batch = self.memory.sample() 73 | approx_kl = self.learner.update(input_batch,action_batch,output_batch,return_batch,advantage_batch) 74 | if approx_kl > self.config.target_kl: 75 | earlystop_episode_count+=1 76 | break 77 | self.memory.clear() 78 | memfull_episode_count+=1 79 | 80 | for i in range(self.nenvs): 81 | if terminals[i] or trunctions[i]: 82 | self.train_episodes[i] += 1 83 | if self.logger == "tensorboard": 84 | self.summary.add_scalars("train-rewards-steps",{"env-%d"%i:infos[i]['episode_score']},self.train_steps*self.nenvs) 85 | else: 86 | wandb.log({f"train-rewards/{i}":infos[i]['episode_score'],'train-steps':self.train_steps*self.nenvs}) 87 | 88 | obs = next_obs 89 | self.train_steps += 1 90 | # end single epoch train 91 | print("\t[Agent] #Interact= %d; #Mem-full= %d; #Early-stop= %d | [Learner] #Accum-iter= %d"%(train_steps, memfull_episode_count, earlystop_episode_count, self.learner.iterations)) 92 | 93 | def test(self,test_environment,test_episode=10,render=False): 94 | obs,infos = test_environment.reset() 95 | current_episode = 0 96 | scores = [] 97 | images = [[] for i in range(test_environment.num_envs)] 98 | best_score = -np.inf 99 | while current_episode < test_episode: 100 | if render: 101 | test_environment.render("human") 102 | else: 103 | render_images = test_environment.render('rgb_array') 104 | for index,img in enumerate(render_images): 105 | images[index].append(img.astype(np.uint8)) 106 | 107 | outputs,actions,pred_values = self.interact(obs,False) 108 | next_obs,rewards,terminals,trunctions,infos = test_environment.step(actions) 109 | for i in range(test_environment.num_envs): 110 | if terminals[i] == True or trunctions[i] == True: 111 | if self.logger == 'tensorboard': 112 | self.summary.add_scalars("evaluate-score",{"episode-%d"%current_episode:infos[i]['episode_score']},self.train_steps*self.nenvs) 113 | else: 114 | wandb.log({f"evaluate-rewards/{current_episode}":infos[i]['episode_score'],'evaluate-steps':self.train_steps*self.nenvs}) 115 | 116 | if infos[i]['episode_score'] > best_score: 117 | episode_images = images[i].copy() 118 | best_score = infos[i]['episode_score'] 119 | 120 | scores.append(infos[i]['episode_score']) 121 | images[i].clear() 122 | current_episode += 1 123 | obs = next_obs 124 | 125 | 126 | print("[%s] Training Steps:%.2f K, Evaluate Episodes:%d, Score Average:%f, Std:%f"%(get_time_hm(), self.train_steps*self.nenvs/1000, 127 | test_episode,np.mean(scores),np.std(scores))) 128 | return scores,episode_images 129 | 130 | def benchmark(self,test_environment,train_steps:int=10000,evaluate_steps:int=10000,test_episode=10,render=False,save_best_model=True): 131 | 132 | epoch = int(train_steps / evaluate_steps) + 1 # training times 133 | 134 | evaluate_scores,evaluate_video = self.test(test_environment,test_episode,render) 135 | benchmark_scores = [] 136 | benchmark_scores.append({'steps':self.train_steps,'scores':evaluate_scores}) 137 | 138 | best_average_score = np.mean(benchmark_scores[-1]['scores']) 139 | best_std_score = np.std(benchmark_scores[-1]['scores']) 140 | best_video = evaluate_video 141 | 142 | for i in range(epoch): 143 | if i == epoch - 1: 144 | train_step = train_steps - (i*evaluate_steps) 145 | else: 146 | train_step = evaluate_steps 147 | print("[Train-test epoch %03d/%03d]: "%(i, epoch)) 148 | self.train(train_step) 149 | evaluate_scores,evaluate_video = self.test(test_environment,test_episode,render) 150 | benchmark_scores.append({'steps':self.train_steps,'scores':evaluate_scores}) 151 | 152 | if np.mean(benchmark_scores[-1]['scores']) > best_average_score: 153 | best_average_score = np.mean(benchmark_scores[-1]['scores']) 154 | best_std_score = np.std(benchmark_scores[-1]['scores']) 155 | best_video = evaluate_video 156 | if save_best_model == True: 157 | model_path = self.learner.modeldir + "/best_model.pth" 158 | torch.save(self.policy.state_dict(), model_path) 159 | 160 | if not render: 161 | # show the best performance video demo on web browser 162 | video_arr = np.array(best_video,dtype=np.uint8).transpose(0,3,1,2) 163 | if self.logger == "tensorboard": 164 | self.summary.add_video("best_video",torch.as_tensor(video_arr,dtype=torch.uint8).unsqueeze(0),fps=50,global_step=self.nenvs*self.train_steps) 165 | else: 166 | wandb.log({"best_video":wandb.Video(video_arr,fps=50,format='gif')},step=self.nenvs*self.train_steps) 167 | 168 | np.save(self.learner.logdir+"/benchmark_%s.npy"%get_time_full(), benchmark_scores) 169 | print("Best Model score = %f, std = %f"%(best_average_score,best_std_score)) 170 | 171 | 172 | -------------------------------------------------------------------------------- /xuance/environment/envpool_utils.py: -------------------------------------------------------------------------------- 1 | import gym 2 | import numpy as np 3 | from xuance.environment import * 4 | 5 | class EnvPool_Wrapper: 6 | def __init__(self,vecenv): 7 | self.vecenv = vecenv 8 | self.num_envs = vecenv.config['num_envs'] 9 | if isinstance(vecenv.observation_space,gym.spaces.Dict): 10 | self.observation_space = vecenv.observation_space 11 | else: 12 | self.observation_space = gym.spaces.Dict({'observation':vecenv.observation_space}) 13 | self.action_space = vecenv.action_space 14 | 15 | def reset(self): 16 | obs,_ = self.vecenv.reset() 17 | self.episode_lengths = np.zeros((self.num_envs,),np.int32) 18 | self.episode_scores = np.zeros((self.num_envs,),np.float32) 19 | self.last_episode_lengths = self.episode_lengths.copy() 20 | self.last_episode_scores = self.episode_scores.copy() 21 | infos = [] 22 | for i in range(self.num_envs): 23 | if isinstance(obs,dict): 24 | current_dict = {} 25 | for key,value in zip(obs.keys(),obs.values()): 26 | current_dict[key] = value[i] 27 | infos.append({'episode_length':self.episode_lengths[i], 28 | 'episode_score':self.episode_scores[i], 29 | 'next_observation':current_dict}) 30 | else: 31 | infos.append({'episode_length':self.episode_lengths[i], 32 | 'episode_score':self.episode_scores[i], 33 | 'next_observation':{'observation':obs[i]}}) 34 | if isinstance(obs,dict): 35 | return obs,infos 36 | else: 37 | return {'observation':obs},infos 38 | 39 | def step(self,actions): 40 | self.last_episode_lengths = self.episode_lengths.copy() 41 | self.last_episode_scores = self.episode_scores.copy() 42 | obs,rewards,terminals,trunctions,_ = self.vecenv.step(actions) 43 | self.episode_lengths += 1 44 | self.episode_scores += rewards 45 | infos = [] 46 | for i in range(self.num_envs): 47 | if terminals[i] or trunctions[i]: 48 | self.episode_scores[i] = 0 49 | self.episode_lengths[i] = 0 50 | if isinstance(obs,dict): 51 | current_dict = {} 52 | for key,value in zip(obs.keys(),obs.values()): 53 | current_dict[key] = value[i] 54 | infos.append({'episode_length':self.last_episode_lengths[i], 55 | 'episode_score':self.last_episode_scores[i], 56 | 'next_observation':current_dict}) 57 | else: 58 | infos.append({'episode_length':self.last_episode_lengths[i], 59 | 'episode_score':self.last_episode_scores[i], 60 | 'next_observation':{'observation':obs[i]}}) 61 | if isinstance(obs,dict): 62 | return obs,rewards,terminals,trunctions,infos 63 | else: 64 | return {'observation':obs},rewards,terminals,trunctions,infos 65 | 66 | 67 | class EnvPool_Normalizer: 68 | def __init__(self,vecenv:EnvPool_Wrapper): 69 | self.vecenv = vecenv 70 | self.num_envs = vecenv.num_envs 71 | self.observation_space = self.vecenv.observation_space 72 | self.action_space = self.vecenv.action_space 73 | def step(self,actions): 74 | raise NotImplementedError 75 | def reset(self): 76 | raise NotImplementedError 77 | 78 | class EnvPool_ObservationNorm(EnvPool_Normalizer): 79 | def __init__(self,config,vecenv,scale_range=(0.1,10),obs_range=(-5,5),forbidden_keys=[],train=True): 80 | super(EnvPool_ObservationNorm,self).__init__(vecenv) 81 | assert scale_range[0] < scale_range[1], "invalid scale_range." 82 | assert obs_range[0] < obs_range[1], "Invalid reward_range." 83 | self.config = config 84 | self.scale_range = scale_range 85 | self.obs_range = obs_range 86 | self.forbidden_keys = forbidden_keys 87 | self.obs_rms = Running_MeanStd(space2shape(self.observation_space)) 88 | self.train = train 89 | self.save_dir = os.path.join(self.config.modeldir,self.config.env_name,self.config.algo_name+"-%d"%self.config.seed) 90 | if self.train == False: 91 | self.load_rms() 92 | 93 | def load_rms(self): 94 | npy_path = os.path.join(self.save_dir,"observation_stat.npy") 95 | if not os.path.exists(npy_path): 96 | return 97 | rms_data = np.load(npy_path,allow_pickle=True).item() 98 | self.obs_rms.count = rms_data['count'] 99 | self.obs_rms.mean = rms_data['mean'] 100 | self.obs_rms.var = rms_data['var'] 101 | 102 | def reset(self): 103 | self.train_steps = 0 104 | return self.vecenv.reset() 105 | 106 | def step(self,actions): 107 | self.train_steps += 1 108 | if self.config.train_steps == self.train_steps or self.train_steps % self.config.save_model_frequency == 0: 109 | np.save(os.path.join(self.save_dir,"observation_stat.npy"),{'count':self.obs_rms.count,'mean':self.obs_rms.mean,'var':self.obs_rms.var}) 110 | obs,rews,terminals,trunctions,infos = self.vecenv.step(actions) 111 | if self.train: 112 | self.obs_rms.update(obs) 113 | norm_observation = {} 114 | for key,value in zip(obs.keys(),obs.values()): 115 | if key in self.forbidden_keys: 116 | continue 117 | scale_factor = np.clip(1/(self.obs_rms.std[key] + 1e-7),self.scale_range[0],self.scale_range[1]) 118 | norm_observation[key] = np.clip((value - self.obs_rms.mean[key]) * scale_factor,self.obs_range[0],self.obs_range[1]) 119 | return norm_observation,rews,terminals,trunctions,infos 120 | 121 | class EnvPool_RewardNorm(EnvPool_Normalizer): 122 | def __init__(self,config,vecenv,scale_range=(0.1,10),reward_range=(-5,5),gamma=0.99,train=True): 123 | super(EnvPool_RewardNorm,self).__init__(vecenv) 124 | assert scale_range[0] < scale_range[1], "invalid scale_range." 125 | assert reward_range[0] < reward_range[1], "Invalid reward_range." 126 | assert gamma < 1, "Gamma should be a float value smaller than 1." 127 | self.config = config 128 | self.gamma = gamma 129 | self.scale_range = scale_range 130 | self.reward_range = reward_range 131 | self.return_rms = Running_MeanStd({'return':(1,)}) 132 | self.episode_rewards = [[] for i in range(self.num_envs)] 133 | self.train = train 134 | self.save_dir = os.path.join(self.config.modeldir,self.config.env_name,self.config.algo_name+"-%d"%self.config.seed) 135 | if train == False: 136 | self.load_rms() 137 | def load_rms(self): 138 | npy_path = os.path.join(self.save_dir,"reward_stat.npy") 139 | if not os.path.exists(npy_path): 140 | return 141 | rms_data = np.load(npy_path,allow_pickle=True).item() 142 | self.return_rms.count = rms_data['count'] 143 | self.return_rms.mean = rms_data['mean'] 144 | self.return_rms.var = rms_data['var'] 145 | 146 | def reset(self): 147 | self.train_steps = 0 148 | return self.vecenv.reset() 149 | def step(self,act): 150 | self.train_steps += 1 151 | if self.config.train_steps == self.train_steps or self.train_steps % self.config.save_model_frequency == 0: 152 | np.save(os.path.join(self.save_dir,"reward_stat.npy"),{'count':self.return_rms.count,'mean':self.return_rms.mean,'var':self.return_rms.var}) 153 | obs,rews,terminals,trunctions,infos = self.vecenv.step(act) 154 | for i in range(len(rews)): 155 | if terminals[i] != True and trunctions[i] != True: 156 | self.episode_rewards[i].append(rews[i]) 157 | else: 158 | if self.train: 159 | self.return_rms.update({'return':discount_cumsum(self.episode_rewards[i],self.gamma)[0:1][np.newaxis,:]}) 160 | self.episode_rewards[i].clear() 161 | scale = np.clip(self.return_rms.std['return'][0],self.scale_range[0],self.scale_range[1]) 162 | rews[i] = np.clip(rews[i]/scale,self.reward_range[0],self.reward_range[1]) 163 | return obs,rews,terminals,trunctions,infos 164 | 165 | class EnvPool_ActionNorm(EnvPool_Normalizer): 166 | def __init__(self,vecenv,input_action_range=(-1,1)): 167 | super(EnvPool_ActionNorm,self).__init__(vecenv) 168 | self.input_action_range = input_action_range 169 | assert isinstance(self.action_space,gym.spaces.Box), "Only use the NormActionWrapper for Continuous Action." 170 | def reset(self): 171 | return self.vecenv.reset() 172 | def step(self,actions): 173 | act = np.clip(actions,self.input_action_range[0],self.input_action_range[1]) 174 | assert np.min(act) >= self.input_action_range[0] and np.max(act) <= self.input_action_range[1], "input action is out of the defined action range." 175 | self.action_space_low = self.action_space.low 176 | self.action_space_high = self.action_space.high 177 | self.input_action_low = self.input_action_range[0] 178 | self.input_action_high = self.input_action_range[1] 179 | input_prop = (act - self.input_action_low) / (self.input_action_high - self.input_action_low) 180 | output_action = input_prop * (self.action_space_high - self.action_space_low) + self.action_space_low 181 | return self.vecenv.step(output_action) 182 | 183 | 184 | -------------------------------------------------------------------------------- /xuance/policy/deterministic.py: -------------------------------------------------------------------------------- 1 | from xuance.policy import * 2 | class ActorNet(nn.Module): 3 | def __init__(self, 4 | state_dim:int, 5 | action_dim:int, 6 | initialize, 7 | device 8 | ): 9 | super(ActorNet,self).__init__() 10 | self.device = device 11 | self.action_dim = action_dim 12 | self.model = nn.Sequential(*mlp_block(state_dim,action_dim,nn.Tanh,initialize,device)[0]) 13 | self.output_shape = {'actor':(action_dim,)} 14 | def forward(self,x:torch.Tensor): 15 | return self.model(x) 16 | class CriticNet(nn.Module): 17 | def __init__(self, 18 | state_dim:int, 19 | action_dim:int, 20 | initialize, 21 | device): 22 | super(CriticNet,self).__init__() 23 | self.device = device 24 | self.model = nn.Sequential(*mlp_block(state_dim+action_dim,state_dim,nn.LeakyReLU,initialize,device)[0], 25 | *mlp_block(state_dim,1,None,initialize,device)[0]) 26 | self.output_shape = {'critic':()} 27 | def forward(self,x:torch.Tensor,a:torch.Tensor): 28 | return self.model(torch.concat((x,a),dim=-1))[:,0] 29 | 30 | class DDPGPolicy(nn.Module): 31 | def __init__(self, 32 | action_space:gym.spaces.Space, 33 | representation:torch.nn.Module, 34 | initialize, 35 | device): 36 | assert isinstance(action_space,gym.spaces.Box) 37 | super(DDPGPolicy,self).__init__() 38 | self.action_space = action_space 39 | self.input_shape = representation.input_shape.copy() 40 | self.output_shape = representation.output_shape.copy() 41 | # dont share the representation network in actor and critic 42 | self.representation = representation 43 | self.critic_representation = copy.deepcopy(representation) 44 | self.target_actor_representation = copy.deepcopy(representation) 45 | self.target_critic_representation = copy.deepcopy(representation) 46 | 47 | # create actor,critic and target actor, target critic 48 | self.actor = ActorNet(self.representation.output_shape['state'][0], 49 | self.action_space.shape[0], 50 | initialize, 51 | device) 52 | self.target_actor = copy.deepcopy(self.actor) 53 | self.critic = CriticNet(self.representation.output_shape['state'][0], 54 | self.action_space.shape[0], 55 | initialize, 56 | device) 57 | self.target_critic = copy.deepcopy(self.critic) 58 | 59 | for key,value in zip(self.actor.output_shape.keys(),self.actor.output_shape.values()): 60 | self.output_shape[key] = value 61 | self.output_shape['critic'] = () 62 | self.actor_parameters = list(self.representation.parameters()) + list(self.actor.parameters()) 63 | self.critic_parameters = list(self.critic_representation.parameters()) + list(self.critic.parameters()) 64 | 65 | def forward(self,observation:dict): 66 | actor_outputs = self.representation(observation) 67 | critic_outputs = self.critic_representation(observation) 68 | action = self.actor(actor_outputs['state']) 69 | critic = self.critic(critic_outputs['state'],action) 70 | actor_outputs['actor'] = action 71 | actor_outputs['critic'] = critic 72 | return actor_outputs,action,critic 73 | 74 | def Qtarget(self,observation:dict): 75 | actor_outputs = self.target_actor_representation(observation) 76 | critic_outputs = self.target_critic_representation(observation) 77 | target_action = self.target_actor(actor_outputs['state']).detach() 78 | target_critic = self.target_critic(critic_outputs['state'],target_action) 79 | return target_critic.detach() 80 | 81 | def Qaction(self,observation:dict,action:torch.Tensor): 82 | outputs = self.critic_representation(observation) 83 | critic = self.critic(outputs['state'],action) 84 | return critic 85 | 86 | def soft_update(self, tau=0.005): 87 | for ep, tp in zip(self.representation.parameters(), self.target_actor_representation.parameters()): 88 | tp.data.mul_(1 - tau) 89 | tp.data.add_(tau * ep.data) 90 | for ep, tp in zip(self.critic_representation.parameters(), self.target_critic_representation.parameters()): 91 | tp.data.mul_(1 - tau) 92 | tp.data.add_(tau * ep.data) 93 | for ep, tp in zip(self.actor.parameters(), self.target_actor.parameters()): 94 | tp.data.mul_(1 - tau) 95 | tp.data.add_(tau * ep.data) 96 | for ep, tp in zip(self.critic.parameters(), self.target_critic.parameters()): 97 | tp.data.mul_(1 - tau) 98 | tp.data.add_(tau * ep.data) 99 | 100 | class TD3Policy(nn.Module): 101 | def __init__(self, 102 | action_space:gym.spaces.Space, 103 | representation:torch.nn.Module, 104 | initialize, 105 | device): 106 | assert isinstance(action_space,gym.spaces.Box) 107 | super(TD3Policy,self).__init__() 108 | self.action_space = action_space 109 | self.input_shape = representation.input_shape.copy() 110 | self.output_shape = representation.output_shape.copy() 111 | # dont share the representation network in actor and critic 112 | 113 | self.representation = representation 114 | self.criticA_representation = copy.deepcopy(representation) 115 | self.criticB_representation = copy.deepcopy(representation) 116 | 117 | self.target_actor_representation = copy.deepcopy(representation) 118 | self.targetA_critic_representation = copy.deepcopy(representation) 119 | self.targetB_critic_representation = copy.deepcopy(representation) 120 | 121 | # create actor,critic and target actor, target critic 122 | self.actor = ActorNet(self.representation.output_shape['state'][0], 123 | self.action_space.shape[0], 124 | initialize, 125 | device) 126 | self.target_actor = copy.deepcopy(self.actor) 127 | 128 | self.criticA = CriticNet(self.representation.output_shape['state'][0], 129 | self.action_space.shape[0], 130 | initialize, 131 | device) 132 | self.criticB = CriticNet(self.representation.output_shape['state'][0], 133 | self.action_space.shape[0], 134 | initialize, 135 | device) 136 | self.target_criticA = copy.deepcopy(self.criticA) 137 | self.target_criticB = copy.deepcopy(self.criticB) 138 | 139 | for key,value in zip(self.actor.output_shape.keys(),self.actor.output_shape.values()): 140 | self.output_shape[key] = value 141 | self.output_shape['criticA'] = () 142 | self.output_shape['criticB'] = () 143 | self.actor_parameters = list(self.representation.parameters()) + list(self.actor.parameters()) 144 | self.critic_parameters = list(self.criticA_representation.parameters()) + list(self.criticA.parameters()) + list(self.criticB_representation.parameters()) + list(self.criticB.parameters()) 145 | 146 | def soft_update(self, tau=0.005): 147 | for ep, tp in zip(self.representation.parameters(), self.target_actor_representation.parameters()): 148 | tp.data.mul_(1 - tau) 149 | tp.data.add_(tau * ep.data) 150 | for ep, tp in zip(self.criticA_representation.parameters(), self.targetA_critic_representation.parameters()): 151 | tp.data.mul_(1 - tau) 152 | tp.data.add_(tau * ep.data) 153 | for ep, tp in zip(self.criticB_representation.parameters(), self.targetB_critic_representation.parameters()): 154 | tp.data.mul_(1 - tau) 155 | tp.data.add_(tau * ep.data) 156 | for ep, tp in zip(self.actor.parameters(), self.target_actor.parameters()): 157 | tp.data.mul_(1 - tau) 158 | tp.data.add_(tau * ep.data) 159 | for ep, tp in zip(self.criticA.parameters(), self.target_criticA.parameters()): 160 | tp.data.mul_(1 - tau) 161 | tp.data.add_(tau * ep.data) 162 | for ep, tp in zip(self.criticB.parameters(), self.target_criticB.parameters()): 163 | tp.data.mul_(1 - tau) 164 | tp.data.add_(tau * ep.data) 165 | 166 | def Qtarget(self,observation:dict): 167 | actor_outputs = self.target_actor_representation(observation) 168 | criticA_outputs = self.targetA_critic_representation(observation) 169 | criticB_outputs = self.targetB_critic_representation(observation) 170 | target_action = self.target_actor(actor_outputs['state']).detach() 171 | target_action = target_action + torch.rand_like(target_action).clamp(-0.1,0.1) 172 | targetA_critic = self.target_criticA(criticA_outputs['state'],target_action) 173 | targetB_critic = self.target_criticB(criticB_outputs['state'],target_action) 174 | return torch.minimum(targetA_critic,targetB_critic).detach() 175 | 176 | def forward(self,observation:dict): 177 | actor_outputs = self.representation(observation) 178 | criticA_outputs = self.criticA_representation(observation) 179 | criticB_outputs = self.criticB_representation(observation) 180 | action = self.actor(actor_outputs['state']) 181 | criticA = self.criticA(criticA_outputs['state'],action) 182 | criticB = self.criticB(criticB_outputs['state'],action) 183 | actor_outputs['actor'] = action 184 | actor_outputs['criticA'] = criticA 185 | actor_outputs['criticB'] = criticB 186 | return actor_outputs,action,(criticA+criticB)/2 187 | 188 | def Qaction(self,observation:dict,action:torch.Tensor): 189 | outputsA = self.criticA_representation(observation) 190 | criticA = self.criticA(outputsA['state'],action) 191 | outputsB = self.criticB_representation(observation) 192 | criticB = self.criticB(outputsB['state'],action) 193 | return criticA,criticB 194 | 195 | 196 | 197 | 198 | -------------------------------------------------------------------------------- /docs/index.rst: -------------------------------------------------------------------------------- 1 | .. xuance documentation master file, created by 2 | sphinx-quickstart on Tue May 16 20:51:38 2023. 3 | You can adapt this file completely to your liking, but it should at least 4 | contain the root `toctree` directive. 5 | 6 | Welcome to XuanCE! 7 | ================================== 8 | XuanCE is a reinforcement learning algorithm platform which supports multiple deep learning frameworks (Pytorch, TensorFlow, Mindspore) and both multi-agent RL and single-agent RL methods. 9 | This project is a pruned version of the original project XuanPolicy with only Pytorch-based implementations and single-agent RL algorithms. 10 | We make this repo as much as highly-modularized and clean to be friendly for the RL starters. 11 | This framework is also compatiable and easy-to-use for researchers to implement their own ideas. 12 | Currently, the supported algorithms include: 13 | 14 | - Advantage Actor-Critic(A2C) 15 | [`paper `__] 16 | - Proximal Policy Optimization(PPO) 17 | [`paper `__] 18 | - Deep Deterministic Policy Gradient(DDPG) 19 | [`paper `__] 20 | - Twin Delayed DDPG(TD3) 21 | [`paper `__] 22 | - Deep-Q Network(DQN) [`paper `__] 23 | - Dueling-Q Network(Dueling-DQN) 24 | [`paper `__] 25 | - Double-Q Network(DDQN) 26 | [`paper `__] 27 | - Generalizaed Advantage Estimation(GAE) 28 | [`paper `__] 29 | 30 | Here are some other features of XuanCE. 31 | 32 | - Support custom environment, policy network, and optimization process 33 | - Support custom policy evaluation with a different environment from 34 | the training 35 | - Support dictionary observation inputs 36 | - Support dictionary network outputs 37 | - Support efficient environments parallization 38 | (`EnvPool `__) 39 | - Support Weights & Bias Logger and Tensorboard Logger 40 | (`WandB `__) 41 | - Support Video Capturing 42 | - Benchmarking Experiments 43 | 44 | Installation 45 | ================== 46 | You can clone this repository and install an editable version locally: 47 | 48 | :: 49 | 50 | $ conda create -n xuance python=3.8 51 | $ conda activate xuance 52 | $ git clone https://github.com/wzcai99/xuance.git 53 | $ cd xuance 54 | $ pip install -e . 55 | 56 | Quick Start 57 | ================== 58 | You can run the RL algorithms with the provided examples, 59 | 60 | :: 61 | 62 | $ python -m example.run_ppo 63 | 64 | or follow the belowing step-by-step instructions. 65 | 66 | **Step 1: Define all the hyper-parameters in PyYAML format.** Here is an example of configuration file of PPO. 67 | 68 | :: 69 | 70 | algo_name: ppo # The name of the algorithm, used for logger 71 | env_name: InvertedDoublePendulum # The env_id of the algorithm, used for logger 72 | seed: 7891 # random seed 73 | 74 | nenvs: 16 # environments runs in parallel 75 | nsize: 256 # how many steps per environment to collect for policy iteration 76 | nminibatch: 8 # batchsize = nenvs * nsize / nminibatch 77 | nepoch: 16 # iteration times = nepoch * nminibatch 78 | 79 | vf_coef: 0.25 # value function loss weight 80 | ent_coef: 0.00 # policy entropy regularization weight 81 | clipgrad_norm: 0.5 # gradient clipping norm 82 | clip_range: 0.20 # ppo surrogate clip ratio 83 | target_kl: 0.01 # restriction of the kl-divergence with the old policy 84 | lr_rate: 0.0004 # learning rate 85 | 86 | save_model_frequency: 1000 # policy model save frequency per iterations 87 | train_steps: 62500 # training steps = train_steps * nenvs 88 | evaluate_steps: 10000 # evaluate frequency per steps 89 | 90 | gamma: 0.99 # discount-factor 91 | tdlam: 0.95 # td-lambda in GAE 92 | 93 | logger: wandb # logger, you can choose wandb or tensorboard 94 | logdir: "./logs/" # logging directory 95 | modeldir: "./models/" # model save directory 96 | 97 | **Step 2: Import some relavant packages:** 98 | 99 | :: 100 | 101 | import torch 102 | import torch.nn as nn 103 | import torch.nn.functional as F 104 | import argparse 105 | import gym 106 | from xuance.utils.common import space2shape,get_config 107 | 108 | **Step 3: Parse some relavant arguments:** 109 | 110 | :: 111 | 112 | def get_args(): 113 | parser = argparse.ArgumentParser() 114 | parser.add_argument("--device",type=str,default="cuda:0") 115 | parser.add_argument("--config",type=str,default="./config/a2c/") 116 | parser.add_argument("--domain",type=str,default="mujoco") 117 | parser.add_argument("--env_id",type=str,default="HalfCheetah-v4") 118 | parser.add_argument("--pretrain_weight",type=str,default=None) 119 | parser.add_argument("--render",type=bool,default=False) 120 | args = parser.parse_known_args()[0] 121 | return args 122 | args = get_args() 123 | device = args.device 124 | config = get_config(args.config,args.domain) 125 | 126 | Note that the argument **config** is the directory saving the PyYAML 127 | file and the argument **domain** is the filename of the PyYAML file. 128 | 129 | **Step 4: Define a training environments (Vector)** 130 | 131 | :: 132 | 133 | from xuance.environment import BasicWrapper,DummyVecEnv,RewardNorm,ObservationNorm,ActionNorm 134 | train_envs = [BasicWrapper(gym.make(args.env_id,render_mode='rgb_array')) for i in range(config.nenvs)] 135 | train_envs = DummyVecEnv(envs) 136 | train_envs = ActionNorm(envs) 137 | train_envs = ObservationNorm(config,envs,train=(args.pretrain_weight is None)) 138 | train_envs = RewardNorm(config,envs,train=(args.pretrain_weight is None)) 139 | 140 | Note that in some RL algorithms (e.g. A2C, PPO), normalizing the 141 | observation data and scaling the reward value is essential for data 142 | efficiency, therefore, we introduce the ActionNorm, ObservationNorm, and 143 | RewardNorm. But you can make adjustments according to your needs. 144 | 145 | **Similarly, EnvPool-based vector environments are also supported.** 146 | 147 | :: 148 | 149 | from xuance.environment import EnvPool_Wrapper,EnvPool_ActionNorm,EnvPool_RewardNorm,EnvPool_ObservationNorm 150 | train_envs = envpool.make(args.env_id,"gym",num_envs=config.nenvs) 151 | train_envs = EnvPool_Wrapper(train_envs) 152 | train_envs = EnvPool_ActionNorm(train_envs) 153 | train_envs = EnvPool_RewardNorm(config,train_envs,train=(args.pretrain_weight is None)) 154 | train_envs = EnvPool_ObservationNorm(config,train_envs,train=(args.pretrain_weight is None)) 155 | 156 | **Step 5: Define a representation network for state encoding.** 157 | 158 | :: 159 | 160 | from xuance.representation import MLP 161 | representation = MLP(space2shape(envs.observation_space),(128,128),nn.LeakyReLU,nn.init.orthogonal_,device) 162 | 163 | **Step 6: Define the policy head on the top of the representation 164 | network.** 165 | 166 | For the discrete action space: 167 | 168 | :: 169 | 170 | from xuance.policy import Categorical_ActorCritic 171 | policy = Categorical_ActorCritic(envs.action_space,representation,nn.init.orthogonal_,device) 172 | 173 | For the continuous action space: 174 | 175 | :: 176 | 177 | from xuance.policy import Gaussian_ActorCritic 178 | policy = Gaussian_ActorCritic(envs.action_space,representation,nn.init.orthogonal_,device) 179 | 180 | If you want to load a pre-trained policy weight: 181 | 182 | :: 183 | 184 | if args.pretrain_weight: 185 | policy.load_state_dict(torch.load(args.pretrain_weight,map_location=device)) 186 | 187 | **Step 7: Define an optimizer and a learning rate scheduler.** 188 | 189 | :: 190 | 191 | optimizer = torch.optim.Adam(policy.parameters(),config.lr_rate) 192 | scheduler = torch.optim.lr_scheduler.LinearLR(optimizer, start_factor=1.0, end_factor=0.1,total_iters=config.train_steps/config.nsize * config.nepoch * config.nminibatch) 193 | 194 | **Step 8: Define the RL learner and agent.** 195 | 196 | :: 197 | 198 | from xuance.learner import PPO_Learner 199 | from xuance.agent import PPO_Agent 200 | learner = PPO_Learner(config,policy,optimizer,scheduler,device) 201 | agent = PPO_Agent(config,envs,policy,learner) 202 | 203 | **Step 9: Train and evaluate the RL agent.** 204 | 205 | Train the RL agent: 206 | 207 | :: 208 | 209 | agent.train(config.train_steps) 210 | 211 | In many cases, the RL algorithm will be evaluated on a different 212 | environment to test the generalization ability, Therefore, in our 213 | framework, before evaluating the policy, it is necessary to define a 214 | function to build the test environment. Here is an example: 215 | 216 | :: 217 | 218 | def build_test_env(): 219 | test_envs = [BasicWrapper(gym.make(args.env_id,render_mode="rgb_array")) for _ in range(1)] 220 | test_envs = DummyVecEnv(test_envs) 221 | test_envs = ActionNorm(test_envs) 222 | test_envs = RewardNorm(config,test_envs,train=False) 223 | test_envs = ObservationNorm(config,test_envs,train=False) 224 | return test_envs 225 | 226 | Then, you can test your RL agent by parsing the function 227 | 228 | :: 229 | 230 | test_env = build_test_env() 231 | agent.test(test_env,10,args.render) # test for 10 episodes 232 | 233 | You can also run a benchmark experiment as follows: 234 | 235 | :: 236 | 237 | agent.benchmark(build_test_env,config.train_steps,config.evaluate_steps,render=args.render) 238 | 239 | The benchmark function will automatically switch between training and 240 | evaluation. 241 | 242 | **Step 10: Use the tensorboard of the wandb to visualize the training 243 | process.** For the usage of the wandb, we recommand to run a wandb 244 | server locally to avoid the network error, to install and run the wandb 245 | locally, follow the tutorial `here `__. 246 | 247 | .. toctree:: 248 | :maxdepth: 1 249 | :caption: Tutorials 250 | 251 | tutorials/concept 252 | tutorials/configuration 253 | tutorials/custom_env 254 | tutorials/custom_network 255 | tutorials/custom_loss 256 | tutorials/multi_inputs 257 | tutorials/logger 258 | 259 | .. toctree:: 260 | :maxdepth: 1 261 | :caption: API Docs 262 | 263 | api/xuance.environment 264 | api/xuance.utils 265 | api/xuance.representation 266 | api/xuance.policy 267 | api/xuance.learner 268 | api/xuance.agent 269 | 270 | Indices and tables 271 | ------------------ 272 | 273 | * :ref:`genindex` 274 | * :ref:`modindex` 275 | * :ref:`search` -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## XuanCE: A simple framework and implementations of reinforcement learning algorithms ## 2 | ![commit](https://img.shields.io/github/last-commit/wzcai99/xuance?style=plastic) 3 | ![docs](https://img.shields.io/readthedocs/xuance?style=plastic) 4 | ![MIT](https://img.shields.io/github/license/wzcai99/xuance?style=plastic) 5 | ![issue](https://img.shields.io/github/issues/wzcai99/xuance?style=plastic) 6 | ![star](https://img.shields.io/github/stars/wzcai99/xuance?style=plastic) 7 | ![fork](https://img.shields.io/github/forks/wzcai99/xuance?style=plastic) 8 | ![contributor](https://img.shields.io/github/contributors/wzcai99/xuance?style=plastic) 9 | 10 | XuanCE is a reinforcement learning algorithm platform which supports multiple deep learning frameworks (Pytorch, TensorFlow, Mindspore) and both multi-agent RL and single-agent RL methods. This repository is a pruned version of the original project [XuanPolicy](https://github.com/agi-brain/xuanpolicy) with only Pytorch-based implementations and single-agent RL algorithms. 11 | We make this repo as much as highly-modularized and clean to be friendly for the RL starters. 12 | This framework is also compatiable and easy-to-use for researchers to implement their own ideas. 13 | 14 | - For example, if you want to benchmark the RL algorithms on some novel problems, just following the example provided in xuance/environment/custom_envs/*.py to formalize a novel problem into Markov Decision Process (MDP) as a gym-based wrapper. A tutorial is provided [here](https://xuance.readthedocs.io/en/latest/). 15 | - If you want to try some advanced representation network as state encoding, just following the example provided in xuance/representation/network.py to define a new class based on torch.nn.Module. A tutorial is provided [here](https://xuance.readthedocs.io/en/latest/). 16 | - If you figure out a better way for RL optimization process, just add a learner similar to the xuance/learner/*.py and define your own loss function. You can compare difference in xuance/learner/a2c.py and xuance/learner/ppo.py for your own implementation. 17 | - If you propose a more efficient memory buffer and experience replay scheme, just add your own memory buffer class in xuance/utils/memory.py and replace the memory used in the xuance/agents/*.py 18 | - ... 19 | - More details of the usage can be found in the [documentions](https://xuance.readthedocs.io/en/latest/). 20 | 21 | In summary, our high-modularized design allows to focus on unit design and improvements with other parts untouched. 22 | The highlighs of our project are listed below: 23 | - Support custom environment, policy network, and optimization process 24 | - Support custom policy evaluation with a different environment from the training 25 | - Support dictionary observation inputs 26 | - Support dictionary network outputs 27 | - Support efficient environments parallization ([EnvPool](https://github.com/sail-sg/envpool)) 28 | - Support Weights & Bias Logger and Tensorboard Logger ([WandB](https://wandb.ai/)) 29 | - Support Video Capturing 30 | - Benchmarking Experiments 31 | 32 | Currently, this repo supports the following RL algorithms which are: 33 | - Advantage Actor-Critic(A2C) [[paper](https://arxiv.org/pdf/1602.01783v2.pdf)] 34 | - Proximal Policy Optimization(PPO) [[paper](https://arxiv.org/pdf/1707.06347.pdf)] 35 | - Deep Deterministic Policy Gradient(DDPG) [[paper](https://arxiv.org/pdf/1509.02971.pdf)] 36 | - Twin Delayed DDPG(TD3) [[paper](https://arxiv.org/pdf/1802.09477.pdf)] 37 | - Deep-Q Network(DQN) [[paper](https://arxiv.org/pdf/1312.5602.pdf)] 38 | - Dueling-Q Network(Dueling-DQN) [[paper](https://arxiv.org/pdf/1511.06581.pdf)] 39 | - Double-Q Network(DDQN) [[paper](https://arxiv.org/pdf/1509.06461.pdf)] 40 | - Generalizaed Advantage Estimation(GAE) [[paper](https://arxiv.org/pdf/1506.02438.pdf)] 41 | ## Installation ## 42 | You can clone this repository and install an editable version locally: 43 | ``` 44 | $ conda create -n xuance python=3.8 45 | $ conda activate xuance 46 | $ git clone https://github.com/wzcai99/xuance.git 47 | $ cd xuance 48 | $ pip install -e . 49 | ``` 50 | 51 | ## Quick Start ## 52 | You can run the RL algorithms with the provided examples, 53 | ``` 54 | $ python -m example.run_ppo 55 | ``` 56 | or follow the belowing step-by-step instructions. 57 | 58 | **Step 1: Define all the hyper-parameters in PyYAML format.** 59 | 60 | Here is an example of configuration file of PPO. 61 | 62 | ``` 63 | algo_name: ppo # The name of the algorithm, used for logger 64 | env_name: InvertedDoublePendulum # The env_id of the algorithm, used for logger 65 | seed: 7891 # random seed 66 | 67 | nenvs: 16 # environments runs in parallel 68 | nsize: 256 # how many steps per environment to collect for policy iteration 69 | nminibatch: 8 # batchsize = nenvs * nsize / nminibatch 70 | nepoch: 16 # iteration times = nepoch * nminibatch 71 | 72 | vf_coef: 0.25 # value function loss weight 73 | ent_coef: 0.00 # policy entropy regularization weight 74 | clipgrad_norm: 0.5 # gradient clipping norm 75 | clip_range: 0.20 # ppo surrogate clip ratio 76 | target_kl: 0.01 # restriction of the kl-divergence with the old policy 77 | lr_rate: 0.0004 # learning rate 78 | 79 | save_model_frequency: 1000 # policy model save frequency per iterations 80 | train_steps: 62500 # training steps = train_steps * nenvs 81 | evaluate_steps: 10000 # evaluate frequency per steps 82 | 83 | gamma: 0.99 # discount-factor 84 | tdlam: 0.95 # td-lambda in GAE 85 | 86 | logger: wandb # logger, you can choose wandb or tensorboard 87 | logdir: "./logs/" # logging directory 88 | modeldir: "./models/" # model save directory 89 | ``` 90 | **Step 2: Import some relavant packages:** 91 | ``` 92 | import torch 93 | import torch.nn as nn 94 | import torch.nn.functional as F 95 | import argparse 96 | import gym 97 | from xuance.utils.common import space2shape,get_config 98 | ``` 99 | **Step 3: Parse some relavant arguments:** 100 | ``` 101 | def get_args(): 102 | parser = argparse.ArgumentParser() 103 | parser.add_argument("--device",type=str,default="cuda:0") 104 | parser.add_argument("--config",type=str,default="./config/a2c/") 105 | parser.add_argument("--domain",type=str,default="mujoco") 106 | parser.add_argument("--env_id",type=str,default="HalfCheetah-v4") 107 | parser.add_argument("--pretrain_weight",type=str,default=None) 108 | parser.add_argument("--render",type=bool,default=False) 109 | args = parser.parse_known_args()[0] 110 | return args 111 | args = get_args() 112 | device = args.device 113 | config = get_config(args.config,args.domain) 114 | ``` 115 | Note that the argument **config** is the directory saving the PyYAML file and the argument **domain** is the filename of the PyYAML file. 116 | 117 | **Step 4: Define a training environments (Vector)** 118 | ``` 119 | from xuance.environment import BasicWrapper,DummyVecEnv,RewardNorm,ObservationNorm,ActionNorm 120 | train_envs = [BasicWrapper(gym.make(args.env_id,render_mode='rgb_array')) for i in range(config.nenvs)] 121 | train_envs = DummyVecEnv(envs) 122 | train_envs = ActionNorm(envs) 123 | train_envs = ObservationNorm(config,envs,train=(args.pretrain_weight is None)) 124 | train_envs = RewardNorm(config,envs,train=(args.pretrain_weight is None)) 125 | ``` 126 | Note that in some RL algorithms (e.g. A2C, PPO), normalizing the observation data and scaling the reward value is essential for data efficiency, therefore, we introduce the ActionNorm, ObservationNorm, and RewardNorm. But you can make adjustments according to your needs. 127 | 128 | **Similarly, EnvPool-based vector environments are also supported.** 129 | ``` 130 | from xuance.environment import EnvPool_Wrapper,EnvPool_ActionNorm,EnvPool_RewardNorm,EnvPool_ObservationNorm 131 | train_envs = envpool.make(args.env_id,"gym",num_envs=config.nenvs) 132 | train_envs = EnvPool_Wrapper(train_envs) 133 | train_envs = EnvPool_ActionNorm(train_envs) 134 | train_envs = EnvPool_RewardNorm(config,train_envs,train=(args.pretrain_weight is None)) 135 | train_envs = EnvPool_ObservationNorm(config,train_envs,train=(args.pretrain_weight is None)) 136 | ``` 137 | 138 | **Step 5: Define a representation network for state encoding.** 139 | ``` 140 | from xuance.representation import MLP 141 | representation = MLP(space2shape(envs.observation_space),(128,128),nn.LeakyReLU,nn.init.orthogonal_,device) 142 | ``` 143 | **Step 6: Define the policy head on the top of the representation network.** 144 | 145 | For the discrete action space: 146 | ``` 147 | from xuance.policy import Categorical_ActorCritic 148 | policy = Categorical_ActorCritic(envs.action_space,representation,nn.init.orthogonal_,device) 149 | ``` 150 | For the continuous action space: 151 | ``` 152 | from xuance.policy import Gaussian_ActorCritic 153 | policy = Gaussian_ActorCritic(envs.action_space,representation,nn.init.orthogonal_,device) 154 | ``` 155 | If you want to load a pre-trained policy weight: 156 | ``` 157 | if args.pretrain_weight: 158 | policy.load_state_dict(torch.load(args.pretrain_weight,map_location=device)) 159 | ``` 160 | 161 | **Step 7: Define an optimizer and a learning rate scheduler:** 162 | ``` 163 | optimizer = torch.optim.Adam(policy.parameters(),config.lr_rate) 164 | scheduler = torch.optim.lr_scheduler.LinearLR(optimizer, start_factor=1.0, end_factor=0.1,total_iters=config.train_steps/config.nsize * config.nepoch * config.nminibatch) 165 | ``` 166 | 167 | **Step 8: Define the RL learner and agent.** 168 | ``` 169 | from xuance.learner import PPO_Learner 170 | from xuance.agent import PPO_Agent 171 | learner = PPO_Learner(config,policy,optimizer,scheduler,device) 172 | agent = PPO_Agent(config,envs,policy,learner) 173 | ``` 174 | 175 | **Step 9: Train and evaluate the RL agent.** 176 | 177 | Train the RL agent: 178 | ``` 179 | agent.train(config.train_steps) 180 | ``` 181 | 182 | In many cases, the RL algorithm will be evaluated on a different environment to test the generalization ability, Therefore, in our framework, before evaluating the policy, it is necessary to define a function to build the test environment. Here is an example: 183 | ``` 184 | def build_test_env(): 185 | test_envs = [BasicWrapper(gym.make(args.env_id,render_mode="rgb_array")) for _ in range(1)] 186 | test_envs = DummyVecEnv(test_envs) 187 | test_envs = ActionNorm(test_envs) 188 | test_envs = RewardNorm(config,test_envs,train=False) 189 | test_envs = ObservationNorm(config,test_envs,train=False) 190 | return test_envs 191 | ``` 192 | Then, you can test your RL agent by parsing the function 193 | ``` 194 | test_env = build_test_env() 195 | agent.test(test_env,10,args.render) # test for 10 episodes 196 | ``` 197 | You can also run a benchmark experiment as follows: 198 | ``` 199 | agent.benchmark(test_env,config.train_steps,config.evaluate_steps,render=args.render) 200 | ``` 201 | The benchmark function will automatically switch between training and evaluation. 202 | 203 | **Step 10: Use the tensorboard of the wandb to visualize the training process.** 204 | For the usage of the wandb, we recommand to run a wandb server locally to avoid the network error, to install and run the wandb locally, follow the tutorial [here](https://github.com/wandb/server). If everything goes well, the wandb logging and the tensorboard logging will show as follows: 205 |
206 |
207 | 208 | 209 |
210 |
211 | 212 | 213 | 226 | 227 | 229 | 230 | ## Benchmark Results (ToDo) ## 231 | The benchmark results of MuJoCo are provided below. More experimental results across different environments will be released in the near future. 232 | The performance of MuJoCo are evaluated with the best model during the training and we report the average scores with 3 different random seeds. 233 | We compare the performance with the published results and the experiments from [Tianshou benchmark](https://tianshou.readthedocs.io/en/master/tutorials/benchmark.html#mujoco-benchmark) 234 | 235 |
236 |
237 | 238 | 239 | 240 | 241 | 242 |
243 |
244 | 245 | 246 | ## Contributing ## 247 | XuanCE is still under active development. More algorithms and features are going to be added and the contributions are always welcomed. 248 | 249 | ## Citing XuanCE ## 250 | If you use XuanCE in your work, please cite our paper: 251 | ``` 252 | @article{XuanPolicy2023, 253 | author = {Wenzhang Liu, Wenzhe Cai, Kun Jiang, Yuanda Wang, Guangran Cheng, Jiawei Wang, Jingyu Cao, Lele Xu, Chaoxu Mu, Changyin Sun}, 254 | title = {XuanPolicy: A Comprehensive and Unified Deep Reinforcement Learning Library}, 255 | year = {2023} 256 | } 257 | ``` 258 | 259 | 260 | 268 | 269 | 270 | --------------------------------------------------------------------------------