├── README.md
├── visualizing.md
├── code
    └── zero_sum_reward.py
├── graphs.md
├── learner_settings.md
├── rewards.md
├── intro.md
└── making_a_good_bot.md


/README.md:
--------------------------------------------------------------------------------
 1 | # What is this guide about?
 2 | 
 3 | This guide will explain how to make your first ML Rocket League bot with RLGym-PPO, a nice and easy-to-use learning framework.
 4 | I will both be explaining how to use the library, as well as how to make a bot in general. 
 5 | 
 6 | *If you notice a mistake in this guide, let me know!*
 7 | 
 8 | ## Prerequisites
 9 | 
10 | If you want to learn how to train ML, bots, the only three requirements are:
11 | - Knowing how to code (preferably in Python)
12 | - Having a PC to train the bot on
13 | - Not giving up easily
14 | 
15 | This guide assumes you have some basic Python experience. 
16 | If you are coming from another language, that's fine too, but you might need to google some basic stuff and watch a few tutorials along the way.
17 | 
18 | I won't hand-hold basic Python tasks like adding an import or making a function, nor will I explain what an argument or constructor is. 
19 | If you don't know, google it!
20 | 
21 | You don't need any prior experience in machine learning.
22 | Feel free to skip ahead if I cover something you already understand.
23 | 
24 | ## Table of Contents
25 | 
26 | *Start with reading the intro.*
27 | 
28 | [Introduction](intro.md) <- *How to set up RLGym-PPO, and the basic concepts of training bots*
29 | 
30 | _____
31 | 
32 | [Learner Settings](learner_settings.md) <- *What the various learner settings do*
33 | 
34 | [Rewards](rewards.md) <- *How rewards work, and how to write them*
35 | 
36 | [Visualizing Your Bot](visualizing.md) <- *How to watch your bot play in a visualizer*
37 | 
38 | [Understanding The Graphs](graphs.md) <- *What the metric graphs mean*
39 | 
40 | [Making a Good Bot](making_a_good_bot.md) <- *How to make a bot that is actually good*


--------------------------------------------------------------------------------
/visualizing.md:
--------------------------------------------------------------------------------
 1 | # Visualizing your bot
 2 | 
 3 | This short section will explain how to watch your bot play the game.
 4 | 
 5 | ## Enabling render mode
 6 | 
 7 | "Render mode" will have one of the environments send its state to a visualizer tool so you can see how your bot is playing. 
 8 | You can enable it via the `render` boolean in the [learner settings](learner_settings.md).
 9 | 
10 | ## What's a visualizer?
11 | 
12 | RLGym-PPO uses rlgym-sim, which simulates Rocket League games using RocketSim. This means the actual game isn't running.
13 | So, if we want to watch our bot, we need a program that can render the game state in 3D.
14 | 
15 | Such programs for rendering RocketSim games are called **visualizers**.
16 | 
17 | There are multiple visualizers to choose from, the default one is VirxEC's https://github.com/VirxEC/rlviser/
18 | 
19 | Virx's visualizer can sometimes be troublesome to set up, so I wrote my own visualizer with the goal of being as easy to use as possible.
20 | You can find it, along with how to connect it to RLGym-PPO/`rlgym_sim`, here: https://github.com/ZealanL/RocketSimVis
21 | 
22 | ## Adjusting render delay
23 | 
24 | Render delay is the time between sending states to the renderer.
25 | Since the simulation runs at maximum speed, it will run way too fast for you to see what's going on.
26 | 
27 | You can adjust the `render_delay` in the [learner settings](learner_settings.md).
28 | 
29 | By default, the render delay is very small, so you will be watching your bot in 2x speed or whatever.
30 | 
31 | To determine the render delay for normal speed, you should use the knowledge that:
32 | - A state is sent to the renderer each step
33 | - Each step is `tick_skip` ticks
34 | - There are 120 ticks in a second
35 | 
36 | I recommend defining a constant for `TICK_SKIP`, and another constant called `STEP_TIME`, which is the time between steps, written in terms of `TICK_SKIP`.
37 | 
38 | This constant will also be useful when tracking time in rewards and terminal conditions and such.
39 | 
40 | _____
41 | [Back to Table of Contents](README.md)


--------------------------------------------------------------------------------
/code/zero_sum_reward.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from rlgym_sim.utils import RewardFunction
 3 | from rlgym_sim.utils.gamestates import GameState, PlayerData
 4 | 
 5 | '''
 6 | This is a wrapper to put around an existing reward
 7 | Usage: ZeroSumReward(YourOtherReward(), team_spirit, [optional: opp_scale])
 8 | NOTE: Due to limitations in rlgym-sim, "previous_action" is not supported and will be "None" for the child reward
 9 | '''
10 | class ZeroSumReward(RewardFunction):
11 |     '''
12 |     child_reward: The underlying reward function
13 |     team_spirit: How much to share this reward with teammates (0-1)
14 |     opp_scale: How to scale the penalty we get for the opponents getting this reward (usually 1)
15 |     '''
16 |     def __init__(self, child_reward: RewardFunction, team_spirit, opp_scale = 1.0):
17 |         self.child_reward = child_reward # type: RewardFunction
18 |         self.team_spirit = team_spirit
19 |         self.opp_scale = opp_scale
20 | 
21 |         self._update_next = True
22 |         self._rewards_cache = {}
23 | 
24 |     def reset(self, initial_state: GameState):
25 |         self.child_reward.reset(initial_state)
26 | 
27 |     def pre_step(self, state: GameState):
28 |         self.child_reward.pre_step(state)
29 | 
30 |         # Mark the next get_reward call as being the first reward call of the step
31 |         self._update_next = True
32 | 
33 |     def update(self, state: GameState, is_final):
34 |         self._rewards_cache = {}
35 | 
36 |         '''
37 |         Each player's reward is calculated using this equation:
38 |         reward = individual_rewards * (1-team_spirit) + avg_team_reward * team_spirit - avg_opp_reward * opp_scale
39 |         '''
40 | 
41 |         # Get the individual rewards from each player while also adding them to that team's reward list
42 |         individual_rewards = {}
43 |         team_reward_lists = [ [], [] ]
44 |         for player in state.players:
45 |             if is_final:
46 |                 reward = self.child_reward.get_final_reward(player, state, None)
47 |             else:
48 |                 reward = self.child_reward.get_reward(player, state, None)
49 |             individual_rewards[player.car_id] = reward
50 |             team_reward_lists[int(player.team_num)].append(reward)
51 | 
52 |         # If a team has no players, add a single 0 to their team rewards so the average doesn't break
53 |         for i in range(2):
54 |             if len(team_reward_lists[i]) == 0:
55 |                 team_reward_lists[i].append(0)
56 | 
57 |         # Turn the team-sorted reward lists into averages for each time
58 |         # Example:
59 |         #    Before: team_rewards = [ [1, 3], [4, 8] ]
60 |         #    After:  team_rewards = [ 2, 6 ]
61 |         team_rewards = np.average(team_reward_lists, 1)
62 | 
63 |         # Compute and cache:
64 |         # reward = individual_rewards * (1-team_spirit)
65 |         #          + avg_team_reward * team_spirit
66 |         #          - avg_opp_reward * opp_scale
67 |         for player in state.players:
68 |             self._rewards_cache[player.car_id] = (
69 |                     individual_rewards[player.car_id] * (1 - self.team_spirit)
70 |                     + team_rewards[int(player.team_num)] * self.team_spirit
71 |                     - team_rewards[1 - int(player.team_num)] * self.opp_scale
72 |             )
73 | 
74 |     '''
75 |     I made get_reward and get_final_reward both call get_reward_multi, using the "is_final" argument to distinguish
76 |     Otherwise I would have to rewrite this function for final rewards, which is lame
77 |     '''
78 |     def get_reward_multi(self, player: PlayerData, state: GameState, previous_action: np.ndarray, is_final) -> float:
79 |         # If this is the first get_reward call this step, we need to update the rewards for all players
80 |         if self._update_next:
81 |             self.update(state, is_final)
82 |             self._update_next = False
83 |         return self._rewards_cache[player.car_id]
84 | 
85 |     def get_reward(self, player: PlayerData, state: GameState, previous_action: np.ndarray) -> float:
86 |         return self.get_reward_multi(player, state, previous_action, False)
87 | 
88 |     def get_final_reward(self, player: PlayerData, state: GameState, previous_action: np.ndarray) -> float:
89 |         return self.get_reward_multi(player, state, previous_action, True)


--------------------------------------------------------------------------------
/graphs.md:
--------------------------------------------------------------------------------
 1 | # Understanding the graphs
 2 | 
 3 | > :warning: *This section is a W.I.P. and needs more review!* :warning:
 4 | 
 5 | This section will explain to you what most of the default RLGym-PPO metric graphs mean, and how to interpret them.
 6 | 
 7 | By default, RLGym-PPO logs metrics with wandb. If you have wandb enabled, you can view them at your bot's wandb link, which is posted in the console when you start a training session.
 8 | 
 9 | Note that I will purposefully not elaborate on some of the more complicated stuff. I may add optional more-detailed explanations in the future.
10 | 
11 | ## *Remember: Watch your bot!*
12 | 
13 | Unless you have an ELO-tracking system (where different versions of your bot are versed against each other to determine skill), no graph is going to be as good as of an indicator of progress as watching your bot play.
14 | 
15 | You shouldn't use vague graphs (like policy reward) to come to vague conclusions. Obsessing over weird changes and patterns in graphs like entropy or clip fraction will drive you insane and just waste your time. Totally normal learning often results in weird cyclic behavior in some graphs. Sometimes graphs suddenly shift for seemingly no reason at all.
16 | 
17 | However, if a general graph completely dies or skyrockets to an insane value, something *is* probably broken. Just don't solely rely on the graphs to determine if there is a problem unless it's very obvious.
18 | 
19 | ## Policy Reward
20 | ![image](https://github.com/ZealanL/RLGym-PPO-Guide/assets/36944229/cb480e81-38c4-488e-9b2a-f563257ef7ca)
21 | 
22 | This graph shows the average of the total reward each player gets, per episode.
23 | 
24 | This should increase a lot in the early stages, HOWEVER, please DO NOT assume that a decrease or plateau in this graph means your bot is not improving.
25 | 
26 | The average episode reward is directly scaled by the length of each episode. 
27 | At the beginning stages, your episodes are probably ending due to the timeout condition, so the average episode duration will increase if your bot starts hitting the ball.
28 | However, as your bot starts hitting the ball, the goal-scored condition will become the primary episode-ender. Therefore, the more often your bot is scoring, the shorter the episodes, and the lower the "policy reward".
29 | 
30 | From my experience, this graph will increase very strongly at the early stages of learning.
31 | Then, once the bot can hit the ball frequently, it will begin to flatten out or decrease, as goal scoring becomes prominent.
32 | 
33 | Note that if you are using zero-sum rewards, this graph is basically useless, as the average total zero-sum reward over any period of time is just going to be zero.
34 | However, since zero-sum rewards are not helpful when the bots can't really reach the ball yet, this is still useful for tracking the progress of early learning.
35 | 
36 | ## Policy Entropy
37 | ![image](https://github.com/ZealanL/RLGym-PPO-Guide/assets/36944229/a3974e70-30ae-4cb2-9cf6-3fad02a09fb3)
38 | 
39 | This one's pretty cool. It shows how much variety there is in the actions of your bot, on average. This graph will directly scale with `ent_coef`, as well as what situations the bot is in.
40 | 
41 | ## Value Function Loss
42 | ![image](https://github.com/ZealanL/RLGym-PPO-Guide/assets/36944229/832d5b31-3bef-4551-ad8e-8dfed9d90d45)
43 | 
44 | This graph shows how much the critic is struggling to predict the rewards of the policy. 
45 | The graph scales pretty consistently with how often event rewards (goals, demos, etc.) occur, as they are extremely difficult (and in many cases straight-up impossible) for the critic to predict in the future.
46 | 
47 | It should decrease a lot at the very beginning, but then settle down to a "best guess" at some base loss, which will then fluctuate depending on what rewards your bot is getting.
48 | 
49 | You will also see this graph immediately shift if you make significant changes to your rewards.
50 | 
51 | ## Policy Update Magnitude/Value Function Update Magnitude
52 | ![image](https://github.com/ZealanL/RLGym-PPO-Guide/assets/36944229/d8726d44-dd0b-42a4-92dc-c0695adb03b7)
53 | 
54 | These are the scale of changes made to the policy and critic each iteration. These directly scale with learning rate, and they both tend to immediately spike or shift with significant reward changes.
55 | 
56 | ## SB3 Clip Fraction/Mean KL Divergence
57 | ![image](https://github.com/ZealanL/RLGym-PPO-Guide/assets/36944229/ebca338e-bd0b-407a-b5cc-3f4539a54301)
58 | 
59 | These scale with the change in the policy each iteration (see "Policy Update Magnitude"). 
60 | Many bot creators will adjust learning rate to keep one of these graphs near a certain value (I usually see people targeting ~0.08 as their clip fraction).
61 | 
62 | ### *TODO: Add more graphs!*
63 | 
64 | _____
65 | [Back to Table of Contents](README.md)


--------------------------------------------------------------------------------
/learner_settings.md:
--------------------------------------------------------------------------------
  1 | # Learner settings
  2 | 
  3 | Next, we will cover most of the settings for the `Learner`. The `Learner` is a Python class that runs all of the learning loop, and it has many settings for all sorts of things related to the learning process.
  4 | 
  5 | Learner settings are set in the `Learner` constructor in `example.py`.
  6 | ```py
  7 | learner = Learner(build_rocketsim_env,
  8 |                   n_proc=n_proc,
  9 |                   min_inference_size=min_inference_size,
 10 |                   metrics_logger=metrics_logger,
 11 |                   ...
 12 | ```
 13 | ⚠️ *Many of the settings set in `example.py` are bad, such as `ent_coef`. Please read about them and change them.*
 14 | 
 15 | Note that only some of the learner settings are being set here. To see all settings, go to [rlgym-ppo/learner.py](https://github.com/AechPro/rlgym-ppo/blob/main/rlgym_ppo/learner.py) and look at the constructor (`def __init__(...`).
 16 | 
 17 | ___
 18 | `n_proc`: To make learning faster, multiple games are run simultaneously, each in its own Python process. This number controls how many Python processes are launched. I recommend increasing this number until you have max CPU usage, to get the best possible steps/second (you can check in Task Manager on Windows).
 19 | ___
 20 | `render`: This enables render mode, which slows down one of the games and sends it to a rendering program. By default, RLGym-PPO uses https://github.com/VirxEC/rlviser/.
 21 | 
 22 | You will turn this on when you want to watch your bot play, but don't leave it on, as it will slow down the learning.
 23 | ___
 24 | `render_delay`: The delay of sending steps to the renderer. Lowering this speeds up the game speed in render mode. I like to have this lower than real-time so I can watch the bot in ~2x speed.
 25 | ___
 26 | `timestep_limit`: How many timesteps until the learner automatically stops learning. I like to set this to a stupid big number, like `10e15` (1 quadrillion). 
 27 | ___
 28 | `exp_buffer_size`: The learning phase doesn't just learn based on the most recent collected batch of steps, but also the previous few batches. This setting controls how big the buffer that stores all the steps is. I recommend setting this to `ts_per_iteration*2` or `ts_per_iteration*3`.
 29 | ___
 30 | `ts_per_iteration`: How many steps are collected each iteration. The optimal number is highly dependent on a number of factors, and you can play around with it to see what results in faster learning. I've found that values of `50_000` are good for early learning, but once the bot is actually hitting the ball, it should be increased to `100_000`. Once the bot is actually shooting and scoring, increase it to `200_000` or even `300_000`.
 31 | ___
 32 | `policy_layer_sizes`: How big your bot's **policy** is. 
 33 | 
 34 | Each number is the width of a layer of neurons. By default, there are 3 layers, each with 256 neurons.
 35 | 
 36 | I haven't mentioned this yet for simplicity, but the learning algorithm actually uses two neural networks: a **policy** that actually plays the game, and a **critic** that predicts how much reward the policy will get.
 37 | 
 38 | The critic learns to predict the reward the policy will get in a given situation, and the policy learns to get reward that the critic didn't predict. This causes the bot to explore methods of new getting reward, resulting in better learning.
 39 | 
 40 | The default policy size is quite small, I would highly recommend increasing both policy and critic size until you start losing a ton of SPS (steps per second). In general, a bigger policy and critic will learn better.
 41 | 
 42 | My computer (with a RTX 3060 ti) seems to run best on a policy/critic size of `[2048, 2048, 1024, 1024]`. More than that, and my SPS tanks. If you aren't using a GPU, your CPU is going to take ages to run a large network, so you might need to stick with a smaller one.
 43 | 
 44 | If you change this, you will need to reset your bot.
 45 | ___
 46 | `critic_layer_sizes`: This is usually set to the same sizes as the policy.
 47 | 
 48 | There is some evidence to suggest that the critic should be bigger, but this hasn't been thoroughly tested in Rocket League yet.
 49 | 
 50 | I recommend just making it the same size as the policy, unless you know what you are doing.
 51 | 
 52 | If you change this, you need to reset the critic. However, the critic doesn't play the game, so you can train a new critic on the same policy. You should set the policy's learning rate to 0 while you do this, so the noob critic doesn't screw up the policy while it is still figuring out what's happening.
 53 | ___
 54 | `ppo_epochs`: This is how many times the learning phase is repeated on the same batch of data. I recommend 2 or 3. 
 55 | 
 56 | Play around with this and see what learns the fastest. Increasing this will lower your SPS because the learning step is repeating multiple times, but you will get better learning up until a certain point. When testing, compare the increase in rewards from a common starting point, not SPS.
 57 | ___
 58 | `ppo_batch_size`: Just set this to `ts_per_iteration`. This is the amount of data that the learning algorithm trains on each iteration.
 59 | ___
 60 | `ppo_minibatch_size`: This should be a small portion of `ppo_batch_size`, I recommend `25_000` or `50_000`. Data will be split into chunks of these size to conserve VRAM (your GPU's memory). This does not affect anything other than how fast the learning phase runs. Mess around and see what gives you the highest SPS.
 61 | 
 62 | If you aren't using a GPU, this isn't as important. I have no clue what the optimal value is for CPU learning. I'd guess something very big (RAM is usually bigger than VRAM), or something quite small (CPU cache size).
 63 | ___
 64 | `ent_coef`: This is the scale factor for entropy. Entropy fights against learning to make your bot pick actions more randomly. This is useful because it forces the bot to try a larger variety of actions in all situations, which leads to better exploration.
 65 | 
 66 | The golden value for this seems to be about `0.01`. 
 67 | You can reduce this significantly to cause your bot to stop exploring the game and start refining what it already knows. Don't do this if you plan to continue training your bot after.
 68 | ___
 69 | `policy_lr`/`critic_lr`: The learning rate for the policy and critic. If you have experience in supervised ML, this means the same thing. I recommend keeping them the same unless you know what you're doing.
 70 | 
 71 | Bigger values increase how much the policy and critic change during learning. 
 72 | Too small, and you are wasting time. Too big, and the learning gets stuck jittering between different directions, unable to actually progress.
 73 | 
 74 | I generally start LR high, then slowly decrease it as the bot improves. If your bot seems stuck, try decreasing LR. Generally, the better the bot is, the lower the LR should be. 
 75 | 
 76 | Here's what I've found to be ideal at different stages of learning:
 77 | - Bot that can't score yet: `2e-4`
 78 | - Bot that is actually trying to score on its opponent: `1e-4`
 79 | - Bot that is learning outplay mechanics (dribbling and flicking, air dribbles, etc.): `0.8e-4` or lower
 80 | 
 81 | The optimal values for your bot will be different, though. This is a value you should play around with until you find something that seems optimal.
 82 | ___
 83 | `log_to_wandb`: Whether or not to log metrics to `wandb`.
 84 | ___
 85 | `checkpoints_save_folder`: Where to save checkpoints to.
 86 | 
 87 | Checkpoints are your bot's current policy and critic, as well as some associated info.
 88 | 
 89 | Checkpoints are used to save and load the bot.
 90 | 
 91 | If you mess something up and your bot's brain gets fried, you will want to restore to an earlier checkpoint.
 92 | 
 93 | I also save backups every day or so, because only so many checkpoints are stored, and sometimes all of them are fried without you realizing until after.
 94 | ___
 95 | `checkpoint_load_folder`: The folder to load a SPECIFIC checkpoint from.
 96 | 
 97 | This is not set by default, meaning the bot will not automatically re-load.
 98 | 
 99 | It is easy to assume this will just load the most recent checkpoint, but no, it loads a specifically chosen checkpoint. 
100 | 
101 | I recommend you add a little code to load the most recent one and save yourself the effort.
102 | 
103 | A neat little Python line for getting the name of the most recent checkpoint in a folder (written by Lamp I think?):
104 | ```py
105 | # Note: You MUST disable the "add_unix_timestamp" learner setting for this to work properly
106 | latest_checkpoint_dir = "data/checkpoints/rlgym-ppo-run/" + str(max(os.listdir("data/checkpoints/rlgym-ppo-run"), key=lambda d: int(d)))
107 | ````
108 | ___	
109 | [Back to Table of Contents](README.md)
110 | 


--------------------------------------------------------------------------------
/rewards.md:
--------------------------------------------------------------------------------
  1 | # Rewards
  2 | 
  3 | This section covers what rewards do, and how to create your own rewards to get the behavior you want.
  4 | 
  5 | ## What rewards do
  6 | 
  7 | For every step of gameplay, there's a corresponding reward value: positive reward values are good, and negative reward values are bad.
  8 | 
  9 | The learning algorithm (PPO) is always trying to maximize both current and future rewards.
 10 | 
 11 | So, if you want your bot to do more of something, you can add a positive reward for doing that thing.
 12 | 
 13 | If you want your bot to do less of something, you can add a negative reward (called a penalty) for doing that thing.
 14 | 
 15 | ## Reward functions
 16 | 
 17 | Reward functions are responsible for applying reward to a given state. They are run for each player, every step.
 18 | This is where all of your reward logic will take place.
 19 | 
 20 | NOTE: The reward function code that follows will assume you already have these imports:
 21 | ```py
 22 | import numpy as np # Import numpy, the python math library
 23 | from rlgym_sim.utils import RewardFunction # Import the base RewardFunction class
 24 | from rlgym_sim.utils.gamestates import GameState, PlayerData # Import game state stuff
 25 | ```
 26 | 
 27 | ## A simple in-air reward
 28 | 
 29 | Here, we will look at an example reward function that rewards the player for being in the air.
 30 | 
 31 | ```py
 32 | class InAirReward(RewardFunction): # We extend the class "RewardFunction"
 33 |     # Empty default constructor (required)
 34 |     def __init__(self):
 35 |         super().__init__()
 36 | 
 37 |     # Called when the game resets (i.e. after a goal is scored)
 38 |     def reset(self, initial_state: GameState):
 39 |         pass # Don't do anything when the game resets
 40 | 
 41 |     # Get the reward for a specific player, at the current state
 42 |     def get_reward(self, player: PlayerData, state: GameState, previous_action) -> float:
 43 |         
 44 |         # "player" is the current player we are getting the reward of
 45 |         # "state" is the current state of the game (ball, all players, etc.)
 46 |         # "previous_action" is the previous inputs of the player (throttle, steer, jump, boost, etc.) as an array
 47 |         
 48 |         if not player.on_ground:
 49 |             # We are in the air! Return full reward
 50 |             return 1
 51 |         else:
 52 |             # We are on ground, don't give any reward
 53 |             return 0
 54 | ```
 55 | 
 56 | The actual reward logic here for the in-air reward is pretty simple:
 57 | ```py
 58 | if not player.on_ground:
 59 |     # We are in the air! Return full reward
 60 |     return 1
 61 | else:
 62 |     # We are on ground, don't give any reward
 63 |     return 0
 64 | ```
 65 | 
 66 | Here I use `on_ground`, a field of a player (players are `PlayerData`).
 67 | You can browse the other player fields in `PlayerData`'s source code, here: https://github.com/AechPro/rocket-league-gym-sim/blob/main/rlgym_sim/utils/gamestates/player_data.py
 68 | 
 69 | Note that all reward functions need to have:
 70 | 1. A constructor
 71 | 2. A function called when the game resets (after a terminal condition)
 72 | 3. A function that returns the reward for a given player
 73 | 
 74 | There are other functions you can also implement, like `pre_step()`, but these are the fundamental ones.
 75 | 
 76 | Your reward functions should always output in a range up to 1, such as `[-1, 1]` or `[0, 1]`. 
 77 | 
 78 | ## A speed-toward-ball reward
 79 | 
 80 | Now we will move on to a more advanced reward, involving the ball.
 81 | 
 82 | We want our player to hit the ball, so we will reward it for having speed in the direction of the ball.
 83 | 
 84 | ```py
 85 | # Import CAR_MAX_SPEED from common game values
 86 | from rlgym_sim.utils.common_values import CAR_MAX_SPEED
 87 | 
 88 | class SpeedTowardBallReward(RewardFunction):
 89 |     # Default constructor
 90 |     def __init__(self):
 91 |         super().__init__()
 92 | 
 93 |     # Do nothing on game reset
 94 |     def reset(self, initial_state: GameState):
 95 |         pass
 96 | 
 97 |     # Get the reward for a specific player, at the current state
 98 |     def get_reward(self, player: PlayerData, state: GameState, previous_action: np.ndarray) -> float:
 99 |         # Velocity of our player
100 |         player_vel = player.car_data.linear_velocity
101 |         
102 |         # Difference in position between our player and the ball
103 |         # When getting the change needed to reach B from A, we can use the formula: (B - A)
104 |         pos_diff = (state.ball.position - player.car_data.position)
105 |         
106 |         # Determine the distance to the ball
107 |         # The distance is just the length of pos_diff
108 |         dist_to_ball = np.linalg.norm(pos_diff)
109 |         
110 |         # We will now normalize our pos_diff vector, so that it has a length/magnitude of 1
111 |         # This will give us the direction to the ball, instead of the difference in position
112 |         # Normalizing a vector can be done by dividing the vector by its length
113 |         dir_to_ball = pos_diff / dist_to_ball
114 | 
115 |         # Use a dot product to determine how much of our velocity is in this direction
116 |         # Note that this will go negative when we are going away from the ball
117 |         speed_toward_ball = np.dot(player_vel, dir_to_ball)
118 |         
119 |         if speed_toward_ball > 0:
120 |             # We are moving toward the ball at a speed of "speed_toward_ball"
121 |             # The maximum speed we can move toward the ball is the maximum car speed
122 |             # We want to return a reward from 0 to 1, so we need to divide our "speed_toward_ball" by the max player speed
123 |             reward = speed_toward_ball / CAR_MAX_SPEED
124 |             return reward
125 |         else:
126 |             # We are not moving toward the ball
127 |             # Many good behaviors require moving away from the ball, so I highly recommend you don't punish moving away
128 |             # We'll just not give any reward
129 |             return 0
130 | ```
131 | 
132 | *NOTE: From my testing, the `SpeedTowardBallReward` above is better than the default `VelocityPlayerToBallReward`, because it does not go negative.*
133 | 
134 | ## Using multiple rewards
135 | 
136 | The `rlgym_sim` environment requires a single reward function. To allow multiple rewards in a single function, there exists `CombinedReward`. 
137 | This reward function runs a set of rewards, multiplies each reward by a corresponding weight, and sums them together as the total reward.
138 | 
139 | You can see `CombinedReward` is already being used in `example.py` by default:
140 | ```py
141 | rewards_to_combine = (VelocityPlayerToBallReward(),
142 |                         VelocityBallToGoalReward(),
143 |                         EventReward(team_goal=1, concede=-1, demo=0.1))
144 | reward_weights = (0.01, 0.1, 10.0)
145 | 
146 | reward_fn = CombinedReward(reward_functions=rewards_to_combine, 
147 |                             reward_weights=reward_weights)
148 | ```
149 | 
150 | Each reward in the `rewards_to_combine` has a corresponding weight in the `reward_weights` tuple. A good reason for each reward function to output from -1 to 1 is so that its maximum absolute output is determined by its weight.
151 | 
152 | To demonstrate how to add reward functions and weights, I'll add our new `InAirReward`, and replace `VelocityPlayerToBallReward` with our new `SpeedTowardBallReward` like so:
153 | 
154 | ```py
155 | rewards_to_combine = ( # I like to break open the parentheses like this
156 |                         InAirReward(),
157 |                         SpeedTowardBallReward(),
158 |                         VelocityBallToGoalReward(),
159 |                         EventReward(team_goal=1, concede=-1, demo=0.1)
160 |                     )
161 | reward_weights = (0.002, 0.01, 0.1, 10.0)
162 | 
163 | reward_fn = CombinedReward(reward_functions=rewards_to_combine, reward_weights=reward_weights)
164 | ```
165 | 
166 | Personally, I find having to keep track of the weight list separately very annoying.
167 | Thankfully, `CombinedReward` has a static function called `from_zipped()`, which takes in pairs of reward functions and weights in one big list. This allows you to do the following:
168 | 
169 | ```py
170 | 
171 | reward_fn = CombinedReward.from_zipped(
172 |     # Format is (func, weight)
173 |     (InAirReward(), 0.002),
174 |     (SpeedTowardBallReward(), 0.01),
175 |     (VelocityBallToGoalReward(), 0.1),
176 |     (EventReward(team_goal=1, concede=-1, demo=0.1), 10.0)
177 | )
178 | ```
179 | 
180 | ## Zero-sum rewards
181 | 
182 | > :warning: *This section is a W.I.P. and needs more review!* :warning:
183 | 
184 | If you are rewarding a player for doing something is good, it only makes sense to equally punish the opponent.
185 | 
186 | Every good bot I have seen has used zero-sum rewards, either partially or completely.
187 | 
188 | A zero-sum reward can be implemented with the following logic: `player_reward = self_reward - opponent_reward`
189 | 
190 | My commented implementation of a zero-sum reward wrapper can be found here: [zero_sum_reward.py](code/zero_sum_reward.py)
191 | 
192 | ### To zero-sum, or not to zero-sum
193 | 
194 | Some bot creators make every reward zero-sum, but I believe this is overkill, and from my testing, making certain rewards zero-sum slows down learning.
195 | 
196 | Based on this testing, my philosophy is the following: **A reward should only be zero-sum if it is beneficial for the opponent to prevent it.**
197 | 
198 | Examples of things that the opponent should be trying to prevent:
199 | - Bumps/demos
200 | - Flip resets
201 | - Strong powershots
202 | - Collecting boost
203 | - Having speed
204 | 
205 | Examples of rewards that the opponent shouldn't worry about preventing as much:
206 | - Speed flips
207 | - Air roll in air
208 | - Air reward/ground penalty
209 | 
210 | Most rewards tend to benefit from being zero-sum, but some rewards, such as those for gameplay/movement tuning, shouldn't be.
211 | 
212 | If a reward is zero-sum that shouldn't be, it will add noise to the overall reward of every player. 
213 | This noise makes other reward signals weaker and will slow down learning.
214 | 
215 | I also recommend not using zero-sum rewards on specific behaviors like flip resets until your bot is skilled enough to start defending against them.
216 | 
217 | ### Team rewards and "team spirit"
218 | 
219 | Another thing zero-sum rewards do is distribute rewards between teammates.
220 | 
221 | If you are training a bot in a team mode, and not using zero-sum rewards, you may notice the bot fighting over the ball with its teammates.
222 | This obviously is not good teamplay!
223 | 
224 | The solution is to share reward among teammates.
225 | 
226 | Team spirit systems do this through a setting called `team_spirit`, which ranges from 0 to 1.
227 | 
228 | At `team_spirit = 0`, no reward is shared between teammates.
229 | At `team_spirit = 1`, all reward is shared between teammates, with each player's reward being the average of their teams.
230 | 
231 | When training a bot, you generally want to start `team_spirit` very low, and slowly increase it as the bot improves, until it reaches `1`.
232 | 
233 | Having high team spirit too early in training will greatly slow down learning. 
234 | This is because early learning mostly focuses on individual behaviors, not team ones.
235 | 
236 | Adding team spirit to zero-sum reward logic gives us this:
237 | ```py
238 | avg_team_reward = ... # Average reward of everyone on our team (including us)
239 | 
240 | avg_opp_reward = ... # Average reward of all opponents
241 | 
242 | # As team spirit increases, our own reward will be replaced by our team's average reward
243 | # After that, we then just subtract the average opponent reward
244 | player_reward = (self_reward * (1 - team_spirit)) + (avg_team_reward * team_spirit) - average_opp_reward
245 | ```
246 | _____
247 | [Back to Table of Contents](README.md)


--------------------------------------------------------------------------------
/intro.md:
--------------------------------------------------------------------------------
  1 | # Introduction
  2 | 
  3 | ## Installing RLGym-PPO and rlgym-sim:
  4 | Here are the steps to install everything needed:
  5 | *Skip a step if you already have the thing!*
  6 | 1. Install [Python](https://www.python.org/downloads/) (make sure to add it to your PATH/environment variables)
  7 | 2. Install [Git](https://git-scm.com/downloads) (you can just click through the install with all of the default settings)
  8 | 3. Install the `RocketSim` package with `pip install rocketsim`
  9 | 4. Install the `rlgym_sim` package with `pip install git+https://github.com/AechPro/rocket-league-gym-sim@main`
 10 | 5. [Download the asset dumper](https://github.com/ZealanL/RLArenaCollisionDumper/releases/tag/v1.0.0) and [follow its usage instructions](https://github.com/ZealanL/RLArenaCollisionDumper/blob/main/README.md) to make the `collision_meshes` folder (we will move this later)
 11 | 6. If you have an NVIDIA GPU, install [CUDA v11.8](https://developer.nvidia.com/cuda-11-8-0-download-archive)
 12 | 7. Install PyTorch from [its website](https://pytorch.org/get-started/locally/) (if you installed CUDA, select the CUDA version, otherwise select CPU)
 13 | 8. Make a folder for your bot
 14 | 9. Install RLGym-PPO with `pip install git+https://github.com/AechPro/rlgym-ppo`
 15 | 10. Steal [example.py](https://github.com/AechPro/rlgym-ppo/blob/main/example.py) from RLGym-PPO and add it to your bot folder
 16 | 11. Move `collision_meshes` to your bot folder
 17 | 
 18 | ### Wait, where is Rocket League involved?
 19 | RLGym-PPO uses rlgym-sim, which is a version of RLGym that runs on a simulated version of Rocket League, without actually running the game itself. This means that you can use RLGym-PPO on non-windows platforms, without having Rocket League, and can also collect data from the game much faster.
 20 | 
 21 | ## Actually running your bot
 22 | Once you have installed RLGym-PPO, you can run your bot by running `example.py` (you should do this through a command prompt, instead of double-clicking).
 23 | 
 24 | This will start training the bot, and will report its results to *wandb*, a data platform that you can use to see graphs of various info as your bot trains. It will also print out a big list of stuff into the console, which I will elaborate on more in the next section.
 25 | 
 26 | ## The basics of the training loop
 27 | Training is a process of:
 28 |  - **Collection**: The bot collects data from the environment (i.e. the bot plays the game at super-speed). Each data point during gameplay is called a **step**.
 29 |  - **Learning**: The learning algorithm uses those collected steps to update the brain of the bot.
 30 | 
 31 | Every time this cycle of learning happens it is called an **iteration**. After each iteration, the bot will print out a report, which will look something like this:
 32 | 
 33 | ```
 34 | --------BEGIN ITERATION REPORT--------
 35 | Policy Reward: 0.03805
 36 | Policy Entropy: 0.80701
 37 | Value Function Loss: 0.01235
 38 | 
 39 | Mean KL Divergence: 0.00279
 40 | SB3 Clip Fraction: 0.02373
 41 | Policy Update Magnitude: 0.07114
 42 | Value Function Update Magnitude: 0.13493
 43 | 
 44 | Collected Steps per Second: 8,581.13182
 45 | Overall Steps per Second: 5,977.26097
 46 | 
 47 | Timestep Collection Time: 4.31789
 48 | Timestep Consumption Time: 2.25241
 49 | PPO Batch Consumption Time: 0.11061
 50 | Total Iteration Time: 5.57030
 51 | 
 52 | Cumulative Model Updates: 28
 53 | Cumulative Timesteps: 550,208
 54 | 
 55 | Timesteps Collected: 50,006
 56 | --------END ITERATION REPORT--------
 57 | ```
 58 | 
 59 | Some of these terms are complicated and require lots of learning about ML to understand, so I'll just cover the simpler ones.
 60 | 
 61 | `Policy Reward`: This is how much reward the bot has collected, on average, in each **episode**.
 62 | An **episode** is a small piece of gameplay that ends once a certain condition is met, such as a goal being scored.
 63 | 
 64 | `Collected steps per second`: This is how many **steps** of gameplay are being collected every second. The better computer you have, the higher this number will be. 
 65 | 
 66 | A **step** is a tiny portion of time in the game. Rocket League physics runs at 120 FPS, and each of these physics frames are called a tick. A step is a handful of ticks, 8 by default. Since `120/8 = 15`, this means your bot is running at 15 updates a second. 
 67 | 
 68 | We'll talk more about this later.
 69 | 
 70 | `Overall steps per second`: Collected steps/sec is just for the collection phase, but what about the learning phase? Overall steps/sec includes the time it takes to learn, after collecting all the data. If you want to know how quickly your bot is running in general, this is the number to look at.
 71 | 
 72 | `Timestep Collection Time`: How long it took to collect all of the steps.
 73 | 
 74 | `Timestep Consumption Time`: How long it took to "consume" (organize and learn from) all of those steps.
 75 | 
 76 | `Cumulative Timesteps`: This is the total number of steps collected during your bot's lifetime. If you plan on making a good bot, expect to see numbers in the tens of billions someday!
 77 | 
 78 | *Fun fact: If your bot steps 15 times a second, 1 billion steps is ~18,500 hours of Rocket League. Bots do not go outside.*
 79 | 
 80 | ## Ok, but how does it learn to play RL?
 81 | 
 82 | When your bot first starts learning, it has absolutely no idea what is going on. Its brain starts off completely random, so it just mashes random inputs.
 83 | 
 84 | In order to get our bot to actually learn something, it needs to have rewards. Rewards are things that happen in the game that you want the bot to do more or less often. The learning algorithm is designed to try to maximize how much reward the bot is getting, while also exploring new ways to get even more rewards.
 85 | 
 86 | Rewards can be both positive and negative, and are ultimately just numbers assigned to specific steps. Positive rewards encourage, negative rewards punish.
 87 | 
 88 | By default, your bot has 3 rewards:
 89 |  - `VelocityPlayerToBallReward()`: Positive reward for moving towards the ball, negative reward for moving away
 90 |  - `VelocityBallToGoalReward()`: Positive for having the ball move towards the opponent's goal, negative reward for having the ball move towards your own goal
 91 |  - `EventReward()`: Positive reward for scoring, negative reward for conceding (getting scored on), and a smaller reward for getting a demo
 92 | 
 93 | *You can find these rewards in `example.py`.*
 94 | 
 95 | As your bot continuous to mash random inputs, it will accidentally trigger a reward or punishment. Then, during the learning phase, the learning algorithm will try to adjust the bot's brain such that it is more/less likely to do the things that were rewarded/punished.
 96 | 
 97 | ### Wait, why don't we just have a goal reward?
 98 | If you think about it, the only thing that ultimately matters for winning a game of Rocket League is scoring and not getting scored on, so why do we need other rewards?
 99 | 
100 | Well, the answer is that bots are not very smart compared to humans. They can't plan out what they want to do, nor do they know what a car is, nor have they ever heard of balls or goals before. They need a lot more specific encouragement to learn how to move towards the ball, hit the ball, collect boost, and so on. 
101 | 
102 | A lot of the difficulty of making a bot is creating rewards that encourage the bot to do what you want, without limiting its ability to explore other options.
103 | 
104 | Also, for future reference, when I say **resetting the bot**, I mean resetting all learning back to nothing. Resetting the bot is a good choice for a number of reasons, and usually occurs when the bot is not improving, or a significant change needs to be made that would break the current bot.
105 | 
106 | ## Diving into actually modifying our bot
107 | 
108 | So, now that you know some of the fundamental ideas, lets start actually messing with stuff.
109 | 
110 | If you haven't already, open `example.py` in the Python editor of your choice.
111 | 
112 | ### Action parser
113 | 
114 | The first thing I recommend changing is the action parser.
115 | This is set at the line:
116 | 
117 | ```python
118 | action_parser = ContinuousAction()
119 | ```
120 | 
121 | **Continuous** **actions** means the bot can use any partial input, which allows for more precise input. However, this is more difficult to use, and I do not recommend it as your first action parser.
122 | 
123 | An **action** is the combination of controller inputs the bot presses (throttle, steer, jump, boost, etc.), and an action parser converts the outputs of the bot's brain into these controller inputs.
124 | 
125 | Most bots use a **discrete action** parser, which separates every useful permutation of inputs into their own box, and the bot can control the car by picking a specific box of inputs.
126 | 
127 | Now before you go ahead and swap out `ContinuousAction` with `DiscreteAction`, beware:
128 | `DiscreteAction` is actually `MultiDiscrete`, which is not what I described.
129 | The fully-discrete[*] action parser is called `LookupAction`, and is not included by the library by default.
130 | 
131 | You can find it here: https://github.com/RLGym/rlgym-tools/blob/main/rlgym_tools/extra_action_parsers/lookup_act.py
132 | 
133 | Since action parsers define how your bot controls the car, changing it usually means resetting the bot.
134 | 
135 | 
136 | ### Rewards and weights
137 | 
138 | Down below the list of rewards is a list of numbers.
139 | These numbers are the weights of each reward, which is how intensely they will influence the bot. `VelocityPlayerToBallReward()` is the lowest, `VelocityBallToGoalReward()` is 10x more influential, and `EventReward()` is 100x more influential than that.
140 | 
141 | **Event-type rewards** are rewards that activate once when a specific thing happens. They are usually important game events, like hitting the ball, shooting, scoring, etc, and so on.
142 | 
143 | **Continuous-type rewards** are rewards that are active *while* something is happening, and thus can run for many steps in a row. Since they happen so often, they are inherently stronger than event rewards, and usually should have far less weight.
144 | 
145 | `VelocityPlayerToBallReward` and `VelocityBallToGoalReward` are continuous rewards, whereas `EventReward` is.. well.. yeah.
146 | 
147 | Inside the constructor to `EventReward()` are sub-weights for different events. Each event will be multiplied by its sub-weight, then multiplied by the reward's weight after. Scoring is `1 * 10 = 10` reward total, whereas demos are only a tenth the reward of scaling, `0.1 * 10 = 1` reward total.
148 | 
149 | All rewards are eventually normalized in the learning algorithm (unless you specifically turn that off, which you probably shouldn't). This means that what actually matters is how rewards are weighted *in relation* to other rewards.
150 | 
151 | I recommend that you:
152 | - Increase `VelocityPlayerToBallReward` a bit (it's very important in the early stages)
153 | - Add `FaceBallReward` with a small weight (this will reward your bot for facing the ball, which is very helpful in the early stages of learning)
154 | 
155 | ### Obs builder
156 | 
157 | **Obs** is short for observation, and it is how your bot perceives the game. An **obs builder** converts the current **state** of the game into inputs to your bot's brain.
158 | 
159 | The default obs builder is a decent starting point. I've found you can get moderately better results if you add car-relative positions and velocities, but that's a bit more advanced.
160 | 
161 | Making obs builders is quite tedious, and also very very high-risk. If you slightly mess something up, it could make your bot unable to play the game, or even worse, play the game but poorly (which is worse, because it is harder to tell that something is wrong).
162 | 
163 | Since the obs defines the input to your bot's brain, changing obs usually means resetting the bot.
164 | 
165 | ### Number of players
166 | 
167 | Two variables control the amount of players in each game.
168 | 
169 | ```python
170 | spawn_opponents = True
171 | team_size = 1
172 | ```
173 | 
174 | `spawn_opponents` means that each game will have an orange player for every blue player.
175 | You probably want that, unless you want to train something very specific that doesn't involve other players.
176 | 
177 | `team_size` is the number of players on each team. I recommend starting with the default of `1`, as making bots that can play with teammates has added challenges. You can also have a bot that plays all modes (1v1, 2v2, and 3v3), but that is more advanced. 
178 | 
179 | ### Terminal conditions
180 | ```py
181 | terminal_conditions = [NoTouchTimeoutCondition(timeout_ticks), GoalScoredCondition()]
182 | ```
183 | 
184 | These are a set of conditions that define when episodes end. If any of these conditions trigger, the episode is over. Basically all bots have a `GoalScoredCondition` (because otherwise the ball is going to stay inside the goal... which is weird), as well as a `NoTouchTimeoutCondition`.
185 | 
186 | The `NoTouchTimeoutCondition` ends the episode if no player has touched the ball for a certain amount of time (10 seconds by default). This is helpful, especially in the early stages, and prevents you from wasting time collecting a ton of data on two motionless bots who aren't doing anything or are stuck upside-down.
187 | 
188 | ### State setters
189 | 
190 | Once the game is terminal, it needs to be reset. By default, it will be reset to kickoff.
191 | 
192 | However, for beginner bots, kickoff is usually not the best state setter. 
193 | 
194 | I recommend using the `RandomState` state setter, especially in the early stages of training. 
195 | Its constructor takes 3 arguments: `ball_rand_speed`, `cars_rand_speed`, and `cars_on_ground`. 
196 | I recommend you use `(True, True, False)`, as this will make the cars and ball start at a random location with random velocities. 
197 | The cars will also spawn airborne half of the time, meaning they will quickly learn how to somewhat orient themselves in the air, too.
198 | 
199 | The state setter is an argument of `rlgym_sim.make()`, within your `build_rocketsim_env()`.
200 | 
201 | _____
202 | [Back to Table of Contents](README.md)
203 | 


--------------------------------------------------------------------------------
/making_a_good_bot.md:
--------------------------------------------------------------------------------
  1 | # Making a Good Bot
  2 | 
  3 | > :warning: *This section is a W.I.P. and needs more review!* :warning:
  4 | 
  5 | While most of this guide explains how to make a bot, this section is specifically intended to teach you how to make a *good* bot.
  6 | This is, of course, my personal experience as well as what I have learned from testing and speaking with other bot creators.
  7 | *If you believe some part of this section is wrong or misleading, please let me know so I can fix it!*
  8 | 
  9 | This section aims to provide enough general and specific guidance to allow a dedicated bot creator to train a bot from nothing to GC.
 10 | 
 11 | I will be referencing the [rewards section](rewards.md), [learner settings section](learner_settings.md), and [graphs section](graphs.md) frequently.
 12 | 
 13 | # Early Stages (Bronze - Silver)
 14 | 
 15 | There are differing opinions on what "early stages" mean, but personally I define the early stages of training as the period of time before your bot is actually trying to score. 
 16 | Bots in the early stages cannot yet push or shoot the ball into the goal.
 17 | 
 18 | In this stage, you primarily should focus your bot on these 2 tasks:
 19 | 1. Learn to touch the ball
 20 | 2. Don't forget how to jump
 21 | 
 22 | ### Why do bots forget how to jump?
 23 | 
 24 | Controlling your car in the air is hard and is a lot less forgiving than driving on the ground.
 25 | A fresh bot simply tasked with reaching the ball will learn this very quickly, and will stop pressing jump altogether.
 26 | However, jumping is very obviously useful, and it can be very difficult for the bot to rediscover jumping later on.
 27 | To combat this, most bots use some sort of reward for being in the air, or a penalty for being on the ground.
 28 | 
 29 | An alternative is to add more jump actions to a discrete action parser. 
 30 | This will make a higher portion of possible actions have jumping, and will greatly increase jump usage in early stages. 
 31 | Doubling the jump actions seems to be enough to eliminate the need for air rewards or ground penalties.
 32 | 
 33 | ### What rewards should I use in the early stages?
 34 | 
 35 | From my testing, here are some good rewards to get a fresh bot to learn to hit the ball as quickly as possible.
 36 | ```py
 37 | // Format: (reward, weight)
 38 | rewards = (
 39 | 	(EventReward(touch=1), 50), # Giant reward for actually hitting the ball
 40 | 	(SpeedTowardBallReward(), 5), # Move towards the ball!
 41 | 	(FaceBallReward(), 1), # Make sure we don't start driving backward at the ball
 42 | 	(AirReward(), 0.15) # Make sure we don't forget how to jump
 43 | )
 44 | # NOTE: SpeedTowardBallReward and AirReward can be found in the rewards section of this guide
 45 | ```
 46 | 
 47 | Notice how I didn't include any rewards for scoring, or even moving the ball toward the goal.
 48 | Having these rewards before the bot is capable of actually hitting the ball just adds lots of noise to the overall reward and will slow learning.
 49 | 
 50 | I recommend using a learning rate of around `2e-4` for the early stages.
 51 | 
 52 | After running these rewards for a few dozen million steps, your bot should be hitting the ball quite frequently.
 53 | 
 54 | If your bot stops jumping, increase the `AirReward`!
 55 | 
 56 | ### Learning to score
 57 | 
 58 | Once your bot is capable of hitting the ball, you should introduce rewards for moving the ball to the goal and scoring.
 59 | You should also decrease the `TouchBallReward` a lot so that it is no longer the bot's top priority.
 60 | 
 61 | I recommend using `VelocityBallToGoalReward` as the continuous scoring encouragement, it should be a fair bit stronger than `SpeedTowardBallReward`.
 62 | 
 63 | ### Don't give massive goal rewards!
 64 | 
 65 | Many people are inclined to add a goal reward with massive weight to the bot, like this:
 66 | ```py
 67 | reward = (
 68 | 	(EventReward(team_goal=1, concede=-1), 100),
 69 | 	(VelocityBallToGoalReward(), 2),
 70 | 	(SpeedTowardBallReward(), 1),
 71 | 	(FaceBallReward(), 0.1),
 72 | 	...
 73 | )
 74 | ```
 75 | 
 76 | **Don't do this!**
 77 | 
 78 | This *feels* like it makes sense because goals are the most important thing in the game.
 79 | However, from my testing and experience, adding massive goal rewards early on in training simply slows down learning and decreases exploration.
 80 | 
 81 | A giant goal reward will drown out every other reward you have. Pick a more reasonable weight like `20`, in this instance.
 82 | 
 83 | I have trained a bot to element level without the use of any goal rewards. It is less important than you might expect.
 84 | 
 85 | # Middle Stages (Gold - Plat)
 86 | 
 87 | Once your bot is capable of pushing the ball into the net, you are now in the "middle stages".
 88 | This stage is more complex and more difficult to get right.
 89 | 
 90 | There are several different things you probably want your bot to learn in this stage:
 91 | - Basic shots
 92 | - Basic saves
 93 | - Basic jump-touches and baby aerials
 94 | - Basic 50s 
 95 | - Collecting boost and keeping
 96 | - Giving space to teammates (if your bot isn't 1v1-only)
 97 | 
 98 | Some of these behaviors will develop naturally if given enough time, whereas others are much harder for bots to discover on their own.
 99 | 
100 | Also, you'll generally want to decrease LR to around `1e-4` now that you're out of the early stages. Watch your clip fraction!
101 | 
102 | ### A better ball-touch reward
103 | 
104 | The default `touch` part of `EventReward` is not very good once your bot can touch the ball. 
105 | This is because ball touches can easily be farmed by constantly pushing the ball, instead of shooting or cutting it.
106 | 
107 | A substantial improvement to this reward is to **scale the reward with the strength of the touch**. 
108 | This means that a slight push that barely changes the velocity of the ball will give almost no reward, but a strong shot will give lots of reward.
109 | 
110 | I won't provide a copy-pasteable reward for this, as you should try to write this reward on your own. Here are some hints:
111 | - `ball_touched` is a property of players, and is `True` if they hit the ball since the previous step
112 | - You should use the ball's current velocity and previous velocity to calculate the change in velocity
113 | 
114 | ### A good air-touch reward
115 | 
116 | Bots find the air scary, so they usually need some strong encouragement to actually hit the ball in the air, especially when it is up high.
117 | 
118 | A basic air touch reward just rewards the player for how high the ball is:
119 | ```py
120 | reward = ball.position[2] / CommonValues.CEILING_Z
121 | ```
122 | 
123 | However, the bot will usually learn to just hit the ball off of the wall high-up to get this reward with minimal effort--usually in the lame "plat wall-shot" way, not the cool "air dribble" way.
124 | A solution is to track the amount of time the player has spent in the air (or get it from RocketSim's `air_time`), and to combine that with the height scaling.
125 | 
126 | ```py
127 | MAX_TIME_IN_AIR = 1.75 # A rough estimate of the maximum reasonable aerial time
128 | air_time_frac = min(player.air_time, MAX_TIME_IN_AIR) / MAX_TIME_IN_AIR
129 | height_frac = ball.position[2] / CommonValues.CEILING_Z
130 | reward = min(air_time_frac, height_frac)
131 | ```
132 | 
133 | ### Get boost, and don't waste it this time!
134 | 
135 | Bots will sort of discover picking up boost if given enough time, but are generally pretty wasteful once they have it.
136 | 
137 | I always recommend a general `SaveBoostReward`, which rewards the player based on how much boost they have.
138 | ```py
139 | reward = sqrt(player.boost_amount)
140 | ```
141 | 
142 | Note that I am using `sqrt()` here. 
143 | The `sqrt()` effectively makes boost more important the less you have (within this reward), which is just a fact of Rocket League. 
144 | Going from 0 boost to 50 boost is more useful than going from 50 boost to 100 boost (remember that boost ranges from 0-1 in RLGym stuff).
145 | 
146 | If your bot is wasting boost, increase the `SaveBoostReward`. If your bot is hogging boost and is afraid to use it, decrease the reward.
147 | 
148 | For picking up boost, `EventReward`'s `boost_pickup` is a good starting point. 
149 | However, bots have a tendency to ignore small pads, so I recommend making the small pad pickup reward much stronger than just 12% of the big boost pickup reward (via a custom reward).
150 | 
151 | ### Developing outplays
152 | 
153 | One of the key skills that will bring your bot into the later stages is the ability to outplay.
154 | This means learning a mechanic that can change the direction of the ball to get past a challenging opponent.
155 | 
156 | The most common way bots do this is with dribbling and flicking, a behavior that comes quite naturally to bots. However, this is not the only way to outplay opponents. 
157 | Cuts, passes, air dribbles, and flip resets are also very strong outplays--however most of those mechanics typically aren't discovered on their own.
158 | If there's a particular mechanic you would like your bot to perform to outplay, I recommend creating and adding a reward for that mechanic in these stages.
159 | 
160 | # Later Stages (Diamond+)
161 | 
162 | Once your bot is capable of outplaying opponents, collecting boost, saving, shooting, and other fundamental game mechanics, I consider it to be in the later stages.
163 | Bots entering these stages are usually around diamond rank (although its 1v1 rank may be plat, as 1v1 ranks are more difficult).
164 | 
165 | ## How do I know if my bot is improving?
166 | 
167 | The better your bot gets, the slower it will improve. This is the nature of pretty much any improving thing, and bots are no exception.
168 | 
169 | In the early stages, it is blatantly obvious if your bot is improving or not. 
170 | In the middle stages, it is less obvious, but repeated observation usually makes it clear.
171 | However, in the later stages, it can be very difficult to tell.
172 | 
173 | Since the bot is playing against itself, an increase in reward, goals, or most other metrics do imply a change in gameplay, but not improvement.
174 | 
175 | Luckily, the PPO learning algorithm is pretty good at its job and generally doesn't get worse at a task unless you mess something up really bad.
176 | However, the learning algorithm is trying to maximize and explore your rewards, not winning. 
177 | So, if your rewards are farmable in a way that does not improve your bot's skill, your bot will likely get worse at the game.
178 | 
179 | ### Objectively measuring improvement
180 | 
181 | The bot training framework [rocket-learn](https://github.com/Rolv-Arild/rocket-learn) uses an ELO-like rating system to track the skill of the bot against previous versions, 
182 | so you can actually measure how much the bot is improving. I do plan on implementing such a system for rlgym-ppo in both C++ and Python, but I haven't gotten around to it yet.
183 | 
184 | If you want to manually test improvement, you can verse your current bot against an older version and see who comes out on top. 
185 | If you are doing this in RLBot, you will have to wait a while, as it takes many goals to actually begin to know if the bot is improving.
186 | 
187 | For my first rlgym-ppo bot I wrote a little Python script that used rlgym-sim to run an infinite game between two versions of my bot. 
188 | This could run far faster than real-time (unlike RLBot) so I was quickly able to see if the bot was actually improving as the scoreline rapidly climbed.
189 | 
190 | ## Nextoification
191 | 
192 | Nexto mostly used very general and gentle rewards, and while there was definitely some deliberate influence on playstyle and mechanics, such as aerials, its a good example of how bots *want* to play.
193 | 
194 | Generally, the less specifically you reward your bot, the more its playstyle will resemble Nexto. 
195 | Nexto's passive dribble-flick playstyle with mostly forward flicks seems to be a natural evolution of basic ballchasing behavior.
196 | 
197 | ### Natural Dribbling and Flicking
198 | 
199 | Dribbling and flicking will generally be discovered in most bots with basic rewards, even without any reward for dribbling or flicking.
200 | 
201 | Almost all bots have a far faster reaction time than humans. Nexto, with a `tick_skip` of `8`, can react to something in as little as 67 milliseconds--whereas humans take around 200-300ms.
202 | This makes dribbling far easier, as the bot just has to react to the ball falling off the edge of its car by accelerating or turning that way, instead of having to predict how the ball will move.
203 | Dribbling is just a natural evolution of pushing the ball in that sense.
204 | 
205 | This also makes flicking over opponents far easier, as the bot can wait until the very last moment to flick, unlike humans, who have to guess or flick much easier to avoid being successfully challenged.
206 | 
207 | You can still use some slight encouragement to start dribbling if you don't want to wait for it to evolve, or if you have big rewards for doing something different (like air dribbling).
208 | 
209 | ### Natural Passiveness
210 | 
211 | While Nexto's dribbles and flicks are extremely precise and effective, one of its main flaws as a bot is that it is very passive.
212 | If you are persistent, Nexto will often give you the ball for free and wander away to its goal, applying no pressure and allowing you to set up a strong outplay with ease.
213 | This is a flaw you will notice with most bots. They are unwilling to take risks and be aggressive when given the option. They would rather wait to save than fight to score.
214 | 
215 | I believe there are two main reasons why passiveness is so natural for bots:
216 | 1. Having a faster reaction time makes being passive more viable (you can react faster to threats)
217 | 2. It is simply easier to be passive than to be aggressive because being aggressive requires more predictive decision-making
218 | 
219 | Personally, I am not a fan of passiveness in bots. In all of my bots I have taken many steps to encourage and promote riskier, more aggressive play.
220 | This allowed my bots to discover much stronger plays and defense as a result.
221 | 
222 | But *how* do you promote aggression? My most general solution to this problem is to **just decrease the concede penalty**.
223 | It seems like basic logic that your reward for conceding (getting scored on) should just be the goal reward negative. After all, getting scored on is just as bad as scoring is good.
224 | However, as you may have already learned, just because something is theoretically correct doesn't mean it is optimal for training bots.
225 | 
226 | I introduce what I call `aggression_bias`, which is the portion of the concede penalty that is removed to promote aggression. 
227 | At `aggression_bias = 0.25`, the penalty for conceding is 25% less than the reward for scoring.
228 | You can define the concede reward using `aggression_bias` like so: `concede_reward = -goal_reward * (1 - aggression_bias)`.
229 | 
230 | I generally use an `aggression_bias` of around `0.2` in my bots, but I frequently change it depending on how aggressive the bot is playing.
231 | If the `aggression_bias` isn't enough, you may want to add strong rewards for challenging a play, and a penalty for there being no player on a given team near the ball.
232 | 
233 | ## Letting it cook
234 | 
235 | In these later stages, it is more important to allow the bot to slowly explore and improve on its own.
236 | Sometimes, you will see no changes on the graphs, but that often means it is slowly improving at everything instead of changing how it plays.
237 | 
238 | A good amount of patience is required to get a high-level bot.
239 | 
240 | _____
241 | [Back to Table of Contents](README.md)
242 | 


--------------------------------------------------------------------------------