├── .gitignore ├── .scalafmt.conf ├── README.md ├── build-site.sh ├── build.sbt ├── data └── pacman │ └── Q.json ├── gridworld.html ├── index.html ├── pacman.html ├── polecart-human.html ├── polecart-qlearning.html ├── project ├── build.properties └── plugins.sbt └── src └── main └── scala └── rl ├── core ├── ActionResult.scala ├── AgentBehaviour.scala ├── Environment.scala ├── QLearning.scala ├── StateConversion.scala └── package.scala ├── gridworld ├── core │ └── GridworldProblem.scala └── ui │ └── GridworldUI.scala ├── pacman ├── core │ └── PacmanProblem.scala ├── training │ ├── PacmanTraining.scala │ └── QKeyValue.scala └── ui │ └── PacmanUI.scala └── polecart ├── core └── PoleBalancingProblem.scala └── ui ├── HumanUI.scala └── QLearningUI.scala /.gitignore: -------------------------------------------------------------------------------- 1 | pacman-training/ 2 | -------------------------------------------------------------------------------- /.scalafmt.conf: -------------------------------------------------------------------------------- 1 | align = true 2 | maxColumn = 100 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Reinforcement Learning in Scala 2 | 3 | This repo contains the source code for the demos to accompany my talk 4 | 'Reinforcement Learning in Scala'. 5 | 6 | The slides are available 7 | [here](https://slides.com/cb372/reinforcement-learning-in-scala). 8 | 9 | The demos are available[here](https://cb372.github.io/rl-in-scala/). 10 | 11 | ## Running locally 12 | 13 | The demos are implemented using Scala.js, so first you need to build the 14 | JavaScript: 15 | 16 | ``` 17 | $ sbt fastOptJS 18 | ``` 19 | 20 | Next, start a simple web server of your choice. I use the Python one: 21 | 22 | ``` 23 | $ python -m SimpleHTTPServer 24 | Serving HTTP on 0.0.0.0 port 8000 ... 25 | ``` 26 | 27 | Finally open the site in your browser: 28 | 29 | ``` 30 | $ open localhost:8000 31 | ``` 32 | 33 | ## Pacman training 34 | 35 | If you'd like to try your hand at making the Pacman agent smarter, the expected 36 | workflow looks something like this: 37 | 38 | 1. Update 39 | [PacmanProblem.scala](src/main/scala/rl/pacman/core/PacmanProblem.scala) to 40 | improve the agent's state space, making it a more efficient learner. 41 | 42 | 2. Run the training harness: 43 | 44 | ``` 45 | $ sbt run 46 | ``` 47 | 48 | This will make the agent play a very large number of games of Pacman. It 49 | will run forever. Every 1 million time steps it will print out some stats to 50 | give an indicator of the agent's learning progress. Every five million time 51 | steps it will write the agent's Q-values to a JSON file in the 52 | `pacman-training` directory. 53 | 54 | 3. Once you have Q-values you are happy with, copy the JSON file to 55 | `data/pacman/Q.json`, overwriting the existing file. 56 | 57 | 4. Follow the steps above for running locally. Open the Pacman UI in your 58 | browser and watch your trained agent show those ghosts who's boss! 59 | 60 | ### Hints 61 | 62 | If you make your state space too large, you'll have a number of problems: 63 | 64 | * Your JSON file will probably be huge enough to crash your browser when the UI 65 | tries to load it. 66 | 67 | * The agent will learn very slowly because it needs to explore so many states. 68 | 69 | So the trick is to find a way of encoding enough information about the game 70 | state without the number of states exploding. e.g. if you were to track the 71 | exact locations of Pacman and both ghosts, you already have 65 x 65 x 65 = 72 | 274,675 states to deal with. 73 | 74 | Your state encoding should also make sense when combined with the reward 75 | function. For example, the environment gives a reward when Pacman eats food, so 76 | intuitively the state should track food in some way. 77 | 78 | If your agent is struggling to win games, you could try: 79 | 80 | * Making the ghosts move more randomly by reducing their `smartMoveProb` 81 | 82 | * Making a smaller grid, maybe with only one ghost 83 | -------------------------------------------------------------------------------- /build-site.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | set -e 4 | 5 | sbt clean fullOptJS 6 | 7 | echo "Copying reinforcement-learning-in-scala-opt.js" 8 | mkdir -p ../rl-in-scala/js 9 | cp target/scala-2.12/reinforcement-learning-in-scala-opt.js ../rl-in-scala/js 10 | 11 | for file in *.html; do 12 | echo "Copying $file" 13 | sed -e "s/target\/scala-2.12\/reinforcement-learning-in-scala-fastopt.js/js\/reinforcement-learning-in-scala-opt.js/" $file > ../rl-in-scala/$file 14 | done 15 | 16 | echo "Copying Pacman data dir" 17 | mkdir -p ../rl-in-scala/data 18 | cp -R data/pacman ../rl-in-scala/data 19 | -------------------------------------------------------------------------------- /build.sbt: -------------------------------------------------------------------------------- 1 | scalaVersion := "2.12.6" 2 | enablePlugins(ScalaJSPlugin) 3 | 4 | //scalaJSUseMainModuleInitializer := true 5 | libraryDependencies ++= Seq( 6 | "org.scala-js" %%% "scalajs-dom" % "0.9.6", 7 | "io.circe" %%% "circe-generic" % "0.10.1", 8 | "io.circe" %%% "circe-parser" % "0.10.1" 9 | ) 10 | -------------------------------------------------------------------------------- /data/pacman/Q.json: -------------------------------------------------------------------------------- 1 | [] 2 | -------------------------------------------------------------------------------- /gridworld.html: -------------------------------------------------------------------------------- 1 | 2 | 3 |
4 |-1
A
→ jump to A'
, reward = 10
B
→ jump to B'
, reward = 5
0
73 | | 74 | | 75 | | 76 | | 77 | |
80 | | 81 | | 82 | | 83 | | 84 | |
87 | | 88 | | 89 | | 90 | | 91 | |
94 | | 95 | | 96 | | 97 | | 98 | |
101 | | 102 | | 103 | | 104 | | 105 | |
116 | | 117 | | 118 | | 119 | | 120 | |
123 | | 124 | | 125 | | 126 | | 127 | |
130 | | 131 | | 132 | | 133 | | 134 | |
137 | | 138 | | 139 | | 140 | | 141 | |
144 | | 145 | | 146 | | 147 | | 148 | |
This site contains the demos for my 'Reinforcement Learning in Scala' talk.
43 | 44 |The slides for the talk are available here.
47 | 48 |The source code for all the demos is available on GitHub.
49 | 50 |53 | There are 3 demos, all of which use the same RL algorithm known as Q-learning. 54 |
55 | 56 | 61 | 62 |This is a continous (non-episodic) problem with very simple rules:
64 |A
and moves in any direction, it jumps to A'
and gets a reward of 10.B
and moves in any direction, it jumps to B'
and gets a reward of 5.
72 | Of course, the optimal policy is to always move towards A
in order to pick up the reward of 10.
73 | If you run the demo, you should see the agent gradually learn this policy.
74 |
76 | It may get stuck in a local minimum (i.e. preferring the B
cell) for a while,
77 | but it is guaranteed to eventually converge on the optimal policy.
78 | This is because the agent constantly explores the state space using the ε-greedy algorithm.
79 |
81 | The big table under the grid shows the agent's current Q(s, a)
for all state-action pairs.
82 | This is the estimate that the agent holds for being in state s
and taking action a
.
83 |
85 | The smaller table shows the same information summarised as a policy. 86 | In other words, for a given state, what action(s) the agent currently believes to be the best. 87 |
88 |This episodic problem is a classic in RL literature.
92 | 93 |94 | At every time step the agent must push the cart either to the left or the right. 95 | The goal is to stop the pole from toppling too far either to the left or the right, 96 | whilst also ensuring the cart does not crash into the walls. 97 |
98 | 99 |The rules are as follows:
100 |107 | It's fascinating to see how quickly the agent learns, especially bearing in mind: 108 |
109 |122 | To get a feel for the problem, you might want to try it yourself first. 123 | Use the Left and Right arrow keys on your keyboard to move the cart. 124 |
125 | 126 |127 | Next you can watch the agent learn. 128 | Use the buttons to run through a single time step, a single episode or continously. 129 |
130 |This one is an exercise for the reader.
134 | 135 |136 | The demo shows a very "dumb" agent. Its state space is enormous, so it has no chance of doing any meaningful learning. 137 |
138 | 139 |140 | See if you can improve the agent by redesigning its state space and putting it through some training. 141 |
142 | 143 |144 | Take a look at the README for more details. 145 |
146 |Cart velocity | 98 | 99 |Fast left | 100 |Slow | 101 |Fast right | 102 ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Pole velocity | 105 | 106 |Fast left | 107 |Slow | 108 |Fast right | 109 | 110 |Fast left | 111 |Slow | 112 |Fast right | 113 | 114 |Fast left | 115 |Slow | 116 |Fast right | 117 ||
Pole angle | 122 |Very left | 123 | 124 |125 | | 126 | | 127 | | 128 | | 129 | | 130 | | 131 | | 132 | | 133 | |
Quite left | 137 | 138 |139 | | 140 | | 141 | | 142 | | 143 | | 144 | | 145 | | 146 | | 147 | | |
Slightly left | 151 | 152 |153 | | 154 | | 155 | | 156 | | 157 | | 158 | | 159 | | 160 | | 161 | | |
Slightly right | 165 | 166 |167 | | 168 | | 169 | | 170 | | 171 | | 172 | | 173 | | 174 | | 175 | | |
Quite right | 179 | 180 |181 | | 182 | | 183 | | 184 | | 185 | | 186 | | 187 | | 188 | | 189 | | |
Very right | 193 | 194 |195 | | 196 | | 197 | | 198 | | 199 | | 200 | | 201 | | 202 | | 203 | |
Cart velocity | 213 | 214 |Fast left | 215 |Slow | 216 |Fast right | 217 ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Pole velocity | 220 | 221 |Fast left | 222 |Slow | 223 |Fast right | 224 | 225 |Fast left | 226 |Slow | 227 |Fast right | 228 | 229 |Fast left | 230 |Slow | 231 |Fast right | 232 ||
Pole angle | 237 |Very left | 238 | 239 |240 | | 241 | | 242 | | 243 | | 244 | | 245 | | 246 | | 247 | | 248 | |
Quite left | 252 | 253 |254 | | 255 | | 256 | | 257 | | 258 | | 259 | | 260 | | 261 | | 262 | | |
Slightly left | 266 | 267 |268 | | 269 | | 270 | | 271 | | 272 | | 273 | | 274 | | 275 | | 276 | | |
Slightly right | 280 | 281 |282 | | 283 | | 284 | | 285 | | 286 | | 287 | | 288 | | 289 | | 290 | | |
Quite right | 294 | 295 |296 | | 297 | | 298 | | 299 | | 300 | | 301 | | 302 | | 303 | | 304 | | |
Very right | 308 | 309 |310 | | 311 | | 312 | | 313 | | 314 | | 315 | | 316 | | 317 | | 318 | |
Cart velocity | 328 | 329 |Fast left | 330 |Slow | 331 |Fast right | 332 ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Pole velocity | 335 | 336 |Fast left | 337 |Slow | 338 |Fast right | 339 | 340 |Fast left | 341 |Slow | 342 |Fast right | 343 | 344 |Fast left | 345 |Slow | 346 |Fast right | 347 ||
Pole angle | 352 |Very left | 353 | 354 |355 | | 356 | | 357 | | 358 | | 359 | | 360 | | 361 | | 362 | | 363 | |
Quite left | 367 | 368 |369 | | 370 | | 371 | | 372 | | 373 | | 374 | | 375 | | 376 | | 377 | | |
Slightly left | 381 | 382 |383 | | 384 | | 385 | | 386 | | 387 | | 388 | | 389 | | 390 | | 391 | | |
Slightly right | 395 | 396 |397 | | 398 | | 399 | | 400 | | 401 | | 402 | | 403 | | 404 | | 405 | | |
Quite right | 409 | 410 |411 | | 412 | | 413 | | 414 | | 415 | | 416 | | 417 | | 418 | | 419 | | |
Very right | 423 | 424 |425 | | 426 | | 427 | | 428 | | 429 | | 430 | | 431 | | 432 | | 433 | |