├── paper
    ├── rldm.pdf
    ├── rldm.out
    ├── rldm.aux
    ├── rldmsubmit.sty
    └── rldm.tex
├── 13-recommended-courses
    └── README.md
├── 12-recommended-books
    └── README.md
├── 11-conclusion
    └── README.md
├── LICENSE
├── .gitignore
├── 09-cooperative-and-adversarial-agents
    └── README.md
├── Dockerfile
├── 08-single-and-multiple-agents
    └── README.md
├── 03-deterministic-and-stochastic-actions
    └── README.md
├── 10-decision-making-and-humans
    └── README.md
├── 06-discrete-and-continuous-actions
    └── README.md
├── 07-observable-and-partially-observable-states
    └── README.md
├── 01-introduction-to-decision-making
    └── README.md
├── 05-discrete-and-continuous-states
    └── README.md
├── 04-known-and-unknown-environments
    └── README.md
├── README.md
├── 02-sequential-decisions
    └── README.md
└── notebooks
    ├── 02-dynamic-programming.ipynb
    ├── solutions
        ├── 02-dynamic-programming.ipynb
        └── 03-planning-algorithms.ipynb
    └── 03-planning-algorithms.ipynb


/paper/rldm.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mimoralea/applied-reinforcement-learning/HEAD/paper/rldm.pdf


--------------------------------------------------------------------------------
/13-recommended-courses/README.md:
--------------------------------------------------------------------------------
1 | ### 13. Recommended Courses
2 | 
3 |   * [Reinforcement Learning and Decision Making by Michael Littman and Charles Isbell](https://in.udacity.com/course/reinforcement-learning--ud600/)
4 |   * [Reinforcement Learning by David Silver](https://www.youtube.com/playlist?list=PL7-jPKtc4r78-wCZcQn5IqyuWhBZ8fOxT)
5 |   * [Deep Reinforcement Learning by Sergey Levine, Chelsea Finn, et al.](https://www.youtube.com/playlist?list=PLkFD6_40KJIwTmSbCv9OVJB3YaO4sFwkX)
6 | 


--------------------------------------------------------------------------------
/12-recommended-books/README.md:
--------------------------------------------------------------------------------
 1 | ### 12. Recommended Books
 2 | 
 3 |   * Reinforcement Learning State-of-the-Art by Marco Wiering et al.
 4 |   * Probabilistic Robotics by Sebastian Thrun et al.
 5 |   * Decision Making Under Uncertainty by Mykel J. Kochenderfer
 6 |   * Neuro-Dynamic Programming by Dimitri P. Bertsekas et al.
 7 |   * Statistical Reinforcement Learning by Masashi Sugiyama
 8 |   * Markov Decision Processes by Martin L. Puterman
 9 |   * Approximate Dynamic Programming by Warren B. Powell
10 |   * Reinforcement Learning and Dynamic Programming Using Function Approximators by Lucian Busoniu et al.
11 |   * Optimal Control Theory by Donald E. Kirk
12 |   * Dynamic Programming by Richard Bellman
13 |   * Dynamic Programming and Optimal Control Vol I by Dimitri P. Bertsekas
14 |   * Dynamic Programming and Optimal Control Vol II by Dimitri P. Bertsekas
15 |   * Reinforcement Learning: An Introduction by Richard S. Sutton et al.
16 | 


--------------------------------------------------------------------------------
/11-conclusion/README.md:
--------------------------------------------------------------------------------
 1 | ### 11. Conclusion
 2 | 
 3 |  Reinforcement Learning is such an exciting field. Some people would argue that this field is the 
 4 |  true "artificial intelligence". This project should provide you with a good foundation for 
 5 |  understanding reinforcement learning. We, not only gave our intuitive take on each of the concepts 
 6 |  introduced but also provided you with problems that you could do on your own to gain a 
 7 |  deeper, hands-on understanding of the concepts discussed. Also, we pointed at several papers to 
 8 |  further your knowledge in each of the areas presented. 
 9 |  We really hope that this serves you well. Also, remember that if you want to help to make this 
10 |  series better, everything from a typo, to another notebook, or maybe an entire section, you are 
11 |  welcome to open a pull-request and I'll take a look and accept it if it helps with this project's 
12 |  objective. To give an intuitive and hands-on perspective of reinforcement learning and decision-making.
13 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 Miguel Morales
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/paper/rldm.out:
--------------------------------------------------------------------------------
 1 | \BOOKMARK [1][-]{section.1}{Introduction}{}% 1
 2 | \BOOKMARK [1][-]{section.2}{Sparking Curiosity}{}% 2
 3 | \BOOKMARK [2][-]{subsection.2.1}{Using Simple And Direct Language}{section.2}% 3
 4 | \BOOKMARK [2][-]{subsection.2.2}{Keeping A Single Narrative}{section.2}% 4
 5 | \BOOKMARK [2][-]{subsection.2.3}{Showing Concepts And Their Complement}{section.2}% 5
 6 | \BOOKMARK [1][-]{section.3}{Removing Friction}{}% 6
 7 | \BOOKMARK [2][-]{subsection.3.1}{Setting Up A Convenient Environment}{section.3}% 7
 8 | \BOOKMARK [2][-]{subsection.3.2}{Providing With Boilerplate Code}{section.3}% 8
 9 | \BOOKMARK [2][-]{subsection.3.3}{Asking For Minimal Effort}{section.3}% 9
10 | \BOOKMARK [1][-]{section.4}{Showing Options}{}% 10
11 | \BOOKMARK [2][-]{subsection.4.1}{Assigning Relevant Readings}{section.4}% 11
12 | \BOOKMARK [2][-]{subsection.4.2}{Watching Academic Lectures}{section.4}% 12
13 | \BOOKMARK [2][-]{subsection.4.3}{Completing Homework and Projects}{section.4}% 13
14 | \BOOKMARK [1][-]{section.5}{Future Work}{}% 14
15 | \BOOKMARK [2][-]{subsection.5.1}{Additional Notebooks}{section.5}% 15
16 | \BOOKMARK [2][-]{subsection.5.2}{Effectiveness Evaluation}{section.5}% 16
17 | \BOOKMARK [2][-]{subsection.5.3}{Request For Feedback}{section.5}% 17
18 | \BOOKMARK [1][-]{section.6}{Conclusion}{}% 18
19 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # C extensions
 7 | *.so
 8 | 
 9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | 
27 | # PyInstaller
28 | #  Usually these files are written by a python script from a template
29 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 | 
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 | 
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .coverage.*
42 | .cache
43 | nosetests.xml
44 | coverage.xml
45 | *,cover
46 | .hypothesis/
47 | 
48 | # Translations
49 | *.mo
50 | *.pot
51 | 
52 | # Django stuff:
53 | *.log
54 | local_settings.py
55 | 
56 | # Flask stuff:
57 | instance/
58 | .webassets-cache
59 | 
60 | # Scrapy stuff:
61 | .scrapy
62 | 
63 | # Sphinx documentation
64 | docs/_build/
65 | 
66 | # PyBuilder
67 | target/
68 | 
69 | # IPython Notebook
70 | .ipynb_checkpoints
71 | 
72 | # pyenv
73 | .python-version
74 | 
75 | # celery beat schedule file
76 | celerybeat-schedule
77 | 
78 | # dotenv
79 | .env
80 | 
81 | # virtualenv
82 | venv/
83 | ENV/
84 | 
85 | # Spyder project settings
86 | .spyderproject
87 | 
88 | # Rope project settings
89 | .ropeproject
90 | 
91 | notebooks/solutions/*
92 | 


--------------------------------------------------------------------------------
/09-cooperative-and-adversarial-agents/README.md:
--------------------------------------------------------------------------------
 1 | ### 9. Cooperative and Adversarial Agents
 2 | 
 3 | #### 9.1 Agents with conflicting objectives
 4 | 
 5 | Going one step further, agents on an environment could actually have conflicting goals. Once we
 6 | starting taking into account multiple agents competing for the same objectives, the field of game
 7 | theory becomes important. Game Theory and reinforcement learning are the two fundamental fields
 8 | of multi-agent reinforcement learning. When agents have opposing goals, there is probably no clear
 9 | optimal solution and an equilibrium among the agents need to search for. For this, lots of game
10 | theory come to play.
11 | 
12 | #### 9.2 Teams of agents with conflicting objectives
13 | 
14 | Finally, we can think of worlds in which teams of agents compete against other teams for conflicting
15 | objectives. The RoboCup soccer is a well-known example of this type of environments. Make sure to check the
16 | recommended readings below, and work the RoboCup soccer
17 | provided by OpenAI Gym.
18 | 
19 | #### 9.3 Further Reading
20 | 
21 |   * [Adversarial Reinforcement Learning](http://www.cs.cmu.edu/~mmv/papers/03TR-advRL.pdf)
22 |   * [Markov games as a framework for multi-agent reinforcement learning](https://www.cs.rutgers.edu/~mlittman/papers/ml94-final.pdf)
23 |   * [Value-function reinforcement learning in Markov games](http://www.sts.rpi.edu/~rsun/si-mal/article3.pdf)
24 |   * [Learning To Play the Game of Chess](https://papers.nips.cc/paper/1007-learning-to-play-the-game-of-chess.pdf)
25 |   * [Mastering the Game of Go with Deep Neural Networks and Tree Search](https://gogameguru.com/i/2016/03/deepmind-mastering-go.pdf)
26 | 


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM jupyter/tensorflow-notebook
 2 | MAINTAINER Miguel Morales <mimoralea@gmail.com>
 3 | USER root
 4 | 
 5 | # update ubuntu installation
 6 | RUN apt-get update -y
 7 | RUN apt-get install -y --no-install-recommends apt-utils
 8 | RUN apt-get upgrade -y
 9 | 
10 | # install dependencies
11 | RUN apt-get install -y libav-tools python3 ipython3 python3-pip python3-dev python3-opengl
12 | RUN apt-get install -y libpq-dev libjpeg-dev libboost-all-dev libsdl2-dev
13 | RUN apt-get install -y curl cmake swig wget unzip git xpra xvfb flex
14 | RUN apt-get install -y libav-tools fluidsynth build-essential qt-sdk
15 | 
16 | # clean up
17 | RUN apt-get clean && rm -rf /var/lib/apt/lists/*
18 | 
19 | # jupyter notebook
20 | EXPOSE 8888
21 | 
22 | # tensorboard
23 | EXPOSE 6006
24 | 
25 | # switch back to user
26 | USER $NB_USER
27 | 
28 | # install necessary packages
29 | RUN pip3 install --upgrade pip
30 | RUN pip3 install numpy scikit-learn scipy pyglet setuptools pygame
31 | RUN pip3 install gym tensorflow keras asciinema pandas
32 | RUN pip3 install git+https://github.com/openai/gym-soccer.git@master
33 | RUN pip3 install git+https://github.com/lusob/gym-ple.git@master
34 | RUN pip3 install git+https://github.com/ntasfi/PyGame-Learning-Environment.git@master
35 | #git clone https://github.com/ntasfi/PyGame-Learning-Environment.git
36 | 
37 | # create a script to start the notebook with xvfb on the back
38 | # this allows screen display to work well
39 | RUN echo '#!/bin/bash' > /tmp/run.sh && \
40 |     echo "nohup sh -c 'tensorboard --logdir=/mnt/notebooks/logs' > /dev/null 2>&1 &" >> /tmp/run.sh && \
41 |     echo 'xvfb-run -s "-screen 0 1280x720x24" /usr/local/bin/start-notebook.sh' >> /tmp/run.sh && \
42 |     chmod +x /tmp/run.sh
43 | 
44 | # move notebooks into container
45 | # ADD notebooks /mnt/notebooks
46 | 
47 | # make the dir with notebooks the working dir
48 | WORKDIR /mnt/notebooks
49 | 
50 | # run the script to start the notebook
51 | ENTRYPOINT ["/tmp/run.sh"]
52 | 


--------------------------------------------------------------------------------
/08-single-and-multiple-agents/README.md:
--------------------------------------------------------------------------------
 1 | ## Part IV: Multiple Decision-Making Agents
 2 | 
 3 | ### 8. Single and Multiple Agents
 4 | 
 5 | #### 8.1 Agents with same objectives
 6 | 
 7 | The methods for reinforcement learning that we have seen so far relate to single agents taking decisions on an environment. We can think, however, on a slightly different kind of problem in which multiple agents
 8 | jointly act on the same environment trying to maximize a common reward signal. Such environment could be robotics, networking, economics, auctions, etc. Often time, the algorithms discussed up until now would
 9 | potentially fail in such environments. The problem is that in these kinds of environments, the control of
10 | the agents is decentralized and therefore it requires coordination and cooperation to maximize the reward
11 | signal.
12 | 
13 | Even though decentralizing the decision-making adds considerable complexity, the need for a multi-agent system
14 | for some problems is real. Often a centralized approach is just not possible, perhaps due to physical constraints, for example, a network routing system being decentralized, or a team of robots with shared objectives but independent processing capabilities. So, the methods of decentralized reinforcement learning,
15 | often called Dec-MDPs and Dec-POMDPs, are very important as well.
16 | 
17 | #### 8.2 What when other agents are at play?
18 | 
19 | When other agents take actions on the same environment, game theory becomes important. Game theory is a field that researches conflict of interests. Economics, political science, psychology, biology and so on
20 | are some of the most conventional fields using game theory concepts.
21 | 
22 | #### 8.3 Further Reading
23 | 
24 |   * [Game Theory: Basic Concepts](http://www.umass.edu/preferen/Game%20Theory%20Evolving/GTE%20Public/GTE%20Game%20Theory%20Basic%20Concepts.pdf)
25 |   * [Game Theory](http://www.cdam.lse.ac.uk/Reports/Files/cdam-2001-09.pdf)
26 |   * [An Analysis of Stochastic Game Theory for Multiagent Reinforcement Learning](http://www.cs.cmu.edu/~mmv/papers/00TR-mike.pdf)
27 |   * [Multi-agent reinforcement learning: An overview](https://pdfs.semanticscholar.org/d96d/a4ac9f78924871c3c4d0dece0b84884fe483.pdf)
28 |   * [Multi Agent Reinforcement Learning: Independent vs Cooperative Agents](http://web.media.mit.edu/~cynthiab/Readings/tan-MAS-reinfLearn.pdf)
29 | 


--------------------------------------------------------------------------------
/03-deterministic-and-stochastic-actions/README.md:
--------------------------------------------------------------------------------
 1 | ### 3. Deterministic and Stochastic Actions
 2 | 
 3 | #### 3.1 We can't perfectly control the world
 4 | 
 5 | One of the main points in reinforcement learning is that actions are not always deterministic. That is, taking
 6 | an action does not imply that the action will affect the world the same way each time. Even if the action is taken
 7 | given the exact same environmental conditions, the actions are not always deterministic. In fact, most real-world
 8 | problems have some stochasticity attached to it in how the world reacts to the agents' actions. For example, we can
 9 | think the stock trading agent taking an action to buy a stock, but encountering network issues along the way and
10 | therefore failing at the transaction. Similarly, for the robotics example, we can imagine how moving a robotic
11 | arm to a given location might be precise within a certain range. So the probability of that actions affecting the
12 | environment the same way each time even if given the same exact initial conditions is not total.
13 | 
14 | #### 3.2 Dealing with stochasticity
15 | 
16 | The way we account for the fact that the world is stochastic is by using expectation of rewards. For example, when
17 | we calculate the rewards we would obtain for taking an action in a given state, we would take into account the probabilities
18 | of transitioning to every single other new state and multiply this probability by the reward we would obtain. If we
19 | sum all of them, we obtain the expectation.
20 | 
21 | #### 3.3 Exercises
22 | 
23 | In this lesson, we looked into how the environment can get more complex than we discussed in previous lessons.
24 | However, the same algorithms we presented earlier can help us plan when we have a model of the environment. On
25 | the Notebook below we will implement the algorithms discussed in the previous chapter in worlds with deterministic
26 | and stochastic transitions.
27 | 
28 | Lesson 3 Notebook.
29 | 
30 | #### 3.4 Further Reading
31 |   
32 |   * [Markov Decision Processes: Concepts and Algorithms](http://www.cs.vu.nl/~annette/SIKS2009/material/SIKS-RLIntro.pdf)
33 |   * [Markov Decision Processes: Lecture Notes](https://math.la.asu.edu/~jtaylor/teaching/Fall2012/STP425/lectures/MDP.pdf)
34 |   * [Markov Decision Processes](http://www.lancaster.ac.uk/postgrad/zaninie/MDP.pdf)
35 |   * [Markov Decision Processes](https://cs.uwaterloo.ca/~jhoey/teaching/cs486/mdp.pdf)
36 | 


--------------------------------------------------------------------------------
/10-decision-making-and-humans/README.md:
--------------------------------------------------------------------------------
 1 | ## Part V: Human Decision-Making and Beyond
 2 | 
 3 | ### 10. Decision-Making and Humans
 4 | 
 5 | #### 10.1 Similarities between methods discussed and humans
 6 | 
 7 | It is not surprising that many of the methods and algorithms explored in these lessons
 8 | have some similarities to how humans perceive intelligence. After all, if it is humans
 9 | that are building these methods, it is expected that we will create this with our bias of how we see the world. One of the most interesting similarities is how reinforcement
10 | learning algorithms can maximize the expected future reward over a long term horizon.
11 | This is perhaps what makes humans the most intelligent creatures on earth, we plan,
12 | execute and even sacrifice high short-term rewards for even smaller rewards given
13 | multiple times in the future. This is truly amazing.
14 | 
15 | Another obvious and important similarity is that of learning by trial and error. It is
16 | true that humans learn with supervision as well. But lots of the learning come from trial and error. There is no way of teaching a toddler to walk with pen and paper, or by reading books, a child will learn to walk by walking. As incredible as it sounds, it is a fact.
17 | 
18 | #### 10.2 Differences between methods discussed and humans
19 | 
20 | However, not everything is immediately obvious, and it is in fact, a mistake to just
21 | try to recreate a human brain. Usually, technology advances more quickly when we build systems that enhance humans, instead of trying to replace us. One of the fundamental
22 | differences between the way reinforcement learning works and how humans behave seem to be a lot the reward system. The fact that different humans can perceive the same reward signal different. In reinforcement learning, the reward signal is given by the environment, but it is not clear that this is how to world is actually model in reality.
23 | Sure, no human would consider that stepping on a nail is a positive signal, but it is
24 | true that many humans, especially, successful ones, usually have a way of "bending"
25 | reality to always look for the positive. Perhaps, this is something researchers should
26 | be working on these days.
27 | 
28 | #### 10.3 Further Reading
29 | 
30 |   * [Reinforcement Learning, High-Level Cognition, and the Human Brain](http://users.ugent.be/~tverguts/Publications_files/Silvetti%20RL%20chapter.pdf)
31 |   * [A Comparison of Human and Agent Reinforcement Learning in Partially Observable Domains](http://mlg.eng.cam.ac.uk/pub/pdf/DosGha11.pdf)
32 |   * [Intrinsically Motivated Reinforcement Learning](http://web.eecs.umich.edu/~baveja/Papers/FinalNIPSIMRL.pdf)
33 |   * [Intrinsic Motivation For Reinforcement Learning Systems](http://www-anw.cs.umass.edu/pubs/2005/barto_s_yale05.pdf)
34 |   * [Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation](https://arxiv.org/pdf/1604.06057.pdf)
35 | 


--------------------------------------------------------------------------------
/paper/rldm.aux:
--------------------------------------------------------------------------------
 1 | \relax 
 2 | \providecommand\hyper@newdestlabel[2]{}
 3 | \providecommand\HyperFirstAtBeginDocument{\AtBeginDocument}
 4 | \HyperFirstAtBeginDocument{\ifx\hyper@anchor\@undefined
 5 | \global\let\oldcontentsline\contentsline
 6 | \gdef\contentsline#1#2#3#4{\oldcontentsline{#1}{#2}{#3}}
 7 | \global\let\oldnewlabel\newlabel
 8 | \gdef\newlabel#1#2{\newlabelxx{#1}#2}
 9 | \gdef\newlabelxx#1#2#3#4#5#6{\oldnewlabel{#1}{{#2}{#3}}}
10 | \AtEndDocument{\ifx\hyper@anchor\@undefined
11 | \let\contentsline\oldcontentsline
12 | \let\newlabel\oldnewlabel
13 | \fi}
14 | \fi}
15 | \global\let\hyper@last\relax 
16 | \gdef\HyperFirstAtBeginDocument#1{#1}
17 | \providecommand\HyField@AuxAddToFields[1]{}
18 | \providecommand\HyField@AuxAddToCoFields[2]{}
19 | \citation{gapranda}
20 | \citation{suttons98}
21 | \citation{intuition}
22 | \citation{directinstruction}
23 | \citation{compare}
24 | \@writefile{toc}{\contentsline {section}{\numberline {1}Introduction}{1}{section.1}}
25 | \@writefile{toc}{\contentsline {section}{\numberline {2}Sparking Curiosity}{1}{section.2}}
26 | \@writefile{toc}{\contentsline {subsection}{\numberline {2.1}Using Simple And Direct Language}{1}{subsection.2.1}}
27 | \@writefile{toc}{\contentsline {subsection}{\numberline {2.2}Keeping A Single Narrative}{1}{subsection.2.2}}
28 | \@writefile{toc}{\contentsline {subsection}{\numberline {2.3}Showing Concepts And Their Complement}{1}{subsection.2.3}}
29 | \citation{openaigym}
30 | \citation{visualization}
31 | \@writefile{toc}{\contentsline {section}{\numberline {3}Removing Friction}{2}{section.3}}
32 | \@writefile{toc}{\contentsline {subsection}{\numberline {3.1}Setting Up A Convenient Environment}{2}{subsection.3.1}}
33 | \@writefile{toc}{\contentsline {subsection}{\numberline {3.2}Providing With Boilerplate Code}{2}{subsection.3.2}}
34 | \@writefile{toc}{\contentsline {subsection}{\numberline {3.3}Asking For Minimal Effort}{2}{subsection.3.3}}
35 | \@writefile{toc}{\contentsline {section}{\numberline {4}Showing Options}{2}{section.4}}
36 | \@writefile{toc}{\contentsline {subsection}{\numberline {4.1}Assigning Relevant Readings}{2}{subsection.4.1}}
37 | \@writefile{toc}{\contentsline {subsection}{\numberline {4.2}Watching Academic Lectures}{2}{subsection.4.2}}
38 | \bibcite{gapranda}{1}
39 | \@writefile{toc}{\contentsline {subsection}{\numberline {4.3}Completing Homework and Projects}{3}{subsection.4.3}}
40 | \@writefile{toc}{\contentsline {section}{\numberline {5}Future Work}{3}{section.5}}
41 | \@writefile{toc}{\contentsline {subsection}{\numberline {5.1}Additional Notebooks}{3}{subsection.5.1}}
42 | \@writefile{toc}{\contentsline {subsection}{\numberline {5.2}Effectiveness Evaluation}{3}{subsection.5.2}}
43 | \@writefile{toc}{\contentsline {subsection}{\numberline {5.3}Request For Feedback}{3}{subsection.5.3}}
44 | \@writefile{toc}{\contentsline {section}{\numberline {6}Conclusion}{3}{section.6}}
45 | \bibcite{suttons98}{2}
46 | \bibcite{intuition}{3}
47 | \bibcite{directinstruction}{4}
48 | \bibcite{compare}{5}
49 | \bibcite{visualization}{6}
50 | \bibcite{openaigym}{7}
51 | 


--------------------------------------------------------------------------------
/06-discrete-and-continuous-actions/README.md:
--------------------------------------------------------------------------------
 1 | ### 6. Discrete and Continuous Actions
 2 | 
 3 | #### 6.1 Continous action space
 4 | 
 5 | Just as the state space, the action space can also become too large to handle in a
 6 | traditional way. Certainly, the problems that we have seen so far have very few
 7 | available actions, "move up, down, left, right". However, in other types of problems,
 8 | like robotics, for instance, even a small number of degrees of freedom can make the
 9 | action space just too large for traditional methods.
10 | 
11 | #### 6.2 Discretizition of action space
12 | 
13 | For the action space, discretization is commonly used. This is a fine method for
14 | problems like we have seen before. However, once we enter the realm of physical control, which often deals with continuous values, every new degree of freedom would exponentially increase the number of possible action combinations. This gets out of control quickly.
15 | 
16 | #### 6.3 Use of function approximation
17 | 
18 | The use of function approximation is again a good way of approaching this problem. 
19 | Just as before, linear and non-linear function approximation methods could work as 
20 | long as we are dealing with a linear or non-linear action space, respectively.
21 | 
22 | #### 6.4 Searching for the policy
23 | 
24 | One way of approaching reinforcement learning problems that we haven't covered yet is to, instead of calculating values with states and action pairs to come up with
25 | optimal policies, we can search for the optimal policy directly. There are different
26 | ways of doing this and this is, in fact, one of the fields of active research in reinforcement learning. One of the advantages of using policy search instead of some
27 | of the methods we have seen before is that it is possible we find the optimal policy
28 | even if we don't find optimal values. For example, you can think of the trading agent
29 | looking to calculate what's the value of buying a stock now, whether it is $10,000 or 
30 | $100,000 you don't care the precise value, you care to know that it is the best action
31 | to take right now. This is because you care about the policy, not the values. The
32 | same concept applies to policy search. You could apply traditional search methods or just gradient descent to search directly for the optimal policy. We will look at
33 | a method that searches for the optimal policy on a continuous state space and continuous
34 | action space in the notebook.
35 | 
36 | #### 6.5 Exercises
37 | 
38 | In this lesson, we looked for the first time at the problem in which the both the number of states and the number
39 | of actions available to the agent are very large or continuous. We introduced a series of methods based on policy
40 | search. So, for this lesson's Notebook, we will look into a problem with continuous state and actions and we
41 | will apply function approximation to come up with the best solution to it.
42 | 
43 | Lesson 6 Notebook.
44 | 
45 | #### 6.6 Further Reading
46 | 
47 |   * [Reinforcement Learning in Continuous State and Action Spaces](http://oai.cwi.nl/oai/asset/19689/19689B.pdf)
48 |   * [Continuous Control with Deep Reinforcement Learning](https://arxiv.org/pdf/1509.02971.pdf)
49 |   * [Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods](https://papers.nips.cc/paper/3318-reinforcement-learning-in-continuous-action-spaces-through-sequential-monte-carlo-methods.pdf)
50 |   * [Q-Learning in Continuous State and Action Spaces](http://users.cecs.anu.edu.au/~rsl/rsl_papers/99ai.kambara.pdf)
51 |   * [Deep Reinforcement Learning: An Overview](https://arxiv.org/pdf/1701.07274.pdf)
52 | 


--------------------------------------------------------------------------------
/07-observable-and-partially-observable-states/README.md:
--------------------------------------------------------------------------------
 1 | ### 7. Observable and Partially-Observable States
 2 | 
 3 | #### 7.1 Is what we see what it is?
 4 | 
 5 | The reinforcement learning methods that we have discussed so far make the assumption that the agent
 6 | has perfect sensing capability. That is, the agent is able to perceive the world exactly as the
 7 | world is. However, in many environments, this is not entirely true. Moreover, in some environments,
 8 | it is vital to take this uncertainty into account. For example, in many robotics environments, often
 9 | our sensor measurements have accuracy within a range. Often, GPS readings can vary from 2 meters to
10 | up to 10 meters. Temperature sensors can be provide reading with %5 error margin. The problem is then
11 | that the methods that we have covered until now are not capable of taking this error into account. This
12 | is because the MDP-based methods have a fundamental assumption, the Markovian assumption. Once this assumption no longer holds true, because the state signal is not fully observable, we enter the fields of partially-observable Markov decision processes.
13 | 
14 | #### 7.2 State Estimation
15 | 
16 | From the robotics world, a few methods emerged to deal with sensor errors. These methods use probabilistic
17 | techniques to model the uncertainty in the sensor readings. In fact, these methods are some of the most
18 | commonly used methods today in areas like autonomous vehicles, object tracking, navigation, and much more.
19 | These methods are called Bayesian filters. We will look into one of them, the Kalman Filter on the notebook 
20 | for this lecture.
21 | 
22 | #### 7.3 Control in Partially-Observable Environments
23 | 
24 | It is important to note that Bayesian filters do not solve the entire decision-making problem, however, they
25 | do efficiently solve the state estimation problem. POMDPs are very complex, and so if the theory underlying.
26 | However, it is good to mention that there exist extensions to most of the algorithms that we have looked at
27 | so far to solve POMDPs for discrete worlds. These methods, however, are inapplicable to many practical
28 | problems in robotics, for instance. There are approximate POMDPs methods that sit in between MDPs and POMDPs
29 | and that are capable of giving sufficiently good approximate answers to POMDPs in a reasonable amount of time.
30 | We will refer you to interesting readings in this area for those looking for more information.
31 | 
32 | #### 7.4 Exercises
33 | 
34 | In this lesson, we learned that what we see is not always what it is happening in the world. Our perceptions might be
35 | biased, we might not have a 20/20 vision and more importantly, we might think we have 20/20 but we might not. For this
36 | reason is important to know that there are other ways of estimating states. In the following Notebook, we will look
37 | at a very popular method for state estimation called the Kalman Filter for a very basic problem partially-observable
38 | states problem.
39 | 
40 | #### 7.5 Further Reading
41 | 
42 |   * [Particle Filters in Robotics](https://arxiv.org/pdf/1301.0607.pdf)
43 |   * [Bayesian Filtering: From Kalman Filters to Particle Filters, and Beyond](http://www.dsi.unifi.it/users/chisci/idfric/Nonlinear_filtering_Chen.pdf)
44 |   * [Reinforcement Learning Using Approximate Belief States](https://ai.stanford.edu/~koller/Papers/Rodriguez+al:NIPS99.pdf)
45 |   * [Bayesian Reinforcement Learning in Continuous POMDPs with Gaussian Processes](https://www.cs.cmu.edu/~sross1/publications/gppomgp_IROS09.pdf)
46 |   * [Deep Reinforcement Learning with POMDPs](http://cs229.stanford.edu/proj2015/363_report.pdf)
47 | 


--------------------------------------------------------------------------------
/01-introduction-to-decision-making/README.md:
--------------------------------------------------------------------------------
 1 | ## Part I: Introduction
 2 | 
 3 | ### 1. Introduction to Decision-Making
 4 | 
 5 | #### 1.1 Decision-Making
 6 | 
 7 | Decision-making has captivated human intelligence for many years. Humans have always
 8 | wondered what makes us the most intelligent animal on this planet. The fact is that
 9 | decision-making could be seen as directly correlated with intelligence. The better
10 | the decisions being made, either by a natural or artificial agent, the more likely we
11 | will perceive that agent as intelligent. Moreover, the level of impact that decisions have
12 | is directly or indirectly recognized by our societies. Roles in which decision-making
13 | is a primary responsibility are the most highly regarded in today's workforce. If we
14 | think of prestige and salary, for example, leadership roles rate higher than management 
15 | and management rate higher than the rest of the labor force.
16 | 
17 | Being such an important field, it comes at no surprise that decision-making is studied
18 | under many different names. Economics, Neuroscience, Psychology, Operations Research, Adaptive
19 | Control, Statistics, Optimal Control Theory, and Reinforcement Learning are some of the prominent
20 | fields contributing to the understanding of decision-making. However, if we think deeper, 
21 | most other fields are also concerned with optimal decision-making. They might not
22 | necessarily contribute directly to improving our understanding of how we take optimal 
23 | decisions, but they do study decision-making apply to a specific trade. For instance,
24 | think journalism. This activity is not concerned with understanding how to take optimal
25 | decisions in general, but it is definitely interested in learning how to take optimal
26 | decisions in regards to preparing news and writing for newspapers. Under this token, we
27 | can see how fields that study decision-making are a generalization of other fields.
28 | 
29 | In the following lessons, we will explore decision-making in regards to Reinforcement
30 | Learning. As Reinforcement Learning is a descendant of Artificial Intelligence, in the remaining
31 | of this chapter we will briefly touch on Artificial Intelligence. Also, being a related field, 
32 | we will look at some basics of probability and statistics. On the rest of this lesson, we
33 | will discuss decision-making when there is only one decision to make. This is perhaps the major
34 | difference between Reinforcement Learning and other related fields. Reinforcement Learning
35 | relaxes this constraint allowing the notion of sequential decision-making.
36 | This sense of interaction with an environment sets Reinforcement Learning apart. In later lessons,
37 | we will continue losing constraints and presenting more abstract topics related to Reinforcement
38 | Learning. After this lesson, we will explore deterministic and stochastic transitions, know
39 | and unknown environments, discrete and continuous states, discrete and continuous actions,
40 | observable and partially observable states, single and multiple agents, cooperative and
41 | adversarial agents, and finally, we will put everything in the perspective of human
42 | intelligence. I hope you enjoy this work.
43 | 
44 | #### 1.2 Further Reading
45 | 
46 |   * [Decision Theory A Brief Introduction](http://people.kth.se/~soh/decisiontheory.pdf)
47 |   * [Statistical Decision Theory: Concepts, Methods and Applications](http://probability.ca/jeff/ftpdir/anjali0.pdf)
48 |   * [A Brief History of Decision Making](https://hbr.org/2006/01/a-brief-history-of-decision-making)
49 |   * [The Theory of Decision Making](http://worthylab.tamu.edu/courses_files/01_edwards_1954.pdf)
50 | 


--------------------------------------------------------------------------------
/05-discrete-and-continuous-states/README.md:
--------------------------------------------------------------------------------
 1 | ## Part III: Decision-Making in Hard Problems
 2 | 
 3 | ### 5. Discrete and Continuous States
 4 | 
 5 | #### 5.1 Too large to hold in memory
 6 | 
 7 | The truth is, as we use all previous methods to solve decision-making problems, 
 8 | it will be a time when the problems are very large. Some problems become so large that we can no longer represent it in computer memory. Moreover, even if we could hold a table with all state and action pair in memory, collecting experience for every state, action combination would be inefficient. 
 9 | 
10 | #### 5.2 Discretization of state space
11 | 
12 | One way of approaching this problem is to combine states into buckets by similarity. 
13 | This approach could effectively reduce the number of states of the problem to a 
14 | number that allows us to solve the problem using one of the methods seen in previous
15 | lessons. For example, in the OpenAI Lunar Lander world, we can see how the entire
16 | right side of the landing pad and the left side could be counted as 2 unique states. 
17 | Truth is, no matter where in that right or left area, your best action will be flight either left or right respectively making sure you are in the middle. Additionally, 
18 | the vertical axis could be easily in the 50% up as a single area and many smaller areas as we get closer to the landing pad. We will see how to apply discretization
19 | to the cart-pole problem on this lesson's notebook.
20 | 
21 | #### 5.3 Use of function approximation
22 | 
23 | Quickly after looking into discretization, any Machine Learning Engineer would shake his/her head. Why not using function approximation instead of doing this by hand? This is exactly why function approximation exists. In fact, we could use
24 | any function approximator like KNN or SVM, however, if the environment is non-linear,
25 | then nonlinear function approximators should be used instead as without them we
26 | might be able to find a solution that improves but never reaches convergence to
27 | the optimal policy. Perhaps, the most popular non-linear function approximators
28 | nowadays are neural networks. In fact, the use of neural networks that are more than 3 layers deep in combination with reinforcement learning algorithms is often grouped on a field called Deep Reinforcement Learning. This is perhaps one of the
29 | most interesting and promising areas of reinforcement learning and we will look
30 | into it on next lesson's notebook.
31 | 
32 | #### 5.4 Exercises
33 | 
34 | In this lesson, we got a step closer to what we could call 'real-world' reinforcement learning. In specific,
35 | we look at a kind of environment in which there are so many states that we can no longer represent a table
36 | of all of them. Either because the state space is too large or flat out continuous.
37 | 
38 | In order to get a sense for this type of problem, we will look a basic Cart Pole pole, and we will solve it by
39 | discretizing the state space in a way to making a manual function approximation of this problem.
40 | 
41 | Lesson 5 Notebook.
42 | 
43 | #### 5.5 Further Reading
44 | 
45 |   * [An Analysis of Reinforcement Learning with Function Approximation](http://icml2008.cs.helsinki.fi/papers/652.pdf)
46 |   * [Residual Algorithms: Reinforcement Learning with Function Approximation](http://www.leemon.com/papers/1995b.pdf)
47 |   * [A Brief Survey of Parametric Value Function Approximation](http://www.cs.utexas.edu/~dana/MLClass/RL_VF.pdf)
48 |   * [A Tutorial on Linear Function Approximators for Dynamic Programming and Reinforcement Learning](https://cs.brown.edu/people/stefie10/publications/geramifard13.pdf)
49 |   * [Playing Atari with Deep Reinforcement Learning](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)
50 |   * [Function Approximation via Tile Coding: Automating Parameter Choice](http://www.cs.utexas.edu/~ai-lab/pubs/SARA05.pdf)
51 | 


--------------------------------------------------------------------------------
/04-known-and-unknown-environments/README.md:
--------------------------------------------------------------------------------
 1 | ### 4. Known and Unknown Environments
 2 | 
 3 | #### 4.1 What if we don't have a model of the environment?
 4 | 
 5 | One of the first things that will come to your head after reviewing the last tutorial is, but what's the point if
 6 | we need to have the dynamics of the environment? What if it is such a complex environment that it is just too hard
 7 | to model? Or better yet, what if we just don't know the environment? Can we still learn the best actions to take
 8 | in order to maximize long-term rewards?
 9 | 
10 | And, the answer to that question is, of course, we can deal with unknown environments. Perhaps, this is the most exciting aspect of reinforcement learning; agents are capable of, through interaction only, learning the best sequence of actions
11 | to take to maximize long-term reward.
12 | 
13 | #### 4.2 The need to explore
14 | 
15 | The fact that we do not have a map of the environment puts us in need to explore it. Before we were given a map or as
16 | it is called in reinforcement learning, a model (MDP), but now, we are just dropped in the middle of a world with no
17 | other guidance than our own experiences. The need for exploration comes with a price. We can no longer take a perfect
18 | sequence of actions that maximize the long-term rewards from the very first time. Instead, we are now ensured to, at
19 | least, fail a couple of times trying to understand the environment and attempting to reach better goals each time.
20 | 
21 | If you think about it, this puts us in a dilemma, as exploration has a cost associated with it, how much is it effective
22 | to pay for it such that the long term rewards are maximized. For example, think about a young person graduating from college
23 | at 20 and getting his/her first job. This person goes around in 3-4 different jobs early on, but later when he is 50 and
24 | has accumulated experience in a specific field, it might no longer be beneficial to do a career change. It could be much
25 | more effective to keep exploiting the experience he/she has gained. Even if there exists a possibility for higher reward on some
26 | other field. Potentially, given the time left in his/her career, the price of learning a new set of skills might not benefit
27 | the long term goals.
28 | 
29 | #### 4.3 What to learn?
30 | 
31 | There are two ways you could think of interacting with the world. At first, we could think of the value of taking actions in
32 | given states. For example, we calculate the expectation of taking action 'a' when on the state 's', then do the same for all
33 | possible state, action combinations. However, as we saw on previous lessons, if we had a model of the environment, we could
34 | determine the exact best value for each state. So, how about learning the model of the environment and then using some of the algorithms on previous lectures to help us guide our decision making?
35 | 
36 | It turns out that these two ways are the most fundamental classes of algorithms in reinforcement learning. Model-Free methods
37 | are those algorithms that learn straight the action selection. These methods are incredibly useful as they are
38 | capable of learning best actions without any knowledge of the environment. However, they are very data hungry and it
39 | requires lots of samples to get good results. Practically speaking, we cannot just let a bipedal robot fall 1,000,000 times
40 | just for the sake of gaining experience. On the other side of the spectrum, Model-Based methods learn and use the model
41 | of the environment in order to improve the action selection especially early on. Model-Based methods are much more data
42 | efficient and for this reason, they are utilized more frequently on problems involving hardware such as robotics.
43 | 
44 | #### 4.4 What to do with what we learn?
45 | 
46 | We saw before that we will have to interact with the environment in order to learn. This obvious way of learning is
47 | called "Online learning". In contrast, however, we could also collect the samples we get from our experience and
48 | use that to further evaluate our actions. Intuitively we can think of how humans learn. When we interact with our
49 | environment, we learn directly from our experiences with it, but also, after we have collected these experiences,
50 | we use our memory to think about it and learn so more of what happened, what we did and how we could improve the
51 | outcome if we are facing the same problem again. This way of learning is called "Offline learning", and it is also
52 | used in reinforcement learning.
53 | 
54 | #### 4.5 Adding small randomness to your actions
55 | 
56 | Finally, there is some other important point often seen in reinforcement learning. The fact that we learn a good
57 | policy does not imply that such policy should be always followed. What if there are some better actions we could
58 | have taken? How do we ensure we always keep an eye on yet a better policy? In reinforcement learning, there are
59 | two main classes of algorithms that address ways of learning while constantly striving for finding better policies.
60 | One way is called off-policy, and it basically means that the actions taken by the agent are not necessarily always
61 | those that we have determined as the best actions. We would then be updating the values of a policy as if we were
62 | taking the actions of that policy when in fact we selected the action from another policy. We can also see off-policy
63 | as having two different policies, one that determines the actions that we are selecting, and the other the one that
64 | we use to evaluate our action selection. Conversely, we also have on-policy learning in which we learn and act on
65 | top of the same policy. That is, we evaluate and follow the actions from the same policy.
66 | 
67 | #### 4.6 Exercises
68 | 
69 | In this lesson, we learned the difference between planning and reinforcement learning. We compared two styles
70 | of doing reinforcement learning, one in which we learn to behave without trying to understand the dynamics
71 | of the environment. And on the other hand, we learn to behave by simultaneously trying to learn the environment
72 | so that our learning could become more and more accurate each time.
73 | 
74 | For this, we will look into a couple of algorithms for model-free reinforcement learning and we will also look
75 | at an algorithm that tries to learn the model with each observation and become much more efficient with every
76 | iteration.
77 | 
78 | Lesson 4 Notebook.
79 | 
80 | #### 4.7 Further Reading
81 | 
82 |   * [Reinforcement Learning: A Survey](https://www.cs.cmu.edu/~tom/10701_sp11/slides/Kaelbling.pdf)
83 |   * [Algorithms for Sequential Decision Making](https://www.cs.rutgers.edu/~mlittman/papers/thesis-with-gammas.pdf)
84 |   * [Shaping and policy search in Reinforcement learning](http://www.cs.ubc.ca/~nando/550-2006/handouts/andrew-ng.pdf)
85 |   * [Dynamic Programming and Optimal Control](http://web.mit.edu/dimitrib/www/dpchapter.pdf)
86 | 


--------------------------------------------------------------------------------
/paper/rldmsubmit.sty:
--------------------------------------------------------------------------------
  1 | %%%% RLDM Macros (LaTex)
  2 | %%%% Style File
  3 | %%%% March 2013
  4 | %%%% This has been purloined almost wholesale from nips10submit_e
  5 | %%%% There are minor changes for RLDM purposes
  6 | 
  7 | % This file can be used with Latex2e whether running in main mode, or
  8 | % 2.09 compatibility mode.
  9 | %
 10 | % If using main mode, you need to include the commands
 11 | %             \documentclass{article}
 12 | %             \usepackage{rldmsubmit,times}
 13 | % as the first lines in your document.  Or, if you do not have Times
 14 | % Roman font available, you can just use
 15 | %             \documentclass{article}
 16 | %             \usepackage{rldmsubmit}
 17 | % instead.
 18 | %
 19 | 
 20 | % Change the overall width of the page.  If these parameters are
 21 | %       changed, they will require corresponding changes in the
 22 | %       maketitle section.
 23 | %
 24 | \usepackage{eso-pic} % used by \AddToShipoutPicture 
 25 | 
 26 | \renewcommand{\topfraction}{0.95}   % let figure take up nearly whole page
 27 | \renewcommand{\textfraction}{0.05}  % let figure take up nearly whole page
 28 | 
 29 | % Define rldmfinal, set to true if rldmfinalcopy is defined  
 30 | \newif\ifrldmfinal
 31 | \rldmfinaltrue
 32 | \def\rldmfinalcopy{\rldmfinaltrue}
 33 | \font\rldmtenhv  = phvb at 8pt % *** IF THIS FAILS, SEE rldm10submit_e.sty ***
 34 | 
 35 | % Specify the dimensions of each page
 36 | 
 37 | %\setlength{\paperheight}{11in}
 38 | %\setlength{\paperwidth}{8.5in}
 39 | 
 40 | \newlength{\rldmFPmargin}
 41 | \setlength{\rldmFPmargin}{1.5cm}
 42 | \setlength{\headheight}{0pt}
 43 | \setlength{\headsep}{0pt}
 44 | 
 45 | \setlength{\textwidth}{\paperwidth}
 46 | \addtolength{\textwidth}{-2\rldmFPmargin}
 47 | \setlength{\oddsidemargin}{\rldmFPmargin}
 48 | \addtolength{\oddsidemargin}{-1in}
 49 | \setlength{\evensidemargin}{\oddsidemargin}
 50 | \setlength{\textheight}{\paperheight}
 51 | \addtolength{\textheight}{-\headheight}
 52 | \addtolength{\textheight}{-\headsep}
 53 | \addtolength{\textheight}{-\footskip}
 54 | \addtolength{\textheight}{-2\rldmFPmargin}
 55 | \setlength{\topmargin}{\rldmFPmargin}
 56 | \addtolength{\topmargin}{-1in}
 57 | 
 58 | %\textheight 23 true cm       % Height of text (including footnotes & figures)
 59 | %\textwidth 14 true cm        % Width of text line.
 60 | \widowpenalty=10000
 61 | \clubpenalty=10000
 62 | 
 63 | \thispagestyle{empty} %\pagestyle{empty}
 64 | \flushbottom \sloppy
 65 | 
 66 | % We're never going to need a table of contents, so just flush it to 
 67 | % save space --- suggested by drstrip@sandia-2
 68 | \def\addcontentsline#1#2#3{}
 69 | 
 70 | % Title stuff, taken from deproc.
 71 | \def\maketitle{\par 
 72 | \begingroup
 73 |    \def\thefootnote{\fnsymbol{footnote}}
 74 |    \def\@makefnmark{\hbox to 0pt{$^{\@thefnmark}$\hss}} % for perfect author
 75 |                                                         % name centering
 76 | %   The footnote-mark was overlapping the footnote-text,
 77 | %   added the following to fix this problem               (MK)
 78 |    \long\def\@makefntext##1{\parindent 1em\noindent
 79 |                             \hbox to1.8em{\hss $\m@th ^{\@thefnmark}$}##1}
 80 |    \@maketitle \@thanks
 81 | \endgroup
 82 | \setcounter{footnote}{0}
 83 | \let\maketitle\relax \let\@maketitle\relax
 84 | \gdef\@thanks{}\gdef\@author{}\gdef\@title{}\let\thanks\relax}
 85 | 
 86 | % The toptitlebar has been raised to top-justify the first page
 87 | 
 88 | % Title (includes both anonimized and non-anonimized versions)
 89 | \def\@maketitle{\vbox{\hsize\textwidth
 90 | \linewidth\hsize \vskip 0.1in \toptitlebar \centering
 91 | {\LARGE\bf \@title\par}  \bottomtitlebar % \vskip 0.1in %  minus
 92 | \ifrldmfinal
 93 |    \def\And{\end{tabular}\hfil\linebreak[0]\hfil
 94 |             \begin{tabular}[t]{c}\bf\rule{\z@}{24pt}\ignorespaces}% 
 95 |   \def\AND{\end{tabular}\hfil\linebreak[4]\hfil
 96 |             \begin{tabular}[t]{c}\bf\rule{\z@}{24pt}\ignorespaces}% 
 97 |     \begin{tabular}[t]{c}\bf\rule{\z@}{24pt}\@author\end{tabular}% 
 98 | \else 
 99 |      \begin{tabular}[t]{c}\bf\rule{\z@}{24pt}
100 | Anonymous Author(s) \\
101 | Affiliation \\
102 | Address \\
103 | \texttt{email} \\
104 | \end{tabular}% 
105 | \fi
106 | \vskip 0.3in minus 0.1in}}
107 | 
108 | \newcommand\startmain{\newpage\setcounter{page}{1}\par}
109 | \newcommand\startopt{\newpage\centerline{\Large \bf Supplementary Material}}
110 | 
111 | \def\keywords#1{\vskip.2in\begin{minipage}[t]{1.4in}%
112 | {\bf Keywords:}\end{minipage}\begin{minipage}[t]{4in}#1\end{minipage}}
113 | 
114 | \def\acknowledgements#1{\vskip.2in\subsubsection*{Acknowledgements}#1}
115 | 
116 | \def\repository#1{\vskip.2in\begin{minipage}[t]{1.4in}%
117 | {\bf Repository:}\end{minipage}\begin{minipage}[t]{5in}\url{#1}\end{minipage}}
118 | 
119 | \def\spresentation#1{\vskip.2in\begin{minipage}[t]{1.4in}%
120 | {\bf Short Presentation:}\end{minipage}\begin{minipage}[t]{5in}\url{#1}\end{minipage}}
121 | 
122 | \def\lpresentation#1{\vskip.2in\begin{minipage}[t]{1.4in}%
123 | {\bf Long Presentation:}\end{minipage}\begin{minipage}[t]{5in}\url{#1}\end{minipage}}
124 | 
125 | \renewenvironment{abstract}{\vskip.075in\centerline{\large\bf
126 | Abstract}\vspace{0.5ex}}{\par}
127 | 
128 | % sections with less space
129 | \def\section{\@startsection {section}{1}{\z@}{-2.0ex plus
130 |     -0.5ex minus -.2ex}{1.5ex plus 0.3ex
131 | minus0.2ex}{\large\bf\raggedright}}
132 | 
133 | \def\subsection{\@startsection{subsection}{2}{\z@}{-1.8ex plus    
134 | -0.5ex minus -.2ex}{0.8ex plus .2ex}{\normalsize\bf\raggedright}}
135 | \def\subsubsection{\@startsection{subsubsection}{3}{\z@}{-1.5ex
136 | plus      -0.5ex minus -.2ex}{0.5ex plus
137 | .2ex}{\normalsize\bf\raggedright}}
138 | \def\paragraph{\@startsection{paragraph}{4}{\z@}{1.5ex plus   
139 | 0.5ex minus .2ex}{-1em}{\normalsize\bf}}
140 | \def\subparagraph{\@startsection{subparagraph}{5}{\z@}{1.5ex plus 
141 |   0.5ex minus .2ex}{-1em}{\normalsize\bf}}
142 | \def\subsubsubsection{\vskip
143 | 5pt{\noindent\normalsize\rm\raggedright}}
144 | 
145 | 
146 | % Footnotes
147 | \footnotesep 6.65pt %
148 | \skip\footins 9pt plus 4pt minus 2pt
149 | \def\footnoterule{\kern-3pt \hrule width 12pc \kern 2.6pt }
150 | \setcounter{footnote}{0}
151 | 
152 | % Lists and paragraphs
153 | \parindent 0pt
154 | \topsep 4pt plus 1pt minus 2pt
155 | \partopsep 1pt plus 0.5pt minus 0.5pt
156 | \itemsep 2pt plus 1pt minus 0.5pt
157 | \parsep 2pt plus 1pt minus 0.5pt
158 | \parskip .5pc
159 | 
160 | 
161 | %\leftmargin2em 
162 | \leftmargin3pc
163 | \leftmargini\leftmargin \leftmarginii 2em
164 | \leftmarginiii 1.5em \leftmarginiv 1.0em \leftmarginv .5em 
165 | 
166 | %\labelsep \labelsep 5pt
167 | 
168 | \def\@listi{\leftmargin\leftmargini}
169 | \def\@listii{\leftmargin\leftmarginii
170 |    \labelwidth\leftmarginii\advance\labelwidth-\labelsep
171 |    \topsep 2pt plus 1pt minus 0.5pt
172 |    \parsep 1pt plus 0.5pt minus 0.5pt
173 |    \itemsep \parsep}
174 | \def\@listiii{\leftmargin\leftmarginiii
175 |     \labelwidth\leftmarginiii\advance\labelwidth-\labelsep
176 |     \topsep 1pt plus 0.5pt minus 0.5pt 
177 |     \parsep \z@ \partopsep 0.5pt plus 0pt minus 0.5pt
178 |     \itemsep \topsep}
179 | \def\@listiv{\leftmargin\leftmarginiv
180 |      \labelwidth\leftmarginiv\advance\labelwidth-\labelsep}
181 | \def\@listv{\leftmargin\leftmarginv
182 |      \labelwidth\leftmarginv\advance\labelwidth-\labelsep}
183 | \def\@listvi{\leftmargin\leftmarginvi
184 |      \labelwidth\leftmarginvi\advance\labelwidth-\labelsep}
185 | 
186 | \abovedisplayskip 7pt plus2pt minus5pt%
187 | \belowdisplayskip \abovedisplayskip
188 | \abovedisplayshortskip  0pt plus3pt%
189 | \belowdisplayshortskip  4pt plus3pt minus3pt%
190 | 
191 | % Less leading in most fonts (due to the narrow columns)
192 | % The choices were between 1-pt and 1.5-pt leading
193 | %\def\@normalsize{\@setsize\normalsize{11pt}\xpt\@xpt} % got rid of @ (MK)
194 | \def\normalsize{\@setsize\normalsize{11pt}\xpt\@xpt}
195 | \def\small{\@setsize\small{10pt}\ixpt\@ixpt}
196 | \def\footnotesize{\@setsize\footnotesize{10pt}\ixpt\@ixpt}
197 | \def\scriptsize{\@setsize\scriptsize{8pt}\viipt\@viipt}
198 | \def\tiny{\@setsize\tiny{7pt}\vipt\@vipt}
199 | \def\large{\@setsize\large{14pt}\xiipt\@xiipt}
200 | \def\Large{\@setsize\Large{16pt}\xivpt\@xivpt}
201 | \def\LARGE{\@setsize\LARGE{20pt}\xviipt\@xviipt}
202 | \def\huge{\@setsize\huge{23pt}\xxpt\@xxpt}
203 | \def\Huge{\@setsize\Huge{28pt}\xxvpt\@xxvpt}
204 | 
205 | \def\toptitlebar{\hrule height4pt\vskip .6cm\vskip-\parskip}
206 | 
207 | \def\bottomtitlebar{\vskip .7cm\vskip-\parskip\hrule height1pt\vskip
208 | .09in} %
209 | %Reduced second vskip to compensate for adding the strut in \@author
210 | 
211 | % Vertical Ruler
212 | % This code is, largely, from the CVPR 2010 conference style file
213 | % ----- define vruler
214 | \makeatletter
215 | \newbox\rldmrulerbox
216 | \newcount\rldmrulercount
217 | \newdimen\rldmruleroffset
218 | \newdimen\cv@lineheight
219 | \newdimen\cv@boxheight
220 | \newbox\cv@tmpbox
221 | \newcount\cv@refno
222 | \newcount\cv@tot
223 | % NUMBER with left flushed zeros  \fillzeros[<WIDTH>]<NUMBER>
224 | \newcount\cv@tmpc@ \newcount\cv@tmpc
225 | \def\fillzeros[#1]#2{\cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi
226 | \cv@tmpc=1 %
227 | \loop\ifnum\cv@tmpc@<10 \else \divide\cv@tmpc@ by 10 \advance\cv@tmpc by 1 \fi
228 |    \ifnum\cv@tmpc@=10\relax\cv@tmpc@=11\relax\fi \ifnum\cv@tmpc@>10 \repeat
229 | \ifnum#2<0\advance\cv@tmpc1\relax-\fi
230 | \loop\ifnum\cv@tmpc<#1\relax0\advance\cv@tmpc1\relax\fi \ifnum\cv@tmpc<#1 \repeat
231 | \cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi \relax\the\cv@tmpc@}%
232 | % \makevruler[<SCALE>][<INITIAL_COUNT>][<STEP>][<DIGITS>][<HEIGHT>]
233 | \def\makevruler[#1][#2][#3][#4][#5]{\begingroup\offinterlineskip
234 | \textheight=#5\vbadness=10000\vfuzz=120ex\overfullrule=0pt%
235 | \global\setbox\rldmrulerbox=\vbox to \textheight{%
236 | {\parskip=0pt\hfuzz=150em\cv@boxheight=\textheight
237 | \cv@lineheight=#1\global\rldmrulercount=#2%
238 | \cv@tot\cv@boxheight\divide\cv@tot\cv@lineheight\advance\cv@tot2%
239 | \cv@refno1\vskip-\cv@lineheight\vskip1ex%
240 | \loop\setbox\cv@tmpbox=\hbox to0cm{{\rldmtenhv\hfil\fillzeros[#4]\rldmrulercount}}%
241 | \ht\cv@tmpbox\cv@lineheight\dp\cv@tmpbox0pt\box\cv@tmpbox\break
242 | \advance\cv@refno1\global\advance\rldmrulercount#3\relax
243 | \ifnum\cv@refno<\cv@tot\repeat}}\endgroup}%
244 | \makeatother
245 | % ----- end of vruler
246 | 
247 | % \makevruler[<SCALE>][<INITIAL_COUNT>][<STEP>][<DIGITS>][<HEIGHT>]
248 | \def\rldmruler#1{\makevruler[12pt][#1][1][3][0.993\textheight]\usebox{\rldmrulerbox}}
249 | \AddToShipoutPicture{%
250 | \ifrldmfinal\else
251 | \rldmruleroffset=\textheight
252 | \advance\rldmruleroffset by -3.7pt
253 |   \color[rgb]{.7,.7,.7}
254 |   \AtTextUpperLeft{%
255 |     \put(\LenToUnit{-35pt},\LenToUnit{-\rldmruleroffset}){%left ruler
256 |       \rldmruler{\rldmrulercount}}
257 |   }
258 | \fi
259 | }
260 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Applied Reinforcement Learning
  2 | 
  3 | I've been studying reinforcement learning and decision-making for a couple of years now.
  4 | One of the most difficult things that I've encountered is not necessarily related to 
  5 | the concepts but how these concepts have been explained. To me, learning occurs when one
  6 | is able to make a connection with the concepts being taught. For this, often an intuitive
  7 | explanation is required, and likely a hands-on approach helps build that kind of 
  8 | understanding.
  9 | 
 10 | My goal for this repository is to create, with the community, a resource that would help
 11 | newcomers understand reinforcement learning in an intuitive way. Consider what you see here
 12 | my initial attempt to teach some of these concepts as plain and simple as I can possibly
 13 | explain them.
 14 | 
 15 | If you'd like to collaborate, whether a typo, or an entire addition to the text, maybe a fix
 16 | to a notebook or a whole new notebook, please feel free to send your issue and/or pull 
 17 | request to make things better. As long as your pull request aligns with the goal of the 
 18 | repository, it is very likely we will merge. I'm not the best teacher, or reinforcement
 19 | learning researcher, but I do believe we can make reinforcement learning and decision-making 
 20 | easy for anyone to understand. Well, at least easier.
 21 | 
 22 | Table of Contents
 23 | =================
 24 | 
 25 |   * [Notebooks Installation](#notebooks-installation)
 26 |     * [Install git](#install-git)
 27 |     * [Install Docker](#install-docker)
 28 |     * [Run Notebooks](#run-notebooks)
 29 |         * [TL;DR version](#tldr-version)
 30 |         * [A little more detailed version:](#a-little-more-detailed-version)
 31 |           * [Open the Notebooks in your browser:](#open-the-notebooks-in-your-browser)
 32 |           * [Open TensorBoard at the following address:](#open-tensorboard-at-the-following-address)
 33 |     * [Docker Tips](#docker-tips)
 34 |   * [Part I: Introduction](01-introduction-to-decision-making/README.md#part-i-introduction)
 35 |       * [1. Introduction to Decision-Making](01-introduction-to-decision-making/README.md#1-introduction-to-decision-making)
 36 |         * [1.1 Decision-Making](01-introduction-to-decision-making/README.md#11-decision-making)
 37 |         * [1.2 Further Reading](01-introduction-to-decision-making/README.md#12-further-reading)
 38 |   * [Part II: Reinforcement Learning and Decision-Making](02-sequential-decisions/README.md#part-ii-reinforcement-learning-and-decision-making)
 39 |       * [2. Sequential Decisions](02-sequential-decisions/README.md#2-sequential-decisions)
 40 |         * [2.1 Modeling Decision-Making Problems](02-sequential-decisions/README.md#21-modeling-decision-making-problems)
 41 |         * [2.2 Solutions Representation](02-sequential-decisions/README.md#22-solutions-representation)
 42 |         * [2.3 Simple Sequential Problem](02-sequential-decisions/README.md#23-simple-sequential-problem)
 43 |         * [2.4 Slightly more complex problems](02-sequential-decisions/README.md#24-slightly-more-complex-problems)
 44 |         * [2.5 Evaluating solutions](02-sequential-decisions/README.md#25-evaluating-solutions)
 45 |         * [2.6 Improving on solutions](02-sequential-decisions/README.md#26-improving-on-solutions)
 46 |         * [2.7 Finding Optimal solutions](02-sequential-decisions/README.md#27-finding-optimal-solutions)
 47 |         * [2.8 Improving on Policy Iteration](02-sequential-decisions/README.md#28-improving-on-policy-iteration)
 48 |         * [2.9 Exercises](02-sequential-decisions/README.md#29-exercises)
 49 |         * [2.10 Further Reading](02-sequential-decisions/README.md#210-further-reading)
 50 |       * [3. Deterministic and Stochastic Actions](03-deterministic-and-stochastic-actions/README.md#3-deterministic-and-stochastic-actions)
 51 |         * [3.1 We can't perfectly control the world](03-deterministic-and-stochastic-actions/README.md#31-we-cant-perfectly-control-the-world)
 52 |         * [3.2 Dealing with stochasticity](03-deterministic-and-stochastic-actions/README.md#32-dealing-with-stochasticity)
 53 |         * [3.3 Exercises](03-deterministic-and-stochastic-actions/README.md#33-exercises)
 54 |         * [3.4 Further Reading](03-deterministic-and-stochastic-actions/README.md#34-further-reading)
 55 |       * [4. Known and Unknown Environments](04-known-and-unknown-environments/README.md#4-known-and-unknown-environments)
 56 |         * [4.1 What if we don't have a model of the environment?](04-known-and-unknown-environments/README.md#41-what-if-we-dont-have-a-model-of-the-environment)
 57 |         * [4.2 The need to explore](04-known-and-unknown-environments/README.md#42-the-need-to-explore)
 58 |         * [4.3 What to learn?](04-known-and-unknown-environments/README.md#43-what-to-learn)
 59 |         * [4.4 What to do with what we learn?](04-known-and-unknown-environments/README.md#44-what-to-do-with-what-we-learn)
 60 |         * [4.5 Adding small randomness to your actions](04-known-and-unknown-environments/README.md#45-adding-small-randomness-to-your-actions)
 61 |         * [4.6 Exercises](04-known-and-unknown-environments/README.md#46-exercises)
 62 |         * [4.7 Further Reading](04-known-and-unknown-environments/README.md#47-further-reading)
 63 |   * [Part III: Decision-Making in Hard Problems](05-discrete-and-continuous-states/README.md#part-iii-decision-making-in-hard-problems)
 64 |       * [5. Discrete and Continuous States](05-discrete-and-continuous-states/README.md#5-discrete-and-continuous-states)
 65 |         * [5.1 Too large to hold in memory](05-discrete-and-continuous-states/README.md#51-too-large-to-hold-in-memory)
 66 |         * [5.2 Discretization of state space](05-discrete-and-continuous-states/README.md#52-discretization-of-state-space)
 67 |         * [5.3 Use of function approximation](05-discrete-and-continuous-states/README.md#53-use-of-function-approximation)
 68 |         * [5.4 Exercises](05-discrete-and-continuous-states/README.md#54-exercises)
 69 |         * [5.5 Further Reading](05-discrete-and-continuous-states/README.md#55-further-reading)
 70 |       * [6. Discrete and Continuous Actions](06-discrete-and-continuous-actions/README.md#6-discrete-and-continuous-actions)
 71 |         * [6.1 Continuous action space](06-discrete-and-continuous-actions/README.md#61-continuous-action-space)
 72 |         * [6.2 Discretizition of action space](06-discrete-and-continuous-actions/README.md#62-discretizition-of-action-space)
 73 |         * [6.3 Use of function approximation](06-discrete-and-continuous-actions/README.md#63-use-of-function-approximation)
 74 |         * [6.4 Searching for the policy](06-discrete-and-continuous-actions/README.md#64-searching-for-the-policy)
 75 |         * [6.5 Exercises](06-discrete-and-continuous-actions/README.md#65-exercises)
 76 |         * [6.6 Further Reading](06-discrete-and-continuous-actions/README.md#66-further-reading)
 77 |       * [7. Observable and Partially-Observable States](07-observable-and-partially-observable-states/README.md#7-observable-and-partially-observable-states)
 78 |         * [7.1 Is what we see what it is?](07-observable-and-partially-observable-states/README.md#71-is-what-we-see-what-it-is)
 79 |         * [7.2 State Estimation](07-observable-and-partially-observable-states/README.md#72-state-estimation)
 80 |         * [7.3 Control in Partially-Observable Environments](07-observable-and-partially-observable-states/README.md#73-control-in-partially-observable-environments)
 81 |         * [7.4 Further Reading](07-observable-and-partially-observable-states/README.md#74-further-reading)
 82 |   * [Part IV: Multiple Decision-Making Agents](08-single-and-multiple-agents/README.md#part-iv-multiple-decision-making-agents)
 83 |       * [8. Single and Multiple Agents](08-single-and-multiple-agents/README.md#8-single-and-multiple-agents)
 84 |         * [8.1 Agents with same objectives](08-single-and-multiple-agents/README.md#81-agents-with-same-objectives)
 85 |         * [8.2 What when other agents are at play?](08-single-and-multiple-agents/README.md#82-what-when-other-agents-are-at-play)
 86 |         * [8.3 Further Reading](08-single-and-multiple-agents/README.md#83-further-reading)
 87 |       * [9. Cooperative and Adversarial Agents](09-cooperative-and-adversarial-agents/README.md#9-cooperative-and-adversarial-agents)
 88 |         * [9.1 Agents with conflicting objectives](09-cooperative-and-adversarial-agents/README.md#91-agents-with-conflicting-objectives)
 89 |         * [9.2 Teams of agents with conflicting objectives](09-cooperative-and-adversarial-agents/README.md#92-teams-of-agents-with-conflicting-objectives)
 90 |         * [9.3 Further Reading](09-cooperative-and-adversarial-agents/README.md#93-further-reading)
 91 |   * [Part V: Human Decision-Making and Beyond](10-decision-making-and-humans/README.md#part-v-human-decision-making-and-beyond)
 92 |       * [10. Decision-Making and Humans](10-decision-making-and-humans/README.md#10-decision-making-and-humans)
 93 |         * [10.1 Similarities between methods discussed and humans](10-decision-making-and-humans/README.md#101-similarities-between-methods-discussed-and-humans)
 94 |         * [10.2 Differences between methods discussed and humans](10-decision-making-and-humans/README.md#102-differences-between-methods-discussed-and-humans)
 95 |         * [10.3 Further Reading](10-decision-making-and-humans/README.md#103-further-reading)
 96 |       * [11. Conclusion](11-conclusion/README.md#11-conclusion)
 97 |       * [12. Recommended Books](12-recommended-books/README.md#12-recommended-books)
 98 |       * [12. Recommended Courses](13-recommended-courses/README.md#13-recommended-courses)
 99 | 
100 | 
101 | 
102 | # Notebooks Installation
103 | 
104 | This repository contains Jupyter Notebooks to follow along with the lectures. However, there are several
105 | packages and applications that need to be installed. To make things easier on you, I took a little longer
106 | time to setup a reproducible environment that you can use to follow along.
107 | 
108 | ## Install git
109 | 
110 | Follow the instructions at (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
111 | 
112 | ## Install Docker
113 | 
114 | Follow the instructions at (https://docs.docker.com/engine/getstarted/step_one/#step-2-install-docker)
115 | 
116 | ## Run Notebooks
117 | 
118 | ### TL;DR version
119 | 
120 | 1. `git clone git@github.com:mimoralea/applied-reinforcement-learning.git && cd applied-reinforcement-learning`
121 | 2. `docker pull mimoralea/openai-gym:v1`
122 | 3. `docker run -it --rm -p 8888:8888 -p 6006:6006 -v $PWD/notebooks/:/mnt/notebooks/ mimoralea/openai-gym:v1`
123 | 
124 | ### A little more detailed version:
125 | 
126 | 1. Clone the repository to a desired location (E.g. `git clone git@github.com:mimoralea/applied-reinforcement-learning.git ~/Projects/applied-reinforcement-learning`)
127 | 2. Enter into the repository directory (E.g. `cd ~/Projects/applied-reinforcement-learning`)
128 | 3. Either Build yourself or Pull the already built Docker container:  
129 |     3.1. To build it use the following command: `docker build -t mimoralea/openai-gym:v1 .`  
130 |     3.2. To pull it from Docker hub use: `docker pull mimoralea/openai-gym:v1`  
131 | 4. Run the container: `docker run -it --rm -p 8888:8888 -p 6006:6006 -v $PWD/notebooks/:/mnt/notebooks/ mimoralea/openai-gym:v1`
132 | 
133 | #### Open the Notebooks in your browser:
134 | 
135 | * `http://localhost:8888` (or follow the link that came out of the run command about which will include the token)
136 | 
137 | #### Open TensorBoard at the following address:
138 | 
139 | * `http://localhost:6006`
140 | 
141 | This will help you visualize the Neural Network in the lessons with function approximation.
142 | 
143 | ## Docker Tips
144 | 
145 | * If you'd like to access a bash session of a running container do:  
146 | ** `docker ps` # will show you currently running containers -- note the id of the container you are trying to access  
147 | ** `docker exec --user root -it c3fbc82f1b49 /bin/bash` # in this case c3fbc82f1b49 is the id  
148 | * If you'd like to start a new container instance straight into bash (without running Jupyter or TensorBoard)  
149 | ** `docker run -it --rm mimoralea/openai-gym:v1 /bin/bash` # this will run the bash session as the Notebook user  
150 | ** `docker run --user root -e GRANT_SUDO=yes -it --rm mimoralea/openai-gym:v1 /bin/bash` # this will run the bash session as root  
151 | 
152 |          
153 | 


--------------------------------------------------------------------------------
/02-sequential-decisions/README.md:
--------------------------------------------------------------------------------
  1 | ## Part II: Reinforcement Learning and Decision-Making
  2 | 
  3 | ### 2. Sequential Decisions
  4 | 
  5 | As mentioned before, Reinforcement Learning introduces the notion of sequential decision-making. This
  6 | idea of making a series of decisions forces the agent to take into account future sequences of actions,
  7 | states, and rewards. In this lesson, we will explore some of the most fundamental aspects of sequential
  8 | decision making.
  9 | 
 10 | #### 2.1 Modeling Decision-Making Problems
 11 | 
 12 | In order to attempt solving a problem, we must be able to represent it in a form that abstracts it
 13 | allowing us to work on it. For decision-making problems, we can think of few aspects that are
 14 | common to all problems. 
 15 | 
 16 | First, we need to be able to receive percepts of the world, that is, the agent needs to be able to sense
 17 | its environment. The input we get from the environment could directly represent the
 18 | true state of the world. However, this is not always true. For example, if we are creating a stock trading
 19 | bot, we can think of the current stock price as part of the current state of the world. However, any person
 20 | that has purchased stocks knows that the sell and buy price are mere estimates of the true price that the
 21 | stock will sell for. For some transactions, this price is not totally accurate. Another example in which this
 22 | issue is much easier to understand is in robotics. For example, GPS localization is accurate
 23 | within a few meters precision. This amount of noise on the sensor could be the difference between an autonomous
 24 | car driving safely or instead of getting into an accident with the car on the next lane. The point is that as in
 25 | the real world when we model it, we need to account for the fact that things that we "see" are not
 26 | necessarily things that "are". This distinction will come up later, for now, we can assume that we live
 27 | in a perfect world and that our perceptions are a true representation of the state of the world. Another
 28 | important fact to clarify is the representation of states must include all necessary history within the state.
 29 | In other words, the states should be represented as memory-less. This is known as Markov property and it is
 30 | a fundamental assumption to solve decision-making problems of the kinds we will be exploring in these lessons.
 31 | 
 32 | Second, all decision-making problems have available actions. For the stock bot, we can think
 33 | of a few actions, sell, buy, hold. We could also add some special action such as limit sell, limit buy,
 34 | options, etc. A robot could have the actions to power a given voltage to a given actuator for a given time. As
 35 | we clarified the potential of a percept not exactly representing the state of the world, actions might
 36 | not turn with the same outcome every time they are taken. That is, actions are not necessarily deterministic.
 37 | For the stock agent example, we can think of the small probability that sending a buy request to the server
 38 | returns with a server communication error. That is, the probability of actually executing the action we selected
 39 | could be 99.9% certain but there is still a small chance that the action doesn't go through as we intended. This
 40 | stochasticity is represented as transition functions. These functions represent the probability of successful transition
 41 | when taking an action on a given state. The sum of all transitions for a given state action pair must equal 1.
 42 | One thing we need to make clear, however, is that probabilities must always be the same. That is, we might not know
 43 | the exact probability of transitioning to a new state giving a current and action, but the value will be the
 44 | same regardless. In other words, the model of the world must be stationary.
 45 | 
 46 | Third, we also have to introduce a feedback signal so that we can evaluate our decision-making abilities.
 47 | Many problems in fields other than Reinforcement Learning represent these are cost signal, Reinforcement Learning
 48 | refers to these signals as rewards. On our trading agent, the reward could simply be the profit or loss made
 49 | from a single transition, or perhaps we could make our reward signal the difference of total assets before and
 50 | after making a transaction. In a robotic task, the reward could be slightly more complex. For example, we could
 51 | design an agent that gets a positive reward while staying up straight walking. Or maybe it gets a reward signal after
 52 | a specific task is accomplished. The important part of the reward is that this will ultimate have a big influence
 53 | on how our agent performs. As we can see, rewards are part of the environment. However, often times we have to design
 54 | these reward signal ourselves. Ideally, we are able to identify a natural signal that we are interested in maximizing.
 55 | 
 56 | The model representation described above is widely known as Markov Decision Processes (MDP). MDP is a framework
 57 | for modeling sequential decision-making problems. An MDP is composed of a tuple (S, A, R, T) in which S is the set
 58 | of states, A is the set of actions, R is the reward function mapping a state and action pairs to a numeric value, T is
 59 | the transition function mapping the probability of reaching a state to a state an action pair.
 60 | 
 61 | We will be using MDPs moving forward, though it is important to mention that MDPs have lots of variants,
 62 | Dec-MDP, POMDP, QMDP, AMDP, MC-POMDP, Dec-POMDP, ND-POMDPs, MMDPs are some of the most common ones. They all
 63 | represent some type of problem-related to MDPs. We will be loosing up the constraints MDP present and generalizing
 64 | the representation of decision-making problems as we go.
 65 | 
 66 | #### 2.2 Solutions Representation
 67 | 
 68 | Now that we have a framework to represent decision-making problems, we need to devise a way of communicating
 69 | possible solutions to the problems. The first word that comes to mind when thinking about solutions to
 70 | decision-making problems is "plan". A plan can be seen as a sequence of steps to accomplish a goal. This is
 71 | great but probably too simplistic. Mike Tyson once said, "Everyone has a plan 'till they get punched in the
 72 | mouth." And it is true, we need something more adaptive than just a simple plan. The next step then is to think
 73 | of a plan and create conditions that help us deal with the uncertainty of the environment. This type of planning
 74 | is known as conditional planning. Which is basically just a regular plan in which we plan in advance the
 75 | contingencies that may arise. However, if we expand this a bit further we can think of a conditional planning
 76 | that takes into account every single possible contingency, even those we haven't thought of. This is called universal
 77 | plan or better yet, a policy. In Reinforcement Learning, a policy is a function mapping states to actions which
 78 | represent a solution to an MDP. The algorithms that we will be discussing later will directly or indirectly produce
 79 | the best possible policy, also called optimal policy. This is important to understand and remember.
 80 | 
 81 | #### 2.3 Simple Sequential Problem
 82 | 
 83 | Given all of the information above, let's review the simplest problem that we can think of. Let's think of a problem of a casino with 2 slot machines. To illustrate some important points better, imagine you enter
 84 | slot machine area paying a flat fee. However, you are only allowed to play 100 trials on any of the 2 machines.
 85 | Also, the machines pay the amount of $1 or nothing on each pull according to an underlying, fixed and unknown
 86 | probability. The Reinforcement Learning problem becomes then, how can I maximize the amount of money I could
 87 | get from it. Should you start pulling an arm and stick to it for the 100 trials? Should you instead pull 1
 88 | and 1? Should you pull 50 and 50? In other words, what is the best strategy or policy for maximizing all
 89 | future rewards?
 90 | 
 91 | The difficulty of this problem, also known as the k-Armed Bandit, in this case, k=2, is that you need to
 92 | simultaneously be able to acquire knowledge of the environment and at the same time harness the knowledge you
 93 | have already acquired. This fundamental trade-off between exploration versus exploitation is what makes
 94 | decision-making problems hard. You might believe that a particular arm has a fairly high payoff probability;
 95 | should you then choose it every time? Should you choose one that you know well in order to gain information
 96 | about its payoff? How about choosing one that you might have good information already but perhaps getting more
 97 | would improve your knowledge of the environment?
 98 | 
 99 | All of the answers to the questions posed above depend on several factors. For example, if instead of allowing you
100 | 100 trials I give you 3, how would your strategy change. Moreover, if I give you an infinite number of trials,
101 | then you really want to put time learning the environment even if doing so gives you sub-optimal results early on.
102 | The knowledge that you gain from the initial exploration will ensure you maximize the expected future rewards long
103 | term.
104 | 
105 | 
106 | #### 2.4 Slightly more complex problems
107 | 
108 | When explaining reinforcement learning, it is very common to show a very basic world to illustrate fundamental
109 | concepts. Let's think of a grid world where the agent starts at 'S'. Reaching the space marked with a 'G' ends the game and gives the agent a reward of 1. Reaching the space with an 'F' ends the game and gives the agent
110 | a reward of -1. The agent is able to select 4 actions every time, (N, S, E, W). The actions selected has exactly the effect we expect. For example, N would move the agent one cell up, E to the cell on the right. Unless the agent is attempting to enter a space marked with an 'X' which is a wall and cannot be entered, and unless the agent is in the left most cell trying to move left, etc. Which will just bounce the agent back to the cell it took the action from.
111 | 
112 | 
113 | #### 2.5 Evaluating solutions
114 | 
115 | Before we being exploring how to get the best solution to this problem. I'd like us to detour into how do we
116 | know how good is a solution. For example, we can imagine a solution given by a series of arrows representing
117 | the different actions to be taken at each cell.
118 | 
119 | Is there a way we can put a number to this policy so we can later rank it?
120 | 
121 | If we need to use a single number, I think we could all agree that the value of the policy can be defined
122 | as the sum of all rewards that we would get starting on state 'S' and following the policy. This algorithm is
123 | called policy evaluation.
124 | 
125 | One thing you might be thinking after reading the previous paragraph is, but what happens if a policy gives
126 | lots of rewards early on, a nothing later. And another policy gives no rewards early on but lots of rewards later.
127 | Is there a way we can account for our preference to early rewards? The answer is yes. So, instead of using the
128 | sum of all rewards as we mentioned before, we will use the sum of discounted rewards in which each reward at time
129 | `t` will be discounted by a factor, let's call it gamma, `t` times. And so we get that policy evaluation basically
130 | calculates the following equation for all states:
131 | 
132 | ```
133 | Vpi(s) = Epi{r_{t+1} + g*r_{t+2} + g**2*r_{t+3} + ... | St = s}
134 | ```
135 | 
136 | So, we are basically finding the value we would get from each of the states if we followed this policy.
137 | Fair enough. Let's forget about equations, check the "Furter reading" sections for that.
138 | 
139 | #### 2.6 Improving on solutions
140 | 
141 | Now that we know how to come up with a single value for a given policy. The natural question we get is how
142 | to we improve on a policy? If we can devise a way to improve and we know how to evaluate, we should be able
143 | to iterate between evaluation and improvement and get the best policy starting from any random policy. And that
144 | would be very useful.
145 | 
146 | The core of the question is whether there is an action different than the action we are being suggested by the
147 | policy that would make the value calculate above larger? How about we temporarily select a different action than
148 | that suggested by the policy and then follow the policy as originally suggested. This way we would isolate the
149 | effect of the action on the entire policy. This is actually the basis for an algorithm called policy improvement.
150 | 
151 | #### 2.7 Finding Optimal solutions
152 | 
153 | One of the powerful facts about policy improvement is that this way of finding better policies from
154 | a given policy actually guarantees that at least a policy of the same exact quality will be returned or better.
155 | This allows us to think of an algorithm that uses policy evaluation to get the value of a policy and then
156 | policy improvement to try to improve this policy and if the improvement just returns any better policy, just
157 | stop. This algorithm is called policy iteration.
158 | 
159 | #### 2.8 Improving on Policy Iteration
160 | 
161 | Policy iteration is great because it guarantees that we will get the very best policy available for a given
162 | MDP. However, sometimes it can take unnecessarily large computation before it comes up with that best policy.
163 | Another way of thinking about this is, would there be a small number delta (E.g. 0.0001) that we would be OK with
164 | accepting as a measure of change in any given state. If there is, then we could just cut the policy evaluation
165 | algorithm short and use the value of states to guide our decision-making. This algorithm is called value iteration.
166 | 
167 | #### 2.9 Exercises
168 | 
169 | In this lesson, we reviewed ways to solve sequential problems. The following Notebook goes into a little more
170 | detail about the Dynamic Programming way of solving problems. We will look into the Fibonacci sequence problem
171 | and devise few ways for solving it. Recursion, Memoization and Dynamic Programming.
172 | 
173 | Lesson 2 Notebook.
174 | 
175 | #### 2.10 Further Reading
176 | 
177 |   * [Dynamic programming](https://people.eecs.berkeley.edu/~vazirani/algorithms/chap6.pdf)
178 |   * [Value iteration and policy iteration algorithms for Markov decision problem](http://www.ics.uci.edu/~csp/r42a-mdp_report.pdf)
179 |   * [Introduction to Markov Decision Processes](http://castlelab.princeton.edu/ORF569papers/Powell_ADP_2ndEdition_Chapter%203.pdf)
180 |   * [Reinforcement Learning: A Survey](https://www.cs.cmu.edu/~tom/10701_sp11/slides/Kaelbling.pdf)
181 | 


--------------------------------------------------------------------------------
/paper/rldm.tex:
--------------------------------------------------------------------------------
  1 | \documentclass[11pt]{article} % For LaTeX2e
  2 | \usepackage{rldmsubmit,palatino}
  3 | \usepackage{graphicx}
  4 | \usepackage{hyperref}
  5 | 
  6 | \title{A More Robust Way of Teaching Reinforcement Learning and Decision Making}
  7 | 
  8 | \author{
  9 | Miguel Morales\thanks{http://www.mimoralea.com} \\
 10 | Department of Computer Science \\
 11 | Georgia Institute of Technology \\
 12 | Atlanta, GA 30332 \\
 13 | \texttt{mimoralea@gatech.edu} \\
 14 | }
 15 | 
 16 | \newcommand{\fix}{\marginpar{FIX}}
 17 | \newcommand{\new}{\marginpar{NEW}}
 18 | 
 19 | \begin{document}
 20 | 
 21 | \maketitle
 22 | 
 23 | \begin{abstract}
 24 |   I propose a new way of teaching reinforcement learning and decision making
 25 |   that is designed to be an improvement to traditional academic teaching. I use
 26 |   a three-step approach to delivering a complete learning experience in a
 27 |   way that engages the student and allows them to grasp the concepts regardless
 28 |   of their skill level. I present a specific way of teaching the content, a
 29 |   new and fully configured coding platform, a set of hands-on exercises and
 30 |   a group of recommended next steps for deeper learning.
 31 | \end{abstract}
 32 | 
 33 | \keywords{teaching tutorials jupyter intuition hands-on}
 34 | \repository{https://www.github.com/mimoralea/applied-reinforcement-learning}
 35 | \spresentation{https://youtu.be/ltjS5ktziLQ}
 36 | \lpresentation{https://youtu.be/1WjNj_JmFaE}
 37 | 
 38 | \acknowledgements{
 39 |   I am thankful to my mentor, Kenneth Brooks, for providing assistance when
 40 |   navigating the field of Educational Technology. Also, for giving direct,
 41 |   concise and clear feedback on how to make this project better. Thank you
 42 |   to all my peers who also provided sincere feedback throughout the semester.
 43 |   I hope to see you all enjoying our OMSCS course in Reinforcement Learning
 44 |   and Decision Making, but not before going through these lessons. It will
 45 |   be a rewarding experience. Pun intended.
 46 | }
 47 | 
 48 | \startmain % to start the main 1-4 pages of the submission.
 49 | 
 50 | \section{Introduction}
 51 | 
 52 | Reinforcement Learning and Decision Making is a complex subject. Being the
 53 | focus of research of a variety of fields including artificial intelligence,
 54 | psychology, machine learning, operations research, control theory, animal
 55 | and human neuroscience, economics, and ethology, it is expected that the
 56 | vast amount of available information could become counterproductive if not
 57 | handled properly. Beginners often find themselves lost while trying to grasp
 58 | the key concepts that are truly vital for understanding. Additionally, reinforcement
 59 | learning and decision making, being a relatively new field, is often taught by
 60 | world-class researchers that frequently unintentionally omit explaining
 61 | core concepts that might seem too basic \cite{gapranda}, yet remain
 62 | fundamental. This creates a gap of knowledge that, if left unfilled, causes
 63 | trouble when learning the more advanced topics.
 64 | 
 65 | These points presents some of the challenges of sparking an interest and keeping the
 66 | students engaged throughout their entire learning experience. If the content is not
 67 | delivered correctly, the students can quickly feel confused, lost and disengaged, and
 68 | when that happens learning stops.
 69 | 
 70 | \section{Sparking Curiosity}
 71 | 
 72 | Fortunately, since reinforcement learning and decision making is studied
 73 | by fields like animal and human neuroscience, ethology, and psychology\cite{suttons98},
 74 | often the concepts can be taught on an direct way using ordinary examples
 75 | in order to connect on an intuitive level. Recent studies in neuroscience have
 76 | shown that emotions and cognition are interrelated\cite{intuition}. By keeping
 77 | the readings approachable I allow students to connect to the narratives at different
 78 | levels. The notion of learning by interacting with an environment should be easy enough
 79 | to understand for all of us, as this is one of the ways we learn. Reinforcement
 80 | learning in Artificial Intelligence has serveral similarities with Human learning.
 81 | 
 82 | I leverage this fact and use a strategy to keep the readers engaged in the
 83 | material.
 84 | 
 85 | \subsection{Using Simple And Direct Language}
 86 | 
 87 | Another important component I accomplished is to use simple and direct
 88 | language throughout the documents. This keeps the reader engaged regardless
 89 | of their reinforcement learning knowledge level.
 90 | 
 91 | I carefully select words and examples that bring the concepts to a
 92 | common sense understanding so that all students can follow the initial
 93 | readings.
 94 | 
 95 | \subsection{Keeping A Single Narrative}
 96 | 
 97 | Additionally, and what was perhaps the most difficult part, I keep a single
 98 | narrative throughout the sequence of concepts being presented. The intention
 99 | here is to allow students to continue reading and use the understanding they
100 | accumulate in previous lessons to understand the subsequent lessons. Similar to what
101 | the direct instruction paradigm\cite{directinstruction} encourages, one of the
102 | most important work on this project is providing with the structure and sequence
103 | on how the concepts are presented.
104 | 
105 | The more traditional approach is to select concepts from the entire body of
106 | reinforcement learning and decision making and use different lessons to present
107 | different material. However, the problem with this approach is that it does
108 | not help the student grasp the complete picture or the connection between other topics.
109 | The effort to present concepts in logical sequence, despite being complex to define
110 | initially, not only feels more natural to present to beginners, but it helps
111 | beginners stay engaged in the material while they continue learning concepts.
112 | 
113 | \subsection{Showing Concepts And Their Complement}
114 | 
115 | Finally, in order to spark and maintain the students' curiosity, I show the
116 | full spectrum of a single concept. Even if just defining the opposite side, I 
117 | still make an effort to mention it and briefly explain it. Often things in
118 | life have a complementary side, that when combined can better show the qualities
119 | of one another. For example, explaining deterministic actions is interesting
120 | all by itself, but you could gain a much better understanding if I explained
121 | them along side stochastic actions. This approach is also known as Compare and
122 | Contrast, and the literature suggests that teaching comparative thinking
123 | strengthens student learning\cite{compare}.
124 | 
125 | I paid close attention to show concepts and their complements in every
126 | lesson. The expectation is that this would help the students have a better
127 | sense of the full range of possibilities for any given point. Keeping concepts
128 | in this format, keeps students engaged, as concepts get progressively more
129 | and more complex.
130 | 
131 | \section{Removing Friction}
132 | 
133 | Once the students' curiosity has been sparked and intuition is engaged, a
134 | convenient way to interact with the concepts should be presented. The
135 | friction of getting hands-on experience is one of the most difficult
136 | barriers to break for beginners, but once this is past, the student can
137 | much better understand the concepts.
138 | 
139 | I worked on three different important points to fully remove the friction
140 | beginners have when first getting into reinforcement learning.
141 | 
142 | \subsection{Setting Up A Convenient Environment}
143 | 
144 | One of the most remarkable accomplishments of this project is the creation
145 | of a fully configured reinforcement learning platform to use OpenAI
146 | Gym \cite{openaigym} environments on Jupyter Notebooks inside of Docker
147 | containers.
148 | 
149 | Besides technicalities, having a ready-to-go environment that can help
150 | students be ready to go within 20 min wait time for the first run after
151 | copy-paste of provided commands is wonderful. After that initial setup
152 | it takes less than 1 min every subsequent run. This allows the student to spend
153 | only a minimal amount of time configuring and battling with packages and configuration
154 | scripts, that do not add knowledge in reinforcement learning; and allows them to concentrate
155 |  all the effort on concepts that truly matter.
156 | 
157 | \subsection{Providing With Boilerplate Code}
158 | 
159 | Moreover, I supplement the notebooks with abundant boilerplate code. Graphs and
160 | visualization functions that very likely aid in the learning process \cite{visualization},
161 | binaries creating web requests in the background to show videos of carefully
162 | selected agent episodes are some of the examples of
163 | code provided to the students.
164 | 
165 | This allows the students to interact only with bits of code that are directly
166 | related to reinforcement learning, and be able to safely ignore other bits.
167 | 
168 | \subsection{Asking For Minimal Effort}
169 | 
170 | Then, I proceed to ask students to put just enough effort to get them engaged.
171 | The hands-on interaction with the notebooks are designed for beginner to get
172 | started with reinforcement learning. Perhaps, these students have not seen
173 | reinforcement learning or even machine learning code in action before. Therefore,
174 | in addition to all of the boilerplate material already mentioned, I also
175 | provide the most common algorithms in each of the notebooks, and
176 | only ask the students to complete small sections that would make the core
177 | algorithms work more effectively.
178 | 
179 | The idea is that after they have contact with reinforcement learning code, they
180 | will have more confidence when interacting with more advanced problems and
181 | projects during the OMSCS course.
182 | 
183 | \section{Showing Options}
184 | 
185 | Lastly, connecting to intuition and getting hands-on experience will be
186 | futile unless the students have a new interest of exploring the field by
187 | themselves. This is the most important aspect of our project, I believe
188 | education is about motivation. The role of an instructor is merely to spark
189 | students curiosity and help them find the path to their own realization.
190 | 
191 | Therefore, at this point I hope to have awakened the students' interest to
192 | explore this marvelous field. Now, showing the path for further learning is
193 | a final and very important step.
194 | 
195 | \subsection{Assigning Relevant Readings}
196 | 
197 | To help the students better navigate the field of reinforcement learning,
198 | I provide with ``Further Reading'' sections in every single lesson, and a
199 | single final section of ``Recommended Books'' at the end of the project. The fact
200 | that I teach the concepts in a direct and simple language is by no means
201 | an indication that academic material can be skipped. Actually, the way I 
202 | present the material should be seen as a \emph{primer}, helping the
203 | concepts later presented come together more naturally, and be absorbed quicker.
204 | 
205 | \subsection{Watching Academic Lectures}
206 | 
207 | Next, I would hope students go on to watch academic lectures.
208 | To have world-class experts in the field of reinforcement learning teaching
209 | concepts that they are wholesomely familiar with and have been studying and working with, 
210 |  is necessary for the students. For this reason, I added a ``Recommended Courses''
211 | sections for students to continue the search and learning on their own.
212 | 
213 | \subsection{Completing Homework and Projects}
214 | 
215 | Finally, I would hope that many of the students using these materials are
216 | the same students who either are planning to enroll for the OMSCS course or have just
217 | enrolled. The OMSCS course, after a brief explanation of core concepts,
218 | shows very advanced concepts, at a very rapid pace. In addition, there are specially
219 | designed homework and project assignments so that the students get a solid
220 | grasp of reinforcement learning.
221 | 
222 | Completing the coursework would certainly put the students in the driving seat
223 | making them owners of their destiny and letting them pick wisely what reinforcement
224 | learning area to explore next. 
225 | 
226 | \section{Future Work}
227 | 
228 | No work is perfect and neither is this one. However, for the
229 | {\raise.17ex\hbox{$\scriptstyle\sim$}}2 months of effort
230 | put into it, I think the progress that has been made is incredible. I started
231 | with an aggressive proposal and delivered on most of it. I kept progress steady, but
232 | flexible enough to adapt along the way, while still completing core components.
233 | The lessons, the container, the notebooks, the assigned readings, the recommended
234 | courses, all provide with a solid foundation for the deep understanding of
235 | reinforcement learning and decision making.
236 | 
237 | It is this foundation that can now make further progress easier to achieve. After
238 | opening this work to the community during the summer semester, I hope to receive
239 | help and feedback to make this project even better going forward.
240 | 
241 | \subsection{Additional Notebooks}
242 | 
243 | An important component for future work is the addition of notebooks. I had the capability
244 | to complete seven notebooks, but while trying to rush in some final work, I noticed
245 | the quality of the later notebooks were seriously degrading the quality of the
246 | project. Instead of pushing onto additional notebooks, I opted for improving the quality
247 | of previous notebooks and leaving the newer projects out of this release.
248 | 
249 | This creates an opportunity for re-adding those notebooks that were removed and
250 | improving them considerably. Also, the addition of new notebooks would be of
251 | great benefit as well.
252 | 
253 | \subsection{Effectiveness Evaluation}
254 | 
255 | A more difficult future work component would be to find a way to measure the effectiveness
256 | of this material. Ideally, an Educational Technology student can take on the task
257 | to research whether the strategy presented here actually improves student
258 | performance. It would be interesting to gather and study this kind of feedback.
259 | 
260 | \subsection{Request For Feedback}
261 | 
262 | Finally, one of the next steps I will be taking on is to release this project
263 | in different places. First, to previous students on the Slack channel of the
264 | Georgia Tech Study Group organization. These folks are now veterans of our course
265 | and would be a great source of feedback. Second, I will release to the OpenAI
266 | community through their discussion forums in an attempt to get a very diverse group
267 | to review and provide feedback. The expectation is that this feedback will be
268 | followed up with actual changes in the form of GitHub pull requests. This, and only
269 | this, would make this the project I initially envisioned. 
270 | 
271 | \section{Conclusion}
272 | 
273 | In this paper, I proposed a more robust way of teaching reinforcement learning
274 | and decision making. I presented a series of lessons taught in a very specific
275 | format, I delivered a fully-configured coding environment for the development
276 | of reinforcement learning agents and algorithms, I provided with boilerplate
277 | code and a series of notebooks to assist with hands-on experimentation, and I
278 | supplemented this with more academic readings, and lectures.
279 | 
280 | I sincerely hope this project will be useful to lots of people interested in
281 | learning the ins and outs of reinforcement learning and decision making. And,
282 | in fact, the project recently helped an OMSCS Reinforcement Learning and
283 | Decision Making student find his way around the complex topic of function
284 | approximation in reinforcement learning. The potential, however, is bigger, and
285 | the path to improvement obvious in some cases. My desire is to see this
286 | work continue to grow into a more mature and effective way of teaching this
287 | amazing field.
288 | 
289 | \medskip
290 |  
291 | \begin{thebibliography}{9}
292 | 
293 | \bibitem{gapranda}
294 |   Ferguson, Julie E.
295 |   \textit{Bridging The Gap Between Research and Practice}
296 |   KM4Dev, Volume 1(3), 46-54.
297 |  
298 | \bibitem{suttons98}
299 |   Richard Sutton, Andrew Barto.
300 |   \textit{Reinforcement Learning: An Introduction.}
301 |   MIT Press, 1998.
302 |  
303 | \bibitem{intuition}
304 |   Maray Immordino-Yang, Matthias Faeth.
305 |   \textit{Building Smart Students: A Neuroscience Perspective on the Role
306 |     of Emotion and Skilled Intuition in Learning.}
307 |   Bloomington, 2010.
308 |  
309 | \bibitem{directinstruction}
310 |   Baumann, James F.
311 |   \textit{The effectiveness of a direct instruction paradigm for teaching main idea comprehension.}
312 |   Reading Research Quarterly (1984): 93-115.
313 |  
314 | \bibitem{compare}
315 |   Silver, Harvey F.
316 |   \textit{Compare and Contrast.}
317 |   Strategic Teacher PLC Guides, 2010.
318 |  
319 | \bibitem{visualization}
320 |   Naps, Thomas L., et al.
321 |   \textit{Exploring the role of visualization and engagement in computer science education.}
322 |   ACM Sigcse Bulletin. Vol. 35. No. 2. ACM, 2002.
323 |  
324 | \bibitem{openaigym}
325 |   Greg Brockman, Vicki Cheung, Ludwig Pettersson et al.
326 |   \textit{OpenAI Gym.}
327 |   ArXiv, 1606.01540, 2016.
328 |  
329 | \end{thebibliography}
330 | 
331 | \end{document}
332 | 


--------------------------------------------------------------------------------
/notebooks/02-dynamic-programming.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "deletable": true,
  7 |     "editable": true
  8 |    },
  9 |    "source": [
 10 |     "### Recursion, Memoization and Dynamic Programming"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "markdown",
 15 |    "metadata": {
 16 |     "deletable": true,
 17 |     "editable": true
 18 |    },
 19 |    "source": [
 20 |     "Remember how we talk about using recursion and dynamic programming. One interesting thing to do is to implement the solution to a common problem called Fibonnaci numbers on these two styles and compare the compute time."
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "markdown",
 25 |    "metadata": {
 26 |     "deletable": true,
 27 |     "editable": true
 28 |    },
 29 |    "source": [
 30 |     "The Fibonacci series looks something like: `0, 1, 1, 2, 3, 5, 8, 13, 21 …` and so on. Any person can quickly notice the pattern. `f(n) = f(n-1) + f(n-2)` So, let's walk through a recursive implementation that solves this problem."
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": 1,
 36 |    "metadata": {
 37 |     "collapsed": true,
 38 |     "deletable": true,
 39 |     "editable": true
 40 |    },
 41 |    "outputs": [],
 42 |    "source": [
 43 |     "def fib(n):\n",
 44 |     "    if n < 2:\n",
 45 |     "        return n\n",
 46 |     "    return fib(n-2) + fib(n-1)"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "code",
 51 |    "execution_count": 2,
 52 |    "metadata": {
 53 |     "collapsed": false,
 54 |     "deletable": true,
 55 |     "editable": true
 56 |    },
 57 |    "outputs": [
 58 |     {
 59 |      "name": "stdout",
 60 |      "output_type": "stream",
 61 |      "text": [
 62 |       "CPU times: user 296 ms, sys: 0 ns, total: 296 ms\n",
 63 |       "Wall time: 295 ms\n"
 64 |      ]
 65 |     },
 66 |     {
 67 |      "data": {
 68 |       "text/plain": [
 69 |        "832040"
 70 |       ]
 71 |      },
 72 |      "execution_count": 2,
 73 |      "metadata": {},
 74 |      "output_type": "execute_result"
 75 |     }
 76 |    ],
 77 |    "source": [
 78 |     "%time fib(30)"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "markdown",
 83 |    "metadata": {
 84 |     "deletable": true,
 85 |     "editable": true
 86 |    },
 87 |    "source": [
 88 |     "Now, the main problem of this algorithm is that we are computing some of the subproblems more than once. For instance, to compute fib(4) we would compute fib(3) and fib(2). However, to compute fib(3) we also have to compute fib(2). Say hello to memoization."
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "markdown",
 93 |    "metadata": {
 94 |     "deletable": true,
 95 |     "editable": true
 96 |    },
 97 |    "source": [
 98 |     "A technique called memoization we are cache the results of previously computed sub problems to avoid unnecessary computations."
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "code",
103 |    "execution_count": 3,
104 |    "metadata": {
105 |     "collapsed": true,
106 |     "deletable": true,
107 |     "editable": true
108 |    },
109 |    "outputs": [],
110 |    "source": [
111 |     "m = {}\n",
112 |     "def fibm(n):\n",
113 |     "    if n in m:\n",
114 |     "        return m[n]\n",
115 |     "    m[n] = n if n < 2 else fibm(n-2) + fibm(n-1)\n",
116 |     "    return m[n]"
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "code",
121 |    "execution_count": 4,
122 |    "metadata": {
123 |     "collapsed": false,
124 |     "deletable": true,
125 |     "editable": true
126 |    },
127 |    "outputs": [
128 |     {
129 |      "name": "stdout",
130 |      "output_type": "stream",
131 |      "text": [
132 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
133 |       "Wall time: 17.6 µs\n"
134 |      ]
135 |     },
136 |     {
137 |      "data": {
138 |       "text/plain": [
139 |        "832040"
140 |       ]
141 |      },
142 |      "execution_count": 4,
143 |      "metadata": {},
144 |      "output_type": "execute_result"
145 |     }
146 |    ],
147 |    "source": [
148 |     "%time fibm(30)"
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "markdown",
153 |    "metadata": {
154 |     "deletable": true,
155 |     "editable": true
156 |    },
157 |    "source": [
158 |     "But the question is, can we do better than this? The use of the array is helpful, but when calculating very large numbers, or perhaps on memory contraint environments it might not be desirable. This is where Dynamic Programming fits the bill."
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "markdown",
163 |    "metadata": {
164 |     "deletable": true,
165 |     "editable": true
166 |    },
167 |    "source": [
168 |     "In DP we take a bottom-up approach. Meaning, we solve the next Fibonacci number we can with the information we already have."
169 |    ]
170 |   },
171 |   {
172 |    "cell_type": "code",
173 |    "execution_count": 13,
174 |    "metadata": {
175 |     "collapsed": true,
176 |     "deletable": true,
177 |     "editable": true
178 |    },
179 |    "outputs": [],
180 |    "source": [
181 |     "def fibdp(n):\n",
182 |     "    if n == 0: return 0\n",
183 |     "    prev, curr = (0, 1)\n",
184 |     "    for i in range(2, n+1):\n",
185 |     "        newf = prev + curr\n",
186 |     "        prev = curr\n",
187 |     "        curr = newf\n",
188 |     "    return curr"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "code",
193 |    "execution_count": 25,
194 |    "metadata": {
195 |     "collapsed": false,
196 |     "deletable": true,
197 |     "editable": true
198 |    },
199 |    "outputs": [
200 |     {
201 |      "name": "stdout",
202 |      "output_type": "stream",
203 |      "text": [
204 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
205 |       "Wall time: 6.44 µs\n"
206 |      ]
207 |     },
208 |     {
209 |      "data": {
210 |       "text/plain": [
211 |        "832040"
212 |       ]
213 |      },
214 |      "execution_count": 25,
215 |      "metadata": {},
216 |      "output_type": "execute_result"
217 |     }
218 |    ],
219 |    "source": [
220 |     "%time fibdp(30)"
221 |    ]
222 |   },
223 |   {
224 |    "cell_type": "markdown",
225 |    "metadata": {
226 |     "deletable": true,
227 |     "editable": true
228 |    },
229 |    "source": [
230 |     "In this format, we don’t need to recurse or keep up with the memory intensive cache dictionary. These, add up to an even better performance."
231 |    ]
232 |   },
233 |   {
234 |    "cell_type": "markdown",
235 |    "metadata": {
236 |     "deletable": true,
237 |     "editable": true
238 |    },
239 |    "source": [
240 |     "Let's now give it a try with factorials. Remember `4! = 4 * 3 * 2 * 1 = 24`. Can you give it try?"
241 |    ]
242 |   },
243 |   {
244 |    "cell_type": "code",
245 |    "execution_count": 1,
246 |    "metadata": {
247 |     "collapsed": true,
248 |     "deletable": true,
249 |     "editable": true
250 |    },
251 |    "outputs": [],
252 |    "source": [
253 |     "def factr(n):\n",
254 |     "    if n < 3:\n",
255 |     "        return n\n",
256 |     "    return n * factr(n - 1)"
257 |    ]
258 |   },
259 |   {
260 |    "cell_type": "code",
261 |    "execution_count": 5,
262 |    "metadata": {
263 |     "collapsed": false,
264 |     "deletable": true,
265 |     "editable": true
266 |    },
267 |    "outputs": [
268 |     {
269 |      "name": "stdout",
270 |      "output_type": "stream",
271 |      "text": [
272 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
273 |       "Wall time: 43.2 µs\n"
274 |      ]
275 |     },
276 |     {
277 |      "data": {
278 |       "text/plain": [
279 |        "265252859812191058636308480000000"
280 |       ]
281 |      },
282 |      "execution_count": 5,
283 |      "metadata": {},
284 |      "output_type": "execute_result"
285 |     }
286 |    ],
287 |    "source": [
288 |     "%time factr(30)"
289 |    ]
290 |   },
291 |   {
292 |    "cell_type": "code",
293 |    "execution_count": 99,
294 |    "metadata": {
295 |     "collapsed": true,
296 |     "deletable": true,
297 |     "editable": true
298 |    },
299 |    "outputs": [],
300 |    "source": [
301 |     "m = {}\n",
302 |     "def factm(n):\n",
303 |     "    if n in m:\n",
304 |     "        return m[n]\n",
305 |     "    m[n] = n if n < 3 else n * factr(n - 1)\n",
306 |     "    return m[n]"
307 |    ]
308 |   },
309 |   {
310 |    "cell_type": "code",
311 |    "execution_count": 100,
312 |    "metadata": {
313 |     "collapsed": false,
314 |     "deletable": true,
315 |     "editable": true
316 |    },
317 |    "outputs": [
318 |     {
319 |      "name": "stdout",
320 |      "output_type": "stream",
321 |      "text": [
322 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
323 |       "Wall time: 47.2 µs\n"
324 |      ]
325 |     },
326 |     {
327 |      "data": {
328 |       "text/plain": [
329 |        "265252859812191058636308480000000"
330 |       ]
331 |      },
332 |      "execution_count": 100,
333 |      "metadata": {},
334 |      "output_type": "execute_result"
335 |     }
336 |    ],
337 |    "source": [
338 |     "%time factm(30)"
339 |    ]
340 |   },
341 |   {
342 |    "cell_type": "code",
343 |    "execution_count": 10,
344 |    "metadata": {
345 |     "collapsed": true,
346 |     "deletable": true,
347 |     "editable": true
348 |    },
349 |    "outputs": [],
350 |    "source": [
351 |     "def factdp(n):\n",
352 |     "    if n < 3: return n\n",
353 |     "    fact = 2\n",
354 |     "    for i in range(3, n + 1):\n",
355 |     "        fact *= i\n",
356 |     "    return fact"
357 |    ]
358 |   },
359 |   {
360 |    "cell_type": "code",
361 |    "execution_count": 11,
362 |    "metadata": {
363 |     "collapsed": false,
364 |     "deletable": true,
365 |     "editable": true
366 |    },
367 |    "outputs": [
368 |     {
369 |      "name": "stdout",
370 |      "output_type": "stream",
371 |      "text": [
372 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
373 |       "Wall time: 7.87 µs\n"
374 |      ]
375 |     },
376 |     {
377 |      "data": {
378 |       "text/plain": [
379 |        "265252859812191058636308480000000"
380 |       ]
381 |      },
382 |      "execution_count": 11,
383 |      "metadata": {},
384 |      "output_type": "execute_result"
385 |     }
386 |    ],
387 |    "source": [
388 |     "%time factdp(30)"
389 |    ]
390 |   },
391 |   {
392 |    "cell_type": "markdown",
393 |    "metadata": {
394 |     "deletable": true,
395 |     "editable": true
396 |    },
397 |    "source": [
398 |     "Let's think of a slightly different problem. Imagine that you want to find the cheapest way to go from city A to city B, but when you are about to buy your ticket, you see that you could hop in different combinations of route and get a much cheaper price than if you go directly. How do you efficiently calculate the best possible combination of tickets and come up with the cheapest route? We will start with basic recursion and work on improving it until we reach dynamic programming.\n",
399 |     "\n",
400 |     "For this last problem in dynamic programming, create 2 functions that calculates the cheapest route from city A to B. I will give you the recursive solution, you will build one with memoization and the one with dynamic programming."
401 |    ]
402 |   },
403 |   {
404 |    "cell_type": "code",
405 |    "execution_count": 1,
406 |    "metadata": {
407 |     "collapsed": true,
408 |     "deletable": true,
409 |     "editable": true
410 |    },
411 |    "outputs": [],
412 |    "source": [
413 |     "import numpy as np"
414 |    ]
415 |   },
416 |   {
417 |    "cell_type": "markdown",
418 |    "metadata": {},
419 |    "source": [
420 |     "Utility function to get fares between cities"
421 |    ]
422 |   },
423 |   {
424 |    "cell_type": "code",
425 |    "execution_count": 2,
426 |    "metadata": {
427 |     "collapsed": false,
428 |     "deletable": true,
429 |     "editable": true,
430 |     "scrolled": false
431 |    },
432 |    "outputs": [],
433 |    "source": [
434 |     "def get_fares(n_cities, max_fare):\n",
435 |     "    np.random.seed(123456)\n",
436 |     "    fares = np.sort(np.random.random((n_cities, n_cities)) * max_fare).astype(int)\n",
437 |     "    for i in range(len(fares)):\n",
438 |     "        fares[i] = np.roll(fares[i], i + 1)\n",
439 |     "    np.fill_diagonal(fares, 0)\n",
440 |     "    for i in range(1, len(fares)):\n",
441 |     "        for j in range(0, i):\n",
442 |     "            fares[i][j] = -1\n",
443 |     "    return fares"
444 |    ]
445 |   },
446 |   {
447 |    "cell_type": "markdown",
448 |    "metadata": {},
449 |    "source": [
450 |     "Let's try it out with 4 cities and random fares with a max of 1000."
451 |    ]
452 |   },
453 |   {
454 |    "cell_type": "code",
455 |    "execution_count": 5,
456 |    "metadata": {
457 |     "collapsed": false,
458 |     "deletable": true,
459 |     "editable": true
460 |    },
461 |    "outputs": [
462 |     {
463 |      "data": {
464 |       "text/plain": [
465 |        "array([[  0, 126, 260, 897],\n",
466 |        "       [ -1,   0,  50, 376],\n",
467 |        "       [ -1,  -1,   0, 123],\n",
468 |        "       [ -1,  -1,  -1,   0]])"
469 |       ]
470 |      },
471 |      "execution_count": 5,
472 |      "metadata": {},
473 |      "output_type": "execute_result"
474 |     }
475 |    ],
476 |    "source": [
477 |     "n_cities = 4\n",
478 |     "max_fare = 1000\n",
479 |     "fares = get_fares(n_cities, max_fare)\n",
480 |     "fares[1][2] = 50\n",
481 |     "fares"
482 |    ]
483 |   },
484 |   {
485 |    "cell_type": "markdown",
486 |    "metadata": {},
487 |    "source": [
488 |     "Here is the recursive solution:"
489 |    ]
490 |   },
491 |   {
492 |    "cell_type": "code",
493 |    "execution_count": 7,
494 |    "metadata": {
495 |     "collapsed": true,
496 |     "deletable": true,
497 |     "editable": true
498 |    },
499 |    "outputs": [],
500 |    "source": [
501 |     "def cheapestr(s, d, c):\n",
502 |     "    if s == d or s == d - 1:\n",
503 |     "        return c[s][d]\n",
504 |     "    \n",
505 |     "    cheapest = c[s][d]\n",
506 |     "    for i in range(s + 1, d):\n",
507 |     "        tmp = cheapestr(s, i, c) + cheapestr(i, d, c)\n",
508 |     "        cheapest = tmp if tmp < cheapest else cheapest\n",
509 |     "    return cheapest"
510 |    ]
511 |   },
512 |   {
513 |    "cell_type": "code",
514 |    "execution_count": 8,
515 |    "metadata": {
516 |     "collapsed": false,
517 |     "deletable": true,
518 |     "editable": true
519 |    },
520 |    "outputs": [
521 |     {
522 |      "name": "stdout",
523 |      "output_type": "stream",
524 |      "text": [
525 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
526 |       "Wall time: 68.7 µs\n"
527 |      ]
528 |     },
529 |     {
530 |      "data": {
531 |       "text/plain": [
532 |        "299"
533 |       ]
534 |      },
535 |      "execution_count": 8,
536 |      "metadata": {},
537 |      "output_type": "execute_result"
538 |     }
539 |    ],
540 |    "source": [
541 |     "%time cheapestr(0, len(fares[0]) - 1, fares)"
542 |    ]
543 |   },
544 |   {
545 |    "cell_type": "markdown",
546 |    "metadata": {},
547 |    "source": [
548 |     "Now, you build the memoization one:"
549 |    ]
550 |   },
551 |   {
552 |    "cell_type": "code",
553 |    "execution_count": 9,
554 |    "metadata": {
555 |     "collapsed": false,
556 |     "deletable": true,
557 |     "editable": true
558 |    },
559 |    "outputs": [],
560 |    "source": [
561 |     "m = {}\n",
562 |     "def cheapestm(s, d, c):\n",
563 |     "    \"\"\" YOU WRITE THIS FUNCTION \"\"\"\n",
564 |     "    return 0"
565 |    ]
566 |   },
567 |   {
568 |    "cell_type": "code",
569 |    "execution_count": 10,
570 |    "metadata": {
571 |     "collapsed": false,
572 |     "deletable": true,
573 |     "editable": true,
574 |     "scrolled": true
575 |    },
576 |    "outputs": [
577 |     {
578 |      "name": "stdout",
579 |      "output_type": "stream",
580 |      "text": [
581 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
582 |       "Wall time: 22.6 µs\n"
583 |      ]
584 |     },
585 |     {
586 |      "data": {
587 |       "text/plain": [
588 |        "299"
589 |       ]
590 |      },
591 |      "execution_count": 10,
592 |      "metadata": {},
593 |      "output_type": "execute_result"
594 |     }
595 |    ],
596 |    "source": [
597 |     "%time cheapestm(0, len(fares[0]) - 1, fares)"
598 |    ]
599 |   },
600 |   {
601 |    "cell_type": "markdown",
602 |    "metadata": {},
603 |    "source": [
604 |     "Faster, you see?\n",
605 |     "\n",
606 |     "Now, do the dynamic programming version."
607 |    ]
608 |   },
609 |   {
610 |    "cell_type": "code",
611 |    "execution_count": 11,
612 |    "metadata": {
613 |     "collapsed": false,
614 |     "deletable": true,
615 |     "editable": true
616 |    },
617 |    "outputs": [],
618 |    "source": [
619 |     "def cheapestdp(s, d, c):\n",
620 |     "    \"\"\" YOU WRITE THIS FUNCTION \"\"\"\n",
621 |     "    return 0"
622 |    ]
623 |   },
624 |   {
625 |    "cell_type": "code",
626 |    "execution_count": 12,
627 |    "metadata": {
628 |     "collapsed": false,
629 |     "deletable": true,
630 |     "editable": true
631 |    },
632 |    "outputs": [
633 |     {
634 |      "name": "stdout",
635 |      "output_type": "stream",
636 |      "text": [
637 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
638 |       "Wall time: 61.5 µs\n"
639 |      ]
640 |     },
641 |     {
642 |      "data": {
643 |       "text/plain": [
644 |        "299"
645 |       ]
646 |      },
647 |      "execution_count": 12,
648 |      "metadata": {},
649 |      "output_type": "execute_result"
650 |     }
651 |    ],
652 |    "source": [
653 |     "%time cheapestdp(0, len(fares[0]) - 1, fares)"
654 |    ]
655 |   },
656 |   {
657 |    "cell_type": "markdown",
658 |    "metadata": {
659 |     "deletable": true,
660 |     "editable": true
661 |    },
662 |    "source": [
663 |     "Let's now try with a larger example:"
664 |    ]
665 |   },
666 |   {
667 |    "cell_type": "code",
668 |    "execution_count": 337,
669 |    "metadata": {
670 |     "collapsed": false,
671 |     "deletable": true,
672 |     "editable": true,
673 |     "scrolled": true
674 |    },
675 |    "outputs": [
676 |     {
677 |      "data": {
678 |       "text/plain": [
679 |        "array([[  0, 123, 126, 129, 228, 260, 336, 352, 373, 376, 447, 451, 543,\n",
680 |        "        776, 820, 840, 859, 897],\n",
681 |        "       [ -1,   0,  37,  61, 137, 146, 235, 245, 340, 343, 405, 574, 589,\n",
682 |        "        590, 594, 753, 852, 861],\n",
683 |        "       [ -1,  -1,   0,  16,  99, 117, 170, 199, 274, 342, 394, 401, 414,\n",
684 |        "        462, 481, 595, 610, 641],\n",
685 |        "       [ -1,  -1,  -1,   0,  94,  95, 134, 138, 155, 433, 471, 497, 560,\n",
686 |        "        630, 639, 683, 732, 758],\n",
687 |        "       [ -1,  -1,  -1,  -1,   0,  85, 140, 149, 329, 370, 386, 395, 477,\n",
688 |        "        544, 562, 566, 619, 634],\n",
689 |        "       [ -1,  -1,  -1,  -1,  -1,   0,  29,  30,  44, 113, 187, 207, 247,\n",
690 |        "        249, 249, 356, 409, 630],\n",
691 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,   0,  22,  60, 168, 216, 277, 279,\n",
692 |        "        372, 419, 449, 606, 690],\n",
693 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,   0,  30,  36, 273, 321, 355,\n",
694 |        "        415, 419, 421, 497, 500],\n",
695 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,   0,  19, 123, 400, 412,\n",
696 |        "        418, 535, 547, 559, 591],\n",
697 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,   0, 110, 120, 140,\n",
698 |        "        204, 220, 395, 454, 488],\n",
699 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,   0,  87, 169,\n",
700 |        "        219, 257, 312, 487, 527],\n",
701 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,   0,   7,\n",
702 |        "         11,  22, 244, 250, 281],\n",
703 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,   0,\n",
704 |        "         10, 116, 123, 259, 287],\n",
705 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,\n",
706 |        "          0,  22,  93, 109, 236],\n",
707 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,\n",
708 |        "         -1,   0,  29, 157, 185],\n",
709 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,\n",
710 |        "         -1,  -1,   0,  18,  78],\n",
711 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,\n",
712 |        "         -1,  -1,  -1,   0,   4],\n",
713 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,\n",
714 |        "         -1,  -1,  -1,  -1,   0]])"
715 |       ]
716 |      },
717 |      "execution_count": 337,
718 |      "metadata": {},
719 |      "output_type": "execute_result"
720 |     }
721 |    ],
722 |    "source": [
723 |     "n_cities = 18 # this will take a little before 20 seconds. Try not to make it any larger :)\n",
724 |     "max_fare = 1000\n",
725 |     "fares = get_fares(n_cities, max_fare)\n",
726 |     "fares"
727 |    ]
728 |   },
729 |   {
730 |    "cell_type": "code",
731 |    "execution_count": 338,
732 |    "metadata": {
733 |     "collapsed": false,
734 |     "deletable": true,
735 |     "editable": true
736 |    },
737 |    "outputs": [
738 |     {
739 |      "name": "stdout",
740 |      "output_type": "stream",
741 |      "text": [
742 |       "CPU times: user 18.3 s, sys: 6.67 ms, total: 18.3 s\n",
743 |       "Wall time: 18.4 s\n"
744 |      ]
745 |     },
746 |     {
747 |      "data": {
748 |       "text/plain": [
749 |        "480"
750 |       ]
751 |      },
752 |      "execution_count": 338,
753 |      "metadata": {},
754 |      "output_type": "execute_result"
755 |     }
756 |    ],
757 |    "source": [
758 |     "%time cheapestr(0, len(fares[0]) - 1, fares)"
759 |    ]
760 |   },
761 |   {
762 |    "cell_type": "code",
763 |    "execution_count": 339,
764 |    "metadata": {
765 |     "collapsed": false,
766 |     "deletable": true,
767 |     "editable": true
768 |    },
769 |    "outputs": [
770 |     {
771 |      "name": "stdout",
772 |      "output_type": "stream",
773 |      "text": [
774 |       "CPU times: user 14.7 s, sys: 3.33 ms, total: 14.7 s\n",
775 |       "Wall time: 14.7 s\n"
776 |      ]
777 |     },
778 |     {
779 |      "data": {
780 |       "text/plain": [
781 |        "480"
782 |       ]
783 |      },
784 |      "execution_count": 339,
785 |      "metadata": {},
786 |      "output_type": "execute_result"
787 |     }
788 |    ],
789 |    "source": [
790 |     "%time cheapestm(0, len(fares[0]) - 1, fares)"
791 |    ]
792 |   },
793 |   {
794 |    "cell_type": "code",
795 |    "execution_count": 340,
796 |    "metadata": {
797 |     "collapsed": false,
798 |     "deletable": true,
799 |     "editable": true
800 |    },
801 |    "outputs": [
802 |     {
803 |      "name": "stdout",
804 |      "output_type": "stream",
805 |      "text": [
806 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
807 |       "Wall time: 75.8 µs\n"
808 |      ]
809 |     },
810 |     {
811 |      "data": {
812 |       "text/plain": [
813 |        "480"
814 |       ]
815 |      },
816 |      "execution_count": 340,
817 |      "metadata": {},
818 |      "output_type": "execute_result"
819 |     }
820 |    ],
821 |    "source": [
822 |     "%time cheapestdp(0, len(fares[0]) - 1, fares)"
823 |    ]
824 |   },
825 |   {
826 |    "cell_type": "markdown",
827 |    "metadata": {
828 |     "deletable": true,
829 |     "editable": true
830 |    },
831 |    "source": [
832 |     "BAAAAAM! See how much faster dynamic programming is?\n",
833 |     "\n",
834 |     "Well, there you have it!!! This is the power of dynamic programming.\n",
835 |     "\n",
836 |     "As mentioned in the tutorials, reinforcement learning leverages the power of dynamic programming in many algorithms. Value Iteration, Q-Learning, etc have a similar take on calculation. The bottom line is to think sequentially instead of recursively. And bottom-up instead of top-down. Let's continue this journey."
837 |    ]
838 |   }
839 |  ],
840 |  "metadata": {
841 |   "kernelspec": {
842 |    "display_name": "Python 3",
843 |    "language": "python",
844 |    "name": "python3"
845 |   },
846 |   "language_info": {
847 |    "codemirror_mode": {
848 |     "name": "ipython",
849 |     "version": 3
850 |    },
851 |    "file_extension": ".py",
852 |    "mimetype": "text/x-python",
853 |    "name": "python",
854 |    "nbconvert_exporter": "python",
855 |    "pygments_lexer": "ipython3",
856 |    "version": "3.5.2"
857 |   }
858 |  },
859 |  "nbformat": 4,
860 |  "nbformat_minor": 2
861 | }
862 | 


--------------------------------------------------------------------------------
/notebooks/solutions/02-dynamic-programming.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "deletable": true,
  7 |     "editable": true
  8 |    },
  9 |    "source": [
 10 |     "### Recursion, Memoization and Dynamic Programming"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "markdown",
 15 |    "metadata": {
 16 |     "deletable": true,
 17 |     "editable": true
 18 |    },
 19 |    "source": [
 20 |     "Remember how we talk about using recursion and dynamic programming. One interesting thing to do is to implement the solution to a common problem called Fibonnaci numbers on these two styles and compare the compute time."
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "markdown",
 25 |    "metadata": {
 26 |     "deletable": true,
 27 |     "editable": true
 28 |    },
 29 |    "source": [
 30 |     "The Fibonacci series looks something like: `0, 1, 1, 2, 3, 5, 8, 13, 21 …` and so on. Any person can quickly notice the pattern. `f(n) = f(n-1) + f(n-2)` So, let's walk through a recursive implementation that solves this problem."
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": 1,
 36 |    "metadata": {
 37 |     "collapsed": true,
 38 |     "deletable": true,
 39 |     "editable": true
 40 |    },
 41 |    "outputs": [],
 42 |    "source": [
 43 |     "def fib(n):\n",
 44 |     "    if n < 2:\n",
 45 |     "        return n\n",
 46 |     "    return fib(n-2) + fib(n-1)"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "code",
 51 |    "execution_count": 2,
 52 |    "metadata": {
 53 |     "collapsed": false,
 54 |     "deletable": true,
 55 |     "editable": true
 56 |    },
 57 |    "outputs": [
 58 |     {
 59 |      "name": "stdout",
 60 |      "output_type": "stream",
 61 |      "text": [
 62 |       "CPU times: user 296 ms, sys: 0 ns, total: 296 ms\n",
 63 |       "Wall time: 295 ms\n"
 64 |      ]
 65 |     },
 66 |     {
 67 |      "data": {
 68 |       "text/plain": [
 69 |        "832040"
 70 |       ]
 71 |      },
 72 |      "execution_count": 2,
 73 |      "metadata": {},
 74 |      "output_type": "execute_result"
 75 |     }
 76 |    ],
 77 |    "source": [
 78 |     "%time fib(30)"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "markdown",
 83 |    "metadata": {
 84 |     "deletable": true,
 85 |     "editable": true
 86 |    },
 87 |    "source": [
 88 |     "Now, the main problem of this algorithm is that we are computing some of the subproblems more than once. For instance, to compute fib(4) we would compute fib(3) and fib(2). However, to compute fib(3) we also have to compute fib(2). Say hello to memoization."
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "markdown",
 93 |    "metadata": {
 94 |     "deletable": true,
 95 |     "editable": true
 96 |    },
 97 |    "source": [
 98 |     "A technique called memoization we are cache the results of previously computed sub problems to avoid unnecessary computations."
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "code",
103 |    "execution_count": 3,
104 |    "metadata": {
105 |     "collapsed": true,
106 |     "deletable": true,
107 |     "editable": true
108 |    },
109 |    "outputs": [],
110 |    "source": [
111 |     "m = {}\n",
112 |     "def fibm(n):\n",
113 |     "    if n in m:\n",
114 |     "        return m[n]\n",
115 |     "    m[n] = n if n < 2 else fibm(n-2) + fibm(n-1)\n",
116 |     "    return m[n]"
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "code",
121 |    "execution_count": 4,
122 |    "metadata": {
123 |     "collapsed": false,
124 |     "deletable": true,
125 |     "editable": true
126 |    },
127 |    "outputs": [
128 |     {
129 |      "name": "stdout",
130 |      "output_type": "stream",
131 |      "text": [
132 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
133 |       "Wall time: 17.6 µs\n"
134 |      ]
135 |     },
136 |     {
137 |      "data": {
138 |       "text/plain": [
139 |        "832040"
140 |       ]
141 |      },
142 |      "execution_count": 4,
143 |      "metadata": {},
144 |      "output_type": "execute_result"
145 |     }
146 |    ],
147 |    "source": [
148 |     "%time fibm(30)"
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "markdown",
153 |    "metadata": {
154 |     "deletable": true,
155 |     "editable": true
156 |    },
157 |    "source": [
158 |     "But the question is, can we do better than this? The use of the array is helpful, but when calculating very large numbers, or perhaps on memory contraint environments it might not be desirable. This is where Dynamic Programming fits the bill."
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "markdown",
163 |    "metadata": {
164 |     "deletable": true,
165 |     "editable": true
166 |    },
167 |    "source": [
168 |     "In DP we take a bottom-up approach. Meaning, we solve the next Fibonacci number we can with the information we already have."
169 |    ]
170 |   },
171 |   {
172 |    "cell_type": "code",
173 |    "execution_count": 13,
174 |    "metadata": {
175 |     "collapsed": true,
176 |     "deletable": true,
177 |     "editable": true
178 |    },
179 |    "outputs": [],
180 |    "source": [
181 |     "def fibdp(n):\n",
182 |     "    if n == 0: return 0\n",
183 |     "    prev, curr = (0, 1)\n",
184 |     "    for i in range(2, n+1):\n",
185 |     "        newf = prev + curr\n",
186 |     "        prev = curr\n",
187 |     "        curr = newf\n",
188 |     "    return curr"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "code",
193 |    "execution_count": 25,
194 |    "metadata": {
195 |     "collapsed": false,
196 |     "deletable": true,
197 |     "editable": true
198 |    },
199 |    "outputs": [
200 |     {
201 |      "name": "stdout",
202 |      "output_type": "stream",
203 |      "text": [
204 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
205 |       "Wall time: 6.44 µs\n"
206 |      ]
207 |     },
208 |     {
209 |      "data": {
210 |       "text/plain": [
211 |        "832040"
212 |       ]
213 |      },
214 |      "execution_count": 25,
215 |      "metadata": {},
216 |      "output_type": "execute_result"
217 |     }
218 |    ],
219 |    "source": [
220 |     "%time fibdp(30)"
221 |    ]
222 |   },
223 |   {
224 |    "cell_type": "markdown",
225 |    "metadata": {
226 |     "deletable": true,
227 |     "editable": true
228 |    },
229 |    "source": [
230 |     "In this format, we don’t need to recurse or keep up with the memory intensive cache dictionary. These, add up to an even better performance."
231 |    ]
232 |   },
233 |   {
234 |    "cell_type": "markdown",
235 |    "metadata": {
236 |     "deletable": true,
237 |     "editable": true
238 |    },
239 |    "source": [
240 |     "Let's now give it a try with factorials. Remember `4! = 4 * 3 * 2 * 1 = 24`. Can you give it try?"
241 |    ]
242 |   },
243 |   {
244 |    "cell_type": "code",
245 |    "execution_count": 1,
246 |    "metadata": {
247 |     "collapsed": true,
248 |     "deletable": true,
249 |     "editable": true
250 |    },
251 |    "outputs": [],
252 |    "source": [
253 |     "def factr(n):\n",
254 |     "    if n < 3:\n",
255 |     "        return n\n",
256 |     "    return n * factr(n - 1)"
257 |    ]
258 |   },
259 |   {
260 |    "cell_type": "code",
261 |    "execution_count": 5,
262 |    "metadata": {
263 |     "collapsed": false,
264 |     "deletable": true,
265 |     "editable": true
266 |    },
267 |    "outputs": [
268 |     {
269 |      "name": "stdout",
270 |      "output_type": "stream",
271 |      "text": [
272 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
273 |       "Wall time: 43.2 µs\n"
274 |      ]
275 |     },
276 |     {
277 |      "data": {
278 |       "text/plain": [
279 |        "265252859812191058636308480000000"
280 |       ]
281 |      },
282 |      "execution_count": 5,
283 |      "metadata": {},
284 |      "output_type": "execute_result"
285 |     }
286 |    ],
287 |    "source": [
288 |     "%time factr(30)"
289 |    ]
290 |   },
291 |   {
292 |    "cell_type": "code",
293 |    "execution_count": 99,
294 |    "metadata": {
295 |     "collapsed": true,
296 |     "deletable": true,
297 |     "editable": true
298 |    },
299 |    "outputs": [],
300 |    "source": [
301 |     "m = {}\n",
302 |     "def factm(n):\n",
303 |     "    if n in m:\n",
304 |     "        return m[n]\n",
305 |     "    m[n] = n if n < 3 else n * factr(n - 1)\n",
306 |     "    return m[n]"
307 |    ]
308 |   },
309 |   {
310 |    "cell_type": "code",
311 |    "execution_count": 100,
312 |    "metadata": {
313 |     "collapsed": false,
314 |     "deletable": true,
315 |     "editable": true
316 |    },
317 |    "outputs": [
318 |     {
319 |      "name": "stdout",
320 |      "output_type": "stream",
321 |      "text": [
322 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
323 |       "Wall time: 47.2 µs\n"
324 |      ]
325 |     },
326 |     {
327 |      "data": {
328 |       "text/plain": [
329 |        "265252859812191058636308480000000"
330 |       ]
331 |      },
332 |      "execution_count": 100,
333 |      "metadata": {},
334 |      "output_type": "execute_result"
335 |     }
336 |    ],
337 |    "source": [
338 |     "%time factm(30)"
339 |    ]
340 |   },
341 |   {
342 |    "cell_type": "code",
343 |    "execution_count": 10,
344 |    "metadata": {
345 |     "collapsed": true,
346 |     "deletable": true,
347 |     "editable": true
348 |    },
349 |    "outputs": [],
350 |    "source": [
351 |     "def factdp(n):\n",
352 |     "    if n < 3: return n\n",
353 |     "    fact = 2\n",
354 |     "    for i in range(3, n + 1):\n",
355 |     "        fact *= i\n",
356 |     "    return fact"
357 |    ]
358 |   },
359 |   {
360 |    "cell_type": "code",
361 |    "execution_count": 11,
362 |    "metadata": {
363 |     "collapsed": false,
364 |     "deletable": true,
365 |     "editable": true
366 |    },
367 |    "outputs": [
368 |     {
369 |      "name": "stdout",
370 |      "output_type": "stream",
371 |      "text": [
372 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
373 |       "Wall time: 7.87 µs\n"
374 |      ]
375 |     },
376 |     {
377 |      "data": {
378 |       "text/plain": [
379 |        "265252859812191058636308480000000"
380 |       ]
381 |      },
382 |      "execution_count": 11,
383 |      "metadata": {},
384 |      "output_type": "execute_result"
385 |     }
386 |    ],
387 |    "source": [
388 |     "%time factdp(30)"
389 |    ]
390 |   },
391 |   {
392 |    "cell_type": "markdown",
393 |    "metadata": {
394 |     "deletable": true,
395 |     "editable": true
396 |    },
397 |    "source": [
398 |     "Let's think of a slightly different problem. Imagine that you want to find the cheapest way to go from city A to city B, but when you are about to buy your ticket, you see that you could hop in different combinations of route and get a much cheaper price than if you go directly. How do you efficiently calculate the best possible combination of tickets and come up with the cheapest route? We will start with basic recursion and work on improving it until we reach dynamic programming.\n",
399 |     "\n",
400 |     "For this last problem in dynamic programming, create 2 functions that calculates the cheapest route from city A to B. I will give you the recursive solution, you will build one with memoization and the one with dynamic programming."
401 |    ]
402 |   },
403 |   {
404 |    "cell_type": "code",
405 |    "execution_count": 1,
406 |    "metadata": {
407 |     "collapsed": true,
408 |     "deletable": true,
409 |     "editable": true
410 |    },
411 |    "outputs": [],
412 |    "source": [
413 |     "import numpy as np"
414 |    ]
415 |   },
416 |   {
417 |    "cell_type": "markdown",
418 |    "metadata": {},
419 |    "source": [
420 |     "Utility function to get fares between cities"
421 |    ]
422 |   },
423 |   {
424 |    "cell_type": "code",
425 |    "execution_count": 2,
426 |    "metadata": {
427 |     "collapsed": false,
428 |     "deletable": true,
429 |     "editable": true,
430 |     "scrolled": false
431 |    },
432 |    "outputs": [],
433 |    "source": [
434 |     "def get_fares(n_cities, max_fare):\n",
435 |     "    np.random.seed(123456)\n",
436 |     "    fares = np.sort(np.random.random((n_cities, n_cities)) * max_fare).astype(int)\n",
437 |     "    for i in range(len(fares)):\n",
438 |     "        fares[i] = np.roll(fares[i], i + 1)\n",
439 |     "    np.fill_diagonal(fares, 0)\n",
440 |     "    for i in range(1, len(fares)):\n",
441 |     "        for j in range(0, i):\n",
442 |     "            fares[i][j] = -1\n",
443 |     "    return fares"
444 |    ]
445 |   },
446 |   {
447 |    "cell_type": "markdown",
448 |    "metadata": {},
449 |    "source": [
450 |     "Let's try it out with 4 cities and random fares with a max of 1000."
451 |    ]
452 |   },
453 |   {
454 |    "cell_type": "code",
455 |    "execution_count": 5,
456 |    "metadata": {
457 |     "collapsed": false,
458 |     "deletable": true,
459 |     "editable": true
460 |    },
461 |    "outputs": [
462 |     {
463 |      "data": {
464 |       "text/plain": [
465 |        "array([[  0, 126, 260, 897],\n",
466 |        "       [ -1,   0,  50, 376],\n",
467 |        "       [ -1,  -1,   0, 123],\n",
468 |        "       [ -1,  -1,  -1,   0]])"
469 |       ]
470 |      },
471 |      "execution_count": 5,
472 |      "metadata": {},
473 |      "output_type": "execute_result"
474 |     }
475 |    ],
476 |    "source": [
477 |     "n_cities = 4\n",
478 |     "max_fare = 1000\n",
479 |     "fares = get_fares(n_cities, max_fare)\n",
480 |     "fares[1][2] = 50\n",
481 |     "fares"
482 |    ]
483 |   },
484 |   {
485 |    "cell_type": "markdown",
486 |    "metadata": {},
487 |    "source": [
488 |     "Here is the recursive solution:"
489 |    ]
490 |   },
491 |   {
492 |    "cell_type": "code",
493 |    "execution_count": 7,
494 |    "metadata": {
495 |     "collapsed": true,
496 |     "deletable": true,
497 |     "editable": true
498 |    },
499 |    "outputs": [],
500 |    "source": [
501 |     "def cheapestr(s, d, c):\n",
502 |     "    if s == d or s == d - 1:\n",
503 |     "        return c[s][d]\n",
504 |     "    \n",
505 |     "    cheapest = c[s][d]\n",
506 |     "    for i in range(s + 1, d):\n",
507 |     "        tmp = cheapestr(s, i, c) + cheapestr(i, d, c)\n",
508 |     "        cheapest = tmp if tmp < cheapest else cheapest\n",
509 |     "    return cheapest"
510 |    ]
511 |   },
512 |   {
513 |    "cell_type": "code",
514 |    "execution_count": 8,
515 |    "metadata": {
516 |     "collapsed": false,
517 |     "deletable": true,
518 |     "editable": true
519 |    },
520 |    "outputs": [
521 |     {
522 |      "name": "stdout",
523 |      "output_type": "stream",
524 |      "text": [
525 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
526 |       "Wall time: 68.7 µs\n"
527 |      ]
528 |     },
529 |     {
530 |      "data": {
531 |       "text/plain": [
532 |        "299"
533 |       ]
534 |      },
535 |      "execution_count": 8,
536 |      "metadata": {},
537 |      "output_type": "execute_result"
538 |     }
539 |    ],
540 |    "source": [
541 |     "%time cheapestr(0, len(fares[0]) - 1, fares)"
542 |    ]
543 |   },
544 |   {
545 |    "cell_type": "markdown",
546 |    "metadata": {},
547 |    "source": [
548 |     "Now, you build the memoization one:"
549 |    ]
550 |   },
551 |   {
552 |    "cell_type": "code",
553 |    "execution_count": 9,
554 |    "metadata": {
555 |     "collapsed": false,
556 |     "deletable": true,
557 |     "editable": true
558 |    },
559 |    "outputs": [],
560 |    "source": [
561 |     "m = {}\n",
562 |     "def cheapestm(s, d, c):\n",
563 |     "    if s == d or s == d - 1:\n",
564 |     "        return c[s][d]\n",
565 |     "\n",
566 |     "    if s in m and d in m[s]:\n",
567 |     "        return m[s][d]\n",
568 |     "        \n",
569 |     "    cheapest = c[s][d]\n",
570 |     "    for i in range(s + 1, d):\n",
571 |     "        tmp = cheapestm(s, i, c) + cheapestm(i, d, c)\n",
572 |     "        cheapest = tmp if tmp < cheapest else cheapest\n",
573 |     "    m[s] = {}\n",
574 |     "    m[s][d] = cheapest\n",
575 |     "    return m[s][d]"
576 |    ]
577 |   },
578 |   {
579 |    "cell_type": "code",
580 |    "execution_count": 10,
581 |    "metadata": {
582 |     "collapsed": false,
583 |     "deletable": true,
584 |     "editable": true,
585 |     "scrolled": true
586 |    },
587 |    "outputs": [
588 |     {
589 |      "name": "stdout",
590 |      "output_type": "stream",
591 |      "text": [
592 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
593 |       "Wall time: 22.6 µs\n"
594 |      ]
595 |     },
596 |     {
597 |      "data": {
598 |       "text/plain": [
599 |        "299"
600 |       ]
601 |      },
602 |      "execution_count": 10,
603 |      "metadata": {},
604 |      "output_type": "execute_result"
605 |     }
606 |    ],
607 |    "source": [
608 |     "%time cheapestm(0, len(fares[0]) - 1, fares)"
609 |    ]
610 |   },
611 |   {
612 |    "cell_type": "markdown",
613 |    "metadata": {},
614 |    "source": [
615 |     "Faster, you see?\n",
616 |     "\n",
617 |     "Now, do the dynamic programming version."
618 |    ]
619 |   },
620 |   {
621 |    "cell_type": "code",
622 |    "execution_count": 11,
623 |    "metadata": {
624 |     "collapsed": false,
625 |     "deletable": true,
626 |     "editable": true
627 |    },
628 |    "outputs": [],
629 |    "source": [
630 |     "def cheapestdp(s, d, c):\n",
631 |     "    cheapest = c[0]\n",
632 |     "    for i in range(2, len(c)):\n",
633 |     "        for j in range(1, i):\n",
634 |     "            new_route = cheapest[j] + c[j][i]\n",
635 |     "            cheapest[i] = new_route if cheapest[i] > new_route else cheapest[i] \n",
636 |     "    return cheapest[-1]"
637 |    ]
638 |   },
639 |   {
640 |    "cell_type": "code",
641 |    "execution_count": 12,
642 |    "metadata": {
643 |     "collapsed": false,
644 |     "deletable": true,
645 |     "editable": true
646 |    },
647 |    "outputs": [
648 |     {
649 |      "name": "stdout",
650 |      "output_type": "stream",
651 |      "text": [
652 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
653 |       "Wall time: 61.5 µs\n"
654 |      ]
655 |     },
656 |     {
657 |      "data": {
658 |       "text/plain": [
659 |        "299"
660 |       ]
661 |      },
662 |      "execution_count": 12,
663 |      "metadata": {},
664 |      "output_type": "execute_result"
665 |     }
666 |    ],
667 |    "source": [
668 |     "%time cheapestdp(0, len(fares[0]) - 1, fares)"
669 |    ]
670 |   },
671 |   {
672 |    "cell_type": "markdown",
673 |    "metadata": {
674 |     "deletable": true,
675 |     "editable": true
676 |    },
677 |    "source": [
678 |     "Let's now try with a larger example:"
679 |    ]
680 |   },
681 |   {
682 |    "cell_type": "code",
683 |    "execution_count": 337,
684 |    "metadata": {
685 |     "collapsed": false,
686 |     "deletable": true,
687 |     "editable": true,
688 |     "scrolled": true
689 |    },
690 |    "outputs": [
691 |     {
692 |      "data": {
693 |       "text/plain": [
694 |        "array([[  0, 123, 126, 129, 228, 260, 336, 352, 373, 376, 447, 451, 543,\n",
695 |        "        776, 820, 840, 859, 897],\n",
696 |        "       [ -1,   0,  37,  61, 137, 146, 235, 245, 340, 343, 405, 574, 589,\n",
697 |        "        590, 594, 753, 852, 861],\n",
698 |        "       [ -1,  -1,   0,  16,  99, 117, 170, 199, 274, 342, 394, 401, 414,\n",
699 |        "        462, 481, 595, 610, 641],\n",
700 |        "       [ -1,  -1,  -1,   0,  94,  95, 134, 138, 155, 433, 471, 497, 560,\n",
701 |        "        630, 639, 683, 732, 758],\n",
702 |        "       [ -1,  -1,  -1,  -1,   0,  85, 140, 149, 329, 370, 386, 395, 477,\n",
703 |        "        544, 562, 566, 619, 634],\n",
704 |        "       [ -1,  -1,  -1,  -1,  -1,   0,  29,  30,  44, 113, 187, 207, 247,\n",
705 |        "        249, 249, 356, 409, 630],\n",
706 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,   0,  22,  60, 168, 216, 277, 279,\n",
707 |        "        372, 419, 449, 606, 690],\n",
708 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,   0,  30,  36, 273, 321, 355,\n",
709 |        "        415, 419, 421, 497, 500],\n",
710 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,   0,  19, 123, 400, 412,\n",
711 |        "        418, 535, 547, 559, 591],\n",
712 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,   0, 110, 120, 140,\n",
713 |        "        204, 220, 395, 454, 488],\n",
714 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,   0,  87, 169,\n",
715 |        "        219, 257, 312, 487, 527],\n",
716 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,   0,   7,\n",
717 |        "         11,  22, 244, 250, 281],\n",
718 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,   0,\n",
719 |        "         10, 116, 123, 259, 287],\n",
720 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,\n",
721 |        "          0,  22,  93, 109, 236],\n",
722 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,\n",
723 |        "         -1,   0,  29, 157, 185],\n",
724 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,\n",
725 |        "         -1,  -1,   0,  18,  78],\n",
726 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,\n",
727 |        "         -1,  -1,  -1,   0,   4],\n",
728 |        "       [ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,\n",
729 |        "         -1,  -1,  -1,  -1,   0]])"
730 |       ]
731 |      },
732 |      "execution_count": 337,
733 |      "metadata": {},
734 |      "output_type": "execute_result"
735 |     }
736 |    ],
737 |    "source": [
738 |     "n_cities = 18 # this will take a little before 20 seconds. Try not to make it any larger :)\n",
739 |     "max_fare = 1000\n",
740 |     "fares = get_fares(n_cities, max_fare)\n",
741 |     "fares"
742 |    ]
743 |   },
744 |   {
745 |    "cell_type": "code",
746 |    "execution_count": 338,
747 |    "metadata": {
748 |     "collapsed": false,
749 |     "deletable": true,
750 |     "editable": true
751 |    },
752 |    "outputs": [
753 |     {
754 |      "name": "stdout",
755 |      "output_type": "stream",
756 |      "text": [
757 |       "CPU times: user 18.3 s, sys: 6.67 ms, total: 18.3 s\n",
758 |       "Wall time: 18.4 s\n"
759 |      ]
760 |     },
761 |     {
762 |      "data": {
763 |       "text/plain": [
764 |        "480"
765 |       ]
766 |      },
767 |      "execution_count": 338,
768 |      "metadata": {},
769 |      "output_type": "execute_result"
770 |     }
771 |    ],
772 |    "source": [
773 |     "%time cheapestr(0, len(fares[0]) - 1, fares)"
774 |    ]
775 |   },
776 |   {
777 |    "cell_type": "code",
778 |    "execution_count": 339,
779 |    "metadata": {
780 |     "collapsed": false,
781 |     "deletable": true,
782 |     "editable": true
783 |    },
784 |    "outputs": [
785 |     {
786 |      "name": "stdout",
787 |      "output_type": "stream",
788 |      "text": [
789 |       "CPU times: user 14.7 s, sys: 3.33 ms, total: 14.7 s\n",
790 |       "Wall time: 14.7 s\n"
791 |      ]
792 |     },
793 |     {
794 |      "data": {
795 |       "text/plain": [
796 |        "480"
797 |       ]
798 |      },
799 |      "execution_count": 339,
800 |      "metadata": {},
801 |      "output_type": "execute_result"
802 |     }
803 |    ],
804 |    "source": [
805 |     "%time cheapestm(0, len(fares[0]) - 1, fares)"
806 |    ]
807 |   },
808 |   {
809 |    "cell_type": "code",
810 |    "execution_count": 340,
811 |    "metadata": {
812 |     "collapsed": false,
813 |     "deletable": true,
814 |     "editable": true
815 |    },
816 |    "outputs": [
817 |     {
818 |      "name": "stdout",
819 |      "output_type": "stream",
820 |      "text": [
821 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
822 |       "Wall time: 75.8 µs\n"
823 |      ]
824 |     },
825 |     {
826 |      "data": {
827 |       "text/plain": [
828 |        "480"
829 |       ]
830 |      },
831 |      "execution_count": 340,
832 |      "metadata": {},
833 |      "output_type": "execute_result"
834 |     }
835 |    ],
836 |    "source": [
837 |     "%time cheapestdp(0, len(fares[0]) - 1, fares)"
838 |    ]
839 |   },
840 |   {
841 |    "cell_type": "markdown",
842 |    "metadata": {
843 |     "deletable": true,
844 |     "editable": true
845 |    },
846 |    "source": [
847 |     "BAAAAAM! See how much faster dynamic programming is?\n",
848 |     "\n",
849 |     "Well, there you have it!!! This is the power of dynamic programming.\n",
850 |     "\n",
851 |     "As mentioned in the tutorials, reinforcement learning leverages the power of dynamic programming in many algorithms. Value Iteration, Q-Learning, etc have a similar take on calculation. The bottom line is to think sequentially instead of recursively. And bottom-up instead of top-down. Let's continue this journey."
852 |    ]
853 |   }
854 |  ],
855 |  "metadata": {
856 |   "kernelspec": {
857 |    "display_name": "Python 3",
858 |    "language": "python",
859 |    "name": "python3"
860 |   },
861 |   "language_info": {
862 |    "codemirror_mode": {
863 |     "name": "ipython",
864 |     "version": 3
865 |    },
866 |    "file_extension": ".py",
867 |    "mimetype": "text/x-python",
868 |    "name": "python",
869 |    "nbconvert_exporter": "python",
870 |    "pygments_lexer": "ipython3",
871 |    "version": "3.5.2"
872 |   }
873 |  },
874 |  "nbformat": 4,
875 |  "nbformat_minor": 2
876 | }
877 | 


--------------------------------------------------------------------------------
/notebooks/03-planning-algorithms.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "### Planning Algorithms"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "Do you remember on lesson 2 and 3 we discussed algorithms that basically solve MDPs? That is, find a policy given a exact representation of the environment. In this section, we will explore 2 such algorithms. Value Iteration and policy iteration."
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": 1,
 20 |    "metadata": {},
 21 |    "outputs": [],
 22 |    "source": [
 23 |     "import numpy as np\n",
 24 |     "import pandas as pd\n",
 25 |     "import tempfile\n",
 26 |     "import pprint\n",
 27 |     "import json\n",
 28 |     "import sys\n",
 29 |     "import gym\n",
 30 |     "\n",
 31 |     "from gym import wrappers\n",
 32 |     "from subprocess import check_output\n",
 33 |     "from IPython.display import HTML"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "markdown",
 38 |    "metadata": {},
 39 |    "source": [
 40 |     "#### Value Iteration"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "markdown",
 45 |    "metadata": {},
 46 |    "source": [
 47 |     "The Value Iteration algorithm uses dynamic programming by dividing the problem into common sub-problems and leveraging that optimal structure to speed-up computations.\n",
 48 |     "\n",
 49 |     "Let me show you how value iterations look like:"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "code",
 54 |    "execution_count": 2,
 55 |    "metadata": {},
 56 |    "outputs": [],
 57 |    "source": [
 58 |     "def value_iteration(S, A, P, gamma=.99, theta = 0.0000001):\n",
 59 |     " \n",
 60 |     "    V = np.random.random(len(S))\n",
 61 |     "    for i in range(100000):\n",
 62 |     "        old_V = V.copy()\n",
 63 |     "        \n",
 64 |     "        Q = np.zeros((len(S), len(A)), dtype=float)\n",
 65 |     "        for s in S:\n",
 66 |     "            for a in A:\n",
 67 |     "                for prob, s_prime, reward, done in P[s][a]:\n",
 68 |     "                    Q[s][a] += prob * (reward + gamma * old_V[s_prime] * (not done))\n",
 69 |     "            V[s] = Q[s].max()\n",
 70 |     "        if np.all(np.abs(old_V - V) < theta):\n",
 71 |     "            break\n",
 72 |     "    \n",
 73 |     "    pi = np.argmax(Q, axis=1)\n",
 74 |     "    return pi, V"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "markdown",
 79 |    "metadata": {},
 80 |    "source": [
 81 |     "As we can see, value iteration expects a set of states, e.g. (0,1,2,3,4) a set of actions, e.g. (0,1) and a set of transition probabilities that represent the dynamics of the environment. Let's take a look at these variables:"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "code",
 86 |    "execution_count": 3,
 87 |    "metadata": {},
 88 |    "outputs": [
 89 |     {
 90 |      "name": "stderr",
 91 |      "output_type": "stream",
 92 |      "text": [
 93 |       "[2017-04-26 00:30:53,851] Making new env: FrozenLake-v0\n"
 94 |      ]
 95 |     }
 96 |    ],
 97 |    "source": [
 98 |     "mdir = tempfile.mkdtemp()\n",
 99 |     "env = gym.make('FrozenLake-v0')\n",
100 |     "env = wrappers.Monitor(env, mdir, force=True)"
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "code",
105 |    "execution_count": 4,
106 |    "metadata": {
107 |     "collapsed": true
108 |    },
109 |    "outputs": [],
110 |    "source": [
111 |     "S = range(env.env.observation_space.n)\n",
112 |     "A = range(env.env.action_space.n)\n",
113 |     "P = env.env.env.P"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": 5,
119 |    "metadata": {},
120 |    "outputs": [
121 |     {
122 |      "data": {
123 |       "text/plain": [
124 |        "range(0, 16)"
125 |       ]
126 |      },
127 |      "execution_count": 5,
128 |      "metadata": {},
129 |      "output_type": "execute_result"
130 |     }
131 |    ],
132 |    "source": [
133 |     "S"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "code",
138 |    "execution_count": 6,
139 |    "metadata": {},
140 |    "outputs": [
141 |     {
142 |      "data": {
143 |       "text/plain": [
144 |        "range(0, 4)"
145 |       ]
146 |      },
147 |      "execution_count": 6,
148 |      "metadata": {},
149 |      "output_type": "execute_result"
150 |     }
151 |    ],
152 |    "source": [
153 |     "A"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "code",
158 |    "execution_count": 9,
159 |    "metadata": {},
160 |    "outputs": [
161 |     {
162 |      "data": {
163 |       "text/plain": [
164 |        "{0: [(0.3333333333333333, 6, 0.0, False),\n",
165 |        "  (0.3333333333333333, 9, 0.0, False),\n",
166 |        "  (0.3333333333333333, 14, 0.0, False)],\n",
167 |        " 1: [(0.3333333333333333, 9, 0.0, False),\n",
168 |        "  (0.3333333333333333, 14, 0.0, False),\n",
169 |        "  (0.3333333333333333, 11, 0.0, True)],\n",
170 |        " 2: [(0.3333333333333333, 14, 0.0, False),\n",
171 |        "  (0.3333333333333333, 11, 0.0, True),\n",
172 |        "  (0.3333333333333333, 6, 0.0, False)],\n",
173 |        " 3: [(0.3333333333333333, 11, 0.0, True),\n",
174 |        "  (0.3333333333333333, 6, 0.0, False),\n",
175 |        "  (0.3333333333333333, 9, 0.0, False)]}"
176 |       ]
177 |      },
178 |      "execution_count": 9,
179 |      "metadata": {},
180 |      "output_type": "execute_result"
181 |     }
182 |    ],
183 |    "source": [
184 |     "P[10]"
185 |    ]
186 |   },
187 |   {
188 |    "cell_type": "markdown",
189 |    "metadata": {},
190 |    "source": [
191 |     "You see the world we are looking into \"FrozenLake-v0\" has 16 different states, 4 different actions. The `P[10]` is basically showing us a peek into the dynamics of the world. For example, in this case, if you are in state \"10\" (from `P[10]`) and you take action 0 (see dictionary key 0), you have equal probability of 0.3333 to land in either state 6, 9 or 14. None of those transitions give you any reward and none of them is terminal.\n",
192 |     "\n",
193 |     "In contrast, we can see taking action 2, might transition you to state 11, which **is** terminal. \n",
194 |     "\n",
195 |     "Get the hang of it? Let's run it!"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": 10,
201 |    "metadata": {
202 |     "collapsed": true
203 |    },
204 |    "outputs": [],
205 |    "source": [
206 |     "pi, V = value_iteration(S, A, P)"
207 |    ]
208 |   },
209 |   {
210 |    "cell_type": "markdown",
211 |    "metadata": {},
212 |    "source": [
213 |     "Now, value iteration calculates two important things. First, it calculates `V`, which tells us how much should we expect from each state if we always act optimally. Second, it gives us `pi`, which is the optimal policy given `V`. Let's take a deeper look:"
214 |    ]
215 |   },
216 |   {
217 |    "cell_type": "code",
218 |    "execution_count": 12,
219 |    "metadata": {},
220 |    "outputs": [
221 |     {
222 |      "data": {
223 |       "text/plain": [
224 |        "array([  9.82775479e-006,   4.77561742e-007,   8.29890013e-006,\n",
225 |        "         7.77646736e-006,   5.68794576e-006,   0.00000000e+000,\n",
226 |        "         3.38430298e-208,   0.00000000e+000,   8.92176447e-007,\n",
227 |        "         5.28039771e-006,   3.09721331e-006,   0.00000000e+000,\n",
228 |        "         0.00000000e+000,   9.53731304e-006,   9.80392157e-001,\n",
229 |        "         0.00000000e+000])"
230 |       ]
231 |      },
232 |      "execution_count": 12,
233 |      "metadata": {},
234 |      "output_type": "execute_result"
235 |     }
236 |    ],
237 |    "source": [
238 |     "V"
239 |    ]
240 |   },
241 |   {
242 |    "cell_type": "code",
243 |    "execution_count": 13,
244 |    "metadata": {},
245 |    "outputs": [
246 |     {
247 |      "data": {
248 |       "text/plain": [
249 |        "array([0, 3, 0, 3, 0, 0, 0, 0, 3, 1, 0, 0, 0, 2, 1, 0])"
250 |       ]
251 |      },
252 |      "execution_count": 13,
253 |      "metadata": {},
254 |      "output_type": "execute_result"
255 |     }
256 |    ],
257 |    "source": [
258 |     "pi"
259 |    ]
260 |   },
261 |   {
262 |    "cell_type": "markdown",
263 |    "metadata": {},
264 |    "source": [
265 |     "See? This policy basically says in state `0`, take action `0`. In state `1` take action `3`. In state `2` take action `0` and so on. Got it?\n",
266 |     "\n",
267 |     "Now, we have the \"directions\" or this \"map\". With this, we can just use this policy and solve the environment as we interact with it.\n",
268 |     "\n",
269 |     "Let's try it out!"
270 |    ]
271 |   },
272 |   {
273 |    "cell_type": "code",
274 |    "execution_count": 14,
275 |    "metadata": {
276 |     "scrolled": true
277 |    },
278 |    "outputs": [
279 |     {
280 |      "name": "stderr",
281 |      "output_type": "stream",
282 |      "text": [
283 |       "[2017-04-26 00:40:00,747] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video010000.json\n",
284 |       "[2017-04-26 00:40:01,236] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video011000.json\n",
285 |       "[2017-04-26 00:40:01,752] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video012000.json\n",
286 |       "[2017-04-26 00:40:02,236] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video013000.json\n",
287 |       "[2017-04-26 00:40:02,732] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video014000.json\n",
288 |       "[2017-04-26 00:40:03,235] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video015000.json\n",
289 |       "[2017-04-26 00:40:03,745] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video016000.json\n",
290 |       "[2017-04-26 00:40:04,234] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video017000.json\n",
291 |       "[2017-04-26 00:40:04,748] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video018000.json\n",
292 |       "[2017-04-26 00:40:05,262] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video019000.json\n"
293 |      ]
294 |     }
295 |    ],
296 |    "source": [
297 |     "for _ in range(10000):\n",
298 |     "    state = env.reset()\n",
299 |     "    while True:\n",
300 |     "        state, reward, done, info = env.step(pi[state])\n",
301 |     "        if done:\n",
302 |     "            break"
303 |    ]
304 |   },
305 |   {
306 |    "cell_type": "markdown",
307 |    "metadata": {},
308 |    "source": [
309 |     "That was the agent interacting with the environment. Let's take a look at some of the episodes:"
310 |    ]
311 |   },
312 |   {
313 |    "cell_type": "code",
314 |    "execution_count": 16,
315 |    "metadata": {},
316 |    "outputs": [
317 |     {
318 |      "name": "stdout",
319 |      "output_type": "stream",
320 |      "text": [
321 |       "https://asciinema.org/a/6rphgm3w1rbkjvoo2rq6tpjqu\n"
322 |      ]
323 |     }
324 |    ],
325 |    "source": [
326 |     "last_video = env.videos[-1][0]\n",
327 |     "out = check_output([\"asciinema\", \"upload\", last_video])\n",
328 |     "out = out.decode(\"utf-8\").replace('\\n', '').replace('\\r', '')\n",
329 |     "print(out)"
330 |    ]
331 |   },
332 |   {
333 |    "cell_type": "markdown",
334 |    "metadata": {},
335 |    "source": [
336 |     "You can look on that link, or better, let's show it on the notebook:"
337 |    ]
338 |   },
339 |   {
340 |    "cell_type": "code",
341 |    "execution_count": 18,
342 |    "metadata": {},
343 |    "outputs": [
344 |     {
345 |      "data": {
346 |       "text/html": [
347 |        "\n",
348 |        "<script type=\"text/javascript\" \n",
349 |        "    src=\"https://asciinema.org/a/6rphgm3w1rbkjvoo2rq6tpjqu.js\" \n",
350 |        "    id=\"asciicast-6rphgm3w1rbkjvoo2rq6tpjqu\" \n",
351 |        "    async data-autoplay=\"true\" data-size=\"big\">\n",
352 |        "</script>\n"
353 |       ],
354 |       "text/plain": [
355 |        "<IPython.core.display.HTML object>"
356 |       ]
357 |      },
358 |      "execution_count": 18,
359 |      "metadata": {},
360 |      "output_type": "execute_result"
361 |     }
362 |    ],
363 |    "source": [
364 |     "castid = out.split('/')[-1]\n",
365 |     "html_tag = \"\"\"\n",
366 |     "<script type=\"text/javascript\" \n",
367 |     "    src=\"https://asciinema.org/a/{0}.js\" \n",
368 |     "    id=\"asciicast-{0}\" \n",
369 |     "    async data-autoplay=\"true\" data-size=\"big\">\n",
370 |     "</script>\n",
371 |     "\"\"\"\n",
372 |     "html_tag = html_tag.format(castid)\n",
373 |     "HTML(data=html_tag)"
374 |    ]
375 |   },
376 |   {
377 |    "cell_type": "markdown",
378 |    "metadata": {},
379 |    "source": [
380 |     "Interesting right? Did you get the world yet?\n",
381 |     "\n",
382 |     "So, 'S' is the starting state, 'G' the goal. 'F' are Frozen grids, and 'H' are holes. Your goal is to go from S to G without falling into any H. The problem is, F is slippery so, often times you are better of by trying moves that seems counter-intuitive. But because you are preventing falling on 'H's it makes sense in the end. For example, the second row, first column 'F', you can see how our agent was trying so hard to go left!! Smashing his head against the wall?? Silly. But why?"
383 |    ]
384 |   },
385 |   {
386 |    "cell_type": "code",
387 |    "execution_count": 19,
388 |    "metadata": {},
389 |    "outputs": [
390 |     {
391 |      "data": {
392 |       "text/plain": [
393 |        "{0: [(0.3333333333333333, 0, 0.0, False),\n",
394 |        "  (0.3333333333333333, 4, 0.0, False),\n",
395 |        "  (0.3333333333333333, 8, 0.0, False)],\n",
396 |        " 1: [(0.3333333333333333, 4, 0.0, False),\n",
397 |        "  (0.3333333333333333, 8, 0.0, False),\n",
398 |        "  (0.3333333333333333, 5, 0.0, True)],\n",
399 |        " 2: [(0.3333333333333333, 8, 0.0, False),\n",
400 |        "  (0.3333333333333333, 5, 0.0, True),\n",
401 |        "  (0.3333333333333333, 0, 0.0, False)],\n",
402 |        " 3: [(0.3333333333333333, 5, 0.0, True),\n",
403 |        "  (0.3333333333333333, 0, 0.0, False),\n",
404 |        "  (0.3333333333333333, 4, 0.0, False)]}"
405 |       ]
406 |      },
407 |      "execution_count": 19,
408 |      "metadata": {},
409 |      "output_type": "execute_result"
410 |     }
411 |    ],
412 |    "source": [
413 |     "P[4]"
414 |    ]
415 |   },
416 |   {
417 |    "cell_type": "markdown",
418 |    "metadata": {},
419 |    "source": [
420 |     "See how action 0 (left) doesn't have any transition leading to a terminal state??\n",
421 |     "\n",
422 |     "All other actions give you a 0.333333 chance each of pushing you into the hole in state '5'!!! So it actually makes sense to go left until it slips you downward to state 8.\n",
423 |     "\n",
424 |     "Cool right?"
425 |    ]
426 |   },
427 |   {
428 |    "cell_type": "code",
429 |    "execution_count": 20,
430 |    "metadata": {},
431 |    "outputs": [
432 |     {
433 |      "data": {
434 |       "text/plain": [
435 |        "array([0, 3, 0, 3, 0, 0, 0, 0, 3, 1, 0, 0, 0, 2, 1, 0])"
436 |       ]
437 |      },
438 |      "execution_count": 20,
439 |      "metadata": {},
440 |      "output_type": "execute_result"
441 |     }
442 |    ],
443 |    "source": [
444 |     "pi"
445 |    ]
446 |   },
447 |   {
448 |    "cell_type": "markdown",
449 |    "metadata": {},
450 |    "source": [
451 |     "See how the \"prescribed\" action is 0 (left) on the policy calculated by value iteration?\n",
452 |     "\n",
453 |     "How about the values?"
454 |    ]
455 |   },
456 |   {
457 |    "cell_type": "code",
458 |    "execution_count": 21,
459 |    "metadata": {},
460 |    "outputs": [
461 |     {
462 |      "data": {
463 |       "text/plain": [
464 |        "array([  9.82775479e-006,   4.77561742e-007,   8.29890013e-006,\n",
465 |        "         7.77646736e-006,   5.68794576e-006,   0.00000000e+000,\n",
466 |        "         3.38430298e-208,   0.00000000e+000,   8.92176447e-007,\n",
467 |        "         5.28039771e-006,   3.09721331e-006,   0.00000000e+000,\n",
468 |        "         0.00000000e+000,   9.53731304e-006,   9.80392157e-001,\n",
469 |        "         0.00000000e+000])"
470 |       ]
471 |      },
472 |      "execution_count": 21,
473 |      "metadata": {},
474 |      "output_type": "execute_result"
475 |     }
476 |    ],
477 |    "source": [
478 |     "V"
479 |    ]
480 |   },
481 |   {
482 |    "cell_type": "markdown",
483 |    "metadata": {},
484 |    "source": [
485 |     "These show the expected rewards on each state."
486 |    ]
487 |   },
488 |   {
489 |    "cell_type": "code",
490 |    "execution_count": 22,
491 |    "metadata": {},
492 |    "outputs": [
493 |     {
494 |      "data": {
495 |       "text/plain": [
496 |        "{0: [(1.0, 15, 0, True)],\n",
497 |        " 1: [(1.0, 15, 0, True)],\n",
498 |        " 2: [(1.0, 15, 0, True)],\n",
499 |        " 3: [(1.0, 15, 0, True)]}"
500 |       ]
501 |      },
502 |      "execution_count": 22,
503 |      "metadata": {},
504 |      "output_type": "execute_result"
505 |     }
506 |    ],
507 |    "source": [
508 |     "P[15]"
509 |    ]
510 |   },
511 |   {
512 |    "cell_type": "markdown",
513 |    "metadata": {},
514 |    "source": [
515 |     "See how the state '15' gives you a reward of +1?? These signal gets propagated all the way to the start state using Value Iteration and it shows the values all accross.\n",
516 |     "\n",
517 |     "Cool? Good."
518 |    ]
519 |   },
520 |   {
521 |    "cell_type": "code",
522 |    "execution_count": 39,
523 |    "metadata": {},
524 |    "outputs": [],
525 |    "source": [
526 |     "env.close()"
527 |    ]
528 |   },
529 |   {
530 |    "cell_type": "markdown",
531 |    "metadata": {},
532 |    "source": [
533 |     "If you want to submit to OpenAI Gym, get your API Key and paste it here:"
534 |    ]
535 |   },
536 |   {
537 |    "cell_type": "code",
538 |    "execution_count": 9,
539 |    "metadata": {},
540 |    "outputs": [
541 |     {
542 |      "name": "stderr",
543 |      "output_type": "stream",
544 |      "text": [
545 |       "[2017-04-01 16:55:43,229] [FrozenLake-v0] Uploading 10000 episodes of training data\n",
546 |       "[2017-04-01 16:55:44,905] [FrozenLake-v0] Uploading videos of 19 training episodes (2158 bytes)\n",
547 |       "[2017-04-01 16:55:45,131] [FrozenLake-v0] Creating evaluation object from /tmp/tmpfukeltbz with learning curve and training video\n",
548 |       "[2017-04-01 16:55:45,620] \n",
549 |       "****************************************************\n",
550 |       "You successfully uploaded your evaluation on FrozenLake-v0 to\n",
551 |       "OpenAI Gym! You can find it at:\n",
552 |       "\n",
553 |       "    https://gym.openai.com/evaluations/eval_ycTPCbyiTWK6T0C4DyrvRg\n",
554 |       "\n",
555 |       "****************************************************\n"
556 |      ]
557 |     }
558 |    ],
559 |    "source": [
560 |     "gym.upload(mdir, api_key='<YOUR OPENAI API KEY>')"
561 |    ]
562 |   },
563 |   {
564 |    "cell_type": "markdown",
565 |    "metadata": {},
566 |    "source": [
567 |     "#### Policy Iteration"
568 |    ]
569 |   },
570 |   {
571 |    "cell_type": "markdown",
572 |    "metadata": {},
573 |    "source": [
574 |     "There is another method called policy iteration. This method is composed of 2 other methods, policy evaluation and policy improvement. The logic goes that policy iteration is 'evaluating' a policy to check for convergence (meaning the policy doesn't change), and 'improving' the policy, which is applying something similar to a 1 step value iteration to get a slightly better policy, but definitely not worse.\n",
575 |     "\n",
576 |     "These two functions cycling together are what policy iteration is about.\n",
577 |     "\n",
578 |     "Can you implement this algorithm yourself? Try it. Make sure to look the solution notebook in case you get stuck.\n",
579 |     "\n",
580 |     "I will give you the policy evaluation and policy improvement methods, you build the policy iteration cycling between the evaluation and improvement methods until there are no changes to the policy."
581 |    ]
582 |   },
583 |   {
584 |    "cell_type": "code",
585 |    "execution_count": 24,
586 |    "metadata": {
587 |     "collapsed": true
588 |    },
589 |    "outputs": [],
590 |    "source": [
591 |     "def policy_evaluation(pi, S, A, P, gamma=.99, theta=0.0000001):\n",
592 |     "    \n",
593 |     "    V = np.zeros(len(S))\n",
594 |     "    while True:\n",
595 |     "        delta = 0\n",
596 |     "        for s in S:\n",
597 |     "            v = V[s]\n",
598 |     "            \n",
599 |     "            V[s] = 0\n",
600 |     "            for prob, dst, reward, done in P[s][pi[s]]:\n",
601 |     "                V[s] += prob * (reward + gamma * V[dst] * (not done))\n",
602 |     "            \n",
603 |     "            delta = max(delta, np.abs(v - V[s]))\n",
604 |     "        if delta < theta:\n",
605 |     "            break\n",
606 |     "    return V"
607 |    ]
608 |   },
609 |   {
610 |    "cell_type": "code",
611 |    "execution_count": 1,
612 |    "metadata": {},
613 |    "outputs": [],
614 |    "source": [
615 |     "def policy_improvement(pi, V, S, A, P, gamma=.99):\n",
616 |     "    for s in S:\n",
617 |     "        old_a = pi[s]\n",
618 |     "        \n",
619 |     "        Qs = np.zeros(len(A), dtype=float)\n",
620 |     "        for a in A:\n",
621 |     "            for prob, s_prime, reward, done in P[s][a]:\n",
622 |     "                Qs[a] += prob * (reward + gamma * V[s_prime] * (not done))\n",
623 |     "        pi[s] = np.argmax(Qs)\n",
624 |     "        V[s] = np.max(Qs)\n",
625 |     "    return pi, V"
626 |    ]
627 |   },
628 |   {
629 |    "cell_type": "code",
630 |    "execution_count": 27,
631 |    "metadata": {},
632 |    "outputs": [],
633 |    "source": [
634 |     "def policy_iteration(S, A, P, gamma=.99):\n",
635 |     "    pi = np.random.choice(A, len(S))\n",
636 |     "    \"\"\" YOU COMPLETE THIS METHOD \"\"\"\n",
637 |     "    return pi"
638 |    ]
639 |   },
640 |   {
641 |    "cell_type": "markdown",
642 |    "metadata": {},
643 |    "source": [
644 |     "After you implement the algorithms, you can run it and calculate the optimal policy:"
645 |    ]
646 |   },
647 |   {
648 |    "cell_type": "code",
649 |    "execution_count": 29,
650 |    "metadata": {},
651 |    "outputs": [
652 |     {
653 |      "name": "stderr",
654 |      "output_type": "stream",
655 |      "text": [
656 |       "[2017-04-26 00:54:49,917] Making new env: FrozenLake-v0\n",
657 |       "[2017-04-26 00:54:49,919] Finished writing results. You can upload them to the scoreboard via gym.upload('/tmp/tmppra935u6')\n"
658 |      ]
659 |     },
660 |     {
661 |      "name": "stdout",
662 |      "output_type": "stream",
663 |      "text": [
664 |       "[0 3 0 3 0 0 0 0 3 1 0 0 0 2 1 0]\n"
665 |      ]
666 |     }
667 |    ],
668 |    "source": [
669 |     "mdir = tempfile.mkdtemp()\n",
670 |     "env = gym.make('FrozenLake-v0')\n",
671 |     "env = wrappers.Monitor(env, mdir, force=True)\n",
672 |     "\n",
673 |     "S = range(env.env.observation_space.n)\n",
674 |     "A = range(env.env.action_space.n)\n",
675 |     "P = env.env.env.P\n",
676 |     "\n",
677 |     "pi = policy_iteration(S, A, P)\n",
678 |     "print(pi)"
679 |    ]
680 |   },
681 |   {
682 |    "cell_type": "markdown",
683 |    "metadata": {},
684 |    "source": [
685 |     "And, of course, interact with the environment looking at the \"directions\" or \"policy\":"
686 |    ]
687 |   },
688 |   {
689 |    "cell_type": "code",
690 |    "execution_count": 30,
691 |    "metadata": {
692 |     "scrolled": true
693 |    },
694 |    "outputs": [
695 |     {
696 |      "name": "stderr",
697 |      "output_type": "stream",
698 |      "text": [
699 |       "[2017-04-26 00:55:44,764] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000000.json\n",
700 |       "[2017-04-26 00:55:44,767] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000001.json\n",
701 |       "[2017-04-26 00:55:44,772] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000008.json\n",
702 |       "[2017-04-26 00:55:44,788] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000027.json\n",
703 |       "[2017-04-26 00:55:44,810] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000064.json\n",
704 |       "[2017-04-26 00:55:44,838] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000125.json\n",
705 |       "[2017-04-26 00:55:44,891] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000216.json\n",
706 |       "[2017-04-26 00:55:44,958] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000343.json\n",
707 |       "[2017-04-26 00:55:45,043] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000512.json\n",
708 |       "[2017-04-26 00:55:45,155] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000729.json\n",
709 |       "[2017-04-26 00:55:45,295] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video001000.json\n",
710 |       "[2017-04-26 00:55:45,889] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video002000.json\n",
711 |       "[2017-04-26 00:55:46,418] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video003000.json\n",
712 |       "[2017-04-26 00:55:46,934] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video004000.json\n",
713 |       "[2017-04-26 00:55:47,441] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video005000.json\n",
714 |       "[2017-04-26 00:55:47,963] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video006000.json\n",
715 |       "[2017-04-26 00:55:48,473] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video007000.json\n",
716 |       "[2017-04-26 00:55:48,989] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video008000.json\n",
717 |       "[2017-04-26 00:55:49,492] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video009000.json\n"
718 |      ]
719 |     }
720 |    ],
721 |    "source": [
722 |     "for _ in range(10000):\n",
723 |     "    state = env.reset()\n",
724 |     "    while True:\n",
725 |     "        state, reward, done, info = env.step(pi[state])\n",
726 |     "        if done:\n",
727 |     "            break"
728 |    ]
729 |   },
730 |   {
731 |    "cell_type": "code",
732 |    "execution_count": 32,
733 |    "metadata": {},
734 |    "outputs": [
735 |     {
736 |      "name": "stdout",
737 |      "output_type": "stream",
738 |      "text": [
739 |       "https://asciinema.org/a/c6phe9z2ntyy3y3lfflzwqwiy\n"
740 |      ]
741 |     }
742 |    ],
743 |    "source": [
744 |     "last_video = env.videos[-1][0]\n",
745 |     "out = check_output([\"asciinema\", \"upload\", last_video])\n",
746 |     "out = out.decode(\"utf-8\").replace('\\n', '').replace('\\r', '')\n",
747 |     "print(out)"
748 |    ]
749 |   },
750 |   {
751 |    "cell_type": "code",
752 |    "execution_count": 34,
753 |    "metadata": {},
754 |    "outputs": [
755 |     {
756 |      "data": {
757 |       "text/html": [
758 |        "\n",
759 |        "<script type=\"text/javascript\" \n",
760 |        "    src=\"https://asciinema.org/a/c6phe9z2ntyy3y3lfflzwqwiy.js\" \n",
761 |        "    id=\"asciicast-c6phe9z2ntyy3y3lfflzwqwiy\" \n",
762 |        "    async data-autoplay=\"true\" data-size=\"big\">\n",
763 |        "</script>\n"
764 |       ],
765 |       "text/plain": [
766 |        "<IPython.core.display.HTML object>"
767 |       ]
768 |      },
769 |      "execution_count": 34,
770 |      "metadata": {},
771 |      "output_type": "execute_result"
772 |     }
773 |    ],
774 |    "source": [
775 |     "castid = out.split('/')[-1]\n",
776 |     "html_tag = \"\"\"\n",
777 |     "<script type=\"text/javascript\" \n",
778 |     "    src=\"https://asciinema.org/a/{0}.js\" \n",
779 |     "    id=\"asciicast-{0}\" \n",
780 |     "    async data-autoplay=\"true\" data-size=\"big\">\n",
781 |     "</script>\n",
782 |     "\"\"\"\n",
783 |     "html_tag = html_tag.format(castid)\n",
784 |     "HTML(data=html_tag)"
785 |    ]
786 |   },
787 |   {
788 |    "cell_type": "markdown",
789 |    "metadata": {},
790 |    "source": [
791 |     "Similar as before. Policies could be slightly different if there is a state in which more than one action give the same value in the end."
792 |    ]
793 |   },
794 |   {
795 |    "cell_type": "code",
796 |    "execution_count": 35,
797 |    "metadata": {},
798 |    "outputs": [
799 |     {
800 |      "data": {
801 |       "text/plain": [
802 |        "array([  9.82775479e-006,   4.77561742e-007,   8.29890013e-006,\n",
803 |        "         7.77646736e-006,   5.68794576e-006,   0.00000000e+000,\n",
804 |        "         3.38430298e-208,   0.00000000e+000,   8.92176447e-007,\n",
805 |        "         5.28039771e-006,   3.09721331e-006,   0.00000000e+000,\n",
806 |        "         0.00000000e+000,   9.53731304e-006,   9.80392157e-001,\n",
807 |        "         0.00000000e+000])"
808 |       ]
809 |      },
810 |      "execution_count": 35,
811 |      "metadata": {},
812 |      "output_type": "execute_result"
813 |     }
814 |    ],
815 |    "source": [
816 |     "V"
817 |    ]
818 |   },
819 |   {
820 |    "cell_type": "code",
821 |    "execution_count": 37,
822 |    "metadata": {},
823 |    "outputs": [
824 |     {
825 |      "data": {
826 |       "text/plain": [
827 |        "array([0, 3, 0, 3, 0, 0, 0, 0, 3, 1, 0, 0, 0, 2, 1, 0])"
828 |       ]
829 |      },
830 |      "execution_count": 37,
831 |      "metadata": {},
832 |      "output_type": "execute_result"
833 |     }
834 |    ],
835 |    "source": [
836 |     "pi"
837 |    ]
838 |   },
839 |   {
840 |    "cell_type": "markdown",
841 |    "metadata": {},
842 |    "source": [
843 |     "That's it let's wrap up."
844 |    ]
845 |   },
846 |   {
847 |    "cell_type": "code",
848 |    "execution_count": 38,
849 |    "metadata": {},
850 |    "outputs": [
851 |     {
852 |      "name": "stderr",
853 |      "output_type": "stream",
854 |      "text": [
855 |       "[2017-04-26 00:57:28,406] Finished writing results. You can upload them to the scoreboard via gym.upload('/tmp/tmp0oe_0gtp')\n"
856 |      ]
857 |     }
858 |    ],
859 |    "source": [
860 |     "env.close()"
861 |    ]
862 |   },
863 |   {
864 |    "cell_type": "markdown",
865 |    "metadata": {},
866 |    "source": [
867 |     "If you want to submit to OpenAI Gym, get your API Key and paste it here:"
868 |    ]
869 |   },
870 |   {
871 |    "cell_type": "code",
872 |    "execution_count": 134,
873 |    "metadata": {},
874 |    "outputs": [
875 |     {
876 |      "name": "stderr",
877 |      "output_type": "stream",
878 |      "text": [
879 |       "[2017-04-01 20:40:54,103] [FrozenLake-v0] Uploading 10000 episodes of training data\n",
880 |       "[2017-04-01 20:40:55,854] [FrozenLake-v0] Uploading videos of 19 training episodes (2278 bytes)\n",
881 |       "[2017-04-01 20:40:56,102] [FrozenLake-v0] Creating evaluation object from /tmp/tmpyspcx0sa with learning curve and training video\n",
882 |       "[2017-04-01 20:40:56,451] \n",
883 |       "****************************************************\n",
884 |       "You successfully uploaded your evaluation on FrozenLake-v0 to\n",
885 |       "OpenAI Gym! You can find it at:\n",
886 |       "\n",
887 |       "    https://gym.openai.com/evaluations/eval_vAvbhsGQRVSAe5DZkFNrQ\n",
888 |       "\n",
889 |       "****************************************************\n"
890 |      ]
891 |     }
892 |    ],
893 |    "source": [
894 |     "gym.upload(mdir, api_key='<YOUR OPENAI API KEY>')"
895 |    ]
896 |   },
897 |   {
898 |    "cell_type": "markdown",
899 |    "metadata": {
900 |     "collapsed": true
901 |    },
902 |    "source": [
903 |     "Hope you liked it... Value Iteration and Policy Iteration might seem disappointing at first, and I understand. What is the intelligence on following given directions!? What if you just don't have a map of the environment you are interacting with? Come on, that's not AI. You are right, it is not. However, Value Iteration and Policy Iteration form the basis of 2 of the 3 most fundamental paradigms of algorithms in reinforcement learning.\n",
904 |     "\n",
905 |     "Next notebooks we start looking into slightly more complicated environment. And also, we will learn about algorithms that learn while interacting with the environment. Also called \"online\" learning."
906 |    ]
907 |   }
908 |  ],
909 |  "metadata": {
910 |   "kernelspec": {
911 |    "display_name": "Python 3",
912 |    "language": "python",
913 |    "name": "python3"
914 |   },
915 |   "language_info": {
916 |    "codemirror_mode": {
917 |     "name": "ipython",
918 |     "version": 3
919 |    },
920 |    "file_extension": ".py",
921 |    "mimetype": "text/x-python",
922 |    "name": "python",
923 |    "nbconvert_exporter": "python",
924 |    "pygments_lexer": "ipython3",
925 |    "version": "3.5.2"
926 |   }
927 |  },
928 |  "nbformat": 4,
929 |  "nbformat_minor": 2
930 | }
931 | 


--------------------------------------------------------------------------------
/notebooks/solutions/03-planning-algorithms.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "### Planning Algorithms"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "Do you remember on lesson 2 and 3 we discussed algorithms that basically solve MDPs? That is, find a policy given a exact representation of the environment. In this section, we will explore 2 such algorithms. Value Iteration and policy iteration."
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": 1,
 20 |    "metadata": {},
 21 |    "outputs": [],
 22 |    "source": [
 23 |     "import numpy as np\n",
 24 |     "import pandas as pd\n",
 25 |     "import tempfile\n",
 26 |     "import pprint\n",
 27 |     "import json\n",
 28 |     "import sys\n",
 29 |     "import gym\n",
 30 |     "\n",
 31 |     "from gym import wrappers\n",
 32 |     "from subprocess import check_output\n",
 33 |     "from IPython.display import HTML"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "markdown",
 38 |    "metadata": {},
 39 |    "source": [
 40 |     "#### Value Iteration"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "markdown",
 45 |    "metadata": {},
 46 |    "source": [
 47 |     "The Value Iteration algorithm uses dynamic programming by dividing the problem into common sub-problems and leveraging that optimal structure to speed-up computations.\n",
 48 |     "\n",
 49 |     "Let me show you how value iterations look like:"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "code",
 54 |    "execution_count": 2,
 55 |    "metadata": {},
 56 |    "outputs": [],
 57 |    "source": [
 58 |     "def value_iteration(S, A, P, gamma=.99, theta = 0.0000001):\n",
 59 |     " \n",
 60 |     "    V = np.random.random(len(S))\n",
 61 |     "    for i in range(100000):\n",
 62 |     "        old_V = V.copy()\n",
 63 |     "        \n",
 64 |     "        Q = np.zeros((len(S), len(A)), dtype=float)\n",
 65 |     "        for s in S:\n",
 66 |     "            for a in A:\n",
 67 |     "                for prob, s_prime, reward, done in P[s][a]:\n",
 68 |     "                    Q[s][a] += prob * (reward + gamma * old_V[s_prime] * (not done))\n",
 69 |     "            V[s] = Q[s].max()\n",
 70 |     "        if np.all(np.abs(old_V - V) < theta):\n",
 71 |     "            break\n",
 72 |     "    \n",
 73 |     "    pi = np.argmax(Q, axis=1)\n",
 74 |     "    return pi, V"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "markdown",
 79 |    "metadata": {},
 80 |    "source": [
 81 |     "As we can see, value iteration expects a set of states, e.g. (0,1,2,3,4) a set of actions, e.g. (0,1) and a set of transition probabilities that represent the dynamics of the environment. Let's take a look at these variables:"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "code",
 86 |    "execution_count": 3,
 87 |    "metadata": {},
 88 |    "outputs": [
 89 |     {
 90 |      "name": "stderr",
 91 |      "output_type": "stream",
 92 |      "text": [
 93 |       "[2017-08-27 08:15:35,098] Making new env: FrozenLake-v0\n"
 94 |      ]
 95 |     }
 96 |    ],
 97 |    "source": [
 98 |     "mdir = tempfile.mkdtemp()\n",
 99 |     "env = gym.make('FrozenLake-v0')\n",
100 |     "env = wrappers.Monitor(env, mdir, force=True)"
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "code",
105 |    "execution_count": 4,
106 |    "metadata": {
107 |     "collapsed": true
108 |    },
109 |    "outputs": [],
110 |    "source": [
111 |     "S = range(env.env.observation_space.n)\n",
112 |     "A = range(env.env.action_space.n)\n",
113 |     "P = env.env.env.P"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": 5,
119 |    "metadata": {},
120 |    "outputs": [
121 |     {
122 |      "data": {
123 |       "text/plain": [
124 |        "range(0, 16)"
125 |       ]
126 |      },
127 |      "execution_count": 5,
128 |      "metadata": {},
129 |      "output_type": "execute_result"
130 |     }
131 |    ],
132 |    "source": [
133 |     "S"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "code",
138 |    "execution_count": 6,
139 |    "metadata": {},
140 |    "outputs": [
141 |     {
142 |      "data": {
143 |       "text/plain": [
144 |        "range(0, 4)"
145 |       ]
146 |      },
147 |      "execution_count": 6,
148 |      "metadata": {},
149 |      "output_type": "execute_result"
150 |     }
151 |    ],
152 |    "source": [
153 |     "A"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "code",
158 |    "execution_count": 7,
159 |    "metadata": {},
160 |    "outputs": [
161 |     {
162 |      "data": {
163 |       "text/plain": [
164 |        "{0: [(0.3333333333333333, 6, 0.0, False),\n",
165 |        "  (0.3333333333333333, 9, 0.0, False),\n",
166 |        "  (0.3333333333333333, 14, 0.0, False)],\n",
167 |        " 1: [(0.3333333333333333, 9, 0.0, False),\n",
168 |        "  (0.3333333333333333, 14, 0.0, False),\n",
169 |        "  (0.3333333333333333, 11, 0.0, True)],\n",
170 |        " 2: [(0.3333333333333333, 14, 0.0, False),\n",
171 |        "  (0.3333333333333333, 11, 0.0, True),\n",
172 |        "  (0.3333333333333333, 6, 0.0, False)],\n",
173 |        " 3: [(0.3333333333333333, 11, 0.0, True),\n",
174 |        "  (0.3333333333333333, 6, 0.0, False),\n",
175 |        "  (0.3333333333333333, 9, 0.0, False)]}"
176 |       ]
177 |      },
178 |      "execution_count": 7,
179 |      "metadata": {},
180 |      "output_type": "execute_result"
181 |     }
182 |    ],
183 |    "source": [
184 |     "P[10]"
185 |    ]
186 |   },
187 |   {
188 |    "cell_type": "markdown",
189 |    "metadata": {},
190 |    "source": [
191 |     "You see the world we are looking into \"FrozenLake-v0\" has 16 different states, 4 different actions. The `P[10]` is basically showing us a peek into the dynamics of the world. For example, in this case, if you are in state \"10\" (from `P[10]`) and you take action 0 (see dictionary key 0), you have equal probability of 0.3333 to land in either state 6, 9 or 14. None of those transitions give you any reward and none of them is terminal.\n",
192 |     "\n",
193 |     "In contrast, we can see taking action 2, might transition you to state 11, which **is** terminal. \n",
194 |     "\n",
195 |     "Get the hang of it? Let's run it!"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": 8,
201 |    "metadata": {
202 |     "collapsed": true
203 |    },
204 |    "outputs": [],
205 |    "source": [
206 |     "pi, V = value_iteration(S, A, P)"
207 |    ]
208 |   },
209 |   {
210 |    "cell_type": "markdown",
211 |    "metadata": {},
212 |    "source": [
213 |     "Now, value iteration calculates two important things. First, it calculates `V`, which tells us how much should we expect from each state if we always act optimally. Second, it gives us `pi`, which is the optimal policy given `V`. Let's take a deeper look:"
214 |    ]
215 |   },
216 |   {
217 |    "cell_type": "code",
218 |    "execution_count": 9,
219 |    "metadata": {},
220 |    "outputs": [
221 |     {
222 |      "data": {
223 |       "text/plain": [
224 |        "array([ 0.54202426,  0.49880096,  0.47069307,  0.45684887,  0.55844941,\n",
225 |        "        0.        ,  0.35834688,  0.        ,  0.59179743,  0.64307884,\n",
226 |        "        0.61520669,  0.        ,  0.        ,  0.74171974,  0.86283707,  0.        ])"
227 |       ]
228 |      },
229 |      "execution_count": 9,
230 |      "metadata": {},
231 |      "output_type": "execute_result"
232 |     }
233 |    ],
234 |    "source": [
235 |     "V"
236 |    ]
237 |   },
238 |   {
239 |    "cell_type": "code",
240 |    "execution_count": 10,
241 |    "metadata": {},
242 |    "outputs": [
243 |     {
244 |      "data": {
245 |       "text/plain": [
246 |        "array([0, 3, 3, 3, 0, 0, 0, 0, 3, 1, 0, 0, 0, 2, 1, 0])"
247 |       ]
248 |      },
249 |      "execution_count": 10,
250 |      "metadata": {},
251 |      "output_type": "execute_result"
252 |     }
253 |    ],
254 |    "source": [
255 |     "pi"
256 |    ]
257 |   },
258 |   {
259 |    "cell_type": "markdown",
260 |    "metadata": {},
261 |    "source": [
262 |     "See? This policy basically says in state `0`, take action `0`. In state `1` take action `3`. In state `2` take action `3` and so on. Got it?\n",
263 |     "\n",
264 |     "Now, we have the \"directions\" or this \"map\". With this, we can just use this policy and solve the environment as we interact with it.\n",
265 |     "\n",
266 |     "Let's try it out!"
267 |    ]
268 |   },
269 |   {
270 |    "cell_type": "code",
271 |    "execution_count": 11,
272 |    "metadata": {
273 |     "scrolled": true
274 |    },
275 |    "outputs": [
276 |     {
277 |      "name": "stderr",
278 |      "output_type": "stream",
279 |      "text": [
280 |       "[2017-08-27 08:16:30,009] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000000.json\n",
281 |       "[2017-08-27 08:16:30,015] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000001.json\n",
282 |       "[2017-08-27 08:16:30,024] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000008.json\n",
283 |       "[2017-08-27 08:16:30,063] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000027.json\n",
284 |       "[2017-08-27 08:16:30,116] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000064.json\n",
285 |       "[2017-08-27 08:16:30,168] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000125.json\n",
286 |       "[2017-08-27 08:16:30,245] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000216.json\n",
287 |       "[2017-08-27 08:16:30,346] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000343.json\n",
288 |       "[2017-08-27 08:16:30,461] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000512.json\n",
289 |       "[2017-08-27 08:16:30,613] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000729.json\n",
290 |       "[2017-08-27 08:16:30,796] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video001000.json\n",
291 |       "[2017-08-27 08:16:31,510] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video002000.json\n",
292 |       "[2017-08-27 08:16:32,407] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video003000.json\n",
293 |       "[2017-08-27 08:16:33,056] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video004000.json\n",
294 |       "[2017-08-27 08:16:33,717] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video005000.json\n",
295 |       "[2017-08-27 08:16:34,350] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video006000.json\n",
296 |       "[2017-08-27 08:16:34,995] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video007000.json\n",
297 |       "[2017-08-27 08:16:35,629] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video008000.json\n",
298 |       "[2017-08-27 08:16:36,299] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video009000.json\n"
299 |      ]
300 |     }
301 |    ],
302 |    "source": [
303 |     "for _ in range(10000):\n",
304 |     "    state = env.reset()\n",
305 |     "    while True:\n",
306 |     "        state, reward, done, info = env.step(pi[state])\n",
307 |     "        if done:\n",
308 |     "            break"
309 |    ]
310 |   },
311 |   {
312 |    "cell_type": "markdown",
313 |    "metadata": {},
314 |    "source": [
315 |     "That was the agent interacting with the environment. Let's take a look at some of the episodes:"
316 |    ]
317 |   },
318 |   {
319 |    "cell_type": "code",
320 |    "execution_count": 13,
321 |    "metadata": {},
322 |    "outputs": [
323 |     {
324 |      "name": "stdout",
325 |      "output_type": "stream",
326 |      "text": [
327 |       "https://asciinema.org/a/cJ4n5wZKQJIxjwKpndi0OKmWX\n"
328 |      ]
329 |     }
330 |    ],
331 |    "source": [
332 |     "last_video = env.videos[-1][0]\n",
333 |     "out = check_output([\"asciinema\", \"upload\", last_video])\n",
334 |     "out = out.decode(\"utf-8\").replace('\\n', '').replace('\\r', '')\n",
335 |     "print(out)"
336 |    ]
337 |   },
338 |   {
339 |    "cell_type": "markdown",
340 |    "metadata": {},
341 |    "source": [
342 |     "You can look on that link, or better, let's show it on the notebook:"
343 |    ]
344 |   },
345 |   {
346 |    "cell_type": "code",
347 |    "execution_count": 14,
348 |    "metadata": {},
349 |    "outputs": [
350 |     {
351 |      "data": {
352 |       "text/html": [
353 |        "\n",
354 |        "<script type=\"text/javascript\" \n",
355 |        "    src=\"https://asciinema.org/a/cJ4n5wZKQJIxjwKpndi0OKmWX.js\" \n",
356 |        "    id=\"asciicast-cJ4n5wZKQJIxjwKpndi0OKmWX\" \n",
357 |        "    async data-autoplay=\"true\" data-size=\"big\">\n",
358 |        "</script>\n"
359 |       ],
360 |       "text/plain": [
361 |        "<IPython.core.display.HTML object>"
362 |       ]
363 |      },
364 |      "execution_count": 14,
365 |      "metadata": {},
366 |      "output_type": "execute_result"
367 |     }
368 |    ],
369 |    "source": [
370 |     "castid = out.split('/')[-1]\n",
371 |     "html_tag = \"\"\"\n",
372 |     "<script type=\"text/javascript\" \n",
373 |     "    src=\"https://asciinema.org/a/{0}.js\" \n",
374 |     "    id=\"asciicast-{0}\" \n",
375 |     "    async data-autoplay=\"true\" data-size=\"big\">\n",
376 |     "</script>\n",
377 |     "\"\"\"\n",
378 |     "html_tag = html_tag.format(castid)\n",
379 |     "HTML(data=html_tag)"
380 |    ]
381 |   },
382 |   {
383 |    "cell_type": "markdown",
384 |    "metadata": {},
385 |    "source": [
386 |     "Interesting right? Did you get the world yet?\n",
387 |     "\n",
388 |     "So, 'S' is the starting state, 'G' the goal. 'F' are Frozen grids, and 'H' are holes. Your goal is to go from S to G without falling into any H. The problem is, F is slippery so, often times you are better of by trying moves that seems counter-intuitive. But because you are preventing falling on 'H's it makes sense in the end. For example, the second row, first column 'F', you can see how our agent was trying so hard to go left!! Smashing his head against the wall?? Silly. But why?"
389 |    ]
390 |   },
391 |   {
392 |    "cell_type": "code",
393 |    "execution_count": 15,
394 |    "metadata": {},
395 |    "outputs": [
396 |     {
397 |      "data": {
398 |       "text/plain": [
399 |        "{0: [(0.3333333333333333, 0, 0.0, False),\n",
400 |        "  (0.3333333333333333, 4, 0.0, False),\n",
401 |        "  (0.3333333333333333, 8, 0.0, False)],\n",
402 |        " 1: [(0.3333333333333333, 4, 0.0, False),\n",
403 |        "  (0.3333333333333333, 8, 0.0, False),\n",
404 |        "  (0.3333333333333333, 5, 0.0, True)],\n",
405 |        " 2: [(0.3333333333333333, 8, 0.0, False),\n",
406 |        "  (0.3333333333333333, 5, 0.0, True),\n",
407 |        "  (0.3333333333333333, 0, 0.0, False)],\n",
408 |        " 3: [(0.3333333333333333, 5, 0.0, True),\n",
409 |        "  (0.3333333333333333, 0, 0.0, False),\n",
410 |        "  (0.3333333333333333, 4, 0.0, False)]}"
411 |       ]
412 |      },
413 |      "execution_count": 15,
414 |      "metadata": {},
415 |      "output_type": "execute_result"
416 |     }
417 |    ],
418 |    "source": [
419 |     "P[4]"
420 |    ]
421 |   },
422 |   {
423 |    "cell_type": "markdown",
424 |    "metadata": {},
425 |    "source": [
426 |     "See how action 0 (left) doesn't have any transition leading to a terminal state??\n",
427 |     "\n",
428 |     "All other actions give you a 0.333333 chance each of pushing you into the hole in state '5'!!! So it actually makes sense to go left until it slips you downward to state 8.\n",
429 |     "\n",
430 |     "Cool right?"
431 |    ]
432 |   },
433 |   {
434 |    "cell_type": "code",
435 |    "execution_count": 16,
436 |    "metadata": {},
437 |    "outputs": [
438 |     {
439 |      "data": {
440 |       "text/plain": [
441 |        "array([0, 3, 3, 3, 0, 0, 0, 0, 3, 1, 0, 0, 0, 2, 1, 0])"
442 |       ]
443 |      },
444 |      "execution_count": 16,
445 |      "metadata": {},
446 |      "output_type": "execute_result"
447 |     }
448 |    ],
449 |    "source": [
450 |     "pi"
451 |    ]
452 |   },
453 |   {
454 |    "cell_type": "markdown",
455 |    "metadata": {},
456 |    "source": [
457 |     "See how the \"prescribed\" action is 0 (left) on the policy calculated by value iteration?\n",
458 |     "\n",
459 |     "How about the values?"
460 |    ]
461 |   },
462 |   {
463 |    "cell_type": "code",
464 |    "execution_count": 17,
465 |    "metadata": {},
466 |    "outputs": [
467 |     {
468 |      "data": {
469 |       "text/plain": [
470 |        "array([ 0.54202426,  0.49880096,  0.47069307,  0.45684887,  0.55844941,\n",
471 |        "        0.        ,  0.35834688,  0.        ,  0.59179743,  0.64307884,\n",
472 |        "        0.61520669,  0.        ,  0.        ,  0.74171974,  0.86283707,  0.        ])"
473 |       ]
474 |      },
475 |      "execution_count": 17,
476 |      "metadata": {},
477 |      "output_type": "execute_result"
478 |     }
479 |    ],
480 |    "source": [
481 |     "V"
482 |    ]
483 |   },
484 |   {
485 |    "cell_type": "markdown",
486 |    "metadata": {},
487 |    "source": [
488 |     "These show the expected rewards on each state."
489 |    ]
490 |   },
491 |   {
492 |    "cell_type": "code",
493 |    "execution_count": 18,
494 |    "metadata": {},
495 |    "outputs": [
496 |     {
497 |      "data": {
498 |       "text/plain": [
499 |        "{0: [(1.0, 15, 0, True)],\n",
500 |        " 1: [(1.0, 15, 0, True)],\n",
501 |        " 2: [(1.0, 15, 0, True)],\n",
502 |        " 3: [(1.0, 15, 0, True)]}"
503 |       ]
504 |      },
505 |      "execution_count": 18,
506 |      "metadata": {},
507 |      "output_type": "execute_result"
508 |     }
509 |    ],
510 |    "source": [
511 |     "P[15]"
512 |    ]
513 |   },
514 |   {
515 |    "cell_type": "markdown",
516 |    "metadata": {},
517 |    "source": [
518 |     "See how the state '15' gives you a reward of +1?? These signal gets propagated all the way to the start state using Value Iteration and it shows the values all accross.\n",
519 |     "\n",
520 |     "Cool? Good."
521 |    ]
522 |   },
523 |   {
524 |    "cell_type": "code",
525 |    "execution_count": 19,
526 |    "metadata": {},
527 |    "outputs": [
528 |     {
529 |      "name": "stderr",
530 |      "output_type": "stream",
531 |      "text": [
532 |       "[2017-08-27 08:18:16,163] Finished writing results. You can upload them to the scoreboard via gym.upload('/tmp/tmp8ebvhkul')\n"
533 |      ]
534 |     }
535 |    ],
536 |    "source": [
537 |     "env.close()"
538 |    ]
539 |   },
540 |   {
541 |    "cell_type": "markdown",
542 |    "metadata": {},
543 |    "source": [
544 |     "If you want to submit to OpenAI Gym, get your API Key and paste it here:"
545 |    ]
546 |   },
547 |   {
548 |    "cell_type": "code",
549 |    "execution_count": 9,
550 |    "metadata": {},
551 |    "outputs": [
552 |     {
553 |      "name": "stderr",
554 |      "output_type": "stream",
555 |      "text": [
556 |       "[2017-04-01 16:55:43,229] [FrozenLake-v0] Uploading 10000 episodes of training data\n",
557 |       "[2017-04-01 16:55:44,905] [FrozenLake-v0] Uploading videos of 19 training episodes (2158 bytes)\n",
558 |       "[2017-04-01 16:55:45,131] [FrozenLake-v0] Creating evaluation object from /tmp/tmpfukeltbz with learning curve and training video\n",
559 |       "[2017-04-01 16:55:45,620] \n",
560 |       "****************************************************\n",
561 |       "You successfully uploaded your evaluation on FrozenLake-v0 to\n",
562 |       "OpenAI Gym! You can find it at:\n",
563 |       "\n",
564 |       "    https://gym.openai.com/evaluations/eval_ycTPCbyiTWK6T0C4DyrvRg\n",
565 |       "\n",
566 |       "****************************************************\n"
567 |      ]
568 |     }
569 |    ],
570 |    "source": [
571 |     "gym.upload(mdir, api_key='<YOUR OPENAI API KEY>')"
572 |    ]
573 |   },
574 |   {
575 |    "cell_type": "markdown",
576 |    "metadata": {},
577 |    "source": [
578 |     "#### Policy Iteration"
579 |    ]
580 |   },
581 |   {
582 |    "cell_type": "markdown",
583 |    "metadata": {},
584 |    "source": [
585 |     "There is another method called policy iteration. This method is composed of 2 other methods, policy evaluation and policy improvement. The logic goes that policy iteration is 'evaluating' a policy to check for convergence (meaning the policy doesn't change), and 'improving' the policy, which is applying something similar to a 1 step value iteration to get a slightly better policy, but definitely not worse.\n",
586 |     "\n",
587 |     "These two functions cycling together are what policy iteration is about.\n",
588 |     "\n",
589 |     "Can you implement this algorithm yourself? Try it. Make sure to look the solution notebook in case you get stuck.\n",
590 |     "\n",
591 |     "I will give you the policy evaluation and policy improvement methods, you build the policy iteration cycling between the evaluation and improvement methods until there are no changes to the policy."
592 |    ]
593 |   },
594 |   {
595 |    "cell_type": "code",
596 |    "execution_count": 32,
597 |    "metadata": {
598 |     "collapsed": true
599 |    },
600 |    "outputs": [],
601 |    "source": [
602 |     "def policy_evaluation(pi, S, A, P, gamma=.99, theta=0.0000001):\n",
603 |     "    \n",
604 |     "    V = np.zeros(len(S))\n",
605 |     "    while True:\n",
606 |     "        delta = 0\n",
607 |     "        for s in S:\n",
608 |     "            v = V[s]\n",
609 |     "            \n",
610 |     "            V[s] = 0\n",
611 |     "            for prob, dst, reward, done in P[s][pi[s]]:\n",
612 |     "                V[s] += prob * (reward + gamma * V[dst] * (not done))\n",
613 |     "            \n",
614 |     "            delta = max(delta, np.abs(v - V[s]))\n",
615 |     "        if delta < theta:\n",
616 |     "            break\n",
617 |     "    return V"
618 |    ]
619 |   },
620 |   {
621 |    "cell_type": "code",
622 |    "execution_count": 33,
623 |    "metadata": {},
624 |    "outputs": [],
625 |    "source": [
626 |     "def policy_improvement(pi, V, S, A, P, gamma=.99):\n",
627 |     "    for s in S:\n",
628 |     "        old_a = pi[s]\n",
629 |     "        \n",
630 |     "        Qs = np.zeros(len(A), dtype=float)\n",
631 |     "        for a in A:\n",
632 |     "            for prob, s_prime, reward, done in P[s][a]:\n",
633 |     "                Qs[a] += prob * (reward + gamma * V[s_prime] * (not done))\n",
634 |     "        pi[s] = np.argmax(Qs)\n",
635 |     "        V[s] = np.max(Qs)\n",
636 |     "    return pi, V"
637 |    ]
638 |   },
639 |   {
640 |    "cell_type": "code",
641 |    "execution_count": 34,
642 |    "metadata": {},
643 |    "outputs": [],
644 |    "source": [
645 |     "def policy_iteration(S, A, P, gamma=.99):\n",
646 |     "    pi = np.random.choice(A, len(S))\n",
647 |     "    while True:    \n",
648 |     "        V = policy_evaluation(pi, S, A, P, gamma)\n",
649 |     "        new_pi, new_V = policy_improvement(\n",
650 |     "            pi.copy(), V.copy(), S, A, P, gamma)\n",
651 |     "        if np.all(pi == new_pi):\n",
652 |     "            break\n",
653 |     "        pi = new_pi\n",
654 |     "        V = new_V\n",
655 |     "    return pi"
656 |    ]
657 |   },
658 |   {
659 |    "cell_type": "markdown",
660 |    "metadata": {},
661 |    "source": [
662 |     "After you implement the algorithms, you can run it and calculate the optimal policy:"
663 |    ]
664 |   },
665 |   {
666 |    "cell_type": "code",
667 |    "execution_count": 35,
668 |    "metadata": {},
669 |    "outputs": [
670 |     {
671 |      "name": "stderr",
672 |      "output_type": "stream",
673 |      "text": [
674 |       "[2017-08-27 08:21:56,074] Making new env: FrozenLake-v0\n",
675 |       "[2017-08-27 08:21:56,078] Finished writing results. You can upload them to the scoreboard via gym.upload('/tmp/tmpsqiqif_m')\n"
676 |      ]
677 |     },
678 |     {
679 |      "name": "stdout",
680 |      "output_type": "stream",
681 |      "text": [
682 |       "[1 3 0 3 0 0 0 0 3 1 0 0 0 2 1 0]\n"
683 |      ]
684 |     }
685 |    ],
686 |    "source": [
687 |     "mdir = tempfile.mkdtemp()\n",
688 |     "env = gym.make('FrozenLake-v0')\n",
689 |     "env = wrappers.Monitor(env, mdir, force=True)\n",
690 |     "\n",
691 |     "S = range(env.env.observation_space.n)\n",
692 |     "A = range(env.env.action_space.n)\n",
693 |     "P = env.env.env.P\n",
694 |     "\n",
695 |     "pi = policy_iteration(S, A, P)\n",
696 |     "print(pi)"
697 |    ]
698 |   },
699 |   {
700 |    "cell_type": "markdown",
701 |    "metadata": {},
702 |    "source": [
703 |     "And, of course, interact with the environment looking at the \"directions\" or \"policy\":"
704 |    ]
705 |   },
706 |   {
707 |    "cell_type": "code",
708 |    "execution_count": 36,
709 |    "metadata": {
710 |     "scrolled": true
711 |    },
712 |    "outputs": [
713 |     {
714 |      "name": "stderr",
715 |      "output_type": "stream",
716 |      "text": [
717 |       "[2017-08-27 08:21:59,041] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000000.json\n",
718 |       "[2017-08-27 08:21:59,053] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000001.json\n",
719 |       "[2017-08-27 08:21:59,059] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000008.json\n",
720 |       "[2017-08-27 08:21:59,086] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000027.json\n",
721 |       "[2017-08-27 08:21:59,127] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000064.json\n",
722 |       "[2017-08-27 08:21:59,166] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000125.json\n",
723 |       "[2017-08-27 08:21:59,214] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000216.json\n",
724 |       "[2017-08-27 08:21:59,287] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000343.json\n",
725 |       "[2017-08-27 08:21:59,375] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000512.json\n",
726 |       "[2017-08-27 08:21:59,490] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000729.json\n",
727 |       "[2017-08-27 08:21:59,624] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video001000.json\n",
728 |       "[2017-08-27 08:22:00,092] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video002000.json\n",
729 |       "[2017-08-27 08:22:00,837] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video003000.json\n",
730 |       "[2017-08-27 08:22:01,269] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video004000.json\n",
731 |       "[2017-08-27 08:22:01,720] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video005000.json\n",
732 |       "[2017-08-27 08:22:02,184] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video006000.json\n",
733 |       "[2017-08-27 08:22:02,614] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video007000.json\n",
734 |       "[2017-08-27 08:22:03,085] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video008000.json\n",
735 |       "[2017-08-27 08:22:03,518] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video009000.json\n"
736 |      ]
737 |     }
738 |    ],
739 |    "source": [
740 |     "for _ in range(10000):\n",
741 |     "    state = env.reset()\n",
742 |     "    while True:\n",
743 |     "        state, reward, done, info = env.step(pi[state])\n",
744 |     "        if done:\n",
745 |     "            break"
746 |    ]
747 |   },
748 |   {
749 |    "cell_type": "code",
750 |    "execution_count": 37,
751 |    "metadata": {},
752 |    "outputs": [
753 |     {
754 |      "name": "stdout",
755 |      "output_type": "stream",
756 |      "text": [
757 |       "https://asciinema.org/a/NIeAt9sdjwkvmQbZSOVAkJOIb\n"
758 |      ]
759 |     }
760 |    ],
761 |    "source": [
762 |     "last_video = env.videos[-1][0]\n",
763 |     "out = check_output([\"asciinema\", \"upload\", last_video])\n",
764 |     "out = out.decode(\"utf-8\").replace('\\n', '').replace('\\r', '')\n",
765 |     "print(out)"
766 |    ]
767 |   },
768 |   {
769 |    "cell_type": "code",
770 |    "execution_count": 38,
771 |    "metadata": {},
772 |    "outputs": [
773 |     {
774 |      "data": {
775 |       "text/html": [
776 |        "\n",
777 |        "<script type=\"text/javascript\" \n",
778 |        "    src=\"https://asciinema.org/a/NIeAt9sdjwkvmQbZSOVAkJOIb.js\" \n",
779 |        "    id=\"asciicast-NIeAt9sdjwkvmQbZSOVAkJOIb\" \n",
780 |        "    async data-autoplay=\"true\" data-size=\"big\">\n",
781 |        "</script>\n"
782 |       ],
783 |       "text/plain": [
784 |        "<IPython.core.display.HTML object>"
785 |       ]
786 |      },
787 |      "execution_count": 38,
788 |      "metadata": {},
789 |      "output_type": "execute_result"
790 |     }
791 |    ],
792 |    "source": [
793 |     "castid = out.split('/')[-1]\n",
794 |     "html_tag = \"\"\"\n",
795 |     "<script type=\"text/javascript\" \n",
796 |     "    src=\"https://asciinema.org/a/{0}.js\" \n",
797 |     "    id=\"asciicast-{0}\" \n",
798 |     "    async data-autoplay=\"true\" data-size=\"big\">\n",
799 |     "</script>\n",
800 |     "\"\"\"\n",
801 |     "html_tag = html_tag.format(castid)\n",
802 |     "HTML(data=html_tag)"
803 |    ]
804 |   },
805 |   {
806 |    "cell_type": "markdown",
807 |    "metadata": {},
808 |    "source": [
809 |     "Similar as before. Policies could be slightly different if there is a state in which more than one action give the same value in the end."
810 |    ]
811 |   },
812 |   {
813 |    "cell_type": "code",
814 |    "execution_count": 39,
815 |    "metadata": {},
816 |    "outputs": [
817 |     {
818 |      "data": {
819 |       "text/plain": [
820 |        "array([ 0.54202426,  0.49880096,  0.47069307,  0.45684887,  0.55844941,\n",
821 |        "        0.        ,  0.35834688,  0.        ,  0.59179743,  0.64307884,\n",
822 |        "        0.61520669,  0.        ,  0.        ,  0.74171974,  0.86283707,  0.        ])"
823 |       ]
824 |      },
825 |      "execution_count": 39,
826 |      "metadata": {},
827 |      "output_type": "execute_result"
828 |     }
829 |    ],
830 |    "source": [
831 |     "V"
832 |    ]
833 |   },
834 |   {
835 |    "cell_type": "code",
836 |    "execution_count": 40,
837 |    "metadata": {},
838 |    "outputs": [
839 |     {
840 |      "data": {
841 |       "text/plain": [
842 |        "array([1, 3, 0, 3, 0, 0, 0, 0, 3, 1, 0, 0, 0, 2, 1, 0])"
843 |       ]
844 |      },
845 |      "execution_count": 40,
846 |      "metadata": {},
847 |      "output_type": "execute_result"
848 |     }
849 |    ],
850 |    "source": [
851 |     "pi"
852 |    ]
853 |   },
854 |   {
855 |    "cell_type": "markdown",
856 |    "metadata": {},
857 |    "source": [
858 |     "That's it let's wrap up."
859 |    ]
860 |   },
861 |   {
862 |    "cell_type": "code",
863 |    "execution_count": 41,
864 |    "metadata": {},
865 |    "outputs": [
866 |     {
867 |      "name": "stderr",
868 |      "output_type": "stream",
869 |      "text": [
870 |       "[2017-08-27 08:22:29,264] Finished writing results. You can upload them to the scoreboard via gym.upload('/tmp/tmpqfn2e0ho')\n"
871 |      ]
872 |     }
873 |    ],
874 |    "source": [
875 |     "env.close()"
876 |    ]
877 |   },
878 |   {
879 |    "cell_type": "markdown",
880 |    "metadata": {},
881 |    "source": [
882 |     "If you want to submit to OpenAI Gym, get your API Key and paste it here:"
883 |    ]
884 |   },
885 |   {
886 |    "cell_type": "code",
887 |    "execution_count": 134,
888 |    "metadata": {},
889 |    "outputs": [
890 |     {
891 |      "name": "stderr",
892 |      "output_type": "stream",
893 |      "text": [
894 |       "[2017-04-01 20:40:54,103] [FrozenLake-v0] Uploading 10000 episodes of training data\n",
895 |       "[2017-04-01 20:40:55,854] [FrozenLake-v0] Uploading videos of 19 training episodes (2278 bytes)\n",
896 |       "[2017-04-01 20:40:56,102] [FrozenLake-v0] Creating evaluation object from /tmp/tmpyspcx0sa with learning curve and training video\n",
897 |       "[2017-04-01 20:40:56,451] \n",
898 |       "****************************************************\n",
899 |       "You successfully uploaded your evaluation on FrozenLake-v0 to\n",
900 |       "OpenAI Gym! You can find it at:\n",
901 |       "\n",
902 |       "    https://gym.openai.com/evaluations/eval_vAvbhsGQRVSAe5DZkFNrQ\n",
903 |       "\n",
904 |       "****************************************************\n"
905 |      ]
906 |     }
907 |    ],
908 |    "source": [
909 |     "gym.upload(mdir, api_key='<YOUR OPENAI API KEY>')"
910 |    ]
911 |   },
912 |   {
913 |    "cell_type": "markdown",
914 |    "metadata": {
915 |     "collapsed": true
916 |    },
917 |    "source": [
918 |     "Hope you liked it... Value Iteration and Policy Iteration might seem disappointing at first, and I understand. What is the intelligence on following given directions!? What if you just don't have a map of the environment you are interacting with? Come on, that's not AI. You are right, it is not. However, Value Iteration and Policy Iteration form the basis of 2 of the 3 most fundamental paradigms of algorithms in reinforcement learning.\n",
919 |     "\n",
920 |     "Next notebooks we start looking into slightly more complicated environment. And also, we will learn about algorithms that learn while interacting with the environment. Also called \"online\" learning."
921 |    ]
922 |   }
923 |  ],
924 |  "metadata": {
925 |   "kernelspec": {
926 |    "display_name": "Python 3",
927 |    "language": "python",
928 |    "name": "python3"
929 |   },
930 |   "language_info": {
931 |    "codemirror_mode": {
932 |     "name": "ipython",
933 |     "version": 3
934 |    },
935 |    "file_extension": ".py",
936 |    "mimetype": "text/x-python",
937 |    "name": "python",
938 |    "nbconvert_exporter": "python",
939 |    "pygments_lexer": "ipython3",
940 |    "version": "3.5.2"
941 |   }
942 |  },
943 |  "nbformat": 4,
944 |  "nbformat_minor": 2
945 | }
946 | 


--------------------------------------------------------------------------------