├── paper ├── rldm.pdf ├── rldm.out ├── rldm.aux ├── rldmsubmit.sty └── rldm.tex ├── 13-recommended-courses └── README.md ├── 12-recommended-books └── README.md ├── 11-conclusion └── README.md ├── LICENSE ├── .gitignore ├── 09-cooperative-and-adversarial-agents └── README.md ├── Dockerfile ├── 08-single-and-multiple-agents └── README.md ├── 03-deterministic-and-stochastic-actions └── README.md ├── 10-decision-making-and-humans └── README.md ├── 06-discrete-and-continuous-actions └── README.md ├── 07-observable-and-partially-observable-states └── README.md ├── 01-introduction-to-decision-making └── README.md ├── 05-discrete-and-continuous-states └── README.md ├── 04-known-and-unknown-environments └── README.md ├── README.md ├── 02-sequential-decisions └── README.md └── notebooks ├── 02-dynamic-programming.ipynb ├── solutions ├── 02-dynamic-programming.ipynb └── 03-planning-algorithms.ipynb └── 03-planning-algorithms.ipynb /paper/rldm.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mimoralea/applied-reinforcement-learning/HEAD/paper/rldm.pdf -------------------------------------------------------------------------------- /13-recommended-courses/README.md: -------------------------------------------------------------------------------- 1 | ### 13. Recommended Courses 2 | 3 | * [Reinforcement Learning and Decision Making by Michael Littman and Charles Isbell](https://in.udacity.com/course/reinforcement-learning--ud600/) 4 | * [Reinforcement Learning by David Silver](https://www.youtube.com/playlist?list=PL7-jPKtc4r78-wCZcQn5IqyuWhBZ8fOxT) 5 | * [Deep Reinforcement Learning by Sergey Levine, Chelsea Finn, et al.](https://www.youtube.com/playlist?list=PLkFD6_40KJIwTmSbCv9OVJB3YaO4sFwkX) 6 | -------------------------------------------------------------------------------- /12-recommended-books/README.md: -------------------------------------------------------------------------------- 1 | ### 12. Recommended Books 2 | 3 | * Reinforcement Learning State-of-the-Art by Marco Wiering et al. 4 | * Probabilistic Robotics by Sebastian Thrun et al. 5 | * Decision Making Under Uncertainty by Mykel J. Kochenderfer 6 | * Neuro-Dynamic Programming by Dimitri P. Bertsekas et al. 7 | * Statistical Reinforcement Learning by Masashi Sugiyama 8 | * Markov Decision Processes by Martin L. Puterman 9 | * Approximate Dynamic Programming by Warren B. Powell 10 | * Reinforcement Learning and Dynamic Programming Using Function Approximators by Lucian Busoniu et al. 11 | * Optimal Control Theory by Donald E. Kirk 12 | * Dynamic Programming by Richard Bellman 13 | * Dynamic Programming and Optimal Control Vol I by Dimitri P. Bertsekas 14 | * Dynamic Programming and Optimal Control Vol II by Dimitri P. Bertsekas 15 | * Reinforcement Learning: An Introduction by Richard S. Sutton et al. 16 | -------------------------------------------------------------------------------- /11-conclusion/README.md: -------------------------------------------------------------------------------- 1 | ### 11. Conclusion 2 | 3 | Reinforcement Learning is such an exciting field. Some people would argue that this field is the 4 | true "artificial intelligence". This project should provide you with a good foundation for 5 | understanding reinforcement learning. We, not only gave our intuitive take on each of the concepts 6 | introduced but also provided you with problems that you could do on your own to gain a 7 | deeper, hands-on understanding of the concepts discussed. Also, we pointed at several papers to 8 | further your knowledge in each of the areas presented. 9 | We really hope that this serves you well. Also, remember that if you want to help to make this 10 | series better, everything from a typo, to another notebook, or maybe an entire section, you are 11 | welcome to open a pull-request and I'll take a look and accept it if it helps with this project's 12 | objective. To give an intuitive and hands-on perspective of reinforcement learning and decision-making. 13 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Miguel Morales 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /paper/rldm.out: -------------------------------------------------------------------------------- 1 | \BOOKMARK [1][-]{section.1}{Introduction}{}% 1 2 | \BOOKMARK [1][-]{section.2}{Sparking Curiosity}{}% 2 3 | \BOOKMARK [2][-]{subsection.2.1}{Using Simple And Direct Language}{section.2}% 3 4 | \BOOKMARK [2][-]{subsection.2.2}{Keeping A Single Narrative}{section.2}% 4 5 | \BOOKMARK [2][-]{subsection.2.3}{Showing Concepts And Their Complement}{section.2}% 5 6 | \BOOKMARK [1][-]{section.3}{Removing Friction}{}% 6 7 | \BOOKMARK [2][-]{subsection.3.1}{Setting Up A Convenient Environment}{section.3}% 7 8 | \BOOKMARK [2][-]{subsection.3.2}{Providing With Boilerplate Code}{section.3}% 8 9 | \BOOKMARK [2][-]{subsection.3.3}{Asking For Minimal Effort}{section.3}% 9 10 | \BOOKMARK [1][-]{section.4}{Showing Options}{}% 10 11 | \BOOKMARK [2][-]{subsection.4.1}{Assigning Relevant Readings}{section.4}% 11 12 | \BOOKMARK [2][-]{subsection.4.2}{Watching Academic Lectures}{section.4}% 12 13 | \BOOKMARK [2][-]{subsection.4.3}{Completing Homework and Projects}{section.4}% 13 14 | \BOOKMARK [1][-]{section.5}{Future Work}{}% 14 15 | \BOOKMARK [2][-]{subsection.5.1}{Additional Notebooks}{section.5}% 15 16 | \BOOKMARK [2][-]{subsection.5.2}{Effectiveness Evaluation}{section.5}% 16 17 | \BOOKMARK [2][-]{subsection.5.3}{Request For Feedback}{section.5}% 17 18 | \BOOKMARK [1][-]{section.6}{Conclusion}{}% 18 19 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .coverage.* 42 | .cache 43 | nosetests.xml 44 | coverage.xml 45 | *,cover 46 | .hypothesis/ 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | local_settings.py 55 | 56 | # Flask stuff: 57 | instance/ 58 | .webassets-cache 59 | 60 | # Scrapy stuff: 61 | .scrapy 62 | 63 | # Sphinx documentation 64 | docs/_build/ 65 | 66 | # PyBuilder 67 | target/ 68 | 69 | # IPython Notebook 70 | .ipynb_checkpoints 71 | 72 | # pyenv 73 | .python-version 74 | 75 | # celery beat schedule file 76 | celerybeat-schedule 77 | 78 | # dotenv 79 | .env 80 | 81 | # virtualenv 82 | venv/ 83 | ENV/ 84 | 85 | # Spyder project settings 86 | .spyderproject 87 | 88 | # Rope project settings 89 | .ropeproject 90 | 91 | notebooks/solutions/* 92 | -------------------------------------------------------------------------------- /09-cooperative-and-adversarial-agents/README.md: -------------------------------------------------------------------------------- 1 | ### 9. Cooperative and Adversarial Agents 2 | 3 | #### 9.1 Agents with conflicting objectives 4 | 5 | Going one step further, agents on an environment could actually have conflicting goals. Once we 6 | starting taking into account multiple agents competing for the same objectives, the field of game 7 | theory becomes important. Game Theory and reinforcement learning are the two fundamental fields 8 | of multi-agent reinforcement learning. When agents have opposing goals, there is probably no clear 9 | optimal solution and an equilibrium among the agents need to search for. For this, lots of game 10 | theory come to play. 11 | 12 | #### 9.2 Teams of agents with conflicting objectives 13 | 14 | Finally, we can think of worlds in which teams of agents compete against other teams for conflicting 15 | objectives. The RoboCup soccer is a well-known example of this type of environments. Make sure to check the 16 | recommended readings below, and work the RoboCup soccer 17 | provided by OpenAI Gym. 18 | 19 | #### 9.3 Further Reading 20 | 21 | * [Adversarial Reinforcement Learning](http://www.cs.cmu.edu/~mmv/papers/03TR-advRL.pdf) 22 | * [Markov games as a framework for multi-agent reinforcement learning](https://www.cs.rutgers.edu/~mlittman/papers/ml94-final.pdf) 23 | * [Value-function reinforcement learning in Markov games](http://www.sts.rpi.edu/~rsun/si-mal/article3.pdf) 24 | * [Learning To Play the Game of Chess](https://papers.nips.cc/paper/1007-learning-to-play-the-game-of-chess.pdf) 25 | * [Mastering the Game of Go with Deep Neural Networks and Tree Search](https://gogameguru.com/i/2016/03/deepmind-mastering-go.pdf) 26 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM jupyter/tensorflow-notebook 2 | MAINTAINER Miguel Morales 3 | USER root 4 | 5 | # update ubuntu installation 6 | RUN apt-get update -y 7 | RUN apt-get install -y --no-install-recommends apt-utils 8 | RUN apt-get upgrade -y 9 | 10 | # install dependencies 11 | RUN apt-get install -y libav-tools python3 ipython3 python3-pip python3-dev python3-opengl 12 | RUN apt-get install -y libpq-dev libjpeg-dev libboost-all-dev libsdl2-dev 13 | RUN apt-get install -y curl cmake swig wget unzip git xpra xvfb flex 14 | RUN apt-get install -y libav-tools fluidsynth build-essential qt-sdk 15 | 16 | # clean up 17 | RUN apt-get clean && rm -rf /var/lib/apt/lists/* 18 | 19 | # jupyter notebook 20 | EXPOSE 8888 21 | 22 | # tensorboard 23 | EXPOSE 6006 24 | 25 | # switch back to user 26 | USER $NB_USER 27 | 28 | # install necessary packages 29 | RUN pip3 install --upgrade pip 30 | RUN pip3 install numpy scikit-learn scipy pyglet setuptools pygame 31 | RUN pip3 install gym tensorflow keras asciinema pandas 32 | RUN pip3 install git+https://github.com/openai/gym-soccer.git@master 33 | RUN pip3 install git+https://github.com/lusob/gym-ple.git@master 34 | RUN pip3 install git+https://github.com/ntasfi/PyGame-Learning-Environment.git@master 35 | #git clone https://github.com/ntasfi/PyGame-Learning-Environment.git 36 | 37 | # create a script to start the notebook with xvfb on the back 38 | # this allows screen display to work well 39 | RUN echo '#!/bin/bash' > /tmp/run.sh && \ 40 | echo "nohup sh -c 'tensorboard --logdir=/mnt/notebooks/logs' > /dev/null 2>&1 &" >> /tmp/run.sh && \ 41 | echo 'xvfb-run -s "-screen 0 1280x720x24" /usr/local/bin/start-notebook.sh' >> /tmp/run.sh && \ 42 | chmod +x /tmp/run.sh 43 | 44 | # move notebooks into container 45 | # ADD notebooks /mnt/notebooks 46 | 47 | # make the dir with notebooks the working dir 48 | WORKDIR /mnt/notebooks 49 | 50 | # run the script to start the notebook 51 | ENTRYPOINT ["/tmp/run.sh"] 52 | -------------------------------------------------------------------------------- /08-single-and-multiple-agents/README.md: -------------------------------------------------------------------------------- 1 | ## Part IV: Multiple Decision-Making Agents 2 | 3 | ### 8. Single and Multiple Agents 4 | 5 | #### 8.1 Agents with same objectives 6 | 7 | The methods for reinforcement learning that we have seen so far relate to single agents taking decisions on an environment. We can think, however, on a slightly different kind of problem in which multiple agents 8 | jointly act on the same environment trying to maximize a common reward signal. Such environment could be robotics, networking, economics, auctions, etc. Often time, the algorithms discussed up until now would 9 | potentially fail in such environments. The problem is that in these kinds of environments, the control of 10 | the agents is decentralized and therefore it requires coordination and cooperation to maximize the reward 11 | signal. 12 | 13 | Even though decentralizing the decision-making adds considerable complexity, the need for a multi-agent system 14 | for some problems is real. Often a centralized approach is just not possible, perhaps due to physical constraints, for example, a network routing system being decentralized, or a team of robots with shared objectives but independent processing capabilities. So, the methods of decentralized reinforcement learning, 15 | often called Dec-MDPs and Dec-POMDPs, are very important as well. 16 | 17 | #### 8.2 What when other agents are at play? 18 | 19 | When other agents take actions on the same environment, game theory becomes important. Game theory is a field that researches conflict of interests. Economics, political science, psychology, biology and so on 20 | are some of the most conventional fields using game theory concepts. 21 | 22 | #### 8.3 Further Reading 23 | 24 | * [Game Theory: Basic Concepts](http://www.umass.edu/preferen/Game%20Theory%20Evolving/GTE%20Public/GTE%20Game%20Theory%20Basic%20Concepts.pdf) 25 | * [Game Theory](http://www.cdam.lse.ac.uk/Reports/Files/cdam-2001-09.pdf) 26 | * [An Analysis of Stochastic Game Theory for Multiagent Reinforcement Learning](http://www.cs.cmu.edu/~mmv/papers/00TR-mike.pdf) 27 | * [Multi-agent reinforcement learning: An overview](https://pdfs.semanticscholar.org/d96d/a4ac9f78924871c3c4d0dece0b84884fe483.pdf) 28 | * [Multi Agent Reinforcement Learning: Independent vs Cooperative Agents](http://web.media.mit.edu/~cynthiab/Readings/tan-MAS-reinfLearn.pdf) 29 | -------------------------------------------------------------------------------- /03-deterministic-and-stochastic-actions/README.md: -------------------------------------------------------------------------------- 1 | ### 3. Deterministic and Stochastic Actions 2 | 3 | #### 3.1 We can't perfectly control the world 4 | 5 | One of the main points in reinforcement learning is that actions are not always deterministic. That is, taking 6 | an action does not imply that the action will affect the world the same way each time. Even if the action is taken 7 | given the exact same environmental conditions, the actions are not always deterministic. In fact, most real-world 8 | problems have some stochasticity attached to it in how the world reacts to the agents' actions. For example, we can 9 | think the stock trading agent taking an action to buy a stock, but encountering network issues along the way and 10 | therefore failing at the transaction. Similarly, for the robotics example, we can imagine how moving a robotic 11 | arm to a given location might be precise within a certain range. So the probability of that actions affecting the 12 | environment the same way each time even if given the same exact initial conditions is not total. 13 | 14 | #### 3.2 Dealing with stochasticity 15 | 16 | The way we account for the fact that the world is stochastic is by using expectation of rewards. For example, when 17 | we calculate the rewards we would obtain for taking an action in a given state, we would take into account the probabilities 18 | of transitioning to every single other new state and multiply this probability by the reward we would obtain. If we 19 | sum all of them, we obtain the expectation. 20 | 21 | #### 3.3 Exercises 22 | 23 | In this lesson, we looked into how the environment can get more complex than we discussed in previous lessons. 24 | However, the same algorithms we presented earlier can help us plan when we have a model of the environment. On 25 | the Notebook below we will implement the algorithms discussed in the previous chapter in worlds with deterministic 26 | and stochastic transitions. 27 | 28 | Lesson 3 Notebook. 29 | 30 | #### 3.4 Further Reading 31 | 32 | * [Markov Decision Processes: Concepts and Algorithms](http://www.cs.vu.nl/~annette/SIKS2009/material/SIKS-RLIntro.pdf) 33 | * [Markov Decision Processes: Lecture Notes](https://math.la.asu.edu/~jtaylor/teaching/Fall2012/STP425/lectures/MDP.pdf) 34 | * [Markov Decision Processes](http://www.lancaster.ac.uk/postgrad/zaninie/MDP.pdf) 35 | * [Markov Decision Processes](https://cs.uwaterloo.ca/~jhoey/teaching/cs486/mdp.pdf) 36 | -------------------------------------------------------------------------------- /10-decision-making-and-humans/README.md: -------------------------------------------------------------------------------- 1 | ## Part V: Human Decision-Making and Beyond 2 | 3 | ### 10. Decision-Making and Humans 4 | 5 | #### 10.1 Similarities between methods discussed and humans 6 | 7 | It is not surprising that many of the methods and algorithms explored in these lessons 8 | have some similarities to how humans perceive intelligence. After all, if it is humans 9 | that are building these methods, it is expected that we will create this with our bias of how we see the world. One of the most interesting similarities is how reinforcement 10 | learning algorithms can maximize the expected future reward over a long term horizon. 11 | This is perhaps what makes humans the most intelligent creatures on earth, we plan, 12 | execute and even sacrifice high short-term rewards for even smaller rewards given 13 | multiple times in the future. This is truly amazing. 14 | 15 | Another obvious and important similarity is that of learning by trial and error. It is 16 | true that humans learn with supervision as well. But lots of the learning come from trial and error. There is no way of teaching a toddler to walk with pen and paper, or by reading books, a child will learn to walk by walking. As incredible as it sounds, it is a fact. 17 | 18 | #### 10.2 Differences between methods discussed and humans 19 | 20 | However, not everything is immediately obvious, and it is in fact, a mistake to just 21 | try to recreate a human brain. Usually, technology advances more quickly when we build systems that enhance humans, instead of trying to replace us. One of the fundamental 22 | differences between the way reinforcement learning works and how humans behave seem to be a lot the reward system. The fact that different humans can perceive the same reward signal different. In reinforcement learning, the reward signal is given by the environment, but it is not clear that this is how to world is actually model in reality. 23 | Sure, no human would consider that stepping on a nail is a positive signal, but it is 24 | true that many humans, especially, successful ones, usually have a way of "bending" 25 | reality to always look for the positive. Perhaps, this is something researchers should 26 | be working on these days. 27 | 28 | #### 10.3 Further Reading 29 | 30 | * [Reinforcement Learning, High-Level Cognition, and the Human Brain](http://users.ugent.be/~tverguts/Publications_files/Silvetti%20RL%20chapter.pdf) 31 | * [A Comparison of Human and Agent Reinforcement Learning in Partially Observable Domains](http://mlg.eng.cam.ac.uk/pub/pdf/DosGha11.pdf) 32 | * [Intrinsically Motivated Reinforcement Learning](http://web.eecs.umich.edu/~baveja/Papers/FinalNIPSIMRL.pdf) 33 | * [Intrinsic Motivation For Reinforcement Learning Systems](http://www-anw.cs.umass.edu/pubs/2005/barto_s_yale05.pdf) 34 | * [Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation](https://arxiv.org/pdf/1604.06057.pdf) 35 | -------------------------------------------------------------------------------- /paper/rldm.aux: -------------------------------------------------------------------------------- 1 | \relax 2 | \providecommand\hyper@newdestlabel[2]{} 3 | \providecommand\HyperFirstAtBeginDocument{\AtBeginDocument} 4 | \HyperFirstAtBeginDocument{\ifx\hyper@anchor\@undefined 5 | \global\let\oldcontentsline\contentsline 6 | \gdef\contentsline#1#2#3#4{\oldcontentsline{#1}{#2}{#3}} 7 | \global\let\oldnewlabel\newlabel 8 | \gdef\newlabel#1#2{\newlabelxx{#1}#2} 9 | \gdef\newlabelxx#1#2#3#4#5#6{\oldnewlabel{#1}{{#2}{#3}}} 10 | \AtEndDocument{\ifx\hyper@anchor\@undefined 11 | \let\contentsline\oldcontentsline 12 | \let\newlabel\oldnewlabel 13 | \fi} 14 | \fi} 15 | \global\let\hyper@last\relax 16 | \gdef\HyperFirstAtBeginDocument#1{#1} 17 | \providecommand\HyField@AuxAddToFields[1]{} 18 | \providecommand\HyField@AuxAddToCoFields[2]{} 19 | \citation{gapranda} 20 | \citation{suttons98} 21 | \citation{intuition} 22 | \citation{directinstruction} 23 | \citation{compare} 24 | \@writefile{toc}{\contentsline {section}{\numberline {1}Introduction}{1}{section.1}} 25 | \@writefile{toc}{\contentsline {section}{\numberline {2}Sparking Curiosity}{1}{section.2}} 26 | \@writefile{toc}{\contentsline {subsection}{\numberline {2.1}Using Simple And Direct Language}{1}{subsection.2.1}} 27 | \@writefile{toc}{\contentsline {subsection}{\numberline {2.2}Keeping A Single Narrative}{1}{subsection.2.2}} 28 | \@writefile{toc}{\contentsline {subsection}{\numberline {2.3}Showing Concepts And Their Complement}{1}{subsection.2.3}} 29 | \citation{openaigym} 30 | \citation{visualization} 31 | \@writefile{toc}{\contentsline {section}{\numberline {3}Removing Friction}{2}{section.3}} 32 | \@writefile{toc}{\contentsline {subsection}{\numberline {3.1}Setting Up A Convenient Environment}{2}{subsection.3.1}} 33 | \@writefile{toc}{\contentsline {subsection}{\numberline {3.2}Providing With Boilerplate Code}{2}{subsection.3.2}} 34 | \@writefile{toc}{\contentsline {subsection}{\numberline {3.3}Asking For Minimal Effort}{2}{subsection.3.3}} 35 | \@writefile{toc}{\contentsline {section}{\numberline {4}Showing Options}{2}{section.4}} 36 | \@writefile{toc}{\contentsline {subsection}{\numberline {4.1}Assigning Relevant Readings}{2}{subsection.4.1}} 37 | \@writefile{toc}{\contentsline {subsection}{\numberline {4.2}Watching Academic Lectures}{2}{subsection.4.2}} 38 | \bibcite{gapranda}{1} 39 | \@writefile{toc}{\contentsline {subsection}{\numberline {4.3}Completing Homework and Projects}{3}{subsection.4.3}} 40 | \@writefile{toc}{\contentsline {section}{\numberline {5}Future Work}{3}{section.5}} 41 | \@writefile{toc}{\contentsline {subsection}{\numberline {5.1}Additional Notebooks}{3}{subsection.5.1}} 42 | \@writefile{toc}{\contentsline {subsection}{\numberline {5.2}Effectiveness Evaluation}{3}{subsection.5.2}} 43 | \@writefile{toc}{\contentsline {subsection}{\numberline {5.3}Request For Feedback}{3}{subsection.5.3}} 44 | \@writefile{toc}{\contentsline {section}{\numberline {6}Conclusion}{3}{section.6}} 45 | \bibcite{suttons98}{2} 46 | \bibcite{intuition}{3} 47 | \bibcite{directinstruction}{4} 48 | \bibcite{compare}{5} 49 | \bibcite{visualization}{6} 50 | \bibcite{openaigym}{7} 51 | -------------------------------------------------------------------------------- /06-discrete-and-continuous-actions/README.md: -------------------------------------------------------------------------------- 1 | ### 6. Discrete and Continuous Actions 2 | 3 | #### 6.1 Continous action space 4 | 5 | Just as the state space, the action space can also become too large to handle in a 6 | traditional way. Certainly, the problems that we have seen so far have very few 7 | available actions, "move up, down, left, right". However, in other types of problems, 8 | like robotics, for instance, even a small number of degrees of freedom can make the 9 | action space just too large for traditional methods. 10 | 11 | #### 6.2 Discretizition of action space 12 | 13 | For the action space, discretization is commonly used. This is a fine method for 14 | problems like we have seen before. However, once we enter the realm of physical control, which often deals with continuous values, every new degree of freedom would exponentially increase the number of possible action combinations. This gets out of control quickly. 15 | 16 | #### 6.3 Use of function approximation 17 | 18 | The use of function approximation is again a good way of approaching this problem. 19 | Just as before, linear and non-linear function approximation methods could work as 20 | long as we are dealing with a linear or non-linear action space, respectively. 21 | 22 | #### 6.4 Searching for the policy 23 | 24 | One way of approaching reinforcement learning problems that we haven't covered yet is to, instead of calculating values with states and action pairs to come up with 25 | optimal policies, we can search for the optimal policy directly. There are different 26 | ways of doing this and this is, in fact, one of the fields of active research in reinforcement learning. One of the advantages of using policy search instead of some 27 | of the methods we have seen before is that it is possible we find the optimal policy 28 | even if we don't find optimal values. For example, you can think of the trading agent 29 | looking to calculate what's the value of buying a stock now, whether it is $10,000 or 30 | $100,000 you don't care the precise value, you care to know that it is the best action 31 | to take right now. This is because you care about the policy, not the values. The 32 | same concept applies to policy search. You could apply traditional search methods or just gradient descent to search directly for the optimal policy. We will look at 33 | a method that searches for the optimal policy on a continuous state space and continuous 34 | action space in the notebook. 35 | 36 | #### 6.5 Exercises 37 | 38 | In this lesson, we looked for the first time at the problem in which the both the number of states and the number 39 | of actions available to the agent are very large or continuous. We introduced a series of methods based on policy 40 | search. So, for this lesson's Notebook, we will look into a problem with continuous state and actions and we 41 | will apply function approximation to come up with the best solution to it. 42 | 43 | Lesson 6 Notebook. 44 | 45 | #### 6.6 Further Reading 46 | 47 | * [Reinforcement Learning in Continuous State and Action Spaces](http://oai.cwi.nl/oai/asset/19689/19689B.pdf) 48 | * [Continuous Control with Deep Reinforcement Learning](https://arxiv.org/pdf/1509.02971.pdf) 49 | * [Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods](https://papers.nips.cc/paper/3318-reinforcement-learning-in-continuous-action-spaces-through-sequential-monte-carlo-methods.pdf) 50 | * [Q-Learning in Continuous State and Action Spaces](http://users.cecs.anu.edu.au/~rsl/rsl_papers/99ai.kambara.pdf) 51 | * [Deep Reinforcement Learning: An Overview](https://arxiv.org/pdf/1701.07274.pdf) 52 | -------------------------------------------------------------------------------- /07-observable-and-partially-observable-states/README.md: -------------------------------------------------------------------------------- 1 | ### 7. Observable and Partially-Observable States 2 | 3 | #### 7.1 Is what we see what it is? 4 | 5 | The reinforcement learning methods that we have discussed so far make the assumption that the agent 6 | has perfect sensing capability. That is, the agent is able to perceive the world exactly as the 7 | world is. However, in many environments, this is not entirely true. Moreover, in some environments, 8 | it is vital to take this uncertainty into account. For example, in many robotics environments, often 9 | our sensor measurements have accuracy within a range. Often, GPS readings can vary from 2 meters to 10 | up to 10 meters. Temperature sensors can be provide reading with %5 error margin. The problem is then 11 | that the methods that we have covered until now are not capable of taking this error into account. This 12 | is because the MDP-based methods have a fundamental assumption, the Markovian assumption. Once this assumption no longer holds true, because the state signal is not fully observable, we enter the fields of partially-observable Markov decision processes. 13 | 14 | #### 7.2 State Estimation 15 | 16 | From the robotics world, a few methods emerged to deal with sensor errors. These methods use probabilistic 17 | techniques to model the uncertainty in the sensor readings. In fact, these methods are some of the most 18 | commonly used methods today in areas like autonomous vehicles, object tracking, navigation, and much more. 19 | These methods are called Bayesian filters. We will look into one of them, the Kalman Filter on the notebook 20 | for this lecture. 21 | 22 | #### 7.3 Control in Partially-Observable Environments 23 | 24 | It is important to note that Bayesian filters do not solve the entire decision-making problem, however, they 25 | do efficiently solve the state estimation problem. POMDPs are very complex, and so if the theory underlying. 26 | However, it is good to mention that there exist extensions to most of the algorithms that we have looked at 27 | so far to solve POMDPs for discrete worlds. These methods, however, are inapplicable to many practical 28 | problems in robotics, for instance. There are approximate POMDPs methods that sit in between MDPs and POMDPs 29 | and that are capable of giving sufficiently good approximate answers to POMDPs in a reasonable amount of time. 30 | We will refer you to interesting readings in this area for those looking for more information. 31 | 32 | #### 7.4 Exercises 33 | 34 | In this lesson, we learned that what we see is not always what it is happening in the world. Our perceptions might be 35 | biased, we might not have a 20/20 vision and more importantly, we might think we have 20/20 but we might not. For this 36 | reason is important to know that there are other ways of estimating states. In the following Notebook, we will look 37 | at a very popular method for state estimation called the Kalman Filter for a very basic problem partially-observable 38 | states problem. 39 | 40 | #### 7.5 Further Reading 41 | 42 | * [Particle Filters in Robotics](https://arxiv.org/pdf/1301.0607.pdf) 43 | * [Bayesian Filtering: From Kalman Filters to Particle Filters, and Beyond](http://www.dsi.unifi.it/users/chisci/idfric/Nonlinear_filtering_Chen.pdf) 44 | * [Reinforcement Learning Using Approximate Belief States](https://ai.stanford.edu/~koller/Papers/Rodriguez+al:NIPS99.pdf) 45 | * [Bayesian Reinforcement Learning in Continuous POMDPs with Gaussian Processes](https://www.cs.cmu.edu/~sross1/publications/gppomgp_IROS09.pdf) 46 | * [Deep Reinforcement Learning with POMDPs](http://cs229.stanford.edu/proj2015/363_report.pdf) 47 | -------------------------------------------------------------------------------- /01-introduction-to-decision-making/README.md: -------------------------------------------------------------------------------- 1 | ## Part I: Introduction 2 | 3 | ### 1. Introduction to Decision-Making 4 | 5 | #### 1.1 Decision-Making 6 | 7 | Decision-making has captivated human intelligence for many years. Humans have always 8 | wondered what makes us the most intelligent animal on this planet. The fact is that 9 | decision-making could be seen as directly correlated with intelligence. The better 10 | the decisions being made, either by a natural or artificial agent, the more likely we 11 | will perceive that agent as intelligent. Moreover, the level of impact that decisions have 12 | is directly or indirectly recognized by our societies. Roles in which decision-making 13 | is a primary responsibility are the most highly regarded in today's workforce. If we 14 | think of prestige and salary, for example, leadership roles rate higher than management 15 | and management rate higher than the rest of the labor force. 16 | 17 | Being such an important field, it comes at no surprise that decision-making is studied 18 | under many different names. Economics, Neuroscience, Psychology, Operations Research, Adaptive 19 | Control, Statistics, Optimal Control Theory, and Reinforcement Learning are some of the prominent 20 | fields contributing to the understanding of decision-making. However, if we think deeper, 21 | most other fields are also concerned with optimal decision-making. They might not 22 | necessarily contribute directly to improving our understanding of how we take optimal 23 | decisions, but they do study decision-making apply to a specific trade. For instance, 24 | think journalism. This activity is not concerned with understanding how to take optimal 25 | decisions in general, but it is definitely interested in learning how to take optimal 26 | decisions in regards to preparing news and writing for newspapers. Under this token, we 27 | can see how fields that study decision-making are a generalization of other fields. 28 | 29 | In the following lessons, we will explore decision-making in regards to Reinforcement 30 | Learning. As Reinforcement Learning is a descendant of Artificial Intelligence, in the remaining 31 | of this chapter we will briefly touch on Artificial Intelligence. Also, being a related field, 32 | we will look at some basics of probability and statistics. On the rest of this lesson, we 33 | will discuss decision-making when there is only one decision to make. This is perhaps the major 34 | difference between Reinforcement Learning and other related fields. Reinforcement Learning 35 | relaxes this constraint allowing the notion of sequential decision-making. 36 | This sense of interaction with an environment sets Reinforcement Learning apart. In later lessons, 37 | we will continue losing constraints and presenting more abstract topics related to Reinforcement 38 | Learning. After this lesson, we will explore deterministic and stochastic transitions, know 39 | and unknown environments, discrete and continuous states, discrete and continuous actions, 40 | observable and partially observable states, single and multiple agents, cooperative and 41 | adversarial agents, and finally, we will put everything in the perspective of human 42 | intelligence. I hope you enjoy this work. 43 | 44 | #### 1.2 Further Reading 45 | 46 | * [Decision Theory A Brief Introduction](http://people.kth.se/~soh/decisiontheory.pdf) 47 | * [Statistical Decision Theory: Concepts, Methods and Applications](http://probability.ca/jeff/ftpdir/anjali0.pdf) 48 | * [A Brief History of Decision Making](https://hbr.org/2006/01/a-brief-history-of-decision-making) 49 | * [The Theory of Decision Making](http://worthylab.tamu.edu/courses_files/01_edwards_1954.pdf) 50 | -------------------------------------------------------------------------------- /05-discrete-and-continuous-states/README.md: -------------------------------------------------------------------------------- 1 | ## Part III: Decision-Making in Hard Problems 2 | 3 | ### 5. Discrete and Continuous States 4 | 5 | #### 5.1 Too large to hold in memory 6 | 7 | The truth is, as we use all previous methods to solve decision-making problems, 8 | it will be a time when the problems are very large. Some problems become so large that we can no longer represent it in computer memory. Moreover, even if we could hold a table with all state and action pair in memory, collecting experience for every state, action combination would be inefficient. 9 | 10 | #### 5.2 Discretization of state space 11 | 12 | One way of approaching this problem is to combine states into buckets by similarity. 13 | This approach could effectively reduce the number of states of the problem to a 14 | number that allows us to solve the problem using one of the methods seen in previous 15 | lessons. For example, in the OpenAI Lunar Lander world, we can see how the entire 16 | right side of the landing pad and the left side could be counted as 2 unique states. 17 | Truth is, no matter where in that right or left area, your best action will be flight either left or right respectively making sure you are in the middle. Additionally, 18 | the vertical axis could be easily in the 50% up as a single area and many smaller areas as we get closer to the landing pad. We will see how to apply discretization 19 | to the cart-pole problem on this lesson's notebook. 20 | 21 | #### 5.3 Use of function approximation 22 | 23 | Quickly after looking into discretization, any Machine Learning Engineer would shake his/her head. Why not using function approximation instead of doing this by hand? This is exactly why function approximation exists. In fact, we could use 24 | any function approximator like KNN or SVM, however, if the environment is non-linear, 25 | then nonlinear function approximators should be used instead as without them we 26 | might be able to find a solution that improves but never reaches convergence to 27 | the optimal policy. Perhaps, the most popular non-linear function approximators 28 | nowadays are neural networks. In fact, the use of neural networks that are more than 3 layers deep in combination with reinforcement learning algorithms is often grouped on a field called Deep Reinforcement Learning. This is perhaps one of the 29 | most interesting and promising areas of reinforcement learning and we will look 30 | into it on next lesson's notebook. 31 | 32 | #### 5.4 Exercises 33 | 34 | In this lesson, we got a step closer to what we could call 'real-world' reinforcement learning. In specific, 35 | we look at a kind of environment in which there are so many states that we can no longer represent a table 36 | of all of them. Either because the state space is too large or flat out continuous. 37 | 38 | In order to get a sense for this type of problem, we will look a basic Cart Pole pole, and we will solve it by 39 | discretizing the state space in a way to making a manual function approximation of this problem. 40 | 41 | Lesson 5 Notebook. 42 | 43 | #### 5.5 Further Reading 44 | 45 | * [An Analysis of Reinforcement Learning with Function Approximation](http://icml2008.cs.helsinki.fi/papers/652.pdf) 46 | * [Residual Algorithms: Reinforcement Learning with Function Approximation](http://www.leemon.com/papers/1995b.pdf) 47 | * [A Brief Survey of Parametric Value Function Approximation](http://www.cs.utexas.edu/~dana/MLClass/RL_VF.pdf) 48 | * [A Tutorial on Linear Function Approximators for Dynamic Programming and Reinforcement Learning](https://cs.brown.edu/people/stefie10/publications/geramifard13.pdf) 49 | * [Playing Atari with Deep Reinforcement Learning](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) 50 | * [Function Approximation via Tile Coding: Automating Parameter Choice](http://www.cs.utexas.edu/~ai-lab/pubs/SARA05.pdf) 51 | -------------------------------------------------------------------------------- /04-known-and-unknown-environments/README.md: -------------------------------------------------------------------------------- 1 | ### 4. Known and Unknown Environments 2 | 3 | #### 4.1 What if we don't have a model of the environment? 4 | 5 | One of the first things that will come to your head after reviewing the last tutorial is, but what's the point if 6 | we need to have the dynamics of the environment? What if it is such a complex environment that it is just too hard 7 | to model? Or better yet, what if we just don't know the environment? Can we still learn the best actions to take 8 | in order to maximize long-term rewards? 9 | 10 | And, the answer to that question is, of course, we can deal with unknown environments. Perhaps, this is the most exciting aspect of reinforcement learning; agents are capable of, through interaction only, learning the best sequence of actions 11 | to take to maximize long-term reward. 12 | 13 | #### 4.2 The need to explore 14 | 15 | The fact that we do not have a map of the environment puts us in need to explore it. Before we were given a map or as 16 | it is called in reinforcement learning, a model (MDP), but now, we are just dropped in the middle of a world with no 17 | other guidance than our own experiences. The need for exploration comes with a price. We can no longer take a perfect 18 | sequence of actions that maximize the long-term rewards from the very first time. Instead, we are now ensured to, at 19 | least, fail a couple of times trying to understand the environment and attempting to reach better goals each time. 20 | 21 | If you think about it, this puts us in a dilemma, as exploration has a cost associated with it, how much is it effective 22 | to pay for it such that the long term rewards are maximized. For example, think about a young person graduating from college 23 | at 20 and getting his/her first job. This person goes around in 3-4 different jobs early on, but later when he is 50 and 24 | has accumulated experience in a specific field, it might no longer be beneficial to do a career change. It could be much 25 | more effective to keep exploiting the experience he/she has gained. Even if there exists a possibility for higher reward on some 26 | other field. Potentially, given the time left in his/her career, the price of learning a new set of skills might not benefit 27 | the long term goals. 28 | 29 | #### 4.3 What to learn? 30 | 31 | There are two ways you could think of interacting with the world. At first, we could think of the value of taking actions in 32 | given states. For example, we calculate the expectation of taking action 'a' when on the state 's', then do the same for all 33 | possible state, action combinations. However, as we saw on previous lessons, if we had a model of the environment, we could 34 | determine the exact best value for each state. So, how about learning the model of the environment and then using some of the algorithms on previous lectures to help us guide our decision making? 35 | 36 | It turns out that these two ways are the most fundamental classes of algorithms in reinforcement learning. Model-Free methods 37 | are those algorithms that learn straight the action selection. These methods are incredibly useful as they are 38 | capable of learning best actions without any knowledge of the environment. However, they are very data hungry and it 39 | requires lots of samples to get good results. Practically speaking, we cannot just let a bipedal robot fall 1,000,000 times 40 | just for the sake of gaining experience. On the other side of the spectrum, Model-Based methods learn and use the model 41 | of the environment in order to improve the action selection especially early on. Model-Based methods are much more data 42 | efficient and for this reason, they are utilized more frequently on problems involving hardware such as robotics. 43 | 44 | #### 4.4 What to do with what we learn? 45 | 46 | We saw before that we will have to interact with the environment in order to learn. This obvious way of learning is 47 | called "Online learning". In contrast, however, we could also collect the samples we get from our experience and 48 | use that to further evaluate our actions. Intuitively we can think of how humans learn. When we interact with our 49 | environment, we learn directly from our experiences with it, but also, after we have collected these experiences, 50 | we use our memory to think about it and learn so more of what happened, what we did and how we could improve the 51 | outcome if we are facing the same problem again. This way of learning is called "Offline learning", and it is also 52 | used in reinforcement learning. 53 | 54 | #### 4.5 Adding small randomness to your actions 55 | 56 | Finally, there is some other important point often seen in reinforcement learning. The fact that we learn a good 57 | policy does not imply that such policy should be always followed. What if there are some better actions we could 58 | have taken? How do we ensure we always keep an eye on yet a better policy? In reinforcement learning, there are 59 | two main classes of algorithms that address ways of learning while constantly striving for finding better policies. 60 | One way is called off-policy, and it basically means that the actions taken by the agent are not necessarily always 61 | those that we have determined as the best actions. We would then be updating the values of a policy as if we were 62 | taking the actions of that policy when in fact we selected the action from another policy. We can also see off-policy 63 | as having two different policies, one that determines the actions that we are selecting, and the other the one that 64 | we use to evaluate our action selection. Conversely, we also have on-policy learning in which we learn and act on 65 | top of the same policy. That is, we evaluate and follow the actions from the same policy. 66 | 67 | #### 4.6 Exercises 68 | 69 | In this lesson, we learned the difference between planning and reinforcement learning. We compared two styles 70 | of doing reinforcement learning, one in which we learn to behave without trying to understand the dynamics 71 | of the environment. And on the other hand, we learn to behave by simultaneously trying to learn the environment 72 | so that our learning could become more and more accurate each time. 73 | 74 | For this, we will look into a couple of algorithms for model-free reinforcement learning and we will also look 75 | at an algorithm that tries to learn the model with each observation and become much more efficient with every 76 | iteration. 77 | 78 | Lesson 4 Notebook. 79 | 80 | #### 4.7 Further Reading 81 | 82 | * [Reinforcement Learning: A Survey](https://www.cs.cmu.edu/~tom/10701_sp11/slides/Kaelbling.pdf) 83 | * [Algorithms for Sequential Decision Making](https://www.cs.rutgers.edu/~mlittman/papers/thesis-with-gammas.pdf) 84 | * [Shaping and policy search in Reinforcement learning](http://www.cs.ubc.ca/~nando/550-2006/handouts/andrew-ng.pdf) 85 | * [Dynamic Programming and Optimal Control](http://web.mit.edu/dimitrib/www/dpchapter.pdf) 86 | -------------------------------------------------------------------------------- /paper/rldmsubmit.sty: -------------------------------------------------------------------------------- 1 | %%%% RLDM Macros (LaTex) 2 | %%%% Style File 3 | %%%% March 2013 4 | %%%% This has been purloined almost wholesale from nips10submit_e 5 | %%%% There are minor changes for RLDM purposes 6 | 7 | % This file can be used with Latex2e whether running in main mode, or 8 | % 2.09 compatibility mode. 9 | % 10 | % If using main mode, you need to include the commands 11 | % \documentclass{article} 12 | % \usepackage{rldmsubmit,times} 13 | % as the first lines in your document. Or, if you do not have Times 14 | % Roman font available, you can just use 15 | % \documentclass{article} 16 | % \usepackage{rldmsubmit} 17 | % instead. 18 | % 19 | 20 | % Change the overall width of the page. If these parameters are 21 | % changed, they will require corresponding changes in the 22 | % maketitle section. 23 | % 24 | \usepackage{eso-pic} % used by \AddToShipoutPicture 25 | 26 | \renewcommand{\topfraction}{0.95} % let figure take up nearly whole page 27 | \renewcommand{\textfraction}{0.05} % let figure take up nearly whole page 28 | 29 | % Define rldmfinal, set to true if rldmfinalcopy is defined 30 | \newif\ifrldmfinal 31 | \rldmfinaltrue 32 | \def\rldmfinalcopy{\rldmfinaltrue} 33 | \font\rldmtenhv = phvb at 8pt % *** IF THIS FAILS, SEE rldm10submit_e.sty *** 34 | 35 | % Specify the dimensions of each page 36 | 37 | %\setlength{\paperheight}{11in} 38 | %\setlength{\paperwidth}{8.5in} 39 | 40 | \newlength{\rldmFPmargin} 41 | \setlength{\rldmFPmargin}{1.5cm} 42 | \setlength{\headheight}{0pt} 43 | \setlength{\headsep}{0pt} 44 | 45 | \setlength{\textwidth}{\paperwidth} 46 | \addtolength{\textwidth}{-2\rldmFPmargin} 47 | \setlength{\oddsidemargin}{\rldmFPmargin} 48 | \addtolength{\oddsidemargin}{-1in} 49 | \setlength{\evensidemargin}{\oddsidemargin} 50 | \setlength{\textheight}{\paperheight} 51 | \addtolength{\textheight}{-\headheight} 52 | \addtolength{\textheight}{-\headsep} 53 | \addtolength{\textheight}{-\footskip} 54 | \addtolength{\textheight}{-2\rldmFPmargin} 55 | \setlength{\topmargin}{\rldmFPmargin} 56 | \addtolength{\topmargin}{-1in} 57 | 58 | %\textheight 23 true cm % Height of text (including footnotes & figures) 59 | %\textwidth 14 true cm % Width of text line. 60 | \widowpenalty=10000 61 | \clubpenalty=10000 62 | 63 | \thispagestyle{empty} %\pagestyle{empty} 64 | \flushbottom \sloppy 65 | 66 | % We're never going to need a table of contents, so just flush it to 67 | % save space --- suggested by drstrip@sandia-2 68 | \def\addcontentsline#1#2#3{} 69 | 70 | % Title stuff, taken from deproc. 71 | \def\maketitle{\par 72 | \begingroup 73 | \def\thefootnote{\fnsymbol{footnote}} 74 | \def\@makefnmark{\hbox to 0pt{$^{\@thefnmark}$\hss}} % for perfect author 75 | % name centering 76 | % The footnote-mark was overlapping the footnote-text, 77 | % added the following to fix this problem (MK) 78 | \long\def\@makefntext##1{\parindent 1em\noindent 79 | \hbox to1.8em{\hss $\m@th ^{\@thefnmark}$}##1} 80 | \@maketitle \@thanks 81 | \endgroup 82 | \setcounter{footnote}{0} 83 | \let\maketitle\relax \let\@maketitle\relax 84 | \gdef\@thanks{}\gdef\@author{}\gdef\@title{}\let\thanks\relax} 85 | 86 | % The toptitlebar has been raised to top-justify the first page 87 | 88 | % Title (includes both anonimized and non-anonimized versions) 89 | \def\@maketitle{\vbox{\hsize\textwidth 90 | \linewidth\hsize \vskip 0.1in \toptitlebar \centering 91 | {\LARGE\bf \@title\par} \bottomtitlebar % \vskip 0.1in % minus 92 | \ifrldmfinal 93 | \def\And{\end{tabular}\hfil\linebreak[0]\hfil 94 | \begin{tabular}[t]{c}\bf\rule{\z@}{24pt}\ignorespaces}% 95 | \def\AND{\end{tabular}\hfil\linebreak[4]\hfil 96 | \begin{tabular}[t]{c}\bf\rule{\z@}{24pt}\ignorespaces}% 97 | \begin{tabular}[t]{c}\bf\rule{\z@}{24pt}\@author\end{tabular}% 98 | \else 99 | \begin{tabular}[t]{c}\bf\rule{\z@}{24pt} 100 | Anonymous Author(s) \\ 101 | Affiliation \\ 102 | Address \\ 103 | \texttt{email} \\ 104 | \end{tabular}% 105 | \fi 106 | \vskip 0.3in minus 0.1in}} 107 | 108 | \newcommand\startmain{\newpage\setcounter{page}{1}\par} 109 | \newcommand\startopt{\newpage\centerline{\Large \bf Supplementary Material}} 110 | 111 | \def\keywords#1{\vskip.2in\begin{minipage}[t]{1.4in}% 112 | {\bf Keywords:}\end{minipage}\begin{minipage}[t]{4in}#1\end{minipage}} 113 | 114 | \def\acknowledgements#1{\vskip.2in\subsubsection*{Acknowledgements}#1} 115 | 116 | \def\repository#1{\vskip.2in\begin{minipage}[t]{1.4in}% 117 | {\bf Repository:}\end{minipage}\begin{minipage}[t]{5in}\url{#1}\end{minipage}} 118 | 119 | \def\spresentation#1{\vskip.2in\begin{minipage}[t]{1.4in}% 120 | {\bf Short Presentation:}\end{minipage}\begin{minipage}[t]{5in}\url{#1}\end{minipage}} 121 | 122 | \def\lpresentation#1{\vskip.2in\begin{minipage}[t]{1.4in}% 123 | {\bf Long Presentation:}\end{minipage}\begin{minipage}[t]{5in}\url{#1}\end{minipage}} 124 | 125 | \renewenvironment{abstract}{\vskip.075in\centerline{\large\bf 126 | Abstract}\vspace{0.5ex}}{\par} 127 | 128 | % sections with less space 129 | \def\section{\@startsection {section}{1}{\z@}{-2.0ex plus 130 | -0.5ex minus -.2ex}{1.5ex plus 0.3ex 131 | minus0.2ex}{\large\bf\raggedright}} 132 | 133 | \def\subsection{\@startsection{subsection}{2}{\z@}{-1.8ex plus 134 | -0.5ex minus -.2ex}{0.8ex plus .2ex}{\normalsize\bf\raggedright}} 135 | \def\subsubsection{\@startsection{subsubsection}{3}{\z@}{-1.5ex 136 | plus -0.5ex minus -.2ex}{0.5ex plus 137 | .2ex}{\normalsize\bf\raggedright}} 138 | \def\paragraph{\@startsection{paragraph}{4}{\z@}{1.5ex plus 139 | 0.5ex minus .2ex}{-1em}{\normalsize\bf}} 140 | \def\subparagraph{\@startsection{subparagraph}{5}{\z@}{1.5ex plus 141 | 0.5ex minus .2ex}{-1em}{\normalsize\bf}} 142 | \def\subsubsubsection{\vskip 143 | 5pt{\noindent\normalsize\rm\raggedright}} 144 | 145 | 146 | % Footnotes 147 | \footnotesep 6.65pt % 148 | \skip\footins 9pt plus 4pt minus 2pt 149 | \def\footnoterule{\kern-3pt \hrule width 12pc \kern 2.6pt } 150 | \setcounter{footnote}{0} 151 | 152 | % Lists and paragraphs 153 | \parindent 0pt 154 | \topsep 4pt plus 1pt minus 2pt 155 | \partopsep 1pt plus 0.5pt minus 0.5pt 156 | \itemsep 2pt plus 1pt minus 0.5pt 157 | \parsep 2pt plus 1pt minus 0.5pt 158 | \parskip .5pc 159 | 160 | 161 | %\leftmargin2em 162 | \leftmargin3pc 163 | \leftmargini\leftmargin \leftmarginii 2em 164 | \leftmarginiii 1.5em \leftmarginiv 1.0em \leftmarginv .5em 165 | 166 | %\labelsep \labelsep 5pt 167 | 168 | \def\@listi{\leftmargin\leftmargini} 169 | \def\@listii{\leftmargin\leftmarginii 170 | \labelwidth\leftmarginii\advance\labelwidth-\labelsep 171 | \topsep 2pt plus 1pt minus 0.5pt 172 | \parsep 1pt plus 0.5pt minus 0.5pt 173 | \itemsep \parsep} 174 | \def\@listiii{\leftmargin\leftmarginiii 175 | \labelwidth\leftmarginiii\advance\labelwidth-\labelsep 176 | \topsep 1pt plus 0.5pt minus 0.5pt 177 | \parsep \z@ \partopsep 0.5pt plus 0pt minus 0.5pt 178 | \itemsep \topsep} 179 | \def\@listiv{\leftmargin\leftmarginiv 180 | \labelwidth\leftmarginiv\advance\labelwidth-\labelsep} 181 | \def\@listv{\leftmargin\leftmarginv 182 | \labelwidth\leftmarginv\advance\labelwidth-\labelsep} 183 | \def\@listvi{\leftmargin\leftmarginvi 184 | \labelwidth\leftmarginvi\advance\labelwidth-\labelsep} 185 | 186 | \abovedisplayskip 7pt plus2pt minus5pt% 187 | \belowdisplayskip \abovedisplayskip 188 | \abovedisplayshortskip 0pt plus3pt% 189 | \belowdisplayshortskip 4pt plus3pt minus3pt% 190 | 191 | % Less leading in most fonts (due to the narrow columns) 192 | % The choices were between 1-pt and 1.5-pt leading 193 | %\def\@normalsize{\@setsize\normalsize{11pt}\xpt\@xpt} % got rid of @ (MK) 194 | \def\normalsize{\@setsize\normalsize{11pt}\xpt\@xpt} 195 | \def\small{\@setsize\small{10pt}\ixpt\@ixpt} 196 | \def\footnotesize{\@setsize\footnotesize{10pt}\ixpt\@ixpt} 197 | \def\scriptsize{\@setsize\scriptsize{8pt}\viipt\@viipt} 198 | \def\tiny{\@setsize\tiny{7pt}\vipt\@vipt} 199 | \def\large{\@setsize\large{14pt}\xiipt\@xiipt} 200 | \def\Large{\@setsize\Large{16pt}\xivpt\@xivpt} 201 | \def\LARGE{\@setsize\LARGE{20pt}\xviipt\@xviipt} 202 | \def\huge{\@setsize\huge{23pt}\xxpt\@xxpt} 203 | \def\Huge{\@setsize\Huge{28pt}\xxvpt\@xxvpt} 204 | 205 | \def\toptitlebar{\hrule height4pt\vskip .6cm\vskip-\parskip} 206 | 207 | \def\bottomtitlebar{\vskip .7cm\vskip-\parskip\hrule height1pt\vskip 208 | .09in} % 209 | %Reduced second vskip to compensate for adding the strut in \@author 210 | 211 | % Vertical Ruler 212 | % This code is, largely, from the CVPR 2010 conference style file 213 | % ----- define vruler 214 | \makeatletter 215 | \newbox\rldmrulerbox 216 | \newcount\rldmrulercount 217 | \newdimen\rldmruleroffset 218 | \newdimen\cv@lineheight 219 | \newdimen\cv@boxheight 220 | \newbox\cv@tmpbox 221 | \newcount\cv@refno 222 | \newcount\cv@tot 223 | % NUMBER with left flushed zeros \fillzeros[] 224 | \newcount\cv@tmpc@ \newcount\cv@tmpc 225 | \def\fillzeros[#1]#2{\cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi 226 | \cv@tmpc=1 % 227 | \loop\ifnum\cv@tmpc@<10 \else \divide\cv@tmpc@ by 10 \advance\cv@tmpc by 1 \fi 228 | \ifnum\cv@tmpc@=10\relax\cv@tmpc@=11\relax\fi \ifnum\cv@tmpc@>10 \repeat 229 | \ifnum#2<0\advance\cv@tmpc1\relax-\fi 230 | \loop\ifnum\cv@tmpc<#1\relax0\advance\cv@tmpc1\relax\fi \ifnum\cv@tmpc<#1 \repeat 231 | \cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi \relax\the\cv@tmpc@}% 232 | % \makevruler[][][][][] 233 | \def\makevruler[#1][#2][#3][#4][#5]{\begingroup\offinterlineskip 234 | \textheight=#5\vbadness=10000\vfuzz=120ex\overfullrule=0pt% 235 | \global\setbox\rldmrulerbox=\vbox to \textheight{% 236 | {\parskip=0pt\hfuzz=150em\cv@boxheight=\textheight 237 | \cv@lineheight=#1\global\rldmrulercount=#2% 238 | \cv@tot\cv@boxheight\divide\cv@tot\cv@lineheight\advance\cv@tot2% 239 | \cv@refno1\vskip-\cv@lineheight\vskip1ex% 240 | \loop\setbox\cv@tmpbox=\hbox to0cm{{\rldmtenhv\hfil\fillzeros[#4]\rldmrulercount}}% 241 | \ht\cv@tmpbox\cv@lineheight\dp\cv@tmpbox0pt\box\cv@tmpbox\break 242 | \advance\cv@refno1\global\advance\rldmrulercount#3\relax 243 | \ifnum\cv@refno<\cv@tot\repeat}}\endgroup}% 244 | \makeatother 245 | % ----- end of vruler 246 | 247 | % \makevruler[][][][][] 248 | \def\rldmruler#1{\makevruler[12pt][#1][1][3][0.993\textheight]\usebox{\rldmrulerbox}} 249 | \AddToShipoutPicture{% 250 | \ifrldmfinal\else 251 | \rldmruleroffset=\textheight 252 | \advance\rldmruleroffset by -3.7pt 253 | \color[rgb]{.7,.7,.7} 254 | \AtTextUpperLeft{% 255 | \put(\LenToUnit{-35pt},\LenToUnit{-\rldmruleroffset}){%left ruler 256 | \rldmruler{\rldmrulercount}} 257 | } 258 | \fi 259 | } 260 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Applied Reinforcement Learning 2 | 3 | I've been studying reinforcement learning and decision-making for a couple of years now. 4 | One of the most difficult things that I've encountered is not necessarily related to 5 | the concepts but how these concepts have been explained. To me, learning occurs when one 6 | is able to make a connection with the concepts being taught. For this, often an intuitive 7 | explanation is required, and likely a hands-on approach helps build that kind of 8 | understanding. 9 | 10 | My goal for this repository is to create, with the community, a resource that would help 11 | newcomers understand reinforcement learning in an intuitive way. Consider what you see here 12 | my initial attempt to teach some of these concepts as plain and simple as I can possibly 13 | explain them. 14 | 15 | If you'd like to collaborate, whether a typo, or an entire addition to the text, maybe a fix 16 | to a notebook or a whole new notebook, please feel free to send your issue and/or pull 17 | request to make things better. As long as your pull request aligns with the goal of the 18 | repository, it is very likely we will merge. I'm not the best teacher, or reinforcement 19 | learning researcher, but I do believe we can make reinforcement learning and decision-making 20 | easy for anyone to understand. Well, at least easier. 21 | 22 | Table of Contents 23 | ================= 24 | 25 | * [Notebooks Installation](#notebooks-installation) 26 | * [Install git](#install-git) 27 | * [Install Docker](#install-docker) 28 | * [Run Notebooks](#run-notebooks) 29 | * [TL;DR version](#tldr-version) 30 | * [A little more detailed version:](#a-little-more-detailed-version) 31 | * [Open the Notebooks in your browser:](#open-the-notebooks-in-your-browser) 32 | * [Open TensorBoard at the following address:](#open-tensorboard-at-the-following-address) 33 | * [Docker Tips](#docker-tips) 34 | * [Part I: Introduction](01-introduction-to-decision-making/README.md#part-i-introduction) 35 | * [1. Introduction to Decision-Making](01-introduction-to-decision-making/README.md#1-introduction-to-decision-making) 36 | * [1.1 Decision-Making](01-introduction-to-decision-making/README.md#11-decision-making) 37 | * [1.2 Further Reading](01-introduction-to-decision-making/README.md#12-further-reading) 38 | * [Part II: Reinforcement Learning and Decision-Making](02-sequential-decisions/README.md#part-ii-reinforcement-learning-and-decision-making) 39 | * [2. Sequential Decisions](02-sequential-decisions/README.md#2-sequential-decisions) 40 | * [2.1 Modeling Decision-Making Problems](02-sequential-decisions/README.md#21-modeling-decision-making-problems) 41 | * [2.2 Solutions Representation](02-sequential-decisions/README.md#22-solutions-representation) 42 | * [2.3 Simple Sequential Problem](02-sequential-decisions/README.md#23-simple-sequential-problem) 43 | * [2.4 Slightly more complex problems](02-sequential-decisions/README.md#24-slightly-more-complex-problems) 44 | * [2.5 Evaluating solutions](02-sequential-decisions/README.md#25-evaluating-solutions) 45 | * [2.6 Improving on solutions](02-sequential-decisions/README.md#26-improving-on-solutions) 46 | * [2.7 Finding Optimal solutions](02-sequential-decisions/README.md#27-finding-optimal-solutions) 47 | * [2.8 Improving on Policy Iteration](02-sequential-decisions/README.md#28-improving-on-policy-iteration) 48 | * [2.9 Exercises](02-sequential-decisions/README.md#29-exercises) 49 | * [2.10 Further Reading](02-sequential-decisions/README.md#210-further-reading) 50 | * [3. Deterministic and Stochastic Actions](03-deterministic-and-stochastic-actions/README.md#3-deterministic-and-stochastic-actions) 51 | * [3.1 We can't perfectly control the world](03-deterministic-and-stochastic-actions/README.md#31-we-cant-perfectly-control-the-world) 52 | * [3.2 Dealing with stochasticity](03-deterministic-and-stochastic-actions/README.md#32-dealing-with-stochasticity) 53 | * [3.3 Exercises](03-deterministic-and-stochastic-actions/README.md#33-exercises) 54 | * [3.4 Further Reading](03-deterministic-and-stochastic-actions/README.md#34-further-reading) 55 | * [4. Known and Unknown Environments](04-known-and-unknown-environments/README.md#4-known-and-unknown-environments) 56 | * [4.1 What if we don't have a model of the environment?](04-known-and-unknown-environments/README.md#41-what-if-we-dont-have-a-model-of-the-environment) 57 | * [4.2 The need to explore](04-known-and-unknown-environments/README.md#42-the-need-to-explore) 58 | * [4.3 What to learn?](04-known-and-unknown-environments/README.md#43-what-to-learn) 59 | * [4.4 What to do with what we learn?](04-known-and-unknown-environments/README.md#44-what-to-do-with-what-we-learn) 60 | * [4.5 Adding small randomness to your actions](04-known-and-unknown-environments/README.md#45-adding-small-randomness-to-your-actions) 61 | * [4.6 Exercises](04-known-and-unknown-environments/README.md#46-exercises) 62 | * [4.7 Further Reading](04-known-and-unknown-environments/README.md#47-further-reading) 63 | * [Part III: Decision-Making in Hard Problems](05-discrete-and-continuous-states/README.md#part-iii-decision-making-in-hard-problems) 64 | * [5. Discrete and Continuous States](05-discrete-and-continuous-states/README.md#5-discrete-and-continuous-states) 65 | * [5.1 Too large to hold in memory](05-discrete-and-continuous-states/README.md#51-too-large-to-hold-in-memory) 66 | * [5.2 Discretization of state space](05-discrete-and-continuous-states/README.md#52-discretization-of-state-space) 67 | * [5.3 Use of function approximation](05-discrete-and-continuous-states/README.md#53-use-of-function-approximation) 68 | * [5.4 Exercises](05-discrete-and-continuous-states/README.md#54-exercises) 69 | * [5.5 Further Reading](05-discrete-and-continuous-states/README.md#55-further-reading) 70 | * [6. Discrete and Continuous Actions](06-discrete-and-continuous-actions/README.md#6-discrete-and-continuous-actions) 71 | * [6.1 Continuous action space](06-discrete-and-continuous-actions/README.md#61-continuous-action-space) 72 | * [6.2 Discretizition of action space](06-discrete-and-continuous-actions/README.md#62-discretizition-of-action-space) 73 | * [6.3 Use of function approximation](06-discrete-and-continuous-actions/README.md#63-use-of-function-approximation) 74 | * [6.4 Searching for the policy](06-discrete-and-continuous-actions/README.md#64-searching-for-the-policy) 75 | * [6.5 Exercises](06-discrete-and-continuous-actions/README.md#65-exercises) 76 | * [6.6 Further Reading](06-discrete-and-continuous-actions/README.md#66-further-reading) 77 | * [7. Observable and Partially-Observable States](07-observable-and-partially-observable-states/README.md#7-observable-and-partially-observable-states) 78 | * [7.1 Is what we see what it is?](07-observable-and-partially-observable-states/README.md#71-is-what-we-see-what-it-is) 79 | * [7.2 State Estimation](07-observable-and-partially-observable-states/README.md#72-state-estimation) 80 | * [7.3 Control in Partially-Observable Environments](07-observable-and-partially-observable-states/README.md#73-control-in-partially-observable-environments) 81 | * [7.4 Further Reading](07-observable-and-partially-observable-states/README.md#74-further-reading) 82 | * [Part IV: Multiple Decision-Making Agents](08-single-and-multiple-agents/README.md#part-iv-multiple-decision-making-agents) 83 | * [8. Single and Multiple Agents](08-single-and-multiple-agents/README.md#8-single-and-multiple-agents) 84 | * [8.1 Agents with same objectives](08-single-and-multiple-agents/README.md#81-agents-with-same-objectives) 85 | * [8.2 What when other agents are at play?](08-single-and-multiple-agents/README.md#82-what-when-other-agents-are-at-play) 86 | * [8.3 Further Reading](08-single-and-multiple-agents/README.md#83-further-reading) 87 | * [9. Cooperative and Adversarial Agents](09-cooperative-and-adversarial-agents/README.md#9-cooperative-and-adversarial-agents) 88 | * [9.1 Agents with conflicting objectives](09-cooperative-and-adversarial-agents/README.md#91-agents-with-conflicting-objectives) 89 | * [9.2 Teams of agents with conflicting objectives](09-cooperative-and-adversarial-agents/README.md#92-teams-of-agents-with-conflicting-objectives) 90 | * [9.3 Further Reading](09-cooperative-and-adversarial-agents/README.md#93-further-reading) 91 | * [Part V: Human Decision-Making and Beyond](10-decision-making-and-humans/README.md#part-v-human-decision-making-and-beyond) 92 | * [10. Decision-Making and Humans](10-decision-making-and-humans/README.md#10-decision-making-and-humans) 93 | * [10.1 Similarities between methods discussed and humans](10-decision-making-and-humans/README.md#101-similarities-between-methods-discussed-and-humans) 94 | * [10.2 Differences between methods discussed and humans](10-decision-making-and-humans/README.md#102-differences-between-methods-discussed-and-humans) 95 | * [10.3 Further Reading](10-decision-making-and-humans/README.md#103-further-reading) 96 | * [11. Conclusion](11-conclusion/README.md#11-conclusion) 97 | * [12. Recommended Books](12-recommended-books/README.md#12-recommended-books) 98 | * [12. Recommended Courses](13-recommended-courses/README.md#13-recommended-courses) 99 | 100 | 101 | 102 | # Notebooks Installation 103 | 104 | This repository contains Jupyter Notebooks to follow along with the lectures. However, there are several 105 | packages and applications that need to be installed. To make things easier on you, I took a little longer 106 | time to setup a reproducible environment that you can use to follow along. 107 | 108 | ## Install git 109 | 110 | Follow the instructions at (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) 111 | 112 | ## Install Docker 113 | 114 | Follow the instructions at (https://docs.docker.com/engine/getstarted/step_one/#step-2-install-docker) 115 | 116 | ## Run Notebooks 117 | 118 | ### TL;DR version 119 | 120 | 1. `git clone git@github.com:mimoralea/applied-reinforcement-learning.git && cd applied-reinforcement-learning` 121 | 2. `docker pull mimoralea/openai-gym:v1` 122 | 3. `docker run -it --rm -p 8888:8888 -p 6006:6006 -v $PWD/notebooks/:/mnt/notebooks/ mimoralea/openai-gym:v1` 123 | 124 | ### A little more detailed version: 125 | 126 | 1. Clone the repository to a desired location (E.g. `git clone git@github.com:mimoralea/applied-reinforcement-learning.git ~/Projects/applied-reinforcement-learning`) 127 | 2. Enter into the repository directory (E.g. `cd ~/Projects/applied-reinforcement-learning`) 128 | 3. Either Build yourself or Pull the already built Docker container: 129 | 3.1. To build it use the following command: `docker build -t mimoralea/openai-gym:v1 .` 130 | 3.2. To pull it from Docker hub use: `docker pull mimoralea/openai-gym:v1` 131 | 4. Run the container: `docker run -it --rm -p 8888:8888 -p 6006:6006 -v $PWD/notebooks/:/mnt/notebooks/ mimoralea/openai-gym:v1` 132 | 133 | #### Open the Notebooks in your browser: 134 | 135 | * `http://localhost:8888` (or follow the link that came out of the run command about which will include the token) 136 | 137 | #### Open TensorBoard at the following address: 138 | 139 | * `http://localhost:6006` 140 | 141 | This will help you visualize the Neural Network in the lessons with function approximation. 142 | 143 | ## Docker Tips 144 | 145 | * If you'd like to access a bash session of a running container do: 146 | ** `docker ps` # will show you currently running containers -- note the id of the container you are trying to access 147 | ** `docker exec --user root -it c3fbc82f1b49 /bin/bash` # in this case c3fbc82f1b49 is the id 148 | * If you'd like to start a new container instance straight into bash (without running Jupyter or TensorBoard) 149 | ** `docker run -it --rm mimoralea/openai-gym:v1 /bin/bash` # this will run the bash session as the Notebook user 150 | ** `docker run --user root -e GRANT_SUDO=yes -it --rm mimoralea/openai-gym:v1 /bin/bash` # this will run the bash session as root 151 | 152 | 153 | -------------------------------------------------------------------------------- /02-sequential-decisions/README.md: -------------------------------------------------------------------------------- 1 | ## Part II: Reinforcement Learning and Decision-Making 2 | 3 | ### 2. Sequential Decisions 4 | 5 | As mentioned before, Reinforcement Learning introduces the notion of sequential decision-making. This 6 | idea of making a series of decisions forces the agent to take into account future sequences of actions, 7 | states, and rewards. In this lesson, we will explore some of the most fundamental aspects of sequential 8 | decision making. 9 | 10 | #### 2.1 Modeling Decision-Making Problems 11 | 12 | In order to attempt solving a problem, we must be able to represent it in a form that abstracts it 13 | allowing us to work on it. For decision-making problems, we can think of few aspects that are 14 | common to all problems. 15 | 16 | First, we need to be able to receive percepts of the world, that is, the agent needs to be able to sense 17 | its environment. The input we get from the environment could directly represent the 18 | true state of the world. However, this is not always true. For example, if we are creating a stock trading 19 | bot, we can think of the current stock price as part of the current state of the world. However, any person 20 | that has purchased stocks knows that the sell and buy price are mere estimates of the true price that the 21 | stock will sell for. For some transactions, this price is not totally accurate. Another example in which this 22 | issue is much easier to understand is in robotics. For example, GPS localization is accurate 23 | within a few meters precision. This amount of noise on the sensor could be the difference between an autonomous 24 | car driving safely or instead of getting into an accident with the car on the next lane. The point is that as in 25 | the real world when we model it, we need to account for the fact that things that we "see" are not 26 | necessarily things that "are". This distinction will come up later, for now, we can assume that we live 27 | in a perfect world and that our perceptions are a true representation of the state of the world. Another 28 | important fact to clarify is the representation of states must include all necessary history within the state. 29 | In other words, the states should be represented as memory-less. This is known as Markov property and it is 30 | a fundamental assumption to solve decision-making problems of the kinds we will be exploring in these lessons. 31 | 32 | Second, all decision-making problems have available actions. For the stock bot, we can think 33 | of a few actions, sell, buy, hold. We could also add some special action such as limit sell, limit buy, 34 | options, etc. A robot could have the actions to power a given voltage to a given actuator for a given time. As 35 | we clarified the potential of a percept not exactly representing the state of the world, actions might 36 | not turn with the same outcome every time they are taken. That is, actions are not necessarily deterministic. 37 | For the stock agent example, we can think of the small probability that sending a buy request to the server 38 | returns with a server communication error. That is, the probability of actually executing the action we selected 39 | could be 99.9% certain but there is still a small chance that the action doesn't go through as we intended. This 40 | stochasticity is represented as transition functions. These functions represent the probability of successful transition 41 | when taking an action on a given state. The sum of all transitions for a given state action pair must equal 1. 42 | One thing we need to make clear, however, is that probabilities must always be the same. That is, we might not know 43 | the exact probability of transitioning to a new state giving a current and action, but the value will be the 44 | same regardless. In other words, the model of the world must be stationary. 45 | 46 | Third, we also have to introduce a feedback signal so that we can evaluate our decision-making abilities. 47 | Many problems in fields other than Reinforcement Learning represent these are cost signal, Reinforcement Learning 48 | refers to these signals as rewards. On our trading agent, the reward could simply be the profit or loss made 49 | from a single transition, or perhaps we could make our reward signal the difference of total assets before and 50 | after making a transaction. In a robotic task, the reward could be slightly more complex. For example, we could 51 | design an agent that gets a positive reward while staying up straight walking. Or maybe it gets a reward signal after 52 | a specific task is accomplished. The important part of the reward is that this will ultimate have a big influence 53 | on how our agent performs. As we can see, rewards are part of the environment. However, often times we have to design 54 | these reward signal ourselves. Ideally, we are able to identify a natural signal that we are interested in maximizing. 55 | 56 | The model representation described above is widely known as Markov Decision Processes (MDP). MDP is a framework 57 | for modeling sequential decision-making problems. An MDP is composed of a tuple (S, A, R, T) in which S is the set 58 | of states, A is the set of actions, R is the reward function mapping a state and action pairs to a numeric value, T is 59 | the transition function mapping the probability of reaching a state to a state an action pair. 60 | 61 | We will be using MDPs moving forward, though it is important to mention that MDPs have lots of variants, 62 | Dec-MDP, POMDP, QMDP, AMDP, MC-POMDP, Dec-POMDP, ND-POMDPs, MMDPs are some of the most common ones. They all 63 | represent some type of problem-related to MDPs. We will be loosing up the constraints MDP present and generalizing 64 | the representation of decision-making problems as we go. 65 | 66 | #### 2.2 Solutions Representation 67 | 68 | Now that we have a framework to represent decision-making problems, we need to devise a way of communicating 69 | possible solutions to the problems. The first word that comes to mind when thinking about solutions to 70 | decision-making problems is "plan". A plan can be seen as a sequence of steps to accomplish a goal. This is 71 | great but probably too simplistic. Mike Tyson once said, "Everyone has a plan 'till they get punched in the 72 | mouth." And it is true, we need something more adaptive than just a simple plan. The next step then is to think 73 | of a plan and create conditions that help us deal with the uncertainty of the environment. This type of planning 74 | is known as conditional planning. Which is basically just a regular plan in which we plan in advance the 75 | contingencies that may arise. However, if we expand this a bit further we can think of a conditional planning 76 | that takes into account every single possible contingency, even those we haven't thought of. This is called universal 77 | plan or better yet, a policy. In Reinforcement Learning, a policy is a function mapping states to actions which 78 | represent a solution to an MDP. The algorithms that we will be discussing later will directly or indirectly produce 79 | the best possible policy, also called optimal policy. This is important to understand and remember. 80 | 81 | #### 2.3 Simple Sequential Problem 82 | 83 | Given all of the information above, let's review the simplest problem that we can think of. Let's think of a problem of a casino with 2 slot machines. To illustrate some important points better, imagine you enter 84 | slot machine area paying a flat fee. However, you are only allowed to play 100 trials on any of the 2 machines. 85 | Also, the machines pay the amount of $1 or nothing on each pull according to an underlying, fixed and unknown 86 | probability. The Reinforcement Learning problem becomes then, how can I maximize the amount of money I could 87 | get from it. Should you start pulling an arm and stick to it for the 100 trials? Should you instead pull 1 88 | and 1? Should you pull 50 and 50? In other words, what is the best strategy or policy for maximizing all 89 | future rewards? 90 | 91 | The difficulty of this problem, also known as the k-Armed Bandit, in this case, k=2, is that you need to 92 | simultaneously be able to acquire knowledge of the environment and at the same time harness the knowledge you 93 | have already acquired. This fundamental trade-off between exploration versus exploitation is what makes 94 | decision-making problems hard. You might believe that a particular arm has a fairly high payoff probability; 95 | should you then choose it every time? Should you choose one that you know well in order to gain information 96 | about its payoff? How about choosing one that you might have good information already but perhaps getting more 97 | would improve your knowledge of the environment? 98 | 99 | All of the answers to the questions posed above depend on several factors. For example, if instead of allowing you 100 | 100 trials I give you 3, how would your strategy change. Moreover, if I give you an infinite number of trials, 101 | then you really want to put time learning the environment even if doing so gives you sub-optimal results early on. 102 | The knowledge that you gain from the initial exploration will ensure you maximize the expected future rewards long 103 | term. 104 | 105 | 106 | #### 2.4 Slightly more complex problems 107 | 108 | When explaining reinforcement learning, it is very common to show a very basic world to illustrate fundamental 109 | concepts. Let's think of a grid world where the agent starts at 'S'. Reaching the space marked with a 'G' ends the game and gives the agent a reward of 1. Reaching the space with an 'F' ends the game and gives the agent 110 | a reward of -1. The agent is able to select 4 actions every time, (N, S, E, W). The actions selected has exactly the effect we expect. For example, N would move the agent one cell up, E to the cell on the right. Unless the agent is attempting to enter a space marked with an 'X' which is a wall and cannot be entered, and unless the agent is in the left most cell trying to move left, etc. Which will just bounce the agent back to the cell it took the action from. 111 | 112 | 113 | #### 2.5 Evaluating solutions 114 | 115 | Before we being exploring how to get the best solution to this problem. I'd like us to detour into how do we 116 | know how good is a solution. For example, we can imagine a solution given by a series of arrows representing 117 | the different actions to be taken at each cell. 118 | 119 | Is there a way we can put a number to this policy so we can later rank it? 120 | 121 | If we need to use a single number, I think we could all agree that the value of the policy can be defined 122 | as the sum of all rewards that we would get starting on state 'S' and following the policy. This algorithm is 123 | called policy evaluation. 124 | 125 | One thing you might be thinking after reading the previous paragraph is, but what happens if a policy gives 126 | lots of rewards early on, a nothing later. And another policy gives no rewards early on but lots of rewards later. 127 | Is there a way we can account for our preference to early rewards? The answer is yes. So, instead of using the 128 | sum of all rewards as we mentioned before, we will use the sum of discounted rewards in which each reward at time 129 | `t` will be discounted by a factor, let's call it gamma, `t` times. And so we get that policy evaluation basically 130 | calculates the following equation for all states: 131 | 132 | ``` 133 | Vpi(s) = Epi{r_{t+1} + g*r_{t+2} + g**2*r_{t+3} + ... | St = s} 134 | ``` 135 | 136 | So, we are basically finding the value we would get from each of the states if we followed this policy. 137 | Fair enough. Let's forget about equations, check the "Furter reading" sections for that. 138 | 139 | #### 2.6 Improving on solutions 140 | 141 | Now that we know how to come up with a single value for a given policy. The natural question we get is how 142 | to we improve on a policy? If we can devise a way to improve and we know how to evaluate, we should be able 143 | to iterate between evaluation and improvement and get the best policy starting from any random policy. And that 144 | would be very useful. 145 | 146 | The core of the question is whether there is an action different than the action we are being suggested by the 147 | policy that would make the value calculate above larger? How about we temporarily select a different action than 148 | that suggested by the policy and then follow the policy as originally suggested. This way we would isolate the 149 | effect of the action on the entire policy. This is actually the basis for an algorithm called policy improvement. 150 | 151 | #### 2.7 Finding Optimal solutions 152 | 153 | One of the powerful facts about policy improvement is that this way of finding better policies from 154 | a given policy actually guarantees that at least a policy of the same exact quality will be returned or better. 155 | This allows us to think of an algorithm that uses policy evaluation to get the value of a policy and then 156 | policy improvement to try to improve this policy and if the improvement just returns any better policy, just 157 | stop. This algorithm is called policy iteration. 158 | 159 | #### 2.8 Improving on Policy Iteration 160 | 161 | Policy iteration is great because it guarantees that we will get the very best policy available for a given 162 | MDP. However, sometimes it can take unnecessarily large computation before it comes up with that best policy. 163 | Another way of thinking about this is, would there be a small number delta (E.g. 0.0001) that we would be OK with 164 | accepting as a measure of change in any given state. If there is, then we could just cut the policy evaluation 165 | algorithm short and use the value of states to guide our decision-making. This algorithm is called value iteration. 166 | 167 | #### 2.9 Exercises 168 | 169 | In this lesson, we reviewed ways to solve sequential problems. The following Notebook goes into a little more 170 | detail about the Dynamic Programming way of solving problems. We will look into the Fibonacci sequence problem 171 | and devise few ways for solving it. Recursion, Memoization and Dynamic Programming. 172 | 173 | Lesson 2 Notebook. 174 | 175 | #### 2.10 Further Reading 176 | 177 | * [Dynamic programming](https://people.eecs.berkeley.edu/~vazirani/algorithms/chap6.pdf) 178 | * [Value iteration and policy iteration algorithms for Markov decision problem](http://www.ics.uci.edu/~csp/r42a-mdp_report.pdf) 179 | * [Introduction to Markov Decision Processes](http://castlelab.princeton.edu/ORF569papers/Powell_ADP_2ndEdition_Chapter%203.pdf) 180 | * [Reinforcement Learning: A Survey](https://www.cs.cmu.edu/~tom/10701_sp11/slides/Kaelbling.pdf) 181 | -------------------------------------------------------------------------------- /paper/rldm.tex: -------------------------------------------------------------------------------- 1 | \documentclass[11pt]{article} % For LaTeX2e 2 | \usepackage{rldmsubmit,palatino} 3 | \usepackage{graphicx} 4 | \usepackage{hyperref} 5 | 6 | \title{A More Robust Way of Teaching Reinforcement Learning and Decision Making} 7 | 8 | \author{ 9 | Miguel Morales\thanks{http://www.mimoralea.com} \\ 10 | Department of Computer Science \\ 11 | Georgia Institute of Technology \\ 12 | Atlanta, GA 30332 \\ 13 | \texttt{mimoralea@gatech.edu} \\ 14 | } 15 | 16 | \newcommand{\fix}{\marginpar{FIX}} 17 | \newcommand{\new}{\marginpar{NEW}} 18 | 19 | \begin{document} 20 | 21 | \maketitle 22 | 23 | \begin{abstract} 24 | I propose a new way of teaching reinforcement learning and decision making 25 | that is designed to be an improvement to traditional academic teaching. I use 26 | a three-step approach to delivering a complete learning experience in a 27 | way that engages the student and allows them to grasp the concepts regardless 28 | of their skill level. I present a specific way of teaching the content, a 29 | new and fully configured coding platform, a set of hands-on exercises and 30 | a group of recommended next steps for deeper learning. 31 | \end{abstract} 32 | 33 | \keywords{teaching tutorials jupyter intuition hands-on} 34 | \repository{https://www.github.com/mimoralea/applied-reinforcement-learning} 35 | \spresentation{https://youtu.be/ltjS5ktziLQ} 36 | \lpresentation{https://youtu.be/1WjNj_JmFaE} 37 | 38 | \acknowledgements{ 39 | I am thankful to my mentor, Kenneth Brooks, for providing assistance when 40 | navigating the field of Educational Technology. Also, for giving direct, 41 | concise and clear feedback on how to make this project better. Thank you 42 | to all my peers who also provided sincere feedback throughout the semester. 43 | I hope to see you all enjoying our OMSCS course in Reinforcement Learning 44 | and Decision Making, but not before going through these lessons. It will 45 | be a rewarding experience. Pun intended. 46 | } 47 | 48 | \startmain % to start the main 1-4 pages of the submission. 49 | 50 | \section{Introduction} 51 | 52 | Reinforcement Learning and Decision Making is a complex subject. Being the 53 | focus of research of a variety of fields including artificial intelligence, 54 | psychology, machine learning, operations research, control theory, animal 55 | and human neuroscience, economics, and ethology, it is expected that the 56 | vast amount of available information could become counterproductive if not 57 | handled properly. Beginners often find themselves lost while trying to grasp 58 | the key concepts that are truly vital for understanding. Additionally, reinforcement 59 | learning and decision making, being a relatively new field, is often taught by 60 | world-class researchers that frequently unintentionally omit explaining 61 | core concepts that might seem too basic \cite{gapranda}, yet remain 62 | fundamental. This creates a gap of knowledge that, if left unfilled, causes 63 | trouble when learning the more advanced topics. 64 | 65 | These points presents some of the challenges of sparking an interest and keeping the 66 | students engaged throughout their entire learning experience. If the content is not 67 | delivered correctly, the students can quickly feel confused, lost and disengaged, and 68 | when that happens learning stops. 69 | 70 | \section{Sparking Curiosity} 71 | 72 | Fortunately, since reinforcement learning and decision making is studied 73 | by fields like animal and human neuroscience, ethology, and psychology\cite{suttons98}, 74 | often the concepts can be taught on an direct way using ordinary examples 75 | in order to connect on an intuitive level. Recent studies in neuroscience have 76 | shown that emotions and cognition are interrelated\cite{intuition}. By keeping 77 | the readings approachable I allow students to connect to the narratives at different 78 | levels. The notion of learning by interacting with an environment should be easy enough 79 | to understand for all of us, as this is one of the ways we learn. Reinforcement 80 | learning in Artificial Intelligence has serveral similarities with Human learning. 81 | 82 | I leverage this fact and use a strategy to keep the readers engaged in the 83 | material. 84 | 85 | \subsection{Using Simple And Direct Language} 86 | 87 | Another important component I accomplished is to use simple and direct 88 | language throughout the documents. This keeps the reader engaged regardless 89 | of their reinforcement learning knowledge level. 90 | 91 | I carefully select words and examples that bring the concepts to a 92 | common sense understanding so that all students can follow the initial 93 | readings. 94 | 95 | \subsection{Keeping A Single Narrative} 96 | 97 | Additionally, and what was perhaps the most difficult part, I keep a single 98 | narrative throughout the sequence of concepts being presented. The intention 99 | here is to allow students to continue reading and use the understanding they 100 | accumulate in previous lessons to understand the subsequent lessons. Similar to what 101 | the direct instruction paradigm\cite{directinstruction} encourages, one of the 102 | most important work on this project is providing with the structure and sequence 103 | on how the concepts are presented. 104 | 105 | The more traditional approach is to select concepts from the entire body of 106 | reinforcement learning and decision making and use different lessons to present 107 | different material. However, the problem with this approach is that it does 108 | not help the student grasp the complete picture or the connection between other topics. 109 | The effort to present concepts in logical sequence, despite being complex to define 110 | initially, not only feels more natural to present to beginners, but it helps 111 | beginners stay engaged in the material while they continue learning concepts. 112 | 113 | \subsection{Showing Concepts And Their Complement} 114 | 115 | Finally, in order to spark and maintain the students' curiosity, I show the 116 | full spectrum of a single concept. Even if just defining the opposite side, I 117 | still make an effort to mention it and briefly explain it. Often things in 118 | life have a complementary side, that when combined can better show the qualities 119 | of one another. For example, explaining deterministic actions is interesting 120 | all by itself, but you could gain a much better understanding if I explained 121 | them along side stochastic actions. This approach is also known as Compare and 122 | Contrast, and the literature suggests that teaching comparative thinking 123 | strengthens student learning\cite{compare}. 124 | 125 | I paid close attention to show concepts and their complements in every 126 | lesson. The expectation is that this would help the students have a better 127 | sense of the full range of possibilities for any given point. Keeping concepts 128 | in this format, keeps students engaged, as concepts get progressively more 129 | and more complex. 130 | 131 | \section{Removing Friction} 132 | 133 | Once the students' curiosity has been sparked and intuition is engaged, a 134 | convenient way to interact with the concepts should be presented. The 135 | friction of getting hands-on experience is one of the most difficult 136 | barriers to break for beginners, but once this is past, the student can 137 | much better understand the concepts. 138 | 139 | I worked on three different important points to fully remove the friction 140 | beginners have when first getting into reinforcement learning. 141 | 142 | \subsection{Setting Up A Convenient Environment} 143 | 144 | One of the most remarkable accomplishments of this project is the creation 145 | of a fully configured reinforcement learning platform to use OpenAI 146 | Gym \cite{openaigym} environments on Jupyter Notebooks inside of Docker 147 | containers. 148 | 149 | Besides technicalities, having a ready-to-go environment that can help 150 | students be ready to go within 20 min wait time for the first run after 151 | copy-paste of provided commands is wonderful. After that initial setup 152 | it takes less than 1 min every subsequent run. This allows the student to spend 153 | only a minimal amount of time configuring and battling with packages and configuration 154 | scripts, that do not add knowledge in reinforcement learning; and allows them to concentrate 155 | all the effort on concepts that truly matter. 156 | 157 | \subsection{Providing With Boilerplate Code} 158 | 159 | Moreover, I supplement the notebooks with abundant boilerplate code. Graphs and 160 | visualization functions that very likely aid in the learning process \cite{visualization}, 161 | binaries creating web requests in the background to show videos of carefully 162 | selected agent episodes are some of the examples of 163 | code provided to the students. 164 | 165 | This allows the students to interact only with bits of code that are directly 166 | related to reinforcement learning, and be able to safely ignore other bits. 167 | 168 | \subsection{Asking For Minimal Effort} 169 | 170 | Then, I proceed to ask students to put just enough effort to get them engaged. 171 | The hands-on interaction with the notebooks are designed for beginner to get 172 | started with reinforcement learning. Perhaps, these students have not seen 173 | reinforcement learning or even machine learning code in action before. Therefore, 174 | in addition to all of the boilerplate material already mentioned, I also 175 | provide the most common algorithms in each of the notebooks, and 176 | only ask the students to complete small sections that would make the core 177 | algorithms work more effectively. 178 | 179 | The idea is that after they have contact with reinforcement learning code, they 180 | will have more confidence when interacting with more advanced problems and 181 | projects during the OMSCS course. 182 | 183 | \section{Showing Options} 184 | 185 | Lastly, connecting to intuition and getting hands-on experience will be 186 | futile unless the students have a new interest of exploring the field by 187 | themselves. This is the most important aspect of our project, I believe 188 | education is about motivation. The role of an instructor is merely to spark 189 | students curiosity and help them find the path to their own realization. 190 | 191 | Therefore, at this point I hope to have awakened the students' interest to 192 | explore this marvelous field. Now, showing the path for further learning is 193 | a final and very important step. 194 | 195 | \subsection{Assigning Relevant Readings} 196 | 197 | To help the students better navigate the field of reinforcement learning, 198 | I provide with ``Further Reading'' sections in every single lesson, and a 199 | single final section of ``Recommended Books'' at the end of the project. The fact 200 | that I teach the concepts in a direct and simple language is by no means 201 | an indication that academic material can be skipped. Actually, the way I 202 | present the material should be seen as a \emph{primer}, helping the 203 | concepts later presented come together more naturally, and be absorbed quicker. 204 | 205 | \subsection{Watching Academic Lectures} 206 | 207 | Next, I would hope students go on to watch academic lectures. 208 | To have world-class experts in the field of reinforcement learning teaching 209 | concepts that they are wholesomely familiar with and have been studying and working with, 210 | is necessary for the students. For this reason, I added a ``Recommended Courses'' 211 | sections for students to continue the search and learning on their own. 212 | 213 | \subsection{Completing Homework and Projects} 214 | 215 | Finally, I would hope that many of the students using these materials are 216 | the same students who either are planning to enroll for the OMSCS course or have just 217 | enrolled. The OMSCS course, after a brief explanation of core concepts, 218 | shows very advanced concepts, at a very rapid pace. In addition, there are specially 219 | designed homework and project assignments so that the students get a solid 220 | grasp of reinforcement learning. 221 | 222 | Completing the coursework would certainly put the students in the driving seat 223 | making them owners of their destiny and letting them pick wisely what reinforcement 224 | learning area to explore next. 225 | 226 | \section{Future Work} 227 | 228 | No work is perfect and neither is this one. However, for the 229 | {\raise.17ex\hbox{$\scriptstyle\sim$}}2 months of effort 230 | put into it, I think the progress that has been made is incredible. I started 231 | with an aggressive proposal and delivered on most of it. I kept progress steady, but 232 | flexible enough to adapt along the way, while still completing core components. 233 | The lessons, the container, the notebooks, the assigned readings, the recommended 234 | courses, all provide with a solid foundation for the deep understanding of 235 | reinforcement learning and decision making. 236 | 237 | It is this foundation that can now make further progress easier to achieve. After 238 | opening this work to the community during the summer semester, I hope to receive 239 | help and feedback to make this project even better going forward. 240 | 241 | \subsection{Additional Notebooks} 242 | 243 | An important component for future work is the addition of notebooks. I had the capability 244 | to complete seven notebooks, but while trying to rush in some final work, I noticed 245 | the quality of the later notebooks were seriously degrading the quality of the 246 | project. Instead of pushing onto additional notebooks, I opted for improving the quality 247 | of previous notebooks and leaving the newer projects out of this release. 248 | 249 | This creates an opportunity for re-adding those notebooks that were removed and 250 | improving them considerably. Also, the addition of new notebooks would be of 251 | great benefit as well. 252 | 253 | \subsection{Effectiveness Evaluation} 254 | 255 | A more difficult future work component would be to find a way to measure the effectiveness 256 | of this material. Ideally, an Educational Technology student can take on the task 257 | to research whether the strategy presented here actually improves student 258 | performance. It would be interesting to gather and study this kind of feedback. 259 | 260 | \subsection{Request For Feedback} 261 | 262 | Finally, one of the next steps I will be taking on is to release this project 263 | in different places. First, to previous students on the Slack channel of the 264 | Georgia Tech Study Group organization. These folks are now veterans of our course 265 | and would be a great source of feedback. Second, I will release to the OpenAI 266 | community through their discussion forums in an attempt to get a very diverse group 267 | to review and provide feedback. The expectation is that this feedback will be 268 | followed up with actual changes in the form of GitHub pull requests. This, and only 269 | this, would make this the project I initially envisioned. 270 | 271 | \section{Conclusion} 272 | 273 | In this paper, I proposed a more robust way of teaching reinforcement learning 274 | and decision making. I presented a series of lessons taught in a very specific 275 | format, I delivered a fully-configured coding environment for the development 276 | of reinforcement learning agents and algorithms, I provided with boilerplate 277 | code and a series of notebooks to assist with hands-on experimentation, and I 278 | supplemented this with more academic readings, and lectures. 279 | 280 | I sincerely hope this project will be useful to lots of people interested in 281 | learning the ins and outs of reinforcement learning and decision making. And, 282 | in fact, the project recently helped an OMSCS Reinforcement Learning and 283 | Decision Making student find his way around the complex topic of function 284 | approximation in reinforcement learning. The potential, however, is bigger, and 285 | the path to improvement obvious in some cases. My desire is to see this 286 | work continue to grow into a more mature and effective way of teaching this 287 | amazing field. 288 | 289 | \medskip 290 | 291 | \begin{thebibliography}{9} 292 | 293 | \bibitem{gapranda} 294 | Ferguson, Julie E. 295 | \textit{Bridging The Gap Between Research and Practice} 296 | KM4Dev, Volume 1(3), 46-54. 297 | 298 | \bibitem{suttons98} 299 | Richard Sutton, Andrew Barto. 300 | \textit{Reinforcement Learning: An Introduction.} 301 | MIT Press, 1998. 302 | 303 | \bibitem{intuition} 304 | Maray Immordino-Yang, Matthias Faeth. 305 | \textit{Building Smart Students: A Neuroscience Perspective on the Role 306 | of Emotion and Skilled Intuition in Learning.} 307 | Bloomington, 2010. 308 | 309 | \bibitem{directinstruction} 310 | Baumann, James F. 311 | \textit{The effectiveness of a direct instruction paradigm for teaching main idea comprehension.} 312 | Reading Research Quarterly (1984): 93-115. 313 | 314 | \bibitem{compare} 315 | Silver, Harvey F. 316 | \textit{Compare and Contrast.} 317 | Strategic Teacher PLC Guides, 2010. 318 | 319 | \bibitem{visualization} 320 | Naps, Thomas L., et al. 321 | \textit{Exploring the role of visualization and engagement in computer science education.} 322 | ACM Sigcse Bulletin. Vol. 35. No. 2. ACM, 2002. 323 | 324 | \bibitem{openaigym} 325 | Greg Brockman, Vicki Cheung, Ludwig Pettersson et al. 326 | \textit{OpenAI Gym.} 327 | ArXiv, 1606.01540, 2016. 328 | 329 | \end{thebibliography} 330 | 331 | \end{document} 332 | -------------------------------------------------------------------------------- /notebooks/02-dynamic-programming.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "deletable": true, 7 | "editable": true 8 | }, 9 | "source": [ 10 | "### Recursion, Memoization and Dynamic Programming" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "deletable": true, 17 | "editable": true 18 | }, 19 | "source": [ 20 | "Remember how we talk about using recursion and dynamic programming. One interesting thing to do is to implement the solution to a common problem called Fibonnaci numbers on these two styles and compare the compute time." 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": { 26 | "deletable": true, 27 | "editable": true 28 | }, 29 | "source": [ 30 | "The Fibonacci series looks something like: `0, 1, 1, 2, 3, 5, 8, 13, 21 …` and so on. Any person can quickly notice the pattern. `f(n) = f(n-1) + f(n-2)` So, let's walk through a recursive implementation that solves this problem." 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 1, 36 | "metadata": { 37 | "collapsed": true, 38 | "deletable": true, 39 | "editable": true 40 | }, 41 | "outputs": [], 42 | "source": [ 43 | "def fib(n):\n", 44 | " if n < 2:\n", 45 | " return n\n", 46 | " return fib(n-2) + fib(n-1)" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 2, 52 | "metadata": { 53 | "collapsed": false, 54 | "deletable": true, 55 | "editable": true 56 | }, 57 | "outputs": [ 58 | { 59 | "name": "stdout", 60 | "output_type": "stream", 61 | "text": [ 62 | "CPU times: user 296 ms, sys: 0 ns, total: 296 ms\n", 63 | "Wall time: 295 ms\n" 64 | ] 65 | }, 66 | { 67 | "data": { 68 | "text/plain": [ 69 | "832040" 70 | ] 71 | }, 72 | "execution_count": 2, 73 | "metadata": {}, 74 | "output_type": "execute_result" 75 | } 76 | ], 77 | "source": [ 78 | "%time fib(30)" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": { 84 | "deletable": true, 85 | "editable": true 86 | }, 87 | "source": [ 88 | "Now, the main problem of this algorithm is that we are computing some of the subproblems more than once. For instance, to compute fib(4) we would compute fib(3) and fib(2). However, to compute fib(3) we also have to compute fib(2). Say hello to memoization." 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": { 94 | "deletable": true, 95 | "editable": true 96 | }, 97 | "source": [ 98 | "A technique called memoization we are cache the results of previously computed sub problems to avoid unnecessary computations." 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 3, 104 | "metadata": { 105 | "collapsed": true, 106 | "deletable": true, 107 | "editable": true 108 | }, 109 | "outputs": [], 110 | "source": [ 111 | "m = {}\n", 112 | "def fibm(n):\n", 113 | " if n in m:\n", 114 | " return m[n]\n", 115 | " m[n] = n if n < 2 else fibm(n-2) + fibm(n-1)\n", 116 | " return m[n]" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": 4, 122 | "metadata": { 123 | "collapsed": false, 124 | "deletable": true, 125 | "editable": true 126 | }, 127 | "outputs": [ 128 | { 129 | "name": "stdout", 130 | "output_type": "stream", 131 | "text": [ 132 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 133 | "Wall time: 17.6 µs\n" 134 | ] 135 | }, 136 | { 137 | "data": { 138 | "text/plain": [ 139 | "832040" 140 | ] 141 | }, 142 | "execution_count": 4, 143 | "metadata": {}, 144 | "output_type": "execute_result" 145 | } 146 | ], 147 | "source": [ 148 | "%time fibm(30)" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": { 154 | "deletable": true, 155 | "editable": true 156 | }, 157 | "source": [ 158 | "But the question is, can we do better than this? The use of the array is helpful, but when calculating very large numbers, or perhaps on memory contraint environments it might not be desirable. This is where Dynamic Programming fits the bill." 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": { 164 | "deletable": true, 165 | "editable": true 166 | }, 167 | "source": [ 168 | "In DP we take a bottom-up approach. Meaning, we solve the next Fibonacci number we can with the information we already have." 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 13, 174 | "metadata": { 175 | "collapsed": true, 176 | "deletable": true, 177 | "editable": true 178 | }, 179 | "outputs": [], 180 | "source": [ 181 | "def fibdp(n):\n", 182 | " if n == 0: return 0\n", 183 | " prev, curr = (0, 1)\n", 184 | " for i in range(2, n+1):\n", 185 | " newf = prev + curr\n", 186 | " prev = curr\n", 187 | " curr = newf\n", 188 | " return curr" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": 25, 194 | "metadata": { 195 | "collapsed": false, 196 | "deletable": true, 197 | "editable": true 198 | }, 199 | "outputs": [ 200 | { 201 | "name": "stdout", 202 | "output_type": "stream", 203 | "text": [ 204 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 205 | "Wall time: 6.44 µs\n" 206 | ] 207 | }, 208 | { 209 | "data": { 210 | "text/plain": [ 211 | "832040" 212 | ] 213 | }, 214 | "execution_count": 25, 215 | "metadata": {}, 216 | "output_type": "execute_result" 217 | } 218 | ], 219 | "source": [ 220 | "%time fibdp(30)" 221 | ] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": { 226 | "deletable": true, 227 | "editable": true 228 | }, 229 | "source": [ 230 | "In this format, we don’t need to recurse or keep up with the memory intensive cache dictionary. These, add up to an even better performance." 231 | ] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "metadata": { 236 | "deletable": true, 237 | "editable": true 238 | }, 239 | "source": [ 240 | "Let's now give it a try with factorials. Remember `4! = 4 * 3 * 2 * 1 = 24`. Can you give it try?" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": 1, 246 | "metadata": { 247 | "collapsed": true, 248 | "deletable": true, 249 | "editable": true 250 | }, 251 | "outputs": [], 252 | "source": [ 253 | "def factr(n):\n", 254 | " if n < 3:\n", 255 | " return n\n", 256 | " return n * factr(n - 1)" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": 5, 262 | "metadata": { 263 | "collapsed": false, 264 | "deletable": true, 265 | "editable": true 266 | }, 267 | "outputs": [ 268 | { 269 | "name": "stdout", 270 | "output_type": "stream", 271 | "text": [ 272 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 273 | "Wall time: 43.2 µs\n" 274 | ] 275 | }, 276 | { 277 | "data": { 278 | "text/plain": [ 279 | "265252859812191058636308480000000" 280 | ] 281 | }, 282 | "execution_count": 5, 283 | "metadata": {}, 284 | "output_type": "execute_result" 285 | } 286 | ], 287 | "source": [ 288 | "%time factr(30)" 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": 99, 294 | "metadata": { 295 | "collapsed": true, 296 | "deletable": true, 297 | "editable": true 298 | }, 299 | "outputs": [], 300 | "source": [ 301 | "m = {}\n", 302 | "def factm(n):\n", 303 | " if n in m:\n", 304 | " return m[n]\n", 305 | " m[n] = n if n < 3 else n * factr(n - 1)\n", 306 | " return m[n]" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": 100, 312 | "metadata": { 313 | "collapsed": false, 314 | "deletable": true, 315 | "editable": true 316 | }, 317 | "outputs": [ 318 | { 319 | "name": "stdout", 320 | "output_type": "stream", 321 | "text": [ 322 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 323 | "Wall time: 47.2 µs\n" 324 | ] 325 | }, 326 | { 327 | "data": { 328 | "text/plain": [ 329 | "265252859812191058636308480000000" 330 | ] 331 | }, 332 | "execution_count": 100, 333 | "metadata": {}, 334 | "output_type": "execute_result" 335 | } 336 | ], 337 | "source": [ 338 | "%time factm(30)" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": 10, 344 | "metadata": { 345 | "collapsed": true, 346 | "deletable": true, 347 | "editable": true 348 | }, 349 | "outputs": [], 350 | "source": [ 351 | "def factdp(n):\n", 352 | " if n < 3: return n\n", 353 | " fact = 2\n", 354 | " for i in range(3, n + 1):\n", 355 | " fact *= i\n", 356 | " return fact" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": 11, 362 | "metadata": { 363 | "collapsed": false, 364 | "deletable": true, 365 | "editable": true 366 | }, 367 | "outputs": [ 368 | { 369 | "name": "stdout", 370 | "output_type": "stream", 371 | "text": [ 372 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 373 | "Wall time: 7.87 µs\n" 374 | ] 375 | }, 376 | { 377 | "data": { 378 | "text/plain": [ 379 | "265252859812191058636308480000000" 380 | ] 381 | }, 382 | "execution_count": 11, 383 | "metadata": {}, 384 | "output_type": "execute_result" 385 | } 386 | ], 387 | "source": [ 388 | "%time factdp(30)" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": { 394 | "deletable": true, 395 | "editable": true 396 | }, 397 | "source": [ 398 | "Let's think of a slightly different problem. Imagine that you want to find the cheapest way to go from city A to city B, but when you are about to buy your ticket, you see that you could hop in different combinations of route and get a much cheaper price than if you go directly. How do you efficiently calculate the best possible combination of tickets and come up with the cheapest route? We will start with basic recursion and work on improving it until we reach dynamic programming.\n", 399 | "\n", 400 | "For this last problem in dynamic programming, create 2 functions that calculates the cheapest route from city A to B. I will give you the recursive solution, you will build one with memoization and the one with dynamic programming." 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": 1, 406 | "metadata": { 407 | "collapsed": true, 408 | "deletable": true, 409 | "editable": true 410 | }, 411 | "outputs": [], 412 | "source": [ 413 | "import numpy as np" 414 | ] 415 | }, 416 | { 417 | "cell_type": "markdown", 418 | "metadata": {}, 419 | "source": [ 420 | "Utility function to get fares between cities" 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": 2, 426 | "metadata": { 427 | "collapsed": false, 428 | "deletable": true, 429 | "editable": true, 430 | "scrolled": false 431 | }, 432 | "outputs": [], 433 | "source": [ 434 | "def get_fares(n_cities, max_fare):\n", 435 | " np.random.seed(123456)\n", 436 | " fares = np.sort(np.random.random((n_cities, n_cities)) * max_fare).astype(int)\n", 437 | " for i in range(len(fares)):\n", 438 | " fares[i] = np.roll(fares[i], i + 1)\n", 439 | " np.fill_diagonal(fares, 0)\n", 440 | " for i in range(1, len(fares)):\n", 441 | " for j in range(0, i):\n", 442 | " fares[i][j] = -1\n", 443 | " return fares" 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "metadata": {}, 449 | "source": [ 450 | "Let's try it out with 4 cities and random fares with a max of 1000." 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": 5, 456 | "metadata": { 457 | "collapsed": false, 458 | "deletable": true, 459 | "editable": true 460 | }, 461 | "outputs": [ 462 | { 463 | "data": { 464 | "text/plain": [ 465 | "array([[ 0, 126, 260, 897],\n", 466 | " [ -1, 0, 50, 376],\n", 467 | " [ -1, -1, 0, 123],\n", 468 | " [ -1, -1, -1, 0]])" 469 | ] 470 | }, 471 | "execution_count": 5, 472 | "metadata": {}, 473 | "output_type": "execute_result" 474 | } 475 | ], 476 | "source": [ 477 | "n_cities = 4\n", 478 | "max_fare = 1000\n", 479 | "fares = get_fares(n_cities, max_fare)\n", 480 | "fares[1][2] = 50\n", 481 | "fares" 482 | ] 483 | }, 484 | { 485 | "cell_type": "markdown", 486 | "metadata": {}, 487 | "source": [ 488 | "Here is the recursive solution:" 489 | ] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": 7, 494 | "metadata": { 495 | "collapsed": true, 496 | "deletable": true, 497 | "editable": true 498 | }, 499 | "outputs": [], 500 | "source": [ 501 | "def cheapestr(s, d, c):\n", 502 | " if s == d or s == d - 1:\n", 503 | " return c[s][d]\n", 504 | " \n", 505 | " cheapest = c[s][d]\n", 506 | " for i in range(s + 1, d):\n", 507 | " tmp = cheapestr(s, i, c) + cheapestr(i, d, c)\n", 508 | " cheapest = tmp if tmp < cheapest else cheapest\n", 509 | " return cheapest" 510 | ] 511 | }, 512 | { 513 | "cell_type": "code", 514 | "execution_count": 8, 515 | "metadata": { 516 | "collapsed": false, 517 | "deletable": true, 518 | "editable": true 519 | }, 520 | "outputs": [ 521 | { 522 | "name": "stdout", 523 | "output_type": "stream", 524 | "text": [ 525 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 526 | "Wall time: 68.7 µs\n" 527 | ] 528 | }, 529 | { 530 | "data": { 531 | "text/plain": [ 532 | "299" 533 | ] 534 | }, 535 | "execution_count": 8, 536 | "metadata": {}, 537 | "output_type": "execute_result" 538 | } 539 | ], 540 | "source": [ 541 | "%time cheapestr(0, len(fares[0]) - 1, fares)" 542 | ] 543 | }, 544 | { 545 | "cell_type": "markdown", 546 | "metadata": {}, 547 | "source": [ 548 | "Now, you build the memoization one:" 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": 9, 554 | "metadata": { 555 | "collapsed": false, 556 | "deletable": true, 557 | "editable": true 558 | }, 559 | "outputs": [], 560 | "source": [ 561 | "m = {}\n", 562 | "def cheapestm(s, d, c):\n", 563 | " \"\"\" YOU WRITE THIS FUNCTION \"\"\"\n", 564 | " return 0" 565 | ] 566 | }, 567 | { 568 | "cell_type": "code", 569 | "execution_count": 10, 570 | "metadata": { 571 | "collapsed": false, 572 | "deletable": true, 573 | "editable": true, 574 | "scrolled": true 575 | }, 576 | "outputs": [ 577 | { 578 | "name": "stdout", 579 | "output_type": "stream", 580 | "text": [ 581 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 582 | "Wall time: 22.6 µs\n" 583 | ] 584 | }, 585 | { 586 | "data": { 587 | "text/plain": [ 588 | "299" 589 | ] 590 | }, 591 | "execution_count": 10, 592 | "metadata": {}, 593 | "output_type": "execute_result" 594 | } 595 | ], 596 | "source": [ 597 | "%time cheapestm(0, len(fares[0]) - 1, fares)" 598 | ] 599 | }, 600 | { 601 | "cell_type": "markdown", 602 | "metadata": {}, 603 | "source": [ 604 | "Faster, you see?\n", 605 | "\n", 606 | "Now, do the dynamic programming version." 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": 11, 612 | "metadata": { 613 | "collapsed": false, 614 | "deletable": true, 615 | "editable": true 616 | }, 617 | "outputs": [], 618 | "source": [ 619 | "def cheapestdp(s, d, c):\n", 620 | " \"\"\" YOU WRITE THIS FUNCTION \"\"\"\n", 621 | " return 0" 622 | ] 623 | }, 624 | { 625 | "cell_type": "code", 626 | "execution_count": 12, 627 | "metadata": { 628 | "collapsed": false, 629 | "deletable": true, 630 | "editable": true 631 | }, 632 | "outputs": [ 633 | { 634 | "name": "stdout", 635 | "output_type": "stream", 636 | "text": [ 637 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 638 | "Wall time: 61.5 µs\n" 639 | ] 640 | }, 641 | { 642 | "data": { 643 | "text/plain": [ 644 | "299" 645 | ] 646 | }, 647 | "execution_count": 12, 648 | "metadata": {}, 649 | "output_type": "execute_result" 650 | } 651 | ], 652 | "source": [ 653 | "%time cheapestdp(0, len(fares[0]) - 1, fares)" 654 | ] 655 | }, 656 | { 657 | "cell_type": "markdown", 658 | "metadata": { 659 | "deletable": true, 660 | "editable": true 661 | }, 662 | "source": [ 663 | "Let's now try with a larger example:" 664 | ] 665 | }, 666 | { 667 | "cell_type": "code", 668 | "execution_count": 337, 669 | "metadata": { 670 | "collapsed": false, 671 | "deletable": true, 672 | "editable": true, 673 | "scrolled": true 674 | }, 675 | "outputs": [ 676 | { 677 | "data": { 678 | "text/plain": [ 679 | "array([[ 0, 123, 126, 129, 228, 260, 336, 352, 373, 376, 447, 451, 543,\n", 680 | " 776, 820, 840, 859, 897],\n", 681 | " [ -1, 0, 37, 61, 137, 146, 235, 245, 340, 343, 405, 574, 589,\n", 682 | " 590, 594, 753, 852, 861],\n", 683 | " [ -1, -1, 0, 16, 99, 117, 170, 199, 274, 342, 394, 401, 414,\n", 684 | " 462, 481, 595, 610, 641],\n", 685 | " [ -1, -1, -1, 0, 94, 95, 134, 138, 155, 433, 471, 497, 560,\n", 686 | " 630, 639, 683, 732, 758],\n", 687 | " [ -1, -1, -1, -1, 0, 85, 140, 149, 329, 370, 386, 395, 477,\n", 688 | " 544, 562, 566, 619, 634],\n", 689 | " [ -1, -1, -1, -1, -1, 0, 29, 30, 44, 113, 187, 207, 247,\n", 690 | " 249, 249, 356, 409, 630],\n", 691 | " [ -1, -1, -1, -1, -1, -1, 0, 22, 60, 168, 216, 277, 279,\n", 692 | " 372, 419, 449, 606, 690],\n", 693 | " [ -1, -1, -1, -1, -1, -1, -1, 0, 30, 36, 273, 321, 355,\n", 694 | " 415, 419, 421, 497, 500],\n", 695 | " [ -1, -1, -1, -1, -1, -1, -1, -1, 0, 19, 123, 400, 412,\n", 696 | " 418, 535, 547, 559, 591],\n", 697 | " [ -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 110, 120, 140,\n", 698 | " 204, 220, 395, 454, 488],\n", 699 | " [ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 87, 169,\n", 700 | " 219, 257, 312, 487, 527],\n", 701 | " [ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 7,\n", 702 | " 11, 22, 244, 250, 281],\n", 703 | " [ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0,\n", 704 | " 10, 116, 123, 259, 287],\n", 705 | " [ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n", 706 | " 0, 22, 93, 109, 236],\n", 707 | " [ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n", 708 | " -1, 0, 29, 157, 185],\n", 709 | " [ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n", 710 | " -1, -1, 0, 18, 78],\n", 711 | " [ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n", 712 | " -1, -1, -1, 0, 4],\n", 713 | " [ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n", 714 | " -1, -1, -1, -1, 0]])" 715 | ] 716 | }, 717 | "execution_count": 337, 718 | "metadata": {}, 719 | "output_type": "execute_result" 720 | } 721 | ], 722 | "source": [ 723 | "n_cities = 18 # this will take a little before 20 seconds. Try not to make it any larger :)\n", 724 | "max_fare = 1000\n", 725 | "fares = get_fares(n_cities, max_fare)\n", 726 | "fares" 727 | ] 728 | }, 729 | { 730 | "cell_type": "code", 731 | "execution_count": 338, 732 | "metadata": { 733 | "collapsed": false, 734 | "deletable": true, 735 | "editable": true 736 | }, 737 | "outputs": [ 738 | { 739 | "name": "stdout", 740 | "output_type": "stream", 741 | "text": [ 742 | "CPU times: user 18.3 s, sys: 6.67 ms, total: 18.3 s\n", 743 | "Wall time: 18.4 s\n" 744 | ] 745 | }, 746 | { 747 | "data": { 748 | "text/plain": [ 749 | "480" 750 | ] 751 | }, 752 | "execution_count": 338, 753 | "metadata": {}, 754 | "output_type": "execute_result" 755 | } 756 | ], 757 | "source": [ 758 | "%time cheapestr(0, len(fares[0]) - 1, fares)" 759 | ] 760 | }, 761 | { 762 | "cell_type": "code", 763 | "execution_count": 339, 764 | "metadata": { 765 | "collapsed": false, 766 | "deletable": true, 767 | "editable": true 768 | }, 769 | "outputs": [ 770 | { 771 | "name": "stdout", 772 | "output_type": "stream", 773 | "text": [ 774 | "CPU times: user 14.7 s, sys: 3.33 ms, total: 14.7 s\n", 775 | "Wall time: 14.7 s\n" 776 | ] 777 | }, 778 | { 779 | "data": { 780 | "text/plain": [ 781 | "480" 782 | ] 783 | }, 784 | "execution_count": 339, 785 | "metadata": {}, 786 | "output_type": "execute_result" 787 | } 788 | ], 789 | "source": [ 790 | "%time cheapestm(0, len(fares[0]) - 1, fares)" 791 | ] 792 | }, 793 | { 794 | "cell_type": "code", 795 | "execution_count": 340, 796 | "metadata": { 797 | "collapsed": false, 798 | "deletable": true, 799 | "editable": true 800 | }, 801 | "outputs": [ 802 | { 803 | "name": "stdout", 804 | "output_type": "stream", 805 | "text": [ 806 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 807 | "Wall time: 75.8 µs\n" 808 | ] 809 | }, 810 | { 811 | "data": { 812 | "text/plain": [ 813 | "480" 814 | ] 815 | }, 816 | "execution_count": 340, 817 | "metadata": {}, 818 | "output_type": "execute_result" 819 | } 820 | ], 821 | "source": [ 822 | "%time cheapestdp(0, len(fares[0]) - 1, fares)" 823 | ] 824 | }, 825 | { 826 | "cell_type": "markdown", 827 | "metadata": { 828 | "deletable": true, 829 | "editable": true 830 | }, 831 | "source": [ 832 | "BAAAAAM! See how much faster dynamic programming is?\n", 833 | "\n", 834 | "Well, there you have it!!! This is the power of dynamic programming.\n", 835 | "\n", 836 | "As mentioned in the tutorials, reinforcement learning leverages the power of dynamic programming in many algorithms. Value Iteration, Q-Learning, etc have a similar take on calculation. The bottom line is to think sequentially instead of recursively. And bottom-up instead of top-down. Let's continue this journey." 837 | ] 838 | } 839 | ], 840 | "metadata": { 841 | "kernelspec": { 842 | "display_name": "Python 3", 843 | "language": "python", 844 | "name": "python3" 845 | }, 846 | "language_info": { 847 | "codemirror_mode": { 848 | "name": "ipython", 849 | "version": 3 850 | }, 851 | "file_extension": ".py", 852 | "mimetype": "text/x-python", 853 | "name": "python", 854 | "nbconvert_exporter": "python", 855 | "pygments_lexer": "ipython3", 856 | "version": "3.5.2" 857 | } 858 | }, 859 | "nbformat": 4, 860 | "nbformat_minor": 2 861 | } 862 | -------------------------------------------------------------------------------- /notebooks/solutions/02-dynamic-programming.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "deletable": true, 7 | "editable": true 8 | }, 9 | "source": [ 10 | "### Recursion, Memoization and Dynamic Programming" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "deletable": true, 17 | "editable": true 18 | }, 19 | "source": [ 20 | "Remember how we talk about using recursion and dynamic programming. One interesting thing to do is to implement the solution to a common problem called Fibonnaci numbers on these two styles and compare the compute time." 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": { 26 | "deletable": true, 27 | "editable": true 28 | }, 29 | "source": [ 30 | "The Fibonacci series looks something like: `0, 1, 1, 2, 3, 5, 8, 13, 21 …` and so on. Any person can quickly notice the pattern. `f(n) = f(n-1) + f(n-2)` So, let's walk through a recursive implementation that solves this problem." 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 1, 36 | "metadata": { 37 | "collapsed": true, 38 | "deletable": true, 39 | "editable": true 40 | }, 41 | "outputs": [], 42 | "source": [ 43 | "def fib(n):\n", 44 | " if n < 2:\n", 45 | " return n\n", 46 | " return fib(n-2) + fib(n-1)" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 2, 52 | "metadata": { 53 | "collapsed": false, 54 | "deletable": true, 55 | "editable": true 56 | }, 57 | "outputs": [ 58 | { 59 | "name": "stdout", 60 | "output_type": "stream", 61 | "text": [ 62 | "CPU times: user 296 ms, sys: 0 ns, total: 296 ms\n", 63 | "Wall time: 295 ms\n" 64 | ] 65 | }, 66 | { 67 | "data": { 68 | "text/plain": [ 69 | "832040" 70 | ] 71 | }, 72 | "execution_count": 2, 73 | "metadata": {}, 74 | "output_type": "execute_result" 75 | } 76 | ], 77 | "source": [ 78 | "%time fib(30)" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": { 84 | "deletable": true, 85 | "editable": true 86 | }, 87 | "source": [ 88 | "Now, the main problem of this algorithm is that we are computing some of the subproblems more than once. For instance, to compute fib(4) we would compute fib(3) and fib(2). However, to compute fib(3) we also have to compute fib(2). Say hello to memoization." 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": { 94 | "deletable": true, 95 | "editable": true 96 | }, 97 | "source": [ 98 | "A technique called memoization we are cache the results of previously computed sub problems to avoid unnecessary computations." 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 3, 104 | "metadata": { 105 | "collapsed": true, 106 | "deletable": true, 107 | "editable": true 108 | }, 109 | "outputs": [], 110 | "source": [ 111 | "m = {}\n", 112 | "def fibm(n):\n", 113 | " if n in m:\n", 114 | " return m[n]\n", 115 | " m[n] = n if n < 2 else fibm(n-2) + fibm(n-1)\n", 116 | " return m[n]" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": 4, 122 | "metadata": { 123 | "collapsed": false, 124 | "deletable": true, 125 | "editable": true 126 | }, 127 | "outputs": [ 128 | { 129 | "name": "stdout", 130 | "output_type": "stream", 131 | "text": [ 132 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 133 | "Wall time: 17.6 µs\n" 134 | ] 135 | }, 136 | { 137 | "data": { 138 | "text/plain": [ 139 | "832040" 140 | ] 141 | }, 142 | "execution_count": 4, 143 | "metadata": {}, 144 | "output_type": "execute_result" 145 | } 146 | ], 147 | "source": [ 148 | "%time fibm(30)" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": { 154 | "deletable": true, 155 | "editable": true 156 | }, 157 | "source": [ 158 | "But the question is, can we do better than this? The use of the array is helpful, but when calculating very large numbers, or perhaps on memory contraint environments it might not be desirable. This is where Dynamic Programming fits the bill." 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": { 164 | "deletable": true, 165 | "editable": true 166 | }, 167 | "source": [ 168 | "In DP we take a bottom-up approach. Meaning, we solve the next Fibonacci number we can with the information we already have." 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 13, 174 | "metadata": { 175 | "collapsed": true, 176 | "deletable": true, 177 | "editable": true 178 | }, 179 | "outputs": [], 180 | "source": [ 181 | "def fibdp(n):\n", 182 | " if n == 0: return 0\n", 183 | " prev, curr = (0, 1)\n", 184 | " for i in range(2, n+1):\n", 185 | " newf = prev + curr\n", 186 | " prev = curr\n", 187 | " curr = newf\n", 188 | " return curr" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": 25, 194 | "metadata": { 195 | "collapsed": false, 196 | "deletable": true, 197 | "editable": true 198 | }, 199 | "outputs": [ 200 | { 201 | "name": "stdout", 202 | "output_type": "stream", 203 | "text": [ 204 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 205 | "Wall time: 6.44 µs\n" 206 | ] 207 | }, 208 | { 209 | "data": { 210 | "text/plain": [ 211 | "832040" 212 | ] 213 | }, 214 | "execution_count": 25, 215 | "metadata": {}, 216 | "output_type": "execute_result" 217 | } 218 | ], 219 | "source": [ 220 | "%time fibdp(30)" 221 | ] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": { 226 | "deletable": true, 227 | "editable": true 228 | }, 229 | "source": [ 230 | "In this format, we don’t need to recurse or keep up with the memory intensive cache dictionary. These, add up to an even better performance." 231 | ] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "metadata": { 236 | "deletable": true, 237 | "editable": true 238 | }, 239 | "source": [ 240 | "Let's now give it a try with factorials. Remember `4! = 4 * 3 * 2 * 1 = 24`. Can you give it try?" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": 1, 246 | "metadata": { 247 | "collapsed": true, 248 | "deletable": true, 249 | "editable": true 250 | }, 251 | "outputs": [], 252 | "source": [ 253 | "def factr(n):\n", 254 | " if n < 3:\n", 255 | " return n\n", 256 | " return n * factr(n - 1)" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": 5, 262 | "metadata": { 263 | "collapsed": false, 264 | "deletable": true, 265 | "editable": true 266 | }, 267 | "outputs": [ 268 | { 269 | "name": "stdout", 270 | "output_type": "stream", 271 | "text": [ 272 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 273 | "Wall time: 43.2 µs\n" 274 | ] 275 | }, 276 | { 277 | "data": { 278 | "text/plain": [ 279 | "265252859812191058636308480000000" 280 | ] 281 | }, 282 | "execution_count": 5, 283 | "metadata": {}, 284 | "output_type": "execute_result" 285 | } 286 | ], 287 | "source": [ 288 | "%time factr(30)" 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": 99, 294 | "metadata": { 295 | "collapsed": true, 296 | "deletable": true, 297 | "editable": true 298 | }, 299 | "outputs": [], 300 | "source": [ 301 | "m = {}\n", 302 | "def factm(n):\n", 303 | " if n in m:\n", 304 | " return m[n]\n", 305 | " m[n] = n if n < 3 else n * factr(n - 1)\n", 306 | " return m[n]" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": 100, 312 | "metadata": { 313 | "collapsed": false, 314 | "deletable": true, 315 | "editable": true 316 | }, 317 | "outputs": [ 318 | { 319 | "name": "stdout", 320 | "output_type": "stream", 321 | "text": [ 322 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 323 | "Wall time: 47.2 µs\n" 324 | ] 325 | }, 326 | { 327 | "data": { 328 | "text/plain": [ 329 | "265252859812191058636308480000000" 330 | ] 331 | }, 332 | "execution_count": 100, 333 | "metadata": {}, 334 | "output_type": "execute_result" 335 | } 336 | ], 337 | "source": [ 338 | "%time factm(30)" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": 10, 344 | "metadata": { 345 | "collapsed": true, 346 | "deletable": true, 347 | "editable": true 348 | }, 349 | "outputs": [], 350 | "source": [ 351 | "def factdp(n):\n", 352 | " if n < 3: return n\n", 353 | " fact = 2\n", 354 | " for i in range(3, n + 1):\n", 355 | " fact *= i\n", 356 | " return fact" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": 11, 362 | "metadata": { 363 | "collapsed": false, 364 | "deletable": true, 365 | "editable": true 366 | }, 367 | "outputs": [ 368 | { 369 | "name": "stdout", 370 | "output_type": "stream", 371 | "text": [ 372 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 373 | "Wall time: 7.87 µs\n" 374 | ] 375 | }, 376 | { 377 | "data": { 378 | "text/plain": [ 379 | "265252859812191058636308480000000" 380 | ] 381 | }, 382 | "execution_count": 11, 383 | "metadata": {}, 384 | "output_type": "execute_result" 385 | } 386 | ], 387 | "source": [ 388 | "%time factdp(30)" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": { 394 | "deletable": true, 395 | "editable": true 396 | }, 397 | "source": [ 398 | "Let's think of a slightly different problem. Imagine that you want to find the cheapest way to go from city A to city B, but when you are about to buy your ticket, you see that you could hop in different combinations of route and get a much cheaper price than if you go directly. How do you efficiently calculate the best possible combination of tickets and come up with the cheapest route? We will start with basic recursion and work on improving it until we reach dynamic programming.\n", 399 | "\n", 400 | "For this last problem in dynamic programming, create 2 functions that calculates the cheapest route from city A to B. I will give you the recursive solution, you will build one with memoization and the one with dynamic programming." 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": 1, 406 | "metadata": { 407 | "collapsed": true, 408 | "deletable": true, 409 | "editable": true 410 | }, 411 | "outputs": [], 412 | "source": [ 413 | "import numpy as np" 414 | ] 415 | }, 416 | { 417 | "cell_type": "markdown", 418 | "metadata": {}, 419 | "source": [ 420 | "Utility function to get fares between cities" 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": 2, 426 | "metadata": { 427 | "collapsed": false, 428 | "deletable": true, 429 | "editable": true, 430 | "scrolled": false 431 | }, 432 | "outputs": [], 433 | "source": [ 434 | "def get_fares(n_cities, max_fare):\n", 435 | " np.random.seed(123456)\n", 436 | " fares = np.sort(np.random.random((n_cities, n_cities)) * max_fare).astype(int)\n", 437 | " for i in range(len(fares)):\n", 438 | " fares[i] = np.roll(fares[i], i + 1)\n", 439 | " np.fill_diagonal(fares, 0)\n", 440 | " for i in range(1, len(fares)):\n", 441 | " for j in range(0, i):\n", 442 | " fares[i][j] = -1\n", 443 | " return fares" 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "metadata": {}, 449 | "source": [ 450 | "Let's try it out with 4 cities and random fares with a max of 1000." 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": 5, 456 | "metadata": { 457 | "collapsed": false, 458 | "deletable": true, 459 | "editable": true 460 | }, 461 | "outputs": [ 462 | { 463 | "data": { 464 | "text/plain": [ 465 | "array([[ 0, 126, 260, 897],\n", 466 | " [ -1, 0, 50, 376],\n", 467 | " [ -1, -1, 0, 123],\n", 468 | " [ -1, -1, -1, 0]])" 469 | ] 470 | }, 471 | "execution_count": 5, 472 | "metadata": {}, 473 | "output_type": "execute_result" 474 | } 475 | ], 476 | "source": [ 477 | "n_cities = 4\n", 478 | "max_fare = 1000\n", 479 | "fares = get_fares(n_cities, max_fare)\n", 480 | "fares[1][2] = 50\n", 481 | "fares" 482 | ] 483 | }, 484 | { 485 | "cell_type": "markdown", 486 | "metadata": {}, 487 | "source": [ 488 | "Here is the recursive solution:" 489 | ] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": 7, 494 | "metadata": { 495 | "collapsed": true, 496 | "deletable": true, 497 | "editable": true 498 | }, 499 | "outputs": [], 500 | "source": [ 501 | "def cheapestr(s, d, c):\n", 502 | " if s == d or s == d - 1:\n", 503 | " return c[s][d]\n", 504 | " \n", 505 | " cheapest = c[s][d]\n", 506 | " for i in range(s + 1, d):\n", 507 | " tmp = cheapestr(s, i, c) + cheapestr(i, d, c)\n", 508 | " cheapest = tmp if tmp < cheapest else cheapest\n", 509 | " return cheapest" 510 | ] 511 | }, 512 | { 513 | "cell_type": "code", 514 | "execution_count": 8, 515 | "metadata": { 516 | "collapsed": false, 517 | "deletable": true, 518 | "editable": true 519 | }, 520 | "outputs": [ 521 | { 522 | "name": "stdout", 523 | "output_type": "stream", 524 | "text": [ 525 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 526 | "Wall time: 68.7 µs\n" 527 | ] 528 | }, 529 | { 530 | "data": { 531 | "text/plain": [ 532 | "299" 533 | ] 534 | }, 535 | "execution_count": 8, 536 | "metadata": {}, 537 | "output_type": "execute_result" 538 | } 539 | ], 540 | "source": [ 541 | "%time cheapestr(0, len(fares[0]) - 1, fares)" 542 | ] 543 | }, 544 | { 545 | "cell_type": "markdown", 546 | "metadata": {}, 547 | "source": [ 548 | "Now, you build the memoization one:" 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": 9, 554 | "metadata": { 555 | "collapsed": false, 556 | "deletable": true, 557 | "editable": true 558 | }, 559 | "outputs": [], 560 | "source": [ 561 | "m = {}\n", 562 | "def cheapestm(s, d, c):\n", 563 | " if s == d or s == d - 1:\n", 564 | " return c[s][d]\n", 565 | "\n", 566 | " if s in m and d in m[s]:\n", 567 | " return m[s][d]\n", 568 | " \n", 569 | " cheapest = c[s][d]\n", 570 | " for i in range(s + 1, d):\n", 571 | " tmp = cheapestm(s, i, c) + cheapestm(i, d, c)\n", 572 | " cheapest = tmp if tmp < cheapest else cheapest\n", 573 | " m[s] = {}\n", 574 | " m[s][d] = cheapest\n", 575 | " return m[s][d]" 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": 10, 581 | "metadata": { 582 | "collapsed": false, 583 | "deletable": true, 584 | "editable": true, 585 | "scrolled": true 586 | }, 587 | "outputs": [ 588 | { 589 | "name": "stdout", 590 | "output_type": "stream", 591 | "text": [ 592 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 593 | "Wall time: 22.6 µs\n" 594 | ] 595 | }, 596 | { 597 | "data": { 598 | "text/plain": [ 599 | "299" 600 | ] 601 | }, 602 | "execution_count": 10, 603 | "metadata": {}, 604 | "output_type": "execute_result" 605 | } 606 | ], 607 | "source": [ 608 | "%time cheapestm(0, len(fares[0]) - 1, fares)" 609 | ] 610 | }, 611 | { 612 | "cell_type": "markdown", 613 | "metadata": {}, 614 | "source": [ 615 | "Faster, you see?\n", 616 | "\n", 617 | "Now, do the dynamic programming version." 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": 11, 623 | "metadata": { 624 | "collapsed": false, 625 | "deletable": true, 626 | "editable": true 627 | }, 628 | "outputs": [], 629 | "source": [ 630 | "def cheapestdp(s, d, c):\n", 631 | " cheapest = c[0]\n", 632 | " for i in range(2, len(c)):\n", 633 | " for j in range(1, i):\n", 634 | " new_route = cheapest[j] + c[j][i]\n", 635 | " cheapest[i] = new_route if cheapest[i] > new_route else cheapest[i] \n", 636 | " return cheapest[-1]" 637 | ] 638 | }, 639 | { 640 | "cell_type": "code", 641 | "execution_count": 12, 642 | "metadata": { 643 | "collapsed": false, 644 | "deletable": true, 645 | "editable": true 646 | }, 647 | "outputs": [ 648 | { 649 | "name": "stdout", 650 | "output_type": "stream", 651 | "text": [ 652 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 653 | "Wall time: 61.5 µs\n" 654 | ] 655 | }, 656 | { 657 | "data": { 658 | "text/plain": [ 659 | "299" 660 | ] 661 | }, 662 | "execution_count": 12, 663 | "metadata": {}, 664 | "output_type": "execute_result" 665 | } 666 | ], 667 | "source": [ 668 | "%time cheapestdp(0, len(fares[0]) - 1, fares)" 669 | ] 670 | }, 671 | { 672 | "cell_type": "markdown", 673 | "metadata": { 674 | "deletable": true, 675 | "editable": true 676 | }, 677 | "source": [ 678 | "Let's now try with a larger example:" 679 | ] 680 | }, 681 | { 682 | "cell_type": "code", 683 | "execution_count": 337, 684 | "metadata": { 685 | "collapsed": false, 686 | "deletable": true, 687 | "editable": true, 688 | "scrolled": true 689 | }, 690 | "outputs": [ 691 | { 692 | "data": { 693 | "text/plain": [ 694 | "array([[ 0, 123, 126, 129, 228, 260, 336, 352, 373, 376, 447, 451, 543,\n", 695 | " 776, 820, 840, 859, 897],\n", 696 | " [ -1, 0, 37, 61, 137, 146, 235, 245, 340, 343, 405, 574, 589,\n", 697 | " 590, 594, 753, 852, 861],\n", 698 | " [ -1, -1, 0, 16, 99, 117, 170, 199, 274, 342, 394, 401, 414,\n", 699 | " 462, 481, 595, 610, 641],\n", 700 | " [ -1, -1, -1, 0, 94, 95, 134, 138, 155, 433, 471, 497, 560,\n", 701 | " 630, 639, 683, 732, 758],\n", 702 | " [ -1, -1, -1, -1, 0, 85, 140, 149, 329, 370, 386, 395, 477,\n", 703 | " 544, 562, 566, 619, 634],\n", 704 | " [ -1, -1, -1, -1, -1, 0, 29, 30, 44, 113, 187, 207, 247,\n", 705 | " 249, 249, 356, 409, 630],\n", 706 | " [ -1, -1, -1, -1, -1, -1, 0, 22, 60, 168, 216, 277, 279,\n", 707 | " 372, 419, 449, 606, 690],\n", 708 | " [ -1, -1, -1, -1, -1, -1, -1, 0, 30, 36, 273, 321, 355,\n", 709 | " 415, 419, 421, 497, 500],\n", 710 | " [ -1, -1, -1, -1, -1, -1, -1, -1, 0, 19, 123, 400, 412,\n", 711 | " 418, 535, 547, 559, 591],\n", 712 | " [ -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 110, 120, 140,\n", 713 | " 204, 220, 395, 454, 488],\n", 714 | " [ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 87, 169,\n", 715 | " 219, 257, 312, 487, 527],\n", 716 | " [ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 7,\n", 717 | " 11, 22, 244, 250, 281],\n", 718 | " [ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0,\n", 719 | " 10, 116, 123, 259, 287],\n", 720 | " [ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n", 721 | " 0, 22, 93, 109, 236],\n", 722 | " [ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n", 723 | " -1, 0, 29, 157, 185],\n", 724 | " [ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n", 725 | " -1, -1, 0, 18, 78],\n", 726 | " [ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n", 727 | " -1, -1, -1, 0, 4],\n", 728 | " [ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n", 729 | " -1, -1, -1, -1, 0]])" 730 | ] 731 | }, 732 | "execution_count": 337, 733 | "metadata": {}, 734 | "output_type": "execute_result" 735 | } 736 | ], 737 | "source": [ 738 | "n_cities = 18 # this will take a little before 20 seconds. Try not to make it any larger :)\n", 739 | "max_fare = 1000\n", 740 | "fares = get_fares(n_cities, max_fare)\n", 741 | "fares" 742 | ] 743 | }, 744 | { 745 | "cell_type": "code", 746 | "execution_count": 338, 747 | "metadata": { 748 | "collapsed": false, 749 | "deletable": true, 750 | "editable": true 751 | }, 752 | "outputs": [ 753 | { 754 | "name": "stdout", 755 | "output_type": "stream", 756 | "text": [ 757 | "CPU times: user 18.3 s, sys: 6.67 ms, total: 18.3 s\n", 758 | "Wall time: 18.4 s\n" 759 | ] 760 | }, 761 | { 762 | "data": { 763 | "text/plain": [ 764 | "480" 765 | ] 766 | }, 767 | "execution_count": 338, 768 | "metadata": {}, 769 | "output_type": "execute_result" 770 | } 771 | ], 772 | "source": [ 773 | "%time cheapestr(0, len(fares[0]) - 1, fares)" 774 | ] 775 | }, 776 | { 777 | "cell_type": "code", 778 | "execution_count": 339, 779 | "metadata": { 780 | "collapsed": false, 781 | "deletable": true, 782 | "editable": true 783 | }, 784 | "outputs": [ 785 | { 786 | "name": "stdout", 787 | "output_type": "stream", 788 | "text": [ 789 | "CPU times: user 14.7 s, sys: 3.33 ms, total: 14.7 s\n", 790 | "Wall time: 14.7 s\n" 791 | ] 792 | }, 793 | { 794 | "data": { 795 | "text/plain": [ 796 | "480" 797 | ] 798 | }, 799 | "execution_count": 339, 800 | "metadata": {}, 801 | "output_type": "execute_result" 802 | } 803 | ], 804 | "source": [ 805 | "%time cheapestm(0, len(fares[0]) - 1, fares)" 806 | ] 807 | }, 808 | { 809 | "cell_type": "code", 810 | "execution_count": 340, 811 | "metadata": { 812 | "collapsed": false, 813 | "deletable": true, 814 | "editable": true 815 | }, 816 | "outputs": [ 817 | { 818 | "name": "stdout", 819 | "output_type": "stream", 820 | "text": [ 821 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 822 | "Wall time: 75.8 µs\n" 823 | ] 824 | }, 825 | { 826 | "data": { 827 | "text/plain": [ 828 | "480" 829 | ] 830 | }, 831 | "execution_count": 340, 832 | "metadata": {}, 833 | "output_type": "execute_result" 834 | } 835 | ], 836 | "source": [ 837 | "%time cheapestdp(0, len(fares[0]) - 1, fares)" 838 | ] 839 | }, 840 | { 841 | "cell_type": "markdown", 842 | "metadata": { 843 | "deletable": true, 844 | "editable": true 845 | }, 846 | "source": [ 847 | "BAAAAAM! See how much faster dynamic programming is?\n", 848 | "\n", 849 | "Well, there you have it!!! This is the power of dynamic programming.\n", 850 | "\n", 851 | "As mentioned in the tutorials, reinforcement learning leverages the power of dynamic programming in many algorithms. Value Iteration, Q-Learning, etc have a similar take on calculation. The bottom line is to think sequentially instead of recursively. And bottom-up instead of top-down. Let's continue this journey." 852 | ] 853 | } 854 | ], 855 | "metadata": { 856 | "kernelspec": { 857 | "display_name": "Python 3", 858 | "language": "python", 859 | "name": "python3" 860 | }, 861 | "language_info": { 862 | "codemirror_mode": { 863 | "name": "ipython", 864 | "version": 3 865 | }, 866 | "file_extension": ".py", 867 | "mimetype": "text/x-python", 868 | "name": "python", 869 | "nbconvert_exporter": "python", 870 | "pygments_lexer": "ipython3", 871 | "version": "3.5.2" 872 | } 873 | }, 874 | "nbformat": 4, 875 | "nbformat_minor": 2 876 | } 877 | -------------------------------------------------------------------------------- /notebooks/03-planning-algorithms.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Planning Algorithms" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Do you remember on lesson 2 and 3 we discussed algorithms that basically solve MDPs? That is, find a policy given a exact representation of the environment. In this section, we will explore 2 such algorithms. Value Iteration and policy iteration." 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "import numpy as np\n", 24 | "import pandas as pd\n", 25 | "import tempfile\n", 26 | "import pprint\n", 27 | "import json\n", 28 | "import sys\n", 29 | "import gym\n", 30 | "\n", 31 | "from gym import wrappers\n", 32 | "from subprocess import check_output\n", 33 | "from IPython.display import HTML" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "#### Value Iteration" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "The Value Iteration algorithm uses dynamic programming by dividing the problem into common sub-problems and leveraging that optimal structure to speed-up computations.\n", 48 | "\n", 49 | "Let me show you how value iterations look like:" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 2, 55 | "metadata": {}, 56 | "outputs": [], 57 | "source": [ 58 | "def value_iteration(S, A, P, gamma=.99, theta = 0.0000001):\n", 59 | " \n", 60 | " V = np.random.random(len(S))\n", 61 | " for i in range(100000):\n", 62 | " old_V = V.copy()\n", 63 | " \n", 64 | " Q = np.zeros((len(S), len(A)), dtype=float)\n", 65 | " for s in S:\n", 66 | " for a in A:\n", 67 | " for prob, s_prime, reward, done in P[s][a]:\n", 68 | " Q[s][a] += prob * (reward + gamma * old_V[s_prime] * (not done))\n", 69 | " V[s] = Q[s].max()\n", 70 | " if np.all(np.abs(old_V - V) < theta):\n", 71 | " break\n", 72 | " \n", 73 | " pi = np.argmax(Q, axis=1)\n", 74 | " return pi, V" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "As we can see, value iteration expects a set of states, e.g. (0,1,2,3,4) a set of actions, e.g. (0,1) and a set of transition probabilities that represent the dynamics of the environment. Let's take a look at these variables:" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 3, 87 | "metadata": {}, 88 | "outputs": [ 89 | { 90 | "name": "stderr", 91 | "output_type": "stream", 92 | "text": [ 93 | "[2017-04-26 00:30:53,851] Making new env: FrozenLake-v0\n" 94 | ] 95 | } 96 | ], 97 | "source": [ 98 | "mdir = tempfile.mkdtemp()\n", 99 | "env = gym.make('FrozenLake-v0')\n", 100 | "env = wrappers.Monitor(env, mdir, force=True)" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 4, 106 | "metadata": { 107 | "collapsed": true 108 | }, 109 | "outputs": [], 110 | "source": [ 111 | "S = range(env.env.observation_space.n)\n", 112 | "A = range(env.env.action_space.n)\n", 113 | "P = env.env.env.P" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 5, 119 | "metadata": {}, 120 | "outputs": [ 121 | { 122 | "data": { 123 | "text/plain": [ 124 | "range(0, 16)" 125 | ] 126 | }, 127 | "execution_count": 5, 128 | "metadata": {}, 129 | "output_type": "execute_result" 130 | } 131 | ], 132 | "source": [ 133 | "S" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 6, 139 | "metadata": {}, 140 | "outputs": [ 141 | { 142 | "data": { 143 | "text/plain": [ 144 | "range(0, 4)" 145 | ] 146 | }, 147 | "execution_count": 6, 148 | "metadata": {}, 149 | "output_type": "execute_result" 150 | } 151 | ], 152 | "source": [ 153 | "A" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": 9, 159 | "metadata": {}, 160 | "outputs": [ 161 | { 162 | "data": { 163 | "text/plain": [ 164 | "{0: [(0.3333333333333333, 6, 0.0, False),\n", 165 | " (0.3333333333333333, 9, 0.0, False),\n", 166 | " (0.3333333333333333, 14, 0.0, False)],\n", 167 | " 1: [(0.3333333333333333, 9, 0.0, False),\n", 168 | " (0.3333333333333333, 14, 0.0, False),\n", 169 | " (0.3333333333333333, 11, 0.0, True)],\n", 170 | " 2: [(0.3333333333333333, 14, 0.0, False),\n", 171 | " (0.3333333333333333, 11, 0.0, True),\n", 172 | " (0.3333333333333333, 6, 0.0, False)],\n", 173 | " 3: [(0.3333333333333333, 11, 0.0, True),\n", 174 | " (0.3333333333333333, 6, 0.0, False),\n", 175 | " (0.3333333333333333, 9, 0.0, False)]}" 176 | ] 177 | }, 178 | "execution_count": 9, 179 | "metadata": {}, 180 | "output_type": "execute_result" 181 | } 182 | ], 183 | "source": [ 184 | "P[10]" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "You see the world we are looking into \"FrozenLake-v0\" has 16 different states, 4 different actions. The `P[10]` is basically showing us a peek into the dynamics of the world. For example, in this case, if you are in state \"10\" (from `P[10]`) and you take action 0 (see dictionary key 0), you have equal probability of 0.3333 to land in either state 6, 9 or 14. None of those transitions give you any reward and none of them is terminal.\n", 192 | "\n", 193 | "In contrast, we can see taking action 2, might transition you to state 11, which **is** terminal. \n", 194 | "\n", 195 | "Get the hang of it? Let's run it!" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 10, 201 | "metadata": { 202 | "collapsed": true 203 | }, 204 | "outputs": [], 205 | "source": [ 206 | "pi, V = value_iteration(S, A, P)" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "Now, value iteration calculates two important things. First, it calculates `V`, which tells us how much should we expect from each state if we always act optimally. Second, it gives us `pi`, which is the optimal policy given `V`. Let's take a deeper look:" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": 12, 219 | "metadata": {}, 220 | "outputs": [ 221 | { 222 | "data": { 223 | "text/plain": [ 224 | "array([ 9.82775479e-006, 4.77561742e-007, 8.29890013e-006,\n", 225 | " 7.77646736e-006, 5.68794576e-006, 0.00000000e+000,\n", 226 | " 3.38430298e-208, 0.00000000e+000, 8.92176447e-007,\n", 227 | " 5.28039771e-006, 3.09721331e-006, 0.00000000e+000,\n", 228 | " 0.00000000e+000, 9.53731304e-006, 9.80392157e-001,\n", 229 | " 0.00000000e+000])" 230 | ] 231 | }, 232 | "execution_count": 12, 233 | "metadata": {}, 234 | "output_type": "execute_result" 235 | } 236 | ], 237 | "source": [ 238 | "V" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 13, 244 | "metadata": {}, 245 | "outputs": [ 246 | { 247 | "data": { 248 | "text/plain": [ 249 | "array([0, 3, 0, 3, 0, 0, 0, 0, 3, 1, 0, 0, 0, 2, 1, 0])" 250 | ] 251 | }, 252 | "execution_count": 13, 253 | "metadata": {}, 254 | "output_type": "execute_result" 255 | } 256 | ], 257 | "source": [ 258 | "pi" 259 | ] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "metadata": {}, 264 | "source": [ 265 | "See? This policy basically says in state `0`, take action `0`. In state `1` take action `3`. In state `2` take action `0` and so on. Got it?\n", 266 | "\n", 267 | "Now, we have the \"directions\" or this \"map\". With this, we can just use this policy and solve the environment as we interact with it.\n", 268 | "\n", 269 | "Let's try it out!" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 14, 275 | "metadata": { 276 | "scrolled": true 277 | }, 278 | "outputs": [ 279 | { 280 | "name": "stderr", 281 | "output_type": "stream", 282 | "text": [ 283 | "[2017-04-26 00:40:00,747] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video010000.json\n", 284 | "[2017-04-26 00:40:01,236] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video011000.json\n", 285 | "[2017-04-26 00:40:01,752] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video012000.json\n", 286 | "[2017-04-26 00:40:02,236] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video013000.json\n", 287 | "[2017-04-26 00:40:02,732] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video014000.json\n", 288 | "[2017-04-26 00:40:03,235] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video015000.json\n", 289 | "[2017-04-26 00:40:03,745] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video016000.json\n", 290 | "[2017-04-26 00:40:04,234] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video017000.json\n", 291 | "[2017-04-26 00:40:04,748] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video018000.json\n", 292 | "[2017-04-26 00:40:05,262] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video019000.json\n" 293 | ] 294 | } 295 | ], 296 | "source": [ 297 | "for _ in range(10000):\n", 298 | " state = env.reset()\n", 299 | " while True:\n", 300 | " state, reward, done, info = env.step(pi[state])\n", 301 | " if done:\n", 302 | " break" 303 | ] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "metadata": {}, 308 | "source": [ 309 | "That was the agent interacting with the environment. Let's take a look at some of the episodes:" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": 16, 315 | "metadata": {}, 316 | "outputs": [ 317 | { 318 | "name": "stdout", 319 | "output_type": "stream", 320 | "text": [ 321 | "https://asciinema.org/a/6rphgm3w1rbkjvoo2rq6tpjqu\n" 322 | ] 323 | } 324 | ], 325 | "source": [ 326 | "last_video = env.videos[-1][0]\n", 327 | "out = check_output([\"asciinema\", \"upload\", last_video])\n", 328 | "out = out.decode(\"utf-8\").replace('\\n', '').replace('\\r', '')\n", 329 | "print(out)" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "You can look on that link, or better, let's show it on the notebook:" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": 18, 342 | "metadata": {}, 343 | "outputs": [ 344 | { 345 | "data": { 346 | "text/html": [ 347 | "\n", 348 | "\n" 353 | ], 354 | "text/plain": [ 355 | "" 356 | ] 357 | }, 358 | "execution_count": 18, 359 | "metadata": {}, 360 | "output_type": "execute_result" 361 | } 362 | ], 363 | "source": [ 364 | "castid = out.split('/')[-1]\n", 365 | "html_tag = \"\"\"\n", 366 | "\n", 371 | "\"\"\"\n", 372 | "html_tag = html_tag.format(castid)\n", 373 | "HTML(data=html_tag)" 374 | ] 375 | }, 376 | { 377 | "cell_type": "markdown", 378 | "metadata": {}, 379 | "source": [ 380 | "Interesting right? Did you get the world yet?\n", 381 | "\n", 382 | "So, 'S' is the starting state, 'G' the goal. 'F' are Frozen grids, and 'H' are holes. Your goal is to go from S to G without falling into any H. The problem is, F is slippery so, often times you are better of by trying moves that seems counter-intuitive. But because you are preventing falling on 'H's it makes sense in the end. For example, the second row, first column 'F', you can see how our agent was trying so hard to go left!! Smashing his head against the wall?? Silly. But why?" 383 | ] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": 19, 388 | "metadata": {}, 389 | "outputs": [ 390 | { 391 | "data": { 392 | "text/plain": [ 393 | "{0: [(0.3333333333333333, 0, 0.0, False),\n", 394 | " (0.3333333333333333, 4, 0.0, False),\n", 395 | " (0.3333333333333333, 8, 0.0, False)],\n", 396 | " 1: [(0.3333333333333333, 4, 0.0, False),\n", 397 | " (0.3333333333333333, 8, 0.0, False),\n", 398 | " (0.3333333333333333, 5, 0.0, True)],\n", 399 | " 2: [(0.3333333333333333, 8, 0.0, False),\n", 400 | " (0.3333333333333333, 5, 0.0, True),\n", 401 | " (0.3333333333333333, 0, 0.0, False)],\n", 402 | " 3: [(0.3333333333333333, 5, 0.0, True),\n", 403 | " (0.3333333333333333, 0, 0.0, False),\n", 404 | " (0.3333333333333333, 4, 0.0, False)]}" 405 | ] 406 | }, 407 | "execution_count": 19, 408 | "metadata": {}, 409 | "output_type": "execute_result" 410 | } 411 | ], 412 | "source": [ 413 | "P[4]" 414 | ] 415 | }, 416 | { 417 | "cell_type": "markdown", 418 | "metadata": {}, 419 | "source": [ 420 | "See how action 0 (left) doesn't have any transition leading to a terminal state??\n", 421 | "\n", 422 | "All other actions give you a 0.333333 chance each of pushing you into the hole in state '5'!!! So it actually makes sense to go left until it slips you downward to state 8.\n", 423 | "\n", 424 | "Cool right?" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": 20, 430 | "metadata": {}, 431 | "outputs": [ 432 | { 433 | "data": { 434 | "text/plain": [ 435 | "array([0, 3, 0, 3, 0, 0, 0, 0, 3, 1, 0, 0, 0, 2, 1, 0])" 436 | ] 437 | }, 438 | "execution_count": 20, 439 | "metadata": {}, 440 | "output_type": "execute_result" 441 | } 442 | ], 443 | "source": [ 444 | "pi" 445 | ] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "metadata": {}, 450 | "source": [ 451 | "See how the \"prescribed\" action is 0 (left) on the policy calculated by value iteration?\n", 452 | "\n", 453 | "How about the values?" 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": 21, 459 | "metadata": {}, 460 | "outputs": [ 461 | { 462 | "data": { 463 | "text/plain": [ 464 | "array([ 9.82775479e-006, 4.77561742e-007, 8.29890013e-006,\n", 465 | " 7.77646736e-006, 5.68794576e-006, 0.00000000e+000,\n", 466 | " 3.38430298e-208, 0.00000000e+000, 8.92176447e-007,\n", 467 | " 5.28039771e-006, 3.09721331e-006, 0.00000000e+000,\n", 468 | " 0.00000000e+000, 9.53731304e-006, 9.80392157e-001,\n", 469 | " 0.00000000e+000])" 470 | ] 471 | }, 472 | "execution_count": 21, 473 | "metadata": {}, 474 | "output_type": "execute_result" 475 | } 476 | ], 477 | "source": [ 478 | "V" 479 | ] 480 | }, 481 | { 482 | "cell_type": "markdown", 483 | "metadata": {}, 484 | "source": [ 485 | "These show the expected rewards on each state." 486 | ] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": 22, 491 | "metadata": {}, 492 | "outputs": [ 493 | { 494 | "data": { 495 | "text/plain": [ 496 | "{0: [(1.0, 15, 0, True)],\n", 497 | " 1: [(1.0, 15, 0, True)],\n", 498 | " 2: [(1.0, 15, 0, True)],\n", 499 | " 3: [(1.0, 15, 0, True)]}" 500 | ] 501 | }, 502 | "execution_count": 22, 503 | "metadata": {}, 504 | "output_type": "execute_result" 505 | } 506 | ], 507 | "source": [ 508 | "P[15]" 509 | ] 510 | }, 511 | { 512 | "cell_type": "markdown", 513 | "metadata": {}, 514 | "source": [ 515 | "See how the state '15' gives you a reward of +1?? These signal gets propagated all the way to the start state using Value Iteration and it shows the values all accross.\n", 516 | "\n", 517 | "Cool? Good." 518 | ] 519 | }, 520 | { 521 | "cell_type": "code", 522 | "execution_count": 39, 523 | "metadata": {}, 524 | "outputs": [], 525 | "source": [ 526 | "env.close()" 527 | ] 528 | }, 529 | { 530 | "cell_type": "markdown", 531 | "metadata": {}, 532 | "source": [ 533 | "If you want to submit to OpenAI Gym, get your API Key and paste it here:" 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": 9, 539 | "metadata": {}, 540 | "outputs": [ 541 | { 542 | "name": "stderr", 543 | "output_type": "stream", 544 | "text": [ 545 | "[2017-04-01 16:55:43,229] [FrozenLake-v0] Uploading 10000 episodes of training data\n", 546 | "[2017-04-01 16:55:44,905] [FrozenLake-v0] Uploading videos of 19 training episodes (2158 bytes)\n", 547 | "[2017-04-01 16:55:45,131] [FrozenLake-v0] Creating evaluation object from /tmp/tmpfukeltbz with learning curve and training video\n", 548 | "[2017-04-01 16:55:45,620] \n", 549 | "****************************************************\n", 550 | "You successfully uploaded your evaluation on FrozenLake-v0 to\n", 551 | "OpenAI Gym! You can find it at:\n", 552 | "\n", 553 | " https://gym.openai.com/evaluations/eval_ycTPCbyiTWK6T0C4DyrvRg\n", 554 | "\n", 555 | "****************************************************\n" 556 | ] 557 | } 558 | ], 559 | "source": [ 560 | "gym.upload(mdir, api_key='')" 561 | ] 562 | }, 563 | { 564 | "cell_type": "markdown", 565 | "metadata": {}, 566 | "source": [ 567 | "#### Policy Iteration" 568 | ] 569 | }, 570 | { 571 | "cell_type": "markdown", 572 | "metadata": {}, 573 | "source": [ 574 | "There is another method called policy iteration. This method is composed of 2 other methods, policy evaluation and policy improvement. The logic goes that policy iteration is 'evaluating' a policy to check for convergence (meaning the policy doesn't change), and 'improving' the policy, which is applying something similar to a 1 step value iteration to get a slightly better policy, but definitely not worse.\n", 575 | "\n", 576 | "These two functions cycling together are what policy iteration is about.\n", 577 | "\n", 578 | "Can you implement this algorithm yourself? Try it. Make sure to look the solution notebook in case you get stuck.\n", 579 | "\n", 580 | "I will give you the policy evaluation and policy improvement methods, you build the policy iteration cycling between the evaluation and improvement methods until there are no changes to the policy." 581 | ] 582 | }, 583 | { 584 | "cell_type": "code", 585 | "execution_count": 24, 586 | "metadata": { 587 | "collapsed": true 588 | }, 589 | "outputs": [], 590 | "source": [ 591 | "def policy_evaluation(pi, S, A, P, gamma=.99, theta=0.0000001):\n", 592 | " \n", 593 | " V = np.zeros(len(S))\n", 594 | " while True:\n", 595 | " delta = 0\n", 596 | " for s in S:\n", 597 | " v = V[s]\n", 598 | " \n", 599 | " V[s] = 0\n", 600 | " for prob, dst, reward, done in P[s][pi[s]]:\n", 601 | " V[s] += prob * (reward + gamma * V[dst] * (not done))\n", 602 | " \n", 603 | " delta = max(delta, np.abs(v - V[s]))\n", 604 | " if delta < theta:\n", 605 | " break\n", 606 | " return V" 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": 1, 612 | "metadata": {}, 613 | "outputs": [], 614 | "source": [ 615 | "def policy_improvement(pi, V, S, A, P, gamma=.99):\n", 616 | " for s in S:\n", 617 | " old_a = pi[s]\n", 618 | " \n", 619 | " Qs = np.zeros(len(A), dtype=float)\n", 620 | " for a in A:\n", 621 | " for prob, s_prime, reward, done in P[s][a]:\n", 622 | " Qs[a] += prob * (reward + gamma * V[s_prime] * (not done))\n", 623 | " pi[s] = np.argmax(Qs)\n", 624 | " V[s] = np.max(Qs)\n", 625 | " return pi, V" 626 | ] 627 | }, 628 | { 629 | "cell_type": "code", 630 | "execution_count": 27, 631 | "metadata": {}, 632 | "outputs": [], 633 | "source": [ 634 | "def policy_iteration(S, A, P, gamma=.99):\n", 635 | " pi = np.random.choice(A, len(S))\n", 636 | " \"\"\" YOU COMPLETE THIS METHOD \"\"\"\n", 637 | " return pi" 638 | ] 639 | }, 640 | { 641 | "cell_type": "markdown", 642 | "metadata": {}, 643 | "source": [ 644 | "After you implement the algorithms, you can run it and calculate the optimal policy:" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": 29, 650 | "metadata": {}, 651 | "outputs": [ 652 | { 653 | "name": "stderr", 654 | "output_type": "stream", 655 | "text": [ 656 | "[2017-04-26 00:54:49,917] Making new env: FrozenLake-v0\n", 657 | "[2017-04-26 00:54:49,919] Finished writing results. You can upload them to the scoreboard via gym.upload('/tmp/tmppra935u6')\n" 658 | ] 659 | }, 660 | { 661 | "name": "stdout", 662 | "output_type": "stream", 663 | "text": [ 664 | "[0 3 0 3 0 0 0 0 3 1 0 0 0 2 1 0]\n" 665 | ] 666 | } 667 | ], 668 | "source": [ 669 | "mdir = tempfile.mkdtemp()\n", 670 | "env = gym.make('FrozenLake-v0')\n", 671 | "env = wrappers.Monitor(env, mdir, force=True)\n", 672 | "\n", 673 | "S = range(env.env.observation_space.n)\n", 674 | "A = range(env.env.action_space.n)\n", 675 | "P = env.env.env.P\n", 676 | "\n", 677 | "pi = policy_iteration(S, A, P)\n", 678 | "print(pi)" 679 | ] 680 | }, 681 | { 682 | "cell_type": "markdown", 683 | "metadata": {}, 684 | "source": [ 685 | "And, of course, interact with the environment looking at the \"directions\" or \"policy\":" 686 | ] 687 | }, 688 | { 689 | "cell_type": "code", 690 | "execution_count": 30, 691 | "metadata": { 692 | "scrolled": true 693 | }, 694 | "outputs": [ 695 | { 696 | "name": "stderr", 697 | "output_type": "stream", 698 | "text": [ 699 | "[2017-04-26 00:55:44,764] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000000.json\n", 700 | "[2017-04-26 00:55:44,767] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000001.json\n", 701 | "[2017-04-26 00:55:44,772] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000008.json\n", 702 | "[2017-04-26 00:55:44,788] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000027.json\n", 703 | "[2017-04-26 00:55:44,810] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000064.json\n", 704 | "[2017-04-26 00:55:44,838] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000125.json\n", 705 | "[2017-04-26 00:55:44,891] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000216.json\n", 706 | "[2017-04-26 00:55:44,958] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000343.json\n", 707 | "[2017-04-26 00:55:45,043] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000512.json\n", 708 | "[2017-04-26 00:55:45,155] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000729.json\n", 709 | "[2017-04-26 00:55:45,295] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video001000.json\n", 710 | "[2017-04-26 00:55:45,889] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video002000.json\n", 711 | "[2017-04-26 00:55:46,418] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video003000.json\n", 712 | "[2017-04-26 00:55:46,934] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video004000.json\n", 713 | "[2017-04-26 00:55:47,441] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video005000.json\n", 714 | "[2017-04-26 00:55:47,963] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video006000.json\n", 715 | "[2017-04-26 00:55:48,473] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video007000.json\n", 716 | "[2017-04-26 00:55:48,989] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video008000.json\n", 717 | "[2017-04-26 00:55:49,492] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video009000.json\n" 718 | ] 719 | } 720 | ], 721 | "source": [ 722 | "for _ in range(10000):\n", 723 | " state = env.reset()\n", 724 | " while True:\n", 725 | " state, reward, done, info = env.step(pi[state])\n", 726 | " if done:\n", 727 | " break" 728 | ] 729 | }, 730 | { 731 | "cell_type": "code", 732 | "execution_count": 32, 733 | "metadata": {}, 734 | "outputs": [ 735 | { 736 | "name": "stdout", 737 | "output_type": "stream", 738 | "text": [ 739 | "https://asciinema.org/a/c6phe9z2ntyy3y3lfflzwqwiy\n" 740 | ] 741 | } 742 | ], 743 | "source": [ 744 | "last_video = env.videos[-1][0]\n", 745 | "out = check_output([\"asciinema\", \"upload\", last_video])\n", 746 | "out = out.decode(\"utf-8\").replace('\\n', '').replace('\\r', '')\n", 747 | "print(out)" 748 | ] 749 | }, 750 | { 751 | "cell_type": "code", 752 | "execution_count": 34, 753 | "metadata": {}, 754 | "outputs": [ 755 | { 756 | "data": { 757 | "text/html": [ 758 | "\n", 759 | "\n" 764 | ], 765 | "text/plain": [ 766 | "" 767 | ] 768 | }, 769 | "execution_count": 34, 770 | "metadata": {}, 771 | "output_type": "execute_result" 772 | } 773 | ], 774 | "source": [ 775 | "castid = out.split('/')[-1]\n", 776 | "html_tag = \"\"\"\n", 777 | "\n", 782 | "\"\"\"\n", 783 | "html_tag = html_tag.format(castid)\n", 784 | "HTML(data=html_tag)" 785 | ] 786 | }, 787 | { 788 | "cell_type": "markdown", 789 | "metadata": {}, 790 | "source": [ 791 | "Similar as before. Policies could be slightly different if there is a state in which more than one action give the same value in the end." 792 | ] 793 | }, 794 | { 795 | "cell_type": "code", 796 | "execution_count": 35, 797 | "metadata": {}, 798 | "outputs": [ 799 | { 800 | "data": { 801 | "text/plain": [ 802 | "array([ 9.82775479e-006, 4.77561742e-007, 8.29890013e-006,\n", 803 | " 7.77646736e-006, 5.68794576e-006, 0.00000000e+000,\n", 804 | " 3.38430298e-208, 0.00000000e+000, 8.92176447e-007,\n", 805 | " 5.28039771e-006, 3.09721331e-006, 0.00000000e+000,\n", 806 | " 0.00000000e+000, 9.53731304e-006, 9.80392157e-001,\n", 807 | " 0.00000000e+000])" 808 | ] 809 | }, 810 | "execution_count": 35, 811 | "metadata": {}, 812 | "output_type": "execute_result" 813 | } 814 | ], 815 | "source": [ 816 | "V" 817 | ] 818 | }, 819 | { 820 | "cell_type": "code", 821 | "execution_count": 37, 822 | "metadata": {}, 823 | "outputs": [ 824 | { 825 | "data": { 826 | "text/plain": [ 827 | "array([0, 3, 0, 3, 0, 0, 0, 0, 3, 1, 0, 0, 0, 2, 1, 0])" 828 | ] 829 | }, 830 | "execution_count": 37, 831 | "metadata": {}, 832 | "output_type": "execute_result" 833 | } 834 | ], 835 | "source": [ 836 | "pi" 837 | ] 838 | }, 839 | { 840 | "cell_type": "markdown", 841 | "metadata": {}, 842 | "source": [ 843 | "That's it let's wrap up." 844 | ] 845 | }, 846 | { 847 | "cell_type": "code", 848 | "execution_count": 38, 849 | "metadata": {}, 850 | "outputs": [ 851 | { 852 | "name": "stderr", 853 | "output_type": "stream", 854 | "text": [ 855 | "[2017-04-26 00:57:28,406] Finished writing results. You can upload them to the scoreboard via gym.upload('/tmp/tmp0oe_0gtp')\n" 856 | ] 857 | } 858 | ], 859 | "source": [ 860 | "env.close()" 861 | ] 862 | }, 863 | { 864 | "cell_type": "markdown", 865 | "metadata": {}, 866 | "source": [ 867 | "If you want to submit to OpenAI Gym, get your API Key and paste it here:" 868 | ] 869 | }, 870 | { 871 | "cell_type": "code", 872 | "execution_count": 134, 873 | "metadata": {}, 874 | "outputs": [ 875 | { 876 | "name": "stderr", 877 | "output_type": "stream", 878 | "text": [ 879 | "[2017-04-01 20:40:54,103] [FrozenLake-v0] Uploading 10000 episodes of training data\n", 880 | "[2017-04-01 20:40:55,854] [FrozenLake-v0] Uploading videos of 19 training episodes (2278 bytes)\n", 881 | "[2017-04-01 20:40:56,102] [FrozenLake-v0] Creating evaluation object from /tmp/tmpyspcx0sa with learning curve and training video\n", 882 | "[2017-04-01 20:40:56,451] \n", 883 | "****************************************************\n", 884 | "You successfully uploaded your evaluation on FrozenLake-v0 to\n", 885 | "OpenAI Gym! You can find it at:\n", 886 | "\n", 887 | " https://gym.openai.com/evaluations/eval_vAvbhsGQRVSAe5DZkFNrQ\n", 888 | "\n", 889 | "****************************************************\n" 890 | ] 891 | } 892 | ], 893 | "source": [ 894 | "gym.upload(mdir, api_key='')" 895 | ] 896 | }, 897 | { 898 | "cell_type": "markdown", 899 | "metadata": { 900 | "collapsed": true 901 | }, 902 | "source": [ 903 | "Hope you liked it... Value Iteration and Policy Iteration might seem disappointing at first, and I understand. What is the intelligence on following given directions!? What if you just don't have a map of the environment you are interacting with? Come on, that's not AI. You are right, it is not. However, Value Iteration and Policy Iteration form the basis of 2 of the 3 most fundamental paradigms of algorithms in reinforcement learning.\n", 904 | "\n", 905 | "Next notebooks we start looking into slightly more complicated environment. And also, we will learn about algorithms that learn while interacting with the environment. Also called \"online\" learning." 906 | ] 907 | } 908 | ], 909 | "metadata": { 910 | "kernelspec": { 911 | "display_name": "Python 3", 912 | "language": "python", 913 | "name": "python3" 914 | }, 915 | "language_info": { 916 | "codemirror_mode": { 917 | "name": "ipython", 918 | "version": 3 919 | }, 920 | "file_extension": ".py", 921 | "mimetype": "text/x-python", 922 | "name": "python", 923 | "nbconvert_exporter": "python", 924 | "pygments_lexer": "ipython3", 925 | "version": "3.5.2" 926 | } 927 | }, 928 | "nbformat": 4, 929 | "nbformat_minor": 2 930 | } 931 | -------------------------------------------------------------------------------- /notebooks/solutions/03-planning-algorithms.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Planning Algorithms" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Do you remember on lesson 2 and 3 we discussed algorithms that basically solve MDPs? That is, find a policy given a exact representation of the environment. In this section, we will explore 2 such algorithms. Value Iteration and policy iteration." 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "import numpy as np\n", 24 | "import pandas as pd\n", 25 | "import tempfile\n", 26 | "import pprint\n", 27 | "import json\n", 28 | "import sys\n", 29 | "import gym\n", 30 | "\n", 31 | "from gym import wrappers\n", 32 | "from subprocess import check_output\n", 33 | "from IPython.display import HTML" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "#### Value Iteration" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "The Value Iteration algorithm uses dynamic programming by dividing the problem into common sub-problems and leveraging that optimal structure to speed-up computations.\n", 48 | "\n", 49 | "Let me show you how value iterations look like:" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 2, 55 | "metadata": {}, 56 | "outputs": [], 57 | "source": [ 58 | "def value_iteration(S, A, P, gamma=.99, theta = 0.0000001):\n", 59 | " \n", 60 | " V = np.random.random(len(S))\n", 61 | " for i in range(100000):\n", 62 | " old_V = V.copy()\n", 63 | " \n", 64 | " Q = np.zeros((len(S), len(A)), dtype=float)\n", 65 | " for s in S:\n", 66 | " for a in A:\n", 67 | " for prob, s_prime, reward, done in P[s][a]:\n", 68 | " Q[s][a] += prob * (reward + gamma * old_V[s_prime] * (not done))\n", 69 | " V[s] = Q[s].max()\n", 70 | " if np.all(np.abs(old_V - V) < theta):\n", 71 | " break\n", 72 | " \n", 73 | " pi = np.argmax(Q, axis=1)\n", 74 | " return pi, V" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "As we can see, value iteration expects a set of states, e.g. (0,1,2,3,4) a set of actions, e.g. (0,1) and a set of transition probabilities that represent the dynamics of the environment. Let's take a look at these variables:" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 3, 87 | "metadata": {}, 88 | "outputs": [ 89 | { 90 | "name": "stderr", 91 | "output_type": "stream", 92 | "text": [ 93 | "[2017-08-27 08:15:35,098] Making new env: FrozenLake-v0\n" 94 | ] 95 | } 96 | ], 97 | "source": [ 98 | "mdir = tempfile.mkdtemp()\n", 99 | "env = gym.make('FrozenLake-v0')\n", 100 | "env = wrappers.Monitor(env, mdir, force=True)" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 4, 106 | "metadata": { 107 | "collapsed": true 108 | }, 109 | "outputs": [], 110 | "source": [ 111 | "S = range(env.env.observation_space.n)\n", 112 | "A = range(env.env.action_space.n)\n", 113 | "P = env.env.env.P" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 5, 119 | "metadata": {}, 120 | "outputs": [ 121 | { 122 | "data": { 123 | "text/plain": [ 124 | "range(0, 16)" 125 | ] 126 | }, 127 | "execution_count": 5, 128 | "metadata": {}, 129 | "output_type": "execute_result" 130 | } 131 | ], 132 | "source": [ 133 | "S" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 6, 139 | "metadata": {}, 140 | "outputs": [ 141 | { 142 | "data": { 143 | "text/plain": [ 144 | "range(0, 4)" 145 | ] 146 | }, 147 | "execution_count": 6, 148 | "metadata": {}, 149 | "output_type": "execute_result" 150 | } 151 | ], 152 | "source": [ 153 | "A" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": 7, 159 | "metadata": {}, 160 | "outputs": [ 161 | { 162 | "data": { 163 | "text/plain": [ 164 | "{0: [(0.3333333333333333, 6, 0.0, False),\n", 165 | " (0.3333333333333333, 9, 0.0, False),\n", 166 | " (0.3333333333333333, 14, 0.0, False)],\n", 167 | " 1: [(0.3333333333333333, 9, 0.0, False),\n", 168 | " (0.3333333333333333, 14, 0.0, False),\n", 169 | " (0.3333333333333333, 11, 0.0, True)],\n", 170 | " 2: [(0.3333333333333333, 14, 0.0, False),\n", 171 | " (0.3333333333333333, 11, 0.0, True),\n", 172 | " (0.3333333333333333, 6, 0.0, False)],\n", 173 | " 3: [(0.3333333333333333, 11, 0.0, True),\n", 174 | " (0.3333333333333333, 6, 0.0, False),\n", 175 | " (0.3333333333333333, 9, 0.0, False)]}" 176 | ] 177 | }, 178 | "execution_count": 7, 179 | "metadata": {}, 180 | "output_type": "execute_result" 181 | } 182 | ], 183 | "source": [ 184 | "P[10]" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "You see the world we are looking into \"FrozenLake-v0\" has 16 different states, 4 different actions. The `P[10]` is basically showing us a peek into the dynamics of the world. For example, in this case, if you are in state \"10\" (from `P[10]`) and you take action 0 (see dictionary key 0), you have equal probability of 0.3333 to land in either state 6, 9 or 14. None of those transitions give you any reward and none of them is terminal.\n", 192 | "\n", 193 | "In contrast, we can see taking action 2, might transition you to state 11, which **is** terminal. \n", 194 | "\n", 195 | "Get the hang of it? Let's run it!" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 8, 201 | "metadata": { 202 | "collapsed": true 203 | }, 204 | "outputs": [], 205 | "source": [ 206 | "pi, V = value_iteration(S, A, P)" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "Now, value iteration calculates two important things. First, it calculates `V`, which tells us how much should we expect from each state if we always act optimally. Second, it gives us `pi`, which is the optimal policy given `V`. Let's take a deeper look:" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": 9, 219 | "metadata": {}, 220 | "outputs": [ 221 | { 222 | "data": { 223 | "text/plain": [ 224 | "array([ 0.54202426, 0.49880096, 0.47069307, 0.45684887, 0.55844941,\n", 225 | " 0. , 0.35834688, 0. , 0.59179743, 0.64307884,\n", 226 | " 0.61520669, 0. , 0. , 0.74171974, 0.86283707, 0. ])" 227 | ] 228 | }, 229 | "execution_count": 9, 230 | "metadata": {}, 231 | "output_type": "execute_result" 232 | } 233 | ], 234 | "source": [ 235 | "V" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 10, 241 | "metadata": {}, 242 | "outputs": [ 243 | { 244 | "data": { 245 | "text/plain": [ 246 | "array([0, 3, 3, 3, 0, 0, 0, 0, 3, 1, 0, 0, 0, 2, 1, 0])" 247 | ] 248 | }, 249 | "execution_count": 10, 250 | "metadata": {}, 251 | "output_type": "execute_result" 252 | } 253 | ], 254 | "source": [ 255 | "pi" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "See? This policy basically says in state `0`, take action `0`. In state `1` take action `3`. In state `2` take action `3` and so on. Got it?\n", 263 | "\n", 264 | "Now, we have the \"directions\" or this \"map\". With this, we can just use this policy and solve the environment as we interact with it.\n", 265 | "\n", 266 | "Let's try it out!" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 11, 272 | "metadata": { 273 | "scrolled": true 274 | }, 275 | "outputs": [ 276 | { 277 | "name": "stderr", 278 | "output_type": "stream", 279 | "text": [ 280 | "[2017-08-27 08:16:30,009] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000000.json\n", 281 | "[2017-08-27 08:16:30,015] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000001.json\n", 282 | "[2017-08-27 08:16:30,024] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000008.json\n", 283 | "[2017-08-27 08:16:30,063] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000027.json\n", 284 | "[2017-08-27 08:16:30,116] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000064.json\n", 285 | "[2017-08-27 08:16:30,168] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000125.json\n", 286 | "[2017-08-27 08:16:30,245] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000216.json\n", 287 | "[2017-08-27 08:16:30,346] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000343.json\n", 288 | "[2017-08-27 08:16:30,461] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000512.json\n", 289 | "[2017-08-27 08:16:30,613] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000729.json\n", 290 | "[2017-08-27 08:16:30,796] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video001000.json\n", 291 | "[2017-08-27 08:16:31,510] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video002000.json\n", 292 | "[2017-08-27 08:16:32,407] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video003000.json\n", 293 | "[2017-08-27 08:16:33,056] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video004000.json\n", 294 | "[2017-08-27 08:16:33,717] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video005000.json\n", 295 | "[2017-08-27 08:16:34,350] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video006000.json\n", 296 | "[2017-08-27 08:16:34,995] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video007000.json\n", 297 | "[2017-08-27 08:16:35,629] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video008000.json\n", 298 | "[2017-08-27 08:16:36,299] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video009000.json\n" 299 | ] 300 | } 301 | ], 302 | "source": [ 303 | "for _ in range(10000):\n", 304 | " state = env.reset()\n", 305 | " while True:\n", 306 | " state, reward, done, info = env.step(pi[state])\n", 307 | " if done:\n", 308 | " break" 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": {}, 314 | "source": [ 315 | "That was the agent interacting with the environment. Let's take a look at some of the episodes:" 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": 13, 321 | "metadata": {}, 322 | "outputs": [ 323 | { 324 | "name": "stdout", 325 | "output_type": "stream", 326 | "text": [ 327 | "https://asciinema.org/a/cJ4n5wZKQJIxjwKpndi0OKmWX\n" 328 | ] 329 | } 330 | ], 331 | "source": [ 332 | "last_video = env.videos[-1][0]\n", 333 | "out = check_output([\"asciinema\", \"upload\", last_video])\n", 334 | "out = out.decode(\"utf-8\").replace('\\n', '').replace('\\r', '')\n", 335 | "print(out)" 336 | ] 337 | }, 338 | { 339 | "cell_type": "markdown", 340 | "metadata": {}, 341 | "source": [ 342 | "You can look on that link, or better, let's show it on the notebook:" 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": 14, 348 | "metadata": {}, 349 | "outputs": [ 350 | { 351 | "data": { 352 | "text/html": [ 353 | "\n", 354 | "\n" 359 | ], 360 | "text/plain": [ 361 | "" 362 | ] 363 | }, 364 | "execution_count": 14, 365 | "metadata": {}, 366 | "output_type": "execute_result" 367 | } 368 | ], 369 | "source": [ 370 | "castid = out.split('/')[-1]\n", 371 | "html_tag = \"\"\"\n", 372 | "\n", 377 | "\"\"\"\n", 378 | "html_tag = html_tag.format(castid)\n", 379 | "HTML(data=html_tag)" 380 | ] 381 | }, 382 | { 383 | "cell_type": "markdown", 384 | "metadata": {}, 385 | "source": [ 386 | "Interesting right? Did you get the world yet?\n", 387 | "\n", 388 | "So, 'S' is the starting state, 'G' the goal. 'F' are Frozen grids, and 'H' are holes. Your goal is to go from S to G without falling into any H. The problem is, F is slippery so, often times you are better of by trying moves that seems counter-intuitive. But because you are preventing falling on 'H's it makes sense in the end. For example, the second row, first column 'F', you can see how our agent was trying so hard to go left!! Smashing his head against the wall?? Silly. But why?" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": 15, 394 | "metadata": {}, 395 | "outputs": [ 396 | { 397 | "data": { 398 | "text/plain": [ 399 | "{0: [(0.3333333333333333, 0, 0.0, False),\n", 400 | " (0.3333333333333333, 4, 0.0, False),\n", 401 | " (0.3333333333333333, 8, 0.0, False)],\n", 402 | " 1: [(0.3333333333333333, 4, 0.0, False),\n", 403 | " (0.3333333333333333, 8, 0.0, False),\n", 404 | " (0.3333333333333333, 5, 0.0, True)],\n", 405 | " 2: [(0.3333333333333333, 8, 0.0, False),\n", 406 | " (0.3333333333333333, 5, 0.0, True),\n", 407 | " (0.3333333333333333, 0, 0.0, False)],\n", 408 | " 3: [(0.3333333333333333, 5, 0.0, True),\n", 409 | " (0.3333333333333333, 0, 0.0, False),\n", 410 | " (0.3333333333333333, 4, 0.0, False)]}" 411 | ] 412 | }, 413 | "execution_count": 15, 414 | "metadata": {}, 415 | "output_type": "execute_result" 416 | } 417 | ], 418 | "source": [ 419 | "P[4]" 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "metadata": {}, 425 | "source": [ 426 | "See how action 0 (left) doesn't have any transition leading to a terminal state??\n", 427 | "\n", 428 | "All other actions give you a 0.333333 chance each of pushing you into the hole in state '5'!!! So it actually makes sense to go left until it slips you downward to state 8.\n", 429 | "\n", 430 | "Cool right?" 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": 16, 436 | "metadata": {}, 437 | "outputs": [ 438 | { 439 | "data": { 440 | "text/plain": [ 441 | "array([0, 3, 3, 3, 0, 0, 0, 0, 3, 1, 0, 0, 0, 2, 1, 0])" 442 | ] 443 | }, 444 | "execution_count": 16, 445 | "metadata": {}, 446 | "output_type": "execute_result" 447 | } 448 | ], 449 | "source": [ 450 | "pi" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "See how the \"prescribed\" action is 0 (left) on the policy calculated by value iteration?\n", 458 | "\n", 459 | "How about the values?" 460 | ] 461 | }, 462 | { 463 | "cell_type": "code", 464 | "execution_count": 17, 465 | "metadata": {}, 466 | "outputs": [ 467 | { 468 | "data": { 469 | "text/plain": [ 470 | "array([ 0.54202426, 0.49880096, 0.47069307, 0.45684887, 0.55844941,\n", 471 | " 0. , 0.35834688, 0. , 0.59179743, 0.64307884,\n", 472 | " 0.61520669, 0. , 0. , 0.74171974, 0.86283707, 0. ])" 473 | ] 474 | }, 475 | "execution_count": 17, 476 | "metadata": {}, 477 | "output_type": "execute_result" 478 | } 479 | ], 480 | "source": [ 481 | "V" 482 | ] 483 | }, 484 | { 485 | "cell_type": "markdown", 486 | "metadata": {}, 487 | "source": [ 488 | "These show the expected rewards on each state." 489 | ] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": 18, 494 | "metadata": {}, 495 | "outputs": [ 496 | { 497 | "data": { 498 | "text/plain": [ 499 | "{0: [(1.0, 15, 0, True)],\n", 500 | " 1: [(1.0, 15, 0, True)],\n", 501 | " 2: [(1.0, 15, 0, True)],\n", 502 | " 3: [(1.0, 15, 0, True)]}" 503 | ] 504 | }, 505 | "execution_count": 18, 506 | "metadata": {}, 507 | "output_type": "execute_result" 508 | } 509 | ], 510 | "source": [ 511 | "P[15]" 512 | ] 513 | }, 514 | { 515 | "cell_type": "markdown", 516 | "metadata": {}, 517 | "source": [ 518 | "See how the state '15' gives you a reward of +1?? These signal gets propagated all the way to the start state using Value Iteration and it shows the values all accross.\n", 519 | "\n", 520 | "Cool? Good." 521 | ] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "execution_count": 19, 526 | "metadata": {}, 527 | "outputs": [ 528 | { 529 | "name": "stderr", 530 | "output_type": "stream", 531 | "text": [ 532 | "[2017-08-27 08:18:16,163] Finished writing results. You can upload them to the scoreboard via gym.upload('/tmp/tmp8ebvhkul')\n" 533 | ] 534 | } 535 | ], 536 | "source": [ 537 | "env.close()" 538 | ] 539 | }, 540 | { 541 | "cell_type": "markdown", 542 | "metadata": {}, 543 | "source": [ 544 | "If you want to submit to OpenAI Gym, get your API Key and paste it here:" 545 | ] 546 | }, 547 | { 548 | "cell_type": "code", 549 | "execution_count": 9, 550 | "metadata": {}, 551 | "outputs": [ 552 | { 553 | "name": "stderr", 554 | "output_type": "stream", 555 | "text": [ 556 | "[2017-04-01 16:55:43,229] [FrozenLake-v0] Uploading 10000 episodes of training data\n", 557 | "[2017-04-01 16:55:44,905] [FrozenLake-v0] Uploading videos of 19 training episodes (2158 bytes)\n", 558 | "[2017-04-01 16:55:45,131] [FrozenLake-v0] Creating evaluation object from /tmp/tmpfukeltbz with learning curve and training video\n", 559 | "[2017-04-01 16:55:45,620] \n", 560 | "****************************************************\n", 561 | "You successfully uploaded your evaluation on FrozenLake-v0 to\n", 562 | "OpenAI Gym! You can find it at:\n", 563 | "\n", 564 | " https://gym.openai.com/evaluations/eval_ycTPCbyiTWK6T0C4DyrvRg\n", 565 | "\n", 566 | "****************************************************\n" 567 | ] 568 | } 569 | ], 570 | "source": [ 571 | "gym.upload(mdir, api_key='')" 572 | ] 573 | }, 574 | { 575 | "cell_type": "markdown", 576 | "metadata": {}, 577 | "source": [ 578 | "#### Policy Iteration" 579 | ] 580 | }, 581 | { 582 | "cell_type": "markdown", 583 | "metadata": {}, 584 | "source": [ 585 | "There is another method called policy iteration. This method is composed of 2 other methods, policy evaluation and policy improvement. The logic goes that policy iteration is 'evaluating' a policy to check for convergence (meaning the policy doesn't change), and 'improving' the policy, which is applying something similar to a 1 step value iteration to get a slightly better policy, but definitely not worse.\n", 586 | "\n", 587 | "These two functions cycling together are what policy iteration is about.\n", 588 | "\n", 589 | "Can you implement this algorithm yourself? Try it. Make sure to look the solution notebook in case you get stuck.\n", 590 | "\n", 591 | "I will give you the policy evaluation and policy improvement methods, you build the policy iteration cycling between the evaluation and improvement methods until there are no changes to the policy." 592 | ] 593 | }, 594 | { 595 | "cell_type": "code", 596 | "execution_count": 32, 597 | "metadata": { 598 | "collapsed": true 599 | }, 600 | "outputs": [], 601 | "source": [ 602 | "def policy_evaluation(pi, S, A, P, gamma=.99, theta=0.0000001):\n", 603 | " \n", 604 | " V = np.zeros(len(S))\n", 605 | " while True:\n", 606 | " delta = 0\n", 607 | " for s in S:\n", 608 | " v = V[s]\n", 609 | " \n", 610 | " V[s] = 0\n", 611 | " for prob, dst, reward, done in P[s][pi[s]]:\n", 612 | " V[s] += prob * (reward + gamma * V[dst] * (not done))\n", 613 | " \n", 614 | " delta = max(delta, np.abs(v - V[s]))\n", 615 | " if delta < theta:\n", 616 | " break\n", 617 | " return V" 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": 33, 623 | "metadata": {}, 624 | "outputs": [], 625 | "source": [ 626 | "def policy_improvement(pi, V, S, A, P, gamma=.99):\n", 627 | " for s in S:\n", 628 | " old_a = pi[s]\n", 629 | " \n", 630 | " Qs = np.zeros(len(A), dtype=float)\n", 631 | " for a in A:\n", 632 | " for prob, s_prime, reward, done in P[s][a]:\n", 633 | " Qs[a] += prob * (reward + gamma * V[s_prime] * (not done))\n", 634 | " pi[s] = np.argmax(Qs)\n", 635 | " V[s] = np.max(Qs)\n", 636 | " return pi, V" 637 | ] 638 | }, 639 | { 640 | "cell_type": "code", 641 | "execution_count": 34, 642 | "metadata": {}, 643 | "outputs": [], 644 | "source": [ 645 | "def policy_iteration(S, A, P, gamma=.99):\n", 646 | " pi = np.random.choice(A, len(S))\n", 647 | " while True: \n", 648 | " V = policy_evaluation(pi, S, A, P, gamma)\n", 649 | " new_pi, new_V = policy_improvement(\n", 650 | " pi.copy(), V.copy(), S, A, P, gamma)\n", 651 | " if np.all(pi == new_pi):\n", 652 | " break\n", 653 | " pi = new_pi\n", 654 | " V = new_V\n", 655 | " return pi" 656 | ] 657 | }, 658 | { 659 | "cell_type": "markdown", 660 | "metadata": {}, 661 | "source": [ 662 | "After you implement the algorithms, you can run it and calculate the optimal policy:" 663 | ] 664 | }, 665 | { 666 | "cell_type": "code", 667 | "execution_count": 35, 668 | "metadata": {}, 669 | "outputs": [ 670 | { 671 | "name": "stderr", 672 | "output_type": "stream", 673 | "text": [ 674 | "[2017-08-27 08:21:56,074] Making new env: FrozenLake-v0\n", 675 | "[2017-08-27 08:21:56,078] Finished writing results. You can upload them to the scoreboard via gym.upload('/tmp/tmpsqiqif_m')\n" 676 | ] 677 | }, 678 | { 679 | "name": "stdout", 680 | "output_type": "stream", 681 | "text": [ 682 | "[1 3 0 3 0 0 0 0 3 1 0 0 0 2 1 0]\n" 683 | ] 684 | } 685 | ], 686 | "source": [ 687 | "mdir = tempfile.mkdtemp()\n", 688 | "env = gym.make('FrozenLake-v0')\n", 689 | "env = wrappers.Monitor(env, mdir, force=True)\n", 690 | "\n", 691 | "S = range(env.env.observation_space.n)\n", 692 | "A = range(env.env.action_space.n)\n", 693 | "P = env.env.env.P\n", 694 | "\n", 695 | "pi = policy_iteration(S, A, P)\n", 696 | "print(pi)" 697 | ] 698 | }, 699 | { 700 | "cell_type": "markdown", 701 | "metadata": {}, 702 | "source": [ 703 | "And, of course, interact with the environment looking at the \"directions\" or \"policy\":" 704 | ] 705 | }, 706 | { 707 | "cell_type": "code", 708 | "execution_count": 36, 709 | "metadata": { 710 | "scrolled": true 711 | }, 712 | "outputs": [ 713 | { 714 | "name": "stderr", 715 | "output_type": "stream", 716 | "text": [ 717 | "[2017-08-27 08:21:59,041] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000000.json\n", 718 | "[2017-08-27 08:21:59,053] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000001.json\n", 719 | "[2017-08-27 08:21:59,059] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000008.json\n", 720 | "[2017-08-27 08:21:59,086] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000027.json\n", 721 | "[2017-08-27 08:21:59,127] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000064.json\n", 722 | "[2017-08-27 08:21:59,166] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000125.json\n", 723 | "[2017-08-27 08:21:59,214] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000216.json\n", 724 | "[2017-08-27 08:21:59,287] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000343.json\n", 725 | "[2017-08-27 08:21:59,375] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000512.json\n", 726 | "[2017-08-27 08:21:59,490] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000729.json\n", 727 | "[2017-08-27 08:21:59,624] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video001000.json\n", 728 | "[2017-08-27 08:22:00,092] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video002000.json\n", 729 | "[2017-08-27 08:22:00,837] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video003000.json\n", 730 | "[2017-08-27 08:22:01,269] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video004000.json\n", 731 | "[2017-08-27 08:22:01,720] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video005000.json\n", 732 | "[2017-08-27 08:22:02,184] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video006000.json\n", 733 | "[2017-08-27 08:22:02,614] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video007000.json\n", 734 | "[2017-08-27 08:22:03,085] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video008000.json\n", 735 | "[2017-08-27 08:22:03,518] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video009000.json\n" 736 | ] 737 | } 738 | ], 739 | "source": [ 740 | "for _ in range(10000):\n", 741 | " state = env.reset()\n", 742 | " while True:\n", 743 | " state, reward, done, info = env.step(pi[state])\n", 744 | " if done:\n", 745 | " break" 746 | ] 747 | }, 748 | { 749 | "cell_type": "code", 750 | "execution_count": 37, 751 | "metadata": {}, 752 | "outputs": [ 753 | { 754 | "name": "stdout", 755 | "output_type": "stream", 756 | "text": [ 757 | "https://asciinema.org/a/NIeAt9sdjwkvmQbZSOVAkJOIb\n" 758 | ] 759 | } 760 | ], 761 | "source": [ 762 | "last_video = env.videos[-1][0]\n", 763 | "out = check_output([\"asciinema\", \"upload\", last_video])\n", 764 | "out = out.decode(\"utf-8\").replace('\\n', '').replace('\\r', '')\n", 765 | "print(out)" 766 | ] 767 | }, 768 | { 769 | "cell_type": "code", 770 | "execution_count": 38, 771 | "metadata": {}, 772 | "outputs": [ 773 | { 774 | "data": { 775 | "text/html": [ 776 | "\n", 777 | "\n" 782 | ], 783 | "text/plain": [ 784 | "" 785 | ] 786 | }, 787 | "execution_count": 38, 788 | "metadata": {}, 789 | "output_type": "execute_result" 790 | } 791 | ], 792 | "source": [ 793 | "castid = out.split('/')[-1]\n", 794 | "html_tag = \"\"\"\n", 795 | "\n", 800 | "\"\"\"\n", 801 | "html_tag = html_tag.format(castid)\n", 802 | "HTML(data=html_tag)" 803 | ] 804 | }, 805 | { 806 | "cell_type": "markdown", 807 | "metadata": {}, 808 | "source": [ 809 | "Similar as before. Policies could be slightly different if there is a state in which more than one action give the same value in the end." 810 | ] 811 | }, 812 | { 813 | "cell_type": "code", 814 | "execution_count": 39, 815 | "metadata": {}, 816 | "outputs": [ 817 | { 818 | "data": { 819 | "text/plain": [ 820 | "array([ 0.54202426, 0.49880096, 0.47069307, 0.45684887, 0.55844941,\n", 821 | " 0. , 0.35834688, 0. , 0.59179743, 0.64307884,\n", 822 | " 0.61520669, 0. , 0. , 0.74171974, 0.86283707, 0. ])" 823 | ] 824 | }, 825 | "execution_count": 39, 826 | "metadata": {}, 827 | "output_type": "execute_result" 828 | } 829 | ], 830 | "source": [ 831 | "V" 832 | ] 833 | }, 834 | { 835 | "cell_type": "code", 836 | "execution_count": 40, 837 | "metadata": {}, 838 | "outputs": [ 839 | { 840 | "data": { 841 | "text/plain": [ 842 | "array([1, 3, 0, 3, 0, 0, 0, 0, 3, 1, 0, 0, 0, 2, 1, 0])" 843 | ] 844 | }, 845 | "execution_count": 40, 846 | "metadata": {}, 847 | "output_type": "execute_result" 848 | } 849 | ], 850 | "source": [ 851 | "pi" 852 | ] 853 | }, 854 | { 855 | "cell_type": "markdown", 856 | "metadata": {}, 857 | "source": [ 858 | "That's it let's wrap up." 859 | ] 860 | }, 861 | { 862 | "cell_type": "code", 863 | "execution_count": 41, 864 | "metadata": {}, 865 | "outputs": [ 866 | { 867 | "name": "stderr", 868 | "output_type": "stream", 869 | "text": [ 870 | "[2017-08-27 08:22:29,264] Finished writing results. You can upload them to the scoreboard via gym.upload('/tmp/tmpqfn2e0ho')\n" 871 | ] 872 | } 873 | ], 874 | "source": [ 875 | "env.close()" 876 | ] 877 | }, 878 | { 879 | "cell_type": "markdown", 880 | "metadata": {}, 881 | "source": [ 882 | "If you want to submit to OpenAI Gym, get your API Key and paste it here:" 883 | ] 884 | }, 885 | { 886 | "cell_type": "code", 887 | "execution_count": 134, 888 | "metadata": {}, 889 | "outputs": [ 890 | { 891 | "name": "stderr", 892 | "output_type": "stream", 893 | "text": [ 894 | "[2017-04-01 20:40:54,103] [FrozenLake-v0] Uploading 10000 episodes of training data\n", 895 | "[2017-04-01 20:40:55,854] [FrozenLake-v0] Uploading videos of 19 training episodes (2278 bytes)\n", 896 | "[2017-04-01 20:40:56,102] [FrozenLake-v0] Creating evaluation object from /tmp/tmpyspcx0sa with learning curve and training video\n", 897 | "[2017-04-01 20:40:56,451] \n", 898 | "****************************************************\n", 899 | "You successfully uploaded your evaluation on FrozenLake-v0 to\n", 900 | "OpenAI Gym! You can find it at:\n", 901 | "\n", 902 | " https://gym.openai.com/evaluations/eval_vAvbhsGQRVSAe5DZkFNrQ\n", 903 | "\n", 904 | "****************************************************\n" 905 | ] 906 | } 907 | ], 908 | "source": [ 909 | "gym.upload(mdir, api_key='')" 910 | ] 911 | }, 912 | { 913 | "cell_type": "markdown", 914 | "metadata": { 915 | "collapsed": true 916 | }, 917 | "source": [ 918 | "Hope you liked it... Value Iteration and Policy Iteration might seem disappointing at first, and I understand. What is the intelligence on following given directions!? What if you just don't have a map of the environment you are interacting with? Come on, that's not AI. You are right, it is not. However, Value Iteration and Policy Iteration form the basis of 2 of the 3 most fundamental paradigms of algorithms in reinforcement learning.\n", 919 | "\n", 920 | "Next notebooks we start looking into slightly more complicated environment. And also, we will learn about algorithms that learn while interacting with the environment. Also called \"online\" learning." 921 | ] 922 | } 923 | ], 924 | "metadata": { 925 | "kernelspec": { 926 | "display_name": "Python 3", 927 | "language": "python", 928 | "name": "python3" 929 | }, 930 | "language_info": { 931 | "codemirror_mode": { 932 | "name": "ipython", 933 | "version": 3 934 | }, 935 | "file_extension": ".py", 936 | "mimetype": "text/x-python", 937 | "name": "python", 938 | "nbconvert_exporter": "python", 939 | "pygments_lexer": "ipython3", 940 | "version": "3.5.2" 941 | } 942 | }, 943 | "nbformat": 4, 944 | "nbformat_minor": 2 945 | } 946 | --------------------------------------------------------------------------------