├── requirements.txt ├── img ├── img1.PNG └── img2.PNG ├── LICENSE ├── README.md ├── notebook ├── 4-dim-example.ipynb └── 3-dim-example.ipynb └── NashQLearn.py /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy 2 | tqdm 3 | nashpy 4 | -------------------------------------------------------------------------------- /img/img1.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtonglet/Nash-Q-Learning/HEAD/img/img1.PNG -------------------------------------------------------------------------------- /img/img2.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtonglet/Nash-Q-Learning/HEAD/img/img2.PNG -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Jonathan Tonglet 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Nash Q Learning 2 | 3 | Implementation of the Nash Q-Learning algorithm to solve games with two agents, as seen in the course Multiagent Systems @ PoliMi. 4 | The algorithm was first introduced in the paper [**Nash q-learning for general-sum stochastic games**](https://dl.acm.org/doi/10.5555/945365.964288) (Hu, J., Wellman, M.P., 2003). 5 | 6 | Feel free to use for your own projects or contribute! 7 | 8 | ## Example 9 | 10 | Consider the following game where two robots need to reach the reward. One obstacle lies in the middle of the grid. The two robots cannot be on the same tile at the same moment, except for the reward's tile. See this [notebook](https://github.com/jtonglet/Nash_Q_Learning/blob/main/notebook/3-dim-example.ipynb) for a detailed walkthrough. 11 | 12 | 13 | ![](img/img1.PNG) 14 | 15 | 16 | The robots and the game grid are represented by the Player and Grid objects. 17 | 18 | ```python 19 | from NashQLearn import Player, Grid 20 | #Initialize the two players 21 | player1 = Player([0,0]) 22 | player2 = Player([2,0]) 23 | #Initialize the grid 24 | grid = Grid(length = 3, 25 | width = 3, 26 | players = [player1,player2], 27 | obstacle_coordinates = [[1,1]], 28 | reward_coordinates = [1,2], 29 | reward_value = 20, 30 | collision_penalty = -1) 31 | ``` 32 | 33 | Once the game settings are defined, a NashQLearning object is initialized with the desired hyperparameters and trained on the grid. 34 | 35 | ```python 36 | from NashQLearn import NashQLearning 37 | nashQ = NashQLearning(grid, 38 | max_iter = 2000, 39 | discount_factor = 0.7, 40 | learning_rate = 0.7, 41 | epsilon = 0.5, 42 | decision_strategy = 'epsilon-greedy') #Available strategies : 'random', 'greedy', and 'epsilon-greedy' 43 | #Retrieve the Q tables after fitting the algorithm 44 | Q0, Q1 = nashQ.fit(return_history = False) 45 | #Best path followed by each player given the values in the Q tables 46 | p0, p1 = nashQ.get_best_policy(Q0,Q1) 47 | 48 | #Show the results 49 | print('Player 0 follows the policy : %s of length %s.'%('-'.join(p0),len(p0))) 50 | >>> Player 0 follows the policy : up-up-right of length 3. 51 | print('Player 1 follows the policy : %s of length %s.'%('-'.join(p1),len(p1))) 52 | >>> Player 1 follows the policy : up-up-left of length 3. 53 | ``` 54 | In this case, the joint optimal policy was found by the algorithm, as shown on the figure below. 55 | ![](img/img2.PNG) 56 | 57 | 58 | ## Requirements 59 | 60 | - python>=3.7 61 | - numpy 62 | - tqdm 63 | - nashpy 64 | 65 | The package nashpy is used to compute the Nash equilibrium for each stage game during the learning process. 66 | -------------------------------------------------------------------------------- /notebook/4-dim-example.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "c409679b", 6 | "metadata": {}, 7 | "source": [ 8 | "## 4-dim grid example" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "id": "b1720590", 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "#Load packages\n", 19 | "from NashQLearn import Player, Grid, NashQLearning\n", 20 | "import warnings\n", 21 | "warnings.filterwarnings('ignore')" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "id": "c116587f", 27 | "metadata": {}, 28 | "source": [ 29 | "This notebook applies the Nash Q Learning algorithm to the following multiagent problem : two robots placed on a grid need to reach the reward. Robots are allowed to move up, down, to the left, and to the right, or to stay at their current position. Robots are not allowed to be on the same tile unless it is the reward tile.\n", 30 | "\n", 31 | "### Prepare the game environment" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 2, 37 | "id": "b6a6e69e", 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "#Initialize the two players\n", 42 | "player1 = Player([3,0])\n", 43 | "player2 = Player([2,2])" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 3, 49 | "id": "3b91d86b", 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "#Initialize the grid\n", 54 | "grid = Grid(length = 4,\n", 55 | " width = 4,\n", 56 | " players = [player1,player2],\n", 57 | " obstacle_coordinates = [[0,0], [1,0],[1,2],[1,3],[0,3],[2,3],[3,1]],\n", 58 | " reward_coordinates = [0,2],\n", 59 | " reward_value = 20,\n", 60 | " collision_penalty = -1)" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 4, 66 | "id": "abdc9942", 67 | "metadata": {}, 68 | "outputs": [ 69 | { 70 | "name": "stdout", 71 | "output_type": "stream", 72 | "text": [ 73 | "Available joint states : 73\n", 74 | "[[[0, 1], [0, 2]], [[0, 1], [1, 1]], [[0, 1], [2, 0]], [[0, 1], [2, 1]], [[0, 1], [2, 2]], [[0, 1], [3, 0]], [[0, 1], [3, 2]], [[0, 1], [3, 3]], [[0, 2], [0, 1]], [[0, 2], [1, 1]], [[0, 2], [2, 0]], [[0, 2], [2, 1]], [[0, 2], [2, 2]], [[0, 2], [3, 0]], [[0, 2], [3, 2]], [[0, 2], [3, 3]], [[1, 1], [0, 1]], [[1, 1], [0, 2]], [[1, 1], [2, 0]], [[1, 1], [2, 1]], [[1, 1], [2, 2]], [[1, 1], [3, 0]], [[1, 1], [3, 2]], [[1, 1], [3, 3]], [[2, 0], [0, 1]], [[2, 0], [0, 2]], [[2, 0], [1, 1]], [[2, 0], [2, 1]], [[2, 0], [2, 2]], [[2, 0], [3, 0]], [[2, 0], [3, 2]], [[2, 0], [3, 3]], [[2, 1], [0, 1]], [[2, 1], [0, 2]], [[2, 1], [1, 1]], [[2, 1], [2, 0]], [[2, 1], [2, 2]], [[2, 1], [3, 0]], [[2, 1], [3, 2]], [[2, 1], [3, 3]], [[2, 2], [0, 1]], [[2, 2], [0, 2]], [[2, 2], [1, 1]], [[2, 2], [2, 0]], [[2, 2], [2, 1]], [[2, 2], [3, 0]], [[2, 2], [3, 2]], [[2, 2], [3, 3]], [[3, 0], [0, 1]], [[3, 0], [0, 2]], [[3, 0], [1, 1]], [[3, 0], [2, 0]], [[3, 0], [2, 1]], [[3, 0], [2, 2]], [[3, 0], [3, 2]], [[3, 0], [3, 3]], [[3, 2], [0, 1]], [[3, 2], [0, 2]], [[3, 2], [1, 1]], [[3, 2], [2, 0]], [[3, 2], [2, 1]], [[3, 2], [2, 2]], [[3, 2], [3, 0]], [[3, 2], [3, 3]], [[3, 3], [0, 1]], [[3, 3], [0, 2]], [[3, 3], [1, 1]], [[3, 3], [2, 0]], [[3, 3], [2, 1]], [[3, 3], [2, 2]], [[3, 3], [3, 0]], [[3, 3], [3, 2]], [[0, 2], [0, 2]]]\n" 75 | ] 76 | } 77 | ], 78 | "source": [ 79 | "joint_states = grid.joint_states()\n", 80 | "print('Available joint states : %s'%len(joint_states))#Correct\n", 81 | "print(joint_states)" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 5, 87 | "id": "10282bf9", 88 | "metadata": {}, 89 | "outputs": [ 90 | { 91 | "data": { 92 | "text/plain": [ 93 | "[['left', [0, 1]],\n", 94 | " ['down', [0, 1]],\n", 95 | " ['left', [0, 2]],\n", 96 | " ['right', [0, 2]],\n", 97 | " ['up', [0, 2]],\n", 98 | " ['up', [1, 1]],\n", 99 | " ['down', [1, 1]],\n", 100 | " ['left', [2, 0]],\n", 101 | " ['down', [2, 0]],\n", 102 | " ['right', [2, 1]],\n", 103 | " ['left', [2, 2]],\n", 104 | " ['up', [2, 2]],\n", 105 | " ['right', [3, 0]],\n", 106 | " ['up', [3, 0]],\n", 107 | " ['down', [3, 0]],\n", 108 | " ['right', [3, 2]],\n", 109 | " ['down', [3, 2]],\n", 110 | " ['left', [3, 3]],\n", 111 | " ['right', [3, 3]],\n", 112 | " ['up', [3, 3]]]" 113 | ] 114 | }, 115 | "execution_count": 5, 116 | "metadata": {}, 117 | "output_type": "execute_result" 118 | } 119 | ], 120 | "source": [ 121 | "walls = grid.identify_walls()\n", 122 | "walls " 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "id": "6c924a07", 128 | "metadata": {}, 129 | "source": [ 130 | "### Run the Nash Q Learning algorithm" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 6, 136 | "id": "28406b6d", 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "nashQ = NashQLearning(grid, \n", 141 | " max_iter = 2000,\n", 142 | " discount_factor = 0.9,\n", 143 | " learning_rate = 0.7,\n", 144 | " epsilon = 0.4,\n", 145 | " decision_strategy = 'epsilon-greedy')" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": 7, 151 | "id": "c7c49ebe", 152 | "metadata": { 153 | "scrolled": true 154 | }, 155 | "outputs": [ 156 | { 157 | "name": "stderr", 158 | "output_type": "stream", 159 | "text": [ 160 | "100%|████████████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 2437.05it/s]\n", 161 | "100%|██████████████████████████████████████████████████████████████████████████████| 2000/2000 [09:04<00:00, 3.68it/s]\n" 162 | ] 163 | } 164 | ], 165 | "source": [ 166 | "#Retrieve the updated Q matrix after fitting the algorithm\n", 167 | "Q0, Q1 = nashQ.fit(return_history = False)" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 8, 173 | "id": "aa2f5917", 174 | "metadata": {}, 175 | "outputs": [ 176 | { 177 | "name": "stdout", 178 | "output_type": "stream", 179 | "text": [ 180 | "[[3, 0], [2, 2]]\n", 181 | "[[2, 0], [2, 1]]\n", 182 | "[[3, 0], [1, 1]]\n", 183 | "[[2, 0], [0, 1]]\n", 184 | "[[2, 1], [0, 2]]\n", 185 | "[[1, 1], [0, 2]]\n", 186 | "[[0, 1], [0, 2]]\n", 187 | "[[0, 2], [0, 2]]\n" 188 | ] 189 | } 190 | ], 191 | "source": [ 192 | "#Best path followed by each player given the values in the q tables\n", 193 | "p0, p1 = nashQ.get_best_policy(Q0,Q1)" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": 9, 199 | "id": "85a96bc4", 200 | "metadata": {}, 201 | "outputs": [ 202 | { 203 | "name": "stdout", 204 | "output_type": "stream", 205 | "text": [ 206 | "Player 0 follows the policy : left-right-left-up-left-left-up of length 7\n", 207 | "Player 1 follows the policy : down-left-left-up-stay-stay-stay of length 7\n" 208 | ] 209 | } 210 | ], 211 | "source": [ 212 | "print('Player 0 follows the policy : %s of length %s' %('-'.join(p0),len(p0)))\n", 213 | "print('Player 1 follows the policy : %s of length %s'%('-'.join(p1),len(p1)))" 214 | ] 215 | } 216 | ], 217 | "metadata": { 218 | "kernelspec": { 219 | "display_name": "Python 3 (ipykernel)", 220 | "language": "python", 221 | "name": "python3" 222 | }, 223 | "language_info": { 224 | "codemirror_mode": { 225 | "name": "ipython", 226 | "version": 3 227 | }, 228 | "file_extension": ".py", 229 | "mimetype": "text/x-python", 230 | "name": "python", 231 | "nbconvert_exporter": "python", 232 | "pygments_lexer": "ipython3", 233 | "version": "3.7.13" 234 | } 235 | }, 236 | "nbformat": 4, 237 | "nbformat_minor": 5 238 | } 239 | -------------------------------------------------------------------------------- /notebook/3-dim-example.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "7cf87dfd", 6 | "metadata": {}, 7 | "source": [ 8 | "## 3-dim grid example" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "id": "b1720590", 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "#Load packages\n", 19 | "from NashQLearn import Player, Grid, NashQLearning\n", 20 | "import warnings\n", 21 | "warnings.filterwarnings('ignore')" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "id": "1cf74196", 27 | "metadata": {}, 28 | "source": [ 29 | "This notebook applies the Nash Q Learning algorithm to the following multiagent problem : two robots placed on a grid need to reach the reward. Robots are allowed to move up, down, to the left, and to the right, or to stay at their current position. \n", 30 | "Robots are not allowed to be on the same tile unless it is the reward tile.\n", 31 | "\n" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "id": "14455cf6", 37 | "metadata": {}, 38 | "source": [ 39 | "### Prepare the game environment" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 2, 45 | "id": "b6a6e69e", 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "#Initialize the two players\n", 50 | "player1 = Player([0,0])\n", 51 | "player2 = Player([2,0])" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 3, 57 | "id": "3b91d86b", 58 | "metadata": {}, 59 | "outputs": [], 60 | "source": [ 61 | "#Initialize the grid\n", 62 | "grid = Grid(length = 3,\n", 63 | " width = 3,\n", 64 | " players = [player1,player2],\n", 65 | " obstacle_coordinates = [[1,1]], #A single obstacle in the middle of the grid\n", 66 | " reward_coordinates = [1,2],\n", 67 | " reward_value = 20,\n", 68 | " collision_penalty = -1)" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 4, 74 | "id": "abdc9942", 75 | "metadata": {}, 76 | "outputs": [ 77 | { 78 | "name": "stdout", 79 | "output_type": "stream", 80 | "text": [ 81 | "Available joint states : 57\n", 82 | "[[[0, 0], [0, 1]], [[0, 0], [0, 2]], [[0, 0], [1, 0]], [[0, 0], [1, 2]], [[0, 0], [2, 0]], [[0, 0], [2, 1]], [[0, 0], [2, 2]], [[0, 1], [0, 0]], [[0, 1], [0, 2]], [[0, 1], [1, 0]], [[0, 1], [1, 2]], [[0, 1], [2, 0]], [[0, 1], [2, 1]], [[0, 1], [2, 2]], [[0, 2], [0, 0]], [[0, 2], [0, 1]], [[0, 2], [1, 0]], [[0, 2], [1, 2]], [[0, 2], [2, 0]], [[0, 2], [2, 1]], [[0, 2], [2, 2]], [[1, 0], [0, 0]], [[1, 0], [0, 1]], [[1, 0], [0, 2]], [[1, 0], [1, 2]], [[1, 0], [2, 0]], [[1, 0], [2, 1]], [[1, 0], [2, 2]], [[1, 2], [0, 0]], [[1, 2], [0, 1]], [[1, 2], [0, 2]], [[1, 2], [1, 0]], [[1, 2], [2, 0]], [[1, 2], [2, 1]], [[1, 2], [2, 2]], [[2, 0], [0, 0]], [[2, 0], [0, 1]], [[2, 0], [0, 2]], [[2, 0], [1, 0]], [[2, 0], [1, 2]], [[2, 0], [2, 1]], [[2, 0], [2, 2]], [[2, 1], [0, 0]], [[2, 1], [0, 1]], [[2, 1], [0, 2]], [[2, 1], [1, 0]], [[2, 1], [1, 2]], [[2, 1], [2, 0]], [[2, 1], [2, 2]], [[2, 2], [0, 0]], [[2, 2], [0, 1]], [[2, 2], [0, 2]], [[2, 2], [1, 0]], [[2, 2], [1, 2]], [[2, 2], [2, 0]], [[2, 2], [2, 1]], [[1, 2], [1, 2]]]\n" 83 | ] 84 | } 85 | ], 86 | "source": [ 87 | "joint_states = grid.joint_states()\n", 88 | "print('Available joint states : %s'%len(joint_states))\n", 89 | "print(joint_states)" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 5, 95 | "id": "10282bf9", 96 | "metadata": {}, 97 | "outputs": [ 98 | { 99 | "data": { 100 | "text/plain": [ 101 | "[['left', [0, 0]],\n", 102 | " ['down', [0, 0]],\n", 103 | " ['left', [0, 1]],\n", 104 | " ['right', [0, 1]],\n", 105 | " ['left', [0, 2]],\n", 106 | " ['up', [0, 2]],\n", 107 | " ['up', [1, 0]],\n", 108 | " ['down', [1, 0]],\n", 109 | " ['up', [1, 2]],\n", 110 | " ['down', [1, 2]],\n", 111 | " ['right', [2, 0]],\n", 112 | " ['down', [2, 0]],\n", 113 | " ['left', [2, 1]],\n", 114 | " ['right', [2, 1]],\n", 115 | " ['right', [2, 2]],\n", 116 | " ['up', [2, 2]]]" 117 | ] 118 | }, 119 | "execution_count": 5, 120 | "metadata": {}, 121 | "output_type": "execute_result" 122 | } 123 | ], 124 | "source": [ 125 | "walls = grid.identify_walls()\n", 126 | "walls" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "id": "4807410f", 132 | "metadata": {}, 133 | "source": [ 134 | "### Run the Nash Q Learning algorithm\n", 135 | "\n", 136 | "The efficiency of the algorithm depends on a set of parameters (the max number of iterations, the discount factor, the learning rate, ...) which may require some tuning. In general, the epsilon-greedy decision stragegy outperforms both the random and greedy strategies." 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 11, 142 | "id": "28406b6d", 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "nashQ = NashQLearning(grid, \n", 147 | " max_iter = 2000,\n", 148 | " discount_factor = 0.7,\n", 149 | " learning_rate = 0.7,\n", 150 | " epsilon = 0.5,\n", 151 | " decision_strategy = 'epsilon-greedy')" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 12, 157 | "id": "c7c49ebe", 158 | "metadata": { 159 | "scrolled": true 160 | }, 161 | "outputs": [ 162 | { 163 | "name": "stderr", 164 | "output_type": "stream", 165 | "text": [ 166 | "100%|████████████████████████████████████████████████████████████████████████████████| 57/57 [00:00<00:00, 2716.77it/s]\n", 167 | "100%|██████████████████████████████████████████████████████████████████████████████| 2000/2000 [08:18<00:00, 4.01it/s]\n" 168 | ] 169 | } 170 | ], 171 | "source": [ 172 | "#Retrieve the updated Q matrix after fitting the algorithm\n", 173 | "Q0, Q1 = nashQ.fit(return_history = False)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 13, 179 | "id": "8ae26f6e", 180 | "metadata": {}, 181 | "outputs": [ 182 | { 183 | "name": "stdout", 184 | "output_type": "stream", 185 | "text": [ 186 | "[[0, 0], [2, 0]]\n", 187 | "[[0, 1], [2, 1]]\n", 188 | "[[0, 2], [2, 2]]\n", 189 | "[[1, 2], [1, 2]]\n" 190 | ] 191 | } 192 | ], 193 | "source": [ 194 | "#Best path followed by each player given the values in the q tables\n", 195 | "p0, p1 = nashQ.get_best_policy(Q0,Q1)" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 14, 201 | "id": "e576d4ff", 202 | "metadata": {}, 203 | "outputs": [ 204 | { 205 | "name": "stdout", 206 | "output_type": "stream", 207 | "text": [ 208 | "Player 0 follows the policy : up-up-right of length 3\n", 209 | "Player 1 follows the policy : up-up-left of length 3\n" 210 | ] 211 | } 212 | ], 213 | "source": [ 214 | "print('Player 0 follows the policy : %s of length %s' %('-'.join(p0),len(p0)))\n", 215 | "print('Player 1 follows the policy : %s of length %s'%('-'.join(p1),len(p1)))" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "id": "7c4b401a", 221 | "metadata": {}, 222 | "source": [ 223 | "In this experiment, the two players have successfully identified the optimal path to the reward.\n" 224 | ] 225 | } 226 | ], 227 | "metadata": { 228 | "kernelspec": { 229 | "display_name": "Python 3 (ipykernel)", 230 | "language": "python", 231 | "name": "python3" 232 | }, 233 | "language_info": { 234 | "codemirror_mode": { 235 | "name": "ipython", 236 | "version": 3 237 | }, 238 | "file_extension": ".py", 239 | "mimetype": "text/x-python", 240 | "name": "python", 241 | "nbconvert_exporter": "python", 242 | "pygments_lexer": "ipython3", 243 | "version": "3.7.13" 244 | } 245 | }, 246 | "nbformat": 4, 247 | "nbformat_minor": 5 248 | } 249 | -------------------------------------------------------------------------------- /NashQLearn.py: -------------------------------------------------------------------------------- 1 | #Implementation of the Nash Q Learning algorithm for simple games with two agents 2 | 3 | import numpy as np 4 | import random 5 | from collections import defaultdict 6 | import nashpy as nash 7 | from tqdm import tqdm 8 | 9 | 10 | class Player: 11 | def __init__(self, 12 | position = [0,0], 13 | movements = ['left','right','up','down','stay'] 14 | ): 15 | ''' 16 | This class is a representation of a player. 17 | 18 | position (list) : list of two integers giving the starting position coordinates of the player 19 | movements (list) : list of strings containing the possible movements. 20 | ''' 21 | self.movements = movements 22 | self.position = position 23 | 24 | 25 | def move(self, movement): 26 | ''' 27 | Compute the new position of a player after performing a movement. 28 | movement (string) : the movement to perform. Invalid string values are interpreted as the 'stay' movement. 29 | ''' 30 | if movement == 'left' and 'left' in self.movements: 31 | new_position = [self.position[0] - 1, self.position[1]] 32 | elif movement == 'right' and 'right' in self.movements: 33 | new_position = [self.position[0] + 1, self.position[1]] 34 | elif movement == 'up' and 'up' in self.movements: 35 | new_position = [self.position[0], self.position[1] + 1] 36 | elif movement == 'down' and 'down' in self.movements: 37 | new_position = [self.position[0], self.position[1] - 1] 38 | else: 39 | new_position = self.position 40 | 41 | return new_position 42 | 43 | 44 | class Grid: 45 | def __init__(self, 46 | length = 2, 47 | width = 2, 48 | players = [Player(),Player()], 49 | reward_coordinates = [1,1], 50 | reward_value = 20, 51 | obstacle_coordinates = [], 52 | collision_allowed = False, 53 | collision_penalty = 0): 54 | ''' 55 | This class is a representation of the game grid. 56 | 57 | length (int) : horizontal dimension of the grid 58 | width (int) : vertical dimension of the grid 59 | players (list) : list of 2 Player objects. 60 | reward_coordinates (list) : list of 2 integers giving the coordinates of the reward 61 | reward_value (int) : value obtained by reaching the reward coordinates 62 | obstacle_coordinates (list) : list of obstacle coordinates. Each obstacle coordinate is a list of 2 integers giving their coordinates 63 | collision_allowed (bool) : whether agents are allowed to be at the same time on the same cell or not 64 | collision_penalty (int) : negative reward obtained for hitting a wall or colliding with another player 65 | joint_player_coordinates (list) : list containing the starting positions of the two players 66 | ''' 67 | self.length = length 68 | self.width = width 69 | self.players = players 70 | self.reward_coordinates = reward_coordinates 71 | self.reward_value = reward_value 72 | self.obstacle_coordinates = obstacle_coordinates 73 | self.collision_allowed = collision_allowed 74 | self.collision_penalty = collision_penalty 75 | self.joint_player_coordinates = [players[0].position, players[1].position] 76 | 77 | 78 | def get_player_0(self): 79 | return self.players[0] 80 | 81 | 82 | def get_player_1(self): 83 | return self.players[1] 84 | 85 | 86 | def joint_states(self): 87 | ''' 88 | Returns a list of all possible joint states in the game. 89 | ''' 90 | if not self.collision_allowed: 91 | #Agents are only allowed to collide on the reward cell, whether they arrive there at the same time or not 92 | joint_states = [[[i,j], 93 | [k,l]] for i in range(self.length) for j in range(self.width) 94 | for k in range(self.length) for l in range(self.width) 95 | if [i,j] != [k,l] and [i,j] not in self.obstacle_coordinates and [k,l] not in self.obstacle_coordinates 96 | ] 97 | joint_states.append([self.reward_coordinates,self.reward_coordinates]) #Add the reward state as joint state 98 | 99 | else: #Agents can collide on any cell, but they can't move to an obstacle 100 | joint_states = [[[i,j], 101 | [k,l]] for i in range(self.length) for j in range(self.width) 102 | for k in range(self.length) for l in range(self.width) 103 | if [i,j] not in self.obstacle_coordinates and [k,l] not in self.obstacle_coordinates 104 | ] 105 | 106 | return joint_states 107 | 108 | 109 | def identify_walls(self): 110 | ''' 111 | Identify all impossible transitions due to the grid walls and the obstacles 112 | ''' 113 | walls = [] 114 | for i in range(self.length): 115 | for j in range(self.width): 116 | if [i,j] not in self.obstacle_coordinates: 117 | fictious_player = Player(position = [i,j]) #Used to explore the grid in search of walls 118 | if fictious_player.move('left')[0] not in range(self.length) or fictious_player.move('left')in self.obstacle_coordinates: 119 | walls.append(['left',fictious_player.position]) 120 | if fictious_player.move('right')[0] not in range(self.length) or fictious_player.move('right') in self.obstacle_coordinates: 121 | walls.append(['right',fictious_player.position]) 122 | if fictious_player.move('up')[1] not in range(self.width) or fictious_player.move('up') in self.obstacle_coordinates: 123 | walls.append(['up',fictious_player.position]) 124 | if fictious_player.move('down')[1] not in range(self.width) or fictious_player.move('down') in self.obstacle_coordinates: 125 | walls.append(['down',fictious_player.position]) 126 | 127 | return walls 128 | 129 | 130 | def compute_reward(self, 131 | old_state, 132 | new_state, 133 | movement, 134 | collision_detected = False): 135 | ''' 136 | Compute the reward obtained by a player for transitioning from its old state to its new state 137 | ''' 138 | if old_state == self.reward_coordinates: #Stop receiving rewards once the goal is reached 139 | reward = 0 140 | elif new_state == self.reward_coordinates: #The goal state is reached for the first time 141 | reward = self.reward_value 142 | elif new_state == old_state and movement != 'stay': #The player moved and bumped in a player or an obstacle 143 | reward = self.collision_penalty 144 | elif movement == 'stay' and collision_detected: #The player stayed but was percuted by another player 145 | reward = self.collision_penalty 146 | else: # The player made a regular valid movement 147 | reward = 0 148 | return reward 149 | 150 | 151 | def create_transition_table(self): 152 | ''' 153 | Creates a dictionary where each pair of joint state and joint movement is mapped to a new resulting joint state 154 | ''' 155 | recursivedict = lambda : defaultdict(recursivedict) 156 | transitions = recursivedict() 157 | joint_states = self.joint_states() 158 | walls = self.identify_walls() 159 | player0_movements = self.players[0].movements 160 | player1_movements = self.players[1].movements 161 | 162 | for state in joint_states: 163 | for m0 in player0_movements: 164 | for m1 in player1_movements: 165 | if [m1,state[1]] in walls or state[1] == self.reward_coordinates: 166 | if [m0,state[0]] in walls or state[0] == self.reward_coordinates: 167 | new_state = state 168 | else : 169 | new_state = [Player(state[0]).move(m0),state[1]] 170 | else: 171 | if [m0,state[0]] in walls or state[0] == self.reward_coordinates: 172 | new_state = [state[0],Player(state[1]).move(m1)] 173 | else: 174 | new_state = [Player(state[0]).move(m0),Player(state[1]).move(m1)] 175 | if (new_state[0] == state[1] and new_state[1] == state[0] ) or new_state not in joint_states: 176 | # There is a collision or a swap of positions 177 | new_state = state #Return to previous state 178 | transitions[joint_states.index(state)][m0][m1] = joint_states.index(new_state) 179 | 180 | return transitions 181 | 182 | 183 | def create_stage_games(self): 184 | ''' 185 | Creates the stage game tables which contains the reward obtained by the players for each pair of joint states and joint movements. 186 | The stage game tables are represented as 3-dimensional tensors. 187 | ''' 188 | joint_states = self.joint_states() 189 | walls = self.identify_walls() 190 | player0_movements = self.players[0].movements 191 | player1_movements = self.players[1].movements 192 | 193 | stage_games0 = np.zeros((len(joint_states), 194 | len(player0_movements), 195 | len(player1_movements), 196 | )) 197 | 198 | stage_games1 = np.zeros((len(joint_states), 199 | len(player0_movements), 200 | len(player1_movements), 201 | )) 202 | for state in tqdm(joint_states): 203 | for m0 in player0_movements: 204 | for m1 in player1_movements: 205 | if [m1,state[1]] in walls: 206 | if [m0,state[0]] not in walls: 207 | new_state = [Player(state[0]).move(m0),state[1]] 208 | else : 209 | new_state = state 210 | else: 211 | if [m0,state[0]] in walls: 212 | new_state = [state[0],Player(state[1]).move(m1)] 213 | else: 214 | new_state = [Player(state[0]).move(m0),Player(state[1]).move(m1)] 215 | collision_detected = False 216 | if (new_state[0] == state[1] and new_state[1] == state[0] ) or new_state not in joint_states: 217 | # There is a collision 218 | new_state = state #Return to previous state 219 | collision_detected = True 220 | 221 | reward0 = self.compute_reward(state[0],new_state[0],m0,collision_detected) 222 | reward1 = self.compute_reward(state[1],new_state[1],m1,collision_detected) 223 | stage_games0[joint_states.index(state)] [player0_movements.index(m0)][player1_movements.index(m1)]= reward0 224 | stage_games1[joint_states.index(state)][player0_movements.index(m0)][player1_movements.index(m1)] = reward1 225 | 226 | return stage_games0, stage_games1 227 | 228 | 229 | def create_q_tables(self): 230 | ''' 231 | Creates the q tables which contains the Q-values used by the the nash Q Learning algorithm. 232 | The q tables are represented as 3-dimensional tensors and are initialized with null values. 233 | ''' 234 | joint_states = self.joint_states() 235 | player0_movements = self.players[0].movements 236 | player1_movements = self.players[1].movements 237 | q_tables0 = np.zeros((len(joint_states), 238 | len(player0_movements), 239 | len(player1_movements) 240 | )) 241 | q_tables1 = np.zeros((len(joint_states), 242 | len(player0_movements), 243 | len(player1_movements) 244 | )) 245 | 246 | return q_tables0, q_tables1 247 | 248 | 249 | class NashQLearning: 250 | 251 | def __init__(self, 252 | grid = Grid(), 253 | learning_rate = 0.5, 254 | max_iter = 100, 255 | discount_factor = 0.7, 256 | decision_strategy = 'random', 257 | epsilon = 0.5, 258 | random_state = 42): 259 | ''' 260 | This class represents an instance of the Nash Q-Learning algorithm 261 | 262 | grid (Grid) : the game grid 263 | learning rate (int) : the weighted importance given to the update of the Q-values compared to their current value 264 | max_iter (int) : max number of iterations of the algorithm 265 | discount_factor (int) : discount factor applied to the nash equilibria value in the Q-values update formula 266 | decision_strategy (str) : decision strategy applied to select the next movement, possible values are 'random','greedy','epsilon-greedy' 267 | epsilon (int) : only if decision_strategy is 'epsilon_greedy', threshold to decide between a greedy or random movement 268 | random_state (int) : seed for results reproducibility 269 | ''' 270 | self.grid = grid 271 | self.learning_rate = learning_rate 272 | self.max_iter = max_iter 273 | self.discount_factor = discount_factor 274 | self.decision_strategy = decision_strategy 275 | self.epsilon = epsilon 276 | random.seed(random_state) 277 | 278 | 279 | def fit(self, 280 | return_history = False): 281 | """ 282 | Fit the Nash Q Learning algorithm on the grid and return one Q table per player. 283 | return_history (bool) : if True, print all the changing positions of the players on the grid during the learning cycle. 284 | """ 285 | current_state = [self.grid.players[0].position,self.grid.players[1].position] 286 | joint_states = self.grid.joint_states() 287 | player0_movements = self.grid.players[0].movements 288 | player1_movements = self.grid.players[1].movements 289 | stage_games0, stage_games1 = self.grid.create_stage_games() 290 | Q0, Q1 = self.grid.create_q_tables() 291 | transition_table = self.grid.create_transition_table() 292 | state_tracker = [current_state] 293 | 294 | for i in tqdm(range(self.max_iter)): 295 | if current_state == joint_states[-1]: #Both players reached the reward, return to original position 296 | current_state = [self.grid.players[0].position,self.grid.players[1].position] 297 | 298 | if self.decision_strategy == 'random': 299 | m0 = player0_movements[random.randrange(len(player0_movements))] 300 | m1 = player1_movements[random.randrange(len(player1_movements))] 301 | if self.decision_strategy == 'greedy': 302 | greedy_matrix0 = Q0[joint_states.index(current_state)] 303 | greedy_matrix1 = Q1[joint_states.index(current_state)] 304 | greedy_game = nash.Game(greedy_matrix0,greedy_matrix1) 305 | equilibriums = list(greedy_game.support_enumeration()) 306 | greedy_equilibrium = equilibriums[random.randrange(len(equilibriums))] #One random equilibrium 307 | if len(np.where(greedy_equilibrium[0] == 1)[0]) ==0: #No strict equilibrium found 308 | m0 = player0_movements[random.randrange(len(player0_movements))] #Random move 309 | m1 = player1_movements[random.randrange(len(player1_movements))] 310 | else: #Select the movements corresponding to the nash equilibrium 311 | m0 = player0_movements[np.where(greedy_equilibrium[0] == 1)[0][0]] 312 | m1 = player1_movements[np.where(greedy_equilibrium[1] == 1)[0][0]] 313 | if self.decision_strategy == 'epsilon-greedy': 314 | random_number = random.uniform(0,1) 315 | if random_number >= self.epsilon: #greedy 316 | greedy_matrix0 = Q0[joint_states.index(current_state)] 317 | greedy_matrix1 = Q1[joint_states.index(current_state)] 318 | greedy_game = nash.Game(greedy_matrix0,greedy_matrix1) 319 | equilibriums = list(greedy_game.support_enumeration()) 320 | greedy_equilibrium = equilibriums[random.randrange(len(equilibriums))] #One random equilibrium 321 | if len(np.where(greedy_equilibrium[0] == 1)[0]) == 0: #No strict equilibrium found 322 | m0 = player0_movements[random.randrange(len(player0_movements))] #Random move 323 | m1 = player1_movements[random.randrange(len(player1_movements))] 324 | else: #Select the movements corresponding to the nash equilibrium 325 | m0 = player0_movements[np.where(greedy_equilibrium[0] == 1)[0][0]] 326 | m1 = player1_movements[np.where(greedy_equilibrium[1] == 1)[0][0]] 327 | else: #random 328 | m0 = player0_movements[random.randrange(len(player0_movements))] 329 | m1 = player1_movements[random.randrange(len(player1_movements))] 330 | 331 | #Update state 332 | new_state = joint_states[transition_table[joint_states.index(current_state)][m0][m1]] 333 | #Solve Nash equilibrium problem in new state 334 | nash_eq_matrix0 = Q0[joint_states.index(new_state)] 335 | nash_eq_matrix1 = Q1[joint_states.index(new_state)] 336 | game = nash.Game(nash_eq_matrix0,nash_eq_matrix1) 337 | equilibriums = list(game.support_enumeration()) 338 | best_payoff = -np.Inf 339 | equilibrium_values = [] 340 | for eq in equilibriums: 341 | payoff = game[eq][0] + game[eq][1] 342 | if payoff >= best_payoff: 343 | best_payoff = payoff 344 | equilibrium_values = game[eq] 345 | 346 | #Q Tables update formula 347 | Q0[joint_states.index(current_state)][player0_movements.index(m0)][player1_movements.index(m1)] = ( 348 | (1 - self.learning_rate) * Q0[joint_states.index(current_state)][player0_movements.index(m0)][player1_movements.index(m1)] 349 | + self.learning_rate * (stage_games0[joint_states.index(current_state)][player0_movements.index(m0)][player1_movements.index(m1)] 350 | + self.discount_factor * equilibrium_values[0]) 351 | ) 352 | 353 | Q1[joint_states.index(current_state)][player0_movements.index(m0)][player1_movements.index(m1)] = ( 354 | (1 - self.learning_rate) * Q1[joint_states.index(current_state)][player0_movements.index(m0)][player1_movements.index(m1)] 355 | + self.learning_rate * (stage_games1[joint_states.index(current_state)][player0_movements.index(m0)][player1_movements.index(m1)] 356 | + self.discount_factor * equilibrium_values[1]) 357 | ) 358 | 359 | current_state = new_state 360 | state_tracker.append(current_state) 361 | 362 | if return_history: 363 | print(state_tracker) 364 | return Q0, Q1 365 | 366 | 367 | def get_best_policy(self, 368 | Q0, 369 | Q1): 370 | """ 371 | Given two Q tables, one for each agent, return their best available path on the grid. 372 | """ 373 | current_state = [self.grid.players[0].position,self.grid.players[1].position] 374 | joint_states = self.grid.joint_states() 375 | transition_table = self.grid.create_transition_table() 376 | player0_movements = self.grid.players[0].movements 377 | player1_movements = self.grid.players[1].movements 378 | policy0 = [] 379 | policy1 = [] 380 | while current_state != joint_states[-1]: #while the reward state is not reached for both agents 381 | print(current_state) 382 | q_state0 = Q0[joint_states.index(current_state)] 383 | q_state1 = Q1[joint_states.index(current_state)] 384 | game = nash.Game(q_state0,q_state1) 385 | equilibriums = list(game.support_enumeration()) 386 | best_payoff = -np.Inf 387 | m0 = 'stay' 388 | m1 = 'stay' 389 | for eq in equilibriums: 390 | if len(np.where(eq[0] == 1)[0]) != 0: #The equilibrium needs to be a strict nash equilibrium (no mixed-strategy) 391 | total_payoff = q_state0[np.where(eq[0]==1)[0][0]][np.where(eq[1]==1)[0][0]] + q_state1[np.where(eq[0]==1)[0][0]][np.where(eq[1]==1)[0][0]] 392 | if total_payoff >= best_payoff and (player0_movements[np.where(eq[0]==1)[0][0]] != 'stay' 393 | or player1_movements[np.where(eq[1]==1)[0][0]] != 'stay'): 394 | #payoff is better and at least one agent is moving 395 | best_payoff = total_payoff 396 | m0 = player0_movements[np.where(eq[0]==1)[0][0]] 397 | m1 = player1_movements[np.where(eq[1]==1)[0][0]] 398 | if current_state[0] != joint_states[-1][0]: 399 | policy0.append(m0) 400 | else : #target reached for player 0 401 | policy0.append('stay') 402 | if current_state[1] != joint_states[-1][1]: 403 | policy1.append(m1) 404 | else: #target reached for player 1 405 | policy1.append('stay') 406 | if current_state != joint_states[transition_table[joint_states.index(current_state)][m0][m1]]: #there was a movement 407 | current_state = joint_states[transition_table[joint_states.index(current_state)][m0][m1]] 408 | else : #No movement, the model did not converge 409 | policy0 = 'model failed to converge to a policy' 410 | policy1 = 'model failed to converge to a policy' 411 | break 412 | print(current_state) 413 | return policy0, policy1 414 | --------------------------------------------------------------------------------