├── requirements.txt
├── img
    ├── img1.PNG
    └── img2.PNG
├── LICENSE
├── README.md
├── notebook
    ├── 4-dim-example.ipynb
    └── 3-dim-example.ipynb
└── NashQLearn.py


/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy
2 | tqdm 
3 | nashpy
4 | 


--------------------------------------------------------------------------------
/img/img1.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jtonglet/Nash-Q-Learning/HEAD/img/img1.PNG


--------------------------------------------------------------------------------
/img/img2.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jtonglet/Nash-Q-Learning/HEAD/img/img2.PNG


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2022 Jonathan Tonglet
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Nash Q Learning 
 2 | 
 3 | Implementation of the Nash Q-Learning  algorithm to solve games with two agents, as seen in the course Multiagent Systems @ PoliMi. 
 4 | The algorithm was first introduced in the paper [**Nash q-learning for general-sum stochastic games**](https://dl.acm.org/doi/10.5555/945365.964288) (Hu, J., Wellman, M.P., 2003).
 5 | 
 6 | Feel free to use for your own projects or contribute!
 7 | 
 8 | ## Example
 9 | 
10 | Consider the following game where two robots need to reach the reward. One obstacle lies in the middle of the grid. The two robots cannot be on the same tile at the same moment, except for the reward's tile. See this [notebook](https://github.com/jtonglet/Nash_Q_Learning/blob/main/notebook/3-dim-example.ipynb) for a detailed walkthrough.
11 | 
12 | 
13 | ![](img/img1.PNG)
14 | 
15 | 
16 | The robots and the game grid are represented by the Player and Grid objects. 
17 | 
18 | ```python
19 | from NashQLearn import Player, Grid
20 | #Initialize the two players
21 | player1 = Player([0,0])
22 | player2 = Player([2,0])
23 | #Initialize the grid
24 | grid = Grid(length = 3,
25 |             width = 3,
26 |             players = [player1,player2],
27 |             obstacle_coordinates = [[1,1]], 
28 |             reward_coordinates = [1,2],
29 |             reward_value = 20,
30 |             collision_penalty = -1)
31 | ```
32 | 
33 | Once the game settings are defined, a NashQLearning object is initialized with the desired hyperparameters and trained on the grid.
34 | 
35 | ```python
36 | from NashQLearn import NashQLearning
37 | nashQ = NashQLearning(grid, 
38 |                       max_iter = 2000,
39 |                       discount_factor = 0.7,
40 |                       learning_rate = 0.7,
41 |                       epsilon = 0.5,
42 |                       decision_strategy = 'epsilon-greedy') #Available strategies : 'random', 'greedy', and 'epsilon-greedy'
43 | #Retrieve the Q tables after fitting the algorithm
44 | Q0, Q1 = nashQ.fit(return_history = False)
45 | #Best path followed by each player given the values in the Q tables
46 | p0, p1 = nashQ.get_best_policy(Q0,Q1)
47 | 
48 | #Show the results
49 | print('Player 0 follows the  policy : %s of length %s.'%('-'.join(p0),len(p0)))
50 | >>> Player 0 follows the  policy : up-up-right of length 3.
51 | print('Player 1 follows the  policy : %s of length %s.'%('-'.join(p1),len(p1)))
52 | >>> Player 1 follows the  policy : up-up-left of length 3.
53 | ```
54 | In this case, the joint optimal policy was found by the algorithm, as shown on the figure below.
55 | ![](img/img2.PNG)
56 | 
57 | 
58 | ## Requirements
59 |       
60 | - python>=3.7
61 | - numpy
62 | - tqdm
63 | - nashpy
64 |               
65 | The package nashpy is used to compute the Nash equilibrium for each stage game during the learning process.
66 | 


--------------------------------------------------------------------------------
/notebook/4-dim-example.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "c409679b",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "## 4-dim grid example"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": 1,
 14 |    "id": "b1720590",
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "#Load packages\n",
 19 |     "from NashQLearn import Player, Grid, NashQLearning\n",
 20 |     "import warnings\n",
 21 |     "warnings.filterwarnings('ignore')"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "markdown",
 26 |    "id": "c116587f",
 27 |    "metadata": {},
 28 |    "source": [
 29 |     "This notebook applies the Nash Q Learning algorithm to the following multiagent problem : two robots placed on a grid need to reach the reward. Robots are allowed to move up, down, to the left, and to the right, or to stay at their current position. Robots are not allowed to be on the same tile unless it is the reward tile.\n",
 30 |     "\n",
 31 |     "### Prepare the game environment"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "code",
 36 |    "execution_count": 2,
 37 |    "id": "b6a6e69e",
 38 |    "metadata": {},
 39 |    "outputs": [],
 40 |    "source": [
 41 |     "#Initialize the two players\n",
 42 |     "player1 = Player([3,0])\n",
 43 |     "player2 = Player([2,2])"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "code",
 48 |    "execution_count": 3,
 49 |    "id": "3b91d86b",
 50 |    "metadata": {},
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "#Initialize the grid\n",
 54 |     "grid = Grid(length = 4,\n",
 55 |     "            width = 4,\n",
 56 |     "            players = [player1,player2],\n",
 57 |     "           obstacle_coordinates = [[0,0], [1,0],[1,2],[1,3],[0,3],[2,3],[3,1]],\n",
 58 |     "           reward_coordinates = [0,2],\n",
 59 |     "           reward_value = 20,\n",
 60 |     "           collision_penalty = -1)"
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "code",
 65 |    "execution_count": 4,
 66 |    "id": "abdc9942",
 67 |    "metadata": {},
 68 |    "outputs": [
 69 |     {
 70 |      "name": "stdout",
 71 |      "output_type": "stream",
 72 |      "text": [
 73 |       "Available joint states : 73\n",
 74 |       "[[[0, 1], [0, 2]], [[0, 1], [1, 1]], [[0, 1], [2, 0]], [[0, 1], [2, 1]], [[0, 1], [2, 2]], [[0, 1], [3, 0]], [[0, 1], [3, 2]], [[0, 1], [3, 3]], [[0, 2], [0, 1]], [[0, 2], [1, 1]], [[0, 2], [2, 0]], [[0, 2], [2, 1]], [[0, 2], [2, 2]], [[0, 2], [3, 0]], [[0, 2], [3, 2]], [[0, 2], [3, 3]], [[1, 1], [0, 1]], [[1, 1], [0, 2]], [[1, 1], [2, 0]], [[1, 1], [2, 1]], [[1, 1], [2, 2]], [[1, 1], [3, 0]], [[1, 1], [3, 2]], [[1, 1], [3, 3]], [[2, 0], [0, 1]], [[2, 0], [0, 2]], [[2, 0], [1, 1]], [[2, 0], [2, 1]], [[2, 0], [2, 2]], [[2, 0], [3, 0]], [[2, 0], [3, 2]], [[2, 0], [3, 3]], [[2, 1], [0, 1]], [[2, 1], [0, 2]], [[2, 1], [1, 1]], [[2, 1], [2, 0]], [[2, 1], [2, 2]], [[2, 1], [3, 0]], [[2, 1], [3, 2]], [[2, 1], [3, 3]], [[2, 2], [0, 1]], [[2, 2], [0, 2]], [[2, 2], [1, 1]], [[2, 2], [2, 0]], [[2, 2], [2, 1]], [[2, 2], [3, 0]], [[2, 2], [3, 2]], [[2, 2], [3, 3]], [[3, 0], [0, 1]], [[3, 0], [0, 2]], [[3, 0], [1, 1]], [[3, 0], [2, 0]], [[3, 0], [2, 1]], [[3, 0], [2, 2]], [[3, 0], [3, 2]], [[3, 0], [3, 3]], [[3, 2], [0, 1]], [[3, 2], [0, 2]], [[3, 2], [1, 1]], [[3, 2], [2, 0]], [[3, 2], [2, 1]], [[3, 2], [2, 2]], [[3, 2], [3, 0]], [[3, 2], [3, 3]], [[3, 3], [0, 1]], [[3, 3], [0, 2]], [[3, 3], [1, 1]], [[3, 3], [2, 0]], [[3, 3], [2, 1]], [[3, 3], [2, 2]], [[3, 3], [3, 0]], [[3, 3], [3, 2]], [[0, 2], [0, 2]]]\n"
 75 |      ]
 76 |     }
 77 |    ],
 78 |    "source": [
 79 |     "joint_states = grid.joint_states()\n",
 80 |     "print('Available joint states : %s'%len(joint_states))#Correct\n",
 81 |     "print(joint_states)"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "code",
 86 |    "execution_count": 5,
 87 |    "id": "10282bf9",
 88 |    "metadata": {},
 89 |    "outputs": [
 90 |     {
 91 |      "data": {
 92 |       "text/plain": [
 93 |        "[['left', [0, 1]],\n",
 94 |        " ['down', [0, 1]],\n",
 95 |        " ['left', [0, 2]],\n",
 96 |        " ['right', [0, 2]],\n",
 97 |        " ['up', [0, 2]],\n",
 98 |        " ['up', [1, 1]],\n",
 99 |        " ['down', [1, 1]],\n",
100 |        " ['left', [2, 0]],\n",
101 |        " ['down', [2, 0]],\n",
102 |        " ['right', [2, 1]],\n",
103 |        " ['left', [2, 2]],\n",
104 |        " ['up', [2, 2]],\n",
105 |        " ['right', [3, 0]],\n",
106 |        " ['up', [3, 0]],\n",
107 |        " ['down', [3, 0]],\n",
108 |        " ['right', [3, 2]],\n",
109 |        " ['down', [3, 2]],\n",
110 |        " ['left', [3, 3]],\n",
111 |        " ['right', [3, 3]],\n",
112 |        " ['up', [3, 3]]]"
113 |       ]
114 |      },
115 |      "execution_count": 5,
116 |      "metadata": {},
117 |      "output_type": "execute_result"
118 |     }
119 |    ],
120 |    "source": [
121 |     "walls = grid.identify_walls()\n",
122 |     "walls  "
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "markdown",
127 |    "id": "6c924a07",
128 |    "metadata": {},
129 |    "source": [
130 |     "###  Run the Nash Q Learning algorithm"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "code",
135 |    "execution_count": 6,
136 |    "id": "28406b6d",
137 |    "metadata": {},
138 |    "outputs": [],
139 |    "source": [
140 |     "nashQ = NashQLearning(grid, \n",
141 |     "                      max_iter = 2000,\n",
142 |     "                      discount_factor = 0.9,\n",
143 |     "                      learning_rate = 0.7,\n",
144 |     "                      epsilon = 0.4,\n",
145 |     "                      decision_strategy = 'epsilon-greedy')"
146 |    ]
147 |   },
148 |   {
149 |    "cell_type": "code",
150 |    "execution_count": 7,
151 |    "id": "c7c49ebe",
152 |    "metadata": {
153 |     "scrolled": true
154 |    },
155 |    "outputs": [
156 |     {
157 |      "name": "stderr",
158 |      "output_type": "stream",
159 |      "text": [
160 |       "100%|████████████████████████████████████████████████████████████████████████████████| 73/73 [00:00<00:00, 2437.05it/s]\n",
161 |       "100%|██████████████████████████████████████████████████████████████████████████████| 2000/2000 [09:04<00:00,  3.68it/s]\n"
162 |      ]
163 |     }
164 |    ],
165 |    "source": [
166 |     "#Retrieve the updated Q matrix after fitting the algorithm\n",
167 |     "Q0, Q1 = nashQ.fit(return_history = False)"
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": 8,
173 |    "id": "aa2f5917",
174 |    "metadata": {},
175 |    "outputs": [
176 |     {
177 |      "name": "stdout",
178 |      "output_type": "stream",
179 |      "text": [
180 |       "[[3, 0], [2, 2]]\n",
181 |       "[[2, 0], [2, 1]]\n",
182 |       "[[3, 0], [1, 1]]\n",
183 |       "[[2, 0], [0, 1]]\n",
184 |       "[[2, 1], [0, 2]]\n",
185 |       "[[1, 1], [0, 2]]\n",
186 |       "[[0, 1], [0, 2]]\n",
187 |       "[[0, 2], [0, 2]]\n"
188 |      ]
189 |     }
190 |    ],
191 |    "source": [
192 |     "#Best path followed by each player given the values in the q tables\n",
193 |     "p0, p1 = nashQ.get_best_policy(Q0,Q1)"
194 |    ]
195 |   },
196 |   {
197 |    "cell_type": "code",
198 |    "execution_count": 9,
199 |    "id": "85a96bc4",
200 |    "metadata": {},
201 |    "outputs": [
202 |     {
203 |      "name": "stdout",
204 |      "output_type": "stream",
205 |      "text": [
206 |       "Player 0 follows the  policy : left-right-left-up-left-left-up of length 7\n",
207 |       "Player 1 follows the  policy : down-left-left-up-stay-stay-stay of length 7\n"
208 |      ]
209 |     }
210 |    ],
211 |    "source": [
212 |     "print('Player 0 follows the  policy : %s of length %s' %('-'.join(p0),len(p0)))\n",
213 |     "print('Player 1 follows the  policy : %s of length %s'%('-'.join(p1),len(p1)))"
214 |    ]
215 |   }
216 |  ],
217 |  "metadata": {
218 |   "kernelspec": {
219 |    "display_name": "Python 3 (ipykernel)",
220 |    "language": "python",
221 |    "name": "python3"
222 |   },
223 |   "language_info": {
224 |    "codemirror_mode": {
225 |     "name": "ipython",
226 |     "version": 3
227 |    },
228 |    "file_extension": ".py",
229 |    "mimetype": "text/x-python",
230 |    "name": "python",
231 |    "nbconvert_exporter": "python",
232 |    "pygments_lexer": "ipython3",
233 |    "version": "3.7.13"
234 |   }
235 |  },
236 |  "nbformat": 4,
237 |  "nbformat_minor": 5
238 | }
239 | 


--------------------------------------------------------------------------------
/notebook/3-dim-example.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "7cf87dfd",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "##  3-dim grid example"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": 1,
 14 |    "id": "b1720590",
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "#Load packages\n",
 19 |     "from NashQLearn import Player, Grid, NashQLearning\n",
 20 |     "import warnings\n",
 21 |     "warnings.filterwarnings('ignore')"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "markdown",
 26 |    "id": "1cf74196",
 27 |    "metadata": {},
 28 |    "source": [
 29 |     "This notebook applies the Nash Q Learning algorithm to the following multiagent problem : two robots placed on a grid need to reach the reward. Robots are allowed to move up, down, to the left, and to the right, or to stay at their current position. \n",
 30 |     "Robots are not allowed to be on the same tile unless it is the reward tile.\n",
 31 |     "\n"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "markdown",
 36 |    "id": "14455cf6",
 37 |    "metadata": {},
 38 |    "source": [
 39 |     "### Prepare the game environment"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "code",
 44 |    "execution_count": 2,
 45 |    "id": "b6a6e69e",
 46 |    "metadata": {},
 47 |    "outputs": [],
 48 |    "source": [
 49 |     "#Initialize the two players\n",
 50 |     "player1 = Player([0,0])\n",
 51 |     "player2 = Player([2,0])"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "code",
 56 |    "execution_count": 3,
 57 |    "id": "3b91d86b",
 58 |    "metadata": {},
 59 |    "outputs": [],
 60 |    "source": [
 61 |     "#Initialize the grid\n",
 62 |     "grid = Grid(length = 3,\n",
 63 |     "            width = 3,\n",
 64 |     "            players = [player1,player2],\n",
 65 |     "           obstacle_coordinates = [[1,1]], #A single obstacle in the middle of the grid\n",
 66 |     "           reward_coordinates = [1,2],\n",
 67 |     "           reward_value = 20,\n",
 68 |     "           collision_penalty = -1)"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "execution_count": 4,
 74 |    "id": "abdc9942",
 75 |    "metadata": {},
 76 |    "outputs": [
 77 |     {
 78 |      "name": "stdout",
 79 |      "output_type": "stream",
 80 |      "text": [
 81 |       "Available joint states : 57\n",
 82 |       "[[[0, 0], [0, 1]], [[0, 0], [0, 2]], [[0, 0], [1, 0]], [[0, 0], [1, 2]], [[0, 0], [2, 0]], [[0, 0], [2, 1]], [[0, 0], [2, 2]], [[0, 1], [0, 0]], [[0, 1], [0, 2]], [[0, 1], [1, 0]], [[0, 1], [1, 2]], [[0, 1], [2, 0]], [[0, 1], [2, 1]], [[0, 1], [2, 2]], [[0, 2], [0, 0]], [[0, 2], [0, 1]], [[0, 2], [1, 0]], [[0, 2], [1, 2]], [[0, 2], [2, 0]], [[0, 2], [2, 1]], [[0, 2], [2, 2]], [[1, 0], [0, 0]], [[1, 0], [0, 1]], [[1, 0], [0, 2]], [[1, 0], [1, 2]], [[1, 0], [2, 0]], [[1, 0], [2, 1]], [[1, 0], [2, 2]], [[1, 2], [0, 0]], [[1, 2], [0, 1]], [[1, 2], [0, 2]], [[1, 2], [1, 0]], [[1, 2], [2, 0]], [[1, 2], [2, 1]], [[1, 2], [2, 2]], [[2, 0], [0, 0]], [[2, 0], [0, 1]], [[2, 0], [0, 2]], [[2, 0], [1, 0]], [[2, 0], [1, 2]], [[2, 0], [2, 1]], [[2, 0], [2, 2]], [[2, 1], [0, 0]], [[2, 1], [0, 1]], [[2, 1], [0, 2]], [[2, 1], [1, 0]], [[2, 1], [1, 2]], [[2, 1], [2, 0]], [[2, 1], [2, 2]], [[2, 2], [0, 0]], [[2, 2], [0, 1]], [[2, 2], [0, 2]], [[2, 2], [1, 0]], [[2, 2], [1, 2]], [[2, 2], [2, 0]], [[2, 2], [2, 1]], [[1, 2], [1, 2]]]\n"
 83 |      ]
 84 |     }
 85 |    ],
 86 |    "source": [
 87 |     "joint_states = grid.joint_states()\n",
 88 |     "print('Available joint states : %s'%len(joint_states))\n",
 89 |     "print(joint_states)"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "code",
 94 |    "execution_count": 5,
 95 |    "id": "10282bf9",
 96 |    "metadata": {},
 97 |    "outputs": [
 98 |     {
 99 |      "data": {
100 |       "text/plain": [
101 |        "[['left', [0, 0]],\n",
102 |        " ['down', [0, 0]],\n",
103 |        " ['left', [0, 1]],\n",
104 |        " ['right', [0, 1]],\n",
105 |        " ['left', [0, 2]],\n",
106 |        " ['up', [0, 2]],\n",
107 |        " ['up', [1, 0]],\n",
108 |        " ['down', [1, 0]],\n",
109 |        " ['up', [1, 2]],\n",
110 |        " ['down', [1, 2]],\n",
111 |        " ['right', [2, 0]],\n",
112 |        " ['down', [2, 0]],\n",
113 |        " ['left', [2, 1]],\n",
114 |        " ['right', [2, 1]],\n",
115 |        " ['right', [2, 2]],\n",
116 |        " ['up', [2, 2]]]"
117 |       ]
118 |      },
119 |      "execution_count": 5,
120 |      "metadata": {},
121 |      "output_type": "execute_result"
122 |     }
123 |    ],
124 |    "source": [
125 |     "walls = grid.identify_walls()\n",
126 |     "walls"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "markdown",
131 |    "id": "4807410f",
132 |    "metadata": {},
133 |    "source": [
134 |     "### Run the Nash Q Learning algorithm\n",
135 |     "\n",
136 |     "The efficiency of the algorithm depends on a set of parameters (the max number of iterations, the discount factor, the learning rate, ...) which may require some tuning. In general, the epsilon-greedy decision stragegy outperforms both the random and greedy strategies."
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "code",
141 |    "execution_count": 11,
142 |    "id": "28406b6d",
143 |    "metadata": {},
144 |    "outputs": [],
145 |    "source": [
146 |     "nashQ = NashQLearning(grid, \n",
147 |     "                      max_iter = 2000,\n",
148 |     "                      discount_factor = 0.7,\n",
149 |     "                      learning_rate = 0.7,\n",
150 |     "                      epsilon = 0.5,\n",
151 |     "                     decision_strategy = 'epsilon-greedy')"
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "code",
156 |    "execution_count": 12,
157 |    "id": "c7c49ebe",
158 |    "metadata": {
159 |     "scrolled": true
160 |    },
161 |    "outputs": [
162 |     {
163 |      "name": "stderr",
164 |      "output_type": "stream",
165 |      "text": [
166 |       "100%|████████████████████████████████████████████████████████████████████████████████| 57/57 [00:00<00:00, 2716.77it/s]\n",
167 |       "100%|██████████████████████████████████████████████████████████████████████████████| 2000/2000 [08:18<00:00,  4.01it/s]\n"
168 |      ]
169 |     }
170 |    ],
171 |    "source": [
172 |     "#Retrieve the updated Q matrix after fitting the algorithm\n",
173 |     "Q0, Q1 = nashQ.fit(return_history = False)"
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "code",
178 |    "execution_count": 13,
179 |    "id": "8ae26f6e",
180 |    "metadata": {},
181 |    "outputs": [
182 |     {
183 |      "name": "stdout",
184 |      "output_type": "stream",
185 |      "text": [
186 |       "[[0, 0], [2, 0]]\n",
187 |       "[[0, 1], [2, 1]]\n",
188 |       "[[0, 2], [2, 2]]\n",
189 |       "[[1, 2], [1, 2]]\n"
190 |      ]
191 |     }
192 |    ],
193 |    "source": [
194 |     "#Best path followed by each player given the values in the q tables\n",
195 |     "p0, p1 = nashQ.get_best_policy(Q0,Q1)"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": 14,
201 |    "id": "e576d4ff",
202 |    "metadata": {},
203 |    "outputs": [
204 |     {
205 |      "name": "stdout",
206 |      "output_type": "stream",
207 |      "text": [
208 |       "Player 0 follows the  policy : up-up-right of length 3\n",
209 |       "Player 1 follows the  policy : up-up-left of length 3\n"
210 |      ]
211 |     }
212 |    ],
213 |    "source": [
214 |     "print('Player 0 follows the  policy : %s of length %s' %('-'.join(p0),len(p0)))\n",
215 |     "print('Player 1 follows the  policy : %s of length %s'%('-'.join(p1),len(p1)))"
216 |    ]
217 |   },
218 |   {
219 |    "cell_type": "markdown",
220 |    "id": "7c4b401a",
221 |    "metadata": {},
222 |    "source": [
223 |     "In this experiment, the two players have successfully identified the optimal path to the reward.\n"
224 |    ]
225 |   }
226 |  ],
227 |  "metadata": {
228 |   "kernelspec": {
229 |    "display_name": "Python 3 (ipykernel)",
230 |    "language": "python",
231 |    "name": "python3"
232 |   },
233 |   "language_info": {
234 |    "codemirror_mode": {
235 |     "name": "ipython",
236 |     "version": 3
237 |    },
238 |    "file_extension": ".py",
239 |    "mimetype": "text/x-python",
240 |    "name": "python",
241 |    "nbconvert_exporter": "python",
242 |    "pygments_lexer": "ipython3",
243 |    "version": "3.7.13"
244 |   }
245 |  },
246 |  "nbformat": 4,
247 |  "nbformat_minor": 5
248 | }
249 | 


--------------------------------------------------------------------------------
/NashQLearn.py:
--------------------------------------------------------------------------------
  1 | #Implementation of the Nash Q Learning algorithm for simple games with two agents
  2 | 
  3 | import numpy as np
  4 | import random
  5 | from collections import defaultdict
  6 | import nashpy as nash
  7 | from tqdm import tqdm
  8 | 
  9 | 
 10 | class Player:
 11 |     def __init__(self, 
 12 |                  position = [0,0],
 13 |                  movements = ['left','right','up','down','stay']
 14 |                 ):
 15 |             '''
 16 |             This class is a representation of a player.
 17 |             
 18 |             position (list) : list of two integers giving the starting position coordinates of the player
 19 |             movements (list) : list of strings containing the possible movements. 
 20 |             '''
 21 |         self.movements = movements
 22 |         self.position = position
 23 | 
 24 |         
 25 |     def move(self, movement):
 26 |         '''
 27 |         Compute the new position of a player after performing a movement.
 28 |         movement (string) : the movement to perform. Invalid string values are interpreted as the 'stay' movement.     
 29 |         '''
 30 |         if movement == 'left'  and 'left' in self.movements:
 31 |             new_position = [self.position[0] - 1, self.position[1]]
 32 |         elif movement == 'right' and 'right' in self.movements:
 33 |             new_position = [self.position[0] + 1, self.position[1]]
 34 |         elif movement == 'up' and 'up' in self.movements:
 35 |             new_position = [self.position[0], self.position[1] + 1]
 36 |         elif movement == 'down' and 'down' in self.movements:
 37 |             new_position = [self.position[0], self.position[1] - 1]
 38 |         else:  
 39 |             new_position = self.position
 40 |         
 41 |         return new_position
 42 |    
 43 | 
 44 | class Grid:
 45 |     def __init__(self,
 46 |                  length = 2, 
 47 |                  width = 2, 
 48 |                  players = [Player(),Player()],
 49 |                  reward_coordinates = [1,1],
 50 |                  reward_value = 20,
 51 |                  obstacle_coordinates = [],
 52 |                  collision_allowed = False,
 53 |                  collision_penalty = 0):
 54 |         '''
 55 |         This class is a representation of the game grid.
 56 | 
 57 |         length (int) : horizontal dimension of the grid
 58 |         width (int)  : vertical dimension of the grid
 59 |         players (list) : list of 2 Player objects.
 60 |         reward_coordinates (list) : list of 2 integers giving the coordinates of the reward
 61 |         reward_value (int) : value obtained by reaching the reward coordinates
 62 |         obstacle_coordinates (list) : list of obstacle coordinates. Each obstacle coordinate is a list of 2 integers giving their coordinates
 63 |         collision_allowed (bool) : whether agents are allowed to be at the same time on the same cell or not
 64 |         collision_penalty (int)  : negative reward obtained for hitting a wall or colliding with another player
 65 |         joint_player_coordinates (list) : list containing the starting positions of the two players
 66 |         '''
 67 |         self.length = length
 68 |         self.width = width
 69 |         self.players =  players
 70 |         self.reward_coordinates = reward_coordinates
 71 |         self.reward_value = reward_value
 72 |         self.obstacle_coordinates = obstacle_coordinates
 73 |         self.collision_allowed = collision_allowed
 74 |         self.collision_penalty =  collision_penalty
 75 |         self.joint_player_coordinates = [players[0].position, players[1].position]
 76 |  
 77 | 
 78 |     def get_player_0(self):
 79 |         return self.players[0]
 80 | 
 81 |     
 82 |     def get_player_1(self):
 83 |         return self.players[1]
 84 |   
 85 | 
 86 |     def joint_states(self):
 87 |         '''
 88 |         Returns a list of all possible joint states in the game.
 89 |         '''
 90 |         if not self.collision_allowed:
 91 |             #Agents are only allowed to collide on the reward cell, whether they arrive there at the same time or not
 92 |             joint_states = [[[i,j],
 93 |                              [k,l]] for i in range(self.length) for j in range(self.width) 
 94 |                              for k in range(self.length) for l in range(self.width)
 95 |                              if [i,j] != [k,l]  and [i,j] not in self.obstacle_coordinates and [k,l] not in self.obstacle_coordinates
 96 |             ]
 97 |             joint_states.append([self.reward_coordinates,self.reward_coordinates]) #Add the reward state as joint state
 98 | 
 99 |         else:  #Agents can collide on any cell, but they can't move to an obstacle
100 |             joint_states = [[[i,j],
101 |                              [k,l]] for i in range(self.length) for j in range(self.width) 
102 |                              for k in range(self.length) for l in range(self.width)
103 |                              if  [i,j] not in self.obstacle_coordinates and [k,l] not in self.obstacle_coordinates
104 |             ]
105 | 
106 |         return joint_states
107 |  
108 | 
109 |     def identify_walls(self):
110 |         '''
111 |         Identify all impossible transitions due to the grid walls and the obstacles
112 |         '''
113 |         walls = []
114 |         for i in range(self.length):
115 |             for j in range(self.width):
116 |                 if [i,j] not in self.obstacle_coordinates:
117 |                     fictious_player = Player(position = [i,j]) #Used to explore the grid in search of walls
118 |                     if fictious_player.move('left')[0] not in range(self.length) or fictious_player.move('left')in self.obstacle_coordinates:
119 |                         walls.append(['left',fictious_player.position])
120 |                     if fictious_player.move('right')[0] not in range(self.length) or fictious_player.move('right') in self.obstacle_coordinates:
121 |                         walls.append(['right',fictious_player.position])
122 |                     if fictious_player.move('up')[1] not in range(self.width) or fictious_player.move('up') in self.obstacle_coordinates:
123 |                         walls.append(['up',fictious_player.position])
124 |                     if fictious_player.move('down')[1] not in range(self.width) or fictious_player.move('down') in self.obstacle_coordinates:
125 |                         walls.append(['down',fictious_player.position])
126 | 
127 |         return walls
128 |  
129 | 
130 |     def compute_reward(self,
131 |                        old_state,
132 |                        new_state,
133 |                        movement,
134 |                        collision_detected = False):
135 |         '''
136 |         Compute the reward obtained by a player for transitioning from its old state to its new state
137 |         '''
138 |         if  old_state == self.reward_coordinates:  #Stop receiving rewards once the goal is reached
139 |             reward = 0
140 |         elif new_state == self.reward_coordinates:  #The goal state is reached for the first time
141 |             reward = self.reward_value
142 |         elif new_state == old_state and movement != 'stay': #The player moved and bumped in a player or an obstacle
143 |             reward = self.collision_penalty
144 |         elif movement == 'stay' and collision_detected:  #The player stayed but was percuted by another player
145 |             reward = self.collision_penalty       
146 |         else: # The player made a regular valid movement
147 |             reward = 0
148 |         return reward
149 | 
150 |     
151 |     def create_transition_table(self):
152 |         '''
153 |         Creates a dictionary where each pair of joint state and joint movement is mapped to a new resulting joint state
154 |         '''
155 |         recursivedict = lambda : defaultdict(recursivedict)
156 |         transitions = recursivedict()
157 |         joint_states = self.joint_states()
158 |         walls = self.identify_walls()
159 |         player0_movements =  self.players[0].movements
160 |         player1_movements =  self.players[1].movements
161 | 
162 |         for state in joint_states:
163 |             for m0 in player0_movements:
164 |                 for m1 in player1_movements:
165 |                     if [m1,state[1]] in walls or state[1] == self.reward_coordinates:
166 |                         if [m0,state[0]]  in walls or state[0] == self.reward_coordinates:
167 |                             new_state = state
168 |                         else : 
169 |                             new_state = [Player(state[0]).move(m0),state[1]]                            
170 |                     else:
171 |                         if [m0,state[0]] in walls or state[0] == self.reward_coordinates:
172 |                             new_state = [state[0],Player(state[1]).move(m1)]
173 |                         else:
174 |                             new_state = [Player(state[0]).move(m0),Player(state[1]).move(m1)]                   
175 |                     if (new_state[0] == state[1]  and new_state[1] == state[0] ) or new_state not in joint_states: 
176 |                         # There is a collision or a swap of positions
177 |                          new_state = state #Return to previous state  
178 |                     transitions[joint_states.index(state)][m0][m1] = joint_states.index(new_state)
179 |                     
180 |         return transitions
181 | 
182 |     
183 |     def create_stage_games(self):
184 |         '''
185 |         Creates the stage game tables which contains the reward obtained by the players for each pair of joint states and joint movements.
186 |         The stage game tables are represented as 3-dimensional tensors. 
187 |         '''
188 |         joint_states = self.joint_states()
189 |         walls = self.identify_walls()
190 |         player0_movements =  self.players[0].movements
191 |         player1_movements =  self.players[1].movements
192 | 
193 |         stage_games0  = np.zeros((len(joint_states),
194 |                                  len(player0_movements),
195 |                                  len(player1_movements),
196 |                                  ))
197 |         
198 |         stage_games1  = np.zeros((len(joint_states),
199 |                                  len(player0_movements),
200 |                                  len(player1_movements),
201 |                                  ))
202 |         for state in tqdm(joint_states):
203 |             for m0 in player0_movements:
204 |                 for m1 in player1_movements:
205 |                     if [m1,state[1]] in walls:
206 |                         if [m0,state[0]] not in walls:
207 |                             new_state = [Player(state[0]).move(m0),state[1]]
208 |                         else : 
209 |                             new_state = state
210 |                     else:
211 |                         if  [m0,state[0]] in walls:
212 |                             new_state = [state[0],Player(state[1]).move(m1)]
213 |                         else:
214 |                             new_state = [Player(state[0]).move(m0),Player(state[1]).move(m1)]                    
215 |                     collision_detected = False
216 |                     if (new_state[0] == state[1]  and new_state[1] == state[0] ) or new_state not in joint_states: 
217 |                          # There is a collision
218 |                          new_state = state #Return to previous state
219 |                          collision_detected = True
220 |                 
221 |                     reward0 = self.compute_reward(state[0],new_state[0],m0,collision_detected)
222 |                     reward1 = self.compute_reward(state[1],new_state[1],m1,collision_detected)                     
223 |                     stage_games0[joint_states.index(state)] [player0_movements.index(m0)][player1_movements.index(m1)]= reward0
224 |                     stage_games1[joint_states.index(state)][player0_movements.index(m0)][player1_movements.index(m1)] = reward1
225 |                   
226 |         return stage_games0, stage_games1
227 | 
228 |     
229 |     def create_q_tables(self): 
230 |         '''
231 |         Creates the q tables which contains the Q-values used by the the nash Q Learning algorithm.
232 |         The q tables are represented as 3-dimensional tensors and are initialized with null values. 
233 |         '''
234 |         joint_states = self.joint_states()
235 |         player0_movements =  self.players[0].movements
236 |         player1_movements =  self.players[1].movements
237 |         q_tables0 = np.zeros((len(joint_states),
238 |                               len(player0_movements),
239 |                               len(player1_movements)
240 |                                     ))    
241 |         q_tables1 = np.zeros((len(joint_states),
242 |                               len(player0_movements),
243 |                               len(player1_movements)
244 |                                     ))
245 |         
246 |         return q_tables0, q_tables1
247 | 
248 |   
249 | class  NashQLearning:
250 | 
251 |     def __init__(self,
252 |                  grid = Grid(),
253 |                  learning_rate = 0.5,
254 |                  max_iter = 100,
255 |                  discount_factor = 0.7,
256 |                  decision_strategy = 'random',
257 |                  epsilon = 0.5,
258 |                  random_state = 42):       
259 |         '''
260 |         This class represents an instance of the Nash Q-Learning algorithm
261 |         
262 |         grid (Grid) : the game grid 
263 |         learning rate (int) : the weighted importance given to the update of the Q-values compared to their current value
264 |         max_iter (int) : max number of iterations of the algorithm
265 |         discount_factor (int) : discount factor applied to the nash equilibria value in the Q-values update formula
266 |         decision_strategy (str) : decision strategy applied to select the next movement, possible values are 'random','greedy','epsilon-greedy'
267 |         epsilon (int) : only if decision_strategy is 'epsilon_greedy', threshold to decide between a greedy or random movement
268 |         random_state (int) : seed for results reproducibility
269 |         '''
270 |         self.grid = grid
271 |         self.learning_rate = learning_rate
272 |         self.max_iter = max_iter
273 |         self.discount_factor = discount_factor
274 |         self.decision_strategy = decision_strategy
275 |         self.epsilon = epsilon
276 |         random.seed(random_state)
277 |         
278 | 
279 |     def fit(self,
280 |             return_history = False):
281 |         """
282 |         Fit the Nash Q Learning algorithm on the grid and return one Q table per player. 
283 |         return_history (bool) : if True, print all the changing positions of the players on the grid during the learning cycle.
284 |         """
285 |         current_state = [self.grid.players[0].position,self.grid.players[1].position]
286 |         joint_states = self.grid.joint_states()
287 |         player0_movements =  self.grid.players[0].movements
288 |         player1_movements =  self.grid.players[1].movements
289 |         stage_games0, stage_games1 = self.grid.create_stage_games()
290 |         Q0, Q1 = self.grid.create_q_tables()
291 |         transition_table = self.grid.create_transition_table()    
292 |         state_tracker = [current_state]
293 |         
294 |         for i in tqdm(range(self.max_iter)):
295 |             if current_state == joint_states[-1]:  #Both players reached the reward, return to original position
296 |                 current_state = [self.grid.players[0].position,self.grid.players[1].position]
297 |                 
298 |             if self.decision_strategy == 'random':
299 |                 m0 = player0_movements[random.randrange(len(player0_movements))]
300 |                 m1 = player1_movements[random.randrange(len(player1_movements))]
301 |             if self.decision_strategy == 'greedy':
302 |                 greedy_matrix0 = Q0[joint_states.index(current_state)]
303 |                 greedy_matrix1 = Q1[joint_states.index(current_state)]
304 |                 greedy_game =  nash.Game(greedy_matrix0,greedy_matrix1)
305 |                 equilibriums = list(greedy_game.support_enumeration())
306 |                 greedy_equilibrium = equilibriums[random.randrange(len(equilibriums))] #One random equilibrium
307 |                 if len(np.where(greedy_equilibrium[0] == 1)[0]) ==0: #No strict equilibrium found
308 |                     m0 = player0_movements[random.randrange(len(player0_movements))] #Random move
309 |                     m1 = player1_movements[random.randrange(len(player1_movements))]
310 |                 else:  #Select the movements corresponding to the nash equilibrium
311 |                     m0 = player0_movements[np.where(greedy_equilibrium[0] == 1)[0][0]]
312 |                     m1 = player1_movements[np.where(greedy_equilibrium[1] == 1)[0][0]]
313 |             if self.decision_strategy == 'epsilon-greedy':
314 |                 random_number = random.uniform(0,1)
315 |                 if random_number >= self.epsilon: #greedy
316 |                     greedy_matrix0 = Q0[joint_states.index(current_state)]
317 |                     greedy_matrix1 = Q1[joint_states.index(current_state)]
318 |                     greedy_game =  nash.Game(greedy_matrix0,greedy_matrix1)
319 |                     equilibriums = list(greedy_game.support_enumeration())
320 |                     greedy_equilibrium = equilibriums[random.randrange(len(equilibriums))] #One random equilibrium
321 |                     if len(np.where(greedy_equilibrium[0] == 1)[0]) == 0: #No strict equilibrium found
322 |                         m0 = player0_movements[random.randrange(len(player0_movements))] #Random move
323 |                         m1 = player1_movements[random.randrange(len(player1_movements))]
324 |                     else:  #Select the movements corresponding to the nash equilibrium
325 |                         m0 = player0_movements[np.where(greedy_equilibrium[0] == 1)[0][0]]
326 |                         m1 = player1_movements[np.where(greedy_equilibrium[1] == 1)[0][0]]          
327 |                 else: #random
328 |                     m0 = player0_movements[random.randrange(len(player0_movements))]
329 |                     m1 = player1_movements[random.randrange(len(player1_movements))]
330 |               
331 |             #Update state
332 |             new_state = joint_states[transition_table[joint_states.index(current_state)][m0][m1]]
333 |             #Solve Nash equilibrium problem in new state
334 |             nash_eq_matrix0 = Q0[joint_states.index(new_state)]
335 |             nash_eq_matrix1 = Q1[joint_states.index(new_state)]
336 |             game = nash.Game(nash_eq_matrix0,nash_eq_matrix1)
337 |             equilibriums = list(game.support_enumeration())
338 |             best_payoff = -np.Inf
339 |             equilibrium_values = []
340 |             for eq in equilibriums:
341 |                 payoff =  game[eq][0] + game[eq][1]
342 |                 if payoff >= best_payoff:
343 |                     best_payoff = payoff
344 |                     equilibrium_values = game[eq]
345 |                           
346 |             #Q Tables update formula
347 |             Q0[joint_states.index(current_state)][player0_movements.index(m0)][player1_movements.index(m1)] = (
348 |                 (1 - self.learning_rate) * Q0[joint_states.index(current_state)][player0_movements.index(m0)][player1_movements.index(m1)]
349 |                 + self.learning_rate * (stage_games0[joint_states.index(current_state)][player0_movements.index(m0)][player1_movements.index(m1)]
350 |                                         + self.discount_factor * equilibrium_values[0])
351 |             )
352 | 
353 |             Q1[joint_states.index(current_state)][player0_movements.index(m0)][player1_movements.index(m1)] = (
354 |                 (1 - self.learning_rate) * Q1[joint_states.index(current_state)][player0_movements.index(m0)][player1_movements.index(m1)]
355 |                 + self.learning_rate * (stage_games1[joint_states.index(current_state)][player0_movements.index(m0)][player1_movements.index(m1)]
356 |                                         + self.discount_factor * equilibrium_values[1])
357 |             )             
358 |             
359 |             current_state = new_state
360 |             state_tracker.append(current_state)   
361 |             
362 |         if return_history:
363 |             print(state_tracker)
364 |         return Q0, Q1
365 | 
366 |     
367 |     def get_best_policy(self,
368 |                            Q0,
369 |                            Q1):
370 |         """
371 |         Given two Q tables, one for each agent, return their best available path on the grid.
372 |         """
373 |         current_state = [self.grid.players[0].position,self.grid.players[1].position]
374 |         joint_states = self.grid.joint_states()
375 |         transition_table = self.grid.create_transition_table()
376 |         player0_movements =  self.grid.players[0].movements
377 |         player1_movements =  self.grid.players[1].movements
378 |         policy0 = []
379 |         policy1 = []
380 |         while current_state != joint_states[-1]: #while the reward state is not reached for both agents
381 |             print(current_state)
382 |             q_state0 = Q0[joint_states.index(current_state)]
383 |             q_state1 = Q1[joint_states.index(current_state)]
384 |             game = nash.Game(q_state0,q_state1)
385 |             equilibriums = list(game.support_enumeration())
386 |             best_payoff = -np.Inf
387 |             m0 = 'stay'
388 |             m1 = 'stay'
389 |             for eq in equilibriums:
390 |                  if len(np.where(eq[0] == 1)[0]) != 0: #The equilibrium needs to be a strict nash equilibrium (no mixed-strategy)
391 |                      total_payoff = q_state0[np.where(eq[0]==1)[0][0]][np.where(eq[1]==1)[0][0]] + q_state1[np.where(eq[0]==1)[0][0]][np.where(eq[1]==1)[0][0]]
392 |                      if total_payoff >= best_payoff and (player0_movements[np.where(eq[0]==1)[0][0]] != 'stay'
393 |                                                         or player1_movements[np.where(eq[1]==1)[0][0]] != 'stay'):
394 |                                                         #payoff is better and at least one agent is moving
395 |                          best_payoff = total_payoff
396 |                          m0 = player0_movements[np.where(eq[0]==1)[0][0]]
397 |                          m1 = player1_movements[np.where(eq[1]==1)[0][0]]           
398 |             if current_state[0] != joint_states[-1][0]:
399 |                 policy0.append(m0)
400 |             else : #target reached for player 0
401 |                 policy0.append('stay')
402 |             if current_state[1] != joint_states[-1][1]:
403 |                 policy1.append(m1)
404 |             else: #target reached for player 1
405 |                 policy1.append('stay')
406 |             if current_state != joint_states[transition_table[joint_states.index(current_state)][m0][m1]]: #there was a movement
407 |                 current_state = joint_states[transition_table[joint_states.index(current_state)][m0][m1]]
408 |             else :  #No movement, the model did not converge
409 |                 policy0 = 'model failed to converge to a policy'
410 |                 policy1 = 'model failed to converge to a policy'
411 |                 break        
412 |         print(current_state)            
413 |         return policy0, policy1
414 | 


--------------------------------------------------------------------------------