├── LICENSE ├── README.md └── ttt.c /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2025-future, Salvatore Sanfilippo 2 | 3 | All rights reserved. 4 | 5 | Redistribution and use in source and binary forms, with or without 6 | modification, are permitted provided that the following conditions are met: 7 | 8 | * Redistributions of source code must retain the above copyright notice, 9 | this list of conditions and the following disclaimer. 10 | 11 | * Redistributions in binary form must reproduce the above copyright notice, 12 | this list of conditions and the following disclaimer in the documentation 13 | and/or other materials provided with the distribution. 14 | 15 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 16 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 17 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 18 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR 19 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 20 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 21 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON 22 | ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 23 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 24 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 25 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Tic Tac Toe with Reinforcement Learning 2 | 3 | *The only winning move is not to play* 4 | 5 | This code implements a neural network that learns to play tic-tac-toe using 6 | reinforcement learning, just playing against a random adversary, in **under 7 | 400 lines of C code**, without any external library used. I guess there are 8 | many examples of RL out there, written in PyTorch or with other ML frameworks, 9 | however what I wanted to accomplish here was to have the whole thing 10 | implemented from scratch, so that each part is understandable and self-evident. 11 | 12 | While the code is a toy, designed to help interested people to learn the basics 13 | of reinforcement learning, it actually showcases the power of RL in learning 14 | things without any pre-existing clue about the game: 15 | 16 | 1. It uses cold start: the neural network is initialized with random weights. 17 | 2. *Tabula rasa* learning: no knowledge of the game is put into the program, if not the fact that X or O can't be put into already used tile, and the fact that a victory or a tie is reached when there are three same symbols in a row or all the tiles are used. 18 | 3. The only signal used to train the neural network is the reward of the game: win, lose, tie. 19 | 20 | In my past experience with [the Kilo editor](https://github.com/antirez/kilo) and the [Picol interpreter](https://github.com/antirez/picol) I noticed that for programmers that want to understand new fields (especially young programmers) small C programs without dependencies, clearly written, commented and *very short* are a good starting point, so, in order to celebrate the Turing Award assigned to Sutton and Barto, I thought of writing this one. 21 | 22 | To try this program, compile with: 23 | 24 | cc ttt.c -o ttt -O3 -Wall -W -ffast-math -lm 25 | 26 | Then run with 27 | 28 | ./ttt 29 | 30 | By default, the program plays against a random opponent (an opponent just 31 | throwing random "X" at random places at each move) for 150k games. Then it 32 | starts a CLI interface to play with the human user. You can specify how many 33 | games you want it to play against the random opponent before playing with 34 | the human specifying it as first argument: 35 | 36 | ./ttt 2000000 37 | 38 | With 2 million games (a few seconds required) it usually no longer loses 39 | a game. 40 | 41 | After playing against itself for a few iterations, the program achieves 42 | what is likely perfect playing: 43 | 44 | Games: 2000000, Wins: 1756049 (87.8%) 45 | Losses: 731 (0.0%) 46 | Ties: 243220 (12.2%) 47 | 48 | Note that there are runs that are more lucky than others, likely because of 49 | weights initialization in the neural network. Run the program multiple times 50 | if you can't reach 0 losses. 51 | 52 | # How it works 53 | 54 | The code tries hard to be simple, and is well commented, with about one line of comment for each two lines of code: to understand how it works should be relatively easy. Yet, in this README, I try to outline a few important things. Also make sure to check the *LEARNING OPPORTUNITY* comments inside the code: there, I tried to highlight important results or techniques in the field of neural networks that you may want to study better. 55 | 56 | ## Board representation 57 | 58 | The state of the game is just that: 59 | 60 | ```c 61 | typedef struct { 62 | char board[9]; // Can be "." (empty) or "X", "O". 63 | int current_player; // 0 for player (X), 1 for computer (O). 64 | } GameState; 65 | ``` 66 | 67 | The human and computer play always in the same order: the human starts, 68 | the computer replies to the move. They also play always with the same 69 | symbol: "X" for the human, "O" for the computer. 70 | 71 | The board itself is just represented by 9 characters, depending on the 72 | fact the tile is empty, or contains X or O. 73 | 74 | ## Neural network 75 | 76 | The neural network is very hard-coded, because the code really wants to be 77 | simple: it only have a single hidden layer, which is enough to model 78 | a so simple game (adding more layers don't help to converge faster nor 79 | to play better). 80 | 81 | Note that tic tac toe has only 5478 possible states, and by default with 82 | 100 hidden neurons our neural network has: 83 | 84 | 18 (inputs) * 100 (hidden) + 85 | 100 (hidden) * 9 (outputs) weights + 86 | 100 + 9 biases 87 | 88 | For a total of 2809 parameters, so our neural network is very near to be able to 89 | memorize each state of the game. However you can reduce the hidden size 90 | to 25 (or less) and it is still able to play well (but not perfectly) with 91 | around 700 parameters (or less). 92 | 93 | ```c 94 | typedef struct { 95 | // Weights and biases. 96 | float weights_ih[NN_INPUT_SIZE * NN_HIDDEN_SIZE]; 97 | float weights_ho[NN_HIDDEN_SIZE * NN_OUTPUT_SIZE]; 98 | float biases_h[NN_HIDDEN_SIZE]; 99 | float biases_o[NN_OUTPUT_SIZE]; 100 | 101 | // Activations are part of the structure itself for simplicity. 102 | float inputs[NN_INPUT_SIZE]; 103 | float hidden[NN_HIDDEN_SIZE]; 104 | float raw_logits[NN_OUTPUT_SIZE]; // Outputs before softmax(). 105 | float outputs[NN_OUTPUT_SIZE]; // Outputs after softmax(). 106 | } NeuralNetwork; 107 | ``` 108 | 109 | Activations are always memorized directly inside the neural network, 110 | so calculating the deltas and performing the backpropagation is very 111 | simple. 112 | 113 | We use RELU because of simple derivative. Almost everything would work in this 114 | case. Weights initialization don't care about RELU, they are just random 115 | from -0.5 to 0.5 (no He weights initialization). 116 | 117 | The output is computed using softmax(), since this neural network basically 118 | assigns probabilities to every possible next move. In theory we use cross 119 | entropy to calculate the loss function, but in practice we evaluate our 120 | *agent* based on the results of the games, so we only use it implicitly here: 121 | 122 | ```c 123 | deltas[i] = output[i] - target[i] 124 | ``` 125 | 126 | That is the delta in case of softmax and cross entropy. 127 | 128 | ## Reinforcement learning policy 129 | 130 | This is the reward policy used: 131 | 132 | ```c 133 | if (winner == 'T') { 134 | reward = 0.2f; // Small reward for draw 135 | } else if (winner == nn_symbol) { 136 | reward = 1.0f; // Large reward for win 137 | } else { 138 | reward = -1.0f; // Negative reward for loss 139 | } 140 | ``` 141 | 142 | When rewarding, we create all the states of the game where the neural network was about to move, and for each state, we reward the winning moves (not just the 143 | final move that won, but *all* the moves performed in the game we won) using as 144 | target output all the other moves set to 0, and the move we want to reward 145 | set to 1. Then we do a pass of backpropagation and update the weights. 146 | 147 | For ties, it's like winning, but the reward is scaled. Instead, when the game 148 | was lost, we use as a target the move set to 0, all the invalid moves set to 149 | 0 as well, and all the other valid moves set to `1/(number-of-valid-moves)`. 150 | 151 | However, we also perform scaling according to how early the move was performed: for moves that are near the start of the game, we give smaller rewards, and for moves that are later in the game (near the end of the game) we provide a stronger reward: 152 | 153 | float move_importance = 0.5f + 0.5f * (float)move_idx/(float)num_moves; 154 | float scaled_reward = reward * move_importance; 155 | 156 | Note that the above makes a lot of difference in the way the program works. 157 | Also note that while this may seem similar to Time Difference in reinforcement 158 | learning, it is not: we don't have a simple way in this case to evaluate if 159 | a single step provided a positive or negative reward: we need to wait for 160 | each game to finish. The temporal scaling above is just a way to code inside 161 | the network that early moves are more open, while, as the game goes on, we 162 | need to play more selectively. 163 | 164 | ## Weights updating 165 | 166 | We just use trivial backpropagation, and the code is designed in order to 167 | show very clearly that, after all, things work in a VERY similar way to 168 | what happens with supervised learning: the difference is just the input/output 169 | pairs are not known beforehand, but they are provided on the fly based on the 170 | reward policy of reinforcement learning. 171 | 172 | Please check the code for more information. 173 | 174 | ## Future work 175 | 176 | Things I didn't test because the complexity would kinda sabotage the educational value of the program and/or for lack of time, but that could be interesting exercises and interesting other projects / branches: 177 | 178 | * Can this approach work with connect four as well? The much larger space of the problem would be really interesting and less of a toy. 179 | * Train the network to play both sides by having an additional input set, that is the symbol that is going to do the move (useful especially in the case of connect four) so that we can use the network itself as opponent, instead of playing against random moves. 180 | * Implement proper sampling, in the case above, so that initially moves are quite random, later they start to pick more consistently the predicted move. 181 | * MCTS. 182 | -------------------------------------------------------------------------------- /ttt.c: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | #include 6 | #include 7 | 8 | // Neural network parameters. 9 | #define NN_INPUT_SIZE 18 10 | #define NN_HIDDEN_SIZE 100 11 | #define NN_OUTPUT_SIZE 9 12 | #define LEARNING_RATE 0.1 13 | 14 | // Game board representation. 15 | typedef struct { 16 | char board[9]; // Can be "." (empty) or "X", "O". 17 | int current_player; // 0 for player (X), 1 for computer (O). 18 | } GameState; 19 | 20 | /* Neural network structure. For simplicity we have just 21 | * one hidden layer and fixed sizes (see defines above). 22 | * However for this problem going deeper than one hidden layer 23 | * is useless. */ 24 | typedef struct { 25 | // Weights and biases. 26 | float weights_ih[NN_INPUT_SIZE * NN_HIDDEN_SIZE]; 27 | float weights_ho[NN_HIDDEN_SIZE * NN_OUTPUT_SIZE]; 28 | float biases_h[NN_HIDDEN_SIZE]; 29 | float biases_o[NN_OUTPUT_SIZE]; 30 | 31 | // Activations are part of the structure itself for simplicity. 32 | float inputs[NN_INPUT_SIZE]; 33 | float hidden[NN_HIDDEN_SIZE]; 34 | float raw_logits[NN_OUTPUT_SIZE]; // Outputs before softmax(). 35 | float outputs[NN_OUTPUT_SIZE]; // Outputs after softmax(). 36 | } NeuralNetwork; 37 | 38 | /* ReLU activation function */ 39 | float relu(float x) { 40 | return x > 0 ? x : 0; 41 | } 42 | 43 | /* Derivative of ReLU activation function */ 44 | float relu_derivative(float x) { 45 | return x > 0 ? 1.0f : 0.0f; 46 | } 47 | 48 | /* Initialize a neural network with random weights, we should 49 | * use something like He weights since we use RELU, but we don't 50 | * care as this is a trivial example. */ 51 | #define RANDOM_WEIGHT() (((float)rand() / RAND_MAX) - 0.5f) 52 | void init_neural_network(NeuralNetwork *nn) { 53 | // Initialize weights with random values between -0.5 and 0.5 54 | for (int i = 0; i < NN_INPUT_SIZE * NN_HIDDEN_SIZE; i++) 55 | nn->weights_ih[i] = RANDOM_WEIGHT(); 56 | 57 | for (int i = 0; i < NN_HIDDEN_SIZE * NN_OUTPUT_SIZE; i++) 58 | nn->weights_ho[i] = RANDOM_WEIGHT(); 59 | 60 | for (int i = 0; i < NN_HIDDEN_SIZE; i++) 61 | nn->biases_h[i] = RANDOM_WEIGHT(); 62 | 63 | for (int i = 0; i < NN_OUTPUT_SIZE; i++) 64 | nn->biases_o[i] = RANDOM_WEIGHT(); 65 | } 66 | 67 | /* Apply softmax activation function to an array input, and 68 | * set the result into output. */ 69 | void softmax(float *input, float *output, int size) { 70 | /* Find maximum value then subtact it to avoid 71 | * numerical stability issues with exp(). */ 72 | float max_val = input[0]; 73 | for (int i = 1; i < size; i++) { 74 | if (input[i] > max_val) { 75 | max_val = input[i]; 76 | } 77 | } 78 | 79 | // Calculate exp(x_i - max) for each element and sum. 80 | float sum = 0.0f; 81 | for (int i = 0; i < size; i++) { 82 | output[i] = expf(input[i] - max_val); 83 | sum += output[i]; 84 | } 85 | 86 | // Normalize to get probabilities. 87 | if (sum > 0) { 88 | for (int i = 0; i < size; i++) { 89 | output[i] /= sum; 90 | } 91 | } else { 92 | /* Fallback in case of numerical issues, just provide 93 | * a uniform distribution. */ 94 | for (int i = 0; i < size; i++) { 95 | output[i] = 1.0f / size; 96 | } 97 | } 98 | } 99 | 100 | /* Neural network foward pass (inference). We store the activations 101 | * so we can also do backpropagation later. */ 102 | void forward_pass(NeuralNetwork *nn, float *inputs) { 103 | // Copy inputs. 104 | memcpy(nn->inputs, inputs, NN_INPUT_SIZE * sizeof(float)); 105 | 106 | // Input to hidden layer. 107 | for (int i = 0; i < NN_HIDDEN_SIZE; i++) { 108 | float sum = nn->biases_h[i]; 109 | for (int j = 0; j < NN_INPUT_SIZE; j++) { 110 | sum += inputs[j] * nn->weights_ih[j * NN_HIDDEN_SIZE + i]; 111 | } 112 | nn->hidden[i] = relu(sum); 113 | } 114 | 115 | // Hidden to output (raw logits). 116 | for (int i = 0; i < NN_OUTPUT_SIZE; i++) { 117 | nn->raw_logits[i] = nn->biases_o[i]; 118 | for (int j = 0; j < NN_HIDDEN_SIZE; j++) { 119 | nn->raw_logits[i] += nn->hidden[j] * nn->weights_ho[j * NN_OUTPUT_SIZE + i]; 120 | } 121 | } 122 | 123 | // Apply softmax to get the final probabilities. 124 | softmax(nn->raw_logits, nn->outputs, NN_OUTPUT_SIZE); 125 | } 126 | 127 | /* Initialize game state with an empty board. */ 128 | void init_game(GameState *state) { 129 | memset(state->board,'.',9); 130 | state->current_player = 0; // Player (X) goes first 131 | } 132 | 133 | /* Show board on screen in ASCII "art"... */ 134 | void display_board(GameState *state) { 135 | for (int row = 0; row < 3; row++) { 136 | // Display the board symbols. 137 | printf("%c%c%c ", state->board[row*3], state->board[row*3+1], 138 | state->board[row*3+2]); 139 | 140 | // Display the position numbers for this row, for the poor human. 141 | printf("%d%d%d\n", row*3, row*3+1, row*3+2); 142 | } 143 | printf("\n"); 144 | } 145 | 146 | /* Convert board state to neural network inputs. Note that we use 147 | * a peculiar encoding I descrived here: 148 | * https://www.youtube.com/watch?v=EXbgUXt8fFU 149 | * 150 | * Instead of one-hot encoding, we can represent N different categories 151 | * as different bit patterns. In this specific case it's trivial: 152 | * 153 | * 00 = empty 154 | * 10 = X 155 | * 01 = O 156 | * 157 | * Two inputs per symbol instead of 3 in this case, but in the general case 158 | * this reduces the input dimensionality A LOT. 159 | * 160 | * LEARNING OPPORTUNITY: You may want to learn (if not already aware) of 161 | * different ways to represent non scalar inputs in neural networks: 162 | * One hot encoding, learned embeddings, and even if it's just my random 163 | * exeriment this "permutation coding" that I'm using here. 164 | */ 165 | void board_to_inputs(GameState *state, float *inputs) { 166 | for (int i = 0; i < 9; i++) { 167 | if (state->board[i] == '.') { 168 | inputs[i*2] = 0; 169 | inputs[i*2+1] = 0; 170 | } else if (state->board[i] == 'X') { 171 | inputs[i*2] = 1; 172 | inputs[i*2+1] = 0; 173 | } else { // 'O' 174 | inputs[i*2] = 0; 175 | inputs[i*2+1] = 1; 176 | } 177 | } 178 | } 179 | 180 | /* Check if the game is over (win or tie). 181 | * Very brutal but fast enough. */ 182 | int check_game_over(GameState *state, char *winner) { 183 | // Check rows. 184 | for (int i = 0; i < 3; i++) { 185 | if (state->board[i*3] != '.' && 186 | state->board[i*3] == state->board[i*3+1] && 187 | state->board[i*3+1] == state->board[i*3+2]) { 188 | *winner = state->board[i*3]; 189 | return 1; 190 | } 191 | } 192 | 193 | // Check columns. 194 | for (int i = 0; i < 3; i++) { 195 | if (state->board[i] != '.' && 196 | state->board[i] == state->board[i+3] && 197 | state->board[i+3] == state->board[i+6]) { 198 | *winner = state->board[i]; 199 | return 1; 200 | } 201 | } 202 | 203 | // Check diagonals. 204 | if (state->board[0] != '.' && 205 | state->board[0] == state->board[4] && 206 | state->board[4] == state->board[8]) { 207 | *winner = state->board[0]; 208 | return 1; 209 | } 210 | if (state->board[2] != '.' && 211 | state->board[2] == state->board[4] && 212 | state->board[4] == state->board[6]) { 213 | *winner = state->board[2]; 214 | return 1; 215 | } 216 | 217 | // Check for tie (no free tiles left). 218 | int empty_tiles = 0; 219 | for (int i = 0; i < 9; i++) { 220 | if (state->board[i] == '.') empty_tiles++; 221 | } 222 | if (empty_tiles == 0) { 223 | *winner = 'T'; // Tie 224 | return 1; 225 | } 226 | 227 | return 0; // Game continues. 228 | } 229 | 230 | /* Get the best move for the computer using the neural network. 231 | * Note that there is no complex sampling at all, we just get 232 | * the output with the highest value THAT has an empty tile. */ 233 | int get_computer_move(GameState *state, NeuralNetwork *nn, int display_probs) { 234 | float inputs[NN_INPUT_SIZE]; 235 | 236 | board_to_inputs(state, inputs); 237 | forward_pass(nn, inputs); 238 | 239 | // Find the highest probability value and best legal move. 240 | float highest_prob = -1.0f; 241 | int highest_prob_idx = -1; 242 | int best_move = -1; 243 | float best_legal_prob = -1.0f; 244 | 245 | for (int i = 0; i < 9; i++) { 246 | // Track highest probability overall. 247 | if (nn->outputs[i] > highest_prob) { 248 | highest_prob = nn->outputs[i]; 249 | highest_prob_idx = i; 250 | } 251 | 252 | // Track best legal move. 253 | if (state->board[i] == '.' && 254 | (best_move == -1 || nn->outputs[i] > best_legal_prob)) 255 | { 256 | best_move = i; 257 | best_legal_prob = nn->outputs[i]; 258 | } 259 | } 260 | 261 | // That's just for debugging. It's interesting to show to user 262 | // in the first iterations of the game, since you can see how initially 263 | // the net picks illegal moves as best, and so forth. 264 | if (display_probs) { 265 | printf("Neural network move probabilities:\n"); 266 | for (int row = 0; row < 3; row++) { 267 | for (int col = 0; col < 3; col++) { 268 | int pos = row * 3 + col; 269 | 270 | // Print probability as percentage. 271 | printf("%5.1f%%", nn->outputs[pos] * 100.0f); 272 | 273 | // Add markers. 274 | if (pos == highest_prob_idx) { 275 | printf("*"); // Highest probability overall. 276 | } 277 | if (pos == best_move) { 278 | printf("#"); // Selected move (highest valid probability). 279 | } 280 | printf(" "); 281 | } 282 | printf("\n"); 283 | } 284 | 285 | // Sum of probabilities should be 1.0, hopefully. 286 | // Just debugging. 287 | float total_prob = 0.0f; 288 | for (int i = 0; i < 9; i++) 289 | total_prob += nn->outputs[i]; 290 | printf("Sum of all probabilities: %.2f\n\n", total_prob); 291 | } 292 | return best_move; 293 | } 294 | 295 | /* Backpropagation function. 296 | * The only difference here from vanilla backprop is that we have 297 | * a 'reward_scaling' argument that makes the output error more/less 298 | * dramatic, so that we can adjust the weights proportionally to the 299 | * reward we want to provide. */ 300 | void backprop(NeuralNetwork *nn, float *target_probs, float learning_rate, float reward_scaling) { 301 | float output_deltas[NN_OUTPUT_SIZE]; 302 | float hidden_deltas[NN_HIDDEN_SIZE]; 303 | 304 | /* === STEP 1: Compute deltas === */ 305 | 306 | /* Calculate output layer deltas: 307 | * Note what's going on here: we are technically using softmax 308 | * as output function and cross entropy as loss, but we never use 309 | * cross entropy in practice since we check the progresses in terms 310 | * of winning the game. 311 | * 312 | * Still calculating the deltas in the output as: 313 | * 314 | * output[i] - target[i] 315 | * 316 | * Is exactly what happens if you derivate the deltas with 317 | * softmax and cross entropy. 318 | * 319 | * LEARNING OPPORTUNITY: This is a well established and fundamental 320 | * result in neural networks, you may want to read more about it. */ 321 | for (int i = 0; i < NN_OUTPUT_SIZE; i++) { 322 | output_deltas[i] = 323 | (nn->outputs[i] - target_probs[i]) * fabsf(reward_scaling); 324 | } 325 | 326 | // Backpropagate error to hidden layer. 327 | for (int i = 0; i < NN_HIDDEN_SIZE; i++) { 328 | float error = 0; 329 | for (int j = 0; j < NN_OUTPUT_SIZE; j++) { 330 | error += output_deltas[j] * nn->weights_ho[i * NN_OUTPUT_SIZE + j]; 331 | } 332 | hidden_deltas[i] = error * relu_derivative(nn->hidden[i]); 333 | } 334 | 335 | /* === STEP 2: Weights updating === */ 336 | 337 | // Output layer weights and biases. 338 | for (int i = 0; i < NN_HIDDEN_SIZE; i++) { 339 | for (int j = 0; j < NN_OUTPUT_SIZE; j++) { 340 | nn->weights_ho[i * NN_OUTPUT_SIZE + j] -= 341 | learning_rate * output_deltas[j] * nn->hidden[i]; 342 | } 343 | } 344 | for (int j = 0; j < NN_OUTPUT_SIZE; j++) { 345 | nn->biases_o[j] -= learning_rate * output_deltas[j]; 346 | } 347 | 348 | // Hidden layer weights and biases. 349 | for (int i = 0; i < NN_INPUT_SIZE; i++) { 350 | for (int j = 0; j < NN_HIDDEN_SIZE; j++) { 351 | nn->weights_ih[i * NN_HIDDEN_SIZE + j] -= 352 | learning_rate * hidden_deltas[j] * nn->inputs[i]; 353 | } 354 | } 355 | for (int j = 0; j < NN_HIDDEN_SIZE; j++) { 356 | nn->biases_h[j] -= learning_rate * hidden_deltas[j]; 357 | } 358 | } 359 | 360 | /* Train the neural network based on game outcome. 361 | * 362 | * The move_history is just an integer array with the index of all the 363 | * moves. This function is designed so that you can specify if the 364 | * game was started by the move by the NN or human, but actually the 365 | * code always let the human move first. */ 366 | void learn_from_game(NeuralNetwork *nn, int *move_history, int num_moves, int nn_moves_even, char winner) { 367 | // Determine reward based on game outcome 368 | float reward; 369 | char nn_symbol = nn_moves_even ? 'O' : 'X'; 370 | 371 | if (winner == 'T') { 372 | reward = 0.3f; // Small reward for draw 373 | } else if (winner == nn_symbol) { 374 | reward = 1.0f; // Large reward for win 375 | } else { 376 | reward = -2.0f; // Negative reward for loss 377 | } 378 | 379 | GameState state; 380 | float target_probs[NN_OUTPUT_SIZE]; 381 | 382 | // Process each move the neural network made. 383 | for (int move_idx = 0; move_idx < num_moves; move_idx++) { 384 | // Skip if this wasn't a move by the neural network. 385 | if ((nn_moves_even && move_idx % 2 != 1) || 386 | (!nn_moves_even && move_idx % 2 != 0)) 387 | { 388 | continue; 389 | } 390 | 391 | // Recreate board state BEFORE this move was made. 392 | init_game(&state); 393 | for (int i = 0; i < move_idx; i++) { 394 | char symbol = (i % 2 == 0) ? 'X' : 'O'; 395 | state.board[move_history[i]] = symbol; 396 | } 397 | 398 | // Convert board to inputs and do forward pass. 399 | float inputs[NN_INPUT_SIZE]; 400 | board_to_inputs(&state, inputs); 401 | forward_pass(nn, inputs); 402 | 403 | /* The move that was actually made by the NN, that is 404 | * the one we want to reward (positively or negatively). */ 405 | int move = move_history[move_idx]; 406 | 407 | /* Here we can't really implement temporal difference in the strict 408 | * reinforcement learning sense, since we don't have an easy way to 409 | * evaluate if the current situation is better or worse than the 410 | * previous state in the game. 411 | * 412 | * However "time related" we do something that is very effective in 413 | * this case: we scale the reward according to the move time, so that 414 | * later moves are more impacted (the game is less open to different 415 | * solutions as we go forward). 416 | * 417 | * We give a fixed 0.5 importance to all the moves plus 418 | * a 0.5 that depends on the move position. 419 | * 420 | * NOTE: this makes A LOT of difference. Experiment with different 421 | * values. 422 | * 423 | * LEARNING OPPORTUNITY: Temporal Difference in Reinforcement Learning 424 | * is a very important result, that was worth the Turing Award in 425 | * 2024 to Sutton and Barto. You may want to read about it. */ 426 | float move_importance = 0.5f + 0.5f * (float)move_idx/(float)num_moves; 427 | float scaled_reward = reward * move_importance; 428 | 429 | /* Create target probability distribution: 430 | * let's start with the logits all set to 0. */ 431 | for (int i = 0; i < NN_OUTPUT_SIZE; i++) 432 | target_probs[i] = 0; 433 | 434 | /* Set the target for the chosen move based on reward: */ 435 | if (scaled_reward >= 0) { 436 | /* For positive reward, set probability of the chosen move to 437 | * 1, with all the rest set to 0. */ 438 | target_probs[move] = 1; 439 | } else { 440 | /* For negative reward, distribute probability to OTHER 441 | * valid moves, which is conceptually the same as discouraging 442 | * the move that we want to discourage. */ 443 | int valid_moves_left = 9-move_idx-1; 444 | float other_prob = 1.0f / valid_moves_left; 445 | for (int i = 0; i < 9; i++) { 446 | if (state.board[i] == '.' && i != move) { 447 | target_probs[i] = other_prob; 448 | } 449 | } 450 | } 451 | 452 | /* Call the generic backpropagation function, using 453 | * our target logits as target. */ 454 | backprop(nn, target_probs, LEARNING_RATE, scaled_reward); 455 | } 456 | } 457 | 458 | /* Play one game of Tic Tac Toe against the neural network. */ 459 | void play_game(NeuralNetwork *nn) { 460 | GameState state; 461 | char winner; 462 | int move_history[9]; // Maximum 9 moves in a game. 463 | int num_moves = 0; 464 | 465 | init_game(&state); 466 | 467 | printf("Welcome to Tic Tac Toe! You are X, the computer is O.\n"); 468 | printf("Enter positions as numbers from 0 to 8 (see picture).\n"); 469 | 470 | while (!check_game_over(&state, &winner)) { 471 | display_board(&state); 472 | 473 | if (state.current_player == 0) { 474 | // Human turn. 475 | int move; 476 | char movec; 477 | printf("Your move (0-8): "); 478 | scanf(" %c", &movec); 479 | move = movec-'0'; // Turn character into number. 480 | 481 | // Check if move is valid. 482 | if (move < 0 || move > 8 || state.board[move] != '.') { 483 | printf("Invalid move! Try again.\n"); 484 | continue; 485 | } 486 | 487 | state.board[move] = 'X'; 488 | move_history[num_moves++] = move; 489 | } else { 490 | // Computer's turn 491 | printf("Computer's move:\n"); 492 | int move = get_computer_move(&state, nn, 1); 493 | state.board[move] = 'O'; 494 | printf("Computer placed O at position %d\n", move); 495 | move_history[num_moves++] = move; 496 | } 497 | 498 | state.current_player = !state.current_player; 499 | } 500 | 501 | display_board(&state); 502 | 503 | if (winner == 'X') { 504 | printf("You win!\n"); 505 | } else if (winner == 'O') { 506 | printf("Computer wins!\n"); 507 | } else { 508 | printf("It's a tie!\n"); 509 | } 510 | 511 | // Learn from this game 512 | learn_from_game(nn, move_history, num_moves, 1, winner); 513 | } 514 | 515 | /* Get a random valid move, this is used for training 516 | * against a random opponent. Note: this function will loop forever 517 | * if the board is full, but here we want simple code. */ 518 | int get_random_move(GameState *state) { 519 | while(1) { 520 | int move = rand() % 9; 521 | if (state->board[move] != '.') continue; 522 | return move; 523 | } 524 | } 525 | 526 | /* Play a game against random moves and learn from it. 527 | * 528 | * This is a very simple Montecarlo Method applied to reinforcement 529 | * learning: 530 | * 531 | * 1. We play a complete random game (episode). 532 | * 2. We determine the reward based on the outcome of the game. 533 | * 3. We update the neural network in order to maximize future rewards. 534 | * 535 | * LEARNING OPPORTUNITY: while the code uses some Montecarlo-alike 536 | * technique, important results were recently obtained using 537 | * Montecarlo Tree Search (MCTS), where a tree structure repesents 538 | * potential future game states that are explored according to 539 | * some selection: you may want to learn about it. */ 540 | char play_random_game(NeuralNetwork *nn, int *move_history, int *num_moves) { 541 | GameState state; 542 | char winner = 0; 543 | *num_moves = 0; 544 | 545 | init_game(&state); 546 | 547 | while (!check_game_over(&state, &winner)) { 548 | int move; 549 | 550 | if (state.current_player == 0) { // Random player's turn (X) 551 | move = get_random_move(&state); 552 | } else { // Neural network's turn (O) 553 | move = get_computer_move(&state, nn, 0); 554 | } 555 | 556 | /* Make the move and store it: we need the moves sequence 557 | * during the learning stage. */ 558 | char symbol = (state.current_player == 0) ? 'X' : 'O'; 559 | state.board[move] = symbol; 560 | move_history[(*num_moves)++] = move; 561 | 562 | // Switch player. 563 | state.current_player = !state.current_player; 564 | } 565 | 566 | // Learn from this game - neural network is 'O' (even-numbered moves). 567 | learn_from_game(nn, move_history, *num_moves, 1, winner); 568 | return winner; 569 | } 570 | 571 | /* Train the neural network against random moves. */ 572 | void train_against_random(NeuralNetwork *nn, int num_games) { 573 | int move_history[9]; 574 | int num_moves; 575 | int wins = 0, losses = 0, ties = 0; 576 | 577 | printf("Training neural network against %d random games...\n", num_games); 578 | 579 | int played_games = 0; 580 | for (int i = 0; i < num_games; i++) { 581 | char winner = play_random_game(nn, move_history, &num_moves); 582 | played_games++; 583 | 584 | // Accumulate statistics that are provided to the user (it's fun). 585 | if (winner == 'O') { 586 | wins++; // Neural network won. 587 | } else if (winner == 'X') { 588 | losses++; // Random player won. 589 | } else { 590 | ties++; // Tie. 591 | } 592 | 593 | // Show progress every many games to avoid flooding the stdout. 594 | if ((i + 1) % 10000 == 0) { 595 | printf("Games: %d, Wins: %d (%.1f%%), " 596 | "Losses: %d (%.1f%%), Ties: %d (%.1f%%)\n", 597 | i + 1, wins, (float)wins * 100 / played_games, 598 | losses, (float)losses * 100 / played_games, 599 | ties, (float)ties * 100 / played_games); 600 | played_games = 0; 601 | wins = 0; 602 | losses = 0; 603 | ties = 0; 604 | } 605 | } 606 | printf("\nTraining complete!\n"); 607 | } 608 | 609 | int main(int argc, char **argv) { 610 | int random_games = 150000; // Fast and enough to play in a decent way. 611 | 612 | if (argc > 1) random_games = atoi(argv[1]); 613 | srand(time(NULL)); 614 | 615 | // Initialize neural network. 616 | NeuralNetwork nn; 617 | init_neural_network(&nn); 618 | 619 | // Train against random moves. 620 | if (random_games > 0) train_against_random(&nn, random_games); 621 | 622 | // Play game with human and learn more. 623 | while(1) { 624 | char play_again; 625 | play_game(&nn); 626 | 627 | printf("Play again? (y/n): "); 628 | scanf(" %c", &play_again); 629 | if (play_again != 'y' && play_again != 'Y') break; 630 | } 631 | return 0; 632 | } 633 | --------------------------------------------------------------------------------